[jira] [Commented] (SPARK-15348) Hive ACID
[ https://issues.apache.org/jira/browse/SPARK-15348?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16844571#comment-16844571 ] Gowtham SB commented on SPARK-15348: I faced the same issue (Spark for Hive acid tables )and I can able to manage with JDBC call from Spark. May be I can use this JDBC call from spark until we get the native ACID support from Spark. Thanks > Hive ACID > - > > Key: SPARK-15348 > URL: https://issues.apache.org/jira/browse/SPARK-15348 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 1.6.3, 2.0.2, 2.1.2, 2.2.0, 2.3.0 >Reporter: Ran Haim >Priority: Major > > Spark does not support any feature of hive's transnational tables, > you cannot use spark to delete/update a table and it also has problems > reading the aggregated data when no compaction was done. > Also it seems that compaction is not supported - alter table ... partition > COMPACT 'major' -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-27787) Eliminate uncessary job to compute SSreg
[ https://issues.apache.org/jira/browse/SPARK-27787?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-27787: Assignee: (was: Apache Spark) > Eliminate uncessary job to compute SSreg > > > Key: SPARK-27787 > URL: https://issues.apache.org/jira/browse/SPARK-27787 > Project: Spark > Issue Type: Improvement > Components: ML >Affects Versions: 3.0.0 >Reporter: zhengruifeng >Priority: Minor > > in {{RegressionMetrics}}, a job is needed to compute SSreg > However, we only need to agg summary of column prediction, then SSreg can be > computed directly based on the summary. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-27787) Eliminate uncessary job to compute SSreg
[ https://issues.apache.org/jira/browse/SPARK-27787?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-27787: Assignee: Apache Spark > Eliminate uncessary job to compute SSreg > > > Key: SPARK-27787 > URL: https://issues.apache.org/jira/browse/SPARK-27787 > Project: Spark > Issue Type: Improvement > Components: ML >Affects Versions: 3.0.0 >Reporter: zhengruifeng >Assignee: Apache Spark >Priority: Minor > > in {{RegressionMetrics}}, a job is needed to compute SSreg > However, we only need to agg summary of column prediction, then SSreg can be > computed directly based on the summary. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-27787) Eliminate uncessary job to compute SSreg
zhengruifeng created SPARK-27787: Summary: Eliminate uncessary job to compute SSreg Key: SPARK-27787 URL: https://issues.apache.org/jira/browse/SPARK-27787 Project: Spark Issue Type: Improvement Components: ML Affects Versions: 3.0.0 Reporter: zhengruifeng in {{RegressionMetrics}}, a job is needed to compute SSreg However, we only need to agg summary of column prediction, then SSreg can be computed directly based on the summary. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-19248) Regex_replace works in 1.6 but not in 2.0
[ https://issues.apache.org/jira/browse/SPARK-19248?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nicholas Chammas updated SPARK-19248: - Labels: (was: bulk-closed) > Regex_replace works in 1.6 but not in 2.0 > - > > Key: SPARK-19248 > URL: https://issues.apache.org/jira/browse/SPARK-19248 > Project: Spark > Issue Type: Bug > Components: PySpark, SQL >Affects Versions: 2.0.2, 2.4.3 >Reporter: Lucas Tittmann >Priority: Major > > We found an error in Spark 2.0.2 execution of Regex. Using PySpark In 1.6.2, > we get the following, expected behaviour: > {noformat} > df = sqlContext.createDataFrame([('.. 5.',)], ['col']) > dfout = df.selectExpr(*["regexp_replace(col, '[ \.]*', '') AS col"]).collect() > z.show(dfout) > >>> [Row(col=u'5')] > dfout2 = df.selectExpr(*["regexp_replace(col, '( |\.)*', '') AS > col"]).collect() > z.show(dfout2) > >>> [Row(col=u'5')] > {noformat} > In Spark 2.0.2, with the same code, we get the following: > {noformat} > df = sqlContext.createDataFrame([('.. 5.',)], ['col']) > dfout = df.selectExpr(*["regexp_replace(col, '[ \.]*', '') AS col"]).collect() > z.show(dfout) > >>> [Row(col=u'5')] > dfout2 = df.selectExpr(*["regexp_replace(col, '( |\.)*', '') AS > col"]).collect() > z.show(dfout2) > >>> [Row(col=u'')] > {noformat} > As you can see, the second regex shows different behaviour depending on the > Spark version. We checked the regex in Java, and both should be correct and > work. Therefore, regex execution in 2.0.2 seems to be erroneous. I do not > have the possibility to confirm in 2.1 at the moment. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-19248) Regex_replace works in 1.6 but not in 2.0
[ https://issues.apache.org/jira/browse/SPARK-19248?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nicholas Chammas updated SPARK-19248: - Affects Version/s: 2.4.3 > Regex_replace works in 1.6 but not in 2.0 > - > > Key: SPARK-19248 > URL: https://issues.apache.org/jira/browse/SPARK-19248 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.2, 2.4.3 >Reporter: Lucas Tittmann >Priority: Major > Labels: bulk-closed > > We found an error in Spark 2.0.2 execution of Regex. Using PySpark In 1.6.2, > we get the following, expected behaviour: > {noformat} > df = sqlContext.createDataFrame([('.. 5.',)], ['col']) > dfout = df.selectExpr(*["regexp_replace(col, '[ \.]*', '') AS col"]).collect() > z.show(dfout) > >>> [Row(col=u'5')] > dfout2 = df.selectExpr(*["regexp_replace(col, '( |\.)*', '') AS > col"]).collect() > z.show(dfout2) > >>> [Row(col=u'5')] > {noformat} > In Spark 2.0.2, with the same code, we get the following: > {noformat} > df = sqlContext.createDataFrame([('.. 5.',)], ['col']) > dfout = df.selectExpr(*["regexp_replace(col, '[ \.]*', '') AS col"]).collect() > z.show(dfout) > >>> [Row(col=u'5')] > dfout2 = df.selectExpr(*["regexp_replace(col, '( |\.)*', '') AS > col"]).collect() > z.show(dfout2) > >>> [Row(col=u'')] > {noformat} > As you can see, the second regex shows different behaviour depending on the > Spark version. We checked the regex in Java, and both should be correct and > work. Therefore, regex execution in 2.0.2 seems to be erroneous. I do not > have the possibility to confirm in 2.1 at the moment. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Reopened] (SPARK-19248) Regex_replace works in 1.6 but not in 2.0
[ https://issues.apache.org/jira/browse/SPARK-19248?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nicholas Chammas reopened SPARK-19248: -- > Regex_replace works in 1.6 but not in 2.0 > - > > Key: SPARK-19248 > URL: https://issues.apache.org/jira/browse/SPARK-19248 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.2 >Reporter: Lucas Tittmann >Priority: Major > Labels: bulk-closed > > We found an error in Spark 2.0.2 execution of Regex. Using PySpark In 1.6.2, > we get the following, expected behaviour: > {noformat} > df = sqlContext.createDataFrame([('.. 5.',)], ['col']) > dfout = df.selectExpr(*["regexp_replace(col, '[ \.]*', '') AS col"]).collect() > z.show(dfout) > >>> [Row(col=u'5')] > dfout2 = df.selectExpr(*["regexp_replace(col, '( |\.)*', '') AS > col"]).collect() > z.show(dfout2) > >>> [Row(col=u'5')] > {noformat} > In Spark 2.0.2, with the same code, we get the following: > {noformat} > df = sqlContext.createDataFrame([('.. 5.',)], ['col']) > dfout = df.selectExpr(*["regexp_replace(col, '[ \.]*', '') AS col"]).collect() > z.show(dfout) > >>> [Row(col=u'5')] > dfout2 = df.selectExpr(*["regexp_replace(col, '( |\.)*', '') AS > col"]).collect() > z.show(dfout2) > >>> [Row(col=u'')] > {noformat} > As you can see, the second regex shows different behaviour depending on the > Spark version. We checked the regex in Java, and both should be correct and > work. Therefore, regex execution in 2.0.2 seems to be erroneous. I do not > have the possibility to confirm in 2.1 at the moment. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-19248) Regex_replace works in 1.6 but not in 2.0
[ https://issues.apache.org/jira/browse/SPARK-19248?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nicholas Chammas updated SPARK-19248: - Component/s: PySpark > Regex_replace works in 1.6 but not in 2.0 > - > > Key: SPARK-19248 > URL: https://issues.apache.org/jira/browse/SPARK-19248 > Project: Spark > Issue Type: Bug > Components: PySpark, SQL >Affects Versions: 2.0.2, 2.4.3 >Reporter: Lucas Tittmann >Priority: Major > Labels: bulk-closed > > We found an error in Spark 2.0.2 execution of Regex. Using PySpark In 1.6.2, > we get the following, expected behaviour: > {noformat} > df = sqlContext.createDataFrame([('.. 5.',)], ['col']) > dfout = df.selectExpr(*["regexp_replace(col, '[ \.]*', '') AS col"]).collect() > z.show(dfout) > >>> [Row(col=u'5')] > dfout2 = df.selectExpr(*["regexp_replace(col, '( |\.)*', '') AS > col"]).collect() > z.show(dfout2) > >>> [Row(col=u'5')] > {noformat} > In Spark 2.0.2, with the same code, we get the following: > {noformat} > df = sqlContext.createDataFrame([('.. 5.',)], ['col']) > dfout = df.selectExpr(*["regexp_replace(col, '[ \.]*', '') AS col"]).collect() > z.show(dfout) > >>> [Row(col=u'5')] > dfout2 = df.selectExpr(*["regexp_replace(col, '( |\.)*', '') AS > col"]).collect() > z.show(dfout2) > >>> [Row(col=u'')] > {noformat} > As you can see, the second regex shows different behaviour depending on the > Spark version. We checked the regex in Java, and both should be correct and > work. Therefore, regex execution in 2.0.2 seems to be erroneous. I do not > have the possibility to confirm in 2.1 at the moment. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19248) Regex_replace works in 1.6 but not in 2.0
[ https://issues.apache.org/jira/browse/SPARK-19248?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16844547#comment-16844547 ] Nicholas Chammas commented on SPARK-19248: -- Looks like Spark 2.4.3 still exhibits the behavior reported in the original issue: {code:java} Welcome to __ / __/__ ___ _/ /__ _\ \/ _ \/ _ `/ __/ '_/ /__ / .__/\_,_/_/ /_/\_\ version 2.4.3 /_/ Using Python version 3.7.3 (default, Mar 27 2019 13:25:00) SparkSession available as 'spark'. >>> df = spark.createDataFrame([('.. 5.',)], ['col']) >>> dfout = df.selectExpr(*["regexp_replace(col, '[ \.]*', '') AS >>> col"]).collect() >>> dfout [Row(col='5')] >>> dfout2 = df.selectExpr(*["regexp_replace(col, '( |\.)*', '') AS >>> col"]).collect() >>> dfout2 [Row(col='')] >>> {code} [~hyukjin.kwon] - I'm going to reopen this issue. > Regex_replace works in 1.6 but not in 2.0 > - > > Key: SPARK-19248 > URL: https://issues.apache.org/jira/browse/SPARK-19248 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.2 >Reporter: Lucas Tittmann >Priority: Major > Labels: bulk-closed > > We found an error in Spark 2.0.2 execution of Regex. Using PySpark In 1.6.2, > we get the following, expected behaviour: > {noformat} > df = sqlContext.createDataFrame([('.. 5.',)], ['col']) > dfout = df.selectExpr(*["regexp_replace(col, '[ \.]*', '') AS col"]).collect() > z.show(dfout) > >>> [Row(col=u'5')] > dfout2 = df.selectExpr(*["regexp_replace(col, '( |\.)*', '') AS > col"]).collect() > z.show(dfout2) > >>> [Row(col=u'5')] > {noformat} > In Spark 2.0.2, with the same code, we get the following: > {noformat} > df = sqlContext.createDataFrame([('.. 5.',)], ['col']) > dfout = df.selectExpr(*["regexp_replace(col, '[ \.]*', '') AS col"]).collect() > z.show(dfout) > >>> [Row(col=u'5')] > dfout2 = df.selectExpr(*["regexp_replace(col, '( |\.)*', '') AS > col"]).collect() > z.show(dfout2) > >>> [Row(col=u'')] > {noformat} > As you can see, the second regex shows different behaviour depending on the > Spark version. We checked the regex in Java, and both should be correct and > work. Therefore, regex execution in 2.0.2 seems to be erroneous. I do not > have the possibility to confirm in 2.1 at the moment. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-27637) If exception occured while fetching blocks by netty block transfer service, check whether the relative executor is alive before retry
[ https://issues.apache.org/jira/browse/SPARK-27637?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan reassigned SPARK-27637: --- Assignee: feiwang > If exception occured while fetching blocks by netty block transfer service, > check whether the relative executor is alive before retry > -- > > Key: SPARK-27637 > URL: https://issues.apache.org/jira/browse/SPARK-27637 > Project: Spark > Issue Type: Improvement > Components: Shuffle >Affects Versions: 2.3.3, 2.4.3 >Reporter: feiwang >Assignee: feiwang >Priority: Major > Fix For: 3.0.0 > > > There are several kinds of shuffle client, blockTransferService and > externalShuffleClient. > For the externalShuffleClient, there are relative external shuffle service, > which guarantees the shuffle block data and regardless the state of > executors. > For the blockTransferService, it is used to fetch broadcast block, and fetch > the shuffle data when external shuffle service is not enabled. > When fetching data by using blockTransferService, the shuffle client would > connect relative executor's blockManager, so if the relative executor is > dead, it would never fetch successfully. > When spark.shuffle.service.enabled is true and > spark.dynamicAllocation.enabled is true, the executor will be removed while > it has been idle for more than idleTimeout. > If a blockTransferService create connection to relative executor > successfully, but the relative executor is removed when beginning to fetch > broadcast block, it would retry (see RetryingBlockFetcher), which is > Ineffective. > If the spark.shuffle.io.retryWait and spark.shuffle.io.maxRetries is big, > such as 30s and 10 times, it would waste 5 minutes. > So, I think we should judge whether the relative executor is alive before > retry. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-27637) If exception occured while fetching blocks by netty block transfer service, check whether the relative executor is alive before retry
[ https://issues.apache.org/jira/browse/SPARK-27637?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-27637. - Resolution: Fixed Fix Version/s: 3.0.0 Issue resolved by pull request 24533 [https://github.com/apache/spark/pull/24533] > If exception occured while fetching blocks by netty block transfer service, > check whether the relative executor is alive before retry > -- > > Key: SPARK-27637 > URL: https://issues.apache.org/jira/browse/SPARK-27637 > Project: Spark > Issue Type: Improvement > Components: Shuffle >Affects Versions: 2.3.3, 2.4.3 >Reporter: feiwang >Priority: Major > Fix For: 3.0.0 > > > There are several kinds of shuffle client, blockTransferService and > externalShuffleClient. > For the externalShuffleClient, there are relative external shuffle service, > which guarantees the shuffle block data and regardless the state of > executors. > For the blockTransferService, it is used to fetch broadcast block, and fetch > the shuffle data when external shuffle service is not enabled. > When fetching data by using blockTransferService, the shuffle client would > connect relative executor's blockManager, so if the relative executor is > dead, it would never fetch successfully. > When spark.shuffle.service.enabled is true and > spark.dynamicAllocation.enabled is true, the executor will be removed while > it has been idle for more than idleTimeout. > If a blockTransferService create connection to relative executor > successfully, but the relative executor is removed when beginning to fetch > broadcast block, it would retry (see RetryingBlockFetcher), which is > Ineffective. > If the spark.shuffle.io.retryWait and spark.shuffle.io.maxRetries is big, > such as 30s and 10 times, it would waste 5 minutes. > So, I think we should judge whether the relative executor is alive before > retry. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18277) na.fill() and friends should work on struct fields
[ https://issues.apache.org/jira/browse/SPARK-18277?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16844530#comment-16844530 ] Nicholas Chammas commented on SPARK-18277: -- [~hyukjin.kwon] - If I still think this issue is relevant, should I just reopen it? > na.fill() and friends should work on struct fields > -- > > Key: SPARK-18277 > URL: https://issues.apache.org/jira/browse/SPARK-18277 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Nicholas Chammas >Priority: Minor > Labels: bulk-closed > > It appears that you cannot use {{fill()}} and friends to quickly modify > struct fields. > For example: > {code} > >>> df = spark.createDataFrame([Row(a=Row(b='yeah yeah'), c='alright'), > >>> Row(a=Row(b=None), c=None)]) > >>> df.printSchema() > root > |-- a: struct (nullable = true) > ||-- b: string (nullable = true) > |-- c: string (nullable = true) > >>> df.show() > +---+---+ > | a| c| > +---+---+ > |[yeah yeah]|alright| > | [null]| null| > +---+---+ > >>> df.na.fill('').show() > +---+---+ > | a| c| > +---+---+ > |[yeah yeah]|alright| > | [null]| | > +---+---+ > {code} > {{c}} got filled in, but {{a.b}} didn't. > I don't know if it's "appropriate", but it would be nice if {{fill()}} and > friends worked automatically on struct fields. > As things are today, there doesn't appear to be a way to fill in null values > inside structs. If you try {{when()}}, you realize that you cannot do > {{when(col('a.b') is None, '')}} because {{Column}} doesn't implement the > appropriate protocol for {{is}}. And if you try {{when(col('a.b') == None, > '')}} it doesn't catch the null values. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-1921) Allow duplicate jar files among the app jar and secondary jars in yarn-cluster mode
[ https://issues.apache.org/jira/browse/SPARK-1921?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-1921: Labels: bulk-closed (was: ) > Allow duplicate jar files among the app jar and secondary jars in > yarn-cluster mode > --- > > Key: SPARK-1921 > URL: https://issues.apache.org/jira/browse/SPARK-1921 > Project: Spark > Issue Type: Sub-task > Components: Deploy >Affects Versions: 1.0.0 >Reporter: Xiangrui Meng >Priority: Minor > Labels: bulk-closed > > In yarn-cluster mode, jars are uploaded to a staging folder on hdfs. If there > are duplicates among the app jar and secondary jars, there will be overwrites > that cause inconsistent timestamps. I saw the following message: > {code} > Application application_1400965808642_0021 failed 2 times due to AM Container > for appattempt_1400965808642_0021_02 exited with exitCode: -1000 due to: > Resource > hdfs://localhost.localdomain:8020/user/cloudera/.sparkStaging/application_1400965808642_0021/app_2.10-0.1.jar > changed on src filesystem (expected 1400998721965, was 1400998723123 > {code} > Tested on a CDH-5 quickstart VM. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-3251) Clarify learning interfaces
[ https://issues.apache.org/jira/browse/SPARK-3251?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-3251: Labels: bulk-closed (was: ) > Clarify learning interfaces > > > Key: SPARK-3251 > URL: https://issues.apache.org/jira/browse/SPARK-3251 > Project: Spark > Issue Type: Improvement > Components: MLlib >Affects Versions: 1.1.0, 1.1.1 >Reporter: Christoph Sawade >Priority: Major > Labels: bulk-closed > > *Make threshold mandatory* > Currently, the output of predict for an example is either the score > or the class. This side-effect is caused by clearThreshold. To > clarify that behaviour three different types of predict (predictScore, > predictClass, predictProbabilty) were introduced; the threshold is not > longer optional. > *Clarify classification interfaces* > Currently, some functionality is spreaded over multiple models. > In order to clarify the structure and simplify the implementation of > more complex models (like multinomial logistic regression), two new > classes are introduced: > - BinaryClassificationModel: for all models that derives a binary > classification from a single weight vector. Comprises the tresholding > functionality to derive a prediction from a score. It basically captures > SVMModel and LogisticRegressionModel. > - ProbabilitistClassificaitonModel: This trait defines the interface for > models that return a calibrated confidence score (aka probability). > *Misc* > - some renaming > - add test for probabilistic output -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-4229) Create hadoop configuration in a consistent way
[ https://issues.apache.org/jira/browse/SPARK-4229?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-4229: Labels: bulk-closed (was: ) > Create hadoop configuration in a consistent way > --- > > Key: SPARK-4229 > URL: https://issues.apache.org/jira/browse/SPARK-4229 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Reporter: Cody Koeninger >Priority: Minor > Labels: bulk-closed > > Some places use SparkHadoopUtil.get.conf, some create a new hadoop config. > Prefer SparkHadoopUtil so that spark.hadoop.* properties are pulled in. > http://apache-spark-developers-list.1001551.n3.nabble.com/Hadoop-configuration-for-checkpointing-td9084.html -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-3717) DecisionTree, RandomForest: Partition by feature
[ https://issues.apache.org/jira/browse/SPARK-3717?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-3717: Labels: bulk-closed (was: ) > DecisionTree, RandomForest: Partition by feature > > > Key: SPARK-3717 > URL: https://issues.apache.org/jira/browse/SPARK-3717 > Project: Spark > Issue Type: Improvement > Components: MLlib >Reporter: Joseph K. Bradley >Priority: Major > Labels: bulk-closed > > h1. Summary > Currently, data are partitioned by row/instance for DecisionTree and > RandomForest. This JIRA argues for partitioning by feature for training deep > trees. This is especially relevant for random forests, which are often > trained to be deeper than single decision trees. > h1. Details > Dataset dimensions and the depth of the tree to be trained are the main > problem parameters determining whether it is better to partition features or > instances. For random forests (training many deep trees), partitioning > features could be much better. > Notation: > * P = # workers > * N = # instances > * M = # features > * D = depth of tree > h2. Partitioning Features > Algorithm sketch: > * Each worker stores: > ** a subset of columns (i.e., a subset of features). If a worker stores > feature j, then the worker stores the feature value for all instances (i.e., > the whole column). > ** all labels > * Train one level at a time. > * Invariants: > ** Each worker stores a mapping: instance → node in current level > * On each iteration: > ** Each worker: For each node in level, compute (best feature to split, info > gain). > ** Reduce (P x M) values to M values to find best split for each node. > ** Workers who have features used in best splits communicate left/right for > relevant instances. Gather total of N bits to master, then broadcast. > * Total communication: > ** Depth D iterations > ** On each iteration, reduce to M values (~8 bytes each), broadcast N values > (1 bit each). > ** Estimate: D * (M * 8 + N) > h2. Partitioning Instances > Algorithm sketch: > * Train one group of nodes at a time. > * Invariants: > * Each worker stores a mapping: instance → node > * On each iteration: > ** Each worker: For each instance, add to aggregate statistics. > ** Aggregate is of size (# nodes in group) x M x (# bins) x (# classes) > *** (“# classes” is for classification. 3 for regression) > ** Reduce aggregate. > ** Master chooses best split for each node in group and broadcasts. > * Local training: Once all instances for a node fit on one machine, it can be > best to shuffle data and training subtrees locally. This can mean shuffling > the entire dataset for each tree trained. > * Summing over all iterations, reduce to total of: > ** (# nodes in tree) x M x (# bins B) x (# classes C) values (~8 bytes each) > ** Estimate: 2^D * M * B * C * 8 > h2. Comparing Partitioning Methods > Partitioning features cost < partitioning instances cost when: > * D * (M * 8 + N) < 2^D * M * B * C * 8 > * D * N < 2^D * M * B * C * 8 (assuming D * M * 8 is small compared to the > right hand side) > * N < [ 2^D * M * B * C * 8 ] / D > Example: many instances: > * 2 million instances, 3500 features, 100 bins, 5 classes, 6 levels (depth = > 5) > * Partitioning features: 6 * ( 3500 * 8 + 2*10^6 ) =~ 1.2 * 10^7 > * Partitioning instances: 32 * 3500 * 100 * 5 * 8 =~ 4.5 * 10^8 -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-6619) Improve Jar caching on executors
[ https://issues.apache.org/jira/browse/SPARK-6619?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-6619. - Resolution: Incomplete > Improve Jar caching on executors > > > Key: SPARK-6619 > URL: https://issues.apache.org/jira/browse/SPARK-6619 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Reporter: Mingyu Kim >Priority: Major > Labels: bulk-closed > > Taking SPARK-2713 one step further so that > - The cached jars can be used by multiple applications. In order to do that, > I'm planning to use MD5 as the cache key as opposed to url hash and timestamp. > - The cached jars are hard-linked to the work directory as opposed to being > copied. > Re: perf. Computing MD5 using "openssl" on my local Macbook Pro took 1.2s for > 158 jars with the total size of 56MB, and this takes ~10s to ship to the > executor at the start-up. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-3601) Kryo NPE for output operations on Avro complex Objects even after registering.
[ https://issues.apache.org/jira/browse/SPARK-3601?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-3601. - Resolution: Incomplete > Kryo NPE for output operations on Avro complex Objects even after registering. > -- > > Key: SPARK-3601 > URL: https://issues.apache.org/jira/browse/SPARK-3601 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.0.0 > Environment: local, standalone cluster >Reporter: mohan gaddam >Priority: Major > Labels: bulk-closed > > Kryo serializer works well when avro objects has simple data. but when the > same avro object has complex data(like unions/arrays) kryo fails while output > operations. but mappings are good. Note that i have registered all the Avro > generated classes with kryo. Im using Java as programming language. > when used complex message throws NPE, stack trace as follows: > == > ERROR scheduler.JobScheduler: Error running job streaming job 1411043845000 > ms.0 > org.apache.spark.SparkException: Job aborted due to stage failure: Exception > while getting task result: com.esotericsoftware.kryo.KryoException: > java.lang.NullPointerException > Serialization trace: > value (xyz.Datum) > data (xyz.ResMsg) > at > org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1185) > > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1174) > > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1173) > > at > scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) > > at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) > at > org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1173) > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:688) > > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:688) > > at scala.Option.foreach(Option.scala:236) > at > org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:688) > > at > org.apache.spark.scheduler.DAGSchedulerEventProcessActor$$anonfun$receive$2.applyOrElse(DAGScheduler.scala:1391) > > at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498) > at akka.actor.ActorCell.invoke(ActorCell.scala:456) > at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237) > at akka.dispatch.Mailbox.run(Mailbox.scala:219) > at > akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386) > > at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) > at > scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339) > > at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) > at > scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) > > In the above exception, Datum and ResMsg are project specific classes > generated by avro using the below avdl snippet: > == > record KeyValueObject { > union{boolean, int, long, float, double, bytes, string} name; > union {boolean, int, long, float, double, bytes, string, > array KeyValueObject}>, KeyValueObject} value; > } > record Datum { > union {boolean, int, long, float, double, bytes, string, > array KeyValueObject}>, KeyValueObject} value; > } > record ResMsg { > string version; > string sequence; > string resourceGUID; > string GWID; > string GWTimestamp; > union {Datum, array} data; > } > avro message samples are as follows: > > 1) > {"version": "01", "sequence": "1", "resourceGUID": "001", "GWID": "002", > "GWTimestamp": "1409823150737", "data": {"value": "30"}} > 2) > {"version": "01", "sequence": "1", "resource": "sensor-001", > "controller": "002", "controllerTimestamp": "1411038710358", "data": > {"value": [ {"name": "Temperature", "value": "30"}, {"name": "Speed", > "value": "60"}, {"name": "Location", "value": ["+401213.1", "-0750015.1"]}, > {"name": "Timestamp", "value": "2014-09-09T08:15:25-05:00"}]}} > both 1 and 2 adhere to the avro schema, so decoder is able to convert them > into avro objects in spark streaming api. > BTW the messages were pulled from kafka source, and decoded by using kafka > decoder. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscri
[jira] [Resolved] (SPARK-5431) SparkSubmitSuite and DriverSuite hang indefinitely if Master fails
[ https://issues.apache.org/jira/browse/SPARK-5431?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-5431. - Resolution: Incomplete > SparkSubmitSuite and DriverSuite hang indefinitely if Master fails > -- > > Key: SPARK-5431 > URL: https://issues.apache.org/jira/browse/SPARK-5431 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.0.0 >Reporter: Andrew Or >Priority: Major > Labels: bulk-closed > > Steps to reproduce: > (1) Add a throw new exception anywhere in Master's constructor > (2) Run the DriverSuite or SparkSubmitSuite > (3) zzz > We need to somehow propagate the exception all the way to the suites > themselves. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-4206) BlockManager warnings in local mode: "Block $blockId already exists on this machine; not re-adding it
[ https://issues.apache.org/jira/browse/SPARK-4206?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-4206. - Resolution: Incomplete > BlockManager warnings in local mode: "Block $blockId already exists on this > machine; not re-adding it > - > > Key: SPARK-4206 > URL: https://issues.apache.org/jira/browse/SPARK-4206 > Project: Spark > Issue Type: Bug > Components: Block Manager > Environment: local mode, branch-1.1 & master >Reporter: Imran Rashid >Priority: Minor > Labels: bulk-closed > > When running in local mode, you often get log warning messages like: > WARN storage.BlockManager: Block input-0-1415022975000 already exists on this > machine; not re-adding it > (eg., try running the TwitterPopularTags example in local mode) > I think these warning messages are pretty unsettling for a new user, and > should be removed. If they are truly innocuous, they should be changed to > logInfo, or maybe even logDebug. Or if they might actually indicate a > problem, we should find the root cause and fix it. > I *think* the problem is caused by a replication level > 1 when running in > local mode. In BlockManager.doPut, first the block is put locally: > https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/storage/BlockManager.scala#L692 > and then if the replication level > 1, a request is sent out to replicate the > block: > https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/storage/BlockManager.scala#L827 > However, in local mode, there isn't anywhere else to replicate the block; the > request comes back to the same node, which then issues the warning that the > block has already been added. > If that analysis is right, the easy fix would be to make sure > replicationLevel = 1 in local mode. But, its a little disturbing that a > replication request could result in an attempt to replicate on the same node > -- and that if something is wrong, we only issue a warning and keep going. > If this really the culprit, then it might be worth taking a closer look at > the logic of replication. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-4716) Avoid shuffle when all-to-all operation has single input and output partition
[ https://issues.apache.org/jira/browse/SPARK-4716?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-4716. - Resolution: Incomplete > Avoid shuffle when all-to-all operation has single input and output partition > - > > Key: SPARK-4716 > URL: https://issues.apache.org/jira/browse/SPARK-4716 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 1.2.0 >Reporter: Sandy Ryza >Priority: Major > Labels: bulk-closed > > I encountered an application that performs joins on a bunch of small RDDs, > unions the results, and then performs larger aggregations across them. Many > of these small RDDs fit in a single partition. For these operations with > only a single partition, there's no reason to write data to disk and then > fetch it over a socket. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-1823) ExternalAppendOnlyMap can still OOM if one key is very large
[ https://issues.apache.org/jira/browse/SPARK-1823?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-1823. - Resolution: Incomplete > ExternalAppendOnlyMap can still OOM if one key is very large > > > Key: SPARK-1823 > URL: https://issues.apache.org/jira/browse/SPARK-1823 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.0.2, 1.1.0 >Reporter: Andrew Or >Priority: Major > Labels: bulk-closed > > If the values for one key do not collectively fit into memory, then the map > will still OOM when you merge the spilled contents back in. > This is a problem especially for PySpark, since we hash the keys (Python > objects) before a shuffle, and there are only so many integers out there in > the world, so there could potentially be many collisions. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-5079) Detect failed jobs / batches in Spark Streaming unit tests
[ https://issues.apache.org/jira/browse/SPARK-5079?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-5079. - Resolution: Incomplete > Detect failed jobs / batches in Spark Streaming unit tests > -- > > Key: SPARK-5079 > URL: https://issues.apache.org/jira/browse/SPARK-5079 > Project: Spark > Issue Type: Bug > Components: DStreams >Reporter: Josh Rosen >Assignee: Ilya Ganelin >Priority: Major > Labels: bulk-closed > > Currently, it is possible to write Spark Streaming unit tests where Spark > jobs fail but the streaming tests succeed because we rely on wall-clock time > plus output comparision in order to check whether a test has passed, and > hence may miss cases where errors occurred if they didn't affect these > results. We should strengthen the tests to check that no job failures > occurred while processing batches. > See https://github.com/apache/spark/pull/3832#issuecomment-68580794 for > additional context. > The StreamingTestWaiter in https://github.com/apache/spark/pull/3801 might > also fix this. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-4911) Report the inputs and outputs of Spark jobs so that external systems can track data lineage
[ https://issues.apache.org/jira/browse/SPARK-4911?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-4911. - Resolution: Incomplete > Report the inputs and outputs of Spark jobs so that external systems can > track data lineage > > > Key: SPARK-4911 > URL: https://issues.apache.org/jira/browse/SPARK-4911 > Project: Spark > Issue Type: New Feature > Components: Spark Core >Affects Versions: 1.2.0 >Reporter: Sandy Ryza >Priority: Major > Labels: bulk-closed > > When Spark runs a job, it would be useful to log its filesystem inputs and > outputs somewhere. This allows external tools to track which persisted > datasets are derived from other persisted datasets. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-5043) Implement updated Receiver API
[ https://issues.apache.org/jira/browse/SPARK-5043?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-5043. - Resolution: Incomplete > Implement updated Receiver API > -- > > Key: SPARK-5043 > URL: https://issues.apache.org/jira/browse/SPARK-5043 > Project: Spark > Issue Type: Sub-task > Components: DStreams >Reporter: Tathagata Das >Assignee: Tathagata Das >Priority: Major > Labels: bulk-closed > > Implement the updated Receiver API based on this design. > https://docs.google.com/document/d/1J66JCF1g-H1y2YnBnzdshTA1-hEJy3VvxXSEGgFpqws/edit?usp=sharing -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-4488) Add control over map-side aggregation
[ https://issues.apache.org/jira/browse/SPARK-4488?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-4488. - Resolution: Incomplete > Add control over map-side aggregation > - > > Key: SPARK-4488 > URL: https://issues.apache.org/jira/browse/SPARK-4488 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 1.1.0 >Reporter: Genmao Yu >Priority: Minor > Labels: bulk-closed > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-5713) Support python serialization for RandomForest
[ https://issues.apache.org/jira/browse/SPARK-5713?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-5713. - Resolution: Incomplete > Support python serialization for RandomForest > - > > Key: SPARK-5713 > URL: https://issues.apache.org/jira/browse/SPARK-5713 > Project: Spark > Issue Type: Improvement > Components: MLlib >Affects Versions: 1.2.0 > Environment: Tested on MacOS >Reporter: Guillaume Charhon >Priority: Major > Labels: bulk-closed > > I was trying to pickle a trained ramdom forest model. Unfortunately, it is > impossible to serialize the model for future use. > model = RandomForest.trainRegressor(trainingData, categoricalFeaturesInfo={}, > numTrees=nb_tree,featureSubsetStrategy="auto", impurity='variance', > maxDepth=depth) > output = open('model.ml', 'wb') > pickle.dump(model,output) > I am getting this error: > TypeError: can't pickle lock objects > I am using Spark 1.2.0. I have also tested Spark 1.2.1 -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-6798) Fix Date serialization in SparkR
[ https://issues.apache.org/jira/browse/SPARK-6798?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-6798. - Resolution: Incomplete > Fix Date serialization in SparkR > > > Key: SPARK-6798 > URL: https://issues.apache.org/jira/browse/SPARK-6798 > Project: Spark > Issue Type: Improvement > Components: SparkR >Reporter: Shivaram Venkataraman >Assignee: Davies Liu >Priority: Minor > Labels: bulk-closed > > SparkR's date serialization right now sends strings from R to the JVM. We > should convert this to integers and also account for timezones correctly by > using DateUtils -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-6497) Class is not registered: scala.reflect.ManifestFactory$$anon$9
[ https://issues.apache.org/jira/browse/SPARK-6497?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-6497. - Resolution: Incomplete > Class is not registered: scala.reflect.ManifestFactory$$anon$9 > -- > > Key: SPARK-6497 > URL: https://issues.apache.org/jira/browse/SPARK-6497 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.3.0, 1.4.0, 1.5.0, 1.5.2 >Reporter: Daniel Darabos >Priority: Major > Labels: bulk-closed > > This is a slight regression from Spark 1.2.1 to 1.3.0. > {noformat} > spark-1.2.1-bin-hadoop2.4/bin/spark-shell --conf > spark.serializer=org.apache.spark.serializer.KryoSerializer --conf > spark.kryo.registrationRequired=true --conf > 'spark.kryo.classesToRegister=scala.collection.mutable.WrappedArray$ofRef,[Lscala.Tuple2;' > scala> sc.parallelize(Seq(1 -> 1)).groupByKey.collect > res0: Array[(Int, Iterable[Int])] = Array((1,CompactBuffer(1))) > {noformat} > {noformat} > spark-1.3.0-bin-hadoop2.4/bin/spark-shell --conf > spark.serializer=org.apache.spark.serializer.KryoSerializer --conf > spark.kryo.registrationRequired=true --conf > 'spark.kryo.classesToRegister=scala.collection.mutable.WrappedArray$ofRef,[Lscala.Tuple2;' > scala> sc.parallelize(Seq(1 -> 1)).groupByKey.collect > Lost task 1.0 in stage 3.0 (TID 25, localhost): > com.esotericsoftware.kryo.KryoException: java.lang.IllegalArgumentException: > Class is not registered: scala.reflect.ManifestFactory$$anon$9 > Note: To register this class use: > kryo.register(scala.reflect.ManifestFactory$$anon$9.class); > Serialization trace: > evidence$1 (org.apache.spark.util.collection.CompactBuffer) > at > com.esotericsoftware.kryo.serializers.FieldSerializer$ObjectField.write(FieldSerializer.java:585) > at > com.esotericsoftware.kryo.serializers.FieldSerializer.write(FieldSerializer.java:213) > at com.esotericsoftware.kryo.Kryo.writeClassAndObject(Kryo.java:568) > at com.twitter.chill.Tuple2Serializer.write(TupleSerializers.scala:37) > at com.twitter.chill.Tuple2Serializer.write(TupleSerializers.scala:33) > at com.esotericsoftware.kryo.Kryo.writeClassAndObject(Kryo.java:568) > at > com.esotericsoftware.kryo.serializers.DefaultArraySerializers$ObjectArraySerializer.write(DefaultArraySerializers.java:318) > at > com.esotericsoftware.kryo.serializers.DefaultArraySerializers$ObjectArraySerializer.write(DefaultArraySerializers.java:293) > at com.esotericsoftware.kryo.Kryo.writeClassAndObject(Kryo.java:568) > at > org.apache.spark.serializer.KryoSerializerInstance.serialize(KryoSerializer.scala:161) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > at java.lang.Thread.run(Thread.java:745) > Caused by: java.lang.IllegalArgumentException: Class is not registered: > scala.reflect.ManifestFactory$$anon$9 > Note: To register this class use: > kryo.register(scala.reflect.ManifestFactory$$anon$9.class); > at com.esotericsoftware.kryo.Kryo.getRegistration(Kryo.java:442) > at > com.esotericsoftware.kryo.util.DefaultClassResolver.writeClass(DefaultClassResolver.java:79) > at com.esotericsoftware.kryo.Kryo.writeClass(Kryo.java:472) > at > com.esotericsoftware.kryo.serializers.FieldSerializer$ObjectField.write(FieldSerializer.java:561) > ... 13 more > {noformat} > In our production code the exception is actually about > {{scala.reflect.ManifestFactory$$anon$8}} instead of > {{scala.reflect.ManifestFactory$$anon$9}} but it's probably the same thing. > Any idea what caused from 1.2.1 to 1.3.0 that could be causing this? > We also get exceptions in 1.3.0 for {{scala.reflect.ClassTag$$anon$1}} and > {{java.lang.Class}}, but I haven't reduced them to a spark-shell reproduction > yet. We can of course just register these classes ourselves. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-2913) Spark's log4j.properties should always appear ahead of Hadoop's on classpath
[ https://issues.apache.org/jira/browse/SPARK-2913?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-2913. - Resolution: Incomplete > Spark's log4j.properties should always appear ahead of Hadoop's on classpath > > > Key: SPARK-2913 > URL: https://issues.apache.org/jira/browse/SPARK-2913 > Project: Spark > Issue Type: Bug > Components: Deploy >Affects Versions: 1.0.0, 1.0.2, 1.1.0 >Reporter: Josh Rosen >Assignee: Josh Rosen >Priority: Major > Labels: bulk-closed > > In the current {{compute-classpath}} scripts, the Hadoop conf directory may > appear before Spark's conf directory in the computed classpath. This leads > to Hadoop's log4j.properties being used instead of Spark's, preventing users > from easily changing Spark's logging settings. > To fix this, we should add a new classpath entry for Spark's log4j.properties > file. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-3115) Improve task broadcast latency for small tasks
[ https://issues.apache.org/jira/browse/SPARK-3115?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-3115. - Resolution: Incomplete > Improve task broadcast latency for small tasks > -- > > Key: SPARK-3115 > URL: https://issues.apache.org/jira/browse/SPARK-3115 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 1.1.0 >Reporter: Shivaram Venkataraman >Assignee: Reynold Xin >Priority: Major > Labels: bulk-closed > > Broadcasting the task information helps reduce the amount of data transferred > for large tasks. However we've seen that this adds more latency for small > tasks. It'll be great to profile and fix this. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-6815) Support accumulators in R
[ https://issues.apache.org/jira/browse/SPARK-6815?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-6815. - Resolution: Incomplete > Support accumulators in R > - > > Key: SPARK-6815 > URL: https://issues.apache.org/jira/browse/SPARK-6815 > Project: Spark > Issue Type: New Feature > Components: SparkR >Reporter: Shivaram Venkataraman >Priority: Minor > Labels: bulk-closed > > SparkR doesn't support acccumulators right now. It might be good to add > support for this to get feature parity with PySpark. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-2253) [Core] Disable partial aggregation automatically when reduction factor is low
[ https://issues.apache.org/jira/browse/SPARK-2253?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-2253. - Resolution: Incomplete > [Core] Disable partial aggregation automatically when reduction factor is low > - > > Key: SPARK-2253 > URL: https://issues.apache.org/jira/browse/SPARK-2253 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Reporter: Reynold Xin >Priority: Major > Labels: bulk-closed > > Once we see enough number of rows in partial aggregation and don't observe > any reduction, Aggregator should just turn off partial aggregation. This > reduces memory usage for high cardinality aggregations. > This one is for Spark core. There is another ticket tracking this for SQL. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-4545) If first Spark Streaming batch fails, it waits 10x batch duration before stopping
[ https://issues.apache.org/jira/browse/SPARK-4545?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-4545. - Resolution: Incomplete > If first Spark Streaming batch fails, it waits 10x batch duration before > stopping > - > > Key: SPARK-4545 > URL: https://issues.apache.org/jira/browse/SPARK-4545 > Project: Spark > Issue Type: Bug > Components: DStreams >Affects Versions: 1.1.0, 1.2.1 >Reporter: Sean Owen >Priority: Major > Labels: bulk-closed > > (I'd like to track the issue raised at > http://mail-archives.apache.org/mod_mbox/spark-dev/201411.mbox/%3CCAMAsSdKY=QCT0YUdrkvbVuqXdFCGp1+6g-=s71fk8zr4uat...@mail.gmail.com%3E > as a JIRA since I think it's a legitimate issue that I can take a look into, > with some help.) > This bit of {{JobGenerator.stop()}} executes, since the message appears in > the logs: > {code} > def haveAllBatchesBeenProcessed = { > lastProcessedBatch != null && lastProcessedBatch.milliseconds == stopTime > } > logInfo("Waiting for jobs to be processed and checkpoints to be written") > while (!hasTimedOut && !haveAllBatchesBeenProcessed) { > Thread.sleep(pollTime) > } > // ... 10x batch duration wait here, before seeing the next line log: > logInfo("Waited for jobs to be processed and checkpoints to be written") > {code} > I think that {{lastProcessedBatch}} is always null since no batch ever > succeeds. Of course, for all this code knows, the next batch might > succeed and so is there waiting for it. But it should proceed after > one more batch completes, even if it failed? > {{JobGenerator.onBatchCompleted}} is only called for a successful batch. > Can it be called if it fails too? I think that would fix it. > Should the condition also not be {{lastProcessedBatch.milliseconds <= > stopTime}} instead of == ? -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-1107) Add shutdown hook on executor stop to stop running tasks
[ https://issues.apache.org/jira/browse/SPARK-1107?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-1107. - Resolution: Incomplete > Add shutdown hook on executor stop to stop running tasks > > > Key: SPARK-1107 > URL: https://issues.apache.org/jira/browse/SPARK-1107 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 0.9.0 >Reporter: Andrew Ash >Priority: Major > Labels: bulk-closed > > Originally reported by aash: > http://mail-archives.apache.org/mod_mbox/incubator-spark-dev/201402.mbox/%3CCA%2B-p3AHXYhpjXH9fr8jQ5%2B_gc%3DNHjLbOiJB9bHSahfEET5aHBQ%40mail.gmail.com%3E > Latest in thread: > http://mail-archives.apache.org/mod_mbox/incubator-spark-dev/201402.mbox/%3CCA+-p3AFi7vz=2oty3caa0g+5ekg+a84uvqrl9tgstvgwgyb...@mail.gmail.com%3E > The most popular approach is to add a shutdown hook that stops running tasks > in the executors. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-636) Add mechanism to run system management/configuration tasks on all workers
[ https://issues.apache.org/jira/browse/SPARK-636?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-636. Resolution: Incomplete > Add mechanism to run system management/configuration tasks on all workers > - > > Key: SPARK-636 > URL: https://issues.apache.org/jira/browse/SPARK-636 > Project: Spark > Issue Type: New Feature > Components: Spark Core >Reporter: Josh Rosen >Priority: Major > Labels: bulk-closed > > It would be useful to have a mechanism to run a task on all workers in order > to perform system management tasks, such as purging caches or changing system > properties. This is useful for automated experiments and benchmarking; I > don't envision this being used for heavy computation. > Right now, I can mimic this with something like > {code} > sc.parallelize(0 until numMachines, numMachines).foreach { } > {code} > but this does not guarantee that every worker runs a task and requires my > user code to know the number of workers. > One sample use case is setup and teardown for benchmark tests. For example, > I might want to drop cached RDDs, purge shuffle data, and call > {{System.gc()}} between test runs. It makes sense to incorporate some of > this functionality, such as dropping cached RDDs, into Spark itself, but it > might be helpful to have a general mechanism for running ad-hoc tasks like > {{System.gc()}}. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-2280) Java & Scala reference docs should describe function reference behavior.
[ https://issues.apache.org/jira/browse/SPARK-2280?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-2280. - Resolution: Incomplete > Java & Scala reference docs should describe function reference behavior. > > > Key: SPARK-2280 > URL: https://issues.apache.org/jira/browse/SPARK-2280 > Project: Spark > Issue Type: Improvement > Components: Documentation >Affects Versions: 1.0.0 >Reporter: Hans Uhlig >Priority: Minor > Labels: bulk-closed, starter > > Example > JavaPairRDD> groupBy(Function f) > Return an RDD of grouped elements. Each group consists of a key and a > sequence of elements mapping to that key. > T and K are not described and there is no explanation of what the function's > inputs and outputs should be and how GroupBy uses this information. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-2545) Add a diagnosis mode for closures to figure out what they're bringing in
[ https://issues.apache.org/jira/browse/SPARK-2545?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-2545. - Resolution: Incomplete > Add a diagnosis mode for closures to figure out what they're bringing in > > > Key: SPARK-2545 > URL: https://issues.apache.org/jira/browse/SPARK-2545 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Reporter: Aaron Davidson >Priority: Major > Labels: bulk-closed > > Today, it's pretty hard to figure out why your closure is bigger than > expected, because it's not obvious what objects are being included or who is > including them. We should have some sort of diagnosis available to users with > very large closures that displays the contents of the closure. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-4500) Improve exact stratified sampling implementation
[ https://issues.apache.org/jira/browse/SPARK-4500?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-4500. - Resolution: Incomplete > Improve exact stratified sampling implementation > > > Key: SPARK-4500 > URL: https://issues.apache.org/jira/browse/SPARK-4500 > Project: Spark > Issue Type: Improvement > Components: MLlib >Affects Versions: 1.2.0 >Reporter: Joseph K. Bradley >Priority: Major > Labels: bulk-closed > > The current implementation for exact stratified sampling (sampleByKeyExact) > could be more efficient. Proposed algorithm sketch: > * Sampling is done separately for each stratum. Here, all counts are w.r.t. > a fixed stratum. > * Let n_partition = number of elements in a given partition > * This method uses 2-tiered sampling: > ** Tier 1 (on driver): Select the sample sizes {n_partition} for each > partition. > *** Without replacement, this uses a multivariate hypergeometric distribution. > *** With replacement, this should use a multinomial distribution. > ** Tier 2 (in parallel): Select a sample of size n_partition. > If anyone is interested in implementing this, I have a rough draft which > works without replacement, but it needs to be cleaned up and augmented to do > sampling with replacement too. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-5488) SPARK_LOCAL_IP not read by mesos scheduler
[ https://issues.apache.org/jira/browse/SPARK-5488?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-5488. - Resolution: Incomplete > SPARK_LOCAL_IP not read by mesos scheduler > -- > > Key: SPARK-5488 > URL: https://issues.apache.org/jira/browse/SPARK-5488 > Project: Spark > Issue Type: Bug > Components: Mesos, Scheduler >Affects Versions: 1.1.1 >Reporter: Martin Tapp >Priority: Minor > Labels: bulk-closed > > My environment sets SPARK_LOCAL_IP and my driver sees it. But mesos sees the > one from my first available network adapter. > I can even see that SPARK_LOCAL_IP is read correctly by Utils.localHostName > and Utils.localIpAddress > (core/src/main/scala/org/apache/spark/util/Utils.scala). Seems spark mesos > framework doesn't use it. > Work around for now is to disable my first adapter such that the second one > becomes the one seen by Spark. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-3916) recognize appended data in textFileStream()
[ https://issues.apache.org/jira/browse/SPARK-3916?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-3916. - Resolution: Incomplete > recognize appended data in textFileStream() > --- > > Key: SPARK-3916 > URL: https://issues.apache.org/jira/browse/SPARK-3916 > Project: Spark > Issue Type: New Feature > Components: DStreams >Reporter: Davies Liu >Assignee: Davies Liu >Priority: Major > Labels: bulk-closed > > Right now, we only find new data from new files, the data written to old > files (processed in last batch) will not be processed. > In order to support this, we need partialRDD(), which is an RDD for part of > file. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-5272) Refactor NaiveBayes to support discrete and continuous labels,features
[ https://issues.apache.org/jira/browse/SPARK-5272?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-5272. - Resolution: Incomplete > Refactor NaiveBayes to support discrete and continuous labels,features > -- > > Key: SPARK-5272 > URL: https://issues.apache.org/jira/browse/SPARK-5272 > Project: Spark > Issue Type: Improvement > Components: MLlib >Affects Versions: 1.2.0 >Reporter: Joseph K. Bradley >Priority: Major > Labels: bulk-closed, clustering > > This JIRA is to discuss refactoring NaiveBayes in order to support both > discrete and continuous labels and features. > Currently, NaiveBayes supports only discrete labels and features. > Proposal: Generalize it to support continuous values as well. > Some items to discuss are: > * How commonly are continuous labels/features used in practice? (Is this > necessary?) > * What should the API look like? > ** E.g., should NB have multiple classes for each type of label/feature, or > should it take a general Factor type parameter? -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-5077) Map output statuses can still exceed spark.akka.frameSize
[ https://issues.apache.org/jira/browse/SPARK-5077?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-5077. - Resolution: Incomplete > Map output statuses can still exceed spark.akka.frameSize > - > > Key: SPARK-5077 > URL: https://issues.apache.org/jira/browse/SPARK-5077 > Project: Spark > Issue Type: Bug > Components: Shuffle >Affects Versions: 1.2.0, 1.3.0, 1.4.1 >Reporter: Josh Rosen >Priority: Major > Labels: bulk-closed > > Since HighlyCompressedMapOutputStatuses uses a bitmap for tracking empty > blocks, its size is not bounded and thus Spark is still susceptible to > "MapOutputTrackerMasterActor: Map output statuses > were 11141547 bytes which exceeds spark.akka.frameSize"-type errors, even in > 1.2.0. > We needed to use a bitmap for tracking zero-sized blocks (see SPARK-3740; > this isn't just a performance issue; it's necessary for correctness). This > will require a bit more effort to fix, since we'll either have to find a way > to use a fixed size / capped size encoding for MapOutputStatuses (which might > require changes to let us fetch empty blocks safely) or figure out some other > strategy for shipping these statues. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-6026) Eliminate the bypassMergeThreshold parameter and associated hash-ish shuffle within the Sort shuffle code
[ https://issues.apache.org/jira/browse/SPARK-6026?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-6026. - Resolution: Incomplete > Eliminate the bypassMergeThreshold parameter and associated hash-ish shuffle > within the Sort shuffle code > - > > Key: SPARK-6026 > URL: https://issues.apache.org/jira/browse/SPARK-6026 > Project: Spark > Issue Type: Bug > Components: Shuffle, Spark Core >Affects Versions: 1.3.0 >Reporter: Kay Ousterhout >Priority: Major > Labels: bulk-closed > > The bypassMergeThreshold parameter (and associated use of a hash-ish shuffle > when the number of partitions is less than this) is basically a workaround > for SparkSQL, because the fact that the sort-based shuffle stores > non-serialized objects is a deal-breaker for SparkSQL, which re-uses objects. > Once the sort-based shuffle is changed to store serialized objects, we > should never be secretly doing hash-ish shuffle even when the user has > specified to use sort-based shuffle (because of its otherwise worse > performance). > [~rxin][~adav], masters of shuffle, it would be helpful to get agreement from > you on this proposal (and also a sanity check that I've correctly > characterized the issue). -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-4144) Support incremental model training of Naive Bayes classifier
[ https://issues.apache.org/jira/browse/SPARK-4144?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-4144. - Resolution: Incomplete > Support incremental model training of Naive Bayes classifier > > > Key: SPARK-4144 > URL: https://issues.apache.org/jira/browse/SPARK-4144 > Project: Spark > Issue Type: Improvement > Components: DStreams, MLlib >Reporter: Chris Fregly >Priority: Major > Labels: bulk-closed > > Per Xiangrui Meng from the following user list discussion: > http://mail-archives.apache.org/mod_mbox/spark-user/201408.mbox/%3CCAJgQjQ_QjMGO=jmm8weq1v8yqfov8du03abzy7eeavgjrou...@mail.gmail.com%3E > > "For Naive Bayes, we need to update the priors and conditional > probabilities, which means we should also remember the number of > observations for the updates." -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-2408) RDD.map(func) dependencies issue after checkpoint & count
[ https://issues.apache.org/jira/browse/SPARK-2408?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-2408. - Resolution: Incomplete > RDD.map(func) dependencies issue after checkpoint & count > - > > Key: SPARK-2408 > URL: https://issues.apache.org/jira/browse/SPARK-2408 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 0.9.1, 1.0.0 >Reporter: Daniel Fry >Priority: Major > Labels: bulk-closed > > i am noticing strange behavior with a simple example use of rdd.checkpoint(). > you can paste the following code into any spark-shell (e.g. with > MASTER=local[*]) > // build an array of 100 random lowercase strings of length 10 > val r = new scala.util.Random() > val str_arr = (1 to 100).map(a => (1 to 10).map(b => new > Character(((Math.abs(r.nextInt) % 26) + 97).toChar)).mkString("")) > // make this into an rdd > val str_rdd = sc.parallelize(str_arr) > // checkpoint & count > sc.setCheckpointDir("hdfs://[namenode]:54310/path/to/some/spark_checkpoint_dir") > str_rdd.checkpoint() > str_rdd.count > // rdd.map some dummy function > def test(a : String) : String = { return a } > str_rdd.map(test).count > this results in a surprising exception! > org.apache.spark.SparkException: Job aborted due to stage failure: Task not > serializable: java.io.NotSerializableException: scala.util.Random > at > org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1033) > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1017) > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1015) > at > scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) > at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) > at > org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1015) > at > org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$submitMissingTasks(DAGScheduler.scala:770) > at > org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$submitStage(DAGScheduler.scala:713) > at > org.apache.spark.scheduler.DAGScheduler.handleJobSubmitted(DAGScheduler.scala:697) > at > org.apache.spark.scheduler.DAGSchedulerEventProcessActor$$anonfun$receive$2.applyOrElse(DAGScheduler.scala:1176) > at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498) > at akka.actor.ActorCell.invoke(ActorCell.scala:456) > at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237) > at akka.dispatch.Mailbox.run(Mailbox.scala:219) > at > akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386) > at > scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) > at > scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339) > at > scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) > at > scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-4698) Data-locality aware Partitioners
[ https://issues.apache.org/jira/browse/SPARK-4698?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-4698. - Resolution: Incomplete > Data-locality aware Partitioners > > > Key: SPARK-4698 > URL: https://issues.apache.org/jira/browse/SPARK-4698 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Reporter: Kevin Mader >Priority: Minor > Labels: bulk-closed > > The current hash and range partitioner tools do not seem to respect the > existing data-locality. A 'dictionary' driven partitioner that calculated the > partitions based on the existing key locations instead of re-calculating them > would be ideal. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-3835) Spark applications that are killed should show up as "KILLED" or "CANCELLED" in the Spark UI
[ https://issues.apache.org/jira/browse/SPARK-3835?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-3835. - Resolution: Incomplete > Spark applications that are killed should show up as "KILLED" or "CANCELLED" > in the Spark UI > > > Key: SPARK-3835 > URL: https://issues.apache.org/jira/browse/SPARK-3835 > Project: Spark > Issue Type: Improvement > Components: Web UI >Affects Versions: 1.1.0 >Reporter: Matt Cheah >Priority: Major > Labels: UI, bulk-closed > > Spark applications that crash or are killed are listed as FINISHED in the > Spark UI. > It looks like the Master only passes back a list of "Running" applications > and a list of "Completed" applications, All of the applications under > "Completed" have status "FINISHED", however if they were killed manually they > should show "CANCELLED", or if they failed they should read "FAILED". -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-4489) JavaPairRDD.collectAsMap from checkpoint RDD may fail with ClassCastException
[ https://issues.apache.org/jira/browse/SPARK-4489?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-4489. - Resolution: Incomplete > JavaPairRDD.collectAsMap from checkpoint RDD may fail with ClassCastException > - > > Key: SPARK-4489 > URL: https://issues.apache.org/jira/browse/SPARK-4489 > Project: Spark > Issue Type: Bug > Components: Java API >Affects Versions: 1.1.0 >Reporter: Christopher Ng >Priority: Major > Labels: bulk-closed > > Calling collectAsMap() on a JavaPairRDD reconstructed from a checkpoint fails > with a ClassCastException: > Exception in thread "main" java.lang.ClassCastException: [Ljava.lang.Object; > cannot be cast to [Lscala.Tuple2; > at > org.apache.spark.rdd.PairRDDFunctions.collectAsMap(PairRDDFunctions.scala:595) > at > org.apache.spark.api.java.JavaPairRDD.collectAsMap(JavaPairRDD.scala:569) > at org.facboy.spark.CheckpointBug.main(CheckpointBug.java:46) > Code sample reproducing the issue: > https://gist.github.com/facboy/8387e950ffb0746a8272 -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-5091) Hooks for PySpark tasks
[ https://issues.apache.org/jira/browse/SPARK-5091?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-5091. - Resolution: Incomplete > Hooks for PySpark tasks > --- > > Key: SPARK-5091 > URL: https://issues.apache.org/jira/browse/SPARK-5091 > Project: Spark > Issue Type: New Feature > Components: PySpark >Reporter: Davies Liu >Priority: Major > Labels: bulk-closed > > Currently, it's not convenient to add package on executor to PYTHONPATH (we > did not assume the environment of driver an executor are identical). > It will be nice to have a hook to called before/after every tasks, then user > could manipulate sys.path by pre-task-hooks. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-3380) DecisionTree: overflow and precision in aggregation
[ https://issues.apache.org/jira/browse/SPARK-3380?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-3380. - Resolution: Incomplete > DecisionTree: overflow and precision in aggregation > --- > > Key: SPARK-3380 > URL: https://issues.apache.org/jira/browse/SPARK-3380 > Project: Spark > Issue Type: Improvement > Components: MLlib >Affects Versions: 1.1.0 >Reporter: Joseph K. Bradley >Priority: Minor > Labels: bulk-closed > > DecisionTree does not check for overflows or loss of precision while > aggregating sufficient statistics (binAggregates). It uses Double, which may > be a problem for DecisionTree regression since the variance calculation could > blow up. At the least, it could check for overflow and renormalize as needed. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-3153) shuffle will run out of space when disks have different free space
[ https://issues.apache.org/jira/browse/SPARK-3153?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-3153. - Resolution: Incomplete > shuffle will run out of space when disks have different free space > -- > > Key: SPARK-3153 > URL: https://issues.apache.org/jira/browse/SPARK-3153 > Project: Spark > Issue Type: Bug > Components: Shuffle >Reporter: Davies Liu >Priority: Major > Labels: bulk-closed > > If we have several disks in SPARK_LOCAL_DIRS, and one of them is much smaller > than others (maybe added in my mistake, or special disk, SSD), them the > shuffle will meet the problem of run out of space with this smaller disk. > PySpark also has this issue during spilling. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-4868) Twitter DStream.map() throws "Task not serializable"
[ https://issues.apache.org/jira/browse/SPARK-4868?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-4868. - Resolution: Incomplete > Twitter DStream.map() throws "Task not serializable" > > > Key: SPARK-4868 > URL: https://issues.apache.org/jira/browse/SPARK-4868 > Project: Spark > Issue Type: Bug > Components: DStreams, Spark Shell >Affects Versions: 1.1.1, 1.2.0 > Environment: * Spark 1.1.1 or 1.2.0 > * EC2 cluster with 1 slave spun up using {{spark-ec2}} > * twitter4j 3.0.3 > * {{spark-shell}} called with {{--jars}} argument to load > {{spark-streaming-twitter_2.10-1.0.0.jar}} as well as all the twitter4j jars. >Reporter: Nicholas Chammas >Priority: Minor > Labels: bulk-closed > > _(Continuing the discussion [started here on the Spark user > list|http://apache-spark-user-list.1001560.n3.nabble.com/NotSerializableException-in-Spark-Streaming-td5725.html].)_ > The following Spark Streaming code throws a serialization exception I do not > understand. > {code} > import twitter4j.auth.{Authorization, OAuthAuthorization} > import twitter4j.conf.ConfigurationBuilder > import org.apache.spark.streaming.{Seconds, Minutes, StreamingContext} > import org.apache.spark.streaming.twitter.TwitterUtils > def getAuth(): Option[Authorization] = { > System.setProperty("twitter4j.oauth.consumerKey", "consumerKey") > System.setProperty("twitter4j.oauth.consumerSecret", "consumerSecret") > System.setProperty("twitter4j.oauth.accessToken", "accessToken") > System.setProperty("twitter4j.oauth.accessTokenSecret", "accessTokenSecret") > Some(new OAuthAuthorization(new ConfigurationBuilder().build())) > } > def noop(a: Any): Any = { > a > } > val ssc = new StreamingContext(sc, Seconds(5)) > val liveTweetObjects = TwitterUtils.createStream(ssc, getAuth()) > val liveTweets = liveTweetObjects.map(_.getText) > liveTweets.map(t => noop(t)).print() // exception here > ssc.start() > {code} > So before I even start the StreamingContext, I get the following stack trace: > {code} > scala> liveTweets.map(t => noop(t)).print() > org.apache.spark.SparkException: Task not serializable > at > org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:166) > at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:158) > at org.apache.spark.SparkContext.clean(SparkContext.scala:1264) > at org.apache.spark.streaming.dstream.DStream.map(DStream.scala:438) > at $iwC$$iwC$$iwC$$iwC.(:27) > at $iwC$$iwC$$iwC.(:32) > at $iwC$$iwC.(:34) > at $iwC.(:36) > at (:38) > at .(:42) > at .() > at .(:7) > at .() > at $print() > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:606) > at > org.apache.spark.repl.SparkIMain$ReadEvalPrint.call(SparkIMain.scala:789) > at > org.apache.spark.repl.SparkIMain$Request.loadAndRun(SparkIMain.scala:1062) > at > org.apache.spark.repl.SparkIMain.loadAndRunReq$1(SparkIMain.scala:615) > at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:646) > at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:610) > at > org.apache.spark.repl.SparkILoop.reallyInterpret$1(SparkILoop.scala:823) > at > org.apache.spark.repl.SparkILoop.interpretStartingWith(SparkILoop.scala:868) > at org.apache.spark.repl.SparkILoop.command(SparkILoop.scala:780) > at org.apache.spark.repl.SparkILoop.processLine$1(SparkILoop.scala:625) > at org.apache.spark.repl.SparkILoop.innerLoop$1(SparkILoop.scala:633) > at org.apache.spark.repl.SparkILoop.loop(SparkILoop.scala:638) > at > org.apache.spark.repl.SparkILoop$$anonfun$process$1.apply$mcZ$sp(SparkILoop.scala:963) > at > org.apache.spark.repl.SparkILoop$$anonfun$process$1.apply(SparkILoop.scala:911) > at > org.apache.spark.repl.SparkILoop$$anonfun$process$1.apply(SparkILoop.scala:911) > at > scala.tools.nsc.util.ScalaClassLoader$.savingContextLoader(ScalaClassLoader.scala:135) > at org.apache.spark.repl.SparkILoop.process(SparkILoop.scala:911) > at org.apache.spark.repl.SparkILoop.process(SparkILoop.scala:1006) > at org.apache.spark.repl.Main$.main(Main.scala:31) > at org.apache.spark.repl.Main.main(Main.scala) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) > at > sun.reflect.DelegatingMethodAccessorImpl.in
[jira] [Resolved] (SPARK-4653) DAGScheduler refactoring and cleanup
[ https://issues.apache.org/jira/browse/SPARK-4653?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-4653. - Resolution: Incomplete > DAGScheduler refactoring and cleanup > > > Key: SPARK-4653 > URL: https://issues.apache.org/jira/browse/SPARK-4653 > Project: Spark > Issue Type: Improvement > Components: Scheduler >Reporter: Josh Rosen >Assignee: Josh Rosen >Priority: Major > Labels: bulk-closed > > This is an umbrella JIRA for DAGScheduler refactoring and cleanup. Please > comment or open sub-issues if you have refactoring suggestions that should > fall under this umbrella. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-3717) DecisionTree, RandomForest: Partition by feature
[ https://issues.apache.org/jira/browse/SPARK-3717?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-3717. - Resolution: Incomplete > DecisionTree, RandomForest: Partition by feature > > > Key: SPARK-3717 > URL: https://issues.apache.org/jira/browse/SPARK-3717 > Project: Spark > Issue Type: Improvement > Components: MLlib >Reporter: Joseph K. Bradley >Priority: Major > Labels: bulk-closed > > h1. Summary > Currently, data are partitioned by row/instance for DecisionTree and > RandomForest. This JIRA argues for partitioning by feature for training deep > trees. This is especially relevant for random forests, which are often > trained to be deeper than single decision trees. > h1. Details > Dataset dimensions and the depth of the tree to be trained are the main > problem parameters determining whether it is better to partition features or > instances. For random forests (training many deep trees), partitioning > features could be much better. > Notation: > * P = # workers > * N = # instances > * M = # features > * D = depth of tree > h2. Partitioning Features > Algorithm sketch: > * Each worker stores: > ** a subset of columns (i.e., a subset of features). If a worker stores > feature j, then the worker stores the feature value for all instances (i.e., > the whole column). > ** all labels > * Train one level at a time. > * Invariants: > ** Each worker stores a mapping: instance → node in current level > * On each iteration: > ** Each worker: For each node in level, compute (best feature to split, info > gain). > ** Reduce (P x M) values to M values to find best split for each node. > ** Workers who have features used in best splits communicate left/right for > relevant instances. Gather total of N bits to master, then broadcast. > * Total communication: > ** Depth D iterations > ** On each iteration, reduce to M values (~8 bytes each), broadcast N values > (1 bit each). > ** Estimate: D * (M * 8 + N) > h2. Partitioning Instances > Algorithm sketch: > * Train one group of nodes at a time. > * Invariants: > * Each worker stores a mapping: instance → node > * On each iteration: > ** Each worker: For each instance, add to aggregate statistics. > ** Aggregate is of size (# nodes in group) x M x (# bins) x (# classes) > *** (“# classes” is for classification. 3 for regression) > ** Reduce aggregate. > ** Master chooses best split for each node in group and broadcasts. > * Local training: Once all instances for a node fit on one machine, it can be > best to shuffle data and training subtrees locally. This can mean shuffling > the entire dataset for each tree trained. > * Summing over all iterations, reduce to total of: > ** (# nodes in tree) x M x (# bins B) x (# classes C) values (~8 bytes each) > ** Estimate: 2^D * M * B * C * 8 > h2. Comparing Partitioning Methods > Partitioning features cost < partitioning instances cost when: > * D * (M * 8 + N) < 2^D * M * B * C * 8 > * D * N < 2^D * M * B * C * 8 (assuming D * M * 8 is small compared to the > right hand side) > * N < [ 2^D * M * B * C * 8 ] / D > Example: many instances: > * 2 million instances, 3500 features, 100 bins, 5 classes, 6 levels (depth = > 5) > * Partitioning features: 6 * ( 3500 * 8 + 2*10^6 ) =~ 1.2 * 10^7 > * Partitioning instances: 32 * 3500 * 100 * 5 * 8 =~ 4.5 * 10^8 -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-5506) java.lang.ClassCastException using lambda expressions in combination of spark and Servlet
[ https://issues.apache.org/jira/browse/SPARK-5506?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-5506. - Resolution: Incomplete > java.lang.ClassCastException using lambda expressions in combination of spark > and Servlet > - > > Key: SPARK-5506 > URL: https://issues.apache.org/jira/browse/SPARK-5506 > Project: Spark > Issue Type: Question > Components: Spark Core >Affects Versions: 1.2.0 > Environment: spark server: Ubuntu 14.04 amd64 > $ java -version > java version "1.8.0_25" > Java(TM) SE Runtime Environment (build 1.8.0_25-b17) > Java HotSpot(TM) 64-Bit Server VM (build 25.25-b02, mixed mode) >Reporter: Milad Khajavi >Priority: Major > Labels: bulk-closed > > I'm trying to build a web API for my Apache spark jobs using sparkjava.com > framework. My code is: > @Override > public void init() { > get("/hello", > (req, res) -> { > String sourcePath = "hdfs://spark:54310/input/*"; > SparkConf conf = new SparkConf().setAppName("LineCount"); > conf.setJars(new String[] { > "/home/sam/resin-4.0.42/webapps/test.war" }); > File configFile = new File("config.properties"); > String sparkURI = "spark://hamrah:7077"; > conf.setMaster(sparkURI); > conf.set("spark.driver.allowMultipleContexts", "true"); > JavaSparkContext sc = new JavaSparkContext(conf); > @SuppressWarnings("resource") > JavaRDD log = sc.textFile(sourcePath); > JavaRDD lines = log.filter(x -> { > return true; > }); > return lines.count(); > }); > } > If I remove the lambda expression or put it inside a simple jar rather than a > web service (somehow a Servlet) it will run without any error. But using a > lambda expression inside a Servlet will result this exception: > 15/01/28 10:36:33 WARN TaskSetManager: Lost task 0.0 in stage 0.0 (TID 0, > hamrah): java.lang.ClassCastException: cannot assign instance of > java.lang.invoke.SerializedLambda to field > org.apache.spark.api.java.JavaRDD$$anonfun$filter$1.f$1 of type > org.apache.spark.api.java.function.Function in instance of > org.apache.spark.api.java.JavaRDD$$anonfun$filter$1 > at > java.io.ObjectStreamClass$FieldReflector.setObjFieldValues(ObjectStreamClass.java:2089) > at java.io.ObjectStreamClass.setObjFieldValues(ObjectStreamClass.java:1261) > at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1999) > at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1918) > at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1801) > at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1351) > at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1993) > at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1918) > at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1801) > at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1351) > at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1993) > at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1918) > at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1801) > at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1351) > at java.io.ObjectInputStream.readObject(ObjectInputStream.java:371) > at > org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:62) > at > org.apache.spark.serializer.JavaSerializerInstance.deserialize(JavaSerializer.scala:87) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:57) > at org.apache.spark.scheduler.Task.run(Task.scala:56) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:196) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > P.S: I tried combination of jersey and javaspark with jetty, tomcat and resin > and all of them led me to the same result. > Here the same issue: > http://apache-spark-user-list.1001560.n3.nabble.com/Spark-on-YARN-java-lang-ClassCastException-SerializedLambda-to-org-apache-spark-api-java-function-Fu1-tt21261.html > This is my colleague question in stackoverflow: > http://stackoverflow.com/questions/28186607/java-lang-classcastexception-using-lambda-expressions-in-spark-job-on-remote-ser -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e
[jira] [Resolved] (SPARK-5497) start-all script not working properly on Standalone HA cluster (with Zookeeper)
[ https://issues.apache.org/jira/browse/SPARK-5497?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-5497. - Resolution: Incomplete > start-all script not working properly on Standalone HA cluster (with > Zookeeper) > --- > > Key: SPARK-5497 > URL: https://issues.apache.org/jira/browse/SPARK-5497 > Project: Spark > Issue Type: Bug > Components: Deploy >Affects Versions: 1.2.0 >Reporter: Roque Vassal'lo >Priority: Major > Labels: Configuration, Deployment, Spark, bulk-closed, start > > I have configured a Standalone HA cluster with Zookeeper with: > - 3 Zookeeper nodes > - 2 Spark master nodes (1 alive and 1 in standby mode) > - 2 Spark slave nodes > While executing start-all.sh on each master, it will start the master and > start a worker on each configured slave. > If alive master goes down, those worker are supposed to reconfigure > themselves to use the new active master automatically. > I have noticed that the spark-env property SPARK_MASTER_IP is used in both > called scripts, start-master and start-slaves. > The problem is that if you configure SPARK_MASTER_IP with the active master > ip, when it goes down, workers don't reassign themselves to the new active > master. > And if you configure SPARK_MASTER_IP with the masters cluster route (well, an > approximation, because you have to write master's port in all-but-last ips, > that is "master1:7077,master2", in order to make it work), slaves start > properly but master doesn't. > So, the start-master script needs SPARK_MASTER_IP property to contain its ip > in order to start master properly; and start-slaves script needs > SPARK_MASTER_IP property to contain the masters cluster ips (that is > "master1:7077,master2") > To test that idea, I have modified start-slaves and spark-env scripts on > master nodes. > On spark-env.sh, I have set SPARK_MASTER_IP property to master's own ip on > each master node (that is, on master node 1, SPARK_MASTER_IP=master1; and on > master node 2, SPARK_MASTER_IP=master2) > On spark-env.sh, I have added a new property SPARK_MASTER_CLUSTER_IP with the > pseudo-masters-cluster-ips (SPARK_MASTER_CLUSTER_IP=master1:7077,master2) on > both masters. > On start-slaves.sh, I have modified all references to SPARK_MASTER_IP to > SPARK_MASTER_CLUSTER_IP. > I have tried that and it works great! When active master node goes down, all > workers reassign themselves to the new active node. > Maybe there is a better fix for this issue. > Hope this quick-fix idea can help. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-5685) Show warning when users open text files compressed with non-splittable algorithms like gzip
[ https://issues.apache.org/jira/browse/SPARK-5685?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-5685. - Resolution: Incomplete > Show warning when users open text files compressed with non-splittable > algorithms like gzip > --- > > Key: SPARK-5685 > URL: https://issues.apache.org/jira/browse/SPARK-5685 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Reporter: Nicholas Chammas >Priority: Minor > Labels: bulk-closed > > This is a usability or user-friendliness issue. > It's extremely common for people to load a text file compressed with gzip, > process it, and then wonder why only 1 core in their cluster is doing any > work. > Some examples: > * http://stackoverflow.com/q/28127119/877069 > * http://stackoverflow.com/q/27531816/877069 > I'm not sure how this problem can be generalized, but at the very least it > would be helpful if Spark displayed some kind of warning in the common case > when someone opens a gzipped file with {{sc.textFile}}. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-5045) Update FlumePollingReceiver to use updated Receiver API
[ https://issues.apache.org/jira/browse/SPARK-5045?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-5045. - Resolution: Incomplete > Update FlumePollingReceiver to use updated Receiver API > --- > > Key: SPARK-5045 > URL: https://issues.apache.org/jira/browse/SPARK-5045 > Project: Spark > Issue Type: Sub-task > Components: DStreams >Reporter: Tathagata Das >Assignee: Hari Shreedharan >Priority: Major > Labels: bulk-closed > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-6462) UpdateStateByKey should allow inner join of new with old keys
[ https://issues.apache.org/jira/browse/SPARK-6462?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-6462. - Resolution: Incomplete > UpdateStateByKey should allow inner join of new with old keys > - > > Key: SPARK-6462 > URL: https://issues.apache.org/jira/browse/SPARK-6462 > Project: Spark > Issue Type: Improvement > Components: DStreams >Affects Versions: 1.3.0 >Reporter: Andre Schumacher >Priority: Major > Labels: bulk-closed > > In a nutshell: provide a (inner join) instead of a cogroup for > updateStateByKey in StateDStream. > Details: > It is common to read data (saw weblog data) from a streaming source (say > Kafka) and each time update the state of a relatively small number of keys. > If only the state changes need to be propagated to a downstream sink then one > could avoid filtering out unchanged state in the user program and instead > provide this functionality in the API (say by adding a > updateStateChangesByKey method). > Note that this is related but not identical to: > https://issues.apache.org/jira/browse/SPARK-2629 -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-4540) Improve Executor ID Logging
[ https://issues.apache.org/jira/browse/SPARK-4540?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-4540. - Resolution: Incomplete > Improve Executor ID Logging > --- > > Key: SPARK-4540 > URL: https://issues.apache.org/jira/browse/SPARK-4540 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Reporter: Arun Ahuja >Priority: Minor > Labels: bulk-closed > > A few things that would useful here: > - An executor should log what executor it is running, AFAICT this does not > help and only the driver reports that executor 10 is running on xyz.host.com > - For YARN, when an executor fails in addition to reporting the executor ID > of the lost executor, report the container ID as well > The latter is useful for multiple executors running on the same machine where > it may be more useful to find the container directly than the executor ID or > host. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-3134) Update block locations asynchronously in TorrentBroadcast
[ https://issues.apache.org/jira/browse/SPARK-3134?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-3134. - Resolution: Incomplete > Update block locations asynchronously in TorrentBroadcast > - > > Key: SPARK-3134 > URL: https://issues.apache.org/jira/browse/SPARK-3134 > Project: Spark > Issue Type: Sub-task > Components: Block Manager >Reporter: Reynold Xin >Priority: Major > Labels: bulk-closed > > Once the TorrentBroadcast gets the data blocks, it needs to tell the master > the new location. We should make the location update non-blocking to reduce > roundtrips we need to launch tasks. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-5674) Spark Job Explain Plan Proof of Concept
[ https://issues.apache.org/jira/browse/SPARK-5674?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-5674. - Resolution: Incomplete > Spark Job Explain Plan Proof of Concept > --- > > Key: SPARK-5674 > URL: https://issues.apache.org/jira/browse/SPARK-5674 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Reporter: Kostas Sakellis >Priority: Major > Labels: bulk-closed > > This is just a prototype of creating an explain plan for a job. Code can be > found here: https://github.com/ksakellis/spark/tree/kostas-explainPlan-poc > The code was written very quickly and so doesn't have any comments, tests and > is probably buggy - hence it being a proof of concept. > *How to Use* > # {code}sc.explainOn <=> sc.explainOff{code} This will generate the explain > plain and print it in the logs > # {code}sc.enableExecution <=> sc.disableExecution{code} This will disable > executing of the job. > Using these two knobs a user can choose to print the explain plan and/or > disable the running of the job if they only want to see the plan. > *Implementation* > This is only a prototype and it is by no means production ready. The code is > pretty hacky in places and a few shortcuts were made just to get the > prototype working. > The most interesting part of this commit is in the ExecutionPlanner.scala > class. This class creates its own private instance of the DAGScheduler and > passes into it a NoopTaskScheduler. The NoopTaskScheduler receives the > created TaskSets from the DAGScheduler and records the stages -> tasksets. > The NoopTaskScheduler also creates fake CompletionsEvents and sends them to > the DAGScheduler to move the scheduling along. It is done this way so that we > can use the DAGScheduler unmodified thus reducing code divergence. > The rest of the code is about processing the information produced by the > ExecutionPlanner, creating a DAG with a bunch of metadata and printing it as > a pretty ascii drawing. For drawing the DAG, > https://github.com/mdr/ascii-graphs is used. This was just easier again to > prototype. > *How is this different than RDD#toDebugString?* > The execution planner runs the job through the entire DAGScheduler so we can > collect some metrics that are not presently available in the debugString. For > example, we can report the binary size of the task which might be important > if the closures are referencing large object. > In addition, because we execute the scheduler code from an action, we can get > a more accurate picture of where the stage boundaries and dependencies. An > action such ask treeReduce will generate a number of stages that you can't > get just by doing .toDebugString on the rdd. > *Limitations of this Implementation* > Because this is a prototype there are is a lot of lame stuff in this commit. > # All of the code in SparkContext in particular sucks. This adds some code in > the runJob() call and when it gets the plan it just writes it to the INFO > log. We need to find a better way of exposing the plan to the caller so that > they can print it, analyze it etc. Maybe we can use implicits or something? > Not sure how best to do this yet. > # Some of the actions will return through exceptions because we are basically > faking a runJob(). If you want ot try this, it is best to just use count() > instead of say collect(). This will get fixed when we fix 1) > # Because the ExplainPlanner creates its own DAGScheduler, there currently is > no way to map the real stages to the "explainPlan" stages. So if a user turns > on explain plan, and doesn't disable execution, we can't automatically add > more metrics to the explain plan as they become available. The stageId in the > plan and the stageId in the real scheduler will be different. This is > important for when we add it to the webUI and users can track progress on the > DAG > # We are using https://github.com/mdr/ascii-graphs to draw the DAG - not sure > if we want to depend on that project. > *Next Steps* > # It would be good to get a few people to take a look at the code > specifically at how the plan gets generated. Clone the package and give it a > try with some of your jobs > # If the approach looks okay overall, I can put together a mini design doc > and add some answers to the above limitations of this approach. > #Feedback most welcome. > *Example Code:* > {code} > sc.explainOn > sc.disableExecution > val rdd = sc.parallelize(1 to 10, 4).map(key => (key.toString, key)) > val rdd2 = sc.parallelize(1 to 5, 2).map(key => (key.toString, key)) > rdd.join(rdd2) >.count() > {code} > *Example Output:* > {noformat} > EXPLAIN PLAN: > +---+ +---+ > | | | | > |Stage: 0 @ map | |Stage: 1 @ m
[jira] [Resolved] (SPARK-6165) Aggregate and reduce should be able to work with very large number of tasks.
[ https://issues.apache.org/jira/browse/SPARK-6165?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-6165. - Resolution: Incomplete > Aggregate and reduce should be able to work with very large number of tasks. > > > Key: SPARK-6165 > URL: https://issues.apache.org/jira/browse/SPARK-6165 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 1.4.0 >Reporter: Mridul Muralidharan >Priority: Minor > Labels: bulk-closed > > To prevent data from workers causing OOM at master, we have the property > 'spark.driver.maxResultSize'. > But the OOM at master can be due to two reasons : > a) Data being sent from workers is too large - causing OOM at master. > b) Large number of moderate (to low) sized data being sent to master causing > OOM. > (For example: 500k tasks, 1k each) > spark.driver.maxResultSize protects against both - but (b) should be handled > more gracefully by master : example spool it to disk, aggregate without > waiting for entire result set to be fetched, etc. > Currently we are forced to use treeReduce and co to work around this problem > : adding to the latency of jobs. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-2581) complete or withdraw visitedStages optimization in DAGScheduler’s stageDependsOn
[ https://issues.apache.org/jira/browse/SPARK-2581?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-2581. - Resolution: Incomplete > complete or withdraw visitedStages optimization in DAGScheduler’s > stageDependsOn > > > Key: SPARK-2581 > URL: https://issues.apache.org/jira/browse/SPARK-2581 > Project: Spark > Issue Type: Improvement > Components: Scheduler >Reporter: Aaron Staple >Priority: Minor > Labels: bulk-closed > > Right now the visitedStages HashSet is populated with stages, but never > queried to limit examination of previously visited stages. It may make sense > to check whether a mapStage has been visited previously before visiting it > again, as in the nearby visitedRdds check. Or it may be that the existing > visitedRdds check sufficiently optimizes this function, and visitedStages can > simply be removed. > See discussion here: > https://github.com/apache/spark/pull/1362#discussion-diff-15018046L1107 -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-5490) KMeans costs can be incorrect if tasks need to be rerun
[ https://issues.apache.org/jira/browse/SPARK-5490?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-5490. - Resolution: Incomplete > KMeans costs can be incorrect if tasks need to be rerun > --- > > Key: SPARK-5490 > URL: https://issues.apache.org/jira/browse/SPARK-5490 > Project: Spark > Issue Type: Bug > Components: MLlib >Affects Versions: 1.3.0 >Reporter: Sandy Ryza >Priority: Major > Labels: bulk-closed, clustering > > KMeans uses accumulators to compute the cost of a clustering at each > iteration. > Each time a ShuffleMapTask completes, it increments the accumulators at the > driver. If a task runs twice because of failures, the accumulators get > incremented twice. > KMeans uses accumulators in ShuffleMapTasks. This means that a task's cost > can end up being double-counted. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-5142) Possibly data may be ruined in Spark Streaming's WAL mechanism.
[ https://issues.apache.org/jira/browse/SPARK-5142?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-5142. - Resolution: Incomplete > Possibly data may be ruined in Spark Streaming's WAL mechanism. > --- > > Key: SPARK-5142 > URL: https://issues.apache.org/jira/browse/SPARK-5142 > Project: Spark > Issue Type: Bug > Components: DStreams >Affects Versions: 1.2.0 >Reporter: Saisai Shao >Priority: Major > Labels: bulk-closed > > Currently in Spark Streaming's WAL manager, data will be written into HDFS > with multiple tries when meeting failure, because of lacking of transactional > guarantee, previously partial-written data is not rolled back and the retried > data will be appended to the last, this will ruin the file and make the > WriteAheadLogReader to read data with failure. > Firstly I think this problem is hard to fix because HDFS do not support > truncate operation(HDFS-3107) or random write with specific offset. > Secondly, I think if we meet such write exception, it is better not to try > again, try again will ruin the file and make read abnormal. > Sorry if I misunderstand anything. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-5480) GraphX pageRank: java.lang.ArrayIndexOutOfBoundsException:
[ https://issues.apache.org/jira/browse/SPARK-5480?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-5480. - Resolution: Incomplete > GraphX pageRank: java.lang.ArrayIndexOutOfBoundsException: > --- > > Key: SPARK-5480 > URL: https://issues.apache.org/jira/browse/SPARK-5480 > Project: Spark > Issue Type: Bug > Components: GraphX >Affects Versions: 1.2.0, 1.3.1 > Environment: Yarn client >Reporter: Stephane Maarek >Priority: Major > Labels: bulk-closed > > Running the following code: > val subgraph = graph.subgraph ( > vpred = (id,article) => //working predicate) > ).cache() > println( s"Subgraph contains ${subgraph.vertices.count} nodes and > ${subgraph.edges.count} edges") > val prGraph = subgraph.staticPageRank(5).cache > val titleAndPrGraph = subgraph.outerJoinVertices(prGraph.vertices) { > (v, title, rank) => (rank.getOrElse(0.0), title) > } > titleAndPrGraph.vertices.top(13) { > Ordering.by((entry: (VertexId, (Double, _))) => entry._2._1) > }.foreach(t => println(t._2._2._1 + ": " + t._2._1 + ", id:" + t._1)) > Returns a graph with 5000 nodes and 4000 edges. > Then it crashes during the PageRank with the following: > 15/01/29 05:51:07 INFO scheduler.TaskSetManager: Starting task 125.0 in stage > 39.0 (TID 1808, *HIDDEN, PROCESS_LOCAL, 2059 bytes) > 15/01/29 05:51:07 WARN scheduler.TaskSetManager: Lost task 107.0 in stage > 39.0 (TID 1794, *HIDDEN): java.lang.ArrayIndexOutOfBoundsException: -1 > at > org.apache.spark.graphx.util.collection.GraphXPrimitiveKeyOpenHashMap$mcJI$sp.apply$mcJI$sp(GraphXPrimitiveKeyOpenHashMap.scala:64) > at > org.apache.spark.graphx.impl.EdgePartition.updateVertices(EdgePartition.scala:91) > at > org.apache.spark.graphx.impl.ReplicatedVertexView$$anonfun$2$$anonfun$apply$1.apply(ReplicatedVertexView.scala:75) > at > org.apache.spark.graphx.impl.ReplicatedVertexView$$anonfun$2$$anonfun$apply$1.apply(ReplicatedVertexView.scala:73) > at scala.collection.Iterator$$anon$11.next(Iterator.scala:328) > at > org.apache.spark.graphx.impl.EdgeRDDImpl$$anonfun$mapEdgePartitions$1.apply(EdgeRDDImpl.scala:110) > at > org.apache.spark.graphx.impl.EdgeRDDImpl$$anonfun$mapEdgePartitions$1.apply(EdgeRDDImpl.scala:108) > at org.apache.spark.rdd.RDD$$anonfun$13.apply(RDD.scala:601) > at org.apache.spark.rdd.RDD$$anonfun$13.apply(RDD.scala:601) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263) > at org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:61) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:228) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:230) > at > org.apache.spark.rdd.ZippedPartitionsRDD2.compute(ZippedPartitionsRDD.scala:88) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:230) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:230) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:230) > at > org.apache.spark.rdd.ZippedPartitionsRDD2.compute(ZippedPartitionsRDD.scala:88) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:230) > at > org.apache.spark.rdd.ZippedPartitionsRDD2.compute(ZippedPartitionsRDD.scala:88) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263) > at org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:61) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:228) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:230) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41) > at org.apache.spark.scheduler.Task.run(Task.scala:56) > at > org.apache.spark.executor.Executor$TaskRunner.
[jira] [Resolved] (SPARK-4885) Enable fetched blocks to exceed 2 GB
[ https://issues.apache.org/jira/browse/SPARK-4885?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-4885. - Resolution: Incomplete > Enable fetched blocks to exceed 2 GB > > > Key: SPARK-4885 > URL: https://issues.apache.org/jira/browse/SPARK-4885 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 1.2.0 >Reporter: Sandy Ryza >Priority: Major > Labels: bulk-closed > > {code} > 14/12/18 09:53:13 ERROR executor.ExecutorUncaughtExceptionHandler: Uncaught > exception in thread Thread[handle-message-executor-12,5,main] > java.lang.OutOfMemoryError: Requested array size exceeds VM limit > at java.util.Arrays.copyOf(Arrays.java:2271) > at > java.io.ByteArrayOutputStream.grow(ByteArrayOutputStream.java:113) > at > java.io.ByteArrayOutputStream.ensureCapacity(ByteArrayOutputStream.java:93) > at > java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:140) > at > java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:82) > at > java.io.BufferedOutputStream.write(BufferedOutputStream.java:126) > at com.esotericsoftware.kryo.io.Output.flush(Output.java:155) > at > com.esotericsoftware.kryo.io.Output.require(Output.java:135) > at > com.esotericsoftware.kryo.io.Output.writeLong(Output.java:477) > at > com.esotericsoftware.kryo.io.Output.writeDouble(Output.java:596) > at > com.esotericsoftware.kryo.serializers.DefaultArraySerializers$DoubleArraySerializer.write(DefaultArraySerializers.java:212) > at > com.esotericsoftware.kryo.serializers.DefaultArraySerializers$DoubleArraySerializer.write(DefaultArraySerializers.java:200) > at > com.esotericsoftware.kryo.Kryo.writeObjectOrNull(Kryo.java:549) > at > com.esotericsoftware.kryo.serializers.FieldSerializer$ObjectField.write(FieldSerializer.java:570) > at > com.esotericsoftware.kryo.serializers.FieldSerializer.write(FieldSerializer.java:213) > at > com.esotericsoftware.kryo.Kryo.writeClassAndObject(Kryo.java:568) > at > org.apache.spark.serializer.KryoSerializationStream.writeObject(KryoSerializer.scala:119) > at > org.apache.spark.serializer.SerializationStream.writeAll(Serializer.scala:110) > at > org.apache.spark.storage.BlockManager.dataSerializeStream(BlockManager.scala:1054) > at > org.apache.spark.storage.BlockManager.dataSerialize(BlockManager.scala:1063) > at > org.apache.spark.storage.MemoryStore.getBytes(MemoryStore.scala:154) > at > org.apache.spark.storage.BlockManager.doGetLocal(BlockManager.scala:428) > at > org.apache.spark.storage.BlockManager.getLocalBytes(BlockManager.scala:394) > at > org.apache.spark.storage.BlockManagerWorker.getBlock(BlockManagerWorker.scala:100) > at > org.apache.spark.storage.BlockManagerWorker.processBlockMessage(BlockManagerWorker.scala:79) > at > org.apache.spark.storage.BlockManagerWorker$$anonfun$2.apply(BlockManagerWorker.scala:48) > at > org.apache.spark.storage.BlockManagerWorker$$anonfun$2.apply(BlockManagerWorker.scala:48) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) > at scala.collection.Iterator$class.foreach(Iterator.scala:727) > at > scala.collection.AbstractIterator.foreach(Iterator.scala:1157) > at > scala.collection.IterableLike$class.foreach(IterableLike.scala:72) > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-1921) Allow duplicate jar files among the app jar and secondary jars in yarn-cluster mode
[ https://issues.apache.org/jira/browse/SPARK-1921?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-1921. - Resolution: Incomplete > Allow duplicate jar files among the app jar and secondary jars in > yarn-cluster mode > --- > > Key: SPARK-1921 > URL: https://issues.apache.org/jira/browse/SPARK-1921 > Project: Spark > Issue Type: Sub-task > Components: Deploy >Affects Versions: 1.0.0 >Reporter: Xiangrui Meng >Priority: Minor > Labels: bulk-closed > > In yarn-cluster mode, jars are uploaded to a staging folder on hdfs. If there > are duplicates among the app jar and secondary jars, there will be overwrites > that cause inconsistent timestamps. I saw the following message: > {code} > Application application_1400965808642_0021 failed 2 times due to AM Container > for appattempt_1400965808642_0021_02 exited with exitCode: -1000 due to: > Resource > hdfs://localhost.localdomain:8020/user/cloudera/.sparkStaging/application_1400965808642_0021/app_2.10-0.1.jar > changed on src filesystem (expected 1400998721965, was 1400998723123 > {code} > Tested on a CDH-5 quickstart VM. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-3631) Add docs for checkpoint usage
[ https://issues.apache.org/jira/browse/SPARK-3631?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-3631. - Resolution: Incomplete > Add docs for checkpoint usage > - > > Key: SPARK-3631 > URL: https://issues.apache.org/jira/browse/SPARK-3631 > Project: Spark > Issue Type: Documentation > Components: Documentation >Affects Versions: 1.1.0 >Reporter: Andrew Ash >Assignee: Andrew Ash >Priority: Major > Labels: bulk-closed > > We should include general documentation on using checkpoints. Right now the > docs only cover checkpoints in the Spark Streaming use case which is slightly > different from Core. > Some content to consider for inclusion from [~brkyvz]: > {quote} > If you set the checkpointing directory however, the intermediate state of the > RDDs will be saved in HDFS, and the lineage will pick off from there. > You won't need to keep the shuffle data before the checkpointed state, > therefore those can be safely removed (will be removed automatically). > However, checkpoint must be called explicitly as in > https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/recommendation/ALS.scala#L291 > ,just setting the directory will not be enough. > {quote} > {quote} > Yes, writing to HDFS is more expensive, but I feel it is still a small price > to pay when compared to having a Disk Space Full error three hours in > and having to start from scratch. > The main goal of checkpointing is to truncate the lineage. Clearing up > shuffle writes come as a bonus to checkpointing, it is not the main goal. The > subtlety here is that .checkpoint() is just like .cache(). Until you call an > action, nothing happens. Therefore, if you're going to do 1000 maps in a > row and you don't want to checkpoint in the meantime until a shuffle happens, > you will still get a StackOverflowError, because the lineage is too long. > I went through some of the code for checkpointing. As far as I can tell, it > materializes the data in HDFS, and resets all its dependencies, so you start > a fresh lineage. My understanding would be that checkpointing still should be > done every N operations to reset the lineage. However, an action must be > performed before the lineage grows too long. > {quote} > A good place to put this information would be at > https://spark.apache.org/docs/latest/programming-guide.html -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-6378) srcAttr in graph.triplets don't update when the size of graph is huge
[ https://issues.apache.org/jira/browse/SPARK-6378?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-6378. - Resolution: Incomplete > srcAttr in graph.triplets don't update when the size of graph is huge > - > > Key: SPARK-6378 > URL: https://issues.apache.org/jira/browse/SPARK-6378 > Project: Spark > Issue Type: Bug > Components: GraphX >Affects Versions: 1.2.1 >Reporter: zhangzhenyue >Priority: Major > Labels: bulk-closed > Attachments: TripletsViewDonotUpdate.scala > > > when the size of the graph is huge(0.2 billion vertex, 6 billion edges), the > srcAttr and dstAttr in graph.triplets don't update when using the > Graph.outerJoinVertices(when the data in vertex is changed). > the code and the log is as follows: > {quote} > g = graph.outerJoinVertices()... > g,vertices,count() > g.edges.count() > println("example edge " + g.triplets.filter(e => e.srcId == > 51L).collect() > .map(e =>(e.srcId + ":" + e.srcAttr + ", " + e.dstId + ":" + > e.dstAttr)).mkString("\n")) > println("example vertex " + g.vertices.filter(e => e._1 == > 51L).collect() > .map(e => (e._1 + "," + e._2)).mkString("\n")) > {quote} > the result: > {quote} > example edge 51:0, 2467451620:61 > 51:0, 1962741310:83 // attr of vertex 51 is 0 in > Graph.triplets > example vertex 51,2 // attr of vertex 51 is 2 in > Graph.vertices > {quote} > when the graph is smaller(10 million vertex), the code is OK, the triplets > will update when the vertex is changed -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-6415) Spark Streaming fail-fast: Stop scheduling jobs when a batch fails, and kills the app
[ https://issues.apache.org/jira/browse/SPARK-6415?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-6415. - Resolution: Incomplete > Spark Streaming fail-fast: Stop scheduling jobs when a batch fails, and kills > the app > - > > Key: SPARK-6415 > URL: https://issues.apache.org/jira/browse/SPARK-6415 > Project: Spark > Issue Type: Improvement > Components: DStreams >Reporter: Hari Shreedharan >Priority: Major > Labels: bulk-closed > > Of course, this would have to be done as a configurable param, but such a > fail-fast is useful else it is painful to figure out what is happening when > there are cascading failures. In some cases, the SparkContext shuts down and > streaming keeps scheduling jobs -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-6148) cachedDataSourceTables may store outdated metadata if the table is updated from another HiveContext
[ https://issues.apache.org/jira/browse/SPARK-6148?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-6148. - Resolution: Incomplete > cachedDataSourceTables may store outdated metadata if the table is updated > from another HiveContext > --- > > Key: SPARK-6148 > URL: https://issues.apache.org/jira/browse/SPARK-6148 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Yin Huai >Priority: Major > Labels: bulk-closed > > If we have two HiveContext, if we change a table through one (e.g. append and > overwrite with new data), cachedDataSourceTables in another one will not be > updated automatically > . -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-6637) Test lambda weighting in implicit ALS
[ https://issues.apache.org/jira/browse/SPARK-6637?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-6637. - Resolution: Incomplete > Test lambda weighting in implicit ALS > - > > Key: SPARK-6637 > URL: https://issues.apache.org/jira/browse/SPARK-6637 > Project: Spark > Issue Type: Task > Components: MLlib >Affects Versions: 1.3.0 >Reporter: Xiangrui Meng >Priority: Major > Labels: bulk-closed > > As discussed on the user-list, there is some inconsistency between 1.2 and > 1.3 on how we scale lambda in the implicit formulation. In 1.2 we scale by > the number of explicit ratings, while in 1.3 we scale by the number of > users/items. > https://www.mail-archive.com/user@spark.apache.org/msg24817.html > It is worth testing which is more proper for implicit feedback. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-6312) ChiSqTest should check for too few counts
[ https://issues.apache.org/jira/browse/SPARK-6312?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-6312. - Resolution: Incomplete > ChiSqTest should check for too few counts > - > > Key: SPARK-6312 > URL: https://issues.apache.org/jira/browse/SPARK-6312 > Project: Spark > Issue Type: Improvement > Components: MLlib >Affects Versions: 1.2.0 >Reporter: Joseph K. Bradley >Priority: Minor > Labels: bulk-closed > > ChiSqTest assumes that elements of the contingency matrix are large enough > (have enough counts) s.t. the central limit theorem kicks in. It would be > reasonable to do one or more of the following: > * Add a note in the docs about making sure there are a reasonable number of > instances being used (or counts in the contingency table entries, to be more > precise and account for skewed category distributions). > * Add a check in the code which could: > ** Log a warning message > ** Alter the p-value to make sure it indicates the test result is > insignificant -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-3306) Addition of external resource dependency in executors
[ https://issues.apache.org/jira/browse/SPARK-3306?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-3306. - Resolution: Incomplete > Addition of external resource dependency in executors > - > > Key: SPARK-3306 > URL: https://issues.apache.org/jira/browse/SPARK-3306 > Project: Spark > Issue Type: New Feature > Components: Spark Core >Reporter: Yan >Priority: Major > Labels: bulk-closed > > Currently, Spark executors only support static and read-only external > resources of side files and jar files. With emerging disparate data sources, > there is a need to support more versatile external resources, such as > connections to data sources, to facilitate efficient data accesses to the > sources. For one, the JDBCRDD, with some modifications, could benefit from > this feature by reusing established JDBC connections from the same Spark > context before. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-5748) Improve Vectors.sqdist implementation
[ https://issues.apache.org/jira/browse/SPARK-5748?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-5748. - Resolution: Incomplete > Improve Vectors.sqdist implementation > - > > Key: SPARK-5748 > URL: https://issues.apache.org/jira/browse/SPARK-5748 > Project: Spark > Issue Type: Improvement > Components: MLlib >Affects Versions: 1.3.0 >Reporter: Xiangrui Meng >Assignee: Xiangrui Meng >Priority: Major > Labels: bulk-closed > > Saw some regression of k-means in 1.3 performance tests. I think the problem > is the sqdist implementation is slow but not very sure. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-5372) Change the default storage level of window operators
[ https://issues.apache.org/jira/browse/SPARK-5372?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-5372. - Resolution: Incomplete > Change the default storage level of window operators > > > Key: SPARK-5372 > URL: https://issues.apache.org/jira/browse/SPARK-5372 > Project: Spark > Issue Type: Task > Components: DStreams >Affects Versions: 1.2.0 >Reporter: Saisai Shao >Priority: Minor > Labels: bulk-closed > > Current storage level of window operators is MEMORY_ONLY_SER, if the memory > is not enough to hold all the window data, cached RDD will be discarded, > which will lead to unexpected behavior. > Besides the default storage level of input data is MEMORY_AND_DISK_SER_2, it > is better to align to this storage level to change the storage level of > window operators to MEMORY_AND_DISK_SER. > This changing has no effect when memory is enough. So I'd propose to change > the default storage level to MEMORY_AND_DISK_SER. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-6808) Checkpointing after zipPartitions results in NODE_LOCAL execution
[ https://issues.apache.org/jira/browse/SPARK-6808?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-6808. - Resolution: Incomplete > Checkpointing after zipPartitions results in NODE_LOCAL execution > - > > Key: SPARK-6808 > URL: https://issues.apache.org/jira/browse/SPARK-6808 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.2.1, 1.3.0 > Environment: EC2 Ubuntu r3.8xlarge machines >Reporter: Xinghao Pan >Priority: Minor > Labels: bulk-closed > > I'm encountering a weird issue where a simple iterative zipPartition is > PROCESS_LOCAL before checkpointing, but turns NODE_LOCAL for all iterations > after checkpointing. More often than not, tasks are fetching remote blocks > from the network, leading to a 10x increase in runtime. > Here's an example snippet of code: > var R : RDD[(Long,Int)] > = sc.parallelize((0 until numPartitions), numPartitions) > .mapPartitions(_ => new Array[(Long,Int)](1000).map(i => > (0L,0)).toSeq.iterator).cache() > sc.setCheckpointDir(checkpointDir) > var iteration = 0 > while (iteration < 50){ > R = R.zipPartitions(R)((x,y) => x).cache() > if ((iteration+1) % checkpointIter == 0) R.checkpoint() > R.foreachPartition(_ => {}) > iteration += 1 > } > I've also tried to unpersist the old RDDs, and increased spark.locality.wait > but nether helps. > Strangely, by adding a simple identity map > R = R.map(x => x).cache() > after the zipPartitions appears to partially mitigate the issue. > The problem was originally triggered when I attempted to checkpoint after > doing joinVertices in GraphX, but the above example shows that the issue is > in Spark core too. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-1863) Allowing user jars to take precedence over Spark jars does not work as expected
[ https://issues.apache.org/jira/browse/SPARK-1863?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-1863. - Resolution: Incomplete > Allowing user jars to take precedence over Spark jars does not work as > expected > --- > > Key: SPARK-1863 > URL: https://issues.apache.org/jira/browse/SPARK-1863 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.0.0 >Reporter: koert kuipers >Priority: Minor > Labels: bulk-closed > > See here: > http://apache-spark-user-list.1001560.n3.nabble.com/java-serialization-errors-with-spark-files-userClassPathFirst-true-td5832.html > The issue seems to be that within ChildExecutorURLClassLoader userClassLoader > has no visibility on classes managed by parentClassLoader because their is no > parent/child relationship. What this means that if a class is loaded by > userClassLoader and it refers to a class loaded by parentClassLoader you get > a NoClassDefFoundError. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-2610) When spark.serializer is set as org.apache.spark.serializer.KryoSerializer, importing a method causes multiple spark applications creations
[ https://issues.apache.org/jira/browse/SPARK-2610?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-2610. - Resolution: Incomplete > When spark.serializer is set as org.apache.spark.serializer.KryoSerializer, > importing a method causes multiple spark applications creations > - > > Key: SPARK-2610 > URL: https://issues.apache.org/jira/browse/SPARK-2610 > Project: Spark > Issue Type: Bug > Components: Spark Shell >Affects Versions: 1.0.1 >Reporter: Yin Huai >Priority: Minor > Labels: bulk-closed > > To reproduce, set > {code} > spark.serializerorg.apache.spark.serializer.KryoSerializer > {code} > in conf/spark-defaults.conf and launch a spark shell. > Then, execute > {code} > class X() { println("What!"); def y = 3 } > val x = new X > import x.y > case class Person(name: String, age: Int) > val serializer = org.apache.spark.serializer.Serializer.getSerializer(null) > val kryoSerializer = serializer.newInstance > val value = kryoSerializer.serialize(Person("abc", 1)) > kryoSerializer.deserialize(value): Person > // Once you execute this line, you will see ... > // What! > // What! > // res1: Person = Person(abc,1) > {code} > Basically, importing a method of a class causes the constructor of that class > been called twice. > It affects our branch 1.0 and master. > For the master, you can use > {code} > val serializer = org.apache.spark.serializer.Serializer.getSerializer(None) > {code} > to get the serializer. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-6160) ChiSqSelector should keep test statistic info
[ https://issues.apache.org/jira/browse/SPARK-6160?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-6160. - Resolution: Incomplete > ChiSqSelector should keep test statistic info > - > > Key: SPARK-6160 > URL: https://issues.apache.org/jira/browse/SPARK-6160 > Project: Spark > Issue Type: Improvement > Components: ML >Reporter: Joseph K. Bradley >Priority: Minor > Labels: bulk-closed > > It is useful to have the test statistics explaining selected features, but > these data are thrown out when constructing the ChiSqSelectorModel. The data > are expensive to recompute, so the ChiSqSelectorModel should store and expose > them. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-5046) Update KinesisReceiver to use updated Receiver API
[ https://issues.apache.org/jira/browse/SPARK-5046?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-5046. - Resolution: Incomplete > Update KinesisReceiver to use updated Receiver API > -- > > Key: SPARK-5046 > URL: https://issues.apache.org/jira/browse/SPARK-5046 > Project: Spark > Issue Type: Sub-task > Components: DStreams >Reporter: Tathagata Das >Priority: Major > Labels: bulk-closed > > Currently the KinesisReceier is not reliable as it does not correctly > acknowledge the source. This tasks is to update the receiver to use the > updated Receiver API in SPARK-5042 and implement a reliable receiver. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-5150) Strange implicit resolution behavior in Spark REPL
[ https://issues.apache.org/jira/browse/SPARK-5150?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-5150. - Resolution: Incomplete > Strange implicit resolution behavior in Spark REPL > -- > > Key: SPARK-5150 > URL: https://issues.apache.org/jira/browse/SPARK-5150 > Project: Spark > Issue Type: Bug > Components: Spark Shell >Reporter: Tobias Schlatter >Priority: Major > Labels: bulk-closed > > Consider the following Spark REPL session: > {code} > scala> def showInt(implicit x: Int) = println(x) > showInt: (implicit x: Int)Unit > scala> object IntHolder { implicit val myInt = 5 } > defined module IntHolder > scala> import IntHolder.myInt > import IntHolder.myInt > scala> showInt > 5 > scala> class A; showInt > :11: error: could not find implicit value for parameter x: Int > showInt > ^ > {code} > This was most likely caused by the fix to SPARK-2632 -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-5104) Distributed Representations of Sentences and Documents
[ https://issues.apache.org/jira/browse/SPARK-5104?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-5104. - Resolution: Incomplete > Distributed Representations of Sentences and Documents > -- > > Key: SPARK-5104 > URL: https://issues.apache.org/jira/browse/SPARK-5104 > Project: Spark > Issue Type: Wish > Components: ML, MLlib >Reporter: Guoqiang Li >Priority: Major > Labels: bulk-closed > > The Paper [Distributed Representations of Sentences and > Documents|http://arxiv.org/abs/1405.4053] -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-3750) Log ulimit settings at warning if they are too low
[ https://issues.apache.org/jira/browse/SPARK-3750?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-3750. - Resolution: Incomplete > Log ulimit settings at warning if they are too low > -- > > Key: SPARK-3750 > URL: https://issues.apache.org/jira/browse/SPARK-3750 > Project: Spark > Issue Type: Improvement > Components: Deploy >Affects Versions: 1.1.0 >Reporter: Andrew Ash >Priority: Minor > Labels: bulk-closed > > In recent versions of Spark the shuffle implementation is much more > aggressive about writing many files out to disk at once. Most linux kernels > have a default limit in the number of open files per process, and Spark can > exhaust this limit. The current hash-based shuffle implementation requires > as many files as the product of the map and reduce partition counts in a wide > dependency. > In order to reduce the errors we're seeing on the user list, we should > determine a value that is considered "too low" for normal operations and log > a warning on executor startup when that value isn't met. > 1. determine what ulimit is acceptable > 2. log when that value isn't met -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-3504) KMeans optimization: track distances and unmoved cluster centers across iterations
[ https://issues.apache.org/jira/browse/SPARK-3504?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-3504. - Resolution: Incomplete > KMeans optimization: track distances and unmoved cluster centers across > iterations > -- > > Key: SPARK-3504 > URL: https://issues.apache.org/jira/browse/SPARK-3504 > Project: Spark > Issue Type: Improvement > Components: MLlib >Affects Versions: 1.0.2 >Reporter: Derrick Burns >Priority: Major > Labels: bulk-closed, clustering > > The 1.0.2 implementation of the KMeans clusterer is VERY inefficient because > recomputes all distances to all cluster centers on each iteration. In later > iterations of Lloyd's algorithm, points don't change clusters and clusters > don't move. > By 1) tracking which clusters move and 2) tracking for each point which > cluster it belongs to and the distance to that cluster, one can avoid > recomputing distances in many cases with very little increase in memory > requirements. > I implemented this new algorithm and the results were fantastic. Using 16 > c3.8xlarge machines on EC2, the clusterer converged in 13 iterations on > 1,714,654 (182 dimensional) points and 20,000 clusters in 24 minutes. Here > are the running times for the first 7 rounds: > 6 minutes and 42 second > 7 minutes and 7 seconds > 7 minutes 13 seconds > 1 minutes 18 seconds > 30 seconds > 18 seconds > 12 seconds > Without this improvement, all rounds would have taken roughly 7 minutes, > resulting in Lloyd's iterations taking 7 * 13 = 91 minutes. In other words, > this improvement resulting in a reduction of roughly 75% in running time with > no loss of accuracy. > My implementation is a rewrite of the existing 1.0.2 implementation. It is > not a simple modification of the existing implementation. Please let me know > if you are interested in this new implementation. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-799) Windows versions of the deploy scripts
[ https://issues.apache.org/jira/browse/SPARK-799?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-799. Resolution: Incomplete > Windows versions of the deploy scripts > -- > > Key: SPARK-799 > URL: https://issues.apache.org/jira/browse/SPARK-799 > Project: Spark > Issue Type: Bug > Components: Deploy, Windows >Reporter: Matei Zaharia >Priority: Major > Labels: Starter, bulk-closed > > Although the Spark daemons run fine on Windows with run.cmd, the deploy > scripts (bin/start-all.sh and such) don't do so unless you have Cygwin. It > would be nice to make .cmd versions of those. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-4684) Add a script to run JDBC server on Windows
[ https://issues.apache.org/jira/browse/SPARK-4684?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-4684. - Resolution: Incomplete > Add a script to run JDBC server on Windows > -- > > Key: SPARK-4684 > URL: https://issues.apache.org/jira/browse/SPARK-4684 > Project: Spark > Issue Type: New Feature > Components: SQL, Windows >Reporter: Matei Zaharia >Assignee: Cheng Lian >Priority: Minor > Labels: bulk-closed > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-5575) Artificial neural networks for MLlib deep learning
[ https://issues.apache.org/jira/browse/SPARK-5575?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-5575. - Resolution: Incomplete > Artificial neural networks for MLlib deep learning > -- > > Key: SPARK-5575 > URL: https://issues.apache.org/jira/browse/SPARK-5575 > Project: Spark > Issue Type: Umbrella > Components: MLlib >Affects Versions: 1.2.0 >Reporter: Alexander Ulanov >Priority: Major > Labels: bulk-closed > > *Goal:* Implement various types of artificial neural networks > *Motivation:* (from https://issues.apache.org/jira/browse/SPARK-15581) > Having deep learning within Spark's ML library is a question of convenience. > Spark has broad analytic capabilities and it is useful to have deep learning > as one of these tools at hand. Deep learning is a model of choice for several > important modern use-cases, and Spark ML might want to cover them. > Eventually, it is hard to explain, why do we have PCA in ML but don't provide > Autoencoder. To summarize this, Spark should have at least the most widely > used deep learning models, such as fully connected artificial neural network, > convolutional network and autoencoder. Advanced and experimental deep > learning features might reside within packages or as pluggable external > tools. These 3 will provide a comprehensive deep learning set for Spark ML. > We might also include recurrent networks as well. > *Requirements:* > # Extensible API compatible with Spark ML. Basic abstractions such as Neuron, > Layer, Error, Regularization, Forward and Backpropagation etc. should be > implemented as traits or interfaces, so they can be easily extended or > reused. Define the Spark ML API for deep learning. This interface is similar > to the other analytics tools in Spark and supports ML pipelines. This makes > deep learning easy to use and plug in into analytics workloads for Spark > users. > # Efficiency. The current implementation of multilayer perceptron in Spark is > less than 2x slower than Caffe, both measured on CPU. The main overhead > sources are JVM and Spark's communication layer. For more details, please > refer to https://github.com/avulanov/ann-benchmark. Having said that, the > efficient implementation of deep learning in Spark should be only few times > slower than in specialized tool. This is very reasonable for the platform > that does much more than deep learning and I believe it is understood by the > community. > # Scalability. Implement efficient distributed training. It relies heavily on > the efficient communication and scheduling mechanisms. The default > implementation is based on Spark. More efficient implementations might > include some external libraries but use the same interface defined. > *Main features:* > # Multilayer perceptron classifier (MLP) > # Autoencoder > # Convolutional neural networks for computer vision. The interface has to > provide few architectures for deep learning that are widely used in practice, > such as AlexNet > *Additional features:* > # Other architectures, such as Recurrent neural network (RNN), Long-short > term memory (LSTM), Restricted boltzmann machine (RBM), deep belief network > (DBN), MLP multivariate regression > # Regularizers, such as L1, L2, drop-out > # Normalizers > # Network customization. The internal API of Spark ANN is designed to be > flexible and can handle different types of layers. However, only a part of > the API is made public. We have to limit the number of public classes in > order to make it simpler to support other languages. This forces us to use > (String or Number) parameters instead of introducing of new public classes. > One of the options to specify the architecture of ANN is to use text > configuration with layer-wise description. We have considered using Caffe > format for this. It gives the benefit of compatibility with well known deep > learning tool and simplifies the support of other languages in Spark. > Implementation of a parser for the subset of Caffe format might be the first > step towards the support of general ANN architectures in Spark. > # Hardware specific optimization. One can wrap other deep learning > implementations with this interface allowing users to pick a particular > back-end, e.g. Caffe or TensorFlow, along with the default one. The interface > has to provide few architectures for deep learning that are widely used in > practice, such as AlexNet. The main motivation for using specialized > libraries for deep learning would be to fully take advantage of the hardware > where Spark runs, in particular GPUs. Having the default interface in Spark, > we will need to wrap only a subset of functions from a given specialized > library. It does require an effort, how
[jira] [Resolved] (SPARK-5915) Spillable should check every N bytes rather than every 32 elements
[ https://issues.apache.org/jira/browse/SPARK-5915?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-5915. - Resolution: Incomplete > Spillable should check every N bytes rather than every 32 elements > -- > > Key: SPARK-5915 > URL: https://issues.apache.org/jira/browse/SPARK-5915 > Project: Spark > Issue Type: Improvement > Components: Shuffle, Spark Core >Affects Versions: 1.0.0 >Reporter: Mingyu Kim >Priority: Major > Labels: bulk-closed > > Spillable currently checks for spill every 32 elements. However, this puts it > at a risk of OOM if each element is large enough. A better alternative is to > check every N bytes accumulated. > N should be decided to a reasonable number via proper testing. > This is a follow-up of SPARK-4808, and was discussed originally in > https://github.com/apache/spark/pull/4420. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-3244) Add fate sharing across related files in Jenkins
[ https://issues.apache.org/jira/browse/SPARK-3244?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-3244. - Resolution: Incomplete > Add fate sharing across related files in Jenkins > > > Key: SPARK-3244 > URL: https://issues.apache.org/jira/browse/SPARK-3244 > Project: Spark > Issue Type: Improvement > Components: Project Infra >Affects Versions: 1.1.0 >Reporter: Andrew Or >Priority: Major > Labels: bulk-closed > > A few files are closely linked with each other. For instance, changes in > "bin/spark-submit" must be reflected in "bin/spark-submit.cmd" and > "SparkSubmitDriverBootstrapper.scala". It would be good if Jenkins gives a > warning if one file is changed but not the related ones. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-1272) Don't fail job if some local directories are buggy
[ https://issues.apache.org/jira/browse/SPARK-1272?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-1272. - Resolution: Incomplete > Don't fail job if some local directories are buggy > -- > > Key: SPARK-1272 > URL: https://issues.apache.org/jira/browse/SPARK-1272 > Project: Spark > Issue Type: Improvement > Components: Shuffle >Reporter: Patrick Wendell >Priority: Major > Labels: bulk-closed > > If Spark cannot create shuffle directories inside of a local directory it > might make sense to just log an error and continue, provided that at least > one valid shuffle directory exists. Otherwise if a single disk is wonky the > entire job can fail. > The down side is that this might mask failures if the person actually > misconfigures the local directories to point to the wrong disk(s). -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-4524) Add documentation on packaging Python dependencies / installing them on clusters
[ https://issues.apache.org/jira/browse/SPARK-4524?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-4524. - Resolution: Incomplete > Add documentation on packaging Python dependencies / installing them on > clusters > > > Key: SPARK-4524 > URL: https://issues.apache.org/jira/browse/SPARK-4524 > Project: Spark > Issue Type: New Feature > Components: Documentation >Reporter: Josh Rosen >Priority: Major > Labels: bulk-closed > > The documentation should have a section on ways to package Python > dependencies, add them to jobs, install them on clusters, etc. with an > overview of the different mechanisms and discussion of best practices. > This could incorporate some information from > https://stackoverflow.com/questions/24686474/shipping-python-modules-in-pyspark-to-other-nodes/24686708#24686708 -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-6208) executor-memory does not work when using local cluster
[ https://issues.apache.org/jira/browse/SPARK-6208?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-6208. - Resolution: Incomplete > executor-memory does not work when using local cluster > -- > > Key: SPARK-6208 > URL: https://issues.apache.org/jira/browse/SPARK-6208 > Project: Spark > Issue Type: New Feature > Components: Spark Submit >Reporter: Yin Huai >Priority: Minor > Labels: bulk-closed > > Seems executor memory set with a local cluster is not correctly set (see > https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/deploy/SparkSubmit.scala#L377). > Also, totalExecutorCores seems has the same issue > (https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/deploy/SparkSubmit.scala#L379). -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-3735) Sending the factor directly or AtA based on the cost in ALS
[ https://issues.apache.org/jira/browse/SPARK-3735?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-3735. - Resolution: Incomplete > Sending the factor directly or AtA based on the cost in ALS > --- > > Key: SPARK-3735 > URL: https://issues.apache.org/jira/browse/SPARK-3735 > Project: Spark > Issue Type: Improvement > Components: MLlib >Reporter: Xiangrui Meng >Assignee: Xiangrui Meng >Priority: Major > Labels: bulk-closed > > It is common to have some super popular products in the dataset. In this > case, sending many user factors to the target product block could be more > expensive than sending the normal equation `\sum_i u_i u_i^T` and `\sum_i u_i > r_ij` to the product block. The cost of sending a single factor is `k`, while > the cost of sending a normal equation is much more expensive, `k * (k + 3) / > 2`. However, if we use normal equation for all products associated with a > user, we don't need to send this user factor. > Determining the optimal assignment is hard. But we could use a simple > heuristic. Inside any rating block, > 1) order the product ids by the number of user ids associated with them in > desc order > 2) starting from the most popular product, mark popular products as "use > normal eq" and calculate the cost > Remember the best assignment that comes with the lowest cost and use it for > computation. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-2296) Refactor util.JsonProtocol for evolvability
[ https://issues.apache.org/jira/browse/SPARK-2296?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-2296. - Resolution: Incomplete > Refactor util.JsonProtocol for evolvability > --- > > Key: SPARK-2296 > URL: https://issues.apache.org/jira/browse/SPARK-2296 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 1.0.0 >Reporter: Andrew Or >Priority: Major > Labels: bulk-closed > > The current design is not very evolvable. For backwards compatibility, every > time we add a new field in one of the relevant objects (e.g. StageInfo) we > need to add a default value to the field. Otherwise, the test suite still > passes, but it throws some sort of obscure json exception if the field does > not exist. > We should let a common interface (JsonSerializable) handle this logic, so we > don't need to do it for all classes. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-5918) Spark Thrift server reports metadata for VARCHAR column as STRING in result set schema
[ https://issues.apache.org/jira/browse/SPARK-5918?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-5918. - Resolution: Incomplete > Spark Thrift server reports metadata for VARCHAR column as STRING in result > set schema > -- > > Key: SPARK-5918 > URL: https://issues.apache.org/jira/browse/SPARK-5918 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.1.1, 1.2.0 >Reporter: Holman Lan >Assignee: Cheng Lian >Priority: Major > Labels: bulk-closed > > This is reproducible using the open source JDBC driver by executing a query > that will return a VARCHAR column then retrieving the result set metadata. > The type name returned by the JDBC driver is VARCHAR which is expected but > reports the column type as string[12] and precision/column length as > 2147483647 (which is what the JDBC driver would return for STRING column) > even though we created a VARCHAR column with max length of 1000. > Further investigation indicates the GetResultSetMetadata Thrift client API > call returns the incorrect metadata. > We have confirmed this behaviour in versions 1.1.1 and 1.2.0. We have not > yet tested this against 1.2.1 but will do so and report our findings. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-2723) Block Manager should catch exceptions in putValues
[ https://issues.apache.org/jira/browse/SPARK-2723?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-2723. - Resolution: Incomplete > Block Manager should catch exceptions in putValues > -- > > Key: SPARK-2723 > URL: https://issues.apache.org/jira/browse/SPARK-2723 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.0.0 >Reporter: Shivaram Venkataraman >Priority: Major > Labels: bulk-closed > > The BlockManager should catch exceptions encountered while writing out files > to disk. Right now these exceptions get counted as user-level task failures > and the job is aborted after failing 4 times. We should either fail the > executor or handle this better to prevent the job from dying. > I ran into an issue where one disk on a large EC2 cluster failed and this > resulted in a long running job terminating. Longer term, we should also look > at black-listing local directories when one of them become unusable ? > Exception pasted below: > 14/07/29 00:55:39 WARN scheduler.TaskSetManager: Loss was due to > java.io.FileNotFoundException > java.io.FileNotFoundException: > /mnt2/spark/spark-local-20140728175256-e7cb/28/broadcast_264_piece20 > (Input/output error) > at java.io.FileOutputStream.open(Native Method) > at java.io.FileOutputStream.(FileOutputStream.java:221) > at java.io.FileOutputStream.(FileOutputStream.java:171) > at org.apache.spark.storage.DiskStore.putValues(DiskStore.scala:79) > at org.apache.spark.storage.DiskStore.putValues(DiskStore.scala:66) > at > org.apache.spark.storage.BlockManager.dropFromMemory(BlockManager.scala:847) > at > org.apache.spark.storage.MemoryStore$$anonfun$ensureFreeSpace$4.apply(MemoryStore.scala:267) > at > org.apache.spark.storage.MemoryStore$$anonfun$ensureFreeSpace$4.apply(MemoryStore.scala:256) > at > scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) > at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) > at > org.apache.spark.storage.MemoryStore.ensureFreeSpace(MemoryStore.scala:256) > at > org.apache.spark.storage.MemoryStore.tryToPut(MemoryStore.scala:179) > at > org.apache.spark.storage.MemoryStore.putValues(MemoryStore.scala:76) > at org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:663) > at org.apache.spark.storage.BlockManager.put(BlockManager.scala:574) > at org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:108) -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org