[jira] [Created] (SPARK-9845) Add built-in UDF
Alex Liu created SPARK-9845: --- Summary: Add built-in UDF Key: SPARK-9845 URL: https://issues.apache.org/jira/browse/SPARK-9845 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 1.4.1, 1.3.1 Reporter: Alex Liu Hive has many built-in functions as in https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF Can we add similar functions to Spark SQL? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-9847) ML Params copyValues should copy default values to default map, not set map
Joseph K. Bradley created SPARK-9847: Summary: ML Params copyValues should copy default values to default map, not set map Key: SPARK-9847 URL: https://issues.apache.org/jira/browse/SPARK-9847 Project: Spark Issue Type: Improvement Components: ML Reporter: Joseph K. Bradley Assignee: Joseph K. Bradley Priority: Critical Currently, Params.copyValues copies default parameter values to the paramMap of the target instance, rather than the defaultParamMap. It should copy to the defaultParamMap because explicitly setting a parameter can change the semantics. This issue arose in [SPARK-9789], where 2 params threshold and thresholds for LogisticRegression can have mutually exclusive values. If thresholds is set, then fit() will copy the default value of threshold as well, easily resulting in inconsistent settings for the 2 params. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-9816) Support BinaryType in Concat
[ https://issues.apache.org/jira/browse/SPARK-9816?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin updated SPARK-9816: --- Target Version/s: 1.6.0 Support BinaryType in Concat Key: SPARK-9816 URL: https://issues.apache.org/jira/browse/SPARK-9816 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 1.4.1 Reporter: Takeshi Yamamuro Support BinaryType in catalyst Concat according to hive behaviours. https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF#LanguageManualUDF-StringFunctions -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-8925) Add @since tags to mllib.util
[ https://issues.apache.org/jira/browse/SPARK-8925?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng resolved SPARK-8925. -- Resolution: Fixed Fix Version/s: 1.5.0 Issue resolved by pull request 7436 [https://github.com/apache/spark/pull/7436] Add @since tags to mllib.util - Key: SPARK-8925 URL: https://issues.apache.org/jira/browse/SPARK-8925 Project: Spark Issue Type: Sub-task Components: Documentation, MLlib Reporter: Xiangrui Meng Priority: Minor Labels: starter Fix For: 1.5.0 Original Estimate: 1h Remaining Estimate: 1h -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-8925) Add @since tags to mllib.util
[ https://issues.apache.org/jira/browse/SPARK-8925?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-8925: - Assignee: Sudhakar Thota Add @since tags to mllib.util - Key: SPARK-8925 URL: https://issues.apache.org/jira/browse/SPARK-8925 Project: Spark Issue Type: Sub-task Components: Documentation, MLlib Reporter: Xiangrui Meng Assignee: Sudhakar Thota Priority: Minor Labels: starter Fix For: 1.5.0 Original Estimate: 1h Remaining Estimate: 1h -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9740) first/last aggregate NULL behavior
[ https://issues.apache.org/jira/browse/SPARK-9740?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14692293#comment-14692293 ] Yin Huai commented on SPARK-9740: - Actually, seems our old first/last functions do not respect nulls. first/last aggregate NULL behavior -- Key: SPARK-9740 URL: https://issues.apache.org/jira/browse/SPARK-9740 Project: Spark Issue Type: Sub-task Components: SQL Reporter: Herman van Hovell Assignee: Yin Huai The FIRST/LAST aggregates implemented as part of the new UDAF interface, return the first or last non-null value (if any) found. This is a departure from the behavior of the old FIRST/LAST aggregates and from the FIRST_VALUE/LAST_VALUE aggregates in Hive. These would return a null value, if that happened to be the first/last value seen. SPARK-9592 tries to 'fix' this behavior for the old UDAF interface. Hive makes this behavior configurable, by adding a skipNulls flag. I would suggest to do the same, and make the default behavior compatible with Hive. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-9848) Add @since tag to new public APIs in 1.5
Xiangrui Meng created SPARK-9848: Summary: Add @since tag to new public APIs in 1.5 Key: SPARK-9848 URL: https://issues.apache.org/jira/browse/SPARK-9848 Project: Spark Issue Type: Documentation Components: Documentation, ML, MLlib Reporter: Xiangrui Meng -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-9848) Add @since tag to new public APIs in 1.5
[ https://issues.apache.org/jira/browse/SPARK-9848?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-9848: - Labels: starter (was: ) Add @since tag to new public APIs in 1.5 Key: SPARK-9848 URL: https://issues.apache.org/jira/browse/SPARK-9848 Project: Spark Issue Type: Sub-task Components: Documentation, ML, MLlib Reporter: Xiangrui Meng Labels: starter We should get a list of new APIs from SPARK-9660. cc: [~fliang] -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-9848) Add @since tag to new public APIs in 1.5
[ https://issues.apache.org/jira/browse/SPARK-9848?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-9848: - Issue Type: Sub-task (was: Documentation) Parent: SPARK-7751 Add @since tag to new public APIs in 1.5 Key: SPARK-9848 URL: https://issues.apache.org/jira/browse/SPARK-9848 Project: Spark Issue Type: Sub-task Components: Documentation, ML, MLlib Reporter: Xiangrui Meng Labels: starter -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-7751) Add @since to stable and experimental methods in MLlib
[ https://issues.apache.org/jira/browse/SPARK-7751?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14692303#comment-14692303 ] Xiangrui Meng commented on SPARK-7751: -- This issue is addressed in SPARK-8967. We tried to use annotation instead of JavaDoc tag for since. However, I didn't find a way to make it work. Add @since to stable and experimental methods in MLlib -- Key: SPARK-7751 URL: https://issues.apache.org/jira/browse/SPARK-7751 Project: Spark Issue Type: Umbrella Components: Documentation, MLlib Affects Versions: 1.4.0 Reporter: Xiangrui Meng Assignee: Xiangrui Meng Priority: Minor Labels: starter This is useful to check whether a feature exists in some version of Spark. This is an umbrella JIRA to track the progress. We want to have @since tag for both stable (those without any Experimental/DeveloperApi/AlphaComponent annotations) and experimental methods in MLlib: (Do NOT tag private or package private classes or methods.) * an example PR for Scala: https://github.com/apache/spark/pull/6101 * an example PR for Python: https://github.com/apache/spark/pull/6295 We need to dig the history of git commit to figure out what was the Spark version when a method was first introduced. Take `NaiveBayes.setModelType` as an example. We can grep `def setModelType` at different version git tags. {code} meng@xm:~/src/spark $ git show v1.3.0:mllib/src/main/scala/org/apache/spark/mllib/classification/NaiveBayes.scala | grep def setModelType meng@xm:~/src/spark $ git show v1.4.0:mllib/src/main/scala/org/apache/spark/mllib/classification/NaiveBayes.scala | grep def setModelType def setModelType(modelType: String): NaiveBayes = { {code} If there are better ways, please let us know. We cannot add all @since tags in a single PR, which is hard to review. So we made some subtasks for each package, for example `org.apache.spark.classification`. Feel free to add more sub-tasks for Python and the `spark.ml` package. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-8971) Support balanced class labels when splitting train/cross validation sets
[ https://issues.apache.org/jira/browse/SPARK-8971?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-8971: --- Assignee: Seth Hendrickson (was: Apache Spark) Support balanced class labels when splitting train/cross validation sets Key: SPARK-8971 URL: https://issues.apache.org/jira/browse/SPARK-8971 Project: Spark Issue Type: New Feature Components: ML Reporter: Feynman Liang Assignee: Seth Hendrickson {{CrossValidator}} and the proposed {{TrainValidatorSplit}} (SPARK-8484) are Spark classes which partition data into training and evaluation sets for performing hyperparameter selection via cross validation. Both methods currently perform the split by randomly sampling the datasets. However, when class probabilities are highly imbalanced (e.g. detection of extremely low-frequency events), random sampling may result in cross validation sets not representative of actual out-of-training performance (e.g. no positive training examples could be included). Mainstream R packages like already [caret|http://topepo.github.io/caret/splitting.html] support splitting the data based upon the class labels. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8971) Support balanced class labels when splitting train/cross validation sets
[ https://issues.apache.org/jira/browse/SPARK-8971?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14692312#comment-14692312 ] Seth Hendrickson commented on SPARK-8971: - I went ahead and created the PR for this issue, even though some of the design choices still merit discussion. This way, others can at least see the code and make comments. I did not mark as WIP but I can do that if needed. Support balanced class labels when splitting train/cross validation sets Key: SPARK-8971 URL: https://issues.apache.org/jira/browse/SPARK-8971 Project: Spark Issue Type: New Feature Components: ML Reporter: Feynman Liang Assignee: Seth Hendrickson {{CrossValidator}} and the proposed {{TrainValidatorSplit}} (SPARK-8484) are Spark classes which partition data into training and evaluation sets for performing hyperparameter selection via cross validation. Both methods currently perform the split by randomly sampling the datasets. However, when class probabilities are highly imbalanced (e.g. detection of extremely low-frequency events), random sampling may result in cross validation sets not representative of actual out-of-training performance (e.g. no positive training examples could be included). Mainstream R packages like already [caret|http://topepo.github.io/caret/splitting.html] support splitting the data based upon the class labels. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8967) Implement @since as an annotation
[ https://issues.apache.org/jira/browse/SPARK-8967?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14692318#comment-14692318 ] Xiangrui Meng commented on SPARK-8967: -- One example is `deprecated` annotation in Scala: https://github.com/scala/scala/blob/2.10.x/src/library/scala/deprecated.scala. However, ScalaDoc may have special handling for this annotation. Implement @since as an annotation - Key: SPARK-8967 URL: https://issues.apache.org/jira/browse/SPARK-8967 Project: Spark Issue Type: New Feature Components: Documentation, Spark Core Reporter: Xiangrui Meng Assignee: Xiangrui Meng Original Estimate: 1h Remaining Estimate: 1h We use @since tag in JavaDoc. There exists one issue. For a overloaded method, it inherits the doc from its parent if no JavaDoc is provided. However, if we want to add @since, we have to add JavaDoc. Then we need to copy the JavaDoc from parent, which makes it hard to keep docs in sync. A better solution would be implementing @since as an annotation, which is not part of the JavaDoc. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-9846) User guide for Multilayer Perceptron Classifier
Xiangrui Meng created SPARK-9846: Summary: User guide for Multilayer Perceptron Classifier Key: SPARK-9846 URL: https://issues.apache.org/jira/browse/SPARK-9846 Project: Spark Issue Type: Documentation Components: Documentation, ML Affects Versions: 1.5.0 Reporter: Xiangrui Meng Assignee: Alexander Ulanov -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-9814) EqualNotNull not passing to data sources
[ https://issues.apache.org/jira/browse/SPARK-9814?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin resolved SPARK-9814. Resolution: Fixed Assignee: Hyukjin Kwon Fix Version/s: 1.5.0 EqualNotNull not passing to data sources Key: SPARK-9814 URL: https://issues.apache.org/jira/browse/SPARK-9814 Project: Spark Issue Type: Improvement Components: SQL Reporter: Hyukjin Kwon Assignee: Hyukjin Kwon Priority: Minor Fix For: 1.5.0 When data sources (such as Parquet) tries to filter data when reading from HDFS (not in memory), Physical planing phase passes the filter objects in {{org.apache.spark.sql.sources}}, which are appropriately built and picked up by {{selectFilters()}} in {{org.apache.spark.sql.sources.DataSourceStrategy}}. On the other hand, it does not pass {{EqualNullSafe}} filter in {{org.apache.spark.sql.catalyst.expressions}} even though this seems possible to pass for other datasources such as Parquet and JSON. In more detail, it does not pass {{EqualNullSafe}} to (below) {{buildScan()}} in {{PrunedFilteredScan}} and {{PrunedScan}}, {code} def buildScan(requiredColumns: Array[String], filters: Array[Filter]): RDD[Row] {code} even though the binary capability issue is solved.(https://issues.apache.org/jira/browse/SPARK-8747). I understand that {{CatalystScan}} can take the all raw expressions accessing to the query planner. However, it is experimental and also it needs different interfaces (as well as unstable for the reasons such as binary capability). In general, the problem below can happen. 1. {code:sql} SELECT * FROM table WHERE field = 1; {code} 2. {code:sql} SELECT * FROM table WHERE field = 1; {code} The second query can be hugely slow although the functionally is almost identical because of the possible large network traffic (etc.) by not filtered data from the source RDD. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-9824) Internal Accumulators will leak WeakReferences
[ https://issues.apache.org/jira/browse/SPARK-9824?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin resolved SPARK-9824. Resolution: Fixed Assignee: Shixiong Zhu Fix Version/s: 1.5.0 Internal Accumulators will leak WeakReferences -- Key: SPARK-9824 URL: https://issues.apache.org/jira/browse/SPARK-9824 Project: Spark Issue Type: Bug Components: Spark Core Reporter: Shixiong Zhu Assignee: Shixiong Zhu Priority: Blocker Fix For: 1.5.0 InternalAccumulator.create doesn't call `registerAccumulatorForCleanup` to register itself with ContextCleaner, so `WeakReference`s for these accumulators in Accumulators.originals won't be removed. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-9776) Another instance of Derby may have already booted the database
[ https://issues.apache.org/jira/browse/SPARK-9776?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-9776: - Priority: Major (was: Blocker) [~sthota] Blocker is for committers to set. This does not rise to that level at this stage, esp. as there is no target version. Doesn't mean it's not important but it's just 'normal' now. Another instance of Derby may have already booted the database --- Key: SPARK-9776 URL: https://issues.apache.org/jira/browse/SPARK-9776 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.5.0 Environment: Mac Yosemite, spark-1.5.0 Reporter: Sudhakar Thota Attachments: SPARK-9776-FL1.rtf val hiveContext = new org.apache.spark.sql.hive.HiveContext(sc) results in error. Though the same works for spark-1.4.1. Caused by: ERROR XSDB6: Another instance of Derby may have already booted the database -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-9789) Reinstate LogisticRegression threshold Param
[ https://issues.apache.org/jira/browse/SPARK-9789?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley updated SPARK-9789: - Shepherd: DB Tsai Reinstate LogisticRegression threshold Param Key: SPARK-9789 URL: https://issues.apache.org/jira/browse/SPARK-9789 Project: Spark Issue Type: Improvement Components: ML Reporter: Joseph K. Bradley Assignee: Joseph K. Bradley From [SPARK-9658]: LogisticRegression.threshold was replaced by thresholds, but we could keep threshold for backwards compatibility. We should add it back, but we should maintain the current semantics whereby thresholds overrides threshold. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-9788) LDA docConcentration, gammaShape 1.5 binary incompatibility fixes
[ https://issues.apache.org/jira/browse/SPARK-9788?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley resolved SPARK-9788. -- Resolution: Fixed Fix Version/s: 1.5.0 Issue resolved by pull request 8077 [https://github.com/apache/spark/pull/8077] LDA docConcentration, gammaShape 1.5 binary incompatibility fixes - Key: SPARK-9788 URL: https://issues.apache.org/jira/browse/SPARK-9788 Project: Spark Issue Type: Improvement Components: MLlib Reporter: Joseph K. Bradley Assignee: Feynman Liang Fix For: 1.5.0 From [SPARK-9658]: 1. LDA.docConcentration It will be nice to keep the old APIs unchanged. Proposal: * Add “asymmetricDocConcentration” and revert docConcentration changes. * If the (internal) doc concentration vector is a single value, “getDocConcentration returns it. If it is a constant vector, getDocConcentration returns the first item, and fails otherwise. 2. LDAModel.gammaShape This should be given a default value. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-7454) Perf test for power iteration clustering (PIC)
[ https://issues.apache.org/jira/browse/SPARK-7454?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14692299#comment-14692299 ] Joseph K. Bradley commented on SPARK-7454: -- If you won't have time, please say so that someone else can take over. Thanks! Perf test for power iteration clustering (PIC) -- Key: SPARK-7454 URL: https://issues.apache.org/jira/browse/SPARK-7454 Project: Spark Issue Type: Sub-task Components: MLlib Affects Versions: 1.4.0 Reporter: Xiangrui Meng Assignee: Stephen Boesch -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-9848) Add @since tag to new public APIs in 1.5
[ https://issues.apache.org/jira/browse/SPARK-9848?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-9848: - Description: We should get a list of new APIs from SPARK-9660. cc: [~fliang] Add @since tag to new public APIs in 1.5 Key: SPARK-9848 URL: https://issues.apache.org/jira/browse/SPARK-9848 Project: Spark Issue Type: Sub-task Components: Documentation, ML, MLlib Reporter: Xiangrui Meng Labels: starter We should get a list of new APIs from SPARK-9660. cc: [~fliang] -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-8971) Support balanced class labels when splitting train/cross validation sets
[ https://issues.apache.org/jira/browse/SPARK-8971?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-8971: --- Assignee: Apache Spark (was: Seth Hendrickson) Support balanced class labels when splitting train/cross validation sets Key: SPARK-8971 URL: https://issues.apache.org/jira/browse/SPARK-8971 Project: Spark Issue Type: New Feature Components: ML Reporter: Feynman Liang Assignee: Apache Spark {{CrossValidator}} and the proposed {{TrainValidatorSplit}} (SPARK-8484) are Spark classes which partition data into training and evaluation sets for performing hyperparameter selection via cross validation. Both methods currently perform the split by randomly sampling the datasets. However, when class probabilities are highly imbalanced (e.g. detection of extremely low-frequency events), random sampling may result in cross validation sets not representative of actual out-of-training performance (e.g. no positive training examples could be included). Mainstream R packages like already [caret|http://topepo.github.io/caret/splitting.html] support splitting the data based upon the class labels. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8971) Support balanced class labels when splitting train/cross validation sets
[ https://issues.apache.org/jira/browse/SPARK-8971?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14692309#comment-14692309 ] Apache Spark commented on SPARK-8971: - User 'sethah' has created a pull request for this issue: https://github.com/apache/spark/pull/8112 Support balanced class labels when splitting train/cross validation sets Key: SPARK-8971 URL: https://issues.apache.org/jira/browse/SPARK-8971 Project: Spark Issue Type: New Feature Components: ML Reporter: Feynman Liang Assignee: Seth Hendrickson {{CrossValidator}} and the proposed {{TrainValidatorSplit}} (SPARK-8484) are Spark classes which partition data into training and evaluation sets for performing hyperparameter selection via cross validation. Both methods currently perform the split by randomly sampling the datasets. However, when class probabilities are highly imbalanced (e.g. detection of extremely low-frequency events), random sampling may result in cross validation sets not representative of actual out-of-training performance (e.g. no positive training examples could be included). Mainstream R packages like already [caret|http://topepo.github.io/caret/splitting.html] support splitting the data based upon the class labels. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-7751) Add @since to stable and experimental methods in MLlib
[ https://issues.apache.org/jira/browse/SPARK-7751?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14692316#comment-14692316 ] Joseph K. Bradley commented on SPARK-7751: -- OK I guess we just need to be more careful about the PRs adding since tags. Add @since to stable and experimental methods in MLlib -- Key: SPARK-7751 URL: https://issues.apache.org/jira/browse/SPARK-7751 Project: Spark Issue Type: Umbrella Components: Documentation, MLlib Affects Versions: 1.4.0 Reporter: Xiangrui Meng Assignee: Xiangrui Meng Priority: Minor Labels: starter This is useful to check whether a feature exists in some version of Spark. This is an umbrella JIRA to track the progress. We want to have @since tag for both stable (those without any Experimental/DeveloperApi/AlphaComponent annotations) and experimental methods in MLlib: (Do NOT tag private or package private classes or methods.) * an example PR for Scala: https://github.com/apache/spark/pull/6101 * an example PR for Python: https://github.com/apache/spark/pull/6295 We need to dig the history of git commit to figure out what was the Spark version when a method was first introduced. Take `NaiveBayes.setModelType` as an example. We can grep `def setModelType` at different version git tags. {code} meng@xm:~/src/spark $ git show v1.3.0:mllib/src/main/scala/org/apache/spark/mllib/classification/NaiveBayes.scala | grep def setModelType meng@xm:~/src/spark $ git show v1.4.0:mllib/src/main/scala/org/apache/spark/mllib/classification/NaiveBayes.scala | grep def setModelType def setModelType(modelType: String): NaiveBayes = { {code} If there are better ways, please let us know. We cannot add all @since tags in a single PR, which is hard to review. So we made some subtasks for each package, for example `org.apache.spark.classification`. Feel free to add more sub-tasks for Python and the `spark.ml` package. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-9850) Adaptive execution in Spark
[ https://issues.apache.org/jira/browse/SPARK-9850?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Matei Zaharia updated SPARK-9850: - Assignee: Yin Huai Adaptive execution in Spark --- Key: SPARK-9850 URL: https://issues.apache.org/jira/browse/SPARK-9850 Project: Spark Issue Type: New Feature Components: Spark Core, SQL Reporter: Matei Zaharia Assignee: Yin Huai Attachments: AdaptiveExecutionInSpark.pdf Query planning is one of the main factors in high performance, but the current Spark engine requires the execution DAG for a job to be set in advance. Even with cost-based optimization, it is hard to know the behavior of data and user-defined functions well enough to always get great execution plans. This JIRA proposes to add adaptive query execution, so that the engine can change the plan for each query as it sees what data earlier stages produced. We propose adding this to Spark SQL / DataFrames first, using a new API in the Spark engine that lets libraries run DAGs adaptively. In future JIRAs, the functionality could be extended to other libraries or the RDD API, but that is more difficult than adding it in SQL. I've attached a design doc by Yin Huai and myself explaining how it would work in more detail. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9427) Add expression functions in SparkR
[ https://issues.apache.org/jira/browse/SPARK-9427?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14692600#comment-14692600 ] Yu Ishikawa commented on SPARK-9427: [~shivaram] After all, I'd like to split this issue to a few sub-issues. Since it is quite difficult to add the listed expressions at once. And since it is a little hard to review a PR for this issue. I think we could classify them to at least three types in SparkR. What do you think? 1. Add expressions whose parameter are only {{(Column)}} or {{(Column, Column)}}, like {{md5(e: Column)}} 2. Add expressions whose parameter are a little complicated, like {{conv(num: Column, fromBase: Int, toBase: Int)}} 3. Add expressions which are conflicted with the already existing generic, like {{coalesce(e: Column*)}} {{1}} is not a difficult task, extracting method definitions from Scala code. And I think we rarely need to consider the confliction with current SparkR code. However, {{2}} and {{3}} are a little hard because of the complexityomplexity. For example, in {{3}}, if we must modify the existing R's generic due to new expressions, we should check whether the modification affects the existing code or not. Add expression functions in SparkR -- Key: SPARK-9427 URL: https://issues.apache.org/jira/browse/SPARK-9427 Project: Spark Issue Type: New Feature Components: SparkR Reporter: Yu Ishikawa The list of functions to add is based on SQL's functions. And it would be better to add them in one shot PR. https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/functions.scala -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-9407) Parquet shouldn't fail when pushing down predicates over a column whose underlying Parquet type is an ENUM
[ https://issues.apache.org/jira/browse/SPARK-9407?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cheng Lian updated SPARK-9407: -- Summary: Parquet shouldn't fail when pushing down predicates over a column whose underlying Parquet type is an ENUM (was: Parquet shouldn't push down predicates over a column whose underlying Parquet type is an ENUM) Parquet shouldn't fail when pushing down predicates over a column whose underlying Parquet type is an ENUM -- Key: SPARK-9407 URL: https://issues.apache.org/jira/browse/SPARK-9407 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.5.0 Reporter: Cheng Lian Assignee: Cheng Lian Priority: Blocker Spark SQL doesn't have an equivalent data type to Parquet {{BINARY (ENUM)}}, and always treats it as a UTF-8 encoded {{StringType}}. Thus, predicate over a Parquet {{ENUM}} column may be pushed down. However, Parquet 1.7.0 and prior versions only support filter push-down optimization for [a limited set of data types|https://github.com/apache/parquet-mr/blob/apache-parquet-1.7.0/parquet-column/src/main/java/org/apache/parquet/filter2/predicate/ValidTypeMap.java#L66-L80], and fails the query. The simplest solution seems to be upgrading parquet-mr to 1.8.1, which fixes this issue via PARQUET-201 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-7165) Sort Merge Join for outer joins
[ https://issues.apache.org/jira/browse/SPARK-7165?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin resolved SPARK-7165. Resolution: Fixed Fix Version/s: 1.5.0 Sort Merge Join for outer joins --- Key: SPARK-7165 URL: https://issues.apache.org/jira/browse/SPARK-7165 Project: Spark Issue Type: Story Components: SQL Reporter: Adrian Wang Assignee: Josh Rosen Priority: Blocker Fix For: 1.5.0 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-9730) Sort Merge Join for Full Outer Join
[ https://issues.apache.org/jira/browse/SPARK-9730?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin updated SPARK-9730: --- Parent Issue: SPARK-9697 (was: SPARK-7165) Sort Merge Join for Full Outer Join --- Key: SPARK-9730 URL: https://issues.apache.org/jira/browse/SPARK-9730 Project: Spark Issue Type: Sub-task Components: SQL Reporter: Josh Rosen Assignee: Josh Rosen -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-9730) Sort Merge Join for Full Outer Join
[ https://issues.apache.org/jira/browse/SPARK-9730?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin updated SPARK-9730: --- Target Version/s: 1.6.0 (was: 1.5.0) Sort Merge Join for Full Outer Join --- Key: SPARK-9730 URL: https://issues.apache.org/jira/browse/SPARK-9730 Project: Spark Issue Type: New Feature Components: SQL Reporter: Josh Rosen -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-9730) Sort Merge Join for Full Outer Join
[ https://issues.apache.org/jira/browse/SPARK-9730?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin updated SPARK-9730: --- Assignee: (was: Josh Rosen) Sort Merge Join for Full Outer Join --- Key: SPARK-9730 URL: https://issues.apache.org/jira/browse/SPARK-9730 Project: Spark Issue Type: Sub-task Components: SQL Reporter: Josh Rosen -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-9730) Sort Merge Join for Full Outer Join
[ https://issues.apache.org/jira/browse/SPARK-9730?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin updated SPARK-9730: --- Target Version/s: 1.5.0 Sort Merge Join for Full Outer Join --- Key: SPARK-9730 URL: https://issues.apache.org/jira/browse/SPARK-9730 Project: Spark Issue Type: Sub-task Components: SQL Reporter: Josh Rosen -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Issue Comment Deleted] (SPARK-9829) peakExecutionMemory is not correct
[ https://issues.apache.org/jira/browse/SPARK-9829?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shixiong Zhu updated SPARK-9829: Comment: was deleted (was: How many tasks? peakExecutionMemory in Web UI is the sum of peakExecutionMemory in all tasks. This value may be confusing sometimes. E.g., assume we have 2 tasks, at 10:00am, task 1's memory usage is 10G, which is peak, and it finishes at 10:02am; then task 2 starts at 10:03am, and it reaches the peak at 10:04am, which is 10G. Then peakExecutionMemory in Web UI will be 20G, although we have never used more than 10G. BTW, did you modify the codes? These values should not be shown directly in Web UI. /cc [~andrewor14]) peakExecutionMemory is not correct -- Key: SPARK-9829 URL: https://issues.apache.org/jira/browse/SPARK-9829 Project: Spark Issue Type: Bug Components: SQL Reporter: Davies Liu Assignee: Shixiong Zhu When run a query with 8G memory, the peakExecutionMemory in WebUI said that 40344371200 (40G). Alos there are lots of accumulators with the same name, can't know what do they mean {code} Accumulable Value number of output rows 439614 number of output rows 7711 number of output rows 965 number of rows7829 number of rows7711 number of input rows 965 number of rows52 number of input rows 439614 number of output rows 30 number of input rows 7726 number of rows277000 peakExecutionMemory 40344371200 number of rows7829 number of rows965 number of rows7726 number of rows30 number of rows138000 number of rows8028 number of rows439614 number of input rows 30 {code} How to reproduce: run TPCDS q19 with scale=5, checkout out the Web UI -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-9849) DirectParquetOutputCommitter qualified name should be backward compatible
[ https://issues.apache.org/jira/browse/SPARK-9849?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin updated SPARK-9849: --- Description: DirectParquetOutputCommitter was moved in SPARK-9763. However, users can explicitly set the class as a config option, so we must be able to resolve the old committer qualified name. was: DirectParquetOutputCommitter was moved in SPARK-9763. However, users can explicitly set the class as a config option, so we must be able to resolve the old committer path. DirectParquetOutputCommitter qualified name should be backward compatible - Key: SPARK-9849 URL: https://issues.apache.org/jira/browse/SPARK-9849 Project: Spark Issue Type: Sub-task Components: SQL Reporter: Reynold Xin Assignee: Reynold Xin Priority: Blocker DirectParquetOutputCommitter was moved in SPARK-9763. However, users can explicitly set the class as a config option, so we must be able to resolve the old committer qualified name. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-9849) DirectParquetOutputCommitter qualified name should be backward compatible
[ https://issues.apache.org/jira/browse/SPARK-9849?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin updated SPARK-9849: --- Summary: DirectParquetOutputCommitter qualified name should be backward compatible (was: DirectParquetOutputCommitter path should be backward compatible) DirectParquetOutputCommitter qualified name should be backward compatible - Key: SPARK-9849 URL: https://issues.apache.org/jira/browse/SPARK-9849 Project: Spark Issue Type: Sub-task Components: SQL Reporter: Reynold Xin Assignee: Reynold Xin Priority: Blocker DirectParquetOutputCommitter was moved in SPARK-9763. However, users can explicitly set the class, so we must be able to resolve the old committer path. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-9849) DirectParquetOutputCommitter qualified name should be backward compatible
[ https://issues.apache.org/jira/browse/SPARK-9849?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin updated SPARK-9849: --- Description: DirectParquetOutputCommitter was moved in SPARK-9763. However, users can explicitly set the class as a config option, so we must be able to resolve the old committer path. was: DirectParquetOutputCommitter was moved in SPARK-9763. However, users can explicitly set the class, so we must be able to resolve the old committer path. DirectParquetOutputCommitter qualified name should be backward compatible - Key: SPARK-9849 URL: https://issues.apache.org/jira/browse/SPARK-9849 Project: Spark Issue Type: Sub-task Components: SQL Reporter: Reynold Xin Assignee: Reynold Xin Priority: Blocker DirectParquetOutputCommitter was moved in SPARK-9763. However, users can explicitly set the class as a config option, so we must be able to resolve the old committer path. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-9740) first/last aggregate NULL behavior
[ https://issues.apache.org/jira/browse/SPARK-9740?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-9740: --- Assignee: Yin Huai (was: Apache Spark) first/last aggregate NULL behavior -- Key: SPARK-9740 URL: https://issues.apache.org/jira/browse/SPARK-9740 Project: Spark Issue Type: Sub-task Components: SQL Reporter: Herman van Hovell Assignee: Yin Huai The FIRST/LAST aggregates implemented as part of the new UDAF interface, return the first or last non-null value (if any) found. This is a departure from the behavior of the old FIRST/LAST aggregates and from the FIRST_VALUE/LAST_VALUE aggregates in Hive. These would return a null value, if that happened to be the first/last value seen. SPARK-9592 tries to 'fix' this behavior for the old UDAF interface. Hive makes this behavior configurable, by adding a skipNulls flag. I would suggest to do the same, and make the default behavior compatible with Hive. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9740) first/last aggregate NULL behavior
[ https://issues.apache.org/jira/browse/SPARK-9740?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14692392#comment-14692392 ] Apache Spark commented on SPARK-9740: - User 'yhuai' has created a pull request for this issue: https://github.com/apache/spark/pull/8113 first/last aggregate NULL behavior -- Key: SPARK-9740 URL: https://issues.apache.org/jira/browse/SPARK-9740 Project: Spark Issue Type: Sub-task Components: SQL Reporter: Herman van Hovell Assignee: Yin Huai The FIRST/LAST aggregates implemented as part of the new UDAF interface, return the first or last non-null value (if any) found. This is a departure from the behavior of the old FIRST/LAST aggregates and from the FIRST_VALUE/LAST_VALUE aggregates in Hive. These would return a null value, if that happened to be the first/last value seen. SPARK-9592 tries to 'fix' this behavior for the old UDAF interface. Hive makes this behavior configurable, by adding a skipNulls flag. I would suggest to do the same, and make the default behavior compatible with Hive. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-9740) first/last aggregate NULL behavior
[ https://issues.apache.org/jira/browse/SPARK-9740?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-9740: --- Assignee: Apache Spark (was: Yin Huai) first/last aggregate NULL behavior -- Key: SPARK-9740 URL: https://issues.apache.org/jira/browse/SPARK-9740 Project: Spark Issue Type: Sub-task Components: SQL Reporter: Herman van Hovell Assignee: Apache Spark The FIRST/LAST aggregates implemented as part of the new UDAF interface, return the first or last non-null value (if any) found. This is a departure from the behavior of the old FIRST/LAST aggregates and from the FIRST_VALUE/LAST_VALUE aggregates in Hive. These would return a null value, if that happened to be the first/last value seen. SPARK-9592 tries to 'fix' this behavior for the old UDAF interface. Hive makes this behavior configurable, by adding a skipNulls flag. I would suggest to do the same, and make the default behavior compatible with Hive. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-9849) DirectParquetOutputCommitter qualified name should be backward compatible
[ https://issues.apache.org/jira/browse/SPARK-9849?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-9849: --- Assignee: Reynold Xin (was: Apache Spark) DirectParquetOutputCommitter qualified name should be backward compatible - Key: SPARK-9849 URL: https://issues.apache.org/jira/browse/SPARK-9849 Project: Spark Issue Type: Sub-task Components: SQL Reporter: Reynold Xin Assignee: Reynold Xin Priority: Blocker DirectParquetOutputCommitter was moved in SPARK-9763. However, users can explicitly set the class as a config option, so we must be able to resolve the old committer qualified name. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-9849) DirectParquetOutputCommitter qualified name should be backward compatible
[ https://issues.apache.org/jira/browse/SPARK-9849?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-9849: --- Assignee: Apache Spark (was: Reynold Xin) DirectParquetOutputCommitter qualified name should be backward compatible - Key: SPARK-9849 URL: https://issues.apache.org/jira/browse/SPARK-9849 Project: Spark Issue Type: Sub-task Components: SQL Reporter: Reynold Xin Assignee: Apache Spark Priority: Blocker DirectParquetOutputCommitter was moved in SPARK-9763. However, users can explicitly set the class as a config option, so we must be able to resolve the old committer qualified name. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9849) DirectParquetOutputCommitter qualified name should be backward compatible
[ https://issues.apache.org/jira/browse/SPARK-9849?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14692402#comment-14692402 ] Apache Spark commented on SPARK-9849: - User 'rxin' has created a pull request for this issue: https://github.com/apache/spark/pull/8114 DirectParquetOutputCommitter qualified name should be backward compatible - Key: SPARK-9849 URL: https://issues.apache.org/jira/browse/SPARK-9849 Project: Spark Issue Type: Sub-task Components: SQL Reporter: Reynold Xin Assignee: Reynold Xin Priority: Blocker DirectParquetOutputCommitter was moved in SPARK-9763. However, users can explicitly set the class as a config option, so we must be able to resolve the old committer qualified name. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-9847) ML Params copyValues should copy default values to default map, not set map
[ https://issues.apache.org/jira/browse/SPARK-9847?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-9847: --- Assignee: Apache Spark (was: Joseph K. Bradley) ML Params copyValues should copy default values to default map, not set map --- Key: SPARK-9847 URL: https://issues.apache.org/jira/browse/SPARK-9847 Project: Spark Issue Type: Improvement Components: ML Reporter: Joseph K. Bradley Assignee: Apache Spark Priority: Critical Currently, Params.copyValues copies default parameter values to the paramMap of the target instance, rather than the defaultParamMap. It should copy to the defaultParamMap because explicitly setting a parameter can change the semantics. This issue arose in [SPARK-9789], where 2 params threshold and thresholds for LogisticRegression can have mutually exclusive values. If thresholds is set, then fit() will copy the default value of threshold as well, easily resulting in inconsistent settings for the 2 params. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-9847) ML Params copyValues should copy default values to default map, not set map
[ https://issues.apache.org/jira/browse/SPARK-9847?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-9847: --- Assignee: Joseph K. Bradley (was: Apache Spark) ML Params copyValues should copy default values to default map, not set map --- Key: SPARK-9847 URL: https://issues.apache.org/jira/browse/SPARK-9847 Project: Spark Issue Type: Improvement Components: ML Reporter: Joseph K. Bradley Assignee: Joseph K. Bradley Priority: Critical Currently, Params.copyValues copies default parameter values to the paramMap of the target instance, rather than the defaultParamMap. It should copy to the defaultParamMap because explicitly setting a parameter can change the semantics. This issue arose in [SPARK-9789], where 2 params threshold and thresholds for LogisticRegression can have mutually exclusive values. If thresholds is set, then fit() will copy the default value of threshold as well, easily resulting in inconsistent settings for the 2 params. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9847) ML Params copyValues should copy default values to default map, not set map
[ https://issues.apache.org/jira/browse/SPARK-9847?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14692422#comment-14692422 ] Apache Spark commented on SPARK-9847: - User 'jkbradley' has created a pull request for this issue: https://github.com/apache/spark/pull/8115 ML Params copyValues should copy default values to default map, not set map --- Key: SPARK-9847 URL: https://issues.apache.org/jira/browse/SPARK-9847 Project: Spark Issue Type: Improvement Components: ML Reporter: Joseph K. Bradley Assignee: Joseph K. Bradley Priority: Critical Currently, Params.copyValues copies default parameter values to the paramMap of the target instance, rather than the defaultParamMap. It should copy to the defaultParamMap because explicitly setting a parameter can change the semantics. This issue arose in [SPARK-9789], where 2 params threshold and thresholds for LogisticRegression can have mutually exclusive values. If thresholds is set, then fit() will copy the default value of threshold as well, easily resulting in inconsistent settings for the 2 params. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-7454) Perf test for power iteration clustering (PIC)
[ https://issues.apache.org/jira/browse/SPARK-7454?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14692437#comment-14692437 ] Stephen Boesch commented on SPARK-7454: --- Hi, I had intended to clean this up in the past few days but yes - am overwhelmed by other tasks. I abdicate. Perf test for power iteration clustering (PIC) -- Key: SPARK-7454 URL: https://issues.apache.org/jira/browse/SPARK-7454 Project: Spark Issue Type: Sub-task Components: MLlib Affects Versions: 1.4.0 Reporter: Xiangrui Meng Assignee: Stephen Boesch -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9827) Too many open files in TungstenExchange
[ https://issues.apache.org/jira/browse/SPARK-9827?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14692441#comment-14692441 ] Apache Spark commented on SPARK-9827: - User 'davies' has created a pull request for this issue: https://github.com/apache/spark/pull/8116 Too many open files in TungstenExchange --- Key: SPARK-9827 URL: https://issues.apache.org/jira/browse/SPARK-9827 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.5.0 Reporter: Davies Liu Assignee: Josh Rosen Priority: Blocker When run q19 on TPCDS (scale=5) dataset with 8G memory, it open 10k shuffle files, crash many things (even Chrome). {code} davies@localhost:~/work/spark$ jps 95385 Jps 95316 SparkSubmit davies@localhost:~/work/spark$ lsof -p 95316 | wc -l 9827 davies@localhost:~/work/spark$ lsof -p 95316 | tail java95316 davies 9772r REG1,2 9522 97350739 /private/var/folders/r1/j51v8t_x4bq6fqt43nymzddwgn/T/blockmgr-6dd1dd1b-735d-4eae-a7eb-72820e0a2e7b/2a/shuffle_0_112_0.data java95316 davies 9773r REG1,2 8449 97351388 /private/var/folders/r1/j51v8t_x4bq6fqt43nymzddwgn/T/blockmgr-6dd1dd1b-735d-4eae-a7eb-72820e0a2e7b/1a/shuffle_0_116_0.data java95316 davies 9774r REG1,2 8200 97351134 /private/var/folders/r1/j51v8t_x4bq6fqt43nymzddwgn/T/blockmgr-6dd1dd1b-735d-4eae-a7eb-72820e0a2e7b/09/shuffle_0_113_0.data java95316 davies 9775r REG1,2 8057 97351941 /private/var/folders/r1/j51v8t_x4bq6fqt43nymzddwgn/T/blockmgr-6dd1dd1b-735d-4eae-a7eb-72820e0a2e7b/05/shuffle_0_117_0.data java95316 davies 9776r REG1,2 8565 97351133 /private/var/folders/r1/j51v8t_x4bq6fqt43nymzddwgn/T/blockmgr-6dd1dd1b-735d-4eae-a7eb-72820e0a2e7b/18/shuffle_0_114_0.data java95316 davies 9777r REG1,2 8185 97351942 /private/var/folders/r1/j51v8t_x4bq6fqt43nymzddwgn/T/blockmgr-6dd1dd1b-735d-4eae-a7eb-72820e0a2e7b/1c/shuffle_0_118_0.data java95316 davies 9778r REG1,2 8865 97351135 /private/var/folders/r1/j51v8t_x4bq6fqt43nymzddwgn/T/blockmgr-6dd1dd1b-735d-4eae-a7eb-72820e0a2e7b/07/shuffle_0_115_0.data java95316 davies 9779r REG1,2 8255 97351987 /private/var/folders/r1/j51v8t_x4bq6fqt43nymzddwgn/T/blockmgr-6dd1dd1b-735d-4eae-a7eb-72820e0a2e7b/3d/shuffle_0_119_0.data java95316 davies 9780r REG1,2 8449 97351388 /private/var/folders/r1/j51v8t_x4bq6fqt43nymzddwgn/T/blockmgr-6dd1dd1b-735d-4eae-a7eb-72820e0a2e7b/1a/shuffle_0_116_0.data java95316 davies 9781r REG1,2 9105 97352148 /private/var/folders/r1/j51v8t_x4bq6fqt43nymzddwgn/T/blockmgr-6dd1dd1b-735d-4eae-a7eb-72820e0a2e7b/13/shuffle_0_120_0.data davies@localhost:~/work/spark$ ls -l /private/var/folders/r1/j51v8t_x4bq6fqt43nymzddwgn/T/blockmgr-71afa3af-f2a5-4b72-8b2d-45aa70ff7466//3a/ total 68 -rw-r--r-- 1 davies staff 8272 Aug 11 09:57 shuffle_0_105_0.data -rw-r--r-- 1 davies staff 1608 Aug 11 09:57 shuffle_0_109_0.index -rw-r--r-- 1 davies staff 8414 Aug 11 09:57 shuffle_0_127_0.data -rw-r--r-- 1 davies staff 8368 Aug 11 09:57 shuffle_0_149_0.data -rw-r--r-- 1 davies staff 1608 Aug 11 09:57 shuffle_0_40_0.index -rw-r--r-- 1 davies staff 1608 Aug 11 09:57 shuffle_0_62_0.index -rw-r--r-- 1 davies staff 7965 Aug 11 09:57 shuffle_0_6_0.data -rw-r--r-- 1 davies staff 8419 Aug 11 09:57 shuffle_0_80_0.data {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-9640) Do not run Python Kinesis tests when the Kinesis assembly JAR has not been generated
[ https://issues.apache.org/jira/browse/SPARK-9640?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tathagata Das resolved SPARK-9640. -- Resolution: Fixed Fix Version/s: 1.5.0 Do not run Python Kinesis tests when the Kinesis assembly JAR has not been generated Key: SPARK-9640 URL: https://issues.apache.org/jira/browse/SPARK-9640 Project: Spark Issue Type: Test Components: Streaming, Tests Reporter: Tathagata Das Assignee: Tathagata Das Fix For: 1.5.0 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-8824) Support Parquet logical types TIMESTAMP_MILLIS and TIMESTAMP_MICROS
[ https://issues.apache.org/jira/browse/SPARK-8824?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14681331#comment-14681331 ] Cheng Lian edited comment on SPARK-8824 at 8/11/15 6:55 AM: Oh sorry, I meant to say {{TIMESTAMP_MICROS}} and I mistook your request for {{TIMESTAMP_MICROS}}. I'm afraid it's already too late for 1.5. Another thing is that, Spark SQL 1.5 now only has microsecond precision, so even if we support {{TIMESTAMP_MILLIS}} in 1.6, we'll probably only read Parquet {{TIMESTAMP_MILLIS}} values and convert them to microsecond timestamps. was (Author: lian cheng): Oh sorry, I mistook your request for {{TIMESTAMP_MICROS}}. I'm afraid it's already too late for 1.5. Another thing is that, Spark SQL 1.5 now only has microsecond precision, so even if we support {{TIMESTAMP_MILLIS}} in 1.6, we'll probably only read Parquet {{TIMESTAMP_MILLIS}} values and convert them to microsecond timestamps. Support Parquet logical types TIMESTAMP_MILLIS and TIMESTAMP_MICROS --- Key: SPARK-8824 URL: https://issues.apache.org/jira/browse/SPARK-8824 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 1.5.0 Reporter: Cheng Lian -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8824) Support Parquet logical types TIMESTAMP_MILLIS and TIMESTAMP_MICROS
[ https://issues.apache.org/jira/browse/SPARK-8824?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14681331#comment-14681331 ] Cheng Lian commented on SPARK-8824: --- Oh sorry, I mistook your request for {{TIMESTAMP_MICROS}}. I'm afraid it's already too late for 1.5. Another thing is that, Spark SQL 1.5 now only has microsecond precision, so even if we support {{TIMESTAMP_MILLIS}} in 1.6, we'll probably only read Parquet {{TIMESTAMP_MILLIS}} values and convert them to microsecond timestamps. Support Parquet logical types TIMESTAMP_MILLIS and TIMESTAMP_MICROS --- Key: SPARK-8824 URL: https://issues.apache.org/jira/browse/SPARK-8824 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 1.5.0 Reporter: Cheng Lian -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-9802) spark configuration page should mention spark.executor.cores yarn property
[ https://issues.apache.org/jira/browse/SPARK-9802?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-9802. -- Resolution: Not A Problem It's documented already, but in the latest docs: https://spark.apache.org/docs/latest/configuration.html search for 'spark.executor.cores'. It looks like this got addressed along with https://github.com/apache/spark/commit/8f8dc45f6d4c8d7b740eaa3d2ea09d0b531af9dd spark configuration page should mention spark.executor.cores yarn property --- Key: SPARK-9802 URL: https://issues.apache.org/jira/browse/SPARK-9802 Project: Spark Issue Type: Documentation Components: Documentation Affects Versions: 1.3.1 Reporter: nirav patel Hi, I see that there's --executor-cores arguments available for spark-submit script which internally sets spark.executor.cores. However that property should also be available on configuration page so people who doesn't use spark-submit script know how to set number of cores per executor(container). https://spark.apache.org/docs/1.3.1/configuration.html Thanks -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-9727) Make the Kinesis project SBT name and consistent with other streaming projects
[ https://issues.apache.org/jira/browse/SPARK-9727?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tathagata Das resolved SPARK-9727. -- Resolution: Fixed Fix Version/s: 1.5.0 Make the Kinesis project SBT name and consistent with other streaming projects -- Key: SPARK-9727 URL: https://issues.apache.org/jira/browse/SPARK-9727 Project: Spark Issue Type: Improvement Components: Build Reporter: Tathagata Das Assignee: Tathagata Das Priority: Minor Fix For: 1.5.0 pom.xml - SBT project name: kinesis-asl --- streaming-kinesis-asl SparkBuild - project name: sparkKinesisAsl --- streamingKinesisAsl -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-9818) Revert 6136, use docker to test JDBC datasources
[ https://issues.apache.org/jira/browse/SPARK-9818?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-9818: --- Assignee: Apache Spark Revert 6136, use docker to test JDBC datasources Key: SPARK-9818 URL: https://issues.apache.org/jira/browse/SPARK-9818 Project: Spark Issue Type: Improvement Reporter: Yijie Shen Assignee: Apache Spark -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-9814) EqualNotNull not passing to data sources
[ https://issues.apache.org/jira/browse/SPARK-9814?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-9814: --- Assignee: (was: Apache Spark) EqualNotNull not passing to data sources Key: SPARK-9814 URL: https://issues.apache.org/jira/browse/SPARK-9814 Project: Spark Issue Type: Improvement Components: SQL Reporter: Hyukjin Kwon Priority: Minor When data sources (such as Parquet) tries to filter data when reading from HDFS (not in memory), Physical planing phase passes the filter objects in {{org.apache.spark.sql.sources}}, which are appropriately built and picked up by {{selectFilters()}} in {{org.apache.spark.sql.sources.DataSourceStrategy}}. On the other hand, it does not pass {{EqualNullSafe}} filter in {{org.apache.spark.sql.catalyst.expressions}} even though this seems possible to pass for other datasources such as Parquet and JSON. In more detail, it does not pass {{EqualNullSafe}} to (below) {{buildScan()}} in {{PrunedFilteredScan}} and {{PrunedScan}}, {code} def buildScan(requiredColumns: Array[String], filters: Array[Filter]): RDD[Row] {code} even though the binary capability issue is solved.(https://issues.apache.org/jira/browse/SPARK-8747). I understand that {{CatalystScan}} can take the all raw expressions accessing to the query planner. However, it is experimental and also it needs different interfaces (as well as unstable for the reasons such as binary capability). In general, the problem below can happen. 1. {code:sql} SELECT * FROM table WHERE field = 1; {code} 2. {code:sql} SELECT * FROM table WHERE field = 1; {code} The second query can be hugely slow although the functionally is almost identical because of the possible large network traffic (etc.) by not filtered data from the source RDD. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Issue Comment Deleted] (SPARK-9814) EqualNotNull not passing to data sources
[ https://issues.apache.org/jira/browse/SPARK-9814?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-9814: Comment: was deleted (was: I just made it. https://github.com/apache/spark/pull/8096) EqualNotNull not passing to data sources Key: SPARK-9814 URL: https://issues.apache.org/jira/browse/SPARK-9814 Project: Spark Issue Type: Improvement Components: SQL Reporter: Hyukjin Kwon Priority: Minor When data sources (such as Parquet) tries to filter data when reading from HDFS (not in memory), Physical planing phase passes the filter objects in {{org.apache.spark.sql.sources}}, which are appropriately built and picked up by {{selectFilters()}} in {{org.apache.spark.sql.sources.DataSourceStrategy}}. On the other hand, it does not pass {{EqualNullSafe}} filter in {{org.apache.spark.sql.catalyst.expressions}} even though this seems possible to pass for other datasources such as Parquet and JSON. In more detail, it does not pass {{EqualNullSafe}} to (below) {{buildScan()}} in {{PrunedFilteredScan}} and {{PrunedScan}}, {code} def buildScan(requiredColumns: Array[String], filters: Array[Filter]): RDD[Row] {code} even though the binary capability issue is solved.(https://issues.apache.org/jira/browse/SPARK-8747). I understand that {{CatalystScan}} can take the all raw expressions accessing to the query planner. However, it is experimental and also it needs different interfaces (as well as unstable for the reasons such as binary capability). In general, the problem below can happen. 1. {code:sql} SELECT * FROM table WHERE field = 1; {code} 2. {code:sql} SELECT * FROM table WHERE field = 1; {code} The second query can be hugely slow although the functionally is almost identical because of the possible large network traffic (etc.) by not filtered data from the source RDD. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9814) EqualNotNull not passing to data sources
[ https://issues.apache.org/jira/browse/SPARK-9814?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14681294#comment-14681294 ] Hyukjin Kwon commented on SPARK-9814: - I just made it :) EqualNotNull not passing to data sources Key: SPARK-9814 URL: https://issues.apache.org/jira/browse/SPARK-9814 Project: Spark Issue Type: Improvement Components: SQL Reporter: Hyukjin Kwon Priority: Minor When data sources (such as Parquet) tries to filter data when reading from HDFS (not in memory), Physical planing phase passes the filter objects in {{org.apache.spark.sql.sources}}, which are appropriately built and picked up by {{selectFilters()}} in {{org.apache.spark.sql.sources.DataSourceStrategy}}. On the other hand, it does not pass {{EqualNullSafe}} filter in {{org.apache.spark.sql.catalyst.expressions}} even though this seems possible to pass for other datasources such as Parquet and JSON. In more detail, it does not pass {{EqualNullSafe}} to (below) {{buildScan()}} in {{PrunedFilteredScan}} and {{PrunedScan}}, {code} def buildScan(requiredColumns: Array[String], filters: Array[Filter]): RDD[Row] {code} even though the binary capability issue is solved.(https://issues.apache.org/jira/browse/SPARK-8747). I understand that {{CatalystScan}} can take the all raw expressions accessing to the query planner. However, it is experimental and also it needs different interfaces (as well as unstable for the reasons such as binary capability). In general, the problem below can happen. 1. {code:sql} SELECT * FROM table WHERE field = 1; {code} 2. {code:sql} SELECT * FROM table WHERE field = 1; {code} The second query can be hugely slow although the functionally is almost identical because of the possible large network traffic (etc.) by not filtered data from the source RDD. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-8757) Check missing and add user guide for MLlib Python API
[ https://issues.apache.org/jira/browse/SPARK-8757?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yanbo Liang updated SPARK-8757: --- Description: Some MLlib algorithm missing user guide for Python, we need to check and add them. The algorithms that missing user guides for Python are list following. Please add it here if you find one more. * For MLlib ** Isotonic regression (Python example) ** LDA (Python example) ** Streaming k-means (Java/Python examples) ** PCA (Python example) ** SVD (Python example) ** FP-growth (Python example) * For ML ** feature *** CountVectorizerModel (user guide) *** DCT (user guide) *** MinMaxScaler (user guide) *** StopWordsRemover (user guide) *** VectorSlicer (user guide) *** ElementwiseProduct (python example) was: Some MLlib algorithm missing user guide for Python, we need to check and add them. The algorithms that missing user guides for Python are list following. Please add it here if you find one more. * For MLlib ** Isotonic regression ** LDA ** Streaming k-means ** PCA ** SVD ** FP-growth * For ML ** feature *** CountVectorizerModel *** DCT *** MinMaxScaler *** StopWordsRemover *** VectorSlicer *** ElementwiseProduct Check missing and add user guide for MLlib Python API - Key: SPARK-8757 URL: https://issues.apache.org/jira/browse/SPARK-8757 Project: Spark Issue Type: Umbrella Components: Documentation, MLlib, PySpark Affects Versions: 1.5.0 Reporter: Yanbo Liang Some MLlib algorithm missing user guide for Python, we need to check and add them. The algorithms that missing user guides for Python are list following. Please add it here if you find one more. * For MLlib ** Isotonic regression (Python example) ** LDA (Python example) ** Streaming k-means (Java/Python examples) ** PCA (Python example) ** SVD (Python example) ** FP-growth (Python example) * For ML ** feature *** CountVectorizerModel (user guide) *** DCT (user guide) *** MinMaxScaler (user guide) *** StopWordsRemover (user guide) *** VectorSlicer (user guide) *** ElementwiseProduct (python example) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9818) Revert 6136, use docker to test JDBC datasources
[ https://issues.apache.org/jira/browse/SPARK-9818?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14681560#comment-14681560 ] Apache Spark commented on SPARK-9818: - User 'yjshen' has created a pull request for this issue: https://github.com/apache/spark/pull/8101 Revert 6136, use docker to test JDBC datasources Key: SPARK-9818 URL: https://issues.apache.org/jira/browse/SPARK-9818 Project: Spark Issue Type: Improvement Reporter: Yijie Shen -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6136) Docker client library introduces Guava 17.0, which causes runtime binary incompatibilities
[ https://issues.apache.org/jira/browse/SPARK-6136?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14681561#comment-14681561 ] Apache Spark commented on SPARK-6136: - User 'yjshen' has created a pull request for this issue: https://github.com/apache/spark/pull/8101 Docker client library introduces Guava 17.0, which causes runtime binary incompatibilities -- Key: SPARK-6136 URL: https://issues.apache.org/jira/browse/SPARK-6136 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.3.0 Reporter: Cheng Lian Assignee: Cheng Lian Fix For: 1.3.0 Integration test suites in the JDBC data source ({{MySQLIntegration}} and {{PostgresIntegration}}) depend on docker-client 2.7.5, which transitively depends on Guava 17.0. Unfortunately, Guava 17.0 is causing runtime binary incompatibility issues when Spark is compiled against Hadoop 2.4. {code} $ ./build/sbt -Pyarn,hadoop-2.4,hive,hive-0.12.0,scala-2.10 -Dhadoop.version=2.4.1 ... sql/test-only *.ParquetDataSourceOffIOSuite ... [info] ParquetDataSourceOffIOSuite: [info] Exception encountered when attempting to run a suite with class name: org.apache.spark.sql.parquet.ParquetDataSourceOffIOSuite *** ABORTED *** (134 milliseconds) [info] java.lang.IllegalAccessError: tried to access method com.google.common.base.Stopwatch.init()V from class org.apache.hadoop.mapreduce.lib.input.FileInputFormat [info] at org.apache.hadoop.mapreduce.lib.input.FileInputFormat.listStatus(FileInputFormat.java:261) [info] at parquet.hadoop.ParquetInputFormat.listStatus(ParquetInputFormat.java:277) [info] at org.apache.spark.sql.parquet.FilteringParquetRowInputFormat.getSplits(ParquetTableOperations.scala:437) [info] at org.apache.spark.rdd.NewHadoopRDD.getPartitions(NewHadoopRDD.scala:95) [info] at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:219) [info] at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:217) [info] at scala.Option.getOrElse(Option.scala:120) [info] at org.apache.spark.rdd.RDD.partitions(RDD.scala:217) [info] at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:32) [info] at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:219) [info] at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:217) [info] at scala.Option.getOrElse(Option.scala:120) [info] at org.apache.spark.rdd.RDD.partitions(RDD.scala:217) [info] at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:32) [info] at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:219) [info] at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:217) [info] at scala.Option.getOrElse(Option.scala:120) [info] at org.apache.spark.rdd.RDD.partitions(RDD.scala:217) [info] at org.apache.spark.SparkContext.runJob(SparkContext.scala:1525) [info] at org.apache.spark.rdd.RDD.collect(RDD.scala:813) [info] at org.apache.spark.sql.execution.SparkPlan.executeCollect(SparkPlan.scala:83) [info] at org.apache.spark.sql.DataFrame.collect(DataFrame.scala:797) [info] at org.apache.spark.sql.QueryTest$.checkAnswer(QueryTest.scala:115) [info] at org.apache.spark.sql.QueryTest.checkAnswer(QueryTest.scala:60) [info] at org.apache.spark.sql.parquet.ParquetIOSuiteBase$$anonfun$checkParquetFile$1.apply(ParquetIOSuite.scala:76) [info] at org.apache.spark.sql.parquet.ParquetIOSuiteBase$$anonfun$checkParquetFile$1.apply(ParquetIOSuite.scala:76) [info] at org.apache.spark.sql.parquet.ParquetTest$$anonfun$withParquetDataFrame$1.apply(ParquetTest.scala:105) [info] at org.apache.spark.sql.parquet.ParquetTest$$anonfun$withParquetDataFrame$1.apply(ParquetTest.scala:105) [info] at org.apache.spark.sql.parquet.ParquetTest$$anonfun$withParquetFile$1.apply(ParquetTest.scala:94) [info] at org.apache.spark.sql.parquet.ParquetTest$$anonfun$withParquetFile$1.apply(ParquetTest.scala:92) [info] at org.apache.spark.sql.parquet.ParquetTest$class.withTempPath(ParquetTest.scala:71) [info] at org.apache.spark.sql.parquet.ParquetIOSuiteBase.withTempPath(ParquetIOSuite.scala:67) [info] at org.apache.spark.sql.parquet.ParquetTest$class.withParquetFile(ParquetTest.scala:92) [info] at org.apache.spark.sql.parquet.ParquetIOSuiteBase.withParquetFile(ParquetIOSuite.scala:67) [info] at org.apache.spark.sql.parquet.ParquetTest$class.withParquetDataFrame(ParquetTest.scala:105) [info] at org.apache.spark.sql.parquet.ParquetIOSuiteBase.withParquetDataFrame(ParquetIOSuite.scala:67) [info] at org.apache.spark.sql.parquet.ParquetIOSuiteBase.checkParquetFile(ParquetIOSuite.scala:76)
[jira] [Assigned] (SPARK-9818) Revert 6136, use docker to test JDBC datasources
[ https://issues.apache.org/jira/browse/SPARK-9818?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-9818: --- Assignee: (was: Apache Spark) Revert 6136, use docker to test JDBC datasources Key: SPARK-9818 URL: https://issues.apache.org/jira/browse/SPARK-9818 Project: Spark Issue Type: Improvement Reporter: Yijie Shen -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-9814) EqualNotNull not passing to data sources
[ https://issues.apache.org/jira/browse/SPARK-9814?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-9814: --- Assignee: Apache Spark EqualNotNull not passing to data sources Key: SPARK-9814 URL: https://issues.apache.org/jira/browse/SPARK-9814 Project: Spark Issue Type: Improvement Components: SQL Reporter: Hyukjin Kwon Assignee: Apache Spark Priority: Minor When data sources (such as Parquet) tries to filter data when reading from HDFS (not in memory), Physical planing phase passes the filter objects in {{org.apache.spark.sql.sources}}, which are appropriately built and picked up by {{selectFilters()}} in {{org.apache.spark.sql.sources.DataSourceStrategy}}. On the other hand, it does not pass {{EqualNullSafe}} filter in {{org.apache.spark.sql.catalyst.expressions}} even though this seems possible to pass for other datasources such as Parquet and JSON. In more detail, it does not pass {{EqualNullSafe}} to (below) {{buildScan()}} in {{PrunedFilteredScan}} and {{PrunedScan}}, {code} def buildScan(requiredColumns: Array[String], filters: Array[Filter]): RDD[Row] {code} even though the binary capability issue is solved.(https://issues.apache.org/jira/browse/SPARK-8747). I understand that {{CatalystScan}} can take the all raw expressions accessing to the query planner. However, it is experimental and also it needs different interfaces (as well as unstable for the reasons such as binary capability). In general, the problem below can happen. 1. {code:sql} SELECT * FROM table WHERE field = 1; {code} 2. {code:sql} SELECT * FROM table WHERE field = 1; {code} The second query can be hugely slow although the functionally is almost identical because of the possible large network traffic (etc.) by not filtered data from the source RDD. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9814) EqualNotNull not passing to data sources
[ https://issues.apache.org/jira/browse/SPARK-9814?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14681293#comment-14681293 ] Hyukjin Kwon commented on SPARK-9814: - I just made it. https://github.com/apache/spark/pull/8096 EqualNotNull not passing to data sources Key: SPARK-9814 URL: https://issues.apache.org/jira/browse/SPARK-9814 Project: Spark Issue Type: Improvement Components: SQL Reporter: Hyukjin Kwon Priority: Minor When data sources (such as Parquet) tries to filter data when reading from HDFS (not in memory), Physical planing phase passes the filter objects in {{org.apache.spark.sql.sources}}, which are appropriately built and picked up by {{selectFilters()}} in {{org.apache.spark.sql.sources.DataSourceStrategy}}. On the other hand, it does not pass {{EqualNullSafe}} filter in {{org.apache.spark.sql.catalyst.expressions}} even though this seems possible to pass for other datasources such as Parquet and JSON. In more detail, it does not pass {{EqualNullSafe}} to (below) {{buildScan()}} in {{PrunedFilteredScan}} and {{PrunedScan}}, {code} def buildScan(requiredColumns: Array[String], filters: Array[Filter]): RDD[Row] {code} even though the binary capability issue is solved.(https://issues.apache.org/jira/browse/SPARK-8747). I understand that {{CatalystScan}} can take the all raw expressions accessing to the query planner. However, it is experimental and also it needs different interfaces (as well as unstable for the reasons such as binary capability). In general, the problem below can happen. 1. {code:sql} SELECT * FROM table WHERE field = 1; {code} 2. {code:sql} SELECT * FROM table WHERE field = 1; {code} The second query can be hugely slow although the functionally is almost identical because of the possible large network traffic (etc.) by not filtered data from the source RDD. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9663) ML Python API coverage issues found during 1.5 QA
[ https://issues.apache.org/jira/browse/SPARK-9663?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14681493#comment-14681493 ] Yanbo Liang commented on SPARK-9663: [~josephkb] I have finished the check, link the existing JIRAs here and close the duplicated ones. Thanks! ML Python API coverage issues found during 1.5 QA - Key: SPARK-9663 URL: https://issues.apache.org/jira/browse/SPARK-9663 Project: Spark Issue Type: Umbrella Components: ML, MLlib, PySpark Reporter: Joseph K. Bradley This umbrella is for a list of Python API coverage issues which we should fix for the 1.6 release cycle. This list is to be generated from issues found in [SPARK-9662] and from remaining issues from 1.4: [SPARK-7536]. Here we check and compare the Python and Scala API of MLlib/ML, add missing classes/methods/parameters for PySpark. * Missing classes for PySpark(ML): ** feature *** CountVectorizerModel SPARK-9769 *** DCT SPARK-8472 *** ElementwiseProduct SPARK-9768 *** MinMaxScaler SPARK-8530 *** StopWordsRemover SPARK-9679 *** VectorSlicer SPARK-9772 ** classification *** OneVsRest SPARK-7861 *** MultilayerPerceptronClassifier SPARK-9773 ** regression *** IsotonicRegression SPARK-9774 * Missing User Guide documents for PySpark SPARK-8757 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-9818) Revert 6136, use docker to test JDBC datasources
[ https://issues.apache.org/jira/browse/SPARK-9818?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yijie Shen updated SPARK-9818: -- Description: (was: https://issues.apache.org/jira/browse/SPARK-6136) Revert 6136, use docker to test JDBC datasources Key: SPARK-9818 URL: https://issues.apache.org/jira/browse/SPARK-9818 Project: Spark Issue Type: Improvement Reporter: Yijie Shen -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-9818) Revert 6136, use docker to test JDBC datasources
[ https://issues.apache.org/jira/browse/SPARK-9818?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yijie Shen updated SPARK-9818: -- External issue ID: (was: 6136) Revert 6136, use docker to test JDBC datasources Key: SPARK-9818 URL: https://issues.apache.org/jira/browse/SPARK-9818 Project: Spark Issue Type: Improvement Reporter: Yijie Shen https://issues.apache.org/jira/browse/SPARK-6136 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-9818) Revert 6136, use docker to test JDBC datasources
[ https://issues.apache.org/jira/browse/SPARK-9818?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yijie Shen updated SPARK-9818: -- External issue ID: 6136 Revert 6136, use docker to test JDBC datasources Key: SPARK-9818 URL: https://issues.apache.org/jira/browse/SPARK-9818 Project: Spark Issue Type: Improvement Reporter: Yijie Shen https://issues.apache.org/jira/browse/SPARK-6136 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-9810) Remove individual commit messages from the squash commit message
[ https://issues.apache.org/jira/browse/SPARK-9810?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin resolved SPARK-9810. Resolution: Fixed Fix Version/s: 1.5.0 Target Version/s: 1.5.0 (was: 1.6.0) Remove individual commit messages from the squash commit message Key: SPARK-9810 URL: https://issues.apache.org/jira/browse/SPARK-9810 Project: Spark Issue Type: Task Components: Build Reporter: Reynold Xin Assignee: Reynold Xin Fix For: 1.5.0 I took a look at the commit messages in git log -- it looks like the individual commit messages are not that useful to include, but do make the commit messages more verbose. They are usually just a bunch of extremely concise descriptions of bug fixes, merges, etc: {code} cb3f12d [xxx] add whitespace 6d874a6 [xxx] support pyspark for yarn-client 89b01f5 [yyy] Update the unit test to add more cases 275d252 [yyy] Address the comments 7cc146d [yyy] Address the comments 2624723 [yyy] Fix rebase conflict 45befaa [yyy] Update the unit test bbc1c9c [yyy] Fix checkpointing doesn't retain driver port issue {code} See mailing list discussions: http://apache-spark-developers-list.1001551.n3.nabble.com/discuss-Removing-individual-commit-messages-from-the-squash-commit-message-td13295.html -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-9817) Improve the container placement strategy by considering the localities of pending container requests
Saisai Shao created SPARK-9817: -- Summary: Improve the container placement strategy by considering the localities of pending container requests Key: SPARK-9817 URL: https://issues.apache.org/jira/browse/SPARK-9817 Project: Spark Issue Type: Improvement Components: YARN Reporter: Saisai Shao Priority: Minor Current implementation does not consider the localities of pending container requests, since required locality preferences of tasks will be shifted time to time. It is better to discard outdated container request and recalculate with container placement strategy. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Issue Comment Deleted] (SPARK-8757) Check missing and add user guide for MLlib Python API
[ https://issues.apache.org/jira/browse/SPARK-8757?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yanbo Liang updated SPARK-8757: --- Comment: was deleted (was: [~josephkb] Yes, Some of those items do have sections and need updates. I have specified more details about the missing.) Check missing and add user guide for MLlib Python API - Key: SPARK-8757 URL: https://issues.apache.org/jira/browse/SPARK-8757 Project: Spark Issue Type: Umbrella Components: Documentation, MLlib, PySpark Affects Versions: 1.5.0 Reporter: Yanbo Liang Some MLlib algorithm missing user guide for Python, we need to check and add them. The algorithms that missing user guides for Python are list following. Please add it here if you find one more. * For MLlib ** Isotonic regression (Python example) ** LDA (Python example) ** Streaming k-means (Java/Python examples) ** PCA (Python example) ** SVD (Python example) ** FP-growth (Python example) * For ML ** feature *** CountVectorizerModel (user guide and examples) *** DCT (user guide and examples) *** MinMaxScaler (user guide and examples) *** StopWordsRemover (user guide and examples) *** VectorSlicer (user guide and examples) *** ElementwiseProduct (python example) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8757) Check missing and add user guide for MLlib Python API
[ https://issues.apache.org/jira/browse/SPARK-8757?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14681536#comment-14681536 ] Yanbo Liang commented on SPARK-8757: [~josephkb] Yes, Some of those items do have sections and need updates. I have specified more details about the missing. Check missing and add user guide for MLlib Python API - Key: SPARK-8757 URL: https://issues.apache.org/jira/browse/SPARK-8757 Project: Spark Issue Type: Umbrella Components: Documentation, MLlib, PySpark Affects Versions: 1.5.0 Reporter: Yanbo Liang Some MLlib algorithm missing user guide for Python, we need to check and add them. The algorithms that missing user guides for Python are list following. Please add it here if you find one more. * For MLlib ** Isotonic regression (Python example) ** LDA (Python example) ** Streaming k-means (Java/Python examples) ** PCA (Python example) ** SVD (Python example) ** FP-growth (Python example) * For ML ** feature *** CountVectorizerModel (user guide and examples) *** DCT (user guide and examples) *** MinMaxScaler (user guide and examples) *** StopWordsRemover (user guide and examples) *** VectorSlicer (user guide and examples) *** ElementwiseProduct (python example) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8757) Check missing and add user guide for MLlib Python API
[ https://issues.apache.org/jira/browse/SPARK-8757?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14681537#comment-14681537 ] Yanbo Liang commented on SPARK-8757: [~josephkb] Yes, Some of those items do have sections and need updates. I have specified more details about the missing. Check missing and add user guide for MLlib Python API - Key: SPARK-8757 URL: https://issues.apache.org/jira/browse/SPARK-8757 Project: Spark Issue Type: Umbrella Components: Documentation, MLlib, PySpark Affects Versions: 1.5.0 Reporter: Yanbo Liang Some MLlib algorithm missing user guide for Python, we need to check and add them. The algorithms that missing user guides for Python are list following. Please add it here if you find one more. * For MLlib ** Isotonic regression (Python example) ** LDA (Python example) ** Streaming k-means (Java/Python examples) ** PCA (Python example) ** SVD (Python example) ** FP-growth (Python example) * For ML ** feature *** CountVectorizerModel (user guide and examples) *** DCT (user guide and examples) *** MinMaxScaler (user guide and examples) *** StopWordsRemover (user guide and examples) *** VectorSlicer (user guide and examples) *** ElementwiseProduct (python example) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9814) EqualNotNull not passing to data sources
[ https://issues.apache.org/jira/browse/SPARK-9814?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14681291#comment-14681291 ] Apache Spark commented on SPARK-9814: - User 'HyukjinKwon' has created a pull request for this issue: https://github.com/apache/spark/pull/8096 EqualNotNull not passing to data sources Key: SPARK-9814 URL: https://issues.apache.org/jira/browse/SPARK-9814 Project: Spark Issue Type: Improvement Components: SQL Reporter: Hyukjin Kwon Priority: Minor When data sources (such as Parquet) tries to filter data when reading from HDFS (not in memory), Physical planing phase passes the filter objects in {{org.apache.spark.sql.sources}}, which are appropriately built and picked up by {{selectFilters()}} in {{org.apache.spark.sql.sources.DataSourceStrategy}}. On the other hand, it does not pass {{EqualNullSafe}} filter in {{org.apache.spark.sql.catalyst.expressions}} even though this seems possible to pass for other datasources such as Parquet and JSON. In more detail, it does not pass {{EqualNullSafe}} to (below) {{buildScan()}} in {{PrunedFilteredScan}} and {{PrunedScan}}, {code} def buildScan(requiredColumns: Array[String], filters: Array[Filter]): RDD[Row] {code} even though the binary capability issue is solved.(https://issues.apache.org/jira/browse/SPARK-8747). I understand that {{CatalystScan}} can take the all raw expressions accessing to the query planner. However, it is experimental and also it needs different interfaces (as well as unstable for the reasons such as binary capability). In general, the problem below can happen. 1. {code:sql} SELECT * FROM table WHERE field = 1; {code} 2. {code:sql} SELECT * FROM table WHERE field = 1; {code} The second query can be hugely slow although the functionally is almost identical because of the possible large network traffic (etc.) by not filtered data from the source RDD. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9814) EqualNotNull not passing to data sources
[ https://issues.apache.org/jira/browse/SPARK-9814?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14681289#comment-14681289 ] Reynold Xin commented on SPARK-9814: [~hyukjin.kwon] would you like to submit a patch for this? EqualNotNull not passing to data sources Key: SPARK-9814 URL: https://issues.apache.org/jira/browse/SPARK-9814 Project: Spark Issue Type: Improvement Components: SQL Reporter: Hyukjin Kwon Priority: Minor When data sources (such as Parquet) tries to filter data when reading from HDFS (not in memory), Physical planing phase passes the filter objects in {{org.apache.spark.sql.sources}}, which are appropriately built and picked up by {{selectFilters()}} in {{org.apache.spark.sql.sources.DataSourceStrategy}}. On the other hand, it does not pass {{EqualNullSafe}} filter in {{org.apache.spark.sql.catalyst.expressions}} even though this seems possible to pass for other datasources such as Parquet and JSON. In more detail, it does not pass {{EqualNullSafe}} to (below) {{buildScan()}} in {{PrunedFilteredScan}} and {{PrunedScan}}, {code} def buildScan(requiredColumns: Array[String], filters: Array[Filter]): RDD[Row] {code} even though the binary capability issue is solved.(https://issues.apache.org/jira/browse/SPARK-8747). I understand that {{CatalystScan}} can take the all raw expressions accessing to the query planner. However, it is experimental and also it needs different interfaces (as well as unstable for the reasons such as binary capability). In general, the problem below can happen. 1. {code:sql} SELECT * FROM table WHERE field = 1; {code} 2. {code:sql} SELECT * FROM table WHERE field = 1; {code} The second query can be hugely slow although the functionally is almost identical because of the possible large network traffic (etc.) by not filtered data from the source RDD. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-9813) Incorrect UNION ALL behavior
[ https://issues.apache.org/jira/browse/SPARK-9813?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14681326#comment-14681326 ] Simeon Simeonov edited comment on SPARK-9813 at 8/11/15 6:47 AM: - [~hvanhovell] Oracle requires the number of columns to be the same and the data types to be compatible. (See http://docs.oracle.com/cd/B19306_01/server.102/b14200/queries004.htm) If we take that approach with Spark, then: - The first case would be OK (but different from Hive, which will cause its own set of problems as there is essentially no documentation on Spark SQL so everyone goes to the Hive Language Manual) - The second case would still be a bug because (a) the number of columns are different and (b) a numeric column is mixed into a string column - The third case still produces an opaque and confusing exception. was (Author: simeons): [~hvanhovell] Oracle requires the number of columns to be the same and the data types to be compatible. (See http://docs.oracle.com/cd/B19306_01/server.102/b14200/queries004.htm) If we take that approach with Spark, then: - The first case would be OK (but different from Hive, which will cause its own set of problems as there is essentially no documentation on Spark SQL so everyone goes to the Hive Language Manual) - The second case would still be a bug because (a) the number of columns were different and (b) a numeric column was mixed into a string column - The third case still produces an opaque and confusing exception. Incorrect UNION ALL behavior Key: SPARK-9813 URL: https://issues.apache.org/jira/browse/SPARK-9813 Project: Spark Issue Type: Bug Components: Spark Core, SQL Affects Versions: 1.4.1 Environment: Ubuntu on AWS Reporter: Simeon Simeonov Labels: sql, union According to the [Hive Language Manual|https://cwiki.apache.org/confluence/display/Hive/LanguageManual+Union] for UNION ALL: {quote} The number and names of columns returned by each select_statement have to be the same. Otherwise, a schema error is thrown. {quote} Spark SQL silently swallows an error when the tables being joined with UNION ALL have the same number of columns but different names. Reproducible example: {code} // This test is meant to run in spark-shell import java.io.File import java.io.PrintWriter import org.apache.spark.sql.hive.HiveContext import org.apache.spark.sql.SaveMode val ctx = sqlContext.asInstanceOf[HiveContext] import ctx.implicits._ def dataPath(name:String) = sys.env(HOME) + / + name + .jsonlines def tempTable(name: String, json: String) = { val path = dataPath(name) new PrintWriter(path) { write(json); close } ctx.read.json(file:// + path).registerTempTable(name) } // Note category vs. cat names of first column tempTable(test_one, {category : A, num : 5}) tempTable(test_another, {cat : A, num : 5}) // ++---+ // |category|num| // ++---+ // | A| 5| // | A| 5| // ++---+ // // Instead, an error should have been generated due to incompatible schema ctx.sql(select * from test_one union all select * from test_another).show // Cleanup new File(dataPath(test_one)).delete() new File(dataPath(test_another)).delete() {code} When the number of columns is different, Spark can even mix in datatypes. Reproducible example (requires a new spark-shell session): {code} // This test is meant to run in spark-shell import java.io.File import java.io.PrintWriter import org.apache.spark.sql.hive.HiveContext import org.apache.spark.sql.SaveMode val ctx = sqlContext.asInstanceOf[HiveContext] import ctx.implicits._ def dataPath(name:String) = sys.env(HOME) + / + name + .jsonlines def tempTable(name: String, json: String) = { val path = dataPath(name) new PrintWriter(path) { write(json); close } ctx.read.json(file:// + path).registerTempTable(name) } // Note test_another is missing category column tempTable(test_one, {category : A, num : 5}) tempTable(test_another, {num : 5}) // ++ // |category| // ++ // | A| // | 5| // ++ // // Instead, an error should have been generated due to incompatible schema ctx.sql(select * from test_one union all select * from test_another).show // Cleanup new File(dataPath(test_one)).delete() new File(dataPath(test_another)).delete() {code} At other times, when the schema are complex, Spark SQL produces a misleading error about an unresolved Union operator: {code} scala ctx.sql(select * from view_clicks | union all | select * from view_clicks_aug | ) 15/08/11 02:40:25 INFO ParseDriver: Parsing command: select * from view_clicks union all select * from view_clicks_aug 15/08/11 02:40:25 INFO ParseDriver: Parse
[jira] [Resolved] (SPARK-9076) Improve NaN value handling
[ https://issues.apache.org/jira/browse/SPARK-9076?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin resolved SPARK-9076. Resolution: Fixed Fix Version/s: 1.5.0 Target Version/s: (was: ) Improve NaN value handling -- Key: SPARK-9076 URL: https://issues.apache.org/jira/browse/SPARK-9076 Project: Spark Issue Type: Improvement Components: SQL Reporter: Reynold Xin Assignee: Reynold Xin Fix For: 1.5.0 This is an umbrella ticket for handling NaN values. For general design, please see https://issues.apache.org/jira/browse/SPARK-9079 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-3059) Spark internal module interface design
[ https://issues.apache.org/jira/browse/SPARK-3059?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin closed SPARK-3059. -- Resolution: Later Closing this one since I'm not sure whether it is useful to have a long-term JIRA ticket like this. Spark internal module interface design -- Key: SPARK-3059 URL: https://issues.apache.org/jira/browse/SPARK-3059 Project: Spark Issue Type: Improvement Components: Spark Core Reporter: Reynold Xin Assignee: Reynold Xin An umbrella ticket to track various internal module interface designs implementations for Spark. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-2456) Scheduler refactoring
[ https://issues.apache.org/jira/browse/SPARK-2456?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin closed SPARK-2456. -- Resolution: Later Closing this one since I'm not sure whether it is useful to have a long-term JIRA ticket like this. Scheduler refactoring - Key: SPARK-2456 URL: https://issues.apache.org/jira/browse/SPARK-2456 Project: Spark Issue Type: Improvement Components: Scheduler Reporter: Reynold Xin This is an umbrella ticket to track scheduler refactoring. We want to clearly define semantics and responsibilities of each component, and define explicit public interfaces for them so it is easier to understand and to contribute (also less buggy). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8824) Support Parquet logical types TIMESTAMP_MILLIS and TIMESTAMP_MICROS
[ https://issues.apache.org/jira/browse/SPARK-8824?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14681418#comment-14681418 ] Konstantin Shaposhnikov commented on SPARK-8824: Ok, thank you for the update. Support Parquet logical types TIMESTAMP_MILLIS and TIMESTAMP_MICROS --- Key: SPARK-8824 URL: https://issues.apache.org/jira/browse/SPARK-8824 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 1.5.0 Reporter: Cheng Lian -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-9770) Add Python API for ml.feature.DCT
[ https://issues.apache.org/jira/browse/SPARK-9770?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yanbo Liang closed SPARK-9770. -- Resolution: Duplicate Add Python API for ml.feature.DCT - Key: SPARK-9770 URL: https://issues.apache.org/jira/browse/SPARK-9770 Project: Spark Issue Type: Improvement Components: ML, PySpark Reporter: Yanbo Liang Priority: Minor Add Python API, user guide and example for ml.feature.DCT -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-9663) ML Python API coverage issues found during 1.5 QA
[ https://issues.apache.org/jira/browse/SPARK-9663?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yanbo Liang updated SPARK-9663: --- Description: This umbrella is for a list of Python API coverage issues which we should fix for the 1.6 release cycle. This list is to be generated from issues found in [SPARK-9662] and from remaining issues from 1.4: [SPARK-7536]. Here we check and compare the Python and Scala API of MLlib/ML, add missing classes/methods/parameters for PySpark. * Missing classes for PySpark(ML): ** feature *** CountVectorizerModel SPARK-9769 *** DCT SPARK-8472 *** ElementwiseProduct SPARK-9768 *** MinMaxScaler SPARK-9771 *** StopWordsRemover SPARK-9679 *** VectorSlicer SPARK-9772 ** classification *** OneVsRest SPARK-7861 *** MultilayerPerceptronClassifier SPARK-9773 ** regression *** IsotonicRegression SPARK-9774 * Missing User Guide documents for PySpark SPARK-8757 was: This umbrella is for a list of Python API coverage issues which we should fix for the 1.6 release cycle. This list is to be generated from issues found in [SPARK-9662] and from remaining issues from 1.4: [SPARK-7536]. Here we check and compare the Python and Scala API of MLlib/ML, add missing classes/methods/parameters for PySpark. * Missing classes for PySpark(ML): ** feature *** CountVectorizerModel SPARK-9769 *** DCT SPARK-9770 *** ElementwiseProduct SPARK-9768 *** MinMaxScaler SPARK-9771 *** StopWordsRemover SPARK-9679 *** VectorSlicer SPARK-9772 ** classification *** OneVsRest SPARK-7861 *** MultilayerPerceptronClassifier SPARK-9773 ** regression *** IsotonicRegression SPARK-9774 * Missing User Guide documents for PySpark SPARK-8757 ML Python API coverage issues found during 1.5 QA - Key: SPARK-9663 URL: https://issues.apache.org/jira/browse/SPARK-9663 Project: Spark Issue Type: Umbrella Components: ML, MLlib, PySpark Reporter: Joseph K. Bradley This umbrella is for a list of Python API coverage issues which we should fix for the 1.6 release cycle. This list is to be generated from issues found in [SPARK-9662] and from remaining issues from 1.4: [SPARK-7536]. Here we check and compare the Python and Scala API of MLlib/ML, add missing classes/methods/parameters for PySpark. * Missing classes for PySpark(ML): ** feature *** CountVectorizerModel SPARK-9769 *** DCT SPARK-8472 *** ElementwiseProduct SPARK-9768 *** MinMaxScaler SPARK-9771 *** StopWordsRemover SPARK-9679 *** VectorSlicer SPARK-9772 ** classification *** OneVsRest SPARK-7861 *** MultilayerPerceptronClassifier SPARK-9773 ** regression *** IsotonicRegression SPARK-9774 * Missing User Guide documents for PySpark SPARK-8757 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-9816) Support BinaryType in Concat
Takeshi Yamamuro created SPARK-9816: --- Summary: Support BinaryType in Concat Key: SPARK-9816 URL: https://issues.apache.org/jira/browse/SPARK-9816 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 1.4.1 Reporter: Takeshi Yamamuro Support BinaryType in catalyst Concat according to hive behaviours. https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF#LanguageManualUDF-StringFunctions -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-9771) Add Python API for ml.feature.MinMaxScaler
[ https://issues.apache.org/jira/browse/SPARK-9771?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yanbo Liang closed SPARK-9771. -- Resolution: Duplicate Add Python API for ml.feature.MinMaxScaler -- Key: SPARK-9771 URL: https://issues.apache.org/jira/browse/SPARK-9771 Project: Spark Issue Type: Improvement Components: ML, PySpark Reporter: Yanbo Liang Priority: Minor Add Python API, user guide and example for ml.feature.MinMaxScaler -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-9663) ML Python API coverage issues found during 1.5 QA
[ https://issues.apache.org/jira/browse/SPARK-9663?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yanbo Liang updated SPARK-9663: --- Description: This umbrella is for a list of Python API coverage issues which we should fix for the 1.6 release cycle. This list is to be generated from issues found in [SPARK-9662] and from remaining issues from 1.4: [SPARK-7536]. Here we check and compare the Python and Scala API of MLlib/ML, add missing classes/methods/parameters for PySpark. * Missing classes for PySpark(ML): ** feature *** CountVectorizerModel SPARK-9769 *** DCT SPARK-8472 *** ElementwiseProduct SPARK-9768 *** MinMaxScaler SPARK-8530 *** StopWordsRemover SPARK-9679 *** VectorSlicer SPARK-9772 ** classification *** OneVsRest SPARK-7861 *** MultilayerPerceptronClassifier SPARK-9773 ** regression *** IsotonicRegression SPARK-9774 * Missing User Guide documents for PySpark SPARK-8757 was: This umbrella is for a list of Python API coverage issues which we should fix for the 1.6 release cycle. This list is to be generated from issues found in [SPARK-9662] and from remaining issues from 1.4: [SPARK-7536]. Here we check and compare the Python and Scala API of MLlib/ML, add missing classes/methods/parameters for PySpark. * Missing classes for PySpark(ML): ** feature *** CountVectorizerModel SPARK-9769 *** DCT SPARK-8472 *** ElementwiseProduct SPARK-9768 *** MinMaxScaler SPARK-9771 *** StopWordsRemover SPARK-9679 *** VectorSlicer SPARK-9772 ** classification *** OneVsRest SPARK-7861 *** MultilayerPerceptronClassifier SPARK-9773 ** regression *** IsotonicRegression SPARK-9774 * Missing User Guide documents for PySpark SPARK-8757 ML Python API coverage issues found during 1.5 QA - Key: SPARK-9663 URL: https://issues.apache.org/jira/browse/SPARK-9663 Project: Spark Issue Type: Umbrella Components: ML, MLlib, PySpark Reporter: Joseph K. Bradley This umbrella is for a list of Python API coverage issues which we should fix for the 1.6 release cycle. This list is to be generated from issues found in [SPARK-9662] and from remaining issues from 1.4: [SPARK-7536]. Here we check and compare the Python and Scala API of MLlib/ML, add missing classes/methods/parameters for PySpark. * Missing classes for PySpark(ML): ** feature *** CountVectorizerModel SPARK-9769 *** DCT SPARK-8472 *** ElementwiseProduct SPARK-9768 *** MinMaxScaler SPARK-8530 *** StopWordsRemover SPARK-9679 *** VectorSlicer SPARK-9772 ** classification *** OneVsRest SPARK-7861 *** MultilayerPerceptronClassifier SPARK-9773 ** regression *** IsotonicRegression SPARK-9774 * Missing User Guide documents for PySpark SPARK-8757 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-9148) User-facing documentation for NaN handling semantics
[ https://issues.apache.org/jira/browse/SPARK-9148?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin updated SPARK-9148: --- Parent Issue: SPARK-9565 (was: SPARK-9076) User-facing documentation for NaN handling semantics Key: SPARK-9148 URL: https://issues.apache.org/jira/browse/SPARK-9148 Project: Spark Issue Type: Technical task Components: Documentation, SQL Reporter: Josh Rosen Priority: Blocker Once we've finalized our NaN changes for Spark 1.5, we need to create user-facing documentation to explain our chosen semantics. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8361) Session of ThriftServer is still alive after I exit beeline
[ https://issues.apache.org/jira/browse/SPARK-8361?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14681525#comment-14681525 ] Weizhong commented on SPARK-8361: - On SparkSQLSessionManager only override the closeSession function which will be called by client(may be beeline or others), from the hive(0.13.1) code we know beeline have handle the Ctrl+D and !quit which will close the session, but don't add shutdown hock, this may be only exit the client but don't close the connection. Session of ThriftServer is still alive after I exit beeline --- Key: SPARK-8361 URL: https://issues.apache.org/jira/browse/SPARK-8361 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.4.0 Environment: centos6.2 spark-1.4.0 Reporter: cen yuhai I connected to thriftserver through beeline, but after I exited beeline(maybe I will use 'ctrl + c' or 'ctrl+z'), it still exited in ThriftServer Web UI(SQL Tab). There are no Finish Time . If I use 'ctrl + d', it will have finish time. After reviewing the code, I think the session is still alive. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9816) Support BinaryType in Concat
[ https://issues.apache.org/jira/browse/SPARK-9816?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14681524#comment-14681524 ] Apache Spark commented on SPARK-9816: - User 'maropu' has created a pull request for this issue: https://github.com/apache/spark/pull/8098 Support BinaryType in Concat Key: SPARK-9816 URL: https://issues.apache.org/jira/browse/SPARK-9816 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 1.4.1 Reporter: Takeshi Yamamuro Support BinaryType in catalyst Concat according to hive behaviours. https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF#LanguageManualUDF-StringFunctions -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-9816) Support BinaryType in Concat
[ https://issues.apache.org/jira/browse/SPARK-9816?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-9816: --- Assignee: (was: Apache Spark) Support BinaryType in Concat Key: SPARK-9816 URL: https://issues.apache.org/jira/browse/SPARK-9816 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 1.4.1 Reporter: Takeshi Yamamuro Support BinaryType in catalyst Concat according to hive behaviours. https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF#LanguageManualUDF-StringFunctions -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-9816) Support BinaryType in Concat
[ https://issues.apache.org/jira/browse/SPARK-9816?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-9816: --- Assignee: Apache Spark Support BinaryType in Concat Key: SPARK-9816 URL: https://issues.apache.org/jira/browse/SPARK-9816 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 1.4.1 Reporter: Takeshi Yamamuro Assignee: Apache Spark Support BinaryType in catalyst Concat according to hive behaviours. https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF#LanguageManualUDF-StringFunctions -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-9810) Remove individual commit messages from the squash commit message
[ https://issues.apache.org/jira/browse/SPARK-9810?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin updated SPARK-9810: --- Target Version/s: 1.6.0 (was: 1.5.0) Fix Version/s: (was: 1.5.0) 1.6.0 Remove individual commit messages from the squash commit message Key: SPARK-9810 URL: https://issues.apache.org/jira/browse/SPARK-9810 Project: Spark Issue Type: Task Components: Build Reporter: Reynold Xin Assignee: Reynold Xin Fix For: 1.6.0 I took a look at the commit messages in git log -- it looks like the individual commit messages are not that useful to include, but do make the commit messages more verbose. They are usually just a bunch of extremely concise descriptions of bug fixes, merges, etc: {code} cb3f12d [xxx] add whitespace 6d874a6 [xxx] support pyspark for yarn-client 89b01f5 [yyy] Update the unit test to add more cases 275d252 [yyy] Address the comments 7cc146d [yyy] Address the comments 2624723 [yyy] Fix rebase conflict 45befaa [yyy] Update the unit test bbc1c9c [yyy] Fix checkpointing doesn't retain driver port issue {code} See mailing list discussions: http://apache-spark-developers-list.1001551.n3.nabble.com/discuss-Removing-individual-commit-messages-from-the-squash-commit-message-td13295.html -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9636) Treat $SPARK_HOME as write-only
[ https://issues.apache.org/jira/browse/SPARK-9636?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14681403#comment-14681403 ] Philipp Angerer commented on SPARK-9636: OK, great :) I see why you think my proposal might be to complex, yet I still think that “log file relative to binary” is much more surprising in an environment where log files have certain dedicated places. {{/var/log/}} is something i really expect a system daemon to use for logs. {{~/.cache/logs}} is merely the best compromise in absence of a dedicated user log directoy. (e.g. {{$XDG_USER_DATA_DIR}} and {{$XDG_USER_CONFIG_DIR}} are clear, but there’s no {{$XDG_USER_STATE_DIR}}) i think all this is a consequence of spark not being a good linux citizen. it has a {{$SPARK_HOME}}, and relies on it, while there should be a way to run it split up to sensible directories: {{/usr/share/spark/}} for data {{/usr/lib/spark/}} for shared libraries, {{/usr/lib/pythonx.x/site-packages/}} for pyspark, {{/usr/bin/}} for binaries and scripts, {{/etc/spark/}} for configs, and {{/var/log/spark}} for logfiles. Treat $SPARK_HOME as write-only --- Key: SPARK-9636 URL: https://issues.apache.org/jira/browse/SPARK-9636 Project: Spark Issue Type: Improvement Components: Input/Output Affects Versions: 1.4.1 Environment: Linux Reporter: Philipp Angerer Priority: Minor Labels: easyfix when starting spark scripts as user and it is installed in a directory the user has no write permissions on, many things work fine, except for the logs (e.g. for {{start-master.sh}}) logs are per default written to {{$SPARK_LOG_DIR}} or (if unset) to {{$SPARK_HOME/logs}}. if installed in this way, it should, instead of throwing an error, write logs to {{/var/log/spark/}}. that’s easy to fix by simply testing a few log dirs in sequence for writability before trying to use one. i suggest using {{$SPARK_LOG_DIR}} (if set) → {{/var/log/spark/}} → {{~/.cache/spark-logs/}} → {{$SPARK_HOME/logs/}} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9776) Another instance of Derby may have already booted the database
[ https://issues.apache.org/jira/browse/SPARK-9776?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14681473#comment-14681473 ] Sean Owen commented on SPARK-9776: -- Yeah I see the same. I don't know enough about HiveContext to know if this indicates something else is going on, but the error message could at least be better. How is your hive-site.xml configured? Another instance of Derby may have already booted the database --- Key: SPARK-9776 URL: https://issues.apache.org/jira/browse/SPARK-9776 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.5.0 Environment: Mac Yosemite, spark-1.5.0 Reporter: Sudhakar Thota Attachments: SPARK-9776-FL1.rtf val hiveContext = new org.apache.spark.sql.hive.HiveContext(sc) results in error. Though the same works for spark-1.4.1. Caused by: ERROR XSDB6: Another instance of Derby may have already booted the database -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9816) Support BinaryType in Concat
[ https://issues.apache.org/jira/browse/SPARK-9816?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14681531#comment-14681531 ] Apache Spark commented on SPARK-9816: - User 'maropu' has created a pull request for this issue: https://github.com/apache/spark/pull/8099 Support BinaryType in Concat Key: SPARK-9816 URL: https://issues.apache.org/jira/browse/SPARK-9816 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 1.4.1 Reporter: Takeshi Yamamuro Support BinaryType in catalyst Concat according to hive behaviours. https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF#LanguageManualUDF-StringFunctions -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-8757) Check missing and add user guide for MLlib Python API
[ https://issues.apache.org/jira/browse/SPARK-8757?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yanbo Liang updated SPARK-8757: --- Description: Some MLlib algorithm missing user guide for Python, we need to check and add them. The algorithms that missing user guides for Python are list following. Please add it here if you find one more. * For MLlib ** Isotonic regression (Python example) ** LDA (Python example) ** Streaming k-means (Java/Python examples) ** PCA (Python example) ** SVD (Python example) ** FP-growth (Python example) * For ML ** feature *** CountVectorizerModel (user guide and examples) *** DCT (user guide and examples) *** MinMaxScaler (user guide and examples) *** StopWordsRemover (user guide and examples) *** VectorSlicer (user guide and examples) *** ElementwiseProduct (python example) was: Some MLlib algorithm missing user guide for Python, we need to check and add them. The algorithms that missing user guides for Python are list following. Please add it here if you find one more. * For MLlib ** Isotonic regression (Python example) ** LDA (Python example) ** Streaming k-means (Java/Python examples) ** PCA (Python example) ** SVD (Python example) ** FP-growth (Python example) * For ML ** feature *** CountVectorizerModel (user guide) *** DCT (user guide) *** MinMaxScaler (user guide) *** StopWordsRemover (user guide) *** VectorSlicer (user guide) *** ElementwiseProduct (python example) Check missing and add user guide for MLlib Python API - Key: SPARK-8757 URL: https://issues.apache.org/jira/browse/SPARK-8757 Project: Spark Issue Type: Umbrella Components: Documentation, MLlib, PySpark Affects Versions: 1.5.0 Reporter: Yanbo Liang Some MLlib algorithm missing user guide for Python, we need to check and add them. The algorithms that missing user guides for Python are list following. Please add it here if you find one more. * For MLlib ** Isotonic regression (Python example) ** LDA (Python example) ** Streaming k-means (Java/Python examples) ** PCA (Python example) ** SVD (Python example) ** FP-growth (Python example) * For ML ** feature *** CountVectorizerModel (user guide and examples) *** DCT (user guide and examples) *** MinMaxScaler (user guide and examples) *** StopWordsRemover (user guide and examples) *** VectorSlicer (user guide and examples) *** ElementwiseProduct (python example) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-9817) Improve the container placement strategy by considering the localities of pending container requests
[ https://issues.apache.org/jira/browse/SPARK-9817?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-9817: --- Assignee: (was: Apache Spark) Improve the container placement strategy by considering the localities of pending container requests Key: SPARK-9817 URL: https://issues.apache.org/jira/browse/SPARK-9817 Project: Spark Issue Type: Improvement Components: YARN Reporter: Saisai Shao Priority: Minor Current implementation does not consider the localities of pending container requests, since required locality preferences of tasks will be shifted time to time. It is better to discard outdated container request and recalculate with container placement strategy. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-9817) Improve the container placement strategy by considering the localities of pending container requests
[ https://issues.apache.org/jira/browse/SPARK-9817?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-9817: --- Assignee: Apache Spark Improve the container placement strategy by considering the localities of pending container requests Key: SPARK-9817 URL: https://issues.apache.org/jira/browse/SPARK-9817 Project: Spark Issue Type: Improvement Components: YARN Reporter: Saisai Shao Assignee: Apache Spark Priority: Minor Current implementation does not consider the localities of pending container requests, since required locality preferences of tasks will be shifted time to time. It is better to discard outdated container request and recalculate with container placement strategy. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9817) Improve the container placement strategy by considering the localities of pending container requests
[ https://issues.apache.org/jira/browse/SPARK-9817?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14681556#comment-14681556 ] Apache Spark commented on SPARK-9817: - User 'jerryshao' has created a pull request for this issue: https://github.com/apache/spark/pull/8100 Improve the container placement strategy by considering the localities of pending container requests Key: SPARK-9817 URL: https://issues.apache.org/jira/browse/SPARK-9817 Project: Spark Issue Type: Improvement Components: YARN Reporter: Saisai Shao Priority: Minor Current implementation does not consider the localities of pending container requests, since required locality preferences of tasks will be shifted time to time. It is better to discard outdated container request and recalculate with container placement strategy. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-9809) Task crashes because the internal accumulators are not properly initialized
[ https://issues.apache.org/jira/browse/SPARK-9809?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin updated SPARK-9809: --- Description: When a stage failed and another stage was resubmitted with only part of partitions to compute, all the tasks failed with error message: java.util.NoSuchElementException: key not found: peakExecutionMemory. This is because the internal accumulators are not properly initialized for this stage while other codes assume the internal accumulators always exist. {code} Job aborted due to stage failure: Task 4 in stage 12.0 failed 4 times, most recent failure: Lost task 4.3 in stage 12.0 (TID 4460, 1 0.1.2.40): java.util.NoSuchElementException: key not found: peakExecutionMemory at scala.collection.MapLike$class.default(MapLike.scala:228) at scala.collection.AbstractMap.default(Map.scala:58) at scala.collection.MapLike$class.apply(MapLike.scala:141) at scala.collection.AbstractMap.apply(Map.scala:58) at org.apache.spark.util.collection.ExternalSorter.writePartitionedFile(ExternalSorter.scala:699) at org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:80) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41) at org.apache.spark.scheduler.Task.run(Task.scala:88) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603) at java.lang.Thread.run(Thread.java:722) {code} was: When a stage failed and another stage was resubmitted with only part of partitions to compute, all the tasks failed with error message: java.util.NoSuchElementException: key not found: peakExecutionMemory. This is because the internal accumulators are not properly initialized for this stage while other codes assume the internal accumulators always exist. Job aborted due to stage failure: Task 4 in stage 12.0 failed 4 times, most recent failure: Lost task 4.3 in stage 12.0 (TID 4460, 1 0.1.2.40): java.util.NoSuchElementException: key not found: peakExecutionMemory at scala.collection.MapLike$class.default(MapLike.scala:228) at scala.collection.AbstractMap.default(Map.scala:58) at scala.collection.MapLike$class.apply(MapLike.scala:141) at scala.collection.AbstractMap.apply(Map.scala:58) at org.apache.spark.util.collection.ExternalSorter.writePartitionedFile(ExternalSorter.scala:699) at org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:80) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41) at org.apache.spark.scheduler.Task.run(Task.scala:88) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603) at java.lang.Thread.run(Thread.java:722) Task crashes because the internal accumulators are not properly initialized --- Key: SPARK-9809 URL: https://issues.apache.org/jira/browse/SPARK-9809 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.5.0 Reporter: Carson Wang Priority: Blocker When a stage failed and another stage was resubmitted with only part of partitions to compute, all the tasks failed with error message: java.util.NoSuchElementException: key not found: peakExecutionMemory. This is because the internal accumulators are not properly initialized for this stage while other codes assume the internal accumulators always exist. {code} Job aborted due to stage failure: Task 4 in stage 12.0 failed 4 times, most recent failure: Lost task 4.3 in stage 12.0 (TID 4460, 1 0.1.2.40): java.util.NoSuchElementException: key not found: peakExecutionMemory at scala.collection.MapLike$class.default(MapLike.scala:228) at scala.collection.AbstractMap.default(Map.scala:58) at scala.collection.MapLike$class.apply(MapLike.scala:141) at scala.collection.AbstractMap.apply(Map.scala:58) at org.apache.spark.util.collection.ExternalSorter.writePartitionedFile(ExternalSorter.scala:699) at org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:80) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73)
[jira] [Commented] (SPARK-9813) Incorrect UNION ALL behavior
[ https://issues.apache.org/jira/browse/SPARK-9813?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14681326#comment-14681326 ] Simeon Simeonov commented on SPARK-9813: [~hvanhovell] Oracle requires the number of columns to be the same and the data types to be compatible. (See http://docs.oracle.com/cd/B19306_01/server.102/b14200/queries004.htm) If we take that approach with Spark, then: - The first case would be OK (but different from Hive, which will cause it's own set of problems as there is essentially no documentation on Spark SQL so everyone goes to the Hive Language Manual) - The second case would still be a bug because (a) the number of columns were different and (b) a numeric column was mixed into a string column - The third case still produces an opaque and confusing exception. Incorrect UNION ALL behavior Key: SPARK-9813 URL: https://issues.apache.org/jira/browse/SPARK-9813 Project: Spark Issue Type: Bug Components: Spark Core, SQL Affects Versions: 1.4.1 Environment: Ubuntu on AWS Reporter: Simeon Simeonov Labels: sql, union According to the [Hive Language Manual|https://cwiki.apache.org/confluence/display/Hive/LanguageManual+Union] for UNION ALL: {quote} The number and names of columns returned by each select_statement have to be the same. Otherwise, a schema error is thrown. {quote} Spark SQL silently swallows an error when the tables being joined with UNION ALL have the same number of columns but different names. Reproducible example: {code} // This test is meant to run in spark-shell import java.io.File import java.io.PrintWriter import org.apache.spark.sql.hive.HiveContext import org.apache.spark.sql.SaveMode val ctx = sqlContext.asInstanceOf[HiveContext] import ctx.implicits._ def dataPath(name:String) = sys.env(HOME) + / + name + .jsonlines def tempTable(name: String, json: String) = { val path = dataPath(name) new PrintWriter(path) { write(json); close } ctx.read.json(file:// + path).registerTempTable(name) } // Note category vs. cat names of first column tempTable(test_one, {category : A, num : 5}) tempTable(test_another, {cat : A, num : 5}) // ++---+ // |category|num| // ++---+ // | A| 5| // | A| 5| // ++---+ // // Instead, an error should have been generated due to incompatible schema ctx.sql(select * from test_one union all select * from test_another).show // Cleanup new File(dataPath(test_one)).delete() new File(dataPath(test_another)).delete() {code} When the number of columns is different, Spark can even mix in datatypes. Reproducible example (requires a new spark-shell session): {code} // This test is meant to run in spark-shell import java.io.File import java.io.PrintWriter import org.apache.spark.sql.hive.HiveContext import org.apache.spark.sql.SaveMode val ctx = sqlContext.asInstanceOf[HiveContext] import ctx.implicits._ def dataPath(name:String) = sys.env(HOME) + / + name + .jsonlines def tempTable(name: String, json: String) = { val path = dataPath(name) new PrintWriter(path) { write(json); close } ctx.read.json(file:// + path).registerTempTable(name) } // Note test_another is missing category column tempTable(test_one, {category : A, num : 5}) tempTable(test_another, {num : 5}) // ++ // |category| // ++ // | A| // | 5| // ++ // // Instead, an error should have been generated due to incompatible schema ctx.sql(select * from test_one union all select * from test_another).show // Cleanup new File(dataPath(test_one)).delete() new File(dataPath(test_another)).delete() {code} At other times, when the schema are complex, Spark SQL produces a misleading error about an unresolved Union operator: {code} scala ctx.sql(select * from view_clicks | union all | select * from view_clicks_aug | ) 15/08/11 02:40:25 INFO ParseDriver: Parsing command: select * from view_clicks union all select * from view_clicks_aug 15/08/11 02:40:25 INFO ParseDriver: Parse Completed 15/08/11 02:40:25 INFO HiveMetaStore: 0: get_table : db=default tbl=view_clicks 15/08/11 02:40:25 INFO audit: ugi=ubuntu ip=unknown-ip-addr cmd=get_table : db=default tbl=view_clicks 15/08/11 02:40:25 INFO HiveMetaStore: 0: get_table : db=default tbl=view_clicks 15/08/11 02:40:25 INFO audit: ugi=ubuntu ip=unknown-ip-addr cmd=get_table : db=default tbl=view_clicks 15/08/11 02:40:25 INFO HiveMetaStore: 0: get_table : db=default tbl=view_clicks_aug 15/08/11 02:40:25 INFO audit: ugi=ubuntu ip=unknown-ip-addr cmd=get_table : db=default tbl=view_clicks_aug 15/08/11 02:40:25 INFO HiveMetaStore: 0: get_table : db=default tbl=view_clicks_aug 15/08/11
[jira] [Comment Edited] (SPARK-9813) Incorrect UNION ALL behavior
[ https://issues.apache.org/jira/browse/SPARK-9813?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14681326#comment-14681326 ] Simeon Simeonov edited comment on SPARK-9813 at 8/11/15 6:46 AM: - [~hvanhovell] Oracle requires the number of columns to be the same and the data types to be compatible. (See http://docs.oracle.com/cd/B19306_01/server.102/b14200/queries004.htm) If we take that approach with Spark, then: - The first case would be OK (but different from Hive, which will cause its own set of problems as there is essentially no documentation on Spark SQL so everyone goes to the Hive Language Manual) - The second case would still be a bug because (a) the number of columns were different and (b) a numeric column was mixed into a string column - The third case still produces an opaque and confusing exception. was (Author: simeons): [~hvanhovell] Oracle requires the number of columns to be the same and the data types to be compatible. (See http://docs.oracle.com/cd/B19306_01/server.102/b14200/queries004.htm) If we take that approach with Spark, then: - The first case would be OK (but different from Hive, which will cause it's own set of problems as there is essentially no documentation on Spark SQL so everyone goes to the Hive Language Manual) - The second case would still be a bug because (a) the number of columns were different and (b) a numeric column was mixed into a string column - The third case still produces an opaque and confusing exception. Incorrect UNION ALL behavior Key: SPARK-9813 URL: https://issues.apache.org/jira/browse/SPARK-9813 Project: Spark Issue Type: Bug Components: Spark Core, SQL Affects Versions: 1.4.1 Environment: Ubuntu on AWS Reporter: Simeon Simeonov Labels: sql, union According to the [Hive Language Manual|https://cwiki.apache.org/confluence/display/Hive/LanguageManual+Union] for UNION ALL: {quote} The number and names of columns returned by each select_statement have to be the same. Otherwise, a schema error is thrown. {quote} Spark SQL silently swallows an error when the tables being joined with UNION ALL have the same number of columns but different names. Reproducible example: {code} // This test is meant to run in spark-shell import java.io.File import java.io.PrintWriter import org.apache.spark.sql.hive.HiveContext import org.apache.spark.sql.SaveMode val ctx = sqlContext.asInstanceOf[HiveContext] import ctx.implicits._ def dataPath(name:String) = sys.env(HOME) + / + name + .jsonlines def tempTable(name: String, json: String) = { val path = dataPath(name) new PrintWriter(path) { write(json); close } ctx.read.json(file:// + path).registerTempTable(name) } // Note category vs. cat names of first column tempTable(test_one, {category : A, num : 5}) tempTable(test_another, {cat : A, num : 5}) // ++---+ // |category|num| // ++---+ // | A| 5| // | A| 5| // ++---+ // // Instead, an error should have been generated due to incompatible schema ctx.sql(select * from test_one union all select * from test_another).show // Cleanup new File(dataPath(test_one)).delete() new File(dataPath(test_another)).delete() {code} When the number of columns is different, Spark can even mix in datatypes. Reproducible example (requires a new spark-shell session): {code} // This test is meant to run in spark-shell import java.io.File import java.io.PrintWriter import org.apache.spark.sql.hive.HiveContext import org.apache.spark.sql.SaveMode val ctx = sqlContext.asInstanceOf[HiveContext] import ctx.implicits._ def dataPath(name:String) = sys.env(HOME) + / + name + .jsonlines def tempTable(name: String, json: String) = { val path = dataPath(name) new PrintWriter(path) { write(json); close } ctx.read.json(file:// + path).registerTempTable(name) } // Note test_another is missing category column tempTable(test_one, {category : A, num : 5}) tempTable(test_another, {num : 5}) // ++ // |category| // ++ // | A| // | 5| // ++ // // Instead, an error should have been generated due to incompatible schema ctx.sql(select * from test_one union all select * from test_another).show // Cleanup new File(dataPath(test_one)).delete() new File(dataPath(test_another)).delete() {code} At other times, when the schema are complex, Spark SQL produces a misleading error about an unresolved Union operator: {code} scala ctx.sql(select * from view_clicks | union all | select * from view_clicks_aug | ) 15/08/11 02:40:25 INFO ParseDriver: Parsing command: select * from view_clicks union all select * from view_clicks_aug 15/08/11 02:40:25 INFO ParseDriver: Parse
[jira] [Commented] (SPARK-8724) Need documentation on how to deploy or use SparkR in Spark 1.4.0+
[ https://issues.apache.org/jira/browse/SPARK-8724?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14681427#comment-14681427 ] Vincent Warmerdam commented on SPARK-8724: -- [~shivaram] [~felixcheung] does this still need to be open? or do we want to add parts of the rstudio blog post to documentation on the sparkr end? Need documentation on how to deploy or use SparkR in Spark 1.4.0+ - Key: SPARK-8724 URL: https://issues.apache.org/jira/browse/SPARK-8724 Project: Spark Issue Type: Bug Components: R Affects Versions: 1.4.0 Reporter: Felix Cheung Priority: Minor As of now there doesn't seem to be any official documentation on how to deploy SparkR with Spark 1.4.0+ Also, cluster manager specific documentation (like http://spark.apache.org/docs/latest/spark-standalone.html) does not call out what mode is supported for SparkR and details on deployment steps. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org