[jira] [Commented] (SPARK-7639) Add Python API for Statistics.kernelDensity
[ https://issues.apache.org/jira/browse/SPARK-7639?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14557681#comment-14557681 ] Apache Spark commented on SPARK-7639: - User 'MechCoder' has created a pull request for this issue: https://github.com/apache/spark/pull/6387 Add Python API for Statistics.kernelDensity --- Key: SPARK-7639 URL: https://issues.apache.org/jira/browse/SPARK-7639 Project: Spark Issue Type: New Feature Components: MLlib, PySpark Affects Versions: 1.4.0 Reporter: Yanbo Liang Add Python API for org.apache.spark.mllib.stat.Statistics.kernelDensity -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-7639) Add Python API for Statistics.kernelDensity
[ https://issues.apache.org/jira/browse/SPARK-7639?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-7639: --- Assignee: (was: Apache Spark) Add Python API for Statistics.kernelDensity --- Key: SPARK-7639 URL: https://issues.apache.org/jira/browse/SPARK-7639 Project: Spark Issue Type: New Feature Components: MLlib, PySpark Affects Versions: 1.4.0 Reporter: Yanbo Liang Add Python API for org.apache.spark.mllib.stat.Statistics.kernelDensity -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-7639) Add Python API for Statistics.kernelDensity
[ https://issues.apache.org/jira/browse/SPARK-7639?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-7639: --- Assignee: Apache Spark Add Python API for Statistics.kernelDensity --- Key: SPARK-7639 URL: https://issues.apache.org/jira/browse/SPARK-7639 Project: Spark Issue Type: New Feature Components: MLlib, PySpark Affects Versions: 1.4.0 Reporter: Yanbo Liang Assignee: Apache Spark Add Python API for org.apache.spark.mllib.stat.Statistics.kernelDensity -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-7524) add configs for keytab and principal, move originals to internal
[ https://issues.apache.org/jira/browse/SPARK-7524?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tao Wang updated SPARK-7524: Description: As spark now supports long running service by updating tokens for namenode, but only accept parameters passed with --k=v format which is not very convinient. I wanna add spark.* configs in properties file and system property was: As spark now supports long running service by updating tokens for namenode, but only accept parameters passed with --k=v format which is not very convinient. I wanna add spark.* configs in properties file and system property, and move originals to spark.internal.*. add configs for keytab and principal, move originals to internal Key: SPARK-7524 URL: https://issues.apache.org/jira/browse/SPARK-7524 Project: Spark Issue Type: Improvement Components: YARN Reporter: Tao Wang As spark now supports long running service by updating tokens for namenode, but only accept parameters passed with --k=v format which is not very convinient. I wanna add spark.* configs in properties file and system property -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-7846) Use different way to pass spark.yarn.keytab and spark.yarn.principal in different modes
[ https://issues.apache.org/jira/browse/SPARK-7846?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-7846: --- Assignee: Apache Spark Use different way to pass spark.yarn.keytab and spark.yarn.principal in different modes --- Key: SPARK-7846 URL: https://issues.apache.org/jira/browse/SPARK-7846 Project: Spark Issue Type: Bug Components: YARN Reporter: Tao Wang Assignee: Apache Spark --principal and --keytabl options are passed to client but when we started thrift server or spark-shell these two are also passed into the Main class (org.apache.spark.sql.hive.thriftserver.HiveThriftServer2 and org.apache.spark.repl.Main). In these two main class, arguments passed in will be processed with some 3rd libraries, which will lead to some error: Invalid option: --principal or Unrecgnised option: --principal. We should pass these command args in different forms, say system properties. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-7846) Use different way to pass spark.yarn.keytab and spark.yarn.principal in different modes
[ https://issues.apache.org/jira/browse/SPARK-7846?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14557732#comment-14557732 ] Apache Spark commented on SPARK-7846: - User 'WangTaoTheTonic' has created a pull request for this issue: https://github.com/apache/spark/pull/6051 Use different way to pass spark.yarn.keytab and spark.yarn.principal in different modes --- Key: SPARK-7846 URL: https://issues.apache.org/jira/browse/SPARK-7846 Project: Spark Issue Type: Bug Components: YARN Reporter: Tao Wang --principal and --keytabl options are passed to client but when we started thrift server or spark-shell these two are also passed into the Main class (org.apache.spark.sql.hive.thriftserver.HiveThriftServer2 and org.apache.spark.repl.Main). In these two main class, arguments passed in will be processed with some 3rd libraries, which will lead to some error: Invalid option: --principal or Unrecgnised option: --principal. We should pass these command args in different forms, say system properties. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-7846) Use different way to pass spark.yarn.keytab and spark.yarn.principal in different modes
[ https://issues.apache.org/jira/browse/SPARK-7846?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-7846: --- Assignee: (was: Apache Spark) Use different way to pass spark.yarn.keytab and spark.yarn.principal in different modes --- Key: SPARK-7846 URL: https://issues.apache.org/jira/browse/SPARK-7846 Project: Spark Issue Type: Bug Components: YARN Reporter: Tao Wang --principal and --keytabl options are passed to client but when we started thrift server or spark-shell these two are also passed into the Main class (org.apache.spark.sql.hive.thriftserver.HiveThriftServer2 and org.apache.spark.repl.Main). In these two main class, arguments passed in will be processed with some 3rd libraries, which will lead to some error: Invalid option: --principal or Unrecgnised option: --principal. We should pass these command args in different forms, say system properties. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-7809) MultivariateOnlineSummarizer should allow users to configure what to compute
[ https://issues.apache.org/jira/browse/SPARK-7809?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-7809: --- Assignee: Apache Spark MultivariateOnlineSummarizer should allow users to configure what to compute Key: SPARK-7809 URL: https://issues.apache.org/jira/browse/SPARK-7809 Project: Spark Issue Type: Improvement Components: MLlib Affects Versions: 1.4.0 Reporter: Xiangrui Meng Assignee: Apache Spark Now MultivariateOnlineSummarizer computes every summary statistics it can provide, which is okay and convenient for small number of features. It the feature dimension is large, this becomes expensive. So we should add setters to allow users to configure what to compute. {code} val summarizer = new MultivariateOnlineSummarizer() .withMean(false) .withMax(false) {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-7809) MultivariateOnlineSummarizer should allow users to configure what to compute
[ https://issues.apache.org/jira/browse/SPARK-7809?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14557695#comment-14557695 ] Apache Spark commented on SPARK-7809: - User 'viirya' has created a pull request for this issue: https://github.com/apache/spark/pull/6388 MultivariateOnlineSummarizer should allow users to configure what to compute Key: SPARK-7809 URL: https://issues.apache.org/jira/browse/SPARK-7809 Project: Spark Issue Type: Improvement Components: MLlib Affects Versions: 1.4.0 Reporter: Xiangrui Meng Now MultivariateOnlineSummarizer computes every summary statistics it can provide, which is okay and convenient for small number of features. It the feature dimension is large, this becomes expensive. So we should add setters to allow users to configure what to compute. {code} val summarizer = new MultivariateOnlineSummarizer() .withMean(false) .withMax(false) {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-7809) MultivariateOnlineSummarizer should allow users to configure what to compute
[ https://issues.apache.org/jira/browse/SPARK-7809?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-7809: --- Assignee: (was: Apache Spark) MultivariateOnlineSummarizer should allow users to configure what to compute Key: SPARK-7809 URL: https://issues.apache.org/jira/browse/SPARK-7809 Project: Spark Issue Type: Improvement Components: MLlib Affects Versions: 1.4.0 Reporter: Xiangrui Meng Now MultivariateOnlineSummarizer computes every summary statistics it can provide, which is okay and convenient for small number of features. It the feature dimension is large, this becomes expensive. So we should add setters to allow users to configure what to compute. {code} val summarizer = new MultivariateOnlineSummarizer() .withMean(false) .withMax(false) {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-7847) Fix dynamic partition path escaping
Cheng Lian created SPARK-7847: - Summary: Fix dynamic partition path escaping Key: SPARK-7847 URL: https://issues.apache.org/jira/browse/SPARK-7847 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.3.1, 1.3.0, 1.4.0 Reporter: Cheng Lian Assignee: Cheng Lian Priority: Critical Background: when writing dynamic partitions, partition values are converted to string and escaped if necessary. For example, a partition column {{p}} of type {{String}} may have a value {{A/B}}, then the corresponding partition directory name is escaped into {{p=A%2fB}}. Currently, there are two issues regarding to dynamic partition path escaping. The first issue is that, when reading back partition values, escaped strings are not unescaped. This one is easy to fix. The second issue is more subtle. In [PR #5381|https://github.com/apache/spark/pull/5381/files#diff-c69b9e667e93b7e4693812cc72abb65fR492] we tried to use {{Path.toUri.toString}} to fix an escaping issue related to S3 credentials with {{/}} character. Unfortunately, {{Path.toUri.toString}} also escapes {{%}} characters in the path. Thus, using the dynamic partitioning case mentioned above, {{p=A%2fB}} is double escaped into {{p=A%252fB}} ({{%}} escaped into {{%25}}). The expected behavior here should be, only escaping the URI user info part (S3 key and secret) but leave all other components untouched. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-7846) Use different way to pass spark.yarn.keytab and spark.yarn.principal in different modes
Tao Wang created SPARK-7846: --- Summary: Use different way to pass spark.yarn.keytab and spark.yarn.principal in different modes Key: SPARK-7846 URL: https://issues.apache.org/jira/browse/SPARK-7846 Project: Spark Issue Type: Bug Components: YARN Reporter: Tao Wang --principal and --keytabl options are passed to client but when we started thrift server or spark-shell these two are also passed into the Main class (org.apache.spark.sql.hive.thriftserver.HiveThriftServer2 and org.apache.spark.repl.Main). In these two main class, arguments passed in will be processed with some 3rd libraries, which will lead to some error: Invalid option: --principal or Unrecgnised option: --principal. We should pass these command args in different forms, say system properties. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-7847) Fix dynamic partition path escaping
[ https://issues.apache.org/jira/browse/SPARK-7847?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-7847: --- Assignee: Apache Spark (was: Cheng Lian) Fix dynamic partition path escaping --- Key: SPARK-7847 URL: https://issues.apache.org/jira/browse/SPARK-7847 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.3.0, 1.3.1, 1.4.0 Reporter: Cheng Lian Assignee: Apache Spark Priority: Critical Background: when writing dynamic partitions, partition values are converted to string and escaped if necessary. For example, a partition column {{p}} of type {{String}} may have a value {{A/B}}, then the corresponding partition directory name is escaped into {{p=A%2fB}}. Currently, there are two issues regarding to dynamic partition path escaping. The first issue is that, when reading back partition values, escaped strings are not unescaped. This one is easy to fix. The second issue is more subtle. In [PR #5381|https://github.com/apache/spark/pull/5381/files#diff-c69b9e667e93b7e4693812cc72abb65fR492] we tried to use {{Path.toUri.toString}} to fix an escaping issue related to S3 credentials with {{/}} character. Unfortunately, {{Path.toUri.toString}} also escapes {{%}} characters in the path. Thus, using the dynamic partitioning case mentioned above, {{p=A%2fB}} is double escaped into {{p=A%252fB}} ({{%}} escaped into {{%25}}). The expected behavior here should be, only escaping the URI user info part (S3 key and secret) but leave all other components untouched. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-7847) Fix dynamic partition path escaping
[ https://issues.apache.org/jira/browse/SPARK-7847?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-7847: --- Assignee: Cheng Lian (was: Apache Spark) Fix dynamic partition path escaping --- Key: SPARK-7847 URL: https://issues.apache.org/jira/browse/SPARK-7847 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.3.0, 1.3.1, 1.4.0 Reporter: Cheng Lian Assignee: Cheng Lian Priority: Critical Background: when writing dynamic partitions, partition values are converted to string and escaped if necessary. For example, a partition column {{p}} of type {{String}} may have a value {{A/B}}, then the corresponding partition directory name is escaped into {{p=A%2fB}}. Currently, there are two issues regarding to dynamic partition path escaping. The first issue is that, when reading back partition values, escaped strings are not unescaped. This one is easy to fix. The second issue is more subtle. In [PR #5381|https://github.com/apache/spark/pull/5381/files#diff-c69b9e667e93b7e4693812cc72abb65fR492] we tried to use {{Path.toUri.toString}} to fix an escaping issue related to S3 credentials with {{/}} character. Unfortunately, {{Path.toUri.toString}} also escapes {{%}} characters in the path. Thus, using the dynamic partitioning case mentioned above, {{p=A%2fB}} is double escaped into {{p=A%252fB}} ({{%}} escaped into {{%25}}). The expected behavior here should be, only escaping the URI user info part (S3 key and secret) but leave all other components untouched. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-7847) Fix dynamic partition path escaping
[ https://issues.apache.org/jira/browse/SPARK-7847?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14557728#comment-14557728 ] Apache Spark commented on SPARK-7847: - User 'liancheng' has created a pull request for this issue: https://github.com/apache/spark/pull/6389 Fix dynamic partition path escaping --- Key: SPARK-7847 URL: https://issues.apache.org/jira/browse/SPARK-7847 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.3.0, 1.3.1, 1.4.0 Reporter: Cheng Lian Assignee: Cheng Lian Priority: Critical Background: when writing dynamic partitions, partition values are converted to string and escaped if necessary. For example, a partition column {{p}} of type {{String}} may have a value {{A/B}}, then the corresponding partition directory name is escaped into {{p=A%2fB}}. Currently, there are two issues regarding to dynamic partition path escaping. The first issue is that, when reading back partition values, escaped strings are not unescaped. This one is easy to fix. The second issue is more subtle. In [PR #5381|https://github.com/apache/spark/pull/5381/files#diff-c69b9e667e93b7e4693812cc72abb65fR492] we tried to use {{Path.toUri.toString}} to fix an escaping issue related to S3 credentials with {{/}} character. Unfortunately, {{Path.toUri.toString}} also escapes {{%}} characters in the path. Thus, using the dynamic partitioning case mentioned above, {{p=A%2fB}} is double escaped into {{p=A%252fB}} ({{%}} escaped into {{%25}}). The expected behavior here should be, only escaping the URI user info part (S3 key and secret) but leave all other components untouched. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-7535) Audit Pipeline APIs for 1.4
[ https://issues.apache.org/jira/browse/SPARK-7535?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14557804#comment-14557804 ] Apache Spark commented on SPARK-7535: - User 'mengxr' has created a pull request for this issue: https://github.com/apache/spark/pull/6392 Audit Pipeline APIs for 1.4 --- Key: SPARK-7535 URL: https://issues.apache.org/jira/browse/SPARK-7535 Project: Spark Issue Type: Sub-task Components: ML, PySpark Reporter: Joseph K. Bradley Assignee: Xiangrui Meng This is an umbrella for auditing the Pipeline (spark.ml) APIs. Items to check: * Public/protected/private access * Consistency across spark.ml * Classes, methods, and parameters in spark.mllib but missing in spark.ml ** We should create JIRAs for each of these (under an umbrella) as to-do items for future releases. For each algorithm or API component, create a subtask under this umbrella. Some major new items: * new feature transformers * tree models * elastic-net * ML attributes * developer APIs (Predictor, Classifier, Regressor) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-7805) Move SQLTestUtils.scala form src/main
[ https://issues.apache.org/jira/browse/SPARK-7805?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yin Huai resolved SPARK-7805. - Resolution: Fixed Fix Version/s: 1.4.0 Issue resolved by pull request 6334 [https://github.com/apache/spark/pull/6334] Move SQLTestUtils.scala form src/main - Key: SPARK-7805 URL: https://issues.apache.org/jira/browse/SPARK-7805 Project: Spark Issue Type: Bug Components: SQL Reporter: Patrick Wendell Assignee: Yin Huai Priority: Critical Fix For: 1.4.0 These trigger binary compatibility issues when changed. In general we shouldn't be putting test code in src/main. If it's needed by multiple modules, IIRC we have a way to do that (look elsewhere in Spark). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-7848) Update SparkStreaming docs to include FAQ for knobs
[ https://issues.apache.org/jira/browse/SPARK-7848?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] jay vyas updated SPARK-7848: Summary: Update SparkStreaming docs to include FAQ for knobs (was: Update SparkStreaming docs to include knobs ) Update SparkStreaming docs to include FAQ for knobs -- Key: SPARK-7848 URL: https://issues.apache.org/jira/browse/SPARK-7848 Project: Spark Issue Type: Documentation Components: Streaming Reporter: jay vyas A recent email on the maligning list detailed a bunch of great knobs to remember for spark streaming. Lets integrate this into the docs where appropriate. I'll paste the raw text in a comment field below -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-7848) Update SparkStreaming docs to include FAQ or bullets for knobs.
[ https://issues.apache.org/jira/browse/SPARK-7848?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] jay vyas updated SPARK-7848: Summary: Update SparkStreaming docs to include FAQ or bullets for knobs. (was: Update SparkStreaming docs to include FAQ for knobs ) Update SparkStreaming docs to include FAQ or bullets for knobs. - Key: SPARK-7848 URL: https://issues.apache.org/jira/browse/SPARK-7848 Project: Spark Issue Type: Documentation Components: Streaming Reporter: jay vyas A recent email on the maligning list detailed a bunch of great knobs to remember for spark streaming. Lets integrate this into the docs where appropriate. I'll paste the raw text in a comment field below -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-7848) Update SparkStreaming docs to incorporate FAQ and/or bullets w/ knobs information.
[ https://issues.apache.org/jira/browse/SPARK-7848?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] jay vyas updated SPARK-7848: Summary: Update SparkStreaming docs to incorporate FAQ and/or bullets w/ knobs information. (was: Update SparkStreaming docs to include FAQ or bullets for knobs.) Update SparkStreaming docs to incorporate FAQ and/or bullets w/ knobs information. Key: SPARK-7848 URL: https://issues.apache.org/jira/browse/SPARK-7848 Project: Spark Issue Type: Documentation Components: Streaming Reporter: jay vyas A recent email on the maligning list detailed a bunch of great knobs to remember for spark streaming. Lets integrate this into the docs where appropriate. I'll paste the raw text in a comment field below -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-7442) Spark 1.3.1 / Hadoop 2.6 prebuilt pacakge has broken S3 filesystem access
[ https://issues.apache.org/jira/browse/SPARK-7442?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14557820#comment-14557820 ] Peng Cheng commented on SPARK-7442: --- Adding jar won't solve the problem: you need to set the following parameters: --conf spark.hadoop.fs.s3n.impl=org.apache.hadoop.fs.s3native.NativeS3FileSystem --conf spark.hadoop.fs.s3.impl=org.apache.hadoop.fs.s3.S3FileSystem But in my 2.6 environment the added jar is ignored by worker's classloader for unknow reason, see http://stackoverflow.com/questions/30426245/apache-spark-classloader-cannot-find-classdef-in-the-jar Spark 1.3.1 / Hadoop 2.6 prebuilt pacakge has broken S3 filesystem access - Key: SPARK-7442 URL: https://issues.apache.org/jira/browse/SPARK-7442 Project: Spark Issue Type: Bug Components: Build Affects Versions: 1.3.1 Environment: OS X Reporter: Nicholas Chammas # Download Spark 1.3.1 pre-built for Hadoop 2.6 from the [Spark downloads page|http://spark.apache.org/downloads.html]. # Add {{localhost}} to your {{slaves}} file and {{start-all.sh}} # Fire up PySpark and try reading from S3 with something like this: {code}sc.textFile('s3n://bucket/file_*').count(){code} # You will get an error like this: {code}py4j.protocol.Py4JJavaError: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.collectAndServe. : java.io.IOException: No FileSystem for scheme: s3n{code} {{file:///...}} works. Spark 1.3.1 prebuilt for Hadoop 2.4 works. Spark 1.3.0 works. It's just the combination of Spark 1.3.1 prebuilt for Hadoop 2.6 accessing S3 that doesn't work. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-7442) Spark 1.3.1 / Hadoop 2.6 prebuilt pacakge has broken S3 filesystem access
[ https://issues.apache.org/jira/browse/SPARK-7442?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14557820#comment-14557820 ] Peng Cheng edited comment on SPARK-7442 at 5/24/15 6:55 PM: Adding jar won't solve the problem: you need to set the following parameters: --conf spark.hadoop.fs.s3n.impl=org.apache.hadoop.fs.s3native.NativeS3FileSystem --conf spark.hadoop.fs.s3.impl=org.apache.hadoop.fs.s3.S3FileSystem But in my 2.6 environment the fs implementation in the added jar is ignored by worker's classloader for unknow reason, see http://stackoverflow.com/questions/30426245/apache-spark-classloader-cannot-find-classdef-in-the-jar was (Author: peng): Adding jar won't solve the problem: you need to set the following parameters: --conf spark.hadoop.fs.s3n.impl=org.apache.hadoop.fs.s3native.NativeS3FileSystem --conf spark.hadoop.fs.s3.impl=org.apache.hadoop.fs.s3.S3FileSystem But in my 2.6 environment the added jar is ignored by worker's classloader for unknow reason, see http://stackoverflow.com/questions/30426245/apache-spark-classloader-cannot-find-classdef-in-the-jar Spark 1.3.1 / Hadoop 2.6 prebuilt pacakge has broken S3 filesystem access - Key: SPARK-7442 URL: https://issues.apache.org/jira/browse/SPARK-7442 Project: Spark Issue Type: Bug Components: Build Affects Versions: 1.3.1 Environment: OS X Reporter: Nicholas Chammas # Download Spark 1.3.1 pre-built for Hadoop 2.6 from the [Spark downloads page|http://spark.apache.org/downloads.html]. # Add {{localhost}} to your {{slaves}} file and {{start-all.sh}} # Fire up PySpark and try reading from S3 with something like this: {code}sc.textFile('s3n://bucket/file_*').count(){code} # You will get an error like this: {code}py4j.protocol.Py4JJavaError: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.collectAndServe. : java.io.IOException: No FileSystem for scheme: s3n{code} {{file:///...}} works. Spark 1.3.1 prebuilt for Hadoop 2.4 works. Spark 1.3.0 works. It's just the combination of Spark 1.3.1 prebuilt for Hadoop 2.6 accessing S3 that doesn't work. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-7848) Update SparkStreaming docs to include knobs
jay vyas created SPARK-7848: --- Summary: Update SparkStreaming docs to include knobs Key: SPARK-7848 URL: https://issues.apache.org/jira/browse/SPARK-7848 Project: Spark Issue Type: Documentation Components: Streaming Reporter: jay vyas A recent email on the maligning list detailed a bunch of great knobs to remember for spark streaming. Lets integrate this into the docs where appropriate. I'll paste the raw text in a comment field below -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-7848) Update SparkStreaming docs to include knobs
[ https://issues.apache.org/jira/browse/SPARK-7848?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14557756#comment-14557756 ] jay vyas commented on SPARK-7848: - COPIED from the ASF Mailing list thread for convenience. {noformat} Blocks are replicated immediately, before the driver launches any jobs using them. On Thu, May 21, 2015 at 2:05 AM, Hemant Bhanawat hemant9...@gmail.com wrote: Honestly, given the length of my email, I didn't expect a reply. :-) Thanks for reading and replying. However, I have a follow-up question: I don't think if I understand the block replication completely. Are the blocks replicated immediately after they are received by the receiver? Or are they kept on the receiver node only and are moved only on shuffle? Has the replication something to do with locality.wait? Thanks, Hemant On Thu, May 21, 2015 at 2:21 AM, Tathagata Das t...@databricks.com wrote: Correcting the ones that are incorrect or incomplete. BUT this is good list for things to remember about Spark Streaming. On Wed, May 20, 2015 at 3:40 AM, Hemant Bhanawat hemant9...@gmail.com wrote: Hi, I have compiled a list (from online sources) of knobs/design considerations that need to be taken care of by applications running on spark streaming. Is my understanding correct? Any other important design consideration that I should take care of? A DStream is associated with a single receiver. For attaining read parallelism multiple receivers i.e. multiple DStreams need to be created. A receiver is run within an executor. It occupies one core. Ensure that there are enough cores for processing after receiver slots are booked i.e. spark.cores.max should take the receiver slots into account. The receivers are allocated to executors in a round robin fashion. When data is received from a stream source, receiver creates blocks of data. A new block of data is generated every blockInterval milliseconds. N blocks of data are created during the batchInterval where N = batchInterval/blockInterval. These blocks are distributed by the BlockManager of the current executor to the block managers of other executors. After that, the Network Input Tracker running on the driver is informed about the block locations for further processing. A RDD is created on the driver for the blocks created during the batchInterval. The blocks generated during the batchInterval are partitions of the RDD. Each partition is a task in spark. blockInterval== batchinterval would mean that a single partition is created and probably it is processed locally. The map tasks on the blocks are processed in the executors (one that received the block, and another where the block was replicated) that has the blocks irrespective of block interval, unless non-local scheduling kicks in (as you observed next). Having bigger blockinterval means bigger blocks. A high value of spark.locality.wait increases the chance of processing a block on the local node. A balance needs to be found out between these two parameters to ensure that the bigger blocks are processed locally. Instead of relying on batchInterval and blockInterval, you can define the number of partitions by calling dstream.repartition(n). This reshuffles the data in RDD randomly to create n number of partitions. Yes, for greater parallelism. Though comes at the cost of a shuffle. An RDD's processing is scheduled by driver's jobscheduler as a job. At a given point of time only one job is active. So, if one job is executing the other jobs are queued. If you have two dstreams there will be two RDDs formed and there will be two jobs created which will be scheduled one after the another. To avoid this, you can union two dstreams. This will ensure that a single unionRDD is formed for the two RDDs of the dstreams. This unionRDD is then considered as a single job. However the partitioning of the RDDs is not impacted. To further clarify, the jobs depend on the number of output operations (print, foreachRDD, saveAsXFiles) and the number of RDD actions in those output operations. dstream1.union(dstream2).foreachRDD { rdd = rdd.count() }// one Spark job per batch dstream1.union(dstream2).foreachRDD { rdd = { rdd.count() ; rdd.count() } } // TWO Spark jobs per batch dstream1.foreachRDD { rdd = rdd.count } ; dstream2.foreachRDD { rdd = rdd.count } // TWO Spark jobs per batch If the batch processing time is more than batchinterval then obviously the receiver's memory will start filling up and will end up in throwing exceptions (most probably BlockNotFoundException). Currently there is no way to pause the receiver. You can limit the rate of receiver using SparkConf config spark.streaming.receiver.maxRate For being fully fault tolerant, spark streaming needs to enable checkpointing. Checkpointing increases the batch processing time. Incomplete. There are two types of checkpointing - data
[jira] [Resolved] (SPARK-7845) Bump Hadoop 1 tests to version 1.2.1
[ https://issues.apache.org/jira/browse/SPARK-7845?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yin Huai resolved SPARK-7845. - Resolution: Fixed Fix Version/s: 1.4.0 Issue resolved by pull request 6384 [https://github.com/apache/spark/pull/6384] Bump Hadoop 1 tests to version 1.2.1 -- Key: SPARK-7845 URL: https://issues.apache.org/jira/browse/SPARK-7845 Project: Spark Issue Type: Improvement Components: Tests Reporter: Patrick Wendell Assignee: Yin Huai Priority: Critical Fix For: 1.4.0 A small number of API's in Hadoop were added between 1.0.4 and 1.2.1. It appears this is one cause of SPARK-7843 since some Hive code relies on newer Hadoop API's. My feeling is we should just bump our tested version up to 1.2.1 (both versions are extremely old). If users are still on 1.2.1 and run into some of these corner cases, we can consider doing some engineering and supporting the older versions. I'd like to bump our test version though and let this be driven by users, if they exist. https://github.com/apache/spark/blob/master/dev/run-tests#L43 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-7805) Move SQLTestUtils.scala form src/main
[ https://issues.apache.org/jira/browse/SPARK-7805?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14557793#comment-14557793 ] Apache Spark commented on SPARK-7805: - User 'yhuai' has created a pull request for this issue: https://github.com/apache/spark/pull/6391 Move SQLTestUtils.scala form src/main - Key: SPARK-7805 URL: https://issues.apache.org/jira/browse/SPARK-7805 Project: Spark Issue Type: Bug Components: SQL Reporter: Patrick Wendell Assignee: Yin Huai Priority: Critical Fix For: 1.4.0 These trigger binary compatibility issues when changed. In general we shouldn't be putting test code in src/main. If it's needed by multiple modules, IIRC we have a way to do that (look elsewhere in Spark). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-7849) Update Spark SQL Hive support documentation for 1.4
Cheng Lian created SPARK-7849: - Summary: Update Spark SQL Hive support documentation for 1.4 Key: SPARK-7849 URL: https://issues.apache.org/jira/browse/SPARK-7849 Project: Spark Issue Type: Documentation Components: Documentation, SQL Affects Versions: 1.4.0 Reporter: Cheng Lian Priority: Critical Hive support contents need to be updated for 1.4. Most importantly, after introducing the isolated classloader mechanism in 1.4, the following questions need to be clarified: # How to enable Hive support # What versions of Hive are supported # How to specify metastore version -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-7843) Several thrift server failures in Spark 1.4 sbt build with hadoop 1.0.4
[ https://issues.apache.org/jira/browse/SPARK-7843?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yin Huai updated SPARK-7843: Priority: Major (was: Critical) Several thrift server failures in Spark 1.4 sbt build with hadoop 1.0.4 --- Key: SPARK-7843 URL: https://issues.apache.org/jira/browse/SPARK-7843 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.4.0 Reporter: Yin Huai Attachments: 12a345adcbaee359199ddfed4f41bf0e19d66d48, HiveThriftBinaryServerSuite-spark-yhuai-org.apache.spark.sql.hive.thriftserver.HiveThriftServer2-1-Yins-MBP-2.out The following tests are failing all the time (starting from https://amplab.cs.berkeley.edu/jenkins/view/Spark/job/Spark-1.4-SBT/117/) {code} Test Result (8 failures / +8) org.apache.spark.sql.hive.thriftserver.HiveThriftBinaryServerSuite.JDBC query execution org.apache.spark.sql.hive.thriftserver.HiveThriftBinaryServerSuite.SPARK-3004 regression: result set containing NULL org.apache.spark.sql.hive.thriftserver.HiveThriftBinaryServerSuite.SPARK-4292 regression: result set iterator issue org.apache.spark.sql.hive.thriftserver.HiveThriftBinaryServerSuite.SPARK-4309 regression: Date type support org.apache.spark.sql.hive.thriftserver.HiveThriftBinaryServerSuite.SPARK-4407 regression: Complex type support org.apache.spark.sql.hive.thriftserver.HiveThriftBinaryServerSuite.test multiple session org.apache.spark.sql.hive.thriftserver.HiveThriftHttpServerSuite.JDBC query execution org.apache.spark.sql.hive.thriftserver.UISeleniumSuite.thrift server ui test {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-7843) Several thrift server failures in Spark 1.4 sbt build with hadoop 1.0.4
[ https://issues.apache.org/jira/browse/SPARK-7843?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yin Huai updated SPARK-7843: Summary: Several thrift server failures in Spark 1.4 sbt build with hadoop 1.0.4 (was: Several thrift server failures in Spark 1.4 sbt build with hadoop 1) Several thrift server failures in Spark 1.4 sbt build with hadoop 1.0.4 --- Key: SPARK-7843 URL: https://issues.apache.org/jira/browse/SPARK-7843 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.4.0 Reporter: Yin Huai Priority: Critical Attachments: 12a345adcbaee359199ddfed4f41bf0e19d66d48, HiveThriftBinaryServerSuite-spark-yhuai-org.apache.spark.sql.hive.thriftserver.HiveThriftServer2-1-Yins-MBP-2.out The following tests are failing all the time (starting from https://amplab.cs.berkeley.edu/jenkins/view/Spark/job/Spark-1.4-SBT/117/) {code} Test Result (8 failures / +8) org.apache.spark.sql.hive.thriftserver.HiveThriftBinaryServerSuite.JDBC query execution org.apache.spark.sql.hive.thriftserver.HiveThriftBinaryServerSuite.SPARK-3004 regression: result set containing NULL org.apache.spark.sql.hive.thriftserver.HiveThriftBinaryServerSuite.SPARK-4292 regression: result set iterator issue org.apache.spark.sql.hive.thriftserver.HiveThriftBinaryServerSuite.SPARK-4309 regression: Date type support org.apache.spark.sql.hive.thriftserver.HiveThriftBinaryServerSuite.SPARK-4407 regression: Complex type support org.apache.spark.sql.hive.thriftserver.HiveThriftBinaryServerSuite.test multiple session org.apache.spark.sql.hive.thriftserver.HiveThriftHttpServerSuite.JDBC query execution org.apache.spark.sql.hive.thriftserver.UISeleniumSuite.thrift server ui test {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-7811) Fix typo on slf4j configuration on metrics.properties.template
[ https://issues.apache.org/jira/browse/SPARK-7811?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-7811: - Assignee: Judy Nash Fix typo on slf4j configuration on metrics.properties.template -- Key: SPARK-7811 URL: https://issues.apache.org/jira/browse/SPARK-7811 Project: Spark Issue Type: Bug Components: Spark Core Reporter: Judy Nash Assignee: Judy Nash Priority: Trivial Fix For: 1.5.0 There are a minor typo on slf4jsink configuration at metrics.properties.template. slf4j is mispelled as sl4j on 2 of the configuration. Correcting the typo so users' custom settings will be loaded correctly. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-7843) Several thrift server failures in Spark 1.4 sbt build with hadoop 1
[ https://issues.apache.org/jira/browse/SPARK-7843?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14557844#comment-14557844 ] Sean Owen commented on SPARK-7843: -- Does https://issues.apache.org/jira/browse/SPARK-7845 resolve this then? Several thrift server failures in Spark 1.4 sbt build with hadoop 1 --- Key: SPARK-7843 URL: https://issues.apache.org/jira/browse/SPARK-7843 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.4.0 Reporter: Yin Huai Priority: Critical Attachments: 12a345adcbaee359199ddfed4f41bf0e19d66d48, HiveThriftBinaryServerSuite-spark-yhuai-org.apache.spark.sql.hive.thriftserver.HiveThriftServer2-1-Yins-MBP-2.out The following tests are failing all the time (starting from https://amplab.cs.berkeley.edu/jenkins/view/Spark/job/Spark-1.4-SBT/117/) {code} Test Result (8 failures / +8) org.apache.spark.sql.hive.thriftserver.HiveThriftBinaryServerSuite.JDBC query execution org.apache.spark.sql.hive.thriftserver.HiveThriftBinaryServerSuite.SPARK-3004 regression: result set containing NULL org.apache.spark.sql.hive.thriftserver.HiveThriftBinaryServerSuite.SPARK-4292 regression: result set iterator issue org.apache.spark.sql.hive.thriftserver.HiveThriftBinaryServerSuite.SPARK-4309 regression: Date type support org.apache.spark.sql.hive.thriftserver.HiveThriftBinaryServerSuite.SPARK-4407 regression: Complex type support org.apache.spark.sql.hive.thriftserver.HiveThriftBinaryServerSuite.test multiple session org.apache.spark.sql.hive.thriftserver.HiveThriftHttpServerSuite.JDBC query execution org.apache.spark.sql.hive.thriftserver.UISeleniumSuite.thrift server ui test {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-7843) Several thrift server failures in Spark 1.4 sbt build with hadoop 1.0.4
[ https://issues.apache.org/jira/browse/SPARK-7843?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14557848#comment-14557848 ] Yin Huai commented on SPARK-7843: - yeah, I think we can resolve this one. After the investigation, I feel it will be really hard to make thrift server work with hadoop 1.0.4. [~chenghao] If you have any idea on workaround, feel free to comment at here. Several thrift server failures in Spark 1.4 sbt build with hadoop 1.0.4 --- Key: SPARK-7843 URL: https://issues.apache.org/jira/browse/SPARK-7843 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.4.0 Reporter: Yin Huai Attachments: 12a345adcbaee359199ddfed4f41bf0e19d66d48, HiveThriftBinaryServerSuite-spark-yhuai-org.apache.spark.sql.hive.thriftserver.HiveThriftServer2-1-Yins-MBP-2.out The following tests are failing all the time (starting from https://amplab.cs.berkeley.edu/jenkins/view/Spark/job/Spark-1.4-SBT/117/) {code} Test Result (8 failures / +8) org.apache.spark.sql.hive.thriftserver.HiveThriftBinaryServerSuite.JDBC query execution org.apache.spark.sql.hive.thriftserver.HiveThriftBinaryServerSuite.SPARK-3004 regression: result set containing NULL org.apache.spark.sql.hive.thriftserver.HiveThriftBinaryServerSuite.SPARK-4292 regression: result set iterator issue org.apache.spark.sql.hive.thriftserver.HiveThriftBinaryServerSuite.SPARK-4309 regression: Date type support org.apache.spark.sql.hive.thriftserver.HiveThriftBinaryServerSuite.SPARK-4407 regression: Complex type support org.apache.spark.sql.hive.thriftserver.HiveThriftBinaryServerSuite.test multiple session org.apache.spark.sql.hive.thriftserver.HiveThriftHttpServerSuite.JDBC query execution org.apache.spark.sql.hive.thriftserver.UISeleniumSuite.thrift server ui test {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-7843) Several thrift server failures in Spark 1.4 sbt build with hadoop 1.0.4
[ https://issues.apache.org/jira/browse/SPARK-7843?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yin Huai resolved SPARK-7843. - Resolution: Not A Problem Since we just bumped the hadoop version in hadoop 1 build and these tests work again, I am resolving this one as Not A Problem. Several thrift server failures in Spark 1.4 sbt build with hadoop 1.0.4 --- Key: SPARK-7843 URL: https://issues.apache.org/jira/browse/SPARK-7843 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.4.0 Reporter: Yin Huai Attachments: 12a345adcbaee359199ddfed4f41bf0e19d66d48, HiveThriftBinaryServerSuite-spark-yhuai-org.apache.spark.sql.hive.thriftserver.HiveThriftServer2-1-Yins-MBP-2.out The following tests are failing all the time (starting from https://amplab.cs.berkeley.edu/jenkins/view/Spark/job/Spark-1.4-SBT/117/) {code} Test Result (8 failures / +8) org.apache.spark.sql.hive.thriftserver.HiveThriftBinaryServerSuite.JDBC query execution org.apache.spark.sql.hive.thriftserver.HiveThriftBinaryServerSuite.SPARK-3004 regression: result set containing NULL org.apache.spark.sql.hive.thriftserver.HiveThriftBinaryServerSuite.SPARK-4292 regression: result set iterator issue org.apache.spark.sql.hive.thriftserver.HiveThriftBinaryServerSuite.SPARK-4309 regression: Date type support org.apache.spark.sql.hive.thriftserver.HiveThriftBinaryServerSuite.SPARK-4407 regression: Complex type support org.apache.spark.sql.hive.thriftserver.HiveThriftBinaryServerSuite.test multiple session org.apache.spark.sql.hive.thriftserver.HiveThriftHttpServerSuite.JDBC query execution org.apache.spark.sql.hive.thriftserver.UISeleniumSuite.thrift server ui test {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-7811) Fix typo on slf4j configuration on metrics.properties.template
[ https://issues.apache.org/jira/browse/SPARK-7811?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-7811. -- Resolution: Fixed Fix Version/s: 1.5.0 Issue resolved by pull request 6362 [https://github.com/apache/spark/pull/6362] Fix typo on slf4j configuration on metrics.properties.template -- Key: SPARK-7811 URL: https://issues.apache.org/jira/browse/SPARK-7811 Project: Spark Issue Type: Bug Components: Spark Core Reporter: Judy Nash Priority: Trivial Fix For: 1.5.0 There are a minor typo on slf4jsink configuration at metrics.properties.template. slf4j is mispelled as sl4j on 2 of the configuration. Correcting the typo so users' custom settings will be loaded correctly. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-7851) SparkSQL cli built against Hive 0.13 throws exception when using with Hive 0.12 HCat
Cheolsoo Park created SPARK-7851: Summary: SparkSQL cli built against Hive 0.13 throws exception when using with Hive 0.12 HCat Key: SPARK-7851 URL: https://issues.apache.org/jira/browse/SPARK-7851 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.4.0 Reporter: Cheolsoo Park Priority: Minor I built Spark with {{Hive 0.13}} and set the following properties- {code} spark.sql.hive.metastore.version=0.12.0 spark.sql.hive.metastore.jars=path_to_hive_0.12_jars {code} But when the SparkSQL CLI starts up, I get the following error- {code} 15/05/24 05:03:29 WARN RetryingMetaStoreClient: MetaStoreClient lost connection. Attempting to reconnect. org.apache.thrift.TApplicationException: Invalid method name: 'get_functions' at org.apache.thrift.TApplicationException.read(TApplicationException.java:108) at org.apache.thrift.TServiceClient.receiveBase(TServiceClient.java:71) at org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$Client.recv_get_functions(ThriftHiveMetastore.java:2886) at org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$Client.get_functions(ThriftHiveMetastore.java:2872) at org.apache.hadoop.hive.metastore.HiveMetaStoreClient.getFunctions(HiveMetaStoreClient.java:1727) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at org.apache.hadoop.hive.metastore.RetryingMetaStoreClient.invoke(RetryingMetaStoreClient.java:89) at com.sun.proxy.$Proxy12.getFunctions(Unknown Source) at org.apache.hadoop.hive.ql.metadata.Hive.getFunctions(Hive.java:2670) at org.apache.hadoop.hive.ql.exec.FunctionRegistry.getFunctionNames(FunctionRegistry.java:674) at org.apache.hadoop.hive.ql.exec.FunctionRegistry.getFunctionNames(FunctionRegistry.java:662) at org.apache.hadoop.hive.cli.CliDriver.getCommandCompletor(CliDriver.java:540) at org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver$.main(SparkSQLCLIDriver.scala:175) at org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.main(SparkSQLCLIDriver.scala) {code} What's happening is that when SparkSQL Cli starts up, it tries to fetch permanent udfs from Hive metastore (due to HIVE-6330, which was introduced in Hive 0.13). But then, it ends up invoking an incompatible thrift function that doesn't exist in Hive 0.12. To work around this error, I have to comment out the following line of code- https://goo.gl/wcfnH1 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-7699) Config spark.dynamicAllocation.initialExecutors has no effect
[ https://issues.apache.org/jira/browse/SPARK-7699?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14557911#comment-14557911 ] Sandy Ryza edited comment on SPARK-7699 at 5/25/15 1:26 AM: [~sowen] I think the possible flaw in your argument is that it relies on initial load being defined in some reasonable way. I.e. I think the worry is that the following can happen: * initial = 3 and min = 1 * cluster is large and uncontended * first line of user code is a job submission that can make use of at least 3 * because the executor allocation thread starts immediately, requested executors ramps down to 1 before the user code has a chance to submit the job Which is to say: what guarantees do we provide about initialExecutors other than that it's the number of executors requests we have before some opaque internal thing happens to adjust it down? One possible such guarantee we could provide is that we won't adjust down for some fixed number of seconds after the SparkContext starts. was (Author: sandyr): [~sowen] I think the possible flaw in your argument is that it relies on initial load being defined in some reasonable. I.e. I think the worry is that the following can happen: * initial = 3 and min = 1 * cluster is large and uncontended * first line of user code is a job submission that can make use of at least 3 * because the executor allocation thread starts immediately, requested executors ramps down to 1 before the user code has a chance to submit the job Which is to say: what guarantees do we provide about initialExecutors other than that it's the number of executors requests we have before some opaque internal thing happens to adjust it down? One possible such guarantee we could provide is that we won't adjust down for some fixed number of seconds after the SparkContext starts. Config spark.dynamicAllocation.initialExecutors has no effect Key: SPARK-7699 URL: https://issues.apache.org/jira/browse/SPARK-7699 Project: Spark Issue Type: Bug Components: Spark Core Reporter: meiyoula spark.dynamicAllocation.minExecutors 2 spark.dynamicAllocation.initialExecutors 3 spark.dynamicAllocation.maxExecutors 4 Just run the spark-shell with above configurations, the initial executor number is 2. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-7699) Config spark.dynamicAllocation.initialExecutors has no effect
[ https://issues.apache.org/jira/browse/SPARK-7699?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14557911#comment-14557911 ] Sandy Ryza commented on SPARK-7699: --- [~sowen] I think the possible flaw in your argument is that it relies on initial load being defined in some reasonable. I.e. I think the worry is that the following can happen: * initial = 3 and min = 1 * cluster is large and uncontended * first line of user code is a job submission that can make use of at least 3 * because the executor allocation thread starts immediately, requested executors ramps down to 1 before the user code has a chance to submit the job Which is to say: what guarantees do we provide about initialExecutors other than that it's the number of executors requests we have before some opaque internal thing happens to adjust it down? One possible such guarantee we could provide is that we won't adjust down for some fixed number of seconds after the SparkContext starts. Config spark.dynamicAllocation.initialExecutors has no effect Key: SPARK-7699 URL: https://issues.apache.org/jira/browse/SPARK-7699 Project: Spark Issue Type: Bug Components: Spark Core Reporter: meiyoula spark.dynamicAllocation.minExecutors 2 spark.dynamicAllocation.initialExecutors 3 spark.dynamicAllocation.maxExecutors 4 Just run the spark-shell with above configurations, the initial executor number is 2. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-7850) Hive 0.12.0 profile in POM should be removed
[ https://issues.apache.org/jira/browse/SPARK-7850?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14557950#comment-14557950 ] Apache Spark commented on SPARK-7850: - User 'piaozhexiu' has created a pull request for this issue: https://github.com/apache/spark/pull/6393 Hive 0.12.0 profile in POM should be removed Key: SPARK-7850 URL: https://issues.apache.org/jira/browse/SPARK-7850 Project: Spark Issue Type: Bug Components: Build, Documentation Affects Versions: 1.4.0 Reporter: Cheolsoo Park Priority: Minor Spark 1.4 supports the multiple metastore versions in a single build (hive-0.13.1) by introducing the IsolatedClientLoader, so {{-Phive-0.12.0}} is no longer needed. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-7850) Hive 0.12.0 profile in POM should be removed
[ https://issues.apache.org/jira/browse/SPARK-7850?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-7850: --- Assignee: (was: Apache Spark) Hive 0.12.0 profile in POM should be removed Key: SPARK-7850 URL: https://issues.apache.org/jira/browse/SPARK-7850 Project: Spark Issue Type: Bug Components: Build, Documentation Affects Versions: 1.4.0 Reporter: Cheolsoo Park Priority: Minor Spark 1.4 supports the multiple metastore versions in a single build (hive-0.13.1) by introducing the IsolatedClientLoader, so {{-Phive-0.12.0}} is no longer needed. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6784) Make sure values of partitioning columns are correctly converted based on their data types
[ https://issues.apache.org/jira/browse/SPARK-6784?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14557936#comment-14557936 ] Adrian Wang commented on SPARK-6784: oh, didn't notice this ticket has changed... I filed another jira at SPARK-7790 https://github.com/apache/spark/pull/6318 Make sure values of partitioning columns are correctly converted based on their data types -- Key: SPARK-6784 URL: https://issues.apache.org/jira/browse/SPARK-6784 Project: Spark Issue Type: Bug Components: SQL Reporter: Davies Liu Assignee: Adrian Wang Priority: Blocker We used to have the problems that values of partitioning columns are not correctly cast to the desired Spark SQL values based on their data types. Let's make sure we correctly do that for both Hive's partitions and HadoopFSRelation's partitions. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-7850) Hive 0.12.0 profile in POM should be removed
Cheolsoo Park created SPARK-7850: Summary: Hive 0.12.0 profile in POM should be removed Key: SPARK-7850 URL: https://issues.apache.org/jira/browse/SPARK-7850 Project: Spark Issue Type: Bug Components: Build, Documentation Affects Versions: 1.4.0 Reporter: Cheolsoo Park Priority: Minor Spark 1.4 supports the multiple metastore versions in a single build (hive-0.13.1) by introducing the IsolatedClientLoader, so {{-Phive-0.12.0}} is no longer needed. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-7850) Hive 0.12.0 profile in POM should be removed
[ https://issues.apache.org/jira/browse/SPARK-7850?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-7850: --- Assignee: Apache Spark Hive 0.12.0 profile in POM should be removed Key: SPARK-7850 URL: https://issues.apache.org/jira/browse/SPARK-7850 Project: Spark Issue Type: Bug Components: Build, Documentation Affects Versions: 1.4.0 Reporter: Cheolsoo Park Assignee: Apache Spark Priority: Minor Spark 1.4 supports the multiple metastore versions in a single build (hive-0.13.1) by introducing the IsolatedClientLoader, so {{-Phive-0.12.0}} is no longer needed. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-7844) Broken tests in KernelDensity
Manoj Kumar created SPARK-7844: -- Summary: Broken tests in KernelDensity Key: SPARK-7844 URL: https://issues.apache.org/jira/browse/SPARK-7844 Project: Spark Issue Type: Bug Components: MLlib Reporter: Manoj Kumar The densities in KernelDensity are scaled down by (number of parallel processes X number of points). This results in broken tests in KernelDensitySuite which haven't been tested properly. I think it should just be scaled down by (number of samples, i.e number of gaussian distributions) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-7844) Broken tests in KernelDensity
[ https://issues.apache.org/jira/browse/SPARK-7844?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-7844: --- Assignee: (was: Apache Spark) Broken tests in KernelDensity - Key: SPARK-7844 URL: https://issues.apache.org/jira/browse/SPARK-7844 Project: Spark Issue Type: Bug Components: MLlib Reporter: Manoj Kumar The densities in KernelDensity are scaled down by (number of parallel processes X number of points). This results in broken tests in KernelDensitySuite which haven't been tested properly. I think it should just be scaled down by (number of samples, i.e number of gaussian distributions) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-7844) Broken tests in KernelDensity
[ https://issues.apache.org/jira/browse/SPARK-7844?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14557650#comment-14557650 ] Manoj Kumar commented on SPARK-7844: ping [~josephkb] Broken tests in KernelDensity - Key: SPARK-7844 URL: https://issues.apache.org/jira/browse/SPARK-7844 Project: Spark Issue Type: Bug Components: MLlib Reporter: Manoj Kumar The densities in KernelDensity are scaled down by (number of parallel processes X number of points). This results in broken tests in KernelDensitySuite which haven't been tested properly. I think it should just be scaled down by (number of samples, i.e number of gaussian distributions) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-7844) Broken tests in KernelDensity
[ https://issues.apache.org/jira/browse/SPARK-7844?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-7844: --- Assignee: Apache Spark Broken tests in KernelDensity - Key: SPARK-7844 URL: https://issues.apache.org/jira/browse/SPARK-7844 Project: Spark Issue Type: Bug Components: MLlib Reporter: Manoj Kumar Assignee: Apache Spark The densities in KernelDensity are scaled down by (number of parallel processes X number of points). This results in broken tests in KernelDensitySuite which haven't been tested properly. I think it should just be scaled down by (number of samples, i.e number of gaussian distributions) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-7844) Broken tests in KernelDensity
[ https://issues.apache.org/jira/browse/SPARK-7844?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14557651#comment-14557651 ] Apache Spark commented on SPARK-7844: - User 'MechCoder' has created a pull request for this issue: https://github.com/apache/spark/pull/6383 Broken tests in KernelDensity - Key: SPARK-7844 URL: https://issues.apache.org/jira/browse/SPARK-7844 Project: Spark Issue Type: Bug Components: MLlib Reporter: Manoj Kumar The densities in KernelDensity are scaled down by (number of parallel processes X number of points). This results in broken tests in KernelDensitySuite which haven't been tested properly. I think it should just be scaled down by (number of samples, i.e number of gaussian distributions) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-7845) Bump Hadoop 1 tests to version 1.2.0
[ https://issues.apache.org/jira/browse/SPARK-7845?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell updated SPARK-7845: --- Description: A small number of API's in Hadoop were added between 1.0.4 and 1.2.0. It appears this is one cause of SPARK-7843 since some Hive code relies on newer Hadoop API's. My feeling is we should just bump our tested version up to 1.2.0 (both versions are extremely old). If users are still on 1.2.0 and run into some of these corner cases, we can consider doing some engineering and supporting the older versions. I'd like to bump our test version though and let this be driven by users, if they exist. https://github.com/apache/spark/blob/master/dev/run-tests#L43 was:A small number of API's in Hadoop were added between 1.0.4 and 1.2.0. It appears this is one cause of SPARK-7843 since some Hive code relies on newer Hadoop API's. My feeling is we should just bump our tested version up to 1.2.0 (both versions are extremely old). If users are still on 1.2.0 and run into some of these corner cases, we can consider doing some engineering and supporting the older versions. I'd like to bump our test version though and let this be driven by users, if they exist. Bump Hadoop 1 tests to version 1.2.0 -- Key: SPARK-7845 URL: https://issues.apache.org/jira/browse/SPARK-7845 Project: Spark Issue Type: Improvement Components: Tests Reporter: Patrick Wendell Priority: Critical A small number of API's in Hadoop were added between 1.0.4 and 1.2.0. It appears this is one cause of SPARK-7843 since some Hive code relies on newer Hadoop API's. My feeling is we should just bump our tested version up to 1.2.0 (both versions are extremely old). If users are still on 1.2.0 and run into some of these corner cases, we can consider doing some engineering and supporting the older versions. I'd like to bump our test version though and let this be driven by users, if they exist. https://github.com/apache/spark/blob/master/dev/run-tests#L43 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-7845) Bump Hadoop 1 tests to version 1.2.0
[ https://issues.apache.org/jira/browse/SPARK-7845?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell updated SPARK-7845: --- Assignee: Yin Huai Bump Hadoop 1 tests to version 1.2.0 -- Key: SPARK-7845 URL: https://issues.apache.org/jira/browse/SPARK-7845 Project: Spark Issue Type: Improvement Components: Tests Reporter: Patrick Wendell Assignee: Yin Huai Priority: Critical A small number of API's in Hadoop were added between 1.0.4 and 1.2.0. It appears this is one cause of SPARK-7843 since some Hive code relies on newer Hadoop API's. My feeling is we should just bump our tested version up to 1.2.0 (both versions are extremely old). If users are still on 1.2.0 and run into some of these corner cases, we can consider doing some engineering and supporting the older versions. I'd like to bump our test version though and let this be driven by users, if they exist. https://github.com/apache/spark/blob/master/dev/run-tests#L43 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-7845) Bump Hadoop 1 tests to version 1.2.1
[ https://issues.apache.org/jira/browse/SPARK-7845?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yin Huai updated SPARK-7845: Summary: Bump Hadoop 1 tests to version 1.2.1 (was: Bump Hadoop 1 tests to version 1.2.0) Bump Hadoop 1 tests to version 1.2.1 -- Key: SPARK-7845 URL: https://issues.apache.org/jira/browse/SPARK-7845 Project: Spark Issue Type: Improvement Components: Tests Reporter: Patrick Wendell Assignee: Yin Huai Priority: Critical A small number of API's in Hadoop were added between 1.0.4 and 1.2.0. It appears this is one cause of SPARK-7843 since some Hive code relies on newer Hadoop API's. My feeling is we should just bump our tested version up to 1.2.0 (both versions are extremely old). If users are still on 1.2.0 and run into some of these corner cases, we can consider doing some engineering and supporting the older versions. I'd like to bump our test version though and let this be driven by users, if they exist. https://github.com/apache/spark/blob/master/dev/run-tests#L43 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-7845) Bump Hadoop 1 tests to version 1.2.1
[ https://issues.apache.org/jira/browse/SPARK-7845?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-7845: --- Assignee: Yin Huai (was: Apache Spark) Bump Hadoop 1 tests to version 1.2.1 -- Key: SPARK-7845 URL: https://issues.apache.org/jira/browse/SPARK-7845 Project: Spark Issue Type: Improvement Components: Tests Reporter: Patrick Wendell Assignee: Yin Huai Priority: Critical A small number of API's in Hadoop were added between 1.0.4 and 1.2.1. It appears this is one cause of SPARK-7843 since some Hive code relies on newer Hadoop API's. My feeling is we should just bump our tested version up to 1.2.1 (both versions are extremely old). If users are still on 1.2.1 and run into some of these corner cases, we can consider doing some engineering and supporting the older versions. I'd like to bump our test version though and let this be driven by users, if they exist. https://github.com/apache/spark/blob/master/dev/run-tests#L43 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-7845) Bump Hadoop 1 tests to version 1.2.1
[ https://issues.apache.org/jira/browse/SPARK-7845?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14557660#comment-14557660 ] Apache Spark commented on SPARK-7845: - User 'yhuai' has created a pull request for this issue: https://github.com/apache/spark/pull/6384 Bump Hadoop 1 tests to version 1.2.1 -- Key: SPARK-7845 URL: https://issues.apache.org/jira/browse/SPARK-7845 Project: Spark Issue Type: Improvement Components: Tests Reporter: Patrick Wendell Assignee: Yin Huai Priority: Critical A small number of API's in Hadoop were added between 1.0.4 and 1.2.1. It appears this is one cause of SPARK-7843 since some Hive code relies on newer Hadoop API's. My feeling is we should just bump our tested version up to 1.2.1 (both versions are extremely old). If users are still on 1.2.1 and run into some of these corner cases, we can consider doing some engineering and supporting the older versions. I'd like to bump our test version though and let this be driven by users, if they exist. https://github.com/apache/spark/blob/master/dev/run-tests#L43 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-7845) Bump Hadoop 1 tests to version 1.2.1
[ https://issues.apache.org/jira/browse/SPARK-7845?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-7845: --- Assignee: Apache Spark (was: Yin Huai) Bump Hadoop 1 tests to version 1.2.1 -- Key: SPARK-7845 URL: https://issues.apache.org/jira/browse/SPARK-7845 Project: Spark Issue Type: Improvement Components: Tests Reporter: Patrick Wendell Assignee: Apache Spark Priority: Critical A small number of API's in Hadoop were added between 1.0.4 and 1.2.1. It appears this is one cause of SPARK-7843 since some Hive code relies on newer Hadoop API's. My feeling is we should just bump our tested version up to 1.2.1 (both versions are extremely old). If users are still on 1.2.1 and run into some of these corner cases, we can consider doing some engineering and supporting the older versions. I'd like to bump our test version though and let this be driven by users, if they exist. https://github.com/apache/spark/blob/master/dev/run-tests#L43 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-7832) Always run SQL tests in master build.
[ https://issues.apache.org/jira/browse/SPARK-7832?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-7832: --- Assignee: Yin Huai (was: Apache Spark) Always run SQL tests in master build. - Key: SPARK-7832 URL: https://issues.apache.org/jira/browse/SPARK-7832 Project: Spark Issue Type: Task Components: Build, SQL Affects Versions: 1.3.0, 1.4.0 Reporter: Yin Huai Assignee: Yin Huai Priority: Critical Our master build does not run Hive compatibility tests. We need to enable them. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-7832) Always run SQL tests in master build.
[ https://issues.apache.org/jira/browse/SPARK-7832?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-7832: --- Assignee: Apache Spark (was: Yin Huai) Always run SQL tests in master build. - Key: SPARK-7832 URL: https://issues.apache.org/jira/browse/SPARK-7832 Project: Spark Issue Type: Task Components: Build, SQL Affects Versions: 1.3.0, 1.4.0 Reporter: Yin Huai Assignee: Apache Spark Priority: Critical Our master build does not run Hive compatibility tests. We need to enable them. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-7832) Always run SQL tests in master build.
[ https://issues.apache.org/jira/browse/SPARK-7832?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14557661#comment-14557661 ] Apache Spark commented on SPARK-7832: - User 'yhuai' has created a pull request for this issue: https://github.com/apache/spark/pull/6385 Always run SQL tests in master build. - Key: SPARK-7832 URL: https://issues.apache.org/jira/browse/SPARK-7832 Project: Spark Issue Type: Task Components: Build, SQL Affects Versions: 1.3.0, 1.4.0 Reporter: Yin Huai Assignee: Yin Huai Priority: Critical Our master build does not run Hive compatibility tests. We need to enable them. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-7780) The intercept in LogisticRegressionWithLBFGS should not be regularized
[ https://issues.apache.org/jira/browse/SPARK-7780?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-7780: --- Assignee: (was: Apache Spark) The intercept in LogisticRegressionWithLBFGS should not be regularized -- Key: SPARK-7780 URL: https://issues.apache.org/jira/browse/SPARK-7780 Project: Spark Issue Type: Bug Components: MLlib Reporter: DB Tsai The intercept in Logistic Regression represents a prior on categories which should not be regularized. In MLlib, the regularization is handled through `Updater`, and the `Updater` penalizes all the components without excluding the intercept which resulting poor training accuracy with regularization. The new implementation in ML framework handles this properly, and we should call the implementation in ML from MLlib since majority of users are still using MLlib api. Note that both of them are doing feature scalings to improve the convergence, and the only difference is ML version doesn't regularize the intercept. As a result, when lambda is zero, they will converge to the same solution. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-7780) The intercept in LogisticRegressionWithLBFGS should not be regularized
[ https://issues.apache.org/jira/browse/SPARK-7780?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-7780: --- Assignee: Apache Spark The intercept in LogisticRegressionWithLBFGS should not be regularized -- Key: SPARK-7780 URL: https://issues.apache.org/jira/browse/SPARK-7780 Project: Spark Issue Type: Bug Components: MLlib Reporter: DB Tsai Assignee: Apache Spark The intercept in Logistic Regression represents a prior on categories which should not be regularized. In MLlib, the regularization is handled through `Updater`, and the `Updater` penalizes all the components without excluding the intercept which resulting poor training accuracy with regularization. The new implementation in ML framework handles this properly, and we should call the implementation in ML from MLlib since majority of users are still using MLlib api. Note that both of them are doing feature scalings to improve the convergence, and the only difference is ML version doesn't regularize the intercept. As a result, when lambda is zero, they will converge to the same solution. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-7780) The intercept in LogisticRegressionWithLBFGS should not be regularized
[ https://issues.apache.org/jira/browse/SPARK-7780?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14557673#comment-14557673 ] Apache Spark commented on SPARK-7780: - User 'holdenk' has created a pull request for this issue: https://github.com/apache/spark/pull/6386 The intercept in LogisticRegressionWithLBFGS should not be regularized -- Key: SPARK-7780 URL: https://issues.apache.org/jira/browse/SPARK-7780 Project: Spark Issue Type: Bug Components: MLlib Reporter: DB Tsai The intercept in Logistic Regression represents a prior on categories which should not be regularized. In MLlib, the regularization is handled through `Updater`, and the `Updater` penalizes all the components without excluding the intercept which resulting poor training accuracy with regularization. The new implementation in ML framework handles this properly, and we should call the implementation in ML from MLlib since majority of users are still using MLlib api. Note that both of them are doing feature scalings to improve the convergence, and the only difference is ML version doesn't regularize the intercept. As a result, when lambda is zero, they will converge to the same solution. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6907) Create an isolated classloader for the Hive Client.
[ https://issues.apache.org/jira/browse/SPARK-6907?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14557675#comment-14557675 ] Patrick Wendell commented on SPARK-6907: Hey [~ste...@apache.org] - my guess is that the way most packagers will use this is that they will just point to their existing Hive jars that are present using the relevant configs. As Michael said, the ivy downloading is convenient, but it's not the only mechanism. Create an isolated classloader for the Hive Client. --- Key: SPARK-6907 URL: https://issues.apache.org/jira/browse/SPARK-6907 Project: Spark Issue Type: Sub-task Components: SQL Reporter: Michael Armbrust Assignee: Michael Armbrust Fix For: 1.4.0 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org