[jira] [Commented] (SPARK-9776) Another instance of Derby may have already booted the database
[ https://issues.apache.org/jira/browse/SPARK-9776?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14700284#comment-14700284 ] Sudhakar Thota commented on SPARK-9776: --- Michael, I am confused. These are the steps I am doing, please correct me if I am wrong in the statements to create a HiveContext. I dont have another SparkContext running. 1. bin/spark-shell 2. val hiveContext = new org.apache.spark.sql.hive.HiveContext(sc) I appreciate your help. Thanks Sudhakar Thota Another instance of Derby may have already booted the database --- Key: SPARK-9776 URL: https://issues.apache.org/jira/browse/SPARK-9776 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.5.0 Environment: Mac Yosemite, spark-1.5.0 Reporter: Sudhakar Thota Attachments: SPARK-9776-FL1.rtf val hiveContext = new org.apache.spark.sql.hive.HiveContext(sc) results in error. Though the same works for spark-1.4.1. Caused by: ERROR XSDB6: Another instance of Derby may have already booted the database -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9776) Another instance of Derby may have already booted the database
[ https://issues.apache.org/jira/browse/SPARK-9776?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14700285#comment-14700285 ] Michael Armbrust commented on SPARK-9776: - Do not run #2. sqlContext is created automatically for you. Another instance of Derby may have already booted the database --- Key: SPARK-9776 URL: https://issues.apache.org/jira/browse/SPARK-9776 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.5.0 Environment: Mac Yosemite, spark-1.5.0 Reporter: Sudhakar Thota Attachments: SPARK-9776-FL1.rtf val hiveContext = new org.apache.spark.sql.hive.HiveContext(sc) results in error. Though the same works for spark-1.4.1. Caused by: ERROR XSDB6: Another instance of Derby may have already booted the database -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9776) Another instance of Derby may have already booted the database
[ https://issues.apache.org/jira/browse/SPARK-9776?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14700287#comment-14700287 ] Eugene Zhulenev commented on SPARK-9776: automatically created sqlContext is available as 'sqlContext' Another instance of Derby may have already booted the database --- Key: SPARK-9776 URL: https://issues.apache.org/jira/browse/SPARK-9776 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.5.0 Environment: Mac Yosemite, spark-1.5.0 Reporter: Sudhakar Thota Attachments: SPARK-9776-FL1.rtf val hiveContext = new org.apache.spark.sql.hive.HiveContext(sc) results in error. Though the same works for spark-1.4.1. Caused by: ERROR XSDB6: Another instance of Derby may have already booted the database -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10066) Can't create HiveContext with spark-shell or spark-sql on snapshot
[ https://issues.apache.org/jira/browse/SPARK-10066?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14700293#comment-14700293 ] Robert Beauchemin commented on SPARK-10066: --- Yes. I've even changed it (as a test) so both /tmp and /tmp/hive are world rwx -able. Here's listing from HDFS: drwxrwxrwx - hdfs hdfs0 2015-06-18 00:24 /tmp drwxrwxrwx - ambari-qa hdfs0 2015-08-16 21:38 /tmp/hive Can't create HiveContext with spark-shell or spark-sql on snapshot -- Key: SPARK-10066 URL: https://issues.apache.org/jira/browse/SPARK-10066 Project: Spark Issue Type: Bug Components: Spark Shell, SQL Affects Versions: 1.5.0 Environment: Centos 6.6 Reporter: Robert Beauchemin Priority: Minor Built the 1.5.0-preview-20150812 with the following: ./make-distribution.sh -Pyarn -Phadoop-2.6 -Dhadoop.version=2.6.0 -Phive -Phive-thriftserver -Psparkr -DskipTests Starting spark-shell or spark-sql returns the following error: java.lang.RuntimeException: java.lang.RuntimeException: The root scratch dir: /tmp/hive on HDFS should be writable. Current permissions are: rwx-- at org.apache.hadoop.hive.ql.session.SessionState.createRootHDFSDir(SessionState.java:612) [elided] at org.apache.hadoop.hive.ql.session.SessionState.start(SessionState.java:508) It's trying to create a new HiveContext. Running pySpark or sparkR works and creates a HiveContext successfully. SqlContext can be created successfully with any shell. I've tried changing permissions on that HDFS directory (even as far as making it world-writable) without success. Tried changing SPARK_USER and also running spark-shell as different users without success. This works on same machine on 1.4.1 and on earlier pre-release versions of Spark 1.5.0 (same make-distribution parms) sucessfully. Just trying the snapshot... -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-9866) VersionsSuite is unnecessarily slow in Jenkins
[ https://issues.apache.org/jira/browse/SPARK-9866?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Josh Rosen updated SPARK-9866: -- Issue Type: Sub-task (was: Bug) Parent: SPARK-9288 VersionsSuite is unnecessarily slow in Jenkins -- Key: SPARK-9866 URL: https://issues.apache.org/jira/browse/SPARK-9866 Project: Spark Issue Type: Sub-task Components: SQL, Tests Reporter: Josh Rosen The VersionsSuite Hive test is unreasonably slow in Jenkins; downloading the Hive JARs and their transitive dependencies from Maven adds at least 8 minutes to the total build time. In order to cut down on build time, I think that we should make the cache directory configurable via an environment variable and should configure the Jenkins scripts to set this variable to point to a location outside of the Jenkins workspace which is re-used across builds. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8518) Log-linear models for survival analysis
[ https://issues.apache.org/jira/browse/SPARK-8518?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14700302#comment-14700302 ] Meihua Wu commented on SPARK-8518: -- [~yanbo] Thank you very much for the update! The loss function and gradient are different for events and censor. So we will need to have a column in the data frame to indicate whether an individual record is an event or censored. I suppose we will need to define a Param for eventCol using code gen and mix it into the AFTRegressionParams. cc [~mengxr] Log-linear models for survival analysis --- Key: SPARK-8518 URL: https://issues.apache.org/jira/browse/SPARK-8518 Project: Spark Issue Type: New Feature Components: ML Reporter: Xiangrui Meng Assignee: Yanbo Liang Original Estimate: 168h Remaining Estimate: 168h We want to add basic log-linear models for survival analysis. The implementation should match the result from R's survival package (http://cran.r-project.org/web/packages/survival/index.html). Design doc from [~yanboliang]: https://docs.google.com/document/d/1fLtB0sqg2HlfqdrJlNHPhpfXO0Zb2_avZrxiVoPEs0E/pub -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-10079) Make `column` and `col` functions be S4 functions
Yu Ishikawa created SPARK-10079: --- Summary: Make `column` and `col` functions be S4 functions Key: SPARK-10079 URL: https://issues.apache.org/jira/browse/SPARK-10079 Project: Spark Issue Type: Sub-task Components: SparkR Reporter: Yu Ishikawa {{column}} and {{col}} function at {{R/pkg/R/Column.R}} are currently defined as S3 functions. I think it would be better to define them as S4 functions. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9972) Add `struct`, `encode` and `decode` function in SparkR
[ https://issues.apache.org/jira/browse/SPARK-9972?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14699237#comment-14699237 ] Yu Ishikawa commented on SPARK-9972: This is a quick note to tell the reason. When I tried to implement {{sort_array}}, I got the error as follows. I haven't inspected it, but the cause seems to be at {{collect}}. I'll comment about that in detail later. {noformat} 1. Error: sort_array on a DataFrame cannot coerce class jobj to a data.frame 1: withCallingHandlers(eval(code, new_test_environment), error = capture_calls, message = function(c) invokeRestart(muffleMessage), warning = function(c) invokeRestart(muffleWarning)) 2: eval(code, new_test_environment) 3: eval(expr, envir, enclos) 4: expect_equal(collect(select(df, sort_array(df$a)))[1, 1], c(1, 2, 3)) at test_sparkSQL.R:787 5: expect_that(object, equals(expected, label = expected.label, ...), info = info, label = label) 6: condition(object) 7: compare(expected, actual, ...) 8: compare.numeric(expected, actual, ...) 9: all.equal(x, y, ...) 10: all.equal.numeric(x, y, ...) 11: attr.all.equal(target, current, tolerance = tolerance, scale = scale, ...) 12: mode(current) 13: collect(select(df, sort_array(df$a))) 14: collect(select(df, sort_array(df$a))) 15: .local(x, ...) 16: do.call(cbind.data.frame, list(cols, stringsAsFactors = stringsAsFactors)) 17: (function (..., deparse.level = 1) data.frame(..., check.names = FALSE))(structure(list(`sort_array(a,true)` = list( environment, NA, NA)), .Names = sort_array(a,true)), stringsAsFactors = FALSE) 18: data.frame(..., check.names = FALSE) 19: as.data.frame(x[[i]], optional = TRUE, stringsAsFactors = stringsAsFactors) 20: as.data.frame.list(x[[i]], optional = TRUE, stringsAsFactors = stringsAsFactors) 21: eval(as.call(c(expression(data.frame), x, check.names = !optional, stringsAsFactors = stringsAsFactors))) 22: eval(expr, envir, enclos) 23: data.frame(`sort_array(a,true)` = list(environment, NA, NA), check.names = FALSE, stringsAsFactors = FALSE) 24: as.data.frame(x[[i]], optional = TRUE, stringsAsFactors = stringsAsFactors) 25: as.data.frame.list(x[[i]], optional = TRUE, stringsAsFactors = stringsAsFactors) 26: eval(as.call(c(expression(data.frame), x, check.names = !optional, stringsAsFactors = stringsAsFactors))) 27: eval(expr, envir, enclos) 28: data.frame(environment, NA, NA, check.names = FALSE, stringsAsFactors = FALSE) 29: as.data.frame(x[[i]], optional = TRUE) 30: as.data.frame.default(x[[i]], optional = TRUE) 31: stop(gettextf(cannot coerce class \%s\ to a data.frame, deparse(class(x))), domain = NA) 32: .handleSimpleError(function (e) { e$calls - head(sys.calls()[-seq_len(frame + 7)], -2) signalCondition(e) }, cannot coerce class \\jobj\\ to a data.frame, quote(as.data.frame.default(x[[i]], optional = TRUE))) {noformat} Add `struct`, `encode` and `decode` function in SparkR -- Key: SPARK-9972 URL: https://issues.apache.org/jira/browse/SPARK-9972 Project: Spark Issue Type: Sub-task Components: SparkR Reporter: Yu Ishikawa Support {{struct}} function on a DataFrame in SparkR. However, I think we need to improve {{collect}} function in SparkR in order to implement {{struct}} function. - struct - encode - decode - array_contains - sort_array -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10026) Implement some common Params for regression in PySpark
[ https://issues.apache.org/jira/browse/SPARK-10026?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14699240#comment-14699240 ] Yanbo Liang commented on SPARK-10026: - I'm working on it. Implement some common Params for regression in PySpark -- Key: SPARK-10026 URL: https://issues.apache.org/jira/browse/SPARK-10026 Project: Spark Issue Type: Sub-task Components: ML, PySpark Reporter: Yanbo Liang Currently some Params are not common classes in Python API which lead we need to write them for each class. The LinearRegression and LogisticRegression related Params are list here: * HasElasticNetParam * HasFitIntercept * HasStandardization -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-7837) NPE when save as parquet in speculative tasks
[ https://issues.apache.org/jira/browse/SPARK-7837?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14699280#comment-14699280 ] Apache Spark commented on SPARK-7837: - User 'liancheng' has created a pull request for this issue: https://github.com/apache/spark/pull/8236 NPE when save as parquet in speculative tasks - Key: SPARK-7837 URL: https://issues.apache.org/jira/browse/SPARK-7837 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.4.0 Reporter: Yin Huai Assignee: Cheng Lian Priority: Critical The query is like {{df.orderBy(...).saveAsTable(...)}}. When there is no partitioning columns and there is a skewed key, I found the following exception in speculative tasks. After these failures, seems we could not call {{SparkHadoopMapRedUtil.commitTask}} correctly. {code} java.lang.NullPointerException at parquet.hadoop.InternalParquetRecordWriter.flushRowGroupToStore(InternalParquetRecordWriter.java:146) at parquet.hadoop.InternalParquetRecordWriter.close(InternalParquetRecordWriter.java:112) at parquet.hadoop.ParquetRecordWriter.close(ParquetRecordWriter.java:73) at org.apache.spark.sql.parquet.ParquetOutputWriter.close(newParquet.scala:115) at org.apache.spark.sql.sources.DefaultWriterContainer.abortTask(commands.scala:385) at org.apache.spark.sql.sources.InsertIntoHadoopFsRelation.org$apache$spark$sql$sources$InsertIntoHadoopFsRelation$$writeRows$1(commands.scala:150) at org.apache.spark.sql.sources.InsertIntoHadoopFsRelation$$anonfun$insert$1.apply(commands.scala:122) at org.apache.spark.sql.sources.InsertIntoHadoopFsRelation$$anonfun$insert$1.apply(commands.scala:122) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:63) at org.apache.spark.scheduler.Task.run(Task.scala:70) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-7837) NPE when save as parquet in speculative tasks
[ https://issues.apache.org/jira/browse/SPARK-7837?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-7837: --- Assignee: Cheng Lian (was: Apache Spark) NPE when save as parquet in speculative tasks - Key: SPARK-7837 URL: https://issues.apache.org/jira/browse/SPARK-7837 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.4.0 Reporter: Yin Huai Assignee: Cheng Lian Priority: Critical The query is like {{df.orderBy(...).saveAsTable(...)}}. When there is no partitioning columns and there is a skewed key, I found the following exception in speculative tasks. After these failures, seems we could not call {{SparkHadoopMapRedUtil.commitTask}} correctly. {code} java.lang.NullPointerException at parquet.hadoop.InternalParquetRecordWriter.flushRowGroupToStore(InternalParquetRecordWriter.java:146) at parquet.hadoop.InternalParquetRecordWriter.close(InternalParquetRecordWriter.java:112) at parquet.hadoop.ParquetRecordWriter.close(ParquetRecordWriter.java:73) at org.apache.spark.sql.parquet.ParquetOutputWriter.close(newParquet.scala:115) at org.apache.spark.sql.sources.DefaultWriterContainer.abortTask(commands.scala:385) at org.apache.spark.sql.sources.InsertIntoHadoopFsRelation.org$apache$spark$sql$sources$InsertIntoHadoopFsRelation$$writeRows$1(commands.scala:150) at org.apache.spark.sql.sources.InsertIntoHadoopFsRelation$$anonfun$insert$1.apply(commands.scala:122) at org.apache.spark.sql.sources.InsertIntoHadoopFsRelation$$anonfun$insert$1.apply(commands.scala:122) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:63) at org.apache.spark.scheduler.Task.run(Task.scala:70) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-10035) Parquet filters does not process EqualNullSafe filter.
[ https://issues.apache.org/jira/browse/SPARK-10035?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cheng Lian updated SPARK-10035: --- Assignee: Hyukjin Kwon Parquet filters does not process EqualNullSafe filter. -- Key: SPARK-10035 URL: https://issues.apache.org/jira/browse/SPARK-10035 Project: Spark Issue Type: Improvement Components: SQL Reporter: Hyukjin Kwon Assignee: Hyukjin Kwon Priority: Minor it is an issue followed by SPARK-9814. Datasources (after {{selectFilters()}} in {{org.apache.spark.sql.execution.datasources.DataSourceStrategy}}) pass {{EqualNotNull}} to {{ParquetRelation}} but {{ParquetFilters}} for {{ParquetRelation}} does not take and process this. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8847) String concatination with column in SparkR
[ https://issues.apache.org/jira/browse/SPARK-8847?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14699219#comment-14699219 ] Sun Rui commented on SPARK-8847: The concat() expression is addressing this issue. String concatination with column in SparkR -- Key: SPARK-8847 URL: https://issues.apache.org/jira/browse/SPARK-8847 Project: Spark Issue Type: New Feature Components: R Reporter: Amar Gondaliya 1. String concatination with the values of the column. i.e. df$newcol -paste(a,df$column) type functionality. 2. String concatination between columns i.e. df$newcol - paste(df$col1,-,df$col2) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-7837) NPE when save as parquet in speculative tasks
[ https://issues.apache.org/jira/browse/SPARK-7837?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-7837: --- Assignee: Apache Spark (was: Cheng Lian) NPE when save as parquet in speculative tasks - Key: SPARK-7837 URL: https://issues.apache.org/jira/browse/SPARK-7837 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.4.0 Reporter: Yin Huai Assignee: Apache Spark Priority: Critical The query is like {{df.orderBy(...).saveAsTable(...)}}. When there is no partitioning columns and there is a skewed key, I found the following exception in speculative tasks. After these failures, seems we could not call {{SparkHadoopMapRedUtil.commitTask}} correctly. {code} java.lang.NullPointerException at parquet.hadoop.InternalParquetRecordWriter.flushRowGroupToStore(InternalParquetRecordWriter.java:146) at parquet.hadoop.InternalParquetRecordWriter.close(InternalParquetRecordWriter.java:112) at parquet.hadoop.ParquetRecordWriter.close(ParquetRecordWriter.java:73) at org.apache.spark.sql.parquet.ParquetOutputWriter.close(newParquet.scala:115) at org.apache.spark.sql.sources.DefaultWriterContainer.abortTask(commands.scala:385) at org.apache.spark.sql.sources.InsertIntoHadoopFsRelation.org$apache$spark$sql$sources$InsertIntoHadoopFsRelation$$writeRows$1(commands.scala:150) at org.apache.spark.sql.sources.InsertIntoHadoopFsRelation$$anonfun$insert$1.apply(commands.scala:122) at org.apache.spark.sql.sources.InsertIntoHadoopFsRelation$$anonfun$insert$1.apply(commands.scala:122) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:63) at org.apache.spark.scheduler.Task.run(Task.scala:70) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10035) Parquet filters does not process EqualNullSafe filter.
[ https://issues.apache.org/jira/browse/SPARK-10035?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14699316#comment-14699316 ] Cheng Lian commented on SPARK-10035: Done, thanks for working on this! Parquet filters does not process EqualNullSafe filter. -- Key: SPARK-10035 URL: https://issues.apache.org/jira/browse/SPARK-10035 Project: Spark Issue Type: Improvement Components: SQL Reporter: Hyukjin Kwon Assignee: Hyukjin Kwon Priority: Minor it is an issue followed by SPARK-9814. Datasources (after {{selectFilters()}} in {{org.apache.spark.sql.execution.datasources.DataSourceStrategy}}) pass {{EqualNotNull}} to {{ParquetRelation}} but {{ParquetFilters}} for {{ParquetRelation}} does not take and process this. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-10048) Support arbitrary nested Java array in serde
Sun Rui created SPARK-10048: --- Summary: Support arbitrary nested Java array in serde Key: SPARK-10048 URL: https://issues.apache.org/jira/browse/SPARK-10048 Project: Spark Issue Type: Sub-task Components: SparkR Reporter: Sun Rui -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-10050) Support collecting data of MapType in DataFrame
Sun Rui created SPARK-10050: --- Summary: Support collecting data of MapType in DataFrame Key: SPARK-10050 URL: https://issues.apache.org/jira/browse/SPARK-10050 Project: Spark Issue Type: Sub-task Components: SparkR Reporter: Sun Rui -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-10049) Support collecting data of ArraryType in DataFrame
Sun Rui created SPARK-10049: --- Summary: Support collecting data of ArraryType in DataFrame Key: SPARK-10049 URL: https://issues.apache.org/jira/browse/SPARK-10049 Project: Spark Issue Type: Sub-task Components: SparkR Reporter: Sun Rui -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-10051) Support collecting data of StructType in DataFrame
Sun Rui created SPARK-10051: --- Summary: Support collecting data of StructType in DataFrame Key: SPARK-10051 URL: https://issues.apache.org/jira/browse/SPARK-10051 Project: Spark Issue Type: Sub-task Components: SparkR Reporter: Sun Rui -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10030) Managed memory leak detected when cache table
[ https://issues.apache.org/jira/browse/SPARK-10030?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14699372#comment-14699372 ] Cheng Lian commented on SPARK-10030: [~joshrosen] Seems to be related to Tungsten? Managed memory leak detected when cache table - Key: SPARK-10030 URL: https://issues.apache.org/jira/browse/SPARK-10030 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.5.0 Reporter: wangwei Priority: Blocker I test the lastest spark-1.5.0 in local, standalone, yarn mode and follow the steps bellow, then errors occured. 1. create table cache_test(id int, name string) stored as textfile ; 2. load data local inpath 'SparkSource/sql/hive/src/test/resources/data/files/kv1.txt' into table cache_test; 3. cache table test as select * from cache_test distribute by id; configuration: spark.driver.memory5g spark.executor.memory 28g spark.cores.max 21 15/08/16 17:14:47 ERROR Executor: Managed memory leak detected; size = 67108864 bytes, TID = 434 15/08/16 17:14:47 ERROR Executor: Exception in task 23.0 in stage 9.0 (TID 434) java.util.NoSuchElementException: key not found: val_54 at scala.collection.MapLike$class.default(MapLike.scala:228) at scala.collection.AbstractMap.default(Map.scala:58) at scala.collection.mutable.HashMap.apply(HashMap.scala:64) at org.apache.spark.sql.columnar.compression.DictionaryEncoding$Encoder.compress(compressionSchemes.scala:258) at org.apache.spark.sql.columnar.compression.CompressibleColumnBuilder$class.build(CompressibleColumnBuilder.scala:110) at org.apache.spark.sql.columnar.NativeColumnBuilder.build(ColumnBuilder.scala:87) at org.apache.spark.sql.columnar.InMemoryRelation$$anonfun$3$$anon$1$$anonfun$next$2.apply(InMemoryColumnarTableScan.scala:153) at org.apache.spark.sql.columnar.InMemoryRelation$$anonfun$3$$anon$1$$anonfun$next$2.apply(InMemoryColumnarTableScan.scala:153) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33) at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108) at scala.collection.TraversableLike$class.map(TraversableLike.scala:244) at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:108) at org.apache.spark.sql.columnar.InMemoryRelation$$anonfun$3$$anon$1.next(InMemoryColumnarTableScan.scala:153) at org.apache.spark.sql.columnar.InMemoryRelation$$anonfun$3$$anon$1.next(InMemoryColumnarTableScan.scala:120) at org.apache.spark.storage.MemoryStore.unrollSafely(MemoryStore.scala:278) at org.apache.spark.CacheManager.putInBlockManager(CacheManager.scala:171) at org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:78) at org.apache.spark.rdd.RDD.iterator(RDD.scala:262) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297) at org.apache.spark.rdd.RDD.iterator(RDD.scala:264) at org.apache.spark.rdd.MapPartitionsWithPreparationRDD.compute(MapPartitionsWithPreparationRDD.scala:46) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297) at org.apache.spark.rdd.RDD.iterator(RDD.scala:264) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297) at org.apache.spark.rdd.RDD.iterator(RDD.scala:264) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41) at org.apache.spark.scheduler.Task.run(Task.scala:88) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603) at java.lang.Thread.run(Thread.java:722) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-10030) Managed memory leak detected when cache table
[ https://issues.apache.org/jira/browse/SPARK-10030?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-10030: -- Component/s: SQL Managed memory leak detected when cache table - Key: SPARK-10030 URL: https://issues.apache.org/jira/browse/SPARK-10030 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.5.0 Reporter: wangwei Priority: Blocker I test the lastest spark-1.5.0 in local, standalone, yarn mode and follow the steps bellow, then errors occured. 1. create table cache_test(id int, name string) stored as textfile ; 2. load data local inpath 'SparkSource/sql/hive/src/test/resources/data/files/kv1.txt' into table cache_test; 3. cache table test as select * from cache_test distribute by id; configuration: spark.driver.memory5g spark.executor.memory 28g spark.cores.max 21 15/08/16 17:14:47 ERROR Executor: Managed memory leak detected; size = 67108864 bytes, TID = 434 15/08/16 17:14:47 ERROR Executor: Exception in task 23.0 in stage 9.0 (TID 434) java.util.NoSuchElementException: key not found: val_54 at scala.collection.MapLike$class.default(MapLike.scala:228) at scala.collection.AbstractMap.default(Map.scala:58) at scala.collection.mutable.HashMap.apply(HashMap.scala:64) at org.apache.spark.sql.columnar.compression.DictionaryEncoding$Encoder.compress(compressionSchemes.scala:258) at org.apache.spark.sql.columnar.compression.CompressibleColumnBuilder$class.build(CompressibleColumnBuilder.scala:110) at org.apache.spark.sql.columnar.NativeColumnBuilder.build(ColumnBuilder.scala:87) at org.apache.spark.sql.columnar.InMemoryRelation$$anonfun$3$$anon$1$$anonfun$next$2.apply(InMemoryColumnarTableScan.scala:153) at org.apache.spark.sql.columnar.InMemoryRelation$$anonfun$3$$anon$1$$anonfun$next$2.apply(InMemoryColumnarTableScan.scala:153) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33) at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108) at scala.collection.TraversableLike$class.map(TraversableLike.scala:244) at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:108) at org.apache.spark.sql.columnar.InMemoryRelation$$anonfun$3$$anon$1.next(InMemoryColumnarTableScan.scala:153) at org.apache.spark.sql.columnar.InMemoryRelation$$anonfun$3$$anon$1.next(InMemoryColumnarTableScan.scala:120) at org.apache.spark.storage.MemoryStore.unrollSafely(MemoryStore.scala:278) at org.apache.spark.CacheManager.putInBlockManager(CacheManager.scala:171) at org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:78) at org.apache.spark.rdd.RDD.iterator(RDD.scala:262) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297) at org.apache.spark.rdd.RDD.iterator(RDD.scala:264) at org.apache.spark.rdd.MapPartitionsWithPreparationRDD.compute(MapPartitionsWithPreparationRDD.scala:46) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297) at org.apache.spark.rdd.RDD.iterator(RDD.scala:264) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297) at org.apache.spark.rdd.RDD.iterator(RDD.scala:264) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41) at org.apache.spark.scheduler.Task.run(Task.scala:88) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603) at java.lang.Thread.run(Thread.java:722) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-10052) KafKaDirectDstream should filter empty partition task or rdd
SuYan created SPARK-10052: - Summary: KafKaDirectDstream should filter empty partition task or rdd Key: SPARK-10052 URL: https://issues.apache.org/jira/browse/SPARK-10052 Project: Spark Issue Type: Improvement Components: Streaming Affects Versions: 1.4.1 Reporter: SuYan We run spark 1.4.0 spark direct streaming, found it submit stages and tasks for input event = 0 events. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5480) GraphX pageRank: java.lang.ArrayIndexOutOfBoundsException:
[ https://issues.apache.org/jira/browse/SPARK-5480?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14699397#comment-14699397 ] Aram Mkrtchyan commented on SPARK-5480: --- We also have the same problem almost every time when using subgraph function before running PageRank algorithm for Graph with 60M vertices with Spark 1.4.0. It used to be normal with = 1.3.0 version. GraphX pageRank: java.lang.ArrayIndexOutOfBoundsException: --- Key: SPARK-5480 URL: https://issues.apache.org/jira/browse/SPARK-5480 Project: Spark Issue Type: Bug Components: GraphX Affects Versions: 1.2.0, 1.3.1 Environment: Yarn client Reporter: Stephane Maarek Running the following code: val subgraph = graph.subgraph ( vpred = (id,article) = //working predicate) ).cache() println( sSubgraph contains ${subgraph.vertices.count} nodes and ${subgraph.edges.count} edges) val prGraph = subgraph.staticPageRank(5).cache val titleAndPrGraph = subgraph.outerJoinVertices(prGraph.vertices) { (v, title, rank) = (rank.getOrElse(0.0), title) } titleAndPrGraph.vertices.top(13) { Ordering.by((entry: (VertexId, (Double, _))) = entry._2._1) }.foreach(t = println(t._2._2._1 + : + t._2._1 + , id: + t._1)) Returns a graph with 5000 nodes and 4000 edges. Then it crashes during the PageRank with the following: 15/01/29 05:51:07 INFO scheduler.TaskSetManager: Starting task 125.0 in stage 39.0 (TID 1808, *HIDDEN, PROCESS_LOCAL, 2059 bytes) 15/01/29 05:51:07 WARN scheduler.TaskSetManager: Lost task 107.0 in stage 39.0 (TID 1794, *HIDDEN): java.lang.ArrayIndexOutOfBoundsException: -1 at org.apache.spark.graphx.util.collection.GraphXPrimitiveKeyOpenHashMap$mcJI$sp.apply$mcJI$sp(GraphXPrimitiveKeyOpenHashMap.scala:64) at org.apache.spark.graphx.impl.EdgePartition.updateVertices(EdgePartition.scala:91) at org.apache.spark.graphx.impl.ReplicatedVertexView$$anonfun$2$$anonfun$apply$1.apply(ReplicatedVertexView.scala:75) at org.apache.spark.graphx.impl.ReplicatedVertexView$$anonfun$2$$anonfun$apply$1.apply(ReplicatedVertexView.scala:73) at scala.collection.Iterator$$anon$11.next(Iterator.scala:328) at org.apache.spark.graphx.impl.EdgeRDDImpl$$anonfun$mapEdgePartitions$1.apply(EdgeRDDImpl.scala:110) at org.apache.spark.graphx.impl.EdgeRDDImpl$$anonfun$mapEdgePartitions$1.apply(EdgeRDDImpl.scala:108) at org.apache.spark.rdd.RDD$$anonfun$13.apply(RDD.scala:601) at org.apache.spark.rdd.RDD$$anonfun$13.apply(RDD.scala:601) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263) at org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:61) at org.apache.spark.rdd.RDD.iterator(RDD.scala:228) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263) at org.apache.spark.rdd.RDD.iterator(RDD.scala:230) at org.apache.spark.rdd.ZippedPartitionsRDD2.compute(ZippedPartitionsRDD.scala:88) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263) at org.apache.spark.rdd.RDD.iterator(RDD.scala:230) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263) at org.apache.spark.rdd.RDD.iterator(RDD.scala:230) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263) at org.apache.spark.rdd.RDD.iterator(RDD.scala:230) at org.apache.spark.rdd.ZippedPartitionsRDD2.compute(ZippedPartitionsRDD.scala:88) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263) at org.apache.spark.rdd.RDD.iterator(RDD.scala:230) at org.apache.spark.rdd.ZippedPartitionsRDD2.compute(ZippedPartitionsRDD.scala:88) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263) at org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:61) at org.apache.spark.rdd.RDD.iterator(RDD.scala:228) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263) at org.apache.spark.rdd.RDD.iterator(RDD.scala:230) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41) at
[jira] [Commented] (SPARK-10068) Add links to sections in MLlib's user guide
[ https://issues.apache.org/jira/browse/SPARK-10068?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14700342#comment-14700342 ] Feynman Liang commented on SPARK-10068: --- Working on this Add links to sections in MLlib's user guide --- Key: SPARK-10068 URL: https://issues.apache.org/jira/browse/SPARK-10068 Project: Spark Issue Type: Improvement Reporter: Feynman Liang Priority: Minor In {{mllib-guide.md}}, the listing under {{MLlib types, algorithms and utilities }} is inconsistent with linking to sections referenced. We should provide links to every section mentioned in this listing. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-9868) auto_sortmerge_join_8 fails non-deterministically in Jenkins
[ https://issues.apache.org/jira/browse/SPARK-9868?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust resolved SPARK-9868. - Resolution: Cannot Reproduce Fix Version/s: 1.5.0 auto_sortmerge_join_8 fails non-deterministically in Jenkins Key: SPARK-9868 URL: https://issues.apache.org/jira/browse/SPARK-9868 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 1.5.0 Reporter: Josh Rosen Assignee: Davies Liu Priority: Blocker Fix For: 1.5.0 https://amplab.cs.berkeley.edu/jenkins/view/Spark-QA-Test/job/Spark-Master-SBT/3219/AMPLAB_JENKINS_BUILD_PROFILE=hadoop2.3,label=centos/testReport/junit/org.apache.spark.sql.hive.execution/HiveCompatibilitySuite/auto_sortmerge_join_8/ {code} Results do not match for auto_sortmerge_join_8: == Parsed Logical Plan == 'Project [unresolvedalias(count(1))] 'Join Inner, Some(('a.key = 'b.key)) 'UnresolvedRelation [bucket_small], Some(a) 'UnresolvedRelation [bucket_big], Some(b) == Analyzed Logical Plan == _c0: bigint Aggregate [count(1) AS _c0#53110L] Join Inner, Some((key#53105 = key#53108)) MetastoreRelation default, bucket_small, Some(a) MetastoreRelation default, bucket_big, Some(b) == Optimized Logical Plan == Aggregate [count(1) AS _c0#53110L] Project Join Inner, Some((key#53105 = key#53108)) Project [key#53105] MetastoreRelation default, bucket_small, Some(a) Project [key#53108] MetastoreRelation default, bucket_big, Some(b) == Physical Plan == TungstenAggregate(key=[], value=[(count(1),mode=Final,isDistinct=false)] TungstenExchange SinglePartition TungstenAggregate(key=[], value=[(count(1),mode=Partial,isDistinct=false)] TungstenProject SortMergeJoin [key#53105], [key#53108] TungstenSort [key#53105 ASC], false, 0 TungstenExchange hashpartitioning(key#53105) ConvertToUnsafe HiveTableScan [key#53105], (MetastoreRelation default, bucket_small, Some(a)) TungstenSort [key#53108 ASC], false, 0 TungstenExchange hashpartitioning(key#53108) ConvertToUnsafe HiveTableScan [key#53108], (MetastoreRelation default, bucket_big, Some(b)) Code Generation: true _c0 !== HIVE - 1 row(s) == == CATALYST - 1 row(s) == !76 74 at org.scalatest.Assertions$class.newAssertionFailedException(Assertions.scala:495) at org.scalatest.FunSuite.newAssertionFailedException(FunSuite.scala:1555) at org.scalatest.Assertions$class.fail(Assertions.scala:1328) at org.scalatest.FunSuite.fail(FunSuite.scala:1555) at org.apache.spark.sql.hive.execution.HiveComparisonTest$$anonfun$createQueryTest$1$$anonfun$apply$mcV$sp$6.apply(HiveComparisonTest.scala:397) at org.apache.spark.sql.hive.execution.HiveComparisonTest$$anonfun$createQueryTest$1$$anonfun$apply$mcV$sp$6.apply(HiveComparisonTest.scala:368) at scala.runtime.Tuple3Zipped$$anonfun$foreach$extension$1.apply(Tuple3Zipped.scala:109) at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33) at scala.collection.mutable.WrappedArray.foreach(WrappedArray.scala:34) at scala.runtime.Tuple3Zipped$.foreach$extension(Tuple3Zipped.scala:107) at org.apache.spark.sql.hive.execution.HiveComparisonTest$$anonfun$createQueryTest$1.apply$mcV$sp(HiveComparisonTest.scala:368) at org.apache.spark.sql.hive.execution.HiveComparisonTest$$anonfun$createQueryTest$1.apply(HiveComparisonTest.scala:238) at org.apache.spark.sql.hive.execution.HiveComparisonTest$$anonfun$createQueryTest$1.apply(HiveComparisonTest.scala:238) at org.scalatest.Transformer$$anonfun$apply$1.apply$mcV$sp(Transformer.scala:22) at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85) at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104) at org.scalatest.Transformer.apply(Transformer.scala:22) at org.scalatest.Transformer.apply(Transformer.scala:20) at org.scalatest.FunSuiteLike$$anon$1.apply(FunSuiteLike.scala:166) at org.apache.spark.SparkFunSuite.withFixture(SparkFunSuite.scala:42) at org.scalatest.FunSuiteLike$class.invokeWithFixture$1(FunSuiteLike.scala:163) at org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:175) at org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:175) at org.scalatest.SuperEngine.runTestImpl(Engine.scala:306) at org.scalatest.FunSuiteLike$class.runTest(FunSuiteLike.scala:175) at org.apache.spark.sql.hive.execution.HiveCompatibilitySuite.org$scalatest$BeforeAndAfter$$super$runTest(HiveCompatibilitySuite.scala:32) at
[jira] [Commented] (SPARK-9776) Another instance of Derby may have already booted the database
[ https://issues.apache.org/jira/browse/SPARK-9776?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14700377#comment-14700377 ] Sudhakar Thota commented on SPARK-9776: --- Thanks Michael and Eugene. I have no problem with SQLContext at all, but problem is with HiveContext. I am trying to build SQL statement for my query using hive tables and want to save the results back in hive table. According to my understanding I need HiveContext to do that, otherwise I have to limit with registerTempTable instead of saveAsTable operation. Not sure, if I am entirely correct, please let me know otherwise. Thanks Sudhakar Thota Another instance of Derby may have already booted the database --- Key: SPARK-9776 URL: https://issues.apache.org/jira/browse/SPARK-9776 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.5.0 Environment: Mac Yosemite, spark-1.5.0 Reporter: Sudhakar Thota Attachments: SPARK-9776-FL1.rtf val hiveContext = new org.apache.spark.sql.hive.HiveContext(sc) results in error. Though the same works for spark-1.4.1. Caused by: ERROR XSDB6: Another instance of Derby may have already booted the database -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-10071) QueueInputDStream Should Allow Checkpointing
Asim Jalis created SPARK-10071: -- Summary: QueueInputDStream Should Allow Checkpointing Key: SPARK-10071 URL: https://issues.apache.org/jira/browse/SPARK-10071 Project: Spark Issue Type: Improvement Components: Streaming Affects Versions: 1.4.1 Reporter: Asim Jalis I would like for https://issues.apache.org/jira/browse/SPARK-8630 to be reverted and that issue resolved as won’t fix, and for QueueInputDStream to revert to its old behavior of not throwing an exception if checkpointing is enabled. Why? The reason is that this fix which throws an exception if the DStream is being checkpointed breaks the primary use case for QueueInputDStream, which is testing. For example, the Spark Streaming documentation recommends using QueueInputDStream for testing. Why does throwing an exception if checkpointing is used break this class? The reason is that if I use windowing operations or updateStateByKey then the StreamingContext requires that I enable checkpointing. It throws an exception if I don’t enable checkpointing. But then if I enable checkpointing this class throws an exception saying that I cannot use checkpointing with the queue stream. The end result of this is that I cannot use QueueInputDStream to test windowing operations and updateStateByKey. It can only be used for trivial stateless DStreams. But would removing the exception-throwing logic make this code fragile? It should not. In the testing scenario the RDD that is passed into the QueueInputDStream is created through parallelize and it is checkpointable. But what about people who are using QueueInputDStream in non-testing scenarios with non-recoverable RDDs? Perhaps a warning suffices here that checkpointing will not be able to recover state if their RDDs are non-recoverable. Then it is up to them how they resolve this situation. Since right now we have no good way of determining if a QueueInputDStream contains RDDs that are recoverable or not, why not err on the side of leaving it to the user of the class to not expect recoverability, rather than forcing checkpointing. In conclusion: my recommendation would be to revert to the old behavior and to resolve this bug as won’t fix. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-9906) User guide for LogisticRegressionSummary
[ https://issues.apache.org/jira/browse/SPARK-9906?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley updated SPARK-9906: - Shepherd: Joseph K. Bradley User guide for LogisticRegressionSummary Key: SPARK-9906 URL: https://issues.apache.org/jira/browse/SPARK-9906 Project: Spark Issue Type: Documentation Components: ML Reporter: Feynman Liang Assignee: Manoj Kumar SPARK-9112 introduces {{LogisticRegressionSummary}} to provide R-like model statistics to ML pipeline logistic regression models. This feature is not present in mllib and should be documented within {{ml-guide}} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-9786) Test backpressure
[ https://issues.apache.org/jira/browse/SPARK-9786?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tathagata Das updated SPARK-9786: - Description: 1. Build a test bench for generating different workloads and data with varying rates - DONE 2. Enable backpressure and test whether things it works with different workloads - IN PROGRESS 3. Test whether it works with multiple receivers 4. Test whether it works with Kinesis 5. Test whether it works with Direct Kafka Test backpressure - Key: SPARK-9786 URL: https://issues.apache.org/jira/browse/SPARK-9786 Project: Spark Issue Type: Sub-task Components: Streaming Reporter: Tathagata Das Assignee: Tathagata Das Priority: Critical 1. Build a test bench for generating different workloads and data with varying rates - DONE 2. Enable backpressure and test whether things it works with different workloads - IN PROGRESS 3. Test whether it works with multiple receivers 4. Test whether it works with Kinesis 5. Test whether it works with Direct Kafka -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9662) ML 1.5 QA: API: Python API coverage
[ https://issues.apache.org/jira/browse/SPARK-9662?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14700460#comment-14700460 ] Joseph K. Bradley commented on SPARK-9662: -- Perfect, thanks. Also, to confirm: Are you done checking for breaking changes to Python APIs? ML 1.5 QA: API: Python API coverage --- Key: SPARK-9662 URL: https://issues.apache.org/jira/browse/SPARK-9662 Project: Spark Issue Type: Sub-task Components: Documentation, ML, MLlib, PySpark Reporter: Joseph K. Bradley Assignee: Yanbo Liang For new public APIs added to MLlib, we need to check the generated HTML doc and compare the Scala Python versions. We need to track: * Inconsistency: Do class/method/parameter names match? * Docs: Is the Python doc missing or just a stub? We want the Python doc to be as complete as the Scala doc. * API breaking changes: These should be very rare but are occasionally either necessary (intentional) or accidental. These must be recorded and added in the Migration Guide for this release. ** Note: If the API change is for an Alpha/Experimental/DeveloperApi component, please note that as well. * Missing classes/methods/parameters: We should create to-do JIRAs for functionality missing from Python, to be added in the next release cycle. Please use a *separate* JIRA (linked below) for this list of to-do items. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-9768) Add Python API for ml.feature.ElementwiseProduct
[ https://issues.apache.org/jira/browse/SPARK-9768?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley resolved SPARK-9768. -- Resolution: Fixed Fix Version/s: 1.5.0 Issue resolved by pull request 8061 [https://github.com/apache/spark/pull/8061] Add Python API for ml.feature.ElementwiseProduct Key: SPARK-9768 URL: https://issues.apache.org/jira/browse/SPARK-9768 Project: Spark Issue Type: Improvement Components: ML, PySpark Reporter: Yanbo Liang Assignee: Yanbo Liang Priority: Minor Fix For: 1.5.0 Add Python API, user guide and example for ml.feature.ElementwiseProduct. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-8916) Add @since tags to mllib.regression
[ https://issues.apache.org/jira/browse/SPARK-8916?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] DB Tsai resolved SPARK-8916. Resolution: Fixed Fix Version/s: 1.5.0 Issue resolved by pull request 7518 [https://github.com/apache/spark/pull/7518] Add @since tags to mllib.regression --- Key: SPARK-8916 URL: https://issues.apache.org/jira/browse/SPARK-8916 Project: Spark Issue Type: Sub-task Components: Documentation, MLlib Reporter: Xiangrui Meng Priority: Minor Labels: starter Fix For: 1.5.0 Original Estimate: 1h Remaining Estimate: 1h -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9776) Another instance of Derby may have already booted the database
[ https://issues.apache.org/jira/browse/SPARK-9776?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14700526#comment-14700526 ] Sudhakar Thota commented on SPARK-9776: --- Thanks Michael and Eugene for your quick responses, I got the point now. Tested the saveAsTable with sqlContext and it worked in the spark-shell. For the script, I still have to open the HiveContext and SparkContext. Thanks Sudhakar Thota Another instance of Derby may have already booted the database --- Key: SPARK-9776 URL: https://issues.apache.org/jira/browse/SPARK-9776 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.5.0 Environment: Mac Yosemite, spark-1.5.0 Reporter: Sudhakar Thota Attachments: SPARK-9776-FL1.rtf val hiveContext = new org.apache.spark.sql.hive.HiveContext(sc) results in error. Though the same works for spark-1.4.1. Caused by: ERROR XSDB6: Another instance of Derby may have already booted the database -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-9906) User guide for LogisticRegressionSummary
[ https://issues.apache.org/jira/browse/SPARK-9906?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Feynman Liang updated SPARK-9906: - Description: SPARK-9112 introduces {{LogisticRegressionSummary}} to provide R-like model statistics to ML pipeline logistic regression models. This feature is not present in mllib and should be documented within {{ml-linear-methods}} (was: SPARK-9112 introduces {{LogisticRegressionSummary}} to provide R-like model statistics to ML pipeline logistic regression models. This feature is not present in mllib and should be documented within {{ml-guide}}) User guide for LogisticRegressionSummary Key: SPARK-9906 URL: https://issues.apache.org/jira/browse/SPARK-9906 Project: Spark Issue Type: Documentation Components: ML Reporter: Feynman Liang Assignee: Manoj Kumar SPARK-9112 introduces {{LogisticRegressionSummary}} to provide R-like model statistics to ML pipeline logistic regression models. This feature is not present in mllib and should be documented within {{ml-linear-methods}} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-10077) Java package doc for spark.ml.feature
Xiangrui Meng created SPARK-10077: - Summary: Java package doc for spark.ml.feature Key: SPARK-10077 URL: https://issues.apache.org/jira/browse/SPARK-10077 Project: Spark Issue Type: Documentation Components: Documentation, ML Reporter: Xiangrui Meng Should be the same as SPARK-7808 but use Java for the code example. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-7808) Scala package doc for spark.ml.feature
[ https://issues.apache.org/jira/browse/SPARK-7808?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-7808: - Summary: Scala package doc for spark.ml.feature (was: Package doc for spark.ml.feature) Scala package doc for spark.ml.feature -- Key: SPARK-7808 URL: https://issues.apache.org/jira/browse/SPARK-7808 Project: Spark Issue Type: Documentation Components: Documentation, ML Affects Versions: 1.4.0 Reporter: Xiangrui Meng Assignee: Xiangrui Meng We added several feature transformers in Spark 1.4. It would be great to add package doc for `spark.ml.feature`. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-10076) makes MultilayerPerceptronClassifier layers and weights public
Yanbo Liang created SPARK-10076: --- Summary: makes MultilayerPerceptronClassifier layers and weights public Key: SPARK-10076 URL: https://issues.apache.org/jira/browse/SPARK-10076 Project: Spark Issue Type: Improvement Components: ML Reporter: Yanbo Liang makes MultilayerPerceptronClassifier layers and weights public -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9856) Add expression functions into SparkR whose params are complicated
[ https://issues.apache.org/jira/browse/SPARK-9856?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14700688#comment-14700688 ] Apache Spark commented on SPARK-9856: - User 'yu-iskw' has created a pull request for this issue: https://github.com/apache/spark/pull/8264 Add expression functions into SparkR whose params are complicated - Key: SPARK-9856 URL: https://issues.apache.org/jira/browse/SPARK-9856 Project: Spark Issue Type: Sub-task Components: SparkR Reporter: Yu Ishikawa Add expression functions whose parameters are a little complicated, like {{regexp_extract(e: Column, exp: String, groupIdx: Int)}} and {{regexp_replace(e: Column, pattern: String, replacement: String)}}. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-9856) Add expression functions into SparkR whose params are complicated
[ https://issues.apache.org/jira/browse/SPARK-9856?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-9856: --- Assignee: (was: Apache Spark) Add expression functions into SparkR whose params are complicated - Key: SPARK-9856 URL: https://issues.apache.org/jira/browse/SPARK-9856 Project: Spark Issue Type: Sub-task Components: SparkR Reporter: Yu Ishikawa Add expression functions whose parameters are a little complicated, like {{regexp_extract(e: Column, exp: String, groupIdx: Int)}} and {{regexp_replace(e: Column, pattern: String, replacement: String)}}. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-9856) Add expression functions into SparkR whose params are complicated
[ https://issues.apache.org/jira/browse/SPARK-9856?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-9856: --- Assignee: Apache Spark Add expression functions into SparkR whose params are complicated - Key: SPARK-9856 URL: https://issues.apache.org/jira/browse/SPARK-9856 Project: Spark Issue Type: Sub-task Components: SparkR Reporter: Yu Ishikawa Assignee: Apache Spark Add expression functions whose parameters are a little complicated, like {{regexp_extract(e: Column, exp: String, groupIdx: Int)}} and {{regexp_replace(e: Column, pattern: String, replacement: String)}}. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8520) Improve GLM's scalability on number of features
[ https://issues.apache.org/jira/browse/SPARK-8520?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14700357#comment-14700357 ] Meihua Wu commented on SPARK-8520: -- For 1, how about migrate to treeReduce and treeAggregate? Improve GLM's scalability on number of features --- Key: SPARK-8520 URL: https://issues.apache.org/jira/browse/SPARK-8520 Project: Spark Issue Type: Improvement Components: ML Affects Versions: 1.4.0 Reporter: Xiangrui Meng Assignee: Xiangrui Meng Priority: Critical Labels: advanced MLlib's GLM implementation uses driver to collect gradient updates. When there exist many features (20 million), the driver becomes the performance bottleneck. In practice, it is common to see a problem with a large feature dimension, resulting from hashing or other feature transformations. So it is important to improve MLlib's scalability on number of features. There are couple possible solutions: 1. still use driver to collect updates, but reduce the amount of data it collects at each iteration. 2. apply 2D partitioning to the training data and store the model coefficients distributively (e.g., vector-free l-bfgs) 3. parameter server 4. ... -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9910) User guide for train validation split
[ https://issues.apache.org/jira/browse/SPARK-9910?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14700372#comment-14700372 ] Martin Zapletal commented on SPARK-9910: I noticed 1.5.0 should be closed by now. What is the deadline for this ticket? User guide for train validation split - Key: SPARK-9910 URL: https://issues.apache.org/jira/browse/SPARK-9910 Project: Spark Issue Type: Documentation Components: ML Reporter: Feynman Liang SPARK-8484 adds a TrainValidationSplit transformer which needs user guide docs and example code. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-8920) Add @since tags to mllib.linalg
[ https://issues.apache.org/jira/browse/SPARK-8920?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng resolved SPARK-8920. -- Resolution: Fixed Fix Version/s: 1.5.0 Issue resolved by pull request 7729 [https://github.com/apache/spark/pull/7729] Add @since tags to mllib.linalg --- Key: SPARK-8920 URL: https://issues.apache.org/jira/browse/SPARK-8920 Project: Spark Issue Type: Sub-task Components: Documentation, MLlib Reporter: Xiangrui Meng Assignee: Sameer Abhyankar Priority: Minor Labels: starter Fix For: 1.5.0 Original Estimate: 4h Remaining Estimate: 4h -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-10072) BlockGenerator can deadlock when the queue block queue of generate blocks fills up to capacity
[ https://issues.apache.org/jira/browse/SPARK-10072?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tathagata Das updated SPARK-10072: -- Priority: Blocker (was: Major) BlockGenerator can deadlock when the queue block queue of generate blocks fills up to capacity -- Key: SPARK-10072 URL: https://issues.apache.org/jira/browse/SPARK-10072 Project: Spark Issue Type: Bug Components: Streaming Reporter: Tathagata Das Assignee: Tathagata Das Priority: Blocker Generated blocks are inserted into an ArrayBlockingQueue, and another thread pulls stuff from the ArrayBlockingQueue and pushes it into BlockManager. Now if that queue fills up to capacity (default is 10 blocks), then the inserting into queue (done in the function updateCurrentBuffer) get blocked inside a synchronized block. However, the thread that is pulling blocks from the queue uses the same lock to check the current (active or stopped) while pulling from the queue. Since the block generating threads is blocked (as the queue is full) on the lock, this thread that is supposed to drain the queue gets blocked. Ergo, deadlock. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-10072) BlockGenerator can deadlock when the queue block queue of generate blocks fills up to capacity
Tathagata Das created SPARK-10072: - Summary: BlockGenerator can deadlock when the queue block queue of generate blocks fills up to capacity Key: SPARK-10072 URL: https://issues.apache.org/jira/browse/SPARK-10072 Project: Spark Issue Type: Bug Components: Streaming Reporter: Tathagata Das Assignee: Tathagata Das Generated blocks are inserted into an ArrayBlockingQueue, and another thread pulls stuff from the ArrayBlockingQueue and pushes it into BlockManager. Now if that queue fills up to capacity (default is 10 blocks), then the inserting into queue (done in the function updateCurrentBuffer) get blocked inside a synchronized block. However, the thread that is pulling blocks from the queue uses the same lock to check the current (active or stopped) while pulling from the queue. Since the block generating threads is blocked (as the queue is full) on the lock, this thread that is supposed to drain the queue gets blocked. Ergo, deadlock. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-10072) BlockGenerator can deadlock when the queue block queue of generate blocks fills up to capacity
[ https://issues.apache.org/jira/browse/SPARK-10072?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-10072: Assignee: Apache Spark (was: Tathagata Das) BlockGenerator can deadlock when the queue block queue of generate blocks fills up to capacity -- Key: SPARK-10072 URL: https://issues.apache.org/jira/browse/SPARK-10072 Project: Spark Issue Type: Bug Components: Streaming Reporter: Tathagata Das Assignee: Apache Spark Priority: Blocker Generated blocks are inserted into an ArrayBlockingQueue, and another thread pulls stuff from the ArrayBlockingQueue and pushes it into BlockManager. Now if that queue fills up to capacity (default is 10 blocks), then the inserting into queue (done in the function updateCurrentBuffer) get blocked inside a synchronized block. However, the thread that is pulling blocks from the queue uses the same lock to check the current (active or stopped) while pulling from the queue. Since the block generating threads is blocked (as the queue is full) on the lock, this thread that is supposed to drain the queue gets blocked. Ergo, deadlock. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10023) Unified DecisionTreeParams checkpointInterval between Scala and Python API.
[ https://issues.apache.org/jira/browse/SPARK-10023?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14700445#comment-14700445 ] Joseph K. Bradley commented on SPARK-10023: --- For this and other JIRAs, could you please note how they are inconsistent? That will help in understanding if we need a fix ASAP (for this release), or if it can wait. Thank you! Unified DecisionTreeParams checkpointInterval between Scala and Python API. - Key: SPARK-10023 URL: https://issues.apache.org/jira/browse/SPARK-10023 Project: Spark Issue Type: Sub-task Components: ML, PySpark Reporter: Yanbo Liang checkpointInterval is one of DecisionTreeParams in Scala API which is inconsistency with Python API, we should unified them. Proposal: Make checkpointInterval shared param. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-9550) Configuration renaming, defaults changes, and deprecation for 1.5.0 (master ticket)
[ https://issues.apache.org/jira/browse/SPARK-9550?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin updated SPARK-9550: --- Description: This ticket tracks configurations which need to be renamed, deprecated, or have their defaults changed for Spark 1.5.0. Note that subtasks / comments here do not necessarily need to reflect changes that must be performed. Rather, tasks should be added here to make sure that the relevant configurations are at least checked before we cut releases. This ticket will also help us to track configuration changes which must make it into the release notes. *Configuration renaming* - Consider renaming {{spark.shuffle.memoryFraction}} to {{spark.execution.memoryFraction}} ([discussion|https://github.com/apache/spark/pull/7770#discussion-diff-36019144]). - Rename all public-facing uses of {{unsafe}} to something less scary, such as {{tungsten}} *Defaults changes* - Codegen is now enabled by default. - Tungsten is now enabled by default. - Parquet schema merging is now disabled by default. - In-memory relation partition pruning should be enabled by default (SPARK-9554). *Deprecation* - Local execution has been removed. *Behavior Changes* - Canonical name of SQL/DataFrame functions are now lower case (e.g. sum vs SUM) - DirectOutputCommitter is not safe to use with speculation was: This ticket tracks configurations which need to be renamed, deprecated, or have their defaults changed for Spark 1.5.0. Note that subtasks / comments here do not necessarily need to reflect changes that must be performed. Rather, tasks should be added here to make sure that the relevant configurations are at least checked before we cut releases. This ticket will also help us to track configuration changes which must make it into the release notes. *Configuration renaming* - Consider renaming {{spark.shuffle.memoryFraction}} to {{spark.execution.memoryFraction}} ([discussion|https://github.com/apache/spark/pull/7770#discussion-diff-36019144]). - Rename all public-facing uses of {{unsafe}} to something less scary, such as {{tungsten}} *Defaults changes* - Codegen is now enabled by default. - Tungsten is now enabled by default. - Parquet schema merging is now disabled by default. - In-memory relation partition pruning should be enabled by default (SPARK-9554). *Deprecation* - Local execution has been removed. *Behavior Changes* - Canonical name of SQL/DataFrame functions are now lower case (e.g. sum vs SUM) Configuration renaming, defaults changes, and deprecation for 1.5.0 (master ticket) --- Key: SPARK-9550 URL: https://issues.apache.org/jira/browse/SPARK-9550 Project: Spark Issue Type: Task Components: Spark Core, SQL Affects Versions: 1.5.0 Reporter: Josh Rosen Priority: Blocker This ticket tracks configurations which need to be renamed, deprecated, or have their defaults changed for Spark 1.5.0. Note that subtasks / comments here do not necessarily need to reflect changes that must be performed. Rather, tasks should be added here to make sure that the relevant configurations are at least checked before we cut releases. This ticket will also help us to track configuration changes which must make it into the release notes. *Configuration renaming* - Consider renaming {{spark.shuffle.memoryFraction}} to {{spark.execution.memoryFraction}} ([discussion|https://github.com/apache/spark/pull/7770#discussion-diff-36019144]). - Rename all public-facing uses of {{unsafe}} to something less scary, such as {{tungsten}} *Defaults changes* - Codegen is now enabled by default. - Tungsten is now enabled by default. - Parquet schema merging is now disabled by default. - In-memory relation partition pruning should be enabled by default (SPARK-9554). *Deprecation* - Local execution has been removed. *Behavior Changes* - Canonical name of SQL/DataFrame functions are now lower case (e.g. sum vs SUM) - DirectOutputCommitter is not safe to use with speculation -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-9974) SBT build: com.twitter:parquet-hadoop-bundle:1.6.0 is not packaged into the assembly jar
[ https://issues.apache.org/jira/browse/SPARK-9974?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin resolved SPARK-9974. Resolution: Fixed Assignee: Cheng Lian Fix Version/s: 1.5.0 SBT build: com.twitter:parquet-hadoop-bundle:1.6.0 is not packaged into the assembly jar Key: SPARK-9974 URL: https://issues.apache.org/jira/browse/SPARK-9974 Project: Spark Issue Type: Bug Components: Build, SQL Affects Versions: 1.5.0 Reporter: Cheng Lian Assignee: Cheng Lian Priority: Blocker Fix For: 1.5.0 One of the consequence of this issue is that Parquet tables created in Hive are not accessible from Spark SQL built with SBT. Maven build is OK. This issue can be worked around by adding {{lib_managed/jars/parquet-hadoop-bundle-1.6.0.jar}} to {{--driver-class-path}}. Git commit: [69930310115501f0de094fe6f5c6c60dade342bd|https://github.com/apache/spark/commit/69930310115501f0de094fe6f5c6c60dade342bd] Build with SBT and check the assembly jar for classes in package {{parquet.hadoop.api}}: {noformat} $ ./build/sbt -Phive -Phive-thriftserver -Phadoop-1 -Dhadoop.version=1.2.1 clean assembly/assembly ... $ jar tf assembly/target/scala-2.10/spark-assembly-1.5.0-SNAPSHOT-hadoop1.2.1.jar | fgrep parquet/hadoop/api org/apache/parquet/hadoop/api/ org/apache/parquet/hadoop/api/DelegatingReadSupport.class org/apache/parquet/hadoop/api/DelegatingWriteSupport.class org/apache/parquet/hadoop/api/InitContext.class org/apache/parquet/hadoop/api/ReadSupport$ReadContext.class org/apache/parquet/hadoop/api/ReadSupport.class org/apache/parquet/hadoop/api/WriteSupport$FinalizedWriteContext.class org/apache/parquet/hadoop/api/WriteSupport$WriteContext.class org/apache/parquet/hadoop/api/WriteSupport.class {noformat} Only classes of {{org.apache.parquet:parquet-mr:1.7.0}}. Note that classes in {{com.twitter:parquet-hadoop-bundle:1.6.0}} are not under the {{org.apache}} namespace. Build with Maven and check the assembly jar for classes in package {{parquet.hadoop.api}}: {noformat} $ ./build/mvn -Phive -Phive-thriftserver -Phadoop-1 -Dhadoop.version=1.2.1 -DskipTests clean package ... $ jar tf assembly/target/scala-2.10/spark-assembly-1.5.0-SNAPSHOT-hadoop1.2.1.jar | fgrep parquet/hadoop/api org/apache/parquet/hadoop/api/ org/apache/parquet/hadoop/api/DelegatingReadSupport.class org/apache/parquet/hadoop/api/DelegatingWriteSupport.class org/apache/parquet/hadoop/api/InitContext.class org/apache/parquet/hadoop/api/ReadSupport$ReadContext.class org/apache/parquet/hadoop/api/ReadSupport.class org/apache/parquet/hadoop/api/WriteSupport$FinalizedWriteContext.class org/apache/parquet/hadoop/api/WriteSupport$WriteContext.class org/apache/parquet/hadoop/api/WriteSupport.class parquet/hadoop/api/ parquet/hadoop/api/DelegatingReadSupport.class parquet/hadoop/api/DelegatingWriteSupport.class parquet/hadoop/api/InitContext.class parquet/hadoop/api/ReadSupport$ReadContext.class parquet/hadoop/api/ReadSupport.class parquet/hadoop/api/WriteSupport$FinalizedWriteContext.class parquet/hadoop/api/WriteSupport$WriteContext.class parquet/hadoop/api/WriteSupport.class {noformat} Expected classes are packaged properly. To reproduce the Parquet table access issue, first create a Parquet table with Hive (say 0.13.1): {noformat} hive CREATE TABLE parquet_test STORED AS PARQUET AS SELECT 1; {noformat} Build Spark assembly jar with the SBT command above, start {{spark-shell}}: {noformat} scala sqlContext.table(parquet_test).show() 15/08/14 17:52:50 INFO HiveMetaStore: 0: get_table : db=default tbl=parquet_test 15/08/14 17:52:50 INFO audit: ugi=lian ip=unknown-ip-addr cmd=get_table : db=default tbl=parquet_test java.lang.NoClassDefFoundError: parquet/hadoop/api/WriteSupport at java.lang.Class.forName0(Native Method) at java.lang.Class.forName(Class.java:270) at org.apache.hadoop.hive.ql.metadata.Table.getOutputFormatClass(Table.java:328) at org.apache.spark.sql.hive.client.ClientWrapper$$anonfun$getTableOption$1$$anonfun$2.apply(ClientWrapper.scala:320) at org.apache.spark.sql.hive.client.ClientWrapper$$anonfun$getTableOption$1$$anonfun$2.apply(ClientWrapper.scala:303) at scala.Option.map(Option.scala:145) at org.apache.spark.sql.hive.client.ClientWrapper$$anonfun$getTableOption$1.apply(ClientWrapper.scala:303) at org.apache.spark.sql.hive.client.ClientWrapper$$anonfun$getTableOption$1.apply(ClientWrapper.scala:298) at org.apache.spark.sql.hive.client.ClientWrapper$$anonfun$withHiveState$1.apply(ClientWrapper.scala:256) at
[jira] [Updated] (SPARK-7707) User guide and example code for KernelDensity
[ https://issues.apache.org/jira/browse/SPARK-7707?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-7707: - Shepherd: Xiangrui Meng User guide and example code for KernelDensity - Key: SPARK-7707 URL: https://issues.apache.org/jira/browse/SPARK-7707 Project: Spark Issue Type: Documentation Components: Documentation, MLlib Affects Versions: 1.4.0 Reporter: Xiangrui Meng Assignee: Sandy Ryza -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-10076) make MultilayerPerceptronClassifier layers and weights public
[ https://issues.apache.org/jira/browse/SPARK-10076?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-10076: Assignee: (was: Apache Spark) make MultilayerPerceptronClassifier layers and weights public -- Key: SPARK-10076 URL: https://issues.apache.org/jira/browse/SPARK-10076 Project: Spark Issue Type: Improvement Components: ML Reporter: Yanbo Liang make MultilayerPerceptronClassifier layers and weights public -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10076) make MultilayerPerceptronClassifier layers and weights public
[ https://issues.apache.org/jira/browse/SPARK-10076?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14700629#comment-14700629 ] Apache Spark commented on SPARK-10076: -- User 'yanboliang' has created a pull request for this issue: https://github.com/apache/spark/pull/8263 make MultilayerPerceptronClassifier layers and weights public -- Key: SPARK-10076 URL: https://issues.apache.org/jira/browse/SPARK-10076 Project: Spark Issue Type: Improvement Components: ML Reporter: Yanbo Liang make MultilayerPerceptronClassifier layers and weights public -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-10076) make MultilayerPerceptronClassifier layers and weights public
[ https://issues.apache.org/jira/browse/SPARK-10076?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-10076: Assignee: Apache Spark make MultilayerPerceptronClassifier layers and weights public -- Key: SPARK-10076 URL: https://issues.apache.org/jira/browse/SPARK-10076 Project: Spark Issue Type: Improvement Components: ML Reporter: Yanbo Liang Assignee: Apache Spark make MultilayerPerceptronClassifier layers and weights public -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-10076) make MultilayerPerceptronClassifier layers and weights public
[ https://issues.apache.org/jira/browse/SPARK-10076?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yanbo Liang updated SPARK-10076: Summary: make MultilayerPerceptronClassifier layers and weights public (was: makes MultilayerPerceptronClassifier layers and weights public ) make MultilayerPerceptronClassifier layers and weights public -- Key: SPARK-10076 URL: https://issues.apache.org/jira/browse/SPARK-10076 Project: Spark Issue Type: Improvement Components: ML Reporter: Yanbo Liang makes MultilayerPerceptronClassifier layers and weights public -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-10076) make MultilayerPerceptronClassifier layers and weights public
[ https://issues.apache.org/jira/browse/SPARK-10076?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yanbo Liang updated SPARK-10076: Description: make MultilayerPerceptronClassifier layers and weights public (was: makes MultilayerPerceptronClassifier layers and weights public ) make MultilayerPerceptronClassifier layers and weights public -- Key: SPARK-10076 URL: https://issues.apache.org/jira/browse/SPARK-10076 Project: Spark Issue Type: Improvement Components: ML Reporter: Yanbo Liang make MultilayerPerceptronClassifier layers and weights public -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-7808) Scala package doc for spark.ml.feature
[ https://issues.apache.org/jira/browse/SPARK-7808?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng resolved SPARK-7808. -- Resolution: Fixed Fix Version/s: 1.5.0 Issue resolved by pull request 8260 [https://github.com/apache/spark/pull/8260] Scala package doc for spark.ml.feature -- Key: SPARK-7808 URL: https://issues.apache.org/jira/browse/SPARK-7808 Project: Spark Issue Type: Documentation Components: Documentation, ML Affects Versions: 1.4.0 Reporter: Xiangrui Meng Assignee: Xiangrui Meng Fix For: 1.5.0 We added several feature transformers in Spark 1.4. It would be great to add package doc for `spark.ml.feature`. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-9898) User guide for PrefixSpan
[ https://issues.apache.org/jira/browse/SPARK-9898?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-9898: - Shepherd: Xiangrui Meng User guide for PrefixSpan - Key: SPARK-9898 URL: https://issues.apache.org/jira/browse/SPARK-9898 Project: Spark Issue Type: Documentation Components: MLlib Reporter: Feynman Liang Assignee: Feynman Liang PrefixSpan was added by SPARK-6487 and needs an accompanying user guide/example code. This should be included in the MLlib docs. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-9654) Add IndexToString in Pyspark
[ https://issues.apache.org/jira/browse/SPARK-9654?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley updated SPARK-9654: - Summary: Add IndexToString in Pyspark (was: Add StringIndexer inverse in Pyspark) Add IndexToString in Pyspark Key: SPARK-9654 URL: https://issues.apache.org/jira/browse/SPARK-9654 Project: Spark Issue Type: New Feature Components: ML, PySpark Reporter: holdenk Assignee: holdenk Priority: Minor -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-10021) Add Python API for ml.feature.IndexToString
[ https://issues.apache.org/jira/browse/SPARK-10021?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley resolved SPARK-10021. --- Resolution: Duplicate Add Python API for ml.feature.IndexToString --- Key: SPARK-10021 URL: https://issues.apache.org/jira/browse/SPARK-10021 Project: Spark Issue Type: Improvement Components: ML, PySpark Reporter: Yanbo Liang Priority: Minor Add Python API for ml.feature.IndexToString -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-7736) Exception not failing Python applications (in yarn cluster mode)
[ https://issues.apache.org/jira/browse/SPARK-7736?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or updated SPARK-7736: - Target Version/s: 1.6.0, 1.5.1 (was: 1.6.0) Exception not failing Python applications (in yarn cluster mode) Key: SPARK-7736 URL: https://issues.apache.org/jira/browse/SPARK-7736 Project: Spark Issue Type: Bug Components: YARN Environment: Spark 1.3.1, Yarn 2.7.0, Ubuntu 14.04 Reporter: Shay Rojansky Assignee: Marcelo Vanzin It seems that exceptions thrown in Python spark apps after the SparkContext is instantiated don't cause the application to fail, at least in Yarn: the application is marked as SUCCEEDED. Note that any exception right before the SparkContext correctly places the application in FAILED state. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-7736) Exception not failing Python applications (in yarn cluster mode)
[ https://issues.apache.org/jira/browse/SPARK-7736?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or updated SPARK-7736: - Target Version/s: 1.6.0 (was: 1.5.1) Exception not failing Python applications (in yarn cluster mode) Key: SPARK-7736 URL: https://issues.apache.org/jira/browse/SPARK-7736 Project: Spark Issue Type: Bug Components: YARN Environment: Spark 1.3.1, Yarn 2.7.0, Ubuntu 14.04 Reporter: Shay Rojansky Assignee: Marcelo Vanzin It seems that exceptions thrown in Python spark apps after the SparkContext is instantiated don't cause the application to fail, at least in Yarn: the application is marked as SUCCEEDED. Note that any exception right before the SparkContext correctly places the application in FAILED state. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-7736) Exception not failing Python applications (in yarn cluster mode)
[ https://issues.apache.org/jira/browse/SPARK-7736?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or updated SPARK-7736: - Fix Version/s: 1.6.0 Exception not failing Python applications (in yarn cluster mode) Key: SPARK-7736 URL: https://issues.apache.org/jira/browse/SPARK-7736 Project: Spark Issue Type: Bug Components: YARN Environment: Spark 1.3.1, Yarn 2.7.0, Ubuntu 14.04 Reporter: Shay Rojansky Assignee: Marcelo Vanzin Fix For: 1.6.0 It seems that exceptions thrown in Python spark apps after the SparkContext is instantiated don't cause the application to fail, at least in Yarn: the application is marked as SUCCEEDED. Note that any exception right before the SparkContext correctly places the application in FAILED state. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-7736) Exception not failing Python applications (in yarn cluster mode)
[ https://issues.apache.org/jira/browse/SPARK-7736?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or updated SPARK-7736: - Fix Version/s: (was: 1.6.0) Exception not failing Python applications (in yarn cluster mode) Key: SPARK-7736 URL: https://issues.apache.org/jira/browse/SPARK-7736 Project: Spark Issue Type: Bug Components: YARN Environment: Spark 1.3.1, Yarn 2.7.0, Ubuntu 14.04 Reporter: Shay Rojansky Assignee: Marcelo Vanzin It seems that exceptions thrown in Python spark apps after the SparkContext is instantiated don't cause the application to fail, at least in Yarn: the application is marked as SUCCEEDED. Note that any exception right before the SparkContext correctly places the application in FAILED state. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-9951) Example code for Multilayer Perceptron Classifier
[ https://issues.apache.org/jira/browse/SPARK-9951?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley resolved SPARK-9951. -- Resolution: Duplicate Example code for Multilayer Perceptron Classifier - Key: SPARK-9951 URL: https://issues.apache.org/jira/browse/SPARK-9951 Project: Spark Issue Type: Documentation Components: Documentation, ML Reporter: Joseph K. Bradley Add an example to the examples/ code folder for Scala. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9951) Example code for Multilayer Perceptron Classifier
[ https://issues.apache.org/jira/browse/SPARK-9951?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14700568#comment-14700568 ] Joseph K. Bradley commented on SPARK-9951: -- Just glanced at it. I think that example will be fine. I'll close this JIRA. Thanks! Example code for Multilayer Perceptron Classifier - Key: SPARK-9951 URL: https://issues.apache.org/jira/browse/SPARK-9951 Project: Spark Issue Type: Documentation Components: Documentation, ML Reporter: Joseph K. Bradley Add an example to the examples/ code folder for Scala. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9888) Update LDA User Guide
[ https://issues.apache.org/jira/browse/SPARK-9888?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14700332#comment-14700332 ] Apache Spark commented on SPARK-9888: - User 'feynmanliang' has created a pull request for this issue: https://github.com/apache/spark/pull/8254 Update LDA User Guide - Key: SPARK-9888 URL: https://issues.apache.org/jira/browse/SPARK-9888 Project: Spark Issue Type: Documentation Components: MLlib Reporter: Feynman Liang Assignee: Feynman Liang Fix For: 1.5.0 LDA has received numerous updates in 1.5, including: * OnlineLDAOptimizer: * Asymmetric document-topic priors * Document-topic hyperparameter optimization * LocalLDAModel * predict * logPerplexity / logLikelihood * DistributedLDAModel: * topDocumentsPerTopic * topTopicsPerDoc * Save/load It is important to note that OnlineLDAOptimizer=LocalLDAModel and EMLDAOptimizer=DistributedLDAModel now support different features. The user guide should document these differences. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-9888) Update LDA User Guide
[ https://issues.apache.org/jira/browse/SPARK-9888?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-9888: --- Assignee: Apache Spark (was: Feynman Liang) Update LDA User Guide - Key: SPARK-9888 URL: https://issues.apache.org/jira/browse/SPARK-9888 Project: Spark Issue Type: Documentation Components: MLlib Reporter: Feynman Liang Assignee: Apache Spark Fix For: 1.5.0 LDA has received numerous updates in 1.5, including: * OnlineLDAOptimizer: * Asymmetric document-topic priors * Document-topic hyperparameter optimization * LocalLDAModel * predict * logPerplexity / logLikelihood * DistributedLDAModel: * topDocumentsPerTopic * topTopicsPerDoc * Save/load It is important to note that OnlineLDAOptimizer=LocalLDAModel and EMLDAOptimizer=DistributedLDAModel now support different features. The user guide should document these differences. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-9888) Update LDA User Guide
[ https://issues.apache.org/jira/browse/SPARK-9888?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-9888: --- Assignee: Feynman Liang (was: Apache Spark) Update LDA User Guide - Key: SPARK-9888 URL: https://issues.apache.org/jira/browse/SPARK-9888 Project: Spark Issue Type: Documentation Components: MLlib Reporter: Feynman Liang Assignee: Feynman Liang Fix For: 1.5.0 LDA has received numerous updates in 1.5, including: * OnlineLDAOptimizer: * Asymmetric document-topic priors * Document-topic hyperparameter optimization * LocalLDAModel * predict * logPerplexity / logLikelihood * DistributedLDAModel: * topDocumentsPerTopic * topTopicsPerDoc * Save/load It is important to note that OnlineLDAOptimizer=LocalLDAModel and EMLDAOptimizer=DistributedLDAModel now support different features. The user guide should document these differences. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-10069) Python's ReduceByKeyAndWindow DStream Keeps Growing
Asim Jalis created SPARK-10069: -- Summary: Python's ReduceByKeyAndWindow DStream Keeps Growing Key: SPARK-10069 URL: https://issues.apache.org/jira/browse/SPARK-10069 Project: Spark Issue Type: Bug Components: PySpark Affects Versions: 1.4.1 Reporter: Asim Jalis When I use reduceByKeyAndWindow with func and invFunc (in PySpark) the size of the window keeps growing. I am appending the code that reproduces this issue. This prints out the count() of the dstream which goes up every batch by 10 elements. Is this a bug in the Python version of Scala or is this expected behavior? Here is the code that reproduces this issue. {code} from pyspark import SparkContext from pyspark.streaming import StreamingContext from pprint import pprint print 'Initializing ssc' ssc = StreamingContext(SparkContext(), batchDuration=1) ssc.checkpoint('ckpt') ds = ssc.textFileStream('input') \ .map(lambda event: (event,1)) \ .reduceByKeyAndWindow( func=lambda count1,count2: count1+count2, invFunc=lambda count1,count2: count1-count2, windowDuration=10, slideDuration=2) ds.pprint() ds.count().pprint() print 'Starting ssc' ssc.start() import itertools import time import random from distutils import dir_util def batch_write(batch_data, batch_file_path): with open(batch_file_path,'w') as batch_file: for element in batch_data: line = str(element) + \n batch_file.write(line) def xrange_write( batch_size = 5, batch_dir = 'input', batch_duration = 1): '''Every batch_duration write a file with batch_size numbers, forever. Start at 0 and keep incrementing. Intended for testing Spark Streaming code.''' dir_util.mkpath('./input') for i in itertools.count(): min = batch_size * i max = batch_size * (i + 1) batch_data = xrange(min,max) file_path = batch_dir + '/' + str(i) batch_write(batch_data, file_path) time.sleep(batch_duration) print 'Feeding data to app' xrange_write() ssc.awaitTermination() {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-5901) [PySpark] pickle classes in main module
[ https://issues.apache.org/jira/browse/SPARK-5901?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Davies Liu closed SPARK-5901. - Resolution: Invalid Target Version/s: (was: 1.5.0) couldpickle does support to serialize class in __main__, but pickle does not support that. [PySpark] pickle classes in main module --- Key: SPARK-5901 URL: https://issues.apache.org/jira/browse/SPARK-5901 Project: Spark Issue Type: Improvement Components: PySpark Reporter: Davies Liu Currently, couldpickle does not support to serialize class object in main module. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-10070) Remove Guava dependencies in user guides
Feynman Liang created SPARK-10070: - Summary: Remove Guava dependencies in user guides Key: SPARK-10070 URL: https://issues.apache.org/jira/browse/SPARK-10070 Project: Spark Issue Type: Improvement Components: ML, MLlib Reporter: Feynman Liang Many code examples in documentation use {{Lists.newArrayList}} (e.g. [ml-feature|https://github.com/apache/spark/blob/master/docs/ml-features.md]) which brings in a dependency on {{com.google.common.collect.Lists}}. We can remove this dependency by using {{Arrays.asList}} instead. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-10072) BlockGenerator can deadlock when the queue block queue of generate blocks fills up to capacity
[ https://issues.apache.org/jira/browse/SPARK-10072?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-10072: Assignee: Tathagata Das (was: Apache Spark) BlockGenerator can deadlock when the queue block queue of generate blocks fills up to capacity -- Key: SPARK-10072 URL: https://issues.apache.org/jira/browse/SPARK-10072 Project: Spark Issue Type: Bug Components: Streaming Reporter: Tathagata Das Assignee: Tathagata Das Priority: Blocker Generated blocks are inserted into an ArrayBlockingQueue, and another thread pulls stuff from the ArrayBlockingQueue and pushes it into BlockManager. Now if that queue fills up to capacity (default is 10 blocks), then the inserting into queue (done in the function updateCurrentBuffer) get blocked inside a synchronized block. However, the thread that is pulling blocks from the queue uses the same lock to check the current (active or stopped) while pulling from the queue. Since the block generating threads is blocked (as the queue is full) on the lock, this thread that is supposed to drain the queue gets blocked. Ergo, deadlock. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10072) BlockGenerator can deadlock when the queue block queue of generate blocks fills up to capacity
[ https://issues.apache.org/jira/browse/SPARK-10072?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14700426#comment-14700426 ] Apache Spark commented on SPARK-10072: -- User 'tdas' has created a pull request for this issue: https://github.com/apache/spark/pull/8257 BlockGenerator can deadlock when the queue block queue of generate blocks fills up to capacity -- Key: SPARK-10072 URL: https://issues.apache.org/jira/browse/SPARK-10072 Project: Spark Issue Type: Bug Components: Streaming Reporter: Tathagata Das Assignee: Tathagata Das Priority: Blocker Generated blocks are inserted into an ArrayBlockingQueue, and another thread pulls stuff from the ArrayBlockingQueue and pushes it into BlockManager. Now if that queue fills up to capacity (default is 10 blocks), then the inserting into queue (done in the function updateCurrentBuffer) get blocked inside a synchronized block. However, the thread that is pulling blocks from the queue uses the same lock to check the current (active or stopped) while pulling from the queue. Since the block generating threads is blocked (as the queue is full) on the lock, this thread that is supposed to drain the queue gets blocked. Ergo, deadlock. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-7736) Exception not failing Python applications (in yarn cluster mode)
[ https://issues.apache.org/jira/browse/SPARK-7736?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14700451#comment-14700451 ] Apache Spark commented on SPARK-7736: - User 'vanzin' has created a pull request for this issue: https://github.com/apache/spark/pull/8258 Exception not failing Python applications (in yarn cluster mode) Key: SPARK-7736 URL: https://issues.apache.org/jira/browse/SPARK-7736 Project: Spark Issue Type: Bug Components: YARN Environment: Spark 1.3.1, Yarn 2.7.0, Ubuntu 14.04 Reporter: Shay Rojansky Assignee: Marcelo Vanzin Fix For: 1.6.0 It seems that exceptions thrown in Python spark apps after the SparkContext is instantiated don't cause the application to fail, at least in Yarn: the application is marked as SUCCEEDED. Note that any exception right before the SparkContext correctly places the application in FAILED state. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-10074) Include Float in @specialized annotation
[ https://issues.apache.org/jira/browse/SPARK-10074?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-10074: Assignee: (was: Apache Spark) Include Float in @specialized annotation Key: SPARK-10074 URL: https://issues.apache.org/jira/browse/SPARK-10074 Project: Spark Issue Type: Improvement Reporter: Ted Yu Priority: Minor There're several places in Spark codebase where we use @specialized annotation covering Long and Double. e.g. in OpenHashMap.scala : {code} class OpenHashMap[K : ClassTag, @specialized(Long, Int, Double) V: ClassTag]( initialCapacity: Int) {code} Float should be added to @specialized annotation as well. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10074) Include Float in @specialized annotation
[ https://issues.apache.org/jira/browse/SPARK-10074?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14700535#comment-14700535 ] Apache Spark commented on SPARK-10074: -- User 'tedyu' has created a pull request for this issue: https://github.com/apache/spark/pull/8259 Include Float in @specialized annotation Key: SPARK-10074 URL: https://issues.apache.org/jira/browse/SPARK-10074 Project: Spark Issue Type: Improvement Reporter: Ted Yu Priority: Minor There're several places in Spark codebase where we use @specialized annotation covering Long and Double. e.g. in OpenHashMap.scala : {code} class OpenHashMap[K : ClassTag, @specialized(Long, Int, Double) V: ClassTag]( initialCapacity: Int) {code} Float should be added to @specialized annotation as well. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-10074) Include Float in @specialized annotation
[ https://issues.apache.org/jira/browse/SPARK-10074?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-10074: Assignee: Apache Spark Include Float in @specialized annotation Key: SPARK-10074 URL: https://issues.apache.org/jira/browse/SPARK-10074 Project: Spark Issue Type: Improvement Reporter: Ted Yu Assignee: Apache Spark Priority: Minor There're several places in Spark codebase where we use @specialized annotation covering Long and Double. e.g. in OpenHashMap.scala : {code} class OpenHashMap[K : ClassTag, @specialized(Long, Int, Double) V: ClassTag]( initialCapacity: Int) {code} Float should be added to @specialized annotation as well. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-10078) Vector-free L-BFGS
Xiangrui Meng created SPARK-10078: - Summary: Vector-free L-BFGS Key: SPARK-10078 URL: https://issues.apache.org/jira/browse/SPARK-10078 Project: Spark Issue Type: New Feature Components: ML Reporter: Xiangrui Meng Assignee: Xiangrui Meng This is to implement a scalable version of vector-free L-BFGS (http://papers.nips.cc/paper/5333-large-scale-l-bfgs-using-mapreduce.pdf). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8520) Improve GLM's scalability on number of features
[ https://issues.apache.org/jira/browse/SPARK-8520?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14700634#comment-14700634 ] Xiangrui Meng commented on SPARK-8520: -- No, this is for general discussion. I created one JIRA specifically for vector-free L-BFGS. Improve GLM's scalability on number of features --- Key: SPARK-8520 URL: https://issues.apache.org/jira/browse/SPARK-8520 Project: Spark Issue Type: Brainstorming Components: ML Affects Versions: 1.4.0 Reporter: Xiangrui Meng Assignee: Xiangrui Meng Priority: Critical Labels: advanced MLlib's GLM implementation uses driver to collect gradient updates. When there exist many features (20 million), the driver becomes the performance bottleneck. In practice, it is common to see a problem with a large feature dimension, resulting from hashing or other feature transformations. So it is important to improve MLlib's scalability on number of features. There are couple possible solutions: 1. still use driver to collect updates, but reduce the amount of data it collects at each iteration. 2. apply 2D partitioning to the training data and store the model coefficients distributively (e.g., vector-free l-bfgs) 3. parameter server 4. ... -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-8520) Improve GLM's scalability on number of features
[ https://issues.apache.org/jira/browse/SPARK-8520?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-8520: - Issue Type: Brainstorming (was: Improvement) Improve GLM's scalability on number of features --- Key: SPARK-8520 URL: https://issues.apache.org/jira/browse/SPARK-8520 Project: Spark Issue Type: Brainstorming Components: ML Affects Versions: 1.4.0 Reporter: Xiangrui Meng Assignee: Xiangrui Meng Priority: Critical Labels: advanced MLlib's GLM implementation uses driver to collect gradient updates. When there exist many features (20 million), the driver becomes the performance bottleneck. In practice, it is common to see a problem with a large feature dimension, resulting from hashing or other feature transformations. So it is important to improve MLlib's scalability on number of features. There are couple possible solutions: 1. still use driver to collect updates, but reduce the amount of data it collects at each iteration. 2. apply 2D partitioning to the training data and store the model coefficients distributively (e.g., vector-free l-bfgs) 3. parameter server 4. ... -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-10059) Broken test: YarnClusterSuite
[ https://issues.apache.org/jira/browse/SPARK-10059?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Marcelo Vanzin resolved SPARK-10059. Resolution: Fixed Assignee: Marcelo Vanzin Fix Version/s: 1.5.0 Broken test: YarnClusterSuite - Key: SPARK-10059 URL: https://issues.apache.org/jira/browse/SPARK-10059 Project: Spark Issue Type: Test Reporter: Davies Liu Assignee: Marcelo Vanzin Priority: Critical Fix For: 1.5.0 This test failed everytime: https://amplab.cs.berkeley.edu/jenkins/view/Spark-QA-Test/job/Spark-1.5-SBT/AMPLAB_JENKINS_BUILD_PROFILE=hadoop2.3,label=spark-test/116/testReport/junit/org.apache.spark.deploy.yarn/YarnClusterSuite/_It_is_not_a_test_/history/ {code} Error Message java.io.IOException: ResourceManager failed to start. Final state is STOPPED Stacktrace sbt.ForkMain$ForkError: java.io.IOException: ResourceManager failed to start. Final state is STOPPED at org.apache.hadoop.yarn.server.MiniYARNCluster.startResourceManager(MiniYARNCluster.java:302) at org.apache.hadoop.yarn.server.MiniYARNCluster.access$500(MiniYARNCluster.java:87) at org.apache.hadoop.yarn.server.MiniYARNCluster$ResourceManagerWrapper.serviceStart(MiniYARNCluster.java:422) at org.apache.hadoop.service.AbstractService.start(AbstractService.java:193) at org.apache.hadoop.service.CompositeService.serviceStart(CompositeService.java:121) at org.apache.hadoop.service.AbstractService.start(AbstractService.java:193) at org.apache.spark.deploy.yarn.YarnClusterSuite.beforeAll(YarnClusterSuite.scala:104) at org.scalatest.BeforeAndAfterAll$class.beforeAll(BeforeAndAfterAll.scala:187) at org.apache.spark.deploy.yarn.YarnClusterSuite.beforeAll(YarnClusterSuite.scala:46) at org.scalatest.BeforeAndAfterAll$class.run(BeforeAndAfterAll.scala:253) at org.apache.spark.deploy.yarn.YarnClusterSuite.run(YarnClusterSuite.scala:46) at org.scalatest.tools.Framework.org$scalatest$tools$Framework$$runSuite(Framework.scala:462) at org.scalatest.tools.Framework$ScalaTestTask.execute(Framework.scala:671) at sbt.ForkMain$Run$2.call(ForkMain.java:294) at sbt.ForkMain$Run$2.call(ForkMain.java:284) at java.util.concurrent.FutureTask.run(FutureTask.java:262) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) Caused by: sbt.ForkMain$ForkError: ResourceManager failed to start. Final state is STOPPED at org.apache.hadoop.yarn.server.MiniYARNCluster.startResourceManager(MiniYARNCluster.java:297) ... 18 more {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-9783) Use SqlNewHadoopRDD in JSONRelation to eliminate extra refresh() call
[ https://issues.apache.org/jira/browse/SPARK-9783?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin updated SPARK-9783: --- Sprint: (was: Spark 1.5 doc/QA sprint) Use SqlNewHadoopRDD in JSONRelation to eliminate extra refresh() call - Key: SPARK-9783 URL: https://issues.apache.org/jira/browse/SPARK-9783 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.5.0 Reporter: Cheng Lian Assignee: Cheng Lian PR #8035 made a quick fix for SPARK-9743 by introducing an extra {{refresh()}} call in {{JSONRelation.buildScan}}. Obviously it hurts performance. To overcome this, we can use {{SqlNewHadoopRDD}} there and override {{listStatus()}} to inject cached {{FileStatus}} instances, similar as what we did in {{ParquetRelation}}. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-9783) Use SqlNewHadoopRDD in JSONRelation to eliminate extra refresh() call
[ https://issues.apache.org/jira/browse/SPARK-9783?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin updated SPARK-9783: --- Target Version/s: 1.6.0 (was: 1.5.0) Use SqlNewHadoopRDD in JSONRelation to eliminate extra refresh() call - Key: SPARK-9783 URL: https://issues.apache.org/jira/browse/SPARK-9783 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.5.0 Reporter: Cheng Lian Assignee: Cheng Lian PR #8035 made a quick fix for SPARK-9743 by introducing an extra {{refresh()}} call in {{JSONRelation.buildScan}}. Obviously it hurts performance. To overcome this, we can use {{SqlNewHadoopRDD}} there and override {{listStatus()}} to inject cached {{FileStatus}} instances, similar as what we did in {{ParquetRelation}}. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-9205) org.apache.spark.sql.hive.HiveSparkSubmitSuite failing for Scala 2.11
[ https://issues.apache.org/jira/browse/SPARK-9205?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust updated SPARK-9205: Target Version/s: 1.6.0 (was: 1.5.0) org.apache.spark.sql.hive.HiveSparkSubmitSuite failing for Scala 2.11 - Key: SPARK-9205 URL: https://issues.apache.org/jira/browse/SPARK-9205 Project: Spark Issue Type: Bug Components: SQL, Tests Affects Versions: 1.4.1, 1.5.0 Reporter: Tathagata Das Assignee: Andrew Or Priority: Critical https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/Spark-Master-Maven/AMPLAB_JENKINS_BUILD_PROFILE=scala2.11,label=centos/7/ -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9776) Another instance of Derby may have already booted the database
[ https://issues.apache.org/jira/browse/SPARK-9776?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14700392#comment-14700392 ] Michael Armbrust commented on SPARK-9776: - The variable is always called {{sqlContext}} but if you have compile with Hive support then it will be of type HiveContext. Another instance of Derby may have already booted the database --- Key: SPARK-9776 URL: https://issues.apache.org/jira/browse/SPARK-9776 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.5.0 Environment: Mac Yosemite, spark-1.5.0 Reporter: Sudhakar Thota Attachments: SPARK-9776-FL1.rtf val hiveContext = new org.apache.spark.sql.hive.HiveContext(sc) results in error. Though the same works for spark-1.4.1. Caused by: ERROR XSDB6: Another instance of Derby may have already booted the database -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-9902) Add Java and Python examples to user guide for 1-sample KS test
[ https://issues.apache.org/jira/browse/SPARK-9902?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng resolved SPARK-9902. -- Resolution: Fixed Fix Version/s: 1.5.0 Issue resolved by pull request 8154 [https://github.com/apache/spark/pull/8154] Add Java and Python examples to user guide for 1-sample KS test --- Key: SPARK-9902 URL: https://issues.apache.org/jira/browse/SPARK-9902 Project: Spark Issue Type: Documentation Components: MLlib Reporter: Feynman Liang Assignee: Jose Cambronero Fix For: 1.5.0 SPARK-8598 adds 1-sample kolmogorov-smirnov tests, which needs Java and python code examples in {{mllib-statistics#hypothesis-testing}}. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-10063) Remove DirectParquetOutputCommitter
[ https://issues.apache.org/jira/browse/SPARK-10063?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin closed SPARK-10063. --- Resolution: Won't Fix Let's not remove it for now until we have a better alternative. Remove DirectParquetOutputCommitter --- Key: SPARK-10063 URL: https://issues.apache.org/jira/browse/SPARK-10063 Project: Spark Issue Type: Bug Components: SQL Reporter: Yin Huai Assignee: Yin Huai Priority: Critical When we use DirectParquetOutputCommitter on S3 and speculation is enabled, there is a chance that we can loss data. Here is the code to reproduce the problem. {code} import org.apache.spark.sql.functions._ val failSpeculativeTask = sqlContext.udf.register(failSpeculativeTask, (i: Int, partitionId: Int, attemptNumber: Int) = { if (partitionId == 0 i == 5) { if (attemptNumber 0) { Thread.sleep(15000) throw new Exception(new exception) } else { Thread.sleep(1) } } i }) val df = sc.parallelize((1 to 100), 20).mapPartitions { iter = val context = org.apache.spark.TaskContext.get() val partitionId = context.partitionId val attemptNumber = context.attemptNumber iter.map(i = (i, partitionId, attemptNumber)) }.toDF(i, partitionId, attemptNumber) df .select(failSpeculativeTask($i, $partitionId, $attemptNumber).as(i), $partitionId, $attemptNumber) .write.mode(overwrite).format(parquet).save(/home/yin/outputCommitter) sqlContext.read.load(/home/yin/outputCommitter).count // The result is 99 and 5 is missing from the output. {code} What happened is that the original task finishes first and uploads its output file to S3, then the speculative task somehow fails. Because we have to call output stream's close method, which uploads data to S3, we actually uploads the partial result generated by the failed speculative task to S3 and this file overwrites the correct file generated by the original task. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-9592) Last implemented based on AggregateExpression1 are calculating the values for entire DataFrame partition not on GroupedData partition.
[ https://issues.apache.org/jira/browse/SPARK-9592?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust resolved SPARK-9592. - Resolution: Fixed Fix Version/s: 1.5.0 Issue resolved by pull request 8172 [https://github.com/apache/spark/pull/8172] Last implemented based on AggregateExpression1 are calculating the values for entire DataFrame partition not on GroupedData partition. -- Key: SPARK-9592 URL: https://issues.apache.org/jira/browse/SPARK-9592 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.4.0 Reporter: gaurav Priority: Minor Fix For: 1.5.0 Original Estimate: 4h Remaining Estimate: 4h In current implementation, First and Last aggregates were calculating the values for entire DataFrame partition and then the same value was returned for all GroupedData in the partition. sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregates.scala Fixed the First and Last aggregates should compute first and last value per GroupedData instead of entire DataFrame. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-10068) Add links to sections in MLlib's user guide
[ https://issues.apache.org/jira/browse/SPARK-10068?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-10068: -- Assignee: Feynman Liang Add links to sections in MLlib's user guide --- Key: SPARK-10068 URL: https://issues.apache.org/jira/browse/SPARK-10068 Project: Spark Issue Type: Improvement Reporter: Feynman Liang Assignee: Feynman Liang Priority: Minor In {{mllib-guide.md}}, the listing under {{MLlib types, algorithms and utilities }} is inconsistent with linking to sections referenced. We should provide links to every section mentioned in this listing. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-10068) Add links to sections in MLlib's user guide
[ https://issues.apache.org/jira/browse/SPARK-10068?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-10068: -- Shepherd: Xiangrui Meng Add links to sections in MLlib's user guide --- Key: SPARK-10068 URL: https://issues.apache.org/jira/browse/SPARK-10068 Project: Spark Issue Type: Improvement Reporter: Feynman Liang Assignee: Feynman Liang Priority: Minor In {{mllib-guide.md}}, the listing under {{MLlib types, algorithms and utilities }} is inconsistent with linking to sections referenced. We should provide links to every section mentioned in this listing. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-7808) Package doc for spark.ml.feature
[ https://issues.apache.org/jira/browse/SPARK-7808?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng reassigned SPARK-7808: Assignee: Xiangrui Meng Package doc for spark.ml.feature Key: SPARK-7808 URL: https://issues.apache.org/jira/browse/SPARK-7808 Project: Spark Issue Type: Documentation Components: Documentation, ML Affects Versions: 1.4.0 Reporter: Xiangrui Meng Assignee: Xiangrui Meng We added several feature transformers in Spark 1.4. It would be great to add package doc for `spark.ml.feature`. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-10025) Add Python API for ml.attribute
[ https://issues.apache.org/jira/browse/SPARK-10025?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley resolved SPARK-10025. --- Resolution: Duplicate Add Python API for ml.attribute --- Key: SPARK-10025 URL: https://issues.apache.org/jira/browse/SPARK-10025 Project: Spark Issue Type: Improvement Components: ML, PySpark Reporter: Yanbo Liang Currently there is no Python implementation for ml.attribute, so we can not use Attribute in ML pipeline. Some transformers need this feature such as VectorSlicer can take a subarray of the original features by specifying column names which should contains in the column Attribute. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-7808) Package doc for spark.ml.feature
[ https://issues.apache.org/jira/browse/SPARK-7808?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-7808: --- Assignee: Apache Spark (was: Xiangrui Meng) Package doc for spark.ml.feature Key: SPARK-7808 URL: https://issues.apache.org/jira/browse/SPARK-7808 Project: Spark Issue Type: Documentation Components: Documentation, ML Affects Versions: 1.4.0 Reporter: Xiangrui Meng Assignee: Apache Spark We added several feature transformers in Spark 1.4. It would be great to add package doc for `spark.ml.feature`. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-7808) Package doc for spark.ml.feature
[ https://issues.apache.org/jira/browse/SPARK-7808?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-7808: --- Assignee: Xiangrui Meng (was: Apache Spark) Package doc for spark.ml.feature Key: SPARK-7808 URL: https://issues.apache.org/jira/browse/SPARK-7808 Project: Spark Issue Type: Documentation Components: Documentation, ML Affects Versions: 1.4.0 Reporter: Xiangrui Meng Assignee: Xiangrui Meng We added several feature transformers in Spark 1.4. It would be great to add package doc for `spark.ml.feature`. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-7808) Package doc for spark.ml.feature
[ https://issues.apache.org/jira/browse/SPARK-7808?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14700520#comment-14700520 ] Apache Spark commented on SPARK-7808: - User 'mengxr' has created a pull request for this issue: https://github.com/apache/spark/pull/8260 Package doc for spark.ml.feature Key: SPARK-7808 URL: https://issues.apache.org/jira/browse/SPARK-7808 Project: Spark Issue Type: Documentation Components: Documentation, ML Affects Versions: 1.4.0 Reporter: Xiangrui Meng Assignee: Xiangrui Meng We added several feature transformers in Spark 1.4. It would be great to add package doc for `spark.ml.feature`. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-10074) Include Float in @specialized annotation
Ted Yu created SPARK-10074: -- Summary: Include Float in @specialized annotation Key: SPARK-10074 URL: https://issues.apache.org/jira/browse/SPARK-10074 Project: Spark Issue Type: Improvement Reporter: Ted Yu Priority: Minor There're several places in Spark codebase where we use @specialized annotation covering Long and Double. e.g. in OpenHashMap.scala : {code} class OpenHashMap[K : ClassTag, @specialized(Long, Int, Double) V: ClassTag]( initialCapacity: Int) {code} Float should be added to @specialized annotation as well. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-9846) User guide for Multilayer Perceptron Classifier
[ https://issues.apache.org/jira/browse/SPARK-9846?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-9846: --- Assignee: Alexander Ulanov (was: Apache Spark) User guide for Multilayer Perceptron Classifier --- Key: SPARK-9846 URL: https://issues.apache.org/jira/browse/SPARK-9846 Project: Spark Issue Type: Documentation Components: Documentation, ML Affects Versions: 1.5.0 Reporter: Xiangrui Meng Assignee: Alexander Ulanov -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-9846) User guide for Multilayer Perceptron Classifier
[ https://issues.apache.org/jira/browse/SPARK-9846?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-9846: --- Assignee: Apache Spark (was: Alexander Ulanov) User guide for Multilayer Perceptron Classifier --- Key: SPARK-9846 URL: https://issues.apache.org/jira/browse/SPARK-9846 Project: Spark Issue Type: Documentation Components: Documentation, ML Affects Versions: 1.5.0 Reporter: Xiangrui Meng Assignee: Apache Spark -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9846) User guide for Multilayer Perceptron Classifier
[ https://issues.apache.org/jira/browse/SPARK-9846?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14700564#comment-14700564 ] Apache Spark commented on SPARK-9846: - User 'avulanov' has created a pull request for this issue: https://github.com/apache/spark/pull/8262 User guide for Multilayer Perceptron Classifier --- Key: SPARK-9846 URL: https://issues.apache.org/jira/browse/SPARK-9846 Project: Spark Issue Type: Documentation Components: Documentation, ML Affects Versions: 1.5.0 Reporter: Xiangrui Meng Assignee: Alexander Ulanov -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9951) Example code for Multilayer Perceptron Classifier
[ https://issues.apache.org/jira/browse/SPARK-9951?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14700567#comment-14700567 ] Alexander Ulanov commented on SPARK-9951: - I've submitter a PR for the user guide. Could you suggest if the example code in the PR can be used for this issue? https://github.com/apache/spark/pull/8262 Example code for Multilayer Perceptron Classifier - Key: SPARK-9951 URL: https://issues.apache.org/jira/browse/SPARK-9951 Project: Spark Issue Type: Documentation Components: Documentation, ML Reporter: Joseph K. Bradley Add an example to the examples/ code folder for Scala. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org