[jira] [Commented] (SPARK-9776) Another instance of Derby may have already booted the database

2015-08-17 Thread Sudhakar Thota (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9776?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14700284#comment-14700284
 ] 

Sudhakar Thota commented on SPARK-9776:
---

Michael,
I am confused. These are the steps I am doing, please correct me if I am wrong 
in the statements to create a HiveContext. 
I dont have another SparkContext running.

1. bin/spark-shell 
2. val hiveContext = new org.apache.spark.sql.hive.HiveContext(sc)

I appreciate your help.
Thanks
Sudhakar Thota

 Another instance of Derby may have already booted the database 
 ---

 Key: SPARK-9776
 URL: https://issues.apache.org/jira/browse/SPARK-9776
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.5.0
 Environment: Mac Yosemite, spark-1.5.0
Reporter: Sudhakar Thota
 Attachments: SPARK-9776-FL1.rtf


 val hiveContext = new org.apache.spark.sql.hive.HiveContext(sc) results in 
 error. Though the same works for spark-1.4.1.
 Caused by: ERROR XSDB6: Another instance of Derby may have already booted the 
 database 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9776) Another instance of Derby may have already booted the database

2015-08-17 Thread Michael Armbrust (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9776?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14700285#comment-14700285
 ] 

Michael Armbrust commented on SPARK-9776:
-

Do not run #2.  sqlContext is created automatically for you.

 Another instance of Derby may have already booted the database 
 ---

 Key: SPARK-9776
 URL: https://issues.apache.org/jira/browse/SPARK-9776
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.5.0
 Environment: Mac Yosemite, spark-1.5.0
Reporter: Sudhakar Thota
 Attachments: SPARK-9776-FL1.rtf


 val hiveContext = new org.apache.spark.sql.hive.HiveContext(sc) results in 
 error. Though the same works for spark-1.4.1.
 Caused by: ERROR XSDB6: Another instance of Derby may have already booted the 
 database 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9776) Another instance of Derby may have already booted the database

2015-08-17 Thread Eugene Zhulenev (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9776?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14700287#comment-14700287
 ] 

Eugene Zhulenev commented on SPARK-9776:


automatically created sqlContext is available as 'sqlContext'

 Another instance of Derby may have already booted the database 
 ---

 Key: SPARK-9776
 URL: https://issues.apache.org/jira/browse/SPARK-9776
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.5.0
 Environment: Mac Yosemite, spark-1.5.0
Reporter: Sudhakar Thota
 Attachments: SPARK-9776-FL1.rtf


 val hiveContext = new org.apache.spark.sql.hive.HiveContext(sc) results in 
 error. Though the same works for spark-1.4.1.
 Caused by: ERROR XSDB6: Another instance of Derby may have already booted the 
 database 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10066) Can't create HiveContext with spark-shell or spark-sql on snapshot

2015-08-17 Thread Robert Beauchemin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10066?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14700293#comment-14700293
 ] 

Robert Beauchemin commented on SPARK-10066:
---

Yes. I've even changed it (as a test) so both /tmp and /tmp/hive are world rwx 
-able. Here's listing from HDFS:
drwxrwxrwx   - hdfs   hdfs0 2015-06-18 00:24 /tmp   
 
drwxrwxrwx   - ambari-qa hdfs0 2015-08-16 21:38 /tmp/hive 

 Can't create HiveContext with spark-shell or spark-sql on snapshot
 --

 Key: SPARK-10066
 URL: https://issues.apache.org/jira/browse/SPARK-10066
 Project: Spark
  Issue Type: Bug
  Components: Spark Shell, SQL
Affects Versions: 1.5.0
 Environment: Centos 6.6
Reporter: Robert Beauchemin
Priority: Minor

 Built the 1.5.0-preview-20150812 with the following:
 ./make-distribution.sh -Pyarn -Phadoop-2.6 -Dhadoop.version=2.6.0 -Phive 
 -Phive-thriftserver -Psparkr -DskipTests
 Starting spark-shell or spark-sql returns the following error: 
 java.lang.RuntimeException: java.lang.RuntimeException: The root scratch dir: 
 /tmp/hive on HDFS should be writable. Current permissions are: rwx--
 at 
 org.apache.hadoop.hive.ql.session.SessionState.createRootHDFSDir(SessionState.java:612)
  [elided]
 at 
 org.apache.hadoop.hive.ql.session.SessionState.start(SessionState.java:508)   
 
 It's trying to create a new HiveContext. Running pySpark or sparkR works and 
 creates a HiveContext successfully. SqlContext can be created successfully 
 with any shell.
 I've tried changing permissions on that HDFS directory (even as far as making 
 it world-writable) without success. Tried changing SPARK_USER and also 
 running spark-shell as different users without success.
 This works on same machine on 1.4.1 and on earlier pre-release versions of 
 Spark 1.5.0 (same make-distribution parms) sucessfully. Just trying the 
 snapshot... 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-9866) VersionsSuite is unnecessarily slow in Jenkins

2015-08-17 Thread Josh Rosen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9866?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen updated SPARK-9866:
--
Issue Type: Sub-task  (was: Bug)
Parent: SPARK-9288

 VersionsSuite is unnecessarily slow in Jenkins
 --

 Key: SPARK-9866
 URL: https://issues.apache.org/jira/browse/SPARK-9866
 Project: Spark
  Issue Type: Sub-task
  Components: SQL, Tests
Reporter: Josh Rosen

 The VersionsSuite Hive test is unreasonably slow in Jenkins; downloading the 
 Hive JARs and their transitive dependencies from Maven adds at least 8 
 minutes to the total build time.
 In order to cut down on build time, I think that we should make the cache 
 directory configurable via an environment variable and should configure the 
 Jenkins scripts to set this variable to point to a location outside of the 
 Jenkins workspace which is re-used across builds.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8518) Log-linear models for survival analysis

2015-08-17 Thread Meihua Wu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8518?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14700302#comment-14700302
 ] 

Meihua Wu commented on SPARK-8518:
--

[~yanbo] Thank you very much for the update!

The loss function and gradient are different for events and censor. So we will 
need to have a column in the data frame to indicate whether an individual 
record is an event or censored. I suppose we will need to define a Param for 
eventCol using code gen and mix it into the AFTRegressionParams. 

cc [~mengxr]

 Log-linear models for survival analysis
 ---

 Key: SPARK-8518
 URL: https://issues.apache.org/jira/browse/SPARK-8518
 Project: Spark
  Issue Type: New Feature
  Components: ML
Reporter: Xiangrui Meng
Assignee: Yanbo Liang
   Original Estimate: 168h
  Remaining Estimate: 168h

 We want to add basic log-linear models for survival analysis. The 
 implementation should match the result from R's survival package 
 (http://cran.r-project.org/web/packages/survival/index.html).
 Design doc from [~yanboliang]: 
 https://docs.google.com/document/d/1fLtB0sqg2HlfqdrJlNHPhpfXO0Zb2_avZrxiVoPEs0E/pub



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-10079) Make `column` and `col` functions be S4 functions

2015-08-17 Thread Yu Ishikawa (JIRA)
Yu Ishikawa created SPARK-10079:
---

 Summary: Make `column` and `col` functions be S4 functions
 Key: SPARK-10079
 URL: https://issues.apache.org/jira/browse/SPARK-10079
 Project: Spark
  Issue Type: Sub-task
  Components: SparkR
Reporter: Yu Ishikawa


{{column}} and {{col}} function at {{R/pkg/R/Column.R}} are currently defined 
as S3 functions. I think it would be better to define them as S4 functions.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9972) Add `struct`, `encode` and `decode` function in SparkR

2015-08-17 Thread Yu Ishikawa (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9972?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14699237#comment-14699237
 ] 

Yu Ishikawa commented on SPARK-9972:


This is a quick note to tell the reason. When I tried to implement 
{{sort_array}}, I got the error as follows. I haven't inspected it, but the 
cause seems to be at {{collect}}. I'll comment about that in detail later.

{noformat}
1. Error: sort_array on a DataFrame 
cannot coerce class jobj to a data.frame
1: withCallingHandlers(eval(code, new_test_environment), error = capture_calls, 
message = function(c) invokeRestart(muffleMessage),
   warning = function(c) invokeRestart(muffleWarning))
2: eval(code, new_test_environment)
3: eval(expr, envir, enclos)
4: expect_equal(collect(select(df, sort_array(df$a)))[1, 1], c(1, 2, 3)) at 
test_sparkSQL.R:787
5: expect_that(object, equals(expected, label = expected.label, ...), info = 
info, label = label)
6: condition(object)
7: compare(expected, actual, ...)
8: compare.numeric(expected, actual, ...)
9: all.equal(x, y, ...)
10: all.equal.numeric(x, y, ...)
11: attr.all.equal(target, current, tolerance = tolerance, scale = scale, ...)
12: mode(current)
13: collect(select(df, sort_array(df$a)))
14: collect(select(df, sort_array(df$a)))
15: .local(x, ...)
16: do.call(cbind.data.frame, list(cols, stringsAsFactors = stringsAsFactors))
17: (function (..., deparse.level = 1)
   data.frame(..., check.names = FALSE))(structure(list(`sort_array(a,true)` = 
list(
   environment, NA, NA)), .Names = sort_array(a,true)), 
stringsAsFactors = FALSE)
18: data.frame(..., check.names = FALSE)
19: as.data.frame(x[[i]], optional = TRUE, stringsAsFactors = stringsAsFactors)
20: as.data.frame.list(x[[i]], optional = TRUE, stringsAsFactors = 
stringsAsFactors)
21: eval(as.call(c(expression(data.frame), x, check.names = !optional, 
stringsAsFactors = stringsAsFactors)))
22: eval(expr, envir, enclos)
23: data.frame(`sort_array(a,true)` = list(environment, NA, NA), check.names 
= FALSE,
   stringsAsFactors = FALSE)
24: as.data.frame(x[[i]], optional = TRUE, stringsAsFactors = stringsAsFactors)
25: as.data.frame.list(x[[i]], optional = TRUE, stringsAsFactors = 
stringsAsFactors)
26: eval(as.call(c(expression(data.frame), x, check.names = !optional, 
stringsAsFactors = stringsAsFactors)))
27: eval(expr, envir, enclos)
28: data.frame(environment, NA, NA, check.names = FALSE, stringsAsFactors = 
FALSE)
29: as.data.frame(x[[i]], optional = TRUE)
30: as.data.frame.default(x[[i]], optional = TRUE)
31: stop(gettextf(cannot coerce class \%s\ to a data.frame, 
deparse(class(x))), domain = NA)
32: .handleSimpleError(function (e)
   {
   e$calls - head(sys.calls()[-seq_len(frame + 7)], -2)
   signalCondition(e)
   }, cannot coerce class \\jobj\\ to a data.frame, 
quote(as.data.frame.default(x[[i]],
   optional = TRUE)))
{noformat}

 Add `struct`, `encode` and `decode` function in SparkR
 --

 Key: SPARK-9972
 URL: https://issues.apache.org/jira/browse/SPARK-9972
 Project: Spark
  Issue Type: Sub-task
  Components: SparkR
Reporter: Yu Ishikawa

 Support {{struct}} function on a DataFrame in SparkR. However, I think we 
 need to improve {{collect}} function in SparkR in order to implement 
 {{struct}} function.
 - struct
 - encode
 - decode
 - array_contains
 - sort_array



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10026) Implement some common Params for regression in PySpark

2015-08-17 Thread Yanbo Liang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10026?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14699240#comment-14699240
 ] 

Yanbo Liang commented on SPARK-10026:
-

I'm working on it.

 Implement some common Params for regression in PySpark
 --

 Key: SPARK-10026
 URL: https://issues.apache.org/jira/browse/SPARK-10026
 Project: Spark
  Issue Type: Sub-task
  Components: ML, PySpark
Reporter: Yanbo Liang

 Currently some Params are not common classes in Python API which lead we need 
 to write them for each class. The LinearRegression and LogisticRegression 
 related Params are list here:
 * HasElasticNetParam
 * HasFitIntercept
 * HasStandardization



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-7837) NPE when save as parquet in speculative tasks

2015-08-17 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-7837?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14699280#comment-14699280
 ] 

Apache Spark commented on SPARK-7837:
-

User 'liancheng' has created a pull request for this issue:
https://github.com/apache/spark/pull/8236

 NPE when save as parquet in speculative tasks
 -

 Key: SPARK-7837
 URL: https://issues.apache.org/jira/browse/SPARK-7837
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.4.0
Reporter: Yin Huai
Assignee: Cheng Lian
Priority: Critical

 The query is like {{df.orderBy(...).saveAsTable(...)}}.
 When there is no partitioning columns and there is a skewed key, I found the 
 following exception in speculative tasks. After these failures, seems we 
 could not call {{SparkHadoopMapRedUtil.commitTask}} correctly.
 {code}
 java.lang.NullPointerException
   at 
 parquet.hadoop.InternalParquetRecordWriter.flushRowGroupToStore(InternalParquetRecordWriter.java:146)
   at 
 parquet.hadoop.InternalParquetRecordWriter.close(InternalParquetRecordWriter.java:112)
   at parquet.hadoop.ParquetRecordWriter.close(ParquetRecordWriter.java:73)
   at 
 org.apache.spark.sql.parquet.ParquetOutputWriter.close(newParquet.scala:115)
   at 
 org.apache.spark.sql.sources.DefaultWriterContainer.abortTask(commands.scala:385)
   at 
 org.apache.spark.sql.sources.InsertIntoHadoopFsRelation.org$apache$spark$sql$sources$InsertIntoHadoopFsRelation$$writeRows$1(commands.scala:150)
   at 
 org.apache.spark.sql.sources.InsertIntoHadoopFsRelation$$anonfun$insert$1.apply(commands.scala:122)
   at 
 org.apache.spark.sql.sources.InsertIntoHadoopFsRelation$$anonfun$insert$1.apply(commands.scala:122)
   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:63)
   at org.apache.spark.scheduler.Task.run(Task.scala:70)
   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213)
   at 
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
   at 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
   at java.lang.Thread.run(Thread.java:745)
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-7837) NPE when save as parquet in speculative tasks

2015-08-17 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7837?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-7837:
---

Assignee: Cheng Lian  (was: Apache Spark)

 NPE when save as parquet in speculative tasks
 -

 Key: SPARK-7837
 URL: https://issues.apache.org/jira/browse/SPARK-7837
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.4.0
Reporter: Yin Huai
Assignee: Cheng Lian
Priority: Critical

 The query is like {{df.orderBy(...).saveAsTable(...)}}.
 When there is no partitioning columns and there is a skewed key, I found the 
 following exception in speculative tasks. After these failures, seems we 
 could not call {{SparkHadoopMapRedUtil.commitTask}} correctly.
 {code}
 java.lang.NullPointerException
   at 
 parquet.hadoop.InternalParquetRecordWriter.flushRowGroupToStore(InternalParquetRecordWriter.java:146)
   at 
 parquet.hadoop.InternalParquetRecordWriter.close(InternalParquetRecordWriter.java:112)
   at parquet.hadoop.ParquetRecordWriter.close(ParquetRecordWriter.java:73)
   at 
 org.apache.spark.sql.parquet.ParquetOutputWriter.close(newParquet.scala:115)
   at 
 org.apache.spark.sql.sources.DefaultWriterContainer.abortTask(commands.scala:385)
   at 
 org.apache.spark.sql.sources.InsertIntoHadoopFsRelation.org$apache$spark$sql$sources$InsertIntoHadoopFsRelation$$writeRows$1(commands.scala:150)
   at 
 org.apache.spark.sql.sources.InsertIntoHadoopFsRelation$$anonfun$insert$1.apply(commands.scala:122)
   at 
 org.apache.spark.sql.sources.InsertIntoHadoopFsRelation$$anonfun$insert$1.apply(commands.scala:122)
   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:63)
   at org.apache.spark.scheduler.Task.run(Task.scala:70)
   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213)
   at 
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
   at 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
   at java.lang.Thread.run(Thread.java:745)
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-10035) Parquet filters does not process EqualNullSafe filter.

2015-08-17 Thread Cheng Lian (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10035?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Lian updated SPARK-10035:
---
Assignee: Hyukjin Kwon

 Parquet filters does not process EqualNullSafe filter.
 --

 Key: SPARK-10035
 URL: https://issues.apache.org/jira/browse/SPARK-10035
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Hyukjin Kwon
Assignee: Hyukjin Kwon
Priority: Minor

 it is an issue followed by SPARK-9814.
 Datasources (after {{selectFilters()}} in 
 {{org.apache.spark.sql.execution.datasources.DataSourceStrategy}}) pass 
 {{EqualNotNull}} to {{ParquetRelation}} but  {{ParquetFilters}} for 
 {{ParquetRelation}} does not take and process this.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8847) String concatination with column in SparkR

2015-08-17 Thread Sun Rui (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8847?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14699219#comment-14699219
 ] 

Sun Rui commented on SPARK-8847:


The concat() expression is addressing this issue.

 String concatination with column in SparkR
 --

 Key: SPARK-8847
 URL: https://issues.apache.org/jira/browse/SPARK-8847
 Project: Spark
  Issue Type: New Feature
  Components: R
Reporter: Amar Gondaliya

 1. String concatination with the values of the column. i.e. df$newcol 
 -paste(a,df$column) type functionality.
 2. String concatination between columns i.e. df$newcol - 
 paste(df$col1,-,df$col2)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-7837) NPE when save as parquet in speculative tasks

2015-08-17 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7837?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-7837:
---

Assignee: Apache Spark  (was: Cheng Lian)

 NPE when save as parquet in speculative tasks
 -

 Key: SPARK-7837
 URL: https://issues.apache.org/jira/browse/SPARK-7837
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.4.0
Reporter: Yin Huai
Assignee: Apache Spark
Priority: Critical

 The query is like {{df.orderBy(...).saveAsTable(...)}}.
 When there is no partitioning columns and there is a skewed key, I found the 
 following exception in speculative tasks. After these failures, seems we 
 could not call {{SparkHadoopMapRedUtil.commitTask}} correctly.
 {code}
 java.lang.NullPointerException
   at 
 parquet.hadoop.InternalParquetRecordWriter.flushRowGroupToStore(InternalParquetRecordWriter.java:146)
   at 
 parquet.hadoop.InternalParquetRecordWriter.close(InternalParquetRecordWriter.java:112)
   at parquet.hadoop.ParquetRecordWriter.close(ParquetRecordWriter.java:73)
   at 
 org.apache.spark.sql.parquet.ParquetOutputWriter.close(newParquet.scala:115)
   at 
 org.apache.spark.sql.sources.DefaultWriterContainer.abortTask(commands.scala:385)
   at 
 org.apache.spark.sql.sources.InsertIntoHadoopFsRelation.org$apache$spark$sql$sources$InsertIntoHadoopFsRelation$$writeRows$1(commands.scala:150)
   at 
 org.apache.spark.sql.sources.InsertIntoHadoopFsRelation$$anonfun$insert$1.apply(commands.scala:122)
   at 
 org.apache.spark.sql.sources.InsertIntoHadoopFsRelation$$anonfun$insert$1.apply(commands.scala:122)
   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:63)
   at org.apache.spark.scheduler.Task.run(Task.scala:70)
   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213)
   at 
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
   at 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
   at java.lang.Thread.run(Thread.java:745)
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10035) Parquet filters does not process EqualNullSafe filter.

2015-08-17 Thread Cheng Lian (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10035?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14699316#comment-14699316
 ] 

Cheng Lian commented on SPARK-10035:


Done, thanks for working on this!

 Parquet filters does not process EqualNullSafe filter.
 --

 Key: SPARK-10035
 URL: https://issues.apache.org/jira/browse/SPARK-10035
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Hyukjin Kwon
Assignee: Hyukjin Kwon
Priority: Minor

 it is an issue followed by SPARK-9814.
 Datasources (after {{selectFilters()}} in 
 {{org.apache.spark.sql.execution.datasources.DataSourceStrategy}}) pass 
 {{EqualNotNull}} to {{ParquetRelation}} but  {{ParquetFilters}} for 
 {{ParquetRelation}} does not take and process this.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-10048) Support arbitrary nested Java array in serde

2015-08-17 Thread Sun Rui (JIRA)
Sun Rui created SPARK-10048:
---

 Summary: Support arbitrary nested Java array in serde
 Key: SPARK-10048
 URL: https://issues.apache.org/jira/browse/SPARK-10048
 Project: Spark
  Issue Type: Sub-task
  Components: SparkR
Reporter: Sun Rui






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-10050) Support collecting data of MapType in DataFrame

2015-08-17 Thread Sun Rui (JIRA)
Sun Rui created SPARK-10050:
---

 Summary: Support collecting data of MapType in DataFrame
 Key: SPARK-10050
 URL: https://issues.apache.org/jira/browse/SPARK-10050
 Project: Spark
  Issue Type: Sub-task
  Components: SparkR
Reporter: Sun Rui






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-10049) Support collecting data of ArraryType in DataFrame

2015-08-17 Thread Sun Rui (JIRA)
Sun Rui created SPARK-10049:
---

 Summary: Support collecting data of ArraryType in DataFrame
 Key: SPARK-10049
 URL: https://issues.apache.org/jira/browse/SPARK-10049
 Project: Spark
  Issue Type: Sub-task
  Components: SparkR
Reporter: Sun Rui






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-10051) Support collecting data of StructType in DataFrame

2015-08-17 Thread Sun Rui (JIRA)
Sun Rui created SPARK-10051:
---

 Summary: Support collecting data of StructType in DataFrame
 Key: SPARK-10051
 URL: https://issues.apache.org/jira/browse/SPARK-10051
 Project: Spark
  Issue Type: Sub-task
  Components: SparkR
Reporter: Sun Rui






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10030) Managed memory leak detected when cache table

2015-08-17 Thread Cheng Lian (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10030?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14699372#comment-14699372
 ] 

Cheng Lian commented on SPARK-10030:


[~joshrosen] Seems to be related to Tungsten?

 Managed memory leak detected when cache table
 -

 Key: SPARK-10030
 URL: https://issues.apache.org/jira/browse/SPARK-10030
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.5.0
Reporter: wangwei
Priority: Blocker

 I test the lastest spark-1.5.0 in local, standalone, yarn mode and follow the 
 steps bellow, then errors occured.
 1. create table cache_test(id int,  name string) stored as textfile ;
 2. load data local inpath 
 'SparkSource/sql/hive/src/test/resources/data/files/kv1.txt' into table 
 cache_test;
 3. cache table test as select * from cache_test distribute by id;
 configuration:
 spark.driver.memory5g
 spark.executor.memory   28g
 spark.cores.max  21
 15/08/16 17:14:47 ERROR Executor: Managed memory leak detected; size = 
 67108864 bytes, TID = 434
 15/08/16 17:14:47 ERROR Executor: Exception in task 23.0 in stage 9.0 (TID 
 434)
 java.util.NoSuchElementException: key not found: val_54
   at scala.collection.MapLike$class.default(MapLike.scala:228)
   at scala.collection.AbstractMap.default(Map.scala:58)
   at scala.collection.mutable.HashMap.apply(HashMap.scala:64)
   at 
 org.apache.spark.sql.columnar.compression.DictionaryEncoding$Encoder.compress(compressionSchemes.scala:258)
   at 
 org.apache.spark.sql.columnar.compression.CompressibleColumnBuilder$class.build(CompressibleColumnBuilder.scala:110)
   at 
 org.apache.spark.sql.columnar.NativeColumnBuilder.build(ColumnBuilder.scala:87)
   at 
 org.apache.spark.sql.columnar.InMemoryRelation$$anonfun$3$$anon$1$$anonfun$next$2.apply(InMemoryColumnarTableScan.scala:153)
   at 
 org.apache.spark.sql.columnar.InMemoryRelation$$anonfun$3$$anon$1$$anonfun$next$2.apply(InMemoryColumnarTableScan.scala:153)
   at 
 scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
   at 
 scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
   at 
 scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
   at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108)
   at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
   at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:108)
   at 
 org.apache.spark.sql.columnar.InMemoryRelation$$anonfun$3$$anon$1.next(InMemoryColumnarTableScan.scala:153)
   at 
 org.apache.spark.sql.columnar.InMemoryRelation$$anonfun$3$$anon$1.next(InMemoryColumnarTableScan.scala:120)
   at 
 org.apache.spark.storage.MemoryStore.unrollSafely(MemoryStore.scala:278)
   at 
 org.apache.spark.CacheManager.putInBlockManager(CacheManager.scala:171)
   at org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:78)
   at org.apache.spark.rdd.RDD.iterator(RDD.scala:262)
   at 
 org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
   at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
   at 
 org.apache.spark.rdd.MapPartitionsWithPreparationRDD.compute(MapPartitionsWithPreparationRDD.scala:46)
   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
   at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
   at 
 org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
   at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
   at 
 org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73)
   at 
 org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
   at org.apache.spark.scheduler.Task.run(Task.scala:88)
   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
   at 
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110)
   at 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603)
   at java.lang.Thread.run(Thread.java:722)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-10030) Managed memory leak detected when cache table

2015-08-17 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10030?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-10030:
--
Component/s: SQL

 Managed memory leak detected when cache table
 -

 Key: SPARK-10030
 URL: https://issues.apache.org/jira/browse/SPARK-10030
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.5.0
Reporter: wangwei
Priority: Blocker

 I test the lastest spark-1.5.0 in local, standalone, yarn mode and follow the 
 steps bellow, then errors occured.
 1. create table cache_test(id int,  name string) stored as textfile ;
 2. load data local inpath 
 'SparkSource/sql/hive/src/test/resources/data/files/kv1.txt' into table 
 cache_test;
 3. cache table test as select * from cache_test distribute by id;
 configuration:
 spark.driver.memory5g
 spark.executor.memory   28g
 spark.cores.max  21
 15/08/16 17:14:47 ERROR Executor: Managed memory leak detected; size = 
 67108864 bytes, TID = 434
 15/08/16 17:14:47 ERROR Executor: Exception in task 23.0 in stage 9.0 (TID 
 434)
 java.util.NoSuchElementException: key not found: val_54
   at scala.collection.MapLike$class.default(MapLike.scala:228)
   at scala.collection.AbstractMap.default(Map.scala:58)
   at scala.collection.mutable.HashMap.apply(HashMap.scala:64)
   at 
 org.apache.spark.sql.columnar.compression.DictionaryEncoding$Encoder.compress(compressionSchemes.scala:258)
   at 
 org.apache.spark.sql.columnar.compression.CompressibleColumnBuilder$class.build(CompressibleColumnBuilder.scala:110)
   at 
 org.apache.spark.sql.columnar.NativeColumnBuilder.build(ColumnBuilder.scala:87)
   at 
 org.apache.spark.sql.columnar.InMemoryRelation$$anonfun$3$$anon$1$$anonfun$next$2.apply(InMemoryColumnarTableScan.scala:153)
   at 
 org.apache.spark.sql.columnar.InMemoryRelation$$anonfun$3$$anon$1$$anonfun$next$2.apply(InMemoryColumnarTableScan.scala:153)
   at 
 scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
   at 
 scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
   at 
 scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
   at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108)
   at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
   at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:108)
   at 
 org.apache.spark.sql.columnar.InMemoryRelation$$anonfun$3$$anon$1.next(InMemoryColumnarTableScan.scala:153)
   at 
 org.apache.spark.sql.columnar.InMemoryRelation$$anonfun$3$$anon$1.next(InMemoryColumnarTableScan.scala:120)
   at 
 org.apache.spark.storage.MemoryStore.unrollSafely(MemoryStore.scala:278)
   at 
 org.apache.spark.CacheManager.putInBlockManager(CacheManager.scala:171)
   at org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:78)
   at org.apache.spark.rdd.RDD.iterator(RDD.scala:262)
   at 
 org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
   at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
   at 
 org.apache.spark.rdd.MapPartitionsWithPreparationRDD.compute(MapPartitionsWithPreparationRDD.scala:46)
   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
   at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
   at 
 org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
   at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
   at 
 org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73)
   at 
 org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
   at org.apache.spark.scheduler.Task.run(Task.scala:88)
   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
   at 
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110)
   at 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603)
   at java.lang.Thread.run(Thread.java:722)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-10052) KafKaDirectDstream should filter empty partition task or rdd

2015-08-17 Thread SuYan (JIRA)
SuYan created SPARK-10052:
-

 Summary: KafKaDirectDstream should filter empty partition task or 
rdd
 Key: SPARK-10052
 URL: https://issues.apache.org/jira/browse/SPARK-10052
 Project: Spark
  Issue Type: Improvement
  Components: Streaming
Affects Versions: 1.4.1
Reporter: SuYan


We run spark 1.4.0 spark direct streaming, found it submit stages and tasks for 
input event = 0 events.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5480) GraphX pageRank: java.lang.ArrayIndexOutOfBoundsException:

2015-08-17 Thread Aram Mkrtchyan (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5480?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14699397#comment-14699397
 ] 

Aram Mkrtchyan commented on SPARK-5480:
---

We also have the same problem almost every time when using subgraph function 
before running PageRank algorithm for Graph with 60M vertices with Spark 1.4.0. 
It used to be normal with = 1.3.0 version.

 GraphX pageRank: java.lang.ArrayIndexOutOfBoundsException: 
 ---

 Key: SPARK-5480
 URL: https://issues.apache.org/jira/browse/SPARK-5480
 Project: Spark
  Issue Type: Bug
  Components: GraphX
Affects Versions: 1.2.0, 1.3.1
 Environment: Yarn client
Reporter: Stephane Maarek

 Running the following code:
 val subgraph = graph.subgraph (
   vpred = (id,article) = //working predicate)
 ).cache()
 println( sSubgraph contains ${subgraph.vertices.count} nodes and 
 ${subgraph.edges.count} edges)
 val prGraph = subgraph.staticPageRank(5).cache
 val titleAndPrGraph = subgraph.outerJoinVertices(prGraph.vertices) {
   (v, title, rank) = (rank.getOrElse(0.0), title)
 }
 titleAndPrGraph.vertices.top(13) {
   Ordering.by((entry: (VertexId, (Double, _))) = entry._2._1)
 }.foreach(t = println(t._2._2._1 + :  + t._2._1 + , id: + t._1))
 Returns a graph with 5000 nodes and 4000 edges.
 Then it crashes during the PageRank with the following:
 15/01/29 05:51:07 INFO scheduler.TaskSetManager: Starting task 125.0 in stage 
 39.0 (TID 1808, *HIDDEN, PROCESS_LOCAL, 2059 bytes)
 15/01/29 05:51:07 WARN scheduler.TaskSetManager: Lost task 107.0 in stage 
 39.0 (TID 1794, *HIDDEN): java.lang.ArrayIndexOutOfBoundsException: -1
 at 
 org.apache.spark.graphx.util.collection.GraphXPrimitiveKeyOpenHashMap$mcJI$sp.apply$mcJI$sp(GraphXPrimitiveKeyOpenHashMap.scala:64)
 at 
 org.apache.spark.graphx.impl.EdgePartition.updateVertices(EdgePartition.scala:91)
 at 
 org.apache.spark.graphx.impl.ReplicatedVertexView$$anonfun$2$$anonfun$apply$1.apply(ReplicatedVertexView.scala:75)
 at 
 org.apache.spark.graphx.impl.ReplicatedVertexView$$anonfun$2$$anonfun$apply$1.apply(ReplicatedVertexView.scala:73)
 at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
 at 
 org.apache.spark.graphx.impl.EdgeRDDImpl$$anonfun$mapEdgePartitions$1.apply(EdgeRDDImpl.scala:110)
 at 
 org.apache.spark.graphx.impl.EdgeRDDImpl$$anonfun$mapEdgePartitions$1.apply(EdgeRDDImpl.scala:108)
 at org.apache.spark.rdd.RDD$$anonfun$13.apply(RDD.scala:601)
 at org.apache.spark.rdd.RDD$$anonfun$13.apply(RDD.scala:601)
 at 
 org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
 at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263)
 at org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:61)
 at org.apache.spark.rdd.RDD.iterator(RDD.scala:228)
 at 
 org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
 at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263)
 at org.apache.spark.rdd.RDD.iterator(RDD.scala:230)
 at 
 org.apache.spark.rdd.ZippedPartitionsRDD2.compute(ZippedPartitionsRDD.scala:88)
 at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263)
 at org.apache.spark.rdd.RDD.iterator(RDD.scala:230)
 at 
 org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
 at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263)
 at org.apache.spark.rdd.RDD.iterator(RDD.scala:230)
 at 
 org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
 at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263)
 at org.apache.spark.rdd.RDD.iterator(RDD.scala:230)
 at 
 org.apache.spark.rdd.ZippedPartitionsRDD2.compute(ZippedPartitionsRDD.scala:88)
 at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263)
 at org.apache.spark.rdd.RDD.iterator(RDD.scala:230)
 at 
 org.apache.spark.rdd.ZippedPartitionsRDD2.compute(ZippedPartitionsRDD.scala:88)
 at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263)
 at org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:61)
 at org.apache.spark.rdd.RDD.iterator(RDD.scala:228)
 at 
 org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
 at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263)
 at org.apache.spark.rdd.RDD.iterator(RDD.scala:230)
 at 
 org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68)
 at 
 org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
 at 

[jira] [Commented] (SPARK-10068) Add links to sections in MLlib's user guide

2015-08-17 Thread Feynman Liang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10068?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14700342#comment-14700342
 ] 

Feynman Liang commented on SPARK-10068:
---

Working on this

 Add links to sections in MLlib's user guide
 ---

 Key: SPARK-10068
 URL: https://issues.apache.org/jira/browse/SPARK-10068
 Project: Spark
  Issue Type: Improvement
Reporter: Feynman Liang
Priority: Minor

 In {{mllib-guide.md}}, the listing under {{MLlib types, algorithms and 
 utilities
 }} is inconsistent with linking to sections referenced. We should provide 
 links to every section mentioned in this listing.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-9868) auto_sortmerge_join_8 fails non-deterministically in Jenkins

2015-08-17 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9868?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust resolved SPARK-9868.
-
   Resolution: Cannot Reproduce
Fix Version/s: 1.5.0

 auto_sortmerge_join_8 fails non-deterministically in Jenkins
 

 Key: SPARK-9868
 URL: https://issues.apache.org/jira/browse/SPARK-9868
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 1.5.0
Reporter: Josh Rosen
Assignee: Davies Liu
Priority: Blocker
 Fix For: 1.5.0


 https://amplab.cs.berkeley.edu/jenkins/view/Spark-QA-Test/job/Spark-Master-SBT/3219/AMPLAB_JENKINS_BUILD_PROFILE=hadoop2.3,label=centos/testReport/junit/org.apache.spark.sql.hive.execution/HiveCompatibilitySuite/auto_sortmerge_join_8/
 {code}
 Results do not match for auto_sortmerge_join_8:
 == Parsed Logical Plan ==
 'Project [unresolvedalias(count(1))]
  'Join Inner, Some(('a.key = 'b.key))
   'UnresolvedRelation [bucket_small], Some(a)
   'UnresolvedRelation [bucket_big], Some(b)
 == Analyzed Logical Plan ==
 _c0: bigint
 Aggregate [count(1) AS _c0#53110L]
  Join Inner, Some((key#53105 = key#53108))
   MetastoreRelation default, bucket_small, Some(a)
   MetastoreRelation default, bucket_big, Some(b)
 == Optimized Logical Plan ==
 Aggregate [count(1) AS _c0#53110L]
  Project
   Join Inner, Some((key#53105 = key#53108))
Project [key#53105]
 MetastoreRelation default, bucket_small, Some(a)
Project [key#53108]
 MetastoreRelation default, bucket_big, Some(b)
 == Physical Plan ==
 TungstenAggregate(key=[], value=[(count(1),mode=Final,isDistinct=false)]
  TungstenExchange SinglePartition
   TungstenAggregate(key=[], value=[(count(1),mode=Partial,isDistinct=false)]
TungstenProject
 SortMergeJoin [key#53105], [key#53108]
  TungstenSort [key#53105 ASC], false, 0
   TungstenExchange hashpartitioning(key#53105)
ConvertToUnsafe
 HiveTableScan [key#53105], (MetastoreRelation default, bucket_small, 
 Some(a))
  TungstenSort [key#53108 ASC], false, 0
   TungstenExchange hashpartitioning(key#53108)
ConvertToUnsafe
 HiveTableScan [key#53108], (MetastoreRelation default, bucket_big, 
 Some(b))
 Code Generation: true
 _c0
 !== HIVE - 1 row(s) ==   == CATALYST - 1 row(s) ==
 !76  74
 
   at 
 org.scalatest.Assertions$class.newAssertionFailedException(Assertions.scala:495)
   at 
 org.scalatest.FunSuite.newAssertionFailedException(FunSuite.scala:1555)
   at org.scalatest.Assertions$class.fail(Assertions.scala:1328)
   at org.scalatest.FunSuite.fail(FunSuite.scala:1555)
   at 
 org.apache.spark.sql.hive.execution.HiveComparisonTest$$anonfun$createQueryTest$1$$anonfun$apply$mcV$sp$6.apply(HiveComparisonTest.scala:397)
   at 
 org.apache.spark.sql.hive.execution.HiveComparisonTest$$anonfun$createQueryTest$1$$anonfun$apply$mcV$sp$6.apply(HiveComparisonTest.scala:368)
   at 
 scala.runtime.Tuple3Zipped$$anonfun$foreach$extension$1.apply(Tuple3Zipped.scala:109)
   at 
 scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
   at scala.collection.mutable.WrappedArray.foreach(WrappedArray.scala:34)
   at scala.runtime.Tuple3Zipped$.foreach$extension(Tuple3Zipped.scala:107)
   at 
 org.apache.spark.sql.hive.execution.HiveComparisonTest$$anonfun$createQueryTest$1.apply$mcV$sp(HiveComparisonTest.scala:368)
   at 
 org.apache.spark.sql.hive.execution.HiveComparisonTest$$anonfun$createQueryTest$1.apply(HiveComparisonTest.scala:238)
   at 
 org.apache.spark.sql.hive.execution.HiveComparisonTest$$anonfun$createQueryTest$1.apply(HiveComparisonTest.scala:238)
   at 
 org.scalatest.Transformer$$anonfun$apply$1.apply$mcV$sp(Transformer.scala:22)
   at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85)
   at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104)
   at org.scalatest.Transformer.apply(Transformer.scala:22)
   at org.scalatest.Transformer.apply(Transformer.scala:20)
   at org.scalatest.FunSuiteLike$$anon$1.apply(FunSuiteLike.scala:166)
   at org.apache.spark.SparkFunSuite.withFixture(SparkFunSuite.scala:42)
   at 
 org.scalatest.FunSuiteLike$class.invokeWithFixture$1(FunSuiteLike.scala:163)
   at 
 org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:175)
   at 
 org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:175)
   at org.scalatest.SuperEngine.runTestImpl(Engine.scala:306)
   at org.scalatest.FunSuiteLike$class.runTest(FunSuiteLike.scala:175)
   at 
 org.apache.spark.sql.hive.execution.HiveCompatibilitySuite.org$scalatest$BeforeAndAfter$$super$runTest(HiveCompatibilitySuite.scala:32)
   at 

[jira] [Commented] (SPARK-9776) Another instance of Derby may have already booted the database

2015-08-17 Thread Sudhakar Thota (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9776?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14700377#comment-14700377
 ] 

Sudhakar Thota commented on SPARK-9776:
---

Thanks Michael and Eugene.

I have no problem with SQLContext at all, but problem is with HiveContext. 

I am trying to build SQL statement for my query using hive tables and want to 
save the results back in hive table. 
According to my understanding  I need HiveContext to do that, otherwise I have 
to limit with registerTempTable instead of saveAsTable operation. Not sure, 
if I am entirely correct, please let me know  otherwise.

Thanks
Sudhakar Thota

 Another instance of Derby may have already booted the database 
 ---

 Key: SPARK-9776
 URL: https://issues.apache.org/jira/browse/SPARK-9776
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.5.0
 Environment: Mac Yosemite, spark-1.5.0
Reporter: Sudhakar Thota
 Attachments: SPARK-9776-FL1.rtf


 val hiveContext = new org.apache.spark.sql.hive.HiveContext(sc) results in 
 error. Though the same works for spark-1.4.1.
 Caused by: ERROR XSDB6: Another instance of Derby may have already booted the 
 database 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-10071) QueueInputDStream Should Allow Checkpointing

2015-08-17 Thread Asim Jalis (JIRA)
Asim Jalis created SPARK-10071:
--

 Summary: QueueInputDStream Should Allow Checkpointing
 Key: SPARK-10071
 URL: https://issues.apache.org/jira/browse/SPARK-10071
 Project: Spark
  Issue Type: Improvement
  Components: Streaming
Affects Versions: 1.4.1
Reporter: Asim Jalis


I would like for https://issues.apache.org/jira/browse/SPARK-8630 to be 
reverted and that issue resolved as won’t fix, and for QueueInputDStream to 
revert to its old behavior of not throwing an exception if checkpointing is
enabled.

Why? The reason is that this fix which throws an exception if the DStream is 
being checkpointed breaks the primary use case for QueueInputDStream, which is 
testing. For example, the Spark Streaming documentation recommends using 
QueueInputDStream for testing.

Why does throwing an exception if checkpointing is used break this class? The 
reason is that if I use windowing operations or updateStateByKey then the 
StreamingContext requires that I enable checkpointing. It throws an exception 
if I don’t enable checkpointing. But then if I enable checkpointing this class 
throws an exception saying that I cannot use checkpointing with the queue 
stream. The end result of this is that I cannot use QueueInputDStream to test 
windowing operations and updateStateByKey. It can only be used for trivial 
stateless DStreams.

But would removing the exception-throwing logic make this code fragile? It 
should not. In the testing scenario the RDD that is passed into the 
QueueInputDStream is created through parallelize and it is checkpointable.

But what about people who are using QueueInputDStream in non-testing scenarios 
with non-recoverable RDDs? Perhaps a warning suffices here that checkpointing 
will not be able to recover state if their RDDs are non-recoverable. Then it is 
up to them how they resolve this situation.

Since right now we have no good way of determining if a QueueInputDStream 
contains RDDs that are recoverable or not, why not err on the side of leaving 
it to the user of the class to not expect recoverability, rather than forcing 
checkpointing.

In conclusion: my recommendation would be to revert to the old behavior and to 
resolve this bug as won’t fix.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-9906) User guide for LogisticRegressionSummary

2015-08-17 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9906?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-9906:
-
Shepherd: Joseph K. Bradley

 User guide for LogisticRegressionSummary
 

 Key: SPARK-9906
 URL: https://issues.apache.org/jira/browse/SPARK-9906
 Project: Spark
  Issue Type: Documentation
  Components: ML
Reporter: Feynman Liang
Assignee: Manoj Kumar

 SPARK-9112 introduces {{LogisticRegressionSummary}} to provide R-like model 
 statistics to ML pipeline logistic regression models. This feature is not 
 present in mllib and should be documented within {{ml-guide}}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-9786) Test backpressure

2015-08-17 Thread Tathagata Das (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9786?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tathagata Das updated SPARK-9786:
-
Description: 
1. Build a test bench for generating different workloads and data with varying 
rates - DONE
2. Enable backpressure and test whether things it works with different 
workloads - IN PROGRESS
3. Test whether it works with multiple receivers
4. Test whether it works with Kinesis 
5. Test whether it works with Direct Kafka

 Test backpressure
 -

 Key: SPARK-9786
 URL: https://issues.apache.org/jira/browse/SPARK-9786
 Project: Spark
  Issue Type: Sub-task
  Components: Streaming
Reporter: Tathagata Das
Assignee: Tathagata Das
Priority: Critical

 1. Build a test bench for generating different workloads and data with 
 varying rates - DONE
 2. Enable backpressure and test whether things it works with different 
 workloads - IN PROGRESS
 3. Test whether it works with multiple receivers
 4. Test whether it works with Kinesis 
 5. Test whether it works with Direct Kafka



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9662) ML 1.5 QA: API: Python API coverage

2015-08-17 Thread Joseph K. Bradley (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9662?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14700460#comment-14700460
 ] 

Joseph K. Bradley commented on SPARK-9662:
--

Perfect, thanks.  Also, to confirm: Are you done checking for breaking changes 
to Python APIs?

 ML 1.5 QA: API: Python API coverage
 ---

 Key: SPARK-9662
 URL: https://issues.apache.org/jira/browse/SPARK-9662
 Project: Spark
  Issue Type: Sub-task
  Components: Documentation, ML, MLlib, PySpark
Reporter: Joseph K. Bradley
Assignee: Yanbo Liang

 For new public APIs added to MLlib, we need to check the generated HTML doc 
 and compare the Scala  Python versions.  We need to track:
 * Inconsistency: Do class/method/parameter names match?
 * Docs: Is the Python doc missing or just a stub?  We want the Python doc to 
 be as complete as the Scala doc.
 * API breaking changes: These should be very rare but are occasionally either 
 necessary (intentional) or accidental.  These must be recorded and added in 
 the Migration Guide for this release.
 ** Note: If the API change is for an Alpha/Experimental/DeveloperApi 
 component, please note that as well.
 * Missing classes/methods/parameters: We should create to-do JIRAs for 
 functionality missing from Python, to be added in the next release cycle.  
 Please use a *separate* JIRA (linked below) for this list of to-do items.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-9768) Add Python API for ml.feature.ElementwiseProduct

2015-08-17 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9768?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley resolved SPARK-9768.
--
   Resolution: Fixed
Fix Version/s: 1.5.0

Issue resolved by pull request 8061
[https://github.com/apache/spark/pull/8061]

 Add Python API for ml.feature.ElementwiseProduct
 

 Key: SPARK-9768
 URL: https://issues.apache.org/jira/browse/SPARK-9768
 Project: Spark
  Issue Type: Improvement
  Components: ML, PySpark
Reporter: Yanbo Liang
Assignee: Yanbo Liang
Priority: Minor
 Fix For: 1.5.0


 Add Python API, user guide and example for ml.feature.ElementwiseProduct.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-8916) Add @since tags to mllib.regression

2015-08-17 Thread DB Tsai (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8916?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

DB Tsai resolved SPARK-8916.

   Resolution: Fixed
Fix Version/s: 1.5.0

Issue resolved by pull request 7518
[https://github.com/apache/spark/pull/7518]

 Add @since tags to mllib.regression
 ---

 Key: SPARK-8916
 URL: https://issues.apache.org/jira/browse/SPARK-8916
 Project: Spark
  Issue Type: Sub-task
  Components: Documentation, MLlib
Reporter: Xiangrui Meng
Priority: Minor
  Labels: starter
 Fix For: 1.5.0

   Original Estimate: 1h
  Remaining Estimate: 1h





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9776) Another instance of Derby may have already booted the database

2015-08-17 Thread Sudhakar Thota (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9776?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14700526#comment-14700526
 ] 

Sudhakar Thota commented on SPARK-9776:
---

Thanks Michael and Eugene for your quick responses, I got the point now. 

Tested the saveAsTable with sqlContext and it worked in the spark-shell. 
For the script, I still have to open the HiveContext and SparkContext. 

Thanks
Sudhakar Thota

 Another instance of Derby may have already booted the database 
 ---

 Key: SPARK-9776
 URL: https://issues.apache.org/jira/browse/SPARK-9776
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.5.0
 Environment: Mac Yosemite, spark-1.5.0
Reporter: Sudhakar Thota
 Attachments: SPARK-9776-FL1.rtf


 val hiveContext = new org.apache.spark.sql.hive.HiveContext(sc) results in 
 error. Though the same works for spark-1.4.1.
 Caused by: ERROR XSDB6: Another instance of Derby may have already booted the 
 database 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-9906) User guide for LogisticRegressionSummary

2015-08-17 Thread Feynman Liang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9906?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Feynman Liang updated SPARK-9906:
-
Description: SPARK-9112 introduces {{LogisticRegressionSummary}} to provide 
R-like model statistics to ML pipeline logistic regression models. This feature 
is not present in mllib and should be documented within {{ml-linear-methods}}  
(was: SPARK-9112 introduces {{LogisticRegressionSummary}} to provide R-like 
model statistics to ML pipeline logistic regression models. This feature is not 
present in mllib and should be documented within {{ml-guide}})

 User guide for LogisticRegressionSummary
 

 Key: SPARK-9906
 URL: https://issues.apache.org/jira/browse/SPARK-9906
 Project: Spark
  Issue Type: Documentation
  Components: ML
Reporter: Feynman Liang
Assignee: Manoj Kumar

 SPARK-9112 introduces {{LogisticRegressionSummary}} to provide R-like model 
 statistics to ML pipeline logistic regression models. This feature is not 
 present in mllib and should be documented within {{ml-linear-methods}}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-10077) Java package doc for spark.ml.feature

2015-08-17 Thread Xiangrui Meng (JIRA)
Xiangrui Meng created SPARK-10077:
-

 Summary: Java package doc for spark.ml.feature
 Key: SPARK-10077
 URL: https://issues.apache.org/jira/browse/SPARK-10077
 Project: Spark
  Issue Type: Documentation
  Components: Documentation, ML
Reporter: Xiangrui Meng


Should be the same as SPARK-7808 but use Java for the code example.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-7808) Scala package doc for spark.ml.feature

2015-08-17 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7808?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-7808:
-
Summary: Scala package doc for spark.ml.feature  (was: Package doc for 
spark.ml.feature)

 Scala package doc for spark.ml.feature
 --

 Key: SPARK-7808
 URL: https://issues.apache.org/jira/browse/SPARK-7808
 Project: Spark
  Issue Type: Documentation
  Components: Documentation, ML
Affects Versions: 1.4.0
Reporter: Xiangrui Meng
Assignee: Xiangrui Meng

 We added several feature transformers in Spark 1.4. It would be great to add 
 package doc for `spark.ml.feature`.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-10076) makes MultilayerPerceptronClassifier layers and weights public

2015-08-17 Thread Yanbo Liang (JIRA)
Yanbo Liang created SPARK-10076:
---

 Summary: makes MultilayerPerceptronClassifier layers and weights 
public 
 Key: SPARK-10076
 URL: https://issues.apache.org/jira/browse/SPARK-10076
 Project: Spark
  Issue Type: Improvement
  Components: ML
Reporter: Yanbo Liang


makes MultilayerPerceptronClassifier layers and weights public 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9856) Add expression functions into SparkR whose params are complicated

2015-08-17 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9856?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14700688#comment-14700688
 ] 

Apache Spark commented on SPARK-9856:
-

User 'yu-iskw' has created a pull request for this issue:
https://github.com/apache/spark/pull/8264

 Add expression functions into SparkR whose params are complicated
 -

 Key: SPARK-9856
 URL: https://issues.apache.org/jira/browse/SPARK-9856
 Project: Spark
  Issue Type: Sub-task
  Components: SparkR
Reporter: Yu Ishikawa

 Add expression functions whose parameters are a little complicated, like 
 {{regexp_extract(e: Column, exp: String, groupIdx: Int)}} and 
 {{regexp_replace(e: Column, pattern: String, replacement: String)}}.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-9856) Add expression functions into SparkR whose params are complicated

2015-08-17 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9856?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-9856:
---

Assignee: (was: Apache Spark)

 Add expression functions into SparkR whose params are complicated
 -

 Key: SPARK-9856
 URL: https://issues.apache.org/jira/browse/SPARK-9856
 Project: Spark
  Issue Type: Sub-task
  Components: SparkR
Reporter: Yu Ishikawa

 Add expression functions whose parameters are a little complicated, like 
 {{regexp_extract(e: Column, exp: String, groupIdx: Int)}} and 
 {{regexp_replace(e: Column, pattern: String, replacement: String)}}.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-9856) Add expression functions into SparkR whose params are complicated

2015-08-17 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9856?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-9856:
---

Assignee: Apache Spark

 Add expression functions into SparkR whose params are complicated
 -

 Key: SPARK-9856
 URL: https://issues.apache.org/jira/browse/SPARK-9856
 Project: Spark
  Issue Type: Sub-task
  Components: SparkR
Reporter: Yu Ishikawa
Assignee: Apache Spark

 Add expression functions whose parameters are a little complicated, like 
 {{regexp_extract(e: Column, exp: String, groupIdx: Int)}} and 
 {{regexp_replace(e: Column, pattern: String, replacement: String)}}.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8520) Improve GLM's scalability on number of features

2015-08-17 Thread Meihua Wu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8520?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14700357#comment-14700357
 ] 

Meihua Wu commented on SPARK-8520:
--

For 1, how about migrate to treeReduce and treeAggregate? 

 Improve GLM's scalability on number of features
 ---

 Key: SPARK-8520
 URL: https://issues.apache.org/jira/browse/SPARK-8520
 Project: Spark
  Issue Type: Improvement
  Components: ML
Affects Versions: 1.4.0
Reporter: Xiangrui Meng
Assignee: Xiangrui Meng
Priority: Critical
  Labels: advanced

 MLlib's GLM implementation uses driver to collect gradient updates. When 
 there exist many features (20 million), the driver becomes the performance 
 bottleneck. In practice, it is common to see a problem with a large feature 
 dimension, resulting from hashing or other feature transformations. So it is 
 important to improve MLlib's scalability on number of features.
 There are couple possible solutions:
 1. still use driver to collect updates, but reduce the amount of data it 
 collects at each iteration.
 2. apply 2D partitioning to the training data and store the model 
 coefficients distributively (e.g., vector-free l-bfgs)
 3. parameter server
 4. ...



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9910) User guide for train validation split

2015-08-17 Thread Martin Zapletal (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9910?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14700372#comment-14700372
 ] 

Martin Zapletal commented on SPARK-9910:


I noticed 1.5.0 should be closed by now. What is the deadline for this ticket?

 User guide for train validation split
 -

 Key: SPARK-9910
 URL: https://issues.apache.org/jira/browse/SPARK-9910
 Project: Spark
  Issue Type: Documentation
  Components: ML
Reporter: Feynman Liang

 SPARK-8484 adds a TrainValidationSplit transformer which needs user guide 
 docs and example code.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-8920) Add @since tags to mllib.linalg

2015-08-17 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8920?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng resolved SPARK-8920.
--
   Resolution: Fixed
Fix Version/s: 1.5.0

Issue resolved by pull request 7729
[https://github.com/apache/spark/pull/7729]

 Add @since tags to mllib.linalg
 ---

 Key: SPARK-8920
 URL: https://issues.apache.org/jira/browse/SPARK-8920
 Project: Spark
  Issue Type: Sub-task
  Components: Documentation, MLlib
Reporter: Xiangrui Meng
Assignee: Sameer Abhyankar
Priority: Minor
  Labels: starter
 Fix For: 1.5.0

   Original Estimate: 4h
  Remaining Estimate: 4h





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-10072) BlockGenerator can deadlock when the queue block queue of generate blocks fills up to capacity

2015-08-17 Thread Tathagata Das (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10072?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tathagata Das updated SPARK-10072:
--
Priority: Blocker  (was: Major)

 BlockGenerator can deadlock when the queue block queue of generate blocks 
 fills up to capacity
 --

 Key: SPARK-10072
 URL: https://issues.apache.org/jira/browse/SPARK-10072
 Project: Spark
  Issue Type: Bug
  Components: Streaming
Reporter: Tathagata Das
Assignee: Tathagata Das
Priority: Blocker

 Generated blocks are inserted into an ArrayBlockingQueue, and another thread 
 pulls stuff from the ArrayBlockingQueue and pushes it into BlockManager. Now 
 if that queue fills up to capacity (default is 10 blocks), then the inserting 
 into queue (done in the function updateCurrentBuffer) get blocked inside a 
 synchronized block. However, the thread that is pulling blocks from the queue 
 uses the same lock to check the current (active or stopped) while pulling 
 from the queue. Since the block generating threads is blocked (as the queue 
 is full) on the lock, this thread that is supposed to drain the queue gets 
 blocked. Ergo, deadlock.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-10072) BlockGenerator can deadlock when the queue block queue of generate blocks fills up to capacity

2015-08-17 Thread Tathagata Das (JIRA)
Tathagata Das created SPARK-10072:
-

 Summary: BlockGenerator can deadlock when the queue block queue of 
generate blocks fills up to capacity
 Key: SPARK-10072
 URL: https://issues.apache.org/jira/browse/SPARK-10072
 Project: Spark
  Issue Type: Bug
  Components: Streaming
Reporter: Tathagata Das
Assignee: Tathagata Das


Generated blocks are inserted into an ArrayBlockingQueue, and another thread 
pulls stuff from the ArrayBlockingQueue and pushes it into BlockManager. Now if 
that queue fills up to capacity (default is 10 blocks), then the inserting into 
queue (done in the function updateCurrentBuffer) get blocked inside a 
synchronized block. However, the thread that is pulling blocks from the queue 
uses the same lock to check the current (active or stopped) while pulling from 
the queue. Since the block generating threads is blocked (as the queue is full) 
on the lock, this thread that is supposed to drain the queue gets blocked. 
Ergo, deadlock.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-10072) BlockGenerator can deadlock when the queue block queue of generate blocks fills up to capacity

2015-08-17 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10072?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-10072:


Assignee: Apache Spark  (was: Tathagata Das)

 BlockGenerator can deadlock when the queue block queue of generate blocks 
 fills up to capacity
 --

 Key: SPARK-10072
 URL: https://issues.apache.org/jira/browse/SPARK-10072
 Project: Spark
  Issue Type: Bug
  Components: Streaming
Reporter: Tathagata Das
Assignee: Apache Spark
Priority: Blocker

 Generated blocks are inserted into an ArrayBlockingQueue, and another thread 
 pulls stuff from the ArrayBlockingQueue and pushes it into BlockManager. Now 
 if that queue fills up to capacity (default is 10 blocks), then the inserting 
 into queue (done in the function updateCurrentBuffer) get blocked inside a 
 synchronized block. However, the thread that is pulling blocks from the queue 
 uses the same lock to check the current (active or stopped) while pulling 
 from the queue. Since the block generating threads is blocked (as the queue 
 is full) on the lock, this thread that is supposed to drain the queue gets 
 blocked. Ergo, deadlock.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10023) Unified DecisionTreeParams checkpointInterval between Scala and Python API.

2015-08-17 Thread Joseph K. Bradley (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10023?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14700445#comment-14700445
 ] 

Joseph K. Bradley commented on SPARK-10023:
---

For this and other JIRAs, could you please note how they are inconsistent?  
That will help in understanding if we need a fix ASAP (for this release), or if 
it can wait.  Thank you!

 Unified DecisionTreeParams checkpointInterval between Scala and Python API.
 -

 Key: SPARK-10023
 URL: https://issues.apache.org/jira/browse/SPARK-10023
 Project: Spark
  Issue Type: Sub-task
  Components: ML, PySpark
Reporter: Yanbo Liang

 checkpointInterval is one of DecisionTreeParams in Scala API which is 
 inconsistency with Python API, we should unified them.
 Proposal: Make checkpointInterval shared param.   



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-9550) Configuration renaming, defaults changes, and deprecation for 1.5.0 (master ticket)

2015-08-17 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9550?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-9550:
---
Description: 
This ticket tracks configurations which need to be renamed, deprecated, or have 
their defaults changed for Spark 1.5.0.

Note that subtasks / comments here do not necessarily need to reflect changes 
that must be performed.  Rather, tasks should be added here to make sure that 
the relevant configurations are at least checked before we cut releases.  This 
ticket will also help us to track configuration changes which must make it into 
the release notes.

*Configuration renaming*

- Consider renaming {{spark.shuffle.memoryFraction}} to 
{{spark.execution.memoryFraction}} 
([discussion|https://github.com/apache/spark/pull/7770#discussion-diff-36019144]).
- Rename all public-facing uses of {{unsafe}} to something less scary, such as 
{{tungsten}}

*Defaults changes*
- Codegen is now enabled by default.
- Tungsten is now enabled by default.
- Parquet schema merging is now disabled by default.
- In-memory relation partition pruning should be enabled by default 
(SPARK-9554).

*Deprecation*
- Local execution has been removed.

*Behavior Changes*
- Canonical name of SQL/DataFrame functions are now lower case (e.g. sum vs SUM)

- DirectOutputCommitter is not safe to use with speculation

  was:
This ticket tracks configurations which need to be renamed, deprecated, or have 
their defaults changed for Spark 1.5.0.

Note that subtasks / comments here do not necessarily need to reflect changes 
that must be performed.  Rather, tasks should be added here to make sure that 
the relevant configurations are at least checked before we cut releases.  This 
ticket will also help us to track configuration changes which must make it into 
the release notes.

*Configuration renaming*

- Consider renaming {{spark.shuffle.memoryFraction}} to 
{{spark.execution.memoryFraction}} 
([discussion|https://github.com/apache/spark/pull/7770#discussion-diff-36019144]).
- Rename all public-facing uses of {{unsafe}} to something less scary, such as 
{{tungsten}}

*Defaults changes*
- Codegen is now enabled by default.
- Tungsten is now enabled by default.
- Parquet schema merging is now disabled by default.
- In-memory relation partition pruning should be enabled by default 
(SPARK-9554).

*Deprecation*
- Local execution has been removed.

*Behavior Changes*
- Canonical name of SQL/DataFrame functions are now lower case (e.g. sum vs SUM)


 Configuration renaming, defaults changes, and deprecation for 1.5.0 (master 
 ticket)
 ---

 Key: SPARK-9550
 URL: https://issues.apache.org/jira/browse/SPARK-9550
 Project: Spark
  Issue Type: Task
  Components: Spark Core, SQL
Affects Versions: 1.5.0
Reporter: Josh Rosen
Priority: Blocker

 This ticket tracks configurations which need to be renamed, deprecated, or 
 have their defaults changed for Spark 1.5.0.
 Note that subtasks / comments here do not necessarily need to reflect changes 
 that must be performed.  Rather, tasks should be added here to make sure that 
 the relevant configurations are at least checked before we cut releases.  
 This ticket will also help us to track configuration changes which must make 
 it into the release notes.
 *Configuration renaming*
 - Consider renaming {{spark.shuffle.memoryFraction}} to 
 {{spark.execution.memoryFraction}} 
 ([discussion|https://github.com/apache/spark/pull/7770#discussion-diff-36019144]).
 - Rename all public-facing uses of {{unsafe}} to something less scary, such 
 as {{tungsten}}
 *Defaults changes*
 - Codegen is now enabled by default.
 - Tungsten is now enabled by default.
 - Parquet schema merging is now disabled by default.
 - In-memory relation partition pruning should be enabled by default 
 (SPARK-9554).
 *Deprecation*
 - Local execution has been removed.
 *Behavior Changes*
 - Canonical name of SQL/DataFrame functions are now lower case (e.g. sum vs 
 SUM)
 - DirectOutputCommitter is not safe to use with speculation



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-9974) SBT build: com.twitter:parquet-hadoop-bundle:1.6.0 is not packaged into the assembly jar

2015-08-17 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9974?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin resolved SPARK-9974.

   Resolution: Fixed
 Assignee: Cheng Lian
Fix Version/s: 1.5.0

 SBT build: com.twitter:parquet-hadoop-bundle:1.6.0 is not packaged into the 
 assembly jar
 

 Key: SPARK-9974
 URL: https://issues.apache.org/jira/browse/SPARK-9974
 Project: Spark
  Issue Type: Bug
  Components: Build, SQL
Affects Versions: 1.5.0
Reporter: Cheng Lian
Assignee: Cheng Lian
Priority: Blocker
 Fix For: 1.5.0


 One of the consequence of this issue is that Parquet tables created in Hive 
 are not accessible from Spark SQL built with SBT. Maven build is OK. This 
 issue can be worked around by adding 
 {{lib_managed/jars/parquet-hadoop-bundle-1.6.0.jar}} to 
 {{--driver-class-path}}.
 Git commit: 
 [69930310115501f0de094fe6f5c6c60dade342bd|https://github.com/apache/spark/commit/69930310115501f0de094fe6f5c6c60dade342bd]
 Build with SBT and check the assembly jar for classes in package 
 {{parquet.hadoop.api}}:
 {noformat}
 $ ./build/sbt -Phive -Phive-thriftserver -Phadoop-1 -Dhadoop.version=1.2.1 
 clean assembly/assembly
 ...
 $ jar tf 
 assembly/target/scala-2.10/spark-assembly-1.5.0-SNAPSHOT-hadoop1.2.1.jar | 
 fgrep parquet/hadoop/api
 org/apache/parquet/hadoop/api/
 org/apache/parquet/hadoop/api/DelegatingReadSupport.class
 org/apache/parquet/hadoop/api/DelegatingWriteSupport.class
 org/apache/parquet/hadoop/api/InitContext.class
 org/apache/parquet/hadoop/api/ReadSupport$ReadContext.class
 org/apache/parquet/hadoop/api/ReadSupport.class
 org/apache/parquet/hadoop/api/WriteSupport$FinalizedWriteContext.class
 org/apache/parquet/hadoop/api/WriteSupport$WriteContext.class
 org/apache/parquet/hadoop/api/WriteSupport.class
 {noformat}
 Only classes of {{org.apache.parquet:parquet-mr:1.7.0}}. Note that classes in 
 {{com.twitter:parquet-hadoop-bundle:1.6.0}} are not under the {{org.apache}} 
 namespace.
 Build with Maven and check the assembly jar for classes in package 
 {{parquet.hadoop.api}}:
 {noformat}
 $ ./build/mvn -Phive -Phive-thriftserver -Phadoop-1 -Dhadoop.version=1.2.1 
 -DskipTests clean package
 ...
 $ jar tf 
 assembly/target/scala-2.10/spark-assembly-1.5.0-SNAPSHOT-hadoop1.2.1.jar | 
 fgrep parquet/hadoop/api
 org/apache/parquet/hadoop/api/
 org/apache/parquet/hadoop/api/DelegatingReadSupport.class
 org/apache/parquet/hadoop/api/DelegatingWriteSupport.class
 org/apache/parquet/hadoop/api/InitContext.class
 org/apache/parquet/hadoop/api/ReadSupport$ReadContext.class
 org/apache/parquet/hadoop/api/ReadSupport.class
 org/apache/parquet/hadoop/api/WriteSupport$FinalizedWriteContext.class
 org/apache/parquet/hadoop/api/WriteSupport$WriteContext.class
 org/apache/parquet/hadoop/api/WriteSupport.class
 parquet/hadoop/api/
 parquet/hadoop/api/DelegatingReadSupport.class
 parquet/hadoop/api/DelegatingWriteSupport.class
 parquet/hadoop/api/InitContext.class
 parquet/hadoop/api/ReadSupport$ReadContext.class
 parquet/hadoop/api/ReadSupport.class
 parquet/hadoop/api/WriteSupport$FinalizedWriteContext.class
 parquet/hadoop/api/WriteSupport$WriteContext.class
 parquet/hadoop/api/WriteSupport.class
 {noformat}
 Expected classes are packaged properly.
 To reproduce the Parquet table access issue, first create a Parquet table 
 with Hive (say 0.13.1):
 {noformat}
 hive CREATE TABLE parquet_test STORED AS PARQUET AS SELECT 1;
 {noformat}
 Build Spark assembly jar with the SBT command above, start {{spark-shell}}:
 {noformat}
 scala sqlContext.table(parquet_test).show()
 15/08/14 17:52:50 INFO HiveMetaStore: 0: get_table : db=default 
 tbl=parquet_test
 15/08/14 17:52:50 INFO audit: ugi=lian  ip=unknown-ip-addr  cmd=get_table 
 : db=default tbl=parquet_test
 java.lang.NoClassDefFoundError: parquet/hadoop/api/WriteSupport
 at java.lang.Class.forName0(Native Method)
 at java.lang.Class.forName(Class.java:270)
 at 
 org.apache.hadoop.hive.ql.metadata.Table.getOutputFormatClass(Table.java:328)
 at 
 org.apache.spark.sql.hive.client.ClientWrapper$$anonfun$getTableOption$1$$anonfun$2.apply(ClientWrapper.scala:320)
 at 
 org.apache.spark.sql.hive.client.ClientWrapper$$anonfun$getTableOption$1$$anonfun$2.apply(ClientWrapper.scala:303)
 at scala.Option.map(Option.scala:145)
 at 
 org.apache.spark.sql.hive.client.ClientWrapper$$anonfun$getTableOption$1.apply(ClientWrapper.scala:303)
 at 
 org.apache.spark.sql.hive.client.ClientWrapper$$anonfun$getTableOption$1.apply(ClientWrapper.scala:298)
 at 
 org.apache.spark.sql.hive.client.ClientWrapper$$anonfun$withHiveState$1.apply(ClientWrapper.scala:256)
 at 
 

[jira] [Updated] (SPARK-7707) User guide and example code for KernelDensity

2015-08-17 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7707?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-7707:
-
Shepherd: Xiangrui Meng

 User guide and example code for KernelDensity
 -

 Key: SPARK-7707
 URL: https://issues.apache.org/jira/browse/SPARK-7707
 Project: Spark
  Issue Type: Documentation
  Components: Documentation, MLlib
Affects Versions: 1.4.0
Reporter: Xiangrui Meng
Assignee: Sandy Ryza





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-10076) make MultilayerPerceptronClassifier layers and weights public

2015-08-17 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10076?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-10076:


Assignee: (was: Apache Spark)

 make MultilayerPerceptronClassifier layers and weights public 
 --

 Key: SPARK-10076
 URL: https://issues.apache.org/jira/browse/SPARK-10076
 Project: Spark
  Issue Type: Improvement
  Components: ML
Reporter: Yanbo Liang

 make MultilayerPerceptronClassifier layers and weights public 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10076) make MultilayerPerceptronClassifier layers and weights public

2015-08-17 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10076?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14700629#comment-14700629
 ] 

Apache Spark commented on SPARK-10076:
--

User 'yanboliang' has created a pull request for this issue:
https://github.com/apache/spark/pull/8263

 make MultilayerPerceptronClassifier layers and weights public 
 --

 Key: SPARK-10076
 URL: https://issues.apache.org/jira/browse/SPARK-10076
 Project: Spark
  Issue Type: Improvement
  Components: ML
Reporter: Yanbo Liang

 make MultilayerPerceptronClassifier layers and weights public 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-10076) make MultilayerPerceptronClassifier layers and weights public

2015-08-17 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10076?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-10076:


Assignee: Apache Spark

 make MultilayerPerceptronClassifier layers and weights public 
 --

 Key: SPARK-10076
 URL: https://issues.apache.org/jira/browse/SPARK-10076
 Project: Spark
  Issue Type: Improvement
  Components: ML
Reporter: Yanbo Liang
Assignee: Apache Spark

 make MultilayerPerceptronClassifier layers and weights public 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-10076) make MultilayerPerceptronClassifier layers and weights public

2015-08-17 Thread Yanbo Liang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10076?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yanbo Liang updated SPARK-10076:

Summary: make MultilayerPerceptronClassifier layers and weights public   
(was: makes MultilayerPerceptronClassifier layers and weights public )

 make MultilayerPerceptronClassifier layers and weights public 
 --

 Key: SPARK-10076
 URL: https://issues.apache.org/jira/browse/SPARK-10076
 Project: Spark
  Issue Type: Improvement
  Components: ML
Reporter: Yanbo Liang

 makes MultilayerPerceptronClassifier layers and weights public 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-10076) make MultilayerPerceptronClassifier layers and weights public

2015-08-17 Thread Yanbo Liang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10076?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yanbo Liang updated SPARK-10076:

Description: make MultilayerPerceptronClassifier layers and weights public  
 (was: makes MultilayerPerceptronClassifier layers and weights public )

 make MultilayerPerceptronClassifier layers and weights public 
 --

 Key: SPARK-10076
 URL: https://issues.apache.org/jira/browse/SPARK-10076
 Project: Spark
  Issue Type: Improvement
  Components: ML
Reporter: Yanbo Liang

 make MultilayerPerceptronClassifier layers and weights public 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-7808) Scala package doc for spark.ml.feature

2015-08-17 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7808?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng resolved SPARK-7808.
--
   Resolution: Fixed
Fix Version/s: 1.5.0

Issue resolved by pull request 8260
[https://github.com/apache/spark/pull/8260]

 Scala package doc for spark.ml.feature
 --

 Key: SPARK-7808
 URL: https://issues.apache.org/jira/browse/SPARK-7808
 Project: Spark
  Issue Type: Documentation
  Components: Documentation, ML
Affects Versions: 1.4.0
Reporter: Xiangrui Meng
Assignee: Xiangrui Meng
 Fix For: 1.5.0


 We added several feature transformers in Spark 1.4. It would be great to add 
 package doc for `spark.ml.feature`.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-9898) User guide for PrefixSpan

2015-08-17 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9898?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-9898:
-
Shepherd: Xiangrui Meng

 User guide for PrefixSpan
 -

 Key: SPARK-9898
 URL: https://issues.apache.org/jira/browse/SPARK-9898
 Project: Spark
  Issue Type: Documentation
  Components: MLlib
Reporter: Feynman Liang
Assignee: Feynman Liang

 PrefixSpan was added by SPARK-6487 and needs an accompanying user 
 guide/example code. This should be included in the MLlib docs.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-9654) Add IndexToString in Pyspark

2015-08-17 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9654?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-9654:
-
Summary: Add IndexToString in Pyspark  (was: Add StringIndexer inverse in 
Pyspark)

 Add IndexToString in Pyspark
 

 Key: SPARK-9654
 URL: https://issues.apache.org/jira/browse/SPARK-9654
 Project: Spark
  Issue Type: New Feature
  Components: ML, PySpark
Reporter: holdenk
Assignee: holdenk
Priority: Minor





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-10021) Add Python API for ml.feature.IndexToString

2015-08-17 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10021?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley resolved SPARK-10021.
---
Resolution: Duplicate

 Add Python API for ml.feature.IndexToString
 ---

 Key: SPARK-10021
 URL: https://issues.apache.org/jira/browse/SPARK-10021
 Project: Spark
  Issue Type: Improvement
  Components: ML, PySpark
Reporter: Yanbo Liang
Priority: Minor

 Add Python API for ml.feature.IndexToString



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-7736) Exception not failing Python applications (in yarn cluster mode)

2015-08-17 Thread Andrew Or (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7736?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or updated SPARK-7736:
-
Target Version/s: 1.6.0, 1.5.1  (was: 1.6.0)

 Exception not failing Python applications (in yarn cluster mode)
 

 Key: SPARK-7736
 URL: https://issues.apache.org/jira/browse/SPARK-7736
 Project: Spark
  Issue Type: Bug
  Components: YARN
 Environment: Spark 1.3.1, Yarn 2.7.0, Ubuntu 14.04
Reporter: Shay Rojansky
Assignee: Marcelo Vanzin

 It seems that exceptions thrown in Python spark apps after the SparkContext 
 is instantiated don't cause the application to fail, at least in Yarn: the 
 application is marked as SUCCEEDED.
 Note that any exception right before the SparkContext correctly places the 
 application in FAILED state.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-7736) Exception not failing Python applications (in yarn cluster mode)

2015-08-17 Thread Andrew Or (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7736?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or updated SPARK-7736:
-
Target Version/s: 1.6.0  (was: 1.5.1)

 Exception not failing Python applications (in yarn cluster mode)
 

 Key: SPARK-7736
 URL: https://issues.apache.org/jira/browse/SPARK-7736
 Project: Spark
  Issue Type: Bug
  Components: YARN
 Environment: Spark 1.3.1, Yarn 2.7.0, Ubuntu 14.04
Reporter: Shay Rojansky
Assignee: Marcelo Vanzin

 It seems that exceptions thrown in Python spark apps after the SparkContext 
 is instantiated don't cause the application to fail, at least in Yarn: the 
 application is marked as SUCCEEDED.
 Note that any exception right before the SparkContext correctly places the 
 application in FAILED state.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-7736) Exception not failing Python applications (in yarn cluster mode)

2015-08-17 Thread Andrew Or (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7736?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or updated SPARK-7736:
-
Fix Version/s: 1.6.0

 Exception not failing Python applications (in yarn cluster mode)
 

 Key: SPARK-7736
 URL: https://issues.apache.org/jira/browse/SPARK-7736
 Project: Spark
  Issue Type: Bug
  Components: YARN
 Environment: Spark 1.3.1, Yarn 2.7.0, Ubuntu 14.04
Reporter: Shay Rojansky
Assignee: Marcelo Vanzin
 Fix For: 1.6.0


 It seems that exceptions thrown in Python spark apps after the SparkContext 
 is instantiated don't cause the application to fail, at least in Yarn: the 
 application is marked as SUCCEEDED.
 Note that any exception right before the SparkContext correctly places the 
 application in FAILED state.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-7736) Exception not failing Python applications (in yarn cluster mode)

2015-08-17 Thread Andrew Or (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7736?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or updated SPARK-7736:
-
Fix Version/s: (was: 1.6.0)

 Exception not failing Python applications (in yarn cluster mode)
 

 Key: SPARK-7736
 URL: https://issues.apache.org/jira/browse/SPARK-7736
 Project: Spark
  Issue Type: Bug
  Components: YARN
 Environment: Spark 1.3.1, Yarn 2.7.0, Ubuntu 14.04
Reporter: Shay Rojansky
Assignee: Marcelo Vanzin

 It seems that exceptions thrown in Python spark apps after the SparkContext 
 is instantiated don't cause the application to fail, at least in Yarn: the 
 application is marked as SUCCEEDED.
 Note that any exception right before the SparkContext correctly places the 
 application in FAILED state.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-9951) Example code for Multilayer Perceptron Classifier

2015-08-17 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9951?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley resolved SPARK-9951.
--
Resolution: Duplicate

 Example code for Multilayer Perceptron Classifier
 -

 Key: SPARK-9951
 URL: https://issues.apache.org/jira/browse/SPARK-9951
 Project: Spark
  Issue Type: Documentation
  Components: Documentation, ML
Reporter: Joseph K. Bradley

 Add an example to the examples/ code folder for Scala.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9951) Example code for Multilayer Perceptron Classifier

2015-08-17 Thread Joseph K. Bradley (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9951?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14700568#comment-14700568
 ] 

Joseph K. Bradley commented on SPARK-9951:
--

Just glanced at it.  I think that example will be fine.  I'll close this JIRA.  
Thanks!

 Example code for Multilayer Perceptron Classifier
 -

 Key: SPARK-9951
 URL: https://issues.apache.org/jira/browse/SPARK-9951
 Project: Spark
  Issue Type: Documentation
  Components: Documentation, ML
Reporter: Joseph K. Bradley

 Add an example to the examples/ code folder for Scala.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9888) Update LDA User Guide

2015-08-17 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9888?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14700332#comment-14700332
 ] 

Apache Spark commented on SPARK-9888:
-

User 'feynmanliang' has created a pull request for this issue:
https://github.com/apache/spark/pull/8254

 Update LDA User Guide
 -

 Key: SPARK-9888
 URL: https://issues.apache.org/jira/browse/SPARK-9888
 Project: Spark
  Issue Type: Documentation
  Components: MLlib
Reporter: Feynman Liang
Assignee: Feynman Liang
 Fix For: 1.5.0


 LDA has received numerous updates in 1.5, including:
  * OnlineLDAOptimizer:
 * Asymmetric document-topic priors
 * Document-topic hyperparameter optimization
  * LocalLDAModel
 * predict
 * logPerplexity / logLikelihood
  * DistributedLDAModel:
 * topDocumentsPerTopic
 * topTopicsPerDoc
  * Save/load
 It is important to note that OnlineLDAOptimizer=LocalLDAModel and 
 EMLDAOptimizer=DistributedLDAModel now support different features. The user 
 guide should document these differences.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-9888) Update LDA User Guide

2015-08-17 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9888?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-9888:
---

Assignee: Apache Spark  (was: Feynman Liang)

 Update LDA User Guide
 -

 Key: SPARK-9888
 URL: https://issues.apache.org/jira/browse/SPARK-9888
 Project: Spark
  Issue Type: Documentation
  Components: MLlib
Reporter: Feynman Liang
Assignee: Apache Spark
 Fix For: 1.5.0


 LDA has received numerous updates in 1.5, including:
  * OnlineLDAOptimizer:
 * Asymmetric document-topic priors
 * Document-topic hyperparameter optimization
  * LocalLDAModel
 * predict
 * logPerplexity / logLikelihood
  * DistributedLDAModel:
 * topDocumentsPerTopic
 * topTopicsPerDoc
  * Save/load
 It is important to note that OnlineLDAOptimizer=LocalLDAModel and 
 EMLDAOptimizer=DistributedLDAModel now support different features. The user 
 guide should document these differences.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-9888) Update LDA User Guide

2015-08-17 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9888?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-9888:
---

Assignee: Feynman Liang  (was: Apache Spark)

 Update LDA User Guide
 -

 Key: SPARK-9888
 URL: https://issues.apache.org/jira/browse/SPARK-9888
 Project: Spark
  Issue Type: Documentation
  Components: MLlib
Reporter: Feynman Liang
Assignee: Feynman Liang
 Fix For: 1.5.0


 LDA has received numerous updates in 1.5, including:
  * OnlineLDAOptimizer:
 * Asymmetric document-topic priors
 * Document-topic hyperparameter optimization
  * LocalLDAModel
 * predict
 * logPerplexity / logLikelihood
  * DistributedLDAModel:
 * topDocumentsPerTopic
 * topTopicsPerDoc
  * Save/load
 It is important to note that OnlineLDAOptimizer=LocalLDAModel and 
 EMLDAOptimizer=DistributedLDAModel now support different features. The user 
 guide should document these differences.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-10069) Python's ReduceByKeyAndWindow DStream Keeps Growing

2015-08-17 Thread Asim Jalis (JIRA)
Asim Jalis created SPARK-10069:
--

 Summary: Python's ReduceByKeyAndWindow DStream Keeps Growing
 Key: SPARK-10069
 URL: https://issues.apache.org/jira/browse/SPARK-10069
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Affects Versions: 1.4.1
Reporter: Asim Jalis


When I use reduceByKeyAndWindow with func and invFunc (in PySpark) the size of 
the window keeps growing. I am appending the code that reproduces this issue. 
This prints out the count() of the dstream which goes up every batch by 10 
elements. 

Is this a bug in the Python version of Scala or is this expected behavior?

Here is the code that reproduces this issue.

{code}
from pyspark import SparkContext
from pyspark.streaming import StreamingContext
from pprint import pprint

print 'Initializing ssc'
ssc = StreamingContext(SparkContext(), batchDuration=1)
ssc.checkpoint('ckpt')

ds = ssc.textFileStream('input') \
.map(lambda event: (event,1)) \
.reduceByKeyAndWindow(
func=lambda count1,count2: count1+count2,
invFunc=lambda count1,count2: count1-count2,
windowDuration=10,
slideDuration=2)

ds.pprint()
ds.count().pprint()

print 'Starting ssc'
ssc.start()

import itertools
import time
import random

from distutils import dir_util 

def batch_write(batch_data, batch_file_path):
with open(batch_file_path,'w') as batch_file: 
for element in batch_data:
line = str(element) + \n
batch_file.write(line)

def xrange_write(
batch_size = 5,
batch_dir = 'input',
batch_duration = 1):
'''Every batch_duration write a file with batch_size numbers,
forever. Start at 0 and keep incrementing. Intended for testing
Spark Streaming code.'''

dir_util.mkpath('./input')
for i in itertools.count():
min = batch_size * i 
max = batch_size * (i + 1)
batch_data = xrange(min,max)
file_path = batch_dir + '/' + str(i)
batch_write(batch_data, file_path)
time.sleep(batch_duration)

print 'Feeding data to app'
xrange_write()
 
ssc.awaitTermination()
{code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-5901) [PySpark] pickle classes in main module

2015-08-17 Thread Davies Liu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5901?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu closed SPARK-5901.
-
  Resolution: Invalid
Target Version/s:   (was: 1.5.0)

couldpickle does support to serialize class in __main__, but pickle does not 
support that.

 [PySpark] pickle classes in main module
 ---

 Key: SPARK-5901
 URL: https://issues.apache.org/jira/browse/SPARK-5901
 Project: Spark
  Issue Type: Improvement
  Components: PySpark
Reporter: Davies Liu

 Currently, couldpickle does not support to serialize class object in main 
 module.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-10070) Remove Guava dependencies in user guides

2015-08-17 Thread Feynman Liang (JIRA)
Feynman Liang created SPARK-10070:
-

 Summary: Remove Guava dependencies in user guides
 Key: SPARK-10070
 URL: https://issues.apache.org/jira/browse/SPARK-10070
 Project: Spark
  Issue Type: Improvement
  Components: ML, MLlib
Reporter: Feynman Liang


Many code examples in documentation use {{Lists.newArrayList}} (e.g. 
[ml-feature|https://github.com/apache/spark/blob/master/docs/ml-features.md]) 
which brings in a dependency on {{com.google.common.collect.Lists}}.

We can remove this dependency by using {{Arrays.asList}} instead.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-10072) BlockGenerator can deadlock when the queue block queue of generate blocks fills up to capacity

2015-08-17 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10072?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-10072:


Assignee: Tathagata Das  (was: Apache Spark)

 BlockGenerator can deadlock when the queue block queue of generate blocks 
 fills up to capacity
 --

 Key: SPARK-10072
 URL: https://issues.apache.org/jira/browse/SPARK-10072
 Project: Spark
  Issue Type: Bug
  Components: Streaming
Reporter: Tathagata Das
Assignee: Tathagata Das
Priority: Blocker

 Generated blocks are inserted into an ArrayBlockingQueue, and another thread 
 pulls stuff from the ArrayBlockingQueue and pushes it into BlockManager. Now 
 if that queue fills up to capacity (default is 10 blocks), then the inserting 
 into queue (done in the function updateCurrentBuffer) get blocked inside a 
 synchronized block. However, the thread that is pulling blocks from the queue 
 uses the same lock to check the current (active or stopped) while pulling 
 from the queue. Since the block generating threads is blocked (as the queue 
 is full) on the lock, this thread that is supposed to drain the queue gets 
 blocked. Ergo, deadlock.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10072) BlockGenerator can deadlock when the queue block queue of generate blocks fills up to capacity

2015-08-17 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10072?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14700426#comment-14700426
 ] 

Apache Spark commented on SPARK-10072:
--

User 'tdas' has created a pull request for this issue:
https://github.com/apache/spark/pull/8257

 BlockGenerator can deadlock when the queue block queue of generate blocks 
 fills up to capacity
 --

 Key: SPARK-10072
 URL: https://issues.apache.org/jira/browse/SPARK-10072
 Project: Spark
  Issue Type: Bug
  Components: Streaming
Reporter: Tathagata Das
Assignee: Tathagata Das
Priority: Blocker

 Generated blocks are inserted into an ArrayBlockingQueue, and another thread 
 pulls stuff from the ArrayBlockingQueue and pushes it into BlockManager. Now 
 if that queue fills up to capacity (default is 10 blocks), then the inserting 
 into queue (done in the function updateCurrentBuffer) get blocked inside a 
 synchronized block. However, the thread that is pulling blocks from the queue 
 uses the same lock to check the current (active or stopped) while pulling 
 from the queue. Since the block generating threads is blocked (as the queue 
 is full) on the lock, this thread that is supposed to drain the queue gets 
 blocked. Ergo, deadlock.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-7736) Exception not failing Python applications (in yarn cluster mode)

2015-08-17 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-7736?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14700451#comment-14700451
 ] 

Apache Spark commented on SPARK-7736:
-

User 'vanzin' has created a pull request for this issue:
https://github.com/apache/spark/pull/8258

 Exception not failing Python applications (in yarn cluster mode)
 

 Key: SPARK-7736
 URL: https://issues.apache.org/jira/browse/SPARK-7736
 Project: Spark
  Issue Type: Bug
  Components: YARN
 Environment: Spark 1.3.1, Yarn 2.7.0, Ubuntu 14.04
Reporter: Shay Rojansky
Assignee: Marcelo Vanzin
 Fix For: 1.6.0


 It seems that exceptions thrown in Python spark apps after the SparkContext 
 is instantiated don't cause the application to fail, at least in Yarn: the 
 application is marked as SUCCEEDED.
 Note that any exception right before the SparkContext correctly places the 
 application in FAILED state.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-10074) Include Float in @specialized annotation

2015-08-17 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10074?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-10074:


Assignee: (was: Apache Spark)

 Include Float in @specialized annotation
 

 Key: SPARK-10074
 URL: https://issues.apache.org/jira/browse/SPARK-10074
 Project: Spark
  Issue Type: Improvement
Reporter: Ted Yu
Priority: Minor

 There're several places in Spark codebase where we use @specialized 
 annotation covering Long and Double.
 e.g. in OpenHashMap.scala :
 {code}
 class OpenHashMap[K : ClassTag, @specialized(Long, Int, Double) V: ClassTag](
 initialCapacity: Int)
 {code}
 Float should be added to @specialized annotation as well.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10074) Include Float in @specialized annotation

2015-08-17 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10074?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14700535#comment-14700535
 ] 

Apache Spark commented on SPARK-10074:
--

User 'tedyu' has created a pull request for this issue:
https://github.com/apache/spark/pull/8259

 Include Float in @specialized annotation
 

 Key: SPARK-10074
 URL: https://issues.apache.org/jira/browse/SPARK-10074
 Project: Spark
  Issue Type: Improvement
Reporter: Ted Yu
Priority: Minor

 There're several places in Spark codebase where we use @specialized 
 annotation covering Long and Double.
 e.g. in OpenHashMap.scala :
 {code}
 class OpenHashMap[K : ClassTag, @specialized(Long, Int, Double) V: ClassTag](
 initialCapacity: Int)
 {code}
 Float should be added to @specialized annotation as well.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-10074) Include Float in @specialized annotation

2015-08-17 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10074?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-10074:


Assignee: Apache Spark

 Include Float in @specialized annotation
 

 Key: SPARK-10074
 URL: https://issues.apache.org/jira/browse/SPARK-10074
 Project: Spark
  Issue Type: Improvement
Reporter: Ted Yu
Assignee: Apache Spark
Priority: Minor

 There're several places in Spark codebase where we use @specialized 
 annotation covering Long and Double.
 e.g. in OpenHashMap.scala :
 {code}
 class OpenHashMap[K : ClassTag, @specialized(Long, Int, Double) V: ClassTag](
 initialCapacity: Int)
 {code}
 Float should be added to @specialized annotation as well.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-10078) Vector-free L-BFGS

2015-08-17 Thread Xiangrui Meng (JIRA)
Xiangrui Meng created SPARK-10078:
-

 Summary: Vector-free L-BFGS
 Key: SPARK-10078
 URL: https://issues.apache.org/jira/browse/SPARK-10078
 Project: Spark
  Issue Type: New Feature
  Components: ML
Reporter: Xiangrui Meng
Assignee: Xiangrui Meng


This is to implement a scalable version of vector-free L-BFGS 
(http://papers.nips.cc/paper/5333-large-scale-l-bfgs-using-mapreduce.pdf).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8520) Improve GLM's scalability on number of features

2015-08-17 Thread Xiangrui Meng (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8520?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14700634#comment-14700634
 ] 

Xiangrui Meng commented on SPARK-8520:
--

No, this is for general discussion. I created one JIRA specifically for 
vector-free L-BFGS.

 Improve GLM's scalability on number of features
 ---

 Key: SPARK-8520
 URL: https://issues.apache.org/jira/browse/SPARK-8520
 Project: Spark
  Issue Type: Brainstorming
  Components: ML
Affects Versions: 1.4.0
Reporter: Xiangrui Meng
Assignee: Xiangrui Meng
Priority: Critical
  Labels: advanced

 MLlib's GLM implementation uses driver to collect gradient updates. When 
 there exist many features (20 million), the driver becomes the performance 
 bottleneck. In practice, it is common to see a problem with a large feature 
 dimension, resulting from hashing or other feature transformations. So it is 
 important to improve MLlib's scalability on number of features.
 There are couple possible solutions:
 1. still use driver to collect updates, but reduce the amount of data it 
 collects at each iteration.
 2. apply 2D partitioning to the training data and store the model 
 coefficients distributively (e.g., vector-free l-bfgs)
 3. parameter server
 4. ...



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-8520) Improve GLM's scalability on number of features

2015-08-17 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8520?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-8520:
-
Issue Type: Brainstorming  (was: Improvement)

 Improve GLM's scalability on number of features
 ---

 Key: SPARK-8520
 URL: https://issues.apache.org/jira/browse/SPARK-8520
 Project: Spark
  Issue Type: Brainstorming
  Components: ML
Affects Versions: 1.4.0
Reporter: Xiangrui Meng
Assignee: Xiangrui Meng
Priority: Critical
  Labels: advanced

 MLlib's GLM implementation uses driver to collect gradient updates. When 
 there exist many features (20 million), the driver becomes the performance 
 bottleneck. In practice, it is common to see a problem with a large feature 
 dimension, resulting from hashing or other feature transformations. So it is 
 important to improve MLlib's scalability on number of features.
 There are couple possible solutions:
 1. still use driver to collect updates, but reduce the amount of data it 
 collects at each iteration.
 2. apply 2D partitioning to the training data and store the model 
 coefficients distributively (e.g., vector-free l-bfgs)
 3. parameter server
 4. ...



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-10059) Broken test: YarnClusterSuite

2015-08-17 Thread Marcelo Vanzin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10059?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Vanzin resolved SPARK-10059.

   Resolution: Fixed
 Assignee: Marcelo Vanzin
Fix Version/s: 1.5.0

 Broken test: YarnClusterSuite
 -

 Key: SPARK-10059
 URL: https://issues.apache.org/jira/browse/SPARK-10059
 Project: Spark
  Issue Type: Test
Reporter: Davies Liu
Assignee: Marcelo Vanzin
Priority: Critical
 Fix For: 1.5.0


 This test failed everytime:  
 https://amplab.cs.berkeley.edu/jenkins/view/Spark-QA-Test/job/Spark-1.5-SBT/AMPLAB_JENKINS_BUILD_PROFILE=hadoop2.3,label=spark-test/116/testReport/junit/org.apache.spark.deploy.yarn/YarnClusterSuite/_It_is_not_a_test_/history/
 {code}
 Error Message
 java.io.IOException: ResourceManager failed to start. Final state is STOPPED
 Stacktrace
 sbt.ForkMain$ForkError: java.io.IOException: ResourceManager failed to start. 
 Final state is STOPPED
   at 
 org.apache.hadoop.yarn.server.MiniYARNCluster.startResourceManager(MiniYARNCluster.java:302)
   at 
 org.apache.hadoop.yarn.server.MiniYARNCluster.access$500(MiniYARNCluster.java:87)
   at 
 org.apache.hadoop.yarn.server.MiniYARNCluster$ResourceManagerWrapper.serviceStart(MiniYARNCluster.java:422)
   at 
 org.apache.hadoop.service.AbstractService.start(AbstractService.java:193)
   at 
 org.apache.hadoop.service.CompositeService.serviceStart(CompositeService.java:121)
   at 
 org.apache.hadoop.service.AbstractService.start(AbstractService.java:193)
   at 
 org.apache.spark.deploy.yarn.YarnClusterSuite.beforeAll(YarnClusterSuite.scala:104)
   at 
 org.scalatest.BeforeAndAfterAll$class.beforeAll(BeforeAndAfterAll.scala:187)
   at 
 org.apache.spark.deploy.yarn.YarnClusterSuite.beforeAll(YarnClusterSuite.scala:46)
   at 
 org.scalatest.BeforeAndAfterAll$class.run(BeforeAndAfterAll.scala:253)
   at 
 org.apache.spark.deploy.yarn.YarnClusterSuite.run(YarnClusterSuite.scala:46)
   at 
 org.scalatest.tools.Framework.org$scalatest$tools$Framework$$runSuite(Framework.scala:462)
   at 
 org.scalatest.tools.Framework$ScalaTestTask.execute(Framework.scala:671)
   at sbt.ForkMain$Run$2.call(ForkMain.java:294)
   at sbt.ForkMain$Run$2.call(ForkMain.java:284)
   at java.util.concurrent.FutureTask.run(FutureTask.java:262)
   at 
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
   at 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
   at java.lang.Thread.run(Thread.java:745)
 Caused by: sbt.ForkMain$ForkError: ResourceManager failed to start. Final 
 state is STOPPED
   at 
 org.apache.hadoop.yarn.server.MiniYARNCluster.startResourceManager(MiniYARNCluster.java:297)
   ... 18 more
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-9783) Use SqlNewHadoopRDD in JSONRelation to eliminate extra refresh() call

2015-08-17 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9783?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-9783:
---
Sprint:   (was: Spark 1.5 doc/QA sprint)

 Use SqlNewHadoopRDD in JSONRelation to eliminate extra refresh() call
 -

 Key: SPARK-9783
 URL: https://issues.apache.org/jira/browse/SPARK-9783
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.5.0
Reporter: Cheng Lian
Assignee: Cheng Lian

 PR #8035 made a quick fix for SPARK-9743 by introducing an extra 
 {{refresh()}} call in {{JSONRelation.buildScan}}. Obviously it hurts 
 performance. To overcome this, we can use {{SqlNewHadoopRDD}} there and 
 override {{listStatus()}} to inject cached {{FileStatus}} instances, similar 
 as what we did in {{ParquetRelation}}.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-9783) Use SqlNewHadoopRDD in JSONRelation to eliminate extra refresh() call

2015-08-17 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9783?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-9783:
---
Target Version/s: 1.6.0  (was: 1.5.0)

 Use SqlNewHadoopRDD in JSONRelation to eliminate extra refresh() call
 -

 Key: SPARK-9783
 URL: https://issues.apache.org/jira/browse/SPARK-9783
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.5.0
Reporter: Cheng Lian
Assignee: Cheng Lian

 PR #8035 made a quick fix for SPARK-9743 by introducing an extra 
 {{refresh()}} call in {{JSONRelation.buildScan}}. Obviously it hurts 
 performance. To overcome this, we can use {{SqlNewHadoopRDD}} there and 
 override {{listStatus()}} to inject cached {{FileStatus}} instances, similar 
 as what we did in {{ParquetRelation}}.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-9205) org.apache.spark.sql.hive.HiveSparkSubmitSuite failing for Scala 2.11

2015-08-17 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9205?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-9205:

Target Version/s: 1.6.0  (was: 1.5.0)

 org.apache.spark.sql.hive.HiveSparkSubmitSuite failing for Scala 2.11
 -

 Key: SPARK-9205
 URL: https://issues.apache.org/jira/browse/SPARK-9205
 Project: Spark
  Issue Type: Bug
  Components: SQL, Tests
Affects Versions: 1.4.1, 1.5.0
Reporter: Tathagata Das
Assignee: Andrew Or
Priority: Critical

 https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/Spark-Master-Maven/AMPLAB_JENKINS_BUILD_PROFILE=scala2.11,label=centos/7/



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9776) Another instance of Derby may have already booted the database

2015-08-17 Thread Michael Armbrust (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9776?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14700392#comment-14700392
 ] 

Michael Armbrust commented on SPARK-9776:
-

The variable is always called {{sqlContext}} but if you have compile with Hive 
support then it will be of type HiveContext.

 Another instance of Derby may have already booted the database 
 ---

 Key: SPARK-9776
 URL: https://issues.apache.org/jira/browse/SPARK-9776
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.5.0
 Environment: Mac Yosemite, spark-1.5.0
Reporter: Sudhakar Thota
 Attachments: SPARK-9776-FL1.rtf


 val hiveContext = new org.apache.spark.sql.hive.HiveContext(sc) results in 
 error. Though the same works for spark-1.4.1.
 Caused by: ERROR XSDB6: Another instance of Derby may have already booted the 
 database 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-9902) Add Java and Python examples to user guide for 1-sample KS test

2015-08-17 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9902?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng resolved SPARK-9902.
--
   Resolution: Fixed
Fix Version/s: 1.5.0

Issue resolved by pull request 8154
[https://github.com/apache/spark/pull/8154]

 Add Java and Python examples to user guide for 1-sample KS test
 ---

 Key: SPARK-9902
 URL: https://issues.apache.org/jira/browse/SPARK-9902
 Project: Spark
  Issue Type: Documentation
  Components: MLlib
Reporter: Feynman Liang
Assignee: Jose Cambronero
 Fix For: 1.5.0


 SPARK-8598 adds 1-sample kolmogorov-smirnov tests, which needs Java and 
 python code examples in {{mllib-statistics#hypothesis-testing}}.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-10063) Remove DirectParquetOutputCommitter

2015-08-17 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10063?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin closed SPARK-10063.
---
Resolution: Won't Fix

Let's not remove it for now until we have a better alternative.


 Remove DirectParquetOutputCommitter
 ---

 Key: SPARK-10063
 URL: https://issues.apache.org/jira/browse/SPARK-10063
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Yin Huai
Assignee: Yin Huai
Priority: Critical

 When we use DirectParquetOutputCommitter on S3 and speculation is enabled, 
 there is a chance that we can loss data. 
 Here is the code to reproduce the problem.
 {code}
 import org.apache.spark.sql.functions._
 val failSpeculativeTask = sqlContext.udf.register(failSpeculativeTask, (i: 
 Int, partitionId: Int, attemptNumber: Int) = {
   if (partitionId == 0  i == 5) {
 if (attemptNumber  0) {
   Thread.sleep(15000)
   throw new Exception(new exception)
 } else {
   Thread.sleep(1)
 }
   }
   
   i
 })
 val df = sc.parallelize((1 to 100), 20).mapPartitions { iter =
   val context = org.apache.spark.TaskContext.get()
   val partitionId = context.partitionId
   val attemptNumber = context.attemptNumber
   iter.map(i = (i, partitionId, attemptNumber))
 }.toDF(i, partitionId, attemptNumber)
 df
   .select(failSpeculativeTask($i, $partitionId, 
 $attemptNumber).as(i), $partitionId, $attemptNumber)
   .write.mode(overwrite).format(parquet).save(/home/yin/outputCommitter)
 sqlContext.read.load(/home/yin/outputCommitter).count
 // The result is 99 and 5 is missing from the output.
 {code}
 What happened is that the original task finishes first and uploads its output 
 file to S3, then the speculative task somehow fails. Because we have to call 
 output stream's close method, which uploads data to S3, we actually uploads 
 the partial result generated by the failed speculative task to S3 and this 
 file overwrites the correct file generated by the original task.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-9592) Last implemented based on AggregateExpression1 are calculating the values for entire DataFrame partition not on GroupedData partition.

2015-08-17 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9592?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust resolved SPARK-9592.
-
   Resolution: Fixed
Fix Version/s: 1.5.0

Issue resolved by pull request 8172
[https://github.com/apache/spark/pull/8172]

 Last implemented based on AggregateExpression1 are calculating the values for 
 entire DataFrame partition not on GroupedData partition.
 --

 Key: SPARK-9592
 URL: https://issues.apache.org/jira/browse/SPARK-9592
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.4.0
Reporter: gaurav
Priority: Minor
 Fix For: 1.5.0

   Original Estimate: 4h
  Remaining Estimate: 4h

 In current implementation, First and Last aggregates were calculating the 
 values for entire DataFrame partition and then the same value was returned 
 for all GroupedData in the partition.
 sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregates.scala
 Fixed the First and Last aggregates should compute first and last value per 
 GroupedData instead of entire DataFrame.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-10068) Add links to sections in MLlib's user guide

2015-08-17 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10068?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-10068:
--
Assignee: Feynman Liang

 Add links to sections in MLlib's user guide
 ---

 Key: SPARK-10068
 URL: https://issues.apache.org/jira/browse/SPARK-10068
 Project: Spark
  Issue Type: Improvement
Reporter: Feynman Liang
Assignee: Feynman Liang
Priority: Minor

 In {{mllib-guide.md}}, the listing under {{MLlib types, algorithms and 
 utilities
 }} is inconsistent with linking to sections referenced. We should provide 
 links to every section mentioned in this listing.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-10068) Add links to sections in MLlib's user guide

2015-08-17 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10068?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-10068:
--
Shepherd: Xiangrui Meng

 Add links to sections in MLlib's user guide
 ---

 Key: SPARK-10068
 URL: https://issues.apache.org/jira/browse/SPARK-10068
 Project: Spark
  Issue Type: Improvement
Reporter: Feynman Liang
Assignee: Feynman Liang
Priority: Minor

 In {{mllib-guide.md}}, the listing under {{MLlib types, algorithms and 
 utilities
 }} is inconsistent with linking to sections referenced. We should provide 
 links to every section mentioned in this listing.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-7808) Package doc for spark.ml.feature

2015-08-17 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7808?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng reassigned SPARK-7808:


Assignee: Xiangrui Meng

 Package doc for spark.ml.feature
 

 Key: SPARK-7808
 URL: https://issues.apache.org/jira/browse/SPARK-7808
 Project: Spark
  Issue Type: Documentation
  Components: Documentation, ML
Affects Versions: 1.4.0
Reporter: Xiangrui Meng
Assignee: Xiangrui Meng

 We added several feature transformers in Spark 1.4. It would be great to add 
 package doc for `spark.ml.feature`.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-10025) Add Python API for ml.attribute

2015-08-17 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10025?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley resolved SPARK-10025.
---
Resolution: Duplicate

 Add Python API for ml.attribute
 ---

 Key: SPARK-10025
 URL: https://issues.apache.org/jira/browse/SPARK-10025
 Project: Spark
  Issue Type: Improvement
  Components: ML, PySpark
Reporter: Yanbo Liang

 Currently there is no Python implementation for ml.attribute, so we can not 
 use Attribute in ML pipeline. Some transformers need this feature such as 
 VectorSlicer can take a subarray of the original features by specifying 
 column names which should contains in the column Attribute. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-7808) Package doc for spark.ml.feature

2015-08-17 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7808?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-7808:
---

Assignee: Apache Spark  (was: Xiangrui Meng)

 Package doc for spark.ml.feature
 

 Key: SPARK-7808
 URL: https://issues.apache.org/jira/browse/SPARK-7808
 Project: Spark
  Issue Type: Documentation
  Components: Documentation, ML
Affects Versions: 1.4.0
Reporter: Xiangrui Meng
Assignee: Apache Spark

 We added several feature transformers in Spark 1.4. It would be great to add 
 package doc for `spark.ml.feature`.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-7808) Package doc for spark.ml.feature

2015-08-17 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7808?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-7808:
---

Assignee: Xiangrui Meng  (was: Apache Spark)

 Package doc for spark.ml.feature
 

 Key: SPARK-7808
 URL: https://issues.apache.org/jira/browse/SPARK-7808
 Project: Spark
  Issue Type: Documentation
  Components: Documentation, ML
Affects Versions: 1.4.0
Reporter: Xiangrui Meng
Assignee: Xiangrui Meng

 We added several feature transformers in Spark 1.4. It would be great to add 
 package doc for `spark.ml.feature`.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-7808) Package doc for spark.ml.feature

2015-08-17 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-7808?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14700520#comment-14700520
 ] 

Apache Spark commented on SPARK-7808:
-

User 'mengxr' has created a pull request for this issue:
https://github.com/apache/spark/pull/8260

 Package doc for spark.ml.feature
 

 Key: SPARK-7808
 URL: https://issues.apache.org/jira/browse/SPARK-7808
 Project: Spark
  Issue Type: Documentation
  Components: Documentation, ML
Affects Versions: 1.4.0
Reporter: Xiangrui Meng
Assignee: Xiangrui Meng

 We added several feature transformers in Spark 1.4. It would be great to add 
 package doc for `spark.ml.feature`.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-10074) Include Float in @specialized annotation

2015-08-17 Thread Ted Yu (JIRA)
Ted Yu created SPARK-10074:
--

 Summary: Include Float in @specialized annotation
 Key: SPARK-10074
 URL: https://issues.apache.org/jira/browse/SPARK-10074
 Project: Spark
  Issue Type: Improvement
Reporter: Ted Yu
Priority: Minor


There're several places in Spark codebase where we use @specialized annotation 
covering Long and Double.
e.g. in OpenHashMap.scala :
{code}
class OpenHashMap[K : ClassTag, @specialized(Long, Int, Double) V: ClassTag](
initialCapacity: Int)
{code}
Float should be added to @specialized annotation as well.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-9846) User guide for Multilayer Perceptron Classifier

2015-08-17 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9846?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-9846:
---

Assignee: Alexander Ulanov  (was: Apache Spark)

 User guide for Multilayer Perceptron Classifier
 ---

 Key: SPARK-9846
 URL: https://issues.apache.org/jira/browse/SPARK-9846
 Project: Spark
  Issue Type: Documentation
  Components: Documentation, ML
Affects Versions: 1.5.0
Reporter: Xiangrui Meng
Assignee: Alexander Ulanov





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-9846) User guide for Multilayer Perceptron Classifier

2015-08-17 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9846?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-9846:
---

Assignee: Apache Spark  (was: Alexander Ulanov)

 User guide for Multilayer Perceptron Classifier
 ---

 Key: SPARK-9846
 URL: https://issues.apache.org/jira/browse/SPARK-9846
 Project: Spark
  Issue Type: Documentation
  Components: Documentation, ML
Affects Versions: 1.5.0
Reporter: Xiangrui Meng
Assignee: Apache Spark





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9846) User guide for Multilayer Perceptron Classifier

2015-08-17 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9846?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14700564#comment-14700564
 ] 

Apache Spark commented on SPARK-9846:
-

User 'avulanov' has created a pull request for this issue:
https://github.com/apache/spark/pull/8262

 User guide for Multilayer Perceptron Classifier
 ---

 Key: SPARK-9846
 URL: https://issues.apache.org/jira/browse/SPARK-9846
 Project: Spark
  Issue Type: Documentation
  Components: Documentation, ML
Affects Versions: 1.5.0
Reporter: Xiangrui Meng
Assignee: Alexander Ulanov





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9951) Example code for Multilayer Perceptron Classifier

2015-08-17 Thread Alexander Ulanov (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9951?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14700567#comment-14700567
 ] 

Alexander Ulanov commented on SPARK-9951:
-

I've submitter a PR for the user guide. Could you suggest if the example code 
in the PR can be used for this issue? https://github.com/apache/spark/pull/8262

 Example code for Multilayer Perceptron Classifier
 -

 Key: SPARK-9951
 URL: https://issues.apache.org/jira/browse/SPARK-9951
 Project: Spark
  Issue Type: Documentation
  Components: Documentation, ML
Reporter: Joseph K. Bradley

 Add an example to the examples/ code folder for Scala.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



  1   2   3   >