[jira] [Resolved] (SPARK-20478) Document LinearSVC in R programming guide
[ https://issues.apache.org/jira/browse/SPARK-20478?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Felix Cheung resolved SPARK-20478. -- Resolution: Fixed Assignee: Miao Wang Fix Version/s: 2.3.0 2.2.0 Target Version/s: 2.2.0, 2.3.0 https://github.com/apache/spark/pull/17797 > Document LinearSVC in R programming guide > - > > Key: SPARK-20478 > URL: https://issues.apache.org/jira/browse/SPARK-20478 > Project: Spark > Issue Type: Documentation > Components: SparkR >Affects Versions: 2.2.0 >Reporter: Felix Cheung >Assignee: Miao Wang > Fix For: 2.2.0, 2.3.0 > > -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20520) R streaming tests failed on Windows
[ https://issues.apache.org/jira/browse/SPARK-20520?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15988222#comment-15988222 ] Felix Cheung commented on SPARK-20520: -- looks like it's just running slow > R streaming tests failed on Windows > --- > > Key: SPARK-20520 > URL: https://issues.apache.org/jira/browse/SPARK-20520 > Project: Spark > Issue Type: Bug > Components: SparkR >Reporter: Felix Cheung >Assignee: Felix Cheung >Priority: Critical > > Running R CMD check on SparkR 2.2 RC1 packages > {code} > Failed > - > 1. Failure: read.stream, write.stream, awaitTermination, stopQuery > (@test_streaming.R#56) > head(sql("SELECT count(*) FROM people"))[[1]] not equal to 3. > 1/1 mismatches > [1] 0 - 3 == -3 > 2. Failure: read.stream, write.stream, awaitTermination, stopQuery > (@test_streaming.R#60) > head(sql("SELECT count(*) FROM people"))[[1]] not equal to 6. > 1/1 mismatches > [1] 3 - 6 == -3 > 3. Failure: print from explain, lastProgress, status, isActive > (@test_streaming.R#75) > any(grepl("\"description\" : \"MemorySink\"", > capture.output(lastProgress(q isn't true. > 4. Failure: Stream other format (@test_streaming.R#95) > - > head(sql("SELECT count(*) FROM people3"))[[1]] not equal to 3. > 1/1 mismatches > [1] 0 - 3 == -3 > 5. Failure: Stream other format (@test_streaming.R#98) > - > any(...) isn't true. > {code} > Need to investigate -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-20520) R streaming tests failed on Windows
[ https://issues.apache.org/jira/browse/SPARK-20520?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Felix Cheung updated SPARK-20520: - Issue Type: Bug (was: Umbrella) > R streaming tests failed on Windows > --- > > Key: SPARK-20520 > URL: https://issues.apache.org/jira/browse/SPARK-20520 > Project: Spark > Issue Type: Bug > Components: SparkR >Reporter: Felix Cheung >Assignee: Felix Cheung >Priority: Critical > > Running R CMD check on SparkR 2.2 RC1 packages > {code} > Failed > - > 1. Failure: read.stream, write.stream, awaitTermination, stopQuery > (@test_streaming.R#56) > head(sql("SELECT count(*) FROM people"))[[1]] not equal to 3. > 1/1 mismatches > [1] 0 - 3 == -3 > 2. Failure: read.stream, write.stream, awaitTermination, stopQuery > (@test_streaming.R#60) > head(sql("SELECT count(*) FROM people"))[[1]] not equal to 6. > 1/1 mismatches > [1] 3 - 6 == -3 > 3. Failure: print from explain, lastProgress, status, isActive > (@test_streaming.R#75) > any(grepl("\"description\" : \"MemorySink\"", > capture.output(lastProgress(q isn't true. > 4. Failure: Stream other format (@test_streaming.R#95) > - > head(sql("SELECT count(*) FROM people3"))[[1]] not equal to 3. > 1/1 mismatches > [1] 0 - 3 == -3 > 5. Failure: Stream other format (@test_streaming.R#98) > - > any(...) isn't true. > {code} > Need to investigate -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-20520) R streaming tests failed on Windows
[ https://issues.apache.org/jira/browse/SPARK-20520?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Felix Cheung updated SPARK-20520: - Summary: R streaming tests failed on Windows (was: R streaming test failed on Windows) > R streaming tests failed on Windows > --- > > Key: SPARK-20520 > URL: https://issues.apache.org/jira/browse/SPARK-20520 > Project: Spark > Issue Type: Umbrella > Components: SparkR >Reporter: Felix Cheung >Assignee: Felix Cheung >Priority: Critical > > Running R CMD check on SparkR 2.2 RC1 packages > {code} > Failed > - > 1. Failure: read.stream, write.stream, awaitTermination, stopQuery > (@test_streaming.R#56) > head(sql("SELECT count(*) FROM people"))[[1]] not equal to 3. > 1/1 mismatches > [1] 0 - 3 == -3 > 2. Failure: read.stream, write.stream, awaitTermination, stopQuery > (@test_streaming.R#60) > head(sql("SELECT count(*) FROM people"))[[1]] not equal to 6. > 1/1 mismatches > [1] 3 - 6 == -3 > 3. Failure: print from explain, lastProgress, status, isActive > (@test_streaming.R#75) > any(grepl("\"description\" : \"MemorySink\"", > capture.output(lastProgress(q isn't true. > 4. Failure: Stream other format (@test_streaming.R#95) > - > head(sql("SELECT count(*) FROM people3"))[[1]] not equal to 3. > 1/1 mismatches > [1] 0 - 3 == -3 > 5. Failure: Stream other format (@test_streaming.R#98) > - > any(...) isn't true. > {code} > Need to investigate -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-20520) R streaming test failed on Windows
[ https://issues.apache.org/jira/browse/SPARK-20520?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Felix Cheung updated SPARK-20520: - Description: Running R CMD check on SparkR 2.2 RC1 packages {code} Failed - 1. Failure: read.stream, write.stream, awaitTermination, stopQuery (@test_streaming.R#56) head(sql("SELECT count(*) FROM people"))[[1]] not equal to 3. 1/1 mismatches [1] 0 - 3 == -3 2. Failure: read.stream, write.stream, awaitTermination, stopQuery (@test_streaming.R#60) head(sql("SELECT count(*) FROM people"))[[1]] not equal to 6. 1/1 mismatches [1] 3 - 6 == -3 3. Failure: print from explain, lastProgress, status, isActive (@test_streaming.R#75) any(grepl("\"description\" : \"MemorySink\"", capture.output(lastProgress(q isn't true. 4. Failure: Stream other format (@test_streaming.R#95) - head(sql("SELECT count(*) FROM people3"))[[1]] not equal to 3. 1/1 mismatches [1] 0 - 3 == -3 5. Failure: Stream other format (@test_streaming.R#98) - any(...) isn't true. {code} Need to investigate was: This JIRA lists tasks for the next Spark release's QA period for SparkR. The list below gives an overview of what is involved, and the corresponding JIRA issues are linked below that. h2. API * Audit new public APIs (from the generated html doc) ** relative to Spark Scala/Java APIs ** relative to popular R libraries h2. Documentation and example code * For new algorithms, create JIRAs for updating the user guide sections & examples * Update Programming Guide * Update website > R streaming test failed on Windows > -- > > Key: SPARK-20520 > URL: https://issues.apache.org/jira/browse/SPARK-20520 > Project: Spark > Issue Type: Umbrella > Components: SparkR >Reporter: Felix Cheung >Assignee: Felix Cheung >Priority: Critical > > Running R CMD check on SparkR 2.2 RC1 packages > {code} > Failed > - > 1. Failure: read.stream, write.stream, awaitTermination, stopQuery > (@test_streaming.R#56) > head(sql("SELECT count(*) FROM people"))[[1]] not equal to 3. > 1/1 mismatches > [1] 0 - 3 == -3 > 2. Failure: read.stream, write.stream, awaitTermination, stopQuery > (@test_streaming.R#60) > head(sql("SELECT count(*) FROM people"))[[1]] not equal to 6. > 1/1 mismatches > [1] 3 - 6 == -3 > 3. Failure: print from explain, lastProgress, status, isActive > (@test_streaming.R#75) > any(grepl("\"description\" : \"MemorySink\"", > capture.output(lastProgress(q isn't true. > 4. Failure: Stream other format (@test_streaming.R#95) > - > head(sql("SELECT count(*) FROM people3"))[[1]] not equal to 3. > 1/1 mismatches > [1] 0 - 3 == -3 > 5. Failure: Stream other format (@test_streaming.R#98) > - > any(...) isn't true. > {code} > Need to investigate -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-20520) R streaming test failed on Windows
[ https://issues.apache.org/jira/browse/SPARK-20520?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Felix Cheung reassigned SPARK-20520: Assignee: Felix Cheung (was: Joseph K. Bradley) > R streaming test failed on Windows > -- > > Key: SPARK-20520 > URL: https://issues.apache.org/jira/browse/SPARK-20520 > Project: Spark > Issue Type: Umbrella > Components: SparkR >Reporter: Felix Cheung >Assignee: Felix Cheung >Priority: Critical > > This JIRA lists tasks for the next Spark release's QA period for SparkR. > The list below gives an overview of what is involved, and the corresponding > JIRA issues are linked below that. > h2. API > * Audit new public APIs (from the generated html doc) > ** relative to Spark Scala/Java APIs > ** relative to popular R libraries > h2. Documentation and example code > * For new algorithms, create JIRAs for updating the user guide sections & > examples > * Update Programming Guide > * Update website -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-20520) R streaming test failed on Windows
[ https://issues.apache.org/jira/browse/SPARK-20520?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Felix Cheung updated SPARK-20520: - Component/s: (was: Documentation) > R streaming test failed on Windows > -- > > Key: SPARK-20520 > URL: https://issues.apache.org/jira/browse/SPARK-20520 > Project: Spark > Issue Type: Umbrella > Components: SparkR >Reporter: Felix Cheung >Assignee: Joseph K. Bradley >Priority: Critical > > This JIRA lists tasks for the next Spark release's QA period for SparkR. > The list below gives an overview of what is involved, and the corresponding > JIRA issues are linked below that. > h2. API > * Audit new public APIs (from the generated html doc) > ** relative to Spark Scala/Java APIs > ** relative to popular R libraries > h2. Documentation and example code > * For new algorithms, create JIRAs for updating the user guide sections & > examples > * Update Programming Guide > * Update website -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-20520) R streaming test failed on Windows
Felix Cheung created SPARK-20520: Summary: R streaming test failed on Windows Key: SPARK-20520 URL: https://issues.apache.org/jira/browse/SPARK-20520 Project: Spark Issue Type: Umbrella Components: Documentation, SparkR Reporter: Felix Cheung Assignee: Joseph K. Bradley Priority: Critical This JIRA lists tasks for the next Spark release's QA period for SparkR. The list below gives an overview of what is involved, and the corresponding JIRA issues are linked below that. h2. API * Audit new public APIs (from the generated html doc) ** relative to Spark Scala/Java APIs ** relative to popular R libraries h2. Documentation and example code * For new algorithms, create JIRAs for updating the user guide sections & examples * Update Programming Guide * Update website -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-20192) SparkR 2.2.0 migration guide, release note
[ https://issues.apache.org/jira/browse/SPARK-20192?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Felix Cheung updated SPARK-20192: - Summary: SparkR 2.2.0 migration guide, release note (was: SparkR 2.2.0 release note) > SparkR 2.2.0 migration guide, release note > -- > > Key: SPARK-20192 > URL: https://issues.apache.org/jira/browse/SPARK-20192 > Project: Spark > Issue Type: Documentation > Components: Documentation, SparkR >Affects Versions: 2.2.0 >Reporter: Felix Cheung >Assignee: Felix Cheung > > From looking at changes since 2.1.0, this/these should be documented in the > migration guide / release note for the 2.2.0 release, as it is behavior > changes > https://github.com/apache/spark/commit/422aa67d1bb84f913b06e6d94615adb6557e2870 > https://github.com/apache/spark/pull/17483 (createExternalTable) -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20512) SparkR 2.2 QA: Programming guide, migration guide, vignettes updates
[ https://issues.apache.org/jira/browse/SPARK-20512?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15988168#comment-15988168 ] Felix Cheung commented on SPARK-20512: -- version migration section is in R programming guide > SparkR 2.2 QA: Programming guide, migration guide, vignettes updates > > > Key: SPARK-20512 > URL: https://issues.apache.org/jira/browse/SPARK-20512 > Project: Spark > Issue Type: Sub-task > Components: Documentation, SparkR >Reporter: Joseph K. Bradley >Priority: Critical > > Before the release, we need to update the SparkR Programming Guide, its > migration guide, and the R vignettes. Updates will include: > * Add migration guide subsection. > ** Use the results of the QA audit JIRAs and [SPARK-17692]. > * Check phrasing, especially in main sections (for outdated items such as "In > this release, ...") > * Update R vignettes > Note: This task is for large changes to the guides. New features are handled > in [SPARK-18330]. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-20192) SparkR 2.2.0 release note
[ https://issues.apache.org/jira/browse/SPARK-20192?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Felix Cheung reassigned SPARK-20192: Assignee: Felix Cheung > SparkR 2.2.0 release note > - > > Key: SPARK-20192 > URL: https://issues.apache.org/jira/browse/SPARK-20192 > Project: Spark > Issue Type: Documentation > Components: Documentation, SparkR >Affects Versions: 2.2.0 >Reporter: Felix Cheung >Assignee: Felix Cheung > > From looking at changes since 2.1.0, this/these should be documented in the > migration guide / release note for the 2.2.0 release, as it is behavior > changes > https://github.com/apache/spark/commit/422aa67d1bb84f913b06e6d94615adb6557e2870 > https://github.com/apache/spark/pull/17483 (createExternalTable) -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-20208) Document R fpGrowth support in vignettes, programming guide and code example
[ https://issues.apache.org/jira/browse/SPARK-20208?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Felix Cheung resolved SPARK-20208. -- Resolution: Fixed > Document R fpGrowth support in vignettes, programming guide and code example > > > Key: SPARK-20208 > URL: https://issues.apache.org/jira/browse/SPARK-20208 > Project: Spark > Issue Type: Documentation > Components: Documentation, SparkR >Affects Versions: 2.2.0 >Reporter: Felix Cheung >Assignee: Maciej Szymkiewicz > Fix For: 2.2.0 > > -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-20015) Document R Structured Streaming (experimental) in R vignettes and R & SS programming guide, R example
[ https://issues.apache.org/jira/browse/SPARK-20015?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Felix Cheung reassigned SPARK-20015: Assignee: Felix Cheung > Document R Structured Streaming (experimental) in R vignettes and R & SS > programming guide, R example > - > > Key: SPARK-20015 > URL: https://issues.apache.org/jira/browse/SPARK-20015 > Project: Spark > Issue Type: Documentation > Components: Documentation, SparkR, Structured Streaming >Affects Versions: 2.2.0 >Reporter: Felix Cheung >Assignee: Felix Cheung > -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-20519) When the input parameter is null, may be a runtime exception occurs
[ https://issues.apache.org/jira/browse/SPARK-20519?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-20519: Assignee: (was: Apache Spark) > When the input parameter is null, may be a runtime exception occurs > > > Key: SPARK-20519 > URL: https://issues.apache.org/jira/browse/SPARK-20519 > Project: Spark > Issue Type: Improvement > Components: Spark Core, SQL >Affects Versions: 2.1.0 >Reporter: liuxian >Priority: Minor > > sqlContext.tables(null) > setCustomHostname(null) > checkHost(null, "test") > checkHostPort(null, "test") > throws exception at runtime: > java.lang.NullPointerException was thrown. > java.lang.NullPointerException > at > org.apache.spark.sql.catalyst.catalog.SessionCatalog.formatDatabaseName(SessionCatalog.scala:125) > at > org.apache.spark.sql.catalyst.catalog.SessionCatalog.listTables(SessionCatalog.scala:715) > at > org.apache.spark.sql.catalyst.catalog.SessionCatalog.listTables(SessionCatalog.scala:706) > at > org.apache.spark.sql.execution.command.ShowTablesCommand$$anonfun$11.apply(tables.scala:655) -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-20519) When the input parameter is null, may be a runtime exception occurs
[ https://issues.apache.org/jira/browse/SPARK-20519?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-20519: Assignee: Apache Spark > When the input parameter is null, may be a runtime exception occurs > > > Key: SPARK-20519 > URL: https://issues.apache.org/jira/browse/SPARK-20519 > Project: Spark > Issue Type: Improvement > Components: Spark Core, SQL >Affects Versions: 2.1.0 >Reporter: liuxian >Assignee: Apache Spark >Priority: Minor > > sqlContext.tables(null) > setCustomHostname(null) > checkHost(null, "test") > checkHostPort(null, "test") > throws exception at runtime: > java.lang.NullPointerException was thrown. > java.lang.NullPointerException > at > org.apache.spark.sql.catalyst.catalog.SessionCatalog.formatDatabaseName(SessionCatalog.scala:125) > at > org.apache.spark.sql.catalyst.catalog.SessionCatalog.listTables(SessionCatalog.scala:715) > at > org.apache.spark.sql.catalyst.catalog.SessionCatalog.listTables(SessionCatalog.scala:706) > at > org.apache.spark.sql.execution.command.ShowTablesCommand$$anonfun$11.apply(tables.scala:655) -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20519) When the input parameter is null, may be a runtime exception occurs
[ https://issues.apache.org/jira/browse/SPARK-20519?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15988111#comment-15988111 ] Apache Spark commented on SPARK-20519: -- User '10110346' has created a pull request for this issue: https://github.com/apache/spark/pull/17796 > When the input parameter is null, may be a runtime exception occurs > > > Key: SPARK-20519 > URL: https://issues.apache.org/jira/browse/SPARK-20519 > Project: Spark > Issue Type: Improvement > Components: Spark Core, SQL >Affects Versions: 2.1.0 >Reporter: liuxian >Priority: Minor > > sqlContext.tables(null) > setCustomHostname(null) > checkHost(null, "test") > checkHostPort(null, "test") > throws exception at runtime: > java.lang.NullPointerException was thrown. > java.lang.NullPointerException > at > org.apache.spark.sql.catalyst.catalog.SessionCatalog.formatDatabaseName(SessionCatalog.scala:125) > at > org.apache.spark.sql.catalyst.catalog.SessionCatalog.listTables(SessionCatalog.scala:715) > at > org.apache.spark.sql.catalyst.catalog.SessionCatalog.listTables(SessionCatalog.scala:706) > at > org.apache.spark.sql.execution.command.ShowTablesCommand$$anonfun$11.apply(tables.scala:655) -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-20519) When the input parameter is null, may be a runtime exception occurs
[ https://issues.apache.org/jira/browse/SPARK-20519?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] liuxian updated SPARK-20519: Summary: When the input parameter is null, may be a runtime exception occurs (was: When the input parameter is null, may be a runtime exeception occurs) > When the input parameter is null, may be a runtime exception occurs > > > Key: SPARK-20519 > URL: https://issues.apache.org/jira/browse/SPARK-20519 > Project: Spark > Issue Type: Improvement > Components: Spark Core, SQL >Affects Versions: 2.1.0 >Reporter: liuxian >Priority: Minor > > sqlContext.tables(null) > setCustomHostname(null) > checkHost(null, "test") > checkHostPort(null, "test") > throws exception at runtime: > java.lang.NullPointerException was thrown. > java.lang.NullPointerException > at > org.apache.spark.sql.catalyst.catalog.SessionCatalog.formatDatabaseName(SessionCatalog.scala:125) > at > org.apache.spark.sql.catalyst.catalog.SessionCatalog.listTables(SessionCatalog.scala:715) > at > org.apache.spark.sql.catalyst.catalog.SessionCatalog.listTables(SessionCatalog.scala:706) > at > org.apache.spark.sql.execution.command.ShowTablesCommand$$anonfun$11.apply(tables.scala:655) -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-20519) When the input parameter is null, may be a runtime exeception occurs
liuxian created SPARK-20519: --- Summary: When the input parameter is null, may be a runtime exeception occurs Key: SPARK-20519 URL: https://issues.apache.org/jira/browse/SPARK-20519 Project: Spark Issue Type: Improvement Components: Spark Core, SQL Affects Versions: 2.1.0 Reporter: liuxian Priority: Minor sqlContext.tables(null) setCustomHostname(null) checkHost(null, "test") checkHostPort(null, "test") throws exception at runtime: java.lang.NullPointerException was thrown. java.lang.NullPointerException at org.apache.spark.sql.catalyst.catalog.SessionCatalog.formatDatabaseName(SessionCatalog.scala:125) at org.apache.spark.sql.catalyst.catalog.SessionCatalog.listTables(SessionCatalog.scala:715) at org.apache.spark.sql.catalyst.catalog.SessionCatalog.listTables(SessionCatalog.scala:706) at org.apache.spark.sql.execution.command.ShowTablesCommand$$anonfun$11.apply(tables.scala:655) -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-20497) Unhelpful error messages when trying to load data from file.
[ https://issues.apache.org/jira/browse/SPARK-20497?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Brandon Barker resolved SPARK-20497. Resolution: Not A Bug > Unhelpful error messages when trying to load data from file. > > > Key: SPARK-20497 > URL: https://issues.apache.org/jira/browse/SPARK-20497 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.0 >Reporter: Brandon Barker >Priority: Minor > > I'm attempting to do the simple task of reproducing the results from the > linear regression example in Spark. I'm using Windows 10. > val training = spark.read.format("libsvm") > .load("C:Users\\brand\\Documents\\GitHub\\sample_linear_regression_data.txt") > Although the file is definitely at the specified location, I just get a > java.lang.NullPointerException at this line. The documentation at > http://spark.apache.org/docs/latest/sql-programming-guide.html#generic-loadsave-functions > doesn't seem to clear things up. The associated javadocs do do not seem any > better. > In my view, such a simple operation should not be troublesome, but perhaps > I've missed some critical documentation - if so, I apologize. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-12837) Spark driver requires large memory space for serialized results even there are no data collected to the driver
[ https://issues.apache.org/jira/browse/SPARK-12837?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Herman van Hovell resolved SPARK-12837. --- Resolution: Fixed Fix Version/s: (was: 2.0.0) 2.3.0 2.2.1 > Spark driver requires large memory space for serialized results even there > are no data collected to the driver > -- > > Key: SPARK-12837 > URL: https://issues.apache.org/jira/browse/SPARK-12837 > Project: Spark > Issue Type: Question > Components: SQL >Affects Versions: 1.5.2, 1.6.0 >Reporter: Tien-Dung LE >Assignee: Wenchen Fan >Priority: Critical > Fix For: 2.2.1, 2.3.0 > > > Executing a sql statement with a large number of partitions requires a high > memory space for the driver even there are no requests to collect data back > to the driver. > Here are steps to re-produce the issue. > 1. Start spark shell with a spark.driver.maxResultSize setting > {code:java} > bin/spark-shell --driver-memory=1g --conf spark.driver.maxResultSize=1m > {code} > 2. Execute the code > {code:java} > case class Toto( a: Int, b: Int) > val df = sc.parallelize( 1 to 1e6.toInt).map( i => Toto( i, i)).toDF > sqlContext.setConf( "spark.sql.shuffle.partitions", "200" ) > df.groupBy("a").count().saveAsParquetFile( "toto1" ) // OK > sqlContext.setConf( "spark.sql.shuffle.partitions", 1e3.toInt.toString ) > df.repartition(1e3.toInt).groupBy("a").count().repartition(1e3.toInt).saveAsParquetFile( > "toto2" ) // ERROR > {code} > The error message is > {code:java} > Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: > Total size of serialized results of 393 tasks (1025.9 KB) is bigger than > spark.driver.maxResultSize (1024.0 KB) > {code} -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-20517) Download link in history server UI is not correct
[ https://issues.apache.org/jira/browse/SPARK-20517?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-20517: Assignee: Apache Spark > Download link in history server UI is not correct > - > > Key: SPARK-20517 > URL: https://issues.apache.org/jira/browse/SPARK-20517 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.1.0, 2.2.0 >Reporter: Saisai Shao >Assignee: Apache Spark >Priority: Minor > > The download link in history server UI is concatenated with: > {code} >class="btn btn-info btn-mini">Download > {code} > Here {{num}} filed represents number of attempts, this is not equal to REST > APIs. In the REST API, if attempt id is not existed, then {{num}} field > should be empty, otherwise this {{num}} field should actually be > {{attemptId}}. > This will lead to the issue of "no such app", rather than correctly download > the event log. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-20517) Download link in history server UI is not correct
[ https://issues.apache.org/jira/browse/SPARK-20517?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-20517: Assignee: (was: Apache Spark) > Download link in history server UI is not correct > - > > Key: SPARK-20517 > URL: https://issues.apache.org/jira/browse/SPARK-20517 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.1.0, 2.2.0 >Reporter: Saisai Shao >Priority: Minor > > The download link in history server UI is concatenated with: > {code} >class="btn btn-info btn-mini">Download > {code} > Here {{num}} filed represents number of attempts, this is not equal to REST > APIs. In the REST API, if attempt id is not existed, then {{num}} field > should be empty, otherwise this {{num}} field should actually be > {{attemptId}}. > This will lead to the issue of "no such app", rather than correctly download > the event log. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20497) Unhelpful error messages when trying to load data from file.
[ https://issues.apache.org/jira/browse/SPARK-20497?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15988089#comment-15988089 ] Hyukjin Kwon commented on SPARK-20497: -- I can't reproduce this on Windows as below: - Existing path {code} val lines = """ |1 1:1.0 3:2.0 5:3.0 |0 |0 2:4.0 4:5.0 6:6.0 """.stripMargin val loc = "C:\\...\\foo" val file = new java.io.File(loc) com.google.common.io.Files.write(lines, file, java.nio.charset.StandardCharsets.UTF_8) spark.read.format("libsvm").load(loc).show() {code} prints {code} +-++ |label|features| +-++ | 1.0|(6,[0,2,4],[1.0,2...| | 0.0| (6,[],[])| | 0.0|(6,[1,3,5],[4.0,5...| +-++ {code} - Non-existing path {code} spark.read.format("libsvm").load("/NON_EXISTS").show() {code} produces {code} org.apache.spark.sql.AnalysisException: Path does not exist: file:/NON_EXISTS; at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$14.apply(Dat aSource.scala:354) at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$14.apply(Dat aSource.scala:342) at scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.s cala:241) at scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.s cala:241) at scala.collection.immutable.List.foreach(List.scala:381) at scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:241) at scala.collection.immutable.List.flatMap(List.scala:344) at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataS ource.scala:342) at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:178) at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:156) ... 48 elided {code} It does not look a problem within Spark. I would rather suggest to close this if anyone is unable to reproduce this within Spark. > Unhelpful error messages when trying to load data from file. > > > Key: SPARK-20497 > URL: https://issues.apache.org/jira/browse/SPARK-20497 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.0 >Reporter: Brandon Barker >Priority: Minor > > I'm attempting to do the simple task of reproducing the results from the > linear regression example in Spark. I'm using Windows 10. > val training = spark.read.format("libsvm") > .load("C:Users\\brand\\Documents\\GitHub\\sample_linear_regression_data.txt") > Although the file is definitely at the specified location, I just get a > java.lang.NullPointerException at this line. The documentation at > http://spark.apache.org/docs/latest/sql-programming-guide.html#generic-loadsave-functions > doesn't seem to clear things up. The associated javadocs do do not seem any > better. > In my view, such a simple operation should not be troublesome, but perhaps > I've missed some critical documentation - if so, I apologize. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20517) Download link in history server UI is not correct
[ https://issues.apache.org/jira/browse/SPARK-20517?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15988087#comment-15988087 ] Apache Spark commented on SPARK-20517: -- User 'jerryshao' has created a pull request for this issue: https://github.com/apache/spark/pull/17795 > Download link in history server UI is not correct > - > > Key: SPARK-20517 > URL: https://issues.apache.org/jira/browse/SPARK-20517 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.1.0, 2.2.0 >Reporter: Saisai Shao >Priority: Minor > > The download link in history server UI is concatenated with: > {code} >class="btn btn-info btn-mini">Download > {code} > Here {{num}} filed represents number of attempts, this is not equal to REST > APIs. In the REST API, if attempt id is not existed, then {{num}} field > should be empty, otherwise this {{num}} field should actually be > {{attemptId}}. > This will lead to the issue of "no such app", rather than correctly download > the event log. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-20518) Supplement the new blockidsuite unit tests
[ https://issues.apache.org/jira/browse/SPARK-20518?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-20518: Assignee: Apache Spark > Supplement the new blockidsuite unit tests > -- > > Key: SPARK-20518 > URL: https://issues.apache.org/jira/browse/SPARK-20518 > Project: Spark > Issue Type: Test > Components: Tests >Affects Versions: 2.1.0 >Reporter: caoxuewen >Assignee: Apache Spark > > adds the new unit tests to support ShuffleDataBlockId , ShuffleIndexBlockId , > TempShuffleBlockId , TempLocalBlockId -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-20518) Supplement the new blockidsuite unit tests
[ https://issues.apache.org/jira/browse/SPARK-20518?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-20518: Assignee: (was: Apache Spark) > Supplement the new blockidsuite unit tests > -- > > Key: SPARK-20518 > URL: https://issues.apache.org/jira/browse/SPARK-20518 > Project: Spark > Issue Type: Test > Components: Tests >Affects Versions: 2.1.0 >Reporter: caoxuewen > > adds the new unit tests to support ShuffleDataBlockId , ShuffleIndexBlockId , > TempShuffleBlockId , TempLocalBlockId -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20518) Supplement the new blockidsuite unit tests
[ https://issues.apache.org/jira/browse/SPARK-20518?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15988085#comment-15988085 ] Apache Spark commented on SPARK-20518: -- User 'heary-cao' has created a pull request for this issue: https://github.com/apache/spark/pull/17794 > Supplement the new blockidsuite unit tests > -- > > Key: SPARK-20518 > URL: https://issues.apache.org/jira/browse/SPARK-20518 > Project: Spark > Issue Type: Test > Components: Tests >Affects Versions: 2.1.0 >Reporter: caoxuewen > > adds the new unit tests to support ShuffleDataBlockId , ShuffleIndexBlockId , > TempShuffleBlockId , TempLocalBlockId -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-20518) Supplement the new blockidsuite unit tests
caoxuewen created SPARK-20518: - Summary: Supplement the new blockidsuite unit tests Key: SPARK-20518 URL: https://issues.apache.org/jira/browse/SPARK-20518 Project: Spark Issue Type: Test Components: Tests Affects Versions: 2.1.0 Reporter: caoxuewen adds the new unit tests to support ShuffleDataBlockId , ShuffleIndexBlockId , TempShuffleBlockId , TempLocalBlockId -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-20517) Download link in history server UI is not correct
Saisai Shao created SPARK-20517: --- Summary: Download link in history server UI is not correct Key: SPARK-20517 URL: https://issues.apache.org/jira/browse/SPARK-20517 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 2.1.0, 2.2.0 Reporter: Saisai Shao Priority: Minor The download link in history server UI is concatenated with: {code} Download {code} Here {{num}} filed represents number of attempts, this is equal to REST APIs. In the REST API, if attempt id is not existed, then {{num}} field should be empty, otherwise this {{num}} field should actually be {{attemptId}}. This will lead to the issue of "no such app", rather than correctly download the event log. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-20517) Download link in history server UI is not correct
[ https://issues.apache.org/jira/browse/SPARK-20517?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Saisai Shao updated SPARK-20517: Description: The download link in history server UI is concatenated with: {code} Download {code} Here {{num}} filed represents number of attempts, this is not equal to REST APIs. In the REST API, if attempt id is not existed, then {{num}} field should be empty, otherwise this {{num}} field should actually be {{attemptId}}. This will lead to the issue of "no such app", rather than correctly download the event log. was: The download link in history server UI is concatenated with: {code} Download {code} Here {{num}} filed represents number of attempts, this is equal to REST APIs. In the REST API, if attempt id is not existed, then {{num}} field should be empty, otherwise this {{num}} field should actually be {{attemptId}}. This will lead to the issue of "no such app", rather than correctly download the event log. > Download link in history server UI is not correct > - > > Key: SPARK-20517 > URL: https://issues.apache.org/jira/browse/SPARK-20517 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.1.0, 2.2.0 >Reporter: Saisai Shao >Priority: Minor > > The download link in history server UI is concatenated with: > {code} >class="btn btn-info btn-mini">Download > {code} > Here {{num}} filed represents number of attempts, this is not equal to REST > APIs. In the REST API, if attempt id is not existed, then {{num}} field > should be empty, otherwise this {{num}} field should actually be > {{attemptId}}. > This will lead to the issue of "no such app", rather than correctly download > the event log. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-20484) Add documentation to ALS code
[ https://issues.apache.org/jira/browse/SPARK-20484?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-20484: Assignee: (was: Apache Spark) > Add documentation to ALS code > - > > Key: SPARK-20484 > URL: https://issues.apache.org/jira/browse/SPARK-20484 > Project: Spark > Issue Type: Improvement > Components: ML, MLlib >Affects Versions: 2.1.0 >Reporter: Daniel Li >Priority: Trivial > > The documentation (both Scaladocs and inline comments) for the ALS code (in > package {{org.apache.spark.ml.recommendation}}) can be clarified where needed > and expanded where incomplete. This is especially important for parts of the > code that are written imperatively for performance, as these parts don't > benefit from the intuitive self-documentation of Scala's higher-level > language abstractions. Specifically, I'd like to add documentation fully > explaining the key functionality of the in-block and out-block objects, their > purpose, how they relate to the overall ALS algorithm, and how they are > calculated in such a way that new maintainers can ramp up much more quickly. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-20484) Add documentation to ALS code
[ https://issues.apache.org/jira/browse/SPARK-20484?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-20484: Assignee: Apache Spark > Add documentation to ALS code > - > > Key: SPARK-20484 > URL: https://issues.apache.org/jira/browse/SPARK-20484 > Project: Spark > Issue Type: Improvement > Components: ML, MLlib >Affects Versions: 2.1.0 >Reporter: Daniel Li >Assignee: Apache Spark >Priority: Trivial > > The documentation (both Scaladocs and inline comments) for the ALS code (in > package {{org.apache.spark.ml.recommendation}}) can be clarified where needed > and expanded where incomplete. This is especially important for parts of the > code that are written imperatively for performance, as these parts don't > benefit from the intuitive self-documentation of Scala's higher-level > language abstractions. Specifically, I'd like to add documentation fully > explaining the key functionality of the in-block and out-block objects, their > purpose, how they relate to the overall ALS algorithm, and how they are > calculated in such a way that new maintainers can ramp up much more quickly. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20484) Add documentation to ALS code
[ https://issues.apache.org/jira/browse/SPARK-20484?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15988047#comment-15988047 ] Apache Spark commented on SPARK-20484: -- User 'danielyli' has created a pull request for this issue: https://github.com/apache/spark/pull/17793 > Add documentation to ALS code > - > > Key: SPARK-20484 > URL: https://issues.apache.org/jira/browse/SPARK-20484 > Project: Spark > Issue Type: Improvement > Components: ML, MLlib >Affects Versions: 2.1.0 >Reporter: Daniel Li >Priority: Trivial > > The documentation (both Scaladocs and inline comments) for the ALS code (in > package {{org.apache.spark.ml.recommendation}}) can be clarified where needed > and expanded where incomplete. This is especially important for parts of the > code that are written imperatively for performance, as these parts don't > benefit from the intuitive self-documentation of Scala's higher-level > language abstractions. Specifically, I'd like to add documentation fully > explaining the key functionality of the in-block and out-block objects, their > purpose, how they relate to the overall ALS algorithm, and how they are > calculated in such a way that new maintainers can ramp up much more quickly. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20496) KafkaWriter Uses Unanalyzed Logical Plan
[ https://issues.apache.org/jira/browse/SPARK-20496?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15988004#comment-15988004 ] Apache Spark commented on SPARK-20496: -- User 'anabranch' has created a pull request for this issue: https://github.com/apache/spark/pull/17792 > KafkaWriter Uses Unanalyzed Logical Plan > > > Key: SPARK-20496 > URL: https://issues.apache.org/jira/browse/SPARK-20496 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 2.2.0 >Reporter: Bill Chambers > > Right now we use the unanalyzed logical plan for writing to Kafka, we should > use the analyzed plan. > https://github.com/apache/spark/blob/master/external/kafka-0-10-sql/src/main/scala/org/apache/spark/sql/kafka010/KafkaWriter.scala#L50 -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18727) Support schema evolution as new files are inserted into table
[ https://issues.apache.org/jira/browse/SPARK-18727?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15987976#comment-15987976 ] Xin Wu commented on SPARK-18727: Thanks! > Support schema evolution as new files are inserted into table > - > > Key: SPARK-18727 > URL: https://issues.apache.org/jira/browse/SPARK-18727 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.0 >Reporter: Eric Liang >Priority: Critical > > Now that we have pushed partition management of all tables to the catalog, > one issue for scalable partition handling remains: handling schema updates. > Currently, a schema update requires dropping and recreating the entire table, > which does not scale well with the size of the table. > We should support updating the schema of the table, either via ALTER TABLE, > or automatically as new files with compatible schemas are appended into the > table. > cc [~rxin] -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18727) Support schema evolution as new files are inserted into table
[ https://issues.apache.org/jira/browse/SPARK-18727?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15987966#comment-15987966 ] Eric Liang commented on SPARK-18727: +1 for supporting ALTER TABLE REPLACE COLUMNS > Support schema evolution as new files are inserted into table > - > > Key: SPARK-18727 > URL: https://issues.apache.org/jira/browse/SPARK-18727 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.0 >Reporter: Eric Liang >Priority: Critical > > Now that we have pushed partition management of all tables to the catalog, > one issue for scalable partition handling remains: handling schema updates. > Currently, a schema update requires dropping and recreating the entire table, > which does not scale well with the size of the table. > We should support updating the schema of the table, either via ALTER TABLE, > or automatically as new files with compatible schemas are appended into the > table. > cc [~rxin] -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18727) Support schema evolution as new files are inserted into table
[ https://issues.apache.org/jira/browse/SPARK-18727?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15987939#comment-15987939 ] Xin Wu commented on SPARK-18727: [~ekhliang] I see. I will try to support ALTER TABLE SCHEMA. Also this is similar or the same as ALTER TABLE REPLACE COLUMNS, which is documented as unsupported Hive feature in SqlBase.q4.. Do we have preference which one to use? > Support schema evolution as new files are inserted into table > - > > Key: SPARK-18727 > URL: https://issues.apache.org/jira/browse/SPARK-18727 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.0 >Reporter: Eric Liang >Priority: Critical > > Now that we have pushed partition management of all tables to the catalog, > one issue for scalable partition handling remains: handling schema updates. > Currently, a schema update requires dropping and recreating the entire table, > which does not scale well with the size of the table. > We should support updating the schema of the table, either via ALTER TABLE, > or automatically as new files with compatible schemas are appended into the > table. > cc [~rxin] -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20516) Spark SQL documentation out of date?
[ https://issues.apache.org/jira/browse/SPARK-20516?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15987933#comment-15987933 ] Ratandeep Ratti commented on SPARK-20516: - Not sure why I cannot assign the ticket to myself. :/ > Spark SQL documentation out of date? > > > Key: SPARK-20516 > URL: https://issues.apache.org/jira/browse/SPARK-20516 > Project: Spark > Issue Type: Task > Components: SQL >Affects Versions: 2.1.0 >Reporter: Ratandeep Ratti > > I was trying out the examples on the [Spark Sql > page|https://spark.apache.org/docs/2.1.0/sql-programming-guide.html]. It > seems that now we have to specify invoke {{master()}} on the SparkSession > builder and also warehouseLocation is now a URI. > I can fix the documentation (sql-programming-guide.html) and send a PR > request. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-20516) Spark SQL documentation out of date?
[ https://issues.apache.org/jira/browse/SPARK-20516?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ratandeep Ratti updated SPARK-20516: Description: I was trying out the examples on the [Spark Sql page|https://spark.apache.org/docs/2.1.0/sql-programming-guide.html]. It seems that now we have to specify invoke {{master()}} on the SparkSession builder and also warehouseLocation is now a URI. I can fix the documentation (sql-programming-guide.html) and send a PR request. was: I was trying out the examples on the [Spark Sql page|https://spark.apache.org/docs/2.1.0/sql-programming-guide.html]. It seems that now we have to specify invoke {{master}} on the SparkSession builder and also warehouseLocation is now a URI. I can fix the documentation (sql-programming-guide.html) and send a PR request. > Spark SQL documentation out of date? > > > Key: SPARK-20516 > URL: https://issues.apache.org/jira/browse/SPARK-20516 > Project: Spark > Issue Type: Task > Components: SQL >Affects Versions: 2.1.0 >Reporter: Ratandeep Ratti > > I was trying out the examples on the [Spark Sql > page|https://spark.apache.org/docs/2.1.0/sql-programming-guide.html]. It > seems that now we have to specify invoke {{master()}} on the SparkSession > builder and also warehouseLocation is now a URI. > I can fix the documentation (sql-programming-guide.html) and send a PR > request. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-20516) Spark SQL documentation out of date?
Ratandeep Ratti created SPARK-20516: --- Summary: Spark SQL documentation out of date? Key: SPARK-20516 URL: https://issues.apache.org/jira/browse/SPARK-20516 Project: Spark Issue Type: Task Components: SQL Affects Versions: 2.1.0 Reporter: Ratandeep Ratti I was trying out the examples on the [Spark Sql page|https://spark.apache.org/docs/2.1.0/sql-programming-guide.html]. It seems that now we have to specify invoke {{master}} on the SparkSession builder and also warehouseLocation is now a URI. I can fix the documentation (sql-programming-guide.html) and send a PR request. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18727) Support schema evolution as new files are inserted into table
[ https://issues.apache.org/jira/browse/SPARK-18727?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15987929#comment-15987929 ] Xiao Li commented on SPARK-18727: - The idea of [~ekhliang] sounds good to me. > Support schema evolution as new files are inserted into table > - > > Key: SPARK-18727 > URL: https://issues.apache.org/jira/browse/SPARK-18727 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.0 >Reporter: Eric Liang >Priority: Critical > > Now that we have pushed partition management of all tables to the catalog, > one issue for scalable partition handling remains: handling schema updates. > Currently, a schema update requires dropping and recreating the entire table, > which does not scale well with the size of the table. > We should support updating the schema of the table, either via ALTER TABLE, > or automatically as new files with compatible schemas are appended into the > table. > cc [~rxin] -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20478) Document LinearSVC in R programming guide
[ https://issues.apache.org/jira/browse/SPARK-20478?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15987903#comment-15987903 ] Miao Wang commented on SPARK-20478: --- OK. I will do it. Thanks for pointing me the place. > Document LinearSVC in R programming guide > - > > Key: SPARK-20478 > URL: https://issues.apache.org/jira/browse/SPARK-20478 > Project: Spark > Issue Type: Documentation > Components: SparkR >Affects Versions: 2.2.0 >Reporter: Felix Cheung > -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-18727) Support schema evolution as new files are inserted into table
[ https://issues.apache.org/jira/browse/SPARK-18727?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Eric Liang updated SPARK-18727: --- The common case we see is users having a complete schema (e.g. output of ETL pipeline) and wanting to update/merge it in an automated job. In this case it's actually more work to alter the columns one at a time, rather than all at once. > Support schema evolution as new files are inserted into table > - > > Key: SPARK-18727 > URL: https://issues.apache.org/jira/browse/SPARK-18727 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.0 >Reporter: Eric Liang >Priority: Critical > > Now that we have pushed partition management of all tables to the catalog, > one issue for scalable partition handling remains: handling schema updates. > Currently, a schema update requires dropping and recreating the entire table, > which does not scale well with the size of the table. > We should support updating the schema of the table, either via ALTER TABLE, > or automatically as new files with compatible schemas are appended into the > table. > cc [~rxin] -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18727) Support schema evolution as new files are inserted into table
[ https://issues.apache.org/jira/browse/SPARK-18727?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15987874#comment-15987874 ] Xin Wu commented on SPARK-18727: [~ekhliang] First of all, i am not sure whether it is wise to introduce more non-SQL standard syntax into Spark's DDL. In addition, with ALTER TABLE SCHEMA, or ALTER TABLE SET/UPDATE/MOIDFY SCHEMA, depending however we call it, it requires users to put in the whole list of columns' definition for maybe a small change of a column. It is inconvenient especially when the table is relatively wide. What do you think [~smilegator] ? > Support schema evolution as new files are inserted into table > - > > Key: SPARK-18727 > URL: https://issues.apache.org/jira/browse/SPARK-18727 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.0 >Reporter: Eric Liang >Priority: Critical > > Now that we have pushed partition management of all tables to the catalog, > one issue for scalable partition handling remains: handling schema updates. > Currently, a schema update requires dropping and recreating the entire table, > which does not scale well with the size of the table. > We should support updating the schema of the table, either via ALTER TABLE, > or automatically as new files with compatible schemas are appended into the > table. > cc [~rxin] -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-20489) Different results in local mode and yarn mode when working with dates (race condition with SimpleDateFormat?)
[ https://issues.apache.org/jira/browse/SPARK-20489?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shixiong Zhu updated SPARK-20489: - Component/s: (was: Shuffle) (was: Spark Core) > Different results in local mode and yarn mode when working with dates (race > condition with SimpleDateFormat?) > - > > Key: SPARK-20489 > URL: https://issues.apache.org/jira/browse/SPARK-20489 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0, 2.0.1, 2.0.2 > Environment: yarn-client mode in Zeppelin, Cloudera > Spark2-distribution >Reporter: Rick Moritz >Priority: Critical > > Running the following code (in Zeppelin, or spark-shell), I get different > results, depending on whether I am using local[*] -mode or yarn-client mode: > {code:title=test case|borderStyle=solid} > import org.apache.spark.sql.Row > import org.apache.spark.sql.types._ > import spark.implicits._ > val counter = 1 to 2 > val size = 1 to 3 > val sampleText = spark.createDataFrame( > sc.parallelize(size) > .map(Row(_)), > StructType(Array(StructField("id", IntegerType, nullable=false)) > ) > ) > .withColumn("loadDTS",lit("2017-04-25T10:45:02.2")) > > val rddList = counter.map( > count => sampleText > .withColumn("loadDTS2", > date_format(date_add(col("loadDTS"),count),"-MM-dd'T'HH:mm:ss.SSS")) > .drop(col("loadDTS")) > .withColumnRenamed("loadDTS2","loadDTS") > .coalesce(4) > .rdd > ) > val resultText = spark.createDataFrame( > spark.sparkContext.union(rddList), > sampleText.schema > ) > val testGrouped = resultText.groupBy("id") > val timestamps = testGrouped.agg( > max(unix_timestamp($"loadDTS", "-MM-dd'T'HH:mm:ss.SSS")) as > "timestamp" > ) > val loadDateResult = resultText.join(timestamps, "id") > val filteredresult = loadDateResult.filter($"timestamp" === > unix_timestamp($"loadDTS", "-MM-dd'T'HH:mm:ss.SSS")) > filteredresult.count > {code} > The expected result, *3* is what I obtain in local mode, but as soon as I run > fully distributed, I get *0*. If Increase size to {{1 to 32000}}, I do get > some results (depending on the size of counter) - none of which makes any > sense. > Up to the application of the last filter, at first glance everything looks > okay, but then something goes wrong. Potentially this is due to lingering > re-use of SimpleDateFormats, but I can't get it to happen in a > non-distributed mode. The generated execution plan is the same in each case, > as expected. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20489) Different results in local mode and yarn mode when working with dates (race condition with SimpleDateFormat?)
[ https://issues.apache.org/jira/browse/SPARK-20489?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15987848#comment-15987848 ] Shixiong Zhu commented on SPARK-20489: -- Could you show the results of `loadDateResult.show(false)`? My hunch is it's a time zone issue. > Different results in local mode and yarn mode when working with dates (race > condition with SimpleDateFormat?) > - > > Key: SPARK-20489 > URL: https://issues.apache.org/jira/browse/SPARK-20489 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0, 2.0.1, 2.0.2 > Environment: yarn-client mode in Zeppelin, Cloudera > Spark2-distribution >Reporter: Rick Moritz >Priority: Critical > > Running the following code (in Zeppelin, or spark-shell), I get different > results, depending on whether I am using local[*] -mode or yarn-client mode: > {code:title=test case|borderStyle=solid} > import org.apache.spark.sql.Row > import org.apache.spark.sql.types._ > import spark.implicits._ > val counter = 1 to 2 > val size = 1 to 3 > val sampleText = spark.createDataFrame( > sc.parallelize(size) > .map(Row(_)), > StructType(Array(StructField("id", IntegerType, nullable=false)) > ) > ) > .withColumn("loadDTS",lit("2017-04-25T10:45:02.2")) > > val rddList = counter.map( > count => sampleText > .withColumn("loadDTS2", > date_format(date_add(col("loadDTS"),count),"-MM-dd'T'HH:mm:ss.SSS")) > .drop(col("loadDTS")) > .withColumnRenamed("loadDTS2","loadDTS") > .coalesce(4) > .rdd > ) > val resultText = spark.createDataFrame( > spark.sparkContext.union(rddList), > sampleText.schema > ) > val testGrouped = resultText.groupBy("id") > val timestamps = testGrouped.agg( > max(unix_timestamp($"loadDTS", "-MM-dd'T'HH:mm:ss.SSS")) as > "timestamp" > ) > val loadDateResult = resultText.join(timestamps, "id") > val filteredresult = loadDateResult.filter($"timestamp" === > unix_timestamp($"loadDTS", "-MM-dd'T'HH:mm:ss.SSS")) > filteredresult.count > {code} > The expected result, *3* is what I obtain in local mode, but as soon as I run > fully distributed, I get *0*. If Increase size to {{1 to 32000}}, I do get > some results (depending on the size of counter) - none of which makes any > sense. > Up to the application of the last filter, at first glance everything looks > okay, but then something goes wrong. Potentially this is due to lingering > re-use of SimpleDateFormats, but I can't get it to happen in a > non-distributed mode. The generated execution plan is the same in each case, > as expected. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-18727) Support schema evolution as new files are inserted into table
[ https://issues.apache.org/jira/browse/SPARK-18727?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Eric Liang updated SPARK-18727: --- Can we add ALTER TABLE SCHEMA to update the entire schema? That would cover any edge cases. > Support schema evolution as new files are inserted into table > - > > Key: SPARK-18727 > URL: https://issues.apache.org/jira/browse/SPARK-18727 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.0 >Reporter: Eric Liang >Priority: Critical > > Now that we have pushed partition management of all tables to the catalog, > one issue for scalable partition handling remains: handling schema updates. > Currently, a schema update requires dropping and recreating the entire table, > which does not scale well with the size of the table. > We should support updating the schema of the table, either via ALTER TABLE, > or automatically as new files with compatible schemas are appended into the > table. > cc [~rxin] -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18727) Support schema evolution as new files are inserted into table
[ https://issues.apache.org/jira/browse/SPARK-18727?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15987830#comment-15987830 ] Xin Wu commented on SPARK-18727: [~simeons] You are right.. My PR does not include the feature that allows you to add new field into a complex type. Such feature could be supported by {code}ALTER TABLE CHANGE COLUMN {code}, where newType has newly added fields. I am also working on this part. > Support schema evolution as new files are inserted into table > - > > Key: SPARK-18727 > URL: https://issues.apache.org/jira/browse/SPARK-18727 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.0 >Reporter: Eric Liang >Priority: Critical > > Now that we have pushed partition management of all tables to the catalog, > one issue for scalable partition handling remains: handling schema updates. > Currently, a schema update requires dropping and recreating the entire table, > which does not scale well with the size of the table. > We should support updating the schema of the table, either via ALTER TABLE, > or automatically as new files with compatible schemas are appended into the > table. > cc [~rxin] -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-20514) Upgrade Jetty to 9.3.11.v20160721
[ https://issues.apache.org/jira/browse/SPARK-20514?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mark Grover updated SPARK-20514: Summary: Upgrade Jetty to 9.3.11.v20160721 (was: Upgrade Jetty to 9.3.13.v20161014) > Upgrade Jetty to 9.3.11.v20160721 > - > > Key: SPARK-20514 > URL: https://issues.apache.org/jira/browse/SPARK-20514 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.2.0 >Reporter: Mark Grover > > Currently, we are using Jetty version 9.2.16.v20160414. > However, Hadoop 3, uses > [9.3.11.v20160721|https://github.com/apache/hadoop/blob/release-3.0.0-alpha2-RC0/hadoop-project/pom.xml#L38] > (Jetty upgrade was brought in by HADOOP-10075). > Currently, when you try to build Spark with Hadoop 3, due to this > incompatibilities in jetty versions used by Hadoop and Spark, compilation > fails with: > {code} > [ERROR] source/core/src/main/scala/org/apache/spark/ui/JettyUtils.scala:31: > error: object gzip is not a member of package org.eclipse.jetty.servlets > [ERROR] import org.eclipse.jetty.servlets.gzip.GzipHandler > [ERROR] ^ > [ERROR] source/core/src/main/scala/org/apache/spark/ui/JettyUtils.scala:238: > error: not found: type GzipHandler > [ERROR] val gzipHandler = new GzipHandler > [ERROR] ^ > [ERROR] two errors found > {code} > So, it'd be good to upgrade Jetty to get us closer to working with Hadoop 3. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-20515) Issue with reading Hive ORC tables having char/varchar columns in Spark SQL
[ https://issues.apache.org/jira/browse/SPARK-20515?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-20515: Assignee: Apache Spark > Issue with reading Hive ORC tables having char/varchar columns in Spark SQL > --- > > Key: SPARK-20515 > URL: https://issues.apache.org/jira/browse/SPARK-20515 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.2 > Environment: AWS EMR Cluster >Reporter: Udit Mehrotra >Assignee: Apache Spark > > Reading from a Hive ORC table containing char/varchar columns fails in Spark > SQL. This is caused by the fact that Spark SQL internally replaces the > char/varchar columns with String data type. So, while reading from the table > created in Hive which has varchar/char columns, it ends up using the wrong > reader and causes a ClassCastException. > > Here is the exception: > > java.lang.ClassCastException: > org.apache.hadoop.hive.serde2.io.HiveVarcharWritable cannot be cast to > org.apache.hadoop.io.Text > at > org.apache.hadoop.hive.serde2.objectinspector.primitive.WritableStringObjectInspector.getPrimitiveWritableObject(WritableStringObjectInspector.java:41) > at > org.apache.spark.sql.hive.HiveInspectors$class.unwrap(HiveInspectors.scala:324) > at > org.apache.spark.sql.hive.HadoopTableReader$.unwrap(TableReader.scala:333) > at > org.apache.spark.sql.hive.HadoopTableReader$$anonfun$14$$anonfun$apply$15.apply(TableReader.scala:419) > at > org.apache.spark.sql.hive.HadoopTableReader$$anonfun$14$$anonfun$apply$15.apply(TableReader.scala:419) > at > org.apache.spark.sql.hive.HadoopTableReader$$anonfun$fillObject$2.apply(TableReader.scala:435) > at > org.apache.spark.sql.hive.HadoopTableReader$$anonfun$fillObject$2.apply(TableReader.scala:426) > at scala.collection.Iterator$$anon$11.next(Iterator.scala:409) > at scala.collection.Iterator$$anon$11.next(Iterator.scala:409) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$4.apply(SparkPlan.scala:247) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$4.apply(SparkPlan.scala:240) > at > org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:803) > at > org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:803) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at > org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:283) > at > org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70) > at org.apache.spark.scheduler.Task.run(Task.scala:86) > at > org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > > While the issue has been fixed in Spark 2.1.1 and 2.2.0 with SPARK-19459, it > still needs to be fixed Spark 2.0. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-20515) Issue with reading Hive ORC tables having char/varchar columns in Spark SQL
[ https://issues.apache.org/jira/browse/SPARK-20515?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-20515: Assignee: (was: Apache Spark) > Issue with reading Hive ORC tables having char/varchar columns in Spark SQL > --- > > Key: SPARK-20515 > URL: https://issues.apache.org/jira/browse/SPARK-20515 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.2 > Environment: AWS EMR Cluster >Reporter: Udit Mehrotra > > Reading from a Hive ORC table containing char/varchar columns fails in Spark > SQL. This is caused by the fact that Spark SQL internally replaces the > char/varchar columns with String data type. So, while reading from the table > created in Hive which has varchar/char columns, it ends up using the wrong > reader and causes a ClassCastException. > > Here is the exception: > > java.lang.ClassCastException: > org.apache.hadoop.hive.serde2.io.HiveVarcharWritable cannot be cast to > org.apache.hadoop.io.Text > at > org.apache.hadoop.hive.serde2.objectinspector.primitive.WritableStringObjectInspector.getPrimitiveWritableObject(WritableStringObjectInspector.java:41) > at > org.apache.spark.sql.hive.HiveInspectors$class.unwrap(HiveInspectors.scala:324) > at > org.apache.spark.sql.hive.HadoopTableReader$.unwrap(TableReader.scala:333) > at > org.apache.spark.sql.hive.HadoopTableReader$$anonfun$14$$anonfun$apply$15.apply(TableReader.scala:419) > at > org.apache.spark.sql.hive.HadoopTableReader$$anonfun$14$$anonfun$apply$15.apply(TableReader.scala:419) > at > org.apache.spark.sql.hive.HadoopTableReader$$anonfun$fillObject$2.apply(TableReader.scala:435) > at > org.apache.spark.sql.hive.HadoopTableReader$$anonfun$fillObject$2.apply(TableReader.scala:426) > at scala.collection.Iterator$$anon$11.next(Iterator.scala:409) > at scala.collection.Iterator$$anon$11.next(Iterator.scala:409) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$4.apply(SparkPlan.scala:247) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$4.apply(SparkPlan.scala:240) > at > org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:803) > at > org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:803) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at > org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:283) > at > org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70) > at org.apache.spark.scheduler.Task.run(Task.scala:86) > at > org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > > While the issue has been fixed in Spark 2.1.1 and 2.2.0 with SPARK-19459, it > still needs to be fixed Spark 2.0. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20515) Issue with reading Hive ORC tables having char/varchar columns in Spark SQL
[ https://issues.apache.org/jira/browse/SPARK-20515?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15987822#comment-15987822 ] Apache Spark commented on SPARK-20515: -- User 'umehrot2' has created a pull request for this issue: https://github.com/apache/spark/pull/17791 > Issue with reading Hive ORC tables having char/varchar columns in Spark SQL > --- > > Key: SPARK-20515 > URL: https://issues.apache.org/jira/browse/SPARK-20515 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.2 > Environment: AWS EMR Cluster >Reporter: Udit Mehrotra > > Reading from a Hive ORC table containing char/varchar columns fails in Spark > SQL. This is caused by the fact that Spark SQL internally replaces the > char/varchar columns with String data type. So, while reading from the table > created in Hive which has varchar/char columns, it ends up using the wrong > reader and causes a ClassCastException. > > Here is the exception: > > java.lang.ClassCastException: > org.apache.hadoop.hive.serde2.io.HiveVarcharWritable cannot be cast to > org.apache.hadoop.io.Text > at > org.apache.hadoop.hive.serde2.objectinspector.primitive.WritableStringObjectInspector.getPrimitiveWritableObject(WritableStringObjectInspector.java:41) > at > org.apache.spark.sql.hive.HiveInspectors$class.unwrap(HiveInspectors.scala:324) > at > org.apache.spark.sql.hive.HadoopTableReader$.unwrap(TableReader.scala:333) > at > org.apache.spark.sql.hive.HadoopTableReader$$anonfun$14$$anonfun$apply$15.apply(TableReader.scala:419) > at > org.apache.spark.sql.hive.HadoopTableReader$$anonfun$14$$anonfun$apply$15.apply(TableReader.scala:419) > at > org.apache.spark.sql.hive.HadoopTableReader$$anonfun$fillObject$2.apply(TableReader.scala:435) > at > org.apache.spark.sql.hive.HadoopTableReader$$anonfun$fillObject$2.apply(TableReader.scala:426) > at scala.collection.Iterator$$anon$11.next(Iterator.scala:409) > at scala.collection.Iterator$$anon$11.next(Iterator.scala:409) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$4.apply(SparkPlan.scala:247) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$4.apply(SparkPlan.scala:240) > at > org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:803) > at > org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:803) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at > org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:283) > at > org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70) > at org.apache.spark.scheduler.Task.run(Task.scala:86) > at > org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > > While the issue has been fixed in Spark 2.1.1 and 2.2.0 with SPARK-19459, it > still needs to be fixed Spark 2.0. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-16333) Excessive Spark history event/json data size (5GB each)
[ https://issues.apache.org/jira/browse/SPARK-16333?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Marcelo Vanzin resolved SPARK-16333. Resolution: Duplicate Let's mark this as a duplicate for now. There are probably minor increments that can be made if the size is still a problem, but probably better to track those individually. > Excessive Spark history event/json data size (5GB each) > --- > > Key: SPARK-16333 > URL: https://issues.apache.org/jira/browse/SPARK-16333 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.0.0 > Environment: this is seen on both x86 (Intel(R) Xeon(R), E5-2699 ) > and ppc platform (Habanero, Model: 8348-21C), Red Hat Enterprise Linux Server > release 7.2 (Maipo)., Spark2.0.0-preview (May-24, 2016 build) >Reporter: Peter Liu > Labels: performance, spark2.0.0 > > With Spark2.0.0-preview (May-24 build), the history event data (the json > file), that is generated for each Spark application run (see below), can be > as big as 5GB (instead of 14 MB for exactly the same application run and the > same input data of 1TB under Spark1.6.1) > -rwxrwx--- 1 root root 5.3G Jun 30 09:39 app-20160630091959- > -rwxrwx--- 1 root root 5.3G Jun 30 09:56 app-20160630094213- > -rwxrwx--- 1 root root 5.3G Jun 30 10:13 app-20160630095856- > -rwxrwx--- 1 root root 5.3G Jun 30 10:30 app-20160630101556- > The test is done with Sparkbench V2, SQL RDD (see github: > https://github.com/SparkTC/spark-bench) -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20497) Unhelpful error messages when trying to load data from file.
[ https://issues.apache.org/jira/browse/SPARK-20497?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15987797#comment-15987797 ] Brandon Barker commented on SPARK-20497: Sorry, it appears not. At least this may be useful to discuss with the spark-testing-base authors. > Unhelpful error messages when trying to load data from file. > > > Key: SPARK-20497 > URL: https://issues.apache.org/jira/browse/SPARK-20497 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.0 >Reporter: Brandon Barker >Priority: Minor > > I'm attempting to do the simple task of reproducing the results from the > linear regression example in Spark. I'm using Windows 10. > val training = spark.read.format("libsvm") > .load("C:Users\\brand\\Documents\\GitHub\\sample_linear_regression_data.txt") > Although the file is definitely at the specified location, I just get a > java.lang.NullPointerException at this line. The documentation at > http://spark.apache.org/docs/latest/sql-programming-guide.html#generic-loadsave-functions > doesn't seem to clear things up. The associated javadocs do do not seem any > better. > In my view, such a simple operation should not be troublesome, but perhaps > I've missed some critical documentation - if so, I apologize. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20497) Unhelpful error messages when trying to load data from file.
[ https://issues.apache.org/jira/browse/SPARK-20497?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15987782#comment-15987782 ] Sean Owen commented on SPARK-20497: --- OK, this doesn't look an exception from Spark then? > Unhelpful error messages when trying to load data from file. > > > Key: SPARK-20497 > URL: https://issues.apache.org/jira/browse/SPARK-20497 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.0 >Reporter: Brandon Barker >Priority: Minor > > I'm attempting to do the simple task of reproducing the results from the > linear regression example in Spark. I'm using Windows 10. > val training = spark.read.format("libsvm") > .load("C:Users\\brand\\Documents\\GitHub\\sample_linear_regression_data.txt") > Although the file is definitely at the specified location, I just get a > java.lang.NullPointerException at this line. The documentation at > http://spark.apache.org/docs/latest/sql-programming-guide.html#generic-loadsave-functions > doesn't seem to clear things up. The associated javadocs do do not seem any > better. > In my view, such a simple operation should not be troublesome, but perhaps > I've missed some critical documentation - if so, I apologize. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-20515) Issue with reading Hive ORC tables having char/varchar columns in Spark SQL
Udit Mehrotra created SPARK-20515: - Summary: Issue with reading Hive ORC tables having char/varchar columns in Spark SQL Key: SPARK-20515 URL: https://issues.apache.org/jira/browse/SPARK-20515 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.0.2 Environment: AWS EMR Cluster Reporter: Udit Mehrotra Reading from a Hive ORC table containing char/varchar columns fails in Spark SQL. This is caused by the fact that Spark SQL internally replaces the char/varchar columns with String data type. So, while reading from the table created in Hive which has varchar/char columns, it ends up using the wrong reader and causes a ClassCastException. Here is the exception: java.lang.ClassCastException: org.apache.hadoop.hive.serde2.io.HiveVarcharWritable cannot be cast to org.apache.hadoop.io.Text at org.apache.hadoop.hive.serde2.objectinspector.primitive.WritableStringObjectInspector.getPrimitiveWritableObject(WritableStringObjectInspector.java:41) at org.apache.spark.sql.hive.HiveInspectors$class.unwrap(HiveInspectors.scala:324) at org.apache.spark.sql.hive.HadoopTableReader$.unwrap(TableReader.scala:333) at org.apache.spark.sql.hive.HadoopTableReader$$anonfun$14$$anonfun$apply$15.apply(TableReader.scala:419) at org.apache.spark.sql.hive.HadoopTableReader$$anonfun$14$$anonfun$apply$15.apply(TableReader.scala:419) at org.apache.spark.sql.hive.HadoopTableReader$$anonfun$fillObject$2.apply(TableReader.scala:435) at org.apache.spark.sql.hive.HadoopTableReader$$anonfun$fillObject$2.apply(TableReader.scala:426) at scala.collection.Iterator$$anon$11.next(Iterator.scala:409) at scala.collection.Iterator$$anon$11.next(Iterator.scala:409) at org.apache.spark.sql.execution.SparkPlan$$anonfun$4.apply(SparkPlan.scala:247) at org.apache.spark.sql.execution.SparkPlan$$anonfun$4.apply(SparkPlan.scala:240) at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:803) at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:803) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319) at org.apache.spark.rdd.RDD.iterator(RDD.scala:283) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70) at org.apache.spark.scheduler.Task.run(Task.scala:86) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) While the issue has been fixed in Spark 2.1.1 and 2.2.0 with SPARK-19459, it still needs to be fixed Spark 2.0. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20497) Unhelpful error messages when trying to load data from file.
[ https://issues.apache.org/jira/browse/SPARK-20497?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=1598#comment-1598 ] Brandon Barker commented on SPARK-20497: Thanks for the quick reply. At the moment, I'm thinking the NPE was due to an incorrectly configured SparkSession and/or SparkContext, as the SparkSession is being created by the unofficial package com.holdenkarau.spark.testing.SparkSessionProvider: com.holdenkarau spark-testing-base_${scala.version.major} Here's the NPE (line 57 is the val training = ... line mentioned above): java.lang.NullPointerException at edu.cornell.ansci.dairy.econ.util.CsvLookupAnalyzer.(CsvLookupAnalyzer.scala:57) at org.cornell.ansci.dairy.econ.util.CsvLookupAnalyzerTest$.setUp(CsvLookupAnalyzerTest.scala:90) at org.cornell.ansci.dairy.econ.util.CsvLookupAnalyzerTest.setUp(CsvLookupAnalyzerTest.scala) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:50) at org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12) at org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:47) at org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:24) at org.junit.runners.ParentRunner.run(ParentRunner.java:363) at org.junit.runner.JUnitCore.run(JUnitCore.java:137) at com.intellij.junit4.JUnit4IdeaTestRunner.startRunnerWithArgs(JUnit4IdeaTestRunner.java:68) at com.intellij.rt.execution.junit.IdeaTestRunner$Repeater.startRunnerWithArgs(IdeaTestRunner.java:51) at com.intellij.rt.execution.junit.JUnitStarter.prepareStreamsAndStart(JUnitStarter.java:237) at com.intellij.rt.execution.junit.JUnitStarter.main(JUnitStarter.java:70) When run from the main application where I've configured spark, I get a much more informative error (aha, a missing "\\" after the C:, oops...) org.apache.spark.sql.AnalysisException: Path does not exist: file:/C:Users/brand/Documents/GitHub/sample_linear_regression_data.txt; at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$14.apply(DataSource.scala:382) ... Fixing this doesn't fix the NPE above when run in the test environment, indicating it is deep configuration issue, and not an issue with Spark, unless we could somehow get a "SparkNotConfiguredException" ;). I'll plan to investigate the testing issue further. > Unhelpful error messages when trying to load data from file. > > > Key: SPARK-20497 > URL: https://issues.apache.org/jira/browse/SPARK-20497 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.0 >Reporter: Brandon Barker > > I'm attempting to do the simple task of reproducing the results from the > linear regression example in Spark. I'm using Windows 10. > val training = spark.read.format("libsvm") > .load("C:Users\\brand\\Documents\\GitHub\\sample_linear_regression_data.txt") > Although the file is definitely at the specified location, I just get a > java.lang.NullPointerException at this line. The documentation at > http://spark.apache.org/docs/latest/sql-programming-guide.html#generic-loadsave-functions > doesn't seem to clear things up. The associated javadocs do do not seem any > better. > In my view, such a simple operation should not be troublesome, but perhaps > I've missed some critical documentation - if so, I apologize. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-20497) Unhelpful error messages when trying to load data from file.
[ https://issues.apache.org/jira/browse/SPARK-20497?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Brandon Barker updated SPARK-20497: --- Flags: (was: Important) > Unhelpful error messages when trying to load data from file. > > > Key: SPARK-20497 > URL: https://issues.apache.org/jira/browse/SPARK-20497 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.0 >Reporter: Brandon Barker >Priority: Minor > > I'm attempting to do the simple task of reproducing the results from the > linear regression example in Spark. I'm using Windows 10. > val training = spark.read.format("libsvm") > .load("C:Users\\brand\\Documents\\GitHub\\sample_linear_regression_data.txt") > Although the file is definitely at the specified location, I just get a > java.lang.NullPointerException at this line. The documentation at > http://spark.apache.org/docs/latest/sql-programming-guide.html#generic-loadsave-functions > doesn't seem to clear things up. The associated javadocs do do not seem any > better. > In my view, such a simple operation should not be troublesome, but perhaps > I've missed some critical documentation - if so, I apologize. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-20497) Unhelpful error messages when trying to load data from file.
[ https://issues.apache.org/jira/browse/SPARK-20497?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Brandon Barker updated SPARK-20497: --- Priority: Minor (was: Major) > Unhelpful error messages when trying to load data from file. > > > Key: SPARK-20497 > URL: https://issues.apache.org/jira/browse/SPARK-20497 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.0 >Reporter: Brandon Barker >Priority: Minor > > I'm attempting to do the simple task of reproducing the results from the > linear regression example in Spark. I'm using Windows 10. > val training = spark.read.format("libsvm") > .load("C:Users\\brand\\Documents\\GitHub\\sample_linear_regression_data.txt") > Although the file is definitely at the specified location, I just get a > java.lang.NullPointerException at this line. The documentation at > http://spark.apache.org/docs/latest/sql-programming-guide.html#generic-loadsave-functions > doesn't seem to clear things up. The associated javadocs do do not seem any > better. > In my view, such a simple operation should not be troublesome, but perhaps > I've missed some critical documentation - if so, I apologize. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-18813) MLlib 2.2 Roadmap
[ https://issues.apache.org/jira/browse/SPARK-18813?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley closed SPARK-18813. - Resolution: Done Fix Version/s: 2.2.0 > MLlib 2.2 Roadmap > - > > Key: SPARK-18813 > URL: https://issues.apache.org/jira/browse/SPARK-18813 > Project: Spark > Issue Type: Umbrella > Components: ML, MLlib >Reporter: Joseph K. Bradley >Priority: Blocker > Labels: roadmap > Fix For: 2.2.0 > > > *PROPOSAL: This includes a proposal for the 2.2 roadmap process for MLlib.* > The roadmap process described below is significantly updated since the 2.1 > roadmap [SPARK-15581]. Please refer to [SPARK-15581] for more discussion on > the basis for this proposal, and comment in this JIRA if you have suggestions > for improvements. > h1. Roadmap process > This roadmap is a master list for MLlib improvements we are working on during > this release. This includes ML-related changes in PySpark and SparkR. > *What is planned for the next release?* > * This roadmap lists issues which at least one Committer has prioritized. > See details below in "Instructions for committers." > * This roadmap only lists larger or more critical issues. > *How can contributors influence this roadmap?* > * If you believe an issue should be in this roadmap, please discuss the issue > on JIRA and/or the dev mailing list. Make sure to ping Committers since at > least one must agree to shepherd the issue. > * For general discussions, use this JIRA or the dev mailing list. For > specific issues, please comment on those issues or the mailing list. > * Vote for & watch issues which are important to you. > ** MLlib, sorted by: [Votes | > https://issues.apache.org/jira/issues/?jql=project%20%3D%20SPARK%20AND%20status%20in%20(Open%2C%20%22In%20Progress%22%2C%20Reopened)%20AND%20component%20in%20(ML%2C%20MLlib)%20ORDER%20BY%20votes%20DESC] > or [Watchers | > https://issues.apache.org/jira/issues/?jql=project%20%3D%20SPARK%20AND%20status%20in%20(Open%2C%20%22In%20Progress%22%2C%20Reopened)%20AND%20component%20in%20(ML%2C%20MLlib)%20ORDER%20BY%20Watchers%20DESC] > ** SparkR, sorted by: [Votes | > https://issues.apache.org/jira/issues/?jql=project%20%3D%20SPARK%20AND%20status%20in%20(Open%2C%20%22In%20Progress%22%2C%20Reopened)%20AND%20component%20in%20(SparkR)%20ORDER%20BY%20votes%20DESC] > or [Watchers | > https://issues.apache.org/jira/issues/?jql=project%20%3D%20SPARK%20AND%20status%20in%20(Open%2C%20%22In%20Progress%22%2C%20Reopened)%20AND%20component%20in%20(SparkR)%20ORDER%20BY%20Watchers%20DESC] > h2. Target Version and Priority > This section describes the meaning of Target Version and Priority. _These > meanings have been updated in this proposal for the 2.2 process._ > || Category | Target Version | Priority | Shepherd | Put on roadmap? | In > next release? || > | [1 | > https://issues.apache.org/jira/issues/?jql=project%20%3D%20SPARK%20AND%20status%20in%20(Open%2C%20%22In%20Progress%22%2C%20Reopened)%20AND%20priority%20%3D%20Blocker%20AND%20component%20in%20(GraphX%2C%20ML%2C%20MLlib%2C%20SparkR)%20AND%20%22Target%20Version%2Fs%22%20%3D%202.2.0] > | next release | Blocker | *must* | *must* | *must* | > | [2 | > https://issues.apache.org/jira/issues/?jql=project%20%3D%20SPARK%20AND%20status%20in%20(Open%2C%20%22In%20Progress%22%2C%20Reopened)%20AND%20priority%20%3D%20Critical%20AND%20component%20in%20(GraphX%2C%20ML%2C%20MLlib%2C%20SparkR)%20AND%20%22Target%20Version%2Fs%22%20%3D%202.2.0] > | next release | Critical | *must* | yes, unless small | *best effort* | > | [3 | > https://issues.apache.org/jira/issues/?jql=project%20%3D%20SPARK%20AND%20status%20in%20(Open%2C%20%22In%20Progress%22%2C%20Reopened)%20AND%20priority%20%3D%20Major%20AND%20component%20in%20(GraphX%2C%20ML%2C%20MLlib%2C%20SparkR)%20AND%20%22Target%20Version%2Fs%22%20%3D%202.2.0] > | next release | Major | *must* | optional | *best effort* | > | [4 | > https://issues.apache.org/jira/issues/?jql=project%20%3D%20SPARK%20AND%20status%20in%20(Open%2C%20%22In%20Progress%22%2C%20Reopened)%20AND%20priority%20%3D%20Minor%20AND%20component%20in%20(GraphX%2C%20ML%2C%20MLlib%2C%20SparkR)%20AND%20%22Target%20Version%2Fs%22%20%3D%202.2.0] > | next release | Minor | optional | no | maybe | > | [5 | > https://issues.apache.org/jira/issues/?jql=project%20%3D%20SPARK%20AND%20status%20in%20(Open%2C%20%22In%20Progress%22%2C%20Reopened)%20AND%20priority%20%3D%20Trivial%20AND%20component%20in%20(GraphX%2C%20ML%2C%20MLlib%2C%20SparkR)%20AND%20%22Target%20Version%2Fs%22%20%3D%202.2.0] > | next release | Trivial | optional | no | maybe | > | [6 | >
[jira] [Commented] (SPARK-18813) MLlib 2.2 Roadmap
[ https://issues.apache.org/jira/browse/SPARK-18813?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15987754#comment-15987754 ] Joseph K. Bradley commented on SPARK-18813: --- Thanks everyone for their thoughts and work during this release cycle! I agree that we'll need to keep working on those big issues which [~mlnick] mentioned. Also, this roadmap JIRA has been less active than I had imagined, even though lots of work has gone on. If you have ideas for improving it for the next cycle, please say! I'll close this for now, and we can create a new roadmap after the QA period is done. Speaking of which...here are QA JIRAs for MLlib/GraphX [SPARK-20499] and for SparkR [SPARK-20508]. Thanks again! > MLlib 2.2 Roadmap > - > > Key: SPARK-18813 > URL: https://issues.apache.org/jira/browse/SPARK-18813 > Project: Spark > Issue Type: Umbrella > Components: ML, MLlib >Reporter: Joseph K. Bradley >Priority: Blocker > Labels: roadmap > > *PROPOSAL: This includes a proposal for the 2.2 roadmap process for MLlib.* > The roadmap process described below is significantly updated since the 2.1 > roadmap [SPARK-15581]. Please refer to [SPARK-15581] for more discussion on > the basis for this proposal, and comment in this JIRA if you have suggestions > for improvements. > h1. Roadmap process > This roadmap is a master list for MLlib improvements we are working on during > this release. This includes ML-related changes in PySpark and SparkR. > *What is planned for the next release?* > * This roadmap lists issues which at least one Committer has prioritized. > See details below in "Instructions for committers." > * This roadmap only lists larger or more critical issues. > *How can contributors influence this roadmap?* > * If you believe an issue should be in this roadmap, please discuss the issue > on JIRA and/or the dev mailing list. Make sure to ping Committers since at > least one must agree to shepherd the issue. > * For general discussions, use this JIRA or the dev mailing list. For > specific issues, please comment on those issues or the mailing list. > * Vote for & watch issues which are important to you. > ** MLlib, sorted by: [Votes | > https://issues.apache.org/jira/issues/?jql=project%20%3D%20SPARK%20AND%20status%20in%20(Open%2C%20%22In%20Progress%22%2C%20Reopened)%20AND%20component%20in%20(ML%2C%20MLlib)%20ORDER%20BY%20votes%20DESC] > or [Watchers | > https://issues.apache.org/jira/issues/?jql=project%20%3D%20SPARK%20AND%20status%20in%20(Open%2C%20%22In%20Progress%22%2C%20Reopened)%20AND%20component%20in%20(ML%2C%20MLlib)%20ORDER%20BY%20Watchers%20DESC] > ** SparkR, sorted by: [Votes | > https://issues.apache.org/jira/issues/?jql=project%20%3D%20SPARK%20AND%20status%20in%20(Open%2C%20%22In%20Progress%22%2C%20Reopened)%20AND%20component%20in%20(SparkR)%20ORDER%20BY%20votes%20DESC] > or [Watchers | > https://issues.apache.org/jira/issues/?jql=project%20%3D%20SPARK%20AND%20status%20in%20(Open%2C%20%22In%20Progress%22%2C%20Reopened)%20AND%20component%20in%20(SparkR)%20ORDER%20BY%20Watchers%20DESC] > h2. Target Version and Priority > This section describes the meaning of Target Version and Priority. _These > meanings have been updated in this proposal for the 2.2 process._ > || Category | Target Version | Priority | Shepherd | Put on roadmap? | In > next release? || > | [1 | > https://issues.apache.org/jira/issues/?jql=project%20%3D%20SPARK%20AND%20status%20in%20(Open%2C%20%22In%20Progress%22%2C%20Reopened)%20AND%20priority%20%3D%20Blocker%20AND%20component%20in%20(GraphX%2C%20ML%2C%20MLlib%2C%20SparkR)%20AND%20%22Target%20Version%2Fs%22%20%3D%202.2.0] > | next release | Blocker | *must* | *must* | *must* | > | [2 | > https://issues.apache.org/jira/issues/?jql=project%20%3D%20SPARK%20AND%20status%20in%20(Open%2C%20%22In%20Progress%22%2C%20Reopened)%20AND%20priority%20%3D%20Critical%20AND%20component%20in%20(GraphX%2C%20ML%2C%20MLlib%2C%20SparkR)%20AND%20%22Target%20Version%2Fs%22%20%3D%202.2.0] > | next release | Critical | *must* | yes, unless small | *best effort* | > | [3 | > https://issues.apache.org/jira/issues/?jql=project%20%3D%20SPARK%20AND%20status%20in%20(Open%2C%20%22In%20Progress%22%2C%20Reopened)%20AND%20priority%20%3D%20Major%20AND%20component%20in%20(GraphX%2C%20ML%2C%20MLlib%2C%20SparkR)%20AND%20%22Target%20Version%2Fs%22%20%3D%202.2.0] > | next release | Major | *must* | optional | *best effort* | > | [4 | > https://issues.apache.org/jira/issues/?jql=project%20%3D%20SPARK%20AND%20status%20in%20(Open%2C%20%22In%20Progress%22%2C%20Reopened)%20AND%20priority%20%3D%20Minor%20AND%20component%20in%20(GraphX%2C%20ML%2C%20MLlib%2C%20SparkR)%20AND%20%22Target%20Version%2Fs%22%20%3D%202.2.0] > | next release | Minor | optional | no | maybe | > | [5 | >
[jira] [Assigned] (SPARK-20514) Upgrade Jetty to 9.3.13.v20161014
[ https://issues.apache.org/jira/browse/SPARK-20514?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-20514: Assignee: (was: Apache Spark) > Upgrade Jetty to 9.3.13.v20161014 > - > > Key: SPARK-20514 > URL: https://issues.apache.org/jira/browse/SPARK-20514 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.2.0 >Reporter: Mark Grover > > Currently, we are using Jetty version 9.2.16.v20160414. > However, Hadoop 3, uses > [9.3.11.v20160721|https://github.com/apache/hadoop/blob/release-3.0.0-alpha2-RC0/hadoop-project/pom.xml#L38] > (Jetty upgrade was brought in by HADOOP-10075). > Currently, when you try to build Spark with Hadoop 3, due to this > incompatibilities in jetty versions used by Hadoop and Spark, compilation > fails with: > {code} > [ERROR] source/core/src/main/scala/org/apache/spark/ui/JettyUtils.scala:31: > error: object gzip is not a member of package org.eclipse.jetty.servlets > [ERROR] import org.eclipse.jetty.servlets.gzip.GzipHandler > [ERROR] ^ > [ERROR] source/core/src/main/scala/org/apache/spark/ui/JettyUtils.scala:238: > error: not found: type GzipHandler > [ERROR] val gzipHandler = new GzipHandler > [ERROR] ^ > [ERROR] two errors found > {code} > So, it'd be good to upgrade Jetty to get us closer to working with Hadoop 3. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-20514) Upgrade Jetty to 9.3.13.v20161014
[ https://issues.apache.org/jira/browse/SPARK-20514?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-20514: Assignee: Apache Spark > Upgrade Jetty to 9.3.13.v20161014 > - > > Key: SPARK-20514 > URL: https://issues.apache.org/jira/browse/SPARK-20514 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.2.0 >Reporter: Mark Grover >Assignee: Apache Spark > > Currently, we are using Jetty version 9.2.16.v20160414. > However, Hadoop 3, uses > [9.3.11.v20160721|https://github.com/apache/hadoop/blob/release-3.0.0-alpha2-RC0/hadoop-project/pom.xml#L38] > (Jetty upgrade was brought in by HADOOP-10075). > Currently, when you try to build Spark with Hadoop 3, due to this > incompatibilities in jetty versions used by Hadoop and Spark, compilation > fails with: > {code} > [ERROR] source/core/src/main/scala/org/apache/spark/ui/JettyUtils.scala:31: > error: object gzip is not a member of package org.eclipse.jetty.servlets > [ERROR] import org.eclipse.jetty.servlets.gzip.GzipHandler > [ERROR] ^ > [ERROR] source/core/src/main/scala/org/apache/spark/ui/JettyUtils.scala:238: > error: not found: type GzipHandler > [ERROR] val gzipHandler = new GzipHandler > [ERROR] ^ > [ERROR] two errors found > {code} > So, it'd be good to upgrade Jetty to get us closer to working with Hadoop 3. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20514) Upgrade Jetty to 9.3.13.v20161014
[ https://issues.apache.org/jira/browse/SPARK-20514?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15987736#comment-15987736 ] Apache Spark commented on SPARK-20514: -- User 'markgrover' has created a pull request for this issue: https://github.com/apache/spark/pull/17790 > Upgrade Jetty to 9.3.13.v20161014 > - > > Key: SPARK-20514 > URL: https://issues.apache.org/jira/browse/SPARK-20514 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.2.0 >Reporter: Mark Grover > > Currently, we are using Jetty version 9.2.16.v20160414. > However, Hadoop 3, uses > [9.3.11.v20160721|https://github.com/apache/hadoop/blob/release-3.0.0-alpha2-RC0/hadoop-project/pom.xml#L38] > (Jetty upgrade was brought in by HADOOP-10075). > Currently, when you try to build Spark with Hadoop 3, due to this > incompatibilities in jetty versions used by Hadoop and Spark, compilation > fails with: > {code} > [ERROR] source/core/src/main/scala/org/apache/spark/ui/JettyUtils.scala:31: > error: object gzip is not a member of package org.eclipse.jetty.servlets > [ERROR] import org.eclipse.jetty.servlets.gzip.GzipHandler > [ERROR] ^ > [ERROR] source/core/src/main/scala/org/apache/spark/ui/JettyUtils.scala:238: > error: not found: type GzipHandler > [ERROR] val gzipHandler = new GzipHandler > [ERROR] ^ > [ERROR] two errors found > {code} > So, it'd be good to upgrade Jetty to get us closer to working with Hadoop 3. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-20514) Upgrade Jetty to 9.3.13.v20161014
[ https://issues.apache.org/jira/browse/SPARK-20514?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mark Grover updated SPARK-20514: Description: Currently, we are using Jetty version 9.2.16.v20160414. However, Hadoop 3, uses [9.3.11.v20160721|https://github.com/apache/hadoop/blob/release-3.0.0-alpha2-RC0/hadoop-project/pom.xml#L38] (Jetty upgrade was brought in by HADOOP-10075). Currently, when you try to build Spark with Hadoop 3, due to this incompatibilities in jetty versions used by Hadoop and Spark, compilation fails with: {code} [ERROR] source/core/src/main/scala/org/apache/spark/ui/JettyUtils.scala:31: error: object gzip is not a member of package org.eclipse.jetty.servlets [ERROR] import org.eclipse.jetty.servlets.gzip.GzipHandler [ERROR] ^ [ERROR] source/core/src/main/scala/org/apache/spark/ui/JettyUtils.scala:238: error: not found: type GzipHandler [ERROR] val gzipHandler = new GzipHandler [ERROR] ^ [ERROR] two errors found {code} So, it'd be good to upgrade Jetty to get us closer to working with Hadoop 3. was: Currently, we are using Jetty version 9.2.16.v20160414. However, Hadoop 3, uses [9.3.11.v20160721|https://github.com/apache/hadoop/blob/release-3.0.0-alpha2-RC0/hadoop-project/pom.xml#L38] (Jetty upgrade was brought in by HADOOP-10075). Currently, when you try to build Spark with Hadoop 3, due to this incompatibilities in jetty versions used by Hadoop and Spark, compilation fails with: {code} [ERROR] source/core/src/main/scala/org/apache/spark/ui/JettyUtils.scala:31: error: object gzip is not a member of package org.eclipse.jetty.servlets [ERROR] import org.eclipse.jetty.servlets.gzip.GzipHandler [ERROR] ^ [ERROR] source/core/src/main/scala/org/apache/spark/ui/JettyUtils.scala:238: error: not found: type GzipHandler [ERROR] val gzipHandler = new GzipHandler [ERROR] ^ [ERROR] two errors found {code} So, it'd be good to upgrade Jetty due to this. > Upgrade Jetty to 9.3.13.v20161014 > - > > Key: SPARK-20514 > URL: https://issues.apache.org/jira/browse/SPARK-20514 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.2.0 >Reporter: Mark Grover > > Currently, we are using Jetty version 9.2.16.v20160414. > However, Hadoop 3, uses > [9.3.11.v20160721|https://github.com/apache/hadoop/blob/release-3.0.0-alpha2-RC0/hadoop-project/pom.xml#L38] > (Jetty upgrade was brought in by HADOOP-10075). > Currently, when you try to build Spark with Hadoop 3, due to this > incompatibilities in jetty versions used by Hadoop and Spark, compilation > fails with: > {code} > [ERROR] source/core/src/main/scala/org/apache/spark/ui/JettyUtils.scala:31: > error: object gzip is not a member of package org.eclipse.jetty.servlets > [ERROR] import org.eclipse.jetty.servlets.gzip.GzipHandler > [ERROR] ^ > [ERROR] source/core/src/main/scala/org/apache/spark/ui/JettyUtils.scala:238: > error: not found: type GzipHandler > [ERROR] val gzipHandler = new GzipHandler > [ERROR] ^ > [ERROR] two errors found > {code} > So, it'd be good to upgrade Jetty to get us closer to working with Hadoop 3. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-20514) Upgrade Jetty to 9.3.13.v20161014
Mark Grover created SPARK-20514: --- Summary: Upgrade Jetty to 9.3.13.v20161014 Key: SPARK-20514 URL: https://issues.apache.org/jira/browse/SPARK-20514 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 2.2.0 Reporter: Mark Grover Currently, we are using Jetty version 9.2.16.v20160414. However, Hadoop 3, uses [9.3.11.v20160721|https://github.com/apache/hadoop/blob/release-3.0.0-alpha2-RC0/hadoop-project/pom.xml#L38] (Jetty upgrade was brought in by HADOOP-10075). Currently, when you try to build Spark with Hadoop 3, due to this incompatibilities in jetty versions used by Hadoop and Spark, compilation fails with: {code} [ERROR] source/core/src/main/scala/org/apache/spark/ui/JettyUtils.scala:31: error: object gzip is not a member of package org.eclipse.jetty.servlets [ERROR] import org.eclipse.jetty.servlets.gzip.GzipHandler [ERROR] ^ [ERROR] source/core/src/main/scala/org/apache/spark/ui/JettyUtils.scala:238: error: not found: type GzipHandler [ERROR] val gzipHandler = new GzipHandler [ERROR] ^ [ERROR] two errors found {code} So, it'd be good to upgrade Jetty due to this. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20392) Slow performance when calling fit on ML pipeline for dataset with many columns but few rows
[ https://issues.apache.org/jira/browse/SPARK-20392?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15987706#comment-15987706 ] Barry Becker commented on SPARK-20392: -- Thanks for working on a fix. Do you have any idea which version of spark this fix will go into? When we create our piplines, we always add bucketizers to bin all the continuous columns before applying the classifier. If a dataset has thousands of of continuous columns (and only a handful of rows) it sounds like it could still take significant time to apply those transforms even though there is very little data. At least the time seems to grow only linearly with the number of transforms. I was worried that it was quadratic. I wonder if another approach might be to have a type of bucketizer that can bin a lot of columns all at once. It would need to accept an list of arrays of split points to correspond to the columns to bin, but it might make things more efficient by replacing thousands of stages with just one. > Slow performance when calling fit on ML pipeline for dataset with many > columns but few rows > --- > > Key: SPARK-20392 > URL: https://issues.apache.org/jira/browse/SPARK-20392 > Project: Spark > Issue Type: Bug > Components: ML >Affects Versions: 2.1.0 >Reporter: Barry Becker > Attachments: blockbuster.csv, blockbuster_fewCols.csv, > giant_query_plan_for_fitting_pipeline.txt, model_9754.zip, model_9756.zip > > > This started as a [question on stack > overflow|http://stackoverflow.com/questions/43484006/why-is-it-slow-to-apply-a-spark-pipeline-to-dataset-with-many-columns-but-few-ro], > but it seems like a bug. > I am testing spark pipelines using a simple dataset (attached) with 312 > (mostly numeric) columns, but only 421 rows. It is small, but it takes 3 > minutes to apply my ML pipeline to it on a 24 core server with 60G of memory. > This seems much to long for such a tiny dataset. Similar pipelines run > quickly on datasets that have fewer columns and more rows. It's something > about the number of columns that is causing the slow performance. > Here are a list of the stages in my pipeline: > {code} > 000_strIdx_5708525b2b6c > 001_strIdx_ec2296082913 > 002_bucketizer_3cbc8811877b > 003_bucketizer_5a01d5d78436 > 004_bucketizer_bf290d11364d > 005_bucketizer_c3296dfe94b2 > 006_bucketizer_7071ca50eb85 > 007_bucketizer_27738213c2a1 > 008_bucketizer_bd728fd89ba1 > 009_bucketizer_e1e716f51796 > 010_bucketizer_38be665993ba > 011_bucketizer_5a0e41e5e94f > 012_bucketizer_b5a3d5743aaa > 013_bucketizer_4420f98ff7ff > 014_bucketizer_777cc4fe6d12 > 015_bucketizer_f0f3a3e5530e > 016_bucketizer_218ecca3b5c1 > 017_bucketizer_0b083439a192 > 018_bucketizer_4520203aec27 > 019_bucketizer_462c2c346079 > 020_bucketizer_47435822e04c > 021_bucketizer_eb9dccb5e6e8 > 022_bucketizer_b5f63dd7451d > 023_bucketizer_e0fd5041c841 > 024_bucketizer_ffb3b9737100 > 025_bucketizer_e06c0d29273c > 026_bucketizer_36ee535a425f > 027_bucketizer_ee3a330269f1 > 028_bucketizer_094b58ea01c0 > 029_bucketizer_e93ea86c08e2 > 030_bucketizer_4728a718bc4b > 031_bucketizer_08f6189c7fcc > 032_bucketizer_11feb74901e6 > 033_bucketizer_ab4add4966c7 > 034_bucketizer_4474f7f1b8ce > 035_bucketizer_90cfa5918d71 > 036_bucketizer_1a9ff5e4eccb > 037_bucketizer_38085415a4f4 > 038_bucketizer_9b5e5a8d12eb > 039_bucketizer_082bb650ecc3 > 040_bucketizer_57e1e363c483 > 041_bucketizer_337583fbfd65 > 042_bucketizer_73e8f6673262 > 043_bucketizer_0f9394ed30b8 > 044_bucketizer_8530f3570019 > 045_bucketizer_c53614f1e507 > 046_bucketizer_8fd99e6ec27b > 047_bucketizer_6a8610496d8a > 048_bucketizer_888b0055c1ad > 049_bucketizer_974e0a1433a6 > 050_bucketizer_e848c0937cb9 > 051_bucketizer_95611095a4ac > 052_bucketizer_660a6031acd9 > 053_bucketizer_aaffe5a3140d > 054_bucketizer_8dc569be285f > 055_bucketizer_83d1bffa07bc > 056_bucketizer_0c6180ba75e6 > 057_bucketizer_452f265a000d > 058_bucketizer_38e02ddfb447 > 059_bucketizer_6fa4ad5d3ebd > 060_bucketizer_91044ee766ce > 061_bucketizer_9a9ef04a173d > 062_bucketizer_3d98eb15f206 > 063_bucketizer_c4915bb4d4ed > 064_bucketizer_8ca2b6550c38 > 065_bucketizer_417ee9b760bc > 066_bucketizer_67f3556bebe8 > 067_bucketizer_0556deb652c6 > 068_bucketizer_067b4b3d234c > 069_bucketizer_30ba55321538 > 070_bucketizer_ad826cc5d746 > 071_bucketizer_77676a898055 > 072_bucketizer_05c37a38ce30 > 073_bucketizer_6d9ae54163ed > 074_bucketizer_8cd668b2855d > 075_bucketizer_d50ea1732021 > 076_bucketizer_c68f467c9559 > 077_bucketizer_ee1dfc840db1 > 078_bucketizer_83ec06a32519 > 079_bucketizer_741d08c1b69e > 080_bucketizer_b7402e4829c7 > 081_bucketizer_8adc590dc447 > 082_bucketizer_673be99bdace > 083_bucketizer_77693b45f94c > 084_bucketizer_53529c6b1ac4 > 085_bucketizer_6a3ca776a81e > 086_bucketizer_6679d9588ac1 >
[jira] [Commented] (SPARK-20499) Spark MLlib, GraphX 2.2 QA umbrella
[ https://issues.apache.org/jira/browse/SPARK-20499?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15987690#comment-15987690 ] Joseph K. Bradley commented on SPARK-20499: --- Well, it's that time again folks. QA! I know there are a few ongoing doc PRs, but I figure the feature/API ones are done for 2.2, so we can begin QAing the API and performance. If you're able to help out with taking or shepherding tasks for this or the SparkR JIRA (linked), please go ahead and claim them! I need to catch up on some doc PRs myself first... > Spark MLlib, GraphX 2.2 QA umbrella > --- > > Key: SPARK-20499 > URL: https://issues.apache.org/jira/browse/SPARK-20499 > Project: Spark > Issue Type: Umbrella > Components: Documentation, GraphX, ML, MLlib >Reporter: Joseph K. Bradley >Assignee: Joseph K. Bradley >Priority: Critical > > This JIRA lists tasks for the next Spark release's QA period for MLlib and > GraphX. *SparkR is separate: [SPARK-20508].* > The list below gives an overview of what is involved, and the corresponding > JIRA issues are linked below that. > h2. API > * Check binary API compatibility for Scala/Java > * Audit new public APIs (from the generated html doc) > ** Scala > ** Java compatibility > ** Python coverage > * Check Experimental, DeveloperApi tags > h2. Algorithms and performance > * Performance tests > * Major new algorithms: MinHash, RandomProjection > h2. Documentation and example code > * For new algorithms, create JIRAs for updating the user guide sections & > examples > * Update Programming Guide > * Update website -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18727) Support schema evolution as new files are inserted into table
[ https://issues.apache.org/jira/browse/SPARK-18727?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15987668#comment-15987668 ] Simeon Simeonov commented on SPARK-18727: - [~xwu0226] The merged PR handles the use case of new top-level columns but, in the test cases, I did not see any examples of adding new fields to (nested) struct columns, a requirement for supporting schema evolution (and closing this ticket). Do you expect you'll work on that also? > Support schema evolution as new files are inserted into table > - > > Key: SPARK-18727 > URL: https://issues.apache.org/jira/browse/SPARK-18727 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.0 >Reporter: Eric Liang >Priority: Critical > > Now that we have pushed partition management of all tables to the catalog, > one issue for scalable partition handling remains: handling schema updates. > Currently, a schema update requires dropping and recreating the entire table, > which does not scale well with the size of the table. > We should support updating the schema of the table, either via ALTER TABLE, > or automatically as new files with compatible schemas are appended into the > table. > cc [~rxin] -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-20452) Cancel a batch Kafka query and rerun the same DataFrame may cause ConcurrentModificationException
[ https://issues.apache.org/jira/browse/SPARK-20452?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tathagata Das resolved SPARK-20452. --- Resolution: Fixed Fix Version/s: 2.2.0 Issue resolved by pull request 17752 [https://github.com/apache/spark/pull/17752] > Cancel a batch Kafka query and rerun the same DataFrame may cause > ConcurrentModificationException > - > > Key: SPARK-20452 > URL: https://issues.apache.org/jira/browse/SPARK-20452 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 2.2.0 >Reporter: Shixiong Zhu >Assignee: Shixiong Zhu > Fix For: 2.2.0 > > > Cancel a batch Kafka query and rerun the same DataFrame may cause > ConcurrentModificationException because it may launch two tasks sharing the > same group id. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-20461) CachedKafkaConsumer may hang forever when it's interrupted
[ https://issues.apache.org/jira/browse/SPARK-20461?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tathagata Das resolved SPARK-20461. --- Resolution: Fixed Fix Version/s: 2.2.0 Issue resolved by pull request 17761 [https://github.com/apache/spark/pull/17761] > CachedKafkaConsumer may hang forever when it's interrupted > -- > > Key: SPARK-20461 > URL: https://issues.apache.org/jira/browse/SPARK-20461 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 2.2.0 >Reporter: Shixiong Zhu > Fix For: 2.2.0 > > > CachedKafkaConsumer may hang forever when it's interrupted because of > KAFKA-1894 -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-20047) Constrained Logistic Regression
[ https://issues.apache.org/jira/browse/SPARK-20047?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] DB Tsai resolved SPARK-20047. - Resolution: Fixed Fix Version/s: 2.2.1 Target Version/s: 2.2.1 (was: 2.3.0) > Constrained Logistic Regression > --- > > Key: SPARK-20047 > URL: https://issues.apache.org/jira/browse/SPARK-20047 > Project: Spark > Issue Type: New Feature > Components: MLlib >Affects Versions: 2.2.0 >Reporter: DB Tsai >Assignee: Yanbo Liang > Fix For: 2.2.1 > > > For certain applications, such as stacked regressions, it is important to put > non-negative constraints on the regression coefficients. Also, if the ranges > of coefficients are known, it makes sense to constrain the coefficient search > space. > Fitting generalized constrained regression models object to Cβ ≤ b, where C ∈ > R^\{m×p\} and b ∈ R^\{m\} are predefined matrices and vectors which places a > set of m linear constraints on the coefficients is very challenging as > discussed in many literatures. > However, for box constraints on the coefficients, the optimization is well > solved. For gradient descent, people can projected gradient descent in the > primal by zeroing the negative weights at each step. For LBFGS, an extended > version of it, LBFGS-B can handle large scale box optimization efficiently. > Unfortunately, for OWLQN, there is no good efficient way to do optimization > with box constrains. > As a result, in this work, we only implement constrained LR with box > constrains without L1 regularization. > Note that since we standardize the data in training phase, so the > coefficients seen in the optimization subroutine are in the scaled space; as > a result, we need to convert the box constrains into scaled space. > Users will be able to set the lower / upper bounds of each coefficients and > intercepts. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18727) Support schema evolution as new files are inserted into table
[ https://issues.apache.org/jira/browse/SPARK-18727?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15987642#comment-15987642 ] Xin Wu commented on SPARK-18727: FYI. I have https://github.com/apache/spark/pull/16626 for ALTER TABLE ADD COLUMNS merged into 2.2. > Support schema evolution as new files are inserted into table > - > > Key: SPARK-18727 > URL: https://issues.apache.org/jira/browse/SPARK-18727 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.0 >Reporter: Eric Liang >Priority: Critical > > Now that we have pushed partition management of all tables to the catalog, > one issue for scalable partition handling remains: handling schema updates. > Currently, a schema update requires dropping and recreating the entire table, > which does not scale well with the size of the table. > We should support updating the schema of the table, either via ALTER TABLE, > or automatically as new files with compatible schemas are appended into the > table. > cc [~rxin] -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19525) Enable Compression of RDD Checkpoints
[ https://issues.apache.org/jira/browse/SPARK-19525?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15987640#comment-15987640 ] Apache Spark commented on SPARK-19525: -- User 'zsxwing' has created a pull request for this issue: https://github.com/apache/spark/pull/17789 > Enable Compression of RDD Checkpoints > - > > Key: SPARK-19525 > URL: https://issues.apache.org/jira/browse/SPARK-19525 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.1.0 >Reporter: Aaditya Ramesh > > In our testing, compressing partitions while writing them to checkpoints on > HDFS using snappy helped performance significantly while also reducing the > variability of the checkpointing operation. In our tests, checkpointing time > was reduced by 3X, and variability was reduced by 2X for data sets of > compressed size approximately 1 GB. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-20499) Spark MLlib, GraphX 2.2 QA umbrella
[ https://issues.apache.org/jira/browse/SPARK-20499?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley updated SPARK-20499: -- Description: This JIRA lists tasks for the next Spark release's QA period for MLlib and GraphX. *SparkR is separate: [SPARK-20508].* The list below gives an overview of what is involved, and the corresponding JIRA issues are linked below that. h2. API * Check binary API compatibility for Scala/Java * Audit new public APIs (from the generated html doc) ** Scala ** Java compatibility ** Python coverage * Check Experimental, DeveloperApi tags h2. Algorithms and performance * Performance tests * Major new algorithms: MinHash, RandomProjection h2. Documentation and example code * For new algorithms, create JIRAs for updating the user guide sections & examples * Update Programming Guide * Update website was: This JIRA lists tasks for the next Spark release's QA period for MLlib and GraphX. *SparkR is separate: [SPARK-18329].* The list below gives an overview of what is involved, and the corresponding JIRA issues are linked below that. h2. API * Check binary API compatibility for Scala/Java * Audit new public APIs (from the generated html doc) ** Scala ** Java compatibility ** Python coverage * Check Experimental, DeveloperApi tags h2. Algorithms and performance * Performance tests * Major new algorithms: MinHash, RandomProjection h2. Documentation and example code * For new algorithms, create JIRAs for updating the user guide sections & examples * Update Programming Guide * Update website > Spark MLlib, GraphX 2.2 QA umbrella > --- > > Key: SPARK-20499 > URL: https://issues.apache.org/jira/browse/SPARK-20499 > Project: Spark > Issue Type: Umbrella > Components: Documentation, GraphX, ML, MLlib >Reporter: Joseph K. Bradley >Assignee: Joseph K. Bradley >Priority: Critical > > This JIRA lists tasks for the next Spark release's QA period for MLlib and > GraphX. *SparkR is separate: [SPARK-20508].* > The list below gives an overview of what is involved, and the corresponding > JIRA issues are linked below that. > h2. API > * Check binary API compatibility for Scala/Java > * Audit new public APIs (from the generated html doc) > ** Scala > ** Java compatibility > ** Python coverage > * Check Experimental, DeveloperApi tags > h2. Algorithms and performance > * Performance tests > * Major new algorithms: MinHash, RandomProjection > h2. Documentation and example code > * For new algorithms, create JIRAs for updating the user guide sections & > examples > * Update Programming Guide > * Update website -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-20513) Update SparkR website for 2.2
[ https://issues.apache.org/jira/browse/SPARK-20513?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley updated SPARK-20513: -- Target Version/s: 2.2.0 > Update SparkR website for 2.2 > - > > Key: SPARK-20513 > URL: https://issues.apache.org/jira/browse/SPARK-20513 > Project: Spark > Issue Type: Sub-task > Components: Documentation, SparkR >Reporter: Joseph K. Bradley >Priority: Critical > > Update the sub-project's website to include new features in this release. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-20513) Update SparkR website for 2.2
[ https://issues.apache.org/jira/browse/SPARK-20513?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley updated SPARK-20513: -- Fix Version/s: (was: 2.1.0) > Update SparkR website for 2.2 > - > > Key: SPARK-20513 > URL: https://issues.apache.org/jira/browse/SPARK-20513 > Project: Spark > Issue Type: Sub-task > Components: Documentation, SparkR >Reporter: Joseph K. Bradley >Priority: Critical > > Update the sub-project's website to include new features in this release. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-20512) SparkR 2.2 QA: Programming guide, migration guide, vignettes updates
[ https://issues.apache.org/jira/browse/SPARK-20512?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley updated SPARK-20512: -- Target Version/s: 2.2.0 (was: 2.1.0) > SparkR 2.2 QA: Programming guide, migration guide, vignettes updates > > > Key: SPARK-20512 > URL: https://issues.apache.org/jira/browse/SPARK-20512 > Project: Spark > Issue Type: Sub-task > Components: Documentation, SparkR >Reporter: Joseph K. Bradley >Priority: Critical > > Before the release, we need to update the SparkR Programming Guide, its > migration guide, and the R vignettes. Updates will include: > * Add migration guide subsection. > ** Use the results of the QA audit JIRAs and [SPARK-17692]. > * Check phrasing, especially in main sections (for outdated items such as "In > this release, ...") > * Update R vignettes > Note: This task is for large changes to the guides. New features are handled > in [SPARK-18330]. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-20511) SparkR 2.2 QA: Check for new R APIs requiring example code
[ https://issues.apache.org/jira/browse/SPARK-20511?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley reassigned SPARK-20511: - Assignee: (was: Yanbo Liang) > SparkR 2.2 QA: Check for new R APIs requiring example code > -- > > Key: SPARK-20511 > URL: https://issues.apache.org/jira/browse/SPARK-20511 > Project: Spark > Issue Type: Sub-task > Components: Documentation, SparkR >Reporter: Joseph K. Bradley > > Audit list of new features added to MLlib's R API, and see which major items > are missing example code (in the examples folder). We do not need examples > for everything, only for major items such as new algorithms. > For any such items: > * Create a JIRA for that feature, and assign it to the author of the feature > (or yourself if interested). > * Link it to (a) the original JIRA which introduced that feature ("related > to") and (b) to this JIRA ("requires"). -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-20512) SparkR 2.2 QA: Programming guide, migration guide, vignettes updates
[ https://issues.apache.org/jira/browse/SPARK-20512?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley updated SPARK-20512: -- Summary: SparkR 2.2 QA: Programming guide, migration guide, vignettes updates (was: CLONE - SparkR 2.1 QA: Programming guide, migration guide, vignettes updates) > SparkR 2.2 QA: Programming guide, migration guide, vignettes updates > > > Key: SPARK-20512 > URL: https://issues.apache.org/jira/browse/SPARK-20512 > Project: Spark > Issue Type: Sub-task > Components: Documentation, SparkR >Reporter: Joseph K. Bradley >Assignee: Xiangrui Meng >Priority: Critical > > Before the release, we need to update the SparkR Programming Guide, its > migration guide, and the R vignettes. Updates will include: > * Add migration guide subsection. > ** Use the results of the QA audit JIRAs and [SPARK-17692]. > * Check phrasing, especially in main sections (for outdated items such as "In > this release, ...") > * Update R vignettes > Note: This task is for large changes to the guides. New features are handled > in [SPARK-18330]. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-20512) SparkR 2.2 QA: Programming guide, migration guide, vignettes updates
[ https://issues.apache.org/jira/browse/SPARK-20512?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley reassigned SPARK-20512: - Assignee: (was: Xiangrui Meng) > SparkR 2.2 QA: Programming guide, migration guide, vignettes updates > > > Key: SPARK-20512 > URL: https://issues.apache.org/jira/browse/SPARK-20512 > Project: Spark > Issue Type: Sub-task > Components: Documentation, SparkR >Reporter: Joseph K. Bradley >Priority: Critical > > Before the release, we need to update the SparkR Programming Guide, its > migration guide, and the R vignettes. Updates will include: > * Add migration guide subsection. > ** Use the results of the QA audit JIRAs and [SPARK-17692]. > * Check phrasing, especially in main sections (for outdated items such as "In > this release, ...") > * Update R vignettes > Note: This task is for large changes to the guides. New features are handled > in [SPARK-18330]. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-20511) SparkR 2.2 QA: Check for new R APIs requiring example code
[ https://issues.apache.org/jira/browse/SPARK-20511?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley updated SPARK-20511: -- Fix Version/s: (was: 2.1.0) > SparkR 2.2 QA: Check for new R APIs requiring example code > -- > > Key: SPARK-20511 > URL: https://issues.apache.org/jira/browse/SPARK-20511 > Project: Spark > Issue Type: Sub-task > Components: Documentation, SparkR >Reporter: Joseph K. Bradley >Assignee: Yanbo Liang > > Audit list of new features added to MLlib's R API, and see which major items > are missing example code (in the examples folder). We do not need examples > for everything, only for major items such as new algorithms. > For any such items: > * Create a JIRA for that feature, and assign it to the author of the feature > (or yourself if interested). > * Link it to (a) the original JIRA which introduced that feature ("related > to") and (b) to this JIRA ("requires"). -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-20511) SparkR 2.2 QA: Check for new R APIs requiring example code
[ https://issues.apache.org/jira/browse/SPARK-20511?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley updated SPARK-20511: -- Summary: SparkR 2.2 QA: Check for new R APIs requiring example code (was: CLONE - SparkR 2.1 QA: Check for new R APIs requiring example code) > SparkR 2.2 QA: Check for new R APIs requiring example code > -- > > Key: SPARK-20511 > URL: https://issues.apache.org/jira/browse/SPARK-20511 > Project: Spark > Issue Type: Sub-task > Components: Documentation, SparkR >Reporter: Joseph K. Bradley >Assignee: Yanbo Liang > > Audit list of new features added to MLlib's R API, and see which major items > are missing example code (in the examples folder). We do not need examples > for everything, only for major items such as new algorithms. > For any such items: > * Create a JIRA for that feature, and assign it to the author of the feature > (or yourself if interested). > * Link it to (a) the original JIRA which introduced that feature ("related > to") and (b) to this JIRA ("requires"). -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-20513) Update SparkR website for 2.2
[ https://issues.apache.org/jira/browse/SPARK-20513?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley updated SPARK-20513: -- Summary: Update SparkR website for 2.2 (was: CLONE - Update SparkR website for 2.1) > Update SparkR website for 2.2 > - > > Key: SPARK-20513 > URL: https://issues.apache.org/jira/browse/SPARK-20513 > Project: Spark > Issue Type: Sub-task > Components: Documentation, SparkR >Reporter: Joseph K. Bradley >Priority: Critical > Fix For: 2.1.0 > > > Update the sub-project's website to include new features in this release. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-20512) SparkR 2.2 QA: Programming guide, migration guide, vignettes updates
[ https://issues.apache.org/jira/browse/SPARK-20512?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley updated SPARK-20512: -- Fix Version/s: (was: 2.1.0) > SparkR 2.2 QA: Programming guide, migration guide, vignettes updates > > > Key: SPARK-20512 > URL: https://issues.apache.org/jira/browse/SPARK-20512 > Project: Spark > Issue Type: Sub-task > Components: Documentation, SparkR >Reporter: Joseph K. Bradley >Priority: Critical > > Before the release, we need to update the SparkR Programming Guide, its > migration guide, and the R vignettes. Updates will include: > * Add migration guide subsection. > ** Use the results of the QA audit JIRAs and [SPARK-17692]. > * Check phrasing, especially in main sections (for outdated items such as "In > this release, ...") > * Update R vignettes > Note: This task is for large changes to the guides. New features are handled > in [SPARK-18330]. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-20511) SparkR 2.2 QA: Check for new R APIs requiring example code
[ https://issues.apache.org/jira/browse/SPARK-20511?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley updated SPARK-20511: -- Target Version/s: 2.2.0 (was: 2.1.0) > SparkR 2.2 QA: Check for new R APIs requiring example code > -- > > Key: SPARK-20511 > URL: https://issues.apache.org/jira/browse/SPARK-20511 > Project: Spark > Issue Type: Sub-task > Components: Documentation, SparkR >Reporter: Joseph K. Bradley > > Audit list of new features added to MLlib's R API, and see which major items > are missing example code (in the examples folder). We do not need examples > for everything, only for major items such as new algorithms. > For any such items: > * Create a JIRA for that feature, and assign it to the author of the feature > (or yourself if interested). > * Link it to (a) the original JIRA which introduced that feature ("related > to") and (b) to this JIRA ("requires"). -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-20510) SparkR 2.2 QA: Update user guide for new features & APIs
[ https://issues.apache.org/jira/browse/SPARK-20510?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley updated SPARK-20510: -- Target Version/s: 2.2.0 > SparkR 2.2 QA: Update user guide for new features & APIs > > > Key: SPARK-20510 > URL: https://issues.apache.org/jira/browse/SPARK-20510 > Project: Spark > Issue Type: Sub-task > Components: Documentation, SparkR >Reporter: Joseph K. Bradley >Priority: Critical > > Check the user guide vs. a list of new APIs (classes, methods, data members) > to see what items require updates to the user guide. > For each feature missing user guide doc: > * Create a JIRA for that feature, and assign it to the author of the feature > * Link it to (a) the original JIRA which introduced that feature ("related > to") and (b) to this JIRA ("requires"). > If you would like to work on this task, please comment, and we can create & > link JIRAs for parts of this work. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-20509) SparkR 2.2 QA: New R APIs and API docs
[ https://issues.apache.org/jira/browse/SPARK-20509?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley updated SPARK-20509: -- Fix Version/s: (was: 2.1.0) > SparkR 2.2 QA: New R APIs and API docs > -- > > Key: SPARK-20509 > URL: https://issues.apache.org/jira/browse/SPARK-20509 > Project: Spark > Issue Type: Sub-task > Components: Documentation, SparkR >Reporter: Joseph K. Bradley >Priority: Blocker > > Audit new public R APIs. Take note of: > * Correctness and uniformity of API > * Documentation: Missing? Bad links or formatting? > ** Check both the generated docs linked from the user guide and the R command > line docs `?read.df`. These are generated using roxygen. > As you find issues, please create JIRAs and link them to this issue. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-20510) SparkR 2.2 QA: Update user guide for new features & APIs
[ https://issues.apache.org/jira/browse/SPARK-20510?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley updated SPARK-20510: -- Fix Version/s: (was: 2.1.0) > SparkR 2.2 QA: Update user guide for new features & APIs > > > Key: SPARK-20510 > URL: https://issues.apache.org/jira/browse/SPARK-20510 > Project: Spark > Issue Type: Sub-task > Components: Documentation, SparkR >Reporter: Joseph K. Bradley >Priority: Critical > > Check the user guide vs. a list of new APIs (classes, methods, data members) > to see what items require updates to the user guide. > For each feature missing user guide doc: > * Create a JIRA for that feature, and assign it to the author of the feature > * Link it to (a) the original JIRA which introduced that feature ("related > to") and (b) to this JIRA ("requires"). > If you would like to work on this task, please comment, and we can create & > link JIRAs for parts of this work. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-20509) SparkR 2.2 QA: New R APIs and API docs
[ https://issues.apache.org/jira/browse/SPARK-20509?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley reassigned SPARK-20509: - Assignee: (was: Yanbo Liang) > SparkR 2.2 QA: New R APIs and API docs > -- > > Key: SPARK-20509 > URL: https://issues.apache.org/jira/browse/SPARK-20509 > Project: Spark > Issue Type: Sub-task > Components: Documentation, SparkR >Reporter: Joseph K. Bradley >Priority: Blocker > > Audit new public R APIs. Take note of: > * Correctness and uniformity of API > * Documentation: Missing? Bad links or formatting? > ** Check both the generated docs linked from the user guide and the R command > line docs `?read.df`. These are generated using roxygen. > As you find issues, please create JIRAs and link them to this issue. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-20508) Spark R 2.2 QA umbrella
[ https://issues.apache.org/jira/browse/SPARK-20508?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley updated SPARK-20508: -- Target Version/s: 2.2.0 (was: 2.1.0) > Spark R 2.2 QA umbrella > --- > > Key: SPARK-20508 > URL: https://issues.apache.org/jira/browse/SPARK-20508 > Project: Spark > Issue Type: Umbrella > Components: Documentation, SparkR >Reporter: Joseph K. Bradley >Assignee: Joseph K. Bradley >Priority: Critical > > This JIRA lists tasks for the next Spark release's QA period for SparkR. > The list below gives an overview of what is involved, and the corresponding > JIRA issues are linked below that. > h2. API > * Audit new public APIs (from the generated html doc) > ** relative to Spark Scala/Java APIs > ** relative to popular R libraries > h2. Documentation and example code > * For new algorithms, create JIRAs for updating the user guide sections & > examples > * Update Programming Guide > * Update website -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-20510) SparkR 2.2 QA: Update user guide for new features & APIs
[ https://issues.apache.org/jira/browse/SPARK-20510?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley updated SPARK-20510: -- Summary: SparkR 2.2 QA: Update user guide for new features & APIs (was: CLONE - SparkR 2.1 QA: Update user guide for new features & APIs) > SparkR 2.2 QA: Update user guide for new features & APIs > > > Key: SPARK-20510 > URL: https://issues.apache.org/jira/browse/SPARK-20510 > Project: Spark > Issue Type: Sub-task > Components: Documentation, SparkR >Reporter: Joseph K. Bradley >Priority: Critical > > Check the user guide vs. a list of new APIs (classes, methods, data members) > to see what items require updates to the user guide. > For each feature missing user guide doc: > * Create a JIRA for that feature, and assign it to the author of the feature > * Link it to (a) the original JIRA which introduced that feature ("related > to") and (b) to this JIRA ("requires"). > If you would like to work on this task, please comment, and we can create & > link JIRAs for parts of this work. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-20509) SparkR 2.2 QA: New R APIs and API docs
[ https://issues.apache.org/jira/browse/SPARK-20509?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley updated SPARK-20509: -- Target Version/s: 2.2.0 (was: 2.1.0) > SparkR 2.2 QA: New R APIs and API docs > -- > > Key: SPARK-20509 > URL: https://issues.apache.org/jira/browse/SPARK-20509 > Project: Spark > Issue Type: Sub-task > Components: Documentation, SparkR >Reporter: Joseph K. Bradley >Priority: Blocker > > Audit new public R APIs. Take note of: > * Correctness and uniformity of API > * Documentation: Missing? Bad links or formatting? > ** Check both the generated docs linked from the user guide and the R command > line docs `?read.df`. These are generated using roxygen. > As you find issues, please create JIRAs and link them to this issue. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-20509) SparkR 2.2 QA: New R APIs and API docs
[ https://issues.apache.org/jira/browse/SPARK-20509?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley updated SPARK-20509: -- Summary: SparkR 2.2 QA: New R APIs and API docs (was: CLONE - SparkR 2.1 QA: New R APIs and API docs) > SparkR 2.2 QA: New R APIs and API docs > -- > > Key: SPARK-20509 > URL: https://issues.apache.org/jira/browse/SPARK-20509 > Project: Spark > Issue Type: Sub-task > Components: Documentation, SparkR >Reporter: Joseph K. Bradley >Assignee: Yanbo Liang >Priority: Blocker > > Audit new public R APIs. Take note of: > * Correctness and uniformity of API > * Documentation: Missing? Bad links or formatting? > ** Check both the generated docs linked from the user guide and the R command > line docs `?read.df`. These are generated using roxygen. > As you find issues, please create JIRAs and link them to this issue. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-20508) Spark R 2.2 QA umbrella
[ https://issues.apache.org/jira/browse/SPARK-20508?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley updated SPARK-20508: -- Fix Version/s: (was: 2.1.0) > Spark R 2.2 QA umbrella > --- > > Key: SPARK-20508 > URL: https://issues.apache.org/jira/browse/SPARK-20508 > Project: Spark > Issue Type: Umbrella > Components: Documentation, SparkR >Reporter: Joseph K. Bradley >Assignee: Joseph K. Bradley >Priority: Critical > > This JIRA lists tasks for the next Spark release's QA period for SparkR. > The list below gives an overview of what is involved, and the corresponding > JIRA issues are linked below that. > h2. API > * Audit new public APIs (from the generated html doc) > ** relative to Spark Scala/Java APIs > ** relative to popular R libraries > h2. Documentation and example code > * For new algorithms, create JIRAs for updating the user guide sections & > examples > * Update Programming Guide > * Update website -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-20508) Spark R 2.2 QA umbrella
Joseph K. Bradley created SPARK-20508: - Summary: Spark R 2.2 QA umbrella Key: SPARK-20508 URL: https://issues.apache.org/jira/browse/SPARK-20508 Project: Spark Issue Type: Umbrella Components: Documentation, SparkR Reporter: Joseph K. Bradley Assignee: Joseph K. Bradley Priority: Critical Fix For: 2.1.0 This JIRA lists tasks for the next Spark release's QA period for SparkR. The list below gives an overview of what is involved, and the corresponding JIRA issues are linked below that. h2. API * Audit new public APIs (from the generated html doc) ** relative to Spark Scala/Java APIs ** relative to popular R libraries h2. Documentation and example code * For new algorithms, create JIRAs for updating the user guide sections & examples * Update Programming Guide * Update website -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-20510) CLONE - SparkR 2.1 QA: Update user guide for new features & APIs
Joseph K. Bradley created SPARK-20510: - Summary: CLONE - SparkR 2.1 QA: Update user guide for new features & APIs Key: SPARK-20510 URL: https://issues.apache.org/jira/browse/SPARK-20510 Project: Spark Issue Type: Sub-task Components: Documentation, SparkR Reporter: Joseph K. Bradley Priority: Critical Check the user guide vs. a list of new APIs (classes, methods, data members) to see what items require updates to the user guide. For each feature missing user guide doc: * Create a JIRA for that feature, and assign it to the author of the feature * Link it to (a) the original JIRA which introduced that feature ("related to") and (b) to this JIRA ("requires"). If you would like to work on this task, please comment, and we can create & link JIRAs for parts of this work. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-20511) CLONE - SparkR 2.1 QA: Check for new R APIs requiring example code
Joseph K. Bradley created SPARK-20511: - Summary: CLONE - SparkR 2.1 QA: Check for new R APIs requiring example code Key: SPARK-20511 URL: https://issues.apache.org/jira/browse/SPARK-20511 Project: Spark Issue Type: Sub-task Components: Documentation, SparkR Reporter: Joseph K. Bradley Assignee: Yanbo Liang Fix For: 2.1.0 Audit list of new features added to MLlib's R API, and see which major items are missing example code (in the examples folder). We do not need examples for everything, only for major items such as new algorithms. For any such items: * Create a JIRA for that feature, and assign it to the author of the feature (or yourself if interested). * Link it to (a) the original JIRA which introduced that feature ("related to") and (b) to this JIRA ("requires"). -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-20513) CLONE - Update SparkR website for 2.1
Joseph K. Bradley created SPARK-20513: - Summary: CLONE - Update SparkR website for 2.1 Key: SPARK-20513 URL: https://issues.apache.org/jira/browse/SPARK-20513 Project: Spark Issue Type: Sub-task Components: Documentation, SparkR Reporter: Joseph K. Bradley Priority: Critical Update the sub-project's website to include new features in this release. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-20512) CLONE - SparkR 2.1 QA: Programming guide, migration guide, vignettes updates
Joseph K. Bradley created SPARK-20512: - Summary: CLONE - SparkR 2.1 QA: Programming guide, migration guide, vignettes updates Key: SPARK-20512 URL: https://issues.apache.org/jira/browse/SPARK-20512 Project: Spark Issue Type: Sub-task Components: Documentation, SparkR Reporter: Joseph K. Bradley Assignee: Xiangrui Meng Priority: Critical Fix For: 2.1.0 Before the release, we need to update the SparkR Programming Guide, its migration guide, and the R vignettes. Updates will include: * Add migration guide subsection. ** Use the results of the QA audit JIRAs and [SPARK-17692]. * Check phrasing, especially in main sections (for outdated items such as "In this release, ...") * Update R vignettes Note: This task is for large changes to the guides. New features are handled in [SPARK-18330]. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org