[jira] [Resolved] (SPARK-20478) Document LinearSVC in R programming guide

2017-04-27 Thread Felix Cheung (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20478?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Felix Cheung resolved SPARK-20478.
--
  Resolution: Fixed
Assignee: Miao Wang
   Fix Version/s: 2.3.0
  2.2.0
Target Version/s: 2.2.0, 2.3.0

https://github.com/apache/spark/pull/17797

> Document LinearSVC in R programming guide
> -
>
> Key: SPARK-20478
> URL: https://issues.apache.org/jira/browse/SPARK-20478
> Project: Spark
>  Issue Type: Documentation
>  Components: SparkR
>Affects Versions: 2.2.0
>Reporter: Felix Cheung
>Assignee: Miao Wang
> Fix For: 2.2.0, 2.3.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20520) R streaming tests failed on Windows

2017-04-27 Thread Felix Cheung (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20520?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15988222#comment-15988222
 ] 

Felix Cheung commented on SPARK-20520:
--

looks like it's just running slow

> R streaming tests failed on Windows
> ---
>
> Key: SPARK-20520
> URL: https://issues.apache.org/jira/browse/SPARK-20520
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Reporter: Felix Cheung
>Assignee: Felix Cheung
>Priority: Critical
>
> Running R CMD check on SparkR 2.2 RC1 packages 
> {code}
> Failed 
> -
> 1. Failure: read.stream, write.stream, awaitTermination, stopQuery 
> (@test_streaming.R#56) 
> head(sql("SELECT count(*) FROM people"))[[1]] not equal to 3.
> 1/1 mismatches
> [1] 0 - 3 == -3
> 2. Failure: read.stream, write.stream, awaitTermination, stopQuery 
> (@test_streaming.R#60) 
> head(sql("SELECT count(*) FROM people"))[[1]] not equal to 6.
> 1/1 mismatches
> [1] 3 - 6 == -3
> 3. Failure: print from explain, lastProgress, status, isActive 
> (@test_streaming.R#75) 
> any(grepl("\"description\" : \"MemorySink\"", 
> capture.output(lastProgress(q isn't true.
> 4. Failure: Stream other format (@test_streaming.R#95) 
> -
> head(sql("SELECT count(*) FROM people3"))[[1]] not equal to 3.
> 1/1 mismatches
> [1] 0 - 3 == -3
> 5. Failure: Stream other format (@test_streaming.R#98) 
> -
> any(...) isn't true.
> {code}
> Need to investigate



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-20520) R streaming tests failed on Windows

2017-04-27 Thread Felix Cheung (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20520?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Felix Cheung updated SPARK-20520:
-
Issue Type: Bug  (was: Umbrella)

> R streaming tests failed on Windows
> ---
>
> Key: SPARK-20520
> URL: https://issues.apache.org/jira/browse/SPARK-20520
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Reporter: Felix Cheung
>Assignee: Felix Cheung
>Priority: Critical
>
> Running R CMD check on SparkR 2.2 RC1 packages 
> {code}
> Failed 
> -
> 1. Failure: read.stream, write.stream, awaitTermination, stopQuery 
> (@test_streaming.R#56) 
> head(sql("SELECT count(*) FROM people"))[[1]] not equal to 3.
> 1/1 mismatches
> [1] 0 - 3 == -3
> 2. Failure: read.stream, write.stream, awaitTermination, stopQuery 
> (@test_streaming.R#60) 
> head(sql("SELECT count(*) FROM people"))[[1]] not equal to 6.
> 1/1 mismatches
> [1] 3 - 6 == -3
> 3. Failure: print from explain, lastProgress, status, isActive 
> (@test_streaming.R#75) 
> any(grepl("\"description\" : \"MemorySink\"", 
> capture.output(lastProgress(q isn't true.
> 4. Failure: Stream other format (@test_streaming.R#95) 
> -
> head(sql("SELECT count(*) FROM people3"))[[1]] not equal to 3.
> 1/1 mismatches
> [1] 0 - 3 == -3
> 5. Failure: Stream other format (@test_streaming.R#98) 
> -
> any(...) isn't true.
> {code}
> Need to investigate



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-20520) R streaming tests failed on Windows

2017-04-27 Thread Felix Cheung (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20520?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Felix Cheung updated SPARK-20520:
-
Summary: R streaming tests failed on Windows  (was: R streaming test failed 
on Windows)

> R streaming tests failed on Windows
> ---
>
> Key: SPARK-20520
> URL: https://issues.apache.org/jira/browse/SPARK-20520
> Project: Spark
>  Issue Type: Umbrella
>  Components: SparkR
>Reporter: Felix Cheung
>Assignee: Felix Cheung
>Priority: Critical
>
> Running R CMD check on SparkR 2.2 RC1 packages 
> {code}
> Failed 
> -
> 1. Failure: read.stream, write.stream, awaitTermination, stopQuery 
> (@test_streaming.R#56) 
> head(sql("SELECT count(*) FROM people"))[[1]] not equal to 3.
> 1/1 mismatches
> [1] 0 - 3 == -3
> 2. Failure: read.stream, write.stream, awaitTermination, stopQuery 
> (@test_streaming.R#60) 
> head(sql("SELECT count(*) FROM people"))[[1]] not equal to 6.
> 1/1 mismatches
> [1] 3 - 6 == -3
> 3. Failure: print from explain, lastProgress, status, isActive 
> (@test_streaming.R#75) 
> any(grepl("\"description\" : \"MemorySink\"", 
> capture.output(lastProgress(q isn't true.
> 4. Failure: Stream other format (@test_streaming.R#95) 
> -
> head(sql("SELECT count(*) FROM people3"))[[1]] not equal to 3.
> 1/1 mismatches
> [1] 0 - 3 == -3
> 5. Failure: Stream other format (@test_streaming.R#98) 
> -
> any(...) isn't true.
> {code}
> Need to investigate



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-20520) R streaming test failed on Windows

2017-04-27 Thread Felix Cheung (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20520?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Felix Cheung updated SPARK-20520:
-
Description: 
Running R CMD check on SparkR 2.2 RC1 packages 
{code}
Failed -
1. Failure: read.stream, write.stream, awaitTermination, stopQuery 
(@test_streaming.R#56) 
head(sql("SELECT count(*) FROM people"))[[1]] not equal to 3.
1/1 mismatches
[1] 0 - 3 == -3


2. Failure: read.stream, write.stream, awaitTermination, stopQuery 
(@test_streaming.R#60) 
head(sql("SELECT count(*) FROM people"))[[1]] not equal to 6.
1/1 mismatches
[1] 3 - 6 == -3


3. Failure: print from explain, lastProgress, status, isActive 
(@test_streaming.R#75) 
any(grepl("\"description\" : \"MemorySink\"", capture.output(lastProgress(q 
isn't true.


4. Failure: Stream other format (@test_streaming.R#95) -
head(sql("SELECT count(*) FROM people3"))[[1]] not equal to 3.
1/1 mismatches
[1] 0 - 3 == -3


5. Failure: Stream other format (@test_streaming.R#98) -
any(...) isn't true.

{code}

Need to investigate


  was:
This JIRA lists tasks for the next Spark release's QA period for SparkR.

The list below gives an overview of what is involved, and the corresponding 
JIRA issues are linked below that.

h2. API

* Audit new public APIs (from the generated html doc)
** relative to Spark Scala/Java APIs
** relative to popular R libraries

h2. Documentation and example code

* For new algorithms, create JIRAs for updating the user guide sections & 
examples
* Update Programming Guide
* Update website



> R streaming test failed on Windows
> --
>
> Key: SPARK-20520
> URL: https://issues.apache.org/jira/browse/SPARK-20520
> Project: Spark
>  Issue Type: Umbrella
>  Components: SparkR
>Reporter: Felix Cheung
>Assignee: Felix Cheung
>Priority: Critical
>
> Running R CMD check on SparkR 2.2 RC1 packages 
> {code}
> Failed 
> -
> 1. Failure: read.stream, write.stream, awaitTermination, stopQuery 
> (@test_streaming.R#56) 
> head(sql("SELECT count(*) FROM people"))[[1]] not equal to 3.
> 1/1 mismatches
> [1] 0 - 3 == -3
> 2. Failure: read.stream, write.stream, awaitTermination, stopQuery 
> (@test_streaming.R#60) 
> head(sql("SELECT count(*) FROM people"))[[1]] not equal to 6.
> 1/1 mismatches
> [1] 3 - 6 == -3
> 3. Failure: print from explain, lastProgress, status, isActive 
> (@test_streaming.R#75) 
> any(grepl("\"description\" : \"MemorySink\"", 
> capture.output(lastProgress(q isn't true.
> 4. Failure: Stream other format (@test_streaming.R#95) 
> -
> head(sql("SELECT count(*) FROM people3"))[[1]] not equal to 3.
> 1/1 mismatches
> [1] 0 - 3 == -3
> 5. Failure: Stream other format (@test_streaming.R#98) 
> -
> any(...) isn't true.
> {code}
> Need to investigate



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-20520) R streaming test failed on Windows

2017-04-27 Thread Felix Cheung (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20520?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Felix Cheung reassigned SPARK-20520:


Assignee: Felix Cheung  (was: Joseph K. Bradley)

> R streaming test failed on Windows
> --
>
> Key: SPARK-20520
> URL: https://issues.apache.org/jira/browse/SPARK-20520
> Project: Spark
>  Issue Type: Umbrella
>  Components: SparkR
>Reporter: Felix Cheung
>Assignee: Felix Cheung
>Priority: Critical
>
> This JIRA lists tasks for the next Spark release's QA period for SparkR.
> The list below gives an overview of what is involved, and the corresponding 
> JIRA issues are linked below that.
> h2. API
> * Audit new public APIs (from the generated html doc)
> ** relative to Spark Scala/Java APIs
> ** relative to popular R libraries
> h2. Documentation and example code
> * For new algorithms, create JIRAs for updating the user guide sections & 
> examples
> * Update Programming Guide
> * Update website



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-20520) R streaming test failed on Windows

2017-04-27 Thread Felix Cheung (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20520?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Felix Cheung updated SPARK-20520:
-
Component/s: (was: Documentation)

> R streaming test failed on Windows
> --
>
> Key: SPARK-20520
> URL: https://issues.apache.org/jira/browse/SPARK-20520
> Project: Spark
>  Issue Type: Umbrella
>  Components: SparkR
>Reporter: Felix Cheung
>Assignee: Joseph K. Bradley
>Priority: Critical
>
> This JIRA lists tasks for the next Spark release's QA period for SparkR.
> The list below gives an overview of what is involved, and the corresponding 
> JIRA issues are linked below that.
> h2. API
> * Audit new public APIs (from the generated html doc)
> ** relative to Spark Scala/Java APIs
> ** relative to popular R libraries
> h2. Documentation and example code
> * For new algorithms, create JIRAs for updating the user guide sections & 
> examples
> * Update Programming Guide
> * Update website



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-20520) R streaming test failed on Windows

2017-04-27 Thread Felix Cheung (JIRA)
Felix Cheung created SPARK-20520:


 Summary: R streaming test failed on Windows
 Key: SPARK-20520
 URL: https://issues.apache.org/jira/browse/SPARK-20520
 Project: Spark
  Issue Type: Umbrella
  Components: Documentation, SparkR
Reporter: Felix Cheung
Assignee: Joseph K. Bradley
Priority: Critical


This JIRA lists tasks for the next Spark release's QA period for SparkR.

The list below gives an overview of what is involved, and the corresponding 
JIRA issues are linked below that.

h2. API

* Audit new public APIs (from the generated html doc)
** relative to Spark Scala/Java APIs
** relative to popular R libraries

h2. Documentation and example code

* For new algorithms, create JIRAs for updating the user guide sections & 
examples
* Update Programming Guide
* Update website




--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-20192) SparkR 2.2.0 migration guide, release note

2017-04-27 Thread Felix Cheung (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20192?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Felix Cheung updated SPARK-20192:
-
Summary: SparkR 2.2.0 migration guide, release note  (was: SparkR 2.2.0 
release note)

> SparkR 2.2.0 migration guide, release note
> --
>
> Key: SPARK-20192
> URL: https://issues.apache.org/jira/browse/SPARK-20192
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation, SparkR
>Affects Versions: 2.2.0
>Reporter: Felix Cheung
>Assignee: Felix Cheung
>
> From looking at changes since 2.1.0, this/these should be documented in the 
> migration guide / release note for the 2.2.0 release, as it is behavior 
> changes
> https://github.com/apache/spark/commit/422aa67d1bb84f913b06e6d94615adb6557e2870
> https://github.com/apache/spark/pull/17483 (createExternalTable)



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20512) SparkR 2.2 QA: Programming guide, migration guide, vignettes updates

2017-04-27 Thread Felix Cheung (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20512?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15988168#comment-15988168
 ] 

Felix Cheung commented on SPARK-20512:
--

version migration section is in R programming guide

> SparkR 2.2 QA: Programming guide, migration guide, vignettes updates
> 
>
> Key: SPARK-20512
> URL: https://issues.apache.org/jira/browse/SPARK-20512
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, SparkR
>Reporter: Joseph K. Bradley
>Priority: Critical
>
> Before the release, we need to update the SparkR Programming Guide, its 
> migration guide, and the R vignettes.  Updates will include:
> * Add migration guide subsection.
> ** Use the results of the QA audit JIRAs and [SPARK-17692].
> * Check phrasing, especially in main sections (for outdated items such as "In 
> this release, ...")
> * Update R vignettes
> Note: This task is for large changes to the guides.  New features are handled 
> in [SPARK-18330].



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-20192) SparkR 2.2.0 release note

2017-04-27 Thread Felix Cheung (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20192?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Felix Cheung reassigned SPARK-20192:


Assignee: Felix Cheung

> SparkR 2.2.0 release note
> -
>
> Key: SPARK-20192
> URL: https://issues.apache.org/jira/browse/SPARK-20192
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation, SparkR
>Affects Versions: 2.2.0
>Reporter: Felix Cheung
>Assignee: Felix Cheung
>
> From looking at changes since 2.1.0, this/these should be documented in the 
> migration guide / release note for the 2.2.0 release, as it is behavior 
> changes
> https://github.com/apache/spark/commit/422aa67d1bb84f913b06e6d94615adb6557e2870
> https://github.com/apache/spark/pull/17483 (createExternalTable)



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-20208) Document R fpGrowth support in vignettes, programming guide and code example

2017-04-27 Thread Felix Cheung (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20208?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Felix Cheung resolved SPARK-20208.
--
Resolution: Fixed

> Document R fpGrowth support in vignettes, programming guide and code example
> 
>
> Key: SPARK-20208
> URL: https://issues.apache.org/jira/browse/SPARK-20208
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation, SparkR
>Affects Versions: 2.2.0
>Reporter: Felix Cheung
>Assignee: Maciej Szymkiewicz
> Fix For: 2.2.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-20015) Document R Structured Streaming (experimental) in R vignettes and R & SS programming guide, R example

2017-04-27 Thread Felix Cheung (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20015?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Felix Cheung reassigned SPARK-20015:


Assignee: Felix Cheung

> Document R Structured Streaming (experimental) in R vignettes and R & SS 
> programming guide, R example
> -
>
> Key: SPARK-20015
> URL: https://issues.apache.org/jira/browse/SPARK-20015
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation, SparkR, Structured Streaming
>Affects Versions: 2.2.0
>Reporter: Felix Cheung
>Assignee: Felix Cheung
>




--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-20519) When the input parameter is null, may be a runtime exception occurs

2017-04-27 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20519?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-20519:


Assignee: (was: Apache Spark)

> When the input parameter is null,  may be a runtime exception occurs
> 
>
> Key: SPARK-20519
> URL: https://issues.apache.org/jira/browse/SPARK-20519
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core, SQL
>Affects Versions: 2.1.0
>Reporter: liuxian
>Priority: Minor
>
> sqlContext.tables(null)
> setCustomHostname(null)
> checkHost(null, "test")
> checkHostPort(null, "test")
> throws exception at runtime:
> java.lang.NullPointerException was thrown.
> java.lang.NullPointerException
>   at 
> org.apache.spark.sql.catalyst.catalog.SessionCatalog.formatDatabaseName(SessionCatalog.scala:125)
>   at 
> org.apache.spark.sql.catalyst.catalog.SessionCatalog.listTables(SessionCatalog.scala:715)
>   at 
> org.apache.spark.sql.catalyst.catalog.SessionCatalog.listTables(SessionCatalog.scala:706)
>   at 
> org.apache.spark.sql.execution.command.ShowTablesCommand$$anonfun$11.apply(tables.scala:655)



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-20519) When the input parameter is null, may be a runtime exception occurs

2017-04-27 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20519?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-20519:


Assignee: Apache Spark

> When the input parameter is null,  may be a runtime exception occurs
> 
>
> Key: SPARK-20519
> URL: https://issues.apache.org/jira/browse/SPARK-20519
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core, SQL
>Affects Versions: 2.1.0
>Reporter: liuxian
>Assignee: Apache Spark
>Priority: Minor
>
> sqlContext.tables(null)
> setCustomHostname(null)
> checkHost(null, "test")
> checkHostPort(null, "test")
> throws exception at runtime:
> java.lang.NullPointerException was thrown.
> java.lang.NullPointerException
>   at 
> org.apache.spark.sql.catalyst.catalog.SessionCatalog.formatDatabaseName(SessionCatalog.scala:125)
>   at 
> org.apache.spark.sql.catalyst.catalog.SessionCatalog.listTables(SessionCatalog.scala:715)
>   at 
> org.apache.spark.sql.catalyst.catalog.SessionCatalog.listTables(SessionCatalog.scala:706)
>   at 
> org.apache.spark.sql.execution.command.ShowTablesCommand$$anonfun$11.apply(tables.scala:655)



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20519) When the input parameter is null, may be a runtime exception occurs

2017-04-27 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20519?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15988111#comment-15988111
 ] 

Apache Spark commented on SPARK-20519:
--

User '10110346' has created a pull request for this issue:
https://github.com/apache/spark/pull/17796

> When the input parameter is null,  may be a runtime exception occurs
> 
>
> Key: SPARK-20519
> URL: https://issues.apache.org/jira/browse/SPARK-20519
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core, SQL
>Affects Versions: 2.1.0
>Reporter: liuxian
>Priority: Minor
>
> sqlContext.tables(null)
> setCustomHostname(null)
> checkHost(null, "test")
> checkHostPort(null, "test")
> throws exception at runtime:
> java.lang.NullPointerException was thrown.
> java.lang.NullPointerException
>   at 
> org.apache.spark.sql.catalyst.catalog.SessionCatalog.formatDatabaseName(SessionCatalog.scala:125)
>   at 
> org.apache.spark.sql.catalyst.catalog.SessionCatalog.listTables(SessionCatalog.scala:715)
>   at 
> org.apache.spark.sql.catalyst.catalog.SessionCatalog.listTables(SessionCatalog.scala:706)
>   at 
> org.apache.spark.sql.execution.command.ShowTablesCommand$$anonfun$11.apply(tables.scala:655)



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-20519) When the input parameter is null, may be a runtime exception occurs

2017-04-27 Thread liuxian (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20519?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

liuxian updated SPARK-20519:

Summary: When the input parameter is null,  may be a runtime exception 
occurs  (was: When the input parameter is null,  may be a runtime exeception 
occurs)

> When the input parameter is null,  may be a runtime exception occurs
> 
>
> Key: SPARK-20519
> URL: https://issues.apache.org/jira/browse/SPARK-20519
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core, SQL
>Affects Versions: 2.1.0
>Reporter: liuxian
>Priority: Minor
>
> sqlContext.tables(null)
> setCustomHostname(null)
> checkHost(null, "test")
> checkHostPort(null, "test")
> throws exception at runtime:
> java.lang.NullPointerException was thrown.
> java.lang.NullPointerException
>   at 
> org.apache.spark.sql.catalyst.catalog.SessionCatalog.formatDatabaseName(SessionCatalog.scala:125)
>   at 
> org.apache.spark.sql.catalyst.catalog.SessionCatalog.listTables(SessionCatalog.scala:715)
>   at 
> org.apache.spark.sql.catalyst.catalog.SessionCatalog.listTables(SessionCatalog.scala:706)
>   at 
> org.apache.spark.sql.execution.command.ShowTablesCommand$$anonfun$11.apply(tables.scala:655)



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-20519) When the input parameter is null, may be a runtime exeception occurs

2017-04-27 Thread liuxian (JIRA)
liuxian created SPARK-20519:
---

 Summary: When the input parameter is null,  may be a runtime 
exeception occurs
 Key: SPARK-20519
 URL: https://issues.apache.org/jira/browse/SPARK-20519
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core, SQL
Affects Versions: 2.1.0
Reporter: liuxian
Priority: Minor


sqlContext.tables(null)
setCustomHostname(null)
checkHost(null, "test")
checkHostPort(null, "test")

throws exception at runtime:
java.lang.NullPointerException was thrown.
java.lang.NullPointerException
at 
org.apache.spark.sql.catalyst.catalog.SessionCatalog.formatDatabaseName(SessionCatalog.scala:125)
at 
org.apache.spark.sql.catalyst.catalog.SessionCatalog.listTables(SessionCatalog.scala:715)
at 
org.apache.spark.sql.catalyst.catalog.SessionCatalog.listTables(SessionCatalog.scala:706)
at 
org.apache.spark.sql.execution.command.ShowTablesCommand$$anonfun$11.apply(tables.scala:655)



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-20497) Unhelpful error messages when trying to load data from file.

2017-04-27 Thread Brandon Barker (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20497?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Brandon Barker resolved SPARK-20497.

Resolution: Not A Bug

> Unhelpful error messages when trying to load data from file.
> 
>
> Key: SPARK-20497
> URL: https://issues.apache.org/jira/browse/SPARK-20497
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Brandon Barker
>Priority: Minor
>
> I'm attempting to do the simple task of reproducing the results from the 
> linear regression example in Spark. I'm using Windows 10.
>   val training = spark.read.format("libsvm")
>  .load("C:Users\\brand\\Documents\\GitHub\\sample_linear_regression_data.txt")
> Although the file is definitely at the specified location, I just get a 
> java.lang.NullPointerException at this line. The documentation at 
> http://spark.apache.org/docs/latest/sql-programming-guide.html#generic-loadsave-functions
>  doesn't seem to clear things up. The associated javadocs do do not seem any 
> better.
> In my view, such a simple operation should not be troublesome, but perhaps 
> I've missed some critical documentation - if so, I apologize. 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-12837) Spark driver requires large memory space for serialized results even there are no data collected to the driver

2017-04-27 Thread Herman van Hovell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12837?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Herman van Hovell resolved SPARK-12837.
---
   Resolution: Fixed
Fix Version/s: (was: 2.0.0)
   2.3.0
   2.2.1

> Spark driver requires large memory space for serialized results even there 
> are no data collected to the driver
> --
>
> Key: SPARK-12837
> URL: https://issues.apache.org/jira/browse/SPARK-12837
> Project: Spark
>  Issue Type: Question
>  Components: SQL
>Affects Versions: 1.5.2, 1.6.0
>Reporter: Tien-Dung LE
>Assignee: Wenchen Fan
>Priority: Critical
> Fix For: 2.2.1, 2.3.0
>
>
> Executing a sql statement with a large number of partitions requires a high 
> memory space for the driver even there are no requests to collect data back 
> to the driver.
> Here are steps to re-produce the issue.
> 1. Start spark shell with a spark.driver.maxResultSize setting
> {code:java}
> bin/spark-shell --driver-memory=1g --conf spark.driver.maxResultSize=1m
> {code}
> 2. Execute the code 
> {code:java}
> case class Toto( a: Int, b: Int)
> val df = sc.parallelize( 1 to 1e6.toInt).map( i => Toto( i, i)).toDF
> sqlContext.setConf( "spark.sql.shuffle.partitions", "200" )
> df.groupBy("a").count().saveAsParquetFile( "toto1" ) // OK
> sqlContext.setConf( "spark.sql.shuffle.partitions", 1e3.toInt.toString )
> df.repartition(1e3.toInt).groupBy("a").count().repartition(1e3.toInt).saveAsParquetFile(
>  "toto2" ) // ERROR
> {code}
> The error message is 
> {code:java}
> Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: 
> Total size of serialized results of 393 tasks (1025.9 KB) is bigger than 
> spark.driver.maxResultSize (1024.0 KB)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-20517) Download link in history server UI is not correct

2017-04-27 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20517?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-20517:


Assignee: Apache Spark

> Download link in history server UI is not correct
> -
>
> Key: SPARK-20517
> URL: https://issues.apache.org/jira/browse/SPARK-20517
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.1.0, 2.2.0
>Reporter: Saisai Shao
>Assignee: Apache Spark
>Priority: Minor
>
> The download link in history server UI is concatenated with:
> {code}
>class="btn btn-info btn-mini">Download
> {code}
> Here {{num}} filed represents number of attempts, this is not equal to REST 
> APIs. In the REST API, if attempt id is not existed, then {{num}} field 
> should be empty, otherwise this {{num}} field should actually be 
> {{attemptId}}.
> This will lead to the issue of "no such app", rather than correctly download 
> the event log.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-20517) Download link in history server UI is not correct

2017-04-27 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20517?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-20517:


Assignee: (was: Apache Spark)

> Download link in history server UI is not correct
> -
>
> Key: SPARK-20517
> URL: https://issues.apache.org/jira/browse/SPARK-20517
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.1.0, 2.2.0
>Reporter: Saisai Shao
>Priority: Minor
>
> The download link in history server UI is concatenated with:
> {code}
>class="btn btn-info btn-mini">Download
> {code}
> Here {{num}} filed represents number of attempts, this is not equal to REST 
> APIs. In the REST API, if attempt id is not existed, then {{num}} field 
> should be empty, otherwise this {{num}} field should actually be 
> {{attemptId}}.
> This will lead to the issue of "no such app", rather than correctly download 
> the event log.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20497) Unhelpful error messages when trying to load data from file.

2017-04-27 Thread Hyukjin Kwon (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20497?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15988089#comment-15988089
 ] 

Hyukjin Kwon commented on SPARK-20497:
--

I can't reproduce this on Windows as below:

- Existing path

{code}
val lines =
  """
|1 1:1.0 3:2.0 5:3.0
|0
|0 2:4.0 4:5.0 6:6.0
  """.stripMargin

val loc = "C:\\...\\foo"
val file = new java.io.File(loc)
com.google.common.io.Files.write(lines, file, 
java.nio.charset.StandardCharsets.UTF_8)
spark.read.format("libsvm").load(loc).show()
{code}

prints

{code}
+-++
|label|features|
+-++
|  1.0|(6,[0,2,4],[1.0,2...|
|  0.0|   (6,[],[])|
|  0.0|(6,[1,3,5],[4.0,5...|
+-++
{code}

- Non-existing path

{code}
spark.read.format("libsvm").load("/NON_EXISTS").show()
{code}

produces

{code}
org.apache.spark.sql.AnalysisException: Path does not exist: file:/NON_EXISTS;
  at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$14.apply(Dat
aSource.scala:354)
  at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$14.apply(Dat
aSource.scala:342)
  at scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.s
cala:241)
  at scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.s
cala:241)
  at scala.collection.immutable.List.foreach(List.scala:381)
  at scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:241)
  at scala.collection.immutable.List.flatMap(List.scala:344)
  at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataS
ource.scala:342)
  at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:178)
  at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:156)
  ... 48 elided
{code}

It does not look a problem within Spark. I would rather suggest to close this 
if anyone is unable to reproduce this within Spark.

> Unhelpful error messages when trying to load data from file.
> 
>
> Key: SPARK-20497
> URL: https://issues.apache.org/jira/browse/SPARK-20497
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Brandon Barker
>Priority: Minor
>
> I'm attempting to do the simple task of reproducing the results from the 
> linear regression example in Spark. I'm using Windows 10.
>   val training = spark.read.format("libsvm")
>  .load("C:Users\\brand\\Documents\\GitHub\\sample_linear_regression_data.txt")
> Although the file is definitely at the specified location, I just get a 
> java.lang.NullPointerException at this line. The documentation at 
> http://spark.apache.org/docs/latest/sql-programming-guide.html#generic-loadsave-functions
>  doesn't seem to clear things up. The associated javadocs do do not seem any 
> better.
> In my view, such a simple operation should not be troublesome, but perhaps 
> I've missed some critical documentation - if so, I apologize. 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20517) Download link in history server UI is not correct

2017-04-27 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20517?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15988087#comment-15988087
 ] 

Apache Spark commented on SPARK-20517:
--

User 'jerryshao' has created a pull request for this issue:
https://github.com/apache/spark/pull/17795

> Download link in history server UI is not correct
> -
>
> Key: SPARK-20517
> URL: https://issues.apache.org/jira/browse/SPARK-20517
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.1.0, 2.2.0
>Reporter: Saisai Shao
>Priority: Minor
>
> The download link in history server UI is concatenated with:
> {code}
>class="btn btn-info btn-mini">Download
> {code}
> Here {{num}} filed represents number of attempts, this is not equal to REST 
> APIs. In the REST API, if attempt id is not existed, then {{num}} field 
> should be empty, otherwise this {{num}} field should actually be 
> {{attemptId}}.
> This will lead to the issue of "no such app", rather than correctly download 
> the event log.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-20518) Supplement the new blockidsuite unit tests

2017-04-27 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20518?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-20518:


Assignee: Apache Spark

> Supplement the new blockidsuite unit tests
> --
>
> Key: SPARK-20518
> URL: https://issues.apache.org/jira/browse/SPARK-20518
> Project: Spark
>  Issue Type: Test
>  Components: Tests
>Affects Versions: 2.1.0
>Reporter: caoxuewen
>Assignee: Apache Spark
>
> adds the new unit tests to support ShuffleDataBlockId , ShuffleIndexBlockId , 
> TempShuffleBlockId , TempLocalBlockId 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-20518) Supplement the new blockidsuite unit tests

2017-04-27 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20518?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-20518:


Assignee: (was: Apache Spark)

> Supplement the new blockidsuite unit tests
> --
>
> Key: SPARK-20518
> URL: https://issues.apache.org/jira/browse/SPARK-20518
> Project: Spark
>  Issue Type: Test
>  Components: Tests
>Affects Versions: 2.1.0
>Reporter: caoxuewen
>
> adds the new unit tests to support ShuffleDataBlockId , ShuffleIndexBlockId , 
> TempShuffleBlockId , TempLocalBlockId 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20518) Supplement the new blockidsuite unit tests

2017-04-27 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20518?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15988085#comment-15988085
 ] 

Apache Spark commented on SPARK-20518:
--

User 'heary-cao' has created a pull request for this issue:
https://github.com/apache/spark/pull/17794

> Supplement the new blockidsuite unit tests
> --
>
> Key: SPARK-20518
> URL: https://issues.apache.org/jira/browse/SPARK-20518
> Project: Spark
>  Issue Type: Test
>  Components: Tests
>Affects Versions: 2.1.0
>Reporter: caoxuewen
>
> adds the new unit tests to support ShuffleDataBlockId , ShuffleIndexBlockId , 
> TempShuffleBlockId , TempLocalBlockId 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-20518) Supplement the new blockidsuite unit tests

2017-04-27 Thread caoxuewen (JIRA)
caoxuewen created SPARK-20518:
-

 Summary: Supplement the new blockidsuite unit tests
 Key: SPARK-20518
 URL: https://issues.apache.org/jira/browse/SPARK-20518
 Project: Spark
  Issue Type: Test
  Components: Tests
Affects Versions: 2.1.0
Reporter: caoxuewen


adds the new unit tests to support ShuffleDataBlockId , ShuffleIndexBlockId , 
TempShuffleBlockId , TempLocalBlockId 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-20517) Download link in history server UI is not correct

2017-04-27 Thread Saisai Shao (JIRA)
Saisai Shao created SPARK-20517:
---

 Summary: Download link in history server UI is not correct
 Key: SPARK-20517
 URL: https://issues.apache.org/jira/browse/SPARK-20517
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 2.1.0, 2.2.0
Reporter: Saisai Shao
Priority: Minor


The download link in history server UI is concatenated with:

{code}
  Download
{code}

Here {{num}} filed represents number of attempts, this is equal to REST APIs. 
In the REST API, if attempt id is not existed, then {{num}} field should be 
empty, otherwise this {{num}} field should actually be {{attemptId}}.

This will lead to the issue of "no such app", rather than correctly download 
the event log.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-20517) Download link in history server UI is not correct

2017-04-27 Thread Saisai Shao (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20517?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Saisai Shao updated SPARK-20517:

Description: 
The download link in history server UI is concatenated with:

{code}
  Download
{code}

Here {{num}} filed represents number of attempts, this is not equal to REST 
APIs. In the REST API, if attempt id is not existed, then {{num}} field should 
be empty, otherwise this {{num}} field should actually be {{attemptId}}.

This will lead to the issue of "no such app", rather than correctly download 
the event log.

  was:
The download link in history server UI is concatenated with:

{code}
  Download
{code}

Here {{num}} filed represents number of attempts, this is equal to REST APIs. 
In the REST API, if attempt id is not existed, then {{num}} field should be 
empty, otherwise this {{num}} field should actually be {{attemptId}}.

This will lead to the issue of "no such app", rather than correctly download 
the event log.


> Download link in history server UI is not correct
> -
>
> Key: SPARK-20517
> URL: https://issues.apache.org/jira/browse/SPARK-20517
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.1.0, 2.2.0
>Reporter: Saisai Shao
>Priority: Minor
>
> The download link in history server UI is concatenated with:
> {code}
>class="btn btn-info btn-mini">Download
> {code}
> Here {{num}} filed represents number of attempts, this is not equal to REST 
> APIs. In the REST API, if attempt id is not existed, then {{num}} field 
> should be empty, otherwise this {{num}} field should actually be 
> {{attemptId}}.
> This will lead to the issue of "no such app", rather than correctly download 
> the event log.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-20484) Add documentation to ALS code

2017-04-27 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20484?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-20484:


Assignee: (was: Apache Spark)

> Add documentation to ALS code
> -
>
> Key: SPARK-20484
> URL: https://issues.apache.org/jira/browse/SPARK-20484
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, MLlib
>Affects Versions: 2.1.0
>Reporter: Daniel Li
>Priority: Trivial
>
> The documentation (both Scaladocs and inline comments) for the ALS code (in 
> package {{org.apache.spark.ml.recommendation}}) can be clarified where needed 
> and expanded where incomplete. This is especially important for parts of the 
> code that are written imperatively for performance, as these parts don't 
> benefit from the intuitive self-documentation of Scala's higher-level 
> language abstractions. Specifically, I'd like to add documentation fully 
> explaining the key functionality of the in-block and out-block objects, their 
> purpose, how they relate to the overall ALS algorithm, and how they are 
> calculated in such a way that new maintainers can ramp up much more quickly.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-20484) Add documentation to ALS code

2017-04-27 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20484?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-20484:


Assignee: Apache Spark

> Add documentation to ALS code
> -
>
> Key: SPARK-20484
> URL: https://issues.apache.org/jira/browse/SPARK-20484
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, MLlib
>Affects Versions: 2.1.0
>Reporter: Daniel Li
>Assignee: Apache Spark
>Priority: Trivial
>
> The documentation (both Scaladocs and inline comments) for the ALS code (in 
> package {{org.apache.spark.ml.recommendation}}) can be clarified where needed 
> and expanded where incomplete. This is especially important for parts of the 
> code that are written imperatively for performance, as these parts don't 
> benefit from the intuitive self-documentation of Scala's higher-level 
> language abstractions. Specifically, I'd like to add documentation fully 
> explaining the key functionality of the in-block and out-block objects, their 
> purpose, how they relate to the overall ALS algorithm, and how they are 
> calculated in such a way that new maintainers can ramp up much more quickly.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20484) Add documentation to ALS code

2017-04-27 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20484?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15988047#comment-15988047
 ] 

Apache Spark commented on SPARK-20484:
--

User 'danielyli' has created a pull request for this issue:
https://github.com/apache/spark/pull/17793

> Add documentation to ALS code
> -
>
> Key: SPARK-20484
> URL: https://issues.apache.org/jira/browse/SPARK-20484
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, MLlib
>Affects Versions: 2.1.0
>Reporter: Daniel Li
>Priority: Trivial
>
> The documentation (both Scaladocs and inline comments) for the ALS code (in 
> package {{org.apache.spark.ml.recommendation}}) can be clarified where needed 
> and expanded where incomplete. This is especially important for parts of the 
> code that are written imperatively for performance, as these parts don't 
> benefit from the intuitive self-documentation of Scala's higher-level 
> language abstractions. Specifically, I'd like to add documentation fully 
> explaining the key functionality of the in-block and out-block objects, their 
> purpose, how they relate to the overall ALS algorithm, and how they are 
> calculated in such a way that new maintainers can ramp up much more quickly.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20496) KafkaWriter Uses Unanalyzed Logical Plan

2017-04-27 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20496?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15988004#comment-15988004
 ] 

Apache Spark commented on SPARK-20496:
--

User 'anabranch' has created a pull request for this issue:
https://github.com/apache/spark/pull/17792

> KafkaWriter Uses Unanalyzed Logical Plan
> 
>
> Key: SPARK-20496
> URL: https://issues.apache.org/jira/browse/SPARK-20496
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.2.0
>Reporter: Bill Chambers
>
> Right now we use the unanalyzed logical plan for writing to Kafka, we should 
> use the analyzed plan.
> https://github.com/apache/spark/blob/master/external/kafka-0-10-sql/src/main/scala/org/apache/spark/sql/kafka010/KafkaWriter.scala#L50



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18727) Support schema evolution as new files are inserted into table

2017-04-27 Thread Xin Wu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18727?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15987976#comment-15987976
 ] 

Xin Wu commented on SPARK-18727:


Thanks! 

> Support schema evolution as new files are inserted into table
> -
>
> Key: SPARK-18727
> URL: https://issues.apache.org/jira/browse/SPARK-18727
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Eric Liang
>Priority: Critical
>
> Now that we have pushed partition management of all tables to the catalog, 
> one issue for scalable partition handling remains: handling schema updates.
> Currently, a schema update requires dropping and recreating the entire table, 
> which does not scale well with the size of the table.
> We should support updating the schema of the table, either via ALTER TABLE, 
> or automatically as new files with compatible schemas are appended into the 
> table.
> cc [~rxin]



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18727) Support schema evolution as new files are inserted into table

2017-04-27 Thread Eric Liang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18727?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15987966#comment-15987966
 ] 

Eric Liang commented on SPARK-18727:


+1 for supporting ALTER TABLE REPLACE COLUMNS

> Support schema evolution as new files are inserted into table
> -
>
> Key: SPARK-18727
> URL: https://issues.apache.org/jira/browse/SPARK-18727
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Eric Liang
>Priority: Critical
>
> Now that we have pushed partition management of all tables to the catalog, 
> one issue for scalable partition handling remains: handling schema updates.
> Currently, a schema update requires dropping and recreating the entire table, 
> which does not scale well with the size of the table.
> We should support updating the schema of the table, either via ALTER TABLE, 
> or automatically as new files with compatible schemas are appended into the 
> table.
> cc [~rxin]



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18727) Support schema evolution as new files are inserted into table

2017-04-27 Thread Xin Wu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18727?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15987939#comment-15987939
 ] 

Xin Wu commented on SPARK-18727:


[~ekhliang] I see. I will try to support ALTER TABLE SCHEMA. Also this is 
similar or the same as ALTER TABLE REPLACE COLUMNS, which is documented as 
unsupported Hive feature in SqlBase.q4.. Do we have preference which one to use?

> Support schema evolution as new files are inserted into table
> -
>
> Key: SPARK-18727
> URL: https://issues.apache.org/jira/browse/SPARK-18727
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Eric Liang
>Priority: Critical
>
> Now that we have pushed partition management of all tables to the catalog, 
> one issue for scalable partition handling remains: handling schema updates.
> Currently, a schema update requires dropping and recreating the entire table, 
> which does not scale well with the size of the table.
> We should support updating the schema of the table, either via ALTER TABLE, 
> or automatically as new files with compatible schemas are appended into the 
> table.
> cc [~rxin]



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20516) Spark SQL documentation out of date?

2017-04-27 Thread Ratandeep Ratti (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20516?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15987933#comment-15987933
 ] 

Ratandeep Ratti commented on SPARK-20516:
-

Not sure why I cannot assign the ticket to myself. :/

> Spark SQL documentation out of date?
> 
>
> Key: SPARK-20516
> URL: https://issues.apache.org/jira/browse/SPARK-20516
> Project: Spark
>  Issue Type: Task
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Ratandeep Ratti
>
> I was trying out the examples on the [Spark Sql 
> page|https://spark.apache.org/docs/2.1.0/sql-programming-guide.html]. It 
> seems that now we have to specify invoke {{master()}} on the SparkSession 
> builder and also warehouseLocation is now a URI.
> I can fix the documentation (sql-programming-guide.html) and send a PR 
> request.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-20516) Spark SQL documentation out of date?

2017-04-27 Thread Ratandeep Ratti (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20516?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ratandeep Ratti updated SPARK-20516:

Description: 
I was trying out the examples on the [Spark Sql 
page|https://spark.apache.org/docs/2.1.0/sql-programming-guide.html]. It seems 
that now we have to specify invoke {{master()}} on the SparkSession builder and 
also warehouseLocation is now a URI.

I can fix the documentation (sql-programming-guide.html) and send a PR request.

  was:
I was trying out the examples on the [Spark Sql 
page|https://spark.apache.org/docs/2.1.0/sql-programming-guide.html]. It seems 
that now we have to specify invoke {{master}} on the SparkSession builder and 
also warehouseLocation is now a URI.

I can fix the documentation (sql-programming-guide.html) and send a PR request.


> Spark SQL documentation out of date?
> 
>
> Key: SPARK-20516
> URL: https://issues.apache.org/jira/browse/SPARK-20516
> Project: Spark
>  Issue Type: Task
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Ratandeep Ratti
>
> I was trying out the examples on the [Spark Sql 
> page|https://spark.apache.org/docs/2.1.0/sql-programming-guide.html]. It 
> seems that now we have to specify invoke {{master()}} on the SparkSession 
> builder and also warehouseLocation is now a URI.
> I can fix the documentation (sql-programming-guide.html) and send a PR 
> request.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-20516) Spark SQL documentation out of date?

2017-04-27 Thread Ratandeep Ratti (JIRA)
Ratandeep Ratti created SPARK-20516:
---

 Summary: Spark SQL documentation out of date?
 Key: SPARK-20516
 URL: https://issues.apache.org/jira/browse/SPARK-20516
 Project: Spark
  Issue Type: Task
  Components: SQL
Affects Versions: 2.1.0
Reporter: Ratandeep Ratti


I was trying out the examples on the [Spark Sql 
page|https://spark.apache.org/docs/2.1.0/sql-programming-guide.html]. It seems 
that now we have to specify invoke {{master}} on the SparkSession builder and 
also warehouseLocation is now a URI.

I can fix the documentation (sql-programming-guide.html) and send a PR request.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18727) Support schema evolution as new files are inserted into table

2017-04-27 Thread Xiao Li (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18727?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15987929#comment-15987929
 ] 

Xiao Li commented on SPARK-18727:
-

The idea of [~ekhliang] sounds good to me. 

> Support schema evolution as new files are inserted into table
> -
>
> Key: SPARK-18727
> URL: https://issues.apache.org/jira/browse/SPARK-18727
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Eric Liang
>Priority: Critical
>
> Now that we have pushed partition management of all tables to the catalog, 
> one issue for scalable partition handling remains: handling schema updates.
> Currently, a schema update requires dropping and recreating the entire table, 
> which does not scale well with the size of the table.
> We should support updating the schema of the table, either via ALTER TABLE, 
> or automatically as new files with compatible schemas are appended into the 
> table.
> cc [~rxin]



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20478) Document LinearSVC in R programming guide

2017-04-27 Thread Miao Wang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20478?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15987903#comment-15987903
 ] 

Miao Wang commented on SPARK-20478:
---

OK. I will do it. Thanks for pointing me the place.

> Document LinearSVC in R programming guide
> -
>
> Key: SPARK-20478
> URL: https://issues.apache.org/jira/browse/SPARK-20478
> Project: Spark
>  Issue Type: Documentation
>  Components: SparkR
>Affects Versions: 2.2.0
>Reporter: Felix Cheung
>




--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-18727) Support schema evolution as new files are inserted into table

2017-04-27 Thread Eric Liang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18727?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eric Liang updated SPARK-18727:
---

The common case we see is users having a complete schema (e.g. output of
ETL pipeline) and wanting to update/merge it in an automated job. In this
case it's actually more work to alter the columns one at a time, rather
than all at once.




> Support schema evolution as new files are inserted into table
> -
>
> Key: SPARK-18727
> URL: https://issues.apache.org/jira/browse/SPARK-18727
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Eric Liang
>Priority: Critical
>
> Now that we have pushed partition management of all tables to the catalog, 
> one issue for scalable partition handling remains: handling schema updates.
> Currently, a schema update requires dropping and recreating the entire table, 
> which does not scale well with the size of the table.
> We should support updating the schema of the table, either via ALTER TABLE, 
> or automatically as new files with compatible schemas are appended into the 
> table.
> cc [~rxin]



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18727) Support schema evolution as new files are inserted into table

2017-04-27 Thread Xin Wu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18727?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15987874#comment-15987874
 ] 

Xin Wu commented on SPARK-18727:


[~ekhliang] First of all, i am not sure whether it is wise to introduce more 
non-SQL standard syntax into Spark's DDL.  In addition, with ALTER TABLE 
SCHEMA, or ALTER TABLE SET/UPDATE/MOIDFY SCHEMA, depending however we call it, 
it requires users to put in the whole list of columns' definition for maybe a 
small change of a column. It is inconvenient especially when the table is 
relatively wide. What do you think [~smilegator] ? 

> Support schema evolution as new files are inserted into table
> -
>
> Key: SPARK-18727
> URL: https://issues.apache.org/jira/browse/SPARK-18727
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Eric Liang
>Priority: Critical
>
> Now that we have pushed partition management of all tables to the catalog, 
> one issue for scalable partition handling remains: handling schema updates.
> Currently, a schema update requires dropping and recreating the entire table, 
> which does not scale well with the size of the table.
> We should support updating the schema of the table, either via ALTER TABLE, 
> or automatically as new files with compatible schemas are appended into the 
> table.
> cc [~rxin]



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-20489) Different results in local mode and yarn mode when working with dates (race condition with SimpleDateFormat?)

2017-04-27 Thread Shixiong Zhu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20489?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shixiong Zhu updated SPARK-20489:
-
Component/s: (was: Shuffle)
 (was: Spark Core)

> Different results in local mode and yarn mode when working with dates (race 
> condition with SimpleDateFormat?)
> -
>
> Key: SPARK-20489
> URL: https://issues.apache.org/jira/browse/SPARK-20489
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0, 2.0.1, 2.0.2
> Environment: yarn-client mode in Zeppelin, Cloudera 
> Spark2-distribution
>Reporter: Rick Moritz
>Priority: Critical
>
> Running the following code (in Zeppelin, or spark-shell), I get different 
> results, depending on whether I am using local[*] -mode or yarn-client mode:
> {code:title=test case|borderStyle=solid}
> import org.apache.spark.sql.Row
> import org.apache.spark.sql.types._
> import spark.implicits._
> val counter = 1 to 2
> val size = 1 to 3
> val sampleText = spark.createDataFrame(
> sc.parallelize(size)
> .map(Row(_)),
> StructType(Array(StructField("id", IntegerType, nullable=false))
> )
> )
> .withColumn("loadDTS",lit("2017-04-25T10:45:02.2"))
> 
> val rddList = counter.map(
> count => sampleText
> .withColumn("loadDTS2", 
> date_format(date_add(col("loadDTS"),count),"-MM-dd'T'HH:mm:ss.SSS"))
> .drop(col("loadDTS"))
> .withColumnRenamed("loadDTS2","loadDTS")
> .coalesce(4)
> .rdd
> )
> val resultText = spark.createDataFrame(
> spark.sparkContext.union(rddList),
> sampleText.schema
> )
> val testGrouped = resultText.groupBy("id")
> val timestamps = testGrouped.agg(
> max(unix_timestamp($"loadDTS", "-MM-dd'T'HH:mm:ss.SSS")) as 
> "timestamp"
> )
> val loadDateResult = resultText.join(timestamps, "id")
> val filteredresult = loadDateResult.filter($"timestamp" === 
> unix_timestamp($"loadDTS", "-MM-dd'T'HH:mm:ss.SSS"))
> filteredresult.count
> {code}
> The expected result, *3* is what I obtain in local mode, but as soon as I run 
> fully distributed, I get *0*. If Increase size to {{1 to 32000}}, I do get 
> some results (depending on the size of counter) - none of which makes any 
> sense.
> Up to the application of the last filter, at first glance everything looks 
> okay, but then something goes wrong. Potentially this is due to lingering 
> re-use of SimpleDateFormats, but I can't get it to happen in a 
> non-distributed mode. The generated execution plan is the same in each case, 
> as expected.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20489) Different results in local mode and yarn mode when working with dates (race condition with SimpleDateFormat?)

2017-04-27 Thread Shixiong Zhu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20489?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15987848#comment-15987848
 ] 

Shixiong Zhu commented on SPARK-20489:
--

Could you show the results of `loadDateResult.show(false)`? My hunch is it's a 
time zone issue.

> Different results in local mode and yarn mode when working with dates (race 
> condition with SimpleDateFormat?)
> -
>
> Key: SPARK-20489
> URL: https://issues.apache.org/jira/browse/SPARK-20489
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0, 2.0.1, 2.0.2
> Environment: yarn-client mode in Zeppelin, Cloudera 
> Spark2-distribution
>Reporter: Rick Moritz
>Priority: Critical
>
> Running the following code (in Zeppelin, or spark-shell), I get different 
> results, depending on whether I am using local[*] -mode or yarn-client mode:
> {code:title=test case|borderStyle=solid}
> import org.apache.spark.sql.Row
> import org.apache.spark.sql.types._
> import spark.implicits._
> val counter = 1 to 2
> val size = 1 to 3
> val sampleText = spark.createDataFrame(
> sc.parallelize(size)
> .map(Row(_)),
> StructType(Array(StructField("id", IntegerType, nullable=false))
> )
> )
> .withColumn("loadDTS",lit("2017-04-25T10:45:02.2"))
> 
> val rddList = counter.map(
> count => sampleText
> .withColumn("loadDTS2", 
> date_format(date_add(col("loadDTS"),count),"-MM-dd'T'HH:mm:ss.SSS"))
> .drop(col("loadDTS"))
> .withColumnRenamed("loadDTS2","loadDTS")
> .coalesce(4)
> .rdd
> )
> val resultText = spark.createDataFrame(
> spark.sparkContext.union(rddList),
> sampleText.schema
> )
> val testGrouped = resultText.groupBy("id")
> val timestamps = testGrouped.agg(
> max(unix_timestamp($"loadDTS", "-MM-dd'T'HH:mm:ss.SSS")) as 
> "timestamp"
> )
> val loadDateResult = resultText.join(timestamps, "id")
> val filteredresult = loadDateResult.filter($"timestamp" === 
> unix_timestamp($"loadDTS", "-MM-dd'T'HH:mm:ss.SSS"))
> filteredresult.count
> {code}
> The expected result, *3* is what I obtain in local mode, but as soon as I run 
> fully distributed, I get *0*. If Increase size to {{1 to 32000}}, I do get 
> some results (depending on the size of counter) - none of which makes any 
> sense.
> Up to the application of the last filter, at first glance everything looks 
> okay, but then something goes wrong. Potentially this is due to lingering 
> re-use of SimpleDateFormats, but I can't get it to happen in a 
> non-distributed mode. The generated execution plan is the same in each case, 
> as expected.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-18727) Support schema evolution as new files are inserted into table

2017-04-27 Thread Eric Liang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18727?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eric Liang updated SPARK-18727:
---

Can we add ALTER TABLE SCHEMA to update the entire schema? That would cover
any edge cases.




> Support schema evolution as new files are inserted into table
> -
>
> Key: SPARK-18727
> URL: https://issues.apache.org/jira/browse/SPARK-18727
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Eric Liang
>Priority: Critical
>
> Now that we have pushed partition management of all tables to the catalog, 
> one issue for scalable partition handling remains: handling schema updates.
> Currently, a schema update requires dropping and recreating the entire table, 
> which does not scale well with the size of the table.
> We should support updating the schema of the table, either via ALTER TABLE, 
> or automatically as new files with compatible schemas are appended into the 
> table.
> cc [~rxin]



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18727) Support schema evolution as new files are inserted into table

2017-04-27 Thread Xin Wu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18727?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15987830#comment-15987830
 ] 

Xin Wu commented on SPARK-18727:


[~simeons] You are right.. My PR does not include the feature that allows you 
to add new field into a complex type. Such feature could be supported by 
{code}ALTER TABLE CHANGE COLUMN   {code}, where 
newType has newly added fields. 

I am also working on this part. 

> Support schema evolution as new files are inserted into table
> -
>
> Key: SPARK-18727
> URL: https://issues.apache.org/jira/browse/SPARK-18727
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Eric Liang
>Priority: Critical
>
> Now that we have pushed partition management of all tables to the catalog, 
> one issue for scalable partition handling remains: handling schema updates.
> Currently, a schema update requires dropping and recreating the entire table, 
> which does not scale well with the size of the table.
> We should support updating the schema of the table, either via ALTER TABLE, 
> or automatically as new files with compatible schemas are appended into the 
> table.
> cc [~rxin]



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-20514) Upgrade Jetty to 9.3.11.v20160721

2017-04-27 Thread Mark Grover (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20514?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mark Grover updated SPARK-20514:

Summary: Upgrade Jetty to 9.3.11.v20160721  (was: Upgrade Jetty to 
9.3.13.v20161014)

> Upgrade Jetty to 9.3.11.v20160721
> -
>
> Key: SPARK-20514
> URL: https://issues.apache.org/jira/browse/SPARK-20514
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.2.0
>Reporter: Mark Grover
>
> Currently, we are using Jetty version 9.2.16.v20160414.
> However, Hadoop 3, uses 
> [9.3.11.v20160721|https://github.com/apache/hadoop/blob/release-3.0.0-alpha2-RC0/hadoop-project/pom.xml#L38]
>  (Jetty upgrade was brought in by HADOOP-10075).
> Currently, when you try to build Spark with Hadoop 3, due to this 
> incompatibilities in jetty versions used by Hadoop and Spark, compilation 
> fails with:
> {code}
> [ERROR] source/core/src/main/scala/org/apache/spark/ui/JettyUtils.scala:31: 
> error: object gzip is not a member of package org.eclipse.jetty.servlets
> [ERROR] import org.eclipse.jetty.servlets.gzip.GzipHandler
> [ERROR]   ^
> [ERROR] source/core/src/main/scala/org/apache/spark/ui/JettyUtils.scala:238: 
> error: not found: type GzipHandler
> [ERROR]   val gzipHandler = new GzipHandler
> [ERROR] ^
> [ERROR] two errors found
> {code}
> So, it'd be good to upgrade Jetty to get us closer to working with Hadoop 3.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-20515) Issue with reading Hive ORC tables having char/varchar columns in Spark SQL

2017-04-27 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20515?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-20515:


Assignee: Apache Spark

> Issue with reading Hive ORC tables having char/varchar columns in Spark SQL
> ---
>
> Key: SPARK-20515
> URL: https://issues.apache.org/jira/browse/SPARK-20515
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.2
> Environment: AWS EMR Cluster
>Reporter: Udit Mehrotra
>Assignee: Apache Spark
>
> Reading from a Hive ORC table containing char/varchar columns fails in Spark 
> SQL. This is caused by the fact that Spark SQL internally replaces the 
> char/varchar columns with String data type. So, while reading from the table 
> created in Hive which has varchar/char columns, it ends up using the wrong 
> reader and causes a ClassCastException.
>  
> Here is the exception:
>  
> java.lang.ClassCastException: 
> org.apache.hadoop.hive.serde2.io.HiveVarcharWritable cannot be cast to 
> org.apache.hadoop.io.Text
> at 
> org.apache.hadoop.hive.serde2.objectinspector.primitive.WritableStringObjectInspector.getPrimitiveWritableObject(WritableStringObjectInspector.java:41)
> at 
> org.apache.spark.sql.hive.HiveInspectors$class.unwrap(HiveInspectors.scala:324)
> at 
> org.apache.spark.sql.hive.HadoopTableReader$.unwrap(TableReader.scala:333)
> at 
> org.apache.spark.sql.hive.HadoopTableReader$$anonfun$14$$anonfun$apply$15.apply(TableReader.scala:419)
> at 
> org.apache.spark.sql.hive.HadoopTableReader$$anonfun$14$$anonfun$apply$15.apply(TableReader.scala:419)
> at 
> org.apache.spark.sql.hive.HadoopTableReader$$anonfun$fillObject$2.apply(TableReader.scala:435)
> at 
> org.apache.spark.sql.hive.HadoopTableReader$$anonfun$fillObject$2.apply(TableReader.scala:426)
> at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
> at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
> at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$4.apply(SparkPlan.scala:247)
> at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$4.apply(SparkPlan.scala:240)
> at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:803)
> at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:803)
> at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> at 
> org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:283)
> at 
> org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70)
> at org.apache.spark.scheduler.Task.run(Task.scala:86)
> at 
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> at java.lang.Thread.run(Thread.java:745)
>  
> While the issue has been fixed in Spark 2.1.1 and 2.2.0 with SPARK-19459, it 
> still needs to be fixed Spark 2.0.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-20515) Issue with reading Hive ORC tables having char/varchar columns in Spark SQL

2017-04-27 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20515?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-20515:


Assignee: (was: Apache Spark)

> Issue with reading Hive ORC tables having char/varchar columns in Spark SQL
> ---
>
> Key: SPARK-20515
> URL: https://issues.apache.org/jira/browse/SPARK-20515
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.2
> Environment: AWS EMR Cluster
>Reporter: Udit Mehrotra
>
> Reading from a Hive ORC table containing char/varchar columns fails in Spark 
> SQL. This is caused by the fact that Spark SQL internally replaces the 
> char/varchar columns with String data type. So, while reading from the table 
> created in Hive which has varchar/char columns, it ends up using the wrong 
> reader and causes a ClassCastException.
>  
> Here is the exception:
>  
> java.lang.ClassCastException: 
> org.apache.hadoop.hive.serde2.io.HiveVarcharWritable cannot be cast to 
> org.apache.hadoop.io.Text
> at 
> org.apache.hadoop.hive.serde2.objectinspector.primitive.WritableStringObjectInspector.getPrimitiveWritableObject(WritableStringObjectInspector.java:41)
> at 
> org.apache.spark.sql.hive.HiveInspectors$class.unwrap(HiveInspectors.scala:324)
> at 
> org.apache.spark.sql.hive.HadoopTableReader$.unwrap(TableReader.scala:333)
> at 
> org.apache.spark.sql.hive.HadoopTableReader$$anonfun$14$$anonfun$apply$15.apply(TableReader.scala:419)
> at 
> org.apache.spark.sql.hive.HadoopTableReader$$anonfun$14$$anonfun$apply$15.apply(TableReader.scala:419)
> at 
> org.apache.spark.sql.hive.HadoopTableReader$$anonfun$fillObject$2.apply(TableReader.scala:435)
> at 
> org.apache.spark.sql.hive.HadoopTableReader$$anonfun$fillObject$2.apply(TableReader.scala:426)
> at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
> at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
> at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$4.apply(SparkPlan.scala:247)
> at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$4.apply(SparkPlan.scala:240)
> at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:803)
> at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:803)
> at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> at 
> org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:283)
> at 
> org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70)
> at org.apache.spark.scheduler.Task.run(Task.scala:86)
> at 
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> at java.lang.Thread.run(Thread.java:745)
>  
> While the issue has been fixed in Spark 2.1.1 and 2.2.0 with SPARK-19459, it 
> still needs to be fixed Spark 2.0.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20515) Issue with reading Hive ORC tables having char/varchar columns in Spark SQL

2017-04-27 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20515?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15987822#comment-15987822
 ] 

Apache Spark commented on SPARK-20515:
--

User 'umehrot2' has created a pull request for this issue:
https://github.com/apache/spark/pull/17791

> Issue with reading Hive ORC tables having char/varchar columns in Spark SQL
> ---
>
> Key: SPARK-20515
> URL: https://issues.apache.org/jira/browse/SPARK-20515
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.2
> Environment: AWS EMR Cluster
>Reporter: Udit Mehrotra
>
> Reading from a Hive ORC table containing char/varchar columns fails in Spark 
> SQL. This is caused by the fact that Spark SQL internally replaces the 
> char/varchar columns with String data type. So, while reading from the table 
> created in Hive which has varchar/char columns, it ends up using the wrong 
> reader and causes a ClassCastException.
>  
> Here is the exception:
>  
> java.lang.ClassCastException: 
> org.apache.hadoop.hive.serde2.io.HiveVarcharWritable cannot be cast to 
> org.apache.hadoop.io.Text
> at 
> org.apache.hadoop.hive.serde2.objectinspector.primitive.WritableStringObjectInspector.getPrimitiveWritableObject(WritableStringObjectInspector.java:41)
> at 
> org.apache.spark.sql.hive.HiveInspectors$class.unwrap(HiveInspectors.scala:324)
> at 
> org.apache.spark.sql.hive.HadoopTableReader$.unwrap(TableReader.scala:333)
> at 
> org.apache.spark.sql.hive.HadoopTableReader$$anonfun$14$$anonfun$apply$15.apply(TableReader.scala:419)
> at 
> org.apache.spark.sql.hive.HadoopTableReader$$anonfun$14$$anonfun$apply$15.apply(TableReader.scala:419)
> at 
> org.apache.spark.sql.hive.HadoopTableReader$$anonfun$fillObject$2.apply(TableReader.scala:435)
> at 
> org.apache.spark.sql.hive.HadoopTableReader$$anonfun$fillObject$2.apply(TableReader.scala:426)
> at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
> at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
> at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$4.apply(SparkPlan.scala:247)
> at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$4.apply(SparkPlan.scala:240)
> at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:803)
> at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:803)
> at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> at 
> org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:283)
> at 
> org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70)
> at org.apache.spark.scheduler.Task.run(Task.scala:86)
> at 
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> at java.lang.Thread.run(Thread.java:745)
>  
> While the issue has been fixed in Spark 2.1.1 and 2.2.0 with SPARK-19459, it 
> still needs to be fixed Spark 2.0.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-16333) Excessive Spark history event/json data size (5GB each)

2017-04-27 Thread Marcelo Vanzin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16333?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Vanzin resolved SPARK-16333.

Resolution: Duplicate

Let's mark this as a duplicate for now. There are probably minor increments 
that can be made if the size is still a problem, but probably better to track 
those individually.

> Excessive Spark history event/json data size (5GB each)
> ---
>
> Key: SPARK-16333
> URL: https://issues.apache.org/jira/browse/SPARK-16333
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.0.0
> Environment: this is seen on both x86 (Intel(R) Xeon(R), E5-2699 ) 
> and ppc platform (Habanero, Model: 8348-21C), Red Hat Enterprise Linux Server 
> release 7.2 (Maipo)., Spark2.0.0-preview (May-24, 2016 build)
>Reporter: Peter Liu
>  Labels: performance, spark2.0.0
>
> With Spark2.0.0-preview (May-24 build), the history event data (the json 
> file), that is generated for each Spark application run (see below), can be 
> as big as 5GB (instead of 14 MB for exactly the same application run and the 
> same input data of 1TB under Spark1.6.1)
> -rwxrwx--- 1 root root 5.3G Jun 30 09:39 app-20160630091959-
> -rwxrwx--- 1 root root 5.3G Jun 30 09:56 app-20160630094213-
> -rwxrwx--- 1 root root 5.3G Jun 30 10:13 app-20160630095856-
> -rwxrwx--- 1 root root 5.3G Jun 30 10:30 app-20160630101556-
> The test is done with Sparkbench V2, SQL RDD (see github: 
> https://github.com/SparkTC/spark-bench)



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20497) Unhelpful error messages when trying to load data from file.

2017-04-27 Thread Brandon Barker (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20497?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15987797#comment-15987797
 ] 

Brandon Barker commented on SPARK-20497:


Sorry, it appears not. At least this may be useful to discuss with the 
spark-testing-base authors.

> Unhelpful error messages when trying to load data from file.
> 
>
> Key: SPARK-20497
> URL: https://issues.apache.org/jira/browse/SPARK-20497
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Brandon Barker
>Priority: Minor
>
> I'm attempting to do the simple task of reproducing the results from the 
> linear regression example in Spark. I'm using Windows 10.
>   val training = spark.read.format("libsvm")
>  .load("C:Users\\brand\\Documents\\GitHub\\sample_linear_regression_data.txt")
> Although the file is definitely at the specified location, I just get a 
> java.lang.NullPointerException at this line. The documentation at 
> http://spark.apache.org/docs/latest/sql-programming-guide.html#generic-loadsave-functions
>  doesn't seem to clear things up. The associated javadocs do do not seem any 
> better.
> In my view, such a simple operation should not be troublesome, but perhaps 
> I've missed some critical documentation - if so, I apologize. 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20497) Unhelpful error messages when trying to load data from file.

2017-04-27 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20497?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15987782#comment-15987782
 ] 

Sean Owen commented on SPARK-20497:
---

OK, this doesn't look an exception from Spark then?

> Unhelpful error messages when trying to load data from file.
> 
>
> Key: SPARK-20497
> URL: https://issues.apache.org/jira/browse/SPARK-20497
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Brandon Barker
>Priority: Minor
>
> I'm attempting to do the simple task of reproducing the results from the 
> linear regression example in Spark. I'm using Windows 10.
>   val training = spark.read.format("libsvm")
>  .load("C:Users\\brand\\Documents\\GitHub\\sample_linear_regression_data.txt")
> Although the file is definitely at the specified location, I just get a 
> java.lang.NullPointerException at this line. The documentation at 
> http://spark.apache.org/docs/latest/sql-programming-guide.html#generic-loadsave-functions
>  doesn't seem to clear things up. The associated javadocs do do not seem any 
> better.
> In my view, such a simple operation should not be troublesome, but perhaps 
> I've missed some critical documentation - if so, I apologize. 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-20515) Issue with reading Hive ORC tables having char/varchar columns in Spark SQL

2017-04-27 Thread Udit Mehrotra (JIRA)
Udit Mehrotra created SPARK-20515:
-

 Summary: Issue with reading Hive ORC tables having char/varchar 
columns in Spark SQL
 Key: SPARK-20515
 URL: https://issues.apache.org/jira/browse/SPARK-20515
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.0.2
 Environment: AWS EMR Cluster
Reporter: Udit Mehrotra


Reading from a Hive ORC table containing char/varchar columns fails in Spark 
SQL. This is caused by the fact that Spark SQL internally replaces the 
char/varchar columns with String data type. So, while reading from the table 
created in Hive which has varchar/char columns, it ends up using the wrong 
reader and causes a ClassCastException.
 
Here is the exception:
 
java.lang.ClassCastException: 
org.apache.hadoop.hive.serde2.io.HiveVarcharWritable cannot be cast to 
org.apache.hadoop.io.Text
at 
org.apache.hadoop.hive.serde2.objectinspector.primitive.WritableStringObjectInspector.getPrimitiveWritableObject(WritableStringObjectInspector.java:41)
at 
org.apache.spark.sql.hive.HiveInspectors$class.unwrap(HiveInspectors.scala:324)
at 
org.apache.spark.sql.hive.HadoopTableReader$.unwrap(TableReader.scala:333)
at 
org.apache.spark.sql.hive.HadoopTableReader$$anonfun$14$$anonfun$apply$15.apply(TableReader.scala:419)
at 
org.apache.spark.sql.hive.HadoopTableReader$$anonfun$14$$anonfun$apply$15.apply(TableReader.scala:419)
at 
org.apache.spark.sql.hive.HadoopTableReader$$anonfun$fillObject$2.apply(TableReader.scala:435)
at 
org.apache.spark.sql.hive.HadoopTableReader$$anonfun$fillObject$2.apply(TableReader.scala:426)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$4.apply(SparkPlan.scala:247)
at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$4.apply(SparkPlan.scala:240)
at 
org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:803)
at 
org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:803)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at 
org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:283)
at 
org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70)
at org.apache.spark.scheduler.Task.run(Task.scala:86)
at 
org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
 
While the issue has been fixed in Spark 2.1.1 and 2.2.0 with SPARK-19459, it 
still needs to be fixed Spark 2.0.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20497) Unhelpful error messages when trying to load data from file.

2017-04-27 Thread Brandon Barker (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20497?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=1598#comment-1598
 ] 

Brandon Barker commented on SPARK-20497:


Thanks for the quick reply. At the moment, I'm thinking the NPE was due to an 
incorrectly configured SparkSession and/or SparkContext, as the SparkSession is 
being created by the unofficial package 
com.holdenkarau.spark.testing.SparkSessionProvider:  
com.holdenkarau
spark-testing-base_${scala.version.major}

Here's the NPE (line 57 is the val training = ... line mentioned above):
java.lang.NullPointerException
at 
edu.cornell.ansci.dairy.econ.util.CsvLookupAnalyzer.(CsvLookupAnalyzer.scala:57)
at 
org.cornell.ansci.dairy.econ.util.CsvLookupAnalyzerTest$.setUp(CsvLookupAnalyzerTest.scala:90)
at 
org.cornell.ansci.dairy.econ.util.CsvLookupAnalyzerTest.setUp(CsvLookupAnalyzerTest.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at 
org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:50)
at 
org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
at 
org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:47)
at 
org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:24)
at org.junit.runners.ParentRunner.run(ParentRunner.java:363)
at org.junit.runner.JUnitCore.run(JUnitCore.java:137)
at 
com.intellij.junit4.JUnit4IdeaTestRunner.startRunnerWithArgs(JUnit4IdeaTestRunner.java:68)
at 
com.intellij.rt.execution.junit.IdeaTestRunner$Repeater.startRunnerWithArgs(IdeaTestRunner.java:51)
at 
com.intellij.rt.execution.junit.JUnitStarter.prepareStreamsAndStart(JUnitStarter.java:237)
at 
com.intellij.rt.execution.junit.JUnitStarter.main(JUnitStarter.java:70)


When run from the main application where I've configured spark, I get a much 
more informative error (aha, a missing "\\" after the C:, oops...)

org.apache.spark.sql.AnalysisException: Path does not exist: 
file:/C:Users/brand/Documents/GitHub/sample_linear_regression_data.txt;
at 
org.apache.spark.sql.execution.datasources.DataSource$$anonfun$14.apply(DataSource.scala:382)
...

Fixing this doesn't fix the NPE above when run in the test environment, 
indicating it is deep configuration issue, and not an issue with Spark, unless 
we could somehow get a "SparkNotConfiguredException" ;). I'll plan to 
investigate the testing issue further.

> Unhelpful error messages when trying to load data from file.
> 
>
> Key: SPARK-20497
> URL: https://issues.apache.org/jira/browse/SPARK-20497
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Brandon Barker
>
> I'm attempting to do the simple task of reproducing the results from the 
> linear regression example in Spark. I'm using Windows 10.
>   val training = spark.read.format("libsvm")
>  .load("C:Users\\brand\\Documents\\GitHub\\sample_linear_regression_data.txt")
> Although the file is definitely at the specified location, I just get a 
> java.lang.NullPointerException at this line. The documentation at 
> http://spark.apache.org/docs/latest/sql-programming-guide.html#generic-loadsave-functions
>  doesn't seem to clear things up. The associated javadocs do do not seem any 
> better.
> In my view, such a simple operation should not be troublesome, but perhaps 
> I've missed some critical documentation - if so, I apologize. 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-20497) Unhelpful error messages when trying to load data from file.

2017-04-27 Thread Brandon Barker (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20497?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Brandon Barker updated SPARK-20497:
---
Flags:   (was: Important)

> Unhelpful error messages when trying to load data from file.
> 
>
> Key: SPARK-20497
> URL: https://issues.apache.org/jira/browse/SPARK-20497
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Brandon Barker
>Priority: Minor
>
> I'm attempting to do the simple task of reproducing the results from the 
> linear regression example in Spark. I'm using Windows 10.
>   val training = spark.read.format("libsvm")
>  .load("C:Users\\brand\\Documents\\GitHub\\sample_linear_regression_data.txt")
> Although the file is definitely at the specified location, I just get a 
> java.lang.NullPointerException at this line. The documentation at 
> http://spark.apache.org/docs/latest/sql-programming-guide.html#generic-loadsave-functions
>  doesn't seem to clear things up. The associated javadocs do do not seem any 
> better.
> In my view, such a simple operation should not be troublesome, but perhaps 
> I've missed some critical documentation - if so, I apologize. 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-20497) Unhelpful error messages when trying to load data from file.

2017-04-27 Thread Brandon Barker (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20497?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Brandon Barker updated SPARK-20497:
---
Priority: Minor  (was: Major)

> Unhelpful error messages when trying to load data from file.
> 
>
> Key: SPARK-20497
> URL: https://issues.apache.org/jira/browse/SPARK-20497
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Brandon Barker
>Priority: Minor
>
> I'm attempting to do the simple task of reproducing the results from the 
> linear regression example in Spark. I'm using Windows 10.
>   val training = spark.read.format("libsvm")
>  .load("C:Users\\brand\\Documents\\GitHub\\sample_linear_regression_data.txt")
> Although the file is definitely at the specified location, I just get a 
> java.lang.NullPointerException at this line. The documentation at 
> http://spark.apache.org/docs/latest/sql-programming-guide.html#generic-loadsave-functions
>  doesn't seem to clear things up. The associated javadocs do do not seem any 
> better.
> In my view, such a simple operation should not be troublesome, but perhaps 
> I've missed some critical documentation - if so, I apologize. 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-18813) MLlib 2.2 Roadmap

2017-04-27 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18813?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley closed SPARK-18813.
-
   Resolution: Done
Fix Version/s: 2.2.0

> MLlib 2.2 Roadmap
> -
>
> Key: SPARK-18813
> URL: https://issues.apache.org/jira/browse/SPARK-18813
> Project: Spark
>  Issue Type: Umbrella
>  Components: ML, MLlib
>Reporter: Joseph K. Bradley
>Priority: Blocker
>  Labels: roadmap
> Fix For: 2.2.0
>
>
> *PROPOSAL: This includes a proposal for the 2.2 roadmap process for MLlib.*
> The roadmap process described below is significantly updated since the 2.1 
> roadmap [SPARK-15581].  Please refer to [SPARK-15581] for more discussion on 
> the basis for this proposal, and comment in this JIRA if you have suggestions 
> for improvements.
> h1. Roadmap process
> This roadmap is a master list for MLlib improvements we are working on during 
> this release.  This includes ML-related changes in PySpark and SparkR.
> *What is planned for the next release?*
> * This roadmap lists issues which at least one Committer has prioritized.  
> See details below in "Instructions for committers."
> * This roadmap only lists larger or more critical issues.
> *How can contributors influence this roadmap?*
> * If you believe an issue should be in this roadmap, please discuss the issue 
> on JIRA and/or the dev mailing list.  Make sure to ping Committers since at 
> least one must agree to shepherd the issue.
> * For general discussions, use this JIRA or the dev mailing list.  For 
> specific issues, please comment on those issues or the mailing list.
> * Vote for & watch issues which are important to you.
> ** MLlib, sorted by: [Votes | 
> https://issues.apache.org/jira/issues/?jql=project%20%3D%20SPARK%20AND%20status%20in%20(Open%2C%20%22In%20Progress%22%2C%20Reopened)%20AND%20component%20in%20(ML%2C%20MLlib)%20ORDER%20BY%20votes%20DESC]
>  or [Watchers | 
> https://issues.apache.org/jira/issues/?jql=project%20%3D%20SPARK%20AND%20status%20in%20(Open%2C%20%22In%20Progress%22%2C%20Reopened)%20AND%20component%20in%20(ML%2C%20MLlib)%20ORDER%20BY%20Watchers%20DESC]
> ** SparkR, sorted by: [Votes | 
> https://issues.apache.org/jira/issues/?jql=project%20%3D%20SPARK%20AND%20status%20in%20(Open%2C%20%22In%20Progress%22%2C%20Reopened)%20AND%20component%20in%20(SparkR)%20ORDER%20BY%20votes%20DESC]
>  or [Watchers | 
> https://issues.apache.org/jira/issues/?jql=project%20%3D%20SPARK%20AND%20status%20in%20(Open%2C%20%22In%20Progress%22%2C%20Reopened)%20AND%20component%20in%20(SparkR)%20ORDER%20BY%20Watchers%20DESC]
> h2. Target Version and Priority
> This section describes the meaning of Target Version and Priority.  _These 
> meanings have been updated in this proposal for the 2.2 process._
> || Category | Target Version | Priority | Shepherd | Put on roadmap? | In 
> next release? ||
> | [1 | 
> https://issues.apache.org/jira/issues/?jql=project%20%3D%20SPARK%20AND%20status%20in%20(Open%2C%20%22In%20Progress%22%2C%20Reopened)%20AND%20priority%20%3D%20Blocker%20AND%20component%20in%20(GraphX%2C%20ML%2C%20MLlib%2C%20SparkR)%20AND%20%22Target%20Version%2Fs%22%20%3D%202.2.0]
>  | next release | Blocker | *must* | *must* | *must* |
> | [2 | 
> https://issues.apache.org/jira/issues/?jql=project%20%3D%20SPARK%20AND%20status%20in%20(Open%2C%20%22In%20Progress%22%2C%20Reopened)%20AND%20priority%20%3D%20Critical%20AND%20component%20in%20(GraphX%2C%20ML%2C%20MLlib%2C%20SparkR)%20AND%20%22Target%20Version%2Fs%22%20%3D%202.2.0]
>  | next release | Critical | *must* | yes, unless small | *best effort* |
> | [3 | 
> https://issues.apache.org/jira/issues/?jql=project%20%3D%20SPARK%20AND%20status%20in%20(Open%2C%20%22In%20Progress%22%2C%20Reopened)%20AND%20priority%20%3D%20Major%20AND%20component%20in%20(GraphX%2C%20ML%2C%20MLlib%2C%20SparkR)%20AND%20%22Target%20Version%2Fs%22%20%3D%202.2.0]
>  | next release | Major | *must* | optional | *best effort* |
> | [4 | 
> https://issues.apache.org/jira/issues/?jql=project%20%3D%20SPARK%20AND%20status%20in%20(Open%2C%20%22In%20Progress%22%2C%20Reopened)%20AND%20priority%20%3D%20Minor%20AND%20component%20in%20(GraphX%2C%20ML%2C%20MLlib%2C%20SparkR)%20AND%20%22Target%20Version%2Fs%22%20%3D%202.2.0]
>  | next release | Minor | optional | no | maybe |
> | [5 | 
> https://issues.apache.org/jira/issues/?jql=project%20%3D%20SPARK%20AND%20status%20in%20(Open%2C%20%22In%20Progress%22%2C%20Reopened)%20AND%20priority%20%3D%20Trivial%20AND%20component%20in%20(GraphX%2C%20ML%2C%20MLlib%2C%20SparkR)%20AND%20%22Target%20Version%2Fs%22%20%3D%202.2.0]
>  | next release | Trivial | optional | no | maybe |
> | [6 | 
> 

[jira] [Commented] (SPARK-18813) MLlib 2.2 Roadmap

2017-04-27 Thread Joseph K. Bradley (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18813?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15987754#comment-15987754
 ] 

Joseph K. Bradley commented on SPARK-18813:
---

Thanks everyone for their thoughts and work during this release cycle!  I agree 
that we'll need to keep working on those big issues which [~mlnick] mentioned.

Also, this roadmap JIRA has been less active than I had imagined, even though 
lots of work has gone on.  If you have ideas for improving it for the next 
cycle, please say!

I'll close this for now, and we can create a new roadmap after the QA period is 
done.  Speaking of which...here are QA JIRAs for MLlib/GraphX [SPARK-20499] and 
for SparkR [SPARK-20508].

Thanks again!

> MLlib 2.2 Roadmap
> -
>
> Key: SPARK-18813
> URL: https://issues.apache.org/jira/browse/SPARK-18813
> Project: Spark
>  Issue Type: Umbrella
>  Components: ML, MLlib
>Reporter: Joseph K. Bradley
>Priority: Blocker
>  Labels: roadmap
>
> *PROPOSAL: This includes a proposal for the 2.2 roadmap process for MLlib.*
> The roadmap process described below is significantly updated since the 2.1 
> roadmap [SPARK-15581].  Please refer to [SPARK-15581] for more discussion on 
> the basis for this proposal, and comment in this JIRA if you have suggestions 
> for improvements.
> h1. Roadmap process
> This roadmap is a master list for MLlib improvements we are working on during 
> this release.  This includes ML-related changes in PySpark and SparkR.
> *What is planned for the next release?*
> * This roadmap lists issues which at least one Committer has prioritized.  
> See details below in "Instructions for committers."
> * This roadmap only lists larger or more critical issues.
> *How can contributors influence this roadmap?*
> * If you believe an issue should be in this roadmap, please discuss the issue 
> on JIRA and/or the dev mailing list.  Make sure to ping Committers since at 
> least one must agree to shepherd the issue.
> * For general discussions, use this JIRA or the dev mailing list.  For 
> specific issues, please comment on those issues or the mailing list.
> * Vote for & watch issues which are important to you.
> ** MLlib, sorted by: [Votes | 
> https://issues.apache.org/jira/issues/?jql=project%20%3D%20SPARK%20AND%20status%20in%20(Open%2C%20%22In%20Progress%22%2C%20Reopened)%20AND%20component%20in%20(ML%2C%20MLlib)%20ORDER%20BY%20votes%20DESC]
>  or [Watchers | 
> https://issues.apache.org/jira/issues/?jql=project%20%3D%20SPARK%20AND%20status%20in%20(Open%2C%20%22In%20Progress%22%2C%20Reopened)%20AND%20component%20in%20(ML%2C%20MLlib)%20ORDER%20BY%20Watchers%20DESC]
> ** SparkR, sorted by: [Votes | 
> https://issues.apache.org/jira/issues/?jql=project%20%3D%20SPARK%20AND%20status%20in%20(Open%2C%20%22In%20Progress%22%2C%20Reopened)%20AND%20component%20in%20(SparkR)%20ORDER%20BY%20votes%20DESC]
>  or [Watchers | 
> https://issues.apache.org/jira/issues/?jql=project%20%3D%20SPARK%20AND%20status%20in%20(Open%2C%20%22In%20Progress%22%2C%20Reopened)%20AND%20component%20in%20(SparkR)%20ORDER%20BY%20Watchers%20DESC]
> h2. Target Version and Priority
> This section describes the meaning of Target Version and Priority.  _These 
> meanings have been updated in this proposal for the 2.2 process._
> || Category | Target Version | Priority | Shepherd | Put on roadmap? | In 
> next release? ||
> | [1 | 
> https://issues.apache.org/jira/issues/?jql=project%20%3D%20SPARK%20AND%20status%20in%20(Open%2C%20%22In%20Progress%22%2C%20Reopened)%20AND%20priority%20%3D%20Blocker%20AND%20component%20in%20(GraphX%2C%20ML%2C%20MLlib%2C%20SparkR)%20AND%20%22Target%20Version%2Fs%22%20%3D%202.2.0]
>  | next release | Blocker | *must* | *must* | *must* |
> | [2 | 
> https://issues.apache.org/jira/issues/?jql=project%20%3D%20SPARK%20AND%20status%20in%20(Open%2C%20%22In%20Progress%22%2C%20Reopened)%20AND%20priority%20%3D%20Critical%20AND%20component%20in%20(GraphX%2C%20ML%2C%20MLlib%2C%20SparkR)%20AND%20%22Target%20Version%2Fs%22%20%3D%202.2.0]
>  | next release | Critical | *must* | yes, unless small | *best effort* |
> | [3 | 
> https://issues.apache.org/jira/issues/?jql=project%20%3D%20SPARK%20AND%20status%20in%20(Open%2C%20%22In%20Progress%22%2C%20Reopened)%20AND%20priority%20%3D%20Major%20AND%20component%20in%20(GraphX%2C%20ML%2C%20MLlib%2C%20SparkR)%20AND%20%22Target%20Version%2Fs%22%20%3D%202.2.0]
>  | next release | Major | *must* | optional | *best effort* |
> | [4 | 
> https://issues.apache.org/jira/issues/?jql=project%20%3D%20SPARK%20AND%20status%20in%20(Open%2C%20%22In%20Progress%22%2C%20Reopened)%20AND%20priority%20%3D%20Minor%20AND%20component%20in%20(GraphX%2C%20ML%2C%20MLlib%2C%20SparkR)%20AND%20%22Target%20Version%2Fs%22%20%3D%202.2.0]
>  | next release | Minor | optional | no | maybe |
> | [5 | 
> 

[jira] [Assigned] (SPARK-20514) Upgrade Jetty to 9.3.13.v20161014

2017-04-27 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20514?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-20514:


Assignee: (was: Apache Spark)

> Upgrade Jetty to 9.3.13.v20161014
> -
>
> Key: SPARK-20514
> URL: https://issues.apache.org/jira/browse/SPARK-20514
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.2.0
>Reporter: Mark Grover
>
> Currently, we are using Jetty version 9.2.16.v20160414.
> However, Hadoop 3, uses 
> [9.3.11.v20160721|https://github.com/apache/hadoop/blob/release-3.0.0-alpha2-RC0/hadoop-project/pom.xml#L38]
>  (Jetty upgrade was brought in by HADOOP-10075).
> Currently, when you try to build Spark with Hadoop 3, due to this 
> incompatibilities in jetty versions used by Hadoop and Spark, compilation 
> fails with:
> {code}
> [ERROR] source/core/src/main/scala/org/apache/spark/ui/JettyUtils.scala:31: 
> error: object gzip is not a member of package org.eclipse.jetty.servlets
> [ERROR] import org.eclipse.jetty.servlets.gzip.GzipHandler
> [ERROR]   ^
> [ERROR] source/core/src/main/scala/org/apache/spark/ui/JettyUtils.scala:238: 
> error: not found: type GzipHandler
> [ERROR]   val gzipHandler = new GzipHandler
> [ERROR] ^
> [ERROR] two errors found
> {code}
> So, it'd be good to upgrade Jetty to get us closer to working with Hadoop 3.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-20514) Upgrade Jetty to 9.3.13.v20161014

2017-04-27 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20514?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-20514:


Assignee: Apache Spark

> Upgrade Jetty to 9.3.13.v20161014
> -
>
> Key: SPARK-20514
> URL: https://issues.apache.org/jira/browse/SPARK-20514
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.2.0
>Reporter: Mark Grover
>Assignee: Apache Spark
>
> Currently, we are using Jetty version 9.2.16.v20160414.
> However, Hadoop 3, uses 
> [9.3.11.v20160721|https://github.com/apache/hadoop/blob/release-3.0.0-alpha2-RC0/hadoop-project/pom.xml#L38]
>  (Jetty upgrade was brought in by HADOOP-10075).
> Currently, when you try to build Spark with Hadoop 3, due to this 
> incompatibilities in jetty versions used by Hadoop and Spark, compilation 
> fails with:
> {code}
> [ERROR] source/core/src/main/scala/org/apache/spark/ui/JettyUtils.scala:31: 
> error: object gzip is not a member of package org.eclipse.jetty.servlets
> [ERROR] import org.eclipse.jetty.servlets.gzip.GzipHandler
> [ERROR]   ^
> [ERROR] source/core/src/main/scala/org/apache/spark/ui/JettyUtils.scala:238: 
> error: not found: type GzipHandler
> [ERROR]   val gzipHandler = new GzipHandler
> [ERROR] ^
> [ERROR] two errors found
> {code}
> So, it'd be good to upgrade Jetty to get us closer to working with Hadoop 3.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20514) Upgrade Jetty to 9.3.13.v20161014

2017-04-27 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20514?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15987736#comment-15987736
 ] 

Apache Spark commented on SPARK-20514:
--

User 'markgrover' has created a pull request for this issue:
https://github.com/apache/spark/pull/17790

> Upgrade Jetty to 9.3.13.v20161014
> -
>
> Key: SPARK-20514
> URL: https://issues.apache.org/jira/browse/SPARK-20514
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.2.0
>Reporter: Mark Grover
>
> Currently, we are using Jetty version 9.2.16.v20160414.
> However, Hadoop 3, uses 
> [9.3.11.v20160721|https://github.com/apache/hadoop/blob/release-3.0.0-alpha2-RC0/hadoop-project/pom.xml#L38]
>  (Jetty upgrade was brought in by HADOOP-10075).
> Currently, when you try to build Spark with Hadoop 3, due to this 
> incompatibilities in jetty versions used by Hadoop and Spark, compilation 
> fails with:
> {code}
> [ERROR] source/core/src/main/scala/org/apache/spark/ui/JettyUtils.scala:31: 
> error: object gzip is not a member of package org.eclipse.jetty.servlets
> [ERROR] import org.eclipse.jetty.servlets.gzip.GzipHandler
> [ERROR]   ^
> [ERROR] source/core/src/main/scala/org/apache/spark/ui/JettyUtils.scala:238: 
> error: not found: type GzipHandler
> [ERROR]   val gzipHandler = new GzipHandler
> [ERROR] ^
> [ERROR] two errors found
> {code}
> So, it'd be good to upgrade Jetty to get us closer to working with Hadoop 3.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-20514) Upgrade Jetty to 9.3.13.v20161014

2017-04-27 Thread Mark Grover (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20514?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mark Grover updated SPARK-20514:

Description: 
Currently, we are using Jetty version 9.2.16.v20160414.

However, Hadoop 3, uses 
[9.3.11.v20160721|https://github.com/apache/hadoop/blob/release-3.0.0-alpha2-RC0/hadoop-project/pom.xml#L38]
 (Jetty upgrade was brought in by HADOOP-10075).

Currently, when you try to build Spark with Hadoop 3, due to this 
incompatibilities in jetty versions used by Hadoop and Spark, compilation fails 
with:
{code}
[ERROR] source/core/src/main/scala/org/apache/spark/ui/JettyUtils.scala:31: 
error: object gzip is not a member of package org.eclipse.jetty.servlets
[ERROR] import org.eclipse.jetty.servlets.gzip.GzipHandler
[ERROR]   ^
[ERROR] source/core/src/main/scala/org/apache/spark/ui/JettyUtils.scala:238: 
error: not found: type GzipHandler
[ERROR]   val gzipHandler = new GzipHandler
[ERROR] ^
[ERROR] two errors found
{code}

So, it'd be good to upgrade Jetty to get us closer to working with Hadoop 3.

  was:
Currently, we are using Jetty version 9.2.16.v20160414.

However, Hadoop 3, uses 
[9.3.11.v20160721|https://github.com/apache/hadoop/blob/release-3.0.0-alpha2-RC0/hadoop-project/pom.xml#L38]
 (Jetty upgrade was brought in by HADOOP-10075).

Currently, when you try to build Spark with Hadoop 3, due to this 
incompatibilities in jetty versions used by Hadoop and Spark, compilation fails 
with:
{code}
[ERROR] source/core/src/main/scala/org/apache/spark/ui/JettyUtils.scala:31: 
error: object gzip is not a member of package org.eclipse.jetty.servlets
[ERROR] import org.eclipse.jetty.servlets.gzip.GzipHandler
[ERROR]   ^
[ERROR] source/core/src/main/scala/org/apache/spark/ui/JettyUtils.scala:238: 
error: not found: type GzipHandler
[ERROR]   val gzipHandler = new GzipHandler
[ERROR] ^
[ERROR] two errors found
{code}

So, it'd be good to upgrade Jetty due to this.


> Upgrade Jetty to 9.3.13.v20161014
> -
>
> Key: SPARK-20514
> URL: https://issues.apache.org/jira/browse/SPARK-20514
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.2.0
>Reporter: Mark Grover
>
> Currently, we are using Jetty version 9.2.16.v20160414.
> However, Hadoop 3, uses 
> [9.3.11.v20160721|https://github.com/apache/hadoop/blob/release-3.0.0-alpha2-RC0/hadoop-project/pom.xml#L38]
>  (Jetty upgrade was brought in by HADOOP-10075).
> Currently, when you try to build Spark with Hadoop 3, due to this 
> incompatibilities in jetty versions used by Hadoop and Spark, compilation 
> fails with:
> {code}
> [ERROR] source/core/src/main/scala/org/apache/spark/ui/JettyUtils.scala:31: 
> error: object gzip is not a member of package org.eclipse.jetty.servlets
> [ERROR] import org.eclipse.jetty.servlets.gzip.GzipHandler
> [ERROR]   ^
> [ERROR] source/core/src/main/scala/org/apache/spark/ui/JettyUtils.scala:238: 
> error: not found: type GzipHandler
> [ERROR]   val gzipHandler = new GzipHandler
> [ERROR] ^
> [ERROR] two errors found
> {code}
> So, it'd be good to upgrade Jetty to get us closer to working with Hadoop 3.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-20514) Upgrade Jetty to 9.3.13.v20161014

2017-04-27 Thread Mark Grover (JIRA)
Mark Grover created SPARK-20514:
---

 Summary: Upgrade Jetty to 9.3.13.v20161014
 Key: SPARK-20514
 URL: https://issues.apache.org/jira/browse/SPARK-20514
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 2.2.0
Reporter: Mark Grover


Currently, we are using Jetty version 9.2.16.v20160414.

However, Hadoop 3, uses 
[9.3.11.v20160721|https://github.com/apache/hadoop/blob/release-3.0.0-alpha2-RC0/hadoop-project/pom.xml#L38]
 (Jetty upgrade was brought in by HADOOP-10075).

Currently, when you try to build Spark with Hadoop 3, due to this 
incompatibilities in jetty versions used by Hadoop and Spark, compilation fails 
with:
{code}
[ERROR] source/core/src/main/scala/org/apache/spark/ui/JettyUtils.scala:31: 
error: object gzip is not a member of package org.eclipse.jetty.servlets
[ERROR] import org.eclipse.jetty.servlets.gzip.GzipHandler
[ERROR]   ^
[ERROR] source/core/src/main/scala/org/apache/spark/ui/JettyUtils.scala:238: 
error: not found: type GzipHandler
[ERROR]   val gzipHandler = new GzipHandler
[ERROR] ^
[ERROR] two errors found
{code}

So, it'd be good to upgrade Jetty due to this.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20392) Slow performance when calling fit on ML pipeline for dataset with many columns but few rows

2017-04-27 Thread Barry Becker (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20392?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15987706#comment-15987706
 ] 

Barry Becker commented on SPARK-20392:
--

Thanks for working on a fix. Do you have any idea which version of spark this 
fix will go into?
When we create our piplines, we always add bucketizers to bin all the 
continuous columns before applying the classifier. If a dataset has thousands 
of of continuous columns (and only a handful of rows) it sounds like it could 
still take significant time to apply those transforms even though there is very 
little data. At least the time seems to grow only linearly with the number of 
transforms. I was worried that it was quadratic.
I wonder if another approach might be to have a type of bucketizer that can bin 
a lot of columns all at once. It would need to accept an list of arrays of 
split points to correspond to the columns to bin, but it might make things more 
efficient by replacing thousands of stages with just one.


> Slow performance when calling fit on ML pipeline for dataset with many 
> columns but few rows
> ---
>
> Key: SPARK-20392
> URL: https://issues.apache.org/jira/browse/SPARK-20392
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 2.1.0
>Reporter: Barry Becker
> Attachments: blockbuster.csv, blockbuster_fewCols.csv, 
> giant_query_plan_for_fitting_pipeline.txt, model_9754.zip, model_9756.zip
>
>
> This started as a [question on stack 
> overflow|http://stackoverflow.com/questions/43484006/why-is-it-slow-to-apply-a-spark-pipeline-to-dataset-with-many-columns-but-few-ro],
>  but it seems like a bug.
> I am testing spark pipelines using a simple dataset (attached) with 312 
> (mostly numeric) columns, but only 421 rows. It is small, but it takes 3 
> minutes to apply my ML pipeline to it on a 24 core server with 60G of memory. 
> This seems much to long for such a tiny dataset. Similar pipelines run 
> quickly on datasets that have fewer columns and more rows. It's something 
> about the number of columns that is causing the slow performance.
> Here are a list of the stages in my pipeline:
> {code}
> 000_strIdx_5708525b2b6c
> 001_strIdx_ec2296082913
> 002_bucketizer_3cbc8811877b
> 003_bucketizer_5a01d5d78436
> 004_bucketizer_bf290d11364d
> 005_bucketizer_c3296dfe94b2
> 006_bucketizer_7071ca50eb85
> 007_bucketizer_27738213c2a1
> 008_bucketizer_bd728fd89ba1
> 009_bucketizer_e1e716f51796
> 010_bucketizer_38be665993ba
> 011_bucketizer_5a0e41e5e94f
> 012_bucketizer_b5a3d5743aaa
> 013_bucketizer_4420f98ff7ff
> 014_bucketizer_777cc4fe6d12
> 015_bucketizer_f0f3a3e5530e
> 016_bucketizer_218ecca3b5c1
> 017_bucketizer_0b083439a192
> 018_bucketizer_4520203aec27
> 019_bucketizer_462c2c346079
> 020_bucketizer_47435822e04c
> 021_bucketizer_eb9dccb5e6e8
> 022_bucketizer_b5f63dd7451d
> 023_bucketizer_e0fd5041c841
> 024_bucketizer_ffb3b9737100
> 025_bucketizer_e06c0d29273c
> 026_bucketizer_36ee535a425f
> 027_bucketizer_ee3a330269f1
> 028_bucketizer_094b58ea01c0
> 029_bucketizer_e93ea86c08e2
> 030_bucketizer_4728a718bc4b
> 031_bucketizer_08f6189c7fcc
> 032_bucketizer_11feb74901e6
> 033_bucketizer_ab4add4966c7
> 034_bucketizer_4474f7f1b8ce
> 035_bucketizer_90cfa5918d71
> 036_bucketizer_1a9ff5e4eccb
> 037_bucketizer_38085415a4f4
> 038_bucketizer_9b5e5a8d12eb
> 039_bucketizer_082bb650ecc3
> 040_bucketizer_57e1e363c483
> 041_bucketizer_337583fbfd65
> 042_bucketizer_73e8f6673262
> 043_bucketizer_0f9394ed30b8
> 044_bucketizer_8530f3570019
> 045_bucketizer_c53614f1e507
> 046_bucketizer_8fd99e6ec27b
> 047_bucketizer_6a8610496d8a
> 048_bucketizer_888b0055c1ad
> 049_bucketizer_974e0a1433a6
> 050_bucketizer_e848c0937cb9
> 051_bucketizer_95611095a4ac
> 052_bucketizer_660a6031acd9
> 053_bucketizer_aaffe5a3140d
> 054_bucketizer_8dc569be285f
> 055_bucketizer_83d1bffa07bc
> 056_bucketizer_0c6180ba75e6
> 057_bucketizer_452f265a000d
> 058_bucketizer_38e02ddfb447
> 059_bucketizer_6fa4ad5d3ebd
> 060_bucketizer_91044ee766ce
> 061_bucketizer_9a9ef04a173d
> 062_bucketizer_3d98eb15f206
> 063_bucketizer_c4915bb4d4ed
> 064_bucketizer_8ca2b6550c38
> 065_bucketizer_417ee9b760bc
> 066_bucketizer_67f3556bebe8
> 067_bucketizer_0556deb652c6
> 068_bucketizer_067b4b3d234c
> 069_bucketizer_30ba55321538
> 070_bucketizer_ad826cc5d746
> 071_bucketizer_77676a898055
> 072_bucketizer_05c37a38ce30
> 073_bucketizer_6d9ae54163ed
> 074_bucketizer_8cd668b2855d
> 075_bucketizer_d50ea1732021
> 076_bucketizer_c68f467c9559
> 077_bucketizer_ee1dfc840db1
> 078_bucketizer_83ec06a32519
> 079_bucketizer_741d08c1b69e
> 080_bucketizer_b7402e4829c7
> 081_bucketizer_8adc590dc447
> 082_bucketizer_673be99bdace
> 083_bucketizer_77693b45f94c
> 084_bucketizer_53529c6b1ac4
> 085_bucketizer_6a3ca776a81e
> 086_bucketizer_6679d9588ac1
> 

[jira] [Commented] (SPARK-20499) Spark MLlib, GraphX 2.2 QA umbrella

2017-04-27 Thread Joseph K. Bradley (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20499?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15987690#comment-15987690
 ] 

Joseph K. Bradley commented on SPARK-20499:
---

Well, it's that time again folks.  QA!  I know there are a few ongoing doc PRs, 
but I figure the feature/API ones are done for 2.2, so we can begin QAing the 
API and performance.  If you're able to help out with taking or shepherding 
tasks for this or the SparkR JIRA (linked), please go ahead and claim them!

I need to catch up on some doc PRs myself first...

> Spark MLlib, GraphX 2.2 QA umbrella
> ---
>
> Key: SPARK-20499
> URL: https://issues.apache.org/jira/browse/SPARK-20499
> Project: Spark
>  Issue Type: Umbrella
>  Components: Documentation, GraphX, ML, MLlib
>Reporter: Joseph K. Bradley
>Assignee: Joseph K. Bradley
>Priority: Critical
>
> This JIRA lists tasks for the next Spark release's QA period for MLlib and 
> GraphX.   *SparkR is separate: [SPARK-20508].*
> The list below gives an overview of what is involved, and the corresponding 
> JIRA issues are linked below that.
> h2. API
> * Check binary API compatibility for Scala/Java
> * Audit new public APIs (from the generated html doc)
> ** Scala
> ** Java compatibility
> ** Python coverage
> * Check Experimental, DeveloperApi tags
> h2. Algorithms and performance
> * Performance tests
> * Major new algorithms: MinHash, RandomProjection
> h2. Documentation and example code
> * For new algorithms, create JIRAs for updating the user guide sections & 
> examples
> * Update Programming Guide
> * Update website



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18727) Support schema evolution as new files are inserted into table

2017-04-27 Thread Simeon Simeonov (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18727?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15987668#comment-15987668
 ] 

Simeon Simeonov commented on SPARK-18727:
-

[~xwu0226] The merged PR handles the use case of new top-level columns but, in 
the test cases, I did not see any examples of adding new fields to (nested) 
struct columns, a requirement for supporting schema evolution (and closing this 
ticket). Do you expect you'll work on that also?

> Support schema evolution as new files are inserted into table
> -
>
> Key: SPARK-18727
> URL: https://issues.apache.org/jira/browse/SPARK-18727
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Eric Liang
>Priority: Critical
>
> Now that we have pushed partition management of all tables to the catalog, 
> one issue for scalable partition handling remains: handling schema updates.
> Currently, a schema update requires dropping and recreating the entire table, 
> which does not scale well with the size of the table.
> We should support updating the schema of the table, either via ALTER TABLE, 
> or automatically as new files with compatible schemas are appended into the 
> table.
> cc [~rxin]



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-20452) Cancel a batch Kafka query and rerun the same DataFrame may cause ConcurrentModificationException

2017-04-27 Thread Tathagata Das (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20452?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tathagata Das resolved SPARK-20452.
---
   Resolution: Fixed
Fix Version/s: 2.2.0

Issue resolved by pull request 17752
[https://github.com/apache/spark/pull/17752]

> Cancel a batch Kafka query and rerun the same DataFrame may cause 
> ConcurrentModificationException
> -
>
> Key: SPARK-20452
> URL: https://issues.apache.org/jira/browse/SPARK-20452
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.2.0
>Reporter: Shixiong Zhu
>Assignee: Shixiong Zhu
> Fix For: 2.2.0
>
>
> Cancel a batch Kafka query and rerun the same DataFrame may cause 
> ConcurrentModificationException because it may launch two tasks sharing the 
> same group id.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-20461) CachedKafkaConsumer may hang forever when it's interrupted

2017-04-27 Thread Tathagata Das (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20461?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tathagata Das resolved SPARK-20461.
---
   Resolution: Fixed
Fix Version/s: 2.2.0

Issue resolved by pull request 17761
[https://github.com/apache/spark/pull/17761]

> CachedKafkaConsumer may hang forever when it's interrupted
> --
>
> Key: SPARK-20461
> URL: https://issues.apache.org/jira/browse/SPARK-20461
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.2.0
>Reporter: Shixiong Zhu
> Fix For: 2.2.0
>
>
> CachedKafkaConsumer may hang forever when it's interrupted because of 
> KAFKA-1894



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-20047) Constrained Logistic Regression

2017-04-27 Thread DB Tsai (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20047?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

DB Tsai resolved SPARK-20047.
-
  Resolution: Fixed
   Fix Version/s: 2.2.1
Target Version/s: 2.2.1  (was: 2.3.0)

> Constrained Logistic Regression
> ---
>
> Key: SPARK-20047
> URL: https://issues.apache.org/jira/browse/SPARK-20047
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Affects Versions: 2.2.0
>Reporter: DB Tsai
>Assignee: Yanbo Liang
> Fix For: 2.2.1
>
>
> For certain applications, such as stacked regressions, it is important to put 
> non-negative constraints on the regression coefficients. Also, if the ranges 
> of coefficients are known, it makes sense to constrain the coefficient search 
> space.
> Fitting generalized constrained regression models object to Cβ ≤ b, where C ∈ 
> R^\{m×p\} and b ∈ R^\{m\} are predefined matrices and vectors which places a 
> set of m linear constraints on the coefficients is very challenging as 
> discussed in many literatures. 
> However, for box constraints on the coefficients, the optimization is well 
> solved. For gradient descent, people can projected gradient descent in the 
> primal by zeroing the negative weights at each step. For LBFGS, an extended 
> version of it, LBFGS-B can handle large scale box optimization efficiently. 
> Unfortunately, for OWLQN, there is no good efficient way to do optimization 
> with box constrains.
> As a result, in this work, we only implement constrained LR with box 
> constrains without L1 regularization. 
> Note that since we standardize the data in training phase, so the 
> coefficients seen in the optimization subroutine are in the scaled space; as 
> a result, we need to convert the box constrains into scaled space.
> Users will be able to set the lower / upper bounds of each coefficients and 
> intercepts.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18727) Support schema evolution as new files are inserted into table

2017-04-27 Thread Xin Wu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18727?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15987642#comment-15987642
 ] 

Xin Wu commented on SPARK-18727:


FYI. I have https://github.com/apache/spark/pull/16626 for ALTER TABLE ADD 
COLUMNS merged into 2.2. 

> Support schema evolution as new files are inserted into table
> -
>
> Key: SPARK-18727
> URL: https://issues.apache.org/jira/browse/SPARK-18727
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Eric Liang
>Priority: Critical
>
> Now that we have pushed partition management of all tables to the catalog, 
> one issue for scalable partition handling remains: handling schema updates.
> Currently, a schema update requires dropping and recreating the entire table, 
> which does not scale well with the size of the table.
> We should support updating the schema of the table, either via ALTER TABLE, 
> or automatically as new files with compatible schemas are appended into the 
> table.
> cc [~rxin]



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19525) Enable Compression of RDD Checkpoints

2017-04-27 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19525?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15987640#comment-15987640
 ] 

Apache Spark commented on SPARK-19525:
--

User 'zsxwing' has created a pull request for this issue:
https://github.com/apache/spark/pull/17789

> Enable Compression of RDD Checkpoints
> -
>
> Key: SPARK-19525
> URL: https://issues.apache.org/jira/browse/SPARK-19525
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.1.0
>Reporter: Aaditya Ramesh
>
> In our testing, compressing partitions while writing them to checkpoints on 
> HDFS using snappy helped performance significantly while also reducing the 
> variability of the checkpointing operation. In our tests, checkpointing time 
> was reduced by 3X, and variability was reduced by 2X for data sets of 
> compressed size approximately 1 GB.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-20499) Spark MLlib, GraphX 2.2 QA umbrella

2017-04-27 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20499?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-20499:
--
Description: 
This JIRA lists tasks for the next Spark release's QA period for MLlib and 
GraphX.   *SparkR is separate: [SPARK-20508].*

The list below gives an overview of what is involved, and the corresponding 
JIRA issues are linked below that.

h2. API

* Check binary API compatibility for Scala/Java
* Audit new public APIs (from the generated html doc)
** Scala
** Java compatibility
** Python coverage
* Check Experimental, DeveloperApi tags

h2. Algorithms and performance

* Performance tests
* Major new algorithms: MinHash, RandomProjection

h2. Documentation and example code

* For new algorithms, create JIRAs for updating the user guide sections & 
examples
* Update Programming Guide
* Update website


  was:
This JIRA lists tasks for the next Spark release's QA period for MLlib and 
GraphX.   *SparkR is separate: [SPARK-18329].*

The list below gives an overview of what is involved, and the corresponding 
JIRA issues are linked below that.

h2. API

* Check binary API compatibility for Scala/Java
* Audit new public APIs (from the generated html doc)
** Scala
** Java compatibility
** Python coverage
* Check Experimental, DeveloperApi tags

h2. Algorithms and performance

* Performance tests
* Major new algorithms: MinHash, RandomProjection

h2. Documentation and example code

* For new algorithms, create JIRAs for updating the user guide sections & 
examples
* Update Programming Guide
* Update website



> Spark MLlib, GraphX 2.2 QA umbrella
> ---
>
> Key: SPARK-20499
> URL: https://issues.apache.org/jira/browse/SPARK-20499
> Project: Spark
>  Issue Type: Umbrella
>  Components: Documentation, GraphX, ML, MLlib
>Reporter: Joseph K. Bradley
>Assignee: Joseph K. Bradley
>Priority: Critical
>
> This JIRA lists tasks for the next Spark release's QA period for MLlib and 
> GraphX.   *SparkR is separate: [SPARK-20508].*
> The list below gives an overview of what is involved, and the corresponding 
> JIRA issues are linked below that.
> h2. API
> * Check binary API compatibility for Scala/Java
> * Audit new public APIs (from the generated html doc)
> ** Scala
> ** Java compatibility
> ** Python coverage
> * Check Experimental, DeveloperApi tags
> h2. Algorithms and performance
> * Performance tests
> * Major new algorithms: MinHash, RandomProjection
> h2. Documentation and example code
> * For new algorithms, create JIRAs for updating the user guide sections & 
> examples
> * Update Programming Guide
> * Update website



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-20513) Update SparkR website for 2.2

2017-04-27 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20513?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-20513:
--
Target Version/s: 2.2.0

> Update SparkR website for 2.2
> -
>
> Key: SPARK-20513
> URL: https://issues.apache.org/jira/browse/SPARK-20513
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, SparkR
>Reporter: Joseph K. Bradley
>Priority: Critical
>
> Update the sub-project's website to include new features in this release.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-20513) Update SparkR website for 2.2

2017-04-27 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20513?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-20513:
--
Fix Version/s: (was: 2.1.0)

> Update SparkR website for 2.2
> -
>
> Key: SPARK-20513
> URL: https://issues.apache.org/jira/browse/SPARK-20513
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, SparkR
>Reporter: Joseph K. Bradley
>Priority: Critical
>
> Update the sub-project's website to include new features in this release.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-20512) SparkR 2.2 QA: Programming guide, migration guide, vignettes updates

2017-04-27 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20512?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-20512:
--
Target Version/s: 2.2.0  (was: 2.1.0)

> SparkR 2.2 QA: Programming guide, migration guide, vignettes updates
> 
>
> Key: SPARK-20512
> URL: https://issues.apache.org/jira/browse/SPARK-20512
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, SparkR
>Reporter: Joseph K. Bradley
>Priority: Critical
>
> Before the release, we need to update the SparkR Programming Guide, its 
> migration guide, and the R vignettes.  Updates will include:
> * Add migration guide subsection.
> ** Use the results of the QA audit JIRAs and [SPARK-17692].
> * Check phrasing, especially in main sections (for outdated items such as "In 
> this release, ...")
> * Update R vignettes
> Note: This task is for large changes to the guides.  New features are handled 
> in [SPARK-18330].



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-20511) SparkR 2.2 QA: Check for new R APIs requiring example code

2017-04-27 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20511?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley reassigned SPARK-20511:
-

Assignee: (was: Yanbo Liang)

> SparkR 2.2 QA: Check for new R APIs requiring example code
> --
>
> Key: SPARK-20511
> URL: https://issues.apache.org/jira/browse/SPARK-20511
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, SparkR
>Reporter: Joseph K. Bradley
>
> Audit list of new features added to MLlib's R API, and see which major items 
> are missing example code (in the examples folder).  We do not need examples 
> for everything, only for major items such as new algorithms.
> For any such items:
> * Create a JIRA for that feature, and assign it to the author of the feature 
> (or yourself if interested).
> * Link it to (a) the original JIRA which introduced that feature ("related 
> to") and (b) to this JIRA ("requires").



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-20512) SparkR 2.2 QA: Programming guide, migration guide, vignettes updates

2017-04-27 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20512?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-20512:
--
Summary: SparkR 2.2 QA: Programming guide, migration guide, vignettes 
updates  (was: CLONE - SparkR 2.1 QA: Programming guide, migration guide, 
vignettes updates)

> SparkR 2.2 QA: Programming guide, migration guide, vignettes updates
> 
>
> Key: SPARK-20512
> URL: https://issues.apache.org/jira/browse/SPARK-20512
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, SparkR
>Reporter: Joseph K. Bradley
>Assignee: Xiangrui Meng
>Priority: Critical
>
> Before the release, we need to update the SparkR Programming Guide, its 
> migration guide, and the R vignettes.  Updates will include:
> * Add migration guide subsection.
> ** Use the results of the QA audit JIRAs and [SPARK-17692].
> * Check phrasing, especially in main sections (for outdated items such as "In 
> this release, ...")
> * Update R vignettes
> Note: This task is for large changes to the guides.  New features are handled 
> in [SPARK-18330].



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-20512) SparkR 2.2 QA: Programming guide, migration guide, vignettes updates

2017-04-27 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20512?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley reassigned SPARK-20512:
-

Assignee: (was: Xiangrui Meng)

> SparkR 2.2 QA: Programming guide, migration guide, vignettes updates
> 
>
> Key: SPARK-20512
> URL: https://issues.apache.org/jira/browse/SPARK-20512
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, SparkR
>Reporter: Joseph K. Bradley
>Priority: Critical
>
> Before the release, we need to update the SparkR Programming Guide, its 
> migration guide, and the R vignettes.  Updates will include:
> * Add migration guide subsection.
> ** Use the results of the QA audit JIRAs and [SPARK-17692].
> * Check phrasing, especially in main sections (for outdated items such as "In 
> this release, ...")
> * Update R vignettes
> Note: This task is for large changes to the guides.  New features are handled 
> in [SPARK-18330].



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-20511) SparkR 2.2 QA: Check for new R APIs requiring example code

2017-04-27 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20511?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-20511:
--
Fix Version/s: (was: 2.1.0)

> SparkR 2.2 QA: Check for new R APIs requiring example code
> --
>
> Key: SPARK-20511
> URL: https://issues.apache.org/jira/browse/SPARK-20511
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, SparkR
>Reporter: Joseph K. Bradley
>Assignee: Yanbo Liang
>
> Audit list of new features added to MLlib's R API, and see which major items 
> are missing example code (in the examples folder).  We do not need examples 
> for everything, only for major items such as new algorithms.
> For any such items:
> * Create a JIRA for that feature, and assign it to the author of the feature 
> (or yourself if interested).
> * Link it to (a) the original JIRA which introduced that feature ("related 
> to") and (b) to this JIRA ("requires").



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-20511) SparkR 2.2 QA: Check for new R APIs requiring example code

2017-04-27 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20511?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-20511:
--
Summary: SparkR 2.2 QA: Check for new R APIs requiring example code  (was: 
CLONE - SparkR 2.1 QA: Check for new R APIs requiring example code)

> SparkR 2.2 QA: Check for new R APIs requiring example code
> --
>
> Key: SPARK-20511
> URL: https://issues.apache.org/jira/browse/SPARK-20511
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, SparkR
>Reporter: Joseph K. Bradley
>Assignee: Yanbo Liang
>
> Audit list of new features added to MLlib's R API, and see which major items 
> are missing example code (in the examples folder).  We do not need examples 
> for everything, only for major items such as new algorithms.
> For any such items:
> * Create a JIRA for that feature, and assign it to the author of the feature 
> (or yourself if interested).
> * Link it to (a) the original JIRA which introduced that feature ("related 
> to") and (b) to this JIRA ("requires").



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-20513) Update SparkR website for 2.2

2017-04-27 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20513?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-20513:
--
Summary: Update SparkR website for 2.2  (was: CLONE - Update SparkR website 
for 2.1)

> Update SparkR website for 2.2
> -
>
> Key: SPARK-20513
> URL: https://issues.apache.org/jira/browse/SPARK-20513
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, SparkR
>Reporter: Joseph K. Bradley
>Priority: Critical
> Fix For: 2.1.0
>
>
> Update the sub-project's website to include new features in this release.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-20512) SparkR 2.2 QA: Programming guide, migration guide, vignettes updates

2017-04-27 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20512?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-20512:
--
Fix Version/s: (was: 2.1.0)

> SparkR 2.2 QA: Programming guide, migration guide, vignettes updates
> 
>
> Key: SPARK-20512
> URL: https://issues.apache.org/jira/browse/SPARK-20512
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, SparkR
>Reporter: Joseph K. Bradley
>Priority: Critical
>
> Before the release, we need to update the SparkR Programming Guide, its 
> migration guide, and the R vignettes.  Updates will include:
> * Add migration guide subsection.
> ** Use the results of the QA audit JIRAs and [SPARK-17692].
> * Check phrasing, especially in main sections (for outdated items such as "In 
> this release, ...")
> * Update R vignettes
> Note: This task is for large changes to the guides.  New features are handled 
> in [SPARK-18330].



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-20511) SparkR 2.2 QA: Check for new R APIs requiring example code

2017-04-27 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20511?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-20511:
--
Target Version/s: 2.2.0  (was: 2.1.0)

> SparkR 2.2 QA: Check for new R APIs requiring example code
> --
>
> Key: SPARK-20511
> URL: https://issues.apache.org/jira/browse/SPARK-20511
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, SparkR
>Reporter: Joseph K. Bradley
>
> Audit list of new features added to MLlib's R API, and see which major items 
> are missing example code (in the examples folder).  We do not need examples 
> for everything, only for major items such as new algorithms.
> For any such items:
> * Create a JIRA for that feature, and assign it to the author of the feature 
> (or yourself if interested).
> * Link it to (a) the original JIRA which introduced that feature ("related 
> to") and (b) to this JIRA ("requires").



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-20510) SparkR 2.2 QA: Update user guide for new features & APIs

2017-04-27 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20510?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-20510:
--
Target Version/s: 2.2.0

> SparkR 2.2 QA: Update user guide for new features & APIs
> 
>
> Key: SPARK-20510
> URL: https://issues.apache.org/jira/browse/SPARK-20510
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, SparkR
>Reporter: Joseph K. Bradley
>Priority: Critical
>
> Check the user guide vs. a list of new APIs (classes, methods, data members) 
> to see what items require updates to the user guide.
> For each feature missing user guide doc:
> * Create a JIRA for that feature, and assign it to the author of the feature
> * Link it to (a) the original JIRA which introduced that feature ("related 
> to") and (b) to this JIRA ("requires").
> If you would like to work on this task, please comment, and we can create & 
> link JIRAs for parts of this work.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-20509) SparkR 2.2 QA: New R APIs and API docs

2017-04-27 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20509?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-20509:
--
Fix Version/s: (was: 2.1.0)

> SparkR 2.2 QA: New R APIs and API docs
> --
>
> Key: SPARK-20509
> URL: https://issues.apache.org/jira/browse/SPARK-20509
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, SparkR
>Reporter: Joseph K. Bradley
>Priority: Blocker
>
> Audit new public R APIs.  Take note of:
> * Correctness and uniformity of API
> * Documentation: Missing?  Bad links or formatting?
> ** Check both the generated docs linked from the user guide and the R command 
> line docs `?read.df`. These are generated using roxygen.
> As you find issues, please create JIRAs and link them to this issue.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-20510) SparkR 2.2 QA: Update user guide for new features & APIs

2017-04-27 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20510?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-20510:
--
Fix Version/s: (was: 2.1.0)

> SparkR 2.2 QA: Update user guide for new features & APIs
> 
>
> Key: SPARK-20510
> URL: https://issues.apache.org/jira/browse/SPARK-20510
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, SparkR
>Reporter: Joseph K. Bradley
>Priority: Critical
>
> Check the user guide vs. a list of new APIs (classes, methods, data members) 
> to see what items require updates to the user guide.
> For each feature missing user guide doc:
> * Create a JIRA for that feature, and assign it to the author of the feature
> * Link it to (a) the original JIRA which introduced that feature ("related 
> to") and (b) to this JIRA ("requires").
> If you would like to work on this task, please comment, and we can create & 
> link JIRAs for parts of this work.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-20509) SparkR 2.2 QA: New R APIs and API docs

2017-04-27 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20509?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley reassigned SPARK-20509:
-

Assignee: (was: Yanbo Liang)

> SparkR 2.2 QA: New R APIs and API docs
> --
>
> Key: SPARK-20509
> URL: https://issues.apache.org/jira/browse/SPARK-20509
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, SparkR
>Reporter: Joseph K. Bradley
>Priority: Blocker
>
> Audit new public R APIs.  Take note of:
> * Correctness and uniformity of API
> * Documentation: Missing?  Bad links or formatting?
> ** Check both the generated docs linked from the user guide and the R command 
> line docs `?read.df`. These are generated using roxygen.
> As you find issues, please create JIRAs and link them to this issue.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-20508) Spark R 2.2 QA umbrella

2017-04-27 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20508?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-20508:
--
Target Version/s: 2.2.0  (was: 2.1.0)

> Spark R 2.2 QA umbrella
> ---
>
> Key: SPARK-20508
> URL: https://issues.apache.org/jira/browse/SPARK-20508
> Project: Spark
>  Issue Type: Umbrella
>  Components: Documentation, SparkR
>Reporter: Joseph K. Bradley
>Assignee: Joseph K. Bradley
>Priority: Critical
>
> This JIRA lists tasks for the next Spark release's QA period for SparkR.
> The list below gives an overview of what is involved, and the corresponding 
> JIRA issues are linked below that.
> h2. API
> * Audit new public APIs (from the generated html doc)
> ** relative to Spark Scala/Java APIs
> ** relative to popular R libraries
> h2. Documentation and example code
> * For new algorithms, create JIRAs for updating the user guide sections & 
> examples
> * Update Programming Guide
> * Update website



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-20510) SparkR 2.2 QA: Update user guide for new features & APIs

2017-04-27 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20510?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-20510:
--
Summary: SparkR 2.2 QA: Update user guide for new features & APIs  (was: 
CLONE - SparkR 2.1 QA: Update user guide for new features & APIs)

> SparkR 2.2 QA: Update user guide for new features & APIs
> 
>
> Key: SPARK-20510
> URL: https://issues.apache.org/jira/browse/SPARK-20510
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, SparkR
>Reporter: Joseph K. Bradley
>Priority: Critical
>
> Check the user guide vs. a list of new APIs (classes, methods, data members) 
> to see what items require updates to the user guide.
> For each feature missing user guide doc:
> * Create a JIRA for that feature, and assign it to the author of the feature
> * Link it to (a) the original JIRA which introduced that feature ("related 
> to") and (b) to this JIRA ("requires").
> If you would like to work on this task, please comment, and we can create & 
> link JIRAs for parts of this work.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-20509) SparkR 2.2 QA: New R APIs and API docs

2017-04-27 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20509?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-20509:
--
Target Version/s: 2.2.0  (was: 2.1.0)

> SparkR 2.2 QA: New R APIs and API docs
> --
>
> Key: SPARK-20509
> URL: https://issues.apache.org/jira/browse/SPARK-20509
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, SparkR
>Reporter: Joseph K. Bradley
>Priority: Blocker
>
> Audit new public R APIs.  Take note of:
> * Correctness and uniformity of API
> * Documentation: Missing?  Bad links or formatting?
> ** Check both the generated docs linked from the user guide and the R command 
> line docs `?read.df`. These are generated using roxygen.
> As you find issues, please create JIRAs and link them to this issue.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-20509) SparkR 2.2 QA: New R APIs and API docs

2017-04-27 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20509?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-20509:
--
Summary: SparkR 2.2 QA: New R APIs and API docs  (was: CLONE - SparkR 2.1 
QA: New R APIs and API docs)

> SparkR 2.2 QA: New R APIs and API docs
> --
>
> Key: SPARK-20509
> URL: https://issues.apache.org/jira/browse/SPARK-20509
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, SparkR
>Reporter: Joseph K. Bradley
>Assignee: Yanbo Liang
>Priority: Blocker
>
> Audit new public R APIs.  Take note of:
> * Correctness and uniformity of API
> * Documentation: Missing?  Bad links or formatting?
> ** Check both the generated docs linked from the user guide and the R command 
> line docs `?read.df`. These are generated using roxygen.
> As you find issues, please create JIRAs and link them to this issue.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-20508) Spark R 2.2 QA umbrella

2017-04-27 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20508?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-20508:
--
Fix Version/s: (was: 2.1.0)

> Spark R 2.2 QA umbrella
> ---
>
> Key: SPARK-20508
> URL: https://issues.apache.org/jira/browse/SPARK-20508
> Project: Spark
>  Issue Type: Umbrella
>  Components: Documentation, SparkR
>Reporter: Joseph K. Bradley
>Assignee: Joseph K. Bradley
>Priority: Critical
>
> This JIRA lists tasks for the next Spark release's QA period for SparkR.
> The list below gives an overview of what is involved, and the corresponding 
> JIRA issues are linked below that.
> h2. API
> * Audit new public APIs (from the generated html doc)
> ** relative to Spark Scala/Java APIs
> ** relative to popular R libraries
> h2. Documentation and example code
> * For new algorithms, create JIRAs for updating the user guide sections & 
> examples
> * Update Programming Guide
> * Update website



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-20508) Spark R 2.2 QA umbrella

2017-04-27 Thread Joseph K. Bradley (JIRA)
Joseph K. Bradley created SPARK-20508:
-

 Summary: Spark R 2.2 QA umbrella
 Key: SPARK-20508
 URL: https://issues.apache.org/jira/browse/SPARK-20508
 Project: Spark
  Issue Type: Umbrella
  Components: Documentation, SparkR
Reporter: Joseph K. Bradley
Assignee: Joseph K. Bradley
Priority: Critical
 Fix For: 2.1.0


This JIRA lists tasks for the next Spark release's QA period for SparkR.

The list below gives an overview of what is involved, and the corresponding 
JIRA issues are linked below that.

h2. API

* Audit new public APIs (from the generated html doc)
** relative to Spark Scala/Java APIs
** relative to popular R libraries

h2. Documentation and example code

* For new algorithms, create JIRAs for updating the user guide sections & 
examples
* Update Programming Guide
* Update website




--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-20510) CLONE - SparkR 2.1 QA: Update user guide for new features & APIs

2017-04-27 Thread Joseph K. Bradley (JIRA)
Joseph K. Bradley created SPARK-20510:
-

 Summary: CLONE - SparkR 2.1 QA: Update user guide for new features 
& APIs
 Key: SPARK-20510
 URL: https://issues.apache.org/jira/browse/SPARK-20510
 Project: Spark
  Issue Type: Sub-task
  Components: Documentation, SparkR
Reporter: Joseph K. Bradley
Priority: Critical


Check the user guide vs. a list of new APIs (classes, methods, data members) to 
see what items require updates to the user guide.

For each feature missing user guide doc:
* Create a JIRA for that feature, and assign it to the author of the feature
* Link it to (a) the original JIRA which introduced that feature ("related to") 
and (b) to this JIRA ("requires").

If you would like to work on this task, please comment, and we can create & 
link JIRAs for parts of this work.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-20511) CLONE - SparkR 2.1 QA: Check for new R APIs requiring example code

2017-04-27 Thread Joseph K. Bradley (JIRA)
Joseph K. Bradley created SPARK-20511:
-

 Summary: CLONE - SparkR 2.1 QA: Check for new R APIs requiring 
example code
 Key: SPARK-20511
 URL: https://issues.apache.org/jira/browse/SPARK-20511
 Project: Spark
  Issue Type: Sub-task
  Components: Documentation, SparkR
Reporter: Joseph K. Bradley
Assignee: Yanbo Liang
 Fix For: 2.1.0


Audit list of new features added to MLlib's R API, and see which major items 
are missing example code (in the examples folder).  We do not need examples for 
everything, only for major items such as new algorithms.

For any such items:
* Create a JIRA for that feature, and assign it to the author of the feature 
(or yourself if interested).
* Link it to (a) the original JIRA which introduced that feature ("related to") 
and (b) to this JIRA ("requires").



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-20513) CLONE - Update SparkR website for 2.1

2017-04-27 Thread Joseph K. Bradley (JIRA)
Joseph K. Bradley created SPARK-20513:
-

 Summary: CLONE - Update SparkR website for 2.1
 Key: SPARK-20513
 URL: https://issues.apache.org/jira/browse/SPARK-20513
 Project: Spark
  Issue Type: Sub-task
  Components: Documentation, SparkR
Reporter: Joseph K. Bradley
Priority: Critical


Update the sub-project's website to include new features in this release.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-20512) CLONE - SparkR 2.1 QA: Programming guide, migration guide, vignettes updates

2017-04-27 Thread Joseph K. Bradley (JIRA)
Joseph K. Bradley created SPARK-20512:
-

 Summary: CLONE - SparkR 2.1 QA: Programming guide, migration 
guide, vignettes updates
 Key: SPARK-20512
 URL: https://issues.apache.org/jira/browse/SPARK-20512
 Project: Spark
  Issue Type: Sub-task
  Components: Documentation, SparkR
Reporter: Joseph K. Bradley
Assignee: Xiangrui Meng
Priority: Critical
 Fix For: 2.1.0


Before the release, we need to update the SparkR Programming Guide, its 
migration guide, and the R vignettes.  Updates will include:
* Add migration guide subsection.
** Use the results of the QA audit JIRAs and [SPARK-17692].
* Check phrasing, especially in main sections (for outdated items such as "In 
this release, ...")
* Update R vignettes

Note: This task is for large changes to the guides.  New features are handled 
in [SPARK-18330].



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



  1   2   3   >