[jira] [Commented] (SPARK-18538) Concurrent Fetching DataFrameReader JDBC APIs Do Not Work

2016-11-30 Thread Wenchen Fan (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18538?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15711221#comment-15711221
 ] 

Wenchen Fan commented on SPARK-18538:
-

already merged to master, will resolve this ticket once we backport it to 2.1

> Concurrent Fetching DataFrameReader JDBC APIs Do Not Work
> -
>
> Key: SPARK-18538
> URL: https://issues.apache.org/jira/browse/SPARK-18538
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Xiao Li
>Assignee: Xiao Li
>Priority: Blocker
>
> {code}
>   def jdbc(
>   url: String,
>   table: String,
>   columnName: String,
>   lowerBound: Long,
>   upperBound: Long,
>   numPartitions: Int,
>   connectionProperties: Properties): DataFrame
> {code}
> {code}
>   def jdbc(
>   url: String,
>   table: String,
>   predicates: Array[String],
>   connectionProperties: Properties): DataFrame
> {code}
> The above two DataFrameReader JDBC APIs ignore the user-specified parameters 
> of parallelism degree



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18620) Spark Streaming + Kinesis : Receiver MaxRate is violated

2016-11-30 Thread Takeshi Yamamuro (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18620?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15711171#comment-15711171
 ] 

Takeshi Yamamuro commented on SPARK-18620:
--

I quickly checked and I found that that's not enough to set max records in 
Kinesis workers because
the kinesis workers cannot limit the number of aggregate messages 
(http://docs.aws.amazon.com/streams/latest/dev/kinesis-kpl-concepts.html#d0e5184).
For example, if we set 10 to the number of max records in workers and a 
producer aggregates two records into one message,
it seems kinesis workers actually 20 records per callback function called.
My hunch is that we need to control #records to push them into a receiver in 
KinesisRecordProcessor#processRecords(https://github.com/apache/spark/blob/master/external/kinesis-asl/src/main/scala/org/apache/spark/streaming/kinesis/KinesisRecordProcessor.scala#L68).

> Spark Streaming + Kinesis : Receiver MaxRate is violated
> 
>
> Key: SPARK-18620
> URL: https://issues.apache.org/jira/browse/SPARK-18620
> Project: Spark
>  Issue Type: Bug
>  Components: DStreams
>Affects Versions: 2.0.2
>Reporter: david przybill
>Priority: Minor
>  Labels: kinesis
>
> I am calling spark-submit passing maxRate, I have a single kinesis receiver, 
> and batches of 1s
> spark-submit  --conf spark.streaming.receiver.maxRate=10 
> however a single batch can greatly exceed the stablished maxRate. i.e: Im 
> getting 300 records.
> it looks like Kinesis is completely ignoring the 
> spark.streaming.receiver.maxRate configuration.
> If you look inside KinesisReceiver.onStart, you see:
> val kinesisClientLibConfiguration =
>   new KinesisClientLibConfiguration(checkpointAppName, streamName, 
> awsCredProvider, workerId)
>   .withKinesisEndpoint(endpointUrl)
>   .withInitialPositionInStream(initialPositionInStream)
>   .withTaskBackoffTimeMillis(500)
>   .withRegionName(regionName)
> This constructor ends up calling another constructor which has a lot of 
> default values for the configuration. One of those values is 
> DEFAULT_MAX_RECORDS which is constantly set to 10,000 records.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-18667) input_file_name function does not work with UDF

2016-11-30 Thread Hyukjin Kwon (JIRA)
Hyukjin Kwon created SPARK-18667:


 Summary: input_file_name function does not work with UDF
 Key: SPARK-18667
 URL: https://issues.apache.org/jira/browse/SPARK-18667
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Reporter: Hyukjin Kwon


{{input_file_name()}} does not return the file name but empty string instead 
when it is used as input for UDF in PySpark as below: 

with the data as below:

{code}
{"a": 1}
{code}

with the codes below:

{code}
from pyspark.sql.functions import *
from pyspark.sql.types import *

def filename(path):
return path

sourceFile = udf(filename, StringType())
spark.read.json("tmp.json").select(sourceFile(input_file_name())).show()
{code}

prints as below:

{code}
+---+
|filename(input_file_name())|
+---+
|   |
+---+
{code}

but the codes below:

{code}
spark.read.json("tmp.json").select(input_file_name()).show()
{code}

prints correctly as below:

{code}
++
|   input_file_name()|
++
|file:///Users/hyu...|
++
{code}

This seems PySpark specific issue.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18665) Spark ThriftServer jobs where are canceled are still “STARTED”

2016-11-30 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18665?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15711100#comment-15711100
 ] 

Apache Spark commented on SPARK-18665:
--

User 'cenyuhai' has created a pull request for this issue:
https://github.com/apache/spark/pull/16097

> Spark ThriftServer jobs where are canceled are still “STARTED”
> --
>
> Key: SPARK-18665
> URL: https://issues.apache.org/jira/browse/SPARK-18665
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.3
>Reporter: cen yuhai
> Attachments: 1179ACF7-3E62-44C5-B01D-CA71C876ECCE.png, 
> 83C5E8AD-59DE-4A85-A483-2BE3FB83F378.png
>
>
> I find that, some jobs are canceled, but the state are still "STARTED", I 
> think this bug are imported by SPARK-6964



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-18665) Spark ThriftServer jobs where are canceled are still “STARTED”

2016-11-30 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18665?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-18665:


Assignee: (was: Apache Spark)

> Spark ThriftServer jobs where are canceled are still “STARTED”
> --
>
> Key: SPARK-18665
> URL: https://issues.apache.org/jira/browse/SPARK-18665
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.3
>Reporter: cen yuhai
> Attachments: 1179ACF7-3E62-44C5-B01D-CA71C876ECCE.png, 
> 83C5E8AD-59DE-4A85-A483-2BE3FB83F378.png
>
>
> I find that, some jobs are canceled, but the state are still "STARTED", I 
> think this bug are imported by SPARK-6964



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-18665) Spark ThriftServer jobs where are canceled are still “STARTED”

2016-11-30 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18665?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-18665:


Assignee: Apache Spark

> Spark ThriftServer jobs where are canceled are still “STARTED”
> --
>
> Key: SPARK-18665
> URL: https://issues.apache.org/jira/browse/SPARK-18665
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.3
>Reporter: cen yuhai
>Assignee: Apache Spark
> Attachments: 1179ACF7-3E62-44C5-B01D-CA71C876ECCE.png, 
> 83C5E8AD-59DE-4A85-A483-2BE3FB83F378.png
>
>
> I find that, some jobs are canceled, but the state are still "STARTED", I 
> think this bug are imported by SPARK-6964



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12347) Write script to run all MLlib examples for testing

2016-11-30 Thread Nick Pentreath (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12347?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15711095#comment-15711095
 ] 

Nick Pentreath commented on SPARK-12347:


Since the PR is still WIP and this is not a blocker for 2.1, I've retargeted to 
2.2

> Write script to run all MLlib examples for testing
> --
>
> Key: SPARK-12347
> URL: https://issues.apache.org/jira/browse/SPARK-12347
> Project: Spark
>  Issue Type: Test
>  Components: ML, MLlib, PySpark, SparkR, Tests
>Reporter: Joseph K. Bradley
>Priority: Critical
>
> It would facilitate testing to have a script which runs all MLlib examples 
> for all languages.
> Design sketch to ensure all examples are run:
> * Generate a list of examples to run programmatically (not from a fixed list).
> * Use a list of special examples to handle examples which require command 
> line arguments.
> * Make sure data, etc. used are small to keep the tests quick.
> This could be broken into subtasks for each language, though it would be nice 
> to provide a single script.
> Not sure where the script should live; perhaps in {{bin/}}?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-12347) Write script to run all MLlib examples for testing

2016-11-30 Thread Nick Pentreath (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12347?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nick Pentreath updated SPARK-12347:
---
Target Version/s: 2.2.0  (was: 2.1.0)

> Write script to run all MLlib examples for testing
> --
>
> Key: SPARK-12347
> URL: https://issues.apache.org/jira/browse/SPARK-12347
> Project: Spark
>  Issue Type: Test
>  Components: ML, MLlib, PySpark, SparkR, Tests
>Reporter: Joseph K. Bradley
>Priority: Critical
>
> It would facilitate testing to have a script which runs all MLlib examples 
> for all languages.
> Design sketch to ensure all examples are run:
> * Generate a list of examples to run programmatically (not from a fixed list).
> * Use a list of special examples to handle examples which require command 
> line arguments.
> * Make sure data, etc. used are small to keep the tests quick.
> This could be broken into subtasks for each language, though it would be nice 
> to provide a single script.
> Not sure where the script should live; perhaps in {{bin/}}?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-18638) Upgrade sbt, zinc and maven plugins

2016-11-30 Thread Weiqing Yang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18638?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Weiqing Yang updated SPARK-18638:
-
Description: 
v2.1.0-rc1has been out. For 2.2.x, it is better to keep sbt up-to-date, and 
upgrade it from 0.13.11 to 0.13.13. The release notes since the last version we 
used are: https://github.com/sbt/sbt/releases/tag/v0.13.12 and 
https://github.com/sbt/sbt/releases/tag/v0.13.13. Both releases include some 
regression fixes. This jira will also update Zinc and Maven plugins.

{code}
   sbt: 0.13.11 -> 0.13.13,
   zinc: 0.3.9 -> 0.3.11,
   maven-assembly-plugin: 2.6 -> 3.0.0
   maven-compiler-plugin: 3.5.1 -> 3.6.
   maven-jar-plugin: 2.6 -> 3.0.2
   maven-javadoc-plugin: 2.10.3 -> 2.10.4
   maven-source-plugin: 2.4 -> 3.0.1
   org.codehaus.mojo:build-helper-maven-plugin: 1.10 -> 1.12
   org.codehaus.mojo:exec-maven-plugin: 1.4.0 -> 1.5.0
{code}

  was:v2.1.0-rc1has been out. For 2.2.x, it is better to keep sbt up-to-date, 
and upgrade it from 0.13.11 to 0.13.13. The release notes since the last 
version we used are: https://github.com/sbt/sbt/releases/tag/v0.13.12 and 
https://github.com/sbt/sbt/releases/tag/v0.13.13. Both releases include some 
regression fixes. 


> Upgrade sbt, zinc and maven plugins
> ---
>
> Key: SPARK-18638
> URL: https://issues.apache.org/jira/browse/SPARK-18638
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Reporter: Weiqing Yang
>Priority: Minor
>
> v2.1.0-rc1has been out. For 2.2.x, it is better to keep sbt up-to-date, and 
> upgrade it from 0.13.11 to 0.13.13. The release notes since the last version 
> we used are: https://github.com/sbt/sbt/releases/tag/v0.13.12 and 
> https://github.com/sbt/sbt/releases/tag/v0.13.13. Both releases include some 
> regression fixes. This jira will also update Zinc and Maven plugins.
> {code}
>sbt: 0.13.11 -> 0.13.13,
>zinc: 0.3.9 -> 0.3.11,
>maven-assembly-plugin: 2.6 -> 3.0.0
>maven-compiler-plugin: 3.5.1 -> 3.6.
>maven-jar-plugin: 2.6 -> 3.0.2
>maven-javadoc-plugin: 2.10.3 -> 2.10.4
>maven-source-plugin: 2.4 -> 3.0.1
>org.codehaus.mojo:build-helper-maven-plugin: 1.10 -> 1.12
>org.codehaus.mojo:exec-maven-plugin: 1.4.0 -> 1.5.0
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-18638) Upgrade sbt, zinc and maven plugins

2016-11-30 Thread Weiqing Yang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18638?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Weiqing Yang updated SPARK-18638:
-
Summary: Upgrade sbt, zinc and maven plugins  (was: Upgrade sbt to 0.13.13)

> Upgrade sbt, zinc and maven plugins
> ---
>
> Key: SPARK-18638
> URL: https://issues.apache.org/jira/browse/SPARK-18638
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Reporter: Weiqing Yang
>Priority: Minor
>
> v2.1.0-rc1has been out. For 2.2.x, it is better to keep sbt up-to-date, and 
> upgrade it from 0.13.11 to 0.13.13. The release notes since the last version 
> we used are: https://github.com/sbt/sbt/releases/tag/v0.13.12 and 
> https://github.com/sbt/sbt/releases/tag/v0.13.13. Both releases include some 
> regression fixes. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18617) Close "kryo auto pick" feature for Spark Streaming

2016-11-30 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18617?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15710928#comment-15710928
 ] 

Apache Spark commented on SPARK-18617:
--

User 'uncleGen' has created a pull request for this issue:
https://github.com/apache/spark/pull/16096

> Close "kryo auto pick" feature for Spark Streaming
> --
>
> Key: SPARK-18617
> URL: https://issues.apache.org/jira/browse/SPARK-18617
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 2.0.2
>Reporter: Genmao Yu
>Assignee: Genmao Yu
> Fix For: 2.1.0
>
>
> [PR-15992| https://github.com/apache/spark/pull/15992] provided a solution to 
> fix the bug, i.e. {{receiver data can not be deserialized properly}}. As 
> [~zsxwing] said, it is a critical bug, but we should not break APIs between 
> maintenance releases. It may be a rational choice to close {{auto pick kryo 
> serializer}} for Spark Streaming in the first step.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-18666) Remove the codes checking deprecated config spark.sql.unsafe.enabled

2016-11-30 Thread Liang-Chi Hsieh (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18666?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Liang-Chi Hsieh updated SPARK-18666:

Description: spark.sql.unsafe.enabled is deprecated since 1.6. There still 
are codes in Web UI to check it. We should remove it and clean the codes.  
(was: spark.sql.unsafe.enabled is deprecated since 2.0. There still are codes 
in Web UI to check it. We should remove it and clean the codes.)

> Remove the codes checking deprecated config spark.sql.unsafe.enabled
> 
>
> Key: SPARK-18666
> URL: https://issues.apache.org/jira/browse/SPARK-18666
> Project: Spark
>  Issue Type: Improvement
>  Components: Web UI
>Reporter: Liang-Chi Hsieh
>Priority: Minor
>
> spark.sql.unsafe.enabled is deprecated since 1.6. There still are codes in 
> Web UI to check it. We should remove it and clean the codes.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-18666) Remove the codes checking deprecated config spark.sql.unsafe.enabled

2016-11-30 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18666?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-18666:


Assignee: (was: Apache Spark)

> Remove the codes checking deprecated config spark.sql.unsafe.enabled
> 
>
> Key: SPARK-18666
> URL: https://issues.apache.org/jira/browse/SPARK-18666
> Project: Spark
>  Issue Type: Improvement
>  Components: Web UI
>Reporter: Liang-Chi Hsieh
>Priority: Minor
>
> spark.sql.unsafe.enabled is deprecated since 2.0. There still are codes in 
> Web UI to check it. We should remove it and clean the codes.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-18666) Remove the codes checking deprecated config spark.sql.unsafe.enabled

2016-11-30 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18666?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-18666:


Assignee: Apache Spark

> Remove the codes checking deprecated config spark.sql.unsafe.enabled
> 
>
> Key: SPARK-18666
> URL: https://issues.apache.org/jira/browse/SPARK-18666
> Project: Spark
>  Issue Type: Improvement
>  Components: Web UI
>Reporter: Liang-Chi Hsieh
>Assignee: Apache Spark
>Priority: Minor
>
> spark.sql.unsafe.enabled is deprecated since 2.0. There still are codes in 
> Web UI to check it. We should remove it and clean the codes.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18666) Remove the codes checking deprecated config spark.sql.unsafe.enabled

2016-11-30 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18666?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15710891#comment-15710891
 ] 

Apache Spark commented on SPARK-18666:
--

User 'viirya' has created a pull request for this issue:
https://github.com/apache/spark/pull/16095

> Remove the codes checking deprecated config spark.sql.unsafe.enabled
> 
>
> Key: SPARK-18666
> URL: https://issues.apache.org/jira/browse/SPARK-18666
> Project: Spark
>  Issue Type: Improvement
>  Components: Web UI
>Reporter: Liang-Chi Hsieh
>Priority: Minor
>
> spark.sql.unsafe.enabled is deprecated since 2.0. There still are codes in 
> Web UI to check it. We should remove it and clean the codes.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-18666) Remove the codes checking deprecated config spark.sql.unsafe.enabled

2016-11-30 Thread Liang-Chi Hsieh (JIRA)
Liang-Chi Hsieh created SPARK-18666:
---

 Summary: Remove the codes checking deprecated config 
spark.sql.unsafe.enabled
 Key: SPARK-18666
 URL: https://issues.apache.org/jira/browse/SPARK-18666
 Project: Spark
  Issue Type: Improvement
  Components: Web UI
Reporter: Liang-Chi Hsieh
Priority: Minor


spark.sql.unsafe.enabled is deprecated since 2.0. There still are codes in Web 
UI to check it. We should remove it and clean the codes.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17583) Remove unused rowSeparator variable and set auto-expanding buffer as default for maxCharsPerColumn option in CSV

2016-11-30 Thread koert kuipers (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17583?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15710871#comment-15710871
 ] 

koert kuipers commented on SPARK-17583:
---

i see. so you are saying in spark 2.0.x it fails when the multiple lines that 
form a record end up in different splits? so basically its then not safe to use 
then. it just happened to work in my unit test because i had tiny part files 
that were never split up.


> Remove unused rowSeparator variable and set auto-expanding buffer as default 
> for maxCharsPerColumn option in CSV
> 
>
> Key: SPARK-17583
> URL: https://issues.apache.org/jira/browse/SPARK-17583
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
>Priority: Minor
> Fix For: 2.1.0
>
>
> This JIRA includes several changes below:
> 1. Upgrade Univocity library from 2.1.1 to 2.2.1
> This includes some performance improvement and also enabling auto-extending 
> buffer in {{maxCharsPerColumn}} option in CSV. Please refer the [release 
> notes|https://github.com/uniVocity/univocity-parsers/releases].
> 2. Remove {{rowSeparator}} variable existing in {{CSVOptions}}
> We have this variable in 
> [CSVOptions|https://github.com/apache/spark/blob/29952ed096fd2a0a19079933ff691671d6f00835/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVOptions.scala#L127]
>  but it seems possibly causing confusion that it actually does not care of 
> {{\r\n}}. For example, we have an issue open about this SPARK-17227 
> describing this variable
> This options is virtually not being used because we rely on 
> {{LineRecordReader}} in Hadoop which deals with only both {{\n}} and {{\r\n}}.
> 3. Setting the default value of {{maxCharsPerColumn}} to auto-expending 
> We are setting 100 for the length of each column. It'd be more sensible 
> we allow auto-expending rather than fixed length by default.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-18476) SparkR Logistic Regression should should support output original label.

2016-11-30 Thread Yanbo Liang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18476?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yanbo Liang resolved SPARK-18476.
-
   Resolution: Fixed
 Assignee: Miao Wang
Fix Version/s: 2.1.0

> SparkR Logistic Regression should should support output original label.
> ---
>
> Key: SPARK-18476
> URL: https://issues.apache.org/jira/browse/SPARK-18476
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Reporter: Miao Wang
>Assignee: Miao Wang
> Fix For: 2.1.0
>
>
> Similar to [SPARK-18401], as a classification algorithm, logistic regression 
> should support output original label instead of supporting index label.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-18665) Spark ThriftServer jobs where are canceled are still “STARTED”

2016-11-30 Thread cen yuhai (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18665?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

cen yuhai updated SPARK-18665:
--
Attachment: 1179ACF7-3E62-44C5-B01D-CA71C876ECCE.png

> Spark ThriftServer jobs where are canceled are still “STARTED”
> --
>
> Key: SPARK-18665
> URL: https://issues.apache.org/jira/browse/SPARK-18665
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.3
>Reporter: cen yuhai
> Attachments: 1179ACF7-3E62-44C5-B01D-CA71C876ECCE.png, 
> 83C5E8AD-59DE-4A85-A483-2BE3FB83F378.png
>
>
> I find that, some jobs are canceled, but the state are still "STARTED", I 
> think this bug are imported by SPARK-6964



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-18665) Spark ThriftServer jobs where are canceled are still “STARTED”

2016-11-30 Thread cen yuhai (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18665?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

cen yuhai updated SPARK-18665:
--
Attachment: 83C5E8AD-59DE-4A85-A483-2BE3FB83F378.png

> Spark ThriftServer jobs where are canceled are still “STARTED”
> --
>
> Key: SPARK-18665
> URL: https://issues.apache.org/jira/browse/SPARK-18665
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.3
>Reporter: cen yuhai
> Attachments: 83C5E8AD-59DE-4A85-A483-2BE3FB83F378.png
>
>
> I find that, some jobs are canceled, but the state are still "STARTED", I 
> think this bug are imported by SPARK-6964



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-18665) Spark ThriftServer jobs where are canceled are still “STARTED”

2016-11-30 Thread cen yuhai (JIRA)
cen yuhai created SPARK-18665:
-

 Summary: Spark ThriftServer jobs where are canceled are still 
“STARTED”
 Key: SPARK-18665
 URL: https://issues.apache.org/jira/browse/SPARK-18665
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.6.3
Reporter: cen yuhai


I find that, some jobs are canceled, but the state are still "STARTED", I think 
this bug are imported by SPARK-6964



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-18541) Add pyspark.sql.Column.aliasWithMetadata to allow dynamic metadata management in pyspark SQL API

2016-11-30 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18541?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-18541:


Assignee: (was: Apache Spark)

> Add pyspark.sql.Column.aliasWithMetadata to allow dynamic metadata management 
> in pyspark SQL API
> 
>
> Key: SPARK-18541
> URL: https://issues.apache.org/jira/browse/SPARK-18541
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, SQL
>Affects Versions: 2.0.2
> Environment: all
>Reporter: Shea Parkes
>Priority: Minor
>  Labels: newbie
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> In the Scala SQL API, you can pass in new metadata when you alias a field.  
> That functionality is not available in the Python API.   Right now, you have 
> to painfully utilize {{SparkSession.createDataFrame}} to manipulate the 
> metadata for even a single column.  I would propose to add the following 
> method to {{pyspark.sql.Column}}:
> {code}
> def aliasWithMetadata(self, name, metadata):
> """
> Make a new Column that has the provided alias and metadata.
> Metadata will be processed with json.dumps()
> """
> _context = pyspark.SparkContext._active_spark_context
> _metadata_str = json.dumps(metadata)
> _metadata_jvm = 
> _context._jvm.org.apache.spark.sql.types.Metadata.fromJson(_metadata_str)
> _new_java_column = getattr(self._jc, 'as')(name, _metadata_jvm)
> return Column(_new_java_column)
> {code}
> I can likely complete this request myself if there is any interest for it.  
> Just have to dust off my knowledge of doctest and the location of the python 
> tests.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-18541) Add pyspark.sql.Column.aliasWithMetadata to allow dynamic metadata management in pyspark SQL API

2016-11-30 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18541?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-18541:


Assignee: Apache Spark

> Add pyspark.sql.Column.aliasWithMetadata to allow dynamic metadata management 
> in pyspark SQL API
> 
>
> Key: SPARK-18541
> URL: https://issues.apache.org/jira/browse/SPARK-18541
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, SQL
>Affects Versions: 2.0.2
> Environment: all
>Reporter: Shea Parkes
>Assignee: Apache Spark
>Priority: Minor
>  Labels: newbie
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> In the Scala SQL API, you can pass in new metadata when you alias a field.  
> That functionality is not available in the Python API.   Right now, you have 
> to painfully utilize {{SparkSession.createDataFrame}} to manipulate the 
> metadata for even a single column.  I would propose to add the following 
> method to {{pyspark.sql.Column}}:
> {code}
> def aliasWithMetadata(self, name, metadata):
> """
> Make a new Column that has the provided alias and metadata.
> Metadata will be processed with json.dumps()
> """
> _context = pyspark.SparkContext._active_spark_context
> _metadata_str = json.dumps(metadata)
> _metadata_jvm = 
> _context._jvm.org.apache.spark.sql.types.Metadata.fromJson(_metadata_str)
> _new_java_column = getattr(self._jc, 'as')(name, _metadata_jvm)
> return Column(_new_java_column)
> {code}
> I can likely complete this request myself if there is any interest for it.  
> Just have to dust off my knowledge of doctest and the location of the python 
> tests.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18541) Add pyspark.sql.Column.aliasWithMetadata to allow dynamic metadata management in pyspark SQL API

2016-11-30 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18541?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15710705#comment-15710705
 ] 

Apache Spark commented on SPARK-18541:
--

User 'shea-parkes' has created a pull request for this issue:
https://github.com/apache/spark/pull/16094

> Add pyspark.sql.Column.aliasWithMetadata to allow dynamic metadata management 
> in pyspark SQL API
> 
>
> Key: SPARK-18541
> URL: https://issues.apache.org/jira/browse/SPARK-18541
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, SQL
>Affects Versions: 2.0.2
> Environment: all
>Reporter: Shea Parkes
>Priority: Minor
>  Labels: newbie
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> In the Scala SQL API, you can pass in new metadata when you alias a field.  
> That functionality is not available in the Python API.   Right now, you have 
> to painfully utilize {{SparkSession.createDataFrame}} to manipulate the 
> metadata for even a single column.  I would propose to add the following 
> method to {{pyspark.sql.Column}}:
> {code}
> def aliasWithMetadata(self, name, metadata):
> """
> Make a new Column that has the provided alias and metadata.
> Metadata will be processed with json.dumps()
> """
> _context = pyspark.SparkContext._active_spark_context
> _metadata_str = json.dumps(metadata)
> _metadata_jvm = 
> _context._jvm.org.apache.spark.sql.types.Metadata.fromJson(_metadata_str)
> _new_java_column = getattr(self._jc, 'as')(name, _metadata_jvm)
> return Column(_new_java_column)
> {code}
> I can likely complete this request myself if there is any interest for it.  
> Just have to dust off my knowledge of doctest and the location of the python 
> tests.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16026) Cost-based Optimizer framework

2016-11-30 Thread Ron Hu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16026?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15710667#comment-15710667
 ] 

Ron Hu commented on SPARK-16026:


Hi Reynold, I previously worked on filter cardinality estimation using the old 
statistics structure.  Now I need to refactor my code using the new basic 
statistics structure we agreed on.  As I am traveling on a business trip now, I 
will resume my work on Monday after I return to Bay Area.  Zhenhua is currently 
busy with some customer tasks this week.  He will return to work on CBO soon.

> Cost-based Optimizer framework
> --
>
> Key: SPARK-16026
> URL: https://issues.apache.org/jira/browse/SPARK-16026
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Reporter: Reynold Xin
>  Labels: releasenotes
> Attachments: Spark_CBO_Design_Spec.pdf
>
>
> This is an umbrella ticket to implement a cost-based optimizer framework 
> beyond broadcast join selection. This framework can be used to implement some 
> useful optimizations such as join reordering.
> The design should discuss how to break the work down into multiple, smaller 
> logical units. For example, changes to statistics class, system catalog, cost 
> estimation/propagation in expressions, cost estimation/propagation in 
> operators can be done in decoupled pull requests.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16026) Cost-based Optimizer framework

2016-11-30 Thread Reynold Xin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16026?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15710618#comment-15710618
 ] 

Reynold Xin commented on SPARK-16026:
-

[~ZenWzh] can we start working on operator cardinality estimation propagation 
based on what's in the catalog right now?


> Cost-based Optimizer framework
> --
>
> Key: SPARK-16026
> URL: https://issues.apache.org/jira/browse/SPARK-16026
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Reporter: Reynold Xin
>  Labels: releasenotes
> Attachments: Spark_CBO_Design_Spec.pdf
>
>
> This is an umbrella ticket to implement a cost-based optimizer framework 
> beyond broadcast join selection. This framework can be used to implement some 
> useful optimizations such as join reordering.
> The design should discuss how to break the work down into multiple, smaller 
> logical units. For example, changes to statistics class, system catalog, cost 
> estimation/propagation in expressions, cost estimation/propagation in 
> operators can be done in decoupled pull requests.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18663) Simplify CountMinSketch aggregate implementation

2016-11-30 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18663?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15710616#comment-15710616
 ] 

Apache Spark commented on SPARK-18663:
--

User 'rxin' has created a pull request for this issue:
https://github.com/apache/spark/pull/16093

> Simplify CountMinSketch aggregate implementation
> 
>
> Key: SPARK-18663
> URL: https://issues.apache.org/jira/browse/SPARK-18663
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Reynold Xin
>
> SPARK-18429 introduced count-min sketch aggregate function for SQL, but the 
> implementation and testing is more complicated than needed. This simplifies 
> the test cases and removes support for data types that don't have clear 
> equality semantics.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-18663) Simplify CountMinSketch aggregate implementation

2016-11-30 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18663?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-18663:


Assignee: Apache Spark  (was: Reynold Xin)

> Simplify CountMinSketch aggregate implementation
> 
>
> Key: SPARK-18663
> URL: https://issues.apache.org/jira/browse/SPARK-18663
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Apache Spark
>
> SPARK-18429 introduced count-min sketch aggregate function for SQL, but the 
> implementation and testing is more complicated than needed. This simplifies 
> the test cases and removes support for data types that don't have clear 
> equality semantics.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-18663) Simplify CountMinSketch aggregate implementation

2016-11-30 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18663?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-18663:


Assignee: Reynold Xin  (was: Apache Spark)

> Simplify CountMinSketch aggregate implementation
> 
>
> Key: SPARK-18663
> URL: https://issues.apache.org/jira/browse/SPARK-18663
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Reynold Xin
>
> SPARK-18429 introduced count-min sketch aggregate function for SQL, but the 
> implementation and testing is more complicated than needed. This simplifies 
> the test cases and removes support for data types that don't have clear 
> equality semantics.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-18664) Don't respond to HTTP OPTIONS in HTTP-based UIs

2016-11-30 Thread meiyoula (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18664?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

meiyoula updated SPARK-18664:
-
Description: 
This was flagged a while ago during a routine security scan(AWVS): the 
HTTP-based Spark services respond to an HTTP OPTIONS command.

HTTP OPTIONS method is enabled on this web server. The OPTIONS method provides 
a list of the methods that are supported by the web server, it represents a 
request for information about the communication options available on the 
request/response chain identified by the Request-URI.

The OPTIONS method may expose sensitive information that may help an malicious 
user to prepare more advanced attacks.

  was:This was flagged a while ago during a routine security scan(AWVS): the 
HTTP-based Spark services respond to an HTTP OPTIONS command.


> Don't respond to HTTP OPTIONS in HTTP-based UIs
> ---
>
> Key: SPARK-18664
> URL: https://issues.apache.org/jira/browse/SPARK-18664
> Project: Spark
>  Issue Type: Improvement
>  Components: Web UI
>Reporter: meiyoula
>Priority: Minor
>
> This was flagged a while ago during a routine security scan(AWVS): the 
> HTTP-based Spark services respond to an HTTP OPTIONS command.
> HTTP OPTIONS method is enabled on this web server. The OPTIONS method 
> provides a list of the methods that are supported by the web server, it 
> represents a request for information about the communication options 
> available on the request/response chain identified by the Request-URI.
> The OPTIONS method may expose sensitive information that may help an 
> malicious user to prepare more advanced attacks.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18664) Don't respond to HTTP OPTIONS in HTTP-based UIs

2016-11-30 Thread meiyoula (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18664?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15710591#comment-15710591
 ] 

meiyoula commented on SPARK-18664:
--

[~srowen] It is similar to SPARK-5983, should we fix it?

> Don't respond to HTTP OPTIONS in HTTP-based UIs
> ---
>
> Key: SPARK-18664
> URL: https://issues.apache.org/jira/browse/SPARK-18664
> Project: Spark
>  Issue Type: Improvement
>  Components: Web UI
>Reporter: meiyoula
>Priority: Minor
>
> This was flagged a while ago during a routine security scan(AWVS): the 
> HTTP-based Spark services respond to an HTTP OPTIONS command.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-18664) Don't respond to HTTP OPTIONS in HTTP-based UIs

2016-11-30 Thread meiyoula (JIRA)
meiyoula created SPARK-18664:


 Summary: Don't respond to HTTP OPTIONS in HTTP-based UIs
 Key: SPARK-18664
 URL: https://issues.apache.org/jira/browse/SPARK-18664
 Project: Spark
  Issue Type: Improvement
  Components: Web UI
Reporter: meiyoula
Priority: Minor


This was flagged a while ago during a routine security scan(AWVS): the 
HTTP-based Spark services respond to an HTTP OPTIONS command.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-18663) Simplify CountMinSketch aggregate implementation

2016-11-30 Thread Reynold Xin (JIRA)
Reynold Xin created SPARK-18663:
---

 Summary: Simplify CountMinSketch aggregate implementation
 Key: SPARK-18663
 URL: https://issues.apache.org/jira/browse/SPARK-18663
 Project: Spark
  Issue Type: Sub-task
Reporter: Reynold Xin
Assignee: Reynold Xin


SPARK-18429 introduced count-min sketch aggregate function for SQL, but the 
implementation and testing is more complicated than needed. This simplifies the 
test cases and removes support for data types that don't have clear equality 
semantics.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-18644) spark-submit fails to run python scripts with specific names

2016-11-30 Thread Bryan Cutler (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18644?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bryan Cutler resolved SPARK-18644.
--
Resolution: Not A Problem

> spark-submit fails to run python scripts with specific names
> 
>
> Key: SPARK-18644
> URL: https://issues.apache.org/jira/browse/SPARK-18644
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, Spark Submit
>Affects Versions: 2.0.2
> Environment: Ubuntu 16.04
>Reporter: Jussi Jousimo
>Priority: Minor
>
> I'm trying to run simple python script named tokenize.py with spark-submit. 
> The script only imports SparkContext:
> from pyspark import SparkContext
> And I run it with:
> spark-submit --packages org.apache.spark:spark-streaming-kafka-0-8_2.11:2.0.2 
> tokenize.py
> However, the script fails:
> ImportError: cannot import name SparkContext
> I have set all necessary environment variables, etc. Strangely, it seems the 
> filename is causing this error. If I rename the file to, e.g., tokenizer.py 
> and run again, it runs fine.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18644) spark-submit fails to run python scripts with specific names

2016-11-30 Thread Bryan Cutler (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18644?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15710507#comment-15710507
 ] 

Bryan Cutler commented on SPARK-18644:
--

Yeah, [~vanzin] is right, it's a python thing.  See the stack trace below, the 
{{inspect}} module imports {{tokenize}} which finds your local file first
{noformat}
Traceback (most recent call last):
  File "repo/spark/tokenize.py", line 1, in 
from pyspark import SparkContext
  File "repo/spark/python/lib/pyspark.zip/pyspark/__init__.py", line 44, in 

  File "repo/spark/python/lib/pyspark.zip/pyspark/context.py", line 33, in 

  File "repo/spark/python/lib/pyspark.zip/pyspark/java_gateway.py", line 31, in 

  File "repo/spark/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py", line 
18, in 
  File "/usr/lib/python2.7/pydoc.py", line 56, in 
import sys, imp, os, re, types, inspect, __builtin__, pkgutil, warnings
  File "/usr/lib/python2.7/inspect.py", line 39, in 
import tokenize
  File "repo/spark/tokenize.py", line 1, in 
  from pyspark import SparkContext
ImportError: cannot import name SparkContext
{noformat}


> spark-submit fails to run python scripts with specific names
> 
>
> Key: SPARK-18644
> URL: https://issues.apache.org/jira/browse/SPARK-18644
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, Spark Submit
>Affects Versions: 2.0.2
> Environment: Ubuntu 16.04
>Reporter: Jussi Jousimo
>Priority: Minor
>
> I'm trying to run simple python script named tokenize.py with spark-submit. 
> The script only imports SparkContext:
> from pyspark import SparkContext
> And I run it with:
> spark-submit --packages org.apache.spark:spark-streaming-kafka-0-8_2.11:2.0.2 
> tokenize.py
> However, the script fails:
> ImportError: cannot import name SparkContext
> I have set all necessary environment variables, etc. Strangely, it seems the 
> filename is causing this error. If I rename the file to, e.g., tokenizer.py 
> and run again, it runs fine.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-18662) Move cluster managers into their own sub-directory

2016-11-30 Thread Anirudh Ramanathan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18662?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Anirudh Ramanathan updated SPARK-18662:
---
External issue URL: https://github.com/apache/spark/pull/16092

> Move cluster managers into their own sub-directory
> --
>
> Key: SPARK-18662
> URL: https://issues.apache.org/jira/browse/SPARK-18662
> Project: Spark
>  Issue Type: Improvement
>  Components: Scheduler
>Reporter: Anirudh Ramanathan
>Priority: Minor
>
> As we move to support Kubernetes in addition to Yarn and Mesos 
> (https://issues.apache.org/jira/browse/SPARK-18278), we should move all the 
> cluster managers into a "resource-managers/" sub-directory. This is simply a 
> reorganization.
> Ref: https://github.com/apache/spark/pull/16061#issuecomment-263649340



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18662) Move cluster managers into their own sub-directory

2016-11-30 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18662?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15710391#comment-15710391
 ] 

Apache Spark commented on SPARK-18662:
--

User 'foxish' has created a pull request for this issue:
https://github.com/apache/spark/pull/16092

> Move cluster managers into their own sub-directory
> --
>
> Key: SPARK-18662
> URL: https://issues.apache.org/jira/browse/SPARK-18662
> Project: Spark
>  Issue Type: Improvement
>  Components: Scheduler
>Reporter: Anirudh Ramanathan
>Priority: Minor
>
> As we move to support Kubernetes in addition to Yarn and Mesos 
> (https://issues.apache.org/jira/browse/SPARK-18278), we should move all the 
> cluster managers into a "resource-managers/" sub-directory. This is simply a 
> reorganization.
> Ref: https://github.com/apache/spark/pull/16061#issuecomment-263649340



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-18662) Move cluster managers into their own sub-directory

2016-11-30 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18662?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-18662:


Assignee: (was: Apache Spark)

> Move cluster managers into their own sub-directory
> --
>
> Key: SPARK-18662
> URL: https://issues.apache.org/jira/browse/SPARK-18662
> Project: Spark
>  Issue Type: Improvement
>  Components: Scheduler
>Reporter: Anirudh Ramanathan
>Priority: Minor
>
> As we move to support Kubernetes in addition to Yarn and Mesos 
> (https://issues.apache.org/jira/browse/SPARK-18278), we should move all the 
> cluster managers into a "resource-managers/" sub-directory. This is simply a 
> reorganization.
> Ref: https://github.com/apache/spark/pull/16061#issuecomment-263649340



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-18662) Move cluster managers into their own sub-directory

2016-11-30 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18662?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-18662:


Assignee: Apache Spark

> Move cluster managers into their own sub-directory
> --
>
> Key: SPARK-18662
> URL: https://issues.apache.org/jira/browse/SPARK-18662
> Project: Spark
>  Issue Type: Improvement
>  Components: Scheduler
>Reporter: Anirudh Ramanathan
>Assignee: Apache Spark
>Priority: Minor
>
> As we move to support Kubernetes in addition to Yarn and Mesos 
> (https://issues.apache.org/jira/browse/SPARK-18278), we should move all the 
> cluster managers into a "resource-managers/" sub-directory. This is simply a 
> reorganization.
> Ref: https://github.com/apache/spark/pull/16061#issuecomment-263649340



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18650) race condition in FileScanRDD.scala

2016-11-30 Thread Hyukjin Kwon (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18650?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15710358#comment-15710358
 ] 

Hyukjin Kwon commented on SPARK-18650:
--

Would this be possible to share your data/sample data? I would like to 
reproduce this.

> race condition in FileScanRDD.scala
> ---
>
> Key: SPARK-18650
> URL: https://issues.apache.org/jira/browse/SPARK-18650
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.2
> Environment: scala 2.11
> macos 10.11.6
>Reporter: Jay Goldman
>
> I am attempting to create a DataSet from a single CSV file :
>  val ss: SparkSession = 
>  val ddr = ss.read.option("path", path)
> ... (choose between xml vs csv parsing)
>  var df = ddr.option("sep", ",")
>   .option("quote", "\"")
>   .option("escape", "\"") // want to retain backslashes (\) ...
>   .option("delimiter", ",")
>   .option("comment", "#")
>   .option("header", "true")
>   .option("format", "csv")
>ddr.csv(path)
> df.count() returns 2 times the number of lines in the CSV file - i.e., each 
> line of the input file shows up as 2 rows in df. 
> moreover df.distinct.count has the correct rows.
> There appears to be a problem in FileScanRDD.compute. I am using spark 
> version 2.0.1 with scala 2.11. I am not going to include the entire contents 
> of FileScanRDD.scala here.
> In FileScanRDD.compute there is the following:
>  private[this] val files = split.asInstanceOf[FilePartition].files.toIterator
> If i put a breakpoint in either FileScanRDD.compute or 
> FIleScanRDD.nextIterator the resulting dataset has the correct number of rows.
> Moreover, the code in FileScanRDD.scala is:
> private def nextIterator(): Boolean = {
> updateBytesReadWithFileSize()
> if (files.hasNext) { // breakpoint here => works
>   currentFile = files.next() // breakpoint here => fails
>   
> }
> else {  }
> 
> }
> if i put a breakpoint on the files.hasNext line all is well; however, if i 
> put a breakpoint on the files.next() line the code will fail when i continue 
> because the files iterator has become empty (see stack trace below). 
> Disabling the breakpoint winds up creating a Dataset with each line of the 
> csv file duplicated.
> So it appears that multiple threads are using the files iterator or the 
> underling split value (an RDDPartition) and timing wise on my system 2 
> workers wind up processing the same file, with the resulting DataSet having 2 
> copies of each of the input lines.
> This code is not active when parsing an XML file. 
> here is stack trace:
> java.util.NoSuchElementException: next on empty iterator
>   at scala.collection.Iterator$$anon$2.next(Iterator.scala:39)
>   at scala.collection.Iterator$$anon$2.next(Iterator.scala:37)
>   at 
> scala.collection.IndexedSeqLike$Elements.next(IndexedSeqLike.scala:63)
>   at 
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:111)
>   at 
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:91)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.agg_doAggregateWithoutKey$(Unknown
>  Source)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
>  Source)
>   at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
>   at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370)
>   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
>   at 
> org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:125)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:79)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:47)
>   at org.apache.spark.scheduler.Task.run(Task.scala:86)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745)
> 16/11/30 09:31:07 ERROR TaskSetManager: Task 0 in stage 0.0 failed 1 times; 
> aborting job
> Exception in thread "main" org.apache.spark.SparkException: Job aborted due 
> to stage failure: Task 0 in stage 0.0 failed 1 times, most recent failure: 
> Lost task 0.0 in stage 0.0 (TID 0, localhost): 
> java.util.NoSuchElementException: next on empty iterator
>   at 

[jira] [Resolved] (SPARK-18655) Ignore Structured Streaming 2.0.2 logs in history server

2016-11-30 Thread Shixiong Zhu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18655?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shixiong Zhu resolved SPARK-18655.
--
   Resolution: Fixed
Fix Version/s: 2.1.0

> Ignore Structured Streaming 2.0.2 logs in history server
> 
>
> Key: SPARK-18655
> URL: https://issues.apache.org/jira/browse/SPARK-18655
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.1.0
>Reporter: Shixiong Zhu
>Assignee: Shixiong Zhu
>Priority: Blocker
> Fix For: 2.1.0
>
>
> SPARK-18516 changes the event log format of Structured Streaming. We should 
> make sure our changes not break the history server.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18560) Receiver data can not be dataSerialized properly.

2016-11-30 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18560?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15710309#comment-15710309
 ] 

Apache Spark commented on SPARK-18560:
--

User 'zsxwing' has created a pull request for this issue:
https://github.com/apache/spark/pull/16091

> Receiver data can not be dataSerialized properly.
> -
>
> Key: SPARK-18560
> URL: https://issues.apache.org/jira/browse/SPARK-18560
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.0.2
>Reporter: Genmao Yu
>Priority: Critical
>
> My spark streaming job can run correctly on Spark 1.6.1, but it can not run 
> properly on Spark 2.0.1, with following exception:
> {code}
> 16/11/22 19:20:15 ERROR executor.Executor: Exception in task 4.3 in stage 6.0 
> (TID 87)
> com.esotericsoftware.kryo.KryoException: Encountered unregistered class ID: 
> 13994
>   at 
> com.esotericsoftware.kryo.util.DefaultClassResolver.readClass(DefaultClassResolver.java:137)
>   at com.esotericsoftware.kryo.Kryo.readClass(Kryo.java:670)
>   at com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:781)
>   at 
> org.apache.spark.serializer.KryoDeserializationStream.readObject(KryoSerializer.scala:243)
>   at 
> org.apache.spark.serializer.DeserializationStream$$anon$1.getNext(Serializer.scala:169)
>   at org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:73)
>   at org.apache.spark.util.Utils$.getIteratorSize(Utils.scala:1760)
>   at org.apache.spark.rdd.RDD$$anonfun$count$1.apply(RDD.scala:1150)
>   at org.apache.spark.rdd.RDD$$anonfun$count$1.apply(RDD.scala:1150)
>   at 
> org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1943)
>   at 
> org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1943)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
>   at org.apache.spark.scheduler.Task.run(Task.scala:108)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:282)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745)
> {code}
> Go deep into  relevant implementation, I find the type of data received by 
> {{Receiver}} is erased. And in Spark2.x, framework can choose a appropriate 
> {{Serializer}} from {{JavaSerializer}} and {{KryoSerializer}} base on the 
> type of data. 
> At the {{Receiver}} side, the type of data is erased to be {{Object}}, so 
> framework will choose {{JavaSerializer}}, with following code:
> {code}
> def canUseKryo(ct: ClassTag[_]): Boolean = {
> primitiveAndPrimitiveArrayClassTags.contains(ct) || ct == stringClassTag
>   }
>   def getSerializer(ct: ClassTag[_]): Serializer = {
> if (canUseKryo(ct)) {
>   kryoSerializer
> } else {
>   defaultSerializer
> }
>   }
> {code}
> At task side, we can get correct data type, and framework will choose 
> {{KryoSerializer}} if possible, with following supported type:
> {code}
> private[this] val stringClassTag: ClassTag[String] = 
> implicitly[ClassTag[String]]
> private[this] val primitiveAndPrimitiveArrayClassTags: Set[ClassTag[_]] = {
> val primitiveClassTags = Set[ClassTag[_]](
>   ClassTag.Boolean,
>   ClassTag.Byte,
>   ClassTag.Char,
>   ClassTag.Double,
>   ClassTag.Float,
>   ClassTag.Int,
>   ClassTag.Long,
>   ClassTag.Null,
>   ClassTag.Short
> )
> val arrayClassTags = primitiveClassTags.map(_.wrap)
> primitiveClassTags ++ arrayClassTags
>   }
> {code}
> In my case, the type of data is Byte Array.
> This problem stems from SPARK-13990, a patch to have Spark automatically pick 
> the "best" serializer when caching RDDs.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18617) Close "kryo auto pick" feature for Spark Streaming

2016-11-30 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18617?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15710308#comment-15710308
 ] 

Apache Spark commented on SPARK-18617:
--

User 'zsxwing' has created a pull request for this issue:
https://github.com/apache/spark/pull/16091

> Close "kryo auto pick" feature for Spark Streaming
> --
>
> Key: SPARK-18617
> URL: https://issues.apache.org/jira/browse/SPARK-18617
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 2.0.2
>Reporter: Genmao Yu
>Assignee: Genmao Yu
> Fix For: 2.1.0
>
>
> [PR-15992| https://github.com/apache/spark/pull/15992] provided a solution to 
> fix the bug, i.e. {{receiver data can not be deserialized properly}}. As 
> [~zsxwing] said, it is a critical bug, but we should not break APIs between 
> maintenance releases. It may be a rational choice to close {{auto pick kryo 
> serializer}} for Spark Streaming in the first step.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-18122) Fallback to Kryo for unknown classes in ExpressionEncoder

2016-11-30 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18122?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-18122:


Assignee: (was: Apache Spark)

> Fallback to Kryo for unknown classes in ExpressionEncoder
> -
>
> Key: SPARK-18122
> URL: https://issues.apache.org/jira/browse/SPARK-18122
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.0.1
>Reporter: Michael Armbrust
>Priority: Critical
>
> In Spark 2.0 we fail to generate an encoder if any of the fields of the class 
> are not of a supported type.  One example is {{Option\[Set\[Int\]\]}}, but 
> there are many more.  We should give the user the option to fall back on 
> opaque kryo serialization in these cases for subtrees of the encoder, rather 
> than failing.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-18122) Fallback to Kryo for unknown classes in ExpressionEncoder

2016-11-30 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18122?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-18122:


Assignee: Apache Spark

> Fallback to Kryo for unknown classes in ExpressionEncoder
> -
>
> Key: SPARK-18122
> URL: https://issues.apache.org/jira/browse/SPARK-18122
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.0.1
>Reporter: Michael Armbrust
>Assignee: Apache Spark
>Priority: Critical
>
> In Spark 2.0 we fail to generate an encoder if any of the fields of the class 
> are not of a supported type.  One example is {{Option\[Set\[Int\]\]}}, but 
> there are many more.  We should give the user the option to fall back on 
> opaque kryo serialization in these cases for subtrees of the encoder, rather 
> than failing.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-18662) Move cluster managers into their own sub-directory

2016-11-30 Thread Anirudh Ramanathan (JIRA)
Anirudh Ramanathan created SPARK-18662:
--

 Summary: Move cluster managers into their own sub-directory
 Key: SPARK-18662
 URL: https://issues.apache.org/jira/browse/SPARK-18662
 Project: Spark
  Issue Type: Improvement
  Components: Scheduler
Reporter: Anirudh Ramanathan
Priority: Minor


As we move to support Kubernetes in addition to Yarn and Mesos 
(https://issues.apache.org/jira/browse/SPARK-18278), we should move all the 
cluster managers into a "resource-managers/" sub-directory. This is simply a 
reorganization.

Ref: https://github.com/apache/spark/pull/16061#issuecomment-263649340



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Reopened] (SPARK-18122) Fallback to Kryo for unknown classes in ExpressionEncoder

2016-11-30 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18122?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust reopened SPARK-18122:
--

I'm going to reopen this.  I think the benefits outweigh the compatibility 
concerns.

> Fallback to Kryo for unknown classes in ExpressionEncoder
> -
>
> Key: SPARK-18122
> URL: https://issues.apache.org/jira/browse/SPARK-18122
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.0.1
>Reporter: Michael Armbrust
>Priority: Critical
>
> In Spark 2.0 we fail to generate an encoder if any of the fields of the class 
> are not of a supported type.  One example is {{Option\[Set\[Int\]\]}}, but 
> there are many more.  We should give the user the option to fall back on 
> opaque kryo serialization in these cases for subtrees of the encoder, rather 
> than failing.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-18656) org.apache.spark.sql.execution.stat.StatFunctions#multipleApproxQuantiles requires too much memory in case of many columns

2016-11-30 Thread Sina Sohangir (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18656?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sina Sohangir closed SPARK-18656.
-
Resolution: Later

> org.apache.spark.sql.execution.stat.StatFunctions#multipleApproxQuantiles 
> requires too much memory in case of many columns
> --
>
> Key: SPARK-18656
> URL: https://issues.apache.org/jira/browse/SPARK-18656
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Sina Sohangir
>   Original Estimate: 1h
>  Remaining Estimate: 1h
>
> org.apache.spark.sql.execution.stat.StatFunctions#multipleApproxQuantiles
> Is implemented in a way that is causes out of memory error for cases where 
> the number of columns are high.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18661) Creating a partitioned datasource table should not scan all files for table

2016-11-30 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18661?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15710233#comment-15710233
 ] 

Apache Spark commented on SPARK-18661:
--

User 'ericl' has created a pull request for this issue:
https://github.com/apache/spark/pull/16090

> Creating a partitioned datasource table should not scan all files for table
> ---
>
> Key: SPARK-18661
> URL: https://issues.apache.org/jira/browse/SPARK-18661
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Eric Liang
>Priority: Blocker
>
> Even though in 2.1 creating a partitioned datasource table will not populate 
> the partition data by default (until the user issues MSCK REPAIR TABLE), it 
> seems we still scan the filesystem for no good reason.
> We should avoid doing this when the user specifies a schema.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-18661) Creating a partitioned datasource table should not scan all files for table

2016-11-30 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18661?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-18661:


Assignee: Apache Spark

> Creating a partitioned datasource table should not scan all files for table
> ---
>
> Key: SPARK-18661
> URL: https://issues.apache.org/jira/browse/SPARK-18661
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Eric Liang
>Assignee: Apache Spark
>Priority: Blocker
>
> Even though in 2.1 creating a partitioned datasource table will not populate 
> the partition data by default (until the user issues MSCK REPAIR TABLE), it 
> seems we still scan the filesystem for no good reason.
> We should avoid doing this when the user specifies a schema.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-18661) Creating a partitioned datasource table should not scan all files for table

2016-11-30 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18661?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-18661:


Assignee: (was: Apache Spark)

> Creating a partitioned datasource table should not scan all files for table
> ---
>
> Key: SPARK-18661
> URL: https://issues.apache.org/jira/browse/SPARK-18661
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Eric Liang
>Priority: Blocker
>
> Even though in 2.1 creating a partitioned datasource table will not populate 
> the partition data by default (until the user issues MSCK REPAIR TABLE), it 
> seems we still scan the filesystem for no good reason.
> We should avoid doing this when the user specifies a schema.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-18661) Creating a partitioned datasource table should not scan all files in filesystem

2016-11-30 Thread Eric Liang (JIRA)
Eric Liang created SPARK-18661:
--

 Summary: Creating a partitioned datasource table should not scan 
all files in filesystem
 Key: SPARK-18661
 URL: https://issues.apache.org/jira/browse/SPARK-18661
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.1.0
Reporter: Eric Liang
Priority: Blocker


Even though in 2.1 creating a partitioned datasource table will not populate 
the partition data by default (until the user issues MSCK REPAIR TABLE), it 
seems we still scan the filesystem for no good reason.

We should avoid doing this when the user specifies a schema.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-18661) Creating a partitioned datasource table should not scan all files for table

2016-11-30 Thread Eric Liang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18661?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eric Liang updated SPARK-18661:
---
Summary: Creating a partitioned datasource table should not scan all files 
for table  (was: Creating a partitioned datasource table should not scan all 
files in filesystem)

> Creating a partitioned datasource table should not scan all files for table
> ---
>
> Key: SPARK-18661
> URL: https://issues.apache.org/jira/browse/SPARK-18661
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Eric Liang
>Priority: Blocker
>
> Even though in 2.1 creating a partitioned datasource table will not populate 
> the partition data by default (until the user issues MSCK REPAIR TABLE), it 
> seems we still scan the filesystem for no good reason.
> We should avoid doing this when the user specifies a schema.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17583) Remove unused rowSeparator variable and set auto-expanding buffer as default for maxCharsPerColumn option in CSV

2016-11-30 Thread Hyukjin Kwon (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17583?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15710199#comment-15710199
 ] 

Hyukjin Kwon commented on SPARK-17583:
--

For example, please refer the discussion in 
https://issues.apache.org/jira/browse/SPARK-17227

> Remove unused rowSeparator variable and set auto-expanding buffer as default 
> for maxCharsPerColumn option in CSV
> 
>
> Key: SPARK-17583
> URL: https://issues.apache.org/jira/browse/SPARK-17583
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
>Priority: Minor
> Fix For: 2.1.0
>
>
> This JIRA includes several changes below:
> 1. Upgrade Univocity library from 2.1.1 to 2.2.1
> This includes some performance improvement and also enabling auto-extending 
> buffer in {{maxCharsPerColumn}} option in CSV. Please refer the [release 
> notes|https://github.com/uniVocity/univocity-parsers/releases].
> 2. Remove {{rowSeparator}} variable existing in {{CSVOptions}}
> We have this variable in 
> [CSVOptions|https://github.com/apache/spark/blob/29952ed096fd2a0a19079933ff691671d6f00835/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVOptions.scala#L127]
>  but it seems possibly causing confusion that it actually does not care of 
> {{\r\n}}. For example, we have an issue open about this SPARK-17227 
> describing this variable
> This options is virtually not being used because we rely on 
> {{LineRecordReader}} in Hadoop which deals with only both {{\n}} and {{\r\n}}.
> 3. Setting the default value of {{maxCharsPerColumn}} to auto-expending 
> We are setting 100 for the length of each column. It'd be more sensible 
> we allow auto-expending rather than fixed length by default.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18374) Incorrect words in StopWords/english.txt

2016-11-30 Thread yuhao yang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18374?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15710198#comment-15710198
 ] 

yuhao yang commented on SPARK-18374:


I checked with some other lists of stopwords and got a list to add, but it's 
longer than I thought:
i'll
you'll
he'll
she'll
we'll
they'll
i'd
you'd
he'd
she'd
we'd
they'd
i'm
you're
he's
she's
it's
we're
they're
i've
we've
you've
they've
isn't
aren't
wasn't
weren't
haven't
hasn't
hadn't
don't
doesn't
didn't
won't
wouldn't
shan't
shouldn't
mustn't
can't
couldn't

any concern?

> Incorrect words in StopWords/english.txt
> 
>
> Key: SPARK-18374
> URL: https://issues.apache.org/jira/browse/SPARK-18374
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 2.0.1
>Reporter: nirav patel
>
> I was just double checking english.txt for list of stopwords as I felt it was 
> taking out valid tokens like 'won'. I think issue is english.txt list is 
> missing apostrophe character and all character after apostrophe. So "won't" 
> becam "won" in that list; "wouldn't" is "wouldn" .
> Here are some incorrect tokens in this list:
> won
> wouldn
> ma
> mightn
> mustn
> needn
> shan
> shouldn
> wasn
> weren
> I think ideal list should have both style. i.e. won't and wont both should be 
> part of english.txt as some tokenizer might remove special characters. But 
> 'won' is obviously shouldn't be in this list.
> Here's list of snowball english stop words:
> http://snowball.tartarus.org/algorithms/english/stop.txt



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17583) Remove unused rowSeparator variable and set auto-expanding buffer as default for maxCharsPerColumn option in CSV

2016-11-30 Thread Hyukjin Kwon (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17583?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15710195#comment-15710195
 ] 

Hyukjin Kwon commented on SPARK-17583:
--

Ah, that would not be related with this JIRA actually because this JIRA 
describes unused options/upgrading the CSV library. 

That is because of relying on LineRecordReader. External CSV library and CSV 
one in 2.0.x support it but it has a problem to read multiple lines as each 
record when it exists accross multiple blocks.

It would be great if they are supported and there are several issues open. 


> Remove unused rowSeparator variable and set auto-expanding buffer as default 
> for maxCharsPerColumn option in CSV
> 
>
> Key: SPARK-17583
> URL: https://issues.apache.org/jira/browse/SPARK-17583
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
>Priority: Minor
> Fix For: 2.1.0
>
>
> This JIRA includes several changes below:
> 1. Upgrade Univocity library from 2.1.1 to 2.2.1
> This includes some performance improvement and also enabling auto-extending 
> buffer in {{maxCharsPerColumn}} option in CSV. Please refer the [release 
> notes|https://github.com/uniVocity/univocity-parsers/releases].
> 2. Remove {{rowSeparator}} variable existing in {{CSVOptions}}
> We have this variable in 
> [CSVOptions|https://github.com/apache/spark/blob/29952ed096fd2a0a19079933ff691671d6f00835/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVOptions.scala#L127]
>  but it seems possibly causing confusion that it actually does not care of 
> {{\r\n}}. For example, we have an issue open about this SPARK-17227 
> describing this variable
> This options is virtually not being used because we rely on 
> {{LineRecordReader}} in Hadoop which deals with only both {{\n}} and {{\r\n}}.
> 3. Setting the default value of {{maxCharsPerColumn}} to auto-expending 
> We are setting 100 for the length of each column. It'd be more sensible 
> we allow auto-expending rather than fixed length by default.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17583) Remove unused rowSeparator variable and set auto-expanding buffer as default for maxCharsPerColumn option in CSV

2016-11-30 Thread koert kuipers (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17583?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15710172#comment-15710172
 ] 

koert kuipers commented on SPARK-17583:
---

i just tested out inhouse unit test (which run against spark 2.0.2) against 
spark 2.1.0-RC1 and things break for writing out csvs and reading them back in 
when there is a newline inside a csv value (which will get quoted). writing out 
works but reading it back in breaks.

now i am not saying its a good idea to have newlines inside quoted csv values. 
but i just wanted to point out that this did used to work with spark 2.0.2. i 
am not entirely sure why it worked actually. looking at the test if actually 
writes the value with the newline out over 2 lines, and it reads it back in 
correctly as well. 

> Remove unused rowSeparator variable and set auto-expanding buffer as default 
> for maxCharsPerColumn option in CSV
> 
>
> Key: SPARK-17583
> URL: https://issues.apache.org/jira/browse/SPARK-17583
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
>Priority: Minor
> Fix For: 2.1.0
>
>
> This JIRA includes several changes below:
> 1. Upgrade Univocity library from 2.1.1 to 2.2.1
> This includes some performance improvement and also enabling auto-extending 
> buffer in {{maxCharsPerColumn}} option in CSV. Please refer the [release 
> notes|https://github.com/uniVocity/univocity-parsers/releases].
> 2. Remove {{rowSeparator}} variable existing in {{CSVOptions}}
> We have this variable in 
> [CSVOptions|https://github.com/apache/spark/blob/29952ed096fd2a0a19079933ff691671d6f00835/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVOptions.scala#L127]
>  but it seems possibly causing confusion that it actually does not care of 
> {{\r\n}}. For example, we have an issue open about this SPARK-17227 
> describing this variable
> This options is virtually not being used because we rely on 
> {{LineRecordReader}} in Hadoop which deals with only both {{\n}} and {{\r\n}}.
> 3. Setting the default value of {{maxCharsPerColumn}} to auto-expending 
> We are setting 100 for the length of each column. It'd be more sensible 
> we allow auto-expending rather than fixed length by default.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-17939) Spark-SQL Nullability: Optimizations vs. Enforcement Clarification

2016-11-30 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17939?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-17939:
-
Target Version/s: 2.1.0

> Spark-SQL Nullability: Optimizations vs. Enforcement Clarification
> --
>
> Key: SPARK-17939
> URL: https://issues.apache.org/jira/browse/SPARK-17939
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Aleksander Eskilson
>Priority: Critical
>
> The notion of Nullability of of StructFields in DataFrames and Datasets 
> creates some confusion. As has been pointed out previously [1], Nullability 
> is a hint to the Catalyst optimizer, and is not meant to be a type-level 
> enforcement. Allowing null fields can also help the reader successfully parse 
> certain types of more loosely-typed data, like JSON and CSV, where null 
> values are common, rather than just failing. 
> There's already been some movement to clarify the meaning of Nullable in the 
> API, but also some requests for a (perhaps completely separate) type-level 
> implementation of Nullable that can act as an enforcement contract.
> This bug is logged here to discuss and clarify this issue.
> [1] - 
> [https://issues.apache.org/jira/browse/SPARK-11319|https://issues.apache.org/jira/browse/SPARK-11319?focusedCommentId=15014535=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-15014535]
> [2] - https://github.com/apache/spark/pull/11785



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-17939) Spark-SQL Nullability: Optimizations vs. Enforcement Clarification

2016-11-30 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17939?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-17939:
-
Target Version/s: 2.2.0  (was: 2.1.0)

> Spark-SQL Nullability: Optimizations vs. Enforcement Clarification
> --
>
> Key: SPARK-17939
> URL: https://issues.apache.org/jira/browse/SPARK-17939
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Aleksander Eskilson
>Priority: Critical
>
> The notion of Nullability of of StructFields in DataFrames and Datasets 
> creates some confusion. As has been pointed out previously [1], Nullability 
> is a hint to the Catalyst optimizer, and is not meant to be a type-level 
> enforcement. Allowing null fields can also help the reader successfully parse 
> certain types of more loosely-typed data, like JSON and CSV, where null 
> values are common, rather than just failing. 
> There's already been some movement to clarify the meaning of Nullable in the 
> API, but also some requests for a (perhaps completely separate) type-level 
> implementation of Nullable that can act as an enforcement contract.
> This bug is logged here to discuss and clarify this issue.
> [1] - 
> [https://issues.apache.org/jira/browse/SPARK-11319|https://issues.apache.org/jira/browse/SPARK-11319?focusedCommentId=15014535=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-15014535]
> [2] - https://github.com/apache/spark/pull/11785



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-18658) Writing to a text DataSource buffers one or more lines in memory

2016-11-30 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18658?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-18658:


Assignee: (was: Apache Spark)

> Writing to a text DataSource buffers one or more lines in memory
> 
>
> Key: SPARK-18658
> URL: https://issues.apache.org/jira/browse/SPARK-18658
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.0.2
>Reporter: Nathan Howell
>Priority: Minor
>
> The JSON and CSV writing paths buffer entire lines (or multiple lines) in 
> memory prior to writing to disk. For large rows this is inefficient. It may 
> make sense to skip the {{TextOutputFormat}} record writer and go directly to 
> the underlying {{FSDataOutputStream}}, allowing the writers to append 
> arbitrary byte arrays (fractions of a row) instead of a full row.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-18658) Writing to a text DataSource buffers one or more lines in memory

2016-11-30 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18658?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-18658:


Assignee: Apache Spark

> Writing to a text DataSource buffers one or more lines in memory
> 
>
> Key: SPARK-18658
> URL: https://issues.apache.org/jira/browse/SPARK-18658
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.0.2
>Reporter: Nathan Howell
>Assignee: Apache Spark
>Priority: Minor
>
> The JSON and CSV writing paths buffer entire lines (or multiple lines) in 
> memory prior to writing to disk. For large rows this is inefficient. It may 
> make sense to skip the {{TextOutputFormat}} record writer and go directly to 
> the underlying {{FSDataOutputStream}}, allowing the writers to append 
> arbitrary byte arrays (fractions of a row) instead of a full row.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18658) Writing to a text DataSource buffers one or more lines in memory

2016-11-30 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18658?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15710116#comment-15710116
 ] 

Apache Spark commented on SPARK-18658:
--

User 'NathanHowell' has created a pull request for this issue:
https://github.com/apache/spark/pull/16089

> Writing to a text DataSource buffers one or more lines in memory
> 
>
> Key: SPARK-18658
> URL: https://issues.apache.org/jira/browse/SPARK-18658
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.0.2
>Reporter: Nathan Howell
>Priority: Minor
>
> The JSON and CSV writing paths buffer entire lines (or multiple lines) in 
> memory prior to writing to disk. For large rows this is inefficient. It may 
> make sense to skip the {{TextOutputFormat}} record writer and go directly to 
> the underlying {{FSDataOutputStream}}, allowing the writers to append 
> arbitrary byte arrays (fractions of a row) instead of a full row.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-18481) ML 2.1 QA: Remove deprecated methods for ML

2016-11-30 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18481?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-18481:
--
Description: 
Remove deprecated methods for ML.

This task removed the following (deprecated) public APIs in org.apache.spark.ml:
* classification.RandomForestClassificationModel.numTrees  (This now refers to 
the Param called "numTrees")
* feature.ChiSqSelectorModel.setLabelCol
* regression.LinearRegressionSummary.model
* regression.RandomForestRegressionModel.numTrees  (This now refers to the 
Param called "numTrees")
* PipelineStage.validateParams
* Evaluator.validateParams

This task made the following changes to match existing patterns for Params:
* These methods were made final:
** classification.RandomForestClassificationModel.getNumTrees
** regression.RandomForestRegressionModel.getNumTrees
* These methods return the concrete class type, rather than an arbitrary trait. 
 This only affected Java compatibility, not Scala.
** classification.RandomForestClassificationModel.setFeatureSubsetStrategy
** regression.RandomForestRegressionModel.setFeatureSubsetStrategy


  was:
Remove deprecated methods for ML.

This task removed the following (deprecated) public APIs in org.apache.spark.ml:
* classification.RandomForestClassificationModel.numTrees  (This now refers to 
the Param called "numTrees")
* feature.ChiSqSelectorModel.setLabelCol
* regression.LinearRegressionSummary.model
* regression.RandomForestRegressionModel.numTrees  (This now refers to the 
Param called "numTrees")
* PipelineStage.validateParams

This task made the following changes to match existing patterns for Params:
* These methods were made final:
** classification.RandomForestClassificationModel.getNumTrees
** regression.RandomForestRegressionModel.getNumTrees
* These methods return the concrete class type, rather than an arbitrary trait. 
 This only affected Java compatibility, not Scala.
** classification.RandomForestClassificationModel.setFeatureSubsetStrategy
** regression.RandomForestRegressionModel.setFeatureSubsetStrategy



> ML 2.1 QA: Remove deprecated methods for ML 
> 
>
> Key: SPARK-18481
> URL: https://issues.apache.org/jira/browse/SPARK-18481
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, PySpark
>Reporter: Yanbo Liang
>Assignee: Yanbo Liang
>Priority: Minor
> Fix For: 2.1.0
>
>
> Remove deprecated methods for ML.
> This task removed the following (deprecated) public APIs in 
> org.apache.spark.ml:
> * classification.RandomForestClassificationModel.numTrees  (This now refers 
> to the Param called "numTrees")
> * feature.ChiSqSelectorModel.setLabelCol
> * regression.LinearRegressionSummary.model
> * regression.RandomForestRegressionModel.numTrees  (This now refers to the 
> Param called "numTrees")
> * PipelineStage.validateParams
> * Evaluator.validateParams
> This task made the following changes to match existing patterns for Params:
> * These methods were made final:
> ** classification.RandomForestClassificationModel.getNumTrees
> ** regression.RandomForestRegressionModel.getNumTrees
> * These methods return the concrete class type, rather than an arbitrary 
> trait.  This only affected Java compatibility, not Scala.
> ** classification.RandomForestClassificationModel.setFeatureSubsetStrategy
> ** regression.RandomForestRegressionModel.setFeatureSubsetStrategy



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-18660) Parquet complains "Can not initialize counter due to context is not a instance of TaskInputOutputContext, but is org.apache.hadoop.mapreduce.task.TaskAttemptContextImpl

2016-11-30 Thread Yin Huai (JIRA)
Yin Huai created SPARK-18660:


 Summary: Parquet complains "Can not initialize counter due to 
context is not a instance of TaskInputOutputContext, but is 
org.apache.hadoop.mapreduce.task.TaskAttemptContextImpl "
 Key: SPARK-18660
 URL: https://issues.apache.org/jira/browse/SPARK-18660
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Yin Huai


Parquet record reader always complain "Can not initialize counter due to 
context is not a instance of TaskInputOutputContext, but is 
org.apache.hadoop.mapreduce.task.TaskAttemptContextImpl". Looks like we always 
create TaskAttemptContextImpl 
(https://github.com/apache/spark/blob/2f7461f31331cfc37f6cfa3586b7bbefb3af5547/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFileFormat.scala#L368).
 But, Parquet wants to use TaskInputOutputContext, which is a subclass of 
TaskAttemptContextImpl. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-18546) UnsafeShuffleWriter corrupts encrypted shuffle files when merging

2016-11-30 Thread Marcelo Vanzin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18546?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Vanzin resolved SPARK-18546.

   Resolution: Fixed
Fix Version/s: 2.1.1

> UnsafeShuffleWriter corrupts encrypted shuffle files when merging
> -
>
> Key: SPARK-18546
> URL: https://issues.apache.org/jira/browse/SPARK-18546
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.1.0
>Reporter: Marcelo Vanzin
>Assignee: Marcelo Vanzin
>Priority: Critical
> Fix For: 2.1.1
>
>
> The merging algorithm in {{UnsafeShuffleWriter}} does not consider 
> encryption, and when it tries to merge encrypted files the result data cannot 
> be read, since data encrypted with different initial vectors is interleaved 
> in the same partition data. This leads to exceptions when trying to read the 
> files during shuffle:
> {noformat}
> com.esotericsoftware.kryo.KryoException: com.ning.compress.lzf.LZFException: 
> Corrupt input data, block did not start with 2 byte signature ('ZV') followed 
> by type byte, 2-byte length)
>   at com.esotericsoftware.kryo.io.Input.fill(Input.java:142)
>   at com.esotericsoftware.kryo.io.Input.require(Input.java:155)
>   at com.esotericsoftware.kryo.io.Input.readInt(Input.java:337)
>   at 
> com.esotericsoftware.kryo.util.DefaultClassResolver.readClass(DefaultClassResolver.java:109)
>   at com.esotericsoftware.kryo.Kryo.readClass(Kryo.java:610)
>   at com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:721)
>   at 
> org.apache.spark.serializer.KryoDeserializationStream.readObject(KryoSerializer.scala:228)
>   at 
> org.apache.spark.serializer.DeserializationStream.readKey(Serializer.scala:169)
>   at 
> org.apache.spark.util.collection.ExternalAppendOnlyMap$DiskMapIterator.readNextItem(ExternalAppendOnlyMap.scala:512)
>   at 
> org.apache.spark.util.collection.ExternalAppendOnlyMap$DiskMapIterator.hasNext(ExternalAppendOnlyMap.scala:533)
> ...
> {noformat}
> (This is our internal branch so don't worry if lines don't necessarily match.)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18085) Better History Server scalability for many / large applications

2016-11-30 Thread Marcelo Vanzin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18085?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15709943#comment-15709943
 ] 

Marcelo Vanzin commented on SPARK-18085:


I uploaded code for milestone 3 from the document:
https://github.com/vanzin/spark/tree/shs-ng/M3

It doesn't have a whole lot of tests, but it's supposed to be a building step 
for the next milestones. I also updated the M1 and M2 branches with 
enhancements / fixes I found while writing the M3 code.

> Better History Server scalability for many / large applications
> ---
>
> Key: SPARK-18085
> URL: https://issues.apache.org/jira/browse/SPARK-18085
> Project: Spark
>  Issue Type: Umbrella
>  Components: Spark Core, Web UI
>Affects Versions: 2.0.0
>Reporter: Marcelo Vanzin
> Attachments: spark_hs_next_gen.pdf
>
>
> It's a known fact that the History Server currently has some annoying issues 
> when serving lots of applications, and when serving large applications.
> I'm filing this umbrella to track work related to addressing those issues. 
> I'll be attaching a document shortly describing the issues and suggesting a 
> path to how to solve them.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-18659) Incorrect behaviors in overwrite table for datasource tables

2016-11-30 Thread Eric Liang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18659?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eric Liang updated SPARK-18659:
---
Description: 
The first three test cases fail due to a crash in hive client when dropping 
partitions that don't contain files. The last one deletes too many files due to 
a partition case resolution failure.

{code}
  test("foo") {
withTable("test") {
  spark.range(10)
.selectExpr("id", "id as A", "'x' as B")
.write.partitionBy("A", "B").mode("overwrite")
.saveAsTable("test")
  spark.sql("insert overwrite table test select id, id, 'x' from range(1)")
  assert(spark.sql("select * from test").count() == 1)
}
  }

  test("bar") {
withTable("test") {
  spark.range(10)
.selectExpr("id", "id as A", "'x' as B")
.write.partitionBy("A", "B").mode("overwrite")
.saveAsTable("test")
  spark.sql("insert overwrite table test partition (a, b) select id, id, 
'x' from range(1)")
  assert(spark.sql("select * from test").count() == 1)
}
  }

  test("baz") {
withTable("test") {
  spark.range(10)
.selectExpr("id", "id as A", "'x' as B")
.write.partitionBy("A", "B").mode("overwrite")
.saveAsTable("test")
  spark.sql("insert overwrite table test partition (A, B) select id, id, 
'x' from range(1)")
  assert(spark.sql("select * from test").count() == 1)
}
  }

  test("qux") {
withTable("test") {
  spark.range(10)
.selectExpr("id", "id as A", "'x' as B")
.write.partitionBy("A", "B").mode("overwrite")
.saveAsTable("test")
  spark.sql("insert overwrite table test partition (a=1, b) select id, 'x' 
from range(1)")
  assert(spark.sql("select * from test").count() == 10)
}
  }
{code}

  was:
The following test cases fail due to a crash in hive client when dropping 
partitions that don't contain files. The last one deletes too many files due to 
a partition case resolution failure.

{code}
  test("foo") {
withTable("test") {
  spark.range(10)
.selectExpr("id", "id as A", "'x' as B")
.write.partitionBy("A", "B").mode("overwrite")
.saveAsTable("test")
  spark.sql("insert overwrite table test select id, id, 'x' from range(1)")
  assert(spark.sql("select * from test").count() == 1)
}
  }

  test("bar") {
withTable("test") {
  spark.range(10)
.selectExpr("id", "id as A", "'x' as B")
.write.partitionBy("A", "B").mode("overwrite")
.saveAsTable("test")
  spark.sql("insert overwrite table test partition (a, b) select id, id, 
'x' from range(1)")
  assert(spark.sql("select * from test").count() == 1)
}
  }

  test("baz") {
withTable("test") {
  spark.range(10)
.selectExpr("id", "id as A", "'x' as B")
.write.partitionBy("A", "B").mode("overwrite")
.saveAsTable("test")
  spark.sql("insert overwrite table test partition (A, B) select id, id, 
'x' from range(1)")
  assert(spark.sql("select * from test").count() == 1)
}
  }

  test("qux") {
withTable("test") {
  spark.range(10)
.selectExpr("id", "id as A", "'x' as B")
.write.partitionBy("A", "B").mode("overwrite")
.saveAsTable("test")
  spark.sql("insert overwrite table test partition (a=1, b) select id, 'x' 
from range(1)")
  assert(spark.sql("select * from test").count() == 10)
}
  }
{code}


> Incorrect behaviors in overwrite table for datasource tables
> 
>
> Key: SPARK-18659
> URL: https://issues.apache.org/jira/browse/SPARK-18659
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Eric Liang
>Priority: Blocker
>
> The first three test cases fail due to a crash in hive client when dropping 
> partitions that don't contain files. The last one deletes too many files due 
> to a partition case resolution failure.
> {code}
>   test("foo") {
> withTable("test") {
>   spark.range(10)
> .selectExpr("id", "id as A", "'x' as B")
> .write.partitionBy("A", "B").mode("overwrite")
> .saveAsTable("test")
>   spark.sql("insert overwrite table test select id, id, 'x' from 
> range(1)")
>   assert(spark.sql("select * from test").count() == 1)
> }
>   }
>   test("bar") {
> withTable("test") {
>   spark.range(10)
> .selectExpr("id", "id as A", "'x' as B")
> .write.partitionBy("A", "B").mode("overwrite")
> .saveAsTable("test")
>   spark.sql("insert overwrite table test partition (a, b) select id, id, 
> 'x' from range(1)")
>   assert(spark.sql("select * from test").count() == 1)
> }
>   }
>   test("baz") {
> withTable("test") {
>   spark.range(10)
> .selectExpr("id", "id as A", "'x' as B")

[jira] [Assigned] (SPARK-18659) Incorrect behaviors in overwrite table for datasource tables

2016-11-30 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18659?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-18659:


Assignee: (was: Apache Spark)

> Incorrect behaviors in overwrite table for datasource tables
> 
>
> Key: SPARK-18659
> URL: https://issues.apache.org/jira/browse/SPARK-18659
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Eric Liang
>Priority: Blocker
>
> The following test cases fail due to a crash in hive client when dropping 
> partitions that don't contain files. The last one deletes too many files due 
> to a partition case resolution failure.
> {code}
>   test("foo") {
> withTable("test") {
>   spark.range(10)
> .selectExpr("id", "id as A", "'x' as B")
> .write.partitionBy("A", "B").mode("overwrite")
> .saveAsTable("test")
>   spark.sql("insert overwrite table test select id, id, 'x' from 
> range(1)")
>   assert(spark.sql("select * from test").count() == 1)
> }
>   }
>   test("bar") {
> withTable("test") {
>   spark.range(10)
> .selectExpr("id", "id as A", "'x' as B")
> .write.partitionBy("A", "B").mode("overwrite")
> .saveAsTable("test")
>   spark.sql("insert overwrite table test partition (a, b) select id, id, 
> 'x' from range(1)")
>   assert(spark.sql("select * from test").count() == 1)
> }
>   }
>   test("baz") {
> withTable("test") {
>   spark.range(10)
> .selectExpr("id", "id as A", "'x' as B")
> .write.partitionBy("A", "B").mode("overwrite")
> .saveAsTable("test")
>   spark.sql("insert overwrite table test partition (A, B) select id, id, 
> 'x' from range(1)")
>   assert(spark.sql("select * from test").count() == 1)
> }
>   }
>   test("qux") {
> withTable("test") {
>   spark.range(10)
> .selectExpr("id", "id as A", "'x' as B")
> .write.partitionBy("A", "B").mode("overwrite")
> .saveAsTable("test")
>   spark.sql("insert overwrite table test partition (a=1, b) select id, 
> 'x' from range(1)")
>   assert(spark.sql("select * from test").count() == 10)
> }
>   }
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18659) Incorrect behaviors in overwrite table for datasource tables

2016-11-30 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18659?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15709938#comment-15709938
 ] 

Apache Spark commented on SPARK-18659:
--

User 'ericl' has created a pull request for this issue:
https://github.com/apache/spark/pull/16088

> Incorrect behaviors in overwrite table for datasource tables
> 
>
> Key: SPARK-18659
> URL: https://issues.apache.org/jira/browse/SPARK-18659
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Eric Liang
>Priority: Blocker
>
> The following test cases fail due to a crash in hive client when dropping 
> partitions that don't contain files. The last one deletes too many files due 
> to a partition case resolution failure.
> {code}
>   test("foo") {
> withTable("test") {
>   spark.range(10)
> .selectExpr("id", "id as A", "'x' as B")
> .write.partitionBy("A", "B").mode("overwrite")
> .saveAsTable("test")
>   spark.sql("insert overwrite table test select id, id, 'x' from 
> range(1)")
>   assert(spark.sql("select * from test").count() == 1)
> }
>   }
>   test("bar") {
> withTable("test") {
>   spark.range(10)
> .selectExpr("id", "id as A", "'x' as B")
> .write.partitionBy("A", "B").mode("overwrite")
> .saveAsTable("test")
>   spark.sql("insert overwrite table test partition (a, b) select id, id, 
> 'x' from range(1)")
>   assert(spark.sql("select * from test").count() == 1)
> }
>   }
>   test("baz") {
> withTable("test") {
>   spark.range(10)
> .selectExpr("id", "id as A", "'x' as B")
> .write.partitionBy("A", "B").mode("overwrite")
> .saveAsTable("test")
>   spark.sql("insert overwrite table test partition (A, B) select id, id, 
> 'x' from range(1)")
>   assert(spark.sql("select * from test").count() == 1)
> }
>   }
>   test("qux") {
> withTable("test") {
>   spark.range(10)
> .selectExpr("id", "id as A", "'x' as B")
> .write.partitionBy("A", "B").mode("overwrite")
> .saveAsTable("test")
>   spark.sql("insert overwrite table test partition (a=1, b) select id, 
> 'x' from range(1)")
>   assert(spark.sql("select * from test").count() == 10)
> }
>   }
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-18659) Incorrect behaviors in overwrite table for datasource tables

2016-11-30 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18659?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-18659:


Assignee: Apache Spark

> Incorrect behaviors in overwrite table for datasource tables
> 
>
> Key: SPARK-18659
> URL: https://issues.apache.org/jira/browse/SPARK-18659
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Eric Liang
>Assignee: Apache Spark
>Priority: Blocker
>
> The following test cases fail due to a crash in hive client when dropping 
> partitions that don't contain files. The last one deletes too many files due 
> to a partition case resolution failure.
> {code}
>   test("foo") {
> withTable("test") {
>   spark.range(10)
> .selectExpr("id", "id as A", "'x' as B")
> .write.partitionBy("A", "B").mode("overwrite")
> .saveAsTable("test")
>   spark.sql("insert overwrite table test select id, id, 'x' from 
> range(1)")
>   assert(spark.sql("select * from test").count() == 1)
> }
>   }
>   test("bar") {
> withTable("test") {
>   spark.range(10)
> .selectExpr("id", "id as A", "'x' as B")
> .write.partitionBy("A", "B").mode("overwrite")
> .saveAsTable("test")
>   spark.sql("insert overwrite table test partition (a, b) select id, id, 
> 'x' from range(1)")
>   assert(spark.sql("select * from test").count() == 1)
> }
>   }
>   test("baz") {
> withTable("test") {
>   spark.range(10)
> .selectExpr("id", "id as A", "'x' as B")
> .write.partitionBy("A", "B").mode("overwrite")
> .saveAsTable("test")
>   spark.sql("insert overwrite table test partition (A, B) select id, id, 
> 'x' from range(1)")
>   assert(spark.sql("select * from test").count() == 1)
> }
>   }
>   test("qux") {
> withTable("test") {
>   spark.range(10)
> .selectExpr("id", "id as A", "'x' as B")
> .write.partitionBy("A", "B").mode("overwrite")
> .saveAsTable("test")
>   spark.sql("insert overwrite table test partition (a=1, b) select id, 
> 'x' from range(1)")
>   assert(spark.sql("select * from test").count() == 10)
> }
>   }
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-18659) Incorrect behaviors in overwrite table for datasource tables

2016-11-30 Thread Eric Liang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18659?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eric Liang updated SPARK-18659:
---
Description: 
The following test cases fail due to a crash in hive client when dropping 
partitions that don't contain files. The last one deletes too many files due to 
a partition case resolution failure.

{code}
  test("foo") {
withTable("test") {
  spark.range(10)
.selectExpr("id", "id as A", "'x' as B")
.write.partitionBy("A", "B").mode("overwrite")
.saveAsTable("test")
  spark.sql("insert overwrite table test select id, id, 'x' from range(1)")
  assert(spark.sql("select * from test").count() == 1)
}
  }

  test("bar") {
withTable("test") {
  spark.range(10)
.selectExpr("id", "id as A", "'x' as B")
.write.partitionBy("A", "B").mode("overwrite")
.saveAsTable("test")
  spark.sql("insert overwrite table test partition (a, b) select id, id, 
'x' from range(1)")
  assert(spark.sql("select * from test").count() == 1)
}
  }

  test("baz") {
withTable("test") {
  spark.range(10)
.selectExpr("id", "id as A", "'x' as B")
.write.partitionBy("A", "B").mode("overwrite")
.saveAsTable("test")
  spark.sql("insert overwrite table test partition (A, B) select id, id, 
'x' from range(1)")
  assert(spark.sql("select * from test").count() == 1)
}
  }

  test("qux") {
withTable("test") {
  spark.range(10)
.selectExpr("id", "id as A", "'x' as B")
.write.partitionBy("A", "B").mode("overwrite")
.saveAsTable("test")
  spark.sql("insert overwrite table test partition (a=1, b) select id, 'x' 
from range(1)")
  assert(spark.sql("select * from test").count() == 10)
}
  }
{code}

  was:
The following test cases fail due to a crash in hive client when dropping 
partitions that don't contain files. The last one crashes due to a partition 
case resolution failure.

{code}
  test("foo") {
withTable("test") {
  spark.range(10)
.selectExpr("id", "id as A", "'x' as B")
.write.partitionBy("A", "B").mode("overwrite")
.saveAsTable("test")
  spark.sql("insert overwrite table test select id, id, 'x' from range(1)")
  assert(spark.sql("select * from test").count() == 1)
}
  }

  test("bar") {
withTable("test") {
  spark.range(10)
.selectExpr("id", "id as A", "'x' as B")
.write.partitionBy("A", "B").mode("overwrite")
.saveAsTable("test")
  spark.sql("insert overwrite table test partition (a, b) select id, id, 
'x' from range(1)")
  assert(spark.sql("select * from test").count() == 1)
}
  }

  test("baz") {
withTable("test") {
  spark.range(10)
.selectExpr("id", "id as A", "'x' as B")
.write.partitionBy("A", "B").mode("overwrite")
.saveAsTable("test")
  spark.sql("insert overwrite table test partition (A, B) select id, id, 
'x' from range(1)")
  assert(spark.sql("select * from test").count() == 1)
}
  }

  test("qux") {
withTable("test") {
  spark.range(10)
.selectExpr("id", "id as A", "'x' as B")
.write.partitionBy("A", "B").mode("overwrite")
.saveAsTable("test")
  spark.sql("insert overwrite table test partition (a=1, b) select id, 'x' 
from range(1)")
  assert(spark.sql("select * from test").count() == 10)
}
  }
{code}


> Incorrect behaviors in overwrite table for datasource tables
> 
>
> Key: SPARK-18659
> URL: https://issues.apache.org/jira/browse/SPARK-18659
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Eric Liang
>Priority: Blocker
>
> The following test cases fail due to a crash in hive client when dropping 
> partitions that don't contain files. The last one deletes too many files due 
> to a partition case resolution failure.
> {code}
>   test("foo") {
> withTable("test") {
>   spark.range(10)
> .selectExpr("id", "id as A", "'x' as B")
> .write.partitionBy("A", "B").mode("overwrite")
> .saveAsTable("test")
>   spark.sql("insert overwrite table test select id, id, 'x' from 
> range(1)")
>   assert(spark.sql("select * from test").count() == 1)
> }
>   }
>   test("bar") {
> withTable("test") {
>   spark.range(10)
> .selectExpr("id", "id as A", "'x' as B")
> .write.partitionBy("A", "B").mode("overwrite")
> .saveAsTable("test")
>   spark.sql("insert overwrite table test partition (a, b) select id, id, 
> 'x' from range(1)")
>   assert(spark.sql("select * from test").count() == 1)
> }
>   }
>   test("baz") {
> withTable("test") {
>   spark.range(10)
> .selectExpr("id", "id as A", "'x' as B")
> 

[jira] [Assigned] (SPARK-18656) org.apache.spark.sql.execution.stat.StatFunctions#multipleApproxQuantiles requires too much memory in case of many columns

2016-11-30 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18656?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-18656:


Assignee: Apache Spark

> org.apache.spark.sql.execution.stat.StatFunctions#multipleApproxQuantiles 
> requires too much memory in case of many columns
> --
>
> Key: SPARK-18656
> URL: https://issues.apache.org/jira/browse/SPARK-18656
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Sina Sohangir
>Assignee: Apache Spark
>   Original Estimate: 1h
>  Remaining Estimate: 1h
>
> org.apache.spark.sql.execution.stat.StatFunctions#multipleApproxQuantiles
> Is implemented in a way that is causes out of memory error for cases where 
> the number of columns are high.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18656) org.apache.spark.sql.execution.stat.StatFunctions#multipleApproxQuantiles requires too much memory in case of many columns

2016-11-30 Thread Sina Sohangir (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18656?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15709930#comment-15709930
 ] 

Sina Sohangir commented on SPARK-18656:
---

Create a PR:
https://github.com/apache/spark/pull/16087



> org.apache.spark.sql.execution.stat.StatFunctions#multipleApproxQuantiles 
> requires too much memory in case of many columns
> --
>
> Key: SPARK-18656
> URL: https://issues.apache.org/jira/browse/SPARK-18656
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Sina Sohangir
>   Original Estimate: 1h
>  Remaining Estimate: 1h
>
> org.apache.spark.sql.execution.stat.StatFunctions#multipleApproxQuantiles
> Is implemented in a way that is causes out of memory error for cases where 
> the number of columns are high.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Issue Comment Deleted] (SPARK-18656) org.apache.spark.sql.execution.stat.StatFunctions#multipleApproxQuantiles requires too much memory in case of many columns

2016-11-30 Thread Sina Sohangir (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18656?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sina Sohangir updated SPARK-18656:
--
Comment: was deleted

(was: Created a PR:
https://github.com/apache/spark/pull/16087

)

> org.apache.spark.sql.execution.stat.StatFunctions#multipleApproxQuantiles 
> requires too much memory in case of many columns
> --
>
> Key: SPARK-18656
> URL: https://issues.apache.org/jira/browse/SPARK-18656
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Sina Sohangir
>   Original Estimate: 1h
>  Remaining Estimate: 1h
>
> org.apache.spark.sql.execution.stat.StatFunctions#multipleApproxQuantiles
> Is implemented in a way that is causes out of memory error for cases where 
> the number of columns are high.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-18656) org.apache.spark.sql.execution.stat.StatFunctions#multipleApproxQuantiles requires too much memory in case of many columns

2016-11-30 Thread Sina Sohangir (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18656?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15709930#comment-15709930
 ] 

Sina Sohangir edited comment on SPARK-18656 at 11/30/16 10:03 PM:
--

Created a PR:
https://github.com/apache/spark/pull/16087




was (Author: sina.sohangir):
Create a PR:
https://github.com/apache/spark/pull/16087



> org.apache.spark.sql.execution.stat.StatFunctions#multipleApproxQuantiles 
> requires too much memory in case of many columns
> --
>
> Key: SPARK-18656
> URL: https://issues.apache.org/jira/browse/SPARK-18656
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Sina Sohangir
>   Original Estimate: 1h
>  Remaining Estimate: 1h
>
> org.apache.spark.sql.execution.stat.StatFunctions#multipleApproxQuantiles
> Is implemented in a way that is causes out of memory error for cases where 
> the number of columns are high.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18656) org.apache.spark.sql.execution.stat.StatFunctions#multipleApproxQuantiles requires too much memory in case of many columns

2016-11-30 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18656?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15709929#comment-15709929
 ] 

Apache Spark commented on SPARK-18656:
--

User 'sinasohangirsc' has created a pull request for this issue:
https://github.com/apache/spark/pull/16087

> org.apache.spark.sql.execution.stat.StatFunctions#multipleApproxQuantiles 
> requires too much memory in case of many columns
> --
>
> Key: SPARK-18656
> URL: https://issues.apache.org/jira/browse/SPARK-18656
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Sina Sohangir
>   Original Estimate: 1h
>  Remaining Estimate: 1h
>
> org.apache.spark.sql.execution.stat.StatFunctions#multipleApproxQuantiles
> Is implemented in a way that is causes out of memory error for cases where 
> the number of columns are high.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-18656) org.apache.spark.sql.execution.stat.StatFunctions#multipleApproxQuantiles requires too much memory in case of many columns

2016-11-30 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18656?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-18656:


Assignee: (was: Apache Spark)

> org.apache.spark.sql.execution.stat.StatFunctions#multipleApproxQuantiles 
> requires too much memory in case of many columns
> --
>
> Key: SPARK-18656
> URL: https://issues.apache.org/jira/browse/SPARK-18656
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Sina Sohangir
>   Original Estimate: 1h
>  Remaining Estimate: 1h
>
> org.apache.spark.sql.execution.stat.StatFunctions#multipleApproxQuantiles
> Is implemented in a way that is causes out of memory error for cases where 
> the number of columns are high.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-18659) Incorrect behaviors in overwrite table for datasource tables

2016-11-30 Thread Eric Liang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18659?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eric Liang updated SPARK-18659:
---
Summary: Incorrect behaviors in overwrite table for datasource tables  
(was: Crash in overwrite table partitions due to hive metastore integration)

> Incorrect behaviors in overwrite table for datasource tables
> 
>
> Key: SPARK-18659
> URL: https://issues.apache.org/jira/browse/SPARK-18659
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Eric Liang
>Priority: Blocker
>
> The following test cases fail due to a crash in hive client when dropping 
> partitions that don't contain files. The last one crashes due to a partition 
> case resolution failure.
> {code}
>   test("foo") {
> withTable("test") {
>   spark.range(10)
> .selectExpr("id", "id as A", "'x' as B")
> .write.partitionBy("A", "B").mode("overwrite")
> .saveAsTable("test")
>   spark.sql("insert overwrite table test select id, id, 'x' from 
> range(1)")
>   assert(spark.sql("select * from test").count() == 1)
> }
>   }
>   test("bar") {
> withTable("test") {
>   spark.range(10)
> .selectExpr("id", "id as A", "'x' as B")
> .write.partitionBy("A", "B").mode("overwrite")
> .saveAsTable("test")
>   spark.sql("insert overwrite table test partition (a, b) select id, id, 
> 'x' from range(1)")
>   assert(spark.sql("select * from test").count() == 1)
> }
>   }
>   test("baz") {
> withTable("test") {
>   spark.range(10)
> .selectExpr("id", "id as A", "'x' as B")
> .write.partitionBy("A", "B").mode("overwrite")
> .saveAsTable("test")
>   spark.sql("insert overwrite table test partition (A, B) select id, id, 
> 'x' from range(1)")
>   assert(spark.sql("select * from test").count() == 1)
> }
>   }
>   test("qux") {
> withTable("test") {
>   spark.range(10)
> .selectExpr("id", "id as A", "'x' as B")
> .write.partitionBy("A", "B").mode("overwrite")
> .saveAsTable("test")
>   spark.sql("insert overwrite table test partition (a=1, b) select id, 
> 'x' from range(1)")
>   assert(spark.sql("select * from test").count() == 10)
> }
>   }
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18251) DataSet API | RuntimeException: Null value appeared in non-nullable field when holding Option Case Class

2016-11-30 Thread Cheng Lian (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18251?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15709869#comment-15709869
 ] 

Cheng Lian commented on SPARK-18251:


One more comment about why we shouldn't allow a {{Option\[T <: Product\]}} to 
be used as top-level Dataset type: one way to think about this more intuitively 
is to make an analogy to databases. In a database table, you cannot mark a row 
itself as null. Instead, you are only allowed to mark a field of a row to be 
null.

Instead of using {{Option\[T <: Product\]}}, the user should resort to 
{{Tuple1\[T <: Product\]}}. Thus, you have a row consisting of a single field, 
which can be filled with either a null or a struct.

> DataSet API | RuntimeException: Null value appeared in non-nullable field 
> when holding Option Case Class
> 
>
> Key: SPARK-18251
> URL: https://issues.apache.org/jira/browse/SPARK-18251
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.0.1
> Environment: OS X
>Reporter: Aniket Bhatnagar
>Assignee: Wenchen Fan
> Fix For: 2.2.0
>
>
> I am running into a runtime exception when a DataSet is holding an Empty 
> object instance for an Option type that is holding non-nullable field. For 
> instance, if we have the following case class:
> case class DataRow(id: Int, value: String)
> Then, DataSet[Option[DataRow]] can only hold Some(DataRow) objects and cannot 
> hold Empty. If it does so, the following exception is thrown:
> {noformat}
> Exception in thread "main" org.apache.spark.SparkException: Job aborted due 
> to stage failure: Task 6 in stage 0.0 failed 1 times, most recent failure: 
> Lost task 6.0 in stage 0.0 (TID 6, localhost): java.lang.RuntimeException: 
> Null value appeared in non-nullable field:
> - field (class: "scala.Int", name: "id")
> - option value class: "DataSetOptBug.DataRow"
> - root class: "scala.Option"
> If the schema is inferred from a Scala tuple/case class, or a Java bean, 
> please try to use scala.Option[_] or other nullable types (e.g. 
> java.lang.Integer instead of int/scala.Int).
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.agg_doAggregateWithoutKey$(Unknown
>  Source)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
>  Source)
>   at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
>   at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370)
>   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
>   at 
> org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:125)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:79)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:47)
>   at org.apache.spark.scheduler.Task.run(Task.scala:86)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745)
> {noformat}
> The bug can be reproduce by using the program: 
> https://gist.github.com/aniketbhatnagar/2ed74613f70d2defe999c18afaa4816e



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-18251) DataSet API | RuntimeException: Null value appeared in non-nullable field when holding Option Case Class

2016-11-30 Thread Cheng Lian (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18251?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Lian updated SPARK-18251:
---
Assignee: Wenchen Fan

> DataSet API | RuntimeException: Null value appeared in non-nullable field 
> when holding Option Case Class
> 
>
> Key: SPARK-18251
> URL: https://issues.apache.org/jira/browse/SPARK-18251
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.0.1
> Environment: OS X
>Reporter: Aniket Bhatnagar
>Assignee: Wenchen Fan
> Fix For: 2.2.0
>
>
> I am running into a runtime exception when a DataSet is holding an Empty 
> object instance for an Option type that is holding non-nullable field. For 
> instance, if we have the following case class:
> case class DataRow(id: Int, value: String)
> Then, DataSet[Option[DataRow]] can only hold Some(DataRow) objects and cannot 
> hold Empty. If it does so, the following exception is thrown:
> {noformat}
> Exception in thread "main" org.apache.spark.SparkException: Job aborted due 
> to stage failure: Task 6 in stage 0.0 failed 1 times, most recent failure: 
> Lost task 6.0 in stage 0.0 (TID 6, localhost): java.lang.RuntimeException: 
> Null value appeared in non-nullable field:
> - field (class: "scala.Int", name: "id")
> - option value class: "DataSetOptBug.DataRow"
> - root class: "scala.Option"
> If the schema is inferred from a Scala tuple/case class, or a Java bean, 
> please try to use scala.Option[_] or other nullable types (e.g. 
> java.lang.Integer instead of int/scala.Int).
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.agg_doAggregateWithoutKey$(Unknown
>  Source)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
>  Source)
>   at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
>   at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370)
>   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
>   at 
> org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:125)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:79)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:47)
>   at org.apache.spark.scheduler.Task.run(Task.scala:86)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745)
> {noformat}
> The bug can be reproduce by using the program: 
> https://gist.github.com/aniketbhatnagar/2ed74613f70d2defe999c18afaa4816e



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-18251) DataSet API | RuntimeException: Null value appeared in non-nullable field when holding Option Case Class

2016-11-30 Thread Cheng Lian (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18251?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Lian resolved SPARK-18251.

   Resolution: Fixed
Fix Version/s: 2.2.0

Issue resolved by pull request 15979
[https://github.com/apache/spark/pull/15979]

> DataSet API | RuntimeException: Null value appeared in non-nullable field 
> when holding Option Case Class
> 
>
> Key: SPARK-18251
> URL: https://issues.apache.org/jira/browse/SPARK-18251
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.0.1
> Environment: OS X
>Reporter: Aniket Bhatnagar
> Fix For: 2.2.0
>
>
> I am running into a runtime exception when a DataSet is holding an Empty 
> object instance for an Option type that is holding non-nullable field. For 
> instance, if we have the following case class:
> case class DataRow(id: Int, value: String)
> Then, DataSet[Option[DataRow]] can only hold Some(DataRow) objects and cannot 
> hold Empty. If it does so, the following exception is thrown:
> {noformat}
> Exception in thread "main" org.apache.spark.SparkException: Job aborted due 
> to stage failure: Task 6 in stage 0.0 failed 1 times, most recent failure: 
> Lost task 6.0 in stage 0.0 (TID 6, localhost): java.lang.RuntimeException: 
> Null value appeared in non-nullable field:
> - field (class: "scala.Int", name: "id")
> - option value class: "DataSetOptBug.DataRow"
> - root class: "scala.Option"
> If the schema is inferred from a Scala tuple/case class, or a Java bean, 
> please try to use scala.Option[_] or other nullable types (e.g. 
> java.lang.Integer instead of int/scala.Int).
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.agg_doAggregateWithoutKey$(Unknown
>  Source)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
>  Source)
>   at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
>   at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370)
>   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
>   at 
> org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:125)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:79)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:47)
>   at org.apache.spark.scheduler.Task.run(Task.scala:86)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745)
> {noformat}
> The bug can be reproduce by using the program: 
> https://gist.github.com/aniketbhatnagar/2ed74613f70d2defe999c18afaa4816e



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-18658) Writing to a text DataSource buffers one or more lines in memory

2016-11-30 Thread Nathan Howell (JIRA)
Nathan Howell created SPARK-18658:
-

 Summary: Writing to a text DataSource buffers one or more lines in 
memory
 Key: SPARK-18658
 URL: https://issues.apache.org/jira/browse/SPARK-18658
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 2.0.2
Reporter: Nathan Howell
Priority: Minor


The JSON and CSV writing paths buffer entire lines (or multiple lines) in 
memory prior to writing to disk. For large rows this is inefficient. It may 
make sense to skip the {{TextOutputFormat}} record writer and go directly to 
the underlying {{FSDataOutputStream}}, allowing the writers to append arbitrary 
byte arrays (fractions of a row) instead of a full row.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-18659) Crash in overwrite table partitions due to hive metastore integration

2016-11-30 Thread Eric Liang (JIRA)
Eric Liang created SPARK-18659:
--

 Summary: Crash in overwrite table partitions due to hive metastore 
integration
 Key: SPARK-18659
 URL: https://issues.apache.org/jira/browse/SPARK-18659
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.1.0
Reporter: Eric Liang
Priority: Blocker


The following test cases fail due to a crash in hive client when dropping 
partitions that don't contain files. The last one crashes due to a partition 
case resolution failure.

{code}
  test("foo") {
withTable("test") {
  spark.range(10)
.selectExpr("id", "id as A", "'x' as B")
.write.partitionBy("A", "B").mode("overwrite")
.saveAsTable("test")
  spark.sql("insert overwrite table test select id, id, 'x' from range(1)")
  assert(spark.sql("select * from test").count() == 1)
}
  }

  test("bar") {
withTable("test") {
  spark.range(10)
.selectExpr("id", "id as A", "'x' as B")
.write.partitionBy("A", "B").mode("overwrite")
.saveAsTable("test")
  spark.sql("insert overwrite table test partition (a, b) select id, id, 
'x' from range(1)")
  assert(spark.sql("select * from test").count() == 1)
}
  }

  test("baz") {
withTable("test") {
  spark.range(10)
.selectExpr("id", "id as A", "'x' as B")
.write.partitionBy("A", "B").mode("overwrite")
.saveAsTable("test")
  spark.sql("insert overwrite table test partition (A, B) select id, id, 
'x' from range(1)")
  assert(spark.sql("select * from test").count() == 1)
}
  }

  test("qux") {
withTable("test") {
  spark.range(10)
.selectExpr("id", "id as A", "'x' as B")
.write.partitionBy("A", "B").mode("overwrite")
.saveAsTable("test")
  spark.sql("insert overwrite table test partition (a=1, b) select id, 'x' 
from range(1)")
  assert(spark.sql("select * from test").count() == 10)
}
  }
{code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18318) ML, Graph 2.1 QA: API: New Scala APIs, docs

2016-11-30 Thread Joseph K. Bradley (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18318?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15709811#comment-15709811
 ] 

Joseph K. Bradley commented on SPARK-18318:
---

I did a quick check too and did not see anything missed.  Thanks!

> ML, Graph 2.1 QA: API: New Scala APIs, docs
> ---
>
> Key: SPARK-18318
> URL: https://issues.apache.org/jira/browse/SPARK-18318
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, GraphX, ML, MLlib
>Reporter: Joseph K. Bradley
>Assignee: Yanbo Liang
>Priority: Blocker
> Fix For: 2.1.1, 2.2.0
>
>
> Audit new public Scala APIs added to MLlib & GraphX.  Take note of:
> * Protected/public classes or methods.  If access can be more private, then 
> it should be.
> * Also look for non-sealed traits.
> * Documentation: Missing?  Bad links or formatting?
> *Make sure to check the object doc!*
> As you find issues, please create JIRAs and link them to this issue.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-18318) ML, Graph 2.1 QA: API: New Scala APIs, docs

2016-11-30 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18318?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley resolved SPARK-18318.
---
   Resolution: Fixed
Fix Version/s: 2.2.0
   2.1.1

Issue resolved by pull request 16009
[https://github.com/apache/spark/pull/16009]

> ML, Graph 2.1 QA: API: New Scala APIs, docs
> ---
>
> Key: SPARK-18318
> URL: https://issues.apache.org/jira/browse/SPARK-18318
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, GraphX, ML, MLlib
>Reporter: Joseph K. Bradley
>Assignee: Yanbo Liang
>Priority: Blocker
> Fix For: 2.1.1, 2.2.0
>
>
> Audit new public Scala APIs added to MLlib & GraphX.  Take note of:
> * Protected/public classes or methods.  If access can be more private, then 
> it should be.
> * Also look for non-sealed traits.
> * Documentation: Missing?  Bad links or formatting?
> *Make sure to check the object doc!*
> As you find issues, please create JIRAs and link them to this issue.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-18657) Persist UUID across query restart

2016-11-30 Thread Michael Armbrust (JIRA)
Michael Armbrust created SPARK-18657:


 Summary: Persist UUID across query restart
 Key: SPARK-18657
 URL: https://issues.apache.org/jira/browse/SPARK-18657
 Project: Spark
  Issue Type: Bug
  Components: Structured Streaming
Reporter: Michael Armbrust
Priority: Critical


We probably also want to add an instance Id or something that changes when the 
query restarts



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-18274) Memory leak in PySpark StringIndexer

2016-11-30 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18274?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-18274:
--
Target Version/s: 2.0.3, 2.1.1, 2.2.0  (was: 2.0.3, 2.1.0)

> Memory leak in PySpark StringIndexer
> 
>
> Key: SPARK-18274
> URL: https://issues.apache.org/jira/browse/SPARK-18274
> Project: Spark
>  Issue Type: Bug
>  Components: ML, PySpark
>Affects Versions: 1.5.2, 1.6.3, 2.0.1, 2.0.2, 2.1.0
>Reporter: Jonas Amrich
>Priority: Critical
>
> StringIndexerModel won't get collected by GC in Java even when deleted in 
> Python. It can be reproduced by this code, which fails after couple of 
> iterations (around 7 if you set driver memory to 600MB): 
> {code}
> import random, string
> from pyspark.ml.feature import StringIndexer
> l = [(''.join(random.choice(string.ascii_uppercase) for _ in range(10)), ) 
> for _ in range(int(7e5))]  # 70 random strings of 10 characters
> df = spark.createDataFrame(l, ['string'])
> for i in range(50):
> indexer = StringIndexer(inputCol='string', outputCol='index')
> indexer.fit(df)
> {code}
> Explicit call to Python GC fixes the issue - following code runs fine:
> {code}
> for i in range(50):
> indexer = StringIndexer(inputCol='string', outputCol='index')
> indexer.fit(df)
> gc.collect()
> {code}
> The issue is similar to SPARK-6194 and can be probably fixed by calling jvm 
> detach in model's destructor. This is implemented in 
> pyspark.mlib.common.JavaModelWrapper but missing in 
> pyspark.ml.wrapper.JavaWrapper. Other models in ml package may also be 
> affected by this memory leak. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-18563) mapWithState: initialState should have a timeout setting per record

2016-11-30 Thread Shixiong Zhu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18563?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shixiong Zhu updated SPARK-18563:
-
Component/s: (was: Structured Streaming)
 DStreams

> mapWithState: initialState should have a timeout setting per record
> ---
>
> Key: SPARK-18563
> URL: https://issues.apache.org/jira/browse/SPARK-18563
> Project: Spark
>  Issue Type: Improvement
>  Components: DStreams
>Reporter: Daniel Haviv
>
> when passing an initialState for mapWithState there should a possibility to 
> set a timeout at the record level.
> If for example mapWithState is configured with a 48H timeout, loading an 
> initialState will cause the state to bloat and hold 96H of data and then 
> release 48H of data at once.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-18588) KafkaSourceStressForDontFailOnDataLossSuite is flaky

2016-11-30 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18588?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-18588:
-
Target Version/s: 2.1.0

> KafkaSourceStressForDontFailOnDataLossSuite is flaky
> 
>
> Key: SPARK-18588
> URL: https://issues.apache.org/jira/browse/SPARK-18588
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Reporter: Herman van Hovell
>Assignee: Shixiong Zhu
>
> https://spark-tests.appspot.com/test-details?suite_name=org.apache.spark.sql.kafka010.KafkaSourceStressForDontFailOnDataLossSuite_name=stress+test+for+failOnDataLoss%3Dfalse



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-16545) Structured Streaming : foreachSink creates the Physical Plan multiple times per TriggerInterval

2016-11-30 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16545?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust resolved SPARK-16545.
--
Resolution: Later

> Structured Streaming : foreachSink creates the Physical Plan multiple times 
> per TriggerInterval 
> 
>
> Key: SPARK-16545
> URL: https://issues.apache.org/jira/browse/SPARK-16545
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.0.0
>Reporter: Mario Briggs
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-18655) Ignore Structured Streaming 2.0.2 logs in history server

2016-11-30 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18655?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-18655:
-
Fix Version/s: (was: 2.1.0)

> Ignore Structured Streaming 2.0.2 logs in history server
> 
>
> Key: SPARK-18655
> URL: https://issues.apache.org/jira/browse/SPARK-18655
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.1.0
>Reporter: Shixiong Zhu
>Assignee: Shixiong Zhu
>Priority: Blocker
>
> SPARK-18516 changes the event log format of Structured Streaming. We should 
> make sure our changes not break the history server.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-18655) Ignore Structured Streaming 2.0.2 logs in history server

2016-11-30 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18655?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-18655:
-
Target Version/s: 2.1.0

> Ignore Structured Streaming 2.0.2 logs in history server
> 
>
> Key: SPARK-18655
> URL: https://issues.apache.org/jira/browse/SPARK-18655
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.1.0
>Reporter: Shixiong Zhu
>Assignee: Shixiong Zhu
>Priority: Blocker
>
> SPARK-18516 changes the event log format of Structured Streaming. We should 
> make sure our changes not break the history server.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-18656) org.apache.spark.sql.execution.stat.StatFunctions#multipleApproxQuantiles requires too much memory in case of many columns

2016-11-30 Thread Sina Sohangir (JIRA)
Sina Sohangir created SPARK-18656:
-

 Summary: 
org.apache.spark.sql.execution.stat.StatFunctions#multipleApproxQuantiles 
requires too much memory in case of many columns
 Key: SPARK-18656
 URL: https://issues.apache.org/jira/browse/SPARK-18656
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Sina Sohangir


org.apache.spark.sql.execution.stat.StatFunctions#multipleApproxQuantiles

Is implemented in a way that is causes out of memory error for cases where the 
number of columns are high.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18536) Failed to save to hive table when case class with empty field

2016-11-30 Thread Reynold Xin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18536?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15709551#comment-15709551
 ] 

Reynold Xin commented on SPARK-18536:
-

We need to add a PreWriteCheck for Parquet.


> Failed to save to hive table when case class with empty field
> -
>
> Key: SPARK-18536
> URL: https://issues.apache.org/jira/browse/SPARK-18536
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.1
>Reporter: pin_zhang
>
> {code}import scala.collection.mutable.Queue
> import org.apache.spark.SparkConf
> import org.apache.spark.SparkContext
> import org.apache.spark.sql.SaveMode
> import org.apache.spark.sql.SparkSession
> import org.apache.spark.streaming.Seconds
> import org.apache.spark.streaming.StreamingContext
> {code}
> 1. Test code
> {code}
> case class EmptyC()
> case class EmptyCTable(dimensions: EmptyC, timebin: java.lang.Long)
> object EmptyTest {
>   def main(args: Array[String]): Unit = {
> val conf = new SparkConf().setAppName("scala").setMaster("local[2]")
> val ctx = new SparkContext(conf)
> val spark = 
> SparkSession.builder().enableHiveSupport().config(conf).getOrCreate()
> val seq = Seq(EmptyCTable(EmptyC(), 100L))
> val rdd = ctx.makeRDD[EmptyCTable](seq)
> val ssc = new StreamingContext(ctx, Seconds(1))
> val queue = Queue(rdd)
> val s = ssc.queueStream(queue, false);
> s.foreachRDD((rdd, time) => {
>   if (!rdd.isEmpty) {
> import spark.sqlContext.implicits._
> rdd.toDF.write.mode(SaveMode.Overwrite).saveAsTable("empty_table")
>   }
> })
> ssc.start()
> ssc.awaitTermination()
>   }
> }
> {code}
> 2. Exception
> {noformat}
> Caused by: java.lang.IllegalStateException: Cannot build an empty group
>   at org.apache.parquet.Preconditions.checkState(Preconditions.java:91)
>   at org.apache.parquet.schema.Types$GroupBuilder.build(Types.java:554)
>   at org.apache.parquet.schema.Types$GroupBuilder.build(Types.java:426)
>   at org.apache.parquet.schema.Types$Builder.named(Types.java:228)
>   at 
> org.apache.spark.sql.execution.datasources.parquet.ParquetSchemaConverter.convertField(ParquetSchemaConverter.scala:527)
>   at 
> org.apache.spark.sql.execution.datasources.parquet.ParquetSchemaConverter.convertField(ParquetSchemaConverter.scala:321)
>   at 
> org.apache.spark.sql.execution.datasources.parquet.ParquetSchemaConverter$$anonfun$convert$1.apply(ParquetSchemaConverter.scala:313)
>   at 
> org.apache.spark.sql.execution.datasources.parquet.ParquetSchemaConverter$$anonfun$convert$1.apply(ParquetSchemaConverter.scala:313)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>   at scala.collection.Iterator$class.foreach(Iterator.scala:893)
>   at scala.collection.AbstractIterator.foreach(Iterator.scala:1336)
>   at scala.collection.IterableLike$class.foreach(IterableLike.scala:72)
>   at org.apache.spark.sql.types.StructType.foreach(StructType.scala:95)
>   at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
>   at org.apache.spark.sql.types.StructType.map(StructType.scala:95)
>   at 
> org.apache.spark.sql.execution.datasources.parquet.ParquetSchemaConverter.convert(ParquetSchemaConverter.scala:313)
>   at 
> org.apache.spark.sql.execution.datasources.parquet.ParquetWriteSupport.init(ParquetWriteSupport.scala:85)
>   at 
> org.apache.parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:288)
>   at 
> org.apache.parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:262)
>   at 
> org.apache.spark.sql.execution.datasources.parquet.ParquetOutputWriter.(ParquetFileFormat.scala:562)
>   at 
> org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat$$anon$1.newInstance(ParquetFileFormat.scala:139)
>   at 
> org.apache.spark.sql.execution.datasources.BaseWriterContainer.newOutputWriter(WriterContainer.scala:131)
>   at 
> org.apache.spark.sql.execution.datasources.DefaultWriterContainer.writeRows(WriterContainer.scala:247)
>   at 
> org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(InsertIntoHadoopFsRelationCommand.scala:143)
>   at 
> org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(InsertIntoHadoopFsRelationCommand.scala:143)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70)
>   at org.apache.spark.scheduler.Task.run(Task.scala:86)
>   at 

[jira] [Updated] (SPARK-18536) Failed to save to hive table when case class with empty field

2016-11-30 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18536?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-18536:

Description: 
{code}import scala.collection.mutable.Queue

import org.apache.spark.SparkConf
import org.apache.spark.SparkContext
import org.apache.spark.sql.SaveMode
import org.apache.spark.sql.SparkSession
import org.apache.spark.streaming.Seconds
import org.apache.spark.streaming.StreamingContext
{code}

1. Test code

{code}
case class EmptyC()
case class EmptyCTable(dimensions: EmptyC, timebin: java.lang.Long)

object EmptyTest {

  def main(args: Array[String]): Unit = {
val conf = new SparkConf().setAppName("scala").setMaster("local[2]")
val ctx = new SparkContext(conf)
val spark = 
SparkSession.builder().enableHiveSupport().config(conf).getOrCreate()
val seq = Seq(EmptyCTable(EmptyC(), 100L))
val rdd = ctx.makeRDD[EmptyCTable](seq)
val ssc = new StreamingContext(ctx, Seconds(1))

val queue = Queue(rdd)
val s = ssc.queueStream(queue, false);
s.foreachRDD((rdd, time) => {
  if (!rdd.isEmpty) {
import spark.sqlContext.implicits._
rdd.toDF.write.mode(SaveMode.Overwrite).saveAsTable("empty_table")
  }
})

ssc.start()
ssc.awaitTermination()

  }

}
{code}

2. Exception
{noformat}
Caused by: java.lang.IllegalStateException: Cannot build an empty group
at org.apache.parquet.Preconditions.checkState(Preconditions.java:91)
at org.apache.parquet.schema.Types$GroupBuilder.build(Types.java:554)
at org.apache.parquet.schema.Types$GroupBuilder.build(Types.java:426)
at org.apache.parquet.schema.Types$Builder.named(Types.java:228)
at 
org.apache.spark.sql.execution.datasources.parquet.ParquetSchemaConverter.convertField(ParquetSchemaConverter.scala:527)
at 
org.apache.spark.sql.execution.datasources.parquet.ParquetSchemaConverter.convertField(ParquetSchemaConverter.scala:321)
at 
org.apache.spark.sql.execution.datasources.parquet.ParquetSchemaConverter$$anonfun$convert$1.apply(ParquetSchemaConverter.scala:313)
at 
org.apache.spark.sql.execution.datasources.parquet.ParquetSchemaConverter$$anonfun$convert$1.apply(ParquetSchemaConverter.scala:313)
at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
at scala.collection.Iterator$class.foreach(Iterator.scala:893)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1336)
at scala.collection.IterableLike$class.foreach(IterableLike.scala:72)
at org.apache.spark.sql.types.StructType.foreach(StructType.scala:95)
at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
at org.apache.spark.sql.types.StructType.map(StructType.scala:95)
at 
org.apache.spark.sql.execution.datasources.parquet.ParquetSchemaConverter.convert(ParquetSchemaConverter.scala:313)
at 
org.apache.spark.sql.execution.datasources.parquet.ParquetWriteSupport.init(ParquetWriteSupport.scala:85)
at 
org.apache.parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:288)
at 
org.apache.parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:262)
at 
org.apache.spark.sql.execution.datasources.parquet.ParquetOutputWriter.(ParquetFileFormat.scala:562)
at 
org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat$$anon$1.newInstance(ParquetFileFormat.scala:139)
at 
org.apache.spark.sql.execution.datasources.BaseWriterContainer.newOutputWriter(WriterContainer.scala:131)
at 
org.apache.spark.sql.execution.datasources.DefaultWriterContainer.writeRows(WriterContainer.scala:247)
at 
org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(InsertIntoHadoopFsRelationCommand.scala:143)
at 
org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(InsertIntoHadoopFsRelationCommand.scala:143)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70)
at org.apache.spark.scheduler.Task.run(Task.scala:86)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
... 3 more
 {noformat}

  was:

import scala.collection.mutable.Queue

import org.apache.spark.SparkConf
import org.apache.spark.SparkContext
import org.apache.spark.sql.SaveMode
import org.apache.spark.sql.SparkSession
import org.apache.spark.streaming.Seconds
import org.apache.spark.streaming.StreamingContext
1. Test code
case class EmptyC()
case class EmptyCTable(dimensions: EmptyC, timebin: java.lang.Long)

object EmptyTest {

  def main(args: Array[String]): Unit = {
val conf = new 

[jira] [Commented] (SPARK-18653) Dataset.show() generates incorrect padding for Unicode Character

2016-11-30 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18653?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15709468#comment-15709468
 ] 

Apache Spark commented on SPARK-18653:
--

User 'kiszk' has created a pull request for this issue:
https://github.com/apache/spark/pull/16086

> Dataset.show() generates incorrect padding for Unicode Character
> 
>
> Key: SPARK-18653
> URL: https://issues.apache.org/jira/browse/SPARK-18653
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Kazuaki Ishizaki
>
> The following program generates incorrect space padding for 
> {{Dataset.show()}} since column name or column value has Unicode Character
> Program
> {code:java}
> case class UnicodeCaseClass(整数: Int, 実数: Double, s: String)
> val ds = Seq(UnicodeCaseClass(1, 1.1, "文字列1")).toDS
> ds.show
> {code}
> Output
> {code}
> +---+---++
> | 整数| 実数|   s|
> +---+---++
> |  1|1.1|文字列1|
> +---+---++
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-18653) Dataset.show() generates incorrect padding for Unicode Character

2016-11-30 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18653?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-18653:


Assignee: Apache Spark

> Dataset.show() generates incorrect padding for Unicode Character
> 
>
> Key: SPARK-18653
> URL: https://issues.apache.org/jira/browse/SPARK-18653
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Kazuaki Ishizaki
>Assignee: Apache Spark
>
> The following program generates incorrect space padding for 
> {{Dataset.show()}} since column name or column value has Unicode Character
> Program
> {code:java}
> case class UnicodeCaseClass(整数: Int, 実数: Double, s: String)
> val ds = Seq(UnicodeCaseClass(1, 1.1, "文字列1")).toDS
> ds.show
> {code}
> Output
> {code}
> +---+---++
> | 整数| 実数|   s|
> +---+---++
> |  1|1.1|文字列1|
> +---+---++
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-18653) Dataset.show() generates incorrect padding for Unicode Character

2016-11-30 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18653?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-18653:


Assignee: (was: Apache Spark)

> Dataset.show() generates incorrect padding for Unicode Character
> 
>
> Key: SPARK-18653
> URL: https://issues.apache.org/jira/browse/SPARK-18653
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Kazuaki Ishizaki
>
> The following program generates incorrect space padding for 
> {{Dataset.show()}} since column name or column value has Unicode Character
> Program
> {code:java}
> case class UnicodeCaseClass(整数: Int, 実数: Double, s: String)
> val ds = Seq(UnicodeCaseClass(1, 1.1, "文字列1")).toDS
> ds.show
> {code}
> Output
> {code}
> +---+---++
> | 整数| 実数|   s|
> +---+---++
> |  1|1.1|文字列1|
> +---+---++
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-18655) Ignore Structured Streaming 2.0.2 logs in history server

2016-11-30 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18655?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-18655:


Assignee: Apache Spark  (was: Shixiong Zhu)

> Ignore Structured Streaming 2.0.2 logs in history server
> 
>
> Key: SPARK-18655
> URL: https://issues.apache.org/jira/browse/SPARK-18655
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.1.0
>Reporter: Shixiong Zhu
>Assignee: Apache Spark
>Priority: Blocker
> Fix For: 2.1.0
>
>
> SPARK-18516 changes the event log format of Structured Streaming. We should 
> make sure our changes not break the history server.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-18655) Ignore Structured Streaming 2.0.2 logs in history server

2016-11-30 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18655?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-18655:


Assignee: Shixiong Zhu  (was: Apache Spark)

> Ignore Structured Streaming 2.0.2 logs in history server
> 
>
> Key: SPARK-18655
> URL: https://issues.apache.org/jira/browse/SPARK-18655
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.1.0
>Reporter: Shixiong Zhu
>Assignee: Shixiong Zhu
>Priority: Blocker
> Fix For: 2.1.0
>
>
> SPARK-18516 changes the event log format of Structured Streaming. We should 
> make sure our changes not break the history server.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



  1   2   >