[jira] [Commented] (SPARK-18538) Concurrent Fetching DataFrameReader JDBC APIs Do Not Work
[ https://issues.apache.org/jira/browse/SPARK-18538?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15711221#comment-15711221 ] Wenchen Fan commented on SPARK-18538: - already merged to master, will resolve this ticket once we backport it to 2.1 > Concurrent Fetching DataFrameReader JDBC APIs Do Not Work > - > > Key: SPARK-18538 > URL: https://issues.apache.org/jira/browse/SPARK-18538 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.0 >Reporter: Xiao Li >Assignee: Xiao Li >Priority: Blocker > > {code} > def jdbc( > url: String, > table: String, > columnName: String, > lowerBound: Long, > upperBound: Long, > numPartitions: Int, > connectionProperties: Properties): DataFrame > {code} > {code} > def jdbc( > url: String, > table: String, > predicates: Array[String], > connectionProperties: Properties): DataFrame > {code} > The above two DataFrameReader JDBC APIs ignore the user-specified parameters > of parallelism degree -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18620) Spark Streaming + Kinesis : Receiver MaxRate is violated
[ https://issues.apache.org/jira/browse/SPARK-18620?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15711171#comment-15711171 ] Takeshi Yamamuro commented on SPARK-18620: -- I quickly checked and I found that that's not enough to set max records in Kinesis workers because the kinesis workers cannot limit the number of aggregate messages (http://docs.aws.amazon.com/streams/latest/dev/kinesis-kpl-concepts.html#d0e5184). For example, if we set 10 to the number of max records in workers and a producer aggregates two records into one message, it seems kinesis workers actually 20 records per callback function called. My hunch is that we need to control #records to push them into a receiver in KinesisRecordProcessor#processRecords(https://github.com/apache/spark/blob/master/external/kinesis-asl/src/main/scala/org/apache/spark/streaming/kinesis/KinesisRecordProcessor.scala#L68). > Spark Streaming + Kinesis : Receiver MaxRate is violated > > > Key: SPARK-18620 > URL: https://issues.apache.org/jira/browse/SPARK-18620 > Project: Spark > Issue Type: Bug > Components: DStreams >Affects Versions: 2.0.2 >Reporter: david przybill >Priority: Minor > Labels: kinesis > > I am calling spark-submit passing maxRate, I have a single kinesis receiver, > and batches of 1s > spark-submit --conf spark.streaming.receiver.maxRate=10 > however a single batch can greatly exceed the stablished maxRate. i.e: Im > getting 300 records. > it looks like Kinesis is completely ignoring the > spark.streaming.receiver.maxRate configuration. > If you look inside KinesisReceiver.onStart, you see: > val kinesisClientLibConfiguration = > new KinesisClientLibConfiguration(checkpointAppName, streamName, > awsCredProvider, workerId) > .withKinesisEndpoint(endpointUrl) > .withInitialPositionInStream(initialPositionInStream) > .withTaskBackoffTimeMillis(500) > .withRegionName(regionName) > This constructor ends up calling another constructor which has a lot of > default values for the configuration. One of those values is > DEFAULT_MAX_RECORDS which is constantly set to 10,000 records. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-18667) input_file_name function does not work with UDF
Hyukjin Kwon created SPARK-18667: Summary: input_file_name function does not work with UDF Key: SPARK-18667 URL: https://issues.apache.org/jira/browse/SPARK-18667 Project: Spark Issue Type: Bug Components: PySpark Reporter: Hyukjin Kwon {{input_file_name()}} does not return the file name but empty string instead when it is used as input for UDF in PySpark as below: with the data as below: {code} {"a": 1} {code} with the codes below: {code} from pyspark.sql.functions import * from pyspark.sql.types import * def filename(path): return path sourceFile = udf(filename, StringType()) spark.read.json("tmp.json").select(sourceFile(input_file_name())).show() {code} prints as below: {code} +---+ |filename(input_file_name())| +---+ | | +---+ {code} but the codes below: {code} spark.read.json("tmp.json").select(input_file_name()).show() {code} prints correctly as below: {code} ++ | input_file_name()| ++ |file:///Users/hyu...| ++ {code} This seems PySpark specific issue. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18665) Spark ThriftServer jobs where are canceled are still “STARTED”
[ https://issues.apache.org/jira/browse/SPARK-18665?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15711100#comment-15711100 ] Apache Spark commented on SPARK-18665: -- User 'cenyuhai' has created a pull request for this issue: https://github.com/apache/spark/pull/16097 > Spark ThriftServer jobs where are canceled are still “STARTED” > -- > > Key: SPARK-18665 > URL: https://issues.apache.org/jira/browse/SPARK-18665 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.3 >Reporter: cen yuhai > Attachments: 1179ACF7-3E62-44C5-B01D-CA71C876ECCE.png, > 83C5E8AD-59DE-4A85-A483-2BE3FB83F378.png > > > I find that, some jobs are canceled, but the state are still "STARTED", I > think this bug are imported by SPARK-6964 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-18665) Spark ThriftServer jobs where are canceled are still “STARTED”
[ https://issues.apache.org/jira/browse/SPARK-18665?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-18665: Assignee: (was: Apache Spark) > Spark ThriftServer jobs where are canceled are still “STARTED” > -- > > Key: SPARK-18665 > URL: https://issues.apache.org/jira/browse/SPARK-18665 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.3 >Reporter: cen yuhai > Attachments: 1179ACF7-3E62-44C5-B01D-CA71C876ECCE.png, > 83C5E8AD-59DE-4A85-A483-2BE3FB83F378.png > > > I find that, some jobs are canceled, but the state are still "STARTED", I > think this bug are imported by SPARK-6964 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-18665) Spark ThriftServer jobs where are canceled are still “STARTED”
[ https://issues.apache.org/jira/browse/SPARK-18665?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-18665: Assignee: Apache Spark > Spark ThriftServer jobs where are canceled are still “STARTED” > -- > > Key: SPARK-18665 > URL: https://issues.apache.org/jira/browse/SPARK-18665 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.3 >Reporter: cen yuhai >Assignee: Apache Spark > Attachments: 1179ACF7-3E62-44C5-B01D-CA71C876ECCE.png, > 83C5E8AD-59DE-4A85-A483-2BE3FB83F378.png > > > I find that, some jobs are canceled, but the state are still "STARTED", I > think this bug are imported by SPARK-6964 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12347) Write script to run all MLlib examples for testing
[ https://issues.apache.org/jira/browse/SPARK-12347?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15711095#comment-15711095 ] Nick Pentreath commented on SPARK-12347: Since the PR is still WIP and this is not a blocker for 2.1, I've retargeted to 2.2 > Write script to run all MLlib examples for testing > -- > > Key: SPARK-12347 > URL: https://issues.apache.org/jira/browse/SPARK-12347 > Project: Spark > Issue Type: Test > Components: ML, MLlib, PySpark, SparkR, Tests >Reporter: Joseph K. Bradley >Priority: Critical > > It would facilitate testing to have a script which runs all MLlib examples > for all languages. > Design sketch to ensure all examples are run: > * Generate a list of examples to run programmatically (not from a fixed list). > * Use a list of special examples to handle examples which require command > line arguments. > * Make sure data, etc. used are small to keep the tests quick. > This could be broken into subtasks for each language, though it would be nice > to provide a single script. > Not sure where the script should live; perhaps in {{bin/}}? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-12347) Write script to run all MLlib examples for testing
[ https://issues.apache.org/jira/browse/SPARK-12347?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nick Pentreath updated SPARK-12347: --- Target Version/s: 2.2.0 (was: 2.1.0) > Write script to run all MLlib examples for testing > -- > > Key: SPARK-12347 > URL: https://issues.apache.org/jira/browse/SPARK-12347 > Project: Spark > Issue Type: Test > Components: ML, MLlib, PySpark, SparkR, Tests >Reporter: Joseph K. Bradley >Priority: Critical > > It would facilitate testing to have a script which runs all MLlib examples > for all languages. > Design sketch to ensure all examples are run: > * Generate a list of examples to run programmatically (not from a fixed list). > * Use a list of special examples to handle examples which require command > line arguments. > * Make sure data, etc. used are small to keep the tests quick. > This could be broken into subtasks for each language, though it would be nice > to provide a single script. > Not sure where the script should live; perhaps in {{bin/}}? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-18638) Upgrade sbt, zinc and maven plugins
[ https://issues.apache.org/jira/browse/SPARK-18638?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Weiqing Yang updated SPARK-18638: - Description: v2.1.0-rc1has been out. For 2.2.x, it is better to keep sbt up-to-date, and upgrade it from 0.13.11 to 0.13.13. The release notes since the last version we used are: https://github.com/sbt/sbt/releases/tag/v0.13.12 and https://github.com/sbt/sbt/releases/tag/v0.13.13. Both releases include some regression fixes. This jira will also update Zinc and Maven plugins. {code} sbt: 0.13.11 -> 0.13.13, zinc: 0.3.9 -> 0.3.11, maven-assembly-plugin: 2.6 -> 3.0.0 maven-compiler-plugin: 3.5.1 -> 3.6. maven-jar-plugin: 2.6 -> 3.0.2 maven-javadoc-plugin: 2.10.3 -> 2.10.4 maven-source-plugin: 2.4 -> 3.0.1 org.codehaus.mojo:build-helper-maven-plugin: 1.10 -> 1.12 org.codehaus.mojo:exec-maven-plugin: 1.4.0 -> 1.5.0 {code} was:v2.1.0-rc1has been out. For 2.2.x, it is better to keep sbt up-to-date, and upgrade it from 0.13.11 to 0.13.13. The release notes since the last version we used are: https://github.com/sbt/sbt/releases/tag/v0.13.12 and https://github.com/sbt/sbt/releases/tag/v0.13.13. Both releases include some regression fixes. > Upgrade sbt, zinc and maven plugins > --- > > Key: SPARK-18638 > URL: https://issues.apache.org/jira/browse/SPARK-18638 > Project: Spark > Issue Type: Improvement > Components: Build >Reporter: Weiqing Yang >Priority: Minor > > v2.1.0-rc1has been out. For 2.2.x, it is better to keep sbt up-to-date, and > upgrade it from 0.13.11 to 0.13.13. The release notes since the last version > we used are: https://github.com/sbt/sbt/releases/tag/v0.13.12 and > https://github.com/sbt/sbt/releases/tag/v0.13.13. Both releases include some > regression fixes. This jira will also update Zinc and Maven plugins. > {code} >sbt: 0.13.11 -> 0.13.13, >zinc: 0.3.9 -> 0.3.11, >maven-assembly-plugin: 2.6 -> 3.0.0 >maven-compiler-plugin: 3.5.1 -> 3.6. >maven-jar-plugin: 2.6 -> 3.0.2 >maven-javadoc-plugin: 2.10.3 -> 2.10.4 >maven-source-plugin: 2.4 -> 3.0.1 >org.codehaus.mojo:build-helper-maven-plugin: 1.10 -> 1.12 >org.codehaus.mojo:exec-maven-plugin: 1.4.0 -> 1.5.0 > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-18638) Upgrade sbt, zinc and maven plugins
[ https://issues.apache.org/jira/browse/SPARK-18638?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Weiqing Yang updated SPARK-18638: - Summary: Upgrade sbt, zinc and maven plugins (was: Upgrade sbt to 0.13.13) > Upgrade sbt, zinc and maven plugins > --- > > Key: SPARK-18638 > URL: https://issues.apache.org/jira/browse/SPARK-18638 > Project: Spark > Issue Type: Improvement > Components: Build >Reporter: Weiqing Yang >Priority: Minor > > v2.1.0-rc1has been out. For 2.2.x, it is better to keep sbt up-to-date, and > upgrade it from 0.13.11 to 0.13.13. The release notes since the last version > we used are: https://github.com/sbt/sbt/releases/tag/v0.13.12 and > https://github.com/sbt/sbt/releases/tag/v0.13.13. Both releases include some > regression fixes. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18617) Close "kryo auto pick" feature for Spark Streaming
[ https://issues.apache.org/jira/browse/SPARK-18617?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15710928#comment-15710928 ] Apache Spark commented on SPARK-18617: -- User 'uncleGen' has created a pull request for this issue: https://github.com/apache/spark/pull/16096 > Close "kryo auto pick" feature for Spark Streaming > -- > > Key: SPARK-18617 > URL: https://issues.apache.org/jira/browse/SPARK-18617 > Project: Spark > Issue Type: Bug >Affects Versions: 2.0.2 >Reporter: Genmao Yu >Assignee: Genmao Yu > Fix For: 2.1.0 > > > [PR-15992| https://github.com/apache/spark/pull/15992] provided a solution to > fix the bug, i.e. {{receiver data can not be deserialized properly}}. As > [~zsxwing] said, it is a critical bug, but we should not break APIs between > maintenance releases. It may be a rational choice to close {{auto pick kryo > serializer}} for Spark Streaming in the first step. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-18666) Remove the codes checking deprecated config spark.sql.unsafe.enabled
[ https://issues.apache.org/jira/browse/SPARK-18666?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Liang-Chi Hsieh updated SPARK-18666: Description: spark.sql.unsafe.enabled is deprecated since 1.6. There still are codes in Web UI to check it. We should remove it and clean the codes. (was: spark.sql.unsafe.enabled is deprecated since 2.0. There still are codes in Web UI to check it. We should remove it and clean the codes.) > Remove the codes checking deprecated config spark.sql.unsafe.enabled > > > Key: SPARK-18666 > URL: https://issues.apache.org/jira/browse/SPARK-18666 > Project: Spark > Issue Type: Improvement > Components: Web UI >Reporter: Liang-Chi Hsieh >Priority: Minor > > spark.sql.unsafe.enabled is deprecated since 1.6. There still are codes in > Web UI to check it. We should remove it and clean the codes. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-18666) Remove the codes checking deprecated config spark.sql.unsafe.enabled
[ https://issues.apache.org/jira/browse/SPARK-18666?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-18666: Assignee: (was: Apache Spark) > Remove the codes checking deprecated config spark.sql.unsafe.enabled > > > Key: SPARK-18666 > URL: https://issues.apache.org/jira/browse/SPARK-18666 > Project: Spark > Issue Type: Improvement > Components: Web UI >Reporter: Liang-Chi Hsieh >Priority: Minor > > spark.sql.unsafe.enabled is deprecated since 2.0. There still are codes in > Web UI to check it. We should remove it and clean the codes. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-18666) Remove the codes checking deprecated config spark.sql.unsafe.enabled
[ https://issues.apache.org/jira/browse/SPARK-18666?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-18666: Assignee: Apache Spark > Remove the codes checking deprecated config spark.sql.unsafe.enabled > > > Key: SPARK-18666 > URL: https://issues.apache.org/jira/browse/SPARK-18666 > Project: Spark > Issue Type: Improvement > Components: Web UI >Reporter: Liang-Chi Hsieh >Assignee: Apache Spark >Priority: Minor > > spark.sql.unsafe.enabled is deprecated since 2.0. There still are codes in > Web UI to check it. We should remove it and clean the codes. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18666) Remove the codes checking deprecated config spark.sql.unsafe.enabled
[ https://issues.apache.org/jira/browse/SPARK-18666?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15710891#comment-15710891 ] Apache Spark commented on SPARK-18666: -- User 'viirya' has created a pull request for this issue: https://github.com/apache/spark/pull/16095 > Remove the codes checking deprecated config spark.sql.unsafe.enabled > > > Key: SPARK-18666 > URL: https://issues.apache.org/jira/browse/SPARK-18666 > Project: Spark > Issue Type: Improvement > Components: Web UI >Reporter: Liang-Chi Hsieh >Priority: Minor > > spark.sql.unsafe.enabled is deprecated since 2.0. There still are codes in > Web UI to check it. We should remove it and clean the codes. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-18666) Remove the codes checking deprecated config spark.sql.unsafe.enabled
Liang-Chi Hsieh created SPARK-18666: --- Summary: Remove the codes checking deprecated config spark.sql.unsafe.enabled Key: SPARK-18666 URL: https://issues.apache.org/jira/browse/SPARK-18666 Project: Spark Issue Type: Improvement Components: Web UI Reporter: Liang-Chi Hsieh Priority: Minor spark.sql.unsafe.enabled is deprecated since 2.0. There still are codes in Web UI to check it. We should remove it and clean the codes. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17583) Remove unused rowSeparator variable and set auto-expanding buffer as default for maxCharsPerColumn option in CSV
[ https://issues.apache.org/jira/browse/SPARK-17583?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15710871#comment-15710871 ] koert kuipers commented on SPARK-17583: --- i see. so you are saying in spark 2.0.x it fails when the multiple lines that form a record end up in different splits? so basically its then not safe to use then. it just happened to work in my unit test because i had tiny part files that were never split up. > Remove unused rowSeparator variable and set auto-expanding buffer as default > for maxCharsPerColumn option in CSV > > > Key: SPARK-17583 > URL: https://issues.apache.org/jira/browse/SPARK-17583 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.0.0 >Reporter: Hyukjin Kwon >Assignee: Hyukjin Kwon >Priority: Minor > Fix For: 2.1.0 > > > This JIRA includes several changes below: > 1. Upgrade Univocity library from 2.1.1 to 2.2.1 > This includes some performance improvement and also enabling auto-extending > buffer in {{maxCharsPerColumn}} option in CSV. Please refer the [release > notes|https://github.com/uniVocity/univocity-parsers/releases]. > 2. Remove {{rowSeparator}} variable existing in {{CSVOptions}} > We have this variable in > [CSVOptions|https://github.com/apache/spark/blob/29952ed096fd2a0a19079933ff691671d6f00835/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVOptions.scala#L127] > but it seems possibly causing confusion that it actually does not care of > {{\r\n}}. For example, we have an issue open about this SPARK-17227 > describing this variable > This options is virtually not being used because we rely on > {{LineRecordReader}} in Hadoop which deals with only both {{\n}} and {{\r\n}}. > 3. Setting the default value of {{maxCharsPerColumn}} to auto-expending > We are setting 100 for the length of each column. It'd be more sensible > we allow auto-expending rather than fixed length by default. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-18476) SparkR Logistic Regression should should support output original label.
[ https://issues.apache.org/jira/browse/SPARK-18476?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yanbo Liang resolved SPARK-18476. - Resolution: Fixed Assignee: Miao Wang Fix Version/s: 2.1.0 > SparkR Logistic Regression should should support output original label. > --- > > Key: SPARK-18476 > URL: https://issues.apache.org/jira/browse/SPARK-18476 > Project: Spark > Issue Type: Bug > Components: SparkR >Reporter: Miao Wang >Assignee: Miao Wang > Fix For: 2.1.0 > > > Similar to [SPARK-18401], as a classification algorithm, logistic regression > should support output original label instead of supporting index label. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-18665) Spark ThriftServer jobs where are canceled are still “STARTED”
[ https://issues.apache.org/jira/browse/SPARK-18665?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] cen yuhai updated SPARK-18665: -- Attachment: 1179ACF7-3E62-44C5-B01D-CA71C876ECCE.png > Spark ThriftServer jobs where are canceled are still “STARTED” > -- > > Key: SPARK-18665 > URL: https://issues.apache.org/jira/browse/SPARK-18665 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.3 >Reporter: cen yuhai > Attachments: 1179ACF7-3E62-44C5-B01D-CA71C876ECCE.png, > 83C5E8AD-59DE-4A85-A483-2BE3FB83F378.png > > > I find that, some jobs are canceled, but the state are still "STARTED", I > think this bug are imported by SPARK-6964 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-18665) Spark ThriftServer jobs where are canceled are still “STARTED”
[ https://issues.apache.org/jira/browse/SPARK-18665?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] cen yuhai updated SPARK-18665: -- Attachment: 83C5E8AD-59DE-4A85-A483-2BE3FB83F378.png > Spark ThriftServer jobs where are canceled are still “STARTED” > -- > > Key: SPARK-18665 > URL: https://issues.apache.org/jira/browse/SPARK-18665 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.3 >Reporter: cen yuhai > Attachments: 83C5E8AD-59DE-4A85-A483-2BE3FB83F378.png > > > I find that, some jobs are canceled, but the state are still "STARTED", I > think this bug are imported by SPARK-6964 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-18665) Spark ThriftServer jobs where are canceled are still “STARTED”
cen yuhai created SPARK-18665: - Summary: Spark ThriftServer jobs where are canceled are still “STARTED” Key: SPARK-18665 URL: https://issues.apache.org/jira/browse/SPARK-18665 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.6.3 Reporter: cen yuhai I find that, some jobs are canceled, but the state are still "STARTED", I think this bug are imported by SPARK-6964 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-18541) Add pyspark.sql.Column.aliasWithMetadata to allow dynamic metadata management in pyspark SQL API
[ https://issues.apache.org/jira/browse/SPARK-18541?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-18541: Assignee: (was: Apache Spark) > Add pyspark.sql.Column.aliasWithMetadata to allow dynamic metadata management > in pyspark SQL API > > > Key: SPARK-18541 > URL: https://issues.apache.org/jira/browse/SPARK-18541 > Project: Spark > Issue Type: Improvement > Components: PySpark, SQL >Affects Versions: 2.0.2 > Environment: all >Reporter: Shea Parkes >Priority: Minor > Labels: newbie > Original Estimate: 24h > Remaining Estimate: 24h > > In the Scala SQL API, you can pass in new metadata when you alias a field. > That functionality is not available in the Python API. Right now, you have > to painfully utilize {{SparkSession.createDataFrame}} to manipulate the > metadata for even a single column. I would propose to add the following > method to {{pyspark.sql.Column}}: > {code} > def aliasWithMetadata(self, name, metadata): > """ > Make a new Column that has the provided alias and metadata. > Metadata will be processed with json.dumps() > """ > _context = pyspark.SparkContext._active_spark_context > _metadata_str = json.dumps(metadata) > _metadata_jvm = > _context._jvm.org.apache.spark.sql.types.Metadata.fromJson(_metadata_str) > _new_java_column = getattr(self._jc, 'as')(name, _metadata_jvm) > return Column(_new_java_column) > {code} > I can likely complete this request myself if there is any interest for it. > Just have to dust off my knowledge of doctest and the location of the python > tests. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-18541) Add pyspark.sql.Column.aliasWithMetadata to allow dynamic metadata management in pyspark SQL API
[ https://issues.apache.org/jira/browse/SPARK-18541?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-18541: Assignee: Apache Spark > Add pyspark.sql.Column.aliasWithMetadata to allow dynamic metadata management > in pyspark SQL API > > > Key: SPARK-18541 > URL: https://issues.apache.org/jira/browse/SPARK-18541 > Project: Spark > Issue Type: Improvement > Components: PySpark, SQL >Affects Versions: 2.0.2 > Environment: all >Reporter: Shea Parkes >Assignee: Apache Spark >Priority: Minor > Labels: newbie > Original Estimate: 24h > Remaining Estimate: 24h > > In the Scala SQL API, you can pass in new metadata when you alias a field. > That functionality is not available in the Python API. Right now, you have > to painfully utilize {{SparkSession.createDataFrame}} to manipulate the > metadata for even a single column. I would propose to add the following > method to {{pyspark.sql.Column}}: > {code} > def aliasWithMetadata(self, name, metadata): > """ > Make a new Column that has the provided alias and metadata. > Metadata will be processed with json.dumps() > """ > _context = pyspark.SparkContext._active_spark_context > _metadata_str = json.dumps(metadata) > _metadata_jvm = > _context._jvm.org.apache.spark.sql.types.Metadata.fromJson(_metadata_str) > _new_java_column = getattr(self._jc, 'as')(name, _metadata_jvm) > return Column(_new_java_column) > {code} > I can likely complete this request myself if there is any interest for it. > Just have to dust off my knowledge of doctest and the location of the python > tests. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18541) Add pyspark.sql.Column.aliasWithMetadata to allow dynamic metadata management in pyspark SQL API
[ https://issues.apache.org/jira/browse/SPARK-18541?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15710705#comment-15710705 ] Apache Spark commented on SPARK-18541: -- User 'shea-parkes' has created a pull request for this issue: https://github.com/apache/spark/pull/16094 > Add pyspark.sql.Column.aliasWithMetadata to allow dynamic metadata management > in pyspark SQL API > > > Key: SPARK-18541 > URL: https://issues.apache.org/jira/browse/SPARK-18541 > Project: Spark > Issue Type: Improvement > Components: PySpark, SQL >Affects Versions: 2.0.2 > Environment: all >Reporter: Shea Parkes >Priority: Minor > Labels: newbie > Original Estimate: 24h > Remaining Estimate: 24h > > In the Scala SQL API, you can pass in new metadata when you alias a field. > That functionality is not available in the Python API. Right now, you have > to painfully utilize {{SparkSession.createDataFrame}} to manipulate the > metadata for even a single column. I would propose to add the following > method to {{pyspark.sql.Column}}: > {code} > def aliasWithMetadata(self, name, metadata): > """ > Make a new Column that has the provided alias and metadata. > Metadata will be processed with json.dumps() > """ > _context = pyspark.SparkContext._active_spark_context > _metadata_str = json.dumps(metadata) > _metadata_jvm = > _context._jvm.org.apache.spark.sql.types.Metadata.fromJson(_metadata_str) > _new_java_column = getattr(self._jc, 'as')(name, _metadata_jvm) > return Column(_new_java_column) > {code} > I can likely complete this request myself if there is any interest for it. > Just have to dust off my knowledge of doctest and the location of the python > tests. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16026) Cost-based Optimizer framework
[ https://issues.apache.org/jira/browse/SPARK-16026?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15710667#comment-15710667 ] Ron Hu commented on SPARK-16026: Hi Reynold, I previously worked on filter cardinality estimation using the old statistics structure. Now I need to refactor my code using the new basic statistics structure we agreed on. As I am traveling on a business trip now, I will resume my work on Monday after I return to Bay Area. Zhenhua is currently busy with some customer tasks this week. He will return to work on CBO soon. > Cost-based Optimizer framework > -- > > Key: SPARK-16026 > URL: https://issues.apache.org/jira/browse/SPARK-16026 > Project: Spark > Issue Type: New Feature > Components: SQL >Reporter: Reynold Xin > Labels: releasenotes > Attachments: Spark_CBO_Design_Spec.pdf > > > This is an umbrella ticket to implement a cost-based optimizer framework > beyond broadcast join selection. This framework can be used to implement some > useful optimizations such as join reordering. > The design should discuss how to break the work down into multiple, smaller > logical units. For example, changes to statistics class, system catalog, cost > estimation/propagation in expressions, cost estimation/propagation in > operators can be done in decoupled pull requests. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16026) Cost-based Optimizer framework
[ https://issues.apache.org/jira/browse/SPARK-16026?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15710618#comment-15710618 ] Reynold Xin commented on SPARK-16026: - [~ZenWzh] can we start working on operator cardinality estimation propagation based on what's in the catalog right now? > Cost-based Optimizer framework > -- > > Key: SPARK-16026 > URL: https://issues.apache.org/jira/browse/SPARK-16026 > Project: Spark > Issue Type: New Feature > Components: SQL >Reporter: Reynold Xin > Labels: releasenotes > Attachments: Spark_CBO_Design_Spec.pdf > > > This is an umbrella ticket to implement a cost-based optimizer framework > beyond broadcast join selection. This framework can be used to implement some > useful optimizations such as join reordering. > The design should discuss how to break the work down into multiple, smaller > logical units. For example, changes to statistics class, system catalog, cost > estimation/propagation in expressions, cost estimation/propagation in > operators can be done in decoupled pull requests. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18663) Simplify CountMinSketch aggregate implementation
[ https://issues.apache.org/jira/browse/SPARK-18663?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15710616#comment-15710616 ] Apache Spark commented on SPARK-18663: -- User 'rxin' has created a pull request for this issue: https://github.com/apache/spark/pull/16093 > Simplify CountMinSketch aggregate implementation > > > Key: SPARK-18663 > URL: https://issues.apache.org/jira/browse/SPARK-18663 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Reynold Xin >Assignee: Reynold Xin > > SPARK-18429 introduced count-min sketch aggregate function for SQL, but the > implementation and testing is more complicated than needed. This simplifies > the test cases and removes support for data types that don't have clear > equality semantics. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-18663) Simplify CountMinSketch aggregate implementation
[ https://issues.apache.org/jira/browse/SPARK-18663?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-18663: Assignee: Apache Spark (was: Reynold Xin) > Simplify CountMinSketch aggregate implementation > > > Key: SPARK-18663 > URL: https://issues.apache.org/jira/browse/SPARK-18663 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Reynold Xin >Assignee: Apache Spark > > SPARK-18429 introduced count-min sketch aggregate function for SQL, but the > implementation and testing is more complicated than needed. This simplifies > the test cases and removes support for data types that don't have clear > equality semantics. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-18663) Simplify CountMinSketch aggregate implementation
[ https://issues.apache.org/jira/browse/SPARK-18663?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-18663: Assignee: Reynold Xin (was: Apache Spark) > Simplify CountMinSketch aggregate implementation > > > Key: SPARK-18663 > URL: https://issues.apache.org/jira/browse/SPARK-18663 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Reynold Xin >Assignee: Reynold Xin > > SPARK-18429 introduced count-min sketch aggregate function for SQL, but the > implementation and testing is more complicated than needed. This simplifies > the test cases and removes support for data types that don't have clear > equality semantics. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-18664) Don't respond to HTTP OPTIONS in HTTP-based UIs
[ https://issues.apache.org/jira/browse/SPARK-18664?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] meiyoula updated SPARK-18664: - Description: This was flagged a while ago during a routine security scan(AWVS): the HTTP-based Spark services respond to an HTTP OPTIONS command. HTTP OPTIONS method is enabled on this web server. The OPTIONS method provides a list of the methods that are supported by the web server, it represents a request for information about the communication options available on the request/response chain identified by the Request-URI. The OPTIONS method may expose sensitive information that may help an malicious user to prepare more advanced attacks. was:This was flagged a while ago during a routine security scan(AWVS): the HTTP-based Spark services respond to an HTTP OPTIONS command. > Don't respond to HTTP OPTIONS in HTTP-based UIs > --- > > Key: SPARK-18664 > URL: https://issues.apache.org/jira/browse/SPARK-18664 > Project: Spark > Issue Type: Improvement > Components: Web UI >Reporter: meiyoula >Priority: Minor > > This was flagged a while ago during a routine security scan(AWVS): the > HTTP-based Spark services respond to an HTTP OPTIONS command. > HTTP OPTIONS method is enabled on this web server. The OPTIONS method > provides a list of the methods that are supported by the web server, it > represents a request for information about the communication options > available on the request/response chain identified by the Request-URI. > The OPTIONS method may expose sensitive information that may help an > malicious user to prepare more advanced attacks. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18664) Don't respond to HTTP OPTIONS in HTTP-based UIs
[ https://issues.apache.org/jira/browse/SPARK-18664?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15710591#comment-15710591 ] meiyoula commented on SPARK-18664: -- [~srowen] It is similar to SPARK-5983, should we fix it? > Don't respond to HTTP OPTIONS in HTTP-based UIs > --- > > Key: SPARK-18664 > URL: https://issues.apache.org/jira/browse/SPARK-18664 > Project: Spark > Issue Type: Improvement > Components: Web UI >Reporter: meiyoula >Priority: Minor > > This was flagged a while ago during a routine security scan(AWVS): the > HTTP-based Spark services respond to an HTTP OPTIONS command. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-18664) Don't respond to HTTP OPTIONS in HTTP-based UIs
meiyoula created SPARK-18664: Summary: Don't respond to HTTP OPTIONS in HTTP-based UIs Key: SPARK-18664 URL: https://issues.apache.org/jira/browse/SPARK-18664 Project: Spark Issue Type: Improvement Components: Web UI Reporter: meiyoula Priority: Minor This was flagged a while ago during a routine security scan(AWVS): the HTTP-based Spark services respond to an HTTP OPTIONS command. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-18663) Simplify CountMinSketch aggregate implementation
Reynold Xin created SPARK-18663: --- Summary: Simplify CountMinSketch aggregate implementation Key: SPARK-18663 URL: https://issues.apache.org/jira/browse/SPARK-18663 Project: Spark Issue Type: Sub-task Reporter: Reynold Xin Assignee: Reynold Xin SPARK-18429 introduced count-min sketch aggregate function for SQL, but the implementation and testing is more complicated than needed. This simplifies the test cases and removes support for data types that don't have clear equality semantics. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-18644) spark-submit fails to run python scripts with specific names
[ https://issues.apache.org/jira/browse/SPARK-18644?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bryan Cutler resolved SPARK-18644. -- Resolution: Not A Problem > spark-submit fails to run python scripts with specific names > > > Key: SPARK-18644 > URL: https://issues.apache.org/jira/browse/SPARK-18644 > Project: Spark > Issue Type: Bug > Components: PySpark, Spark Submit >Affects Versions: 2.0.2 > Environment: Ubuntu 16.04 >Reporter: Jussi Jousimo >Priority: Minor > > I'm trying to run simple python script named tokenize.py with spark-submit. > The script only imports SparkContext: > from pyspark import SparkContext > And I run it with: > spark-submit --packages org.apache.spark:spark-streaming-kafka-0-8_2.11:2.0.2 > tokenize.py > However, the script fails: > ImportError: cannot import name SparkContext > I have set all necessary environment variables, etc. Strangely, it seems the > filename is causing this error. If I rename the file to, e.g., tokenizer.py > and run again, it runs fine. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18644) spark-submit fails to run python scripts with specific names
[ https://issues.apache.org/jira/browse/SPARK-18644?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15710507#comment-15710507 ] Bryan Cutler commented on SPARK-18644: -- Yeah, [~vanzin] is right, it's a python thing. See the stack trace below, the {{inspect}} module imports {{tokenize}} which finds your local file first {noformat} Traceback (most recent call last): File "repo/spark/tokenize.py", line 1, in from pyspark import SparkContext File "repo/spark/python/lib/pyspark.zip/pyspark/__init__.py", line 44, in File "repo/spark/python/lib/pyspark.zip/pyspark/context.py", line 33, in File "repo/spark/python/lib/pyspark.zip/pyspark/java_gateway.py", line 31, in File "repo/spark/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py", line 18, in File "/usr/lib/python2.7/pydoc.py", line 56, in import sys, imp, os, re, types, inspect, __builtin__, pkgutil, warnings File "/usr/lib/python2.7/inspect.py", line 39, in import tokenize File "repo/spark/tokenize.py", line 1, in from pyspark import SparkContext ImportError: cannot import name SparkContext {noformat} > spark-submit fails to run python scripts with specific names > > > Key: SPARK-18644 > URL: https://issues.apache.org/jira/browse/SPARK-18644 > Project: Spark > Issue Type: Bug > Components: PySpark, Spark Submit >Affects Versions: 2.0.2 > Environment: Ubuntu 16.04 >Reporter: Jussi Jousimo >Priority: Minor > > I'm trying to run simple python script named tokenize.py with spark-submit. > The script only imports SparkContext: > from pyspark import SparkContext > And I run it with: > spark-submit --packages org.apache.spark:spark-streaming-kafka-0-8_2.11:2.0.2 > tokenize.py > However, the script fails: > ImportError: cannot import name SparkContext > I have set all necessary environment variables, etc. Strangely, it seems the > filename is causing this error. If I rename the file to, e.g., tokenizer.py > and run again, it runs fine. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-18662) Move cluster managers into their own sub-directory
[ https://issues.apache.org/jira/browse/SPARK-18662?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Anirudh Ramanathan updated SPARK-18662: --- External issue URL: https://github.com/apache/spark/pull/16092 > Move cluster managers into their own sub-directory > -- > > Key: SPARK-18662 > URL: https://issues.apache.org/jira/browse/SPARK-18662 > Project: Spark > Issue Type: Improvement > Components: Scheduler >Reporter: Anirudh Ramanathan >Priority: Minor > > As we move to support Kubernetes in addition to Yarn and Mesos > (https://issues.apache.org/jira/browse/SPARK-18278), we should move all the > cluster managers into a "resource-managers/" sub-directory. This is simply a > reorganization. > Ref: https://github.com/apache/spark/pull/16061#issuecomment-263649340 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18662) Move cluster managers into their own sub-directory
[ https://issues.apache.org/jira/browse/SPARK-18662?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15710391#comment-15710391 ] Apache Spark commented on SPARK-18662: -- User 'foxish' has created a pull request for this issue: https://github.com/apache/spark/pull/16092 > Move cluster managers into their own sub-directory > -- > > Key: SPARK-18662 > URL: https://issues.apache.org/jira/browse/SPARK-18662 > Project: Spark > Issue Type: Improvement > Components: Scheduler >Reporter: Anirudh Ramanathan >Priority: Minor > > As we move to support Kubernetes in addition to Yarn and Mesos > (https://issues.apache.org/jira/browse/SPARK-18278), we should move all the > cluster managers into a "resource-managers/" sub-directory. This is simply a > reorganization. > Ref: https://github.com/apache/spark/pull/16061#issuecomment-263649340 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-18662) Move cluster managers into their own sub-directory
[ https://issues.apache.org/jira/browse/SPARK-18662?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-18662: Assignee: (was: Apache Spark) > Move cluster managers into their own sub-directory > -- > > Key: SPARK-18662 > URL: https://issues.apache.org/jira/browse/SPARK-18662 > Project: Spark > Issue Type: Improvement > Components: Scheduler >Reporter: Anirudh Ramanathan >Priority: Minor > > As we move to support Kubernetes in addition to Yarn and Mesos > (https://issues.apache.org/jira/browse/SPARK-18278), we should move all the > cluster managers into a "resource-managers/" sub-directory. This is simply a > reorganization. > Ref: https://github.com/apache/spark/pull/16061#issuecomment-263649340 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-18662) Move cluster managers into their own sub-directory
[ https://issues.apache.org/jira/browse/SPARK-18662?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-18662: Assignee: Apache Spark > Move cluster managers into their own sub-directory > -- > > Key: SPARK-18662 > URL: https://issues.apache.org/jira/browse/SPARK-18662 > Project: Spark > Issue Type: Improvement > Components: Scheduler >Reporter: Anirudh Ramanathan >Assignee: Apache Spark >Priority: Minor > > As we move to support Kubernetes in addition to Yarn and Mesos > (https://issues.apache.org/jira/browse/SPARK-18278), we should move all the > cluster managers into a "resource-managers/" sub-directory. This is simply a > reorganization. > Ref: https://github.com/apache/spark/pull/16061#issuecomment-263649340 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18650) race condition in FileScanRDD.scala
[ https://issues.apache.org/jira/browse/SPARK-18650?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15710358#comment-15710358 ] Hyukjin Kwon commented on SPARK-18650: -- Would this be possible to share your data/sample data? I would like to reproduce this. > race condition in FileScanRDD.scala > --- > > Key: SPARK-18650 > URL: https://issues.apache.org/jira/browse/SPARK-18650 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.2 > Environment: scala 2.11 > macos 10.11.6 >Reporter: Jay Goldman > > I am attempting to create a DataSet from a single CSV file : > val ss: SparkSession = > val ddr = ss.read.option("path", path) > ... (choose between xml vs csv parsing) > var df = ddr.option("sep", ",") > .option("quote", "\"") > .option("escape", "\"") // want to retain backslashes (\) ... > .option("delimiter", ",") > .option("comment", "#") > .option("header", "true") > .option("format", "csv") >ddr.csv(path) > df.count() returns 2 times the number of lines in the CSV file - i.e., each > line of the input file shows up as 2 rows in df. > moreover df.distinct.count has the correct rows. > There appears to be a problem in FileScanRDD.compute. I am using spark > version 2.0.1 with scala 2.11. I am not going to include the entire contents > of FileScanRDD.scala here. > In FileScanRDD.compute there is the following: > private[this] val files = split.asInstanceOf[FilePartition].files.toIterator > If i put a breakpoint in either FileScanRDD.compute or > FIleScanRDD.nextIterator the resulting dataset has the correct number of rows. > Moreover, the code in FileScanRDD.scala is: > private def nextIterator(): Boolean = { > updateBytesReadWithFileSize() > if (files.hasNext) { // breakpoint here => works > currentFile = files.next() // breakpoint here => fails > > } > else { } > > } > if i put a breakpoint on the files.hasNext line all is well; however, if i > put a breakpoint on the files.next() line the code will fail when i continue > because the files iterator has become empty (see stack trace below). > Disabling the breakpoint winds up creating a Dataset with each line of the > csv file duplicated. > So it appears that multiple threads are using the files iterator or the > underling split value (an RDDPartition) and timing wise on my system 2 > workers wind up processing the same file, with the resulting DataSet having 2 > copies of each of the input lines. > This code is not active when parsing an XML file. > here is stack trace: > java.util.NoSuchElementException: next on empty iterator > at scala.collection.Iterator$$anon$2.next(Iterator.scala:39) > at scala.collection.Iterator$$anon$2.next(Iterator.scala:37) > at > scala.collection.IndexedSeqLike$Elements.next(IndexedSeqLike.scala:63) > at > org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:111) > at > org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:91) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.agg_doAggregateWithoutKey$(Unknown > Source) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown > Source) > at > org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) > at > org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370) > at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) > at > org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:125) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:79) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:47) > at org.apache.spark.scheduler.Task.run(Task.scala:86) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > 16/11/30 09:31:07 ERROR TaskSetManager: Task 0 in stage 0.0 failed 1 times; > aborting job > Exception in thread "main" org.apache.spark.SparkException: Job aborted due > to stage failure: Task 0 in stage 0.0 failed 1 times, most recent failure: > Lost task 0.0 in stage 0.0 (TID 0, localhost): > java.util.NoSuchElementException: next on empty iterator > at
[jira] [Resolved] (SPARK-18655) Ignore Structured Streaming 2.0.2 logs in history server
[ https://issues.apache.org/jira/browse/SPARK-18655?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shixiong Zhu resolved SPARK-18655. -- Resolution: Fixed Fix Version/s: 2.1.0 > Ignore Structured Streaming 2.0.2 logs in history server > > > Key: SPARK-18655 > URL: https://issues.apache.org/jira/browse/SPARK-18655 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 2.1.0 >Reporter: Shixiong Zhu >Assignee: Shixiong Zhu >Priority: Blocker > Fix For: 2.1.0 > > > SPARK-18516 changes the event log format of Structured Streaming. We should > make sure our changes not break the history server. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18560) Receiver data can not be dataSerialized properly.
[ https://issues.apache.org/jira/browse/SPARK-18560?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15710309#comment-15710309 ] Apache Spark commented on SPARK-18560: -- User 'zsxwing' has created a pull request for this issue: https://github.com/apache/spark/pull/16091 > Receiver data can not be dataSerialized properly. > - > > Key: SPARK-18560 > URL: https://issues.apache.org/jira/browse/SPARK-18560 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.0.2 >Reporter: Genmao Yu >Priority: Critical > > My spark streaming job can run correctly on Spark 1.6.1, but it can not run > properly on Spark 2.0.1, with following exception: > {code} > 16/11/22 19:20:15 ERROR executor.Executor: Exception in task 4.3 in stage 6.0 > (TID 87) > com.esotericsoftware.kryo.KryoException: Encountered unregistered class ID: > 13994 > at > com.esotericsoftware.kryo.util.DefaultClassResolver.readClass(DefaultClassResolver.java:137) > at com.esotericsoftware.kryo.Kryo.readClass(Kryo.java:670) > at com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:781) > at > org.apache.spark.serializer.KryoDeserializationStream.readObject(KryoSerializer.scala:243) > at > org.apache.spark.serializer.DeserializationStream$$anon$1.getNext(Serializer.scala:169) > at org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:73) > at org.apache.spark.util.Utils$.getIteratorSize(Utils.scala:1760) > at org.apache.spark.rdd.RDD$$anonfun$count$1.apply(RDD.scala:1150) > at org.apache.spark.rdd.RDD$$anonfun$count$1.apply(RDD.scala:1150) > at > org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1943) > at > org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1943) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87) > at org.apache.spark.scheduler.Task.run(Task.scala:108) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:282) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > {code} > Go deep into relevant implementation, I find the type of data received by > {{Receiver}} is erased. And in Spark2.x, framework can choose a appropriate > {{Serializer}} from {{JavaSerializer}} and {{KryoSerializer}} base on the > type of data. > At the {{Receiver}} side, the type of data is erased to be {{Object}}, so > framework will choose {{JavaSerializer}}, with following code: > {code} > def canUseKryo(ct: ClassTag[_]): Boolean = { > primitiveAndPrimitiveArrayClassTags.contains(ct) || ct == stringClassTag > } > def getSerializer(ct: ClassTag[_]): Serializer = { > if (canUseKryo(ct)) { > kryoSerializer > } else { > defaultSerializer > } > } > {code} > At task side, we can get correct data type, and framework will choose > {{KryoSerializer}} if possible, with following supported type: > {code} > private[this] val stringClassTag: ClassTag[String] = > implicitly[ClassTag[String]] > private[this] val primitiveAndPrimitiveArrayClassTags: Set[ClassTag[_]] = { > val primitiveClassTags = Set[ClassTag[_]]( > ClassTag.Boolean, > ClassTag.Byte, > ClassTag.Char, > ClassTag.Double, > ClassTag.Float, > ClassTag.Int, > ClassTag.Long, > ClassTag.Null, > ClassTag.Short > ) > val arrayClassTags = primitiveClassTags.map(_.wrap) > primitiveClassTags ++ arrayClassTags > } > {code} > In my case, the type of data is Byte Array. > This problem stems from SPARK-13990, a patch to have Spark automatically pick > the "best" serializer when caching RDDs. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18617) Close "kryo auto pick" feature for Spark Streaming
[ https://issues.apache.org/jira/browse/SPARK-18617?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15710308#comment-15710308 ] Apache Spark commented on SPARK-18617: -- User 'zsxwing' has created a pull request for this issue: https://github.com/apache/spark/pull/16091 > Close "kryo auto pick" feature for Spark Streaming > -- > > Key: SPARK-18617 > URL: https://issues.apache.org/jira/browse/SPARK-18617 > Project: Spark > Issue Type: Bug >Affects Versions: 2.0.2 >Reporter: Genmao Yu >Assignee: Genmao Yu > Fix For: 2.1.0 > > > [PR-15992| https://github.com/apache/spark/pull/15992] provided a solution to > fix the bug, i.e. {{receiver data can not be deserialized properly}}. As > [~zsxwing] said, it is a critical bug, but we should not break APIs between > maintenance releases. It may be a rational choice to close {{auto pick kryo > serializer}} for Spark Streaming in the first step. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-18122) Fallback to Kryo for unknown classes in ExpressionEncoder
[ https://issues.apache.org/jira/browse/SPARK-18122?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-18122: Assignee: (was: Apache Spark) > Fallback to Kryo for unknown classes in ExpressionEncoder > - > > Key: SPARK-18122 > URL: https://issues.apache.org/jira/browse/SPARK-18122 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.0.1 >Reporter: Michael Armbrust >Priority: Critical > > In Spark 2.0 we fail to generate an encoder if any of the fields of the class > are not of a supported type. One example is {{Option\[Set\[Int\]\]}}, but > there are many more. We should give the user the option to fall back on > opaque kryo serialization in these cases for subtrees of the encoder, rather > than failing. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-18122) Fallback to Kryo for unknown classes in ExpressionEncoder
[ https://issues.apache.org/jira/browse/SPARK-18122?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-18122: Assignee: Apache Spark > Fallback to Kryo for unknown classes in ExpressionEncoder > - > > Key: SPARK-18122 > URL: https://issues.apache.org/jira/browse/SPARK-18122 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.0.1 >Reporter: Michael Armbrust >Assignee: Apache Spark >Priority: Critical > > In Spark 2.0 we fail to generate an encoder if any of the fields of the class > are not of a supported type. One example is {{Option\[Set\[Int\]\]}}, but > there are many more. We should give the user the option to fall back on > opaque kryo serialization in these cases for subtrees of the encoder, rather > than failing. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-18662) Move cluster managers into their own sub-directory
Anirudh Ramanathan created SPARK-18662: -- Summary: Move cluster managers into their own sub-directory Key: SPARK-18662 URL: https://issues.apache.org/jira/browse/SPARK-18662 Project: Spark Issue Type: Improvement Components: Scheduler Reporter: Anirudh Ramanathan Priority: Minor As we move to support Kubernetes in addition to Yarn and Mesos (https://issues.apache.org/jira/browse/SPARK-18278), we should move all the cluster managers into a "resource-managers/" sub-directory. This is simply a reorganization. Ref: https://github.com/apache/spark/pull/16061#issuecomment-263649340 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Reopened] (SPARK-18122) Fallback to Kryo for unknown classes in ExpressionEncoder
[ https://issues.apache.org/jira/browse/SPARK-18122?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust reopened SPARK-18122: -- I'm going to reopen this. I think the benefits outweigh the compatibility concerns. > Fallback to Kryo for unknown classes in ExpressionEncoder > - > > Key: SPARK-18122 > URL: https://issues.apache.org/jira/browse/SPARK-18122 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.0.1 >Reporter: Michael Armbrust >Priority: Critical > > In Spark 2.0 we fail to generate an encoder if any of the fields of the class > are not of a supported type. One example is {{Option\[Set\[Int\]\]}}, but > there are many more. We should give the user the option to fall back on > opaque kryo serialization in these cases for subtrees of the encoder, rather > than failing. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-18656) org.apache.spark.sql.execution.stat.StatFunctions#multipleApproxQuantiles requires too much memory in case of many columns
[ https://issues.apache.org/jira/browse/SPARK-18656?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sina Sohangir closed SPARK-18656. - Resolution: Later > org.apache.spark.sql.execution.stat.StatFunctions#multipleApproxQuantiles > requires too much memory in case of many columns > -- > > Key: SPARK-18656 > URL: https://issues.apache.org/jira/browse/SPARK-18656 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Sina Sohangir > Original Estimate: 1h > Remaining Estimate: 1h > > org.apache.spark.sql.execution.stat.StatFunctions#multipleApproxQuantiles > Is implemented in a way that is causes out of memory error for cases where > the number of columns are high. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18661) Creating a partitioned datasource table should not scan all files for table
[ https://issues.apache.org/jira/browse/SPARK-18661?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15710233#comment-15710233 ] Apache Spark commented on SPARK-18661: -- User 'ericl' has created a pull request for this issue: https://github.com/apache/spark/pull/16090 > Creating a partitioned datasource table should not scan all files for table > --- > > Key: SPARK-18661 > URL: https://issues.apache.org/jira/browse/SPARK-18661 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.0 >Reporter: Eric Liang >Priority: Blocker > > Even though in 2.1 creating a partitioned datasource table will not populate > the partition data by default (until the user issues MSCK REPAIR TABLE), it > seems we still scan the filesystem for no good reason. > We should avoid doing this when the user specifies a schema. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-18661) Creating a partitioned datasource table should not scan all files for table
[ https://issues.apache.org/jira/browse/SPARK-18661?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-18661: Assignee: Apache Spark > Creating a partitioned datasource table should not scan all files for table > --- > > Key: SPARK-18661 > URL: https://issues.apache.org/jira/browse/SPARK-18661 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.0 >Reporter: Eric Liang >Assignee: Apache Spark >Priority: Blocker > > Even though in 2.1 creating a partitioned datasource table will not populate > the partition data by default (until the user issues MSCK REPAIR TABLE), it > seems we still scan the filesystem for no good reason. > We should avoid doing this when the user specifies a schema. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-18661) Creating a partitioned datasource table should not scan all files for table
[ https://issues.apache.org/jira/browse/SPARK-18661?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-18661: Assignee: (was: Apache Spark) > Creating a partitioned datasource table should not scan all files for table > --- > > Key: SPARK-18661 > URL: https://issues.apache.org/jira/browse/SPARK-18661 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.0 >Reporter: Eric Liang >Priority: Blocker > > Even though in 2.1 creating a partitioned datasource table will not populate > the partition data by default (until the user issues MSCK REPAIR TABLE), it > seems we still scan the filesystem for no good reason. > We should avoid doing this when the user specifies a schema. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-18661) Creating a partitioned datasource table should not scan all files in filesystem
Eric Liang created SPARK-18661: -- Summary: Creating a partitioned datasource table should not scan all files in filesystem Key: SPARK-18661 URL: https://issues.apache.org/jira/browse/SPARK-18661 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.1.0 Reporter: Eric Liang Priority: Blocker Even though in 2.1 creating a partitioned datasource table will not populate the partition data by default (until the user issues MSCK REPAIR TABLE), it seems we still scan the filesystem for no good reason. We should avoid doing this when the user specifies a schema. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-18661) Creating a partitioned datasource table should not scan all files for table
[ https://issues.apache.org/jira/browse/SPARK-18661?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Eric Liang updated SPARK-18661: --- Summary: Creating a partitioned datasource table should not scan all files for table (was: Creating a partitioned datasource table should not scan all files in filesystem) > Creating a partitioned datasource table should not scan all files for table > --- > > Key: SPARK-18661 > URL: https://issues.apache.org/jira/browse/SPARK-18661 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.0 >Reporter: Eric Liang >Priority: Blocker > > Even though in 2.1 creating a partitioned datasource table will not populate > the partition data by default (until the user issues MSCK REPAIR TABLE), it > seems we still scan the filesystem for no good reason. > We should avoid doing this when the user specifies a schema. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17583) Remove unused rowSeparator variable and set auto-expanding buffer as default for maxCharsPerColumn option in CSV
[ https://issues.apache.org/jira/browse/SPARK-17583?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15710199#comment-15710199 ] Hyukjin Kwon commented on SPARK-17583: -- For example, please refer the discussion in https://issues.apache.org/jira/browse/SPARK-17227 > Remove unused rowSeparator variable and set auto-expanding buffer as default > for maxCharsPerColumn option in CSV > > > Key: SPARK-17583 > URL: https://issues.apache.org/jira/browse/SPARK-17583 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.0.0 >Reporter: Hyukjin Kwon >Assignee: Hyukjin Kwon >Priority: Minor > Fix For: 2.1.0 > > > This JIRA includes several changes below: > 1. Upgrade Univocity library from 2.1.1 to 2.2.1 > This includes some performance improvement and also enabling auto-extending > buffer in {{maxCharsPerColumn}} option in CSV. Please refer the [release > notes|https://github.com/uniVocity/univocity-parsers/releases]. > 2. Remove {{rowSeparator}} variable existing in {{CSVOptions}} > We have this variable in > [CSVOptions|https://github.com/apache/spark/blob/29952ed096fd2a0a19079933ff691671d6f00835/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVOptions.scala#L127] > but it seems possibly causing confusion that it actually does not care of > {{\r\n}}. For example, we have an issue open about this SPARK-17227 > describing this variable > This options is virtually not being used because we rely on > {{LineRecordReader}} in Hadoop which deals with only both {{\n}} and {{\r\n}}. > 3. Setting the default value of {{maxCharsPerColumn}} to auto-expending > We are setting 100 for the length of each column. It'd be more sensible > we allow auto-expending rather than fixed length by default. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18374) Incorrect words in StopWords/english.txt
[ https://issues.apache.org/jira/browse/SPARK-18374?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15710198#comment-15710198 ] yuhao yang commented on SPARK-18374: I checked with some other lists of stopwords and got a list to add, but it's longer than I thought: i'll you'll he'll she'll we'll they'll i'd you'd he'd she'd we'd they'd i'm you're he's she's it's we're they're i've we've you've they've isn't aren't wasn't weren't haven't hasn't hadn't don't doesn't didn't won't wouldn't shan't shouldn't mustn't can't couldn't any concern? > Incorrect words in StopWords/english.txt > > > Key: SPARK-18374 > URL: https://issues.apache.org/jira/browse/SPARK-18374 > Project: Spark > Issue Type: Bug > Components: ML >Affects Versions: 2.0.1 >Reporter: nirav patel > > I was just double checking english.txt for list of stopwords as I felt it was > taking out valid tokens like 'won'. I think issue is english.txt list is > missing apostrophe character and all character after apostrophe. So "won't" > becam "won" in that list; "wouldn't" is "wouldn" . > Here are some incorrect tokens in this list: > won > wouldn > ma > mightn > mustn > needn > shan > shouldn > wasn > weren > I think ideal list should have both style. i.e. won't and wont both should be > part of english.txt as some tokenizer might remove special characters. But > 'won' is obviously shouldn't be in this list. > Here's list of snowball english stop words: > http://snowball.tartarus.org/algorithms/english/stop.txt -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17583) Remove unused rowSeparator variable and set auto-expanding buffer as default for maxCharsPerColumn option in CSV
[ https://issues.apache.org/jira/browse/SPARK-17583?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15710195#comment-15710195 ] Hyukjin Kwon commented on SPARK-17583: -- Ah, that would not be related with this JIRA actually because this JIRA describes unused options/upgrading the CSV library. That is because of relying on LineRecordReader. External CSV library and CSV one in 2.0.x support it but it has a problem to read multiple lines as each record when it exists accross multiple blocks. It would be great if they are supported and there are several issues open. > Remove unused rowSeparator variable and set auto-expanding buffer as default > for maxCharsPerColumn option in CSV > > > Key: SPARK-17583 > URL: https://issues.apache.org/jira/browse/SPARK-17583 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.0.0 >Reporter: Hyukjin Kwon >Assignee: Hyukjin Kwon >Priority: Minor > Fix For: 2.1.0 > > > This JIRA includes several changes below: > 1. Upgrade Univocity library from 2.1.1 to 2.2.1 > This includes some performance improvement and also enabling auto-extending > buffer in {{maxCharsPerColumn}} option in CSV. Please refer the [release > notes|https://github.com/uniVocity/univocity-parsers/releases]. > 2. Remove {{rowSeparator}} variable existing in {{CSVOptions}} > We have this variable in > [CSVOptions|https://github.com/apache/spark/blob/29952ed096fd2a0a19079933ff691671d6f00835/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVOptions.scala#L127] > but it seems possibly causing confusion that it actually does not care of > {{\r\n}}. For example, we have an issue open about this SPARK-17227 > describing this variable > This options is virtually not being used because we rely on > {{LineRecordReader}} in Hadoop which deals with only both {{\n}} and {{\r\n}}. > 3. Setting the default value of {{maxCharsPerColumn}} to auto-expending > We are setting 100 for the length of each column. It'd be more sensible > we allow auto-expending rather than fixed length by default. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17583) Remove unused rowSeparator variable and set auto-expanding buffer as default for maxCharsPerColumn option in CSV
[ https://issues.apache.org/jira/browse/SPARK-17583?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15710172#comment-15710172 ] koert kuipers commented on SPARK-17583: --- i just tested out inhouse unit test (which run against spark 2.0.2) against spark 2.1.0-RC1 and things break for writing out csvs and reading them back in when there is a newline inside a csv value (which will get quoted). writing out works but reading it back in breaks. now i am not saying its a good idea to have newlines inside quoted csv values. but i just wanted to point out that this did used to work with spark 2.0.2. i am not entirely sure why it worked actually. looking at the test if actually writes the value with the newline out over 2 lines, and it reads it back in correctly as well. > Remove unused rowSeparator variable and set auto-expanding buffer as default > for maxCharsPerColumn option in CSV > > > Key: SPARK-17583 > URL: https://issues.apache.org/jira/browse/SPARK-17583 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.0.0 >Reporter: Hyukjin Kwon >Assignee: Hyukjin Kwon >Priority: Minor > Fix For: 2.1.0 > > > This JIRA includes several changes below: > 1. Upgrade Univocity library from 2.1.1 to 2.2.1 > This includes some performance improvement and also enabling auto-extending > buffer in {{maxCharsPerColumn}} option in CSV. Please refer the [release > notes|https://github.com/uniVocity/univocity-parsers/releases]. > 2. Remove {{rowSeparator}} variable existing in {{CSVOptions}} > We have this variable in > [CSVOptions|https://github.com/apache/spark/blob/29952ed096fd2a0a19079933ff691671d6f00835/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVOptions.scala#L127] > but it seems possibly causing confusion that it actually does not care of > {{\r\n}}. For example, we have an issue open about this SPARK-17227 > describing this variable > This options is virtually not being used because we rely on > {{LineRecordReader}} in Hadoop which deals with only both {{\n}} and {{\r\n}}. > 3. Setting the default value of {{maxCharsPerColumn}} to auto-expending > We are setting 100 for the length of each column. It'd be more sensible > we allow auto-expending rather than fixed length by default. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-17939) Spark-SQL Nullability: Optimizations vs. Enforcement Clarification
[ https://issues.apache.org/jira/browse/SPARK-17939?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust updated SPARK-17939: - Target Version/s: 2.1.0 > Spark-SQL Nullability: Optimizations vs. Enforcement Clarification > -- > > Key: SPARK-17939 > URL: https://issues.apache.org/jira/browse/SPARK-17939 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: Aleksander Eskilson >Priority: Critical > > The notion of Nullability of of StructFields in DataFrames and Datasets > creates some confusion. As has been pointed out previously [1], Nullability > is a hint to the Catalyst optimizer, and is not meant to be a type-level > enforcement. Allowing null fields can also help the reader successfully parse > certain types of more loosely-typed data, like JSON and CSV, where null > values are common, rather than just failing. > There's already been some movement to clarify the meaning of Nullable in the > API, but also some requests for a (perhaps completely separate) type-level > implementation of Nullable that can act as an enforcement contract. > This bug is logged here to discuss and clarify this issue. > [1] - > [https://issues.apache.org/jira/browse/SPARK-11319|https://issues.apache.org/jira/browse/SPARK-11319?focusedCommentId=15014535=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-15014535] > [2] - https://github.com/apache/spark/pull/11785 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-17939) Spark-SQL Nullability: Optimizations vs. Enforcement Clarification
[ https://issues.apache.org/jira/browse/SPARK-17939?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust updated SPARK-17939: - Target Version/s: 2.2.0 (was: 2.1.0) > Spark-SQL Nullability: Optimizations vs. Enforcement Clarification > -- > > Key: SPARK-17939 > URL: https://issues.apache.org/jira/browse/SPARK-17939 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: Aleksander Eskilson >Priority: Critical > > The notion of Nullability of of StructFields in DataFrames and Datasets > creates some confusion. As has been pointed out previously [1], Nullability > is a hint to the Catalyst optimizer, and is not meant to be a type-level > enforcement. Allowing null fields can also help the reader successfully parse > certain types of more loosely-typed data, like JSON and CSV, where null > values are common, rather than just failing. > There's already been some movement to clarify the meaning of Nullable in the > API, but also some requests for a (perhaps completely separate) type-level > implementation of Nullable that can act as an enforcement contract. > This bug is logged here to discuss and clarify this issue. > [1] - > [https://issues.apache.org/jira/browse/SPARK-11319|https://issues.apache.org/jira/browse/SPARK-11319?focusedCommentId=15014535=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-15014535] > [2] - https://github.com/apache/spark/pull/11785 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-18658) Writing to a text DataSource buffers one or more lines in memory
[ https://issues.apache.org/jira/browse/SPARK-18658?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-18658: Assignee: (was: Apache Spark) > Writing to a text DataSource buffers one or more lines in memory > > > Key: SPARK-18658 > URL: https://issues.apache.org/jira/browse/SPARK-18658 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.0.2 >Reporter: Nathan Howell >Priority: Minor > > The JSON and CSV writing paths buffer entire lines (or multiple lines) in > memory prior to writing to disk. For large rows this is inefficient. It may > make sense to skip the {{TextOutputFormat}} record writer and go directly to > the underlying {{FSDataOutputStream}}, allowing the writers to append > arbitrary byte arrays (fractions of a row) instead of a full row. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-18658) Writing to a text DataSource buffers one or more lines in memory
[ https://issues.apache.org/jira/browse/SPARK-18658?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-18658: Assignee: Apache Spark > Writing to a text DataSource buffers one or more lines in memory > > > Key: SPARK-18658 > URL: https://issues.apache.org/jira/browse/SPARK-18658 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.0.2 >Reporter: Nathan Howell >Assignee: Apache Spark >Priority: Minor > > The JSON and CSV writing paths buffer entire lines (or multiple lines) in > memory prior to writing to disk. For large rows this is inefficient. It may > make sense to skip the {{TextOutputFormat}} record writer and go directly to > the underlying {{FSDataOutputStream}}, allowing the writers to append > arbitrary byte arrays (fractions of a row) instead of a full row. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18658) Writing to a text DataSource buffers one or more lines in memory
[ https://issues.apache.org/jira/browse/SPARK-18658?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15710116#comment-15710116 ] Apache Spark commented on SPARK-18658: -- User 'NathanHowell' has created a pull request for this issue: https://github.com/apache/spark/pull/16089 > Writing to a text DataSource buffers one or more lines in memory > > > Key: SPARK-18658 > URL: https://issues.apache.org/jira/browse/SPARK-18658 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.0.2 >Reporter: Nathan Howell >Priority: Minor > > The JSON and CSV writing paths buffer entire lines (or multiple lines) in > memory prior to writing to disk. For large rows this is inefficient. It may > make sense to skip the {{TextOutputFormat}} record writer and go directly to > the underlying {{FSDataOutputStream}}, allowing the writers to append > arbitrary byte arrays (fractions of a row) instead of a full row. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-18481) ML 2.1 QA: Remove deprecated methods for ML
[ https://issues.apache.org/jira/browse/SPARK-18481?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley updated SPARK-18481: -- Description: Remove deprecated methods for ML. This task removed the following (deprecated) public APIs in org.apache.spark.ml: * classification.RandomForestClassificationModel.numTrees (This now refers to the Param called "numTrees") * feature.ChiSqSelectorModel.setLabelCol * regression.LinearRegressionSummary.model * regression.RandomForestRegressionModel.numTrees (This now refers to the Param called "numTrees") * PipelineStage.validateParams * Evaluator.validateParams This task made the following changes to match existing patterns for Params: * These methods were made final: ** classification.RandomForestClassificationModel.getNumTrees ** regression.RandomForestRegressionModel.getNumTrees * These methods return the concrete class type, rather than an arbitrary trait. This only affected Java compatibility, not Scala. ** classification.RandomForestClassificationModel.setFeatureSubsetStrategy ** regression.RandomForestRegressionModel.setFeatureSubsetStrategy was: Remove deprecated methods for ML. This task removed the following (deprecated) public APIs in org.apache.spark.ml: * classification.RandomForestClassificationModel.numTrees (This now refers to the Param called "numTrees") * feature.ChiSqSelectorModel.setLabelCol * regression.LinearRegressionSummary.model * regression.RandomForestRegressionModel.numTrees (This now refers to the Param called "numTrees") * PipelineStage.validateParams This task made the following changes to match existing patterns for Params: * These methods were made final: ** classification.RandomForestClassificationModel.getNumTrees ** regression.RandomForestRegressionModel.getNumTrees * These methods return the concrete class type, rather than an arbitrary trait. This only affected Java compatibility, not Scala. ** classification.RandomForestClassificationModel.setFeatureSubsetStrategy ** regression.RandomForestRegressionModel.setFeatureSubsetStrategy > ML 2.1 QA: Remove deprecated methods for ML > > > Key: SPARK-18481 > URL: https://issues.apache.org/jira/browse/SPARK-18481 > Project: Spark > Issue Type: Improvement > Components: ML, PySpark >Reporter: Yanbo Liang >Assignee: Yanbo Liang >Priority: Minor > Fix For: 2.1.0 > > > Remove deprecated methods for ML. > This task removed the following (deprecated) public APIs in > org.apache.spark.ml: > * classification.RandomForestClassificationModel.numTrees (This now refers > to the Param called "numTrees") > * feature.ChiSqSelectorModel.setLabelCol > * regression.LinearRegressionSummary.model > * regression.RandomForestRegressionModel.numTrees (This now refers to the > Param called "numTrees") > * PipelineStage.validateParams > * Evaluator.validateParams > This task made the following changes to match existing patterns for Params: > * These methods were made final: > ** classification.RandomForestClassificationModel.getNumTrees > ** regression.RandomForestRegressionModel.getNumTrees > * These methods return the concrete class type, rather than an arbitrary > trait. This only affected Java compatibility, not Scala. > ** classification.RandomForestClassificationModel.setFeatureSubsetStrategy > ** regression.RandomForestRegressionModel.setFeatureSubsetStrategy -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-18660) Parquet complains "Can not initialize counter due to context is not a instance of TaskInputOutputContext, but is org.apache.hadoop.mapreduce.task.TaskAttemptContextImpl
Yin Huai created SPARK-18660: Summary: Parquet complains "Can not initialize counter due to context is not a instance of TaskInputOutputContext, but is org.apache.hadoop.mapreduce.task.TaskAttemptContextImpl " Key: SPARK-18660 URL: https://issues.apache.org/jira/browse/SPARK-18660 Project: Spark Issue Type: Bug Components: SQL Reporter: Yin Huai Parquet record reader always complain "Can not initialize counter due to context is not a instance of TaskInputOutputContext, but is org.apache.hadoop.mapreduce.task.TaskAttemptContextImpl". Looks like we always create TaskAttemptContextImpl (https://github.com/apache/spark/blob/2f7461f31331cfc37f6cfa3586b7bbefb3af5547/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFileFormat.scala#L368). But, Parquet wants to use TaskInputOutputContext, which is a subclass of TaskAttemptContextImpl. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-18546) UnsafeShuffleWriter corrupts encrypted shuffle files when merging
[ https://issues.apache.org/jira/browse/SPARK-18546?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Marcelo Vanzin resolved SPARK-18546. Resolution: Fixed Fix Version/s: 2.1.1 > UnsafeShuffleWriter corrupts encrypted shuffle files when merging > - > > Key: SPARK-18546 > URL: https://issues.apache.org/jira/browse/SPARK-18546 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.1.0 >Reporter: Marcelo Vanzin >Assignee: Marcelo Vanzin >Priority: Critical > Fix For: 2.1.1 > > > The merging algorithm in {{UnsafeShuffleWriter}} does not consider > encryption, and when it tries to merge encrypted files the result data cannot > be read, since data encrypted with different initial vectors is interleaved > in the same partition data. This leads to exceptions when trying to read the > files during shuffle: > {noformat} > com.esotericsoftware.kryo.KryoException: com.ning.compress.lzf.LZFException: > Corrupt input data, block did not start with 2 byte signature ('ZV') followed > by type byte, 2-byte length) > at com.esotericsoftware.kryo.io.Input.fill(Input.java:142) > at com.esotericsoftware.kryo.io.Input.require(Input.java:155) > at com.esotericsoftware.kryo.io.Input.readInt(Input.java:337) > at > com.esotericsoftware.kryo.util.DefaultClassResolver.readClass(DefaultClassResolver.java:109) > at com.esotericsoftware.kryo.Kryo.readClass(Kryo.java:610) > at com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:721) > at > org.apache.spark.serializer.KryoDeserializationStream.readObject(KryoSerializer.scala:228) > at > org.apache.spark.serializer.DeserializationStream.readKey(Serializer.scala:169) > at > org.apache.spark.util.collection.ExternalAppendOnlyMap$DiskMapIterator.readNextItem(ExternalAppendOnlyMap.scala:512) > at > org.apache.spark.util.collection.ExternalAppendOnlyMap$DiskMapIterator.hasNext(ExternalAppendOnlyMap.scala:533) > ... > {noformat} > (This is our internal branch so don't worry if lines don't necessarily match.) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18085) Better History Server scalability for many / large applications
[ https://issues.apache.org/jira/browse/SPARK-18085?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15709943#comment-15709943 ] Marcelo Vanzin commented on SPARK-18085: I uploaded code for milestone 3 from the document: https://github.com/vanzin/spark/tree/shs-ng/M3 It doesn't have a whole lot of tests, but it's supposed to be a building step for the next milestones. I also updated the M1 and M2 branches with enhancements / fixes I found while writing the M3 code. > Better History Server scalability for many / large applications > --- > > Key: SPARK-18085 > URL: https://issues.apache.org/jira/browse/SPARK-18085 > Project: Spark > Issue Type: Umbrella > Components: Spark Core, Web UI >Affects Versions: 2.0.0 >Reporter: Marcelo Vanzin > Attachments: spark_hs_next_gen.pdf > > > It's a known fact that the History Server currently has some annoying issues > when serving lots of applications, and when serving large applications. > I'm filing this umbrella to track work related to addressing those issues. > I'll be attaching a document shortly describing the issues and suggesting a > path to how to solve them. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-18659) Incorrect behaviors in overwrite table for datasource tables
[ https://issues.apache.org/jira/browse/SPARK-18659?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Eric Liang updated SPARK-18659: --- Description: The first three test cases fail due to a crash in hive client when dropping partitions that don't contain files. The last one deletes too many files due to a partition case resolution failure. {code} test("foo") { withTable("test") { spark.range(10) .selectExpr("id", "id as A", "'x' as B") .write.partitionBy("A", "B").mode("overwrite") .saveAsTable("test") spark.sql("insert overwrite table test select id, id, 'x' from range(1)") assert(spark.sql("select * from test").count() == 1) } } test("bar") { withTable("test") { spark.range(10) .selectExpr("id", "id as A", "'x' as B") .write.partitionBy("A", "B").mode("overwrite") .saveAsTable("test") spark.sql("insert overwrite table test partition (a, b) select id, id, 'x' from range(1)") assert(spark.sql("select * from test").count() == 1) } } test("baz") { withTable("test") { spark.range(10) .selectExpr("id", "id as A", "'x' as B") .write.partitionBy("A", "B").mode("overwrite") .saveAsTable("test") spark.sql("insert overwrite table test partition (A, B) select id, id, 'x' from range(1)") assert(spark.sql("select * from test").count() == 1) } } test("qux") { withTable("test") { spark.range(10) .selectExpr("id", "id as A", "'x' as B") .write.partitionBy("A", "B").mode("overwrite") .saveAsTable("test") spark.sql("insert overwrite table test partition (a=1, b) select id, 'x' from range(1)") assert(spark.sql("select * from test").count() == 10) } } {code} was: The following test cases fail due to a crash in hive client when dropping partitions that don't contain files. The last one deletes too many files due to a partition case resolution failure. {code} test("foo") { withTable("test") { spark.range(10) .selectExpr("id", "id as A", "'x' as B") .write.partitionBy("A", "B").mode("overwrite") .saveAsTable("test") spark.sql("insert overwrite table test select id, id, 'x' from range(1)") assert(spark.sql("select * from test").count() == 1) } } test("bar") { withTable("test") { spark.range(10) .selectExpr("id", "id as A", "'x' as B") .write.partitionBy("A", "B").mode("overwrite") .saveAsTable("test") spark.sql("insert overwrite table test partition (a, b) select id, id, 'x' from range(1)") assert(spark.sql("select * from test").count() == 1) } } test("baz") { withTable("test") { spark.range(10) .selectExpr("id", "id as A", "'x' as B") .write.partitionBy("A", "B").mode("overwrite") .saveAsTable("test") spark.sql("insert overwrite table test partition (A, B) select id, id, 'x' from range(1)") assert(spark.sql("select * from test").count() == 1) } } test("qux") { withTable("test") { spark.range(10) .selectExpr("id", "id as A", "'x' as B") .write.partitionBy("A", "B").mode("overwrite") .saveAsTable("test") spark.sql("insert overwrite table test partition (a=1, b) select id, 'x' from range(1)") assert(spark.sql("select * from test").count() == 10) } } {code} > Incorrect behaviors in overwrite table for datasource tables > > > Key: SPARK-18659 > URL: https://issues.apache.org/jira/browse/SPARK-18659 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.0 >Reporter: Eric Liang >Priority: Blocker > > The first three test cases fail due to a crash in hive client when dropping > partitions that don't contain files. The last one deletes too many files due > to a partition case resolution failure. > {code} > test("foo") { > withTable("test") { > spark.range(10) > .selectExpr("id", "id as A", "'x' as B") > .write.partitionBy("A", "B").mode("overwrite") > .saveAsTable("test") > spark.sql("insert overwrite table test select id, id, 'x' from > range(1)") > assert(spark.sql("select * from test").count() == 1) > } > } > test("bar") { > withTable("test") { > spark.range(10) > .selectExpr("id", "id as A", "'x' as B") > .write.partitionBy("A", "B").mode("overwrite") > .saveAsTable("test") > spark.sql("insert overwrite table test partition (a, b) select id, id, > 'x' from range(1)") > assert(spark.sql("select * from test").count() == 1) > } > } > test("baz") { > withTable("test") { > spark.range(10) > .selectExpr("id", "id as A", "'x' as B")
[jira] [Assigned] (SPARK-18659) Incorrect behaviors in overwrite table for datasource tables
[ https://issues.apache.org/jira/browse/SPARK-18659?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-18659: Assignee: (was: Apache Spark) > Incorrect behaviors in overwrite table for datasource tables > > > Key: SPARK-18659 > URL: https://issues.apache.org/jira/browse/SPARK-18659 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.0 >Reporter: Eric Liang >Priority: Blocker > > The following test cases fail due to a crash in hive client when dropping > partitions that don't contain files. The last one deletes too many files due > to a partition case resolution failure. > {code} > test("foo") { > withTable("test") { > spark.range(10) > .selectExpr("id", "id as A", "'x' as B") > .write.partitionBy("A", "B").mode("overwrite") > .saveAsTable("test") > spark.sql("insert overwrite table test select id, id, 'x' from > range(1)") > assert(spark.sql("select * from test").count() == 1) > } > } > test("bar") { > withTable("test") { > spark.range(10) > .selectExpr("id", "id as A", "'x' as B") > .write.partitionBy("A", "B").mode("overwrite") > .saveAsTable("test") > spark.sql("insert overwrite table test partition (a, b) select id, id, > 'x' from range(1)") > assert(spark.sql("select * from test").count() == 1) > } > } > test("baz") { > withTable("test") { > spark.range(10) > .selectExpr("id", "id as A", "'x' as B") > .write.partitionBy("A", "B").mode("overwrite") > .saveAsTable("test") > spark.sql("insert overwrite table test partition (A, B) select id, id, > 'x' from range(1)") > assert(spark.sql("select * from test").count() == 1) > } > } > test("qux") { > withTable("test") { > spark.range(10) > .selectExpr("id", "id as A", "'x' as B") > .write.partitionBy("A", "B").mode("overwrite") > .saveAsTable("test") > spark.sql("insert overwrite table test partition (a=1, b) select id, > 'x' from range(1)") > assert(spark.sql("select * from test").count() == 10) > } > } > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18659) Incorrect behaviors in overwrite table for datasource tables
[ https://issues.apache.org/jira/browse/SPARK-18659?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15709938#comment-15709938 ] Apache Spark commented on SPARK-18659: -- User 'ericl' has created a pull request for this issue: https://github.com/apache/spark/pull/16088 > Incorrect behaviors in overwrite table for datasource tables > > > Key: SPARK-18659 > URL: https://issues.apache.org/jira/browse/SPARK-18659 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.0 >Reporter: Eric Liang >Priority: Blocker > > The following test cases fail due to a crash in hive client when dropping > partitions that don't contain files. The last one deletes too many files due > to a partition case resolution failure. > {code} > test("foo") { > withTable("test") { > spark.range(10) > .selectExpr("id", "id as A", "'x' as B") > .write.partitionBy("A", "B").mode("overwrite") > .saveAsTable("test") > spark.sql("insert overwrite table test select id, id, 'x' from > range(1)") > assert(spark.sql("select * from test").count() == 1) > } > } > test("bar") { > withTable("test") { > spark.range(10) > .selectExpr("id", "id as A", "'x' as B") > .write.partitionBy("A", "B").mode("overwrite") > .saveAsTable("test") > spark.sql("insert overwrite table test partition (a, b) select id, id, > 'x' from range(1)") > assert(spark.sql("select * from test").count() == 1) > } > } > test("baz") { > withTable("test") { > spark.range(10) > .selectExpr("id", "id as A", "'x' as B") > .write.partitionBy("A", "B").mode("overwrite") > .saveAsTable("test") > spark.sql("insert overwrite table test partition (A, B) select id, id, > 'x' from range(1)") > assert(spark.sql("select * from test").count() == 1) > } > } > test("qux") { > withTable("test") { > spark.range(10) > .selectExpr("id", "id as A", "'x' as B") > .write.partitionBy("A", "B").mode("overwrite") > .saveAsTable("test") > spark.sql("insert overwrite table test partition (a=1, b) select id, > 'x' from range(1)") > assert(spark.sql("select * from test").count() == 10) > } > } > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-18659) Incorrect behaviors in overwrite table for datasource tables
[ https://issues.apache.org/jira/browse/SPARK-18659?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-18659: Assignee: Apache Spark > Incorrect behaviors in overwrite table for datasource tables > > > Key: SPARK-18659 > URL: https://issues.apache.org/jira/browse/SPARK-18659 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.0 >Reporter: Eric Liang >Assignee: Apache Spark >Priority: Blocker > > The following test cases fail due to a crash in hive client when dropping > partitions that don't contain files. The last one deletes too many files due > to a partition case resolution failure. > {code} > test("foo") { > withTable("test") { > spark.range(10) > .selectExpr("id", "id as A", "'x' as B") > .write.partitionBy("A", "B").mode("overwrite") > .saveAsTable("test") > spark.sql("insert overwrite table test select id, id, 'x' from > range(1)") > assert(spark.sql("select * from test").count() == 1) > } > } > test("bar") { > withTable("test") { > spark.range(10) > .selectExpr("id", "id as A", "'x' as B") > .write.partitionBy("A", "B").mode("overwrite") > .saveAsTable("test") > spark.sql("insert overwrite table test partition (a, b) select id, id, > 'x' from range(1)") > assert(spark.sql("select * from test").count() == 1) > } > } > test("baz") { > withTable("test") { > spark.range(10) > .selectExpr("id", "id as A", "'x' as B") > .write.partitionBy("A", "B").mode("overwrite") > .saveAsTable("test") > spark.sql("insert overwrite table test partition (A, B) select id, id, > 'x' from range(1)") > assert(spark.sql("select * from test").count() == 1) > } > } > test("qux") { > withTable("test") { > spark.range(10) > .selectExpr("id", "id as A", "'x' as B") > .write.partitionBy("A", "B").mode("overwrite") > .saveAsTable("test") > spark.sql("insert overwrite table test partition (a=1, b) select id, > 'x' from range(1)") > assert(spark.sql("select * from test").count() == 10) > } > } > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-18659) Incorrect behaviors in overwrite table for datasource tables
[ https://issues.apache.org/jira/browse/SPARK-18659?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Eric Liang updated SPARK-18659: --- Description: The following test cases fail due to a crash in hive client when dropping partitions that don't contain files. The last one deletes too many files due to a partition case resolution failure. {code} test("foo") { withTable("test") { spark.range(10) .selectExpr("id", "id as A", "'x' as B") .write.partitionBy("A", "B").mode("overwrite") .saveAsTable("test") spark.sql("insert overwrite table test select id, id, 'x' from range(1)") assert(spark.sql("select * from test").count() == 1) } } test("bar") { withTable("test") { spark.range(10) .selectExpr("id", "id as A", "'x' as B") .write.partitionBy("A", "B").mode("overwrite") .saveAsTable("test") spark.sql("insert overwrite table test partition (a, b) select id, id, 'x' from range(1)") assert(spark.sql("select * from test").count() == 1) } } test("baz") { withTable("test") { spark.range(10) .selectExpr("id", "id as A", "'x' as B") .write.partitionBy("A", "B").mode("overwrite") .saveAsTable("test") spark.sql("insert overwrite table test partition (A, B) select id, id, 'x' from range(1)") assert(spark.sql("select * from test").count() == 1) } } test("qux") { withTable("test") { spark.range(10) .selectExpr("id", "id as A", "'x' as B") .write.partitionBy("A", "B").mode("overwrite") .saveAsTable("test") spark.sql("insert overwrite table test partition (a=1, b) select id, 'x' from range(1)") assert(spark.sql("select * from test").count() == 10) } } {code} was: The following test cases fail due to a crash in hive client when dropping partitions that don't contain files. The last one crashes due to a partition case resolution failure. {code} test("foo") { withTable("test") { spark.range(10) .selectExpr("id", "id as A", "'x' as B") .write.partitionBy("A", "B").mode("overwrite") .saveAsTable("test") spark.sql("insert overwrite table test select id, id, 'x' from range(1)") assert(spark.sql("select * from test").count() == 1) } } test("bar") { withTable("test") { spark.range(10) .selectExpr("id", "id as A", "'x' as B") .write.partitionBy("A", "B").mode("overwrite") .saveAsTable("test") spark.sql("insert overwrite table test partition (a, b) select id, id, 'x' from range(1)") assert(spark.sql("select * from test").count() == 1) } } test("baz") { withTable("test") { spark.range(10) .selectExpr("id", "id as A", "'x' as B") .write.partitionBy("A", "B").mode("overwrite") .saveAsTable("test") spark.sql("insert overwrite table test partition (A, B) select id, id, 'x' from range(1)") assert(spark.sql("select * from test").count() == 1) } } test("qux") { withTable("test") { spark.range(10) .selectExpr("id", "id as A", "'x' as B") .write.partitionBy("A", "B").mode("overwrite") .saveAsTable("test") spark.sql("insert overwrite table test partition (a=1, b) select id, 'x' from range(1)") assert(spark.sql("select * from test").count() == 10) } } {code} > Incorrect behaviors in overwrite table for datasource tables > > > Key: SPARK-18659 > URL: https://issues.apache.org/jira/browse/SPARK-18659 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.0 >Reporter: Eric Liang >Priority: Blocker > > The following test cases fail due to a crash in hive client when dropping > partitions that don't contain files. The last one deletes too many files due > to a partition case resolution failure. > {code} > test("foo") { > withTable("test") { > spark.range(10) > .selectExpr("id", "id as A", "'x' as B") > .write.partitionBy("A", "B").mode("overwrite") > .saveAsTable("test") > spark.sql("insert overwrite table test select id, id, 'x' from > range(1)") > assert(spark.sql("select * from test").count() == 1) > } > } > test("bar") { > withTable("test") { > spark.range(10) > .selectExpr("id", "id as A", "'x' as B") > .write.partitionBy("A", "B").mode("overwrite") > .saveAsTable("test") > spark.sql("insert overwrite table test partition (a, b) select id, id, > 'x' from range(1)") > assert(spark.sql("select * from test").count() == 1) > } > } > test("baz") { > withTable("test") { > spark.range(10) > .selectExpr("id", "id as A", "'x' as B") >
[jira] [Assigned] (SPARK-18656) org.apache.spark.sql.execution.stat.StatFunctions#multipleApproxQuantiles requires too much memory in case of many columns
[ https://issues.apache.org/jira/browse/SPARK-18656?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-18656: Assignee: Apache Spark > org.apache.spark.sql.execution.stat.StatFunctions#multipleApproxQuantiles > requires too much memory in case of many columns > -- > > Key: SPARK-18656 > URL: https://issues.apache.org/jira/browse/SPARK-18656 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Sina Sohangir >Assignee: Apache Spark > Original Estimate: 1h > Remaining Estimate: 1h > > org.apache.spark.sql.execution.stat.StatFunctions#multipleApproxQuantiles > Is implemented in a way that is causes out of memory error for cases where > the number of columns are high. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18656) org.apache.spark.sql.execution.stat.StatFunctions#multipleApproxQuantiles requires too much memory in case of many columns
[ https://issues.apache.org/jira/browse/SPARK-18656?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15709930#comment-15709930 ] Sina Sohangir commented on SPARK-18656: --- Create a PR: https://github.com/apache/spark/pull/16087 > org.apache.spark.sql.execution.stat.StatFunctions#multipleApproxQuantiles > requires too much memory in case of many columns > -- > > Key: SPARK-18656 > URL: https://issues.apache.org/jira/browse/SPARK-18656 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Sina Sohangir > Original Estimate: 1h > Remaining Estimate: 1h > > org.apache.spark.sql.execution.stat.StatFunctions#multipleApproxQuantiles > Is implemented in a way that is causes out of memory error for cases where > the number of columns are high. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Issue Comment Deleted] (SPARK-18656) org.apache.spark.sql.execution.stat.StatFunctions#multipleApproxQuantiles requires too much memory in case of many columns
[ https://issues.apache.org/jira/browse/SPARK-18656?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sina Sohangir updated SPARK-18656: -- Comment: was deleted (was: Created a PR: https://github.com/apache/spark/pull/16087 ) > org.apache.spark.sql.execution.stat.StatFunctions#multipleApproxQuantiles > requires too much memory in case of many columns > -- > > Key: SPARK-18656 > URL: https://issues.apache.org/jira/browse/SPARK-18656 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Sina Sohangir > Original Estimate: 1h > Remaining Estimate: 1h > > org.apache.spark.sql.execution.stat.StatFunctions#multipleApproxQuantiles > Is implemented in a way that is causes out of memory error for cases where > the number of columns are high. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-18656) org.apache.spark.sql.execution.stat.StatFunctions#multipleApproxQuantiles requires too much memory in case of many columns
[ https://issues.apache.org/jira/browse/SPARK-18656?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15709930#comment-15709930 ] Sina Sohangir edited comment on SPARK-18656 at 11/30/16 10:03 PM: -- Created a PR: https://github.com/apache/spark/pull/16087 was (Author: sina.sohangir): Create a PR: https://github.com/apache/spark/pull/16087 > org.apache.spark.sql.execution.stat.StatFunctions#multipleApproxQuantiles > requires too much memory in case of many columns > -- > > Key: SPARK-18656 > URL: https://issues.apache.org/jira/browse/SPARK-18656 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Sina Sohangir > Original Estimate: 1h > Remaining Estimate: 1h > > org.apache.spark.sql.execution.stat.StatFunctions#multipleApproxQuantiles > Is implemented in a way that is causes out of memory error for cases where > the number of columns are high. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18656) org.apache.spark.sql.execution.stat.StatFunctions#multipleApproxQuantiles requires too much memory in case of many columns
[ https://issues.apache.org/jira/browse/SPARK-18656?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15709929#comment-15709929 ] Apache Spark commented on SPARK-18656: -- User 'sinasohangirsc' has created a pull request for this issue: https://github.com/apache/spark/pull/16087 > org.apache.spark.sql.execution.stat.StatFunctions#multipleApproxQuantiles > requires too much memory in case of many columns > -- > > Key: SPARK-18656 > URL: https://issues.apache.org/jira/browse/SPARK-18656 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Sina Sohangir > Original Estimate: 1h > Remaining Estimate: 1h > > org.apache.spark.sql.execution.stat.StatFunctions#multipleApproxQuantiles > Is implemented in a way that is causes out of memory error for cases where > the number of columns are high. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-18656) org.apache.spark.sql.execution.stat.StatFunctions#multipleApproxQuantiles requires too much memory in case of many columns
[ https://issues.apache.org/jira/browse/SPARK-18656?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-18656: Assignee: (was: Apache Spark) > org.apache.spark.sql.execution.stat.StatFunctions#multipleApproxQuantiles > requires too much memory in case of many columns > -- > > Key: SPARK-18656 > URL: https://issues.apache.org/jira/browse/SPARK-18656 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Sina Sohangir > Original Estimate: 1h > Remaining Estimate: 1h > > org.apache.spark.sql.execution.stat.StatFunctions#multipleApproxQuantiles > Is implemented in a way that is causes out of memory error for cases where > the number of columns are high. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-18659) Incorrect behaviors in overwrite table for datasource tables
[ https://issues.apache.org/jira/browse/SPARK-18659?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Eric Liang updated SPARK-18659: --- Summary: Incorrect behaviors in overwrite table for datasource tables (was: Crash in overwrite table partitions due to hive metastore integration) > Incorrect behaviors in overwrite table for datasource tables > > > Key: SPARK-18659 > URL: https://issues.apache.org/jira/browse/SPARK-18659 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.0 >Reporter: Eric Liang >Priority: Blocker > > The following test cases fail due to a crash in hive client when dropping > partitions that don't contain files. The last one crashes due to a partition > case resolution failure. > {code} > test("foo") { > withTable("test") { > spark.range(10) > .selectExpr("id", "id as A", "'x' as B") > .write.partitionBy("A", "B").mode("overwrite") > .saveAsTable("test") > spark.sql("insert overwrite table test select id, id, 'x' from > range(1)") > assert(spark.sql("select * from test").count() == 1) > } > } > test("bar") { > withTable("test") { > spark.range(10) > .selectExpr("id", "id as A", "'x' as B") > .write.partitionBy("A", "B").mode("overwrite") > .saveAsTable("test") > spark.sql("insert overwrite table test partition (a, b) select id, id, > 'x' from range(1)") > assert(spark.sql("select * from test").count() == 1) > } > } > test("baz") { > withTable("test") { > spark.range(10) > .selectExpr("id", "id as A", "'x' as B") > .write.partitionBy("A", "B").mode("overwrite") > .saveAsTable("test") > spark.sql("insert overwrite table test partition (A, B) select id, id, > 'x' from range(1)") > assert(spark.sql("select * from test").count() == 1) > } > } > test("qux") { > withTable("test") { > spark.range(10) > .selectExpr("id", "id as A", "'x' as B") > .write.partitionBy("A", "B").mode("overwrite") > .saveAsTable("test") > spark.sql("insert overwrite table test partition (a=1, b) select id, > 'x' from range(1)") > assert(spark.sql("select * from test").count() == 10) > } > } > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18251) DataSet API | RuntimeException: Null value appeared in non-nullable field when holding Option Case Class
[ https://issues.apache.org/jira/browse/SPARK-18251?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15709869#comment-15709869 ] Cheng Lian commented on SPARK-18251: One more comment about why we shouldn't allow a {{Option\[T <: Product\]}} to be used as top-level Dataset type: one way to think about this more intuitively is to make an analogy to databases. In a database table, you cannot mark a row itself as null. Instead, you are only allowed to mark a field of a row to be null. Instead of using {{Option\[T <: Product\]}}, the user should resort to {{Tuple1\[T <: Product\]}}. Thus, you have a row consisting of a single field, which can be filled with either a null or a struct. > DataSet API | RuntimeException: Null value appeared in non-nullable field > when holding Option Case Class > > > Key: SPARK-18251 > URL: https://issues.apache.org/jira/browse/SPARK-18251 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.0.1 > Environment: OS X >Reporter: Aniket Bhatnagar >Assignee: Wenchen Fan > Fix For: 2.2.0 > > > I am running into a runtime exception when a DataSet is holding an Empty > object instance for an Option type that is holding non-nullable field. For > instance, if we have the following case class: > case class DataRow(id: Int, value: String) > Then, DataSet[Option[DataRow]] can only hold Some(DataRow) objects and cannot > hold Empty. If it does so, the following exception is thrown: > {noformat} > Exception in thread "main" org.apache.spark.SparkException: Job aborted due > to stage failure: Task 6 in stage 0.0 failed 1 times, most recent failure: > Lost task 6.0 in stage 0.0 (TID 6, localhost): java.lang.RuntimeException: > Null value appeared in non-nullable field: > - field (class: "scala.Int", name: "id") > - option value class: "DataSetOptBug.DataRow" > - root class: "scala.Option" > If the schema is inferred from a Scala tuple/case class, or a Java bean, > please try to use scala.Option[_] or other nullable types (e.g. > java.lang.Integer instead of int/scala.Int). > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.agg_doAggregateWithoutKey$(Unknown > Source) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown > Source) > at > org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) > at > org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370) > at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) > at > org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:125) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:79) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:47) > at org.apache.spark.scheduler.Task.run(Task.scala:86) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > {noformat} > The bug can be reproduce by using the program: > https://gist.github.com/aniketbhatnagar/2ed74613f70d2defe999c18afaa4816e -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-18251) DataSet API | RuntimeException: Null value appeared in non-nullable field when holding Option Case Class
[ https://issues.apache.org/jira/browse/SPARK-18251?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cheng Lian updated SPARK-18251: --- Assignee: Wenchen Fan > DataSet API | RuntimeException: Null value appeared in non-nullable field > when holding Option Case Class > > > Key: SPARK-18251 > URL: https://issues.apache.org/jira/browse/SPARK-18251 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.0.1 > Environment: OS X >Reporter: Aniket Bhatnagar >Assignee: Wenchen Fan > Fix For: 2.2.0 > > > I am running into a runtime exception when a DataSet is holding an Empty > object instance for an Option type that is holding non-nullable field. For > instance, if we have the following case class: > case class DataRow(id: Int, value: String) > Then, DataSet[Option[DataRow]] can only hold Some(DataRow) objects and cannot > hold Empty. If it does so, the following exception is thrown: > {noformat} > Exception in thread "main" org.apache.spark.SparkException: Job aborted due > to stage failure: Task 6 in stage 0.0 failed 1 times, most recent failure: > Lost task 6.0 in stage 0.0 (TID 6, localhost): java.lang.RuntimeException: > Null value appeared in non-nullable field: > - field (class: "scala.Int", name: "id") > - option value class: "DataSetOptBug.DataRow" > - root class: "scala.Option" > If the schema is inferred from a Scala tuple/case class, or a Java bean, > please try to use scala.Option[_] or other nullable types (e.g. > java.lang.Integer instead of int/scala.Int). > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.agg_doAggregateWithoutKey$(Unknown > Source) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown > Source) > at > org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) > at > org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370) > at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) > at > org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:125) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:79) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:47) > at org.apache.spark.scheduler.Task.run(Task.scala:86) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > {noformat} > The bug can be reproduce by using the program: > https://gist.github.com/aniketbhatnagar/2ed74613f70d2defe999c18afaa4816e -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-18251) DataSet API | RuntimeException: Null value appeared in non-nullable field when holding Option Case Class
[ https://issues.apache.org/jira/browse/SPARK-18251?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cheng Lian resolved SPARK-18251. Resolution: Fixed Fix Version/s: 2.2.0 Issue resolved by pull request 15979 [https://github.com/apache/spark/pull/15979] > DataSet API | RuntimeException: Null value appeared in non-nullable field > when holding Option Case Class > > > Key: SPARK-18251 > URL: https://issues.apache.org/jira/browse/SPARK-18251 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.0.1 > Environment: OS X >Reporter: Aniket Bhatnagar > Fix For: 2.2.0 > > > I am running into a runtime exception when a DataSet is holding an Empty > object instance for an Option type that is holding non-nullable field. For > instance, if we have the following case class: > case class DataRow(id: Int, value: String) > Then, DataSet[Option[DataRow]] can only hold Some(DataRow) objects and cannot > hold Empty. If it does so, the following exception is thrown: > {noformat} > Exception in thread "main" org.apache.spark.SparkException: Job aborted due > to stage failure: Task 6 in stage 0.0 failed 1 times, most recent failure: > Lost task 6.0 in stage 0.0 (TID 6, localhost): java.lang.RuntimeException: > Null value appeared in non-nullable field: > - field (class: "scala.Int", name: "id") > - option value class: "DataSetOptBug.DataRow" > - root class: "scala.Option" > If the schema is inferred from a Scala tuple/case class, or a Java bean, > please try to use scala.Option[_] or other nullable types (e.g. > java.lang.Integer instead of int/scala.Int). > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.agg_doAggregateWithoutKey$(Unknown > Source) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown > Source) > at > org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) > at > org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370) > at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) > at > org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:125) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:79) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:47) > at org.apache.spark.scheduler.Task.run(Task.scala:86) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > {noformat} > The bug can be reproduce by using the program: > https://gist.github.com/aniketbhatnagar/2ed74613f70d2defe999c18afaa4816e -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-18658) Writing to a text DataSource buffers one or more lines in memory
Nathan Howell created SPARK-18658: - Summary: Writing to a text DataSource buffers one or more lines in memory Key: SPARK-18658 URL: https://issues.apache.org/jira/browse/SPARK-18658 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 2.0.2 Reporter: Nathan Howell Priority: Minor The JSON and CSV writing paths buffer entire lines (or multiple lines) in memory prior to writing to disk. For large rows this is inefficient. It may make sense to skip the {{TextOutputFormat}} record writer and go directly to the underlying {{FSDataOutputStream}}, allowing the writers to append arbitrary byte arrays (fractions of a row) instead of a full row. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-18659) Crash in overwrite table partitions due to hive metastore integration
Eric Liang created SPARK-18659: -- Summary: Crash in overwrite table partitions due to hive metastore integration Key: SPARK-18659 URL: https://issues.apache.org/jira/browse/SPARK-18659 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.1.0 Reporter: Eric Liang Priority: Blocker The following test cases fail due to a crash in hive client when dropping partitions that don't contain files. The last one crashes due to a partition case resolution failure. {code} test("foo") { withTable("test") { spark.range(10) .selectExpr("id", "id as A", "'x' as B") .write.partitionBy("A", "B").mode("overwrite") .saveAsTable("test") spark.sql("insert overwrite table test select id, id, 'x' from range(1)") assert(spark.sql("select * from test").count() == 1) } } test("bar") { withTable("test") { spark.range(10) .selectExpr("id", "id as A", "'x' as B") .write.partitionBy("A", "B").mode("overwrite") .saveAsTable("test") spark.sql("insert overwrite table test partition (a, b) select id, id, 'x' from range(1)") assert(spark.sql("select * from test").count() == 1) } } test("baz") { withTable("test") { spark.range(10) .selectExpr("id", "id as A", "'x' as B") .write.partitionBy("A", "B").mode("overwrite") .saveAsTable("test") spark.sql("insert overwrite table test partition (A, B) select id, id, 'x' from range(1)") assert(spark.sql("select * from test").count() == 1) } } test("qux") { withTable("test") { spark.range(10) .selectExpr("id", "id as A", "'x' as B") .write.partitionBy("A", "B").mode("overwrite") .saveAsTable("test") spark.sql("insert overwrite table test partition (a=1, b) select id, 'x' from range(1)") assert(spark.sql("select * from test").count() == 10) } } {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18318) ML, Graph 2.1 QA: API: New Scala APIs, docs
[ https://issues.apache.org/jira/browse/SPARK-18318?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15709811#comment-15709811 ] Joseph K. Bradley commented on SPARK-18318: --- I did a quick check too and did not see anything missed. Thanks! > ML, Graph 2.1 QA: API: New Scala APIs, docs > --- > > Key: SPARK-18318 > URL: https://issues.apache.org/jira/browse/SPARK-18318 > Project: Spark > Issue Type: Sub-task > Components: Documentation, GraphX, ML, MLlib >Reporter: Joseph K. Bradley >Assignee: Yanbo Liang >Priority: Blocker > Fix For: 2.1.1, 2.2.0 > > > Audit new public Scala APIs added to MLlib & GraphX. Take note of: > * Protected/public classes or methods. If access can be more private, then > it should be. > * Also look for non-sealed traits. > * Documentation: Missing? Bad links or formatting? > *Make sure to check the object doc!* > As you find issues, please create JIRAs and link them to this issue. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-18318) ML, Graph 2.1 QA: API: New Scala APIs, docs
[ https://issues.apache.org/jira/browse/SPARK-18318?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley resolved SPARK-18318. --- Resolution: Fixed Fix Version/s: 2.2.0 2.1.1 Issue resolved by pull request 16009 [https://github.com/apache/spark/pull/16009] > ML, Graph 2.1 QA: API: New Scala APIs, docs > --- > > Key: SPARK-18318 > URL: https://issues.apache.org/jira/browse/SPARK-18318 > Project: Spark > Issue Type: Sub-task > Components: Documentation, GraphX, ML, MLlib >Reporter: Joseph K. Bradley >Assignee: Yanbo Liang >Priority: Blocker > Fix For: 2.1.1, 2.2.0 > > > Audit new public Scala APIs added to MLlib & GraphX. Take note of: > * Protected/public classes or methods. If access can be more private, then > it should be. > * Also look for non-sealed traits. > * Documentation: Missing? Bad links or formatting? > *Make sure to check the object doc!* > As you find issues, please create JIRAs and link them to this issue. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-18657) Persist UUID across query restart
Michael Armbrust created SPARK-18657: Summary: Persist UUID across query restart Key: SPARK-18657 URL: https://issues.apache.org/jira/browse/SPARK-18657 Project: Spark Issue Type: Bug Components: Structured Streaming Reporter: Michael Armbrust Priority: Critical We probably also want to add an instance Id or something that changes when the query restarts -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-18274) Memory leak in PySpark StringIndexer
[ https://issues.apache.org/jira/browse/SPARK-18274?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley updated SPARK-18274: -- Target Version/s: 2.0.3, 2.1.1, 2.2.0 (was: 2.0.3, 2.1.0) > Memory leak in PySpark StringIndexer > > > Key: SPARK-18274 > URL: https://issues.apache.org/jira/browse/SPARK-18274 > Project: Spark > Issue Type: Bug > Components: ML, PySpark >Affects Versions: 1.5.2, 1.6.3, 2.0.1, 2.0.2, 2.1.0 >Reporter: Jonas Amrich >Priority: Critical > > StringIndexerModel won't get collected by GC in Java even when deleted in > Python. It can be reproduced by this code, which fails after couple of > iterations (around 7 if you set driver memory to 600MB): > {code} > import random, string > from pyspark.ml.feature import StringIndexer > l = [(''.join(random.choice(string.ascii_uppercase) for _ in range(10)), ) > for _ in range(int(7e5))] # 70 random strings of 10 characters > df = spark.createDataFrame(l, ['string']) > for i in range(50): > indexer = StringIndexer(inputCol='string', outputCol='index') > indexer.fit(df) > {code} > Explicit call to Python GC fixes the issue - following code runs fine: > {code} > for i in range(50): > indexer = StringIndexer(inputCol='string', outputCol='index') > indexer.fit(df) > gc.collect() > {code} > The issue is similar to SPARK-6194 and can be probably fixed by calling jvm > detach in model's destructor. This is implemented in > pyspark.mlib.common.JavaModelWrapper but missing in > pyspark.ml.wrapper.JavaWrapper. Other models in ml package may also be > affected by this memory leak. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-18563) mapWithState: initialState should have a timeout setting per record
[ https://issues.apache.org/jira/browse/SPARK-18563?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shixiong Zhu updated SPARK-18563: - Component/s: (was: Structured Streaming) DStreams > mapWithState: initialState should have a timeout setting per record > --- > > Key: SPARK-18563 > URL: https://issues.apache.org/jira/browse/SPARK-18563 > Project: Spark > Issue Type: Improvement > Components: DStreams >Reporter: Daniel Haviv > > when passing an initialState for mapWithState there should a possibility to > set a timeout at the record level. > If for example mapWithState is configured with a 48H timeout, loading an > initialState will cause the state to bloat and hold 96H of data and then > release 48H of data at once. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-18588) KafkaSourceStressForDontFailOnDataLossSuite is flaky
[ https://issues.apache.org/jira/browse/SPARK-18588?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust updated SPARK-18588: - Target Version/s: 2.1.0 > KafkaSourceStressForDontFailOnDataLossSuite is flaky > > > Key: SPARK-18588 > URL: https://issues.apache.org/jira/browse/SPARK-18588 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Reporter: Herman van Hovell >Assignee: Shixiong Zhu > > https://spark-tests.appspot.com/test-details?suite_name=org.apache.spark.sql.kafka010.KafkaSourceStressForDontFailOnDataLossSuite_name=stress+test+for+failOnDataLoss%3Dfalse -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-16545) Structured Streaming : foreachSink creates the Physical Plan multiple times per TriggerInterval
[ https://issues.apache.org/jira/browse/SPARK-16545?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust resolved SPARK-16545. -- Resolution: Later > Structured Streaming : foreachSink creates the Physical Plan multiple times > per TriggerInterval > > > Key: SPARK-16545 > URL: https://issues.apache.org/jira/browse/SPARK-16545 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 2.0.0 >Reporter: Mario Briggs > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-18655) Ignore Structured Streaming 2.0.2 logs in history server
[ https://issues.apache.org/jira/browse/SPARK-18655?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust updated SPARK-18655: - Fix Version/s: (was: 2.1.0) > Ignore Structured Streaming 2.0.2 logs in history server > > > Key: SPARK-18655 > URL: https://issues.apache.org/jira/browse/SPARK-18655 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 2.1.0 >Reporter: Shixiong Zhu >Assignee: Shixiong Zhu >Priority: Blocker > > SPARK-18516 changes the event log format of Structured Streaming. We should > make sure our changes not break the history server. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-18655) Ignore Structured Streaming 2.0.2 logs in history server
[ https://issues.apache.org/jira/browse/SPARK-18655?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust updated SPARK-18655: - Target Version/s: 2.1.0 > Ignore Structured Streaming 2.0.2 logs in history server > > > Key: SPARK-18655 > URL: https://issues.apache.org/jira/browse/SPARK-18655 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 2.1.0 >Reporter: Shixiong Zhu >Assignee: Shixiong Zhu >Priority: Blocker > > SPARK-18516 changes the event log format of Structured Streaming. We should > make sure our changes not break the history server. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-18656) org.apache.spark.sql.execution.stat.StatFunctions#multipleApproxQuantiles requires too much memory in case of many columns
Sina Sohangir created SPARK-18656: - Summary: org.apache.spark.sql.execution.stat.StatFunctions#multipleApproxQuantiles requires too much memory in case of many columns Key: SPARK-18656 URL: https://issues.apache.org/jira/browse/SPARK-18656 Project: Spark Issue Type: Improvement Components: SQL Reporter: Sina Sohangir org.apache.spark.sql.execution.stat.StatFunctions#multipleApproxQuantiles Is implemented in a way that is causes out of memory error for cases where the number of columns are high. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18536) Failed to save to hive table when case class with empty field
[ https://issues.apache.org/jira/browse/SPARK-18536?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15709551#comment-15709551 ] Reynold Xin commented on SPARK-18536: - We need to add a PreWriteCheck for Parquet. > Failed to save to hive table when case class with empty field > - > > Key: SPARK-18536 > URL: https://issues.apache.org/jira/browse/SPARK-18536 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.1 >Reporter: pin_zhang > > {code}import scala.collection.mutable.Queue > import org.apache.spark.SparkConf > import org.apache.spark.SparkContext > import org.apache.spark.sql.SaveMode > import org.apache.spark.sql.SparkSession > import org.apache.spark.streaming.Seconds > import org.apache.spark.streaming.StreamingContext > {code} > 1. Test code > {code} > case class EmptyC() > case class EmptyCTable(dimensions: EmptyC, timebin: java.lang.Long) > object EmptyTest { > def main(args: Array[String]): Unit = { > val conf = new SparkConf().setAppName("scala").setMaster("local[2]") > val ctx = new SparkContext(conf) > val spark = > SparkSession.builder().enableHiveSupport().config(conf).getOrCreate() > val seq = Seq(EmptyCTable(EmptyC(), 100L)) > val rdd = ctx.makeRDD[EmptyCTable](seq) > val ssc = new StreamingContext(ctx, Seconds(1)) > val queue = Queue(rdd) > val s = ssc.queueStream(queue, false); > s.foreachRDD((rdd, time) => { > if (!rdd.isEmpty) { > import spark.sqlContext.implicits._ > rdd.toDF.write.mode(SaveMode.Overwrite).saveAsTable("empty_table") > } > }) > ssc.start() > ssc.awaitTermination() > } > } > {code} > 2. Exception > {noformat} > Caused by: java.lang.IllegalStateException: Cannot build an empty group > at org.apache.parquet.Preconditions.checkState(Preconditions.java:91) > at org.apache.parquet.schema.Types$GroupBuilder.build(Types.java:554) > at org.apache.parquet.schema.Types$GroupBuilder.build(Types.java:426) > at org.apache.parquet.schema.Types$Builder.named(Types.java:228) > at > org.apache.spark.sql.execution.datasources.parquet.ParquetSchemaConverter.convertField(ParquetSchemaConverter.scala:527) > at > org.apache.spark.sql.execution.datasources.parquet.ParquetSchemaConverter.convertField(ParquetSchemaConverter.scala:321) > at > org.apache.spark.sql.execution.datasources.parquet.ParquetSchemaConverter$$anonfun$convert$1.apply(ParquetSchemaConverter.scala:313) > at > org.apache.spark.sql.execution.datasources.parquet.ParquetSchemaConverter$$anonfun$convert$1.apply(ParquetSchemaConverter.scala:313) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) > at scala.collection.Iterator$class.foreach(Iterator.scala:893) > at scala.collection.AbstractIterator.foreach(Iterator.scala:1336) > at scala.collection.IterableLike$class.foreach(IterableLike.scala:72) > at org.apache.spark.sql.types.StructType.foreach(StructType.scala:95) > at scala.collection.TraversableLike$class.map(TraversableLike.scala:234) > at org.apache.spark.sql.types.StructType.map(StructType.scala:95) > at > org.apache.spark.sql.execution.datasources.parquet.ParquetSchemaConverter.convert(ParquetSchemaConverter.scala:313) > at > org.apache.spark.sql.execution.datasources.parquet.ParquetWriteSupport.init(ParquetWriteSupport.scala:85) > at > org.apache.parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:288) > at > org.apache.parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:262) > at > org.apache.spark.sql.execution.datasources.parquet.ParquetOutputWriter.(ParquetFileFormat.scala:562) > at > org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat$$anon$1.newInstance(ParquetFileFormat.scala:139) > at > org.apache.spark.sql.execution.datasources.BaseWriterContainer.newOutputWriter(WriterContainer.scala:131) > at > org.apache.spark.sql.execution.datasources.DefaultWriterContainer.writeRows(WriterContainer.scala:247) > at > org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(InsertIntoHadoopFsRelationCommand.scala:143) > at > org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(InsertIntoHadoopFsRelationCommand.scala:143) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70) > at org.apache.spark.scheduler.Task.run(Task.scala:86) > at
[jira] [Updated] (SPARK-18536) Failed to save to hive table when case class with empty field
[ https://issues.apache.org/jira/browse/SPARK-18536?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin updated SPARK-18536: Description: {code}import scala.collection.mutable.Queue import org.apache.spark.SparkConf import org.apache.spark.SparkContext import org.apache.spark.sql.SaveMode import org.apache.spark.sql.SparkSession import org.apache.spark.streaming.Seconds import org.apache.spark.streaming.StreamingContext {code} 1. Test code {code} case class EmptyC() case class EmptyCTable(dimensions: EmptyC, timebin: java.lang.Long) object EmptyTest { def main(args: Array[String]): Unit = { val conf = new SparkConf().setAppName("scala").setMaster("local[2]") val ctx = new SparkContext(conf) val spark = SparkSession.builder().enableHiveSupport().config(conf).getOrCreate() val seq = Seq(EmptyCTable(EmptyC(), 100L)) val rdd = ctx.makeRDD[EmptyCTable](seq) val ssc = new StreamingContext(ctx, Seconds(1)) val queue = Queue(rdd) val s = ssc.queueStream(queue, false); s.foreachRDD((rdd, time) => { if (!rdd.isEmpty) { import spark.sqlContext.implicits._ rdd.toDF.write.mode(SaveMode.Overwrite).saveAsTable("empty_table") } }) ssc.start() ssc.awaitTermination() } } {code} 2. Exception {noformat} Caused by: java.lang.IllegalStateException: Cannot build an empty group at org.apache.parquet.Preconditions.checkState(Preconditions.java:91) at org.apache.parquet.schema.Types$GroupBuilder.build(Types.java:554) at org.apache.parquet.schema.Types$GroupBuilder.build(Types.java:426) at org.apache.parquet.schema.Types$Builder.named(Types.java:228) at org.apache.spark.sql.execution.datasources.parquet.ParquetSchemaConverter.convertField(ParquetSchemaConverter.scala:527) at org.apache.spark.sql.execution.datasources.parquet.ParquetSchemaConverter.convertField(ParquetSchemaConverter.scala:321) at org.apache.spark.sql.execution.datasources.parquet.ParquetSchemaConverter$$anonfun$convert$1.apply(ParquetSchemaConverter.scala:313) at org.apache.spark.sql.execution.datasources.parquet.ParquetSchemaConverter$$anonfun$convert$1.apply(ParquetSchemaConverter.scala:313) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) at scala.collection.Iterator$class.foreach(Iterator.scala:893) at scala.collection.AbstractIterator.foreach(Iterator.scala:1336) at scala.collection.IterableLike$class.foreach(IterableLike.scala:72) at org.apache.spark.sql.types.StructType.foreach(StructType.scala:95) at scala.collection.TraversableLike$class.map(TraversableLike.scala:234) at org.apache.spark.sql.types.StructType.map(StructType.scala:95) at org.apache.spark.sql.execution.datasources.parquet.ParquetSchemaConverter.convert(ParquetSchemaConverter.scala:313) at org.apache.spark.sql.execution.datasources.parquet.ParquetWriteSupport.init(ParquetWriteSupport.scala:85) at org.apache.parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:288) at org.apache.parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:262) at org.apache.spark.sql.execution.datasources.parquet.ParquetOutputWriter.(ParquetFileFormat.scala:562) at org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat$$anon$1.newInstance(ParquetFileFormat.scala:139) at org.apache.spark.sql.execution.datasources.BaseWriterContainer.newOutputWriter(WriterContainer.scala:131) at org.apache.spark.sql.execution.datasources.DefaultWriterContainer.writeRows(WriterContainer.scala:247) at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(InsertIntoHadoopFsRelationCommand.scala:143) at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(InsertIntoHadoopFsRelationCommand.scala:143) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70) at org.apache.spark.scheduler.Task.run(Task.scala:86) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274) ... 3 more {noformat} was: import scala.collection.mutable.Queue import org.apache.spark.SparkConf import org.apache.spark.SparkContext import org.apache.spark.sql.SaveMode import org.apache.spark.sql.SparkSession import org.apache.spark.streaming.Seconds import org.apache.spark.streaming.StreamingContext 1. Test code case class EmptyC() case class EmptyCTable(dimensions: EmptyC, timebin: java.lang.Long) object EmptyTest { def main(args: Array[String]): Unit = { val conf = new
[jira] [Commented] (SPARK-18653) Dataset.show() generates incorrect padding for Unicode Character
[ https://issues.apache.org/jira/browse/SPARK-18653?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15709468#comment-15709468 ] Apache Spark commented on SPARK-18653: -- User 'kiszk' has created a pull request for this issue: https://github.com/apache/spark/pull/16086 > Dataset.show() generates incorrect padding for Unicode Character > > > Key: SPARK-18653 > URL: https://issues.apache.org/jira/browse/SPARK-18653 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.0 >Reporter: Kazuaki Ishizaki > > The following program generates incorrect space padding for > {{Dataset.show()}} since column name or column value has Unicode Character > Program > {code:java} > case class UnicodeCaseClass(整数: Int, 実数: Double, s: String) > val ds = Seq(UnicodeCaseClass(1, 1.1, "文字列1")).toDS > ds.show > {code} > Output > {code} > +---+---++ > | 整数| 実数| s| > +---+---++ > | 1|1.1|文字列1| > +---+---++ > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-18653) Dataset.show() generates incorrect padding for Unicode Character
[ https://issues.apache.org/jira/browse/SPARK-18653?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-18653: Assignee: Apache Spark > Dataset.show() generates incorrect padding for Unicode Character > > > Key: SPARK-18653 > URL: https://issues.apache.org/jira/browse/SPARK-18653 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.0 >Reporter: Kazuaki Ishizaki >Assignee: Apache Spark > > The following program generates incorrect space padding for > {{Dataset.show()}} since column name or column value has Unicode Character > Program > {code:java} > case class UnicodeCaseClass(整数: Int, 実数: Double, s: String) > val ds = Seq(UnicodeCaseClass(1, 1.1, "文字列1")).toDS > ds.show > {code} > Output > {code} > +---+---++ > | 整数| 実数| s| > +---+---++ > | 1|1.1|文字列1| > +---+---++ > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-18653) Dataset.show() generates incorrect padding for Unicode Character
[ https://issues.apache.org/jira/browse/SPARK-18653?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-18653: Assignee: (was: Apache Spark) > Dataset.show() generates incorrect padding for Unicode Character > > > Key: SPARK-18653 > URL: https://issues.apache.org/jira/browse/SPARK-18653 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.0 >Reporter: Kazuaki Ishizaki > > The following program generates incorrect space padding for > {{Dataset.show()}} since column name or column value has Unicode Character > Program > {code:java} > case class UnicodeCaseClass(整数: Int, 実数: Double, s: String) > val ds = Seq(UnicodeCaseClass(1, 1.1, "文字列1")).toDS > ds.show > {code} > Output > {code} > +---+---++ > | 整数| 実数| s| > +---+---++ > | 1|1.1|文字列1| > +---+---++ > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-18655) Ignore Structured Streaming 2.0.2 logs in history server
[ https://issues.apache.org/jira/browse/SPARK-18655?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-18655: Assignee: Apache Spark (was: Shixiong Zhu) > Ignore Structured Streaming 2.0.2 logs in history server > > > Key: SPARK-18655 > URL: https://issues.apache.org/jira/browse/SPARK-18655 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 2.1.0 >Reporter: Shixiong Zhu >Assignee: Apache Spark >Priority: Blocker > Fix For: 2.1.0 > > > SPARK-18516 changes the event log format of Structured Streaming. We should > make sure our changes not break the history server. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-18655) Ignore Structured Streaming 2.0.2 logs in history server
[ https://issues.apache.org/jira/browse/SPARK-18655?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-18655: Assignee: Shixiong Zhu (was: Apache Spark) > Ignore Structured Streaming 2.0.2 logs in history server > > > Key: SPARK-18655 > URL: https://issues.apache.org/jira/browse/SPARK-18655 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 2.1.0 >Reporter: Shixiong Zhu >Assignee: Shixiong Zhu >Priority: Blocker > Fix For: 2.1.0 > > > SPARK-18516 changes the event log format of Structured Streaming. We should > make sure our changes not break the history server. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org