[jira] [Updated] (SPARK-27943) Implement default constraint with Column for Hive table
[ https://issues.apache.org/jira/browse/SPARK-27943?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] jiaan.geng updated SPARK-27943: --- Summary: Implement default constraint with Column for Hive table (was: Add default constraint when create hive table) > Implement default constraint with Column for Hive table > --- > > Key: SPARK-27943 > URL: https://issues.apache.org/jira/browse/SPARK-27943 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 2.3.0, 2.4.0 >Reporter: jiaan.geng >Priority: Major > > Default constraint with column is ANSI standard. > Hive 3.0+ has supported default constraint > ref:https://issues.apache.org/jira/browse/HIVE-18726 > But Spark SQL implement this feature not yet. > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-27943) Add default constraint when create hive table
[ https://issues.apache.org/jira/browse/SPARK-27943?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] jiaan.geng updated SPARK-27943: --- Description: Default constraint with column is ANSI standard. Hive 3.0+ has supported default constraint ref:https://issues.apache.org/jira/browse/HIVE-18726 But Spark SQL implement this feature not yet. was: Default constraint with column is ANSI standard. Hive 3.0+ has supported default constraint ref:https://issues.apache.org/jira/browse/HIVE-18726 But Spark SQL implement this feature not yet. > Add default constraint when create hive table > - > > Key: SPARK-27943 > URL: https://issues.apache.org/jira/browse/SPARK-27943 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 2.3.0, 2.4.0 >Reporter: jiaan.geng >Priority: Major > > Default constraint with column is ANSI standard. > Hive 3.0+ has supported default constraint > ref:https://issues.apache.org/jira/browse/HIVE-18726 > But Spark SQL implement this feature not yet. > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-27798) from_avro can modify variables in other rows in local mode
[ https://issues.apache.org/jira/browse/SPARK-27798?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16856370#comment-16856370 ] Gengliang Wang commented on SPARK-27798: [~viirya] I am currently working on other issues. Thanks for working on it! > from_avro can modify variables in other rows in local mode > -- > > Key: SPARK-27798 > URL: https://issues.apache.org/jira/browse/SPARK-27798 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.3 >Reporter: Yosuke Mori >Priority: Blocker > Labels: correctness > Attachments: Screen Shot 2019-05-21 at 2.39.27 PM.png > > > Steps to reproduce: > Create a local Dataset (at least two distinct rows) with a binary Avro field. > Use the {{from_avro}} function to deserialize the binary into another column. > Verify that all of the rows incorrectly have the same value. > Here's a concrete example (using Spark 2.4.3). All it does is converts a list > of TestPayload objects into binary using the defined avro schema, then tries > to deserialize using {{from_avro}} with that same schema: > {code:java} > import org.apache.avro.Schema > import org.apache.avro.generic.{GenericDatumWriter, GenericRecord, > GenericRecordBuilder} > import org.apache.avro.io.EncoderFactory > import org.apache.spark.sql.SparkSession > import org.apache.spark.sql.avro.from_avro > import org.apache.spark.sql.functions.col > import java.io.ByteArrayOutputStream > object TestApp extends App { > // Payload container > case class TestEvent(payload: Array[Byte]) > // Deserialized Payload > case class TestPayload(message: String) > // Schema for Payload > val simpleSchema = > """ > |{ > |"type": "record", > |"name" : "Payload", > |"fields" : [ {"name" : "message", "type" : [ "string", "null" ] } ] > |} > """.stripMargin > // Convert TestPayload into avro binary > def generateSimpleSchemaBinary(record: TestPayload, avsc: String): > Array[Byte] = { > val schema = new Schema.Parser().parse(avsc) > val out = new ByteArrayOutputStream() > val writer = new GenericDatumWriter[GenericRecord](schema) > val encoder = EncoderFactory.get().binaryEncoder(out, null) > val rootRecord = new GenericRecordBuilder(schema).set("message", > record.message).build() > writer.write(rootRecord, encoder) > encoder.flush() > out.toByteArray > } > val spark: SparkSession = > SparkSession.builder().master("local[*]").getOrCreate() > import spark.implicits._ > List( > TestPayload("one"), > TestPayload("two"), > TestPayload("three"), > TestPayload("four") > ).map(payload => TestEvent(generateSimpleSchemaBinary(payload, > simpleSchema))) > .toDS() > .withColumn("deserializedPayload", from_avro(col("payload"), > simpleSchema)) > .show(truncate = false) > } > {code} > And here is what this program outputs: > {noformat} > +--+---+ > |payload |deserializedPayload| > +--+---+ > |[00 06 6F 6E 65] |[four] | > |[00 06 74 77 6F] |[four] | > |[00 0A 74 68 72 65 65]|[four] | > |[00 08 66 6F 75 72] |[four] | > +--+---+{noformat} > Here, we can see that the avro binary is correctly generated, but the > deserialized version is a copy of the last row. I have not yet verified that > this is an issue in cluster mode as well. > > I dug into a bit more of the code and it seems like the resuse of {{result}} > in {{AvroDataToCatalyst}} is overwriting the decoded values of previous rows. > I set a breakpoint in {{LocalRelation}} and the {{data}} sequence seem to all > point to the same address in memory - and therefore a mutation in one > variable will cause all of it to mutate. > !Screen Shot 2019-05-21 at 2.39.27 PM.png! -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-27798) from_avro can modify variables in other rows in local mode
[ https://issues.apache.org/jira/browse/SPARK-27798?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16856362#comment-16856362 ] Liang-Chi Hsieh commented on SPARK-27798: - Is anyone working one this? If none, I will probably send a PR to fix it. > from_avro can modify variables in other rows in local mode > -- > > Key: SPARK-27798 > URL: https://issues.apache.org/jira/browse/SPARK-27798 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.3 >Reporter: Yosuke Mori >Priority: Blocker > Labels: correctness > Attachments: Screen Shot 2019-05-21 at 2.39.27 PM.png > > > Steps to reproduce: > Create a local Dataset (at least two distinct rows) with a binary Avro field. > Use the {{from_avro}} function to deserialize the binary into another column. > Verify that all of the rows incorrectly have the same value. > Here's a concrete example (using Spark 2.4.3). All it does is converts a list > of TestPayload objects into binary using the defined avro schema, then tries > to deserialize using {{from_avro}} with that same schema: > {code:java} > import org.apache.avro.Schema > import org.apache.avro.generic.{GenericDatumWriter, GenericRecord, > GenericRecordBuilder} > import org.apache.avro.io.EncoderFactory > import org.apache.spark.sql.SparkSession > import org.apache.spark.sql.avro.from_avro > import org.apache.spark.sql.functions.col > import java.io.ByteArrayOutputStream > object TestApp extends App { > // Payload container > case class TestEvent(payload: Array[Byte]) > // Deserialized Payload > case class TestPayload(message: String) > // Schema for Payload > val simpleSchema = > """ > |{ > |"type": "record", > |"name" : "Payload", > |"fields" : [ {"name" : "message", "type" : [ "string", "null" ] } ] > |} > """.stripMargin > // Convert TestPayload into avro binary > def generateSimpleSchemaBinary(record: TestPayload, avsc: String): > Array[Byte] = { > val schema = new Schema.Parser().parse(avsc) > val out = new ByteArrayOutputStream() > val writer = new GenericDatumWriter[GenericRecord](schema) > val encoder = EncoderFactory.get().binaryEncoder(out, null) > val rootRecord = new GenericRecordBuilder(schema).set("message", > record.message).build() > writer.write(rootRecord, encoder) > encoder.flush() > out.toByteArray > } > val spark: SparkSession = > SparkSession.builder().master("local[*]").getOrCreate() > import spark.implicits._ > List( > TestPayload("one"), > TestPayload("two"), > TestPayload("three"), > TestPayload("four") > ).map(payload => TestEvent(generateSimpleSchemaBinary(payload, > simpleSchema))) > .toDS() > .withColumn("deserializedPayload", from_avro(col("payload"), > simpleSchema)) > .show(truncate = false) > } > {code} > And here is what this program outputs: > {noformat} > +--+---+ > |payload |deserializedPayload| > +--+---+ > |[00 06 6F 6E 65] |[four] | > |[00 06 74 77 6F] |[four] | > |[00 0A 74 68 72 65 65]|[four] | > |[00 08 66 6F 75 72] |[four] | > +--+---+{noformat} > Here, we can see that the avro binary is correctly generated, but the > deserialized version is a copy of the last row. I have not yet verified that > this is an issue in cluster mode as well. > > I dug into a bit more of the code and it seems like the resuse of {{result}} > in {{AvroDataToCatalyst}} is overwriting the decoded values of previous rows. > I set a breakpoint in {{LocalRelation}} and the {{data}} sequence seem to all > point to the same address in memory - and therefore a mutation in one > variable will cause all of it to mutate. > !Screen Shot 2019-05-21 at 2.39.27 PM.png! -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-27952) Built-in function: lag
Zhu, Lipeng created SPARK-27952: --- Summary: Built-in function: lag Key: SPARK-27952 URL: https://issues.apache.org/jira/browse/SPARK-27952 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 3.0.0 Reporter: Zhu, Lipeng [https://www.postgresql.org/docs/8.4/functions-window.html] |{{lag({{value}} {{any}} [, {{offset}}{{integer}} [, {{default}} {{any}} ]])}}|{{same type as }}{{value}}|returns {{value}} evaluated at the row that is {{offset}} rows before the current row within the partition; if there is no such row, instead return {{default}}. Both {{offset}}and {{default}} are evaluated with respect to the current row. If omitted, {{offset}} defaults to 1 and {{default}} to null| -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-27951) Built-in function: NTH_VALUE
[ https://issues.apache.org/jira/browse/SPARK-27951?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhu, Lipeng updated SPARK-27951: Description: |{{nth_value({{value}}{{any}}, {{nth}}{{integer}})}}|{{same type as }}{{value}}|returns {{value}} evaluated at the row that is the {{nth}} row of the window frame (counting from 1); null if no such row| [https://www.postgresql.org/docs/8.4/functions-window.html] was:|{{nth_value({{value}} {{any}}, {{nth}}{{integer}})}}|{{same type as }}{{value}}|returns {{value}} evaluated at the row that is the {{nth}} row of the window frame (counting from 1); null if no such row| > Built-in function: NTH_VALUE > > > Key: SPARK-27951 > URL: https://issues.apache.org/jira/browse/SPARK-27951 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.0.0 >Reporter: Zhu, Lipeng >Priority: Major > > |{{nth_value({{value}}{{any}}, {{nth}}{{integer}})}}|{{same type as > }}{{value}}|returns {{value}} evaluated at the row that is the {{nth}} row of > the window frame (counting from 1); null if no such row| > [https://www.postgresql.org/docs/8.4/functions-window.html] -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-27951) Built-in function: NTH_VALUE
Zhu, Lipeng created SPARK-27951: --- Summary: Built-in function: NTH_VALUE Key: SPARK-27951 URL: https://issues.apache.org/jira/browse/SPARK-27951 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 3.0.0 Reporter: Zhu, Lipeng |{{nth_value({{value}} {{any}}, {{nth}}{{integer}})}}|{{same type as }}{{value}}|returns {{value}} evaluated at the row that is the {{nth}} row of the window frame (counting from 1); null if no such row| -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-27949) Support ANSI SQL grammar `substring(string_expression from n1 [for n2])`
[ https://issues.apache.org/jira/browse/SPARK-27949?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuming Wang updated SPARK-27949: Issue Type: Sub-task (was: Improvement) Parent: SPARK-27764 > Support ANSI SQL grammar `substring(string_expression from n1 [for n2])` > > > Key: SPARK-27949 > URL: https://issues.apache.org/jira/browse/SPARK-27949 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.1.0 >Reporter: Zhu, Lipeng >Priority: Minor > > Currently, function substr/substring's usage is like > substring(string_expression, n1 [,n2]). > But the ANSI SQL defined the pattern for substr/substring is like > substring(string_expression from n1 [for n2]). This gap make some > inconvenient when we switch to the SparkSQL. > Can we support the ANSI pattern like substring(string_expression from n1 [for > n2])? > > [http://www.contrib.andrew.cmu.edu/~shadow/sql/sql1992.txt] -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-27578) Support INTERVAL ... HOUR TO SECOND syntax
[ https://issues.apache.org/jira/browse/SPARK-27578?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuming Wang updated SPARK-27578: Issue Type: Sub-task (was: Improvement) Parent: SPARK-27764 > Support INTERVAL ... HOUR TO SECOND syntax > -- > > Key: SPARK-27578 > URL: https://issues.apache.org/jira/browse/SPARK-27578 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.0.0 >Reporter: Zhu, Lipeng >Priority: Major > > Currently, SparkSQL can support interval format like this. > > {code:java} > select interval '5 23:59:59.155' day to second.{code} > > Can SparkSQL support grammar like below, as Presto/Teradata can support it > well now. > {code:java} > select interval '23:59:59.155' hour to second > {code} > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-27948) GPU Scheduling - Use ResouceName to represent resource names
[ https://issues.apache.org/jira/browse/SPARK-27948?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-27948. --- Resolution: Fixed Fix Version/s: 3.0.0 This is resolved via https://github.com/apache/spark/pull/24799 > GPU Scheduling - Use ResouceName to represent resource names > > > Key: SPARK-27948 > URL: https://issues.apache.org/jira/browse/SPARK-27948 > Project: Spark > Issue Type: Test > Components: Spark Core >Affects Versions: 3.0.0 >Reporter: Xingbo Jiang >Assignee: Xingbo Jiang >Priority: Minor > Fix For: 3.0.0 > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-27950) Additional DynamoDB and CloudWatch config
[ https://issues.apache.org/jira/browse/SPARK-27950?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Eric S Meisel updated SPARK-27950: -- Priority: Minor (was: Major) > Additional DynamoDB and CloudWatch config > - > > Key: SPARK-27950 > URL: https://issues.apache.org/jira/browse/SPARK-27950 > Project: Spark > Issue Type: Improvement > Components: Structured Streaming >Affects Versions: 2.4.3 >Reporter: Eric S Meisel >Priority: Minor > > I was researching getting Spark’s Kinesis integration running locally against > {{localstack}}. We found this issue, and it creates a complication: > [localstack/localstack#677|https://github.com/localstack/localstack/issues/677] > Effectively, we need to be able to redirect calls for Kinesis, DynamoDB and > Cloudwatch in order for the KCL to properly use the {{localstack}} > infrastructure. We have successfully done this with the KCL (both 1.x and > 2.x), but with Spark’s integration we are unable to configure DynamoDB and > Cloudwatch’s endpoints: > [https://github.com/apache/spark/blob/master/external/kinesis-asl/src/main/scala/org/apache/spark/streaming/kinesis/KinesisReceiver.scala#L162] -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-27909) Fix CTE substitution dependence on ResolveRelations throwing AnalysisException
[ https://issues.apache.org/jira/browse/SPARK-27909?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-27909: -- Fix Version/s: 3.0.0 > Fix CTE substitution dependence on ResolveRelations throwing AnalysisException > -- > > Key: SPARK-27909 > URL: https://issues.apache.org/jira/browse/SPARK-27909 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.4.3 >Reporter: Ryan Blue >Assignee: Ryan Blue >Priority: Major > Fix For: 3.0.0 > > > CTE substitution currently works by running all analyzer rules on plans after > each substitution. It does this to fix a recursive CTE case, but this design > requires the ResolveRelations rule to throw an AnalysisException when it > cannot resolve a table or else the CTE substitution will run again and may > possibly recurse infinitely. > Table resolution should be possible across multiple independent rules. To > accomplish this, the current ResolveRelations rule detects cases where other > rules (like ResolveDataSource) will resolve a TableIdentifier and returns the > UnresolvedRelation unmodified only in those cases. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-27950) Additional DynamoDB and CloudWatch config
[ https://issues.apache.org/jira/browse/SPARK-27950?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16856275#comment-16856275 ] Apache Spark commented on SPARK-27950: -- User 'etspaceman' has created a pull request for this issue: https://github.com/apache/spark/pull/24801 > Additional DynamoDB and CloudWatch config > - > > Key: SPARK-27950 > URL: https://issues.apache.org/jira/browse/SPARK-27950 > Project: Spark > Issue Type: Improvement > Components: Structured Streaming >Affects Versions: 2.4.3 >Reporter: Eric S Meisel >Priority: Major > > I was researching getting Spark’s Kinesis integration running locally against > {{localstack}}. We found this issue, and it creates a complication: > [localstack/localstack#677|https://github.com/localstack/localstack/issues/677] > Effectively, we need to be able to redirect calls for Kinesis, DynamoDB and > Cloudwatch in order for the KCL to properly use the {{localstack}} > infrastructure. We have successfully done this with the KCL (both 1.x and > 2.x), but with Spark’s integration we are unable to configure DynamoDB and > Cloudwatch’s endpoints: > [https://github.com/apache/spark/blob/master/external/kinesis-asl/src/main/scala/org/apache/spark/streaming/kinesis/KinesisReceiver.scala#L162] -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-27950) Additional DynamoDB and CloudWatch config
[ https://issues.apache.org/jira/browse/SPARK-27950?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-27950: Assignee: (was: Apache Spark) > Additional DynamoDB and CloudWatch config > - > > Key: SPARK-27950 > URL: https://issues.apache.org/jira/browse/SPARK-27950 > Project: Spark > Issue Type: Improvement > Components: Structured Streaming >Affects Versions: 2.4.3 >Reporter: Eric S Meisel >Priority: Major > > I was researching getting Spark’s Kinesis integration running locally against > {{localstack}}. We found this issue, and it creates a complication: > [localstack/localstack#677|https://github.com/localstack/localstack/issues/677] > Effectively, we need to be able to redirect calls for Kinesis, DynamoDB and > Cloudwatch in order for the KCL to properly use the {{localstack}} > infrastructure. We have successfully done this with the KCL (both 1.x and > 2.x), but with Spark’s integration we are unable to configure DynamoDB and > Cloudwatch’s endpoints: > [https://github.com/apache/spark/blob/master/external/kinesis-asl/src/main/scala/org/apache/spark/streaming/kinesis/KinesisReceiver.scala#L162] -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-27950) Additional DynamoDB and CloudWatch config
[ https://issues.apache.org/jira/browse/SPARK-27950?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-27950: Assignee: Apache Spark > Additional DynamoDB and CloudWatch config > - > > Key: SPARK-27950 > URL: https://issues.apache.org/jira/browse/SPARK-27950 > Project: Spark > Issue Type: Improvement > Components: Structured Streaming >Affects Versions: 2.4.3 >Reporter: Eric S Meisel >Assignee: Apache Spark >Priority: Major > > I was researching getting Spark’s Kinesis integration running locally against > {{localstack}}. We found this issue, and it creates a complication: > [localstack/localstack#677|https://github.com/localstack/localstack/issues/677] > Effectively, we need to be able to redirect calls for Kinesis, DynamoDB and > Cloudwatch in order for the KCL to properly use the {{localstack}} > infrastructure. We have successfully done this with the KCL (both 1.x and > 2.x), but with Spark’s integration we are unable to configure DynamoDB and > Cloudwatch’s endpoints: > [https://github.com/apache/spark/blob/master/external/kinesis-asl/src/main/scala/org/apache/spark/streaming/kinesis/KinesisReceiver.scala#L162] -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-27950) Additional DynamoDB and CloudWatch config
Eric S Meisel created SPARK-27950: - Summary: Additional DynamoDB and CloudWatch config Key: SPARK-27950 URL: https://issues.apache.org/jira/browse/SPARK-27950 Project: Spark Issue Type: Improvement Components: Structured Streaming Affects Versions: 2.4.3 Reporter: Eric S Meisel I was researching getting Spark’s Kinesis integration running locally against {{localstack}}. We found this issue, and it creates a complication: [localstack/localstack#677|https://github.com/localstack/localstack/issues/677] Effectively, we need to be able to redirect calls for Kinesis, DynamoDB and Cloudwatch in order for the KCL to properly use the {{localstack}} infrastructure. We have successfully done this with the KCL (both 1.x and 2.x), but with Spark’s integration we are unable to configure DynamoDB and Cloudwatch’s endpoints: [https://github.com/apache/spark/blob/master/external/kinesis-asl/src/main/scala/org/apache/spark/streaming/kinesis/KinesisReceiver.scala#L162] -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-27949) Support ANSI SQL grammar `substring(string_expression from n1 [for n2])`
[ https://issues.apache.org/jira/browse/SPARK-27949?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-27949: -- Priority: Minor (was: Major) > Support ANSI SQL grammar `substring(string_expression from n1 [for n2])` > > > Key: SPARK-27949 > URL: https://issues.apache.org/jira/browse/SPARK-27949 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.1.0 >Reporter: Zhu, Lipeng >Priority: Minor > > Currently, function substr/substring's usage is like > substring(string_expression, n1 [,n2]). > But the ANSI SQL defined the pattern for substr/substring is like > substring(string_expression from n1 [for n2]). This gap make some > inconvenient when we switch to the SparkSQL. > Can we support the ANSI pattern like substring(string_expression from n1 [for > n2])? > > [http://www.contrib.andrew.cmu.edu/~shadow/sql/sql1992.txt] -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-27949) Support ANSI SQL grammar `substring(string_expression from n1 [for n2])`
[ https://issues.apache.org/jira/browse/SPARK-27949?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-27949: -- Issue Type: Improvement (was: Bug) > Support ANSI SQL grammar `substring(string_expression from n1 [for n2])` > > > Key: SPARK-27949 > URL: https://issues.apache.org/jira/browse/SPARK-27949 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.1.0 >Reporter: Zhu, Lipeng >Priority: Major > > Currently, function substr/substring's usage is like > substring(string_expression, n1 [,n2]). > But the ANSI SQL defined the pattern for substr/substring is like > substring(string_expression from n1 [for n2]). This gap make some > inconvenient when we switch to the SparkSQL. > Can we support the ANSI pattern like substring(string_expression from n1 [for > n2])? > > [http://www.contrib.andrew.cmu.edu/~shadow/sql/sql1992.txt] -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-27949) Support ANSI SQL grammar `substring(string_expression from n1 [for n2])`
[ https://issues.apache.org/jira/browse/SPARK-27949?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-27949: Assignee: Apache Spark > Support ANSI SQL grammar `substring(string_expression from n1 [for n2])` > > > Key: SPARK-27949 > URL: https://issues.apache.org/jira/browse/SPARK-27949 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.1.0 >Reporter: Zhu, Lipeng >Assignee: Apache Spark >Priority: Major > > Currently, function substr/substring's usage is like > substring(string_expression, n1 [,n2]). > But the ANSI SQL defined the pattern for substr/substring is like > substring(string_expression from n1 [for n2]). This gap make some > inconvenient when we switch to the SparkSQL. > Can we support the ANSI pattern like substring(string_expression from n1 [for > n2])? > > [http://www.contrib.andrew.cmu.edu/~shadow/sql/sql1992.txt] -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-27949) Support ANSI SQL grammar `substring(string_expression from n1 [for n2])`
[ https://issues.apache.org/jira/browse/SPARK-27949?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-27949: Assignee: (was: Apache Spark) > Support ANSI SQL grammar `substring(string_expression from n1 [for n2])` > > > Key: SPARK-27949 > URL: https://issues.apache.org/jira/browse/SPARK-27949 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.1.0 >Reporter: Zhu, Lipeng >Priority: Major > > Currently, function substr/substring's usage is like > substring(string_expression, n1 [,n2]). > But the ANSI SQL defined the pattern for substr/substring is like > substring(string_expression from n1 [for n2]). This gap make some > inconvenient when we switch to the SparkSQL. > Can we support the ANSI pattern like substring(string_expression from n1 [for > n2])? > > [http://www.contrib.andrew.cmu.edu/~shadow/sql/sql1992.txt] -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-27949) Support ANSI SQL grammar `substring(string_expression from n1 [for n2])`
Zhu, Lipeng created SPARK-27949: --- Summary: Support ANSI SQL grammar `substring(string_expression from n1 [for n2])` Key: SPARK-27949 URL: https://issues.apache.org/jira/browse/SPARK-27949 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 3.1.0 Reporter: Zhu, Lipeng Currently, function substr/substring's usage is like substring(string_expression, n1 [,n2]). But the ANSI SQL defined the pattern for substr/substring is like substring(string_expression from n1 [for n2]). This gap make some inconvenient when we switch to the SparkSQL. Can we support the ANSI pattern like substring(string_expression from n1 [for n2])? [http://www.contrib.andrew.cmu.edu/~shadow/sql/sql1992.txt] -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-27947) ParsedStatement subclass toString may throw ClassCastException
[ https://issues.apache.org/jira/browse/SPARK-27947?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-27947: Assignee: (was: Apache Spark) > ParsedStatement subclass toString may throw ClassCastException > -- > > Key: SPARK-27947 > URL: https://issues.apache.org/jira/browse/SPARK-27947 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0 >Reporter: John Zhuge >Priority: Minor > > In ParsedStatement.productIterator, `case mapArg: Map[_, _]` may match any > Map type, thus causing `asInstanceOf[Map[String, String]]` to throw > ClassCastException. > The following test reproduces the issue: > {code:java} > case class TestStatement(p: Map[String, Int]) extends ParsedStatement { > override def output: Seq[Attribute] = Nil > override def children: Seq[LogicalPlan] = Nil > } > TestStatement(Map("abc" -> 1)).toString{code} > Changing the code to `case mapArg: Map[String, String]` will not work due to > type erasure. As a matter of fact, compiler gives this warning: > {noformat} > Warning:(41, 18) non-variable type argument String in type pattern > scala.collection.immutable.Map[String,String] (the underlying of > Map[String,String]) is unchecked since it is eliminated by erasure > case mapArg: Map[String, String] =>{noformat} > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-27948) GPU Scheduling - Use ResouceName to represent resource names
[ https://issues.apache.org/jira/browse/SPARK-27948?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-27948: Assignee: Xingbo Jiang (was: Apache Spark) > GPU Scheduling - Use ResouceName to represent resource names > > > Key: SPARK-27948 > URL: https://issues.apache.org/jira/browse/SPARK-27948 > Project: Spark > Issue Type: Test > Components: Spark Core >Affects Versions: 3.0.0 >Reporter: Xingbo Jiang >Assignee: Xingbo Jiang >Priority: Minor > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-27948) GPU Scheduling - Use ResouceName to represent resource names
[ https://issues.apache.org/jira/browse/SPARK-27948?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-27948: Assignee: Apache Spark (was: Xingbo Jiang) > GPU Scheduling - Use ResouceName to represent resource names > > > Key: SPARK-27948 > URL: https://issues.apache.org/jira/browse/SPARK-27948 > Project: Spark > Issue Type: Test > Components: Spark Core >Affects Versions: 3.0.0 >Reporter: Xingbo Jiang >Assignee: Apache Spark >Priority: Minor > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-27947) ParsedStatement subclass toString may throw ClassCastException
[ https://issues.apache.org/jira/browse/SPARK-27947?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-27947: Assignee: Apache Spark > ParsedStatement subclass toString may throw ClassCastException > -- > > Key: SPARK-27947 > URL: https://issues.apache.org/jira/browse/SPARK-27947 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0 >Reporter: John Zhuge >Assignee: Apache Spark >Priority: Minor > > In ParsedStatement.productIterator, `case mapArg: Map[_, _]` may match any > Map type, thus causing `asInstanceOf[Map[String, String]]` to throw > ClassCastException. > The following test reproduces the issue: > {code:java} > case class TestStatement(p: Map[String, Int]) extends ParsedStatement { > override def output: Seq[Attribute] = Nil > override def children: Seq[LogicalPlan] = Nil > } > TestStatement(Map("abc" -> 1)).toString{code} > Changing the code to `case mapArg: Map[String, String]` will not work due to > type erasure. As a matter of fact, compiler gives this warning: > {noformat} > Warning:(41, 18) non-variable type argument String in type pattern > scala.collection.immutable.Map[String,String] (the underlying of > Map[String,String]) is unchecked since it is eliminated by erasure > case mapArg: Map[String, String] =>{noformat} > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-27948) GPU Scheduling - Use ResouceName to represent resource names
Xingbo Jiang created SPARK-27948: Summary: GPU Scheduling - Use ResouceName to represent resource names Key: SPARK-27948 URL: https://issues.apache.org/jira/browse/SPARK-27948 Project: Spark Issue Type: Test Components: Spark Core Affects Versions: 3.0.0 Reporter: Xingbo Jiang Assignee: Xingbo Jiang -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-27947) ParsedStatement subclass toString may throw ClassCastException
[ https://issues.apache.org/jira/browse/SPARK-27947?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] John Zhuge updated SPARK-27947: --- Description: In ParsedStatement.productIterator, `case mapArg: Map[_, _]` may match any Map type, thus causing `asInstanceOf[Map[String, String]]` to throw ClassCastException. The following test reproduces the issue: {code:java} case class TestStatement(p: Map[String, Int]) extends ParsedStatement { override def output: Seq[Attribute] = Nil override def children: Seq[LogicalPlan] = Nil } TestStatement(Map("abc" -> 1)).toString{code} Changing the code to `case mapArg: Map[String, String]` will not work due to type erasure. As a matter of fact, compiler gives this warning: {noformat} Warning:(41, 18) non-variable type argument String in type pattern scala.collection.immutable.Map[String,String] (the underlying of Map[String,String]) is unchecked since it is eliminated by erasure case mapArg: Map[String, String] =>{noformat} was: In ParsedStatement.productIterator, `case mapArg: Map[_, _]` may match any Map type, thus causing `asInstanceOf[Map[String, String]]` to throw ClassCastException. The following test reproduces the issue: {code:java} case class TestStatement(p: Map[String, Int]) extends ParsedStatement { override def output: Seq[Attribute] = Nil override def children: Seq[LogicalPlan] = Nil } TestStatement(Map("abc" -> 1)).toString{code} Changing the code to `case mapArg: Map[String, String]` will not work due to type erasure. As a matter of fact, compiler gives the warning: {noformat} Warning:(41, 18) non-variable type argument String in type pattern scala.collection.immutable.Map[String,String] (the underlying of Map[String,String]) is unchecked since it is eliminated by erasure case mapArg: Map[String, String] =>{noformat} > ParsedStatement subclass toString may throw ClassCastException > -- > > Key: SPARK-27947 > URL: https://issues.apache.org/jira/browse/SPARK-27947 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0 >Reporter: John Zhuge >Priority: Minor > > In ParsedStatement.productIterator, `case mapArg: Map[_, _]` may match any > Map type, thus causing `asInstanceOf[Map[String, String]]` to throw > ClassCastException. > The following test reproduces the issue: > {code:java} > case class TestStatement(p: Map[String, Int]) extends ParsedStatement { > override def output: Seq[Attribute] = Nil > override def children: Seq[LogicalPlan] = Nil > } > TestStatement(Map("abc" -> 1)).toString{code} > Changing the code to `case mapArg: Map[String, String]` will not work due to > type erasure. As a matter of fact, compiler gives this warning: > {noformat} > Warning:(41, 18) non-variable type argument String in type pattern > scala.collection.immutable.Map[String,String] (the underlying of > Map[String,String]) is unchecked since it is eliminated by erasure > case mapArg: Map[String, String] =>{noformat} > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-27947) ParsedStatement subclass toString may throw ClassCastException
[ https://issues.apache.org/jira/browse/SPARK-27947?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] John Zhuge updated SPARK-27947: --- Description: In ParsedStatement.productIterator, `case mapArg: Map[_, _]` may match any Map type, thus causing `asInstanceOf[Map[String, String]]` to throw ClassCastException. The following test reproduces the issue: {code:java} case class TestStatement(p: Map[String, Int]) extends ParsedStatement { override def output: Seq[Attribute] = Nil override def children: Seq[LogicalPlan] = Nil } TestStatement(Map("abc" -> 1)).toString{code} Changing the code to `case mapArg: Map[String, String]` will not work due to type erasure. As a matter of fact, compiler gives the warning: {noformat} Warning:(41, 18) non-variable type argument String in type pattern scala.collection.immutable.Map[String,String] (the underlying of Map[String,String]) is unchecked since it is eliminated by erasure case mapArg: Map[String, String] =>{noformat} was: In ParsedStatement.productIterator, `case mapArg: Map[_, _]` may match any Map type, thus causing `asInstanceOf[Map[String, String]]` to throw ClassCastException. The following test reproduces the issue: {code:java} case class TestStatement(p: Map[String, Int]) extends ParsedStatement { override def output: Seq[Attribute] = Nil override def children: Seq[LogicalPlan] = Nil } TestStatement(Map("abc" -> 1)).toString{code} Changing the code to `case mapArg: Map[String, String]` will not work due to type erasure. As a matter of fact, compiler gives the warning: {noformat} Warning:(41, 18) non-variable type argument String in type pattern scala.collection.immutable.Map[String,String] (the underlying of Map[String,String]) is unchecked since it is eliminated by erasure case mapArg: Map[String, String] =>{noformat} > ParsedStatement subclass toString may throw ClassCastException > -- > > Key: SPARK-27947 > URL: https://issues.apache.org/jira/browse/SPARK-27947 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0 >Reporter: John Zhuge >Priority: Minor > > In ParsedStatement.productIterator, `case mapArg: Map[_, _]` may match any > Map type, thus causing `asInstanceOf[Map[String, String]]` to throw > ClassCastException. > The following test reproduces the issue: > {code:java} > case class TestStatement(p: Map[String, Int]) extends ParsedStatement { > override def output: Seq[Attribute] = Nil > override def children: Seq[LogicalPlan] = Nil > } > TestStatement(Map("abc" -> 1)).toString{code} > Changing the code to `case mapArg: Map[String, String]` will not work due to > type erasure. As a matter of fact, compiler gives the warning: > {noformat} > Warning:(41, 18) non-variable type argument String in type pattern > scala.collection.immutable.Map[String,String] (the underlying of > Map[String,String]) is unchecked since it is eliminated by erasure > case mapArg: Map[String, String] =>{noformat} > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-27947) ParsedStatement subclass toString may throw ClassCastException
[ https://issues.apache.org/jira/browse/SPARK-27947?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] John Zhuge updated SPARK-27947: --- Description: In ParsedStatement.productIterator, `case mapArg: Map[_, _]` may match any Map type, thus causing `asInstanceOf[Map[String, String]]` to throw ClassCastException. The following test reproduces the issue: {code:java} case class TestStatement(p: Map[String, Int]) extends ParsedStatement { override def output: Seq[Attribute] = Nil override def children: Seq[LogicalPlan] = Nil } TestStatement(Map("abc" -> 1)).toString{code} Changing the code to `case mapArg: Map[String, String]` will not work due to type erasure. As a matter of fact, compiler gives the warning: {noformat} Warning:(41, 18) non-variable type argument String in type pattern scala.collection.immutable.Map[String,String] (the underlying of Map[String,String]) is unchecked since it is eliminated by erasure case mapArg: Map[String, String] =>{noformat} was: In ParsedStatement.productIterator, "case mapArg: Map[_, _]" may match any Map type, thus causing `asInstanceOf[Map[String, String]]` to throw ClassCastException. The following test reproduces the issue: {code:java} case class TestStatement(p: Map[String, Int]) extends ParsedStatement { override def output: Seq[Attribute] = Nil override def children: Seq[LogicalPlan] = Nil } TestStatement(Map("abc" -> 1)).toString{code} Changing the code to `case mapArg: Map[String, String]` will not work due to type erasure. As a matter of fact, compiler gives the warning: {noformat} Warning:(41, 18) non-variable type argument String in type pattern scala.collection.immutable.Map[String,String] (the underlying of Map[String,String]) is unchecked since it is eliminated by erasure case mapArg: Map[String, String] =>{noformat} > ParsedStatement subclass toString may throw ClassCastException > -- > > Key: SPARK-27947 > URL: https://issues.apache.org/jira/browse/SPARK-27947 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0 >Reporter: John Zhuge >Priority: Minor > > In ParsedStatement.productIterator, `case mapArg: Map[_, _]` may match any > Map type, thus causing `asInstanceOf[Map[String, String]]` to throw > ClassCastException. > The following test reproduces the issue: > {code:java} > case class TestStatement(p: Map[String, Int]) extends ParsedStatement { > override def output: Seq[Attribute] = Nil > override def children: Seq[LogicalPlan] = Nil > } > TestStatement(Map("abc" -> 1)).toString{code} > Changing the code to `case mapArg: Map[String, String]` will not work due to > type erasure. As a matter of fact, compiler gives the warning: > {noformat} > Warning:(41, 18) non-variable type argument String in type pattern > scala.collection.immutable.Map[String,String] (the underlying of > Map[String,String]) is unchecked since it is eliminated by erasure > case mapArg: Map[String, String] =>{noformat} > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-27947) ParsedStatement subclass toString may throw ClassCastException
[ https://issues.apache.org/jira/browse/SPARK-27947?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] John Zhuge updated SPARK-27947: --- Description: In ParsedStatement.productIterator, "case mapArg: Map[_, _]" may match any Map type, thus causing `asInstanceOf[Map[String, String]]` to throw ClassCastException. The following test reproduces the issue: {code:java} case class TestStatement(p: Map[String, Int]) extends ParsedStatement { override def output: Seq[Attribute] = Nil override def children: Seq[LogicalPlan] = Nil } TestStatement(Map("abc" -> 1)).toString{code} Changing the code to `case mapArg: Map[String, String]` will not work due to type erasure. As a matter of fact, compiler gives the warning: {noformat} Warning:(41, 18) non-variable type argument String in type pattern scala.collection.immutable.Map[String,String] (the underlying of Map[String,String]) is unchecked since it is eliminated by erasure case mapArg: Map[String, String] =>{noformat} was: In ParsedStatement.productIterator, "case mapArg: Map[_, _]" may match any Map type, thus causing `asInstanceOf[Map[String, String]]` to throw ClassCastException. The following test reproduces the issue: {code:java} case class TestStatement(p: Map[String, Int]) extends ParsedStatement { override def output: Seq[Attribute] = Nil override def children: Seq[LogicalPlan] = Nil } TestStatement(Map("abc" -> 1)).toString{code} {code:java} Changing the code to `case mapArg: Map[String, String]` will not work due to type erasure. As a matter of fact, compiler gives the warning: {noformat} Warning:(41, 18) non-variable type argument String in type pattern scala.collection.immutable.Map[String,String] (the underlying of Map[String,String]) is unchecked since it is eliminated by erasure case mapArg: Map[String, String] =>{noformat} > ParsedStatement subclass toString may throw ClassCastException > -- > > Key: SPARK-27947 > URL: https://issues.apache.org/jira/browse/SPARK-27947 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0 >Reporter: John Zhuge >Priority: Minor > > In ParsedStatement.productIterator, "case mapArg: Map[_, _]" may match any > Map type, thus causing `asInstanceOf[Map[String, String]]` to throw > ClassCastException. > The following test reproduces the issue: > {code:java} > case class TestStatement(p: Map[String, Int]) extends ParsedStatement { > override def output: Seq[Attribute] = Nil > override def children: Seq[LogicalPlan] = Nil > } > TestStatement(Map("abc" -> 1)).toString{code} > Changing the code to `case mapArg: Map[String, String]` will not work due to > type erasure. As a matter of fact, compiler gives the warning: > {noformat} > Warning:(41, 18) non-variable type argument String in type pattern > scala.collection.immutable.Map[String,String] (the underlying of > Map[String,String]) is unchecked since it is eliminated by erasure > case mapArg: Map[String, String] =>{noformat} > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-27947) ParsedStatement subclass toString may throw ClassCastException
John Zhuge created SPARK-27947: -- Summary: ParsedStatement subclass toString may throw ClassCastException Key: SPARK-27947 URL: https://issues.apache.org/jira/browse/SPARK-27947 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 3.0.0 Reporter: John Zhuge In ParsedStatement.productIterator, "case mapArg: Map[_, _]" may match any Map type, thus causing `asInstanceOf[Map[String, String]]` to throw ClassCastException. The following test reproduces the issue: {code:java} case class TestStatement(p: Map[String, Int]) extends ParsedStatement { override def output: Seq[Attribute] = Nil override def children: Seq[LogicalPlan] = Nil } TestStatement(Map("abc" -> 1)).toString{code} {code:java} {code} Changing the code to `case mapArg: Map[String, String]` will not work due to type erasure. As a matter of fact, compiler gives the warning: {noformat} Warning:(41, 18) non-variable type argument String in type pattern scala.collection.immutable.Map[String,String] (the underlying of Map[String,String]) is unchecked since it is eliminated by erasure case mapArg: Map[String, String] =>{noformat} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-27947) ParsedStatement subclass toString may throw ClassCastException
[ https://issues.apache.org/jira/browse/SPARK-27947?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] John Zhuge updated SPARK-27947: --- Description: In ParsedStatement.productIterator, "case mapArg: Map[_, _]" may match any Map type, thus causing `asInstanceOf[Map[String, String]]` to throw ClassCastException. The following test reproduces the issue: {code:java} case class TestStatement(p: Map[String, Int]) extends ParsedStatement { override def output: Seq[Attribute] = Nil override def children: Seq[LogicalPlan] = Nil } TestStatement(Map("abc" -> 1)).toString{code} {code:java} Changing the code to `case mapArg: Map[String, String]` will not work due to type erasure. As a matter of fact, compiler gives the warning: {noformat} Warning:(41, 18) non-variable type argument String in type pattern scala.collection.immutable.Map[String,String] (the underlying of Map[String,String]) is unchecked since it is eliminated by erasure case mapArg: Map[String, String] =>{noformat} was: In ParsedStatement.productIterator, "case mapArg: Map[_, _]" may match any Map type, thus causing `asInstanceOf[Map[String, String]]` to throw ClassCastException. The following test reproduces the issue: {code:java} case class TestStatement(p: Map[String, Int]) extends ParsedStatement { override def output: Seq[Attribute] = Nil override def children: Seq[LogicalPlan] = Nil } TestStatement(Map("abc" -> 1)).toString{code} {code:java} {code} Changing the code to `case mapArg: Map[String, String]` will not work due to type erasure. As a matter of fact, compiler gives the warning: {noformat} Warning:(41, 18) non-variable type argument String in type pattern scala.collection.immutable.Map[String,String] (the underlying of Map[String,String]) is unchecked since it is eliminated by erasure case mapArg: Map[String, String] =>{noformat} > ParsedStatement subclass toString may throw ClassCastException > -- > > Key: SPARK-27947 > URL: https://issues.apache.org/jira/browse/SPARK-27947 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0 >Reporter: John Zhuge >Priority: Minor > > In ParsedStatement.productIterator, "case mapArg: Map[_, _]" may match any > Map type, thus causing `asInstanceOf[Map[String, String]]` to throw > ClassCastException. > The following test reproduces the issue: > {code:java} > case class TestStatement(p: Map[String, Int]) extends ParsedStatement { > override def output: Seq[Attribute] = Nil > override def children: Seq[LogicalPlan] = Nil > } > TestStatement(Map("abc" -> 1)).toString{code} > {code:java} > Changing the code to `case mapArg: Map[String, String]` will not work due to > type erasure. As a matter of fact, compiler gives the warning: > {noformat} > Warning:(41, 18) non-variable type argument String in type pattern > scala.collection.immutable.Map[String,String] (the underlying of > Map[String,String]) is unchecked since it is eliminated by erasure > case mapArg: Map[String, String] =>{noformat} > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-27366) Spark scheduler internal changes to support GPU scheduling
[ https://issues.apache.org/jira/browse/SPARK-27366?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng resolved SPARK-27366. --- Resolution: Fixed Fix Version/s: 3.0.0 Issue resolved by pull request 24374 [https://github.com/apache/spark/pull/24374] > Spark scheduler internal changes to support GPU scheduling > -- > > Key: SPARK-27366 > URL: https://issues.apache.org/jira/browse/SPARK-27366 > Project: Spark > Issue Type: Story > Components: Spark Core >Affects Versions: 3.0.0 >Reporter: Xiangrui Meng >Assignee: Xingbo Jiang >Priority: Major > Fix For: 3.0.0 > > > Update Spark job scheduler to support accelerator resource requests submitted > at application level. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-17025) Cannot persist PySpark ML Pipeline model that includes custom Transformer
[ https://issues.apache.org/jira/browse/SPARK-17025?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16855948#comment-16855948 ] John Bauer edited comment on SPARK-17025 at 6/4/19 11:12 PM: - [~Hadar] [~yug95] [~ralucamaria.b...@gmail.com] I wrote a minimal example of a PySpark estimator/model pair which can be saved and loaded at [ImputeNormal|https://github.com/JohnHBauer/ImputeNormal] It imputes missing values from a normal distribution, using mean and standard deviation parameters estimated from the data, so it might be useful for that too. Let me know if it helps you, was (Author: johnhbauer): [~Hadar] [~yug95] [~ralucamaria.b...@gmail.com] I wrote a minimal example of a PySpark estimator/model pair which can be saved and loaded at [ImputeNormal|https://github.com/JohnHBauer/ImputeNormal] which imputes missing values from a normal distribution using parameters estimated from the data. Let me know if it helps you, > Cannot persist PySpark ML Pipeline model that includes custom Transformer > - > > Key: SPARK-17025 > URL: https://issues.apache.org/jira/browse/SPARK-17025 > Project: Spark > Issue Type: New Feature > Components: ML, PySpark >Affects Versions: 2.0.0 >Reporter: Nicholas Chammas >Assignee: Ajay Saini >Priority: Minor > Fix For: 2.3.0 > > > Following the example in [this Databricks blog > post|https://databricks.com/blog/2016/05/31/apache-spark-2-0-preview-machine-learning-model-persistence.html] > under "Python tuning", I'm trying to save an ML Pipeline model. > This pipeline, however, includes a custom transformer. When I try to save the > model, the operation fails because the custom transformer doesn't have a > {{_to_java}} attribute. > {code} > Traceback (most recent call last): > File ".../file.py", line 56, in > model.bestModel.save('model') > File > "/usr/local/Cellar/apache-spark/2.0.0/libexec/python/lib/pyspark.zip/pyspark/ml/pipeline.py", > line 222, in save > File > "/usr/local/Cellar/apache-spark/2.0.0/libexec/python/lib/pyspark.zip/pyspark/ml/pipeline.py", > line 217, in write > File > "/usr/local/Cellar/apache-spark/2.0.0/libexec/python/lib/pyspark.zip/pyspark/ml/util.py", > line 93, in __init__ > File > "/usr/local/Cellar/apache-spark/2.0.0/libexec/python/lib/pyspark.zip/pyspark/ml/pipeline.py", > line 254, in _to_java > AttributeError: 'PeoplePairFeaturizer' object has no attribute '_to_java' > {code} > Looking at the source code for > [ml/base.py|https://github.com/apache/spark/blob/acaf2a81ad5238fd1bc81e7be2c328f40c07e755/python/pyspark/ml/base.py], > I see that not even the base Transformer class has such an attribute. > I'm assuming this is missing functionality that is intended to be patched up > (i.e. [like > this|https://github.com/apache/spark/blob/acaf2a81ad5238fd1bc81e7be2c328f40c07e755/python/pyspark/ml/classification.py#L1421-L1433]). > I'm not sure if there is an existing JIRA for this (my searches didn't turn > up clear results). -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-17025) Cannot persist PySpark ML Pipeline model that includes custom Transformer
[ https://issues.apache.org/jira/browse/SPARK-17025?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16855948#comment-16855948 ] John Bauer edited comment on SPARK-17025 at 6/4/19 11:10 PM: - [~Hadar] [~yug95] [~ralucamaria.b...@gmail.com] I wrote a minimal example of a PySpark estimator/model pair which can be saved and loaded at [ImputeNormal|https://github.com/JohnHBauer/ImputeNormal] which imputes missing values from a normal distribution using parameters estimated from the data. Let me know if it helps you, was (Author: johnhbauer): [~Hadar] [~yug95] I wrote a minimal example of a PySpark estimator/model pair which can be saved and loaded at [ImputeNormal|https://github.com/JohnHBauer/ImputeNormal] which imputes missing values from a normal distribution using parameters estimated from the data. Let me know if it helps you, > Cannot persist PySpark ML Pipeline model that includes custom Transformer > - > > Key: SPARK-17025 > URL: https://issues.apache.org/jira/browse/SPARK-17025 > Project: Spark > Issue Type: New Feature > Components: ML, PySpark >Affects Versions: 2.0.0 >Reporter: Nicholas Chammas >Assignee: Ajay Saini >Priority: Minor > Fix For: 2.3.0 > > > Following the example in [this Databricks blog > post|https://databricks.com/blog/2016/05/31/apache-spark-2-0-preview-machine-learning-model-persistence.html] > under "Python tuning", I'm trying to save an ML Pipeline model. > This pipeline, however, includes a custom transformer. When I try to save the > model, the operation fails because the custom transformer doesn't have a > {{_to_java}} attribute. > {code} > Traceback (most recent call last): > File ".../file.py", line 56, in > model.bestModel.save('model') > File > "/usr/local/Cellar/apache-spark/2.0.0/libexec/python/lib/pyspark.zip/pyspark/ml/pipeline.py", > line 222, in save > File > "/usr/local/Cellar/apache-spark/2.0.0/libexec/python/lib/pyspark.zip/pyspark/ml/pipeline.py", > line 217, in write > File > "/usr/local/Cellar/apache-spark/2.0.0/libexec/python/lib/pyspark.zip/pyspark/ml/util.py", > line 93, in __init__ > File > "/usr/local/Cellar/apache-spark/2.0.0/libexec/python/lib/pyspark.zip/pyspark/ml/pipeline.py", > line 254, in _to_java > AttributeError: 'PeoplePairFeaturizer' object has no attribute '_to_java' > {code} > Looking at the source code for > [ml/base.py|https://github.com/apache/spark/blob/acaf2a81ad5238fd1bc81e7be2c328f40c07e755/python/pyspark/ml/base.py], > I see that not even the base Transformer class has such an attribute. > I'm assuming this is missing functionality that is intended to be patched up > (i.e. [like > this|https://github.com/apache/spark/blob/acaf2a81ad5238fd1bc81e7be2c328f40c07e755/python/pyspark/ml/classification.py#L1421-L1433]). > I'm not sure if there is an existing JIRA for this (my searches didn't turn > up clear results). -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-27724) DataSourceV2: Add RTAS logical operation
[ https://issues.apache.org/jira/browse/SPARK-27724?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-27724: Assignee: (was: Apache Spark) > DataSourceV2: Add RTAS logical operation > > > Key: SPARK-27724 > URL: https://issues.apache.org/jira/browse/SPARK-27724 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.4.3 >Reporter: Ryan Blue >Priority: Major > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-27724) DataSourceV2: Add RTAS logical operation
[ https://issues.apache.org/jira/browse/SPARK-27724?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-27724: Assignee: Apache Spark > DataSourceV2: Add RTAS logical operation > > > Key: SPARK-27724 > URL: https://issues.apache.org/jira/browse/SPARK-27724 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.4.3 >Reporter: Ryan Blue >Assignee: Apache Spark >Priority: Major > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-27909) Fix CTE substitution dependence on ResolveRelations throwing AnalysisException
[ https://issues.apache.org/jira/browse/SPARK-27909?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li resolved SPARK-27909. - Resolution: Fixed Assignee: Ryan Blue > Fix CTE substitution dependence on ResolveRelations throwing AnalysisException > -- > > Key: SPARK-27909 > URL: https://issues.apache.org/jira/browse/SPARK-27909 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.4.3 >Reporter: Ryan Blue >Assignee: Ryan Blue >Priority: Major > > CTE substitution currently works by running all analyzer rules on plans after > each substitution. It does this to fix a recursive CTE case, but this design > requires the ResolveRelations rule to throw an AnalysisException when it > cannot resolve a table or else the CTE substitution will run again and may > possibly recurse infinitely. > Table resolution should be possible across multiple independent rules. To > accomplish this, the current ResolveRelations rule detects cases where other > rules (like ResolveDataSource) will resolve a TableIdentifier and returns the > UnresolvedRelation unmodified only in those cases. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-27909) Fix CTE substitution dependence on ResolveRelations throwing AnalysisException
[ https://issues.apache.org/jira/browse/SPARK-27909?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li updated SPARK-27909: Issue Type: Improvement (was: Bug) > Fix CTE substitution dependence on ResolveRelations throwing AnalysisException > -- > > Key: SPARK-27909 > URL: https://issues.apache.org/jira/browse/SPARK-27909 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.4.3 >Reporter: Ryan Blue >Priority: Major > > CTE substitution currently works by running all analyzer rules on plans after > each substitution. It does this to fix a recursive CTE case, but this design > requires the ResolveRelations rule to throw an AnalysisException when it > cannot resolve a table or else the CTE substitution will run again and may > possibly recurse infinitely. > Table resolution should be possible across multiple independent rules. To > accomplish this, the current ResolveRelations rule detects cases where other > rules (like ResolveDataSource) will resolve a TableIdentifier and returns the > UnresolvedRelation unmodified only in those cases. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-27909) Fix CTE substitution dependence on ResolveRelations throwing AnalysisException
[ https://issues.apache.org/jira/browse/SPARK-27909?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16856138#comment-16856138 ] Xiao Li commented on SPARK-27909: - We do not support recursive CTE. I change it to an improvement > Fix CTE substitution dependence on ResolveRelations throwing AnalysisException > -- > > Key: SPARK-27909 > URL: https://issues.apache.org/jira/browse/SPARK-27909 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.4.3 >Reporter: Ryan Blue >Priority: Major > > CTE substitution currently works by running all analyzer rules on plans after > each substitution. It does this to fix a recursive CTE case, but this design > requires the ResolveRelations rule to throw an AnalysisException when it > cannot resolve a table or else the CTE substitution will run again and may > possibly recurse infinitely. > Table resolution should be possible across multiple independent rules. To > accomplish this, the current ResolveRelations rule detects cases where other > rules (like ResolveDataSource) will resolve a TableIdentifier and returns the > UnresolvedRelation unmodified only in those cases. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-27946) Hive DDL to Spark DDL conversion feature for "show create table"
Xiao Li created SPARK-27946: --- Summary: Hive DDL to Spark DDL conversion feature for "show create table" Key: SPARK-27946 URL: https://issues.apache.org/jira/browse/SPARK-27946 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 3.0.0 Reporter: Xiao Li Many users migrate tables created with Hive DDL to Spark/DB. Defining the table with Spark DDL brings performance benefits. We need to add a feature to Show Create Table that allows you to generate Spark DDL for a table. For example: `SHOW CREATE TABLE customers AS SPARK`. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-27946) Hive DDL to Spark DDL conversion feature for "show create table"
[ https://issues.apache.org/jira/browse/SPARK-27946?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li updated SPARK-27946: Issue Type: Improvement (was: Bug) > Hive DDL to Spark DDL conversion feature for "show create table" > > > Key: SPARK-27946 > URL: https://issues.apache.org/jira/browse/SPARK-27946 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Xiao Li >Priority: Major > > Many users migrate tables created with Hive DDL to Spark/DB. Defining the > table with Spark DDL brings performance benefits. We need to add a feature to > Show Create Table that allows you to generate Spark DDL for a table. For > example: `SHOW CREATE TABLE customers AS SPARK`. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-27946) Hive DDL to Spark DDL conversion USING "show create table"
[ https://issues.apache.org/jira/browse/SPARK-27946?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li updated SPARK-27946: Summary: Hive DDL to Spark DDL conversion USING "show create table" (was: Hive DDL to Spark DDL conversion feature for "show create table") > Hive DDL to Spark DDL conversion USING "show create table" > -- > > Key: SPARK-27946 > URL: https://issues.apache.org/jira/browse/SPARK-27946 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Xiao Li >Priority: Major > > Many users migrate tables created with Hive DDL to Spark/DB. Defining the > table with Spark DDL brings performance benefits. We need to add a feature to > Show Create Table that allows you to generate Spark DDL for a table. For > example: `SHOW CREATE TABLE customers AS SPARK`. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-27900) Spark driver will not exit due to an oom error
[ https://issues.apache.org/jira/browse/SPARK-27900?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Stavros Kontopoulos updated SPARK-27900: Summary: Spark driver will not exit due to an oom error (was: Spark will not exit due to an oom error) > Spark driver will not exit due to an oom error > -- > > Key: SPARK-27900 > URL: https://issues.apache.org/jira/browse/SPARK-27900 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.0.0, 2.4.3 >Reporter: Stavros Kontopoulos >Priority: Major > > This affects Spark on K8s at least as pods will run forever and makes > impossible for tools like Spark Operator to report back > job status. > A spark pi job is running: > spark-pi-driver 1/1 Running 0 1h > spark-pi2-1559309337787-exec-1 1/1 Running 0 1h > spark-pi2-1559309337787-exec-2 1/1 Running 0 1h > with the following setup: > {quote}apiVersion: "sparkoperator.k8s.io/v1beta1" > kind: SparkApplication > metadata: > name: spark-pi > namespace: spark > spec: > type: Scala > mode: cluster > image: "skonto/spark:k8s-3.0.0-sa" > imagePullPolicy: Always > mainClass: org.apache.spark.examples.SparkPi > mainApplicationFile: > "local:///opt/spark/examples/jars/spark-examples_2.12-3.0.0-SNAPSHOT.jar" > arguments: > - "100" > sparkVersion: "2.4.0" > restartPolicy: > type: Never > nodeSelector: > "spark": "autotune" > driver: > memory: "1g" > labels: > version: 2.4.0 > serviceAccount: spark-sa > executor: > instances: 2 > memory: "1g" > labels: > version: 2.4.0{quote} > At some point the driver fails but it is still running and so the pods are > still running: > 19/05/31 13:29:20 INFO DAGScheduler: Submitting ResultStage 0 > (MapPartitionsRDD[1] at map at SparkPi.scala:34), which has no missing parents > 19/05/31 13:29:23 INFO MemoryStore: Block broadcast_0 stored as values in > memory (estimated size 3.0 KiB, free 110.0 MiB) > 19/05/31 13:29:23 INFO MemoryStore: Block broadcast_0_piece0 stored as bytes > in memory (estimated size 1765.0 B, free 110.0 MiB) > 19/05/31 13:29:23 INFO BlockManagerInfo: Added broadcast_0_piece0 in memory > on spark-pi2-1559309337787-driver-svc.spark.svc:7079 (size: 1765.0 B, free: > 110.0 MiB) > 19/05/31 13:29:23 INFO SparkContext: Created broadcast 0 from broadcast at > DAGScheduler.scala:1180 > 19/05/31 13:29:25 INFO DAGScheduler: Submitting 100 missing tasks from > ResultStage 0 (MapPartitionsRDD[1] at map at SparkPi.scala:34) (first 15 > tasks are for partitions Vector(0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, > 14)) > 19/05/31 13:29:25 INFO TaskSchedulerImpl: Adding task set 0.0 with 100 > tasks > Exception in thread "dag-scheduler-event-loop" java.lang.OutOfMemoryError: > Java heap space > at > scala.collection.mutable.ResizableArray.ensureSize(ResizableArray.scala:106) > at > scala.collection.mutable.ResizableArray.ensureSize$(ResizableArray.scala:96) > at scala.collection.mutable.ArrayBuffer.ensureSize(ArrayBuffer.scala:49) > Mem: 2295260K used, 24458144K free, 1636K shrd, 48052K buff, 899424K cached > $ kubectl describe pod spark-pi2-driver -n spark > Name: spark-pi2-driver > Namespace: spark > Priority: 0 > PriorityClassName: > Node: gke-test-cluster-1-spark-autotune-46c36f4f-x3z9/10.138.0.44 > Start Time: Fri, 31 May 2019 16:28:59 +0300 > Labels: spark-app-selector=spark-74d8e5a8f1af428d91093dfa6ee9d661 > spark-role=driver > sparkoperator.k8s.io/app-name=spark-pi2 > sparkoperator.k8s.io/launched-by-spark-operator=true > sparkoperator.k8s.io/submission-id=spark-pi2-1559309336226927526 > version=2.4.0 > Annotations: > Status: Running > IP: 10.12.103.4 > Controlled By: SparkApplication/spark-pi2 > Containers: > spark-kubernetes-driver: > Container ID: > docker://55dadb603290b42f9ddb71959edf0224ddc7ea621ee15429941d3bcc7db9b71f > Image: skonto/spark:k8s-3.0.0-sa > Image ID: > docker-pullable://skonto/spark@sha256:6268d760d1a006b69c7086f946e4d5d9a3b99f149832c63cfc7fe39671f5cda9 > Ports: 7078/TCP, 7079/TCP, 4040/TCP > Host Ports: 0/TCP, 0/TCP, 0/TCP > Args: > driver > --properties-file > /opt/spark/conf/spark.properties > --class > org.apache.spark.examples.SparkPi > spark-internal > 100 > State: Running > In the container processes are in _interruptible sleep_: > PID PPID USER STAT VSZ %VSZ CPU %CPU COMMAND > 15 1 185 S 2114m 7% 0 0% /usr/lib/jvm/java-1.8-openjdk/bin/java -cp > /opt/spark/conf/:/opt/spark/jars/* -Xmx500m > org.apache.spark.deploy.SparkSubmit --deploy-mode client --conf spar > 287 0 185 S 2344 0% 3 0% sh > 294 287 185 R 1536 0% 3 0% top > 1 0 185 S 776 0% 0 0% /sbin/tini -s – /opt/spark/bin/spark-submit --conf > spark.driver.bindAddress=10.12.103.4 --deploy-mode client --properties-file > /opt/spark/conf/spark.prope > Liveness
[jira] [Updated] (SPARK-27900) Spark will not exit due to an oom error
[ https://issues.apache.org/jira/browse/SPARK-27900?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Stavros Kontopoulos updated SPARK-27900: Description: This affects Spark on K8s at least as pods will run forever. A spark pi job is running: spark-pi-driver 1/1 Running 0 1h spark-pi2-1559309337787-exec-1 1/1 Running 0 1h spark-pi2-1559309337787-exec-2 1/1 Running 0 1h with the following setup: {quote}apiVersion: "sparkoperator.k8s.io/v1beta1" kind: SparkApplication metadata: name: spark-pi namespace: spark spec: type: Scala mode: cluster image: "skonto/spark:k8s-3.0.0-sa" imagePullPolicy: Always mainClass: org.apache.spark.examples.SparkPi mainApplicationFile: "local:///opt/spark/examples/jars/spark-examples_2.12-3.0.0-SNAPSHOT.jar" arguments: - "100" sparkVersion: "2.4.0" restartPolicy: type: Never nodeSelector: "spark": "autotune" driver: memory: "1g" labels: version: 2.4.0 serviceAccount: spark-sa executor: instances: 2 memory: "1g" labels: version: 2.4.0{quote} At some point the driver fails but it is still running and so the pods are still running: 19/05/31 13:29:20 INFO DAGScheduler: Submitting ResultStage 0 (MapPartitionsRDD[1] at map at SparkPi.scala:34), which has no missing parents 19/05/31 13:29:23 INFO MemoryStore: Block broadcast_0 stored as values in memory (estimated size 3.0 KiB, free 110.0 MiB) 19/05/31 13:29:23 INFO MemoryStore: Block broadcast_0_piece0 stored as bytes in memory (estimated size 1765.0 B, free 110.0 MiB) 19/05/31 13:29:23 INFO BlockManagerInfo: Added broadcast_0_piece0 in memory on spark-pi2-1559309337787-driver-svc.spark.svc:7079 (size: 1765.0 B, free: 110.0 MiB) 19/05/31 13:29:23 INFO SparkContext: Created broadcast 0 from broadcast at DAGScheduler.scala:1180 19/05/31 13:29:25 INFO DAGScheduler: Submitting 100 missing tasks from ResultStage 0 (MapPartitionsRDD[1] at map at SparkPi.scala:34) (first 15 tasks are for partitions Vector(0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14)) 19/05/31 13:29:25 INFO TaskSchedulerImpl: Adding task set 0.0 with 100 tasks Exception in thread "dag-scheduler-event-loop" java.lang.OutOfMemoryError: Java heap space at scala.collection.mutable.ResizableArray.ensureSize(ResizableArray.scala:106) at scala.collection.mutable.ResizableArray.ensureSize$(ResizableArray.scala:96) at scala.collection.mutable.ArrayBuffer.ensureSize(ArrayBuffer.scala:49) Mem: 2295260K used, 24458144K free, 1636K shrd, 48052K buff, 899424K cached $ kubectl describe pod spark-pi2-driver -n spark Name: spark-pi2-driver Namespace: spark Priority: 0 PriorityClassName: Node: gke-test-cluster-1-spark-autotune-46c36f4f-x3z9/10.138.0.44 Start Time: Fri, 31 May 2019 16:28:59 +0300 Labels: spark-app-selector=spark-74d8e5a8f1af428d91093dfa6ee9d661 spark-role=driver sparkoperator.k8s.io/app-name=spark-pi2 sparkoperator.k8s.io/launched-by-spark-operator=true sparkoperator.k8s.io/submission-id=spark-pi2-1559309336226927526 version=2.4.0 Annotations: Status: Running IP: 10.12.103.4 Controlled By: SparkApplication/spark-pi2 Containers: spark-kubernetes-driver: Container ID: docker://55dadb603290b42f9ddb71959edf0224ddc7ea621ee15429941d3bcc7db9b71f Image: skonto/spark:k8s-3.0.0-sa Image ID: docker-pullable://skonto/spark@sha256:6268d760d1a006b69c7086f946e4d5d9a3b99f149832c63cfc7fe39671f5cda9 Ports: 7078/TCP, 7079/TCP, 4040/TCP Host Ports: 0/TCP, 0/TCP, 0/TCP Args: driver --properties-file /opt/spark/conf/spark.properties --class org.apache.spark.examples.SparkPi spark-internal 100 State: Running In the container processes are in _interruptible sleep_: PID PPID USER STAT VSZ %VSZ CPU %CPU COMMAND 15 1 185 S 2114m 7% 0 0% /usr/lib/jvm/java-1.8-openjdk/bin/java -cp /opt/spark/conf/:/opt/spark/jars/* -Xmx500m org.apache.spark.deploy.SparkSubmit --deploy-mode client --conf spar 287 0 185 S 2344 0% 3 0% sh 294 287 185 R 1536 0% 3 0% top 1 0 185 S 776 0% 0 0% /sbin/tini -s – /opt/spark/bin/spark-submit --conf spark.driver.bindAddress=10.12.103.4 --deploy-mode client --properties-file /opt/spark/conf/spark.prope Liveness checks might be a workaround but rest apis may be still working if threads in jvm still are running as in this case (I did check the spark ui and it was there). was: A spark pi job is running: spark-pi-driver 1/1 Running 0 1h spark-pi2-1559309337787-exec-1 1/1 Running 0 1h spark-pi2-1559309337787-exec-2 1/1 Running 0 1h with the following setup: {quote}apiVersion: "sparkoperator.k8s.io/v1beta1" kind: SparkApplication metadata: name: spark-pi namespace: spark spec: type: Scala mode: cluster image: "skonto/spark:k8s-3.0.0-sa" imagePullPolicy: Always mainClass: org.apache.spark.examples.SparkPi mainApplicationFile: "local:///opt/spark/examples/jars/spark-examples_2.12-3.0.0-SNAPSHOT.jar" arguments: - "100" sparkVersion: "2.4.0" restartPolicy: type: Ne
[jira] [Updated] (SPARK-27900) Spark will not exit due to an oom error
[ https://issues.apache.org/jira/browse/SPARK-27900?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Stavros Kontopoulos updated SPARK-27900: Description: This affects Spark on K8s at least as pods will run forever and makes impossible for tools like Spark Operator to report back job status. A spark pi job is running: spark-pi-driver 1/1 Running 0 1h spark-pi2-1559309337787-exec-1 1/1 Running 0 1h spark-pi2-1559309337787-exec-2 1/1 Running 0 1h with the following setup: {quote}apiVersion: "sparkoperator.k8s.io/v1beta1" kind: SparkApplication metadata: name: spark-pi namespace: spark spec: type: Scala mode: cluster image: "skonto/spark:k8s-3.0.0-sa" imagePullPolicy: Always mainClass: org.apache.spark.examples.SparkPi mainApplicationFile: "local:///opt/spark/examples/jars/spark-examples_2.12-3.0.0-SNAPSHOT.jar" arguments: - "100" sparkVersion: "2.4.0" restartPolicy: type: Never nodeSelector: "spark": "autotune" driver: memory: "1g" labels: version: 2.4.0 serviceAccount: spark-sa executor: instances: 2 memory: "1g" labels: version: 2.4.0{quote} At some point the driver fails but it is still running and so the pods are still running: 19/05/31 13:29:20 INFO DAGScheduler: Submitting ResultStage 0 (MapPartitionsRDD[1] at map at SparkPi.scala:34), which has no missing parents 19/05/31 13:29:23 INFO MemoryStore: Block broadcast_0 stored as values in memory (estimated size 3.0 KiB, free 110.0 MiB) 19/05/31 13:29:23 INFO MemoryStore: Block broadcast_0_piece0 stored as bytes in memory (estimated size 1765.0 B, free 110.0 MiB) 19/05/31 13:29:23 INFO BlockManagerInfo: Added broadcast_0_piece0 in memory on spark-pi2-1559309337787-driver-svc.spark.svc:7079 (size: 1765.0 B, free: 110.0 MiB) 19/05/31 13:29:23 INFO SparkContext: Created broadcast 0 from broadcast at DAGScheduler.scala:1180 19/05/31 13:29:25 INFO DAGScheduler: Submitting 100 missing tasks from ResultStage 0 (MapPartitionsRDD[1] at map at SparkPi.scala:34) (first 15 tasks are for partitions Vector(0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14)) 19/05/31 13:29:25 INFO TaskSchedulerImpl: Adding task set 0.0 with 100 tasks Exception in thread "dag-scheduler-event-loop" java.lang.OutOfMemoryError: Java heap space at scala.collection.mutable.ResizableArray.ensureSize(ResizableArray.scala:106) at scala.collection.mutable.ResizableArray.ensureSize$(ResizableArray.scala:96) at scala.collection.mutable.ArrayBuffer.ensureSize(ArrayBuffer.scala:49) Mem: 2295260K used, 24458144K free, 1636K shrd, 48052K buff, 899424K cached $ kubectl describe pod spark-pi2-driver -n spark Name: spark-pi2-driver Namespace: spark Priority: 0 PriorityClassName: Node: gke-test-cluster-1-spark-autotune-46c36f4f-x3z9/10.138.0.44 Start Time: Fri, 31 May 2019 16:28:59 +0300 Labels: spark-app-selector=spark-74d8e5a8f1af428d91093dfa6ee9d661 spark-role=driver sparkoperator.k8s.io/app-name=spark-pi2 sparkoperator.k8s.io/launched-by-spark-operator=true sparkoperator.k8s.io/submission-id=spark-pi2-1559309336226927526 version=2.4.0 Annotations: Status: Running IP: 10.12.103.4 Controlled By: SparkApplication/spark-pi2 Containers: spark-kubernetes-driver: Container ID: docker://55dadb603290b42f9ddb71959edf0224ddc7ea621ee15429941d3bcc7db9b71f Image: skonto/spark:k8s-3.0.0-sa Image ID: docker-pullable://skonto/spark@sha256:6268d760d1a006b69c7086f946e4d5d9a3b99f149832c63cfc7fe39671f5cda9 Ports: 7078/TCP, 7079/TCP, 4040/TCP Host Ports: 0/TCP, 0/TCP, 0/TCP Args: driver --properties-file /opt/spark/conf/spark.properties --class org.apache.spark.examples.SparkPi spark-internal 100 State: Running In the container processes are in _interruptible sleep_: PID PPID USER STAT VSZ %VSZ CPU %CPU COMMAND 15 1 185 S 2114m 7% 0 0% /usr/lib/jvm/java-1.8-openjdk/bin/java -cp /opt/spark/conf/:/opt/spark/jars/* -Xmx500m org.apache.spark.deploy.SparkSubmit --deploy-mode client --conf spar 287 0 185 S 2344 0% 3 0% sh 294 287 185 R 1536 0% 3 0% top 1 0 185 S 776 0% 0 0% /sbin/tini -s – /opt/spark/bin/spark-submit --conf spark.driver.bindAddress=10.12.103.4 --deploy-mode client --properties-file /opt/spark/conf/spark.prope Liveness checks might be a workaround but rest apis may be still working if threads in jvm still are running as in this case (I did check the spark ui and it was there). was: This affects Spark on K8s at least as pods will run forever. A spark pi job is running: spark-pi-driver 1/1 Running 0 1h spark-pi2-1559309337787-exec-1 1/1 Running 0 1h spark-pi2-1559309337787-exec-2 1/1 Running 0 1h with the following setup: {quote}apiVersion: "sparkoperator.k8s.io/v1beta1" kind: SparkApplication metadata: name: spark-pi namespace: spark spec: type: Scala mode: cluster image: "skonto/spark:k8s-3.0.0-sa" imagePullPolicy: Always mainClass: org.apache.spark.examples.SparkPi mainApplicationFile: "local
[jira] [Updated] (SPARK-27900) Spark will not exit due to an oom error
[ https://issues.apache.org/jira/browse/SPARK-27900?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Stavros Kontopoulos updated SPARK-27900: Summary: Spark will not exit due to an oom error (was: Spark on K8s will not report container failure due to an oom error) > Spark will not exit due to an oom error > --- > > Key: SPARK-27900 > URL: https://issues.apache.org/jira/browse/SPARK-27900 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.0.0, 2.4.3 >Reporter: Stavros Kontopoulos >Priority: Major > > A spark pi job is running: > spark-pi-driver 1/1 Running 0 1h > spark-pi2-1559309337787-exec-1 1/1 Running 0 1h > spark-pi2-1559309337787-exec-2 1/1 Running 0 1h > with the following setup: > {quote}apiVersion: "sparkoperator.k8s.io/v1beta1" > kind: SparkApplication > metadata: > name: spark-pi > namespace: spark > spec: > type: Scala > mode: cluster > image: "skonto/spark:k8s-3.0.0-sa" > imagePullPolicy: Always > mainClass: org.apache.spark.examples.SparkPi > mainApplicationFile: > "local:///opt/spark/examples/jars/spark-examples_2.12-3.0.0-SNAPSHOT.jar" > arguments: > - "100" > sparkVersion: "2.4.0" > restartPolicy: > type: Never > nodeSelector: > "spark": "autotune" > driver: > memory: "1g" > labels: > version: 2.4.0 > serviceAccount: spark-sa > executor: > instances: 2 > memory: "1g" > labels: > version: 2.4.0{quote} > At some point the driver fails but it is still running and so the pods are > still running: > 19/05/31 13:29:20 INFO DAGScheduler: Submitting ResultStage 0 > (MapPartitionsRDD[1] at map at SparkPi.scala:34), which has no missing parents > 19/05/31 13:29:23 INFO MemoryStore: Block broadcast_0 stored as values in > memory (estimated size 3.0 KiB, free 110.0 MiB) > 19/05/31 13:29:23 INFO MemoryStore: Block broadcast_0_piece0 stored as bytes > in memory (estimated size 1765.0 B, free 110.0 MiB) > 19/05/31 13:29:23 INFO BlockManagerInfo: Added broadcast_0_piece0 in memory > on spark-pi2-1559309337787-driver-svc.spark.svc:7079 (size: 1765.0 B, free: > 110.0 MiB) > 19/05/31 13:29:23 INFO SparkContext: Created broadcast 0 from broadcast at > DAGScheduler.scala:1180 > 19/05/31 13:29:25 INFO DAGScheduler: Submitting 100 missing tasks from > ResultStage 0 (MapPartitionsRDD[1] at map at SparkPi.scala:34) (first 15 > tasks are for partitions Vector(0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, > 14)) > 19/05/31 13:29:25 INFO TaskSchedulerImpl: Adding task set 0.0 with 100 > tasks > Exception in thread "dag-scheduler-event-loop" java.lang.OutOfMemoryError: > Java heap space > at > scala.collection.mutable.ResizableArray.ensureSize(ResizableArray.scala:106) > at > scala.collection.mutable.ResizableArray.ensureSize$(ResizableArray.scala:96) > at scala.collection.mutable.ArrayBuffer.ensureSize(ArrayBuffer.scala:49) > Mem: 2295260K used, 24458144K free, 1636K shrd, 48052K buff, 899424K cached > $ kubectl describe pod spark-pi2-driver -n spark > Name: spark-pi2-driver > Namespace: spark > Priority: 0 > PriorityClassName: > Node: gke-test-cluster-1-spark-autotune-46c36f4f-x3z9/10.138.0.44 > Start Time: Fri, 31 May 2019 16:28:59 +0300 > Labels: spark-app-selector=spark-74d8e5a8f1af428d91093dfa6ee9d661 > spark-role=driver > sparkoperator.k8s.io/app-name=spark-pi2 > sparkoperator.k8s.io/launched-by-spark-operator=true > sparkoperator.k8s.io/submission-id=spark-pi2-1559309336226927526 > version=2.4.0 > Annotations: > Status: Running > IP: 10.12.103.4 > Controlled By: SparkApplication/spark-pi2 > Containers: > spark-kubernetes-driver: > Container ID: > docker://55dadb603290b42f9ddb71959edf0224ddc7ea621ee15429941d3bcc7db9b71f > Image: skonto/spark:k8s-3.0.0-sa > Image ID: > docker-pullable://skonto/spark@sha256:6268d760d1a006b69c7086f946e4d5d9a3b99f149832c63cfc7fe39671f5cda9 > Ports: 7078/TCP, 7079/TCP, 4040/TCP > Host Ports: 0/TCP, 0/TCP, 0/TCP > Args: > driver > --properties-file > /opt/spark/conf/spark.properties > --class > org.apache.spark.examples.SparkPi > spark-internal > 100 > State: Running > In the container processes are in _interruptible sleep_: > PID PPID USER STAT VSZ %VSZ CPU %CPU COMMAND > 15 1 185 S 2114m 7% 0 0% /usr/lib/jvm/java-1.8-openjdk/bin/java -cp > /opt/spark/conf/:/opt/spark/jars/* -Xmx500m > org.apache.spark.deploy.SparkSubmit --deploy-mode client --conf spar > 287 0 185 S 2344 0% 3 0% sh > 294 287 185 R 1536 0% 3 0% top > 1 0 185 S 776 0% 0 0% /sbin/tini -s – /opt/spark/bin/spark-submit --conf > spark.driver.bindAddress=10.12.103.4 --deploy-mode client --properties-file > /opt/spark/conf/spark.prope > Liveness checks might be a workaround but rest apis may be still working if > threads in jvm still are running as in this case (I did check the sp
[jira] [Updated] (SPARK-27900) Spark on K8s will not report container failure due to an oom error
[ https://issues.apache.org/jira/browse/SPARK-27900?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Stavros Kontopoulos updated SPARK-27900: Component/s: (was: Kubernetes) > Spark on K8s will not report container failure due to an oom error > -- > > Key: SPARK-27900 > URL: https://issues.apache.org/jira/browse/SPARK-27900 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.0.0, 2.4.3 >Reporter: Stavros Kontopoulos >Priority: Major > > A spark pi job is running: > spark-pi-driver 1/1 Running 0 1h > spark-pi2-1559309337787-exec-1 1/1 Running 0 1h > spark-pi2-1559309337787-exec-2 1/1 Running 0 1h > with the following setup: > {quote}apiVersion: "sparkoperator.k8s.io/v1beta1" > kind: SparkApplication > metadata: > name: spark-pi > namespace: spark > spec: > type: Scala > mode: cluster > image: "skonto/spark:k8s-3.0.0-sa" > imagePullPolicy: Always > mainClass: org.apache.spark.examples.SparkPi > mainApplicationFile: > "local:///opt/spark/examples/jars/spark-examples_2.12-3.0.0-SNAPSHOT.jar" > arguments: > - "100" > sparkVersion: "2.4.0" > restartPolicy: > type: Never > nodeSelector: > "spark": "autotune" > driver: > memory: "1g" > labels: > version: 2.4.0 > serviceAccount: spark-sa > executor: > instances: 2 > memory: "1g" > labels: > version: 2.4.0{quote} > At some point the driver fails but it is still running and so the pods are > still running: > 19/05/31 13:29:20 INFO DAGScheduler: Submitting ResultStage 0 > (MapPartitionsRDD[1] at map at SparkPi.scala:34), which has no missing parents > 19/05/31 13:29:23 INFO MemoryStore: Block broadcast_0 stored as values in > memory (estimated size 3.0 KiB, free 110.0 MiB) > 19/05/31 13:29:23 INFO MemoryStore: Block broadcast_0_piece0 stored as bytes > in memory (estimated size 1765.0 B, free 110.0 MiB) > 19/05/31 13:29:23 INFO BlockManagerInfo: Added broadcast_0_piece0 in memory > on spark-pi2-1559309337787-driver-svc.spark.svc:7079 (size: 1765.0 B, free: > 110.0 MiB) > 19/05/31 13:29:23 INFO SparkContext: Created broadcast 0 from broadcast at > DAGScheduler.scala:1180 > 19/05/31 13:29:25 INFO DAGScheduler: Submitting 100 missing tasks from > ResultStage 0 (MapPartitionsRDD[1] at map at SparkPi.scala:34) (first 15 > tasks are for partitions Vector(0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, > 14)) > 19/05/31 13:29:25 INFO TaskSchedulerImpl: Adding task set 0.0 with 100 > tasks > Exception in thread "dag-scheduler-event-loop" java.lang.OutOfMemoryError: > Java heap space > at > scala.collection.mutable.ResizableArray.ensureSize(ResizableArray.scala:106) > at > scala.collection.mutable.ResizableArray.ensureSize$(ResizableArray.scala:96) > at scala.collection.mutable.ArrayBuffer.ensureSize(ArrayBuffer.scala:49) > Mem: 2295260K used, 24458144K free, 1636K shrd, 48052K buff, 899424K cached > $ kubectl describe pod spark-pi2-driver -n spark > Name: spark-pi2-driver > Namespace: spark > Priority: 0 > PriorityClassName: > Node: gke-test-cluster-1-spark-autotune-46c36f4f-x3z9/10.138.0.44 > Start Time: Fri, 31 May 2019 16:28:59 +0300 > Labels: spark-app-selector=spark-74d8e5a8f1af428d91093dfa6ee9d661 > spark-role=driver > sparkoperator.k8s.io/app-name=spark-pi2 > sparkoperator.k8s.io/launched-by-spark-operator=true > sparkoperator.k8s.io/submission-id=spark-pi2-1559309336226927526 > version=2.4.0 > Annotations: > Status: Running > IP: 10.12.103.4 > Controlled By: SparkApplication/spark-pi2 > Containers: > spark-kubernetes-driver: > Container ID: > docker://55dadb603290b42f9ddb71959edf0224ddc7ea621ee15429941d3bcc7db9b71f > Image: skonto/spark:k8s-3.0.0-sa > Image ID: > docker-pullable://skonto/spark@sha256:6268d760d1a006b69c7086f946e4d5d9a3b99f149832c63cfc7fe39671f5cda9 > Ports: 7078/TCP, 7079/TCP, 4040/TCP > Host Ports: 0/TCP, 0/TCP, 0/TCP > Args: > driver > --properties-file > /opt/spark/conf/spark.properties > --class > org.apache.spark.examples.SparkPi > spark-internal > 100 > State: Running > In the container processes are in _interruptible sleep_: > PID PPID USER STAT VSZ %VSZ CPU %CPU COMMAND > 15 1 185 S 2114m 7% 0 0% /usr/lib/jvm/java-1.8-openjdk/bin/java -cp > /opt/spark/conf/:/opt/spark/jars/* -Xmx500m > org.apache.spark.deploy.SparkSubmit --deploy-mode client --conf spar > 287 0 185 S 2344 0% 3 0% sh > 294 287 185 R 1536 0% 3 0% top > 1 0 185 S 776 0% 0 0% /sbin/tini -s – /opt/spark/bin/spark-submit --conf > spark.driver.bindAddress=10.12.103.4 --deploy-mode client --properties-file > /opt/spark/conf/spark.prope > Liveness checks might be a workaround but rest apis may be still working if > threads in jvm still are running as in this case (I did check the spark ui > and it was there). > >
[jira] [Updated] (SPARK-27900) Spark on K8s will not report container failure due to an oom error
[ https://issues.apache.org/jira/browse/SPARK-27900?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Stavros Kontopoulos updated SPARK-27900: Component/s: Spark Core > Spark on K8s will not report container failure due to an oom error > -- > > Key: SPARK-27900 > URL: https://issues.apache.org/jira/browse/SPARK-27900 > Project: Spark > Issue Type: Bug > Components: Kubernetes, Spark Core >Affects Versions: 3.0.0, 2.4.3 >Reporter: Stavros Kontopoulos >Priority: Major > > A spark pi job is running: > spark-pi-driver 1/1 Running 0 1h > spark-pi2-1559309337787-exec-1 1/1 Running 0 1h > spark-pi2-1559309337787-exec-2 1/1 Running 0 1h > with the following setup: > {quote}apiVersion: "sparkoperator.k8s.io/v1beta1" > kind: SparkApplication > metadata: > name: spark-pi > namespace: spark > spec: > type: Scala > mode: cluster > image: "skonto/spark:k8s-3.0.0-sa" > imagePullPolicy: Always > mainClass: org.apache.spark.examples.SparkPi > mainApplicationFile: > "local:///opt/spark/examples/jars/spark-examples_2.12-3.0.0-SNAPSHOT.jar" > arguments: > - "100" > sparkVersion: "2.4.0" > restartPolicy: > type: Never > nodeSelector: > "spark": "autotune" > driver: > memory: "1g" > labels: > version: 2.4.0 > serviceAccount: spark-sa > executor: > instances: 2 > memory: "1g" > labels: > version: 2.4.0{quote} > At some point the driver fails but it is still running and so the pods are > still running: > 19/05/31 13:29:20 INFO DAGScheduler: Submitting ResultStage 0 > (MapPartitionsRDD[1] at map at SparkPi.scala:34), which has no missing parents > 19/05/31 13:29:23 INFO MemoryStore: Block broadcast_0 stored as values in > memory (estimated size 3.0 KiB, free 110.0 MiB) > 19/05/31 13:29:23 INFO MemoryStore: Block broadcast_0_piece0 stored as bytes > in memory (estimated size 1765.0 B, free 110.0 MiB) > 19/05/31 13:29:23 INFO BlockManagerInfo: Added broadcast_0_piece0 in memory > on spark-pi2-1559309337787-driver-svc.spark.svc:7079 (size: 1765.0 B, free: > 110.0 MiB) > 19/05/31 13:29:23 INFO SparkContext: Created broadcast 0 from broadcast at > DAGScheduler.scala:1180 > 19/05/31 13:29:25 INFO DAGScheduler: Submitting 100 missing tasks from > ResultStage 0 (MapPartitionsRDD[1] at map at SparkPi.scala:34) (first 15 > tasks are for partitions Vector(0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, > 14)) > 19/05/31 13:29:25 INFO TaskSchedulerImpl: Adding task set 0.0 with 100 > tasks > Exception in thread "dag-scheduler-event-loop" java.lang.OutOfMemoryError: > Java heap space > at > scala.collection.mutable.ResizableArray.ensureSize(ResizableArray.scala:106) > at > scala.collection.mutable.ResizableArray.ensureSize$(ResizableArray.scala:96) > at scala.collection.mutable.ArrayBuffer.ensureSize(ArrayBuffer.scala:49) > Mem: 2295260K used, 24458144K free, 1636K shrd, 48052K buff, 899424K cached > $ kubectl describe pod spark-pi2-driver -n spark > Name: spark-pi2-driver > Namespace: spark > Priority: 0 > PriorityClassName: > Node: gke-test-cluster-1-spark-autotune-46c36f4f-x3z9/10.138.0.44 > Start Time: Fri, 31 May 2019 16:28:59 +0300 > Labels: spark-app-selector=spark-74d8e5a8f1af428d91093dfa6ee9d661 > spark-role=driver > sparkoperator.k8s.io/app-name=spark-pi2 > sparkoperator.k8s.io/launched-by-spark-operator=true > sparkoperator.k8s.io/submission-id=spark-pi2-1559309336226927526 > version=2.4.0 > Annotations: > Status: Running > IP: 10.12.103.4 > Controlled By: SparkApplication/spark-pi2 > Containers: > spark-kubernetes-driver: > Container ID: > docker://55dadb603290b42f9ddb71959edf0224ddc7ea621ee15429941d3bcc7db9b71f > Image: skonto/spark:k8s-3.0.0-sa > Image ID: > docker-pullable://skonto/spark@sha256:6268d760d1a006b69c7086f946e4d5d9a3b99f149832c63cfc7fe39671f5cda9 > Ports: 7078/TCP, 7079/TCP, 4040/TCP > Host Ports: 0/TCP, 0/TCP, 0/TCP > Args: > driver > --properties-file > /opt/spark/conf/spark.properties > --class > org.apache.spark.examples.SparkPi > spark-internal > 100 > State: Running > In the container processes are in _interruptible sleep_: > PID PPID USER STAT VSZ %VSZ CPU %CPU COMMAND > 15 1 185 S 2114m 7% 0 0% /usr/lib/jvm/java-1.8-openjdk/bin/java -cp > /opt/spark/conf/:/opt/spark/jars/* -Xmx500m > org.apache.spark.deploy.SparkSubmit --deploy-mode client --conf spar > 287 0 185 S 2344 0% 3 0% sh > 294 287 185 R 1536 0% 3 0% top > 1 0 185 S 776 0% 0 0% /sbin/tini -s – /opt/spark/bin/spark-submit --conf > spark.driver.bindAddress=10.12.103.4 --deploy-mode client --properties-file > /opt/spark/conf/spark.prope > Liveness checks might be a workaround but rest apis may be still working if > threads in jvm still are running as in this case (I did check the spark ui > and it was there). > >
[jira] [Assigned] (SPARK-27900) Spark on K8s will not report container failure due to an oom error
[ https://issues.apache.org/jira/browse/SPARK-27900?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-27900: Assignee: (was: Apache Spark) > Spark on K8s will not report container failure due to an oom error > -- > > Key: SPARK-27900 > URL: https://issues.apache.org/jira/browse/SPARK-27900 > Project: Spark > Issue Type: Bug > Components: Kubernetes >Affects Versions: 3.0.0, 2.4.3 >Reporter: Stavros Kontopoulos >Priority: Major > > A spark pi job is running: > spark-pi-driver 1/1 Running 0 1h > spark-pi2-1559309337787-exec-1 1/1 Running 0 1h > spark-pi2-1559309337787-exec-2 1/1 Running 0 1h > with the following setup: > {quote}apiVersion: "sparkoperator.k8s.io/v1beta1" > kind: SparkApplication > metadata: > name: spark-pi > namespace: spark > spec: > type: Scala > mode: cluster > image: "skonto/spark:k8s-3.0.0-sa" > imagePullPolicy: Always > mainClass: org.apache.spark.examples.SparkPi > mainApplicationFile: > "local:///opt/spark/examples/jars/spark-examples_2.12-3.0.0-SNAPSHOT.jar" > arguments: > - "100" > sparkVersion: "2.4.0" > restartPolicy: > type: Never > nodeSelector: > "spark": "autotune" > driver: > memory: "1g" > labels: > version: 2.4.0 > serviceAccount: spark-sa > executor: > instances: 2 > memory: "1g" > labels: > version: 2.4.0{quote} > At some point the driver fails but it is still running and so the pods are > still running: > 19/05/31 13:29:20 INFO DAGScheduler: Submitting ResultStage 0 > (MapPartitionsRDD[1] at map at SparkPi.scala:34), which has no missing parents > 19/05/31 13:29:23 INFO MemoryStore: Block broadcast_0 stored as values in > memory (estimated size 3.0 KiB, free 110.0 MiB) > 19/05/31 13:29:23 INFO MemoryStore: Block broadcast_0_piece0 stored as bytes > in memory (estimated size 1765.0 B, free 110.0 MiB) > 19/05/31 13:29:23 INFO BlockManagerInfo: Added broadcast_0_piece0 in memory > on spark-pi2-1559309337787-driver-svc.spark.svc:7079 (size: 1765.0 B, free: > 110.0 MiB) > 19/05/31 13:29:23 INFO SparkContext: Created broadcast 0 from broadcast at > DAGScheduler.scala:1180 > 19/05/31 13:29:25 INFO DAGScheduler: Submitting 100 missing tasks from > ResultStage 0 (MapPartitionsRDD[1] at map at SparkPi.scala:34) (first 15 > tasks are for partitions Vector(0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, > 14)) > 19/05/31 13:29:25 INFO TaskSchedulerImpl: Adding task set 0.0 with 100 > tasks > Exception in thread "dag-scheduler-event-loop" java.lang.OutOfMemoryError: > Java heap space > at > scala.collection.mutable.ResizableArray.ensureSize(ResizableArray.scala:106) > at > scala.collection.mutable.ResizableArray.ensureSize$(ResizableArray.scala:96) > at scala.collection.mutable.ArrayBuffer.ensureSize(ArrayBuffer.scala:49) > Mem: 2295260K used, 24458144K free, 1636K shrd, 48052K buff, 899424K cached > $ kubectl describe pod spark-pi2-driver -n spark > Name: spark-pi2-driver > Namespace: spark > Priority: 0 > PriorityClassName: > Node: gke-test-cluster-1-spark-autotune-46c36f4f-x3z9/10.138.0.44 > Start Time: Fri, 31 May 2019 16:28:59 +0300 > Labels: spark-app-selector=spark-74d8e5a8f1af428d91093dfa6ee9d661 > spark-role=driver > sparkoperator.k8s.io/app-name=spark-pi2 > sparkoperator.k8s.io/launched-by-spark-operator=true > sparkoperator.k8s.io/submission-id=spark-pi2-1559309336226927526 > version=2.4.0 > Annotations: > Status: Running > IP: 10.12.103.4 > Controlled By: SparkApplication/spark-pi2 > Containers: > spark-kubernetes-driver: > Container ID: > docker://55dadb603290b42f9ddb71959edf0224ddc7ea621ee15429941d3bcc7db9b71f > Image: skonto/spark:k8s-3.0.0-sa > Image ID: > docker-pullable://skonto/spark@sha256:6268d760d1a006b69c7086f946e4d5d9a3b99f149832c63cfc7fe39671f5cda9 > Ports: 7078/TCP, 7079/TCP, 4040/TCP > Host Ports: 0/TCP, 0/TCP, 0/TCP > Args: > driver > --properties-file > /opt/spark/conf/spark.properties > --class > org.apache.spark.examples.SparkPi > spark-internal > 100 > State: Running > In the container processes are in _interruptible sleep_: > PID PPID USER STAT VSZ %VSZ CPU %CPU COMMAND > 15 1 185 S 2114m 7% 0 0% /usr/lib/jvm/java-1.8-openjdk/bin/java -cp > /opt/spark/conf/:/opt/spark/jars/* -Xmx500m > org.apache.spark.deploy.SparkSubmit --deploy-mode client --conf spar > 287 0 185 S 2344 0% 3 0% sh > 294 287 185 R 1536 0% 3 0% top > 1 0 185 S 776 0% 0 0% /sbin/tini -s – /opt/spark/bin/spark-submit --conf > spark.driver.bindAddress=10.12.103.4 --deploy-mode client --properties-file > /opt/spark/conf/spark.prope > Liveness checks might be a workaround but rest apis may be still working if > threads in jvm still are running as in this case (I did check the spark ui > and it was there). > > -- T
[jira] [Assigned] (SPARK-27900) Spark on K8s will not report container failure due to an oom error
[ https://issues.apache.org/jira/browse/SPARK-27900?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-27900: Assignee: Apache Spark > Spark on K8s will not report container failure due to an oom error > -- > > Key: SPARK-27900 > URL: https://issues.apache.org/jira/browse/SPARK-27900 > Project: Spark > Issue Type: Bug > Components: Kubernetes >Affects Versions: 3.0.0, 2.4.3 >Reporter: Stavros Kontopoulos >Assignee: Apache Spark >Priority: Major > > A spark pi job is running: > spark-pi-driver 1/1 Running 0 1h > spark-pi2-1559309337787-exec-1 1/1 Running 0 1h > spark-pi2-1559309337787-exec-2 1/1 Running 0 1h > with the following setup: > {quote}apiVersion: "sparkoperator.k8s.io/v1beta1" > kind: SparkApplication > metadata: > name: spark-pi > namespace: spark > spec: > type: Scala > mode: cluster > image: "skonto/spark:k8s-3.0.0-sa" > imagePullPolicy: Always > mainClass: org.apache.spark.examples.SparkPi > mainApplicationFile: > "local:///opt/spark/examples/jars/spark-examples_2.12-3.0.0-SNAPSHOT.jar" > arguments: > - "100" > sparkVersion: "2.4.0" > restartPolicy: > type: Never > nodeSelector: > "spark": "autotune" > driver: > memory: "1g" > labels: > version: 2.4.0 > serviceAccount: spark-sa > executor: > instances: 2 > memory: "1g" > labels: > version: 2.4.0{quote} > At some point the driver fails but it is still running and so the pods are > still running: > 19/05/31 13:29:20 INFO DAGScheduler: Submitting ResultStage 0 > (MapPartitionsRDD[1] at map at SparkPi.scala:34), which has no missing parents > 19/05/31 13:29:23 INFO MemoryStore: Block broadcast_0 stored as values in > memory (estimated size 3.0 KiB, free 110.0 MiB) > 19/05/31 13:29:23 INFO MemoryStore: Block broadcast_0_piece0 stored as bytes > in memory (estimated size 1765.0 B, free 110.0 MiB) > 19/05/31 13:29:23 INFO BlockManagerInfo: Added broadcast_0_piece0 in memory > on spark-pi2-1559309337787-driver-svc.spark.svc:7079 (size: 1765.0 B, free: > 110.0 MiB) > 19/05/31 13:29:23 INFO SparkContext: Created broadcast 0 from broadcast at > DAGScheduler.scala:1180 > 19/05/31 13:29:25 INFO DAGScheduler: Submitting 100 missing tasks from > ResultStage 0 (MapPartitionsRDD[1] at map at SparkPi.scala:34) (first 15 > tasks are for partitions Vector(0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, > 14)) > 19/05/31 13:29:25 INFO TaskSchedulerImpl: Adding task set 0.0 with 100 > tasks > Exception in thread "dag-scheduler-event-loop" java.lang.OutOfMemoryError: > Java heap space > at > scala.collection.mutable.ResizableArray.ensureSize(ResizableArray.scala:106) > at > scala.collection.mutable.ResizableArray.ensureSize$(ResizableArray.scala:96) > at scala.collection.mutable.ArrayBuffer.ensureSize(ArrayBuffer.scala:49) > Mem: 2295260K used, 24458144K free, 1636K shrd, 48052K buff, 899424K cached > $ kubectl describe pod spark-pi2-driver -n spark > Name: spark-pi2-driver > Namespace: spark > Priority: 0 > PriorityClassName: > Node: gke-test-cluster-1-spark-autotune-46c36f4f-x3z9/10.138.0.44 > Start Time: Fri, 31 May 2019 16:28:59 +0300 > Labels: spark-app-selector=spark-74d8e5a8f1af428d91093dfa6ee9d661 > spark-role=driver > sparkoperator.k8s.io/app-name=spark-pi2 > sparkoperator.k8s.io/launched-by-spark-operator=true > sparkoperator.k8s.io/submission-id=spark-pi2-1559309336226927526 > version=2.4.0 > Annotations: > Status: Running > IP: 10.12.103.4 > Controlled By: SparkApplication/spark-pi2 > Containers: > spark-kubernetes-driver: > Container ID: > docker://55dadb603290b42f9ddb71959edf0224ddc7ea621ee15429941d3bcc7db9b71f > Image: skonto/spark:k8s-3.0.0-sa > Image ID: > docker-pullable://skonto/spark@sha256:6268d760d1a006b69c7086f946e4d5d9a3b99f149832c63cfc7fe39671f5cda9 > Ports: 7078/TCP, 7079/TCP, 4040/TCP > Host Ports: 0/TCP, 0/TCP, 0/TCP > Args: > driver > --properties-file > /opt/spark/conf/spark.properties > --class > org.apache.spark.examples.SparkPi > spark-internal > 100 > State: Running > In the container processes are in _interruptible sleep_: > PID PPID USER STAT VSZ %VSZ CPU %CPU COMMAND > 15 1 185 S 2114m 7% 0 0% /usr/lib/jvm/java-1.8-openjdk/bin/java -cp > /opt/spark/conf/:/opt/spark/jars/* -Xmx500m > org.apache.spark.deploy.SparkSubmit --deploy-mode client --conf spar > 287 0 185 S 2344 0% 3 0% sh > 294 287 185 R 1536 0% 3 0% top > 1 0 185 S 776 0% 0 0% /sbin/tini -s – /opt/spark/bin/spark-submit --conf > spark.driver.bindAddress=10.12.103.4 --deploy-mode client --properties-file > /opt/spark/conf/spark.prope > Liveness checks might be a workaround but rest apis may be still working if > threads in jvm still are running as in this case (I did check the spark ui > and it wa
[jira] [Issue Comment Deleted] (SPARK-20202) Remove references to org.spark-project.hive
[ https://issues.apache.org/jira/browse/SPARK-20202?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] t oo updated SPARK-20202: - Comment: was deleted (was: gentle ping) > Remove references to org.spark-project.hive > --- > > Key: SPARK-20202 > URL: https://issues.apache.org/jira/browse/SPARK-20202 > Project: Spark > Issue Type: Bug > Components: Build, SQL >Affects Versions: 1.6.4, 2.0.3, 2.1.1 >Reporter: Owen O'Malley >Priority: Major > > Spark can't continue to depend on their fork of Hive and must move to > standard Hive versions. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Issue Comment Deleted] (SPARK-20202) Remove references to org.spark-project.hive
[ https://issues.apache.org/jira/browse/SPARK-20202?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] t oo updated SPARK-20202: - Comment: was deleted (was: bump) > Remove references to org.spark-project.hive > --- > > Key: SPARK-20202 > URL: https://issues.apache.org/jira/browse/SPARK-20202 > Project: Spark > Issue Type: Bug > Components: Build, SQL >Affects Versions: 1.6.4, 2.0.3, 2.1.1 >Reporter: Owen O'Malley >Priority: Major > > Spark can't continue to depend on their fork of Hive and must move to > standard Hive versions. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-27567) Spark Streaming consumers (from Kafka) intermittently die with 'SparkException: Couldn't find leaders for Set'
[ https://issues.apache.org/jira/browse/SPARK-27567?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dmitry Goldenberg resolved SPARK-27567. --- Resolution: Not A Bug In our case, issue was caused by someone changing the IP address of the box. Ideally, Kafka or Spark could have more "telling", more intuitive error messages for these types of issues. > Spark Streaming consumers (from Kafka) intermittently die with > 'SparkException: Couldn't find leaders for Set' > -- > > Key: SPARK-27567 > URL: https://issues.apache.org/jira/browse/SPARK-27567 > Project: Spark > Issue Type: Bug > Components: DStreams >Affects Versions: 1.5.0 > Environment: GCP / 170~14.04.1-Ubuntu >Reporter: Dmitry Goldenberg >Priority: Major > > Some of our consumers intermittently die with the stack traces I'm including. > Once restarted they run for a while then die again. > I can't find any cohesive documentation on what this error means and how to > go about troubleshooting it. Any help would be appreciated. > *Kafka version* is 0.8.2.1 (2.10-0.8.2.1). > Some of the errors seen look like this: > {noformat} > ERROR org.apache.spark.scheduler.TaskSchedulerImpl: Lost executor 2 on > 10.150.0.54: remote Rpc client disassociated{noformat} > Main error stack trace: > {noformat} > 2019-04-23 20:36:54,323 ERROR > org.apache.spark.streaming.scheduler.JobScheduler: Error g > enerating jobs for time 1556066214000 ms > org.apache.spark.SparkException: ArrayBuffer(org.apache.spark.SparkException: > Couldn't find leaders for Set([hdfs.hbase.acme.attachments,49], > [hdfs.hbase.acme.attachmen > ts,63], [hdfs.hbase.acme.attachments,31], [hdfs.hbase.acme.attachments,9], > [hdfs.hbase.acme.attachments,25], [hdfs.hbase.acme.attachments,55], > [hdfs.hbase.acme.attachme > nts,5], [hdfs.hbase.acme.attachments,37], [hdfs.hbase.acme.attachments,7], > [hdfs.hbase.acme.attachments,47], [hdfs.hbase.acme.attachments,13], > [hdfs.hbase.acme.attachme > nts,43], [hdfs.hbase.acme.attachments,19], [hdfs.hbase.acme.attachments,15], > [hdfs.hbase.acme.attachments,23], [hdfs.hbase.acme.attachments,53], > [hdfs.hbase.acme.attach > ments,1], [hdfs.hbase.acme.attachments,27], [hdfs.hbase.acme.attachments,57], > [hdfs.hbase.acme.attachments,39], [hdfs.hbase.acme.attachments,11], > [hdfs.hbase.acme.attac > hments,29], [hdfs.hbase.acme.attachments,33], > [hdfs.hbase.acme.attachments,35], [hdfs.hbase.acme.attachments,51], > [hdfs.hbase.acme.attachments,45], [hdfs.hbase.acme.att > achments,21], [hdfs.hbase.acme.attachments,3], > [hdfs.hbase.acme.attachments,59], [hdfs.hbase.acme.attachments,41], > [hdfs.hbase.acme.attachments,17], [hdfs.hbase.acme.at > tachments,61])) > at > org.apache.spark.streaming.kafka.DirectKafkaInputDStream.latestLeaderOffsets(DirectKafkaInputDStream.scala:123) > ~[acmedsc-ingest-kafka-spark-2.0.0-SNAPSHOT.j > ar:?] > at > org.apache.spark.streaming.kafka.DirectKafkaInputDStream.compute(DirectKafkaInputDStream.scala:145) > ~[acmedsc-ingest-kafka-spark-2.0.0-SNAPSHOT.jar:?] > at > org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1$$anonfun$apply$7.apply(DStream.scala:350) > ~[spark-assembly-1.5.0-hadoop2.4.0.ja > r:1.5.0] > at > org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1$$anonfun$apply$7.apply(DStream.scala:350) > ~[spark-assembly-1.5.0-hadoop2.4.0.ja > r:1.5.0] > at scala.util.DynamicVariable.withValue(DynamicVariable.scala:57) > ~[acmedsc-ingest-kafka-spark-2.0.0-SNAPSHOT.jar:?] > at > org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1.apply(DStream.scala:349) > ~[spark-assembly-1.5.0-hadoop2.4.0.jar:1.5.0] > at > org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1.apply(DStream.scala:349) > ~[spark-assembly-1.5.0-hadoop2.4.0.jar:1.5.0] > at > org.apache.spark.streaming.dstream.DStream.createRDDWithLocalProperties(DStream.scala:399) > ~[spark-assembly-1.5.0-hadoop2.4.0.jar:1.5.0] > at > org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1.apply(DStream.scala:344) > ~[spark-assembly-1.5.0-hadoop2.4.0.jar:1.5.0] > at > org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1.apply(DStream.scala:342) > ~[spark-assembly-1.5.0-hadoop2.4.0.jar:1.5.0] > at scala.Option.orElse(Option.scala:257) > ~[acmedsc-ingest-kafka-spark-2.0.0-SNAPSHOT.jar:?] > at org.apache.spark.streaming.dstream.DStream.getOrCompute(DStream.scala:339) > ~[spark-assembly-1.5.0-hadoop2.4.0.jar:1.5.0] > at > org.apache.spark.streaming.dstream.MappedDStream.compute(MappedDStream.scala:35) > ~[spark-assembly-1.5.0-hadoop2.4.0.jar:1.5.0] > at > org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1$$anonfun$apply$7.appl
[jira] [Commented] (SPARK-27567) Spark Streaming consumers (from Kafka) intermittently die with 'SparkException: Couldn't find leaders for Set'
[ https://issues.apache.org/jira/browse/SPARK-27567?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16855996#comment-16855996 ] Dmitry Goldenberg commented on SPARK-27567: --- OK, it appears that the IP address of the box in question got changed and that sent Kafka (and the consumers along with it) heywire. We can probably close the ticket but this may be useful to other folks. The takeaway is, check your networking settings; any changes to the IP address or the like. Kafka doesn't respond too well to that :) > Spark Streaming consumers (from Kafka) intermittently die with > 'SparkException: Couldn't find leaders for Set' > -- > > Key: SPARK-27567 > URL: https://issues.apache.org/jira/browse/SPARK-27567 > Project: Spark > Issue Type: Bug > Components: DStreams >Affects Versions: 1.5.0 > Environment: GCP / 170~14.04.1-Ubuntu >Reporter: Dmitry Goldenberg >Priority: Major > > Some of our consumers intermittently die with the stack traces I'm including. > Once restarted they run for a while then die again. > I can't find any cohesive documentation on what this error means and how to > go about troubleshooting it. Any help would be appreciated. > *Kafka version* is 0.8.2.1 (2.10-0.8.2.1). > Some of the errors seen look like this: > {noformat} > ERROR org.apache.spark.scheduler.TaskSchedulerImpl: Lost executor 2 on > 10.150.0.54: remote Rpc client disassociated{noformat} > Main error stack trace: > {noformat} > 2019-04-23 20:36:54,323 ERROR > org.apache.spark.streaming.scheduler.JobScheduler: Error g > enerating jobs for time 1556066214000 ms > org.apache.spark.SparkException: ArrayBuffer(org.apache.spark.SparkException: > Couldn't find leaders for Set([hdfs.hbase.acme.attachments,49], > [hdfs.hbase.acme.attachmen > ts,63], [hdfs.hbase.acme.attachments,31], [hdfs.hbase.acme.attachments,9], > [hdfs.hbase.acme.attachments,25], [hdfs.hbase.acme.attachments,55], > [hdfs.hbase.acme.attachme > nts,5], [hdfs.hbase.acme.attachments,37], [hdfs.hbase.acme.attachments,7], > [hdfs.hbase.acme.attachments,47], [hdfs.hbase.acme.attachments,13], > [hdfs.hbase.acme.attachme > nts,43], [hdfs.hbase.acme.attachments,19], [hdfs.hbase.acme.attachments,15], > [hdfs.hbase.acme.attachments,23], [hdfs.hbase.acme.attachments,53], > [hdfs.hbase.acme.attach > ments,1], [hdfs.hbase.acme.attachments,27], [hdfs.hbase.acme.attachments,57], > [hdfs.hbase.acme.attachments,39], [hdfs.hbase.acme.attachments,11], > [hdfs.hbase.acme.attac > hments,29], [hdfs.hbase.acme.attachments,33], > [hdfs.hbase.acme.attachments,35], [hdfs.hbase.acme.attachments,51], > [hdfs.hbase.acme.attachments,45], [hdfs.hbase.acme.att > achments,21], [hdfs.hbase.acme.attachments,3], > [hdfs.hbase.acme.attachments,59], [hdfs.hbase.acme.attachments,41], > [hdfs.hbase.acme.attachments,17], [hdfs.hbase.acme.at > tachments,61])) > at > org.apache.spark.streaming.kafka.DirectKafkaInputDStream.latestLeaderOffsets(DirectKafkaInputDStream.scala:123) > ~[acmedsc-ingest-kafka-spark-2.0.0-SNAPSHOT.j > ar:?] > at > org.apache.spark.streaming.kafka.DirectKafkaInputDStream.compute(DirectKafkaInputDStream.scala:145) > ~[acmedsc-ingest-kafka-spark-2.0.0-SNAPSHOT.jar:?] > at > org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1$$anonfun$apply$7.apply(DStream.scala:350) > ~[spark-assembly-1.5.0-hadoop2.4.0.ja > r:1.5.0] > at > org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1$$anonfun$apply$7.apply(DStream.scala:350) > ~[spark-assembly-1.5.0-hadoop2.4.0.ja > r:1.5.0] > at scala.util.DynamicVariable.withValue(DynamicVariable.scala:57) > ~[acmedsc-ingest-kafka-spark-2.0.0-SNAPSHOT.jar:?] > at > org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1.apply(DStream.scala:349) > ~[spark-assembly-1.5.0-hadoop2.4.0.jar:1.5.0] > at > org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1.apply(DStream.scala:349) > ~[spark-assembly-1.5.0-hadoop2.4.0.jar:1.5.0] > at > org.apache.spark.streaming.dstream.DStream.createRDDWithLocalProperties(DStream.scala:399) > ~[spark-assembly-1.5.0-hadoop2.4.0.jar:1.5.0] > at > org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1.apply(DStream.scala:344) > ~[spark-assembly-1.5.0-hadoop2.4.0.jar:1.5.0] > at > org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1.apply(DStream.scala:342) > ~[spark-assembly-1.5.0-hadoop2.4.0.jar:1.5.0] > at scala.Option.orElse(Option.scala:257) > ~[acmedsc-ingest-kafka-spark-2.0.0-SNAPSHOT.jar:?] > at org.apache.spark.streaming.dstream.DStream.getOrCompute(DStream.scala:339) > ~[spark-assembly-1.5.0-hadoop2.4.0.jar:1.5.0] > at > org.apache.spark.streaming.dstream.MappedDStream.com
[jira] [Resolved] (SPARK-27939) Defining a schema with VectorUDT
[ https://issues.apache.org/jira/browse/SPARK-27939?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bryan Cutler resolved SPARK-27939. -- Resolution: Not A Problem > Defining a schema with VectorUDT > > > Key: SPARK-27939 > URL: https://issues.apache.org/jira/browse/SPARK-27939 > Project: Spark > Issue Type: Bug > Components: ML, PySpark >Affects Versions: 2.4.0 >Reporter: Johannes Schaffrath >Priority: Minor > > When I try to define a dataframe schema which has a VectorUDT field, I run > into an error when the VectorUDT field is not the last element of the > StructType list. > The following example causes the error below: > {code:java} > // from pyspark.sql import functions as F > from pyspark.sql import types as T > from pyspark.sql import Row > from pyspark.ml.linalg import VectorUDT, SparseVector > #VectorUDT should be the last structfield > train_schema = T.StructType([ > T.StructField('features', VectorUDT()), > T.StructField('SALESCLOSEPRICE', T.IntegerType()) > ]) > > train_df = spark.createDataFrame( > [Row(features=SparseVector(135, {0: 139900.0, 1: 139900.0, 2: 980.0, 3: 10.0, > 5: 980.0, 6: 1858.0, 7: 1858.0, 8: 980.0, 9: 1950.0, 10: 1.28, 11: 1.0, 12: > 1.0, 15: 2.0, 16: 3.0, 20: 2017.0, 21: 7.0, 22: 28.0, 23: 15.0, 24: 196.0, > 25: 25.0, 26: -1.0, 27: 4.03, 28: 3.96, 29: 3.88, 30: 3.9, 31: 3.91, 32: 9.8, > 33: 22.4, 34: 67.8, 35: 49.8, 36: 11.9, 37: 2.7, 38: 0.2926, 39: 142.7551, > 40: 980.0, 41: 0.0133, 42: 1.5, 43: 1.0, 51: -1.0, 52: -1.0, 53: -1.0, 54: > -1.0, 55: -1.0, 56: -1.0, 57: -1.0, 62: 1.0, 68: 1.0, 77: 1.0, 81: 1.0, 89: > 1.0, 95: 1.0, 96: 1.0, 101: 1.0, 103: 1.0, 108: 1.0, 114: 1.0, 115: 1.0, 123: > 1.0, 133: 1.0}), SALESCLOSEPRICE=143000), > Row(features=SparseVector(135, {0: 21.0, 1: 21.0, 2: 1144.0, 3: 4.0, > 5: 1268.0, 6: 1640.0, 7: 1640.0, 8: 2228.0, 9: 1971.0, 10: 0.32, 11: 1.0, 14: > 2.0, 15: 3.0, 16: 4.0, 17: 960.0, 20: 2017.0, 21: 10.0, 22: 41.0, 23: 9.0, > 24: 282.0, 25: 2.0, 26: -1.0, 27: 3.91, 28: 3.85, 29: 3.83, 30: 3.83, 31: > 3.78, 32: 32.2, 33: 49.0, 34: 18.8, 35: 14.0, 36: 35.8, 37: 14.6, 38: 0.4392, > 39: 94.2549, 40: 2228.0, 41: 0.0078, 42: 1., 43: -1.0, 44: -1.0, 45: > -1.0, 46: -1.0, 47: -1.0, 48: -1.0, 49: -1.0, 50: -1.0, 52: 1.0, 55: -1.0, > 56: -1.0, 57: -1.0, 62: 1.0, 68: 1.0, 77: 1.0, 79: 1.0, 89: 1.0, 92: 1.0, 96: > 1.0, 101: 1.0, 103: 1.0, 108: 1.0, 114: 1.0, 115: 1.0, 124: 1.0, 133: 1.0}), > SALESCLOSEPRICE=19), > Row(features=SparseVector(135, {0: 225000.0, 1: 225000.0, 2: 1102.0, 3: > 28.0, 5: 1102.0, 6: 2390.0, 7: 2390.0, 8: 1102.0, 9: 1949.0, 10: 0.822, 11: > 1.0, 15: 1.0, 16: 2.0, 20: 2017.0, 21: 6.0, 22: 26.0, 23: 26.0, 24: 177.0, > 25: 25.0, 26: -1.0, 27: 3.88, 28: 3.9, 29: 3.91, 30: 3.89, 31: 3.94, 32: 9.8, > 33: 22.4, 34: 67.8, 35: 61.7, 36: 2.7, 38: 0.4706, 39: 204.1742, 40: 1102.0, > 41: 0.0106, 42: 2.0, 49: 1.0, 51: -1.0, 52: -1.0, 53: -1.0, 54: -1.0, 57: > 1.0, 62: 1.0, 68: 1.0, 70: 1.0, 79: 1.0, 89: 1.0, 92: 1.0, 96: 1.0, 100: 1.0, > 103: 1.0, 108: 1.0, 110: 1.0, 115: 1.0, 123: 1.0, 131: 1.0, 132: 1.0}), > SALESCLOSEPRICE=225000) > ], schema=train_schema) > > train_df.printSchema() > train_df.show() > {code} > Error message: > {code:java} > // Fail to execute line 17: ], schema=train_schema) Traceback (most recent > call last): File "/tmp/zeppelin_pyspark-3793375738105660281.py", line 375, in > exec(code, _zcUserQueryNameSpace) File "", line 17, in > File "/opt/spark/python/lib/pyspark.zip/pyspark/sql/session.py", > line 748, in createDataFrame rdd, schema = self._createFromLocal(map(prepare, > data), schema) File > "/opt/spark/python/lib/pyspark.zip/pyspark/sql/session.py", line 429, in > _createFromLocal data = [schema.toInternal(row) for row in data] File > "/opt/spark/python/lib/pyspark.zip/pyspark/sql/session.py", line 429, in > data = [schema.toInternal(row) for row in data] File > "/opt/spark/python/lib/pyspark.zip/pyspark/sql/types.py", line 604, in > toInternal for f, v, c in zip(self.fields, obj, self._needConversion)) File > "/opt/spark/python/lib/pyspark.zip/pyspark/sql/types.py", line 604, in > for f, v, c in zip(self.fields, obj, self._needConversion)) File > "/opt/spark/python/lib/pyspark.zip/pyspark/sql/types.py", line 442, in > toInternal return self.dataType.toInternal(obj) File > "/opt/spark/python/lib/pyspark.zip/pyspark/sql/types.py", line 685, in > toInternal return self._cachedSqlType().toInternal(self.serialize(obj)) File > "/opt/spark/python/lib/pyspark.zip/pyspark/ml/linalg/__init__.py", line 167, > in serialize raise TypeError("cannot serialize %r of type %r" % (obj, > type(obj))) TypeError: cannot serialize 143000 of type {code} > I don't get
[jira] [Comment Edited] (SPARK-27939) Defining a schema with VectorUDT
[ https://issues.apache.org/jira/browse/SPARK-27939?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16855969#comment-16855969 ] Bryan Cutler edited comment on SPARK-27939 at 6/4/19 6:13 PM: -- Linked to a similar problem with Python {{Row}} class, SPARK-22232 was (Author: bryanc): Another problem with Python {{Row}} class > Defining a schema with VectorUDT > > > Key: SPARK-27939 > URL: https://issues.apache.org/jira/browse/SPARK-27939 > Project: Spark > Issue Type: Bug > Components: ML, PySpark >Affects Versions: 2.4.0 >Reporter: Johannes Schaffrath >Priority: Minor > > When I try to define a dataframe schema which has a VectorUDT field, I run > into an error when the VectorUDT field is not the last element of the > StructType list. > The following example causes the error below: > {code:java} > // from pyspark.sql import functions as F > from pyspark.sql import types as T > from pyspark.sql import Row > from pyspark.ml.linalg import VectorUDT, SparseVector > #VectorUDT should be the last structfield > train_schema = T.StructType([ > T.StructField('features', VectorUDT()), > T.StructField('SALESCLOSEPRICE', T.IntegerType()) > ]) > > train_df = spark.createDataFrame( > [Row(features=SparseVector(135, {0: 139900.0, 1: 139900.0, 2: 980.0, 3: 10.0, > 5: 980.0, 6: 1858.0, 7: 1858.0, 8: 980.0, 9: 1950.0, 10: 1.28, 11: 1.0, 12: > 1.0, 15: 2.0, 16: 3.0, 20: 2017.0, 21: 7.0, 22: 28.0, 23: 15.0, 24: 196.0, > 25: 25.0, 26: -1.0, 27: 4.03, 28: 3.96, 29: 3.88, 30: 3.9, 31: 3.91, 32: 9.8, > 33: 22.4, 34: 67.8, 35: 49.8, 36: 11.9, 37: 2.7, 38: 0.2926, 39: 142.7551, > 40: 980.0, 41: 0.0133, 42: 1.5, 43: 1.0, 51: -1.0, 52: -1.0, 53: -1.0, 54: > -1.0, 55: -1.0, 56: -1.0, 57: -1.0, 62: 1.0, 68: 1.0, 77: 1.0, 81: 1.0, 89: > 1.0, 95: 1.0, 96: 1.0, 101: 1.0, 103: 1.0, 108: 1.0, 114: 1.0, 115: 1.0, 123: > 1.0, 133: 1.0}), SALESCLOSEPRICE=143000), > Row(features=SparseVector(135, {0: 21.0, 1: 21.0, 2: 1144.0, 3: 4.0, > 5: 1268.0, 6: 1640.0, 7: 1640.0, 8: 2228.0, 9: 1971.0, 10: 0.32, 11: 1.0, 14: > 2.0, 15: 3.0, 16: 4.0, 17: 960.0, 20: 2017.0, 21: 10.0, 22: 41.0, 23: 9.0, > 24: 282.0, 25: 2.0, 26: -1.0, 27: 3.91, 28: 3.85, 29: 3.83, 30: 3.83, 31: > 3.78, 32: 32.2, 33: 49.0, 34: 18.8, 35: 14.0, 36: 35.8, 37: 14.6, 38: 0.4392, > 39: 94.2549, 40: 2228.0, 41: 0.0078, 42: 1., 43: -1.0, 44: -1.0, 45: > -1.0, 46: -1.0, 47: -1.0, 48: -1.0, 49: -1.0, 50: -1.0, 52: 1.0, 55: -1.0, > 56: -1.0, 57: -1.0, 62: 1.0, 68: 1.0, 77: 1.0, 79: 1.0, 89: 1.0, 92: 1.0, 96: > 1.0, 101: 1.0, 103: 1.0, 108: 1.0, 114: 1.0, 115: 1.0, 124: 1.0, 133: 1.0}), > SALESCLOSEPRICE=19), > Row(features=SparseVector(135, {0: 225000.0, 1: 225000.0, 2: 1102.0, 3: > 28.0, 5: 1102.0, 6: 2390.0, 7: 2390.0, 8: 1102.0, 9: 1949.0, 10: 0.822, 11: > 1.0, 15: 1.0, 16: 2.0, 20: 2017.0, 21: 6.0, 22: 26.0, 23: 26.0, 24: 177.0, > 25: 25.0, 26: -1.0, 27: 3.88, 28: 3.9, 29: 3.91, 30: 3.89, 31: 3.94, 32: 9.8, > 33: 22.4, 34: 67.8, 35: 61.7, 36: 2.7, 38: 0.4706, 39: 204.1742, 40: 1102.0, > 41: 0.0106, 42: 2.0, 49: 1.0, 51: -1.0, 52: -1.0, 53: -1.0, 54: -1.0, 57: > 1.0, 62: 1.0, 68: 1.0, 70: 1.0, 79: 1.0, 89: 1.0, 92: 1.0, 96: 1.0, 100: 1.0, > 103: 1.0, 108: 1.0, 110: 1.0, 115: 1.0, 123: 1.0, 131: 1.0, 132: 1.0}), > SALESCLOSEPRICE=225000) > ], schema=train_schema) > > train_df.printSchema() > train_df.show() > {code} > Error message: > {code:java} > // Fail to execute line 17: ], schema=train_schema) Traceback (most recent > call last): File "/tmp/zeppelin_pyspark-3793375738105660281.py", line 375, in > exec(code, _zcUserQueryNameSpace) File "", line 17, in > File "/opt/spark/python/lib/pyspark.zip/pyspark/sql/session.py", > line 748, in createDataFrame rdd, schema = self._createFromLocal(map(prepare, > data), schema) File > "/opt/spark/python/lib/pyspark.zip/pyspark/sql/session.py", line 429, in > _createFromLocal data = [schema.toInternal(row) for row in data] File > "/opt/spark/python/lib/pyspark.zip/pyspark/sql/session.py", line 429, in > data = [schema.toInternal(row) for row in data] File > "/opt/spark/python/lib/pyspark.zip/pyspark/sql/types.py", line 604, in > toInternal for f, v, c in zip(self.fields, obj, self._needConversion)) File > "/opt/spark/python/lib/pyspark.zip/pyspark/sql/types.py", line 604, in > for f, v, c in zip(self.fields, obj, self._needConversion)) File > "/opt/spark/python/lib/pyspark.zip/pyspark/sql/types.py", line 442, in > toInternal return self.dataType.toInternal(obj) File > "/opt/spark/python/lib/pyspark.zip/pyspark/sql/types.py", line 685, in > toInternal return self._cachedSqlType().toInternal(self.serialize(obj)) File > "/opt/spark/python/lib/
[jira] [Commented] (SPARK-27939) Defining a schema with VectorUDT
[ https://issues.apache.org/jira/browse/SPARK-27939?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16855969#comment-16855969 ] Bryan Cutler commented on SPARK-27939: -- Another problem with Python {{Row}} class > Defining a schema with VectorUDT > > > Key: SPARK-27939 > URL: https://issues.apache.org/jira/browse/SPARK-27939 > Project: Spark > Issue Type: Bug > Components: ML, PySpark >Affects Versions: 2.4.0 >Reporter: Johannes Schaffrath >Priority: Minor > > When I try to define a dataframe schema which has a VectorUDT field, I run > into an error when the VectorUDT field is not the last element of the > StructType list. > The following example causes the error below: > {code:java} > // from pyspark.sql import functions as F > from pyspark.sql import types as T > from pyspark.sql import Row > from pyspark.ml.linalg import VectorUDT, SparseVector > #VectorUDT should be the last structfield > train_schema = T.StructType([ > T.StructField('features', VectorUDT()), > T.StructField('SALESCLOSEPRICE', T.IntegerType()) > ]) > > train_df = spark.createDataFrame( > [Row(features=SparseVector(135, {0: 139900.0, 1: 139900.0, 2: 980.0, 3: 10.0, > 5: 980.0, 6: 1858.0, 7: 1858.0, 8: 980.0, 9: 1950.0, 10: 1.28, 11: 1.0, 12: > 1.0, 15: 2.0, 16: 3.0, 20: 2017.0, 21: 7.0, 22: 28.0, 23: 15.0, 24: 196.0, > 25: 25.0, 26: -1.0, 27: 4.03, 28: 3.96, 29: 3.88, 30: 3.9, 31: 3.91, 32: 9.8, > 33: 22.4, 34: 67.8, 35: 49.8, 36: 11.9, 37: 2.7, 38: 0.2926, 39: 142.7551, > 40: 980.0, 41: 0.0133, 42: 1.5, 43: 1.0, 51: -1.0, 52: -1.0, 53: -1.0, 54: > -1.0, 55: -1.0, 56: -1.0, 57: -1.0, 62: 1.0, 68: 1.0, 77: 1.0, 81: 1.0, 89: > 1.0, 95: 1.0, 96: 1.0, 101: 1.0, 103: 1.0, 108: 1.0, 114: 1.0, 115: 1.0, 123: > 1.0, 133: 1.0}), SALESCLOSEPRICE=143000), > Row(features=SparseVector(135, {0: 21.0, 1: 21.0, 2: 1144.0, 3: 4.0, > 5: 1268.0, 6: 1640.0, 7: 1640.0, 8: 2228.0, 9: 1971.0, 10: 0.32, 11: 1.0, 14: > 2.0, 15: 3.0, 16: 4.0, 17: 960.0, 20: 2017.0, 21: 10.0, 22: 41.0, 23: 9.0, > 24: 282.0, 25: 2.0, 26: -1.0, 27: 3.91, 28: 3.85, 29: 3.83, 30: 3.83, 31: > 3.78, 32: 32.2, 33: 49.0, 34: 18.8, 35: 14.0, 36: 35.8, 37: 14.6, 38: 0.4392, > 39: 94.2549, 40: 2228.0, 41: 0.0078, 42: 1., 43: -1.0, 44: -1.0, 45: > -1.0, 46: -1.0, 47: -1.0, 48: -1.0, 49: -1.0, 50: -1.0, 52: 1.0, 55: -1.0, > 56: -1.0, 57: -1.0, 62: 1.0, 68: 1.0, 77: 1.0, 79: 1.0, 89: 1.0, 92: 1.0, 96: > 1.0, 101: 1.0, 103: 1.0, 108: 1.0, 114: 1.0, 115: 1.0, 124: 1.0, 133: 1.0}), > SALESCLOSEPRICE=19), > Row(features=SparseVector(135, {0: 225000.0, 1: 225000.0, 2: 1102.0, 3: > 28.0, 5: 1102.0, 6: 2390.0, 7: 2390.0, 8: 1102.0, 9: 1949.0, 10: 0.822, 11: > 1.0, 15: 1.0, 16: 2.0, 20: 2017.0, 21: 6.0, 22: 26.0, 23: 26.0, 24: 177.0, > 25: 25.0, 26: -1.0, 27: 3.88, 28: 3.9, 29: 3.91, 30: 3.89, 31: 3.94, 32: 9.8, > 33: 22.4, 34: 67.8, 35: 61.7, 36: 2.7, 38: 0.4706, 39: 204.1742, 40: 1102.0, > 41: 0.0106, 42: 2.0, 49: 1.0, 51: -1.0, 52: -1.0, 53: -1.0, 54: -1.0, 57: > 1.0, 62: 1.0, 68: 1.0, 70: 1.0, 79: 1.0, 89: 1.0, 92: 1.0, 96: 1.0, 100: 1.0, > 103: 1.0, 108: 1.0, 110: 1.0, 115: 1.0, 123: 1.0, 131: 1.0, 132: 1.0}), > SALESCLOSEPRICE=225000) > ], schema=train_schema) > > train_df.printSchema() > train_df.show() > {code} > Error message: > {code:java} > // Fail to execute line 17: ], schema=train_schema) Traceback (most recent > call last): File "/tmp/zeppelin_pyspark-3793375738105660281.py", line 375, in > exec(code, _zcUserQueryNameSpace) File "", line 17, in > File "/opt/spark/python/lib/pyspark.zip/pyspark/sql/session.py", > line 748, in createDataFrame rdd, schema = self._createFromLocal(map(prepare, > data), schema) File > "/opt/spark/python/lib/pyspark.zip/pyspark/sql/session.py", line 429, in > _createFromLocal data = [schema.toInternal(row) for row in data] File > "/opt/spark/python/lib/pyspark.zip/pyspark/sql/session.py", line 429, in > data = [schema.toInternal(row) for row in data] File > "/opt/spark/python/lib/pyspark.zip/pyspark/sql/types.py", line 604, in > toInternal for f, v, c in zip(self.fields, obj, self._needConversion)) File > "/opt/spark/python/lib/pyspark.zip/pyspark/sql/types.py", line 604, in > for f, v, c in zip(self.fields, obj, self._needConversion)) File > "/opt/spark/python/lib/pyspark.zip/pyspark/sql/types.py", line 442, in > toInternal return self.dataType.toInternal(obj) File > "/opt/spark/python/lib/pyspark.zip/pyspark/sql/types.py", line 685, in > toInternal return self._cachedSqlType().toInternal(self.serialize(obj)) File > "/opt/spark/python/lib/pyspark.zip/pyspark/ml/linalg/__init__.py", line 167, > in serialize raise TypeError("cannot serialize %r of type %r" % (obj, > type(obj
[jira] [Comment Edited] (SPARK-27939) Defining a schema with VectorUDT
[ https://issues.apache.org/jira/browse/SPARK-27939?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16855966#comment-16855966 ] Bryan Cutler edited comment on SPARK-27939 at 6/4/19 6:11 PM: -- The problem is the {{Row}} class sorts the field names alphabetically, which puts capital letters first and then conflicts with your schema: {noformat} r = Row(features=SparseVector(135, {0: 139900.0, 1: 139900.0, ...}), SALESCLOSEPRICE=143000) In [3]: r.__fields__ Out[3]: ['SALESCLOSEPRICE', 'features']{noformat} This is by design, but it is not intuitive and has caused lots of problems. Hopefully, we can improve this for Spark 3.0.0 You can either just specify your data as tuples. for example {noformat} In [5]: train_df = spark.createDataFrame([(SparseVector(135, {0: 139900.0}), 143000)], schema=train_schema) In [6]: train_df.show() ++---+ | features|SALESCLOSEPRICE| ++---+ |(135,[0],[139900.0])| 143000| ++---+ {noformat} Or if you want to have keywords, then define your own row class like this: {noformat} In [7]: MyRow = Row('features', 'SALESCLOSEPRICE') In [8]: MyRow(SparseVector(135, {0: 139900.0}), 143000) Out[8]: Row(features=SparseVector(135, {0: 139900.0}), SALESCLOSEPRICE=143000){noformat} was (Author: bryanc): The problem is the {{Row}} class sorts the field names alphabetically, which puts capital letters first and then conflicts with your schema: {noformat} r = Row(features=SparseVector(135, {0: 139900.0, 1: 139900.0, ...}), SALESCLOSEPRICE=143000) In [3]: r.__fields__ Out[3]: ['SALESCLOSEPRICE', 'features']{noformat} This is by design, but it is not intuitive and has caused lots of problems. You can either just specify your data as tuples. for example {noformat} In [5]: train_df = spark.createDataFrame([(SparseVector(135, {0: 139900.0}), 143000)], schema=train_schema) In [6]: train_df.show() ++---+ | features|SALESCLOSEPRICE| ++---+ |(135,[0],[139900.0])| 143000| ++---+ {noformat} Or if you want to have keywords, then define your own row class like this: {noformat} In [7]: MyRow = Row('features', 'SALESCLOSEPRICE') In [8]: MyRow(SparseVector(135, {0: 139900.0}), 143000) Out[8]: Row(features=SparseVector(135, {0: 139900.0}), SALESCLOSEPRICE=143000){noformat} > Defining a schema with VectorUDT > > > Key: SPARK-27939 > URL: https://issues.apache.org/jira/browse/SPARK-27939 > Project: Spark > Issue Type: Bug > Components: ML, PySpark >Affects Versions: 2.4.0 >Reporter: Johannes Schaffrath >Priority: Minor > > When I try to define a dataframe schema which has a VectorUDT field, I run > into an error when the VectorUDT field is not the last element of the > StructType list. > The following example causes the error below: > {code:java} > // from pyspark.sql import functions as F > from pyspark.sql import types as T > from pyspark.sql import Row > from pyspark.ml.linalg import VectorUDT, SparseVector > #VectorUDT should be the last structfield > train_schema = T.StructType([ > T.StructField('features', VectorUDT()), > T.StructField('SALESCLOSEPRICE', T.IntegerType()) > ]) > > train_df = spark.createDataFrame( > [Row(features=SparseVector(135, {0: 139900.0, 1: 139900.0, 2: 980.0, 3: 10.0, > 5: 980.0, 6: 1858.0, 7: 1858.0, 8: 980.0, 9: 1950.0, 10: 1.28, 11: 1.0, 12: > 1.0, 15: 2.0, 16: 3.0, 20: 2017.0, 21: 7.0, 22: 28.0, 23: 15.0, 24: 196.0, > 25: 25.0, 26: -1.0, 27: 4.03, 28: 3.96, 29: 3.88, 30: 3.9, 31: 3.91, 32: 9.8, > 33: 22.4, 34: 67.8, 35: 49.8, 36: 11.9, 37: 2.7, 38: 0.2926, 39: 142.7551, > 40: 980.0, 41: 0.0133, 42: 1.5, 43: 1.0, 51: -1.0, 52: -1.0, 53: -1.0, 54: > -1.0, 55: -1.0, 56: -1.0, 57: -1.0, 62: 1.0, 68: 1.0, 77: 1.0, 81: 1.0, 89: > 1.0, 95: 1.0, 96: 1.0, 101: 1.0, 103: 1.0, 108: 1.0, 114: 1.0, 115: 1.0, 123: > 1.0, 133: 1.0}), SALESCLOSEPRICE=143000), > Row(features=SparseVector(135, {0: 21.0, 1: 21.0, 2: 1144.0, 3: 4.0, > 5: 1268.0, 6: 1640.0, 7: 1640.0, 8: 2228.0, 9: 1971.0, 10: 0.32, 11: 1.0, 14: > 2.0, 15: 3.0, 16: 4.0, 17: 960.0, 20: 2017.0, 21: 10.0, 22: 41.0, 23: 9.0, > 24: 282.0, 25: 2.0, 26: -1.0, 27: 3.91, 28: 3.85, 29: 3.83, 30: 3.83, 31: > 3.78, 32: 32.2, 33: 49.0, 34: 18.8, 35: 14.0, 36: 35.8, 37: 14.6, 38: 0.4392, > 39: 94.2549, 40: 2228.0, 41: 0.0078, 42: 1., 43: -1.0, 44: -1.0, 45: > -1.0, 46: -1.0, 47: -1.0, 48: -1.0, 49: -1.0, 50: -1.0, 52: 1.0, 55: -1.0, > 56: -1.0, 57: -1.0, 62: 1.0, 68: 1.0, 77: 1.0, 79: 1.0, 89: 1.0, 92: 1.0, 96: > 1.0, 101: 1.0, 103: 1.0, 108: 1.0, 114: 1.0, 115: 1.0, 124: 1.0, 1
[jira] [Commented] (SPARK-27939) Defining a schema with VectorUDT
[ https://issues.apache.org/jira/browse/SPARK-27939?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16855966#comment-16855966 ] Bryan Cutler commented on SPARK-27939: -- The problem is the {{Row}} class sorts the field names alphabetically, which puts capital letters first and then conflicts with your schema: {noformat} r = Row(features=SparseVector(135, {0: 139900.0, 1: 139900.0, ...}), SALESCLOSEPRICE=143000) In [3]: r.__fields__ Out[3]: ['SALESCLOSEPRICE', 'features']{noformat} This is by design, but it is not intuitive and has caused lots of problems. You can either just specify your data as tuples. for example {noformat} In [5]: train_df = spark.createDataFrame([(SparseVector(135, {0: 139900.0}), 143000)], schema=train_schema) In [6]: train_df.show() ++---+ | features|SALESCLOSEPRICE| ++---+ |(135,[0],[139900.0])| 143000| ++---+ {noformat} Or if you want to have keywords, then define your own row class like this: {noformat} In [7]: MyRow = Row('features', 'SALESCLOSEPRICE') In [8]: MyRow(SparseVector(135, {0: 139900.0}), 143000) Out[8]: Row(features=SparseVector(135, {0: 139900.0}), SALESCLOSEPRICE=143000){noformat} > Defining a schema with VectorUDT > > > Key: SPARK-27939 > URL: https://issues.apache.org/jira/browse/SPARK-27939 > Project: Spark > Issue Type: Bug > Components: ML, PySpark >Affects Versions: 2.4.0 >Reporter: Johannes Schaffrath >Priority: Minor > > When I try to define a dataframe schema which has a VectorUDT field, I run > into an error when the VectorUDT field is not the last element of the > StructType list. > The following example causes the error below: > {code:java} > // from pyspark.sql import functions as F > from pyspark.sql import types as T > from pyspark.sql import Row > from pyspark.ml.linalg import VectorUDT, SparseVector > #VectorUDT should be the last structfield > train_schema = T.StructType([ > T.StructField('features', VectorUDT()), > T.StructField('SALESCLOSEPRICE', T.IntegerType()) > ]) > > train_df = spark.createDataFrame( > [Row(features=SparseVector(135, {0: 139900.0, 1: 139900.0, 2: 980.0, 3: 10.0, > 5: 980.0, 6: 1858.0, 7: 1858.0, 8: 980.0, 9: 1950.0, 10: 1.28, 11: 1.0, 12: > 1.0, 15: 2.0, 16: 3.0, 20: 2017.0, 21: 7.0, 22: 28.0, 23: 15.0, 24: 196.0, > 25: 25.0, 26: -1.0, 27: 4.03, 28: 3.96, 29: 3.88, 30: 3.9, 31: 3.91, 32: 9.8, > 33: 22.4, 34: 67.8, 35: 49.8, 36: 11.9, 37: 2.7, 38: 0.2926, 39: 142.7551, > 40: 980.0, 41: 0.0133, 42: 1.5, 43: 1.0, 51: -1.0, 52: -1.0, 53: -1.0, 54: > -1.0, 55: -1.0, 56: -1.0, 57: -1.0, 62: 1.0, 68: 1.0, 77: 1.0, 81: 1.0, 89: > 1.0, 95: 1.0, 96: 1.0, 101: 1.0, 103: 1.0, 108: 1.0, 114: 1.0, 115: 1.0, 123: > 1.0, 133: 1.0}), SALESCLOSEPRICE=143000), > Row(features=SparseVector(135, {0: 21.0, 1: 21.0, 2: 1144.0, 3: 4.0, > 5: 1268.0, 6: 1640.0, 7: 1640.0, 8: 2228.0, 9: 1971.0, 10: 0.32, 11: 1.0, 14: > 2.0, 15: 3.0, 16: 4.0, 17: 960.0, 20: 2017.0, 21: 10.0, 22: 41.0, 23: 9.0, > 24: 282.0, 25: 2.0, 26: -1.0, 27: 3.91, 28: 3.85, 29: 3.83, 30: 3.83, 31: > 3.78, 32: 32.2, 33: 49.0, 34: 18.8, 35: 14.0, 36: 35.8, 37: 14.6, 38: 0.4392, > 39: 94.2549, 40: 2228.0, 41: 0.0078, 42: 1., 43: -1.0, 44: -1.0, 45: > -1.0, 46: -1.0, 47: -1.0, 48: -1.0, 49: -1.0, 50: -1.0, 52: 1.0, 55: -1.0, > 56: -1.0, 57: -1.0, 62: 1.0, 68: 1.0, 77: 1.0, 79: 1.0, 89: 1.0, 92: 1.0, 96: > 1.0, 101: 1.0, 103: 1.0, 108: 1.0, 114: 1.0, 115: 1.0, 124: 1.0, 133: 1.0}), > SALESCLOSEPRICE=19), > Row(features=SparseVector(135, {0: 225000.0, 1: 225000.0, 2: 1102.0, 3: > 28.0, 5: 1102.0, 6: 2390.0, 7: 2390.0, 8: 1102.0, 9: 1949.0, 10: 0.822, 11: > 1.0, 15: 1.0, 16: 2.0, 20: 2017.0, 21: 6.0, 22: 26.0, 23: 26.0, 24: 177.0, > 25: 25.0, 26: -1.0, 27: 3.88, 28: 3.9, 29: 3.91, 30: 3.89, 31: 3.94, 32: 9.8, > 33: 22.4, 34: 67.8, 35: 61.7, 36: 2.7, 38: 0.4706, 39: 204.1742, 40: 1102.0, > 41: 0.0106, 42: 2.0, 49: 1.0, 51: -1.0, 52: -1.0, 53: -1.0, 54: -1.0, 57: > 1.0, 62: 1.0, 68: 1.0, 70: 1.0, 79: 1.0, 89: 1.0, 92: 1.0, 96: 1.0, 100: 1.0, > 103: 1.0, 108: 1.0, 110: 1.0, 115: 1.0, 123: 1.0, 131: 1.0, 132: 1.0}), > SALESCLOSEPRICE=225000) > ], schema=train_schema) > > train_df.printSchema() > train_df.show() > {code} > Error message: > {code:java} > // Fail to execute line 17: ], schema=train_schema) Traceback (most recent > call last): File "/tmp/zeppelin_pyspark-3793375738105660281.py", line 375, in > exec(code, _zcUserQueryNameSpace) File "", line 17, in > File "/opt/spark/python/lib/pyspark.zip/pyspark/sql/session.py", > line 748, in createDataFrame rdd, schema = self._createFromLocal(map(prepare, > data), schema)
[jira] [Assigned] (SPARK-27945) Make minimal changes to support columnar processing
[ https://issues.apache.org/jira/browse/SPARK-27945?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-27945: Assignee: Apache Spark > Make minimal changes to support columnar processing > --- > > Key: SPARK-27945 > URL: https://issues.apache.org/jira/browse/SPARK-27945 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Robert Joseph Evans >Assignee: Apache Spark >Priority: Major > > As the first step for SPARK-27396 this is to put in the minimum changes > needed to allow a plugin to support columnar processing. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-27945) Make minimal changes to support columnar processing
[ https://issues.apache.org/jira/browse/SPARK-27945?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-27945: Assignee: (was: Apache Spark) > Make minimal changes to support columnar processing > --- > > Key: SPARK-27945 > URL: https://issues.apache.org/jira/browse/SPARK-27945 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Robert Joseph Evans >Priority: Major > > As the first step for SPARK-27396 this is to put in the minimum changes > needed to allow a plugin to support columnar processing. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17025) Cannot persist PySpark ML Pipeline model that includes custom Transformer
[ https://issues.apache.org/jira/browse/SPARK-17025?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16855948#comment-16855948 ] John Bauer commented on SPARK-17025: [~Hadar] [~yug95] I wrote a minimal example of a PySpark estimator/model pair which can be saved and loaded at [ImputeNormal|https://github.com/JohnHBauer/ImputeNormal] which imputes missing values from a normal distribution using parameters estimated from the data. Let me know if it helps you, > Cannot persist PySpark ML Pipeline model that includes custom Transformer > - > > Key: SPARK-17025 > URL: https://issues.apache.org/jira/browse/SPARK-17025 > Project: Spark > Issue Type: New Feature > Components: ML, PySpark >Affects Versions: 2.0.0 >Reporter: Nicholas Chammas >Assignee: Ajay Saini >Priority: Minor > Fix For: 2.3.0 > > > Following the example in [this Databricks blog > post|https://databricks.com/blog/2016/05/31/apache-spark-2-0-preview-machine-learning-model-persistence.html] > under "Python tuning", I'm trying to save an ML Pipeline model. > This pipeline, however, includes a custom transformer. When I try to save the > model, the operation fails because the custom transformer doesn't have a > {{_to_java}} attribute. > {code} > Traceback (most recent call last): > File ".../file.py", line 56, in > model.bestModel.save('model') > File > "/usr/local/Cellar/apache-spark/2.0.0/libexec/python/lib/pyspark.zip/pyspark/ml/pipeline.py", > line 222, in save > File > "/usr/local/Cellar/apache-spark/2.0.0/libexec/python/lib/pyspark.zip/pyspark/ml/pipeline.py", > line 217, in write > File > "/usr/local/Cellar/apache-spark/2.0.0/libexec/python/lib/pyspark.zip/pyspark/ml/util.py", > line 93, in __init__ > File > "/usr/local/Cellar/apache-spark/2.0.0/libexec/python/lib/pyspark.zip/pyspark/ml/pipeline.py", > line 254, in _to_java > AttributeError: 'PeoplePairFeaturizer' object has no attribute '_to_java' > {code} > Looking at the source code for > [ml/base.py|https://github.com/apache/spark/blob/acaf2a81ad5238fd1bc81e7be2c328f40c07e755/python/pyspark/ml/base.py], > I see that not even the base Transformer class has such an attribute. > I'm assuming this is missing functionality that is intended to be patched up > (i.e. [like > this|https://github.com/apache/spark/blob/acaf2a81ad5238fd1bc81e7be2c328f40c07e755/python/pyspark/ml/classification.py#L1421-L1433]). > I'm not sure if there is an existing JIRA for this (my searches didn't turn > up clear results). -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-27805) toPandas does not propagate SparkExceptions with arrow enabled
[ https://issues.apache.org/jira/browse/SPARK-27805?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bryan Cutler updated SPARK-27805: - Affects Version/s: (was: 3.1.0) 2.4.3 > toPandas does not propagate SparkExceptions with arrow enabled > -- > > Key: SPARK-27805 > URL: https://issues.apache.org/jira/browse/SPARK-27805 > Project: Spark > Issue Type: Improvement > Components: PySpark, SQL >Affects Versions: 2.4.3 >Reporter: David Vogelbacher >Assignee: David Vogelbacher >Priority: Major > Fix For: 3.0.0 > > > When calling {{toPandas}} with arrow enabled errors encountered during the > collect are not propagated to the python process. > There is only a very general {{EofError}} raised. > Example of behavior with arrow enabled vs. arrow disabled: > {noformat} > import traceback > from pyspark.sql.functions import udf > from pyspark.sql.types import IntegerType > def raise_exception(): > raise Exception("My error") > error_udf = udf(raise_exception, IntegerType()) > df = spark.range(3).toDF("i").withColumn("x", error_udf()) > try: > df.toPandas() > except: > no_arrow_exception = traceback.format_exc() > spark.conf.set("spark.sql.execution.arrow.enabled", "true") > try: > df.toPandas() > except: > arrow_exception = traceback.format_exc() > print no_arrow_exception > print arrow_exception > {noformat} > {{arrow_exception}} gives as output: > {noformat} > >>> print arrow_exception > Traceback (most recent call last): > File "", line 2, in > File "/Users/dvogelbacher/git/spark/python/pyspark/sql/dataframe.py", line > 2143, in toPandas > batches = self._collectAsArrow() > File "/Users/dvogelbacher/git/spark/python/pyspark/sql/dataframe.py", line > 2205, in _collectAsArrow > results = list(_load_from_socket(sock_info, ArrowCollectSerializer())) > File "/Users/dvogelbacher/git/spark/python/pyspark/serializers.py", line > 210, in load_stream > num = read_int(stream) > File "/Users/dvogelbacher/git/spark/python/pyspark/serializers.py", line > 810, in read_int > raise EOFError > EOFError > {noformat} > {{no_arrow_exception}} gives as output: > {noformat} > Traceback (most recent call last): > File "", line 2, in > File "/Users/dvogelbacher/git/spark/python/pyspark/sql/dataframe.py", line > 2166, in toPandas > pdf = pd.DataFrame.from_records(self.collect(), columns=self.columns) > File "/Users/dvogelbacher/git/spark/python/pyspark/sql/dataframe.py", line > 516, in collect > sock_info = self._jdf.collectToPython() > File > "/Users/dvogelbacher/git/spark/python/lib/py4j-0.10.8.1-src.zip/py4j/java_gateway.py", > line 1286, in __call__ > answer, self.gateway_client, self.target_id, self.name) > File "/Users/dvogelbacher/git/spark/python/pyspark/sql/utils.py", line 89, > in deco > return f(*a, **kw) > File > "/Users/dvogelbacher/git/spark/python/lib/py4j-0.10.8.1-src.zip/py4j/protocol.py", > line 328, in get_return_value > format(target_id, ".", name), value) > Py4JJavaError: An error occurred while calling o38.collectToPython. > : org.apache.spark.SparkException: Job aborted due to stage failure: Task 7 > in stage 0.0 failed 1 times, most recent failure: Lost task 7.0 in stage 0.0 > (TID 7, localhost, executor driver): > org.apache.spark.api.python.PythonException: Traceback (most recent call > last): > File > "/Users/dvogelbacher/git/spark/python/lib/pyspark.zip/pyspark/worker.py", > line 428, in main > process() > File > "/Users/dvogelbacher/git/spark/python/lib/pyspark.zip/pyspark/worker.py", > line 423, in process > serializer.dump_stream(func(split_index, iterator), outfile) > File "/Users/dvogelbacher/git/spark/python/pyspark/serializers.py", line > 438, in dump_stream > self.serializer.dump_stream(self._batched(iterator), stream) > File "/Users/dvogelbacher/git/spark/python/pyspark/serializers.py", line > 141, in dump_stream > for obj in iterator: > File "/Users/dvogelbacher/git/spark/python/pyspark/serializers.py", line > 427, in _batched > for item in iterator: > File "", line 1, in > File > "/Users/dvogelbacher/git/spark/python/lib/pyspark.zip/pyspark/worker.py", > line 86, in > return lambda *a: f(*a) > File "/Users/dvogelbacher/git/spark/python/pyspark/util.py", line 99, in > wrapper > return f(*args, **kwargs) > File "", line 2, in raise_exception > Exception: My error > ... > {noformat} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-27805) toPandas does not propagate SparkExceptions with arrow enabled
[ https://issues.apache.org/jira/browse/SPARK-27805?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bryan Cutler resolved SPARK-27805. -- Resolution: Fixed Fix Version/s: 3.0.0 Issue resolved by pull request 24677 [https://github.com/apache/spark/pull/24677] > toPandas does not propagate SparkExceptions with arrow enabled > -- > > Key: SPARK-27805 > URL: https://issues.apache.org/jira/browse/SPARK-27805 > Project: Spark > Issue Type: Improvement > Components: PySpark, SQL >Affects Versions: 3.1.0 >Reporter: David Vogelbacher >Assignee: David Vogelbacher >Priority: Major > Fix For: 3.0.0 > > > When calling {{toPandas}} with arrow enabled errors encountered during the > collect are not propagated to the python process. > There is only a very general {{EofError}} raised. > Example of behavior with arrow enabled vs. arrow disabled: > {noformat} > import traceback > from pyspark.sql.functions import udf > from pyspark.sql.types import IntegerType > def raise_exception(): > raise Exception("My error") > error_udf = udf(raise_exception, IntegerType()) > df = spark.range(3).toDF("i").withColumn("x", error_udf()) > try: > df.toPandas() > except: > no_arrow_exception = traceback.format_exc() > spark.conf.set("spark.sql.execution.arrow.enabled", "true") > try: > df.toPandas() > except: > arrow_exception = traceback.format_exc() > print no_arrow_exception > print arrow_exception > {noformat} > {{arrow_exception}} gives as output: > {noformat} > >>> print arrow_exception > Traceback (most recent call last): > File "", line 2, in > File "/Users/dvogelbacher/git/spark/python/pyspark/sql/dataframe.py", line > 2143, in toPandas > batches = self._collectAsArrow() > File "/Users/dvogelbacher/git/spark/python/pyspark/sql/dataframe.py", line > 2205, in _collectAsArrow > results = list(_load_from_socket(sock_info, ArrowCollectSerializer())) > File "/Users/dvogelbacher/git/spark/python/pyspark/serializers.py", line > 210, in load_stream > num = read_int(stream) > File "/Users/dvogelbacher/git/spark/python/pyspark/serializers.py", line > 810, in read_int > raise EOFError > EOFError > {noformat} > {{no_arrow_exception}} gives as output: > {noformat} > Traceback (most recent call last): > File "", line 2, in > File "/Users/dvogelbacher/git/spark/python/pyspark/sql/dataframe.py", line > 2166, in toPandas > pdf = pd.DataFrame.from_records(self.collect(), columns=self.columns) > File "/Users/dvogelbacher/git/spark/python/pyspark/sql/dataframe.py", line > 516, in collect > sock_info = self._jdf.collectToPython() > File > "/Users/dvogelbacher/git/spark/python/lib/py4j-0.10.8.1-src.zip/py4j/java_gateway.py", > line 1286, in __call__ > answer, self.gateway_client, self.target_id, self.name) > File "/Users/dvogelbacher/git/spark/python/pyspark/sql/utils.py", line 89, > in deco > return f(*a, **kw) > File > "/Users/dvogelbacher/git/spark/python/lib/py4j-0.10.8.1-src.zip/py4j/protocol.py", > line 328, in get_return_value > format(target_id, ".", name), value) > Py4JJavaError: An error occurred while calling o38.collectToPython. > : org.apache.spark.SparkException: Job aborted due to stage failure: Task 7 > in stage 0.0 failed 1 times, most recent failure: Lost task 7.0 in stage 0.0 > (TID 7, localhost, executor driver): > org.apache.spark.api.python.PythonException: Traceback (most recent call > last): > File > "/Users/dvogelbacher/git/spark/python/lib/pyspark.zip/pyspark/worker.py", > line 428, in main > process() > File > "/Users/dvogelbacher/git/spark/python/lib/pyspark.zip/pyspark/worker.py", > line 423, in process > serializer.dump_stream(func(split_index, iterator), outfile) > File "/Users/dvogelbacher/git/spark/python/pyspark/serializers.py", line > 438, in dump_stream > self.serializer.dump_stream(self._batched(iterator), stream) > File "/Users/dvogelbacher/git/spark/python/pyspark/serializers.py", line > 141, in dump_stream > for obj in iterator: > File "/Users/dvogelbacher/git/spark/python/pyspark/serializers.py", line > 427, in _batched > for item in iterator: > File "", line 1, in > File > "/Users/dvogelbacher/git/spark/python/lib/pyspark.zip/pyspark/worker.py", > line 86, in > return lambda *a: f(*a) > File "/Users/dvogelbacher/git/spark/python/pyspark/util.py", line 99, in > wrapper > return f(*args, **kwargs) > File "", line 2, in raise_exception > Exception: My error > ... > {noformat} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.o
[jira] [Assigned] (SPARK-27805) toPandas does not propagate SparkExceptions with arrow enabled
[ https://issues.apache.org/jira/browse/SPARK-27805?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bryan Cutler reassigned SPARK-27805: Assignee: David Vogelbacher > toPandas does not propagate SparkExceptions with arrow enabled > -- > > Key: SPARK-27805 > URL: https://issues.apache.org/jira/browse/SPARK-27805 > Project: Spark > Issue Type: Improvement > Components: PySpark, SQL >Affects Versions: 3.1.0 >Reporter: David Vogelbacher >Assignee: David Vogelbacher >Priority: Major > > When calling {{toPandas}} with arrow enabled errors encountered during the > collect are not propagated to the python process. > There is only a very general {{EofError}} raised. > Example of behavior with arrow enabled vs. arrow disabled: > {noformat} > import traceback > from pyspark.sql.functions import udf > from pyspark.sql.types import IntegerType > def raise_exception(): > raise Exception("My error") > error_udf = udf(raise_exception, IntegerType()) > df = spark.range(3).toDF("i").withColumn("x", error_udf()) > try: > df.toPandas() > except: > no_arrow_exception = traceback.format_exc() > spark.conf.set("spark.sql.execution.arrow.enabled", "true") > try: > df.toPandas() > except: > arrow_exception = traceback.format_exc() > print no_arrow_exception > print arrow_exception > {noformat} > {{arrow_exception}} gives as output: > {noformat} > >>> print arrow_exception > Traceback (most recent call last): > File "", line 2, in > File "/Users/dvogelbacher/git/spark/python/pyspark/sql/dataframe.py", line > 2143, in toPandas > batches = self._collectAsArrow() > File "/Users/dvogelbacher/git/spark/python/pyspark/sql/dataframe.py", line > 2205, in _collectAsArrow > results = list(_load_from_socket(sock_info, ArrowCollectSerializer())) > File "/Users/dvogelbacher/git/spark/python/pyspark/serializers.py", line > 210, in load_stream > num = read_int(stream) > File "/Users/dvogelbacher/git/spark/python/pyspark/serializers.py", line > 810, in read_int > raise EOFError > EOFError > {noformat} > {{no_arrow_exception}} gives as output: > {noformat} > Traceback (most recent call last): > File "", line 2, in > File "/Users/dvogelbacher/git/spark/python/pyspark/sql/dataframe.py", line > 2166, in toPandas > pdf = pd.DataFrame.from_records(self.collect(), columns=self.columns) > File "/Users/dvogelbacher/git/spark/python/pyspark/sql/dataframe.py", line > 516, in collect > sock_info = self._jdf.collectToPython() > File > "/Users/dvogelbacher/git/spark/python/lib/py4j-0.10.8.1-src.zip/py4j/java_gateway.py", > line 1286, in __call__ > answer, self.gateway_client, self.target_id, self.name) > File "/Users/dvogelbacher/git/spark/python/pyspark/sql/utils.py", line 89, > in deco > return f(*a, **kw) > File > "/Users/dvogelbacher/git/spark/python/lib/py4j-0.10.8.1-src.zip/py4j/protocol.py", > line 328, in get_return_value > format(target_id, ".", name), value) > Py4JJavaError: An error occurred while calling o38.collectToPython. > : org.apache.spark.SparkException: Job aborted due to stage failure: Task 7 > in stage 0.0 failed 1 times, most recent failure: Lost task 7.0 in stage 0.0 > (TID 7, localhost, executor driver): > org.apache.spark.api.python.PythonException: Traceback (most recent call > last): > File > "/Users/dvogelbacher/git/spark/python/lib/pyspark.zip/pyspark/worker.py", > line 428, in main > process() > File > "/Users/dvogelbacher/git/spark/python/lib/pyspark.zip/pyspark/worker.py", > line 423, in process > serializer.dump_stream(func(split_index, iterator), outfile) > File "/Users/dvogelbacher/git/spark/python/pyspark/serializers.py", line > 438, in dump_stream > self.serializer.dump_stream(self._batched(iterator), stream) > File "/Users/dvogelbacher/git/spark/python/pyspark/serializers.py", line > 141, in dump_stream > for obj in iterator: > File "/Users/dvogelbacher/git/spark/python/pyspark/serializers.py", line > 427, in _batched > for item in iterator: > File "", line 1, in > File > "/Users/dvogelbacher/git/spark/python/lib/pyspark.zip/pyspark/worker.py", > line 86, in > return lambda *a: f(*a) > File "/Users/dvogelbacher/git/spark/python/pyspark/util.py", line 99, in > wrapper > return f(*args, **kwargs) > File "", line 2, in raise_exception > Exception: My error > ... > {noformat} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-27888) Python 2->3 migration guide for PySpark users
[ https://issues.apache.org/jira/browse/SPARK-27888?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16855896#comment-16855896 ] Hyukjin Kwon commented on SPARK-27888: -- Hm, Pyspark as far as I can tell should work out of the box in Python 2 and 3 identically except Pythom version specific difference - If it doesn't it's an issue in Pyspark or version differences in Python itself. It's good to note in user guide (so I did at SPARK-27942) but I doubt if it should be noted in migration guide. We deprecate it but not remove it out yet. Migration guide wasn't updated so far for python deprecation (or other language deprecation as far as I can tell) > Python 2->3 migration guide for PySpark users > - > > Key: SPARK-27888 > URL: https://issues.apache.org/jira/browse/SPARK-27888 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 3.0.0 >Reporter: Xiangrui Meng >Priority: Major > > We might need a short Python 2->3 migration guide for PySpark users. It > doesn't need to be comprehensive given many Python 2->3 migration guides > around. We just need some pointers and list items that are specific to > PySpark. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-27396) SPIP: Public APIs for extended Columnar Processing Support
[ https://issues.apache.org/jira/browse/SPARK-27396?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Robert Joseph Evans updated SPARK-27396: Epic Name: Public APIs for extended Columnar Processing Support > SPIP: Public APIs for extended Columnar Processing Support > -- > > Key: SPARK-27396 > URL: https://issues.apache.org/jira/browse/SPARK-27396 > Project: Spark > Issue Type: Epic > Components: SQL >Affects Versions: 3.0.0 >Reporter: Robert Joseph Evans >Priority: Major > > *SPIP: Columnar Processing Without Arrow Formatting Guarantees.* > > *Q1.* What are you trying to do? Articulate your objectives using absolutely > no jargon. > The Dataset/DataFrame API in Spark currently only exposes to users one row at > a time when processing data. The goals of this are to > # Add to the current sql extensions mechanism so advanced users can have > access to the physical SparkPlan and manipulate it to provide columnar > processing for existing operators, including shuffle. This will allow them > to implement their own cost based optimizers to decide when processing should > be columnar and when it should not. > # Make any transitions between the columnar memory layout and a row based > layout transparent to the users so operations that are not columnar see the > data as rows, and operations that are columnar see the data as columns. > > Not Requirements, but things that would be nice to have. > # Transition the existing in memory columnar layouts to be compatible with > Apache Arrow. This would make the transformations to Apache Arrow format a > no-op. The existing formats are already very close to those layouts in many > cases. This would not be using the Apache Arrow java library, but instead > being compatible with the memory > [layout|https://arrow.apache.org/docs/format/Layout.html] and possibly only a > subset of that layout. > > *Q2.* What problem is this proposal NOT designed to solve? > The goal of this is not for ML/AI but to provide APIs for accelerated > computing in Spark primarily targeting SQL/ETL like workloads. ML/AI already > have several mechanisms to get data into/out of them. These can be improved > but will be covered in a separate SPIP. > This is not trying to implement any of the processing itself in a columnar > way, with the exception of examples for documentation. > This does not cover exposing the underlying format of the data. The only way > to get at the data in a ColumnVector is through the public APIs. Exposing > the underlying format to improve efficiency will be covered in a separate > SPIP. > This is not trying to implement new ways of transferring data to external > ML/AI applications. That is covered by separate SPIPs already. > This is not trying to add in generic code generation for columnar processing. > Currently code generation for columnar processing is only supported when > translating columns to rows. We will continue to support this, but will not > extend it as a general solution. That will be covered in a separate SPIP if > we find it is helpful. For now columnar processing will be interpreted. > This is not trying to expose a way to get columnar data into Spark through > DataSource V2 or any other similar API. That would be covered by a separate > SPIP if we find it is needed. > > *Q3.* How is it done today, and what are the limits of current practice? > The current columnar support is limited to 3 areas. > # Internal implementations of FileFormats, optionally can return a > ColumnarBatch instead of rows. The code generation phase knows how to take > that columnar data and iterate through it as rows for stages that wants rows, > which currently is almost everything. The limitations here are mostly > implementation specific. The current standard is to abuse Scala’s type > erasure to return ColumnarBatches as the elements of an RDD[InternalRow]. The > code generation can handle this because it is generating java code, so it > bypasses scala’s type checking and just casts the InternalRow to the desired > ColumnarBatch. This makes it difficult for others to implement the same > functionality for different processing because they can only do it through > code generation. There really is no clean separate path in the code > generation for columnar vs row based. Additionally, because it is only > supported through code generation if for any reason code generation would > fail there is no backup. This is typically fine for input formats but can be > problematic when we get into more extensive processing. > # When caching data it can optionally be cached in a columnar format if the > input is also columnar. This is similar to the first area and has the same > limitations because the ca
[jira] [Created] (SPARK-27945) Make minimal changes to support columnar processing
Robert Joseph Evans created SPARK-27945: --- Summary: Make minimal changes to support columnar processing Key: SPARK-27945 URL: https://issues.apache.org/jira/browse/SPARK-27945 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.0.0 Reporter: Robert Joseph Evans As the first step for SPARK-27396 this is to put in the minimum changes needed to allow a plugin to support columnar processing. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-27396) SPIP: Public APIs for extended Columnar Processing Support
[ https://issues.apache.org/jira/browse/SPARK-27396?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Robert Joseph Evans updated SPARK-27396: Issue Type: Epic (was: Improvement) > SPIP: Public APIs for extended Columnar Processing Support > -- > > Key: SPARK-27396 > URL: https://issues.apache.org/jira/browse/SPARK-27396 > Project: Spark > Issue Type: Epic > Components: SQL >Affects Versions: 3.0.0 >Reporter: Robert Joseph Evans >Priority: Major > > *SPIP: Columnar Processing Without Arrow Formatting Guarantees.* > > *Q1.* What are you trying to do? Articulate your objectives using absolutely > no jargon. > The Dataset/DataFrame API in Spark currently only exposes to users one row at > a time when processing data. The goals of this are to > # Add to the current sql extensions mechanism so advanced users can have > access to the physical SparkPlan and manipulate it to provide columnar > processing for existing operators, including shuffle. This will allow them > to implement their own cost based optimizers to decide when processing should > be columnar and when it should not. > # Make any transitions between the columnar memory layout and a row based > layout transparent to the users so operations that are not columnar see the > data as rows, and operations that are columnar see the data as columns. > > Not Requirements, but things that would be nice to have. > # Transition the existing in memory columnar layouts to be compatible with > Apache Arrow. This would make the transformations to Apache Arrow format a > no-op. The existing formats are already very close to those layouts in many > cases. This would not be using the Apache Arrow java library, but instead > being compatible with the memory > [layout|https://arrow.apache.org/docs/format/Layout.html] and possibly only a > subset of that layout. > > *Q2.* What problem is this proposal NOT designed to solve? > The goal of this is not for ML/AI but to provide APIs for accelerated > computing in Spark primarily targeting SQL/ETL like workloads. ML/AI already > have several mechanisms to get data into/out of them. These can be improved > but will be covered in a separate SPIP. > This is not trying to implement any of the processing itself in a columnar > way, with the exception of examples for documentation. > This does not cover exposing the underlying format of the data. The only way > to get at the data in a ColumnVector is through the public APIs. Exposing > the underlying format to improve efficiency will be covered in a separate > SPIP. > This is not trying to implement new ways of transferring data to external > ML/AI applications. That is covered by separate SPIPs already. > This is not trying to add in generic code generation for columnar processing. > Currently code generation for columnar processing is only supported when > translating columns to rows. We will continue to support this, but will not > extend it as a general solution. That will be covered in a separate SPIP if > we find it is helpful. For now columnar processing will be interpreted. > This is not trying to expose a way to get columnar data into Spark through > DataSource V2 or any other similar API. That would be covered by a separate > SPIP if we find it is needed. > > *Q3.* How is it done today, and what are the limits of current practice? > The current columnar support is limited to 3 areas. > # Internal implementations of FileFormats, optionally can return a > ColumnarBatch instead of rows. The code generation phase knows how to take > that columnar data and iterate through it as rows for stages that wants rows, > which currently is almost everything. The limitations here are mostly > implementation specific. The current standard is to abuse Scala’s type > erasure to return ColumnarBatches as the elements of an RDD[InternalRow]. The > code generation can handle this because it is generating java code, so it > bypasses scala’s type checking and just casts the InternalRow to the desired > ColumnarBatch. This makes it difficult for others to implement the same > functionality for different processing because they can only do it through > code generation. There really is no clean separate path in the code > generation for columnar vs row based. Additionally, because it is only > supported through code generation if for any reason code generation would > fail there is no backup. This is typically fine for input formats but can be > problematic when we get into more extensive processing. > # When caching data it can optionally be cached in a columnar format if the > input is also columnar. This is similar to the first area and has the same > limitations because the cache acts as an input, but i
[jira] [Updated] (SPARK-27933) Extracting common purge "behaviour" to the parent StreamExecution
[ https://issues.apache.org/jira/browse/SPARK-27933?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jacek Laskowski updated SPARK-27933: Summary: Extracting common purge "behaviour" to the parent StreamExecution (was: Introduce StreamExecution.purge for removing entries from metadata logs) > Extracting common purge "behaviour" to the parent StreamExecution > - > > Key: SPARK-27933 > URL: https://issues.apache.org/jira/browse/SPARK-27933 > Project: Spark > Issue Type: Improvement > Components: Structured Streaming >Affects Versions: 2.4.3 >Reporter: Jacek Laskowski >Priority: Minor > > Extracting the common {{purge}} "behaviour" to the parent {{StreamExecution}}. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-27888) Python 2->3 migration guide for PySpark users
[ https://issues.apache.org/jira/browse/SPARK-27888?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16855838#comment-16855838 ] Xiangrui Meng commented on SPARK-27888: --- It would be nice if we can find some PySpark users who already migrated from python 2 to 3 and tell us what issues users should expect and any benefits from the migration. > Python 2->3 migration guide for PySpark users > - > > Key: SPARK-27888 > URL: https://issues.apache.org/jira/browse/SPARK-27888 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 3.0.0 >Reporter: Xiangrui Meng >Priority: Major > > We might need a short Python 2->3 migration guide for PySpark users. It > doesn't need to be comprehensive given many Python 2->3 migration guides > around. We just need some pointers and list items that are specific to > PySpark. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-27884) Deprecate Python 2 support in Spark 3.0
[ https://issues.apache.org/jira/browse/SPARK-27884?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng reassigned SPARK-27884: - Assignee: Xiangrui Meng > Deprecate Python 2 support in Spark 3.0 > --- > > Key: SPARK-27884 > URL: https://issues.apache.org/jira/browse/SPARK-27884 > Project: Spark > Issue Type: Story > Components: PySpark >Affects Versions: 3.0.0 >Reporter: Xiangrui Meng >Assignee: Xiangrui Meng >Priority: Major > Labels: release-notes > > Officially deprecate Python 2 support in Spark 3.0. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-27888) Python 2->3 migration guide for PySpark users
[ https://issues.apache.org/jira/browse/SPARK-27888?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16855832#comment-16855832 ] Xiangrui Meng commented on SPARK-27888: --- This JIRA is not to inform users that Python 2 is deprecated. It provides info to help users migrate to Python, though I don't know if there are things specific to PySpark. I think it should stay in the user guide instead of a release note. > Python 2->3 migration guide for PySpark users > - > > Key: SPARK-27888 > URL: https://issues.apache.org/jira/browse/SPARK-27888 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 3.0.0 >Reporter: Xiangrui Meng >Priority: Major > > We might need a short Python 2->3 migration guide for PySpark users. It > doesn't need to be comprehensive given many Python 2->3 migration guides > around. We just need some pointers and list items that are specific to > PySpark. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-27886) Add Apache Spark project to https://python3statement.org/
[ https://issues.apache.org/jira/browse/SPARK-27886?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng resolved SPARK-27886. --- Resolution: Done > Add Apache Spark project to https://python3statement.org/ > - > > Key: SPARK-27886 > URL: https://issues.apache.org/jira/browse/SPARK-27886 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 3.0.0 >Reporter: Xiangrui Meng >Assignee: Xiangrui Meng >Priority: Major > > Add Spark to https://python3statement.org/ and indicate our timeline. I > reviewed the statement at https://python3statement.org/. Most projects listed > there will *drop* Python 2 before 2020/01/01 instead of deprecating it with > only [one > exception|https://github.com/python3statement/python3statement.github.io/blob/6ccacf8beb3cc49b1b3d572ab4f841e250853ca9/site.js#L206]. > We certainly cannot drop Python 2 support in 2019 given we haven't > deprecated it yet. > Maybe we can add the following time line: > * 2019/10 - 2020/04: Python 2 & 3 > * 2020/04 - : Python 3 only > The switching is the next release after Spark 3.0. If we want to hold another > release, it would be Sept or Oct. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-27942) Note that Python 2.7 is deprecated in Spark documentation
[ https://issues.apache.org/jira/browse/SPARK-27942?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-27942: -- Component/s: Documentation > Note that Python 2.7 is deprecated in Spark documentation > - > > Key: SPARK-27942 > URL: https://issues.apache.org/jira/browse/SPARK-27942 > Project: Spark > Issue Type: Sub-task > Components: Documentation, PySpark >Affects Versions: 3.0.0 >Reporter: Hyukjin Kwon >Assignee: Hyukjin Kwon >Priority: Major > Fix For: 3.0.0 > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-27942) Note that Python 2.7 is deprecated in Spark documentation
[ https://issues.apache.org/jira/browse/SPARK-27942?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-27942. --- Resolution: Fixed Fix Version/s: 3.0.0 This is resolved via https://github.com/apache/spark/pull/24789 > Note that Python 2.7 is deprecated in Spark documentation > - > > Key: SPARK-27942 > URL: https://issues.apache.org/jira/browse/SPARK-27942 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 3.0.0 >Reporter: Hyukjin Kwon >Assignee: Hyukjin Kwon >Priority: Major > Fix For: 3.0.0 > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-27772) SQLTestUtils Refactoring
[ https://issues.apache.org/jira/browse/SPARK-27772?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-27772. --- Resolution: Fixed Fix Version/s: 3.0.0 Issue resolved by pull request 24747 [https://github.com/apache/spark/pull/24747] > SQLTestUtils Refactoring > > > Key: SPARK-27772 > URL: https://issues.apache.org/jira/browse/SPARK-27772 > Project: Spark > Issue Type: Improvement > Components: SQL, Tests >Affects Versions: 3.0.0 >Reporter: William Wong >Assignee: William Wong >Priority: Minor > Fix For: 3.0.0 > > > The current `SQLTestUtils` created many `withXXX` utility functions to clean > up tables/views/caches created for testing purpose. Some of those `withXXX` > functions ignore certain exceptions, like `NoSuchTableException` in the clean > up block (ie, the finally block). > > {code:java} > /** > * Drops temporary view `viewNames` after calling `f`. > */ > protected def withTempView(viewNames: String*)(f: => Unit): Unit = { > try f finally { > // If the test failed part way, we don't want to mask the failure by > failing to remove > // temp views that never got created. > try viewNames.foreach(spark.catalog.dropTempView) catch { > case _: NoSuchTableException => > } > } > } > {code} > I believe it is not the best approach. Because it is hard to anticipate what > exception should or should not be ignored. > > Java's `try-with-resources` statement does not mask exception throwing in the > try block with any exception caught in the 'close()' statement. Exception > caught in the 'close()' statement would add as a suppressed exception > instead. It sounds a better approach. > > Therefore, I proposed to standardise those 'withXXX' function with following > `withFinallyBlock` function, which does something similar to Java's > try-with-resources statement. > {code:java} > /** > * Drops temporary view `viewNames` after calling `f`. > */ > protected def withTempView(viewNames: String*)(f: => Unit): Unit = { > tryWithFinally(f)(viewNames.foreach(spark.catalog.dropTempView)) > } > /** > * Executes the given tryBlock and then the given finallyBlock no matter > whether tryBlock throws > * an exception. If both tryBlock and finallyBlock throw exceptions, the > exception thrown > * from the finallyBlock with be added to the exception thrown from tryBlock > as a > * suppress exception. It helps to avoid masking the exception from tryBlock > with exception > * from finallyBlock > */ > private def tryWithFinally(tryBlock: => Unit)(finallyBlock: => Unit): Unit = { > var fromTryBlock: Throwable = null > try tryBlock catch { > case cause: Throwable => > fromTryBlock = cause > throw cause > } finally { > if (fromTryBlock != null) { > try finallyBlock catch { > case fromFinallyBlock: Throwable => > fromTryBlock.addSuppressed(fromFinallyBlock) > throw fromTryBlock > } > } else { > finallyBlock > } > } > } > {code} > If a feature is well written, we show not hit any exception in those closing > method in testcase. The purpose of this proposal is to help developers to > identify what actually break their tests. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-27772) SQLTestUtils Refactoring
[ https://issues.apache.org/jira/browse/SPARK-27772?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen reassigned SPARK-27772: - Assignee: William Wong > SQLTestUtils Refactoring > > > Key: SPARK-27772 > URL: https://issues.apache.org/jira/browse/SPARK-27772 > Project: Spark > Issue Type: Improvement > Components: SQL, Tests >Affects Versions: 3.0.0 >Reporter: William Wong >Assignee: William Wong >Priority: Minor > > The current `SQLTestUtils` created many `withXXX` utility functions to clean > up tables/views/caches created for testing purpose. Some of those `withXXX` > functions ignore certain exceptions, like `NoSuchTableException` in the clean > up block (ie, the finally block). > > {code:java} > /** > * Drops temporary view `viewNames` after calling `f`. > */ > protected def withTempView(viewNames: String*)(f: => Unit): Unit = { > try f finally { > // If the test failed part way, we don't want to mask the failure by > failing to remove > // temp views that never got created. > try viewNames.foreach(spark.catalog.dropTempView) catch { > case _: NoSuchTableException => > } > } > } > {code} > I believe it is not the best approach. Because it is hard to anticipate what > exception should or should not be ignored. > > Java's `try-with-resources` statement does not mask exception throwing in the > try block with any exception caught in the 'close()' statement. Exception > caught in the 'close()' statement would add as a suppressed exception > instead. It sounds a better approach. > > Therefore, I proposed to standardise those 'withXXX' function with following > `withFinallyBlock` function, which does something similar to Java's > try-with-resources statement. > {code:java} > /** > * Drops temporary view `viewNames` after calling `f`. > */ > protected def withTempView(viewNames: String*)(f: => Unit): Unit = { > tryWithFinally(f)(viewNames.foreach(spark.catalog.dropTempView)) > } > /** > * Executes the given tryBlock and then the given finallyBlock no matter > whether tryBlock throws > * an exception. If both tryBlock and finallyBlock throw exceptions, the > exception thrown > * from the finallyBlock with be added to the exception thrown from tryBlock > as a > * suppress exception. It helps to avoid masking the exception from tryBlock > with exception > * from finallyBlock > */ > private def tryWithFinally(tryBlock: => Unit)(finallyBlock: => Unit): Unit = { > var fromTryBlock: Throwable = null > try tryBlock catch { > case cause: Throwable => > fromTryBlock = cause > throw cause > } finally { > if (fromTryBlock != null) { > try finallyBlock catch { > case fromFinallyBlock: Throwable => > fromTryBlock.addSuppressed(fromFinallyBlock) > throw fromTryBlock > } > } else { > finallyBlock > } > } > } > {code} > If a feature is well written, we show not hit any exception in those closing > method in testcase. The purpose of this proposal is to help developers to > identify what actually break their tests. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Reopened] (SPARK-27567) Spark Streaming consumers (from Kafka) intermittently die with 'SparkException: Couldn't find leaders for Set'
[ https://issues.apache.org/jira/browse/SPARK-27567?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dmitry Goldenberg reopened SPARK-27567: --- Issue not resolved, it appears intermittently. We have a QA box with Kafka installed and our Spark Streaming job pulling from it. Kafka has replication factor set to 1 since it's a cluster of 1 node. Just saw the error there. > Spark Streaming consumers (from Kafka) intermittently die with > 'SparkException: Couldn't find leaders for Set' > -- > > Key: SPARK-27567 > URL: https://issues.apache.org/jira/browse/SPARK-27567 > Project: Spark > Issue Type: Bug > Components: DStreams >Affects Versions: 1.5.0 > Environment: GCP / 170~14.04.1-Ubuntu >Reporter: Dmitry Goldenberg >Priority: Major > > Some of our consumers intermittently die with the stack traces I'm including. > Once restarted they run for a while then die again. > I can't find any cohesive documentation on what this error means and how to > go about troubleshooting it. Any help would be appreciated. > *Kafka version* is 0.8.2.1 (2.10-0.8.2.1). > Some of the errors seen look like this: > {noformat} > ERROR org.apache.spark.scheduler.TaskSchedulerImpl: Lost executor 2 on > 10.150.0.54: remote Rpc client disassociated{noformat} > Main error stack trace: > {noformat} > 2019-04-23 20:36:54,323 ERROR > org.apache.spark.streaming.scheduler.JobScheduler: Error g > enerating jobs for time 1556066214000 ms > org.apache.spark.SparkException: ArrayBuffer(org.apache.spark.SparkException: > Couldn't find leaders for Set([hdfs.hbase.acme.attachments,49], > [hdfs.hbase.acme.attachmen > ts,63], [hdfs.hbase.acme.attachments,31], [hdfs.hbase.acme.attachments,9], > [hdfs.hbase.acme.attachments,25], [hdfs.hbase.acme.attachments,55], > [hdfs.hbase.acme.attachme > nts,5], [hdfs.hbase.acme.attachments,37], [hdfs.hbase.acme.attachments,7], > [hdfs.hbase.acme.attachments,47], [hdfs.hbase.acme.attachments,13], > [hdfs.hbase.acme.attachme > nts,43], [hdfs.hbase.acme.attachments,19], [hdfs.hbase.acme.attachments,15], > [hdfs.hbase.acme.attachments,23], [hdfs.hbase.acme.attachments,53], > [hdfs.hbase.acme.attach > ments,1], [hdfs.hbase.acme.attachments,27], [hdfs.hbase.acme.attachments,57], > [hdfs.hbase.acme.attachments,39], [hdfs.hbase.acme.attachments,11], > [hdfs.hbase.acme.attac > hments,29], [hdfs.hbase.acme.attachments,33], > [hdfs.hbase.acme.attachments,35], [hdfs.hbase.acme.attachments,51], > [hdfs.hbase.acme.attachments,45], [hdfs.hbase.acme.att > achments,21], [hdfs.hbase.acme.attachments,3], > [hdfs.hbase.acme.attachments,59], [hdfs.hbase.acme.attachments,41], > [hdfs.hbase.acme.attachments,17], [hdfs.hbase.acme.at > tachments,61])) > at > org.apache.spark.streaming.kafka.DirectKafkaInputDStream.latestLeaderOffsets(DirectKafkaInputDStream.scala:123) > ~[acmedsc-ingest-kafka-spark-2.0.0-SNAPSHOT.j > ar:?] > at > org.apache.spark.streaming.kafka.DirectKafkaInputDStream.compute(DirectKafkaInputDStream.scala:145) > ~[acmedsc-ingest-kafka-spark-2.0.0-SNAPSHOT.jar:?] > at > org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1$$anonfun$apply$7.apply(DStream.scala:350) > ~[spark-assembly-1.5.0-hadoop2.4.0.ja > r:1.5.0] > at > org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1$$anonfun$apply$7.apply(DStream.scala:350) > ~[spark-assembly-1.5.0-hadoop2.4.0.ja > r:1.5.0] > at scala.util.DynamicVariable.withValue(DynamicVariable.scala:57) > ~[acmedsc-ingest-kafka-spark-2.0.0-SNAPSHOT.jar:?] > at > org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1.apply(DStream.scala:349) > ~[spark-assembly-1.5.0-hadoop2.4.0.jar:1.5.0] > at > org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1.apply(DStream.scala:349) > ~[spark-assembly-1.5.0-hadoop2.4.0.jar:1.5.0] > at > org.apache.spark.streaming.dstream.DStream.createRDDWithLocalProperties(DStream.scala:399) > ~[spark-assembly-1.5.0-hadoop2.4.0.jar:1.5.0] > at > org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1.apply(DStream.scala:344) > ~[spark-assembly-1.5.0-hadoop2.4.0.jar:1.5.0] > at > org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1.apply(DStream.scala:342) > ~[spark-assembly-1.5.0-hadoop2.4.0.jar:1.5.0] > at scala.Option.orElse(Option.scala:257) > ~[acmedsc-ingest-kafka-spark-2.0.0-SNAPSHOT.jar:?] > at org.apache.spark.streaming.dstream.DStream.getOrCompute(DStream.scala:339) > ~[spark-assembly-1.5.0-hadoop2.4.0.jar:1.5.0] > at > org.apache.spark.streaming.dstream.MappedDStream.compute(MappedDStream.scala:35) > ~[spark-assembly-1.5.0-hadoop2.4.0.jar:1.5.0] > at > org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1$$anonf
[jira] [Updated] (SPARK-18569) Support R formula arithmetic
[ https://issues.apache.org/jira/browse/SPARK-18569?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-18569: -- Affects Version/s: 2.4.3 Issue Type: Improvement (was: Sub-task) Parent: (was: SPARK-15540) > Support R formula arithmetic > - > > Key: SPARK-18569 > URL: https://issues.apache.org/jira/browse/SPARK-18569 > Project: Spark > Issue Type: Improvement > Components: ML, SparkR >Affects Versions: 2.4.3 >Reporter: Felix Cheung >Priority: Major > > I think we should support arithmetic which makes it a lot more convenient to > build model. Something like > {code} > log(y) ~ a + log(x) > {code} > And to avoid resolution confusions we should support the I() operator: > {code} > I > I(X∗Z) as is: include a new variable consisting of these variables multiplied > {code} > Such that this works: > {code} > y ~ a + I(b+c) > {code} > the term b+c is to be interpreted as the sum of b and c. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-18570) Consider supporting other R formula operators
[ https://issues.apache.org/jira/browse/SPARK-18570?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-18570. --- Resolution: Fixed Fix Version/s: 3.0.0 Issue resolved by pull request 24764 [https://github.com/apache/spark/pull/24764] > Consider supporting other R formula operators > - > > Key: SPARK-18570 > URL: https://issues.apache.org/jira/browse/SPARK-18570 > Project: Spark > Issue Type: Sub-task > Components: ML, SparkR >Reporter: Felix Cheung >Assignee: Ozan Cicekci >Priority: Minor > Fix For: 3.0.0 > > > Such as > {code} > ∗ > X∗Y include these variables and the interactions between them > ^ > (X + Z + W)^3 include these variables and all interactions up to three way > | > X | Z conditioning: include x given z > {code} > Other includes, %in%, ` (backtick) > https://stat.ethz.ch/R-manual/R-devel/library/stats/html/formula.html -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-18570) Consider supporting other R formula operators
[ https://issues.apache.org/jira/browse/SPARK-18570?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen reassigned SPARK-18570: - Assignee: Ozan Cicekci > Consider supporting other R formula operators > - > > Key: SPARK-18570 > URL: https://issues.apache.org/jira/browse/SPARK-18570 > Project: Spark > Issue Type: Sub-task > Components: ML, SparkR >Reporter: Felix Cheung >Assignee: Ozan Cicekci >Priority: Minor > > Such as > {code} > ∗ > X∗Y include these variables and the interactions between them > ^ > (X + Z + W)^3 include these variables and all interactions up to three way > | > X | Z conditioning: include x given z > {code} > Other includes, %in%, ` (backtick) > https://stat.ethz.ch/R-manual/R-devel/library/stats/html/formula.html -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-27944) Unify the behavior of checking empty output column names
[ https://issues.apache.org/jira/browse/SPARK-27944?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-27944: Assignee: Apache Spark > Unify the behavior of checking empty output column names > > > Key: SPARK-27944 > URL: https://issues.apache.org/jira/browse/SPARK-27944 > Project: Spark > Issue Type: Improvement > Components: ML >Affects Versions: 3.0.0 >Reporter: zhengruifeng >Assignee: Apache Spark >Priority: Minor > > In some algs (LDA, DT, GBTC, SVC, LR, MLP, NB), the transform method will > check whether an output column name is empty, and if all names are empty, it > will log a warning msg and do nothing. > > It maybe better to make other algs (regression, ovr, clustering, als) also > check the output columns before transformation. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-27944) Unify the behavior of checking empty output column names
[ https://issues.apache.org/jira/browse/SPARK-27944?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-27944: Assignee: (was: Apache Spark) > Unify the behavior of checking empty output column names > > > Key: SPARK-27944 > URL: https://issues.apache.org/jira/browse/SPARK-27944 > Project: Spark > Issue Type: Improvement > Components: ML >Affects Versions: 3.0.0 >Reporter: zhengruifeng >Priority: Minor > > In some algs (LDA, DT, GBTC, SVC, LR, MLP, NB), the transform method will > check whether an output column name is empty, and if all names are empty, it > will log a warning msg and do nothing. > > It maybe better to make other algs (regression, ovr, clustering, als) also > check the output columns before transformation. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-27900) Spark on K8s will not report container failure due to an oom error
[ https://issues.apache.org/jira/browse/SPARK-27900?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16855598#comment-16855598 ] Stavros Kontopoulos commented on SPARK-27900: - Setting Thread.setDefaultUncaughtExceptionHandler(new SparkUncaughtExceptionHandler) in SparkSubmit in client mode caused the handler to be invoked at the driver side yet: 19/06/04 11:01:42 ERROR SparkUncaughtExceptionHandler: Uncaught exception in thread Thread[dag-scheduler-event-loop,5,main] java.lang.OutOfMemoryError: Java heap space at scala.collection.mutable.ResizableArray.ensureSize(ResizableArray.scala:106) at scala.collection.mutable.ResizableArray.ensureSize$(ResizableArray.scala:96) at scala.collection.mutable.ArrayBuffer.ensureSize(ArrayBuffer.scala:49) at scala.collection.mutable.ArrayBuffer.$plus$eq(ArrayBuffer.scala:85) at org.apache.spark.scheduler.TaskSetManager.addPendingTask(TaskSetManager.scala:264) at org.apache.spark.scheduler.TaskSetManager.$anonfun$addPendingTasks$2(TaskSetManager.scala:194) at org.apache.spark.scheduler.TaskSetManager$$Lambda$1109/206130956.apply$mcVI$sp(Unknown Source) at scala.collection.immutable.Range.foreach$mVc$sp(Range.scala:158) at org.apache.spark.scheduler.TaskSetManager.$anonfun$addPendingTasks$1(TaskSetManager.scala:193) at org.apache.spark.scheduler.TaskSetManager$$Lambda$1108/329172165.apply$mcV$sp(Unknown Source) at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23) at org.apache.spark.util.Utils$.timeTakenMs(Utils.scala:534) at org.apache.spark.scheduler.TaskSetManager.addPendingTasks(TaskSetManager.scala:192) at org.apache.spark.scheduler.TaskSetManager.(TaskSetManager.scala:189) at org.apache.spark.scheduler.TaskSchedulerImpl.createTaskSetManager(TaskSchedulerImpl.scala:252) at org.apache.spark.scheduler.TaskSchedulerImpl.submitTasks(TaskSchedulerImpl.scala:210) at org.apache.spark.scheduler.DAGScheduler.submitMissingTasks(DAGScheduler.scala:1233) at org.apache.spark.scheduler.DAGScheduler.submitStage(DAGScheduler.scala:1084) at org.apache.spark.scheduler.DAGScheduler.handleJobSubmitted(DAGScheduler.scala:1028) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:2126) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2118) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2107) at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:49) 19/06/04 11:01:42 INFO SparkContext: Invoking stop() from shutdown hook 19/06/04 11:01:42 INFO SparkUI: Stopped Spark web UI at http://spark-pi2-1559645994185-driver-svc.spark.svc:4040 19/06/04 11:01:42 INFO BlockManagerInfo: Removed broadcast_0_piece0 on spark-pi2-1559645994185-driver-svc.spark.svc:7079 in memory (size: 1765.0 B, free: 110.0 MiB) the main thread though is stuck at: "main" #1 prio=5 os_prio=0 tid=0x5653d3a5e800 nid=0x1d waiting on condition [0x7f31b7ca7000] java.lang.Thread.State: WAITING (parking) at sun.misc.Unsafe.park(Native Method) - parking to wait for <0xf0fd01a0> (a scala.concurrent.impl.Promise$CompletionLatch) at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175) at java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836) at java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedInterruptibly(AbstractQueuedSynchronizer.java:997) at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(AbstractQueuedSynchronizer.java:1304) at scala.concurrent.impl.Promise$DefaultPromise.tryAwait(Promise.scala:242) at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:258) at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:187) at org.apache.spark.util.ThreadUtils$.awaitReady(ThreadUtils.scala:242) at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:736) [https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala#L736] |ThreadUtils.awaitReady(waiter.completionFuture, Duration.Inf) | This should be configurable imho. > Spark on K8s will not report container failure due to an oom error > -- > > Key: SPARK-27900 > URL: https://issues.apache.org/jira/browse/SPARK-27900 > Project: Spark > Issue Type: Bug > Components: Kubernetes >Affects Versions: 3.0.0, 2.4.3 >Reporter: Stavros Kontopoulos >Priority: Major > > A spark pi job is running: > spark-pi-driver 1/1 Running 0 1h > spark-pi2-1559309337787-exec-1 1/1 Running 0 1h > spark-pi2-1559309337787-exec-2 1/1 Running 0 1h > with the following setup: > {quote}apiVersion: "sparkoperator.k8s.io/v1beta1" > kind:
[jira] [Created] (SPARK-27944) Unify the behavior of checking empty output column names
zhengruifeng created SPARK-27944: Summary: Unify the behavior of checking empty output column names Key: SPARK-27944 URL: https://issues.apache.org/jira/browse/SPARK-27944 Project: Spark Issue Type: Improvement Components: ML Affects Versions: 3.0.0 Reporter: zhengruifeng In some algs (LDA, DT, GBTC, SVC, LR, MLP, NB), the transform method will check whether an output column name is empty, and if all names are empty, it will log a warning msg and do nothing. It maybe better to make other algs (regression, ovr, clustering, als) also check the output columns before transformation. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-27943) Add default constraint when create hive table
[ https://issues.apache.org/jira/browse/SPARK-27943?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-27943: Assignee: Apache Spark > Add default constraint when create hive table > - > > Key: SPARK-27943 > URL: https://issues.apache.org/jira/browse/SPARK-27943 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 2.3.0, 2.4.0 >Reporter: jiaan.geng >Assignee: Apache Spark >Priority: Major > > Default constraint with column is ANSI standard. > Hive 3.0+ has supported default constraint > ref:https://issues.apache.org/jira/browse/HIVE-18726 > But Spark SQL implement this feature not yet. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-27943) Add default constraint when create hive table
[ https://issues.apache.org/jira/browse/SPARK-27943?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-27943: Assignee: (was: Apache Spark) > Add default constraint when create hive table > - > > Key: SPARK-27943 > URL: https://issues.apache.org/jira/browse/SPARK-27943 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 2.3.0, 2.4.0 >Reporter: jiaan.geng >Priority: Major > > Default constraint with column is ANSI standard. > Hive 3.0+ has supported default constraint > ref:https://issues.apache.org/jira/browse/HIVE-18726 > But Spark SQL implement this feature not yet. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-27895) Spark streaming - RDD filter is always refreshing providing updated filtered items
[ https://issues.apache.org/jira/browse/SPARK-27895?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16855547#comment-16855547 ] Ilias Karalis commented on SPARK-27895: --- The results of RDD filter change all the time until next batch comes in. I am not sure if it has to do with the filter function itself. But this function and filter method work fine , if instead of filtering the RDD, we filter a collection. inputRdd.filter... : results are not stable and change all the time until next batch comes in. inputRdd.collect().filter... : computes results just once and results are stable until next batch comes in. > Spark streaming - RDD filter is always refreshing providing updated filtered > items > -- > > Key: SPARK-27895 > URL: https://issues.apache.org/jira/browse/SPARK-27895 > Project: Spark > Issue Type: Bug > Components: DStreams >Affects Versions: 2.4.0, 2.4.2, 2.4.3 > Environment: Intellij, running local in windows10 laptop. > >Reporter: Ilias Karalis >Priority: Major > > Spark streaming: 2.4.x > Scala: 2.11.11 > > > foreachRDD of DStream, > in case filter is used on RDD then filter is always refreshing, providing new > results continuously until new batch is processed. For the new batch, the > same occurs. > With the same code, if we do rdd.collect() and then run the filter on the > collection, we get just one time results, which remains stable until new > batch is coming in. > Filter function is based on random probability (reservoir sampling). > > {color:#80}val {color}toSampleRDD: RDD[(Long, Long)] = > inputRdd.filter(x=> chooseX(x) ) > > {color:#80}def {color}chooseX (x:(Long, Long)) : Boolean = { > {color:#80}val {color}r = scala.util.Random > {color:#80}val {color}p = r.nextFloat() > edgeTotalCounter += {color:#ff}1{color} {color:#80}if {color}(p < > (sampleLength.toFloat / edgeTotalCounter.toFloat)) { > edgeLocalRDDCounter += {color:#ff}1{color} println({color:#008000}"Edge > " {color}+x + {color:#008000}" has been selected and is number : "{color}+ > edgeLocalRDDCounter +{color:#008000}"."{color}) > {color:#80}true{color} } > {color:#80}else{color} false > \{color}} > > edgeLocalRDDCounter counts selected edges from inputRDD. > Strange is that the counter is increased 1st time from 1 to y, then filter > continues to run unexpectedly again and the counter is increased again > starting from y+1 to z. After that each time filter unexpectedly continues to > run, it provides results for which the counter starts from y+1. Each time > filter runs provides different results and filters different number of edges. > toSampleRDD always changes accordingly to new provided results. > When new batch is coming in then it starts the same behavior for the new > batch. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-27895) Spark streaming - RDD filter is always refreshing providing updated filtered items
[ https://issues.apache.org/jira/browse/SPARK-27895?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ilias Karalis updated SPARK-27895: -- Description: Spark streaming: 2.4.x Scala: 2.11.11 foreachRDD of DStream, in case filter is used on RDD then filter is always refreshing, providing new results continuously until new batch is processed. For the new batch, the same occurs. With the same code, if we do rdd.collect() and then run the filter on the collection, we get just one time results, which remains stable until new batch is coming in. Filter function is based on random probability (reservoir sampling). {color:#80}val {color}toSampleRDD: RDD[(Long, Long)] = inputRdd.filter(x=> chooseX(x) ) {color:#80}def {color}chooseX (x:(Long, Long)) : Boolean = { {color:#80}val {color}r = scala.util.Random {color:#80}val {color}p = r.nextFloat() edgeTotalCounter += {color:#ff}1{color} {color:#80}if {color}(p < (sampleLength.toFloat / edgeTotalCounter.toFloat)) { edgeLocalRDDCounter += {color:#ff}1{color} println({color:#008000}"Edge " {color}+x + {color:#008000}" has been selected and is number : "{color}+ edgeLocalRDDCounter +{color:#008000}"."{color}) {color:#80}true{color} } {color:#80}else{color} false \{color}} edgeLocalRDDCounter counts selected edges from inputRDD. Strange is that the counter is increased 1st time from 1 to y, then filter continues to run unexpectedly again and the counter is increased again starting from y+1 to z. After that each time filter unexpectedly continues to run, it provides results for which the counter starts from y+1. Each time filter runs provides different results and filters different number of edges. toSampleRDD always changes accordingly to new provided results. When new batch is coming in then it starts the same behavior for the new batch. was: Spark streaming: 2.4.x Scala: 2.11.11 foreachRDD of DStream, in case filter is used on RDD then filter is always refreshing, providing new results continuously until new batch is processed. For the new batch, the same occurs. With the same code, if we do rdd.collect() and then run the filter on the collection, we get just one time results, which remains stable until new batch is coming in. Filter function is based on random probability (reservoir sampling). {color:#80}val {color}toSampleRDD: RDD[(Long, Long)] = inputRdd.filter(x=> chooseX(x) ) {color:#80}def {color}chooseX (x:(Long, Long)) : Boolean = { {color:#808080} {color} {color:#80}val {color}r = scala.util.Random {color:#80}val {color}p = r.nextFloat() edgeTotalCounter += {color:#ff}1 {color} {color:#808080} {color} {color:#80}if {color}(p < (sampleLength.toFloat / edgeTotalCounter.toFloat)) { edgeLocalRDDCounter += {color:#ff}1 {color} println({color:#008000}"Edge " {color}+x + {color:#008000}" has been selected and is number : " {color}+ edgeLocalRDDCounter +{color:#008000}"."{color}) {color:#80}true {color} } {color:#80}else {color}{color:#80} false {color}} edgeLocalRDDCounter counts selected edges from inputRDD. Strange is that the counter is increased 1st time from 1 to y, then filter continues to run unexpectedly again and the counter is increased again starting from y+1 to z. After that each time filter unexpectedly continues to run, it provides results for which the counter starts from y+1. Each time filter runs provides different results and filters different number of edges. toSampleRDD always changes accordingly to new provided results. When new batch is coming in then it starts the same behavior for the new batch. > Spark streaming - RDD filter is always refreshing providing updated filtered > items > -- > > Key: SPARK-27895 > URL: https://issues.apache.org/jira/browse/SPARK-27895 > Project: Spark > Issue Type: Bug > Components: DStreams >Affects Versions: 2.4.0, 2.4.2, 2.4.3 > Environment: Intellij, running local in windows10 laptop. > >Reporter: Ilias Karalis >Priority: Major > > Spark streaming: 2.4.x > Scala: 2.11.11 > > > foreachRDD of DStream, > in case filter is used on RDD then filter is always refreshing, providing new > results continuously until new batch is processed. For the new batch, the > same occurs. > With the same code, if we do rdd.collect() and then run the filter on the > collection, we get just one time results, which remains stable until new > batch is coming in. > Filter function is based on random probability (reservoir sampling). > > {color:#80}val {color}toSampleRDD: RDD[(Long, Long)] = > inputRdd.filter(x=> chooseX(x) ) > > {color:#80}def {color}chooseX (x:(Long, Long)) : Boolean = { > {color:#80}val {col
[jira] [Created] (SPARK-27943) Add default constraint when create hive table
jiaan.geng created SPARK-27943: -- Summary: Add default constraint when create hive table Key: SPARK-27943 URL: https://issues.apache.org/jira/browse/SPARK-27943 Project: Spark Issue Type: New Feature Components: SQL Affects Versions: 2.4.0, 2.3.0 Reporter: jiaan.geng Default constraint with column is ANSI standard. Hive 3.0+ has supported default constraint ref:https://issues.apache.org/jira/browse/HIVE-18726 But Spark SQL implement this feature not yet. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-27941) Serverless Spark in the Cloud
[ https://issues.apache.org/jira/browse/SPARK-27941?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shuheng Dai updated SPARK-27941: Description: Public cloud providers have started offering serverless container services. For example, AWS offers Fargate [https://aws.amazon.com/fargate/] This opens up the possibility to run Spark workloads in a serverless manner and remove the need to provision, maintain and manage a cluster. POC: [https://github.com/mu5358271/spark-on-fargate] While it might not make sense for Spark to favor any particular cloud provider or to support a large number of cloud providers natively, it would make sense to make some of the internal Spark components more pluggable and cloud friendly so that it is easier for various cloud providers to integrate. For example, * authentication: IO and network encryption requires authentication via securely sharing a secret, and the implementation of this is currently tied to the cluster manager: yarn uses hadoop ugi, kubernetes uses a shared file mounted on all pods. These can be decoupled so it is possible to swap in implementation using public cloud. In the POC, this is implemented by passing around AWS KMS encrypted secret and decrypting the secret at each executor, which delegate authentication and authorization to the cloud. * deployment & scheduler: adding a new cluster manager and scheduler backend requires changing a number of places in the Spark core package and rebuilding the entire project. Having a pluggable scheduler per https://issues.apache.org/jira/browse/SPARK-19700 would make it easier to add different scheduler backends backed by different cloud providers. * client-cluster communication: I am not very familiar with the network part of the code base so I might be wrong on this. My understanding is that the code base assumes that the client and the cluster are on the same network and the nodes communicate with each other via hostname/ip. For security best practice, it is advised to run the executors in a private protected network, which may be separate from the client machine's network. Since we are serverless, that means the client need to first launch the driver into the private network, and the driver in turn start the executors, potentially doubling job initialization time. This can be solved by dropping complete serverlessness and having a persistent host in the private network, or (I do not have a POC, so I am not sure if this actually works) by implementing client-cluster communication via message queues in the cloud to get around the network separation. * shuffle storage and retrieval: external shuffle in yarn relies on the existence of a persistent cluster that continues to serve shuffle files beyond the lifecycle of the executors. This assumption no longer holds in a serverless cluster with only transient containers. Pluggable remote shuffle storage per https://issues.apache.org/jira/browse/SPARK-25299 would make it easier to introduce new cloud-backed shuffle. was: Public cloud providers have started offering serverless container services. For example, AWS offers Fargate [https://aws.amazon.com/fargate/] This opens up the possibility to run Spark workloads in a serverless manner and remove the need to provision and maintain a cluster. POC: [https://github.com/mu5358271/spark-on-fargate] While it might not make sense for Spark to favor any particular cloud provider or to support a large number of cloud providers natively. It would make sense to make some of the internal Spark components more pluggable and cloud friendly so that it is easier for various cloud providers to integrate. For example, * authentication: IO and network encryption requires authentication via securely sharing a secret, and the implementation of this is currently tied to the cluster manager: yarn uses hadoop ugi, kubernetes uses a shared file mounted on all pods. These can be decoupled so it is possible to swap in implementation using public cloud. In the POC, this is implemented by passing around AWS KMS encrypted secret and decrypting the secret at each executor, which delegate authentication and authorization to the cloud. * deployment & scheduler: adding a new cluster manager and scheduler backend requires changing a number of places in the Spark core package and rebuilding the entire project. Having a pluggable scheduler per https://issues.apache.org/jira/browse/SPARK-19700 would make it easier to add different scheduler backends backed by different cloud providers. * client-cluster communication: I am not very familiar with the network part of the code base so I might be wrong on this. My understanding is that the code base assumes that the client and the cluster are on the same network and the nodes communicate with each other via hostname/ip. * shuffle storage and retrieval: > Serverless Spark in the Cloud
[jira] [Comment Edited] (SPARK-27900) Spark on K8s will not report container failure due to an oom error
[ https://issues.apache.org/jira/browse/SPARK-27900?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16854449#comment-16854449 ] Stavros Kontopoulos edited comment on SPARK-27900 at 6/4/19 9:13 AM: - A better approach is to try use -XX:+ExitOnOutOfMemoryError or set the SparkUncaughtExceptionHandler [https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/executor/Executor.scala#L88] in the driver (eg. SparkContext). For the executor part the related Jira is this one: https://issues.apache.org/jira/browse/SPARK-1772 [~sro...@scient.com] do you know why we dont set the handler at the driver side? was (Author: skonto): A better approach is to try use -XX:+ExitOnOutOfMemoryError or set the SparkUncaughtExceptionHandler [https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/executor/Executor.scala#L88] in the driver (eg. SparkContext). [~sro...@scient.com] do you know why we dont set the handler at the driver side? > Spark on K8s will not report container failure due to an oom error > -- > > Key: SPARK-27900 > URL: https://issues.apache.org/jira/browse/SPARK-27900 > Project: Spark > Issue Type: Bug > Components: Kubernetes >Affects Versions: 3.0.0, 2.4.3 >Reporter: Stavros Kontopoulos >Priority: Major > > A spark pi job is running: > spark-pi-driver 1/1 Running 0 1h > spark-pi2-1559309337787-exec-1 1/1 Running 0 1h > spark-pi2-1559309337787-exec-2 1/1 Running 0 1h > with the following setup: > {quote}apiVersion: "sparkoperator.k8s.io/v1beta1" > kind: SparkApplication > metadata: > name: spark-pi > namespace: spark > spec: > type: Scala > mode: cluster > image: "skonto/spark:k8s-3.0.0-sa" > imagePullPolicy: Always > mainClass: org.apache.spark.examples.SparkPi > mainApplicationFile: > "local:///opt/spark/examples/jars/spark-examples_2.12-3.0.0-SNAPSHOT.jar" > arguments: > - "100" > sparkVersion: "2.4.0" > restartPolicy: > type: Never > nodeSelector: > "spark": "autotune" > driver: > memory: "1g" > labels: > version: 2.4.0 > serviceAccount: spark-sa > executor: > instances: 2 > memory: "1g" > labels: > version: 2.4.0{quote} > At some point the driver fails but it is still running and so the pods are > still running: > 19/05/31 13:29:20 INFO DAGScheduler: Submitting ResultStage 0 > (MapPartitionsRDD[1] at map at SparkPi.scala:34), which has no missing parents > 19/05/31 13:29:23 INFO MemoryStore: Block broadcast_0 stored as values in > memory (estimated size 3.0 KiB, free 110.0 MiB) > 19/05/31 13:29:23 INFO MemoryStore: Block broadcast_0_piece0 stored as bytes > in memory (estimated size 1765.0 B, free 110.0 MiB) > 19/05/31 13:29:23 INFO BlockManagerInfo: Added broadcast_0_piece0 in memory > on spark-pi2-1559309337787-driver-svc.spark.svc:7079 (size: 1765.0 B, free: > 110.0 MiB) > 19/05/31 13:29:23 INFO SparkContext: Created broadcast 0 from broadcast at > DAGScheduler.scala:1180 > 19/05/31 13:29:25 INFO DAGScheduler: Submitting 100 missing tasks from > ResultStage 0 (MapPartitionsRDD[1] at map at SparkPi.scala:34) (first 15 > tasks are for partitions Vector(0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, > 14)) > 19/05/31 13:29:25 INFO TaskSchedulerImpl: Adding task set 0.0 with 100 > tasks > Exception in thread "dag-scheduler-event-loop" java.lang.OutOfMemoryError: > Java heap space > at > scala.collection.mutable.ResizableArray.ensureSize(ResizableArray.scala:106) > at > scala.collection.mutable.ResizableArray.ensureSize$(ResizableArray.scala:96) > at scala.collection.mutable.ArrayBuffer.ensureSize(ArrayBuffer.scala:49) > Mem: 2295260K used, 24458144K free, 1636K shrd, 48052K buff, 899424K cached > $ kubectl describe pod spark-pi2-driver -n spark > Name: spark-pi2-driver > Namespace: spark > Priority: 0 > PriorityClassName: > Node: gke-test-cluster-1-spark-autotune-46c36f4f-x3z9/10.138.0.44 > Start Time: Fri, 31 May 2019 16:28:59 +0300 > Labels: spark-app-selector=spark-74d8e5a8f1af428d91093dfa6ee9d661 > spark-role=driver > sparkoperator.k8s.io/app-name=spark-pi2 > sparkoperator.k8s.io/launched-by-spark-operator=true > sparkoperator.k8s.io/submission-id=spark-pi2-1559309336226927526 > version=2.4.0 > Annotations: > Status: Running > IP: 10.12.103.4 > Controlled By: SparkApplication/spark-pi2 > Containers: > spark-kubernetes-driver: > Container ID: > docker://55dadb603290b42f9ddb71959edf0224ddc7ea621ee15429941d3bcc7db9b71f > Image: skonto/spark:k8s-3.0.0-sa > Image ID: > docker-pullable://skonto/spark@sha256:6268d760d1a006b69c7086f946e4d5d9a3b99f149832c63cfc7fe39671f5cda9 > Ports: 7078/TCP, 7079/TCP, 4040/TCP > Host Ports: 0/TCP, 0/TCP, 0/TCP > Args: > driver > --properties-file
[jira] [Updated] (SPARK-27941) Serverless Spark in the Cloud
[ https://issues.apache.org/jira/browse/SPARK-27941?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shuheng Dai updated SPARK-27941: Description: Public cloud providers have started offering serverless container services. For example, AWS offers Fargate [https://aws.amazon.com/fargate/] This opens up the possibility to run Spark workloads in a serverless manner and remove the need to provision and maintain a cluster. POC: [https://github.com/mu5358271/spark-on-fargate] While it might not make sense for Spark to favor any particular cloud provider or to support a large number of cloud providers natively. It would make sense to make some of the internal Spark components more pluggable and cloud friendly so that it is easier for various cloud providers to integrate. For example, * authentication: IO and network encryption requires authentication via securely sharing a secret, and the implementation of this is currently tied to the cluster manager: yarn uses hadoop ugi, kubernetes uses a shared file mounted on all pods. These can be decoupled so it is possible to swap in implementation using public cloud. In the POC, this is implemented by passing around AWS KMS encrypted secret and decrypting the secret at each executor, which delegate authentication and authorization to the cloud. * deployment & scheduler: adding a new cluster manager and scheduler backend requires changing a number of places in the Spark core package and rebuilding the entire project. Having a pluggable scheduler per https://issues.apache.org/jira/browse/SPARK-19700 would make it easier to add different scheduler backends backed by different cloud providers. * client-cluster communication: I am not very familiar with the network part of the code base so I might be wrong on this. My understanding is that the code base assumes that the client and the cluster are on the same network and the nodes communicate with each other via hostname/ip. * shuffle storage and retrieval: was: Public cloud providers have started offering serverless container services. For example, AWS offers Fargate [https://aws.amazon.com/fargate/] This opens up the possibility to run Spark workloads in a serverless manner and remove the need to provision and maintain a cluster. POC: [https://github.com/mu5358271/spark-on-fargate] While it might not make sense for Spark to favor any particular cloud provider or to support a large number of cloud providers natively. It would make sense to make some of the internal Spark components more pluggable and cloud friendly so that it is easier for various cloud providers to integrate. For example, * authentication: IO and network encryption requires authentication via securely sharing a secret, and the implementation of this is currently tied to the cluster manager: yarn uses hadoop ugi, kubernetes uses a shared file mounted on all pods. These can be decoupled so it is possible to swap in implementation using public cloud. In the POC, this is implemented by passing around AWS KMS encrypted secret and decrypting the secret at each executor, which delegate authentication and authorization to the cloud. * deployment & scheduler: adding a new cluster manager and scheduler backend requires changing a number of places in the Spark core package, and rebuilding the entire project. * driver-executor communication: * shuffle storage and retrieval: > Serverless Spark in the Cloud > - > > Key: SPARK-27941 > URL: https://issues.apache.org/jira/browse/SPARK-27941 > Project: Spark > Issue Type: New Feature > Components: Build, Deploy, Scheduler, Security, Shuffle, Spark Core >Affects Versions: 3.0.0 >Reporter: Shuheng Dai >Priority: Major > > Public cloud providers have started offering serverless container services. > For example, AWS offers Fargate [https://aws.amazon.com/fargate/] > This opens up the possibility to run Spark workloads in a serverless manner > and remove the need to provision and maintain a cluster. POC: > [https://github.com/mu5358271/spark-on-fargate] > While it might not make sense for Spark to favor any particular cloud > provider or to support a large number of cloud providers natively. It would > make sense to make some of the internal Spark components more pluggable and > cloud friendly so that it is easier for various cloud providers to integrate. > For example, > * authentication: IO and network encryption requires authentication via > securely sharing a secret, and the implementation of this is currently tied > to the cluster manager: yarn uses hadoop ugi, kubernetes uses a shared file > mounted on all pods. These can be decoupled so it is possible to swap in > implementation using public cloud. In the POC, this is implemented by passing > around AWS KMS encrypted secr
[jira] [Updated] (SPARK-27941) Serverless Spark in the Cloud
[ https://issues.apache.org/jira/browse/SPARK-27941?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shuheng Dai updated SPARK-27941: Description: Public cloud providers have started offering serverless container services. For example, AWS offers Fargate [https://aws.amazon.com/fargate/] This opens up the possibility to run Spark workloads in a serverless manner and remove the need to provision and maintain a cluster. POC: [https://github.com/mu5358271/spark-on-fargate] While it might not make sense for Spark to favor any particular cloud provider or to support a large number of cloud providers natively. It would make sense to make some of the internal Spark components more pluggable and cloud friendly so that it is easier for various cloud providers to integrate. For example, * authentication: IO and network encryption requires authentication via securely sharing a secret, and the implementation of this is currently tied to the cluster manager: yarn uses hadoop ugi, kubernetes uses a shared file mounted on all pods. These can be decoupled so it is possible to swap in implementation using public cloud. In the POC, this is implemented by passing around AWS KMS encrypted secret and decrypting the secret at each executor, which delegate authentication and authorization to the cloud. * deployment & scheduler: adding a new cluster manager and scheduler backend requires changing a number of places in the Spark core package, and rebuilding the entire project. * driver-executor communication: * shuffle storage and retrieval: was: Public cloud providers have started offering serverless container services. For example, AWS offers Fargate [https://aws.amazon.com/fargate/] This opens up the possibility to run Spark workloads in a serverless manner and remove the need to provision and maintain a cluster. While it might not make sense for Spark to favor any particular cloud provider or to support a large number of cloud providers natively. It would make sense to make some of the internal Spark components more pluggable and cloud friendly so that it is easier for various cloud providers to integrate. For example, * authentication: IO and network encryption requires authentication via securely sharing a secret, and the implementation of this is currently tied to the cluster manager: yarn uses hadoop ugi, kubernetes uses a shared file mounted on all pods. These can be decoupled so it is possible to swap in implementation using public cloud. In the POC, this is implemented by passing around AWS KMS encrypted secret and delegate authentication and authorization to the cloud. * deployment & scheduler: adding a new cluster manager requires change a number of places in the Spark core package, and rebuilding the project. * driver-executor communication: * shuffle storage and retrieval: > Serverless Spark in the Cloud > - > > Key: SPARK-27941 > URL: https://issues.apache.org/jira/browse/SPARK-27941 > Project: Spark > Issue Type: New Feature > Components: Build, Deploy, Scheduler, Security, Shuffle, Spark Core >Affects Versions: 3.0.0 >Reporter: Shuheng Dai >Priority: Major > > Public cloud providers have started offering serverless container services. > For example, AWS offers Fargate [https://aws.amazon.com/fargate/] > This opens up the possibility to run Spark workloads in a serverless manner > and remove the need to provision and maintain a cluster. POC: > [https://github.com/mu5358271/spark-on-fargate] > While it might not make sense for Spark to favor any particular cloud > provider or to support a large number of cloud providers natively. It would > make sense to make some of the internal Spark components more pluggable and > cloud friendly so that it is easier for various cloud providers to integrate. > For example, > * authentication: IO and network encryption requires authentication via > securely sharing a secret, and the implementation of this is currently tied > to the cluster manager: yarn uses hadoop ugi, kubernetes uses a shared file > mounted on all pods. These can be decoupled so it is possible to swap in > implementation using public cloud. In the POC, this is implemented by passing > around AWS KMS encrypted secret and decrypting the secret at each executor, > which delegate authentication and authorization to the cloud. > * deployment & scheduler: adding a new cluster manager and scheduler backend > requires changing a number of places in the Spark core package, and > rebuilding the entire project. > * driver-executor communication: > * shuffle storage and retrieval: -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
[jira] [Updated] (SPARK-27941) Serverless Spark in the Cloud
[ https://issues.apache.org/jira/browse/SPARK-27941?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shuheng Dai updated SPARK-27941: Description: Public cloud providers have started offering serverless container services. For example, AWS offers Fargate [https://aws.amazon.com/fargate/] This opens up the possibility to run Spark workloads in a serverless manner and remove the need to provision and maintain a cluster. While it might not make sense for Spark to favor any particular cloud provider or to support a large number of cloud providers natively. It would make sense to make some of the internal Spark components more pluggable and cloud friendly so that it is easier for various cloud providers to integrate. For example, * authentication: IO and network encryption requires authentication via securely sharing a secret, and the implementation of this is currently tied to the cluster manager: yarn uses hadoop ugi, kubernetes uses a shared file mounted on all pods. These can be decoupled so it is possible to swap in implementation using public cloud. In the POC, this is implemented by passing around AWS KMS encrypted secret and delegate authentication and authorization to the cloud. * deployment & scheduler: adding a new cluster manager requires change a number of places in the Spark core package, and rebuilding the project. * driver-executor communication: * shuffle storage and retrieval: was: Public cloud providers have started offering serverless container services. For example, AWS offers Fargate [https://aws.amazon.com/fargate/] This opens up the possibility to run Spark workloads in a serverless manner. Pluggable authentication Pluggable scheduler Pluggable shuffle storage and retrieval Pluggable driver > Serverless Spark in the Cloud > - > > Key: SPARK-27941 > URL: https://issues.apache.org/jira/browse/SPARK-27941 > Project: Spark > Issue Type: New Feature > Components: Build, Deploy, Scheduler, Security, Shuffle, Spark Core >Affects Versions: 3.0.0 >Reporter: Shuheng Dai >Priority: Major > > Public cloud providers have started offering serverless container services. > For example, AWS offers Fargate [https://aws.amazon.com/fargate/] > This opens up the possibility to run Spark workloads in a serverless manner > and remove the need to provision and maintain a cluster. > While it might not make sense for Spark to favor any particular cloud > provider or to support a large number of cloud providers natively. It would > make sense to make some of the internal Spark components more pluggable and > cloud friendly so that it is easier for various cloud providers to integrate. > For example, > * authentication: IO and network encryption requires authentication via > securely sharing a secret, and the implementation of this is currently tied > to the cluster manager: yarn uses hadoop ugi, kubernetes uses a shared file > mounted on all pods. These can be decoupled so it is possible to swap in > implementation using public cloud. In the POC, this is implemented by passing > around AWS KMS encrypted secret and delegate authentication and authorization > to the cloud. > * deployment & scheduler: adding a new cluster manager requires change a > number of places in the Spark core package, and rebuilding the project. > * driver-executor communication: > * shuffle storage and retrieval: -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-27873) Csv reader, adding a corrupt record column causes error if enforceSchema=false
[ https://issues.apache.org/jira/browse/SPARK-27873?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-27873: - Fix Version/s: 2.4.4 > Csv reader, adding a corrupt record column causes error if enforceSchema=false > -- > > Key: SPARK-27873 > URL: https://issues.apache.org/jira/browse/SPARK-27873 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.3 >Reporter: Marcin Mejran >Assignee: Liang-Chi Hsieh >Priority: Major > Fix For: 2.4.4, 3.0.0 > > > In the Spark CSV reader If you're using permissive mode with a column for > storing corrupt records then you need to add a new schema column > corresponding to columnNameOfCorruptRecord. > However, if you have a header row and enforceSchema=false the schema vs. > header validation fails because there is an extra column corresponding to > columnNameOfCorruptRecord. > Since, the FAILFAST mode doesn't print informative error messages on which > rows failed to parse there is no way other to track down broken rows without > setting a corrupt record column. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-27884) Deprecate Python 2 support in Spark 3.0
[ https://issues.apache.org/jira/browse/SPARK-27884?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-27884: - Labels: release-notes (was: ) > Deprecate Python 2 support in Spark 3.0 > --- > > Key: SPARK-27884 > URL: https://issues.apache.org/jira/browse/SPARK-27884 > Project: Spark > Issue Type: Story > Components: PySpark >Affects Versions: 3.0.0 >Reporter: Xiangrui Meng >Priority: Major > Labels: release-notes > > Officially deprecate Python 2 support in Spark 3.0. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-27888) Python 2->3 migration guide for PySpark users
[ https://issues.apache.org/jira/browse/SPARK-27888?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16855420#comment-16855420 ] Hyukjin Kwon commented on SPARK-27888: -- [~mengxr], I know it's good to inform that Python 2 is deprecated to users so that they can prepare to drop it in their env too but how about we leave it in a release note not in migration guide? It's deprecated but it would still work. > Python 2->3 migration guide for PySpark users > - > > Key: SPARK-27888 > URL: https://issues.apache.org/jira/browse/SPARK-27888 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 3.0.0 >Reporter: Xiangrui Meng >Priority: Major > > We might need a short Python 2->3 migration guide for PySpark users. It > doesn't need to be comprehensive given many Python 2->3 migration guides > around. We just need some pointers and list items that are specific to > PySpark. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org