[jira] [Updated] (SPARK-27943) Implement default constraint with Column for Hive table

2019-06-04 Thread jiaan.geng (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27943?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

jiaan.geng updated SPARK-27943:
---
Summary: Implement default constraint with Column for Hive table  (was: Add 
default constraint when create hive table)

> Implement default constraint with Column for Hive table
> ---
>
> Key: SPARK-27943
> URL: https://issues.apache.org/jira/browse/SPARK-27943
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 2.3.0, 2.4.0
>Reporter: jiaan.geng
>Priority: Major
>
> Default constraint with column is ANSI standard.
> Hive 3.0+ has supported default constraint 
> ref:https://issues.apache.org/jira/browse/HIVE-18726
> But Spark SQL implement this feature not yet.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-27943) Add default constraint when create hive table

2019-06-04 Thread jiaan.geng (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27943?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

jiaan.geng updated SPARK-27943:
---
Description: 
Default constraint with column is ANSI standard.

Hive 3.0+ has supported default constraint 
ref:https://issues.apache.org/jira/browse/HIVE-18726

But Spark SQL implement this feature not yet.

 

  was:
Default constraint with column is ANSI standard.

Hive 3.0+ has supported default constraint 
ref:https://issues.apache.org/jira/browse/HIVE-18726

But Spark SQL implement this feature not yet.


> Add default constraint when create hive table
> -
>
> Key: SPARK-27943
> URL: https://issues.apache.org/jira/browse/SPARK-27943
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 2.3.0, 2.4.0
>Reporter: jiaan.geng
>Priority: Major
>
> Default constraint with column is ANSI standard.
> Hive 3.0+ has supported default constraint 
> ref:https://issues.apache.org/jira/browse/HIVE-18726
> But Spark SQL implement this feature not yet.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-27798) from_avro can modify variables in other rows in local mode

2019-06-04 Thread Gengliang Wang (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-27798?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16856370#comment-16856370
 ] 

Gengliang Wang commented on SPARK-27798:


[~viirya] I am currently working on other issues.
Thanks for working on it!

> from_avro can modify variables in other rows in local mode
> --
>
> Key: SPARK-27798
> URL: https://issues.apache.org/jira/browse/SPARK-27798
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.3
>Reporter: Yosuke Mori
>Priority: Blocker
>  Labels: correctness
> Attachments: Screen Shot 2019-05-21 at 2.39.27 PM.png
>
>
> Steps to reproduce:
> Create a local Dataset (at least two distinct rows) with a binary Avro field. 
> Use the {{from_avro}} function to deserialize the binary into another column. 
> Verify that all of the rows incorrectly have the same value.
> Here's a concrete example (using Spark 2.4.3). All it does is converts a list 
> of TestPayload objects into binary using the defined avro schema, then tries 
> to deserialize using {{from_avro}} with that same schema:
> {code:java}
> import org.apache.avro.Schema
> import org.apache.avro.generic.{GenericDatumWriter, GenericRecord, 
> GenericRecordBuilder}
> import org.apache.avro.io.EncoderFactory
> import org.apache.spark.sql.SparkSession
> import org.apache.spark.sql.avro.from_avro
> import org.apache.spark.sql.functions.col
> import java.io.ByteArrayOutputStream
> object TestApp extends App {
>   // Payload container
>   case class TestEvent(payload: Array[Byte])
>   // Deserialized Payload
>   case class TestPayload(message: String)
>   // Schema for Payload
>   val simpleSchema =
> """
>   |{
>   |"type": "record",
>   |"name" : "Payload",
>   |"fields" : [ {"name" : "message", "type" : [ "string", "null" ] } ]
>   |}
> """.stripMargin
>   // Convert TestPayload into avro binary
>   def generateSimpleSchemaBinary(record: TestPayload, avsc: String): 
> Array[Byte] = {
> val schema = new Schema.Parser().parse(avsc)
> val out = new ByteArrayOutputStream()
> val writer = new GenericDatumWriter[GenericRecord](schema)
> val encoder = EncoderFactory.get().binaryEncoder(out, null)
> val rootRecord = new GenericRecordBuilder(schema).set("message", 
> record.message).build()
> writer.write(rootRecord, encoder)
> encoder.flush()
> out.toByteArray
>   }
>   val spark: SparkSession = 
> SparkSession.builder().master("local[*]").getOrCreate()
>   import spark.implicits._
>   List(
> TestPayload("one"),
> TestPayload("two"),
> TestPayload("three"),
> TestPayload("four")
>   ).map(payload => TestEvent(generateSimpleSchemaBinary(payload, 
> simpleSchema)))
> .toDS()
> .withColumn("deserializedPayload", from_avro(col("payload"), 
> simpleSchema))
> .show(truncate = false)
> }
> {code}
> And here is what this program outputs:
> {noformat}
> +--+---+
> |payload   |deserializedPayload|
> +--+---+
> |[00 06 6F 6E 65]  |[four] |
> |[00 06 74 77 6F]  |[four] |
> |[00 0A 74 68 72 65 65]|[four] |
> |[00 08 66 6F 75 72]   |[four] |
> +--+---+{noformat}
> Here, we can see that the avro binary is correctly generated, but the 
> deserialized version is a copy of the last row. I have not yet verified that 
> this is an issue in cluster mode as well.
>  
> I dug into a bit more of the code and it seems like the resuse of {{result}} 
> in {{AvroDataToCatalyst}} is overwriting the decoded values of previous rows. 
> I set a breakpoint in {{LocalRelation}} and the {{data}} sequence seem to all 
> point to the same address in memory - and therefore a mutation in one 
> variable will cause all of it to mutate.
> !Screen Shot 2019-05-21 at 2.39.27 PM.png!



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-27798) from_avro can modify variables in other rows in local mode

2019-06-04 Thread Liang-Chi Hsieh (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-27798?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16856362#comment-16856362
 ] 

Liang-Chi Hsieh commented on SPARK-27798:
-

Is anyone working one this? If none, I will probably send a PR to fix it.

> from_avro can modify variables in other rows in local mode
> --
>
> Key: SPARK-27798
> URL: https://issues.apache.org/jira/browse/SPARK-27798
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.3
>Reporter: Yosuke Mori
>Priority: Blocker
>  Labels: correctness
> Attachments: Screen Shot 2019-05-21 at 2.39.27 PM.png
>
>
> Steps to reproduce:
> Create a local Dataset (at least two distinct rows) with a binary Avro field. 
> Use the {{from_avro}} function to deserialize the binary into another column. 
> Verify that all of the rows incorrectly have the same value.
> Here's a concrete example (using Spark 2.4.3). All it does is converts a list 
> of TestPayload objects into binary using the defined avro schema, then tries 
> to deserialize using {{from_avro}} with that same schema:
> {code:java}
> import org.apache.avro.Schema
> import org.apache.avro.generic.{GenericDatumWriter, GenericRecord, 
> GenericRecordBuilder}
> import org.apache.avro.io.EncoderFactory
> import org.apache.spark.sql.SparkSession
> import org.apache.spark.sql.avro.from_avro
> import org.apache.spark.sql.functions.col
> import java.io.ByteArrayOutputStream
> object TestApp extends App {
>   // Payload container
>   case class TestEvent(payload: Array[Byte])
>   // Deserialized Payload
>   case class TestPayload(message: String)
>   // Schema for Payload
>   val simpleSchema =
> """
>   |{
>   |"type": "record",
>   |"name" : "Payload",
>   |"fields" : [ {"name" : "message", "type" : [ "string", "null" ] } ]
>   |}
> """.stripMargin
>   // Convert TestPayload into avro binary
>   def generateSimpleSchemaBinary(record: TestPayload, avsc: String): 
> Array[Byte] = {
> val schema = new Schema.Parser().parse(avsc)
> val out = new ByteArrayOutputStream()
> val writer = new GenericDatumWriter[GenericRecord](schema)
> val encoder = EncoderFactory.get().binaryEncoder(out, null)
> val rootRecord = new GenericRecordBuilder(schema).set("message", 
> record.message).build()
> writer.write(rootRecord, encoder)
> encoder.flush()
> out.toByteArray
>   }
>   val spark: SparkSession = 
> SparkSession.builder().master("local[*]").getOrCreate()
>   import spark.implicits._
>   List(
> TestPayload("one"),
> TestPayload("two"),
> TestPayload("three"),
> TestPayload("four")
>   ).map(payload => TestEvent(generateSimpleSchemaBinary(payload, 
> simpleSchema)))
> .toDS()
> .withColumn("deserializedPayload", from_avro(col("payload"), 
> simpleSchema))
> .show(truncate = false)
> }
> {code}
> And here is what this program outputs:
> {noformat}
> +--+---+
> |payload   |deserializedPayload|
> +--+---+
> |[00 06 6F 6E 65]  |[four] |
> |[00 06 74 77 6F]  |[four] |
> |[00 0A 74 68 72 65 65]|[four] |
> |[00 08 66 6F 75 72]   |[four] |
> +--+---+{noformat}
> Here, we can see that the avro binary is correctly generated, but the 
> deserialized version is a copy of the last row. I have not yet verified that 
> this is an issue in cluster mode as well.
>  
> I dug into a bit more of the code and it seems like the resuse of {{result}} 
> in {{AvroDataToCatalyst}} is overwriting the decoded values of previous rows. 
> I set a breakpoint in {{LocalRelation}} and the {{data}} sequence seem to all 
> point to the same address in memory - and therefore a mutation in one 
> variable will cause all of it to mutate.
> !Screen Shot 2019-05-21 at 2.39.27 PM.png!



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-27952) Built-in function: lag

2019-06-04 Thread Zhu, Lipeng (JIRA)
Zhu, Lipeng created SPARK-27952:
---

 Summary: Built-in function: lag
 Key: SPARK-27952
 URL: https://issues.apache.org/jira/browse/SPARK-27952
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 3.0.0
Reporter: Zhu, Lipeng


[https://www.postgresql.org/docs/8.4/functions-window.html]
|{{lag({{value}} {{any}} [, {{offset}}{{integer}} [, {{default}} {{any}} 
]])}}|{{same type as }}{{value}}|returns {{value}} evaluated at the row that is 
{{offset}} rows before the current row within the partition; if there is no 
such row, instead return {{default}}. Both {{offset}}and {{default}} are 
evaluated with respect to the current row. If omitted, {{offset}} defaults to 1 
and {{default}} to null|



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-27951) Built-in function: NTH_VALUE

2019-06-04 Thread Zhu, Lipeng (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27951?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhu, Lipeng updated SPARK-27951:

Description: 
|{{nth_value({{value}}{{any}}, {{nth}}{{integer}})}}|{{same type as 
}}{{value}}|returns {{value}} evaluated at the row that is the {{nth}} row of 
the window frame (counting from 1); null if no such row|

[https://www.postgresql.org/docs/8.4/functions-window.html]

  was:|{{nth_value({{value}} {{any}}, {{nth}}{{integer}})}}|{{same type as 
}}{{value}}|returns {{value}} evaluated at the row that is the {{nth}} row of 
the window frame (counting from 1); null if no such row|


> Built-in function: NTH_VALUE
> 
>
> Key: SPARK-27951
> URL: https://issues.apache.org/jira/browse/SPARK-27951
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Zhu, Lipeng
>Priority: Major
>
> |{{nth_value({{value}}{{any}}, {{nth}}{{integer}})}}|{{same type as 
> }}{{value}}|returns {{value}} evaluated at the row that is the {{nth}} row of 
> the window frame (counting from 1); null if no such row|
> [https://www.postgresql.org/docs/8.4/functions-window.html]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-27951) Built-in function: NTH_VALUE

2019-06-04 Thread Zhu, Lipeng (JIRA)
Zhu, Lipeng created SPARK-27951:
---

 Summary: Built-in function: NTH_VALUE
 Key: SPARK-27951
 URL: https://issues.apache.org/jira/browse/SPARK-27951
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 3.0.0
Reporter: Zhu, Lipeng


|{{nth_value({{value}} {{any}}, {{nth}}{{integer}})}}|{{same type as 
}}{{value}}|returns {{value}} evaluated at the row that is the {{nth}} row of 
the window frame (counting from 1); null if no such row|



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-27949) Support ANSI SQL grammar `substring(string_expression from n1 [for n2])`

2019-06-04 Thread Yuming Wang (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27949?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuming Wang updated SPARK-27949:

Issue Type: Sub-task  (was: Improvement)
Parent: SPARK-27764

> Support ANSI SQL grammar `substring(string_expression from n1 [for n2])`
> 
>
> Key: SPARK-27949
> URL: https://issues.apache.org/jira/browse/SPARK-27949
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Zhu, Lipeng
>Priority: Minor
>
> Currently, function substr/substring's usage is like 
> substring(string_expression, n1 [,n2]). 
> But the ANSI SQL defined the pattern for substr/substring is like 
> substring(string_expression from n1 [for n2]). This gap make some 
> inconvenient when we switch to the SparkSQL.
> Can we support the ANSI pattern like substring(string_expression from n1 [for 
> n2])?
>  
> [http://www.contrib.andrew.cmu.edu/~shadow/sql/sql1992.txt]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-27578) Support INTERVAL ... HOUR TO SECOND syntax

2019-06-04 Thread Yuming Wang (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27578?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuming Wang updated SPARK-27578:

Issue Type: Sub-task  (was: Improvement)
Parent: SPARK-27764

> Support INTERVAL ... HOUR TO SECOND syntax
> --
>
> Key: SPARK-27578
> URL: https://issues.apache.org/jira/browse/SPARK-27578
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Zhu, Lipeng
>Priority: Major
>
> Currently, SparkSQL can support interval format like this. 
>  
> {code:java}
> select interval '5 23:59:59.155' day to second.{code}
>  
> Can SparkSQL support grammar like below, as Presto/Teradata can support it 
> well now.
> {code:java}
> select interval '23:59:59.155' hour to second
> {code}
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-27948) GPU Scheduling - Use ResouceName to represent resource names

2019-06-04 Thread Dongjoon Hyun (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27948?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-27948.
---
   Resolution: Fixed
Fix Version/s: 3.0.0

This is resolved via https://github.com/apache/spark/pull/24799

> GPU Scheduling - Use ResouceName to represent resource names
> 
>
> Key: SPARK-27948
> URL: https://issues.apache.org/jira/browse/SPARK-27948
> Project: Spark
>  Issue Type: Test
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Xingbo Jiang
>Assignee: Xingbo Jiang
>Priority: Minor
> Fix For: 3.0.0
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-27950) Additional DynamoDB and CloudWatch config

2019-06-04 Thread Eric S Meisel (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27950?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eric S Meisel updated SPARK-27950:
--
Priority: Minor  (was: Major)

> Additional DynamoDB and CloudWatch config
> -
>
> Key: SPARK-27950
> URL: https://issues.apache.org/jira/browse/SPARK-27950
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 2.4.3
>Reporter: Eric S Meisel
>Priority: Minor
>
> I was researching getting Spark’s Kinesis integration running locally against 
> {{localstack}}. We found this issue, and it creates a complication: 
> [localstack/localstack#677|https://github.com/localstack/localstack/issues/677]
> Effectively, we need to be able to redirect calls for Kinesis, DynamoDB and 
> Cloudwatch in order for the KCL to properly use the {{localstack}} 
> infrastructure. We have successfully done this with the KCL (both 1.x and 
> 2.x), but with Spark’s integration we are unable to configure DynamoDB and 
> Cloudwatch’s endpoints:
> [https://github.com/apache/spark/blob/master/external/kinesis-asl/src/main/scala/org/apache/spark/streaming/kinesis/KinesisReceiver.scala#L162]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-27909) Fix CTE substitution dependence on ResolveRelations throwing AnalysisException

2019-06-04 Thread Dongjoon Hyun (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27909?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-27909:
--
Fix Version/s: 3.0.0

> Fix CTE substitution dependence on ResolveRelations throwing AnalysisException
> --
>
> Key: SPARK-27909
> URL: https://issues.apache.org/jira/browse/SPARK-27909
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.3
>Reporter: Ryan Blue
>Assignee: Ryan Blue
>Priority: Major
> Fix For: 3.0.0
>
>
> CTE substitution currently works by running all analyzer rules on plans after 
> each substitution. It does this to fix a recursive CTE case, but this design 
> requires the ResolveRelations rule to throw an AnalysisException when it 
> cannot resolve a table or else the CTE substitution will run again and may 
> possibly recurse infinitely.
> Table resolution should be possible across multiple independent rules. To 
> accomplish this, the current ResolveRelations rule detects cases where other 
> rules (like ResolveDataSource) will resolve a TableIdentifier and returns the 
> UnresolvedRelation unmodified only in those cases.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-27950) Additional DynamoDB and CloudWatch config

2019-06-04 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-27950?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16856275#comment-16856275
 ] 

Apache Spark commented on SPARK-27950:
--

User 'etspaceman' has created a pull request for this issue:
https://github.com/apache/spark/pull/24801

> Additional DynamoDB and CloudWatch config
> -
>
> Key: SPARK-27950
> URL: https://issues.apache.org/jira/browse/SPARK-27950
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 2.4.3
>Reporter: Eric S Meisel
>Priority: Major
>
> I was researching getting Spark’s Kinesis integration running locally against 
> {{localstack}}. We found this issue, and it creates a complication: 
> [localstack/localstack#677|https://github.com/localstack/localstack/issues/677]
> Effectively, we need to be able to redirect calls for Kinesis, DynamoDB and 
> Cloudwatch in order for the KCL to properly use the {{localstack}} 
> infrastructure. We have successfully done this with the KCL (both 1.x and 
> 2.x), but with Spark’s integration we are unable to configure DynamoDB and 
> Cloudwatch’s endpoints:
> [https://github.com/apache/spark/blob/master/external/kinesis-asl/src/main/scala/org/apache/spark/streaming/kinesis/KinesisReceiver.scala#L162]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-27950) Additional DynamoDB and CloudWatch config

2019-06-04 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27950?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-27950:


Assignee: (was: Apache Spark)

> Additional DynamoDB and CloudWatch config
> -
>
> Key: SPARK-27950
> URL: https://issues.apache.org/jira/browse/SPARK-27950
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 2.4.3
>Reporter: Eric S Meisel
>Priority: Major
>
> I was researching getting Spark’s Kinesis integration running locally against 
> {{localstack}}. We found this issue, and it creates a complication: 
> [localstack/localstack#677|https://github.com/localstack/localstack/issues/677]
> Effectively, we need to be able to redirect calls for Kinesis, DynamoDB and 
> Cloudwatch in order for the KCL to properly use the {{localstack}} 
> infrastructure. We have successfully done this with the KCL (both 1.x and 
> 2.x), but with Spark’s integration we are unable to configure DynamoDB and 
> Cloudwatch’s endpoints:
> [https://github.com/apache/spark/blob/master/external/kinesis-asl/src/main/scala/org/apache/spark/streaming/kinesis/KinesisReceiver.scala#L162]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-27950) Additional DynamoDB and CloudWatch config

2019-06-04 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27950?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-27950:


Assignee: Apache Spark

> Additional DynamoDB and CloudWatch config
> -
>
> Key: SPARK-27950
> URL: https://issues.apache.org/jira/browse/SPARK-27950
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 2.4.3
>Reporter: Eric S Meisel
>Assignee: Apache Spark
>Priority: Major
>
> I was researching getting Spark’s Kinesis integration running locally against 
> {{localstack}}. We found this issue, and it creates a complication: 
> [localstack/localstack#677|https://github.com/localstack/localstack/issues/677]
> Effectively, we need to be able to redirect calls for Kinesis, DynamoDB and 
> Cloudwatch in order for the KCL to properly use the {{localstack}} 
> infrastructure. We have successfully done this with the KCL (both 1.x and 
> 2.x), but with Spark’s integration we are unable to configure DynamoDB and 
> Cloudwatch’s endpoints:
> [https://github.com/apache/spark/blob/master/external/kinesis-asl/src/main/scala/org/apache/spark/streaming/kinesis/KinesisReceiver.scala#L162]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-27950) Additional DynamoDB and CloudWatch config

2019-06-04 Thread Eric S Meisel (JIRA)
Eric S Meisel created SPARK-27950:
-

 Summary: Additional DynamoDB and CloudWatch config
 Key: SPARK-27950
 URL: https://issues.apache.org/jira/browse/SPARK-27950
 Project: Spark
  Issue Type: Improvement
  Components: Structured Streaming
Affects Versions: 2.4.3
Reporter: Eric S Meisel


I was researching getting Spark’s Kinesis integration running locally against 
{{localstack}}. We found this issue, and it creates a complication: 
[localstack/localstack#677|https://github.com/localstack/localstack/issues/677]

Effectively, we need to be able to redirect calls for Kinesis, DynamoDB and 
Cloudwatch in order for the KCL to properly use the {{localstack}} 
infrastructure. We have successfully done this with the KCL (both 1.x and 2.x), 
but with Spark’s integration we are unable to configure DynamoDB and 
Cloudwatch’s endpoints:

[https://github.com/apache/spark/blob/master/external/kinesis-asl/src/main/scala/org/apache/spark/streaming/kinesis/KinesisReceiver.scala#L162]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-27949) Support ANSI SQL grammar `substring(string_expression from n1 [for n2])`

2019-06-04 Thread Dongjoon Hyun (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27949?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-27949:
--
Priority: Minor  (was: Major)

> Support ANSI SQL grammar `substring(string_expression from n1 [for n2])`
> 
>
> Key: SPARK-27949
> URL: https://issues.apache.org/jira/browse/SPARK-27949
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Zhu, Lipeng
>Priority: Minor
>
> Currently, function substr/substring's usage is like 
> substring(string_expression, n1 [,n2]). 
> But the ANSI SQL defined the pattern for substr/substring is like 
> substring(string_expression from n1 [for n2]). This gap make some 
> inconvenient when we switch to the SparkSQL.
> Can we support the ANSI pattern like substring(string_expression from n1 [for 
> n2])?
>  
> [http://www.contrib.andrew.cmu.edu/~shadow/sql/sql1992.txt]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-27949) Support ANSI SQL grammar `substring(string_expression from n1 [for n2])`

2019-06-04 Thread Dongjoon Hyun (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27949?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-27949:
--
Issue Type: Improvement  (was: Bug)

> Support ANSI SQL grammar `substring(string_expression from n1 [for n2])`
> 
>
> Key: SPARK-27949
> URL: https://issues.apache.org/jira/browse/SPARK-27949
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Zhu, Lipeng
>Priority: Major
>
> Currently, function substr/substring's usage is like 
> substring(string_expression, n1 [,n2]). 
> But the ANSI SQL defined the pattern for substr/substring is like 
> substring(string_expression from n1 [for n2]). This gap make some 
> inconvenient when we switch to the SparkSQL.
> Can we support the ANSI pattern like substring(string_expression from n1 [for 
> n2])?
>  
> [http://www.contrib.andrew.cmu.edu/~shadow/sql/sql1992.txt]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-27949) Support ANSI SQL grammar `substring(string_expression from n1 [for n2])`

2019-06-04 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27949?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-27949:


Assignee: Apache Spark

> Support ANSI SQL grammar `substring(string_expression from n1 [for n2])`
> 
>
> Key: SPARK-27949
> URL: https://issues.apache.org/jira/browse/SPARK-27949
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Zhu, Lipeng
>Assignee: Apache Spark
>Priority: Major
>
> Currently, function substr/substring's usage is like 
> substring(string_expression, n1 [,n2]). 
> But the ANSI SQL defined the pattern for substr/substring is like 
> substring(string_expression from n1 [for n2]). This gap make some 
> inconvenient when we switch to the SparkSQL.
> Can we support the ANSI pattern like substring(string_expression from n1 [for 
> n2])?
>  
> [http://www.contrib.andrew.cmu.edu/~shadow/sql/sql1992.txt]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-27949) Support ANSI SQL grammar `substring(string_expression from n1 [for n2])`

2019-06-04 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27949?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-27949:


Assignee: (was: Apache Spark)

> Support ANSI SQL grammar `substring(string_expression from n1 [for n2])`
> 
>
> Key: SPARK-27949
> URL: https://issues.apache.org/jira/browse/SPARK-27949
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Zhu, Lipeng
>Priority: Major
>
> Currently, function substr/substring's usage is like 
> substring(string_expression, n1 [,n2]). 
> But the ANSI SQL defined the pattern for substr/substring is like 
> substring(string_expression from n1 [for n2]). This gap make some 
> inconvenient when we switch to the SparkSQL.
> Can we support the ANSI pattern like substring(string_expression from n1 [for 
> n2])?
>  
> [http://www.contrib.andrew.cmu.edu/~shadow/sql/sql1992.txt]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-27949) Support ANSI SQL grammar `substring(string_expression from n1 [for n2])`

2019-06-04 Thread Zhu, Lipeng (JIRA)
Zhu, Lipeng created SPARK-27949:
---

 Summary: Support ANSI SQL grammar `substring(string_expression 
from n1 [for n2])`
 Key: SPARK-27949
 URL: https://issues.apache.org/jira/browse/SPARK-27949
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.1.0
Reporter: Zhu, Lipeng


Currently, function substr/substring's usage is like 
substring(string_expression, n1 [,n2]). 

But the ANSI SQL defined the pattern for substr/substring is like 
substring(string_expression from n1 [for n2]). This gap make some inconvenient 
when we switch to the SparkSQL.

Can we support the ANSI pattern like substring(string_expression from n1 [for 
n2])?

 

[http://www.contrib.andrew.cmu.edu/~shadow/sql/sql1992.txt]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-27947) ParsedStatement subclass toString may throw ClassCastException

2019-06-04 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27947?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-27947:


Assignee: (was: Apache Spark)

> ParsedStatement subclass toString may throw ClassCastException
> --
>
> Key: SPARK-27947
> URL: https://issues.apache.org/jira/browse/SPARK-27947
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: John Zhuge
>Priority: Minor
>
> In ParsedStatement.productIterator, `case mapArg: Map[_, _]` may match any 
> Map type, thus causing `asInstanceOf[Map[String, String]]` to throw 
> ClassCastException.
> The following test reproduces the issue:
> {code:java}
> case class TestStatement(p: Map[String, Int]) extends ParsedStatement {
>  override def output: Seq[Attribute] = Nil
>  override def children: Seq[LogicalPlan] = Nil
> }
> TestStatement(Map("abc" -> 1)).toString{code}
> Changing the code to `case mapArg: Map[String, String]` will not work due to 
> type erasure. As a matter of fact, compiler gives this warning:
> {noformat}
> Warning:(41, 18) non-variable type argument String in type pattern 
> scala.collection.immutable.Map[String,String] (the underlying of 
> Map[String,String]) is unchecked since it is eliminated by erasure
> case mapArg: Map[String, String] =>{noformat}
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-27948) GPU Scheduling - Use ResouceName to represent resource names

2019-06-04 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27948?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-27948:


Assignee: Xingbo Jiang  (was: Apache Spark)

> GPU Scheduling - Use ResouceName to represent resource names
> 
>
> Key: SPARK-27948
> URL: https://issues.apache.org/jira/browse/SPARK-27948
> Project: Spark
>  Issue Type: Test
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Xingbo Jiang
>Assignee: Xingbo Jiang
>Priority: Minor
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-27948) GPU Scheduling - Use ResouceName to represent resource names

2019-06-04 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27948?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-27948:


Assignee: Apache Spark  (was: Xingbo Jiang)

> GPU Scheduling - Use ResouceName to represent resource names
> 
>
> Key: SPARK-27948
> URL: https://issues.apache.org/jira/browse/SPARK-27948
> Project: Spark
>  Issue Type: Test
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Xingbo Jiang
>Assignee: Apache Spark
>Priority: Minor
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-27947) ParsedStatement subclass toString may throw ClassCastException

2019-06-04 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27947?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-27947:


Assignee: Apache Spark

> ParsedStatement subclass toString may throw ClassCastException
> --
>
> Key: SPARK-27947
> URL: https://issues.apache.org/jira/browse/SPARK-27947
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: John Zhuge
>Assignee: Apache Spark
>Priority: Minor
>
> In ParsedStatement.productIterator, `case mapArg: Map[_, _]` may match any 
> Map type, thus causing `asInstanceOf[Map[String, String]]` to throw 
> ClassCastException.
> The following test reproduces the issue:
> {code:java}
> case class TestStatement(p: Map[String, Int]) extends ParsedStatement {
>  override def output: Seq[Attribute] = Nil
>  override def children: Seq[LogicalPlan] = Nil
> }
> TestStatement(Map("abc" -> 1)).toString{code}
> Changing the code to `case mapArg: Map[String, String]` will not work due to 
> type erasure. As a matter of fact, compiler gives this warning:
> {noformat}
> Warning:(41, 18) non-variable type argument String in type pattern 
> scala.collection.immutable.Map[String,String] (the underlying of 
> Map[String,String]) is unchecked since it is eliminated by erasure
> case mapArg: Map[String, String] =>{noformat}
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-27948) GPU Scheduling - Use ResouceName to represent resource names

2019-06-04 Thread Xingbo Jiang (JIRA)
Xingbo Jiang created SPARK-27948:


 Summary: GPU Scheduling - Use ResouceName to represent resource 
names
 Key: SPARK-27948
 URL: https://issues.apache.org/jira/browse/SPARK-27948
 Project: Spark
  Issue Type: Test
  Components: Spark Core
Affects Versions: 3.0.0
Reporter: Xingbo Jiang
Assignee: Xingbo Jiang






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-27947) ParsedStatement subclass toString may throw ClassCastException

2019-06-04 Thread John Zhuge (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27947?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

John Zhuge updated SPARK-27947:
---
Description: 
In ParsedStatement.productIterator, `case mapArg: Map[_, _]` may match any Map 
type, thus causing `asInstanceOf[Map[String, String]]` to throw 
ClassCastException.

The following test reproduces the issue:
{code:java}
case class TestStatement(p: Map[String, Int]) extends ParsedStatement {
 override def output: Seq[Attribute] = Nil
 override def children: Seq[LogicalPlan] = Nil
}

TestStatement(Map("abc" -> 1)).toString{code}
Changing the code to `case mapArg: Map[String, String]` will not work due to 
type erasure. As a matter of fact, compiler gives this warning:
{noformat}
Warning:(41, 18) non-variable type argument String in type pattern 
scala.collection.immutable.Map[String,String] (the underlying of 
Map[String,String]) is unchecked since it is eliminated by erasure
case mapArg: Map[String, String] =>{noformat}
 

  was:
In ParsedStatement.productIterator, `case mapArg: Map[_, _]` may match any Map 
type, thus causing `asInstanceOf[Map[String, String]]` to throw 
ClassCastException.

The following test reproduces the issue:
{code:java}
case class TestStatement(p: Map[String, Int]) extends ParsedStatement {
 override def output: Seq[Attribute] = Nil
 override def children: Seq[LogicalPlan] = Nil
}

TestStatement(Map("abc" -> 1)).toString{code}
Changing the code to `case mapArg: Map[String, String]` will not work due to 
type erasure. As a matter of fact, compiler gives the warning:
{noformat}
Warning:(41, 18) non-variable type argument String in type pattern 
scala.collection.immutable.Map[String,String] (the underlying of 
Map[String,String]) is unchecked since it is eliminated by erasure
case mapArg: Map[String, String] =>{noformat}
 


> ParsedStatement subclass toString may throw ClassCastException
> --
>
> Key: SPARK-27947
> URL: https://issues.apache.org/jira/browse/SPARK-27947
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: John Zhuge
>Priority: Minor
>
> In ParsedStatement.productIterator, `case mapArg: Map[_, _]` may match any 
> Map type, thus causing `asInstanceOf[Map[String, String]]` to throw 
> ClassCastException.
> The following test reproduces the issue:
> {code:java}
> case class TestStatement(p: Map[String, Int]) extends ParsedStatement {
>  override def output: Seq[Attribute] = Nil
>  override def children: Seq[LogicalPlan] = Nil
> }
> TestStatement(Map("abc" -> 1)).toString{code}
> Changing the code to `case mapArg: Map[String, String]` will not work due to 
> type erasure. As a matter of fact, compiler gives this warning:
> {noformat}
> Warning:(41, 18) non-variable type argument String in type pattern 
> scala.collection.immutable.Map[String,String] (the underlying of 
> Map[String,String]) is unchecked since it is eliminated by erasure
> case mapArg: Map[String, String] =>{noformat}
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-27947) ParsedStatement subclass toString may throw ClassCastException

2019-06-04 Thread John Zhuge (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27947?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

John Zhuge updated SPARK-27947:
---
Description: 
In ParsedStatement.productIterator, `case mapArg: Map[_, _]` may match any Map 
type, thus causing `asInstanceOf[Map[String, String]]` to throw 
ClassCastException.

The following test reproduces the issue:
{code:java}
case class TestStatement(p: Map[String, Int]) extends ParsedStatement {
 override def output: Seq[Attribute] = Nil
 override def children: Seq[LogicalPlan] = Nil
}

TestStatement(Map("abc" -> 1)).toString{code}
Changing the code to `case mapArg: Map[String, String]` will not work due to 
type erasure. As a matter of fact, compiler gives the warning:
{noformat}
Warning:(41, 18) non-variable type argument String in type pattern 
scala.collection.immutable.Map[String,String] (the underlying of 
Map[String,String]) is unchecked since it is eliminated by erasure
case mapArg: Map[String, String] =>{noformat}
 

  was:
In ParsedStatement.productIterator, `case mapArg: Map[_, _]` may match any Map 
type, thus causing `asInstanceOf[Map[String, String]]` to throw 
ClassCastException.

The following test reproduces the issue:
{code:java}
case class TestStatement(p: Map[String, Int]) extends ParsedStatement {
 override def output: Seq[Attribute] = Nil
 override def children: Seq[LogicalPlan] = Nil
}

 TestStatement(Map("abc" -> 1)).toString{code}
Changing the code to `case mapArg: Map[String, String]` will not work due to 
type erasure. As a matter of fact, compiler gives the warning:
{noformat}
Warning:(41, 18) non-variable type argument String in type pattern 
scala.collection.immutable.Map[String,String] (the underlying of 
Map[String,String]) is unchecked since it is eliminated by erasure
case mapArg: Map[String, String] =>{noformat}
 


> ParsedStatement subclass toString may throw ClassCastException
> --
>
> Key: SPARK-27947
> URL: https://issues.apache.org/jira/browse/SPARK-27947
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: John Zhuge
>Priority: Minor
>
> In ParsedStatement.productIterator, `case mapArg: Map[_, _]` may match any 
> Map type, thus causing `asInstanceOf[Map[String, String]]` to throw 
> ClassCastException.
> The following test reproduces the issue:
> {code:java}
> case class TestStatement(p: Map[String, Int]) extends ParsedStatement {
>  override def output: Seq[Attribute] = Nil
>  override def children: Seq[LogicalPlan] = Nil
> }
> TestStatement(Map("abc" -> 1)).toString{code}
> Changing the code to `case mapArg: Map[String, String]` will not work due to 
> type erasure. As a matter of fact, compiler gives the warning:
> {noformat}
> Warning:(41, 18) non-variable type argument String in type pattern 
> scala.collection.immutable.Map[String,String] (the underlying of 
> Map[String,String]) is unchecked since it is eliminated by erasure
> case mapArg: Map[String, String] =>{noformat}
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-27947) ParsedStatement subclass toString may throw ClassCastException

2019-06-04 Thread John Zhuge (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27947?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

John Zhuge updated SPARK-27947:
---
Description: 
In ParsedStatement.productIterator, `case mapArg: Map[_, _]` may match any Map 
type, thus causing `asInstanceOf[Map[String, String]]` to throw 
ClassCastException.

The following test reproduces the issue:
{code:java}
case class TestStatement(p: Map[String, Int]) extends ParsedStatement {
 override def output: Seq[Attribute] = Nil
 override def children: Seq[LogicalPlan] = Nil
}

 TestStatement(Map("abc" -> 1)).toString{code}
Changing the code to `case mapArg: Map[String, String]` will not work due to 
type erasure. As a matter of fact, compiler gives the warning:
{noformat}
Warning:(41, 18) non-variable type argument String in type pattern 
scala.collection.immutable.Map[String,String] (the underlying of 
Map[String,String]) is unchecked since it is eliminated by erasure
case mapArg: Map[String, String] =>{noformat}
 

  was:
In ParsedStatement.productIterator, "case mapArg: Map[_, _]" may match any Map 
type, thus causing `asInstanceOf[Map[String, String]]` to throw 
ClassCastException.

The following test reproduces the issue:
{code:java}
case class TestStatement(p: Map[String, Int]) extends ParsedStatement {
 override def output: Seq[Attribute] = Nil
 override def children: Seq[LogicalPlan] = Nil
}

 TestStatement(Map("abc" -> 1)).toString{code}

Changing the code to `case mapArg: Map[String, String]` will not work due to 
type erasure. As a matter of fact, compiler gives the warning:
{noformat}
Warning:(41, 18) non-variable type argument String in type pattern 
scala.collection.immutable.Map[String,String] (the underlying of 
Map[String,String]) is unchecked since it is eliminated by erasure
case mapArg: Map[String, String] =>{noformat}
 


> ParsedStatement subclass toString may throw ClassCastException
> --
>
> Key: SPARK-27947
> URL: https://issues.apache.org/jira/browse/SPARK-27947
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: John Zhuge
>Priority: Minor
>
> In ParsedStatement.productIterator, `case mapArg: Map[_, _]` may match any 
> Map type, thus causing `asInstanceOf[Map[String, String]]` to throw 
> ClassCastException.
> The following test reproduces the issue:
> {code:java}
> case class TestStatement(p: Map[String, Int]) extends ParsedStatement {
>  override def output: Seq[Attribute] = Nil
>  override def children: Seq[LogicalPlan] = Nil
> }
>  TestStatement(Map("abc" -> 1)).toString{code}
> Changing the code to `case mapArg: Map[String, String]` will not work due to 
> type erasure. As a matter of fact, compiler gives the warning:
> {noformat}
> Warning:(41, 18) non-variable type argument String in type pattern 
> scala.collection.immutable.Map[String,String] (the underlying of 
> Map[String,String]) is unchecked since it is eliminated by erasure
> case mapArg: Map[String, String] =>{noformat}
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-27947) ParsedStatement subclass toString may throw ClassCastException

2019-06-04 Thread John Zhuge (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27947?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

John Zhuge updated SPARK-27947:
---
Description: 
In ParsedStatement.productIterator, "case mapArg: Map[_, _]" may match any Map 
type, thus causing `asInstanceOf[Map[String, String]]` to throw 
ClassCastException.

The following test reproduces the issue:
{code:java}
case class TestStatement(p: Map[String, Int]) extends ParsedStatement {
 override def output: Seq[Attribute] = Nil
 override def children: Seq[LogicalPlan] = Nil
}

 TestStatement(Map("abc" -> 1)).toString{code}

Changing the code to `case mapArg: Map[String, String]` will not work due to 
type erasure. As a matter of fact, compiler gives the warning:
{noformat}
Warning:(41, 18) non-variable type argument String in type pattern 
scala.collection.immutable.Map[String,String] (the underlying of 
Map[String,String]) is unchecked since it is eliminated by erasure
case mapArg: Map[String, String] =>{noformat}
 

  was:
In ParsedStatement.productIterator, "case mapArg: Map[_, _]" may match any Map 
type, thus causing `asInstanceOf[Map[String, String]]` to throw 
ClassCastException.

The following test reproduces the issue:
{code:java}
case class TestStatement(p: Map[String, Int]) extends ParsedStatement {
 override def output: Seq[Attribute] = Nil
 override def children: Seq[LogicalPlan] = Nil
}

 TestStatement(Map("abc" -> 1)).toString{code}
{code:java}

Changing the code to `case mapArg: Map[String, String]` will not work due to 
type erasure. As a matter of fact, compiler gives the warning:
{noformat}
Warning:(41, 18) non-variable type argument String in type pattern 
scala.collection.immutable.Map[String,String] (the underlying of 
Map[String,String]) is unchecked since it is eliminated by erasure
case mapArg: Map[String, String] =>{noformat}
 


> ParsedStatement subclass toString may throw ClassCastException
> --
>
> Key: SPARK-27947
> URL: https://issues.apache.org/jira/browse/SPARK-27947
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: John Zhuge
>Priority: Minor
>
> In ParsedStatement.productIterator, "case mapArg: Map[_, _]" may match any 
> Map type, thus causing `asInstanceOf[Map[String, String]]` to throw 
> ClassCastException.
> The following test reproduces the issue:
> {code:java}
> case class TestStatement(p: Map[String, Int]) extends ParsedStatement {
>  override def output: Seq[Attribute] = Nil
>  override def children: Seq[LogicalPlan] = Nil
> }
>  TestStatement(Map("abc" -> 1)).toString{code}
> Changing the code to `case mapArg: Map[String, String]` will not work due to 
> type erasure. As a matter of fact, compiler gives the warning:
> {noformat}
> Warning:(41, 18) non-variable type argument String in type pattern 
> scala.collection.immutable.Map[String,String] (the underlying of 
> Map[String,String]) is unchecked since it is eliminated by erasure
> case mapArg: Map[String, String] =>{noformat}
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-27947) ParsedStatement subclass toString may throw ClassCastException

2019-06-04 Thread John Zhuge (JIRA)
John Zhuge created SPARK-27947:
--

 Summary: ParsedStatement subclass toString may throw 
ClassCastException
 Key: SPARK-27947
 URL: https://issues.apache.org/jira/browse/SPARK-27947
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.0.0
Reporter: John Zhuge


In ParsedStatement.productIterator, "case mapArg: Map[_, _]" may match any Map 
type, thus causing `asInstanceOf[Map[String, String]]` to throw 
ClassCastException.

The following test reproduces the issue:
{code:java}
case class TestStatement(p: Map[String, Int]) extends ParsedStatement {
 override def output: Seq[Attribute] = Nil
 override def children: Seq[LogicalPlan] = Nil
}

 TestStatement(Map("abc" -> 1)).toString{code}
{code:java}
 {code}
Changing the code to `case mapArg: Map[String, String]` will not work due to 
type erasure. As a matter of fact, compiler gives the warning:
{noformat}
Warning:(41, 18) non-variable type argument String in type pattern 
scala.collection.immutable.Map[String,String] (the underlying of 
Map[String,String]) is unchecked since it is eliminated by erasure
case mapArg: Map[String, String] =>{noformat}
 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-27947) ParsedStatement subclass toString may throw ClassCastException

2019-06-04 Thread John Zhuge (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27947?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

John Zhuge updated SPARK-27947:
---
Description: 
In ParsedStatement.productIterator, "case mapArg: Map[_, _]" may match any Map 
type, thus causing `asInstanceOf[Map[String, String]]` to throw 
ClassCastException.

The following test reproduces the issue:
{code:java}
case class TestStatement(p: Map[String, Int]) extends ParsedStatement {
 override def output: Seq[Attribute] = Nil
 override def children: Seq[LogicalPlan] = Nil
}

 TestStatement(Map("abc" -> 1)).toString{code}
{code:java}

Changing the code to `case mapArg: Map[String, String]` will not work due to 
type erasure. As a matter of fact, compiler gives the warning:
{noformat}
Warning:(41, 18) non-variable type argument String in type pattern 
scala.collection.immutable.Map[String,String] (the underlying of 
Map[String,String]) is unchecked since it is eliminated by erasure
case mapArg: Map[String, String] =>{noformat}
 

  was:
In ParsedStatement.productIterator, "case mapArg: Map[_, _]" may match any Map 
type, thus causing `asInstanceOf[Map[String, String]]` to throw 
ClassCastException.

The following test reproduces the issue:
{code:java}
case class TestStatement(p: Map[String, Int]) extends ParsedStatement {
 override def output: Seq[Attribute] = Nil
 override def children: Seq[LogicalPlan] = Nil
}

 TestStatement(Map("abc" -> 1)).toString{code}
{code:java}
 {code}
Changing the code to `case mapArg: Map[String, String]` will not work due to 
type erasure. As a matter of fact, compiler gives the warning:
{noformat}
Warning:(41, 18) non-variable type argument String in type pattern 
scala.collection.immutable.Map[String,String] (the underlying of 
Map[String,String]) is unchecked since it is eliminated by erasure
case mapArg: Map[String, String] =>{noformat}
 


> ParsedStatement subclass toString may throw ClassCastException
> --
>
> Key: SPARK-27947
> URL: https://issues.apache.org/jira/browse/SPARK-27947
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: John Zhuge
>Priority: Minor
>
> In ParsedStatement.productIterator, "case mapArg: Map[_, _]" may match any 
> Map type, thus causing `asInstanceOf[Map[String, String]]` to throw 
> ClassCastException.
> The following test reproduces the issue:
> {code:java}
> case class TestStatement(p: Map[String, Int]) extends ParsedStatement {
>  override def output: Seq[Attribute] = Nil
>  override def children: Seq[LogicalPlan] = Nil
> }
>  TestStatement(Map("abc" -> 1)).toString{code}
> {code:java}
> Changing the code to `case mapArg: Map[String, String]` will not work due to 
> type erasure. As a matter of fact, compiler gives the warning:
> {noformat}
> Warning:(41, 18) non-variable type argument String in type pattern 
> scala.collection.immutable.Map[String,String] (the underlying of 
> Map[String,String]) is unchecked since it is eliminated by erasure
> case mapArg: Map[String, String] =>{noformat}
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-27366) Spark scheduler internal changes to support GPU scheduling

2019-06-04 Thread Xiangrui Meng (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27366?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng resolved SPARK-27366.
---
   Resolution: Fixed
Fix Version/s: 3.0.0

Issue resolved by pull request 24374
[https://github.com/apache/spark/pull/24374]

> Spark scheduler internal changes to support GPU scheduling
> --
>
> Key: SPARK-27366
> URL: https://issues.apache.org/jira/browse/SPARK-27366
> Project: Spark
>  Issue Type: Story
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Xiangrui Meng
>Assignee: Xingbo Jiang
>Priority: Major
> Fix For: 3.0.0
>
>
> Update Spark job scheduler to support accelerator resource requests submitted 
> at application level.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-17025) Cannot persist PySpark ML Pipeline model that includes custom Transformer

2019-06-04 Thread John Bauer (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-17025?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16855948#comment-16855948
 ] 

John Bauer edited comment on SPARK-17025 at 6/4/19 11:12 PM:
-

[~Hadar] [~yug95] [~ralucamaria.b...@gmail.com] I wrote a minimal example of a 
PySpark estimator/model pair which can be saved and loaded at 
[ImputeNormal|https://github.com/JohnHBauer/ImputeNormal] 
It imputes missing values from a normal distribution, using mean and standard 
deviation parameters estimated from the data, so it might be useful for that 
too.  
Let me know if it helps you,


was (Author: johnhbauer):
[~Hadar] [~yug95] [~ralucamaria.b...@gmail.com] I wrote a minimal example of a 
PySpark estimator/model pair which can be saved and loaded at 
[ImputeNormal|https://github.com/JohnHBauer/ImputeNormal] which imputes missing 
values from a normal distribution using parameters estimated from the data.  
Let me know if it helps you,

> Cannot persist PySpark ML Pipeline model that includes custom Transformer
> -
>
> Key: SPARK-17025
> URL: https://issues.apache.org/jira/browse/SPARK-17025
> Project: Spark
>  Issue Type: New Feature
>  Components: ML, PySpark
>Affects Versions: 2.0.0
>Reporter: Nicholas Chammas
>Assignee: Ajay Saini
>Priority: Minor
> Fix For: 2.3.0
>
>
> Following the example in [this Databricks blog 
> post|https://databricks.com/blog/2016/05/31/apache-spark-2-0-preview-machine-learning-model-persistence.html]
>  under "Python tuning", I'm trying to save an ML Pipeline model.
> This pipeline, however, includes a custom transformer. When I try to save the 
> model, the operation fails because the custom transformer doesn't have a 
> {{_to_java}} attribute.
> {code}
> Traceback (most recent call last):
>   File ".../file.py", line 56, in 
> model.bestModel.save('model')
>   File 
> "/usr/local/Cellar/apache-spark/2.0.0/libexec/python/lib/pyspark.zip/pyspark/ml/pipeline.py",
>  line 222, in save
>   File 
> "/usr/local/Cellar/apache-spark/2.0.0/libexec/python/lib/pyspark.zip/pyspark/ml/pipeline.py",
>  line 217, in write
>   File 
> "/usr/local/Cellar/apache-spark/2.0.0/libexec/python/lib/pyspark.zip/pyspark/ml/util.py",
>  line 93, in __init__
>   File 
> "/usr/local/Cellar/apache-spark/2.0.0/libexec/python/lib/pyspark.zip/pyspark/ml/pipeline.py",
>  line 254, in _to_java
> AttributeError: 'PeoplePairFeaturizer' object has no attribute '_to_java'
> {code}
> Looking at the source code for 
> [ml/base.py|https://github.com/apache/spark/blob/acaf2a81ad5238fd1bc81e7be2c328f40c07e755/python/pyspark/ml/base.py],
>  I see that not even the base Transformer class has such an attribute.
> I'm assuming this is missing functionality that is intended to be patched up 
> (i.e. [like 
> this|https://github.com/apache/spark/blob/acaf2a81ad5238fd1bc81e7be2c328f40c07e755/python/pyspark/ml/classification.py#L1421-L1433]).
> I'm not sure if there is an existing JIRA for this (my searches didn't turn 
> up clear results).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-17025) Cannot persist PySpark ML Pipeline model that includes custom Transformer

2019-06-04 Thread John Bauer (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-17025?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16855948#comment-16855948
 ] 

John Bauer edited comment on SPARK-17025 at 6/4/19 11:10 PM:
-

[~Hadar] [~yug95] [~ralucamaria.b...@gmail.com] I wrote a minimal example of a 
PySpark estimator/model pair which can be saved and loaded at 
[ImputeNormal|https://github.com/JohnHBauer/ImputeNormal] which imputes missing 
values from a normal distribution using parameters estimated from the data.  
Let me know if it helps you,


was (Author: johnhbauer):
[~Hadar] [~yug95] I wrote a minimal example of a PySpark estimator/model pair 
which can be saved and loaded at 
[ImputeNormal|https://github.com/JohnHBauer/ImputeNormal] which imputes missing 
values from a normal distribution using parameters estimated from the data.  
Let me know if it helps you,

> Cannot persist PySpark ML Pipeline model that includes custom Transformer
> -
>
> Key: SPARK-17025
> URL: https://issues.apache.org/jira/browse/SPARK-17025
> Project: Spark
>  Issue Type: New Feature
>  Components: ML, PySpark
>Affects Versions: 2.0.0
>Reporter: Nicholas Chammas
>Assignee: Ajay Saini
>Priority: Minor
> Fix For: 2.3.0
>
>
> Following the example in [this Databricks blog 
> post|https://databricks.com/blog/2016/05/31/apache-spark-2-0-preview-machine-learning-model-persistence.html]
>  under "Python tuning", I'm trying to save an ML Pipeline model.
> This pipeline, however, includes a custom transformer. When I try to save the 
> model, the operation fails because the custom transformer doesn't have a 
> {{_to_java}} attribute.
> {code}
> Traceback (most recent call last):
>   File ".../file.py", line 56, in 
> model.bestModel.save('model')
>   File 
> "/usr/local/Cellar/apache-spark/2.0.0/libexec/python/lib/pyspark.zip/pyspark/ml/pipeline.py",
>  line 222, in save
>   File 
> "/usr/local/Cellar/apache-spark/2.0.0/libexec/python/lib/pyspark.zip/pyspark/ml/pipeline.py",
>  line 217, in write
>   File 
> "/usr/local/Cellar/apache-spark/2.0.0/libexec/python/lib/pyspark.zip/pyspark/ml/util.py",
>  line 93, in __init__
>   File 
> "/usr/local/Cellar/apache-spark/2.0.0/libexec/python/lib/pyspark.zip/pyspark/ml/pipeline.py",
>  line 254, in _to_java
> AttributeError: 'PeoplePairFeaturizer' object has no attribute '_to_java'
> {code}
> Looking at the source code for 
> [ml/base.py|https://github.com/apache/spark/blob/acaf2a81ad5238fd1bc81e7be2c328f40c07e755/python/pyspark/ml/base.py],
>  I see that not even the base Transformer class has such an attribute.
> I'm assuming this is missing functionality that is intended to be patched up 
> (i.e. [like 
> this|https://github.com/apache/spark/blob/acaf2a81ad5238fd1bc81e7be2c328f40c07e755/python/pyspark/ml/classification.py#L1421-L1433]).
> I'm not sure if there is an existing JIRA for this (my searches didn't turn 
> up clear results).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-27724) DataSourceV2: Add RTAS logical operation

2019-06-04 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27724?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-27724:


Assignee: (was: Apache Spark)

> DataSourceV2: Add RTAS logical operation
> 
>
> Key: SPARK-27724
> URL: https://issues.apache.org/jira/browse/SPARK-27724
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.4.3
>Reporter: Ryan Blue
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-27724) DataSourceV2: Add RTAS logical operation

2019-06-04 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27724?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-27724:


Assignee: Apache Spark

> DataSourceV2: Add RTAS logical operation
> 
>
> Key: SPARK-27724
> URL: https://issues.apache.org/jira/browse/SPARK-27724
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.4.3
>Reporter: Ryan Blue
>Assignee: Apache Spark
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-27909) Fix CTE substitution dependence on ResolveRelations throwing AnalysisException

2019-06-04 Thread Xiao Li (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27909?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li resolved SPARK-27909.
-
Resolution: Fixed
  Assignee: Ryan Blue

> Fix CTE substitution dependence on ResolveRelations throwing AnalysisException
> --
>
> Key: SPARK-27909
> URL: https://issues.apache.org/jira/browse/SPARK-27909
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.3
>Reporter: Ryan Blue
>Assignee: Ryan Blue
>Priority: Major
>
> CTE substitution currently works by running all analyzer rules on plans after 
> each substitution. It does this to fix a recursive CTE case, but this design 
> requires the ResolveRelations rule to throw an AnalysisException when it 
> cannot resolve a table or else the CTE substitution will run again and may 
> possibly recurse infinitely.
> Table resolution should be possible across multiple independent rules. To 
> accomplish this, the current ResolveRelations rule detects cases where other 
> rules (like ResolveDataSource) will resolve a TableIdentifier and returns the 
> UnresolvedRelation unmodified only in those cases.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-27909) Fix CTE substitution dependence on ResolveRelations throwing AnalysisException

2019-06-04 Thread Xiao Li (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27909?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-27909:

Issue Type: Improvement  (was: Bug)

> Fix CTE substitution dependence on ResolveRelations throwing AnalysisException
> --
>
> Key: SPARK-27909
> URL: https://issues.apache.org/jira/browse/SPARK-27909
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.3
>Reporter: Ryan Blue
>Priority: Major
>
> CTE substitution currently works by running all analyzer rules on plans after 
> each substitution. It does this to fix a recursive CTE case, but this design 
> requires the ResolveRelations rule to throw an AnalysisException when it 
> cannot resolve a table or else the CTE substitution will run again and may 
> possibly recurse infinitely.
> Table resolution should be possible across multiple independent rules. To 
> accomplish this, the current ResolveRelations rule detects cases where other 
> rules (like ResolveDataSource) will resolve a TableIdentifier and returns the 
> UnresolvedRelation unmodified only in those cases.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-27909) Fix CTE substitution dependence on ResolveRelations throwing AnalysisException

2019-06-04 Thread Xiao Li (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-27909?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16856138#comment-16856138
 ] 

Xiao Li commented on SPARK-27909:
-

We do not support recursive CTE. I change it to an improvement

> Fix CTE substitution dependence on ResolveRelations throwing AnalysisException
> --
>
> Key: SPARK-27909
> URL: https://issues.apache.org/jira/browse/SPARK-27909
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.3
>Reporter: Ryan Blue
>Priority: Major
>
> CTE substitution currently works by running all analyzer rules on plans after 
> each substitution. It does this to fix a recursive CTE case, but this design 
> requires the ResolveRelations rule to throw an AnalysisException when it 
> cannot resolve a table or else the CTE substitution will run again and may 
> possibly recurse infinitely.
> Table resolution should be possible across multiple independent rules. To 
> accomplish this, the current ResolveRelations rule detects cases where other 
> rules (like ResolveDataSource) will resolve a TableIdentifier and returns the 
> UnresolvedRelation unmodified only in those cases.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-27946) Hive DDL to Spark DDL conversion feature for "show create table"

2019-06-04 Thread Xiao Li (JIRA)
Xiao Li created SPARK-27946:
---

 Summary: Hive DDL to Spark DDL conversion feature for "show create 
table"
 Key: SPARK-27946
 URL: https://issues.apache.org/jira/browse/SPARK-27946
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.0.0
Reporter: Xiao Li


Many users migrate tables created with Hive DDL to Spark/DB. Defining the table 
with Spark DDL brings performance benefits. We need to add a feature to Show 
Create Table that allows you to generate Spark DDL for a table. For example: 
`SHOW CREATE TABLE customers AS SPARK`.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-27946) Hive DDL to Spark DDL conversion feature for "show create table"

2019-06-04 Thread Xiao Li (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27946?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-27946:

Issue Type: Improvement  (was: Bug)

> Hive DDL to Spark DDL conversion feature for "show create table"
> 
>
> Key: SPARK-27946
> URL: https://issues.apache.org/jira/browse/SPARK-27946
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Xiao Li
>Priority: Major
>
> Many users migrate tables created with Hive DDL to Spark/DB. Defining the 
> table with Spark DDL brings performance benefits. We need to add a feature to 
> Show Create Table that allows you to generate Spark DDL for a table. For 
> example: `SHOW CREATE TABLE customers AS SPARK`.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-27946) Hive DDL to Spark DDL conversion USING "show create table"

2019-06-04 Thread Xiao Li (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27946?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-27946:

Summary: Hive DDL to Spark DDL conversion USING "show create table"  (was: 
Hive DDL to Spark DDL conversion feature for "show create table")

> Hive DDL to Spark DDL conversion USING "show create table"
> --
>
> Key: SPARK-27946
> URL: https://issues.apache.org/jira/browse/SPARK-27946
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Xiao Li
>Priority: Major
>
> Many users migrate tables created with Hive DDL to Spark/DB. Defining the 
> table with Spark DDL brings performance benefits. We need to add a feature to 
> Show Create Table that allows you to generate Spark DDL for a table. For 
> example: `SHOW CREATE TABLE customers AS SPARK`.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-27900) Spark driver will not exit due to an oom error

2019-06-04 Thread Stavros Kontopoulos (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27900?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Stavros Kontopoulos updated SPARK-27900:

Summary: Spark driver will not exit due to an oom error  (was: Spark will 
not exit due to an oom error)

> Spark driver will not exit due to an oom error
> --
>
> Key: SPARK-27900
> URL: https://issues.apache.org/jira/browse/SPARK-27900
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.0.0, 2.4.3
>Reporter: Stavros Kontopoulos
>Priority: Major
>
> This affects Spark on K8s at least as pods will run forever and makes 
> impossible for tools like Spark Operator to report back
> job status.
> A spark pi job is running:
> spark-pi-driver 1/1 Running 0 1h
>  spark-pi2-1559309337787-exec-1 1/1 Running 0 1h
>  spark-pi2-1559309337787-exec-2 1/1 Running 0 1h
> with the following setup:
> {quote}apiVersion: "sparkoperator.k8s.io/v1beta1"
>  kind: SparkApplication
>  metadata:
>  name: spark-pi
>  namespace: spark
>  spec:
>  type: Scala
>  mode: cluster
>  image: "skonto/spark:k8s-3.0.0-sa"
>  imagePullPolicy: Always
>  mainClass: org.apache.spark.examples.SparkPi
>  mainApplicationFile: 
> "local:///opt/spark/examples/jars/spark-examples_2.12-3.0.0-SNAPSHOT.jar"
>  arguments:
>  - "100"
>  sparkVersion: "2.4.0"
>  restartPolicy:
>  type: Never
>  nodeSelector:
>  "spark": "autotune"
>  driver:
>  memory: "1g"
>  labels:
>  version: 2.4.0
>  serviceAccount: spark-sa
>  executor:
>  instances: 2
>  memory: "1g"
>  labels:
>  version: 2.4.0{quote}
> At some point the driver fails but it is still running and so the pods are 
> still running:
> 19/05/31 13:29:20 INFO DAGScheduler: Submitting ResultStage 0 
> (MapPartitionsRDD[1] at map at SparkPi.scala:34), which has no missing parents
>  19/05/31 13:29:23 INFO MemoryStore: Block broadcast_0 stored as values in 
> memory (estimated size 3.0 KiB, free 110.0 MiB)
>  19/05/31 13:29:23 INFO MemoryStore: Block broadcast_0_piece0 stored as bytes 
> in memory (estimated size 1765.0 B, free 110.0 MiB)
>  19/05/31 13:29:23 INFO BlockManagerInfo: Added broadcast_0_piece0 in memory 
> on spark-pi2-1559309337787-driver-svc.spark.svc:7079 (size: 1765.0 B, free: 
> 110.0 MiB)
>  19/05/31 13:29:23 INFO SparkContext: Created broadcast 0 from broadcast at 
> DAGScheduler.scala:1180
>  19/05/31 13:29:25 INFO DAGScheduler: Submitting 100 missing tasks from 
> ResultStage 0 (MapPartitionsRDD[1] at map at SparkPi.scala:34) (first 15 
> tasks are for partitions Vector(0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 
> 14))
>  19/05/31 13:29:25 INFO TaskSchedulerImpl: Adding task set 0.0 with 100 
> tasks
>  Exception in thread "dag-scheduler-event-loop" java.lang.OutOfMemoryError: 
> Java heap space
>  at 
> scala.collection.mutable.ResizableArray.ensureSize(ResizableArray.scala:106)
>  at 
> scala.collection.mutable.ResizableArray.ensureSize$(ResizableArray.scala:96)
>  at scala.collection.mutable.ArrayBuffer.ensureSize(ArrayBuffer.scala:49)
>  Mem: 2295260K used, 24458144K free, 1636K shrd, 48052K buff, 899424K cached
> $ kubectl describe pod spark-pi2-driver -n spark
>  Name: spark-pi2-driver
>  Namespace: spark
>  Priority: 0
>  PriorityClassName: 
>  Node: gke-test-cluster-1-spark-autotune-46c36f4f-x3z9/10.138.0.44
>  Start Time: Fri, 31 May 2019 16:28:59 +0300
>  Labels: spark-app-selector=spark-74d8e5a8f1af428d91093dfa6ee9d661
>  spark-role=driver
>  sparkoperator.k8s.io/app-name=spark-pi2
>  sparkoperator.k8s.io/launched-by-spark-operator=true
>  sparkoperator.k8s.io/submission-id=spark-pi2-1559309336226927526
>  version=2.4.0
>  Annotations: 
>  Status: Running
>  IP: 10.12.103.4
>  Controlled By: SparkApplication/spark-pi2
>  Containers:
>  spark-kubernetes-driver:
>  Container ID: 
> docker://55dadb603290b42f9ddb71959edf0224ddc7ea621ee15429941d3bcc7db9b71f
>  Image: skonto/spark:k8s-3.0.0-sa
>  Image ID: 
> docker-pullable://skonto/spark@sha256:6268d760d1a006b69c7086f946e4d5d9a3b99f149832c63cfc7fe39671f5cda9
>  Ports: 7078/TCP, 7079/TCP, 4040/TCP
>  Host Ports: 0/TCP, 0/TCP, 0/TCP
>  Args:
>  driver
>  --properties-file
>  /opt/spark/conf/spark.properties
>  --class
>  org.apache.spark.examples.SparkPi
>  spark-internal
>  100
>  State: Running
> In the container processes are in _interruptible sleep_:
> PID PPID USER STAT VSZ %VSZ CPU %CPU COMMAND
>  15 1 185 S 2114m 7% 0 0% /usr/lib/jvm/java-1.8-openjdk/bin/java -cp 
> /opt/spark/conf/:/opt/spark/jars/* -Xmx500m 
> org.apache.spark.deploy.SparkSubmit --deploy-mode client --conf spar
>  287 0 185 S 2344 0% 3 0% sh
>  294 287 185 R 1536 0% 3 0% top
>  1 0 185 S 776 0% 0 0% /sbin/tini -s – /opt/spark/bin/spark-submit --conf 
> spark.driver.bindAddress=10.12.103.4 --deploy-mode client --properties-file 
> /opt/spark/conf/spark.prope
> Liveness

[jira] [Updated] (SPARK-27900) Spark will not exit due to an oom error

2019-06-04 Thread Stavros Kontopoulos (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27900?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Stavros Kontopoulos updated SPARK-27900:

Description: 
This affects Spark on K8s at least as pods will run forever.

A spark pi job is running:

spark-pi-driver 1/1 Running 0 1h
 spark-pi2-1559309337787-exec-1 1/1 Running 0 1h
 spark-pi2-1559309337787-exec-2 1/1 Running 0 1h

with the following setup:
{quote}apiVersion: "sparkoperator.k8s.io/v1beta1"
 kind: SparkApplication
 metadata:
 name: spark-pi
 namespace: spark
 spec:
 type: Scala
 mode: cluster
 image: "skonto/spark:k8s-3.0.0-sa"
 imagePullPolicy: Always
 mainClass: org.apache.spark.examples.SparkPi
 mainApplicationFile: 
"local:///opt/spark/examples/jars/spark-examples_2.12-3.0.0-SNAPSHOT.jar"
 arguments:
 - "100"
 sparkVersion: "2.4.0"
 restartPolicy:
 type: Never
 nodeSelector:
 "spark": "autotune"
 driver:
 memory: "1g"
 labels:
 version: 2.4.0
 serviceAccount: spark-sa
 executor:
 instances: 2
 memory: "1g"
 labels:
 version: 2.4.0{quote}
At some point the driver fails but it is still running and so the pods are 
still running:

19/05/31 13:29:20 INFO DAGScheduler: Submitting ResultStage 0 
(MapPartitionsRDD[1] at map at SparkPi.scala:34), which has no missing parents
 19/05/31 13:29:23 INFO MemoryStore: Block broadcast_0 stored as values in 
memory (estimated size 3.0 KiB, free 110.0 MiB)
 19/05/31 13:29:23 INFO MemoryStore: Block broadcast_0_piece0 stored as bytes 
in memory (estimated size 1765.0 B, free 110.0 MiB)
 19/05/31 13:29:23 INFO BlockManagerInfo: Added broadcast_0_piece0 in memory on 
spark-pi2-1559309337787-driver-svc.spark.svc:7079 (size: 1765.0 B, free: 110.0 
MiB)
 19/05/31 13:29:23 INFO SparkContext: Created broadcast 0 from broadcast at 
DAGScheduler.scala:1180
 19/05/31 13:29:25 INFO DAGScheduler: Submitting 100 missing tasks from 
ResultStage 0 (MapPartitionsRDD[1] at map at SparkPi.scala:34) (first 15 tasks 
are for partitions Vector(0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14))
 19/05/31 13:29:25 INFO TaskSchedulerImpl: Adding task set 0.0 with 100 
tasks
 Exception in thread "dag-scheduler-event-loop" java.lang.OutOfMemoryError: 
Java heap space
 at scala.collection.mutable.ResizableArray.ensureSize(ResizableArray.scala:106)
 at scala.collection.mutable.ResizableArray.ensureSize$(ResizableArray.scala:96)
 at scala.collection.mutable.ArrayBuffer.ensureSize(ArrayBuffer.scala:49)
 Mem: 2295260K used, 24458144K free, 1636K shrd, 48052K buff, 899424K cached

$ kubectl describe pod spark-pi2-driver -n spark
 Name: spark-pi2-driver
 Namespace: spark
 Priority: 0
 PriorityClassName: 
 Node: gke-test-cluster-1-spark-autotune-46c36f4f-x3z9/10.138.0.44
 Start Time: Fri, 31 May 2019 16:28:59 +0300
 Labels: spark-app-selector=spark-74d8e5a8f1af428d91093dfa6ee9d661
 spark-role=driver
 sparkoperator.k8s.io/app-name=spark-pi2
 sparkoperator.k8s.io/launched-by-spark-operator=true
 sparkoperator.k8s.io/submission-id=spark-pi2-1559309336226927526
 version=2.4.0
 Annotations: 
 Status: Running
 IP: 10.12.103.4
 Controlled By: SparkApplication/spark-pi2
 Containers:
 spark-kubernetes-driver:
 Container ID: 
docker://55dadb603290b42f9ddb71959edf0224ddc7ea621ee15429941d3bcc7db9b71f
 Image: skonto/spark:k8s-3.0.0-sa
 Image ID: 
docker-pullable://skonto/spark@sha256:6268d760d1a006b69c7086f946e4d5d9a3b99f149832c63cfc7fe39671f5cda9
 Ports: 7078/TCP, 7079/TCP, 4040/TCP
 Host Ports: 0/TCP, 0/TCP, 0/TCP
 Args:
 driver
 --properties-file
 /opt/spark/conf/spark.properties
 --class
 org.apache.spark.examples.SparkPi
 spark-internal
 100
 State: Running

In the container processes are in _interruptible sleep_:

PID PPID USER STAT VSZ %VSZ CPU %CPU COMMAND
 15 1 185 S 2114m 7% 0 0% /usr/lib/jvm/java-1.8-openjdk/bin/java -cp 
/opt/spark/conf/:/opt/spark/jars/* -Xmx500m org.apache.spark.deploy.SparkSubmit 
--deploy-mode client --conf spar
 287 0 185 S 2344 0% 3 0% sh
 294 287 185 R 1536 0% 3 0% top
 1 0 185 S 776 0% 0 0% /sbin/tini -s – /opt/spark/bin/spark-submit --conf 
spark.driver.bindAddress=10.12.103.4 --deploy-mode client --properties-file 
/opt/spark/conf/spark.prope

Liveness checks might be a workaround but rest apis may be still working if 
threads in jvm still are running as in this case (I did check the spark ui and 
it was there).

 

 

  was:
A spark pi job is running:

spark-pi-driver 1/1 Running 0 1h
 spark-pi2-1559309337787-exec-1 1/1 Running 0 1h
 spark-pi2-1559309337787-exec-2 1/1 Running 0 1h

with the following setup:
{quote}apiVersion: "sparkoperator.k8s.io/v1beta1"
 kind: SparkApplication
 metadata:
 name: spark-pi
 namespace: spark
 spec:
 type: Scala
 mode: cluster
 image: "skonto/spark:k8s-3.0.0-sa"
 imagePullPolicy: Always
 mainClass: org.apache.spark.examples.SparkPi
 mainApplicationFile: 
"local:///opt/spark/examples/jars/spark-examples_2.12-3.0.0-SNAPSHOT.jar"
 arguments:
 - "100"
 sparkVersion: "2.4.0"
 restartPolicy:
 type: Ne

[jira] [Updated] (SPARK-27900) Spark will not exit due to an oom error

2019-06-04 Thread Stavros Kontopoulos (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27900?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Stavros Kontopoulos updated SPARK-27900:

Description: 
This affects Spark on K8s at least as pods will run forever and makes 
impossible for tools like Spark Operator to report back

job status.

A spark pi job is running:

spark-pi-driver 1/1 Running 0 1h
 spark-pi2-1559309337787-exec-1 1/1 Running 0 1h
 spark-pi2-1559309337787-exec-2 1/1 Running 0 1h

with the following setup:
{quote}apiVersion: "sparkoperator.k8s.io/v1beta1"
 kind: SparkApplication
 metadata:
 name: spark-pi
 namespace: spark
 spec:
 type: Scala
 mode: cluster
 image: "skonto/spark:k8s-3.0.0-sa"
 imagePullPolicy: Always
 mainClass: org.apache.spark.examples.SparkPi
 mainApplicationFile: 
"local:///opt/spark/examples/jars/spark-examples_2.12-3.0.0-SNAPSHOT.jar"
 arguments:
 - "100"
 sparkVersion: "2.4.0"
 restartPolicy:
 type: Never
 nodeSelector:
 "spark": "autotune"
 driver:
 memory: "1g"
 labels:
 version: 2.4.0
 serviceAccount: spark-sa
 executor:
 instances: 2
 memory: "1g"
 labels:
 version: 2.4.0{quote}
At some point the driver fails but it is still running and so the pods are 
still running:

19/05/31 13:29:20 INFO DAGScheduler: Submitting ResultStage 0 
(MapPartitionsRDD[1] at map at SparkPi.scala:34), which has no missing parents
 19/05/31 13:29:23 INFO MemoryStore: Block broadcast_0 stored as values in 
memory (estimated size 3.0 KiB, free 110.0 MiB)
 19/05/31 13:29:23 INFO MemoryStore: Block broadcast_0_piece0 stored as bytes 
in memory (estimated size 1765.0 B, free 110.0 MiB)
 19/05/31 13:29:23 INFO BlockManagerInfo: Added broadcast_0_piece0 in memory on 
spark-pi2-1559309337787-driver-svc.spark.svc:7079 (size: 1765.0 B, free: 110.0 
MiB)
 19/05/31 13:29:23 INFO SparkContext: Created broadcast 0 from broadcast at 
DAGScheduler.scala:1180
 19/05/31 13:29:25 INFO DAGScheduler: Submitting 100 missing tasks from 
ResultStage 0 (MapPartitionsRDD[1] at map at SparkPi.scala:34) (first 15 tasks 
are for partitions Vector(0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14))
 19/05/31 13:29:25 INFO TaskSchedulerImpl: Adding task set 0.0 with 100 
tasks
 Exception in thread "dag-scheduler-event-loop" java.lang.OutOfMemoryError: 
Java heap space
 at scala.collection.mutable.ResizableArray.ensureSize(ResizableArray.scala:106)
 at scala.collection.mutable.ResizableArray.ensureSize$(ResizableArray.scala:96)
 at scala.collection.mutable.ArrayBuffer.ensureSize(ArrayBuffer.scala:49)
 Mem: 2295260K used, 24458144K free, 1636K shrd, 48052K buff, 899424K cached

$ kubectl describe pod spark-pi2-driver -n spark
 Name: spark-pi2-driver
 Namespace: spark
 Priority: 0
 PriorityClassName: 
 Node: gke-test-cluster-1-spark-autotune-46c36f4f-x3z9/10.138.0.44
 Start Time: Fri, 31 May 2019 16:28:59 +0300
 Labels: spark-app-selector=spark-74d8e5a8f1af428d91093dfa6ee9d661
 spark-role=driver
 sparkoperator.k8s.io/app-name=spark-pi2
 sparkoperator.k8s.io/launched-by-spark-operator=true
 sparkoperator.k8s.io/submission-id=spark-pi2-1559309336226927526
 version=2.4.0
 Annotations: 
 Status: Running
 IP: 10.12.103.4
 Controlled By: SparkApplication/spark-pi2
 Containers:
 spark-kubernetes-driver:
 Container ID: 
docker://55dadb603290b42f9ddb71959edf0224ddc7ea621ee15429941d3bcc7db9b71f
 Image: skonto/spark:k8s-3.0.0-sa
 Image ID: 
docker-pullable://skonto/spark@sha256:6268d760d1a006b69c7086f946e4d5d9a3b99f149832c63cfc7fe39671f5cda9
 Ports: 7078/TCP, 7079/TCP, 4040/TCP
 Host Ports: 0/TCP, 0/TCP, 0/TCP
 Args:
 driver
 --properties-file
 /opt/spark/conf/spark.properties
 --class
 org.apache.spark.examples.SparkPi
 spark-internal
 100
 State: Running

In the container processes are in _interruptible sleep_:

PID PPID USER STAT VSZ %VSZ CPU %CPU COMMAND
 15 1 185 S 2114m 7% 0 0% /usr/lib/jvm/java-1.8-openjdk/bin/java -cp 
/opt/spark/conf/:/opt/spark/jars/* -Xmx500m org.apache.spark.deploy.SparkSubmit 
--deploy-mode client --conf spar
 287 0 185 S 2344 0% 3 0% sh
 294 287 185 R 1536 0% 3 0% top
 1 0 185 S 776 0% 0 0% /sbin/tini -s – /opt/spark/bin/spark-submit --conf 
spark.driver.bindAddress=10.12.103.4 --deploy-mode client --properties-file 
/opt/spark/conf/spark.prope

Liveness checks might be a workaround but rest apis may be still working if 
threads in jvm still are running as in this case (I did check the spark ui and 
it was there).

 

 

  was:
This affects Spark on K8s at least as pods will run forever.

A spark pi job is running:

spark-pi-driver 1/1 Running 0 1h
 spark-pi2-1559309337787-exec-1 1/1 Running 0 1h
 spark-pi2-1559309337787-exec-2 1/1 Running 0 1h

with the following setup:
{quote}apiVersion: "sparkoperator.k8s.io/v1beta1"
 kind: SparkApplication
 metadata:
 name: spark-pi
 namespace: spark
 spec:
 type: Scala
 mode: cluster
 image: "skonto/spark:k8s-3.0.0-sa"
 imagePullPolicy: Always
 mainClass: org.apache.spark.examples.SparkPi
 mainApplicationFile: 
"local

[jira] [Updated] (SPARK-27900) Spark will not exit due to an oom error

2019-06-04 Thread Stavros Kontopoulos (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27900?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Stavros Kontopoulos updated SPARK-27900:

Summary: Spark will not exit due to an oom error  (was: Spark on K8s will 
not report container failure due to an oom error)

> Spark will not exit due to an oom error
> ---
>
> Key: SPARK-27900
> URL: https://issues.apache.org/jira/browse/SPARK-27900
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.0.0, 2.4.3
>Reporter: Stavros Kontopoulos
>Priority: Major
>
> A spark pi job is running:
> spark-pi-driver 1/1 Running 0 1h
>  spark-pi2-1559309337787-exec-1 1/1 Running 0 1h
>  spark-pi2-1559309337787-exec-2 1/1 Running 0 1h
> with the following setup:
> {quote}apiVersion: "sparkoperator.k8s.io/v1beta1"
>  kind: SparkApplication
>  metadata:
>  name: spark-pi
>  namespace: spark
>  spec:
>  type: Scala
>  mode: cluster
>  image: "skonto/spark:k8s-3.0.0-sa"
>  imagePullPolicy: Always
>  mainClass: org.apache.spark.examples.SparkPi
>  mainApplicationFile: 
> "local:///opt/spark/examples/jars/spark-examples_2.12-3.0.0-SNAPSHOT.jar"
>  arguments:
>  - "100"
>  sparkVersion: "2.4.0"
>  restartPolicy:
>  type: Never
>  nodeSelector:
>  "spark": "autotune"
>  driver:
>  memory: "1g"
>  labels:
>  version: 2.4.0
>  serviceAccount: spark-sa
>  executor:
>  instances: 2
>  memory: "1g"
>  labels:
>  version: 2.4.0{quote}
> At some point the driver fails but it is still running and so the pods are 
> still running:
> 19/05/31 13:29:20 INFO DAGScheduler: Submitting ResultStage 0 
> (MapPartitionsRDD[1] at map at SparkPi.scala:34), which has no missing parents
>  19/05/31 13:29:23 INFO MemoryStore: Block broadcast_0 stored as values in 
> memory (estimated size 3.0 KiB, free 110.0 MiB)
>  19/05/31 13:29:23 INFO MemoryStore: Block broadcast_0_piece0 stored as bytes 
> in memory (estimated size 1765.0 B, free 110.0 MiB)
>  19/05/31 13:29:23 INFO BlockManagerInfo: Added broadcast_0_piece0 in memory 
> on spark-pi2-1559309337787-driver-svc.spark.svc:7079 (size: 1765.0 B, free: 
> 110.0 MiB)
>  19/05/31 13:29:23 INFO SparkContext: Created broadcast 0 from broadcast at 
> DAGScheduler.scala:1180
>  19/05/31 13:29:25 INFO DAGScheduler: Submitting 100 missing tasks from 
> ResultStage 0 (MapPartitionsRDD[1] at map at SparkPi.scala:34) (first 15 
> tasks are for partitions Vector(0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 
> 14))
>  19/05/31 13:29:25 INFO TaskSchedulerImpl: Adding task set 0.0 with 100 
> tasks
>  Exception in thread "dag-scheduler-event-loop" java.lang.OutOfMemoryError: 
> Java heap space
>  at 
> scala.collection.mutable.ResizableArray.ensureSize(ResizableArray.scala:106)
>  at 
> scala.collection.mutable.ResizableArray.ensureSize$(ResizableArray.scala:96)
>  at scala.collection.mutable.ArrayBuffer.ensureSize(ArrayBuffer.scala:49)
>  Mem: 2295260K used, 24458144K free, 1636K shrd, 48052K buff, 899424K cached
> $ kubectl describe pod spark-pi2-driver -n spark
>  Name: spark-pi2-driver
>  Namespace: spark
>  Priority: 0
>  PriorityClassName: 
>  Node: gke-test-cluster-1-spark-autotune-46c36f4f-x3z9/10.138.0.44
>  Start Time: Fri, 31 May 2019 16:28:59 +0300
>  Labels: spark-app-selector=spark-74d8e5a8f1af428d91093dfa6ee9d661
>  spark-role=driver
>  sparkoperator.k8s.io/app-name=spark-pi2
>  sparkoperator.k8s.io/launched-by-spark-operator=true
>  sparkoperator.k8s.io/submission-id=spark-pi2-1559309336226927526
>  version=2.4.0
>  Annotations: 
>  Status: Running
>  IP: 10.12.103.4
>  Controlled By: SparkApplication/spark-pi2
>  Containers:
>  spark-kubernetes-driver:
>  Container ID: 
> docker://55dadb603290b42f9ddb71959edf0224ddc7ea621ee15429941d3bcc7db9b71f
>  Image: skonto/spark:k8s-3.0.0-sa
>  Image ID: 
> docker-pullable://skonto/spark@sha256:6268d760d1a006b69c7086f946e4d5d9a3b99f149832c63cfc7fe39671f5cda9
>  Ports: 7078/TCP, 7079/TCP, 4040/TCP
>  Host Ports: 0/TCP, 0/TCP, 0/TCP
>  Args:
>  driver
>  --properties-file
>  /opt/spark/conf/spark.properties
>  --class
>  org.apache.spark.examples.SparkPi
>  spark-internal
>  100
>  State: Running
> In the container processes are in _interruptible sleep_:
> PID PPID USER STAT VSZ %VSZ CPU %CPU COMMAND
>  15 1 185 S 2114m 7% 0 0% /usr/lib/jvm/java-1.8-openjdk/bin/java -cp 
> /opt/spark/conf/:/opt/spark/jars/* -Xmx500m 
> org.apache.spark.deploy.SparkSubmit --deploy-mode client --conf spar
>  287 0 185 S 2344 0% 3 0% sh
>  294 287 185 R 1536 0% 3 0% top
>  1 0 185 S 776 0% 0 0% /sbin/tini -s – /opt/spark/bin/spark-submit --conf 
> spark.driver.bindAddress=10.12.103.4 --deploy-mode client --properties-file 
> /opt/spark/conf/spark.prope
> Liveness checks might be a workaround but rest apis may be still working if 
> threads in jvm still are running as in this case (I did check the sp

[jira] [Updated] (SPARK-27900) Spark on K8s will not report container failure due to an oom error

2019-06-04 Thread Stavros Kontopoulos (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27900?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Stavros Kontopoulos updated SPARK-27900:

Component/s: (was: Kubernetes)

> Spark on K8s will not report container failure due to an oom error
> --
>
> Key: SPARK-27900
> URL: https://issues.apache.org/jira/browse/SPARK-27900
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.0.0, 2.4.3
>Reporter: Stavros Kontopoulos
>Priority: Major
>
> A spark pi job is running:
> spark-pi-driver 1/1 Running 0 1h
>  spark-pi2-1559309337787-exec-1 1/1 Running 0 1h
>  spark-pi2-1559309337787-exec-2 1/1 Running 0 1h
> with the following setup:
> {quote}apiVersion: "sparkoperator.k8s.io/v1beta1"
>  kind: SparkApplication
>  metadata:
>  name: spark-pi
>  namespace: spark
>  spec:
>  type: Scala
>  mode: cluster
>  image: "skonto/spark:k8s-3.0.0-sa"
>  imagePullPolicy: Always
>  mainClass: org.apache.spark.examples.SparkPi
>  mainApplicationFile: 
> "local:///opt/spark/examples/jars/spark-examples_2.12-3.0.0-SNAPSHOT.jar"
>  arguments:
>  - "100"
>  sparkVersion: "2.4.0"
>  restartPolicy:
>  type: Never
>  nodeSelector:
>  "spark": "autotune"
>  driver:
>  memory: "1g"
>  labels:
>  version: 2.4.0
>  serviceAccount: spark-sa
>  executor:
>  instances: 2
>  memory: "1g"
>  labels:
>  version: 2.4.0{quote}
> At some point the driver fails but it is still running and so the pods are 
> still running:
> 19/05/31 13:29:20 INFO DAGScheduler: Submitting ResultStage 0 
> (MapPartitionsRDD[1] at map at SparkPi.scala:34), which has no missing parents
>  19/05/31 13:29:23 INFO MemoryStore: Block broadcast_0 stored as values in 
> memory (estimated size 3.0 KiB, free 110.0 MiB)
>  19/05/31 13:29:23 INFO MemoryStore: Block broadcast_0_piece0 stored as bytes 
> in memory (estimated size 1765.0 B, free 110.0 MiB)
>  19/05/31 13:29:23 INFO BlockManagerInfo: Added broadcast_0_piece0 in memory 
> on spark-pi2-1559309337787-driver-svc.spark.svc:7079 (size: 1765.0 B, free: 
> 110.0 MiB)
>  19/05/31 13:29:23 INFO SparkContext: Created broadcast 0 from broadcast at 
> DAGScheduler.scala:1180
>  19/05/31 13:29:25 INFO DAGScheduler: Submitting 100 missing tasks from 
> ResultStage 0 (MapPartitionsRDD[1] at map at SparkPi.scala:34) (first 15 
> tasks are for partitions Vector(0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 
> 14))
>  19/05/31 13:29:25 INFO TaskSchedulerImpl: Adding task set 0.0 with 100 
> tasks
>  Exception in thread "dag-scheduler-event-loop" java.lang.OutOfMemoryError: 
> Java heap space
>  at 
> scala.collection.mutable.ResizableArray.ensureSize(ResizableArray.scala:106)
>  at 
> scala.collection.mutable.ResizableArray.ensureSize$(ResizableArray.scala:96)
>  at scala.collection.mutable.ArrayBuffer.ensureSize(ArrayBuffer.scala:49)
>  Mem: 2295260K used, 24458144K free, 1636K shrd, 48052K buff, 899424K cached
> $ kubectl describe pod spark-pi2-driver -n spark
>  Name: spark-pi2-driver
>  Namespace: spark
>  Priority: 0
>  PriorityClassName: 
>  Node: gke-test-cluster-1-spark-autotune-46c36f4f-x3z9/10.138.0.44
>  Start Time: Fri, 31 May 2019 16:28:59 +0300
>  Labels: spark-app-selector=spark-74d8e5a8f1af428d91093dfa6ee9d661
>  spark-role=driver
>  sparkoperator.k8s.io/app-name=spark-pi2
>  sparkoperator.k8s.io/launched-by-spark-operator=true
>  sparkoperator.k8s.io/submission-id=spark-pi2-1559309336226927526
>  version=2.4.0
>  Annotations: 
>  Status: Running
>  IP: 10.12.103.4
>  Controlled By: SparkApplication/spark-pi2
>  Containers:
>  spark-kubernetes-driver:
>  Container ID: 
> docker://55dadb603290b42f9ddb71959edf0224ddc7ea621ee15429941d3bcc7db9b71f
>  Image: skonto/spark:k8s-3.0.0-sa
>  Image ID: 
> docker-pullable://skonto/spark@sha256:6268d760d1a006b69c7086f946e4d5d9a3b99f149832c63cfc7fe39671f5cda9
>  Ports: 7078/TCP, 7079/TCP, 4040/TCP
>  Host Ports: 0/TCP, 0/TCP, 0/TCP
>  Args:
>  driver
>  --properties-file
>  /opt/spark/conf/spark.properties
>  --class
>  org.apache.spark.examples.SparkPi
>  spark-internal
>  100
>  State: Running
> In the container processes are in _interruptible sleep_:
> PID PPID USER STAT VSZ %VSZ CPU %CPU COMMAND
>  15 1 185 S 2114m 7% 0 0% /usr/lib/jvm/java-1.8-openjdk/bin/java -cp 
> /opt/spark/conf/:/opt/spark/jars/* -Xmx500m 
> org.apache.spark.deploy.SparkSubmit --deploy-mode client --conf spar
>  287 0 185 S 2344 0% 3 0% sh
>  294 287 185 R 1536 0% 3 0% top
>  1 0 185 S 776 0% 0 0% /sbin/tini -s – /opt/spark/bin/spark-submit --conf 
> spark.driver.bindAddress=10.12.103.4 --deploy-mode client --properties-file 
> /opt/spark/conf/spark.prope
> Liveness checks might be a workaround but rest apis may be still working if 
> threads in jvm still are running as in this case (I did check the spark ui 
> and it was there).
>  
>  

[jira] [Updated] (SPARK-27900) Spark on K8s will not report container failure due to an oom error

2019-06-04 Thread Stavros Kontopoulos (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27900?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Stavros Kontopoulos updated SPARK-27900:

Component/s: Spark Core

> Spark on K8s will not report container failure due to an oom error
> --
>
> Key: SPARK-27900
> URL: https://issues.apache.org/jira/browse/SPARK-27900
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes, Spark Core
>Affects Versions: 3.0.0, 2.4.3
>Reporter: Stavros Kontopoulos
>Priority: Major
>
> A spark pi job is running:
> spark-pi-driver 1/1 Running 0 1h
>  spark-pi2-1559309337787-exec-1 1/1 Running 0 1h
>  spark-pi2-1559309337787-exec-2 1/1 Running 0 1h
> with the following setup:
> {quote}apiVersion: "sparkoperator.k8s.io/v1beta1"
>  kind: SparkApplication
>  metadata:
>  name: spark-pi
>  namespace: spark
>  spec:
>  type: Scala
>  mode: cluster
>  image: "skonto/spark:k8s-3.0.0-sa"
>  imagePullPolicy: Always
>  mainClass: org.apache.spark.examples.SparkPi
>  mainApplicationFile: 
> "local:///opt/spark/examples/jars/spark-examples_2.12-3.0.0-SNAPSHOT.jar"
>  arguments:
>  - "100"
>  sparkVersion: "2.4.0"
>  restartPolicy:
>  type: Never
>  nodeSelector:
>  "spark": "autotune"
>  driver:
>  memory: "1g"
>  labels:
>  version: 2.4.0
>  serviceAccount: spark-sa
>  executor:
>  instances: 2
>  memory: "1g"
>  labels:
>  version: 2.4.0{quote}
> At some point the driver fails but it is still running and so the pods are 
> still running:
> 19/05/31 13:29:20 INFO DAGScheduler: Submitting ResultStage 0 
> (MapPartitionsRDD[1] at map at SparkPi.scala:34), which has no missing parents
>  19/05/31 13:29:23 INFO MemoryStore: Block broadcast_0 stored as values in 
> memory (estimated size 3.0 KiB, free 110.0 MiB)
>  19/05/31 13:29:23 INFO MemoryStore: Block broadcast_0_piece0 stored as bytes 
> in memory (estimated size 1765.0 B, free 110.0 MiB)
>  19/05/31 13:29:23 INFO BlockManagerInfo: Added broadcast_0_piece0 in memory 
> on spark-pi2-1559309337787-driver-svc.spark.svc:7079 (size: 1765.0 B, free: 
> 110.0 MiB)
>  19/05/31 13:29:23 INFO SparkContext: Created broadcast 0 from broadcast at 
> DAGScheduler.scala:1180
>  19/05/31 13:29:25 INFO DAGScheduler: Submitting 100 missing tasks from 
> ResultStage 0 (MapPartitionsRDD[1] at map at SparkPi.scala:34) (first 15 
> tasks are for partitions Vector(0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 
> 14))
>  19/05/31 13:29:25 INFO TaskSchedulerImpl: Adding task set 0.0 with 100 
> tasks
>  Exception in thread "dag-scheduler-event-loop" java.lang.OutOfMemoryError: 
> Java heap space
>  at 
> scala.collection.mutable.ResizableArray.ensureSize(ResizableArray.scala:106)
>  at 
> scala.collection.mutable.ResizableArray.ensureSize$(ResizableArray.scala:96)
>  at scala.collection.mutable.ArrayBuffer.ensureSize(ArrayBuffer.scala:49)
>  Mem: 2295260K used, 24458144K free, 1636K shrd, 48052K buff, 899424K cached
> $ kubectl describe pod spark-pi2-driver -n spark
>  Name: spark-pi2-driver
>  Namespace: spark
>  Priority: 0
>  PriorityClassName: 
>  Node: gke-test-cluster-1-spark-autotune-46c36f4f-x3z9/10.138.0.44
>  Start Time: Fri, 31 May 2019 16:28:59 +0300
>  Labels: spark-app-selector=spark-74d8e5a8f1af428d91093dfa6ee9d661
>  spark-role=driver
>  sparkoperator.k8s.io/app-name=spark-pi2
>  sparkoperator.k8s.io/launched-by-spark-operator=true
>  sparkoperator.k8s.io/submission-id=spark-pi2-1559309336226927526
>  version=2.4.0
>  Annotations: 
>  Status: Running
>  IP: 10.12.103.4
>  Controlled By: SparkApplication/spark-pi2
>  Containers:
>  spark-kubernetes-driver:
>  Container ID: 
> docker://55dadb603290b42f9ddb71959edf0224ddc7ea621ee15429941d3bcc7db9b71f
>  Image: skonto/spark:k8s-3.0.0-sa
>  Image ID: 
> docker-pullable://skonto/spark@sha256:6268d760d1a006b69c7086f946e4d5d9a3b99f149832c63cfc7fe39671f5cda9
>  Ports: 7078/TCP, 7079/TCP, 4040/TCP
>  Host Ports: 0/TCP, 0/TCP, 0/TCP
>  Args:
>  driver
>  --properties-file
>  /opt/spark/conf/spark.properties
>  --class
>  org.apache.spark.examples.SparkPi
>  spark-internal
>  100
>  State: Running
> In the container processes are in _interruptible sleep_:
> PID PPID USER STAT VSZ %VSZ CPU %CPU COMMAND
>  15 1 185 S 2114m 7% 0 0% /usr/lib/jvm/java-1.8-openjdk/bin/java -cp 
> /opt/spark/conf/:/opt/spark/jars/* -Xmx500m 
> org.apache.spark.deploy.SparkSubmit --deploy-mode client --conf spar
>  287 0 185 S 2344 0% 3 0% sh
>  294 287 185 R 1536 0% 3 0% top
>  1 0 185 S 776 0% 0 0% /sbin/tini -s – /opt/spark/bin/spark-submit --conf 
> spark.driver.bindAddress=10.12.103.4 --deploy-mode client --properties-file 
> /opt/spark/conf/spark.prope
> Liveness checks might be a workaround but rest apis may be still working if 
> threads in jvm still are running as in this case (I did check the spark ui 
> and it was there).
>  
> 

[jira] [Assigned] (SPARK-27900) Spark on K8s will not report container failure due to an oom error

2019-06-04 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27900?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-27900:


Assignee: (was: Apache Spark)

> Spark on K8s will not report container failure due to an oom error
> --
>
> Key: SPARK-27900
> URL: https://issues.apache.org/jira/browse/SPARK-27900
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes
>Affects Versions: 3.0.0, 2.4.3
>Reporter: Stavros Kontopoulos
>Priority: Major
>
> A spark pi job is running:
> spark-pi-driver 1/1 Running 0 1h
>  spark-pi2-1559309337787-exec-1 1/1 Running 0 1h
>  spark-pi2-1559309337787-exec-2 1/1 Running 0 1h
> with the following setup:
> {quote}apiVersion: "sparkoperator.k8s.io/v1beta1"
>  kind: SparkApplication
>  metadata:
>  name: spark-pi
>  namespace: spark
>  spec:
>  type: Scala
>  mode: cluster
>  image: "skonto/spark:k8s-3.0.0-sa"
>  imagePullPolicy: Always
>  mainClass: org.apache.spark.examples.SparkPi
>  mainApplicationFile: 
> "local:///opt/spark/examples/jars/spark-examples_2.12-3.0.0-SNAPSHOT.jar"
>  arguments:
>  - "100"
>  sparkVersion: "2.4.0"
>  restartPolicy:
>  type: Never
>  nodeSelector:
>  "spark": "autotune"
>  driver:
>  memory: "1g"
>  labels:
>  version: 2.4.0
>  serviceAccount: spark-sa
>  executor:
>  instances: 2
>  memory: "1g"
>  labels:
>  version: 2.4.0{quote}
> At some point the driver fails but it is still running and so the pods are 
> still running:
> 19/05/31 13:29:20 INFO DAGScheduler: Submitting ResultStage 0 
> (MapPartitionsRDD[1] at map at SparkPi.scala:34), which has no missing parents
>  19/05/31 13:29:23 INFO MemoryStore: Block broadcast_0 stored as values in 
> memory (estimated size 3.0 KiB, free 110.0 MiB)
>  19/05/31 13:29:23 INFO MemoryStore: Block broadcast_0_piece0 stored as bytes 
> in memory (estimated size 1765.0 B, free 110.0 MiB)
>  19/05/31 13:29:23 INFO BlockManagerInfo: Added broadcast_0_piece0 in memory 
> on spark-pi2-1559309337787-driver-svc.spark.svc:7079 (size: 1765.0 B, free: 
> 110.0 MiB)
>  19/05/31 13:29:23 INFO SparkContext: Created broadcast 0 from broadcast at 
> DAGScheduler.scala:1180
>  19/05/31 13:29:25 INFO DAGScheduler: Submitting 100 missing tasks from 
> ResultStage 0 (MapPartitionsRDD[1] at map at SparkPi.scala:34) (first 15 
> tasks are for partitions Vector(0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 
> 14))
>  19/05/31 13:29:25 INFO TaskSchedulerImpl: Adding task set 0.0 with 100 
> tasks
>  Exception in thread "dag-scheduler-event-loop" java.lang.OutOfMemoryError: 
> Java heap space
>  at 
> scala.collection.mutable.ResizableArray.ensureSize(ResizableArray.scala:106)
>  at 
> scala.collection.mutable.ResizableArray.ensureSize$(ResizableArray.scala:96)
>  at scala.collection.mutable.ArrayBuffer.ensureSize(ArrayBuffer.scala:49)
>  Mem: 2295260K used, 24458144K free, 1636K shrd, 48052K buff, 899424K cached
> $ kubectl describe pod spark-pi2-driver -n spark
>  Name: spark-pi2-driver
>  Namespace: spark
>  Priority: 0
>  PriorityClassName: 
>  Node: gke-test-cluster-1-spark-autotune-46c36f4f-x3z9/10.138.0.44
>  Start Time: Fri, 31 May 2019 16:28:59 +0300
>  Labels: spark-app-selector=spark-74d8e5a8f1af428d91093dfa6ee9d661
>  spark-role=driver
>  sparkoperator.k8s.io/app-name=spark-pi2
>  sparkoperator.k8s.io/launched-by-spark-operator=true
>  sparkoperator.k8s.io/submission-id=spark-pi2-1559309336226927526
>  version=2.4.0
>  Annotations: 
>  Status: Running
>  IP: 10.12.103.4
>  Controlled By: SparkApplication/spark-pi2
>  Containers:
>  spark-kubernetes-driver:
>  Container ID: 
> docker://55dadb603290b42f9ddb71959edf0224ddc7ea621ee15429941d3bcc7db9b71f
>  Image: skonto/spark:k8s-3.0.0-sa
>  Image ID: 
> docker-pullable://skonto/spark@sha256:6268d760d1a006b69c7086f946e4d5d9a3b99f149832c63cfc7fe39671f5cda9
>  Ports: 7078/TCP, 7079/TCP, 4040/TCP
>  Host Ports: 0/TCP, 0/TCP, 0/TCP
>  Args:
>  driver
>  --properties-file
>  /opt/spark/conf/spark.properties
>  --class
>  org.apache.spark.examples.SparkPi
>  spark-internal
>  100
>  State: Running
> In the container processes are in _interruptible sleep_:
> PID PPID USER STAT VSZ %VSZ CPU %CPU COMMAND
>  15 1 185 S 2114m 7% 0 0% /usr/lib/jvm/java-1.8-openjdk/bin/java -cp 
> /opt/spark/conf/:/opt/spark/jars/* -Xmx500m 
> org.apache.spark.deploy.SparkSubmit --deploy-mode client --conf spar
>  287 0 185 S 2344 0% 3 0% sh
>  294 287 185 R 1536 0% 3 0% top
>  1 0 185 S 776 0% 0 0% /sbin/tini -s – /opt/spark/bin/spark-submit --conf 
> spark.driver.bindAddress=10.12.103.4 --deploy-mode client --properties-file 
> /opt/spark/conf/spark.prope
> Liveness checks might be a workaround but rest apis may be still working if 
> threads in jvm still are running as in this case (I did check the spark ui 
> and it was there).
>  
>  



--
T

[jira] [Assigned] (SPARK-27900) Spark on K8s will not report container failure due to an oom error

2019-06-04 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27900?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-27900:


Assignee: Apache Spark

> Spark on K8s will not report container failure due to an oom error
> --
>
> Key: SPARK-27900
> URL: https://issues.apache.org/jira/browse/SPARK-27900
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes
>Affects Versions: 3.0.0, 2.4.3
>Reporter: Stavros Kontopoulos
>Assignee: Apache Spark
>Priority: Major
>
> A spark pi job is running:
> spark-pi-driver 1/1 Running 0 1h
>  spark-pi2-1559309337787-exec-1 1/1 Running 0 1h
>  spark-pi2-1559309337787-exec-2 1/1 Running 0 1h
> with the following setup:
> {quote}apiVersion: "sparkoperator.k8s.io/v1beta1"
>  kind: SparkApplication
>  metadata:
>  name: spark-pi
>  namespace: spark
>  spec:
>  type: Scala
>  mode: cluster
>  image: "skonto/spark:k8s-3.0.0-sa"
>  imagePullPolicy: Always
>  mainClass: org.apache.spark.examples.SparkPi
>  mainApplicationFile: 
> "local:///opt/spark/examples/jars/spark-examples_2.12-3.0.0-SNAPSHOT.jar"
>  arguments:
>  - "100"
>  sparkVersion: "2.4.0"
>  restartPolicy:
>  type: Never
>  nodeSelector:
>  "spark": "autotune"
>  driver:
>  memory: "1g"
>  labels:
>  version: 2.4.0
>  serviceAccount: spark-sa
>  executor:
>  instances: 2
>  memory: "1g"
>  labels:
>  version: 2.4.0{quote}
> At some point the driver fails but it is still running and so the pods are 
> still running:
> 19/05/31 13:29:20 INFO DAGScheduler: Submitting ResultStage 0 
> (MapPartitionsRDD[1] at map at SparkPi.scala:34), which has no missing parents
>  19/05/31 13:29:23 INFO MemoryStore: Block broadcast_0 stored as values in 
> memory (estimated size 3.0 KiB, free 110.0 MiB)
>  19/05/31 13:29:23 INFO MemoryStore: Block broadcast_0_piece0 stored as bytes 
> in memory (estimated size 1765.0 B, free 110.0 MiB)
>  19/05/31 13:29:23 INFO BlockManagerInfo: Added broadcast_0_piece0 in memory 
> on spark-pi2-1559309337787-driver-svc.spark.svc:7079 (size: 1765.0 B, free: 
> 110.0 MiB)
>  19/05/31 13:29:23 INFO SparkContext: Created broadcast 0 from broadcast at 
> DAGScheduler.scala:1180
>  19/05/31 13:29:25 INFO DAGScheduler: Submitting 100 missing tasks from 
> ResultStage 0 (MapPartitionsRDD[1] at map at SparkPi.scala:34) (first 15 
> tasks are for partitions Vector(0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 
> 14))
>  19/05/31 13:29:25 INFO TaskSchedulerImpl: Adding task set 0.0 with 100 
> tasks
>  Exception in thread "dag-scheduler-event-loop" java.lang.OutOfMemoryError: 
> Java heap space
>  at 
> scala.collection.mutable.ResizableArray.ensureSize(ResizableArray.scala:106)
>  at 
> scala.collection.mutable.ResizableArray.ensureSize$(ResizableArray.scala:96)
>  at scala.collection.mutable.ArrayBuffer.ensureSize(ArrayBuffer.scala:49)
>  Mem: 2295260K used, 24458144K free, 1636K shrd, 48052K buff, 899424K cached
> $ kubectl describe pod spark-pi2-driver -n spark
>  Name: spark-pi2-driver
>  Namespace: spark
>  Priority: 0
>  PriorityClassName: 
>  Node: gke-test-cluster-1-spark-autotune-46c36f4f-x3z9/10.138.0.44
>  Start Time: Fri, 31 May 2019 16:28:59 +0300
>  Labels: spark-app-selector=spark-74d8e5a8f1af428d91093dfa6ee9d661
>  spark-role=driver
>  sparkoperator.k8s.io/app-name=spark-pi2
>  sparkoperator.k8s.io/launched-by-spark-operator=true
>  sparkoperator.k8s.io/submission-id=spark-pi2-1559309336226927526
>  version=2.4.0
>  Annotations: 
>  Status: Running
>  IP: 10.12.103.4
>  Controlled By: SparkApplication/spark-pi2
>  Containers:
>  spark-kubernetes-driver:
>  Container ID: 
> docker://55dadb603290b42f9ddb71959edf0224ddc7ea621ee15429941d3bcc7db9b71f
>  Image: skonto/spark:k8s-3.0.0-sa
>  Image ID: 
> docker-pullable://skonto/spark@sha256:6268d760d1a006b69c7086f946e4d5d9a3b99f149832c63cfc7fe39671f5cda9
>  Ports: 7078/TCP, 7079/TCP, 4040/TCP
>  Host Ports: 0/TCP, 0/TCP, 0/TCP
>  Args:
>  driver
>  --properties-file
>  /opt/spark/conf/spark.properties
>  --class
>  org.apache.spark.examples.SparkPi
>  spark-internal
>  100
>  State: Running
> In the container processes are in _interruptible sleep_:
> PID PPID USER STAT VSZ %VSZ CPU %CPU COMMAND
>  15 1 185 S 2114m 7% 0 0% /usr/lib/jvm/java-1.8-openjdk/bin/java -cp 
> /opt/spark/conf/:/opt/spark/jars/* -Xmx500m 
> org.apache.spark.deploy.SparkSubmit --deploy-mode client --conf spar
>  287 0 185 S 2344 0% 3 0% sh
>  294 287 185 R 1536 0% 3 0% top
>  1 0 185 S 776 0% 0 0% /sbin/tini -s – /opt/spark/bin/spark-submit --conf 
> spark.driver.bindAddress=10.12.103.4 --deploy-mode client --properties-file 
> /opt/spark/conf/spark.prope
> Liveness checks might be a workaround but rest apis may be still working if 
> threads in jvm still are running as in this case (I did check the spark ui 
> and it wa

[jira] [Issue Comment Deleted] (SPARK-20202) Remove references to org.spark-project.hive

2019-06-04 Thread t oo (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-20202?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

t oo updated SPARK-20202:
-
Comment: was deleted

(was: gentle ping)

> Remove references to org.spark-project.hive
> ---
>
> Key: SPARK-20202
> URL: https://issues.apache.org/jira/browse/SPARK-20202
> Project: Spark
>  Issue Type: Bug
>  Components: Build, SQL
>Affects Versions: 1.6.4, 2.0.3, 2.1.1
>Reporter: Owen O'Malley
>Priority: Major
>
> Spark can't continue to depend on their fork of Hive and must move to 
> standard Hive versions.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Issue Comment Deleted] (SPARK-20202) Remove references to org.spark-project.hive

2019-06-04 Thread t oo (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-20202?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

t oo updated SPARK-20202:
-
Comment: was deleted

(was: bump)

> Remove references to org.spark-project.hive
> ---
>
> Key: SPARK-20202
> URL: https://issues.apache.org/jira/browse/SPARK-20202
> Project: Spark
>  Issue Type: Bug
>  Components: Build, SQL
>Affects Versions: 1.6.4, 2.0.3, 2.1.1
>Reporter: Owen O'Malley
>Priority: Major
>
> Spark can't continue to depend on their fork of Hive and must move to 
> standard Hive versions.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-27567) Spark Streaming consumers (from Kafka) intermittently die with 'SparkException: Couldn't find leaders for Set'

2019-06-04 Thread Dmitry Goldenberg (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27567?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dmitry Goldenberg resolved SPARK-27567.
---
Resolution: Not A Bug

In our case, issue was caused by someone changing the IP address of the box. 
Ideally, Kafka or Spark could have more "telling", more intuitive error 
messages for these types of issues.

> Spark Streaming consumers (from Kafka) intermittently die with 
> 'SparkException: Couldn't find leaders for Set'
> --
>
> Key: SPARK-27567
> URL: https://issues.apache.org/jira/browse/SPARK-27567
> Project: Spark
>  Issue Type: Bug
>  Components: DStreams
>Affects Versions: 1.5.0
> Environment: GCP / 170~14.04.1-Ubuntu
>Reporter: Dmitry Goldenberg
>Priority: Major
>
> Some of our consumers intermittently die with the stack traces I'm including. 
> Once restarted they run for a while then die again.
> I can't find any cohesive documentation on what this error means and how to 
> go about troubleshooting it. Any help would be appreciated.
> *Kafka version* is 0.8.2.1 (2.10-0.8.2.1).
> Some of the errors seen look like this:
> {noformat}
> ERROR org.apache.spark.scheduler.TaskSchedulerImpl: Lost executor 2 on 
> 10.150.0.54: remote Rpc client disassociated{noformat}
> Main error stack trace:
> {noformat}
> 2019-04-23 20:36:54,323 ERROR 
> org.apache.spark.streaming.scheduler.JobScheduler: Error g
> enerating jobs for time 1556066214000 ms
> org.apache.spark.SparkException: ArrayBuffer(org.apache.spark.SparkException: 
> Couldn't find leaders for Set([hdfs.hbase.acme.attachments,49], 
> [hdfs.hbase.acme.attachmen
> ts,63], [hdfs.hbase.acme.attachments,31], [hdfs.hbase.acme.attachments,9], 
> [hdfs.hbase.acme.attachments,25], [hdfs.hbase.acme.attachments,55], 
> [hdfs.hbase.acme.attachme
> nts,5], [hdfs.hbase.acme.attachments,37], [hdfs.hbase.acme.attachments,7], 
> [hdfs.hbase.acme.attachments,47], [hdfs.hbase.acme.attachments,13], 
> [hdfs.hbase.acme.attachme
> nts,43], [hdfs.hbase.acme.attachments,19], [hdfs.hbase.acme.attachments,15], 
> [hdfs.hbase.acme.attachments,23], [hdfs.hbase.acme.attachments,53], 
> [hdfs.hbase.acme.attach
> ments,1], [hdfs.hbase.acme.attachments,27], [hdfs.hbase.acme.attachments,57], 
> [hdfs.hbase.acme.attachments,39], [hdfs.hbase.acme.attachments,11], 
> [hdfs.hbase.acme.attac
> hments,29], [hdfs.hbase.acme.attachments,33], 
> [hdfs.hbase.acme.attachments,35], [hdfs.hbase.acme.attachments,51], 
> [hdfs.hbase.acme.attachments,45], [hdfs.hbase.acme.att
> achments,21], [hdfs.hbase.acme.attachments,3], 
> [hdfs.hbase.acme.attachments,59], [hdfs.hbase.acme.attachments,41], 
> [hdfs.hbase.acme.attachments,17], [hdfs.hbase.acme.at
> tachments,61]))
> at 
> org.apache.spark.streaming.kafka.DirectKafkaInputDStream.latestLeaderOffsets(DirectKafkaInputDStream.scala:123)
>  ~[acmedsc-ingest-kafka-spark-2.0.0-SNAPSHOT.j
> ar:?]
> at 
> org.apache.spark.streaming.kafka.DirectKafkaInputDStream.compute(DirectKafkaInputDStream.scala:145)
>  ~[acmedsc-ingest-kafka-spark-2.0.0-SNAPSHOT.jar:?]
> at 
> org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1$$anonfun$apply$7.apply(DStream.scala:350)
>  ~[spark-assembly-1.5.0-hadoop2.4.0.ja
> r:1.5.0]
> at 
> org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1$$anonfun$apply$7.apply(DStream.scala:350)
>  ~[spark-assembly-1.5.0-hadoop2.4.0.ja
> r:1.5.0]
> at scala.util.DynamicVariable.withValue(DynamicVariable.scala:57) 
> ~[acmedsc-ingest-kafka-spark-2.0.0-SNAPSHOT.jar:?]
> at 
> org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1.apply(DStream.scala:349)
>  ~[spark-assembly-1.5.0-hadoop2.4.0.jar:1.5.0]
> at 
> org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1.apply(DStream.scala:349)
>  ~[spark-assembly-1.5.0-hadoop2.4.0.jar:1.5.0]
> at 
> org.apache.spark.streaming.dstream.DStream.createRDDWithLocalProperties(DStream.scala:399)
>  ~[spark-assembly-1.5.0-hadoop2.4.0.jar:1.5.0]
> at 
> org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1.apply(DStream.scala:344)
>  ~[spark-assembly-1.5.0-hadoop2.4.0.jar:1.5.0]
> at 
> org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1.apply(DStream.scala:342)
>  ~[spark-assembly-1.5.0-hadoop2.4.0.jar:1.5.0]
> at scala.Option.orElse(Option.scala:257) 
> ~[acmedsc-ingest-kafka-spark-2.0.0-SNAPSHOT.jar:?]
> at org.apache.spark.streaming.dstream.DStream.getOrCompute(DStream.scala:339) 
> ~[spark-assembly-1.5.0-hadoop2.4.0.jar:1.5.0]
> at 
> org.apache.spark.streaming.dstream.MappedDStream.compute(MappedDStream.scala:35)
>  ~[spark-assembly-1.5.0-hadoop2.4.0.jar:1.5.0]
> at 
> org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1$$anonfun$apply$7.appl

[jira] [Commented] (SPARK-27567) Spark Streaming consumers (from Kafka) intermittently die with 'SparkException: Couldn't find leaders for Set'

2019-06-04 Thread Dmitry Goldenberg (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-27567?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16855996#comment-16855996
 ] 

Dmitry Goldenberg commented on SPARK-27567:
---

OK, it appears that the IP address of the box in question got changed and that 
sent Kafka (and the consumers along with it) heywire. We can probably close the 
ticket but this may be useful to other folks. The takeaway is, check your 
networking settings; any changes to the IP address or the like. Kafka doesn't 
respond too well to that :)

> Spark Streaming consumers (from Kafka) intermittently die with 
> 'SparkException: Couldn't find leaders for Set'
> --
>
> Key: SPARK-27567
> URL: https://issues.apache.org/jira/browse/SPARK-27567
> Project: Spark
>  Issue Type: Bug
>  Components: DStreams
>Affects Versions: 1.5.0
> Environment: GCP / 170~14.04.1-Ubuntu
>Reporter: Dmitry Goldenberg
>Priority: Major
>
> Some of our consumers intermittently die with the stack traces I'm including. 
> Once restarted they run for a while then die again.
> I can't find any cohesive documentation on what this error means and how to 
> go about troubleshooting it. Any help would be appreciated.
> *Kafka version* is 0.8.2.1 (2.10-0.8.2.1).
> Some of the errors seen look like this:
> {noformat}
> ERROR org.apache.spark.scheduler.TaskSchedulerImpl: Lost executor 2 on 
> 10.150.0.54: remote Rpc client disassociated{noformat}
> Main error stack trace:
> {noformat}
> 2019-04-23 20:36:54,323 ERROR 
> org.apache.spark.streaming.scheduler.JobScheduler: Error g
> enerating jobs for time 1556066214000 ms
> org.apache.spark.SparkException: ArrayBuffer(org.apache.spark.SparkException: 
> Couldn't find leaders for Set([hdfs.hbase.acme.attachments,49], 
> [hdfs.hbase.acme.attachmen
> ts,63], [hdfs.hbase.acme.attachments,31], [hdfs.hbase.acme.attachments,9], 
> [hdfs.hbase.acme.attachments,25], [hdfs.hbase.acme.attachments,55], 
> [hdfs.hbase.acme.attachme
> nts,5], [hdfs.hbase.acme.attachments,37], [hdfs.hbase.acme.attachments,7], 
> [hdfs.hbase.acme.attachments,47], [hdfs.hbase.acme.attachments,13], 
> [hdfs.hbase.acme.attachme
> nts,43], [hdfs.hbase.acme.attachments,19], [hdfs.hbase.acme.attachments,15], 
> [hdfs.hbase.acme.attachments,23], [hdfs.hbase.acme.attachments,53], 
> [hdfs.hbase.acme.attach
> ments,1], [hdfs.hbase.acme.attachments,27], [hdfs.hbase.acme.attachments,57], 
> [hdfs.hbase.acme.attachments,39], [hdfs.hbase.acme.attachments,11], 
> [hdfs.hbase.acme.attac
> hments,29], [hdfs.hbase.acme.attachments,33], 
> [hdfs.hbase.acme.attachments,35], [hdfs.hbase.acme.attachments,51], 
> [hdfs.hbase.acme.attachments,45], [hdfs.hbase.acme.att
> achments,21], [hdfs.hbase.acme.attachments,3], 
> [hdfs.hbase.acme.attachments,59], [hdfs.hbase.acme.attachments,41], 
> [hdfs.hbase.acme.attachments,17], [hdfs.hbase.acme.at
> tachments,61]))
> at 
> org.apache.spark.streaming.kafka.DirectKafkaInputDStream.latestLeaderOffsets(DirectKafkaInputDStream.scala:123)
>  ~[acmedsc-ingest-kafka-spark-2.0.0-SNAPSHOT.j
> ar:?]
> at 
> org.apache.spark.streaming.kafka.DirectKafkaInputDStream.compute(DirectKafkaInputDStream.scala:145)
>  ~[acmedsc-ingest-kafka-spark-2.0.0-SNAPSHOT.jar:?]
> at 
> org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1$$anonfun$apply$7.apply(DStream.scala:350)
>  ~[spark-assembly-1.5.0-hadoop2.4.0.ja
> r:1.5.0]
> at 
> org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1$$anonfun$apply$7.apply(DStream.scala:350)
>  ~[spark-assembly-1.5.0-hadoop2.4.0.ja
> r:1.5.0]
> at scala.util.DynamicVariable.withValue(DynamicVariable.scala:57) 
> ~[acmedsc-ingest-kafka-spark-2.0.0-SNAPSHOT.jar:?]
> at 
> org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1.apply(DStream.scala:349)
>  ~[spark-assembly-1.5.0-hadoop2.4.0.jar:1.5.0]
> at 
> org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1.apply(DStream.scala:349)
>  ~[spark-assembly-1.5.0-hadoop2.4.0.jar:1.5.0]
> at 
> org.apache.spark.streaming.dstream.DStream.createRDDWithLocalProperties(DStream.scala:399)
>  ~[spark-assembly-1.5.0-hadoop2.4.0.jar:1.5.0]
> at 
> org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1.apply(DStream.scala:344)
>  ~[spark-assembly-1.5.0-hadoop2.4.0.jar:1.5.0]
> at 
> org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1.apply(DStream.scala:342)
>  ~[spark-assembly-1.5.0-hadoop2.4.0.jar:1.5.0]
> at scala.Option.orElse(Option.scala:257) 
> ~[acmedsc-ingest-kafka-spark-2.0.0-SNAPSHOT.jar:?]
> at org.apache.spark.streaming.dstream.DStream.getOrCompute(DStream.scala:339) 
> ~[spark-assembly-1.5.0-hadoop2.4.0.jar:1.5.0]
> at 
> org.apache.spark.streaming.dstream.MappedDStream.com

[jira] [Resolved] (SPARK-27939) Defining a schema with VectorUDT

2019-06-04 Thread Bryan Cutler (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27939?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bryan Cutler resolved SPARK-27939.
--
Resolution: Not A Problem

> Defining a schema with VectorUDT
> 
>
> Key: SPARK-27939
> URL: https://issues.apache.org/jira/browse/SPARK-27939
> Project: Spark
>  Issue Type: Bug
>  Components: ML, PySpark
>Affects Versions: 2.4.0
>Reporter: Johannes Schaffrath
>Priority: Minor
>
> When I try to define a dataframe schema which has a VectorUDT field, I run 
> into an error when the VectorUDT field is not the last element of the 
> StructType list.
> The following example causes the error below:
> {code:java}
> // from pyspark.sql import functions as F
> from pyspark.sql import types as T
> from pyspark.sql import Row
> from pyspark.ml.linalg import VectorUDT, SparseVector
> #VectorUDT should be the last structfield
> train_schema = T.StructType([
>     T.StructField('features', VectorUDT()),
>     T.StructField('SALESCLOSEPRICE', T.IntegerType())
>     ])
>   
> train_df = spark.createDataFrame(
> [Row(features=SparseVector(135, {0: 139900.0, 1: 139900.0, 2: 980.0, 3: 10.0, 
> 5: 980.0, 6: 1858.0, 7: 1858.0, 8: 980.0, 9: 1950.0, 10: 1.28, 11: 1.0, 12: 
> 1.0, 15: 2.0, 16: 3.0, 20: 2017.0, 21: 7.0, 22: 28.0, 23: 15.0, 24: 196.0, 
> 25: 25.0, 26: -1.0, 27: 4.03, 28: 3.96, 29: 3.88, 30: 3.9, 31: 3.91, 32: 9.8, 
> 33: 22.4, 34: 67.8, 35: 49.8, 36: 11.9, 37: 2.7, 38: 0.2926, 39: 142.7551, 
> 40: 980.0, 41: 0.0133, 42: 1.5, 43: 1.0, 51: -1.0, 52: -1.0, 53: -1.0, 54: 
> -1.0, 55: -1.0, 56: -1.0, 57: -1.0, 62: 1.0, 68: 1.0, 77: 1.0, 81: 1.0, 89: 
> 1.0, 95: 1.0, 96: 1.0, 101: 1.0, 103: 1.0, 108: 1.0, 114: 1.0, 115: 1.0, 123: 
> 1.0, 133: 1.0}), SALESCLOSEPRICE=143000),
>  Row(features=SparseVector(135, {0: 21.0, 1: 21.0, 2: 1144.0, 3: 4.0, 
> 5: 1268.0, 6: 1640.0, 7: 1640.0, 8: 2228.0, 9: 1971.0, 10: 0.32, 11: 1.0, 14: 
> 2.0, 15: 3.0, 16: 4.0, 17: 960.0, 20: 2017.0, 21: 10.0, 22: 41.0, 23: 9.0, 
> 24: 282.0, 25: 2.0, 26: -1.0, 27: 3.91, 28: 3.85, 29: 3.83, 30: 3.83, 31: 
> 3.78, 32: 32.2, 33: 49.0, 34: 18.8, 35: 14.0, 36: 35.8, 37: 14.6, 38: 0.4392, 
> 39: 94.2549, 40: 2228.0, 41: 0.0078, 42: 1., 43: -1.0, 44: -1.0, 45: 
> -1.0, 46: -1.0, 47: -1.0, 48: -1.0, 49: -1.0, 50: -1.0, 52: 1.0, 55: -1.0, 
> 56: -1.0, 57: -1.0, 62: 1.0, 68: 1.0, 77: 1.0, 79: 1.0, 89: 1.0, 92: 1.0, 96: 
> 1.0, 101: 1.0, 103: 1.0, 108: 1.0, 114: 1.0, 115: 1.0, 124: 1.0, 133: 1.0}), 
> SALESCLOSEPRICE=19),
>  Row(features=SparseVector(135, {0: 225000.0, 1: 225000.0, 2: 1102.0, 3: 
> 28.0, 5: 1102.0, 6: 2390.0, 7: 2390.0, 8: 1102.0, 9: 1949.0, 10: 0.822, 11: 
> 1.0, 15: 1.0, 16: 2.0, 20: 2017.0, 21: 6.0, 22: 26.0, 23: 26.0, 24: 177.0, 
> 25: 25.0, 26: -1.0, 27: 3.88, 28: 3.9, 29: 3.91, 30: 3.89, 31: 3.94, 32: 9.8, 
> 33: 22.4, 34: 67.8, 35: 61.7, 36: 2.7, 38: 0.4706, 39: 204.1742, 40: 1102.0, 
> 41: 0.0106, 42: 2.0, 49: 1.0, 51: -1.0, 52: -1.0, 53: -1.0, 54: -1.0, 57: 
> 1.0, 62: 1.0, 68: 1.0, 70: 1.0, 79: 1.0, 89: 1.0, 92: 1.0, 96: 1.0, 100: 1.0, 
> 103: 1.0, 108: 1.0, 110: 1.0, 115: 1.0, 123: 1.0, 131: 1.0, 132: 1.0}), 
> SALESCLOSEPRICE=225000)
>  ], schema=train_schema)
>  
> train_df.printSchema()
> train_df.show()
> {code}
> Error  message:
> {code:java}
> // Fail to execute line 17: ], schema=train_schema) Traceback (most recent 
> call last): File "/tmp/zeppelin_pyspark-3793375738105660281.py", line 375, in 
>  exec(code, _zcUserQueryNameSpace) File "", line 17, in 
>  File "/opt/spark/python/lib/pyspark.zip/pyspark/sql/session.py", 
> line 748, in createDataFrame rdd, schema = self._createFromLocal(map(prepare, 
> data), schema) File 
> "/opt/spark/python/lib/pyspark.zip/pyspark/sql/session.py", line 429, in 
> _createFromLocal data = [schema.toInternal(row) for row in data] File 
> "/opt/spark/python/lib/pyspark.zip/pyspark/sql/session.py", line 429, in 
>  data = [schema.toInternal(row) for row in data] File 
> "/opt/spark/python/lib/pyspark.zip/pyspark/sql/types.py", line 604, in 
> toInternal for f, v, c in zip(self.fields, obj, self._needConversion)) File 
> "/opt/spark/python/lib/pyspark.zip/pyspark/sql/types.py", line 604, in 
>  for f, v, c in zip(self.fields, obj, self._needConversion)) File 
> "/opt/spark/python/lib/pyspark.zip/pyspark/sql/types.py", line 442, in 
> toInternal return self.dataType.toInternal(obj) File 
> "/opt/spark/python/lib/pyspark.zip/pyspark/sql/types.py", line 685, in 
> toInternal return self._cachedSqlType().toInternal(self.serialize(obj)) File 
> "/opt/spark/python/lib/pyspark.zip/pyspark/ml/linalg/__init__.py", line 167, 
> in serialize raise TypeError("cannot serialize %r of type %r" % (obj, 
> type(obj))) TypeError: cannot serialize 143000 of type {code}
> I don't get 

[jira] [Comment Edited] (SPARK-27939) Defining a schema with VectorUDT

2019-06-04 Thread Bryan Cutler (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-27939?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16855969#comment-16855969
 ] 

Bryan Cutler edited comment on SPARK-27939 at 6/4/19 6:13 PM:
--

Linked to a similar problem with Python {{Row}} class, SPARK-22232


was (Author: bryanc):
Another problem with Python {{Row}} class

> Defining a schema with VectorUDT
> 
>
> Key: SPARK-27939
> URL: https://issues.apache.org/jira/browse/SPARK-27939
> Project: Spark
>  Issue Type: Bug
>  Components: ML, PySpark
>Affects Versions: 2.4.0
>Reporter: Johannes Schaffrath
>Priority: Minor
>
> When I try to define a dataframe schema which has a VectorUDT field, I run 
> into an error when the VectorUDT field is not the last element of the 
> StructType list.
> The following example causes the error below:
> {code:java}
> // from pyspark.sql import functions as F
> from pyspark.sql import types as T
> from pyspark.sql import Row
> from pyspark.ml.linalg import VectorUDT, SparseVector
> #VectorUDT should be the last structfield
> train_schema = T.StructType([
>     T.StructField('features', VectorUDT()),
>     T.StructField('SALESCLOSEPRICE', T.IntegerType())
>     ])
>   
> train_df = spark.createDataFrame(
> [Row(features=SparseVector(135, {0: 139900.0, 1: 139900.0, 2: 980.0, 3: 10.0, 
> 5: 980.0, 6: 1858.0, 7: 1858.0, 8: 980.0, 9: 1950.0, 10: 1.28, 11: 1.0, 12: 
> 1.0, 15: 2.0, 16: 3.0, 20: 2017.0, 21: 7.0, 22: 28.0, 23: 15.0, 24: 196.0, 
> 25: 25.0, 26: -1.0, 27: 4.03, 28: 3.96, 29: 3.88, 30: 3.9, 31: 3.91, 32: 9.8, 
> 33: 22.4, 34: 67.8, 35: 49.8, 36: 11.9, 37: 2.7, 38: 0.2926, 39: 142.7551, 
> 40: 980.0, 41: 0.0133, 42: 1.5, 43: 1.0, 51: -1.0, 52: -1.0, 53: -1.0, 54: 
> -1.0, 55: -1.0, 56: -1.0, 57: -1.0, 62: 1.0, 68: 1.0, 77: 1.0, 81: 1.0, 89: 
> 1.0, 95: 1.0, 96: 1.0, 101: 1.0, 103: 1.0, 108: 1.0, 114: 1.0, 115: 1.0, 123: 
> 1.0, 133: 1.0}), SALESCLOSEPRICE=143000),
>  Row(features=SparseVector(135, {0: 21.0, 1: 21.0, 2: 1144.0, 3: 4.0, 
> 5: 1268.0, 6: 1640.0, 7: 1640.0, 8: 2228.0, 9: 1971.0, 10: 0.32, 11: 1.0, 14: 
> 2.0, 15: 3.0, 16: 4.0, 17: 960.0, 20: 2017.0, 21: 10.0, 22: 41.0, 23: 9.0, 
> 24: 282.0, 25: 2.0, 26: -1.0, 27: 3.91, 28: 3.85, 29: 3.83, 30: 3.83, 31: 
> 3.78, 32: 32.2, 33: 49.0, 34: 18.8, 35: 14.0, 36: 35.8, 37: 14.6, 38: 0.4392, 
> 39: 94.2549, 40: 2228.0, 41: 0.0078, 42: 1., 43: -1.0, 44: -1.0, 45: 
> -1.0, 46: -1.0, 47: -1.0, 48: -1.0, 49: -1.0, 50: -1.0, 52: 1.0, 55: -1.0, 
> 56: -1.0, 57: -1.0, 62: 1.0, 68: 1.0, 77: 1.0, 79: 1.0, 89: 1.0, 92: 1.0, 96: 
> 1.0, 101: 1.0, 103: 1.0, 108: 1.0, 114: 1.0, 115: 1.0, 124: 1.0, 133: 1.0}), 
> SALESCLOSEPRICE=19),
>  Row(features=SparseVector(135, {0: 225000.0, 1: 225000.0, 2: 1102.0, 3: 
> 28.0, 5: 1102.0, 6: 2390.0, 7: 2390.0, 8: 1102.0, 9: 1949.0, 10: 0.822, 11: 
> 1.0, 15: 1.0, 16: 2.0, 20: 2017.0, 21: 6.0, 22: 26.0, 23: 26.0, 24: 177.0, 
> 25: 25.0, 26: -1.0, 27: 3.88, 28: 3.9, 29: 3.91, 30: 3.89, 31: 3.94, 32: 9.8, 
> 33: 22.4, 34: 67.8, 35: 61.7, 36: 2.7, 38: 0.4706, 39: 204.1742, 40: 1102.0, 
> 41: 0.0106, 42: 2.0, 49: 1.0, 51: -1.0, 52: -1.0, 53: -1.0, 54: -1.0, 57: 
> 1.0, 62: 1.0, 68: 1.0, 70: 1.0, 79: 1.0, 89: 1.0, 92: 1.0, 96: 1.0, 100: 1.0, 
> 103: 1.0, 108: 1.0, 110: 1.0, 115: 1.0, 123: 1.0, 131: 1.0, 132: 1.0}), 
> SALESCLOSEPRICE=225000)
>  ], schema=train_schema)
>  
> train_df.printSchema()
> train_df.show()
> {code}
> Error  message:
> {code:java}
> // Fail to execute line 17: ], schema=train_schema) Traceback (most recent 
> call last): File "/tmp/zeppelin_pyspark-3793375738105660281.py", line 375, in 
>  exec(code, _zcUserQueryNameSpace) File "", line 17, in 
>  File "/opt/spark/python/lib/pyspark.zip/pyspark/sql/session.py", 
> line 748, in createDataFrame rdd, schema = self._createFromLocal(map(prepare, 
> data), schema) File 
> "/opt/spark/python/lib/pyspark.zip/pyspark/sql/session.py", line 429, in 
> _createFromLocal data = [schema.toInternal(row) for row in data] File 
> "/opt/spark/python/lib/pyspark.zip/pyspark/sql/session.py", line 429, in 
>  data = [schema.toInternal(row) for row in data] File 
> "/opt/spark/python/lib/pyspark.zip/pyspark/sql/types.py", line 604, in 
> toInternal for f, v, c in zip(self.fields, obj, self._needConversion)) File 
> "/opt/spark/python/lib/pyspark.zip/pyspark/sql/types.py", line 604, in 
>  for f, v, c in zip(self.fields, obj, self._needConversion)) File 
> "/opt/spark/python/lib/pyspark.zip/pyspark/sql/types.py", line 442, in 
> toInternal return self.dataType.toInternal(obj) File 
> "/opt/spark/python/lib/pyspark.zip/pyspark/sql/types.py", line 685, in 
> toInternal return self._cachedSqlType().toInternal(self.serialize(obj)) File 
> "/opt/spark/python/lib/

[jira] [Commented] (SPARK-27939) Defining a schema with VectorUDT

2019-06-04 Thread Bryan Cutler (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-27939?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16855969#comment-16855969
 ] 

Bryan Cutler commented on SPARK-27939:
--

Another problem with Python {{Row}} class

> Defining a schema with VectorUDT
> 
>
> Key: SPARK-27939
> URL: https://issues.apache.org/jira/browse/SPARK-27939
> Project: Spark
>  Issue Type: Bug
>  Components: ML, PySpark
>Affects Versions: 2.4.0
>Reporter: Johannes Schaffrath
>Priority: Minor
>
> When I try to define a dataframe schema which has a VectorUDT field, I run 
> into an error when the VectorUDT field is not the last element of the 
> StructType list.
> The following example causes the error below:
> {code:java}
> // from pyspark.sql import functions as F
> from pyspark.sql import types as T
> from pyspark.sql import Row
> from pyspark.ml.linalg import VectorUDT, SparseVector
> #VectorUDT should be the last structfield
> train_schema = T.StructType([
>     T.StructField('features', VectorUDT()),
>     T.StructField('SALESCLOSEPRICE', T.IntegerType())
>     ])
>   
> train_df = spark.createDataFrame(
> [Row(features=SparseVector(135, {0: 139900.0, 1: 139900.0, 2: 980.0, 3: 10.0, 
> 5: 980.0, 6: 1858.0, 7: 1858.0, 8: 980.0, 9: 1950.0, 10: 1.28, 11: 1.0, 12: 
> 1.0, 15: 2.0, 16: 3.0, 20: 2017.0, 21: 7.0, 22: 28.0, 23: 15.0, 24: 196.0, 
> 25: 25.0, 26: -1.0, 27: 4.03, 28: 3.96, 29: 3.88, 30: 3.9, 31: 3.91, 32: 9.8, 
> 33: 22.4, 34: 67.8, 35: 49.8, 36: 11.9, 37: 2.7, 38: 0.2926, 39: 142.7551, 
> 40: 980.0, 41: 0.0133, 42: 1.5, 43: 1.0, 51: -1.0, 52: -1.0, 53: -1.0, 54: 
> -1.0, 55: -1.0, 56: -1.0, 57: -1.0, 62: 1.0, 68: 1.0, 77: 1.0, 81: 1.0, 89: 
> 1.0, 95: 1.0, 96: 1.0, 101: 1.0, 103: 1.0, 108: 1.0, 114: 1.0, 115: 1.0, 123: 
> 1.0, 133: 1.0}), SALESCLOSEPRICE=143000),
>  Row(features=SparseVector(135, {0: 21.0, 1: 21.0, 2: 1144.0, 3: 4.0, 
> 5: 1268.0, 6: 1640.0, 7: 1640.0, 8: 2228.0, 9: 1971.0, 10: 0.32, 11: 1.0, 14: 
> 2.0, 15: 3.0, 16: 4.0, 17: 960.0, 20: 2017.0, 21: 10.0, 22: 41.0, 23: 9.0, 
> 24: 282.0, 25: 2.0, 26: -1.0, 27: 3.91, 28: 3.85, 29: 3.83, 30: 3.83, 31: 
> 3.78, 32: 32.2, 33: 49.0, 34: 18.8, 35: 14.0, 36: 35.8, 37: 14.6, 38: 0.4392, 
> 39: 94.2549, 40: 2228.0, 41: 0.0078, 42: 1., 43: -1.0, 44: -1.0, 45: 
> -1.0, 46: -1.0, 47: -1.0, 48: -1.0, 49: -1.0, 50: -1.0, 52: 1.0, 55: -1.0, 
> 56: -1.0, 57: -1.0, 62: 1.0, 68: 1.0, 77: 1.0, 79: 1.0, 89: 1.0, 92: 1.0, 96: 
> 1.0, 101: 1.0, 103: 1.0, 108: 1.0, 114: 1.0, 115: 1.0, 124: 1.0, 133: 1.0}), 
> SALESCLOSEPRICE=19),
>  Row(features=SparseVector(135, {0: 225000.0, 1: 225000.0, 2: 1102.0, 3: 
> 28.0, 5: 1102.0, 6: 2390.0, 7: 2390.0, 8: 1102.0, 9: 1949.0, 10: 0.822, 11: 
> 1.0, 15: 1.0, 16: 2.0, 20: 2017.0, 21: 6.0, 22: 26.0, 23: 26.0, 24: 177.0, 
> 25: 25.0, 26: -1.0, 27: 3.88, 28: 3.9, 29: 3.91, 30: 3.89, 31: 3.94, 32: 9.8, 
> 33: 22.4, 34: 67.8, 35: 61.7, 36: 2.7, 38: 0.4706, 39: 204.1742, 40: 1102.0, 
> 41: 0.0106, 42: 2.0, 49: 1.0, 51: -1.0, 52: -1.0, 53: -1.0, 54: -1.0, 57: 
> 1.0, 62: 1.0, 68: 1.0, 70: 1.0, 79: 1.0, 89: 1.0, 92: 1.0, 96: 1.0, 100: 1.0, 
> 103: 1.0, 108: 1.0, 110: 1.0, 115: 1.0, 123: 1.0, 131: 1.0, 132: 1.0}), 
> SALESCLOSEPRICE=225000)
>  ], schema=train_schema)
>  
> train_df.printSchema()
> train_df.show()
> {code}
> Error  message:
> {code:java}
> // Fail to execute line 17: ], schema=train_schema) Traceback (most recent 
> call last): File "/tmp/zeppelin_pyspark-3793375738105660281.py", line 375, in 
>  exec(code, _zcUserQueryNameSpace) File "", line 17, in 
>  File "/opt/spark/python/lib/pyspark.zip/pyspark/sql/session.py", 
> line 748, in createDataFrame rdd, schema = self._createFromLocal(map(prepare, 
> data), schema) File 
> "/opt/spark/python/lib/pyspark.zip/pyspark/sql/session.py", line 429, in 
> _createFromLocal data = [schema.toInternal(row) for row in data] File 
> "/opt/spark/python/lib/pyspark.zip/pyspark/sql/session.py", line 429, in 
>  data = [schema.toInternal(row) for row in data] File 
> "/opt/spark/python/lib/pyspark.zip/pyspark/sql/types.py", line 604, in 
> toInternal for f, v, c in zip(self.fields, obj, self._needConversion)) File 
> "/opt/spark/python/lib/pyspark.zip/pyspark/sql/types.py", line 604, in 
>  for f, v, c in zip(self.fields, obj, self._needConversion)) File 
> "/opt/spark/python/lib/pyspark.zip/pyspark/sql/types.py", line 442, in 
> toInternal return self.dataType.toInternal(obj) File 
> "/opt/spark/python/lib/pyspark.zip/pyspark/sql/types.py", line 685, in 
> toInternal return self._cachedSqlType().toInternal(self.serialize(obj)) File 
> "/opt/spark/python/lib/pyspark.zip/pyspark/ml/linalg/__init__.py", line 167, 
> in serialize raise TypeError("cannot serialize %r of type %r" % (obj, 
> type(obj

[jira] [Comment Edited] (SPARK-27939) Defining a schema with VectorUDT

2019-06-04 Thread Bryan Cutler (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-27939?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16855966#comment-16855966
 ] 

Bryan Cutler edited comment on SPARK-27939 at 6/4/19 6:11 PM:
--

The problem is the {{Row}} class sorts the field names alphabetically, which 
puts capital letters first and then conflicts with your schema:
{noformat}
r = Row(features=SparseVector(135, {0: 139900.0, 1: 139900.0, ...}), 
SALESCLOSEPRICE=143000)

In [3]: r.__fields__
Out[3]: ['SALESCLOSEPRICE', 'features']{noformat}
This is by design, but it is not intuitive and has caused lots of problems. 
Hopefully, we can improve this for Spark 3.0.0

You can either just specify your data as tuples. for example
{noformat}
In [5]: train_df = spark.createDataFrame([(SparseVector(135, {0: 139900.0}), 
143000)], schema=train_schema)

In [6]: train_df.show()
++---+
| features|SALESCLOSEPRICE|
++---+
|(135,[0],[139900.0])| 143000|
++---+
{noformat}
Or if you want to have keywords, then define your own row class like this:
{noformat}
In [7]: MyRow = Row('features', 'SALESCLOSEPRICE')

In [8]: MyRow(SparseVector(135, {0: 139900.0}), 143000)
Out[8]: Row(features=SparseVector(135, {0: 139900.0}), 
SALESCLOSEPRICE=143000){noformat}


was (Author: bryanc):
The problem is the {{Row}} class sorts the field names alphabetically, which 
puts capital letters first and then conflicts with your schema:
{noformat}
r = Row(features=SparseVector(135, {0: 139900.0, 1: 139900.0, ...}), 
SALESCLOSEPRICE=143000)

In [3]: r.__fields__
Out[3]: ['SALESCLOSEPRICE', 'features']{noformat}
This is by design, but it is not intuitive and has caused lots of problems.

You can either just specify your data as tuples. for example
{noformat}
In [5]: train_df = spark.createDataFrame([(SparseVector(135, {0: 139900.0}), 
143000)], schema=train_schema)

In [6]: train_df.show()
++---+
| features|SALESCLOSEPRICE|
++---+
|(135,[0],[139900.0])| 143000|
++---+
{noformat}
Or if you want to have keywords, then define your own row class like this:
{noformat}
In [7]: MyRow = Row('features', 'SALESCLOSEPRICE')

In [8]: MyRow(SparseVector(135, {0: 139900.0}), 143000)
Out[8]: Row(features=SparseVector(135, {0: 139900.0}), 
SALESCLOSEPRICE=143000){noformat}

> Defining a schema with VectorUDT
> 
>
> Key: SPARK-27939
> URL: https://issues.apache.org/jira/browse/SPARK-27939
> Project: Spark
>  Issue Type: Bug
>  Components: ML, PySpark
>Affects Versions: 2.4.0
>Reporter: Johannes Schaffrath
>Priority: Minor
>
> When I try to define a dataframe schema which has a VectorUDT field, I run 
> into an error when the VectorUDT field is not the last element of the 
> StructType list.
> The following example causes the error below:
> {code:java}
> // from pyspark.sql import functions as F
> from pyspark.sql import types as T
> from pyspark.sql import Row
> from pyspark.ml.linalg import VectorUDT, SparseVector
> #VectorUDT should be the last structfield
> train_schema = T.StructType([
>     T.StructField('features', VectorUDT()),
>     T.StructField('SALESCLOSEPRICE', T.IntegerType())
>     ])
>   
> train_df = spark.createDataFrame(
> [Row(features=SparseVector(135, {0: 139900.0, 1: 139900.0, 2: 980.0, 3: 10.0, 
> 5: 980.0, 6: 1858.0, 7: 1858.0, 8: 980.0, 9: 1950.0, 10: 1.28, 11: 1.0, 12: 
> 1.0, 15: 2.0, 16: 3.0, 20: 2017.0, 21: 7.0, 22: 28.0, 23: 15.0, 24: 196.0, 
> 25: 25.0, 26: -1.0, 27: 4.03, 28: 3.96, 29: 3.88, 30: 3.9, 31: 3.91, 32: 9.8, 
> 33: 22.4, 34: 67.8, 35: 49.8, 36: 11.9, 37: 2.7, 38: 0.2926, 39: 142.7551, 
> 40: 980.0, 41: 0.0133, 42: 1.5, 43: 1.0, 51: -1.0, 52: -1.0, 53: -1.0, 54: 
> -1.0, 55: -1.0, 56: -1.0, 57: -1.0, 62: 1.0, 68: 1.0, 77: 1.0, 81: 1.0, 89: 
> 1.0, 95: 1.0, 96: 1.0, 101: 1.0, 103: 1.0, 108: 1.0, 114: 1.0, 115: 1.0, 123: 
> 1.0, 133: 1.0}), SALESCLOSEPRICE=143000),
>  Row(features=SparseVector(135, {0: 21.0, 1: 21.0, 2: 1144.0, 3: 4.0, 
> 5: 1268.0, 6: 1640.0, 7: 1640.0, 8: 2228.0, 9: 1971.0, 10: 0.32, 11: 1.0, 14: 
> 2.0, 15: 3.0, 16: 4.0, 17: 960.0, 20: 2017.0, 21: 10.0, 22: 41.0, 23: 9.0, 
> 24: 282.0, 25: 2.0, 26: -1.0, 27: 3.91, 28: 3.85, 29: 3.83, 30: 3.83, 31: 
> 3.78, 32: 32.2, 33: 49.0, 34: 18.8, 35: 14.0, 36: 35.8, 37: 14.6, 38: 0.4392, 
> 39: 94.2549, 40: 2228.0, 41: 0.0078, 42: 1., 43: -1.0, 44: -1.0, 45: 
> -1.0, 46: -1.0, 47: -1.0, 48: -1.0, 49: -1.0, 50: -1.0, 52: 1.0, 55: -1.0, 
> 56: -1.0, 57: -1.0, 62: 1.0, 68: 1.0, 77: 1.0, 79: 1.0, 89: 1.0, 92: 1.0, 96: 
> 1.0, 101: 1.0, 103: 1.0, 108: 1.0, 114: 1.0, 115: 1.0, 124: 1.0, 1

[jira] [Commented] (SPARK-27939) Defining a schema with VectorUDT

2019-06-04 Thread Bryan Cutler (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-27939?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16855966#comment-16855966
 ] 

Bryan Cutler commented on SPARK-27939:
--

The problem is the {{Row}} class sorts the field names alphabetically, which 
puts capital letters first and then conflicts with your schema:
{noformat}
r = Row(features=SparseVector(135, {0: 139900.0, 1: 139900.0, ...}), 
SALESCLOSEPRICE=143000)

In [3]: r.__fields__
Out[3]: ['SALESCLOSEPRICE', 'features']{noformat}
This is by design, but it is not intuitive and has caused lots of problems.

You can either just specify your data as tuples. for example
{noformat}
In [5]: train_df = spark.createDataFrame([(SparseVector(135, {0: 139900.0}), 
143000)], schema=train_schema)

In [6]: train_df.show()
++---+
| features|SALESCLOSEPRICE|
++---+
|(135,[0],[139900.0])| 143000|
++---+
{noformat}
Or if you want to have keywords, then define your own row class like this:
{noformat}
In [7]: MyRow = Row('features', 'SALESCLOSEPRICE')

In [8]: MyRow(SparseVector(135, {0: 139900.0}), 143000)
Out[8]: Row(features=SparseVector(135, {0: 139900.0}), 
SALESCLOSEPRICE=143000){noformat}

> Defining a schema with VectorUDT
> 
>
> Key: SPARK-27939
> URL: https://issues.apache.org/jira/browse/SPARK-27939
> Project: Spark
>  Issue Type: Bug
>  Components: ML, PySpark
>Affects Versions: 2.4.0
>Reporter: Johannes Schaffrath
>Priority: Minor
>
> When I try to define a dataframe schema which has a VectorUDT field, I run 
> into an error when the VectorUDT field is not the last element of the 
> StructType list.
> The following example causes the error below:
> {code:java}
> // from pyspark.sql import functions as F
> from pyspark.sql import types as T
> from pyspark.sql import Row
> from pyspark.ml.linalg import VectorUDT, SparseVector
> #VectorUDT should be the last structfield
> train_schema = T.StructType([
>     T.StructField('features', VectorUDT()),
>     T.StructField('SALESCLOSEPRICE', T.IntegerType())
>     ])
>   
> train_df = spark.createDataFrame(
> [Row(features=SparseVector(135, {0: 139900.0, 1: 139900.0, 2: 980.0, 3: 10.0, 
> 5: 980.0, 6: 1858.0, 7: 1858.0, 8: 980.0, 9: 1950.0, 10: 1.28, 11: 1.0, 12: 
> 1.0, 15: 2.0, 16: 3.0, 20: 2017.0, 21: 7.0, 22: 28.0, 23: 15.0, 24: 196.0, 
> 25: 25.0, 26: -1.0, 27: 4.03, 28: 3.96, 29: 3.88, 30: 3.9, 31: 3.91, 32: 9.8, 
> 33: 22.4, 34: 67.8, 35: 49.8, 36: 11.9, 37: 2.7, 38: 0.2926, 39: 142.7551, 
> 40: 980.0, 41: 0.0133, 42: 1.5, 43: 1.0, 51: -1.0, 52: -1.0, 53: -1.0, 54: 
> -1.0, 55: -1.0, 56: -1.0, 57: -1.0, 62: 1.0, 68: 1.0, 77: 1.0, 81: 1.0, 89: 
> 1.0, 95: 1.0, 96: 1.0, 101: 1.0, 103: 1.0, 108: 1.0, 114: 1.0, 115: 1.0, 123: 
> 1.0, 133: 1.0}), SALESCLOSEPRICE=143000),
>  Row(features=SparseVector(135, {0: 21.0, 1: 21.0, 2: 1144.0, 3: 4.0, 
> 5: 1268.0, 6: 1640.0, 7: 1640.0, 8: 2228.0, 9: 1971.0, 10: 0.32, 11: 1.0, 14: 
> 2.0, 15: 3.0, 16: 4.0, 17: 960.0, 20: 2017.0, 21: 10.0, 22: 41.0, 23: 9.0, 
> 24: 282.0, 25: 2.0, 26: -1.0, 27: 3.91, 28: 3.85, 29: 3.83, 30: 3.83, 31: 
> 3.78, 32: 32.2, 33: 49.0, 34: 18.8, 35: 14.0, 36: 35.8, 37: 14.6, 38: 0.4392, 
> 39: 94.2549, 40: 2228.0, 41: 0.0078, 42: 1., 43: -1.0, 44: -1.0, 45: 
> -1.0, 46: -1.0, 47: -1.0, 48: -1.0, 49: -1.0, 50: -1.0, 52: 1.0, 55: -1.0, 
> 56: -1.0, 57: -1.0, 62: 1.0, 68: 1.0, 77: 1.0, 79: 1.0, 89: 1.0, 92: 1.0, 96: 
> 1.0, 101: 1.0, 103: 1.0, 108: 1.0, 114: 1.0, 115: 1.0, 124: 1.0, 133: 1.0}), 
> SALESCLOSEPRICE=19),
>  Row(features=SparseVector(135, {0: 225000.0, 1: 225000.0, 2: 1102.0, 3: 
> 28.0, 5: 1102.0, 6: 2390.0, 7: 2390.0, 8: 1102.0, 9: 1949.0, 10: 0.822, 11: 
> 1.0, 15: 1.0, 16: 2.0, 20: 2017.0, 21: 6.0, 22: 26.0, 23: 26.0, 24: 177.0, 
> 25: 25.0, 26: -1.0, 27: 3.88, 28: 3.9, 29: 3.91, 30: 3.89, 31: 3.94, 32: 9.8, 
> 33: 22.4, 34: 67.8, 35: 61.7, 36: 2.7, 38: 0.4706, 39: 204.1742, 40: 1102.0, 
> 41: 0.0106, 42: 2.0, 49: 1.0, 51: -1.0, 52: -1.0, 53: -1.0, 54: -1.0, 57: 
> 1.0, 62: 1.0, 68: 1.0, 70: 1.0, 79: 1.0, 89: 1.0, 92: 1.0, 96: 1.0, 100: 1.0, 
> 103: 1.0, 108: 1.0, 110: 1.0, 115: 1.0, 123: 1.0, 131: 1.0, 132: 1.0}), 
> SALESCLOSEPRICE=225000)
>  ], schema=train_schema)
>  
> train_df.printSchema()
> train_df.show()
> {code}
> Error  message:
> {code:java}
> // Fail to execute line 17: ], schema=train_schema) Traceback (most recent 
> call last): File "/tmp/zeppelin_pyspark-3793375738105660281.py", line 375, in 
>  exec(code, _zcUserQueryNameSpace) File "", line 17, in 
>  File "/opt/spark/python/lib/pyspark.zip/pyspark/sql/session.py", 
> line 748, in createDataFrame rdd, schema = self._createFromLocal(map(prepare, 
> data), schema) 

[jira] [Assigned] (SPARK-27945) Make minimal changes to support columnar processing

2019-06-04 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27945?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-27945:


Assignee: Apache Spark

> Make minimal changes to support columnar processing
> ---
>
> Key: SPARK-27945
> URL: https://issues.apache.org/jira/browse/SPARK-27945
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Robert Joseph Evans
>Assignee: Apache Spark
>Priority: Major
>
> As the first step for SPARK-27396 this is to put in the minimum changes 
> needed to allow a plugin to support columnar processing.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-27945) Make minimal changes to support columnar processing

2019-06-04 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27945?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-27945:


Assignee: (was: Apache Spark)

> Make minimal changes to support columnar processing
> ---
>
> Key: SPARK-27945
> URL: https://issues.apache.org/jira/browse/SPARK-27945
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Robert Joseph Evans
>Priority: Major
>
> As the first step for SPARK-27396 this is to put in the minimum changes 
> needed to allow a plugin to support columnar processing.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17025) Cannot persist PySpark ML Pipeline model that includes custom Transformer

2019-06-04 Thread John Bauer (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-17025?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16855948#comment-16855948
 ] 

John Bauer commented on SPARK-17025:


[~Hadar] [~yug95] I wrote a minimal example of a PySpark estimator/model pair 
which can be saved and loaded at 
[ImputeNormal|https://github.com/JohnHBauer/ImputeNormal] which imputes missing 
values from a normal distribution using parameters estimated from the data.  
Let me know if it helps you,

> Cannot persist PySpark ML Pipeline model that includes custom Transformer
> -
>
> Key: SPARK-17025
> URL: https://issues.apache.org/jira/browse/SPARK-17025
> Project: Spark
>  Issue Type: New Feature
>  Components: ML, PySpark
>Affects Versions: 2.0.0
>Reporter: Nicholas Chammas
>Assignee: Ajay Saini
>Priority: Minor
> Fix For: 2.3.0
>
>
> Following the example in [this Databricks blog 
> post|https://databricks.com/blog/2016/05/31/apache-spark-2-0-preview-machine-learning-model-persistence.html]
>  under "Python tuning", I'm trying to save an ML Pipeline model.
> This pipeline, however, includes a custom transformer. When I try to save the 
> model, the operation fails because the custom transformer doesn't have a 
> {{_to_java}} attribute.
> {code}
> Traceback (most recent call last):
>   File ".../file.py", line 56, in 
> model.bestModel.save('model')
>   File 
> "/usr/local/Cellar/apache-spark/2.0.0/libexec/python/lib/pyspark.zip/pyspark/ml/pipeline.py",
>  line 222, in save
>   File 
> "/usr/local/Cellar/apache-spark/2.0.0/libexec/python/lib/pyspark.zip/pyspark/ml/pipeline.py",
>  line 217, in write
>   File 
> "/usr/local/Cellar/apache-spark/2.0.0/libexec/python/lib/pyspark.zip/pyspark/ml/util.py",
>  line 93, in __init__
>   File 
> "/usr/local/Cellar/apache-spark/2.0.0/libexec/python/lib/pyspark.zip/pyspark/ml/pipeline.py",
>  line 254, in _to_java
> AttributeError: 'PeoplePairFeaturizer' object has no attribute '_to_java'
> {code}
> Looking at the source code for 
> [ml/base.py|https://github.com/apache/spark/blob/acaf2a81ad5238fd1bc81e7be2c328f40c07e755/python/pyspark/ml/base.py],
>  I see that not even the base Transformer class has such an attribute.
> I'm assuming this is missing functionality that is intended to be patched up 
> (i.e. [like 
> this|https://github.com/apache/spark/blob/acaf2a81ad5238fd1bc81e7be2c328f40c07e755/python/pyspark/ml/classification.py#L1421-L1433]).
> I'm not sure if there is an existing JIRA for this (my searches didn't turn 
> up clear results).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-27805) toPandas does not propagate SparkExceptions with arrow enabled

2019-06-04 Thread Bryan Cutler (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27805?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bryan Cutler updated SPARK-27805:
-
Affects Version/s: (was: 3.1.0)
   2.4.3

> toPandas does not propagate SparkExceptions with arrow enabled
> --
>
> Key: SPARK-27805
> URL: https://issues.apache.org/jira/browse/SPARK-27805
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, SQL
>Affects Versions: 2.4.3
>Reporter: David Vogelbacher
>Assignee: David Vogelbacher
>Priority: Major
> Fix For: 3.0.0
>
>
> When calling {{toPandas}} with arrow enabled errors encountered during the 
> collect are not propagated to the python process.
> There is only a very general {{EofError}} raised.
> Example of behavior with arrow enabled vs. arrow disabled:
> {noformat}
> import traceback
> from pyspark.sql.functions import udf
> from pyspark.sql.types import IntegerType
> def raise_exception():
>   raise Exception("My error")
> error_udf = udf(raise_exception, IntegerType())
> df = spark.range(3).toDF("i").withColumn("x", error_udf())
> try:
> df.toPandas()
> except:
> no_arrow_exception = traceback.format_exc()
> spark.conf.set("spark.sql.execution.arrow.enabled", "true")
> try:
> df.toPandas()
> except:
> arrow_exception = traceback.format_exc()
> print no_arrow_exception
> print arrow_exception
> {noformat}
> {{arrow_exception}} gives as output:
> {noformat}
> >>> print arrow_exception
> Traceback (most recent call last):
>   File "", line 2, in 
>   File "/Users/dvogelbacher/git/spark/python/pyspark/sql/dataframe.py", line 
> 2143, in toPandas
> batches = self._collectAsArrow()
>   File "/Users/dvogelbacher/git/spark/python/pyspark/sql/dataframe.py", line 
> 2205, in _collectAsArrow
> results = list(_load_from_socket(sock_info, ArrowCollectSerializer()))
>   File "/Users/dvogelbacher/git/spark/python/pyspark/serializers.py", line 
> 210, in load_stream
> num = read_int(stream)
>   File "/Users/dvogelbacher/git/spark/python/pyspark/serializers.py", line 
> 810, in read_int
> raise EOFError
> EOFError
> {noformat}
> {{no_arrow_exception}} gives as output:
> {noformat}
> Traceback (most recent call last):
>   File "", line 2, in 
>   File "/Users/dvogelbacher/git/spark/python/pyspark/sql/dataframe.py", line 
> 2166, in toPandas
> pdf = pd.DataFrame.from_records(self.collect(), columns=self.columns)
>   File "/Users/dvogelbacher/git/spark/python/pyspark/sql/dataframe.py", line 
> 516, in collect
> sock_info = self._jdf.collectToPython()
>   File 
> "/Users/dvogelbacher/git/spark/python/lib/py4j-0.10.8.1-src.zip/py4j/java_gateway.py",
>  line 1286, in __call__
> answer, self.gateway_client, self.target_id, self.name)
>   File "/Users/dvogelbacher/git/spark/python/pyspark/sql/utils.py", line 89, 
> in deco
> return f(*a, **kw)
>   File 
> "/Users/dvogelbacher/git/spark/python/lib/py4j-0.10.8.1-src.zip/py4j/protocol.py",
>  line 328, in get_return_value
> format(target_id, ".", name), value)
> Py4JJavaError: An error occurred while calling o38.collectToPython.
> : org.apache.spark.SparkException: Job aborted due to stage failure: Task 7 
> in stage 0.0 failed 1 times, most recent failure: Lost task 7.0 in stage 0.0 
> (TID 7, localhost, executor driver): 
> org.apache.spark.api.python.PythonException: Traceback (most recent call 
> last):
>   File 
> "/Users/dvogelbacher/git/spark/python/lib/pyspark.zip/pyspark/worker.py", 
> line 428, in main
> process()
>   File 
> "/Users/dvogelbacher/git/spark/python/lib/pyspark.zip/pyspark/worker.py", 
> line 423, in process
> serializer.dump_stream(func(split_index, iterator), outfile)
>   File "/Users/dvogelbacher/git/spark/python/pyspark/serializers.py", line 
> 438, in dump_stream
> self.serializer.dump_stream(self._batched(iterator), stream)
>   File "/Users/dvogelbacher/git/spark/python/pyspark/serializers.py", line 
> 141, in dump_stream
> for obj in iterator:
>   File "/Users/dvogelbacher/git/spark/python/pyspark/serializers.py", line 
> 427, in _batched
> for item in iterator:
>   File "", line 1, in 
>   File 
> "/Users/dvogelbacher/git/spark/python/lib/pyspark.zip/pyspark/worker.py", 
> line 86, in 
> return lambda *a: f(*a)
>   File "/Users/dvogelbacher/git/spark/python/pyspark/util.py", line 99, in 
> wrapper
> return f(*args, **kwargs)
>   File "", line 2, in raise_exception
> Exception: My error
> ...
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-27805) toPandas does not propagate SparkExceptions with arrow enabled

2019-06-04 Thread Bryan Cutler (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27805?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bryan Cutler resolved SPARK-27805.
--
   Resolution: Fixed
Fix Version/s: 3.0.0

Issue resolved by pull request 24677
[https://github.com/apache/spark/pull/24677]

> toPandas does not propagate SparkExceptions with arrow enabled
> --
>
> Key: SPARK-27805
> URL: https://issues.apache.org/jira/browse/SPARK-27805
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, SQL
>Affects Versions: 3.1.0
>Reporter: David Vogelbacher
>Assignee: David Vogelbacher
>Priority: Major
> Fix For: 3.0.0
>
>
> When calling {{toPandas}} with arrow enabled errors encountered during the 
> collect are not propagated to the python process.
> There is only a very general {{EofError}} raised.
> Example of behavior with arrow enabled vs. arrow disabled:
> {noformat}
> import traceback
> from pyspark.sql.functions import udf
> from pyspark.sql.types import IntegerType
> def raise_exception():
>   raise Exception("My error")
> error_udf = udf(raise_exception, IntegerType())
> df = spark.range(3).toDF("i").withColumn("x", error_udf())
> try:
> df.toPandas()
> except:
> no_arrow_exception = traceback.format_exc()
> spark.conf.set("spark.sql.execution.arrow.enabled", "true")
> try:
> df.toPandas()
> except:
> arrow_exception = traceback.format_exc()
> print no_arrow_exception
> print arrow_exception
> {noformat}
> {{arrow_exception}} gives as output:
> {noformat}
> >>> print arrow_exception
> Traceback (most recent call last):
>   File "", line 2, in 
>   File "/Users/dvogelbacher/git/spark/python/pyspark/sql/dataframe.py", line 
> 2143, in toPandas
> batches = self._collectAsArrow()
>   File "/Users/dvogelbacher/git/spark/python/pyspark/sql/dataframe.py", line 
> 2205, in _collectAsArrow
> results = list(_load_from_socket(sock_info, ArrowCollectSerializer()))
>   File "/Users/dvogelbacher/git/spark/python/pyspark/serializers.py", line 
> 210, in load_stream
> num = read_int(stream)
>   File "/Users/dvogelbacher/git/spark/python/pyspark/serializers.py", line 
> 810, in read_int
> raise EOFError
> EOFError
> {noformat}
> {{no_arrow_exception}} gives as output:
> {noformat}
> Traceback (most recent call last):
>   File "", line 2, in 
>   File "/Users/dvogelbacher/git/spark/python/pyspark/sql/dataframe.py", line 
> 2166, in toPandas
> pdf = pd.DataFrame.from_records(self.collect(), columns=self.columns)
>   File "/Users/dvogelbacher/git/spark/python/pyspark/sql/dataframe.py", line 
> 516, in collect
> sock_info = self._jdf.collectToPython()
>   File 
> "/Users/dvogelbacher/git/spark/python/lib/py4j-0.10.8.1-src.zip/py4j/java_gateway.py",
>  line 1286, in __call__
> answer, self.gateway_client, self.target_id, self.name)
>   File "/Users/dvogelbacher/git/spark/python/pyspark/sql/utils.py", line 89, 
> in deco
> return f(*a, **kw)
>   File 
> "/Users/dvogelbacher/git/spark/python/lib/py4j-0.10.8.1-src.zip/py4j/protocol.py",
>  line 328, in get_return_value
> format(target_id, ".", name), value)
> Py4JJavaError: An error occurred while calling o38.collectToPython.
> : org.apache.spark.SparkException: Job aborted due to stage failure: Task 7 
> in stage 0.0 failed 1 times, most recent failure: Lost task 7.0 in stage 0.0 
> (TID 7, localhost, executor driver): 
> org.apache.spark.api.python.PythonException: Traceback (most recent call 
> last):
>   File 
> "/Users/dvogelbacher/git/spark/python/lib/pyspark.zip/pyspark/worker.py", 
> line 428, in main
> process()
>   File 
> "/Users/dvogelbacher/git/spark/python/lib/pyspark.zip/pyspark/worker.py", 
> line 423, in process
> serializer.dump_stream(func(split_index, iterator), outfile)
>   File "/Users/dvogelbacher/git/spark/python/pyspark/serializers.py", line 
> 438, in dump_stream
> self.serializer.dump_stream(self._batched(iterator), stream)
>   File "/Users/dvogelbacher/git/spark/python/pyspark/serializers.py", line 
> 141, in dump_stream
> for obj in iterator:
>   File "/Users/dvogelbacher/git/spark/python/pyspark/serializers.py", line 
> 427, in _batched
> for item in iterator:
>   File "", line 1, in 
>   File 
> "/Users/dvogelbacher/git/spark/python/lib/pyspark.zip/pyspark/worker.py", 
> line 86, in 
> return lambda *a: f(*a)
>   File "/Users/dvogelbacher/git/spark/python/pyspark/util.py", line 99, in 
> wrapper
> return f(*args, **kwargs)
>   File "", line 2, in raise_exception
> Exception: My error
> ...
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.o

[jira] [Assigned] (SPARK-27805) toPandas does not propagate SparkExceptions with arrow enabled

2019-06-04 Thread Bryan Cutler (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27805?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bryan Cutler reassigned SPARK-27805:


Assignee: David Vogelbacher

> toPandas does not propagate SparkExceptions with arrow enabled
> --
>
> Key: SPARK-27805
> URL: https://issues.apache.org/jira/browse/SPARK-27805
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, SQL
>Affects Versions: 3.1.0
>Reporter: David Vogelbacher
>Assignee: David Vogelbacher
>Priority: Major
>
> When calling {{toPandas}} with arrow enabled errors encountered during the 
> collect are not propagated to the python process.
> There is only a very general {{EofError}} raised.
> Example of behavior with arrow enabled vs. arrow disabled:
> {noformat}
> import traceback
> from pyspark.sql.functions import udf
> from pyspark.sql.types import IntegerType
> def raise_exception():
>   raise Exception("My error")
> error_udf = udf(raise_exception, IntegerType())
> df = spark.range(3).toDF("i").withColumn("x", error_udf())
> try:
> df.toPandas()
> except:
> no_arrow_exception = traceback.format_exc()
> spark.conf.set("spark.sql.execution.arrow.enabled", "true")
> try:
> df.toPandas()
> except:
> arrow_exception = traceback.format_exc()
> print no_arrow_exception
> print arrow_exception
> {noformat}
> {{arrow_exception}} gives as output:
> {noformat}
> >>> print arrow_exception
> Traceback (most recent call last):
>   File "", line 2, in 
>   File "/Users/dvogelbacher/git/spark/python/pyspark/sql/dataframe.py", line 
> 2143, in toPandas
> batches = self._collectAsArrow()
>   File "/Users/dvogelbacher/git/spark/python/pyspark/sql/dataframe.py", line 
> 2205, in _collectAsArrow
> results = list(_load_from_socket(sock_info, ArrowCollectSerializer()))
>   File "/Users/dvogelbacher/git/spark/python/pyspark/serializers.py", line 
> 210, in load_stream
> num = read_int(stream)
>   File "/Users/dvogelbacher/git/spark/python/pyspark/serializers.py", line 
> 810, in read_int
> raise EOFError
> EOFError
> {noformat}
> {{no_arrow_exception}} gives as output:
> {noformat}
> Traceback (most recent call last):
>   File "", line 2, in 
>   File "/Users/dvogelbacher/git/spark/python/pyspark/sql/dataframe.py", line 
> 2166, in toPandas
> pdf = pd.DataFrame.from_records(self.collect(), columns=self.columns)
>   File "/Users/dvogelbacher/git/spark/python/pyspark/sql/dataframe.py", line 
> 516, in collect
> sock_info = self._jdf.collectToPython()
>   File 
> "/Users/dvogelbacher/git/spark/python/lib/py4j-0.10.8.1-src.zip/py4j/java_gateway.py",
>  line 1286, in __call__
> answer, self.gateway_client, self.target_id, self.name)
>   File "/Users/dvogelbacher/git/spark/python/pyspark/sql/utils.py", line 89, 
> in deco
> return f(*a, **kw)
>   File 
> "/Users/dvogelbacher/git/spark/python/lib/py4j-0.10.8.1-src.zip/py4j/protocol.py",
>  line 328, in get_return_value
> format(target_id, ".", name), value)
> Py4JJavaError: An error occurred while calling o38.collectToPython.
> : org.apache.spark.SparkException: Job aborted due to stage failure: Task 7 
> in stage 0.0 failed 1 times, most recent failure: Lost task 7.0 in stage 0.0 
> (TID 7, localhost, executor driver): 
> org.apache.spark.api.python.PythonException: Traceback (most recent call 
> last):
>   File 
> "/Users/dvogelbacher/git/spark/python/lib/pyspark.zip/pyspark/worker.py", 
> line 428, in main
> process()
>   File 
> "/Users/dvogelbacher/git/spark/python/lib/pyspark.zip/pyspark/worker.py", 
> line 423, in process
> serializer.dump_stream(func(split_index, iterator), outfile)
>   File "/Users/dvogelbacher/git/spark/python/pyspark/serializers.py", line 
> 438, in dump_stream
> self.serializer.dump_stream(self._batched(iterator), stream)
>   File "/Users/dvogelbacher/git/spark/python/pyspark/serializers.py", line 
> 141, in dump_stream
> for obj in iterator:
>   File "/Users/dvogelbacher/git/spark/python/pyspark/serializers.py", line 
> 427, in _batched
> for item in iterator:
>   File "", line 1, in 
>   File 
> "/Users/dvogelbacher/git/spark/python/lib/pyspark.zip/pyspark/worker.py", 
> line 86, in 
> return lambda *a: f(*a)
>   File "/Users/dvogelbacher/git/spark/python/pyspark/util.py", line 99, in 
> wrapper
> return f(*args, **kwargs)
>   File "", line 2, in raise_exception
> Exception: My error
> ...
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-27888) Python 2->3 migration guide for PySpark users

2019-06-04 Thread Hyukjin Kwon (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-27888?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16855896#comment-16855896
 ] 

Hyukjin Kwon commented on SPARK-27888:
--

Hm, Pyspark as far as I can tell should work out of the box in Python 2 and 3 
identically except Pythom version specific difference - If it doesn't it's an 
issue in Pyspark or version differences in Python itself.

It's good to note in user guide (so I did at  SPARK-27942) but I doubt if it 
should be noted in migration guide.

We deprecate it but not remove it out yet. Migration guide wasn't updated so 
far for python deprecation (or other language deprecation as far as I can tell)

> Python 2->3 migration guide for PySpark users
> -
>
> Key: SPARK-27888
> URL: https://issues.apache.org/jira/browse/SPARK-27888
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.0.0
>Reporter: Xiangrui Meng
>Priority: Major
>
> We might need a short Python 2->3 migration guide for PySpark users. It 
> doesn't need to be comprehensive given many Python 2->3 migration guides 
> around. We just need some pointers and list items that are specific to 
> PySpark.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-27396) SPIP: Public APIs for extended Columnar Processing Support

2019-06-04 Thread Robert Joseph Evans (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27396?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robert Joseph Evans updated SPARK-27396:

Epic Name: Public APIs for extended Columnar Processing Support

> SPIP: Public APIs for extended Columnar Processing Support
> --
>
> Key: SPARK-27396
> URL: https://issues.apache.org/jira/browse/SPARK-27396
> Project: Spark
>  Issue Type: Epic
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Robert Joseph Evans
>Priority: Major
>
> *SPIP: Columnar Processing Without Arrow Formatting Guarantees.*
>  
> *Q1.* What are you trying to do? Articulate your objectives using absolutely 
> no jargon.
> The Dataset/DataFrame API in Spark currently only exposes to users one row at 
> a time when processing data.  The goals of this are to
>  # Add to the current sql extensions mechanism so advanced users can have 
> access to the physical SparkPlan and manipulate it to provide columnar 
> processing for existing operators, including shuffle.  This will allow them 
> to implement their own cost based optimizers to decide when processing should 
> be columnar and when it should not.
>  # Make any transitions between the columnar memory layout and a row based 
> layout transparent to the users so operations that are not columnar see the 
> data as rows, and operations that are columnar see the data as columns.
>  
> Not Requirements, but things that would be nice to have.
>  # Transition the existing in memory columnar layouts to be compatible with 
> Apache Arrow.  This would make the transformations to Apache Arrow format a 
> no-op. The existing formats are already very close to those layouts in many 
> cases.  This would not be using the Apache Arrow java library, but instead 
> being compatible with the memory 
> [layout|https://arrow.apache.org/docs/format/Layout.html] and possibly only a 
> subset of that layout.
>  
> *Q2.* What problem is this proposal NOT designed to solve? 
> The goal of this is not for ML/AI but to provide APIs for accelerated 
> computing in Spark primarily targeting SQL/ETL like workloads.  ML/AI already 
> have several mechanisms to get data into/out of them. These can be improved 
> but will be covered in a separate SPIP.
> This is not trying to implement any of the processing itself in a columnar 
> way, with the exception of examples for documentation.
> This does not cover exposing the underlying format of the data.  The only way 
> to get at the data in a ColumnVector is through the public APIs.  Exposing 
> the underlying format to improve efficiency will be covered in a separate 
> SPIP.
> This is not trying to implement new ways of transferring data to external 
> ML/AI applications.  That is covered by separate SPIPs already.
> This is not trying to add in generic code generation for columnar processing. 
>  Currently code generation for columnar processing is only supported when 
> translating columns to rows.  We will continue to support this, but will not 
> extend it as a general solution. That will be covered in a separate SPIP if 
> we find it is helpful.  For now columnar processing will be interpreted.
> This is not trying to expose a way to get columnar data into Spark through 
> DataSource V2 or any other similar API.  That would be covered by a separate 
> SPIP if we find it is needed.
>  
> *Q3.* How is it done today, and what are the limits of current practice?
> The current columnar support is limited to 3 areas.
>  # Internal implementations of FileFormats, optionally can return a 
> ColumnarBatch instead of rows.  The code generation phase knows how to take 
> that columnar data and iterate through it as rows for stages that wants rows, 
> which currently is almost everything.  The limitations here are mostly 
> implementation specific. The current standard is to abuse Scala’s type 
> erasure to return ColumnarBatches as the elements of an RDD[InternalRow]. The 
> code generation can handle this because it is generating java code, so it 
> bypasses scala’s type checking and just casts the InternalRow to the desired 
> ColumnarBatch.  This makes it difficult for others to implement the same 
> functionality for different processing because they can only do it through 
> code generation. There really is no clean separate path in the code 
> generation for columnar vs row based. Additionally, because it is only 
> supported through code generation if for any reason code generation would 
> fail there is no backup.  This is typically fine for input formats but can be 
> problematic when we get into more extensive processing.
>  # When caching data it can optionally be cached in a columnar format if the 
> input is also columnar.  This is similar to the first area and has the same 
> limitations because the ca

[jira] [Created] (SPARK-27945) Make minimal changes to support columnar processing

2019-06-04 Thread Robert Joseph Evans (JIRA)
Robert Joseph Evans created SPARK-27945:
---

 Summary: Make minimal changes to support columnar processing
 Key: SPARK-27945
 URL: https://issues.apache.org/jira/browse/SPARK-27945
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.0.0
Reporter: Robert Joseph Evans


As the first step for SPARK-27396 this is to put in the minimum changes needed 
to allow a plugin to support columnar processing.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-27396) SPIP: Public APIs for extended Columnar Processing Support

2019-06-04 Thread Robert Joseph Evans (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27396?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robert Joseph Evans updated SPARK-27396:

Issue Type: Epic  (was: Improvement)

> SPIP: Public APIs for extended Columnar Processing Support
> --
>
> Key: SPARK-27396
> URL: https://issues.apache.org/jira/browse/SPARK-27396
> Project: Spark
>  Issue Type: Epic
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Robert Joseph Evans
>Priority: Major
>
> *SPIP: Columnar Processing Without Arrow Formatting Guarantees.*
>  
> *Q1.* What are you trying to do? Articulate your objectives using absolutely 
> no jargon.
> The Dataset/DataFrame API in Spark currently only exposes to users one row at 
> a time when processing data.  The goals of this are to
>  # Add to the current sql extensions mechanism so advanced users can have 
> access to the physical SparkPlan and manipulate it to provide columnar 
> processing for existing operators, including shuffle.  This will allow them 
> to implement their own cost based optimizers to decide when processing should 
> be columnar and when it should not.
>  # Make any transitions between the columnar memory layout and a row based 
> layout transparent to the users so operations that are not columnar see the 
> data as rows, and operations that are columnar see the data as columns.
>  
> Not Requirements, but things that would be nice to have.
>  # Transition the existing in memory columnar layouts to be compatible with 
> Apache Arrow.  This would make the transformations to Apache Arrow format a 
> no-op. The existing formats are already very close to those layouts in many 
> cases.  This would not be using the Apache Arrow java library, but instead 
> being compatible with the memory 
> [layout|https://arrow.apache.org/docs/format/Layout.html] and possibly only a 
> subset of that layout.
>  
> *Q2.* What problem is this proposal NOT designed to solve? 
> The goal of this is not for ML/AI but to provide APIs for accelerated 
> computing in Spark primarily targeting SQL/ETL like workloads.  ML/AI already 
> have several mechanisms to get data into/out of them. These can be improved 
> but will be covered in a separate SPIP.
> This is not trying to implement any of the processing itself in a columnar 
> way, with the exception of examples for documentation.
> This does not cover exposing the underlying format of the data.  The only way 
> to get at the data in a ColumnVector is through the public APIs.  Exposing 
> the underlying format to improve efficiency will be covered in a separate 
> SPIP.
> This is not trying to implement new ways of transferring data to external 
> ML/AI applications.  That is covered by separate SPIPs already.
> This is not trying to add in generic code generation for columnar processing. 
>  Currently code generation for columnar processing is only supported when 
> translating columns to rows.  We will continue to support this, but will not 
> extend it as a general solution. That will be covered in a separate SPIP if 
> we find it is helpful.  For now columnar processing will be interpreted.
> This is not trying to expose a way to get columnar data into Spark through 
> DataSource V2 or any other similar API.  That would be covered by a separate 
> SPIP if we find it is needed.
>  
> *Q3.* How is it done today, and what are the limits of current practice?
> The current columnar support is limited to 3 areas.
>  # Internal implementations of FileFormats, optionally can return a 
> ColumnarBatch instead of rows.  The code generation phase knows how to take 
> that columnar data and iterate through it as rows for stages that wants rows, 
> which currently is almost everything.  The limitations here are mostly 
> implementation specific. The current standard is to abuse Scala’s type 
> erasure to return ColumnarBatches as the elements of an RDD[InternalRow]. The 
> code generation can handle this because it is generating java code, so it 
> bypasses scala’s type checking and just casts the InternalRow to the desired 
> ColumnarBatch.  This makes it difficult for others to implement the same 
> functionality for different processing because they can only do it through 
> code generation. There really is no clean separate path in the code 
> generation for columnar vs row based. Additionally, because it is only 
> supported through code generation if for any reason code generation would 
> fail there is no backup.  This is typically fine for input formats but can be 
> problematic when we get into more extensive processing.
>  # When caching data it can optionally be cached in a columnar format if the 
> input is also columnar.  This is similar to the first area and has the same 
> limitations because the cache acts as an input, but i

[jira] [Updated] (SPARK-27933) Extracting common purge "behaviour" to the parent StreamExecution

2019-06-04 Thread Jacek Laskowski (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27933?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jacek Laskowski updated SPARK-27933:

Summary: Extracting common purge "behaviour" to the parent StreamExecution  
(was: Introduce StreamExecution.purge for removing entries from metadata logs)

> Extracting common purge "behaviour" to the parent StreamExecution
> -
>
> Key: SPARK-27933
> URL: https://issues.apache.org/jira/browse/SPARK-27933
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 2.4.3
>Reporter: Jacek Laskowski
>Priority: Minor
>
> Extracting the common {{purge}} "behaviour" to the parent {{StreamExecution}}.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-27888) Python 2->3 migration guide for PySpark users

2019-06-04 Thread Xiangrui Meng (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-27888?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16855838#comment-16855838
 ] 

Xiangrui Meng commented on SPARK-27888:
---

It would be nice if we can find some PySpark users who already migrated from 
python 2 to 3 and tell us what issues users should expect and any benefits from 
the migration.

> Python 2->3 migration guide for PySpark users
> -
>
> Key: SPARK-27888
> URL: https://issues.apache.org/jira/browse/SPARK-27888
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.0.0
>Reporter: Xiangrui Meng
>Priority: Major
>
> We might need a short Python 2->3 migration guide for PySpark users. It 
> doesn't need to be comprehensive given many Python 2->3 migration guides 
> around. We just need some pointers and list items that are specific to 
> PySpark.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-27884) Deprecate Python 2 support in Spark 3.0

2019-06-04 Thread Xiangrui Meng (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27884?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng reassigned SPARK-27884:
-

Assignee: Xiangrui Meng

> Deprecate Python 2 support in Spark 3.0
> ---
>
> Key: SPARK-27884
> URL: https://issues.apache.org/jira/browse/SPARK-27884
> Project: Spark
>  Issue Type: Story
>  Components: PySpark
>Affects Versions: 3.0.0
>Reporter: Xiangrui Meng
>Assignee: Xiangrui Meng
>Priority: Major
>  Labels: release-notes
>
> Officially deprecate Python 2 support in Spark 3.0.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-27888) Python 2->3 migration guide for PySpark users

2019-06-04 Thread Xiangrui Meng (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-27888?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16855832#comment-16855832
 ] 

Xiangrui Meng commented on SPARK-27888:
---

This JIRA is not to inform users that Python 2 is deprecated. It provides info 
to help users migrate to Python, though I don't know if there are things 
specific to PySpark. I think it should stay in the user guide instead of a 
release note.

> Python 2->3 migration guide for PySpark users
> -
>
> Key: SPARK-27888
> URL: https://issues.apache.org/jira/browse/SPARK-27888
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.0.0
>Reporter: Xiangrui Meng
>Priority: Major
>
> We might need a short Python 2->3 migration guide for PySpark users. It 
> doesn't need to be comprehensive given many Python 2->3 migration guides 
> around. We just need some pointers and list items that are specific to 
> PySpark.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-27886) Add Apache Spark project to https://python3statement.org/

2019-06-04 Thread Xiangrui Meng (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27886?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng resolved SPARK-27886.
---
Resolution: Done

> Add Apache Spark project to https://python3statement.org/
> -
>
> Key: SPARK-27886
> URL: https://issues.apache.org/jira/browse/SPARK-27886
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.0.0
>Reporter: Xiangrui Meng
>Assignee: Xiangrui Meng
>Priority: Major
>
> Add Spark to https://python3statement.org/ and indicate our timeline. I 
> reviewed the statement at https://python3statement.org/. Most projects listed 
> there will *drop* Python 2 before 2020/01/01 instead of deprecating it with 
> only [one 
> exception|https://github.com/python3statement/python3statement.github.io/blob/6ccacf8beb3cc49b1b3d572ab4f841e250853ca9/site.js#L206].
>  We certainly cannot drop Python 2 support in 2019 given we haven't 
> deprecated it yet.
> Maybe we can add the following time line:
> * 2019/10 - 2020/04: Python 2 & 3
> * 2020/04 - : Python 3 only
> The switching is the next release after Spark 3.0. If we want to hold another 
> release, it would be Sept or Oct.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-27942) Note that Python 2.7 is deprecated in Spark documentation

2019-06-04 Thread Dongjoon Hyun (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27942?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-27942:
--
Component/s: Documentation

> Note that Python 2.7 is deprecated in Spark documentation
> -
>
> Key: SPARK-27942
> URL: https://issues.apache.org/jira/browse/SPARK-27942
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, PySpark
>Affects Versions: 3.0.0
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
>Priority: Major
> Fix For: 3.0.0
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-27942) Note that Python 2.7 is deprecated in Spark documentation

2019-06-04 Thread Dongjoon Hyun (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27942?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-27942.
---
   Resolution: Fixed
Fix Version/s: 3.0.0

This is resolved via https://github.com/apache/spark/pull/24789

> Note that Python 2.7 is deprecated in Spark documentation
> -
>
> Key: SPARK-27942
> URL: https://issues.apache.org/jira/browse/SPARK-27942
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.0.0
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
>Priority: Major
> Fix For: 3.0.0
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-27772) SQLTestUtils Refactoring

2019-06-04 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27772?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-27772.
---
   Resolution: Fixed
Fix Version/s: 3.0.0

Issue resolved by pull request 24747
[https://github.com/apache/spark/pull/24747]

> SQLTestUtils Refactoring
> 
>
> Key: SPARK-27772
> URL: https://issues.apache.org/jira/browse/SPARK-27772
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL, Tests
>Affects Versions: 3.0.0
>Reporter: William Wong
>Assignee: William Wong
>Priority: Minor
> Fix For: 3.0.0
>
>
> The current `SQLTestUtils` created many `withXXX` utility functions to clean 
> up tables/views/caches created for testing purpose. Some of those `withXXX` 
> functions ignore certain exceptions, like `NoSuchTableException` in the clean 
> up block (ie, the finally block). 
>  
> {code:java}
> /**
>  * Drops temporary view `viewNames` after calling `f`.
>  */
> protected def withTempView(viewNames: String*)(f: => Unit): Unit = {
>   try f finally {
> // If the test failed part way, we don't want to mask the failure by 
> failing to remove
> // temp views that never got created.
> try viewNames.foreach(spark.catalog.dropTempView) catch {
>   case _: NoSuchTableException =>
> }
>   }
> }
> {code}
> I believe it is not the best approach. Because it is hard to anticipate what 
> exception should or should not be ignored.  
>  
> Java's `try-with-resources` statement does not mask exception throwing in the 
> try block with any exception caught in the 'close()' statement. Exception 
> caught in the 'close()' statement would add as a suppressed exception 
> instead. It sounds a better approach.
>  
> Therefore, I proposed to standardise those 'withXXX' function with following 
> `withFinallyBlock` function, which does something similar to Java's 
> try-with-resources statement. 
> {code:java}
> /**
> * Drops temporary view `viewNames` after calling `f`.
> */
> protected def withTempView(viewNames: String*)(f: => Unit): Unit = {
>   tryWithFinally(f)(viewNames.foreach(spark.catalog.dropTempView))
> }
> /**
>  * Executes the given tryBlock and then the given finallyBlock no matter 
> whether tryBlock throws
>  * an exception. If both tryBlock and finallyBlock throw exceptions, the 
> exception thrown
>  * from the finallyBlock with be added to the exception thrown from tryBlock 
> as a
>  * suppress exception. It helps to avoid masking the exception from tryBlock 
> with exception
>  * from finallyBlock
>  */
> private def tryWithFinally(tryBlock: => Unit)(finallyBlock: => Unit): Unit = {
>   var fromTryBlock: Throwable = null
>   try tryBlock catch {
> case cause: Throwable =>
>   fromTryBlock = cause
>   throw cause
>   } finally {
> if (fromTryBlock != null) {
>   try finallyBlock catch {
> case fromFinallyBlock: Throwable =>
>   fromTryBlock.addSuppressed(fromFinallyBlock)
>   throw fromTryBlock
>   }
> } else {
>   finallyBlock
> }
>   }
> }
> {code}
> If a feature is well written, we show not hit any exception in those closing 
> method in testcase. The purpose of this proposal is to help developers to 
> identify what actually break their tests.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-27772) SQLTestUtils Refactoring

2019-06-04 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27772?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen reassigned SPARK-27772:
-

Assignee: William Wong

> SQLTestUtils Refactoring
> 
>
> Key: SPARK-27772
> URL: https://issues.apache.org/jira/browse/SPARK-27772
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL, Tests
>Affects Versions: 3.0.0
>Reporter: William Wong
>Assignee: William Wong
>Priority: Minor
>
> The current `SQLTestUtils` created many `withXXX` utility functions to clean 
> up tables/views/caches created for testing purpose. Some of those `withXXX` 
> functions ignore certain exceptions, like `NoSuchTableException` in the clean 
> up block (ie, the finally block). 
>  
> {code:java}
> /**
>  * Drops temporary view `viewNames` after calling `f`.
>  */
> protected def withTempView(viewNames: String*)(f: => Unit): Unit = {
>   try f finally {
> // If the test failed part way, we don't want to mask the failure by 
> failing to remove
> // temp views that never got created.
> try viewNames.foreach(spark.catalog.dropTempView) catch {
>   case _: NoSuchTableException =>
> }
>   }
> }
> {code}
> I believe it is not the best approach. Because it is hard to anticipate what 
> exception should or should not be ignored.  
>  
> Java's `try-with-resources` statement does not mask exception throwing in the 
> try block with any exception caught in the 'close()' statement. Exception 
> caught in the 'close()' statement would add as a suppressed exception 
> instead. It sounds a better approach.
>  
> Therefore, I proposed to standardise those 'withXXX' function with following 
> `withFinallyBlock` function, which does something similar to Java's 
> try-with-resources statement. 
> {code:java}
> /**
> * Drops temporary view `viewNames` after calling `f`.
> */
> protected def withTempView(viewNames: String*)(f: => Unit): Unit = {
>   tryWithFinally(f)(viewNames.foreach(spark.catalog.dropTempView))
> }
> /**
>  * Executes the given tryBlock and then the given finallyBlock no matter 
> whether tryBlock throws
>  * an exception. If both tryBlock and finallyBlock throw exceptions, the 
> exception thrown
>  * from the finallyBlock with be added to the exception thrown from tryBlock 
> as a
>  * suppress exception. It helps to avoid masking the exception from tryBlock 
> with exception
>  * from finallyBlock
>  */
> private def tryWithFinally(tryBlock: => Unit)(finallyBlock: => Unit): Unit = {
>   var fromTryBlock: Throwable = null
>   try tryBlock catch {
> case cause: Throwable =>
>   fromTryBlock = cause
>   throw cause
>   } finally {
> if (fromTryBlock != null) {
>   try finallyBlock catch {
> case fromFinallyBlock: Throwable =>
>   fromTryBlock.addSuppressed(fromFinallyBlock)
>   throw fromTryBlock
>   }
> } else {
>   finallyBlock
> }
>   }
> }
> {code}
> If a feature is well written, we show not hit any exception in those closing 
> method in testcase. The purpose of this proposal is to help developers to 
> identify what actually break their tests.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Reopened] (SPARK-27567) Spark Streaming consumers (from Kafka) intermittently die with 'SparkException: Couldn't find leaders for Set'

2019-06-04 Thread Dmitry Goldenberg (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27567?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dmitry Goldenberg reopened SPARK-27567:
---

Issue not resolved, it appears intermittently. We have a QA box with Kafka 
installed and our Spark Streaming job pulling from it. Kafka has replication 
factor set to 1 since it's a cluster of 1 node. Just saw the error there.

> Spark Streaming consumers (from Kafka) intermittently die with 
> 'SparkException: Couldn't find leaders for Set'
> --
>
> Key: SPARK-27567
> URL: https://issues.apache.org/jira/browse/SPARK-27567
> Project: Spark
>  Issue Type: Bug
>  Components: DStreams
>Affects Versions: 1.5.0
> Environment: GCP / 170~14.04.1-Ubuntu
>Reporter: Dmitry Goldenberg
>Priority: Major
>
> Some of our consumers intermittently die with the stack traces I'm including. 
> Once restarted they run for a while then die again.
> I can't find any cohesive documentation on what this error means and how to 
> go about troubleshooting it. Any help would be appreciated.
> *Kafka version* is 0.8.2.1 (2.10-0.8.2.1).
> Some of the errors seen look like this:
> {noformat}
> ERROR org.apache.spark.scheduler.TaskSchedulerImpl: Lost executor 2 on 
> 10.150.0.54: remote Rpc client disassociated{noformat}
> Main error stack trace:
> {noformat}
> 2019-04-23 20:36:54,323 ERROR 
> org.apache.spark.streaming.scheduler.JobScheduler: Error g
> enerating jobs for time 1556066214000 ms
> org.apache.spark.SparkException: ArrayBuffer(org.apache.spark.SparkException: 
> Couldn't find leaders for Set([hdfs.hbase.acme.attachments,49], 
> [hdfs.hbase.acme.attachmen
> ts,63], [hdfs.hbase.acme.attachments,31], [hdfs.hbase.acme.attachments,9], 
> [hdfs.hbase.acme.attachments,25], [hdfs.hbase.acme.attachments,55], 
> [hdfs.hbase.acme.attachme
> nts,5], [hdfs.hbase.acme.attachments,37], [hdfs.hbase.acme.attachments,7], 
> [hdfs.hbase.acme.attachments,47], [hdfs.hbase.acme.attachments,13], 
> [hdfs.hbase.acme.attachme
> nts,43], [hdfs.hbase.acme.attachments,19], [hdfs.hbase.acme.attachments,15], 
> [hdfs.hbase.acme.attachments,23], [hdfs.hbase.acme.attachments,53], 
> [hdfs.hbase.acme.attach
> ments,1], [hdfs.hbase.acme.attachments,27], [hdfs.hbase.acme.attachments,57], 
> [hdfs.hbase.acme.attachments,39], [hdfs.hbase.acme.attachments,11], 
> [hdfs.hbase.acme.attac
> hments,29], [hdfs.hbase.acme.attachments,33], 
> [hdfs.hbase.acme.attachments,35], [hdfs.hbase.acme.attachments,51], 
> [hdfs.hbase.acme.attachments,45], [hdfs.hbase.acme.att
> achments,21], [hdfs.hbase.acme.attachments,3], 
> [hdfs.hbase.acme.attachments,59], [hdfs.hbase.acme.attachments,41], 
> [hdfs.hbase.acme.attachments,17], [hdfs.hbase.acme.at
> tachments,61]))
> at 
> org.apache.spark.streaming.kafka.DirectKafkaInputDStream.latestLeaderOffsets(DirectKafkaInputDStream.scala:123)
>  ~[acmedsc-ingest-kafka-spark-2.0.0-SNAPSHOT.j
> ar:?]
> at 
> org.apache.spark.streaming.kafka.DirectKafkaInputDStream.compute(DirectKafkaInputDStream.scala:145)
>  ~[acmedsc-ingest-kafka-spark-2.0.0-SNAPSHOT.jar:?]
> at 
> org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1$$anonfun$apply$7.apply(DStream.scala:350)
>  ~[spark-assembly-1.5.0-hadoop2.4.0.ja
> r:1.5.0]
> at 
> org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1$$anonfun$apply$7.apply(DStream.scala:350)
>  ~[spark-assembly-1.5.0-hadoop2.4.0.ja
> r:1.5.0]
> at scala.util.DynamicVariable.withValue(DynamicVariable.scala:57) 
> ~[acmedsc-ingest-kafka-spark-2.0.0-SNAPSHOT.jar:?]
> at 
> org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1.apply(DStream.scala:349)
>  ~[spark-assembly-1.5.0-hadoop2.4.0.jar:1.5.0]
> at 
> org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1.apply(DStream.scala:349)
>  ~[spark-assembly-1.5.0-hadoop2.4.0.jar:1.5.0]
> at 
> org.apache.spark.streaming.dstream.DStream.createRDDWithLocalProperties(DStream.scala:399)
>  ~[spark-assembly-1.5.0-hadoop2.4.0.jar:1.5.0]
> at 
> org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1.apply(DStream.scala:344)
>  ~[spark-assembly-1.5.0-hadoop2.4.0.jar:1.5.0]
> at 
> org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1.apply(DStream.scala:342)
>  ~[spark-assembly-1.5.0-hadoop2.4.0.jar:1.5.0]
> at scala.Option.orElse(Option.scala:257) 
> ~[acmedsc-ingest-kafka-spark-2.0.0-SNAPSHOT.jar:?]
> at org.apache.spark.streaming.dstream.DStream.getOrCompute(DStream.scala:339) 
> ~[spark-assembly-1.5.0-hadoop2.4.0.jar:1.5.0]
> at 
> org.apache.spark.streaming.dstream.MappedDStream.compute(MappedDStream.scala:35)
>  ~[spark-assembly-1.5.0-hadoop2.4.0.jar:1.5.0]
> at 
> org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1$$anonf

[jira] [Updated] (SPARK-18569) Support R formula arithmetic

2019-06-04 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-18569?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-18569:
--
Affects Version/s: 2.4.3
   Issue Type: Improvement  (was: Sub-task)
   Parent: (was: SPARK-15540)

> Support R formula arithmetic 
> -
>
> Key: SPARK-18569
> URL: https://issues.apache.org/jira/browse/SPARK-18569
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, SparkR
>Affects Versions: 2.4.3
>Reporter: Felix Cheung
>Priority: Major
>
> I think we should support arithmetic which makes it a lot more convenient to 
> build model. Something like
> {code}
>   log(y) ~ a + log(x)
> {code}
> And to avoid resolution confusions we should support the I() operator:
> {code}
> I
>  I(X∗Z) as is: include a new variable consisting of these variables multiplied
> {code}
> Such that this works:
> {code}
> y ~ a + I(b+c)
> {code}
> the term b+c is to be interpreted as the sum of b and c.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-18570) Consider supporting other R formula operators

2019-06-04 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-18570?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-18570.
---
   Resolution: Fixed
Fix Version/s: 3.0.0

Issue resolved by pull request 24764
[https://github.com/apache/spark/pull/24764]

> Consider supporting other R formula operators
> -
>
> Key: SPARK-18570
> URL: https://issues.apache.org/jira/browse/SPARK-18570
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML, SparkR
>Reporter: Felix Cheung
>Assignee: Ozan Cicekci
>Priority: Minor
> Fix For: 3.0.0
>
>
> Such as
> {code}
> ∗ 
>  X∗Y include these variables and the interactions between them
> ^
>  (X + Z + W)^3 include these variables and all interactions up to three way
> |
>  X | Z conditioning: include x given z
> {code}
> Other includes, %in%, ` (backtick)
> https://stat.ethz.ch/R-manual/R-devel/library/stats/html/formula.html



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-18570) Consider supporting other R formula operators

2019-06-04 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-18570?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen reassigned SPARK-18570:
-

Assignee: Ozan Cicekci

> Consider supporting other R formula operators
> -
>
> Key: SPARK-18570
> URL: https://issues.apache.org/jira/browse/SPARK-18570
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML, SparkR
>Reporter: Felix Cheung
>Assignee: Ozan Cicekci
>Priority: Minor
>
> Such as
> {code}
> ∗ 
>  X∗Y include these variables and the interactions between them
> ^
>  (X + Z + W)^3 include these variables and all interactions up to three way
> |
>  X | Z conditioning: include x given z
> {code}
> Other includes, %in%, ` (backtick)
> https://stat.ethz.ch/R-manual/R-devel/library/stats/html/formula.html



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-27944) Unify the behavior of checking empty output column names

2019-06-04 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27944?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-27944:


Assignee: Apache Spark

> Unify the behavior of checking empty output column names
> 
>
> Key: SPARK-27944
> URL: https://issues.apache.org/jira/browse/SPARK-27944
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 3.0.0
>Reporter: zhengruifeng
>Assignee: Apache Spark
>Priority: Minor
>
> In some algs (LDA, DT, GBTC, SVC, LR, MLP, NB), the transform method will 
> check whether an output column name is empty, and if all names are empty, it 
> will log a warning msg and do nothing.
>  
> It maybe better to make other algs (regression, ovr, clustering, als) also 
> check the output columns before transformation.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-27944) Unify the behavior of checking empty output column names

2019-06-04 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27944?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-27944:


Assignee: (was: Apache Spark)

> Unify the behavior of checking empty output column names
> 
>
> Key: SPARK-27944
> URL: https://issues.apache.org/jira/browse/SPARK-27944
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 3.0.0
>Reporter: zhengruifeng
>Priority: Minor
>
> In some algs (LDA, DT, GBTC, SVC, LR, MLP, NB), the transform method will 
> check whether an output column name is empty, and if all names are empty, it 
> will log a warning msg and do nothing.
>  
> It maybe better to make other algs (regression, ovr, clustering, als) also 
> check the output columns before transformation.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-27900) Spark on K8s will not report container failure due to an oom error

2019-06-04 Thread Stavros Kontopoulos (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-27900?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16855598#comment-16855598
 ] 

Stavros Kontopoulos commented on SPARK-27900:
-

Setting Thread.setDefaultUncaughtExceptionHandler(new 
SparkUncaughtExceptionHandler)

in SparkSubmit in client mode caused the handler to be invoked at the driver 
side yet:

19/06/04 11:01:42 ERROR SparkUncaughtExceptionHandler: Uncaught exception in 
thread Thread[dag-scheduler-event-loop,5,main]
java.lang.OutOfMemoryError: Java heap space
 at scala.collection.mutable.ResizableArray.ensureSize(ResizableArray.scala:106)
 at scala.collection.mutable.ResizableArray.ensureSize$(ResizableArray.scala:96)
 at scala.collection.mutable.ArrayBuffer.ensureSize(ArrayBuffer.scala:49)
 at scala.collection.mutable.ArrayBuffer.$plus$eq(ArrayBuffer.scala:85)
 at 
org.apache.spark.scheduler.TaskSetManager.addPendingTask(TaskSetManager.scala:264)
 at 
org.apache.spark.scheduler.TaskSetManager.$anonfun$addPendingTasks$2(TaskSetManager.scala:194)
 at 
org.apache.spark.scheduler.TaskSetManager$$Lambda$1109/206130956.apply$mcVI$sp(Unknown
 Source)
 at scala.collection.immutable.Range.foreach$mVc$sp(Range.scala:158)
 at 
org.apache.spark.scheduler.TaskSetManager.$anonfun$addPendingTasks$1(TaskSetManager.scala:193)
 at 
org.apache.spark.scheduler.TaskSetManager$$Lambda$1108/329172165.apply$mcV$sp(Unknown
 Source)
 at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
 at org.apache.spark.util.Utils$.timeTakenMs(Utils.scala:534)
 at 
org.apache.spark.scheduler.TaskSetManager.addPendingTasks(TaskSetManager.scala:192)
 at org.apache.spark.scheduler.TaskSetManager.(TaskSetManager.scala:189)
 at 
org.apache.spark.scheduler.TaskSchedulerImpl.createTaskSetManager(TaskSchedulerImpl.scala:252)
 at 
org.apache.spark.scheduler.TaskSchedulerImpl.submitTasks(TaskSchedulerImpl.scala:210)
 at 
org.apache.spark.scheduler.DAGScheduler.submitMissingTasks(DAGScheduler.scala:1233)
 at org.apache.spark.scheduler.DAGScheduler.submitStage(DAGScheduler.scala:1084)
 at 
org.apache.spark.scheduler.DAGScheduler.handleJobSubmitted(DAGScheduler.scala:1028)
 at 
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:2126)
 at 
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2118)
 at 
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2107)
 at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:49)
19/06/04 11:01:42 INFO SparkContext: Invoking stop() from shutdown hook
19/06/04 11:01:42 INFO SparkUI: Stopped Spark web UI at 
http://spark-pi2-1559645994185-driver-svc.spark.svc:4040
19/06/04 11:01:42 INFO BlockManagerInfo: Removed broadcast_0_piece0 on 
spark-pi2-1559645994185-driver-svc.spark.svc:7079 in memory (size: 1765.0 B, 
free: 110.0 MiB)

 

the main thread though is stuck at: 

"main" #1 prio=5 os_prio=0 tid=0x5653d3a5e800 nid=0x1d waiting on condition 
[0x7f31b7ca7000]
 java.lang.Thread.State: WAITING (parking)
 at sun.misc.Unsafe.park(Native Method)
 - parking to wait for <0xf0fd01a0> (a 
scala.concurrent.impl.Promise$CompletionLatch)
 at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
 at 
java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836)
 at 
java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedInterruptibly(AbstractQueuedSynchronizer.java:997)
 at 
java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(AbstractQueuedSynchronizer.java:1304)
 at scala.concurrent.impl.Promise$DefaultPromise.tryAwait(Promise.scala:242)
 at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:258)
 at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:187)
 at org.apache.spark.util.ThreadUtils$.awaitReady(ThreadUtils.scala:242)
 at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:736)

[https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala#L736]
|ThreadUtils.awaitReady(waiter.completionFuture, Duration.Inf)
 
|

This should be configurable imho.

 

> Spark on K8s will not report container failure due to an oom error
> --
>
> Key: SPARK-27900
> URL: https://issues.apache.org/jira/browse/SPARK-27900
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes
>Affects Versions: 3.0.0, 2.4.3
>Reporter: Stavros Kontopoulos
>Priority: Major
>
> A spark pi job is running:
> spark-pi-driver 1/1 Running 0 1h
>  spark-pi2-1559309337787-exec-1 1/1 Running 0 1h
>  spark-pi2-1559309337787-exec-2 1/1 Running 0 1h
> with the following setup:
> {quote}apiVersion: "sparkoperator.k8s.io/v1beta1"
>  kind:

[jira] [Created] (SPARK-27944) Unify the behavior of checking empty output column names

2019-06-04 Thread zhengruifeng (JIRA)
zhengruifeng created SPARK-27944:


 Summary: Unify the behavior of checking empty output column names
 Key: SPARK-27944
 URL: https://issues.apache.org/jira/browse/SPARK-27944
 Project: Spark
  Issue Type: Improvement
  Components: ML
Affects Versions: 3.0.0
Reporter: zhengruifeng


In some algs (LDA, DT, GBTC, SVC, LR, MLP, NB), the transform method will check 
whether an output column name is empty, and if all names are empty, it will log 
a warning msg and do nothing.

 

It maybe better to make other algs (regression, ovr, clustering, als) also 
check the output columns before transformation.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-27943) Add default constraint when create hive table

2019-06-04 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27943?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-27943:


Assignee: Apache Spark

> Add default constraint when create hive table
> -
>
> Key: SPARK-27943
> URL: https://issues.apache.org/jira/browse/SPARK-27943
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 2.3.0, 2.4.0
>Reporter: jiaan.geng
>Assignee: Apache Spark
>Priority: Major
>
> Default constraint with column is ANSI standard.
> Hive 3.0+ has supported default constraint 
> ref:https://issues.apache.org/jira/browse/HIVE-18726
> But Spark SQL implement this feature not yet.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-27943) Add default constraint when create hive table

2019-06-04 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27943?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-27943:


Assignee: (was: Apache Spark)

> Add default constraint when create hive table
> -
>
> Key: SPARK-27943
> URL: https://issues.apache.org/jira/browse/SPARK-27943
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 2.3.0, 2.4.0
>Reporter: jiaan.geng
>Priority: Major
>
> Default constraint with column is ANSI standard.
> Hive 3.0+ has supported default constraint 
> ref:https://issues.apache.org/jira/browse/HIVE-18726
> But Spark SQL implement this feature not yet.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-27895) Spark streaming - RDD filter is always refreshing providing updated filtered items

2019-06-04 Thread Ilias Karalis (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-27895?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16855547#comment-16855547
 ] 

Ilias Karalis commented on SPARK-27895:
---

The results of RDD filter change all the time until next batch comes in.

I am not sure if it has to do with the filter function itself. But this 
function and filter method work fine , if instead of filtering the RDD, we 
filter a collection.

inputRdd.filter... : results are not stable and change all the time until next 
batch comes in.

inputRdd.collect().filter... : computes results just once and results are 
stable until next batch comes in.

 

> Spark streaming - RDD filter is always refreshing providing updated filtered 
> items
> --
>
> Key: SPARK-27895
> URL: https://issues.apache.org/jira/browse/SPARK-27895
> Project: Spark
>  Issue Type: Bug
>  Components: DStreams
>Affects Versions: 2.4.0, 2.4.2, 2.4.3
> Environment: Intellij, running local in windows10 laptop.
>  
>Reporter: Ilias Karalis
>Priority: Major
>
> Spark streaming: 2.4.x
> Scala: 2.11.11
>  
>  
> foreachRDD of DStream,
> in case filter is used on RDD then filter is always refreshing, providing new 
> results continuously until new batch is processed. For the new batch, the 
> same occurs.
> With the same code, if we do rdd.collect() and then run the filter on the 
> collection, we get just one time results, which remains stable until new 
> batch is coming in.
> Filter function is based on random probability (reservoir sampling).
>  
> {color:#80}val {color}toSampleRDD: RDD[(Long, Long)] = 
> inputRdd.filter(x=> chooseX(x) )
>  
> {color:#80}def {color}chooseX (x:(Long, Long)) : Boolean = {
>  {color:#80}val {color}r = scala.util.Random
>  {color:#80}val {color}p = r.nextFloat()
>  edgeTotalCounter += {color:#ff}1{color} {color:#80}if {color}(p < 
> (sampleLength.toFloat / edgeTotalCounter.toFloat)) {
>  edgeLocalRDDCounter += {color:#ff}1{color} println({color:#008000}"Edge 
> " {color}+x + {color:#008000}" has been selected and is number : "{color}+  
> edgeLocalRDDCounter +{color:#008000}"."{color})
>  {color:#80}true{color} }
>  {color:#80}else{color} false
>  \{color}}
>  
> edgeLocalRDDCounter counts selected edges from inputRDD.
> Strange is that the counter is increased 1st time from 1 to y, then filter 
> continues to run unexpectedly again and the counter is increased again 
> starting from y+1 to z. After that each time filter unexpectedly continues to 
> run, it provides results for which the counter starts from y+1. Each time 
> filter runs provides different results and filters different number of edges.
> toSampleRDD always changes accordingly to new provided results.
> When new batch is coming in then it starts the same behavior for the new 
> batch.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-27895) Spark streaming - RDD filter is always refreshing providing updated filtered items

2019-06-04 Thread Ilias Karalis (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27895?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ilias Karalis updated SPARK-27895:
--
Description: 
Spark streaming: 2.4.x

Scala: 2.11.11

 

 

foreachRDD of DStream,

in case filter is used on RDD then filter is always refreshing, providing new 
results continuously until new batch is processed. For the new batch, the same 
occurs.

With the same code, if we do rdd.collect() and then run the filter on the 
collection, we get just one time results, which remains stable until new batch 
is coming in.

Filter function is based on random probability (reservoir sampling).

 

{color:#80}val {color}toSampleRDD: RDD[(Long, Long)] = inputRdd.filter(x=> 
chooseX(x) )

 

{color:#80}def {color}chooseX (x:(Long, Long)) : Boolean = {
 {color:#80}val {color}r = scala.util.Random
 {color:#80}val {color}p = r.nextFloat()
 edgeTotalCounter += {color:#ff}1{color} {color:#80}if {color}(p < 
(sampleLength.toFloat / edgeTotalCounter.toFloat)) {
 edgeLocalRDDCounter += {color:#ff}1{color} println({color:#008000}"Edge " 
{color}+x + {color:#008000}" has been selected and is number : "{color}+  
edgeLocalRDDCounter +{color:#008000}"."{color})
 {color:#80}true{color} }
 {color:#80}else{color} false
 \{color}}

 

edgeLocalRDDCounter counts selected edges from inputRDD.

Strange is that the counter is increased 1st time from 1 to y, then filter 
continues to run unexpectedly again and the counter is increased again starting 
from y+1 to z. After that each time filter unexpectedly continues to run, it 
provides results for which the counter starts from y+1. Each time filter runs 
provides different results and filters different number of edges.

toSampleRDD always changes accordingly to new provided results.

When new batch is coming in then it starts the same behavior for the new batch.

  was:
Spark streaming: 2.4.x

Scala: 2.11.11

 

foreachRDD of DStream,

in case filter is used on RDD then filter is always refreshing, providing new 
results continuously until new batch is processed. For the new batch, the same 
occurs.

With the same code, if we do rdd.collect() and then run the filter on the 
collection, we get just one time results, which remains stable until new batch 
is coming in.

Filter function is based on random probability (reservoir sampling).

 

{color:#80}val {color}toSampleRDD: RDD[(Long, Long)] = inputRdd.filter(x=> 
chooseX(x) )

 

{color:#80}def {color}chooseX (x:(Long, Long)) : Boolean = {
{color:#808080}
{color} {color:#80}val {color}r = scala.util.Random
 {color:#80}val {color}p = r.nextFloat()
 edgeTotalCounter += {color:#ff}1
{color} {color:#808080}
{color} {color:#80}if {color}(p < (sampleLength.toFloat / 
edgeTotalCounter.toFloat)) {
 edgeLocalRDDCounter += {color:#ff}1
{color} println({color:#008000}"Edge " {color}+x + {color:#008000}" has been 
selected and is number : " {color}+ edgeLocalRDDCounter 
+{color:#008000}"."{color})
 {color:#80}true
{color} }
 {color:#80}else
{color}{color:#80} false
{color}}

 

edgeLocalRDDCounter counts selected edges from inputRDD.

Strange is that the counter is increased 1st time from 1 to y, then filter 
continues to run unexpectedly again and the counter is increased again starting 
from y+1 to z. After that each time filter unexpectedly continues to run, it 
provides results for which the counter starts from y+1. Each time filter runs 
provides different results and filters different number of edges.

toSampleRDD always changes accordingly to new provided results.

When new batch is coming in then it starts the same behavior for the new batch.


> Spark streaming - RDD filter is always refreshing providing updated filtered 
> items
> --
>
> Key: SPARK-27895
> URL: https://issues.apache.org/jira/browse/SPARK-27895
> Project: Spark
>  Issue Type: Bug
>  Components: DStreams
>Affects Versions: 2.4.0, 2.4.2, 2.4.3
> Environment: Intellij, running local in windows10 laptop.
>  
>Reporter: Ilias Karalis
>Priority: Major
>
> Spark streaming: 2.4.x
> Scala: 2.11.11
>  
>  
> foreachRDD of DStream,
> in case filter is used on RDD then filter is always refreshing, providing new 
> results continuously until new batch is processed. For the new batch, the 
> same occurs.
> With the same code, if we do rdd.collect() and then run the filter on the 
> collection, we get just one time results, which remains stable until new 
> batch is coming in.
> Filter function is based on random probability (reservoir sampling).
>  
> {color:#80}val {color}toSampleRDD: RDD[(Long, Long)] = 
> inputRdd.filter(x=> chooseX(x) )
>  
> {color:#80}def {color}chooseX (x:(Long, Long)) : Boolean = {
>  {color:#80}val {col

[jira] [Created] (SPARK-27943) Add default constraint when create hive table

2019-06-04 Thread jiaan.geng (JIRA)
jiaan.geng created SPARK-27943:
--

 Summary: Add default constraint when create hive table
 Key: SPARK-27943
 URL: https://issues.apache.org/jira/browse/SPARK-27943
 Project: Spark
  Issue Type: New Feature
  Components: SQL
Affects Versions: 2.4.0, 2.3.0
Reporter: jiaan.geng


Default constraint with column is ANSI standard.

Hive 3.0+ has supported default constraint 
ref:https://issues.apache.org/jira/browse/HIVE-18726

But Spark SQL implement this feature not yet.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-27941) Serverless Spark in the Cloud

2019-06-04 Thread Shuheng Dai (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27941?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shuheng Dai updated SPARK-27941:

Description: 
Public cloud providers have started offering serverless container services. For 
example, AWS offers Fargate [https://aws.amazon.com/fargate/]

This opens up the possibility to run Spark workloads in a serverless manner and 
remove the need to provision, maintain and manage a cluster. POC: 
[https://github.com/mu5358271/spark-on-fargate]

While it might not make sense for Spark to favor any particular cloud provider 
or to support a large number of cloud providers natively, it would make sense 
to make some of the internal Spark components more pluggable and cloud friendly 
so that it is easier for various cloud providers to integrate. For example, 
 * authentication: IO and network encryption requires authentication via 
securely sharing a secret, and the implementation of this is currently tied to 
the cluster manager: yarn uses hadoop ugi, kubernetes uses a shared file 
mounted on all pods. These can be decoupled so it is possible to swap in 
implementation using public cloud. In the POC, this is implemented by passing 
around AWS KMS encrypted secret and decrypting the secret at each executor, 
which delegate authentication and authorization to the cloud.
 * deployment & scheduler: adding a new cluster manager and scheduler backend 
requires changing a number of places in the Spark core package and rebuilding 
the entire project. Having a pluggable scheduler per 
https://issues.apache.org/jira/browse/SPARK-19700 would make it easier to add 
different scheduler backends backed by different cloud providers.
 * client-cluster communication: I am not very familiar with the network part 
of the code base so I might be wrong on this. My understanding is that the code 
base assumes that the client and the cluster are on the same network and the 
nodes communicate with each other via hostname/ip. For security best practice, 
it is advised to run the executors in a private protected network, which may be 
separate from the client machine's network. Since we are serverless, that means 
the client need to first launch the driver into the private network, and the 
driver in turn start the executors, potentially doubling job initialization 
time. This can be solved by dropping complete serverlessness and having a 
persistent host in the private network, or (I do not have a POC, so I am not 
sure if this actually works) by implementing client-cluster communication via 
message queues in the cloud to get around the network separation.
 * shuffle storage and retrieval: external shuffle in yarn relies on the 
existence of a persistent cluster that continues to serve shuffle files beyond 
the lifecycle of the executors. This assumption no longer holds in a serverless 
cluster with only transient containers. Pluggable remote shuffle storage per 
https://issues.apache.org/jira/browse/SPARK-25299 would make it easier to 
introduce new cloud-backed shuffle.

  was:
Public cloud providers have started offering serverless container services. For 
example, AWS offers Fargate [https://aws.amazon.com/fargate/]

This opens up the possibility to run Spark workloads in a serverless manner and 
remove the need to provision and maintain a cluster. POC: 
[https://github.com/mu5358271/spark-on-fargate]

While it might not make sense for Spark to favor any particular cloud provider 
or to support a large number of cloud providers natively. It would make sense 
to make some of the internal Spark components more pluggable and cloud friendly 
so that it is easier for various cloud providers to integrate. For example, 
 * authentication: IO and network encryption requires authentication via 
securely sharing a secret, and the implementation of this is currently tied to 
the cluster manager: yarn uses hadoop ugi, kubernetes uses a shared file 
mounted on all pods. These can be decoupled so it is possible to swap in 
implementation using public cloud. In the POC, this is implemented by passing 
around AWS KMS encrypted secret and decrypting the secret at each executor, 
which delegate authentication and authorization to the cloud.
 * deployment & scheduler: adding a new cluster manager and scheduler backend 
requires changing a number of places in the Spark core package and rebuilding 
the entire project. Having a pluggable scheduler per 
https://issues.apache.org/jira/browse/SPARK-19700 would make it easier to add 
different scheduler backends backed by different cloud providers.
 * client-cluster communication: I am not very familiar with the network part 
of the code base so I might be wrong on this. My understanding is that the code 
base assumes that the client and the cluster are on the same network and the 
nodes communicate with each other via hostname/ip. 
 * shuffle storage and retrieval: 


> Serverless Spark in the Cloud

[jira] [Comment Edited] (SPARK-27900) Spark on K8s will not report container failure due to an oom error

2019-06-04 Thread Stavros Kontopoulos (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-27900?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16854449#comment-16854449
 ] 

Stavros Kontopoulos edited comment on SPARK-27900 at 6/4/19 9:13 AM:
-

A better approach is to try use  -XX:+ExitOnOutOfMemoryError or set the 
SparkUncaughtExceptionHandler  
[https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/executor/Executor.scala#L88]
 in the driver (eg. SparkContext). For the executor part the related Jira is 
this one: https://issues.apache.org/jira/browse/SPARK-1772

[~sro...@scient.com] do you know why we dont set the handler at the driver side?

 


was (Author: skonto):
A better approach is to try use  -XX:+ExitOnOutOfMemoryError or set the 
SparkUncaughtExceptionHandler  
[https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/executor/Executor.scala#L88]
 in the driver (eg. SparkContext).

[~sro...@scient.com] do you know why we dont set the handler at the driver side?

 

> Spark on K8s will not report container failure due to an oom error
> --
>
> Key: SPARK-27900
> URL: https://issues.apache.org/jira/browse/SPARK-27900
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes
>Affects Versions: 3.0.0, 2.4.3
>Reporter: Stavros Kontopoulos
>Priority: Major
>
> A spark pi job is running:
> spark-pi-driver 1/1 Running 0 1h
>  spark-pi2-1559309337787-exec-1 1/1 Running 0 1h
>  spark-pi2-1559309337787-exec-2 1/1 Running 0 1h
> with the following setup:
> {quote}apiVersion: "sparkoperator.k8s.io/v1beta1"
>  kind: SparkApplication
>  metadata:
>  name: spark-pi
>  namespace: spark
>  spec:
>  type: Scala
>  mode: cluster
>  image: "skonto/spark:k8s-3.0.0-sa"
>  imagePullPolicy: Always
>  mainClass: org.apache.spark.examples.SparkPi
>  mainApplicationFile: 
> "local:///opt/spark/examples/jars/spark-examples_2.12-3.0.0-SNAPSHOT.jar"
>  arguments:
>  - "100"
>  sparkVersion: "2.4.0"
>  restartPolicy:
>  type: Never
>  nodeSelector:
>  "spark": "autotune"
>  driver:
>  memory: "1g"
>  labels:
>  version: 2.4.0
>  serviceAccount: spark-sa
>  executor:
>  instances: 2
>  memory: "1g"
>  labels:
>  version: 2.4.0{quote}
> At some point the driver fails but it is still running and so the pods are 
> still running:
> 19/05/31 13:29:20 INFO DAGScheduler: Submitting ResultStage 0 
> (MapPartitionsRDD[1] at map at SparkPi.scala:34), which has no missing parents
>  19/05/31 13:29:23 INFO MemoryStore: Block broadcast_0 stored as values in 
> memory (estimated size 3.0 KiB, free 110.0 MiB)
>  19/05/31 13:29:23 INFO MemoryStore: Block broadcast_0_piece0 stored as bytes 
> in memory (estimated size 1765.0 B, free 110.0 MiB)
>  19/05/31 13:29:23 INFO BlockManagerInfo: Added broadcast_0_piece0 in memory 
> on spark-pi2-1559309337787-driver-svc.spark.svc:7079 (size: 1765.0 B, free: 
> 110.0 MiB)
>  19/05/31 13:29:23 INFO SparkContext: Created broadcast 0 from broadcast at 
> DAGScheduler.scala:1180
>  19/05/31 13:29:25 INFO DAGScheduler: Submitting 100 missing tasks from 
> ResultStage 0 (MapPartitionsRDD[1] at map at SparkPi.scala:34) (first 15 
> tasks are for partitions Vector(0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 
> 14))
>  19/05/31 13:29:25 INFO TaskSchedulerImpl: Adding task set 0.0 with 100 
> tasks
>  Exception in thread "dag-scheduler-event-loop" java.lang.OutOfMemoryError: 
> Java heap space
>  at 
> scala.collection.mutable.ResizableArray.ensureSize(ResizableArray.scala:106)
>  at 
> scala.collection.mutable.ResizableArray.ensureSize$(ResizableArray.scala:96)
>  at scala.collection.mutable.ArrayBuffer.ensureSize(ArrayBuffer.scala:49)
>  Mem: 2295260K used, 24458144K free, 1636K shrd, 48052K buff, 899424K cached
> $ kubectl describe pod spark-pi2-driver -n spark
>  Name: spark-pi2-driver
>  Namespace: spark
>  Priority: 0
>  PriorityClassName: 
>  Node: gke-test-cluster-1-spark-autotune-46c36f4f-x3z9/10.138.0.44
>  Start Time: Fri, 31 May 2019 16:28:59 +0300
>  Labels: spark-app-selector=spark-74d8e5a8f1af428d91093dfa6ee9d661
>  spark-role=driver
>  sparkoperator.k8s.io/app-name=spark-pi2
>  sparkoperator.k8s.io/launched-by-spark-operator=true
>  sparkoperator.k8s.io/submission-id=spark-pi2-1559309336226927526
>  version=2.4.0
>  Annotations: 
>  Status: Running
>  IP: 10.12.103.4
>  Controlled By: SparkApplication/spark-pi2
>  Containers:
>  spark-kubernetes-driver:
>  Container ID: 
> docker://55dadb603290b42f9ddb71959edf0224ddc7ea621ee15429941d3bcc7db9b71f
>  Image: skonto/spark:k8s-3.0.0-sa
>  Image ID: 
> docker-pullable://skonto/spark@sha256:6268d760d1a006b69c7086f946e4d5d9a3b99f149832c63cfc7fe39671f5cda9
>  Ports: 7078/TCP, 7079/TCP, 4040/TCP
>  Host Ports: 0/TCP, 0/TCP, 0/TCP
>  Args:
>  driver
>  --properties-file

[jira] [Updated] (SPARK-27941) Serverless Spark in the Cloud

2019-06-04 Thread Shuheng Dai (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27941?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shuheng Dai updated SPARK-27941:

Description: 
Public cloud providers have started offering serverless container services. For 
example, AWS offers Fargate [https://aws.amazon.com/fargate/]

This opens up the possibility to run Spark workloads in a serverless manner and 
remove the need to provision and maintain a cluster. POC: 
[https://github.com/mu5358271/spark-on-fargate]

While it might not make sense for Spark to favor any particular cloud provider 
or to support a large number of cloud providers natively. It would make sense 
to make some of the internal Spark components more pluggable and cloud friendly 
so that it is easier for various cloud providers to integrate. For example, 
 * authentication: IO and network encryption requires authentication via 
securely sharing a secret, and the implementation of this is currently tied to 
the cluster manager: yarn uses hadoop ugi, kubernetes uses a shared file 
mounted on all pods. These can be decoupled so it is possible to swap in 
implementation using public cloud. In the POC, this is implemented by passing 
around AWS KMS encrypted secret and decrypting the secret at each executor, 
which delegate authentication and authorization to the cloud.
 * deployment & scheduler: adding a new cluster manager and scheduler backend 
requires changing a number of places in the Spark core package and rebuilding 
the entire project. Having a pluggable scheduler per 
https://issues.apache.org/jira/browse/SPARK-19700 would make it easier to add 
different scheduler backends backed by different cloud providers.
 * client-cluster communication: I am not very familiar with the network part 
of the code base so I might be wrong on this. My understanding is that the code 
base assumes that the client and the cluster are on the same network and the 
nodes communicate with each other via hostname/ip. 
 * shuffle storage and retrieval: 

  was:
Public cloud providers have started offering serverless container services. For 
example, AWS offers Fargate [https://aws.amazon.com/fargate/]

This opens up the possibility to run Spark workloads in a serverless manner and 
remove the need to provision and maintain a cluster. POC: 
[https://github.com/mu5358271/spark-on-fargate]

While it might not make sense for Spark to favor any particular cloud provider 
or to support a large number of cloud providers natively. It would make sense 
to make some of the internal Spark components more pluggable and cloud friendly 
so that it is easier for various cloud providers to integrate. For example, 
 * authentication: IO and network encryption requires authentication via 
securely sharing a secret, and the implementation of this is currently tied to 
the cluster manager: yarn uses hadoop ugi, kubernetes uses a shared file 
mounted on all pods. These can be decoupled so it is possible to swap in 
implementation using public cloud. In the POC, this is implemented by passing 
around AWS KMS encrypted secret and decrypting the secret at each executor, 
which delegate authentication and authorization to the cloud.
 * deployment & scheduler: adding a new cluster manager and scheduler backend 
requires changing a number of places in the Spark core package, and rebuilding 
the entire project. 
 * driver-executor communication: 
 * shuffle storage and retrieval: 


> Serverless Spark in the Cloud
> -
>
> Key: SPARK-27941
> URL: https://issues.apache.org/jira/browse/SPARK-27941
> Project: Spark
>  Issue Type: New Feature
>  Components: Build, Deploy, Scheduler, Security, Shuffle, Spark Core
>Affects Versions: 3.0.0
>Reporter: Shuheng Dai
>Priority: Major
>
> Public cloud providers have started offering serverless container services. 
> For example, AWS offers Fargate [https://aws.amazon.com/fargate/]
> This opens up the possibility to run Spark workloads in a serverless manner 
> and remove the need to provision and maintain a cluster. POC: 
> [https://github.com/mu5358271/spark-on-fargate]
> While it might not make sense for Spark to favor any particular cloud 
> provider or to support a large number of cloud providers natively. It would 
> make sense to make some of the internal Spark components more pluggable and 
> cloud friendly so that it is easier for various cloud providers to integrate. 
> For example, 
>  * authentication: IO and network encryption requires authentication via 
> securely sharing a secret, and the implementation of this is currently tied 
> to the cluster manager: yarn uses hadoop ugi, kubernetes uses a shared file 
> mounted on all pods. These can be decoupled so it is possible to swap in 
> implementation using public cloud. In the POC, this is implemented by passing 
> around AWS KMS encrypted secr

[jira] [Updated] (SPARK-27941) Serverless Spark in the Cloud

2019-06-04 Thread Shuheng Dai (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27941?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shuheng Dai updated SPARK-27941:

Description: 
Public cloud providers have started offering serverless container services. For 
example, AWS offers Fargate [https://aws.amazon.com/fargate/]

This opens up the possibility to run Spark workloads in a serverless manner and 
remove the need to provision and maintain a cluster. POC: 
[https://github.com/mu5358271/spark-on-fargate]

While it might not make sense for Spark to favor any particular cloud provider 
or to support a large number of cloud providers natively. It would make sense 
to make some of the internal Spark components more pluggable and cloud friendly 
so that it is easier for various cloud providers to integrate. For example, 
 * authentication: IO and network encryption requires authentication via 
securely sharing a secret, and the implementation of this is currently tied to 
the cluster manager: yarn uses hadoop ugi, kubernetes uses a shared file 
mounted on all pods. These can be decoupled so it is possible to swap in 
implementation using public cloud. In the POC, this is implemented by passing 
around AWS KMS encrypted secret and decrypting the secret at each executor, 
which delegate authentication and authorization to the cloud.
 * deployment & scheduler: adding a new cluster manager and scheduler backend 
requires changing a number of places in the Spark core package, and rebuilding 
the entire project. 
 * driver-executor communication: 
 * shuffle storage and retrieval: 

  was:
Public cloud providers have started offering serverless container services. For 
example, AWS offers Fargate [https://aws.amazon.com/fargate/]

This opens up the possibility to run Spark workloads in a serverless manner and 
remove the need to provision and maintain a cluster.

While it might not make sense for Spark to favor any particular cloud provider 
or to support a large number of cloud providers natively. It would make sense 
to make some of the internal Spark components more pluggable and cloud friendly 
so that it is easier for various cloud providers to integrate. For example, 
 * authentication: IO and network encryption requires authentication via 
securely sharing a secret, and the implementation of this is currently tied to 
the cluster manager: yarn uses hadoop ugi, kubernetes uses a shared file 
mounted on all pods. These can be decoupled so it is possible to swap in 
implementation using public cloud. In the POC, this is implemented by passing 
around AWS KMS encrypted secret and delegate authentication and authorization 
to the cloud.
 * deployment & scheduler: adding a new cluster manager requires change a 
number of places in the Spark core package, and rebuilding the project. 
 * driver-executor communication: 
 * shuffle storage and retrieval: 


> Serverless Spark in the Cloud
> -
>
> Key: SPARK-27941
> URL: https://issues.apache.org/jira/browse/SPARK-27941
> Project: Spark
>  Issue Type: New Feature
>  Components: Build, Deploy, Scheduler, Security, Shuffle, Spark Core
>Affects Versions: 3.0.0
>Reporter: Shuheng Dai
>Priority: Major
>
> Public cloud providers have started offering serverless container services. 
> For example, AWS offers Fargate [https://aws.amazon.com/fargate/]
> This opens up the possibility to run Spark workloads in a serverless manner 
> and remove the need to provision and maintain a cluster. POC: 
> [https://github.com/mu5358271/spark-on-fargate]
> While it might not make sense for Spark to favor any particular cloud 
> provider or to support a large number of cloud providers natively. It would 
> make sense to make some of the internal Spark components more pluggable and 
> cloud friendly so that it is easier for various cloud providers to integrate. 
> For example, 
>  * authentication: IO and network encryption requires authentication via 
> securely sharing a secret, and the implementation of this is currently tied 
> to the cluster manager: yarn uses hadoop ugi, kubernetes uses a shared file 
> mounted on all pods. These can be decoupled so it is possible to swap in 
> implementation using public cloud. In the POC, this is implemented by passing 
> around AWS KMS encrypted secret and decrypting the secret at each executor, 
> which delegate authentication and authorization to the cloud.
>  * deployment & scheduler: adding a new cluster manager and scheduler backend 
> requires changing a number of places in the Spark core package, and 
> rebuilding the entire project. 
>  * driver-executor communication: 
>  * shuffle storage and retrieval: 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org

[jira] [Updated] (SPARK-27941) Serverless Spark in the Cloud

2019-06-04 Thread Shuheng Dai (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27941?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shuheng Dai updated SPARK-27941:

Description: 
Public cloud providers have started offering serverless container services. For 
example, AWS offers Fargate [https://aws.amazon.com/fargate/]

This opens up the possibility to run Spark workloads in a serverless manner and 
remove the need to provision and maintain a cluster.

While it might not make sense for Spark to favor any particular cloud provider 
or to support a large number of cloud providers natively. It would make sense 
to make some of the internal Spark components more pluggable and cloud friendly 
so that it is easier for various cloud providers to integrate. For example, 
 * authentication: IO and network encryption requires authentication via 
securely sharing a secret, and the implementation of this is currently tied to 
the cluster manager: yarn uses hadoop ugi, kubernetes uses a shared file 
mounted on all pods. These can be decoupled so it is possible to swap in 
implementation using public cloud. In the POC, this is implemented by passing 
around AWS KMS encrypted secret and delegate authentication and authorization 
to the cloud.
 * deployment & scheduler: adding a new cluster manager requires change a 
number of places in the Spark core package, and rebuilding the project. 
 * driver-executor communication: 
 * shuffle storage and retrieval: 

  was:
Public cloud providers have started offering serverless container services. For 
example, AWS offers Fargate [https://aws.amazon.com/fargate/]

This opens up the possibility to run Spark workloads in a serverless manner. 

Pluggable authentication

Pluggable scheduler

Pluggable shuffle storage and retrieval

Pluggable driver


> Serverless Spark in the Cloud
> -
>
> Key: SPARK-27941
> URL: https://issues.apache.org/jira/browse/SPARK-27941
> Project: Spark
>  Issue Type: New Feature
>  Components: Build, Deploy, Scheduler, Security, Shuffle, Spark Core
>Affects Versions: 3.0.0
>Reporter: Shuheng Dai
>Priority: Major
>
> Public cloud providers have started offering serverless container services. 
> For example, AWS offers Fargate [https://aws.amazon.com/fargate/]
> This opens up the possibility to run Spark workloads in a serverless manner 
> and remove the need to provision and maintain a cluster.
> While it might not make sense for Spark to favor any particular cloud 
> provider or to support a large number of cloud providers natively. It would 
> make sense to make some of the internal Spark components more pluggable and 
> cloud friendly so that it is easier for various cloud providers to integrate. 
> For example, 
>  * authentication: IO and network encryption requires authentication via 
> securely sharing a secret, and the implementation of this is currently tied 
> to the cluster manager: yarn uses hadoop ugi, kubernetes uses a shared file 
> mounted on all pods. These can be decoupled so it is possible to swap in 
> implementation using public cloud. In the POC, this is implemented by passing 
> around AWS KMS encrypted secret and delegate authentication and authorization 
> to the cloud.
>  * deployment & scheduler: adding a new cluster manager requires change a 
> number of places in the Spark core package, and rebuilding the project. 
>  * driver-executor communication: 
>  * shuffle storage and retrieval: 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-27873) Csv reader, adding a corrupt record column causes error if enforceSchema=false

2019-06-04 Thread Hyukjin Kwon (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27873?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-27873:
-
Fix Version/s: 2.4.4

> Csv reader, adding a corrupt record column causes error if enforceSchema=false
> --
>
> Key: SPARK-27873
> URL: https://issues.apache.org/jira/browse/SPARK-27873
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.3
>Reporter: Marcin Mejran
>Assignee: Liang-Chi Hsieh
>Priority: Major
> Fix For: 2.4.4, 3.0.0
>
>
> In the Spark CSV reader If you're using permissive mode with a column for 
> storing corrupt records then you need to add a new schema column 
> corresponding to columnNameOfCorruptRecord.
> However, if you have a header row and enforceSchema=false the schema vs. 
> header validation fails because there is an extra column corresponding to 
> columnNameOfCorruptRecord.
> Since, the FAILFAST mode doesn't print informative error messages on which 
> rows failed to parse there is no way other to track down broken rows without 
> setting a corrupt record column.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-27884) Deprecate Python 2 support in Spark 3.0

2019-06-04 Thread Hyukjin Kwon (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27884?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-27884:
-
Labels: release-notes  (was: )

> Deprecate Python 2 support in Spark 3.0
> ---
>
> Key: SPARK-27884
> URL: https://issues.apache.org/jira/browse/SPARK-27884
> Project: Spark
>  Issue Type: Story
>  Components: PySpark
>Affects Versions: 3.0.0
>Reporter: Xiangrui Meng
>Priority: Major
>  Labels: release-notes
>
> Officially deprecate Python 2 support in Spark 3.0.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-27888) Python 2->3 migration guide for PySpark users

2019-06-04 Thread Hyukjin Kwon (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-27888?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16855420#comment-16855420
 ] 

Hyukjin Kwon commented on SPARK-27888:
--

[~mengxr], I know it's good to inform that Python 2 is deprecated to users so 
that they can prepare to drop it in their env too but how about we leave it in 
a release note not in migration guide? It's deprecated but it would still work.

> Python 2->3 migration guide for PySpark users
> -
>
> Key: SPARK-27888
> URL: https://issues.apache.org/jira/browse/SPARK-27888
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.0.0
>Reporter: Xiangrui Meng
>Priority: Major
>
> We might need a short Python 2->3 migration guide for PySpark users. It 
> doesn't need to be comprehensive given many Python 2->3 migration guides 
> around. We just need some pointers and list items that are specific to 
> PySpark.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



  1   2   >