[jira] [Commented] (SPARK-26494) Support Oracle TIMESTAMP WITH LOCAL TIME ZONE type

2020-06-18 Thread Jeff Evans (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-26494?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17139930#comment-17139930
 ] 

Jeff Evans commented on SPARK-26494:


The pull request was closed as stale.  Can someone please revive it, perhaps 
[~hyukjin.kwon]?

> Support Oracle TIMESTAMP WITH LOCAL TIME ZONE type
> --
>
> Key: SPARK-26494
> URL: https://issues.apache.org/jira/browse/SPARK-26494
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: kun'qin 
>Priority: Minor
>
> Use spark to read oracle TIMESTAMP(6) WITH LOCAL TIME ZONE type can't be 
> found,
> When the data type is TIMESTAMP(6) WITH LOCAL TIME ZONE
> At this point, the sqlType value of the function getCatalystType in the 
> JdbcUtils class is -102.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-31779) Redefining struct inside array incorrectly wraps child fields in array

2020-05-29 Thread Jeff Evans (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-31779?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17119931#comment-17119931
 ] 

Jeff Evans edited comment on SPARK-31779 at 5/29/20, 9:01 PM:
--

Thanks, using {{arrays_zip}} (along with an extra {{cast}} to a new {{struct}} 
to preserve the original field names, since {{arrays_zip}} seems to set them to 
"0", "1", etc.) seems to work.

I still find the behavior quite counterintuitive, but this workaround will 
allow me to solve the immediate requirement. 


was (Author: jeff.w.evans):
Thanks, using {{arrays_zip}} (along with an extra {{cast}} to preserve the 
existing field names, since {{arrays_zip}} seems to set them to "0", "1", etc.) 
seems to work.

I still find the behavior quite counterintuitive, but this workaround will 
allow me to solve the immediate requirement. 

> Redefining struct inside array incorrectly wraps child fields in array
> --
>
> Key: SPARK-31779
> URL: https://issues.apache.org/jira/browse/SPARK-31779
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.5
>Reporter: Jeff Evans
>Priority: Major
>
> It seems that redefining a {{struct}} for the purpose of removing a 
> sub-field, when that {{struct}} is itself inside an {{array}}, results in the 
> remaining (non-removed) {{struct}} fields themselves being incorrectly 
> wrapped in an array.
> For more context, see [this|https://stackoverflow.com/a/46084983/375670] 
> StackOverflow answer and discussion thread.  I have debugged this code and 
> distilled it down to what I believe represents a bug in Spark itself.
> Consider the following {{spark-shell}} session (version 2.4.5):
> {code}
> // use a nested JSON structure that contains a struct inside an array
> val jsonData = """{
>   "foo": "bar",
>   "top": {
> "child1": 5,
> "child2": [
>   {
> "child2First": "one",
> "child2Second": 2
>   }
> ]
>   }
> }"""
> // read into a DataFrame
> val df = spark.read.option("multiline", "true").json(Seq(jsonData).toDS())
> // create a new definition for "top", which will remove the 
> "top.child2.child2First" column
> val newTop = struct(df("top").getField("child1").alias("child1"), 
> array(struct(df("top").getField("child2").getField("child2Second").alias("child2Second"))).alias("child2"))
> // show the schema before and after swapping out the struct definition
> df.schema.toDDL
> // `foo` STRING,`top` STRUCT<`child1`: BIGINT, `child2`: 
> ARRAY>>
> df.withColumn("top", newTop).schema.toDDL
> // `foo` STRING,`top` STRUCT<`child1`: BIGINT, `child2`: 
> ARRAY>>>
> {code}
> Notice in this case that the new definition for {{top.child2.child2Second}} 
> is an {{ARRAY}}.  This is incorrect; it should simply be {{BIGINT}}.  
> There is nothing in the definition of the {{newTop}} {{struct}} that should 
> have caused the type to become wrapped in an array like this.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-31779) Redefining struct inside array incorrectly wraps child fields in array

2020-05-29 Thread Jeff Evans (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-31779?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17119931#comment-17119931
 ] 

Jeff Evans commented on SPARK-31779:


Thanks, using {{arrays_zip}} (along with an extra {{cast}} to preserve the 
existing field names, since {{arrays_zip}} seems to set them to "0", "1", etc.) 
seems to work.

I still find the behavior quite counterintuitive, but this workaround will 
allow me to solve the immediate requirement. 

> Redefining struct inside array incorrectly wraps child fields in array
> --
>
> Key: SPARK-31779
> URL: https://issues.apache.org/jira/browse/SPARK-31779
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.5
>Reporter: Jeff Evans
>Priority: Major
>
> It seems that redefining a {{struct}} for the purpose of removing a 
> sub-field, when that {{struct}} is itself inside an {{array}}, results in the 
> remaining (non-removed) {{struct}} fields themselves being incorrectly 
> wrapped in an array.
> For more context, see [this|https://stackoverflow.com/a/46084983/375670] 
> StackOverflow answer and discussion thread.  I have debugged this code and 
> distilled it down to what I believe represents a bug in Spark itself.
> Consider the following {{spark-shell}} session (version 2.4.5):
> {code}
> // use a nested JSON structure that contains a struct inside an array
> val jsonData = """{
>   "foo": "bar",
>   "top": {
> "child1": 5,
> "child2": [
>   {
> "child2First": "one",
> "child2Second": 2
>   }
> ]
>   }
> }"""
> // read into a DataFrame
> val df = spark.read.option("multiline", "true").json(Seq(jsonData).toDS())
> // create a new definition for "top", which will remove the 
> "top.child2.child2First" column
> val newTop = struct(df("top").getField("child1").alias("child1"), 
> array(struct(df("top").getField("child2").getField("child2Second").alias("child2Second"))).alias("child2"))
> // show the schema before and after swapping out the struct definition
> df.schema.toDDL
> // `foo` STRING,`top` STRUCT<`child1`: BIGINT, `child2`: 
> ARRAY>>
> df.withColumn("top", newTop).schema.toDDL
> // `foo` STRING,`top` STRUCT<`child1`: BIGINT, `child2`: 
> ARRAY>>>
> {code}
> Notice in this case that the new definition for {{top.child2.child2Second}} 
> is an {{ARRAY}}.  This is incorrect; it should simply be {{BIGINT}}.  
> There is nothing in the definition of the {{newTop}} {{struct}} that should 
> have caused the type to become wrapped in an array like this.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-31779) Redefining struct inside array incorrectly wraps child fields in array

2020-05-28 Thread Jeff Evans (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-31779?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17119200#comment-17119200
 ] 

Jeff Evans commented on SPARK-31779:


I disagree.  An entire new array has been introduced into the schema, where 
there was none before.

{code}

// use a nested JSON structure that contains a struct inside an array
val jsonData = """{
  "foo": "bar",
  "top": {
"child1": 5,
"child2": [
  {
"child2First": "one",
"child2Second": 2,
"child2Third": 14.1

  },{
"child2First": "two",
"child2Second": 3,
"child2Third": -18.3
  }
]
  }
}"""


// read into a DataFrame
val df = spark.read.option("multiline", "true").json(Seq(jsonData).toDS())

// create a new definition for "top", which will remove the 
"top.child2.child2First" column, but keep the others
val newTop = struct(df("top").getField("child1").alias("child1"), 
array(struct(df("top").getField("child2").getField("child2Second").alias("child2Second"),
 
df("top").getField("child2").getField("child2Third").alias("child2Third"))).alias("child2"))

df.printSchema
root
 |-- foo: string (nullable = true)
 |-- top: struct (nullable = true)
 ||-- child1: long (nullable = true)
 ||-- child2: array (nullable = true)
 |||-- element: struct (containsNull = true)
 ||||-- child2First: string (nullable = true)
 ||||-- child2Second: long (nullable = true)
 ||||-- child2Third: double (nullable = true)

df.withColumn("top", newTop).printSchema
root
 |-- foo: string (nullable = true)
 |-- top: struct (nullable = false)
 ||-- child1: long (nullable = true)
 ||-- child2: array (nullable = false)
 |||-- element: struct (containsNull = false)
 ||||-- child2Second: array (nullable = true)
 |||||-- element: long (containsNull = true)
 ||||-- child2Third: array (nullable = true)
 |||||-- element: double (containsNull = true)

{code}

Given the new struct I have defined, this is what I would expect to see there, 
instead.  It is essentially the same as the output of {{df.printSchema}} except 
missing the single line for {{root.child2.child2First}}.

{code}

df.withColumn("top", newTop).printSchema
root
 |-- foo: string (nullable = true)
 |-- top: struct (nullable = false)
 ||-- child1: long (nullable = true)
 ||-- child2: array (nullable = false)
 |||-- element: struct (containsNull = false)
 ||||-- child2Second: long (nullable = true)
 ||||-- child2Third: double (nullable = true)
{code}

Perhaps the real problem is the way that I've attempted to "remove" 
{{child2First}}, in which case I would greatly appreciate the necessary 
correction to achieve what I'm trying to.  Thanks.

> Redefining struct inside array incorrectly wraps child fields in array
> --
>
> Key: SPARK-31779
> URL: https://issues.apache.org/jira/browse/SPARK-31779
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.5
>Reporter: Jeff Evans
>Priority: Major
>
> It seems that redefining a {{struct}} for the purpose of removing a 
> sub-field, when that {{struct}} is itself inside an {{array}}, results in the 
> remaining (non-removed) {{struct}} fields themselves being incorrectly 
> wrapped in an array.
> For more context, see [this|https://stackoverflow.com/a/46084983/375670] 
> StackOverflow answer and discussion thread.  I have debugged this code and 
> distilled it down to what I believe represents a bug in Spark itself.
> Consider the following {{spark-shell}} session (version 2.4.5):
> {code}
> // use a nested JSON structure that contains a struct inside an array
> val jsonData = """{
>   "foo": "bar",
>   "top": {
> "child1": 5,
> "child2": [
>   {
> "child2First": "one",
> "child2Second": 2
>   }
> ]
>   }
> }"""
> // read into a DataFrame
> val df = spark.read.option("multiline", "true").json(Seq(jsonData).toDS())
> // create a new definition for "top", which will remove the 
> "top.child2.child2First" column
> val newTop = struct(df("top").getField("child1").alias("child1"), 
> array(struct(df("top").getField("child2").getField("child2Second").alias("child2Second"))).alias("child2"))
> // show the schema before and after swapping out the struct definition
> df.schema.toDDL
> // `foo` STRING,`top` STRUCT<`child1`: BIGINT, `child2`: 
> ARRAY>>
> df.withColumn("top", newTop).schema.toDDL
> // `foo` STRING,`top` STRUCT<`child1`: BIGINT, `child2`: 
> ARRAY>>>
> {code}
> Notice in this case that the new definition for {{top.child2.child2Second}} 
> is an {{ARRAY}}.  This is incorrect; it should simply be {{BIGINT}}.  
> There is nothing in the definition of the {{newTop}} 

[jira] [Comment Edited] (SPARK-31779) Redefining struct inside array incorrectly wraps child fields in array

2020-05-28 Thread Jeff Evans (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-31779?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17119200#comment-17119200
 ] 

Jeff Evans edited comment on SPARK-31779 at 5/29/20, 1:53 AM:
--

I disagree.  An entire new array has been introduced into the schema, where 
there was none before.  This seems to happen to every remaining field in the 
struct (the type becomes wrapped in a new array).

{code}

// use a nested JSON structure that contains a struct inside an array
val jsonData = """{
  "foo": "bar",
  "top": {
"child1": 5,
"child2": [
  {
"child2First": "one",
"child2Second": 2,
"child2Third": 14.1

  },{
"child2First": "two",
"child2Second": 3,
"child2Third": -18.3
  }
]
  }
}"""


// read into a DataFrame
val df = spark.read.option("multiline", "true").json(Seq(jsonData).toDS())

// create a new definition for "top", which will remove the 
"top.child2.child2First" column, but keep the others
val newTop = struct(df("top").getField("child1").alias("child1"), 
array(struct(df("top").getField("child2").getField("child2Second").alias("child2Second"),
 
df("top").getField("child2").getField("child2Third").alias("child2Third"))).alias("child2"))

df.printSchema
root
 |-- foo: string (nullable = true)
 |-- top: struct (nullable = true)
 ||-- child1: long (nullable = true)
 ||-- child2: array (nullable = true)
 |||-- element: struct (containsNull = true)
 ||||-- child2First: string (nullable = true)
 ||||-- child2Second: long (nullable = true)
 ||||-- child2Third: double (nullable = true)

df.withColumn("top", newTop).printSchema
root
 |-- foo: string (nullable = true)
 |-- top: struct (nullable = false)
 ||-- child1: long (nullable = true)
 ||-- child2: array (nullable = false)
 |||-- element: struct (containsNull = false)
 ||||-- child2Second: array (nullable = true)
 |||||-- element: long (containsNull = true)
 ||||-- child2Third: array (nullable = true)
 |||||-- element: double (containsNull = true)

{code}

Given the new struct I have defined, this is what I would expect to see there, 
instead.  It is essentially the same as the output of {{df.printSchema}} except 
missing the single line for {{root.child2.child2First}}.

{code}

df.withColumn("top", newTop).printSchema
root
 |-- foo: string (nullable = true)
 |-- top: struct (nullable = false)
 ||-- child1: long (nullable = true)
 ||-- child2: array (nullable = false)
 |||-- element: struct (containsNull = false)
 ||||-- child2Second: long (nullable = true)
 ||||-- child2Third: double (nullable = true)
{code}

Perhaps the real problem is the way that I've attempted to "remove" 
{{child2First}}, in which case I would greatly appreciate the necessary 
correction to achieve what I'm trying to.  Thanks.


was (Author: jeff.w.evans):
I disagree.  An entire new array has been introduced into the schema, where 
there was none before.

{code}

// use a nested JSON structure that contains a struct inside an array
val jsonData = """{
  "foo": "bar",
  "top": {
"child1": 5,
"child2": [
  {
"child2First": "one",
"child2Second": 2,
"child2Third": 14.1

  },{
"child2First": "two",
"child2Second": 3,
"child2Third": -18.3
  }
]
  }
}"""


// read into a DataFrame
val df = spark.read.option("multiline", "true").json(Seq(jsonData).toDS())

// create a new definition for "top", which will remove the 
"top.child2.child2First" column, but keep the others
val newTop = struct(df("top").getField("child1").alias("child1"), 
array(struct(df("top").getField("child2").getField("child2Second").alias("child2Second"),
 
df("top").getField("child2").getField("child2Third").alias("child2Third"))).alias("child2"))

df.printSchema
root
 |-- foo: string (nullable = true)
 |-- top: struct (nullable = true)
 ||-- child1: long (nullable = true)
 ||-- child2: array (nullable = true)
 |||-- element: struct (containsNull = true)
 ||||-- child2First: string (nullable = true)
 ||||-- child2Second: long (nullable = true)
 ||||-- child2Third: double (nullable = true)

df.withColumn("top", newTop).printSchema
root
 |-- foo: string (nullable = true)
 |-- top: struct (nullable = false)
 ||-- child1: long (nullable = true)
 ||-- child2: array (nullable = false)
 |||-- element: struct (containsNull = false)
 ||||-- child2Second: array (nullable = true)
 |||||-- element: long (containsNull = true)
 ||||-- child2Third: array (nullable = true)
 |||||-- element: double (containsNull = true)

{code}

Given the new struct I have defined, this is what I would expect to see there, 
instead.  

[jira] [Created] (SPARK-31779) Redefining struct inside array incorrectly wraps child fields in array

2020-05-20 Thread Jeff Evans (Jira)
Jeff Evans created SPARK-31779:
--

 Summary: Redefining struct inside array incorrectly wraps child 
fields in array
 Key: SPARK-31779
 URL: https://issues.apache.org/jira/browse/SPARK-31779
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.4.5
Reporter: Jeff Evans


It seems that redefining a {{struct}} for the purpose of removing a sub-field, 
when that {{struct}} is itself inside an {{array}}, results in the remaining 
(non-removed) {{struct}} fields themselves being incorrectly wrapped in an 
array.

For more context, see [this|https://stackoverflow.com/a/46084983/375670] 
StackOverflow answer and discussion thread.  I have debugged this code and 
distilled it down to what I believe represents a bug in Spark itself.

Consider the following {{spark-shell}} session (version 2.4.5):

{code}
// use a nested JSON structure that contains a struct inside an array
val jsonData = """{
  "foo": "bar",
  "top": {
"child1": 5,
"child2": [
  {
"child2First": "one",
"child2Second": 2
  }
]
  }
}"""

// read into a DataFrame
val df = spark.read.option("multiline", "true").json(Seq(jsonData).toDS())

// create a new definition for "top", which will remove the 
"top.child2.child2First" column

val newTop = struct(df("top").getField("child1").alias("child1"), 
array(struct(df("top").getField("child2").getField("child2Second").alias("child2Second"))).alias("child2"))

// show the schema before and after swapping out the struct definition
df.schema.toDDL
// `foo` STRING,`top` STRUCT<`child1`: BIGINT, `child2`: 
ARRAY>>
df.withColumn("top", newTop).schema.toDDL
// `foo` STRING,`top` STRUCT<`child1`: BIGINT, `child2`: 
ARRAY>>>
{code}

Notice in this case that the new definition for {{top.child2.child2Second}} is 
an {{ARRAY}}.  This is incorrect; it should simply be {{BIGINT}}.  
There is nothing in the definition of the {{newTop}} {{struct}} that should 
have caused the type to become wrapped in an array like this.






--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19248) Regex_replace works in 1.6 but not in 2.0

2020-01-24 Thread Jeff Evans (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-19248?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17023019#comment-17023019
 ] 

Jeff Evans commented on SPARK-19248:


I'm not a Spark maintainer, so can't answer definitively.  However, I would 
guess they won't change the default value.  This was deliberately added in 2.0 
with a default value of false, and usually breaking changes like this are 
introduced in new major versions (speaking in general terms).

> Regex_replace works in 1.6 but not in 2.0
> -
>
> Key: SPARK-19248
> URL: https://issues.apache.org/jira/browse/SPARK-19248
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Affects Versions: 2.0.2, 2.4.3
>Reporter: Lucas Tittmann
>Priority: Major
>  Labels: correctness
>
> We found an error in Spark 2.0.2 execution of Regex. Using PySpark In 1.6.2, 
> we get the following, expected behaviour:
> {noformat}
> df = sqlContext.createDataFrame([('..   5.',)], ['col'])
> dfout = df.selectExpr(*["regexp_replace(col, '[ \.]*', '') AS col"]).collect()
> z.show(dfout)
> >>> [Row(col=u'5')]
> dfout2 = df.selectExpr(*["regexp_replace(col, '( |\.)*', '') AS 
> col"]).collect()
> z.show(dfout2)
> >>> [Row(col=u'5')]
> {noformat}
> In Spark 2.0.2, with the same code, we get the following:
> {noformat}
> df = sqlContext.createDataFrame([('..   5.',)], ['col'])
> dfout = df.selectExpr(*["regexp_replace(col, '[ \.]*', '') AS col"]).collect()
> z.show(dfout)
> >>> [Row(col=u'5')]
> dfout2 = df.selectExpr(*["regexp_replace(col, '( |\.)*', '') AS 
> col"]).collect()
> z.show(dfout2)
> >>> [Row(col=u'')]
> {noformat}
> As you can see, the second regex shows different behaviour depending on the 
> Spark version. We checked the regex in Java, and both should be correct and 
> work. Therefore, regex execution in 2.0.2 seems to be erroneous. I do not 
> have the possibility to confirm in 2.1 at the moment.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-19248) Regex_replace works in 1.6 but not in 2.0

2020-01-23 Thread Jeff Evans (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-19248?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17022430#comment-17022430
 ] 

Jeff Evans edited comment on SPARK-19248 at 1/23/20 8:06 PM:
-

After some debugging, I figured out what's going on here.  The crux of this is 
the {{spark.sql.parser.escapedStringLiterals}} config setting, introduced under 
SPARK-20399.  This behavior changed in 2.0 (see 
[here|https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala#L483]).
  If you start your PySpark sessions described above with this line:

{{spark.conf.set("spark.sql.parser.escapedStringLiterals", True)}}

then you should see the 1.6 behavior.  Otherwise, you need to escape the 
literal backslash before the dot character (and of course, in a string literal, 
the backslashes themselves also need escaped), so you would need the pattern to 
be {{'( |.)*'}}

By the way, this isn't Python-specific behavior.  Even if you use a Scala 
session, and use the {{expr}} expression (which I don't see in the sample 
sessions above), you will notice the same thing happening.

{code}
val df = spark.createDataFrame(Seq((0, "..   5."))).toDF("id","col")

df.selectExpr("""regexp_replace(col, "( |\.)*", "")""").show()
+-+
|regexp_replace(col, ( |.)*, )|
+-+
| |
+-+

spark.conf.set("spark.sql.parser.escapedStringLiterals", true)

df.selectExpr("""regexp_replace(col, "( |\.)*", "")""").show()
+--+
|regexp_replace(col, ( |\.)*, )|
+--+
| 5|
+--+
{code}


was (Author: jeff.w.evans):
After some debugging, I figured out what's going on here.  The crux of this is 
the {{spark.sql.parser.escapedStringLiterals}} config setting, introduced under 
SPARK-20399.  This behavior changed in 2.0 (see 
[here|https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala#L483]).
  If you start your PySpark sessions described above with this line:

{{spark.conf.set("spark.sql.parser.escapedStringLiterals", True)}}

then you should see the 1.6 behavior.  Otherwise, you need to escape the 
literal backslash before the dot character (and of course, in a string literal, 
the backslashes themselves also need escaped), so you would need the pattern to 
be {{'( |.)*'}}

> Regex_replace works in 1.6 but not in 2.0
> -
>
> Key: SPARK-19248
> URL: https://issues.apache.org/jira/browse/SPARK-19248
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Affects Versions: 2.0.2, 2.4.3
>Reporter: Lucas Tittmann
>Priority: Major
>  Labels: correctness
>
> We found an error in Spark 2.0.2 execution of Regex. Using PySpark In 1.6.2, 
> we get the following, expected behaviour:
> {noformat}
> df = sqlContext.createDataFrame([('..   5.',)], ['col'])
> dfout = df.selectExpr(*["regexp_replace(col, '[ \.]*', '') AS col"]).collect()
> z.show(dfout)
> >>> [Row(col=u'5')]
> dfout2 = df.selectExpr(*["regexp_replace(col, '( |\.)*', '') AS 
> col"]).collect()
> z.show(dfout2)
> >>> [Row(col=u'5')]
> {noformat}
> In Spark 2.0.2, with the same code, we get the following:
> {noformat}
> df = sqlContext.createDataFrame([('..   5.',)], ['col'])
> dfout = df.selectExpr(*["regexp_replace(col, '[ \.]*', '') AS col"]).collect()
> z.show(dfout)
> >>> [Row(col=u'5')]
> dfout2 = df.selectExpr(*["regexp_replace(col, '( |\.)*', '') AS 
> col"]).collect()
> z.show(dfout2)
> >>> [Row(col=u'')]
> {noformat}
> As you can see, the second regex shows different behaviour depending on the 
> Spark version. We checked the regex in Java, and both should be correct and 
> work. Therefore, regex execution in 2.0.2 seems to be erroneous. I do not 
> have the possibility to confirm in 2.1 at the moment.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-19248) Regex_replace works in 1.6 but not in 2.0

2020-01-23 Thread Jeff Evans (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-19248?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17022430#comment-17022430
 ] 

Jeff Evans edited comment on SPARK-19248 at 1/23/20 7:53 PM:
-

After some debugging, I figured out what's going on here.  The crux of this is 
the {{spark.sql.parser.escapedStringLiterals}} config setting, introduced under 
SPARK-20399.  This behavior changed in 2.0 (see 
[here|https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala#L483]).
  If you start your PySpark sessions described above with this line:

{{spark.conf.set("spark.sql.parser.escapedStringLiterals", True)}}

then you should see the 1.6 behavior.  Otherwise, you need to escape the 
literal backslash before the dot character, so you would need the pattern to be 
{{'( |.)*'}}


was (Author: jeff.w.evans):
After some debugging, I figured out what's going on here.  The crux of this is 
the {{spark.sql.parser.escapedStringLiterals}} config setting, introduced under 
SPARK-20399.  This behavior changed in 2.0 (see 
[here|https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala#L483]).
  If you start your PySpark sessions described above with this line:

{{spark.conf.set("spark.sql.parser.escapedStringLiterals", True)}}

then you should see the 1.6 behavior.

> Regex_replace works in 1.6 but not in 2.0
> -
>
> Key: SPARK-19248
> URL: https://issues.apache.org/jira/browse/SPARK-19248
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Affects Versions: 2.0.2, 2.4.3
>Reporter: Lucas Tittmann
>Priority: Major
>  Labels: correctness
>
> We found an error in Spark 2.0.2 execution of Regex. Using PySpark In 1.6.2, 
> we get the following, expected behaviour:
> {noformat}
> df = sqlContext.createDataFrame([('..   5.',)], ['col'])
> dfout = df.selectExpr(*["regexp_replace(col, '[ \.]*', '') AS col"]).collect()
> z.show(dfout)
> >>> [Row(col=u'5')]
> dfout2 = df.selectExpr(*["regexp_replace(col, '( |\.)*', '') AS 
> col"]).collect()
> z.show(dfout2)
> >>> [Row(col=u'5')]
> {noformat}
> In Spark 2.0.2, with the same code, we get the following:
> {noformat}
> df = sqlContext.createDataFrame([('..   5.',)], ['col'])
> dfout = df.selectExpr(*["regexp_replace(col, '[ \.]*', '') AS col"]).collect()
> z.show(dfout)
> >>> [Row(col=u'5')]
> dfout2 = df.selectExpr(*["regexp_replace(col, '( |\.)*', '') AS 
> col"]).collect()
> z.show(dfout2)
> >>> [Row(col=u'')]
> {noformat}
> As you can see, the second regex shows different behaviour depending on the 
> Spark version. We checked the regex in Java, and both should be correct and 
> work. Therefore, regex execution in 2.0.2 seems to be erroneous. I do not 
> have the possibility to confirm in 2.1 at the moment.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-19248) Regex_replace works in 1.6 but not in 2.0

2020-01-23 Thread Jeff Evans (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-19248?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17022430#comment-17022430
 ] 

Jeff Evans edited comment on SPARK-19248 at 1/23/20 7:53 PM:
-

After some debugging, I figured out what's going on here.  The crux of this is 
the {{spark.sql.parser.escapedStringLiterals}} config setting, introduced under 
SPARK-20399.  This behavior changed in 2.0 (see 
[here|https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala#L483]).
  If you start your PySpark sessions described above with this line:

{{spark.conf.set("spark.sql.parser.escapedStringLiterals", True)}}

then you should see the 1.6 behavior.  Otherwise, you need to escape the 
literal backslash before the dot character (and of course, in a string literal, 
the backslashes themselves also need escaped), so you would need the pattern to 
be {{'( |.)*'}}


was (Author: jeff.w.evans):
After some debugging, I figured out what's going on here.  The crux of this is 
the {{spark.sql.parser.escapedStringLiterals}} config setting, introduced under 
SPARK-20399.  This behavior changed in 2.0 (see 
[here|https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala#L483]).
  If you start your PySpark sessions described above with this line:

{{spark.conf.set("spark.sql.parser.escapedStringLiterals", True)}}

then you should see the 1.6 behavior.  Otherwise, you need to escape the 
literal backslash before the dot character, so you would need the pattern to be 
{{'( |.)*'}}

> Regex_replace works in 1.6 but not in 2.0
> -
>
> Key: SPARK-19248
> URL: https://issues.apache.org/jira/browse/SPARK-19248
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Affects Versions: 2.0.2, 2.4.3
>Reporter: Lucas Tittmann
>Priority: Major
>  Labels: correctness
>
> We found an error in Spark 2.0.2 execution of Regex. Using PySpark In 1.6.2, 
> we get the following, expected behaviour:
> {noformat}
> df = sqlContext.createDataFrame([('..   5.',)], ['col'])
> dfout = df.selectExpr(*["regexp_replace(col, '[ \.]*', '') AS col"]).collect()
> z.show(dfout)
> >>> [Row(col=u'5')]
> dfout2 = df.selectExpr(*["regexp_replace(col, '( |\.)*', '') AS 
> col"]).collect()
> z.show(dfout2)
> >>> [Row(col=u'5')]
> {noformat}
> In Spark 2.0.2, with the same code, we get the following:
> {noformat}
> df = sqlContext.createDataFrame([('..   5.',)], ['col'])
> dfout = df.selectExpr(*["regexp_replace(col, '[ \.]*', '') AS col"]).collect()
> z.show(dfout)
> >>> [Row(col=u'5')]
> dfout2 = df.selectExpr(*["regexp_replace(col, '( |\.)*', '') AS 
> col"]).collect()
> z.show(dfout2)
> >>> [Row(col=u'')]
> {noformat}
> As you can see, the second regex shows different behaviour depending on the 
> Spark version. We checked the regex in Java, and both should be correct and 
> work. Therefore, regex execution in 2.0.2 seems to be erroneous. I do not 
> have the possibility to confirm in 2.1 at the moment.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19248) Regex_replace works in 1.6 but not in 2.0

2020-01-23 Thread Jeff Evans (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-19248?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17022430#comment-17022430
 ] 

Jeff Evans commented on SPARK-19248:


After some debugging, I figured out what's going on here.  The crux of this is 
the {{spark.sql.parser.escapedStringLiterals}} config setting, introduced under 
SPARK-20399.  This behavior changed in 2.0 (see 
[here|https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala#L483]).
  If you start your PySpark sessions described above with this line:

{{spark.conf.set("spark.sql.parser.escapedStringLiterals", True)}}

then you should see the 1.6 behavior.

> Regex_replace works in 1.6 but not in 2.0
> -
>
> Key: SPARK-19248
> URL: https://issues.apache.org/jira/browse/SPARK-19248
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Affects Versions: 2.0.2, 2.4.3
>Reporter: Lucas Tittmann
>Priority: Major
>  Labels: correctness
>
> We found an error in Spark 2.0.2 execution of Regex. Using PySpark In 1.6.2, 
> we get the following, expected behaviour:
> {noformat}
> df = sqlContext.createDataFrame([('..   5.',)], ['col'])
> dfout = df.selectExpr(*["regexp_replace(col, '[ \.]*', '') AS col"]).collect()
> z.show(dfout)
> >>> [Row(col=u'5')]
> dfout2 = df.selectExpr(*["regexp_replace(col, '( |\.)*', '') AS 
> col"]).collect()
> z.show(dfout2)
> >>> [Row(col=u'5')]
> {noformat}
> In Spark 2.0.2, with the same code, we get the following:
> {noformat}
> df = sqlContext.createDataFrame([('..   5.',)], ['col'])
> dfout = df.selectExpr(*["regexp_replace(col, '[ \.]*', '') AS col"]).collect()
> z.show(dfout)
> >>> [Row(col=u'5')]
> dfout2 = df.selectExpr(*["regexp_replace(col, '( |\.)*', '') AS 
> col"]).collect()
> z.show(dfout2)
> >>> [Row(col=u'')]
> {noformat}
> As you can see, the second regex shows different behaviour depending on the 
> Spark version. We checked the regex in Java, and both should be correct and 
> work. Therefore, regex execution in 2.0.2 seems to be erroneous. I do not 
> have the possibility to confirm in 2.1 at the moment.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-30489) Make build delete pyspark.zip file properly

2020-01-10 Thread Jeff Evans (Jira)
Jeff Evans created SPARK-30489:
--

 Summary: Make build delete pyspark.zip file properly
 Key: SPARK-30489
 URL: https://issues.apache.org/jira/browse/SPARK-30489
 Project: Spark
  Issue Type: Improvement
  Components: Build
Affects Versions: 3.0.0
Reporter: Jeff Evans


The build uses Ant tasks to delete, then recreate, the {{pyspark.zip}} file 
within {{python/lib}}.  The only problem is the Ant task definition for the 
delete operation is incorrect (it uses {{dir}} instead of {{file}}), so it 
doesn't actually get deleted by this task.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-26494) Support Oracle TIMESTAMP WITH LOCAL TIME ZONE type

2020-01-10 Thread Jeff Evans (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-26494?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jeff Evans updated SPARK-26494:
---
Summary: Support Oracle TIMESTAMP WITH LOCAL TIME ZONE type  (was: 【spark 
sql】Use spark to read oracle TIMESTAMP(6) WITH LOCAL TIME ZONE type can't be 
found,)

> Support Oracle TIMESTAMP WITH LOCAL TIME ZONE type
> --
>
> Key: SPARK-26494
> URL: https://issues.apache.org/jira/browse/SPARK-26494
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: kun'qin 
>Priority: Minor
>
> Use spark to read oracle TIMESTAMP(6) WITH LOCAL TIME ZONE type can't be 
> found,
> When the data type is TIMESTAMP(6) WITH LOCAL TIME ZONE
> At this point, the sqlType value of the function getCatalystType in the 
> JdbcUtils class is -102.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26494) 【spark sql】Use spark to read oracle TIMESTAMP(6) WITH LOCAL TIME ZONE type can't be found,

2020-01-10 Thread Jeff Evans (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-26494?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17013093#comment-17013093
 ] 

Jeff Evans commented on SPARK-26494:


To be clear, this type represents an instant in time.  From [the 
docs|https://docs.oracle.com/database/121/SUTIL/GUID-CB5D2124-D9AE-4C71-A83D-DFE071FE3542.htm]:

{quote}The TIMESTAMP WITH LOCAL TIME ZONE data type is another variant of 
TIMESTAMP that includes a time zone offset in its value. Data stored in the 
database is normalized to the database time zone, and time zone displacement is 
not stored as part of the column data. When the data is retrieved, it is 
returned in the user's local session time zone. It is specified as 
follows:{quote}

So it's really almost the same as a {{TIMESTAMP}}, just that it does some kind 
of automatic TZ conversion (converting from the offset given by the client to 
the DB server's offset automatically).  But that conversion is orthogonal to 
Spark entirely; it should just be treated like a {{TIMESTAMP}}.

> 【spark sql】Use spark to read oracle TIMESTAMP(6) WITH LOCAL TIME ZONE type 
> can't be found,
> --
>
> Key: SPARK-26494
> URL: https://issues.apache.org/jira/browse/SPARK-26494
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: kun'qin 
>Priority: Minor
>
> Use spark to read oracle TIMESTAMP(6) WITH LOCAL TIME ZONE type can't be 
> found,
> When the data type is TIMESTAMP(6) WITH LOCAL TIME ZONE
> At this point, the sqlType value of the function getCatalystType in the 
> JdbcUtils class is -102.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26739) Standardized Join Types for DataFrames

2020-01-08 Thread Jeff Evans (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-26739?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17011019#comment-17011019
 ] 

Jeff Evans commented on SPARK-26739:


[~maropu], I have an active PR for this, which is just awaiting additional 
feedback: https://github.com/apache/spark/pull/26286

> Standardized Join Types for DataFrames
> --
>
> Key: SPARK-26739
> URL: https://issues.apache.org/jira/browse/SPARK-26739
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Skyler Lehan
>Priority: Minor
>   Original Estimate: 48h
>  Remaining Estimate: 48h
>
> h3. *Q1.* What are you trying to do? Articulate your objectives using 
> absolutely no jargon.
> Currently, in the join functions on 
> [DataFrames|http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.Dataset],
>  the join types are defined via a string parameter called joinType. In order 
> for a developer to know which joins are possible, they must look up the API 
> call for join. While this works fine, it can cause the developer to make a 
> typo resulting in improper joins and/or unexpected errors that aren't evident 
> at compile time. The objective of this improvement would be to allow 
> developers to use a common definition for join types (by enum or constants) 
> called JoinTypes. This would contain the possible joins and remove the 
> possibility of a typo. It would also allow Spark to alter the names of the 
> joins in the future without impacting end-users.
> h3. *Q2.* What problem is this proposal NOT designed to solve?
> The problem this solves is extremely narrow, it would not solve anything 
> other than providing a common definition for join types.
> h3. *Q3.* How is it done today, and what are the limits of current practice?
> Currently, developers must join two DataFrames like so:
> {code:java}
> val resultDF = leftDF.join(rightDF, col("ID") === col("RightID"), 
> "left_outer")
> {code}
> Where they manually type the join type. As stated above, this:
>  * Requires developers to manually type in the join
>  * Can cause possibility of typos
>  * Restricts renaming of join types as its a literal string
>  * Does not restrict and/or compile check the join type being used, leading 
> to runtime errors
> h3. *Q4.* What is new in your approach and why do you think it will be 
> successful?
> The new approach would use constants or *more preferably an enum*, something 
> like this:
> {code:java}
> val resultDF = leftDF.join(rightDF, col("ID") === col("RightID"), 
> JoinType.LEFT_OUTER)
> {code}
> This would provide:
>  * In code reference/definitions of the possible join types
>  ** This subsequently allows the addition of scaladoc of what each join type 
> does and how it operates
>  * Removes possibilities of a typo on the join type
>  * Provides compile time checking of the join type (only if an enum is used)
> To clarify, if JoinType is a constant, it would just fill in the joinType 
> string parameter for users. If an enum is used, it would restrict the domain 
> of possible join types to whatever is defined in the future JoinType enum. 
> The enum is preferred, however it would take longer to implement.
> h3. *Q5.* Who cares? If you are successful, what difference will it make?
> Developers using Apache Spark will care. This will make the join function 
> easier to wield and lead to less runtime errors. It will save time by 
> bringing join type validation at compile time. It will also provide in code 
> reference to the join types, which saves the developer time of having to look 
> up and navigate the multiple join functions to find the possible join types. 
> In addition to that, the resulting constants/enum would have documentation on 
> how that join type works.
> h3. *Q6.* What are the risks?
> Users of Apache Spark who currently use strings to define their join types 
> could be impacted if an enum is chosen as the common definition. This risk 
> can be mitigated by using string constants. The string constants would be the 
> exact same string as the string literals used today. For example:
> {code:java}
> JoinType.INNER = "inner"
> {code}
> If an enum is still the preferred way of defining the join types, new join 
> functions could be added that take in these enums and the join calls that 
> contain string parameters for joinType could be deprecated. This would give 
> developers a chance to change over to the new join types.
> h3. *Q7.* How long will it take?
> A few days for a seasoned Spark developer.
> h3. *Q8.* What are the mid-term and final "exams" to check for success?
> Mid-term exam would be the addition of a common definition of the join types 
> and additional join functions that take in the join type enum/constant. The 
> final exam would be working tests 

[jira] [Updated] (SPARK-30256) Allow SparkLauncher to sudo before executing spark-submit

2019-12-13 Thread Jeff Evans (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30256?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jeff Evans updated SPARK-30256:
---
Description: It would be useful if 
{{org.apache.spark.launcher.SparkLauncher}} allowed for a "sudo as user X" 
option.  This way, multi-tenant applications that run Spark jobs could give end 
users greater security, by ensuring that the files (including, importantly, 
keytabs) can remain readable only by the end users instead of the UID that runs 
this multi-tenant application itself.  I believe that {{sudo -u  
spark-submit }} should work.  The builder maintained by 
{{SparkLauncher}} could simply have a {{setSudoUser}} method.  (was: It would 
be useful if {{org.apache.spark.launcher.SparkLauncher}} allowed for a "sudo as 
user X" option.  This way, multi-tenant applications that run Spark jobs could 
give end users greater security, by ensuring that the files (including, 
importantly, keytabs) can remain readable only by the end users instead of the 
UID that runs this multi-tenant application itself.  I believe that {{sudo -u 
 spark-submit  Allow SparkLauncher to sudo before executing spark-submit
> -
>
> Key: SPARK-30256
> URL: https://issues.apache.org/jira/browse/SPARK-30256
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Submit
>Affects Versions: 3.0.0
>Reporter: Jeff Evans
>Priority: Minor
>
> It would be useful if {{org.apache.spark.launcher.SparkLauncher}} allowed for 
> a "sudo as user X" option.  This way, multi-tenant applications that run 
> Spark jobs could give end users greater security, by ensuring that the files 
> (including, importantly, keytabs) can remain readable only by the end users 
> instead of the UID that runs this multi-tenant application itself.  I believe 
> that {{sudo -u  spark-submit }} should work.  The 
> builder maintained by {{SparkLauncher}} could simply have a {{setSudoUser}} 
> method.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-30256) Allow SparkLauncher to sudo before executing spark-submit

2019-12-13 Thread Jeff Evans (Jira)
Jeff Evans created SPARK-30256:
--

 Summary: Allow SparkLauncher to sudo before executing spark-submit
 Key: SPARK-30256
 URL: https://issues.apache.org/jira/browse/SPARK-30256
 Project: Spark
  Issue Type: Improvement
  Components: Spark Submit
Affects Versions: 3.0.0
Reporter: Jeff Evans


It would be useful if {{org.apache.spark.launcher.SparkLauncher}} allowed for a 
"sudo as user X" option.  This way, multi-tenant applications that run Spark 
jobs could give end users greater security, by ensuring that the files 
(including, importantly, keytabs) can remain readable only by the end users 
instead of the UID that runs this multi-tenant application itself.  I believe 
that {{sudo -u  spark-submit 

[jira] [Updated] (SPARK-24540) Support for multiple character delimiter in Spark CSV read

2019-10-04 Thread Jeff Evans (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-24540?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jeff Evans updated SPARK-24540:
---
Summary: Support for multiple character delimiter in Spark CSV read  (was: 
Support for multiple delimiter in Spark CSV read)

> Support for multiple character delimiter in Spark CSV read
> --
>
> Key: SPARK-24540
> URL: https://issues.apache.org/jira/browse/SPARK-24540
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.1
>Reporter: Ashwin K
>Priority: Major
>
> Currently, the delimiter option Spark 2.0 to read and split CSV files/data 
> only support a single character delimiter. If we try to provide multiple 
> delimiters, we observer the following error message.
> eg: Dataset df = spark.read().option("inferSchema", "true")
>                                                           .option("header", 
> "false")
>                                                          .option("delimiter", 
> ", ")
>                                                           .csv("C:\test.txt");
> Exception in thread "main" java.lang.IllegalArgumentException: Delimiter 
> cannot be more than one character: , 
> at 
> org.apache.spark.sql.execution.datasources.csv.CSVUtils$.toChar(CSVUtils.scala:111)
>  at 
> org.apache.spark.sql.execution.datasources.csv.CSVOptions.(CSVOptions.scala:83)
>  at 
> org.apache.spark.sql.execution.datasources.csv.CSVOptions.(CSVOptions.scala:39)
>  at 
> org.apache.spark.sql.execution.datasources.csv.CSVFileFormat.inferSchema(CSVFileFormat.scala:55)
>  at 
> org.apache.spark.sql.execution.datasources.DataSource$$anonfun$8.apply(DataSource.scala:202)
>  at 
> org.apache.spark.sql.execution.datasources.DataSource$$anonfun$8.apply(DataSource.scala:202)
>  at scala.Option.orElse(Option.scala:289)
>  at 
> org.apache.spark.sql.execution.datasources.DataSource.getOrInferFileFormatSchema(DataSource.scala:201)
>  at 
> org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:392)
>  at 
> org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:239)
>  at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:227)
>  at org.apache.spark.sql.DataFrameReader.csv(DataFrameReader.scala:596)
>  at org.apache.spark.sql.DataFrameReader.csv(DataFrameReader.scala:473)
>  
> Generally, the data to be processed contains multiple delimiters and 
> presently we need to do a manual data clean up on the source/input file, 
> which doesn't work well in large applications which consumes numerous files.
> There seems to be work-around like reading data as text and using the split 
> option, but this in my opinion defeats the purpose, advantage and efficiency 
> of a direct read from CSV file.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24540) Support for multiple delimiter in Spark CSV read

2019-10-04 Thread Jeff Evans (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-24540?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16944783#comment-16944783
 ] 

Jeff Evans commented on SPARK-24540:


I created a pull request to support this (which was linked above).  I'm not 
entirely clear on why SPARK-17967 would be a blocker, though.

> Support for multiple delimiter in Spark CSV read
> 
>
> Key: SPARK-24540
> URL: https://issues.apache.org/jira/browse/SPARK-24540
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.1
>Reporter: Ashwin K
>Priority: Major
>
> Currently, the delimiter option Spark 2.0 to read and split CSV files/data 
> only support a single character delimiter. If we try to provide multiple 
> delimiters, we observer the following error message.
> eg: Dataset df = spark.read().option("inferSchema", "true")
>                                                           .option("header", 
> "false")
>                                                          .option("delimiter", 
> ", ")
>                                                           .csv("C:\test.txt");
> Exception in thread "main" java.lang.IllegalArgumentException: Delimiter 
> cannot be more than one character: , 
> at 
> org.apache.spark.sql.execution.datasources.csv.CSVUtils$.toChar(CSVUtils.scala:111)
>  at 
> org.apache.spark.sql.execution.datasources.csv.CSVOptions.(CSVOptions.scala:83)
>  at 
> org.apache.spark.sql.execution.datasources.csv.CSVOptions.(CSVOptions.scala:39)
>  at 
> org.apache.spark.sql.execution.datasources.csv.CSVFileFormat.inferSchema(CSVFileFormat.scala:55)
>  at 
> org.apache.spark.sql.execution.datasources.DataSource$$anonfun$8.apply(DataSource.scala:202)
>  at 
> org.apache.spark.sql.execution.datasources.DataSource$$anonfun$8.apply(DataSource.scala:202)
>  at scala.Option.orElse(Option.scala:289)
>  at 
> org.apache.spark.sql.execution.datasources.DataSource.getOrInferFileFormatSchema(DataSource.scala:201)
>  at 
> org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:392)
>  at 
> org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:239)
>  at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:227)
>  at org.apache.spark.sql.DataFrameReader.csv(DataFrameReader.scala:596)
>  at org.apache.spark.sql.DataFrameReader.csv(DataFrameReader.scala:473)
>  
> Generally, the data to be processed contains multiple delimiters and 
> presently we need to do a manual data clean up on the source/input file, 
> which doesn't work well in large applications which consumes numerous files.
> There seems to be work-around like reading data as text and using the split 
> option, but this in my opinion defeats the purpose, advantage and efficiency 
> of a direct read from CSV file.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-27903) Improve parser error message for mismatched parentheses in expressions

2019-10-03 Thread Jeff Evans (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-27903?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16944041#comment-16944041
 ] 

Jeff Evans commented on SPARK-27903:


I've been spending some time playing around with the grammar, and I'm not sure 
this is possible in the general case.  It should be easy enough to handle the 
case outlined in this Jira (I have a working change for that), but an "extra" 
right parenthesis is much more challenging due to the way ANTLR works, and the 
way the grammar is written.

> Improve parser error message for mismatched parentheses in expressions
> --
>
> Key: SPARK-27903
> URL: https://issues.apache.org/jira/browse/SPARK-27903
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Yesheng Ma
>Priority: Major
>
> When parentheses are mismatched in expressions in queries, the error message 
> is confusing. This is especially true for large queries, where mismatched 
> parens are tedious for human to figure out. 
> For example, the error message for 
> {code:sql} 
> SELECT ((x + y) * z FROM t; 
> {code} 
> is 
> {code:java} 
> mismatched input 'FROM' expecting ','(line 1, pos 20) 
> {code} 
> One possible way to fix is to explicitly capture such kind of mismatched 
> parens in a grammar rule and print user-friendly error message such as 
> {code:java} 
> mismatched parentheses for expression 'SELECT ((x + y) * z FROM t;'(line 1, 
> pos 20) 
> {code} 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24077) Issue a better error message for `CREATE TEMPORARY FUNCTION IF NOT EXISTS`

2019-09-30 Thread Jeff Evans (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-24077?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16941363#comment-16941363
 ] 

Jeff Evans commented on SPARK-24077:


Is this still relevant, given SPARK-20383 is done?

> Issue a better error message for `CREATE TEMPORARY FUNCTION IF NOT EXISTS`
> --
>
> Key: SPARK-24077
> URL: https://issues.apache.org/jira/browse/SPARK-24077
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Benedict Jin
>Priority: Major
>  Labels: starter
>
> The error message of {{CREATE TEMPORARY FUNCTION IF NOT EXISTS}} looks 
> confusing: 
> {code}
> scala> 
> org.apache.spark.sql.SparkSession.builder().enableHiveSupport.getOrCreate.sql("CREATE
>  TEMPORARY FUNCTION IF NOT EXISTS yuzhouwan as 
> 'org.apache.spark.sql.hive.udf.YuZhouWan'")
> {code}
> {code}
> org.apache.spark.sql.catalyst.parser.ParseException:
> mismatched input 'NOT' expecting \{'.', 'AS'}(line 1, pos 29)
> == SQL ==
>  CREATE TEMPORARY FUNCTION IF NOT EXISTS yuzhouwan as 
> 'org.apache.spark.sql.hive.udf.YuZhouWan'
>  -^^^
>  at 
> org.apache.spark.sql.catalyst.parser.ParseException.withCommand(ParseDriver.scala:197)
>  at 
> org.apache.spark.sql.catalyst.parser.AbstractSqlParser.parse(ParseDriver.scala:99)
>  at 
> org.apache.spark.sql.execution.SparkSqlParser.parse(SparkSqlParser.scala:45)
>  at 
> org.apache.spark.sql.catalyst.parser.AbstractSqlParser.parsePlan(ParseDriver.scala:53)
>  at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:592)
>  ... 48 elided
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25153) Improve error messages for columns with dots/periods

2019-09-16 Thread Jeff Evans (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-25153?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16930896#comment-16930896
 ] 

Jeff Evans commented on SPARK-25153:


Opened a pull request for this (see link added by the bot).  Open to 
suggestions on the exact wording of the "suggestion", of course.

> Improve error messages for columns with dots/periods
> 
>
> Key: SPARK-25153
> URL: https://issues.apache.org/jira/browse/SPARK-25153
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: holdenk
>Priority: Trivial
>  Labels: starter
>
> When we fail to resolve a column name with a dot in it, and the column name 
> is present as a string literal the error message could mention using 
> backticks to have the string treated as a literal.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-29070) Make SparkLauncher log full spark-submit command line

2019-09-16 Thread Jeff Evans (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29070?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jeff Evans updated SPARK-29070:
---
Description: 
{{org.apache.spark.launcher.SparkLauncher}} wraps a {{ProcessBuilder}}, and 
builds up a full command line to {{spark-submit}} using a builder pattern.  
When {{startApplication}} is finally called, a full command line is 
materialized out of all the options, then invoked via the {{ProcessBuilder}}.

In scenarios where another application is submitting to Spark, it would be 
extremely useful from a support and debugging standpoint to be able to see the 
full {{spark-submit}} command that is actually used (so that the same 
submission can be tested standalone, arguments tweaked, etc.).  Currently, the 
only way this gets captured is to {{stderr}} if the 
{{SPARK_PRINT_LAUNCH_COMMAND}} environment variable is set.  This is cumbersome 
in the context of an application that is wrapping Spark and already using the 
APIs.

I propose simply making {{SparkSubmit}} log the full command line it is about 
to launch, so that clients can see it directly in their log files, rather than 
having to capture and search through {{stderr}}.

  was:
{{org.apache.spark.launcher.SparkLauncher}} wraps a {{ProcessBuilder}}, and 
builds up a full command line to {{spark-submit}} using a builder pattern.  
When {{startApplication}} is finally called, a 

In scenarios where another application is submitting to Spark, it would be 
extremely useful from a support and debugging standpoint to be able to see the 
full {{spark-submit}} command that is actually used (so that the same 
submission can be tested standalone, arguments tweaked, etc.).  Currently, the 
only way this gets captured is to {{stderr}} if the 
{{SPARK_PRINT_LAUNCH_COMMAND}} environment variable is set.  This is cumbersome 
in the context of an application that is wrapping Spark and already using the 
APIs.

I propose simply adding a getter method to {{SparkSubmit}} that allows clients 
to retrieve what the full command line will be, so they can log this however 
they wish (or do anything else with it).


> Make SparkLauncher log full spark-submit command line
> -
>
> Key: SPARK-29070
> URL: https://issues.apache.org/jira/browse/SPARK-29070
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Submit
>Affects Versions: 2.4.5
>Reporter: Jeff Evans
>Priority: Minor
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> {{org.apache.spark.launcher.SparkLauncher}} wraps a {{ProcessBuilder}}, and 
> builds up a full command line to {{spark-submit}} using a builder pattern.  
> When {{startApplication}} is finally called, a full command line is 
> materialized out of all the options, then invoked via the {{ProcessBuilder}}.
> In scenarios where another application is submitting to Spark, it would be 
> extremely useful from a support and debugging standpoint to be able to see 
> the full {{spark-submit}} command that is actually used (so that the same 
> submission can be tested standalone, arguments tweaked, etc.).  Currently, 
> the only way this gets captured is to {{stderr}} if the 
> {{SPARK_PRINT_LAUNCH_COMMAND}} environment variable is set.  This is 
> cumbersome in the context of an application that is wrapping Spark and 
> already using the APIs.
> I propose simply making {{SparkSubmit}} log the full command line it is about 
> to launch, so that clients can see it directly in their log files, rather 
> than having to capture and search through {{stderr}}.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-29070) Make SparkLauncher log full spark-submit command line

2019-09-16 Thread Jeff Evans (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29070?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jeff Evans updated SPARK-29070:
---
Summary: Make SparkLauncher log full spark-submit command line  (was: Allow 
SparkLauncher to return full spark-submit command line)

> Make SparkLauncher log full spark-submit command line
> -
>
> Key: SPARK-29070
> URL: https://issues.apache.org/jira/browse/SPARK-29070
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Submit
>Affects Versions: 2.4.5
>Reporter: Jeff Evans
>Priority: Minor
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> {{org.apache.spark.launcher.SparkLauncher}} wraps a {{ProcessBuilder}}, and 
> builds up a full command line to {{spark-submit}} using a builder pattern.  
> When {{startApplication}} is finally called, a 
> In scenarios where another application is submitting to Spark, it would be 
> extremely useful from a support and debugging standpoint to be able to see 
> the full {{spark-submit}} command that is actually used (so that the same 
> submission can be tested standalone, arguments tweaked, etc.).  Currently, 
> the only way this gets captured is to {{stderr}} if the 
> {{SPARK_PRINT_LAUNCH_COMMAND}} environment variable is set.  This is 
> cumbersome in the context of an application that is wrapping Spark and 
> already using the APIs.
> I propose simply adding a getter method to {{SparkSubmit}} that allows 
> clients to retrieve what the full command line will be, so they can log this 
> however they wish (or do anything else with it).



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-29070) Allow SparkLauncher to return full spark-submit command line

2019-09-12 Thread Jeff Evans (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29070?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jeff Evans updated SPARK-29070:
---
Priority: Minor  (was: Major)

> Allow SparkLauncher to return full spark-submit command line
> 
>
> Key: SPARK-29070
> URL: https://issues.apache.org/jira/browse/SPARK-29070
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Submit
>Affects Versions: 2.4.5
>Reporter: Jeff Evans
>Priority: Minor
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> {{org.apache.spark.launcher.SparkLauncher}} wraps a {{ProcessBuilder}}, and 
> builds up a full command line to {{spark-submit}} using a builder pattern.  
> When {{startApplication}} is finally called, a 
> In scenarios where another application is submitting to Spark, it would be 
> extremely useful from a support and debugging standpoint to be able to see 
> the full {{spark-submit}} command that is actually used (so that the same 
> submission can be tested standalone, arguments tweaked, etc.).  Currently, 
> the only way this gets captured is to {{stderr}} if the 
> {{SPARK_PRINT_LAUNCH_COMMAND}} environment variable is set.  This is 
> cumbersome in the context of an application that is wrapping Spark and 
> already using the APIs.
> I propose simply adding a getter method to {{SparkSubmit}} that allows 
> clients to retrieve what the full command line will be, so they can log this 
> however they wish (or do anything else with it).



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-29070) Allow SparkLauncher to return full spark-submit command line

2019-09-12 Thread Jeff Evans (Jira)
Jeff Evans created SPARK-29070:
--

 Summary: Allow SparkLauncher to return full spark-submit command 
line
 Key: SPARK-29070
 URL: https://issues.apache.org/jira/browse/SPARK-29070
 Project: Spark
  Issue Type: Improvement
  Components: Spark Submit
Affects Versions: 2.4.5
Reporter: Jeff Evans


{{org.apache.spark.launcher.SparkLauncher}} wraps a {{ProcessBuilder}}, and 
builds up a full command line to {{spark-submit}} using a builder pattern.  
When {{startApplication}} is finally called, a 

In scenarios where another application is submitting to Spark, it would be 
extremely useful from a support and debugging standpoint to be able to see the 
full {{spark-submit}} command that is actually used (so that the same 
submission can be tested standalone, arguments tweaked, etc.).  Currently, the 
only way this gets captured is to {{stderr}} if the 
{{SPARK_PRINT_LAUNCH_COMMAND}} environment variable is set.  This is cumbersome 
in the context of an application that is wrapping Spark and already using the 
APIs.

I propose simply adding a getter method to {{SparkSubmit}} that allows clients 
to retrieve what the full command line will be, so they can log this however 
they wish (or do anything else with it).



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org