[ 
https://issues.apache.org/jira/browse/SPARK-16205?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Max Moroz updated SPARK-16205:
------------------------------
    Description: 
According to the docs, StructType is equivalent only to python list and tuple. 
I accidentally returned a dict from a udf function that registered its return 
value as StructType.

Expected behavior: either (1) an exception is raised (if strict type is 
checked); or (2) dict is treated as an iterable, resulting in a struct being 
created in an arbitrary order from the keys of the dict (horribly dangerous, 
but I'd understand).

Actual behavior: struct was created "properly", in the sense that keys were 
matched to the field names of the struct, and values were used for values.

This is wonderful, but completely undocumented as far as I can tell.

{code}
import pyspark.sql.functions as F
import pyspark.sql.types as T

fields = 'abcdefgh'

def udf(type_):
  def to_udf(func):
    return F.udf(func, type_)
  return to_udf

struct = T.StructType()
for c in fields:
  struct.add(c, T.StringType())

@udf(struct)
def f(row):
  d = dict(zip(fields, fields.upper()))
  return d

df.select(f('value')).show()
# output is unexpectedly meaningful, with uppercase as values
{code}

  was:
According to the docs, StructType is equivalent only to python list and tuple. 
I accidentally returned a dict from a udf function that registered its return 
value as StructType.

Expected behavior: either (1) an exception is raised (if strict type is 
checked); or (2) dict is treated as an iterable, resulting in a struct being 
created in an arbitrary order from the keys of the dict (horribly dangerous, 
but I'd understand).

Actual behavior: struct was created "properly", in the sense that keys were 
matched to the field names of the struct, and values were used for values.

This is wonderful, but completely undocumented as far as I can tell.

{code}
import pyspark.sql.functions as F
import pyspark.sql.types as T

fields = 'abcdefgh'

def udf(type_):
  def to_udf(func):
    return F.udf(func, type_)
  return to_udf

struct = T.StructType()
for c in fields:
  struct.add(c, T.StringType())

@udf(struct)
def f(row):
  d = dict(zip(fields, fields))
  return d

df.select(f('value')).show()

'''
Output is unexpectedly "meaningful":
+------------------+
|PythonUDF#f(value)|
+------------------+
| [a,b,c,d,e,f,g,h]|
| [a,b,c,d,e,f,g,h]|
+------------------+
'''
{code}


> dict -> StructType conversion is undocumented
> ---------------------------------------------
>
>                 Key: SPARK-16205
>                 URL: https://issues.apache.org/jira/browse/SPARK-16205
>             Project: Spark
>          Issue Type: Documentation
>          Components: PySpark
>    Affects Versions: 2.0.0
>            Reporter: Max Moroz
>            Priority: Minor
>
> According to the docs, StructType is equivalent only to python list and 
> tuple. I accidentally returned a dict from a udf function that registered its 
> return value as StructType.
> Expected behavior: either (1) an exception is raised (if strict type is 
> checked); or (2) dict is treated as an iterable, resulting in a struct being 
> created in an arbitrary order from the keys of the dict (horribly dangerous, 
> but I'd understand).
> Actual behavior: struct was created "properly", in the sense that keys were 
> matched to the field names of the struct, and values were used for values.
> This is wonderful, but completely undocumented as far as I can tell.
> {code}
> import pyspark.sql.functions as F
> import pyspark.sql.types as T
> fields = 'abcdefgh'
> def udf(type_):
>   def to_udf(func):
>     return F.udf(func, type_)
>   return to_udf
> struct = T.StructType()
> for c in fields:
>   struct.add(c, T.StringType())
> @udf(struct)
> def f(row):
>   d = dict(zip(fields, fields.upper()))
>   return d
> df.select(f('value')).show()
> # output is unexpectedly meaningful, with uppercase as values
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to