Actually it's supposed to be part of Spark 1.5 release, see 
https://issues.apache.org/jira/browse/SPARK-8230
You're definitely welcome to contribute to it, let me know if you have any 
question on implementing it.

Cheng Hao


-----Original Message-----
From: pedro [mailto:ski.rodrig...@gmail.com] 
Sent: Thursday, July 16, 2015 7:31 AM
To: user@spark.apache.org
Subject: Python DataFrames: length of ArrayType

Resubmitting after fixing subscription to mailing list.

Based on the list of functions here:
https://spark.apache.org/docs/latest/api/python/pyspark.sql.html#module-pyspark.sql.functions

there doesn't seem to be a way to get the length of an array in a dataframe 
without defining a UDF. 

What I would be looking for is something like this (except length_udf would be 
pyspark.sql.functions.length or something similar):

length_udf = UserDefinedFunction(len, IntegerType()) test_schema = StructType([ 
        StructField('arr', ArrayType(IntegerType())), 
        StructField('letter', StringType()) 
    ])
test_df = sql.createDataFrame(sc.parallelize([ 
        [[1, 2, 3], 'a'], 
        [[4, 5, 6, 7, 8], 'b'] 
    ]), test_schema)
test_df.select(length_udf(test_df.arr)).collect() 

Output: 
[Row(PythonUDF#len(arr)=3), Row(PythonUDF#len(arr)=5)] 

Is there currently a way to accomplish this? If this doesn't exist and seems 
useful, I would be happy to contribute a PR with the function. 

Pedro Rodriguez



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Python-DataFrames-length-of-ArrayType-tp23869.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional 
commands, e-mail: user-h...@spark.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Reply via email to