Based on the list of functions here:
https://spark.apache.org/docs/latest/api/python/pyspark.sql.html#module-pyspark.sql.functions

there doesn't seem to be a way to get the length of an array in a dataframe
without defining a UDF.

What I would be looking for is something like this (except length_udf would
be pyspark.sql.functions.length or something similar):

length_udf = UserDefinedFunction(len, IntegerType())
test_schema = StructType([
        StructField('arr', ArrayType(IntegerType())),
        StructField('letter', StringType())
    ])
test_df = sql.createDataFrame(sc.parallelize([
        [[1, 2, 3], 'a'],
        [[4, 5, 6, 7, 8], 'b']
    ]), test_schema)
test_df.select(length_udf(test_df.arr)).collect()

Output:
[Row(PythonUDF#len(arr)=3), Row(PythonUDF#len(arr)=5)]

Is there currently a way to accomplish this? If this doesn't exist and seems
useful, I would be happy to contribute a PR with the function.

Pedro Rodriguez



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Python-DataFrames-length-of-array-tp23868.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Reply via email to