The you probably want to create a normal function as opposed to UDF. A UDF takes your function and applies it on each element in the column one after the other. You can think of it as working on the result of a loop iterating on the column.
pyspark.sql.function.regexp_replace receives a column and applies the regex on each element to create a new column. You can do it in one of two ways: The first is using a udf in which case you shouldn’t use the pyspark.sql.functions.regex but instead use standard python regex. The second is to simply apply the column changes one after the other in a function. This should be something like: def my_f(target_col): for match,repl in regexp_list: target_col = regexp_replace(target_col, match, repl) return target_col and then use it with: Test_data.select(my_f(test_data.name)) The second option is more correct and should provide better performance. From: Perttu Ranta-aho [mailto:ranta...@iki.fi] Sent: Thursday, November 17, 2016 1:50 PM To: user@spark.apache.org Subject: Re: Nested UDFs Hi, My example was little bogus, my real use case is to do multiple regexp replacements so something like: def my_f(data): for match, repl in regexp_list: data = regexp_replace(match, repl, data) return data I could achieve my goal by mutiple .select(regexp_replace()) lines, but one UDF would be nicer. -Perttu to 17. marraskuuta 2016 klo 9.42 Mendelson, Assaf <assaf.mendel...@rsa.com<mailto:assaf.mendel...@rsa.com>> kirjoitti: Regexp_replace is supposed to receive a column, you don’t need to write a UDF for it. Instead try: Test_data.select(regexp_Replace(test_data.name<http://test_data.name>, ‘a’, ‘X’) You would need a Udf if you would wanted to do something on the string value of a single row (e.g. return data + “bla”) Assaf. From: Perttu Ranta-aho [mailto:ranta...@iki.fi<mailto:ranta...@iki.fi>] Sent: Thursday, November 17, 2016 9:15 AM To: user@spark.apache.org<mailto:user@spark.apache.org> Subject: Nested UDFs Hi, Shouldn't this work? from pyspark.sql.functions import regexp_replace, udf def my_f(data): return regexp_replace(data, 'a', 'X') my_udf = udf(my_f) test_data = sqlContext.createDataFrame([('a',), ('b',), ('c',)], ('name',)) test_data.select(my_udf(test_data.name<http://test_data.name>)).show() But instead of 'a' being replaced with 'X' I get exception: File ".../spark-2.0.2-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/sql/functions.py", line 1471, in regexp_replace jc = sc._jvm.functions.regexp_replace(_to_java_column(str), pattern, replacement) AttributeError: 'NoneType' object has no attribute '_jvm' ??? -Perttu