Regexp_extract not giving correct output

Sachit Murarka Wed, 02 Dec 2020 07:34:03 -0800

Hi All,

I am using Pyspark to get the value from a column on basis of regex.


Following is the regex which I am using:
(^\[OrderID:\s)?(?(1).*\]\s\[UniqueID:\s([a-z0-9A-Z]*)\].*|\[.*\]\s\[([a-z0-9A-Z]*)\].*)

df = spark.createDataFrame([("[1234] [3333] [4444] [66]",),
("abcd",)],["stringValue"])

result = df.withColumn('extracted value',
F.regexp_extract(F.col('stringValue'),
'(^\[OrderID:\s)?(?(1).*\]\s\[UniqueID:\s([a-z0-9A-Z]*)\].*|\[.*\]\s\[([a-z0-9A-Z]*)\].*)',
1))

I have tried with spark.sql as well. It is giving empty output.

I have tested this regex , it is working fine on an online regextester .
But it is not working in spark . I know spark needs Java based regex ,
hence I tried escaping also , that gave exception:
: java.util.regex.PatternSyntaxException: Unknown inline modifier near
index 21
(^\[OrderID:\s)?(?(1).*\]\s\[UniqueID:\s([a-z0-9A-Z]*)\].*|\[.*\]\s\[([a-z0-9A-Z]*)\].*)


Can you please help here?

Kind Regards,
Sachit Murarka

Regexp_extract not giving correct output

Reply via email to