Re: Regexp_extract not giving correct output
This means there is something wrong with your regex vs what Java supports. Do you mean "(?:" rather than "(?" around where the error is? This is not related to Spark. On Wed, Dec 2, 2020 at 9:45 AM Sachit Murarka wrote: > Hi Sean, > > Thanks for quick response! > > I have tried with string literal 'r' as a prefix that also gave an empty > result.. > spark.sql(r"select regexp_extract('[11] [22] > [33]','(^\[OrderID:\s)?(?(1).*\]\s\[UniqueID:\s([a-z0-9A-Z]*)\].*|\[.*\]\s\[([a-z0-9A-Z]*)\].*)',1) > as anyid").show() > > and as I mentioned when I am using 2 backslashes it is giving an exception > as follows: > : java.util.regex.PatternSyntaxException: Unknown inline modifier near > index 21 > > (^\[OrderID:\s)?(?(1).*\]\s\[UniqueID:\s([a-z0-9A-Z]*)\].*|\[.*\]\s\[([a-z0-9A-Z]*)\].*) > > Kind Regards, > Sachit Murarka > > > On Wed, Dec 2, 2020 at 9:07 PM Sean Owen wrote: > >> As in Java/Scala, in Python you'll need to escape the backslashes with >> \\. "\[" means just "[" in a string. I think you could also prefix the >> string literal with 'r' to disable Python's handling of escapes. >> >> On Wed, Dec 2, 2020 at 9:34 AM Sachit Murarka >> wrote: >> >>> Hi All, >>> >>> I am using Pyspark to get the value from a column on basis of regex. >>> >>> Following is the regex which I am using: >>> >>> (^\[OrderID:\s)?(?(1).*\]\s\[UniqueID:\s([a-z0-9A-Z]*)\].*|\[.*\]\s\[([a-z0-9A-Z]*)\].*) >>> >>> df = spark.createDataFrame([("[1234] [] [] [66]",), >>> ("abcd",)],["stringValue"]) >>> >>> result = df.withColumn('extracted value', >>> F.regexp_extract(F.col('stringValue'), >>> '(^\[OrderID:\s)?(?(1).*\]\s\[UniqueID:\s([a-z0-9A-Z]*)\].*|\[.*\]\s\[([a-z0-9A-Z]*)\].*)', >>> 1)) >>> >>> I have tried with spark.sql as well. It is giving empty output. >>> >>> I have tested this regex , it is working fine on an online regextester . >>> But it is not working in spark . I know spark needs Java based regex , >>> hence I tried escaping also , that gave exception: >>> : java.util.regex.PatternSyntaxException: Unknown inline modifier near >>> index 21 >>> >>> (^\[OrderID:\s)?(?(1).*\]\s\[UniqueID:\s([a-z0-9A-Z]*)\].*|\[.*\]\s\[([a-z0-9A-Z]*)\].*) >>> >>> >>> Can you please help here? >>> >>> Kind Regards, >>> Sachit Murarka >>> >>
Re: Regexp_extract not giving correct output
Hi Sean, Thanks for quick response! I have tried with string literal 'r' as a prefix that also gave an empty result.. spark.sql(r"select regexp_extract('[11] [22] [33]','(^\[OrderID:\s)?(?(1).*\]\s\[UniqueID:\s([a-z0-9A-Z]*)\].*|\[.*\]\s\[([a-z0-9A-Z]*)\].*)',1) as anyid").show() and as I mentioned when I am using 2 backslashes it is giving an exception as follows: : java.util.regex.PatternSyntaxException: Unknown inline modifier near index 21 (^\[OrderID:\s)?(?(1).*\]\s\[UniqueID:\s([a-z0-9A-Z]*)\].*|\[.*\]\s\[([a-z0-9A-Z]*)\].*) Kind Regards, Sachit Murarka On Wed, Dec 2, 2020 at 9:07 PM Sean Owen wrote: > As in Java/Scala, in Python you'll need to escape the backslashes with \\. > "\[" means just "[" in a string. I think you could also prefix the string > literal with 'r' to disable Python's handling of escapes. > > On Wed, Dec 2, 2020 at 9:34 AM Sachit Murarka > wrote: > >> Hi All, >> >> I am using Pyspark to get the value from a column on basis of regex. >> >> Following is the regex which I am using: >> >> (^\[OrderID:\s)?(?(1).*\]\s\[UniqueID:\s([a-z0-9A-Z]*)\].*|\[.*\]\s\[([a-z0-9A-Z]*)\].*) >> >> df = spark.createDataFrame([("[1234] [] [] [66]",), >> ("abcd",)],["stringValue"]) >> >> result = df.withColumn('extracted value', >> F.regexp_extract(F.col('stringValue'), >> '(^\[OrderID:\s)?(?(1).*\]\s\[UniqueID:\s([a-z0-9A-Z]*)\].*|\[.*\]\s\[([a-z0-9A-Z]*)\].*)', >> 1)) >> >> I have tried with spark.sql as well. It is giving empty output. >> >> I have tested this regex , it is working fine on an online regextester . >> But it is not working in spark . I know spark needs Java based regex , >> hence I tried escaping also , that gave exception: >> : java.util.regex.PatternSyntaxException: Unknown inline modifier near >> index 21 >> >> (^\[OrderID:\s)?(?(1).*\]\s\[UniqueID:\s([a-z0-9A-Z]*)\].*|\[.*\]\s\[([a-z0-9A-Z]*)\].*) >> >> >> Can you please help here? >> >> Kind Regards, >> Sachit Murarka >> >
Re: Regexp_extract not giving correct output
As in Java/Scala, in Python you'll need to escape the backslashes with \\. "\[" means just "[" in a string. I think you could also prefix the string literal with 'r' to disable Python's handling of escapes. On Wed, Dec 2, 2020 at 9:34 AM Sachit Murarka wrote: > Hi All, > > I am using Pyspark to get the value from a column on basis of regex. > > Following is the regex which I am using: > > (^\[OrderID:\s)?(?(1).*\]\s\[UniqueID:\s([a-z0-9A-Z]*)\].*|\[.*\]\s\[([a-z0-9A-Z]*)\].*) > > df = spark.createDataFrame([("[1234] [] [] [66]",), > ("abcd",)],["stringValue"]) > > result = df.withColumn('extracted value', > F.regexp_extract(F.col('stringValue'), > '(^\[OrderID:\s)?(?(1).*\]\s\[UniqueID:\s([a-z0-9A-Z]*)\].*|\[.*\]\s\[([a-z0-9A-Z]*)\].*)', > 1)) > > I have tried with spark.sql as well. It is giving empty output. > > I have tested this regex , it is working fine on an online regextester . > But it is not working in spark . I know spark needs Java based regex , > hence I tried escaping also , that gave exception: > : java.util.regex.PatternSyntaxException: Unknown inline modifier near > index 21 > > (^\[OrderID:\s)?(?(1).*\]\s\[UniqueID:\s([a-z0-9A-Z]*)\].*|\[.*\]\s\[([a-z0-9A-Z]*)\].*) > > > Can you please help here? > > Kind Regards, > Sachit Murarka >
Regexp_extract not giving correct output
Hi All, I am using Pyspark to get the value from a column on basis of regex. Following is the regex which I am using: (^\[OrderID:\s)?(?(1).*\]\s\[UniqueID:\s([a-z0-9A-Z]*)\].*|\[.*\]\s\[([a-z0-9A-Z]*)\].*) df = spark.createDataFrame([("[1234] [] [] [66]",), ("abcd",)],["stringValue"]) result = df.withColumn('extracted value', F.regexp_extract(F.col('stringValue'), '(^\[OrderID:\s)?(?(1).*\]\s\[UniqueID:\s([a-z0-9A-Z]*)\].*|\[.*\]\s\[([a-z0-9A-Z]*)\].*)', 1)) I have tried with spark.sql as well. It is giving empty output. I have tested this regex , it is working fine on an online regextester . But it is not working in spark . I know spark needs Java based regex , hence I tried escaping also , that gave exception: : java.util.regex.PatternSyntaxException: Unknown inline modifier near index 21 (^\[OrderID:\s)?(?(1).*\]\s\[UniqueID:\s([a-z0-9A-Z]*)\].*|\[.*\]\s\[([a-z0-9A-Z]*)\].*) Can you please help here? Kind Regards, Sachit Murarka