[jira] [Commented] (SPARK-16203) regexp_extract to return an ArrayType(StringType())
[ https://issues.apache.org/jira/browse/SPARK-16203?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16552155#comment-16552155 ] Nick Nicolini commented on SPARK-16203: --- Cool, added ticket here:https://issues.apache.org/jira/browse/SPARK-24884 I think the above is the same feature that [~mmoroz] was asking for, so IMO we close this ticket in favor of the newer one. > regexp_extract to return an ArrayType(StringType()) > --- > > Key: SPARK-16203 > URL: https://issues.apache.org/jira/browse/SPARK-16203 > Project: Spark > Issue Type: Improvement >Affects Versions: 2.0.0 >Reporter: Max Moroz >Priority: Minor > > regexp_extract only returns a single matched group. If (as if often the case > - e.g., web log parsing) we need to parse the entire line and get all the > groups, we'll need to call it as many times as there are groups. > It's only a minor annoyance syntactically. > But unless I misunderstand something, it would be very inefficient. (How > would Spark know not to do multiple pattern matching operations, when only > one is needed? Or does the optimizer actually check whether the patterns are > identical, and if they are, avoid the repeated regex matching operations??) > Would it be possible to have it return an array when the index is not > specified (defaulting to None)? -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16203) regexp_extract to return an ArrayType(StringType())
[ https://issues.apache.org/jira/browse/SPARK-16203?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16551966#comment-16551966 ] Herman van Hovell commented on SPARK-16203: --- [~nnicolini] adding {{regexp_extract_all}} makes sense. Can you file a new ticket for this? BTW there might already one. > regexp_extract to return an ArrayType(StringType()) > --- > > Key: SPARK-16203 > URL: https://issues.apache.org/jira/browse/SPARK-16203 > Project: Spark > Issue Type: Improvement >Affects Versions: 2.0.0 >Reporter: Max Moroz >Priority: Minor > > regexp_extract only returns a single matched group. If (as if often the case > - e.g., web log parsing) we need to parse the entire line and get all the > groups, we'll need to call it as many times as there are groups. > It's only a minor annoyance syntactically. > But unless I misunderstand something, it would be very inefficient. (How > would Spark know not to do multiple pattern matching operations, when only > one is needed? Or does the optimizer actually check whether the patterns are > identical, and if they are, avoid the repeated regex matching operations??) > Would it be possible to have it return an array when the index is not > specified (defaulting to None)? -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16203) regexp_extract to return an ArrayType(StringType())
[ https://issues.apache.org/jira/browse/SPARK-16203?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16551885#comment-16551885 ] Nick Nicolini commented on SPARK-16203: --- [~srowen] [~hvanhovell] I want to re-open this discussion. I've recently hit many cases of regexp parsing where we need to match on something that is always arbitrary in length; for example, a text block that looks something like: {code:java} AAA:WORDS| BBB:TEXT| MSG:ASDF| MSG:QWER| ... MSG:ZXCV|{code} Where I need to pull out all values between "MSG:" and "|", which can occur in each instance between 1 and n times. I cannot reliably use the method shown above, and while I can write a UDF to handle this it'd be great if this was supported natively in Spark. Perhaps we can implement something like "regexp_extract_all" as [presto|https://prestodb.io/docs/current/functions/regexp.html] and [pig|https://pig.apache.org/docs/latest/api/org/apache/pig/builtin/REGEX_EXTRACT_ALL.html] have? > regexp_extract to return an ArrayType(StringType()) > --- > > Key: SPARK-16203 > URL: https://issues.apache.org/jira/browse/SPARK-16203 > Project: Spark > Issue Type: Improvement >Affects Versions: 2.0.0 >Reporter: Max Moroz >Priority: Minor > > regexp_extract only returns a single matched group. If (as if often the case > - e.g., web log parsing) we need to parse the entire line and get all the > groups, we'll need to call it as many times as there are groups. > It's only a minor annoyance syntactically. > But unless I misunderstand something, it would be very inefficient. (How > would Spark know not to do multiple pattern matching operations, when only > one is needed? Or does the optimizer actually check whether the patterns are > identical, and if they are, avoid the repeated regex matching operations??) > Would it be possible to have it return an array when the index is not > specified (defaulting to None)? -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16203) regexp_extract to return an ArrayType(StringType())
[ https://issues.apache.org/jira/browse/SPARK-16203?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15356624#comment-15356624 ] Max Moroz commented on SPARK-16203: --- [~hvanhovell] UDF: yes, that's what I did - but I don't actually know if the overhead of UDF is less than the overhead of calling regex multiple times. Given that Dataset.flatmap() is essentially an RDD operation, I think using it would prevent Catalyst from optimizing anything if I use it? I'm using python, so I'm not sure what Expression is. I thought there's UDF (which is slow), and UDAF (which I can't do in python, but which isn't relevant in this case since I'm not aggregating anything). Is there any reason not to add another DataFrame method (like regexp_extract_n) to the SQL/DataFrame interface? This (inefficient) code and its many variations shows up often in standard tutorials: {code} pattern = r'(\S+) (\S+) (\S+) \[(\S+) \S+\] "(\S+) (\S+) (\S+)" (\S+) (\S+) "(\S*)" "(.*?)"' fields = log.select( regexp_extract('value', pattern, 1).alias('host'), regexp_extract('value', pattern, 4).alias('timestamp'), regexp_extract('value', pattern, 5).alias('method'), regexp_extract('value', pattern, 6).alias('url'), regexp_extract('value', pattern, 7).alias('protocol'), regexp_extract('value', pattern, 8).alias('status'), regexp_extract('value', pattern, 9).alias('size'), regexp_extract('value', pattern, 10).alias('referrer'), regexp_extract('value', pattern, 11).alias('agent') ) {code} > regexp_extract to return an ArrayType(StringType()) > --- > > Key: SPARK-16203 > URL: https://issues.apache.org/jira/browse/SPARK-16203 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 2.0.0 >Reporter: Max Moroz >Priority: Minor > > regexp_extract only returns a single matched group. If (as if often the case > - e.g., web log parsing) we need to parse the entire line and get all the > groups, we'll need to call it as many times as there are groups. > It's only a minor annoyance syntactically. > But unless I misunderstand something, it would be very inefficient. (How > would Spark know not to do multiple pattern matching operations, when only > one is needed? Or does the optimizer actually check whether the patterns are > identical, and if they are, avoid the repeated regex matching operations??) > Would it be possible to have it return an array when the index is not > specified (defaulting to None)? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16203) regexp_extract to return an ArrayType(StringType())
[ https://issues.apache.org/jira/browse/SPARK-16203?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15349737#comment-15349737 ] Herman van Hovell commented on SPARK-16203: --- I do agree that this is not efficient, but we cannot change the return type of {{regexp_extract}}. You could start by writing your own UDF; which can return an array of strings. Also consider using {{Dataset.explode(...)/Dataset.flatmap(...)}}. A more advanced approach would be to implement your own {{Expression}}. > regexp_extract to return an ArrayType(StringType()) > --- > > Key: SPARK-16203 > URL: https://issues.apache.org/jira/browse/SPARK-16203 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 2.0.0 >Reporter: Max Moroz >Priority: Minor > > regexp_extract only returns a single matched group. If (as if often the case > - e.g., web log parsing) we need to parse the entire line and get all the > groups, we'll need to call it as many times as there are groups. > It's only a minor annoyance syntactically. > But unless I misunderstand something, it would be very inefficient. (How > would Spark know not to do multiple pattern matching operations, when only > one is needed? Or does the optimizer actually check whether the patterns are > identical, and if they are, avoid the repeated regex matching operations??) > Would it be possible to have it return an array when the index is not > specified (defaulting to None)? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16203) regexp_extract to return an ArrayType(StringType())
[ https://issues.apache.org/jira/browse/SPARK-16203?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15349582#comment-15349582 ] Max Moroz commented on SPARK-16203: --- Hive SQL syntax allows the return value from a function to be an array; for example split does it. I understand overloading the existing name may be confusing, but would it be inappropriate to add another function (like regexp_extract_n)? If I'm misunderstanding something, and parsing something like a web log with DataFrame API is already perfectly efficient, I would not think it's worth doing. But I don't think it's currently possible to do efficient parsing (the best solution I'm aware of is regexp_replace, and then split - perhaps the optimizer manages to optimize away the unnecessary insertion of new characters, but I don't think so?). > regexp_extract to return an ArrayType(StringType()) > --- > > Key: SPARK-16203 > URL: https://issues.apache.org/jira/browse/SPARK-16203 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 2.0.0 >Reporter: Max Moroz >Priority: Minor > > regexp_extract only returns a single matched group. If (as if often the case > - e.g., web log parsing) we need to parse the entire line and get all the > groups, we'll need to call it as many times as there are groups. > It's only a minor annoyance syntactically. > But unless I misunderstand something, it would be very inefficient. (How > would Spark know not to do multiple pattern matching operations, when only > one is needed? Or does the optimizer actually check whether the patterns are > identical, and if they are, avoid the repeated regex matching operations??) > Would it be possible to have it return an array when the index is not > specified (defaulting to None)? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16203) regexp_extract to return an ArrayType(StringType())
[ https://issues.apache.org/jira/browse/SPARK-16203?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15349510#comment-15349510 ] Sean Owen commented on SPARK-16203: --- I'm pretty certain this is for consistency with Hive, at the least, and other DBs that define this. It's not clear how this would interact with SQL syntax if it results in many values. I think the semantics of this particular operation are on purpose. > regexp_extract to return an ArrayType(StringType()) > --- > > Key: SPARK-16203 > URL: https://issues.apache.org/jira/browse/SPARK-16203 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 2.0.0 >Reporter: Max Moroz >Priority: Minor > > regexp_extract only returns a single matched group. If (as if often the case > - e.g., web log parsing) we need to parse the entire line and get all the > groups, we'll need to call it as many times as there are groups. > It's only a minor annoyance syntactically. > But unless I misunderstand something, it would be very inefficient. (How > would Spark know not to do multiple pattern matching operations, when only > one is needed? Or does the optimizer actually check whether the patterns are > identical, and if they are, avoid the repeated regex matching operations??) > Would it be possible to have it return an array when the index is not > specified (defaulting to None)? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org