[ https://issues.apache.org/jira/browse/SPARK-16203?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15349737#comment-15349737 ]
Herman van Hovell commented on SPARK-16203: ------------------------------------------- I do agree that this is not efficient, but we cannot change the return type of {{regexp_extract}}. You could start by writing your own UDF; which can return an array of strings. Also consider using {{Dataset.explode(...)/Dataset.flatmap(...)}}. A more advanced approach would be to implement your own {{Expression}}. > regexp_extract to return an ArrayType(StringType()) > --------------------------------------------------- > > Key: SPARK-16203 > URL: https://issues.apache.org/jira/browse/SPARK-16203 > Project: Spark > Issue Type: Improvement > Components: PySpark > Affects Versions: 2.0.0 > Reporter: Max Moroz > Priority: Minor > > regexp_extract only returns a single matched group. If (as if often the case > - e.g., web log parsing) we need to parse the entire line and get all the > groups, we'll need to call it as many times as there are groups. > It's only a minor annoyance syntactically. > But unless I misunderstand something, it would be very inefficient. (How > would Spark know not to do multiple pattern matching operations, when only > one is needed? Or does the optimizer actually check whether the patterns are > identical, and if they are, avoid the repeated regex matching operations??) > Would it be possible to have it return an array when the index is not > specified (defaulting to None)? -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org