[ 
https://issues.apache.org/jira/browse/SPARK-16203?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15349510#comment-15349510
 ] 

Sean Owen commented on SPARK-16203:
-----------------------------------

I'm pretty certain this is for consistency with Hive, at the least, and other 
DBs that define this. It's not clear how this would interact with SQL syntax if 
it results in many values. I think the semantics of this particular operation 
are on purpose.

> regexp_extract to return an ArrayType(StringType())
> ---------------------------------------------------
>
>                 Key: SPARK-16203
>                 URL: https://issues.apache.org/jira/browse/SPARK-16203
>             Project: Spark
>          Issue Type: Improvement
>          Components: PySpark
>    Affects Versions: 2.0.0
>            Reporter: Max Moroz
>            Priority: Minor
>
> regexp_extract only returns a single matched group. If (as if often the case 
> - e.g., web log parsing) we need to parse the entire line and get all the 
> groups, we'll need to call it as many times as there are groups.
> It's only a minor annoyance syntactically.
> But unless I misunderstand something, it would be very inefficient.  (How 
> would Spark know not to do multiple pattern matching operations, when only 
> one is needed? Or does the optimizer actually check whether the patterns are 
> identical, and if they are, avoid the repeated regex matching operations??)
> Would it be  possible to have it return an array when the index is not 
> specified (defaulting to None)?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to