[ 
https://issues.apache.org/jira/browse/SPARK-16203?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15349582#comment-15349582
 ] 

Max Moroz commented on SPARK-16203:
-----------------------------------

Hive SQL syntax allows the return value from a function to be an array; for 
example split does it.  I understand overloading the existing name may be 
confusing, but would it be inappropriate to add another function (like 
regexp_extract_n)? 

If I'm misunderstanding something, and parsing something like a web log with 
DataFrame API is already perfectly efficient, I would not think it's worth 
doing. But I don't think it's currently possible to do efficient parsing (the 
best solution I'm aware of is regexp_replace, and then split - perhaps the 
optimizer manages to optimize away the unnecessary insertion of new characters, 
but I don't think so?).

> regexp_extract to return an ArrayType(StringType())
> ---------------------------------------------------
>
>                 Key: SPARK-16203
>                 URL: https://issues.apache.org/jira/browse/SPARK-16203
>             Project: Spark
>          Issue Type: Improvement
>          Components: PySpark
>    Affects Versions: 2.0.0
>            Reporter: Max Moroz
>            Priority: Minor
>
> regexp_extract only returns a single matched group. If (as if often the case 
> - e.g., web log parsing) we need to parse the entire line and get all the 
> groups, we'll need to call it as many times as there are groups.
> It's only a minor annoyance syntactically.
> But unless I misunderstand something, it would be very inefficient.  (How 
> would Spark know not to do multiple pattern matching operations, when only 
> one is needed? Or does the optimizer actually check whether the patterns are 
> identical, and if they are, avoid the repeated regex matching operations??)
> Would it be  possible to have it return an array when the index is not 
> specified (defaulting to None)?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to