[jira] [Commented] (SPARK-16203) regexp_extract to return an ArrayType(StringType())

2018-07-22 Thread Nick Nicolini (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-16203?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16552155#comment-16552155
 ] 

Nick Nicolini commented on SPARK-16203:
---

Cool, added ticket here:https://issues.apache.org/jira/browse/SPARK-24884

I think the above is the same feature that [~mmoroz] was asking for, so IMO we 
close this ticket in favor of the newer one.

> regexp_extract to return an ArrayType(StringType())
> ---
>
> Key: SPARK-16203
> URL: https://issues.apache.org/jira/browse/SPARK-16203
> Project: Spark
>  Issue Type: Improvement
>Affects Versions: 2.0.0
>Reporter: Max Moroz
>Priority: Minor
>
> regexp_extract only returns a single matched group. If (as if often the case 
> - e.g., web log parsing) we need to parse the entire line and get all the 
> groups, we'll need to call it as many times as there are groups.
> It's only a minor annoyance syntactically.
> But unless I misunderstand something, it would be very inefficient.  (How 
> would Spark know not to do multiple pattern matching operations, when only 
> one is needed? Or does the optimizer actually check whether the patterns are 
> identical, and if they are, avoid the repeated regex matching operations??)
> Would it be  possible to have it return an array when the index is not 
> specified (defaulting to None)?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16203) regexp_extract to return an ArrayType(StringType())

2018-07-22 Thread Herman van Hovell (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-16203?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16551966#comment-16551966
 ] 

Herman van Hovell commented on SPARK-16203:
---

[~nnicolini] adding {{regexp_extract_all}} makes sense. Can you file a new 
ticket for this? BTW there might already one.

> regexp_extract to return an ArrayType(StringType())
> ---
>
> Key: SPARK-16203
> URL: https://issues.apache.org/jira/browse/SPARK-16203
> Project: Spark
>  Issue Type: Improvement
>Affects Versions: 2.0.0
>Reporter: Max Moroz
>Priority: Minor
>
> regexp_extract only returns a single matched group. If (as if often the case 
> - e.g., web log parsing) we need to parse the entire line and get all the 
> groups, we'll need to call it as many times as there are groups.
> It's only a minor annoyance syntactically.
> But unless I misunderstand something, it would be very inefficient.  (How 
> would Spark know not to do multiple pattern matching operations, when only 
> one is needed? Or does the optimizer actually check whether the patterns are 
> identical, and if they are, avoid the repeated regex matching operations??)
> Would it be  possible to have it return an array when the index is not 
> specified (defaulting to None)?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16203) regexp_extract to return an ArrayType(StringType())

2018-07-21 Thread Nick Nicolini (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-16203?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16551885#comment-16551885
 ] 

Nick Nicolini commented on SPARK-16203:
---

[~srowen] [~hvanhovell] I want to re-open this discussion. I've recently hit 
many cases of regexp parsing where we need to match on something that is always 
arbitrary in length; for example, a text block that looks something like:

 
{code:java}
AAA:WORDS|
BBB:TEXT|
MSG:ASDF|
MSG:QWER|
...
MSG:ZXCV|{code}
Where I need to pull out all values between "MSG:" and "|", which can occur in 
each instance between 1 and n times. I cannot reliably use the method shown 
above, and while I can write a UDF to handle this it'd be great if this was 
supported natively in Spark.

Perhaps we can implement something like "regexp_extract_all" as 
[presto|https://prestodb.io/docs/current/functions/regexp.html] and 
[pig|https://pig.apache.org/docs/latest/api/org/apache/pig/builtin/REGEX_EXTRACT_ALL.html]
 have?

 

 

 

 

 

> regexp_extract to return an ArrayType(StringType())
> ---
>
> Key: SPARK-16203
> URL: https://issues.apache.org/jira/browse/SPARK-16203
> Project: Spark
>  Issue Type: Improvement
>Affects Versions: 2.0.0
>Reporter: Max Moroz
>Priority: Minor
>
> regexp_extract only returns a single matched group. If (as if often the case 
> - e.g., web log parsing) we need to parse the entire line and get all the 
> groups, we'll need to call it as many times as there are groups.
> It's only a minor annoyance syntactically.
> But unless I misunderstand something, it would be very inefficient.  (How 
> would Spark know not to do multiple pattern matching operations, when only 
> one is needed? Or does the optimizer actually check whether the patterns are 
> identical, and if they are, avoid the repeated regex matching operations??)
> Would it be  possible to have it return an array when the index is not 
> specified (defaulting to None)?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16203) regexp_extract to return an ArrayType(StringType())

2016-06-30 Thread Max Moroz (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16203?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15356624#comment-15356624
 ] 

Max Moroz commented on SPARK-16203:
---

[~hvanhovell] UDF: yes, that's what I did - but I don't actually know if the 
overhead of UDF is less than the overhead of calling regex multiple times. 
Given that Dataset.flatmap() is essentially an RDD operation, I think using it 
would prevent Catalyst from optimizing anything if I use it?

I'm using python, so I'm not sure what Expression is. I thought there's UDF 
(which is slow), and UDAF (which I can't do in python, but which isn't relevant 
in this case since I'm not aggregating anything).

Is there any reason not to add another DataFrame method (like regexp_extract_n) 
to the SQL/DataFrame interface? This (inefficient) code and its many variations 
shows up often in standard tutorials:

{code}
pattern = r'(\S+) (\S+) (\S+) \[(\S+) \S+\] "(\S+) (\S+) (\S+)" (\S+) (\S+) 
"(\S*)" "(.*?)"'
fields = log.select(
  regexp_extract('value', pattern, 1).alias('host'),
  regexp_extract('value', pattern, 
4).alias('timestamp'),
  regexp_extract('value', pattern, 5).alias('method'),
  regexp_extract('value', pattern, 6).alias('url'),
  regexp_extract('value', pattern, 7).alias('protocol'),
  regexp_extract('value', pattern, 8).alias('status'),
  regexp_extract('value', pattern, 9).alias('size'),
  regexp_extract('value', pattern, 
10).alias('referrer'),
  regexp_extract('value', pattern, 11).alias('agent')
)
{code}


> regexp_extract to return an ArrayType(StringType())
> ---
>
> Key: SPARK-16203
> URL: https://issues.apache.org/jira/browse/SPARK-16203
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 2.0.0
>Reporter: Max Moroz
>Priority: Minor
>
> regexp_extract only returns a single matched group. If (as if often the case 
> - e.g., web log parsing) we need to parse the entire line and get all the 
> groups, we'll need to call it as many times as there are groups.
> It's only a minor annoyance syntactically.
> But unless I misunderstand something, it would be very inefficient.  (How 
> would Spark know not to do multiple pattern matching operations, when only 
> one is needed? Or does the optimizer actually check whether the patterns are 
> identical, and if they are, avoid the repeated regex matching operations??)
> Would it be  possible to have it return an array when the index is not 
> specified (defaulting to None)?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16203) regexp_extract to return an ArrayType(StringType())

2016-06-25 Thread Herman van Hovell (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16203?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15349737#comment-15349737
 ] 

Herman van Hovell commented on SPARK-16203:
---

I do agree that this is not efficient, but we cannot change the return type of 
{{regexp_extract}}.

You could start by writing your own UDF; which can return an array of strings. 
Also consider using {{Dataset.explode(...)/Dataset.flatmap(...)}}. A more 
advanced approach would be to implement your own {{Expression}}.


> regexp_extract to return an ArrayType(StringType())
> ---
>
> Key: SPARK-16203
> URL: https://issues.apache.org/jira/browse/SPARK-16203
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 2.0.0
>Reporter: Max Moroz
>Priority: Minor
>
> regexp_extract only returns a single matched group. If (as if often the case 
> - e.g., web log parsing) we need to parse the entire line and get all the 
> groups, we'll need to call it as many times as there are groups.
> It's only a minor annoyance syntactically.
> But unless I misunderstand something, it would be very inefficient.  (How 
> would Spark know not to do multiple pattern matching operations, when only 
> one is needed? Or does the optimizer actually check whether the patterns are 
> identical, and if they are, avoid the repeated regex matching operations??)
> Would it be  possible to have it return an array when the index is not 
> specified (defaulting to None)?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16203) regexp_extract to return an ArrayType(StringType())

2016-06-25 Thread Max Moroz (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16203?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15349582#comment-15349582
 ] 

Max Moroz commented on SPARK-16203:
---

Hive SQL syntax allows the return value from a function to be an array; for 
example split does it.  I understand overloading the existing name may be 
confusing, but would it be inappropriate to add another function (like 
regexp_extract_n)? 

If I'm misunderstanding something, and parsing something like a web log with 
DataFrame API is already perfectly efficient, I would not think it's worth 
doing. But I don't think it's currently possible to do efficient parsing (the 
best solution I'm aware of is regexp_replace, and then split - perhaps the 
optimizer manages to optimize away the unnecessary insertion of new characters, 
but I don't think so?).

> regexp_extract to return an ArrayType(StringType())
> ---
>
> Key: SPARK-16203
> URL: https://issues.apache.org/jira/browse/SPARK-16203
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 2.0.0
>Reporter: Max Moroz
>Priority: Minor
>
> regexp_extract only returns a single matched group. If (as if often the case 
> - e.g., web log parsing) we need to parse the entire line and get all the 
> groups, we'll need to call it as many times as there are groups.
> It's only a minor annoyance syntactically.
> But unless I misunderstand something, it would be very inefficient.  (How 
> would Spark know not to do multiple pattern matching operations, when only 
> one is needed? Or does the optimizer actually check whether the patterns are 
> identical, and if they are, avoid the repeated regex matching operations??)
> Would it be  possible to have it return an array when the index is not 
> specified (defaulting to None)?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16203) regexp_extract to return an ArrayType(StringType())

2016-06-25 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16203?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15349510#comment-15349510
 ] 

Sean Owen commented on SPARK-16203:
---

I'm pretty certain this is for consistency with Hive, at the least, and other 
DBs that define this. It's not clear how this would interact with SQL syntax if 
it results in many values. I think the semantics of this particular operation 
are on purpose.

> regexp_extract to return an ArrayType(StringType())
> ---
>
> Key: SPARK-16203
> URL: https://issues.apache.org/jira/browse/SPARK-16203
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 2.0.0
>Reporter: Max Moroz
>Priority: Minor
>
> regexp_extract only returns a single matched group. If (as if often the case 
> - e.g., web log parsing) we need to parse the entire line and get all the 
> groups, we'll need to call it as many times as there are groups.
> It's only a minor annoyance syntactically.
> But unless I misunderstand something, it would be very inefficient.  (How 
> would Spark know not to do multiple pattern matching operations, when only 
> one is needed? Or does the optimizer actually check whether the patterns are 
> identical, and if they are, avoid the repeated regex matching operations??)
> Would it be  possible to have it return an array when the index is not 
> specified (defaulting to None)?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org