[
https://issues.apache.org/jira/browse/HIVE-4943?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13767894#comment-13767894
]
Michael Haeusler commented on HIVE-4943:
----------------------------------------
An awesome feature of Hive is the rich type system with excellent support for
complex data-structures. To me, this ticket seems like a very useful extension
to the hive built-ins. It is especially helpful for those users, that use
complex data-structures. Right now, queries are often cumbersome when you
access denormalized or nested data.
E.g., let's consider a table that contains products together with their most
popular accessories (cross-sellings). The order of the cross-selling products
matter:
{code:javascript}
{
"productId": 42,
"name": "most awesome mp3 player",
"manufacturer": "acme corp",
"accessories": [
{ "productId" : 23, "name": "batteries", "manufacturer": "acme corp" },
{ "productId" : 25, "name": "extra load earphones", "manufacturer":
"noisemakers inc" }
]
}
{code}
Let's assume we want to know the average position in cross-sellings of the
manufacturer "noisemakers inc". Surprisingly, this is not possible with hive
built-ins. You could try to come up with a custom UDFSequence and a query like
this:
{code:sql}
SELECT
AVG(SEQUENCE(p.productId)) AS wrongAverage
FROM
products p
LATERAL VIEW
EXPLODE(p.accessories) pa AS accessory
WHERE
pa.accessory.manufacturer = 'noisemakers inc';
{code}
Unfortunately, the above query will give us wrong results, because Hive
executes the predicate in the where clause first. Therefore, any UDF in the
select clause has no chance to see and count all values.
Using the UDTF from this ticket seems to be the best solution:
{code:sql}
SELECT
AVG(pa.pos) AS correctAverage
FROM
products p
LATERAL VIEW
POSEXPLODE(p.accessories) pa AS pos, accessory
WHERE
pa.accessory.manufacturer = 'noisemakers inc';
{code}
> An explode function that includes the item's position in the array
> ------------------------------------------------------------------
>
> Key: HIVE-4943
> URL: https://issues.apache.org/jira/browse/HIVE-4943
> Project: Hive
> Issue Type: New Feature
> Components: Query Processor
> Affects Versions: 0.11.0
> Reporter: Niko Stahl
> Labels: patch
> Fix For: 0.11.0
>
> Attachments: HIVE-4943.1.patch, HIVE-4943.2.patch
>
> Original Estimate: 8h
> Remaining Estimate: 8h
>
> A function that explodes an array and includes an output column with the
> position of each item in the original array.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira