You are welcomed

Dudu

From: Mahender Sarangam [mailto:mahender.bigd...@outlook.com]
Sent: Friday, June 10, 2016 8:55 PM
To: user@hive.apache.org
Subject: Re: Get 100 items in Comma Separated strings from Hive Column.


Thanks Dudu. This is wonderful explaination. I'm very thankful

On 6/10/2016 7:24 AM, Markovitz, Dudu wrote:
regexp_extract ('(,?[^,]*){0,10}',0)

(...){0,10}

The expression surrounded by brackets repeats 0 to 10 times.


(,?[…]*)

Optional comma followed by sequence (0 or more) of characters


[^,]

Any character which is not comma


regexp_extract (...,0)

0 stands for the whole expression
1 stands for the 1st expression which is surrounded by brackets (ordered by the 
opening brackets)
2 stands for the 2nd expression which is surrounded by brackets (ordered by the 
opening brackets)
3 stands for the 3rd expression which is surrounded by brackets (ordered by the 
opening brackets)
Etc.



regexp_replace (((,?[^,]*){0,10}).*','$1')

Similar to regexp_extract but this time we’re not extracting the first 10 
tokens but replacing the whole expression with the first 10 tokens.
The expression that stands for the first 10 tokens is identical to the one we 
used in regexp_extract
.* stands for any character that repeats 0 or more times which represent 
anything following the first 10 tokens
$1 stands for the 1st expression which is surrounded by brackets (ordered by 
the opening brackets)


From: Mahender Sarangam [mailto:mahender.bigd...@outlook.com]
Sent: Friday, June 10, 2016 2:54 PM
To: user@hive.apache.org<mailto:user@hive.apache.org>
Subject: Re: Get 100 items in Comma Separated strings from Hive Column.


Thanks Dudu. I will check. Can you please throw some light on regexp_replace 
(((,?[^,]*){0,10}).*','$1')  regexp_extract ('(,?[^,]*){0,10}',0),

On 6/9/2016 11:33 PM, Markovitz, Dudu wrote:
+ Improvement

The “Count” can be done in a cleaner way
(The previous way works also with simple ‘replace’)

hive> select RowID,length(regexp_replace(stringColumn,'[^,]',''))+1 as count 
from t;

1              2
2              5
3              24
4              17
5              8
6              11
7              26
8              18
9              9


From: Markovitz, Dudu [mailto:dmarkov...@paypal.com]
Sent: Thursday, June 09, 2016 11:30 PM
To: user@hive.apache.org<mailto:user@hive.apache.org>
Subject: RE: Get 100 items in Comma Separated strings from Hive Column.

----------------------------------------------------------------------------------------------------
--  bash
----------------------------------------------------------------------------------------------------
mkdir t

cat>t/data.txt
1|44,85
2|56,37,83,68,43
3|33,48,42,18,23,80,31,86,48,42,37,52,99,55,93,1,63,67,32,75,44,57,70,2
4|77,26,95,53,11,99,74,82,7,55,75,6,32,87,75,99,80
5|48,78,39,62,16,44,43,63
6|35,97,99,19,22,50,29,84,82,25,77
7|80,43,82,94,81,58,70,8,70,6,62,100,60,84,55,24,100,75,84,15,53,5,19,45,61,73
8|66,44,66,4,80,72,81,63,51,24,51,77,87,85,10,36,43,2
9|39,64,29,14,9,42,66,56,33

hdfs dfs -put t /tmp

----------------------------------------------------------------------------------------------------
--  hive
----------------------------------------------------------------------------------------------------

create external table t
(
    RowID           int
   ,stringColumn    string
)
    row format delimited
    fields terminated by '|'
    location '/tmp/t'
;

select RowID,regexp_extract (stringColumn,'(,?[^,]*){0,10}',0) as 
string10,length(stringColumn)-length(regexp_replace(stringColumn,',',''))+1 as 
count from t;

1    44,85 2
2    56,37,83,68,43  5
3    33,48,42,18,23,80,31,86,48,42   24
4    77,26,95,53,11,99,74,82,7,55    17
5    48,78,39,62,16,44,43,63    8
6    35,97,99,19,22,50,29,84,82,25   11
7    80,43,82,94,81,58,70,8,70,6     26
8    66,44,66,4,80,72,81,63,51,24    18
9    39,64,29,14,9,42,66,56,33  9


Extracting the first 100 (10 in my example) tokens can be done with 
regexp_extract or regexp_replace

hive> select regexp_extract 
('1,2,3,4,5,6,7,8,9,10,11,12,13,14,15','(,?[^,]*){0,10}',0);

1,2,3,4,5,6,7,8,9,10

hive> select regexp_replace 
('1,2,3,4,5,6,7,8,9,10,11,12,13,14,15','((,?[^,]*){0,10}).*','$1');

1,2,3,4,5,6,7,8,9,10


From: Mahender Sarangam [mailto:mahender.bigd...@outlook.com]
Sent: Thursday, June 09, 2016 7:13 PM
To: user@hive.apache.org<mailto:user@hive.apache.org>
Subject: Get 100 items in Comma Separated strings from Hive Column.


Hi,

We have hive table which has a single column with more than 1000 comma 
separated string items.  Is there a way to retrieve only 100 string items from 
that Column. Also we need to capture number of comma separated string items. We 
are looking for more of   "substring_index" functionality, since we are using 
Hive 1.2 version, we couldn't find "substring_index" UDF function, Is there a 
way to achieve the same functionality with  "regexp_extract" and I also see 
there is UDF available not sure whether this helps us achieving same 
functionality. 
https://github.com/brndnmtthws/facebook-hive-udfs/blob/master/src/main/java/com/facebook/hive/udf/UDFRegexpExtractAll.java

Scenario : Table1 (Source Table)

RowID stringColumn

1 1,2,3,4...10000

2 2,4,5,8,4

3 10,11,98,100

Now i Would like to show table result structure like below

Row ID 100String count

1 1,2,3...100 10000

2 2,4,5,8,4 5


Reply via email to