----------------------------------------------------------------------------------------------------
--  bash
----------------------------------------------------------------------------------------------------
mkdir t

cat>t/data.txt
1|44,85
2|56,37,83,68,43
3|33,48,42,18,23,80,31,86,48,42,37,52,99,55,93,1,63,67,32,75,44,57,70,2
4|77,26,95,53,11,99,74,82,7,55,75,6,32,87,75,99,80
5|48,78,39,62,16,44,43,63
6|35,97,99,19,22,50,29,84,82,25,77
7|80,43,82,94,81,58,70,8,70,6,62,100,60,84,55,24,100,75,84,15,53,5,19,45,61,73
8|66,44,66,4,80,72,81,63,51,24,51,77,87,85,10,36,43,2
9|39,64,29,14,9,42,66,56,33

hdfs dfs -put t /tmp

----------------------------------------------------------------------------------------------------
--  hive
----------------------------------------------------------------------------------------------------

create external table t
(
    RowID           int
   ,stringColumn    string
)
    row format delimited
    fields terminated by '|'
    location '/tmp/t'
;

select RowID,regexp_extract (stringColumn,'(,?[^,]*){0,10}',0) as 
string10,length(stringColumn)-length(regexp_replace(stringColumn,',',''))+1 as 
count from t;

1    44,85 2
2    56,37,83,68,43  5
3    33,48,42,18,23,80,31,86,48,42   24
4    77,26,95,53,11,99,74,82,7,55    17
5    48,78,39,62,16,44,43,63    8
6    35,97,99,19,22,50,29,84,82,25   11
7    80,43,82,94,81,58,70,8,70,6     26
8    66,44,66,4,80,72,81,63,51,24    18
9    39,64,29,14,9,42,66,56,33  9


Extracting the first 100 (10 in my example) tokens can be done with 
regexp_extract or regexp_replace

hive> select regexp_extract 
('1,2,3,4,5,6,7,8,9,10,11,12,13,14,15','(,?[^,]*){0,10}',0);

1,2,3,4,5,6,7,8,9,10

hive> select regexp_replace 
('1,2,3,4,5,6,7,8,9,10,11,12,13,14,15','((,?[^,]*){0,10}).*','$1');

1,2,3,4,5,6,7,8,9,10


From: Mahender Sarangam [mailto:mahender.bigd...@outlook.com]
Sent: Thursday, June 09, 2016 7:13 PM
To: user@hive.apache.org
Subject: Get 100 items in Comma Separated strings from Hive Column.


Hi,

We have hive table which has a single column with more than 1000 comma 
separated string items.  Is there a way to retrieve only 100 string items from 
that Column. Also we need to capture number of comma separated string items. We 
are looking for more of   "substring_index" functionality, since we are using 
Hive 1.2 version, we couldn't find "substring_index" UDF function, Is there a 
way to achieve the same functionality with  "regexp_extract" and I also see 
there is UDF available not sure whether this helps us achieving same 
functionality. 
https://github.com/brndnmtthws/facebook-hive-udfs/blob/master/src/main/java/com/facebook/hive/udf/UDFRegexpExtractAll.java

Scenario : Table1 (Source Table)

RowID stringColumn

1 1,2,3,4...10000

2 2,4,5,8,4

3 10,11,98,100

Now i Would like to show table result structure like below

Row ID 100String count

1 1,2,3...100 10000

2 2,4,5,8,4 5

Reply via email to