[ 
https://issues.apache.org/jira/browse/HIVE-6140?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13868867#comment-13868867
 ] 

Anandha L Ranganathan commented on HIVE-6140:
---------------------------------------------

Here is the system configuration 
4 core, 8 GB RAM.
file format: Text
compression : NONE

1)  select count(l) from letters where l = 'l ';
 around 100 seconds.

2)  select count(l) from letters where trim(l) = 'l';
230 seconds

3)I created GenericUDF function for trim and the result was
  select count(l) from letters where gentrim(l) = 'l';
220 seconds.


This evaluate function is taking around 1500 nano seconds processing for each 
record. This  nano seconds accumulates and takes 230 seconds when we use  UDF 
function for 500M records.

This is the code that is used in evaluate.

        if (arguments[0].get() == null) {
                return null;
        }
       
        input = (Text) converters[0].convert(arguments[0].get());
        input.set(input.toString().trim());



[~ehans]]
I haven't tried ORC file format. I will try later.

> trim udf is very slow
> ---------------------
>
>                 Key: HIVE-6140
>                 URL: https://issues.apache.org/jira/browse/HIVE-6140
>             Project: Hive
>          Issue Type: Bug
>          Components: UDF
>            Reporter: Thejas M Nair
>            Assignee: Anandha L Ranganathan
>         Attachments: temp.pl
>
>
> Paraphrasing what was reported by [~cartershanklin] -
> I used the attached Perl script to generate 500 million two-character strings 
> which always included a space. I loaded it using:
> create table letters (l string); 
> load data local inpath '/home/sandbox/data.csv' overwrite into table letters;
> Then I ran this SQL script:
> select count(l) from letters where l = 'l ';
> select count(l) from letters where trim(l) = 'l';
> First query = 170 seconds
> Second query  = 514 seconds



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

Reply via email to