[ https://issues.apache.org/jira/browse/HIVE-6140?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13868867#comment-13868867 ]
Anandha L Ranganathan commented on HIVE-6140: --------------------------------------------- Here is the system configuration 4 core, 8 GB RAM. file format: Text compression : NONE 1) select count(l) from letters where l = 'l '; around 100 seconds. 2) select count(l) from letters where trim(l) = 'l'; 230 seconds 3)I created GenericUDF function for trim and the result was select count(l) from letters where gentrim(l) = 'l'; 220 seconds. This evaluate function is taking around 1500 nano seconds processing for each record. This nano seconds accumulates and takes 230 seconds when we use UDF function for 500M records. This is the code that is used in evaluate. if (arguments[0].get() == null) { return null; } input = (Text) converters[0].convert(arguments[0].get()); input.set(input.toString().trim()); [~ehans]] I haven't tried ORC file format. I will try later. > trim udf is very slow > --------------------- > > Key: HIVE-6140 > URL: https://issues.apache.org/jira/browse/HIVE-6140 > Project: Hive > Issue Type: Bug > Components: UDF > Reporter: Thejas M Nair > Assignee: Anandha L Ranganathan > Attachments: temp.pl > > > Paraphrasing what was reported by [~cartershanklin] - > I used the attached Perl script to generate 500 million two-character strings > which always included a space. I loaded it using: > create table letters (l string); > load data local inpath '/home/sandbox/data.csv' overwrite into table letters; > Then I ran this SQL script: > select count(l) from letters where l = 'l '; > select count(l) from letters where trim(l) = 'l'; > First query = 170 seconds > Second query = 514 seconds -- This message was sent by Atlassian JIRA (v6.1.5#6160)