So I am experimenting with ORC files, and I have a fast little table that has login events. Out of curiosity, I was wondering if based on what we all knew about ORC files, if did the below, would the per file indexing get me anything? Now, before people complain about small files, let's toss that aside for now.
I have set mapred.reduce.tasks=26; insert into table ogintable select * from main_table where loginid != '' distribute by (abs(hash(substring(loginid, 0, 1))) % 26) sort by loginid Basically, I am thinking that if I distribute by what I put, each letter will get it's own file, and thus, acts as a mini index? Am I over thinking this? I know if I do just the sort by I get 3 to 4 files, with this method, I get more files, and since loginid is extremely common where clause member, I was thinking this may be a good thing? Maybe I am wrong, figured I'd send it out to the group to get made fun of/ridiculed in public :)