So I am experimenting with ORC files, and I have a fast little table that
has login events.  Out of curiosity, I was wondering if based on what we
all knew about ORC files, if did the below, would the per file indexing get
me anything? Now, before people complain about small files, let's toss that
aside for now.

I have
set mapred.reduce.tasks=26;
insert into table ogintable
select * from main_table where loginid != ''
distribute by (abs(hash(substring(loginid, 0, 1))) % 26)
sort by loginid


Basically, I am thinking that if I distribute by what I put, each letter
will get it's own file, and thus, acts as a mini index? Am I over thinking
this? I know if I do just the sort by I get 3 to 4 files, with this method,
I get more files, and since loginid is extremely common where clause
member, I was thinking this may be a good thing? Maybe I am wrong, figured
I'd send it out to the group to get made fun of/ridiculed in public :)

Reply via email to