Hello, the same question about DISTRIBUTE BY on Hive. Accorring to you, you do not use hashCode of Object class on DBY, Distribute By.
I tried to understand how ObjectInspectorUtils works for distribution, but it seemed it has a lot of Hive API. It is not much understnading. I want to override partitionByHash function on Flink like the same way of DBY on Hive. I am working on implementing some benchmark system for these two system, which could be contritbutino to Hive as well. Could you tell me in detail how it works? I am pretty sure if you do not user hashCode of Object class in Java, you defined the partition function for DBY. On Sun, Oct 25, 2015 at 12:59 AM, Philip Lee <philjj...@gmail.com> wrote: > Hello, the same question about DISTRIBUTE BY on Hive. > > Accorring to you, you do not use hashCode of Object class on DBY, > Distribute By. > > I tried to understand how ObjectInspectorUtils works for distribution, > but it seemed it has a lot of Hive API. It is not much understnading. > I want to override partitionByHash function on Flink like the same way of > DBY on Hive. > I am working on implementing some benchmark system for these two system, > which could be contritbutino to Hive as well. > > Could you tell me in detail how it works? > I am pretty sure if you do not user hashCode of Object class in Java, you > defined the partition function for DBY. > > Regards, > Philip Lee > > > On Thu, Oct 22, 2015 at 7:13 PM, Gopal Vijayaraghavan <gop...@apache.org> > wrote: > >> >> > so do you think if we want the same result from Hive and Spark or the >> >other freamwork, how could we try this one ? >> >> There's a special backwards compat slow codepath that gets triggered if >> you do >> >> set mapred.reduce.tasks=199; (or any number) >> >> This will produce the exact same hash-code as the java hashcode for >> Strings & Integers. >> >> The bucket-id is determined by >> >> (hashCode & Integer.MAX_VALUE) % numberOfBuckets >> >> but this also triggers a non-stable sort on an entirely empty key, which >> will shuffle the data so the output file's order bears no resemblance to >> the input file's order. >> >> >> Even with that setting, the only consistent layout produced by Hive is the >> CLUSTER BY, which will sort on the same key used for distribution & uses >> the java hashCode if the auto-parallelism is turned off by setting a fixed >> reducer count. >> >> Cheers, >> Gopal >> >> >> > > > -- > > ========================================================== > > *Hae Joon Lee* > > > Now, in Germany, > > M.S. Candidate, Interested in Distributed System, Iterative Processing > > Dept. of Computer Science, Informatik in German, TUB > > Technical University of Berlin > > > In Korea, > > M.S. Candidate, Computer Architecture Laboratory > > Dept. of Computer Science, KAIST > > > Rm# 4414 CS Dept. KAIST > > 373-1 Guseong-dong, Yuseong-gu, Daejon, South Korea (305-701) > > > Mobile) 49) 015-251-448-278 in Germany, no cellular in Korea > > ========================================================== >