I looked at the code again. When number of HFiles to be loaded times number of column families is a big value, your suggestion may produce some speedup. If you have access to a cluster, you can measure potential savings in your approach.
Cheers On Thu, Aug 27, 2015 at 5:08 PM, Ted Yu <[email protected]> wrote: > At roughly how many column families would this change show performance > boost ? > > Cheers > > > > > On Aug 27, 2015, at 4:56 PM, Himanshu Verma <[email protected]> > wrote: > > > > Hi, > > > > I was looking at following method: > > > > public void doBulkLoad(Path hfofDir, final Admin admin, Table table, > >> > >> RegionLocator regionLocator) throws TableNotFoundException, > >> IOException { > > > > > > > > We can optimize following part of this method: > > > > 353 ArrayList<String> familyNames = new > >> ArrayList<String>(families.size()); > >> > >> 354 for (HColumnDescriptor family : families) { > >> > >> 355 familyNames.add(family.getNameAsString()); > >> > >> 356 } > >> > >> 357 ArrayList<String> unmatchedFamilies = new ArrayList<String>(); > >> > >> 358 Iterator<LoadQueueItem> queueIter = queue.iterator(); > >> > >> 359 while (queueIter.hasNext()) { > >> > >> 360 LoadQueueItem lqi = queueIter.next(); > >> > >> 361 String familyNameInHFile = Bytes.toString(lqi.family); > >> > >> 362 if (!familyNames.contains(familyNameInHFile)) { > >> > >> 363 ¦ unmatchedFamilies.add(familyNameInHFile); > >> > >> 364 } > >> > >> 365 } > > > > line 353 uses ArrayList data structure for familyNames and calls its > > "contains" (line 362) method which is O(n). We can instead use HashSet, > its > > "contains" method is O(1). > > > > It should increase performance in cases having large number of column > > families. > > > > This is my first time here, I can make this change if everything looks > fine. > > > > Regards, > > Himanshu Verma >
