[ https://issues.apache.org/jira/browse/HBASE-18161?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16058135#comment-16058135 ]
Densel Santhmayor commented on HBASE-18161: ------------------------------------------- The Findbugs issues started showing up after commit 298454e8a72d27286d540edb2e9eeeb where spotbugs implementation replace findbugs. I ran mvn findbugs:check on hbase-server and the following files were flagged. None of them are related in any way to changes I've made so will probably skip trying to update them for now. [INFO] BugInstance size is 12 [INFO] Error size is 0 [INFO] Total bugs: 12 [INFO] Repeated conditional test in org.apache.hadoop.hbase.LocalHBaseCluster.getActiveMaster() ["org.apache.hadoop.hbase.LocalHBaseCluster"] At LocalHBaseCluster.java:[lines 61-444] [INFO] Primitive is boxed to call Long.compareTo(Long): use Long.compare(long, long) instead ["org.apache.hadoop.hbase.constraint.Constraints$1"] At Constraints.java:[lines 612-617] [INFO] Return value of org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher.getRecoverableZooKeeper() ignored, but method has no side effect ["org.apache.hadoop.hbase.coordination.ZkSplitLogWorkerCoordination$GetDataAsyncCallback"] At ZkSplitLogWorkerCoordination.java:[lines 564-577] [INFO] Possible null pointer dereference in org.apache.hadoop.hbase.mapreduce.JarFinder.zipDir(File, String, ZipOutputStream, boolean) due to return value of called method ["org.apache.hadoop.hbase.mapreduce.JarFinder"] At JarFinder.java:[lines 47-181] [INFO] Useless object stored in variable famPaths of method org.apache.hadoop.hbase.mapreduce.LoadIncrementalHFiles.tryAtomicRegionLoad(ClientServiceCallable, TableName, byte[], Collection) ["org.apache.hadoop.hbase.mapreduce.LoadIncrementalHFiles"] At LoadIncrementalHFiles.java:[lines 106-1309] [INFO] Useless condition: it's known that this.numProcessing >= 0 at this point ["org.apache.hadoop.hbase.master.DeadServer"] At DeadServer.java:[lines 43-202] [INFO] Primitive is boxed to call Integer.compareTo(Integer): use Integer.compare(int, int) instead ["org.apache.hadoop.hbase.master.balancer.BaseLoadBalancer$Cluster$1"] At BaseLoadBalancer.java:[lines 914-917] [INFO] org.apache.hadoop.hbase.master.balancer.BaseLoadBalancer$Cluster$2.compare(Integer, Integer) incorrectly handles float value ["org.apache.hadoop.hbase.master.balancer.BaseLoadBalancer$Cluster$2"] At BaseLoadBalancer.java:[lines 929-939] [INFO] Return value of org.apache.hadoop.hbase.client.Mutation.getRow() ignored, but method has no side effect ["org.apache.hadoop.hbase.regionserver.HRegion"] At HRegion.java:[lines 193-8218] [INFO] Return value of org.apache.hadoop.hbase.coprocessor.RegionObserver.postAppend(ObserverContext, Append, Result) ignored, but method has no side effect ["org.apache.hadoop.hbase.regionserver.RegionCoprocessorHost$46"] At RegionCoprocessorHost.java:[lines 1229-1234] [INFO] Primitive is boxed to call Long.compareTo(Long): use Long.compare(long, long) instead ["org.apache.hadoop.hbase.replication.regionserver.ReplicationSource$LogsComparator"] At ReplicationSource.java:[lines 504-519] [INFO] org.apache.hadoop.hbase.tool.Canary$RegionMonitor.run() makes inefficient use of keySet iterator instead of entrySet iterator ["org.apache.hadoop.hbase.tool.Canary$RegionMonitor"] At Canary.java:[lines 996-1227] I ran all the above failed tests as well locally and they went through fine. > Incremental Load support for Multiple-Table HFileOutputFormat > ------------------------------------------------------------- > > Key: HBASE-18161 > URL: https://issues.apache.org/jira/browse/HBASE-18161 > Project: HBase > Issue Type: New Feature > Reporter: Densel Santhmayor > Priority: Minor > Attachments: MultiHFileOutputFormatSupport_HBASE_18161.patch, > MultiHFileOutputFormatSupport_HBASE_18161_v2.patch, > MultiHFileOutputFormatSupport_HBASE_18161_v3.patch, > MultiHFileOutputFormatSupport_HBASE_18161_v4.patch, > MultiHFileOutputFormatSupport_HBASE_18161_v5.patch, > MultiHFileOutputFormatSupport_HBASE_18161_v6.patch, > MultiHFileOutputFormatSupport_HBASE_18161_v7.patch, > MultiHFileOutputFormatSupport_HBASE_18161_v8.patch > > > h2. Introduction > MapReduce currently supports the ability to write HBase records in bulk to > HFiles for a single table. The file(s) can then be uploaded to the relevant > RegionServers information with reasonable latency. This feature is useful to > make a large set of data available for queries at the same time as well as > provides a way to efficiently process very large input into HBase without > affecting query latencies. > There is, however, no support to write variations of the same record key to > HFiles belonging to multiple HBase tables from within the same MapReduce job. > > h2. Goal > The goal of this JIRA is to extend HFileOutputFormat2 to support writing to > HFiles for different tables within the same MapReduce job while single-table > HFile features backwards-compatible. > For our use case, we needed to write a record key to a smaller HBase table > for quicker access, and the same record key with a date appended to a larger > table for longer term storage with chronological access. Each of these tables > would have different TTL and other settings to support their respective > access patterns. We also needed to be able to bulk write records to multiple > tables with different subsets of very large input as efficiently as possible. > Rather than run the MapReduce job multiple times (one for each table or > record structure), it would be useful to be able to parse the input a single > time and write to multiple tables simultaneously. > Additionally, we'd like to maintain backwards compatibility with the existing > heavily-used HFileOutputFormat2 interface to allow benefits such as locality > sensitivity (that was introduced long after we implemented support for > multiple tables) to support both single table and multi table hfile writes. > h2. Proposal > * Backwards compatibility for existing single table support in > HFileOutputFormat2 will be maintained and in this case, mappers will need to > emit the table rowkey as before. However, a new class - > MultiHFileOutputFormat - will provide a helper function to generate a rowkey > for mappers that prefixes the desired tablename to the existing rowkey as > well as provides configureIncrementalLoad support for multiple tables. > * HFileOutputFormat2 will be updated in the following way: > ** configureIncrementalLoad will now accept multiple table descriptor and > region locator pairs, analogous to the single pair currently accepted by > HFileOutputFormat2. > ** Compression, Block Size, Bloom Type and Datablock settings PER column > family that are set in the Configuration object are now indexed and retrieved > by tablename AND column family > ** getRegionStartKeys will now support multiple regionlocators and calculate > split points and therefore partitions collectively for all tables. Similarly, > now the eventual number of Reducers will be equal to the total number of > partitions across all tables. > ** The RecordWriter class will be able to process rowkeys either with or > without the tablename prepended depending on how configureIncrementalLoad was > configured with MultiHFileOutputFormat or HFileOutputFormat2. > * The use of MultiHFileOutputFormat will write the output into HFiles which > will match the output format of HFileOutputFormat2. However, while the > default use case will keep the existing directory structure with column > family name as the directory and HFiles within that directory, in the case of > MultiHFileOutputFormat, it will output HFiles in the output directory with > the following relative paths: > {noformat} > --table1 > --family1 > --HFiles > --table2 > --family1 > --family2 > --HFiles > {noformat} > This aims to be a comprehensive solution to the original tickets - HBASE-3727 > and HBASE-16261. Thanks to [~clayb] for his support. > The patch will be attached shortly. -- This message was sent by Atlassian JIRA (v6.4.14#64029)