[jira] [Commented] (HBASE-18161) Incremental Load support for Multiple-Table HFileOutputFormat

Densel Santhmayor (JIRA) Wed, 21 Jun 2017 13:26:20 -0700

    [ 
https://issues.apache.org/jira/browse/HBASE-18161?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16058135#comment-16058135
 ]


Densel Santhmayor commented on HBASE-18161:
-------------------------------------------

The Findbugs issues started showing up after commit 
298454e8a72d27286d540edb2e9eeeb where spotbugs implementation replace findbugs.

I ran mvn findbugs:check on hbase-server and the following files were flagged. 
None of them are related in any way to changes I've made so will probably skip 
trying to update them for now.

[INFO] BugInstance size is 12
[INFO] Error size is 0
[INFO] Total bugs: 12
[INFO] Repeated conditional test in 
org.apache.hadoop.hbase.LocalHBaseCluster.getActiveMaster() 
["org.apache.hadoop.hbase.LocalHBaseCluster"] At LocalHBaseCluster.java:[lines 
61-444]
[INFO] Primitive is boxed to call Long.compareTo(Long): use Long.compare(long, 
long) instead ["org.apache.hadoop.hbase.constraint.Constraints$1"] At 
Constraints.java:[lines 612-617]
[INFO] Return value of 
org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher.getRecoverableZooKeeper() 
ignored, but method has no side effect 
["org.apache.hadoop.hbase.coordination.ZkSplitLogWorkerCoordination$GetDataAsyncCallback"]
 At ZkSplitLogWorkerCoordination.java:[lines 564-577]
[INFO] Possible null pointer dereference in 
org.apache.hadoop.hbase.mapreduce.JarFinder.zipDir(File, String, 
ZipOutputStream, boolean) due to return value of called method 
["org.apache.hadoop.hbase.mapreduce.JarFinder"] At JarFinder.java:[lines 47-181]
[INFO] Useless object stored in variable famPaths of method 
org.apache.hadoop.hbase.mapreduce.LoadIncrementalHFiles.tryAtomicRegionLoad(ClientServiceCallable,
 TableName, byte[], Collection) 
["org.apache.hadoop.hbase.mapreduce.LoadIncrementalHFiles"] At 
LoadIncrementalHFiles.java:[lines 106-1309]
[INFO] Useless condition: it's known that this.numProcessing >= 0 at this point 
["org.apache.hadoop.hbase.master.DeadServer"] At DeadServer.java:[lines 43-202]
[INFO] Primitive is boxed to call Integer.compareTo(Integer): use 
Integer.compare(int, int) instead 
["org.apache.hadoop.hbase.master.balancer.BaseLoadBalancer$Cluster$1"] At 
BaseLoadBalancer.java:[lines 914-917]
[INFO] 
org.apache.hadoop.hbase.master.balancer.BaseLoadBalancer$Cluster$2.compare(Integer,
 Integer) incorrectly handles float value 
["org.apache.hadoop.hbase.master.balancer.BaseLoadBalancer$Cluster$2"] At 
BaseLoadBalancer.java:[lines 929-939]
[INFO] Return value of org.apache.hadoop.hbase.client.Mutation.getRow() 
ignored, but method has no side effect 
["org.apache.hadoop.hbase.regionserver.HRegion"] At HRegion.java:[lines 
193-8218]
[INFO] Return value of 
org.apache.hadoop.hbase.coprocessor.RegionObserver.postAppend(ObserverContext, 
Append, Result) ignored, but method has no side effect 
["org.apache.hadoop.hbase.regionserver.RegionCoprocessorHost$46"] At 
RegionCoprocessorHost.java:[lines 1229-1234]
[INFO] Primitive is boxed to call Long.compareTo(Long): use Long.compare(long, 
long) instead 
["org.apache.hadoop.hbase.replication.regionserver.ReplicationSource$LogsComparator"]
 At ReplicationSource.java:[lines 504-519]
[INFO] org.apache.hadoop.hbase.tool.Canary$RegionMonitor.run() makes 
inefficient use of keySet iterator instead of entrySet iterator 
["org.apache.hadoop.hbase.tool.Canary$RegionMonitor"] At Canary.java:[lines 
996-1227]

I ran all the above failed tests as well locally and they went through fine.  

> Incremental Load support for Multiple-Table HFileOutputFormat
> -------------------------------------------------------------
>
>                 Key: HBASE-18161
>                 URL: https://issues.apache.org/jira/browse/HBASE-18161
>             Project: HBase
>          Issue Type: New Feature
>            Reporter: Densel Santhmayor
>            Priority: Minor
>         Attachments: MultiHFileOutputFormatSupport_HBASE_18161.patch, 
> MultiHFileOutputFormatSupport_HBASE_18161_v2.patch, 
> MultiHFileOutputFormatSupport_HBASE_18161_v3.patch, 
> MultiHFileOutputFormatSupport_HBASE_18161_v4.patch, 
> MultiHFileOutputFormatSupport_HBASE_18161_v5.patch, 
> MultiHFileOutputFormatSupport_HBASE_18161_v6.patch, 
> MultiHFileOutputFormatSupport_HBASE_18161_v7.patch, 
> MultiHFileOutputFormatSupport_HBASE_18161_v8.patch
>
>
> h2. Introduction
> MapReduce currently supports the ability to write HBase records in bulk to 
> HFiles for a single table. The file(s) can then be uploaded to the relevant 
> RegionServers information with reasonable latency. This feature is useful to 
> make a large set of data available for queries at the same time as well as 
> provides a way to efficiently process very large input into HBase without 
> affecting query latencies.
> There is, however, no support to write variations of the same record key to 
> HFiles belonging to multiple HBase tables from within the same MapReduce job. 
>  
> h2. Goal
> The goal of this JIRA is to extend HFileOutputFormat2 to support writing to 
> HFiles for different tables within the same MapReduce job while single-table 
> HFile features backwards-compatible. 
> For our use case, we needed to write a record key to a smaller HBase table 
> for quicker access, and the same record key with a date appended to a larger 
> table for longer term storage with chronological access. Each of these tables 
> would have different TTL and other settings to support their respective 
> access patterns. We also needed to be able to bulk write records to multiple 
> tables with different subsets of very large input as efficiently as possible. 
> Rather than run the MapReduce job multiple times (one for each table or 
> record structure), it would be useful to be able to parse the input a single 
> time and write to multiple tables simultaneously.
> Additionally, we'd like to maintain backwards compatibility with the existing 
> heavily-used HFileOutputFormat2 interface to allow benefits such as locality 
> sensitivity (that was introduced long after we implemented support for 
> multiple tables) to support both single table and multi table hfile writes. 
> h2. Proposal
> * Backwards compatibility for existing single table support in 
> HFileOutputFormat2 will be maintained and in this case, mappers will need to 
> emit the table rowkey as before. However, a new class - 
> MultiHFileOutputFormat - will provide a helper function to generate a rowkey 
> for mappers that prefixes the desired tablename to the existing rowkey as 
> well as provides configureIncrementalLoad support for multiple tables.
> * HFileOutputFormat2 will be updated in the following way:
> ** configureIncrementalLoad will now accept multiple table descriptor and 
> region locator pairs, analogous to the single pair currently accepted by 
> HFileOutputFormat2. 
> ** Compression, Block Size, Bloom Type and Datablock settings PER column 
> family that are set in the Configuration object are now indexed and retrieved 
> by tablename AND column family
> ** getRegionStartKeys will now support multiple regionlocators and calculate 
> split points and therefore partitions collectively for all tables. Similarly, 
> now the eventual number of Reducers will be equal to the total number of 
> partitions across all tables. 
> ** The RecordWriter class will be able to process rowkeys either with or 
> without the tablename prepended depending on how configureIncrementalLoad was 
> configured with MultiHFileOutputFormat or HFileOutputFormat2.
> * The use of MultiHFileOutputFormat will write the output into HFiles which 
> will match the output format of HFileOutputFormat2. However, while the 
> default use case will keep the existing directory structure with column 
> family name as the directory and HFiles within that directory, in the case of 
> MultiHFileOutputFormat, it will output HFiles in the output directory with 
> the following relative paths: 
> {noformat}
>      --table1 
>        --family1 
>          --HFiles 
>      --table2 
>        --family1 
>        --family2 
>          --HFiles
> {noformat}
> This aims to be a comprehensive solution to the original tickets - HBASE-3727 
> and HBASE-16261. Thanks to [~clayb] for his support.
> The patch will be attached shortly.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

[jira] [Commented] (HBASE-18161) Incremental Load support for Multiple-Table HFileOutputFormat

Reply via email to