RE: Get addColumn + ColumnRangeFilter

Taeyun Kim Sun, 18 Jan 2015 16:40:25 -0800

Thanks.

But in my case it is unlikely that the FirstColumnName would be included in the 
range. (If it is included, it would cause a problem.)


Instead, since the number of splits is mostly 1, I will include the name of the 
first split to the first Get with addColumn(). With that, most queries can be 
satisfied with single Get.

Thanks again.

-----Original Message-----
From: Ted Yu [mailto:yuzhih...@gmail.com] 
Sent: Saturday, January 17, 2015 6:31 AM
To: user@hbase.apache.org
Subject: Re: Get addColumn + ColumnRangeFilter

To clarify what I meant, the test passes with the following change:

      Get g = new Get(RowKey);

      byte[] minColumn = new byte[]{(byte)0};

      int cmpMin = Bytes.compareTo(FirstColumnNameBytes, 0, 
FirstColumnNameBytes.length,

        minColumn, 0, minColumn.length);

      byte[] maxColumn = Bytes.toBytes("~");

      int cmpMax = Bytes.compareTo(FirstColumnNameBytes, 0, 
FirstColumnNameBytes.length,

        maxColumn, 0, maxColumn.length);

      if (cmpMin <= 0 || cmpMax >= 0) {

        g.addColumn(ColumnFamilyNameBytes, FirstColumnNameBytes);  // should be 
redundant...

      }

      g.setFilter(new ColumnRangeFilter(minColumn, false,

        maxColumn, false));  // ...since this includes the first column

FYI

On Fri, Jan 16, 2015 at 7:23 AM, Ted Yu <yuzhih...@gmail.com> wrote:

> Thanks for the background information.
>
> For your last question, the columns given by addColumn() calls 
> (ColumnTracker
> uses) are checked first.
> So yes.
>
> Relaxing this limitation may take some effort - ScanQueryMatcher can 
> take Filter user passes into account. But the filter may not be 
> ColumnRangeFilter. It can be FilterList involving ColumnRangeFilter.
> To add such logic into ScanQueryMatcher#match() makes the code less 
> maintainable.
>
> Can you check whether the column in addColumn() is covered by the 
> ColumnRangeFilter and if so, do not call addColumn() ?
>
> Cheers
>
> On Thu, Jan 15, 2015 at 11:35 PM, Taeyun Kim 
> <taeyun....@innowireless.com>
> wrote:
>
>> It's a somewhat long story.
>> Maybe I use HBase some weird way.
>>
>> My use case is as follows:
>>
>> I didn't want to put many small file into HDFS. (Since it is bad for 
>> HDFS, both for scalability and performance)
>>
>> The small files are grouped by a test log, since the files are many 
>> facets of the result of the analysis of one test log. So, they could 
>> be the members of one SequentialFile.
>> But I felt SequentialFile (or other similar ones) not attractive, 
>> since anyway I would get many not-so-big(about ~20MB, except for rare 
>> cases) Sequential files since the analysis result files are not so 
>> big and the test log files are continually generated.
>> So some manual file management and merge could be a must.
>>
>> So, I decided to use a HBase record as a kind of 'directory' to avoid 
>> the manual file management. (directory = file group) By this, the 
>> 'files' are automatically 'merged' into appropriately sized HFiles, 
>> and as a bonus that 'files' can be automatically deleted when it's 
>> lifetime is done.
>>
>> The 'directory' has the following files.
>>
>> - 'm': meta file. (to check the version of the 'directory' format)
>> - 'Result.csv.0'
>> - 'Result.csv.1'
>> - ...
>> - 'Result.csv.p': parts file. (has the split count and each size. 'p' 
>> is for 'parts')
>> - 'AnotherResultA.csv.0'
>> - 'AnotherResultA.csv.1'
>> - ...
>> - 'AnotherResultA.csv.p'
>> - 'TestEnvironment.txt'
>>
>> Each 'file' is saved as a column.
>>
>> Result files are split for the following reasons:
>> - To handle extreme case the file is too big to be processed by one task.
>> - To save the task process memory: the split size is actually smaller 
>> than 64MB(size for one task) and individually compressed. By this, a 
>> task process can have at most one column uncompressed. A task is 
>> assigned multiple 'splits'.
>>
>> For this, I've written an InputFormat class.
>>
>> Now, the InputFormat class can first Get both 'm' and a parts file to 
>> get the inputSplit information. This is not a problem. Single Get 
>> with 2
>> addColumn() is sufficient.
>> But when the whole content of a file must be read(like 
>> Files.readAllBytes()), must Get 'm' and unknown number of splits that 
>> has a name range(Result.csv.0 ~ Result.csv.7) to Get the whole 
>> content by single Get. (addColumn() + ColumnRangeFilter) But for the 
>> current HBase status, it seems that I have to invoke 2 Gets, or 
>> disable the version check. (Maybe not a big deal?)
>>
>> That's all.
>>
>> If you think that this Record is not efficient, or there is better 
>> solution, please let me know.
>>
>> BTW, for the current status, when both addColumn() and 
>> ColumnRangeFilter are applied, they are practically combined by 'AND' 
>> operator. Right?
>>
>> -----Original Message-----
>> From: Ted Yu [mailto:yuzhih...@gmail.com]
>> Sent: Friday, January 16, 2015 3:39 PM
>> To: user@hbase.apache.org
>> Subject: Re: Get addColumn + ColumnRangeFilter
>>
>> I reproduced the failed test (testAddColumnWithColumnRangeFilter) 
>> after modifying your test case to fit master branch.
>>
>> The reason for one Cell being returned is that ExplicitColumnTracker 
>> is used by ScanQueryMatcher to first check if the column is part of 
>> the requested columns (f:fc in your case). The other columns don't 
>> pass this check, hence they're not included in the result.
>>
>> Before this part of code is changed, can I ask why you need to call
>> g.addColumn() when g has ColumnRangeFilter associated with it.
>>
>> Cheers
>>
>> On Thu, Jan 15, 2015 at 6:22 PM, Taeyun Kim 
>> <taeyun....@innowireless.com>
>> wrote:
>>
>> > (Sorry if this mail is a duplicate)
>> >
>> > Hi Ted,
>> >
>> > I've attached 2 unit test classes.
>> >
>> > Both have one failed test.
>> >
>> > -
>> >
>> HBaseAddColumnWithColumnRangeFilterTest1.testAddColumnWithColumnRangeFilter():
>> > Expected: 10, Actual 1
>> > -
>> >
>> HBaseAddColumnWithColumnRangeFilterTest2.testAddColumnWithColumnRangeFilter():
>> > Result is empty
>> >
>> > If the tests have problems, please let me know.
>> >
>> >
>> > -----Original Message-----
>> > From: Ted Yu [mailto:yuzhih...@gmail.com]
>> > Sent: Thursday, January 15, 2015 6:59 PM
>> > To: user@hbase.apache.org
>> > Subject: Re: Get addColumn + ColumnRangeFilter
>> >
>> > Can you write a unit test which shows this behavior?
>> >
>> > Thanks
>> >
>> >
>> >
>> > > On Jan 14, 2015, at 9:09 PM, Taeyun Kim <
>> > taeyun.kim.innowirel...@gmail.com> wrote:
>> > >
>> > > Hi,
>> > >
>> > >
>> > >
>> > > I have a situation that both Get.addColumn() and 
>> > > Get.setFilter(new
>> > > ColumnRangeFilter(…)) needed to Get.
>> > >
>> > > The source code snippet is as follows:
>> > >
>> > >
>> > >
>> > >        Get g = new Get(getRowKey(lfileId));
>> > >
>> > >        g.addColumn(Schema.ColumnFamilyNameBytes,
>> > > MetaColumnNameBytes);
>> > >
>> > >        g.setFilter(new ColumnRangeFilter(Bytes.toBytes(name), 
>> > > false,
>> > >
>> > >            Bytes.toBytes(name + "~"), false));
>> > >
>> > >        Result r = table.get(g);
>> > >
>> > >
>> > >
>> > >        if (r.isEmpty())
>> > >
>> > >            throw new FileNotFoundException(
>> > >
>> > >                String.format("%d:%d:%s", projectId, lfileId, 
>> > > name));
>> > >
>> > >
>> > >
>> > > When g.addColumn() is commented out, the Result is not empty, 
>> > > while with g.addColumn the Result is empty(FileNotFoundException is 
>> > > thrown).
>> > >
>> > > Is it illegal to use both methods?
>> > >
>> > >
>> > >
>> > > BTW, ther version of HBase used is 0.98. (Hortonworks HDP 2.1)
>> > >
>> > >
>> > >
>> > > Thanks.
>> >
>>
>>
>

RE: Get addColumn + ColumnRangeFilter

Reply via email to