Re: HBase scan returns inconsistent results on multiple runs for same dataset

2017-03-02 Thread Hef
I ran the tests with following scenarios:
1. ran tasks with old client 5 times, and got 'mapping input records'
counters with 5 different values, varied from 470k ~ 630k
2. ran tasks with new client 5 times, got only 1 value, much larger than
any value from step 1, which was  2.6m
3. RegionServers were not restarted during tests
4. Scan criteria was consistent during tests



On Fri, Mar 3, 2017 at 12:04 PM, Ted Yu  wrote:

> Since cache for ClientScanner might or might not be empty during your test
> runs, it was hard to tell whether you hit the bug described by HBASE-15378.
>
> I would suggest you upgrade to a release with HBASE-15378.
>
> On Thu, Mar 2, 2017 at 7:59 PM, Hef  wrote:
>
> > Thanks for the hint, which led me to investigate from the client side and
> > finally had this problem resolved.
> >
> > I reviewed the code and found that 1.0.0-cdh5.6.1, an old version of
> > hbase-client was used in my project. After updated to 1.2.0-cdh5.9.0,
> >  consistent with the one server is running,  my tasks work correctly.
> >
> > I looked into the source of HBase 1.2.0-cdh5.9.0, HBASE-15378 is not
> > patched. And I also went through all release notes from CDH HBase 5.6 to
> > 5.9, nothing about this inconsistent scan behavior had been mentioned.
> > Though the problem has been resolved for now , I have no idea what the
> root
> > cause  actually is, and whether it will come out again if my dataset
> grows
> > larger, without HBASE-15378.
> >
> >
> >
> > On Thu, Mar 2, 2017 at 12:09 AM, Sean Busbey  wrote:
> >
> > > The place to check for include JIRAs on top of those in the ASF release
> > is
> > > here:
> > >
> > > http://archive.cloudera.com/cdh5/cdh/5/hbase-1.2.0-cdh5.9.
> > > 1.releasenotes.html
> > >
> > > HBASE-15378 is not in CDH5.9.1.
> > >
> > > On Wed, Mar 1, 2017 at 9:58 AM, Ted Yu  wrote:
> > > > I don't see it here:
> > > >
> > > > http://archive.cloudera.com/cdh5/cdh/5/hbase-1.2.0-cdh5.9.
> > > 1.CHANGES.txt?_ga=1.10311413.1914112506.1454459553
> > > >
> > > > On Wed, Mar 1, 2017 at 5:46 AM, Hef  wrote:
> > > >
> > > >> I'm using CDH 5.9, the document show its HBase version is
> > > >> hbase-1.2.0+cdh5.9.1+222.  (
> > > >> https://www.cloudera.com/documentation/enterprise/
> > > >> release-notes/topics/cdh_vd_cdh_package_tarball_59.html
> > > >> )
> > > >> I have no idea if  HBASE-15378  is included.
> > > >>
> > > >> On Wed, Mar 1, 2017 at 9:33 PM, Ted Yu  wrote:
> > > >>
> > > >> > Which hbase version are you using ?
> > > >> >
> > > >> > Does it include HBASE-15378 ?
> > > >> >
> > > >> > > On Mar 1, 2017, at 5:02 AM, Hef  wrote:
> > > >> > >
> > > >> > > Hi,
> > > >> > > I'm encountering a strange behavior on MapReduce when using
> HBase
> > as
> > > >> > input
> > > >> > > format. I run my MR tasks on a same table, same dataset, with a
> > same
> > > >> > > pattern of Fuzzy Row Filter, multiple times. The Input Records
> > > counters
> > > >> > > shown are not consistent, the smallest number can be 40% less
> than
> > > the
> > > >> > > largest one.
> > > >> > >
> > > >> > > More specifically,
> > > >> > > - the table is split into 18 regions, distributed on 3 region
> > > server.
> > > >> The
> > > >> > > TTL is set to 10 days for the record, though the dataset for MR
> > only
> > > >> > > includes those inserted in 7days.
> > > >> > >
> > > >> > > - The row key is defined as:
> > > >> > > sault(1byte) + time_of_hour(4bytes) + uuid(36bytes)
> > > >> > >
> > > >> > >
> > > >> > > - The scan is created as below:
> > > >> > >
> > > >> > > Scan scan = new Scan();
> > > >> > > scan.setBatch(100);
> > > >> > > scan.setCaching(1);
> > > >> > > scan.setCacheBlocks(false);
> > > >> > > scan.setMaxVersions(1);
> > > >> > >
> > > >> > >
> > > >> > > And the row filter for the scan is a FuzzyRowFilter that filters
> > > only
> > > >> > > events of a given time_of_hour.
> > > >> > >
> > > >> > > Everything looks fine while the result is out of expect.
> > > >> > > A same task runs 10 times, the Input Records counters  show 6
> > > different
> > > >> > > numbers, and the final output shows 6 different results.
> > > >> > >
> > > >> > > Does anyone has every faced this problem before?
> > > >> > > What could be the cause of this inconsistency of HBase scan
> > result?
> > > >> > >
> > > >> > > Thanks
> > > >> >
> > > >>
> > >
> >
>


Re: HBase scan returns inconsistent results on multiple runs for same dataset

2017-03-02 Thread Ted Yu
Since cache for ClientScanner might or might not be empty during your test
runs, it was hard to tell whether you hit the bug described by HBASE-15378.

I would suggest you upgrade to a release with HBASE-15378.

On Thu, Mar 2, 2017 at 7:59 PM, Hef  wrote:

> Thanks for the hint, which led me to investigate from the client side and
> finally had this problem resolved.
>
> I reviewed the code and found that 1.0.0-cdh5.6.1, an old version of
> hbase-client was used in my project. After updated to 1.2.0-cdh5.9.0,
>  consistent with the one server is running,  my tasks work correctly.
>
> I looked into the source of HBase 1.2.0-cdh5.9.0, HBASE-15378 is not
> patched. And I also went through all release notes from CDH HBase 5.6 to
> 5.9, nothing about this inconsistent scan behavior had been mentioned.
> Though the problem has been resolved for now , I have no idea what the root
> cause  actually is, and whether it will come out again if my dataset grows
> larger, without HBASE-15378.
>
>
>
> On Thu, Mar 2, 2017 at 12:09 AM, Sean Busbey  wrote:
>
> > The place to check for include JIRAs on top of those in the ASF release
> is
> > here:
> >
> > http://archive.cloudera.com/cdh5/cdh/5/hbase-1.2.0-cdh5.9.
> > 1.releasenotes.html
> >
> > HBASE-15378 is not in CDH5.9.1.
> >
> > On Wed, Mar 1, 2017 at 9:58 AM, Ted Yu  wrote:
> > > I don't see it here:
> > >
> > > http://archive.cloudera.com/cdh5/cdh/5/hbase-1.2.0-cdh5.9.
> > 1.CHANGES.txt?_ga=1.10311413.1914112506.1454459553
> > >
> > > On Wed, Mar 1, 2017 at 5:46 AM, Hef  wrote:
> > >
> > >> I'm using CDH 5.9, the document show its HBase version is
> > >> hbase-1.2.0+cdh5.9.1+222.  (
> > >> https://www.cloudera.com/documentation/enterprise/
> > >> release-notes/topics/cdh_vd_cdh_package_tarball_59.html
> > >> )
> > >> I have no idea if  HBASE-15378  is included.
> > >>
> > >> On Wed, Mar 1, 2017 at 9:33 PM, Ted Yu  wrote:
> > >>
> > >> > Which hbase version are you using ?
> > >> >
> > >> > Does it include HBASE-15378 ?
> > >> >
> > >> > > On Mar 1, 2017, at 5:02 AM, Hef  wrote:
> > >> > >
> > >> > > Hi,
> > >> > > I'm encountering a strange behavior on MapReduce when using HBase
> as
> > >> > input
> > >> > > format. I run my MR tasks on a same table, same dataset, with a
> same
> > >> > > pattern of Fuzzy Row Filter, multiple times. The Input Records
> > counters
> > >> > > shown are not consistent, the smallest number can be 40% less than
> > the
> > >> > > largest one.
> > >> > >
> > >> > > More specifically,
> > >> > > - the table is split into 18 regions, distributed on 3 region
> > server.
> > >> The
> > >> > > TTL is set to 10 days for the record, though the dataset for MR
> only
> > >> > > includes those inserted in 7days.
> > >> > >
> > >> > > - The row key is defined as:
> > >> > > sault(1byte) + time_of_hour(4bytes) + uuid(36bytes)
> > >> > >
> > >> > >
> > >> > > - The scan is created as below:
> > >> > >
> > >> > > Scan scan = new Scan();
> > >> > > scan.setBatch(100);
> > >> > > scan.setCaching(1);
> > >> > > scan.setCacheBlocks(false);
> > >> > > scan.setMaxVersions(1);
> > >> > >
> > >> > >
> > >> > > And the row filter for the scan is a FuzzyRowFilter that filters
> > only
> > >> > > events of a given time_of_hour.
> > >> > >
> > >> > > Everything looks fine while the result is out of expect.
> > >> > > A same task runs 10 times, the Input Records counters  show 6
> > different
> > >> > > numbers, and the final output shows 6 different results.
> > >> > >
> > >> > > Does anyone has every faced this problem before?
> > >> > > What could be the cause of this inconsistency of HBase scan
> result?
> > >> > >
> > >> > > Thanks
> > >> >
> > >>
> >
>


Re: HBase scan returns inconsistent results on multiple runs for same dataset

2017-03-02 Thread Hef
Thanks for the hint, which led me to investigate from the client side and
finally had this problem resolved.

I reviewed the code and found that 1.0.0-cdh5.6.1, an old version of
hbase-client was used in my project. After updated to 1.2.0-cdh5.9.0,
 consistent with the one server is running,  my tasks work correctly.

I looked into the source of HBase 1.2.0-cdh5.9.0, HBASE-15378 is not
patched. And I also went through all release notes from CDH HBase 5.6 to
5.9, nothing about this inconsistent scan behavior had been mentioned.
Though the problem has been resolved for now , I have no idea what the root
cause  actually is, and whether it will come out again if my dataset grows
larger, without HBASE-15378.



On Thu, Mar 2, 2017 at 12:09 AM, Sean Busbey  wrote:

> The place to check for include JIRAs on top of those in the ASF release is
> here:
>
> http://archive.cloudera.com/cdh5/cdh/5/hbase-1.2.0-cdh5.9.
> 1.releasenotes.html
>
> HBASE-15378 is not in CDH5.9.1.
>
> On Wed, Mar 1, 2017 at 9:58 AM, Ted Yu  wrote:
> > I don't see it here:
> >
> > http://archive.cloudera.com/cdh5/cdh/5/hbase-1.2.0-cdh5.9.
> 1.CHANGES.txt?_ga=1.10311413.1914112506.1454459553
> >
> > On Wed, Mar 1, 2017 at 5:46 AM, Hef  wrote:
> >
> >> I'm using CDH 5.9, the document show its HBase version is
> >> hbase-1.2.0+cdh5.9.1+222.  (
> >> https://www.cloudera.com/documentation/enterprise/
> >> release-notes/topics/cdh_vd_cdh_package_tarball_59.html
> >> )
> >> I have no idea if  HBASE-15378  is included.
> >>
> >> On Wed, Mar 1, 2017 at 9:33 PM, Ted Yu  wrote:
> >>
> >> > Which hbase version are you using ?
> >> >
> >> > Does it include HBASE-15378 ?
> >> >
> >> > > On Mar 1, 2017, at 5:02 AM, Hef  wrote:
> >> > >
> >> > > Hi,
> >> > > I'm encountering a strange behavior on MapReduce when using HBase as
> >> > input
> >> > > format. I run my MR tasks on a same table, same dataset, with a same
> >> > > pattern of Fuzzy Row Filter, multiple times. The Input Records
> counters
> >> > > shown are not consistent, the smallest number can be 40% less than
> the
> >> > > largest one.
> >> > >
> >> > > More specifically,
> >> > > - the table is split into 18 regions, distributed on 3 region
> server.
> >> The
> >> > > TTL is set to 10 days for the record, though the dataset for MR only
> >> > > includes those inserted in 7days.
> >> > >
> >> > > - The row key is defined as:
> >> > > sault(1byte) + time_of_hour(4bytes) + uuid(36bytes)
> >> > >
> >> > >
> >> > > - The scan is created as below:
> >> > >
> >> > > Scan scan = new Scan();
> >> > > scan.setBatch(100);
> >> > > scan.setCaching(1);
> >> > > scan.setCacheBlocks(false);
> >> > > scan.setMaxVersions(1);
> >> > >
> >> > >
> >> > > And the row filter for the scan is a FuzzyRowFilter that filters
> only
> >> > > events of a given time_of_hour.
> >> > >
> >> > > Everything looks fine while the result is out of expect.
> >> > > A same task runs 10 times, the Input Records counters  show 6
> different
> >> > > numbers, and the final output shows 6 different results.
> >> > >
> >> > > Does anyone has every faced this problem before?
> >> > > What could be the cause of this inconsistency of HBase scan result?
> >> > >
> >> > > Thanks
> >> >
> >>
>


Re: HBase scan returns inconsistent results on multiple runs for same dataset

2017-03-01 Thread Sean Busbey
The place to check for include JIRAs on top of those in the ASF release is here:

http://archive.cloudera.com/cdh5/cdh/5/hbase-1.2.0-cdh5.9.1.releasenotes.html

HBASE-15378 is not in CDH5.9.1.

On Wed, Mar 1, 2017 at 9:58 AM, Ted Yu  wrote:
> I don't see it here:
>
> http://archive.cloudera.com/cdh5/cdh/5/hbase-1.2.0-cdh5.9.1.CHANGES.txt?_ga=1.10311413.1914112506.1454459553
>
> On Wed, Mar 1, 2017 at 5:46 AM, Hef  wrote:
>
>> I'm using CDH 5.9, the document show its HBase version is
>> hbase-1.2.0+cdh5.9.1+222.  (
>> https://www.cloudera.com/documentation/enterprise/
>> release-notes/topics/cdh_vd_cdh_package_tarball_59.html
>> )
>> I have no idea if  HBASE-15378  is included.
>>
>> On Wed, Mar 1, 2017 at 9:33 PM, Ted Yu  wrote:
>>
>> > Which hbase version are you using ?
>> >
>> > Does it include HBASE-15378 ?
>> >
>> > > On Mar 1, 2017, at 5:02 AM, Hef  wrote:
>> > >
>> > > Hi,
>> > > I'm encountering a strange behavior on MapReduce when using HBase as
>> > input
>> > > format. I run my MR tasks on a same table, same dataset, with a same
>> > > pattern of Fuzzy Row Filter, multiple times. The Input Records counters
>> > > shown are not consistent, the smallest number can be 40% less than the
>> > > largest one.
>> > >
>> > > More specifically,
>> > > - the table is split into 18 regions, distributed on 3 region server.
>> The
>> > > TTL is set to 10 days for the record, though the dataset for MR only
>> > > includes those inserted in 7days.
>> > >
>> > > - The row key is defined as:
>> > > sault(1byte) + time_of_hour(4bytes) + uuid(36bytes)
>> > >
>> > >
>> > > - The scan is created as below:
>> > >
>> > > Scan scan = new Scan();
>> > > scan.setBatch(100);
>> > > scan.setCaching(1);
>> > > scan.setCacheBlocks(false);
>> > > scan.setMaxVersions(1);
>> > >
>> > >
>> > > And the row filter for the scan is a FuzzyRowFilter that filters only
>> > > events of a given time_of_hour.
>> > >
>> > > Everything looks fine while the result is out of expect.
>> > > A same task runs 10 times, the Input Records counters  show 6 different
>> > > numbers, and the final output shows 6 different results.
>> > >
>> > > Does anyone has every faced this problem before?
>> > > What could be the cause of this inconsistency of HBase scan result?
>> > >
>> > > Thanks
>> >
>>


Re: HBase scan returns inconsistent results on multiple runs for same dataset

2017-03-01 Thread Ted Yu
I don't see it here:

http://archive.cloudera.com/cdh5/cdh/5/hbase-1.2.0-cdh5.9.1.CHANGES.txt?_ga=1.10311413.1914112506.1454459553

On Wed, Mar 1, 2017 at 5:46 AM, Hef  wrote:

> I'm using CDH 5.9, the document show its HBase version is
> hbase-1.2.0+cdh5.9.1+222.  (
> https://www.cloudera.com/documentation/enterprise/
> release-notes/topics/cdh_vd_cdh_package_tarball_59.html
> )
> I have no idea if  HBASE-15378  is included.
>
> On Wed, Mar 1, 2017 at 9:33 PM, Ted Yu  wrote:
>
> > Which hbase version are you using ?
> >
> > Does it include HBASE-15378 ?
> >
> > > On Mar 1, 2017, at 5:02 AM, Hef  wrote:
> > >
> > > Hi,
> > > I'm encountering a strange behavior on MapReduce when using HBase as
> > input
> > > format. I run my MR tasks on a same table, same dataset, with a same
> > > pattern of Fuzzy Row Filter, multiple times. The Input Records counters
> > > shown are not consistent, the smallest number can be 40% less than the
> > > largest one.
> > >
> > > More specifically,
> > > - the table is split into 18 regions, distributed on 3 region server.
> The
> > > TTL is set to 10 days for the record, though the dataset for MR only
> > > includes those inserted in 7days.
> > >
> > > - The row key is defined as:
> > > sault(1byte) + time_of_hour(4bytes) + uuid(36bytes)
> > >
> > >
> > > - The scan is created as below:
> > >
> > > Scan scan = new Scan();
> > > scan.setBatch(100);
> > > scan.setCaching(1);
> > > scan.setCacheBlocks(false);
> > > scan.setMaxVersions(1);
> > >
> > >
> > > And the row filter for the scan is a FuzzyRowFilter that filters only
> > > events of a given time_of_hour.
> > >
> > > Everything looks fine while the result is out of expect.
> > > A same task runs 10 times, the Input Records counters  show 6 different
> > > numbers, and the final output shows 6 different results.
> > >
> > > Does anyone has every faced this problem before?
> > > What could be the cause of this inconsistency of HBase scan result?
> > >
> > > Thanks
> >
>


Re: HBase scan returns inconsistent results on multiple runs for same dataset

2017-03-01 Thread Hef
I'm using CDH 5.9, the document show its HBase version is
hbase-1.2.0+cdh5.9.1+222.  (
https://www.cloudera.com/documentation/enterprise/release-notes/topics/cdh_vd_cdh_package_tarball_59.html
)
I have no idea if  HBASE-15378  is included.

On Wed, Mar 1, 2017 at 9:33 PM, Ted Yu  wrote:

> Which hbase version are you using ?
>
> Does it include HBASE-15378 ?
>
> > On Mar 1, 2017, at 5:02 AM, Hef  wrote:
> >
> > Hi,
> > I'm encountering a strange behavior on MapReduce when using HBase as
> input
> > format. I run my MR tasks on a same table, same dataset, with a same
> > pattern of Fuzzy Row Filter, multiple times. The Input Records counters
> > shown are not consistent, the smallest number can be 40% less than the
> > largest one.
> >
> > More specifically,
> > - the table is split into 18 regions, distributed on 3 region server. The
> > TTL is set to 10 days for the record, though the dataset for MR only
> > includes those inserted in 7days.
> >
> > - The row key is defined as:
> > sault(1byte) + time_of_hour(4bytes) + uuid(36bytes)
> >
> >
> > - The scan is created as below:
> >
> > Scan scan = new Scan();
> > scan.setBatch(100);
> > scan.setCaching(1);
> > scan.setCacheBlocks(false);
> > scan.setMaxVersions(1);
> >
> >
> > And the row filter for the scan is a FuzzyRowFilter that filters only
> > events of a given time_of_hour.
> >
> > Everything looks fine while the result is out of expect.
> > A same task runs 10 times, the Input Records counters  show 6 different
> > numbers, and the final output shows 6 different results.
> >
> > Does anyone has every faced this problem before?
> > What could be the cause of this inconsistency of HBase scan result?
> >
> > Thanks
>


Re: HBase scan returns inconsistent results on multiple runs for same dataset

2017-03-01 Thread Ted Yu
Which hbase version are you using ?

Does it include HBASE-15378 ?

> On Mar 1, 2017, at 5:02 AM, Hef  wrote:
> 
> Hi,
> I'm encountering a strange behavior on MapReduce when using HBase as input
> format. I run my MR tasks on a same table, same dataset, with a same
> pattern of Fuzzy Row Filter, multiple times. The Input Records counters
> shown are not consistent, the smallest number can be 40% less than the
> largest one.
> 
> More specifically,
> - the table is split into 18 regions, distributed on 3 region server. The
> TTL is set to 10 days for the record, though the dataset for MR only
> includes those inserted in 7days.
> 
> - The row key is defined as:
> sault(1byte) + time_of_hour(4bytes) + uuid(36bytes)
> 
> 
> - The scan is created as below:
> 
> Scan scan = new Scan();
> scan.setBatch(100);
> scan.setCaching(1);
> scan.setCacheBlocks(false);
> scan.setMaxVersions(1);
> 
> 
> And the row filter for the scan is a FuzzyRowFilter that filters only
> events of a given time_of_hour.
> 
> Everything looks fine while the result is out of expect.
> A same task runs 10 times, the Input Records counters  show 6 different
> numbers, and the final output shows 6 different results.
> 
> Does anyone has every faced this problem before?
> What could be the cause of this inconsistency of HBase scan result?
> 
> Thanks


HBase scan returns inconsistent results on multiple runs for same dataset

2017-03-01 Thread Hef
Hi,
I'm encountering a strange behavior on MapReduce when using HBase as input
format. I run my MR tasks on a same table, same dataset, with a same
pattern of Fuzzy Row Filter, multiple times. The Input Records counters
shown are not consistent, the smallest number can be 40% less than the
largest one.

More specifically,
- the table is split into 18 regions, distributed on 3 region server. The
TTL is set to 10 days for the record, though the dataset for MR only
includes those inserted in 7days.

- The row key is defined as:
sault(1byte) + time_of_hour(4bytes) + uuid(36bytes)


- The scan is created as below:

Scan scan = new Scan();
scan.setBatch(100);
scan.setCaching(1);
scan.setCacheBlocks(false);
scan.setMaxVersions(1);


And the row filter for the scan is a FuzzyRowFilter that filters only
events of a given time_of_hour.

Everything looks fine while the result is out of expect.
A same task runs 10 times, the Input Records counters  show 6 different
numbers, and the final output shows 6 different results.

Does anyone has every faced this problem before?
What could be the cause of this inconsistency of HBase scan result?

Thanks