Thanks for the update, I'm forwarding your message back to the list.

Just to confirm, was the query time longer on the the one with the
incorrect result? In the incorrect case I think we are just misreading the
HBase metadata during our optimization to return row counts without reading
any data. This should be really fast, and noticeably different than running
a complete query, even with a small dataset as we have to read in your
table and run an aggregation over it.

This would just be a final confirmation of where the issue is occurring, I
will hopefully have time soon to get this fixed but I'm wrapping up some
other things right now.


---------- Forwarded message ----------
From: Kumiko Yada <kumiko.y...@ds-iq.com>
Date: Thu, Jan 14, 2016 at 12:53 PM
Subject: RE: Drill query does not return all results from HBase
To: Jason Altekruse <altekruseja...@gmail.com>


Jason,



I’m sorry.  My testing was incorrect last night.  I’m not sure what I did
differently; however your guess were correct.  When I did the one column
count, the row count was correct.  Here is the additional testing results.



My company has been invested to use the drill, and it’s very important for
us that this is fixed.  Let me know if I can do anything to get this issue
to be fixed.  I really appreciate you that you are looking into issue!

Hbase table (1 column family, 5 columns, 10000000 rows)

COUNT(*) - row count is correct

1 column count - row count is correct

*Hbase table (1 column family, 6 columns,  10000000 rows)*

*COUNT(*) - row count is incorrect (**returned 6724 rows)*

1 column count - row count is correct

*Hbase table (2 column family, 6 columns in each columns family, 10000000
rows)*

*COUNT(*) - row count is incorrect (returned 3362 rows)*

1 column count - row count is correct

Hbase table (2 column family, 2 columns in each columns family, 10000000
rows)

COUNT(*) - row count is correct

1 column count - row count is correct

*Hbasetable (2 column family, 4 columns in one column family and 2 columns
in other column family, 10000000 rows)*

*COUNT(*) - row count is incorrect (returned 6723 rows)*

1 column count - row count is correct

Hbasetable (2 column family, 1 column in one column family and 3 columns in
other column family, 10000000 rows)

COUNT(*) - row count is correct

1 column count - row count is correct



Thanks

Kumiko



*From:* Kumiko Yada
*Sent:* Wednesday, January 13, 2016 7:28 PM
*To:* 'Jason Altekruse' <altekruseja...@gmail.com>
*Cc:* Ki Kang <ki.k...@ds-iq.com>; Kevin Verhoeven <
kevin.verhoe...@ds-iq.com>
*Subject:* RE: Drill query does not return all results from HBase



I also run the query to display only 1 column with no limit to try force a
full scan, but the result was the same, just 10000 rows selected.  With the
same table (contains 6 columns), I run the query to display the row_key,
and it display all records, 10,000,000 rows.



-Kumiko



*From:* Kumiko Yada
*Sent:* Wednesday, January 13, 2016 7:24 PM
*To:* 'Jason Altekruse' <altekruseja...@gmail.com>
*Cc:* Ki Kang <ki.k...@ds-iq.com>; Kevin Verhoeven <
kevin.verhoe...@ds-iq.com>
*Subject:* RE: Drill query does not return all results from HBase



Jason



I run the query to display only 1 column for 100000 rows, and it only
returned 10000 rows.



-Kumiko



*From:* Jason Altekruse [mailto:altekruseja...@gmail.com
<altekruseja...@gmail.com>]
*Sent:* Wednesday, January 13, 2016 6:39 PM
*To:* Kumiko Yada <kumiko.y...@ds-iq.com>
*Cc:* Ki Kang <ki.k...@ds-iq.com>; Kevin Verhoeven <
kevin.verhoe...@ds-iq.com>

*Subject:* Re: Drill query does not return all results from HBase



I know in a number of cases we have special optimizer rules that try to
skip reading the dataset all together if we have metadata for the number of
rows and all that is requested is a count(*). I assume that this is the
case with HBase, and this may be where we aren't doing something correctly.
Can you try to run a 'sum', or other aggregate query on one of the columns
to see if a full scan of the data is operating correctly?



On Wed, Jan 13, 2016 at 6:27 PM, Kumiko Yada <kumiko.y...@ds-iq.com> wrote:

Thank you, Jason!

Let me know if you need any help on this. I will be glad to help on repro
and/or test the fix.

Thanks
Kumiko

-----Original Message-----
From: Jason Altekruse [mailto:altekruseja...@gmail.com]
Sent: Wednesday, January 13, 2016 6:24 PM
To: user <user@drill.apache.org>

Cc: Aditya Kishore <adityakish...@gmail.com>; Kevin Verhoeven <
kevin.verhoe...@ds-iq.com>
Subject: Re: Drill query does not return all results from HBase

Thanks for filing the issue. I haven't worked much with HBase, but this is
a critical wrong results issues, so I will be taking a look at this soon if
no one else raises their hand.

On Wed, Jan 13, 2016 at 6:20 PM, Kumiko Yada <kumiko.y...@ds-iq.com> wrote:

> I opened the bug on this.  The drill is returning the correct rows
> when the hbase contains 5 or less columns, but not 6 or more columns.
>
> https://issues.apache.org/jira/browse/DRILL-4271
>
> Thanks
> Kumiko
>
> -----Original Message-----
> From: Kumiko Yada [mailto:kumiko.y...@ds-iq.com]
> Sent: Wednesday, January 13, 2016 4:52 PM
> To: user@drill.apache.org
> Cc: Aditya Kishore <adityakish...@gmail.com>; Kevin Verhoeven <
> kevin.verhoe...@ds-iq.com>
> Subject: RE: Drill query does not return all results from HBase
>
> We are using the HBase 1.0.0. & CDH 5.4.  I found out the correct row
> count returned when the Hbase table contains only 1 column family, 1
> column, but the incorrect row count is returned for the Hbase table
> contains 1 column family, 6 columns.
>
> This looks like the Drill issue.  Has anyone found any workaround?
>
> Thanks
> Kumiko
>
> -----Original Message-----
> From: Abhishek Girish [mailto:abhishek.gir...@gmail.com]
> Sent: Tuesday, January 12, 2016 6:51 PM
> To: user <user@drill.apache.org>
> Cc: Aditya Kishore <adityakish...@gmail.com>
> Subject: Re: Drill query does not return all results from HBase
>
> Well, the major version din't change if I remember it right, hence did
> not share the info in my previous mail. I'm on HBase 1.1.1 right now
> and don't see the issue. Also, I am on a MapR setup, which might not
> be comparable with their CDH setups.
>
> On Tue, Jan 12, 2016 at 5:50 PM, Jason Altekruse
> <altekruseja...@gmail.com
> >
> wrote:
>
> > Abhishek,
> >
> > What version of HBase did you have the problem with, and what
> > version did you upgrade to that solved the problem? I assume this
> > would be useful information to compare your setup with Kevin's and
Kumiko's.
> >
> > - Jason
> >
> > On Tue, Jan 12, 2016 at 10:41 AM, Abhishek Girish <
> > abhishek.gir...@gmail.com
> > > wrote:
> >
> > > I hit a very similar issue recently. Via HBase shell, i was able
> > > to fetch all records, whereas I was only able to see a small
> > > subset of records
> > when
> > > queried from Drill. Each time I inserted 1000 records, only about
> > > 50 of those would show up.
> > >
> > > Although I could repro' the problem consistently, it was resolved
> > > once i updated my Hadoop setup. My guess is that it was a HBase
> > > bug which got resolved. Although strange as it seems, it might not
> > > have to do with
> > Drill
> > > itself.
> > >
> > > -Abhishek
> > >
> > > On Tue, Jan 12, 2016 at 7:52 AM, Jason Altekruse <
> > altekruseja...@gmail.com
> > > >
> > > wrote:
> > >
> > > > I'm not sure why this is happening, we have tests in our
> > > > automated
> > suite
> > > > that I believe run some pretty large queries against Hbase and
> > > > verify
> > the
> > > > results.
> > > >
> > > > Aditya, do you have some time available to try to reproduce this
> > > > and diagnose the problem?
> > > >
> > > > On Wed, Jan 6, 2016 at 2:03 PM, Kumiko Yada
> > > > <kumiko.y...@ds-iq.com>
> > > wrote:
> > > >
> > > > > I'm having the same issue.  Is there any workaround for this?
> > > > >
> > > > > Thanks
> > > > > Kumiko
> > > > >
> > > > > -----Original Message-----
> > > > > From: Kevin Verhoeven [mailto:kevin.verhoe...@ds-iq.com]
> > > > > Sent: Monday, December 21, 2015 10:37 AM
> > > > > To: user@drill.apache.org
> > > > > Subject: Drill query does not return all results from HBase
> > > > >
> > > > > We have a problem where a Drill query against HBase does not
> > > > > return
> > all
> > > > > results. The following query should return over 100,000 rows,
> > > > > but we
> > > only
> > > > > get about 1,030 back.
> > > > >
> > > > > SELECT row_key FROM `hbase`.`customer_staged` WHERE
> > > > > customer_number =
> > > 800
> > > > >
> > > > > If we scan directly using the hbase shell we see over 100,000
> > > > > rows,
> > but
> > > > > the same Drill query does not return a fraction of the
> > > > > expected
> > > results.
> > > > We
> > > > > have also run a count against the table and Drill returns the
> > > > > same
> > > 1,030
> > > > > number, which is far less than expect. What could be going wrong?
> > > > >
> > > > > We are running Drill 1.2 on Ubuntu 14.04 against CDH 5.4.3
> > > > > (HBase
> > 1.0).
> > > > We
> > > > > run HBase on six RegionServers, the table has about 1.3
> > > > > billion
> rows.
> > > > >
> > > > > Thanks,
> > > > >
> > > > > Kevin
> > > > >
> > > > >
> > > >
> > >
> >
>

Reply via email to