Bulk loading would help a little bit in the "all-or-nothing" problem, but still not be fool proof.

You could have a set of files which are destined to different tables and have very clear data that needs to be loaded, but, if a file(s) failed to be loaded, you would have to take some steps to keep retrying.

On 9/27/19 12:22 PM, Gautham Acharya wrote:
We will be reaching 100million rows early next year, and then billions shortly after that. So, Hbase will be needed to scale to that degree.

If one of the tables fails to write, we need some kind of a rollback mechanism, which is why I was considering a transaction. We cannot be in a partial state where some of the ‘views’ are written and some aren’t.

*From:*Pedro Boado [mailto:pedro.bo...@gmail.com]
*Sent:* Friday, September 27, 2019 7:22 AM
*To:* user@phoenix.apache.org
*Subject:* Re: Materialized views in Hbase/Phoenix

*CAUTION:*This email originated from outside the Allen Institute. Please do not click links or open attachments unless you've validated the sender and know the content is safe.

------------------------------------------------------------------------

For just a few million rows I would go for a RDBMS and not Phoenix / HBase.

You don't really need transactions to control completion, just write a flag (a COMPLETED empty file, for instance) as a final step in your job.

On Fri, 27 Sep 2019, 15:03 Gautham Acharya, <gauth...@alleninstitute.org <mailto:gauth...@alleninstitute.org>> wrote:

    Thanks Anil.

    So, what you’re essentially advocating for is to use some kind of
    Spark/compute framework (I was going to use AWS Glue) job to write
    the ‘materialized views’ as separate tables (maybe tied together
    with some kind of a naming convention?)

    In this case, we’d end up with some sticky data consistency issues
    if the write job failed halfway through (some ‘materialized view’
    tables would be updated, and some wouldn’t). Can I use Phoenix
    transactions to wrap the write jobs together, to make sure either
    all the data is updated, or none?

    --gautham

    *From:*anil gupta [mailto:anilgupt...@gmail.com
    <mailto:anilgupt...@gmail.com>]
    *Sent:* Friday, September 27, 2019 6:58 AM
    *To:* user@phoenix.apache.org <mailto:user@phoenix.apache.org>
    *Subject:* Re: Materialized views in Hbase/Phoenix

    *CAUTION:*This email originated from outside the Allen Institute.
    Please do not click links or open attachments unless you've
    validated the sender and know the content is safe.

    ------------------------------------------------------------------------

    For your use case, i would suggest to create another table that
    stores the matrix. Since this data doesnt change that often, maybe
    you can write a nightly spark/MR job to update/rebuild the matrix
    table.(If you want near real time that is also possible with any
    streaming system) Have you looked into bloom filters? It might help
    if you have sparse dataset and you are using Phoenix dynamic columns.
    We use dynamic columns for a table that has columns upto 40k. Here
    is the presentation and optimizations we made for that use case:
    https://www.slideshare.net/anilgupta84/phoenix-con2017-truecarfinal
    
<https://nam05.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.slideshare.net%2Fanilgupta84%2Fphoenix-con2017-truecarfinal&data=02%7C01%7C%7C63db5e769d074a7ec9c908d743562e01%7C32669cd6737f4b398bddd6951120d3fc%7C0%7C1%7C637051909727164641&sdata=lKAo7Zw%2FWYQyIRg6kfQ2lqx4yO55AAgVaJ6kgXvqqRc%3D&reserved=0>

    IMO, Hive integration with HBase is not fully baked and it has a lot
    of rough edges. So, it better to stick with native Phoenix/HBase if
    you care about performance and ease of operations.

    HTH,

    Anil Gupta

    On Wed, Sep 25, 2019 at 10:01 AM Gautham Acharya
    <gauth...@alleninstitute.org <mailto:gauth...@alleninstitute.org>>
    wrote:

        Hi,

        Currently I'm using Hbase to store large, sparse matrices of
        50,000 columns 10+ million rows of integers.

        This matrix is used for fast, random access - we need to be able
        to fetch random row/column subsets, as well as entire columns.
        We also want to very quickly fetch aggregates (Mean, median,
        etc) on this matrix.

        The data does not change very often for these matrices (a few
        times a week at most), so pre-computing is very feasible here.
        What I would like to do is maintain a column store (store the
        column names as row keys, and a compressed list of all the row
        values) for the use case where we select an entire column.
        Additionally, I would like to maintain a separate table for each
        precomputed aggregate (median table, mean table, etc).

        The query time for all these use cases needs to be low latency -
        under 100ms.

        When the data does change for a certain matrix, it would be nice
        to easily update the optimized table. Ideally, I would like the
        column store/aggregation tables to just be materialized views of
        the original matrix. It doesn't look like Apache Phoenix
        supports materialized views. It looks like Hive does, but
        unfortunately Hive doesn't normally offer low latency queries.

        Maybe Hive can create the materialized view, and we can just
        query the underlying Hbase store for lower latency responses?

        What would be a good solution for this?

        --gautham

        --gautham



--
    Thanks & Regards,
    Anil Gupta

Reply via email to