Re: Telco HBase POC

James Taylor Tue, 19 Jan 2016 12:07:43 -0800

Hi Willem,
Let us know how we can help as you start getting into this, in particular
with your schema design based on your query requirements.
Thanks,
James


On Mon, Jan 18, 2016 at 8:50 AM, Pariksheet Barapatre <pbarapa...@gmail.com>
wrote:

> Hi Willem,
>
> Use Phoenix bulk load. I guess your source is csv so phoenixcsvbulk loader
> can be used.
>
> How frequently you want to load these files. If you can wait for certain
> interval to merge these files and map reduce will bulk load to Phoenix
> table.
>
> Cheers
> Pari
> On 18-Jan-2016 4:17 pm, "Willem Conradie" <willem.conra...@pbtgroup.co.za>
> wrote:
>
>> Hi Pari,
>>
>>
>>
>> My comments in blue.
>>
>>
>>
>> Few notes from my experience :
>>
>> 1. Use bulk load rather than psql.py. Load larger files(merge) instead of
>> small files.
>>
>> Are you referring to native HBase bulk load or Phoenix MapReduce bulk
>> load? Unfortunately we can’t change how the files are received from source.
>> Must we pre-process to merge the files before running the bulk load
>> utility?
>>
>>
>>
>> 2. Increase HBase block cache
>>
>> 3. Turn off HBase auto compaction
>>
>> 4. Select primary key correctly
>> 5. Don't use salting . As table will be huge, your phoenix query will
>> fork may scanners. Try something like hash on userid.
>> 6. Define TTL to purge data periodically
>>
>>
>>
>>
>>
>> Regards,
>>
>> Willem
>>
>>
>>
>> *From:* Pariksheet Barapatre [mailto:pbarapa...@gmail.com]
>> *Sent:* 15 January 2016 03:17 PM
>> *To:* user@phoenix.apache.org
>> *Subject:* Re: Telco HBase POC
>>
>>
>>
>> Hi Willem,
>>
>> Looking at your use case. Phoenix would be a handy client.
>>
>> Few notes from my experience :
>>
>> 1. Use bulk load rather than psql.py. Load larger files(merge) instead of
>> small files.
>>
>> 2. Increase HBase block cache
>>
>> 3. Turn off HBase auto compaction
>>
>> 4. Select primary key correctly
>>
>> 5. Don't use salting . As table will be huge, your phoenix query will
>> fork may scanners. Try something like hash on userid.
>>
>> 6. Define TTL to purge data periodically
>>
>>
>>
>> Cheers
>>
>> Pari
>>
>>
>>
>>
>>
>> On 15 January 2016 at 17:48, Pedro Gandola <pedro.gand...@gmail.com>
>> wrote:
>>
>> Hi Willem,
>>
>> Just to give you my short experience as phoenix user.
>>
>> I'm using Phoenix4.4 on top of a HBase cluster where I keep 3 billion
>> entries.
>>
>> In our use case Phoenix is doing very well and it saved a lot of code
>> complexity and time. If you guys have already decided that HBase is the way
>> to go then having phoenix as a SQL layer it will help a lot, not only in
>> terms of code simplicity but It will help you to create and maintain your
>> indexes and views which can be hard&costly tasks using the plain HBase API.
>> Joining tables it's just a simple SQL join :).
>>
>>
>>
>> And there are a lot of more useful features that make your life easier
>> with HBase.
>>
>> In terms of performance and depending on the SLAs that you have you need
>> to benchmark, however I think your main battles are going to be with HBase,
>> JVM GCs, Network, FileSystem, etc...
>>
>>
>> I would say to give Phoenix a try, for sure.
>>
>> Cheers
>> Pedro
>>
>>
>>
>> On Fri, Jan 15, 2016 at 9:12 AM, Willem Conradie <
>> willem.conra...@pbtgroup.co.za> wrote:
>>
>>
>>
>> Hi,
>>
>>
>>
>> I am currently consulting at a client with the following requirements.
>>
>>
>>
>> They want to make available detailed data usage CDRs for customers to
>> verify their data usage against the websites that they visited. In short
>> this can be seen as an itemised bill for data usage.  The data is currently
>> not loaded into a RDBMS due to the volumes of data involved. The proposed
>> solution is to load the data into HBase, running on a HDP cluster, and make
>> it available for querying by the subscribers.  It is critical to ensure low
>> latency read access to the subscriber data, which possibly will be exposed
>> to 25 million subscribers. We will be running a scaled down version first
>> for a proof of concept with the intention of it becoming an operational
>> data store.  Once the solution is functioning properly for the data usage
>> CDRs other CDR types will be added, as such we need  to build a cost
>> effective, scalable solution .
>>
>>
>>
>> I am thinking of using Apache Phoenix for the following reasons:
>>
>>
>>
>> 1.      1. Current data loading into RDBMS is file based (CSV) via a
>> staging server using the RDBMS file load drivers
>>
>> 2.      2.  Use Apache Phoenix   bin/psql.py script to mimic above
>> process to load to HBase
>>
>> 3.       3. Expected data volume :  60 000 files per day
>>                                                   1 –to 10 MB per file
>>                                                   500 million records per
>> day
>>                                                    500 GB total volume
>> per day
>>
>>
>> 4.        4. Use Apache Phoenix client for low latency data retrieval
>>
>>
>>
>> Is Apache Phoenix a suitable candidate for this specific use case?
>>
>>
>>
>> Regards,
>>
>> Willem
>>
>>
>>
>>
>>
>>
>>
>>
>> --
>>
>> Cheers,
>>
>> Pari
>>
>

Re: Telco HBase POC

Reply via email to