In my experience it's better to keep the number of column families low. When 
flushes occur, they effect all column families in a table, so when the memstore 
fills you'll create an HFile per family. I haven't seen any performance impact 
in having two column families though.


As for the number of columns, there are two extremes - 1) "narrow" - store the 
xml as a blob in a single cell; 2) "wide" break it out into columns, of which 
you can have thousands.


  1.  In the case where you store XML as a blob you always need to retrieve the 
entire document, and must deserialise it to perform operations. You save space 
in not repeating the row key, you save space on column and column family 
qualifiers
  2.  When you break the XML out into columns you can retrieve data at a per 
attribute level, which might save IO by filtering unnecessary content, and you 
don't need to break open the XML to perform operations. You incur a cost in 
repeating the row key per tuple (this can add up and will effect read 
performance by limiting the number of rows that can fit into the block cache), 
as well as the extra cost of column families. There is a practical limit to the 
number of columns because a row cannot be split across regions.

You may find optimal performance for you use case somewhere between the two 
extremes and it's best to prototype and measure early.

Cheers,
Richard


https://richardstartin.com/


________________________________
From: Mich Talebzadeh <mich.talebza...@gmail.com>
Sent: 28 November 2016 21:57
To: user@hbase.apache.org
Subject: Re: Storing XML file in Hbase

Thanks Richard.

How would one decide on the number of column family and columns?

Is there a ballpark approach

Cheers

Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*



http://talebzadehmich.wordpress.com


*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.



On 28 November 2016 at 16:04, Richard Startin <richardstar...@outlook.com>
wrote:

> Hi Mich,
>
> If you want to store the file whole, you'll need to enforce a 10MB limit
> to the file size, otherwise you will flush too often (each time the me
> store fills up) which will slow down writes.
>
> Maybe you could deconstruct the xml by extracting columns from the xml
> using xpath?
>
> If the files are small there might be a tangible performance benefit by
> limiting the number of columns.
>
> Cheers,
> Richard
>
> Sent from my iPhone
>
> > On 28 Nov 2016, at 15:53, Dima Spivak <dimaspi...@apache.org> wrote:
> >
> > Hi Mich,
> >
> > How many files are you looking to store? How often do you need to read
> > them? What's the total size of all the files you need to serve?
> >
> > Cheers,
> > Dima
> >
> > On Mon, Nov 28, 2016 at 7:04 AM Mich Talebzadeh <
> mich.talebza...@gmail.com>
> > wrote:
> >
> >> Hi,
> >>
> >> Storing XML file in Big Data. Are there any strategies to create
> multiple
> >> column families or just one column family and in that case how many
> columns
> >> would be optional?
> >>
> >> thanks
> >>
> >> Dr Mich Talebzadeh
> >>
> >>
> >>
> >> LinkedIn *
> >> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCd
> OABUrV8Pw
> >> <
> >> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCd
> OABUrV8Pw
> >>> *
> >>
> >>
> >>
> >> http://talebzadehmich.wordpress.com
> >>
> >>
> >> *Disclaimer:* Use it at your own risk. Any and all responsibility for
> any
> >> loss, damage or destruction of data or any other property which may
> arise
> >> from relying on this email's technical content is explicitly disclaimed.
> >> The author will in no case be liable for any monetary damages arising
> from
> >> such loss, damage or destruction.
> >>
>

Reply via email to