In my experience it's better to keep the number of column families low. When flushes occur, they effect all column families in a table, so when the memstore fills you'll create an HFile per family. I haven't seen any performance impact in having two column families though.
As for the number of columns, there are two extremes - 1) "narrow" - store the xml as a blob in a single cell; 2) "wide" break it out into columns, of which you can have thousands. 1. In the case where you store XML as a blob you always need to retrieve the entire document, and must deserialise it to perform operations. You save space in not repeating the row key, you save space on column and column family qualifiers 2. When you break the XML out into columns you can retrieve data at a per attribute level, which might save IO by filtering unnecessary content, and you don't need to break open the XML to perform operations. You incur a cost in repeating the row key per tuple (this can add up and will effect read performance by limiting the number of rows that can fit into the block cache), as well as the extra cost of column families. There is a practical limit to the number of columns because a row cannot be split across regions. You may find optimal performance for you use case somewhere between the two extremes and it's best to prototype and measure early. Cheers, Richard https://richardstartin.com/ ________________________________ From: Mich Talebzadeh <mich.talebza...@gmail.com> Sent: 28 November 2016 21:57 To: user@hbase.apache.org Subject: Re: Storing XML file in Hbase Thanks Richard. How would one decide on the number of column family and columns? Is there a ballpark approach Cheers Dr Mich Talebzadeh LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>* http://talebzadehmich.wordpress.com *Disclaimer:* Use it at your own risk. Any and all responsibility for any loss, damage or destruction of data or any other property which may arise from relying on this email's technical content is explicitly disclaimed. The author will in no case be liable for any monetary damages arising from such loss, damage or destruction. On 28 November 2016 at 16:04, Richard Startin <richardstar...@outlook.com> wrote: > Hi Mich, > > If you want to store the file whole, you'll need to enforce a 10MB limit > to the file size, otherwise you will flush too often (each time the me > store fills up) which will slow down writes. > > Maybe you could deconstruct the xml by extracting columns from the xml > using xpath? > > If the files are small there might be a tangible performance benefit by > limiting the number of columns. > > Cheers, > Richard > > Sent from my iPhone > > > On 28 Nov 2016, at 15:53, Dima Spivak <dimaspi...@apache.org> wrote: > > > > Hi Mich, > > > > How many files are you looking to store? How often do you need to read > > them? What's the total size of all the files you need to serve? > > > > Cheers, > > Dima > > > > On Mon, Nov 28, 2016 at 7:04 AM Mich Talebzadeh < > mich.talebza...@gmail.com> > > wrote: > > > >> Hi, > >> > >> Storing XML file in Big Data. Are there any strategies to create > multiple > >> column families or just one column family and in that case how many > columns > >> would be optional? > >> > >> thanks > >> > >> Dr Mich Talebzadeh > >> > >> > >> > >> LinkedIn * > >> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCd > OABUrV8Pw > >> < > >> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCd > OABUrV8Pw > >>> * > >> > >> > >> > >> http://talebzadehmich.wordpress.com > >> > >> > >> *Disclaimer:* Use it at your own risk. Any and all responsibility for > any > >> loss, damage or destruction of data or any other property which may > arise > >> from relying on this email's technical content is explicitly disclaimed. > >> The author will in no case be liable for any monetary damages arising > from > >> such loss, damage or destruction. > >> >