Re: Newbie Question about 37TB binary storage on HBase

Ted Yu Thu, 27 Nov 2014 16:16:38 -0800

For MOB, please take a look at HBASE-11339

Cheers


On Nov 27, 2014, at 3:32 PM, Aleks Laz <al-userhb...@none.at> wrote:

> Hi Wilm.
> 
> Am 27-11-2014 23:41, schrieb Wilm Schumacher:
>> Hi Aleks ;),
>> Am 27.11.2014 um 22:27 schrieb Aleks Laz:
>>> Our application is a nginx/php-fpm/postgresql Setup.
>>> The target design is nginx + proxy features / php-fpm / $DB / $Storage.
>>> .) Can I mix HDFS /HBase for binary data storage and data analyzing?
>> yes. hbase is perfect for that. For storage it will work (with the
>> "MOB-extension") and with map reduce you can do whatever data analyzing
>> you want. I assume you do some image processing with the data?!?!
> 
> What's the plan about the "MOB-extension"?
> 
> From development point of view I can build HBase with the "MOB-extension"
> but from sysadmin point of view a 'package' (jar,zip, dep, rpm, ...) is much
> easier to maintain.
> 
> Currently there are no plans to analyse the images, but who knows what the
> future brings.
> 
> We need to make some "accesslog" analyzing like piwik or awffull.
> Maybe elasticsearch is a better tool for that?
> 
>>> .) What is the preferred way to us HBase  with PHP?
>> The native client lib is in java. This is the best way to go. But if you
>> need only basic access from the php application, then thrift or rest
>> would be a good choice.
>> http://wiki.apache.org/hadoop/Hbase/ThriftApi
>> http://wiki.apache.org/hadoop/Hbase/Stargate
> 
> Stargate is a cool name ;-)
> 
>> There are language bindings for both
>>> .) How difficult is it to use HBase with PHP?
>> Depending on what you are trying to do. If you just do a little
>> fetching, updating, inserting etc. it's pretty easy. More complicate
>> stuff I would do in java and expose it by a custom api by a java service.
>>> .) What's a good solution for the 37 TB or the upcoming ~120 TB to
>>> distribute?
>>>   [ ] N Servers with 1 37 TB mountpoints per server?
>>>   [ ] N Servers with x TB mountpoints pers server?
>>>   [ ] other:
>> that's "not your business". hbase/hadoop does the trick for you. hbase
>> distributes the data, replicates it etc.. You will only talk to the master.
> 
> Well but at the end of the day I will need a physical storage distributed over
> x servers.
> 
> My question is do I need to care that all servers have enough storage for the
> whole data?
> 
> As far as I have understood hadoop client see a 'Filesystem' with 37 TB or
> 120 TB but from the server point of view how should I plan the storage/server
> setup for the datanodes.
> 
> As from the link below hadoophbase-capacity-planning and
> 
> http://blog.cloudera.com/blog/2013/08/how-to-select-the-right-hardware-for-your-new-hadoop-cluster/
> 
> #####
> ....
> Here are the recommended specifications for DataNode/TaskTrackers in a 
> balanced Hadoop cluster:
> 
>    12-24 1-4TB hard disks in a JBOD (Just a Bunch Of Disks) configuration
> ...
> #####
> 
> What happen when a datanode have 20TB but the whole hadoop/HBase 2 node 
> cluster have 40?
> 
> I see I'm still new to hadoop/HBase concept.
> 
>>> .) Is HBase a good value for $Storage?
>> yes ;)
>>> .) Is HBase a good value for $DB?
>>>    DB-Size is smaller then 1 GB, I would use HBase just for HA features
>>>    of Hadoop.
>> well, the official documentation says:
>> »First, make sure you have enough data. If you have hundreds of millions
>> or billions of rows, then HBase is a good candidate. If you only have a
>> few thousand/million rows, then using a traditional RDBMS might be a
>> better choice ...«
> 
> Okay so I will stay for this on postgresql with pgbouncer.
> 
>> In my experience at around 1-10 million rows RDBMS are not really
>> useable anymore. But I only used small/cheap hardware ... and don't like
>> RDBMS ;).
> 
> ;-)
> 
>> Well, you will have at least 40 million rows ... and the plattform is
>> growing. I think SQL isn't a choice anymore. And as you have heavy read
>> and only a few writes hbase is a good fit.
> 
> ?! why "40 million rows", do you mean the file tables?
> In the DB is only some Data like, User account, id for a directory and so on.
> 
>>> .) Due to the fact that HBase is a file-system I could use
>>>      /cams , for binary data
>>>      /DB   , for DB storage
>>>      /logs , for log storage
>>>    but is this wise. On the 'disk' they are different RAIDs.
>> hbase is a data store. This was probably copy pasted from the original
>> hadoop question ;).
> 
> ;-)
> 
>>> .) Should I plan a dedicated Network+Card for the 'cluster
>>>   communication' as for the most other cluster software?
>>>   From what I have read it looks not necessary but from security point
>>>   of view, yes.
>> http://blog.cloudera.com/blog/2010/08/hadoophbase-capacity-planning/
>> Cloudera employees says that it wouldn't harm if you have to push a lot
>> of data to the cluster.
> 
> Okay, so it is like other cluster setups.
> 
>>> .) Maybe the communication with the componnents (hadoop, zk, ...) could
>>>   be setup ed with TLS?
>> hbase is build on top of hadoop/hdfs. This in the "hadoop domain".
>> hadoop can encrypt the transported data by TLS, can encrypt the data on
>> the disc, you can use kerberos auth (but this stuff I never did) etc.
>> etc.. So the answer is yes.
> 
> Thanks.
> 
>> Last remark: You seem kind of bound to PHP. The hadoop world is written
>> in java. Of course there are a lot of ways to do stuff in other
>> languages, over interfaces etc. But the java api is the most powerful
>> and sometimes there are no other ways then to use it directly.
> 
> Currently, yes php is the main language.
> I don't know a good solution for php similar like hadoop, anyone else know 
> one?
> 
> I will take a look on
> 
> https://wiki.apache.org/hadoop/PoweredBy
> 
> to get some Ideas for a working solution.
> 
>> Best wishes,
>> Wilm
> 
> Thanks for your feedbak.
> I will dig deeper into this topic and start to setup the components step by 
> step.
> 
> BR Aleks

Re: Newbie Question about 37TB binary storage on HBase

Reply via email to