Hi there- You probably want to check out these chapters of the Hbase ref guide:
http://hbase.apache.org/book.html#datamodel http://hbase.apache.org/book.html#schema http://hbase.apache.org/book.html#mapreduce ... and with respect to the "40 minutes per report", a common pattern is to create summary table/files as appropriate. On 2/5/12 3:37 AM, "mete" <[email protected]> wrote: >Hello, > >i am thinking about using hbase for storing web log data, i like the idea >to have hdfs underneath so that i wont be worried about failure cases much >and i can benefit from all the cool HBase features. > >The thing i could not figure out is howto effectively store and query the >data.I am planning to split each kind of log record to 10 - 20 columns and >then use MR jobs query the table with full scans. >(I guess i can use hive or pig for this as well but i am not familiar with >those yet) >I find this approach simple and easy to implement but on the other hand >this is like an offline process, it could take a lot of time to get a >single report. And of course a business user would be very dissappointed >to >see that he/she has to wait another 40 mins to get the results of the >query. > >So what i am trying to achieve is to keep this query time as small as >possible. For this i can sacrifice the write speed as well, i dont really >have to integrate new logs on-the-fly but a job that runs overnight is >also >fine. > >So for this kind of situation do you find Hbase useful? > >I read about star-schema design to make more effective queries but then >this makes the developers job a lot more harder because i need to design >different schemas for different log types, adding a new log type would >require some time to gather requirements,develop etc... > >I thought about creating a very simple hbase shema, like just a key and >the >content for each record, and then index this content with lucene, but then >this sounded like i did not need hbase in the first place because i am not >really benefiting from it except for storage.Also i could not be sure >about >how big my lucene indexes would get, and if i could cope up with bigdata >on >lucene. What do you think about lucene indexes on hbase? > >I read about how rackspace was doing things, as far as i understood they >are generating lucene indexes while parsing the logs in hadoop, and then >merging this index into some system that is serving the previous indexes.( >http://highscalability.com/how-rackspace-now-uses-mapreduce-and-hadoop-que >ry-terabytes-data) > >Does anyone use a similar approach or have any ideas about this? > >Do you think any of these are suitable? or if not should i try a different >way? > >Thanks in advance >Mete
