... but it depends on what you want to do. If you want full-text searching, then yes, you probably want to look at Lucene. If you want activity analysis, summaries are probably better.
On 2/5/12 1:54 PM, "Doug Meil" <[email protected]> wrote: > >Hi there- > >You probably want to check out these chapters of the Hbase ref guide: > >http://hbase.apache.org/book.html#datamodel >http://hbase.apache.org/book.html#schema >http://hbase.apache.org/book.html#mapreduce > >... and with respect to the "40 minutes per report", a common pattern is >to create summary table/files as appropriate. > > > > >On 2/5/12 3:37 AM, "mete" <[email protected]> wrote: > >>Hello, >> >>i am thinking about using hbase for storing web log data, i like the idea >>to have hdfs underneath so that i wont be worried about failure cases >>much >>and i can benefit from all the cool HBase features. >> >>The thing i could not figure out is howto effectively store and query the >>data.I am planning to split each kind of log record to 10 - 20 columns >>and >>then use MR jobs query the table with full scans. >>(I guess i can use hive or pig for this as well but i am not familiar >>with >>those yet) >>I find this approach simple and easy to implement but on the other hand >>this is like an offline process, it could take a lot of time to get a >>single report. And of course a business user would be very dissappointed >>to >>see that he/she has to wait another 40 mins to get the results of the >>query. >> >>So what i am trying to achieve is to keep this query time as small as >>possible. For this i can sacrifice the write speed as well, i dont really >>have to integrate new logs on-the-fly but a job that runs overnight is >>also >>fine. >> >>So for this kind of situation do you find Hbase useful? >> >>I read about star-schema design to make more effective queries but then >>this makes the developers job a lot more harder because i need to design >>different schemas for different log types, adding a new log type would >>require some time to gather requirements,develop etc... >> >>I thought about creating a very simple hbase shema, like just a key and >>the >>content for each record, and then index this content with lucene, but >>then >>this sounded like i did not need hbase in the first place because i am >>not >>really benefiting from it except for storage.Also i could not be sure >>about >>how big my lucene indexes would get, and if i could cope up with bigdata >>on >>lucene. What do you think about lucene indexes on hbase? >> >>I read about how rackspace was doing things, as far as i understood they >>are generating lucene indexes while parsing the logs in hadoop, and then >>merging this index into some system that is serving the previous >>indexes.( >>http://highscalability.com/how-rackspace-now-uses-mapreduce-and-hadoop-qu >>e >>ry-terabytes-data) >> >>Does anyone use a similar approach or have any ideas about this? >> >>Do you think any of these are suitable? or if not should i try a >>different >>way? >> >>Thanks in advance >>Mete > > >
