[ https://issues.apache.org/jira/browse/NUTCH-650?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Doğacan Güney updated NUTCH-650: -------------------------------- Attachment: hbase-integration_v1.patch This patch is what I have done so far. Right now, hbase integration is functional enough that you can inject/generate/fetch from http pages/parse html pages /create a basic index ( Only parse-html, protocol-http and index-basic are updated for hbase ) Before I go into design, first, a note: Don't worry about the size of the patch :D I know that it is huge but for simplicity I created a new package (org.apache.nutchbase) and moved code there instead of modifying it directly. So, bulk of the patch is just old code really. In general, if you are interested in reviewing this patch (and I hope you are:), interesting parts are: InjectorHbase, GeneratorHbase, FetcherHbase, ParseTable, UpdateTable, IndexerHbase and anything under util.hbase. A) Why integrate with hbase? - All your data in a central location - No more segment/crawldb/linkdb merges. - No more "missing" data in a job. There are a lot of places where we copy data from one structure to another just so that it is available in a later job. For example, during parsing we don't have access to a URL's fetch status. So we copy fetch status into content metadata. This will no longer be necessary with hbase integration. - A much simpler data model. If you want to update a small part in a single record, now you have to write a MR job that reads the relevant directory, change the single record, remove old directory and rename new directory. With hbase, you can just update that record. Also, hbase gives us access to Yahoo! Pig, which I think, with its SQL-ish language may be easier for people to understand and use. B) Design Design is actually rather straightforward. - We store everything (fetch time, status, content, parsed text, outlinks, inlinks, etc.) in hbase. I have written a small utility class that creates "webtable" with necessary columns. - So now most jobs just take the name of the table as input. - There are two main classes for interfacing with hbase. ImmutableRowPart wraps around a RowResult and has helper getters (getStatus(), getContent(), etc.). RowPart is similar to ImmutableRowPart but also has setters. The idea is that RowPart also wraps RowResult but also keeps a list of updates done to that row. So when getSomething is called, it first checks if Something is already updated (if so then returns the updated version) or returns from RowResult. RowPart can also create a BatchUpdate from its list of updates. - URLs are stores in reversed host order. For example, http://bar.foo.com:8983/to/index.html?a=b becomes com.foo.bar:http:8983/to/index.html?a=b. This way, URLs from the same tld/host/domain are stored closer to each other. TableUtil has methods for reversing and unreversing URLs. - CrawlDatum Status-es are simplifed. Since everything is in central location now, no point in having a DB and FETCH status. Jobs: - Each job marks rows so that the next job knows which rows to read. For example, if GeneratorHbase decides that a URL should be generated it marks the URL with a TMP_FETCH_MARK (Marking a url is simply creating a special metadata field.) When FetcherHbase runs, it skips over anything without this special mark. - InjectorHbase: First, a job runs where injected urls are marked. Then in the next job, if a row has the mark but nothing else (here, I assumed that if a row has "status:" column, that it already exists), InjectorHbase initializes the row. - GeneratorHbase: Supports max-per-host configuration and topN. Marks generated urls with a marker. - FetcherHbase: Very similar to original Fetcher. Marks urls successfully fetched. Skips over URLs not marked by GeneratorHbase - ParseTable: Similar to original Parser. Outlinks are stored "outlinks:<fromUrl>" -> "anchor". - UpdateTable: Does updatedb's and invertlink's job. Also clears any markers. - IndexerHbase: Indexes the _entire_ table. Skips over URLs not parsed successfully. Plugins: - Plugins now have a Set<String> getColumnSet(); method. Before starting a job, we ask relevant plugins what exactly they want to read from hbase and read those columns. For example, FetcherHbase reads some columns but doesn't read "modifiedTime:". However, protocol-httphbase needs this column. So the plugin adds this column to its set and FetcherHbase reads "modifiedTime:" when protocol-httphbase is active. This way, plugins read exactly what they want, whenever they want it. For example, during parse normally CrawlDatum's fields are not available. However, with this patch, a parse plugin can ask for any of those fields and they will get it. - Also, plugin API is simpler now. Most plugins will look like a variation of this: public void doStuff(String url, RowPart row); So now a plugin can also choose to update any column it wants. C) What's missing - A LOT of plugins. - No ScoringFilters at all. - Converters from old data into hbase - GeneratorHbase: no byIP stuff. does not shuffle URLs for fetching. no -adddays - FetcherHbase: no byIP stuff. no parsing during fetch. Shuffling is important for performace, but can be fixed. (One solution that comes to mind is to randomly partition URLs into reducers during map, and perform the actual fetching during reduce). Supports following redirects, but not immediately. Http headers are not stored. Since no parsing in fetcher, fetcher always stores content. - ParseTable: No multi-parse (i.e ParseResult). - IndexerHbase: No way to choose a subset of urls to index. (There is a marker in UpdateTable but I haven't put it in yet) - FetchSchedule: prevModifiedTime, prev... stuff are missing as I haven't yet figured a way to read older versions of the same column. Most of what's missing is stuff I didn't have time to code. Should be easy to add later on. As always, suggestions/reviews are welcome. > Hbase Integration > ----------------- > > Key: NUTCH-650 > URL: https://issues.apache.org/jira/browse/NUTCH-650 > Project: Nutch > Issue Type: New Feature > Affects Versions: 1.0.0 > Reporter: Doğacan Güney > Assignee: Doğacan Güney > Attachments: hbase-integration_v1.patch > > > This issue will track nutch/hbase integration -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.