[ 
https://issues.apache.org/jira/browse/NUTCH-650?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Doğacan Güney updated NUTCH-650:
--------------------------------

    Attachment: hbase-integration_v1.patch

This patch is what I have done so far. Right now, hbase integration is 
functional enough that you can inject/generate/fetch from http pages/parse html 
pages /create a basic index ( Only parse-html, protocol-http and index-basic 
are updated for hbase )

Before I go into design, first, a note: Don't worry about the size of the patch 
:D I know that it is huge but for simplicity I created a new package 
(org.apache.nutchbase) and moved code there instead of modifying it directly. 
So, bulk of the patch is just old code really. In general, if you are 
interested in reviewing this patch (and I hope you are:), interesting parts 
are: InjectorHbase, GeneratorHbase, FetcherHbase, ParseTable, UpdateTable, 
IndexerHbase and anything under util.hbase.

A) Why integrate with hbase?
  - All your data in a central location
  - No more segment/crawldb/linkdb merges.
  - No more "missing" data in a job. There are a lot of places where we copy 
data from one structure to another just so that it is available in a later job. 
For example, during parsing we don't have access to a URL's fetch status. So we 
copy fetch status into content metadata. This will no longer be necessary with 
hbase integration.
  - A much simpler data model. If you want to update a small part in a single 
record, now you have to write a MR job that reads the relevant directory, 
change the single record, remove old directory and rename new directory. With 
hbase, you can just update that record. Also, hbase gives us access to Yahoo! 
Pig, which I think, with its SQL-ish language may be easier for people to 
understand and use.
  
B) Design
Design is actually rather straightforward. 

  - We store everything (fetch time, status, content, parsed text, outlinks, 
inlinks, etc.) in hbase. I have written a small utility class that creates 
"webtable" with necessary columns.
  - So now most jobs just take the name of the table as input. 
  - There are two main classes for interfacing with hbase. ImmutableRowPart 
wraps around a RowResult and has helper getters (getStatus(), getContent(), 
etc.). RowPart is similar to ImmutableRowPart but also has setters. The idea is 
that RowPart also wraps RowResult but also keeps a list of updates done to that 
row. So when getSomething is called, it first checks if Something is already 
updated (if so then returns the updated version) or returns from RowResult. 
RowPart can also create a BatchUpdate from its list of updates.
   - URLs are stores in reversed host order. For example, 
http://bar.foo.com:8983/to/index.html?a=b becomes 
com.foo.bar:http:8983/to/index.html?a=b. This way, URLs from the same 
tld/host/domain are stored closer to each other. TableUtil has methods for 
reversing and unreversing URLs.
   - CrawlDatum Status-es are simplifed. Since everything is in central 
location now, no point in having a DB and FETCH status. 

Jobs:
  - Each job marks rows so that the next job knows which rows to read. For 
example, if GeneratorHbase decides that a URL should be generated it marks the 
URL with a TMP_FETCH_MARK (Marking a url is simply creating a special metadata 
field.) When FetcherHbase runs, it skips over anything without this special 
mark.
  - InjectorHbase: First, a job runs where injected urls are marked. Then in 
the next job, if a row has the mark but nothing else (here, I assumed that if a 
row has "status:" column, that it already exists), InjectorHbase initializes 
the row.
   - GeneratorHbase: Supports max-per-host configuration and topN. Marks 
generated urls with a marker.
   - FetcherHbase: Very similar to original Fetcher. Marks urls successfully 
fetched. Skips over URLs not marked by GeneratorHbase
   - ParseTable: Similar to original Parser. Outlinks are stored 
"outlinks:<fromUrl>" -> "anchor".
   - UpdateTable: Does updatedb's and invertlink's job. Also clears any markers.
   - IndexerHbase: Indexes the _entire_ table. Skips over URLs not parsed 
successfully.

Plugins:
  - Plugins now have a
  
  Set<String> getColumnSet();
  
  method. Before starting a job, we ask relevant plugins what exactly they want 
to read from hbase and read those columns. For example, FetcherHbase reads some 
columns but doesn't read "modifiedTime:". However, protocol-httphbase needs 
this column. So the plugin adds this column to its set and FetcherHbase reads 
"modifiedTime:" when protocol-httphbase is active. This way, plugins read 
exactly what they want, whenever they want it. For example, during parse 
normally CrawlDatum's fields are not available. However, with this patch, a 
parse plugin can ask for any of those fields and they will get it.
  
  - Also, plugin API is simpler now. Most plugins will look like a variation of 
this:
  
  public void doStuff(String url, RowPart row);
  
  So now a plugin can also choose to update any column it wants.
   
C) What's missing
  - A LOT of plugins.
  - No ScoringFilters at all.
  - Converters from old data into hbase
  - GeneratorHbase: no byIP stuff. does not shuffle URLs for fetching. no 
-adddays
  - FetcherHbase: no byIP stuff. no parsing during fetch. Shuffling is 
important for performace, but can be fixed. (One solution that comes to mind is 
to randomly partition URLs into reducers during map, and perform the actual 
fetching during reduce). Supports following redirects, but not immediately. 
Http headers are not stored. Since no parsing in fetcher, fetcher always stores 
content.
  - ParseTable: No multi-parse (i.e ParseResult). 
  - IndexerHbase: No way to choose a subset of urls to index. (There is a 
marker in UpdateTable but I haven't put it in yet)
  - FetchSchedule: prevModifiedTime, prev... stuff are missing as I haven't yet 
figured a way to read older versions of the same column.
  
  Most of what's missing is stuff I didn't have time to code. Should be easy to 
add later on.

As always, suggestions/reviews are welcome.

> Hbase Integration
> -----------------
>
>                 Key: NUTCH-650
>                 URL: https://issues.apache.org/jira/browse/NUTCH-650
>             Project: Nutch
>          Issue Type: New Feature
>    Affects Versions: 1.0.0
>            Reporter: Doğacan Güney
>            Assignee: Doğacan Güney
>         Attachments: hbase-integration_v1.patch
>
>
> This issue will track nutch/hbase integration

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to