[Nutch Wiki] Update of "NutchFileFormats" by LewisJohnMcgibbney

Apache Wiki Thu, 24 Sep 2015 20:14:50 -0700

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Nutch Wiki" for change 
notification.


The "NutchFileFormats" page has been changed by LewisJohnMcgibbney:
https://wiki.apache.org/nutch/NutchFileFormats?action=diff&rev1=7&rev2=8

  
  = CrawlDB =
  
- Content here is under construction.
- Content here is under construction.
+ == Description ==
+ 
+ Nutch maintains a CrawlDB containing 
[[http://nutch.apache.org/apidocs/apidocs-1.10/index.html?org/apache/nutch/crawl/CrawlDatum.html|CrawlDatum]]
 objects.
+ 
+ == Directory Structure ==
+ {{{
+ .
+ ├── current
+ │   └── part-00000
+ │       ├── data
+ │       └── index
+ └── old
+     ├── part-00000
+     │   ├── data
+     │   └── index
+     └── part-00001
+         ├── data
+         └── index
+     └── ...
+ }}}
+ 
+ == File Formats ==
+ 
+ {{{#!CSV ,
+ file,key datatype,value datatype,codec
+ data,org.apache.hadoop.io.Text,org.apache.nutch.crawl.CrawlDatum,
+ 
index,org.apache.hadoop.io.Text,org.apache.hadoop.io.LongWritable,org.apache.hadoop.io.compress.DefaultCodec
+ }}}
  
  = LinkDB =
  
- Content here is under construction.
- Content here is under construction.
+ == Description ==
+ 
+ Maintains an inverted link map, listing incoming links for each url.
+ 
+ == Directory Structure ==
+ 
+ {{{
+ .
+ └── current
+     └── part-00000
+         ├── data
+         └── index
+ }}}
+ 
+ == File Formats ==
+ 
+ {{{#!CSV ,
+ file,key datatype,value datatype,codec
+ 
data,org.apache.hadoop.io.Text,org.apache.nutch.crawl.Inlinks,org.apache.hadoop.io.compress.DefaultCodec
+ 
index,org.apache.hadoop.io.Text,org.apache.hadoop.io.LongWritable,org.apache.hadoop.io.compress.DefaultCodec
+ }}}
  
  = Segments =
  
+ == Description ==
+ 
- When Nutch crawls the web, each resulting segment has four subdirectories, 
each containing an ArrayFile (a MapFile having keys that are long integers):
+ When Nutch crawls the web, each resulting segment (segments contain the 
actual content which was fetched) has four subdirectories, each containing an 
ArrayFile (a MapFile having keys that are long integers).
  
+ == Directory Structure ==
- {{{#!CSV ,
- Subdirectory,Value datatype,Variable
- fetchlist,net.nutch.pagedb.FetchListEntry,fetchList
- fetcher,net.nutch.fetcher.FetcherOutput,fetcherDb
- fetcher_content,net.nutch.fetcher.FetcherContent,rawDb
- fetcher_text,net.nutch.fetcher.FetcherText,strippedDb
- }}}
  
- Crawling is performed by net.nutch.fetcher.Fetcher which starts a number of 
parallel FetcherThread?. Each thread gets an URL from the fetchList, checks 
robots.txt, retrieves the contents and appends the results to fetcherDb, rawDb, 
and strippedDb.
+ {{{
+ .
+ ├── content
+ │   ├── part-00000
+ │   │   ├── data
+ │   │   └── index
+ │   └── part-...
+ ├── crawl_fetch
+ │   ├── part-00000
+ │   │   ├── data
+ │   │   └── index
+ │   └── part-...
+ ├── crawl_generate
+ │   └── part-00000
+ ├── crawl_parse
+ │   ├── part-00000
+ │   └── part-00001
+ ├── parse_data
+ │   ├── part-00000
+ │   │   ├── data
+ │   │   └── index
+ │   └── part-...
+ └── parse_text
+     ├── part-00000
+     │   ├── data
+     │   └── index
+     └── part-...
+ }}}
+ 
+ == Description ==
+ 
+ {{{#!CSV ,
+ Subdirectory,file,key datatype,value datatype,codec
+ 
content,data,org.apache.hadoop.io.Text,org.apache.nutch.protocol.Content,org.apache.hadoop.io.compress.DefaultCodec
+ 
content,index,org.apache.hadoop.io.Text,org.apache.hadoop.io.LongWritable,org.apache.hadoop.io.compress.DefaultCodec
+ 
crawl_fetch,data,org.apache.hadoop.io.Text,org.apache.nutch.crawl.CrawlDatum,org.apache.hadoop.io.compress.DefaultCodec
+ 
crawl_fetch,index,org.apache.hadoop.io.Text,org.apache.hadoop.io.LongWritable,org.apache.hadoop.io.compress.DefaultCodec
+ 
crawl_generate,part-0000,org.apache.hadoop.io.Text,org.apache.nutch.crawl.CrawlDatum,org.apache.hadoop.io.compress.DefaultCodec
+ 
crawl_parse,data,org.apache.hadoop.io.Text,org.apache.nutch.crawl.CrawlDatum,org.apache.hadoop.io.compress.DefaultCodec
+ 
crawl_parse,index,org.apache.hadoop.io.Text,org.apache.hadoop.io.LongWritable,org.apache.hadoop.io.compress.DefaultCodec
+ 
parse_data,data,org.apache.hadoop.io.Text,org.apache.nutch.parse.ParseData,org.apache.hadoop.io.compress.DefaultCodec
+ 
parse_data,index,org.apache.hadoop.io.Text,org.apache.hadoop.io.LongWritable,org.apache.hadoop.io.compress.DefaultCodec
+ 
parse_text,data,org.apache.hadoop.io.Text,org.apache.nutch.parse.ParseText,org.apache.hadoop.io.compress.DefaultCodec
+ 
parse_text,index,org.apache.hadoop.io.Text,org.apache.hadoop.io.LongWritable,org.apache.hadoop.io.compress.DefaultCodec
+ }}}
  
  = Old File Format Documentation =

[Nutch Wiki] Update of "NutchFileFormats" by LewisJohnMcgibbney

Reply via email to