[Nutch Wiki] Update of "NutchFileFormats" by LewisJohnMcgibbney

Apache Wiki Thu, 24 Sep 2015 19:12:55 -0700

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Nutch Wiki" for change 
notification.


The "NutchFileFormats" page has been changed by LewisJohnMcgibbney:
https://wiki.apache.org/nutch/NutchFileFormats?action=diff&rev1=4&rev2=5

  
  = Introduction =
  
- The page provides information on the Nutch file formats (for the Nutch 1.X 
series) from the bottom up.
+ The page provides information on the Nutch file formats (for the Nutch 1.X 
series) from the bottom up. Within the context of this document, we use the 
terminology '''custom types''' to refer to physical files which can be written 
by Nutch. More is explained about Writable's below. '''N.B.''' Nutch implements 
several core data structures and serialization mechanisms directly from Apache 
Hadoop so please read this document with that in mind.
  
- = Nutch Files in Detail =
+ = Nutch CustomWritable's =
  
  Nutch implements its own custom serialization to store custom serialized Java 
data types and structures on file. The interface 
[[http://hadoop.apache.org/docs/current2/api/index.html?org/apache/hadoop/io/Writable.html|org.apache.hadoop.io.Writable]]
 must be implemented for all such data types.
  
@@ -42, +42 @@

  
  = Segments = 
  
- TODO
+ == org.apache.hadoop.io.Text ==
  
- Nutch uses Java's native UTF-8 character set, and the class net.nutch.io.UTF8 
for writing short strings to files. The UTF8 class limits the length of strings 
to 0xffff/3 or 21845 bytes. The function UTF8.write() uses 
java.io.DataOutput.writeShort() to prepend the length of the string. This is 
why the two bytes \000\003 is seen before a three letter word in a file. The 
zero byte is thus not a null termination of the previous string (strings are 
not null terminated), but the most significant byte of the 16 bit short integer 
indicating the length of the following string.
+ Nutch uses Java's native UTF-8 character set, and the class 
[[http://hadoop.apache.org/docs/current2/api/index.html?org/apache/hadoop/io/Text.html|org.apache.hadoop.io.Text]]
 for writing short strings to files. The UTF8 class limits the length of 
strings to 0xffff/3 or 21845 bytes. The function UTF8.write() uses 
java.io.DataOutput.writeShort() to prepend the length of the string. This is 
why the two bytes \000\003 is seen before a three letter word in a file. The 
zero byte is thus not a null termination of the previous string (strings are 
not null terminated), but the most significant byte of the 16 bit short integer 
indicating the length of the following string.
  
+ == org.apache.hadoop.io.SequenceFile ==
+ 
- Nutch relies heavily on mappings (associative arrays) from keys to values. 
The class net.nutch.io.SequenceFile is a flat file of keys and values. The 
first four bytes of each such file are ASCII "SEQ" and \001 (C-a), followed by 
the Java class names of keys and values, written as UTF8 strings, e.g. 
"SEQ\001\000\004long\000\004long", for a mapping from long integers to long 
integers. After that follows the key-value pairs. Each pair is introduced by 
four bytes telling the length in bytes of the pair (excluding the eight length 
bytes) and four bytes telling the length of the key. The typical long (64 bit) 
integer is 8 bytes and a long-to-long mapping will have pairs of length 16 
bytes, e.g.
+ Nutch relies heavily on mappings (associative arrays) from keys to values. 
The class 
[[http://hadoop.apache.org/docs/current2/api/index.html?org/apache/hadoop/io/SequenceFile.html|SequenceFile]]
 is a flat file of keys and values. The first four bytes of each such file are 
ASCII "SEQ" and \001 (C-a), followed by the Java class names of keys and 
values, written as UTF8 strings, e.g. "SEQ\001\000\004long\000\004long", for a 
mapping from long integers to long integers. After that follows the key-value 
pairs. Each pair is introduced by four bytes telling the length in bytes of the 
pair (excluding the eight length bytes) and four bytes telling the length of 
the key. The typical long (64 bit) integer is 8 bytes and a long-to-long 
mapping will have pairs of length 16 bytes, e.g.
  
  {{{
    00 00 00 10                                   int length of pair = 0x10 = 
16 bytes
@@ -55, +57 @@

    00 00 00 00 00 0a 42 9b       long value = 0xa429b = 672411
  }}}
  
+ Much more on SequenceFile characteristics such as file headers and 
compression options can be found at the 
[[http://hadoop.apache.org/docs/current2/api/index.html?org/apache/hadoop/io/SequenceFile.html|SequenceFile]]
 Javadoc.
+ 
+ == org.apache.hadoop.io.MapFile ==
+ 
- To economize the handling of large data volumes, net.nutch.io.MapFile manages 
a mapping as two separate files in a subdirectory of its own. The large "data" 
file stores all keys and values, sorted by the key. The much smaller "index" 
file points to byte offsets in the data file for a small sample of keys. Only 
the index file is read into memory.
+ To economize the handling of large data volumes, 
[[http://hadoop.apache.org/docs/current2/api/index.html?org/apache/hadoop/io/MapFile.html|MapFile]]
 manages a mapping as two separate files in a subdirectory of its own. The 
large "data" file stores all keys and values, sorted by the key. The much 
smaller "index" file points to byte offsets in the data file for a small sample 
of keys. Only the index file is read into memory.
  
- net.nutch.io.ArrayFile is a specialization of MapFile where the keys are long 
integers.
+ 
[[http://hadoop.apache.org/docs/current2/api/index.html?org/apache/hadoop/io/ArrayFile.html|ArrayFile]]
 is a specialization of MapFile, specifically a dense file-based mapping from 
integers to values where the keys are long integers. Finally you can also see 
[[http://hadoop.apache.org/docs/current2/api/index.html?org/apache/hadoop/io/SetFile.html|SetFile]
 which is a file representing a file-based set of keys.
  
- The Java files in net.nutch.io.* comprise 2556 lines of source code. The 
biggest one is Sequencefile.java, which contains a Writer (112 lines), a Reader 
(138 lines), a BufferedRandomAccessFile (140 lines) and a Sorter (389 lines).
+ Additional files in 
[[http://hadoop.apache.org/docs/current2/api/index.html?org/apache/hadoop/io/package-summary.html|org.apache.hadoop.io.*]]
 package contains the actual Writer, Reader and Sorter implementations as well.
  
  When Nutch crawls the web, each resulting segment has four subdirectories, 
each containing an ArrayFile (a MapFile having keys that are long integers):

[Nutch Wiki] Update of "NutchFileFormats" by LewisJohnMcgibbney

Reply via email to