I've seen it now thanks for the attention
oSilvio wrote: > > Very useful information, thanks! > But in order to extract the data inside those files (like html pages) I > can find no algorithm available by nutch, nor the process used to store > the data. Do you know if it is possible to extract using lucene? > > > > Dennis Kubes-2 wrote: >> >> The nutch databases are either SequenceFile or MapFile formats which >> store key and value pairs. Their keys and values are Writable >> implementations which translate an object into it byte equivalent and >> vice versa. >> >> Data and index files are MapFile format. Data is a SequenceFile, index >> is an index used by MapFiles for seeking to a specific key. >> >> Please see the hadoop wiki for more information about Sequence and Map >> files and writable formats. >> >> Dennis >> >> oSilvio wrote: >>> Do somebody know how do the file structure works, briefly? >>> It seems that the data are compressed or something, its not possible to >>> understand whats recorded in the data nor index files. >>> Thanks >>> Silvio >> >> > > -- View this message in context: http://www.nabble.com/File-system-tp21022587p21033338.html Sent from the Nutch - Dev mailing list archive at Nabble.com.