I've seen it now thanks for the attention


oSilvio wrote:
> 
> Very useful information, thanks!
> But in order to extract the data inside those files (like html pages) I
> can find no algorithm available by nutch, nor the process used to store
> the data. Do you know if it is possible to extract using lucene?
> 
>  
> 
> Dennis Kubes-2 wrote:
>> 
>> The nutch databases are either SequenceFile or MapFile formats which 
>> store key and value pairs.  Their keys and values are Writable 
>> implementations which translate an object into it byte equivalent and 
>> vice versa.
>> 
>> Data and index files are MapFile format.  Data is a SequenceFile, index 
>> is an index used by MapFiles for seeking to a specific key.
>> 
>> Please see the hadoop wiki for more information about Sequence and Map 
>> files and writable formats.
>> 
>> Dennis
>> 
>> oSilvio wrote:
>>> Do somebody know how do the file structure works, briefly? 
>>> It seems that the data are compressed or something, its not possible to
>>> understand whats recorded in the data nor index files.
>>> Thanks
>>> Silvio
>> 
>> 
> 
> 

-- 
View this message in context: 
http://www.nabble.com/File-system-tp21022587p21033338.html
Sent from the Nutch - Dev mailing list archive at Nabble.com.

Reply via email to