Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Nutch Wiki" for change 
notification.

The "IndexStructure" page has been changed by LewisJohnMcgibbney:
https://wiki.apache.org/nutch/IndexStructure?action=diff&rev1=18&rev2=19

  
  The index structure formed after indexing is shown below : 
  
- ||'''Field Name'''||'''Stored'''||'''Index'''|| '''Plugin/Class''' 
||'''Comment'''||
+ ||'''Field Name'''||'''Stored'''||'''Index'''|| '''Plugin/Class''' 
||'''Comment'''|| '''version'''||
+ || || || || || || '''1.x''' || '''2.x''' ||
- ||    boost    ||     YES     ||      Not Indexed     || various scoring 
plugins || Adds a '''score''' value field to a particular document. This is 
allocated based upon its importance within the webgraph. ||
+ ||    boost    ||     YES     ||      Not Indexed     || various scoring 
plugins || Adds a '''score''' value field to a particular document. This is 
allocated based upon its importance within the webgraph. || ?  || ? ||
- ||    digest  ||      YES     ||      Not Indexed     || 
org.apache.nutch.indexer.IndexerMapReduce.java || Adds a '''message digest''' 
field to a document. Can be MD5 over content and headers or more sophisticated 
text profile of the content. ||
+ ||    digest  ||      YES     ||      Not Indexed     || 
org.apache.nutch.indexer.IndexerMapReduce.java || Adds a '''message digest''' 
field to a document. Can be MD5 over content and headers or more sophisticated 
text profile of the content. ||  ?  || ? ||
- ||    lang    ||      YES     ||      Un-Tokenized    ||      
language-identifier || Add a '''lang''', language field to a document.||
+ ||    lang    ||      YES     ||      Un-Tokenized    ||      
language-identifier || Add a '''lang''', language field to a document.||  ?  || 
? ||
- ||    segment ||              YES     ||      Not Indexed     || 
org.apache.nutch.indexer.IndexerMapReduce.java || Adds the originating 
'''segment''' field to the document, used to identify the most recent segment 
in which this document was fetched. ||
+ ||    segment ||              YES     ||      Not Indexed     || 
org.apache.nutch.indexer.IndexerMapReduce.java || Adds the originating 
'''segment''' field to the document, used to identify the most recent segment 
in which this document was fetched. ||  ?  || ? ||
- ||    tstamp  ||      YES     ||      Tokenized       || /!\ NEEDS COMMENT 
/!\ || Adds a '''timestamp''' field of the most recent time this document was 
fetched ||
+ ||    tstamp  ||      YES     ||      Tokenized       || /!\ NEEDS COMMENT 
/!\ || Adds a '''timestamp''' field of the most recent time this document was 
fetched ||  ?  || ? ||
- ||    cc:license      ||      YES     ||      Indexed, Tokenized      || 
creativecommons || Adds the entire license as '''cc:license=xxx''' and 
'''attributes''' extracted of the license url||
+ ||    cc:license      ||      YES     ||      Indexed, Tokenized      || 
creativecommons || Adds the entire license as '''cc:license=xxx''' and 
'''attributes''' extracted of the license url||  ?  || ? ||
- ||    cc:meta ||      YES     ||      Indexed, Tokenized      ||      
creativecommons || Adds the license location as '''cc:meta=xxx''' ||
+ ||    cc:meta ||      YES     ||      Indexed, Tokenized      ||      
creativecommons || Adds the license location as '''cc:meta=xxx''' ||  ?  || ? ||
- ||    cc:type ||      YES     ||      Indexed,Tokenized       ||      
creativecommons || Adds the work type as '''cc:type=xxx'''||
+ ||    cc:type ||      YES     ||      Indexed,Tokenized       ||      
creativecommons || Adds the work type as '''cc:type=xxx'''||  ?  || ? ||
- ||    anchor  ||      NO      ||      Tokenized       ||      index-anchor || 
Indexing filter that indexes all inbound '''anchor text''' for a document.||
+ ||    anchor  ||      NO      ||      Tokenized       ||      index-anchor || 
Indexing filter that indexes all inbound '''anchor text''' for a document.||  ? 
 || ? ||
- ||    title   ||      YES     ||      Tokenized       ||      index-basic     
|| Adds basic searchable '''title field''' to a document. Also indexed by 
index-more ||
+ ||    title   ||      YES     ||      Tokenized       ||      index-basic     
|| Adds basic searchable '''title field''' to a document. Also indexed by 
index-more ||  ?  || ? ||
- ||    site    ||      NO      ||      Un-Tokenized    ||      index-basic || 
Adds basic searchable '''site field''' to a document. ||
+ ||    site    ||      NO      ||      Un-Tokenized    ||      index-basic || 
Adds basic searchable '''site field''' to a document. ||  ?  || ? ||
- ||    host    ||      NO      ||      Tokenized       ||      index-basic     
|| Adds basic searchable '''hostname field''' to a document. ||
+ ||    host    ||      NO      ||      Tokenized       ||      index-basic     
|| Adds basic searchable '''hostname field''' to a document. ||  ?  || ? ||
- ||    url     ||      YES     ||      Tokenized       ||      index-basic || 
Adds basic searchable '''URL field''' to a document. ||
+ ||    url     ||      YES     ||      Tokenized       ||      index-basic || 
Adds basic searchable '''URL field''' to a document. ||  ?  || ? ||
- ||    content         ||      NO      ||      Tokenized       ||      
index-basic     || Adds basic searchable '''content field''' to a document. ||
+ ||    content         ||      NO      ||      Tokenized       ||      
index-basic     || Adds basic searchable '''content field''' to a document. ||  
?  || ? ||
- ||    lastModified    ||      NO      ||      Indexed, Un-Tokenized   ||      
index-more || Adds some time related meta info in the form of 
'''last-modified''' if present. ||
+ ||    lastModified    ||      NO      ||      Indexed, Un-Tokenized   ||      
index-more || Adds some time related meta info in the form of 
'''last-modified''' if present. ||  ?  || ? ||
- ||    date    ||      NO      ||      Indexed, Un-Tokenized   ||      
index-more || Index date as last-modified, or, if that's not present, uses 
fetch time. ||
+ ||    date    ||      NO      ||      Indexed, Un-Tokenized   ||      
index-more || Index date as last-modified, or, if that's not present, uses 
fetch time. ||  ?  || ? ||
- ||    contentLength   ||      NO      ||      Indexed, Un-Tokenized   ||      
index-more || /!\ NEEDS COMMENT /!\ ||
+ ||    contentLength   ||      NO      ||      Indexed, Un-Tokenized   ||      
index-more || /!\ NEEDS COMMENT /!\ ||  ?  || ? ||
- ||    type    ||      NO      ||      Indexed, Un-Tokenized   ||      
index-more      || Adds contentType, primaryType, subType (all mime-types) ||
+ ||    type    ||      NO      ||      Indexed, Un-Tokenized   ||      
index-more      || Adds contentType, primaryType, subType (all mime-types) ||  
?  || ? ||
- ||    primaryType     ||      NO      ||      Indexed, Un-Tokenized   ||      
index-more      ||      primaryType (mime-type) ||
+ ||    primaryType     ||      NO      ||      Indexed, Un-Tokenized   ||      
index-more      ||      primaryType (mime-type) ||  ?  || ? ||
- ||    subType         ||      NO      ||      Indexed, Un-Tokenized   ||      
index-more      ||      subType (mime-type) ||
+ ||    subType         ||      NO      ||      Indexed, Un-Tokenized   ||      
index-more      ||      subType (mime-type) ||  ?  || ? ||
- ||      tld             ||     YES      || Un-Tokenized / NotStored(based on 
conf) || tld || Adds a '''top level domain''' field to the document.  ||
+ ||      tld             ||     YES      || Un-Tokenized / NotStored(based on 
conf) || tld || Adds a '''top level domain''' field to the document.  ||  ?  || 
? ||
- ||      subcollection   ||    YES || Tokenized || subcollection || For 
Comprehensive description see 
src/java/org/apache/nutch/collection/'''package.html'''   ||
+ ||      subcollection   ||    YES || Tokenized || subcollection || For 
Comprehensive description see 
src/java/org/apache/nutch/collection/'''package.html'''   ||  ?  || ? ||
- ||    urlmeta ||      NO      ||      Indexed, Un-Tokenized   ||      urlmeta 
        || Adds any specified '''url metadata tags''' to the document in the 
index.||
+ ||    urlmeta ||      NO      ||      Indexed, Un-Tokenized   ||      urlmeta 
        || Adds any specified '''url metadata tags''' to the document in the 
index.||  ?  || ? ||
  ----
  Jira Issues about indexing and IndexingFilterPlugins are 
  

Reply via email to