Dear Wiki user, You have subscribed to a wiki page or wiki category on "Nutch Wiki" for change notification.
The "IndexStructure" page has been changed by LewisJohnMcgibbney: https://wiki.apache.org/nutch/IndexStructure?action=diff&rev1=18&rev2=19 The index structure formed after indexing is shown below : - ||'''Field Name'''||'''Stored'''||'''Index'''|| '''Plugin/Class''' ||'''Comment'''|| + ||'''Field Name'''||'''Stored'''||'''Index'''|| '''Plugin/Class''' ||'''Comment'''|| '''version'''|| + || || || || || || '''1.x''' || '''2.x''' || - || boost || YES || Not Indexed || various scoring plugins || Adds a '''score''' value field to a particular document. This is allocated based upon its importance within the webgraph. || + || boost || YES || Not Indexed || various scoring plugins || Adds a '''score''' value field to a particular document. This is allocated based upon its importance within the webgraph. || ? || ? || - || digest || YES || Not Indexed || org.apache.nutch.indexer.IndexerMapReduce.java || Adds a '''message digest''' field to a document. Can be MD5 over content and headers or more sophisticated text profile of the content. || + || digest || YES || Not Indexed || org.apache.nutch.indexer.IndexerMapReduce.java || Adds a '''message digest''' field to a document. Can be MD5 over content and headers or more sophisticated text profile of the content. || ? || ? || - || lang || YES || Un-Tokenized || language-identifier || Add a '''lang''', language field to a document.|| + || lang || YES || Un-Tokenized || language-identifier || Add a '''lang''', language field to a document.|| ? || ? || - || segment || YES || Not Indexed || org.apache.nutch.indexer.IndexerMapReduce.java || Adds the originating '''segment''' field to the document, used to identify the most recent segment in which this document was fetched. || + || segment || YES || Not Indexed || org.apache.nutch.indexer.IndexerMapReduce.java || Adds the originating '''segment''' field to the document, used to identify the most recent segment in which this document was fetched. || ? || ? || - || tstamp || YES || Tokenized || /!\ NEEDS COMMENT /!\ || Adds a '''timestamp''' field of the most recent time this document was fetched || + || tstamp || YES || Tokenized || /!\ NEEDS COMMENT /!\ || Adds a '''timestamp''' field of the most recent time this document was fetched || ? || ? || - || cc:license || YES || Indexed, Tokenized || creativecommons || Adds the entire license as '''cc:license=xxx''' and '''attributes''' extracted of the license url|| + || cc:license || YES || Indexed, Tokenized || creativecommons || Adds the entire license as '''cc:license=xxx''' and '''attributes''' extracted of the license url|| ? || ? || - || cc:meta || YES || Indexed, Tokenized || creativecommons || Adds the license location as '''cc:meta=xxx''' || + || cc:meta || YES || Indexed, Tokenized || creativecommons || Adds the license location as '''cc:meta=xxx''' || ? || ? || - || cc:type || YES || Indexed,Tokenized || creativecommons || Adds the work type as '''cc:type=xxx'''|| + || cc:type || YES || Indexed,Tokenized || creativecommons || Adds the work type as '''cc:type=xxx'''|| ? || ? || - || anchor || NO || Tokenized || index-anchor || Indexing filter that indexes all inbound '''anchor text''' for a document.|| + || anchor || NO || Tokenized || index-anchor || Indexing filter that indexes all inbound '''anchor text''' for a document.|| ? || ? || - || title || YES || Tokenized || index-basic || Adds basic searchable '''title field''' to a document. Also indexed by index-more || + || title || YES || Tokenized || index-basic || Adds basic searchable '''title field''' to a document. Also indexed by index-more || ? || ? || - || site || NO || Un-Tokenized || index-basic || Adds basic searchable '''site field''' to a document. || + || site || NO || Un-Tokenized || index-basic || Adds basic searchable '''site field''' to a document. || ? || ? || - || host || NO || Tokenized || index-basic || Adds basic searchable '''hostname field''' to a document. || + || host || NO || Tokenized || index-basic || Adds basic searchable '''hostname field''' to a document. || ? || ? || - || url || YES || Tokenized || index-basic || Adds basic searchable '''URL field''' to a document. || + || url || YES || Tokenized || index-basic || Adds basic searchable '''URL field''' to a document. || ? || ? || - || content || NO || Tokenized || index-basic || Adds basic searchable '''content field''' to a document. || + || content || NO || Tokenized || index-basic || Adds basic searchable '''content field''' to a document. || ? || ? || - || lastModified || NO || Indexed, Un-Tokenized || index-more || Adds some time related meta info in the form of '''last-modified''' if present. || + || lastModified || NO || Indexed, Un-Tokenized || index-more || Adds some time related meta info in the form of '''last-modified''' if present. || ? || ? || - || date || NO || Indexed, Un-Tokenized || index-more || Index date as last-modified, or, if that's not present, uses fetch time. || + || date || NO || Indexed, Un-Tokenized || index-more || Index date as last-modified, or, if that's not present, uses fetch time. || ? || ? || - || contentLength || NO || Indexed, Un-Tokenized || index-more || /!\ NEEDS COMMENT /!\ || + || contentLength || NO || Indexed, Un-Tokenized || index-more || /!\ NEEDS COMMENT /!\ || ? || ? || - || type || NO || Indexed, Un-Tokenized || index-more || Adds contentType, primaryType, subType (all mime-types) || + || type || NO || Indexed, Un-Tokenized || index-more || Adds contentType, primaryType, subType (all mime-types) || ? || ? || - || primaryType || NO || Indexed, Un-Tokenized || index-more || primaryType (mime-type) || + || primaryType || NO || Indexed, Un-Tokenized || index-more || primaryType (mime-type) || ? || ? || - || subType || NO || Indexed, Un-Tokenized || index-more || subType (mime-type) || + || subType || NO || Indexed, Un-Tokenized || index-more || subType (mime-type) || ? || ? || - || tld || YES || Un-Tokenized / NotStored(based on conf) || tld || Adds a '''top level domain''' field to the document. || + || tld || YES || Un-Tokenized / NotStored(based on conf) || tld || Adds a '''top level domain''' field to the document. || ? || ? || - || subcollection || YES || Tokenized || subcollection || For Comprehensive description see src/java/org/apache/nutch/collection/'''package.html''' || + || subcollection || YES || Tokenized || subcollection || For Comprehensive description see src/java/org/apache/nutch/collection/'''package.html''' || ? || ? || - || urlmeta || NO || Indexed, Un-Tokenized || urlmeta || Adds any specified '''url metadata tags''' to the document in the index.|| + || urlmeta || NO || Indexed, Un-Tokenized || urlmeta || Adds any specified '''url metadata tags''' to the document in the index.|| ? || ? || ---- Jira Issues about indexing and IndexingFilterPlugins are