[
https://issues.apache.org/jira/browse/NUTCH-439?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Enis Soztutar updated NUTCH-439:
--------------------------------
Attachment: tld_plugin_v2.3.patch
bq. TLDScoringFilter contains a misspelled field, tldEnties, it should be
renamed to tldEntries
Done!
bq. one of the use cases for the "tld" index field that you mention is that
users may search on it. But in the latest patch this field is added with
Field.Index.NO, which makes searching on it impossible. Also, in order to
search on arbitrary Lucene fields Nutch needs a Query filter, so we would need
a TLDQueryFilter, which doesn't exist (yet?).
Well, infact NUTCH-445 covers searching on tlds, namely we would be able to
search site:lucene.apache.org, or site:apache.org or even site:org, therefore i
think indexing tld fields and TLDQueryFilter is not needed. I will delve deeper
into NUTCH-445 as soon as i find some time. We can move domain indexing
functionality to index-basic so that it will be generic enough.
bq. using domain names instead of host names - we need to discuss this further,
let's create a separate issue on this.
we can open issues case by case since the patches is expected to have major
side effects.
> Top Level Domains Indexing / Scoring
> ------------------------------------
>
> Key: NUTCH-439
> URL: https://issues.apache.org/jira/browse/NUTCH-439
> Project: Nutch
> Issue Type: New Feature
> Components: indexer
> Affects Versions: 0.9.0
> Reporter: Enis Soztutar
> Assignee: Enis Soztutar
> Attachments: tld_plugin_v1.0.patch, tld_plugin_v1.1.patch,
> tld_plugin_v2.0.patch, tld_plugin_v2.1.patch, tld_plugin_v2.2.patch,
> tld_plugin_v2.3.patch
>
>
> Top Level Domains (tlds) are the last part(s) of the host name in a DNS
> system. TLDs are managed by the Internet Assigned Numbers Authority. IANA
> divides tlds into three. infrastructure, generic(such as "com", "edu") and
> country code tlds(such as "en", "de" , "tr", ). Indexing the top level domain
> and optionally boosting is needed for improving the search results and
> enhancing locality.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
-------------------------------------------------------------------------
This SF.net email is sponsored by: Splunk Inc.
Still grepping through log files to find problems? Stop.
Now Search log events and configuration files using AJAX and a browser.
Download your FREE copy of Splunk now >> http://get.splunk.com/
_______________________________________________
Nutch-developers mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-developers