Build failed in Hudson: Nutch-Nightly #220
See http://lucene.zones.apache.org:8080/hudson/job/Nutch-Nightly/220/changes -- [...truncated 4592 lines...] deploy: [mkdir] Created dir: http://lucene.zones.apache.org:8080/hudson/job/Nutch-Nightly/ws/trunk/build/plugins/urlfilter-regex [copy] Copying 1 file to http://lucene.zones.apache.org:8080/hudson/job/Nutch-Nightly/ws/trunk/build/plugins/urlfilter-regex Overriding previous definition of reference to plugin.deps copy-generated-lib: [copy] Copying 1 file to http://lucene.zones.apache.org:8080/hudson/job/Nutch-Nightly/ws/trunk/build/plugins/urlfilter-regex init: [mkdir] Created dir: http://lucene.zones.apache.org:8080/hudson/job/Nutch-Nightly/ws/trunk/build/urlfilter-suffix [mkdir] Created dir: http://lucene.zones.apache.org:8080/hudson/job/Nutch-Nightly/ws/trunk/build/urlfilter-suffix/classes [mkdir] Created dir: http://lucene.zones.apache.org:8080/hudson/job/Nutch-Nightly/ws/trunk/build/urlfilter-suffix/test init-plugin: deps-jar: compile: [echo] Compiling plugin: urlfilter-suffix [javac] Compiling 1 source file to http://lucene.zones.apache.org:8080/hudson/job/Nutch-Nightly/ws/trunk/build/urlfilter-suffix/classes [javac] Note: http://lucene.zones.apache.org:8080/hudson/job/Nutch-Nightly/ws/trunk/src/plugin/urlfilter-suffix/src/java/org/apache/nutch/urlfilter/suffix/SuffixURLFilter.java uses unchecked or unsafe operations. [javac] Note: Recompile with -Xlint:unchecked for details. jar: [jar] Building jar: http://lucene.zones.apache.org:8080/hudson/job/Nutch-Nightly/ws/trunk/build/urlfilter-suffix/urlfilter-suffix.jar deps-test: deploy: [mkdir] Created dir: http://lucene.zones.apache.org:8080/hudson/job/Nutch-Nightly/ws/trunk/build/plugins/urlfilter-suffix [copy] Copying 1 file to http://lucene.zones.apache.org:8080/hudson/job/Nutch-Nightly/ws/trunk/build/plugins/urlfilter-suffix copy-generated-lib: [copy] Copying 1 file to http://lucene.zones.apache.org:8080/hudson/job/Nutch-Nightly/ws/trunk/build/plugins/urlfilter-suffix init: [mkdir] Created dir: http://lucene.zones.apache.org:8080/hudson/job/Nutch-Nightly/ws/trunk/build/urlfilter-validator [mkdir] Created dir: http://lucene.zones.apache.org:8080/hudson/job/Nutch-Nightly/ws/trunk/build/urlfilter-validator/classes [mkdir] Created dir: http://lucene.zones.apache.org:8080/hudson/job/Nutch-Nightly/ws/trunk/build/urlfilter-validator/test init-plugin: deps-jar: compile: [echo] Compiling plugin: urlfilter-validator [javac] Compiling 1 source file to http://lucene.zones.apache.org:8080/hudson/job/Nutch-Nightly/ws/trunk/build/urlfilter-validator/classes jar: [jar] Building jar: http://lucene.zones.apache.org:8080/hudson/job/Nutch-Nightly/ws/trunk/build/urlfilter-validator/urlfilter-validator.jar deps-test: deploy: [mkdir] Created dir: http://lucene.zones.apache.org:8080/hudson/job/Nutch-Nightly/ws/trunk/build/plugins/urlfilter-validator [copy] Copying 1 file to http://lucene.zones.apache.org:8080/hudson/job/Nutch-Nightly/ws/trunk/build/plugins/urlfilter-validator copy-generated-lib: [copy] Copying 1 file to http://lucene.zones.apache.org:8080/hudson/job/Nutch-Nightly/ws/trunk/build/plugins/urlfilter-validator init: [mkdir] Created dir: http://lucene.zones.apache.org:8080/hudson/job/Nutch-Nightly/ws/trunk/build/urlnormalizer-basic [mkdir] Created dir: http://lucene.zones.apache.org:8080/hudson/job/Nutch-Nightly/ws/trunk/build/urlnormalizer-basic/classes [mkdir] Created dir: http://lucene.zones.apache.org:8080/hudson/job/Nutch-Nightly/ws/trunk/build/urlnormalizer-basic/test init-plugin: deps-jar: compile: [echo] Compiling plugin: urlnormalizer-basic [javac] Compiling 1 source file to http://lucene.zones.apache.org:8080/hudson/job/Nutch-Nightly/ws/trunk/build/urlnormalizer-basic/classes jar: [jar] Building jar: http://lucene.zones.apache.org:8080/hudson/job/Nutch-Nightly/ws/trunk/build/urlnormalizer-basic/urlnormalizer-basic.jar deps-test: deploy: [mkdir] Created dir: http://lucene.zones.apache.org:8080/hudson/job/Nutch-Nightly/ws/trunk/build/plugins/urlnormalizer-basic [copy] Copying 1 file to http://lucene.zones.apache.org:8080/hudson/job/Nutch-Nightly/ws/trunk/build/plugins/urlnormalizer-basic copy-generated-lib: [copy] Copying 1 file to http://lucene.zones.apache.org:8080/hudson/job/Nutch-Nightly/ws/trunk/build/plugins/urlnormalizer-basic init: [mkdir] Created dir: http://lucene.zones.apache.org:8080/hudson/job/Nutch-Nightly/ws/trunk/build/urlnormalizer-pass [mkdir] Created dir: http://lucene.zones.apache.org:8080/hudson/job/Nutch-Nightly/ws/trunk/build/urlnormalizer-pass/classes [mkdir] Created dir: http://lucene.zones.apache.org:8080/hudson/job/Nutch-Nightly/ws/trunk/build/urlnormalizer-pass/test init-plugin: deps-jar: compile: [echo]
Adding fields to BasicQueryFilter
Hi, I have started to use Nutch recently. Congratulations, that's very impressive! I look forward to discovering more about it. I have been trying to add a custom field to those used by the BasicQueryFilter and found no other way than modifying the code. What I needed was that each term of the original query as was found at least in on of the fields + search all the terms in a phrase query for each field. The latter could easily be done in a separate QueryFilter, but not the former as it would require to parse the boolean query obtained from the BasicQueryFilter and modify the clauses to add my field(s) and assume that the structure does not change etc... Am I missing something? Is there a simple way to do that apart from modifying the code? Would it make sense to modify the BQF so that it could take the name + weights of the fields to use from the config (e.g. search for a parameter name query.field_name.boost)? Let me know if you think that is relevant and I'll send a patch for the BQF. Best, Julien -- http://www.digitalpebble.com Open Source Solutions for Text Engineering
Re: Parsing extra fields from an html page in the web. ....
I brief. You need to write HtmlParserFilter, then IndexingFilter and QueryFilter. You register them through extension points. Search USER (not dev) group, there answers already. BTW. This questions is asked over and over. It seems to be a good subject to write on wiki. Marcin > Hi, > We are working on an Indian Language search engine and are using > nutch-0.9as the basic framework. > > However when the html pages are parsed during the fetching phase, the > htmlParser which runs on the page extracts the title text and metatags and > the outlinks. > what do i need to do if i need to add in more fields like , > ,
[jira] Commented: (NUTCH-25) needs 'character encoding' detector
[ https://issues.apache.org/jira/browse/NUTCH-25?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12530796 ] Hudson commented on NUTCH-25: - Integrated in Nutch-Nightly #219 (See [http://lucene.zones.apache.org:8080/hudson/job/Nutch-Nightly/219/]) > needs 'character encoding' detector > --- > > Key: NUTCH-25 > URL: https://issues.apache.org/jira/browse/NUTCH-25 > Project: Nutch > Issue Type: New Feature >Reporter: Stefan Groschupf >Assignee: Doğacan Güney > Fix For: 1.0.0 > > Attachments: EncodingDetector.java, EncodingDetector_additive.java, > NUTCH-25.patch, NUTCH-25_draft.patch, NUTCH-25_v2.patch, NUTCH-25_v3.patch, > NUTCH-25_v4.patch, patch > > > transferred from: > http://sourceforge.net/tracker/index.php?func=detail&aid=995730&group_id=59548&atid=491356 > submitted by: > Jungshik Shin > this is a follow-up to bug 993380 (figure out 'charset' > from the meta tag). > Although we can cover a lot of ground using the 'C-T' > header field in in the HTTP header and the > corresponding meta tag in html documents (and in case > of XML, we have to use a similar but a different > 'parsing'), in the wild, there are a lot of documents > without any information about the character encoding > used. Browsers like Mozilla and search engines like > Google use character encoding detectors to deal with > these 'unlabelled' documents. > Mozilla's character encoding detector is GPL/MPL'd and > we might be able to port it to Java. Unfortunately, > it's not fool-proof. However, along with some other > heuristic used by Mozilla and elsewhere, it'll be > possible to achieve a high rate of the detection. > The following page has links to some other related pages. > http://trainedmonkey.com/week/2004/26 > In addition to the character encoding detection, we > also need to detect the language of a document, which > is even harder and should be a separate bug (although > it's related). -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (NUTCH-369) StringUtil.resolveEncodingAlias is unuseful.
[ https://issues.apache.org/jira/browse/NUTCH-369?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12530798 ] Hudson commented on NUTCH-369: -- Integrated in Nutch-Nightly #219 (See [http://lucene.zones.apache.org:8080/hudson/job/Nutch-Nightly/219/]) > StringUtil.resolveEncodingAlias is unuseful. > - > > Key: NUTCH-369 > URL: https://issues.apache.org/jira/browse/NUTCH-369 > Project: Nutch > Issue Type: Bug > Components: fetcher >Affects Versions: 0.9.0 > Environment: all >Reporter: King Kong >Assignee: Doğacan Güney >Priority: Minor > Attachments: patch.diff, remover.diff > > > After we defined encoding alias map in StringUtil , but parse html use > orginal encoding also. > I found it is reading charset from meta in nekohtml which HtmlParser used . > we can set it's feature > "http://cyberneko.org/html/features/scanner/ignore-specified-charset"; to true > that nekohtml will use encoding we set; > concretely, > private DocumentFragment parseNeko(InputSource input) throws Exception { > DOMFragmentParser parser = new DOMFragmentParser(); > // some plugins, e.g., creativecommons, need to examine html comments > try { >+ > parser.setFeature("http://cyberneko.org/html/features/scanner/ignore-specified-charset",true); > parser.setFeature("http://apache.org/xml/features/include-comments";, > true); > > BTW, It must be add on front of try block,because the following sentence > (parser.setFeature("http://apache.org/xml/features/include-comments";, > true);) will throw exception. > -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (NUTCH-487) Neko HTML parser goes on default settings.
[ https://issues.apache.org/jira/browse/NUTCH-487?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12530797 ] Hudson commented on NUTCH-487: -- Integrated in Nutch-Nightly #219 (See [http://lucene.zones.apache.org:8080/hudson/job/Nutch-Nightly/219/]) > Neko HTML parser goes on default settings. > -- > > Key: NUTCH-487 > URL: https://issues.apache.org/jira/browse/NUTCH-487 > Project: Nutch > Issue Type: Bug > Components: fetcher >Affects Versions: 0.9.0 > Environment: Linux, Java 1.5.0. >Reporter: Marcin Okraszewski > Fix For: 1.0.0 > > Attachments: neko_setup.patch > > > The Neko HTML parser set up is done in silent try / catch statement (Nutch > 0.9: HtmlParser.java:248-259). The problem is that the first feature being > set thrown an exception. So, the whole setup block is skipped. The catch > statement does nothing, so probably nobody noticed this. > I attach a patch which fixes this. It was done on Nutch 0.9, but SVN trunk > contains the same code. > The patch does: > 1. Fixes augmentations feature. > 2. Removes include-comments feature, because I couldn't find anything similar > at http://people.apache.org/~andyc/neko/doc/html/settings.html > 3. Prints warn message when exception is caught. > Please note that now there goes a lot for messages to console (not log4j > log), because "report-errors" feature is being set. Shouldn't it be removed? -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
Build failed in Hudson: Nutch-Nightly #219
See http://lucene.zones.apache.org:8080/hudson/job/Nutch-Nightly/219/changes Changes: [dogacan] Java 5 compatibility fix for NUTCH-25. Contributed by Ned Rockson. [dogacan] NUTCH-25 - needs 'character encoding' detector. Mostly contributed by Doug Cook. Some parts are contributed by Marcin Okraszewski and Renaud Richardet. Also fixes NUTCH-369 and NUTCH-487. -- [...truncated 4594 lines...] [copy] Copying 1 file to http://lucene.zones.apache.org:8080/hudson/job/Nutch-Nightly/ws/trunk/build/plugins/urlfilter-regex Overriding previous definition of reference to plugin.deps copy-generated-lib: [copy] Copying 1 file to http://lucene.zones.apache.org:8080/hudson/job/Nutch-Nightly/ws/trunk/build/plugins/urlfilter-regex init: [mkdir] Created dir: http://lucene.zones.apache.org:8080/hudson/job/Nutch-Nightly/ws/trunk/build/urlfilter-suffix [mkdir] Created dir: http://lucene.zones.apache.org:8080/hudson/job/Nutch-Nightly/ws/trunk/build/urlfilter-suffix/classes [mkdir] Created dir: http://lucene.zones.apache.org:8080/hudson/job/Nutch-Nightly/ws/trunk/build/urlfilter-suffix/test init-plugin: deps-jar: compile: [echo] Compiling plugin: urlfilter-suffix [javac] Compiling 1 source file to http://lucene.zones.apache.org:8080/hudson/job/Nutch-Nightly/ws/trunk/build/urlfilter-suffix/classes [javac] Note: http://lucene.zones.apache.org:8080/hudson/job/Nutch-Nightly/ws/trunk/src/plugin/urlfilter-suffix/src/java/org/apache/nutch/urlfilter/suffix/SuffixURLFilter.java uses unchecked or unsafe operations. [javac] Note: Recompile with -Xlint:unchecked for details. jar: [jar] Building jar: http://lucene.zones.apache.org:8080/hudson/job/Nutch-Nightly/ws/trunk/build/urlfilter-suffix/urlfilter-suffix.jar deps-test: deploy: [mkdir] Created dir: http://lucene.zones.apache.org:8080/hudson/job/Nutch-Nightly/ws/trunk/build/plugins/urlfilter-suffix [copy] Copying 1 file to http://lucene.zones.apache.org:8080/hudson/job/Nutch-Nightly/ws/trunk/build/plugins/urlfilter-suffix copy-generated-lib: [copy] Copying 1 file to http://lucene.zones.apache.org:8080/hudson/job/Nutch-Nightly/ws/trunk/build/plugins/urlfilter-suffix init: [mkdir] Created dir: http://lucene.zones.apache.org:8080/hudson/job/Nutch-Nightly/ws/trunk/build/urlfilter-validator [mkdir] Created dir: http://lucene.zones.apache.org:8080/hudson/job/Nutch-Nightly/ws/trunk/build/urlfilter-validator/classes [mkdir] Created dir: http://lucene.zones.apache.org:8080/hudson/job/Nutch-Nightly/ws/trunk/build/urlfilter-validator/test init-plugin: deps-jar: compile: [echo] Compiling plugin: urlfilter-validator [javac] Compiling 1 source file to http://lucene.zones.apache.org:8080/hudson/job/Nutch-Nightly/ws/trunk/build/urlfilter-validator/classes jar: [jar] Building jar: http://lucene.zones.apache.org:8080/hudson/job/Nutch-Nightly/ws/trunk/build/urlfilter-validator/urlfilter-validator.jar deps-test: deploy: [mkdir] Created dir: http://lucene.zones.apache.org:8080/hudson/job/Nutch-Nightly/ws/trunk/build/plugins/urlfilter-validator [copy] Copying 1 file to http://lucene.zones.apache.org:8080/hudson/job/Nutch-Nightly/ws/trunk/build/plugins/urlfilter-validator copy-generated-lib: [copy] Copying 1 file to http://lucene.zones.apache.org:8080/hudson/job/Nutch-Nightly/ws/trunk/build/plugins/urlfilter-validator init: [mkdir] Created dir: http://lucene.zones.apache.org:8080/hudson/job/Nutch-Nightly/ws/trunk/build/urlnormalizer-basic [mkdir] Created dir: http://lucene.zones.apache.org:8080/hudson/job/Nutch-Nightly/ws/trunk/build/urlnormalizer-basic/classes [mkdir] Created dir: http://lucene.zones.apache.org:8080/hudson/job/Nutch-Nightly/ws/trunk/build/urlnormalizer-basic/test init-plugin: deps-jar: compile: [echo] Compiling plugin: urlnormalizer-basic [javac] Compiling 1 source file to http://lucene.zones.apache.org:8080/hudson/job/Nutch-Nightly/ws/trunk/build/urlnormalizer-basic/classes jar: [jar] Building jar: http://lucene.zones.apache.org:8080/hudson/job/Nutch-Nightly/ws/trunk/build/urlnormalizer-basic/urlnormalizer-basic.jar deps-test: deploy: [mkdir] Created dir: http://lucene.zones.apache.org:8080/hudson/job/Nutch-Nightly/ws/trunk/build/plugins/urlnormalizer-basic [copy] Copying 1 file to http://lucene.zones.apache.org:8080/hudson/job/Nutch-Nightly/ws/trunk/build/plugins/urlnormalizer-basic copy-generated-lib: [copy] Copying 1 file to http://lucene.zones.apache.org:8080/hudson/job/Nutch-Nightly/ws/trunk/build/plugins/urlnormalizer-basic init: [mkdir] Created dir: http://lucene.zones.apache.org:8080/hudson/job/Nutch-Nightly/ws/trunk/build/urlnormalizer-pass [mkdir] Created dir: http://lucene.zones.apache.org:8080/hudson/job/Nutch-Nightly/ws/trunk/build/urlnormalizer-pass/classes [mkdir] Created di
[jira] Commented: (NUTCH-558) Need tool to retrieve domain statistics
[ https://issues.apache.org/jira/browse/NUTCH-558?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12530755 ] Chris Schneider commented on NUTCH-558: --- The reason that DomainStats does not use URLUtils is that (as mentioned above) we are currently using a relatively old Nutch source base (last integrated at revision 417928). There are probably other tools/resources we could use as well if we reworked the code to better fit the current Nutch/Hadooop source environment. Sorry for being so out of date. > Need tool to retrieve domain statistics > --- > > Key: NUTCH-558 > URL: https://issues.apache.org/jira/browse/NUTCH-558 > Project: Nutch > Issue Type: New Feature >Affects Versions: 0.9.0 >Reporter: Chris Schneider >Assignee: Chris Schneider > Attachments: DomainStats.patch > > > Several developers have expressed interest in a tool to retrieve statistics > from a crawl on a domain basis (e.g., how many pages were successfully > fetched from www.apache.org vs. apache.org, where the latter total would > include the former). -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
Re: query parsing
Sebastian Schick wrote: Hello, maybe I better under my problem with the highlighting of query terms in the summary now a little bit more?. My query is e.g "New York lang:de" The term "New York" is highlighted correctly. But if I have an email address e.g [EMAIL PROTECTED] in the summary, the term "de" of the email address is also highlighted. Why will "de" not deleted in the NutchAnalysis class? My problem now is, that I do not understand, how I can remove field values from query? Regards, Sebastian Hello, my solution is to change the file NutchAnalysis.java in line 293: if (this.queryFilters.isRawField(field)) { result.clear(); // result.add(queryString.substring(start, token.endCoumn)); Maybe this is something which can be configurable? Or is it? Regards, Sebastian
query parsing
Hello, maybe I better under my problem with the highlighting of query terms in the summary now a little bit more?. My query is e.g "New York lang:de" The term "New York" is highlighted correctly. But if I have an email address e.g [EMAIL PROTECTED] in the summary, the term "de" of the email address is also highlighted. Why will "de" not deleted in the NutchAnalysis class? My problem now is, that I do not understand, how I can remove field values from query? Regards, Sebastian
Parsing extra fields from an html page in the web.....
Hi, We are working on an Indian Language search engine and are using nutch-0.9as the basic framework. However when the html pages are parsed during the fetching phase, the htmlParser which runs on the page extracts the title text and metatags and the outlinks. what do i need to do if i need to add in more fields like , ,
[jira] Updated: (NUTCH-559) NTLM, Basic and Digest Authentication schemes for web/proxy server
[ https://issues.apache.org/jira/browse/NUTCH-559?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Susam Pal updated NUTCH-559: Attachment: NUTCH-559v0.2.patch Uploading a revised (v0.2) patch which accommodates most of the suggestions by Doğacan. A few points I want to discuss:- * Extending the authentication to work for more than one host was in my mind but I found too many possible cases. So I was planning to have a different configuration file where all the authentication rules can be mentioned to override the corresponding 'conf/nutch-site.xml' properties. The different possible cases are: ** Different credentials for different domain or sub-domains, say, example.com, ad.example.com, example.net, etc. ** Different credentials for different hosts. ** Different credentials for different realms. * I removed cookie related code earlier because I didn't find it to work (even before merging my work). However, I have brought them back in the revised patch. We can discuss more on this if required. * I have restored most of the original response reading code except for 'calculateTryToRead'. This method is not checking for 'Content-Length' limit. The content-length limit check present in this patch is similar to that of 'protocol-http' which is simpler and correct. If the idea of having a separate authentication configuration file looks good, I can work on it when I get some free time. > NTLM, Basic and Digest Authentication schemes for web/proxy server > -- > > Key: NUTCH-559 > URL: https://issues.apache.org/jira/browse/NUTCH-559 > Project: Nutch > Issue Type: Improvement > Components: fetcher >Affects Versions: 1.0.0 >Reporter: Susam Pal > Attachments: NUTCH-559v0.1.patch, NUTCH-559v0.2.patch > > > Added basic, digest and NTLM authentication schemes to protocol-httpclient. > The authentication schemes can be configured for proxy server as well as web > servers of a domain. HTTP authentication can take place over HTTP/1.0, > HTTP/1.1 and HTTPS. > The authentication guide can be found here: > [http://wiki.apache.org/nutch/HttpAuthenticationSchemes]. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (NUTCH-558) Need tool to retrieve domain statistics
[ https://issues.apache.org/jira/browse/NUTCH-558?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12530656 ] Enis Soztutar commented on NUTCH-558: - I wonder why you do not use URLUtils introduced in NUTCH-439. Also there is a similar tool(not committed) in this patch which extracts url/domain/tld statistics from the crawldb, but lacks filtering. > Need tool to retrieve domain statistics > --- > > Key: NUTCH-558 > URL: https://issues.apache.org/jira/browse/NUTCH-558 > Project: Nutch > Issue Type: New Feature >Affects Versions: 0.9.0 >Reporter: Chris Schneider >Assignee: Chris Schneider > Attachments: DomainStats.patch > > > Several developers have expressed interest in a tool to retrieve statistics > from a crawl on a domain basis (e.g., how many pages were successfully > fetched from www.apache.org vs. apache.org, where the latter total would > include the former). -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.