date:20070927

Build failed in Hudson: Nutch-Nightly #220

2007-09-27 Thread hudson

See http://lucene.zones.apache.org:8080/hudson/job/Nutch-Nightly/220/changes

--
[...truncated 4592 lines...]

deploy:
[mkdir] Created dir: 
http://lucene.zones.apache.org:8080/hudson/job/Nutch-Nightly/ws/trunk/build/plugins/urlfilter-regex
 
 [copy] Copying 1 file to 
http://lucene.zones.apache.org:8080/hudson/job/Nutch-Nightly/ws/trunk/build/plugins/urlfilter-regex
 
Overriding previous definition of reference to plugin.deps

copy-generated-lib:
 [copy] Copying 1 file to 
http://lucene.zones.apache.org:8080/hudson/job/Nutch-Nightly/ws/trunk/build/plugins/urlfilter-regex
 

init:
[mkdir] Created dir: 
http://lucene.zones.apache.org:8080/hudson/job/Nutch-Nightly/ws/trunk/build/urlfilter-suffix
 
[mkdir] Created dir: 
http://lucene.zones.apache.org:8080/hudson/job/Nutch-Nightly/ws/trunk/build/urlfilter-suffix/classes
 
[mkdir] Created dir: 
http://lucene.zones.apache.org:8080/hudson/job/Nutch-Nightly/ws/trunk/build/urlfilter-suffix/test
 

init-plugin:

deps-jar:

compile:
 [echo] Compiling plugin: urlfilter-suffix
[javac] Compiling 1 source file to 
http://lucene.zones.apache.org:8080/hudson/job/Nutch-Nightly/ws/trunk/build/urlfilter-suffix/classes
 
[javac] Note: 
http://lucene.zones.apache.org:8080/hudson/job/Nutch-Nightly/ws/trunk/src/plugin/urlfilter-suffix/src/java/org/apache/nutch/urlfilter/suffix/SuffixURLFilter.java
  uses unchecked or unsafe operations.
[javac] Note: Recompile with -Xlint:unchecked for details.

jar:
  [jar] Building jar: 
http://lucene.zones.apache.org:8080/hudson/job/Nutch-Nightly/ws/trunk/build/urlfilter-suffix/urlfilter-suffix.jar
 

deps-test:

deploy:
[mkdir] Created dir: 
http://lucene.zones.apache.org:8080/hudson/job/Nutch-Nightly/ws/trunk/build/plugins/urlfilter-suffix
 
 [copy] Copying 1 file to 
http://lucene.zones.apache.org:8080/hudson/job/Nutch-Nightly/ws/trunk/build/plugins/urlfilter-suffix
 

copy-generated-lib:
 [copy] Copying 1 file to 
http://lucene.zones.apache.org:8080/hudson/job/Nutch-Nightly/ws/trunk/build/plugins/urlfilter-suffix
 

init:
[mkdir] Created dir: 
http://lucene.zones.apache.org:8080/hudson/job/Nutch-Nightly/ws/trunk/build/urlfilter-validator
 
[mkdir] Created dir: 
http://lucene.zones.apache.org:8080/hudson/job/Nutch-Nightly/ws/trunk/build/urlfilter-validator/classes
 
[mkdir] Created dir: 
http://lucene.zones.apache.org:8080/hudson/job/Nutch-Nightly/ws/trunk/build/urlfilter-validator/test
 

init-plugin:

deps-jar:

compile:
 [echo] Compiling plugin: urlfilter-validator
[javac] Compiling 1 source file to 
http://lucene.zones.apache.org:8080/hudson/job/Nutch-Nightly/ws/trunk/build/urlfilter-validator/classes
 

jar:
  [jar] Building jar: 
http://lucene.zones.apache.org:8080/hudson/job/Nutch-Nightly/ws/trunk/build/urlfilter-validator/urlfilter-validator.jar
 

deps-test:

deploy:
[mkdir] Created dir: 
http://lucene.zones.apache.org:8080/hudson/job/Nutch-Nightly/ws/trunk/build/plugins/urlfilter-validator
 
 [copy] Copying 1 file to 
http://lucene.zones.apache.org:8080/hudson/job/Nutch-Nightly/ws/trunk/build/plugins/urlfilter-validator
 

copy-generated-lib:
 [copy] Copying 1 file to 
http://lucene.zones.apache.org:8080/hudson/job/Nutch-Nightly/ws/trunk/build/plugins/urlfilter-validator
 

init:
[mkdir] Created dir: 
http://lucene.zones.apache.org:8080/hudson/job/Nutch-Nightly/ws/trunk/build/urlnormalizer-basic
 
[mkdir] Created dir: 
http://lucene.zones.apache.org:8080/hudson/job/Nutch-Nightly/ws/trunk/build/urlnormalizer-basic/classes
 
[mkdir] Created dir: 
http://lucene.zones.apache.org:8080/hudson/job/Nutch-Nightly/ws/trunk/build/urlnormalizer-basic/test
 

init-plugin:

deps-jar:

compile:
 [echo] Compiling plugin: urlnormalizer-basic
[javac] Compiling 1 source file to 
http://lucene.zones.apache.org:8080/hudson/job/Nutch-Nightly/ws/trunk/build/urlnormalizer-basic/classes
 

jar:
  [jar] Building jar: 
http://lucene.zones.apache.org:8080/hudson/job/Nutch-Nightly/ws/trunk/build/urlnormalizer-basic/urlnormalizer-basic.jar
 

deps-test:

deploy:
[mkdir] Created dir: 
http://lucene.zones.apache.org:8080/hudson/job/Nutch-Nightly/ws/trunk/build/plugins/urlnormalizer-basic
 
 [copy] Copying 1 file to 
http://lucene.zones.apache.org:8080/hudson/job/Nutch-Nightly/ws/trunk/build/plugins/urlnormalizer-basic
 

copy-generated-lib:
 [copy] Copying 1 file to 
http://lucene.zones.apache.org:8080/hudson/job/Nutch-Nightly/ws/trunk/build/plugins/urlnormalizer-basic
 

init:
[mkdir] Created dir: 
http://lucene.zones.apache.org:8080/hudson/job/Nutch-Nightly/ws/trunk/build/urlnormalizer-pass
 
[mkdir] Created dir: 
http://lucene.zones.apache.org:8080/hudson/job/Nutch-Nightly/ws/trunk/build/urlnormalizer-pass/classes
 
[mkdir] Created dir: 
http://lucene.zones.apache.org:8080/hudson/job/Nutch-Nightly/ws/trunk/build/urlnormalizer-pass/test
 

init-plugin:

deps-jar:

compile:
 [echo]

Adding fields to BasicQueryFilter

2007-09-27 Thread julien nioche

Hi,

I have started to use Nutch recently. Congratulations, that's very
impressive! I look forward to discovering more about it.

I have been trying to add a custom field to those used by the
BasicQueryFilter and found no other way than modifying the code. What I
needed was that each term of the original query as was found at least in on
of the fields + search all the terms in a phrase query for each field. The
latter could easily be done in a separate QueryFilter, but not the former as
it would require to parse the boolean query obtained from the
BasicQueryFilter and modify the clauses to add my field(s) and assume that
the structure does not change etc...

Am I missing something? Is there a simple way to do that apart from
modifying the code? Would it make sense to modify the BQF so that it could
take the name + weights of the fields to use from the config (e.g. search
for a parameter name query.field_name.boost)?

Let me know if you think that is relevant and I'll send a patch for the BQF.

Best,

Julien

-- 
http://www.digitalpebble.com
Open Source Solutions for Text Engineering

Re: Parsing extra fields from an html page in the web. ....

2007-09-27 Thread Marcin Okraszewski

I brief. You need to write HtmlParserFilter, then IndexingFilter and 
QueryFilter. You register them through extension points. Search USER (not dev) 
group, there answers already.

BTW. This questions is asked over and over. It seems to be a good subject to 
write on wiki.

Marcin

> Hi,
> We are working on an Indian Language search engine and are using
> nutch-0.9as the basic framework.
> 
> However when the html pages are parsed during the fetching phase, the
> htmlParser which runs on the page extracts the title text and metatags and
> the outlinks.
> what do i need to do if i need to add in more fields like ,
> ,

[jira] Commented: (NUTCH-25) needs 'character encoding' detector

2007-09-27 Thread Hudson (JIRA)


[ 
https://issues.apache.org/jira/browse/NUTCH-25?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12530796
 ] 

Hudson commented on NUTCH-25:
-

Integrated in Nutch-Nightly #219 (See 
[http://lucene.zones.apache.org:8080/hudson/job/Nutch-Nightly/219/])

> needs 'character encoding' detector
> ---
>
> Key: NUTCH-25
> URL: https://issues.apache.org/jira/browse/NUTCH-25
> Project: Nutch
>  Issue Type: New Feature
>Reporter: Stefan Groschupf
>Assignee: Doğacan Güney
> Fix For: 1.0.0
>
> Attachments: EncodingDetector.java, EncodingDetector_additive.java, 
> NUTCH-25.patch, NUTCH-25_draft.patch, NUTCH-25_v2.patch, NUTCH-25_v3.patch, 
> NUTCH-25_v4.patch, patch
>
>
> transferred from:
> http://sourceforge.net/tracker/index.php?func=detail&aid=995730&group_id=59548&atid=491356
> submitted by:
> Jungshik Shin
> this is a follow-up to bug 993380 (figure out 'charset'
> from the meta tag).
> Although we can cover a lot of ground using the 'C-T'
> header field in in the HTTP header and the
> corresponding meta tag in html documents (and in case
> of XML, we have to use a similar but a different
> 'parsing'), in the wild, there are a lot of documents
> without any information about the character encoding
> used. Browsers like Mozilla and search engines like
> Google use character encoding detectors to deal with
> these 'unlabelled' documents. 
> Mozilla's character encoding detector is GPL/MPL'd and
> we might be able to port it to Java. Unfortunately,
> it's not fool-proof. However, along with some other
> heuristic used by Mozilla and elsewhere, it'll be
> possible to achieve a high rate of the detection. 
> The following page has links to some other related pages.
> http://trainedmonkey.com/week/2004/26
> In addition to the character encoding detection, we
> also need to detect the language of a document, which
> is even harder and should be a separate bug (although
> it's related).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (NUTCH-369) StringUtil.resolveEncodingAlias is unuseful.

2007-09-27 Thread Hudson (JIRA)


[ 
https://issues.apache.org/jira/browse/NUTCH-369?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12530798
 ] 

Hudson commented on NUTCH-369:
--

Integrated in Nutch-Nightly #219 (See 
[http://lucene.zones.apache.org:8080/hudson/job/Nutch-Nightly/219/])

> StringUtil.resolveEncodingAlias  is unuseful.
> -
>
> Key: NUTCH-369
> URL: https://issues.apache.org/jira/browse/NUTCH-369
> Project: Nutch
>  Issue Type: Bug
>  Components: fetcher
>Affects Versions: 0.9.0
> Environment: all
>Reporter: King Kong
>Assignee: Doğacan Güney
>Priority: Minor
> Attachments: patch.diff, remover.diff
>
>
> After we defined encoding alias map in StringUtil , but parse html use 
> orginal encoding also.
> I found it is reading charset from  meta in nekohtml which HtmlParser  used .
> we can set it's feature 
> "http://cyberneko.org/html/features/scanner/ignore-specified-charset"; to true 
> that nekohtml will use encoding we set;
> concretely,
>   private DocumentFragment parseNeko(InputSource input) throws Exception {
> DOMFragmentParser parser = new DOMFragmentParser();
> // some plugins, e.g., creativecommons, need to examine html comments
> try {
>+ 
> parser.setFeature("http://cyberneko.org/html/features/scanner/ignore-specified-charset",true);
>   parser.setFeature("http://apache.org/xml/features/include-comments";, 
>   true);
>   
> BTW, It must be add on front of try block,because the following sentence  
> (parser.setFeature("http://apache.org/xml/features/include-comments";, 
>   true);) will throw exception.
>  

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (NUTCH-487) Neko HTML parser goes on default settings.

2007-09-27 Thread Hudson (JIRA)


[ 
https://issues.apache.org/jira/browse/NUTCH-487?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12530797
 ] 

Hudson commented on NUTCH-487:
--

Integrated in Nutch-Nightly #219 (See 
[http://lucene.zones.apache.org:8080/hudson/job/Nutch-Nightly/219/])

> Neko HTML parser goes on default settings.
> --
>
> Key: NUTCH-487
> URL: https://issues.apache.org/jira/browse/NUTCH-487
> Project: Nutch
>  Issue Type: Bug
>  Components: fetcher
>Affects Versions: 0.9.0
> Environment: Linux, Java 1.5.0.
>Reporter: Marcin Okraszewski
> Fix For: 1.0.0
>
> Attachments: neko_setup.patch
>
>
> The Neko HTML parser set up is done in silent try / catch statement (Nutch 
> 0.9: HtmlParser.java:248-259). The problem is that the first feature being 
> set thrown an exception. So, the whole setup block is skipped. The catch 
> statement does nothing, so probably nobody noticed this.
> I attach a patch which fixes this. It was done on Nutch 0.9, but SVN trunk 
> contains the same code.
> The patch does:
> 1. Fixes augmentations feature.
> 2. Removes include-comments feature, because I couldn't find anything similar 
> at http://people.apache.org/~andyc/neko/doc/html/settings.html
> 3. Prints warn message when exception is caught.
> Please note that now there goes a lot for messages to console (not log4j 
> log), because "report-errors" feature is being set. Shouldn't it be removed?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Build failed in Hudson: Nutch-Nightly #219

2007-09-27 Thread hudson

See http://lucene.zones.apache.org:8080/hudson/job/Nutch-Nightly/219/changes

Changes:

[dogacan] Java 5 compatibility fix for NUTCH-25. Contributed by Ned Rockson.

[dogacan] NUTCH-25 - needs 'character encoding' detector. Mostly contributed by 
Doug Cook. Some parts are contributed by Marcin Okraszewski and Renaud 
Richardet. Also fixes NUTCH-369 and NUTCH-487.

--
[...truncated 4594 lines...]
 [copy] Copying 1 file to 
http://lucene.zones.apache.org:8080/hudson/job/Nutch-Nightly/ws/trunk/build/plugins/urlfilter-regex
 
Overriding previous definition of reference to plugin.deps

copy-generated-lib:
 [copy] Copying 1 file to 
http://lucene.zones.apache.org:8080/hudson/job/Nutch-Nightly/ws/trunk/build/plugins/urlfilter-regex
 

init:
[mkdir] Created dir: 
http://lucene.zones.apache.org:8080/hudson/job/Nutch-Nightly/ws/trunk/build/urlfilter-suffix
 
[mkdir] Created dir: 
http://lucene.zones.apache.org:8080/hudson/job/Nutch-Nightly/ws/trunk/build/urlfilter-suffix/classes
 
[mkdir] Created dir: 
http://lucene.zones.apache.org:8080/hudson/job/Nutch-Nightly/ws/trunk/build/urlfilter-suffix/test
 

init-plugin:

deps-jar:

compile:
 [echo] Compiling plugin: urlfilter-suffix
[javac] Compiling 1 source file to 
http://lucene.zones.apache.org:8080/hudson/job/Nutch-Nightly/ws/trunk/build/urlfilter-suffix/classes
 
[javac] Note: 
http://lucene.zones.apache.org:8080/hudson/job/Nutch-Nightly/ws/trunk/src/plugin/urlfilter-suffix/src/java/org/apache/nutch/urlfilter/suffix/SuffixURLFilter.java
  uses unchecked or unsafe operations.
[javac] Note: Recompile with -Xlint:unchecked for details.

jar:
  [jar] Building jar: 
http://lucene.zones.apache.org:8080/hudson/job/Nutch-Nightly/ws/trunk/build/urlfilter-suffix/urlfilter-suffix.jar
 

deps-test:

deploy:
[mkdir] Created dir: 
http://lucene.zones.apache.org:8080/hudson/job/Nutch-Nightly/ws/trunk/build/plugins/urlfilter-suffix
 
 [copy] Copying 1 file to 
http://lucene.zones.apache.org:8080/hudson/job/Nutch-Nightly/ws/trunk/build/plugins/urlfilter-suffix
 

copy-generated-lib:
 [copy] Copying 1 file to 
http://lucene.zones.apache.org:8080/hudson/job/Nutch-Nightly/ws/trunk/build/plugins/urlfilter-suffix
 

init:
[mkdir] Created dir: 
http://lucene.zones.apache.org:8080/hudson/job/Nutch-Nightly/ws/trunk/build/urlfilter-validator
 
[mkdir] Created dir: 
http://lucene.zones.apache.org:8080/hudson/job/Nutch-Nightly/ws/trunk/build/urlfilter-validator/classes
 
[mkdir] Created dir: 
http://lucene.zones.apache.org:8080/hudson/job/Nutch-Nightly/ws/trunk/build/urlfilter-validator/test
 

init-plugin:

deps-jar:

compile:
 [echo] Compiling plugin: urlfilter-validator
[javac] Compiling 1 source file to 
http://lucene.zones.apache.org:8080/hudson/job/Nutch-Nightly/ws/trunk/build/urlfilter-validator/classes
 

jar:
  [jar] Building jar: 
http://lucene.zones.apache.org:8080/hudson/job/Nutch-Nightly/ws/trunk/build/urlfilter-validator/urlfilter-validator.jar
 

deps-test:

deploy:
[mkdir] Created dir: 
http://lucene.zones.apache.org:8080/hudson/job/Nutch-Nightly/ws/trunk/build/plugins/urlfilter-validator
 
 [copy] Copying 1 file to 
http://lucene.zones.apache.org:8080/hudson/job/Nutch-Nightly/ws/trunk/build/plugins/urlfilter-validator
 

copy-generated-lib:
 [copy] Copying 1 file to 
http://lucene.zones.apache.org:8080/hudson/job/Nutch-Nightly/ws/trunk/build/plugins/urlfilter-validator
 

init:
[mkdir] Created dir: 
http://lucene.zones.apache.org:8080/hudson/job/Nutch-Nightly/ws/trunk/build/urlnormalizer-basic
 
[mkdir] Created dir: 
http://lucene.zones.apache.org:8080/hudson/job/Nutch-Nightly/ws/trunk/build/urlnormalizer-basic/classes
 
[mkdir] Created dir: 
http://lucene.zones.apache.org:8080/hudson/job/Nutch-Nightly/ws/trunk/build/urlnormalizer-basic/test
 

init-plugin:

deps-jar:

compile:
 [echo] Compiling plugin: urlnormalizer-basic
[javac] Compiling 1 source file to 
http://lucene.zones.apache.org:8080/hudson/job/Nutch-Nightly/ws/trunk/build/urlnormalizer-basic/classes
 

jar:
  [jar] Building jar: 
http://lucene.zones.apache.org:8080/hudson/job/Nutch-Nightly/ws/trunk/build/urlnormalizer-basic/urlnormalizer-basic.jar
 

deps-test:

deploy:
[mkdir] Created dir: 
http://lucene.zones.apache.org:8080/hudson/job/Nutch-Nightly/ws/trunk/build/plugins/urlnormalizer-basic
 
 [copy] Copying 1 file to 
http://lucene.zones.apache.org:8080/hudson/job/Nutch-Nightly/ws/trunk/build/plugins/urlnormalizer-basic
 

copy-generated-lib:
 [copy] Copying 1 file to 
http://lucene.zones.apache.org:8080/hudson/job/Nutch-Nightly/ws/trunk/build/plugins/urlnormalizer-basic
 

init:
[mkdir] Created dir: 
http://lucene.zones.apache.org:8080/hudson/job/Nutch-Nightly/ws/trunk/build/urlnormalizer-pass
 
[mkdir] Created dir: 
http://lucene.zones.apache.org:8080/hudson/job/Nutch-Nightly/ws/trunk/build/urlnormalizer-pass/classes
 
[mkdir] Created di

[jira] Commented: (NUTCH-558) Need tool to retrieve domain statistics

2007-09-27 Thread Chris Schneider (JIRA)


[ 
https://issues.apache.org/jira/browse/NUTCH-558?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12530755
 ] 

Chris Schneider commented on NUTCH-558:
---

The reason that DomainStats does not use URLUtils is that (as mentioned above) 
we are currently using a relatively old Nutch source base (last integrated at 
revision 417928). There are probably other tools/resources we could use as well 
if we reworked the code to better fit the current Nutch/Hadooop source 
environment. Sorry for being so out of date.

> Need tool to retrieve domain statistics
> ---
>
> Key: NUTCH-558
> URL: https://issues.apache.org/jira/browse/NUTCH-558
> Project: Nutch
>  Issue Type: New Feature
>Affects Versions: 0.9.0
>Reporter: Chris Schneider
>Assignee: Chris Schneider
> Attachments: DomainStats.patch
>
>
> Several developers have expressed interest in a tool to retrieve statistics 
> from a crawl on a domain basis (e.g., how many pages were successfully 
> fetched from www.apache.org vs. apache.org, where the latter total would 
> include the former).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Re: query parsing

2007-09-27 Thread Sebastian Schick


Sebastian Schick wrote:

Hello,

maybe I better under my problem with the highlighting of query terms 
in the summary now a little bit more?. My query is e.g "New York lang:de"


The term "New York" is highlighted correctly.
But if I have an email address e.g [EMAIL PROTECTED] in the summary, the 
term "de" of the email address is also highlighted.


Why will "de" not deleted in the NutchAnalysis class?
My problem now is, that I do not understand, how I can remove field 
values from query?




Regards,

Sebastian

Hello,

my solution is to change the file NutchAnalysis.java in line 293:

if (this.queryFilters.isRawField(field)) {
   result.clear();
// result.add(queryString.substring(start, token.endCoumn));


Maybe this is something which can be configurable? Or is it?



Regards,

Sebastian

query parsing

2007-09-27 Thread Sebastian Schick


Hello,

maybe I better under my problem with the highlighting of query terms in 
the summary now a little bit more?. My query is e.g "New York lang:de"


The term "New York" is highlighted correctly.
But if I have an email address e.g [EMAIL PROTECTED] in the summary, the term 
"de" of the email address is also highlighted.


Why will "de" not deleted in the NutchAnalysis class?
My problem now is, that I do not understand, how I can remove field 
values from query?




Regards,

Sebastian

Parsing extra fields from an html page in the web.....

2007-09-27 Thread Pratyush Banerjee

Hi,
We are working on an Indian Language search engine and are using
nutch-0.9as the basic framework.

However when the html pages are parsed during the fetching phase, the
htmlParser which runs on the page extracts the title text and metatags and
the outlinks.
what do i need to do if i need to add in more fields like ,
,

[jira] Updated: (NUTCH-559) NTLM, Basic and Digest Authentication schemes for web/proxy server

2007-09-27 Thread Susam Pal (JIRA)


 [ 
https://issues.apache.org/jira/browse/NUTCH-559?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Susam Pal updated NUTCH-559:


Attachment: NUTCH-559v0.2.patch

Uploading a revised (v0.2) patch which accommodates most of the suggestions by 
Doğacan. A few points I want to discuss:-

* Extending the authentication to work for more than one host was in my mind 
but I found too many possible cases. So I was planning to have a different 
configuration file where all the authentication rules can be mentioned to 
override the corresponding 'conf/nutch-site.xml' properties. The different 
possible cases are:
** Different credentials for different domain or sub-domains, say, example.com, 
ad.example.com, example.net, etc.
** Different credentials for different hosts.
** Different credentials for different realms.
* I removed cookie related code earlier because I didn't find it to work (even 
before merging my work). However, I have brought them back in the revised 
patch. We can discuss more on this if required.
* I have restored most of the original response reading code except for 
'calculateTryToRead'. This method is not checking for 'Content-Length' limit. 
The content-length limit check present in this patch is similar to that of 
'protocol-http' which is simpler and correct.

If the idea of having a separate authentication configuration file looks good, 
I can work on it when I get some free time.

> NTLM, Basic and Digest Authentication schemes for web/proxy server
> --
>
> Key: NUTCH-559
> URL: https://issues.apache.org/jira/browse/NUTCH-559
> Project: Nutch
>  Issue Type: Improvement
>  Components: fetcher
>Affects Versions: 1.0.0
>Reporter: Susam Pal
> Attachments: NUTCH-559v0.1.patch, NUTCH-559v0.2.patch
>
>
> Added basic, digest and NTLM authentication schemes to protocol-httpclient. 
> The authentication schemes can be configured for proxy server as well as web 
> servers of a domain. HTTP authentication can take place over HTTP/1.0, 
> HTTP/1.1 and HTTPS.
> The authentication guide can be found here: 
> [http://wiki.apache.org/nutch/HttpAuthenticationSchemes].

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (NUTCH-558) Need tool to retrieve domain statistics

2007-09-27 Thread Enis Soztutar (JIRA)


[ 
https://issues.apache.org/jira/browse/NUTCH-558?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12530656
 ] 

Enis Soztutar commented on NUTCH-558:
-

I wonder why you do not use URLUtils introduced in NUTCH-439. Also there is a 
similar tool(not committed) in this patch which extracts url/domain/tld 
statistics from the crawldb, but lacks filtering.

> Need tool to retrieve domain statistics
> ---
>
> Key: NUTCH-558
> URL: https://issues.apache.org/jira/browse/NUTCH-558
> Project: Nutch
>  Issue Type: New Feature
>Affects Versions: 0.9.0
>Reporter: Chris Schneider
>Assignee: Chris Schneider
> Attachments: DomainStats.patch
>
>
> Several developers have expressed interest in a tool to retrieve statistics 
> from a crawl on a domain basis (e.g., how many pages were successfully 
> fetched from www.apache.org vs. apache.org, where the latter total would 
> include the former).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Build failed in Hudson: Nutch-Nightly #220

Adding fields to BasicQueryFilter

Re: Parsing extra fields from an html page in the web. ....

[jira] Commented: (NUTCH-25) needs 'character encoding' detector

[jira] Commented: (NUTCH-369) StringUtil.resolveEncodingAlias is unuseful.

[jira] Commented: (NUTCH-487) Neko HTML parser goes on default settings.

Build failed in Hudson: Nutch-Nightly #219

[jira] Commented: (NUTCH-558) Need tool to retrieve domain statistics

Re: query parsing

query parsing

Parsing extra fields from an html page in the web.....

[jira] Updated: (NUTCH-559) NTLM, Basic and Digest Authentication schemes for web/proxy server

[jira] Commented: (NUTCH-558) Need tool to retrieve domain statistics

13 matches

Site Navigation

Mail list logo

Footer information