Re: failed to subscribe 'nutch-user' maillist

2007-06-30 Thread Susam Pal
From: Oscar [EMAIL PROTECTED] To: [EMAIL PROTECTED] Subject: subscribe This is how you are trying to subscribe. This is incorrect. You should send a mail to the following email address to subscribe to the mailing list. [EMAIL PROTECTED] Regards, Susam Pal http://susam.in/ On 6/30/07, Oscar

Re: Build failed in Hudson: Nutch-Nightly #203

2007-09-11 Thread Susam Pal
Is it that the interface 'org.apache.nutch.net.URLFilter' was compiled with JDK 1.5 earlier? I have seen this problem happening with a beta version of JDK 1.6. Are you using the latest version, JDK 1.6 Update 2? Regards, Susam Pal http://susam.in/ On 9/11/07, Doğacan Güney [EMAIL PROTECTED

protocol-httpclient Authentication schemes

2007-09-14 Thread Susam Pal
suggestions? Regards, Susam Pal http://susam.in/

Re: Two suggestions

2007-10-06 Thread Susam Pal
you tried parse-pdf? Regards, Susam Pal http://susam.in/

Re: Choices in Nutch Web interface?

2007-10-10 Thread Susam Pal
mailing list. Regards, Susam Pal http://susam.in/ On 10/10/07, Christopher Bader [EMAIL PROTECTED] wrote: I ran Nutch on a subset of Wikipedia, and it works. But for each search it always gives exactly two choices. How do I configure it so that it gives (a) N choices, for arbitrary N, or (b

Re: [jira] Created: (NUTCH-599) nutch crawl and index problem

2008-01-07 Thread Susam Pal
it is not a bug in Nutch 0.9 This looks like a configuration problem at your end. Please discuss this properly in [EMAIL PROTECTED] instead of submitting it as a bug in Nutch. Regards, Susam Pal On Jan 8, 2008 7:16 AM, sudarat (JIRA) [EMAIL PROTECTED] wrote: nutch crawl and index problem

Re: [jira] Created: (NUTCH-599) nutch crawl and index problem

2008-01-07 Thread Susam Pal
I wanted to send this as a private reply but sent it to the list instead. Sorry for the inconvenience. On Jan 8, 2008 10:21 AM, Susam Pal [EMAIL PROTECTED] wrote: I have replied this query of yours yesterday in [EMAIL PROTECTED] If you haven't received the reply, probably you have

Re: nutch latest build - inject operation failing

2008-02-14 Thread Susam Pal
doesn\'t occur in Linux. I am not well acquainted with the Hadoop code yet. Could someone throw light on what might be going wrong? Regards, Susam Pal On 2/7/08, DS jha [EMAIL PROTECTED] wrote: Hi - Looks like latest trunk version of nutch is failing with the following exception when trying

Re: nutch latest build - inject operation failing

2008-02-14 Thread Susam Pal
this failed with the same error. Right now I don't have a Windows system with me. I will try setting it as /cygdrive/d/tmp/ tomorrow when I again have access to a Windows system and then I'll update the mailing list with the observations. Thanks for the suggestion. Regards, Susam Pal On Thu, Feb

Re: nutch latest build - inject operation failing

2008-02-15 Thread Susam Pal
) at org.apache.hadoop.mapred.Task.saveTaskOutput(Task.java:426) at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:165) Regards, Susam Pal On Thu, Feb 14, 2008 at 10:07 PM, Susam Pal [EMAIL PROTECTED] wrote: What I did try was setting hadoop.tmp.dir to /opt/tmp. I found the behavior

Re: Problem in running Nutch where proxy authentication is required.

2008-03-14 Thread Susam Pal
I still can't see any DEBUG logs in your log file. Did you go through my earlier mail? Regards, Susam Pal On Wed, Mar 12, 2008 at 9:39 PM, [EMAIL PROTECTED] wrote: Hi All, I am facing a problem in running nutch where the proxy authentication is required to crawl the site.(eg. google.com

Why is Nutch not involved in Google Summer of Code - 2008?

2008-03-22 Thread Susam Pal
valuable work can be done. What do you say? Regards, Susam Pal

Re: Pending Commits for Nutch Issues

2008-12-02 Thread Susam Pal
I agree with John too. Probably you meant $ 0.02, since 0.02 cents is too less. It is usually 2 cents. :-P Regards, Susam Pal On Tue, Dec 2, 2008 at 6:09 PM, John Martyniak [EMAIL PROTECTED] wrote: Is NUTCH-442 going to be part of the 1.0 release? I hope so, Nutch/Solr integration would

crawl-tool.xml mentions nutch-site.xml for overriding but it is not possible

2009-04-06 Thread Susam Pal
'conf/crawl-tool.xml' ? Regards, Susam Pal

Re: crawl-tool.xml mentions nutch-site.xml for overriding but it is not possible

2009-05-09 Thread Susam Pal
On Tue, Apr 7, 2009 at 1:07 AM, Susam Pal susam@gmail.com wrote: The inline documentation of 'conf/crawl-tool.xml' mentions: !-- Do not modify this file directly.  Instead, copy entries that you -- !-- wish to modify from this file into nutch-site.xml and change them

Re: How can I get startted with Nutch 1.0

2009-06-01 Thread Susam Pal
are available at: http://lucene.apache.org/nutch/version_control.html Regards, Susam Pal

Re: Crawling authenticated websites !

2010-03-18 Thread Susam Pal
included it in CC. This feature is not present in Nutch. We have recorded the summary of some old discussions regarding this here: http://wiki.apache.org/nutch/HttpPostAuthentication But this was never implemented. Regards, Susam Pal

[jira] Updated: (NUTCH-44) too many search results

2007-09-08 Thread Susam Pal (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-44?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Susam Pal updated NUTCH-44: --- Attachment: NUTCH-44.patch Attached a patch. To apply:- patch -p0 NUTCH-44.patch ant war cp build

[jira] Updated: (NUTCH-44) too many search results

2007-09-08 Thread Susam Pal (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-44?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Susam Pal updated NUTCH-44: --- Attachment: (was: NUTCH-44.patch) too many search results --- Key

[jira] Updated: (NUTCH-44) too many search results

2007-09-08 Thread Susam Pal (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-44?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Susam Pal updated NUTCH-44: --- Attachment: NUTCH-44.patch Updated my previous patch to fix the issue in opensearch too. To apply:- patch

[jira] Updated: (NUTCH-281) cached.jsp: base-href needs to be outside comments

2007-09-09 Thread Susam Pal (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-281?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Susam Pal updated NUTCH-281: Attachment: NUTCH-281.patch Uploading a patch. Put the base tag outside comments and now the relative

[jira] Created: (NUTCH-557) protocol-http11 for HTTP 1.1, HTTPS, NTLM, Basic and Digest Authentication

2007-09-18 Thread Susam Pal (JIRA)
Type: Improvement Components: fetcher Affects Versions: 1.0.0 Reporter: Susam Pal 'protocol-http11' is a protocol plugin which supports retrieving documents via the HTTP 1.0, HTTP 1.1 and HTTPS protocols, optionally with Basic, Digest and NTLM authentication schemes

[jira] Updated: (NUTCH-557) protocol-http11 for HTTP 1.1, HTTPS, NTLM, Basic and Digest Authentication

2007-09-18 Thread Susam Pal (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-557?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Susam Pal updated NUTCH-557: Attachment: protocol-http11v0.1.patch I have generated this patch against Nutch trunk. To apply:- patch

[jira] Updated: (NUTCH-557) protocol-http11 for HTTP 1.1, HTTPS, NTLM, Basic and Digest Authentication

2007-09-18 Thread Susam Pal (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-557?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Susam Pal updated NUTCH-557: Priority: Minor (was: Major) protocol-http11 for HTTP 1.1, HTTPS, NTLM, Basic and Digest Authentication

[jira] Commented: (NUTCH-557) protocol-http11 for HTTP 1.1, HTTPS, NTLM, Basic and Digest Authentication

2007-09-19 Thread Susam Pal (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-557?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12528854 ] Susam Pal commented on NUTCH-557: - No, there isn't any significant difference in performance. Here's a list

[jira] Created: (NUTCH-559) NTLM, Basic and Digest Authentication schemes for web/proxy server

2007-09-24 Thread Susam Pal (JIRA)
Components: fetcher Affects Versions: 1.0.0 Reporter: Susam Pal Priority: Minor Added basic, digest and NTLM authentication schemes to protocol-httpclient. The authentication schemes can be configured for proxy server as well as web servers of a domain

[jira] Updated: (NUTCH-559) NTLM, Basic and Digest Authentication schemes for web/proxy server

2007-09-24 Thread Susam Pal (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-559?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Susam Pal updated NUTCH-559: Attachment: NUTCH-559v0.1.patch I have generated this patch against Nutch trunk. It will add support

[jira] Closed: (NUTCH-557) protocol-http11 for HTTP 1.1, HTTPS, NTLM, Basic and Digest Authentication

2007-09-24 Thread Susam Pal (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-557?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Susam Pal closed NUTCH-557. --- Resolution: Won't Fix As per the discussion, 'protocol-http11' has been turned into a patch for 'protocol

[jira] Issue Comment Edited: (NUTCH-539) HttpClient plugin does not work with BasicAuthentication

2007-09-25 Thread Susam Pal (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-539?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12530175 ] susam edited comment on NUTCH-539 at 9/25/07 10:54 AM: --- 1. There is a bug in the patch. The

[jira] Updated: (NUTCH-559) NTLM, Basic and Digest Authentication schemes for web/proxy server

2007-09-25 Thread Susam Pal (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-559?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Susam Pal updated NUTCH-559: Priority: Major (was: Minor) Apart from adding the authentication features, this patch would fix three

[jira] Commented: (NUTCH-560) protocol-httpclient reading more bytes than http.content.limit

2007-09-26 Thread Susam Pal (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-560?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12530519 ] Susam Pal commented on NUTCH-560: - I analysed 'protocol-http' and it behaves almost in the same manner. While

[jira] Updated: (NUTCH-559) NTLM, Basic and Digest Authentication schemes for web/proxy server

2007-11-01 Thread Susam Pal (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-559?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Susam Pal updated NUTCH-559: Attachment: NUTCH-559v0.4.patch Uploading a revised (v0.4) patch that has all authentication configuration

[jira] Updated: (NUTCH-559) NTLM, Basic and Digest Authentication schemes for web/proxy server

2007-11-28 Thread Susam Pal (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-559?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Susam Pal updated NUTCH-559: Attachment: NUTCH-559v0.5.patch Uploading a revised (v0.5) patch with some test cases. Added a 'scheme

[jira] Updated: (NUTCH-601) Recrawling on existing crawl directory using force option

2008-02-04 Thread Susam Pal (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-601?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Susam Pal updated NUTCH-601: Attachment: NUTCH-601v0.2.patch Attached a revised patch (NUTCH-601v0.2.patch), which removes the old

[jira] Updated: (NUTCH-601) Recrawling on existing crawl directory using force option

2008-02-04 Thread Susam Pal (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-601?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Susam Pal updated NUTCH-601: Attachment: NUTCH-601v0.1.patch Patch attached. Recrawling on existing crawl directory using force option

[jira] Created: (NUTCH-601) Recrawling on existing crawl directory using force option

2008-02-04 Thread Susam Pal (JIRA)
Versions: 1.0.0 Reporter: Susam Pal Priority: Minor Added a '-force' option to the 'bin/nutch crawl' command line. With this option, one can crawl and recrawl in the following manner: {code} bin/nutch crawl urls -dir crawl -depth 2 -topN 10 -threads 5 bin/nutch crawl urls -dir

[jira] Commented: (NUTCH-601) Recrawling on existing crawl directory using force option

2008-02-05 Thread Susam Pal (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-601?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12565848#action_12565848 ] Susam Pal commented on NUTCH-601: - The 'if (newIndex != index)' condition is just a check

[jira] Updated: (NUTCH-601) Recrawling on existing crawl directory using force option

2008-02-15 Thread Susam Pal (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-601?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Susam Pal updated NUTCH-601: Attachment: NUTCH-601v1.0.patch Attached another patch (NUTCH-601v1.0.patch) that always deletes the old

[jira] Updated: (NUTCH-601) Recrawling on existing crawl directory using force option

2008-02-15 Thread Susam Pal (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-601?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Susam Pal updated NUTCH-601: Attachment: NUTCH-601v0.3.patch Attached a revised patch (NUTCH-601v0.3.patch) that makes the code simpler

[jira] Updated: (NUTCH-612) URL filtering is always disabled in Generator when invoked by Crawl

2008-02-15 Thread Susam Pal (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-612?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Susam Pal updated NUTCH-612: Attachment: NUTCH-612v0.1.patch Attached patch to fix the bug. This modifies Crawl.java and Generator.java

[jira] Created: (NUTCH-612) URL filtering is always disabled in Generator when invoked by Crawl

2008-02-15 Thread Susam Pal (JIRA)
Components: generator Affects Versions: 1.0.0 Reporter: Susam Pal Fix For: 1.0.0 When a crawl is done using the 'bin/nutch crawl' command, no filtering is done in Generator even if 'crawl.generate.filter' is set to true in the configuration file. The problem

[jira] Created: (NUTCH-735) crawl-tool.xml must be read before nutch-site.xml when invoked using crawl command

2009-05-09 Thread Susam Pal (JIRA)
Issue Type: Bug Components: web gui Affects Versions: 1.0.0 Reporter: Susam Pal Priority: Minor The inline documentation of 'conf/crawl-tool.xml' mentions: {code:xml} !-- Do not modify this file directly. Instead, copy entries that you -- !-- wish

[jira] Updated: (NUTCH-735) crawl-tool.xml must be read before nutch-site.xml when invoked using crawl command

2009-05-09 Thread Susam Pal (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-735?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Susam Pal updated NUTCH-735: Attachment: NUTCH-735v0.1.patch Attached patch. crawl-tool.xml must be read before nutch-site.xml when