date:20050226

RE: [Nutch-dev] Re: [Nutch-general] Question about normalizing urls

2005-02-26 Thread Chirag Chaman

Use the normalizer. If you see regex-normalizer.xml (conf directory) it should already have a rule to remove Jsession iDs. This one is a bit unique as it has a "-" . So we may need to write another one -- sho ld be simple. CC- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Beh

[Nutch-dev] Re: [Nutch-general] Question about normalizing urls

2005-02-26 Thread LocalSearch.HK

Hi, Can anyone help? I know it is a bit of a newbie question, but I'm stuck with it. Shri - Original Message - From: LocalSearch.HK To: [EMAIL PROTECTED] Sent: Thursday, February 24, 2005 11:27 PM Subject: [Nutch-general] Question about normalizing urls

Re: [Nutch-dev] what should be done first?

2005-02-26 Thread praveen pathiyil

Hi, I think the nutch page is not upto date. Nutch does have plugins for parsing non-HTML content like word, rtf, and pdf. A few people had reported an issue of the parsing stage hanging when PDF files are being parsed. I had faced this issue and it is a random occurance. If you don't find anythin

Re: [Nutch-dev] Adding more meta-data/fields when indexing

2005-02-26 Thread praveen pathiyil

On the implementation side, you might want to look at index-basic and query-basic to look at the way the indexing and querying is done. If you would like to add more meta data for the documents being indexed, you can extend these or write specific plugins which adds specific pieces of metadata. On

RE: [Nutch-dev] Indexing only PDF files

2005-02-26 Thread Chirag Chaman

Kashif In the regex-urlfilter.txt file only allow .pdf +\.pdf$ -. his will only allow files ending in .pdf and ignore everything else. From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of Kashif KhadimSent: Saturday, February 26, 2005 6:26 AMTo: nutch-developers@li

[Nutch-dev] Indexing only PDF files

2005-02-26 Thread Kashif Khadim

Hi, I just want to index PDF files from my website using intranet crawl. I don't want html or other files how can i do this ?. Thanks. Kashif.__Do You Yahoo!?Tired of spam? Yahoo! Mail has the best spam protection around http://mail.yahoo.com

[Nutch-dev] [ nutch-Bugs-1152281 ] WritableComparator goes to never-ending loop

2005-02-26 Thread SourceForge.net

Bugs item #1152281, was opened at 2005-02-26 12:32 Message generated for change (Tracker Item Submitted) made by Item Submitter You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=491356&aid=1152281&group_id=59548 Category: None Group: None Status: Open Resolution: None

RE: [Nutch-dev] Re: [Nutch-general] Question about normalizing urls

[Nutch-dev] Re: [Nutch-general] Question about normalizing urls

Re: [Nutch-dev] what should be done first?

Re: [Nutch-dev] Adding more meta-data/fields when indexing

RE: [Nutch-dev] Indexing only PDF files

[Nutch-dev] Indexing only PDF files

[Nutch-dev] [ nutch-Bugs-1152281 ] WritableComparator goes to never-ending loop

7 matches

Site Navigation

Mail list logo

Footer information