Use the normalizer.
If you see regex-normalizer.xml (conf directory) it
should already have a rule to remove Jsession iDs.
This one is a bit unique as it has a "-" . So we may need
to write another one -- sho ld be simple.
CC-
From:
[EMAIL PROTECTED]
[mailto:[EMAIL PROTECTED] On Beh
Hi,
Can anyone help? I know it is a bit of a newbie
question, but I'm stuck with it.
Shri
- Original Message -
From:
LocalSearch.HK
To: [EMAIL PROTECTED]
Sent: Thursday, February 24, 2005 11:27
PM
Subject: [Nutch-general] Question about
normalizing urls
Hi,
I think the nutch page is not upto date. Nutch does have plugins for
parsing non-HTML content like word, rtf, and pdf. A few people had
reported an issue of the parsing stage hanging when PDF files are
being parsed. I had faced this issue and it is a random occurance. If
you don't find anythin
On the implementation side, you might want to look at index-basic and
query-basic to look at the way the indexing and querying is done. If
you would like to add more meta data for the documents being indexed,
you can extend these or write specific plugins which adds specific
pieces of metadata. On
Kashif
In the regex-urlfilter.txt file only allow
.pdf
+\.pdf$
-.
his will only allow
files ending in .pdf and ignore everything
else.
From: [EMAIL PROTECTED]
[mailto:[EMAIL PROTECTED] On Behalf Of Kashif
KhadimSent: Saturday, February 26, 2005 6:26 AMTo:
nutch-developers@li
Hi,
I just want to index PDF files from my website using intranet crawl. I don't want html or other files how can i do this ?.
Thanks.
Kashif.__Do You Yahoo!?Tired of spam? Yahoo! Mail has the best spam protection around http://mail.yahoo.com
Bugs item #1152281, was opened at 2005-02-26 12:32
Message generated for change (Tracker Item Submitted) made by Item Submitter
You can respond by visiting:
https://sourceforge.net/tracker/?func=detail&atid=491356&aid=1152281&group_id=59548
Category: None
Group: None
Status: Open
Resolution: None