Re: [Nutch-general] indexing only special documents

Martin Kammerlander Fri, 08 Jun 2007 06:52:06 -0700

Hi Briggs, hi Ronny, hi all

First off all, thx for your help!!


I tried both methods from you Ronny and additionally the one from Briggs. They
both have the same effect that if I remove all extensions not to be indexed
like you both described (the only extension that rests is pdf), than it has the
effect that the crawler does not even parse me one site. The crawler just ends
with no page indexed.

May I try to describe my problem again: The crawler should really parse all
sites starting from the seed url no matter which extension. But what the
crawler should do is only to index then the pdf documents. All other sites
which are not pdf documents should not be listed and also not fetched and
therefore not saved.

So the only final sites that are listed when I do a search in the nutch search
engine should be urls with pdfs...nothing else.

hope I made my problem bit clearer ;)

greetz
martin



Zitat von Briggs <[EMAIL PROTECTED]>:

> Ronny, your way is probably better.  See, I was only dealing with the
> fetched properties.  But, in your case, you don't fetch it, which gets rid
> of all that wasted bandwidth.
>
> For dealing with types that can be dealt with via the file extension, this
> would probably work better.
>
>
> On 6/7/07, Naess, Ronny <[EMAIL PROTECTED]> wrote:
> >
> >
> > Hi.
> >
> > Configure crawl-urlfilter.txt
> > Thus you want to add something like +\.pdf$ I guess another way would be
> > to exclude all others
> >
> > Try expanding the line below with html, doc, xls, ppt, etc
> > -\.(gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|r
> > pm|tgz|mov|MOV|exe|jpeg|JPEG|bmp|BMP|js|JS|dojo|DOJO|jsp|JSP)$
> >
> > Or try including
> > +\.pdf$
> > #
> > -\.(gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|r
> > pm|tgz|mov|MOV|exe|jpeg|JPEG|bmp|BMP|js|JS|dojo|DOJO|jsp|JSP)$
> > Followd by
> > -.
> >
> > Have'nt tried it myself, but experiment some and I guess you figure it
> > out pretty soon.
> >
> > Regards,
> > Ronny
> >
> > -----Opprinnelig melding-----
> > Fra: Martin Kammerlander [mailto:[EMAIL PROTECTED]
> >
> > Sendt: 6. juni 2007 20:30
> > Til: [EMAIL PROTECTED]
> > Emne: indexing only special documents
> >
> >
> >
> > hi!
> >
> > I have a question. If I have for example the seed urls and do a crawl
> > based o that seeds. If I want to index then only pages that contain for
> > example pdf documents, how can I do that?
> >
> > cheers
> > martin
> >
> >
> >
> > !DSPAM:4666ff05259891293215062!
> >
> >
>
>
> --
> "Conscious decisions by conscious minds are what make reality real"
>




-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general

Re: [Nutch-general] indexing only special documents

Reply via email to