Hi Briggs, hi Ronny, hi all First off all, thx for your help!!
I tried both methods from you Ronny and additionally the one from Briggs. They both have the same effect that if I remove all extensions not to be indexed like you both described (the only extension that rests is pdf), than it has the effect that the crawler does not even parse me one site. The crawler just ends with no page indexed. May I try to describe my problem again: The crawler should really parse all sites starting from the seed url no matter which extension. But what the crawler should do is only to index then the pdf documents. All other sites which are not pdf documents should not be listed and also not fetched and therefore not saved. So the only final sites that are listed when I do a search in the nutch search engine should be urls with pdfs...nothing else. hope I made my problem bit clearer ;) greetz martin Zitat von Briggs <[EMAIL PROTECTED]>: > Ronny, your way is probably better. See, I was only dealing with the > fetched properties. But, in your case, you don't fetch it, which gets rid > of all that wasted bandwidth. > > For dealing with types that can be dealt with via the file extension, this > would probably work better. > > > On 6/7/07, Naess, Ronny <[EMAIL PROTECTED]> wrote: > > > > > > Hi. > > > > Configure crawl-urlfilter.txt > > Thus you want to add something like +\.pdf$ I guess another way would be > > to exclude all others > > > > Try expanding the line below with html, doc, xls, ppt, etc > > -\.(gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|r > > pm|tgz|mov|MOV|exe|jpeg|JPEG|bmp|BMP|js|JS|dojo|DOJO|jsp|JSP)$ > > > > Or try including > > +\.pdf$ > > # > > -\.(gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|r > > pm|tgz|mov|MOV|exe|jpeg|JPEG|bmp|BMP|js|JS|dojo|DOJO|jsp|JSP)$ > > Followd by > > -. > > > > Have'nt tried it myself, but experiment some and I guess you figure it > > out pretty soon. > > > > Regards, > > Ronny > > > > -----Opprinnelig melding----- > > Fra: Martin Kammerlander [mailto:[EMAIL PROTECTED] > > > > Sendt: 6. juni 2007 20:30 > > Til: [EMAIL PROTECTED] > > Emne: indexing only special documents > > > > > > > > hi! > > > > I have a question. If I have for example the seed urls and do a crawl > > based o that seeds. If I want to index then only pages that contain for > > example pdf documents, how can I do that? > > > > cheers > > martin > > > > > > > > !DSPAM:4666ff05259891293215062! > > > > > > > -- > "Conscious decisions by conscious minds are what make reality real" > ------------------------------------------------------------------------- This SF.net email is sponsored by DB2 Express Download DB2 Express C - the FREE version of DB2 express and take control of your XML. No limits. Just data. Click to get it now. http://sourceforge.net/powerbar/db2/ _______________________________________________ Nutch-general mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/nutch-general
