Hi,
Watching your website I can see two kind of different results:
-For example the first hit
http://www.lds.org/newsroom/files/jeff_lindsay_DNA_3.pdf, has no summary and
it produces the problem with cache.
-The third hit belongs to the second group, they have summary and the
cache link goes fine.
So it looks like nutch cant access the content of first groupt hits. Maybe
parse-pdf plugin cant handle this pdf, it could happen, this would also
explains why the title of the first group hits is the URL, and not the title
keep inside pdf document.
If I were you I would crawl only the first hit (
http://www.lds.org/newsroom/files/jeff_lindsay_DNA_3.pdf ), and look the log
file. If parse-pdf cant handle this document you will see a big ERROR
message.
Hope it helps.
Alvaro C.
2006/9/14, Jacob Brunson <[EMAIL PROTECTED]>:
>
> I don't know if I understand completely your email.
> What you mean with "cache"?
So if you go with the standard search results page, there is a link to
a cached copy of the page. If the page was html, then there are no
problems, however, if the page was binary, it returns a http 500
internal server error.
You can see this if you click on the "cached" link of any of the pdf
documents in the search results on my search engine:
http://ldssearch.com/search.jsp?lang=en&query=pdf
>
> steven shingler escribió:
> > Hi all,
> >
> > I'm trying to find out which filetypes nutch will cache.
> >
> > for example: it does html, but not pdf.
> >
> > Is there any documentation on how different filetypes are handled?
> >
> > Is it possible to configure nutch to cache pdfs etc?
> >
> > Any advice very gratefully received.
> > Thanks,
> > Steve
> >
> >
------------------------------------------------------------------------
> >
> > No virus found in this incoming message.
> > Checked by AVG Free Edition.
> > Version: 7.1.405 / Virus Database: 268.12.3/445 - Release Date:
11/09/2006
> >
>
>
>
>
> __________________________________________________
> Preguntá. Respondé. Descubrí.
> Todo lo que querías saber, y lo que ni imaginabas,
> está en Yahoo! Respuestas (Beta).
> ¡Probalo ya!
> http://www.yahoo.com.ar/respuestas
>
>
>
--
http://JacobBrunson.com
-------------------------------------------------------------------------
Using Tomcat but need to do more? Need to support web services, security?
Get stuff done quickly with pre-integrated technology to make your job easier
Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo
http://sel.as-us.falkag.net/sel?cmd=lnk&kid=120709&bid=263057&dat=121642
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general