That's precisely my point I think that the modification should support regular 
expressions to specify passwords, I think this would be a good addition to 
nutch.

----- Mensaje original -----
De: "Tejas Patil" <tejas.patil...@gmail.com>
Para: user@nutch.apache.org
Enviados: Miércoles, 13 de Febrero 2013 16:54:58
Asunto: Re: How do I pass a password to Tika from Nutch for encrypted PDFs?

Absolutely. Normally crawlers are expected to gather pages which are
publically accessible. In internet or intranet, if a pdf file is protected,
then it is expected that its only for a small subset of users who know the
password and so it should not pop up in search results. From information
security perspective, its fair if the crawler doesn't parse these files.

Also, the % of such files present over the normal pages is less. The
scenario of people crawling wherein a majority of pdf files are protected
is rare. If that happens, it makes sense to assume that they know the files
and their corresponding passwords before hand.  If the password is common,
say "xyx.com/docs/pages/abc/*" has the same password for all pdf files then
a facility to provide a pattern would be convenient instead of listing
every url of that host.

Thanks,
Tejas Patil


On Wed, Feb 13, 2013 at 12:57 PM, Jorge Luis Betancourt Gonzalez <
jlbetanco...@uci.cu> wrote:

> I got this, but really a tedious work to list passwords for each PDF file
> that will be crawled, don't you think?
>
> ----- Mensaje original -----
> De: "Tejas Patil" <tejas.patil...@gmail.com>
> Para: user@nutch.apache.org
> Enviados: Miércoles, 13 de Febrero 2013 14:03:21
> Asunto: Re: How do I pass a password to Tika from Nutch for encrypted PDFs?
>
> There can be pdf files of same name at different hosts so using the url
> would be better as compared to name. All this info can be in a xml file
> which will be read by the pdf plugin.
>
> Thanks,
> Tejas Patil
>
>
> On Wed, Feb 13, 2013 at 10:35 AM, Jorge Luis Betancourt Gonzalez <
> jlbetanco...@uci.cu> wrote:
>
> > Which could be a good way of specifying which password goes with which
> PDF
> > file? by full URI or by filename? other?
> >
> > ----- Mensaje original -----
> > De: "Julien Nioche" <lists.digitalpeb...@gmail.com>
> > Para: user@nutch.apache.org, "John Dhabolt" <myco...@yahoo.com>
> > Enviados: Miércoles, 13 de Febrero 2013 13:04:27
> > Asunto: Re: How do I pass a password to Tika from Nutch for encrypted
> PDFs?
> >
> > Hi John,
> >
> > Currently not but it should be relatively straightforward to modify
> > parse-tika to do so and would be a nice contribution to Nutch
> >
> > Julien
> >
> > On 13 February 2013 13:53, John Dhabolt <myco...@yahoo.com> wrote:
> >
> > > Hi,
> > >
> > > We have PDFs we need to crawl that have a password associated. I don't
> > see
> > > a way to pass this password to Tika. Apparently prior to Tika 1.1 the
> > > password would have been passed in Tika metadata. In Tika 1.1 and
> > greater,
> > > they've added a new ParseContext object, PasswordProvider, which adds a
> > > getPassword method. Are either of these methods available to Nutch 1.6
> > > through a property setting?
> > >
> >
> >
> >
> > --
> > *
> > *Open Source Solutions for Text Engineering
> >
> > http://digitalpebble.blogspot.com/
> > http://www.digitalpebble.com
> > http://twitter.com/digitalpebble
> >
>

Reply via email to