Re: How do I pass a password to Tika from Nutch for encrypted PDFs?

2013-02-13 Thread John Dhabolt
Just to touch on our use case, the search is for a public financial website 
with PDFs "locked" (password protected) so that users cannot change the 
financial information in the PDF. So they're not password protected to limit 
access, simply to limit the ability to modify with tools like Acrobat. Since 
these are created by the financial company, they also have a single password, 
so this is a fairly simple use case.

Thanks for the discussion!

John



 From: Jorge Luis Betancourt Gonzalez 
To: user@nutch.apache.org 
Sent: Wednesday, February 13, 2013 6:13 PM
Subject: Re: How do I pass a password to Tika from Nutch for encrypted PDFs?
 
That's precisely my point I think that the modification should support regular 
expressions to specify passwords, I think this would be a good addition to 
nutch.

- Mensaje original -
De: "Tejas Patil" 
Para: user@nutch.apache.org
Enviados: Miércoles, 13 de Febrero 2013 16:54:58
Asunto: Re: How do I pass a password to Tika from Nutch for encrypted PDFs?

Absolutely. Normally crawlers are expected to gather pages which are
publically accessible. In internet or intranet, if a pdf file is protected,
then it is expected that its only for a small subset of users who know the
password and so it should not pop up in search results. From information
security perspective, its fair if the crawler doesn't parse these files.

Also, the % of such files present over the normal pages is less. The
scenario of people crawling wherein a majority of pdf files are protected
is rare. If that happens, it makes sense to assume that they know the files
and their corresponding passwords before hand.  If the password is common,
say "xyx.com/docs/pages/abc/*" has the same password for all pdf files then
a facility to provide a pattern would be convenient instead of listing
every url of that host.

Thanks,
Tejas Patil


On Wed, Feb 13, 2013 at 12:57 PM, Jorge Luis Betancourt Gonzalez <
jlbetanco...@uci.cu> wrote:

> I got this, but really a tedious work to list passwords for each PDF file
> that will be crawled, don't you think?
>
> - Mensaje original -
> De: "Tejas Patil" 
> Para: user@nutch.apache.org
> Enviados: Miércoles, 13 de Febrero 2013 14:03:21
> Asunto: Re: How do I pass a password to Tika from Nutch for encrypted PDFs?
>
> There can be pdf files of same name at different hosts so using the url
> would be better as compared to name. All this info can be in a xml file
> which will be read by the pdf plugin.
>
> Thanks,
> Tejas Patil
>
>
> On Wed, Feb 13, 2013 at 10:35 AM, Jorge Luis Betancourt Gonzalez <
> jlbetanco...@uci.cu> wrote:
>
> > Which could be a good way of specifying which password goes with which
> PDF
> > file? by full URI or by filename? other?
> >
> > - Mensaje original -
> > De: "Julien Nioche" 
> > Para: user@nutch.apache.org, "John Dhabolt" 
> > Enviados: Miércoles, 13 de Febrero 2013 13:04:27
> > Asunto: Re: How do I pass a password to Tika from Nutch for encrypted
> PDFs?
> >
> > Hi John,
> >
> > Currently not but it should be relatively straightforward to modify
> > parse-tika to do so and would be a nice contribution to Nutch
> >
> > Julien
> >
> > On 13 February 2013 13:53, John Dhabolt  wrote:
> >
> > > Hi,
> > >
> > > We have PDFs we need to crawl that have a password associated. I don't
> > see
> > > a way to pass this password to Tika. Apparently prior to Tika 1.1 the
> > > password would have been passed in Tika metadata. In Tika 1.1 and
> > greater,
> > > they've added a new ParseContext object, PasswordProvider, which adds a
> > > getPassword method. Are either of these methods available to Nutch 1.6
> > > through a property setting?
> > >
> >
> >
> >
> > --
> > *
> > *Open Source Solutions for Text Engineering
> >
> > http://digitalpebble.blogspot.com/
> > http://www.digitalpebble.com
> > http://twitter.com/digitalpebble
> >
>

Re: How do I pass a password to Tika from Nutch for encrypted PDFs?

2013-02-13 Thread Jorge Luis Betancourt Gonzalez
That's precisely my point I think that the modification should support regular 
expressions to specify passwords, I think this would be a good addition to 
nutch.

- Mensaje original -
De: "Tejas Patil" 
Para: user@nutch.apache.org
Enviados: Miércoles, 13 de Febrero 2013 16:54:58
Asunto: Re: How do I pass a password to Tika from Nutch for encrypted PDFs?

Absolutely. Normally crawlers are expected to gather pages which are
publically accessible. In internet or intranet, if a pdf file is protected,
then it is expected that its only for a small subset of users who know the
password and so it should not pop up in search results. From information
security perspective, its fair if the crawler doesn't parse these files.

Also, the % of such files present over the normal pages is less. The
scenario of people crawling wherein a majority of pdf files are protected
is rare. If that happens, it makes sense to assume that they know the files
and their corresponding passwords before hand.  If the password is common,
say "xyx.com/docs/pages/abc/*" has the same password for all pdf files then
a facility to provide a pattern would be convenient instead of listing
every url of that host.

Thanks,
Tejas Patil


On Wed, Feb 13, 2013 at 12:57 PM, Jorge Luis Betancourt Gonzalez <
jlbetanco...@uci.cu> wrote:

> I got this, but really a tedious work to list passwords for each PDF file
> that will be crawled, don't you think?
>
> - Mensaje original -
> De: "Tejas Patil" 
> Para: user@nutch.apache.org
> Enviados: Miércoles, 13 de Febrero 2013 14:03:21
> Asunto: Re: How do I pass a password to Tika from Nutch for encrypted PDFs?
>
> There can be pdf files of same name at different hosts so using the url
> would be better as compared to name. All this info can be in a xml file
> which will be read by the pdf plugin.
>
> Thanks,
> Tejas Patil
>
>
> On Wed, Feb 13, 2013 at 10:35 AM, Jorge Luis Betancourt Gonzalez <
> jlbetanco...@uci.cu> wrote:
>
> > Which could be a good way of specifying which password goes with which
> PDF
> > file? by full URI or by filename? other?
> >
> > - Mensaje original -
> > De: "Julien Nioche" 
> > Para: user@nutch.apache.org, "John Dhabolt" 
> > Enviados: Miércoles, 13 de Febrero 2013 13:04:27
> > Asunto: Re: How do I pass a password to Tika from Nutch for encrypted
> PDFs?
> >
> > Hi John,
> >
> > Currently not but it should be relatively straightforward to modify
> > parse-tika to do so and would be a nice contribution to Nutch
> >
> > Julien
> >
> > On 13 February 2013 13:53, John Dhabolt  wrote:
> >
> > > Hi,
> > >
> > > We have PDFs we need to crawl that have a password associated. I don't
> > see
> > > a way to pass this password to Tika. Apparently prior to Tika 1.1 the
> > > password would have been passed in Tika metadata. In Tika 1.1 and
> > greater,
> > > they've added a new ParseContext object, PasswordProvider, which adds a
> > > getPassword method. Are either of these methods available to Nutch 1.6
> > > through a property setting?
> > >
> >
> >
> >
> > --
> > *
> > *Open Source Solutions for Text Engineering
> >
> > http://digitalpebble.blogspot.com/
> > http://www.digitalpebble.com
> > http://twitter.com/digitalpebble
> >
>


Re: How do I pass a password to Tika from Nutch for encrypted PDFs?

2013-02-13 Thread Tejas Patil
Absolutely. Normally crawlers are expected to gather pages which are
publically accessible. In internet or intranet, if a pdf file is protected,
then it is expected that its only for a small subset of users who know the
password and so it should not pop up in search results. From information
security perspective, its fair if the crawler doesn't parse these files.

Also, the % of such files present over the normal pages is less. The
scenario of people crawling wherein a majority of pdf files are protected
is rare. If that happens, it makes sense to assume that they know the files
and their corresponding passwords before hand.  If the password is common,
say "xyx.com/docs/pages/abc/*" has the same password for all pdf files then
a facility to provide a pattern would be convenient instead of listing
every url of that host.

Thanks,
Tejas Patil


On Wed, Feb 13, 2013 at 12:57 PM, Jorge Luis Betancourt Gonzalez <
jlbetanco...@uci.cu> wrote:

> I got this, but really a tedious work to list passwords for each PDF file
> that will be crawled, don't you think?
>
> - Mensaje original -
> De: "Tejas Patil" 
> Para: user@nutch.apache.org
> Enviados: Miércoles, 13 de Febrero 2013 14:03:21
> Asunto: Re: How do I pass a password to Tika from Nutch for encrypted PDFs?
>
> There can be pdf files of same name at different hosts so using the url
> would be better as compared to name. All this info can be in a xml file
> which will be read by the pdf plugin.
>
> Thanks,
> Tejas Patil
>
>
> On Wed, Feb 13, 2013 at 10:35 AM, Jorge Luis Betancourt Gonzalez <
> jlbetanco...@uci.cu> wrote:
>
> > Which could be a good way of specifying which password goes with which
> PDF
> > file? by full URI or by filename? other?
> >
> > - Mensaje original -
> > De: "Julien Nioche" 
> > Para: user@nutch.apache.org, "John Dhabolt" 
> > Enviados: Miércoles, 13 de Febrero 2013 13:04:27
> > Asunto: Re: How do I pass a password to Tika from Nutch for encrypted
> PDFs?
> >
> > Hi John,
> >
> > Currently not but it should be relatively straightforward to modify
> > parse-tika to do so and would be a nice contribution to Nutch
> >
> > Julien
> >
> > On 13 February 2013 13:53, John Dhabolt  wrote:
> >
> > > Hi,
> > >
> > > We have PDFs we need to crawl that have a password associated. I don't
> > see
> > > a way to pass this password to Tika. Apparently prior to Tika 1.1 the
> > > password would have been passed in Tika metadata. In Tika 1.1 and
> > greater,
> > > they've added a new ParseContext object, PasswordProvider, which adds a
> > > getPassword method. Are either of these methods available to Nutch 1.6
> > > through a property setting?
> > >
> >
> >
> >
> > --
> > *
> > *Open Source Solutions for Text Engineering
> >
> > http://digitalpebble.blogspot.com/
> > http://www.digitalpebble.com
> > http://twitter.com/digitalpebble
> >
>


Re: How do I pass a password to Tika from Nutch for encrypted PDFs?

2013-02-13 Thread Jorge Luis Betancourt Gonzalez
I got this, but really a tedious work to list passwords for each PDF file that 
will be crawled, don't you think?

- Mensaje original -
De: "Tejas Patil" 
Para: user@nutch.apache.org
Enviados: Miércoles, 13 de Febrero 2013 14:03:21
Asunto: Re: How do I pass a password to Tika from Nutch for encrypted PDFs?

There can be pdf files of same name at different hosts so using the url
would be better as compared to name. All this info can be in a xml file
which will be read by the pdf plugin.

Thanks,
Tejas Patil


On Wed, Feb 13, 2013 at 10:35 AM, Jorge Luis Betancourt Gonzalez <
jlbetanco...@uci.cu> wrote:

> Which could be a good way of specifying which password goes with which PDF
> file? by full URI or by filename? other?
>
> - Mensaje original -
> De: "Julien Nioche" 
> Para: user@nutch.apache.org, "John Dhabolt" 
> Enviados: Miércoles, 13 de Febrero 2013 13:04:27
> Asunto: Re: How do I pass a password to Tika from Nutch for encrypted PDFs?
>
> Hi John,
>
> Currently not but it should be relatively straightforward to modify
> parse-tika to do so and would be a nice contribution to Nutch
>
> Julien
>
> On 13 February 2013 13:53, John Dhabolt  wrote:
>
> > Hi,
> >
> > We have PDFs we need to crawl that have a password associated. I don't
> see
> > a way to pass this password to Tika. Apparently prior to Tika 1.1 the
> > password would have been passed in Tika metadata. In Tika 1.1 and
> greater,
> > they've added a new ParseContext object, PasswordProvider, which adds a
> > getPassword method. Are either of these methods available to Nutch 1.6
> > through a property setting?
> >
>
>
>
> --
> *
> *Open Source Solutions for Text Engineering
>
> http://digitalpebble.blogspot.com/
> http://www.digitalpebble.com
> http://twitter.com/digitalpebble
>


Re: nutch cannot retrive title and inlinks of a domain

2013-02-13 Thread alxsss
Hi,

I noticed that for other urls in the seed inlinks are saved as ol. I checked 
the code and figured out that this is done with the part that saves anchors. 
So, in my case inlinks are saved as anchors in the field ol in hbase. But, for 
one of the ulrs, titile and inlinks are not retrieved, although its parse 
status marked success/ok (1/0), args=[]. 

Alex.

 

 

 

-Original Message-
From: kiran chitturi 
To: user 
Sent: Wed, Feb 13, 2013 12:40 pm
Subject: Re: nutch cannot retrive title and inlinks of a domain


Hi Alex,

Inlinks does not work with me now for the same domain [0] currently. I am
using Nutch-2.x and Hbase. Does the inlinks get saved for you for some of
the crawl seeds ?

Surprising, the title does not get saved. Did you try using parsechecker ?


[0] - http://www.mail-archive.com/user@nutch.apache.org/msg08627.html


On Wed, Feb 13, 2013 at 3:26 PM,  wrote:

> Hello,
>
> I noticed that nutch cannot retrieve title and inlinks of one of the
> domains in the seed list. However, if I run identical code from the server
> where this domain is hosted then it correctly parses it. The surprising
> thing is that in both cases this urls has
>
> status: 2 (status_fetched)
> parseStatus:success/ok (1/0), args=[]
>
>
> I used nutch-2.1 with hbase-0.92.1 and nutch 1.4.
>
>
> Any ideas why this happens?
>
> Thanks.
>
> Alex.
>



-- 
Kiran Chitturi

 


RE: Nutch identifier while indexing.

2013-02-13 Thread Markus Jelsma
You can use the subcollection indexing filter to set a value for URL's that 
match a string. With it you can distinquish even if they are on the same host 
and domain.
 
-Original message-
> From:mbehlok 
> Sent: Wed 13-Feb-2013 21:20
> To: user@nutch.apache.org
> Subject: Re: Nutch identifier while indexing.
> 
> wish it was that simple:
> 
> SitaA = www.myDomain.com/index.aspx?site=1
> 
> SitaB = www.myDomain.com/index.aspx?site=2
> 
> SitaC = www.myDomain.com/index.aspx?site=3
> 
> 
> 
> --
> View this message in context: 
> http://lucene.472066.n3.nabble.com/Nutch-identifier-while-indexing-tp4040285p4040323.html
> Sent from the Nutch - User mailing list archive at Nabble.com.
> 


Re: nutch cannot retrive title and inlinks of a domain

2013-02-13 Thread kiran chitturi
Hi Alex,

Inlinks does not work with me now for the same domain [0] currently. I am
using Nutch-2.x and Hbase. Does the inlinks get saved for you for some of
the crawl seeds ?

Surprising, the title does not get saved. Did you try using parsechecker ?


[0] - http://www.mail-archive.com/user@nutch.apache.org/msg08627.html


On Wed, Feb 13, 2013 at 3:26 PM,  wrote:

> Hello,
>
> I noticed that nutch cannot retrieve title and inlinks of one of the
> domains in the seed list. However, if I run identical code from the server
> where this domain is hosted then it correctly parses it. The surprising
> thing is that in both cases this urls has
>
> status: 2 (status_fetched)
> parseStatus:success/ok (1/0), args=[]
>
>
> I used nutch-2.1 with hbase-0.92.1 and nutch 1.4.
>
>
> Any ideas why this happens?
>
> Thanks.
>
> Alex.
>



-- 
Kiran Chitturi


Re: Nutch identifier while indexing.

2013-02-13 Thread alxsss
The only suggestion that I know is that you can index the site param at the end 
of the urls as a separate field and make facet search in solr with that param 
values.

Alex.

 

 

 

-Original Message-
From: mbehlok 
To: user 
Sent: Wed, Feb 13, 2013 12:20 pm
Subject: Re: Nutch identifier while indexing.


wish it was that simple:

SitaA = www.myDomain.com/index.aspx?site=1

SitaB = www.myDomain.com/index.aspx?site=2

SitaC = www.myDomain.com/index.aspx?site=3



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Nutch-identifier-while-indexing-tp4040285p4040323.html
Sent from the Nutch - User mailing list archive at Nabble.com.

 


nutch cannot retrive title and inlinks of a domain

2013-02-13 Thread alxsss
Hello,

I noticed that nutch cannot retrieve title and inlinks of one of the domains in 
the seed list. However, if I run identical code from the server where this 
domain is hosted then it correctly parses it. The surprising thing is that in 
both cases this urls has

status: 2 (status_fetched)
parseStatus:success/ok (1/0), args=[]


I used nutch-2.1 with hbase-0.92.1 and nutch 1.4.


Any ideas why this happens?

Thanks.

Alex. 


Re: Nutch identifier while indexing.

2013-02-13 Thread mbehlok
wish it was that simple:

SitaA = www.myDomain.com/index.aspx?site=1

SitaB = www.myDomain.com/index.aspx?site=2

SitaC = www.myDomain.com/index.aspx?site=3



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Nutch-identifier-while-indexing-tp4040285p4040323.html
Sent from the Nutch - User mailing list archive at Nabble.com.


Re: Nutch identifier while indexing.

2013-02-13 Thread alxsss
Are you telling that your sites have form siteA.mydomain.com, 
siteB.mydomain.com, siteC.mydomain.com?

Alex.

 

 

 

-Original Message-
From: mbehlok 
To: user 
Sent: Wed, Feb 13, 2013 11:05 am
Subject: Nutch identifier while indexing.


Hello, I am indexing 3 sites:

SiteA
SiteB
SiteC

I want to index these sites in a way that when searching them in solr I can
query a search on each of these sites in separate. So one could say... thats
easy, just filter them by host... WRONG...  Sites are hosted on the same
host but have different starting points. That is, starting the crawl from
different root urls (SiteA, SiteB, SiteC) produces different results. My
imagination tells me to somehow specify an identifier on schema.xml that
passes to solr which was the root url that produced that crawl. Any ideas on
how to implement this? any variations?

Mitch 
 



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Nutch-identifier-while-indexing-tp4040285.html
Sent from the Nutch - User mailing list archive at Nabble.com.

 


Nutch identifier while indexing.

2013-02-13 Thread mbehlok
Hello, I am indexing 3 sites:

SiteA
SiteB
SiteC

I want to index these sites in a way that when searching them in solr I can
query a search on each of these sites in separate. So one could say... thats
easy, just filter them by host... WRONG...  Sites are hosted on the same
host but have different starting points. That is, starting the crawl from
different root urls (SiteA, SiteB, SiteC) produces different results. My
imagination tells me to somehow specify an identifier on schema.xml that
passes to solr which was the root url that produced that crawl. Any ideas on
how to implement this? any variations?

Mitch 
 



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Nutch-identifier-while-indexing-tp4040285.html
Sent from the Nutch - User mailing list archive at Nabble.com.


Re: How do I pass a password to Tika from Nutch for encrypted PDFs?

2013-02-13 Thread Tejas Patil
There can be pdf files of same name at different hosts so using the url
would be better as compared to name. All this info can be in a xml file
which will be read by the pdf plugin.

Thanks,
Tejas Patil


On Wed, Feb 13, 2013 at 10:35 AM, Jorge Luis Betancourt Gonzalez <
jlbetanco...@uci.cu> wrote:

> Which could be a good way of specifying which password goes with which PDF
> file? by full URI or by filename? other?
>
> - Mensaje original -
> De: "Julien Nioche" 
> Para: user@nutch.apache.org, "John Dhabolt" 
> Enviados: Miércoles, 13 de Febrero 2013 13:04:27
> Asunto: Re: How do I pass a password to Tika from Nutch for encrypted PDFs?
>
> Hi John,
>
> Currently not but it should be relatively straightforward to modify
> parse-tika to do so and would be a nice contribution to Nutch
>
> Julien
>
> On 13 February 2013 13:53, John Dhabolt  wrote:
>
> > Hi,
> >
> > We have PDFs we need to crawl that have a password associated. I don't
> see
> > a way to pass this password to Tika. Apparently prior to Tika 1.1 the
> > password would have been passed in Tika metadata. In Tika 1.1 and
> greater,
> > they've added a new ParseContext object, PasswordProvider, which adds a
> > getPassword method. Are either of these methods available to Nutch 1.6
> > through a property setting?
> >
>
>
>
> --
> *
> *Open Source Solutions for Text Engineering
>
> http://digitalpebble.blogspot.com/
> http://www.digitalpebble.com
> http://twitter.com/digitalpebble
>


Re: How do I pass a password to Tika from Nutch for encrypted PDFs?

2013-02-13 Thread Jorge Luis Betancourt Gonzalez
Which could be a good way of specifying which password goes with which PDF 
file? by full URI or by filename? other?

- Mensaje original -
De: "Julien Nioche" 
Para: user@nutch.apache.org, "John Dhabolt" 
Enviados: Miércoles, 13 de Febrero 2013 13:04:27
Asunto: Re: How do I pass a password to Tika from Nutch for encrypted PDFs?

Hi John,

Currently not but it should be relatively straightforward to modify
parse-tika to do so and would be a nice contribution to Nutch

Julien

On 13 February 2013 13:53, John Dhabolt  wrote:

> Hi,
>
> We have PDFs we need to crawl that have a password associated. I don't see
> a way to pass this password to Tika. Apparently prior to Tika 1.1 the
> password would have been passed in Tika metadata. In Tika 1.1 and greater,
> they've added a new ParseContext object, PasswordProvider, which adds a
> getPassword method. Are either of these methods available to Nutch 1.6
> through a property setting?
>



--
*
*Open Source Solutions for Text Engineering

http://digitalpebble.blogspot.com/
http://www.digitalpebble.com
http://twitter.com/digitalpebble


Re: How do I pass a password to Tika from Nutch for encrypted PDFs?

2013-02-13 Thread Julien Nioche
Hi John,

Currently not but it should be relatively straightforward to modify
parse-tika to do so and would be a nice contribution to Nutch

Julien

On 13 February 2013 13:53, John Dhabolt  wrote:

> Hi,
>
> We have PDFs we need to crawl that have a password associated. I don't see
> a way to pass this password to Tika. Apparently prior to Tika 1.1 the
> password would have been passed in Tika metadata. In Tika 1.1 and greater,
> they've added a new ParseContext object, PasswordProvider, which adds a
> getPassword method. Are either of these methods available to Nutch 1.6
> through a property setting?
>



-- 
*
*Open Source Solutions for Text Engineering

http://digitalpebble.blogspot.com/
http://www.digitalpebble.com
http://twitter.com/digitalpebble


Re: Content Truncation in Nutch 2.1/MySQL

2013-02-13 Thread Ward Loving
Hi Lewis:

Well, I've done some additional testing and the truncation issue seems to
be isolated to the particular web server/site that I'm trying to process.
 When I run the process against other sites, I'm not seeing the same issue.
 I guess for processing that site I'll have to go with Plan B.

Thanks for your help.

Ward


On Sun, Feb 10, 2013 at 8:19 PM, Lewis John Mcgibbney <
lewis.mcgibb...@gmail.com> wrote:

> No content should be truncated if you set http.content.limit to -1 and
> leave the default settings on. It is as simple as that.
> Have you recompiled Nutch with some changes you made before continuing
> crawling?
>
> On Fri, Feb 8, 2013 at 9:01 PM, Ward Loving  wrote:
>
> > Well,
> >
> > I spoke to soon.  I ran a crawl overnight and I'm seeing all kinds of
> > truncation happening again.   I can hardly find a content field in my
> > database that hasn't been truncated.  I'm seeing a ton of these warning
> > messages in the log:
> >
> > 2013-02-08 19:40:36,861 WARN  parse.ParserJob -
> > http://www.episcopalchurch.org/parish/university-texas-austin-txskipped.
> > Content of size 30220 was truncated to 29919
> > 2013-02-08 19:40:36,861 INFO  parse.ParserJob - Parsing
> > http://www.episcopalchurch.org/parish/varina-church-richmond-va
> > 2013-02-08 19:40:36,861 WARN  parse.ParserJob -
> > http://www.episcopalchurch.org/parish/varina-church-richmond-va skipped.
> > Content of size 29559 was truncated to 28471
> > 2013-02-08 19:40:36,861 INFO  parse.ParserJob - Parsing
> > http://www.episcopalchurch.org/parish/vauters-church-champlain-va
> >
> > This is sort of bizarrre.  I spot checked 5 pages when I first started
> the
> > process yesterday morning and all the content in the content fields was
> > complete.  Now I'm running it again and nothing is, but I don't see the
> > warning messages that anything is amiss with the data with the first
> couple
> > of pages I fetched.  I've tried updating the following setting to false
> but
> > it doesn't seem to help:
> >
> > 
> >   parser.skip.truncated
> >   false
> >   Boolean value for whether we should skip parsing for
> > truncated documents. By default this
> >   property is activated due to extremely high levels of CPU which parsing
> > can sometimes take.
> >   
> > 
> >
> >
> >
> >
> >
> >
> > On Thu, Feb 7, 2013 at 5:24 PM, Ward Loving  wrote:
> >
> > > Yep, looks like it.  The configuration is tricky no doubt.  In my case,
> > > however, I think I had actually fixed the config, I just couldn't tell
> > that
> > > I had resolved the issue.  I was looking at stale data.
> > >
> > >
> > > On Thu, Feb 7, 2013 at 5:12 PM, Lewis John Mcgibbney <
> > > lewis.mcgibb...@gmail.com> wrote:
> > >
> > >> So the problem for you is resolved?
> > >> The main (typical) problem here is in the underlying gora-sql library
> > and
> > >> some rather difficult to master gora-sql-mapping.xml constraints.
> > >> Hope all is resolved
> > >> Lewis
> > >>
> > >> On Thu, Feb 7, 2013 at 1:57 PM, Ward Loving  wrote:
> > >>
> > >> > Alright...very good news.  I guess something I did fixed the issue.
> > >>  Once I
> > >> > dropped my webpage table and restarted the process, I'm now getting
> > >> > complete pages.  The actual load of the data to that field can
> happen
> > >> > somewhat later than the fetch entry in the logs.  Easy to see when
> > >> > inserting data the first time around.  Not as simple to detect when
> > >> you've
> > >> > loaded data previously. Thanks for your assistance.
> > >> >
> > >> >
> > >> >
> > >> >
> > >> >
> > >> >
> > >> > On Thu, Feb 7, 2013 at 3:01 PM, Lewis John Mcgibbney <
> > >> > lewis.mcgibb...@gmail.com> wrote:
> > >> >
> > >> > > It will prduce more output on the fetcher part of your hadoop.log
> > not
> > >> on
> > >> > > the parsechecker tool itself that is why you are seeing nothing
> > more.
> > >> > > Are you still having problems with the truncation aspect?
> > >> > > Lewis
> > >> > >
> > >> > > On Thu, Feb 7, 2013 at 11:07 AM, Ward Loving 
> > >> wrote:
> > >> > >
> > >> > > > Lewis:
> > >> > > >
> > >> > > >
> > >> > >
> > >> >
> > >> >
> > >> >
> > >> > --
> > >> > Ward Loving
> > >> > Senior Technical Consultant
> > >> > Appirio, Inc.
> > >> > www.appirio.com
> > >> > (706) 225-9475
> > >> >
> > >>
> > >>
> > >>
> > >> --
> > >> *Lewis*
> > >>
> > >
> > >
> > >
> > > --
> > > Ward Loving
> > > Senior Technical Consultant
> > > Appirio, Inc.
> > > www.appirio.com
> > > (706) 225-9475
> > >
> >
> >
> >
> > --
> > Ward Loving
> > Senior Technical Consultant
> > Appirio, Inc.
> > www.appirio.com
> > (706) 225-9475
> >
>
>
>
> --
> *Lewis*
>



-- 
Ward Loving
Senior Technical Consultant
Appirio, Inc.
www.appirio.com
(706) 225-9475


Slow parse on hadoop

2013-02-13 Thread Žygimantas
Hi,

I have Nutch running on a Hadoop cluster. Inject, generate, fetch are working 
fine, they are executed on multiple nodes. We seam to get only one mapper for 
the parse job and the parse step only runs on one node and it takes a minute or 
so to parse one page. Please see the log below (1min 41s to parse thetimes).

2013-02-13 13:46:02,658 INFO org.apache.nutch.parse.ParserJob: Parsing 
http://www.thetimes.co.uk/tto/news/ 2013-02-13 13:47:43,415 INFO 
org.apache.nutch.parse.ParserJob: Parsing http://online.wsj.com/home-page
I am using parse-html plugin to do the job. Cassandra as the DB. When running 
locally all is fine.
Running parse with this:
hadoop jar apache-nutch-2.1-SNAPSHOT.job org.apache.nutch.parse.ParserJob $id


Also including log from jobtracker
Hadoop job_201302131311_0006 onJob Name: parse
Job-ACLs: All users are allowed
Status: Succeeded
Started at: Wed Feb 13 13:44:06 GMT 2013
Finished at: Wed Feb 13 14:06:30 GMT 2013
Finished in: 22mins, 23sec



Counter
Map
Reduce
Total
ParserStatus success 13 0 13 
notparsed 1 0 1 
Job Counters SLOTS_MILLIS_MAPS 0 0 1,335,834 
Total time spent by all reduces waiting after reserving slots (ms) 0 0 0 
Total time spent by all maps waiting after reserving slots (ms) 0 0 0 
Launched map tasks 0 0 1 
SLOTS_MILLIS_REDUCES 0 0 0 
File Output Format Counters Bytes Written 0 0 0 
File Input Format Counters Bytes Read 0 0 0 
FileSystemCounters HDFS_BYTES_READ 689 0 689 
FILE_BYTES_WRITTEN 32,142 0 32,142 
Map-Reduce Framework Map input records 138 0 138 
Physical memory (bytes) snapshot 417,538,048 0 417,538,048 
Spilled Records 0 0 0 
Total committed heap usage (bytes) 186,449,920 0 186,449,920 
CPU time spent (ms) 1,379,340 0 1,379,340 
Virtual memory (bytes) snapshot 1,163,165,696 0 1,163,165,696 
SPLIT_RAW_BYTES 689 0 689 
Map output records 14 0 14