Re: How do I pass a password to Tika from Nutch for encrypted PDFs?
Just to touch on our use case, the search is for a public financial website with PDFs "locked" (password protected) so that users cannot change the financial information in the PDF. So they're not password protected to limit access, simply to limit the ability to modify with tools like Acrobat. Since these are created by the financial company, they also have a single password, so this is a fairly simple use case. Thanks for the discussion! John From: Jorge Luis Betancourt Gonzalez To: user@nutch.apache.org Sent: Wednesday, February 13, 2013 6:13 PM Subject: Re: How do I pass a password to Tika from Nutch for encrypted PDFs? That's precisely my point I think that the modification should support regular expressions to specify passwords, I think this would be a good addition to nutch. - Mensaje original - De: "Tejas Patil" Para: user@nutch.apache.org Enviados: Miércoles, 13 de Febrero 2013 16:54:58 Asunto: Re: How do I pass a password to Tika from Nutch for encrypted PDFs? Absolutely. Normally crawlers are expected to gather pages which are publically accessible. In internet or intranet, if a pdf file is protected, then it is expected that its only for a small subset of users who know the password and so it should not pop up in search results. From information security perspective, its fair if the crawler doesn't parse these files. Also, the % of such files present over the normal pages is less. The scenario of people crawling wherein a majority of pdf files are protected is rare. If that happens, it makes sense to assume that they know the files and their corresponding passwords before hand. If the password is common, say "xyx.com/docs/pages/abc/*" has the same password for all pdf files then a facility to provide a pattern would be convenient instead of listing every url of that host. Thanks, Tejas Patil On Wed, Feb 13, 2013 at 12:57 PM, Jorge Luis Betancourt Gonzalez < jlbetanco...@uci.cu> wrote: > I got this, but really a tedious work to list passwords for each PDF file > that will be crawled, don't you think? > > - Mensaje original - > De: "Tejas Patil" > Para: user@nutch.apache.org > Enviados: Miércoles, 13 de Febrero 2013 14:03:21 > Asunto: Re: How do I pass a password to Tika from Nutch for encrypted PDFs? > > There can be pdf files of same name at different hosts so using the url > would be better as compared to name. All this info can be in a xml file > which will be read by the pdf plugin. > > Thanks, > Tejas Patil > > > On Wed, Feb 13, 2013 at 10:35 AM, Jorge Luis Betancourt Gonzalez < > jlbetanco...@uci.cu> wrote: > > > Which could be a good way of specifying which password goes with which > PDF > > file? by full URI or by filename? other? > > > > - Mensaje original - > > De: "Julien Nioche" > > Para: user@nutch.apache.org, "John Dhabolt" > > Enviados: Miércoles, 13 de Febrero 2013 13:04:27 > > Asunto: Re: How do I pass a password to Tika from Nutch for encrypted > PDFs? > > > > Hi John, > > > > Currently not but it should be relatively straightforward to modify > > parse-tika to do so and would be a nice contribution to Nutch > > > > Julien > > > > On 13 February 2013 13:53, John Dhabolt wrote: > > > > > Hi, > > > > > > We have PDFs we need to crawl that have a password associated. I don't > > see > > > a way to pass this password to Tika. Apparently prior to Tika 1.1 the > > > password would have been passed in Tika metadata. In Tika 1.1 and > > greater, > > > they've added a new ParseContext object, PasswordProvider, which adds a > > > getPassword method. Are either of these methods available to Nutch 1.6 > > > through a property setting? > > > > > > > > > > > -- > > * > > *Open Source Solutions for Text Engineering > > > > http://digitalpebble.blogspot.com/ > > http://www.digitalpebble.com > > http://twitter.com/digitalpebble > > >
Re: How do I pass a password to Tika from Nutch for encrypted PDFs?
That's precisely my point I think that the modification should support regular expressions to specify passwords, I think this would be a good addition to nutch. - Mensaje original - De: "Tejas Patil" Para: user@nutch.apache.org Enviados: Miércoles, 13 de Febrero 2013 16:54:58 Asunto: Re: How do I pass a password to Tika from Nutch for encrypted PDFs? Absolutely. Normally crawlers are expected to gather pages which are publically accessible. In internet or intranet, if a pdf file is protected, then it is expected that its only for a small subset of users who know the password and so it should not pop up in search results. From information security perspective, its fair if the crawler doesn't parse these files. Also, the % of such files present over the normal pages is less. The scenario of people crawling wherein a majority of pdf files are protected is rare. If that happens, it makes sense to assume that they know the files and their corresponding passwords before hand. If the password is common, say "xyx.com/docs/pages/abc/*" has the same password for all pdf files then a facility to provide a pattern would be convenient instead of listing every url of that host. Thanks, Tejas Patil On Wed, Feb 13, 2013 at 12:57 PM, Jorge Luis Betancourt Gonzalez < jlbetanco...@uci.cu> wrote: > I got this, but really a tedious work to list passwords for each PDF file > that will be crawled, don't you think? > > - Mensaje original - > De: "Tejas Patil" > Para: user@nutch.apache.org > Enviados: Miércoles, 13 de Febrero 2013 14:03:21 > Asunto: Re: How do I pass a password to Tika from Nutch for encrypted PDFs? > > There can be pdf files of same name at different hosts so using the url > would be better as compared to name. All this info can be in a xml file > which will be read by the pdf plugin. > > Thanks, > Tejas Patil > > > On Wed, Feb 13, 2013 at 10:35 AM, Jorge Luis Betancourt Gonzalez < > jlbetanco...@uci.cu> wrote: > > > Which could be a good way of specifying which password goes with which > PDF > > file? by full URI or by filename? other? > > > > - Mensaje original - > > De: "Julien Nioche" > > Para: user@nutch.apache.org, "John Dhabolt" > > Enviados: Miércoles, 13 de Febrero 2013 13:04:27 > > Asunto: Re: How do I pass a password to Tika from Nutch for encrypted > PDFs? > > > > Hi John, > > > > Currently not but it should be relatively straightforward to modify > > parse-tika to do so and would be a nice contribution to Nutch > > > > Julien > > > > On 13 February 2013 13:53, John Dhabolt wrote: > > > > > Hi, > > > > > > We have PDFs we need to crawl that have a password associated. I don't > > see > > > a way to pass this password to Tika. Apparently prior to Tika 1.1 the > > > password would have been passed in Tika metadata. In Tika 1.1 and > > greater, > > > they've added a new ParseContext object, PasswordProvider, which adds a > > > getPassword method. Are either of these methods available to Nutch 1.6 > > > through a property setting? > > > > > > > > > > > -- > > * > > *Open Source Solutions for Text Engineering > > > > http://digitalpebble.blogspot.com/ > > http://www.digitalpebble.com > > http://twitter.com/digitalpebble > > >
Re: How do I pass a password to Tika from Nutch for encrypted PDFs?
Absolutely. Normally crawlers are expected to gather pages which are publically accessible. In internet or intranet, if a pdf file is protected, then it is expected that its only for a small subset of users who know the password and so it should not pop up in search results. From information security perspective, its fair if the crawler doesn't parse these files. Also, the % of such files present over the normal pages is less. The scenario of people crawling wherein a majority of pdf files are protected is rare. If that happens, it makes sense to assume that they know the files and their corresponding passwords before hand. If the password is common, say "xyx.com/docs/pages/abc/*" has the same password for all pdf files then a facility to provide a pattern would be convenient instead of listing every url of that host. Thanks, Tejas Patil On Wed, Feb 13, 2013 at 12:57 PM, Jorge Luis Betancourt Gonzalez < jlbetanco...@uci.cu> wrote: > I got this, but really a tedious work to list passwords for each PDF file > that will be crawled, don't you think? > > - Mensaje original - > De: "Tejas Patil" > Para: user@nutch.apache.org > Enviados: Miércoles, 13 de Febrero 2013 14:03:21 > Asunto: Re: How do I pass a password to Tika from Nutch for encrypted PDFs? > > There can be pdf files of same name at different hosts so using the url > would be better as compared to name. All this info can be in a xml file > which will be read by the pdf plugin. > > Thanks, > Tejas Patil > > > On Wed, Feb 13, 2013 at 10:35 AM, Jorge Luis Betancourt Gonzalez < > jlbetanco...@uci.cu> wrote: > > > Which could be a good way of specifying which password goes with which > PDF > > file? by full URI or by filename? other? > > > > - Mensaje original - > > De: "Julien Nioche" > > Para: user@nutch.apache.org, "John Dhabolt" > > Enviados: Miércoles, 13 de Febrero 2013 13:04:27 > > Asunto: Re: How do I pass a password to Tika from Nutch for encrypted > PDFs? > > > > Hi John, > > > > Currently not but it should be relatively straightforward to modify > > parse-tika to do so and would be a nice contribution to Nutch > > > > Julien > > > > On 13 February 2013 13:53, John Dhabolt wrote: > > > > > Hi, > > > > > > We have PDFs we need to crawl that have a password associated. I don't > > see > > > a way to pass this password to Tika. Apparently prior to Tika 1.1 the > > > password would have been passed in Tika metadata. In Tika 1.1 and > > greater, > > > they've added a new ParseContext object, PasswordProvider, which adds a > > > getPassword method. Are either of these methods available to Nutch 1.6 > > > through a property setting? > > > > > > > > > > > -- > > * > > *Open Source Solutions for Text Engineering > > > > http://digitalpebble.blogspot.com/ > > http://www.digitalpebble.com > > http://twitter.com/digitalpebble > > >
Re: How do I pass a password to Tika from Nutch for encrypted PDFs?
I got this, but really a tedious work to list passwords for each PDF file that will be crawled, don't you think? - Mensaje original - De: "Tejas Patil" Para: user@nutch.apache.org Enviados: Miércoles, 13 de Febrero 2013 14:03:21 Asunto: Re: How do I pass a password to Tika from Nutch for encrypted PDFs? There can be pdf files of same name at different hosts so using the url would be better as compared to name. All this info can be in a xml file which will be read by the pdf plugin. Thanks, Tejas Patil On Wed, Feb 13, 2013 at 10:35 AM, Jorge Luis Betancourt Gonzalez < jlbetanco...@uci.cu> wrote: > Which could be a good way of specifying which password goes with which PDF > file? by full URI or by filename? other? > > - Mensaje original - > De: "Julien Nioche" > Para: user@nutch.apache.org, "John Dhabolt" > Enviados: Miércoles, 13 de Febrero 2013 13:04:27 > Asunto: Re: How do I pass a password to Tika from Nutch for encrypted PDFs? > > Hi John, > > Currently not but it should be relatively straightforward to modify > parse-tika to do so and would be a nice contribution to Nutch > > Julien > > On 13 February 2013 13:53, John Dhabolt wrote: > > > Hi, > > > > We have PDFs we need to crawl that have a password associated. I don't > see > > a way to pass this password to Tika. Apparently prior to Tika 1.1 the > > password would have been passed in Tika metadata. In Tika 1.1 and > greater, > > they've added a new ParseContext object, PasswordProvider, which adds a > > getPassword method. Are either of these methods available to Nutch 1.6 > > through a property setting? > > > > > > -- > * > *Open Source Solutions for Text Engineering > > http://digitalpebble.blogspot.com/ > http://www.digitalpebble.com > http://twitter.com/digitalpebble >
Re: nutch cannot retrive title and inlinks of a domain
Hi, I noticed that for other urls in the seed inlinks are saved as ol. I checked the code and figured out that this is done with the part that saves anchors. So, in my case inlinks are saved as anchors in the field ol in hbase. But, for one of the ulrs, titile and inlinks are not retrieved, although its parse status marked success/ok (1/0), args=[]. Alex. -Original Message- From: kiran chitturi To: user Sent: Wed, Feb 13, 2013 12:40 pm Subject: Re: nutch cannot retrive title and inlinks of a domain Hi Alex, Inlinks does not work with me now for the same domain [0] currently. I am using Nutch-2.x and Hbase. Does the inlinks get saved for you for some of the crawl seeds ? Surprising, the title does not get saved. Did you try using parsechecker ? [0] - http://www.mail-archive.com/user@nutch.apache.org/msg08627.html On Wed, Feb 13, 2013 at 3:26 PM, wrote: > Hello, > > I noticed that nutch cannot retrieve title and inlinks of one of the > domains in the seed list. However, if I run identical code from the server > where this domain is hosted then it correctly parses it. The surprising > thing is that in both cases this urls has > > status: 2 (status_fetched) > parseStatus:success/ok (1/0), args=[] > > > I used nutch-2.1 with hbase-0.92.1 and nutch 1.4. > > > Any ideas why this happens? > > Thanks. > > Alex. > -- Kiran Chitturi
RE: Nutch identifier while indexing.
You can use the subcollection indexing filter to set a value for URL's that match a string. With it you can distinquish even if they are on the same host and domain. -Original message- > From:mbehlok > Sent: Wed 13-Feb-2013 21:20 > To: user@nutch.apache.org > Subject: Re: Nutch identifier while indexing. > > wish it was that simple: > > SitaA = www.myDomain.com/index.aspx?site=1 > > SitaB = www.myDomain.com/index.aspx?site=2 > > SitaC = www.myDomain.com/index.aspx?site=3 > > > > -- > View this message in context: > http://lucene.472066.n3.nabble.com/Nutch-identifier-while-indexing-tp4040285p4040323.html > Sent from the Nutch - User mailing list archive at Nabble.com. >
Re: nutch cannot retrive title and inlinks of a domain
Hi Alex, Inlinks does not work with me now for the same domain [0] currently. I am using Nutch-2.x and Hbase. Does the inlinks get saved for you for some of the crawl seeds ? Surprising, the title does not get saved. Did you try using parsechecker ? [0] - http://www.mail-archive.com/user@nutch.apache.org/msg08627.html On Wed, Feb 13, 2013 at 3:26 PM, wrote: > Hello, > > I noticed that nutch cannot retrieve title and inlinks of one of the > domains in the seed list. However, if I run identical code from the server > where this domain is hosted then it correctly parses it. The surprising > thing is that in both cases this urls has > > status: 2 (status_fetched) > parseStatus:success/ok (1/0), args=[] > > > I used nutch-2.1 with hbase-0.92.1 and nutch 1.4. > > > Any ideas why this happens? > > Thanks. > > Alex. > -- Kiran Chitturi
Re: Nutch identifier while indexing.
The only suggestion that I know is that you can index the site param at the end of the urls as a separate field and make facet search in solr with that param values. Alex. -Original Message- From: mbehlok To: user Sent: Wed, Feb 13, 2013 12:20 pm Subject: Re: Nutch identifier while indexing. wish it was that simple: SitaA = www.myDomain.com/index.aspx?site=1 SitaB = www.myDomain.com/index.aspx?site=2 SitaC = www.myDomain.com/index.aspx?site=3 -- View this message in context: http://lucene.472066.n3.nabble.com/Nutch-identifier-while-indexing-tp4040285p4040323.html Sent from the Nutch - User mailing list archive at Nabble.com.
nutch cannot retrive title and inlinks of a domain
Hello, I noticed that nutch cannot retrieve title and inlinks of one of the domains in the seed list. However, if I run identical code from the server where this domain is hosted then it correctly parses it. The surprising thing is that in both cases this urls has status: 2 (status_fetched) parseStatus:success/ok (1/0), args=[] I used nutch-2.1 with hbase-0.92.1 and nutch 1.4. Any ideas why this happens? Thanks. Alex.
Re: Nutch identifier while indexing.
wish it was that simple: SitaA = www.myDomain.com/index.aspx?site=1 SitaB = www.myDomain.com/index.aspx?site=2 SitaC = www.myDomain.com/index.aspx?site=3 -- View this message in context: http://lucene.472066.n3.nabble.com/Nutch-identifier-while-indexing-tp4040285p4040323.html Sent from the Nutch - User mailing list archive at Nabble.com.
Re: Nutch identifier while indexing.
Are you telling that your sites have form siteA.mydomain.com, siteB.mydomain.com, siteC.mydomain.com? Alex. -Original Message- From: mbehlok To: user Sent: Wed, Feb 13, 2013 11:05 am Subject: Nutch identifier while indexing. Hello, I am indexing 3 sites: SiteA SiteB SiteC I want to index these sites in a way that when searching them in solr I can query a search on each of these sites in separate. So one could say... thats easy, just filter them by host... WRONG... Sites are hosted on the same host but have different starting points. That is, starting the crawl from different root urls (SiteA, SiteB, SiteC) produces different results. My imagination tells me to somehow specify an identifier on schema.xml that passes to solr which was the root url that produced that crawl. Any ideas on how to implement this? any variations? Mitch -- View this message in context: http://lucene.472066.n3.nabble.com/Nutch-identifier-while-indexing-tp4040285.html Sent from the Nutch - User mailing list archive at Nabble.com.
Nutch identifier while indexing.
Hello, I am indexing 3 sites: SiteA SiteB SiteC I want to index these sites in a way that when searching them in solr I can query a search on each of these sites in separate. So one could say... thats easy, just filter them by host... WRONG... Sites are hosted on the same host but have different starting points. That is, starting the crawl from different root urls (SiteA, SiteB, SiteC) produces different results. My imagination tells me to somehow specify an identifier on schema.xml that passes to solr which was the root url that produced that crawl. Any ideas on how to implement this? any variations? Mitch -- View this message in context: http://lucene.472066.n3.nabble.com/Nutch-identifier-while-indexing-tp4040285.html Sent from the Nutch - User mailing list archive at Nabble.com.
Re: How do I pass a password to Tika from Nutch for encrypted PDFs?
There can be pdf files of same name at different hosts so using the url would be better as compared to name. All this info can be in a xml file which will be read by the pdf plugin. Thanks, Tejas Patil On Wed, Feb 13, 2013 at 10:35 AM, Jorge Luis Betancourt Gonzalez < jlbetanco...@uci.cu> wrote: > Which could be a good way of specifying which password goes with which PDF > file? by full URI or by filename? other? > > - Mensaje original - > De: "Julien Nioche" > Para: user@nutch.apache.org, "John Dhabolt" > Enviados: Miércoles, 13 de Febrero 2013 13:04:27 > Asunto: Re: How do I pass a password to Tika from Nutch for encrypted PDFs? > > Hi John, > > Currently not but it should be relatively straightforward to modify > parse-tika to do so and would be a nice contribution to Nutch > > Julien > > On 13 February 2013 13:53, John Dhabolt wrote: > > > Hi, > > > > We have PDFs we need to crawl that have a password associated. I don't > see > > a way to pass this password to Tika. Apparently prior to Tika 1.1 the > > password would have been passed in Tika metadata. In Tika 1.1 and > greater, > > they've added a new ParseContext object, PasswordProvider, which adds a > > getPassword method. Are either of these methods available to Nutch 1.6 > > through a property setting? > > > > > > -- > * > *Open Source Solutions for Text Engineering > > http://digitalpebble.blogspot.com/ > http://www.digitalpebble.com > http://twitter.com/digitalpebble >
Re: How do I pass a password to Tika from Nutch for encrypted PDFs?
Which could be a good way of specifying which password goes with which PDF file? by full URI or by filename? other? - Mensaje original - De: "Julien Nioche" Para: user@nutch.apache.org, "John Dhabolt" Enviados: Miércoles, 13 de Febrero 2013 13:04:27 Asunto: Re: How do I pass a password to Tika from Nutch for encrypted PDFs? Hi John, Currently not but it should be relatively straightforward to modify parse-tika to do so and would be a nice contribution to Nutch Julien On 13 February 2013 13:53, John Dhabolt wrote: > Hi, > > We have PDFs we need to crawl that have a password associated. I don't see > a way to pass this password to Tika. Apparently prior to Tika 1.1 the > password would have been passed in Tika metadata. In Tika 1.1 and greater, > they've added a new ParseContext object, PasswordProvider, which adds a > getPassword method. Are either of these methods available to Nutch 1.6 > through a property setting? > -- * *Open Source Solutions for Text Engineering http://digitalpebble.blogspot.com/ http://www.digitalpebble.com http://twitter.com/digitalpebble
Re: How do I pass a password to Tika from Nutch for encrypted PDFs?
Hi John, Currently not but it should be relatively straightforward to modify parse-tika to do so and would be a nice contribution to Nutch Julien On 13 February 2013 13:53, John Dhabolt wrote: > Hi, > > We have PDFs we need to crawl that have a password associated. I don't see > a way to pass this password to Tika. Apparently prior to Tika 1.1 the > password would have been passed in Tika metadata. In Tika 1.1 and greater, > they've added a new ParseContext object, PasswordProvider, which adds a > getPassword method. Are either of these methods available to Nutch 1.6 > through a property setting? > -- * *Open Source Solutions for Text Engineering http://digitalpebble.blogspot.com/ http://www.digitalpebble.com http://twitter.com/digitalpebble
Re: Content Truncation in Nutch 2.1/MySQL
Hi Lewis: Well, I've done some additional testing and the truncation issue seems to be isolated to the particular web server/site that I'm trying to process. When I run the process against other sites, I'm not seeing the same issue. I guess for processing that site I'll have to go with Plan B. Thanks for your help. Ward On Sun, Feb 10, 2013 at 8:19 PM, Lewis John Mcgibbney < lewis.mcgibb...@gmail.com> wrote: > No content should be truncated if you set http.content.limit to -1 and > leave the default settings on. It is as simple as that. > Have you recompiled Nutch with some changes you made before continuing > crawling? > > On Fri, Feb 8, 2013 at 9:01 PM, Ward Loving wrote: > > > Well, > > > > I spoke to soon. I ran a crawl overnight and I'm seeing all kinds of > > truncation happening again. I can hardly find a content field in my > > database that hasn't been truncated. I'm seeing a ton of these warning > > messages in the log: > > > > 2013-02-08 19:40:36,861 WARN parse.ParserJob - > > http://www.episcopalchurch.org/parish/university-texas-austin-txskipped. > > Content of size 30220 was truncated to 29919 > > 2013-02-08 19:40:36,861 INFO parse.ParserJob - Parsing > > http://www.episcopalchurch.org/parish/varina-church-richmond-va > > 2013-02-08 19:40:36,861 WARN parse.ParserJob - > > http://www.episcopalchurch.org/parish/varina-church-richmond-va skipped. > > Content of size 29559 was truncated to 28471 > > 2013-02-08 19:40:36,861 INFO parse.ParserJob - Parsing > > http://www.episcopalchurch.org/parish/vauters-church-champlain-va > > > > This is sort of bizarrre. I spot checked 5 pages when I first started > the > > process yesterday morning and all the content in the content fields was > > complete. Now I'm running it again and nothing is, but I don't see the > > warning messages that anything is amiss with the data with the first > couple > > of pages I fetched. I've tried updating the following setting to false > but > > it doesn't seem to help: > > > > > > parser.skip.truncated > > false > > Boolean value for whether we should skip parsing for > > truncated documents. By default this > > property is activated due to extremely high levels of CPU which parsing > > can sometimes take. > > > > > > > > > > > > > > > > > > On Thu, Feb 7, 2013 at 5:24 PM, Ward Loving wrote: > > > > > Yep, looks like it. The configuration is tricky no doubt. In my case, > > > however, I think I had actually fixed the config, I just couldn't tell > > that > > > I had resolved the issue. I was looking at stale data. > > > > > > > > > On Thu, Feb 7, 2013 at 5:12 PM, Lewis John Mcgibbney < > > > lewis.mcgibb...@gmail.com> wrote: > > > > > >> So the problem for you is resolved? > > >> The main (typical) problem here is in the underlying gora-sql library > > and > > >> some rather difficult to master gora-sql-mapping.xml constraints. > > >> Hope all is resolved > > >> Lewis > > >> > > >> On Thu, Feb 7, 2013 at 1:57 PM, Ward Loving wrote: > > >> > > >> > Alright...very good news. I guess something I did fixed the issue. > > >> Once I > > >> > dropped my webpage table and restarted the process, I'm now getting > > >> > complete pages. The actual load of the data to that field can > happen > > >> > somewhat later than the fetch entry in the logs. Easy to see when > > >> > inserting data the first time around. Not as simple to detect when > > >> you've > > >> > loaded data previously. Thanks for your assistance. > > >> > > > >> > > > >> > > > >> > > > >> > > > >> > > > >> > On Thu, Feb 7, 2013 at 3:01 PM, Lewis John Mcgibbney < > > >> > lewis.mcgibb...@gmail.com> wrote: > > >> > > > >> > > It will prduce more output on the fetcher part of your hadoop.log > > not > > >> on > > >> > > the parsechecker tool itself that is why you are seeing nothing > > more. > > >> > > Are you still having problems with the truncation aspect? > > >> > > Lewis > > >> > > > > >> > > On Thu, Feb 7, 2013 at 11:07 AM, Ward Loving > > >> wrote: > > >> > > > > >> > > > Lewis: > > >> > > > > > >> > > > > > >> > > > > >> > > > >> > > > >> > > > >> > -- > > >> > Ward Loving > > >> > Senior Technical Consultant > > >> > Appirio, Inc. > > >> > www.appirio.com > > >> > (706) 225-9475 > > >> > > > >> > > >> > > >> > > >> -- > > >> *Lewis* > > >> > > > > > > > > > > > > -- > > > Ward Loving > > > Senior Technical Consultant > > > Appirio, Inc. > > > www.appirio.com > > > (706) 225-9475 > > > > > > > > > > > -- > > Ward Loving > > Senior Technical Consultant > > Appirio, Inc. > > www.appirio.com > > (706) 225-9475 > > > > > > -- > *Lewis* > -- Ward Loving Senior Technical Consultant Appirio, Inc. www.appirio.com (706) 225-9475
Slow parse on hadoop
Hi, I have Nutch running on a Hadoop cluster. Inject, generate, fetch are working fine, they are executed on multiple nodes. We seam to get only one mapper for the parse job and the parse step only runs on one node and it takes a minute or so to parse one page. Please see the log below (1min 41s to parse thetimes). 2013-02-13 13:46:02,658 INFO org.apache.nutch.parse.ParserJob: Parsing http://www.thetimes.co.uk/tto/news/ 2013-02-13 13:47:43,415 INFO org.apache.nutch.parse.ParserJob: Parsing http://online.wsj.com/home-page I am using parse-html plugin to do the job. Cassandra as the DB. When running locally all is fine. Running parse with this: hadoop jar apache-nutch-2.1-SNAPSHOT.job org.apache.nutch.parse.ParserJob $id Also including log from jobtracker Hadoop job_201302131311_0006 onJob Name: parse Job-ACLs: All users are allowed Status: Succeeded Started at: Wed Feb 13 13:44:06 GMT 2013 Finished at: Wed Feb 13 14:06:30 GMT 2013 Finished in: 22mins, 23sec Counter Map Reduce Total ParserStatus success 13 0 13 notparsed 1 0 1 Job Counters SLOTS_MILLIS_MAPS 0 0 1,335,834 Total time spent by all reduces waiting after reserving slots (ms) 0 0 0 Total time spent by all maps waiting after reserving slots (ms) 0 0 0 Launched map tasks 0 0 1 SLOTS_MILLIS_REDUCES 0 0 0 File Output Format Counters Bytes Written 0 0 0 File Input Format Counters Bytes Read 0 0 0 FileSystemCounters HDFS_BYTES_READ 689 0 689 FILE_BYTES_WRITTEN 32,142 0 32,142 Map-Reduce Framework Map input records 138 0 138 Physical memory (bytes) snapshot 417,538,048 0 417,538,048 Spilled Records 0 0 0 Total committed heap usage (bytes) 186,449,920 0 186,449,920 CPU time spent (ms) 1,379,340 0 1,379,340 Virtual memory (bytes) snapshot 1,163,165,696 0 1,163,165,696 SPLIT_RAW_BYTES 689 0 689 Map output records 14 0 14