Re: Crawl in Nutch2.2
limit and parser.html.impl with their > > default > > >> > > > value... > > >> > > > > > >> > > > Benajmin > > >> > > > > > >> > > > > > >> > > > On Tue, Jun 25, 2013 at 6:14 PM, Tejas Patil < > > >> tejas.patil...@gmail.com > > >> > > >wrote: > > >> > > > > > >> > > >> Did you check the logs (NUTCH_HOME/logs/hadoop.log) for any > > >> exception > > >> > or > > >> > > >> error messages ? > > >> > > >> Also you might have a look at these configs in nutch-site.xml > > >> (default > > >> > > >> values are in nutch-default.xml): > > >> > > >> http.content.limit and parser.html.impl > > >> > > >> > > >> > > >> > > >> > > >> On Tue, Jun 25, 2013 at 7:04 AM, Sznajder ForMailingList < > > >> > > >> bs4mailingl...@gmail.com> wrote: > > >> > > >> > > >> > > >> > Hello > > >> > > >> > > > >> > > >> > I installed Nutch 2.2 on my linux machine. > > >> > > >> > > > >> > > >> > I defined the seed directory with one file containing: > > >> > > >> > http://en.wikipedia.org/ > > >> > > >> > http://edition.cnn.com/ > > >> > > >> > > > >> > > >> > > > >> > > >> > I ran the following: > > >> > > >> > sh bin/nutch inject ~/DataExplorerCrawl_gpfs/seed/ > > >> > > >> > > > >> > > >> > After this step: > > >> > > >> > the call > > >> > > >> > -bash-4.1$ sh bin/nutch readdb -stats > > >> > > >> > > > >> > > >> > returns > > >> > > >> > TOTAL urls: 2 > > >> > > >> > status 0 (null):2 > > >> > > >> > avg score: 1.0 > > >> > > >> > > > >> > > >> > > > >> > > >> > Then, I ran the following: > > >> > > >> > bin/nutch generate -topN 10 > > >> > > >> > bin/nutch fetch -all > > >> > > >> > bin/nutch parse -all > > >> > > >> > bin/nutch updatedb > > >> > > >> > bin/nutch generate -topN 1000 > > >> > > >> > bin/nutch fetch -all > > >> > > >> > bin/nutch parse -all > > >> > > >> > bin/nutch updatedb > > >> > > >> > > > >> > > >> > > > >> > > >> > However, the stats call after these steps is still: > > >> > > >> > the call > > >> > > >> > -bash-4.1$ sh bin/nutch readdb -stats > > >> > > >> > status 5 (status_redir_perm): 1 > > >> > > >> > max score: 2.0 > > >> > > >> > TOTAL urls: 3 > > >> > > >> > avg score: 1.334 > > >> > > >> > > > >> > > >> > > > >> > > >> > > > >> > > >> > Only 3 urls?! > > >> > > >> > What do I miss? > > >> > > >> > > > >> > > >> > thanks > > >> > > >> > > > >> > > >> > Benjamin > > >> > > >> > > > >> > > >> > > >> > > > > > >> > > > > >> > > -- > > >> > > *Lewis* > > >> > > > > >> > > > >> > > >> > > >> > > >> -- > > >> *Lewis* > > >> > > > > > > > > > -- Kiran Chitturi <http://www.linkedin.com/in/kiranchitturi>
Re: [VOTE] Apache Nutch 2.2 Release Candidate
+1, This is important release for 2.x Series. Tejas, I also got it 2 days late On Fri, May 31, 2013 at 7:17 PM, lewis john mcgibbney wrote: > Good Friday Everyone, > > Glad to get to a stage where we can VOTE on the release of the Apache > Nutch 2.2 artifacts. > > We solved a stack of issues: > http://s.apache.org/LPB > > SVN source tag: > http://svn.apache.org/repos/asf/nutch/tags/release-2.2/ > > Staging repo: > https://repository.apache.org/content/repositories/orgapachenutch-044/ > > Release artifacts: > http://people.apache.org/~lewismc/nutch/nutch2.2/ > > PGP release keys (signed using 4E21557F): > http://nutch.apache.org/dist/KEYS > > Vote will be open for at least least 72 hours, however given this weather > I suppose we can all be forgiven if it is not done over the weekend :0) > > I would like to say a huge thanks all contributors and committers from far > and wide who helped with this release. It is another milestone for us to > get here yet again. > > Have a great weekend > > Lewis > > [ ] +1, let's get it released!!! > [ ] +/-0, fine, but consider to fix few issues before... > [ ] -1, nope, because... (and please explain why) > > p.s. here's my +1 > > > -- Kiran Chitturi <http://www.linkedin.com/in/kiranchitturi>
Re: Error thrown during solrIndex in Nutch2.1
Hi Szanjder, It looks like the Index is read-only and you do not have any write access. We can see that permission denied status On Mon, Jun 3, 2013 at 9:13 AM, Sznajder ForMailingList < bs4mailingl...@gmail.com> wrote: > Indeed, from the Solr side, I got > > > Jun 3, 2013 4:05:49 PM org.apache.solr.core.SolrCore execute > INFO: [] webapp=/solr path=/update params={version=2&wt=javabin} status=500 > QTime=1007 > Jun 3, 2013 4:05:49 PM org.apache.solr.common.SolrException log > SEVERE: org.apache.lucene.store.LockObtainFailedException: Lock obtain > timed out: NativeFSLock@solr/./data/index/write.lock: > java.io.FileNotFoundException: solr/./data/index/write.lock (Permission > denied) > at org.apache.lucene.store.Lock.obtain(Lock.java:84) > at > org.apache.lucene.index.IndexWriter.(IndexWriter.java:1098) > at > org.apache.solr.update.SolrIndexWriter.(SolrIndexWriter.java:84) > at > > org.apache.solr.update.UpdateHandler.createMainIndexWriter(UpdateHandler.java:101) > at > > org.apache.solr.update.DirectUpdateHandler2.openWriter(DirectUpdateHandler2.java:171) > at > > org.apache.solr.update.DirectUpdateHandler2.addDoc(DirectUpdateHandler2.java:219) > at > > org.apache.solr.update.processor.RunUpdateProcessor.processAdd(RunUpdateProcessorFactory.java:61) > at > > org.apache.solr.update.processor.LogUpdateProcessor.processAdd(LogUpdateProcessorFactory.java:115) > at > org.apache.solr.handler.XMLLoader.processUpdate(XMLLoader.java:157) > at org.apache.solr.handler.XMLLoader.load(XMLLoader.java:79) > at > > org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:58) > at > > org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:129) > at org.apache.solr.core.SolrCore.execute(SolrCore.java:1376) > at > > org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:365) > at > > org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:260) > at > > org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212) > at > org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:399) > at > org.mortbay.jetty.security.SecurityHandler.handle(SecurityHandler.java:216) > at > org.mortbay.jetty.servlet.SessionHandler.handle(SessionHandler.java:182) > at > org.mortbay.jetty.handler.ContextHandler.handle(ContextHandler.java:766) > at > org.mortbay.jetty.webapp.WebAppContext.handle(WebAppContext.java:450) > at > > org.mortbay.jetty.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:230) > at > > org.mortbay.jetty.handler.HandlerCollection.handle(HandlerCollection.java:114) > at > org.mortbay.jetty.handler.HandlerWrapper.handle(HandlerWrapper.java:152) > at org.mortbay.jetty.Server.handle(Server.java:326) > at > org.mortbay.jetty.HttpConnection.handleRequest(HttpConnection.java:542) > at > > org.mortbay.jetty.HttpConnection$RequestHandler.content(HttpConnection.java:945) > at org.mortbay.jetty.HttpParser.parseNext(HttpParser.java:843) > at org.mortbay.jetty.HttpParser.parseAvailable(HttpParser.java:212) > at org.mortbay.jetty.HttpConnection.handle(HttpConnection.java:404) > at > > org.mortbay.jetty.bio.SocketConnector$Connection.run(SocketConnector.java:228) > at > > org.mortbay.thread.QueuedThreadPool$PoolThread.run(QueuedThreadPool.java:582) > Caused by: java.io.FileNotFoundException: solr/./data/index/write.lock > (Permission denied) > at java.io.RandomAccessFile.(RandomAccessFile.java:229) > at > org.apache.lucene.store.NativeFSLock.obtain(NativeFSLockFactory.java:203) > at org.apache.lucene.store.Lock.obtain(Lock.java:95) > ... 31 more > > > > On Mon, Jun 3, 2013 at 4:10 PM, kiran chitturi >wrote: > > > Hi Sznajder, > > > > It is hard to know the Solr errors from Nutch side. Please look at Solr > > logs to see what happened. > > > > On Mon, Jun 3, 2013 at 9:04 AM, Sznajder ForMailingList < > > bs4mailingl...@gmail.com> wrote: > > > > > Hi > > > > > > I ran the SolIndex from Nutch2.1 and I am getting the following error > > > (copied from the hadoop.log file): > > > > > > Any hint is welcome.. > > > > > > Benjamin > > > > > > 2013-06-03 14:51:10,428 INFO indexer.IndexingFilters - Adding > > > org.apache.nutch.indexer.anchor.AnchorInd
Re: Error thrown during solrIndex in Nutch2.1
> *Lock obtain timed out: NativeFSLock@solr/./data/index/write.lock: > java.io.FileNotFoundException: solr/./data/index/write.lock (Permission > denied) org.apache.lucene.store.LockObtainFailedException: Lock obtain > timed out: NativeFSLock@solr/./data/index/write.lock: > java.io.FileNotFoundException: solr/./data/index/write.lock (Permission > denied) at org.apache.lucene.store.Lock.obtain(Lock.java:84) * at > org.apache.lucene.index.IndexWriter.(IndexWriter.java:1098)at > org.apache.solr.update.SolrIndexWriter.(SolrIndexWriter.java:84) > On the second note, you should check this too -- Kiran Chitturi <http://www.linkedin.com/in/kiranchitturi>
Re: Error thrown during solrIndex in Nutch2.1
t > org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:399) > at > org.mortbay.jetty.security.SecurityHandler.handle(SecurityHandler.java:216) > at > org.mortbay.jetty.servlet.SessionHandler.handle(SessionHandler.java:182) > at > org.mortbay.jetty.handler.ContextHandler.handle(ContextHandler.java:766) > at org.mortbay.jetty.webapp.WebAppContext.handle(WebAppContext.java:450) > > request: http://ir-hadoop0:8983/solr/update?wt=javabin&version=2 > at > > org.apache.solr.client.solrj.impl.CommonsHttpSolrServer.request(CommonsHttpSolrServer.java:430) > at > > org.apache.solr.client.solrj.impl.CommonsHttpSolrServer.request(CommonsHttpSolrServer.java:244) > at > > org.apache.solr.client.solrj.request.AbstractUpdateRequest.process(AbstractUpdateRequest.java:105) > at org.apache.solr.client.solrj.SolrServer.add(SolrServer.java:49) > at > org.apache.nutch.indexer.solr.SolrWriter.close(SolrWriter.java:91) > at > > org.apache.nutch.indexer.IndexerOutputFormat$1.close(IndexerOutputFormat.java:53) > at > > org.apache.hadoop.mapred.MapTask$NewDirectOutputCollector.close(MapTask.java:651) > -- Kiran Chitturi <http://www.linkedin.com/in/kiranchitturi>
Re: Nutch not crawling fully
> fetch of http://www.igate.com/ failed with: Http code=407, url= > http://www.igate.com <http://www.igate.com/ -finishing> Hi Suresh, The url is never successfully fetched. The http error code 407 is thrown away. That is the reason it is in unfetched status. > > > > > dwbilab01@dwbilab01-OptiPlex-990:~/apache-nutch-1.6$ bin/nutch readdb > mondaycrawl/crawldb/ -stats > CrawlDb statistics start: mondaycrawl/crawldb/ > Statistics for CrawlDb: mondaycrawl/crawldb/ > TOTAL urls: 1 > retry 1:1 > min score: 1.0 > avg score: 1.0 > max score: 1.0 > status 1 (db_unfetched):1 > CrawlDb statistics: done > > > > > > > ~~Disclaimer~~~ > Information contained and transmitted by this e-mail is confidential and > proprietary to iGATE and its affiliates and is intended for use only by the > recipient. If you are not the intended recipient, you are hereby notified > that any dissemination, distribution, copying or use of this e-mail is > strictly prohibited and you are requested to delete this e-mail immediately > and notify the originator or mailad...@igate.com mailad...@igate.com>. iGATE does not enter into any agreement with any > party by e-mail. Any views expressed by an individual do not necessarily > reflect the view of iGATE. iGATE is not responsible for the consequences of > any actions taken on the basis of information provided, through this email. > The contents of an attachment to this e-mail may contain software viruses, > which could damage your own computer system. While iGATE has taken every > reasonable precaution to minimise this risk, we cannot accept liability for > any damage which you sustain as a result of software viruses. You should > carry out your own virus checks before opening an attachment. To know more > about iGATE please visit www.igate.com <http://www.igate.com>. > > ~~~~ > -- Kiran Chitturi <http://www.linkedin.com/in/kiranchitturi>
Re: Error in resolving some dependencies
Hi Tejas, I didn't get time to test it yet. My thesis writing is keeping me too busy. I will work on Nutch once I get some free time after my defense :) You guys are doing a great job :) On Fri, May 31, 2013 at 5:57 PM, Tejas Patil wrote: > Hi Kiran, > > Happy to know :) > Have you faced any problems with it ? I am in middle of editing the wiki > page and your comments might help me do that. > > Thanks, > Tejas > > > On Fri, May 31, 2013 at 2:55 PM, kiran chitturi > wrote: > > > Thank you Tejas. So happy to get rid of the whole Eclipse setup and have > it > > easy :) > > > > On Fri, May 31, 2013 at 6:40 AM, Tejas Patil > >wrote: > > > > > I have created NUTCH-1577 [0] for tracking this and uploaded patches > for > > > trunk and 2.x. It would be great if someone could try it and give some > > > comments. > > > > > > [0] : https://issues.apache.org/jira/browse/NUTCH-1577 > > > > > > > > > On Fri, May 31, 2013 at 2:18 AM, Tejas Patil > > >wrote: > > > > > > > I think that the current process of setting up Nutch as a project in > > > > eclipse is cumbersome. Especially this one is hellishly boring: > > > > > > > > *In addition, we must manually add EVERY individual plugin src/java > and > > > > src/test folder, although this takes some time it is absolutely > > essential > > > > that this is done.* > > > > > > > > I am thinking of automating this in some way. Adding a eclipse target > > to > > > > create an eclipse project would be worth saving time of everybody who > > > wants > > > > to use Nutch with eclipse. > > > > > > > > Thanks, > > > > Tejas > > > > > > > > > > > > On Wed, May 29, 2013 at 11:20 AM, Adriana Farina < > > > > adriana.farin...@gmail.com> wrote: > > > > > > > >> Hi Kiran, > > > >> > > > >> thank you very much! > > > >> > > > >> I'll try the solution you suggest. > > > >> > > > >> Many thanks to both of you! > > > >> > > > >> > > > >> 2013/5/29 kiran chitturi > > > >> > > > >> > Hi Adriana, > > > >> > > > > >> > I think I noticed something like this in 2.x series. I have noted > > the > > > >> > dependencies here [1] while ago but I also remember the missing > Tika > > > >> > dependency in 2.x. I usually download the missing jar files from > [2] > > > and > > > >> > add it to the project through 'external Jars'. > > > >> > > > > >> > This is not a permanent solution and I am not sure why Eclipse > > throws > > > >> away > > > >> > the missing dependencies. Adding the additional jars is a > temporary > > > fix > > > >> to > > > >> > this problem. > > > >> > > > > >> > Hope this helps. > > > >> > > > > >> > [1] > > > >> > https://wiki.apache.org/nutch/RunNutchInEclipse#Missing_dependencies > > > >> > [2] > > > >> > > > > >> > > > > >> > > > > > > http://grepcode.com/snapshot/repo1.maven.org/maven2/org.apache.tika/tika-app/1.3 > > > >> > > > > >> > On Wed, May 29, 2013 at 11:51 AM, Adriana Farina < > > > >> > adriana.farin...@gmail.com > > > >> > > wrote: > > > >> > > > > >> > > Hi Feng, > > > >> > > > > > >> > > yes, I use the IvyDE managed dependencies and I configured it as > > > >> > described > > > >> > > in the tutorial, but I still get the import errors. > > > >> > > > > > >> > > 2013/5/29 feng lu > > > >> > > > > > >> > > > It may not find the Tina package ,do you > > > >> > > > use ivyde managed dependencies. Such as the tutorial mentioned > > > that > > > >> > > > Remaining in the Libraries tab Add Library > IvyDE Managed > > > >> > Dependencies > > > > >> > > > browse to trunk/ivy/ivy.xml > ensure ALL configuration boxes > are > > > >> > included > > > >> > > > > > > >> > > > On May 29, 2013 9:37 PM, "Adriana Farina" < > > > >> adriana.farin...@gmail.com> > > > >> > > > wrote: > > > >> > > > > > > > >> > > > > Hello, > > > >> > > > > > > > >> > > > > I'm using Nutch 2.1. I follow the guide > > > >> > > > > http://wiki.apache.org/nutch/RunNutchInEclipse to import it > > in > > > >> > > Eclipse. > > > >> > > > The > > > >> > > > > strange thing is that it cannot resolve some import. For > > > example, > > > >> in > > > >> > > > > TikaParser.java, I get the error "The import > > > >> > > org.apache.tika.parser.html > > > >> > > > > cannot be resolved". > > > >> > > > > > > > >> > > > > Am I missing something? Where am I wrong? > > > >> > > > > > > > >> > > > > Thank you very much! > > > >> > > > > > > > >> > > > > > > > >> > > > > -- > > > >> > > > > Adriana Farina > > > >> > > > > > > >> > > > > > >> > > > > > >> > > > > > >> > > -- > > > >> > > Adriana Farina > > > >> > > > > > >> > > > > >> > > > > >> > > > > >> > -- > > > >> > Kiran Chitturi > > > >> > > > > >> > <http://www.linkedin.com/in/kiranchitturi> > > > >> > > > > >> > > > >> > > > >> > > > >> -- > > > >> Adriana Farina > > > >> > > > > > > > > > > > > > > > > > > > -- > > Kiran Chitturi > > > > <http://www.linkedin.com/in/kiranchitturi> > > > -- Kiran Chitturi <http://www.linkedin.com/in/kiranchitturi>
Re: Error in resolving some dependencies
Thank you Tejas. So happy to get rid of the whole Eclipse setup and have it easy :) On Fri, May 31, 2013 at 6:40 AM, Tejas Patil wrote: > I have created NUTCH-1577 [0] for tracking this and uploaded patches for > trunk and 2.x. It would be great if someone could try it and give some > comments. > > [0] : https://issues.apache.org/jira/browse/NUTCH-1577 > > > On Fri, May 31, 2013 at 2:18 AM, Tejas Patil >wrote: > > > I think that the current process of setting up Nutch as a project in > > eclipse is cumbersome. Especially this one is hellishly boring: > > > > *In addition, we must manually add EVERY individual plugin src/java and > > src/test folder, although this takes some time it is absolutely essential > > that this is done.* > > > > I am thinking of automating this in some way. Adding a eclipse target to > > create an eclipse project would be worth saving time of everybody who > wants > > to use Nutch with eclipse. > > > > Thanks, > > Tejas > > > > > > On Wed, May 29, 2013 at 11:20 AM, Adriana Farina < > > adriana.farin...@gmail.com> wrote: > > > >> Hi Kiran, > >> > >> thank you very much! > >> > >> I'll try the solution you suggest. > >> > >> Many thanks to both of you! > >> > >> > >> 2013/5/29 kiran chitturi > >> > >> > Hi Adriana, > >> > > >> > I think I noticed something like this in 2.x series. I have noted the > >> > dependencies here [1] while ago but I also remember the missing Tika > >> > dependency in 2.x. I usually download the missing jar files from [2] > and > >> > add it to the project through 'external Jars'. > >> > > >> > This is not a permanent solution and I am not sure why Eclipse throws > >> away > >> > the missing dependencies. Adding the additional jars is a temporary > fix > >> to > >> > this problem. > >> > > >> > Hope this helps. > >> > > >> > [1] > >> https://wiki.apache.org/nutch/RunNutchInEclipse#Missing_dependencies > >> > [2] > >> > > >> > > >> > http://grepcode.com/snapshot/repo1.maven.org/maven2/org.apache.tika/tika-app/1.3 > >> > > >> > On Wed, May 29, 2013 at 11:51 AM, Adriana Farina < > >> > adriana.farin...@gmail.com > >> > > wrote: > >> > > >> > > Hi Feng, > >> > > > >> > > yes, I use the IvyDE managed dependencies and I configured it as > >> > described > >> > > in the tutorial, but I still get the import errors. > >> > > > >> > > 2013/5/29 feng lu > >> > > > >> > > > It may not find the Tina package ,do you > >> > > > use ivyde managed dependencies. Such as the tutorial mentioned > that > >> > > > Remaining in the Libraries tab Add Library > IvyDE Managed > >> > Dependencies > > >> > > > browse to trunk/ivy/ivy.xml > ensure ALL configuration boxes are > >> > included > >> > > > > >> > > > On May 29, 2013 9:37 PM, "Adriana Farina" < > >> adriana.farin...@gmail.com> > >> > > > wrote: > >> > > > > > >> > > > > Hello, > >> > > > > > >> > > > > I'm using Nutch 2.1. I follow the guide > >> > > > > http://wiki.apache.org/nutch/RunNutchInEclipse to import it in > >> > > Eclipse. > >> > > > The > >> > > > > strange thing is that it cannot resolve some import. For > example, > >> in > >> > > > > TikaParser.java, I get the error "The import > >> > > org.apache.tika.parser.html > >> > > > > cannot be resolved". > >> > > > > > >> > > > > Am I missing something? Where am I wrong? > >> > > > > > >> > > > > Thank you very much! > >> > > > > > >> > > > > > >> > > > > -- > >> > > > > Adriana Farina > >> > > > > >> > > > >> > > > >> > > > >> > > -- > >> > > Adriana Farina > >> > > > >> > > >> > > >> > > >> > -- > >> > Kiran Chitturi > >> > > >> > <http://www.linkedin.com/in/kiranchitturi> > >> > > >> > >> > >> > >> -- > >> Adriana Farina > >> > > > > > -- Kiran Chitturi <http://www.linkedin.com/in/kiranchitturi>
Re: Error in resolving some dependencies
Hi Adriana, I think I noticed something like this in 2.x series. I have noted the dependencies here [1] while ago but I also remember the missing Tika dependency in 2.x. I usually download the missing jar files from [2] and add it to the project through 'external Jars'. This is not a permanent solution and I am not sure why Eclipse throws away the missing dependencies. Adding the additional jars is a temporary fix to this problem. Hope this helps. [1] https://wiki.apache.org/nutch/RunNutchInEclipse#Missing_dependencies [2] http://grepcode.com/snapshot/repo1.maven.org/maven2/org.apache.tika/tika-app/1.3 On Wed, May 29, 2013 at 11:51 AM, Adriana Farina wrote: > Hi Feng, > > yes, I use the IvyDE managed dependencies and I configured it as described > in the tutorial, but I still get the import errors. > > 2013/5/29 feng lu > > > It may not find the Tina package ,do you > > use ivyde managed dependencies. Such as the tutorial mentioned that > > Remaining in the Libraries tab Add Library > IvyDE Managed Dependencies > > > browse to trunk/ivy/ivy.xml > ensure ALL configuration boxes are included > > > > On May 29, 2013 9:37 PM, "Adriana Farina" > > wrote: > > > > > > Hello, > > > > > > I'm using Nutch 2.1. I follow the guide > > > http://wiki.apache.org/nutch/RunNutchInEclipse to import it in > Eclipse. > > The > > > strange thing is that it cannot resolve some import. For example, in > > > TikaParser.java, I get the error "The import > org.apache.tika.parser.html > > > cannot be resolved". > > > > > > Am I missing something? Where am I wrong? > > > > > > Thank you very much! > > > > > > > > > -- > > > Adriana Farina > > > > > > -- > Adriana Farina > -- Kiran Chitturi <http://www.linkedin.com/in/kiranchitturi>
Re: Unfetched urls not being generated for fetching.
Yes, my guess was right. The protocolStatus says that these files have HTTP 404 status, just that unfetched status is not updated in Nutch. I also faced a similar problem [1]. Please open a jira and report any findings. [1] http://find.searchhub.org/document/6e4464919811d20f#c2a5de6e93942ada On Fri, May 24, 2013 at 10:03 AM, Bai Shen wrote: > I'm trying to check hbase for urls that have unfetched status but my query > isn't working correctly. No matter what I don't get a match. > > scan 'webpage', {COLUMNS=>['f:bas', 'f:st'], > FILTER=>SingleColumnValueFilter.new(Bytes.toBytes('f'), > Bytes.toBytes('st'), CompareFilter::CompareOp.valueOf('EQUAL'), > Bytes.toBytes('1'))} > > > I did manage to find one entry with an unfetched status. It apparently has > no base url, so I'm assuming that's why it's not fetched. I'm not sure how > that happened. It also says protocolStatus is NOTFOUND. > > > On Fri, May 24, 2013 at 9:48 AM, kiran chitturi > wrote: > > > I have seen this happen in Nutch 2.x. > > > > I would suggest you to check your regex file to see the conditions and > use > > hbase to get the urls that have unfetched status. > > > > Also, try to check the protocol status of each unfetched url in HBase, > most > > probably it is either 404 or status other than 200. > > > > Hope this helps. > > > > On Fri, May 24, 2013 at 8:13 AM, Bai Shen > wrote: > > > > > I'm running Nutch 2.1 using HBase. > > > > > > When I run readdb -stats I show that there are 15k unfetched urls. > > > However, when I run generate -topN 1000 I get no urls to be fetched. > Up > > > until now it's been pulling a full thousand urls for each cycle. > > > > > > Any ideas? I'm not sure what to check. > > > > > > Thanks. > > > > > > > > > > > -- > > Kiran Chitturi > > > > <http://www.linkedin.com/in/kiranchitturi> > > > -- Kiran Chitturi <http://www.linkedin.com/in/kiranchitturi>
Re: Unfetched urls not being generated for fetching.
I have seen this happen in Nutch 2.x. I would suggest you to check your regex file to see the conditions and use hbase to get the urls that have unfetched status. Also, try to check the protocol status of each unfetched url in HBase, most probably it is either 404 or status other than 200. Hope this helps. On Fri, May 24, 2013 at 8:13 AM, Bai Shen wrote: > I'm running Nutch 2.1 using HBase. > > When I run readdb -stats I show that there are 15k unfetched urls. > However, when I run generate -topN 1000 I get no urls to be fetched. Up > until now it's been pulling a full thousand urls for each cycle. > > Any ideas? I'm not sure what to check. > > Thanks. > -- Kiran Chitturi <http://www.linkedin.com/in/kiranchitturi>
Re: Error when running Nutch, please help
on.java:1121) > > at > org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:850) > > at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:824) > > at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1261) > > at org.apache.nutch.fetcher.Fetcher.fetch(Fetcher.java:1332) > > at org.apache.nutch.fetcher.Fetcher.run(Fetcher.java:1368) > > at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65) > > at org.apache.nutch.fetcher.Fetcher.main(Fetcher.java:1341) > > Caused by: java.net.URISyntaxException: Relative path in absolute URI: > drwxr-xr-xnn3ncaryannstaffnn102nn4n23n22:01n20130423220110 > > at java.net.URI.checkPath(URI.java:1788) > > at java.net.URI.(URI.java:734) > > at org.apache.hadoop.fs.Path.initialize(Path.java:145) > > ... 30 more > > > > All I did was following the totur as follows: > > 1. download nutch bin from: > http://mirror.esocc.com/apache/nutch/1.6/apache-nutch-1.6-bin.zip > > 2. unzip and step into the dir: apache-nutch-1.6 > > 3. in my home dir i setup JAVA_HOME in .bash_profile like: > > JAVA_HOME=/System/Library/Frameworks/JavaVM.framework/Versions/1.6/Home > > export JAVA_HOME > > 4. change the content in conf/nutch-site.xml to follows: > > > > > > http.agent.name > > NutchSpider > > > > > > > > 5. under dir: apache-nutch-1.6, excute: > > mkdir -p urls > > cd urls > > touch seed.txt > > 6. edit seed.txt with content: > > http://nutch.apache.org/ > > 7. then edit file conf/regex-urlfilter.txt and replace > > # accept anything else > > +. > > with > > +^http://([a-z0-9]*\.)*nutch.apache.org/ > > 8. finally, i run comand under dir :apache-nutch-1.6 as follows: > > MaohuaLiu-MacBook-Pro:apache-nutch-1.6 carya$ bin/crawl urls/seed.txt > TestCrawl http://localhost:8983/solr/ 2 > > > > 9. at the end show the error message as mentioned before. > > > > > > please help me to solve this problem, thanks very much. > > > > my java version: > > MaohuaLiu-MacBook-Pro:apache-nutch-1.6 carya$ java -version > > java version "1.6.0_43" > > Java(TM) SE Runtime Environment (build 1.6.0_43-b01-447-11M4203) > > Java HotSpot(TM) 64-Bit Server VM (build 20.14-b01-447, mixed mode) > > > > Max OS X version 10.7.5 > > > > > > > > Best Regards. > > -- > > Maohua Liu > > Email: carya@gmail.com > > > > -- Kiran Chitturi <http://www.linkedin.com/in/kiranchitturi>
Re: need legends for fetch reduce jobtracker ouput
Yes Lewis. It would be the best way for the permissions right now. I will add Tejas once he shares his wiki uid. On Tue, Apr 23, 2013 at 1:07 AM, Lewis John Mcgibbney < lewis.mcgibb...@gmail.com> wrote: > I agree. > I can sort this tomorrow. > @Kiran, > Are we still working to addition of documentation contributers via > contributers and admin group since the most recent lockdown? > Tejas should be added to both groups. > @Tejas please drop one of us your wiki uid whenever it suits. > Lewis > > On Monday, April 22, 2013, Tejas Patil wrote: > > Hi Lewis, > > > > Thanks !! > > I have huge respect for those who engineered the Fetcher class (esp. of > > 1.x) as its simply *awesome* and complex piece of code. > > I can polish my post more so that it comes to the "wiki" quality. I don't > > have access to wiki. Can you provide me the same ? > > > > Thanks, > > Tejas > > > > > > On Mon, Apr 22, 2013 at 8:09 PM, Lewis John Mcgibbney < > > lewis.mcgibb...@gmail.com> wrote: > > > >> hi Tejas, > >> this is a real excellent reply and very useful. > >> it would be really great if we could somehow have this kind of low level > >> information readily available on the Nutch wiki. > >> > >> On Monday, April 22, 2013, Tejas Patil > wrote: > >> > Fetcher threads try to get a fetch item (url) from a queue of all the > >> fetch > >> > items (this queue is actually a queue of queues. For details see [0]). > If > >> a > >> > thread doesnt get a fetch-item, it spinwaits for 500ms before polling > the > >> > queue again. > >> > The '*spinWaiting*' count tells us how many threads are in their > >> > spinwaiting state at a given instance. > >> > > >> > The '*active*' count tells us how many threads are currently > performing > >> the > >> > activities related to the fetch of a fetch-item. This involves sending > >> > requests to the server, getting the bytes from the server, parsing, > >> storing > >> > etc.. > >> > > >> > '*pages*' is a count for total pages fetched till a given point. > >> > '*errors*' is a count for total errors seen. > >> > > >> > *Next comes pages/s:* > >> > First number comes from this: > >> > float)pages.get())*10)/elapsed)/10.0 > >> > > >> > second one comes from this: > >> > (actualPages*10)/10.0 > >> > > >> > actualPages holds the count of pages processed in the last 5 secs > (when > >> the > >> > calculation is done). > >> > > >> > First number can be seen as the overall speed for that execution. The > >> > second number can be regarded as the instanteous speed as it just uses > >> the > >> > #pages in last 5 secs when this calculation is done. See lines 818-830 > in > >> > [0]. > >> > > >> > *Next comes the kb/s* values which are computed as follows: > >> > (((float)bytes.get())*8)/1024)/elapsed > >> > ((float)actualBytes)*8)/1024 > >> > > >> > This is similar to that of pages/sec. See lines 818-830 in [0]. > >> > > >> > '*URLs*' indicates how many urls are pending and '*queues*' indicate > the > >> > number of queues present. Queues are formed on the basis on hostname > or > >> ip > >> > depending on the configuration set. > >> > > >> > See FetcherReducer.java [0] for more details. > >> > > >> > [0] : > >> > > >> > >> > > http://svn.apache.org/viewvc/nutch/branches/2.x/src/java/org/apache/nutch/fetcher/FetcherReducer.java?view=markup > >> > > >> > > >> > On Mon, Apr 22, 2013 at 6:09 PM, kaveh minooie > wrote: > >> > > >> >> could someone please tell me one more time, in this line: > >> >> 0/20 spinwaiting/active, 53852 pages, 7612 errors, 4.1 12 pages/s, > 2632 > >> >> 7346 kb/s, 989 URLs in 5 queues > reduce > >> >> > >> >> what are the two numbers before pages/s and two numbers before kb/s? > >> >> > >> >> thanks, > >> >> > >> > > >> > >> -- > >> *Lewis* > >> > > > > -- > *Lewis* > -- Kiran Chitturi <http://www.linkedin.com/in/kiranchitturi>
Re: rewriting urls that are index
If you are using Solr versions in the 4.x series, then you could update the fields [1] once the data is indexed. This is not doing the nutch way but this is something that came in min and can work right away. [1] http://wiki.apache.org/solr/UpdateJSON#Atomic_Updates On Mon, Apr 22, 2013 at 9:56 AM, Niels Boldt wrote: > Hi, > > We are crawling a site using nutch 1.6 and indexing into solr. > > However, we need to rewrite the urls that are indexed in the following way > > For instance, nutch crawls a page http://www.example.com/article=xxx but > when moving data to the index we would like to use the url > > http://www.example.com/kb#article=xxx <http://www.example.com/article=xxx> > > Instead. So when we get data from solr it will show links to > http://www.example.com/kb#article=xxx > <http://www.example.com/article=xxx> instead > of http://www.example.com/article=xxx > > Is that possible to do by creating a plugin that extends the UrlNormalizer, > eg > > http://nutch.apache.org/apidocs-1.4/org/apache/nutch/net/URLNormalizer.html > > Or is it better to add a new indexed property that we use. > > Best Regards > Niels > -- Kiran Chitturi <http://www.linkedin.com/in/kiranchitturi>
Re: [Exception in thread "main" java.io.IOException: Job failed!]
Sorry, here is the link http://wiki.apache.org/nutch/NutchTutorial On Sat, Apr 20, 2013 at 4:11 PM, kiran chitturi wrote: > Also, please go through the tutorial here [1]. We updated it with more > info on commands and everything > > > [1] > > > On Sat, Apr 20, 2013 at 3:00 PM, micklai wrote: > >> Hi, >> >> Env: >> System ubuntu 12.04 >> Tomcat: 7.0.39 >> Solr: 3.6.2 >> Nutch: 1.6 >> >> Try to deploy nutch 1.6 with solr 3.6.2 but failed with running the >> command >> below: >> bin/nutch crawl urls -solr http://localhost:8080/solr/ -dir crawl -depth >> 2 >> -threads 5 -topN 5 >> >> For the details: >> = >> crawl started in: crawl >> rootUrlDir = urls >> threads = 5 >> depth = 2 >> solrUrl=http://localhost:8080/solr/ >> topN = 5 >> Injector: starting at 2013-04-21 02:21:12 >> Injector: crawlDb: crawl/crawldb >> Injector: urlDir: urls >> Injector: Converting injected urls to crawl db entries. >> Injector: total number of urls rejected by filters: 0 >> Injector: total number of urls injected after normalization and >> filtering: 1 >> Injector: Merging injected urls into crawl db. >> Injector: finished at 2013-04-21 02:21:27, elapsed: 00:00:14 >> Generator: starting at 2013-04-21 02:21:27 >> Generator: Selecting best-scoring urls due for fetch. >> Generator: filtering: true >> Generator: normalizing: true >> Generator: topN: 5 >> Generator: jobtracker is 'local', generating exactly one partition. >> Generator: Partitioning selected urls for politeness. >> Generator: segment: crawl/segments/20130421022135 >> Generator: finished at 2013-04-21 02:21:42, elapsed: 00:00:15 >> Fetcher: Your 'http.agent.name' value should be listed first in >> 'http.robots.agents' property. >> Fetcher: starting at 2013-04-21 02:21:42 >> Fetcher: segment: crawl/segments/20130421022135 >> Using queue mode : byHost >> Fetcher: threads: 5 >> Fetcher: time-out divisor: 2 >> QueueFeeder finished: total 1 records + hit by time limit :0 >> Using queue mode : byHost >> Using queue mode : byHost >> Using queue mode : byHost >> Using queue mode : byHost >> Using queue mode : byHost >> Fetcher: throughput threshold: -1 >> Fetcher: throughput threshold retries: 5 >> fetching http://www.163.com/ >> -finishing thread FetcherThread, activeThreads=4 >> -finishing thread FetcherThread, activeThreads=3 >> -finishing thread FetcherThread, activeThreads=2 >> -finishing thread FetcherThread, activeThreads=1 >> -activeThreads=1, spinWaiting=0, fetchQueues.totalSize=0 >> -finishing thread FetcherThread, activeThreads=0 >> -activeThreads=0, spinWaiting=0, fetchQueues.totalSize=0 >> -activeThreads=0 >> Fetcher: finished at 2013-04-21 02:21:49, elapsed: 00:00:07 >> ParseSegment: starting at 2013-04-21 02:21:49 >> ParseSegment: segment: crawl/segments/20130421022135 >> Parsed (24ms):http://www.163.com/ >> ParseSegment: finished at 2013-04-21 02:21:56, elapsed: 00:00:07 >> CrawlDb update: starting at 2013-04-21 02:21:56 >> CrawlDb update: db: crawl/crawldb >> CrawlDb update: segments: [crawl/segments/20130421022135] >> CrawlDb update: additions allowed: true >> CrawlDb update: URL normalizing: true >> CrawlDb update: URL filtering: true >> CrawlDb update: 404 purging: false >> CrawlDb update: Merging segment data into db. >> CrawlDb update: finished at 2013-04-21 02:22:09, elapsed: 00:00:13 >> Generator: starting at 2013-04-21 02:22:09 >> Generator: Selecting best-scoring urls due for fetch. >> Generator: filtering: true >> Generator: normalizing: true >> Generator: topN: 5 >> Generator: jobtracker is 'local', generating exactly one partition. >> Generator: Partitioning selected urls for politeness. >> Generator: segment: crawl/segments/20130421022217 >> Generator: finished at 2013-04-21 02:22:25, elapsed: 00:00:15 >> Fetcher: Your 'http.agent.name' value should be listed first in >> 'http.robots.agents' property. >> Fetcher: starting at 2013-04-21 02:22:25 >> Fetcher: segment: crawl/segments/20130421022217 >> Using queue mode : byHost >> Fetcher: threads: 5 >> Fetcher: time-out divisor: 2 >> QueueFeeder finished: total 5 records + hit by time limit :0 >> Using queue mode : byHost >> Using queue mode : byHost >> fetching http://m.163.com/ >> Using queue mode : byHost >> Using queue mode : byHost >> fetching http://3g.163.com/li
Re: [Exception in thread "main" java.io.IOException: Job failed!]
lSize=1 > * queue: http://m.163.com > maxThreads= 1 > inProgress= 0 > crawlDelay= 5000 > minCrawlDelay = 0 > nextFetchTime = 1366482150482 > now = 1366482147348 > 0. http://m.163.com/newsapp/ > -activeThreads=5, spinWaiting=4, fetchQueues.totalSize=1 > * queue: http://m.163.com > maxThreads= 1 > inProgress= 0 > crawlDelay= 5000 > minCrawlDelay = 0 > nextFetchTime = 1366482150482 > now = 1366482148350 > 0. http://m.163.com/newsapp/ > -activeThreads=5, spinWaiting=4, fetchQueues.totalSize=1 > * queue: http://m.163.com > maxThreads= 1 > inProgress= 0 > crawlDelay= 5000 > minCrawlDelay = 0 > nextFetchTime = 1366482150482 > now = 1366482149352 > 0. http://m.163.com/newsapp/ > -activeThreads=5, spinWaiting=4, fetchQueues.totalSize=1 > * queue: http://m.163.com > maxThreads= 1 > inProgress= 0 > crawlDelay= 5000 > minCrawlDelay = 0 > nextFetchTime = 1366482150482 > now = 1366482150354 > 0. http://m.163.com/newsapp/ > fetching http://m.163.com/newsapp/ > -finishing thread FetcherThread, activeThreads=4 > -finishing thread FetcherThread, activeThreads=3 > -finishing thread FetcherThread, activeThreads=2 > -finishing thread FetcherThread, activeThreads=1 > -finishing thread FetcherThread, activeThreads=0 > -activeThreads=0, spinWaiting=0, fetchQueues.totalSize=0 > -activeThreads=0 > Fetcher: finished at 2013-04-21 02:22:38, elapsed: 00:00:13 > ParseSegment: starting at 2013-04-21 02:22:38 > ParseSegment: segment: crawl/segments/20130421022217 > Parsed (4ms):http://caipiao.163.com/mobile/client_cp.jsp > Parsed (3ms):http://m.163.com/newsapp/ > Parsed (1ms):http://music.163.com/ > ParseSegment: finished at 2013-04-21 02:22:45, elapsed: 00:00:07 > CrawlDb update: starting at 2013-04-21 02:22:45 > CrawlDb update: db: crawl/crawldb > CrawlDb update: segments: [crawl/segments/20130421022217] > CrawlDb update: additions allowed: true > CrawlDb update: URL normalizing: true > CrawlDb update: URL filtering: true > *CrawlDb update: 404 purging: false* > CrawlDb update: Merging segment data into db. > CrawlDb update: finished at 2013-04-21 02:22:58, elapsed: 00:00:13 > LinkDb: starting at 2013-04-21 02:22:58 > LinkDb: linkdb: crawl/linkdb > LinkDb: URL normalize: true > LinkDb: URL filter: true > LinkDb: internal links will be ignored. > LinkDb: adding segment: > file:/home/lailx/search_engine/nutch/crawl/segments/20130421022135 > LinkDb: adding segment: > file:/home/lailx/search_engine/nutch/crawl/segments/20130421022217 > LinkDb: finished at 2013-04-21 02:23:08, elapsed: 00:00:10 > SolrIndexer: starting at 2013-04-21 02:23:08 > *SolrIndexer: deleting gone documents: false > SolrIndexer: URL filtering: false > SolrIndexer: URL normalizing: false* > Indexing 4 documents > *java.io.IOException: Job failed!* > SolrDeleteDuplicates: starting at 2013-04-21 02:23:39 > SolrDeleteDuplicates: Solr url: http://localhost:8080/solr/ > *Exception in thread "main" java.io.IOException: Job failed!* > at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1265) > at > > org.apache.nutch.indexer.solr.SolrDeleteDuplicates.dedup(SolrDeleteDuplicates.java:373) > at > > org.apache.nutch.indexer.solr.SolrDeleteDuplicates.dedup(SolrDeleteDuplicates.java:353) > at org.apache.nutch.crawl.Crawl.run(Crawl.java:153) > at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65) > at org.apache.nutch.crawl.Crawl.main(Crawl.java:55) > = > But nutch works well, it works with the command below: > bin/nutch crawl urls -dir crawl -depth 2 -topN 5 > > And I also had modified the [Solr_Home]/conf/schema.xml to make nutch > integrated with solr, solr also works well by accessing > "localhost:8080/solr". > > Hope you can help in this problem, wait for your reply. > Thanks. > > Br, > Mick > > > > -- > View this message in context: > http://lucene.472066.n3.nabble.com/Exception-in-thread-main-java-io-IOException-Job-failed-tp4057620.html > Sent from the Nutch - User mailing list archive at Nabble.com. > -- Kiran Chitturi <http://www.linkedin.com/in/kiranchitturi>
Re: [Exception in thread "main" java.io.IOException: Job failed!]
> * queue: http://m.163.com > maxThreads= 1 > inProgress= 0 > crawlDelay= 5000 > minCrawlDelay = 0 > nextFetchTime = 1366482150482 > now = 1366482147348 > 0. http://m.163.com/newsapp/ > -activeThreads=5, spinWaiting=4, fetchQueues.totalSize=1 > * queue: http://m.163.com > maxThreads= 1 > inProgress= 0 > crawlDelay= 5000 > minCrawlDelay = 0 > nextFetchTime = 1366482150482 > now = 1366482148350 > 0. http://m.163.com/newsapp/ > -activeThreads=5, spinWaiting=4, fetchQueues.totalSize=1 > * queue: http://m.163.com > maxThreads= 1 > inProgress= 0 > crawlDelay= 5000 > minCrawlDelay = 0 > nextFetchTime = 1366482150482 > now = 1366482149352 > 0. http://m.163.com/newsapp/ > -activeThreads=5, spinWaiting=4, fetchQueues.totalSize=1 > * queue: http://m.163.com > maxThreads= 1 > inProgress= 0 > crawlDelay= 5000 > minCrawlDelay = 0 > nextFetchTime = 1366482150482 > now = 1366482150354 > 0. http://m.163.com/newsapp/ > fetching http://m.163.com/newsapp/ > -finishing thread FetcherThread, activeThreads=4 > -finishing thread FetcherThread, activeThreads=3 > -finishing thread FetcherThread, activeThreads=2 > -finishing thread FetcherThread, activeThreads=1 > -finishing thread FetcherThread, activeThreads=0 > -activeThreads=0, spinWaiting=0, fetchQueues.totalSize=0 > -activeThreads=0 > Fetcher: finished at 2013-04-21 02:22:38, elapsed: 00:00:13 > ParseSegment: starting at 2013-04-21 02:22:38 > ParseSegment: segment: crawl/segments/20130421022217 > Parsed (4ms):http://caipiao.163.com/mobile/client_cp.jsp > Parsed (3ms):http://m.163.com/newsapp/ > Parsed (1ms):http://music.163.com/ > ParseSegment: finished at 2013-04-21 02:22:45, elapsed: 00:00:07 > CrawlDb update: starting at 2013-04-21 02:22:45 > CrawlDb update: db: crawl/crawldb > CrawlDb update: segments: [crawl/segments/20130421022217] > CrawlDb update: additions allowed: true > CrawlDb update: URL normalizing: true > CrawlDb update: URL filtering: true > *CrawlDb update: 404 purging: false* > CrawlDb update: Merging segment data into db. > CrawlDb update: finished at 2013-04-21 02:22:58, elapsed: 00:00:13 > LinkDb: starting at 2013-04-21 02:22:58 > LinkDb: linkdb: crawl/linkdb > LinkDb: URL normalize: true > LinkDb: URL filter: true > LinkDb: internal links will be ignored. > LinkDb: adding segment: > file:/home/lailx/search_engine/nutch/crawl/segments/20130421022135 > LinkDb: adding segment: > file:/home/lailx/search_engine/nutch/crawl/segments/20130421022217 > LinkDb: finished at 2013-04-21 02:23:08, elapsed: 00:00:10 > SolrIndexer: starting at 2013-04-21 02:23:08 > *SolrIndexer: deleting gone documents: false > SolrIndexer: URL filtering: false > SolrIndexer: URL normalizing: false* > Indexing 4 documents > *java.io.IOException: Job failed!* > SolrDeleteDuplicates: starting at 2013-04-21 02:23:39 > SolrDeleteDuplicates: Solr url: http://localhost:8080/solr/ > *Exception in thread "main" java.io.IOException: Job failed!* > at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1265) > at > > org.apache.nutch.indexer.solr.SolrDeleteDuplicates.dedup(SolrDeleteDuplicates.java:373) > at > > org.apache.nutch.indexer.solr.SolrDeleteDuplicates.dedup(SolrDeleteDuplicates.java:353) > at org.apache.nutch.crawl.Crawl.run(Crawl.java:153) > at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65) > at org.apache.nutch.crawl.Crawl.main(Crawl.java:55) > = > But nutch works well, it works with the command below: > bin/nutch crawl urls -dir crawl -depth 2 -topN 5 > > And I also had modified the [Solr_Home]/conf/schema.xml to make nutch > integrated with solr, solr also works well by accessing > "localhost:8080/solr". > > Hope you can help in this problem, wait for your reply. > Thanks. > > Br, > Mick > > > > -- > View this message in context: > http://lucene.472066.n3.nabble.com/Exception-in-thread-main-java-io-IOException-Job-failed-tp4057620.html > Sent from the Nutch - User mailing list archive at Nabble.com. > -- Kiran Chitturi <http://www.linkedin.com/in/kiranchitturi>
Re: Question about Nutch and Hadoop
This might be of help to you [1]. [1] http://wiki.apache.org/nutch/NutchTutorial#A4._Setup_Solr_for_search On Wed, Apr 17, 2013 at 11:07 AM, Maximiliano Marin < conta...@maximilianomarin.com> wrote: > Hi All: > I have a new question related to this topic. Once nutch indexed all > the items, I want to use Solr for querying for via web. How can I > search into hadoop filesystem? > > Thank you in advance, > > Regards, > Atte, > Maximiliano Marin Bustos > MCTS: Windows Server 2008 R2, Virtualization > MCTS: SQL Server 2008, Implementation and Maintenance > Web: http://maximilianomarin.com > Celular: (+56 9) 780 688 91 > > > 2013/4/16 Maximiliano Marin : > > Thank You Alexander. Now I'm following your tutorial. If I have any > > doubt I will send it to the users group. > > I think the Lewis's idea is great. You should share your blog post in > > the Nutch Wiki. > > > > Regards. > > > > M. > > Atte, > > Maximiliano Marin Bustos > > MCTS: Windows Server 2008 R2, Virtualization > > MCTS: SQL Server 2008, Implementation and Maintenance > > Web: http://maximilianomarin.com > > Celular: (+56 9) 780 688 91 > > > > > > 2013/4/16 Lewis John Mcgibbney : > >> Hi Alexander, > >> Please feel free to sign up to our wiki (please provide one of the dev > team > >> with your uid) and link to your documentation. > >> Best > >> Lewis > >> > >> On Monday, April 15, 2013, Alexander Chepurnoy > wrote: > >>> You can find those files under Hadoop folder. Working with Hadoop+Nutch > >> is about to work with Hadoop with Nutch job file. Default Nutch > >> documentation is not that good. Please refer Hadoop installation / > >> configuration guide instead, and blogpost (e.g. my blogpost on how to > >> install it > >> > http://chepurnoy.org/blog/2012/11/how-to-install-hadoop-1-dot-0-3-on-cluster-plus-nutch/ > ). > >>> > >>> Best regards, Alexander > >>> > >>> --- Re, 16.4.13, Maximiliano Marin > пишет: > >>> > >>> > >>>> This is my first message here. Regards to all of you. > >>>> I am trying nutch for indexing documents along a hadoop > >>>> implementation. I am reading this link [1] but there > >>>> are many > >>>> elements that I can't find in my nutch directory. > >>>> Can you tell me please where can I find conf/mapred-site.xml > >>>> and any > >>>> other hadoop related config file? > >>>> > >>>> Thank you in advance. > >>>> > >>>> > >>>> > >>>> [1]http://wiki.apache.org/nutch/NutchHadoopTutorial > >>>> > >>>> > >>>> Atte, > >>>> Maximiliano Marin Bustos > >>>> MCTS: Windows Server 2008 R2, Virtualization > >>>> MCTS: SQL Server 2008, Implementation and Maintenance > >>>> Web: http://maximilianomarin.com > >>>> Celular: 78068891 > >>>> > >>> > >> > >> -- > >> *Lewis* > -- Kiran Chitturi <http://www.linkedin.com/in/kiranchitturi>
Re: Only recrawl the pages with http code=500
Hi Sheng, I haven't tried this but I have read something similar in this mailing list. May be, you can do a test with separate nutch crawl and see how it works. Are you using 1.x or 2.x ? On Wed, Apr 10, 2013 at 11:17 PM, Tianwei Sheng wrote: > Hi, Kiran, > > Yeah, that's what I want. We also used pig, I can just write a pig script > to get those urls and inject them again to the table. > > Btw, are you sure that reinjecting an url into an existing table with the > same row key there will force nutch to recrawl it? Where I can find the > document or code for this? > > > On Wed, Apr 10, 2013 at 9:25 AM, kiran chitturi > wrote: > > > In addition to feng lu suggestions, > > > > You can also try to reinject the records. A hbase query with a filter of > > http status code 500 will give you the list of urls with stauts code 500. > > > > Then you can simply reinject them, which will ask for nutch to crawl them > > again if I am correct. > > > > > > On Wed, Apr 10, 2013 at 12:08 PM, feng lu wrote: > > > > > you can set fetcher.server.delay and fetcher.server.min.delay > properties > > > too bigger, maybe the crawl successful rate will be higher. the failed > > page > > > will be re-fetched when fetch time has come. you can refer to this > > > http://pascaldimassimo.com/2010/06/11/how-to-re-crawl-with-nutch/ > > > > > > > > > On Wed, Apr 10, 2013 at 3:16 AM, Tianwei Sheng < > tianwei.sh...@gmail.com > > > >wrote: > > > > > > > Hi, all, > > > > > > > > I used nutch 2.1 + HBase to crawling one website. It seems that the > > > remote > > > > website may have some rate limit and will give me http code=500 > > > > occasionally, I knew that I probably need to tune the crawl > parameters, > > > > such as several delay, etc. But given that I have crawled lots of > > pages > > > > successfully and only may have 10% of such failed pages, Is it a way > to > > > > only fetch those failed pages incrementally. > > > > > > > > For interrupted jobs, I used the following command to resume, > > > > > > > > ./bin/nutch fetch 1364930286-844556485 -resume > > > > > > > > it will successfully resume the job and crawled those unfetched pages > > > from > > > > previous failed job. I checked the code, in FetcherJob.java, it has: > > > > > > > > {{{ > > > > if (shouldContinue && Mark.FETCH_MARK.checkMark(page) != null) > { > > > > if (LOG.isDebugEnabled()) { > > > > LOG.debug("Skipping " + TableUtil.unreverseUrl(key) + "; > > > already > > > > fetched"); > > > > } > > > > return; > > > > } > > > > }}} > > > > > > > > For those failed urls in hbase table, the row has: > > > > {{{ > > > > f:prot > > > > timestamp=1365478335194, value= \x02nHttp code=500, url= > > > > mk:_ftcmrk_ > > > > timestamp=1365478335194, value=1364930286-844556485 > > > > }}} > > > > > > > > > > > > It seems that the code only will check _ftcmrk_ regardless of if > there > > > is a > > > > "f:cnt" or not. > > > > > > > > > > > > So the questions, does the nutch has some option for method for me to > > > only > > > > fetch those failed pages? > > > > > > > > Thanks a lot. > > > > > > > > Tianwei > > > > > > > > > > > > > > > > -- > > > Don't Grow Old, Grow Up... :-) > > > > > > > > > > > -- > > Kiran Chitturi > > > > <http://www.linkedin.com/in/kiranchitturi> > > > -- Kiran Chitturi <http://www.linkedin.com/in/kiranchitturi>
Re: Only recrawl the pages with http code=500
Hi Alex, I see two ways of doing it. I don't have the scripts right now but I will give pointers 1) You can use HBase shell or any HBase client and query HBase by using the filters. In this case, ValueFilter can be used to check exact value in the column [0] 2) Second is to write a pig script, which reads the data from HBase (Read the fields url and http status code) and then filter the records based on the status code, store the output wherever you want. [1] I would suggest to go with Pig in case if you are operating on a cluster and it is easy to write pig scripts for these kind of jobs than doing a mapreduce. [0] - http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/filter/ValueFilter.html [1] http://pig.apache.org/docs/r0.7.0/piglatin_ref2.html#FILTER Hope this helps. On Wed, Apr 10, 2013 at 1:24 PM, wrote: > Hi, > > == > > A hbase query with a filter of > http status code 500 will give you the list of urls with stauts code 500. > == > Could you please let me know how to do this? I was trying to get an answer > to this kind of selection in hbase mailing list without success. > > Thanks. > Alex. > > > > > > > > -Original Message- > From: kiran chitturi > To: user > Sent: Wed, Apr 10, 2013 9:25 am > Subject: Re: Only recrawl the pages with http code=500 > > > In addition to feng lu suggestions, > > You can also try to reinject the records. A hbase query with a filter of > http status code 500 will give you the list of urls with stauts code 500. > > Then you can simply reinject them, which will ask for nutch to crawl them > again if I am correct. > > > On Wed, Apr 10, 2013 at 12:08 PM, feng lu wrote: > > > you can set fetcher.server.delay and fetcher.server.min.delay properties > > too bigger, maybe the crawl successful rate will be higher. the failed > page > > will be re-fetched when fetch time has come. you can refer to this > > http://pascaldimassimo.com/2010/06/11/how-to-re-crawl-with-nutch/ > > > > > > On Wed, Apr 10, 2013 at 3:16 AM, Tianwei Sheng > >wrote: > > > > > Hi, all, > > > > > > I used nutch 2.1 + HBase to crawling one website. It seems that the > > remote > > > website may have some rate limit and will give me http code=500 > > > occasionally, I knew that I probably need to tune the crawl parameters, > > > such as several delay, etc. But given that I have crawled lots of > pages > > > successfully and only may have 10% of such failed pages, Is it a way to > > > only fetch those failed pages incrementally. > > > > > > For interrupted jobs, I used the following command to resume, > > > > > > ./bin/nutch fetch 1364930286-844556485 -resume > > > > > > it will successfully resume the job and crawled those unfetched pages > > from > > > previous failed job. I checked the code, in FetcherJob.java, it has: > > > > > > {{{ > > > if (shouldContinue && Mark.FETCH_MARK.checkMark(page) != null) { > > > if (LOG.isDebugEnabled()) { > > > LOG.debug("Skipping " + TableUtil.unreverseUrl(key) + "; > > already > > > fetched"); > > > } > > > return; > > > } > > > }}} > > > > > > For those failed urls in hbase table, the row has: > > > {{{ > > > f:prot > > > timestamp=1365478335194, value= \x02nHttp code=500, url= > > > mk:_ftcmrk_ > > > timestamp=1365478335194, value=1364930286-844556485 > > > }}} > > > > > > > > > It seems that the code only will check _ftcmrk_ regardless of if there > > is a > > > "f:cnt" or not. > > > > > > > > > So the questions, does the nutch has some option for method for me to > > only > > > fetch those failed pages? > > > > > > Thanks a lot. > > > > > > Tianwei > > > > > > > > > > > -- > > Don't Grow Old, Grow Up... :-) > > > > > > -- > Kiran Chitturi > > <http://www.linkedin.com/in/kiranchitturi> > > > -- Kiran Chitturi <http://www.linkedin.com/in/kiranchitturi>
Re: Only recrawl the pages with http code=500
In addition to feng lu suggestions, You can also try to reinject the records. A hbase query with a filter of http status code 500 will give you the list of urls with stauts code 500. Then you can simply reinject them, which will ask for nutch to crawl them again if I am correct. On Wed, Apr 10, 2013 at 12:08 PM, feng lu wrote: > you can set fetcher.server.delay and fetcher.server.min.delay properties > too bigger, maybe the crawl successful rate will be higher. the failed page > will be re-fetched when fetch time has come. you can refer to this > http://pascaldimassimo.com/2010/06/11/how-to-re-crawl-with-nutch/ > > > On Wed, Apr 10, 2013 at 3:16 AM, Tianwei Sheng >wrote: > > > Hi, all, > > > > I used nutch 2.1 + HBase to crawling one website. It seems that the > remote > > website may have some rate limit and will give me http code=500 > > occasionally, I knew that I probably need to tune the crawl parameters, > > such as several delay, etc. But given that I have crawled lots of pages > > successfully and only may have 10% of such failed pages, Is it a way to > > only fetch those failed pages incrementally. > > > > For interrupted jobs, I used the following command to resume, > > > > ./bin/nutch fetch 1364930286-844556485 -resume > > > > it will successfully resume the job and crawled those unfetched pages > from > > previous failed job. I checked the code, in FetcherJob.java, it has: > > > > {{{ > > if (shouldContinue && Mark.FETCH_MARK.checkMark(page) != null) { > > if (LOG.isDebugEnabled()) { > > LOG.debug("Skipping " + TableUtil.unreverseUrl(key) + "; > already > > fetched"); > > } > > return; > > } > > }}} > > > > For those failed urls in hbase table, the row has: > > {{{ > > f:prot > > timestamp=1365478335194, value= \x02nHttp code=500, url= > > mk:_ftcmrk_ > > timestamp=1365478335194, value=1364930286-844556485 > > }}} > > > > > > It seems that the code only will check _ftcmrk_ regardless of if there > is a > > "f:cnt" or not. > > > > > > So the questions, does the nutch has some option for method for me to > only > > fetch those failed pages? > > > > Thanks a lot. > > > > Tianwei > > > > > > -- > Don't Grow Old, Grow Up... :-) > -- Kiran Chitturi <http://www.linkedin.com/in/kiranchitturi>
Re: Nutch 2.1 n00b trying to install and crawl for the first time
Hi Yves, Which database are you using ? Please take a look here [1] and see if they can be of help to you. [1] - http://wiki.apache.org/nutch/#Nutch_2.X_tutorial.28s.29 On Wed, Mar 27, 2013 at 12:07 AM, Yves S. Garret wrote: > Hi all, I'm running CentOS 6.4, nutch 2.1 and java 1.7.0 openjdk. I'm > trying to setup Nutch to > work just on my laptop and play with it. When I try to run Nutch, this is > what I see: > > http://bin.cakephp.org/view/1253503188 > > This is what I did to set my JAVA_HOME: > > $ JAVA_HOME="/usr/lib/jvm/jre-1.7.0-openjdk" > $ echo $JAVA_HOME > /usr/lib/jvm/jre-1.7.0-openjdk > $ export JAVA_HOME > $ echo $JAVA_HOME > /usr/lib/jvm/jre-1.7.0-openjdk > > What am I doing wrong? Something obvious that I'm messing up? > -- Kiran Chitturi <http://www.linkedin.com/in/kiranchitturi>
Re: Nutch 2.1 metadata
Hi Jaap, There is a patch [0] for the index-metatags plugin to work with Nutch 2.1. You do not need to wait until indexing to check metatags, you can use './bin/nutch indexchecker[1]. [0] https://issues.apache.org/jira/browse/NUTCH-1478 [1] http://wiki.apache.org/nutch/bin/nutch%20indexchecker On Sun, Dec 30, 2012 at 4:55 PM, J. Gobel wrote: > Hi there, > > Is there an example available on how to index metatags for nutch 2.1 & > solr4 ? > > I tried the tutorial found on : http://wiki.apache.org/nutch/IndexMetatags > > I have set up my Nutch 2.1 and Solr 4 according to > http://nlp.solutions.asia/?p=180 > > When I go to my Solr server and perform a query I dont see any metatags. > > Thanks in advance, > > Jaap > -- Kiran Chitturi <http://www.linkedin.com/in/kiranchitturi>
Re: Does Nutch Checks Whether A Page crawled before or not
gt; > > not, click here< > > > > > > > > . > > > > NAML< > > > > > > http://lucene.472066.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewer&id=instant_html%21nabble%3Aemail.naml&base=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespace&breadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml > > > > > > > > > > > > > > > > > > > > > > > -- > > > View this message in context: > > > > > > http://lucene.472066.n3.nabble.com/Does-Nutch-Checks-Whether-A-Page-crawled-before-or-not-tp4049564p4049590.html > > > > > Sent from the Nutch - User mailing list archive at Nabble.com. > > > > > > > > > -- > > If you reply to this email, your message will be added to the discussion > > below: > > > > > http://lucene.472066.n3.nabble.com/Does-Nutch-Checks-Whether-A-Page-crawled-before-or-not-tp4049564p4049596.html > > To unsubscribe from Does Nutch Checks Whether A Page crawled before or > > not, click here< > http://lucene.472066.n3.nabble.com/template/NamlServlet.jtp?macro=unsubscribe_by_code&node=4049564&code=ZnVya2Fua2FtYWNpQGdtYWlsLmNvbXw0MDQ5NTY0fDEyODM4MDc0Mg== > > > > . > > NAML< > http://lucene.472066.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewer&id=instant_html%21nabble%3Aemail.naml&base=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespace&breadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml > > > > > > > > > -- > View this message in context: > http://lucene.472066.n3.nabble.com/Does-Nutch-Checks-Whether-A-Page-crawled-before-or-not-tp4049564p4049597.html > Sent from the Nutch - User mailing list archive at Nabble.com. > -- Kiran Chitturi <http://www.linkedin.com/in/kiranchitturi>
Re: [WELCOME] Feng Lu as Apache Nutch PMC and Committer
> IMHO it's not that the developers *should* focus on this or that. I see it > more as an evolutionary process where things get improved because they are > used in the first place or get derelict and abandoned if there is no > interest from users. If as you say people prefer to have a SQL backend > instead of the sequential HDFS data structures then there will be more > contributions and as a result 2.x will be improved. > > Julien > Yes, This is a great point for Nutch 2.x. I think there a lot of potential users for SQL store in 2.x, since 2.x started supporting backends for Nutch. This was my initial idea when I started out but I settled with 1.6 at the end :) > > -- > * > *Open Source Solutions for Text Engineering > > http://digitalpebble.blogspot.com/ > http://www.digitalpebble.com > http://twitter.com/digitalpebble > -- Kiran Chitturi <http://www.linkedin.com/in/kiranchitturi>
Re: Parse benchmark/performance
Thank you Ye for updating us with your findings. It is best to use the latest version of Nutch since there are updates and fixes for each version On Sun, Mar 17, 2013 at 3:48 AM, ytthet wrote: > Hi Folks, > > I found out where the issue was. Just thought it might be useful for > others. > > The performance issue I was facing in parse was due to the regular > expression URL filter and funny URL. "regex-URLfilter" plugin. One of the > regular expression was taking long... very long to process for some funny > URL. > > Removing the content "-.*(/[^/]+)/[^/]+\1/[^/]+\1/" from > regex-urlfilter.txt > in the conf saved tons of time on parsing. > > Following thread discussed the similar matter. > http://lucene.472066.n3.nabble.com/Reduce-Error-during-fetch-td609736.html > https://issues.apache.org/jira/browse/NUTCH-233 > > Cheers, > > Ye > > > > -- > View this message in context: > http://lucene.472066.n3.nabble.com/Parse-benchmark-performance-tp4045827p4048185.html > Sent from the Nutch - User mailing list archive at Nabble.com. > -- Kiran Chitturi <http://www.linkedin.com/in/kiranchitturi>
Re: [WELCOME] Feng Lu as Apache Nutch PMC and Committer
Congrats Feng. Welcome onboard. On Tue, Mar 12, 2013 at 6:43 PM, lewis john mcgibbney wrote: > Hi Everyone, > > On behalf of the Nutch PMC I would like to announce and welcome Feng Lu on > board as PMC and Committer on the project. > Amongst others, Feng has been an important part of the Nutch development > over the last while and we would like to welcome him. > > @Feng, > Please feel free to say a bit about yourself, your involvement and use case > for Nutch or anything else. > > Thank you very much. > Lewis > -- Kiran Chitturi <http://www.linkedin.com/in/kiranchitturi>
Re: Run Nutch Crawl in Eclipse
What error exactly are you getting ? Which version of software are you using ? Are you able to build the software and not crawl ? or Are you not able to build in Eclipse ? On Fri, Mar 15, 2013 at 7:23 PM, Mustafa_elkhiat wrote: > On 03/16/2013 12:40 AM, kiran chitturi wrote: > >> Hi Mustafa, >> >> Did you look at this page [0] ?There is a section at the bottom that gives >> instructions on setting up the 'plugin.folders' property. >> >> >> >> >> [0] - >> http://wiki.apache.org/nutch/**RunNutchInEclipse<http://wiki.apache.org/nutch/RunNutchInEclipse> >> >> >> On Fri, Mar 15, 2013 at 5:55 PM, Mustafa_elkhiat >> wrote: >> >> Hi Andy >>> i face the same problem .Please help and tell me how i set >>> "plugin.folders" property correctly >>> Mustafa >>> >>> >> >> thank you kiran > but problem not solved is there any way to solve this problem? > > -- Kiran Chitturi <http://www.linkedin.com/in/kiranchitturi>
Re: Run Nutch Crawl in Eclipse
Hi Mustafa, Did you look at this page [0] ?There is a section at the bottom that gives instructions on setting up the 'plugin.folders' property. [0] - http://wiki.apache.org/nutch/RunNutchInEclipse On Fri, Mar 15, 2013 at 5:55 PM, Mustafa_elkhiat wrote: > Hi Andy > i face the same problem .Please help and tell me how i set > "plugin.folders" property correctly > Mustafa > -- Kiran Chitturi <http://www.linkedin.com/in/kiranchitturi>
Re: How to Continue to Crawl with Nutch Even An Error Occurs?
Hi! You might find the below two threads useful. [0] [1] [0] - http://lucene.472066.n3.nabble.com/Continue-to-Crawl-even-when-an-Error-Occured-td4047230.html [1] - http://lucene.472066.n3.nabble.com/Session-failed-during-parsing-IOException-because-of-OOM-td4046057.html On Tue, Mar 12, 2013 at 1:44 PM, kamaci wrote: > When I crawl with Nutch and error occurs (i.e. when one of threads doesn't > come within a time) it stops crawling and exits. > > Is there any configuration to continue crawling even a such kind of error > occurs at Nutch? > > > > -- > View this message in context: > http://lucene.472066.n3.nabble.com/How-to-Continue-to-Crawl-with-Nutch-Even-An-Error-Occurs-tp4046700.html > Sent from the Nutch - User mailing list archive at Nabble.com. > -- Kiran Chitturi <http://www.linkedin.com/in/kiranchitturi>
Nutch : Wiki Section updates
Hi! I have noticed that there are certain sections of Nutch wiki that are not up to date. I am planning to update these pages with some pointers to the mailing list discussion which give valuable information and also JIRA's. First thing, I have created an account in wiki but i am not able to see 'edit' button for any page. Can someone point me in the right direction ? Second, Does anyone have suggestions on improving/updating certain pages ? Is anyone willing to update the 'Tasklist' and 'Features' section in the Wiki ? Third, Do we have any updates on the public servers running Nutch ? We have dead links here [0] and this needs a major update. I am willing to start this and update in my free time. It would be great if someone can proofread to check that i did not write something incorrect. I am new to this. Please let me know what are things I need to be know before starting the work. Please let me know your suggestions. [0] - http://wiki.apache.org/nutch/PublicServers Thank you -- Kiran Chitturi <http://www.linkedin.com/in/kiranchitturi>
Re: Iterative Crawling
In addition to Lewis suggestions, please try giving bigger value to topN, if configuration files are defined right way, you will see more crawls. On Thu, Mar 14, 2013 at 12:30 AM, Lewis John Mcgibbney < lewis.mcgibb...@gmail.com> wrote: > You can use the parsechecker from the nutch script to see what outlinks you > should be picking up. > Once you know how the crawler is configured then you can begin to assert > why outlinks are not either being parsed out, or subsequently being > fetched. > hth > > On Wed, Mar 13, 2013 at 6:13 PM, Dat Tran wrote: > > > Thank for your reply. After configure urlfilter, i execute this command > to > > crawl > > bin/nutch crawl urls -topN 10 -depth 3 > > (urls is the directory where seed list located ). > > But it crawls, fetchs and parses only links which are defined in seed > > list, > > not the outlinks. > > > > > > > > -- > > View this message in context: > > > http://lucene.472066.n3.nabble.com/Iterative-Crawling-tp4046501p4047209.html > > Sent from the Nutch - User mailing list archive at Nabble.com. > > > > > > -- > *Lewis* > -- Kiran Chitturi <http://www.linkedin.com/in/kiranchitturi>
Re: Mapping nested json objects to map data type
Sorry about the spam, I want to post this in pig mailing list :) On Wed, Mar 13, 2013 at 11:36 PM, kiran chitturi wrote: > Hi! > > I am using Pig 0.10 version and I have a question about mapping nested > JSON objects from Hbase. > > *For example: * > > The below commands loads the field family from Hbase. > > fields = load 'hbase://documents' using > org.apache.pig.backend.hadoop.hbase.HBaseStorage('field:*','-loadKey true > -limit 5') as (rowkey, metadata:map[]); > > The metadata field looks like below after the above command. ( I used > 'illustrate fields' to get this) > > {fields_j={"tika.Content-Encoding":"ISO-8859-1","distanceToCentroid":0.5761632290266712,"tika.Content-Type":"text/plain; > charset=ISO-8859-1","clusterId":118,"tika.parsing":"ok"}} > > Map data type worked as I wanted so far. Now, I would like the value for > 'fields_j' key to be also a Map data type. I think it is being assigned as > 'byteArray' by default. > > Is there any way by which I can convert this in to a map data type ? That > would be helpful for me to process more. > > I tried to write python UDF but jython only supports python 2.5, I am not > sure how to convert this string in to a dictionary in python. > > Did anyone encounter this type of issue before ? > > Sorry for the long question, I want to explain my problem clearly. > > Please let me know your suggestions. > > Regards, > -- > Kiran Chitturi > > <http://www.linkedin.com/in/kiranchitturi> > > > -- Kiran Chitturi <http://www.linkedin.com/in/kiranchitturi>
Re: Parse benchmark/performance
Hi Roland, 1.x saves the crawled pages in the form of segments in the folder. In the post here [1], Markus and Julien gave pointers on the speed differences between 1.x and 2.x. There is a issue with loading pages with Gora as backend [2]. [1] - http://lucene.472066.n3.nabble.com/Differences-between-2-1-and-1-6-td4042856.html [2] - https://issues.apache.org/jira/browse/GORA-119 On Mon, Mar 11, 2013 at 10:57 AM, Roland von Herget < roland.von.her...@gmail.com> wrote: > Hi Ye, > > I'm running with 5-40 threads, depending on time of day (more fetchers > during night), my jvm is configured with a max of 30GB heap, but it > normally uses only 2GB. > > What does 1.x do with the crawled pages? (where are they saved?) Maybe > the problem is not the parsing of the pages, but loading them from > disk/db backend? (I had such an issue with 2.x) > > --Roland > > On Mon, Mar 11, 2013 at 3:03 PM, Ye T Thet wrote: > > Thanks Roland, > > > > I think there is quite a difference between 2.x and 1.x. According yours > > and few feedbacks, it might be my memory assignment. I am starting with > 3.7 > > GiB memory to get as minimal memory as I can go by. I might need to > > increase the spec for the box though. > > > > May I know how much memory do you use for your crawler? And the number of > > concurrent thread? At least I would be able to compare if I am too harsh > on > > my crawler. > > > > Thanks, > > > > Ye > > > > On Sun, Mar 10, 2013 at 6:31 PM, Roland von Herget < > > roland.von.her...@gmail.com> wrote: > > > >> Hi Ye, > >> > >> just a small note: i'm running 2.1 with combined fetcher/parser job in > >> local mode. I fetch about 20 pages/s and parsing was never a bottelneck. > >> So, i didn't know anything about 1.x, but this seems to be pretty slow. > >> > >> --Roland > >> Am 10.03.2013 09:53 schrieb "Ye T Thet" : > >> > >> > Thanks Feng Lu, > >> > > >> > I am running in local mode. So no leverage on map reduce model > actually. > >> My > >> > assumption is that running a single box with the 4x computing power is > >> more > >> > efficient than a cluster with 4 box with 1x computing power. Counting > 1 > >> > map/reduce task has its own over head. I guess my assumption is wrong > >> then? > >> > > >> > My question is leaning a bit towards map reduce fundamental now, I > guess, > >> > is it true to say what I can gain performance (speed) by splitting > tasks > >> to > >> > multiple Map Reduce while using the same computer power let's say 4x. > >> > > >> > Example: 8 map 8 reduce on 4x computing power is more efficient with 1 > >> map > >> > 1 reduce with 4x computing power? > >> > > >> > My guess is that 48hr to parse 100k urls does not sound efficient. > >> > Unfortunately 100k is just the beginning for me. :( I am looking at 10 > >> > Millions per fetch cycle. I am looking for ideas and pointer on how to > >> gain > >> > speed. May be using/tweaking Map Reduce would the the answer? > >> > > >> > If you have done similar cases, what is the ideal Map Reduce setting > per > >> > slave? I can post more details if it would help. > >> > > >> > Any input would be greatly appreciated. > >> > > >> > Cheers, > >> > > >> > Ye > >> > > >> > On Sun, Mar 10, 2013 at 11:17 AM, feng lu > wrote: > >> > > >> > > Hi Ye > >> > > > >> > > Do you run nutch in local mode? > >> > > > >> > > 48hr to parse 100k urls, maybe each url spend 1.7s/page. i think the > >> > mainly > >> > > time spend on ParseSegment include parse html and write parse data > to > >> > DFS, > >> > > that include text to parse_text, data to parse_data and links to > >> > > crawl_parse. > >> > > > >> > > Maybe you can run the nutch in cluster using deploy mode. It will > make > >> > full > >> > > use of the MR distributed computing capabilities > >> > > > >> > > > >> > > > >> > > > >> > > On Sat, Mar 9, 2013 at 11:09 AM, Ye T Thet > >> > wrote: > >> > > > >> > > > Hi, > >> > > > > >> > > > It is *NOT* about t
Re: Session failed during parsing: IOException because of OOM
Great! Just thought i would point out in case you missed :) On Sun, Mar 10, 2013 at 11:30 PM, Kristopher Kane wrote: > Right, I'm running the script this time around based on your first reply. > > -Kris > > > On Sun, Mar 10, 2013 at 11:13 PM, kiran chitturi > wrote: > > > Hi Kris, > > > > It was discussed several times in this thread that crawl command should > be > > deprecated and instead, the crawl script present in bin directory > > (./bin/crawl) should be used. [0] > > > > The crawl script does a step by step procedure unlike crawl command. It > is > > recommended to use crawl script. > > > > [0] - https://issues.apache.org/jira/browse/NUTCH-1087 > > > > > > On Sun, Mar 10, 2013 at 10:24 PM, Kristopher Kane > >wrote: > > > > > Thanks for the reply. I'm using 1.6 on Centos 6.3 with Oracle Java 6 > and > > > using all of the built-in Hadoop capability. Haven't learned how to > run > > it > > > on my 'real' hadoop cluster yet... > > > > > > Invocation: bin/nutch crawl urls -solr > http://localhost:8983/solr/-depth > > > 5 -topN 5000 > > > > > > Hadoop trace: > > > > > > 2013-03-09 23:07:07,662 WARN mapred.LocalJobRunner - job_local_0016 > > > java.lang.OutOfMemoryError: unable to create new native thread > > > at java.lang.Thread.start0(Native Method) > > > at java.lang.Thread.start(Unknown Source) > > > at java.util.concurrent.ThreadPoolExecutor.addThread(Unknown > > > Source) > > > at > > > > java.util.concurrent.ThreadPoolExecutor.addIfUnderMaximumPoolSize(Unknown > > > Source) > > > at java.util.concurrent.ThreadPoolExecutor.execute(Unknown > > Source) > > > at java.util.concurrent.AbstractExecutorService.submit(Unknown > > > Source) > > > at > org.apache.nutch.parse.ParseUtil.runParser(ParseUtil.java:159) > > > at org.apache.nutch.parse.ParseUtil.parse(ParseUtil.java:93) > > > at > org.apache.nutch.parse.ParseSegment.map(ParseSegment.java:97) > > > at > org.apache.nutch.parse.ParseSegment.map(ParseSegment.java:44) > > > at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50) > > > at > > org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:436) > > > at org.apache.hadoop.mapred.MapTask.run(MapTask.java:372) > > > at > > > > org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:212) > > > > > > I was running it in a small vm with 2 GB or memory. After I posted, I > ran > > > the crawler again with 6 GB of memory. > > > > > > I'll try what you suggested and bypass the inject. > > > > > > Thanks, > > > > > > -Kris > > > > > > > > > > > > On Sat, Mar 9, 2013 at 11:36 PM, kiran chitturi > > > wrote: > > > > > > > Hi Kris, > > > > > > > > Which version are you using ? > > > > > > > > At which step did the exception happen ? Is it after fetch stage or > > parse > > > > stage? > > > > > > > > Are you using the crawl script(./bin/crawl) or crawl command > > (./bin/nutch > > > > crawl) to do the crawl ? > > > > > > > > You can use the crawl script located at (./bin/crawl) by removing the > > > > inject step since you would not need injecting the seeds again. > > > > > > > > Please let us know if you have any more questions. > > > > > > > > > > > > > > > > > > > > On Sat, Mar 9, 2013 at 11:22 PM, Kristopher Kane < > kkane.l...@gmail.com > > > > >wrote: > > > > > > > > > I had a long running session going and would like to try and pick > up > > > > where > > > > > it left off if possible. In the terminal, Nutch was at a parsing > > stage > > > > > then hit OOM. Is there anyway to start that near where it left > off? > > > > > > > > > > -Kris > > > > > > > > > > > > > > > > > > > > > -- > > > > Kiran Chitturi > > > > > > > > > > > > > > > -- > > Kiran Chitturi > > > -- Kiran Chitturi
Re: Session failed during parsing: IOException because of OOM
Hi Kris, It was discussed several times in this thread that crawl command should be deprecated and instead, the crawl script present in bin directory (./bin/crawl) should be used. [0] The crawl script does a step by step procedure unlike crawl command. It is recommended to use crawl script. [0] - https://issues.apache.org/jira/browse/NUTCH-1087 On Sun, Mar 10, 2013 at 10:24 PM, Kristopher Kane wrote: > Thanks for the reply. I'm using 1.6 on Centos 6.3 with Oracle Java 6 and > using all of the built-in Hadoop capability. Haven't learned how to run it > on my 'real' hadoop cluster yet... > > Invocation: bin/nutch crawl urls -solr http://localhost:8983/solr/ -depth > 5 -topN 5000 > > Hadoop trace: > > 2013-03-09 23:07:07,662 WARN mapred.LocalJobRunner - job_local_0016 > java.lang.OutOfMemoryError: unable to create new native thread > at java.lang.Thread.start0(Native Method) > at java.lang.Thread.start(Unknown Source) > at java.util.concurrent.ThreadPoolExecutor.addThread(Unknown > Source) > at > java.util.concurrent.ThreadPoolExecutor.addIfUnderMaximumPoolSize(Unknown > Source) > at java.util.concurrent.ThreadPoolExecutor.execute(Unknown Source) > at java.util.concurrent.AbstractExecutorService.submit(Unknown > Source) > at org.apache.nutch.parse.ParseUtil.runParser(ParseUtil.java:159) > at org.apache.nutch.parse.ParseUtil.parse(ParseUtil.java:93) > at org.apache.nutch.parse.ParseSegment.map(ParseSegment.java:97) > at org.apache.nutch.parse.ParseSegment.map(ParseSegment.java:44) > at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50) > at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:436) > at org.apache.hadoop.mapred.MapTask.run(MapTask.java:372) > at > org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:212) > > I was running it in a small vm with 2 GB or memory. After I posted, I ran > the crawler again with 6 GB of memory. > > I'll try what you suggested and bypass the inject. > > Thanks, > > -Kris > > > > On Sat, Mar 9, 2013 at 11:36 PM, kiran chitturi > wrote: > > > Hi Kris, > > > > Which version are you using ? > > > > At which step did the exception happen ? Is it after fetch stage or parse > > stage? > > > > Are you using the crawl script(./bin/crawl) or crawl command (./bin/nutch > > crawl) to do the crawl ? > > > > You can use the crawl script located at (./bin/crawl) by removing the > > inject step since you would not need injecting the seeds again. > > > > Please let us know if you have any more questions. > > > > > > > > > > On Sat, Mar 9, 2013 at 11:22 PM, Kristopher Kane > >wrote: > > > > > I had a long running session going and would like to try and pick up > > where > > > it left off if possible. In the terminal, Nutch was at a parsing stage > > > then hit OOM. Is there anyway to start that near where it left off? > > > > > > -Kris > > > > > > > > > > > -- > > Kiran Chitturi > > > -- Kiran Chitturi
Re: Session failed during parsing: IOException because of OOM
Hi Kris, Which version are you using ? At which step did the exception happen ? Is it after fetch stage or parse stage? Are you using the crawl script(./bin/crawl) or crawl command (./bin/nutch crawl) to do the crawl ? You can use the crawl script located at (./bin/crawl) by removing the inject step since you would not need injecting the seeds again. Please let us know if you have any more questions. On Sat, Mar 9, 2013 at 11:22 PM, Kristopher Kane wrote: > I had a long running session going and would like to try and pick up where > it left off if possible. In the terminal, Nutch was at a parsing stage > then hit OOM. Is there anyway to start that near where it left off? > > -Kris > -- Kiran Chitturi
Re: [ANNOUNCEMENT] Welcome Kiran Chitturi as Apache Nutch PMC and Committer
Thanks a lot guys for inviting me and for the wishes. I am a graduate student in Virginia Tech University doing my Masters in Computer Science. I have been using Apache Nutch for the last one year as part of my assistantship with our University Library. The Digital Libraries and Archives division of our libraries was using Google Mini Search Engine for their website that hosts 600k files but Google Mini was no longer supported and we want to try building Search Engine using Open Source technologies. That is when i started my journey with Nutch and we were able to successfully achieve our Goals using Nutch and Solr. The library was pleased with the project and they are more interested now to work with Open Source software whenever possible. I liked working with Nutch community and it has been a great learning experience for me. I would like to learn and contribute back even after my graduation. Few things that I have in my mind right now other than committing patches are to improve our documentation (Wiki), helping users to my best and also to start the Apache Wicket UI work soon for 2.x in Nutch. Regards, Kiran. On Sat, Mar 9, 2013 at 4:06 PM, Tejas Patil wrote: > Welcome aboard Kiran :) > > > On Sat, Mar 9, 2013 at 12:56 PM, lewis john mcgibbney > wrote: > >> Hi All, >> >> Over the last while we have been aware of Kiran's ongoing contribution to >> the Nutch community. >> It is with great pleasure that we invite Kiran to join the Nutch PMC and >> also take up Committer role. >> @Kiran, please feel free to say a bit about yourself and introduce what >> brought you to Apache Nutch. >> Have a great weekend. >> Best >> Lewis > > > -- Kiran Chitturi
Re: Parse statistics in Nutch
Thanks Lewis. I will give a try at this On Tue, Mar 5, 2013 at 12:59 PM, Lewis John Mcgibbney < lewis.mcgibb...@gmail.com> wrote: > There are a few things you can do Kiran. > My preference is to use custom counters for successfully and unsuccessfully > parsed docs within the ParserJob or equivalent. I would be surprised if > this is not already there however. > It is not much trouble to add counters to something like this. We already > do it in InjectorJob for instance to make explicit the number of filtered > URLs and the number of URLs injected post filtering and normalization. > > On Tuesday, March 5, 2013, kiran chitturi > wrote: > > Hi! > > > > We already get statistics for fetcher using (readdb -stats) but can we > also > > include parse Statistics in the statistics. > > > > It will be very helpful in knowing how many documents are successfully > > parsed and we could use different methods to reparse if we see lot of > > failing documents. > > > > Only way i know to get how many documents are parsed is to check Solr on > > how many documents are indexed. > > > > What do you guys think of this ? > > > > -- > > Kiran Chitturi > > > > -- > *Lewis* > -- Kiran Chitturi
Re: Find which URL created exception
Hi! Looking at 'logs/hadoop.log' will give you more information on why the job has failed. To check if a single URL can be crawled, please use parseChecker tool [0] [0] - http://wiki.apache.org/nutch/bin/nutch%20parsechecker I have checked using parseChecker and it worked for me. On Tue, Mar 5, 2013 at 11:38 AM, raviksingh wrote: > Hi, > I am new to nutch. I am using nutch with MySQL. > While trying to crawl http://piwik.org/xmlrpc.php > <http://piwik.org/xmlrpc.php> > nutch throws exception : > > Parsing http://piwik.org/xmlrpc.php > Call completed > java.lang.RuntimeException: job failed: name=update-table, jobid=null > at > org.apache.nutch.util.NutchJob.waitForCompletion(NutchJob.java:54) > at org.apache.nutch.crawl.DbUpdaterJob.run(DbUpdaterJob.java:98) > at org.apache.nutch.crawl.Crawler.runTool(Crawler.java:68) > at org.apache.nutch.crawl.Crawler.run(Crawler.java:181) > at org.apache.nutch.crawl.Crawler.run(Crawler.java:250) > at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65) > at ravi.crawler.MyCrawl.crawl(MyCrawl.java:13) > at ravi.crawler.Crawler.AttachCrawl(Crawler.java:88) > at scheduler.MyTask.run(MyTask.java:15) > at java.util.TimerThread.mainLoop(Unknown Source) > at java.util.TimerThread.run(Unknown Source) > > > > Please check the link as it looks like a service. > > How can I either resolve this . > > > > > > -- > View this message in context: > http://lucene.472066.n3.nabble.com/Find-which-URL-created-exception-tp4044914.html > Sent from the Nutch - User mailing list archive at Nabble.com. > -- Kiran Chitturi
Re: Nutch 1.6 : java.lang.OutOfMemoryError: unable to create new native thread
Tejas, I have a total of 364k files fetched in my last crawl and i used a topN of 2000 and 2 threads per queue. The gap i have noticed is between 5-8 minutes. I had a total of 180 rounds in my crawl ( i had some big crawls at the beginning with topN of 10k but after it crashed i changed topN to 2k). Due to my hardware limitations and local mode, i think using smaller number of rounds saved me quite some time. The downside might be having lot more segments to go through but i am writing scripts for automating the index and reparse tasks. On Mon, Mar 4, 2013 at 11:18 PM, Tejas Patil wrote: > Hi Kiran, > > Is the 6 mins consistent across those 5 rounds ? With 10k files is takes > ~60 minutes for writing segments. > With 2k file, it took 6 min gap. You will need 5 such small rounds to get > total 10k, so total gap time would be (5 * 6) = 30 mins. Thats half of the > time taken for the crawl with 10k !! So in a way, you saved 30 mins by > running small crawls. Something does seem right with the math here. > > Thanks, > Tejas Patil > > On Mon, Mar 4, 2013 at 12:45 PM, kiran chitturi > wrote: > > > Thanks Sebastian for the details. This was the bottleneck i had when i am > > fetching 10k files. Now i switched to 2k and i have a 6 mins gap now. It > > took me some time finding right configuration in the local node. > > > > > > > > On Mon, Mar 4, 2013 at 3:33 PM, Sebastian Nagel > > wrote: > > > > > After all documents are fetched (and ev. parsed) the segment has to be > > > written: > > > finish sorting the data and copy it from local temp dir > (hadoop.tmp.dir) > > > to the > > > segment directory. If IO is a bottleneck this may take a while. Also > > looks > > > like > > > you have a lot of content! > > > > > > On 03/04/2013 06:03 AM, kiran chitturi wrote: > > > > Thanks for your suggestion guys! The big crawl is fetching large > amount > > > of > > > > big PDF files. > > > > > > > > For something like below, the fetcher took a lot of time to finish > up, > > > even > > > > though the files are fetched. It shows more than one hour of time. > > > > > > > >> > > > >> 2013-03-01 19:45:43,217 INFO fetcher.Fetcher - -activeThreads=0, > > > >> spinWaiting=0, fetchQueues.totalSize=0 > > > >> 2013-03-01* 19:45:43,217 *INFO fetcher.Fetcher - -activeThreads=0 > > > >> 2013-03-01* 20:57:55,288* INFO fetcher.Fetcher - Fetcher: finished > at > > > >> 2013-03-01 20:57:55, elapsed: 01:34:09 > > > > > > > > > > > > Does fetching a lot of files causes this issue ? Should i stick to > one > > > > thread per local mode or use pseudo distributed mode to improve > > > performance > > > > ? > > > > > > > > What is an acceptable time fetcher should finish up after fetching > the > > > > files ? What exactly happens in this step ? > > > > > > > > Thanks again! > > > > Kiran. > > > > > > > > > > > > > > > > On Sun, Mar 3, 2013 at 4:55 PM, Markus Jelsma < > > > markus.jel...@openindex.io>wrote: > > > > > > > >> The default heap size of 1G is just enough for a parsing fetcher > with > > 10 > > > >> threads. The only problem that may rise is too large and complicated > > PDF > > > >> files or very large HTML files. If you generate fetch lists of a > > > reasonable > > > >> size there won't be a problem most of the time. And if you want to > > > crawl a > > > >> lot, then just generate more small segments. > > > >> > > > >> If there is a bug it's most likely to be the parser eating memory > and > > > not > > > >> releasing it. > > > >> > > > >> -Original message- > > > >>> From:Tejas Patil > > > >>> Sent: Sun 03-Mar-2013 22:19 > > > >>> To: user@nutch.apache.org > > > >>> Subject: Re: Nutch 1.6 : java.lang.OutOfMemoryError: unable to > create > > > >> new native thread > > > >>> > > > >>> I agree with Sebastian. It was a crawl in local mode and not over a > > > >>> cluster. The intended crawl volume is huge and if we dont override > > the > > > >>> default heap size to some decent value, there is high possibility > of > > > >> facing >
Re: Nutch 1.6 : How to reparse Nutch segments ?
Thanks Tejas. Deleting the 'crawl_parse' directory worked for me today. On Mon, Mar 4, 2013 at 11:15 PM, Tejas Patil wrote: > Yes. After I deleted that directory, parse operation ran successfully. Even > if its an empty directory, parse wont proceed normally. > > > On Mon, Mar 4, 2013 at 8:07 PM, kiran chitturi >wrote: > > > Thanks Tejas for the information. > > > > Did you try deleting 'crawl_parse' directory ? Since, the code checks for > > that directory, i will try deleting and reparsing. > > > > > > > > On Mon, Mar 4, 2013 at 10:49 PM, Tejas Patil > >wrote: > > > > > The code [0] checks if there is already a "crawl_parse" directory in > the > > > segment [lines 88-89]. > > > > > > 88 if (fs.exists(new Path(out, CrawlDatum.PARSE_DIR_NAME))) 89 throw > new > > > IOException("Segment already parsed!"); > > > I am not sure what you guys meant by deleting the subsection of the > > > directories. Did you mean deletion of the contents inside the old > > > crawl_parse directory ? I tried that locally and it didn't work. > > > > > > [0] : > > > > > > > > > http://svn.apache.org/viewvc/nutch/trunk/src/java/org/apache/nutch/parse/ParseOutputFormat.java?view=markup > > > > > > > > > On Mon, Mar 4, 2013 at 4:20 PM, kiran chitturi < > > chitturikira...@gmail.com > > > >wrote: > > > > > > > It took me close to 2 days to fetch 400k pages on my not so fast > single > > > > machine. I do not want to refetch unless it very crucial. > > > > > > > > I will check and see if deleting any sub-directories is helpful > > > > > > > > Thanks! > > > > > > > > > > > > On Mon, Mar 4, 2013 at 5:54 PM, Lewis John Mcgibbney < > > > > lewis.mcgibb...@gmail.com> wrote: > > > > > > > > > This makes perfect sense Kiran. It is something I've encountered in > > the > > > > > past and as my segments were not production critical I was easily > > able > > > to > > > > > delete and re-fetch them then parse out the stuff I wanted to. > > > > > As I said, I think this is the only way to get I'm afraid. > > > > > > > > > > On Mon, Mar 4, 2013 at 2:25 PM, kiran chitturi < > > > > chitturikira...@gmail.com > > > > > >wrote: > > > > > > > > > > > Yeah. I used parse-(tika|metatags) first in the configuration and > > > now i > > > > > > want to use parse-(html|tika|metatags). This is due to the > > > > parse-metatags > > > > > > new patch upgrade. > > > > > > > > > > > > Thanks for the suggestions. It would be very helpful for > reparsing > > > > > segments > > > > > > for 1.x like 2.x has. > > > > > > > > > > > > Regards, > > > > > > Kiran. > > > > > > > > > > > > > > > > > > On Mon, Mar 4, 2013 at 4:51 PM, Lewis John Mcgibbney < > > > > > > lewis.mcgibb...@gmail.com> wrote: > > > > > > > > > > > > > Please don't go ahead and delete the parse directories just yet > > > > before > > > > > > you > > > > > > > hear back from others. > > > > > > > My suggestion would be to try and delete a subsection of the > > > > > directories > > > > > > > and see if this is possible. > > > > > > > Have you changed some configuration and now want to parse out > > some > > > > more > > > > > > > content/structure? > > > > > > > > > > > > > > > > > > > > > On Mon, Mar 4, 2013 at 1:33 PM, kiran chitturi < > > > > > > chitturikira...@gmail.com > > > > > > > >wrote: > > > > > > > > > > > > > > > Hi! > > > > > > > > > > > > > > > > I am trying to reparse Nutch segments and it says 'Segment > > > already > > > > > > > parsed' > > > > > > > > when i try to parse. > > > > > > > > > > > > > > > > Is there any option of attribute as '-reparse' like 2.x > series > > > has > > > > ? > > > > > > > > > > > > > > > > Should i delete some directories so that i can reparse ? > > > > > > > > > > > > > > > > Please give me suggestions on how to reparse segments that > are > > > > > already > > > > > > > > parsed. > > > > > > > > > > > > > > > > Thanks, > > > > > > > > -- > > > > > > > > Kiran Chitturi > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > -- > > > > > > > *Lewis* > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > -- > > > > > > Kiran Chitturi > > > > > > > > > > > > > > > > > > > > > > > > > > -- > > > > > *Lewis* > > > > > > > > > > > > > > > > > > > > > -- > > > > Kiran Chitturi > > > > > > > > > > > > > > > -- > > Kiran Chitturi > > > -- Kiran Chitturi
Re: Nutch 1.6 : How to reparse Nutch segments ?
Thanks Tejas for the information. Did you try deleting 'crawl_parse' directory ? Since, the code checks for that directory, i will try deleting and reparsing. On Mon, Mar 4, 2013 at 10:49 PM, Tejas Patil wrote: > The code [0] checks if there is already a "crawl_parse" directory in the > segment [lines 88-89]. > > 88 if (fs.exists(new Path(out, CrawlDatum.PARSE_DIR_NAME))) 89 throw new > IOException("Segment already parsed!"); > I am not sure what you guys meant by deleting the subsection of the > directories. Did you mean deletion of the contents inside the old > crawl_parse directory ? I tried that locally and it didn't work. > > [0] : > > http://svn.apache.org/viewvc/nutch/trunk/src/java/org/apache/nutch/parse/ParseOutputFormat.java?view=markup > > > On Mon, Mar 4, 2013 at 4:20 PM, kiran chitturi >wrote: > > > It took me close to 2 days to fetch 400k pages on my not so fast single > > machine. I do not want to refetch unless it very crucial. > > > > I will check and see if deleting any sub-directories is helpful > > > > Thanks! > > > > > > On Mon, Mar 4, 2013 at 5:54 PM, Lewis John Mcgibbney < > > lewis.mcgibb...@gmail.com> wrote: > > > > > This makes perfect sense Kiran. It is something I've encountered in the > > > past and as my segments were not production critical I was easily able > to > > > delete and re-fetch them then parse out the stuff I wanted to. > > > As I said, I think this is the only way to get I'm afraid. > > > > > > On Mon, Mar 4, 2013 at 2:25 PM, kiran chitturi < > > chitturikira...@gmail.com > > > >wrote: > > > > > > > Yeah. I used parse-(tika|metatags) first in the configuration and > now i > > > > want to use parse-(html|tika|metatags). This is due to the > > parse-metatags > > > > new patch upgrade. > > > > > > > > Thanks for the suggestions. It would be very helpful for reparsing > > > segments > > > > for 1.x like 2.x has. > > > > > > > > Regards, > > > > Kiran. > > > > > > > > > > > > On Mon, Mar 4, 2013 at 4:51 PM, Lewis John Mcgibbney < > > > > lewis.mcgibb...@gmail.com> wrote: > > > > > > > > > Please don't go ahead and delete the parse directories just yet > > before > > > > you > > > > > hear back from others. > > > > > My suggestion would be to try and delete a subsection of the > > > directories > > > > > and see if this is possible. > > > > > Have you changed some configuration and now want to parse out some > > more > > > > > content/structure? > > > > > > > > > > > > > > > On Mon, Mar 4, 2013 at 1:33 PM, kiran chitturi < > > > > chitturikira...@gmail.com > > > > > >wrote: > > > > > > > > > > > Hi! > > > > > > > > > > > > I am trying to reparse Nutch segments and it says 'Segment > already > > > > > parsed' > > > > > > when i try to parse. > > > > > > > > > > > > Is there any option of attribute as '-reparse' like 2.x series > has > > ? > > > > > > > > > > > > Should i delete some directories so that i can reparse ? > > > > > > > > > > > > Please give me suggestions on how to reparse segments that are > > > already > > > > > > parsed. > > > > > > > > > > > > Thanks, > > > > > > -- > > > > > > Kiran Chitturi > > > > > > > > > > > > > > > > > > > > > > > > > > -- > > > > > *Lewis* > > > > > > > > > > > > > > > > > > > > > -- > > > > Kiran Chitturi > > > > > > > > > > > > > > > > -- > > > *Lewis* > > > > > > > > > > > -- > > Kiran Chitturi > > > -- Kiran Chitturi
Re: Nutch 1.6 : How to reparse Nutch segments ?
It took me close to 2 days to fetch 400k pages on my not so fast single machine. I do not want to refetch unless it very crucial. I will check and see if deleting any sub-directories is helpful Thanks! On Mon, Mar 4, 2013 at 5:54 PM, Lewis John Mcgibbney < lewis.mcgibb...@gmail.com> wrote: > This makes perfect sense Kiran. It is something I've encountered in the > past and as my segments were not production critical I was easily able to > delete and re-fetch them then parse out the stuff I wanted to. > As I said, I think this is the only way to get I'm afraid. > > On Mon, Mar 4, 2013 at 2:25 PM, kiran chitturi >wrote: > > > Yeah. I used parse-(tika|metatags) first in the configuration and now i > > want to use parse-(html|tika|metatags). This is due to the parse-metatags > > new patch upgrade. > > > > Thanks for the suggestions. It would be very helpful for reparsing > segments > > for 1.x like 2.x has. > > > > Regards, > > Kiran. > > > > > > On Mon, Mar 4, 2013 at 4:51 PM, Lewis John Mcgibbney < > > lewis.mcgibb...@gmail.com> wrote: > > > > > Please don't go ahead and delete the parse directories just yet before > > you > > > hear back from others. > > > My suggestion would be to try and delete a subsection of the > directories > > > and see if this is possible. > > > Have you changed some configuration and now want to parse out some more > > > content/structure? > > > > > > > > > On Mon, Mar 4, 2013 at 1:33 PM, kiran chitturi < > > chitturikira...@gmail.com > > > >wrote: > > > > > > > Hi! > > > > > > > > I am trying to reparse Nutch segments and it says 'Segment already > > > parsed' > > > > when i try to parse. > > > > > > > > Is there any option of attribute as '-reparse' like 2.x series has ? > > > > > > > > Should i delete some directories so that i can reparse ? > > > > > > > > Please give me suggestions on how to reparse segments that are > already > > > > parsed. > > > > > > > > Thanks, > > > > -- > > > > Kiran Chitturi > > > > > > > > > > > > > > > > -- > > > *Lewis* > > > > > > > > > > > -- > > Kiran Chitturi > > > > > > -- > *Lewis* > -- Kiran Chitturi
Re: Nutch 1.6 : How to reparse Nutch segments ?
Yeah. I used parse-(tika|metatags) first in the configuration and now i want to use parse-(html|tika|metatags). This is due to the parse-metatags new patch upgrade. Thanks for the suggestions. It would be very helpful for reparsing segments for 1.x like 2.x has. Regards, Kiran. On Mon, Mar 4, 2013 at 4:51 PM, Lewis John Mcgibbney < lewis.mcgibb...@gmail.com> wrote: > Please don't go ahead and delete the parse directories just yet before you > hear back from others. > My suggestion would be to try and delete a subsection of the directories > and see if this is possible. > Have you changed some configuration and now want to parse out some more > content/structure? > > > On Mon, Mar 4, 2013 at 1:33 PM, kiran chitturi >wrote: > > > Hi! > > > > I am trying to reparse Nutch segments and it says 'Segment already > parsed' > > when i try to parse. > > > > Is there any option of attribute as '-reparse' like 2.x series has ? > > > > Should i delete some directories so that i can reparse ? > > > > Please give me suggestions on how to reparse segments that are already > > parsed. > > > > Thanks, > > -- > > Kiran Chitturi > > > > > > -- > *Lewis* > -- Kiran Chitturi
Re: Nutch 1.6 : java.lang.OutOfMemoryError: unable to create new native thread
Thanks Sebastian for the details. This was the bottleneck i had when i am fetching 10k files. Now i switched to 2k and i have a 6 mins gap now. It took me some time finding right configuration in the local node. On Mon, Mar 4, 2013 at 3:33 PM, Sebastian Nagel wrote: > After all documents are fetched (and ev. parsed) the segment has to be > written: > finish sorting the data and copy it from local temp dir (hadoop.tmp.dir) > to the > segment directory. If IO is a bottleneck this may take a while. Also looks > like > you have a lot of content! > > On 03/04/2013 06:03 AM, kiran chitturi wrote: > > Thanks for your suggestion guys! The big crawl is fetching large amount > of > > big PDF files. > > > > For something like below, the fetcher took a lot of time to finish up, > even > > though the files are fetched. It shows more than one hour of time. > > > >> > >> 2013-03-01 19:45:43,217 INFO fetcher.Fetcher - -activeThreads=0, > >> spinWaiting=0, fetchQueues.totalSize=0 > >> 2013-03-01* 19:45:43,217 *INFO fetcher.Fetcher - -activeThreads=0 > >> 2013-03-01* 20:57:55,288* INFO fetcher.Fetcher - Fetcher: finished at > >> 2013-03-01 20:57:55, elapsed: 01:34:09 > > > > > > Does fetching a lot of files causes this issue ? Should i stick to one > > thread per local mode or use pseudo distributed mode to improve > performance > > ? > > > > What is an acceptable time fetcher should finish up after fetching the > > files ? What exactly happens in this step ? > > > > Thanks again! > > Kiran. > > > > > > > > On Sun, Mar 3, 2013 at 4:55 PM, Markus Jelsma < > markus.jel...@openindex.io>wrote: > > > >> The default heap size of 1G is just enough for a parsing fetcher with 10 > >> threads. The only problem that may rise is too large and complicated PDF > >> files or very large HTML files. If you generate fetch lists of a > reasonable > >> size there won't be a problem most of the time. And if you want to > crawl a > >> lot, then just generate more small segments. > >> > >> If there is a bug it's most likely to be the parser eating memory and > not > >> releasing it. > >> > >> -Original message- > >>> From:Tejas Patil > >>> Sent: Sun 03-Mar-2013 22:19 > >>> To: user@nutch.apache.org > >>> Subject: Re: Nutch 1.6 : java.lang.OutOfMemoryError: unable to create > >> new native thread > >>> > >>> I agree with Sebastian. It was a crawl in local mode and not over a > >>> cluster. The intended crawl volume is huge and if we dont override the > >>> default heap size to some decent value, there is high possibility of > >> facing > >>> an OOM. > >>> > >>> > >>> On Sun, Mar 3, 2013 at 1:04 PM, kiran chitturi < > >> chitturikira...@gmail.com>wrote: > >>> > >>>>> If you find the time you should trace the process. > >>>>> Seems to be either a misconfiguration or even a bug. > >>>>> > >>>>> I will try to track this down soon with the previous configuration. > >> Right > >>>> now, i am just trying to get data crawled by Monday. > >>>> > >>>> Kiran. > >>>> > >>>> > >>>>>>> Luckily, you should be able to retry via "bin/nutch parse ..." > >>>>>>> Then trace the system and the Java process to catch the reason. > >>>>>>> > >>>>>>> Sebastian > >>>>>>> > >>>>>>> On 03/02/2013 08:13 PM, kiran chitturi wrote: > >>>>>>>> Sorry, i am looking to crawl 400k documents with the crawl. I > >> said > >>>> 400 > >>>>> in > >>>>>>>> my last message. > >>>>>>>> > >>>>>>>> > >>>>>>>> On Sat, Mar 2, 2013 at 2:12 PM, kiran chitturi < > >>>>>>> chitturikira...@gmail.com>wrote: > >>>>>>>> > >>>>>>>>> Hi! > >>>>>>>>> > >>>>>>>>> I am running Nutch 1.6 on a 4 GB Mac OS desktop with Core i5 > >> 2.8GHz. > >>>>>>>>> > >>>>>>>>> Last night i started a crawl on local mode for 5 seeds with the > >>>> confi
Re: Nutch 2.1 crawling step by step and crawling command differences
Hi Adriana, I do not know the solution for your problem but in general crawl command is deprecated and using crawl script (step by step) is encouraged. Please check [0] for more details [0] - https://issues.apache.org/jira/browse/NUTCH-1087 On Mon, Mar 4, 2013 at 11:23 AM, Adriana Farina wrote: > Hello, > > I'm using Nutch 2.1 in distributed mode with Hadoop 1.0.4 and HBase 0.90.4 > as database. > > When I launch the crawling job step by step everything works fine, but when > I launch the crawl command (either through the command "hadoop jar > apache-nutch-2.1.job org.apache.nutch.crawl.Crawler urls /urls.txt -depth 3 > -topN 5" or through the command "bin/nutch crawl urls /urls.txt -depth 3 > -topN 5" inside the folder nutch/runtime/deploy) it doesn't fetch anything. > The crawling job runs without problem until the end and it doesn't output > any exception. > However, if I look inside the webpage table created in HBase it is like the > nutch job executes only the inject phase. > > I've dug in the source code of nutch but I'm not able to figure out what > can be the problem. At first I thought that it could be due to the batch > id, since in the "step-by-step mode" I pass it explicitly to the fetcher > and the parser, but this does not exeplain why when I run the crawl command > it does not seem to run the generator. > > Can somebody help me please? > > Thank you! > > > -- > Adriana Farina > -- Kiran Chitturi
Re: Nutch 1.6 : java.lang.OutOfMemoryError: unable to create new native thread
Thanks for your suggestion guys! The big crawl is fetching large amount of big PDF files. For something like below, the fetcher took a lot of time to finish up, even though the files are fetched. It shows more than one hour of time. > > 2013-03-01 19:45:43,217 INFO fetcher.Fetcher - -activeThreads=0, > spinWaiting=0, fetchQueues.totalSize=0 > 2013-03-01* 19:45:43,217 *INFO fetcher.Fetcher - -activeThreads=0 > 2013-03-01* 20:57:55,288* INFO fetcher.Fetcher - Fetcher: finished at > 2013-03-01 20:57:55, elapsed: 01:34:09 Does fetching a lot of files causes this issue ? Should i stick to one thread per local mode or use pseudo distributed mode to improve performance ? What is an acceptable time fetcher should finish up after fetching the files ? What exactly happens in this step ? Thanks again! Kiran. On Sun, Mar 3, 2013 at 4:55 PM, Markus Jelsma wrote: > The default heap size of 1G is just enough for a parsing fetcher with 10 > threads. The only problem that may rise is too large and complicated PDF > files or very large HTML files. If you generate fetch lists of a reasonable > size there won't be a problem most of the time. And if you want to crawl a > lot, then just generate more small segments. > > If there is a bug it's most likely to be the parser eating memory and not > releasing it. > > -Original message- > > From:Tejas Patil > > Sent: Sun 03-Mar-2013 22:19 > > To: user@nutch.apache.org > > Subject: Re: Nutch 1.6 : java.lang.OutOfMemoryError: unable to create > new native thread > > > > I agree with Sebastian. It was a crawl in local mode and not over a > > cluster. The intended crawl volume is huge and if we dont override the > > default heap size to some decent value, there is high possibility of > facing > > an OOM. > > > > > > On Sun, Mar 3, 2013 at 1:04 PM, kiran chitturi < > chitturikira...@gmail.com>wrote: > > > > > > If you find the time you should trace the process. > > > > Seems to be either a misconfiguration or even a bug. > > > > > > > > I will try to track this down soon with the previous configuration. > Right > > > now, i am just trying to get data crawled by Monday. > > > > > > Kiran. > > > > > > > > > > >> Luckily, you should be able to retry via "bin/nutch parse ..." > > > > >> Then trace the system and the Java process to catch the reason. > > > > >> > > > > >> Sebastian > > > > >> > > > > >> On 03/02/2013 08:13 PM, kiran chitturi wrote: > > > > >>> Sorry, i am looking to crawl 400k documents with the crawl. I > said > > > 400 > > > > in > > > > >>> my last message. > > > > >>> > > > > >>> > > > > >>> On Sat, Mar 2, 2013 at 2:12 PM, kiran chitturi < > > > > >> chitturikira...@gmail.com>wrote: > > > > >>> > > > > >>>> Hi! > > > > >>>> > > > > >>>> I am running Nutch 1.6 on a 4 GB Mac OS desktop with Core i5 > 2.8GHz. > > > > >>>> > > > > >>>> Last night i started a crawl on local mode for 5 seeds with the > > > config > > > > >>>> given below. If the crawl goes well, it should fetch a total of > 400 > > > > >>>> documents. The crawling is done on a single host that we own. > > > > >>>> > > > > >>>> Config > > > > >>>> - > > > > >>>> > > > > >>>> fetcher.threads.per.queue - 2 > > > > >>>> fetcher.server.delay - 1 > > > > >>>> fetcher.throughput.threshold.pages - -1 > > > > >>>> > > > > >>>> crawl script settings > > > > >>>> > > > > >>>> timeLimitFetch- 30 > > > > >>>> numThreads - 5 > > > > >>>> topN - 1 > > > > >>>> mapred.child.java.opts=-Xmx1000m > > > > >>>> > > > > >>>> > > > > >>>> I have noticed today that the crawl has stopped due to an error > and > > > i > > > > >> have > > > > >>>> found the below error in logs. > > > > >>>> > > > > >>>> 2013-03-01 21:45:03,767 I
Re: Nutch 1.6 : java.lang.OutOfMemoryError: unable to create new native thread
> If you find the time you should trace the process. > Seems to be either a misconfiguration or even a bug. > > I will try to track this down soon with the previous configuration. Right now, i am just trying to get data crawled by Monday. Kiran. > >> Luckily, you should be able to retry via "bin/nutch parse ..." > >> Then trace the system and the Java process to catch the reason. > >> > >> Sebastian > >> > >> On 03/02/2013 08:13 PM, kiran chitturi wrote: > >>> Sorry, i am looking to crawl 400k documents with the crawl. I said 400 > in > >>> my last message. > >>> > >>> > >>> On Sat, Mar 2, 2013 at 2:12 PM, kiran chitturi < > >> chitturikira...@gmail.com>wrote: > >>> > >>>> Hi! > >>>> > >>>> I am running Nutch 1.6 on a 4 GB Mac OS desktop with Core i5 2.8GHz. > >>>> > >>>> Last night i started a crawl on local mode for 5 seeds with the config > >>>> given below. If the crawl goes well, it should fetch a total of 400 > >>>> documents. The crawling is done on a single host that we own. > >>>> > >>>> Config > >>>> - > >>>> > >>>> fetcher.threads.per.queue - 2 > >>>> fetcher.server.delay - 1 > >>>> fetcher.throughput.threshold.pages - -1 > >>>> > >>>> crawl script settings > >>>> > >>>> timeLimitFetch- 30 > >>>> numThreads - 5 > >>>> topN - 1 > >>>> mapred.child.java.opts=-Xmx1000m > >>>> > >>>> > >>>> I have noticed today that the crawl has stopped due to an error and i > >> have > >>>> found the below error in logs. > >>>> > >>>> 2013-03-01 21:45:03,767 INFO parse.ParseSegment - Parsed (0ms): > >>>>> http://scholar.lib.vt.edu/ejournals/JARS/v33n3/v33n3-letcher.htm > >>>>> 2013-03-01 21:45:03,790 WARN mapred.LocalJobRunner - job_local_0001 > >>>>> java.lang.OutOfMemoryError: unable to create new native thread > >>>>> at java.lang.Thread.start0(Native Method) > >>>>> at java.lang.Thread.start(Thread.java:658) > >>>>> at > >>>>> > >> > java.util.concurrent.ThreadPoolExecutor.addThread(ThreadPoolExecutor.java:681) > >>>>> at > >>>>> > >> > java.util.concurrent.ThreadPoolExecutor.addIfUnderMaximumPoolSize(ThreadPoolExecutor.java:727) > >>>>> at > >>>>> > >> > java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:655) > >>>>> at > >>>>> > >> > java.util.concurrent.AbstractExecutorService.submit(AbstractExecutorService.java:92) > >>>>> at > >> org.apache.nutch.parse.ParseUtil.runParser(ParseUtil.java:159) > >>>>> at > org.apache.nutch.parse.ParseUtil.parse(PaifrseUtil.java:93) > >>>>> at > >> org.apache.nutch.parse.ParseSegment.map(ParseSegment.java:97) > >>>>> at > >> org.apache.nutch.parse.ParseSegment.map(ParseSegment.java:44) > >>>>> at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50) > >>>>> at > >> org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:436) > >>>>> at org.apache.hadoop.mapred.MapTask.run(MapTask.java:372) > >>>>> at > >>>>> > >> org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:212) > >>>>> (END) > >>>> > >>>> > >>>> > >>>> Did anyone run in to the same issue ? I am not sure why the new native > >>>> thread is not being created. The link here says [0] that it might due > to > >>>> the limitation of number of processes in my OS. Will increase them > solve > >>>> the issue ? > >>>> > >>>> > >>>> [0] - http://ww2.cs.fsu.edu/~czhang/errors.html > >>>> > >>>> Thanks! > >>>> > >>>> -- > >>>> Kiran Chitturi > >>>> > >>> > >>> > >>> > >> > >> > > > > > > -- Kiran Chitturi
Re: Nutch 1.6 : java.lang.OutOfMemoryError: unable to create new native thread
Thanks Sebastian for the suggestions. I came over this by using low value for topN(2000) than 1. I decided to use lower value for topN with more rounds. On Sun, Mar 3, 2013 at 3:41 PM, Sebastian Nagel wrote: > Hi Kiran, > > there are many possible reasons for the problem. Beside the limits on the > number of processes > the stack size in the Java VM and the system (see java -Xss and ulimit -s). > > I think in local mode there should be only one mapper and consequently only > one thread spent for parsing. So the number of processes/threads is hardly > the > problem suggested that you don't run any other number crunching tasks in > parallel > on your desktop. > > Luckily, you should be able to retry via "bin/nutch parse ..." > Then trace the system and the Java process to catch the reason. > > Sebastian > > On 03/02/2013 08:13 PM, kiran chitturi wrote: > > Sorry, i am looking to crawl 400k documents with the crawl. I said 400 in > > my last message. > > > > > > On Sat, Mar 2, 2013 at 2:12 PM, kiran chitturi < > chitturikira...@gmail.com>wrote: > > > >> Hi! > >> > >> I am running Nutch 1.6 on a 4 GB Mac OS desktop with Core i5 2.8GHz. > >> > >> Last night i started a crawl on local mode for 5 seeds with the config > >> given below. If the crawl goes well, it should fetch a total of 400 > >> documents. The crawling is done on a single host that we own. > >> > >> Config > >> - > >> > >> fetcher.threads.per.queue - 2 > >> fetcher.server.delay - 1 > >> fetcher.throughput.threshold.pages - -1 > >> > >> crawl script settings > >> > >> timeLimitFetch- 30 > >> numThreads - 5 > >> topN - 1 > >> mapred.child.java.opts=-Xmx1000m > >> > >> > >> I have noticed today that the crawl has stopped due to an error and i > have > >> found the below error in logs. > >> > >> 2013-03-01 21:45:03,767 INFO parse.ParseSegment - Parsed (0ms): > >>> http://scholar.lib.vt.edu/ejournals/JARS/v33n3/v33n3-letcher.htm > >>> 2013-03-01 21:45:03,790 WARN mapred.LocalJobRunner - job_local_0001 > >>> java.lang.OutOfMemoryError: unable to create new native thread > >>> at java.lang.Thread.start0(Native Method) > >>> at java.lang.Thread.start(Thread.java:658) > >>> at > >>> > java.util.concurrent.ThreadPoolExecutor.addThread(ThreadPoolExecutor.java:681) > >>> at > >>> > java.util.concurrent.ThreadPoolExecutor.addIfUnderMaximumPoolSize(ThreadPoolExecutor.java:727) > >>> at > >>> > java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:655) > >>> at > >>> > java.util.concurrent.AbstractExecutorService.submit(AbstractExecutorService.java:92) > >>> at > org.apache.nutch.parse.ParseUtil.runParser(ParseUtil.java:159) > >>> at org.apache.nutch.parse.ParseUtil.parse(ParseUtil.java:93) > >>> at > org.apache.nutch.parse.ParseSegment.map(ParseSegment.java:97) > >>> at > org.apache.nutch.parse.ParseSegment.map(ParseSegment.java:44) > >>> at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50) > >>> at > org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:436) > >>> at org.apache.hadoop.mapred.MapTask.run(MapTask.java:372) > >>> at > >>> > org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:212) > >>> (END) > >> > >> > >> > >> Did anyone run in to the same issue ? I am not sure why the new native > >> thread is not being created. The link here says [0] that it might due to > >> the limitation of number of processes in my OS. Will increase them solve > >> the issue ? > >> > >> > >> [0] - http://ww2.cs.fsu.edu/~czhang/errors.html > >> > >> Thanks! > >> > >> -- > >> Kiran Chitturi > >> > > > > > > > > -- Kiran Chitturi
Re: help with nutch-site configuration
Hi Amit, I do not exactly understand your question. Do you want to know why half of url's are not fetched ? You need to take a look at (readdb -stats) to find out the statistics and take a dump of the content, check the url's which are not fetched and see what is the protocolStatus of those url's are. I previously noticed inconsistency between fetchStatus and protocolStatus. AFAIK, the successfully parsed pages are sent to Solr. If you want check more, you can check the parse status in the dump and logs for any parse errors. HTH On Sun, Mar 3, 2013 at 12:22 PM, Amit Sela wrote: > My use case is crawling over ~12MM URLs with depth 1, and indexing them > with Solr. > I use nutch 1.6 and Solr 3.6.2. > I also use metatags plugin to fetch the URL's keywords and description. > > However, I seem to have issues with fetching and indexing into Solr. > Running on a sample of ~120K URLs, results in fetching about half of them > and indexing ~20K... > After trying some configurations that did help but got me to the mentioned > numbers (it was lower before) I'm kinda lost in what's next. > > If anyone works with this use case and can help I'd appreciate. > > These are my current configurations: > > http.agent.name > MyNutchSpider > plugin.includes > > protocol-httpclient|urlfilter-regex|parse-(text|html|tika|metatags|js)|index-(basic|anchor|metadata)|query-(basic|site|url)|response-(json|xml)|summary-basic|scoring-opic|urlnormalizer-(pass) > metatags.names > keywords;Keywords;description;Description > index.parse.md > > metatag.keywords,metatag.Keywords,metatag.description,metatag.Description > db.update.additions.allowed > false > generate.count.mode > domain > partition.url.mode > byDomain > fetcher.queue.mode > byDomain > http.redirect.max > 30 > http.content.limit > 262144 > db.injector.update > true > parse.filter.urls > true > parse.normalize.urls > true > > Thanks! > -- Kiran Chitturi
Re: Nutch 1.6 : java.lang.OutOfMemoryError: unable to create new native thread
Sorry, i am looking to crawl 400k documents with the crawl. I said 400 in my last message. On Sat, Mar 2, 2013 at 2:12 PM, kiran chitturi wrote: > Hi! > > I am running Nutch 1.6 on a 4 GB Mac OS desktop with Core i5 2.8GHz. > > Last night i started a crawl on local mode for 5 seeds with the config > given below. If the crawl goes well, it should fetch a total of 400 > documents. The crawling is done on a single host that we own. > > Config > - > > fetcher.threads.per.queue - 2 > fetcher.server.delay - 1 > fetcher.throughput.threshold.pages - -1 > > crawl script settings > > timeLimitFetch- 30 > numThreads - 5 > topN - 1 > mapred.child.java.opts=-Xmx1000m > > > I have noticed today that the crawl has stopped due to an error and i have > found the below error in logs. > > 2013-03-01 21:45:03,767 INFO parse.ParseSegment - Parsed (0ms): >> http://scholar.lib.vt.edu/ejournals/JARS/v33n3/v33n3-letcher.htm >> 2013-03-01 21:45:03,790 WARN mapred.LocalJobRunner - job_local_0001 >> java.lang.OutOfMemoryError: unable to create new native thread >> at java.lang.Thread.start0(Native Method) >> at java.lang.Thread.start(Thread.java:658) >> at >> java.util.concurrent.ThreadPoolExecutor.addThread(ThreadPoolExecutor.java:681) >> at >> java.util.concurrent.ThreadPoolExecutor.addIfUnderMaximumPoolSize(ThreadPoolExecutor.java:727) >> at >> java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:655) >> at >> java.util.concurrent.AbstractExecutorService.submit(AbstractExecutorService.java:92) >> at org.apache.nutch.parse.ParseUtil.runParser(ParseUtil.java:159) >> at org.apache.nutch.parse.ParseUtil.parse(ParseUtil.java:93) >> at org.apache.nutch.parse.ParseSegment.map(ParseSegment.java:97) >> at org.apache.nutch.parse.ParseSegment.map(ParseSegment.java:44) >> at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50) >> at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:436) >> at org.apache.hadoop.mapred.MapTask.run(MapTask.java:372) >> at >> org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:212) >> (END) > > > > Did anyone run in to the same issue ? I am not sure why the new native > thread is not being created. The link here says [0] that it might due to > the limitation of number of processes in my OS. Will increase them solve > the issue ? > > > [0] - http://ww2.cs.fsu.edu/~czhang/errors.html > > Thanks! > > -- > Kiran Chitturi > -- Kiran Chitturi
Re: Problem compiling FeedParser plugin with Nutch 2.1 source
Lewis, On the same note, the following plugins needs to be ported when i tried to build 2.x with Eclipse i) Feed ii) parse-swf iii) parse-ext iv) parse-zip v) parse-metatags ( I wrote patch for this earlier, NUTCH-1478) The above plugins need to be ported to build 2.x successfully with plugins. On Thu, Feb 28, 2013 at 4:58 PM, Lewis John Mcgibbney < lewis.mcgibb...@gmail.com> wrote: > honestly, I think we should get this fixed. > Can someone please explain to me why we don't build every plugin within > Nutch 2.x? > I think we should. > > > On Thu, Feb 28, 2013 at 12:58 PM, kiran chitturi > wrote: > > > This is a problem with the feed plugin. It is not yet ported to 2.x. > > > > The FeedIndexingFilter Class extends the IndexingFilter whose interface > and > > method changed from 1.x to 2.x > > > > I fixed a similar one in Parse-metaTags which extends the ParseFilter > > interface. > > > > [Nutch-874] was opened related to these issues but we do not know still > > what plugins need to be ported due to the API changes. > > > > > > > https://issues.apache.org/jira/browse/NUTCH-874?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel > > > > > > > > On Thu, Feb 28, 2013 at 3:26 PM, Lewis John Mcgibbney < > > lewis.mcgibb...@gmail.com> wrote: > > > > > This shouldn't be happening but we are aware (the Jira instance > reflects > > > this) that there are some existing compatibility issues with Nutch 2.x > > > HEAD. > > > IIRC Kiran had a patch integrated which dealt with some of these > issues. > > > What I have to ask is what JDK are you using? I use 1.6.0_25 (I really > > need > > > to upgrade) on my laptop and we run the Apache Nutch nightly builds for > > > both 1.x trunk and 2.x branch on the latest 1.7 version of Java. > > > Unless I have broken my code whilst writing some patches, my code > > compiles > > > flawlessly locally and as a project we do not have regular compiler > > issues > > > with our development nightly builds. > > > > > > On Wed, Feb 27, 2013 at 10:15 PM, Anand Bhagwat > > >wrote: > > > > > > > Hi, > > > > I want to use FeedParser plugin which comes as part of Nutch 2.1 > > > > distribution. When I am trying to build it its giving compilation > > > errors. > > > > I think its using some classes from Nutch 1.6 which are not > available. > > > Any > > > > suggestions as to how I can resolve this issue? > > > > > > > > *[javac] > > > > > > > > > > > > > > /home/adminibm/Documents/workspace-sts-3.1.0.RELEASE/nutch2/src/plugin/feed/src/java/org/apache/nutch/indexer/feed/FeedIndexingFilter.java:28: > > > > cannot find symbol > > > > [javac] symbol : class CrawlDatum > > > > [javac] location: package org.apache.nutch.crawl > > > > [javac] import org.apache.nutch.crawl.CrawlDatum; > > > > [javac] ^ > > > > [javac] > > > > > > > > > > > > > > /home/adminibm/Documents/workspace-sts-3.1.0.RELEASE/nutch2/src/plugin/feed/src/java/org/apache/nutch/indexer/feed/FeedIndexingFilter.java:29: > > > > cannot find symbol > > > > [javac] symbol : class Inlinks > > > > [javac] location: package org.apache.nutch.crawl > > > > [javac] import org.apache.nutch.crawl.Inlinks; > > > > [javac] ^ > > > > [javac] > > > > > > > > > > > > > > /home/adminibm/Documents/workspace-sts-3.1.0.RELEASE/nutch2/src/plugin/feed/src/java/org/apache/nutch/indexer/feed/FeedIndexingFilter.java:36: > > > > cannot find symbol > > > > [javac] symbol : class ParseData > > > > [javac] location: package org.apache.nutch.parse > > > > [javac] import org.apache.nutch.parse.ParseData; > > > > [javac] ^* > > > > > > > > Thanks, > > > > Anand. > > > > > > > > > > > > > > > > -- > > > *Lewis* > > > > > > > > > > > -- > > Kiran Chitturi > > > > > > -- > *Lewis* > -- Kiran Chitturi
Re: Problem compiling FeedParser plugin with Nutch 2.1 source
I worked on this for a while, i was able to fix FeedIndexingFilter but FeedParser also needs some more time and i will work on this some other time. FeedParser might need to be mostly rewritten for 2.x. On Thu, Feb 28, 2013 at 3:58 PM, kiran chitturi wrote: > This is a problem with the feed plugin. It is not yet ported to 2.x. > > The FeedIndexingFilter Class extends the IndexingFilter whose interface > and method changed from 1.x to 2.x > > I fixed a similar one in Parse-metaTags which extends the ParseFilter > interface. > > [Nutch-874] was opened related to these issues but we do not know still > what plugins need to be ported due to the API changes. > > > https://issues.apache.org/jira/browse/NUTCH-874?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel > > > > On Thu, Feb 28, 2013 at 3:26 PM, Lewis John Mcgibbney < > lewis.mcgibb...@gmail.com> wrote: > >> This shouldn't be happening but we are aware (the Jira instance reflects >> this) that there are some existing compatibility issues with Nutch 2.x >> HEAD. >> IIRC Kiran had a patch integrated which dealt with some of these issues. >> What I have to ask is what JDK are you using? I use 1.6.0_25 (I really >> need >> to upgrade) on my laptop and we run the Apache Nutch nightly builds for >> both 1.x trunk and 2.x branch on the latest 1.7 version of Java. >> Unless I have broken my code whilst writing some patches, my code compiles >> flawlessly locally and as a project we do not have regular compiler issues >> with our development nightly builds. >> >> On Wed, Feb 27, 2013 at 10:15 PM, Anand Bhagwat > >wrote: >> >> > Hi, >> > I want to use FeedParser plugin which comes as part of Nutch 2.1 >> > distribution. When I am trying to build it its giving compilation >> errors. >> > I think its using some classes from Nutch 1.6 which are not available. >> Any >> > suggestions as to how I can resolve this issue? >> > >> > *[javac] >> > >> > >> /home/adminibm/Documents/workspace-sts-3.1.0.RELEASE/nutch2/src/plugin/feed/src/java/org/apache/nutch/indexer/feed/FeedIndexingFilter.java:28: >> > cannot find symbol >> > [javac] symbol : class CrawlDatum >> > [javac] location: package org.apache.nutch.crawl >> > [javac] import org.apache.nutch.crawl.CrawlDatum; >> > [javac] ^ >> > [javac] >> > >> > >> /home/adminibm/Documents/workspace-sts-3.1.0.RELEASE/nutch2/src/plugin/feed/src/java/org/apache/nutch/indexer/feed/FeedIndexingFilter.java:29: >> > cannot find symbol >> > [javac] symbol : class Inlinks >> > [javac] location: package org.apache.nutch.crawl >> > [javac] import org.apache.nutch.crawl.Inlinks; >> > [javac] ^ >> > [javac] >> > >> > >> /home/adminibm/Documents/workspace-sts-3.1.0.RELEASE/nutch2/src/plugin/feed/src/java/org/apache/nutch/indexer/feed/FeedIndexingFilter.java:36: >> > cannot find symbol >> > [javac] symbol : class ParseData >> > [javac] location: package org.apache.nutch.parse >> > [javac] import org.apache.nutch.parse.ParseData; >> > [javac] ^* >> > >> > Thanks, >> > Anand. >> > >> >> >> >> -- >> *Lewis* >> > > > > -- > Kiran Chitturi > -- Kiran Chitturi
Re: Problem compiling FeedParser plugin with Nutch 2.1 source
This is a problem with the feed plugin. It is not yet ported to 2.x. The FeedIndexingFilter Class extends the IndexingFilter whose interface and method changed from 1.x to 2.x I fixed a similar one in Parse-metaTags which extends the ParseFilter interface. [Nutch-874] was opened related to these issues but we do not know still what plugins need to be ported due to the API changes. https://issues.apache.org/jira/browse/NUTCH-874?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel On Thu, Feb 28, 2013 at 3:26 PM, Lewis John Mcgibbney < lewis.mcgibb...@gmail.com> wrote: > This shouldn't be happening but we are aware (the Jira instance reflects > this) that there are some existing compatibility issues with Nutch 2.x > HEAD. > IIRC Kiran had a patch integrated which dealt with some of these issues. > What I have to ask is what JDK are you using? I use 1.6.0_25 (I really need > to upgrade) on my laptop and we run the Apache Nutch nightly builds for > both 1.x trunk and 2.x branch on the latest 1.7 version of Java. > Unless I have broken my code whilst writing some patches, my code compiles > flawlessly locally and as a project we do not have regular compiler issues > with our development nightly builds. > > On Wed, Feb 27, 2013 at 10:15 PM, Anand Bhagwat >wrote: > > > Hi, > > I want to use FeedParser plugin which comes as part of Nutch 2.1 > > distribution. When I am trying to build it its giving compilation > errors. > > I think its using some classes from Nutch 1.6 which are not available. > Any > > suggestions as to how I can resolve this issue? > > > > *[javac] > > > > > /home/adminibm/Documents/workspace-sts-3.1.0.RELEASE/nutch2/src/plugin/feed/src/java/org/apache/nutch/indexer/feed/FeedIndexingFilter.java:28: > > cannot find symbol > > [javac] symbol : class CrawlDatum > > [javac] location: package org.apache.nutch.crawl > > [javac] import org.apache.nutch.crawl.CrawlDatum; > > [javac] ^ > > [javac] > > > > > /home/adminibm/Documents/workspace-sts-3.1.0.RELEASE/nutch2/src/plugin/feed/src/java/org/apache/nutch/indexer/feed/FeedIndexingFilter.java:29: > > cannot find symbol > > [javac] symbol : class Inlinks > > [javac] location: package org.apache.nutch.crawl > > [javac] import org.apache.nutch.crawl.Inlinks; > > [javac] ^ > > [javac] > > > > > /home/adminibm/Documents/workspace-sts-3.1.0.RELEASE/nutch2/src/plugin/feed/src/java/org/apache/nutch/indexer/feed/FeedIndexingFilter.java:36: > > cannot find symbol > > [javac] symbol : class ParseData > > [javac] location: package org.apache.nutch.parse > > [javac] import org.apache.nutch.parse.ParseData; > > [javac] ^* > > > > Thanks, > > Anand. > > > > > > -- > *Lewis* > -- Kiran Chitturi
Re: Fetching of URLs from seed list ends up with only a small portion of them indexed by Solr
This looks odd. From what i know, the successfully parsed documents are sent to Solr. Did you check the logs for any exceptions ? What command are you using to index ? On Thu, Feb 28, 2013 at 1:51 PM, Amit Sela wrote: > Hi everyone, > > I'm running with nutch 1.6 and Solr 3.6.2. > I'm trying to crawl only the seed list (depth 1) and it seems that the > process ends with only ~255 of the URLs indexed in Solr. > > Seed list is about 120K. > Fetcher map input is 117K where success is 62K and temp_moved 45K. > Parse shows success of 62K. > CrawlDB after the fetch shows db_redir_perm=56K, db_unfetched=27K > and db_fetched=22K. > > And finally IndexerStatus shows 20K documents added. > What am I missing ? > > Thanks! > > my nutch-site.xml includes: > - > plugin.includes > > protocol-httpclient|urlfilter-regex|parse-(text|html|tika|metatags|js)|index-(basic|anchor|metadata)|query-(basic|site|url)|response-(json|xml)|summary-basic|scoring-opic|urlnormalizer-(pass|regex|basic)i > metatags.names > keywords;Keywords;description;Description > index.parse.md > > metatag.keywords,metatag.Keywords,metatag.description,metatag.Description > db.update.additions.allowed > false > generate.count.mode > domain > partition.url.mode > byDomain > file.content.limit > 262144 > http.content.limit > 262144 > parse.filter.urls > true > parse.normalize.urls > true > -- Kiran Chitturi
Re: nutch-2.1 with hbase - any good tool for querying results?
I found apache pig [1] convenient to use with Hbase for querying and filtering. 1 - http://pig.apache.org/ On Tue, Feb 26, 2013 at 12:18 PM, adfel70 wrote: > Anybody using a good tool for performing queries on the crawl results > directly from hbase? > some of the queries I want to make are: get all the url that failed > fetching, get all the urls that failed parsing. > > querying hbasedirectly seems more convenient then running readdb, waiting > for results, than parsing the readdb output to get the required > information. > > thanks. > > > > -- > View this message in context: > http://lucene.472066.n3.nabble.com/nutch-2-1-with-hbase-any-good-tool-for-querying-results-tp4043109.html > Sent from the Nutch - User mailing list archive at Nabble.com. > -- Kiran Chitturi
Re: Eclipse Error
Let's keep the discussion in the User mailing list. I would suggest you to follow the instructions here to set up Nutch in Eclipse [1] JDK 1.6 + or 1.7 + will be good enough. I would also suggest to keep your JRE compatible with the JDK. [1] - http://wiki.apache.org/nutch/RunNutchInEclipse On Tue, Feb 26, 2013 at 10:05 AM, Danilo Fernandes wrote: > Kiran, > > Do you think I need a JDK 7? > > ** ** > > *De:* kiran chitturi [mailto:chitturikira...@gmail.com] > *Enviada em:* terça-feira, 26 de fevereiro de 2013 11:57 > > *Para:* d...@nutch.apache.org > *Assunto:* Re: Eclipse Error > > ** ** > > I think Nutch requires atleast Java 1.6. > > ** ** > > On Tue, Feb 26, 2013 at 5:33 AM, Danilo Fernandes > wrote: > > What version of JDK fit with Nutch trunk? > > Anybody knows? > > ** ** > > 2013/2/25 Danilo Fernandes > > Feng Lu, thanks for the fast reply. > > > > But, I’m using the JavaSE-1.6 (jre6) and always get this error. > > > > *De:* feng lu [mailto:amuseme...@gmail.com] > *Enviada em:* segunda-feira, 25 de fevereiro de 2013 22:35 > *Para:* d...@nutch.apache.org > *Assunto:* Re: Eclipse Error > > > > Hi Danilo > > > > "Unsupported maj.minor version 51.0" means that you compiled your classes > under a specific JDK, but then try to run them under older version of JDK. > So, you can't run classes compiled with JDK 6.0 under JDK 5.0. The same > with classes compiled under JDK 7.0 when you try to run them under JDK 6.0. > > > > > On Tue, Feb 26, 2013 at 9:12 AM, Danilo Fernandes > wrote: > > *Hi, I want do some changes in Nutch to get a HTML and take some data > from them. > > My problem starts when I’m compiling the code in Eclipse. > > I always receive the follow error message.* > > Buildfile: *C:\Users\Danilo\workspace\Nutch\build.xml* > > [*taskdef*] Could not load definitions from resource > org/sonar/ant/antlib.xml. It could not be found. > > *ivy-probe-antlib*: > > *ivy-download*: > > [*taskdef*] Could not load definitions from resource > org/sonar/ant/antlib.xml. It could not be found. > > *ivy-download-unchecked*: > > *ivy-init-antlib*: > > *ivy-init*: > > *init*: > > [*mkdir*] Created dir: C:\Users\Danilo\workspace\Nutch\build > > [*mkdir*] Created dir: C:\Users\Danilo\workspace\Nutch\build\classes** > ** > > [*mkdir*] Created dir: C:\Users\Danilo\workspace\Nutch\build\release** > ** > > [*mkdir*] Created dir: C:\Users\Danilo\workspace\Nutch\build\test > > [*mkdir*] Created dir: > C:\Users\Danilo\workspace\Nutch\build\test\classes > > [*copy*] Copying 8 files to C:\Users\Danilo\workspace\Nutch\conf > > [*copy*] Copying > C:\Users\Danilo\workspace\Nutch\conf\automaton-urlfilter.txt.template to > C:\Users\Danilo\workspace\Nutch\conf\automaton-urlfilter.txt > > [*copy*] Copying > C:\Users\Danilo\workspace\Nutch\conf\httpclient-auth.xml.template to > C:\Users\Danilo\workspace\Nutch\conf\httpclient-auth.xml > > [*copy*] Copying > C:\Users\Danilo\workspace\Nutch\conf\nutch-site.xml.template to > C:\Users\Danilo\workspace\Nutch\conf\nutch-site.xml > > [*copy*] Copying > C:\Users\Danilo\workspace\Nutch\conf\prefix-urlfilter.txt.template to > C:\Users\Danilo\workspace\Nutch\conf\prefix-urlfilter.txt > > [*copy*] Copying > C:\Users\Danilo\workspace\Nutch\conf\regex-normalize.xml.template to > C:\Users\Danilo\workspace\Nutch\conf\regex-normalize.xml > > [*copy*] Copying > C:\Users\Danilo\workspace\Nutch\conf\regex-urlfilter.txt.template to > C:\Users\Danilo\workspace\Nutch\conf\regex-urlfilter.txt > > [*copy*] Copying > C:\Users\Danilo\workspace\Nutch\conf\subcollections.xml.template to > C:\Users\Danilo\workspace\Nutch\conf\subcollections.xml > > [*copy*] Copying > C:\Users\Danilo\workspace\Nutch\conf\suffix-urlfilter.txt.template to > C:\Users\Danilo\workspace\Nutch\conf\suffix-urlfilter.txt > > *clean-lib*: > > *resolve-default*: > > [*ivy:resolve*] :: Ivy 2.2.0 - 20100923230623 :: > http://ant.apache.org/ivy/ :: > > [*ivy:resolve*] :: loading settings :: file = > C:\Users\Danilo\workspace\Nutch\ivy\ivysettings.xml > > [*ivy:resolve*] :: problems summary :: > > [*ivy:resolve*] ERRORS > > [*ivy:resolve*] unknown resolver main > > [*ivy:resolve*] unknown resolver main > > [*ivy:resolve*] unknown resolver main > > [*ivy:resolve*] un
Re: Handling Content-Type Parameter in Nutch and Solr
Hi Raja, Which Nutch version are you using ? Can you check again with parseChecker [1] tool ? [1] - http://wiki.apache.org/nutch/bin/nutch%20parsechecker On Mon, Feb 25, 2013 at 9:32 AM, Raja Kulasekaran wrote: > Hi, > > I am unable to get the value of ContentType as well as > metatag.Content-Type. > > Can you please suggest me the correct way to get this value ? > > Raja > -- Kiran Chitturi
Re: Nutch stable version
Hi Amit, Nutch 2.1 with Hbase is stable than using MySQL as backend. Please check the link here [0] on how to use Hbase as backend. [0] - http://wiki.apache.org/nutch/Nutch2Tutorial On Mon, Feb 18, 2013 at 8:07 AM, Amit Sela wrote: > Hi all, > > I installed Nutch 2.1 with Gora and MySQL and I tried running the inject > job i got the following exception: > > org.apache.gora.util.GoraException: java.io.IOException: > com.mysql.jdbc.exceptions.jdbc4.MySQLSyntaxErrorException: Column length > too big for column 'text' (max = 16383); use BLOB or TEXT instead > > Then I found out it's a known BUG > NUTCH-970<https://issues.apache.org/jira/browse/NUTCH-970> > > So what version should I use for a stable crawler to parse about 12MM urls > ? > I want to try it first on my laptop (with much less urls to parse...) and > then deploy on an existing Hadoop cluster. > > Any suggestions ? > > Thanks, > > Amit. > -- Kiran Chitturi
Re: Dump of WebDB in 2.x
Hi Lewis, We crawl one of our library websites and the table dump was always presentable without any issues but i am not sure if we have any special characters within our content. I can check and tell you more on monday when i go back to work. I use Nutch-2.x with Hbase. Kiran. On Sat, Feb 16, 2013 at 3:01 PM, Lewis John Mcgibbney < lewis.mcgibb...@gmail.com> wrote: > Hi, > I wonder if someone using 2.x can tell me the following. Regardless of > which backend you use for storing the webdb or hostdb, can you please > confirm what your db dump looks like e.g. can it be read, is it presentable > visually, etc. > I am struggling to open a dump of my webdb in my gedit text editor as there > are lots of non UTF-8 chars in there. > I wonder if this behaviour is consistent across all gora backends or if it > is specific to gora-cassandra. > Someone using HBase or Accumulo would be great... or of course any of the > SQL db's. > Thank you very much. > Lewis > > -- > *Lewis* > -- Kiran Chitturi
Re: nutch cannot retrive title and inlinks of a domain
Hi Alex, Inlinks does not work with me now for the same domain [0] currently. I am using Nutch-2.x and Hbase. Does the inlinks get saved for you for some of the crawl seeds ? Surprising, the title does not get saved. Did you try using parsechecker ? [0] - http://www.mail-archive.com/user@nutch.apache.org/msg08627.html On Wed, Feb 13, 2013 at 3:26 PM, wrote: > Hello, > > I noticed that nutch cannot retrieve title and inlinks of one of the > domains in the seed list. However, if I run identical code from the server > where this domain is hosted then it correctly parses it. The surprising > thing is that in both cases this urls has > > status: 2 (status_fetched) > parseStatus:success/ok (1/0), args=[] > > > I used nutch-2.1 with hbase-0.92.1 and nutch 1.4. > > > Any ideas why this happens? > > Thanks. > > Alex. > -- Kiran Chitturi
Re: 2.x : Links with 404 status are not being updated from db_unfetched to db_gone
On Mon, Feb 4, 2013 at 7:18 PM, Lewis John Mcgibbney < lewis.mcgibb...@gmail.com> wrote: > Hi Kiran, > > You are using 2.x still? > > Yes, I am using 2.x version of Nutch. HttpBase [0] suggests that upon receipt of a 404 response code the > ProtocolStatus is marked to ProtocolStatusCodes.NOTFOUND which appears > to be 14! [1]. > What are you expecting to happen here? > > Yes, the ProtocolStatus is changed to NOTFOUND but i am talking about fetch status which is still 1 (db_unfetched status) rather than assigning it 3 (db_gone status). We can see in this log file ( https://raw.github.com/salvager/NutchDev/master/runtime/local/table_fields/part-r-0) that Urls with protocolStatus NOTFOUND have a fetch status of 1 (db_unfetched). Shouldn't they be changed from status 1 to status 3 ? The second column in the log file is fetchStatus and third column is protocolStatus Due to this reason when i do (readdb -stats) there is inconsistency. I am not sure if its a problem only for me or anyone else. I have did the crawl from scratch 3-4 times. > > > PS : I have made patch which dumps only particular fields through command > > line (Example: ./bin/nutch readdb -dump table_fields -fields > > "status,protocolStatus"). baseUrl is dumped by default along with other > > fields requested. I can upload if anyone is interested. > > Please file an issue and attach your patch. Any potential addition to > the codebase is welcomed., > Sure. Will do! > > [0] > http://svn.apache.org/repos/asf/nutch/branches/2.x/src/plugin/lib-http/src/java/org/apache/nutch/protocol/http/api/HttpBase.java > [1] > http://svn.apache.org/repos/asf/nutch/branches/2.x/src/java/org/apache/nutch/protocol/ProtocolStatusCodes.java > > -- > Lewis > -- Kiran Chitturi
2.x : Links with 404 status are not being updated from db_unfetched to db_gone
Hi! I did a crawl on a single seed for 30 rounds and it has crawled around 16k seeds. I have checked (readdb -stats) and it showed 2116 seeds as unfetched. I ran the fetcher again with option 'all' but it does not fetch anything and the unfetched list remains same. I have dumped only the fields (baseURL, status, protocolStatus) and can be found at ( https://raw.github.com/salvager/NutchDev/master/runtime/local/table_fields/part-r-0 ). The file clearly shows that urls with status 1 have the protocolStatus(NOT FOUND). Those seeds are never moved to status (db_gone) that is status 3 if i am correct. Did anyone had a similar problem ? Any ideas on how to fix it ? PS : I have made patch which dumps only particular fields through command line (Example: ./bin/nutch readdb -dump table_fields -fields "status,protocolStatus"). baseUrl is dumped by default along with other fields requested. I can upload if anyone is interested. Thanks, -- Kiran Chitturi
Re: Mysql don't save Markers properly
gora-sql has few bugs. Its recommended to use hbase with Nutch. I had a problem in fetching and parsing data. On Fri, Feb 1, 2013 at 2:31 AM, feng lu wrote: > Hi vetus. > > I found the same problem when i run the crawl processing > inject->generate->parse->updatedb. the mysql db output is: > > mysql> SELECT convert(markers using utf8),baseUrl FROM `webpage` WHERE 1; > > > dist0_injmrk_y_updmrk_*1359699678-1110220041__prsmrk__*1359699678-1110220041_gnmrk_*1359699678-1110220041 > _ftcmrk_*1359699678-1110220041 | http://www.apache.org/ | > > the generate and fetch mark is still in the db. > > But when i use HBase as the back-end DB, with the same crawled url and same > crawl process. > > In HBase , after runing the updatedb command, the Generate and Fetch mark > are all remove. > > So maybe it's a bug in Gora-sql model. > > > On Thu, Jan 31, 2013 at 5:42 PM, amuseme wrote: > > > Hi vetus. > > > > Why updater don't delete the values from de database. I see in > > DbUpdateReducer class WebPage has already remove the Generate and Fetcher > > markers if they exists. > > > > > > > > - > > Don't Grow Old, Grow Up. > > -- > > View this message in context: > > > http://lucene.472066.n3.nabble.com/Mysql-don-t-save-Markers-properly-tp4037310p4037651.html > > Sent from the Nutch - User mailing list archive at Nabble.com. > > > > > > -- > Don't Grow Old, Grow Up... :-) > -- Kiran Chitturi
Re: Nutch 2.0 updatedb and gora query
Hi Lewis, I am using gora 0.2.1 and hbase 0.90.5. I started from scratch and did a step by step crawling (inject, generate, fetch, parse, dbUpdate). I am starting from a single seed. The first four phases went well so far and metadata, outlinks, fetch, parse fields are extracted and saved in hbase. 67 outlinks are present for the first seed. When i did the updateDB command, all the 67 records are added with fields (f:ts, f:st, f:fi, s:s, mk:dist, mkdt:_csh_). Before starting the crawl, i have the internal links property turned to true but i could not see any inlinks in the 67 records. I did the generate, fetch for 67 records. Now, in the hbase more fields are added along with one outlink field which is same as baseURL for that record. Once the records are parsed, more outlinks and fields are added. After the updatedb command now, 902 more records are added with fields (f:ts, f:st, f:fi, s:s, mk:dist, mkdt:_csh_). I am not sure if an inlink is added as an outlink but as far as i saw inlinks are not at all saved with in the records at any phase. They are somehow missed during the dbUpdateReducer phase, where as the other fields are getting added. When i was debugging with eclipse, i saw in dbUpdateReduce job that inlinks are added to the page and i am able to print them to stdout. This look like a bug, i am not sure if is with gora or nutch. Kiran On Wed, Jan 30, 2013 at 2:44 PM, Lewis John Mcgibbney < lewis.mcgibb...@gmail.com> wrote: > Hi Kiran, > > On Wed, Jan 30, 2013 at 11:10 AM, kiran chitturi > wrote: > > > I have checked the database after the dbupdate job is ran and i could > see > > only markers, signature and fetch fields. > > > > Which Gora artifacts are you using? > We've recently fixed a bug in gora-cassandra [0] as the state for map > values was not being correctly recorded, this prevented us from writing the > values during the dbUpdaterJob. > I was not aware (and no-one flagged it up during either the Gora 0.2.1 or > Nutch 2.1 RC testing) that there was a problem with similar fields being > written to HBase. > > > > > > The initial seed which was crawled and parsed, has only outlinks. I > notice > > one of the outlink is actually the inlink. > > > > Can you reproduce? Is there any way of being more verbose here. This is > starting to sound like a bug. Unfortunately, I am not 100% on the HBase > module either! > > > > > > Aren't inlinks supposed to be saved during the dbUpdatedJob ? > > > Yes, specifically in the dbUpdaterReducerJob [1] > > > > When i tried > > to debug, i could see in eclipse and in the dbUpdateReducer job that the > > inlinks are being saved to the page object along with fetch fields, > markers > > but i did not understood where the data is going from there. > > > > We need to narrow this down and document it fully then. > I cannot look into this for a couple hours Kiran, > Lewis > > [0] https://issues.apache.org/jira/browse/GORA-182 > [1] http://wiki.apache.org/nutch/Nutch2Crawling#DbUpdate > [2] > > http://svn.apache.org/repos/asf/nutch/branches/2.x/src/java/org/apache/nutch/util/WebPageWritable.java > -- Kiran Chitturi
Re: GeneratorJob and InjectorJob questions in Nutch 2.x
Yes. I have noticed sometimes when i want a new crawl and there are already records present in the database, the crawl does not go as expected. I generally drop the table (hbase) and run the crawl again. Also, please use crawl script instead of nutch script to start crawls [0] [0] - https://issues.apache.org/jira/browse/NUTCH-1087?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13554862#comment-13554862 On Wed, Jan 30, 2013 at 3:03 PM, Weilei Zhang wrote: > It seems that I understand this problem now: this comes from the prior > fetch(es). > I need to find some way to reset the database if I want to execute a > fresh crawl, right? > Sorry if this is too basic a question. This is only my 4th day into > Nutch/Hadoop/Hbase though I have been a Java programmer for a while. > Thanks > -Weilei > > > On Wed, Jan 30, 2013 at 11:52 AM, Weilei Zhang wrote: > > Hi > > I am trying to use Nutch 2.x and have one question regarding Generator > > and Injector: > > Basically, I only have link as root to crawl and I see (by > > instrumenting the code) that this one link was written to Context in > > the last step of InjectorJob and that is the only link written to > > Context from GeneratorJob. However, I saw multiple links sent to map > > function in the first steps of GeneratorJob ( I instrumented setup > > function). Those links seem to include all URLs referenced from the > > original link. My question is where does fetch/parse happen? From the > > Crawler code, it is straightforward to me that Injector is immediately > > followed by Generator; I tried to scrub the code down to do the job > > but failed. > > > > I ran crawl in the following way: > >>/nutch crawl urlsDir > > > > There is only one link under a file in urlsDir. > >>cat urlsDir/* > > http://www.bmw.com > > > > The following is excerpt from the Generator map function > > instrumentation output. Those are reversedURL. > > al.com.bmw.www:http/ > > al.com.bmw.www:http/al/en > > am.bmw.www:http/ > > am.bmw.www:http/am/en > > ao.co.bmw:http/ > > ao.co.bmw:http/ao/pt > > ar.com.bmw.www:http/ > > ar.com.bmw.www:http/ar/es/ > > at.bmw.www:http/ > > at.bmw.www:http/at/de/general/configurations_center/configure.html > > at.bmw.www:http/de/index.html > > > at.bmw.www:http/de/topics/services-angebote/connecteddrivedienste/connecteddrive-antrag/ueberblick.html > > au.com.bmw.www:http/ > > > au.com.bmw.www:http/com/en/newvehicles/1series/5door/2011/showroom/compare.html > > > au.com.bmw.www:http/com/en/newvehicles/1series/5door/2011/showroom/configurator.html > > > au.com.bmw.www:http/com/en/newvehicles/1series/5door/2011/showroom/driveawayprice.html > > > au.com.bmw.www:http/com/en/newvehicles/1series/5door/2011/showroom/financecalculator.html > > > au.com.bmw.www:http/com/en/newvehicles/1series/5door/2011/showroom/highlights/ > > > au.com.bmw.www:http/com/en/newvehicles/1series/5door/2011/showroom/introduction.html > > > au.com.bmw.www:http/com/en/newvehicles/1series/5door/2011/showroom/requestebrochure.html > > > au.com.bmw.www:http/com/en/newvehicles/1series/5door/2011/showroom/requesttestdrive.html > > > au.com.bmw.www:http/com/en/newvehicles/1series/convertible/2011/showroom/compare.html > > > au.com.bmw.www:http/com/en/newvehicles/1series/convertible/2011/showroom/configurator.html > > > au.com.bmw.www:http/com/en/newvehicles/1series/convertible/2011/showroom/driveawayprice.html > > > au.com.bmw.www:http/com/en/newvehicles/1series/convertible/2011/showroom/financecalculator.html > > > au.com.bmw.www:http/com/en/newvehicles/1series/convertible/2011/showroom/highlights/ > > > > > > Thanks for any hints! > > -- > > Best Regards > > -Weilei > > > > -- > Best Regards > -Weilei > -- Kiran Chitturi
Re: GeneratorJob and InjectorJob questions in Nutch 2.x
The steps occur in this order 1) Inject 2) Generate 3) Fetcher 4) Parse 5) dbUpdate I would suggest you to clean the database and start again. Please let me know your results. On Wed, Jan 30, 2013 at 2:52 PM, Weilei Zhang wrote: > Hi > I am trying to use Nutch 2.x and have one question regarding Generator > and Injector: > Basically, I only have link as root to crawl and I see (by > instrumenting the code) that this one link was written to Context in > the last step of InjectorJob and that is the only link written to > Context from GeneratorJob. However, I saw multiple links sent to map > function in the first steps of GeneratorJob ( I instrumented setup > function). Those links seem to include all URLs referenced from the > original link. My question is where does fetch/parse happen? From the > Crawler code, it is straightforward to me that Injector is immediately > followed by Generator; I tried to scrub the code down to do the job > but failed. > > I ran crawl in the following way: > >/nutch crawl urlsDir > > There is only one link under a file in urlsDir. > >cat urlsDir/* > http://www.bmw.com > > The following is excerpt from the Generator map function > instrumentation output. Those are reversedURL. > al.com.bmw.www:http/ > al.com.bmw.www:http/al/en > am.bmw.www:http/ > am.bmw.www:http/am/en > ao.co.bmw:http/ > ao.co.bmw:http/ao/pt > ar.com.bmw.www:http/ > ar.com.bmw.www:http/ar/es/ > at.bmw.www:http/ > at.bmw.www:http/at/de/general/configurations_center/configure.html > at.bmw.www:http/de/index.html > > at.bmw.www:http/de/topics/services-angebote/connecteddrivedienste/connecteddrive-antrag/ueberblick.html > au.com.bmw.www:http/ > > au.com.bmw.www:http/com/en/newvehicles/1series/5door/2011/showroom/compare.html > > au.com.bmw.www:http/com/en/newvehicles/1series/5door/2011/showroom/configurator.html > > au.com.bmw.www:http/com/en/newvehicles/1series/5door/2011/showroom/driveawayprice.html > > au.com.bmw.www:http/com/en/newvehicles/1series/5door/2011/showroom/financecalculator.html > > au.com.bmw.www:http/com/en/newvehicles/1series/5door/2011/showroom/highlights/ > > au.com.bmw.www:http/com/en/newvehicles/1series/5door/2011/showroom/introduction.html > > au.com.bmw.www:http/com/en/newvehicles/1series/5door/2011/showroom/requestebrochure.html > > au.com.bmw.www:http/com/en/newvehicles/1series/5door/2011/showroom/requesttestdrive.html > > au.com.bmw.www:http/com/en/newvehicles/1series/convertible/2011/showroom/compare.html > > au.com.bmw.www:http/com/en/newvehicles/1series/convertible/2011/showroom/configurator.html > > au.com.bmw.www:http/com/en/newvehicles/1series/convertible/2011/showroom/driveawayprice.html > > au.com.bmw.www:http/com/en/newvehicles/1series/convertible/2011/showroom/financecalculator.html > > au.com.bmw.www:http/com/en/newvehicles/1series/convertible/2011/showroom/highlights/ > > > Thanks for any hints! > -- > Best Regards > -Weilei > -- Kiran Chitturi
Re: Nutch 2.0 updatedb and gora query
I have checked the database after the dbupdate job is ran and i could see only markers, signature and fetch fields. The initial seed which was crawled and parsed, has only outlinks. I notice one of the outlink is actually the inlink. Aren't inlinks supposed to be saved during the dbUpdatedJob ? When i tried to debug, i could see in eclipse and in the dbUpdateReducer job that the inlinks are being saved to the page object along with fetch fields, markers but i did not understood where the data is going from there. Is the data written to Hbase during the dbUpdateReducer job ? Thanks, Kiran. On Wed, Jan 30, 2013 at 1:43 PM, wrote: > I see that inlinks are saved as ol in hbase. > > Alex. > > > > > > > > -----Original Message- > From: kiran chitturi > To: user > Sent: Wed, Jan 30, 2013 9:31 am > Subject: Re: Nutch 2.0 updatedb and gora query > > > Link to the reference ( > > http://lucene.472066.n3.nabble.com/Inlinks-not-being-saved-in-the-database-td4037067.html > ) > and jira (https://issues.apache.org/jira/browse/NUTCH-1524) > > > On Wed, Jan 30, 2013 at 12:25 PM, kiran chitturi > wrote: > > > Hi, > > > > I have posted a similar issue in dev list [0]. The problem comes with > > inlinks not being saved to database even though they are added to the > > webpage object. > > > > I am curious about what happens after the fields are saved in the webpage > > object. How are they sent to Gora ? Which class is used to communicate > with > > Gora ? > > > > I have seen Storage Utils class but i want to know if its the only class > > that is used to communicate with databases. > > > > Please let me know your suggestions. I feel, the inlinks are not being > > saved due to small problem in the code. > > > > > > > > [0] - > > http://mail-archives.apache.org/mod_mbox/nutch-dev/201301.mbox/browser > > > > Thanks, > > -- > > Kiran Chitturi > > > > > > -- > Kiran Chitturi > > > -- Kiran Chitturi
Re: Nutch 2.0 updatedb and gora query
Link to the reference ( http://lucene.472066.n3.nabble.com/Inlinks-not-being-saved-in-the-database-td4037067.html) and jira (https://issues.apache.org/jira/browse/NUTCH-1524) On Wed, Jan 30, 2013 at 12:25 PM, kiran chitturi wrote: > Hi, > > I have posted a similar issue in dev list [0]. The problem comes with > inlinks not being saved to database even though they are added to the > webpage object. > > I am curious about what happens after the fields are saved in the webpage > object. How are they sent to Gora ? Which class is used to communicate with > Gora ? > > I have seen Storage Utils class but i want to know if its the only class > that is used to communicate with databases. > > Please let me know your suggestions. I feel, the inlinks are not being > saved due to small problem in the code. > > > > [0] - > http://mail-archives.apache.org/mod_mbox/nutch-dev/201301.mbox/browser > > Thanks, > -- > Kiran Chitturi > -- Kiran Chitturi
Re: Nutch 2.x : No Inlinks found
I think i figured this out. This might be due to the default property of 'db.ignore.internal.links' set to true. Please let me know if i am wrong. Thanks, Kiran. On Thu, Jan 24, 2013 at 12:00 PM, kiran chitturi wrote: > Hi! > > I am working with Nutch 2.x and i have crawled 16k documents from giving > single url as a seed. > > I am just checking the hbase database and i found that there are no > inlinks for any webpage while there are outlinks present. > > Is this an issue currently or Is it a problem with my crawling ? > > Please let me know your suggestions. > > Regards, > -- > Kiran Chitturi > -- Kiran Chitturi
Re: Nutch 2.x : readdb command dump
On Thu, Jan 17, 2013 at 12:39 AM, Lewis John Mcgibbney < lewis.mcgibb...@gmail.com> wrote: > Hi Kiran, > > On Wednesday, January 16, 2013, kiran chitturi > wrote: > > > > Can i make changes to this parameter ALL_FIELDS and then try to dump the > > fields based on the user input ? This command might look like > './bin/nutch > > readdb -dump baseUrl $OUTPUT'. > > I assume by $OUTPUT you mean the field to pass as a param for the mapper > job? Sorry, i was not clear earlier. The command i wrote was little confusing and not very explanatory. I mean like a command like this ./bin/nutch readdb -dump -field $FIELD_NAME $OUTPUT_DIR. $FIELD_NAME is the field name to be dumped, baseUrl is default field that can be dumped along with any other field requested since it is the key to distinguish between different records. $OUTPUT_DIR is the directory to dump the requested fields from the database. So, my question is whether we can set a single/multiple fields in the query rather than all the fields like in line in 319 in [0] [0] - http://svn.apache.org/viewvc/nutch/branches/2.x/src/java/org/apache/nutch/crawl/WebTableReader.java?view=markup Thanks, Kiran. > > I might need to go and take a look at Gora API as you suggested, if i > want > > a command like > > './bin/nutch readdb -dump baseUrl -condition parseStatus 2 $OUTPUT' to > dump > > baseUrl's based on the field values. > > You would be good to head to user@gora as this kind of querying is a key > part of Gora functionality. > > > > > > Do you think something like this is meaningful to implement in Nutch 2.x > ? > > Most certainly, anything that gives us a mechanism to obtain fine grained > querying of the webdb can only be a good thing right? > > > > > I feel, its a great thing if nutch can do this instead of doing out of > box > > work with database since we can different kind of databases using Gora. > > +1 > > > > > Please let me know your suggestions. > > > > Thanks, > > Kiran. > > > > [0] > > > > http://svn.apache.org/viewvc/nutch/branches/2.x/src/java/org/apache/nutch/crawl/WebTableReader.java?view=markup > > > > On Wed, Jan 16, 2013 at 3:54 PM, Lewis John Mcgibbney < > > lewis.mcgibb...@gmail.com> wrote: > > > >> Hi Kiran, > >> > >> For this I think you are looking at diving further into the Gora API and > >> codebase. > >> As you can see around line 232 [0], the Query is set and executed based > on > >> the key. > >> What you wish to do would possible encompass setting fields via the Gora > >> Query API. There are some other useful methods in there which you could > use > >> for your specific requirements. > >> If you find something which you think we could integrate into the > >> WebTableReader in a more widely applicable manner then by all means > please > >> log a Jira, however I think that writing your own custom class to cut of > >> all of the stuff you don't need from the existing WebTableReader may be > the > >> best route to take. > >> Of course this may be wrong for me to say... > >> > >> Lewis > >> > >> [0] > >> > >> > > http://svn.apache.org/viewvc/nutch/branches/2.x/src/java/org/apache/nutch/crawl/WebTableReader.java?view=markup > >> [1] > >> > >> > > http://svn.apache.org/repos/asf/gora/trunk/gora-core/src/main/java/org/apache/gora/query/Query.java > >> > >> On Wed, Jan 16, 2013 at 9:35 AM, kiran chitturi > >> wrote: > >> > >> > If i want to fetch the list of urls based on the value of a field in > the > >> > database (like parseStatus, protocolStatus), are there any direct > tricks > >> or > >> > commands for it rather than dumping the webpage (without content and > >> text) > >> > and searching inside. > >> > > >> > For example a command like './bin/nutch readdb -dump $FIELD_NAME > >> > $FIELD_VALUE $LOCATION', might be quite useful when trying to look in > to > >> > the database after reading stats of the crawl and trying to figure out > >> > which urls are under (status_redir_temp, status_redir_perm, > status_retry, > >> > status_gone, status_unfetched, status_fetched). > >> > > >> > Are there any tips/tricks when trying to deal with large data and > trying > >> to > >> > dump urls based on parseStatus ? > >> > > >> > The documentation here (http://wiki.apache.org/nutch/bin/nutch_readdb > ) > >> > might not apply to 2.x series. > >> > > >> > A page with commands and examples will be very helpful. Can we try to > >> > create all new documentation separating 2.x and 1.x series ? > >> > > >> > > >> > Thanks, > >> > > >> > -- > >> > Kiran Chitturi > >> > > >> > >> > >> > >> -- > >> *Lewis* > >> > > > > > > > > -- > > Kiran Chitturi > > > > -- > *Lewis* > -- Kiran Chitturi
Re: Nutch 2.x : readdb command dump
Hi Lewis, Thanks for your suggestions. I am looking at WebTableReader to make changes, particularly at line 319 [0]. There the query fields are set and the parameter ALL_FIELDS from webpage is passed. Can i make changes to this parameter ALL_FIELDS and then try to dump the fields based on the user input ? This command might look like './bin/nutch readdb -dump baseUrl $OUTPUT'. I might need to go and take a look at Gora API as you suggested, if i want a command like './bin/nutch readdb -dump baseUrl -condition parseStatus 2 $OUTPUT' to dump baseUrl's based on the field values. Do you think something like this is meaningful to implement in Nutch 2.x ? I feel, its a great thing if nutch can do this instead of doing out of box work with database since we can different kind of databases using Gora. Please let me know your suggestions. Thanks, Kiran. [0] http://svn.apache.org/viewvc/nutch/branches/2.x/src/java/org/apache/nutch/crawl/WebTableReader.java?view=markup On Wed, Jan 16, 2013 at 3:54 PM, Lewis John Mcgibbney < lewis.mcgibb...@gmail.com> wrote: > Hi Kiran, > > For this I think you are looking at diving further into the Gora API and > codebase. > As you can see around line 232 [0], the Query is set and executed based on > the key. > What you wish to do would possible encompass setting fields via the Gora > Query API. There are some other useful methods in there which you could use > for your specific requirements. > If you find something which you think we could integrate into the > WebTableReader in a more widely applicable manner then by all means please > log a Jira, however I think that writing your own custom class to cut of > all of the stuff you don't need from the existing WebTableReader may be the > best route to take. > Of course this may be wrong for me to say... > > Lewis > > [0] > > http://svn.apache.org/viewvc/nutch/branches/2.x/src/java/org/apache/nutch/crawl/WebTableReader.java?view=markup > [1] > > http://svn.apache.org/repos/asf/gora/trunk/gora-core/src/main/java/org/apache/gora/query/Query.java > > On Wed, Jan 16, 2013 at 9:35 AM, kiran chitturi > wrote: > > > If i want to fetch the list of urls based on the value of a field in the > > database (like parseStatus, protocolStatus), are there any direct tricks > or > > commands for it rather than dumping the webpage (without content and > text) > > and searching inside. > > > > For example a command like './bin/nutch readdb -dump $FIELD_NAME > > $FIELD_VALUE $LOCATION', might be quite useful when trying to look in to > > the database after reading stats of the crawl and trying to figure out > > which urls are under (status_redir_temp, status_redir_perm, status_retry, > > status_gone, status_unfetched, status_fetched). > > > > Are there any tips/tricks when trying to deal with large data and trying > to > > dump urls based on parseStatus ? > > > > The documentation here (http://wiki.apache.org/nutch/bin/nutch_readdb) > > might not apply to 2.x series. > > > > A page with commands and examples will be very helpful. Can we try to > > create all new documentation separating 2.x and 1.x series ? > > > > > > Thanks, > > > > -- > > Kiran Chitturi > > > > > > -- > *Lewis* > -- Kiran Chitturi
Re: nutch 2 tutorial
Hi Michael, The Nutch2Tutorial [1] is only for configuring Hbase with Nutch. The 'readdb' commands needs a parameter to work with. Please check [2] for steps to crawl using Nutch 2 and Hbase. There is also patch in the issue [3] for using a script for crawling with Nutch 2. [1] - http://wiki.apache.org/nutch/Nutch2Tutorial [2] - http://sujitpal.blogspot.com/2011/01/exploring-nutch-20-hbase-storage.html [3] - https://issues.apache.org/jira/browse/NUTCH-1087 Hope this helps! Regards, Kiran. On Mon, Jan 7, 2013 at 11:52 AM, Michael Gang wrote: > Hi all, > > I am trying to follow the tutorial of nutch2 at > http://wiki.apache.org/nutch/Nutch2Tutorial > but after inject the tutorial ends and i don't know how to continue from > there. > > When i try to run > > nutch readdb > > > I get an error > > :bin/nutch readdb > Usage: WebTableReader (-stats | -url [url] | -dump [-regex > regex]) > [-crawlId ] [-content] [-headers] [-links] > [-text] > -crawlId - the id to prefix the schemas to operate on, > (default: storage.crawl.id) > -stats [-sort] - print overall statistics to System.out > [-sort]- list status sorted by host > -url - print information on to System.out > -dump [-regex regex] - dump the webtable to a text file in > > -content - dump also raw content > -headers - dump protocol headers > -links - dump links > -text - dump extracted text > [-regex] - filter on the URL of the webtable entry > > I am asking myself how i can configure nutch that it will crawl a certain > page and all his children pages. > I see that this is the topic in the tutorial > http://wiki.apache.org/nutch/NutchTutorial > but i am not sure from which point to continue, as in nutch2 i am working > against hbase and not against a directory. > > Thanks, > David > -- Kiran Chitturi
Re: Nutch Admin Interface (looking for work)
Thank you Chris for your support. There are example GUI wireframes here [1] in the wiki. I am going through the book 'Wicket in Action', i am not sure how accurate it is with the current version but it should get me familiar with Wicket. I will keep you posted with my learnings. [1] - http://wiki.apache.org/nutch/NutchAdministrationUserInterface#Look_and_feel_Admin_Gui : Regards, Kiran. On Fri, Jan 4, 2013 at 10:48 AM, Mattmann, Chris A (388J) < chris.a.mattm...@jpl.nasa.gov> wrote: > Hey Kiran, > > On 1/4/13 12:26 AM, "kiran chitturi" wrote: > > >Hi Chris, > > > >Sounds great. > > > >I noticed that the old GUI code was written 3 years ago (if i am correct) > >for 1.x and might be bit outdated. > > Yep it's definitely outdated. We should redo it in Wicket. :) > > > > >Starting with new one based on Wicket with you would be very interesting > >and good learning for me. > > > >Do you think the wireframes that were developed for the GUI while ago [1] > >would be useful ? Can they be developed with Wicket or should we go and > >try > >to design new ones ? > > The wireframes would definitely be useful -- do we have them? > > > > >If we can decide on Wicket, then i can start looking around and playing > >around with Wicket :) > > +1 to Wicket -- let's do it. You have my support. > > Cheers, > Chris > > > > >Many Thanks, > >Kiran. > > > > > >On Fri, Jan 4, 2013 at 1:45 AM, Mattmann, Chris A (388J) < > >chris.a.mattm...@jpl.nasa.gov> wrote: > > > >> Thanks Kiran I will check out both issue pointers below. > >> > >> I would love to explore maybe an Apache Wicket [1] based UI, along with > >> CXF or Restlet web services for the GUI. > >> > >> We already have Restlet thanks to Andrzej and I'm a big Wicket fan and > >> would love to help. > >> > >> Let me know what you think -- it's way better than straight JSPs and > >>maybe > >> will help you to focus on just the HTML or design or JS portion. > >> > >> Cheers, > >> Chris > >> > >> [1] http://wicket.apache.org/ > >> > >> On 1/3/13 9:15 PM, "kiran chitturi" wrote: > >> > >> >Hi Chris, > >> > > >> >Thank you for the pointers. > >> > > >> >I have submitted jira-1467 [1] and 1478 [2]. Both of them are related > >>to > >> >the work i am doing in our library. > >> > > >> >Regarding the GUI, i was hoping if i could contribute to the > >>development > >> >that's already been done. I am using 2.x right now, and would be > >> >interested > >> >in working in that branch. > >> > > >> >I will look at 2.x brach and check how much work is done. I did not > >>have > >> >much experience with JSP before, but i want to look in to the code and > >>see > >> >if i could learn and help at the same time. > >> > > >> >[1] https://issues.apache.org/jira/browse/NUTCH-1467 > >> >[2] https://issues.apache.org/jira/browse/NUTCH-1478 > >> > > >> >Regards, > >> >Kiran. > >> > > >> >On Thu, Jan 3, 2013 at 8:14 PM, Mattmann, Chris A (388J) < > >> >chris.a.mattm...@jpl.nasa.gov> wrote: > >> > > >> >> Thanks Kiran...appreciate it. Can you point to the JIRAs and patches > >>you > >> >> have submitted? I'd be happy to look at them. > >> >> > >> >> The GUI doesn't exist right now and I think that wiki is a bit old > >>with > >> >> that page. If you'd like to work on a GUI or implement one we'd > >> >>appreciate > >> >> it. We currently have 2 main line branches of development: > >> >> > >> >> 2.x -- Gora based, uses swappable backend and Web Page data model. > >> >> 1.x -- based on Sequence File storage, doesn't use ORM. > >> >> > >> >> If you could clarify which branch you'd like to work on too that > >>would > >> >> also help. > >> >> Or both! > >> >> > >> >> Thanks! > >> >> > >> >> Cheers, > >> >> Chris > >> >> > >> >> On 1/2/13 7:42 PM, "kiran chitturi" > >>wrote: > >> >> > >> >> >Hi Chris, > >> >&
Re: Nutch Admin Interface (looking for work)
Hi Chris, Sounds great. I noticed that the old GUI code was written 3 years ago (if i am correct) for 1.x and might be bit outdated. Starting with new one based on Wicket with you would be very interesting and good learning for me. Do you think the wireframes that were developed for the GUI while ago [1] would be useful ? Can they be developed with Wicket or should we go and try to design new ones ? If we can decide on Wicket, then i can start looking around and playing around with Wicket :) Many Thanks, Kiran. On Fri, Jan 4, 2013 at 1:45 AM, Mattmann, Chris A (388J) < chris.a.mattm...@jpl.nasa.gov> wrote: > Thanks Kiran I will check out both issue pointers below. > > I would love to explore maybe an Apache Wicket [1] based UI, along with > CXF or Restlet web services for the GUI. > > We already have Restlet thanks to Andrzej and I'm a big Wicket fan and > would love to help. > > Let me know what you think -- it's way better than straight JSPs and maybe > will help you to focus on just the HTML or design or JS portion. > > Cheers, > Chris > > [1] http://wicket.apache.org/ > > On 1/3/13 9:15 PM, "kiran chitturi" wrote: > > >Hi Chris, > > > >Thank you for the pointers. > > > >I have submitted jira-1467 [1] and 1478 [2]. Both of them are related to > >the work i am doing in our library. > > > >Regarding the GUI, i was hoping if i could contribute to the development > >that's already been done. I am using 2.x right now, and would be > >interested > >in working in that branch. > > > >I will look at 2.x brach and check how much work is done. I did not have > >much experience with JSP before, but i want to look in to the code and see > >if i could learn and help at the same time. > > > >[1] https://issues.apache.org/jira/browse/NUTCH-1467 > >[2] https://issues.apache.org/jira/browse/NUTCH-1478 > > > >Regards, > >Kiran. > > > >On Thu, Jan 3, 2013 at 8:14 PM, Mattmann, Chris A (388J) < > >chris.a.mattm...@jpl.nasa.gov> wrote: > > > >> Thanks Kiran...appreciate it. Can you point to the JIRAs and patches you > >> have submitted? I'd be happy to look at them. > >> > >> The GUI doesn't exist right now and I think that wiki is a bit old with > >> that page. If you'd like to work on a GUI or implement one we'd > >>appreciate > >> it. We currently have 2 main line branches of development: > >> > >> 2.x -- Gora based, uses swappable backend and Web Page data model. > >> 1.x -- based on Sequence File storage, doesn't use ORM. > >> > >> If you could clarify which branch you'd like to work on too that would > >> also help. > >> Or both! > >> > >> Thanks! > >> > >> Cheers, > >> Chris > >> > >> On 1/2/13 7:42 PM, "kiran chitturi" wrote: > >> > >> >Hi Chris, > >> > > >> >Thank you for replying and the pointers. > >> > > >> >Just to let you know, i have already been working with Nutch for the > >>last > >> >4 > >> >months. I did some JIRA's and few patches. > >> > > >> >I am wondering the intial code for Nutch GUI is out, i have read about > >>the > >> >proposal here [1]. > >> > > >> >I would be working with Nutch for atleast another five months and will > >>be > >> >working around. > >> > > >> >I am particularly interested in Admin GUI since that would be helpful > >>on > >> >what i am working too. > >> > > >> >Please let me know if you have any suggestions > >> > > >> >[1] - http://wiki.apache.org/nutch/NutchAdministrationUserInterface > >> > > >> >Regards, > >> >Kiran. > >> > > >> > > >> >On Wed, Jan 2, 2013 at 7:18 PM, Mattmann, Chris A (388J) < > >> >chris.a.mattm...@jpl.nasa.gov> wrote: > >> > > >> >> Hi Kiran, > >> >> > >> >> Thanks! We'd be happy to have you help out. > >> >> > >> >> Please start out by reviewing our Nutch JIRA [1] and seeing what's > >> >>already > >> >> been done, and then the Nutch wiki [2]. Then feel free to file some > >> >>Nutch > >> >> JIRAs and break down the problem into small, easily revertible > >>patches > >> >>and > >> >> proceed. > >> >&
Re: Nutch Admin Interface (looking for work)
Hi Chris, Thank you for the pointers. I have submitted jira-1467 [1] and 1478 [2]. Both of them are related to the work i am doing in our library. Regarding the GUI, i was hoping if i could contribute to the development that's already been done. I am using 2.x right now, and would be interested in working in that branch. I will look at 2.x brach and check how much work is done. I did not have much experience with JSP before, but i want to look in to the code and see if i could learn and help at the same time. [1] https://issues.apache.org/jira/browse/NUTCH-1467 [2] https://issues.apache.org/jira/browse/NUTCH-1478 Regards, Kiran. On Thu, Jan 3, 2013 at 8:14 PM, Mattmann, Chris A (388J) < chris.a.mattm...@jpl.nasa.gov> wrote: > Thanks Kiran...appreciate it. Can you point to the JIRAs and patches you > have submitted? I'd be happy to look at them. > > The GUI doesn't exist right now and I think that wiki is a bit old with > that page. If you'd like to work on a GUI or implement one we'd appreciate > it. We currently have 2 main line branches of development: > > 2.x -- Gora based, uses swappable backend and Web Page data model. > 1.x -- based on Sequence File storage, doesn't use ORM. > > If you could clarify which branch you'd like to work on too that would > also help. > Or both! > > Thanks! > > Cheers, > Chris > > On 1/2/13 7:42 PM, "kiran chitturi" wrote: > > >Hi Chris, > > > >Thank you for replying and the pointers. > > > >Just to let you know, i have already been working with Nutch for the last > >4 > >months. I did some JIRA's and few patches. > > > >I am wondering the intial code for Nutch GUI is out, i have read about the > >proposal here [1]. > > > >I would be working with Nutch for atleast another five months and will be > >working around. > > > >I am particularly interested in Admin GUI since that would be helpful on > >what i am working too. > > > >Please let me know if you have any suggestions > > > >[1] - http://wiki.apache.org/nutch/NutchAdministrationUserInterface > > > >Regards, > >Kiran. > > > > > >On Wed, Jan 2, 2013 at 7:18 PM, Mattmann, Chris A (388J) < > >chris.a.mattm...@jpl.nasa.gov> wrote: > > > >> Hi Kiran, > >> > >> Thanks! We'd be happy to have you help out. > >> > >> Please start out by reviewing our Nutch JIRA [1] and seeing what's > >>already > >> been done, and then the Nutch wiki [2]. Then feel free to file some > >>Nutch > >> JIRAs and break down the problem into small, easily revertible patches > >>and > >> proceed. > >> > >> The Nutch PMC and devs will be happy to help guide you and help. Feel > >>free > >> to contact us on d...@nutch.apache.org if you have any > questions/concerns > >> while proceeding. > >> > >> Thanks! > >> > >> Cheers, > >> Chris > >> > >> [1] http://issues.apache.org/jira/browse/NUTCH > >> [2] http://wiki.apache.org/nutch/ > >> > >> On 1/1/13 1:40 PM, "kiran chitturi" wrote: > >> > >> >Hi, > >> > > >> >Happy new year to Nutch folks :) > >> > > >> >I noticed today there was a proposal about user interface (admin GUI) > >>for > >> >Nutch. > >> > > >> >I would like to check if i can be of any help. I am very much > >>interested > >> >in > >> >developing one and it would be lot helpful to new folks. > >> > > >> >Please let me know who to contact or where to check the progress. > >> > > >> >Thank you, > >> > > >> >-- > >> >Kiran Chitturi > >> > >> > > > > > >-- > >Kiran Chitturi > > -- Kiran Chitturi
Re: Nutch Admin Interface (looking for work)
Hi Chris, Thank you for replying and the pointers. Just to let you know, i have already been working with Nutch for the last 4 months. I did some JIRA's and few patches. I am wondering the intial code for Nutch GUI is out, i have read about the proposal here [1]. I would be working with Nutch for atleast another five months and will be working around. I am particularly interested in Admin GUI since that would be helpful on what i am working too. Please let me know if you have any suggestions [1] - http://wiki.apache.org/nutch/NutchAdministrationUserInterface Regards, Kiran. On Wed, Jan 2, 2013 at 7:18 PM, Mattmann, Chris A (388J) < chris.a.mattm...@jpl.nasa.gov> wrote: > Hi Kiran, > > Thanks! We'd be happy to have you help out. > > Please start out by reviewing our Nutch JIRA [1] and seeing what's already > been done, and then the Nutch wiki [2]. Then feel free to file some Nutch > JIRAs and break down the problem into small, easily revertible patches and > proceed. > > The Nutch PMC and devs will be happy to help guide you and help. Feel free > to contact us on d...@nutch.apache.org if you have any questions/concerns > while proceeding. > > Thanks! > > Cheers, > Chris > > [1] http://issues.apache.org/jira/browse/NUTCH > [2] http://wiki.apache.org/nutch/ > > On 1/1/13 1:40 PM, "kiran chitturi" wrote: > > >Hi, > > > >Happy new year to Nutch folks :) > > > >I noticed today there was a proposal about user interface (admin GUI) for > >Nutch. > > > >I would like to check if i can be of any help. I am very much interested > >in > >developing one and it would be lot helpful to new folks. > > > >Please let me know who to contact or where to check the progress. > > > >Thank you, > > > >-- > >Kiran Chitturi > > -- Kiran Chitturi
Re: Nutch 2.1 crash
Hi, There are multiple issues currently when dealing with mysql as backend for Nutch 2.x series. Also, its not recommended to use the 'crawl' command anymore. Please check here (https://issues.apache.org/jira/browse/NUTCH-1087). Best, Kiran. On Wed, Dec 12, 2012 at 9:47 AM, 高睿 wrote: > Hi, > > I found an exception when I running nutch 2.1 with mysql. The command line > is: bin/nutch crawl urls -depth 1 -topN 5 > Here's the reproduce steps for the issue: > 1. start nutch > 2. stop it during it executing > 3. start nutch again > The problem can be recovered by clean up the table 'webpage'. > > = Error in the console > = > Skipping > http://blog.foofactory.fi/2007/03/perfomance-history-for-nutch.html; > different batch id (null) > Exception in thread "main" java.lang.RuntimeException: job failed: > name=parse, jobid=job_local_0004 > at > org.apache.nutch.util.NutchJob.waitForCompletion(NutchJob.java:54) > at org.apache.nutch.parse.ParserJob.run(ParserJob.java:251) > at org.apache.nutch.crawl.Crawler.runTool(Crawler.java:68) > at org.apache.nutch.crawl.Crawler.run(Crawler.java:171) > at org.apache.nutch.crawl.Crawler.run(Crawler.java:250) > at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65) > at org.apache.nutch.crawl.Crawler.main(Crawler.java:257) > > = Error in the logs/hadoop.log > = > 2012-12-12 22:26:33,379 INFO parse.ParserJob - Skipping > http://blog.foofactory.fi/2007/02/online-indexing-integrating-nutch-with.html; > different batch id (null) > 2012-12-12 22:26:33,379 INFO parse.ParserJob - Skipping > http://blog.foofactory.fi/2007/03/perfomance-history-for-nutch.html; > different batch id (null) > 2012-12-12 22:26:33,380 WARN mapred.FileOutputCommitter - Output path is > null in cleanup > 2012-12-12 22:26:33,381 WARN mapred.LocalJobRunner - job_local_0004 > java.io.IOException: java.io.EOFException > at org.apache.gora.sql.query.SqlResult.nextInner(SqlResult.java:58) > at org.apache.gora.query.impl.ResultBase.next(ResultBase.java:112) > at > org.apache.gora.mapreduce.GoraRecordReader.nextKeyValue(GoraRecordReader.java:111) > at > org.apache.hadoop.mapred.MapTask$NewTrackingRecordReader.nextKeyValue(MapTask.java:532) > at > org.apache.hadoop.mapreduce.MapContext.nextKeyValue(MapContext.java:67) > at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:143) > at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:764) > at org.apache.hadoop.mapred.MapTask.run(MapTask.java:370) > at > org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:212) > Caused by: java.io.EOFException > at > org.apache.avro.io.BinaryDecoder$InputStreamByteSource.readRaw(BinaryDecoder.java:818) > at > org.apache.avro.io.BinaryDecoder.doReadBytes(BinaryDecoder.java:340) > at > org.apache.avro.io.BinaryDecoder.readString(BinaryDecoder.java:265) > at > org.apache.gora.mapreduce.FakeResolvingDecoder.readString(FakeResolvingDecoder.java:131) > at > org.apache.avro.generic.GenericDatumReader.readString(GenericDatumReader.java:280) > at > org.apache.avro.generic.GenericDatumReader.readMap(GenericDatumReader.java:191) > at > org.apache.gora.avro.PersistentDatumReader.readMap(PersistentDatumReader.java:182) > at > org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:83) > at > org.apache.gora.avro.PersistentDatumReader.read(PersistentDatumReader.java:102) > at org.apache.gora.util.IOUtils.deserialize(IOUtils.java:259) > at org.apache.gora.sql.store.SqlStore.readField(SqlStore.java:565) > at org.apache.gora.sql.store.SqlStore.readObject(SqlStore.java:486) > at org.apache.gora.sql.query.SqlResult.nextInner(SqlResult.java:54) > ... 8 more > > Thanks. > > Regards, > Rui > -- Kiran Chitturi
Re: upgrade nutch 1.4 to 2.x
I am not sure how the migration works. There has been changes in the architecture from 1.4 to 2.x. On Thu, Dec 6, 2012 at 1:32 PM, kaveh minooie wrote: > Hi every body > > I was wondering if it was possible to upgrade ( by that I mean the crawldb > and linkdb) from 1.4 to 2.x version? has any one has done this? > > -- > Kaveh Minooie > > www.plutoz.com > -- Kiran Chitturi
Re: Best practices for running Nutch
Hi I'm not sure this applies to you because i don't know what you mean by > `running crawler`; never run the fetcher for longer than an hour orso. > Thank you for the reply. I have seen fetcher run for more time like 3-5 hours and more. I think based on both of your suggestions, i will try to see if i can switch my database and not to use the crawl class but use the crawl script. I will start a new script and try to see if it makes any changes in the performance. Thank you, Kiran. -- Kiran Chitturi
Re: org.apache.solr.common.SolrException: Document contains multiple values for uniqueKey field: id?
Hi Erol, It looks like the error is from Nutch side and i would suggest you to check your database for entries and see how the documents, fields are saved or you can dump the database and see the values of the fields and check if there are any multiple values in there. Looks like the document Id is an array in Nutch document than a string. Hope this helps. On Mon, Nov 12, 2012 at 12:45 PM, Erol Akarsu wrote: > I am trying to crawl with nutch and index on solr. Crawling went fine > > But when I try to index with SOLR, then I am getting error in my tomcat log > file "SEVERE: org.apache.solr.common.SolrException: Document contains > multiple values for uniqueKey field: > id=[fi.foofactory.blog:http/2007/03/twice-speed-half-size.html, > http://blog.foofactory.fi/2007/03/twice-speed-half-size.html, > ]" > > bin/nutch crawl urls/ -depth 2 > > eakarsu@ubuntu:~/apache-nutch-2.1/runtime/local$ bin/nutch solrindex > http://localhost:8983/solr40/ -reindex > SolrIndexerJob: starting > Adding 31 documents > SolrIndexerJob: java.lang.RuntimeException: job failed: name=solr-index, > jobid=job_local_0001 > at org.apache.nutch.util.NutchJob.waitForCompletion(NutchJob.java:54) > at > org.apache.nutch.indexer.solr.SolrIndexerJob.run(SolrIndexerJob.java:46) > at > > org.apache.nutch.indexer.solr.SolrIndexerJob.indexSolr(SolrIndexerJob.java:54) > at > org.apache.nutch.indexer.solr.SolrIndexerJob.run(SolrIndexerJob.java:75) > at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65) > at > org.apache.nutch.indexer.solr.SolrIndexerJob.main(SolrIndexerJob.java:84) > > > Nov 12, 2012 10:42:59 AM org.apache.solr.common.SolrException log > SEVERE: org.apache.solr.common.SolrException: Document contains multiple > values for uniqueKey field: > id=[fi.foofactory.blog:http/2007/03/twice-speed-half-size.html, > http://blog.foofactory.fi/2007/03/twice-speed-half-size.html, > ] > at > > org.apache.solr.update.AddUpdateCommand.getIndexedId(AddUpdateCommand.java:91) > at > > org.apache.solr.update.processor.DistributedUpdateProcessor.versionAdd(DistributedUpdateProcessor.java:445) > at > > org.apache.solr.update.processor.DistributedUpdateProcessor.processAdd(DistributedUpdateProcessor.java:325) > at > > org.apache.solr.update.processor.LogUpdateProcessor.processAdd(LogUpdateProcessorFactory.java:100) > at > > org.apache.solr.update.processor.SignatureUpdateProcessorFactory$SignatureUpdateProcessor.processAdd(SignatureUpdateProcessorFactory.java:181) > at > org.apache.solr.handler.loader.XMLLoader.processUpdate(XMLLoader.java:230) > at > org.apache.solr.handler.loader.XMLLoader.load(XMLLoader.java:157) > at > > org.apache.solr.handler.UpdateRequestHandler$1.load(UpdateRequestHandler.java:92) > at > > org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:74) > at > > org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:129) > at org.apache.solr.core.SolrCore.execute(SolrCore.java:1699) > at > > org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:455) > at > > org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:276) > at > > org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:235) > at > > org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:206) > at > > org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:233) > at > > org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:191) > at > > org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:127) > at > > org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:102) > at > > org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:109) > at > org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:293) > at > org.apache.coyote.http11.Http11Processor.process(Http11Processor.java:859) > at > > org.apache.coyote.http11.Http11Protocol$Http11ConnectionHandler.process(Http11Protocol.java:602) > at > org.apache.tomcat.util.net.JIoEndpoint$Worker.run(JIoEndpoint.java:489) > at java.lang.Thread.run(Thread.java:662) > -- Kiran Chitturi
Nutch html parse error (java.io.IOException: Pushback buffer overflow )
Hi, I was trying to parse HTML files and it throwed this error for one particular HTML file. I guessed we are using tagsoup for parsing and someone already fixed this in the tika code. ( https://github.com/jukka/tagsoup/commit/9cfe7b48745173faafa419f540538a0b6309b699 ) Can someone tell me if this revision is included in the tika that we have with Nutch-2.x ? Should i use latest tika-dev to have this included and change libraries in ivy.xml ? Thank you, -- Kiran Chitturi
Re: bin/nutch parsechecker -dumpText works but bin/nutch parse fails
I have just tested with Hbase as gora-backend and all the pdf files are parsed in the first attempt. This is an sql backend problem as Julien has noted before. Thanks, Kiran. On Thu, Nov 1, 2012 at 1:49 PM, wrote: > Hi, > > I think in order to be sure that this is gora-sql problem, you need to do > the same crawling with nutch/hbase. It must not take much time if you run > it in local mode. Simply install hbase and follow quick start tutorial. > > Alex. > > > > > > > > -Original Message- > From: kiran chitturi > To: user > Sent: Thu, Nov 1, 2012 9:29 am > Subject: Re: bin/nutch parsechecker -dumpText works but bin/nutch parse > fails > > > Hi, > > I have created an issue (https://issues.apache.org/jira/browse/NUTCH-1487 > ). > > Do you think this is because of the SQL backend ? Its failing for PDF files > but working for HTML files. > > Can the problem be due to some bug in the tika.parser code (since tika > plugin handles the PDF parsing) ? > > I am interesting in fixing this problem, if i can find out where the issue > starts. > > Does anyone have inputs for this ? > > Thanks, > Kiran. > > > > On Thu, Nov 1, 2012 at 10:15 AM, Julien Nioche < > lists.digitalpeb...@gmail.com> wrote: > > > Hi > > > > Yes please do open an issue. The docs should be parsed in one go and I > > suspect (yet another) issue with the SQL backend > > > > Thanks > > > > J > > > > On 1 November 2012 13:48, kiran chitturi > > wrote: > > > > > Thank you alxsss for the suggestion. It displays the actualSize and > > > inHeaderSize for every file and two more lines in logs but it did not > > much > > > information even when i set parserJob to Debug. > > > > > > I had the same problem when i re-compiled everything today. I have to > run > > > the parse command multiple times to get all the files parsed. > > > > > > I am using SQL with GORA. Its mysql database. > > > > > > For now, atleast the files are getting parsed, do i need to open a > issue > > > for this ? > > > > > > Thank you, > > > > > > Regards, > > > Kiran. > > > > > > > > > On Wed, Oct 31, 2012 at 4:36 PM, Julien Nioche < > > > lists.digitalpeb...@gmail.com> wrote: > > > > > > > Hi Kiran > > > > > > > > Interesting. Which backend are you using with GORA? The SQL one? > Could > > > be a > > > > problem at that level > > > > > > > > Julien > > > > > > > > On 31 October 2012 17:01, kiran chitturi > > > > wrote: > > > > > > > > > Hi Julien, > > > > > > > > > > I have just noticed something when running the parse. > > > > > > > > > > First when i ran the parse command 'sh bin/nutch parse > > > > > 1351188762-1772522488', the parsing of all the PDF files has > failed. > > > > > > > > > > When i ran the command again one pdf file got parsed. Next time, > > > another > > > > > pdf file got parsed. > > > > > > > > > > When i ran the parse command the number of times the total number > of > > > pdf > > > > > files, all the pdf files got parsed. > > > > > > > > > > In my case, i ran it 17 times and all the pdf files are parsed. > > Before > > > > > that, not everything is parsed. > > > > > > > > > > This sounds strange, do you think it is some configuration problem > ? > > > > > > > > > > I have tried this 2 times and same thing happened two times for me > . > > > > > > > > > > I am not sure why this is happening. > > > > > > > > > > Thanks for your help. > > > > > > > > > > Regards, > > > > > Kiran. > > > > > > > > > > > > > > > On Wed, Oct 31, 2012 at 10:28 AM, Julien Nioche < > > > > > lists.digitalpeb...@gmail.com> wrote: > > > > > > > > > > > Hi > > > > > > > > > > > > > > > > > > > Sorry about that. I did not notice the parsecodes are actually > > > nutch > > > > > and > > > > > > > not tika. > > > > > > > > > > > > > > no problems! > >
Re: Getting a NullPointerException in Nutch 2.1
h id (null) > Skipping > http://nutch.apache.org/**tutorial.html<http://nutch.apache.org/tutorial.html>; > different batch id (null) > Skipping > http://nutch.apache.org/**version_control.html<http://nutch.apache.org/version_control.html>; > different batch id (null) > Skipping > http://nutch.apache.org/wiki.**html<http://nutch.apache.org/wiki.html>; > different batch id (null) > Exception in thread "main" java.lang.NullPointerException > at java.util.Hashtable.put(**Hashtable.java:411) > at java.util.Properties.**setProperty(Properties.java:**160) > at org.apache.hadoop.conf.**Configuration.set(** > Configuration.java:438) > at org.apache.nutch.indexer.**IndexerJob.createIndexJob(** > IndexerJob.java:128) > at org.apache.nutch.indexer.solr.**SolrIndexerJob.run(** > SolrIndexerJob.java:44) > at org.apache.nutch.crawl.**Crawler.runTool(Crawler.java:**68) > at org.apache.nutch.crawl.**Crawler.run(Crawler.java:192) > at org.apache.nutch.crawl.**Crawler.run(Crawler.java:250) > at org.apache.hadoop.util.**ToolRunner.run(ToolRunner.**java:65) > at org.apache.nutch.crawl.**Crawler.main(Crawler.java:257) > ==**== > >I'm using HBase 90.6 because the latest didn't work for me. Also, I'm > using solr 3.6.1 instead of solr 4.0 for the same problem. > > I was wondering what versions of Nutch, HBase and Solr other users who > have gotten Nutch to work. are using? I'm getting the feeling that only > the right version combinations of all parts works . > >cocofan > -- Kiran Chitturi
Re: error at parse command in Nutch 2.x : java.sql.BatchUpdateException: data exception: string data, right truncation
Hi Lewis, I am not sure. I do not remember how i overcome this error. For mysql with Nutch 2.x, i followed this article (which worked most of the times) and i changed the type of field 'content' to 'LONG BLOB' to overcome data size error. Except that, i do not think i changed much. Sorry,i was not of much help. Regards, Kiran. On Thu, Nov 1, 2012 at 1:16 PM, Lewis John Mcgibbney < lewis.mcgibb...@gmail.com> wrote: > Hi Kiran, > > Did you ever get anywhere with this one? > > Lewis > > On Tue, Oct 16, 2012 at 10:30 PM, kiran chitturi > wrote: > > Hi, > > > > I am using Nutch 2.x series with updated tika dependencies with hsql > > database. > > > > I have did the commands 'inject,generate,fetch' and after that when i run > > the 'parse' command i get error below. > > > > Is it because of some size limit ? If so, how can i change the settings ? > > > > java.io.IOException: java.sql.BatchUpdateException: data exception: > string > >> data, right truncation > >> at org.apache.gora.sql.store.SqlStore.flush(SqlStore.java:340) > >> at org.apache.gora.sql.store.SqlStore.close(SqlStore.java:185) > >> at > >> > org.apache.gora.mapreduce.GoraRecordWriter.close(GoraRecordWriter.java:55) > >> at > >> > org.apache.hadoop.mapred.MapTask$NewDirectOutputCollector.close(MapTask.java:651) > >> at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:766) > >> at org.apache.hadoop.mapred.MapTask.run(MapTask.java:370) > >> at > org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:212) > >> Caused by: java.sql.BatchUpdateException: data exception: string data, > >> right truncation > >> at org.hsqldb.jdbc.JDBCPreparedStatement.executeBatch(Unknown Source) > >> at org.apache.gora.sql.store.SqlStore.flush(SqlStore.java:328) > >> ... 6 more > > > > > > Thanks for the help. > > > > Regards, > > -- > > Kiran Chitturi > > > > -- > Lewis > -- Kiran Chitturi
Re: bin/nutch parsechecker -dumpText works but bin/nutch parse fails
Hi, I have created an issue (https://issues.apache.org/jira/browse/NUTCH-1487). Do you think this is because of the SQL backend ? Its failing for PDF files but working for HTML files. Can the problem be due to some bug in the tika.parser code (since tika plugin handles the PDF parsing) ? I am interesting in fixing this problem, if i can find out where the issue starts. Does anyone have inputs for this ? Thanks, Kiran. On Thu, Nov 1, 2012 at 10:15 AM, Julien Nioche < lists.digitalpeb...@gmail.com> wrote: > Hi > > Yes please do open an issue. The docs should be parsed in one go and I > suspect (yet another) issue with the SQL backend > > Thanks > > J > > On 1 November 2012 13:48, kiran chitturi > wrote: > > > Thank you alxsss for the suggestion. It displays the actualSize and > > inHeaderSize for every file and two more lines in logs but it did not > much > > information even when i set parserJob to Debug. > > > > I had the same problem when i re-compiled everything today. I have to run > > the parse command multiple times to get all the files parsed. > > > > I am using SQL with GORA. Its mysql database. > > > > For now, atleast the files are getting parsed, do i need to open a issue > > for this ? > > > > Thank you, > > > > Regards, > > Kiran. > > > > > > On Wed, Oct 31, 2012 at 4:36 PM, Julien Nioche < > > lists.digitalpeb...@gmail.com> wrote: > > > > > Hi Kiran > > > > > > Interesting. Which backend are you using with GORA? The SQL one? Could > > be a > > > problem at that level > > > > > > Julien > > > > > > On 31 October 2012 17:01, kiran chitturi > > > wrote: > > > > > > > Hi Julien, > > > > > > > > I have just noticed something when running the parse. > > > > > > > > First when i ran the parse command 'sh bin/nutch parse > > > > 1351188762-1772522488', the parsing of all the PDF files has failed. > > > > > > > > When i ran the command again one pdf file got parsed. Next time, > > another > > > > pdf file got parsed. > > > > > > > > When i ran the parse command the number of times the total number of > > pdf > > > > files, all the pdf files got parsed. > > > > > > > > In my case, i ran it 17 times and all the pdf files are parsed. > Before > > > > that, not everything is parsed. > > > > > > > > This sounds strange, do you think it is some configuration problem ? > > > > > > > > I have tried this 2 times and same thing happened two times for me . > > > > > > > > I am not sure why this is happening. > > > > > > > > Thanks for your help. > > > > > > > > Regards, > > > > Kiran. > > > > > > > > > > > > On Wed, Oct 31, 2012 at 10:28 AM, Julien Nioche < > > > > lists.digitalpeb...@gmail.com> wrote: > > > > > > > > > Hi > > > > > > > > > > > > > > > > Sorry about that. I did not notice the parsecodes are actually > > nutch > > > > and > > > > > > not tika. > > > > > > > > > > > > no problems! > > > > > > > > > > > > > > > > The setup is local on Mac desktop and i am using through command > > line > > > > and > > > > > > remote debugging through eclipse ( > > > > > > > > > > > > > > > > > > > > > http://wiki.apache.org/nutch/RunNutchInEclipse#Remote_Debugging_in_Eclipse > > > > > > ). > > > > > > > > > > > > > > > > OK > > > > > > > > > > > > > > > > > I have set both http.content.limit and file.content.limit to -1. > > The > > > > logs > > > > > > just say 'WARN parse.ParseUtil - Unable to successfully parse > > > content > > > > > > http://scholar.lib.vt.edu/ejournals/ALAN/v29n3/pdf/watson.pdf of > > > type > > > > > > application/pdf'. > > > > > > > > > > > > > > > > you set it in $NUTCH_HOME/runtime/local/conf/nutch-site.xml right? > > (not > > > > > in $NUTCH_HOME/conf/nutch-site.xml unless you call 'ant clean > > runtime') > > > > > &g
Re: bin/nutch parsechecker -dumpText works but bin/nutch parse fails
Thank you alxsss for the suggestion. It displays the actualSize and inHeaderSize for every file and two more lines in logs but it did not much information even when i set parserJob to Debug. I had the same problem when i re-compiled everything today. I have to run the parse command multiple times to get all the files parsed. I am using SQL with GORA. Its mysql database. For now, atleast the files are getting parsed, do i need to open a issue for this ? Thank you, Regards, Kiran. On Wed, Oct 31, 2012 at 4:36 PM, Julien Nioche < lists.digitalpeb...@gmail.com> wrote: > Hi Kiran > > Interesting. Which backend are you using with GORA? The SQL one? Could be a > problem at that level > > Julien > > On 31 October 2012 17:01, kiran chitturi > wrote: > > > Hi Julien, > > > > I have just noticed something when running the parse. > > > > First when i ran the parse command 'sh bin/nutch parse > > 1351188762-1772522488', the parsing of all the PDF files has failed. > > > > When i ran the command again one pdf file got parsed. Next time, another > > pdf file got parsed. > > > > When i ran the parse command the number of times the total number of pdf > > files, all the pdf files got parsed. > > > > In my case, i ran it 17 times and all the pdf files are parsed. Before > > that, not everything is parsed. > > > > This sounds strange, do you think it is some configuration problem ? > > > > I have tried this 2 times and same thing happened two times for me . > > > > I am not sure why this is happening. > > > > Thanks for your help. > > > > Regards, > > Kiran. > > > > > > On Wed, Oct 31, 2012 at 10:28 AM, Julien Nioche < > > lists.digitalpeb...@gmail.com> wrote: > > > > > Hi > > > > > > > > > > Sorry about that. I did not notice the parsecodes are actually nutch > > and > > > > not tika. > > > > > > > > no problems! > > > > > > > > > > The setup is local on Mac desktop and i am using through command line > > and > > > > remote debugging through eclipse ( > > > > > > > > > > http://wiki.apache.org/nutch/RunNutchInEclipse#Remote_Debugging_in_Eclipse > > > > ). > > > > > > > > > > OK > > > > > > > > > > > I have set both http.content.limit and file.content.limit to -1. The > > logs > > > > just say 'WARN parse.ParseUtil - Unable to successfully parse > content > > > > http://scholar.lib.vt.edu/ejournals/ALAN/v29n3/pdf/watson.pdf of > type > > > > application/pdf'. > > > > > > > > > > you set it in $NUTCH_HOME/runtime/local/conf/nutch-site.xml right? (not > > > in $NUTCH_HOME/conf/nutch-site.xml unless you call 'ant clean runtime') > > > > > > > > > > > > > > All the html's are getting parsed and when i crawl this page ( > > > > http://scholar.lib.vt.edu/ejournals/ALAN/v29n3/), all the html's and > > > some > > > > of the pdf files get parsed. Like, half of the pdf files get parsed > and > > > the > > > > other half don't get parsed. > > > > > > > > > > do the ones that are not parsed have something in common? length? > > > > > > > > > > I am not sure about what causing the problem as you said parsechecker > > is > > > > actually work. I want the parser to crawl the full-text of the pdf > and > > > the > > > > metadata, title. > > > > > > > > > > OK > > > > > > > > > > > > > > The metatags are also getting crawled for failed pdf parsing. > > > > > > > > > > They would be discarded because of the failure even if they > > > were successfully extracted indeed. The current mechanism does not > cater > > > for semi-failures > > > > > > J. > > > > > > -- > > > * > > > *Open Source Solutions for Text Engineering > > > > > > http://digitalpebble.blogspot.com/ > > > http://www.digitalpebble.com > > > http://twitter.com/digitalpebble > > > > > > > > > > > -- > > Kiran Chitturi > > > > > > -- > * > *Open Source Solutions for Text Engineering > > http://digitalpebble.blogspot.com/ > http://www.digitalpebble.com > http://twitter.com/digitalpebble > -- Kiran Chitturi
Re: bin/nutch parsechecker -dumpText works but bin/nutch parse fails
Hi Julien, I have just noticed something when running the parse. First when i ran the parse command 'sh bin/nutch parse 1351188762-1772522488', the parsing of all the PDF files has failed. When i ran the command again one pdf file got parsed. Next time, another pdf file got parsed. When i ran the parse command the number of times the total number of pdf files, all the pdf files got parsed. In my case, i ran it 17 times and all the pdf files are parsed. Before that, not everything is parsed. This sounds strange, do you think it is some configuration problem ? I have tried this 2 times and same thing happened two times for me . I am not sure why this is happening. Thanks for your help. Regards, Kiran. On Wed, Oct 31, 2012 at 10:28 AM, Julien Nioche < lists.digitalpeb...@gmail.com> wrote: > Hi > > > > Sorry about that. I did not notice the parsecodes are actually nutch and > > not tika. > > > > no problems! > > > > The setup is local on Mac desktop and i am using through command line and > > remote debugging through eclipse ( > > > http://wiki.apache.org/nutch/RunNutchInEclipse#Remote_Debugging_in_Eclipse > > ). > > > > OK > > > > > I have set both http.content.limit and file.content.limit to -1. The logs > > just say 'WARN parse.ParseUtil - Unable to successfully parse content > > http://scholar.lib.vt.edu/ejournals/ALAN/v29n3/pdf/watson.pdf of type > > application/pdf'. > > > > you set it in $NUTCH_HOME/runtime/local/conf/nutch-site.xml right? (not > in $NUTCH_HOME/conf/nutch-site.xml unless you call 'ant clean runtime') > > > > > > All the html's are getting parsed and when i crawl this page ( > > http://scholar.lib.vt.edu/ejournals/ALAN/v29n3/), all the html's and > some > > of the pdf files get parsed. Like, half of the pdf files get parsed and > the > > other half don't get parsed. > > > > do the ones that are not parsed have something in common? length? > > > > I am not sure about what causing the problem as you said parsechecker is > > actually work. I want the parser to crawl the full-text of the pdf and > the > > metadata, title. > > > > OK > > > > > > The metatags are also getting crawled for failed pdf parsing. > > > > They would be discarded because of the failure even if they > were successfully extracted indeed. The current mechanism does not cater > for semi-failures > > J. > > -- > * > *Open Source Solutions for Text Engineering > > http://digitalpebble.blogspot.com/ > http://www.digitalpebble.com > http://twitter.com/digitalpebble > -- Kiran Chitturi
Re: bin/nutch parsechecker -dumpText works but bin/nutch parse fails
Hi Julien, Sorry about that. I did not notice the parsecodes are actually nutch and not tika. The setup is local on Mac desktop and i am using through command line and remote debugging through eclipse ( http://wiki.apache.org/nutch/RunNutchInEclipse#Remote_Debugging_in_Eclipse ). I have set both http.content.limit and file.content.limit to -1. The logs just say 'WARN parse.ParseUtil - Unable to successfully parse content http://scholar.lib.vt.edu/ejournals/ALAN/v29n3/pdf/watson.pdf of type application/pdf'. All the html's are getting parsed and when i crawl this page ( http://scholar.lib.vt.edu/ejournals/ALAN/v29n3/), all the html's and some of the pdf files get parsed. Like, half of the pdf files get parsed and the other half don't get parsed. I am not sure about what causing the problem as you said parsechecker is actually work. I want the parser to crawl the full-text of the pdf and the metadata, title. The metatags are also getting crawled for failed pdf parsing. Please let me know if any additional information is need. Thanks for the help. Regards. On Wed, Oct 31, 2012 at 9:59 AM, Julien Nioche < lists.digitalpeb...@gmail.com> wrote: > hi Kiran > > > > Does anyone know why i am having this conflict ? I feel thats because of > > the Tika parser parsecodes (Major Code and Minor code) but i have not > been > > able to figure out why this happened. > > > as explained earlier you are confusing cause and consequences here. the > parsing does not fail because of the codes but the codes indicate that it > fails > > there is no point in bothering people in the Tika list as the codes are not > related to tika but are 100% Nutch > > please give more info about your setup : local? psuedo -distributed? > running from the command line? Have you checked that the content limit is > really taken into account? What messages are you getting in the logs? > etc > > Thanks > > Julien > > > On 31 October 2012 13:53, kiran chitturi > wrote: > > > Hi, > > > > I have mailed the list previously about Tika parse Codes (major code and > > minor code) and as Julien pointed out here > > http://www.mail-archive.com/user%40nutch.apache.org/msg07950.html 'sh > > bin/nutch parsechecker -dumpText > > http://scholar.lib.vt.edu/ejournals/ALAN/v29n3/pdf/watson.pdf' works but > > when i do 'sh bin/nutch parse that includes the above pdf file > > then i see this message in the logs > > > > 'WARN parse.ParseUtil - Unable to successfully parse content > > http://scholar.lib.vt.edu/ejournals/ALAN/v29n3/pdf/watson.pdf of type > > application/pdf' > > > > Does anyone know why i am having this conflict ? I feel thats because of > > the Tika parser parsecodes (Major Code and Minor code) but i have not > been > > able to figure out why this happened. > > > > Did anyone encounter this problem before ? I am also gonna post in tika > > mailing list about what the codes mean ? > > > > > > Regards, > > > > -- > > Kiran Chitturi > > > > > > -- > * > *Open Source Solutions for Text Engineering > > http://digitalpebble.blogspot.com/ > http://www.digitalpebble.com > http://twitter.com/digitalpebble > -- Kiran Chitturi
Re: Nutch 2.x parse MajorCode, MinorCode
Hi Julien, The parsechecker works fine for me too but this fails when i do the complete crawl and try to save it in the database. I do not know where its failing. I can check back if you want me to. Thanks! Kiran On Tue, Oct 30, 2012 at 11:06 AM, Julien Nioche < lists.digitalpeb...@gmail.com> wrote: > *./nutch parsechecker -D http.agent.name="tralala" -D > http.content.limit=-1 > -dumpText http://scholar.lib.vt.edu/ejournals/ALAN/v29n3/pdf/yearwood.pdf* > > works absolutely fine in both the trunk and 2.x branch. try from the > runtime/local/bin directory and check the logs for more details > > On 30 October 2012 13:54, kiran chitturi > wrote: > > > Interestingly, the tika jar i have downloaded separately is able to parse > > all the text from the pdf files while the nutch tika parser is failing > for > > some of the files. I have set the content.limit to -1. > > > > The error message is '2012-10-30 09:30:37,382 WARN parse.ParseUtil - > > Unable to successfully parse content > > http://scholar.lib.vt.edu/ejournals/ALAN/v29n3/pdf/yearwood.pdf of type > > application/pdf' > > > > for the failed pdf files. I could see some title and text when i am > > debugging in Eclipse but i could see it failing due to the parseCodes. > > > > Thank you. > > Kiran > > > > On Tue, Oct 30, 2012 at 8:58 AM, kiran chitturi > > wrote: > > > > > Hi > > > > > > I did not sent the content limit to -1 but i have set it high enough to > > be > > > able to go through the documents that i am parsing. I could see some > > title > > > and text but i am not sure how much it is able to do. I am gonna try > > using > > > tika separately and try to process the documents. If all of it goes > > through > > > tika-1.2 separately then i have to try to debug where i am getting the > > > error here. > > > > > > Many Thanks, > > > Kiran. > > > > > > > > > On Tue, Oct 30, 2012 at 4:37 AM, Julien Nioche < > > > lists.digitalpeb...@gmail.com> wrote: > > > > > >> Hi > > >> > > >> Look at the code for the class ParseStatusCodes. This simply indicates > > >> that > > >> the parsing failed and is not the cause for the failing itself. Do you > > get > > >> the entire text for the document or just what the parser managed to > > >> process > > >> until it failed? Did you set the content limit to -1? > > >> > > >> Thanks > > >> > > >> Julien > > >> > > >> > > >> On 29 October 2012 19:17, kiran chitturi > > >> wrote: > > >> > > >> > Hi! > > >> > > > >> > I am debugging nutch with eclipse and i have found out that some pdf > > >> files > > >> > which are not succesfully parsed have majorCode as 2 and minorCode > as > > >> 200 > > >> > and files which are succesfully parsed have majorCode 1 and > minorCode > > 0. > > >> > > > >> > Can someone please explain me or point to what these codes mean ? > > >> > > > >> > Actually, the title, text and everything is parsed in the failed > > parses > > >> but > > >> > somehow because of the codes it not saving the fields and returning > as > > >> > failed parsing. > > >> > > > >> > Thanks for your help. > > >> > > > >> > Regards, > > >> > -- > > >> > Kiran Chitturi > > >> > > > >> > > >> > > >> > > >> -- > > >> * > > >> *Open Source Solutions for Text Engineering > > >> > > >> http://digitalpebble.blogspot.com/ > > >> http://www.digitalpebble.com > > >> http://twitter.com/digitalpebble > > >> > > > > > > > > > > > > -- > > > Kiran Chitturi > > > > > > > > > > > > -- > > Kiran Chitturi > > > > > > -- > * > *Open Source Solutions for Text Engineering > > http://digitalpebble.blogspot.com/ > http://www.digitalpebble.com > http://twitter.com/digitalpebble > -- Kiran Chitturi
Re: Nutch 2.x parse MajorCode, MinorCode
Interestingly, the tika jar i have downloaded separately is able to parse all the text from the pdf files while the nutch tika parser is failing for some of the files. I have set the content.limit to -1. The error message is '2012-10-30 09:30:37,382 WARN parse.ParseUtil - Unable to successfully parse content http://scholar.lib.vt.edu/ejournals/ALAN/v29n3/pdf/yearwood.pdf of type application/pdf' for the failed pdf files. I could see some title and text when i am debugging in Eclipse but i could see it failing due to the parseCodes. Thank you. Kiran On Tue, Oct 30, 2012 at 8:58 AM, kiran chitturi wrote: > Hi > > I did not sent the content limit to -1 but i have set it high enough to be > able to go through the documents that i am parsing. I could see some title > and text but i am not sure how much it is able to do. I am gonna try using > tika separately and try to process the documents. If all of it goes through > tika-1.2 separately then i have to try to debug where i am getting the > error here. > > Many Thanks, > Kiran. > > > On Tue, Oct 30, 2012 at 4:37 AM, Julien Nioche < > lists.digitalpeb...@gmail.com> wrote: > >> Hi >> >> Look at the code for the class ParseStatusCodes. This simply indicates >> that >> the parsing failed and is not the cause for the failing itself. Do you get >> the entire text for the document or just what the parser managed to >> process >> until it failed? Did you set the content limit to -1? >> >> Thanks >> >> Julien >> >> >> On 29 October 2012 19:17, kiran chitturi >> wrote: >> >> > Hi! >> > >> > I am debugging nutch with eclipse and i have found out that some pdf >> files >> > which are not succesfully parsed have majorCode as 2 and minorCode as >> 200 >> > and files which are succesfully parsed have majorCode 1 and minorCode 0. >> > >> > Can someone please explain me or point to what these codes mean ? >> > >> > Actually, the title, text and everything is parsed in the failed parses >> but >> > somehow because of the codes it not saving the fields and returning as >> > failed parsing. >> > >> > Thanks for your help. >> > >> > Regards, >> > -- >> > Kiran Chitturi >> > >> >> >> >> -- >> * >> *Open Source Solutions for Text Engineering >> >> http://digitalpebble.blogspot.com/ >> http://www.digitalpebble.com >> http://twitter.com/digitalpebble >> > > > > -- > Kiran Chitturi > > -- Kiran Chitturi
Re: Nutch 2.x parse MajorCode, MinorCode
Hi I did not sent the content limit to -1 but i have set it high enough to be able to go through the documents that i am parsing. I could see some title and text but i am not sure how much it is able to do. I am gonna try using tika separately and try to process the documents. If all of it goes through tika-1.2 separately then i have to try to debug where i am getting the error here. Many Thanks, Kiran. On Tue, Oct 30, 2012 at 4:37 AM, Julien Nioche < lists.digitalpeb...@gmail.com> wrote: > Hi > > Look at the code for the class ParseStatusCodes. This simply indicates that > the parsing failed and is not the cause for the failing itself. Do you get > the entire text for the document or just what the parser managed to process > until it failed? Did you set the content limit to -1? > > Thanks > > Julien > > > On 29 October 2012 19:17, kiran chitturi > wrote: > > > Hi! > > > > I am debugging nutch with eclipse and i have found out that some pdf > files > > which are not succesfully parsed have majorCode as 2 and minorCode as 200 > > and files which are succesfully parsed have majorCode 1 and minorCode 0. > > > > Can someone please explain me or point to what these codes mean ? > > > > Actually, the title, text and everything is parsed in the failed parses > but > > somehow because of the codes it not saving the fields and returning as > > failed parsing. > > > > Thanks for your help. > > > > Regards, > > -- > > Kiran Chitturi > > > > > > -- > * > *Open Source Solutions for Text Engineering > > http://digitalpebble.blogspot.com/ > http://www.digitalpebble.com > http://twitter.com/digitalpebble > -- Kiran Chitturi
Re: Nutch 2.x Eclipse: Can't retrieve Tika parser for mime-type application/pdf
Thank you very much. This has worked great and resolved the issue of finding parser. One interesting thing is out of 10 pdf files, it has crawled 2 files and said unsuccessful for other pdf files. This has happened like 10 times for now. I really need to debug and put more error messages than just 'unable to succesfully parse content ..' Thanks again, Kiran. On Fri, Oct 26, 2012 at 4:16 AM, Julien Nioche < lists.digitalpeb...@gmail.com> wrote: > > > > Is there anything wrong with my eclipse configuration? I am looking to > > debug some things in nutch, so i am working with eclipse and nutch. > > > easier to follow the steps in Remote Debugging in Eclipse from > http://wiki.apache.org/nutch/RunNutchInEclipse > > it will save you all sorts of classpath issues etc... note that this works > in local mode only > > HTH > > Julien > > > On 25 October 2012 19:44, kiran chitturi > wrote: > > > Hi, > > > > i have built Nutch 2.x in eclipse using this tutorial ( > > http://wiki.apache.org/nutch/RunNutchInEclipse) and with some > > modifications. > > > > Its able to parse html files successfully but when it comes to pdf files > it > > says 2012-10-25 14:37:05,071 ERROR tika.TikaParser - Can't retrieve Tika > > parser for mime-type application/pdf > > > > Is there anything wrong with my eclipse configuration? I am looking to > > debug some things in nutch, so i am working with eclipse and nutch. > > > > Do i need to point any libraries for eclipseto recognize tika parsers for > > application/pdf type ? > > > > What exactly is the reason for this type of error to appear for only pdf > > files and not html files ? I am using recent nutch 2.x which has tika > > upgraded to 1.2 > > > > I would like some help here and would like to know if anyone has > > encountered similar problem with eclipse, nutch 2.x and parsing > > application/pdf files ? > > > > Many Thanks, > > -- > > Kiran Chitturi > > > > > > -- > * > *Open Source Solutions for Text Engineering > > http://digitalpebble.blogspot.com/ > http://www.digitalpebble.com > http://twitter.com/digitalpebble > -- Kiran Chitturi
Re: Nutch 2.x, MySQL and readhostdb command.
I get the same error for 'sh bin/nutch readhostdb localhost'. 'updatehostdb' also gives similar error. On Fri, Oct 19, 2012 at 5:26 AM, wrote: > Could somebody confirm if the bin/nutch readhostdb command works with > MySQL. I am trying to figure out if it is broke or I don't know how to use > it. > > Thanks > > James > > > bin/nutch readhostdb localhost > > HostDBReader: org.apache.gora.util.GoraException: java.io.IOException: > com.mysql.jdbc.exceptions.jdbc4.MySQLSyntaxErrorException: Specified key > was too long; max key length is 767 bytes > at > org.apache.gora.store.DataStoreFactory.createDataStore(DataStoreFactory.java:167) > at > org.apache.gora.store.DataStoreFactory.createDataStore(DataStoreFactory.java:135) > at > org.apache.nutch.storage.StorageUtils.createWebStore(StorageUtils.java:75) > at org.apache.nutch.host.HostDbReader.read(HostDbReader.java:44) > at org.apache.nutch.host.HostDbReader.run(HostDbReader.java:82) > at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65) > at org.apache.nutch.host.HostDbReader.main(HostDbReader.java:68) > > -- Kiran Chitturi
Re: Nutch 2.x : ParseUtil failing for some pdf files
Hi James, I have increased the limit in nutch-site.xml ( https://github.com/salvager/nutch/blob/master/nutch-site.xml) and i have created the webpage table based on the fields here ( http://nlp.solutions.asia/?p=180). The database stills shows the parseStatus as 'org.apache.nutch.parse.ParseException: Unable to successfully parse content'. I am having text field nutch 'null' for them. This the the screenshot <https://raw.github.com/salvager/nutch/master/Screen%20shot%202012-10-18%20at%205.27.13%20PM.png>of mysql database that i have. Can you please tell me how can i overcome this problem ? This is the screenshot<https://raw.github.com/salvager/nutch/master/Screen%20shot%202012-10-18%20at%205.36.43%20PM.png> of my webpage table. Many Thanks for your help. Regards, Kiran. On Wed, Oct 17, 2012 at 6:20 AM, wrote: > Hi Kiran, > > I agree with Julien it is probably trimmed content. > > I regularly parse PDFs with Nutch 2.x with MySQL as the backend without > problem (even without the patch). > > The differences in my set up from the standard set up that may be > applicable: > > 1) In nutch-site.xml the file.content.limit and http.content.limit are set > to 600. > 2) I have a custom create webpage table sql script that creates fields > that can hold more. The default table fields are not sufficiently large in > most real world situations. http://nlp.solutions.asia/?p=180 > > I crawled http://scholar.lib.vt.edu/ejournals/ALAN/v29n3/pdf/ and it > successfully parsed all except one of the PDFs, v29n3.pdf. That PDF is > almost 20 megs much larger than the limit in nutch-default.xml and even > larger than that configured in my nutch-site.xml. Interestingly that PDF is > also completely pictures (what looks like text is actually pictures of > text) so there may be no real text to parse. > > James > > > From: Julien Nioche [lists.digitalpeb...@gmail.com] > Sent: Wednesday, October 17, 2012 4:17 PM > To: user@nutch.apache.org > Subject: Re: Nutch 2.x : ParseUtil failing for some pdf files > > trimmed content? > > On 16 October 2012 22:47, kiran chitturi > wrote: > > > Hi, > > > > I am running Nutch 2.x with patch here at > > https://issues.apache.org/jira/browse/NUTCH-1433 and connected to a > mysql > > database. > > > > After the {inject, generate, fetch} commands when i issue the command (sh > > bin/nutch parse 1350396627-126726428) the parserJob was success but when > i > > look inside the database only one pdf file is parsed out of 10. > > > > When i look in to hadoop.log it shows the statement '2012-10-16 > > 16:04:30,682 WARN parse.ParseUtil - Unable to successfully parse content > > http://scholar.lib.vt.edu/ejournals/ALAN/v29n3/pdf/authors.pdf of type > > application/pdf' like this. > > > > The logs of successfully parsed and failed ones are below. The logs below > > show that pdf file '.../agosto.pdf' is parsed and the file > > '/authors.pdf' is not parsed. > > > > The same thing happened for all other pdf files, the parse failed. When i > > do the 'sh bin/nutch parsechecker {url}' it worked with the failed pdf > > files and it does not show any errors. > > > > > > 2012-10-16 16:04:28,150 INFO parse.ParserJob - Parsing > > > http://scholar.lib.vt.edu/ejournals/ALAN/v29n3/pdf/agosto.pdf > > > 2012-10-16 16:04:28,151 INFO parse.ParserFactory - The parsing > plugins: > > > [org.apache.nutch.parse.tika.TikaParser] are enabled via the > > > plugin.includes system property, and all claim to support the content > > type > > > application/pdf, but they are not mapp > > > ed to it in the parse-plugins.xml file > > > 2012-10-16 16:04:30,549 WARN parse.MetaTagsParser - Found meta tag : > > > content-type application/pdf > > > 2012-10-16 16:04:30,549 WARN parse.MetaTagsParser - Found meta tag : > > > dcterms:modified 2010-11-02T20:51:27Z > > > 2012-10-16 16:04:30,549 WARN parse.MetaTagsParser - Found meta tag : > > > meta:creation-date2010-10-20T21:12:47Z > > > 2012-10-16 16:04:30,549 WARN parse.MetaTagsParser - Found meta tag : > > > meta:save-date2010-11-02T20:51:27Z > > > 2012-10-16 16:04:30,549 WARN parse.MetaTagsParser - Found meta tag : > > > last-modified 2010-11-02T20:51:27Z > > > 2012-10-16 16:04:30,549 WARN parse.MetaTagsParser - Found meta tag : > > > dc:creatorDenise E. Agosto > > > 2012-10-16 16:04:30,549 WARN parse.MetaTagsParser - Found meta tag : > > > dcterms:crea
Nutch 2.x : ParseUtil failing for some pdf files
gt; modified 2010-11-02T20:51:57Z > 2012-10-16 16:04:30,681 WARN parse.MetaTagsParser - Found meta tag : > xmptpg:npages 1 > 2012-10-16 16:04:30,681 WARN parse.MetaTagsParser - Found meta tag : > created Wed Oct 20 17:00:15 EDT 2010 > 2012-10-16 16:04:30,681 WARN parse.MetaTagsParser - Found meta tag : > producer Adobe Acrobat 9.4 Paper Capture Plug-in > 2012-10-16 16:04:30,681 WARN parse.MetaTagsParser - Found meta tag : > last-save-date2010-11-02T20:51:57Z > 2012-10-16 16:04:30,681 WARN parse.MetaTagsParser - Found meta tag : > dc:title ALAN v29n3 - INSTRUCTIONS FOR AUTHORS > 2012-10-16 16:04:30,682 WARN parse.ParseUtil - Unable to successfully > parse content > http://scholar.lib.vt.edu/ejournals/ALAN/v29n3/pdf/authors.pdf of type > application/pdf > 2012-10-16 16:04:30,692 INFO parse.ParserJob - Parsing > http://scholar.lib.vt.edu/ejournals/ALAN/v29n3/pdf/brown.pdf > Is there any way i can get more logs about knowing whether the error is file specific or error from internal parser ? Thank you, -- Kiran Chitturi
Re: nutch - Status: failed(2,200): org.apache.nutch.parse.ParseException: Unable to successfully parse content
type > > > application/pdf, but they are not mapped to it in the > parse-plugins.xml > > > file > > > 2012-10-15 15:43:36,967 ERROR tika.TikaParser - Can't retrieve Tika > parser > > > for mime-type application/pdf > > > 2012-10-15 15:43:36,969 WARN parse.ParseUtil - Unable to successfully > > > parse content > > > http://scholar.lib.vt.edu/ejournals/JTE/v23n2/pdf/katsioloudis.pdf of > > > type application/pdf > > > > > > The config file nutch-site.xml is as below: > > > > > > > > > > > > > > > > > > > > > > http.agent.name > > > My Nutch Spider > > > > > > > > > > > > plugin.folders > > > > /Users/kiranch/Documents/workspace/nutch-2.x/runtime/local/plugins > > > > > > > > > > > > > > > plugin.includes > > > > > > > protocol-http|urlfilter-regex|parse-(html|tika|metatags)|index-(basic|anchor|metadata)|scoring-opic|urlnormalizer-(pass|regex|basic) > > > > > > > > > > > > > > > metatags.names > > > * > > > Names of the metatags to extract, separated by;. > > > Use '*' to extract all metatags. Prefixes the names with 'metatag.' > > > in the parse-metadata. For instance to index description and > keywords, > > > you need to activate the plugin index-metadata and set the value of > the > > > parameter 'index.parse.md' to > 'metatag.description;metatag.keywords'. > > > > > > > > > > > > index.parse.md > > > > > > > dc.creator,dc.bibliographiccitation,dcterms.issued,content-type,dcterms.bibliographiccitation,dc.format,dc.type,dc.language,dc.contributor,originalcharencoding,dc.publisher,dc.title,charencodingforconversion > > > > > > > > > Comma-separated list of keys to be taken from the parse metadata to > > > generate fields. > > > Can be used e.g. for 'description' or 'keywords' provided that these > > > values are generated > > > by a parser (see parse-metatags plugin) > > > > > > > > > > > > http.content.limit > > > -1 > > > > > > > > > > > > Are there any configuration settings that i need to do to work with pdf > > files ? I have parsed them before and crawled but i am not sure which is > > causing the error now. > > > > Can someone please point the cause of the errors above ? > > > > Many Thanks, > > -- > > Kiran Chitturi > > > -- Kiran Chitturi
Re: nutch - Status: failed(2,200): org.apache.nutch.parse.ParseException: Unable to successfully parse content
> > > > > > > > > > > > > http.agent.name > > > > > My Nutch Spider > > > > > > > > > > > > > > > > > > > > plugin.folders > > > > > > > > > /Users/kiranch/Documents/workspace/nutch-2.x/runtime/local/plugins > > > > > > > > > > > > > > > > > > > > > > > > > plugin.includes > > > > > > > > > > > > > > protocol-http|urlfilter-regex|parse-(html|tika|metatags)|index-(basic|anchor|metadata)|scoring-opic|urlnormalizer-(pass|regex|basic) > > > > > > > > > > > > > > > > > > > > > > > > > metatags.names > > > > > * > > > > > Names of the metatags to extract, separated by;. > > > > > Use '*' to extract all metatags. Prefixes the names with > 'metatag.' > > > > > in the parse-metadata. For instance to index description and > > > keywords, > > > > > you need to activate the plugin index-metadata and set the value > of > > > the > > > > > parameter 'index.parse.md' to > > > 'metatag.description;metatag.keywords'. > > > > > > > > > > > > > > > > > > > > index.parse.md > > > > > > > > > > > > > > dc.creator,dc.bibliographiccitation,dcterms.issued,content-type,dcterms.bibliographiccitation,dc.format,dc.type,dc.language,dc.contributor,originalcharencoding,dc.publisher,dc.title,charencodingforconversion > > > > > > > > > > > > > > > Comma-separated list of keys to be taken from the parse metadata > to > > > > > generate fields. > > > > > Can be used e.g. for 'description' or 'keywords' provided that > these > > > > > values are generated > > > > > by a parser (see parse-metatags plugin) > > > > > > > > > > > > > > > > > > > > http.content.limit > > > > > -1 > > > > > > > > > > > > > > > > > > > > Are there any configuration settings that i need to do to work > with pdf > > > > files ? I have parsed them before and crawled but i am not sure > which is > > > > causing the error now. > > > > > > > > Can someone please point the cause of the errors above ? > > > > > > > > Many Thanks, > > > > -- > > > > Kiran Chitturi > > > > > > > > > > > > > > > -- > > Kiran Chitturi > > > -- Kiran Chitturi
Re: nutch - Status: failed(2,200): org.apache.nutch.parse.ParseException: Unable to successfully parse content
What configuration did you use in nutch-site.xml ? Does it work for you with the 2.x version ? Thanks, Kiran. On Mon, Oct 15, 2012 at 5:20 PM, Markus Jelsma wrote: > Hi, > > It complains about not finding a Tika parser for the content type, did you > modify parse-plugins.xml? I can run it with a vanilla 1.4 but it fails > because of PDFbox. I can parse it successfully with trunk, 1.5 is not going > to work, not because it cannot find the TikaParser for PDFs but becasue > PDFBox cannot handle it. > > Cheers, > > > -Original message- > > From:kiran chitturi > > Sent: Mon 15-Oct-2012 21:58 > > To: user@nutch.apache.org > > Subject: nutch - Status: failed(2,200): > org.apache.nutch.parse.ParseException: Unable to successfully parse content > > > > Hi, > > > > I am trying to parse pdf files using nutch and its failing everytime with > > the status 'Status: failed(2,200): org.apache.nutch.parse.ParseException: > > Unable to successfully parse content' in both nutch 1.5 and 2.x series > when > > i do the command 'sh bin/nutch parsechecker > > http://scholar.lib.vt.edu/ejournals/JTE/v23n2/pdf/katsioloudis.pdf'. > > > > The hadoop.log looks like this > > > > > > > > 2012-10-15 15:43:32,323 INFO http.Http - http.proxy.host = null > > > 2012-10-15 15:43:32,323 INFO http.Http - http.proxy.port = 8080 > > > 2012-10-15 15:43:32,323 INFO http.Http - http.timeout = 1 > > > 2012-10-15 15:43:32,323 INFO http.Http - http.content.limit = -1 > > > 2012-10-15 15:43:32,323 INFO http.Http - http.agent = My Nutch > > > Spider/Nutch-2.2-SNAPSHOT > > > 2012-10-15 15:43:32,323 INFO http.Http - http.accept.language = > > > en-us,en-gb,en;q=0.7,*;q=0.3 > > > 2012-10-15 15:43:32,323 INFO http.Http - http.accept = > > > text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8 > > > 2012-10-15 15:43:36,851 INFO parse.ParserChecker - parsing: > > > http://scholar.lib.vt.edu/ejournals/JTE/v23n2/pdf/katsioloudis.pdf > > > 2012-10-15 15:43:36,851 INFO parse.ParserChecker - contentType: > > > application/pdf > > > 2012-10-15 15:43:36,858 INFO crawl.SignatureFactory - Using Signature > > > impl: org.apache.nutch.crawl.MD5Signature > > > 2012-10-15 15:43:36,904 INFO parse.ParserFactory - The parsing > plugins: > > > [org.apache.nutch.parse.tika.TikaParser] are enabled via the > > > plugin.includes system property, and all claim to support the content > type > > > application/pdf, but they are not mapped to it in the > parse-plugins.xml > > > file > > > 2012-10-15 15:43:36,967 ERROR tika.TikaParser - Can't retrieve Tika > parser > > > for mime-type application/pdf > > > 2012-10-15 15:43:36,969 WARN parse.ParseUtil - Unable to successfully > > > parse content > > > http://scholar.lib.vt.edu/ejournals/JTE/v23n2/pdf/katsioloudis.pdf of > > > type application/pdf > > > > > > The config file nutch-site.xml is as below: > > > > > > > > > > > > > > > > > > > > > > http.agent.name > > > My Nutch Spider > > > > > > > > > > > > plugin.folders > > > > /Users/kiranch/Documents/workspace/nutch-2.x/runtime/local/plugins > > > > > > > > > > > > > > > plugin.includes > > > > > > > protocol-http|urlfilter-regex|parse-(html|tika|metatags)|index-(basic|anchor|metadata)|scoring-opic|urlnormalizer-(pass|regex|basic) > > > > > > > > > > > > > > > metatags.names > > > * > > > Names of the metatags to extract, separated by;. > > > Use '*' to extract all metatags. Prefixes the names with 'metatag.' > > > in the parse-metadata. For instance to index description and > keywords, > > > you need to activate the plugin index-metadata and set the value of > the > > > parameter 'index.parse.md' to > 'metatag.description;metatag.keywords'. > > > > > > > > > > > > index.parse.md > > > > > > > dc.creator,dc.bibliographiccitation,dcterms.issued,content-type,dcterms.bibliographiccitation,dc.format,dc.type,dc.language,dc.contributor,originalcharencoding,dc.publisher,dc.title,charencodingforconversion > > > > > > > > > Comma-separated list of keys to be taken from the parse metadata to > > > generate fields. > > > Can be used e.g. for 'description' or 'keywords' provided that these > > > values are generated > > > by a parser (see parse-metatags plugin) > > > > > > > > > > > > http.content.limit > > > -1 > > > > > > > > > > > > Are there any configuration settings that i need to do to work with pdf > > files ? I have parsed them before and crawled but i am not sure which is > > causing the error now. > > > > Can someone please point the cause of the errors above ? > > > > Many Thanks, > > -- > > Kiran Chitturi > > > -- Kiran Chitturi