Hi Lewis, I have never used a patch before but after searching a bit managed to apply the patch in cygwin. (had to reinstall cygwin with the patch tool as the path command was not present in the previous install)
I installed the patch by skipping pom.xml file and it worked. I can copy all the crawled urls to the mongodb. I can get the html content of crawled urls from the readseg -dump command in nutch 1.6 so i guess it will be possible to get full html along with just the text part? lewis john mcgibbney wrote > Hi Peter > > On Saturday, February 16, 2013, peterbarretto > <peterbarretto08@gmail.> > Where do i make the pom.xml changes i cant find the file? > > What are you talking about? I made a patch which pulls everything for you. > There should be no changes required. > >> I havent built the patch changes as i cant find pom.xml file. > > The maven project file is in the root project. We do not build nutch with > ?aven. Currently for development we use ant tasks and ivy for > dependencies. > >> >> >> lewis john mcgibbney wrote >>> https://issues.apache.org/jira/browse/NUTCH-1528 >>> >>> This is the mongodb indexer patch ported to trunk. >>> >>> Can I mention that there is usually no time line on these things e.g. >>> feature requests. >>> I'm sure you can appreciate that we are all extremely busy at work with > an >>> array of other things so if it takes a bit of time, then thats OK. The >>> world goes on and keeps spinning. Even if we are getting bombarded by >>> meteorites in Russia!!! >>> >>> Please check the patch and out comment accordingly. >>> >>> Regarding your issue with regards to the full page content, I am not >>> sure >>> if this is currently available in Nutch trunk with out you writing some >>> code. >>> Full html markup is certainly stored in 2.x... but I don't know whether >>> you >>> are prepared to move to 2.x for your operations? >>> >>> hth >>> Lewis >>> >>> On Fri, Feb 15, 2013 at 1:58 AM, peterbarretto < >> >>> peterbarretto08@ >> >>> >wrote: >>> >>>> Hi Lewis, >>>> >>>> Is this patch done?? >>>> >>>> >>>> lewis john mcgibbney wrote >>>> > Hi, >>>> > Once I get access to my office I am going to build the patches from >>>> trunk. >>>> > Is it trunk that you are using? >>>> > Thanks >>>> > Lewis >>>> > >>>> > On Fri, Feb 8, 2013 at 9:00 PM, peterbarretto < >>>> >>>> > peterbarretto08@ >>>> >>>> > >wrote: >>>> > >>>> >> Hi Lewis, >>>> >> >>>> >> I managed to get the code working by adding the below function to >>>> >> MongodbWriter.java in the public class MongodbWriter implements >>>> >> NutchIndexWriter :- >>>> >> >>>> >> public void delete(String key) throws IOException{ >>>> >> return; >>>> >> } >>>> >> >>>> >> And the crawled data was getting stored in mongodb. >>>> >> The only issue was it was storing only the text of the page and not >>>> the >>>> >> full >>>> >> html content of the page. >>>> >> How do i store the full html content of the page also? >>>> >> Hope to see the patches soon. >>>> >> Thanks >>>> >> >>>> >> >>>> >> >>>> >> lewis john mcgibbney wrote >>>> >> > Certainly. >>>> >> > I am currently reviewing the code and will hopefully have patches >>>> for >>>> >> > Nutch trunk cooked up for tomorrow. >>>> >> > I'll update this thread likewise. >>>> >> > Thanks >>>> >> > Lewis >>>> >> > >>>> >> > On Wed, Jan 30, 2013 at 10:02 PM, peterbarretto >>>> >> > < >>>> >> >>>> >> > peterbarretto08@ >>>> >> >>>> >> > > wrote: >>>> >> >> Hi Lewis, >>>> >> >> >>>> >> >> I am new to java and i dont know how to inherit all public >>>> methods >>>> >> from >>>> >> >> NutchIndexWriter >>>> >> >> Can you help me with that? Then i can rebuild and check if it >>>> works. >>>> >> >> >>>> >> >> >>>> >> >> lewis john mcgibbney wrote >>>> >> >>> As you will see the code has not been amended in a year or so. >>>> >> >>> The positive side is that you only seem to be getting one issue >>>> with >>>> >> >>> javac >>>> >> >>> >>>> >> >>> On Tue, Jan 29, 2013 at 8:39 PM, peterbarretto < >>>> >> >> >>>> >> >>> peterbarretto08@ >>>> >> >> >>>> >> >>> >wrote: >>>> >> >>> >>>> >> >>>> >>>> >> >>>> >>>> >> >>>> >>>> >> >>>> > C:\nutch-16\src\java\org\apache\nutch\indexer\mongodb\MongodbWriter.java:18: >>>> >> >>>> error: MongodbWriter is not abstract and does not override >>>> abstract >>>> >> >>>> method >>>> >> >>>> delete(String) in NutchIndexWriter >>>> >> >>>> [javac] public class MongodbWriter implements >>>> NutchIndexWriter{ >>>> >> >>>> >>>> >> >>>> Sort this error out by inheriting all public methods from >>>> >> >>>> NutchIndexWriter >>>> >> >>> for starts. I take it you are not developing from within >>>> Eclipse? >>>> As >>>> >> >>> this >>>> >> >>> would have been flagged up immediately. This should at least >>>> enable >>>> >> you >>>> >> >>> to >>>> >> >>> compile the code. >>>> >> >>> >>>> >> >>> >>>> >> >>>> >>>> >> >>>> I have already crawled some urls now and i need to move those >>>> to >>>> >> >>>> mongodb. >>>> >> >>>> Is >> View this message in context: > http://lucene.472066.n3.nabble.com/How-to-get-page-content-of-crawled-pages-tp1944330p4040944.html >> Sent from the Nutch - User mailing list archive at Nabble.com. >> > > -- > *Lewis* -- View this message in context: http://lucene.472066.n3.nabble.com/How-to-get-page-content-of-crawled-pages-tp1944330p4041066.html Sent from the Nutch - User mailing list archive at Nabble.com.

