Re: How to get page content of crawled pages

peterbarretto Mon, 18 Feb 2013 10:35:51 -0800

Hi Lewis,

I have never used a patch before but after searching a bit managed to apply
the patch in cygwin. (had to reinstall cygwin with the patch tool as the
path command was not present in the previous install)


I installed the patch by skipping pom.xml file and it worked.
I can copy all the crawled urls to the mongodb.

I can get the html content of crawled urls from the readseg -dump command in
nutch 1.6 so i guess it will be possible to get full html along with just
the text part?




lewis john mcgibbney wrote
> Hi Peter
> 
> On Saturday, February 16, 2013, peterbarretto
> &lt;peterbarretto08@gmail.&gt;
> Where do i make the pom.xml changes i cant find the file?
> 
> What are you talking about? I made a patch which pulls everything for you.
> There should be no changes required.
> 
>> I havent built the patch changes as i cant find pom.xml file.
> 
> The maven project file is in the root project. We do not build nutch with
> ?aven. Currently for development we use ant tasks and ivy for
> dependencies.
> 
>>
>>
>> lewis john mcgibbney wrote
>>> https://issues.apache.org/jira/browse/NUTCH-1528
>>>
>>> This is the mongodb indexer patch ported to trunk.
>>>
>>> Can I mention that there is usually no time line on these things e.g.
>>> feature requests.
>>> I'm sure you can appreciate that we are all extremely busy at work with
> an
>>> array of other things so if it takes a bit of time, then thats OK. The
>>> world goes on and keeps spinning. Even if we are getting bombarded by
>>> meteorites in Russia!!!
>>>
>>> Please check the patch and out comment accordingly.
>>>
>>> Regarding your issue with regards to the full page content, I am not
>>> sure
>>> if this is currently available in Nutch trunk with out you writing some
>>> code.
>>> Full html markup is certainly stored in 2.x... but I don't know whether
>>> you
>>> are prepared to move to 2.x for your operations?
>>>
>>> hth
>>> Lewis
>>>
>>> On Fri, Feb 15, 2013 at 1:58 AM, peterbarretto &lt;
>>
>>> peterbarretto08@
>>
>>> &gt;wrote:
>>>
>>>> Hi Lewis,
>>>>
>>>> Is this patch done??
>>>>
>>>>
>>>> lewis john mcgibbney wrote
>>>> > Hi,
>>>> > Once I get access to my office I am going to build the patches from
>>>> trunk.
>>>> > Is it trunk that you are using?
>>>> > Thanks
>>>> > Lewis
>>>> >
>>>> > On Fri, Feb 8, 2013 at 9:00 PM, peterbarretto &lt;
>>>>
>>>> > peterbarretto08@
>>>>
>>>> > &gt;wrote:
>>>> >
>>>> >> Hi Lewis,
>>>> >>
>>>> >> I managed to get the code working by adding the below function to
>>>> >> MongodbWriter.java in the public class MongodbWriter  implements
>>>> >> NutchIndexWriter :-
>>>> >>
>>>> >>          public void delete(String key) throws IOException{
>>>> >>                 return;
>>>> >>         }
>>>> >>
>>>> >> And the crawled data was getting stored in mongodb.
>>>> >> The only issue was it was storing only the text of the page and not
>>>> the
>>>> >> full
>>>> >> html content of the page.
>>>> >> How do i store the full html content of the page also?
>>>> >> Hope to see the patches soon.
>>>> >> Thanks
>>>> >>
>>>> >>
>>>> >>
>>>> >> lewis john mcgibbney wrote
>>>> >> > Certainly.
>>>> >> > I am currently reviewing the code and will hopefully have patches
>>>> for
>>>> >> > Nutch trunk cooked up for tomorrow.
>>>> >> > I'll update this thread likewise.
>>>> >> > Thanks
>>>> >> > Lewis
>>>> >> >
>>>> >> > On Wed, Jan 30, 2013 at 10:02 PM, peterbarretto
>>>> >> > &lt;
>>>> >>
>>>> >> > peterbarretto08@
>>>> >>
>>>> >> > &gt; wrote:
>>>> >> >> Hi Lewis,
>>>> >> >>
>>>> >> >> I am new to java and i dont know how to inherit all public
>>>> methods
>>>> >> from
>>>> >> >> NutchIndexWriter
>>>> >> >> Can you help me with that? Then i can rebuild and check if it
>>>> works.
>>>> >> >>
>>>> >> >>
>>>> >> >> lewis john mcgibbney wrote
>>>> >> >>> As you will see the code has not been amended in a year or so.
>>>> >> >>> The positive side is that you only seem to be getting one issue
>>>> with
>>>> >> >>> javac
>>>> >> >>>
>>>> >> >>> On Tue, Jan 29, 2013 at 8:39 PM, peterbarretto &lt;
>>>> >> >>
>>>> >> >>> peterbarretto08@
>>>> >> >>
>>>> >> >>> &gt;wrote:
>>>> >> >>>
>>>> >> >>>>
>>>> >> >>>>
>>>> >> >>>>
>>>> >>
>>>>
> C:\nutch-16\src\java\org\apache\nutch\indexer\mongodb\MongodbWriter.java:18:
>>>> >> >>>> error: MongodbWriter is not abstract and does not override
>>>> abstract
>>>> >> >>>> method
>>>> >> >>>> delete(String) in NutchIndexWriter
>>>> >> >>>>     [javac] public class MongodbWriter  implements
>>>> NutchIndexWriter{
>>>> >> >>>>
>>>> >> >>>> Sort this error out by inheriting all public methods from
>>>> >> >>>> NutchIndexWriter
>>>> >> >>> for starts. I take it you are not developing from within
>>>> Eclipse?
>>>> As
>>>> >> >>> this
>>>> >> >>> would have been flagged up immediately. This should at least
>>>> enable
>>>> >> you
>>>> >> >>> to
>>>> >> >>> compile the code.
>>>> >> >>>
>>>> >> >>>
>>>> >> >>>>
>>>> >> >>>> I have already crawled some urls now and i need to move those
>>>> to
>>>> >> >>>> mongodb.
>>>> >> >>>> Is
>> View this message in context:
> http://lucene.472066.n3.nabble.com/How-to-get-page-content-of-crawled-pages-tp1944330p4040944.html
>> Sent from the Nutch - User mailing list archive at Nabble.com.
>>
> 
> -- 
> *Lewis*





--
View this message in context: 
http://lucene.472066.n3.nabble.com/How-to-get-page-content-of-crawled-pages-tp1944330p4041066.html
Sent from the Nutch - User mailing list archive at Nabble.com.

Re: How to get page content of crawled pages

Reply via email to