Re: How to get page content of crawled pages

Lewis John Mcgibbney Sun, 17 Feb 2013 08:40:45 -0800

Hi Peter

On Saturday, February 16, 2013, peterbarretto <peterbarretto08@gmail.>
Where do i make the pom.xml changes i cant find the file?


What are you talking about? I made a patch which pulls everything for you.
There should be no changes required.

> I havent built the patch changes as i cant find pom.xml file.

The maven project file is in the root project. We do not build nutch with
?aven. Currently for development we use ant tasks and ivy for dependencies.

>
>
> lewis john mcgibbney wrote
>> https://issues.apache.org/jira/browse/NUTCH-1528
>>
>> This is the mongodb indexer patch ported to trunk.
>>
>> Can I mention that there is usually no time line on these things e.g.
>> feature requests.
>> I'm sure you can appreciate that we are all extremely busy at work with
an
>> array of other things so if it takes a bit of time, then thats OK. The
>> world goes on and keeps spinning. Even if we are getting bombarded by
>> meteorites in Russia!!!
>>
>> Please check the patch and out comment accordingly.
>>
>> Regarding your issue with regards to the full page content, I am not sure
>> if this is currently available in Nutch trunk with out you writing some
>> code.
>> Full html markup is certainly stored in 2.x... but I don't know whether
>> you
>> are prepared to move to 2.x for your operations?
>>
>> hth
>> Lewis
>>
>> On Fri, Feb 15, 2013 at 1:58 AM, peterbarretto &lt;
>
>> peterbarretto08@
>
>> &gt;wrote:
>>
>>> Hi Lewis,
>>>
>>> Is this patch done??
>>>
>>>
>>> lewis john mcgibbney wrote
>>> > Hi,
>>> > Once I get access to my office I am going to build the patches from
>>> trunk.
>>> > Is it trunk that you are using?
>>> > Thanks
>>> > Lewis
>>> >
>>> > On Fri, Feb 8, 2013 at 9:00 PM, peterbarretto &lt;
>>>
>>> > peterbarretto08@
>>>
>>> > &gt;wrote:
>>> >
>>> >> Hi Lewis,
>>> >>
>>> >> I managed to get the code working by adding the below function to
>>> >> MongodbWriter.java in the public class MongodbWriter  implements
>>> >> NutchIndexWriter :-
>>> >>
>>> >>          public void delete(String key) throws IOException{
>>> >>                 return;
>>> >>         }
>>> >>
>>> >> And the crawled data was getting stored in mongodb.
>>> >> The only issue was it was storing only the text of the page and not
>>> the
>>> >> full
>>> >> html content of the page.
>>> >> How do i store the full html content of the page also?
>>> >> Hope to see the patches soon.
>>> >> Thanks
>>> >>
>>> >>
>>> >>
>>> >> lewis john mcgibbney wrote
>>> >> > Certainly.
>>> >> > I am currently reviewing the code and will hopefully have patches
>>> for
>>> >> > Nutch trunk cooked up for tomorrow.
>>> >> > I'll update this thread likewise.
>>> >> > Thanks
>>> >> > Lewis
>>> >> >
>>> >> > On Wed, Jan 30, 2013 at 10:02 PM, peterbarretto
>>> >> > &lt;
>>> >>
>>> >> > peterbarretto08@
>>> >>
>>> >> > &gt; wrote:
>>> >> >> Hi Lewis,
>>> >> >>
>>> >> >> I am new to java and i dont know how to inherit all public methods
>>> >> from
>>> >> >> NutchIndexWriter
>>> >> >> Can you help me with that? Then i can rebuild and check if it
>>> works.
>>> >> >>
>>> >> >>
>>> >> >> lewis john mcgibbney wrote
>>> >> >>> As you will see the code has not been amended in a year or so.
>>> >> >>> The positive side is that you only seem to be getting one issue
>>> with
>>> >> >>> javac
>>> >> >>>
>>> >> >>> On Tue, Jan 29, 2013 at 8:39 PM, peterbarretto &lt;
>>> >> >>
>>> >> >>> peterbarretto08@
>>> >> >>
>>> >> >>> &gt;wrote:
>>> >> >>>
>>> >> >>>>
>>> >> >>>>
>>> >> >>>>
>>> >>
>>>
C:\nutch-16\src\java\org\apache\nutch\indexer\mongodb\MongodbWriter.java:18:
>>> >> >>>> error: MongodbWriter is not abstract and does not override
>>> abstract
>>> >> >>>> method
>>> >> >>>> delete(String) in NutchIndexWriter
>>> >> >>>>     [javac] public class MongodbWriter  implements
>>> NutchIndexWriter{
>>> >> >>>>
>>> >> >>>> Sort this error out by inheriting all public methods from
>>> >> >>>> NutchIndexWriter
>>> >> >>> for starts. I take it you are not developing from within Eclipse?
>>> As
>>> >> >>> this
>>> >> >>> would have been flagged up immediately. This should at least
>>> enable
>>> >> you
>>> >> >>> to
>>> >> >>> compile the code.
>>> >> >>>
>>> >> >>>
>>> >> >>>>
>>> >> >>>> I have already crawled some urls now and i need to move those to
>>> >> >>>> mongodb.
>>> >> >>>> Is
> View this message in context:
http://lucene.472066.n3.nabble.com/How-to-get-page-content-of-crawled-pages-tp1944330p4040944.html
> Sent from the Nutch - User mailing list archive at Nabble.com.
>

-- 
*Lewis*

Re: How to get page content of crawled pages

Reply via email to