Re: [Nutch-general] Source of Outlink and how to get Outlinks in 0.9

Briggs Wed, 18 Apr 2007 14:50:39 -0700

I am adding more info to my post from what I have been looking into...

So, I have found the LinkDbReader and it seems to be able to dump text
out to a file. But, unfortunately, it dumps to a file and I need to
parse it (or I might have missed something).  So, if this is the
correct class, that will have to work... Here is a snippet of the
output of the LinkDbReader from a page that I crawled on one of my
test machines, which has apache documentation installed. The output of
the reader is:


<snippet>
http://httpd.apache.org/        Inlinks:
 fromUrl: http://nutchdev-1/manual/ anchor: HTTP Server

http://httpd.apache.org/docs-project/   Inlinks:
 fromUrl: http://nutchdev-1/manual/ anchor: Documentation
 fromUrl: http://nutchdev-1/manual/ anchor:

http://www.apache.org/  Inlinks:
 fromUrl: http://nutchdev-1/manual/ anchor: Apache

http://www.apache.org/foundation/preFAQ.html    Inlinks:
 fromUrl: http://nutchdev-1/ anchor: Apache web server

http://www.apache.org/licenses/LICENSE-2.0      Inlinks:
 fromUrl: http://nutchdev-1/manual/ anchor: Apache License, Version 2.0

</snippet>

So, am I to assume that the format shows outlinks first, then the
Inlinks are where the links were found?  I'll just have to figure out
the format here so I can parse it.  I'll probably write a wrapper that
exports to xml or something to make transformation of this easier.

Anyway, am I on the right track?

Briggs.




On 4/18/07, Briggs <[EMAIL PROTECTED]> wrote:
> Is it possible to determine from which domain(s) an outlink was
> located?  The only way I know how is to limit the crawl to a single
> domain (so, I would know where the outlink came from). Also, I am
> having difficultly trying to figure out how in 0.9 (probably the same
> in 0.8) to easily get the outlinks for my segments.  In nutch 0.7.* we
> use to do something like:
>
> <snippet>
>
> segmentReader = createSegmentReader(segment);
>
> final FetcherOutput fetcherOutput = new FetcherOutput();
> final Content content                   = new Content();
> final ParseData indexParseData   = new ParseData();
> final ParseText parseText            = new ParseText();
>
> while (segmentReader.next(fetcherOutput, content, parseText, indexParseData)) 
> {
>     extractOutlinksFromParseData(indexParseData, outlinks);
> }
>
> </snippet>
>
> <snippet>
> private void extractOutlinksFromParseData(final ParseData
> indexParseData, final    Set<String> outlinks) {
>
>         for (final Outlink outlink : indexParseData.getOutlinks()) {
>             if (null != outlink  && outlink.getToUrl() != null) {
>                 outlinks.add(outlink.getToUrl());
>             }
>         }
>     }
> </snippet>
>
> I am finally making the plunge and attempting to get this thing (my
> application) up to date with the latest and greatest!
>
> Thanks for your time!  And once I really get through this code I
> promise to start posting answers.
>
> Briggs.
>
> --
> "Conscious decisions by conscious minds are what make reality real"
>


-- 
"Conscious decisions by concious minds are what make reality real"

-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general

Re: [Nutch-general] Source of Outlink and how to get Outlinks in 0.9

Reply via email to