I am adding more info to my post from what I have been looking into... So, I have found the LinkDbReader and it seems to be able to dump text out to a file. But, unfortunately, it dumps to a file and I need to parse it (or I might have missed something). So, if this is the correct class, that will have to work... Here is a snippet of the output of the LinkDbReader from a page that I crawled on one of my test machines, which has apache documentation installed. The output of the reader is:
<snippet> http://httpd.apache.org/ Inlinks: fromUrl: http://nutchdev-1/manual/ anchor: HTTP Server http://httpd.apache.org/docs-project/ Inlinks: fromUrl: http://nutchdev-1/manual/ anchor: Documentation fromUrl: http://nutchdev-1/manual/ anchor: http://www.apache.org/ Inlinks: fromUrl: http://nutchdev-1/manual/ anchor: Apache http://www.apache.org/foundation/preFAQ.html Inlinks: fromUrl: http://nutchdev-1/ anchor: Apache web server http://www.apache.org/licenses/LICENSE-2.0 Inlinks: fromUrl: http://nutchdev-1/manual/ anchor: Apache License, Version 2.0 </snippet> So, am I to assume that the format shows outlinks first, then the Inlinks are where the links were found? I'll just have to figure out the format here so I can parse it. I'll probably write a wrapper that exports to xml or something to make transformation of this easier. Anyway, am I on the right track? Briggs. On 4/18/07, Briggs <[EMAIL PROTECTED]> wrote: > Is it possible to determine from which domain(s) an outlink was > located? The only way I know how is to limit the crawl to a single > domain (so, I would know where the outlink came from). Also, I am > having difficultly trying to figure out how in 0.9 (probably the same > in 0.8) to easily get the outlinks for my segments. In nutch 0.7.* we > use to do something like: > > <snippet> > > segmentReader = createSegmentReader(segment); > > final FetcherOutput fetcherOutput = new FetcherOutput(); > final Content content = new Content(); > final ParseData indexParseData = new ParseData(); > final ParseText parseText = new ParseText(); > > while (segmentReader.next(fetcherOutput, content, parseText, indexParseData)) > { > extractOutlinksFromParseData(indexParseData, outlinks); > } > > </snippet> > > <snippet> > private void extractOutlinksFromParseData(final ParseData > indexParseData, final Set<String> outlinks) { > > for (final Outlink outlink : indexParseData.getOutlinks()) { > if (null != outlink && outlink.getToUrl() != null) { > outlinks.add(outlink.getToUrl()); > } > } > } > </snippet> > > I am finally making the plunge and attempting to get this thing (my > application) up to date with the latest and greatest! > > Thanks for your time! And once I really get through this code I > promise to start posting answers. > > Briggs. > > -- > "Conscious decisions by conscious minds are what make reality real" > -- "Conscious decisions by concious minds are what make reality real" ------------------------------------------------------------------------- This SF.net email is sponsored by DB2 Express Download DB2 Express C - the FREE version of DB2 express and take control of your XML. No limits. Just data. Click to get it now. http://sourceforge.net/powerbar/db2/ _______________________________________________ Nutch-general mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/nutch-general
