Is it possible to determine from which domain(s) an outlink was
located? The only way I know how is to limit the crawl to a single
domain (so, I would know where the outlink came from). Also, I am
having difficultly trying to figure out how in 0.9 (probably the same
in 0.8) to easily get the outlinks for my segments. In nutch 0.7.* we
use to do something like:
<snippet>
segmentReader = createSegmentReader(segment);
final FetcherOutput fetcherOutput = new FetcherOutput();
final Content content = new Content();
final ParseData indexParseData = new ParseData();
final ParseText parseText = new ParseText();
while (segmentReader.next(fetcherOutput, content, parseText, indexParseData)) {
extractOutlinksFromParseData(indexParseData, outlinks);
}
</snippet>
<snippet>
private void extractOutlinksFromParseData(final ParseData
indexParseData, final Set<String> outlinks) {
for (final Outlink outlink : indexParseData.getOutlinks()) {
if (null != outlink && outlink.getToUrl() != null) {
outlinks.add(outlink.getToUrl());
}
}
}
</snippet>
I am finally making the plunge and attempting to get this thing (my
application) up to date with the latest and greatest!
Thanks for your time! And once I really get through this code I
promise to start posting answers.
Briggs.
--
"Conscious decisions by conscious minds are what make reality real"
-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general