parse-(html)does follow links of full html page, parse-(tika) does follow any
links and stops at level 1
--------------------------------------------------------------------------------------------------------
Key: NUTCH-817
URL: https://issues.apache.org/jira/browse/NUTCH-817
Project: Nutch
Issue Type: Bug
Components: parser
Affects Versions: 1.1
Environment: Suse linux 11.1, java version "1.6.0_13"
Reporter: matthew a. grisius
submitted per Julien Nioche. I did not see where to attach a file so I pasted
it here. btw: Tika command line returns empty html body for this file.
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Frameset//EN"
"http://www.w3.org/TR/html4/frameset.dtd">
<!--NewPage-->
<HTML>
<HEAD>
<!-- Generated by javadoc on Fri Mar 28 17:23:42 EDT 2008-->
<TITLE>
Matrix Application Development Kit
</TITLE>
<SCRIPT type="text/javascript">
targetPage = "" + window.location.search;
if (targetPage != "" && targetPage != "undefined")
targetPage = targetPage.substring(1);
function loadFrames() {
if (targetPage != "" && targetPage != "undefined")
top.classFrame.location = top.targetPage;
}
</SCRIPT>
<NOSCRIPT>
</NOSCRIPT>
</HEAD>
<FRAMESET cols="20%,80%" title="" onLoad="top.loadFrames()">
<FRAMESET rows="30%,70%" title="" onLoad="top.loadFrames()">
<FRAME src="overview-frame.html" name="packageListFrame" title="All Packages">
<FRAME src="allclasses-frame.html" name="packageFrame" title="All classes and
interfaces (except non-static nested types)">
</FRAMESET>
<FRAME src="overview-summary.html" name="classFrame" title="Package, class and
interface descriptions" scrolling="yes">
<NOFRAMES>
<H2>
Frame Alert</H2>
<P>
This document is designed to be viewed using the frames feature. If you see
this message, you are using a non-frame-capable web client.
<BR>
Link to<A HREF="overview-summary.html">Non-frame version.</A>
</NOFRAMES>
</FRAMESET>
</HTML>
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.