Hello juan,
Thursday, May 18, 2006, 10:18:36 AM, you wrote:
I don't think that such usage of html meta tag is good idea. This will
lead to not valid HTML code.
Google adsense bot uses HTML comments (if
present) to determine which content to use for targeting. Nutch could
use the same approach.
Hello,
I proposed a idea. You could use a especial tag like meta in the body. This
tag do not show in html browser and do not need HTML comment.
HELLO
HELLO NO INDEX
"Nutch N
On 5/16/06, Alexander E Genaud <[EMAIL PROTECTED]> wrote:
Hello,
As far as I understand, /robots.txt designates which files may and may
not be indexed by the Nutch and other crawlers. However, is there a
method by which site may exclude only sections of a document?
The benefit is most evident i
Thanks for getting back to me Jérôme,
Would you suggest I jump into the Tokenizer? Would we need to
differentiate indexing, summaries, and/or anchors (as google claims to
do)? Should I target 0.7.2 or 0.8-dev?
Aside, perhaps we should add the modified date field (as NutchWax and
others do).
Ale
As far as I understand, /robots.txt designates which files may and may
not be indexed by the Nutch and other crawlers. However, is there a
method by which site may exclude only sections of a document?
Some methods I've seen include:
If there is no such feature and this is deemed useful, I would
Hello,
As far as I understand, /robots.txt designates which files may and may
not be indexed by the Nutch and other crawlers. However, is there a
method by which site may exclude only sections of a document?
The benefit is most evident in the search hit result description
(snippets) which will o