Hi,
I've found a lot of garbage produced by the language identifier, most likely
caused by it relying on HTTP-header as the first hint for the language.
Instead of a nice tight list of ISO-codes i've got an index full of garbage
making me unable to select a language. The lang field now
Hi,
Is this currently possible with Tika 0.9 in Nutch branch 1.4? I would have
thought that this would have been dealt with in Tika, however I have seen no
mention of anyone having problems extracting this from web documents when
fetching with Nutch or even discussing it.
For example say I had
Hi,
I’m new to nutch and I want to detect the languages on every hit returned in
the websites. I found out there is a plugin called Language-Identifier but
it didn’t work!
I tried to edit nutch-site.xml as it says here
Hi Markus,
I think this is a good shout, and it is not hard to understand the points
you make. Quite clearly, good practice relating to the inclusion of accurate
and useful language information (as well as other types of information) in
HTTP headers is not a reality and it wouldn't be suitable
Hi,
I am writing my own plugin, which includes a lib in the
src/plugin/myownplugin.
In eclipse, I have no problem running the plugin, which is rely upon the jar
files under src/plugin/myownplugin/lib.
In linux, however, the plugin can’t access to the jar files in run �Ctime
even
You simply need to write a HTMLParser, they receive the DOM representation
of the page from parse-tika (or parse-html). See JIRA for the entry on the
metatag parser for an example and discussion. There is usually no need to
modify parse-html or tika at all
Julien
On 17 July 2011 16:23, lewis
Hello,
How is encoding detection done? At what stage? Fetching? or Parsing?
Because some pages that I am working with are full of errors, such as
declaring the encoding meta, wrong, or double, etc.
Could it be possible to tell nutch use this encoding for this site?
Best Regards,
C.B.
Would adding the plugin to src/plugin/build.xml help?
This is also a problem I am having. I am confused about what to specify and
where due to the structure of Nutch and its runtime directory after
building.
Any help or discussion would be great.
Thanks
2011/7/17 jeffersonzhou
The WebGraph and LinkRank classes work together. The WebGraph is were
links from either the same domains or same hosts can be ignored (or
allowed). The configuration parameters:
link.ignore.internal.host = true|false
link.ignore.internal.domain = true|false
can be used to change that
A couple of comments:
1) I went and did some basic edits, please revert my changes if I
didn't grok what you meant! I am new to Nutch and don't have your expertise.
2) In the first paragraph, I think remove the mention that waycool
did the original tutorial. If that is
Hi,
Whta would be optimal parameters would require some experimentation.
But with the right db.fetch.interval.max between two fetches (in the
nutch-default.xml) and scheduled daily crawl you would be able crawl through
all of the pages eventually. Here you may like to restrict the crawls to the
11 matches
Mail list logo