Garbage with languageidentifier

2011-07-17 Thread Markus Jelsma
Hi, I've found a lot of garbage produced by the language identifier, most likely caused by it relying on HTTP-header as the first hint for the language. Instead of a nice tight list of ISO-codes i've got an index full of garbage making me unable to select a language. The lang field now

Extracting triples tags or hash tags from html

2011-07-17 Thread lewis john mcgibbney
Hi, Is this currently possible with Tika 0.9 in Nutch branch 1.4? I would have thought that this would have been dealt with in Tika, however I have seen no mention of anyone having problems extracting this from web documents when fetching with Nutch or even discussing it. For example say I had

Language-Identifier plugin

2011-07-17 Thread Malik
Hi, I’m new to nutch and I want to detect the languages on every hit returned in the websites. I found out there is a plugin called Language-Identifier but it didn’t work! I tried to edit nutch-site.xml as it says here

Re: Garbage with languageidentifier

2011-07-17 Thread lewis john mcgibbney
Hi Markus, I think this is a good shout, and it is not hard to understand the points you make. Quite clearly, good practice relating to the inclusion of accurate and useful language information (as well as other types of information) in HTTP headers is not a reality and it wouldn't be suitable

How to access to jar files in the lib folder of a user-defined plugin

2011-07-17 Thread jeffersonzhou
Hi, I am writing my own plugin, which includes a lib in the src/plugin/myownplugin. In eclipse, I have no problem running the plugin, which is rely upon the jar files under src/plugin/myownplugin/lib. In linux, however, the plugin can’t access to the jar files in run �Ctime even

Re: Extracting triples tags or hash tags from html

2011-07-17 Thread Julien Nioche
You simply need to write a HTMLParser, they receive the DOM representation of the page from parse-tika (or parse-html). See JIRA for the entry on the metatag parser for an example and discussion. There is usually no need to modify parse-html or tika at all Julien On 17 July 2011 16:23, lewis

custom encoding - or encoding detection does not work

2011-07-17 Thread Cam Bazz
Hello, How is encoding detection done? At what stage? Fetching? or Parsing? Because some pages that I am working with are full of errors, such as declaring the encoding meta, wrong, or double, etc. Could it be possible to tell nutch use this encoding for this site? Best Regards, C.B.

Re: How to access to jar files in the lib folder of a user-defined plugin

2011-07-17 Thread Markus Jelsma
Would adding the plugin to src/plugin/build.xml help? This is also a problem I am having. I am confused about what to specify and where due to the structure of Nutch and its runtime directory after building. Any help or discussion would be great. Thanks 2011/7/17 jeffersonzhou

Re: LinkRank scores

2011-07-17 Thread Dennis Kubes
The WebGraph and LinkRank classes work together. The WebGraph is were links from either the same domains or same hosts can be ignored (or allowed). The configuration parameters: link.ignore.internal.host = true|false link.ignore.internal.domain = true|false can be used to change that

Re: The correct tutorial on the home page?

2011-07-17 Thread Eric Pugh
A couple of comments: 1) I went and did some basic edits, please revert my changes if I didn't grok what you meant! I am new to Nutch and don't have your expertise. 2) In the first paragraph, I think remove the mention that waycool did the original tutorial. If that is

Re: some questions about the crawling with Nutch

2011-07-17 Thread tamanjit.bin...@yahoo.co.in
Hi, Whta would be optimal parameters would require some experimentation. But with the right db.fetch.interval.max between two fetches (in the nutch-default.xml) and scheduled daily crawl you would be able crawl through all of the pages eventually. Here you may like to restrict the crawls to the