Re: Differences between 2.1 and 1.6

2013-02-25 Thread Lewis John Mcgibbney
Hi Markus, This is very useful thank you. Lewis On Mon, Feb 25, 2013 at 3:08 PM, Markus Jelsma wrote: > Something seems to be missing here. It's clear that 1.x has more features > and is a lot more stable than 2.x. Nutch 2.x can theoretically perform a > lot better if you are going to crawl on a

RE: Differences between 2.1 and 1.6

2013-02-25 Thread Markus Jelsma
Something seems to be missing here. It's clear that 1.x has more features and is a lot more stable than 2.x. Nutch 2.x can theoretically perform a lot better if you are going to crawl on a very large scale but i still haven't seen any numbers to support this assumption. Nutch 1.x can easily deal

Re: Differences between 2.1 and 1.6

2013-02-25 Thread Lewis John Mcgibbney
Hi Danilo, You can check out the architecture changes here http://wiki.apache.org/nutch/#Nutch_2.x Nutch trunk (1.7-SNAPSHOT) is here http://svn.apache.org/repos/asf/nutch/trunk/ 2.x is here http://svn.apache.org/repos/asf/nutch/branches/2.x/ On Mon, Feb 25, 2013 at 1:56 PM, Danilo Fernandes <

Re: Differences between 2.1 and 1.6

2013-02-25 Thread Tejas Patil
Hi Danilo, On Mon, Feb 25, 2013 at 1:56 PM, Danilo Fernandes < dan...@kelsorfernandes.com.br> wrote: > Hi everyone, > > Somebody can tell me about differences between 2.1 and 1.6? > [1] and [2] would be informative reads. > > The SVN trunk is 1.* or 2.*? > Trunk [3] is 1.x. 2.X can be found h

Differences between 2.1 and 1.6

2013-02-25 Thread Danilo Fernandes
Hi everyone, Somebody can tell me about differences between 2.1 and 1.6? The SVN trunk is 1.* or 2.*? Thanks, Danilo Fernandes

Re: Nutch 2.1 - Image / Video Search

2013-02-25 Thread J. Delgado
If your interested in pure image search you may want to use Nutch for crawling but something like imgseek (http://www.imgseek.net/isk-daemon) for indexing and search. -J El lunes, 25 de febrero de 2013, Jorge Luis Betancourt Gonzalez escribió: > Hi: > > Like Raja said, it's possible the thing is

Nutch 2.1 MySQL setup character encoding

2013-02-25 Thread jazz
Now with correct headings (started this mail from an old mail with an old thread in it...) Hi, How do I setup nutch to crawl correctly using the UTF-8 character set? This does n

RE: Nutch status info on each domain individually

2013-02-25 Thread Markus Jelsma
Well, you can always the DomainStatistics utilities to get the raw numbers on hosts, domains and TLD's but this won't tell you whether a domain has been fully crawled because the crawling frontier can always change. You can be sure that everything (disregarding url filters) has been crawled if

Nutch 2.1 MySQL setup character encoding

2013-02-25 Thread jazz
Hi, How do I setup nutch to crawl correctly using the UTF-8 character set? This does not work: http://nlp.solutions.asia/?p=180 I am using nutch 2.1, Solr 4.0 and MySQL 5.5.30. This is the error during the parser job: Caused by: java.sql.SQLException: Incorrect string value: '\xEF\xBB\xBF Ir..

Re: Nutch status info on each domain individually

2013-02-25 Thread Tejas Patil
I can't of any existing nutch utility which can be used here. Maybe dumping the crawldb and then grepping over it would sound reasonable if the number of hosts is large and the crawldb is small. This will be a bad idea if this has to be done after every nutch cycle on a large crawldb. If you are r

Nutch status info on each domain individually

2013-02-25 Thread imehesz
hello, I can finally run Nutch (+Solr) with JAVA, my only question left is, how can I make sure if a particular domain has been crawled? Let's say I have 300 sites to crawl and index. So far my work-around was to execute a simple Solr query for each domain URL, and see if the indexing timestamp i

Re: regex-urlfilter file for multiple domains

2013-02-25 Thread Tejas Patil
Hey Danilo, On Mon, Feb 25, 2013 at 7:09 AM, Danilo Fernandes < dan...@kelsorfernandes.com.br> wrote: > Hello, > > > I started with crawling a site and I didn't have any problems. But, I need > define criteria to each domain. > > > > How can I create differents regex-urlfilter for each of them? >

Re: Handling Content-Type Parameter in Nutch and Solr

2013-02-25 Thread Raja Kulasekaran
Hi, Below I have updated both Content as well as Parse Metadata. Can you suggest me the rule for "çontentType¨ as well as metatag.content-Type . Is this from the header of the file as my html file only have a description field. __DUMP__ parsing: http://localhost/def.html contentType: text/html

Re: Handling Content-Type Parameter in Nutch and Solr

2013-02-25 Thread kiran chitturi
Hi Raja, Which Nutch version are you using ? Can you check again with parseChecker [1] tool ? [1] - http://wiki.apache.org/nutch/bin/nutch%20parsechecker On Mon, Feb 25, 2013 at 9:32 AM, Raja Kulasekaran wrote: > Hi, > > I am unable to get the value of ContentType as well as > metatag.Conten

Re: Nutch 2.1 - Image / Video Search

2013-02-25 Thread Jorge Luis Betancourt Gonzalez
Hi: Like Raja said, it's possible the thing is that out of the box, nutch is only able to index the metadata of the file, you can always write some plugins to implement any logic you desire. - Mensaje original - De: "Raja Kulasekaran" Para: user@nutch.apache.org Enviados: Domingo, 24 d

Re: Nutch + Eclipse

2013-02-25 Thread Julien Nioche
You are welcome. We should probably rename the pom.xml file into something else so that people don't assume that Nutch can be built with Maven. On 25 February 2013 09:06, feng lu wrote: > So it was like this! Thank you for correcting my mistakes. > > i see this issue https://issues.apache.org/ji

Re: Nutch + Eclipse

2013-02-25 Thread feng lu
So it was like this! Thank you for correcting my mistakes. i see this issue https://issues.apache.org/jira/browse/NUTCH-1371 thanks Julien On Mon, Feb 25, 2013 at 4:50 PM, Julien Nioche < lists.digitalpeb...@gmail.com> wrote: > > nutch can use maven to manage the project. > > > That's incorr

Re: Nutch + Eclipse

2013-02-25 Thread Julien Nioche
> nutch can use maven to manage the project. That's incorrect. Nutch is built with ANT+IVY. There is indeed a pom.xml used to publish the artefacts with Maven but it can't be used for building Nutch properly. There is a Jira issue with a proposal to move to ANT+Maven but even this does not mean