Re: Differences between 2.1 and 1.6

2013-02-25 Thread Lewis John Mcgibbney
Hi Markus,
This is very useful thank you.
Lewis

On Mon, Feb 25, 2013 at 3:08 PM, Markus Jelsma
wrote:

> Something seems to be missing here. It's clear that 1.x has more features
> and is a lot more stable than 2.x. Nutch 2.x can theoretically perform a
> lot better if you are going to crawl on a very large scale but i still
> haven't seen any numbers to support this assumption. Nutch 1.x can easily
> deal with many millions of records and deal with billions if you throw some
> hardware at it.
>
> Most users are not going to crawl millions or records. In that case i
> personally choose 1.x. I prefer the stability and predictabilty above some
> performance you are not likely going to need anyway.
>
> Besides our large 1.x research cluster we still use 1.x in production for
> all our customers, running locally on a 2 core 512MB RAM VPS with a crawldb
> of over 5 million records and it runs fine, fast and keeps up with newly
> discovered URL's. The only significant improvements were a better scoring
> filter and integrating indexing in the fetcher.
>
> -Original message-
> > From:Lewis John Mcgibbney 
> > Sent: Mon 25-Feb-2013 23:37
> > To: user@nutch.apache.org
> > Subject: Re: Differences between 2.1 and 1.6
> >
> > Hi Danilo,
> >
> > You can check out the architecture changes here
> > http://wiki.apache.org/nutch/#Nutch_2.x
> >
> > Nutch trunk (1.7-SNAPSHOT) is here
> > http://svn.apache.org/repos/asf/nutch/trunk/
> >
> > 2.x is here
> > http://svn.apache.org/repos/asf/nutch/branches/2.x/
> >
> > On Mon, Feb 25, 2013 at 1:56 PM, Danilo Fernandes <
> > dan...@kelsorfernandes.com.br> wrote:
> >
> > > Hi everyone,
> > >
> > > Somebody can tell me about differences between 2.1 and 1.6?
> > >
> > > The SVN trunk is 1.* or 2.*?
> > >
> > > Thanks,
> > > Danilo Fernandes
> > >
> > >
> >
> >
> > --
> > *Lewis*
> >
>



-- 
*Lewis*


RE: Differences between 2.1 and 1.6

2013-02-25 Thread Markus Jelsma
Something seems to be missing here. It's clear that 1.x has more features and 
is a lot more stable than 2.x. Nutch 2.x can theoretically perform a lot better 
if you are going to crawl on a very large scale but i still haven't seen any 
numbers to support this assumption. Nutch 1.x can easily deal with many 
millions of records and deal with billions if you throw some hardware at it. 

Most users are not going to crawl millions or records. In that case i 
personally choose 1.x. I prefer the stability and predictabilty above some 
performance you are not likely going to need anyway. 

Besides our large 1.x research cluster we still use 1.x in production for all 
our customers, running locally on a 2 core 512MB RAM VPS with a crawldb of over 
5 million records and it runs fine, fast and keeps up with newly discovered 
URL's. The only significant improvements were a better scoring filter and 
integrating indexing in the fetcher.
 
-Original message-
> From:Lewis John Mcgibbney 
> Sent: Mon 25-Feb-2013 23:37
> To: user@nutch.apache.org
> Subject: Re: Differences between 2.1 and 1.6
> 
> Hi Danilo,
> 
> You can check out the architecture changes here
> http://wiki.apache.org/nutch/#Nutch_2.x
> 
> Nutch trunk (1.7-SNAPSHOT) is here
> http://svn.apache.org/repos/asf/nutch/trunk/
> 
> 2.x is here
> http://svn.apache.org/repos/asf/nutch/branches/2.x/
> 
> On Mon, Feb 25, 2013 at 1:56 PM, Danilo Fernandes <
> dan...@kelsorfernandes.com.br> wrote:
> 
> > Hi everyone,
> >
> > Somebody can tell me about differences between 2.1 and 1.6?
> >
> > The SVN trunk is 1.* or 2.*?
> >
> > Thanks,
> > Danilo Fernandes
> >
> >
> 
> 
> -- 
> *Lewis*
> 


Re: Differences between 2.1 and 1.6

2013-02-25 Thread Lewis John Mcgibbney
Hi Danilo,

You can check out the architecture changes here
http://wiki.apache.org/nutch/#Nutch_2.x

Nutch trunk (1.7-SNAPSHOT) is here
http://svn.apache.org/repos/asf/nutch/trunk/

2.x is here
http://svn.apache.org/repos/asf/nutch/branches/2.x/

On Mon, Feb 25, 2013 at 1:56 PM, Danilo Fernandes <
dan...@kelsorfernandes.com.br> wrote:

> Hi everyone,
>
> Somebody can tell me about differences between 2.1 and 1.6?
>
> The SVN trunk is 1.* or 2.*?
>
> Thanks,
> Danilo Fernandes
>
>


-- 
*Lewis*


Re: Differences between 2.1 and 1.6

2013-02-25 Thread Tejas Patil
Hi Danilo,

On Mon, Feb 25, 2013 at 1:56 PM, Danilo Fernandes <
dan...@kelsorfernandes.com.br> wrote:

> Hi everyone,
>
> Somebody can tell me about differences between 2.1 and 1.6?
>

[1] and [2] would be informative reads.

>
> The SVN trunk is 1.* or 2.*?
>

Trunk [3] is 1.x.  2.X can be found here [4]

>
> Thanks,
> Danilo Fernandes
>
>
[1] : http://digitalpebble.blogspot.com/2012/07/nutch-20-is-out-at-last.html
[2] :
http://lucene.472066.n3.nabble.com/differences-between-nutch-1-and-nutch-2-td4031548.html
[3] : http://svn.apache.org/repos/asf/nutch/trunk/
[4] : http://svn.apache.org/repos/asf/nutch/branches/2.x/

Thanks,
Tejas Patil


Differences between 2.1 and 1.6

2013-02-25 Thread Danilo Fernandes
Hi everyone,

Somebody can tell me about differences between 2.1 and 1.6?

The SVN trunk is 1.* or 2.*?

Thanks,
Danilo Fernandes



Re: Nutch 2.1 - Image / Video Search

2013-02-25 Thread J. Delgado
If your interested in pure image search you may want to use Nutch for
crawling but something like imgseek (http://www.imgseek.net/isk-daemon) for
indexing and search.

-J

El lunes, 25 de febrero de 2013, Jorge Luis Betancourt Gonzalez escribió:

> Hi:
>
> Like Raja said, it's possible the thing is that out of the box, nutch is
> only able to index the metadata of the file, you can always write some
> plugins to implement any logic you desire.
>
> - Mensaje original -
> De: "Raja Kulasekaran" >
> Para: user@nutch.apache.org 
> Enviados: Domingo, 24 de Febrero 2013 13:31:28
> Asunto: Nutch 2.1 - Image / Video Search
>
> Hi,
>
> Is it possible to crawl the Images as well as videos from  Nutch latest
> version . I am using Nutch 1.6. I would like to know whether I can go ahead
> to
>
> use Nutch 1.6 or Suggest me the appropriate versions .
>
> Raja
>


-- 
Sent from Gmail Mobile


Nutch 2.1 MySQL setup character encoding

2013-02-25 Thread jazz
Now with correct headings (started this mail from an old mail with an old 
thread in it...)



Hi,

How do I setup nutch to crawl correctly using the UTF-8 character set?

This does not work: http://nlp.solutions.asia/?p=180

I am using nutch 2.1, Solr 4.0 and MySQL 5.5.30. This is the error during the 
parser job:

Caused by: java.sql.SQLException: Incorrect string value: '\xEF\xBB\xBF Ir...' 
for column 'text' at row 1
at com.mysql.jdbc.SQLError.createSQLException(SQLError.java:1073)
at com.mysql.jdbc.MysqlIO.checkErrorPacket(MysqlIO.java:3609)
at com.mysql.jdbc.MysqlIO.checkErrorPacket(MysqlIO.java:3541)
at com.mysql.jdbc.MysqlIO.sendCommand(MysqlIO.java:2002)

The problem seems to be that the JDBC connection is not working on UTF-8. How 
do I change that in nutch? This is used but does not seem to effect the JDBC 
connection:


parser.character.encoding.default
utf-8



Thanks for your help,

Bart

RE: Nutch status info on each domain individually

2013-02-25 Thread Markus Jelsma
Well, you can always the DomainStatistics utilities to get the raw numbers on 
hosts, domains and TLD's but this won't tell you whether a domain has been 
fully crawled because the crawling frontier can always change.

You can be sure that everything (disregarding url filters) has been crawled if 
no more records are selected before fetched records are eligible again for 
refetch (default interval).

NUTCH-1325 does a better job in providing stats for hosts than the current 
DomainStatistics but it's uncommitted. It'll work though.

https://issues.apache.org/jira/browse/NUTCH-1325
 
-Original message-
> From:Tejas Patil 
> Sent: Mon 25-Feb-2013 20:46
> To: user@nutch.apache.org
> Subject: Re: Nutch status info on each domain individually
> 
> I can't of any existing nutch utility which can be used here. Maybe dumping
> the crawldb and then grepping over it would sound reasonable if the number
> of hosts is large and the crawldb is small. This will be a bad idea if this
> has to be done after every nutch cycle on a large crawldb.
> 
> If you are ready to write some small code, then it can become easy:
> 1. Write some code to query the index so that you need not have to do that
> manually. OR
> 2. Write a map reduce code to read crawdb wherein the mapper emits the
> hosts of the url.
> 
> #1 is better deal in terms of execution time.
> 
> Thanks,
> Tejas Patil
> 
> 
> On Mon, Feb 25, 2013 at 11:28 AM, imehesz  wrote:
> 
> > hello,
> >
> > I can finally run Nutch (+Solr) with JAVA, my only question left is, how
> > can
> > I make sure if a particular domain has been crawled?
> >
> > Let's say I have 300 sites to crawl and index.
> > So far my work-around was to execute a simple Solr query for each domain
> > URL, and see if the indexing timestamp in the Solr DB is greater then the
> > Nutch crawling start date-time. It works, but I'm curious if there is a
> > better way to do this.
> >
> > thanks,
> > --iM
> >
> >
> >
> > --
> > View this message in context:
> > http://lucene.472066.n3.nabble.com/Nutch-status-info-on-each-domain-individually-tp4042815.html
> > Sent from the Nutch - User mailing list archive at Nabble.com.
> >
> 


Nutch 2.1 MySQL setup character encoding

2013-02-25 Thread jazz
Hi,

How do I setup nutch to crawl correctly using the UTF-8 character set?

This does not work: http://nlp.solutions.asia/?p=180

I am using nutch 2.1, Solr 4.0 and MySQL 5.5.30. This is the error during the 
parser job:

Caused by: java.sql.SQLException: Incorrect string value: '\xEF\xBB\xBF Ir...' 
for column 'text' at row 1
at com.mysql.jdbc.SQLError.createSQLException(SQLError.java:1073)
at com.mysql.jdbc.MysqlIO.checkErrorPacket(MysqlIO.java:3609)
at com.mysql.jdbc.MysqlIO.checkErrorPacket(MysqlIO.java:3541)
at com.mysql.jdbc.MysqlIO.sendCommand(MysqlIO.java:2002)

The problem seems to be that the JDBC connection is not working on UTF-8. How 
do I change that in nutch? This is used but does not seem to effect the JDBC 
connection:


parser.character.encoding.default
utf-8



Thanks for your help,

Bart

Re: Nutch status info on each domain individually

2013-02-25 Thread Tejas Patil
I can't of any existing nutch utility which can be used here. Maybe dumping
the crawldb and then grepping over it would sound reasonable if the number
of hosts is large and the crawldb is small. This will be a bad idea if this
has to be done after every nutch cycle on a large crawldb.

If you are ready to write some small code, then it can become easy:
1. Write some code to query the index so that you need not have to do that
manually. OR
2. Write a map reduce code to read crawdb wherein the mapper emits the
hosts of the url.

#1 is better deal in terms of execution time.

Thanks,
Tejas Patil


On Mon, Feb 25, 2013 at 11:28 AM, imehesz  wrote:

> hello,
>
> I can finally run Nutch (+Solr) with JAVA, my only question left is, how
> can
> I make sure if a particular domain has been crawled?
>
> Let's say I have 300 sites to crawl and index.
> So far my work-around was to execute a simple Solr query for each domain
> URL, and see if the indexing timestamp in the Solr DB is greater then the
> Nutch crawling start date-time. It works, but I'm curious if there is a
> better way to do this.
>
> thanks,
> --iM
>
>
>
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/Nutch-status-info-on-each-domain-individually-tp4042815.html
> Sent from the Nutch - User mailing list archive at Nabble.com.
>


Nutch status info on each domain individually

2013-02-25 Thread imehesz
hello,

I can finally run Nutch (+Solr) with JAVA, my only question left is, how can
I make sure if a particular domain has been crawled?

Let's say I have 300 sites to crawl and index.
So far my work-around was to execute a simple Solr query for each domain
URL, and see if the indexing timestamp in the Solr DB is greater then the
Nutch crawling start date-time. It works, but I'm curious if there is a
better way to do this. 

thanks,
--iM



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Nutch-status-info-on-each-domain-individually-tp4042815.html
Sent from the Nutch - User mailing list archive at Nabble.com.


Re: regex-urlfilter file for multiple domains

2013-02-25 Thread Tejas Patil
Hey Danilo,

On Mon, Feb 25, 2013 at 7:09 AM, Danilo Fernandes <
dan...@kelsorfernandes.com.br> wrote:

> Hello,
>
>
> I started with crawling a site and I didn't have any problems. But, I need
> define criteria to each domain.
>
>
>
> How can I create differents regex-urlfilter for each of them?
>
>
>
> Actually the ideia is catch some pages of each site and no all. Each one
> have a different structure and I need cover all of them.
>
>
>
> Like:
>
>
>
> Domain1.com/sale = I want catch.
>
> Domain1.com/cars = I don't.
>
>
>
> Regex: -Domain1.com/[^s].*
>
>
>
> Domain2.com/flytickets = I want catch.
>
> Domian2.com/contatPage = I don't.
>
>
>
> Regex: -Domain2.com/[^f].*
>
>
>
> Is it possible?
>
> Yes.
You can do following:
1. just have accept rules and a "-." in the end to omit urls which dont
match.
2. just have reject rules and a "+." in the end to accept urls which get
rejected.
3. A combination of both.

Say you go by #1. Then for the given example, it would be something like:
--
+Domain1.com/sale.*
+Domain2.com/flytickets.*
-.
--
hth

>
>
> Thank's Again.
>
>
>
> Danilo Fernandes
>
>

thanks,
Tejas Patil


Re: Handling Content-Type Parameter in Nutch and Solr

2013-02-25 Thread Raja Kulasekaran
Hi,

Below I have updated both Content as well as Parse Metadata.

Can you suggest me the rule for "çontentType¨ as well as
metatag.content-Type . Is this from the header of the file as my html file
only have a description field.

__DUMP__

parsing: http://localhost/def.html
contentType: text/html
signature: d677f21eaccf7cc5ff4cb8484d9a8965
-
Url
---
http://localhost/def.html
-
ParseData
-
Version: 5
Status: success(1,0)
Title:
Outlinks: 0
Content Metadata: ETag="40c43-2f-4d68fe096f698" Date=Mon, 25 Feb 2013
17:40:04 GMT Content-Length=47 Last-Modified=Mon, 25 Feb 2013 17:29:03 GMT
Content-Type=text/html; charset=UTF-8 Connection=close Accept-Ranges=bytes
Server=Apache/2.2.22 (Fedora)
Parse Metadata: CharEncodingForConversion=utf-8 OriginalCharEncoding=utf-8
metatag.description=¨text/html¨
-
ParseText
-

__END_DUMP


On
Mon, Feb 25, 2013 at 8:29 PM, kiran chitturi wrote:

> Hi Raja,
>
> Which Nutch version are you using ? Can you check again with parseChecker
> [1] tool ?
>
> [1] - http://wiki.apache.org/nutch/bin/nutch%20parsechecker
>
>
>
> On Mon, Feb 25, 2013 at 9:32 AM, Raja Kulasekaran 
> wrote:
>
> > Hi,
> >
> > I am unable to get the value of ContentType as well as
> > metatag.Content-Type.
> >
> > Can you please suggest me the correct way to get this value ?
> >
> > Raja
> >
>
>
>
> --
> Kiran Chitturi
>


Re: Handling Content-Type Parameter in Nutch and Solr

2013-02-25 Thread kiran chitturi
Hi Raja,

Which Nutch version are you using ? Can you check again with parseChecker
[1] tool ?

[1] - http://wiki.apache.org/nutch/bin/nutch%20parsechecker



On Mon, Feb 25, 2013 at 9:32 AM, Raja Kulasekaran  wrote:

> Hi,
>
> I am unable to get the value of ContentType as well as
> metatag.Content-Type.
>
> Can you please suggest me the correct way to get this value ?
>
> Raja
>



-- 
Kiran Chitturi


Re: Nutch 2.1 - Image / Video Search

2013-02-25 Thread Jorge Luis Betancourt Gonzalez
Hi:

Like Raja said, it's possible the thing is that out of the box, nutch is only 
able to index the metadata of the file, you can always write some plugins to 
implement any logic you desire.

- Mensaje original -
De: "Raja Kulasekaran" 
Para: user@nutch.apache.org
Enviados: Domingo, 24 de Febrero 2013 13:31:28
Asunto: Nutch 2.1 - Image / Video Search

Hi,

Is it possible to crawl the Images as well as videos from  Nutch latest
version . I am using Nutch 1.6. I would like to know whether I can go ahead
to

use Nutch 1.6 or Suggest me the appropriate versions .

Raja


Re: Nutch + Eclipse

2013-02-25 Thread Julien Nioche
You are welcome. We should probably rename the pom.xml file into something
else so that people don't assume that Nutch can be built with Maven.

On 25 February 2013 09:06, feng lu  wrote:

> So it was like this! Thank you for correcting my mistakes.
>
> i see this issue https://issues.apache.org/jira/browse/NUTCH-1371
>
>
> thanks Julien
>
>
>
> On Mon, Feb 25, 2013 at 4:50 PM, Julien Nioche <
> lists.digitalpeb...@gmail.com> wrote:
>
> > > nutch can use maven to manage the project.
> >
> >
> > That's incorrect. Nutch is built with ANT+IVY. There is indeed a pom.xml
> > used to  publish the artefacts with Maven but it can't be used for
> building
> > Nutch properly. There is a Jira issue with a proposal to move to
> ANT+Maven
> > but even this does not mean you can build Nutch with Maven only
> >
> >
> >
> > >
> > > [0] http://wiki.apache.org/nutch/RunNutchInEclipse
> > >
> > >
> > > On Mon, Feb 25, 2013 at 10:26 AM, Danilo Fernandes <
> > > dan...@kelsorfernandes.com.br> wrote:
> > >
> > > > Hello,
> > > >
> > > > ** **
> > > >
> > > > I’m new with Nutch and I want do some changes for get HTML and scrapy
> > > data
> > > > from that.
> > > >
> > > > ** **
> > > >
> > > > My problem starts with Eclipse. I can’t run the code with Ant.
> > > >
> > > > When I create a trunk always receive this error list:
> > > >
> > > > ** **
> > > >
> > > > 
> > > >
> > > > ** **
> > > >
> > > > I tried some things, but nothing happen.
> > > >
> > > > ** **
> > > >
> > > > Ideias?
> > > >
> > > > ** **
> > > >
> > > > PS: I tried subscribe to the developer list, but I haven’t
> success…
> > > >
> > > > ** **
> > > >
> > > > Somebody Can help me?
> > > >
> > > > ** **
> > > >
> > > > Thanks very much!
> > > >
> > > > Danilo Fernandes
> > > >
> > > > ** **
> > > >
> > > > ** **
> > > >
> > >
> > >
> > >
> > > --
> > > Don't Grow Old, Grow Up... :-)
> > >
> >
> >
> >
> > --
> > *
> > *Open Source Solutions for Text Engineering
> >
> > http://digitalpebble.blogspot.com/
> > http://www.digitalpebble.com
> > http://twitter.com/digitalpebble
> >
>
>
>
> --
> Don't Grow Old, Grow Up... :-)
>



-- 
*
*Open Source Solutions for Text Engineering

http://digitalpebble.blogspot.com/
http://www.digitalpebble.com
http://twitter.com/digitalpebble


Re: Nutch + Eclipse

2013-02-25 Thread feng lu
So it was like this! Thank you for correcting my mistakes.

i see this issue https://issues.apache.org/jira/browse/NUTCH-1371


thanks Julien



On Mon, Feb 25, 2013 at 4:50 PM, Julien Nioche <
lists.digitalpeb...@gmail.com> wrote:

> > nutch can use maven to manage the project.
>
>
> That's incorrect. Nutch is built with ANT+IVY. There is indeed a pom.xml
> used to  publish the artefacts with Maven but it can't be used for building
> Nutch properly. There is a Jira issue with a proposal to move to ANT+Maven
> but even this does not mean you can build Nutch with Maven only
>
>
>
> >
> > [0] http://wiki.apache.org/nutch/RunNutchInEclipse
> >
> >
> > On Mon, Feb 25, 2013 at 10:26 AM, Danilo Fernandes <
> > dan...@kelsorfernandes.com.br> wrote:
> >
> > > Hello,
> > >
> > > ** **
> > >
> > > I’m new with Nutch and I want do some changes for get HTML and scrapy
> > data
> > > from that.
> > >
> > > ** **
> > >
> > > My problem starts with Eclipse. I can’t run the code with Ant.
> > >
> > > When I create a trunk always receive this error list:
> > >
> > > ** **
> > >
> > > 
> > >
> > > ** **
> > >
> > > I tried some things, but nothing happen.
> > >
> > > ** **
> > >
> > > Ideias?
> > >
> > > ** **
> > >
> > > PS: I tried subscribe to the developer list, but I haven’t success…
> > >
> > > ** **
> > >
> > > Somebody Can help me?
> > >
> > > ** **
> > >
> > > Thanks very much!
> > >
> > > Danilo Fernandes
> > >
> > > ** **
> > >
> > > ** **
> > >
> >
> >
> >
> > --
> > Don't Grow Old, Grow Up... :-)
> >
>
>
>
> --
> *
> *Open Source Solutions for Text Engineering
>
> http://digitalpebble.blogspot.com/
> http://www.digitalpebble.com
> http://twitter.com/digitalpebble
>



-- 
Don't Grow Old, Grow Up... :-)


Re: Nutch + Eclipse

2013-02-25 Thread Julien Nioche
> nutch can use maven to manage the project.


That's incorrect. Nutch is built with ANT+IVY. There is indeed a pom.xml
used to  publish the artefacts with Maven but it can't be used for building
Nutch properly. There is a Jira issue with a proposal to move to ANT+Maven
but even this does not mean you can build Nutch with Maven only



>
> [0] http://wiki.apache.org/nutch/RunNutchInEclipse
>
>
> On Mon, Feb 25, 2013 at 10:26 AM, Danilo Fernandes <
> dan...@kelsorfernandes.com.br> wrote:
>
> > Hello,
> >
> > ** **
> >
> > I’m new with Nutch and I want do some changes for get HTML and scrapy
> data
> > from that.
> >
> > ** **
> >
> > My problem starts with Eclipse. I can’t run the code with Ant.
> >
> > When I create a trunk always receive this error list:
> >
> > ** **
> >
> > 
> >
> > ** **
> >
> > I tried some things, but nothing happen.
> >
> > ** **
> >
> > Ideias?
> >
> > ** **
> >
> > PS: I tried subscribe to the developer list, but I haven’t success…
> >
> > ** **
> >
> > Somebody Can help me?
> >
> > ** **
> >
> > Thanks very much!
> >
> > Danilo Fernandes
> >
> > ** **
> >
> > ** **
> >
>
>
>
> --
> Don't Grow Old, Grow Up... :-)
>



-- 
*
*Open Source Solutions for Text Engineering

http://digitalpebble.blogspot.com/
http://www.digitalpebble.com
http://twitter.com/digitalpebble