Re: Nutch Incremental Crawl

2013-02-26 Thread feng lu
Hi David

May be what your want is an adaptive re-fetch algorithm. see [0]

 [0] http://pascaldimassimo.com/2010/06/11/how-to-re-crawl-with-nutch/


On Wed, Feb 27, 2013 at 1:20 PM, David Philip
wrote:

> Hi all,
>
>   Thank you very much for the replies. Very useful information to
> understand how incremental crawling can be achieved.
>
> Dear Markus:
> Can you please tell me how do I over ride this fetch interval , incase if I
> require to fetch the page before the time interval is passed?
>
>
>
> Thanks very much
> - David
>
>
>
>
>
> On Thu, Feb 14, 2013 at 2:57 PM, Markus Jelsma
> wrote:
>
> > If you want records to be fetched at a fixed interval its easier to
> inject
> > them with a fixed fetch interval.
> >
> > nutch.fixedFetchInterval=86400
> >
> >
> >
> > -Original message-
> > > From:kemical 
> > > Sent: Thu 14-Feb-2013 10:15
> > > To: user@nutch.apache.org
> > > Subject: Re: Nutch Incremental Crawl
> > >
> > > Hi David,
> > >
> > > You can also consider setting shorter fetch interval time with nutch
> > inject.
> > > This way you'll set higher score (so the url is always taken in
> priority
> > > when you generate a segment) and a fetch.interval of 1 day.
> > >
> > > If you have a case similar to me, you'll often want some homepage fetch
> > each
> > > day but not their inlinks. What you can do is inject all your seed urls
> > > again (assuming those url are only homepages).
> > >
> > > #change nutch option so existing urls can be injected again in
> > > conf/nutch-default.xml or conf/nutch-site.xml
> > > db.injector.update=true
> > >
> > > #Add metadata to update score/fetch interval
> > > #the following line will concat to each line of your seed urls files
> with
> > > the new score / new interval
> > > perl -pi -e 's/^(.*)\n$/\1\tnutch.score=100\tnutch.fetchInterval=8'
> > > [your_seed_url_dir]/*
> > >
> > > #run command
> > > bin/nutch inject crawl/crawldb [your_seed_url_dir]
> > >
> > > Now, the following crawl will take your urls in top priority and crawl
> > them
> > > once a day. I've used my situation to illustrate the concept but i
> guess
> > you
> > > can tweek params to fit your needs.
> > >
> > > This way is useful when you want a regular fetch on some urls, if it's
> > > occured rarely i guess freegen is the right choice.
> > >
> > > Best,
> > > Mike
> > >
> > >
> > >
> > >
> > >
> > > --
> > > View this message in context:
> >
> http://lucene.472066.n3.nabble.com/Nutch-Incremental-Crawl-tp4037903p4040400.html
> > > Sent from the Nutch - User mailing list archive at Nabble.com.
> > >
> >
>



-- 
Don't Grow Old, Grow Up... :-)


Re: Nutch Incremental Crawl

2013-02-26 Thread David Philip
Hi all,

  Thank you very much for the replies. Very useful information to
understand how incremental crawling can be achieved.

Dear Markus:
Can you please tell me how do I over ride this fetch interval , incase if I
require to fetch the page before the time interval is passed?



Thanks very much
- David





On Thu, Feb 14, 2013 at 2:57 PM, Markus Jelsma
wrote:

> If you want records to be fetched at a fixed interval its easier to inject
> them with a fixed fetch interval.
>
> nutch.fixedFetchInterval=86400
>
>
>
> -Original message-
> > From:kemical 
> > Sent: Thu 14-Feb-2013 10:15
> > To: user@nutch.apache.org
> > Subject: Re: Nutch Incremental Crawl
> >
> > Hi David,
> >
> > You can also consider setting shorter fetch interval time with nutch
> inject.
> > This way you'll set higher score (so the url is always taken in priority
> > when you generate a segment) and a fetch.interval of 1 day.
> >
> > If you have a case similar to me, you'll often want some homepage fetch
> each
> > day but not their inlinks. What you can do is inject all your seed urls
> > again (assuming those url are only homepages).
> >
> > #change nutch option so existing urls can be injected again in
> > conf/nutch-default.xml or conf/nutch-site.xml
> > db.injector.update=true
> >
> > #Add metadata to update score/fetch interval
> > #the following line will concat to each line of your seed urls files with
> > the new score / new interval
> > perl -pi -e 's/^(.*)\n$/\1\tnutch.score=100\tnutch.fetchInterval=8'
> > [your_seed_url_dir]/*
> >
> > #run command
> > bin/nutch inject crawl/crawldb [your_seed_url_dir]
> >
> > Now, the following crawl will take your urls in top priority and crawl
> them
> > once a day. I've used my situation to illustrate the concept but i guess
> you
> > can tweek params to fit your needs.
> >
> > This way is useful when you want a regular fetch on some urls, if it's
> > occured rarely i guess freegen is the right choice.
> >
> > Best,
> > Mike
> >
> >
> >
> >
> >
> > --
> > View this message in context:
> http://lucene.472066.n3.nabble.com/Nutch-Incremental-Crawl-tp4037903p4040400.html
> > Sent from the Nutch - User mailing list archive at Nabble.com.
> >
>


Re: migrating from 1.x to 2.x

2013-02-26 Thread Lewis John Mcgibbney
Hi kaveh,
Size of crawl database is not an issue with regards to migration between
Nutch versions, it is a compatibility issue which you need to be concerned
about.
There are no tools currently available in Nutch (as far as I know) to read
URLs from hdfs and import/inject your crawl data into your hbase cluster.
This is mostly due to the nature of direction in which Nutch is moving,
which is to do just crawling, at scale, quickly. We don't have immediate
necessity or passion to maintain legacy tools within the codebase and have
been trying to reduce this aspect of the codebase. This however doesn't
help as there was never a tool for this specific purpose anyway (as far as
I know).
It is however becoming something which I am getting interested about (the
notion of obtaining lots of data from various data stores and bootstrapping
Nutch with it). I would really like to read the data with Gora and map it
somewhere. I am interested in the Nutch injecting code and would be
interested to extend it/write new code to solve this issue.

On Tue, Feb 26, 2013 at 5:03 PM, kaveh minooie  wrote:

> me again,
>
> is there anyway that I can import my existing crawldb from a nutch 1.4
> which has about 2.5 B (with a B) links in it and currently resides in a
> hdfs file system into webpages table in hbase?
>
>
> and what happened to linkdb in nutch 2.x?
>
> thanks,
>



-- 
*Lewis*


migrating from 1.x to 2.x

2013-02-26 Thread kaveh minooie

me again,

is there anyway that I can import my existing crawldb from a nutch 1.4 
which has about 2.5 B (with a B) links in it and currently resides in a 
hdfs file system into webpages table in hbase?



and what happened to linkdb in nutch 2.x?

thanks,


Re: Eclipse Error

2013-02-26 Thread Lewis John Mcgibbney
We compile and test Apache Nutch on Solaris with the Java 7 (latest) JDK
and all is good. I run Apache Nutch on some CI ubuntu servers. I do not run
on Windows. This may be a problem, or it could be your development
environment, or it could be something else.
I have not engaged in this conversation before, so I can only assume it is
something to do with your local environment.


On Tue, Feb 26, 2013 at 11:05 AM, Danilo Fernandes <
dan...@kelsorfernandes.com.br> wrote:

>
>
> Thanks for the reply Tejas, but I tried run by ant many times and
> do not have any difference.
>
> About turning on verbose level for ivy, I´m
> a nuts with Eclipse and plugins.
>
> Can you help me do that?
>
> On Tue, 26
> Feb 2013 10:46:43 -0800, Tejas Patil wrote:
>
> > Hi Lewis,
> >
> > The OP is
> not able to build nutch in Eclipse. So far people have been
> > suspecting
> over this part of the log:
> >
> >
> **C:UsersDaniloworkspaceNutchbuild.xml:96:
> >
> java.lang.UnsupportedClassVersionError: com/sun/tools/javac/Main :
> >
> Unsupported major.minor version 51.0*
> > *
> >
> > It turns out that the
> java version is fine. (v 1.6). I am not sure but this
> > problem might be
> related to ivy as per this error:
> >
> > *> > [*ivy:resolve*] unknown
> resolver main*
> > *> > [*ivy:resolve*] :: USE VERBOSE OR DEBUG
> MESSAGE LEVEL FOR MORE
> >
> >>> DETAILS
> >
> > *
> > I had faced this
> error before but it was sporadic and used to go away after
> > invoking
> ant again. OP faces it consistently.
> >
> > @Danilo: Maybe turning on
> verbose level for ivy might shed some light.
> >
> > Thanks,
> > Tejas
> Patil
> >
> > On Tue, Feb 26, 2013 at 10:32 AM, Lewis John Mcgibbney <
> >
> lewis.mcgibb...@gmail.com [15]> wrote:
> >
> >> What is the problem? There
> is a community here that can help... if we know what is wrong! On Tue,
> Feb 26, 2013 at 7:44 AM, Danilo Fernandes <
> dan...@kelsorfernandes.com.br [14]> wrote:
> >>
> >>> I tried both and no
> one function! :( -Mensagem original- De: kiran chitturi
> [mailto:chitturikira...@gmail.com [8]] Enviada em: terça-feira, 26 de
> fevereiro de 2013 12:32 Para: user@nutch.apache.org [9] Cc:
> ferna...@gmail.com [10] Assunto: Re: Eclipse Error Let's keep the
> discussion in the User mailing list. I would suggest you to follow the
> instructions here to set up Nutch in Eclipse [1] JDK 1.6 + or 1.7 + will
> be good enough. I would also suggest to keep your JRE compatible with
> the JDK. [1] - http://wiki.apache.org/nutch/RunNutchInEclipse [11] On
> Tue, Feb 26, 2013 at 10:05 AM, Danilo Fernandes wrote:
> >>>
> 
> Kiran, Do you think I need a JDK 7? ** ** *De:* kiran chitturi
> [mailto:chitturikira...@gmail.com [1]] *Enviada em:* terça-feira, 26 de
> fevereiro de 2013 11:57 *Para:* d...@nutch.apache.org [2] *Assunto:* Re:
> Eclipse Error ** ** I think Nutch requires atleast Java 1.6. **
> ** On Tue, Feb 26, 2013 at 5:33 AM, Danilo Fernandes wrote: What
> version of JDK fit with Nutch trunk? Anybody knows? ** ** 2013/2/25
> Danilo Fernandes  Feng Lu, thanks for the fast reply.  But,
> I'm using the JavaSE-1.6 (jre6) and always get this error. 
> *De:* feng lu [mailto:amuseme...@gmail.com [5]] *Enviada em:*
> segunda-feira, 25 de fevereiro de 2013 22:35 *Para:*
> d...@nutch.apache.org [6] *Assunto:* Re: Eclipse Error  Hi
> Danilo  "Unsupported maj.minor version 51.0" means that you
> compiled your classes under a specific JDK, but then try to run them
> under older
> >>> version of JDK.
> >>>
>  So, you can't run classes
> compiled with JDK 6.0 under JDK 5.0. The same with classes compiled
> under JDK 7.0 when you try to run them under
> >>> JDK 6.0.   On
> Tue, Feb 26, 2013 at 9:12 AM, Danilo Fernandes wrote: *Hi, I want do
> some changes in Nutch to get a HTML and take some data from them. My
> problem starts when I'm compiling the code in Eclipse. I alw
> >>>
>  d
> not load definitions from resource org/sonar/ant/antlib.xml. It could
> not be found. *ivy-probe-antlib*: *ivy-download*:
> [*taskdef*] Could not load definitions from resource
> org/sonar/ant/antlib.xml. It could not be found.
> *ivy-download-unchecked*: *ivy-init-antlib*: *ivy-init*:
> *init*: [*mkdir*] Created dir: C:UsersDaniloworkspaceNutchbuild
> [*mkdir*] Created dir: C:UsersDaniloworkspaceNutchbuildclasses** **
> [*mkdir*] Created dir: C:UsersDaniloworkspaceNutchbuildrelease** **
> [*mkdir*] Created dir: C:UsersDaniloworkspaceNutchbuildtest
> [*mkdir*] Created dir: C:UsersDaniloworkspaceNutchbuildtestclasses
> [*copy*] Copying 8 files to C:UsersDaniloworkspaceNutchconf [*copy*]
> Copying C:UsersDaniloworkspaceNutchconfautomaton-urlfilter.txt.template
> to C:UsersDaniloworkspaceNutchconfautomaton-urlfilter.txt [*copy*]
> Copying C:UsersDaniloworkspaceNutchconfhttpclient-auth.xml.template to
> C:UsersDaniloworkspaceNutchconfhttpclient-auth.xml [*copy*] Copying
> C:Us

Re: Eclipse Error

2013-02-26 Thread Danilo Fernandes
  

Thanks for the reply Tejas, but I tried run by ant many times and
do not have any difference.

About turning on verbose level for ivy, I´m
a nuts with Eclipse and plugins. 

Can you help me do that?

On Tue, 26
Feb 2013 10:46:43 -0800, Tejas Patil wrote: 

> Hi Lewis,
> 
> The OP is
not able to build nutch in Eclipse. So far people have been
> suspecting
over this part of the log:
> 
>
**C:UsersDaniloworkspaceNutchbuild.xml:96:
>
java.lang.UnsupportedClassVersionError: com/sun/tools/javac/Main :
>
Unsupported major.minor version 51.0*
> *
> 
> It turns out that the
java version is fine. (v 1.6). I am not sure but this
> problem might be
related to ivy as per this error:
> 
> *> > [*ivy:resolve*] unknown
resolver main*
> *> > [*ivy:resolve*] :: USE VERBOSE OR DEBUG
MESSAGE LEVEL FOR MORE
> 
>>> DETAILS
> 
> *
> I had faced this
error before but it was sporadic and used to go away after
> invoking
ant again. OP faces it consistently.
> 
> @Danilo: Maybe turning on
verbose level for ivy might shed some light.
> 
> Thanks,
> Tejas
Patil
> 
> On Tue, Feb 26, 2013 at 10:32 AM, Lewis John Mcgibbney <
>
lewis.mcgibb...@gmail.com [15]> wrote:
> 
>> What is the problem? There
is a community here that can help... if we know what is wrong! On Tue,
Feb 26, 2013 at 7:44 AM, Danilo Fernandes <
dan...@kelsorfernandes.com.br [14]> wrote: 
>> 
>>> I tried both and no
one function! :( -Mensagem original- De: kiran chitturi
[mailto:chitturikira...@gmail.com [8]] Enviada em: terça-feira, 26 de
fevereiro de 2013 12:32 Para: user@nutch.apache.org [9] Cc:
ferna...@gmail.com [10] Assunto: Re: Eclipse Error Let's keep the
discussion in the User mailing list. I would suggest you to follow the
instructions here to set up Nutch in Eclipse [1] JDK 1.6 + or 1.7 + will
be good enough. I would also suggest to keep your JRE compatible with
the JDK. [1] - http://wiki.apache.org/nutch/RunNutchInEclipse [11] On
Tue, Feb 26, 2013 at 10:05 AM, Danilo Fernandes wrote: 
>>> 

Kiran, Do you think I need a JDK 7? ** ** *De:* kiran chitturi
[mailto:chitturikira...@gmail.com [1]] *Enviada em:* terça-feira, 26 de
fevereiro de 2013 11:57 *Para:* d...@nutch.apache.org [2] *Assunto:* Re:
Eclipse Error ** ** I think Nutch requires atleast Java 1.6. **
** On Tue, Feb 26, 2013 at 5:33 AM, Danilo Fernandes wrote: What
version of JDK fit with Nutch trunk? Anybody knows? ** ** 2013/2/25
Danilo Fernandes  Feng Lu, thanks for the fast reply.  But,
I'm using the JavaSE-1.6 (jre6) and always get this error. 
*De:* feng lu [mailto:amuseme...@gmail.com [5]] *Enviada em:*
segunda-feira, 25 de fevereiro de 2013 22:35 *Para:*
d...@nutch.apache.org [6] *Assunto:* Re: Eclipse Error  Hi
Danilo  "Unsupported maj.minor version 51.0" means that you
compiled your classes under a specific JDK, but then try to run them
under older
>>> version of JDK. 
>>> 
 So, you can't run classes
compiled with JDK 6.0 under JDK 5.0. The same with classes compiled
under JDK 7.0 when you try to run them under
>>> JDK 6.0.   On
Tue, Feb 26, 2013 at 9:12 AM, Danilo Fernandes wrote: *Hi, I want do
some changes in Nutch to get a HTML and take some data from them. My
problem starts when I'm compiling the code in Eclipse. I alw
>>> 
 d
not load definitions from resource org/sonar/ant/antlib.xml. It could
not be found. *ivy-probe-antlib*: *ivy-download*:
[*taskdef*] Could not load definitions from resource
org/sonar/ant/antlib.xml. It could not be found.
*ivy-download-unchecked*: *ivy-init-antlib*: *ivy-init*:
*init*: [*mkdir*] Created dir: C:UsersDaniloworkspaceNutchbuild
[*mkdir*] Created dir: C:UsersDaniloworkspaceNutchbuildclasses** **
[*mkdir*] Created dir: C:UsersDaniloworkspaceNutchbuildrelease** **
[*mkdir*] Created dir: C:UsersDaniloworkspaceNutchbuildtest
[*mkdir*] Created dir: C:UsersDaniloworkspaceNutchbuildtestclasses
[*copy*] Copying 8 files to C:UsersDaniloworkspaceNutchconf [*copy*]
Copying C:UsersDaniloworkspaceNutchconfautomaton-urlfilter.txt.template
to C:UsersDaniloworkspaceNutchconfautomaton-urlfilter.txt [*copy*]
Copying C:UsersDaniloworkspaceNutchconfhttpclient-auth.xml.template to
C:UsersDaniloworkspaceNutchconfhttpclient-auth.xml [*copy*] Copying
C:UsersDaniloworkspaceNutchconfnutch-site.xml.template to
C:UsersDaniloworkspaceNutchconfnutch-site.xml [*copy*] Copying
C:UsersDaniloworkspaceNutchconfprefix-urlfilter.txt.template to
C:UsersDaniloworkspaceNutchconfprefix-urlfilter.txt [*copy*] Copying
C:UsersDaniloworkspaceNutchconfregex-normalize.xml.template to
C:UsersDaniloworkspaceNutchconfregex-normalize.xml [*copy*] Copying
C:UsersDaniloworkspaceNutchconfregex-urlfilter.txt.template to
C:UsersDaniloworkspaceNutchconfregex-urlfilter.txt [*copy*] Copying
C:UsersDaniloworkspaceNutchconfsubcollections.xml.template to
C:UsersDaniloworkspaceNutchconfsubcollections.xml [*copy*] Copying
C:UsersDanilowor

Re: Nutch 2.1 - Image / Video Search

2013-02-26 Thread Tejas Patil
Hey Joaquin,

That seems to be a interesting tool to me. Have you integrated it with
nutch ? Just curious to know things :)

Thanks,
Tejas Patil


On Mon, Feb 25, 2013 at 1:38 PM, J. Delgado wrote:

> If your interested in pure image search you may want to use Nutch for
> crawling but something like imgseek (http://www.imgseek.net/isk-daemon)
> for
> indexing and search.
>
> -J
>
> El lunes, 25 de febrero de 2013, Jorge Luis Betancourt Gonzalez escribió:
>
> > Hi:
> >
> > Like Raja said, it's possible the thing is that out of the box, nutch is
> > only able to index the metadata of the file, you can always write some
> > plugins to implement any logic you desire.
> >
> > - Mensaje original -
> > De: "Raja Kulasekaran" >
> > Para: user@nutch.apache.org 
> > Enviados: Domingo, 24 de Febrero 2013 13:31:28
> > Asunto: Nutch 2.1 - Image / Video Search
> >
> > Hi,
> >
> > Is it possible to crawl the Images as well as videos from  Nutch latest
> > version . I am using Nutch 1.6. I would like to know whether I can go
> ahead
> > to
> >
> > use Nutch 1.6 or Suggest me the appropriate versions .
> >
> > Raja
> >
>
>
> --
> Sent from Gmail Mobile
>


Re: Eclipse Error

2013-02-26 Thread Tejas Patil
Hi Lewis,

The OP is not able to build nutch in Eclipse. So far people have been
suspecting over this part of the log:

**C:\Users\Danilo\workspace\Nutch\build.xml:96:
java.lang.UnsupportedClassVersionError: com/sun/tools/javac/Main :
Unsupported major.minor version 51.0*
*

It turns out that the java version is fine. (v 1.6). I am not sure but this
problem might be related to ivy as per this error:

*> > [*ivy:resolve*]   unknown resolver main*
*> > [*ivy:resolve*] :: USE VERBOSE OR DEBUG MESSAGE LEVEL FOR MORE
> > DETAILS
*
I had faced this error before but it was sporadic and used to go away after
invoking ant again. OP faces it consistently.

@Danilo: Maybe turning on verbose level for ivy might shed some light.

Thanks,
Tejas Patil


On Tue, Feb 26, 2013 at 10:32 AM, Lewis John Mcgibbney <
lewis.mcgibb...@gmail.com> wrote:

> What is the problem? There is a community here that can help... if we know
> what is wrong!
>
> On Tue, Feb 26, 2013 at 7:44 AM, Danilo Fernandes <
> dan...@kelsorfernandes.com.br> wrote:
>
> > I tried both and no one function! :(
> >
> > -Mensagem original-
> > De: kiran chitturi [mailto:chitturikira...@gmail.com]
> > Enviada em: terça-feira, 26 de fevereiro de 2013 12:32
> > Para: user@nutch.apache.org
> > Cc: ferna...@gmail.com
> > Assunto: Re: Eclipse Error
> >
> > Let's keep the discussion in the User mailing list.
> >
> > I would suggest you to follow the instructions here to set up Nutch in
> > Eclipse [1]
> >
> > JDK 1.6 + or 1.7 + will be good enough. I would also suggest to keep your
> > JRE compatible with the JDK.
> >
> > [1] - http://wiki.apache.org/nutch/RunNutchInEclipse
> >
> >
> >
> >
> > On Tue, Feb 26, 2013 at 10:05 AM, Danilo Fernandes
> > wrote:
> >
> > > Kiran,
> > >
> > > Do you think I need a JDK 7?
> > >
> > > ** **
> > >
> > > *De:* kiran chitturi [mailto:chitturikira...@gmail.com]
> > > *Enviada em:* terça-feira, 26 de fevereiro de 2013 11:57
> > >
> > > *Para:* d...@nutch.apache.org
> > > *Assunto:* Re: Eclipse Error
> > >
> > > ** **
> > >
> > > I think Nutch requires atleast Java 1.6.
> > >
> > > ** **
> > >
> > > On Tue, Feb 26, 2013 at 5:33 AM, Danilo Fernandes 
> > > wrote:
> > >
> > > What version of JDK fit with Nutch trunk?
> > >
> > > Anybody knows?
> > >
> > > ** **
> > >
> > > 2013/2/25 Danilo Fernandes 
> > >
> > > Feng Lu, thanks for the fast reply.
> > >
> > >  
> > >
> > > But, I’m using the JavaSE-1.6 (jre6) and always get this error.
> > >
> > >  
> > >
> > > *De:* feng lu [mailto:amuseme...@gmail.com] *Enviada em:*
> > > segunda-feira, 25 de fevereiro de 2013 22:35
> > > *Para:* d...@nutch.apache.org
> > > *Assunto:* Re: Eclipse Error
> > >
> > >  
> > >
> > > Hi Danilo
> > >
> > >  
> > >
> > > "Unsupported maj.minor version 51.0" means that you compiled your
> > > classes under a specific JDK, but then try to run them under older
> > version
> > of JDK.
> > > So, you can't run classes compiled with JDK 6.0 under JDK 5.0. The
> > > same with classes compiled under JDK 7.0 when you try to run them under
> > JDK 6.0.
> > > 
> > >
> > >  
> > >
> > > On Tue, Feb 26, 2013 at 9:12 AM, Danilo Fernandes 
> > > wrote:
> > >
> > > *Hi, I want do some changes in Nutch to get a HTML and take some data
> > > from them.
> > >
> > > My problem starts when I’m compiling the code in Eclipse.
> > >
> > > I always receive the follow error message.*
> > >
> > > Buildfile: *C:\Users\Danilo\workspace\Nutch\build.xml*
> > >
> > >   [*taskdef*] Could not load definitions from resource
> > > org/sonar/ant/antlib.xml. It could not be found.
> > >
> > > *ivy-probe-antlib*:
> > >
> > > *ivy-download*:
> > >
> > >   [*taskdef*] Could not load definitions from resource
> > > org/sonar/ant/antlib.xml. It could not be found.
> > >
> > > *ivy-download-unchecked*:
> > >
> > > *ivy-init-antlib*:
> > >
> > > *ivy-init*:
> > >
> > > *init*:
> > >
> > > [*mkdir*] Created dir: C:\Users\Danilo\workspace\Nutch\build
> > >
> > > [*mkdir*] Created dir:
> > > C:\Users\Danilo\workspace\Nutch\build\classes**
> > > **
> > >
> > > [*mkdir*] Created dir:
> > > C:\Users\Danilo\workspace\Nutch\build\release**
> > > **
> > >
> > > [*mkdir*] Created dir:
> > > C:\Users\Danilo\workspace\Nutch\build\test
> > >
> > > [*mkdir*] Created dir:
> > > C:\Users\Danilo\workspace\Nutch\build\test\classes
> > >
> > >  [*copy*] Copying 8 files to
> > > C:\Users\Danilo\workspace\Nutch\conf
> > >
> > >  [*copy*] Copying
> > > C:\Users\Danilo\workspace\Nutch\conf\automaton-urlfilter.txt.template
> > > to
> > > C:\Users\Danilo\workspace\Nutch\conf\automaton-urlfilter.txt
> > >
> > >  [*copy*] Copying
> > > C:\Users\Danilo\workspace\Nutch\conf\httpclient-auth.xml.template to
> > > C:\Users\Danilo\workspace\Nutch\conf\httpclient-auth.xml
> > >
> > >  [*copy*] Copying
> > > C:\Users\Danilo\workspace\Nutch\conf\nutch-

Re: Eclipse Error

2013-02-26 Thread Lewis John Mcgibbney
What is the problem? There is a community here that can help... if we know
what is wrong!

On Tue, Feb 26, 2013 at 7:44 AM, Danilo Fernandes <
dan...@kelsorfernandes.com.br> wrote:

> I tried both and no one function! :(
>
> -Mensagem original-
> De: kiran chitturi [mailto:chitturikira...@gmail.com]
> Enviada em: terça-feira, 26 de fevereiro de 2013 12:32
> Para: user@nutch.apache.org
> Cc: ferna...@gmail.com
> Assunto: Re: Eclipse Error
>
> Let's keep the discussion in the User mailing list.
>
> I would suggest you to follow the instructions here to set up Nutch in
> Eclipse [1]
>
> JDK 1.6 + or 1.7 + will be good enough. I would also suggest to keep your
> JRE compatible with the JDK.
>
> [1] - http://wiki.apache.org/nutch/RunNutchInEclipse
>
>
>
>
> On Tue, Feb 26, 2013 at 10:05 AM, Danilo Fernandes
> wrote:
>
> > Kiran,
> >
> > Do you think I need a JDK 7?
> >
> > ** **
> >
> > *De:* kiran chitturi [mailto:chitturikira...@gmail.com]
> > *Enviada em:* terça-feira, 26 de fevereiro de 2013 11:57
> >
> > *Para:* d...@nutch.apache.org
> > *Assunto:* Re: Eclipse Error
> >
> > ** **
> >
> > I think Nutch requires atleast Java 1.6.
> >
> > ** **
> >
> > On Tue, Feb 26, 2013 at 5:33 AM, Danilo Fernandes 
> > wrote:
> >
> > What version of JDK fit with Nutch trunk?
> >
> > Anybody knows?
> >
> > ** **
> >
> > 2013/2/25 Danilo Fernandes 
> >
> > Feng Lu, thanks for the fast reply.
> >
> >  
> >
> > But, I’m using the JavaSE-1.6 (jre6) and always get this error.
> >
> >  
> >
> > *De:* feng lu [mailto:amuseme...@gmail.com] *Enviada em:*
> > segunda-feira, 25 de fevereiro de 2013 22:35
> > *Para:* d...@nutch.apache.org
> > *Assunto:* Re: Eclipse Error
> >
> >  
> >
> > Hi Danilo
> >
> >  
> >
> > "Unsupported maj.minor version 51.0" means that you compiled your
> > classes under a specific JDK, but then try to run them under older
> version
> of JDK.
> > So, you can't run classes compiled with JDK 6.0 under JDK 5.0. The
> > same with classes compiled under JDK 7.0 when you try to run them under
> JDK 6.0.
> > 
> >
> >  
> >
> > On Tue, Feb 26, 2013 at 9:12 AM, Danilo Fernandes 
> > wrote:
> >
> > *Hi, I want do some changes in Nutch to get a HTML and take some data
> > from them.
> >
> > My problem starts when I’m compiling the code in Eclipse.
> >
> > I always receive the follow error message.*
> >
> > Buildfile: *C:\Users\Danilo\workspace\Nutch\build.xml*
> >
> >   [*taskdef*] Could not load definitions from resource
> > org/sonar/ant/antlib.xml. It could not be found.
> >
> > *ivy-probe-antlib*:
> >
> > *ivy-download*:
> >
> >   [*taskdef*] Could not load definitions from resource
> > org/sonar/ant/antlib.xml. It could not be found.
> >
> > *ivy-download-unchecked*:
> >
> > *ivy-init-antlib*:
> >
> > *ivy-init*:
> >
> > *init*:
> >
> > [*mkdir*] Created dir: C:\Users\Danilo\workspace\Nutch\build
> >
> > [*mkdir*] Created dir:
> > C:\Users\Danilo\workspace\Nutch\build\classes**
> > **
> >
> > [*mkdir*] Created dir:
> > C:\Users\Danilo\workspace\Nutch\build\release**
> > **
> >
> > [*mkdir*] Created dir:
> > C:\Users\Danilo\workspace\Nutch\build\test
> >
> > [*mkdir*] Created dir:
> > C:\Users\Danilo\workspace\Nutch\build\test\classes
> >
> >  [*copy*] Copying 8 files to
> > C:\Users\Danilo\workspace\Nutch\conf
> >
> >  [*copy*] Copying
> > C:\Users\Danilo\workspace\Nutch\conf\automaton-urlfilter.txt.template
> > to
> > C:\Users\Danilo\workspace\Nutch\conf\automaton-urlfilter.txt
> >
> >  [*copy*] Copying
> > C:\Users\Danilo\workspace\Nutch\conf\httpclient-auth.xml.template to
> > C:\Users\Danilo\workspace\Nutch\conf\httpclient-auth.xml
> >
> >  [*copy*] Copying
> > C:\Users\Danilo\workspace\Nutch\conf\nutch-site.xml.template to
> > C:\Users\Danilo\workspace\Nutch\conf\nutch-site.xml
> >
> >  [*copy*] Copying
> > C:\Users\Danilo\workspace\Nutch\conf\prefix-urlfilter.txt.template to
> > C:\Users\Danilo\workspace\Nutch\conf\prefix-urlfilter.txt
> >
> >  [*copy*] Copying
> > C:\Users\Danilo\workspace\Nutch\conf\regex-normalize.xml.template to
> > C:\Users\Danilo\workspace\Nutch\conf\regex-normalize.xml
> >
> >  [*copy*] Copying
> > C:\Users\Danilo\workspace\Nutch\conf\regex-urlfilter.txt.template to
> > C:\Users\Danilo\workspace\Nutch\conf\regex-urlfilter.txt
> >
> >  [*copy*] Copying
> > C:\Users\Danilo\workspace\Nutch\conf\subcollections.xml.template to
> > C:\Users\Danilo\workspace\Nutch\conf\subcollections.xml
> >
> >  [*copy*] Copying
> > C:\Users\Danilo\workspace\Nutch\conf\suffix-urlfilter.txt.template to
> > C:\Users\Danilo\workspace\Nutch\conf\suffix-urlfilter.txt
> >
> > *clean-lib*:
> >
> > *resolve-default*:
> >
> > [*ivy:resolve*] :: Ivy 2.2.0 - 20100923230623 ::
> > http://ant.apache.org/ivy/ ::
> >
> > [*ivy:resolve*] :: loading settings :: file =
> > C:\Users\Danilo\workspa

Re: nutch-2.1 with hbase - any good tool for querying results?

2013-02-26 Thread Lewis John Mcgibbney
We will be working on better support (gora-pig adapter) for this
functionality in Apache Gora > 0.3.
For now Kiran's suggestion is by far the best.
Thank you
Lewis

On Tue, Feb 26, 2013 at 10:17 AM, kiran chitturi
wrote:

> I found apache pig [1] convenient to use with Hbase for querying and
> filtering.
>
> 1 - http://pig.apache.org/
>
>
>
>
> On Tue, Feb 26, 2013 at 12:18 PM, adfel70  wrote:
>
> > Anybody using a good tool for performing queries on the crawl results
> > directly from hbase?
> > some of the queries I want to make are: get all the url that failed
> > fetching, get all the urls that failed parsing.
> >
> > querying hbasedirectly seems more convenient then running readdb, waiting
> > for results, than parsing the readdb output to get the required
> > information.
> >
> > thanks.
> >
> >
> >
> > --
> > View this message in context:
> >
> http://lucene.472066.n3.nabble.com/nutch-2-1-with-hbase-any-good-tool-for-querying-results-tp4043109.html
> > Sent from the Nutch - User mailing list archive at Nabble.com.
> >
>
>
>
> --
> Kiran Chitturi
>



-- 
*Lewis*


Re: nutch-2.1 with hbase - any good tool for querying results?

2013-02-26 Thread kiran chitturi
I found apache pig [1] convenient to use with Hbase for querying and
filtering.

1 - http://pig.apache.org/




On Tue, Feb 26, 2013 at 12:18 PM, adfel70  wrote:

> Anybody using a good tool for performing queries on the crawl results
> directly from hbase?
> some of the queries I want to make are: get all the url that failed
> fetching, get all the urls that failed parsing.
>
> querying hbasedirectly seems more convenient then running readdb, waiting
> for results, than parsing the readdb output to get the required
> information.
>
> thanks.
>
>
>
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/nutch-2-1-with-hbase-any-good-tool-for-querying-results-tp4043109.html
> Sent from the Nutch - User mailing list archive at Nabble.com.
>



-- 
Kiran Chitturi


nutch-2.1 with hbase - any good tool for querying results?

2013-02-26 Thread adfel70
Anybody using a good tool for performing queries on the crawl results
directly from hbase?
some of the queries I want to make are: get all the url that failed
fetching, get all the urls that failed parsing.

querying hbasedirectly seems more convenient then running readdb, waiting
for results, than parsing the readdb output to get the required information.

thanks.



--
View this message in context: 
http://lucene.472066.n3.nabble.com/nutch-2-1-with-hbase-any-good-tool-for-querying-results-tp4043109.html
Sent from the Nutch - User mailing list archive at Nabble.com.


RES: Eclipse Error

2013-02-26 Thread Danilo Fernandes
I tried both and no one function! :(

-Mensagem original-
De: kiran chitturi [mailto:chitturikira...@gmail.com] 
Enviada em: terça-feira, 26 de fevereiro de 2013 12:32
Para: user@nutch.apache.org
Cc: ferna...@gmail.com
Assunto: Re: Eclipse Error

Let's keep the discussion in the User mailing list.

I would suggest you to follow the instructions here to set up Nutch in
Eclipse [1]

JDK 1.6 + or 1.7 + will be good enough. I would also suggest to keep your
JRE compatible with the JDK.

[1] - http://wiki.apache.org/nutch/RunNutchInEclipse




On Tue, Feb 26, 2013 at 10:05 AM, Danilo Fernandes
wrote:

> Kiran,
>
> Do you think I need a JDK 7?
>
> ** **
>
> *De:* kiran chitturi [mailto:chitturikira...@gmail.com]
> *Enviada em:* terça-feira, 26 de fevereiro de 2013 11:57
>
> *Para:* d...@nutch.apache.org
> *Assunto:* Re: Eclipse Error
>
> ** **
>
> I think Nutch requires atleast Java 1.6.
>
> ** **
>
> On Tue, Feb 26, 2013 at 5:33 AM, Danilo Fernandes 
> wrote:
>
> What version of JDK fit with Nutch trunk?
>
> Anybody knows?
>
> ** **
>
> 2013/2/25 Danilo Fernandes 
>
> Feng Lu, thanks for the fast reply.
>
>  
>
> But, I’m using the JavaSE-1.6 (jre6) and always get this error.
>
>  
>
> *De:* feng lu [mailto:amuseme...@gmail.com] *Enviada em:* 
> segunda-feira, 25 de fevereiro de 2013 22:35
> *Para:* d...@nutch.apache.org
> *Assunto:* Re: Eclipse Error
>
>  
>
> Hi Danilo
>
>  
>
> "Unsupported maj.minor version 51.0" means that you compiled your 
> classes under a specific JDK, but then try to run them under older version
of JDK.
> So, you can't run classes compiled with JDK 6.0 under JDK 5.0. The 
> same with classes compiled under JDK 7.0 when you try to run them under
JDK 6.0.
> 
>
>  
>
> On Tue, Feb 26, 2013 at 9:12 AM, Danilo Fernandes 
> wrote:
>
> *Hi, I want do some changes in Nutch to get a HTML and take some data 
> from them.
>
> My problem starts when I’m compiling the code in Eclipse.
>
> I always receive the follow error message.*
>
> Buildfile: *C:\Users\Danilo\workspace\Nutch\build.xml*
>
>   [*taskdef*] Could not load definitions from resource 
> org/sonar/ant/antlib.xml. It could not be found.
>
> *ivy-probe-antlib*:
>
> *ivy-download*:
>
>   [*taskdef*] Could not load definitions from resource 
> org/sonar/ant/antlib.xml. It could not be found.
>
> *ivy-download-unchecked*:
>
> *ivy-init-antlib*:
>
> *ivy-init*:
>
> *init*:
>
> [*mkdir*] Created dir: C:\Users\Danilo\workspace\Nutch\build
>
> [*mkdir*] Created dir: 
> C:\Users\Danilo\workspace\Nutch\build\classes**
> **
>
> [*mkdir*] Created dir: 
> C:\Users\Danilo\workspace\Nutch\build\release**
> **
>
> [*mkdir*] Created dir: 
> C:\Users\Danilo\workspace\Nutch\build\test
>
> [*mkdir*] Created dir:
> C:\Users\Danilo\workspace\Nutch\build\test\classes
>
>  [*copy*] Copying 8 files to 
> C:\Users\Danilo\workspace\Nutch\conf
>
>  [*copy*] Copying
> C:\Users\Danilo\workspace\Nutch\conf\automaton-urlfilter.txt.template 
> to
> C:\Users\Danilo\workspace\Nutch\conf\automaton-urlfilter.txt
>
>  [*copy*] Copying
> C:\Users\Danilo\workspace\Nutch\conf\httpclient-auth.xml.template to
> C:\Users\Danilo\workspace\Nutch\conf\httpclient-auth.xml
>
>  [*copy*] Copying
> C:\Users\Danilo\workspace\Nutch\conf\nutch-site.xml.template to
> C:\Users\Danilo\workspace\Nutch\conf\nutch-site.xml
>
>  [*copy*] Copying
> C:\Users\Danilo\workspace\Nutch\conf\prefix-urlfilter.txt.template to
> C:\Users\Danilo\workspace\Nutch\conf\prefix-urlfilter.txt
>
>  [*copy*] Copying
> C:\Users\Danilo\workspace\Nutch\conf\regex-normalize.xml.template to
> C:\Users\Danilo\workspace\Nutch\conf\regex-normalize.xml
>
>  [*copy*] Copying
> C:\Users\Danilo\workspace\Nutch\conf\regex-urlfilter.txt.template to
> C:\Users\Danilo\workspace\Nutch\conf\regex-urlfilter.txt
>
>  [*copy*] Copying
> C:\Users\Danilo\workspace\Nutch\conf\subcollections.xml.template to
> C:\Users\Danilo\workspace\Nutch\conf\subcollections.xml
>
>  [*copy*] Copying
> C:\Users\Danilo\workspace\Nutch\conf\suffix-urlfilter.txt.template to
> C:\Users\Danilo\workspace\Nutch\conf\suffix-urlfilter.txt
>
> *clean-lib*:
>
> *resolve-default*:
>
> [*ivy:resolve*] :: Ivy 2.2.0 - 20100923230623 ::
> http://ant.apache.org/ivy/ ::
>
> [*ivy:resolve*] :: loading settings :: file =
> C:\Users\Danilo\workspace\Nutch\ivy\ivysettings.xml
>
> [*ivy:resolve*] :: problems summary ::
>
> [*ivy:resolve*]  ERRORS
>
> [*ivy:resolve*]   unknown resolver main
>
> [*ivy:resolve*]   unknown resolver main
>
> [*ivy:resolve*]   unknown resolver main
>
> [*ivy:resolve*]   unknown resolver main
>
> [*ivy:resolve*]   unknown resolver main
>
> [*ivy:resolve*]   unknown resolver main
>
> [*ivy:resolve*]   unknown resolver main
>
> [*ivy:resolve*]   unkn

Re: Eclipse Error

2013-02-26 Thread kiran chitturi
Let's keep the discussion in the User mailing list.

I would suggest you to follow the instructions here to set up Nutch in
Eclipse [1]

JDK 1.6 + or 1.7 + will be good enough. I would also suggest to keep your
JRE compatible with the JDK.

[1] - http://wiki.apache.org/nutch/RunNutchInEclipse




On Tue, Feb 26, 2013 at 10:05 AM, Danilo Fernandes wrote:

> Kiran,
>
> Do you think I need a JDK 7?
>
> ** **
>
> *De:* kiran chitturi [mailto:chitturikira...@gmail.com]
> *Enviada em:* terça-feira, 26 de fevereiro de 2013 11:57
>
> *Para:* d...@nutch.apache.org
> *Assunto:* Re: Eclipse Error
>
> ** **
>
> I think Nutch requires atleast Java 1.6.
>
> ** **
>
> On Tue, Feb 26, 2013 at 5:33 AM, Danilo Fernandes 
> wrote:
>
> What version of JDK fit with Nutch trunk?
>
> Anybody knows?
>
> ** **
>
> 2013/2/25 Danilo Fernandes 
>
> Feng Lu, thanks for the fast reply.
>
>  
>
> But, I’m using the JavaSE-1.6 (jre6) and always get this error.
>
>  
>
> *De:* feng lu [mailto:amuseme...@gmail.com]
> *Enviada em:* segunda-feira, 25 de fevereiro de 2013 22:35
> *Para:* d...@nutch.apache.org
> *Assunto:* Re: Eclipse Error
>
>  
>
> Hi Danilo
>
>  
>
> "Unsupported maj.minor version 51.0" means that you compiled your classes
> under a specific JDK, but then try to run them under older version of JDK.
> So, you can't run classes compiled with JDK 6.0 under JDK 5.0. The same
> with classes compiled under JDK 7.0 when you try to run them under JDK 6.0.
> 
>
>  
>
> On Tue, Feb 26, 2013 at 9:12 AM, Danilo Fernandes 
> wrote:
>
> *Hi, I want do some changes in Nutch to get a HTML and take some data
> from them.
>
> My problem starts when I’m compiling the code in Eclipse.
>
> I always receive the follow error message.*
>
> Buildfile: *C:\Users\Danilo\workspace\Nutch\build.xml*
>
>   [*taskdef*] Could not load definitions from resource
> org/sonar/ant/antlib.xml. It could not be found.
>
> *ivy-probe-antlib*:
>
> *ivy-download*:
>
>   [*taskdef*] Could not load definitions from resource
> org/sonar/ant/antlib.xml. It could not be found.
>
> *ivy-download-unchecked*:
>
> *ivy-init-antlib*:
>
> *ivy-init*:
>
> *init*:
>
> [*mkdir*] Created dir: C:\Users\Danilo\workspace\Nutch\build
>
> [*mkdir*] Created dir: C:\Users\Danilo\workspace\Nutch\build\classes**
> **
>
> [*mkdir*] Created dir: C:\Users\Danilo\workspace\Nutch\build\release**
> **
>
> [*mkdir*] Created dir: C:\Users\Danilo\workspace\Nutch\build\test
>
> [*mkdir*] Created dir:
> C:\Users\Danilo\workspace\Nutch\build\test\classes
>
>  [*copy*] Copying 8 files to C:\Users\Danilo\workspace\Nutch\conf
>
>  [*copy*] Copying
> C:\Users\Danilo\workspace\Nutch\conf\automaton-urlfilter.txt.template to
> C:\Users\Danilo\workspace\Nutch\conf\automaton-urlfilter.txt
>
>  [*copy*] Copying
> C:\Users\Danilo\workspace\Nutch\conf\httpclient-auth.xml.template to
> C:\Users\Danilo\workspace\Nutch\conf\httpclient-auth.xml
>
>  [*copy*] Copying
> C:\Users\Danilo\workspace\Nutch\conf\nutch-site.xml.template to
> C:\Users\Danilo\workspace\Nutch\conf\nutch-site.xml
>
>  [*copy*] Copying
> C:\Users\Danilo\workspace\Nutch\conf\prefix-urlfilter.txt.template to
> C:\Users\Danilo\workspace\Nutch\conf\prefix-urlfilter.txt
>
>  [*copy*] Copying
> C:\Users\Danilo\workspace\Nutch\conf\regex-normalize.xml.template to
> C:\Users\Danilo\workspace\Nutch\conf\regex-normalize.xml
>
>  [*copy*] Copying
> C:\Users\Danilo\workspace\Nutch\conf\regex-urlfilter.txt.template to
> C:\Users\Danilo\workspace\Nutch\conf\regex-urlfilter.txt
>
>  [*copy*] Copying
> C:\Users\Danilo\workspace\Nutch\conf\subcollections.xml.template to
> C:\Users\Danilo\workspace\Nutch\conf\subcollections.xml
>
>  [*copy*] Copying
> C:\Users\Danilo\workspace\Nutch\conf\suffix-urlfilter.txt.template to
> C:\Users\Danilo\workspace\Nutch\conf\suffix-urlfilter.txt
>
> *clean-lib*:
>
> *resolve-default*:
>
> [*ivy:resolve*] :: Ivy 2.2.0 - 20100923230623 ::
> http://ant.apache.org/ivy/ ::
>
> [*ivy:resolve*] :: loading settings :: file =
> C:\Users\Danilo\workspace\Nutch\ivy\ivysettings.xml
>
> [*ivy:resolve*] :: problems summary ::
>
> [*ivy:resolve*]  ERRORS
>
> [*ivy:resolve*]   unknown resolver main
>
> [*ivy:resolve*]   unknown resolver main
>
> [*ivy:resolve*]   unknown resolver main
>
> [*ivy:resolve*]   unknown resolver main
>
> [*ivy:resolve*]   unknown resolver main
>
> [*ivy:resolve*]   unknown resolver main
>
> [*ivy:resolve*]   unknown resolver main
>
> [*ivy:resolve*]   unknown resolver main
>
> [*ivy:resolve*]   unknown resolver main
>
> [*ivy:resolve*]   unknown resolver main
>
> [*ivy:resolve*]   unknown resolver main
>
> [*ivy:resolve*]   unknown resolver main
>
> [*ivy:resolve*]   unknown resolver mai

Re: Crawling URLs with query string while limiting only web pages

2013-02-26 Thread Ye T Thet
Feng Lu: Thanks for the tip. I will definitely try the approach. Appreciate
your help.

Tejas: I am using the groping approach, filtering out some keywords from
the fetched log. Good so far I observed 20% of the fetched list is filled
up with the not-so important URL. I hope optimized filter can do some good
for my crawler performance.

Thanks for your directions.

Cheers,

Ye



On Mon, Feb 25, 2013 at 3:31 AM, Tejas Patil wrote:

> @Ye, You need not look at each url. Random sampling will be better. It wont
> be accurate but practical thing to do. Even while going through logs,
> extract the urls, sort them so that all of those belonging to the same host
> lie in the same group.
>
> @feng lu: +1. Good trick to remove the bad urls using normalization. The
> main problem in front of OP would be still to come up with such rules by
> manually observing the logs.
>
> Thanks,
> Tejas Patil
>
>
> On Sun, Feb 24, 2013 at 7:16 AM, feng lu  wrote:
>
> > Hi Ye
> >
> > Can you add this pattern to regex-normalize.xml configuration file for
> the
> > RegexUrlNormalize class.
> >
> > 
> > 
> >
> >
> >
> ([;_]?((?i)l|j|bv_)?((?i)sid|phpsessid|sessionid|view|zoom)=.*?)(\?|&|#|$)
> >   $4
> > 
> >
> > it will removes session ids from urls such as view and zoom.
> >
> > e.g. site1.com/article1/?view=printerfriendly
> > e.g. site1.com/article1/?zoom=large
> > e.g. site1.com/article1/?zoom=extralarge
> >
> > to
> >
> > e.g. site1.com/article1
> >
> >
> >
> >
> >
> > On Sun, Feb 24, 2013 at 9:48 PM, Ye T Thet 
> wrote:
> >
> > > Tejas,
> > >
> > > Thanks for your pointers. They are really helpful. As of now my
> approach
> > is
> > > according to your direction 1, 2 and 3. Since my sites are around 10k
> in
> > > number, I hope it would be manageable for near future.
> > >
> > > I might need to apply as per your direction 4 and 5 in the future as
> > well.
> > > But I believe it might be out of my league to get it right though.
> > >
> > > Some extra information my approach, most of my target sites are using
> CMS
> > > and quite a number of them DOES NOT use pretty URL. I have been greping
> > the
> > > log and identify the pattern of redundant or non-important URL and
> adding
> > > regex rules to regex-urlfilter. 2 millions URL is quite hard to process
> > for
> > > one man though. Phew!
> > >
> > > I would share if I could fine an approach that could benefit us all.
> > >
> > > Regards,
> > >
> > > Ye
> > >
> > > On Sat, Feb 23, 2013 at 12:22 PM, Tejas Patil <
> tejas.patil...@gmail.com
> > > >wrote:
> > >
> > > > one correction in red below.
> > > >
> > > > On Fri, Feb 22, 2013 at 8:20 PM, Tejas Patil <
> tejas.patil...@gmail.com
> > > > >wrote:
> > > >
> > > > > I think that what you have done till now is logical. Typically in
> > nutch
> > > > > crawls people dont want urls with query string but nowadays things
> > have
> > > > > changed. For instance, category #2 you pointed out may capture some
> > > vital
> > > > > pages. I once ran into the similar issue. Crawler cant be made
> > > > intelligent
> > > > > beyond a certain point and I had to go through crawl logs to check
> > what
> > > > all
> > > > > urls are being fetched and later redefine by regex rules.
> > > > >
> > > > > Some things that I had considered doing:
> > > > > 1. Start off with rules which are less restrictive and observe the
> > logs
> > > > > for which urls are visited. This will give you an idea about the
> bad
> > > urls
> > > > > and the good ones. As you already have crawled for 10 days, you are
> > > (just
> > > > > !!) left with studying the logs.
> > > > > 2. After #1 is done, launch crawls with accept rules for the good
> > urls
> > > > and
> > > > > put a "-." in the end to avoid the bad urls.
> > > > > 3. Having a huge list of regexes is bad thing because its comparing
> > > urls
> > > > > against regexes is a costly operation and done for every url. A url
> > > > getting
> > > > > a match early saves this time. So have patterns which capture a
> huge
> > > set
> > > > of
> > > > > urls at the top for the regex urlfilter file.
> > > > > 4. Sometimes you dont want the parser to extract urls from certain
> > > areas
> > > > > of the page as you know that its not going to yield anything good
> to
> > > you.
> > > > > Lets say that the "print" or "zoom" urls are coming from some
> > specific
> > > > tags
> > > > > of the html source. Its better not to parse those things and thus
> not
> > > > have
> > > > > those urls itself in the first place. The profit here is that now
> the
> > > > regex
> > > > > rules to be defined are reduced.
> > > > > 5. An improvement over *#4* is that if you know the nature of pages
> > > that
> > > > > are being crawled, you can tweak parsers to extract urls from
> > specific
> > > > tags
> > > > > only. This reduces noise and much cleaner fetch list.
> > > > >
> > > > > As far as I feel, this problem wont have an automated solution like
> > > > > modifying some config/setting. There is a decent amount of human

RE: regex-urlfilter file for multiple domains

2013-02-26 Thread Danilo Fernandes
  

Very nice.
Thanks. 

My problem now is to put Nuth to run into
Eclipse. I´m having some problems with JDKs.

When I run in Ant, I alwys
receive a message about differents versions of JDK/JRE.

Have you idea
about that? 

On Tue, 26 Feb 2013 10:42:21 +, Markus Jelsma wrote:


> No, there is no feature for that. You would have to patch it up
yourself. It shouldn't be very hard. 
> 
> -Original message-
>

>> From:Danilo Fernandes Sent: Tue 26-Feb-2013 11:37 To:
user@nutch.apache.org [2]Subject: RE: regex-urlfilter file for multiple
domains Yes, my first options is differents files to differents domains.
The point is how can I link the files with each domain? Do I need do
some changes in Nutch code or the project have a feature for do that? On
Tue, 26 Feb 2013 10:33:37 +, Markus Jelsma wrote: 
>> 
>>> Yes,
>>
it will support that until you run out of memory. But having a million
expressions is not going to work nicely. If you have a lot of
expressions but can divide them into domains i would patch the filter so
it will only execute filters that or for a specific domain.-Original
message- 
>> 
 From:Danilo Fernandes Sent: Tue
>> 26-Feb-2013
11:31 To: user@nutch.apache.org [3] [2] Subject: RE: regex-urlfilter
file for multiple domains Tejas, do you have any idea about how many
rules can I use in the file? Probably I will work with 1M regex for
differentes URLs. Nutch will support that? Links: -- [1]
mailto:dan...@kelsorfernandes.com.br [4] [2]
mailto:user@nutch.apache.org [5]

  

Links:
--
[1]
mailto:dan...@kelsorfernandes.com.br
[2]
mailto:user@nutch.apache.org
[3] mailto:user@nutch.apache.org
[4]
mailto:dan...@kelsorfernandes.com.br
[5] mailto:user@nutch.apache.org


RE: regex-urlfilter file for multiple domains

2013-02-26 Thread Markus Jelsma
No, there is no feature for that. You would have to patch it up yourself. It 
shouldn't be very hard. 
 
-Original message-
> From:Danilo Fernandes 
> Sent: Tue 26-Feb-2013 11:37
> To: user@nutch.apache.org
> Subject: RE: regex-urlfilter file for multiple domains
> 
>   
> 
> Yes, my first options is differents files to differents domains.
> The point is how can I link the files with each domain? Do I need do
> some changes in Nutch code or the project have a feature for do
> that?
> 
> On Tue, 26 Feb 2013 10:33:37 +, Markus Jelsma wrote: 
> 
> > Yes,
> it will support that until you run out of memory. But having a million
> expressions is not going to work nicely. If you have a lot of
> expressions but can divide them into domains i would patch the filter so
> it will only execute filters that or for a specific domain. 
> > 
> >
> -Original message-
> > 
> >> From:Danilo Fernandes Sent: Tue
> 26-Feb-2013 11:31 To: user@nutch.apache.org [2] Subject: RE:
> regex-urlfilter file for multiple domains Tejas, do you have any idea
> about how many rules can I use in the file? Probably I will work with 1M
> regex for differentes URLs. Nutch will support that?
> 
>  
> 
> 
> Links:
> --
> [1] mailto:dan...@kelsorfernandes.com.br
> [2]
> mailto:user@nutch.apache.org
> 


RE: regex-urlfilter file for multiple domains

2013-02-26 Thread Danilo Fernandes
  

Yes, my first options is differents files to differents domains.
The point is how can I link the files with each domain? Do I need do
some changes in Nutch code or the project have a feature for do
that?

On Tue, 26 Feb 2013 10:33:37 +, Markus Jelsma wrote: 

> Yes,
it will support that until you run out of memory. But having a million
expressions is not going to work nicely. If you have a lot of
expressions but can divide them into domains i would patch the filter so
it will only execute filters that or for a specific domain. 
> 
>
-Original message-
> 
>> From:Danilo Fernandes Sent: Tue
26-Feb-2013 11:31 To: user@nutch.apache.org [2] Subject: RE:
regex-urlfilter file for multiple domains Tejas, do you have any idea
about how many rules can I use in the file? Probably I will work with 1M
regex for differentes URLs. Nutch will support that?

 


Links:
--
[1] mailto:dan...@kelsorfernandes.com.br
[2]
mailto:user@nutch.apache.org


RE: regex-urlfilter file for multiple domains

2013-02-26 Thread Markus Jelsma
Yes, it will support that until you run out of memory. But having a million 
expressions is not going to work nicely. If you have a lot of expressions but 
can divide them into domains i would patch the filter so it will only execute 
filters that or for a specific domain. 
 
-Original message-
> From:Danilo Fernandes 
> Sent: Tue 26-Feb-2013 11:31
> To: user@nutch.apache.org
> Subject: RE: regex-urlfilter file for multiple domains
> 
> Tejas, do you have any idea about how many rules can I use in the file?
> 
> Probably I will work with 1M regex for differentes URLs.
> 
> Nutch will support that?


RE: regex-urlfilter file for multiple domains

2013-02-26 Thread Danilo Fernandes
Tejas, do you have any idea about how many rules can I use in the file?

Probably I will work with 1M regex for differentes URLs.

Nutch will support that?

Re: Only a small portion of URLs is indexed in Solr at the end of the crawl

2013-02-26 Thread Stefan Scheffler

Am 26.02.2013 10:19, schrieb Amit Sela:

Hi all,

I'm running nutch 1.6 and solr 3.6.2 and I'm crawling with depth 1 topN
100 and 'db.update.additions.allowed' false.
The idea is to fetch, parse and index only the URLs in the seed list.

I seed ~120K URLs but in solr I see only ~20K indexed.

The fetch job counters show:

moved 49,937 -> redirections i think (not be crawled, there is a nutch 
property, which allows this)
robots_denied 1,149 -> forbidden by the robots txt of the seed url
robots_denied_maxcrawldelay 267 -> forbidden by the robots txt delay option of 
the seed url
hitByTimeLimit 6,072 -> response timeout
exception 4,479 -> other stuff
notmodified 2
access_denied 4 -> login needed
temp_moved 4,658 -> redirections (not be crawled, there is a nutch property, 
which allows this)
success 23,033 -> your 20k, which are indexed
notfound 1,658 -> 404
By the way. if you crawl just with a depth of 1, you don´t need to 
specify a topN, because you will allways crawl just the seedurl


and the ParserStatus success count is 22844

What happened to all the URLs ? they are all active URLs, not some old
list...

Thanks,

Amit.




--
Stefan Scheffler
Avantgarde Labs GmbH
Löbauer Straße 19, 01099 Dresden
Telefon: + 49 (0) 351 21590834
Email: sscheff...@avantgarde-labs.de



Only a small portion of URLs is indexed in Solr at the end of the crawl

2013-02-26 Thread Amit Sela
Hi all,

I'm running nutch 1.6 and solr 3.6.2 and I'm crawling with depth 1 topN
100 and 'db.update.additions.allowed' false.
The idea is to fetch, parse and index only the URLs in the seed list.

I seed ~120K URLs but in solr I see only ~20K indexed.

The fetch job counters show:

moved 49,937
robots_denied 1,149
robots_denied_maxcrawldelay 267
hitByTimeLimit 6,072
exception 4,479
notmodified 2
access_denied 4
temp_moved 4,658
success 23,033
notfound 1,658

and the ParserStatus success count is 22844

What happened to all the URLs ? they are all active URLs, not some old
list...

Thanks,

Amit.