Re: Newbie

2015-02-08 Thread Mattmann, Chris A (3980)
Thanks Trevor. Moving user-owner@n.a.o to BCC since
I think you meant to ask this on the user@n.a.o list.

I think the best bet is to check out the Nutch wiki
with several tutorials and other info on how to get
started. Also we would welcome you to join the dev
and user lists (by sending blank emails to dev-subscr...@nutch.apache.org
and to user-subscr...@nutch.apache.org) for discussion
and questions about the project.

Here is the link to the Nutch wiki:

http://wiki.apache.org/nutch/

Thank you!

Cheers,
Chris

++
Chris Mattmann, Ph.D.
Chief Architect
Instrument Software and Science Data Systems Section (398)
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 168-519, Mailstop: 168-527
Email: chris.a.mattm...@nasa.gov
WWW:  http://sunset.usc.edu/~mattmann/
++
Adjunct Associate Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++






-Original Message-
From: Trevor Oakley 
Reply-To: "tre...@merrows.co.uk" 
Date: Sunday, February 8, 2015 at 10:56 AM
To: "user-ow...@nutch.apache.org" 
Subject: Newbie

>I am new to Nutch - is there a learning resource anywhere?
> 
>Thanks
>Trevor
> 
> 



[Nutch-newbie] Installation error

2013-05-17 Thread Shah, Nishant
Hi everyone,

This is my first post so apologies if this is not the correct question to ask.

I have followed the wiki tutorial and I am getting the below error. I am 
running in the local mode and don't have hadoop installed. Can you please help 
as I have no clue what's going wrong.

Thanks.
Nishant

The Error:
log4j:ERROR setFile(null,true) call failed.
java.io.FileNotFoundException: 
/home/local/ANT/nishans/nutch-1.6/apache-nutch-1.6/logs/hadoop.log (No such 
file or directory)
at java.io.FileOutputStream.openAppend(Native Method)
at java.io.FileOutputStream.(FileOutputStream.java:207)
at java.io.FileOutputStream.(FileOutputStream.java:131)
at org.apache.log4j.FileAppender.setFile(FileAppender.java:290)
at org.apache.log4j.FileAppender.activateOptions(FileAppender.java:164)
at 
org.apache.log4j.DailyRollingFileAppender.activateOptions(DailyRollingFileAppender.java:216)
at org.apache.log4j.config.PropertySetter.activate(PropertySetter.java:257)
at org.apache.log4j.config.PropertySetter.setProperties(PropertySetter.java:133)
at org.apache.log4j.config.PropertySetter.setProperties(PropertySetter.java:97)
at 
org.apache.log4j.PropertyConfigurator.parseAppender(PropertyConfigurator.java:689)
at 
org.apache.log4j.PropertyConfigurator.parseCategory(PropertyConfigurator.java:647)
at 
org.apache.log4j.PropertyConfigurator.configureRootCategory(PropertyConfigurator.java:544)
at 
org.apache.log4j.PropertyConfigurator.doConfigure(PropertyConfigurator.java:440)
at 
org.apache.log4j.PropertyConfigurator.doConfigure(PropertyConfigurator.java:476)
at 
org.apache.log4j.helpers.OptionConverter.selectAndConfigure(OptionConverter.java:471)
at org.apache.log4j.LogManager.(LogManager.java:125)
at org.slf4j.impl.Log4jLoggerFactory.getLogger(Log4jLoggerFactory.java:73)
at org.slf4j.LoggerFactory.getLogger(LoggerFactory.java:242)
at org.slf4j.LoggerFactory.getLogger(LoggerFactory.java:254)
at org.apache.nutch.crawl.Injector.(Injector.java:53)
log4j:ERROR Either File or DatePattern options are not set for appender [DRFA].
Injector: starting at 2013-05-17 17:22:22
Injector: crawlDb: Te stCrawl/crawldb
Injector: urlDir: urls/seed.txt
Injector: Converting injected urls to crawl db entries.
Injector: java.io.IOException: Job failed!
at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1265)
at org.apache.nutch.crawl.Injector.inject(Injector.java:281)
at org.apache.nutch.crawl.Injector.run(Injector.java:318)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
at org.apache.nutch.crawl.Injector.main(Injector.java:308)

nishans@ua41f725d6547517ff08c:~/nutch-1.6/apache-nutch-1.6$
 clear

nishans@ua41f725d6547517ff08c:~/nutch-1.6/apache-nutch-1.6$
 bin/crawl urls/seed.txt TestCrawl http://localhost:8983/solr/ 2
log4j:ERROR setFile(null,true) call failed.
java.io.FileNotFoundException: 
/home/local/ANT/nishans/nutch-1.6/apache-nutch-1.6/logs/hadoop.log (No such 
file or directory)
at java.io.FileOutputStream.openAppend(Native Method)
at java.io.FileOutputStream.(File OutputStream.java:207)
at java.io.FileOutputStream.(FileOutputStream.java:131)
at org.apache.log4j.FileAppender.setFile(FileAppender.java:290)
at org.apache.log4j.FileAppender.activateOptions(FileAppender.java:164)
at 
org.apache.log4j.DailyRollingFileAppender.activateOptions(DailyRollingFileAppender.java:216)
at org.apache.log4j.config.PropertySetter.activate(PropertySetter.java:257)
at org.apache.log4j.config.PropertySetter.setProperties(PropertySetter.java:133)
at org.apache.log4j.config.PropertySetter.setProperties(PropertySetter.java:97)
at 
org.apache.log4j.PropertyConfigurator.parseAppender(PropertyConfigurator.java:689)
at 
org.apache.log4j.PropertyConfigurator.parseCategory(PropertyConfigurator.java:647)
at 
org.apache.log4j.PropertyConfigurator.configureRootCategory(PropertyConfigurator.java:544)
at 
org.apache.log4j.PropertyConfigurator.doConfigure(PropertyConfigurator.java:440)
at 
org.apache.log4j.PropertyConfigurator.doConfigure(PropertyConfigurator.java:476)
at 
org.apache.log4j.helpers.OptionConverter.selectAndConfigure(OptionConverter.java:471)
at org.apache.log4j.LogManager.(LogManager.java:125)
at org.slf4j.impl.Log4jLoggerFactory.getLogger(Log4jLoggerFactory.java:73)
at org.slf4j.LoggerFactory.getLogger(LoggerFactory.java:242)
at org.slf4j.LoggerFactory.getLogger(LoggerFactory.java:254)
at org.apache.nutch.crawl.Injector.(Injector.java:53)
log4j:ERROR Either File or DatePattern options are not set for appender [DRFA].
Injector: starting at 2013-05-17 17:30:07
Injector: crawlDb: TestCrawl/crawldb
Injector: urlDir: urls/seed.txt
Injector: Converting injected urls to crawl db entries.
Injector: java.io.IOException: Job failed!
at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1265)
at org.apache.nutch.crawl.Injector.inject(Injecto r.java:281)
at org.apache.nutch.crawl.Injector.run(Injector.java:318)

Newbie: No search result

2011-05-04 Thread Roberto

Hello everyone, I'm a newbie to nutch... sorry if the question is silly...
I've installed Nutch according to the steps of the official tutorial. 
Everything seems ok, and the crawl completes (just with some error on 
specific pages), but I cannot get any result through the browser search. 
My catalina.out log says (if relevant):


03-May-2011 17:28:56 org.apache.catalina.startup.HostConfig deployDirectory
INFO: Deploying web application directory nutch
2011-05-03 17:28:56,404 INFO  NutchBean - creating new bean
2011-05-03 17:28:56,564 WARN  SearchBean - Neither 
file:/var/lib/tomcat6/crawl/index nor file:/var/lib/tomcat6

/crawl/indexes found!

Thank you in advance for any hint...


Newbie Question, hadoop error?

2016-06-13 Thread Jamal, Sarfaraz
Hi Guys,

I am attempting to run nutch using cygwin, and I am having the following 
problem:
Ps. I added Hadoop-core to the lib folder already -

I appreciate any insight or comment you guys may have -

$ bin/crawl -i urls/ TestCrawl/  2
Injecting seed URLs
/cygdrive/c/apache-nutch-1.11/bin/nutch inject TestCrawl//crawldb urls/
Exception in thread "main" java.lang.NoSuchMethodError: 
org.apache.commons.cli.OptionBuilder.withArgPattern(Ljava/lang/String;I)Lorg/apache/commons/cli/OptionBuilder;
at 
org.apache.hadoop.util.GenericOptionsParser.buildGeneralOptions(GenericOptionsParser.java:207)
at 
org.apache.hadoop.util.GenericOptionsParser.parseGeneralOptions(GenericOptionsParser.java:370)
at 
org.apache.hadoop.util.GenericOptionsParser.(GenericOptionsParser.java:153)
at 
org.apache.hadoop.util.GenericOptionsParser.(GenericOptionsParser.java:138)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:59)
   at org.apache.nutch.crawl.Injector.main(Injector.java:369)
Error running:
  /cygdrive/c/apache-nutch-1.11/bin/nutch inject TestCrawl//crawldb urls/
Failed with exit value 1.



Thanks!,

Sas


Few questions from a newbie

2011-01-24 Thread .: Abhishek :.
Hi all,

 I am very new to Nutch and Lucene as well. I am having few questions about
Nutch, I know they are very much basic but I could not get clear cut answers
out of googling for this. The questions are,

   - If I have to crawl just 5-6 web sites or URL's should I use intranet
   crawl or whole web crawl.
   - How do I set recrawl's for these same web sites after the first crawl.
   - If I have to start search the results via my own java code which jar
   files or api's or samples should I be looking into.
   - Is there a book on Nutch?

Thanks a bunch for your patience. I appreciate your time.

./Abishek


RE: Newbie: No search result

2011-05-04 Thread McGibbney, Lewis John
Hi Roberto,

By the looks of it this has to do with correctly defining your searcher.dir 
property in nutch-site.xml

If you have set this property previously with 'file:/path/to/index' then remove 
the 'file:' and just try 'path/to/index'

How are you running Nutch? Although in this case catalina.out provides you with 
an indication of what is wrong, I think it would be best for you to have a look 
at hadoop.log instead of catalina.out... as this will provide a more 
comprehensive log of Nutch activity.

As for your error on specific pages, if you are using <=1.2 then you can add 
parsing plug-in's under the plug-in.includes property in nucth-site.xml this 
may provide better results. If you are using Branch-1.3 or trunk, most of these 
plug-ins have been removed in a move to clean up the code as much parsing can 
now be done with Tika (amongst others)

Lewis

From: Roberto [rmez...@infinito.it]
Sent: 04 May 2011 11:36
To: user@nutch.apache.org
Subject: Newbie: No search result

Hello everyone, I'm a newbie to nutch... sorry if the question is silly...
I've installed Nutch according to the steps of the official tutorial.
Everything seems ok, and the crawl completes (just with some error on
specific pages), but I cannot get any result through the browser search.
My catalina.out log says (if relevant):

03-May-2011 17:28:56 org.apache.catalina.startup.HostConfig deployDirectory
INFO: Deploying web application directory nutch
2011-05-03 17:28:56,404 INFO  NutchBean - creating new bean
2011-05-03 17:28:56,564 WARN  SearchBean - Neither
file:/var/lib/tomcat6/crawl/index nor file:/var/lib/tomcat6
/crawl/indexes found!

Thank you in advance for any hint...

Email has been scanned for viruses by Altman Technologies' email management 
service - www.altman.co.uk/emailsystems

Glasgow Caledonian University is a registered Scottish charity, number SC021474

Winner: Times Higher Education’s Widening Participation Initiative of the Year 
2009 and Herald Society’s Education Initiative of the Year 2009.
http://www.gcu.ac.uk/newsevents/news/bycategory/theuniversity/1/name,6219,en.html

Winner: Times Higher Education’s Outstanding Support for Early Career 
Researchers of the Year 2010, GCU as a lead with Universities Scotland partners.
http://www.gcu.ac.uk/newsevents/news/bycategory/theuniversity/1/name,15691,en.html


Re: Newbie: No search result

2011-05-04 Thread Roberto

Thank you very much Lewis! It seems ok now...

The property "searcher.dir" was not set at all (The tutorial do not 
mention it...).
I edited  /var/lib/tomcat6/webapps/nutch/WEB-INF/classes/nutch-site.xml  
this way:





searcher.dir 
file:/var/apache-nutch-1.1-bin/crawl.test/




the "file:" prefix was necessary, otherwise Tomcat adds 
"/var/lib/tomcat6" at the beginning of the path


Now nutch web search works, but just for one of two sites configured 
moreover I have to distinguish searches of each of two sites for each of 
two languages configured for each site (italian and english). I will try 
to discover how to do independent searches (by site and by language) 
If you have some link I'll appreciate a lot...

Thank you again,
Roberto


RE: Newbie: No search result

2011-05-04 Thread McGibbney, Lewis John
>Now nutch web search works, but just for one of two sites configured

Just to clarify, are you saying that the pages you configured have been 
fetched, processed and indexed but do not feature when you submit a query or 
that Nutch is failing to fetch one site when you are crawling?

>moreover I have to distinguish searches of each of two sites for each of
>two languages configured for each site (italian and english). I will try
>to discover how to do independent searches (by site and by language)

There have been various discussions on this over recent months, hopefully some 
of the threads on the user archives [1] may be able to help you with this. I 
have very little exoerience crawling and working with any data other than web 
data in English so unfortunately I cannot comment specifically.

>If you have some link I'll appreciate a lot...

[1] http://www.mail-archive.com/user@nutch.apache.org/

HTH



Glasgow Caledonian University is a registered Scottish charity, number SC021474

Winner: Times Higher Education’s Widening Participation Initiative of the Year 
2009 and Herald Society’s Education Initiative of the Year 2009.
http://www.gcu.ac.uk/newsevents/news/bycategory/theuniversity/1/name,6219,en.html

Winner: Times Higher Education’s Outstanding Support for Early Career 
Researchers of the Year 2010, GCU as a lead with Universities Scotland partners.
http://www.gcu.ac.uk/newsevents/news/bycategory/theuniversity/1/name,15691,en.html


Re: Newbie: No search result

2011-05-04 Thread Roberto
Ok, everything seems to work now. I've just created four separated 
'conf' and 'url' files (two sites with two language version each) and 
four tomcat nutch instances, following this guide:

http://wiki.apache.org/nutch/GettingNutchRunningWithDebian

Thank you again for your help!


Re-Crawling Basic Syntax - newbie

2015-09-30 Thread Muhamad Muchlis
Hi,

I have manual script for my first crawl, anyone can explain this command
step by step:

*Initialize the crawldb*
bin/nutch inject urls/
*Generate URLs from crawldb*
bin/nutch generate -topN 80
*Fetch generated URLs*
bin/nutch fetch -all
*Parse fetched URLs*
bin/nutch parse -all
*Update database from parsed URLs*
bin/nutch updatedb -all
*Index parsed URLs*
bin/nutch index -all

anyone can help me  how re-crawling script.



Thanks


Regard's

Muchlis


Re: Newbie Question, hadoop error?

2016-06-15 Thread Lewis John Mcgibbney
Hi Sas,
See response inline :)

On Wed, Jun 15, 2016 at 5:36 AM,  wrote:

> From: "Jamal, Sarfaraz" 
> To: "'user@nutch.apache.org'" 
> Cc:
> Date: Mon, 13 Jun 2016 17:36:44 -0400
> Subject: Newbie Question, hadoop error?
> Hi Guys,
>
> I am attempting to run nutch using cygwin,


Is this Nutch 1.11 binary distribution you mean?


> and I am having the following problem:
> Ps. I added Hadoop-core to the lib folder already -
>
> I appreciate any insight or comment you guys may have -
>
> $ bin/crawl -i urls/ TestCrawl/  2
> Injecting seed URLs
> /cygdrive/c/apache-nutch-1.11/bin/nutch inject TestCrawl//crawldb urls/
> Exception in thread "main" java.lang.NoSuchMethodError:
> org.apache.commons.cli.OptionBuilder.withArgPattern(Ljava/lang/String;I)Lorg/apache/commons/cli/OptionBuilder;
> at
> org.apache.hadoop.util.GenericOptionsParser.buildGeneralOptions(GenericOptionsParser.java:207)
> at
> org.apache.hadoop.util.GenericOptionsParser.parseGeneralOptions(GenericOptionsParser.java:370)
> at
> org.apache.hadoop.util.GenericOptionsParser.(GenericOptionsParser.java:153)
> at
> org.apache.hadoop.util.GenericOptionsParser.(GenericOptionsParser.java:138)
> at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:59)
>at org.apache.nutch.crawl.Injector.main(Injector.java:369)
> Error running:
>   /cygdrive/c/apache-nutch-1.11/bin/nutch inject TestCrawl//crawldb urls/
> Failed with exit value 1.


There are a few issues above.
1) You should change the data structures parent directory from 'TestCrawl/'
to 'TestCrawl' e.g. remove the trailing forward slash. This will prevent
you from generating the CrawlDB in 'TestCrawl//crawldb' and will generated
it in 'TestCrawl/crawldb' instead.
2) The presence of NoSuchMethodError would indicate that the
$NUTCH_HOME/lib directory is not on the JVM classpath. Please make sure
that it is.

Lewis


Newbie Nutch/Solr Question(s)

2016-07-15 Thread Jamal, Sarfaraz
Hi Guy,

I have nutch 'working' relatively, and I am now ready to index it to solr.

I already have a solr environment up and running and now wish to index a few 
websites.

I have read through the documentation and I believe I have to do something like 
this:

Instead of this:
"cp ${NUTCH_RUNTIME_HOME}/conf/schema.xml 
${APACHE_SOLR_HOME}/example/solr/collection1/conf/"

I really just need to take all field types and all field names from the 
schema.xml file and add them to my existing ' managed-schema'

Correct?

Thanks!!

Sas


Re: Few questions from a newbie

2011-01-24 Thread Amna Waqar
1,to crawl just 5 to 6 websites,u can use both cases but intranet crawl
gives u more control and speed
2.After the first crawl,the recrawling the same sites time is 30 days by
default in db.fetcher.interval,you can change it according to ur own
convenience.
3.I ve no idea about the third question
cz  i m also a newbie
Best of luck with nutch learning


On Mon, Jan 24, 2011 at 9:04 PM, .: Abhishek :.  wrote:

> Hi all,
>
>  I am very new to Nutch and Lucene as well. I am having few questions about
> Nutch, I know they are very much basic but I could not get clear cut
> answers
> out of googling for this. The questions are,
>
>   - If I have to crawl just 5-6 web sites or URL's should I use intranet
>   crawl or whole web crawl.
>   - How do I set recrawl's for these same web sites after the first crawl.
>   - If I have to start search the results via my own java code which jar
>   files or api's or samples should I be looking into.
>   - Is there a book on Nutch?
>
> Thanks a bunch for your patience. I appreciate your time.
>
> ./Abishek
>


Re: Few questions from a newbie

2011-01-24 Thread Charan K
Refer NutchBean.java for the their question. You can run than from command line 
to test the index.

 If you use SOLR indexing, it is going to be much simpler, they have a solr 
java client.. 

Sent from my iPhone

On Jan 24, 2011, at 8:07 PM, Amna Waqar  wrote:

> 1,to crawl just 5 to 6 websites,u can use both cases but intranet crawl
> gives u more control and speed
> 2.After the first crawl,the recrawling the same sites time is 30 days by
> default in db.fetcher.interval,you can change it according to ur own
> convenience.
> 3.I ve no idea about the third question
> cz  i m also a newbie
> Best of luck with nutch learning
> 
> 
> On Mon, Jan 24, 2011 at 9:04 PM, .: Abhishek :.  wrote:
> 
>> Hi all,
>> 
>> I am very new to Nutch and Lucene as well. I am having few questions about
>> Nutch, I know they are very much basic but I could not get clear cut
>> answers
>> out of googling for this. The questions are,
>> 
>>  - If I have to crawl just 5-6 web sites or URL's should I use intranet
>>  crawl or whole web crawl.
>>  - How do I set recrawl's for these same web sites after the first crawl.
>>  - If I have to start search the results via my own java code which jar
>>  files or api's or samples should I be looking into.
>>  - Is there a book on Nutch?
>> 
>> Thanks a bunch for your patience. I appreciate your time.
>> 
>> ./Abishek
>> 


Re: Few questions from a newbie

2011-01-24 Thread alxsss
How to use solr to index nutch segments?
What is the meaning of db.fetcher.interval? Does this mean that if I run the 
same crawl command before 30 days it will do nothing?

Thanks.
Alex.

 

 


 

 

-Original Message-
From: Charan K 
To: user 
Cc: user 
Sent: Mon, Jan 24, 2011 8:24 pm
Subject: Re: Few questions from a newbie


Refer NutchBean.java for the their question. You can run than from command line 

to test the index.



 If you use SOLR indexing, it is going to be much simpler, they have a solr 
java 

client.. 



Sent from my iPhone



On Jan 24, 2011, at 8:07 PM, Amna Waqar  wrote:



> 1,to crawl just 5 to 6 websites,u can use both cases but intranet crawl

> gives u more control and speed

> 2.After the first crawl,the recrawling the same sites time is 30 days by

> default in db.fetcher.interval,you can change it according to ur own

> convenience.

> 3.I ve no idea about the third question

> cz  i m also a newbie

> Best of luck with nutch learning

> 

> 

> On Mon, Jan 24, 2011 at 9:04 PM, .: Abhishek :.  wrote:

> 

>> Hi all,

>> 

>> I am very new to Nutch and Lucene as well. I am having few questions about

>> Nutch, I know they are very much basic but I could not get clear cut

>> answers

>> out of googling for this. The questions are,

>> 

>>  - If I have to crawl just 5-6 web sites or URL's should I use intranet

>>  crawl or whole web crawl.

>>  - How do I set recrawl's for these same web sites after the first crawl.

>>  - If I have to start search the results via my own java code which jar

>>  files or api's or samples should I be looking into.

>>  - Is there a book on Nutch?

>> 

>> Thanks a bunch for your patience. I appreciate your time.

>> 

>> ./Abishek

>> 




 


RE: Few questions from a newbie

2011-01-24 Thread Chris Woolum
To use solr:
 
bin/nutch solrindex http://127.0.0.1:8080/solr/ crawl/crawldb crawl/linkdb 
crawl/segments/*
 
assuming the crawl dir is crawl



From: alx...@aim.com [mailto:alx...@aim.com]
Sent: Mon 1/24/2011 9:23 PM
To: user@nutch.apache.org
Subject: Re: Few questions from a newbie



How to use solr to index nutch segments?
What is the meaning of db.fetcher.interval? Does this mean that if I run the 
same crawl command before 30 days it will do nothing?

Thanks.
Alex.










-Original Message-
From: Charan K 
To: user 
Cc: user 
Sent: Mon, Jan 24, 2011 8:24 pm
Subject: Re: Few questions from a newbie


Refer NutchBean.java for the their question. You can run than from command line

to test the index.



 If you use SOLR indexing, it is going to be much simpler, they have a solr java

client..



Sent from my iPhone



On Jan 24, 2011, at 8:07 PM, Amna Waqar  wrote:



> 1,to crawl just 5 to 6 websites,u can use both cases but intranet crawl

> gives u more control and speed

> 2.After the first crawl,the recrawling the same sites time is 30 days by

> default in db.fetcher.interval,you can change it according to ur own

> convenience.

> 3.I ve no idea about the third question

> cz  i m also a newbie

> Best of luck with nutch learning

>

>

> On Mon, Jan 24, 2011 at 9:04 PM, .: Abhishek :.  wrote:

>

>> Hi all,

>>

>> I am very new to Nutch and Lucene as well. I am having few questions about

>> Nutch, I know they are very much basic but I could not get clear cut

>> answers

>> out of googling for this. The questions are,

>>

>>  - If I have to crawl just 5-6 web sites or URL's should I use intranet

>>  crawl or whole web crawl.

>>  - How do I set recrawl's for these same web sites after the first crawl.

>>  - If I have to start search the results via my own java code which jar

>>  files or api's or samples should I be looking into.

>>  - Is there a book on Nutch?

>>

>> Thanks a bunch for your patience. I appreciate your time.

>>

>> ./Abishek

>>









Re: Few questions from a newbie

2011-01-24 Thread charan kumar
db.fetcher.interval : It means that URLS which were fetched in the last 30
days  will not be fetched. Or A URL is eligible for refetch only
after 30 days of last crawl.


On Mon, Jan 24, 2011 at 9:23 PM,  wrote:

> How to use solr to index nutch segments?
> What is the meaning of db.fetcher.interval? Does this mean that if I run
> the same crawl command before 30 days it will do nothing?
>
> Thanks.
> Alex.
>
>
>
>
>
>
>
>
>
>
> -Original Message-
> From: Charan K 
> To: user 
> Cc: user 
> Sent: Mon, Jan 24, 2011 8:24 pm
> Subject: Re: Few questions from a newbie
>
>
> Refer NutchBean.java for the their question. You can run than from command
> line
>
> to test the index.
>
>
>
>  If you use SOLR indexing, it is going to be much simpler, they have a solr
> java
>
> client..
>
>
>
> Sent from my iPhone
>
>
>
> On Jan 24, 2011, at 8:07 PM, Amna Waqar  wrote:
>
>
>
> > 1,to crawl just 5 to 6 websites,u can use both cases but intranet crawl
>
> > gives u more control and speed
>
> > 2.After the first crawl,the recrawling the same sites time is 30 days by
>
> > default in db.fetcher.interval,you can change it according to ur own
>
> > convenience.
>
> > 3.I ve no idea about the third question
>
> > cz  i m also a newbie
>
> > Best of luck with nutch learning
>
> >
>
> >
>
> > On Mon, Jan 24, 2011 at 9:04 PM, .: Abhishek :. 
> wrote:
>
> >
>
> >> Hi all,
>
> >>
>
> >> I am very new to Nutch and Lucene as well. I am having few questions
> about
>
> >> Nutch, I know they are very much basic but I could not get clear cut
>
> >> answers
>
> >> out of googling for this. The questions are,
>
> >>
>
> >>  - If I have to crawl just 5-6 web sites or URL's should I use intranet
>
> >>  crawl or whole web crawl.
>
> >>  - How do I set recrawl's for these same web sites after the first
> crawl.
>
> >>  - If I have to start search the results via my own java code which jar
>
> >>  files or api's or samples should I be looking into.
>
> >>  - Is there a book on Nutch?
>
> >>
>
> >> Thanks a bunch for your patience. I appreciate your time.
>
> >>
>
> >> ./Abishek
>
> >>
>
>
>
>
>
>


Re: Few questions from a newbie

2011-01-25 Thread .: Abhishek :.
Thanks Chris, Charan and Alex.

I am looking into the crawl statistics now. And I see fields like
db_unfetched, db_fetched, db_gone, db_redir-temp and db_redir_perm, what do
they mean?

And, I also see the db_unfetched is way too high than the db_fetched. Does
it mean most of the pages did not crawl at all due to some issues?

Thanks again for your time!


On Tue, Jan 25, 2011 at 2:33 PM, charan kumar wrote:

> db.fetcher.interval : It means that URLS which were fetched in the last 30
> days  will not be fetched. Or A URL is eligible for refetch only
> after 30 days of last crawl.
>
>
> On Mon, Jan 24, 2011 at 9:23 PM,  wrote:
>
> > How to use solr to index nutch segments?
> > What is the meaning of db.fetcher.interval? Does this mean that if I run
> > the same crawl command before 30 days it will do nothing?
> >
> > Thanks.
> > Alex.
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> > -Original Message-
> > From: Charan K 
> > To: user 
> > Cc: user 
> > Sent: Mon, Jan 24, 2011 8:24 pm
> > Subject: Re: Few questions from a newbie
> >
> >
> > Refer NutchBean.java for the their question. You can run than from
> command
> > line
> >
> > to test the index.
> >
> >
> >
> >  If you use SOLR indexing, it is going to be much simpler, they have a
> solr
> > java
> >
> > client..
> >
> >
> >
> > Sent from my iPhone
> >
> >
> >
> > On Jan 24, 2011, at 8:07 PM, Amna Waqar  wrote:
> >
> >
> >
> > > 1,to crawl just 5 to 6 websites,u can use both cases but intranet crawl
> >
> > > gives u more control and speed
> >
> > > 2.After the first crawl,the recrawling the same sites time is 30 days
> by
> >
> > > default in db.fetcher.interval,you can change it according to ur own
> >
> > > convenience.
> >
> > > 3.I ve no idea about the third question
> >
> > > cz  i m also a newbie
> >
> > > Best of luck with nutch learning
> >
> > >
> >
> > >
> >
> > > On Mon, Jan 24, 2011 at 9:04 PM, .: Abhishek :. 
> > wrote:
> >
> > >
> >
> > >> Hi all,
> >
> > >>
> >
> > >> I am very new to Nutch and Lucene as well. I am having few questions
> > about
> >
> > >> Nutch, I know they are very much basic but I could not get clear cut
> >
> > >> answers
> >
> > >> out of googling for this. The questions are,
> >
> > >>
> >
> > >>  - If I have to crawl just 5-6 web sites or URL's should I use
> intranet
> >
> > >>  crawl or whole web crawl.
> >
> > >>  - How do I set recrawl's for these same web sites after the first
> > crawl.
> >
> > >>  - If I have to start search the results via my own java code which
> jar
> >
> > >>  files or api's or samples should I be looking into.
> >
> > >>  - Is there a book on Nutch?
> >
> > >>
> >
> > >> Thanks a bunch for your patience. I appreciate your time.
> >
> > >>
> >
> > >> ./Abishek
> >
> > >>
> >
> >
> >
> >
> >
> >
>


Re: Few questions from a newbie

2011-01-25 Thread Markus Jelsma
These values come from the CrawlDB and have the following meaning.

db_unfetched
This is the number of URL's that are to be crawled when the next batch is 
started. This number is usually limited with the generate.max.per.host 
setting. So, if there are 5000 unfetched and generate.max.per.host is set to 
1000, the next batch will fetch only 1000. Watch, the number of unfetched will 
usually not be 5000-1000 because new URL's have been discovered and added to 
the CrawlDB.

db_fetched
These URL's have been fetched. Their next fetch will be db.fetcher.interval. 
But, this is not always the case. There the adaprive schedule algorithm can 
tune this number depending on several settings. With these you can tune the 
interval when a page is modified or not modified.

db_gone
HTTP 404 Not Found

db_redir-temp
HTTP 307 Temporary Redirect

db_redir_perm
HTTP 301 Moved Permanently

Code:
http://svn.apache.org/viewvc/nutch/branches/branch-1.2/src/java/org/apache/nutch/crawl/CrawlDatum.java?view=markup

Configuration:
http://svn.apache.org/viewvc/nutch/branches/branch-1.2/conf/nutch-
default.xml?view=markup

> Thanks Chris, Charan and Alex.
> 
> I am looking into the crawl statistics now. And I see fields like
> db_unfetched, db_fetched, db_gone, db_redir-temp and db_redir_perm, what do
> they mean?
> 
> And, I also see the db_unfetched is way too high than the db_fetched. Does
> it mean most of the pages did not crawl at all due to some issues?
> 
> Thanks again for your time!
> 
> On Tue, Jan 25, 2011 at 2:33 PM, charan kumar wrote:
> > db.fetcher.interval : It means that URLS which were fetched in the last
> > 30 days  will not be fetched. Or A URL is eligible for refetch
> > only after 30 days of last crawl.
> > 
> > On Mon, Jan 24, 2011 at 9:23 PM,  wrote:
> > > How to use solr to index nutch segments?
> > > What is the meaning of db.fetcher.interval? Does this mean that if I
> > > run the same crawl command before 30 days it will do nothing?
> > > 
> > > Thanks.
> > > Alex.
> > > 
> > > 
> > > 
> > > 
> > > 
> > > 
> > > 
> > > 
> > > 
> > > 
> > > -Original Message-
> > > From: Charan K 
> > > To: user 
> > > Cc: user 
> > > Sent: Mon, Jan 24, 2011 8:24 pm
> > > Subject: Re: Few questions from a newbie
> > > 
> > > 
> > > Refer NutchBean.java for the their question. You can run than from
> > 
> > command
> > 
> > > line
> > > 
> > > to test the index.
> > > 
> > >  If you use SOLR indexing, it is going to be much simpler, they have a
> > 
> > solr
> > 
> > > java
> > > 
> > > client..
> > > 
> > > 
> > > 
> > > Sent from my iPhone
> > > 
> > > On Jan 24, 2011, at 8:07 PM, Amna Waqar  wrote:
> > > > 1,to crawl just 5 to 6 websites,u can use both cases but intranet
> > > > crawl
> > > > 
> > > > gives u more control and speed
> > > > 
> > > > 2.After the first crawl,the recrawling the same sites time is 30 days
> > 
> > by
> > 
> > > > default in db.fetcher.interval,you can change it according to ur own
> > > > 
> > > > convenience.
> > > > 
> > > > 3.I ve no idea about the third question
> > > > 
> > > > cz  i m also a newbie
> > > > 
> > > > Best of luck with nutch learning
> > > > 
> > > > 
> > > > 
> > > > 
> > > > 
> > > > On Mon, Jan 24, 2011 at 9:04 PM, .: Abhishek :. 
> > > 
> > > wrote:
> > > >> Hi all,
> > > >> 
> > > >> 
> > > >> 
> > > >> I am very new to Nutch and Lucene as well. I am having few questions
> > > 
> > > about
> > > 
> > > >> Nutch, I know they are very much basic but I could not get clear cut
> > > >> 
> > > >> answers
> > > >> 
> > > >> out of googling for this. The questions are,
> > > >> 
> > > >>  - If I have to crawl just 5-6 web sites or URL's should I use
> > 
> > intranet
> > 
> > > >>  crawl or whole web crawl.
> > > >>  
> > > >>  - How do I set recrawl's for these same web sites after the first
> > > 
> > > crawl.
> > > 
> > > >>  - If I have to start search the results via my own java code which
> > 
> > jar
> > 
> > > >>  files or api's or samples should I be looking into.
> > > >>  
> > > >>  - Is there a book on Nutch?
> > > >> 
> > > >> Thanks a bunch for your patience. I appreciate your time.
> > > >> 
> > > >> 
> > > >> 
> > > >> ./Abishek


Re: Few questions from a newbie

2011-01-25 Thread .: Abhishek :.
Thanks a bunch Markus.

By the way, is there some book or material on Nutch which would help me
understanding it better? I  come from an application development background
and all the crawl n search stuff is *very* new to me :)


On Wed, Jan 26, 2011 at 9:48 AM, Markus Jelsma
wrote:

> These values come from the CrawlDB and have the following meaning.
>
> db_unfetched
> This is the number of URL's that are to be crawled when the next batch is
> started. This number is usually limited with the generate.max.per.host
> setting. So, if there are 5000 unfetched and generate.max.per.host is set
> to
> 1000, the next batch will fetch only 1000. Watch, the number of unfetched
> will
> usually not be 5000-1000 because new URL's have been discovered and added
> to
> the CrawlDB.
>
> db_fetched
> These URL's have been fetched. Their next fetch will be
> db.fetcher.interval.
> But, this is not always the case. There the adaprive schedule algorithm can
> tune this number depending on several settings. With these you can tune the
> interval when a page is modified or not modified.
>
> db_gone
> HTTP 404 Not Found
>
> db_redir-temp
> HTTP 307 Temporary Redirect
>
> db_redir_perm
> HTTP 301 Moved Permanently
>
> Code:
>
> http://svn.apache.org/viewvc/nutch/branches/branch-1.2/src/java/org/apache/nutch/crawl/CrawlDatum.java?view=markup
>
> Configuration:
> http://svn.apache.org/viewvc/nutch/branches/branch-1.2/conf/nutch-
> default.xml?view=markup<http://svn.apache.org/viewvc/nutch/branches/branch-1.2/conf/nutch-%0Adefault.xml?view=markup>
>
> > Thanks Chris, Charan and Alex.
> >
> > I am looking into the crawl statistics now. And I see fields like
> > db_unfetched, db_fetched, db_gone, db_redir-temp and db_redir_perm, what
> do
> > they mean?
> >
> > And, I also see the db_unfetched is way too high than the db_fetched.
> Does
> > it mean most of the pages did not crawl at all due to some issues?
> >
> > Thanks again for your time!
> >
> > On Tue, Jan 25, 2011 at 2:33 PM, charan kumar  >wrote:
> > > db.fetcher.interval : It means that URLS which were fetched in the last
> > > 30 days  will not be fetched. Or A URL is eligible for refetch
> > > only after 30 days of last crawl.
> > >
> > > On Mon, Jan 24, 2011 at 9:23 PM,  wrote:
> > > > How to use solr to index nutch segments?
> > > > What is the meaning of db.fetcher.interval? Does this mean that if I
> > > > run the same crawl command before 30 days it will do nothing?
> > > >
> > > > Thanks.
> > > > Alex.
> > > >
> > > >
> > > >
> > > >
> > > >
> > > >
> > > >
> > > >
> > > >
> > > >
> > > > -Original Message-
> > > > From: Charan K 
> > > > To: user 
> > > > Cc: user 
> > > > Sent: Mon, Jan 24, 2011 8:24 pm
> > > > Subject: Re: Few questions from a newbie
> > > >
> > > >
> > > > Refer NutchBean.java for the their question. You can run than from
> > >
> > > command
> > >
> > > > line
> > > >
> > > > to test the index.
> > > >
> > > >  If you use SOLR indexing, it is going to be much simpler, they have
> a
> > >
> > > solr
> > >
> > > > java
> > > >
> > > > client..
> > > >
> > > >
> > > >
> > > > Sent from my iPhone
> > > >
> > > > On Jan 24, 2011, at 8:07 PM, Amna Waqar 
> wrote:
> > > > > 1,to crawl just 5 to 6 websites,u can use both cases but intranet
> > > > > crawl
> > > > >
> > > > > gives u more control and speed
> > > > >
> > > > > 2.After the first crawl,the recrawling the same sites time is 30
> days
> > >
> > > by
> > >
> > > > > default in db.fetcher.interval,you can change it according to ur
> own
> > > > >
> > > > > convenience.
> > > > >
> > > > > 3.I ve no idea about the third question
> > > > >
> > > > > cz  i m also a newbie
> > > > >
> > > > > Best of luck with nutch learning
> > > > >
> > > > >
> > > > >
> > > > >
> > > > >
> > > > > On Mon, Jan 24, 2011 at 9:04 PM, .: Abhishek :.  >
> > > >
> > > > wrote:
> > > > >> Hi all,
> > > > >>
> > > > >>
> > > > >>
> > > > >> I am very new to Nutch and Lucene as well. I am having few
> questions
> > > >
> > > > about
> > > >
> > > > >> Nutch, I know they are very much basic but I could not get clear
> cut
> > > > >>
> > > > >> answers
> > > > >>
> > > > >> out of googling for this. The questions are,
> > > > >>
> > > > >>  - If I have to crawl just 5-6 web sites or URL's should I use
> > >
> > > intranet
> > >
> > > > >>  crawl or whole web crawl.
> > > > >>
> > > > >>  - How do I set recrawl's for these same web sites after the first
> > > >
> > > > crawl.
> > > >
> > > > >>  - If I have to start search the results via my own java code
> which
> > >
> > > jar
> > >
> > > > >>  files or api's or samples should I be looking into.
> > > > >>
> > > > >>  - Is there a book on Nutch?
> > > > >>
> > > > >> Thanks a bunch for your patience. I appreciate your time.
> > > > >>
> > > > >>
> > > > >>
> > > > >> ./Abishek
>


Re: Few questions from a newbie

2011-01-26 Thread Julien Nioche
Tom White's book on Hadoop is a must have for anyone wanting to understand
how Nutch and Hadoop work. There is a section in it specifically about Nutch
written by Andrzej as well


On 26 January 2011 03:02, .: Abhishek :.  wrote:

> Thanks a bunch Markus.
>
> By the way, is there some book or material on Nutch which would help me
> understanding it better? I  come from an application development background
> and all the crawl n search stuff is *very* new to me :)
>
>
> On Wed, Jan 26, 2011 at 9:48 AM, Markus Jelsma
> wrote:
>
> > These values come from the CrawlDB and have the following meaning.
> >
> > db_unfetched
> > This is the number of URL's that are to be crawled when the next batch is
> > started. This number is usually limited with the generate.max.per.host
> > setting. So, if there are 5000 unfetched and generate.max.per.host is set
> > to
> > 1000, the next batch will fetch only 1000. Watch, the number of unfetched
> > will
> > usually not be 5000-1000 because new URL's have been discovered and added
> > to
> > the CrawlDB.
> >
> > db_fetched
> > These URL's have been fetched. Their next fetch will be
> > db.fetcher.interval.
> > But, this is not always the case. There the adaprive schedule algorithm
> can
> > tune this number depending on several settings. With these you can tune
> the
> > interval when a page is modified or not modified.
> >
> > db_gone
> > HTTP 404 Not Found
> >
> > db_redir-temp
> > HTTP 307 Temporary Redirect
> >
> > db_redir_perm
> > HTTP 301 Moved Permanently
> >
> > Code:
> >
> >
> http://svn.apache.org/viewvc/nutch/branches/branch-1.2/src/java/org/apache/nutch/crawl/CrawlDatum.java?view=markup
> >
> > Configuration:
> > http://svn.apache.org/viewvc/nutch/branches/branch-1.2/conf/nutch-
> > default.xml?view=markup<
> http://svn.apache.org/viewvc/nutch/branches/branch-1.2/conf/nutch-%0Adefault.xml?view=markup
> >
> >
> > > Thanks Chris, Charan and Alex.
> > >
> > > I am looking into the crawl statistics now. And I see fields like
> > > db_unfetched, db_fetched, db_gone, db_redir-temp and db_redir_perm,
> what
> > do
> > > they mean?
> > >
> > > And, I also see the db_unfetched is way too high than the db_fetched.
> > Does
> > > it mean most of the pages did not crawl at all due to some issues?
> > >
> > > Thanks again for your time!
> > >
> > > On Tue, Jan 25, 2011 at 2:33 PM, charan kumar  > >wrote:
> > > > db.fetcher.interval : It means that URLS which were fetched in the
> last
> > > > 30 days  will not be fetched. Or A URL is eligible for
> refetch
> > > > only after 30 days of last crawl.
> > > >
> > > > On Mon, Jan 24, 2011 at 9:23 PM,  wrote:
> > > > > How to use solr to index nutch segments?
> > > > > What is the meaning of db.fetcher.interval? Does this mean that if
> I
> > > > > run the same crawl command before 30 days it will do nothing?
> > > > >
> > > > > Thanks.
> > > > > Alex.
> > > > >
> > > > >
> > > > >
> > > > >
> > > > >
> > > > >
> > > > >
> > > > >
> > > > >
> > > > >
> > > > > -Original Message-
> > > > > From: Charan K 
> > > > > To: user 
> > > > > Cc: user 
> > > > > Sent: Mon, Jan 24, 2011 8:24 pm
> > > > > Subject: Re: Few questions from a newbie
> > > > >
> > > > >
> > > > > Refer NutchBean.java for the their question. You can run than from
> > > >
> > > > command
> > > >
> > > > > line
> > > > >
> > > > > to test the index.
> > > > >
> > > > >  If you use SOLR indexing, it is going to be much simpler, they
> have
> > a
> > > >
> > > > solr
> > > >
> > > > > java
> > > > >
> > > > > client..
> > > > >
> > > > >
> > > > >
> > > > > Sent from my iPhone
> > > > >
> > > > > On Jan 24, 2011, at 8:07 PM, Amna Waqar 
> > wrote:
> > > > > > 1,to crawl just 5 to 6 websites,u can use both cases but intranet
> > > > > > crawl
> > > > > >
> > &

Re: Few questions from a newbie

2011-01-26 Thread .: Abhishek :.
Thanks Julien. I will get the book :)

On Wed, Jan 26, 2011 at 5:09 PM, Julien Nioche <
lists.digitalpeb...@gmail.com> wrote:

> Tom White's book on Hadoop is a must have for anyone wanting to understand
> how Nutch and Hadoop work. There is a section in it specifically about
> Nutch
> written by Andrzej as well
>
>
> On 26 January 2011 03:02, .: Abhishek :.  wrote:
>
> > Thanks a bunch Markus.
> >
> > By the way, is there some book or material on Nutch which would help me
> > understanding it better? I  come from an application development
> background
> > and all the crawl n search stuff is *very* new to me :)
> >
> >
> > On Wed, Jan 26, 2011 at 9:48 AM, Markus Jelsma
> > wrote:
> >
> > > These values come from the CrawlDB and have the following meaning.
> > >
> > > db_unfetched
> > > This is the number of URL's that are to be crawled when the next batch
> is
> > > started. This number is usually limited with the generate.max.per.host
> > > setting. So, if there are 5000 unfetched and generate.max.per.host is
> set
> > > to
> > > 1000, the next batch will fetch only 1000. Watch, the number of
> unfetched
> > > will
> > > usually not be 5000-1000 because new URL's have been discovered and
> added
> > > to
> > > the CrawlDB.
> > >
> > > db_fetched
> > > These URL's have been fetched. Their next fetch will be
> > > db.fetcher.interval.
> > > But, this is not always the case. There the adaprive schedule algorithm
> > can
> > > tune this number depending on several settings. With these you can tune
> > the
> > > interval when a page is modified or not modified.
> > >
> > > db_gone
> > > HTTP 404 Not Found
> > >
> > > db_redir-temp
> > > HTTP 307 Temporary Redirect
> > >
> > > db_redir_perm
> > > HTTP 301 Moved Permanently
> > >
> > > Code:
> > >
> > >
> >
> http://svn.apache.org/viewvc/nutch/branches/branch-1.2/src/java/org/apache/nutch/crawl/CrawlDatum.java?view=markup
> > >
> > > Configuration:
> > > http://svn.apache.org/viewvc/nutch/branches/branch-1.2/conf/nutch-
> > > default.xml?view=markup<
> >
> http://svn.apache.org/viewvc/nutch/branches/branch-1.2/conf/nutch-%0Adefault.xml?view=markup
> > >
> > >
> > > > Thanks Chris, Charan and Alex.
> > > >
> > > > I am looking into the crawl statistics now. And I see fields like
> > > > db_unfetched, db_fetched, db_gone, db_redir-temp and db_redir_perm,
> > what
> > > do
> > > > they mean?
> > > >
> > > > And, I also see the db_unfetched is way too high than the db_fetched.
> > > Does
> > > > it mean most of the pages did not crawl at all due to some issues?
> > > >
> > > > Thanks again for your time!
> > > >
> > > > On Tue, Jan 25, 2011 at 2:33 PM, charan kumar <
> charan.ku...@gmail.com
> > > >wrote:
> > > > > db.fetcher.interval : It means that URLS which were fetched in the
> > last
> > > > > 30 days  will not be fetched. Or A URL is eligible for
> > refetch
> > > > > only after 30 days of last crawl.
> > > > >
> > > > > On Mon, Jan 24, 2011 at 9:23 PM,  wrote:
> > > > > > How to use solr to index nutch segments?
> > > > > > What is the meaning of db.fetcher.interval? Does this mean that
> if
> > I
> > > > > > run the same crawl command before 30 days it will do nothing?
> > > > > >
> > > > > > Thanks.
> > > > > > Alex.
> > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > > > -Original Message-
> > > > > > From: Charan K 
> > > > > > To: user 
> > > > > > Cc: user 
> > > > > > Sent: Mon, Jan 24, 2011 8:24 pm
> > > > > > Subject: Re: Few questions from a newbie
> > > > > >
> > > > > >
> > > > > > Refer NutchBean.java for the their question. You can run than
> from
> > > > >
> > > > > command
> > > > >
> > > > 

RE: Few questions from a newbie

2011-01-26 Thread McGibbney, Lewis John
I can only speak for myself but I think that reading up on 'search' E.g. 
Lucene, is really the first stop prior to engaging with the crawling stuff. 
There are publications out there dealing with building search applications but 
these only contain small sections on web crawlers and code examples are fairly 
dated now.

Hope this helps


From: .: Abhishek :. [ab1s...@gmail.com]
Sent: 26 January 2011 03:02
To: markus.jel...@openindex.io
Cc: user@nutch.apache.org
Subject: Re: Few questions from a newbie

Thanks a bunch Markus.

By the way, is there some book or material on Nutch which would help me
understanding it better? I  come from an application development background
and all the crawl n search stuff is *very* new to me :)


On Wed, Jan 26, 2011 at 9:48 AM, Markus Jelsma
wrote:

> These values come from the CrawlDB and have the following meaning.
>
> db_unfetched
> This is the number of URL's that are to be crawled when the next batch is
> started. This number is usually limited with the generate.max.per.host
> setting. So, if there are 5000 unfetched and generate.max.per.host is set
> to
> 1000, the next batch will fetch only 1000. Watch, the number of unfetched
> will
> usually not be 5000-1000 because new URL's have been discovered and added
> to
> the CrawlDB.
>
> db_fetched
> These URL's have been fetched. Their next fetch will be
> db.fetcher.interval.
> But, this is not always the case. There the adaprive schedule algorithm can
> tune this number depending on several settings. With these you can tune the
> interval when a page is modified or not modified.
>
> db_gone
> HTTP 404 Not Found
>
> db_redir-temp
> HTTP 307 Temporary Redirect
>
> db_redir_perm
> HTTP 301 Moved Permanently
>
> Code:
>
> http://svn.apache.org/viewvc/nutch/branches/branch-1.2/src/java/org/apache/nutch/crawl/CrawlDatum.java?view=markup
>
> Configuration:
> http://svn.apache.org/viewvc/nutch/branches/branch-1.2/conf/nutch-
> default.xml?view=markup<http://svn.apache.org/viewvc/nutch/branches/branch-1.2/conf/nutch-%0Adefault.xml?view=markup>
>
> > Thanks Chris, Charan and Alex.
> >
> > I am looking into the crawl statistics now. And I see fields like
> > db_unfetched, db_fetched, db_gone, db_redir-temp and db_redir_perm, what
> do
> > they mean?
> >
> > And, I also see the db_unfetched is way too high than the db_fetched.
> Does
> > it mean most of the pages did not crawl at all due to some issues?
> >
> > Thanks again for your time!
> >
> > On Tue, Jan 25, 2011 at 2:33 PM, charan kumar  >wrote:
> > > db.fetcher.interval : It means that URLS which were fetched in the last
> > > 30 days  will not be fetched. Or A URL is eligible for refetch
> > > only after 30 days of last crawl.
> > >
> > > On Mon, Jan 24, 2011 at 9:23 PM,  wrote:
> > > > How to use solr to index nutch segments?
> > > > What is the meaning of db.fetcher.interval? Does this mean that if I
> > > > run the same crawl command before 30 days it will do nothing?
> > > >
> > > > Thanks.
> > > > Alex.
> > > >
> > > >
> > > >
> > > >
> > > >
> > > >
> > > >
> > > >
> > > >
> > > >
> > > > -Original Message-
> > > > From: Charan K 
> > > > To: user 
> > > > Cc: user 
> > > > Sent: Mon, Jan 24, 2011 8:24 pm
> > > > Subject: Re: Few questions from a newbie
> > > >
> > > >
> > > > Refer NutchBean.java for the their question. You can run than from
> > >
> > > command
> > >
> > > > line
> > > >
> > > > to test the index.
> > > >
> > > >  If you use SOLR indexing, it is going to be much simpler, they have
> a
> > >
> > > solr
> > >
> > > > java
> > > >
> > > > client..
> > > >
> > > >
> > > >
> > > > Sent from my iPhone
> > > >
> > > > On Jan 24, 2011, at 8:07 PM, Amna Waqar 
> wrote:
> > > > > 1,to crawl just 5 to 6 websites,u can use both cases but intranet
> > > > > crawl
> > > > >
> > > > > gives u more control and speed
> > > > >
> > > > > 2.After the first crawl,the recrawling the same sites time is 30
> days
> > >
> > > by
> > >
> > > > > default in db.fetcher.interval,you can change it according to ur

Re: Few questions from a newbie

2011-01-26 Thread Arjun Kumar Reddy
Hi list,

I have given the set of urls as

http://is.gd/Jt32Cf
http://is.gd/hS3lEJ
http://is.gd/Jy1Im3
http://is.gd/QoJ8xy
http://is.gd/e4ct89
http://is.gd/WAOVmd
http://is.gd/lhkA69
http://is.gd/3OilLD
. 43 such urls

And I have run the crawl command bin/nutch crawl urls/ -dir crawl -depth 3

*arjun@arjun-ninjas:~/nutch$* bin/nutch readdb crawl/crawldb -stats
*CrawlDb statistics start: crawl/crawldb*
*Statistics for CrawlDb: crawl/crawldb*
*TOTAL urls: 43*
*retry 0: 43*
*min score: 1.0*
*avg score: 1.0*
*max score: 1.0*
*status 3 (db_gone): 1*
*status 4 (db_redir_temp): 1*
*status 5 (db_redir_perm): 41*
*CrawlDb statistics: done*

When I am trying to read the content from the segments, the content block is
empty for every record.

Can you please tell me where I can get the content of these urls.

Thanks and regards,*
*Arjun Kumar Reddy


Re: Few questions from a newbie

2011-01-26 Thread Estrada Groups
You probably have to literally click on each URL to get the URL it's 
referencing. Those are URL shorteners  and probably won't play nicely with a 
crawler because of the redirection.

Adam

Sent from my iPhone

On Jan 26, 2011, at 8:02 AM, Arjun Kumar Reddy  
wrote:

> Hi list,
> 
> I have given the set of urls as
> 
> http://is.gd/Jt32Cf
> http://is.gd/hS3lEJ
> http://is.gd/Jy1Im3
> http://is.gd/QoJ8xy
> http://is.gd/e4ct89
> http://is.gd/WAOVmd
> http://is.gd/lhkA69
> http://is.gd/3OilLD
> . 43 such urls
> 
> And I have run the crawl command bin/nutch crawl urls/ -dir crawl -depth 3
> 
> *arjun@arjun-ninjas:~/nutch$* bin/nutch readdb crawl/crawldb -stats
> *CrawlDb statistics start: crawl/crawldb*
> *Statistics for CrawlDb: crawl/crawldb*
> *TOTAL urls: 43*
> *retry 0: 43*
> *min score: 1.0*
> *avg score: 1.0*
> *max score: 1.0*
> *status 3 (db_gone): 1*
> *status 4 (db_redir_temp): 1*
> *status 5 (db_redir_perm): 41*
> *CrawlDb statistics: done*
> 
> When I am trying to read the content from the segments, the content block is
> empty for every record.
> 
> Can you please tell me where I can get the content of these urls.
> 
> Thanks and regards,*
> *Arjun Kumar Reddy


Re: Few questions from a newbie

2011-01-26 Thread Arjun Kumar Reddy
I am developing an application based on twitter feeds...so 90% of the url's
will be short urls.
So, it is difficult for me to manually convert all these urls to actual
urls. Do we have any other solution for this?


Thanks and regards,
Arjun Kumar Reddy


On Wed, Jan 26, 2011 at 7:09 PM, Estrada Groups <
estrada.adam.gro...@gmail.com> wrote:

> You probably have to literally click on each URL to get the URL it's
> referencing. Those are URL shorteners  and probably won't play nicely with a
> crawler because of the redirection.
>
> Adam
>
> Sent from my iPhone
>
> On Jan 26, 2011, at 8:02 AM, Arjun Kumar Reddy <
> charjunkumar.re...@iiitb.net> wrote:
>
> > Hi list,
> >
> > I have given the set of urls as
> >
> > http://is.gd/Jt32Cf
> > http://is.gd/hS3lEJ
> > http://is.gd/Jy1Im3
> > http://is.gd/QoJ8xy
> > http://is.gd/e4ct89
> > http://is.gd/WAOVmd
> > http://is.gd/lhkA69
> > http://is.gd/3OilLD
> > . 43 such urls
> >
> > And I have run the crawl command bin/nutch crawl urls/ -dir crawl -depth
> 3
> >
> > *arjun@arjun-ninjas:~/nutch$* bin/nutch readdb crawl/crawldb -stats
> > *CrawlDb statistics start: crawl/crawldb*
> > *Statistics for CrawlDb: crawl/crawldb*
> > *TOTAL urls: 43*
> > *retry 0: 43*
> > *min score: 1.0*
> > *avg score: 1.0*
> > *max score: 1.0*
> > *status 3 (db_gone): 1*
> > *status 4 (db_redir_temp): 1*
> > *status 5 (db_redir_perm): 41*
> > *CrawlDb statistics: done*
> >
> > When I am trying to read the content from the segments, the content block
> is
> > empty for every record.
> >
> > Can you please tell me where I can get the content of these urls.
> >
> > Thanks and regards,*
> > *Arjun Kumar Reddy
>


Re: Few questions from a newbie

2011-01-26 Thread Churchill Nanje Mambe
hello
 you have to use the short url APIs and get the long URLs... its abit
complex as you have to determine the url if its short, then determine the
url shortening service used eg: tinyurl.com bit.ly or goo.gl and then you
use their respective api and send in the url and they will return the long
url... I used this before but it was a simple php based aggregator and not
nutch


Re: Few questions from a newbie

2011-01-26 Thread Arjun Kumar Reddy
Yea Hi Mambe,

Thanks for the feedback. I have mentioned the details of my application in
the above post.
I have tried doing this crawling job using php-multi curl and I am getting
results which are good enough but the problem I am facing is that it is
taking hell lot of time to get the contents of the urls. I have done this
without using any API or conversions.

So, in order to crawl in lesser time limits and also helps me to scale my
application, I have chosen Nutch crawler.

Thanks and regards,*
*Ch. Arjun Kumar Reddy

On Wed, Jan 26, 2011 at 9:19 PM, Churchill Nanje Mambe <
mambena...@afrovisiongroup.com> wrote:

> hello
>  you have to use the short url APIs and get the long URLs... its abit
> complex as you have to determine the url if its short, then determine the
> url shortening service used eg: tinyurl.com bit.ly or goo.gl and then you
> use their respective api and send in the url and they will return the long
> url... I used this before but it was a simple php based aggregator and not
> nutch
>


Re: Few questions from a newbie

2011-01-26 Thread Churchill Nanje Mambe
even if the url being crawled is shortened, it will still lead nutch to the
actual link and nutch will fetch it

Churchill Nanje Mambe
237 77545907,
AfroVisioN Founder, President,CEO
www.camerborn.com/mambenanje
http://www.afrovisiongroup.com | http://mambenanje.blogspot.com
skypeID: mambenanje
www.twitter.com/mambenanje



On Wed, Jan 26, 2011 at 4:56 PM, Arjun Kumar Reddy <
charjunkumar.re...@iiitb.net> wrote:

> Yea Hi Mambe,
>
> Thanks for the feedback. I have mentioned the details of my application in
> the above post.
> I have tried doing this crawling job using php-multi curl and I am getting
> results which are good enough but the problem I am facing is that it is
> taking hell lot of time to get the contents of the urls. I have done this
> without using any API or conversions.
>
> So, in order to crawl in lesser time limits and also helps me to scale my
> application, I have chosen Nutch crawler.
>
> Thanks and regards,*
> *Ch. Arjun Kumar Reddy
>
> On Wed, Jan 26, 2011 at 9:19 PM, Churchill Nanje Mambe <
> mambena...@afrovisiongroup.com> wrote:
>
> > hello
> >  you have to use the short url APIs and get the long URLs... its abit
> > complex as you have to determine the url if its short, then determine the
> > url shortening service used eg: tinyurl.com bit.ly or goo.gl and then
> you
> > use their respective api and send in the url and they will return the
> long
> > url... I used this before but it was a simple php based aggregator and
> not
> > nutch
> >
>


Re: Few questions from a newbie

2011-01-26 Thread Churchill Nanje Mambe
even if the url being crawled is shortened, it will still lead nutch to the
actual link and nutch will fetch it


Re: Few questions from a newbie

2011-01-26 Thread alxsss
you can put fetch external and internal links to false and increase depth.
 

 


 

 

-Original Message-
From: Churchill Nanje Mambe 
To: user 
Sent: Wed, Jan 26, 2011 8:03 am
Subject: Re: Few questions from a newbie


even if the url being crawled is shortened, it will still lead nutch to the

actual link and nutch will fetch it




 


Newbie trouble - Hbase class not found

2016-05-07 Thread diego gullo
I am trying Nutch for the first time. I created an automated docker setup
to load
Nutch 2 + Hbase (i had tried cassandra but could not get it to work so i
thought i start with Hbase to give it a try)

The project is available at https://github.com/bizmate/nutch
and with docker compose you can start the containers with a running
instance of Nutch  exposed on 8899 and Hbase.

in gora.properties i already enabled hbase

gora.datastore.default=org.apache.gora.hbase.store.HBaseStore


But i get Hbase class not found error when I run this command.

root@87b87f55835e:/opt/nutch# bin/nutch inject urls.txt

InjectorJob: starting at 2016-05-07 08:37:49

InjectorJob: Injecting urlDir: urls.txt

*InjectorJob: java.lang.ClassNotFoundException:
org.apache.gora.hbase.store.HBaseStore*

at java.net.URLClassLoader$1.run(URLClassLoader.java:366)

at java.net.URLClassLoader$1.run(URLClassLoader.java:355)

at java.security.AccessController.doPrivileged(Native Method)

at java.net.URLClassLoader.findClass(URLClassLoader.java:354)

at java.lang.ClassLoader.loadClass(ClassLoader.java:425)

at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:308)

at java.lang.ClassLoader.loadClass(ClassLoader.java:358)

at java.lang.Class.forName0(Native Method)

at java.lang.Class.forName(Class.java:190)

at
org.apache.nutch.storage.StorageUtils.getDataStoreClass(StorageUtils.java:89)

at
org.apache.nutch.storage.StorageUtils.createWebStore(StorageUtils.java:73)

at org.apache.nutch.crawl.InjectorJob.run(InjectorJob.java:221)

at org.apache.nutch.crawl.InjectorJob.inject(InjectorJob.java:251)

at org.apache.nutch.crawl.InjectorJob.run(InjectorJob.java:273)

at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)

at org.apache.nutch.crawl.InjectorJob.main(InjectorJob.java:282)

Suggestions?


RE: Newbie Nutch/Solr Question(s)

2016-07-18 Thread Markus Jelsma
Hi Jamal - don't use managed schema with Solr 6.0 and/or 6.1. Just copy over 
the schema Nutch provides and you are good to go.
Markus

 
 
-Original message-
> From:Jamal, Sarfaraz 
> Sent: Friday 15th July 2016 15:47
> To: user@nutch.apache.org
> Subject: Newbie Nutch/Solr Question(s)
> 
> Hi Guy,
> 
> I have nutch 'working' relatively, and I am now ready to index it to solr.
> 
> I already have a solr environment up and running and now wish to index a few 
> websites.
> 
> I have read through the documentation and I believe I have to do something 
> like this:
> 
> Instead of this:
> "cp ${NUTCH_RUNTIME_HOME}/conf/schema.xml 
> ${APACHE_SOLR_HOME}/example/solr/collection1/conf/"
> 
> I really just need to take all field types and all field names from the 
> schema.xml file and add them to my existing ' managed-schema'
> 
> Correct?
> 
> Thanks!!
> 
> Sas
> 


Antwort: Re: Few questions from a newbie

2011-01-26 Thread Mike Zuehlke
Hi Arjun,

nutch handles redirect by itself - like the return codes 301 and 302.

Did you check how much redirects you have to follow until you get 
HTTP_ACCESS (200).
I think there are four redirects needed to get the given url content. So 
you have to increase the depth for your crawling.

Regards
Mike




Von:Arjun Kumar Reddy 
An: user@nutch.apache.org
Datum:  26.01.2011 15:43
Betreff:Re: Few questions from a newbie



I am developing an application based on twitter feeds...so 90% of the 
url's
will be short urls.
So, it is difficult for me to manually convert all these urls to actual
urls. Do we have any other solution for this?


Thanks and regards,
Arjun Kumar Reddy


On Wed, Jan 26, 2011 at 7:09 PM, Estrada Groups <
estrada.adam.gro...@gmail.com> wrote:

> You probably have to literally click on each URL to get the URL it's
> referencing. Those are URL shorteners  and probably won't play nicely 
with a
> crawler because of the redirection.
>
> Adam
>
> Sent from my iPhone
>
> On Jan 26, 2011, at 8:02 AM, Arjun Kumar Reddy <
> charjunkumar.re...@iiitb.net> wrote:
>
> > Hi list,
> >
> > I have given the set of urls as
> >
> > http://is.gd/Jt32Cf
> > http://is.gd/hS3lEJ
> > http://is.gd/Jy1Im3
> > http://is.gd/QoJ8xy
> > http://is.gd/e4ct89
> > http://is.gd/WAOVmd
> > http://is.gd/lhkA69
> > http://is.gd/3OilLD
> > . 43 such urls
> >
> > And I have run the crawl command bin/nutch crawl urls/ -dir crawl 
-depth
> 3
> >
> > *arjun@arjun-ninjas:~/nutch$* bin/nutch readdb crawl/crawldb -stats
> > *CrawlDb statistics start: crawl/crawldb*
> > *Statistics for CrawlDb: crawl/crawldb*
> > *TOTAL urls: 43*
> > *retry 0: 43*
> > *min score: 1.0*
> > *avg score: 1.0*
> > *max score: 1.0*
> > *status 3 (db_gone): 1*
> > *status 4 (db_redir_temp): 1*
> > *status 5 (db_redir_perm): 41*
> > *CrawlDb statistics: done*
> >
> > When I am trying to read the content from the segments, the content 
block
> is
> > empty for every record.
> >
> > Can you please tell me where I can get the content of these urls.
> >
> > Thanks and regards,*
> > *Arjun Kumar Reddy
>






http://www.zanox.com/disclaimer/znx_logo_01.gif"; alt="disclaimer 
logo: ZANOX.de AG">

We will create the ultimate global alliance to monetize the Internet


ZANOX.de AG | Headquarters: Berlin
AG Charlottenburg | HRB 75459 | Ust-ID: DE 209981705
Executive Board: Philipp Justus (CEO) | Christian Kleinsorge (CSO) | Daniel 
Keller (CTO)
Chairman of the Supervisory Board: Ralph Büchi

Re: Re: Few questions from a newbie

2011-01-26 Thread Arjun Kumar Reddy
Hi Mike,

Actually in my application, I am working on twitter feeds where I
am filtering the tweets present with inks and I am storing the contents of
the links. I am maintaining all such links in the urls file giving it as an
input to nutch crawler. Here, I am not bothered about the inlinks or
outlinks of that particular link.

So, at first I have given the depth as 1 and later on increased to 3. If I
increase the depth, I can prevent the unwanted crawls. So is there any other
solution for this?

I have also changed the number of redirects configuration paramater to 4 in
nutch-default.xml file.

Thanks and regards,*
*Ch. Arjun Kumar Reddy


On Wed, Jan 26, 2011 at 8:28 PM, Mike Zuehlke wrote:

> Hi Arjun,
>
> nutch handles redirect by itself - like the return codes 301 and 302.
>
> Did you check how much redirects you have to follow until you get
> HTTP_ACCESS (200).
> I think there are four redirects needed to get the given url content. So
> you have to increase the depth for your crawling.
>
> Regards
> Mike
>
>
>
>
> Von:Arjun Kumar Reddy 
> An: user@nutch.apache.org
> Datum:  26.01.2011 15:43
> Betreff:Re: Few questions from a newbie
>
>
>
> I am developing an application based on twitter feeds...so 90% of the
> url's
> will be short urls.
> So, it is difficult for me to manually convert all these urls to actual
> urls. Do we have any other solution for this?
>
>
> Thanks and regards,
> Arjun Kumar Reddy
>
>
> On Wed, Jan 26, 2011 at 7:09 PM, Estrada Groups <
> estrada.adam.gro...@gmail.com> wrote:
>
> > You probably have to literally click on each URL to get the URL it's
> > referencing. Those are URL shorteners  and probably won't play nicely
> with a
> > crawler because of the redirection.
> >
> > Adam
> >
> > Sent from my iPhone
> >
> > On Jan 26, 2011, at 8:02 AM, Arjun Kumar Reddy <
> > charjunkumar.re...@iiitb.net> wrote:
> >
> > > Hi list,
> > >
> > > I have given the set of urls as
> > >
> > > http://is.gd/Jt32Cf
> > > http://is.gd/hS3lEJ
> > > http://is.gd/Jy1Im3
> > > http://is.gd/QoJ8xy
> > > http://is.gd/e4ct89
> > > http://is.gd/WAOVmd
> > > http://is.gd/lhkA69
> > > http://is.gd/3OilLD
> > > . 43 such urls
> > >
> > > And I have run the crawl command bin/nutch crawl urls/ -dir crawl
> -depth
> > 3
> > >
> > > *arjun@arjun-ninjas:~/nutch$* bin/nutch readdb crawl/crawldb -stats
> > > *CrawlDb statistics start: crawl/crawldb*
> > > *Statistics for CrawlDb: crawl/crawldb*
> > > *TOTAL urls: 43*
> > > *retry 0: 43*
> > > *min score: 1.0*
> > > *avg score: 1.0*
> > > *max score: 1.0*
> > > *status 3 (db_gone): 1*
> > > *status 4 (db_redir_temp): 1*
> > > *status 5 (db_redir_perm): 41*
> > > *CrawlDb statistics: done*
> > >
> > > When I am trying to read the content from the segments, the content
> block
> > is
> > > empty for every record.
> > >
> > > Can you please tell me where I can get the content of these urls.
> > >
> > > Thanks and regards,*
> > > *Arjun Kumar Reddy
> >
>
>
>
>
>
>
> http://www.zanox.com/disclaimer/znx_logo_01.gif"; alt="disclaimer
> logo: ZANOX.de AG">
>
> 
> We will create the ultimate global alliance to monetize the Internet
>
> 
>
> ZANOX.de AG | Headquarters: Berlin
> AG Charlottenburg | HRB 75459 | Ust-ID: DE 209981705
> Executive Board: Philipp Justus (CEO) | Christian Kleinsorge (CSO) | Daniel
> Keller (CTO)
> Chairman of the Supervisory Board: Ralph Büchi


Another question from a meta tag newbie

2011-01-31 Thread Joshua J Pavel


I've been crawling the user groups, and I feel like Nutch can do this by
default, but I just can't seem to crack it.

I want to grab meta tags from indexed pages and insert them in the
database.  Specifically, I'll have some meta tags that identity the type of
content on the page, so that I can group results as either video, photo,
news, etc.

I looked into 655 and 855, but I believe those are for adding metadata, not
utilizing meta data already in the page.

What I expect, is that when I do a dump, I'd have the fields visible in
Metadata

http://test.site.com/index.html   Version: 7
Status: 2 (db_fetched)
Fetch time: Wed Mar 02 20:22:33 UTC 2011
Modified time: Thu Jan 01 00:00:00 UTC 1970
Retries since fetch: 0
Retry interval: 2592000 seconds (30 days)
Score: 0.013783041
Signature: 26df10bef4cf4cebe3f1041ba121068d
Metadata: _pst_: success(1), lastModified=0, MYFIELD=MYVALUE

I think Nutch-779 may be what I need, and as I'm running version 1.2, I
should have this capability.  I'm filling in db.parsemeta.to.crawldb, but
is there something else I need to do?  Or is it populating it, and dumping
the database doesn't show me those values?

Re: Newbie trouble - Hbase class not found

2016-05-09 Thread Lewis John Mcgibbney
Hi Diego,

On Mon, May 9, 2016 at 2:32 AM,  wrote:

>
> From: diego gullo 
> To: user@nutch.apache.org
> Cc:
> Date: Sat, 7 May 2016 09:41:00 +0100
> Subject: Newbie trouble - Hbase class not found
> I am trying Nutch for the first time. I created an automated docker setup
> to load
> Nutch 2 + Hbase (i had tried cassandra but could not get it to work so i
> thought i start with Hbase to give it a try)
>

I would suggest you to use the official Nutch containers which can be found
at
https://github.com/apache/nutch/tree/2.x/docker/hbase


>
> The project is available at https://github.com/bizmate/nutch
>

Nice, thanks for posting


> and with docker compose you can start the containers with a running
> instance of Nutch  exposed on 8899 and Hbase.
>

Cool.


>
> in gora.properties i already enabled hbase
>
> gora.datastore.default=org.apache.gora.hbase.store.HBaseStore
>
>
> But i get Hbase class not found error when I run this command.
>
> root@87b87f55835e:/opt/nutch# bin/nutch inject urls.txt
>
> InjectorJob: starting at 2016-05-07 08:37:49
>
> InjectorJob: Injecting urlDir: urls.txt
>
> *InjectorJob: java.lang.ClassNotFoundException:
> org.apache.gora.hbase.store.HBaseStore*
>
>
[snip]


>
> Suggestions?
>
>
Yes, you've not enabled the gora-hbase dependency download from within
ivy/ivy.xml
https://github.com/apache/nutch/blob/2.x/ivy/ivy.xml#L114-L117

Please refer to the tutorial for further advice
http://wiki.apache.org/nutch/Nutch2Tutorial
Thanks

-- 
*Lewis*


Re: Newbie trouble - Hbase class not found

2016-05-09 Thread diego gullo
Hi Lewis

thanks a lot for the reply. Regarding the ivy config I dont think this is
the problem.

I have the hbase config enable here

https://github.com/bizmate/nutch/blob/master/docker/nutch/ivy.xml#L115

This file is mounted in the image through docker compose.

https://github.com/bizmate/nutch/blob/master/docker-compose.yml#L14

The difference with the file you pointed to is the revision. Unsure if it
makes a difference.

However, you suggested to use some specific Dockerfiles. I am not 100% sure
if these have already been used to publish images on Docker hub. Infact I
would not mind to set this up although if the images are already pushed to
docker hub I would rather not spend time on something that is already
there.

These are all the nutch images i see already available on the hub.

https://hub.docker.com/search/?isAutomated=0&isOfficial=0&page=1&pullCount=0&q=nutch&starCount=0

If the official image is not there I would be happy to contribute and ask
docker for the 'nutch' name space and set the automated build.

If interested pls let me know. I can IRC if preferred.



On 9 May 2016 at 16:12, Lewis John Mcgibbney 
wrote:

> Hi Diego,
>
> On Mon, May 9, 2016 at 2:32 AM,  wrote:
>
> >
> > From: diego gullo 
> > To: user@nutch.apache.org
> > Cc:
> > Date: Sat, 7 May 2016 09:41:00 +0100
> > Subject: Newbie trouble - Hbase class not found
> > I am trying Nutch for the first time. I created an automated docker setup
> > to load
> > Nutch 2 + Hbase (i had tried cassandra but could not get it to work so i
> > thought i start with Hbase to give it a try)
> >
>
> I would suggest you to use the official Nutch containers which can be found
> at
> https://github.com/apache/nutch/tree/2.x/docker/hbase
>
>
> >
> > The project is available at https://github.com/bizmate/nutch
> >
>
> Nice, thanks for posting
>
>
> > and with docker compose you can start the containers with a running
> > instance of Nutch  exposed on 8899 and Hbase.
> >
>
> Cool.
>
>
> >
> > in gora.properties i already enabled hbase
> >
> > gora.datastore.default=org.apache.gora.hbase.store.HBaseStore
> >
> >
> > But i get Hbase class not found error when I run this command.
> >
> > root@87b87f55835e:/opt/nutch# bin/nutch inject urls.txt
> >
> > InjectorJob: starting at 2016-05-07 08:37:49
> >
> > InjectorJob: Injecting urlDir: urls.txt
> >
> > *InjectorJob: java.lang.ClassNotFoundException:
> > org.apache.gora.hbase.store.HBaseStore*
> >
> >
> [snip]
>
>
> >
> > Suggestions?
> >
> >
> Yes, you've not enabled the gora-hbase dependency download from within
> ivy/ivy.xml
> https://github.com/apache/nutch/blob/2.x/ivy/ivy.xml#L114-L117
>
> Please refer to the tutorial for further advice
> http://wiki.apache.org/nutch/Nutch2Tutorial
> Thanks
>
> --
> *Lewis*
>



-- 
www.bizmate.biz


Re: Newbie trouble - Hbase class not found

2016-05-15 Thread diego gullo
Hi Lewis

I have changed the build for the docker containers and in the weekend sent
the PR for the logs folder. The original problem I had is still persistent.

To reproduce


   1. Check out https://github.com/bizmate/nutch
   2. run *docker-compose up -d* - this will pull the docker image based on
   the Official docker file and mount it with the configurations suggested in
   the documentation available on the nutch site. i.e. Ivy, gora and
   nutch-site configs all available at
   https://github.com/bizmate/nutch/tree/master/docker/. This includes the
   suggestion from your previous email.
   https://github.com/bizmate/nutch/blob/master/docker/nutch/ivy.xml#L117
   3. access the container docker exec -it nutch bash
   4. su hdbase
   5. Run inject, still says class not found


hduser@458c70ec85a2:/opt/nutch$ bin/nutch inject urls/seed.txt
InjectorJob: starting at 2016-05-15 18:40:31
InjectorJob: Injecting urlDir: urls/seed.txt
Exception in thread "main" *java.lang.NoClassDefFoundError:
org/apache/hadoop/hbase/HBaseConfiguration*
at org.apache.gora.hbase.store.HBaseStore.initialize(HBaseStore.java:114)
at
org.apache.gora.store.DataStoreFactory.initializeDataStore(DataStoreFactory.java:102)
at
org.apache.gora.store.DataStoreFactory.createDataStore(DataStoreFactory.java:161)
at
org.apache.gora.store.DataStoreFactory.createDataStore(DataStoreFactory.java:135)
at
org.apache.nutch.storage.StorageUtils.createWebStore(StorageUtils.java:78)
at org.apache.nutch.crawl.InjectorJob.run(InjectorJob.java:233)
at org.apache.nutch.crawl.InjectorJob.inject(InjectorJob.java:267)
at org.apache.nutch.crawl.InjectorJob.run(InjectorJob.java:290)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
at org.apache.nutch.crawl.InjectorJob.main(InjectorJob.java:299)
Caused by: java.lang.ClassNotFoundException:
org.apache.hadoop.hbase.HBaseConfiguration
at java.net.URLClassLoader$1.run(URLClassLoader.java:366)
at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(URLClassLoader.java:354)
at java.lang.ClassLoader.loadClass(ClassLoader.java:425)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:308)
at java.lang.ClassLoader.loadClass(ClassLoader.java:358)
... 10 more

this is also despite setting HBASE_HOME and HADOOP_CLASSPATH  as suggested
here -
http://stackoverflow.com/questions/26364057/exception-in-thread-main-java-lang-noclassdeffounderror-org-apache-hadoop-hba



On 10 May 2016 at 07:28, diego gullo  wrote:

> Hi Lewis
>
> thanks a lot for the reply. Regarding the ivy config I dont think this is
> the problem.
>
> I have the hbase config enable here
>
> https://github.com/bizmate/nutch/blob/master/docker/nutch/ivy.xml#L115
>
> This file is mounted in the image through docker compose.
>
> https://github.com/bizmate/nutch/blob/master/docker-compose.yml#L14
>
> The difference with the file you pointed to is the revision. Unsure if it
> makes a difference.
>
> However, you suggested to use some specific Dockerfiles. I am not 100%
> sure if these have already been used to publish images on Docker hub.
> Infact I would not mind to set this up although if the images are already
> pushed to docker hub I would rather not spend time on something that is
> already there.
>
> These are all the nutch images i see already available on the hub.
>
>
> https://hub.docker.com/search/?isAutomated=0&isOfficial=0&page=1&pullCount=0&q=nutch&starCount=0
>
> If the official image is not there I would be happy to contribute and ask
> docker for the 'nutch' name space and set the automated build.
>
> If interested pls let me know. I can IRC if preferred.
>
>
>
> On 9 May 2016 at 16:12, Lewis John Mcgibbney 
> wrote:
>
>> Hi Diego,
>>
>> On Mon, May 9, 2016 at 2:32 AM, 
>> wrote:
>>
>> >
>> > From: diego gullo 
>> > To: user@nutch.apache.org
>> > Cc:
>> > Date: Sat, 7 May 2016 09:41:00 +0100
>> > Subject: Newbie trouble - Hbase class not found
>> > I am trying Nutch for the first time. I created an automated docker
>> setup
>> > to load
>> > Nutch 2 + Hbase (i had tried cassandra but could not get it to work so i
>> > thought i start with Hbase to give it a try)
>> >
>>
>> I would suggest you to use the official Nutch containers which can be
>> found
>> at
>> https://github.com/apache/nutch/tree/2.x/docker/hbase
>>
>>
>> >
>> > The project is available at https://github.com/bizmate/nutch
>> >
>>
>> Nice, thanks for posting
>>
>>
>> > and with docker compose you can start the containers with a running
>> > instance of Nutch  exposed on 8899 

Re: Newbie trouble - Hbase class not found

2016-05-16 Thread Lewis John Mcgibbney
Hi Diego,

The PR at https://github.com/apache/nutch/pull/111 will solve your issue.
Thanks

On Mon, May 16, 2016 at 11:40 AM,  wrote:

>
> From: diego gullo 
> To: user@nutch.apache.org
> Cc:
> Date: Sun, 15 May 2016 20:04:05 +0100
> Subject: Re: Newbie trouble - Hbase class not found
> Hi Lewis
>
> I have changed the build for the docker containers and in the weekend sent
> the PR for the logs folder. The original problem I had is still persistent.
>
> To reproduce
>
>
>1. Check out https://github.com/bizmate/nutch
>2. run *docker-compose up -d* - this will pull the docker image based on
>the Official docker file and mount it with the configurations suggested
> in
>the documentation available on the nutch site. i.e. Ivy, gora and
>nutch-site configs all available at
>https://github.com/bizmate/nutch/tree/master/docker/. This includes the
>suggestion from your previous email.
>https://github.com/bizmate/nutch/blob/master/docker/nutch/ivy.xml#L117
>3. access the container docker exec -it nutch bash
>4. su hdbase
>5. Run inject, still says class not found
>
>
> hduser@458c70ec85a2:/opt/nutch$ bin/nutch inject urls/seed.txt
> InjectorJob: starting at 2016-05-15 18:40:31
> InjectorJob: Injecting urlDir: urls/seed.txt
> Exception in thread "main" *java.lang.NoClassDefFoundError:
> org/apache/hadoop/hbase/HBaseConfiguration*
> at org.apache.gora.hbase.store.HBaseStore.initialize(HBaseStore.java:114)
> at
>
> org.apache.gora.store.DataStoreFactory.initializeDataStore(DataStoreFactory.java:102)
> at
>
> org.apache.gora.store.DataStoreFactory.createDataStore(DataStoreFactory.java:161)
> at
>
> org.apache.gora.store.DataStoreFactory.createDataStore(DataStoreFactory.java:135)
> at
> org.apache.nutch.storage.StorageUtils.createWebStore(StorageUtils.java:78)
> at org.apache.nutch.crawl.InjectorJob.run(InjectorJob.java:233)
> at org.apache.nutch.crawl.InjectorJob.inject(InjectorJob.java:267)
> at org.apache.nutch.crawl.InjectorJob.run(InjectorJob.java:290)
> at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
> at org.apache.nutch.crawl.InjectorJob.main(InjectorJob.java:299)
> Caused by: java.lang.ClassNotFoundException:
> org.apache.hadoop.hbase.HBaseConfiguration
> at java.net.URLClassLoader$1.run(URLClassLoader.java:366)
> at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
> at java.security.AccessController.doPrivileged(Native Method)
> at java.net.URLClassLoader.findClass(URLClassLoader.java:354)
> at java.lang.ClassLoader.loadClass(ClassLoader.java:425)
> at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:308)
> at java.lang.ClassLoader.loadClass(ClassLoader.java:358)
> ... 10 more
>
> this is also despite setting HBASE_HOME and HADOOP_CLASSPATH  as suggested
> here -
>
> http://stackoverflow.com/questions/26364057/exception-in-thread-main-java-lang-noclassdeffounderror-org-apache-hadoop-hba
>
>


RE: [E] Re: Newbie Question, hadoop error?

2016-06-16 Thread Jamal, Sarfaraz
Hi lewis! Thanks for the response.

I have a question:
"The presence of NoSuchMethodError would indicate that the $NUTCH_HOME/lib 
directory is not on the JVM classpath. Please make sure that it is."

So far I have only set one environment variable which is JAVA_HOME

What is the JVM Classpath? Is it an environment variable?

Thanks,

Sas


-Original Message-
From: Lewis John Mcgibbney [mailto:lewis.mcgibb...@gmail.com] 
Sent: Wednesday, June 15, 2016 11:46 PM
To: user@nutch.apache.org
Subject: [E] Re: Newbie Question, hadoop error?

Hi Sas,
See response inline :)

On Wed, Jun 15, 2016 at 5:36 AM,  wrote:

> From: "Jamal, Sarfaraz" 
> To: "'user@nutch.apache.org'" 
> Cc:
> Date: Mon, 13 Jun 2016 17:36:44 -0400
> Subject: Newbie Question, hadoop error?
> Hi Guys,
>
> I am attempting to run nutch using cygwin,


Is this Nutch 1.11 binary distribution you mean?


> and I am having the following problem:
> Ps. I added Hadoop-core to the lib folder already -
>
> I appreciate any insight or comment you guys may have -
>
> $ bin/crawl -i urls/ TestCrawl/  2
> Injecting seed URLs
> /cygdrive/c/apache-nutch-1.11/bin/nutch inject TestCrawl//crawldb 
> urls/ Exception in thread "main" java.lang.NoSuchMethodError:
> org.apache.commons.cli.OptionBuilder.withArgPattern(Ljava/lang/String;I)Lorg/apache/commons/cli/OptionBuilder;
> at
> org.apache.hadoop.util.GenericOptionsParser.buildGeneralOptions(GenericOptionsParser.java:207)
> at
> org.apache.hadoop.util.GenericOptionsParser.parseGeneralOptions(GenericOptionsParser.java:370)
> at
> org.apache.hadoop.util.GenericOptionsParser.(GenericOptionsParser.java:153)
> at
> org.apache.hadoop.util.GenericOptionsParser.(GenericOptionsParser.java:138)
> at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:59)
>at org.apache.nutch.crawl.Injector.main(Injector.java:369)
> Error running:
>   /cygdrive/c/apache-nutch-1.11/bin/nutch inject TestCrawl//crawldb 
> urls/ Failed with exit value 1.


There are a few issues above.
1) You should change the data structures parent directory from 'TestCrawl/'
to 'TestCrawl' e.g. remove the trailing forward slash. This will prevent you 
from generating the CrawlDB in 'TestCrawl//crawldb' and will generated it in 
'TestCrawl/crawldb' instead.
2) The presence of NoSuchMethodError would indicate that the $NUTCH_HOME/lib 
directory is not on the JVM classpath. Please make sure that it is.

Lewis


RE: [E] Re: Newbie Question, hadoop error?

2016-06-16 Thread Jamal, Sarfaraz
Hi Lewis,

Here is an update
(I spoke to one of our java guys) -

$ set classpath = C:\\apache-nutch-1.11\\lib
$ $classpath
/cygdrive/c/apache-nutch-1.11/lib

$ ../bin/crawl -i urls/ TestCrawl  2
Injecting seed URLs
/cygdrive/c/apache-nutch-1.11/bin/nutch inject TestCrawl crawldb urls/
Exception in thread "main" java.lang.NoSuchMethodError: 
org.apache.commons.cli.OptionBuilder.withArgPattern(Ljava/lang/String;I)Lorg/apache/commons/cli/OptionBuilder;
at 
org.apache.hadoop.util.GenericOptionsParser.buildGeneralOptions(GenericOptionsParser.java:207)
at 
org.apache.hadoop.util.GenericOptionsParser.parseGeneralOptions(GenericOptionsParser.java:370)
at 
org.apache.hadoop.util.GenericOptionsParser.(GenericOptionsParser.java:153)
at 
org.apache.hadoop.util.GenericOptionsParser.(GenericOptionsParser.java:138)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:59)
at org.apache.nutch.crawl.Injector.main(Injector.java:369)
Error running:
  /cygdrive/c/apache-nutch-1.11/bin/nutch inject TestCrawl/crawldb urls/
Failed with exit value 1.

1. I set it using Cygwin notation, and regular windows path
2.  I set it in DOS as well by appending to what was already ther

And in each instance I received the same error

Any thoughts? Or do you notice anything I might have missed?

Thanks,

sas


-Original Message-
From: Lewis John Mcgibbney [mailto:lewis.mcgibb...@gmail.com] 
Sent: Wednesday, June 15, 2016 11:46 PM
To: user@nutch.apache.org
Subject: [E] Re: Newbie Question, hadoop error?

Hi Sas,
See response inline :)

On Wed, Jun 15, 2016 at 5:36 AM,  wrote:

> From: "Jamal, Sarfaraz" 
> To: "'user@nutch.apache.org'" 
> Cc:
> Date: Mon, 13 Jun 2016 17:36:44 -0400
> Subject: Newbie Question, hadoop error?
> Hi Guys,
>
> I am attempting to run nutch using cygwin,


Is this Nutch 1.11 binary distribution you mean?


> and I am having the following problem:
> Ps. I added Hadoop-core to the lib folder already -
>
> I appreciate any insight or comment you guys may have -
>
> $ bin/crawl -i urls/ TestCrawl/  2
> Injecting seed URLs
> /cygdrive/c/apache-nutch-1.11/bin/nutch inject TestCrawl//crawldb 
> urls/ Exception in thread "main" java.lang.NoSuchMethodError:
> org.apache.commons.cli.OptionBuilder.withArgPattern(Ljava/lang/String;I)Lorg/apache/commons/cli/OptionBuilder;
> at
> org.apache.hadoop.util.GenericOptionsParser.buildGeneralOptions(GenericOptionsParser.java:207)
> at
> org.apache.hadoop.util.GenericOptionsParser.parseGeneralOptions(GenericOptionsParser.java:370)
> at
> org.apache.hadoop.util.GenericOptionsParser.(GenericOptionsParser.java:153)
> at
> org.apache.hadoop.util.GenericOptionsParser.(GenericOptionsParser.java:138)
> at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:59)
>at org.apache.nutch.crawl.Injector.main(Injector.java:369)
> Error running:
>   /cygdrive/c/apache-nutch-1.11/bin/nutch inject TestCrawl//crawldb 
> urls/ Failed with exit value 1.


There are a few issues above.
1) You should change the data structures parent directory from 'TestCrawl/'
to 'TestCrawl' e.g. remove the trailing forward slash. This will prevent you 
from generating the CrawlDB in 'TestCrawl//crawldb' and will generated it in 
'TestCrawl/crawldb' instead.
2) The presence of NoSuchMethodError would indicate that the $NUTCH_HOME/lib 
directory is not on the JVM classpath. Please make sure that it is.

Lewis


Newbie question about non-trunk plug-in locations

2011-11-29 Thread John Dhabolt
Hi,

So I'm looking to add standard keyword and description metadata to my index. 
I'm referencing NUTCH-809 (https://issues.apache.org/jira/browse/NUTCH-809) and 
it includes a patch file that appears to be for a file in the source at the 
following location:

src/plugin/index-metatags/src/java/at/scintillation/nutch/MetaTagsIndexer.java


It seems many have referenced the file, but I've looked high and low, 
downloading the source tree, issuing Google searches and swearing to myself to 
no avail. The file is not to be found.

Am I missing something obvious here? 

Thanks in advance!

Jon

Re: Newbie question about non-trunk plug-in locations

2011-11-29 Thread Faruk Berksöz
The issue is still open.As a result of this the patch file was not applied
to any version.

Faruk

2011/11/29 John Dhabolt 

> Hi,
>
> So I'm looking to add standard keyword and description metadata to my
> index. I'm referencing NUTCH-809 (
> https://issues.apache.org/jira/browse/NUTCH-809) and it includes a patch
> file that appears to be for a file in the source at the following location:
>
>
> src/plugin/index-metatags/src/java/at/scintillation/nutch/MetaTagsIndexer.java
>
>
> It seems many have referenced the file, but I've looked high and low,
> downloading the source tree, issuing Google searches and swearing to myself
> to no avail. The file is not to be found.
>
> Am I missing something obvious here?
>
> Thanks in advance!
>
> Jon


Fw: Newbie question about non-trunk plug-in locations

2011-11-29 Thread John Dhabolt


Whoops, forgot to reply all and left the mailing list out of my response.

- Forwarded Message -
From: John Dhabolt 
To: Faruk Berksöz  
Sent: Tuesday, November 29, 2011 4:59 PM
Subject: Re: Newbie question about non-trunk plug-in locations
 

Hi Frank,

Thank you for the reply. Is the original file(s) available somewhere that I can 
download and apply the patch to? Since there was a discussion about something 
that appears to be broken in the current version without the patch, I was just 
wondering where the code resides that started the discussion. It seems that 
some have access to the code since the patch was made against it.

I'm trying to find a plug-in that's already created so I don't have to roll my 
own. It seems many have solved this problem time and again...I'm just trying to 
avoid adding to the "again and yet again."

Thanks for the clarification!

John



 From: Faruk Berksöz 
To: user@nutch.apache.org; John Dhabolt  
Sent: Tuesday, November 29, 2011 4:39 PM
Subject: Re: Newbie question about non-trunk plug-in locations
 

The issue is still open.As a result of this the patch file was not applied to 
any version.

Faruk


2011/11/29 John Dhabolt 

Hi,
>
>So I'm looking to add standard keyword and description metadata to my index. 
>I'm referencing NUTCH-809 (https://issues.apache.org/jira/browse/NUTCH-809) 
>and it includes a patch file that appears to be for a file in the source at 
>the following location:
>
>src/plugin/index-metatags/src/java/at/scintillation/nutch/MetaTagsIndexer.java
>
>
>It seems many have referenced the file, but I've looked high and low, 
>downloading the source tree, issuing Google searches and swearing to myself to 
>no avail. The file is not to be found.
>
>Am I missing something obvious here? 
>
>Thanks in advance!
>
>Jon

Re: Fw: Newbie question about non-trunk plug-in locations

2011-11-30 Thread Elisabeth Adler

Hi John,
As mentioned, the plugin is not yet included in the code. Therefore, you 
have to basically build your own Nutch version including the plugin. You 
have to download the patch file and apply it to your local copy of the 
Nutch code base (e.g. in Eclipse via Team>Apply Patch). The code itself 
is contained within the patch file.

Best,
Elisabeth

On 29/11/2011 23:06, John Dhabolt wrote:


Whoops, forgot to reply all and left the mailing list out of my response.

- Forwarded Message -
From: John Dhabolt
To: Faruk Berksöz
Sent: Tuesday, November 29, 2011 4:59 PM
Subject: Re: Newbie question about non-trunk plug-in locations


Hi Frank,

Thank you for the reply. Is the original file(s) available somewhere that I can 
download and apply the patch to? Since there was a discussion about something 
that appears to be broken in the current version without the patch, I was just 
wondering where the code resides that started the discussion. It seems that 
some have access to the code since the patch was made against it.

I'm trying to find a plug-in that's already created so I don't have to roll my own. It 
seems many have solved this problem time and again...I'm just trying to avoid adding to 
the "again and yet again."

Thanks for the clarification!

John



  From: Faruk Berksöz
To: user@nutch.apache.org; John Dhabolt
Sent: Tuesday, November 29, 2011 4:39 PM
Subject: Re: Newbie question about non-trunk plug-in locations


The issue is still open.As a result of this the patch file was not applied to 
any version.

Faruk


2011/11/29 John Dhabolt

Hi,

So I'm looking to add standard keyword and description metadata to my index. 
I'm referencing NUTCH-809 (https://issues.apache.org/jira/browse/NUTCH-809) and 
it includes a patch file that appears to be for a file in the source at the 
following location:

src/plugin/index-metatags/src/java/at/scintillation/nutch/MetaTagsIndexer.java


It seems many have referenced the file, but I've looked high and low, 
downloading the source tree, issuing Google searches and swearing to myself to 
no avail. The file is not to be found.

Am I missing something obvious here? 


Thanks in advance!

Jon