from:"lewis john mcgibbney"

Re: Character encoding on Html-Pages

2011-06-07 Thread lewis john mcgibbney

Hi Alex,

I cannot locate the java file you mention at
org.apache.nutch.parse.html.HtmlParser in either 1.2 or branch 1.3...

Having a quick look at org.apache.nutch.parse.HTMLMetaTags (in both versions
above it is identical) it appears that you are right the double quotes for
meta http-equiv are accepted whereas 'single quotes' are not. I would
be interested to see what kind of output you get when nutch-1.2 experiences
the type of single quote meta syntax you highlight? Can you elaborate
please...

If your regex suggestion is working then I would stick with this, however
this is maybe something you wish to raise in JIRA... any comments?
Lewis

On Tue, Jun 7, 2011 at 4:05 PM, Alex F 
alexander.fahlke.mailingli...@googlemail.com wrote:

 Hi,

 the regex metaPattern inside org.apache.nutch.parse.html.HtmlParser is not
 suitable for sites using single quotes for meta http-equiv

  Example: meta http-equiv='Content-Type' content='text/html;
 charset=iso-8859-1'
  We experienced a couple of pages with that kind of quotes and Nutch-1.2
 was not able to handle it.

 Is there any fallback or would it be good to use the following
 regex: meta\\s+([^]*http-equiv=(\|')?content-type(\|')?[^]*)
 (single
 or regular quotes are accepted)?

 BR

 Alexander Fahlke
 Software Development
 www.informera.de




-- 
*Lewis*

Re: keeping index up to date

2011-06-07 Thread lewis john mcgibbney

Hi,

To add to Markus' comments, if you take a look at the script it is written
in such a way that if run in safe mode it protects us against an error which
may occur. If this is the case we an recover segments etc and take
appropriate actions to resolve.

On Tue, Jun 7, 2011 at 9:01 PM, Markus Jelsma markus.jel...@openindex.iowrote:


   Hi,
 
  I took a look to the  recrawl script and noticed that all the steps
 except
  urls injection are repeated at the consequent indexing and wondered why
  would we generate new segments? Is it possible to do fetch, update for
 all
  previous $s1..$sn , invertlink  and index steps.

 No, the generater generates a segment with a list of URL for the fetcher to
 fetch. You can, if you like, then merge segments.

 
  Thanks.
  Alex.
 
 
 
 
 
 
  -Original Message-
  From: Julien Nioche lists.digitalpeb...@gmail.com
  To: user user@nutch.apache.org
  Sent: Wed, Jun 1, 2011 12:59 am
  Subject: Re: keeping index up to date
 
 
  You should use the adaptative fetch schedule. See
  http://pascaldimassimo.com/2010/06/11/how-to-re-crawl-with-nutch/
  http://pascaldimassimo.com/2010/06/11/how-to-re-crawl-with-nutch/%20
 for
  details
 
  On 1 June 2011 07:18, alx...@aim.com wrote:
   Hello,
  
   I use nutch-1.2 to index about 3000 sites. One of them has about 1500
 pdf
   files which do not change over time.
   I wondered if there is a way of configuring nutch not to fetch
 unchanged
   documents again and again, but keep the old index for them.
  
  
   Thanks.
   Alex.




-- 
*Lewis*

Re: nutch NoClassDefFound

2011-06-08 Thread lewis john mcgibbney

Hi,

I suggest that before you try to progress any further with this you read as
much of the wiki [1] as you can, in particular I would start here [2] [3]

After this, try looking through some of the source and understanding what
parameters are required to run various commands. The reason for this is that
from time to time it is guaranteed that we will come across log output that
indicates various errors in Nutch configuration or something else... it
helps considerably if you have a sound understanding and working knowledge
of the processes behind the internal operating nature of Nutch.

[1]http://wiki.apache.org/nutch/
[2]http://wiki.apache.org/nutch/NutchTutorial
[3]http://wiki.apache.org/nutch/FAQ

On Tue, Jun 7, 2011 at 9:42 PM, abhayd ajdabhol...@hotmail.com wrote:

 hi
 I m very new to nutch and trying to set up nutch on windows and using
 cygwin.
 I downloaded
 http://apache.mirrors.airband.net/nutch/apache-nutch-1.2-bin.zip. I think
 i
 dont need to build.

 When i try following command i get error.. I saw similar question posted in
 forum but it was related to running nutch from source. Any idea what could
 be wrong?

 $ echo $JAVA_HOME
 C:\Program Files\Java\jdk1.6.0_12\

 jj@D1QJ50C1 ~/nutch-1.2/bin
 $ ./nutch crawl
 java.lang.NoClassDefFoundError: and
 Caused by: java.lang.ClassNotFoundException: and
at java.net.URLClassLoader$1.run(URLClassLoader.java:200)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(URLClassLoader.java:188)
at java.lang.ClassLoader.loadClass(ClassLoader.java:307)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:301)
at java.lang.ClassLoader.loadClass(ClassLoader.java:252)
at java.lang.ClassLoader.loadClassInternal(ClassLoader.java:320)
 Could not find the main class: and.  Program will exit.
 Exception in thread main

 --
 View this message in context:
 http://lucene.472066.n3.nabble.com/nutch-NoClassDefFound-tp3036674p3036674.html
 Sent from the Nutch - User mailing list archive at Nabble.com.




-- 
*Lewis*

Updates to Nutch Wiki

2011-06-08 Thread lewis john mcgibbney

Hi everyone,

Was wondering if anyone (familiar with the topics) would be interested in
sending me material for the following pages [1] [2]. The links appear to be
non existent in our wiki and it would be nice to get some material on these
topics if these topics are important and are required! Although 1.3 has just
been released, material on previous releases is very much welcomed.

[1] http://wiki.apache.org/nutch/SearchOverMultipleIndexes
[2] http://wiki.apache.org/nutch/NutchWithChineseAnalyzer

In addition we've now re-arranged the wiki somewhat. Hopefully this
structure will make locating specifics an earier task.

Thanks
-- 
*Lewis*

Re: searcher.dir not working

2011-06-08 Thread lewis john mcgibbney

Hi abhayd,

In short...yes.

Although you have correctly specified an absolute path, you need to drop the
/crawldb/current/part-0

A good resource for this stuff can usually be found on the mailing lists.

On Wed, Jun 8, 2011 at 8:03 AM, abhayd ajdabhol...@hotmail.com wrote:

 hi

 I am using nutch 1.2
 i created index using command bin/nutch crawl urls -dir crawl

 under crawl directory i see
 crawldb/current/part-0/index

 i added this to nutch-site.xml under tomcat installation
 as search.dir property value
 /home/user1/nutch-1.2/crawl/crawldb/current/part-0

 I am getting message index is not directory. Am i doing something wrong?

 Any help?

 --
 View this message in context:
 http://lucene.472066.n3.nabble.com/searcher-dir-not-working-tp3038087p3038087.html
 Sent from the Nutch - User mailing list archive at Nabble.com.




-- 
*Lewis*

Re: bin folder missing in 1.3 release

2011-06-09 Thread lewis john mcgibbney

We are a bit thin on supporting documentation for the new release at the
moment but are actively working towards producing this. Hopefully once we
have something contributed to the wiki the differences in configuration and
functionality within release 1.3 will be fully explained.

On Thu, Jun 9, 2011 at 11:04 AM, Markus Jelsma
markus.jel...@openindex.iowrote:

 Here is is:
 http://svn.apache.org/viewvc/nutch/tags/release-1.3/src/bin/

 It's being copied over when building with ant.

  As said in the subject, I can't find the bin folder in the 1.3 release.
 Is
  it intentionally?
 
 
  Thanks




-- 
*Lewis*

Re: No Urls to fetch

2011-06-13 Thread lewis john mcgibbney

Hi Adelaida,

Assuming that you have been able to successfully crawl the top level domain
http://elcorreo.com e.g. that you have been able to crawl and create an
index, at least we know that your configuration options are OK.

I assume that you are using 1.2... can you confirm?
What does the rest of your crawl-urlfilter.txt look like?
Have you been setting any properties in nutch-site.txt which might alter
Nutch behaviour?

I am not perfect with syntax for creating filter rules in crawl-urlfilter...
can someone confirm that this is correct.

On Mon, Jun 13, 2011 at 12:10 PM, Adelaida Lejarazu alejar...@gmail.comwrote:

Hello,

I´m new to Nutch and I´m doing some tests to see how it works. I want to do
some crawling in a digital newspaper webpage. To do so, I put in the urls
directory where I have my seed list the URL I want to crawl that is: *
http://elcorreo.com*
The thing is that I don´t want to crawl all the news in the site but only
the ones of the current day, so I put a filter in the
*crawl-urlfilter.txt*(for the moment I´m using the
*crawl* command). The filter I put is:

+^http://www.elcorreo.com/.*?/20110613/.*?.html

A correct URL would be for example,

http://www.elcorreo.com/vizcaya/20110613/mas-actualidad/politica/lopez-consta-pactado-bildu-201106131023.html

so, I think the regular expression is correct but Nutch doesn´t crawl
anything. It says that there are *No Urls to Fetch - check your seed list
and URL filters.*

Am I missing something ??

Thanks,

--
*Lewis*

Re: Injecting urls through code instead of file

2011-06-14 Thread lewis john mcgibbney

Hi,

Can you provide a use case? The reason I ask is that I can only assume that
you would be hacking some code to inject your urls from some other URL
store?

On Tue, Jun 14, 2011 at 5:18 PM, shanWDC ssar...@web.com wrote:

 Is there a way to inject urls in the injector, through code, rather than
 specifying a url file?



 --
 View this message in context:
 http://lucene.472066.n3.nabble.com/Injecting-urls-through-code-instead-of-file-tp3063662p3063662.html
 Sent from the Nutch - User mailing list archive at Nabble.com.




-- 
*Lewis*

Re: Problem with Nutch Search

2011-06-16 Thread lewis john mcgibbney

Off the top of my head one property springs to mind. Which you may or may
not have configured in nutch-site

http.content.limit

However I think that this is not the source of the problem.
I would advise you to have a look at your hadoop log file for any obvious
warnings... how do you know he sweeps up about 50 lines after
that he does not sweep over the text? Have you looked at a dump of the
crawldb to see what content the database is aware of?

Without verifying answers to some of the above it is hard to decouple the
errors in nutch from the legacy architecture of  Nutch 1.3


On Thu, Jun 16, 2011 at 3:03 PM, Jefferson jeff151520...@msn.com wrote:

 Hi
 I'm testing the nutch, I followed the tutorial in the nutch,
 but I found a problem. I ran the command bin / nutch crawl
 6 sites in plain text that contains only about 400 lines of text, so far so
 normal. When I do a search with Nutch, he sweeps up about 50 lines after
 that he does not sweep over the text. If I look, for example by church
 and
 this word is beyond the first 50 lines of text, it returns 0 results.
 Anyone have any solution for this?

 --
 View this message in context:
 http://lucene.472066.n3.nabble.com/Problem-with-Nutch-Search-tp3072077p3072077.html
 Sent from the Nutch - User mailing list archive at Nabble.com.




-- 
*Lewis*

Re: I need step-by-step tutorial to run Nutch 1.2 from source code

2011-06-18 Thread lewis john mcgibbney

Hi Mohammad,

Try looking at the pre nutch 1.3 material on the wiki, I'm sure there must
be something in there you can build on... or that will at least point you in
the right direction

http://wiki.apache.org/nutch/Archive%20and%20Legacy

HTH

On Fri, Jun 17, 2011 at 9:27 PM, Mohammad Hassan Pandi
pandi...@gmail.comwrote:

 Hi everybody!
 I have already installed Hadoop 0.20.2 on a two-node cluster and I want to
 run Nutch 1.2 source code
 just to have a feeling of how it works. I need a step-by-step tutorial to
 do
 that.




-- 
*Lewis*

Re: Empty indexes folder after crawling!

2011-06-23 Thread lewis john mcgibbney

Have you set your crawl directory property value in nutch-site.xml when
launching the war file on tomcat?

On Tue, Jun 21, 2011 at 4:01 AM, Mohammad Hassan Pandi
pandi...@gmail.comwrote:

 follwing http://wiki.apache.org/nutch/NutchHadoopTutorial I crawled
 lucene.apache.org with command

  bin/nutch crawl urlsdir -dir crawl -depth 3

 and copy the whole thing to local file system by running the command

  bin/hadoop dfs -copyToLocal crawl /d01/local/

 but the indexes folder is empty. this causes no result when searching
 for a query in nutch UI




-- 
*Lewis*

Re: how to classify the search results by an indexed field with lucene?

2011-06-23 Thread lewis john mcgibbney

to give a short answer to your question the answer is I don't know. Many of
us are not using Lucene as the indexing machanism. I think as this is
specifically linked to Lucene you would be better asking there.

try the user list

http://lucene.apache.org/java/docs/mailinglists.html#Java User List

On Tue, Jun 21, 2011 at 6:58 AM, Joey majunj...@gmail.com wrote:

 Hi all,

 Is there anyone who had ever encounted this problem before?
 Looking forward to your replying. :-) Thanks.

 Regards,
 Joey

 On 06/20/2011 02:09 PM, Joey Ma wrote:

 Hi all,

 I use lucene as the indexer in nutch 1.2. I want to get the classified
 search results by an indexed field, for example to show the hit count
 distributions of different months in a year.

 I found that in lucene 2.* this could be achieved by the
 QueryFilter().bit(IndexReader) method and calculate the hit count for each
 category. But in lucene 3.*, the class of QueryFilter has been removed and I
 couldn't find the equivalent of that method.

 Could anyone tell me how to make this achievement effectively? Thanks very
 much.

 Regards,
 Joey






-- 
*Lewis*

Re: Where Can I find Nutch war file??

2011-06-23 Thread lewis john mcgibbney

Hi,

Assuming that you are using 1.2 the war file should definately be there. You
will be able to get step by step directions for this in the tutorial on the
Nutch site.

http://wiki.apache.org/nutch/NutchTutorial

Note that this will be getting updated soon to reflect changes incorporated
into release 1.3, therefore search and indexing will not be covered under
legacy Lucene architecture and there will be no WAR file tro locate if using
1.3.
On Tue, Jun 21, 2011 at 12:06 AM, Mohammad Hassan Pandi
pandi...@gmail.comwrote:

 Thanks for your response
 I got nutch-2010-07-07_04-49-04.tar.gz extracted and opened up the
 directory
 in Eclipse and run build.xml. There are several tasks in build.xml such as
 init, compile, compile-core,
 The tutorial I followed http://wiki.apache.org/nutch/NutchHadoopTutorial;
 says choose job(the default task) and package task. I choosed them and run
 but no war file is created

 On Tue, Jun 21, 2011 at 11:12 AM, Hasan Diwan hasan.di...@gmail.com
 wrote:

  You'll need to build it yourself -- try, $ANT_HOME/bin/ant war or
  %ANT_HOME%\bin\ant war. Let me know how you get on...
 
  On 20 June 2011 23:19, Mohammad Hassan Pandi pandi...@gmail.com wrote:
   Hi guys,
   there is no war file in build folder of my nutch. where can I find
 nutch
  war
   file to deploy on tomcat?
  
 
 
 
  --
  Sent from my mobile device
  Envoyait de mon telephone mobil
 




-- 
*Lewis*

Re: helpful books or tutorials on nutch

2011-06-23 Thread lewis john mcgibbney

As this is open source I think the best way to solve your question/request
is to get down and dirty with your own configuration. Many implementation
scenarios are unique, to a new Nutch user this may provide no immediate
helpful credentials, however it clearly displays the adaptability and
extensibility of the Nutch framework, this cn only be learned and understood
by adhering to the suggestions made above.

There are various books out there which include commentary on nutch but none
that will give you a one stop shop for all answers. I have yet to find one
which fully documents a real world scenario...

With regards to Luke you may be best off asking on the google group you
highlighted, however has anyone else tried this out and can confirm?

On Tue, Jun 21, 2011 at 9:30 AM, Shouguo Li the1plum...@gmail.com wrote:

 hey guys

 i know this question has been asked several times on this mailing list but
 i
 didn't see good answers in the archive. are there any books or online
 tutorials that walks you through nutch with couple of real world scenarios?
 there are several wiki pages on nutch.apache.org, but they're too brief,
 and
 somewhat out of date.

 also, i tried out nutch 1.3 and solr. but i can't browse it using latest
 luke tool even though luke says it's compatible with lucene 3 now,

 http://code.google.com/p/luke/downloads/detail?name=lukeall-3.1.0.jarcan=2q=

 thx!




-- 
*Lewis*

Re: Solrdedup NPE

2011-06-23 Thread lewis john mcgibbney

Hi Markus,

Can you list the steps you executed prior to the solrdedup please?

I think I encountered something similar a while back and as my work was
moving on I didn't get a chance to investigate it fully.

On Tue, Jun 21, 2011 at 1:54 PM, Markus Jelsma
markus.jel...@openindex.iowrote:

 Hi,

 Any idea what the exception below can result from? The dedup queries go
 allright and produce normal results. Some indices will not generate this
 NPE.

 Cheers,

 11/06/21 20:47:37 WARN mapred.LocalJobRunner: job_local_0001
 java.lang.NullPointerException
at org.apache.hadoop.io.Text.encode(Text.java:388)
at org.apache.hadoop.io.Text.set(Text.java:178)
at

 org.apache.nutch.indexer.solr.SolrDeleteDuplicates$SolrInputFormat$1.next(SolrDeleteDuplicates.java:272)
at

 org.apache.nutch.indexer.solr.SolrDeleteDuplicates$SolrInputFormat$1.next(SolrDeleteDuplicates.java:243)
at

 org.apache.hadoop.mapred.MapTask$TrackedRecordReader.moveToNext(MapTask.java:192)
at
 org.apache.hadoop.mapred.MapTask$TrackedRecordReader.next(MapTask.java:176)
at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:48)
at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:358)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:307)
at
 org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:177)
 11/06/21 20:47:37 INFO mapred.JobClient:  map 0% reduce 0%
 11/06/21 20:47:37 INFO mapred.JobClient: Job complete: job_local_0001
 11/06/21 20:47:37 INFO mapred.JobClient: Counters: 0
 Exception in thread main java.io.IOException: Job failed!
at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1252)
at

 org.apache.nutch.indexer.solr.SolrDeleteDuplicates.dedup(SolrDeleteDuplicates.java:363)
at

 org.apache.nutch.indexer.solr.SolrDeleteDuplicates.run(SolrDeleteDuplicates.java:375)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
at

 org.apache.nutch.indexer.solr.SolrDeleteDuplicates.main(SolrDeleteDuplicates.java:380)




-- 
*Lewis*

Re: Building Nutch 2.0 from the trunk

2011-06-23 Thread lewis john mcgibbney

I tried to build Nutch trunk in eclipse about circa 2 months ago. Gora built
fine and from memory it was the ivy configuration within Nutch which had to
be altered. I'm positive the problems I was having have now been
rectified but I haven't tried since. That is why I am interested in why
JUnit tests failed as I thought the only problem with the build was with my
Gora dependency. Sorry this is off topic.

To relate to the original question. Have you been able to build Nutch trunk
using Markus' comments above?




On Thu, Jun 23, 2011 at 3:28 PM, Markus Jelsma
markus.jel...@openindex.iowrote:

 You can safely build Nutch trunk with Gora 1089728. I can also build the
 current Nutch and Gora trunks. What error do you get?

  Hi,
 
  I think this is your second thread on this topic? I tried to get trunk to
  build but was unable as there are problems with Gora as Julien
 highlighted
  to me some time ago. My first question is did you get trunk to build
  following the tutorial you have highlighted? The problem I was having was
  with Gora, not with any JUnit tests.
 
  Can you please expand on your actions a bit. Thanks
 
  On Wed, Jun 22, 2011 at 4:50 AM, Nutch User - 1
 nutch.use...@gmail.comwrote:
   Could someone give me step-by-step instructions on how to build Nutch
   2.0 from the trunk and run it? I tried to follow this
   (http://techvineyard.blogspot.com/2010/12/build-nutch-20.html), but
   failed to do so as described here
   (http://lucene.472066.n3.nabble.com/TestFetcher-hangs-td3091057.html).




-- 
*Lewis*

Re: Problem in search

2011-06-24 Thread lewis john mcgibbney

Hi Jefferson,

I cannot access either your nutch-site or nutch-default but I see that your
http.content.limit is  INFO http.Http - http.content.limit = 65536

It is a fairly large page so maybe this can be the cause. I'm sorrry I don't
have access to my linux worktop so I can't test myself can you please advise
if this has been accounted for in your nutch-site. Anything over the default
65536 limit is truncated therefore you may not be able to search for it.

Further to this it seems that the hadoop.log does not show any eratic
bahaviour.

On Fri, Jun 24, 2011 at 7:40 AM, Jefferson jeff151520...@msn.com wrote:

 My problem is in the search.
 I made the site crawler http://en.wikipedia.org/wiki/Albert_Einstein
 When I access the http://localhost:8080/nutch-1.1/
 and digit Adolf Hitler returns me a result, ok.
 When I type phenomena returns 0 results, not ok.

 Attached is my config files and logging.
 thanks

 http://lucene.472066.n3.nabble.com/file/n3104461/nutch-site.xml
 nutch-site.xml
 http://lucene.472066.n3.nabble.com/file/n3104461/nutch-default.xml
 nutch-default.xml
 http://lucene.472066.n3.nabble.com/file/n3104461/hadoop.log hadoop.log
 http://lucene.472066.n3.nabble.com/file/n3104461/crawl.log crawl.log

 --
 View this message in context:
 http://lucene.472066.n3.nabble.com/Problem-in-search-tp3104461p3104461.html
 Sent from the Nutch - User mailing list archive at Nabble.com.




-- 
*Lewis*

Apache Nutch 1.3 tutorial now on Wiki

2011-06-24 Thread lewis john mcgibbney

Hi all,

With permission from the author I managed to adapt a blog entry for the
above which can be found here.

At this stage I would ask for anyone interested to
make changes/improvements/etc. Once we can verify the integrity and accuracy
of the entry it would be nice to rebuild the website with this tutorial as
the most recent tutorial resource for getting Nutch 1.3 up and running.

Thank you

-- 
*Lewis*

Re: Problem in search

2011-06-24 Thread lewis john mcgibbney

Can you expand on this? I am not understanding your description of the
problem.

On Fri, Jun 24, 2011 at 12:52 PM, Jefferson jeff151520...@msn.com wrote:

 ready.
 Now I have another problem:
 digit phenomena and he returns this:
 -
 Albert Einstein - Wikipedia, the free encyclopedia Albert Einstein From
 Wikipedia, the free encyclopedia Jump ...
 -
 what might be happening? Thanks for the help

 below my configuration files:

 http://lucene.472066.n3.nabble.com/file/n3105976/nutch-default.txt
 nutch-default.txt
 http://lucene.472066.n3.nabble.com/file/n3105976/nutch-site.txt
 nutch-site.txt


 --
 View this message in context:
 http://lucene.472066.n3.nabble.com/Problem-in-search-tp3104461p3105976.html
 Sent from the Nutch - User mailing list archive at Nabble.com.




-- 
*Lewis*

Re: Problem in search

2011-06-25 Thread lewis john mcgibbney

I see within you're nutch-site file that you have set an http.content.limit
value of 340,671. Is there any reason for this value? I'm assuming you are
not indexing this page so you can merely search for the term phenomena, and
that there is other textual content within the page that you are interested
in...would this assumption be right?

As Markus explained the page has a http content length of some 600,000, and
from looking at where the first occourance of the term phenomena is, it is
located roughly half way through the page.

When crawling large sites such as wikipedia (which we all know contains
large http content within its webpages), I have found that a safe guard
measure to ensure we get all page content is to set the http.content.limit
to a negative value e.g. -1. This way we are guaranteed that we get all page
content. Another useful tool which is widely used is LUKE [1], this will
enable you to search you Lucene index and confirm whether or not Nutch has
fetched and sent the content you wish to be stored within your index.

[1] http://code.google.com/p/luke/

On Sat, Jun 25, 2011 at 7:42 AM, Jefferson jeff151520...@msn.com wrote:

 The problem is that he returns the beginning of the text section of the
 website. The correct he is returning the passage in which the word
 phenomena is found.
 Sorry my english...


 Jefferson

 --
 View this message in context:
 http://lucene.472066.n3.nabble.com/Problem-in-search-tp3104461p3107810.html
 Sent from the Nutch - User mailing list archive at Nabble.com.




-- 
*Lewis*

Nutch Gotchas as of release 1.3

2011-06-25 Thread lewis john mcgibbney

Hello list,

Do we have any suggestions we wish to discuss regarding the above?

thanks

-- 
*Lewis*

Re: Empty indexes folder after crawling!

2011-06-25 Thread lewis john mcgibbney

try reading the tutorial on the wiki for 1.3 release. It gives step by step
stages for crawling and indexing then setting up Nutch WAR in Tomcat and
searching. You can find it under archives section in Nutch wiki

On Sat, Jun 25, 2011 at 9:12 PM, Mohammad Hassan Pandi
pandi...@gmail.comwrote:

 My nutch-site.xml is empty. Perhaps it means nutch uses default path as
 Index location. right?

 On Thu, Jun 23, 2011 at 10:57 PM, lewis john mcgibbney 
 lewis.mcgibb...@gmail.com wrote:

  Have you set your crawl directory property value in nutch-site.xml when
  launching the war file on tomcat?
 
  On Tue, Jun 21, 2011 at 4:01 AM, Mohammad Hassan Pandi
  pandi...@gmail.comwrote:
 
   follwing http://wiki.apache.org/nutch/NutchHadoopTutorial I crawled
   lucene.apache.org with command
  
bin/nutch crawl urlsdir -dir crawl -depth 3
  
   and copy the whole thing to local file system by running the command
  
bin/hadoop dfs -copyToLocal crawl /d01/local/
  
   but the indexes folder is empty. this causes no result when searching
   for a query in nutch UI
  
 
 
 
  --
  *Lewis*
 




-- 
*Lewis*

Re: Using nutch 1.3 in Eclipse

2011-06-30 Thread lewis john mcgibbney

I will try to get a wiki entry for this sorted ASAP as it is a fundamental
requirement for anyone wishing to debug/understand how classes work in Nutch
1.3, when the time comes around any opinions/comments you have would be a
great addition.
 Thanks

2011/6/30 Nutch User - 1 nutch.use...@gmail.com

  On 01/01/2011 08:52 AM, jeffersonzhou wrote:
  When you new a Java project for Nutch 1.3, what default location did you
  use? The folder where you unzipped the software or the runtime/local?
 
 
 
  -Original Message-
  From: Nutch User - 1 [mailto:nutch.use...@gmail.com]
  Sent: Thursday, June 30, 2011 2:43 PM
  To: user@nutch.apache.org
  Subject: Re: Using nutch 1.3 in Eclipse
 
  On 06/30/2011 02:00 AM, dyzc wrote:
  Hi, is there any information regarding working nutch 1.3 in eclipse?
  Thanks
 
  I got it working with the help of this
  (http://wiki.apache.org/nutch/RunNutchInEclipse1.0). However, I have had
  serious difficulties with the trunk of 2.0 and Eclipse as described here
  (http://lucene.472066.n3.nabble.com/TestFetcher-hangs-td3091057.html).
 

 I did as the tutorial suggested:Select Create project from existing
 source and use the location where you downloaded Nutch. I may have
 copied some .jar-files from runtime/local/lib to lib or then Ivy
 obtained them when running the Ant build from Eclipse.




-- 
*Lewis*

Re: Memory leak in fetcher (1.0) ?

2011-07-02 Thread lewis john mcgibbney

How many threads do you have running concurrently?

Is there any log output to indicate any warnings or errors otherswise?

On Sat, Jul 2, 2011 at 7:40 AM, Markus Jelsma markus.jel...@openindex.iowrote:

 Does it run out of memory? Is GC able to reclaim consumed heap space?

  Have a 300K URLs segement to fetch (no parsing)
  I see memory continuously growing up... looking like a memory leak.
  I have patch 769, 770 installed, and did not see any other patches
 related
  to memory leak.




-- 
*Lewis*

Nutch 1.3 CommandLineOptions updated to reflect new changes

2011-07-02 Thread lewis john mcgibbney

Hi,

Just finished the above, which you can find here [1] so please check out the
pages if you are having trouble passing parameters to any commands. It would
be great to mention if there are any mistakes or even better edit or add any
missing information you think would make the documentation clearer.

Also you will see that there is a section at the bottom of the page
subtitled 'other classes', feel free to add any classes you have been using
which we have not already included.

Thanks

[1] http://wiki.apache.org/nutch/CommandLineOptions

-- 
*Lewis*

Re: Problems when crawl a .nsf site

2011-07-03 Thread lewis john mcgibbney

Absolutely...

There is a short (old) thread here on this topic [1], from what I can see
this issue has not been addressed. Therefore it looks like implementing your
own parser plugin is what's required.

[1]
http://www.lucidimagination.com/search/document/a8d53fac1caa578c/nutch_with_nsf_files

2011/7/3 Alexander Aristov alexander.aris...@gmail.com

 Hi

 If it is a text file then you can simply associate the extension with text
 parser. But if I understand you right it's a lotus Db file then I suspect
 you have no other choice than implementing your own parser. I haven't heard
 of lotus files support in nutch.

 Best Regards
 Alexander Aristov


 2011/7/3 丛云牙之主 yanhaora...@qq.com

  Hello, I am using nutch-1.2 has encountered a problem.The site is
  writtenwith lotus domino, I use the browser to enter, click on the
 emergence
  of thoseconnections have not changed the site URL, unlike some sites have
 a
  lot of suffixes.Then there is a web site is buptoa.bupt.edu.cn /
  student_broad.nsf, I wanted to climbwill take. Nsf file. But nutch does
 not
  support. Nsf file crawl, I should write my ownplugin or should solve this
  problem from the other side?
  Extremely grateful for your help




-- 
*Lewis*

Re: Searching for documents with a certain boost value

2011-07-05 Thread lewis john mcgibbney

Hi,

I am sorry that I have not been able to try and replicate the scenario and
confirm whether I get zero scores in a similar situation as I am temporarily
unable to do so but I would like to add this resource [1], if you have not
seen it yet. I am aware that this doesn't address the problem directly but
if we can start thinking more about the way scoring is done then maybe we
can get further to uncovering the solution to finding whether or not we can
search for fields within our document or documents within our index which
have a boost value of zero. Obviously the reference I include is relevant
specifically to Nutch versions using Lucene however I'm hoping that as we
are referring to scoring  done by the OPIC filter that the outcome will be
consistent across versions including those which do not use legacy Lucene.
Can someone please correct me if I am wrong here...

Focussing specifically on your question, it appears that a document field is
not shown if a term was not found in a particular field e.g. there is no
score value given. This would suggest that we cannot query for it, therefore
my gut instinct is that we cannot query for a zero value present within
these fields. N.B I cannot confirm this, I am merely going on the little
research I have done into the OPIC scoring algorithm. It would be nice if
someone could confirm otherwise and correct me though.

[1]
http://wiki.apache.org/nutch/FAQ#How_is_scoring_done_in_Nutch.3F_.28Or.2C_explain_the_.22explain.22_page.3F.29


-- Forwarded message --
From: Nutch User - 1 nutch.use...@gmail.com
Date: Mon, Jul 4, 2011 at 12:43 AM
Subject: Searching for documents with a certain boost value
To: user@nutch.apache.org


Hi.

As I have described here
(
http://lucene.472066.n3.nabble.com/URL-redirection-and-zero-scores-td3085311.html
)
I have encountered a situation where some of my indexed documents have
zero boost value.

I'd like to know if there's a way to search which ones have zero as
their boost value. I have tried to do a Lucene query with Luke but it
failed. The query was: boost:00 00 00 00. (The boost field seems to be
a binary one, so it may have something to do with the problem.)

I allowed leading * in wildcard queries, and url:* returned me every
document as it should. However, boost:* returned none. Can this boost
field even be used as a search criteria?

Best regards,
Nutch User - 1



-- 
*Lewis*

Crawling relation database

2011-07-05 Thread lewis john mcgibbney

Hi,

I'm curious to hear if anyone has information for configuring Nutch to crawl
a RDB such as MySQL. In my hypothetical example there are N number of
databases residing in various distributed geographical locations, to make a
worst case scenario, say that they are NOT all the same type, and I wish to
use Nutch trunk 2.0 to push the results to some other structured data store
which I can then connect to to serve search results.

Does anyone have any information such as an overview of database crawling
and serving using Nutch? I have been unsuccesful obtaining info on the Web
as query results are ambiguous and usually refer to crawldb or linkdb.

If I can get this it would be a real nice entry for inclusion in our wiki.

Thanks for any suggestions or info.

-- 
*Lewis*

Re: Crawling relation database

2011-07-05 Thread lewis john mcgibbney

thanks to you both

On Tue, Jul 5, 2011 at 4:35 PM, Markus Jelsma markus.jel...@openindex.iowrote:

 H,

 About geographical search: Solr will do this for you. Built-in for 3.x+ and
 using third-party plugins for 1.4.x. Both provide different features. In
 Solr
 it's you'd not base similarity on geographical data but use spatial data to
 boost textual similar documents instead, or filter.

 This keeps text similarity intact and offers spatial features on top.

 You'll get more feedback on the Solr list indeed :)

 Cheers

  Thanks for this Markus, it had occured to me that DIH was a very
 plausable
  solution to progress with. I think you have just confirmed due to the
  flexibility it offers amongst other attributes.
 
  I'm looking at creating a context aware web application which would use
  geographical search to obtain results based on location. This is required
  as the data will contain (amongst others) fields with integer values
 which
  vary dependent upon a building location cost index. Similarity is
 directly
  linked through geographical location factor. I wanted to have the data
  stored within the n number of distributed RDB's available in a cloud
  environment which could be searched as oppose to the non-trivial task of
  searching across a fragmented distrubuted number of DB's.
 
  As you mention, it does make more sense to save documents in a doc (or
  column) oriented DB.
 
  Essentially, using the DIH tool would remove requirement for Nutch?
 
  I think to progress with this, I'm best moving the thread to Solr-user@if
  I have further questions.
 
  Thank you
 
  On Tue, Jul 5, 2011 at 3:53 PM, Markus Jelsma
 markus.jel...@openindex.iowrote:
   Hi Lewis,
  
   It sounds to me you'd be better of using Solr's very advanced
   DataImportHandler [1]. It can (delta) import data from your RDBMS' and
   offers
   much flexibility on how to transform entities.
  
   Besides crawling you also mentions you'd like to push results (of what)
   to another structured data store. But why would you want that?
 Handling,
   processing and serving search results is done by Solr (and ES in the
   future)
   and since our entities are flat (just a document) it makes more sense
 to
   me to
   save documents in a document (or column) oriented DB.
  
   [1] :http://wiki.apache.org/solr/DataImportHandler
  
   Cheers,
  
Hi,
   
I'm curious to hear if anyone has information for configuring Nutch
 to
crawl a RDB such as MySQL. In my hypothetical example there are N
number of databases residing in various distributed geographical
locations, to make a worst case scenario, say that they are NOT all
the same type, and
  
   I
  
wish to use Nutch trunk 2.0 to push the results to some other
structured data store which I can then connect to to serve search
results.
   
Does anyone have any information such as an overview of database
crawling and serving using Nutch? I have been unsuccesful obtaining
info on the
  
   Web
  
as query results are ambiguous and usually refer to crawldb or
 linkdb.
   
If I can get this it would be a real nice entry for inclusion in our
  
   wiki.
  
Thanks for any suggestions or info.




-- 
*Lewis*

Re: crawling a list of urls

2011-07-07 Thread lewis john mcgibbney

Hi C.B.,

This is way to vague. We really require more information regarding roughly
what kind of results you wish to get. It would be a near impossible task for
anyone to try and specify a solution to this open ended question.

Please elaborate

Thank you

On Thu, Jul 7, 2011 at 12:56 PM, Cam Bazz camb...@gmail.com wrote:

 Hello,

 I have a case where I need to crawl a list of exact url's. Somewhere
 in the range of 1 to 1.5M urls.

 I have written those urls in numereus files under /home/urls , ie
 /home/urls/1 /home/urls/2

 Then by using the crawl command I am crawling to depth=1

 Are there any recomendations or general guidelines that I should
 follow when making nutch just to fetch and index a list of urls?


 Best Regards,
 C.B.




-- 
*Lewis*

Re: Problems with nutch tutorial

2011-07-07 Thread lewis john mcgibbney

Hi Paul,

Please see this tutorial for working with Nutch 1.3 [1]

The tutorial you were using is for Nutch 1.2 from memory.

[1] http://wiki.apache.org/nutch/RunningNutchAndSolr

Thank you



On Thu, Jul 7, 2011 at 1:17 PM, Paul van Hoven 
paul.van.ho...@googlemail.com wrote:

 I'm completly new to nutch so I downloaded version 1.3 and worked through
 the beginners tutorial at 
 http://wiki.apache.org/nutch/**NutchTutorialhttp://wiki.apache.org/nutch/NutchTutorial.
 The first problem was that I did not find  the file
 conf/crawl-urlfilter.txt so I omitted that and continued with launiching
 nutch. Therefore I created a plain text file in
 /Users/toom/Downloads/nutch-**1.3/crawled called urls.txt which
 contains the following text:

 tom:crawled toom$ cat urls.txt
 http://nutch.apache.org/

 So after that I invoked nutch by calling
 tom:bin toom$ ./nutch crawl /Users/toom/Downloads/nutch-1.**3/crawled -dir
 /Users/toom/Downloads/nutch-1.**3/sites -depth 3 -topN 50
 solrUrl is not set, indexing will be skipped...
 crawl started in: /Users/toom/Downloads/nutch-1.**3/sites
 rootUrlDir = /Users/toom/Downloads/nutch-1.**3/crawled
 threads = 10
 depth = 3
 solrUrl=null
 topN = 50
 Injector: starting at 2011-07-07 14:02:31
 Injector: crawlDb: /Users/toom/Downloads/nutch-1.**3/sites/crawldb
 Injector: urlDir: /Users/toom/Downloads/nutch-1.**3/crawled
 Injector: Converting injected urls to crawl db entries.
 Injector: Merging injected urls into crawl db.
 Injector: finished at 2011-07-07 14:02:35, elapsed: 00:00:03
 Generator: starting at 2011-07-07 14:02:35
 Generator: Selecting best-scoring urls due for fetch.
 Generator: filtering: true
 Generator: normalizing: true
 Generator: topN: 50
 Generator: jobtracker is 'local', generating exactly one partition.
 Generator: Partitioning selected urls for politeness.
 Generator: segment: /Users/toom/Downloads/nutch-1.**3/sites/segments/**
 20110707140238
 Generator: finished at 2011-07-07 14:02:39, elapsed: 00:00:04
 Fetcher: No agents listed in 'http.agent.name' property.
 Exception in thread main java.lang.**IllegalArgumentException: Fetcher:
 No agents listed in 'http.agent.name' property.
at org.apache.nutch.fetcher.**Fetcher.checkConfiguration(**
 Fetcher.java:1166)
at org.apache.nutch.fetcher.**Fetcher.fetch(Fetcher.java:**1068)
at org.apache.nutch.crawl.Crawl.**run(Crawl.java:135)
at org.apache.hadoop.util.**ToolRunner.run(ToolRunner.**java:65)
at org.apache.nutch.crawl.Crawl.**main(Crawl.java:54)


 I do not understand what happend here, maybe one of you can help me?




-- 
*Lewis*

Re: crawling a list of urls

2011-07-07 Thread lewis john mcgibbney

See comments below

On Thu, Jul 7, 2011 at 4:31 PM, Cam Bazz camb...@gmail.com wrote:

 Hello Lewis,

 Pardon me for the non-verbose desription. I have a set of urls, namely
 product urls, in range of millions.


Firstly, (this is juts a suggestion) but I assume that you wish Nutch to
fetch the full page content. Ensure that http.content.limit is set to an
appropriate limit to allow this.



 So I want to write my urls, in a flat file, and have nutch crawl them
 to depth = 1


As you describe you have various seed directories, therefore I assume that
crawling a large set of seeds will be a recursive task, IMHO I would save
myself the manual task of running the jobs and write a bash script to do
this for me, this will also enable you to schedule for once a day update of
your crawldb, linkdb, solr index and so forth. There are plenty of scripts
which have been tested and used throughout the community here
http://wiki.apache.org/nutch/Archive%20and%20Legacy#Script_Administration


 However, I might remove url's from this list, or add new ones. I also
 would like nutch to revisit each site each 1 day.


Check out nutch-site for crawldb fetch intervals, these values can be used
to accommodate the dynamism of various pages. Once you have removed URLs
(this is going to be a laborious and extremely tedious task if done
manually), you would simply run your script again.

I would like removed urls to be deleted, and new ones to be reinjected
 each time nutch starts.


With regards to deleting URLs in your crawldb (and subsequently index) I am
not sure of this exactly. Can you justify completely deleting the URLs from
the data store? What happens if you add the URL in again the next day? I',
not sure if this is a sustainable method for maintaining your data
store/index.


 Best Regards,
 -C.B.

 On Thu, Jul 7, 2011 at 6:21 PM, lewis john mcgibbney
 lewis.mcgibb...@gmail.com wrote:
  Hi C.B.,
 
  This is way to vague. We really require more information regarding
 roughly
  what kind of results you wish to get. It would be a near impossible task
 for
  anyone to try and specify a solution to this open ended question.
 
  Please elaborate
 
  Thank you
 
  On Thu, Jul 7, 2011 at 12:56 PM, Cam Bazz camb...@gmail.com wrote:
 
  Hello,
 
  I have a case where I need to crawl a list of exact url's. Somewhere
  in the range of 1 to 1.5M urls.
 
  I have written those urls in numereus files under /home/urls , ie
  /home/urls/1 /home/urls/2
 
  Then by using the crawl command I am crawling to depth=1
 
  Are there any recomendations or general guidelines that I should
  follow when making nutch just to fetch and index a list of urls?
 
 
  Best Regards,
  C.B.
 
 
 
 
  --
  *Lewis*
 




-- 
*Lewis*

Re: no agents listed in 'http.agent.name'

2011-07-07 Thread lewis john mcgibbney

Hi Serenity,

I don't know if you are aware but this message has been duplicated across
both user@  nutch-user@.

In general it is good practice for what to put in nutch-site and
nutch-default can be found here [1] and here [2]. It is not required to add
the properties to both of the conf files.

To address your problem specifically, it should be pretty straightforward to
implement this in nutch-site.xml, try copying over properties one by one and
grandually building up a picture of where the discrepancy may be.

[1]
http://wiki.apache.org/nutch/FAQ#I_have_two_XML_files.2C_nutch-default.xml_and_nutch-site.xml.2C_why.3F
[2] http://wiki.apache.org/nutch/NutchConfigurationFiles

On Thu, Jul 7, 2011 at 4:45 PM, serenity serenitykenings...@gmail.comwrote:

 Hello Friends,

 I am experiencing this error message  fetcher:no agents listed in
 'http.agent.name' property when I am trying to crawl with Nutch 1.3

 I referred other mails regarding the same error message and tried to change
 the nutch-default.xml and nutch-site.xml file details with

  property
  namehttp.agent.name/name
  valueMy Nutch Spider/value
  descriptionEMPTY/description
 /property

 I also filled out the other property details without blank and still
 getting
 the same error. May I know my mistake ?


 Thanks,
 Serenity

 --
 View this message in context:
 http://lucene.472066.n3.nabble.com/no-agents-listed-in-http-agent-name-tp3148609p3148609.html
 Sent from the Nutch - User mailing list archive at Nabble.com.




-- 
*Lewis*

Re: Partitioning selected urls for politeness and scoring

2011-07-08 Thread lewis john mcgibbney

Yes this would limit the number of URLs from any one domain, but it would
not explain why one domain seems to get fetched more after recursive fetches
of some given seed set.

Can you explain more about your crawling operation? Are you executing a
crawl command? If so what arguements are you passing?

If not can you give more detail of the job you are running

Thank you

On Fri, Jul 8, 2011 at 2:50 PM, Hannes Carl Meyer hannesc...@googlemail.com
 wrote:

 Hi,

 you could set generate.max.per.host to a reasonable size to prevent this!
 On a default configuration this is set to -1 which means unlimited.

 BR

 Hannes

 ---
 Hannes Carl Meyer
 www.informera.de

 On Fri, Jul 8, 2011 at 2:53 PM, Eggebrecht, Thomas (GfK Marktforschung) 
 thomas.eggebre...@gfk.com wrote:

  Hi list,
 
  My seed list contains URLs from about 20 different domains. In the first
  fetch cycles everything is all right and all domains will be selected
 quite
  equally distributed. But after about 10-15 cycles one domain starts to
  prevail. URLs from all other domains will not be selected anymore. It
 seems
  that URLs from that certain domain have the highest scoring and URLs from
  other domains don't have a chance anymore. Is this a right assumption?
 
  I'm not very happy because I would like to fetch URLs from all domains in
  each cycle. What would you do in that case?
 
  Best regards and thanks for answers
  Thomas
 
  (Using nutch-1.2)
 
 
  GfK SE, Nuremberg, Germany, commercial register Nuremberg HRB 25014;
  Management Board: Professor Dr. Klaus L. W?bbenhorst (CEO), Pamela Knapp
  (CFO), Dr. Gerhard Hausruckinger, Petra Heinlein, Debra A. Pruent,
 Wilhelm
  R. Wessels; Chairman of the Supervisory Board: Dr. Arno Mahlert
  This email and any attachments may contain confidential or privileged
  information. Please note that unauthorized copying, disclosure or
  distribution of the material in this email is not permitted.
 




-- 
*Lewis*

Re: How to deploy Nutch 1.3 in the web server

2011-07-08 Thread lewis john mcgibbney

The web app was deprecated when we released Nutch 1.3. This was so we could
use Solr interface for searching and offload the builk associated with the
web app (amongst other things). There has been quite a lot of chat regarding
this on this list over the last while.

The last version of Nutch to use the web app was Nutch 1.2

On Fri, Jul 8, 2011 at 8:42 PM, serenity serenitykenings...@gmail.comwrote:

 Hello,


 I downloaded and installed Nutch 1.3 successfully and would like to deploy
 it in the webserver. Do I need to modify the existing build.xml file for
 generating the war file.


 Serenity

 --
 View this message in context:
 http://lucene.472066.n3.nabble.com/How-to-deploy-Nutch-1-3-in-the-web-server-tp3152969p3152969.html
 Sent from the Nutch - User mailing list archive at Nabble.com.




-- 
*Lewis*

Re: skipping invalid segments

2011-07-08 Thread lewis john mcgibbney

Hi C.B.,

It looks like you may have simply missed the '-dir' when you were specifying
your crawldb directory to be updated from the fetched segment. Have a look
here [1]

Can you please try and post your results.

[1] http://wiki.apache.org/nutch/bin/nutch_updatedb



On Fri, Jul 8, 2011 at 5:06 PM, Cam Bazz camb...@gmail.com wrote:

 Hello,

 I tried to crawl manually, only a list of urls. I have issued the
 following commands:

 bin/nutch inject /home/crawl/crawldb /home/urls

 bin/nutch generate /home/crawl/crawldb /home/crawl/segments

 bin/nutch fetch /home/crawl/segments/123456789

 bin/nutch updatedb /home/crawl/crawldb /home/crawl/segments/123456789
 -noAdditions

 however for the last command: it skips the segment 12345789 saying it
 is an invalid segment?

 This is exactly what I need (the -noAdditions flag) but it will not
 updatedb. What might have done wrong?

 Best Regards,
 -C.B.




-- 
*Lewis*

Re: Integrating Solr 3.2 with Nutch 1.3

2011-07-08 Thread lewis john mcgibbney

Hi Serenity,

How did you execute the crawl? with crawl command? Have you ensured that
parsing has been done?

This looks like a different IIE than other have been getting when indexing
to Solr. So please ensure that parsing has been done on all fetched content.

On Fri, Jul 8, 2011 at 6:20 PM, serenity serenitykenings...@gmail.comwrote:

Hello,

I successfully installed both Solr 3.2 and Nutch 1.3 separately and both of
them are working good. Now, I am trying to integrate both of them to get
the
search results which are already crawled and indexed by Nutch 1.3.

I followed the steps according the following url's but wont display
anything
in the Solr search.

http://wiki.apache.org/nutch/RunningNutchAndSolr
http://wiki.apache.org/nutch/RunningNutchAndSolr

http://

http://thetechietutorials.blogspot.com/2011/06/solr-and-nutch-integration.html
http://

http://thetechietutorials.blogspot.com/2011/06/solr-and-nutch-integration.html

After, I run the command “bin/nutch solrindex http://127.0.0.1:8983/solr/
firstSite/crawldb firstSite/linkdb firstSite/segments/* “ to send crawl
data
to Solr for indexing ,it is fetching the links but receiving the following
error :

*Exception in thread main org.apache.hadoop.mapred.InvalidInputException:
Input path does not exist:

file:/c:/apache-nutch-1.3-src/runtime/local/firstSite/segments/20110706135037/parse_data
*

May I know, if I need to do any changes to the schema.xml file prior to
copying into solr/conf folder.

Serenity

--
View this message in context:
http://lucene.472066.n3.nabble.com/Integrating-Solr-3-2-with-Nutch-1-3-tp3152501p3152501.html
Sent from the Nutch - User mailing list archive at Nabble.com.

--
*Lewis*

Re: custom extractor

2011-07-08 Thread lewis john mcgibbney

Hi C.B.,

Your description gets slightly cloudy towards the end e.g. around One
diffuculty with my htmlcleaner...taken from firebug???

Are you trying to say that some of the URLs are bad HTML, you know this
because it is flagged up by firebug? If this is the case are you able to
edit the HTML and make it well-formed so to speak?

It would also be of great help if you could post a small suggestion of the
type of xpath extraction you are looking to so, if anyone has built plugins
implementing xpath (which I have not) then they may be able to comment
further.




On Wed, Jul 6, 2011 at 5:10 PM, Cam Bazz camb...@gmail.com wrote:

 Hello,

 Previously I have build a primitive crawler in java, extracting
 certain information per html page using xpaths. then I discovered
 nutch, and now I want to be able to extract certain elements in dom,
 tru xpath, multiple xpaths per site.

 I am crawling a number of web sites, lets say 16, and I would like to
 be able to write multiple xpaths per site, and then index the output
 of those extractions in solr, as a different field.

 I have googled for a while, and I understand certain plugin can be
 developed that will act as a custom html parser. I understand that
 another path is using tika.

 I also have experimented with boilerpiple library, and It was
 insufficient to extract the data I want. (I am extracting
 specificiations of certain products, usually in tables, and
 fragmented)

 One diffuculty with my htmlcleaner based xpath evaluator was that the
 real world htmls sometime were broken, and even when I cleaned them
 html cleaner will not find xpaths taken from firebug.

 Which way should I start?

 Any ideas / help / recomendation greatly appreciated,

 Best Regards,
 C.B.




-- 
*Lewis*

Re: Building Nutch 2.0 from the trunk

2011-07-08 Thread lewis john mcgibbney

Hi,

Just thought it reasonable to come back to this one and double confirm that
all current trunks below build successfully (after some simple
configuration) and it is possible to get a Nutch/Gora/HBase trunk
implementation up and running in good time should you wish. These
technologies are pretty dynamic just now and there is a lot of exciting
stuff in the pipeline for the near future.

Thanks

On Thu, Jun 23, 2011 at 11:55 PM, lewis john mcgibbney 
lewis.mcgibb...@gmail.com wrote:

 I tried to build Nutch trunk in eclipse about circa 2 months ago. Gora
 built fine and from memory it was the ivy configuration within Nutch which
 had to be altered. I'm positive the problems I was having have now been
 rectified but I haven't tried since. That is why I am interested in why
 JUnit tests failed as I thought the only problem with the build was with my
 Gora dependency. Sorry this is off topic.

 To relate to the original question. Have you been able to build Nutch trunk
 using Markus' comments above?




 On Thu, Jun 23, 2011 at 3:28 PM, Markus Jelsma markus.jel...@openindex.io
  wrote:

 You can safely build Nutch trunk with Gora 1089728. I can also build the
 current Nutch and Gora trunks. What error do you get?

  Hi,
 
  I think this is your second thread on this topic? I tried to get trunk
 to
  build but was unable as there are problems with Gora as Julien
 highlighted
  to me some time ago. My first question is did you get trunk to build
  following the tutorial you have highlighted? The problem I was having
 was
  with Gora, not with any JUnit tests.
 
  Can you please expand on your actions a bit. Thanks
 
  On Wed, Jun 22, 2011 at 4:50 AM, Nutch User - 1
 nutch.use...@gmail.comwrote:
   Could someone give me step-by-step instructions on how to build Nutch
   2.0 from the trunk and run it? I tried to follow this
   (http://techvineyard.blogspot.com/2010/12/build-nutch-20.html), but
   failed to do so as described here
   (http://lucene.472066.n3.nabble.com/TestFetcher-hangs-td3091057.html
 ).




 --
 *Lewis*




-- 
*Lewis*

Re: Are we losing Nutch?

2011-07-10 Thread lewis john mcgibbney

Hi Carmmello,

I would like to stress that I am only speaking from my own views on the way
the project has been moving over the last year and a half or so but I would
like to add the following points to address you quite obvious concerns

There has been a lot of correspondence on closely linked topics over the
past wee while, I think developers understand that there is a small step up
to address the requirements of Nutch 1.3, and there is also always inherent
problems when you individuals and communities are faced with change. I would
like to say with confidence that the current version of Nutch is a well
refined tool which is adapting very accurately to provide the best crawling
functionality of a very dynamically developing web consisting of various
complex graph structures. I would like to make it clear that it is extremely
important to have a stable and well designed crawling implementation such as
Nutch 1.3 as if you look on the Dev lists you will see the barrage of tasks,
existing in an array of complexity, functionality and accuracy which keep
Nutch running in parallel with  daily changes to the dynamic web. If Nutch
is not focussing on crawling, then no matter whether we have a web
application interface or a Lucene index the quality of data fetched will
simply not be up to scratch. I hope you can appreciated the burden which
this imposes on the directional decisions made within the Project Management
Committee in the last year or so...

Developers across many of the ASF projects understand that being user
friendly is an excellent attribute to have in any open source Apache
implementation. However projects develop and in the case of the Apache
Software foundation, some of these project spawn sub-projects which graduate
to become thier own top level independant projects. As I'm sure you are
aware, this was the case with Nutch, therefore it means that as a community
we should be able to make decisions independently in the best interests of
the project. We have talk about not reinventing the wheel, well this is also
going on across the ASF board of projects. One thing to consider is that
many developers, contributors and PMCs do not belong to one project, they
give up time, effort and resources to sometimes several projects, therefore
it is very important that as a project Nutch reserves the priority of
removing duplication across the board.

Addressing you point regarding the real objectives established at the
beginning of the project, there has been significant progress made within
Nutch and excellent sub projects which have since graduated to top level
projects (I'm sure there is no reason to name them) with their own bustling
communities. Allowing Nutch to stagnate and claim to be a one-size-fits-all
search engine would have jeopardised the viability of all of these
successful projects and would have therefore prevented the very innovation
that earns open source implementation under the ASF the reputation and
widespread use that projects are renowned for.

For example if we take the latest Nutch 1.3 release, we have two options to
deploy, in local mode (running on one machine) or in deploy mode (harnessing
the strength of parallel processing jobs for different kinds of Nutch
users). The development has been driven purely by variances seen across the
community usage of Nutch. We draw upon progress made in other delegated
areas for the benefit of the project, not to isolate non-programmers from
using newer versions of the code base.

I would also like to add that there are many questions asked about Solr due
to a number of criteria, namely:

Various developers/committers/PMC members of Nutch are also members of
various Solr groups.
Developers and users are kind enough to take the time and effort to answer
Solr related questions as it is commonly recognised now that Solr is the
widespread indexing mechanism (which also has an easily configurable GUI).
It is not very often that users on the Nutch user@ list are ignored or
thier queries unanswered, however if this is the case there is good reason
behind it. In general, and in my opinion, when I started using Nutch I found
the help on user@ not only extremely beneficial but also a confidence boost
to get me working on Sold and other project lists.

I suppose that there are always 2 sides to every story and it is very
uncomforting to hear that you are really not happy with the latest release,
there was a lot of hard work put into its development. Amongst bug fixes and
other potential barriers mentioned above and previously on this list I would
like to think that as the project matures it users can also recognise the
dynamism which needs to exist in a project of Nutch's nature i order to
present users with a stable and robust software choice. Instead of becoming
handicapped we have a clear vision for Nutch 2.0, Nutch branches e.g. 1.4
and many new fixes on the way. I suppose it depends from which side of the
table you are on when you mention that it is

Re: html of the crawled pages.

2011-07-10 Thread lewis john mcgibbney

Hi C.B.,

Can you please expand on this description?

On Sun, Jul 10, 2011 at 11:52 AM, Cam Bazz camb...@gmail.com wrote:

 Hello All,

 Is there a way to save the plain htmls from the crawl? Or is this
 already stored in segments dir?

 Best Regards,
 -C.B.




-- 
*Lewis*

Re: Problems with tutorial

2011-07-10 Thread lewis john mcgibbney

Hi,

For a 1.3 tutorial please see here [1]. I am in the process of overhauling
the nutch site to accomodate new changes as per 1.3 release.

Thank you

On Sun, Jul 10, 2011 at 3:42 PM, Paul van Hoven 
paul.van.ho...@googlemail.com wrote:

 I'm completly new to nutch so I downloaded version 1.3 and worked through
 the beginners tutorial at 
 http://wiki.apache.org/nutch/**NutchTutorialhttp://wiki.apache.org/nutch/NutchTutorial.
 The first problem was that I did not find  the file
 conf/crawl-urlfilter.txt so I omitted that and continued with launiching
 nutch. Therefore I created a plain text file in
 /Users/toom/Downloads/nutch-**1.3/crawled called urls.txt which
 contains the following text:

 tom:crawled toom$ cat urls.txt
 http://nutch.apache.org/

 So after that I invoked nutch by calling
 tom:bin toom$ ./nutch crawl /Users/toom/Downloads/nutch-1.**3/crawled -dir
 /Users/toom/Downloads/nutch-1.**3/sites -depth 3 -topN 50
 solrUrl is not set, indexing will be skipped...
 crawl started in: /Users/toom/Downloads/nutch-1.**3/sites
 rootUrlDir = /Users/toom/Downloads/nutch-1.**3/crawled
 threads = 10
 depth = 3
 solrUrl=null
 topN = 50
 Injector: starting at 2011-07-07 14:02:31
 Injector: crawlDb: /Users/toom/Downloads/nutch-1.**3/sites/crawldb
 Injector: urlDir: /Users/toom/Downloads/nutch-1.**3/crawled
 Injector: Converting injected urls to crawl db entries.
 Injector: Merging injected urls into crawl db.
 Injector: finished at 2011-07-07 14:02:35, elapsed: 00:00:03
 Generator: starting at 2011-07-07 14:02:35
 Generator: Selecting best-scoring urls due for fetch.
 Generator: filtering: true
 Generator: normalizing: true
 Generator: topN: 50
 Generator: jobtracker is 'local', generating exactly one partition.
 Generator: Partitioning selected urls for politeness.
 Generator: segment: /Users/toom/Downloads/nutch-1.**3/sites/segments/**
 20110707140238
 Generator: finished at 2011-07-07 14:02:39, elapsed: 00:00:04
 Fetcher: No agents listed in 'http.agent.name' property.
 Exception in thread main java.lang.**IllegalArgumentException: Fetcher:
 No agents listed in 'http.agent.name' property.
at org.apache.nutch.fetcher.**Fetcher.checkConfiguration(**
 Fetcher.java:1166)
at org.apache.nutch.fetcher.**Fetcher.fetch(Fetcher.java:**1068)
at org.apache.nutch.crawl.Crawl.**run(Crawl.java:135)
at org.apache.hadoop.util.**ToolRunner.run(ToolRunner.**java:65)
at org.apache.nutch.crawl.Crawl.**main(Crawl.java:54)


 I do not understand what happend here, maybe one of you can help me?




-- 
*Lewis*

Re: Error Network is unreachable in Nutch 1.3

2011-07-11 Thread lewis john mcgibbney

Hi,

Please see this new tutorial [1] for configuring Nutch 1.3. If you are
familiar/comnfortable working with Solr for improvements to indexing then
you will find it no problem.

If you require to stick with Lucene and the web application front end then
please stcik with Nutch 1.2 or before.

[1] http://wiki.apache.org/nutch/RunningNutchAndSolr



On Mon, Jul 11, 2011 at 3:02 PM, Yusniel Hidalgo Delgado
yhdelg...@uci.cuwrote:

 Hello.
 I'm trying to run nutch 1.3 in my LAN following the NutchTutorial from wiki
 page. When I try to run with this command line options: nutch crawl urls
 -dir crawl -depth 3 I get the following output:

 solrUrl is not set, indexing will be skipped...
 crawl started in: crawl
 rootUrlDir = urls
 threads = 10
 depth = 3
 solrUrl=null
 Injector: starting at 2011-07-11 09:35:37
 Injector: crawlDb: crawl/crawldb
 Injector: urlDir: urls
 Injector: Converting injected urls to crawl db entries.
 Injector: Merging injected urls into crawl db.
 Injector: finished at 2011-07-11 09:35:40, elapsed: 00:00:03
 Generator: starting at 2011-07-11 09:35:40
 Generator: Selecting best-scoring urls due for fetch.
 Generator: filtering: true
 Generator: normalizing: true
 Generator: jobtracker is 'local', generating exactly one partition.
 Generator: Partitioning selected urls for politeness.
 Generator: segment: crawl/segments/20110711093542
 Generator: finished at 2011-07-11 09:35:43, elapsed: 00:00:03
 Fetcher: starting at 2011-07-11 09:35:43
 Fetcher: segment: crawl/segments/20110711093542
 Fetcher: threads: 10
 QueueFeeder finished: total 2 records + hit by time limit :0
 fetching http://FIRST http://first/ SITE/
 fetching http://SECOND http://second/ SITE/
 -finishing thread FetcherThread, activeThreads=2
 -finishing thread FetcherThread, activeThreads=2
 -finishing thread FetcherThread, activeThreads=2
 -finishing thread FetcherThread, activeThreads=2
 -finishing thread FetcherThread, activeThreads=2
 -finishing thread FetcherThread, activeThreads=3
 -finishing thread FetcherThread, activeThreads=2
 -finishing thread FetcherThread, activeThreads=3
 fetch of http://FIRST http://first/ SITE/ failed with:
 java.net.ConnectException: Network is unreachable
 -finishing thread FetcherThread, activeThreads=1
 fetch of http://SECOND http://second/ SITE/ failed with:
 java.net.ConnectException: Network is unreachable
 -finishing thread FetcherThread, activeThreads=0
 -activeThreads=0, spinWaiting=0, fetchQueues.totalSize=0
 -activeThreads=0
 Fetcher: finished at 2011-07-11 09:35:45, elapsed: 00:00:02
 ParseSegment: starting at 2011-07-11 09:35:45
 ParseSegment: segment: crawl/segments/20110711093542
 ParseSegment: finished at 2011-07-11 09:35:47, elapsed: 00:00:01
 CrawlDb update: starting at 2011-07-11 09:35:47
 CrawlDb update: db: crawl/crawldb
 CrawlDb update: segments: [crawl/segments/**20110711093542]
 CrawlDb update: additions allowed: true
 CrawlDb update: URL normalizing: true
 CrawlDb update: URL filtering: true
 CrawlDb update: Merging segment data into db.
 CrawlDb update: finished at 2011-07-11 09:35:48, elapsed: 00:00:01
 Generator: starting at 2011-07-11 09:35:48
 Generator: Selecting best-scoring urls due for fetch.
 Generator: filtering: true
 Generator: normalizing: true
 Generator: jobtracker is 'local', generating exactly one partition.
 Generator: 0 records selected for fetching, exiting ...
 Stopping at depth=1 - no more URLs to fetch.
 LinkDb: starting at 2011-07-11 09:35:49
 LinkDb: linkdb: crawl/linkdb
 LinkDb: URL normalize: true
 LinkDb: URL filter: true
 LinkDb: adding segment: file:/home/yusniel/Programas/**
 nutch-1.3/runtime/local/bin/**crawl/segments/20110711093542
 LinkDb: finished at 2011-07-11 09:35:50, elapsed: 00:00:01
 crawl finished: crawl

 According to this output, the problem is related with the access to
 network, however, I can access to those web site using Firefox. I'm using
 Debian testing version.

 Greetings.




-- 
*Lewis*

Re: Nutch Novice help

2011-07-12 Thread lewis john mcgibbney

Hi Please see this tutorial [1] for up to date 1.3 tutorial on wiki.

Please try it out and take on Markus' points regarding Nutch trunk as the
problems you are experiencing are usual with Trunk as it stands.

[1] http://wiki.apache.org/nutch/RunningNutchAndSolr

On Mon, Jul 11, 2011 at 10:50 PM, Sethi, Parampreet 
parampreet.se...@teamaol.com wrote:

 Hi All,

 Sorry for such a naïve question,  I downloaded nutch 1.3 binary today and
 trying to set it up as mentioned in Tutorial at
 http://wiki.apache.org/nutch/NutchTutorial

 How ever I am not able to find crawl-urlfilter.txt inside conf directory.
 Is there any other place where I should look for this file?

 Thanks
 Param




-- 
*Lewis*

Re: developing nutch, either in eclipse or netbeans

2011-07-12 Thread lewis john mcgibbney

I must admit Markus that I agree with you that for making ad-hoc changes to
your configuration it is usually more time efficient to use a text editor.

Hi C.B.

Is there any reaon in particular you are interested in getting it up working
with an IDE? I had contemplated getting a revised tutorial up and running
for Eclipse in due course.

On Mon, Jul 11, 2011 at 11:15 PM, Markus Jelsma
markus.jel...@openindex.iowrote:

 Hi,

 I remember some mails on this matter on the list recently. Try the search
 or
 don't use an IDE. I never got it running but quickly gave up anyway and use
 a
 simple text editor instead.

 Cheers,

  Hello,
 
  Hopeless to get a working build environment in eclipse or netbeans. I
  have followed http://wiki.apache.org/nutch/RunNutchInEclipse1.0
 
  With NB there are maven related problems, and with eclipse it wont
  recognize the build/local structure in apache-nutch-1.3
 
  What is the easiest way to get going with eclipse or netbeans?
 
  Best Regards,
  -C.B.




-- 
*Lewis*

Re: Updating Tika in Nutch

2011-07-12 Thread lewis john mcgibbney

Hi Fernando,

One point for me to mention which I did not pick up from your post. Did you
rebuild your Nutch dist after making the changes to include your new parser?
I know that this is a pretty simple suggestion but hopefully it might be the
right one.

Also can you please provide more details of I get an error saying
C:/Program not found whenever I try to do anything... ? Were you ablke to
build your 1.3 dist? I understand that 1.2 is sufficient for your needs,
however it might be beneficial to root out why you cannot get 1.3 working
for future interests.

Thanks

On Tue, Jul 12, 2011 at 8:27 AM, Fernando Arreola jfarr...@gmail.comwrote:

 Hello,

 I have made some additions (a new parser) to the Apache Tika application
 and
 I am trying to see if I can run my new changes using the crawl mechanism in
 Nutch, but I am having some trouble updating Nutch with my modified Tika
 application.

 The Tika updates I made run fine if I run Tika as a standalone using either
 the command line or the Tika GUI.

 I am using Nutch 1.2, 1.3 seems to not be able to run for me (I get an
 error
 saying C:/Program not found whenever I try to do anything but 1.2 should be
 fine for what I am trying to do which is just to see the parse results from
 the new parser I added to Tika).

 I have replaced the tika-core.jar, tika-parsers.jar and tika-mimetypes.xml
 files with my versions of those files as described in the following link:
 http://issues.apache.org/jira/browse/NUTCH-766. I also updated the
 nutch-site.xml to enable the parse-tika plugin. I also updated the
 parse-plugins.xml file with the following (afm files are what I am trying
 to
 parse):

mimeType name=application/x-font-afm
plugin id=parse-tika /
/mimeType

 I am crawling a personal site in which I have links to .afm files. If I
 crawl before making any updates to Nutch, it fetches the files fine. After
 making the updates detailed above, I get the following error: fetch of
 http://scf.usc.edu/~jfarreol/woor2___.AFM failed with:
 java.lang.NoClassDefFoundError: org/apache/james/mime4j/MimeException.

 Not really sure, what the issue is but my guess is that I have not updated
 all the necessary files. Any help would be greatly appreciated.

 Thank you,
 Fernando Arreola




-- 
*Lewis*

Re: Nutch Gotchas as of release 1.3

2011-07-12 Thread lewis john mcgibbney

Hi

I have duly updated both the Nutch Gotchas [1] and the tutorial [2] to
incorporate these gotchas which have been highlighted. Thanks for pointing
these out.

[1] http://wiki.apache.org/nutch/NutchGotchas
[2] http://wiki.apache.org/nutch/RunningNutchAndSolr

On Tue, Jul 12, 2011 at 12:03 AM, Jerry E. Craig, Jr. 
jcr...@inforeverse.com wrote:

 Just from a total noob standpoint (just installed my first LAMP box over
 the last month) realizing that I needed to look in the Runtime folder when I
 downloaded the tar.gz file was a HUGE step.

 Then we all run the Crawl at least to make sure things work.  The main
 tutorial was missing the [-solr] part of the crawl command line to get that
 to index.  It wasn't after someone helped me here and pointed me to the
 actual documents that I found it.

 Those were the 2 big things for me as a total noob, otherwise I'm really
 happy to have at least that part working.  Now, my stupid CentOS install
 only has libxml2 2.6.15 and I need 2.6.17 for php and I'm a few revisions
 off on libcurl also.  I have NO idea how to go back and fix that.  Not sure
 if I should just try to upgrade to php53 and hope for the best or what.
  But, that's more of a solr / php question than a Nutch question I think.


 -Original Message-
 From: Markus Jelsma [mailto:markus.jel...@openindex.io]
 Sent: Monday, July 11, 2011 3:19 PM
 To: user@nutch.apache.org
 Cc: lewis john mcgibbney
 Subject: Re: Nutch Gotchas as of release 1.3

 Well, now i'm thinking of it: yes.

 - there were three (incl. myself) people mentioning the problem described
 in NUTCH-1016;
 - a few users don't seem to catch the part of the tutorial telling them to
 add their robot to the config
 - missing crawl-urlfilter
 - mails about missing solrUrl

 I think quite a few users still rely on the crawl command instead of
 running a script.

  Hello list,
 
  Do we have any suggestions we wish to discuss regarding the above?
 
  thanks




-- 
*Lewis*

Re: nutch crashes for unknown reason

2011-07-12 Thread lewis john mcgibbney

Fro mn the looks of it you need to parse all segments before indexing
attempting to index them.

As Markus has pointed out, the specific segment hasn't been parsed. Try
parsing as per the following link

http://wiki.apache.org/nutch/bin/nutch_parse

On Tue, Jul 12, 2011 at 1:50 PM, Paul van Hoven 
paul.van.ho...@googlemail.com wrote:

 Okay, and what does that mean? How can I repair the error?

 2011/7/12 Markus Jelsma markus.jel...@openindex.io:
  I don't see this segment 20110712114256 being parsed.
 
  On Tuesday 12 July 2011 13:38:35 Paul van Hoven wrote:
  I'm not if I did understand you correct. Here is the complete output
  of my crawl:
 
 
  tom:bin toom$ ./nutch crawl /Users/toom/Downloads/nutch-1.3/crawled
  -dir /Users/toom/Downloads/nutch-1.3/sites -depth 3 -topN 50
  solrUrl is not set, indexing will be skipped...
  crawl started in: /Users/toom/Downloads/nutch-1.3/sites
  rootUrlDir = /Users/toom/Downloads/nutch-1.3/crawled
  threads = 10
  depth = 3
  solrUrl=null
  topN = 50
  Injector: starting at 2011-07-12 12:28:49
  Injector: crawlDb: /Users/toom/Downloads/nutch-1.3/sites/crawldb
  Injector: urlDir: /Users/toom/Downloads/nutch-1.3/crawled
  Injector: Converting injected urls to crawl db entries.
  Injector: Merging injected urls into crawl db.
  Injector: finished at 2011-07-12 12:28:53, elapsed: 00:00:04
  Generator: starting at 2011-07-12 12:28:53
  Generator: Selecting best-scoring urls due for fetch.
  Generator: filtering: true
  Generator: normalizing: true
  Generator: topN: 50
  Generator: jobtracker is 'local', generating exactly one partition.
  Generator: Partitioning selected urls for politeness.
  Generator: segment:
  /Users/toom/Downloads/nutch-1.3/sites/segments/20110712122856
  Generator: finished at 2011-07-12 12:28:57, elapsed: 00:00:04
  Fetcher: Your 'http.agent.name' value should be listed first in
  'http.robots.agents' property.
  Fetcher: starting at 2011-07-12 12:28:57
  Fetcher: segment:
  /Users/toom/Downloads/nutch-1.3/sites/segments/20110712122856 Fetcher:
  threads: 10
  QueueFeeder finished: total 1 records + hit by time limit :0
  fetching http://nutch.apache.org/
  -finishing thread FetcherThread, activeThreads=1
  -finishing thread FetcherThread, activeThreads=1
  -finishing thread FetcherThread, activeThreads=1
  -finishing thread FetcherThread, activeThreads=1
  -finishing thread FetcherThread, activeThreads=1
  -finishing thread FetcherThread, activeThreads=1
  -finishing thread FetcherThread, activeThreads=1
  -finishing thread FetcherThread, activeThreads=1
  -finishing thread FetcherThread, activeThreads=1
  -activeThreads=1, spinWaiting=0, fetchQueues.totalSize=0
  -finishing thread FetcherThread, activeThreads=0
  -activeThreads=0, spinWaiting=0, fetchQueues.totalSize=0
  -activeThreads=0
  Fetcher: finished at 2011-07-12 12:29:01, elapsed: 00:00:03
  ParseSegment: starting at 2011-07-12 12:29:01
  ParseSegment: segment:
  /Users/toom/Downloads/nutch-1.3/sites/segments/20110712122856
  ParseSegment: finished at 2011-07-12 12:29:03, elapsed: 00:00:02
  CrawlDb update: starting at 2011-07-12 12:29:03
  CrawlDb update: db: /Users/toom/Downloads/nutch-1.3/sites/crawldb
  CrawlDb update: segments:
  [/Users/toom/Downloads/nutch-1.3/sites/segments/20110712122856]
  CrawlDb update: additions allowed: true
  CrawlDb update: URL normalizing: true
  CrawlDb update: URL filtering: true
  CrawlDb update: Merging segment data into db.
  CrawlDb update: finished at 2011-07-12 12:29:06, elapsed: 00:00:02
  Generator: starting at 2011-07-12 12:29:06
  Generator: Selecting best-scoring urls due for fetch.
  Generator: filtering: true
  Generator: normalizing: true
  Generator: topN: 50
  Generator: jobtracker is 'local', generating exactly one partition.
  Generator: Partitioning selected urls for politeness.
  Generator: segment:
  /Users/toom/Downloads/nutch-1.3/sites/segments/20110712122908
  Generator: finished at 2011-07-12 12:29:10, elapsed: 00:00:03
  Fetcher: Your 'http.agent.name' value should be listed first in
  'http.robots.agents' property.
  Fetcher: starting at 2011-07-12 12:29:10
  Fetcher: segment:
  /Users/toom/Downloads/nutch-1.3/sites/segments/20110712122908 Fetcher:
  threads: 10
  QueueFeeder finished: total 50 records + hit by time limit :0
  fetching http://www.cafepress.com/nutch/
  fetching http://creativecommons.org/press-releases/entry/5064
  fetching http://blog.foofactory.fi/2007/03/twice-speed-half-size.html
  fetching http://www.apache.org/dist/nutch/CHANGES-1.0.txt
  fetching http://eu.apachecon.com/c/aceu2009/sessions/138
  fetching http://www.us.apachecon.com/c/acus2009/
  fetching http://issues.apache.org/jira/browse/NUTCH
  fetching http://forrest.apache.org/
  fetching http://hadoop.apache.org/
  fetching http://wiki.apache.org/nutch/
  fetching http://nutch.apache.org/credits.html
  fetching http://tika.apache.org/
  fetching http://lucene.apache.org/solr/
  fetching http://osuosl.org/news_folder/nutch
  fetching

Re: A possible solution to my URL redirection and zero scores problem

2011-07-12 Thread lewis john mcgibbney

Well I think in order to address the problem directly it would be better to
focus on getting something working with a distribution of Nutch you are most
comfortable working with. For the time being I would avoid working with
trunk 2.0 unless you can justify otherwise. I would also either make a
decision between Nutch 1.2 and the current 1.3 release rather than focussing
on previous branches, which may or may not be stable depending on when you
last svn updated.

If you can try working with a fresh 1.2 or 1.3 (preferrably 1.3) then we
could maybe get to the bottom of this one as it would be great to find
whether there is scope to file a JIRA with this.

Thank you

On Tue, Jul 12, 2011 at 2:02 PM, Nutch User - 1 nutch.use...@gmail.comwrote:

 On 07/12/2011 03:42 PM, lewis john mcgibbney wrote:
  Hi,
 
  An observation is that you are using the 1.3 branch, which will now
 contain
  some older code. For example the fetcher class has now been upgraded to
 deal
  with Nutch-962, which is mentioned at the top of the class as per your
 URL
  example.
 
  Can anyone explain what the existing metadata being transferred is as per
  below if it does not include the score as you state?
 
  } else {
CrawlDatum newDatum = new CrawlDatum(CrawlDatum.STATUS_LINKED,
datum.getFetchInterval());
// transfer existing metadata
newDatum.getMetaData().putAll(datum.getMetaData());
try {
  scfilters.initialScore(url, newDatum);
 
  I would have imagined that the metadata would have included the relative
  initial score we are discussing if it were to be of use in attributing an
  initial URLs metadata to a redirect?
  Apart from this, with the addition of your datum.getScore(), do the new
  scores attributed to the URL redirects  reflect accurately you're general
  understanding of the web graph?

 I have only been dealing with Nutch 1.2 and 1.3. I tried to setup 2.0
 with Eclipse but failed as described here
 (http://lucene.472066.n3.nabble.com/TestFetcher-hangs-td3091057.html).
 The new scores were as they should have been in my opinion. (Even though
 I would state that Nutch's implementation of OPIC isn't exactly what the
 publication says.) I don't know what information is passed in metadata.




-- 
*Lewis*

Re: running tests from the command line

2011-07-12 Thread lewis john mcgibbney

What plugin are you hacking away on? You're own custom one or one already
shipped with Nutch? Just so we are reading from the same page.

This, along with some further documentation for running various classes from
the command line is definately worth inclusion in the CommandLineOptions
page of the wiki.

On Tue, Jul 12, 2011 at 6:00 PM, Tim Pease tim.pe...@gmail.com wrote:

 At the root of the Nutch 1.3 project, what is the magic ant incantation to
 run only the tests for the plugin I'm currently hacking away on? I'm looking
 for the command line syntax.

 Blessings,
 TwP




-- 
*Lewis*

Re: Nutch Novice help

2011-07-12 Thread lewis john mcgibbney

Have a good look under your hadoop.log which should be created when you
initiate a crawl with Nutch, this will be extremely valuable. In addition
there are various properties in nutch-site.xml which can be set to make
logging more verbose at various levels e.g. fetching

In order to root out various errors you will need to get used to looking
through yours logs. It is also advised to try and include as much log data
as possible when posting queries on the user list. You can find more
information about this here as it will greatly help you get accurate and
detailed help from the list in the future. Please have a look here [1].

I would advise you to delete all crawled data and begin a fresh crawl, this
way you can try the above, looking at your logs, before we try to root out
where exactly the errors are stemming from.

HTH

[1]
http://wiki.apache.org/nutch/Becoming_A_Nutch_Developer#Becoming_a_Nutch_Developer



On Tue, Jul 12, 2011 at 7:31 PM, Sethi, Parampreet 
parampreet.se...@teamaol.com wrote:

 Hey Lewis, Thanks for the quick reply. Looks like I am tangled now =)

 I tried the tutorial mentioned at
 http://wiki.apache.org/nutch/RunningNutchAndSolr

 For me step 3 is not working. Two of the directories are not created (which
 should be there after step 3 is complete.)

 crawl/crawldb - Created
 crawl/linkdb - not created
 crawl/segments - not created

 Also, I changed the url to http://nutch.apache.org, but still same log
 message Generator: 0 records selected for fetching, exiting ...

 Looks like I am missing some key step =(.

 -param

 On 7/12/11 1:37 PM, lewis john mcgibbney lewis.mcgibb...@gmail.com
 wrote:

  Hi,
 
  I think you are maybe getting tangled here. Please see the following
  tutorial for Nutch 1.3 [1]
 
  Please also note that the URL you provided is the old Nutch site and now
  redirects to http://nutch.apache.org
 
  [1] http://wiki.apache.org/nutch/RunningNutchAndSolr
 
  On Tue, Jul 12, 2011 at 5:23 PM, Sethi, Parampreet 
  parampreet.se...@teamaol.com wrote:
 
  Thanks for updating the tutorial. I tried my setup, the crawl command is
  running. But none of the pages are being crawled.
  I created urls directory inside local folder and added new file nutch
 with
  url in the same as mentioned in tutorial.
 
  (I also tried file named urls inside nutch/runtime/local diretcory. The
  contents of urls file is http://lucene.apache.org/nutch/ )
 
  Here's the log:
 
  us137390:local parampreetsethi$  bin/nutch crawl urls -dir crawl -depth
 3
  -topN 50
  solrUrl is not set, indexing will be skipped...
  crawl started in: crawl
  rootUrlDir = urls
  threads = 10
  depth = 3
  solrUrl=null
  topN = 50
  Injector: starting at 2011-07-12 12:22:12
  Injector: crawlDb: crawl/crawldb
  Injector: urlDir: urls
  Injector: Converting injected urls to crawl db entries.
  Injector: Merging injected urls into crawl db.
  Injector: finished at 2011-07-12 12:22:15, elapsed: 00:00:03
  Generator: starting at 2011-07-12 12:22:15
  Generator: Selecting best-scoring urls due for fetch.
  Generator: filtering: true
  Generator: normalizing: true
  Generator: topN: 50
  Generator: jobtracker is 'local', generating exactly one partition.
  Generator: 0 records selected for fetching, exiting ...
  Stopping at depth=0 - no more URLs to fetch.
  No URLs to fetch - check your seed list and URL filters.
  crawl finished: crawl
 
 
  Please help.
 
  Thanks
  Param
 
  On 7/12/11 5:52 AM, Julien Nioche lists.digitalpeb...@gmail.com
 wrote:
 
  On 12 July 2011 10:30, Julien Nioche lists.digitalpeb...@gmail.com
  wrote:
 
 
 
  There seems to be no crawl-urlfilter file indeed. Don't know why
 it's
  gone since
  the crawl command is still there. You can find the file in the 1.2
  release:
  http://svn.apache.org/viewvc/nutch/branches/branch-1.2/conf/
 
  Crawl-urlfilter has been removed  purposefully as it did not add
  anything
  to the other url filters (automaton | regex) in terms of
  functionality.
  By
  default the urlfilters contain (+.) which IIRC was what the
  Crawl-urlfilter used to do.
 
 
  That's reasonable. But now news users are unaware and don't know what
  to
  do
  with this error message.
 
 
  Yep, the tutorial needs updating indeed
 
 
  done
 
 
 
 
 
 
  Thanks for a quick reply.
 
  I searched in the nutch directory but still do not see that file
 :(.
 
  Here's
 
  complete file list inside runtime/local/conf directory.
 
  us137390:conf parampreetsethi$ pwd
 
 /Users/parampreetsethi/Documents/workspace/nutch/runtime/local/conf
  us137390:conf parampreetsethi$ ls -t
  automaton-urlfilter.txtdomain-urlfilter.txt
  nutch-default.xml
  prefix-urlfilter.txtsolrindex-mapping.xml
  configuration.xslhttpclient-auth.xmlnutch-site.xml
  regex-normalize.xmlsubcollections.xml
  domain-suffixes.xmllog4j.propertiesparse-plugins.dtd
  regex-urlfilter.txtsuffix-urlfilter.txt
  domain-suffixes.xsdnutch-conf.xslparse-plugins.xml
  schema.xml tika

Re: Need help: Can't find bundle for base name org.nutch.jsp.search, locale en_US

2011-07-14 Thread lewis john mcgibbney

Assuming your using Nutch 1.2, the web application you point to needs to be
the exact name of the WAR file.

In my case it was therefore always

http://localhost:8080/nutch-1.2 http://localhost:8080/nutch/

Also I do not understand written spanish (i think this is) so I can help you
out with the other stuff sorry.

On Wed, Jul 13, 2011 at 3:55 PM, Marlen zmach...@facinf.uho.edu.cu wrote:

 On 13/07/2011 10:30, Marlen wrote:

 I have been subscribed to the lucene list help,, and it was great.. I hope
 it be great too...
 There is a problem for me.. I don't speak quit well English..
 So the important thing.. I had a problem with the installation when I
 tip this: http://localhost:8080/nutch/; on my browser this come out:
 Estado HTTP 500 -

 type Informe de Excepción

 mensaje

 descripción El servidor encontró un error interno () que hizo que no
 pudiera rellenar este requerimiento.

 excepción

 org.apache.jasper.**JasperException: java.util.**
 MissingResourceException:
 Can't find bundle for base name org.nutch.jsp.search, locale en_US
org.apache.jasper.servlet.**JspServletWrapper.**handleJspException(**
 JspServletWrapper.java:531)
org.apache.jasper.servlet.**JspServletWrapper.service(**
 JspServletWrapper.java:454)
org.apache.jasper.servlet.**JspServlet.serviceJspFile(**
 JspServlet.java:389)
org.apache.jasper.servlet.**JspServlet.service(JspServlet.**java:332)
javax.servlet.http.**HttpServlet.service(**HttpServlet.java:722)

 causa raíz

 java.util.**MissingResourceException: Can't find bundle for base name
 org.nutch.jsp.search, locale en_US
java.util.ResourceBundle.**throwMissingResourceException(**
 ResourceBundle.java:1539)
java.util.ResourceBundle.**getBundleImpl(ResourceBundle.**java:1278)
java.util.ResourceBundle.**getBundle(ResourceBundle.java:**805)
org.apache.jsp.index_jsp._**jspService(index_jsp.java:56)
org.apache.jasper.runtime.**HttpJspBase.service(**HttpJspBase.java:68)
javax.servlet.http.**HttpServlet.service(**HttpServlet.java:722)
org.apache.jasper.servlet.**JspServletWrapper.service(**
 JspServletWrapper.java:416)
org.apache.jasper.servlet.**JspServlet.serviceJspFile(**
 JspServlet.java:389)
org.apache.jasper.servlet.**JspServlet.service(JspServlet.**java:332)
javax.servlet.http.**HttpServlet.service(**HttpServlet.java:722)

 nota La traza completa de la causa de este error se encuentra en los
 archivos de diario de Apache Tomcat/7.0.5.
 Apache Tomcat/7.0.5






-- 
*Lewis*

Re: Can we use crawled data by Nutch 0.9 in other versions of Nutch

2011-07-14 Thread lewis john mcgibbney

I think you question should be more along the lines of, is it possible to
use data stored within a Lucene index in a Solr core for search?

Unfortunately I am unable to answer this question, my suggestion would be to
ask on solr-user@

Another option which you may wish to consider is using the convdb command
line option to upgrade your 0.9 crawldb to a crawldb compatible with Nutch
1.2 and subsequently 1.3. You can then undertake crawls with Nutch 1.3 and
index directly to Solr. Please someone correct me here if I am wrong.

On Wed, Jul 13, 2011 at 3:50 PM, serenity serenitykenings...@gmail.comwrote:

 Hello,

 I have a question and I apologize if it sounds stupid. I just want to know,
 if we can use the crawled data by Nutch 0.9 in Nutch 1.3 because search has
 been delegated to Solr in Nutch 1.3 and I want to get the search results
 from the crawled data by Nutch 0.9 in Nutch 1.3.


 Serenity

 --
 View this message in context:
 http://lucene.472066.n3.nabble.com/Can-we-use-crawled-data-by-Nutch-0-9-in-other-versions-of-Nutch-tp3166259p3166259.html
 Sent from the Nutch - User mailing list archive at Nabble.com.




-- 
*Lewis*

Re: Recrawling with Solr backend

2011-07-14 Thread lewis john mcgibbney

Pleas seem comments below

On Thu, Jul 14, 2011 at 12:52 PM, Chris Alexander 
chris.alexan...@kusiri.com wrote:

 Hi Lewis,

 First of all, thanks for the fantastic reply, most useful. I am working on
 testing out the functions you mention, of which I was not previously aware.


Yes there has been a lot of action recently even between 1.3 release and dev
1.4.


 There are a few offshoot questions from this that the answers to which
 aren't immediately apparent.

 When a solrindex is run doing an update of a previous index, is it the case
 that all of the content is copied into the solr index again (overwriting
 unchanged files, for example) or is only the changed data modified in the
 index? We came to this question because we are thinking of running a
 rolling
 crawl (i.e. restart a new crawl when the previous one has terminated) and
 clearly if it re-adds already existing and unchanged data on each loop
 round
 then this would negatively impact performance and would increase the amount
 of compacting required in Solr. From my simple testing it looks like the
 date is not updated in the Solr index, implying that it is not modified?
 But
 I could use confirmation of this as it's a fairly important issue.


If we take solrindex and solrdedup and omit solrclean for the time being as
this is a different matter and deals with removing a certain type of
'broken' document rather than comparing docs in our Solr index in acting
accordingly.

Solrindex - No data is technically copied, instead it is indexed from the
crawldb based upon whatever type of content and metadata we wished to
extract from it with our parsers (check out the plugins) along with URLlinks
present in linkdb. When fetching is undertaken each URL is given a unique
fetch time in milliseconds, this way we can disambiguate between several
pages which may be present in the solrindex and run the deduplication
command accordingly. At the moment, committs for all reducers to the solr
instance are handled in one go and yes you are correct this has been
identified as fairly expensive as resources for crawls and subsequently Solr
communication jobs increase proportionately. To prevent Nutch sending
'already existing and unchanged data', every page is given a metatag
relating to a lastModified value. This means that any page which has not
been modfied since the last crawl will be skipped during fetching. Does this
clear any of this up for you?


 The second point is relating to removing documents from the index. In the
 scenario we are working on, a list of primary URLs is used to direct the
 start of the crawl. When a new site is to be crawled, its homepage URL is
 added to the seed urls file for the next crawl (it may also have a filter
 added to the filtering file to restrict the crawling spread). When a site
 is
 no longer desired in the index, its URL is removed from the seed urls file.
 When the next index is run, does this mean that the pages crawled under the
 previous URL will be removed from the solr index because they were not
 crawled on that occasion, or will they have to be removed manually by some
 other mechanism? From my simple testing it looks like they are not removed
 automatically.


You are correct here, they most certainyl are not removed automatically. I
commented on a similar post a while ago. What happens if you were to remove
an URL from the seed list, recrawl (and automatically remove the pages from
you're index), then find out you are perhaps required to re-add that URL to
your seed list tomorrow or in the near future. This would not be a
sustainable way to maintain an index.


 I just found the db.ignore.external.links configuration value - which will
 solve a lot of the issues previously mentioned in passing regarding
 filtering the URLs to crawl.


Yes, I would say that experience using properties in nutch-site and your
various URLFIlters in a well tuned fashion should yield better results over
time.


 Thanks again for the help (and apologies for the huge e-mail)

 Chris

 On 14 July 2011 10:59, lewis john mcgibbney lewis.mcgibb...@gmail.com
 wrote:

  Hi Chris,
 
  Yes a Nutch 1.3 crawl and Solr index bash script is something that has
 not
  been added to the wiki yet. I think this is partly because there are very
  few adjustments to be made to the comprehensive Nutch 1.2 scripts
 currently
  available on the Nutch wiki. This would however be a great addition if we
  could get the time to post one. The point of focus I pick up from your
  thread is that you require a script for a way of re-crawling previously
  crawled pages only a certain amount of time after they were last crawled
  etc. Generally speaking (at this stage anyway), I'll assume that etc
  just
  means various other property changes within nutch-site.xml.
 
  My recommended steps would be something like
 
  inject
  generate
  fetch
  parse
  updatedb
  invertlinks
  solrindex
  solrdedup
  solrclean
 
  We can obviously schedule Nutch to crawl regularly

Re: The correct tutorial on the home page?

2011-07-14 Thread lewis john mcgibbney

Hi Eric

Please add any comments you wish to the new tutorial that Markus mentioned
on the Wiki. I am in the process of rebuilding the Nutch site and this will
be included tomorrow e.g from now on the default tutorial people are
directed to from the wiki will be the RunningNutchAndSolr tutorial...

The RunningNutchAndSolr tutorial was created as a bridge to running nutch in
deploy mode tutorial which I am working towards and would like to see
constructed in the near future. We can harness a huge amount of power using
Nutch in tadem with Hadoop therefore this is the next step.

As I said, please suggest anything which would make phasing into Nutch 1.3 a
less laborious task.

thanks

On Thu, Jul 14, 2011 at 10:31 PM, Markus Jelsma
markus.jel...@openindex.iowrote:

 Thanks. And check out open issues if possible.
 cheers


  I agree with updating NutchTutorial to be  1.3.   Folks coming to Nutch
  and following a tutorial will almost certainly be wanting to know about
  the latest and greatest released code!
 
  I subscribed to the dev@ list and will keep an eye on the updates and
  provide any feedback I can!
 
  Eric
 
  On Jul 14, 2011, at 4:41 PM, Markus Jelsma wrote:
   Hi Erik,
  
   Lewis already moved a lot of  1.3 stuff to a legacy area on the wiki.
   The tutorial on the pointed to from the homepage is indeed old but also
   contains recent additions. Perhaps we should merge those two tutorials
   and get rid of RunningNutchAndSolr. The NutchTutorial seems more
   appropriate in = 1.3.
  
   Cheers
  
   Hi all, I am getting back up to speed on Nutch after being away for a
   couple versions!  I noticed the tutorial linked from the homepage is
 to
   this one: http://wiki.apache.org/nutch/NutchTutorial
  
   However, it seems like with Nutch moving to using Solr, that the
   tutorial that should be linked to is
  
 http://wiki.apache.org/nutch/RunningNutchAndSolr#A3._Crawl_your_first_we
   bs ite
  
   Alternatively, the content for all the pre 1.3 Nutch should be moved
 to
   a different NutchTutorial wiki page, and the NutchTutorial updated
 with
   the Solr content?
  
  
  
   Eric
  
   -
   Eric Pugh | Principal | OpenSource Connections, LLC | 434.466.1467 |
   http://www.opensourceconnections.com Co-Author: Solr 1.4 Enterprise
   Search Server available from
   http://www.packtpub.com/solr-1-4-enterprise-search-server This e-mail
   and all contents, including attachments, is considered to be Company
   Confidential unless explicitly stated otherwise, regardless of whether
   attachments are marked as such.
 
  -
  Eric Pugh | Principal | OpenSource Connections, LLC | 434.466.1467 |
  http://www.opensourceconnections.com Co-Author: Solr 1.4 Enterprise
 Search
  Server available from
  http://www.packtpub.com/solr-1-4-enterprise-search-server This e-mail
 and
  all contents, including attachments, is considered to be Company
  Confidential unless explicitly stated otherwise, regardless of whether
  attachments are marked as such.




-- 
*Lewis*

Re: what does the parse command does

2011-07-15 Thread lewis john mcgibbney

Hi C.B.,

Quite a few things here

On Fri, Jul 15, 2011 at 5:19 PM, Cam Bazz camb...@gmail.com wrote:

 Hello,

 Finally I got a working build environment, and I am doing some
 modifications and playing around.


Good to hear, although it is off topic can you share any hurdles you
overcame with us please. It would be good to hear how you solved you
configuration problems.


 I also got my first plugin to build, and almost done with my custom parser.


Excellent, I will proceed with adding your comment to a page in plugin
central on the wiki, in the meantime it would be good to hear more about
your plugin and what functionality it encapsulates! Would it be possible to
get a wiki entry? We are a bit short for Nutch 1.3 custom plugin tutorials.


 I have my custom plugin and the method

 public ParseResult filter(Content content, ParseResult parseResult,
 HTMLMetaTags metaTags, DocumentFragment doc) { ...

 does indeed have all the information that I need to do my custom parsing.

 Now this is what I dont understand: there is a content field in solr.
 I have read the solrindexer code, and figured out that pretty much any
 field in the doc is indexed to solr.


If you have a look at boht your schema and solr-mapping documents you will
see how fields are generated and passed to Solr for indexing.


 What must I do, so I can open another content like field such as
 Content2 and put my custom extracted data, so solr indexes it? I
 think this does not have to do with solr, but the fields in the
 document.

My suggestion would be to specify extraction of the field within the plugin
code then add the various configuration parameters to both of the
aforementioned config documents.



 In the recommended example, the found result is only added to
 contentMeta - and this one is not indexed by solr.


What recommended example? I am not following you here.


 Best Regards,
 -C.B.




-- 
*Lewis*

Re: Deploying the web application in Nutch 1.2

2011-07-15 Thread lewis john mcgibbney

Are you adding this to nutch-site within your webapp or just in your root
Nutch installation. This needs to be included in your webapp version of
nutch-site.xml. In my experience this was a small case of confusion at
first.

On Fri, Jul 15, 2011 at 7:03 PM, Chip Calhoun ccalh...@aip.org wrote:

You've gotten me very close to a breakthrough. I've started over, and I've
found that If I don't make any edits to nutch-site.xml, I get a working
Nutch web app; I have no index and all of my searches fail, but I have
Nutch. When I add my crawl location to nutch-site.xml and restart Tomcat,
that's when I start getting the 404 with the The requested resource () is
not available message.
Clearly I'm doing something wrong when I edit nutch-site.xml. I'm going to
paste the entire contents of my nutch-site.xml. Where am I screwing this
up?

Thanks for your help on this.

?xml version=1.0?
configuration
property
namehttp.agent.name/name
valuenutch-solr-integration/value
/property
property
namegenerate.max.per.host/name
value100/value
/property
property
nameplugin.includes/name

-Original Message-
From: lewis john mcgibbney [mailto:lewis.mcgibb...@gmail.com]
Sent: Thursday, July 14, 2011 5:38 PM
To: user@nutch.apache.org
Subject: Re: Deploying the web application in Nutch 1.2

On Thu, Jul 14, 2011 at 8:01 PM, Chip Calhoun ccalh...@aip.org wrote:

Thanks Lewis.

I'm still having trouble. I've moved the war file to
$CATALINA_HOME/webapps/nutch/ and unpacked it. I don't' seem to have
a catalina.sh file, so I've skipped that step.

From memory the catalina.sh file is used to start you Tomcat server
instance... this has nothing to do with Nutch. Regardless of what lind of
WAR files you have in your Tomcat webapps directory, starting your tomat
server from the command line sould be the same...

And I've added the following to
C:\Apache\Tomcat-5.5\webapps\nutch\WEB-INF\classes\nutch-site.xml :

As far as a I can remember nutch-site.xml is already there, however you
need to specify various property values after this has been uploaded the
first time. After rebooting Tomcat all of your property setting will be
running.

property
namesearcher.dir/name
valueC:\Apache\apache-nutch-1.2\crawlvalue !-- There must be a
crawl/index directory to run off !-- /property

Looks fine, however please remove the !... as this is not required.

However, when I go to http://localhost:8080/nutch/ I always get a 404
with
the message, The requested resource () is not available. What am I
missing?

As I said the name of the WAR file needs to be identical to the webapp you
specify in the tomcat URL... can you confirm this. There should really be
no
problem starting up the Nutch web app if you follow the tutorial carfeully.

Thanks,
Chip

-Original Message-
From: lewis john mcgibbney [mailto:lewis.mcgibb...@gmail.com]
Sent: Thursday, July 14, 2011 5:40 AM
To: user@nutch.apache.org
Subject: Re: Deploying the web application in Nutch 1.2

Hi Chip,

Please see this tutorial for 1.2 administration [1], many people have
been
using it recently and as far as I'm aware it is working perfectly.

Please post back if you have any troubles

[1] http://wiki.apache.org/nutch/NutchTutorial

On Wed, Jul 13, 2011 at 5:50 PM, Chip Calhoun ccalh...@aip.org wrote:

I'm a newbie trying to set up a Nutch 1.2 web app, because it seems a
bit better suited to my smallish site than the Nutch 1.3 / Solr
connection. I'm going through the tutorial at
http://wiki.apache.org/nutch/Nutch_-_The_Java_Search_Engine , and I've
hit the following instruction:

Deploy the Nutch web application as the ROOT context

I'm not sure what I'm meant to do here. I get the idea that I'm
supposed to replace the current contents of
$CATALINA_HOME/webapps/ROOT/ with something from my Nutch directory,
but
I don't know what from my Nutch
directory I'm supposed to move. Can someone please explain what I
need
to
move?

Thanks,
Chip

--
*Lewis*

Re: Deploying the web application in Nutch 1.2

2011-07-15 Thread lewis john mcgibbney

As a resource it would be wise to have a look at the list archives for an
exact answer to this. Take a look at your catalina.out logs for more verbose
info on where the error is.

It has been a while since I have configured this now, sorry I can't be of
more help in giving a definite answer.

On Fri, Jul 15, 2011 at 8:27 PM, Chip Calhoun ccalh...@aip.org wrote:

 I'm definitely changing the file in my webapp.  I can tell I'm doing that
 much right because it makes a noticeable change to the function of my web
 app; unfortunately, the change is that it seems to break everything.

 I've tried playing with the actual value for this, but with no success.  In
 the tutorial's example, value/somewhere/crawlvalue, what is that
 relative to?  Where would that hypothetical /somewhere/ directory be,
 relative to $CATALINA_HOME/webapps/?  It feels like this is my problem,
 because I can't think of anything else it could be.

 -Original Message-
 From: lewis john mcgibbney [mailto:lewis.mcgibb...@gmail.com]
 Sent: Friday, July 15, 2011 3:19 PM
 To: user@nutch.apache.org
 Subject: Re: Deploying the web application in Nutch 1.2

 Are you adding this to nutch-site within your webapp or just in your root
 Nutch installation. This needs to be included in your webapp version of
 nutch-site.xml. In my experience this was a small case of confusion at
 first.

 On Fri, Jul 15, 2011 at 7:03 PM, Chip Calhoun ccalh...@aip.org wrote:

  You've gotten me very close to a breakthrough.  I've started over, and
  I've found that If I don't make any edits to nutch-site.xml, I get a
  working Nutch web app; I have no index and all of my searches fail,
  but I have Nutch.  When I add my crawl location to nutch-site.xml and
  restart Tomcat, that's when I start getting the 404 with the The
  requested resource () is not available message.
  Clearly I'm doing something wrong when I edit nutch-site.xml.  I'm
  going to paste the entire contents of my nutch-site.xml.  Where am I
  screwing this up?
 
  Thanks for your help on this.
 
  ?xml version=1.0?
  configuration
  property
  namehttp.agent.name/name
  valuenutch-solr-integration/value
  /property
  property
  namegenerate.max.per.host/name
  value100/value
  /property
  property
  nameplugin.includes/name
 
  valueprotocol-http|urlfilter-regex|parse-html|index-(basic|anchor)|q
  uery-(basic|site|url)|response-(json|xml)|summary-basic|scoring-opic|u
  rlnormalizer-(pass|regex|basic)/value
  /property
  property
  namesearcher.dir/name
  valueC:/Apache/apache-nutch-1.2/crawlvalue
  /property
  /configuration
 
 
  -Original Message-
  From: lewis john mcgibbney [mailto:lewis.mcgibb...@gmail.com]
  Sent: Thursday, July 14, 2011 5:38 PM
  To: user@nutch.apache.org
  Subject: Re: Deploying the web application in Nutch 1.2
 
  On Thu, Jul 14, 2011 at 8:01 PM, Chip Calhoun ccalh...@aip.org wrote:
 
   Thanks Lewis.
  
   I'm still having trouble.  I've moved the war file to
   $CATALINA_HOME/webapps/nutch/ and unpacked it.  I don't' seem to
   have a catalina.sh file, so I've skipped that step.
 
 
  From memory the catalina.sh file is used to start you Tomcat server
  instance... this has nothing to do with Nutch. Regardless of what lind
  of WAR files you have in your Tomcat webapps directory, starting your
  tomat server from the command line sould be the same...
 
   And I've added the following to
   C:\Apache\Tomcat-5.5\webapps\nutch\WEB-INF\classes\nutch-site.xml :
  
 
  As far as a I can remember nutch-site.xml is already there, however
  you need to specify various property values after this has been
  uploaded the first time. After rebooting Tomcat all of your property
  setting will be running.
 
 
  
   property
   namesearcher.dir/name
   valueC:\Apache\apache-nutch-1.2\crawlvalue !-- There must be a
   crawl/index directory to run off !-- /property
  
 
  Looks fine, however please remove the !... as this is not required.
 
  
   However, when I go to http://localhost:8080/nutch/ I always get a
   404
  with
   the message, The requested resource () is not available.  What am
   I missing?
  
 
  As I said the name of the WAR file needs to be identical to the webapp
  you specify in the tomcat URL... can you confirm this. There should
  really be no problem starting up the Nutch web app if you follow the
  tutorial carfeully.
 
 
   Thanks,
   Chip
  
   -Original Message-
   From: lewis john mcgibbney [mailto:lewis.mcgibb...@gmail.com]
   Sent: Thursday, July 14, 2011 5:40 AM
   To: user@nutch.apache.org
   Subject: Re: Deploying the web application in Nutch 1.2
  
   Hi Chip,
  
   Please see this tutorial for 1.2 administration [1], many people
   have
  been
   using it recently and as far as I'm aware it is working perfectly.
  
   Please post back if you have any troubles
  
   [1] http://wiki.apache.org/nutch/NutchTutorial
  
  
  
   On Wed, Jul 13, 2011 at 5:50 PM, Chip Calhoun ccalh...@aip.org
 wrote:
  
I'm a newbie trying to set up

Re: problem compiling plugin

2011-07-15 Thread lewis john mcgibbney

Hi C.B.,

I'm in the process of overhauling PluginCentral on the wiki and have opened
a wiki page for Plugin Gotchas [1]. Would it be possible to ask you to edit
and define your understanding of the problem more specifically please. There
is also an interesting page here [2], which you may or may not be interested
in reading.

Thanks for initiating this.

[1] http://wiki.apache.org/nutch/PluginGotchas
[2] http://wiki.apache.org/nutch/WhatsTheProblemWithPluginsAndClass-loading

On Fri, Jul 15, 2011 at 10:21 AM, Cam Bazz camb...@gmail.com wrote:

 Hello Lewis,

 I have solved this problem by putting the ivy.jar where the ant
 releated jars are in my system. /usr/share/lib/ant - in ubuntu.

 I think we might want to add this to documentation for building plugins.

 The current problem is since lucene is gone in 1.3, i need a new solr
 based indexer, and I could not find an example for it.

 Best Regards,
 C.B.

 On Fri, Jul 15, 2011 at 11:17 AM, lewis.mcgibb...@gmail.com
 lewis.mcgibb...@gmail.com wrote:
  It looks like you dot have specifics set within your build.xml. The error
 log would also suggest this. Can you please post the lines causing the error
  -Original Message-
  From: Cam Bazz
  Sent:  14/07/2011, 6:19  PM
  To: user@nutch.apache.org
  Subject: problem compiling plugin
 
 
  Hello,
 
  I am following http://wiki.apache.org/nutch/WritingPluginExample-1.2
  on 1.3 and when i try to build my plugin with ant I get:
 
  moliere@blitz:~/java/apache-nutch-1.3/src/plugin/recommended$ ant
  Buildfile: build.xml
 
  BUILD FAILED
  /home/moliere/java/apache-nutch-1.3/src/plugin/recommended/build.xml:5:
  The following error occurred while executing this line:
  /home/moliere/java/apache-nutch-1.3/src/plugin/build-plugin.xml:46:
  Problem: failed to create task or type
  antlib:org.apache.ivy.ant:settings
  Cause: The name is undefined.
  Action: Check the spelling.
  Action: Check that any custom tasks/types have been declared.
  Action: Check that any presetdef/macrodef declarations have taken
 place.
  No types or tasks have been defined in this namespace yet
 
  This appears to be an antlib declaration.
  Action: Check that the implementing library exists in one of:
 -/usr/share/ant/lib
 -/home/moliere/.ant/lib
 -a directory added on the command line with the -lib argument
 
 
  Total time: 0 seconds
 




-- 
*Lewis*

Re: LinkRank scores

2011-07-15 Thread lewis john mcgibbney

Hi,

Do we have any suggestion to demystify this. I intend to look into webgraph
in more detail soon as I wish to get a much more detailed picture of its
functionality for link analysis purposes.

On Wed, Jul 13, 2011 at 9:25 AM, Nutch User - 1 nutch.use...@gmail.comwrote:

 Does anyone know how the LinkRank scores are calculated exactly? The
 only sources of information I have are this wiki page:
 (http://wiki.apache.org/nutch/NewScoring) and the source code of the tool.

 Is this the only difference from PageRank:
 
 It is different from PageRank in that nepotistic links such as links
 internal to a website and reciprocal links between websites can be
 ignored. The number of iterations can also be configured; by default 10
 iterations are performed.
 
 ?

 I.e. if internal links are not ignored, would the LinkRank scores be
 equivalent to PageRank scores?




-- 
*Lewis*

Re: Isn't there redudant/wasteful duplication between nutch crawldb and solr index?

2011-07-16 Thread lewis john mcgibbney

Hi Gabriele,

At first this seems like a plausable arguement, however my question concerns
what Nutch would do if we wished to change the Solr core which to index to?

If we removed this functionality from the crawldb there would be no way to
determine what Nutch was to fetch and what it wasn't.

On Sat, Jul 16, 2011 at 1:00 AM, Gabriele Kahlout
gabri...@mysimpatico.comwrote:

 Hello,

 I had this draft lurking for a while now, and before archiving for personal
 reference I wondered if it's accurate, and if you recommend posting it to
 the wiki.

 Nutch maintains a crawldb (and linkdb, for that matter) of the urls it
 crawled, the fetch status, and the date. This data is maintained beyond
 fetch so that pages may be re-crawled, after the a re-crawling period.
 At the same time Solr maintains an inverted index of all the fetched pages.
 It'd seem more efficient if nutch relied on the index instead of
 maintaining its own crawldb, to !store the same url twice.
 [BUT THAT'S JUST A KEY/ID, NOT WASTE AT ALL, WOULD ALSO END UP THE SAME IN
 SOLR]

 --
 Regards,
 K. Gabriele

 --- unchanged since 20/9/10 ---
 P.S. If the subject contains [LON] or the addressee acknowledges the
 receipt within 48 hours then I don't resend the email.
 subject(this) ∈ L(LON*) ∨ ∃x. (x ∈ MyInbox ∧ Acknowledges(x, this) ∧
 time(x)  Now + 48h) ⇒ ¬resend(I, this).

 If an email is sent by a sender that is not a trusted contact or the email
 does not contain a valid code then the email is not received. A valid code
 starts with a hyphen and ends with X.
 ∀x. x ∈ MyInbox ⇒ from(x) ∈ MySafeSenderList ∨ (∃y. y ∈ subject(x) ∧ y ∈
 L(-[a-z]+[0-9]X)).




-- 
*Lewis*

Re: Isn't there redudant/wasteful duplication between nutch crawldb and solr index?

2011-07-16 Thread lewis john mcgibbney

Please feel free to add this to the wiki as it is a question that will
undoubtably arise in the future.

Lewis

On Sat, Jul 16, 2011 at 12:37 PM, Gabriele Kahlout gabri...@mysimpatico.com
 wrote:

 On Sat, Jul 16, 2011 at 1:29 PM, lewis john mcgibbney 
 lewis.mcgibb...@gmail.com wrote:

  Hi Gabriele,
 
  At first this seems like a plausable arguement,


 Indeed, I think it could be a FAQ. Shall I add it to nutch wiki?


  however my question concerns
  what Nutch would do if we wished to change the Solr core which to index
 to?
 
  If we removed this functionality from the crawldb there would be no way
 to
  determine what Nutch was to fetch and what it wasn't.
 

 Indeed, you confirm my though.

 
   crawled, the fetch status, and the date. This data is maintained beyond
   fetch so that pages may be re-crawled, after the a re-crawling period.
   At the same time Solr maintains an inverted index of all the fetched
  pages.
   It'd seem more efficient if nutch relied on the index instead of
   maintaining its own crawldb, to !store the same url twice.
   [BUT THAT'S JUST A KEY/ID, NOT WASTE AT ALL, WOULD ALSO END UP THE SAME
  IN
   SOLR]
  
   --
   Regards,
   K. Gabriele
  
   --- unchanged since 20/9/10 ---
   P.S. If the subject contains [LON] or the addressee acknowledges the
   receipt within 48 hours then I don't resend the email.
   subject(this) ∈ L(LON*) ∨ ∃x. (x ∈ MyInbox ∧ Acknowledges(x, this) ∧
   time(x)  Now + 48h) ⇒ ¬resend(I, this).
  
   If an email is sent by a sender that is not a trusted contact or the
  email
   does not contain a valid code then the email is not received. A valid
  code
   starts with a hyphen and ends with X.
   ∀x. x ∈ MyInbox ⇒ from(x) ∈ MySafeSenderList ∨ (∃y. y ∈ subject(x) ∧ y
 ∈
   L(-[a-z]+[0-9]X)).
 
 
 
 
 
 
  --
  *Lewis*
 



 --
 Regards,
 K. Gabriele

 --- unchanged since 20/9/10 ---
 P.S. If the subject contains [LON] or the addressee acknowledges the
 receipt within 48 hours then I don't resend the email.
 subject(this) ∈ L(LON*) ∨ ∃x. (x ∈ MyInbox ∧ Acknowledges(x, this) ∧
 time(x)
  Now + 48h) ⇒ ¬resend(I, this).

 If an email is sent by a sender that is not a trusted contact or the email
 does not contain a valid code then the email is not received. A valid code
 starts with a hyphen and ends with X.
 ∀x. x ∈ MyInbox ⇒ from(x) ∈ MySafeSenderList ∨ (∃y. y ∈ subject(x) ∧ y ∈
 L(-[a-z]+[0-9]X)).




-- 
*Lewis*

Re: running tests from the command line

2011-07-16 Thread lewis john mcgibbney

Further to this, I have been working on a JIRA ticket for this [1]

If you could, can you please test. I will also shortly and hopefully we can
get this committed soon.

Thank you

[1] https://issues.apache.org/jira/browse/NUTCH-672

On Tue, Jul 12, 2011 at 9:36 PM, lewis john mcgibbney 
lewis.mcgibb...@gmail.com wrote:

 OK, it seems like you are comfortable with JUnit testing under ant but I
 think for purpose of the list, I will provide the following resource [1] for
 general info on configuring JUnit tests.

 I would comment that you may be able to get a more verbose output if you
 set heltonerror, printsummary and formatter type=plain for easier reading
 of output report.

 Basically what we are after is a report printed to a file or stdout to show
 where errors are present.

 Could you please have a look at the 'test' subsection of [1] and correct me
 on anything I have misinterpreted.

 [1] http://ant.apache.org/manual/Tasks/junit.html

 Finally, although it seems like everything is OK, it would be great to
 crack this one. It would be useful to run just JUnit tests with Ant from the
 command line.
 On Tue, Jul 12, 2011 at 8:55 PM, Tim Pease tim.pe...@gmail.com wrote:


 On Jul 12, 2011, at 11:51 AM, lewis john mcgibbney wrote:

  What plugin are you hacking away on? You're own custom one or one
 already
  shipped with Nutch? Just so we are reading from the same page.
 

 Adding some http.agent.name support to the HTMLMetaProcessor found in
 the parse-html plugin. For some reason all JUnit test results are not being
 output to stdout when running the tests. The ant task claims there are
 failures, but none are shown.

 I had to hack the ant task so that haltonfailure is true and fork is
 false. Then the expected output was showing up.

 To shorten the test loop a little bit I was hoping ant provided an easy
 wan to run just the tests for the parse-html plugin.

 Thanks for the speedy reply!

 Blessings,
 TwP

  This, along with some further documentation for running various classes
 from
  the command line is definately worth inclusion in the CommandLineOptions
  page of the wiki.
 
  On Tue, Jul 12, 2011 at 6:00 PM, Tim Pease tim.pe...@gmail.com wrote:
 
  At the root of the Nutch 1.3 project, what is the magic ant incantation
 to
  run only the tests for the plugin I'm currently hacking away on? I'm
 looking
  for the command line syntax.
 
  Blessings,
  TwP
 
 
 
 
  --
  *Lewis*




 --
 *Lewis*




-- 
*Lewis*

Extracting triples tags or hash tags from html

2011-07-17 Thread lewis john mcgibbney

Hi,

Is this currently possible with Tika 0.9 in Nutch branch 1.4? I would have
thought that this would have been dealt with in Tika, however I have seen no
mention of anyone having problems extracting this from web documents when
fetching with Nutch or even discussing it.

For example say I had some geographical location in a meta tag such
asgeo:long=55.1244, is is possible to extract with parse-tika or would I
need to extend parse-html?

Or the other part, is it possible to extract hash tags from twitter via the
above?

-- 
*Lewis*

Re: Garbage with languageidentifier

2011-07-17 Thread lewis john mcgibbney

Hi Markus,

I think this is a good shout, and it is not hard to understand the points
you make. Quite clearly, good practice relating to the inclusion of accurate
and useful language information (as well as other types of information) in
HTTP headers is not a reality and it wouldn't be suitable for us to pretend
as if this was not the case.

One thing to note though, I just found out yesterday that language detection
in trunk has been passed to Tika but this is not the case with branch 1.4.
It's not my intention to put words into peoples mouth's, however by the
looks of the conversation in NUTCH-657 I foresee that delegating
language-identification to Tika and making branch-1.4 consistent with trunk
would be the next move? Am I correct here? please say otherwise if this is
not the case.

If this is the plan then is there any requirement for Nutch to have an
independent language detection plugin? If we can address why the decision
was made for trunk to rely upon tika for language detection then we can
justify where we are with the comments you make. To be honest I am seeing a
medium sized grey area here, however this has to do with my inexperience
dealing with the language detection plugin and of the problems you mention.

On Sun, Jul 17, 2011 at 2:04 PM, Markus Jelsma
markus.jel...@openindex.iowrote:

 The proposal is to configure the order of detection: meta,header,identifier
 (which is the current order).

  Hi,
 
  I've found a lot of garbage produced by the language identifier, most
  likely caused by it relying on HTTP-header as the first hint for the
  language.
 
  Instead of a nice tight list of ISO-codes i've got an index full of
 garbage
  making me unable to select a language. The lang field now contains a mess
  including ISO-codes of various types (nl | ned, nl-NL | nederlands |
  Nederlands | dutch | Dutch etc etc) and even comma-separated
 combinations.
  It's impossible to do a simple fq:lang:nl due to this undeterminable set
 of
  language identifiers. Apart from language identifiers that we as human
  understand the headers also contains values such as
 {$plugin.meta.language}
  | Weerribben zuivel | Array or complete sentences and even MIME-types and
  more nonsens you can laugh about.
 
  Why do we rely on HTTP-header at all? Isn't it well-known that only very
  few developers and content management systems actually care about
  returning proper information in HTTP headers?  This actually also goes
 for
  finding out content- type, which is a similar problem in the index.
 
  I know work is going on in Tika for improving MIME-type detection i'm not
  sure if this is true for language identification. We still have to rely
 on
  the Nutch plugin to do this work, right? If so, i propose to make it
  configurable so we can choose if we wan't to rely on the current
 behaviour
  or do N-gram detection straight-away.
 
  Comments?
 
  Thanks




-- 
*Lewis*

Re: Fetched pages has no content

2011-07-18 Thread lewis john mcgibbney

Hi,

If you have a look at your regex-ulrfilter.txt it will by default be
rejecting ? in the URL. Please test with line edited (or commented out) and
see if the problem fades.

On Mon, Jul 18, 2011 at 10:11 AM, Anders Rask anr...@gmail.com wrote:

 Hi Markus!

 We are using a custom parser, but I don't think that the problem is in the
 parsing. I got the same problem when trying the ParserChecker. I also tried
 the following:

 I injected the following seeds:

 http://www.uu.se/news/news_item.php?id=1423typ=pm
 http://www.uu.se/news/news_item.php?id=1421typ=pm
 http://www.uu.se/news/news_item.php?id=1489typ=artikel
 http://www.uu.se/news/news_item.php?id=1407typ=pm
 http://www.uu.se/news/news_item.php?id=1234typ=artikel
 http://www.uu.se/news/news_item.php?id=1233typ=artikel
 http://www.uu.se/news/news_item.php?id=1180typ=artikel
 http://www.uu.se/news/news_item.php?typ=pmid=1381
 http://www.uu.se/

 Then generated a segment, fetched that segment and then did a readseg with
 -noparse, -noparsedata and -noparsetext.

 I have attached the readseg dump and it shows no content for:
 http://www.uu.se/news/news_item.php?typ=pmid=1381

 Can the problem somehow be in the configurations for the fetcher?


 Best regards,
 --Anders Rask
 www.findwise.com


 2011/7/15 Markus Jelsma markus.jel...@openindex.io

 What parser are you using? What does bin/nutch
 org.apache.nutch.parse.ParserChecker say? Here it outputs the content fine
 with parse-tika enabled.

 On Friday 15 July 2011 15:04:55 Anders Rask wrote:
  Hi!
 
  We are using Nutch to crawl a bunch of websites and index them to Solr.
 At
  the moment we are in the process of upgrading from Nutch 1.1 to Nutch
 1.3
  and in the same time going from one server to two servers.
 
  Unfortunately we are stuck with a problem which we haven't seen in the
 old
  environment. Several of the pages that we are fetching contain no
 content
  when they are stored in the segment. The following is an excerpt from
  readseg on a segment containing such a page:
 
  
 
  Recno:: 5
  URL:: http://www.uu.se/news/news_item.php?typ=pmid=1381
 
  Content::
  Version: -1
  url: http://www.uu.se/news/news_item.php?typ=pmid=1381
  base: http://www.uu.se/news/news_item.php?typ=pmid=1381
  contentType: text/html
  metadata: Date=Fri, 15 Jul 2011 09:02:38 GMT Content-Length=7195
  nutch.crawl.score=1.0 _fst_=33 nutch.segment.name=20110715110049
  Connection=close Content-Type=text/html Server=Apache
  Content:
 
  
 
  The fetch logs say nothing unusual about retrieving this page:
  2011-07-15 11:02:37,500 INFO org.apache.nutch.fetcher.Fetcher: fetching
  http://www.uu.se/news/news_item.php?typ=pmid=1381
 
  There seems to be nothing strange about the page itself and a very
 similar
  page (http://www.uu.se/news/news_item.php?id=1421typ=pm) is crawled
 and
  indexed without any problems.
 
  Anyone have any ideas about what might be wrong here?
 
 
  Best regards,
  --Anders Rask
  www.findwise.com

 --
 Markus Jelsma - CTO - Openindex
 http://www.linkedin.com/in/markus17
 050-8536620 / 06-50258350





-- 
*Lewis*

Re: some Nutch questions

2011-07-18 Thread lewis john mcgibbney

Hi Cheng,

Please see this wiki page for some references to optimization [1]

I can see your problem though. I think a possible solution may to have two
seed directories, with a specifically tailored Nutch implementation ready to
crawl both. This way we guarantee top results if we take site in a case by
case basis. Please feel free to add any further comments to this wiki page
based upon your personal experiences moving towards optimization.

Thanks

[1] http://wiki.apache.org/nutch/OptimizingCrawls

On Sat, Jul 16, 2011 at 2:23 AM, Cheng Li chen...@usc.edu wrote:

 Hi,

 I have some questions for the optimization.


 1)
 for the command

 bin/nutch crawl url -dir mydir -depth 2 -threads 4 -topN 50
 logs/logs1.log

   ,

  I know the meaning of parameter  , say ,

 -depth 8 the maximum depth of links crawled is 8 (8 levels down from
 the seed urls)

 -topN 5 maximum number of links/pages can be crawled at each depth
 -thread 16 issue 16 threads simultaneously


 but how to choose the proper number for each parameter?  For example ,in
 craiglist  web site , the usual url for a certain car goes like this:
 http://losangeles.craigslist.org/sgv/cto/2496560420.html


  But in Kbb.com,   the usual  url for a certain car goes like this:

 http://www.kbb.com/volkswagen/jetta/2003-volkswagen-jetta/gls-sedan-4d/?vehicleid=348329intent=buy-usedoptions=4098815|true|4098881|truepricetype=private-partycondition=good


 how to determine the value of parameter for these 2 example ?



 2) When I check the data in Luke in overview panel, I found that on the
 left
 side (available fields and term counts per field table)the anchor number
 value is zero , while the content value is not, and on the right side (top
 ranking terms table) all the rank values are also the same.I want to know
 the reason that it displays the information like this.


 Thanks,

 --
 Cheng Li




-- 
*Lewis*

Re: How to use lucene to index Nutch 1.3 data

2011-07-19 Thread lewis john mcgibbney

Hi Kelvin,

I see you are posting on a couple of threads with regards to the Lucene
index generated by Nutch which you correctly point out is not there. It is
not possible to create a Lucene index from Nutch 1.3 anymore as all
searching has been shifted to Solr therefore Nutch 1.3 has no use for a
Lucene index. If you wish to find out more on why this is current practice
please feel free to read into recent activity on the lists.

I hope this clears things up.

On Tue, Jul 19, 2011 at 3:32 PM, Kelvin k...@yahoo.com.sg wrote:

 Hi Александр,

 Thank you for your reply, but I am not using solr. How do I use Lucene to
 create an index of folder /crawl?
 I went to Lucene website, but it only explains how to index local files and
 html?




 
 From: Александр Кожевников b37hr3...@yandex.ru
 To: user@nutch.apache.org; k...@yahoo.com.sg
 Sent: Tuesday, 19 July 2011 8:10 PM
 Subject: Re: How to use lucene to index Nutch 1.3 data

 Kelvin,
 You should make index using solr
 $ bin/nutch solrindex http://127.0.0.1:8983/solr/ crawl/crawldb
 crawl/linkdb crawl/segments/*


 19.07.2011, 15:07, Kelvin k...@yahoo.com.sg:
  Dear all,
 
  After crawling using Nutch 1.3, I realise that my /crawl folder does not
 contain folder /index.
 
  Is there any way to create a lucene index from the /crawl folder?
 
  Thank you for your help.
 
  Best regards,
  Kelvin




-- 
*Lewis*

Re: help, src modify to optimize the crawl

2011-07-20 Thread lewis john mcgibbney

I dont think this has anything to so with modifying the crawl src. It
doesn't infact have anything to do with optimization either. Try using your
URLFilters e.g. regex

It is important to try and understand what type of pages we can filter out
from a Nutch crawl using the filters provided.

HTH

On Wed, Jul 20, 2011 at 11:04 AM, Cheng Li chen...@usc.edu wrote:

 Hi,

I tried to use Nutch to crawl craiglist.   The seed I use is




http://losangeles.craigslist.org/wst/ctd/
 http://losangeles.craigslist.org/sfv/ctd/
 http://losangeles.craigslist.org/lac/ctd/
 http://losangeles.craigslist.org/sgv/ctd/
 http://losangeles.craigslist.org/lgb/ctd/
 http://losangeles.craigslist.org/ant/ctd/

 http://losangeles.craigslist.org/wst/cto/
 http://losangeles.craigslist.org/sfv/cto/
 http://losangeles.craigslist.org/lac/cto/
 http://losangeles.craigslist.org/sgv/cto/
 http://losangeles.craigslist.org/lgb/cto/
 http://losangeles.craigslist.org/ant/cto/


  What I want to get is the result page like this one , for example ,
 http://losangeles.craigslist.org/lac/ctd/2501038362.html  , which is a
 specific car selling page .
  What I DON'T what to get is the result page like this one , for example ,
 http://losangeles.craigslist.org/cta/.

  However , in my query result , I can always have results like
 http://losangeles.craigslist.org/cta/.

  Actually , I can get this kind of this website from craiglist, just part
 of
 them , but not all of them.  I tried to adjust the crawl command line
 parameter, but there is no much change .

  So what I plan to do is to modify the crawl code in Nutch src code. Where
 can I start ?  What kind of work can I do to optimize the crawl process in
 src code ?

 --
 Cheng Li




-- 
*Lewis*

Re: embedded google map in nutch query result page

2011-07-20 Thread lewis john mcgibbney

I don't know if you are still pursuing this, and as you haven't had any
response I will give some tips.

It sounds like your using = Nutch 1.2, therefore unless you are comofrtable
working with JSP's then I wouldn't bother with the hastle. It might be
better to try and use Solr for indexing and searching and build an interface
such as Solr AJAX which would then permit you to write a widget to do this
task. However unless you have time and are competent and willing to learn
and use Apache Solr and Javascript then this is not an ideal solution.

I honestly have no idea how to implement this using the legacy JSP

On Wed, Jul 20, 2011 at 11:09 AM, Cheng Li chen...@usc.edu wrote:

 Hi,

I  have done a google map marker html code. I plan to display the google
 map object in the nutch query result page, with the geo-markers which are
 extracted from the results listed on that page.

  How should I modify the nutch query result page to implement my design?

 Thanks,

 --
 Cheng Li




-- 
*Lewis*

Re: skipping invalid segments nutch 1.3

2011-07-20 Thread lewis john mcgibbney

There is no documentation for individual commands used to run a Nutch 1.3
crawl so I'm not sure where there has been a mislead. In the instance that
this was required I would direct newer users to the legacy documentation for
the time being.

My comment to Leo was to understand whether he managed to correct the
invalid segments problem.

Leo, if this still persists may I ask you to try again, I will do the same
and will be happy to provide feedback

May I suggest the following


use the following commands

inject
generate
fetch
parse
updatedb

At this stage we should be able to ascertain if something is correct and
hopefully debug. May I add the following... please make the following
additions to nutch-site.

fetcher verbose - true
http verbose - true
check for redirects and set accordingly


On Wed, Jul 20, 2011 at 1:39 PM, Julien Nioche 
lists.digitalpeb...@gmail.com wrote:

 The wiki can be edited and you are welcome to suggest improvements if there
 is something missing

 On 20 July 2011 13:31, Cam Bazz camb...@gmail.com wrote:

  Hello,
 
  I think there is a mislead in the documentation, it does not tell us
  that we have to parse.
 
  On Wed, Jul 20, 2011 at 11:42 AM, Julien Nioche
  lists.digitalpeb...@gmail.com wrote:
   Haven't you forgotten to call parse?
  
   On 19 July 2011 23:40, Leo Subscriptions llsub...@zudiewiener.com
  wrote:
  
   Hi Lewis,
  
   You are correct about the last post not showing any errors. I just
   wanted to show that I don't get any errors if I use 'crawl' and to
 prove
   that I do not have any faults in the conf files or the directories.
  
   I still get the errors if I use the individual commands inject,
   generate, fetch
  
   Cheers,
  
   Leo
  
  
  
On Tue, 2011-07-19 at 22:09 +0100, lewis john mcgibbney wrote:
  
Hi Leo
   
Did you resolve?
   
Your second log data doesn't appear to show any errors however the
problem you specify if one I have witnessed myself while ago. Since
you posted have you been able to replicate... or resolve?
   
   
On Sun, Jul 17, 2011 at 1:03 AM, Leo Subscriptions
llsub...@zudiewiener.com wrote:
   
I've used crawl to ensure config is correct and I don't get
any errors,
so I must be doing something wrong with the individual
 steps,
but can;t
see what.
   
   
  
 
 
   
llist@LeosLinux:~/nutchData
$ /usr/share/nutch/runtime/local/bin/nutch
   
   
crawl /home/llist/nutchData/seed/urls
-dir /home/llist/nutchData/crawl
-depth 3 -topN 5
solrUrl is not set, indexing will be skipped...
crawl started in: /home/llist/nutchData/crawl
rootUrlDir = /home/llist/nutchData/seed/urls
threads = 10
depth = 3
solrUrl=null
topN = 5
Injector: starting at 2011-07-17 09:31:19
   
Injector: crawlDb: /home/llist/nutchData/crawl/crawldb
   
   
Injector: urlDir: /home/llist/nutchData/seed/urls
   
Injector: Converting injected urls to crawl db entries.
Injector: Merging injected urls into crawl db.
   
   
Injector: finished at 2011-07-17 09:31:22, elapsed: 00:00:02
Generator: starting at 2011-07-17 09:31:22
   
Generator: Selecting best-scoring urls due for fetch.
Generator: filtering: true
Generator: normalizing: true
   
   
Generator: topN: 5
   
Generator: jobtracker is 'local', generating exactly one
partition.
Generator: Partitioning selected urls for politeness.
   
   
Generator:
segment: /home/llist/nutchData/crawl/segments/20110717093124
Generator: finished at 2011-07-17 09:31:26, elapsed:
 00:00:04
   
Fetcher: Your 'http.agent.name' value should be listed
 first
in
'http.robots.agents' property.
   
   
Fetcher: starting at 2011-07-17 09:31:26
Fetcher:
segment: /home/llist/nutchData/crawl/segments/20110717093124
   
Fetcher: threads: 10
QueueFeeder finished: total 1 records + hit by time limit :0
fetching http://www.seek.com.au/
-finishing thread FetcherThread, activeThreads=1
-finishing thread FetcherThread, activeThreads=1
-finishing thread FetcherThread, activeThreads=1
-finishing thread FetcherThread, activeThreads=1
-finishing thread FetcherThread, activeThreads=1
-finishing thread FetcherThread, activeThreads=1
-finishing thread FetcherThread, activeThreads=1
-finishing thread FetcherThread, activeThreads=1
   
-finishing thread FetcherThread, activeThreads=1

Re: crawling in any depth until no new pages were found

2011-07-20 Thread lewis john mcgibbney

Hi Marek,

As were talking about automating the task were immediately looking at
implementing a bash script. In the situation we have described, we wish
Nutch to adopt a breadth first search BFS behaviour when crawling. Between
us can we suggest any methods for best practice relating to BFS?

As you have highlighted we can check the crawldb after every updatedb
command to determine whether there are any status (?) unfetched urls, and
ideally we wish to continue until this number is non existent when we either
dump stats or read them via stdout. I would suggest that we discuss a method
for obtaining the dbunfecthed value and creating a loop based on whether or
not it is = 0. Is this possible?

On Wed, Jul 20, 2011 at 2:05 PM, Marek Bachmann m.bachm...@uni-kassel.dewrote:

 Hi all,

 has anyone suggestions how I could solve following task:

 I want to crawl a sub-domain of our network completely. I always did it by
 multiple fetch / parse / update cycles manually. After a few cycles I
 checked if there are unfetched pages in the crawldb. If so, I started the
 cycle over again. I repeated that until no new pages were discovered.
 But that is annoying me and that is why I am looking for a way to do this
 steps automatic until no unfetched pages are left.

 Any ideas?




-- 
*Lewis*

Re: Nutch not indexing full collection

2011-07-20 Thread lewis john mcgibbney

Hi Chip,

I would try running your scripts after setting the environment variable
$NUTCH_HOME to nutch/runtime/local/NUTCH_HOME

On Wed, Jul 20, 2011 at 4:01 PM, Chip Calhoun ccalh...@aip.org wrote:

 I've been working with $NUTCH_HOME/runtime/local/conf/nutch-site.xml, and
 I'm pretty sure that's the correct file.  I run my commands while in
 $NUTCH_HOME/ , which means all of my commands begin with
 runtime/local/bin/nutch... .  That means my urls directory is
 $NUTCH_HOME/urls/ and my crawl directory ends up being $NUTCH_HOME/crawl/
 (as opposed to $NUTCH_HOME/runtime/local/urls/ and so forth), but it does
 seem to at least be getting my urlfilters from
 $NUTCH_HOME/runtime/local/conf/ .

 I get no output when I try runtime/local/bin/nutch readdb -stats , so
 that's weird.

 I dimly recall there being a total index size value somewhere in Nutch or
 Solr which has to be increased, but I can no longer find any reference to
 it.

 Chip

 -Original Message-
 From: Julien Nioche [mailto:lists.digitalpeb...@gmail.com]
 Sent: Wednesday, July 20, 2011 10:06 AM
 To: user@nutch.apache.org
 Subject: Re: Nutch not indexing full collection

 I'd have suspected db.max.outlinks.per.page but you seem to have set it up
 correctly. Are you running Nutch in runtime/local? in which case you
 modified nutch-site.xml in runtime/local/conf, right?

 nutch readdb -stats will give you the total number of pages known etc

 Julien

 On 20 July 2011 14:51, Chip Calhoun ccalh...@aip.org wrote:

  Hi,
 
  I'm using Nutch 1.3 to crawl a section of our website, and it doesn't
  seem to crawl the entire thing.  I'm probably missing something
  simple, so I hope somebody can help me.
 
  My urls/nutch file contains a single URL:
  http://www.aip.org/history/ohilist/transcripts.html , which is an
  alphabetical listing of other pages.  It looks like the indexer stops
  partway down this page, meaning that entries later in the alphabet
  aren't indexed.
 
  My nutch-site.xml has the following content:
  ?xml version=1.0?
  ?xml-stylesheet type=text/xsl href=configuration.xsl?
  !-- Put site-specific property overrides in this file. --
  configuration property  namehttp.agent.name/name  valueOHI
  Spider/value /property property
  namedb.max.outlinks.per.page/name
   value-1/value
   descriptionThe maximum number of outlinks that we'll process for a
 page.
   If this value is nonnegative (=0), at most db.max.outlinks.per.page
  outlinks  will be processed for a page; otherwise, all outlinks will
  be processed.
   /description
  /property
  /configuration
 
  My regex-urlfilter.txt and crawl-urlfilter.txt both include the
  following, which should allow access to everything I want:
  # accept hosts in MY.DOMAIN.NAME
  +^http://([a-z0-9]*\.)*aip.org/history/ohilist/
  # skip everything else
  -.
 
  I've crawled with the following command:
  runtime/local/bin/nutch crawl urls -dir crawl -depth 15 -topN 50
 
  Note that since we don't have NutchBean anymore, I can't tell whether
  this is actually a Nutch problem or whether something is failing when
  I port to Solr.  What am I missing?
 
  Thanks,
  Chip
 



 --
 *
 *Open Source Solutions for Text Engineering

 http://digitalpebble.blogspot.com/
 http://www.digitalpebble.com




-- 
*Lewis*

Re: embedded google map in nutch query result page

2011-07-20 Thread lewis john mcgibbney

You can find Ajax Solr here [1]. As I said this is only one option for doing
this.

The information you can return and display is really directly dependant on
your requirements and your imagination. However it should not be too hard
implementing the maps you are looking for when you get to grips with writing
widgets I wouldn't imagine.

[1] http://evolvingweb.github.com/ajax-solr/

On Wed, Jul 20, 2011 at 9:57 PM, Cheng Li chen...@usc.edu wrote:

  Thank you . I'll try to use solr to do the indexing and add the google map
 object . Do you know some resource for solr AJAX ? where should I add the
 google map js code in solr  ?

 Thanks again,

 On Wed, Jul 20, 2011 at 1:51 PM, lewis john mcgibbney 
 lewis.mcgibb...@gmail.com wrote:

  I don't know if you are still pursuing this, and as you haven't had any
  response I will give some tips.
 
  It sounds like your using = Nutch 1.2, therefore unless you are
  comofrtable
  working with JSP's then I wouldn't bother with the hastle. It might be
  better to try and use Solr for indexing and searching and build an
  interface
  such as Solr AJAX which would then permit you to write a widget to do
 this
  task. However unless you have time and are competent and willing to learn
  and use Apache Solr and Javascript then this is not an ideal solution.
 
  I honestly have no idea how to implement this using the legacy JSP
 
  On Wed, Jul 20, 2011 at 11:09 AM, Cheng Li chen...@usc.edu wrote:
 
   Hi,
  
  I  have done a google map marker html code. I plan to display the
  google
   map object in the nutch query result page, with the geo-markers which
 are
   extracted from the results listed on that page.
  
How should I modify the nutch query result page to implement my
 design?
  
   Thanks,
  
   --
   Cheng Li
  
 
 
 
  --
  *Lewis*
 



 --
 Cheng Li




-- 
*Lewis*

Re: skipping invalid segments nutch 1.3

2011-07-21 Thread lewis john mcgibbney

, spinWaiting=0, fetchQueues.totalSize=0
 -activeThreads=0
 Fetcher: finished at 2011-07-21 12:26:40, elapsed: 00:00:04


 llist@LeosLinux:~/nutchData$ /usr/share/nutch/runtime/local/bin/nutch
 parse /home/llist/nutchData/crawl/segments/20110721122519
 ParseSegment: starting at 2011-07-21 12:27:22
 ParseSegment:
 segment: /home/llist/nutchData/crawl/segments/20110721122519
 ParseSegment: finished at 2011-07-21 12:27:24, elapsed: 00:00:01


 llist@LeosLinux:~/nutchData$ /usr/share/nutch/runtime/local/bin/nutch
 updatedb /home/llist/nutchData/crawl/crawldb
 -dir /home/llist/nutchData/crawl/segments/20110721122519
 CrawlDb update: starting at 2011-07-21 12:28:03
 CrawlDb update: db: /home/llist/nutchData/crawl/crawldb
 CrawlDb update: segments:
 [file:/home/llist/nutchData/crawl/segments/20110721122519/parse_text,
 file:/home/llist/nutchData/crawl/segments/20110721122519/content,
 file:/home/llist/nutchData/crawl/segments/20110721122519/crawl_parse,
 file:/home/llist/nutchData/crawl/segments/20110721122519/parse_data,
 file:/home/llist/nutchData/crawl/segments/20110721122519/crawl_fetch,
 file:/home/llist/nutchData/crawl/segments/20110721122519/crawl_generate]
 CrawlDb update: additions allowed: true
 CrawlDb update: URL normalizing: false
 CrawlDb update: URL filtering: false
  - skipping invalid segment
 file:/home/llist/nutchData/crawl/segments/20110721122519/parse_text
  - skipping invalid segment
 file:/home/llist/nutchData/crawl/segments/20110721122519/content
  - skipping invalid segment
 file:/home/llist/nutchData/crawl/segments/20110721122519/crawl_parse
  - skipping invalid segment
 file:/home/llist/nutchData/crawl/segments/20110721122519/parse_data
  - skipping invalid segment
 file:/home/llist/nutchData/crawl/segments/20110721122519/crawl_fetch
  - skipping invalid segment
 file:/home/llist/nutchData/crawl/segments/20110721122519/crawl_generate
 CrawlDb update: Merging segment data into db.
 CrawlDb update: finished at 2011-07-21 12:28:04, elapsed: 00:00:01


 



 On Wed, 2011-07-20 at 21:58 +0100, lewis john mcgibbney wrote:

  There is no documentation for individual commands used to run a Nutch 1.3
  crawl so I'm not sure where there has been a mislead. In the instance
 that
  this was required I would direct newer users to the legacy documentation
 for
  the time being.
 
  My comment to Leo was to understand whether he managed to correct the
  invalid segments problem.
 
  Leo, if this still persists may I ask you to try again, I will do the
 same
  and will be happy to provide feedback
 
  May I suggest the following
 
 
  use the following commands
 
  inject
  generate
  fetch
  parse
  updatedb
 
  At this stage we should be able to ascertain if something is correct and
  hopefully debug. May I add the following... please make the following
  additions to nutch-site.
 
  fetcher verbose - true
  http verbose - true
  check for redirects and set accordingly
 
 
  On Wed, Jul 20, 2011 at 1:39 PM, Julien Nioche 
  lists.digitalpeb...@gmail.com wrote:
 
   The wiki can be edited and you are welcome to suggest improvements if
 there
   is something missing
  
   On 20 July 2011 13:31, Cam Bazz camb...@gmail.com wrote:
  
Hello,
   
I think there is a mislead in the documentation, it does not tell us
that we have to parse.
   
On Wed, Jul 20, 2011 at 11:42 AM, Julien Nioche
lists.digitalpeb...@gmail.com wrote:
 Haven't you forgotten to call parse?

 On 19 July 2011 23:40, Leo Subscriptions llsub...@zudiewiener.com
 
wrote:

 Hi Lewis,

 You are correct about the last post not showing any errors. I just
 wanted to show that I don't get any errors if I use 'crawl' and to
   prove
 that I do not have any faults in the conf files or the
 directories.

 I still get the errors if I use the individual commands inject,
 generate, fetch

 Cheers,

 Leo



  On Tue, 2011-07-19 at 22:09 +0100, lewis john mcgibbney wrote:

  Hi Leo
 
  Did you resolve?
 
  Your second log data doesn't appear to show any errors however
 the
  problem you specify if one I have witnessed myself while ago.
 Since
  you posted have you been able to replicate... or resolve?
 
 
  On Sun, Jul 17, 2011 at 1:03 AM, Leo Subscriptions
  llsub...@zudiewiener.com wrote:
 
  I've used crawl to ensure config is correct and I don't
 get
  any errors,
  so I must be doing something wrong with the individual
   steps,
  but can;t
  see what.
 
 

   
  
 
 
  llist@LeosLinux:~/nutchData
  $ /usr/share/nutch/runtime/local/bin/nutch
 
 
  crawl /home/llist/nutchData/seed

Re: solr index display

2011-07-25 Thread lewis john mcgibbney

Specifically I would mention that you would get a community input if this
question was directed towards the Solr user list, however I think you are
looking for the velocity response writer.

Have a search on the Solr wiki you will find info there.

In addition there are various other well established client libraries, I
previously worked with Ajax Solr.

On Mon, Jul 25, 2011 at 12:32 AM, Cheng Li chen...@usc.edu wrote:

 Hi,

  I follow this instruction to run the index by solr .
 http://wiki.apache.org/nutch/RunningNutchAndSolr

at the last step , it is said that   If you want to see the raw HTML
 indexed by Solr, change the content field definition in solrconfig.xml to.

   But I found several solrconfig.xml in  apache-solr  directory .  Which
 solrconfig.xml should I modify to make the query page look like Nutch 1.2
 query page?

 Thanks,

 --
 Cheng Li




-- 
*Lewis*

Re: embedded google map in nutch query result page

2011-07-25 Thread lewis john mcgibbney

A while since I configured this. Try the tutorial, if I remember it was
pretty verbose and I would imagine that it covers this subject area
entirely.

Sorry I couldn't be of more help.

On Mon, Jul 25, 2011 at 4:33 AM, Cheng Li chen...@usc.edu wrote:

 Hi,

   I just looked up the website  http://evolvingweb.github.com/ajax-solr/
 you gave me .

   But I have some questions about that.
   Where should I add the javascript code file ? Is it in some subdirectory
 in apache-solr directory?

   Can you explain a little bit more?

  Thanks,

 On Wed, Jul 20, 2011 at 2:28 PM, lewis john mcgibbney 
 lewis.mcgibb...@gmail.com wrote:

  You can find Ajax Solr here [1]. As I said this is only one option for
  doing
  this.
 
  The information you can return and display is really directly dependant
 on
  your requirements and your imagination. However it should not be too hard
  implementing the maps you are looking for when you get to grips with
  writing
  widgets I wouldn't imagine.
 
  [1] http://evolvingweb.github.com/ajax-solr/
 
  On Wed, Jul 20, 2011 at 9:57 PM, Cheng Li chen...@usc.edu wrote:
 
Thank you . I'll try to use solr to do the indexing and add the google
  map
   object . Do you know some resource for solr AJAX ? where should I add
 the
   google map js code in solr  ?
  
   Thanks again,
  
   On Wed, Jul 20, 2011 at 1:51 PM, lewis john mcgibbney 
   lewis.mcgibb...@gmail.com wrote:
  
I don't know if you are still pursuing this, and as you haven't had
 any
response I will give some tips.
   
It sounds like your using = Nutch 1.2, therefore unless you are
comofrtable
working with JSP's then I wouldn't bother with the hastle. It might
 be
better to try and use Solr for indexing and searching and build an
interface
such as Solr AJAX which would then permit you to write a widget to do
   this
task. However unless you have time and are competent and willing to
  learn
and use Apache Solr and Javascript then this is not an ideal
 solution.
   
I honestly have no idea how to implement this using the legacy JSP
   
On Wed, Jul 20, 2011 at 11:09 AM, Cheng Li chen...@usc.edu wrote:
   
 Hi,

I  have done a google map marker html code. I plan to display
 the
google
 map object in the nutch query result page, with the geo-markers
 which
   are
 extracted from the results listed on that page.

  How should I modify the nutch query result page to implement my
   design?

 Thanks,

 --
 Cheng Li

   
   
   
--
*Lewis*
   
  
  
  
   --
   Cheng Li
  
 
 
 
  --
  *Lewis*
 



 --
 Cheng Li




-- 
*Lewis*

Re: Storage of data between crawls

2011-07-27 Thread lewis john mcgibbney

HI Alexander,

I don't want to state the obvious here but this will depend directly on what
type of loading your Nutch implementation deals with...

You are correct in stating that we store data in segments, namely
/crawl_fetch
/content
/crawl_parse
/parse_data
/crawl_generate
/parse_text

I understand that this doesn't add much value to answering your question,
but as we are now indexing with Solr (and therefore not storing larger
amounts of data with Nutch) I am struggling slightly to understand the
issues you are trying to answer.




On Mon, Jul 25, 2011 at 5:13 PM, Chris Alexander chris.alexan...@kusiri.com
 wrote:

 Hi all,

 I have been asked to look at doing some disk space estimates for our Nutch
 usage. It looks like Nutch stores the content of the pages it downloads and
 indexes in its data directory for the segment, is this the case?

 Are there any other major storage requirements I should make not of with
 Nutch specifically (not the Solr storage, we can handle that bit)?

 Cheers

 Chris




-- 
*Lewis*

Re: Nutch not indexing full collection

2011-07-27 Thread lewis john mcgibbney

has this been solved?

If your http.content.limit has not been increased in nutch-site.xml then you
will not be able to store this data and index with Solr.

On Mon, Jul 25, 2011 at 6:18 PM, Chip Calhoun ccalh...@aip.org wrote:

 I'm still having trouble.  I've set a windows environment variable,
 NUTCH_HOME, which for me is C:\Apache\nutch-1.3\runtime\local .  I now have
 my urls and crawl directories in that C:\Apache\nutch-1.3\runtime\local
 folder.  But I'm still not crawling files later on my urls list, and
 apparently I can't search for words or phrases toward the end of any of my
 documents.  Am I misremembering that there was a total file size value
 somewhere in Nutch or Solr that needs to be increased?

 -Original Message-
 From: lewis john mcgibbney [mailto:lewis.mcgibb...@gmail.com]
 Sent: Wednesday, July 20, 2011 5:23 PM
 To: user@nutch.apache.org
 Subject: Re: Nutch not indexing full collection

 Hi Chip,

 I would try running your scripts after setting the environment variable
 $NUTCH_HOME to nutch/runtime/local/NUTCH_HOME

 On Wed, Jul 20, 2011 at 4:01 PM, Chip Calhoun ccalh...@aip.org wrote:

  I've been working with $NUTCH_HOME/runtime/local/conf/nutch-site.xml,
  and I'm pretty sure that's the correct file.  I run my commands while
  in $NUTCH_HOME/ , which means all of my commands begin with
  runtime/local/bin/nutch... .  That means my urls directory is
  $NUTCH_HOME/urls/ and my crawl directory ends up being
  $NUTCH_HOME/crawl/ (as opposed to $NUTCH_HOME/runtime/local/urls/ and
  so forth), but it does seem to at least be getting my urlfilters from
  $NUTCH_HOME/runtime/local/conf/ .
 
  I get no output when I try runtime/local/bin/nutch readdb -stats , so
  that's weird.
 
  I dimly recall there being a total index size value somewhere in Nutch
  or Solr which has to be increased, but I can no longer find any
  reference to it.
 
  Chip
 
  -Original Message-
  From: Julien Nioche [mailto:lists.digitalpeb...@gmail.com]
  Sent: Wednesday, July 20, 2011 10:06 AM
  To: user@nutch.apache.org
  Subject: Re: Nutch not indexing full collection
 
  I'd have suspected db.max.outlinks.per.page but you seem to have set
  it up correctly. Are you running Nutch in runtime/local? in which case
  you modified nutch-site.xml in runtime/local/conf, right?
 
  nutch readdb -stats will give you the total number of pages known etc
 
  Julien
 
  On 20 July 2011 14:51, Chip Calhoun ccalh...@aip.org wrote:
 
   Hi,
  
   I'm using Nutch 1.3 to crawl a section of our website, and it
   doesn't seem to crawl the entire thing.  I'm probably missing
   something simple, so I hope somebody can help me.
  
   My urls/nutch file contains a single URL:
   http://www.aip.org/history/ohilist/transcripts.html , which is an
   alphabetical listing of other pages.  It looks like the indexer
   stops partway down this page, meaning that entries later in the
   alphabet aren't indexed.
  
   My nutch-site.xml has the following content:
   ?xml version=1.0?
   ?xml-stylesheet type=text/xsl href=configuration.xsl?
   !-- Put site-specific property overrides in this file. --
   configuration property  namehttp.agent.name/name  valueOHI
   Spider/value /property property
   namedb.max.outlinks.per.page/name
value-1/value
descriptionThe maximum number of outlinks that we'll process for
   a
  page.
If this value is nonnegative (=0), at most
   db.max.outlinks.per.page outlinks  will be processed for a page;
   otherwise, all outlinks will be processed.
/description
   /property
   /configuration
  
   My regex-urlfilter.txt and crawl-urlfilter.txt both include the
   following, which should allow access to everything I want:
   # accept hosts in MY.DOMAIN.NAME
   +^http://([a-z0-9]*\.)*aip.org/history/ohilist/
   # skip everything else
   -.
  
   I've crawled with the following command:
   runtime/local/bin/nutch crawl urls -dir crawl -depth 15 -topN 50
  
   Note that since we don't have NutchBean anymore, I can't tell
   whether this is actually a Nutch problem or whether something is
   failing when I port to Solr.  What am I missing?
  
   Thanks,
   Chip
  
 
 
 
  --
  *
  *Open Source Solutions for Text Engineering
 
  http://digitalpebble.blogspot.com/
  http://www.digitalpebble.com
 



 --
 *Lewis*




-- 
*Lewis*

Re: TF in wide internet crawls

2011-07-27 Thread lewis john mcgibbney

Hi Markus,

I am getting you until the last parts of your comments.

cope with non-edited... edited by whom? and for what purpose? To give a
better relative tf score...

To comment on the first part, and please ignore or correct me if I am wrong,
but do we not give each page and therefore each document an initial score of
1.0 which is then subsequently used by whichever scoring algorithm we
plugin? If this is the case then how are we specifying score for a page and
tf of some term with a document or tf-idf of that term over the entire
document collection to determine relevance? How can be accurately
disambiguate between these entities?

As I said I'm loosing you towards the end however it would be good
discussion to explore behind the surface architecture.


On Mon, Jul 25, 2011 at 10:23 PM, Markus Jelsma
markus.jel...@openindex.iowrote:

 Hi,

 I've done several projects where term frequency yields bad result sets and
 worse relevancy. These projects all had one similarity; user-generated
 content
 with a competitive edge. The latter means classifieds web sites such as
 e-bay
 etc. The internet is something similar. It contains edited content,
 classifieds
 and spam or other garbage.

 What do you do with tf in your wide internet index? Do you impose a
 threshold
 or are you emitting 1.0f for each match?
 For now i emit 1.0f for each match and rely on matches in multiple fields
 with
 varying boosts to improve relevancy and various other methods.

 Can tf*idf cope with non-edited (and untrusted) documents at all? I've seen
 great relevancy with good content but really bad relevance in several
 cases.

 Thanks!




-- 
*Lewis*

Re: plugin build.xml file

2011-07-27 Thread lewis john mcgibbney

Hi Cheng Li,

Please experiment with this. We have been gradually getting the
pluginCentral section of the wiki updated as it needed a total face lift, so
would appreciate any additional input you may have for updating the writing
Plugin example which is already there. Apart being completely out of date,
the one you mention should have been moved to archive and legacy section
under OldPluginCentral.

I'll be picking this up tomorrow and updating.

On Tue, Jul 26, 2011 at 6:46 AM, Cheng Li chen...@usc.edu wrote:

 Hi,

   In http://wiki.apache.org/nutch/WritingPluginExample-0.9  , it is said
 that in nutch/plugin/recommended  directory , there should be 2 files which
 are build.xml and plugin.xml.

   But in nutch - 1.3 , i checked other folders in plugin , most of them
 have one plugin.xml file and a jar file .

   So , in nutch -1.3 , do I still need to follow the instruction that
 create a build.xml in   /plugin/recommended directory ?

   Or what else configuration files should I modify and create ?

 Thanks,

 --
 Cheng Li




-- 
*Lewis*

Re: Limit Nutch memory usage

2011-07-27 Thread lewis john mcgibbney

Hi Marseld,

I'm just putting my thoughts out here, however Hadoop is not shipped with
Nutch 1.3 anymore therefore I don't know where you would set this specific
property within yout Nutch instances...

How are you running Hadoop
what version of Nutch
what mode are you running Nutch in?

On Tue, Jul 26, 2011 at 8:55 AM, Marseld Dedgjonaj 
marseld.dedgjo...@ikubinfo.com wrote:

 Hello list,

 I have two instances of nutch running on my machine. I want to configure
 Instance 1 maximum usage or RAM to be 4 Gb

 And max usage of RAM in instance 2 to be 8 GB.

 Can I do it by configuring HADOOP_HEAPSIZE for each instance?

 Will these configuration interferes to each other ?



 Best Regards,

 Marseld



 p class=MsoNormalspan style=color: rgb(31, 73, 125);Gjeni
 bPuneuml; teuml; Mireuml;/b dhe bteuml; Mireuml; peuml;r
 Puneuml;/b... Vizitoni: a target=_blank href=http://www.punaime.al/
 www.punaime.al/a/span/p
 pa target=_blank href=http://www.punaime.al/;span
 style=text-decoration: none;img width=165 height=31 border=0
 alt=punaime src=http://www.ikub.al/images/punaime.al_small.png;
 //span/a/p




-- 
*Lewis*

Re: Storage of data between crawls

2011-07-28 Thread lewis john mcgibbney

Well when Nutch undertakes a generate fetch and parse e.g. the steps that
generate segment data for indexing, the data is stored in various forms
within the segment. There is much more purpose to the segment that explained
in this reply however it does not add to this particular thread.

If you have a look at nutch-default.xml you will noticed a deprecated
property db.default.fetch.interval, ignore this for the time being and focus
instead on db.fetch.interval.default (which is a much more accurate method
of specifying default value for re-fetches of any given page anyway), any
segment older than this value can be safely deleted as new segments will
have been created in successive crawl processes thus rendering it less
useful to us. This is one option for reducing the amount of memory Nutch
data takes on disk.

An alternative option to this is to mergesegs with the option to pass
filtering and slicing commands for a healthier output segment. I remeber
learning on this list some time ago that mergesegs is a useful command for
managing a Nutch instance which produces several segments per day.
Understandably this can get out of hand pretty quickly therefore merging
segment data enables us to manage this effectively.

In general, but strictly dependant on the size and nature of your Nutch
crawls, we rarely experience problems concerning the size of disk space
occupied by = Nutch 1.3 segment data, however I'm sure there are extreme
cases out there.

On Thu, Jul 28, 2011 at 9:18 AM, Chris Alexander chris.alexan...@kusiri.com
 wrote:

 Cheers Lewis, perhaps I should attempt to rephrase the question.

 Clearly Nutch must download and store the contents of a page during a
 crawl.
 However, once you have indexed this content, does Nutch keep this data, or
 is it cleaned up, automatically or is there a command to do it?

 Thanks

 Chris

 On 27 July 2011 17:14, lewis john mcgibbney lewis.mcgibb...@gmail.com
 wrote:

  HI Alexander,
 
  I don't want to state the obvious here but this will depend directly on
  what
  type of loading your Nutch implementation deals with...
 
  You are correct in stating that we store data in segments, namely
  /crawl_fetch
  /content
  /crawl_parse
  /parse_data
  /crawl_generate
  /parse_text
 
  I understand that this doesn't add much value to answering your question,
  but as we are now indexing with Solr (and therefore not storing larger
  amounts of data with Nutch) I am struggling slightly to understand the
  issues you are trying to answer.
 
 
 
 
  On Mon, Jul 25, 2011 at 5:13 PM, Chris Alexander 
  chris.alexan...@kusiri.com
   wrote:
 
   Hi all,
  
   I have been asked to look at doing some disk space estimates for our
  Nutch
   usage. It looks like Nutch stores the content of the pages it downloads
  and
   indexes in its data directory for the segment, is this the case?
  
   Are there any other major storage requirements I should make not of
 with
   Nutch specifically (not the Solr storage, we can handle that bit)?
  
   Cheers
  
   Chris
  
 
 
 
  --
  *Lewis*
 




-- 
*Lewis*

Re: NullPointerException when calling readdb on empty database

2011-08-03 Thread lewis john mcgibbney

which version of Nutch are you using?

Is chat a plain text file, with URLs in a list on per line? If this the case
there is no need to add it to your crawl command. Additionally, there is no
point in trying to read what is happeneing in your crawldb if your generator
log output indicates that nothing has been selected for fetching therefore
this will be skipped.

I'm slightly concerned about your crawl parameters, for example is it
necessary to use crawl-chat, I have never used hyphens before, and it is
only a suggestion, but might it be possible that Nutch is taking -chat as a
parameter as well?



On Wed, Aug 3, 2011 at 8:34 AM, Christian Weiske 
christian.wei...@netresearch.de wrote:

 Hi,


 I'm getting the following error:

 $ bin/nutch readdb crawl-chat/crawldb -stats
 CrawlDb statistics start: crawl-chat/crawldb
 Statistics for CrawlDb: crawl-chat/crawldb
 Exception in thread main java.lang.NullPointerException
at
 org.apache.nutch.crawl.CrawlDbReader.processStatJob(CrawlDbReader.java:352)
at org.apache.nutch.crawl.CrawlDbReader.main(CrawlDbReader.java:502)


 The db has been created as follows, as you see no URLs have been
 fetched (another problem):

 $ bin/nutch crawl urls/chat -dir crawl-chat -depth 10 -topN 1
 solrUrl is not set, indexing will be skipped...
 crawl started in: crawl-chat
 rootUrlDir = urls/chat
 threads = 10
 depth = 10
 solrUrl=null
 topN = 1
 Injector: starting at 2011-08-03 09:31:53
 Injector: crawlDb: crawl-chat/crawldb
 Injector: urlDir: urls/chat
 Injector: Converting injected urls to crawl db entries.
 Injector: Merging injected urls into crawl db.
 Injector: finished at 2011-08-03 09:31:57, elapsed: 00:00:04
 Generator: starting at 2011-08-03 09:31:57
 Generator: Selecting best-scoring urls due for fetch.
 Generator: filtering: true
 Generator: normalizing: true
 Generator: topN: 1
 Generator: jobtracker is 'local', generating exactly one partition.
 Generator: 0 records selected for fetching, exiting ...
 Stopping at depth=0 - no more URLs to fetch.
 No URLs to fetch - check your seed list and URL filters.
 crawl finished: crawl-chat

 --
 Viele Grüße
 Christian Weiske




-- 
*Lewis*

Re: imported to solr

2011-08-03 Thread lewis john mcgibbney

Hi Kiks,

What kind of changes have you made to your schema when transferring to Solr
instance?

You ask about the stored parsed text content, well the default Nutch schema
sets this by default to stored=false as it is not always required for all
content to be stored. Generally speaking terms that occur in title, meta,
etc fields will be more valuable for searching across, especially when
considering data stores. Hopefully you can change this behaviour by simple
making the changes described, however Solr does not like kindly changes to
schema therefore it will be necessary to reindex your data to your Solr
core.

On Wed, Aug 3, 2011 at 7:31 AM, Kiks kikstern...@gmail.com wrote:

 This question was posted on solr list and not answered because nutch
 related...


 The indexed contents of 100 sites were imported to solr from nutch using:

 bin/nutch solrindex http://127.0.0.1:8983/solr/ crawl/crawldb crawl/linkdb
 crawl/segments/*

 now, a solr admin search for 'photography' includes these results:

  doc
float name=score0.12570743/
 float
float name=boost1.0440307/float
str name=digest94d97f2806240d18d67cafe9c34f94e1/str
str name=idhttp://www.galleryhopper.org//str
str name=segment.../str
str name=titleGallery Hopper: Todd Walker's photography ephemera.
 Read, enjoy, share, discard./str
date name=tstamp.../date
str name=urlhttp://www.galleryhopper.org//str
  /doc

 but highlighting options are on the title field not page text.

 My question: Where is the stored parsetext content of the pages? What is
 the
 solr command to send it from nutch with url/id key? The information is
 contained in the crawl segments with solr id field matching nutch url.

 Thanks.




-- 
*Lewis*

Re: New wiki page for Running Nutch 1.3 in Eclipse

2011-08-03 Thread lewis john mcgibbney

Sorry

http://wiki.apache.org/nutch/RunNutchInEclipse

On Wed, Aug 3, 2011 at 2:12 PM, Dr.Ibrahim A Alkharashi 
khara...@kacst.edu.sa wrote:

 thanks for the info, would you please post a pointer to the page.

 Regards
 Ibrahim

 On Aug 3, 2011, at 3:13 PM, lewis john mcgibbney 
 lewis.mcgibb...@gmail.com wrote:

  Hi,
 
  We've just posted a new updated wiki page covering the above topic.
  If there are any discrepancies within the page it would be nice if
  individuals could sign up to the wiki and edit based upon your own
  experiences using = Nutch 1.3 within an IDE. However, alternatively
 please
  post on the lists and we will get it updated.
  Thanks for now
 
  --
  *Lewis*
 Warning: This message and its attachment, if any, are confidential and may
 contain information protected by law. If you are not the intended recipient,
 please contact the sender immediately and delete the message and its
 attachment, if any. You should not copy the message and its attachment, if
 any, or disclose its contents to any other person or use it for any purpose.
 Statements and opinions expressed in this e-mail and its attachment, if any,
 are those of the sender, and do not necessarily reflect those of King
 Abdulaziz city for Science and Technology (KACST) in the Kingdom of Saudi
 Arabia. KACST accepts no liability for any damage caused by this email.

 تحذير: هذه الرسالة وما تحويه من مرفقات (إن وجدت) تمثل وثيقة سرية قد تحتوي
 على معلومات محمية بموجب القانون. إذا لم تكن الشخص المعني بهذه الرسالة فيجب
 عليك تنبيه المُرسل بخطأ وصولها إليك، وحذف الرسالة ومرفقاتها (إن وجدت)، ولا
 يجوز لك نسخ أو توزيع هذه الرسالة أو مرفقاتها (إن وجدت) أو أي جزء منها، أو
 البوح بمحتوياتها للغير أو استعمالها لأي غرض. علماً بأن فحوى هذه الرسالة
 ومرفقاتها (ان وجدت) تعبر عن رأي المُرسل وليس بالضرورة رأي مدينة الملك
 عبدالعزيز للعلوم والتقنية بالمملكة العربية السعودية، ولا تتحمل المدينة أي
 مسئولية عن الأضرار الناتجة عن ما قد يحتويه هذا البريد.




-- 
*Lewis*

Re: how to extract tf-idf

2011-08-06 Thread lewis john mcgibbney

Hi Zhanibek,

I would like to refer specifically to Markus' thread which he initiated a
short time ago [1] sharing close similarity to your own questions. I think
the main question to be answered now is how do we extract tf-idf from a
crawled website? And as we now refer to Nutch as an independent software
project focussed solely on crawling this is a question which would provide
significant value to understanding more about the inner workings.

Markus mentioned that there many aspects we need to consider before trying
to compile a tf-idf score e.g. link score, norms, boosts, functions etc.
This is making it relatively hard for me (and I suspect others) to
accurately comment on the actual components we are required to consider and
understand in this specific context before we can address the fundamental
question at hand...

I think there is a good deal of lateral thinking required here ;0)

In the mean time have you had any chance to delve into this?


[1] http://www.mail-archive.com/user%40nutch.apache.org/msg03517.html

On Wed, Aug 3, 2011 at 5:28 AM, Zhanibek Datbayev itoma...@gmail.comwrote:

 Hello Nutch Users,
 I've googled for a while and still can not find answers to the following:
 1. After I crawl a web site, how can I extract tf-idf for it?
 2. How can I access original web pages crawled?
 3. Is it possible to get for each word id it corresponds to?

 Thanks in advance!

 -Zhanibek




-- 
*Lewis*

Re: fetcher runs without error with no internet connection

2011-08-23 Thread lewis john mcgibbney

Hi Alex,

Did you get anywhere with this?

What condition led to you seeing unknown host exception?

Unless segment gets corrupted, I would assume you could fetch again.
Hopefully you can confirm this.

On Tue, Aug 16, 2011 at 9:23 PM, alx...@aim.com wrote:

 Hello,

 After running bin/nutch fetch $segment for 2 days, internet connection was
 lost, but nutch did not give any errors. Usually I was seeing Unknown host
 exception before.
 Any ideas what happened and is it OK to stop the fetch and run it again on
 the same (old) segment? This is nutch -1.2

 Thanks.
 Alex.




-- 
*Lewis*

Re: force recrawl

2011-08-23 Thread lewis john mcgibbney

Correct

There should be comprehensive documentation on the wiki for these parameters
(and many more)

On Fri, Aug 19, 2011 at 6:46 PM, Markus Jelsma
markus.jel...@openindex.iowrote:

 addDays is not a crawl switch but a generator switch. You cannot use the
 crawl
 command.

  But if I use
  bin/nutch crawl urls -dir crawl -depth 2 -topN 50
  addDays does not have any effect.
  Has anyone a nutch crawl script that can also be used to force a recrawl?
 
   Well, actually. You can! I seem to have forgotten the -addDays switch
 of
   the generator. It adds #days to the current time to force URL's with
   fetch times in the future to be eligible for fetch.
 
  --
  View this message in context:
  http://lucene.472066.n3.nabble.com/force-recrawl-tp3268654p3268779.html
  Sent from the Nutch - User mailing list archive at Nabble.com.




-- 
*Lewis*

Re: Empty LinkDB after invertlinks

2011-08-23 Thread lewis john mcgibbney

Hi

Small suggestion, but I do not see any -dir argument passed alongside your
initial invertlinks command. I understand that you have multiple segment
directories, which have been fetched over a recent number of days, and that
the output would also suggest the process was properly executed, however I
have never used the command without the -dir option (as it has always worked
for me), therefore I can only suggest that this may be the problem.



On Tue, Aug 23, 2011 at 3:29 PM, Marek Bachmann m.bachm...@uni-kassel.dewrote:

 Hi Markus,

 thank you for the quick reply. I already searched for this Configuration
 error and found:

 http://www.mail-archive.com/**nutch-u...@lucene.apache.org/**msg15397.htmlhttp://www.mail-archive.com/nutch-user@lucene.apache.org/msg15397.html

 Where they say that This exception is innocuous - it helps to debug at
 which points in the code the Configuration instances are being created.
 (...)

 I have indeed not much disk space on the machine but it should be enough at
 the moment:

 root@hrz-vm180:/home/**nutchServer/relaunch_nutch/**runtime/local/bin# df
 -h .
 FilesystemSize  Used Avail Use% Mounted on
 /dev/vda1  20G  5.9G   15G  30% /home

 As I am root and all directories under 
 /home/nutchServer/relaunch_**nutch/runtime/local/bin
 are set to root:root and 755 permissions shouldn't be the problem.

 Any further suggestions? :-/

 Thank you once again



 Am 23.08.2011 16:10, schrieb Markus Jelsma:

  There are some peculiarities in your log:

 2011-08-23 14:47:34,833 DEBUG conf.Configuration - java.io.IOException:
 config()
at org.apache.hadoop.conf.**Configuration.init(**
 Configuration.java:211)
at org.apache.hadoop.conf.**Configuration.init(**
 Configuration.java:198)
at org.apache.hadoop.mapred.**JobConf.init(JobConf.java:**213)
at
 org.apache.hadoop.mapred.**LocalJobRunner$Job.init(**
 LocalJobRunner.java:93)
at
 org.apache.hadoop.mapred.**LocalJobRunner.submitJob(**
 LocalJobRunner.java:373)
at
 org.apache.hadoop.mapred.**JobClient.submitJobInternal(**
 JobClient.java:800)
at org.apache.hadoop.mapred.**JobClient.submitJob(JobClient.**
 java:730)
at org.apache.hadoop.mapred.**JobClient.runJob(JobClient.**
 java:1249)
at org.apache.nutch.crawl.LinkDb.**invert(LinkDb.java:190)
at org.apache.nutch.crawl.LinkDb.**run(LinkDb.java:292)
at org.apache.hadoop.util.**ToolRunner.run(ToolRunner.**java:65)
at org.apache.nutch.crawl.LinkDb.**main(LinkDb.java:255)

 2011-08-23 14:47:34,922 INFO  mapred.JobClient - Running job:
 job_local_0002
 2011-08-23 14:47:34,923 DEBUG conf.Configuration - java.io.IOException:
 config(config)
at org.apache.hadoop.conf.**Configuration.init(**
 Configuration.java:226)
at org.apache.hadoop.mapred.**JobConf.init(JobConf.java:**184)
at org.apache.hadoop.mapreduce.**JobContext.init(JobContext.**
 java:52)
at org.apache.hadoop.mapred.**JobContext.init(JobContext.**
 java:32)
at org.apache.hadoop.mapred.**JobContext.init(JobContext.**
 java:38)
at
 org.apache.hadoop.mapred.**LocalJobRunner$Job.run(**
 LocalJobRunner.java:111)


 Can you check permissions, disk space etc?



 On Tuesday 23 August 2011 16:05:16 Marek Bachmann wrote:

 Hey Ho,

 for some reasons the inverlinks command produces an empty linkdb.

 I did:

 root@hrz-vm180:/home/**nutchServer/relaunch_nutch/**runtime/local/bin#
 ./nutch invertlinks crawl/linkdb crawl/segments/* -noNormalize -noFilter
 LinkDb: starting at 2011-08-23 14:47:21
 LinkDb: linkdb: crawl/linkdb
 LinkDb: URL normalize: false
 LinkDb: URL filter: false
 LinkDb: adding segment: crawl/segments/20110817164804
 LinkDb: adding segment: crawl/segments/20110817164912
 LinkDb: adding segment: crawl/segments/20110817165053
 LinkDb: adding segment: crawl/segments/20110817165524
 LinkDb: adding segment: crawl/segments/20110817170729
 LinkDb: adding segment: crawl/segments/20110817171757
 LinkDb: adding segment: crawl/segments/20110817172919
 LinkDb: adding segment: crawl/segments/20110819135218
 LinkDb: adding segment: crawl/segments/20110819165658
 LinkDb: adding segment: crawl/segments/20110819170807
 LinkDb: adding segment: crawl/segments/20110819171841
 LinkDb: adding segment: crawl/segments/20110819173350
 LinkDb: adding segment: crawl/segments/20110822135934
 LinkDb: adding segment: crawl/segments/20110822141229
 LinkDb: adding segment: crawl/segments/20110822143419
 LinkDb: adding segment: crawl/segments/20110822143824
 LinkDb: adding segment: crawl/segments/20110822144031
 LinkDb: adding segment: crawl/segments/20110822144232
 LinkDb: adding segment: crawl/segments/20110822144435
 LinkDb: adding segment: crawl/segments/20110822144617
 LinkDb: adding segment: crawl/segments/20110822144750
 LinkDb: adding segment: crawl/segments/20110822144927
 LinkDb: adding segment: crawl/segments/20110822145249
 LinkDb: adding segment: crawl/segments/20110822150757

Re: readdblink not showing alllinks

2011-08-23 Thread lewis john mcgibbney

If you please post your crawldb dump then we could see the structure of your
crawldb and may be able to begin pin pointing the issue.

It should not be required for you to undertake another crawl after inverting
links for these URLs to be indexed when calling solrindex command... there
must be more to it.

On Tue, Aug 23, 2011 at 6:54 PM, abhayd ajdabhol...@hotmail.com wrote:

 hi
 after doing invert link i see the complete link graph...THANKS

 I m bit confused, please help me understand..

 I do crawl using crawl command. I see around 7000+ urls when i dump
 crawldb.
 Then i do invertlink and i see the complete link graph.
 After this i do solrindex.

 After solr indexing is completed i see only 2421 docs. I was expecting
 7000+
 docs (i.e exact number of unique urls which i got from dumping crawldb as
 text)

 Why i just see 2421 urls/docs in solr?
 Do i need to execute crawl again after invertlink?

 Here are some settings
 --
  namedb.update.max.inlinks/name
  value1/value

  namedb.ignore.internal.links/name
  valuefalse/value

  namedb.max.inlinks/name
  value1/value

  namedb.max.outlinks.per.page/name
  value-1/value


 --
 View this message in context:
 http://lucene.472066.n3.nabble.com/readdblink-not-showing-alllinks-tp3274127p3278779.html
 Sent from the Nutch - User mailing list archive at Nabble.com.




-- 
*Lewis*

Re: How to save html source to local drive

2011-08-24 Thread lewis john mcgibbney

Hi

Can you explain how you tried to save raw html obtained during a crawl to a
local drive? I am not entirely sure what you mean here and why you would
want to do so given that we already have an array of alternative options
available. Can you please expand on this.

Thank you

On Wed, Aug 24, 2011 at 5:24 AM, dyzc 1393975...@qq.com wrote:

 Hi,


 I am using nutch within hadoop distributed computing environment. I tried
 saving html source to a local drive (not HDFS) via absolute filepath, but I
 can't find the saved contents on either master node or datanodes.


 How can I achieve this?


 Thanks!




-- 
*Lewis*

Re: Recursively searching through web dirs

2011-08-24 Thread lewis john mcgibbney

Hi Adam,

My initial thoughts are that you are correct. It is very unusual for your
files to be located on an URL in the same domain which is not referenced by
the top level or a subsequent level URL within the domain.

What I would suggest is that you have a look through your hadoop.log as well
as use some of the commans which enable you to investigate your crawldb,
segment(s) and linkdb if you've created one.

have a look at the wiki under command line options

On Wed, Aug 24, 2011 at 9:03 PM, Adam Estrada estrada.adam.gro...@gmail.com
 wrote:

 All,

 I have a root domain and a couple directories deep I have some files that I
 want to index. The problem is that they are not referenced on the main page
 using a hyperlink or anything like that.

 http://www.geoglobaldomination.org/kml/temp/

 I want to be able to crawl down in to /kml/temp/ without knowing that it's
 even there. Is there a way to do this in Nutch?

 echo http://www.geoglobaldomination.org  urls

 ./nutch crawl urls -threads 10 -depth 10 -topN 20 -solr
 http://172.16.2.107:8983/solr

 Nothing and I suspect that it's because there is not a hyperlink on the
 main
 page.

 Thoughts?
 Adam




-- 
*Lewis*

Re: Trying to understand and use URLmeta

2011-08-25 Thread lewis john mcgibbney

Hi JB,

We have recently finished a complete plugin tutorial which fully explains
the functionality of the urlmeta plugin on the wiki. It can be found here
[1], could I ask you to have a thorough look at it, and the code and if you
still have questions then please reinforce them.

[1] http://wiki.apache.org/nutch/WritingPluginExample

Thank you

On Wed, Aug 24, 2011 at 9:36 PM, John R. Brinkema brink...@teo.uscourts.gov
 wrote:

 Hi all,

 I am trying use URLmeta to inject meta data into documents that I crawl and
 I am having some problems.

 First the context:  Nutch 1.3 with Solr 3.2

 My seed url files looks like:  http://mySite.com/Guide/index.**
 html\trecommended= http://mySite.com/Guide/index.html%5Ctrecommended=
 Guide\**tkeywords=Guide,Policy,**JBmarker

 I put JBmarker there so I could see where the metadata got put.

 Index.html itself is a table of contents of a guide; that is, it is mostly
 a list of outlinks to parts of the overall guide.

 My nutch-site.xml includes the following properties:

 property
 nameplugin.includes/name
 valueprotocol-http|**urlfilter-regex|parse-(html|**
 tika)|index-(basic|anchor|**urlmeta)|scoring-opic|**
 urlnormalizer-(pass|regex|**basic)/value
 /property
 property
 nameurlmeta.tags/name
 valuerecommended,keywords/**value
 /property

 I fire up nutch to crawl and all goes well.   To see what nutch did, I ran
 'readseg -dump' and looked at the results.  What I found was the following:

 ... other Recno's above ...

 Recno:: 56
 URL:: http:/mySite.com/Guide/index.**html

 CrawlDatum::
 Version: 7
 Status: 65 (signature)
 Fetch time: Tue Aug 23 10:08:18 EDT 2011
 Modified time: Wed Dec 31 19:00:00 EST 1969
 Retries since fetch: 0
 Retry interval: 0 seconds (0 days)
 Score: 1.0
 Signature: 5c182af41027766eccf1ea60d11277**2c
 Metadata:

 CrawlDatum::
 Version: 7
 Status: 1 (db_unfetched)
 Fetch time: Tue Aug 23 10:08:04 EDT 2011
 Modified time: Wed Dec 31 19:00:00 EST 1969
 Retries since fetch: 0
 Retry interval: 2592000 seconds (30 days)
 Score: 1.0
 Signature: null
 Metadata: recommended: Guide_ngt_: 1314108489210keywords:
 Guide,Policy,JBmarker

 Content::
 Version: -1
 url: http://mySite.com/Guide/index.**htmlhttp://mySite.com/Guide/index.html
 base: http://mySite.com/Guide/index.**htmlhttp://mySite.com/Guide/index.html
 ... lots more content ...

 CrawlDatum::
 Version: 7
 Status: 33 (fetch_success)
 Fetch time: Tue Aug 23 10:08:15 EDT 2011
 Modified time: Wed Dec 31 19:00:00 EST 1969
 Retries since fetch: 0
 Retry interval: 2592000 seconds (30 days)
 Score: 1.0
 Signature: null
 Metadata: recommended: Guide_ngt_: 1314108489210keywords:
 Guide,Policy,JBmarker_pst_: success(1), lastModified=0

 ParseData::
 Version: 5
 Status: success(1,0)
 Title: Guide
 Outlinks: 60
  outlink: toUrl: 
 http://mySite.com/Home/About.**htmlhttp://mySite.com/Home/About.htmlanchor: 
 About Me
  outlink: toUrl: 
 http://mySite.com/Guide/**Contact_The_Guide.htmlhttp://mySite.com/Guide/Contact_The_Guide.htmlanchor:
  Contact Me
 ... many more outlinks ...
 Content Metadata: nutch.content.digest=**5c182af41027766eccf1ea60d11277**2c
 Accept-ranges=bytes Date=Tue, 23 Aug 2011 16:28:43 GMT Content-Length=28798
 Last-Modified=Wed, 06 Apr 2011 00:15:10 GMT nutch.crawl.score=1.0 _fst_=33
 nutch.segment.name=**20110823100811 Content-Type=text/html
 Connection=close Server=Netscape-Enterprise/6.0
 Parse Metadata: CharEncodingForConversion=**windows-1252
 OriginalCharEncoding=windows-**1252

 ParseText::
 ... lots of parsed text ...

 Recno::  57

 ... and so forth.

 JBmarker does not appear anywhere else, in this segment or any of the
 others.

 When I do a solrindex, JBmarker does not appear to be anywhere.  ??

 *What I expected*

 As I understand ULRmeta (as defined by the two nutch patches), the meta
 data that is included with the url  is injected into the seed url; that is
 to say, it is as if the lines:

 META NAME=recommended CONTENT=Guide
 META NAME=keywords CONTENT=Guide,Policy,**JBmarker

 were in the seed url content.  Furthermore,  it is as if those two lines
 were in all the outlink content of the seed url.  So, I expected that when I
 looked at all the CrawlDatum and ParseData of the outlinks from the seed
 url, I would see the same meta data as in the seed CrawlDatum and ParseData.
  Which is clearly not the case.

 As for solrindex, I assume that I have some work to do to get any special
 metadata actions moved over to solr; a special plugin of some sort.  That
 is, urlmeta does not help get the collected metadata from Nutch to Solr.

 So what is happening?  Where did I go astray?  Am I analyzing the Nutch
 dumps incorrectly?

 One other side note:  I assume that Luke no longer will help me debug Nutch
 since it works with Lucene indexes and Nutch no longer create such beasts.
  Are there any tools that help with viewing Nutch databases?  It seems that
 Nutch takes some liberties with the data it is dumping (e.g., the meta tags
 all concatenated together without delimiters; I

Re: Are there any tutorial for writing regex-normalize.xml?

2011-08-26 Thread lewis john mcgibbney

Apart from looking through the list archives, as far as I aware nothing has
been specifically documented on this topic.

In the mean time you may find this helpful

http://geekswithblogs.net/brcraju/articles/235.aspx

On Fri, Aug 26, 2011 at 9:22 AM, Kaiwii Ho kaiwi...@gmail.com wrote:

 I'm gonna to specify my own regex-normalize.xml.Are there any tutorial for
 writing regex-normalize.xml?
 waiting for ur help and thank u




-- 
*Lewis*

Re: force recrawl

2011-08-27 Thread lewis john mcgibbney

If you only wish to serve crawls to that one page, I'm sure this could
easily be set up by writing a bash script specifying the -adddays arguement
with your commands. This could then be set and run as a cron job?

Please someone correct me if I am wrong.

On Fri, Aug 26, 2011 at 10:22 PM, Radim Kolar h...@sendmail.cz wrote:

 It would be nice to have command which will alter database refetch times in
 specified URLs. With configuration like that:

 ^http://www\.google\.com/?$  1d   # fetch google homepage daily

 I am willing to help with sponsoring development and testing of such thing.




-- 
*Lewis*

Trying to complete index structure wiki page

2011-08-27 Thread lewis john mcgibbney

Hi,

As the title suggests, I'm in the process of getting some comprehensive
documentation sorted out for Nutch, this obviously starts at wiki level. I'm
currently working on the IndexStructure page [1]. I would appreciate if some
guys could have a quick look and correct where they see fit.

In addition I have a couple of quick questions regarding the last 4 fields
I'm trying to account for

1) BOOST - As far as I am aware this was deprecated in Nutch 1.2 or Nutch
1.1... correct/wrong?
2) DIGEST - Don't have a clue
3) SEGMENT - as 2
4) TIMESTAMP - as 2

Would be great if people could fill me in with the grey areas please.

Finally, what a job all contributors, dev's and committers made cleaning up
plugin directory even between Nutch 1.2 and 1.3 release. It's not until you
see previous versions on SVN that you can fully appreciate the excellent job
that has been made with 1.3 release.  :0)

[1] http://wiki.apache.org/nutch/IndexStructure

-- 
*Lewis*

Re: How to generate multiple small segments w/o -numFetchers?

2011-08-28 Thread lewis john mcgibbney

Hi Gabriele can you expand on your last comment... are you running in deploy
mode?

And to reply to your first point, yes you are correct, the FAQ's need
extensive updating. Please feel free to change anything you feel necessary,
however as a matter of retaining knowledge for the legacy of Nutch we are
now moving deprecated/old information resources to the archive section of
the wiki.



On Sun, Aug 28, 2011 at 7:58 AM, Gabriele Kahlout
gabri...@mysimpatico.comwrote:

 but that's no local solution:

 if (local.equals(job.get(mapred.job.tracker))  numLists != 1) {
  // override
  LOG.info(Generator: jobtracker is 'local', generating exactly one
 partition.);
  numLists = 1;
}

 On Sun, Aug 28, 2011 at 8:57 AM, Gabriele Kahlout
 gabri...@mysimpatico.comwrote:

  it was a bin/nutch generate option.
 
 
  On Sun, Aug 28, 2011 at 6:24 AM, Gabriele Kahlout 
  gabri...@mysimpatico.com wrote:
 
  Hello,
 
  All over the FAQ http://wiki.apache.org/nutch/FAQ it's bin/nutch
  -numFetchers option is documented as a way to generate multiple small
  segments. However that option doesn't seem available neither in 1.3 nor
 1.4.
  So should the FAQ be updated or am I missing something? How else could I
  generate multiple small segments?
  I can see doing that with -topN but that's less convenient.
 
  --
  Regards,
  K. Gabriele
 
  --- unchanged since 20/9/10 ---
  P.S. If the subject contains [LON] or the addressee acknowledges the
  receipt within 48 hours then I don't resend the email.
  subject(this) ∈ L(LON*) ∨ ∃x. (x ∈ MyInbox ∧ Acknowledges(x, this) ∧
  time(x)  Now + 48h) ⇒ ¬resend(I, this).
 
  If an email is sent by a sender that is not a trusted contact or the
 email
  does not contain a valid code then the email is not received. A valid
 code
  starts with a hyphen and ends with X.
  ∀x. x ∈ MyInbox ⇒ from(x) ∈ MySafeSenderList ∨ (∃y. y ∈ subject(x) ∧ y ∈
  L(-[a-z]+[0-9]X)).
 
 
 
 
  --
  Regards,
  K. Gabriele
 
  --- unchanged since 20/9/10 ---
  P.S. If the subject contains [LON] or the addressee acknowledges the
  receipt within 48 hours then I don't resend the email.
  subject(this) ∈ L(LON*) ∨ ∃x. (x ∈ MyInbox ∧ Acknowledges(x, this) ∧
  time(x)  Now + 48h) ⇒ ¬resend(I, this).
 
  If an email is sent by a sender that is not a trusted contact or the
 email
  does not contain a valid code then the email is not received. A valid
 code
  starts with a hyphen and ends with X.
  ∀x. x ∈ MyInbox ⇒ from(x) ∈ MySafeSenderList ∨ (∃y. y ∈ subject(x) ∧ y ∈
  L(-[a-z]+[0-9]X)).
 
 


 --
 Regards,
 K. Gabriele

 --- unchanged since 20/9/10 ---
 P.S. If the subject contains [LON] or the addressee acknowledges the
 receipt within 48 hours then I don't resend the email.
 subject(this) ∈ L(LON*) ∨ ∃x. (x ∈ MyInbox ∧ Acknowledges(x, this) ∧
 time(x)
  Now + 48h) ⇒ ¬resend(I, this).

 If an email is sent by a sender that is not a trusted contact or the email
 does not contain a valid code then the email is not received. A valid code
 starts with a hyphen and ends with X.
 ∀x. x ∈ MyInbox ⇒ from(x) ∈ MySafeSenderList ∨ (∃y. y ∈ subject(x) ∧ y ∈
 L(-[a-z]+[0-9]X)).




-- 
*Lewis*

Re: a question about job failed

2011-08-29 Thread lewis john mcgibbney

Hi Zhao,

Do you have anymore verbose log info from hadoop.log, I have never worked
with Nutch 0.9 but if you could at least indicate whether you get something
like

LOG: info Dedup: starting ... blah blah blah

Taking this to a larger context I am not particularly happy with the
verboseness of logging when there are errors with indexing commands. When we
experience an error during any of the index related commands we get back Job
failed. It would be nice to get back a reason for the job failing which was
more clear than a stack trace.

Finally, this is from a personal point of view, I would highly recommend
that you upgrade to a newer (1.3) version of Nutch if you are using this in
production. There are significant improvements in functionality.

Lewis

On Mon, Aug 29, 2011 at 3:24 AM, zhao 253546...@qq.com wrote:

 Dear all，
 after use nutch 0.9 ，but have a question，Detailed description of the
 problem
 is
   Exception in thread main java.io.IOException: Job failed!
at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:604)
at
 org.apache.nutch.indexer.DeleteDuplicates.dedup(DeleteDuplicates.java:439)
  at org.apache.nutch.crawl.Crawl.main(Crawl.java:135)
 Thank you for your help
  zhao

 --
 View this message in context:
 http://lucene.472066.n3.nabble.com/a-question-about-job-failed-tp3291669p3291669.html
 Sent from the Nutch - User mailing list archive at Nabble.com.




-- 
*Lewis*

Re: SSHD for Nutch 1.3 in Pseudo Distributed mode

2011-08-29 Thread lewis john mcgibbney

If it complains about SSH errors then I would ensure that you are logged
into your SSH client e.g. ssh -v localhost, prior to executing any hadoop
scripts. This would make sense.

Further to this, unless you are actually experiencing Nutch related problems
on a pseudo or cluster setup then probably the best place to go is the
hadoop user lists. This is only a thought, but it would make most sense.



On Mon, Aug 29, 2011 at 3:58 PM, webdev1977 webdev1...@gmail.com wrote:

 Do I NEED SSHD for Nutch 1.3 in Pseudo Distributed mode?

 I am running on a windows server using cygwin (obviously :-)

 I can not get haddop/nutch to run in deploy mode and I am not sure if it
 has
 something to do with ssh or not.  When I run start-all.sh it gives me some
 ssh usage errors and also says it is staring the jobtracker and namenode.

 In the hadoop log it complains about not being able to write file:
 hdfs://localhost:9000/cygdrive/r/EnterpriseSearch/hadoop/mapreduce/system/
 jobtracker.info.

 I have configred core-site.xml, hdfs-site.xml and mapred-site.xml



 --
 View this message in context:
 http://lucene.472066.n3.nabble.com/SSHD-for-Nutch-1-3-in-Pseudo-Distributed-mode-tp3292907p3292907.html
 Sent from the Nutch - User mailing list archive at Nabble.com.




-- 
*Lewis*

1 2 3 4 5 6 7 8 9 10 >

1 - 100 of 1408 matches

Mail list logo