Re: Character encoding on Html-Pages

2011-06-07 Thread lewis john mcgibbney
Hi Alex, I cannot locate the java file you mention at org.apache.nutch.parse.html.HtmlParser in either 1.2 or branch 1.3... Having a quick look at org.apache.nutch.parse.HTMLMetaTags (in both versions above it is identical) it appears that you are right the double quotes for meta http-equiv

Re: keeping index up to date

2011-06-07 Thread lewis john mcgibbney
Hi, To add to Markus' comments, if you take a look at the script it is written in such a way that if run in safe mode it protects us against an error which may occur. If this is the case we an recover segments etc and take appropriate actions to resolve. On Tue, Jun 7, 2011 at 9:01 PM, Markus

Re: nutch NoClassDefFound

2011-06-08 Thread lewis john mcgibbney
Hi, I suggest that before you try to progress any further with this you read as much of the wiki [1] as you can, in particular I would start here [2] [3] After this, try looking through some of the source and understanding what parameters are required to run various commands. The reason for this

Updates to Nutch Wiki

2011-06-08 Thread lewis john mcgibbney
Hi everyone, Was wondering if anyone (familiar with the topics) would be interested in sending me material for the following pages [1] [2]. The links appear to be non existent in our wiki and it would be nice to get some material on these topics if these topics are important and are required!

Re: searcher.dir not working

2011-06-08 Thread lewis john mcgibbney
Hi abhayd, In short...yes. Although you have correctly specified an absolute path, you need to drop the /crawldb/current/part-0 A good resource for this stuff can usually be found on the mailing lists. On Wed, Jun 8, 2011 at 8:03 AM, abhayd ajdabhol...@hotmail.com wrote: hi I am using

Re: bin folder missing in 1.3 release

2011-06-09 Thread lewis john mcgibbney
We are a bit thin on supporting documentation for the new release at the moment but are actively working towards producing this. Hopefully once we have something contributed to the wiki the differences in configuration and functionality within release 1.3 will be fully explained. On Thu, Jun 9,

Re: No Urls to fetch

2011-06-13 Thread lewis john mcgibbney
Hi Adelaida, Assuming that you have been able to successfully crawl the top level domain http://elcorreo.com e.g. that you have been able to crawl and create an index, at least we know that your configuration options are OK. I assume that you are using 1.2... can you confirm? What does the rest

Re: Injecting urls through code instead of file

2011-06-14 Thread lewis john mcgibbney
Hi, Can you provide a use case? The reason I ask is that I can only assume that you would be hacking some code to inject your urls from some other URL store? On Tue, Jun 14, 2011 at 5:18 PM, shanWDC ssar...@web.com wrote: Is there a way to inject urls in the injector, through code, rather than

Re: Problem with Nutch Search

2011-06-16 Thread lewis john mcgibbney
Off the top of my head one property springs to mind. Which you may or may not have configured in nutch-site http.content.limit However I think that this is not the source of the problem. I would advise you to have a look at your hadoop log file for any obvious warnings... how do you know he

Re: I need step-by-step tutorial to run Nutch 1.2 from source code

2011-06-18 Thread lewis john mcgibbney
Hi Mohammad, Try looking at the pre nutch 1.3 material on the wiki, I'm sure there must be something in there you can build on... or that will at least point you in the right direction http://wiki.apache.org/nutch/Archive%20and%20Legacy HTH On Fri, Jun 17, 2011 at 9:27 PM, Mohammad Hassan

Re: Empty indexes folder after crawling!

2011-06-23 Thread lewis john mcgibbney
Have you set your crawl directory property value in nutch-site.xml when launching the war file on tomcat? On Tue, Jun 21, 2011 at 4:01 AM, Mohammad Hassan Pandi pandi...@gmail.comwrote: follwing http://wiki.apache.org/nutch/NutchHadoopTutorial I crawled lucene.apache.org with command

Re: how to classify the search results by an indexed field with lucene?

2011-06-23 Thread lewis john mcgibbney
to give a short answer to your question the answer is I don't know. Many of us are not using Lucene as the indexing machanism. I think as this is specifically linked to Lucene you would be better asking there. try the user list http://lucene.apache.org/java/docs/mailinglists.html#Java User List

Re: Where Can I find Nutch war file??

2011-06-23 Thread lewis john mcgibbney
Hi, Assuming that you are using 1.2 the war file should definately be there. You will be able to get step by step directions for this in the tutorial on the Nutch site. http://wiki.apache.org/nutch/NutchTutorial Note that this will be getting updated soon to reflect changes incorporated into

Re: helpful books or tutorials on nutch

2011-06-23 Thread lewis john mcgibbney
As this is open source I think the best way to solve your question/request is to get down and dirty with your own configuration. Many implementation scenarios are unique, to a new Nutch user this may provide no immediate helpful credentials, however it clearly displays the adaptability and

Re: Solrdedup NPE

2011-06-23 Thread lewis john mcgibbney
Hi Markus, Can you list the steps you executed prior to the solrdedup please? I think I encountered something similar a while back and as my work was moving on I didn't get a chance to investigate it fully. On Tue, Jun 21, 2011 at 1:54 PM, Markus Jelsma markus.jel...@openindex.iowrote: Hi,

Re: Building Nutch 2.0 from the trunk

2011-06-23 Thread lewis john mcgibbney
I tried to build Nutch trunk in eclipse about circa 2 months ago. Gora built fine and from memory it was the ivy configuration within Nutch which had to be altered. I'm positive the problems I was having have now been rectified but I haven't tried since. That is why I am interested in why JUnit

Re: Problem in search

2011-06-24 Thread lewis john mcgibbney
Hi Jefferson, I cannot access either your nutch-site or nutch-default but I see that your http.content.limit is INFO http.Http - http.content.limit = 65536 It is a fairly large page so maybe this can be the cause. I'm sorrry I don't have access to my linux worktop so I can't test myself can you

Apache Nutch 1.3 tutorial now on Wiki

2011-06-24 Thread lewis john mcgibbney
Hi all, With permission from the author I managed to adapt a blog entry for the above which can be found here. At this stage I would ask for anyone interested to make changes/improvements/etc. Once we can verify the integrity and accuracy of the entry it would be nice to rebuild the website with

Re: Problem in search

2011-06-24 Thread lewis john mcgibbney
Can you expand on this? I am not understanding your description of the problem. On Fri, Jun 24, 2011 at 12:52 PM, Jefferson jeff151520...@msn.com wrote: ready. Now I have another problem: digit phenomena and he returns this: - Albert Einstein - Wikipedia, the free encyclopedia Albert

Re: Problem in search

2011-06-25 Thread lewis john mcgibbney
I see within you're nutch-site file that you have set an http.content.limit value of 340,671. Is there any reason for this value? I'm assuming you are not indexing this page so you can merely search for the term phenomena, and that there is other textual content within the page that you are

Nutch Gotchas as of release 1.3

2011-06-25 Thread lewis john mcgibbney
Hello list, Do we have any suggestions we wish to discuss regarding the above? thanks -- *Lewis*

Re: Empty indexes folder after crawling!

2011-06-25 Thread lewis john mcgibbney
nutch-site.xml is empty. Perhaps it means nutch uses default path as Index location. right? On Thu, Jun 23, 2011 at 10:57 PM, lewis john mcgibbney lewis.mcgibb...@gmail.com wrote: Have you set your crawl directory property value in nutch-site.xml when launching the war file on tomcat

Re: Using nutch 1.3 in Eclipse

2011-06-30 Thread lewis john mcgibbney
I will try to get a wiki entry for this sorted ASAP as it is a fundamental requirement for anyone wishing to debug/understand how classes work in Nutch 1.3, when the time comes around any opinions/comments you have would be a great addition. Thanks 2011/6/30 Nutch User - 1 nutch.use...@gmail.com

Re: Memory leak in fetcher (1.0) ?

2011-07-02 Thread lewis john mcgibbney
How many threads do you have running concurrently? Is there any log output to indicate any warnings or errors otherswise? On Sat, Jul 2, 2011 at 7:40 AM, Markus Jelsma markus.jel...@openindex.iowrote: Does it run out of memory? Is GC able to reclaim consumed heap space? Have a 300K URLs

Nutch 1.3 CommandLineOptions updated to reflect new changes

2011-07-02 Thread lewis john mcgibbney
Hi, Just finished the above, which you can find here [1] so please check out the pages if you are having trouble passing parameters to any commands. It would be great to mention if there are any mistakes or even better edit or add any missing information you think would make the documentation

Re: Problems when crawl a .nsf site

2011-07-03 Thread lewis john mcgibbney
Absolutely... There is a short (old) thread here on this topic [1], from what I can see this issue has not been addressed. Therefore it looks like implementing your own parser plugin is what's required. [1] http://www.lucidimagination.com/search/document/a8d53fac1caa578c/nutch_with_nsf_files

Re: Searching for documents with a certain boost value

2011-07-05 Thread lewis john mcgibbney
Hi, I am sorry that I have not been able to try and replicate the scenario and confirm whether I get zero scores in a similar situation as I am temporarily unable to do so but I would like to add this resource [1], if you have not seen it yet. I am aware that this doesn't address the problem

Crawling relation database

2011-07-05 Thread lewis john mcgibbney
Hi, I'm curious to hear if anyone has information for configuring Nutch to crawl a RDB such as MySQL. In my hypothetical example there are N number of databases residing in various distributed geographical locations, to make a worst case scenario, say that they are NOT all the same type, and I

Re: Crawling relation database

2011-07-05 Thread lewis john mcgibbney
thanks to you both On Tue, Jul 5, 2011 at 4:35 PM, Markus Jelsma markus.jel...@openindex.iowrote: H, About geographical search: Solr will do this for you. Built-in for 3.x+ and using third-party plugins for 1.4.x. Both provide different features. In Solr it's you'd not base similarity on

Re: crawling a list of urls

2011-07-07 Thread lewis john mcgibbney
Hi C.B., This is way to vague. We really require more information regarding roughly what kind of results you wish to get. It would be a near impossible task for anyone to try and specify a solution to this open ended question. Please elaborate Thank you On Thu, Jul 7, 2011 at 12:56 PM, Cam

Re: Problems with nutch tutorial

2011-07-07 Thread lewis john mcgibbney
Hi Paul, Please see this tutorial for working with Nutch 1.3 [1] The tutorial you were using is for Nutch 1.2 from memory. [1] http://wiki.apache.org/nutch/RunningNutchAndSolr Thank you On Thu, Jul 7, 2011 at 1:17 PM, Paul van Hoven paul.van.ho...@googlemail.com wrote: I'm completly new

Re: crawling a list of urls

2011-07-07 Thread lewis john mcgibbney
Regards, -C.B. On Thu, Jul 7, 2011 at 6:21 PM, lewis john mcgibbney lewis.mcgibb...@gmail.com wrote: Hi C.B., This is way to vague. We really require more information regarding roughly what kind of results you wish to get. It would be a near impossible task for anyone to try

Re: no agents listed in 'http.agent.name'

2011-07-07 Thread lewis john mcgibbney
Hi Serenity, I don't know if you are aware but this message has been duplicated across both user@ nutch-user@. In general it is good practice for what to put in nutch-site and nutch-default can be found here [1] and here [2]. It is not required to add the properties to both of the conf files.

Re: Partitioning selected urls for politeness and scoring

2011-07-08 Thread lewis john mcgibbney
Yes this would limit the number of URLs from any one domain, but it would not explain why one domain seems to get fetched more after recursive fetches of some given seed set. Can you explain more about your crawling operation? Are you executing a crawl command? If so what arguements are you

Re: How to deploy Nutch 1.3 in the web server

2011-07-08 Thread lewis john mcgibbney
The web app was deprecated when we released Nutch 1.3. This was so we could use Solr interface for searching and offload the builk associated with the web app (amongst other things). There has been quite a lot of chat regarding this on this list over the last while. The last version of Nutch to

Re: skipping invalid segments

2011-07-08 Thread lewis john mcgibbney
Hi C.B., It looks like you may have simply missed the '-dir' when you were specifying your crawldb directory to be updated from the fetched segment. Have a look here [1] Can you please try and post your results. [1] http://wiki.apache.org/nutch/bin/nutch_updatedb On Fri, Jul 8, 2011 at 5:06

Re: Integrating Solr 3.2 with Nutch 1.3

2011-07-08 Thread lewis john mcgibbney
Hi Serenity, How did you execute the crawl? with crawl command? Have you ensured that parsing has been done? This looks like a different IIE than other have been getting when indexing to Solr. So please ensure that parsing has been done on all fetched content. On Fri, Jul 8, 2011 at 6:20 PM,

Re: custom extractor

2011-07-08 Thread lewis john mcgibbney
Hi C.B., Your description gets slightly cloudy towards the end e.g. around One diffuculty with my htmlcleaner...taken from firebug??? Are you trying to say that some of the URLs are bad HTML, you know this because it is flagged up by firebug? If this is the case are you able to edit the HTML and

Re: Building Nutch 2.0 from the trunk

2011-07-08 Thread lewis john mcgibbney
are pretty dynamic just now and there is a lot of exciting stuff in the pipeline for the near future. Thanks On Thu, Jun 23, 2011 at 11:55 PM, lewis john mcgibbney lewis.mcgibb...@gmail.com wrote: I tried to build Nutch trunk in eclipse about circa 2 months ago. Gora built fine and from memory

Re: Are we losing Nutch?

2011-07-10 Thread lewis john mcgibbney
Hi Carmmello, I would like to stress that I am only speaking from my own views on the way the project has been moving over the last year and a half or so but I would like to add the following points to address you quite obvious concerns There has been a lot of correspondence on closely linked

Re: html of the crawled pages.

2011-07-10 Thread lewis john mcgibbney
Hi C.B., Can you please expand on this description? On Sun, Jul 10, 2011 at 11:52 AM, Cam Bazz camb...@gmail.com wrote: Hello All, Is there a way to save the plain htmls from the crawl? Or is this already stored in segments dir? Best Regards, -C.B. -- *Lewis*

Re: Problems with tutorial

2011-07-10 Thread lewis john mcgibbney
Hi, For a 1.3 tutorial please see here [1]. I am in the process of overhauling the nutch site to accomodate new changes as per 1.3 release. Thank you On Sun, Jul 10, 2011 at 3:42 PM, Paul van Hoven paul.van.ho...@googlemail.com wrote: I'm completly new to nutch so I downloaded version 1.3

Re: Error Network is unreachable in Nutch 1.3

2011-07-11 Thread lewis john mcgibbney
Hi, Please see this new tutorial [1] for configuring Nutch 1.3. If you are familiar/comnfortable working with Solr for improvements to indexing then you will find it no problem. If you require to stick with Lucene and the web application front end then please stcik with Nutch 1.2 or before. [1]

Re: Nutch Novice help

2011-07-12 Thread lewis john mcgibbney
Hi Please see this tutorial [1] for up to date 1.3 tutorial on wiki. Please try it out and take on Markus' points regarding Nutch trunk as the problems you are experiencing are usual with Trunk as it stands. [1] http://wiki.apache.org/nutch/RunningNutchAndSolr On Mon, Jul 11, 2011 at 10:50 PM,

Re: developing nutch, either in eclipse or netbeans

2011-07-12 Thread lewis john mcgibbney
I must admit Markus that I agree with you that for making ad-hoc changes to your configuration it is usually more time efficient to use a text editor. Hi C.B. Is there any reaon in particular you are interested in getting it up working with an IDE? I had contemplated getting a revised tutorial

Re: Updating Tika in Nutch

2011-07-12 Thread lewis john mcgibbney
Hi Fernando, One point for me to mention which I did not pick up from your post. Did you rebuild your Nutch dist after making the changes to include your new parser? I know that this is a pretty simple suggestion but hopefully it might be the right one. Also can you please provide more details

Re: Nutch Gotchas as of release 1.3

2011-07-12 Thread lewis john mcgibbney
of a solr / php question than a Nutch question I think. -Original Message- From: Markus Jelsma [mailto:markus.jel...@openindex.io] Sent: Monday, July 11, 2011 3:19 PM To: user@nutch.apache.org Cc: lewis john mcgibbney Subject: Re: Nutch Gotchas as of release 1.3 Well, now i'm

Re: nutch crashes for unknown reason

2011-07-12 Thread lewis john mcgibbney
Fro mn the looks of it you need to parse all segments before indexing attempting to index them. As Markus has pointed out, the specific segment hasn't been parsed. Try parsing as per the following link http://wiki.apache.org/nutch/bin/nutch_parse On Tue, Jul 12, 2011 at 1:50 PM, Paul van Hoven

Re: A possible solution to my URL redirection and zero scores problem

2011-07-12 Thread lewis john mcgibbney
be great to find whether there is scope to file a JIRA with this. Thank you On Tue, Jul 12, 2011 at 2:02 PM, Nutch User - 1 nutch.use...@gmail.comwrote: On 07/12/2011 03:42 PM, lewis john mcgibbney wrote: Hi, An observation is that you are using the 1.3 branch, which will now contain some

Re: running tests from the command line

2011-07-12 Thread lewis john mcgibbney
What plugin are you hacking away on? You're own custom one or one already shipped with Nutch? Just so we are reading from the same page. This, along with some further documentation for running various classes from the command line is definately worth inclusion in the CommandLineOptions page of

Re: Nutch Novice help

2011-07-12 Thread lewis john mcgibbney
for fetching, exiting ... Looks like I am missing some key step =(. -param On 7/12/11 1:37 PM, lewis john mcgibbney lewis.mcgibb...@gmail.com wrote: Hi, I think you are maybe getting tangled here. Please see the following tutorial for Nutch 1.3 [1] Please also note that the URL you

Re: Need help: Can't find bundle for base name org.nutch.jsp.search, locale en_US

2011-07-14 Thread lewis john mcgibbney
Assuming your using Nutch 1.2, the web application you point to needs to be the exact name of the WAR file. In my case it was therefore always http://localhost:8080/nutch-1.2 http://localhost:8080/nutch/ Also I do not understand written spanish (i think this is) so I can help you out with the

Re: Can we use crawled data by Nutch 0.9 in other versions of Nutch

2011-07-14 Thread lewis john mcgibbney
I think you question should be more along the lines of, is it possible to use data stored within a Lucene index in a Solr core for search? Unfortunately I am unable to answer this question, my suggestion would be to ask on solr-user@ Another option which you may wish to consider is using the

Re: Recrawling with Solr backend

2011-07-14 Thread lewis john mcgibbney
in a well tuned fashion should yield better results over time. Thanks again for the help (and apologies for the huge e-mail) Chris On 14 July 2011 10:59, lewis john mcgibbney lewis.mcgibb...@gmail.com wrote: Hi Chris, Yes a Nutch 1.3 crawl and Solr index bash script is something that has

Re: The correct tutorial on the home page?

2011-07-14 Thread lewis john mcgibbney
Hi Eric Please add any comments you wish to the new tutorial that Markus mentioned on the Wiki. I am in the process of rebuilding the Nutch site and this will be included tomorrow e.g from now on the default tutorial people are directed to from the wiki will be the RunningNutchAndSolr tutorial...

Re: what does the parse command does

2011-07-15 Thread lewis john mcgibbney
Hi C.B., Quite a few things here On Fri, Jul 15, 2011 at 5:19 PM, Cam Bazz camb...@gmail.com wrote: Hello, Finally I got a working build environment, and I am doing some modifications and playing around. Good to hear, although it is off topic can you share any hurdles you overcame with us

Re: Deploying the web application in Nutch 1.2

2011-07-15 Thread lewis john mcgibbney
)|response-(json|xml)|summary-basic|scoring-opic|urlnormalizer-(pass|regex|basic)/value /property property namesearcher.dir/name valueC:/Apache/apache-nutch-1.2/crawlvalue /property /configuration -Original Message- From: lewis john mcgibbney [mailto:lewis.mcgibb...@gmail.com] Sent

Re: Deploying the web application in Nutch 1.2

2011-07-15 Thread lewis john mcgibbney
of anything else it could be. -Original Message- From: lewis john mcgibbney [mailto:lewis.mcgibb...@gmail.com] Sent: Friday, July 15, 2011 3:19 PM To: user@nutch.apache.org Subject: Re: Deploying the web application in Nutch 1.2 Are you adding this to nutch-site within your webapp

Re: problem compiling plugin

2011-07-15 Thread lewis john mcgibbney
Hi C.B., I'm in the process of overhauling PluginCentral on the wiki and have opened a wiki page for Plugin Gotchas [1]. Would it be possible to ask you to edit and define your understanding of the problem more specifically please. There is also an interesting page here [2], which you may or may

Re: LinkRank scores

2011-07-15 Thread lewis john mcgibbney
Hi, Do we have any suggestion to demystify this. I intend to look into webgraph in more detail soon as I wish to get a much more detailed picture of its functionality for link analysis purposes. On Wed, Jul 13, 2011 at 9:25 AM, Nutch User - 1 nutch.use...@gmail.comwrote: Does anyone know how

Re: Isn't there redudant/wasteful duplication between nutch crawldb and solr index?

2011-07-16 Thread lewis john mcgibbney
Hi Gabriele, At first this seems like a plausable arguement, however my question concerns what Nutch would do if we wished to change the Solr core which to index to? If we removed this functionality from the crawldb there would be no way to determine what Nutch was to fetch and what it wasn't.

Re: Isn't there redudant/wasteful duplication between nutch crawldb and solr index?

2011-07-16 Thread lewis john mcgibbney
Please feel free to add this to the wiki as it is a question that will undoubtably arise in the future. Lewis On Sat, Jul 16, 2011 at 12:37 PM, Gabriele Kahlout gabri...@mysimpatico.com wrote: On Sat, Jul 16, 2011 at 1:29 PM, lewis john mcgibbney lewis.mcgibb...@gmail.com wrote: Hi

Re: running tests from the command line

2011-07-16 Thread lewis john mcgibbney
Further to this, I have been working on a JIRA ticket for this [1] If you could, can you please test. I will also shortly and hopefully we can get this committed soon. Thank you [1] https://issues.apache.org/jira/browse/NUTCH-672 On Tue, Jul 12, 2011 at 9:36 PM, lewis john mcgibbney

Extracting triples tags or hash tags from html

2011-07-17 Thread lewis john mcgibbney
Hi, Is this currently possible with Tika 0.9 in Nutch branch 1.4? I would have thought that this would have been dealt with in Tika, however I have seen no mention of anyone having problems extracting this from web documents when fetching with Nutch or even discussing it. For example say I had

Re: Garbage with languageidentifier

2011-07-17 Thread lewis john mcgibbney
Hi Markus, I think this is a good shout, and it is not hard to understand the points you make. Quite clearly, good practice relating to the inclusion of accurate and useful language information (as well as other types of information) in HTTP headers is not a reality and it wouldn't be suitable

Re: Fetched pages has no content

2011-07-18 Thread lewis john mcgibbney
Hi, If you have a look at your regex-ulrfilter.txt it will by default be rejecting ? in the URL. Please test with line edited (or commented out) and see if the problem fades. On Mon, Jul 18, 2011 at 10:11 AM, Anders Rask anr...@gmail.com wrote: Hi Markus! We are using a custom parser, but I

Re: some Nutch questions

2011-07-18 Thread lewis john mcgibbney
Hi Cheng, Please see this wiki page for some references to optimization [1] I can see your problem though. I think a possible solution may to have two seed directories, with a specifically tailored Nutch implementation ready to crawl both. This way we guarantee top results if we take site in a

Re: How to use lucene to index Nutch 1.3 data

2011-07-19 Thread lewis john mcgibbney
Hi Kelvin, I see you are posting on a couple of threads with regards to the Lucene index generated by Nutch which you correctly point out is not there. It is not possible to create a Lucene index from Nutch 1.3 anymore as all searching has been shifted to Solr therefore Nutch 1.3 has no use for a

Re: help, src modify to optimize the crawl

2011-07-20 Thread lewis john mcgibbney
I dont think this has anything to so with modifying the crawl src. It doesn't infact have anything to do with optimization either. Try using your URLFilters e.g. regex It is important to try and understand what type of pages we can filter out from a Nutch crawl using the filters provided. HTH

Re: embedded google map in nutch query result page

2011-07-20 Thread lewis john mcgibbney
I don't know if you are still pursuing this, and as you haven't had any response I will give some tips. It sounds like your using = Nutch 1.2, therefore unless you are comofrtable working with JSP's then I wouldn't bother with the hastle. It might be better to try and use Solr for indexing and

Re: skipping invalid segments nutch 1.3

2011-07-20 Thread lewis john mcgibbney
errors if I use 'crawl' and to prove that I do not have any faults in the conf files or the directories. I still get the errors if I use the individual commands inject, generate, fetch Cheers, Leo On Tue, 2011-07-19 at 22:09 +0100, lewis john mcgibbney wrote

Re: crawling in any depth until no new pages were found

2011-07-20 Thread lewis john mcgibbney
Hi Marek, As were talking about automating the task were immediately looking at implementing a bash script. In the situation we have described, we wish Nutch to adopt a breadth first search BFS behaviour when crawling. Between us can we suggest any methods for best practice relating to BFS? As

Re: Nutch not indexing full collection

2011-07-20 Thread lewis john mcgibbney
Hi Chip, I would try running your scripts after setting the environment variable $NUTCH_HOME to nutch/runtime/local/NUTCH_HOME On Wed, Jul 20, 2011 at 4:01 PM, Chip Calhoun ccalh...@aip.org wrote: I've been working with $NUTCH_HOME/runtime/local/conf/nutch-site.xml, and I'm pretty sure that's

Re: embedded google map in nutch query result page

2011-07-20 Thread lewis john mcgibbney
the google map js code in solr ? Thanks again, On Wed, Jul 20, 2011 at 1:51 PM, lewis john mcgibbney lewis.mcgibb...@gmail.com wrote: I don't know if you are still pursuing this, and as you haven't had any response I will give some tips. It sounds like your using = Nutch 1.2

Re: skipping invalid segments nutch 1.3

2011-07-21 Thread lewis john mcgibbney
: Merging segment data into db. CrawlDb update: finished at 2011-07-21 12:28:04, elapsed: 00:00:01 On Wed, 2011-07-20 at 21:58 +0100, lewis john mcgibbney wrote: There is no documentation for individual

Re: solr index display

2011-07-25 Thread lewis john mcgibbney
Specifically I would mention that you would get a community input if this question was directed towards the Solr user list, however I think you are looking for the velocity response writer. Have a search on the Solr wiki you will find info there. In addition there are various other well

Re: embedded google map in nutch query result page

2011-07-25 Thread lewis john mcgibbney
://evolvingweb.github.com/ajax-solr/ you gave me . But I have some questions about that. Where should I add the javascript code file ? Is it in some subdirectory in apache-solr directory? Can you explain a little bit more? Thanks, On Wed, Jul 20, 2011 at 2:28 PM, lewis john mcgibbney

Re: Storage of data between crawls

2011-07-27 Thread lewis john mcgibbney
HI Alexander, I don't want to state the obvious here but this will depend directly on what type of loading your Nutch implementation deals with... You are correct in stating that we store data in segments, namely /crawl_fetch /content /crawl_parse /parse_data /crawl_generate /parse_text I

Re: Nutch not indexing full collection

2011-07-27 Thread lewis john mcgibbney
documents. Am I misremembering that there was a total file size value somewhere in Nutch or Solr that needs to be increased? -Original Message- From: lewis john mcgibbney [mailto:lewis.mcgibb...@gmail.com] Sent: Wednesday, July 20, 2011 5:23 PM To: user@nutch.apache.org Subject: Re

Re: TF in wide internet crawls

2011-07-27 Thread lewis john mcgibbney
Hi Markus, I am getting you until the last parts of your comments. cope with non-edited... edited by whom? and for what purpose? To give a better relative tf score... To comment on the first part, and please ignore or correct me if I am wrong, but do we not give each page and therefore each

Re: plugin build.xml file

2011-07-27 Thread lewis john mcgibbney
Hi Cheng Li, Please experiment with this. We have been gradually getting the pluginCentral section of the wiki updated as it needed a total face lift, so would appreciate any additional input you may have for updating the writing Plugin example which is already there. Apart being completely out

Re: Limit Nutch memory usage

2011-07-27 Thread lewis john mcgibbney
Hi Marseld, I'm just putting my thoughts out here, however Hadoop is not shipped with Nutch 1.3 anymore therefore I don't know where you would set this specific property within yout Nutch instances... How are you running Hadoop what version of Nutch what mode are you running Nutch in? On Tue,

Re: Storage of data between crawls

2011-07-28 Thread lewis john mcgibbney
, automatically or is there a command to do it? Thanks Chris On 27 July 2011 17:14, lewis john mcgibbney lewis.mcgibb...@gmail.com wrote: HI Alexander, I don't want to state the obvious here but this will depend directly on what type of loading your Nutch implementation deals

Re: NullPointerException when calling readdb on empty database

2011-08-03 Thread lewis john mcgibbney
which version of Nutch are you using? Is chat a plain text file, with URLs in a list on per line? If this the case there is no need to add it to your crawl command. Additionally, there is no point in trying to read what is happeneing in your crawldb if your generator log output indicates that

Re: imported to solr

2011-08-03 Thread lewis john mcgibbney
Hi Kiks, What kind of changes have you made to your schema when transferring to Solr instance? You ask about the stored parsed text content, well the default Nutch schema sets this by default to stored=false as it is not always required for all content to be stored. Generally speaking terms that

Re: New wiki page for Running Nutch 1.3 in Eclipse

2011-08-03 Thread lewis john mcgibbney
Sorry http://wiki.apache.org/nutch/RunNutchInEclipse On Wed, Aug 3, 2011 at 2:12 PM, Dr.Ibrahim A Alkharashi khara...@kacst.edu.sa wrote: thanks for the info, would you please post a pointer to the page. Regards Ibrahim On Aug 3, 2011, at 3:13 PM, lewis john mcgibbney lewis.mcgibb

Re: how to extract tf-idf

2011-08-06 Thread lewis john mcgibbney
Hi Zhanibek, I would like to refer specifically to Markus' thread which he initiated a short time ago [1] sharing close similarity to your own questions. I think the main question to be answered now is how do we extract tf-idf from a crawled website? And as we now refer to Nutch as an independent

Re: fetcher runs without error with no internet connection

2011-08-23 Thread lewis john mcgibbney
Hi Alex, Did you get anywhere with this? What condition led to you seeing unknown host exception? Unless segment gets corrupted, I would assume you could fetch again. Hopefully you can confirm this. On Tue, Aug 16, 2011 at 9:23 PM, alx...@aim.com wrote: Hello, After running bin/nutch fetch

Re: force recrawl

2011-08-23 Thread lewis john mcgibbney
Correct There should be comprehensive documentation on the wiki for these parameters (and many more) On Fri, Aug 19, 2011 at 6:46 PM, Markus Jelsma markus.jel...@openindex.iowrote: addDays is not a crawl switch but a generator switch. You cannot use the crawl command. But if I use

Re: Empty LinkDB after invertlinks

2011-08-23 Thread lewis john mcgibbney
Hi Small suggestion, but I do not see any -dir argument passed alongside your initial invertlinks command. I understand that you have multiple segment directories, which have been fetched over a recent number of days, and that the output would also suggest the process was properly executed,

Re: readdblink not showing alllinks

2011-08-23 Thread lewis john mcgibbney
If you please post your crawldb dump then we could see the structure of your crawldb and may be able to begin pin pointing the issue. It should not be required for you to undertake another crawl after inverting links for these URLs to be indexed when calling solrindex command... there must be

Re: How to save html source to local drive

2011-08-24 Thread lewis john mcgibbney
Hi Can you explain how you tried to save raw html obtained during a crawl to a local drive? I am not entirely sure what you mean here and why you would want to do so given that we already have an array of alternative options available. Can you please expand on this. Thank you On Wed, Aug 24,

Re: Recursively searching through web dirs

2011-08-24 Thread lewis john mcgibbney
Hi Adam, My initial thoughts are that you are correct. It is very unusual for your files to be located on an URL in the same domain which is not referenced by the top level or a subsequent level URL within the domain. What I would suggest is that you have a look through your hadoop.log as well

Re: Trying to understand and use URLmeta

2011-08-25 Thread lewis john mcgibbney
Hi JB, We have recently finished a complete plugin tutorial which fully explains the functionality of the urlmeta plugin on the wiki. It can be found here [1], could I ask you to have a thorough look at it, and the code and if you still have questions then please reinforce them. [1]

Re: Are there any tutorial for writing regex-normalize.xml?

2011-08-26 Thread lewis john mcgibbney
Apart from looking through the list archives, as far as I aware nothing has been specifically documented on this topic. In the mean time you may find this helpful http://geekswithblogs.net/brcraju/articles/235.aspx On Fri, Aug 26, 2011 at 9:22 AM, Kaiwii Ho kaiwi...@gmail.com wrote: I'm gonna

Re: force recrawl

2011-08-27 Thread lewis john mcgibbney
If you only wish to serve crawls to that one page, I'm sure this could easily be set up by writing a bash script specifying the -adddays arguement with your commands. This could then be set and run as a cron job? Please someone correct me if I am wrong. On Fri, Aug 26, 2011 at 10:22 PM, Radim

Trying to complete index structure wiki page

2011-08-27 Thread lewis john mcgibbney
Hi, As the title suggests, I'm in the process of getting some comprehensive documentation sorted out for Nutch, this obviously starts at wiki level. I'm currently working on the IndexStructure page [1]. I would appreciate if some guys could have a quick look and correct where they see fit. In

Re: How to generate multiple small segments w/o -numFetchers?

2011-08-28 Thread lewis john mcgibbney
Hi Gabriele can you expand on your last comment... are you running in deploy mode? And to reply to your first point, yes you are correct, the FAQ's need extensive updating. Please feel free to change anything you feel necessary, however as a matter of retaining knowledge for the legacy of Nutch

Re: a question about job failed

2011-08-29 Thread lewis john mcgibbney
Hi Zhao, Do you have anymore verbose log info from hadoop.log, I have never worked with Nutch 0.9 but if you could at least indicate whether you get something like LOG: info Dedup: starting ... blah blah blah Taking this to a larger context I am not particularly happy with the verboseness of

Re: SSHD for Nutch 1.3 in Pseudo Distributed mode

2011-08-29 Thread lewis john mcgibbney
If it complains about SSH errors then I would ensure that you are logged into your SSH client e.g. ssh -v localhost, prior to executing any hadoop scripts. This would make sense. Further to this, unless you are actually experiencing Nutch related problems on a pseudo or cluster setup then

  1   2   3   4   5   6   7   8   9   10   >