Re: Character encoding on Html-Pages
Hi Alex, I cannot locate the java file you mention at org.apache.nutch.parse.html.HtmlParser in either 1.2 or branch 1.3... Having a quick look at org.apache.nutch.parse.HTMLMetaTags (in both versions above it is identical) it appears that you are right the double quotes for meta http-equiv are accepted whereas 'single quotes' are not. I would be interested to see what kind of output you get when nutch-1.2 experiences the type of single quote meta syntax you highlight? Can you elaborate please... If your regex suggestion is working then I would stick with this, however this is maybe something you wish to raise in JIRA... any comments? Lewis On Tue, Jun 7, 2011 at 4:05 PM, Alex F alexander.fahlke.mailingli...@googlemail.com wrote: Hi, the regex metaPattern inside org.apache.nutch.parse.html.HtmlParser is not suitable for sites using single quotes for meta http-equiv Example: meta http-equiv='Content-Type' content='text/html; charset=iso-8859-1' We experienced a couple of pages with that kind of quotes and Nutch-1.2 was not able to handle it. Is there any fallback or would it be good to use the following regex: meta\\s+([^]*http-equiv=(\|')?content-type(\|')?[^]*) (single or regular quotes are accepted)? BR Alexander Fahlke Software Development www.informera.de -- *Lewis*
Re: keeping index up to date
Hi, To add to Markus' comments, if you take a look at the script it is written in such a way that if run in safe mode it protects us against an error which may occur. If this is the case we an recover segments etc and take appropriate actions to resolve. On Tue, Jun 7, 2011 at 9:01 PM, Markus Jelsma markus.jel...@openindex.iowrote: Hi, I took a look to the recrawl script and noticed that all the steps except urls injection are repeated at the consequent indexing and wondered why would we generate new segments? Is it possible to do fetch, update for all previous $s1..$sn , invertlink and index steps. No, the generater generates a segment with a list of URL for the fetcher to fetch. You can, if you like, then merge segments. Thanks. Alex. -Original Message- From: Julien Nioche lists.digitalpeb...@gmail.com To: user user@nutch.apache.org Sent: Wed, Jun 1, 2011 12:59 am Subject: Re: keeping index up to date You should use the adaptative fetch schedule. See http://pascaldimassimo.com/2010/06/11/how-to-re-crawl-with-nutch/ http://pascaldimassimo.com/2010/06/11/how-to-re-crawl-with-nutch/%20 for details On 1 June 2011 07:18, alx...@aim.com wrote: Hello, I use nutch-1.2 to index about 3000 sites. One of them has about 1500 pdf files which do not change over time. I wondered if there is a way of configuring nutch not to fetch unchanged documents again and again, but keep the old index for them. Thanks. Alex. -- *Lewis*
Re: nutch NoClassDefFound
Hi, I suggest that before you try to progress any further with this you read as much of the wiki [1] as you can, in particular I would start here [2] [3] After this, try looking through some of the source and understanding what parameters are required to run various commands. The reason for this is that from time to time it is guaranteed that we will come across log output that indicates various errors in Nutch configuration or something else... it helps considerably if you have a sound understanding and working knowledge of the processes behind the internal operating nature of Nutch. [1]http://wiki.apache.org/nutch/ [2]http://wiki.apache.org/nutch/NutchTutorial [3]http://wiki.apache.org/nutch/FAQ On Tue, Jun 7, 2011 at 9:42 PM, abhayd ajdabhol...@hotmail.com wrote: hi I m very new to nutch and trying to set up nutch on windows and using cygwin. I downloaded http://apache.mirrors.airband.net/nutch/apache-nutch-1.2-bin.zip. I think i dont need to build. When i try following command i get error.. I saw similar question posted in forum but it was related to running nutch from source. Any idea what could be wrong? $ echo $JAVA_HOME C:\Program Files\Java\jdk1.6.0_12\ jj@D1QJ50C1 ~/nutch-1.2/bin $ ./nutch crawl java.lang.NoClassDefFoundError: and Caused by: java.lang.ClassNotFoundException: and at java.net.URLClassLoader$1.run(URLClassLoader.java:200) at java.security.AccessController.doPrivileged(Native Method) at java.net.URLClassLoader.findClass(URLClassLoader.java:188) at java.lang.ClassLoader.loadClass(ClassLoader.java:307) at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:301) at java.lang.ClassLoader.loadClass(ClassLoader.java:252) at java.lang.ClassLoader.loadClassInternal(ClassLoader.java:320) Could not find the main class: and. Program will exit. Exception in thread main -- View this message in context: http://lucene.472066.n3.nabble.com/nutch-NoClassDefFound-tp3036674p3036674.html Sent from the Nutch - User mailing list archive at Nabble.com. -- *Lewis*
Updates to Nutch Wiki
Hi everyone, Was wondering if anyone (familiar with the topics) would be interested in sending me material for the following pages [1] [2]. The links appear to be non existent in our wiki and it would be nice to get some material on these topics if these topics are important and are required! Although 1.3 has just been released, material on previous releases is very much welcomed. [1] http://wiki.apache.org/nutch/SearchOverMultipleIndexes [2] http://wiki.apache.org/nutch/NutchWithChineseAnalyzer In addition we've now re-arranged the wiki somewhat. Hopefully this structure will make locating specifics an earier task. Thanks -- *Lewis*
Re: searcher.dir not working
Hi abhayd, In short...yes. Although you have correctly specified an absolute path, you need to drop the /crawldb/current/part-0 A good resource for this stuff can usually be found on the mailing lists. On Wed, Jun 8, 2011 at 8:03 AM, abhayd ajdabhol...@hotmail.com wrote: hi I am using nutch 1.2 i created index using command bin/nutch crawl urls -dir crawl under crawl directory i see crawldb/current/part-0/index i added this to nutch-site.xml under tomcat installation as search.dir property value /home/user1/nutch-1.2/crawl/crawldb/current/part-0 I am getting message index is not directory. Am i doing something wrong? Any help? -- View this message in context: http://lucene.472066.n3.nabble.com/searcher-dir-not-working-tp3038087p3038087.html Sent from the Nutch - User mailing list archive at Nabble.com. -- *Lewis*
Re: bin folder missing in 1.3 release
We are a bit thin on supporting documentation for the new release at the moment but are actively working towards producing this. Hopefully once we have something contributed to the wiki the differences in configuration and functionality within release 1.3 will be fully explained. On Thu, Jun 9, 2011 at 11:04 AM, Markus Jelsma markus.jel...@openindex.iowrote: Here is is: http://svn.apache.org/viewvc/nutch/tags/release-1.3/src/bin/ It's being copied over when building with ant. As said in the subject, I can't find the bin folder in the 1.3 release. Is it intentionally? Thanks -- *Lewis*
Re: No Urls to fetch
Hi Adelaida, Assuming that you have been able to successfully crawl the top level domain http://elcorreo.com e.g. that you have been able to crawl and create an index, at least we know that your configuration options are OK. I assume that you are using 1.2... can you confirm? What does the rest of your crawl-urlfilter.txt look like? Have you been setting any properties in nutch-site.txt which might alter Nutch behaviour? I am not perfect with syntax for creating filter rules in crawl-urlfilter... can someone confirm that this is correct. On Mon, Jun 13, 2011 at 12:10 PM, Adelaida Lejarazu alejar...@gmail.comwrote: Hello, I´m new to Nutch and I´m doing some tests to see how it works. I want to do some crawling in a digital newspaper webpage. To do so, I put in the urls directory where I have my seed list the URL I want to crawl that is: * http://elcorreo.com* The thing is that I don´t want to crawl all the news in the site but only the ones of the current day, so I put a filter in the *crawl-urlfilter.txt*(for the moment I´m using the *crawl* command). The filter I put is: +^http://www.elcorreo.com/.*?/20110613/.*?.html A correct URL would be for example, http://www.elcorreo.com/vizcaya/20110613/mas-actualidad/politica/lopez-consta-pactado-bildu-201106131023.html so, I think the regular expression is correct but Nutch doesn´t crawl anything. It says that there are *No Urls to Fetch - check your seed list and URL filters.* Am I missing something ?? Thanks, -- *Lewis*
Re: Injecting urls through code instead of file
Hi, Can you provide a use case? The reason I ask is that I can only assume that you would be hacking some code to inject your urls from some other URL store? On Tue, Jun 14, 2011 at 5:18 PM, shanWDC ssar...@web.com wrote: Is there a way to inject urls in the injector, through code, rather than specifying a url file? -- View this message in context: http://lucene.472066.n3.nabble.com/Injecting-urls-through-code-instead-of-file-tp3063662p3063662.html Sent from the Nutch - User mailing list archive at Nabble.com. -- *Lewis*
Re: Problem with Nutch Search
Off the top of my head one property springs to mind. Which you may or may not have configured in nutch-site http.content.limit However I think that this is not the source of the problem. I would advise you to have a look at your hadoop log file for any obvious warnings... how do you know he sweeps up about 50 lines after that he does not sweep over the text? Have you looked at a dump of the crawldb to see what content the database is aware of? Without verifying answers to some of the above it is hard to decouple the errors in nutch from the legacy architecture of Nutch 1.3 On Thu, Jun 16, 2011 at 3:03 PM, Jefferson jeff151520...@msn.com wrote: Hi I'm testing the nutch, I followed the tutorial in the nutch, but I found a problem. I ran the command bin / nutch crawl 6 sites in plain text that contains only about 400 lines of text, so far so normal. When I do a search with Nutch, he sweeps up about 50 lines after that he does not sweep over the text. If I look, for example by church and this word is beyond the first 50 lines of text, it returns 0 results. Anyone have any solution for this? -- View this message in context: http://lucene.472066.n3.nabble.com/Problem-with-Nutch-Search-tp3072077p3072077.html Sent from the Nutch - User mailing list archive at Nabble.com. -- *Lewis*
Re: I need step-by-step tutorial to run Nutch 1.2 from source code
Hi Mohammad, Try looking at the pre nutch 1.3 material on the wiki, I'm sure there must be something in there you can build on... or that will at least point you in the right direction http://wiki.apache.org/nutch/Archive%20and%20Legacy HTH On Fri, Jun 17, 2011 at 9:27 PM, Mohammad Hassan Pandi pandi...@gmail.comwrote: Hi everybody! I have already installed Hadoop 0.20.2 on a two-node cluster and I want to run Nutch 1.2 source code just to have a feeling of how it works. I need a step-by-step tutorial to do that. -- *Lewis*
Re: Empty indexes folder after crawling!
Have you set your crawl directory property value in nutch-site.xml when launching the war file on tomcat? On Tue, Jun 21, 2011 at 4:01 AM, Mohammad Hassan Pandi pandi...@gmail.comwrote: follwing http://wiki.apache.org/nutch/NutchHadoopTutorial I crawled lucene.apache.org with command bin/nutch crawl urlsdir -dir crawl -depth 3 and copy the whole thing to local file system by running the command bin/hadoop dfs -copyToLocal crawl /d01/local/ but the indexes folder is empty. this causes no result when searching for a query in nutch UI -- *Lewis*
Re: how to classify the search results by an indexed field with lucene?
to give a short answer to your question the answer is I don't know. Many of us are not using Lucene as the indexing machanism. I think as this is specifically linked to Lucene you would be better asking there. try the user list http://lucene.apache.org/java/docs/mailinglists.html#Java User List On Tue, Jun 21, 2011 at 6:58 AM, Joey majunj...@gmail.com wrote: Hi all, Is there anyone who had ever encounted this problem before? Looking forward to your replying. :-) Thanks. Regards, Joey On 06/20/2011 02:09 PM, Joey Ma wrote: Hi all, I use lucene as the indexer in nutch 1.2. I want to get the classified search results by an indexed field, for example to show the hit count distributions of different months in a year. I found that in lucene 2.* this could be achieved by the QueryFilter().bit(IndexReader) method and calculate the hit count for each category. But in lucene 3.*, the class of QueryFilter has been removed and I couldn't find the equivalent of that method. Could anyone tell me how to make this achievement effectively? Thanks very much. Regards, Joey -- *Lewis*
Re: Where Can I find Nutch war file??
Hi, Assuming that you are using 1.2 the war file should definately be there. You will be able to get step by step directions for this in the tutorial on the Nutch site. http://wiki.apache.org/nutch/NutchTutorial Note that this will be getting updated soon to reflect changes incorporated into release 1.3, therefore search and indexing will not be covered under legacy Lucene architecture and there will be no WAR file tro locate if using 1.3. On Tue, Jun 21, 2011 at 12:06 AM, Mohammad Hassan Pandi pandi...@gmail.comwrote: Thanks for your response I got nutch-2010-07-07_04-49-04.tar.gz extracted and opened up the directory in Eclipse and run build.xml. There are several tasks in build.xml such as init, compile, compile-core, The tutorial I followed http://wiki.apache.org/nutch/NutchHadoopTutorial; says choose job(the default task) and package task. I choosed them and run but no war file is created On Tue, Jun 21, 2011 at 11:12 AM, Hasan Diwan hasan.di...@gmail.com wrote: You'll need to build it yourself -- try, $ANT_HOME/bin/ant war or %ANT_HOME%\bin\ant war. Let me know how you get on... On 20 June 2011 23:19, Mohammad Hassan Pandi pandi...@gmail.com wrote: Hi guys, there is no war file in build folder of my nutch. where can I find nutch war file to deploy on tomcat? -- Sent from my mobile device Envoyait de mon telephone mobil -- *Lewis*
Re: helpful books or tutorials on nutch
As this is open source I think the best way to solve your question/request is to get down and dirty with your own configuration. Many implementation scenarios are unique, to a new Nutch user this may provide no immediate helpful credentials, however it clearly displays the adaptability and extensibility of the Nutch framework, this cn only be learned and understood by adhering to the suggestions made above. There are various books out there which include commentary on nutch but none that will give you a one stop shop for all answers. I have yet to find one which fully documents a real world scenario... With regards to Luke you may be best off asking on the google group you highlighted, however has anyone else tried this out and can confirm? On Tue, Jun 21, 2011 at 9:30 AM, Shouguo Li the1plum...@gmail.com wrote: hey guys i know this question has been asked several times on this mailing list but i didn't see good answers in the archive. are there any books or online tutorials that walks you through nutch with couple of real world scenarios? there are several wiki pages on nutch.apache.org, but they're too brief, and somewhat out of date. also, i tried out nutch 1.3 and solr. but i can't browse it using latest luke tool even though luke says it's compatible with lucene 3 now, http://code.google.com/p/luke/downloads/detail?name=lukeall-3.1.0.jarcan=2q= thx! -- *Lewis*
Re: Solrdedup NPE
Hi Markus, Can you list the steps you executed prior to the solrdedup please? I think I encountered something similar a while back and as my work was moving on I didn't get a chance to investigate it fully. On Tue, Jun 21, 2011 at 1:54 PM, Markus Jelsma markus.jel...@openindex.iowrote: Hi, Any idea what the exception below can result from? The dedup queries go allright and produce normal results. Some indices will not generate this NPE. Cheers, 11/06/21 20:47:37 WARN mapred.LocalJobRunner: job_local_0001 java.lang.NullPointerException at org.apache.hadoop.io.Text.encode(Text.java:388) at org.apache.hadoop.io.Text.set(Text.java:178) at org.apache.nutch.indexer.solr.SolrDeleteDuplicates$SolrInputFormat$1.next(SolrDeleteDuplicates.java:272) at org.apache.nutch.indexer.solr.SolrDeleteDuplicates$SolrInputFormat$1.next(SolrDeleteDuplicates.java:243) at org.apache.hadoop.mapred.MapTask$TrackedRecordReader.moveToNext(MapTask.java:192) at org.apache.hadoop.mapred.MapTask$TrackedRecordReader.next(MapTask.java:176) at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:48) at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:358) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:307) at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:177) 11/06/21 20:47:37 INFO mapred.JobClient: map 0% reduce 0% 11/06/21 20:47:37 INFO mapred.JobClient: Job complete: job_local_0001 11/06/21 20:47:37 INFO mapred.JobClient: Counters: 0 Exception in thread main java.io.IOException: Job failed! at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1252) at org.apache.nutch.indexer.solr.SolrDeleteDuplicates.dedup(SolrDeleteDuplicates.java:363) at org.apache.nutch.indexer.solr.SolrDeleteDuplicates.run(SolrDeleteDuplicates.java:375) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65) at org.apache.nutch.indexer.solr.SolrDeleteDuplicates.main(SolrDeleteDuplicates.java:380) -- *Lewis*
Re: Building Nutch 2.0 from the trunk
I tried to build Nutch trunk in eclipse about circa 2 months ago. Gora built fine and from memory it was the ivy configuration within Nutch which had to be altered. I'm positive the problems I was having have now been rectified but I haven't tried since. That is why I am interested in why JUnit tests failed as I thought the only problem with the build was with my Gora dependency. Sorry this is off topic. To relate to the original question. Have you been able to build Nutch trunk using Markus' comments above? On Thu, Jun 23, 2011 at 3:28 PM, Markus Jelsma markus.jel...@openindex.iowrote: You can safely build Nutch trunk with Gora 1089728. I can also build the current Nutch and Gora trunks. What error do you get? Hi, I think this is your second thread on this topic? I tried to get trunk to build but was unable as there are problems with Gora as Julien highlighted to me some time ago. My first question is did you get trunk to build following the tutorial you have highlighted? The problem I was having was with Gora, not with any JUnit tests. Can you please expand on your actions a bit. Thanks On Wed, Jun 22, 2011 at 4:50 AM, Nutch User - 1 nutch.use...@gmail.comwrote: Could someone give me step-by-step instructions on how to build Nutch 2.0 from the trunk and run it? I tried to follow this (http://techvineyard.blogspot.com/2010/12/build-nutch-20.html), but failed to do so as described here (http://lucene.472066.n3.nabble.com/TestFetcher-hangs-td3091057.html). -- *Lewis*
Re: Problem in search
Hi Jefferson, I cannot access either your nutch-site or nutch-default but I see that your http.content.limit is INFO http.Http - http.content.limit = 65536 It is a fairly large page so maybe this can be the cause. I'm sorrry I don't have access to my linux worktop so I can't test myself can you please advise if this has been accounted for in your nutch-site. Anything over the default 65536 limit is truncated therefore you may not be able to search for it. Further to this it seems that the hadoop.log does not show any eratic bahaviour. On Fri, Jun 24, 2011 at 7:40 AM, Jefferson jeff151520...@msn.com wrote: My problem is in the search. I made the site crawler http://en.wikipedia.org/wiki/Albert_Einstein When I access the http://localhost:8080/nutch-1.1/ and digit Adolf Hitler returns me a result, ok. When I type phenomena returns 0 results, not ok. Attached is my config files and logging. thanks http://lucene.472066.n3.nabble.com/file/n3104461/nutch-site.xml nutch-site.xml http://lucene.472066.n3.nabble.com/file/n3104461/nutch-default.xml nutch-default.xml http://lucene.472066.n3.nabble.com/file/n3104461/hadoop.log hadoop.log http://lucene.472066.n3.nabble.com/file/n3104461/crawl.log crawl.log -- View this message in context: http://lucene.472066.n3.nabble.com/Problem-in-search-tp3104461p3104461.html Sent from the Nutch - User mailing list archive at Nabble.com. -- *Lewis*
Apache Nutch 1.3 tutorial now on Wiki
Hi all, With permission from the author I managed to adapt a blog entry for the above which can be found here. At this stage I would ask for anyone interested to make changes/improvements/etc. Once we can verify the integrity and accuracy of the entry it would be nice to rebuild the website with this tutorial as the most recent tutorial resource for getting Nutch 1.3 up and running. Thank you -- *Lewis*
Re: Problem in search
Can you expand on this? I am not understanding your description of the problem. On Fri, Jun 24, 2011 at 12:52 PM, Jefferson jeff151520...@msn.com wrote: ready. Now I have another problem: digit phenomena and he returns this: - Albert Einstein - Wikipedia, the free encyclopedia Albert Einstein From Wikipedia, the free encyclopedia Jump ... - what might be happening? Thanks for the help below my configuration files: http://lucene.472066.n3.nabble.com/file/n3105976/nutch-default.txt nutch-default.txt http://lucene.472066.n3.nabble.com/file/n3105976/nutch-site.txt nutch-site.txt -- View this message in context: http://lucene.472066.n3.nabble.com/Problem-in-search-tp3104461p3105976.html Sent from the Nutch - User mailing list archive at Nabble.com. -- *Lewis*
Re: Problem in search
I see within you're nutch-site file that you have set an http.content.limit value of 340,671. Is there any reason for this value? I'm assuming you are not indexing this page so you can merely search for the term phenomena, and that there is other textual content within the page that you are interested in...would this assumption be right? As Markus explained the page has a http content length of some 600,000, and from looking at where the first occourance of the term phenomena is, it is located roughly half way through the page. When crawling large sites such as wikipedia (which we all know contains large http content within its webpages), I have found that a safe guard measure to ensure we get all page content is to set the http.content.limit to a negative value e.g. -1. This way we are guaranteed that we get all page content. Another useful tool which is widely used is LUKE [1], this will enable you to search you Lucene index and confirm whether or not Nutch has fetched and sent the content you wish to be stored within your index. [1] http://code.google.com/p/luke/ On Sat, Jun 25, 2011 at 7:42 AM, Jefferson jeff151520...@msn.com wrote: The problem is that he returns the beginning of the text section of the website. The correct he is returning the passage in which the word phenomena is found. Sorry my english... Jefferson -- View this message in context: http://lucene.472066.n3.nabble.com/Problem-in-search-tp3104461p3107810.html Sent from the Nutch - User mailing list archive at Nabble.com. -- *Lewis*
Nutch Gotchas as of release 1.3
Hello list, Do we have any suggestions we wish to discuss regarding the above? thanks -- *Lewis*
Re: Empty indexes folder after crawling!
try reading the tutorial on the wiki for 1.3 release. It gives step by step stages for crawling and indexing then setting up Nutch WAR in Tomcat and searching. You can find it under archives section in Nutch wiki On Sat, Jun 25, 2011 at 9:12 PM, Mohammad Hassan Pandi pandi...@gmail.comwrote: My nutch-site.xml is empty. Perhaps it means nutch uses default path as Index location. right? On Thu, Jun 23, 2011 at 10:57 PM, lewis john mcgibbney lewis.mcgibb...@gmail.com wrote: Have you set your crawl directory property value in nutch-site.xml when launching the war file on tomcat? On Tue, Jun 21, 2011 at 4:01 AM, Mohammad Hassan Pandi pandi...@gmail.comwrote: follwing http://wiki.apache.org/nutch/NutchHadoopTutorial I crawled lucene.apache.org with command bin/nutch crawl urlsdir -dir crawl -depth 3 and copy the whole thing to local file system by running the command bin/hadoop dfs -copyToLocal crawl /d01/local/ but the indexes folder is empty. this causes no result when searching for a query in nutch UI -- *Lewis* -- *Lewis*
Re: Using nutch 1.3 in Eclipse
I will try to get a wiki entry for this sorted ASAP as it is a fundamental requirement for anyone wishing to debug/understand how classes work in Nutch 1.3, when the time comes around any opinions/comments you have would be a great addition. Thanks 2011/6/30 Nutch User - 1 nutch.use...@gmail.com On 01/01/2011 08:52 AM, jeffersonzhou wrote: When you new a Java project for Nutch 1.3, what default location did you use? The folder where you unzipped the software or the runtime/local? -Original Message- From: Nutch User - 1 [mailto:nutch.use...@gmail.com] Sent: Thursday, June 30, 2011 2:43 PM To: user@nutch.apache.org Subject: Re: Using nutch 1.3 in Eclipse On 06/30/2011 02:00 AM, dyzc wrote: Hi, is there any information regarding working nutch 1.3 in eclipse? Thanks I got it working with the help of this (http://wiki.apache.org/nutch/RunNutchInEclipse1.0). However, I have had serious difficulties with the trunk of 2.0 and Eclipse as described here (http://lucene.472066.n3.nabble.com/TestFetcher-hangs-td3091057.html). I did as the tutorial suggested:Select Create project from existing source and use the location where you downloaded Nutch. I may have copied some .jar-files from runtime/local/lib to lib or then Ivy obtained them when running the Ant build from Eclipse. -- *Lewis*
Re: Memory leak in fetcher (1.0) ?
How many threads do you have running concurrently? Is there any log output to indicate any warnings or errors otherswise? On Sat, Jul 2, 2011 at 7:40 AM, Markus Jelsma markus.jel...@openindex.iowrote: Does it run out of memory? Is GC able to reclaim consumed heap space? Have a 300K URLs segement to fetch (no parsing) I see memory continuously growing up... looking like a memory leak. I have patch 769, 770 installed, and did not see any other patches related to memory leak. -- *Lewis*
Nutch 1.3 CommandLineOptions updated to reflect new changes
Hi, Just finished the above, which you can find here [1] so please check out the pages if you are having trouble passing parameters to any commands. It would be great to mention if there are any mistakes or even better edit or add any missing information you think would make the documentation clearer. Also you will see that there is a section at the bottom of the page subtitled 'other classes', feel free to add any classes you have been using which we have not already included. Thanks [1] http://wiki.apache.org/nutch/CommandLineOptions -- *Lewis*
Re: Problems when crawl a .nsf site
Absolutely... There is a short (old) thread here on this topic [1], from what I can see this issue has not been addressed. Therefore it looks like implementing your own parser plugin is what's required. [1] http://www.lucidimagination.com/search/document/a8d53fac1caa578c/nutch_with_nsf_files 2011/7/3 Alexander Aristov alexander.aris...@gmail.com Hi If it is a text file then you can simply associate the extension with text parser. But if I understand you right it's a lotus Db file then I suspect you have no other choice than implementing your own parser. I haven't heard of lotus files support in nutch. Best Regards Alexander Aristov 2011/7/3 丛云牙之主 yanhaora...@qq.com Hello, I am using nutch-1.2 has encountered a problem.The site is writtenwith lotus domino, I use the browser to enter, click on the emergence of thoseconnections have not changed the site URL, unlike some sites have a lot of suffixes.Then there is a web site is buptoa.bupt.edu.cn / student_broad.nsf, I wanted to climbwill take. Nsf file. But nutch does not support. Nsf file crawl, I should write my ownplugin or should solve this problem from the other side? Extremely grateful for your help -- *Lewis*
Re: Searching for documents with a certain boost value
Hi, I am sorry that I have not been able to try and replicate the scenario and confirm whether I get zero scores in a similar situation as I am temporarily unable to do so but I would like to add this resource [1], if you have not seen it yet. I am aware that this doesn't address the problem directly but if we can start thinking more about the way scoring is done then maybe we can get further to uncovering the solution to finding whether or not we can search for fields within our document or documents within our index which have a boost value of zero. Obviously the reference I include is relevant specifically to Nutch versions using Lucene however I'm hoping that as we are referring to scoring done by the OPIC filter that the outcome will be consistent across versions including those which do not use legacy Lucene. Can someone please correct me if I am wrong here... Focussing specifically on your question, it appears that a document field is not shown if a term was not found in a particular field e.g. there is no score value given. This would suggest that we cannot query for it, therefore my gut instinct is that we cannot query for a zero value present within these fields. N.B I cannot confirm this, I am merely going on the little research I have done into the OPIC scoring algorithm. It would be nice if someone could confirm otherwise and correct me though. [1] http://wiki.apache.org/nutch/FAQ#How_is_scoring_done_in_Nutch.3F_.28Or.2C_explain_the_.22explain.22_page.3F.29 -- Forwarded message -- From: Nutch User - 1 nutch.use...@gmail.com Date: Mon, Jul 4, 2011 at 12:43 AM Subject: Searching for documents with a certain boost value To: user@nutch.apache.org Hi. As I have described here ( http://lucene.472066.n3.nabble.com/URL-redirection-and-zero-scores-td3085311.html ) I have encountered a situation where some of my indexed documents have zero boost value. I'd like to know if there's a way to search which ones have zero as their boost value. I have tried to do a Lucene query with Luke but it failed. The query was: boost:00 00 00 00. (The boost field seems to be a binary one, so it may have something to do with the problem.) I allowed leading * in wildcard queries, and url:* returned me every document as it should. However, boost:* returned none. Can this boost field even be used as a search criteria? Best regards, Nutch User - 1 -- *Lewis*
Crawling relation database
Hi, I'm curious to hear if anyone has information for configuring Nutch to crawl a RDB such as MySQL. In my hypothetical example there are N number of databases residing in various distributed geographical locations, to make a worst case scenario, say that they are NOT all the same type, and I wish to use Nutch trunk 2.0 to push the results to some other structured data store which I can then connect to to serve search results. Does anyone have any information such as an overview of database crawling and serving using Nutch? I have been unsuccesful obtaining info on the Web as query results are ambiguous and usually refer to crawldb or linkdb. If I can get this it would be a real nice entry for inclusion in our wiki. Thanks for any suggestions or info. -- *Lewis*
Re: Crawling relation database
thanks to you both On Tue, Jul 5, 2011 at 4:35 PM, Markus Jelsma markus.jel...@openindex.iowrote: H, About geographical search: Solr will do this for you. Built-in for 3.x+ and using third-party plugins for 1.4.x. Both provide different features. In Solr it's you'd not base similarity on geographical data but use spatial data to boost textual similar documents instead, or filter. This keeps text similarity intact and offers spatial features on top. You'll get more feedback on the Solr list indeed :) Cheers Thanks for this Markus, it had occured to me that DIH was a very plausable solution to progress with. I think you have just confirmed due to the flexibility it offers amongst other attributes. I'm looking at creating a context aware web application which would use geographical search to obtain results based on location. This is required as the data will contain (amongst others) fields with integer values which vary dependent upon a building location cost index. Similarity is directly linked through geographical location factor. I wanted to have the data stored within the n number of distributed RDB's available in a cloud environment which could be searched as oppose to the non-trivial task of searching across a fragmented distrubuted number of DB's. As you mention, it does make more sense to save documents in a doc (or column) oriented DB. Essentially, using the DIH tool would remove requirement for Nutch? I think to progress with this, I'm best moving the thread to Solr-user@if I have further questions. Thank you On Tue, Jul 5, 2011 at 3:53 PM, Markus Jelsma markus.jel...@openindex.iowrote: Hi Lewis, It sounds to me you'd be better of using Solr's very advanced DataImportHandler [1]. It can (delta) import data from your RDBMS' and offers much flexibility on how to transform entities. Besides crawling you also mentions you'd like to push results (of what) to another structured data store. But why would you want that? Handling, processing and serving search results is done by Solr (and ES in the future) and since our entities are flat (just a document) it makes more sense to me to save documents in a document (or column) oriented DB. [1] :http://wiki.apache.org/solr/DataImportHandler Cheers, Hi, I'm curious to hear if anyone has information for configuring Nutch to crawl a RDB such as MySQL. In my hypothetical example there are N number of databases residing in various distributed geographical locations, to make a worst case scenario, say that they are NOT all the same type, and I wish to use Nutch trunk 2.0 to push the results to some other structured data store which I can then connect to to serve search results. Does anyone have any information such as an overview of database crawling and serving using Nutch? I have been unsuccesful obtaining info on the Web as query results are ambiguous and usually refer to crawldb or linkdb. If I can get this it would be a real nice entry for inclusion in our wiki. Thanks for any suggestions or info. -- *Lewis*
Re: crawling a list of urls
Hi C.B., This is way to vague. We really require more information regarding roughly what kind of results you wish to get. It would be a near impossible task for anyone to try and specify a solution to this open ended question. Please elaborate Thank you On Thu, Jul 7, 2011 at 12:56 PM, Cam Bazz camb...@gmail.com wrote: Hello, I have a case where I need to crawl a list of exact url's. Somewhere in the range of 1 to 1.5M urls. I have written those urls in numereus files under /home/urls , ie /home/urls/1 /home/urls/2 Then by using the crawl command I am crawling to depth=1 Are there any recomendations or general guidelines that I should follow when making nutch just to fetch and index a list of urls? Best Regards, C.B. -- *Lewis*
Re: Problems with nutch tutorial
Hi Paul, Please see this tutorial for working with Nutch 1.3 [1] The tutorial you were using is for Nutch 1.2 from memory. [1] http://wiki.apache.org/nutch/RunningNutchAndSolr Thank you On Thu, Jul 7, 2011 at 1:17 PM, Paul van Hoven paul.van.ho...@googlemail.com wrote: I'm completly new to nutch so I downloaded version 1.3 and worked through the beginners tutorial at http://wiki.apache.org/nutch/**NutchTutorialhttp://wiki.apache.org/nutch/NutchTutorial. The first problem was that I did not find the file conf/crawl-urlfilter.txt so I omitted that and continued with launiching nutch. Therefore I created a plain text file in /Users/toom/Downloads/nutch-**1.3/crawled called urls.txt which contains the following text: tom:crawled toom$ cat urls.txt http://nutch.apache.org/ So after that I invoked nutch by calling tom:bin toom$ ./nutch crawl /Users/toom/Downloads/nutch-1.**3/crawled -dir /Users/toom/Downloads/nutch-1.**3/sites -depth 3 -topN 50 solrUrl is not set, indexing will be skipped... crawl started in: /Users/toom/Downloads/nutch-1.**3/sites rootUrlDir = /Users/toom/Downloads/nutch-1.**3/crawled threads = 10 depth = 3 solrUrl=null topN = 50 Injector: starting at 2011-07-07 14:02:31 Injector: crawlDb: /Users/toom/Downloads/nutch-1.**3/sites/crawldb Injector: urlDir: /Users/toom/Downloads/nutch-1.**3/crawled Injector: Converting injected urls to crawl db entries. Injector: Merging injected urls into crawl db. Injector: finished at 2011-07-07 14:02:35, elapsed: 00:00:03 Generator: starting at 2011-07-07 14:02:35 Generator: Selecting best-scoring urls due for fetch. Generator: filtering: true Generator: normalizing: true Generator: topN: 50 Generator: jobtracker is 'local', generating exactly one partition. Generator: Partitioning selected urls for politeness. Generator: segment: /Users/toom/Downloads/nutch-1.**3/sites/segments/** 20110707140238 Generator: finished at 2011-07-07 14:02:39, elapsed: 00:00:04 Fetcher: No agents listed in 'http.agent.name' property. Exception in thread main java.lang.**IllegalArgumentException: Fetcher: No agents listed in 'http.agent.name' property. at org.apache.nutch.fetcher.**Fetcher.checkConfiguration(** Fetcher.java:1166) at org.apache.nutch.fetcher.**Fetcher.fetch(Fetcher.java:**1068) at org.apache.nutch.crawl.Crawl.**run(Crawl.java:135) at org.apache.hadoop.util.**ToolRunner.run(ToolRunner.**java:65) at org.apache.nutch.crawl.Crawl.**main(Crawl.java:54) I do not understand what happend here, maybe one of you can help me? -- *Lewis*
Re: crawling a list of urls
See comments below On Thu, Jul 7, 2011 at 4:31 PM, Cam Bazz camb...@gmail.com wrote: Hello Lewis, Pardon me for the non-verbose desription. I have a set of urls, namely product urls, in range of millions. Firstly, (this is juts a suggestion) but I assume that you wish Nutch to fetch the full page content. Ensure that http.content.limit is set to an appropriate limit to allow this. So I want to write my urls, in a flat file, and have nutch crawl them to depth = 1 As you describe you have various seed directories, therefore I assume that crawling a large set of seeds will be a recursive task, IMHO I would save myself the manual task of running the jobs and write a bash script to do this for me, this will also enable you to schedule for once a day update of your crawldb, linkdb, solr index and so forth. There are plenty of scripts which have been tested and used throughout the community here http://wiki.apache.org/nutch/Archive%20and%20Legacy#Script_Administration However, I might remove url's from this list, or add new ones. I also would like nutch to revisit each site each 1 day. Check out nutch-site for crawldb fetch intervals, these values can be used to accommodate the dynamism of various pages. Once you have removed URLs (this is going to be a laborious and extremely tedious task if done manually), you would simply run your script again. I would like removed urls to be deleted, and new ones to be reinjected each time nutch starts. With regards to deleting URLs in your crawldb (and subsequently index) I am not sure of this exactly. Can you justify completely deleting the URLs from the data store? What happens if you add the URL in again the next day? I', not sure if this is a sustainable method for maintaining your data store/index. Best Regards, -C.B. On Thu, Jul 7, 2011 at 6:21 PM, lewis john mcgibbney lewis.mcgibb...@gmail.com wrote: Hi C.B., This is way to vague. We really require more information regarding roughly what kind of results you wish to get. It would be a near impossible task for anyone to try and specify a solution to this open ended question. Please elaborate Thank you On Thu, Jul 7, 2011 at 12:56 PM, Cam Bazz camb...@gmail.com wrote: Hello, I have a case where I need to crawl a list of exact url's. Somewhere in the range of 1 to 1.5M urls. I have written those urls in numereus files under /home/urls , ie /home/urls/1 /home/urls/2 Then by using the crawl command I am crawling to depth=1 Are there any recomendations or general guidelines that I should follow when making nutch just to fetch and index a list of urls? Best Regards, C.B. -- *Lewis* -- *Lewis*
Re: no agents listed in 'http.agent.name'
Hi Serenity, I don't know if you are aware but this message has been duplicated across both user@ nutch-user@. In general it is good practice for what to put in nutch-site and nutch-default can be found here [1] and here [2]. It is not required to add the properties to both of the conf files. To address your problem specifically, it should be pretty straightforward to implement this in nutch-site.xml, try copying over properties one by one and grandually building up a picture of where the discrepancy may be. [1] http://wiki.apache.org/nutch/FAQ#I_have_two_XML_files.2C_nutch-default.xml_and_nutch-site.xml.2C_why.3F [2] http://wiki.apache.org/nutch/NutchConfigurationFiles On Thu, Jul 7, 2011 at 4:45 PM, serenity serenitykenings...@gmail.comwrote: Hello Friends, I am experiencing this error message fetcher:no agents listed in 'http.agent.name' property when I am trying to crawl with Nutch 1.3 I referred other mails regarding the same error message and tried to change the nutch-default.xml and nutch-site.xml file details with property namehttp.agent.name/name valueMy Nutch Spider/value descriptionEMPTY/description /property I also filled out the other property details without blank and still getting the same error. May I know my mistake ? Thanks, Serenity -- View this message in context: http://lucene.472066.n3.nabble.com/no-agents-listed-in-http-agent-name-tp3148609p3148609.html Sent from the Nutch - User mailing list archive at Nabble.com. -- *Lewis*
Re: Partitioning selected urls for politeness and scoring
Yes this would limit the number of URLs from any one domain, but it would not explain why one domain seems to get fetched more after recursive fetches of some given seed set. Can you explain more about your crawling operation? Are you executing a crawl command? If so what arguements are you passing? If not can you give more detail of the job you are running Thank you On Fri, Jul 8, 2011 at 2:50 PM, Hannes Carl Meyer hannesc...@googlemail.com wrote: Hi, you could set generate.max.per.host to a reasonable size to prevent this! On a default configuration this is set to -1 which means unlimited. BR Hannes --- Hannes Carl Meyer www.informera.de On Fri, Jul 8, 2011 at 2:53 PM, Eggebrecht, Thomas (GfK Marktforschung) thomas.eggebre...@gfk.com wrote: Hi list, My seed list contains URLs from about 20 different domains. In the first fetch cycles everything is all right and all domains will be selected quite equally distributed. But after about 10-15 cycles one domain starts to prevail. URLs from all other domains will not be selected anymore. It seems that URLs from that certain domain have the highest scoring and URLs from other domains don't have a chance anymore. Is this a right assumption? I'm not very happy because I would like to fetch URLs from all domains in each cycle. What would you do in that case? Best regards and thanks for answers Thomas (Using nutch-1.2) GfK SE, Nuremberg, Germany, commercial register Nuremberg HRB 25014; Management Board: Professor Dr. Klaus L. W?bbenhorst (CEO), Pamela Knapp (CFO), Dr. Gerhard Hausruckinger, Petra Heinlein, Debra A. Pruent, Wilhelm R. Wessels; Chairman of the Supervisory Board: Dr. Arno Mahlert This email and any attachments may contain confidential or privileged information. Please note that unauthorized copying, disclosure or distribution of the material in this email is not permitted. -- *Lewis*
Re: How to deploy Nutch 1.3 in the web server
The web app was deprecated when we released Nutch 1.3. This was so we could use Solr interface for searching and offload the builk associated with the web app (amongst other things). There has been quite a lot of chat regarding this on this list over the last while. The last version of Nutch to use the web app was Nutch 1.2 On Fri, Jul 8, 2011 at 8:42 PM, serenity serenitykenings...@gmail.comwrote: Hello, I downloaded and installed Nutch 1.3 successfully and would like to deploy it in the webserver. Do I need to modify the existing build.xml file for generating the war file. Serenity -- View this message in context: http://lucene.472066.n3.nabble.com/How-to-deploy-Nutch-1-3-in-the-web-server-tp3152969p3152969.html Sent from the Nutch - User mailing list archive at Nabble.com. -- *Lewis*
Re: skipping invalid segments
Hi C.B., It looks like you may have simply missed the '-dir' when you were specifying your crawldb directory to be updated from the fetched segment. Have a look here [1] Can you please try and post your results. [1] http://wiki.apache.org/nutch/bin/nutch_updatedb On Fri, Jul 8, 2011 at 5:06 PM, Cam Bazz camb...@gmail.com wrote: Hello, I tried to crawl manually, only a list of urls. I have issued the following commands: bin/nutch inject /home/crawl/crawldb /home/urls bin/nutch generate /home/crawl/crawldb /home/crawl/segments bin/nutch fetch /home/crawl/segments/123456789 bin/nutch updatedb /home/crawl/crawldb /home/crawl/segments/123456789 -noAdditions however for the last command: it skips the segment 12345789 saying it is an invalid segment? This is exactly what I need (the -noAdditions flag) but it will not updatedb. What might have done wrong? Best Regards, -C.B. -- *Lewis*
Re: Integrating Solr 3.2 with Nutch 1.3
Hi Serenity, How did you execute the crawl? with crawl command? Have you ensured that parsing has been done? This looks like a different IIE than other have been getting when indexing to Solr. So please ensure that parsing has been done on all fetched content. On Fri, Jul 8, 2011 at 6:20 PM, serenity serenitykenings...@gmail.comwrote: Hello, I successfully installed both Solr 3.2 and Nutch 1.3 separately and both of them are working good. Now, I am trying to integrate both of them to get the search results which are already crawled and indexed by Nutch 1.3. I followed the steps according the following url's but wont display anything in the Solr search. http://wiki.apache.org/nutch/RunningNutchAndSolr http://wiki.apache.org/nutch/RunningNutchAndSolr http:// http://thetechietutorials.blogspot.com/2011/06/solr-and-nutch-integration.html http:// http://thetechietutorials.blogspot.com/2011/06/solr-and-nutch-integration.html After, I run the command “bin/nutch solrindex http://127.0.0.1:8983/solr/ firstSite/crawldb firstSite/linkdb firstSite/segments/* “ to send crawl data to Solr for indexing ,it is fetching the links but receiving the following error : *Exception in thread main org.apache.hadoop.mapred.InvalidInputException: Input path does not exist: file:/c:/apache-nutch-1.3-src/runtime/local/firstSite/segments/20110706135037/parse_data * May I know, if I need to do any changes to the schema.xml file prior to copying into solr/conf folder. Serenity -- View this message in context: http://lucene.472066.n3.nabble.com/Integrating-Solr-3-2-with-Nutch-1-3-tp3152501p3152501.html Sent from the Nutch - User mailing list archive at Nabble.com. -- *Lewis*
Re: custom extractor
Hi C.B., Your description gets slightly cloudy towards the end e.g. around One diffuculty with my htmlcleaner...taken from firebug??? Are you trying to say that some of the URLs are bad HTML, you know this because it is flagged up by firebug? If this is the case are you able to edit the HTML and make it well-formed so to speak? It would also be of great help if you could post a small suggestion of the type of xpath extraction you are looking to so, if anyone has built plugins implementing xpath (which I have not) then they may be able to comment further. On Wed, Jul 6, 2011 at 5:10 PM, Cam Bazz camb...@gmail.com wrote: Hello, Previously I have build a primitive crawler in java, extracting certain information per html page using xpaths. then I discovered nutch, and now I want to be able to extract certain elements in dom, tru xpath, multiple xpaths per site. I am crawling a number of web sites, lets say 16, and I would like to be able to write multiple xpaths per site, and then index the output of those extractions in solr, as a different field. I have googled for a while, and I understand certain plugin can be developed that will act as a custom html parser. I understand that another path is using tika. I also have experimented with boilerpiple library, and It was insufficient to extract the data I want. (I am extracting specificiations of certain products, usually in tables, and fragmented) One diffuculty with my htmlcleaner based xpath evaluator was that the real world htmls sometime were broken, and even when I cleaned them html cleaner will not find xpaths taken from firebug. Which way should I start? Any ideas / help / recomendation greatly appreciated, Best Regards, C.B. -- *Lewis*
Re: Building Nutch 2.0 from the trunk
Hi, Just thought it reasonable to come back to this one and double confirm that all current trunks below build successfully (after some simple configuration) and it is possible to get a Nutch/Gora/HBase trunk implementation up and running in good time should you wish. These technologies are pretty dynamic just now and there is a lot of exciting stuff in the pipeline for the near future. Thanks On Thu, Jun 23, 2011 at 11:55 PM, lewis john mcgibbney lewis.mcgibb...@gmail.com wrote: I tried to build Nutch trunk in eclipse about circa 2 months ago. Gora built fine and from memory it was the ivy configuration within Nutch which had to be altered. I'm positive the problems I was having have now been rectified but I haven't tried since. That is why I am interested in why JUnit tests failed as I thought the only problem with the build was with my Gora dependency. Sorry this is off topic. To relate to the original question. Have you been able to build Nutch trunk using Markus' comments above? On Thu, Jun 23, 2011 at 3:28 PM, Markus Jelsma markus.jel...@openindex.io wrote: You can safely build Nutch trunk with Gora 1089728. I can also build the current Nutch and Gora trunks. What error do you get? Hi, I think this is your second thread on this topic? I tried to get trunk to build but was unable as there are problems with Gora as Julien highlighted to me some time ago. My first question is did you get trunk to build following the tutorial you have highlighted? The problem I was having was with Gora, not with any JUnit tests. Can you please expand on your actions a bit. Thanks On Wed, Jun 22, 2011 at 4:50 AM, Nutch User - 1 nutch.use...@gmail.comwrote: Could someone give me step-by-step instructions on how to build Nutch 2.0 from the trunk and run it? I tried to follow this (http://techvineyard.blogspot.com/2010/12/build-nutch-20.html), but failed to do so as described here (http://lucene.472066.n3.nabble.com/TestFetcher-hangs-td3091057.html ). -- *Lewis* -- *Lewis*
Re: Are we losing Nutch?
Hi Carmmello, I would like to stress that I am only speaking from my own views on the way the project has been moving over the last year and a half or so but I would like to add the following points to address you quite obvious concerns There has been a lot of correspondence on closely linked topics over the past wee while, I think developers understand that there is a small step up to address the requirements of Nutch 1.3, and there is also always inherent problems when you individuals and communities are faced with change. I would like to say with confidence that the current version of Nutch is a well refined tool which is adapting very accurately to provide the best crawling functionality of a very dynamically developing web consisting of various complex graph structures. I would like to make it clear that it is extremely important to have a stable and well designed crawling implementation such as Nutch 1.3 as if you look on the Dev lists you will see the barrage of tasks, existing in an array of complexity, functionality and accuracy which keep Nutch running in parallel with daily changes to the dynamic web. If Nutch is not focussing on crawling, then no matter whether we have a web application interface or a Lucene index the quality of data fetched will simply not be up to scratch. I hope you can appreciated the burden which this imposes on the directional decisions made within the Project Management Committee in the last year or so... Developers across many of the ASF projects understand that being user friendly is an excellent attribute to have in any open source Apache implementation. However projects develop and in the case of the Apache Software foundation, some of these project spawn sub-projects which graduate to become thier own top level independant projects. As I'm sure you are aware, this was the case with Nutch, therefore it means that as a community we should be able to make decisions independently in the best interests of the project. We have talk about not reinventing the wheel, well this is also going on across the ASF board of projects. One thing to consider is that many developers, contributors and PMCs do not belong to one project, they give up time, effort and resources to sometimes several projects, therefore it is very important that as a project Nutch reserves the priority of removing duplication across the board. Addressing you point regarding the real objectives established at the beginning of the project, there has been significant progress made within Nutch and excellent sub projects which have since graduated to top level projects (I'm sure there is no reason to name them) with their own bustling communities. Allowing Nutch to stagnate and claim to be a one-size-fits-all search engine would have jeopardised the viability of all of these successful projects and would have therefore prevented the very innovation that earns open source implementation under the ASF the reputation and widespread use that projects are renowned for. For example if we take the latest Nutch 1.3 release, we have two options to deploy, in local mode (running on one machine) or in deploy mode (harnessing the strength of parallel processing jobs for different kinds of Nutch users). The development has been driven purely by variances seen across the community usage of Nutch. We draw upon progress made in other delegated areas for the benefit of the project, not to isolate non-programmers from using newer versions of the code base. I would also like to add that there are many questions asked about Solr due to a number of criteria, namely: Various developers/committers/PMC members of Nutch are also members of various Solr groups. Developers and users are kind enough to take the time and effort to answer Solr related questions as it is commonly recognised now that Solr is the widespread indexing mechanism (which also has an easily configurable GUI). It is not very often that users on the Nutch user@ list are ignored or thier queries unanswered, however if this is the case there is good reason behind it. In general, and in my opinion, when I started using Nutch I found the help on user@ not only extremely beneficial but also a confidence boost to get me working on Sold and other project lists. I suppose that there are always 2 sides to every story and it is very uncomforting to hear that you are really not happy with the latest release, there was a lot of hard work put into its development. Amongst bug fixes and other potential barriers mentioned above and previously on this list I would like to think that as the project matures it users can also recognise the dynamism which needs to exist in a project of Nutch's nature i order to present users with a stable and robust software choice. Instead of becoming handicapped we have a clear vision for Nutch 2.0, Nutch branches e.g. 1.4 and many new fixes on the way. I suppose it depends from which side of the table you are on when you mention that it is
Re: html of the crawled pages.
Hi C.B., Can you please expand on this description? On Sun, Jul 10, 2011 at 11:52 AM, Cam Bazz camb...@gmail.com wrote: Hello All, Is there a way to save the plain htmls from the crawl? Or is this already stored in segments dir? Best Regards, -C.B. -- *Lewis*
Re: Problems with tutorial
Hi, For a 1.3 tutorial please see here [1]. I am in the process of overhauling the nutch site to accomodate new changes as per 1.3 release. Thank you On Sun, Jul 10, 2011 at 3:42 PM, Paul van Hoven paul.van.ho...@googlemail.com wrote: I'm completly new to nutch so I downloaded version 1.3 and worked through the beginners tutorial at http://wiki.apache.org/nutch/**NutchTutorialhttp://wiki.apache.org/nutch/NutchTutorial. The first problem was that I did not find the file conf/crawl-urlfilter.txt so I omitted that and continued with launiching nutch. Therefore I created a plain text file in /Users/toom/Downloads/nutch-**1.3/crawled called urls.txt which contains the following text: tom:crawled toom$ cat urls.txt http://nutch.apache.org/ So after that I invoked nutch by calling tom:bin toom$ ./nutch crawl /Users/toom/Downloads/nutch-1.**3/crawled -dir /Users/toom/Downloads/nutch-1.**3/sites -depth 3 -topN 50 solrUrl is not set, indexing will be skipped... crawl started in: /Users/toom/Downloads/nutch-1.**3/sites rootUrlDir = /Users/toom/Downloads/nutch-1.**3/crawled threads = 10 depth = 3 solrUrl=null topN = 50 Injector: starting at 2011-07-07 14:02:31 Injector: crawlDb: /Users/toom/Downloads/nutch-1.**3/sites/crawldb Injector: urlDir: /Users/toom/Downloads/nutch-1.**3/crawled Injector: Converting injected urls to crawl db entries. Injector: Merging injected urls into crawl db. Injector: finished at 2011-07-07 14:02:35, elapsed: 00:00:03 Generator: starting at 2011-07-07 14:02:35 Generator: Selecting best-scoring urls due for fetch. Generator: filtering: true Generator: normalizing: true Generator: topN: 50 Generator: jobtracker is 'local', generating exactly one partition. Generator: Partitioning selected urls for politeness. Generator: segment: /Users/toom/Downloads/nutch-1.**3/sites/segments/** 20110707140238 Generator: finished at 2011-07-07 14:02:39, elapsed: 00:00:04 Fetcher: No agents listed in 'http.agent.name' property. Exception in thread main java.lang.**IllegalArgumentException: Fetcher: No agents listed in 'http.agent.name' property. at org.apache.nutch.fetcher.**Fetcher.checkConfiguration(** Fetcher.java:1166) at org.apache.nutch.fetcher.**Fetcher.fetch(Fetcher.java:**1068) at org.apache.nutch.crawl.Crawl.**run(Crawl.java:135) at org.apache.hadoop.util.**ToolRunner.run(ToolRunner.**java:65) at org.apache.nutch.crawl.Crawl.**main(Crawl.java:54) I do not understand what happend here, maybe one of you can help me? -- *Lewis*
Re: Error Network is unreachable in Nutch 1.3
Hi, Please see this new tutorial [1] for configuring Nutch 1.3. If you are familiar/comnfortable working with Solr for improvements to indexing then you will find it no problem. If you require to stick with Lucene and the web application front end then please stcik with Nutch 1.2 or before. [1] http://wiki.apache.org/nutch/RunningNutchAndSolr On Mon, Jul 11, 2011 at 3:02 PM, Yusniel Hidalgo Delgado yhdelg...@uci.cuwrote: Hello. I'm trying to run nutch 1.3 in my LAN following the NutchTutorial from wiki page. When I try to run with this command line options: nutch crawl urls -dir crawl -depth 3 I get the following output: solrUrl is not set, indexing will be skipped... crawl started in: crawl rootUrlDir = urls threads = 10 depth = 3 solrUrl=null Injector: starting at 2011-07-11 09:35:37 Injector: crawlDb: crawl/crawldb Injector: urlDir: urls Injector: Converting injected urls to crawl db entries. Injector: Merging injected urls into crawl db. Injector: finished at 2011-07-11 09:35:40, elapsed: 00:00:03 Generator: starting at 2011-07-11 09:35:40 Generator: Selecting best-scoring urls due for fetch. Generator: filtering: true Generator: normalizing: true Generator: jobtracker is 'local', generating exactly one partition. Generator: Partitioning selected urls for politeness. Generator: segment: crawl/segments/20110711093542 Generator: finished at 2011-07-11 09:35:43, elapsed: 00:00:03 Fetcher: starting at 2011-07-11 09:35:43 Fetcher: segment: crawl/segments/20110711093542 Fetcher: threads: 10 QueueFeeder finished: total 2 records + hit by time limit :0 fetching http://FIRST http://first/ SITE/ fetching http://SECOND http://second/ SITE/ -finishing thread FetcherThread, activeThreads=2 -finishing thread FetcherThread, activeThreads=2 -finishing thread FetcherThread, activeThreads=2 -finishing thread FetcherThread, activeThreads=2 -finishing thread FetcherThread, activeThreads=2 -finishing thread FetcherThread, activeThreads=3 -finishing thread FetcherThread, activeThreads=2 -finishing thread FetcherThread, activeThreads=3 fetch of http://FIRST http://first/ SITE/ failed with: java.net.ConnectException: Network is unreachable -finishing thread FetcherThread, activeThreads=1 fetch of http://SECOND http://second/ SITE/ failed with: java.net.ConnectException: Network is unreachable -finishing thread FetcherThread, activeThreads=0 -activeThreads=0, spinWaiting=0, fetchQueues.totalSize=0 -activeThreads=0 Fetcher: finished at 2011-07-11 09:35:45, elapsed: 00:00:02 ParseSegment: starting at 2011-07-11 09:35:45 ParseSegment: segment: crawl/segments/20110711093542 ParseSegment: finished at 2011-07-11 09:35:47, elapsed: 00:00:01 CrawlDb update: starting at 2011-07-11 09:35:47 CrawlDb update: db: crawl/crawldb CrawlDb update: segments: [crawl/segments/**20110711093542] CrawlDb update: additions allowed: true CrawlDb update: URL normalizing: true CrawlDb update: URL filtering: true CrawlDb update: Merging segment data into db. CrawlDb update: finished at 2011-07-11 09:35:48, elapsed: 00:00:01 Generator: starting at 2011-07-11 09:35:48 Generator: Selecting best-scoring urls due for fetch. Generator: filtering: true Generator: normalizing: true Generator: jobtracker is 'local', generating exactly one partition. Generator: 0 records selected for fetching, exiting ... Stopping at depth=1 - no more URLs to fetch. LinkDb: starting at 2011-07-11 09:35:49 LinkDb: linkdb: crawl/linkdb LinkDb: URL normalize: true LinkDb: URL filter: true LinkDb: adding segment: file:/home/yusniel/Programas/** nutch-1.3/runtime/local/bin/**crawl/segments/20110711093542 LinkDb: finished at 2011-07-11 09:35:50, elapsed: 00:00:01 crawl finished: crawl According to this output, the problem is related with the access to network, however, I can access to those web site using Firefox. I'm using Debian testing version. Greetings. -- *Lewis*
Re: Nutch Novice help
Hi Please see this tutorial [1] for up to date 1.3 tutorial on wiki. Please try it out and take on Markus' points regarding Nutch trunk as the problems you are experiencing are usual with Trunk as it stands. [1] http://wiki.apache.org/nutch/RunningNutchAndSolr On Mon, Jul 11, 2011 at 10:50 PM, Sethi, Parampreet parampreet.se...@teamaol.com wrote: Hi All, Sorry for such a naïve question, I downloaded nutch 1.3 binary today and trying to set it up as mentioned in Tutorial at http://wiki.apache.org/nutch/NutchTutorial How ever I am not able to find crawl-urlfilter.txt inside conf directory. Is there any other place where I should look for this file? Thanks Param -- *Lewis*
Re: developing nutch, either in eclipse or netbeans
I must admit Markus that I agree with you that for making ad-hoc changes to your configuration it is usually more time efficient to use a text editor. Hi C.B. Is there any reaon in particular you are interested in getting it up working with an IDE? I had contemplated getting a revised tutorial up and running for Eclipse in due course. On Mon, Jul 11, 2011 at 11:15 PM, Markus Jelsma markus.jel...@openindex.iowrote: Hi, I remember some mails on this matter on the list recently. Try the search or don't use an IDE. I never got it running but quickly gave up anyway and use a simple text editor instead. Cheers, Hello, Hopeless to get a working build environment in eclipse or netbeans. I have followed http://wiki.apache.org/nutch/RunNutchInEclipse1.0 With NB there are maven related problems, and with eclipse it wont recognize the build/local structure in apache-nutch-1.3 What is the easiest way to get going with eclipse or netbeans? Best Regards, -C.B. -- *Lewis*
Re: Updating Tika in Nutch
Hi Fernando, One point for me to mention which I did not pick up from your post. Did you rebuild your Nutch dist after making the changes to include your new parser? I know that this is a pretty simple suggestion but hopefully it might be the right one. Also can you please provide more details of I get an error saying C:/Program not found whenever I try to do anything... ? Were you ablke to build your 1.3 dist? I understand that 1.2 is sufficient for your needs, however it might be beneficial to root out why you cannot get 1.3 working for future interests. Thanks On Tue, Jul 12, 2011 at 8:27 AM, Fernando Arreola jfarr...@gmail.comwrote: Hello, I have made some additions (a new parser) to the Apache Tika application and I am trying to see if I can run my new changes using the crawl mechanism in Nutch, but I am having some trouble updating Nutch with my modified Tika application. The Tika updates I made run fine if I run Tika as a standalone using either the command line or the Tika GUI. I am using Nutch 1.2, 1.3 seems to not be able to run for me (I get an error saying C:/Program not found whenever I try to do anything but 1.2 should be fine for what I am trying to do which is just to see the parse results from the new parser I added to Tika). I have replaced the tika-core.jar, tika-parsers.jar and tika-mimetypes.xml files with my versions of those files as described in the following link: http://issues.apache.org/jira/browse/NUTCH-766. I also updated the nutch-site.xml to enable the parse-tika plugin. I also updated the parse-plugins.xml file with the following (afm files are what I am trying to parse): mimeType name=application/x-font-afm plugin id=parse-tika / /mimeType I am crawling a personal site in which I have links to .afm files. If I crawl before making any updates to Nutch, it fetches the files fine. After making the updates detailed above, I get the following error: fetch of http://scf.usc.edu/~jfarreol/woor2___.AFM failed with: java.lang.NoClassDefFoundError: org/apache/james/mime4j/MimeException. Not really sure, what the issue is but my guess is that I have not updated all the necessary files. Any help would be greatly appreciated. Thank you, Fernando Arreola -- *Lewis*
Re: Nutch Gotchas as of release 1.3
Hi I have duly updated both the Nutch Gotchas [1] and the tutorial [2] to incorporate these gotchas which have been highlighted. Thanks for pointing these out. [1] http://wiki.apache.org/nutch/NutchGotchas [2] http://wiki.apache.org/nutch/RunningNutchAndSolr On Tue, Jul 12, 2011 at 12:03 AM, Jerry E. Craig, Jr. jcr...@inforeverse.com wrote: Just from a total noob standpoint (just installed my first LAMP box over the last month) realizing that I needed to look in the Runtime folder when I downloaded the tar.gz file was a HUGE step. Then we all run the Crawl at least to make sure things work. The main tutorial was missing the [-solr] part of the crawl command line to get that to index. It wasn't after someone helped me here and pointed me to the actual documents that I found it. Those were the 2 big things for me as a total noob, otherwise I'm really happy to have at least that part working. Now, my stupid CentOS install only has libxml2 2.6.15 and I need 2.6.17 for php and I'm a few revisions off on libcurl also. I have NO idea how to go back and fix that. Not sure if I should just try to upgrade to php53 and hope for the best or what. But, that's more of a solr / php question than a Nutch question I think. -Original Message- From: Markus Jelsma [mailto:markus.jel...@openindex.io] Sent: Monday, July 11, 2011 3:19 PM To: user@nutch.apache.org Cc: lewis john mcgibbney Subject: Re: Nutch Gotchas as of release 1.3 Well, now i'm thinking of it: yes. - there were three (incl. myself) people mentioning the problem described in NUTCH-1016; - a few users don't seem to catch the part of the tutorial telling them to add their robot to the config - missing crawl-urlfilter - mails about missing solrUrl I think quite a few users still rely on the crawl command instead of running a script. Hello list, Do we have any suggestions we wish to discuss regarding the above? thanks -- *Lewis*
Re: nutch crashes for unknown reason
Fro mn the looks of it you need to parse all segments before indexing attempting to index them. As Markus has pointed out, the specific segment hasn't been parsed. Try parsing as per the following link http://wiki.apache.org/nutch/bin/nutch_parse On Tue, Jul 12, 2011 at 1:50 PM, Paul van Hoven paul.van.ho...@googlemail.com wrote: Okay, and what does that mean? How can I repair the error? 2011/7/12 Markus Jelsma markus.jel...@openindex.io: I don't see this segment 20110712114256 being parsed. On Tuesday 12 July 2011 13:38:35 Paul van Hoven wrote: I'm not if I did understand you correct. Here is the complete output of my crawl: tom:bin toom$ ./nutch crawl /Users/toom/Downloads/nutch-1.3/crawled -dir /Users/toom/Downloads/nutch-1.3/sites -depth 3 -topN 50 solrUrl is not set, indexing will be skipped... crawl started in: /Users/toom/Downloads/nutch-1.3/sites rootUrlDir = /Users/toom/Downloads/nutch-1.3/crawled threads = 10 depth = 3 solrUrl=null topN = 50 Injector: starting at 2011-07-12 12:28:49 Injector: crawlDb: /Users/toom/Downloads/nutch-1.3/sites/crawldb Injector: urlDir: /Users/toom/Downloads/nutch-1.3/crawled Injector: Converting injected urls to crawl db entries. Injector: Merging injected urls into crawl db. Injector: finished at 2011-07-12 12:28:53, elapsed: 00:00:04 Generator: starting at 2011-07-12 12:28:53 Generator: Selecting best-scoring urls due for fetch. Generator: filtering: true Generator: normalizing: true Generator: topN: 50 Generator: jobtracker is 'local', generating exactly one partition. Generator: Partitioning selected urls for politeness. Generator: segment: /Users/toom/Downloads/nutch-1.3/sites/segments/20110712122856 Generator: finished at 2011-07-12 12:28:57, elapsed: 00:00:04 Fetcher: Your 'http.agent.name' value should be listed first in 'http.robots.agents' property. Fetcher: starting at 2011-07-12 12:28:57 Fetcher: segment: /Users/toom/Downloads/nutch-1.3/sites/segments/20110712122856 Fetcher: threads: 10 QueueFeeder finished: total 1 records + hit by time limit :0 fetching http://nutch.apache.org/ -finishing thread FetcherThread, activeThreads=1 -finishing thread FetcherThread, activeThreads=1 -finishing thread FetcherThread, activeThreads=1 -finishing thread FetcherThread, activeThreads=1 -finishing thread FetcherThread, activeThreads=1 -finishing thread FetcherThread, activeThreads=1 -finishing thread FetcherThread, activeThreads=1 -finishing thread FetcherThread, activeThreads=1 -finishing thread FetcherThread, activeThreads=1 -activeThreads=1, spinWaiting=0, fetchQueues.totalSize=0 -finishing thread FetcherThread, activeThreads=0 -activeThreads=0, spinWaiting=0, fetchQueues.totalSize=0 -activeThreads=0 Fetcher: finished at 2011-07-12 12:29:01, elapsed: 00:00:03 ParseSegment: starting at 2011-07-12 12:29:01 ParseSegment: segment: /Users/toom/Downloads/nutch-1.3/sites/segments/20110712122856 ParseSegment: finished at 2011-07-12 12:29:03, elapsed: 00:00:02 CrawlDb update: starting at 2011-07-12 12:29:03 CrawlDb update: db: /Users/toom/Downloads/nutch-1.3/sites/crawldb CrawlDb update: segments: [/Users/toom/Downloads/nutch-1.3/sites/segments/20110712122856] CrawlDb update: additions allowed: true CrawlDb update: URL normalizing: true CrawlDb update: URL filtering: true CrawlDb update: Merging segment data into db. CrawlDb update: finished at 2011-07-12 12:29:06, elapsed: 00:00:02 Generator: starting at 2011-07-12 12:29:06 Generator: Selecting best-scoring urls due for fetch. Generator: filtering: true Generator: normalizing: true Generator: topN: 50 Generator: jobtracker is 'local', generating exactly one partition. Generator: Partitioning selected urls for politeness. Generator: segment: /Users/toom/Downloads/nutch-1.3/sites/segments/20110712122908 Generator: finished at 2011-07-12 12:29:10, elapsed: 00:00:03 Fetcher: Your 'http.agent.name' value should be listed first in 'http.robots.agents' property. Fetcher: starting at 2011-07-12 12:29:10 Fetcher: segment: /Users/toom/Downloads/nutch-1.3/sites/segments/20110712122908 Fetcher: threads: 10 QueueFeeder finished: total 50 records + hit by time limit :0 fetching http://www.cafepress.com/nutch/ fetching http://creativecommons.org/press-releases/entry/5064 fetching http://blog.foofactory.fi/2007/03/twice-speed-half-size.html fetching http://www.apache.org/dist/nutch/CHANGES-1.0.txt fetching http://eu.apachecon.com/c/aceu2009/sessions/138 fetching http://www.us.apachecon.com/c/acus2009/ fetching http://issues.apache.org/jira/browse/NUTCH fetching http://forrest.apache.org/ fetching http://hadoop.apache.org/ fetching http://wiki.apache.org/nutch/ fetching http://nutch.apache.org/credits.html fetching http://tika.apache.org/ fetching http://lucene.apache.org/solr/ fetching http://osuosl.org/news_folder/nutch fetching
Re: A possible solution to my URL redirection and zero scores problem
Well I think in order to address the problem directly it would be better to focus on getting something working with a distribution of Nutch you are most comfortable working with. For the time being I would avoid working with trunk 2.0 unless you can justify otherwise. I would also either make a decision between Nutch 1.2 and the current 1.3 release rather than focussing on previous branches, which may or may not be stable depending on when you last svn updated. If you can try working with a fresh 1.2 or 1.3 (preferrably 1.3) then we could maybe get to the bottom of this one as it would be great to find whether there is scope to file a JIRA with this. Thank you On Tue, Jul 12, 2011 at 2:02 PM, Nutch User - 1 nutch.use...@gmail.comwrote: On 07/12/2011 03:42 PM, lewis john mcgibbney wrote: Hi, An observation is that you are using the 1.3 branch, which will now contain some older code. For example the fetcher class has now been upgraded to deal with Nutch-962, which is mentioned at the top of the class as per your URL example. Can anyone explain what the existing metadata being transferred is as per below if it does not include the score as you state? } else { CrawlDatum newDatum = new CrawlDatum(CrawlDatum.STATUS_LINKED, datum.getFetchInterval()); // transfer existing metadata newDatum.getMetaData().putAll(datum.getMetaData()); try { scfilters.initialScore(url, newDatum); I would have imagined that the metadata would have included the relative initial score we are discussing if it were to be of use in attributing an initial URLs metadata to a redirect? Apart from this, with the addition of your datum.getScore(), do the new scores attributed to the URL redirects reflect accurately you're general understanding of the web graph? I have only been dealing with Nutch 1.2 and 1.3. I tried to setup 2.0 with Eclipse but failed as described here (http://lucene.472066.n3.nabble.com/TestFetcher-hangs-td3091057.html). The new scores were as they should have been in my opinion. (Even though I would state that Nutch's implementation of OPIC isn't exactly what the publication says.) I don't know what information is passed in metadata. -- *Lewis*
Re: running tests from the command line
What plugin are you hacking away on? You're own custom one or one already shipped with Nutch? Just so we are reading from the same page. This, along with some further documentation for running various classes from the command line is definately worth inclusion in the CommandLineOptions page of the wiki. On Tue, Jul 12, 2011 at 6:00 PM, Tim Pease tim.pe...@gmail.com wrote: At the root of the Nutch 1.3 project, what is the magic ant incantation to run only the tests for the plugin I'm currently hacking away on? I'm looking for the command line syntax. Blessings, TwP -- *Lewis*
Re: Nutch Novice help
Have a good look under your hadoop.log which should be created when you initiate a crawl with Nutch, this will be extremely valuable. In addition there are various properties in nutch-site.xml which can be set to make logging more verbose at various levels e.g. fetching In order to root out various errors you will need to get used to looking through yours logs. It is also advised to try and include as much log data as possible when posting queries on the user list. You can find more information about this here as it will greatly help you get accurate and detailed help from the list in the future. Please have a look here [1]. I would advise you to delete all crawled data and begin a fresh crawl, this way you can try the above, looking at your logs, before we try to root out where exactly the errors are stemming from. HTH [1] http://wiki.apache.org/nutch/Becoming_A_Nutch_Developer#Becoming_a_Nutch_Developer On Tue, Jul 12, 2011 at 7:31 PM, Sethi, Parampreet parampreet.se...@teamaol.com wrote: Hey Lewis, Thanks for the quick reply. Looks like I am tangled now =) I tried the tutorial mentioned at http://wiki.apache.org/nutch/RunningNutchAndSolr For me step 3 is not working. Two of the directories are not created (which should be there after step 3 is complete.) crawl/crawldb - Created crawl/linkdb - not created crawl/segments - not created Also, I changed the url to http://nutch.apache.org, but still same log message Generator: 0 records selected for fetching, exiting ... Looks like I am missing some key step =(. -param On 7/12/11 1:37 PM, lewis john mcgibbney lewis.mcgibb...@gmail.com wrote: Hi, I think you are maybe getting tangled here. Please see the following tutorial for Nutch 1.3 [1] Please also note that the URL you provided is the old Nutch site and now redirects to http://nutch.apache.org [1] http://wiki.apache.org/nutch/RunningNutchAndSolr On Tue, Jul 12, 2011 at 5:23 PM, Sethi, Parampreet parampreet.se...@teamaol.com wrote: Thanks for updating the tutorial. I tried my setup, the crawl command is running. But none of the pages are being crawled. I created urls directory inside local folder and added new file nutch with url in the same as mentioned in tutorial. (I also tried file named urls inside nutch/runtime/local diretcory. The contents of urls file is http://lucene.apache.org/nutch/ ) Here's the log: us137390:local parampreetsethi$ bin/nutch crawl urls -dir crawl -depth 3 -topN 50 solrUrl is not set, indexing will be skipped... crawl started in: crawl rootUrlDir = urls threads = 10 depth = 3 solrUrl=null topN = 50 Injector: starting at 2011-07-12 12:22:12 Injector: crawlDb: crawl/crawldb Injector: urlDir: urls Injector: Converting injected urls to crawl db entries. Injector: Merging injected urls into crawl db. Injector: finished at 2011-07-12 12:22:15, elapsed: 00:00:03 Generator: starting at 2011-07-12 12:22:15 Generator: Selecting best-scoring urls due for fetch. Generator: filtering: true Generator: normalizing: true Generator: topN: 50 Generator: jobtracker is 'local', generating exactly one partition. Generator: 0 records selected for fetching, exiting ... Stopping at depth=0 - no more URLs to fetch. No URLs to fetch - check your seed list and URL filters. crawl finished: crawl Please help. Thanks Param On 7/12/11 5:52 AM, Julien Nioche lists.digitalpeb...@gmail.com wrote: On 12 July 2011 10:30, Julien Nioche lists.digitalpeb...@gmail.com wrote: There seems to be no crawl-urlfilter file indeed. Don't know why it's gone since the crawl command is still there. You can find the file in the 1.2 release: http://svn.apache.org/viewvc/nutch/branches/branch-1.2/conf/ Crawl-urlfilter has been removed purposefully as it did not add anything to the other url filters (automaton | regex) in terms of functionality. By default the urlfilters contain (+.) which IIRC was what the Crawl-urlfilter used to do. That's reasonable. But now news users are unaware and don't know what to do with this error message. Yep, the tutorial needs updating indeed done Thanks for a quick reply. I searched in the nutch directory but still do not see that file :(. Here's complete file list inside runtime/local/conf directory. us137390:conf parampreetsethi$ pwd /Users/parampreetsethi/Documents/workspace/nutch/runtime/local/conf us137390:conf parampreetsethi$ ls -t automaton-urlfilter.txtdomain-urlfilter.txt nutch-default.xml prefix-urlfilter.txtsolrindex-mapping.xml configuration.xslhttpclient-auth.xmlnutch-site.xml regex-normalize.xmlsubcollections.xml domain-suffixes.xmllog4j.propertiesparse-plugins.dtd regex-urlfilter.txtsuffix-urlfilter.txt domain-suffixes.xsdnutch-conf.xslparse-plugins.xml schema.xml tika
Re: Need help: Can't find bundle for base name org.nutch.jsp.search, locale en_US
Assuming your using Nutch 1.2, the web application you point to needs to be the exact name of the WAR file. In my case it was therefore always http://localhost:8080/nutch-1.2 http://localhost:8080/nutch/ Also I do not understand written spanish (i think this is) so I can help you out with the other stuff sorry. On Wed, Jul 13, 2011 at 3:55 PM, Marlen zmach...@facinf.uho.edu.cu wrote: On 13/07/2011 10:30, Marlen wrote: I have been subscribed to the lucene list help,, and it was great.. I hope it be great too... There is a problem for me.. I don't speak quit well English.. So the important thing.. I had a problem with the installation when I tip this: http://localhost:8080/nutch/; on my browser this come out: Estado HTTP 500 - type Informe de Excepción mensaje descripción El servidor encontró un error interno () que hizo que no pudiera rellenar este requerimiento. excepción org.apache.jasper.**JasperException: java.util.** MissingResourceException: Can't find bundle for base name org.nutch.jsp.search, locale en_US org.apache.jasper.servlet.**JspServletWrapper.**handleJspException(** JspServletWrapper.java:531) org.apache.jasper.servlet.**JspServletWrapper.service(** JspServletWrapper.java:454) org.apache.jasper.servlet.**JspServlet.serviceJspFile(** JspServlet.java:389) org.apache.jasper.servlet.**JspServlet.service(JspServlet.**java:332) javax.servlet.http.**HttpServlet.service(**HttpServlet.java:722) causa raíz java.util.**MissingResourceException: Can't find bundle for base name org.nutch.jsp.search, locale en_US java.util.ResourceBundle.**throwMissingResourceException(** ResourceBundle.java:1539) java.util.ResourceBundle.**getBundleImpl(ResourceBundle.**java:1278) java.util.ResourceBundle.**getBundle(ResourceBundle.java:**805) org.apache.jsp.index_jsp._**jspService(index_jsp.java:56) org.apache.jasper.runtime.**HttpJspBase.service(**HttpJspBase.java:68) javax.servlet.http.**HttpServlet.service(**HttpServlet.java:722) org.apache.jasper.servlet.**JspServletWrapper.service(** JspServletWrapper.java:416) org.apache.jasper.servlet.**JspServlet.serviceJspFile(** JspServlet.java:389) org.apache.jasper.servlet.**JspServlet.service(JspServlet.**java:332) javax.servlet.http.**HttpServlet.service(**HttpServlet.java:722) nota La traza completa de la causa de este error se encuentra en los archivos de diario de Apache Tomcat/7.0.5. Apache Tomcat/7.0.5 -- *Lewis*
Re: Can we use crawled data by Nutch 0.9 in other versions of Nutch
I think you question should be more along the lines of, is it possible to use data stored within a Lucene index in a Solr core for search? Unfortunately I am unable to answer this question, my suggestion would be to ask on solr-user@ Another option which you may wish to consider is using the convdb command line option to upgrade your 0.9 crawldb to a crawldb compatible with Nutch 1.2 and subsequently 1.3. You can then undertake crawls with Nutch 1.3 and index directly to Solr. Please someone correct me here if I am wrong. On Wed, Jul 13, 2011 at 3:50 PM, serenity serenitykenings...@gmail.comwrote: Hello, I have a question and I apologize if it sounds stupid. I just want to know, if we can use the crawled data by Nutch 0.9 in Nutch 1.3 because search has been delegated to Solr in Nutch 1.3 and I want to get the search results from the crawled data by Nutch 0.9 in Nutch 1.3. Serenity -- View this message in context: http://lucene.472066.n3.nabble.com/Can-we-use-crawled-data-by-Nutch-0-9-in-other-versions-of-Nutch-tp3166259p3166259.html Sent from the Nutch - User mailing list archive at Nabble.com. -- *Lewis*
Re: Recrawling with Solr backend
Pleas seem comments below On Thu, Jul 14, 2011 at 12:52 PM, Chris Alexander chris.alexan...@kusiri.com wrote: Hi Lewis, First of all, thanks for the fantastic reply, most useful. I am working on testing out the functions you mention, of which I was not previously aware. Yes there has been a lot of action recently even between 1.3 release and dev 1.4. There are a few offshoot questions from this that the answers to which aren't immediately apparent. When a solrindex is run doing an update of a previous index, is it the case that all of the content is copied into the solr index again (overwriting unchanged files, for example) or is only the changed data modified in the index? We came to this question because we are thinking of running a rolling crawl (i.e. restart a new crawl when the previous one has terminated) and clearly if it re-adds already existing and unchanged data on each loop round then this would negatively impact performance and would increase the amount of compacting required in Solr. From my simple testing it looks like the date is not updated in the Solr index, implying that it is not modified? But I could use confirmation of this as it's a fairly important issue. If we take solrindex and solrdedup and omit solrclean for the time being as this is a different matter and deals with removing a certain type of 'broken' document rather than comparing docs in our Solr index in acting accordingly. Solrindex - No data is technically copied, instead it is indexed from the crawldb based upon whatever type of content and metadata we wished to extract from it with our parsers (check out the plugins) along with URLlinks present in linkdb. When fetching is undertaken each URL is given a unique fetch time in milliseconds, this way we can disambiguate between several pages which may be present in the solrindex and run the deduplication command accordingly. At the moment, committs for all reducers to the solr instance are handled in one go and yes you are correct this has been identified as fairly expensive as resources for crawls and subsequently Solr communication jobs increase proportionately. To prevent Nutch sending 'already existing and unchanged data', every page is given a metatag relating to a lastModified value. This means that any page which has not been modfied since the last crawl will be skipped during fetching. Does this clear any of this up for you? The second point is relating to removing documents from the index. In the scenario we are working on, a list of primary URLs is used to direct the start of the crawl. When a new site is to be crawled, its homepage URL is added to the seed urls file for the next crawl (it may also have a filter added to the filtering file to restrict the crawling spread). When a site is no longer desired in the index, its URL is removed from the seed urls file. When the next index is run, does this mean that the pages crawled under the previous URL will be removed from the solr index because they were not crawled on that occasion, or will they have to be removed manually by some other mechanism? From my simple testing it looks like they are not removed automatically. You are correct here, they most certainyl are not removed automatically. I commented on a similar post a while ago. What happens if you were to remove an URL from the seed list, recrawl (and automatically remove the pages from you're index), then find out you are perhaps required to re-add that URL to your seed list tomorrow or in the near future. This would not be a sustainable way to maintain an index. I just found the db.ignore.external.links configuration value - which will solve a lot of the issues previously mentioned in passing regarding filtering the URLs to crawl. Yes, I would say that experience using properties in nutch-site and your various URLFIlters in a well tuned fashion should yield better results over time. Thanks again for the help (and apologies for the huge e-mail) Chris On 14 July 2011 10:59, lewis john mcgibbney lewis.mcgibb...@gmail.com wrote: Hi Chris, Yes a Nutch 1.3 crawl and Solr index bash script is something that has not been added to the wiki yet. I think this is partly because there are very few adjustments to be made to the comprehensive Nutch 1.2 scripts currently available on the Nutch wiki. This would however be a great addition if we could get the time to post one. The point of focus I pick up from your thread is that you require a script for a way of re-crawling previously crawled pages only a certain amount of time after they were last crawled etc. Generally speaking (at this stage anyway), I'll assume that etc just means various other property changes within nutch-site.xml. My recommended steps would be something like inject generate fetch parse updatedb invertlinks solrindex solrdedup solrclean We can obviously schedule Nutch to crawl regularly
Re: The correct tutorial on the home page?
Hi Eric Please add any comments you wish to the new tutorial that Markus mentioned on the Wiki. I am in the process of rebuilding the Nutch site and this will be included tomorrow e.g from now on the default tutorial people are directed to from the wiki will be the RunningNutchAndSolr tutorial... The RunningNutchAndSolr tutorial was created as a bridge to running nutch in deploy mode tutorial which I am working towards and would like to see constructed in the near future. We can harness a huge amount of power using Nutch in tadem with Hadoop therefore this is the next step. As I said, please suggest anything which would make phasing into Nutch 1.3 a less laborious task. thanks On Thu, Jul 14, 2011 at 10:31 PM, Markus Jelsma markus.jel...@openindex.iowrote: Thanks. And check out open issues if possible. cheers I agree with updating NutchTutorial to be 1.3. Folks coming to Nutch and following a tutorial will almost certainly be wanting to know about the latest and greatest released code! I subscribed to the dev@ list and will keep an eye on the updates and provide any feedback I can! Eric On Jul 14, 2011, at 4:41 PM, Markus Jelsma wrote: Hi Erik, Lewis already moved a lot of 1.3 stuff to a legacy area on the wiki. The tutorial on the pointed to from the homepage is indeed old but also contains recent additions. Perhaps we should merge those two tutorials and get rid of RunningNutchAndSolr. The NutchTutorial seems more appropriate in = 1.3. Cheers Hi all, I am getting back up to speed on Nutch after being away for a couple versions! I noticed the tutorial linked from the homepage is to this one: http://wiki.apache.org/nutch/NutchTutorial However, it seems like with Nutch moving to using Solr, that the tutorial that should be linked to is http://wiki.apache.org/nutch/RunningNutchAndSolr#A3._Crawl_your_first_we bs ite Alternatively, the content for all the pre 1.3 Nutch should be moved to a different NutchTutorial wiki page, and the NutchTutorial updated with the Solr content? Eric - Eric Pugh | Principal | OpenSource Connections, LLC | 434.466.1467 | http://www.opensourceconnections.com Co-Author: Solr 1.4 Enterprise Search Server available from http://www.packtpub.com/solr-1-4-enterprise-search-server This e-mail and all contents, including attachments, is considered to be Company Confidential unless explicitly stated otherwise, regardless of whether attachments are marked as such. - Eric Pugh | Principal | OpenSource Connections, LLC | 434.466.1467 | http://www.opensourceconnections.com Co-Author: Solr 1.4 Enterprise Search Server available from http://www.packtpub.com/solr-1-4-enterprise-search-server This e-mail and all contents, including attachments, is considered to be Company Confidential unless explicitly stated otherwise, regardless of whether attachments are marked as such. -- *Lewis*
Re: what does the parse command does
Hi C.B., Quite a few things here On Fri, Jul 15, 2011 at 5:19 PM, Cam Bazz camb...@gmail.com wrote: Hello, Finally I got a working build environment, and I am doing some modifications and playing around. Good to hear, although it is off topic can you share any hurdles you overcame with us please. It would be good to hear how you solved you configuration problems. I also got my first plugin to build, and almost done with my custom parser. Excellent, I will proceed with adding your comment to a page in plugin central on the wiki, in the meantime it would be good to hear more about your plugin and what functionality it encapsulates! Would it be possible to get a wiki entry? We are a bit short for Nutch 1.3 custom plugin tutorials. I have my custom plugin and the method public ParseResult filter(Content content, ParseResult parseResult, HTMLMetaTags metaTags, DocumentFragment doc) { ... does indeed have all the information that I need to do my custom parsing. Now this is what I dont understand: there is a content field in solr. I have read the solrindexer code, and figured out that pretty much any field in the doc is indexed to solr. If you have a look at boht your schema and solr-mapping documents you will see how fields are generated and passed to Solr for indexing. What must I do, so I can open another content like field such as Content2 and put my custom extracted data, so solr indexes it? I think this does not have to do with solr, but the fields in the document. My suggestion would be to specify extraction of the field within the plugin code then add the various configuration parameters to both of the aforementioned config documents. In the recommended example, the found result is only added to contentMeta - and this one is not indexed by solr. What recommended example? I am not following you here. Best Regards, -C.B. -- *Lewis*
Re: Deploying the web application in Nutch 1.2
Are you adding this to nutch-site within your webapp or just in your root Nutch installation. This needs to be included in your webapp version of nutch-site.xml. In my experience this was a small case of confusion at first. On Fri, Jul 15, 2011 at 7:03 PM, Chip Calhoun ccalh...@aip.org wrote: You've gotten me very close to a breakthrough. I've started over, and I've found that If I don't make any edits to nutch-site.xml, I get a working Nutch web app; I have no index and all of my searches fail, but I have Nutch. When I add my crawl location to nutch-site.xml and restart Tomcat, that's when I start getting the 404 with the The requested resource () is not available message. Clearly I'm doing something wrong when I edit nutch-site.xml. I'm going to paste the entire contents of my nutch-site.xml. Where am I screwing this up? Thanks for your help on this. ?xml version=1.0? configuration property namehttp.agent.name/name valuenutch-solr-integration/value /property property namegenerate.max.per.host/name value100/value /property property nameplugin.includes/name valueprotocol-http|urlfilter-regex|parse-html|index-(basic|anchor)|query-(basic|site|url)|response-(json|xml)|summary-basic|scoring-opic|urlnormalizer-(pass|regex|basic)/value /property property namesearcher.dir/name valueC:/Apache/apache-nutch-1.2/crawlvalue /property /configuration -Original Message- From: lewis john mcgibbney [mailto:lewis.mcgibb...@gmail.com] Sent: Thursday, July 14, 2011 5:38 PM To: user@nutch.apache.org Subject: Re: Deploying the web application in Nutch 1.2 On Thu, Jul 14, 2011 at 8:01 PM, Chip Calhoun ccalh...@aip.org wrote: Thanks Lewis. I'm still having trouble. I've moved the war file to $CATALINA_HOME/webapps/nutch/ and unpacked it. I don't' seem to have a catalina.sh file, so I've skipped that step. From memory the catalina.sh file is used to start you Tomcat server instance... this has nothing to do with Nutch. Regardless of what lind of WAR files you have in your Tomcat webapps directory, starting your tomat server from the command line sould be the same... And I've added the following to C:\Apache\Tomcat-5.5\webapps\nutch\WEB-INF\classes\nutch-site.xml : As far as a I can remember nutch-site.xml is already there, however you need to specify various property values after this has been uploaded the first time. After rebooting Tomcat all of your property setting will be running. property namesearcher.dir/name valueC:\Apache\apache-nutch-1.2\crawlvalue !-- There must be a crawl/index directory to run off !-- /property Looks fine, however please remove the !... as this is not required. However, when I go to http://localhost:8080/nutch/ I always get a 404 with the message, The requested resource () is not available. What am I missing? As I said the name of the WAR file needs to be identical to the webapp you specify in the tomcat URL... can you confirm this. There should really be no problem starting up the Nutch web app if you follow the tutorial carfeully. Thanks, Chip -Original Message- From: lewis john mcgibbney [mailto:lewis.mcgibb...@gmail.com] Sent: Thursday, July 14, 2011 5:40 AM To: user@nutch.apache.org Subject: Re: Deploying the web application in Nutch 1.2 Hi Chip, Please see this tutorial for 1.2 administration [1], many people have been using it recently and as far as I'm aware it is working perfectly. Please post back if you have any troubles [1] http://wiki.apache.org/nutch/NutchTutorial On Wed, Jul 13, 2011 at 5:50 PM, Chip Calhoun ccalh...@aip.org wrote: I'm a newbie trying to set up a Nutch 1.2 web app, because it seems a bit better suited to my smallish site than the Nutch 1.3 / Solr connection. I'm going through the tutorial at http://wiki.apache.org/nutch/Nutch_-_The_Java_Search_Engine , and I've hit the following instruction: Deploy the Nutch web application as the ROOT context I'm not sure what I'm meant to do here. I get the idea that I'm supposed to replace the current contents of $CATALINA_HOME/webapps/ROOT/ with something from my Nutch directory, but I don't know what from my Nutch directory I'm supposed to move. Can someone please explain what I need to move? Thanks, Chip -- *Lewis* -- *Lewis* -- *Lewis*
Re: Deploying the web application in Nutch 1.2
As a resource it would be wise to have a look at the list archives for an exact answer to this. Take a look at your catalina.out logs for more verbose info on where the error is. It has been a while since I have configured this now, sorry I can't be of more help in giving a definite answer. On Fri, Jul 15, 2011 at 8:27 PM, Chip Calhoun ccalh...@aip.org wrote: I'm definitely changing the file in my webapp. I can tell I'm doing that much right because it makes a noticeable change to the function of my web app; unfortunately, the change is that it seems to break everything. I've tried playing with the actual value for this, but with no success. In the tutorial's example, value/somewhere/crawlvalue, what is that relative to? Where would that hypothetical /somewhere/ directory be, relative to $CATALINA_HOME/webapps/? It feels like this is my problem, because I can't think of anything else it could be. -Original Message- From: lewis john mcgibbney [mailto:lewis.mcgibb...@gmail.com] Sent: Friday, July 15, 2011 3:19 PM To: user@nutch.apache.org Subject: Re: Deploying the web application in Nutch 1.2 Are you adding this to nutch-site within your webapp or just in your root Nutch installation. This needs to be included in your webapp version of nutch-site.xml. In my experience this was a small case of confusion at first. On Fri, Jul 15, 2011 at 7:03 PM, Chip Calhoun ccalh...@aip.org wrote: You've gotten me very close to a breakthrough. I've started over, and I've found that If I don't make any edits to nutch-site.xml, I get a working Nutch web app; I have no index and all of my searches fail, but I have Nutch. When I add my crawl location to nutch-site.xml and restart Tomcat, that's when I start getting the 404 with the The requested resource () is not available message. Clearly I'm doing something wrong when I edit nutch-site.xml. I'm going to paste the entire contents of my nutch-site.xml. Where am I screwing this up? Thanks for your help on this. ?xml version=1.0? configuration property namehttp.agent.name/name valuenutch-solr-integration/value /property property namegenerate.max.per.host/name value100/value /property property nameplugin.includes/name valueprotocol-http|urlfilter-regex|parse-html|index-(basic|anchor)|q uery-(basic|site|url)|response-(json|xml)|summary-basic|scoring-opic|u rlnormalizer-(pass|regex|basic)/value /property property namesearcher.dir/name valueC:/Apache/apache-nutch-1.2/crawlvalue /property /configuration -Original Message- From: lewis john mcgibbney [mailto:lewis.mcgibb...@gmail.com] Sent: Thursday, July 14, 2011 5:38 PM To: user@nutch.apache.org Subject: Re: Deploying the web application in Nutch 1.2 On Thu, Jul 14, 2011 at 8:01 PM, Chip Calhoun ccalh...@aip.org wrote: Thanks Lewis. I'm still having trouble. I've moved the war file to $CATALINA_HOME/webapps/nutch/ and unpacked it. I don't' seem to have a catalina.sh file, so I've skipped that step. From memory the catalina.sh file is used to start you Tomcat server instance... this has nothing to do with Nutch. Regardless of what lind of WAR files you have in your Tomcat webapps directory, starting your tomat server from the command line sould be the same... And I've added the following to C:\Apache\Tomcat-5.5\webapps\nutch\WEB-INF\classes\nutch-site.xml : As far as a I can remember nutch-site.xml is already there, however you need to specify various property values after this has been uploaded the first time. After rebooting Tomcat all of your property setting will be running. property namesearcher.dir/name valueC:\Apache\apache-nutch-1.2\crawlvalue !-- There must be a crawl/index directory to run off !-- /property Looks fine, however please remove the !... as this is not required. However, when I go to http://localhost:8080/nutch/ I always get a 404 with the message, The requested resource () is not available. What am I missing? As I said the name of the WAR file needs to be identical to the webapp you specify in the tomcat URL... can you confirm this. There should really be no problem starting up the Nutch web app if you follow the tutorial carfeully. Thanks, Chip -Original Message- From: lewis john mcgibbney [mailto:lewis.mcgibb...@gmail.com] Sent: Thursday, July 14, 2011 5:40 AM To: user@nutch.apache.org Subject: Re: Deploying the web application in Nutch 1.2 Hi Chip, Please see this tutorial for 1.2 administration [1], many people have been using it recently and as far as I'm aware it is working perfectly. Please post back if you have any troubles [1] http://wiki.apache.org/nutch/NutchTutorial On Wed, Jul 13, 2011 at 5:50 PM, Chip Calhoun ccalh...@aip.org wrote: I'm a newbie trying to set up
Re: problem compiling plugin
Hi C.B., I'm in the process of overhauling PluginCentral on the wiki and have opened a wiki page for Plugin Gotchas [1]. Would it be possible to ask you to edit and define your understanding of the problem more specifically please. There is also an interesting page here [2], which you may or may not be interested in reading. Thanks for initiating this. [1] http://wiki.apache.org/nutch/PluginGotchas [2] http://wiki.apache.org/nutch/WhatsTheProblemWithPluginsAndClass-loading On Fri, Jul 15, 2011 at 10:21 AM, Cam Bazz camb...@gmail.com wrote: Hello Lewis, I have solved this problem by putting the ivy.jar where the ant releated jars are in my system. /usr/share/lib/ant - in ubuntu. I think we might want to add this to documentation for building plugins. The current problem is since lucene is gone in 1.3, i need a new solr based indexer, and I could not find an example for it. Best Regards, C.B. On Fri, Jul 15, 2011 at 11:17 AM, lewis.mcgibb...@gmail.com lewis.mcgibb...@gmail.com wrote: It looks like you dot have specifics set within your build.xml. The error log would also suggest this. Can you please post the lines causing the error -Original Message- From: Cam Bazz Sent: 14/07/2011, 6:19 PM To: user@nutch.apache.org Subject: problem compiling plugin Hello, I am following http://wiki.apache.org/nutch/WritingPluginExample-1.2 on 1.3 and when i try to build my plugin with ant I get: moliere@blitz:~/java/apache-nutch-1.3/src/plugin/recommended$ ant Buildfile: build.xml BUILD FAILED /home/moliere/java/apache-nutch-1.3/src/plugin/recommended/build.xml:5: The following error occurred while executing this line: /home/moliere/java/apache-nutch-1.3/src/plugin/build-plugin.xml:46: Problem: failed to create task or type antlib:org.apache.ivy.ant:settings Cause: The name is undefined. Action: Check the spelling. Action: Check that any custom tasks/types have been declared. Action: Check that any presetdef/macrodef declarations have taken place. No types or tasks have been defined in this namespace yet This appears to be an antlib declaration. Action: Check that the implementing library exists in one of: -/usr/share/ant/lib -/home/moliere/.ant/lib -a directory added on the command line with the -lib argument Total time: 0 seconds -- *Lewis*
Re: LinkRank scores
Hi, Do we have any suggestion to demystify this. I intend to look into webgraph in more detail soon as I wish to get a much more detailed picture of its functionality for link analysis purposes. On Wed, Jul 13, 2011 at 9:25 AM, Nutch User - 1 nutch.use...@gmail.comwrote: Does anyone know how the LinkRank scores are calculated exactly? The only sources of information I have are this wiki page: (http://wiki.apache.org/nutch/NewScoring) and the source code of the tool. Is this the only difference from PageRank: It is different from PageRank in that nepotistic links such as links internal to a website and reciprocal links between websites can be ignored. The number of iterations can also be configured; by default 10 iterations are performed. ? I.e. if internal links are not ignored, would the LinkRank scores be equivalent to PageRank scores? -- *Lewis*
Re: Isn't there redudant/wasteful duplication between nutch crawldb and solr index?
Hi Gabriele, At first this seems like a plausable arguement, however my question concerns what Nutch would do if we wished to change the Solr core which to index to? If we removed this functionality from the crawldb there would be no way to determine what Nutch was to fetch and what it wasn't. On Sat, Jul 16, 2011 at 1:00 AM, Gabriele Kahlout gabri...@mysimpatico.comwrote: Hello, I had this draft lurking for a while now, and before archiving for personal reference I wondered if it's accurate, and if you recommend posting it to the wiki. Nutch maintains a crawldb (and linkdb, for that matter) of the urls it crawled, the fetch status, and the date. This data is maintained beyond fetch so that pages may be re-crawled, after the a re-crawling period. At the same time Solr maintains an inverted index of all the fetched pages. It'd seem more efficient if nutch relied on the index instead of maintaining its own crawldb, to !store the same url twice. [BUT THAT'S JUST A KEY/ID, NOT WASTE AT ALL, WOULD ALSO END UP THE SAME IN SOLR] -- Regards, K. Gabriele --- unchanged since 20/9/10 --- P.S. If the subject contains [LON] or the addressee acknowledges the receipt within 48 hours then I don't resend the email. subject(this) ∈ L(LON*) ∨ ∃x. (x ∈ MyInbox ∧ Acknowledges(x, this) ∧ time(x) Now + 48h) ⇒ ¬resend(I, this). If an email is sent by a sender that is not a trusted contact or the email does not contain a valid code then the email is not received. A valid code starts with a hyphen and ends with X. ∀x. x ∈ MyInbox ⇒ from(x) ∈ MySafeSenderList ∨ (∃y. y ∈ subject(x) ∧ y ∈ L(-[a-z]+[0-9]X)). -- *Lewis*
Re: Isn't there redudant/wasteful duplication between nutch crawldb and solr index?
Please feel free to add this to the wiki as it is a question that will undoubtably arise in the future. Lewis On Sat, Jul 16, 2011 at 12:37 PM, Gabriele Kahlout gabri...@mysimpatico.com wrote: On Sat, Jul 16, 2011 at 1:29 PM, lewis john mcgibbney lewis.mcgibb...@gmail.com wrote: Hi Gabriele, At first this seems like a plausable arguement, Indeed, I think it could be a FAQ. Shall I add it to nutch wiki? however my question concerns what Nutch would do if we wished to change the Solr core which to index to? If we removed this functionality from the crawldb there would be no way to determine what Nutch was to fetch and what it wasn't. Indeed, you confirm my though. crawled, the fetch status, and the date. This data is maintained beyond fetch so that pages may be re-crawled, after the a re-crawling period. At the same time Solr maintains an inverted index of all the fetched pages. It'd seem more efficient if nutch relied on the index instead of maintaining its own crawldb, to !store the same url twice. [BUT THAT'S JUST A KEY/ID, NOT WASTE AT ALL, WOULD ALSO END UP THE SAME IN SOLR] -- Regards, K. Gabriele --- unchanged since 20/9/10 --- P.S. If the subject contains [LON] or the addressee acknowledges the receipt within 48 hours then I don't resend the email. subject(this) ∈ L(LON*) ∨ ∃x. (x ∈ MyInbox ∧ Acknowledges(x, this) ∧ time(x) Now + 48h) ⇒ ¬resend(I, this). If an email is sent by a sender that is not a trusted contact or the email does not contain a valid code then the email is not received. A valid code starts with a hyphen and ends with X. ∀x. x ∈ MyInbox ⇒ from(x) ∈ MySafeSenderList ∨ (∃y. y ∈ subject(x) ∧ y ∈ L(-[a-z]+[0-9]X)). -- *Lewis* -- Regards, K. Gabriele --- unchanged since 20/9/10 --- P.S. If the subject contains [LON] or the addressee acknowledges the receipt within 48 hours then I don't resend the email. subject(this) ∈ L(LON*) ∨ ∃x. (x ∈ MyInbox ∧ Acknowledges(x, this) ∧ time(x) Now + 48h) ⇒ ¬resend(I, this). If an email is sent by a sender that is not a trusted contact or the email does not contain a valid code then the email is not received. A valid code starts with a hyphen and ends with X. ∀x. x ∈ MyInbox ⇒ from(x) ∈ MySafeSenderList ∨ (∃y. y ∈ subject(x) ∧ y ∈ L(-[a-z]+[0-9]X)). -- *Lewis*
Re: running tests from the command line
Further to this, I have been working on a JIRA ticket for this [1] If you could, can you please test. I will also shortly and hopefully we can get this committed soon. Thank you [1] https://issues.apache.org/jira/browse/NUTCH-672 On Tue, Jul 12, 2011 at 9:36 PM, lewis john mcgibbney lewis.mcgibb...@gmail.com wrote: OK, it seems like you are comfortable with JUnit testing under ant but I think for purpose of the list, I will provide the following resource [1] for general info on configuring JUnit tests. I would comment that you may be able to get a more verbose output if you set heltonerror, printsummary and formatter type=plain for easier reading of output report. Basically what we are after is a report printed to a file or stdout to show where errors are present. Could you please have a look at the 'test' subsection of [1] and correct me on anything I have misinterpreted. [1] http://ant.apache.org/manual/Tasks/junit.html Finally, although it seems like everything is OK, it would be great to crack this one. It would be useful to run just JUnit tests with Ant from the command line. On Tue, Jul 12, 2011 at 8:55 PM, Tim Pease tim.pe...@gmail.com wrote: On Jul 12, 2011, at 11:51 AM, lewis john mcgibbney wrote: What plugin are you hacking away on? You're own custom one or one already shipped with Nutch? Just so we are reading from the same page. Adding some http.agent.name support to the HTMLMetaProcessor found in the parse-html plugin. For some reason all JUnit test results are not being output to stdout when running the tests. The ant task claims there are failures, but none are shown. I had to hack the ant task so that haltonfailure is true and fork is false. Then the expected output was showing up. To shorten the test loop a little bit I was hoping ant provided an easy wan to run just the tests for the parse-html plugin. Thanks for the speedy reply! Blessings, TwP This, along with some further documentation for running various classes from the command line is definately worth inclusion in the CommandLineOptions page of the wiki. On Tue, Jul 12, 2011 at 6:00 PM, Tim Pease tim.pe...@gmail.com wrote: At the root of the Nutch 1.3 project, what is the magic ant incantation to run only the tests for the plugin I'm currently hacking away on? I'm looking for the command line syntax. Blessings, TwP -- *Lewis* -- *Lewis* -- *Lewis*
Extracting triples tags or hash tags from html
Hi, Is this currently possible with Tika 0.9 in Nutch branch 1.4? I would have thought that this would have been dealt with in Tika, however I have seen no mention of anyone having problems extracting this from web documents when fetching with Nutch or even discussing it. For example say I had some geographical location in a meta tag such asgeo:long=55.1244, is is possible to extract with parse-tika or would I need to extend parse-html? Or the other part, is it possible to extract hash tags from twitter via the above? -- *Lewis*
Re: Garbage with languageidentifier
Hi Markus, I think this is a good shout, and it is not hard to understand the points you make. Quite clearly, good practice relating to the inclusion of accurate and useful language information (as well as other types of information) in HTTP headers is not a reality and it wouldn't be suitable for us to pretend as if this was not the case. One thing to note though, I just found out yesterday that language detection in trunk has been passed to Tika but this is not the case with branch 1.4. It's not my intention to put words into peoples mouth's, however by the looks of the conversation in NUTCH-657 I foresee that delegating language-identification to Tika and making branch-1.4 consistent with trunk would be the next move? Am I correct here? please say otherwise if this is not the case. If this is the plan then is there any requirement for Nutch to have an independent language detection plugin? If we can address why the decision was made for trunk to rely upon tika for language detection then we can justify where we are with the comments you make. To be honest I am seeing a medium sized grey area here, however this has to do with my inexperience dealing with the language detection plugin and of the problems you mention. On Sun, Jul 17, 2011 at 2:04 PM, Markus Jelsma markus.jel...@openindex.iowrote: The proposal is to configure the order of detection: meta,header,identifier (which is the current order). Hi, I've found a lot of garbage produced by the language identifier, most likely caused by it relying on HTTP-header as the first hint for the language. Instead of a nice tight list of ISO-codes i've got an index full of garbage making me unable to select a language. The lang field now contains a mess including ISO-codes of various types (nl | ned, nl-NL | nederlands | Nederlands | dutch | Dutch etc etc) and even comma-separated combinations. It's impossible to do a simple fq:lang:nl due to this undeterminable set of language identifiers. Apart from language identifiers that we as human understand the headers also contains values such as {$plugin.meta.language} | Weerribben zuivel | Array or complete sentences and even MIME-types and more nonsens you can laugh about. Why do we rely on HTTP-header at all? Isn't it well-known that only very few developers and content management systems actually care about returning proper information in HTTP headers? This actually also goes for finding out content- type, which is a similar problem in the index. I know work is going on in Tika for improving MIME-type detection i'm not sure if this is true for language identification. We still have to rely on the Nutch plugin to do this work, right? If so, i propose to make it configurable so we can choose if we wan't to rely on the current behaviour or do N-gram detection straight-away. Comments? Thanks -- *Lewis*
Re: Fetched pages has no content
Hi, If you have a look at your regex-ulrfilter.txt it will by default be rejecting ? in the URL. Please test with line edited (or commented out) and see if the problem fades. On Mon, Jul 18, 2011 at 10:11 AM, Anders Rask anr...@gmail.com wrote: Hi Markus! We are using a custom parser, but I don't think that the problem is in the parsing. I got the same problem when trying the ParserChecker. I also tried the following: I injected the following seeds: http://www.uu.se/news/news_item.php?id=1423typ=pm http://www.uu.se/news/news_item.php?id=1421typ=pm http://www.uu.se/news/news_item.php?id=1489typ=artikel http://www.uu.se/news/news_item.php?id=1407typ=pm http://www.uu.se/news/news_item.php?id=1234typ=artikel http://www.uu.se/news/news_item.php?id=1233typ=artikel http://www.uu.se/news/news_item.php?id=1180typ=artikel http://www.uu.se/news/news_item.php?typ=pmid=1381 http://www.uu.se/ Then generated a segment, fetched that segment and then did a readseg with -noparse, -noparsedata and -noparsetext. I have attached the readseg dump and it shows no content for: http://www.uu.se/news/news_item.php?typ=pmid=1381 Can the problem somehow be in the configurations for the fetcher? Best regards, --Anders Rask www.findwise.com 2011/7/15 Markus Jelsma markus.jel...@openindex.io What parser are you using? What does bin/nutch org.apache.nutch.parse.ParserChecker say? Here it outputs the content fine with parse-tika enabled. On Friday 15 July 2011 15:04:55 Anders Rask wrote: Hi! We are using Nutch to crawl a bunch of websites and index them to Solr. At the moment we are in the process of upgrading from Nutch 1.1 to Nutch 1.3 and in the same time going from one server to two servers. Unfortunately we are stuck with a problem which we haven't seen in the old environment. Several of the pages that we are fetching contain no content when they are stored in the segment. The following is an excerpt from readseg on a segment containing such a page: Recno:: 5 URL:: http://www.uu.se/news/news_item.php?typ=pmid=1381 Content:: Version: -1 url: http://www.uu.se/news/news_item.php?typ=pmid=1381 base: http://www.uu.se/news/news_item.php?typ=pmid=1381 contentType: text/html metadata: Date=Fri, 15 Jul 2011 09:02:38 GMT Content-Length=7195 nutch.crawl.score=1.0 _fst_=33 nutch.segment.name=20110715110049 Connection=close Content-Type=text/html Server=Apache Content: The fetch logs say nothing unusual about retrieving this page: 2011-07-15 11:02:37,500 INFO org.apache.nutch.fetcher.Fetcher: fetching http://www.uu.se/news/news_item.php?typ=pmid=1381 There seems to be nothing strange about the page itself and a very similar page (http://www.uu.se/news/news_item.php?id=1421typ=pm) is crawled and indexed without any problems. Anyone have any ideas about what might be wrong here? Best regards, --Anders Rask www.findwise.com -- Markus Jelsma - CTO - Openindex http://www.linkedin.com/in/markus17 050-8536620 / 06-50258350 -- *Lewis*
Re: some Nutch questions
Hi Cheng, Please see this wiki page for some references to optimization [1] I can see your problem though. I think a possible solution may to have two seed directories, with a specifically tailored Nutch implementation ready to crawl both. This way we guarantee top results if we take site in a case by case basis. Please feel free to add any further comments to this wiki page based upon your personal experiences moving towards optimization. Thanks [1] http://wiki.apache.org/nutch/OptimizingCrawls On Sat, Jul 16, 2011 at 2:23 AM, Cheng Li chen...@usc.edu wrote: Hi, I have some questions for the optimization. 1) for the command bin/nutch crawl url -dir mydir -depth 2 -threads 4 -topN 50 logs/logs1.log , I know the meaning of parameter , say , -depth 8 the maximum depth of links crawled is 8 (8 levels down from the seed urls) -topN 5 maximum number of links/pages can be crawled at each depth -thread 16 issue 16 threads simultaneously but how to choose the proper number for each parameter? For example ,in craiglist web site , the usual url for a certain car goes like this: http://losangeles.craigslist.org/sgv/cto/2496560420.html But in Kbb.com, the usual url for a certain car goes like this: http://www.kbb.com/volkswagen/jetta/2003-volkswagen-jetta/gls-sedan-4d/?vehicleid=348329intent=buy-usedoptions=4098815|true|4098881|truepricetype=private-partycondition=good how to determine the value of parameter for these 2 example ? 2) When I check the data in Luke in overview panel, I found that on the left side (available fields and term counts per field table)the anchor number value is zero , while the content value is not, and on the right side (top ranking terms table) all the rank values are also the same.I want to know the reason that it displays the information like this. Thanks, -- Cheng Li -- *Lewis*
Re: How to use lucene to index Nutch 1.3 data
Hi Kelvin, I see you are posting on a couple of threads with regards to the Lucene index generated by Nutch which you correctly point out is not there. It is not possible to create a Lucene index from Nutch 1.3 anymore as all searching has been shifted to Solr therefore Nutch 1.3 has no use for a Lucene index. If you wish to find out more on why this is current practice please feel free to read into recent activity on the lists. I hope this clears things up. On Tue, Jul 19, 2011 at 3:32 PM, Kelvin k...@yahoo.com.sg wrote: Hi Александр, Thank you for your reply, but I am not using solr. How do I use Lucene to create an index of folder /crawl? I went to Lucene website, but it only explains how to index local files and html? From: Александр Кожевников b37hr3...@yandex.ru To: user@nutch.apache.org; k...@yahoo.com.sg Sent: Tuesday, 19 July 2011 8:10 PM Subject: Re: How to use lucene to index Nutch 1.3 data Kelvin, You should make index using solr $ bin/nutch solrindex http://127.0.0.1:8983/solr/ crawl/crawldb crawl/linkdb crawl/segments/* 19.07.2011, 15:07, Kelvin k...@yahoo.com.sg: Dear all, After crawling using Nutch 1.3, I realise that my /crawl folder does not contain folder /index. Is there any way to create a lucene index from the /crawl folder? Thank you for your help. Best regards, Kelvin -- *Lewis*
Re: help, src modify to optimize the crawl
I dont think this has anything to so with modifying the crawl src. It doesn't infact have anything to do with optimization either. Try using your URLFilters e.g. regex It is important to try and understand what type of pages we can filter out from a Nutch crawl using the filters provided. HTH On Wed, Jul 20, 2011 at 11:04 AM, Cheng Li chen...@usc.edu wrote: Hi, I tried to use Nutch to crawl craiglist. The seed I use is http://losangeles.craigslist.org/wst/ctd/ http://losangeles.craigslist.org/sfv/ctd/ http://losangeles.craigslist.org/lac/ctd/ http://losangeles.craigslist.org/sgv/ctd/ http://losangeles.craigslist.org/lgb/ctd/ http://losangeles.craigslist.org/ant/ctd/ http://losangeles.craigslist.org/wst/cto/ http://losangeles.craigslist.org/sfv/cto/ http://losangeles.craigslist.org/lac/cto/ http://losangeles.craigslist.org/sgv/cto/ http://losangeles.craigslist.org/lgb/cto/ http://losangeles.craigslist.org/ant/cto/ What I want to get is the result page like this one , for example , http://losangeles.craigslist.org/lac/ctd/2501038362.html , which is a specific car selling page . What I DON'T what to get is the result page like this one , for example , http://losangeles.craigslist.org/cta/. However , in my query result , I can always have results like http://losangeles.craigslist.org/cta/. Actually , I can get this kind of this website from craiglist, just part of them , but not all of them. I tried to adjust the crawl command line parameter, but there is no much change . So what I plan to do is to modify the crawl code in Nutch src code. Where can I start ? What kind of work can I do to optimize the crawl process in src code ? -- Cheng Li -- *Lewis*
Re: embedded google map in nutch query result page
I don't know if you are still pursuing this, and as you haven't had any response I will give some tips. It sounds like your using = Nutch 1.2, therefore unless you are comofrtable working with JSP's then I wouldn't bother with the hastle. It might be better to try and use Solr for indexing and searching and build an interface such as Solr AJAX which would then permit you to write a widget to do this task. However unless you have time and are competent and willing to learn and use Apache Solr and Javascript then this is not an ideal solution. I honestly have no idea how to implement this using the legacy JSP On Wed, Jul 20, 2011 at 11:09 AM, Cheng Li chen...@usc.edu wrote: Hi, I have done a google map marker html code. I plan to display the google map object in the nutch query result page, with the geo-markers which are extracted from the results listed on that page. How should I modify the nutch query result page to implement my design? Thanks, -- Cheng Li -- *Lewis*
Re: skipping invalid segments nutch 1.3
There is no documentation for individual commands used to run a Nutch 1.3 crawl so I'm not sure where there has been a mislead. In the instance that this was required I would direct newer users to the legacy documentation for the time being. My comment to Leo was to understand whether he managed to correct the invalid segments problem. Leo, if this still persists may I ask you to try again, I will do the same and will be happy to provide feedback May I suggest the following use the following commands inject generate fetch parse updatedb At this stage we should be able to ascertain if something is correct and hopefully debug. May I add the following... please make the following additions to nutch-site. fetcher verbose - true http verbose - true check for redirects and set accordingly On Wed, Jul 20, 2011 at 1:39 PM, Julien Nioche lists.digitalpeb...@gmail.com wrote: The wiki can be edited and you are welcome to suggest improvements if there is something missing On 20 July 2011 13:31, Cam Bazz camb...@gmail.com wrote: Hello, I think there is a mislead in the documentation, it does not tell us that we have to parse. On Wed, Jul 20, 2011 at 11:42 AM, Julien Nioche lists.digitalpeb...@gmail.com wrote: Haven't you forgotten to call parse? On 19 July 2011 23:40, Leo Subscriptions llsub...@zudiewiener.com wrote: Hi Lewis, You are correct about the last post not showing any errors. I just wanted to show that I don't get any errors if I use 'crawl' and to prove that I do not have any faults in the conf files or the directories. I still get the errors if I use the individual commands inject, generate, fetch Cheers, Leo On Tue, 2011-07-19 at 22:09 +0100, lewis john mcgibbney wrote: Hi Leo Did you resolve? Your second log data doesn't appear to show any errors however the problem you specify if one I have witnessed myself while ago. Since you posted have you been able to replicate... or resolve? On Sun, Jul 17, 2011 at 1:03 AM, Leo Subscriptions llsub...@zudiewiener.com wrote: I've used crawl to ensure config is correct and I don't get any errors, so I must be doing something wrong with the individual steps, but can;t see what. llist@LeosLinux:~/nutchData $ /usr/share/nutch/runtime/local/bin/nutch crawl /home/llist/nutchData/seed/urls -dir /home/llist/nutchData/crawl -depth 3 -topN 5 solrUrl is not set, indexing will be skipped... crawl started in: /home/llist/nutchData/crawl rootUrlDir = /home/llist/nutchData/seed/urls threads = 10 depth = 3 solrUrl=null topN = 5 Injector: starting at 2011-07-17 09:31:19 Injector: crawlDb: /home/llist/nutchData/crawl/crawldb Injector: urlDir: /home/llist/nutchData/seed/urls Injector: Converting injected urls to crawl db entries. Injector: Merging injected urls into crawl db. Injector: finished at 2011-07-17 09:31:22, elapsed: 00:00:02 Generator: starting at 2011-07-17 09:31:22 Generator: Selecting best-scoring urls due for fetch. Generator: filtering: true Generator: normalizing: true Generator: topN: 5 Generator: jobtracker is 'local', generating exactly one partition. Generator: Partitioning selected urls for politeness. Generator: segment: /home/llist/nutchData/crawl/segments/20110717093124 Generator: finished at 2011-07-17 09:31:26, elapsed: 00:00:04 Fetcher: Your 'http.agent.name' value should be listed first in 'http.robots.agents' property. Fetcher: starting at 2011-07-17 09:31:26 Fetcher: segment: /home/llist/nutchData/crawl/segments/20110717093124 Fetcher: threads: 10 QueueFeeder finished: total 1 records + hit by time limit :0 fetching http://www.seek.com.au/ -finishing thread FetcherThread, activeThreads=1 -finishing thread FetcherThread, activeThreads=1 -finishing thread FetcherThread, activeThreads=1 -finishing thread FetcherThread, activeThreads=1 -finishing thread FetcherThread, activeThreads=1 -finishing thread FetcherThread, activeThreads=1 -finishing thread FetcherThread, activeThreads=1 -finishing thread FetcherThread, activeThreads=1 -finishing thread FetcherThread, activeThreads=1
Re: crawling in any depth until no new pages were found
Hi Marek, As were talking about automating the task were immediately looking at implementing a bash script. In the situation we have described, we wish Nutch to adopt a breadth first search BFS behaviour when crawling. Between us can we suggest any methods for best practice relating to BFS? As you have highlighted we can check the crawldb after every updatedb command to determine whether there are any status (?) unfetched urls, and ideally we wish to continue until this number is non existent when we either dump stats or read them via stdout. I would suggest that we discuss a method for obtaining the dbunfecthed value and creating a loop based on whether or not it is = 0. Is this possible? On Wed, Jul 20, 2011 at 2:05 PM, Marek Bachmann m.bachm...@uni-kassel.dewrote: Hi all, has anyone suggestions how I could solve following task: I want to crawl a sub-domain of our network completely. I always did it by multiple fetch / parse / update cycles manually. After a few cycles I checked if there are unfetched pages in the crawldb. If so, I started the cycle over again. I repeated that until no new pages were discovered. But that is annoying me and that is why I am looking for a way to do this steps automatic until no unfetched pages are left. Any ideas? -- *Lewis*
Re: Nutch not indexing full collection
Hi Chip, I would try running your scripts after setting the environment variable $NUTCH_HOME to nutch/runtime/local/NUTCH_HOME On Wed, Jul 20, 2011 at 4:01 PM, Chip Calhoun ccalh...@aip.org wrote: I've been working with $NUTCH_HOME/runtime/local/conf/nutch-site.xml, and I'm pretty sure that's the correct file. I run my commands while in $NUTCH_HOME/ , which means all of my commands begin with runtime/local/bin/nutch... . That means my urls directory is $NUTCH_HOME/urls/ and my crawl directory ends up being $NUTCH_HOME/crawl/ (as opposed to $NUTCH_HOME/runtime/local/urls/ and so forth), but it does seem to at least be getting my urlfilters from $NUTCH_HOME/runtime/local/conf/ . I get no output when I try runtime/local/bin/nutch readdb -stats , so that's weird. I dimly recall there being a total index size value somewhere in Nutch or Solr which has to be increased, but I can no longer find any reference to it. Chip -Original Message- From: Julien Nioche [mailto:lists.digitalpeb...@gmail.com] Sent: Wednesday, July 20, 2011 10:06 AM To: user@nutch.apache.org Subject: Re: Nutch not indexing full collection I'd have suspected db.max.outlinks.per.page but you seem to have set it up correctly. Are you running Nutch in runtime/local? in which case you modified nutch-site.xml in runtime/local/conf, right? nutch readdb -stats will give you the total number of pages known etc Julien On 20 July 2011 14:51, Chip Calhoun ccalh...@aip.org wrote: Hi, I'm using Nutch 1.3 to crawl a section of our website, and it doesn't seem to crawl the entire thing. I'm probably missing something simple, so I hope somebody can help me. My urls/nutch file contains a single URL: http://www.aip.org/history/ohilist/transcripts.html , which is an alphabetical listing of other pages. It looks like the indexer stops partway down this page, meaning that entries later in the alphabet aren't indexed. My nutch-site.xml has the following content: ?xml version=1.0? ?xml-stylesheet type=text/xsl href=configuration.xsl? !-- Put site-specific property overrides in this file. -- configuration property namehttp.agent.name/name valueOHI Spider/value /property property namedb.max.outlinks.per.page/name value-1/value descriptionThe maximum number of outlinks that we'll process for a page. If this value is nonnegative (=0), at most db.max.outlinks.per.page outlinks will be processed for a page; otherwise, all outlinks will be processed. /description /property /configuration My regex-urlfilter.txt and crawl-urlfilter.txt both include the following, which should allow access to everything I want: # accept hosts in MY.DOMAIN.NAME +^http://([a-z0-9]*\.)*aip.org/history/ohilist/ # skip everything else -. I've crawled with the following command: runtime/local/bin/nutch crawl urls -dir crawl -depth 15 -topN 50 Note that since we don't have NutchBean anymore, I can't tell whether this is actually a Nutch problem or whether something is failing when I port to Solr. What am I missing? Thanks, Chip -- * *Open Source Solutions for Text Engineering http://digitalpebble.blogspot.com/ http://www.digitalpebble.com -- *Lewis*
Re: embedded google map in nutch query result page
You can find Ajax Solr here [1]. As I said this is only one option for doing this. The information you can return and display is really directly dependant on your requirements and your imagination. However it should not be too hard implementing the maps you are looking for when you get to grips with writing widgets I wouldn't imagine. [1] http://evolvingweb.github.com/ajax-solr/ On Wed, Jul 20, 2011 at 9:57 PM, Cheng Li chen...@usc.edu wrote: Thank you . I'll try to use solr to do the indexing and add the google map object . Do you know some resource for solr AJAX ? where should I add the google map js code in solr ? Thanks again, On Wed, Jul 20, 2011 at 1:51 PM, lewis john mcgibbney lewis.mcgibb...@gmail.com wrote: I don't know if you are still pursuing this, and as you haven't had any response I will give some tips. It sounds like your using = Nutch 1.2, therefore unless you are comofrtable working with JSP's then I wouldn't bother with the hastle. It might be better to try and use Solr for indexing and searching and build an interface such as Solr AJAX which would then permit you to write a widget to do this task. However unless you have time and are competent and willing to learn and use Apache Solr and Javascript then this is not an ideal solution. I honestly have no idea how to implement this using the legacy JSP On Wed, Jul 20, 2011 at 11:09 AM, Cheng Li chen...@usc.edu wrote: Hi, I have done a google map marker html code. I plan to display the google map object in the nutch query result page, with the geo-markers which are extracted from the results listed on that page. How should I modify the nutch query result page to implement my design? Thanks, -- Cheng Li -- *Lewis* -- Cheng Li -- *Lewis*
Re: skipping invalid segments nutch 1.3
, spinWaiting=0, fetchQueues.totalSize=0 -activeThreads=0 Fetcher: finished at 2011-07-21 12:26:40, elapsed: 00:00:04 llist@LeosLinux:~/nutchData$ /usr/share/nutch/runtime/local/bin/nutch parse /home/llist/nutchData/crawl/segments/20110721122519 ParseSegment: starting at 2011-07-21 12:27:22 ParseSegment: segment: /home/llist/nutchData/crawl/segments/20110721122519 ParseSegment: finished at 2011-07-21 12:27:24, elapsed: 00:00:01 llist@LeosLinux:~/nutchData$ /usr/share/nutch/runtime/local/bin/nutch updatedb /home/llist/nutchData/crawl/crawldb -dir /home/llist/nutchData/crawl/segments/20110721122519 CrawlDb update: starting at 2011-07-21 12:28:03 CrawlDb update: db: /home/llist/nutchData/crawl/crawldb CrawlDb update: segments: [file:/home/llist/nutchData/crawl/segments/20110721122519/parse_text, file:/home/llist/nutchData/crawl/segments/20110721122519/content, file:/home/llist/nutchData/crawl/segments/20110721122519/crawl_parse, file:/home/llist/nutchData/crawl/segments/20110721122519/parse_data, file:/home/llist/nutchData/crawl/segments/20110721122519/crawl_fetch, file:/home/llist/nutchData/crawl/segments/20110721122519/crawl_generate] CrawlDb update: additions allowed: true CrawlDb update: URL normalizing: false CrawlDb update: URL filtering: false - skipping invalid segment file:/home/llist/nutchData/crawl/segments/20110721122519/parse_text - skipping invalid segment file:/home/llist/nutchData/crawl/segments/20110721122519/content - skipping invalid segment file:/home/llist/nutchData/crawl/segments/20110721122519/crawl_parse - skipping invalid segment file:/home/llist/nutchData/crawl/segments/20110721122519/parse_data - skipping invalid segment file:/home/llist/nutchData/crawl/segments/20110721122519/crawl_fetch - skipping invalid segment file:/home/llist/nutchData/crawl/segments/20110721122519/crawl_generate CrawlDb update: Merging segment data into db. CrawlDb update: finished at 2011-07-21 12:28:04, elapsed: 00:00:01 On Wed, 2011-07-20 at 21:58 +0100, lewis john mcgibbney wrote: There is no documentation for individual commands used to run a Nutch 1.3 crawl so I'm not sure where there has been a mislead. In the instance that this was required I would direct newer users to the legacy documentation for the time being. My comment to Leo was to understand whether he managed to correct the invalid segments problem. Leo, if this still persists may I ask you to try again, I will do the same and will be happy to provide feedback May I suggest the following use the following commands inject generate fetch parse updatedb At this stage we should be able to ascertain if something is correct and hopefully debug. May I add the following... please make the following additions to nutch-site. fetcher verbose - true http verbose - true check for redirects and set accordingly On Wed, Jul 20, 2011 at 1:39 PM, Julien Nioche lists.digitalpeb...@gmail.com wrote: The wiki can be edited and you are welcome to suggest improvements if there is something missing On 20 July 2011 13:31, Cam Bazz camb...@gmail.com wrote: Hello, I think there is a mislead in the documentation, it does not tell us that we have to parse. On Wed, Jul 20, 2011 at 11:42 AM, Julien Nioche lists.digitalpeb...@gmail.com wrote: Haven't you forgotten to call parse? On 19 July 2011 23:40, Leo Subscriptions llsub...@zudiewiener.com wrote: Hi Lewis, You are correct about the last post not showing any errors. I just wanted to show that I don't get any errors if I use 'crawl' and to prove that I do not have any faults in the conf files or the directories. I still get the errors if I use the individual commands inject, generate, fetch Cheers, Leo On Tue, 2011-07-19 at 22:09 +0100, lewis john mcgibbney wrote: Hi Leo Did you resolve? Your second log data doesn't appear to show any errors however the problem you specify if one I have witnessed myself while ago. Since you posted have you been able to replicate... or resolve? On Sun, Jul 17, 2011 at 1:03 AM, Leo Subscriptions llsub...@zudiewiener.com wrote: I've used crawl to ensure config is correct and I don't get any errors, so I must be doing something wrong with the individual steps, but can;t see what. llist@LeosLinux:~/nutchData $ /usr/share/nutch/runtime/local/bin/nutch crawl /home/llist/nutchData/seed
Re: solr index display
Specifically I would mention that you would get a community input if this question was directed towards the Solr user list, however I think you are looking for the velocity response writer. Have a search on the Solr wiki you will find info there. In addition there are various other well established client libraries, I previously worked with Ajax Solr. On Mon, Jul 25, 2011 at 12:32 AM, Cheng Li chen...@usc.edu wrote: Hi, I follow this instruction to run the index by solr . http://wiki.apache.org/nutch/RunningNutchAndSolr at the last step , it is said that If you want to see the raw HTML indexed by Solr, change the content field definition in solrconfig.xml to. But I found several solrconfig.xml in apache-solr directory . Which solrconfig.xml should I modify to make the query page look like Nutch 1.2 query page? Thanks, -- Cheng Li -- *Lewis*
Re: embedded google map in nutch query result page
A while since I configured this. Try the tutorial, if I remember it was pretty verbose and I would imagine that it covers this subject area entirely. Sorry I couldn't be of more help. On Mon, Jul 25, 2011 at 4:33 AM, Cheng Li chen...@usc.edu wrote: Hi, I just looked up the website http://evolvingweb.github.com/ajax-solr/ you gave me . But I have some questions about that. Where should I add the javascript code file ? Is it in some subdirectory in apache-solr directory? Can you explain a little bit more? Thanks, On Wed, Jul 20, 2011 at 2:28 PM, lewis john mcgibbney lewis.mcgibb...@gmail.com wrote: You can find Ajax Solr here [1]. As I said this is only one option for doing this. The information you can return and display is really directly dependant on your requirements and your imagination. However it should not be too hard implementing the maps you are looking for when you get to grips with writing widgets I wouldn't imagine. [1] http://evolvingweb.github.com/ajax-solr/ On Wed, Jul 20, 2011 at 9:57 PM, Cheng Li chen...@usc.edu wrote: Thank you . I'll try to use solr to do the indexing and add the google map object . Do you know some resource for solr AJAX ? where should I add the google map js code in solr ? Thanks again, On Wed, Jul 20, 2011 at 1:51 PM, lewis john mcgibbney lewis.mcgibb...@gmail.com wrote: I don't know if you are still pursuing this, and as you haven't had any response I will give some tips. It sounds like your using = Nutch 1.2, therefore unless you are comofrtable working with JSP's then I wouldn't bother with the hastle. It might be better to try and use Solr for indexing and searching and build an interface such as Solr AJAX which would then permit you to write a widget to do this task. However unless you have time and are competent and willing to learn and use Apache Solr and Javascript then this is not an ideal solution. I honestly have no idea how to implement this using the legacy JSP On Wed, Jul 20, 2011 at 11:09 AM, Cheng Li chen...@usc.edu wrote: Hi, I have done a google map marker html code. I plan to display the google map object in the nutch query result page, with the geo-markers which are extracted from the results listed on that page. How should I modify the nutch query result page to implement my design? Thanks, -- Cheng Li -- *Lewis* -- Cheng Li -- *Lewis* -- Cheng Li -- *Lewis*
Re: Storage of data between crawls
HI Alexander, I don't want to state the obvious here but this will depend directly on what type of loading your Nutch implementation deals with... You are correct in stating that we store data in segments, namely /crawl_fetch /content /crawl_parse /parse_data /crawl_generate /parse_text I understand that this doesn't add much value to answering your question, but as we are now indexing with Solr (and therefore not storing larger amounts of data with Nutch) I am struggling slightly to understand the issues you are trying to answer. On Mon, Jul 25, 2011 at 5:13 PM, Chris Alexander chris.alexan...@kusiri.com wrote: Hi all, I have been asked to look at doing some disk space estimates for our Nutch usage. It looks like Nutch stores the content of the pages it downloads and indexes in its data directory for the segment, is this the case? Are there any other major storage requirements I should make not of with Nutch specifically (not the Solr storage, we can handle that bit)? Cheers Chris -- *Lewis*
Re: Nutch not indexing full collection
has this been solved? If your http.content.limit has not been increased in nutch-site.xml then you will not be able to store this data and index with Solr. On Mon, Jul 25, 2011 at 6:18 PM, Chip Calhoun ccalh...@aip.org wrote: I'm still having trouble. I've set a windows environment variable, NUTCH_HOME, which for me is C:\Apache\nutch-1.3\runtime\local . I now have my urls and crawl directories in that C:\Apache\nutch-1.3\runtime\local folder. But I'm still not crawling files later on my urls list, and apparently I can't search for words or phrases toward the end of any of my documents. Am I misremembering that there was a total file size value somewhere in Nutch or Solr that needs to be increased? -Original Message- From: lewis john mcgibbney [mailto:lewis.mcgibb...@gmail.com] Sent: Wednesday, July 20, 2011 5:23 PM To: user@nutch.apache.org Subject: Re: Nutch not indexing full collection Hi Chip, I would try running your scripts after setting the environment variable $NUTCH_HOME to nutch/runtime/local/NUTCH_HOME On Wed, Jul 20, 2011 at 4:01 PM, Chip Calhoun ccalh...@aip.org wrote: I've been working with $NUTCH_HOME/runtime/local/conf/nutch-site.xml, and I'm pretty sure that's the correct file. I run my commands while in $NUTCH_HOME/ , which means all of my commands begin with runtime/local/bin/nutch... . That means my urls directory is $NUTCH_HOME/urls/ and my crawl directory ends up being $NUTCH_HOME/crawl/ (as opposed to $NUTCH_HOME/runtime/local/urls/ and so forth), but it does seem to at least be getting my urlfilters from $NUTCH_HOME/runtime/local/conf/ . I get no output when I try runtime/local/bin/nutch readdb -stats , so that's weird. I dimly recall there being a total index size value somewhere in Nutch or Solr which has to be increased, but I can no longer find any reference to it. Chip -Original Message- From: Julien Nioche [mailto:lists.digitalpeb...@gmail.com] Sent: Wednesday, July 20, 2011 10:06 AM To: user@nutch.apache.org Subject: Re: Nutch not indexing full collection I'd have suspected db.max.outlinks.per.page but you seem to have set it up correctly. Are you running Nutch in runtime/local? in which case you modified nutch-site.xml in runtime/local/conf, right? nutch readdb -stats will give you the total number of pages known etc Julien On 20 July 2011 14:51, Chip Calhoun ccalh...@aip.org wrote: Hi, I'm using Nutch 1.3 to crawl a section of our website, and it doesn't seem to crawl the entire thing. I'm probably missing something simple, so I hope somebody can help me. My urls/nutch file contains a single URL: http://www.aip.org/history/ohilist/transcripts.html , which is an alphabetical listing of other pages. It looks like the indexer stops partway down this page, meaning that entries later in the alphabet aren't indexed. My nutch-site.xml has the following content: ?xml version=1.0? ?xml-stylesheet type=text/xsl href=configuration.xsl? !-- Put site-specific property overrides in this file. -- configuration property namehttp.agent.name/name valueOHI Spider/value /property property namedb.max.outlinks.per.page/name value-1/value descriptionThe maximum number of outlinks that we'll process for a page. If this value is nonnegative (=0), at most db.max.outlinks.per.page outlinks will be processed for a page; otherwise, all outlinks will be processed. /description /property /configuration My regex-urlfilter.txt and crawl-urlfilter.txt both include the following, which should allow access to everything I want: # accept hosts in MY.DOMAIN.NAME +^http://([a-z0-9]*\.)*aip.org/history/ohilist/ # skip everything else -. I've crawled with the following command: runtime/local/bin/nutch crawl urls -dir crawl -depth 15 -topN 50 Note that since we don't have NutchBean anymore, I can't tell whether this is actually a Nutch problem or whether something is failing when I port to Solr. What am I missing? Thanks, Chip -- * *Open Source Solutions for Text Engineering http://digitalpebble.blogspot.com/ http://www.digitalpebble.com -- *Lewis* -- *Lewis*
Re: TF in wide internet crawls
Hi Markus, I am getting you until the last parts of your comments. cope with non-edited... edited by whom? and for what purpose? To give a better relative tf score... To comment on the first part, and please ignore or correct me if I am wrong, but do we not give each page and therefore each document an initial score of 1.0 which is then subsequently used by whichever scoring algorithm we plugin? If this is the case then how are we specifying score for a page and tf of some term with a document or tf-idf of that term over the entire document collection to determine relevance? How can be accurately disambiguate between these entities? As I said I'm loosing you towards the end however it would be good discussion to explore behind the surface architecture. On Mon, Jul 25, 2011 at 10:23 PM, Markus Jelsma markus.jel...@openindex.iowrote: Hi, I've done several projects where term frequency yields bad result sets and worse relevancy. These projects all had one similarity; user-generated content with a competitive edge. The latter means classifieds web sites such as e-bay etc. The internet is something similar. It contains edited content, classifieds and spam or other garbage. What do you do with tf in your wide internet index? Do you impose a threshold or are you emitting 1.0f for each match? For now i emit 1.0f for each match and rely on matches in multiple fields with varying boosts to improve relevancy and various other methods. Can tf*idf cope with non-edited (and untrusted) documents at all? I've seen great relevancy with good content but really bad relevance in several cases. Thanks! -- *Lewis*
Re: plugin build.xml file
Hi Cheng Li, Please experiment with this. We have been gradually getting the pluginCentral section of the wiki updated as it needed a total face lift, so would appreciate any additional input you may have for updating the writing Plugin example which is already there. Apart being completely out of date, the one you mention should have been moved to archive and legacy section under OldPluginCentral. I'll be picking this up tomorrow and updating. On Tue, Jul 26, 2011 at 6:46 AM, Cheng Li chen...@usc.edu wrote: Hi, In http://wiki.apache.org/nutch/WritingPluginExample-0.9 , it is said that in nutch/plugin/recommended directory , there should be 2 files which are build.xml and plugin.xml. But in nutch - 1.3 , i checked other folders in plugin , most of them have one plugin.xml file and a jar file . So , in nutch -1.3 , do I still need to follow the instruction that create a build.xml in /plugin/recommended directory ? Or what else configuration files should I modify and create ? Thanks, -- Cheng Li -- *Lewis*
Re: Limit Nutch memory usage
Hi Marseld, I'm just putting my thoughts out here, however Hadoop is not shipped with Nutch 1.3 anymore therefore I don't know where you would set this specific property within yout Nutch instances... How are you running Hadoop what version of Nutch what mode are you running Nutch in? On Tue, Jul 26, 2011 at 8:55 AM, Marseld Dedgjonaj marseld.dedgjo...@ikubinfo.com wrote: Hello list, I have two instances of nutch running on my machine. I want to configure Instance 1 maximum usage or RAM to be 4 Gb And max usage of RAM in instance 2 to be 8 GB. Can I do it by configuring HADOOP_HEAPSIZE for each instance? Will these configuration interferes to each other ? Best Regards, Marseld p class=MsoNormalspan style=color: rgb(31, 73, 125);Gjeni bPuneuml; teuml; Mireuml;/b dhe bteuml; Mireuml; peuml;r Puneuml;/b... Vizitoni: a target=_blank href=http://www.punaime.al/ www.punaime.al/a/span/p pa target=_blank href=http://www.punaime.al/;span style=text-decoration: none;img width=165 height=31 border=0 alt=punaime src=http://www.ikub.al/images/punaime.al_small.png; //span/a/p -- *Lewis*
Re: Storage of data between crawls
Well when Nutch undertakes a generate fetch and parse e.g. the steps that generate segment data for indexing, the data is stored in various forms within the segment. There is much more purpose to the segment that explained in this reply however it does not add to this particular thread. If you have a look at nutch-default.xml you will noticed a deprecated property db.default.fetch.interval, ignore this for the time being and focus instead on db.fetch.interval.default (which is a much more accurate method of specifying default value for re-fetches of any given page anyway), any segment older than this value can be safely deleted as new segments will have been created in successive crawl processes thus rendering it less useful to us. This is one option for reducing the amount of memory Nutch data takes on disk. An alternative option to this is to mergesegs with the option to pass filtering and slicing commands for a healthier output segment. I remeber learning on this list some time ago that mergesegs is a useful command for managing a Nutch instance which produces several segments per day. Understandably this can get out of hand pretty quickly therefore merging segment data enables us to manage this effectively. In general, but strictly dependant on the size and nature of your Nutch crawls, we rarely experience problems concerning the size of disk space occupied by = Nutch 1.3 segment data, however I'm sure there are extreme cases out there. On Thu, Jul 28, 2011 at 9:18 AM, Chris Alexander chris.alexan...@kusiri.com wrote: Cheers Lewis, perhaps I should attempt to rephrase the question. Clearly Nutch must download and store the contents of a page during a crawl. However, once you have indexed this content, does Nutch keep this data, or is it cleaned up, automatically or is there a command to do it? Thanks Chris On 27 July 2011 17:14, lewis john mcgibbney lewis.mcgibb...@gmail.com wrote: HI Alexander, I don't want to state the obvious here but this will depend directly on what type of loading your Nutch implementation deals with... You are correct in stating that we store data in segments, namely /crawl_fetch /content /crawl_parse /parse_data /crawl_generate /parse_text I understand that this doesn't add much value to answering your question, but as we are now indexing with Solr (and therefore not storing larger amounts of data with Nutch) I am struggling slightly to understand the issues you are trying to answer. On Mon, Jul 25, 2011 at 5:13 PM, Chris Alexander chris.alexan...@kusiri.com wrote: Hi all, I have been asked to look at doing some disk space estimates for our Nutch usage. It looks like Nutch stores the content of the pages it downloads and indexes in its data directory for the segment, is this the case? Are there any other major storage requirements I should make not of with Nutch specifically (not the Solr storage, we can handle that bit)? Cheers Chris -- *Lewis* -- *Lewis*
Re: NullPointerException when calling readdb on empty database
which version of Nutch are you using? Is chat a plain text file, with URLs in a list on per line? If this the case there is no need to add it to your crawl command. Additionally, there is no point in trying to read what is happeneing in your crawldb if your generator log output indicates that nothing has been selected for fetching therefore this will be skipped. I'm slightly concerned about your crawl parameters, for example is it necessary to use crawl-chat, I have never used hyphens before, and it is only a suggestion, but might it be possible that Nutch is taking -chat as a parameter as well? On Wed, Aug 3, 2011 at 8:34 AM, Christian Weiske christian.wei...@netresearch.de wrote: Hi, I'm getting the following error: $ bin/nutch readdb crawl-chat/crawldb -stats CrawlDb statistics start: crawl-chat/crawldb Statistics for CrawlDb: crawl-chat/crawldb Exception in thread main java.lang.NullPointerException at org.apache.nutch.crawl.CrawlDbReader.processStatJob(CrawlDbReader.java:352) at org.apache.nutch.crawl.CrawlDbReader.main(CrawlDbReader.java:502) The db has been created as follows, as you see no URLs have been fetched (another problem): $ bin/nutch crawl urls/chat -dir crawl-chat -depth 10 -topN 1 solrUrl is not set, indexing will be skipped... crawl started in: crawl-chat rootUrlDir = urls/chat threads = 10 depth = 10 solrUrl=null topN = 1 Injector: starting at 2011-08-03 09:31:53 Injector: crawlDb: crawl-chat/crawldb Injector: urlDir: urls/chat Injector: Converting injected urls to crawl db entries. Injector: Merging injected urls into crawl db. Injector: finished at 2011-08-03 09:31:57, elapsed: 00:00:04 Generator: starting at 2011-08-03 09:31:57 Generator: Selecting best-scoring urls due for fetch. Generator: filtering: true Generator: normalizing: true Generator: topN: 1 Generator: jobtracker is 'local', generating exactly one partition. Generator: 0 records selected for fetching, exiting ... Stopping at depth=0 - no more URLs to fetch. No URLs to fetch - check your seed list and URL filters. crawl finished: crawl-chat -- Viele Grüße Christian Weiske -- *Lewis*
Re: imported to solr
Hi Kiks, What kind of changes have you made to your schema when transferring to Solr instance? You ask about the stored parsed text content, well the default Nutch schema sets this by default to stored=false as it is not always required for all content to be stored. Generally speaking terms that occur in title, meta, etc fields will be more valuable for searching across, especially when considering data stores. Hopefully you can change this behaviour by simple making the changes described, however Solr does not like kindly changes to schema therefore it will be necessary to reindex your data to your Solr core. On Wed, Aug 3, 2011 at 7:31 AM, Kiks kikstern...@gmail.com wrote: This question was posted on solr list and not answered because nutch related... The indexed contents of 100 sites were imported to solr from nutch using: bin/nutch solrindex http://127.0.0.1:8983/solr/ crawl/crawldb crawl/linkdb crawl/segments/* now, a solr admin search for 'photography' includes these results: doc float name=score0.12570743/ float float name=boost1.0440307/float str name=digest94d97f2806240d18d67cafe9c34f94e1/str str name=idhttp://www.galleryhopper.org//str str name=segment.../str str name=titleGallery Hopper: Todd Walker's photography ephemera. Read, enjoy, share, discard./str date name=tstamp.../date str name=urlhttp://www.galleryhopper.org//str /doc but highlighting options are on the title field not page text. My question: Where is the stored parsetext content of the pages? What is the solr command to send it from nutch with url/id key? The information is contained in the crawl segments with solr id field matching nutch url. Thanks. -- *Lewis*
Re: New wiki page for Running Nutch 1.3 in Eclipse
Sorry http://wiki.apache.org/nutch/RunNutchInEclipse On Wed, Aug 3, 2011 at 2:12 PM, Dr.Ibrahim A Alkharashi khara...@kacst.edu.sa wrote: thanks for the info, would you please post a pointer to the page. Regards Ibrahim On Aug 3, 2011, at 3:13 PM, lewis john mcgibbney lewis.mcgibb...@gmail.com wrote: Hi, We've just posted a new updated wiki page covering the above topic. If there are any discrepancies within the page it would be nice if individuals could sign up to the wiki and edit based upon your own experiences using = Nutch 1.3 within an IDE. However, alternatively please post on the lists and we will get it updated. Thanks for now -- *Lewis* Warning: This message and its attachment, if any, are confidential and may contain information protected by law. If you are not the intended recipient, please contact the sender immediately and delete the message and its attachment, if any. You should not copy the message and its attachment, if any, or disclose its contents to any other person or use it for any purpose. Statements and opinions expressed in this e-mail and its attachment, if any, are those of the sender, and do not necessarily reflect those of King Abdulaziz city for Science and Technology (KACST) in the Kingdom of Saudi Arabia. KACST accepts no liability for any damage caused by this email. تحذير: هذه الرسالة وما تحويه من مرفقات (إن وجدت) تمثل وثيقة سرية قد تحتوي على معلومات محمية بموجب القانون. إذا لم تكن الشخص المعني بهذه الرسالة فيجب عليك تنبيه المُرسل بخطأ وصولها إليك، وحذف الرسالة ومرفقاتها (إن وجدت)، ولا يجوز لك نسخ أو توزيع هذه الرسالة أو مرفقاتها (إن وجدت) أو أي جزء منها، أو البوح بمحتوياتها للغير أو استعمالها لأي غرض. علماً بأن فحوى هذه الرسالة ومرفقاتها (ان وجدت) تعبر عن رأي المُرسل وليس بالضرورة رأي مدينة الملك عبدالعزيز للعلوم والتقنية بالمملكة العربية السعودية، ولا تتحمل المدينة أي مسئولية عن الأضرار الناتجة عن ما قد يحتويه هذا البريد. -- *Lewis*
Re: how to extract tf-idf
Hi Zhanibek, I would like to refer specifically to Markus' thread which he initiated a short time ago [1] sharing close similarity to your own questions. I think the main question to be answered now is how do we extract tf-idf from a crawled website? And as we now refer to Nutch as an independent software project focussed solely on crawling this is a question which would provide significant value to understanding more about the inner workings. Markus mentioned that there many aspects we need to consider before trying to compile a tf-idf score e.g. link score, norms, boosts, functions etc. This is making it relatively hard for me (and I suspect others) to accurately comment on the actual components we are required to consider and understand in this specific context before we can address the fundamental question at hand... I think there is a good deal of lateral thinking required here ;0) In the mean time have you had any chance to delve into this? [1] http://www.mail-archive.com/user%40nutch.apache.org/msg03517.html On Wed, Aug 3, 2011 at 5:28 AM, Zhanibek Datbayev itoma...@gmail.comwrote: Hello Nutch Users, I've googled for a while and still can not find answers to the following: 1. After I crawl a web site, how can I extract tf-idf for it? 2. How can I access original web pages crawled? 3. Is it possible to get for each word id it corresponds to? Thanks in advance! -Zhanibek -- *Lewis*
Re: fetcher runs without error with no internet connection
Hi Alex, Did you get anywhere with this? What condition led to you seeing unknown host exception? Unless segment gets corrupted, I would assume you could fetch again. Hopefully you can confirm this. On Tue, Aug 16, 2011 at 9:23 PM, alx...@aim.com wrote: Hello, After running bin/nutch fetch $segment for 2 days, internet connection was lost, but nutch did not give any errors. Usually I was seeing Unknown host exception before. Any ideas what happened and is it OK to stop the fetch and run it again on the same (old) segment? This is nutch -1.2 Thanks. Alex. -- *Lewis*
Re: force recrawl
Correct There should be comprehensive documentation on the wiki for these parameters (and many more) On Fri, Aug 19, 2011 at 6:46 PM, Markus Jelsma markus.jel...@openindex.iowrote: addDays is not a crawl switch but a generator switch. You cannot use the crawl command. But if I use bin/nutch crawl urls -dir crawl -depth 2 -topN 50 addDays does not have any effect. Has anyone a nutch crawl script that can also be used to force a recrawl? Well, actually. You can! I seem to have forgotten the -addDays switch of the generator. It adds #days to the current time to force URL's with fetch times in the future to be eligible for fetch. -- View this message in context: http://lucene.472066.n3.nabble.com/force-recrawl-tp3268654p3268779.html Sent from the Nutch - User mailing list archive at Nabble.com. -- *Lewis*
Re: Empty LinkDB after invertlinks
Hi Small suggestion, but I do not see any -dir argument passed alongside your initial invertlinks command. I understand that you have multiple segment directories, which have been fetched over a recent number of days, and that the output would also suggest the process was properly executed, however I have never used the command without the -dir option (as it has always worked for me), therefore I can only suggest that this may be the problem. On Tue, Aug 23, 2011 at 3:29 PM, Marek Bachmann m.bachm...@uni-kassel.dewrote: Hi Markus, thank you for the quick reply. I already searched for this Configuration error and found: http://www.mail-archive.com/**nutch-u...@lucene.apache.org/**msg15397.htmlhttp://www.mail-archive.com/nutch-user@lucene.apache.org/msg15397.html Where they say that This exception is innocuous - it helps to debug at which points in the code the Configuration instances are being created. (...) I have indeed not much disk space on the machine but it should be enough at the moment: root@hrz-vm180:/home/**nutchServer/relaunch_nutch/**runtime/local/bin# df -h . FilesystemSize Used Avail Use% Mounted on /dev/vda1 20G 5.9G 15G 30% /home As I am root and all directories under /home/nutchServer/relaunch_**nutch/runtime/local/bin are set to root:root and 755 permissions shouldn't be the problem. Any further suggestions? :-/ Thank you once again Am 23.08.2011 16:10, schrieb Markus Jelsma: There are some peculiarities in your log: 2011-08-23 14:47:34,833 DEBUG conf.Configuration - java.io.IOException: config() at org.apache.hadoop.conf.**Configuration.init(** Configuration.java:211) at org.apache.hadoop.conf.**Configuration.init(** Configuration.java:198) at org.apache.hadoop.mapred.**JobConf.init(JobConf.java:**213) at org.apache.hadoop.mapred.**LocalJobRunner$Job.init(** LocalJobRunner.java:93) at org.apache.hadoop.mapred.**LocalJobRunner.submitJob(** LocalJobRunner.java:373) at org.apache.hadoop.mapred.**JobClient.submitJobInternal(** JobClient.java:800) at org.apache.hadoop.mapred.**JobClient.submitJob(JobClient.** java:730) at org.apache.hadoop.mapred.**JobClient.runJob(JobClient.** java:1249) at org.apache.nutch.crawl.LinkDb.**invert(LinkDb.java:190) at org.apache.nutch.crawl.LinkDb.**run(LinkDb.java:292) at org.apache.hadoop.util.**ToolRunner.run(ToolRunner.**java:65) at org.apache.nutch.crawl.LinkDb.**main(LinkDb.java:255) 2011-08-23 14:47:34,922 INFO mapred.JobClient - Running job: job_local_0002 2011-08-23 14:47:34,923 DEBUG conf.Configuration - java.io.IOException: config(config) at org.apache.hadoop.conf.**Configuration.init(** Configuration.java:226) at org.apache.hadoop.mapred.**JobConf.init(JobConf.java:**184) at org.apache.hadoop.mapreduce.**JobContext.init(JobContext.** java:52) at org.apache.hadoop.mapred.**JobContext.init(JobContext.** java:32) at org.apache.hadoop.mapred.**JobContext.init(JobContext.** java:38) at org.apache.hadoop.mapred.**LocalJobRunner$Job.run(** LocalJobRunner.java:111) Can you check permissions, disk space etc? On Tuesday 23 August 2011 16:05:16 Marek Bachmann wrote: Hey Ho, for some reasons the inverlinks command produces an empty linkdb. I did: root@hrz-vm180:/home/**nutchServer/relaunch_nutch/**runtime/local/bin# ./nutch invertlinks crawl/linkdb crawl/segments/* -noNormalize -noFilter LinkDb: starting at 2011-08-23 14:47:21 LinkDb: linkdb: crawl/linkdb LinkDb: URL normalize: false LinkDb: URL filter: false LinkDb: adding segment: crawl/segments/20110817164804 LinkDb: adding segment: crawl/segments/20110817164912 LinkDb: adding segment: crawl/segments/20110817165053 LinkDb: adding segment: crawl/segments/20110817165524 LinkDb: adding segment: crawl/segments/20110817170729 LinkDb: adding segment: crawl/segments/20110817171757 LinkDb: adding segment: crawl/segments/20110817172919 LinkDb: adding segment: crawl/segments/20110819135218 LinkDb: adding segment: crawl/segments/20110819165658 LinkDb: adding segment: crawl/segments/20110819170807 LinkDb: adding segment: crawl/segments/20110819171841 LinkDb: adding segment: crawl/segments/20110819173350 LinkDb: adding segment: crawl/segments/20110822135934 LinkDb: adding segment: crawl/segments/20110822141229 LinkDb: adding segment: crawl/segments/20110822143419 LinkDb: adding segment: crawl/segments/20110822143824 LinkDb: adding segment: crawl/segments/20110822144031 LinkDb: adding segment: crawl/segments/20110822144232 LinkDb: adding segment: crawl/segments/20110822144435 LinkDb: adding segment: crawl/segments/20110822144617 LinkDb: adding segment: crawl/segments/20110822144750 LinkDb: adding segment: crawl/segments/20110822144927 LinkDb: adding segment: crawl/segments/20110822145249 LinkDb: adding segment: crawl/segments/20110822150757
Re: readdblink not showing alllinks
If you please post your crawldb dump then we could see the structure of your crawldb and may be able to begin pin pointing the issue. It should not be required for you to undertake another crawl after inverting links for these URLs to be indexed when calling solrindex command... there must be more to it. On Tue, Aug 23, 2011 at 6:54 PM, abhayd ajdabhol...@hotmail.com wrote: hi after doing invert link i see the complete link graph...THANKS I m bit confused, please help me understand.. I do crawl using crawl command. I see around 7000+ urls when i dump crawldb. Then i do invertlink and i see the complete link graph. After this i do solrindex. After solr indexing is completed i see only 2421 docs. I was expecting 7000+ docs (i.e exact number of unique urls which i got from dumping crawldb as text) Why i just see 2421 urls/docs in solr? Do i need to execute crawl again after invertlink? Here are some settings -- namedb.update.max.inlinks/name value1/value namedb.ignore.internal.links/name valuefalse/value namedb.max.inlinks/name value1/value namedb.max.outlinks.per.page/name value-1/value -- View this message in context: http://lucene.472066.n3.nabble.com/readdblink-not-showing-alllinks-tp3274127p3278779.html Sent from the Nutch - User mailing list archive at Nabble.com. -- *Lewis*
Re: How to save html source to local drive
Hi Can you explain how you tried to save raw html obtained during a crawl to a local drive? I am not entirely sure what you mean here and why you would want to do so given that we already have an array of alternative options available. Can you please expand on this. Thank you On Wed, Aug 24, 2011 at 5:24 AM, dyzc 1393975...@qq.com wrote: Hi, I am using nutch within hadoop distributed computing environment. I tried saving html source to a local drive (not HDFS) via absolute filepath, but I can't find the saved contents on either master node or datanodes. How can I achieve this? Thanks! -- *Lewis*
Re: Recursively searching through web dirs
Hi Adam, My initial thoughts are that you are correct. It is very unusual for your files to be located on an URL in the same domain which is not referenced by the top level or a subsequent level URL within the domain. What I would suggest is that you have a look through your hadoop.log as well as use some of the commans which enable you to investigate your crawldb, segment(s) and linkdb if you've created one. have a look at the wiki under command line options On Wed, Aug 24, 2011 at 9:03 PM, Adam Estrada estrada.adam.gro...@gmail.com wrote: All, I have a root domain and a couple directories deep I have some files that I want to index. The problem is that they are not referenced on the main page using a hyperlink or anything like that. http://www.geoglobaldomination.org/kml/temp/ I want to be able to crawl down in to /kml/temp/ without knowing that it's even there. Is there a way to do this in Nutch? echo http://www.geoglobaldomination.org urls ./nutch crawl urls -threads 10 -depth 10 -topN 20 -solr http://172.16.2.107:8983/solr Nothing and I suspect that it's because there is not a hyperlink on the main page. Thoughts? Adam -- *Lewis*
Re: Trying to understand and use URLmeta
Hi JB, We have recently finished a complete plugin tutorial which fully explains the functionality of the urlmeta plugin on the wiki. It can be found here [1], could I ask you to have a thorough look at it, and the code and if you still have questions then please reinforce them. [1] http://wiki.apache.org/nutch/WritingPluginExample Thank you On Wed, Aug 24, 2011 at 9:36 PM, John R. Brinkema brink...@teo.uscourts.gov wrote: Hi all, I am trying use URLmeta to inject meta data into documents that I crawl and I am having some problems. First the context: Nutch 1.3 with Solr 3.2 My seed url files looks like: http://mySite.com/Guide/index.** html\trecommended= http://mySite.com/Guide/index.html%5Ctrecommended= Guide\**tkeywords=Guide,Policy,**JBmarker I put JBmarker there so I could see where the metadata got put. Index.html itself is a table of contents of a guide; that is, it is mostly a list of outlinks to parts of the overall guide. My nutch-site.xml includes the following properties: property nameplugin.includes/name valueprotocol-http|**urlfilter-regex|parse-(html|** tika)|index-(basic|anchor|**urlmeta)|scoring-opic|** urlnormalizer-(pass|regex|**basic)/value /property property nameurlmeta.tags/name valuerecommended,keywords/**value /property I fire up nutch to crawl and all goes well. To see what nutch did, I ran 'readseg -dump' and looked at the results. What I found was the following: ... other Recno's above ... Recno:: 56 URL:: http:/mySite.com/Guide/index.**html CrawlDatum:: Version: 7 Status: 65 (signature) Fetch time: Tue Aug 23 10:08:18 EDT 2011 Modified time: Wed Dec 31 19:00:00 EST 1969 Retries since fetch: 0 Retry interval: 0 seconds (0 days) Score: 1.0 Signature: 5c182af41027766eccf1ea60d11277**2c Metadata: CrawlDatum:: Version: 7 Status: 1 (db_unfetched) Fetch time: Tue Aug 23 10:08:04 EDT 2011 Modified time: Wed Dec 31 19:00:00 EST 1969 Retries since fetch: 0 Retry interval: 2592000 seconds (30 days) Score: 1.0 Signature: null Metadata: recommended: Guide_ngt_: 1314108489210keywords: Guide,Policy,JBmarker Content:: Version: -1 url: http://mySite.com/Guide/index.**htmlhttp://mySite.com/Guide/index.html base: http://mySite.com/Guide/index.**htmlhttp://mySite.com/Guide/index.html ... lots more content ... CrawlDatum:: Version: 7 Status: 33 (fetch_success) Fetch time: Tue Aug 23 10:08:15 EDT 2011 Modified time: Wed Dec 31 19:00:00 EST 1969 Retries since fetch: 0 Retry interval: 2592000 seconds (30 days) Score: 1.0 Signature: null Metadata: recommended: Guide_ngt_: 1314108489210keywords: Guide,Policy,JBmarker_pst_: success(1), lastModified=0 ParseData:: Version: 5 Status: success(1,0) Title: Guide Outlinks: 60 outlink: toUrl: http://mySite.com/Home/About.**htmlhttp://mySite.com/Home/About.htmlanchor: About Me outlink: toUrl: http://mySite.com/Guide/**Contact_The_Guide.htmlhttp://mySite.com/Guide/Contact_The_Guide.htmlanchor: Contact Me ... many more outlinks ... Content Metadata: nutch.content.digest=**5c182af41027766eccf1ea60d11277**2c Accept-ranges=bytes Date=Tue, 23 Aug 2011 16:28:43 GMT Content-Length=28798 Last-Modified=Wed, 06 Apr 2011 00:15:10 GMT nutch.crawl.score=1.0 _fst_=33 nutch.segment.name=**20110823100811 Content-Type=text/html Connection=close Server=Netscape-Enterprise/6.0 Parse Metadata: CharEncodingForConversion=**windows-1252 OriginalCharEncoding=windows-**1252 ParseText:: ... lots of parsed text ... Recno:: 57 ... and so forth. JBmarker does not appear anywhere else, in this segment or any of the others. When I do a solrindex, JBmarker does not appear to be anywhere. ?? *What I expected* As I understand ULRmeta (as defined by the two nutch patches), the meta data that is included with the url is injected into the seed url; that is to say, it is as if the lines: META NAME=recommended CONTENT=Guide META NAME=keywords CONTENT=Guide,Policy,**JBmarker were in the seed url content. Furthermore, it is as if those two lines were in all the outlink content of the seed url. So, I expected that when I looked at all the CrawlDatum and ParseData of the outlinks from the seed url, I would see the same meta data as in the seed CrawlDatum and ParseData. Which is clearly not the case. As for solrindex, I assume that I have some work to do to get any special metadata actions moved over to solr; a special plugin of some sort. That is, urlmeta does not help get the collected metadata from Nutch to Solr. So what is happening? Where did I go astray? Am I analyzing the Nutch dumps incorrectly? One other side note: I assume that Luke no longer will help me debug Nutch since it works with Lucene indexes and Nutch no longer create such beasts. Are there any tools that help with viewing Nutch databases? It seems that Nutch takes some liberties with the data it is dumping (e.g., the meta tags all concatenated together without delimiters; I
Re: Are there any tutorial for writing regex-normalize.xml?
Apart from looking through the list archives, as far as I aware nothing has been specifically documented on this topic. In the mean time you may find this helpful http://geekswithblogs.net/brcraju/articles/235.aspx On Fri, Aug 26, 2011 at 9:22 AM, Kaiwii Ho kaiwi...@gmail.com wrote: I'm gonna to specify my own regex-normalize.xml.Are there any tutorial for writing regex-normalize.xml? waiting for ur help and thank u -- *Lewis*
Re: force recrawl
If you only wish to serve crawls to that one page, I'm sure this could easily be set up by writing a bash script specifying the -adddays arguement with your commands. This could then be set and run as a cron job? Please someone correct me if I am wrong. On Fri, Aug 26, 2011 at 10:22 PM, Radim Kolar h...@sendmail.cz wrote: It would be nice to have command which will alter database refetch times in specified URLs. With configuration like that: ^http://www\.google\.com/?$ 1d # fetch google homepage daily I am willing to help with sponsoring development and testing of such thing. -- *Lewis*
Trying to complete index structure wiki page
Hi, As the title suggests, I'm in the process of getting some comprehensive documentation sorted out for Nutch, this obviously starts at wiki level. I'm currently working on the IndexStructure page [1]. I would appreciate if some guys could have a quick look and correct where they see fit. In addition I have a couple of quick questions regarding the last 4 fields I'm trying to account for 1) BOOST - As far as I am aware this was deprecated in Nutch 1.2 or Nutch 1.1... correct/wrong? 2) DIGEST - Don't have a clue 3) SEGMENT - as 2 4) TIMESTAMP - as 2 Would be great if people could fill me in with the grey areas please. Finally, what a job all contributors, dev's and committers made cleaning up plugin directory even between Nutch 1.2 and 1.3 release. It's not until you see previous versions on SVN that you can fully appreciate the excellent job that has been made with 1.3 release. :0) [1] http://wiki.apache.org/nutch/IndexStructure -- *Lewis*
Re: How to generate multiple small segments w/o -numFetchers?
Hi Gabriele can you expand on your last comment... are you running in deploy mode? And to reply to your first point, yes you are correct, the FAQ's need extensive updating. Please feel free to change anything you feel necessary, however as a matter of retaining knowledge for the legacy of Nutch we are now moving deprecated/old information resources to the archive section of the wiki. On Sun, Aug 28, 2011 at 7:58 AM, Gabriele Kahlout gabri...@mysimpatico.comwrote: but that's no local solution: if (local.equals(job.get(mapred.job.tracker)) numLists != 1) { // override LOG.info(Generator: jobtracker is 'local', generating exactly one partition.); numLists = 1; } On Sun, Aug 28, 2011 at 8:57 AM, Gabriele Kahlout gabri...@mysimpatico.comwrote: it was a bin/nutch generate option. On Sun, Aug 28, 2011 at 6:24 AM, Gabriele Kahlout gabri...@mysimpatico.com wrote: Hello, All over the FAQ http://wiki.apache.org/nutch/FAQ it's bin/nutch -numFetchers option is documented as a way to generate multiple small segments. However that option doesn't seem available neither in 1.3 nor 1.4. So should the FAQ be updated or am I missing something? How else could I generate multiple small segments? I can see doing that with -topN but that's less convenient. -- Regards, K. Gabriele --- unchanged since 20/9/10 --- P.S. If the subject contains [LON] or the addressee acknowledges the receipt within 48 hours then I don't resend the email. subject(this) ∈ L(LON*) ∨ ∃x. (x ∈ MyInbox ∧ Acknowledges(x, this) ∧ time(x) Now + 48h) ⇒ ¬resend(I, this). If an email is sent by a sender that is not a trusted contact or the email does not contain a valid code then the email is not received. A valid code starts with a hyphen and ends with X. ∀x. x ∈ MyInbox ⇒ from(x) ∈ MySafeSenderList ∨ (∃y. y ∈ subject(x) ∧ y ∈ L(-[a-z]+[0-9]X)). -- Regards, K. Gabriele --- unchanged since 20/9/10 --- P.S. If the subject contains [LON] or the addressee acknowledges the receipt within 48 hours then I don't resend the email. subject(this) ∈ L(LON*) ∨ ∃x. (x ∈ MyInbox ∧ Acknowledges(x, this) ∧ time(x) Now + 48h) ⇒ ¬resend(I, this). If an email is sent by a sender that is not a trusted contact or the email does not contain a valid code then the email is not received. A valid code starts with a hyphen and ends with X. ∀x. x ∈ MyInbox ⇒ from(x) ∈ MySafeSenderList ∨ (∃y. y ∈ subject(x) ∧ y ∈ L(-[a-z]+[0-9]X)). -- Regards, K. Gabriele --- unchanged since 20/9/10 --- P.S. If the subject contains [LON] or the addressee acknowledges the receipt within 48 hours then I don't resend the email. subject(this) ∈ L(LON*) ∨ ∃x. (x ∈ MyInbox ∧ Acknowledges(x, this) ∧ time(x) Now + 48h) ⇒ ¬resend(I, this). If an email is sent by a sender that is not a trusted contact or the email does not contain a valid code then the email is not received. A valid code starts with a hyphen and ends with X. ∀x. x ∈ MyInbox ⇒ from(x) ∈ MySafeSenderList ∨ (∃y. y ∈ subject(x) ∧ y ∈ L(-[a-z]+[0-9]X)). -- *Lewis*
Re: a question about job failed
Hi Zhao, Do you have anymore verbose log info from hadoop.log, I have never worked with Nutch 0.9 but if you could at least indicate whether you get something like LOG: info Dedup: starting ... blah blah blah Taking this to a larger context I am not particularly happy with the verboseness of logging when there are errors with indexing commands. When we experience an error during any of the index related commands we get back Job failed. It would be nice to get back a reason for the job failing which was more clear than a stack trace. Finally, this is from a personal point of view, I would highly recommend that you upgrade to a newer (1.3) version of Nutch if you are using this in production. There are significant improvements in functionality. Lewis On Mon, Aug 29, 2011 at 3:24 AM, zhao 253546...@qq.com wrote: Dear all, after use nutch 0.9 ,but have a question,Detailed description of the problem is Exception in thread main java.io.IOException: Job failed! at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:604) at org.apache.nutch.indexer.DeleteDuplicates.dedup(DeleteDuplicates.java:439) at org.apache.nutch.crawl.Crawl.main(Crawl.java:135) Thank you for your help zhao -- View this message in context: http://lucene.472066.n3.nabble.com/a-question-about-job-failed-tp3291669p3291669.html Sent from the Nutch - User mailing list archive at Nabble.com. -- *Lewis*
Re: SSHD for Nutch 1.3 in Pseudo Distributed mode
If it complains about SSH errors then I would ensure that you are logged into your SSH client e.g. ssh -v localhost, prior to executing any hadoop scripts. This would make sense. Further to this, unless you are actually experiencing Nutch related problems on a pseudo or cluster setup then probably the best place to go is the hadoop user lists. This is only a thought, but it would make most sense. On Mon, Aug 29, 2011 at 3:58 PM, webdev1977 webdev1...@gmail.com wrote: Do I NEED SSHD for Nutch 1.3 in Pseudo Distributed mode? I am running on a windows server using cygwin (obviously :-) I can not get haddop/nutch to run in deploy mode and I am not sure if it has something to do with ssh or not. When I run start-all.sh it gives me some ssh usage errors and also says it is staring the jobtracker and namenode. In the hadoop log it complains about not being able to write file: hdfs://localhost:9000/cygdrive/r/EnterpriseSearch/hadoop/mapreduce/system/ jobtracker.info. I have configred core-site.xml, hdfs-site.xml and mapred-site.xml -- View this message in context: http://lucene.472066.n3.nabble.com/SSHD-for-Nutch-1-3-in-Pseudo-Distributed-mode-tp3292907p3292907.html Sent from the Nutch - User mailing list archive at Nabble.com. -- *Lewis*