RE: Suffix URLFilter not working
We happily use that filter just as it is shipped with Nutch. Just enabling it in plugin.includes works for us. To ease testing you can use the bin/nutch org.apache.nutch.net.URLFilterChecker to test filters. -Original message- From:Bai Shen baishen.li...@gmail.com Sent: Wed 12-Jun-2013 14:32 To: user@nutch.apache.org Subject: Suffix URLFilter not working I'm dealing with a lot of file types that I don't want to index. I was originally using the regex filter to exclude them but it was getting out of hand. I changed my plugin includes from urlfilter-regex to urlfilter-(regex|suffix) I've tried using both the default urlfilter-suffix.txt file via adding the extensions I don't want and making my own file that starts with + and includes the extensions I do want. Neither of these approaches seem to work. I continue to get urls added to the database which continue extensions I don't want. Even adding a urlfilter.order section to my nutch-site.xml doesn't work. I don't see any obvious bugs in the code, so I'm a bit stumped. Any suggestions for what else to look at? Thanks.
RE: HTMLParseFilter equivalent in Nutch 2.2 ???
I think for Nutch 2x it was HTMLParseFilter was renamed to ParseFilter. This is not true for 1.x, see NUTCH-1482. https://issues.apache.org/jira/browse/NUTCH-1482 -Original message- From:Tony Mullins tonymullins...@gmail.com Sent: Wed 12-Jun-2013 14:37 To: user@nutch.apache.org Subject: HTMLParseFilter equivalent in Nutch 2.2 ??? Hi , If I go to http://wiki.apache.org/nutch/AboutPlugins ,here it shows me HTMLParseFilter is extension point for adding custom metadata to HTML and its 'Filter' method's signature is 'public ParseResult filter(Content content, ParseResult parseResult, HTMLMetaTags metaTags, DocumentFragment doc)' but its in api 1.4 doc. I am on Nutch 2.2 and there is no class by name of HTMLParseFilter in v2.2 api doc http://nutch.apache.org/apidocs-2.2/allclasses-noframe.html. So please tell me which class to use in v2.2 api for adding my custom rule to extract some data from HTML page (is it ParseFilter ?) and add it to HMTL metadata so later then I could add it to my Solr using indexfilter plugin. Thanks, Tony.
RE: using Tika within Nutch to remove boiler plates?
we don't use Boilerpipe anymore so no point in sharing. Just set those two configuration options in nutch-site.xml as property nametika.use_boilerpipe/name valuetrue/value /property property nametika.boilerpipe.extractor/name valueArticleExtractor/value /property and it should work -Original message- From:Joe Zhang smartag...@gmail.com Sent: Tue 11-Jun-2013 01:42 To: user user@nutch.apache.org Subject: Re: using Tika within Nutch to remove boiler plates? Marcus, do you mind sharing a sample nutch-site.xml? On Mon, Jun 10, 2013 at 1:42 AM, Markus Jelsma markus.jel...@openindex.iowrote: Those settings belong to nutch-site. Enable BP and set the correct extractor and it should work just fine. -Original message- From:Lewis John Mcgibbney lewis.mcgibb...@gmail.com Sent: Sun 09-Jun-2013 20:47 To: user@nutch.apache.org Subject: Re: using Tika within Nutch to remove boiler plates? Hi Joe, I've not used this feature, it would be great if one of the others could chime in here. From what I can infer from the correspondence on the issue, and the available patches, you should be applying the most recent one uploaded by Markus [0] as your starting point. This is dated as 22/11/2011. On Sun, Jun 9, 2013 at 11:00 AM, Joe Zhang smartag...@gmail.com wrote: One of the comments mentioned the following: tika.use_boilerpipe=true tika.boilerpipe.extractor=ArticleExtractor|CanolaExtractor which part the code is it referring to? You will see this included in one of the earlier patches uploaded by Markus on 11/05/2011 [1] Also, within the current Nutch config, should I focus on parse-plugin.xml? Look at the other patches and also Gabriele's comments. You may most likely need to alter something but AFAICT the work hasbeen done.. it's just a case of pulling together several contributions. Maybe you should look at the patch for 2.x (uploaded most recently by Roland) and see what is going on there. hth [0] https://issues.apache.org/jira/secure/attachment/12504736/NUTCH-961-1.5-1.patch [1] https://issues.apache.org/jira/secure/attachment/12478927/NUTCH-961-1.3-tikaparser1.patch
RE: Data Extraction from 100+ different sites...
Hi, Yes, you should write a plugin that has a parse filter and indexing filter. To ease maintenance you would want to have a file per host/domain containing XPath expressions, far easier that switch statements that need to be recompiled. The indexing filter would then index the field values extracted by your parse filter. Cheers, Markus -Original message- From:Tony Mullins tonymullins...@gmail.com Sent: Tue 11-Jun-2013 16:07 To: user@nutch.apache.org Subject: Data Extraction from 100+ different sites... Hi, I have 100+ different sites ( and may be more will be added in near future), I have to crawl them and extract my required information from each site. So each site would have its own extraction rule ( XPaths). So far I have seen there is no built-in mechanism in Nutch to fulfill my requirement and I may have to write custom HTMLParserFilter extension and IndexFilter plugin. And I may have to write 100+ switch cases in my plugin to handle the extraction rules of each site Is this the best way to handle my requirement or there is any better way to handle it ? Thanks for your support help. Tony.
RE: using Tika within Nutch to remove boiler plates?
Yes, Boilerpipe is complex and difficult to adapt. It also requires you to preset an extraction algorithm which is impossible for us. I've created an extractor instead that works for most pages and ignores stuff like news overviews and major parts of homepages. It's also tightly coupled with our date extractor (based on [1]) and language detector (based on LangDetect) and image extraction. In many cases boilerpipe's articleextractor will work very well but date extraction such as NUTCH-141 won't do the trick as it only works on extracted text as a whole and does not consider page semantics. [1]: https://issues.apache.org/jira/browse/NUTCH-1414 -Original message- From:Joe Zhang smartag...@gmail.com Sent: Tue 11-Jun-2013 18:06 To: user user@nutch.apache.org Subject: Re: using Tika within Nutch to remove boiler plates? Any particular reason why you don't use boilerpipe any more? So what do you suggest as an alternative? On Tue, Jun 11, 2013 at 5:41 AM, Markus Jelsma markus.jel...@openindex.iowrote: we don't use Boilerpipe anymore so no point in sharing. Just set those two configuration options in nutch-site.xml as property nametika.use_boilerpipe/name valuetrue/value /property property nametika.boilerpipe.extractor/name valueArticleExtractor/value /property and it should work -Original message- From:Joe Zhang smartag...@gmail.com Sent: Tue 11-Jun-2013 01:42 To: user user@nutch.apache.org Subject: Re: using Tika within Nutch to remove boiler plates? Marcus, do you mind sharing a sample nutch-site.xml? On Mon, Jun 10, 2013 at 1:42 AM, Markus Jelsma markus.jel...@openindex.iowrote: Those settings belong to nutch-site. Enable BP and set the correct extractor and it should work just fine. -Original message- From:Lewis John Mcgibbney lewis.mcgibb...@gmail.com Sent: Sun 09-Jun-2013 20:47 To: user@nutch.apache.org Subject: Re: using Tika within Nutch to remove boiler plates? Hi Joe, I've not used this feature, it would be great if one of the others could chime in here. From what I can infer from the correspondence on the issue, and the available patches, you should be applying the most recent one uploaded by Markus [0] as your starting point. This is dated as 22/11/2011. On Sun, Jun 9, 2013 at 11:00 AM, Joe Zhang smartag...@gmail.com wrote: One of the comments mentioned the following: tika.use_boilerpipe=true tika.boilerpipe.extractor=ArticleExtractor|CanolaExtractor which part the code is it referring to? You will see this included in one of the earlier patches uploaded by Markus on 11/05/2011 [1] Also, within the current Nutch config, should I focus on parse-plugin.xml? Look at the other patches and also Gabriele's comments. You may most likely need to alter something but AFAICT the work hasbeen done.. it's just a case of pulling together several contributions. Maybe you should look at the patch for 2.x (uploaded most recently by Roland) and see what is going on there. hth [0] https://issues.apache.org/jira/secure/attachment/12504736/NUTCH-961-1.5-1.patch [1] https://issues.apache.org/jira/secure/attachment/12478927/NUTCH-961-1.3-tikaparser1.patch
RE: Data Extraction from 100+ different sites...
You can use URLUtil in that parse filter to determine on which host/domain you are and lazy load the file with expressions for that host. Just keep a Maphostname, Listexpressions in your object and load lists of expressions on demand. -Original message- From:Tony Mullins tonymullins...@gmail.com Sent: Tue 11-Jun-2013 18:59 To: user@nutch.apache.org Subject: Re: Data Extraction from 100+ different sites... Hi Markus, I couldn't understand how can I avoid switch cases in your suggested idea I would have one plugin which will implement HtmlParseFilter and I would have to check the current URL by getting content.getUrl() and this all will be happening in same class so I would have to add swicth cases... I may could add xpath expression for each site in separate files but to get XPath expression I would have to decide which file I have to read and for that I would have to add my this code logic in swith case Please correct me if I am getting this all wrong !!! And I think this is common requirement for web crawling solutions to get custom data from page... then are not there any such Nutch plugins available on web ? Thanks, Tony. On Tue, Jun 11, 2013 at 7:35 PM, Markus Jelsma markus.jel...@openindex.iowrote: Hi, Yes, you should write a plugin that has a parse filter and indexing filter. To ease maintenance you would want to have a file per host/domain containing XPath expressions, far easier that switch statements that need to be recompiled. The indexing filter would then index the field values extracted by your parse filter. Cheers, Markus -Original message- From:Tony Mullins tonymullins...@gmail.com Sent: Tue 11-Jun-2013 16:07 To: user@nutch.apache.org Subject: Data Extraction from 100+ different sites... Hi, I have 100+ different sites ( and may be more will be added in near future), I have to crawl them and extract my required information from each site. So each site would have its own extraction rule ( XPaths). So far I have seen there is no built-in mechanism in Nutch to fulfill my requirement and I may have to write custom HTMLParserFilter extension and IndexFilter plugin. And I may have to write 100+ switch cases in my plugin to handle the extraction rules of each site Is this the best way to handle my requirement or there is any better way to handle it ? Thanks for your support help. Tony.
RE: using Tika within Nutch to remove boiler plates?
In my opinion Boilerpipe is the most effective free and open source tool for the job :) It does require some patching (see linked issues) and manual upgrade to Boilerpipe 1.2.0. -Original message- From:Joe Zhang smartag...@gmail.com Sent: Tue 11-Jun-2013 21:19 To: user user@nutch.apache.org Subject: Re: using Tika within Nutch to remove boiler plates? So what in your opinion is the most effective way of removing boilerplates in Nutch crawls? On Tue, Jun 11, 2013 at 12:12 PM, Markus Jelsma markus.jel...@openindex.iowrote: Yes, Boilerpipe is complex and difficult to adapt. It also requires you to preset an extraction algorithm which is impossible for us. I've created an extractor instead that works for most pages and ignores stuff like news overviews and major parts of homepages. It's also tightly coupled with our date extractor (based on [1]) and language detector (based on LangDetect) and image extraction. In many cases boilerpipe's articleextractor will work very well but date extraction such as NUTCH-141 won't do the trick as it only works on extracted text as a whole and does not consider page semantics. [1]: https://issues.apache.org/jira/browse/NUTCH-1414 -Original message- From:Joe Zhang smartag...@gmail.com Sent: Tue 11-Jun-2013 18:06 To: user user@nutch.apache.org Subject: Re: using Tika within Nutch to remove boiler plates? Any particular reason why you don't use boilerpipe any more? So what do you suggest as an alternative? On Tue, Jun 11, 2013 at 5:41 AM, Markus Jelsma markus.jel...@openindex.iowrote: we don't use Boilerpipe anymore so no point in sharing. Just set those two configuration options in nutch-site.xml as property nametika.use_boilerpipe/name valuetrue/value /property property nametika.boilerpipe.extractor/name valueArticleExtractor/value /property and it should work -Original message- From:Joe Zhang smartag...@gmail.com Sent: Tue 11-Jun-2013 01:42 To: user user@nutch.apache.org Subject: Re: using Tika within Nutch to remove boiler plates? Marcus, do you mind sharing a sample nutch-site.xml? On Mon, Jun 10, 2013 at 1:42 AM, Markus Jelsma markus.jel...@openindex.iowrote: Those settings belong to nutch-site. Enable BP and set the correct extractor and it should work just fine. -Original message- From:Lewis John Mcgibbney lewis.mcgibb...@gmail.com Sent: Sun 09-Jun-2013 20:47 To: user@nutch.apache.org Subject: Re: using Tika within Nutch to remove boiler plates? Hi Joe, I've not used this feature, it would be great if one of the others could chime in here. From what I can infer from the correspondence on the issue, and the available patches, you should be applying the most recent one uploaded by Markus [0] as your starting point. This is dated as 22/11/2011. On Sun, Jun 9, 2013 at 11:00 AM, Joe Zhang smartag...@gmail.com wrote: One of the comments mentioned the following: tika.use_boilerpipe=true tika.boilerpipe.extractor=ArticleExtractor|CanolaExtractor which part the code is it referring to? You will see this included in one of the earlier patches uploaded by Markus on 11/05/2011 [1] Also, within the current Nutch config, should I focus on parse-plugin.xml? Look at the other patches and also Gabriele's comments. You may most likely need to alter something but AFAICT the work hasbeen done.. it's just a case of pulling together several contributions. Maybe you should look at the patch for 2.x (uploaded most recently by Roland) and see what is going on there. hth [0] https://issues.apache.org/jira/secure/attachment/12504736/NUTCH-961-1.5-1.patch [1] https://issues.apache.org/jira/secure/attachment/12478927/NUTCH-961-1.3-tikaparser1.patch
RE: using Tika within Nutch to remove boiler plates?
Those settings belong to nutch-site. Enable BP and set the correct extractor and it should work just fine. -Original message- From:Lewis John Mcgibbney lewis.mcgibb...@gmail.com Sent: Sun 09-Jun-2013 20:47 To: user@nutch.apache.org Subject: Re: using Tika within Nutch to remove boiler plates? Hi Joe, I've not used this feature, it would be great if one of the others could chime in here. From what I can infer from the correspondence on the issue, and the available patches, you should be applying the most recent one uploaded by Markus [0] as your starting point. This is dated as 22/11/2011. On Sun, Jun 9, 2013 at 11:00 AM, Joe Zhang smartag...@gmail.com wrote: One of the comments mentioned the following: tika.use_boilerpipe=true tika.boilerpipe.extractor=ArticleExtractor|CanolaExtractor which part the code is it referring to? You will see this included in one of the earlier patches uploaded by Markus on 11/05/2011 [1] Also, within the current Nutch config, should I focus on parse-plugin.xml? Look at the other patches and also Gabriele's comments. You may most likely need to alter something but AFAICT the work hasbeen done.. it's just a case of pulling together several contributions. Maybe you should look at the patch for 2.x (uploaded most recently by Roland) and see what is going on there. hth [0] https://issues.apache.org/jira/secure/attachment/12504736/NUTCH-961-1.5-1.patch [1] https://issues.apache.org/jira/secure/attachment/12478927/NUTCH-961-1.3-tikaparser1.patch
RE: Generator -adddays
Please don't break existing scripts and support lower and uppercase. Markus -Original message- From:Lewis John Mcgibbney lewis.mcgibb...@gmail.com Sent: Fri 31-May-2013 19:11 To: user@nutch.apache.org Subject: Re: Generator -adddays Seems like a small cli syntax bug. Please submit a patch and we can commit. Thanks Lewis On Friday, May 31, 2013, Bai Shen baishen.li...@gmail.com wrote: Two quick questions. 1. Why is the parameter -adddays and not -addDays? 2. Should it be changed to match the other parameters or is it another referer? Thanks. -- *Lewis*
RE: How to achieve different fetcher.server.delay configuration for different hosts/sub domains?
You can either use robots.txt or modify the Fetcher. Fetcher has a FetchItemQueue for each queue, this also records the CrawlDelay for that queue. A FetchItemQueue is created by FetchItemQueues.getFetchItemQueue(), here it sets the CrawlDelay for the queue. You can have a lookup table here that looks for CrawlDelay for a given queue id (host or domain or IP). -Original message- From:vivekvl vive...@yahoo.com Sent: Tue 28-May-2013 16:01 To: user@nutch.apache.org Subject: How to achieve different fetcher.server.delay configuration for different hosts/sub domains? I have a problem in configuring fetcher.server.delay for my crawl. Some of the sub domains needs fetcher.server.delay to be high and some needs this to be less. Whether there is a straight forward way to achieve this? If yes what are the configurations I need to make. If this is not going to be simple, is there any workaround to achieve this? Thanks, Vivek -- View this message in context: http://lucene.472066.n3.nabble.com/How-to-achieve-different-fetcher-server-delay-configuration-for-different-hosts-sub-domains-tp4066505.html Sent from the Nutch - User mailing list archive at Nabble.com.
Fetcher corrupting some segments
Hi, For some reason the fetcher sometimes produces corrupts unreadable segments. It then exists with exception like problem advancing post, or negative array size exception etc. java.lang.RuntimeException: problem advancing post rec#702 at org.apache.hadoop.mapred.Task$ValuesIterator.next(Task.java:1225) at org.apache.hadoop.mapred.ReduceTask$ReduceValuesIterator.moveToNext(ReduceTask.java:250) at org.apache.hadoop.mapred.ReduceTask$ReduceValuesIterator.next(ReduceTask.java:246) at org.apache.nutch.fetcher.Fetcher$FetcherReducer.reduce(Fetcher.java:1431) at org.apache.nutch.fetcher.Fetcher$FetcherReducer.reduce(Fetcher.java:1392) at org.apache.hadoop.mapred.ReduceTask.runOldReducer(ReduceTask.java:520) at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:421) at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:260) Caused by: java.io.EOFException at java.io.DataInputStream.readFully(DataInputStream.java:197) at org.apache.hadoop.io.Text.readString(Text.java:402) at org.apache.nutch.metadata.Metadata.readFields(Metadata.java:243) at org.apache.nutch.parse.ParseData.readFields(ParseData.java:144) at org.apache.nutch.parse.ParseImpl.readFields(ParseImpl.java:70) at org.apache.nutch.util.GenericWritableConfigurable.readFields(GenericWritableConfigurable.java:54) at org.apache.hadoop.io.serializer.WritableSerialization$WritableDeserializer.deserialize(WritableSerialization.java:67) at org.apache.hadoop.io.serializer.WritableSerialization$WritableDeserializer.deserialize(WritableSerialization.java:40) at org.apache.hadoop.mapred.Task$ValuesIterator.readNextValue(Task.java:1282) at org.apache.hadoop.mapred.Task$ValuesIterator.next(Task.java:1222) ... 7 more 2013-05-26 22:41:41,344 ERROR fetcher.Fetcher - Fetcher: java.io.IOException: Job failed! at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1327) at org.apache.nutch.fetcher.Fetcher.fetch(Fetcher.java:1520) at org.apache.nutch.fetcher.Fetcher.run(Fetcher.java:1556) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65) at org.apache.nutch.fetcher.Fetcher.main(Fetcher.java:1529) These errors produce the following exception when trying to index. java.io.IOException: IO error in map input file file:/opt/nutch/crawl/segments/20130526223014/crawl_parse/part-0 at org.apache.hadoop.mapred.MapTask$TrackedRecordReader.moveToNext(MapTask.java:242) at org.apache.hadoop.mapred.MapTask$TrackedRecordReader.next(MapTask.java:216) at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:48) at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:436) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:372) at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:212) Caused by: org.apache.hadoop.fs.ChecksumException: Checksum error: file:/opt/nutch/crawl/segments/20130526223014/crawl_parse/part-0 at 2620416 at org.apache.hadoop.fs.ChecksumFileSystem$ChecksumFSInputChecker.readChunk(ChecksumFileSystem.java:219) at org.apache.hadoop.fs.FSInputChecker.readChecksumChunk(FSInputChecker.java:237) at org.apache.hadoop.fs.FSInputChecker.fill(FSInputChecker.java:176) at org.apache.hadoop.fs.FSInputChecker.read1(FSInputChecker.java:193) at org.apache.hadoop.fs.FSInputChecker.read(FSInputChecker.java:158) at java.io.DataInputStream.readFully(DataInputStream.java:195) at org.apache.hadoop.io.DataOutputBuffer$Buffer.write(DataOutputBuffer.java:63) at org.apache.hadoop.io.DataOutputBuffer.write(DataOutputBuffer.java:101) at org.apache.hadoop.io.SequenceFile$Reader.next(SequenceFile.java:1992) at org.apache.hadoop.io.SequenceFile$Reader.next(SequenceFile.java:2124) at org.apache.hadoop.mapred.SequenceFileRecordReader.next(SequenceFileRecordReader.java:76) at org.apache.hadoop.mapred.MapTask$TrackedRecordReader.moveToNext(MapTask.java:236) ... 5 more Is there any way we can debug this? The errors is usually related to Nutch reading metadata, but since we cannot read the metadata, i cannot know what data is causing the issue :) Any hints to share on how to tackle these issues? Markus
RE: rewriting urls that are index
Hi, The 1.x indexer takes a -normalize parameter and there you can rewrite your URL's. Judging from your patterns the RegexURLNormalizer should be sufficient. Make sure you use the config file containing that pattern only when indexing, otherwise they'll end up in the CrawlDB and segments. Use urlnormalizer.regex.file to specifiy the file or pass patterns directly using urlnormalizer.regex.rules. Cheers, Markus -Original message- From:Niels Boldt nielsbo...@gmail.com Sent: Mon 22-Apr-2013 15:56 To: user@nutch.apache.org Subject: rewriting urls that are index Hi, We are crawling a site using nutch 1.6 and indexing into solr. However, we need to rewrite the urls that are indexed in the following way For instance, nutch crawls a page http://www.example.com/article=xxx but when moving data to the index we would like to use the url http://www.example.com/kb#article=xxx http://www.example.com/article=xxx Instead. So when we get data from solr it will show links to http://www.example.com/kb#article=xxx http://www.example.com/article=xxx instead of http://www.example.com/article=xxx Is that possible to do by creating a plugin that extends the UrlNormalizer, eg http://nutch.apache.org/apidocs-1.4/org/apache/nutch/net/URLNormalizer.html Or is it better to add a new indexed property that we use. Best Regards Niels
RE: Period-terminated hostnames
Rodney, Those are valid URL's but you clearly don't need them. You can either use filters to get rid of them or normalize them away. Use the org.apache.nutch.net.URLNormalizerChecker or URLFilterChecker tools to test your config. Markus -Original message- From:Rodney Barnett barn...@ploughman-analytics.com Sent: Thu 18-Apr-2013 22:31 To: user@nutch.apache.org Subject: Period-terminated hostnames I'm using nutch 1.6 to crawl a variety of web pages/sites and I'm finding that my solr database contains pairs of near-duplicate entries where the main difference is that one contains a period after the hostname in the id. For example: entry 1: id: http://example.com/ entry 2: id: http://example.com./ I can't find any references to this issue. Has anyone else noticed this? Is there a good way to correct this? I've added an entry to regex-normalize.xml to remove the period, but I'm not sure yet whether it works. Is there a good way to test the url normalizer configuration? I tracked the source of some of these urls back to hyperlinks extracted from PDF files where the hyperlink doesn't seem to have the period but the linked text is followed by a period. For example: {link}http://example.com{/link}.; where the curly braces indicate the hyperlink boundaries. The command nutch parsechecker reports that the outlink is http://example.com. for this case. Thanks for any assistance. Rodney
RE: How to Continue to Crawl with Nutch Even An Error Occurs?
If Nutch exits with an error then the segment is bad, a failing thread is not an error that leads to a failed segments. This means the segment is properly fetched but just that some records failed. Those records will be eligible for refetch. Assuming you use the crawl command, the updatedb command will be successful so there should be no issue here. What's the problem? -Original message- From:kamaci furkankam...@gmail.com Sent: Wed 20-Mar-2013 23:48 To: user@nutch.apache.org Subject: How to Continue to Crawl with Nutch Even An Error Occurs? When I crawl with Nutch and error occurs (i.e. when one of threads doesn't come within a time) it stops crawling and exits. Is there any configuration to continue crawling even a such kind of error occurs at Nutch? -- View this message in context: http://lucene.472066.n3.nabble.com/How-to-Continue-to-Crawl-with-Nutch-Even-An-Error-Occurs-tp4049567.html Sent from the Nutch - User mailing list archive at Nabble.com.
RE: Does Nutch Checks Whether A Page crawled before or not
The CrawlDB contains information on all URL's and their status e.g. what HTTP code did they get, the interval, some metadata and their fetch time. Use the readdb command to inspect a specific URL. -Original message- From:kamaci furkankam...@gmail.com Sent: Wed 20-Mar-2013 23:52 To: user@nutch.apache.org Subject: Re: Does Nutch Checks Whether A Page crawled before or not Where does Nutch stores that information? 2013/3/21 Markus Jelsma-2 [via Lucene] ml-node+s472066n4049568...@n3.nabble.com Nutch selects records that are eligible for fetch. It's either due to a transient failure or if the fetch interval has been expired. This means that failed fetches due to network issues are refetched within 24 hours. Successfully fetched pages are only refetched if the current time exceeds the previously fetchTime + interval. -Original message- From:kamaci [hidden email]http://user/SendEmail.jtp?type=nodenode=4049568i=0 Sent: Wed 20-Mar-2013 23:46 To: [hidden email]http://user/SendEmail.jtp?type=nodenode=4049568i=1 Subject: Does Nutch Checks Whether A Page crawled before or not Lets assume that I am crawling wikipedia.org with depth 1 and topN 1. After it finishes crawling if I rerun that command and after finishes again and again. What happens? Does Nutch skips previous fetched pages or try to crawl same pages again? -- View this message in context: http://lucene.472066.n3.nabble.com/Does-Nutch-Checks-Whether-A-Page-crawled-before-or-not-tp4049564.html Sent from the Nutch - User mailing list archive at Nabble.com. -- If you reply to this email, your message will be added to the discussion below: http://lucene.472066.n3.nabble.com/Does-Nutch-Checks-Whether-A-Page-crawled-before-or-not-tp4049564p4049568.html To unsubscribe from Does Nutch Checks Whether A Page crawled before or not, click herehttp://lucene.472066.n3.nabble.com/template/NamlServlet.jtp?macro=unsubscribe_by_codenode=4049564code=ZnVya2Fua2FtYWNpQGdtYWlsLmNvbXw0MDQ5NTY0fDEyODM4MDc0Mg== . NAMLhttp://lucene.472066.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewerid=instant_html%21nabble%3Aemail.namlbase=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespacebreadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml -- View this message in context: http://lucene.472066.n3.nabble.com/Does-Nutch-Checks-Whether-A-Page-crawled-before-or-not-tp4049564p4049569.html Sent from the Nutch - User mailing list archive at Nabble.com.
RE: [WELCOME] Feng Lu as Apache Nutch PMC and Committer
Feng Lu, welcome! :) -Original message- From:Julien Nioche lists.digitalpeb...@gmail.com Sent: Mon 18-Mar-2013 13:23 To: user@nutch.apache.org Cc: d...@nutch.apache.org Subject: Re: [WELCOME] Feng Lu as Apache Nutch PMC and Committer Hi Feng, Congratulations on becoming a committer and welcome! [...] A problem has been troubling me a long time is that what is the target of nutch 1.x, Does nutch 1.x is just a transitional version of Nutch 2.x, or they can coexist because Nutch 1.x has a different data processing method to Nutch 2.x, the latter, it's not so much the processing method that differs as they are very similar but the way data are stored. like Julien said, Nutch 1.x is great for batch processing and 2.x large scale processing. Hmm, I don't think I said that. Both are batch orientated and 1.x is probably better at large scale processing than 2.x (at least currently) Perhaps with more and more people use NoSql as their back-end DB, the developers should focus more on the development of Nutch 2.x, ensure its stability and improve its function. IMHO it's not that the developers should focus on this or that. I see it more as an evolutionary process where things get improved because they are used in the first place or get derelict and abandoned if there is no interest from users. If as you say people prefer to have a SQL backend instead of the sequential HDFS data structures then there will be more contributions and as a result 2.x will be improved. Julien -- http://digitalpebble.com/img/logo.gif Open Source Solutions for Text Engineering http://digitalpebble.blogspot.com/ http://digitalpebble.blogspot.com/ http://www.digitalpebble.com http://www.digitalpebble.com http://twitter.com/digitalpebble http://twitter.com/digitalpebble
RE: keep all pages from a domain in one slice
Hi You can't do this with -slice but you can merge segments and filter them. This would mean you'd have to merge the segments for each domain. But that's far too much work. Why do you want to do this? There may be better ways in achieving you goal. -Original message- From:Jason S jason.stu...@gmail.com Sent: Tue 05-Mar-2013 22:18 To: user@nutch.apache.org Subject: keep all pages from a domain in one slice Hello, I seem to remember seeing a discussion about this in the past but I can't seem to find it in the archives. When using mergesegs -slice, is it possible to keep all the pages from a domain in the same slice? I have just been messing around with this functionality (Nutch 1.6), and it seems like the records are simply split after the counter has reached the slice size specified, sometimes splitting the records from a single domain over multiple slices. How can I segregate a domain to a single slice? Thanks in advance, ~Jason
RE: Nutch 1.6 : java.lang.OutOfMemoryError: unable to create new native thread
The default heap size of 1G is just enough for a parsing fetcher with 10 threads. The only problem that may rise is too large and complicated PDF files or very large HTML files. If you generate fetch lists of a reasonable size there won't be a problem most of the time. And if you want to crawl a lot, then just generate more small segments. If there is a bug it's most likely to be the parser eating memory and not releasing it. -Original message- From:Tejas Patil tejas.patil...@gmail.com Sent: Sun 03-Mar-2013 22:19 To: user@nutch.apache.org Subject: Re: Nutch 1.6 : java.lang.OutOfMemoryError: unable to create new native thread I agree with Sebastian. It was a crawl in local mode and not over a cluster. The intended crawl volume is huge and if we dont override the default heap size to some decent value, there is high possibility of facing an OOM. On Sun, Mar 3, 2013 at 1:04 PM, kiran chitturi chitturikira...@gmail.comwrote: If you find the time you should trace the process. Seems to be either a misconfiguration or even a bug. I will try to track this down soon with the previous configuration. Right now, i am just trying to get data crawled by Monday. Kiran. Luckily, you should be able to retry via bin/nutch parse ... Then trace the system and the Java process to catch the reason. Sebastian On 03/02/2013 08:13 PM, kiran chitturi wrote: Sorry, i am looking to crawl 400k documents with the crawl. I said 400 in my last message. On Sat, Mar 2, 2013 at 2:12 PM, kiran chitturi chitturikira...@gmail.comwrote: Hi! I am running Nutch 1.6 on a 4 GB Mac OS desktop with Core i5 2.8GHz. Last night i started a crawl on local mode for 5 seeds with the config given below. If the crawl goes well, it should fetch a total of 400 documents. The crawling is done on a single host that we own. Config - fetcher.threads.per.queue - 2 fetcher.server.delay - 1 fetcher.throughput.threshold.pages - -1 crawl script settings timeLimitFetch- 30 numThreads - 5 topN - 1 mapred.child.java.opts=-Xmx1000m I have noticed today that the crawl has stopped due to an error and i have found the below error in logs. 2013-03-01 21:45:03,767 INFO parse.ParseSegment - Parsed (0ms): http://scholar.lib.vt.edu/ejournals/JARS/v33n3/v33n3-letcher.htm 2013-03-01 21:45:03,790 WARN mapred.LocalJobRunner - job_local_0001 java.lang.OutOfMemoryError: unable to create new native thread at java.lang.Thread.start0(Native Method) at java.lang.Thread.start(Thread.java:658) at java.util.concurrent.ThreadPoolExecutor.addThread(ThreadPoolExecutor.java:681) at java.util.concurrent.ThreadPoolExecutor.addIfUnderMaximumPoolSize(ThreadPoolExecutor.java:727) at java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:655) at java.util.concurrent.AbstractExecutorService.submit(AbstractExecutorService.java:92) at org.apache.nutch.parse.ParseUtil.runParser(ParseUtil.java:159) at org.apache.nutch.parse.ParseUtil.parse(PaifrseUtil.java:93) at org.apache.nutch.parse.ParseSegment.map(ParseSegment.java:97) at org.apache.nutch.parse.ParseSegment.map(ParseSegment.java:44) at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50) at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:436) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:372) at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:212) (END) Did anyone run in to the same issue ? I am not sure why the new native thread is not being created. The link here says [0] that it might due to the limitation of number of processes in my OS. Will increase them solve the issue ? [0] - http://ww2.cs.fsu.edu/~czhang/errors.html Thanks! -- Kiran Chitturi -- Kiran Chitturi
RE: a lot of threads spinwaiting
Hi, Regarding politeness, 3 threads per queue is not really polite :) Cheers -Original message- From:jc jvizu...@gmail.com Sent: Fri 01-Mar-2013 15:08 To: user@nutch.apache.org Subject: Re: a lot of threads spinwaiting Hi Roland and lufeng, Thank you very much for your replies, I already tested lufeng advice, with results pretty much as expected. By the way, my nutch installation is based on 2.1 version with hbase as crawldb storage Roland, maybe fetcher.server.delay param has something to do with that as well, I set it to 3 secs, setting it to 0 would be unpolite? All info you provided has helped me a lot, only one issue remains unfixed yet, there are more than 60 URLs from different hosts in my seed file, and only 20 queues, things may seem that all other 40 hosts have no more URLs to generate, but I really haven't seen any URL coming from those hosts since the creation of the crawldb. Based on my poor experience following params would allow a number of 60 queues for my vertical crawl, am I missing something? topN = 1 million fetcher.threads.per.queue = 3 fetcher.threads.per.host = 3 (just in case, I remember you told me to use per.queue instead) fetcher.threads.fetch = 200 seed urls of different hosts = 60 or more (regex-urlfilter.txt allows only urls from these hosts, they're all there, I checked) crawldb record count 1 million Thanks again for all your help Regards, JC -- View this message in context: http://lucene.472066.n3.nabble.com/a-lot-of-threads-spinwaiting-tp4043801p4043988.html Sent from the Nutch - User mailing list archive at Nabble.com.
RE: Nutch Incremental Crawl
The default or the injected interval? The default interval can be set in the config (see nutch-default for example). Per URL's can be set using the injector: URL\tnutch.fixedFetchInterval=86400 -Original message- From:David Philip davidphilipshe...@gmail.com Sent: Wed 27-Feb-2013 06:21 To: user@nutch.apache.org Subject: Re: Nutch Incremental Crawl Hi all, Thank you very much for the replies. Very useful information to understand how incremental crawling can be achieved. Dear Markus: Can you please tell me how do I over ride this fetch interval , incase if I require to fetch the page before the time interval is passed? Thanks very much - David On Thu, Feb 14, 2013 at 2:57 PM, Markus Jelsma markus.jel...@openindex.iowrote: If you want records to be fetched at a fixed interval its easier to inject them with a fixed fetch interval. nutch.fixedFetchInterval=86400 -Original message- From:kemical mickael.lume...@gmail.com Sent: Thu 14-Feb-2013 10:15 To: user@nutch.apache.org Subject: Re: Nutch Incremental Crawl Hi David, You can also consider setting shorter fetch interval time with nutch inject. This way you'll set higher score (so the url is always taken in priority when you generate a segment) and a fetch.interval of 1 day. If you have a case similar to me, you'll often want some homepage fetch each day but not their inlinks. What you can do is inject all your seed urls again (assuming those url are only homepages). #change nutch option so existing urls can be injected again in conf/nutch-default.xml or conf/nutch-site.xml db.injector.update=true #Add metadata to update score/fetch interval #the following line will concat to each line of your seed urls files with the new score / new interval perl -pi -e 's/^(.*)\n$/\1\tnutch.score=100\tnutch.fetchInterval=8' [your_seed_url_dir]/* #run command bin/nutch inject crawl/crawldb [your_seed_url_dir] Now, the following crawl will take your urls in top priority and crawl them once a day. I've used my situation to illustrate the concept but i guess you can tweek params to fit your needs. This way is useful when you want a regular fetch on some urls, if it's occured rarely i guess freegen is the right choice. Best, Mike -- View this message in context: http://lucene.472066.n3.nabble.com/Nutch-Incremental-Crawl-tp4037903p4040400.html Sent from the Nutch - User mailing list archive at Nabble.com.
RE: Nutch Incremental Crawl
You can simply reinject the records. You can overwrite and/or update the current record. See the db.injector.update and overwrite settings. -Original message- From:David Philip davidphilipshe...@gmail.com Sent: Wed 27-Feb-2013 11:23 To: user@nutch.apache.org Subject: Re: Nutch Incremental Crawl HI Markus, I meant over riding the injected interval.. How to override the injected fetch interval? While crawling fetch interval was set 30days (default). Now I want to re-fetch same site (that is to force re-fetch) and not wait for fetch interval (30 days).. how can we do that? Feng Lu : Thank you for the reference link. Thanks - David On Wed, Feb 27, 2013 at 3:22 PM, Markus Jelsma markus.jel...@openindex.iowrote: The default or the injected interval? The default interval can be set in the config (see nutch-default for example). Per URL's can be set using the injector: URL\tnutch.fixedFetchInterval=86400 -Original message- From:David Philip davidphilipshe...@gmail.com Sent: Wed 27-Feb-2013 06:21 To: user@nutch.apache.org Subject: Re: Nutch Incremental Crawl Hi all, Thank you very much for the replies. Very useful information to understand how incremental crawling can be achieved. Dear Markus: Can you please tell me how do I over ride this fetch interval , incase if I require to fetch the page before the time interval is passed? Thanks very much - David On Thu, Feb 14, 2013 at 2:57 PM, Markus Jelsma markus.jel...@openindex.iowrote: If you want records to be fetched at a fixed interval its easier to inject them with a fixed fetch interval. nutch.fixedFetchInterval=86400 -Original message- From:kemical mickael.lume...@gmail.com Sent: Thu 14-Feb-2013 10:15 To: user@nutch.apache.org Subject: Re: Nutch Incremental Crawl Hi David, You can also consider setting shorter fetch interval time with nutch inject. This way you'll set higher score (so the url is always taken in priority when you generate a segment) and a fetch.interval of 1 day. If you have a case similar to me, you'll often want some homepage fetch each day but not their inlinks. What you can do is inject all your seed urls again (assuming those url are only homepages). #change nutch option so existing urls can be injected again in conf/nutch-default.xml or conf/nutch-site.xml db.injector.update=true #Add metadata to update score/fetch interval #the following line will concat to each line of your seed urls files with the new score / new interval perl -pi -e 's/^(.*)\n$/\1\tnutch.score=100\tnutch.fetchInterval=8' [your_seed_url_dir]/* #run command bin/nutch inject crawl/crawldb [your_seed_url_dir] Now, the following crawl will take your urls in top priority and crawl them once a day. I've used my situation to illustrate the concept but i guess you can tweek params to fit your needs. This way is useful when you want a regular fetch on some urls, if it's occured rarely i guess freegen is the right choice. Best, Mike -- View this message in context: http://lucene.472066.n3.nabble.com/Nutch-Incremental-Crawl-tp4037903p4040400.html Sent from the Nutch - User mailing list archive at Nabble.com.
RE: regex-urlfilter file for multiple domains
Yes, it will support that until you run out of memory. But having a million expressions is not going to work nicely. If you have a lot of expressions but can divide them into domains i would patch the filter so it will only execute filters that or for a specific domain. -Original message- From:Danilo Fernandes dan...@kelsorfernandes.com.br Sent: Tue 26-Feb-2013 11:31 To: user@nutch.apache.org Subject: RE: regex-urlfilter file for multiple domains Tejas, do you have any idea about how many rules can I use in the file? Probably I will work with 1M regex for differentes URLs. Nutch will support that?
RE: regex-urlfilter file for multiple domains
No, there is no feature for that. You would have to patch it up yourself. It shouldn't be very hard. -Original message- From:Danilo Fernandes dan...@kelsorfernandes.com.br Sent: Tue 26-Feb-2013 11:37 To: user@nutch.apache.org Subject: RE: regex-urlfilter file for multiple domains Yes, my first options is differents files to differents domains. The point is how can I link the files with each domain? Do I need do some changes in Nutch code or the project have a feature for do that? On Tue, 26 Feb 2013 10:33:37 +, Markus Jelsma wrote: Yes, it will support that until you run out of memory. But having a million expressions is not going to work nicely. If you have a lot of expressions but can divide them into domains i would patch the filter so it will only execute filters that or for a specific domain. -Original message- From:Danilo Fernandes Sent: Tue 26-Feb-2013 11:31 To: user@nutch.apache.org [2] Subject: RE: regex-urlfilter file for multiple domains Tejas, do you have any idea about how many rules can I use in the file? Probably I will work with 1M regex for differentes URLs. Nutch will support that? Links: -- [1] mailto:dan...@kelsorfernandes.com.br [2] mailto:user@nutch.apache.org
RE: Nutch status info on each domain individually
Well, you can always the DomainStatistics utilities to get the raw numbers on hosts, domains and TLD's but this won't tell you whether a domain has been fully crawled because the crawling frontier can always change. You can be sure that everything (disregarding url filters) has been crawled if no more records are selected before fetched records are eligible again for refetch (default interval). NUTCH-1325 does a better job in providing stats for hosts than the current DomainStatistics but it's uncommitted. It'll work though. https://issues.apache.org/jira/browse/NUTCH-1325 -Original message- From:Tejas Patil tejas.patil...@gmail.com Sent: Mon 25-Feb-2013 20:46 To: user@nutch.apache.org Subject: Re: Nutch status info on each domain individually I can't of any existing nutch utility which can be used here. Maybe dumping the crawldb and then grepping over it would sound reasonable if the number of hosts is large and the crawldb is small. This will be a bad idea if this has to be done after every nutch cycle on a large crawldb. If you are ready to write some small code, then it can become easy: 1. Write some code to query the index so that you need not have to do that manually. OR 2. Write a map reduce code to read crawdb wherein the mapper emits the hosts of the url. #1 is better deal in terms of execution time. Thanks, Tejas Patil On Mon, Feb 25, 2013 at 11:28 AM, imehesz imeh...@gmail.com wrote: hello, I can finally run Nutch (+Solr) with JAVA, my only question left is, how can I make sure if a particular domain has been crawled? Let's say I have 300 sites to crawl and index. So far my work-around was to execute a simple Solr query for each domain URL, and see if the indexing timestamp in the Solr DB is greater then the Nutch crawling start date-time. It works, but I'm curious if there is a better way to do this. thanks, --iM -- View this message in context: http://lucene.472066.n3.nabble.com/Nutch-status-info-on-each-domain-individually-tp4042815.html Sent from the Nutch - User mailing list archive at Nabble.com.
RE: Differences between 2.1 and 1.6
Something seems to be missing here. It's clear that 1.x has more features and is a lot more stable than 2.x. Nutch 2.x can theoretically perform a lot better if you are going to crawl on a very large scale but i still haven't seen any numbers to support this assumption. Nutch 1.x can easily deal with many millions of records and deal with billions if you throw some hardware at it. Most users are not going to crawl millions or records. In that case i personally choose 1.x. I prefer the stability and predictabilty above some performance you are not likely going to need anyway. Besides our large 1.x research cluster we still use 1.x in production for all our customers, running locally on a 2 core 512MB RAM VPS with a crawldb of over 5 million records and it runs fine, fast and keeps up with newly discovered URL's. The only significant improvements were a better scoring filter and integrating indexing in the fetcher. -Original message- From:Lewis John Mcgibbney lewis.mcgibb...@gmail.com Sent: Mon 25-Feb-2013 23:37 To: user@nutch.apache.org Subject: Re: Differences between 2.1 and 1.6 Hi Danilo, You can check out the architecture changes here http://wiki.apache.org/nutch/#Nutch_2.x Nutch trunk (1.7-SNAPSHOT) is here http://svn.apache.org/repos/asf/nutch/trunk/ 2.x is here http://svn.apache.org/repos/asf/nutch/branches/2.x/ On Mon, Feb 25, 2013 at 1:56 PM, Danilo Fernandes dan...@kelsorfernandes.com.br wrote: Hi everyone, Somebody can tell me about differences between 2.1 and 1.6? The SVN trunk is 1.* or 2.*? Thanks, Danilo Fernandes -- *Lewis*
RE: Crawl script numberOfRounds
Yes. -Original message- From:Amit Sela am...@infolinks.com Sent: Tue 19-Feb-2013 13:40 To: user@nutch.apache.org Subject: Crawl script quot;numberOfRoundsquot; Is the crawl script numberOfRounds argument is the equivalent of depth argument in the crawl command ? Thanks.
RE: fields in solrindex-mapping.xml
Those are added by IndexerMapReduce (or 2.x equivalent) and index-basic. They contain the crawl datum's signature, the time stamp (see index-basic) and crawl datum score. If you think you don't need them, you can safely omit them. -Original message- From:alx...@aim.com alx...@aim.com Sent: Sat 16-Feb-2013 19:21 To: user@nutch.apache.org Subject: Re: fields in solrindex-mapping.xml Hi Lewis, Why do we need to include digest, tstamp, boost and batchid fields in solrindex? Thanks. Alex. -Original Message- From: Lewis John Mcgibbney lewis.mcgibb...@gmail.com To: user user@nutch.apache.org Sent: Fri, Feb 15, 2013 4:21 pm Subject: Re: fields in solrindex-mapping.xml Hi Alex, OK so we can certainly remove segment from 2.x solr-index-mapping.xml. It would however be nice to replace this with the appropriate batchId. Can someone advise where the 'segment' field currently comes from in trunk? That way we can at least map the field to the batchId equivalent in 2.x Thank you Lewis On Fri, Feb 15, 2013 at 2:23 PM, alx...@aim.com wrote: Hi Lewis, If I exclude one of the fileds tstamp, digest, and boost from solindex-mapping and schema.xml, solrindex gives error SEVERE: org.apache.solr.common.SolrException: ERROR: [doc=com.yahoo:http/] unknown field 'tstamp' for each of above fields, except segment. Alex. -Original Message- From: Lewis John Mcgibbney lewis.mcgibb...@gmail.com To: user user@nutch.apache.org Sent: Thu, Feb 14, 2013 8:34 pm Subject: Re: fields in solrindex-mapping.xml Hi Alex, Tstamp represents fetch tiem, used for deduplication. Boost is for scoring-opic and link. This is required in 2.x as well. I don't have the code right now, but you can try removing digest and segment. To me they both look legacy. There is a wiki page on index structure which you can consult and/or add to should you wish. Thank you Lewis On Thursday, February 14, 2013, alx...@aim.com wrote: Hello, I see that there are field dest=segment source=segment/ field dest=boost source=boost/ field dest=digest source=digest/ field dest=tstamp source=tstamp/ fields in addition to title, host and content ones in nutch-2.x' solr-mapping.xml. I thought tstamp may be needed for sorting documents. What about the other fields, segment, boost and digest? Can someone explain, why these fields are included in solr-mapping.xml? Thanks. Alex. -- *Lewis* -- *Lewis*
RE: Nutch identifier while indexing.
You can use the subcollection indexing filter to set a value for URL's that match a string. With it you can distinquish even if they are on the same host and domain. -Original message- From:mbehlok m_beh...@hotmail.com Sent: Wed 13-Feb-2013 21:20 To: user@nutch.apache.org Subject: Re: Nutch identifier while indexing. wish it was that simple: SitaA = www.myDomain.com/index.aspx?site=1 SitaB = www.myDomain.com/index.aspx?site=2 SitaC = www.myDomain.com/index.aspx?site=3 -- View this message in context: http://lucene.472066.n3.nabble.com/Nutch-identifier-while-indexing-tp4040285p4040323.html Sent from the Nutch - User mailing list archive at Nabble.com.
RE: DiskChecker$DiskErrorException
Hi- Also enough space in your /tmp directory? Cheers -Original message- From:Alexei Korolev alexei.koro...@gmail.com Sent: Mon 11-Feb-2013 09:27 To: user@nutch.apache.org Subject: DiskChecker$DiskErrorException Hello, Already twice I got this error: 2013-02-08 15:26:11,674 WARN mapred.LocalJobRunner - job_local_0001 org.apache.hadoop.util.DiskChecker$DiskErrorException: Could not find taskTracker/jobcache/job_local_0001/attempt_local_0001_m_00_0/output/spill0.out in any of the configur ed local directories at org.apache.hadoop.fs.LocalDirAllocator$AllocatorPerContext.getLocalPathToRead(LocalDirAllocator.java:389) at org.apache.hadoop.fs.LocalDirAllocator.getLocalPathToRead(LocalDirAllocator.java:138) at org.apache.hadoop.mapred.MapOutputFile.getSpillFile(MapOutputFile.java:94) at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.mergeParts(MapTask.java:1443) at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.flush(MapTask.java:1154) at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:359) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:307) at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:177) 2013-02-08 15:26:12,515 ERROR fetcher.Fetcher - Fetcher: java.io.IOException: Job failed! at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1252) at org.apache.nutch.fetcher.Fetcher.fetch(Fetcher.java:1204) at org.apache.nutch.fetcher.Fetcher.run(Fetcher.java:1240) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65) at org.apache.nutch.fetcher.Fetcher.main(Fetcher.java:1213) I've checked in google, but no luck. I run nutch 1.4 locally and have a plenty of free space on disk. I would much appreciate for some help. Thanks. -- Alexei A. Korolev
RE: performance question: fetcher and parser in separate map/reduce jobs?
A parsing fetcher does everything in the mapper. Please check the output() method around line 1012 onwards: http://svn.apache.org/viewvc/nutch/trunk/src/java/org/apache/nutch/fetcher/Fetcher.java?view=markup Parsing, signature, outlink processing (using code in ParseOutputFormat) all happens there. Cheers, Markus -Original message- From:Weilei Zhang zhan...@gmail.com Sent: Sat 09-Feb-2013 23:40 To: user@nutch.apache.org Subject: Re: performance question: fetcher and parser in separate map/reduce jobs? This is indeed helpful. Thanks Lewis. On Wed, Feb 6, 2013 at 6:50 PM, Lewis John Mcgibbney lewis.mcgibb...@gmail.com wrote: I've eventually added this to our FAQ's http://wiki.apache.org/nutch/FAQ#Can_I_parse_during_the_fetching_process.3F This should explain for you. Lewis On Wed, Feb 6, 2013 at 6:31 PM, Weilei Zhang zhan...@gmail.com wrote: Hi I have a performance question: why fetcher and parser is staged in two separate jobs instead of one? Intuitively, parser can be included as a part of fetcher reducer, is it? This seems to be more efficient. Thanks -- Best Regards -Weilei -- *Lewis* -- Best Regards -Weilei
RE: performance question: fetcher and parser in separate map/reduce jobs?
Oh, i'd like to add that the biggest problem is memory and the possibility for a parser to hang, consume resources and time out everything else and destroying the segment. -Original message- From:Weilei Zhang zhan...@gmail.com Sent: Sat 09-Feb-2013 23:40 To: user@nutch.apache.org Subject: Re: performance question: fetcher and parser in separate map/reduce jobs? This is indeed helpful. Thanks Lewis. On Wed, Feb 6, 2013 at 6:50 PM, Lewis John Mcgibbney lewis.mcgibb...@gmail.com wrote: I've eventually added this to our FAQ's http://wiki.apache.org/nutch/FAQ#Can_I_parse_during_the_fetching_process.3F This should explain for you. Lewis On Wed, Feb 6, 2013 at 6:31 PM, Weilei Zhang zhan...@gmail.com wrote: Hi I have a performance question: why fetcher and parser is staged in two separate jobs instead of one? Intuitively, parser can be included as a part of fetcher reducer, is it? This seems to be more efficient. Thanks -- Best Regards -Weilei -- *Lewis* -- Best Regards -Weilei
RE: Best Practice to optimize Parse reduce step / ParseoutputFormat
-Original message- From:kemical mickael.lume...@gmail.com Sent: Fri 08-Feb-2013 10:53 To: user@nutch.apache.org Subject: Best Practice to optimize Parse reduce step / ParseoutputFormat Hi, I've been looking for some time now the reasons of Parse reduce taking a lot of time. And i've found lots of different suggestions but no many feedbacks on which are working or not. First here is a list of the thread i've found, and also the Patch 1314 : http://lucene.472066.n3.nabble.com/Parse-reduce-slow-as-a-snail-td3296865.html http://lucene.472066.n3.nabble.com/ParseSegment-taking-a-long-time-to-finish-td3758053.html http://lucene.472066.n3.nabble.com/ParseSegment-slow-reduce-phase-td612119.html https://issues.apache.org/jira/browse/NUTCH-1314 Here are some questions about what i've found on them: - It's seems that parse reduce time is mainly due to long urls = Is there anyone who can confirm since he has excluded long urls (with patch or regex or whatever, he now have better perfs?) Most certainly! - Normalizing step is occuring before filtering: = If so, is there a real interest to filter urls with regex (like the -^.{350,}$ expression) ? The sooner you can reject long URL's, the better. -The patch 1314 seems to be done when you parse with parse-html = i'm using boilerpipe with patch NUTCH-961, should the patch 1314 work with it? (i guess not) and what change should i make (i'm quite afraid to do a patch/plugin myself) . It will help a little but i don't think you'll win much vs. filtering by regex filter. This is not an exhaustive list of questions, so if you have questions and/or recommandations, please add them. Sorry to start a new thread since it could have been added as an answer to my last one: http://lucene.472066.n3.nabble.com/Very-long-time-just-before-fetching-and-just-after-parsing-td4037673.html but i think the title of this one could be useful for more people (mine was too specific) -- View this message in context: http://lucene.472066.n3.nabble.com/Best-Practice-to-optimize-Parse-reduce-step-ParseoutputFormat-tp4039200.html Sent from the Nutch - User mailing list archive at Nabble.com.
RE: Could not find any valid local directory for output/file.out
The /tmp directory is not cleaned up IIRC. You're safe to empty it as long a you don't have a job running ;) -Original message- From:Lewis John Mcgibbney lewis.mcgibb...@gmail.com Sent: Fri 08-Feb-2013 20:48 To: user@nutch.apache.org Subject: Re: Could not find any valid local directory for output/file.out +1 This is a ridiculous size of tmp for a crawldb of minimal size. There is clearly something wrong On Friday, February 8, 2013, Tejas Patil tejas.patil...@gmail.com wrote: I dont think there is any such property. Maybe its time for you to cleanup /tmp :) Thanks, Tejas Patil On Fri, Feb 8, 2013 at 11:16 AM, Eyeris Rodriguez Rueda eru...@uci.cu wrote: Hi lewis an tejas again. I have point the hadoop.tmp.dir property but nutch still consuming to much space for me. Is posible to reduce the space of nutch in my tmp folder with some property of a fetcher process? I always get an exception because the hard disk is full. my crawldb only have 150 MB not more. but my tmp folder continue increasing without control until 60 GB, and fail at this point. please any help - Mensaje original - De: Eyeris Rodriguez Rueda eru...@uci.cu Para: user@nutch.apache.org Enviados: Viernes, 8 de Febrero 2013 10:45:52 Asunto: Re: Could not find any valid local directory for output/file.out Thanks a lot. lewis and tejas, you are very helpfull for me. It function ok, I have pointed to another partition and ok. Problem solved. - Mensaje original - De: Tejas Patil tejas.patil...@gmail.com Para: user@nutch.apache.org Enviados: Jueves, 7 de Febrero 2013 16:32:33 Asunto: Re: Could not find any valid local directory for output/file.out On Thu, Feb 7, 2013 at 12:47 PM, Eyeris Rodriguez Rueda eru...@uci.cu wrote: Thank to all for your replies. If i want to change the default location for hadoop job(/tmp), where i can do that ?, because my nutch-site.xml not include nothing pointing to /tmp. Add this property to nutch-site.xml with appropriate value: property namehadoop.tmp.dir/name valueXX/value /property So I have readed about nutch and hadoop but im not sure to understand at all. Is posible to use nutch 1.5.1 in distributed mode ? yes In this case what i need to do for that, I really appreciated your answer because I can´t find a good documentation for this topic. For distributed mode, Nutch is called from runtime/deploy. The conf files should be modified in runtime/local/conf, not in $NUTCH_HOME/conf. So modify the runtime/local/conf/nutch-site.xml to set http.agent.nameproperly. I am assuming that the hadoop setup is in place and hadoop variables are exported. Now, run the nutch commands from runtime/deploy. Thanks, Tejas Patil - Mensaje original - De: Tejas Patil tejas.patil...@gmail.com Para: user@nutch.apache.org Enviados: Jueves, 7 de Febrero 2013 14:04:26 Asunto: Re: Could not find any valid local directory for output/file.out Nutch jobs are executed by Hadoop. /tmp is the default location used by hadoop to store temporary data required for a job. If you dont over-ride hadoop.tmp.dir in any config file, it will use /tmp by default. In your case, /tmp doesnt have ample space left so better over-ride that property and point it to some other location which has ample space. Thanks, Tejas Patil On Thu, Feb 7, 2013 at 10:38 AM, Eyeris Rodriguez Rueda eru...@uci.cu wrote: Thanks lewis by your answer. My doubt is why /tmp is increasing while crawl process is doing, and why nutch use that folder. Im using nutch 1.5.1 in single mode and my nutch site not have properties hadoop.tmp.dir. I need reduce the space used for that folder because I only have 40 GB for nutch machine and 50 GB for solr machine. Please some advice or expla -- *Lewis*
RE: Could not find any valid local directory for output/file.out
Hadoop stores temporary files there such as shuffling map output data, you need it! But you can rf -r it after a complete crawl cycle. Do not clear it while a job is running, it's going to miss it's temp files. -Original message- From:Eyeris Rodriguez Rueda eru...@uci.cu Sent: Fri 08-Feb-2013 20:53 To: user@nutch.apache.org Subject: Re: Could not find any valid local directory for output/file.out Im using ubuntu server 12.04 only for nutch, I have asigned 40 GB for this. Is /tmp needed for nutch crawl process ? or i can make a crontab for delete /tmp content without problem for nutch crawl. - Mensaje original - De: Tejas Patil tejas.patil...@gmail.com Para: user@nutch.apache.org Enviados: Viernes, 8 de Febrero 2013 14:33:25 Asunto: Re: Could not find any valid local directory for output/file.out I dont think there is any such property. Maybe its time for you to cleanup /tmp :) Thanks, Tejas Patil On Fri, Feb 8, 2013 at 11:16 AM, Eyeris Rodriguez Rueda eru...@uci.cuwrote: Hi lewis an tejas again. I have point the hadoop.tmp.dir property but nutch still consuming to much space for me. Is posible to reduce the space of nutch in my tmp folder with some property of a fetcher process? I always get an exception because the hard disk is full. my crawldb only have 150 MB not more. but my tmp folder continue increasing without control until 60 GB, and fail at this point. please any help - Mensaje original - De: Eyeris Rodriguez Rueda eru...@uci.cu Para: user@nutch.apache.org Enviados: Viernes, 8 de Febrero 2013 10:45:52 Asunto: Re: Could not find any valid local directory for output/file.out Thanks a lot. lewis and tejas, you are very helpfull for me. It function ok, I have pointed to another partition and ok. Problem solved. - Mensaje original - De: Tejas Patil tejas.patil...@gmail.com Para: user@nutch.apache.org Enviados: Jueves, 7 de Febrero 2013 16:32:33 Asunto: Re: Could not find any valid local directory for output/file.out On Thu, Feb 7, 2013 at 12:47 PM, Eyeris Rodriguez Rueda eru...@uci.cu wrote: Thank to all for your replies. If i want to change the default location for hadoop job(/tmp), where i can do that ?, because my nutch-site.xml not include nothing pointing to /tmp. Add this property to nutch-site.xml with appropriate value: property namehadoop.tmp.dir/name valueXX/value /property So I have readed about nutch and hadoop but im not sure to understand at all. Is posible to use nutch 1.5.1 in distributed mode ? yes In this case what i need to do for that, I really appreciated your answer because I can´t find a good documentation for this topic. For distributed mode, Nutch is called from runtime/deploy. The conf files should be modified in runtime/local/conf, not in $NUTCH_HOME/conf. So modify the runtime/local/conf/nutch-site.xml to set http.agent.nameproperly. I am assuming that the hadoop setup is in place and hadoop variables are exported. Now, run the nutch commands from runtime/deploy. Thanks, Tejas Patil - Mensaje original - De: Tejas Patil tejas.patil...@gmail.com Para: user@nutch.apache.org Enviados: Jueves, 7 de Febrero 2013 14:04:26 Asunto: Re: Could not find any valid local directory for output/file.out Nutch jobs are executed by Hadoop. /tmp is the default location used by hadoop to store temporary data required for a job. If you dont over-ride hadoop.tmp.dir in any config file, it will use /tmp by default. In your case, /tmp doesnt have ample space left so better over-ride that property and point it to some other location which has ample space. Thanks, Tejas Patil On Thu, Feb 7, 2013 at 10:38 AM, Eyeris Rodriguez Rueda eru...@uci.cu wrote: Thanks lewis by your answer. My doubt is why /tmp is increasing while crawl process is doing, and why nutch use that folder. Im using nutch 1.5.1 in single mode and my nutch site not have properties hadoop.tmp.dir. I need reduce the space used for that folder because I only have 40 GB for nutch machine and 50 GB for solr machine. Please some advice or explanation will be accepted. Thanks for your time. - Mensaje original - De: Lewis John Mcgibbney lewis.mcgibb...@gmail.com Para: user@nutch.apache.org Enviados: Jueves, 7 de Febrero 2013 13:06:11 Asunto: Re: Could not find any valid local directory for output/file.out Hi, https://wiki.apache.org/nutch/NutchGotchas#DiskErrorException_while_fetching On Thursday, February 7, 2013, Eyeris Rodriguez Rueda eru...@uci.cu wrote: Hi all. I have a problem when i do a crawl for few hour or days, im using nutch 1.5.1 and solr 3.6, but the crawl process fails and i dont know how to fix
RE: increase the number of fetches at agiven time on nutch 1.6 or 2.1
Try setting -numFetchers N on the generator. -Original message- From:Sourajit Basak sourajit.ba...@gmail.com Sent: Mon 28-Jan-2013 11:57 To: user@nutch.apache.org Subject: Re: increase the number of fetches at agiven time on nutch 1.6 or 2.1 A higher number of per host threads, etc might not be useful if the bandwidth doesn't scale out. I have a different observation though. We run nutch on a hadoop cluster. Even as we added new machines to the cluster, the fetch phase only creates two tasks. (the original number of nodes when we started) Why is it so ? I have checked that the tasks do get spawned in the newly added nodes. We have this setting in hadoop mapred-site.xml property namemapred.tasktracker.map.tasks.maximum/name value20/value /property We have planned to double the number of websites and see if it still doesn't spawn tasks on each node. I will keep this forum updated with out results. In the meantime, can anyone point out if we have missed any particular configuration ? Thanks, Sourajit On Mon, Jan 28, 2013 at 10:35 AM, Tejas Patil tejas.patil...@gmail.comwrote: Hey Peter, I am guessing that you have just increased the global thread count. Have you even increased fetcher.threads.per.host ? This will improve the crawl rate as multiple threads can attack the same site. Dont make it too high or else the system will get overloaded. The nutch wiki has an article [0] about the potential reasons for slow crawls and some good suggestions. [0] : https://wiki.apache.org/nutch/OptimizingCrawls Thanks, Tejas Patil On Sun, Jan 27, 2013 at 8:08 PM, peterbarretto peterbarrett...@gmail.com wrote: I tried increasing the numbers of threads to 50 but the speed is not affected I tried changing the partition.url.mode value to byDomain and fetcher.queue.mode to byDomain but still it does not help the speed. It seems to get urls from 2 domains now and the other domains are not getting crawled. Is this due to the url score? if so how do i crawl urls from all the domains? lewis john mcgibbney wrote Increase number of threads when fetching Also please see nutch-deault.xml for paritioning of urls, if you know your target domains you may wish to adapt the policy. Lewis On Sunday, January 27, 2013, peterbarretto lt; peterbarretto08@ gt; wrote: I want to increase the number of urls fetched at a time in nutch. I have around 10 websites to crawl. so how can i crawl all the sites at a time ? right now i am fetching 1 site with a fetch delay of 2 second but it is too slow. How to concurrently fetch from different domain? -- View this message in context: http://lucene.472066.n3.nabble.com/increase-the-number-of-fetches-at-agiven-time-on-nutch-1-6-or-2-1-tp4036499.html Sent from the Nutch - User mailing list archive at Nabble.com. -- *Lewis* -- View this message in context: http://lucene.472066.n3.nabble.com/increase-the-number-of-fetches-at-agiven-time-on-nutch-1-6-or-2-1-tp4036499p4036630.html Sent from the Nutch - User mailing list archive at Nabble.com.
RE: Solr dinamic fields
Hi -Original message- From:Jorge Luis Betancourt Gonzalez jlbetanco...@uci.cu Sent: Mon 28-Jan-2013 17:01 To: user@nutch.apache.org Subject: Solr dinamic fields Hi: I'm currently working on a plattform for crawl a large amount of PDFs files. Using nutch (and tika) I'm able of extract and store the textual content of the files in solr, but right now we want to be able to extract the content of the PDFs by page, this means that, we want to store several solr fields (one per each page in the document). Is there any recommended way of accomplish this in nutch/solr?. With a parse plugin I could store the text from each page to the metadata's document, anything else would be needed? Yes, make a custom indexing filter that reads your parsed metadata and adds page specific fields to NutchDocument. That should work fine. slds -- It is only in the mysterious equation of love that any logical reasons can be found. Good programmers often confuse halloween (31 OCT) with christmas (25 DEC)
RE: conditional indexing
Hi - i've not yet committed a fix for: https://issues.apache.org/jira/browse/NUTCH-1449 This will allow you to stop documents from being indexed from within your indexing filter. Order can be configured using the indexing.filter.order or something configuration directive. -Original message- From:Sourajit Basak sourajit.ba...@gmail.com Sent: Wed 23-Jan-2013 09:24 To: user@nutch.apache.org Subject: conditional indexing We have an implementation of Indexing filter that runs side-by-side the indexer-basic plugin. How is the order determined ? Also, how do I do conditional indexing i.e. stop certain urls from being indexed ? I think I can apply a filter but that approach will not work since we index based on the page contents.
RE: Nutch support with regards to Deduplication and Document versioning
If you use 1.x and don't merge segments you still have older versions of documents. There is no active versioning in Nutch 1x except segment naming and merging, if you use it. -Original message- From:Tejas Patil tejas.patil...@gmail.com Sent: Wed 23-Jan-2013 09:25 To: user@nutch.apache.org Subject: Re: Nutch support with regards to Deduplication and Document versioning Hi Anand, Nutch will keep the latest content of a given url (based on the time when it was fetched). It wont store the old versions. Thanks, Tejas On Wed, Jan 23, 2013 at 12:12 AM, Anand Bhagwat abbhagwa...@gmail.comwrote: Hi, I want to know what kind of support does Nutch provides with regards to de-duplication and document versioning? Thanks, Anand.
RE: solrindex deleteGone vs solrclean
Hi, -deleteGone relies on segment information to delete records, which is faster and indeed somewhat on-the-fly. solclean command relies on CrawlDB information and will always work, even if you lost your segment or just periodically delete old segments. Cheers -Original message- From:Jason S jason.stu...@gmail.com Sent: Thu 24-Jan-2013 03:01 To: user@nutch.apache.org Subject: solrindex deleteGone vs solrclean Hello, I'm curious about the difference between using -deleteGone with solrindex and the solrclean command. From what I understand, they basically do the same thing except -deleteGone is more on the fly. Is this correct? Is there any scenario where one would be more appropriate than the other? Thanks in advance! ~Jason
RE: Synthetic Tokens
Hi, In Nutch a `synthetic token` maps to a field/value pair. You need an indexing filter to read the key/value pair from the parsed metadata and add it as a field/value pair to the NutchDocument. You may also need a custom parser filter to extract the data from somewhere and store it to the parsed metadata as key/value, which you then further process in your indexing filter. Check out the index-basic and index-more plugins for examples. Cheers, -Original message- From:Jakub Moskal jakub.mos...@gmail.com Sent: Mon 21-Jan-2013 04:58 To: user@nutch.apache.org Subject: Synthetic Tokens Hi, I would like to develop a plugin that creates synthetic tokens for some documents that are crawled by Nutch (as described here: http://www.ideaeng.com/synthetic-tokens-need-p2-0604). How can this be done in Nutch? Should I create a new field for every new synthetic token, or should I add them to metadata? I'm not quite sure how fields/metadata relate to the tokens described in the article. Thanks! Jakub
RE: Wrong ParseData in segment
Sebastian! I thought about that too since i do sometimes use class variables in some parse plugins such as storing the Parse object. However, i assumed the plugins were already in a thread-safe environment because each FetcherThread instance has it's own instance of ParseUtil. I'll modify the plugins and see if it helps ;) Thanks, Markus -Original message- From:Sebastian Nagel wastl.na...@googlemail.com Sent: Wed 16-Jan-2013 18:38 To: user@nutch.apache.org Subject: Re: Wrong ParseData in segment Hi Markus, right now I have seen this problem in a small test set of 20 documents: - various document types (HTML, PDF, XLS, zip, doc, ods) - small and quite large docs (up to 12 MB) - local docs via protocol-file - fetcher.parse = true - Nutch 1.4, local mode Somehow metadata from a one doc slipped into another doc: - extracted by a custom HtmlParseFilter plugin (author, keywords, description) - reproducible, though not easily (3-5 trials to get one, rarely two wrong meta fields) - wrong parsemeta is definitely in the segment After adding more and more debug logs the stupid answer is: the custom plugin was not 100% thread-safe. Yes, it wasn't clear to me ;-): the same instance of a plugin may process two documents in parallel. I found also this thread (and NUTCH-496): http://www.mail-archive.com/nutch-developers@lists.sourceforge.net/msg12333.html I didn't find any hint in the wiki (eg. in http://wiki.apache.org/nutch/WritingPluginExample), but I'll add one. Cheers, Sebastian 2012/11/30 Markus Jelsma markus.jel...@openindex.io: Hi In our case it is really in the segment, and ends up in the index. Are there any known issues with parse filters? In that filter we do set the Parse object as class attribute but we reset it with the new Parse object right after filter() is called. I also cannot think of the custom Tika ContentHandler to be the issue, a new ContentHandler is created for each parse and passed to the TeeContentHandler, just all other ContentHandlers. I assume an individual parse is completely isolated from another because all those objects are created new for each record. Does anyone have a clue, however slight? Or any general tips on this, or how to attempt to reproduce it? Thanks -Original message- From:Sebastian Nagel wastl.na...@googlemail.com Sent: Fri 30-Nov-2012 21:04 To: user@nutch.apache.org Subject: Re: Wrong ParseData in segment Hi Markus, sounds somewhat similar to NUTCH-1252 but that was rather trivial and easy to reproduce. Sebastian 2012/11/30 Markus Jelsma markus.jel...@openindex.io: Hi, We've got an issue where one in a few thousand records partially contains another record's ParseMeta data. To be specific, record A ends up with the ParseMeta data of record B that is added by one of our custom parse plugins. I'm unsure as to where the problem really is because the parse plugin receives data from a modified parser plugin that in turn adds a custom Tika ContentHandler. Because i'm unable to reproduce this i had to inspect the code for places where an object is reused but an attribute is not reset. To me, that would be the most obvious problem, but until now i've been unsuccessful in finding the issue! Regardless of how remote the chance is of someone having had some similar issue: does anyone have some ideas to share? Thanks, Markus
RE: Wrong ParseData in segment
Hi Sebastian, Makes sense, i'll be sure to modify the parser plugins. Perhaps it would be worth trying to make sure a single thread uses a single instance. I don't know why it works the way it does. Judging from the pointed thread it's intended behaviour. On the other side, reusing parser plugins the way it's now doesn't make too much sense. There's usually not a huge amount of data involved per single instance so conserving heap space doesn't seem a reasonable justification. Thanks, Markus -Original message- From:Sebastian Nagel wastl.na...@googlemail.com Sent: Wed 16-Jan-2013 22:04 To: user@nutch.apache.org Subject: Re: Wrong ParseData in segment Hi Markus, However, i assumed the plugins were already in a thread-safe environment because each FetcherThread instance has it's own instance of ParseUtil. I had similar assumptions but the debug output to investigate my problem is straightforward (the number are object hash codes): 2013-01-16 17:04:29,386 DEBUG parse.CustomParseFilter (instance=1639291161): parsing file:.../1.xls 2013-01-16 17:04:29,452 DEBUG parse.CustomParseFilter (instance=1639291161): parsing file:.../2.doc 2013-01-16 17:04:29,452 DEBUG parse.FieldExtractor - docfragm=1634712296: node meta elem = 598132191 2013-01-16 17:04:29,452 DEBUG parse.FieldExtractor - docfragm=1634712296: author=Christina Maier 2013-01-16 17:04:29,507 DEBUG parse.FieldExtractor - docfragm=1758166206: node meta elem = 598132191 2013-01-16 17:04:29,507 DEBUG parse.FieldExtractor - docfragm=1758166206: author=Christina Maier The same parse filter instance processes two documents in parallel. The plugin does a lot (extracting metadata, pruning content) and the documents are large and take some time to process. Via a shared instance variable references to DOM nodes slipped from one call of filter() to the other. Is there a possibility to ensure that every instance of ParseUtil has it's own plugin instances? Would be worth to check. Cheers, Sebastian On 01/16/2013 06:55 PM, Markus Jelsma wrote: Sebastian! I thought about that too since i do sometimes use class variables in some parse plugins such as storing the Parse object. However, i assumed the plugins were already in a thread-safe environment because each FetcherThread instance has it's own instance of ParseUtil. I'll modify the plugins and see if it helps ;) Thanks, Markus -Original message- From:Sebastian Nagel wastl.na...@googlemail.com Sent: Wed 16-Jan-2013 18:38 To: user@nutch.apache.org Subject: Re: Wrong ParseData in segment Hi Markus, right now I have seen this problem in a small test set of 20 documents: - various document types (HTML, PDF, XLS, zip, doc, ods) - small and quite large docs (up to 12 MB) - local docs via protocol-file - fetcher.parse = true - Nutch 1.4, local mode Somehow metadata from a one doc slipped into another doc: - extracted by a custom HtmlParseFilter plugin (author, keywords, description) - reproducible, though not easily (3-5 trials to get one, rarely two wrong meta fields) - wrong parsemeta is definitely in the segment After adding more and more debug logs the stupid answer is: the custom plugin was not 100% thread-safe. Yes, it wasn't clear to me ;-): the same instance of a plugin may process two documents in parallel. I found also this thread (and NUTCH-496): http://www.mail-archive.com/nutch-developers@lists.sourceforge.net/msg12333.html I didn't find any hint in the wiki (eg. in http://wiki.apache.org/nutch/WritingPluginExample), but I'll add one. Cheers, Sebastian 2012/11/30 Markus Jelsma markus.jel...@openindex.io: Hi In our case it is really in the segment, and ends up in the index. Are there any known issues with parse filters? In that filter we do set the Parse object as class attribute but we reset it with the new Parse object right after filter() is called. I also cannot think of the custom Tika ContentHandler to be the issue, a new ContentHandler is created for each parse and passed to the TeeContentHandler, just all other ContentHandlers. I assume an individual parse is completely isolated from another because all those objects are created new for each record. Does anyone have a clue, however slight? Or any general tips on this, or how to attempt to reproduce it? Thanks -Original message- From:Sebastian Nagel wastl.na...@googlemail.com Sent: Fri 30-Nov-2012 21:04 To: user@nutch.apache.org Subject: Re: Wrong ParseData in segment Hi Markus, sounds somewhat similar to NUTCH-1252 but that was rather trivial and easy to reproduce. Sebastian 2012/11/30 Markus Jelsma markus.jel...@openindex.io: Hi, We've got an issue where one in a few thousand records partially contains another record's ParseMeta data. To be specific, record A ends
RE: [ANNOUNCE] New Nutch committer and PMC : Tejas Patil
Nice! Thanks -Original message- From:Lewis John Mcgibbney lewis.mcgibb...@gmail.com Sent: Mon 14-Jan-2013 20:28 To: d...@nutch.apache.org Cc: user@nutch.apache.org Subject: Re: [ANNOUNCE] New Nutch committer and PMC : Tejas Patil Welcome aboard Tejas Best Lewis On Monday, January 14, 2013, Julien Nioche lists.digitalpeb...@gmail.com wrote: Dear all, It is my pleasure to announce that Tejas Patil has joined the Nutch PMC and is a new committer. Tejas, would you mind telling us about yourself, what you've done so far with Nutch, which areas you think you'd like to get involved, etc... Congratulations Tejas and welcome on board! BTW If you haven't done so please have a look at http://www.apache.org/dev/new-committers-guide.html. I expect that your account will be created within a few days after reception of the ICLA Best, Julien -- http://digitalpebble.com/img/logo.gif Open Source Solutions for Text Engineering http://digitalpebble.blogspot.com/ http://www.digitalpebble.com http://twitter.com/digitalpebble -- *Lewis*
RE: How segments is created?
-Original message- From:Bayu Widyasanyata bwidyasany...@gmail.com Sent: Sun 13-Jan-2013 07:34 To: user@nutch.apache.org Subject: Re: How segments is created? On Sun, Jan 13, 2013 at 12:47 PM, Tejas Patil tejas.patil...@gmail.comwrote: Well, if you know that the front page is updated frequently, set db.fetch.interval.default to lower value so that urls will be eligible for re-fetch sooner. By default, if a url is fetched successfully, it becomes eligible for re-fetching after 30 days Very clear! In summary, Nutch can not identify if a page is being updated hence (if page is updated frequently) we should set to lower value db.fetch.interval.default to re-fetch the page. No, you can plugin another FetchSchedule that supports adjusting the interval based on whether a record is modified. See the AdaptiveFetchSchedule for an example. Thanks so much! -- wassalam, [bayu]
RE: code changes not reflecting when deployed on hadoop
Seems the job file is not deployed to all task trackers and i'm not sure why. Can you try using the nutch script to run your fetcher? -Original message- From:Sourajit Basak sourajit.ba...@gmail.com Sent: Thu 27-Dec-2012 13:29 To: user@nutch.apache.org Subject: code changes not reflecting when deployed on hadoop We have made some changes to Fetcher (v1.5). However, when we build a .job (jar) and deploy it on hadoop it doesn't seem to pick up any changes. This is how we are running it. ./hadoop jar ../nutch/apache-nutch-1.5.1.job org.apache.nutch.fetcher.Fetcher segment on hdfs -threads 4 However, if we modify any of the plugins, it picks up the changes properly. Initially, I doubted that our logic wasn't getting hit. To cross check, we removed Fetcher.class from the .job file and re-executed. Still it seems to run an old version of the code. I strongly suspect, I am missing out something which needs to be done after a code change.
RE: code changes not reflecting when deployed on hadoop
It works the same as in local mode, just have the job file in the CWD. -Original message- From:Sourajit Basak sourajit.ba...@gmail.com Sent: Thu 27-Dec-2012 14:51 To: user@nutch.apache.org Subject: Re: code changes not reflecting when deployed on hadoop We are using hadoop 1.1 On Thu, Dec 27, 2012 at 7:13 PM, Sourajit Basak sourajit.ba...@gmail.comwrote: How do you use the nutch script on a cluster ? On Thu, Dec 27, 2012 at 6:25 PM, Markus Jelsma markus.jel...@openindex.io wrote: Can you try using the nutch script to run your fetcher?
RE: Nutch approach for DeadLinks
Hi - Nutch 1.5 has a -deleteGone switch for the SolrIndexer job.This will delete permanent redirects and 404's that have been discovered during the crawl. 1.6 also has a -deleteRobotsNoIndex that will delete pages that have a robots meta tag with a noindex value. -Original message- From:David Philip davidphilipshe...@gmail.com Sent: Wed 26-Dec-2012 06:28 To: user@nutch.apache.org Subject: Nutch approach for DeadLinks Hi All, How does nutch work with deadlinks? say for example, there is a blog site being crawled today and all the blogs (documents) are indexed to solr. Tomorrow, if one of the blog is deleted which mean that the URL indexed yesterday is no more working today! In such cases, How to update the solr indexes such that this particular blog doesn’t come in search results? Recrawling the same site didn’t delete this record in solr. How to handle such cases? I am using nutch 1.5.1 bin. Thanks David
RE: About the version of the nutch
Hi - it depends on the estimated size of your data and the available hardware. You can simply get the current 1.0.x stable or 1.1.x beta Hadoop version, both will run fine. The choice is which Nutch to use, 1.x is very stable and has more features and can be used for very large scale crawls although you might have to use a bit more hardware. 2.x is more efficient in writing and reading data but also less stable, you will run into more problems that divert you from your core tasks. If you have a few powerful machines and your data is in the TB range 1.x is fine. If you like a challenge 2.x is the way to go. We process many TBs each month on just a few powerful machines and run a modified 1.x. -Original message- From:許懷文 k120861032...@gmail.com Sent: Mon 24-Dec-2012 18:17 To: user@nutch.apache.org Subject: About the version of the nutch Dear Nutch Project Team: I am interested in Nutch and Hadoop and want to use them to apply to big data analysis; but I have some problems with the version of them. I want to set up a search engine by myself, and I also choose the Hadoop+Nutch+Solr+Hbase to implement it. Would you mind give me the suitable version of them to set them up? I will appreciate your kind reply and helpful suggestions. Thanks! Best regards, Kevin Hsu.
RE: shouldFetch rejected
Hi - curTime does not exceed fetchTime, thus the record is not eligible for fetch. -Original message- From:Jan Philippe Wimmer i...@jepse.net Sent: Mon 17-Dec-2012 13:31 To: user@nutch.apache.org Subject: Re: shouldFetch rejected Hi again. i still have that issue. I start with a complete new crawl directory structure and get the following error: -shouldFetch rejected 'http://www.lequipe.fr/Football/', fetchTime=1359626286623, curTime=1355738313780 Full-Log: crawl started in: /opt/project/current/crawl_project/nutch/crawl/1300 rootUrlDir = /opt/project/current/crawl_project/nutch/urls/url_1300 threads = 20 depth = 3 solrUrl=http://192.168.1.144:8983/solr/ topN = 400 Injector: starting at 2012-12-17 10:57:36 Injector: crawlDb: /opt/project/current/crawl_project/nutch/crawl/1300/crawldb Injector: urlDir: /opt/project/current/crawl_project/nutch/urls/url_1300 Injector: Converting injected urls to crawl db entries. Injector: Merging injected urls into crawl db. Injector: finished at 2012-12-17 10:57:51, elapsed: 00:00:14 Generator: starting at 2012-12-17 10:57:51 Generator: Selecting best-scoring urls due for fetch. Generator: filtering: true Generator: normalizing: true Generator: topN: 400 Generator: jobtracker is 'local', generating exactly one partition. Generator: Partitioning selected urls for politeness. Generator: segment: /opt/project/current/crawl_project/nutch/crawl/1300/segments/20121217105759 Generator: finished at 2012-12-17 10:58:06, elapsed: 00:00:15 Fetcher: Your 'http.agent.name' value should be listed first in 'http.robots.agents' property. Fetcher: starting at 2012-12-17 10:58:06 Fetcher: segment: /opt/project/current/crawl_project/nutch/crawl/1300/segments/20121217105759 Using queue mode : byHost Fetcher: threads: 20 Fetcher: time-out divisor: 2 QueueFeeder finished: total 1 records + hit by time limit :0 Using queue mode : byHost Using queue mode : byHost fetching http://www.lequipe.fr/Football/ -finishing thread FetcherThread, activeThreads=1 Using queue mode : byHost -finishing thread FetcherThread, activeThreads=1 Using queue mode : byHost -finishing thread FetcherThread, activeThreads=1 Using queue mode : byHost -finishing thread FetcherThread, activeThreads=1 Using queue mode : byHost -finishing thread FetcherThread, activeThreads=1 Using queue mode : byHost -finishing thread FetcherThread, activeThreads=1 Using queue mode : byHost -finishing thread FetcherThread, activeThreads=1 Using queue mode : byHost -finishing thread FetcherThread, activeThreads=1 Using queue mode : byHost -finishing thread FetcherThread, activeThreads=1 Using queue mode : byHost -finishing thread FetcherThread, activeThreads=1 Using queue mode : byHost -finishing thread FetcherThread, activeThreads=1 Using queue mode : byHost -finishing thread FetcherThread, activeThreads=1 Using queue mode : byHost -finishing thread FetcherThread, activeThreads=1 Using queue mode : byHost -finishing thread FetcherThread, activeThreads=1 Using queue mode : byHost -finishing thread FetcherThread, activeThreads=1 Using queue mode : byHost -finishing thread FetcherThread, activeThreads=1 Using queue mode : byHost -finishing thread FetcherThread, activeThreads=1 Using queue mode : byHost -finishing thread FetcherThread, activeThreads=1 Using queue mode : byHost -finishing thread FetcherThread, activeThreads=1 Fetcher: throughput threshold: -1 Fetcher: throughput threshold retries: 5 -finishing thread FetcherThread, activeThreads=0 -activeThreads=0, spinWaiting=0, fetchQueues.totalSize=0 -activeThreads=0 Fetcher: finished at 2012-12-17 10:58:13, elapsed: 00:00:07 ParseSegment: starting at 2012-12-17 10:58:13 ParseSegment: segment: /opt/project/current/crawl_project/nutch/crawl/1300/segments/20121217105759 ParseSegment: finished at 2012-12-17 10:58:20, elapsed: 00:00:07 CrawlDb update: starting at 2012-12-17 10:58:20 CrawlDb update: db: /opt/project/current/crawl_project/nutch/crawl/1300/crawldb CrawlDb update: segments: [/opt/project/current/crawl_project/nutch/crawl/1300/segments/20121217105759] CrawlDb update: additions allowed: true CrawlDb update: URL normalizing: true CrawlDb update: URL filtering: true CrawlDb update: 404 purging: false CrawlDb update: Merging segment data into db. CrawlDb update: finished at 2012-12-17 10:58:33, elapsed: 00:00:13 Generator: starting at 2012-12-17 10:58:33 Generator: Selecting best-scoring urls due for fetch. Generator: filtering: true Generator: normalizing: true Generator: topN: 400 Generator: jobtracker is 'local', generating exactly one partition. -shouldFetch rejected 'http://www.lequipe.fr/Football/', fetchTime=1359626286623, curTime=1355738313780 Generator: 0 records selected for fetching, exiting ... Stopping at depth=1 - no more URLs to fetch. LinkDb: starting at 2012-12-17 10:58:40 LinkDb: linkdb:
RE: How to extend Nutch for article crawling
The 1.x indexer can filter and normalize. -Original message- From:Julien Nioche lists.digitalpeb...@gmail.com Sent: Mon 17-Dec-2012 15:11 To: user@nutch.apache.org Subject: Re: How to extend Nutch for article crawling Hi See comments below 1. Add article list pages into url/seed.txt Here's one problem. What I actually want to be indexed is the article pages, not the article list pages. But, if I don't allow the list page to be indexed, Nutch will do nothing because the list page is the entrance. So, how can I index only the article page without list pages? I think that the indexer can now filter URLs but can't remember whether it is for 1.x only or is in 2.x as well. Anyone? This would work if you can find a regular expression that captures the list pages. Another approach would be to tweak the indexer so that it skips documents containing an arbitrary metadatum (e.g. skip.indexing), this metadata would be set in a custom parser when processing the list pages. I think this would be a useful feature to have anyway. URL filters use the URL string only and having the option to skip based on metadata would be good IMHO 2. Write a plugin to parse out the 'author', 'date', 'article body', 'headline' and maybe other information from html. The 'Parser' plugin interface in Nutch 2.1 is: Parse getParse(String url, WebPage page) And the 'WebPage' class has some predefined attributs: public class WebPage extends PersistentBase { //... private Utf8 baseUrl; // ... private Utf8 title; private Utf8 text; // ... private MapUtf8,ByteBuffer metadata; // ... } So, the only field I can put my specified attributes in is the 'metadata'. Is it designed for this purpose? BTW, the Parser in trunk looks like: 'public ParseResult getParse(Content content)', and seems more reasonable for me. The extension point Parser is for low level parsing i.e extract text and metadata from binary formats, which is done typically by parse-tika. What you want to implement is an extension of ParseFilter and add your own entries to the parse metadata. The creative commons plugin should be a good example to get started 3. After the articles are indexed into Solr, another application can query it by 'date' then store the article information into Mysql. My question here is: can Nutch store the article directly into Mysql? Or can I write a plugin to specify the index behavior? you could use the mysql backend in GORA (but it is broken AFAIK) and get the other application to use it, alternatively you could write a custom indexer that sends directly into MySQL but that would be a bit redundant. Do you need to use SOLR at all or is the aim to simply to store in MySQL? Is Nutch a good choice for my purpose? If not, do you guys suggest another good quality framework/library for me? You can definitely do that with Nutch. There are certainly other resources that could be used but they might also need a bit of customisation anyway HTH Julien -- * *Open Source Solutions for Text Engineering http://digitalpebble.blogspot.com/ http://www.digitalpebble.com http://twitter.com/digitalpebble
RE: identify domains from fetch lists taking lot of time.
Hi - you have to get rid of those URL's via URL filters. If you cannot filter them out you can set the fetcher time limit (see nutch-default) to limit the time the fetcher runs or set the fetcher minumum throughput (see nutch-default). The latter will abort the fetcher if less than N pages/second are fetched. The unfetched records will be fetched later on together with other queues. -Original message- From:manubharghav manubharg...@gmail.com Sent: Fri 14-Dec-2012 07:39 To: user@nutch.apache.org Subject: identify domains from fetch lists taking lot of time. Hi, I initiated a crawl on 200 domains till a depth of 5 with a topN of 1 million. A single domain extended my fetch time by a day as it kept generating outlinks to the same page with different urls( the parameters change, but the content remains same.) .http://www.awex.com.au/about-awex.html?s=___.So is there anyway to run the content dedup while fetching itself or are there any other steps to avoid such cases. The problem is that as the size of the fetch list is increasing the fetcher has a delay of say 3 seconds hitting the same server. This is causing the delay in the node and hence delaying the effective time of the crawl. Thanks in advance. Manu Reddy. -- View this message in context: http://lucene.472066.n3.nabble.com/identify-domains-from-fetch-lists-taking-lot-of-time-tp4026942.html Sent from the Nutch - User mailing list archive at Nabble.com.
RE: fetcher partitioning
Sourajit, Looks fine at a first glance. A partitioner does not partition between threads, only mappers. It also makes little sense because in the fetcher number of threads can be set plus the queue mode. Can you open an issue and attach your patch? Thanks, -Original message- From:Sourajit Basak sourajit.ba...@gmail.com Sent: Mon 10-Dec-2012 10:55 To: user@nutch.apache.org Cc: Markus Jelsma markus.jel...@openindex.io Subject: Re: fetcher partitioning Could anyone review this patch for using a pluggable custom partitioner ? For the time, I have just copied over HashPartitioner impl. Need to understand a bit more about Hadoop's partitioning. Can the group also comment if this RandomPartioner will distribute urls from the same host across different fetcher threads ? Running in local mode, doesn't seem to have any affect. (My cluster is undergoing routine maintenance; need to wait for testing in distributed mode) Best, Sourajit On Thu, Dec 6, 2012 at 11:21 AM, Sourajit Basak sourajit.ba...@gmail.com mailto:sourajit.ba...@gmail.com wrote: Ok. Give me some time. On Thu, Dec 6, 2012 at 12:07 AM, Markus Jelsma markus.jel...@openindex.io mailto:markus.jel...@openindex.io wrote: -Original message- From:Sourajit Basak sourajit.ba...@gmail.com mailto:sourajit.ba...@gmail.com Sent: Wed 05-Dec-2012 18:16 To: user@nutch.apache.org mailto:user@nutch.apache.org Subject: fetcher partitioning Per my understanding, Nutch partitions urls based on either host, ip or domain. Is it possible to partition based on url patterns ? For e.g my company, a publishing house, is planning to expose its content like http://host/publicationA http://host/publicationA , http://host/publicationB http://host/publicationB . etc. We wish to partition the fetching based on url patterns like /publicationA/* to a thread, /publicationB/* to another, etc. This will not only help us expedite indexing the content but also test the throughput of the site, though the second is an additional benefit we get by doing no extra work. We can attempt to modify the URLPartitioner, but that does not seem to be plug and play like the FetchSchedule. And would mean changes to the core. Indeed, you have to modify the partitioner to make this happen. You are free to do so but you can also make it pluggable as fetch schedule via config and provide a patch so it can be added to the Nutch sources. Any suggestions ? Best, Sourajit
RE: fetcher partitioning
-Original message- From:Sourajit Basak sourajit.ba...@gmail.com Sent: Mon 10-Dec-2012 12:17 To: user@nutch.apache.org Subject: Re: fetcher partitioning Markus, I will open an issue. But I am confused now. Does the partitioner have no effect on the fetchers ? The partitioner decides which record ends up in which fetch list. When running locally, there is always one fetch list and one mapper to ingest that fetch list. Even if we allot 10 threads to the fetcher (all urls belonging to the same host), will each thread fetch its items simultaneously ? That depends on the queue mode used. The fetcher organizes URL's in queues, and threads will just pick the next URL to fetch. URL's are either queued by host, ip or domain. See nutch-default for descriptions on which queue to use and how many threads per queue to set up. What is queue mode? Best, Sourajit On Mon, Dec 10, 2012 at 4:23 PM, Markus Jelsma markus.jel...@openindex.iowrote: Sourajit, Looks fine at a first glance. A partitioner does not partition between threads, only mappers. It also makes little sense because in the fetcher number of threads can be set plus the queue mode. Can you open an issue and attach your patch? Thanks, -Original message- From:Sourajit Basak sourajit.ba...@gmail.com Sent: Mon 10-Dec-2012 10:55 To: user@nutch.apache.org Cc: Markus Jelsma markus.jel...@openindex.io Subject: Re: fetcher partitioning Could anyone review this patch for using a pluggable custom partitioner ? For the time, I have just copied over HashPartitioner impl. Need to understand a bit more about Hadoop's partitioning. Can the group also comment if this RandomPartioner will distribute urls from the same host across different fetcher threads ? Running in local mode, doesn't seem to have any affect. (My cluster is undergoing routine maintenance; need to wait for testing in distributed mode) Best, Sourajit On Thu, Dec 6, 2012 at 11:21 AM, Sourajit Basak sourajit.ba...@gmail.com mailto:sourajit.ba...@gmail.com wrote: Ok. Give me some time. On Thu, Dec 6, 2012 at 12:07 AM, Markus Jelsma markus.jel...@openindex.io mailto:markus.jel...@openindex.io wrote: -Original message- From:Sourajit Basak sourajit.ba...@gmail.com mailto: sourajit.ba...@gmail.com Sent: Wed 05-Dec-2012 18:16 To: user@nutch.apache.org mailto:user@nutch.apache.org Subject: fetcher partitioning Per my understanding, Nutch partitions urls based on either host, ip or domain. Is it possible to partition based on url patterns ? For e.g my company, a publishing house, is planning to expose its content like http://host/publicationA http://host/publicationA , http://host/publicationB http://host/publicationB . etc. We wish to partition the fetching based on url patterns like /publicationA/* to a thread, /publicationB/* to another, etc. This will not only help us expedite indexing the content but also test the throughput of the site, though the second is an additional benefit we get by doing no extra work. We can attempt to modify the URLPartitioner, but that does not seem to be plug and play like the FetchSchedule. And would mean changes to the core. Indeed, you have to modify the partitioner to make this happen. You are free to do so but you can also make it pluggable as fetch schedule via config and provide a patch so it can be added to the Nutch sources. Any suggestions ? Best, Sourajit
RE: [ANNOUNCE] Apache Nutch 1.6 Released
Thanks Lewis! :) -Original message- From:Lewis John Mcgibbney lewis.mcgibb...@gmail.com Sent: Sat 08-Dec-2012 22:56 To: annou...@apache.org; user@nutch.apache.org Cc: d...@nutch.apache.org Subject: [ANNOUNCE] Apache Nutch 1.6 Released Hi All, The Apache Nutch PMC are extremely pleased to announce the release of Apache Nutch v1.6. This release includes over 20 bug fixes, the same in improvements, as well as new functionalities including a new HostNormalizer, the ability to dynamically set fetchInterval by MIME-type and functional enhancements to the Indexer API inluding the normalization of URL's and the deletion of robots noIndex documents. Other notable improvements include the upgrade of key dependencies to Tika 1.2 and Automaton 1.11-8. A full PMC statement can be found here [0] The release can be found on official Apache mirrors [1] as well as sources in Maven Central [2] Thank you Lewis On Behalf of the Nutch PMC [0] http://s.apache.org/NFp [1] http://www.apache.org/dyn/closer.cgi/nutch/ [2] http://search.maven.org/#artifactdetails|org.apache.nutch|nutch|1.6|jar -- Lewis
RE: New Scoring
-Original message- From:Pratik Garg saytopra...@gmail.com Sent: Wed 05-Dec-2012 19:17 To: user@nutch.apache.org Cc: Chirag Goel goel.chi...@gmail.com Subject: New Scoring Hi, Nutch provides a default and new Scoring method for giving score to the pages. I have couple of questions * What is the difference between these two methods? LinkRank is a power iterative algorithm such as PageRank. It can be used incrementally and it very stable. Opic has trouble with increments. * If I want to pass this data to solr during indexing , do I have to do anything extra. The CrawlDB has a score field which is used to populate the boost field. With Opic this is added via the scoring filter. If you use the linkrank algorithm make sure you call it's scoreupdater tool, that writes the calculated scores back to the crawldb. * If I want to sort the results from solr based on this data , which field I should use? the boost field. Thanks, Pratik
RE: hung threads in big nutch crawl process
This page explains the individual steps: http://wiki.apache.org/nutch/NutchTutorial#A3.2_Using_Individual_Commands_for_Whole-Web_Crawling -Original message- From:Eyeris Rodriguez Rueda eru...@uci.cu Sent: Mon 03-Dec-2012 21:08 To: user@nutch.apache.org Subject: RE: hung threads in big nutch crawl process Thank markus for your anwer. I always have used nutch with console making a complete cycle bin/nutch crawl urls -dir crawl -depth 10 -topN 10 -solr http://localhost:8080/solr Could you explain me how to use a separately process. I was reading the wiki but not function for me because I don’t understand the commands. I want to use nutch in distribuited mode, could you give me a good documentation of it. _ Ing. Eyeris Rodriguez Rueda Teléfono:837-3370 Universidad de las Ciencias Informáticas _ -Mensaje original- De: Markus Jelsma [mailto:markus.jel...@openindex.io] Enviado el: lunes, 03 de diciembre de 2012 1:42 PM Para: user@nutch.apache.org Asunto: RE: hung threads in big nutch crawl process Hi - Hadoop organizes some threads but in Nutch the only job that uses threads is the fetcher. Parses are done using the executor service. It is very well possible that you have some regexes that are very complex and Nutch can take a long time processing those, especially if you parse in the fetcher job. You should run the Nutch jobs separate to find out which job is giving you trouble. -Original message- From:Eyeris Rodriguez Rueda eru...@uci.cu Sent: Mon 03-Dec-2012 20:31 To: user@nutch.apache.org Subject: hung threads in big nutch crawl process Hi all. I have detected that in big nutch crawl process(depth:10 topN:100 000) some threads are hunged in some part of crawl cicle for example normalizing by regex and fetching urls to. Im using nutch 1.5.1 and solr 3.6. Ram:2GB CPU:CoreI3. OS:Ubuntu 12.04(server) I have a doubt, How nutch manipulate the threads in a cicle of crawl process ?. Is multithread the generation,fetching,parsing process ? PD:Sorry for my english. Is not my native language. 10mo. ANIVERSARIO DE LA CREACION DE LA UNIVERSIDAD DE LAS CIENCIAS INFORMATICAS... CONECTADOS AL FUTURO, CONECTADOS A LA REVOLUCION http://www.uci.cu http://www.facebook.com/universidad.uci http://www.flickr.com/photos/universidad_uci
RE: Fetch content inside nutch parse
See how the indexchecker fetches URL's: http://svn.apache.org/viewvc/nutch/trunk/src/java/org/apache/nutch/indexer/IndexingFiltersChecker.java?view=markup -Original message- From:Jorge Luis Betancourt Gonzalez jlbetanco...@uci.cu Sent: Fri 30-Nov-2012 16:46 To: user@nutch.apache.org Subject: Fetch content inside nutch parse It's possible to use nutch fetcher inside a parse plugin? Or should some third party library? slds -- It is only in the mysterious equation of love that any logical reasons can be found. Good programmers often confuse halloween (31 OCT) with christmas (25 DEC) 10mo. ANIVERSARIO DE LA CREACION DE LA UNIVERSIDAD DE LAS CIENCIAS INFORMATICAS... CONECTADOS AL FUTURO, CONECTADOS A LA REVOLUCION http://www.uci.cu http://www.facebook.com/universidad.uci http://www.flickr.com/photos/universidad_uci
RE: Indexing-time URL filtering again
Please send us the regex file. -Original message- From:Joe Zhang smartag...@gmail.com Sent: Thu 29-Nov-2012 04:48 To: user@nutch.apache.org Subject: Re: Indexing-time URL filtering again I made sure I got the most recent trunk, Markus. I don't understand why the problem persists. On Mon, Nov 26, 2012 at 3:21 AM, Markus Jelsma markus.jel...@openindex.iowrote: I checked the code. You're probably not pointing it to a valid path or perhaps the build is wrong and you haven't used ant clean before building Nutch. If you keep having trouble you may want to check out trunk. -Original message- From:Joe Zhang smartag...@gmail.com Sent: Mon 26-Nov-2012 00:40 To: user@nutch.apache.org Subject: Re: Indexing-time URL filtering again OK. I'm testing it. But like I said, even when I reduce the patterns to the simpliest form -., the problem still persists. On Sun, Nov 25, 2012 at 3:59 PM, Markus Jelsma markus.jel...@openindex.iowrote: It's taking input from stdin, enter some URL's to test it. You can add an issue with reproducable steps. -Original message- From:Joe Zhang smartag...@gmail.com Sent: Sun 25-Nov-2012 23:49 To: user@nutch.apache.org Subject: Re: Indexing-time URL filtering again I ran the regex tester command you provided. It seems to be taking forever (15 min + by now). On Sun, Nov 25, 2012 at 3:28 PM, Joe Zhang smartag...@gmail.com wrote: you mean the content my pattern file? well, even wehn I reduce it to simply -., the same problem still pops up. On Sun, Nov 25, 2012 at 3:30 PM, Markus Jelsma markus.jel...@openindex.io wrote: You seems to have an NPE caused by your regex rules, for some weird reason. If you can provide a way to reproduce you can file an issue in Jira. This NPE should also occur if your run the regex tester. nutch -Durlfilter.regex.file=path org.apache.nutch.net.URLFilterChecker -allCombined In the mean time you can check if a rule causes the NPE. -Original message- From:Joe Zhang smartag...@gmail.com Sent: Sun 25-Nov-2012 23:26 To: user@nutch.apache.org Subject: Re: Indexing-time URL filtering again the last few lines of hadoop.log: 2012-11-25 16:30:30,021 INFO indexer.IndexingFilters - Adding org.apache.nutch.indexer.anchor.AnchorIndexingFilter 2012-11-25 16:30:30,026 INFO indexer.IndexingFilters - Adding org.apache.nutch.indexer.metadata.MetadataIndexer 2012-11-25 16:30:30,218 WARN mapred.LocalJobRunner - job_local_0001 java.lang.RuntimeException: Error in configuring object at org.apache.hadoop.util.ReflectionUtils.setJobConf(ReflectionUtils.java:93) at org.apache.hadoop.util.ReflectionUtils.setConf(ReflectionUtils.java:64) at org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:117) at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:432) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:372) at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:212) Caused by: java.lang.reflect.InvocationTargetException at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:601) at org.apache.hadoop.util.ReflectionUtils.setJobConf(ReflectionUtils.java:88) ... 5 more Caused by: java.lang.RuntimeException: Error in configuring object at org.apache.hadoop.util.ReflectionUtils.setJobConf(ReflectionUtils.java:93) at org.apache.hadoop.util.ReflectionUtils.setConf(ReflectionUtils.java:64) at org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:117) at org.apache.hadoop.mapred.MapRunner.configure(MapRunner.java:34) ... 10 more Caused by: java.lang.reflect.InvocationTargetException at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:601
RE: size of crawl
Impossible to say but perhaps there are more non-200 fetched records. Carefully look at the fetcher logs and inspect the crawldb with the readdb -stats command. -Original message- From:Joe Zhang smartag...@gmail.com Sent: Thu 29-Nov-2012 07:04 To: user user@nutch.apache.org Subject: size of crawl With the same set of parameters (-depth 5 -topN 200), I run two different crawls: Crawl 1: 2 sites Crawl 2: 4 sites (superset of the 2 in Crawl1) However, I end up having much fewer docs in Crawl 2. Can anybody suggest the reason(s)? Thanks. Joe.
RE: Nutch efficiency and multiple single URL crawls
As i said, you don't rebuild, you just overwrite the config file in the hadoop config directory on the data nodes. Config files are looked up there as well. Just copy the file to the data nodes. -Original message- From:AC Nutch acnu...@gmail.com Sent: Thu 29-Nov-2012 05:38 To: user@nutch.apache.org Subject: Re: Nutch efficiency and multiple single URL crawls Thanks for the help. Perhaps I am misunderstanding, what would be the proper way to leverage this? I am a bit new to Nutch 1.5.1, I've been using 1.4 and have generally been using runtime/deploy/bin/nutch with a .job file. I notice things are done a bit differently in 1.5.1 with the lack of a nutch runtime and nutch deploy directories. How can I run a crawl while leveraging this functionality and not having to rebuild the job file each new crawl? More specifically, I'm picturing the following workflow... (1) update config file to restrict domain crawls - (2) run command that crawls a domain with changes from config file while not having to rebuild job file - (3) index to Solr What would the (general) command be for step (2) is my question. On Mon, Nov 26, 2012 at 5:16 AM, Markus Jelsma markus.jel...@openindex.iowrote: Hi, Rebuilding the job file for each domain is not a good idea indeed, plus it adds the Hadoop overhead. But you don't have to, we write dynamic config files to each node's Hadoop configuration directory and it is picked up instead of the embedded configuration file. Cheers, -Original message- From:AC Nutch acnu...@gmail.com Sent: Mon 26-Nov-2012 06:50 To: user@nutch.apache.org Subject: Nutch efficiency and multiple single URL crawls Hello, I am using Nutch 1.5.1 and I am looking to do something specific with it. I have a few million base domains in a Solr index, so for example: http://www.nutch.org, http://www.apache.org, http://www.whatever.cometc. I am trying to crawl each of these base domains in deploy mode and retrieve all of their sub-urls associated with that domain in the most efficient way possible. To give you an example of the workflow I am trying to achieve: (1) Grab a base domain, let's say http://www.nutch.org (2) Crawl the base domain for all URLs in that domain, let's say http://www.nutch.org/page1 , http://www.nutch.org/page2, http://www.nutch.org/page3, etc. etc. (3) store these results somewhere (perhaps another Solr instance) and (4) move on to the next base domain in my Solr index and repeat the process. Essentially just trying to grab all links associated with a page and then move on to the next page. The part I am having trouble with is ensuring that this workflow is efficient. The only way I can think to do this would be: (1) Grab a base domain from Solr from my shell script (simple enough) (2) Add an entry to regex-urlfilter with the domain I am looking to restrict the crawl to, in the example above that would be an entry that says to only keep sub-pages of http://www.nutch.org/ (3) Recreate the Nutch job file (~25 sec.) (4) Start the crawl for pages associated with a domain and do the indexing My issue is with step #3, AFAIK if I want to restrict a crawl to a specific domain I have to change regex-urlfilter and reload the job file. This is a pretty significant problem, since adding 25 seconds every single time I start a new base domain is going to add way too many seconds to my workflow (25 sec x a few million = way too much time). Finally the question...is there a way to add url filters on the fly when I start a crawl and/or restrict a crawl to a particular domain on the fly. OR can you think of a decent solution to the problem/am I missing something?
RE: Access crawled content or parsed data of previous crawled url
Hi, This is a difficult problem in MapReduce and because of the fact that one image URL may be embedded in many documents. There are various methods you could use to aggregate the records but none i can think of will work very well or are straightforward to implement. I think the most straightforward and easy to implement method is that you should create a new key/value pair to store the surrounding text in for each image, do this during the parse. This would mean you have to emit a Text,Text pair for each image in every HTML page with the image's URLas key and the surrounding text as value. You will have to modify the indexer to ingest that structure as well during indexing. This way existing CrawlDatum's for existing images will end up in the reducer together with zero or more of your new key/value pair. In IndexerMapReduce you can deal with them appropriately. This method works well with MapReduce and requires not too much programming. The downside is that you cannot build a parse plugin and indexing plugin because they cannot handle your new key/value pair. Good luck and let us know what you came up with :) -Original message- From:Jorge Luis Betancourt Gonzalez jlbetanco...@uci.cu Sent: Thu 29-Nov-2012 19:53 To: user@nutch.apache.org Subject: Re: Access crawled content or parsed data of previous crawled url For now I don't see any form of accessing metadata for a previously parsed document, I'm mistaken? - Mensaje original - De: alx...@aim.com Para: user@nutch.apache.org Enviados: Jueves, 29 de Noviembre 2012 13:38:43 Asunto: Re: Access crawled content or parsed data of previous crawled url Hi, Unfortunately, my employer does not want me to disclose details of the plugin at this time. Alex. -Original Message- From: Jorge Luis Betancourt Gonzalez jlbetanco...@uci.cu To: user user@nutch.apache.org Sent: Wed, Nov 28, 2012 6:20 pm Subject: Re: Access crawled content or parsed data of previous crawled url Hi Alex: What you've done is basically what I'm try to accomplish: I'm trying to get the text surrounding the img tags to improve the image search engine we're building (this is done when the html page containing the img tag is parsed), and when the image url itself is parsed we generate thumbnails and extract some metadata. But how do you keep the this 2 pieces of data linked together inside your index (solr in my case). Because the thing is that I'm getting two documents inside solr (1. containing the text surrounding the img tag, and other document with the thumbnail). So what brings me troubles is how when the thumbnail is being generated can I get the surrounding text detecte when the html was parsed? Thanks a lot for all the replies! P.S: Alex, can you share some piece of code (if it's possible) of your working plugins? Or walk me through what you've came up with? - Mensaje original - De: alx...@aim.com Para: user@nutch.apache.org Enviados: Miércoles, 28 de Noviembre 2012 19:54:07 Asunto: Re: Access crawled content or parsed data of previous crawled url It is not clear what you try to achieve. We have done something similar in regard of indexing img tags. We retrieve img tag data while parsing the html page and keep it in a metadata and when parsing img url itself we create thumbnail. hth. Alex. -Original Message- From: Jorge Luis Betancourt Gonzalez jlbetanco...@uci.cu To: user user@nutch.apache.org Sent: Wed, Nov 28, 2012 2:58 pm Subject: Re: Access crawled content or parsed data of previous crawled url Any documentation about crawldb api? I'm guessing the it shouldn't be so hard to retrieve a documento by it's url (which is basically what I need. I'm also open to any suggestion on this matter, so If any one has done something similar or has any thoughts on this and can share it, I'll be very grateful. Greetings! - Mensaje original - De: Stefan Scheffler sscheff...@avantgarde-labs.de Para: user@nutch.apache.org Enviados: Miércoles, 28 de Noviembre 2012 15:04:44 Asunto: Re: Access crawled content or parsed data of previous crawled url Hi, I think, this is possible, because you can write a ParserPlugin which access the allready stored documents via the segments- /crawldb api. But i´m not sure how it will work exactly. Regards Stefan Re Am 28.11.2012 20:59, schrieb Jorge Luis Betancourt Gonzalez: Hi: For what I've seen in nutch plugins exist the philosophy of one NutchDocument per url, but I was wondering if there is any way of accessing parsed/crawled content of a previous fetched/parsed url, let's say for instance that I've a HTML page with an image embedded: So the start point will be http://host.com/test.html which is the first document that get's fetched/parsed then the OutLink extractor will detect the embedded image inside test.html and then add
RE: The topN parameter in nutch crawl
Nutch does neither. If scoring is used the records to fetch are ordered by score and if there is no score it's simply sorted alphabetically. With some tuning to a scoring filter you can do whatever you want but in the end everything is going to be crawled (if there are enough resources). What are you trying to do? If you're not going to process many millions of records it doesn't really matter because all records will be fetched within a reasonable amount of time. -Original message- From:Joe Zhang smartag...@gmail.com Sent: Thu 29-Nov-2012 22:45 To: user@nutch.apache.org Subject: Re: The quot;topNquot; parameter in nutch crawl How would you characterize the crawling algorithm? Depth-first, breath-first, or some heuristic-based? On Thu, Nov 29, 2012 at 2:10 PM, Markus Jelsma markus.jel...@openindex.iowrote: Hi, None of all three. the topN-parameter simply means that the generator will select up to N records to fetch for each time it is invoked. It's best to forget the notion of depth in crawling, it has little meaning in most cases. Usually one will just continously crawl until there are no more records to fetch. We continously invoke the crawler and tell it to do something. If there's nothing to do (but that never happens) we just invoke it again the next time. Cheers, -Original message- From:Joe Zhang smartag...@gmail.com Sent: Thu 29-Nov-2012 21:58 To: user user@nutch.apache.org Subject: The quot;topNquot; parameter in nutch crawl Dear list, This parameter is causing me some confusion. To me, there are at 3 possible meanings for topN: 1. The branching factor at a given node 2. **the maximum number of pages that will be retrieved at each level up to the depth (from the wiki), which seems to refer to the total of branching factors at any given level 3. The size of the entire frontier/queue To me, (1) makes the most sense, and (3) is the easiest to implement programming-wise. If (2) is the actual implementation in nutch, it means the effective branching factor would be lower at deeper levels, correct? In this sense, in order to conduct a comprehensive crawl, if we have to trade off between depth and topN, we should probably favor larger topN? In other words, -depth 5 -topN 1000 would make more sense than -depth 10 -topN 100 for a comprehensive crawl, correct? Thanks!
RE: The topN parameter in nutch crawl
-Original message- From:Joe Zhang smartag...@gmail.com Sent: Thu 29-Nov-2012 23:33 To: user@nutch.apache.org Subject: Re: The quot;topNquot; parameter in nutch crawl I'm not sure I completely understand. Typically when we think about the crawling problem as one of graph traversal, the # of nodes visited would some exponential function. Are you saying that this is not true, and if we specify somehting like -depth 5 -topN 100, we'll at most visit 500 nodes? Yes. Nutch generates fetch lists from the CrawlDB which is nothing more than a sorted list of URL (score, than alphabetically). It just picks the first eligible URL in the sorted list. You really should take a good look at the Generator code, it'll answer most of your questions. On Thu, Nov 29, 2012 at 3:03 PM, Markus Jelsma markus.jel...@openindex.iowrote: Nutch does neither. If scoring is used the records to fetch are ordered by score and if there is no score it's simply sorted alphabetically. With some tuning to a scoring filter you can do whatever you want but in the end everything is going to be crawled (if there are enough resources). What are you trying to do? If you're not going to process many millions of records it doesn't really matter because all records will be fetched within a reasonable amount of time. -Original message- From:Joe Zhang smartag...@gmail.com Sent: Thu 29-Nov-2012 22:45 To: user@nutch.apache.org Subject: Re: The quot;topNquot; parameter in nutch crawl How would you characterize the crawling algorithm? Depth-first, breath-first, or some heuristic-based? On Thu, Nov 29, 2012 at 2:10 PM, Markus Jelsma markus.jel...@openindex.iowrote: Hi, None of all three. the topN-parameter simply means that the generator will select up to N records to fetch for each time it is invoked. It's best to forget the notion of depth in crawling, it has little meaning in most cases. Usually one will just continously crawl until there are no more records to fetch. We continously invoke the crawler and tell it to do something. If there's nothing to do (but that never happens) we just invoke it again the next time. Cheers, -Original message- From:Joe Zhang smartag...@gmail.com Sent: Thu 29-Nov-2012 21:58 To: user user@nutch.apache.org Subject: The quot;topNquot; parameter in nutch crawl Dear list, This parameter is causing me some confusion. To me, there are at 3 possible meanings for topN: 1. The branching factor at a given node 2. **the maximum number of pages that will be retrieved at each level up to the depth (from the wiki), which seems to refer to the total of branching factors at any given level 3. The size of the entire frontier/queue To me, (1) makes the most sense, and (3) is the easiest to implement programming-wise. If (2) is the actual implementation in nutch, it means the effective branching factor would be lower at deeper levels, correct? In this sense, in order to conduct a comprehensive crawl, if we have to trade off between depth and topN, we should probably favor larger topN? In other words, -depth 5 -topN 1000 would make more sense than -depth 10 -topN 100 for a comprehensive crawl, correct? Thanks!
RE: trunk
Trunk is a directory in svn in which actual development is happening: http://svn.apache.org/viewvc/nutch/trunk/ -Original message- From:Joe Zhang smartag...@gmail.com Sent: Tue 27-Nov-2012 01:46 To: user user@nutch.apache.org Subject: trunk In a different thread, Markus suggested checking out trunk. The relationship between trunk and svn has been confusing to me. Can somebody provide a link to a tutorial, and offer advice on how to access nutch trunk? Thanks.
RE: problem with text/html content type of documents appears application/xhtml+xml in solr index
Hi - are you sure you have tabs separating the target and the mapped mimes? Use the nutch indexchecker tool to quickly test if it works. -Original message- From:Eyeris Rodriguez Rueda eru...@uci.cu Sent: Tue 27-Nov-2012 21:18 To: user@nutch.apache.org Subject: RE: problem with text/html content type of documents appears application/xhtml+xml in solr index Hi. Markus. I was doing your recommendations but, my problem persist, some documents still with application/xhtml+xml instead of text/html. I add the property to nutch-site.xml and make the conf/contenttype-mapping.txt file property namemoreIndexingFilter.mapMimeTypes/name valuetrue/value /property I'm using nutch 1.5.1. Tell me if I need to replace index-more.jar in plugin directory with any fixed version ? 10mo. ANIVERSARIO DE LA CREACION DE LA UNIVERSIDAD DE LAS CIENCIAS INFORMATICAS... CONECTADOS AL FUTURO, CONECTADOS A LA REVOLUCION http://www.uci.cu http://www.facebook.com/universidad.uci http://www.flickr.com/photos/universidad_uci
RE: problem with text/html content type of documents appears application/xhtml+xml in solr index
:http://blogs.prod.uci.cu/humanOS outlinks :http://blogs.prod.uci.cu/micro outlinks :http://blogs.prod.uci.cu/nova/ outlinks :http://coj.uci.cu/general/about.xhtml outlinks :http://pgs.soporte.uci.cu outlinks :http://portal.albet.prod.uci.cu outlinks :http://portal.calisoft.prod.uci.cu outlinks :http://portal.cdae.prod.uci.cu outlinks :http://portal.cedin.prod.uci.cu outlinks :http://portal.cegel.prod.uci.cu outlinks :http://portal.ceige.prod.uci.cu outlinks :http://portal.cenia.prod.uci.cu outlinks :http://portal.cesim.prod.uci.cu outlinks :http://portal.cice.prod.uci.cu outlinks :http://portal.cidi.prod.uci.cu outlinks :http://portal.cised.prod.uci.cu outlinks :http://portal.datec.prod.uci.cu outlinks :http://portal.dgp.prod.uci.cu outlinks :http://portal.dt.prod.uci.cu outlinks :http://portal.fortes.prod.uci.cu outlinks :http://portal.frcav.cav.uci.cu outlinks :http://portal.frgrm.grm.uci.cu outlinks :http://portal.frhab.hab.uci.cu outlinks :http://portal.geitel.prod.uci.cu outlinks :http://portal.geysed.prod.uci.cu outlinks :http://portal.hlg.uci.cu outlinks :http://portal.isec.prod.uci.cu outlinks :http://portal.tlm.prod.uci.cu outlinks :http://portal.vcl.uci.cu/ outlinks :http://postgresql.uci.cu outlinks :http://www.redmine.org/ outlinks :http://www.redmine.org/guide contentLength : 5280 and this is the page code that i check with firefox. !DOCTYPE html PUBLIC -//W3C//DTD XHTML 1.0 Transitional//EN http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd; html xmlns=http://www.w3.org/1999/xhtml; xml:lang=en head meta http-equiv=content-type content=text/html; charset=utf-8 / titleComunidades UCI/title continue I need to replace index-more.jar plugin ? - Mensaje original - De: Markus Jelsma markus.jel...@openindex.io Para: user@nutch.apache.org Enviados: Martes, 27 de Noviembre 2012 15:33:20 Asunto: RE: problem with text/html content type of documents appears application/xhtml+xml in solr index Hi - are you sure you have tabs separating the target and the mapped mimes? Use the nutch indexchecker tool to quickly test if it works. -Original message- From:Eyeris Rodriguez Rueda eru...@uci.cu Sent: Tue 27-Nov-2012 21:18 To: user@nutch.apache.org Subject: RE: problem with text/html content type of documents appears application/xhtml+xml in solr index Hi. Markus. I was doing your recommendations but, my problem persist, some documents still with application/xhtml+xml instead of text/html. I add the property to nutch-site.xml and make the conf/contenttype-mapping.txt file property namemoreIndexingFilter.mapMimeTypes/name valuetrue/value /property I'm using nutch 1.5.1. Tell me if I need to replace index-more.jar in plugin directory with any fixed version ? 10mo. ANIVERSARIO DE LA CREACION DE LA UNIVERSIDAD DE LAS CIENCIAS INFORMATICAS... CONECTADOS AL FUTURO, CONECTADOS A LA REVOLUCION http://www.uci.cu http://www.facebook.com/universidad.uci http://www.flickr.com/photos/universidad_uci
RE: Indexing-time URL filtering again
Building from source with ant produces a local runtime in runtime/local, that's the same as when you extract an official release. -Original message- From:Joe Zhang smartag...@gmail.com Sent: Mon 26-Nov-2012 22:23 To: user@nutch.apache.org Subject: Re: Indexing-time URL filtering again yes that's wht i've been doing. but ant itself won't produce the official binary release. On Mon, Nov 26, 2012 at 2:16 PM, Markus Jelsma markus.jel...@openindex.iowrote: just ant will do the trick. -Original message- From:Joe Zhang smartag...@gmail.com Sent: Mon 26-Nov-2012 22:03 To: user@nutch.apache.org Subject: Re: Indexing-time URL filtering again talking about ant, after ant clean, which ant target should i use? On Mon, Nov 26, 2012 at 3:21 AM, Markus Jelsma markus.jel...@openindex.iowrote: I checked the code. You're probably not pointing it to a valid path or perhaps the build is wrong and you haven't used ant clean before building Nutch. If you keep having trouble you may want to check out trunk. -Original message- From:Joe Zhang smartag...@gmail.com Sent: Mon 26-Nov-2012 00:40 To: user@nutch.apache.org Subject: Re: Indexing-time URL filtering again OK. I'm testing it. But like I said, even when I reduce the patterns to the simpliest form -., the problem still persists. On Sun, Nov 25, 2012 at 3:59 PM, Markus Jelsma markus.jel...@openindex.iowrote: It's taking input from stdin, enter some URL's to test it. You can add an issue with reproducable steps. -Original message- From:Joe Zhang smartag...@gmail.com Sent: Sun 25-Nov-2012 23:49 To: user@nutch.apache.org Subject: Re: Indexing-time URL filtering again I ran the regex tester command you provided. It seems to be taking forever (15 min + by now). On Sun, Nov 25, 2012 at 3:28 PM, Joe Zhang smartag...@gmail.com wrote: you mean the content my pattern file? well, even wehn I reduce it to simply -., the same problem still pops up. On Sun, Nov 25, 2012 at 3:30 PM, Markus Jelsma markus.jel...@openindex.io wrote: You seems to have an NPE caused by your regex rules, for some weird reason. If you can provide a way to reproduce you can file an issue in Jira. This NPE should also occur if your run the regex tester. nutch -Durlfilter.regex.file=path org.apache.nutch.net.URLFilterChecker -allCombined In the mean time you can check if a rule causes the NPE. -Original message- From:Joe Zhang smartag...@gmail.com Sent: Sun 25-Nov-2012 23:26 To: user@nutch.apache.org Subject: Re: Indexing-time URL filtering again the last few lines of hadoop.log: 2012-11-25 16:30:30,021 INFO indexer.IndexingFilters - Adding org.apache.nutch.indexer.anchor.AnchorIndexingFilter 2012-11-25 16:30:30,026 INFO indexer.IndexingFilters - Adding org.apache.nutch.indexer.metadata.MetadataIndexer 2012-11-25 16:30:30,218 WARN mapred.LocalJobRunner - job_local_0001 java.lang.RuntimeException: Error in configuring object at org.apache.hadoop.util.ReflectionUtils.setJobConf(ReflectionUtils.java:93) at org.apache.hadoop.util.ReflectionUtils.setConf(ReflectionUtils.java:64) at org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:117) at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:432) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:372) at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:212) Caused by: java.lang.reflect.InvocationTargetException at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:601) at org.apache.hadoop.util.ReflectionUtils.setJobConf(ReflectionUtils.java:88) ... 5 more Caused by: java.lang.RuntimeException: Error in configuring object at org.apache.hadoop.util.ReflectionUtils.setJobConf(ReflectionUtils.java:93
RE: Indexing-time URL filtering again
No, this is no bug. As i said, you need either to patch your Nutch or get the sources from trunk. The -filter parameter is not in your version. Check the patch manual if you don't know how it works. $ cd trunk ; patch -p0 file.patch -Original message- From:Joe Zhang smartag...@gmail.com Sent: Sun 25-Nov-2012 08:42 To: Markus Jelsma markus.jel...@openindex.io; user user@nutch.apache.org Subject: Re: Indexing-time URL filtering again This does seem a bug. Can anybody help? On Sat, Nov 24, 2012 at 6:40 PM, Joe Zhang smartag...@gmail.com wrote: Markus, could you advise? Thanks a lot! On Sat, Nov 24, 2012 at 12:49 AM, Joe Zhang smartag...@gmail.com wrote: I followed your instruction and applied the patch, Markus, but the problem still persists --- -filter is interpreted as a path by solrindex. On Fri, Nov 23, 2012 at 12:39 AM, Markus Jelsma markus.jel...@openindex.io wrote: Ah, i get it now. Please use trunk or patch your version with: https://issues.apache.org/jira/browse/NUTCH-1300 to enable filtering. -Original message- From:Joe Zhang smartag...@gmail.com Sent: Fri 23-Nov-2012 03:08 To: user@nutch.apache.org Subject: Re: Indexing-time URL filtering again But Markus said it worked for him. I was really he could send his command line. On Thu, Nov 22, 2012 at 6:28 PM, Lewis John Mcgibbney lewis.mcgibb...@gmail.com wrote: Is this a bug? On Thu, Nov 22, 2012 at 10:13 PM, Joe Zhang smartag...@gmail.com wrote: Putting -filter between crawldb and segments, I sitll got the same thing: org.apache.hadoop.mapred.InvalidInputException: Input path does not exist: file:/home/tools/Nutch/apache-nutch-1.5.1/-filter/crawl_fetch Input path does not exist: file:/home/tools/Nutch/apache-nutch-1.5.1/-filter/crawl_parse Input path does not exist: file:/home/tools/Nutch/apache-nutch-1.5.1/-filter/parse_data Input path does not exist: file:/home/tools/Nutch/apache-nutch-1.5.1/-filter/parse_text On Thu, Nov 22, 2012 at 3:11 PM, Markus Jelsma markus.jel...@openindex.iowrote: These are roughly the available parameters: Usage: SolrIndexer solr url crawldb [-linkdb linkdb] [-hostdb hostdb] [-params k1=v1k2=v2...] (segment ... | -dir segments) [-noCommit] [-deleteGone] [-deleteRobotsNoIndex] [-deleteSkippedByIndexingFilter] [-filter] [-normalize] Having -filter at the end should work fine, if it, for some reason, doesn't work put it before the segment and after the crawldb and file an issue in jira, it works here if i have -filter at the end. Cheers -Original message- From:Joe Zhang smartag...@gmail.com Sent: Thu 22-Nov-2012 23:05 To: Markus Jelsma markus.jel...@openindex.io; user user@nutch.apache.org Subject: Re: Indexing-time URL filtering again Yes, I forgot to do that. But still, what exactly should the command look like? bin/nutch solrindex -Durlfilter.regex.file=UrlFiltering.txt http://localhost:8983/solr/ http://localhost:8983/solr/ .../crawldb/ /segments/* -filter this command would cause nutch to interpret -filter as a path. On Thu, Nov 22, 2012 at 6:14 AM, Markus Jelsma markus.jel...@openindex.io mailto:markus.jel...@openindex.io wrote: Hi, I just tested a small index job that usually writes 1200 records to Solr. It works fine if i specify -. in a filter (index nothing) and point to it with -Durlfilter.regex.file=path like you do. I assume you mean by `it doesn't work` that it filters nothing and indexes all records from the segment. Did you forget the -filter parameter? Cheers -Original message- From:Joe Zhang smartag...@gmail.com mailto: smartag...@gmail.com Sent: Thu 22-Nov-2012 07:29 To: user user@nutch.apache.org mailto:user@nutch.apache.org Subject: Indexing-time URL filtering again Dear List: I asked a similar question before, but I haven't solved the problem. Therefore I try to re-ask the question more clearly and seek advice. I'm using nutch 1.5.1 and solr 3.6.1 together. Things work fine at the rudimentary level. The basic problem I face in crawling/indexing is that I need to control which pages the crawlers should VISIT (so far through nutch/conf/regex-urlfilter.txt) and which pages are INDEXED by Solr. The latter are only a SUBSET of the former, and they are giving me headache. A real-life example would be: when we crawl CNN.com, we only want to index real content pages such as http://www.cnn.com/2012/11/21/us/bin-laden-burial/index.html?hpt=hp_t1
RE: problem with text/html content type of documents appears application/xhtml+xml in solr index
Hi - trunk's more indexing filter can map mime types to any target. With it you can map both (x)html mimes to text/html or to `web page`. https://issues.apache.org/jira/browse/NUTCH-1262 -Original message- From:Eyeris Rodriguez Rueda eru...@uci.cu Sent: Sun 25-Nov-2012 00:48 To: user@nutch.apache.org Subject: problem with text/html content type of documents appears application/xhtml+xml in solr index Hi. I have changed my nutch version from 1.4 to 1.5.1 and I have detected a problem with content type of some document, some pages with text/html appears in solr index with application/xhtml+xml , when I check the links the navegator tell me that efectively is text/html. Any body can help me to fix this problem, I think change this content type manually in solr index to text/html but is not a good way for me. Please any suggestion or advice will be accepted. 10mo. ANIVERSARIO DE LA CREACION DE LA UNIVERSIDAD DE LAS CIENCIAS INFORMATICAS... CONECTADOS AL FUTURO, CONECTADOS A LA REVOLUCION http://www.uci.cu http://www.facebook.com/universidad.uci http://www.flickr.com/photos/universidad_uci
RE: Indexing-time URL filtering again
You should provide the log output. -Original message- From:Joe Zhang smartag...@gmail.com Sent: Sun 25-Nov-2012 17:27 To: user@nutch.apache.org Subject: Re: Indexing-time URL filtering again I actually checked out the most recent build from SVN, Release 1.6 - 23/11/2012. The following command bin/nutch solrindex -Durlfilter.regex.file=.UrlFiltering.txt http://localhost:8983/solr/ crawl/crawldb/ -linkdb crawl/linkdb/ crawl/segments/* -filter produced the following output: SolrIndexer: starting at 2012-11-25 16:19:29 SolrIndexer: deleting gone documents: false SolrIndexer: URL filtering: true SolrIndexer: URL normalizing: false java.io.IOException: Job failed! Can anybody help? On Sun, Nov 25, 2012 at 6:43 AM, Joe Zhang smartag...@gmail.com wrote: How exactly do I get to trunk? I did download download NUTCH-1300-1.5-1.patch, and run the patch command correctly, and re-build nutch. But the problem still persists... On Sun, Nov 25, 2012 at 3:29 AM, Markus Jelsma markus.jel...@openindex.io wrote: No, this is no bug. As i said, you need either to patch your Nutch or get the sources from trunk. The -filter parameter is not in your version. Check the patch manual if you don't know how it works. $ cd trunk ; patch -p0 file.patch -Original message- From:Joe Zhang smartag...@gmail.com Sent: Sun 25-Nov-2012 08:42 To: Markus Jelsma markus.jel...@openindex.io; user user@nutch.apache.org Subject: Re: Indexing-time URL filtering again This does seem a bug. Can anybody help? On Sat, Nov 24, 2012 at 6:40 PM, Joe Zhang smartag...@gmail.com wrote: Markus, could you advise? Thanks a lot! On Sat, Nov 24, 2012 at 12:49 AM, Joe Zhang smartag...@gmail.com wrote: I followed your instruction and applied the patch, Markus, but the problem still persists --- -filter is interpreted as a path by solrindex. On Fri, Nov 23, 2012 at 12:39 AM, Markus Jelsma markus.jel...@openindex.io wrote: Ah, i get it now. Please use trunk or patch your version with: https://issues.apache.org/jira/browse/NUTCH-1300 to enable filtering. -Original message- From:Joe Zhang smartag...@gmail.com Sent: Fri 23-Nov-2012 03:08 To: user@nutch.apache.org Subject: Re: Indexing-time URL filtering again But Markus said it worked for him. I was really he could send his command line. On Thu, Nov 22, 2012 at 6:28 PM, Lewis John Mcgibbney lewis.mcgibb...@gmail.com wrote: Is this a bug? On Thu, Nov 22, 2012 at 10:13 PM, Joe Zhang smartag...@gmail.com wrote: Putting -filter between crawldb and segments, I sitll got the same thing: org.apache.hadoop.mapred.InvalidInputException: Input path does not exist: file:/home/tools/Nutch/apache-nutch-1.5.1/-filter/crawl_fetch Input path does not exist: file:/home/tools/Nutch/apache-nutch-1.5.1/-filter/crawl_parse Input path does not exist: file:/home/tools/Nutch/apache-nutch-1.5.1/-filter/parse_data Input path does not exist: file:/home/tools/Nutch/apache-nutch-1.5.1/-filter/parse_text On Thu, Nov 22, 2012 at 3:11 PM, Markus Jelsma markus.jel...@openindex.iowrote: These are roughly the available parameters: Usage: SolrIndexer solr url crawldb [-linkdb linkdb] [-hostdb hostdb] [-params k1=v1k2=v2...] (segment ... | -dir segments) [-noCommit] [-deleteGone] [-deleteRobotsNoIndex] [-deleteSkippedByIndexingFilter] [-filter] [-normalize] Having -filter at the end should work fine, if it, for some reason, doesn't work put it before the segment and after the crawldb and file an issue in jira, it works here if i have -filter at the end. Cheers -Original message- From:Joe Zhang smartag...@gmail.com Sent: Thu 22-Nov-2012 23:05 To: Markus Jelsma markus.jel...@openindex.io; user user@nutch.apache.org Subject: Re: Indexing-time URL filtering again Yes, I forgot to do that. But still, what exactly should the command look like? bin/nutch solrindex -Durlfilter.regex.file=UrlFiltering.txt http://localhost:8983/solr/ http://localhost:8983/solr/ .../crawldb/ /segments/* -filter this command would cause nutch to interpret -filter as a path. On Thu, Nov 22, 2012 at 6:14 AM, Markus Jelsma markus.jel...@openindex.io mailto: markus.jel...@openindex.io wrote: Hi, I just tested a small index job that usually writes 1200 records to Solr. It works fine if i specify -. in a filter (index nothing) and point to it with -Durlfilter.regex.file=path like
RE: problem with text/html content type of documents appears application/xhtml+xml in solr index
Hi - you need to enable mime-type mapping in Nutch config and define your mappings. Enable it with: property namemoreIndexingFilter.mapMimeTypes/name valuetrue/value /property and add the following to your mapping config: cat conf/contenttype-mapping.txt # Target content type TAB type1 [TAB type2 ...] text/html application/xhtml+xml This will map application/xhtml+xml to text/html when indexing documents to Solr. You can configure any arbitrary target such as `web page` or `document` for various similar content types. Trunk has this feature. You can either patch your version or check out from trunk and compile Nutch yourself. Patching is very simple: $ cd trunk ; patch -p0 file.patch -Original message- From:Eyeris Rodriguez Rueda eru...@uci.cu Sent: Sun 25-Nov-2012 20:42 To: user@nutch.apache.org Subject: RE: problem with text/html content type of documents appears application/xhtml+xml in solr index Thanks a lot Markus for your answer. My English is not so good. I was reading but i don’t know how to fix the problems yet. Could you explain me in details the solution please. I was looking in conf directory but I can't find how to map one mime types to another. I need to replace index-more plugin ? I was looking in the link that you suggest me and a saw a NUTCH-1262-1.5-1.patch but I don’t know how to use that patch. Please tell me if I need to delete the index completely or there is a way to replace an application/xhtml+xml to text/html in solr index. -Mensaje original- De: Markus Jelsma [mailto:markus.jel...@openindex.io] Enviado el: domingo, 25 de noviembre de 2012 4:33 AM Para: user@nutch.apache.org Asunto: RE: problem with text/html content type of documents appears application/xhtml+xml in solr index Hi - trunk's more indexing filter can map mime types to any target. With it you can map both (x)html mimes to text/html or to `web page`. https://issues.apache.org/jira/browse/NUTCH-1262 -Original message- From:Eyeris Rodriguez Rueda eru...@uci.cu Sent: Sun 25-Nov-2012 00:48 To: user@nutch.apache.org Subject: problem with text/html content type of documents appears application/xhtml+xml in solr index Hi. I have changed my nutch version from 1.4 to 1.5.1 and I have detected a problem with content type of some document, some pages with text/html appears in solr index with application/xhtml+xml , when I check the links the navegator tell me that efectively is text/html. Any body can help me to fix this problem, I think change this content type manually in solr index to text/html but is not a good way for me. Please any suggestion or advice will be accepted. 10mo. ANIVERSARIO DE LA CREACION DE LA UNIVERSIDAD DE LAS CIENCIAS INFORMATICAS... CONECTADOS AL FUTURO, CONECTADOS A LA REVOLUCION http://www.uci.cu http://www.facebook.com/universidad.uci http://www.flickr.com/photos/universidad_uci
RE: Indexing-time URL filtering again
You seems to have an NPE caused by your regex rules, for some weird reason. If you can provide a way to reproduce you can file an issue in Jira. This NPE should also occur if your run the regex tester. nutch -Durlfilter.regex.file=path org.apache.nutch.net.URLFilterChecker -allCombined In the mean time you can check if a rule causes the NPE. -Original message- From:Joe Zhang smartag...@gmail.com Sent: Sun 25-Nov-2012 23:26 To: user@nutch.apache.org Subject: Re: Indexing-time URL filtering again the last few lines of hadoop.log: 2012-11-25 16:30:30,021 INFO indexer.IndexingFilters - Adding org.apache.nutch.indexer.anchor.AnchorIndexingFilter 2012-11-25 16:30:30,026 INFO indexer.IndexingFilters - Adding org.apache.nutch.indexer.metadata.MetadataIndexer 2012-11-25 16:30:30,218 WARN mapred.LocalJobRunner - job_local_0001 java.lang.RuntimeException: Error in configuring object at org.apache.hadoop.util.ReflectionUtils.setJobConf(ReflectionUtils.java:93) at org.apache.hadoop.util.ReflectionUtils.setConf(ReflectionUtils.java:64) at org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:117) at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:432) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:372) at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:212) Caused by: java.lang.reflect.InvocationTargetException at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:601) at org.apache.hadoop.util.ReflectionUtils.setJobConf(ReflectionUtils.java:88) ... 5 more Caused by: java.lang.RuntimeException: Error in configuring object at org.apache.hadoop.util.ReflectionUtils.setJobConf(ReflectionUtils.java:93) at org.apache.hadoop.util.ReflectionUtils.setConf(ReflectionUtils.java:64) at org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:117) at org.apache.hadoop.mapred.MapRunner.configure(MapRunner.java:34) ... 10 more Caused by: java.lang.reflect.InvocationTargetException at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:601) at org.apache.hadoop.util.ReflectionUtils.setJobConf(ReflectionUtils.java:88) ... 13 more Caused by: java.lang.NullPointerException at java.io.Reader.init(Reader.java:78) at java.io.BufferedReader.init(BufferedReader.java:94) at java.io.BufferedReader.init(BufferedReader.java:109) at org.apache.nutch.urlfilter.api.RegexURLFilterBase.readRules(RegexURLFilterBase.java:180) at org.apache.nutch.urlfilter.api.RegexURLFilterBase.setConf(RegexURLFilterBase.java:156) at org.apache.nutch.plugin.Extension.getExtensionInstance(Extension.java:162) at org.apache.nutch.net.URLFilters.init(URLFilters.java:57) at org.apache.nutch.indexer.IndexerMapReduce.configure(IndexerMapReduce.java:95) ... 18 more 2012-11-25 16:30:30,568 ERROR solr.SolrIndexer - java.io.IOException: Job failed! On Sun, Nov 25, 2012 at 3:08 PM, Markus Jelsma markus.jel...@openindex.iowrote: You should provide the log output. -Original message- From:Joe Zhang smartag...@gmail.com Sent: Sun 25-Nov-2012 17:27 To: user@nutch.apache.org Subject: Re: Indexing-time URL filtering again I actually checked out the most recent build from SVN, Release 1.6 - 23/11/2012. The following command bin/nutch solrindex -Durlfilter.regex.file=.UrlFiltering.txt http://localhost:8983/solr/ crawl/crawldb/ -linkdb crawl/linkdb/ crawl/segments/* -filter produced the following output: SolrIndexer: starting at 2012-11-25 16:19:29 SolrIndexer: deleting gone documents: false SolrIndexer: URL filtering: true SolrIndexer: URL normalizing: false java.io.IOException: Job failed! Can anybody help? On Sun, Nov 25, 2012 at 6:43 AM, Joe Zhang smartag...@gmail.com wrote: How exactly do I get to trunk? I did download download NUTCH-1300-1.5-1.patch, and run the patch command correctly, and re-build nutch. But the problem still persists... On Sun, Nov 25, 2012 at 3:29 AM, Markus Jelsma markus.jel...@openindex.io wrote: No, this is no bug. As i said, you need either to patch your Nutch or get the sources from trunk. The -filter parameter is not in your version. Check the patch manual
RE: Indexing-time URL filtering again
It's taking input from stdin, enter some URL's to test it. You can add an issue with reproducable steps. -Original message- From:Joe Zhang smartag...@gmail.com Sent: Sun 25-Nov-2012 23:49 To: user@nutch.apache.org Subject: Re: Indexing-time URL filtering again I ran the regex tester command you provided. It seems to be taking forever (15 min + by now). On Sun, Nov 25, 2012 at 3:28 PM, Joe Zhang smartag...@gmail.com wrote: you mean the content my pattern file? well, even wehn I reduce it to simply -., the same problem still pops up. On Sun, Nov 25, 2012 at 3:30 PM, Markus Jelsma markus.jel...@openindex.io wrote: You seems to have an NPE caused by your regex rules, for some weird reason. If you can provide a way to reproduce you can file an issue in Jira. This NPE should also occur if your run the regex tester. nutch -Durlfilter.regex.file=path org.apache.nutch.net.URLFilterChecker -allCombined In the mean time you can check if a rule causes the NPE. -Original message- From:Joe Zhang smartag...@gmail.com Sent: Sun 25-Nov-2012 23:26 To: user@nutch.apache.org Subject: Re: Indexing-time URL filtering again the last few lines of hadoop.log: 2012-11-25 16:30:30,021 INFO indexer.IndexingFilters - Adding org.apache.nutch.indexer.anchor.AnchorIndexingFilter 2012-11-25 16:30:30,026 INFO indexer.IndexingFilters - Adding org.apache.nutch.indexer.metadata.MetadataIndexer 2012-11-25 16:30:30,218 WARN mapred.LocalJobRunner - job_local_0001 java.lang.RuntimeException: Error in configuring object at org.apache.hadoop.util.ReflectionUtils.setJobConf(ReflectionUtils.java:93) at org.apache.hadoop.util.ReflectionUtils.setConf(ReflectionUtils.java:64) at org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:117) at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:432) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:372) at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:212) Caused by: java.lang.reflect.InvocationTargetException at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:601) at org.apache.hadoop.util.ReflectionUtils.setJobConf(ReflectionUtils.java:88) ... 5 more Caused by: java.lang.RuntimeException: Error in configuring object at org.apache.hadoop.util.ReflectionUtils.setJobConf(ReflectionUtils.java:93) at org.apache.hadoop.util.ReflectionUtils.setConf(ReflectionUtils.java:64) at org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:117) at org.apache.hadoop.mapred.MapRunner.configure(MapRunner.java:34) ... 10 more Caused by: java.lang.reflect.InvocationTargetException at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:601) at org.apache.hadoop.util.ReflectionUtils.setJobConf(ReflectionUtils.java:88) ... 13 more Caused by: java.lang.NullPointerException at java.io.Reader.init(Reader.java:78) at java.io.BufferedReader.init(BufferedReader.java:94) at java.io.BufferedReader.init(BufferedReader.java:109) at org.apache.nutch.urlfilter.api.RegexURLFilterBase.readRules(RegexURLFilterBase.java:180) at org.apache.nutch.urlfilter.api.RegexURLFilterBase.setConf(RegexURLFilterBase.java:156) at org.apache.nutch.plugin.Extension.getExtensionInstance(Extension.java:162) at org.apache.nutch.net.URLFilters.init(URLFilters.java:57) at org.apache.nutch.indexer.IndexerMapReduce.configure(IndexerMapReduce.java:95) ... 18 more 2012-11-25 16:30:30,568 ERROR solr.SolrIndexer - java.io.IOException: Job failed! On Sun, Nov 25, 2012 at 3:08 PM, Markus Jelsma markus.jel...@openindex.iowrote: You should provide the log output. -Original message- From:Joe Zhang smartag...@gmail.com Sent: Sun 25-Nov-2012 17:27 To: user@nutch.apache.org Subject: Re: Indexing-time URL filtering again I actually checked out the most recent build from SVN, Release 1.6 - 23/11/2012. The following command bin/nutch solrindex -Durlfilter.regex.file=.UrlFiltering.txt http://localhost:8983/solr/ crawl
RE: Indexing-time URL filtering again
Hi, I just tested a small index job that usually writes 1200 records to Solr. It works fine if i specify -. in a filter (index nothing) and point to it with -Durlfilter.regex.file=path like you do. I assume you mean by `it doesn't work` that it filters nothing and indexes all records from the segment. Did you forget the -filter parameter? Cheers -Original message- From:Joe Zhang smartag...@gmail.com Sent: Thu 22-Nov-2012 07:29 To: user user@nutch.apache.org Subject: Indexing-time URL filtering again Dear List: I asked a similar question before, but I haven't solved the problem. Therefore I try to re-ask the question more clearly and seek advice. I'm using nutch 1.5.1 and solr 3.6.1 together. Things work fine at the rudimentary level. The basic problem I face in crawling/indexing is that I need to control which pages the crawlers should VISIT (so far through nutch/conf/regex-urlfilter.txt) and which pages are INDEXED by Solr. The latter are only a SUBSET of the former, and they are giving me headache. A real-life example would be: when we crawl CNN.com, we only want to index real content pages such as http://www.cnn.com/2012/11/21/us/bin-laden-burial/index.html?hpt=hp_t1. When we start the crawling from the root, we can't specify tight patterns (e.g., +^http://([a-z0-9]*\.)* cnn.com/[0-9]{4}/[0-9]{2}/[0-9]{2}/..*) in nutch/conf/regex-urlfilter.txt, because the pages on the path between root and content pages do not satisfy such patterns. Putting such patterns in nutch/conf/regex-urlfilter.txt would severely jeopardize the coverage of the crawl. The closest solution I've got so far (courtesy of Markus) was this: nutch solrindex -Durlfilter.regex.file=/path http://solrurl/ ... but unfortunately I haven't been able to make it work for me. The content of the urlfilter.regex.file is what I thought correct --- something like the following: +^http://([a-z0-9]*\.)*cnn.com/[0-9]{4}/[0-9]{2}/[0-9]{2}/..* -. Everything seems quite straightforward. Am I doing anything wrong here? Can anyone advise? I'd greatly appreciate. Joe
RE: doubts about some propierties on nutch-site.xml file
See: http://nutch.apache.org/apidocs-2.1/org/apache/nutch/crawl/AdaptiveFetchSchedule.html -Original message- From:Eyeris Rodriguez Rueda eru...@uci.cu Sent: Fri 23-Nov-2012 03:29 To: user@nutch.apache.org Subject: doubts about some propierties on nutch-site.xml file Hi all. I have some doubts about some properties on nutch-site.xml file, i will appreciated if anybody can explain me about it function in detail. Im using nutch 1.5.1 and solr 3.6. First property namedb.fetch.schedule.adaptive.inc_rate/name value0.4/value descriptionIf a page is unmodified, its fetchInterval will be increased by this rate. This value should not exceed 0.5, otherwise the algorithm becomes unstable./description /property In this case how the fetchInterval is modified by this value what means that ?? * Second property namedb.fetch.schedule.adaptive.sync_delta/name valuetrue/value descriptionIf true, try to synchronize with the time of page change. by shifting the next fetchTime by a fraction (sync_rate) of the difference between the last modification time, and the last fetch time./description /property I can´t understand this propierty. ** regards. 10mo. ANIVERSARIO DE LA CREACION DE LA UNIVERSIDAD DE LAS CIENCIAS INFORMATICAS... CONECTADOS AL FUTURO, CONECTADOS A LA REVOLUCION http://www.uci.cu http://www.facebook.com/universidad.uci http://www.flickr.com/photos/universidad_uci
RE: Best practices for running Nutch
Hi -Original message- From:kiran chitturi chitturikira...@gmail.com Sent: Sun 18-Nov-2012 18:38 To: user@nutch.apache.org Subject: Best practices for running Nutch Hi! I have been running crawls using Nutch for 13000 documents (protocol http) on a single machine and it goes on to take 2-3 days to get finished. I am using 2.x version of Nutch. I use a depth of 20 and topN of 1000 (2000) when i initiate the 'sh bin/nutch crawl -depth 20 -topN 1000'. I keep running in to Exceptions after one day. Sometimes its - Memory Exception : Heap Space (after the parsing of the documents) After parsing the documents? That should be during updatedb but are you sure? That job hardly ever runs out of memory. - Mysql Connection Error (because the crawler went on to fetch 10,000 documents after the command 'sh bin/nutch crawl -continue -depth 10 -topN 700' as the crawl failed because I increased the heap space and increased the timeout. I am wondering what are the best practices to run Nutch crawls. Is a full crawl a good thing to do or should i do it in steps (generate, fetch, parse, updatedb) ? Separate steps are good for debugging and give you more control. Also how do i choose the value of the parameters, even if i give topN as 700 the fetcher goes to fetch 3000 documents. What parameters have high impact on the running time of the crawl ? Are you sure? The generator (at least in trunk) honors the topN parameter and will not generate more than specified. Keep in mind that using the crawl script and the depth parameter you're multiplying topN by depth. All these options might be system based and need not have general values which work for everyone. I am wondering what are things that Nutch Users and Developers follow here when running big crawls ? What is a big crawl? 13.000 documents are very easy to manage on a very small machine running locally. If you're downloading from one or a few hosts it's expected to take a very long time due to crawler politeness, don't download faster than one page every 5 seconds unless you're allowed to or own the host. If you own a host or are allowed to you can increase or increase the number of threads per queue (host, domain or IP). Some of the exceptions come after 1 or 2 days of running the crawler, so its getting hard to know how to fix them before hand. I'm not sure this applies to you because i don't know what you mean by `running crawler`; never run the fetcher for longer than an hour orso. Are there any common exceptions that Nutch can run in to frequently ? The usual exceptions are network errors. Is there any documentation for Nutch practices ? I have seen people crawls go for a long time because of the filtering sometimes. I'm not sure but the best thing to do on this list is not talk about crawl (e.g. my crawl fails or takes too long) but to talk about the separate jobs. We don't know what's wrong if one tells us a crawl is taking long because it consists of the separate steps. Sorry for the long email. Thank you, -- Kiran Chitturi
RE: custom plugin's constructor unable to access hadoop conf
That's because the object is not set in the constructor. You can access Configuration after setConf() is called. So defer your work in the constructor to this method. public void setConf(Configuration conf) { this.conf = conf; } -Original message- From:Sourajit Basak sourajit.ba...@gmail.com Sent: Fri 16-Nov-2012 11:28 To: user@nutch.apache.org Subject: custom plugin's constructor unable to access hadoop conf In my custom HtmlParseFilter plugin, I am getting a NPE on trying to access the hadoop Configuration object in the plugin constructor. Is this a known behavior ?
RE: site-specific crawling policies
you can override some URL Filter paths in nutch site or with command line options (tools) such as bin/nutch fetch -Durlfilter.regex.file=bla. You can also set NUTCH_HOME and keep everything separate if you're running it locally. On Hadoop you'll need separate job files. -Original message- From:Joe Zhang smartag...@gmail.com Sent: Fri 16-Nov-2012 18:35 To: user@nutch.apache.org Subject: Re: site-specific crawling policies That's easy to do. But what about the configuration files? The same nutchs-site.xml, urlfiter files will be read. On Fri, Nov 16, 2012 at 3:28 AM, Sourajit Basak sourajit.ba...@gmail.comwrote: Group related sites together and use separate crawldb, segment directories. On Fri, Nov 16, 2012 at 9:40 AM, Joe Zhang smartag...@gmail.com wrote: So how exactly do I set up different nutch instances then? On Thu, Nov 15, 2012 at 7:52 PM, Lewis John Mcgibbney lewis.mcgibb...@gmail.com wrote: Hi Joe, In all honesty, it might sound slightly optimistic, it may also depend upon the size and calibre of the different sites/domains but if you are attempting a depth first, domain specific crawl, then maybe separate Nutch instances will be your friend... Lewis On Thu, Nov 15, 2012 at 11:53 PM, Joe Zhang smartag...@gmail.com wrote: well, these are all details. The bigger question is, how to seperate the crawling policy of site A from that of site B? On Thu, Nov 15, 2012 at 7:41 AM, Sourajit Basak sourajit.ba...@gmail.comwrote: You probably need to customize parse-metatags plugin. I think you go ahead and include all possible metatags. And take care of missing metatags in solr. On Thu, Nov 15, 2012 at 12:22 AM, Joe Zhang smartag...@gmail.com wrote: I understand conf/regex-urlfilter.txt; I can put domain names into the URL patterns. But what about meta tags? What if I want to parse out different meta tags for different sites? On Wed, Nov 14, 2012 at 1:33 AM, Sourajit Basak sourajit.ba...@gmail.com wrote: 1) For parsing indexing customized meta tags enable configure plugin parse-metatags 2) There are several filters of url, like regex based. For regex, the patterns are specified via conf/regex-urlfilter.txt On Wed, Nov 14, 2012 at 1:33 PM, Tejas Patil tejas.patil...@gmail.com wrote: While defining url patterns, have the domain name in it so that you get site/domain specific rules. I don't know about configuring meta tags. Thanks, Tejas On Tue, Nov 13, 2012 at 11:34 PM, Joe Zhang smartag...@gmail.com wrote: How to enforce site-specific crawling policies, i.e, different URL patterns, meta tags, etc. for different websites to be crawled? I got the sense that multiple instances of nutch are needed? Is it correct? If yes, how? -- Lewis
RE: re-Crawl re-fetch all pages each time
Hi - this should not happen. The only thing i can imagine is that the update step doesn't succeed but that would mean nothing is going to be indexed either. You can inspect an URL using the readdb tool, check before and after. -Original message- From:vetus ve...@isac.cat Sent: Thu 15-Nov-2012 15:41 To: user@nutch.apache.org Subject: re-Crawl re-fetch all pages each time Hello, I have a problem... I'm trying to index a small domain, and I'm using org.apache.nutch.crawl.Crawler to do it. The problem, is that after the crawler has indexed all the pages of the domain, I execute the crawler again... and It fetch all the pages again althoug the fetch interval has not expired... This is wrong because it generates a lot of connections... I'm using the default config and this is the command that I execute: org.apache.nutch.crawl.Crawler -depth 1 -threads 1 -topN 5 Can you help me? please Thanks -- View this message in context: http://lucene.472066.n3.nabble.com/re-Crawl-re-fetch-all-pages-each-time-tp4020464.html Sent from the Nutch - User mailing list archive at Nabble.com.
RE: adding custom metadata to CrawlDatum during parse
Hi - Sure, check the db.parsemeta.to.crawldb configuration directive. -Original message- From:Sourajit Basak sourajit.ba...@gmail.com Sent: Wed 14-Nov-2012 08:10 To: user@nutch.apache.org Subject: adding custom metadata to CrawlDatum during parse Is it possible to add custom metadata (preferably via plugins) to the CrawlDatum of the url during parse or its associated filter phases ? It seems you can do so if you parse along with fetch. That too will require modifications to Fetcher.java; Have I missed out any better way to accomplish ? Sourajit
RE: How to find ids of pages that have been newly crawled or modified after a given date with Nutch 2.1
In trunk the modified time is based on whether or not the signature has changed. It makes little sense relying on HTTP headers because almost no CMS implements it correctly and it messes (or allows to be messed with on purpose) with an adaptive schedule. https://issues.apache.org/jira/browse/NUTCH-1341 -Original message- From:j.sulli...@thomsonreuters.com j.sulli...@thomsonreuters.com Sent: Tue 13-Nov-2012 11:13 To: user@nutch.apache.org Subject: RE: How to find ids of pages that have been newly crawled or modified after a given date with Nutch 2.1 I think the modifiedTime comes from the http headers if available, if not it is left empty. In other words it is the time the content was last modified according to the source if available and if not available it is left blank. Depending on what Jacob is trying to achieve the one line patch at https://issues.apache.org/jira/browse/NUTCH-1475 might be what he needs (or might not be). James -Original Message- From: Ferdy Galema [mailto:ferdy.gal...@kalooga.com] Sent: Tuesday, November 13, 2012 6:31 PM To: user@nutch.apache.org Subject: Re: How to find ids of pages that have been newly crawled or modified after a given date with Nutch 2.1 Hi, There might be something wrong with the field modifiedTime. I'm not sure how well you can rely on this field (with the default or the adaptive scheduler). If you want to get to the bottom of this, I suggest debugging or running small crawls to test the behaviour. In case something doesn't work as expected, please repost here or open a Jira. Ferdy. On Mon, Nov 12, 2012 at 8:18 PM, Jacob Sisk jacob.s...@gmail.com wrote: Hi, If this question has already been answered please forgive me and point me to the appropriate thread. I'd like to be able to find the ids of all new pages crawled by nutch or pages modified since a fixed point in the past. I'm using Nutch 2.1 with MySQL as the back-end and it seems like the appropriate back-end query should be something like: select id from webpage where (prevFetchTime=null fetchTime=X) or (modifiedTime = X ) where X is some point in the past. What I've found is that modifiedTime is always null. I am using the adaptive scheduler and the default md5 signature class. I've tried both re-injecting seed URLs as well as not, it seems to make no difference. modifiedTime remains null. I am most grateful for any help or advise. If my nutc-hsite.xml fiel would help I can forward it along. Thanks, jacob
RE: Simulating 2.x's page.putToInlinks() in trunk
In trunk you can use the Inlink and Inlinks classes. The first for each inline and the latter to add the Inlink objects to. Inlinks inlinks = new Inlinks() inlinks.add(new Inlink(http://nutch.apache.org/;, Apache Nutch)); The inlink URL is the key in the key/value pair so you won't see that one. -Original message- From:Lewis John Mcgibbney lewis.mcgibb...@gmail.com Sent: Mon 12-Nov-2012 16:29 To: user@nutch.apache.org Subject: Simulating 2.x's page.putToInlinks() in trunk Hi, I'm attempting to test the AnchorIndexingFilter by adding numerous inlinks and their anchor text then check whether the deduplication is working sufficiently. Can someone show me how I simulate the following using the trunk API // This is 2.x API WebPage page = new WebPage(); page.putToInlinks(new Utf8($inlink1), new Utf8($anchor_text1)); page.putToInlinks(new Utf8($inlink2), new Utf8($anchor_text1)); page.putToInlinks(new Utf8($inlink3), new Utf8($anchor_text2)); If anchor deduplication is set to boolean true value then we could only allow two anchor entries for the page inlinks. I wish therefore to simulate this in trunk API using Inlinks, Inlink or NutchDocument.add function however I am stuck... Thank you very much in advance for any help. Best Lewis -- Lewis
RE: very slow generator step
Hi - Please use the -noFilter option. It is usually useless to filter in the generator because they've already been filtered in the parse step and or update step. -Original message- From:Mohammad wrk mhd...@yahoo.com Sent: Mon 12-Nov-2012 18:43 To: user@nutch.apache.org Subject: very slow generator step Hi, The generator time has gone from 8 minutes to 106 minutes few days ago and stayed there since then. AFAIK, I haven't made any configuration changes recently (attached you can find some of the configurations that I thought might be related). A quick CPU sampling shows that most of the time is spent on java.util.regex.Matcher.find(). Since I'm using default regex configurations and my crawldb has only 3,052,412 urls, I was wondering if this is a known issue with nutch-1.5.1 ? Here are some more information that might help: = Generator logs 2012-11-09 03:14:50,920 INFO crawl.Generator - Generator: starting at 2012-11-09 03:14:50 2012-11-09 03:14:50,920 INFO crawl.Generator - Generator: Selecting best-scoring urls due for fetch. 2012-11-09 03:14:50,920 INFO crawl.Generator - Generator: filtering: true 2012-11-09 03:14:50,920 INFO crawl.Generator - Generator: normalizing: true 2012-11-09 03:14:50,921 INFO crawl.Generator - Generator: topN: 3000 2012-11-09 03:14:50,923 INFO crawl.Generator - Generator: jobtracker is 'local', generating exactly one partition. 2012-11-09 03:23:39,741 INFO crawl.Generator - Generator: Partitioning selected urls for politeness. 2012-11-09 03:23:40,743 INFO crawl.Generator - Generator: segment: segments/20121109032340 2012-11-09 03:23:47,860 INFO crawl.Generator - Generator: finished at 2012-11-09 03:23:47, elapsed: 00:08:56 2012-11-09 05:35:14,033 INFO crawl.Generator - Generator: starting at 2012-11-09 05:35:14 2012-11-09 05:35:14,033 INFO crawl.Generator - Generator: Selecting best-scoring urls due for fetch. 2012-11-09 05:35:14,033 INFO crawl.Generator - Generator: filtering: true 2012-11-09 05:35:14,033 INFO crawl.Generator - Generator: normalizing: true 2012-11-09 05:35:14,033 INFO crawl.Generator - Generator: topN: 3000 2012-11-09 05:35:14,037 INFO crawl.Generator - Generator: jobtracker is 'local', generating exactly one partition. 2012-11-09 07:21:42,840 INFO crawl.Generator - Generator: Partitioning selected urls for politeness. 2012-11-09 07:21:43,841 INFO crawl.Generator - Generator: segment: segments/20121109072143 2012-11-09 07:21:51,004 INFO crawl.Generator - Generator: finished at 2012-11-09 07:21:51, elapsed: 01:46:36 = CrawlDb statistics CrawlDb statistics start: ./crawldb Statistics for CrawlDb: ./crawldb TOTAL urls:3052412 retry 0:3047404 retry 1:338 retry 2:1192 retry 3:822 retry 4:336 retry 5:2320 min score:0.0 avg score:0.015368268 max score:48.608 status 1 (db_unfetched):2813249 status 2 (db_fetched):196717 status 3 (db_gone):14204 status 4 (db_redir_temp):10679 status 5 (db_redir_perm):17563 CrawlDb statistics: done = System info Memory: 4 GB CPUs: Intel® Core™ i3-2310M CPU @ 2.10GHz × 4 Available diskspace: 171.7 GB OS: Release 12.10 (quantal) 64-bit Thanks, Mohammad
RE: very slow generator step
You may need to change your expressions but it is performant. Not all features of traditional regex are supported. http://wiki.apache.org/nutch/RegexURLFiltersBenchs -Original message- From:Mohammad wrk mhd...@yahoo.com Sent: Mon 12-Nov-2012 22:17 To: user@nutch.apache.org Subject: Re: very slow generator step That's a good thinking. I have never used url-filter automation. Where can I find more info? Thanks, Mohammad From: Julien Nioche lists.digitalpeb...@gmail.com To: user@nutch.apache.org; Mohammad wrk mhd...@yahoo.com Sent: Monday, November 12, 2012 12:38:44 PM Subject: Re: very slow generator step Could be that a particularly long and tricky URL got into your crawldb and put the regex into a spin. I'd use the url-filter automaton instead as it is much faster. Would be interesting to know what caused the regex to take so much time, in case you fancy a bit of debugging ;-) Julien On 12 November 2012 20:29, Mohammad wrk mhd...@yahoo.com wrote: Thanks for the tip. It went down to 2 minutes :-) What I don't understand is that how come everything was working fine with the default configuration for about 4 days and all of a sudden one crawl causes a jump of 100 minutes? Cheers, Mohammad From: Markus Jelsma markus.jel...@openindex.io To: user@nutch.apache.org user@nutch.apache.org Sent: Monday, November 12, 2012 11:19:11 AM Subject: RE: very slow generator step Hi - Please use the -noFilter option. It is usually useless to filter in the generator because they've already been filtered in the parse step and or update step. -Original message- From:Mohammad wrk mhd...@yahoo.com Sent: Mon 12-Nov-2012 18:43 To: user@nutch.apache.org Subject: very slow generator step Hi, The generator time has gone from 8 minutes to 106 minutes few days ago and stayed there since then. AFAIK, I haven't made any configuration changes recently (attached you can find some of the configurations that I thought might be related). A quick CPU sampling shows that most of the time is spent on java.util.regex.Matcher.find(). Since I'm using default regex configurations and my crawldb has only 3,052,412 urls, I was wondering if this is a known issue with nutch-1.5.1 ? Here are some more information that might help: = Generator logs 2012-11-09 03:14:50,920 INFO crawl.Generator - Generator: starting at 2012-11-09 03:14:50 2012-11-09 03:14:50,920 INFO crawl.Generator - Generator: Selecting best-scoring urls due for fetch. 2012-11-09 03:14:50,920 INFO crawl.Generator - Generator: filtering: true 2012-11-09 03:14:50,920 INFO crawl.Generator - Generator: normalizing: true 2012-11-09 03:14:50,921 INFO crawl.Generator - Generator: topN: 3000 2012-11-09 03:14:50,923 INFO crawl.Generator - Generator: jobtracker is 'local', generating exactly one partition. 2012-11-09 03:23:39,741 INFO crawl.Generator - Generator: Partitioning selected urls for politeness. 2012-11-09 03:23:40,743 INFO crawl.Generator - Generator: segment: segments/20121109032340 2012-11-09 03:23:47,860 INFO crawl.Generator - Generator: finished at 2012-11-09 03:23:47, elapsed: 00:08:56 2012-11-09 05:35:14,033 INFO crawl.Generator - Generator: starting at 2012-11-09 05:35:14 2012-11-09 05:35:14,033 INFO crawl.Generator - Generator: Selecting best-scoring urls due for fetch. 2012-11-09 05:35:14,033 INFO crawl.Generator - Generator: filtering: true 2012-11-09 05:35:14,033 INFO crawl.Generator - Generator: normalizing: true 2012-11-09 05:35:14,033 INFO crawl.Generator - Generator: topN: 3000 2012-11-09 05:35:14,037 INFO crawl.Generator - Generator: jobtracker is 'local', generating exactly one partition. 2012-11-09 07:21:42,840 INFO crawl.Generator - Generator: Partitioning selected urls for politeness. 2012-11-09 07:21:43,841 INFO crawl.Generator - Generator: segment: segments/20121109072143 2012-11-09 07:21:51,004 INFO crawl.Generator - Generator: finished at 2012-11-09 07:21:51, elapsed: 01:46:36 = CrawlDb statistics CrawlDb statistics start: ./crawldb Statistics for CrawlDb: ./crawldb TOTAL urls:3052412 retry 0:3047404 retry 1:338 retry 2:1192 retry 3:822 retry 4:336 retry 5:2320 min score:0.0 avg score:0.015368268 max score:48.608 status 1 (db_unfetched):2813249 status 2 (db_fetched):196717 status 3 (db_gone):14204 status 4 (db_redir_temp):10679 status 5 (db_redir_perm):17563 CrawlDb statistics: done = System info Memory: 4 GB CPUs: Intel® Core™ i3-2310M CPU @ 2.10GHz × 4 Available diskspace: 171.7 GB OS: Release 12.10 (quantal) 64-bit Thanks, Mohammad -- * *Open Source Solutions for Text
RE: Tika Parsing not working in the latest version of 2.X?
Try cleaning your build. -Original message- From:j.sulli...@thomsonreuters.com j.sulli...@thomsonreuters.com Sent: Thu 08-Nov-2012 07:23 To: user@nutch.apache.org Subject: Tika Parsing not working in the latest version of 2.X? Just tried the latest 2.X after being away for a while. Tika parsing doesn't seem to be working. Exception in thread main java.lang.NoSuchMethodError: org.apache.tika.mime.MediaType.set([Lorg/apache/tika/mime/MediaType;)Ljava/util/Set; at org.apache.tika.parser.crypto.Pkcs7Parser.getSupportedTypes(Pkcs7Parser.java:52) at org.apache.nutch.parse.tika.TikaConfig.init(TikaConfig.java:149) at org.apache.nutch.parse.tika.TikaConfig.getDefaultConfig(TikaConfig.java:210) at org.apache.nutch.parse.tika.TikaParser.setConf(TikaParser.java:203) at org.apache.nutch.plugin.Extension.getExtensionInstance(Extension.java:162) at org.apache.nutch.parse.ParserFactory.getFields(ParserFactory.java:209) at org.apache.nutch.parse.ParserJob.getFields(ParserJob.java:193) at org.apache.nutch.fetcher.FetcherJob.getFields(FetcherJob.java:142) at org.apache.nutch.fetcher.FetcherJob.run(FetcherJob.java:184) at org.apache.nutch.fetcher.FetcherJob.fetch(FetcherJob.java:219) at org.apache.nutch.fetcher.FetcherJob.run(FetcherJob.java:301) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65) at org.apache.nutch.fetcher.FetcherJob.main(FetcherJob.java:307) Exception in thread main java.lang.NoSuchMethodError: org.apache.tika.mime.MediaType.set([Lorg/apache/tika/mime/MediaType;)Ljava/util/Set; at org.apache.tika.parser.crypto.Pkcs7Parser.getSupportedTypes(Pkcs7Parser.java:52) at org.apache.nutch.parse.tika.TikaConfig.init(TikaConfig.java:149) at org.apache.nutch.parse.tika.TikaConfig.getDefaultConfig(TikaConfig.java:210) at org.apache.nutch.parse.tika.TikaParser.setConf(TikaParser.java:203) at org.apache.nutch.plugin.Extension.getExtensionInstance(Extension.java:162) at org.apache.nutch.parse.ParserFactory.getFields(ParserFactory.java:209) at org.apache.nutch.parse.ParserJob.getFields(ParserJob.java:193) at org.apache.nutch.parse.ParserJob.run(ParserJob.java:245) at org.apache.nutch.parse.ParserJob.parse(ParserJob.java:259) at org.apache.nutch.parse.ParserJob.run(ParserJob.java:302) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65) at org.apache.nutch.parse.ParserJob.main(ParserJob.java:306)
RE: URL filtering: crawling time vs. indexing time
Just try it. With -D you can override Nutch and Hadoop configuration properties. -Original message- From:Joe Zhang smartag...@gmail.com Sent: Sun 04-Nov-2012 06:07 To: user user@nutch.apache.org Subject: Re: URL filtering: crawling time vs. indexing time Markus, I don't see -D as a valid command parameter for solrindex. On Fri, Nov 2, 2012 at 11:37 AM, Markus Jelsma markus.jel...@openindex.iowrote: Ah, i understand now. The indexer tool can filter as well in 1.5.1 and if you enable the regex filter and set a different regex configuration file when indexing vs. crawling you should be good to go. You can override the default configuration file by setting urlfilter.regex.file and point it to the regex file you want to use for indexing. You can set it via nutch solrindex -Durlfilter.regex.file=/path http://solrurl/ ... Cheers -Original message- From:Joe Zhang smartag...@gmail.com Sent: Fri 02-Nov-2012 17:55 To: user@nutch.apache.org Subject: Re: URL filtering: crawling time vs. indexing time I'm not sure I get it. Again, my problem is a very generic one: - The patterns in regex-urlfitler.txt, howevery exotic they are, they control ***which URLs to visit***. - Generally speaking, the set of ULRs to be indexed into solr is only a ***subset*** of the above. We need a way to specify crawling filter (which is regex-urlfitler.txt) vs. indexing filter, I think. On Fri, Nov 2, 2012 at 9:29 AM, Rémy Amouroux r...@teorem.fr wrote: You have still several possibilities here : 1) find a way to seed the crawl with the URLs containing the links to the leaf pages (sometimes it is possible with a simple loop) 2) create regex for each step of the scenario going to the leaf page, in order to limit the crawl to necessary pages only. Use the $ sign at the end of your regexp to limit the match of regexp like http://([a-z0-9]*\.)* mysite.com. Le 2 nov. 2012 à 17:22, Joe Zhang smartag...@gmail.com a écrit : The problem is that, - if you write regex such as: +^http://([a-z0-9]*\.)*mysite.com, you'll end up indexing all the pages on the way, not just the leaf pages. - if you write specific regex for http://www.mysite.com/level1pattern/level2pattern/pagepattern.html, and you start crawling at mysite.com, you'll get zero results, as there is no match. On Fri, Nov 2, 2012 at 6:21 AM, Markus Jelsma markus.jel...@openindex.iowrote: -Original message- From:Joe Zhang smartag...@gmail.com Sent: Fri 02-Nov-2012 10:04 To: user@nutch.apache.org Subject: URL filtering: crawling time vs. indexing time I feel like this is a trivial question, but I just can't get my ahead around it. I'm using nutch 1.5.1 and solr 3.6.1 together. Things work fine at the rudimentary level. If my understanding is correct, the regex-es in nutch/conf/regex-urlfilter.txt control the crawling behavior, ie., which URLs to visit or not in the crawling process. Yes. On the other hand, it doesn't seem artificial for us to only want certain pages to be indexed. I was hoping to write some regular expressions as well in some config file, but I just can't find the right place. My hunch tells me that such things should not require into-the-box coding. Can anybody help? What exactly do you want? Add your custom regular expressions? The regex-urlfilter.txt is the place to write them to. Again, the scenario is really rather generic. Let's say we want to crawl http://www.mysite.com. We can use the regex-urlfilter.txt to skip loops and unncessary file types etc., but only expect to index pages with URLs like: http://www.mysite.com/level1pattern/level2pattern/pagepattern.html . To do this you must simply make sure your regular expressions can do this. Am I too naive to expect zero Java coding in this case? No, you can achieve almost all kinds of exotic filtering with just the URL filters and the regular expressions. Cheers
RE: timestamp in nutch schema
Hi - the timestamp is just the time when a page is being indexed. Not very useful except for deduplication. If you want to index some publishing date you must first identify the source of that date and get it out of webpages. It's possible to use og:date or other meta meta tags or perhaps other sources but to do so you must create a custom parse filter. Meta tags can be indexed without creating a custom parse filter. If you don't trust websites or need special (re)formatting or checking logic you need to make a parse filter for it. I've also built a date parsing filter to retrieve dates in various formats from free text, check Jira for a patch for the dateparsefilter. It's an older version but still works well. -Original message- From:Joe Zhang smartag...@gmail.com Sent: Sun 04-Nov-2012 05:44 To: user user@nutch.apache.org Subject: timestamp in nutch schema My understanding is that the timestamp stores crawling time. Is there any way to get nutch to parse out the publishing time of webpages and store such info in timestamp or some other field?
RE: URL filtering: crawling time vs. indexing time
-Original message- From:Joe Zhang smartag...@gmail.com Sent: Fri 02-Nov-2012 10:04 To: user@nutch.apache.org Subject: URL filtering: crawling time vs. indexing time I feel like this is a trivial question, but I just can't get my ahead around it. I'm using nutch 1.5.1 and solr 3.6.1 together. Things work fine at the rudimentary level. If my understanding is correct, the regex-es in nutch/conf/regex-urlfilter.txt control the crawling behavior, ie., which URLs to visit or not in the crawling process. Yes. On the other hand, it doesn't seem artificial for us to only want certain pages to be indexed. I was hoping to write some regular expressions as well in some config file, but I just can't find the right place. My hunch tells me that such things should not require into-the-box coding. Can anybody help? What exactly do you want? Add your custom regular expressions? The regex-urlfilter.txt is the place to write them to. Again, the scenario is really rather generic. Let's say we want to crawl http://www.mysite.com. We can use the regex-urlfilter.txt to skip loops and unncessary file types etc., but only expect to index pages with URLs like: http://www.mysite.com/level1pattern/level2pattern/pagepattern.html. To do this you must simply make sure your regular expressions can do this. Am I too naive to expect zero Java coding in this case? No, you can achieve almost all kinds of exotic filtering with just the URL filters and the regular expressions. Cheers
RE: URL filtering: crawling time vs. indexing time
Ah, i understand now. The indexer tool can filter as well in 1.5.1 and if you enable the regex filter and set a different regex configuration file when indexing vs. crawling you should be good to go. You can override the default configuration file by setting urlfilter.regex.file and point it to the regex file you want to use for indexing. You can set it via nutch solrindex -Durlfilter.regex.file=/path http://solrurl/ ... Cheers -Original message- From:Joe Zhang smartag...@gmail.com Sent: Fri 02-Nov-2012 17:55 To: user@nutch.apache.org Subject: Re: URL filtering: crawling time vs. indexing time I'm not sure I get it. Again, my problem is a very generic one: - The patterns in regex-urlfitler.txt, howevery exotic they are, they control ***which URLs to visit***. - Generally speaking, the set of ULRs to be indexed into solr is only a ***subset*** of the above. We need a way to specify crawling filter (which is regex-urlfitler.txt) vs. indexing filter, I think. On Fri, Nov 2, 2012 at 9:29 AM, Rémy Amouroux r...@teorem.fr wrote: You have still several possibilities here : 1) find a way to seed the crawl with the URLs containing the links to the leaf pages (sometimes it is possible with a simple loop) 2) create regex for each step of the scenario going to the leaf page, in order to limit the crawl to necessary pages only. Use the $ sign at the end of your regexp to limit the match of regexp like http://([a-z0-9]*\.)* mysite.com. Le 2 nov. 2012 à 17:22, Joe Zhang smartag...@gmail.com a écrit : The problem is that, - if you write regex such as: +^http://([a-z0-9]*\.)*mysite.com, you'll end up indexing all the pages on the way, not just the leaf pages. - if you write specific regex for http://www.mysite.com/level1pattern/level2pattern/pagepattern.html, and you start crawling at mysite.com, you'll get zero results, as there is no match. On Fri, Nov 2, 2012 at 6:21 AM, Markus Jelsma markus.jel...@openindex.iowrote: -Original message- From:Joe Zhang smartag...@gmail.com Sent: Fri 02-Nov-2012 10:04 To: user@nutch.apache.org Subject: URL filtering: crawling time vs. indexing time I feel like this is a trivial question, but I just can't get my ahead around it. I'm using nutch 1.5.1 and solr 3.6.1 together. Things work fine at the rudimentary level. If my understanding is correct, the regex-es in nutch/conf/regex-urlfilter.txt control the crawling behavior, ie., which URLs to visit or not in the crawling process. Yes. On the other hand, it doesn't seem artificial for us to only want certain pages to be indexed. I was hoping to write some regular expressions as well in some config file, but I just can't find the right place. My hunch tells me that such things should not require into-the-box coding. Can anybody help? What exactly do you want? Add your custom regular expressions? The regex-urlfilter.txt is the place to write them to. Again, the scenario is really rather generic. Let's say we want to crawl http://www.mysite.com. We can use the regex-urlfilter.txt to skip loops and unncessary file types etc., but only expect to index pages with URLs like: http://www.mysite.com/level1pattern/level2pattern/pagepattern.html. To do this you must simply make sure your regular expressions can do this. Am I too naive to expect zero Java coding in this case? No, you can achieve almost all kinds of exotic filtering with just the URL filters and the regular expressions. Cheers
RE: Information about compiling?
Hi, There are binary versions of 1.5.1 but not 2.x. http://apache.xl-mirror.nl/nutch/1.5.1/ About the scripts, you have to build nutch and then go to runtime/local directory to run bin/nutch. Cheers -Original message- From:Dr. Thomas Zastrow p...@thomas-zastrow.de Sent: Thu 01-Nov-2012 10:45 To: user@nutch.apache.org Subject: Information about compiling? Dear all, I found the following tutorial on the web: http://wiki.apache.org/nutch/NutchTutorial It starts with a binary version of Nutch. Unfortunateley, I didn't found any binary version, just the source code on the web page? So, I downloaded the latest version and compiled it with ant. Everything seems to work, but I'm a little bit confused about the paths and how I should go on? Following the tutorial, I have to change some files, but they exist in several versions: find . -iname regex-urlfilter.txt ./runtime/local/conf/regex-urlfilter.txt ./conf/regex-urlfilter.txt The same goes for the nutch command, I'm not sure which one is the right one. When I execute /src/bin/nutch with the following parameters: ./nutch crawl /opt/crawls/ -dir /opt/crawls/ -depth 3 -topN 5 I got an error which I understand that the script can not find the jar files: Exception in thread main java.lang.NoClassDefFoundError: org/apache/nutch/crawl/Crawler Caused by: java.lang.ClassNotFoundException: org.apache.nutch.crawl.Crawler at java.net.URLClassLoader$1.run(URLClassLoader.java:202) at java.security.AccessController.doPrivileged(Native Method) at java.net.URLClassLoader.findClass(URLClassLoader.java:190) at java.lang.ClassLoader.loadClass(ClassLoader.java:306) at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:301) at java.lang.ClassLoader.loadClass(ClassLoader.java:247) Could not find the main class: org.apache.nutch.crawl.Crawler. Program will exit. Any help would be nice ;-) Best regards and thank you for the software! Tom -- Dr. Thomas Zastrow Süsser Str. 5 72074 Tübingen www.thomas-zastrow.de
RE: [crawler-common] infoQ article Apache Nutch 2 Features and Product Roadmap
Cheers! -Original message- From:Lewis John Mcgibbney lewis.mcgibb...@gmail.com Sent: Thu 01-Nov-2012 18:30 To: user@nutch.apache.org Subject: Re: [crawler-common] infoQ article Apache Nutch 2 Features and Product Roadmap Nice one Julien. Its nothing short of a privilege to be part of the various communities and working alongside you guys. Have a great night. Lewis On Thu, Nov 1, 2012 at 11:39 AM, Julien Nioche lists.digitalpeb...@gmail.com wrote: Hi all, Apologies for cross posting. Srini Penchikala has just published an interview with me about Nutch 2 on InfoQ at http://www.infoq.com/articles/nioche-apache-nutch2. Several projects are mentioned in relation to Nutch, hence the CC. The views and opinions expressed are entirely mine and do not reflect any official position of the Nutch PMC ;-) Thanks Julien -- * *Open Source Solutions for Text Engineering http://digitalpebble.blogspot.com/ http://www.digitalpebble.com http://twitter.com/digitalpebble -- You received this message because you are subscribed to the Google Groups crawler-commons group. Visit this group at http://groups.google.com/group/crawler-commons?hl=en-US. -- *Lewis*
RE: fetch time
Hi - Yes, the fetch time is the time when the record is eligible for fetch again. Cheers, -Original message- From:Stefan Scheffler sscheff...@avantgarde-labs.de Sent: Sat 27-Oct-2012 14:49 To: user@nutch.apache.org Subject: fetch time Hi, When i dump out the crawl db, there is a fetch entry for each url, which is over one month in the future... Fetch time: Mon Nov 26 06:09:43 CET 2012 Does this mean, this is the next time of fetching? Regards Stefan
RE: Format of content file in segments?
Hi Морозов, It's a directory containing Hadoop map file(s) that stores key/value pairs. Hadoop Text class is the key and Nutch' Content class is the value. You would need Hadoop to easily process the files http://svn.apache.org/viewvc/nutch/trunk/src/java/org/apache/nutch/protocol/Content.java?view=markup Cheers, Markus -Original message- From:Морозов Евгений ant...@yandex.ru Sent: Sat 27-Oct-2012 18:32 To: user@nutch.apache.org Subject: Format of quot;contentquot; file in segments? Where can I find the format of the content file in a segment directory? Either source code or documentation. I'm looking at reading it with a program external to nutch. regards, keanta
RE: How to recover data from /tmp/hadoop-myuser
Hi, You cannot recover the mapper output as far as i know. But anyway, one should never have a fetcher running for three days. It's far better to generate a large amount of smaller segments and fetch them sequentially. If an error occurs, only a small portion is affected. We never run fetchers for more than one hour, instead we run many in a row and sometimes concurrently. Cheers, -Original message- From:Mohammad wrk mhd...@yahoo.com Sent: Fri 26-Oct-2012 00:47 To: user@nutch.apache.org Subject: How to recover data from /tmp/hadoop-myuser Hi, My fetch cycle (nutch fetch ./segments/20121021205343/ -threads 25) failed, after 3 days, with the error below. Under the segment folder (./segments/20121021205343/) there is only generated fetch list (crawl_generate) and no content. However /tmp/hadoop-myuser/ has 96G of data. I was wondering if there is a way to recover this data and parse the segment? org.apache.hadoop.util.DiskChecker$DiskErrorException: Could not find any valid local directory for output/file.out at org.apache.hadoop.fs.LocalDirAllocator$AllocatorPerContext.getLocalPathForWrite(LocalDirAllocator.java:381) at org.apache.hadoop.fs.LocalDirAllocator.getLocalPathForWrite(LocalDirAllocator.java:146) at org.apache.hadoop.fs.LocalDirAllocator.getLocalPathForWrite(LocalDirAllocator.java:127) at org.apache.hadoop.mapred.MapOutputFile.getOutputFileForWrite(MapOutputFile.java:69) at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.mergeParts(MapTask.java:1640) at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.flush(MapTask.java:1323) at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:437) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:372) at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:212) 2012-10-24 14:43:29,671 ERROR fetcher.Fetcher - Fetcher: java.io.IOException: Job failed! at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1265) at org.apache.nutch.fetcher.Fetcher.fetch(Fetcher.java:1318) at org.apache.nutch.fetcher.Fetcher.run(Fetcher.java:1354) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65) at org.apache.nutch.fetcher.Fetcher.main(Fetcher.java:1327) Thanks, Mohammad
RE: Nutch 2.x Eclipse: Can't retrieve Tika parser for mime-type application/pdf
Hi, -Original message- From:kiran chitturi chitturikira...@gmail.com Sent: Thu 25-Oct-2012 20:49 To: user@nutch.apache.org Subject: Nutch 2.x Eclipse: Can't retrieve Tika parser for mime-type application/pdf Hi, i have built Nutch 2.x in eclipse using this tutorial ( http://wiki.apache.org/nutch/RunNutchInEclipse) and with some modifications. Its able to parse html files successfully but when it comes to pdf files it says 2012-10-25 14:37:05,071 ERROR tika.TikaParser - Can't retrieve Tika parser for mime-type application/pdf Is there anything wrong with my eclipse configuration? I am looking to debug some things in nutch, so i am working with eclipse and nutch. Do i need to point any libraries for eclipseto recognize tika parsers for application/pdf type ? What exactly is the reason for this type of error to appear for only pdf files and not html files ? I am using recent nutch 2.x which has tika upgraded to 1.2 This is possible if the PDFBox dependancy is not found anywhere or is wrongly mapped in Tika's plugin.xml. The above error can also happen if you happen to have a tika-parsers-VERSION.jar in your runtime/local/lib directory, for some strange reason. I would like some help here and would like to know if anyone has encountered similar problem with eclipse, nutch 2.x and parsing application/pdf files ? Many Thanks, -- Kiran Chitturi
RE: How to recover data from /tmp/hadoop-myuser
Hi - there's a similar entry already, however, the fetcher.done part doesn't seem to be correct. I can see no reason why that would ever work as Hadoop temp files are simply no copied to the segment if it fails. There's also no notion of an fetcher.done file in trunk. http://wiki.apache.org/nutch/FAQ#How_can_I_recover_an_aborted_fetch_process.3F -Original message- From:Lewis John Mcgibbney lewis.mcgibb...@gmail.com Sent: Fri 26-Oct-2012 15:15 To: user@nutch.apache.org Subject: Re: How to recover data from /tmp/hadoop-myuser I really think this should be in the FAQ's? http://wiki.apache.org/nutch/FAQ On Fri, Oct 26, 2012 at 2:10 PM, Markus Jelsma markus.jel...@openindex.io wrote: Hi, You cannot recover the mapper output as far as i know. But anyway, one should never have a fetcher running for three days. It's far better to generate a large amount of smaller segments and fetch them sequentially. If an error occurs, only a small portion is affected. We never run fetchers for more than one hour, instead we run many in a row and sometimes concurrently. Cheers, -Original message- From:Mohammad wrk mhd...@yahoo.com Sent: Fri 26-Oct-2012 00:47 To: user@nutch.apache.org Subject: How to recover data from /tmp/hadoop-myuser Hi, My fetch cycle (nutch fetch ./segments/20121021205343/ -threads 25) failed, after 3 days, with the error below. Under the segment folder (./segments/20121021205343/) there is only generated fetch list (crawl_generate) and no content. However /tmp/hadoop-myuser/ has 96G of data. I was wondering if there is a way to recover this data and parse the segment? org.apache.hadoop.util.DiskChecker$DiskErrorException: Could not find any valid local directory for output/file.out at org.apache.hadoop.fs.LocalDirAllocator$AllocatorPerContext.getLocalPathForWrite(LocalDirAllocator.java:381) at org.apache.hadoop.fs.LocalDirAllocator.getLocalPathForWrite(LocalDirAllocator.java:146) at org.apache.hadoop.fs.LocalDirAllocator.getLocalPathForWrite(LocalDirAllocator.java:127) at org.apache.hadoop.mapred.MapOutputFile.getOutputFileForWrite(MapOutputFile.java:69) at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.mergeParts(MapTask.java:1640) at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.flush(MapTask.java:1323) at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:437) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:372) at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:212) 2012-10-24 14:43:29,671 ERROR fetcher.Fetcher - Fetcher: java.io.IOException: Job failed! at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1265) at org.apache.nutch.fetcher.Fetcher.fetch(Fetcher.java:1318) at org.apache.nutch.fetcher.Fetcher.run(Fetcher.java:1354) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65) at org.apache.nutch.fetcher.Fetcher.main(Fetcher.java:1327) Thanks, Mohammad -- Lewis
RE: RegEx URL Normalizer
Hi, Check the bottom normalizer, it uses the lookbehind operator to remove double slashes except the first two. Cheers, http://svn.apache.org/viewvc/nutch/trunk/conf/regex-normalize.xml.template?view=markup -Original message- From:Magnús Skúlason magg...@gmail.com Sent: Mon 22-Oct-2012 00:34 To: user@nutch.apache.org Cc: dkavr...@gmail.com; Markus Jelsma markus.jel...@openindex.io Subject: Re: RegEx URL Normalizer Hi, I am interested in doing this i.e. only strip out parameters from url if some other string is found as well, in my case it will be a domain name. I am using 1.5.1 but I am unfamiliar with the look-behind operator. Does anyone have a sample of how this is done? best regards, Magnus On Thu, Sep 8, 2011 at 12:14 PM, Alexander Fahlke alexander.fahlke.mailingli...@googlemail.com wrote: Thanks guys! @Dinçer: This does not check if the URL contains document.py. :( @Markus: Unfortunately I have to use nutch-1.2 so I decided to customize RegexURLNormalizer. ;) -- regexNormalize(String urlString, String scope) { ... It now simple stupid checks if urlString contains document.py and then cuts out the unwanted stuff. I made this is even configurable via nutch-site.xml. Nutch 1.4 would be better for this. Maybe in the next project. BR On Wed, Sep 7, 2011 at 2:34 PM, Dinçer Kavraal dkavr...@gmail.com wrote: Hi Alexander, Would this one work? (I am far away from a Nutch installation to test) (?:[?](?:Date|Sort|Page|pos|anz)=[^?]+|([?](?:Name|Art|Blank|nr)=[^?]*)) Don't forget to use amp; instead of in the regex. Best, Dinçer 2011/9/5 Alexander Fahlke alexander.fahlke.mailingli...@googlemail.com Hi! I have problems with the right setup of the RegExURLNormalizer. It should strip out some parameters for a specific script. Only pages where document.py is present should be normalized. Here is an example: Input: http://www.example.com/cgi-bin/category/document.py?Name=AlexArt=enDate=2000Sort=1Page=109nr=16519pos=1644anz=1952Blank=1.pdf Output: http://www.example.com/cgi-bin/category/document.py?Name=AlexArt=ennr=16519Blank=1.pdf Date, Sort, Page, pos, anz are the parameters to be stripped out. I tried it with the following setup: ([;_]?((?i)l|j|bv_)?((?i)date| sort|page|pos|anz)=.*?)(\?||#|$) How to tell nutch to use this regex only for pages with document.py? BR -- Alexander Fahlke Software Development www.informera.de -- Alexander Fahlke Software Development www.informera.de
RE: Best practice to index a large crawl through Solr?
Hi - Hadoop can write more records per second than Solr can analyze and store, especially with multiple reducers (threads in Solr). SolrCloud is notoriously slow when it comes to indexing compared to a stand-alone setup. However, this should not be a problem at all as your not dealing with millions of records. Trying to tie HBase as a backend to Solr is not a good idea at all. The best and fastest storage for Solr is a disk and MMappedDirectory enabled (default in recent version) and plenty of RAM. Keep in mind that Solr keeps several parts of the index in memory and others if it can and it is very efficient in doing that. With only a few million records it's easy and fast enough to run Hadoop locally (or pseudo if you can) and have a single Solr node running. -Original message- From:Thilina Gunarathne cset...@gmail.com Sent: Mon 22-Oct-2012 22:35 To: user@nutch.apache.org Subject: Re: Best practice to index a large crawl through Solr? Hi Alex, Thanks again for the information. My current requirement is to implement a simple searching application for a publication. Our current data sizes probably would not exceed the amount of records you mentioned and for now, we should be fine with a single Solr instance. I'm going to check out the SolrCloud for our future needs. Hm, so you are thinking Nutch - HBase - Solr - HBase, that does sound pretty crazy. I agree :).. Unfortunately (or may be luckily) I do not have much time to invest on this and I'll probably have to rely on the existing tools, rather than trying to reinvent the wheels :).. thanks, Thilina On Mon, Oct 22, 2012 at 4:00 PM, Alejandro Caceres acace...@hyperiongray.com wrote: No problem. Wrt to your first question, Solr would actually be storing this data locally. Solr sharding actually uses its own mechanism called SolrCloud. I'd recommend checking it out here: http://wiki.apache.org/solr/SolrCloud, it seems cool though I have not used it myself. Hm, so you are thinking Nutch - HBase - Solr - HBase, that does sound pretty crazy. You can most definitely find a more efficient way to do this, either by going to HBase directly from the start (I wouldn't do so personally) or just using Solr. It might be good to know what kind of application you are looking to build and asking more specifically. Alex On Mon, Oct 22, 2012 at 3:48 PM, Thilina Gunarathne cset...@gmail.com wrote: Hi Alex, Thanks for the very fast response :).. It sort of depends on your purpose and the amount of data. I currently have a single Solr instance (~1GB of memory, 2 processors on the server) serving almost ~3,700,000 records from Nutch and it's still working great for me. If you have around that I'd say a single Solr instance is OK, depending on if you are planning on making your data publicly available or not. This is very useful information. In this case, would the Solr instance be retrieving and storing all the data locally or is it still using the Nutch data store to retrieve the actual content while serving the queries? If you're creating something larger of some sort, Solr 4.0, which supports sharding natively would be a great option (I think it's still in Beta, but if you're feeling brave...). This is especially true if you are creating a search engine of some sort, or would like easily searchable data. That's interesting. I'll check that out. By any chance, do you know whether the Solr sharding is using the HDFS to store the data or is it using it's own infrastructure? I would imagine doing this directly from HBase would not be a great option, as Nutch is storing the data in the format that is convenient for Nutch itself to use, and not so much in a format that it is friendly for you to reuse for your own purposes. I was actually thinking of a scenario where we would use Solr to index the data and storing the resultant index in HBase. Then using the HBase directly to perform simple index lookups.. Please pardon my lack of knowledge on Nutch and Solr, if the above sounds ludicrous :).. thanks, Thilina IMO your best bet is going to try out Solr 4.0. Alex On Mon, Oct 22, 2012 at 3:03 PM, Thilina Gunarathne cset...@gmail.com wrote: Dear All, What would be the best practice to index a large crawl using Solr? The crawl is performed on a multi node Hadoop cluster using HBase as the back end.. Would Solr become a bottleneck if we use just a single Solr instance? Is it possible to store the indexed data on HBase and to serve them from the HBase it self? thanks a lot, Thilina -- https://www.cs.indiana.edu/~tgunarat/ http://www.linkedin.com/in/thilina http://thilina.gunarathne.org -- ___ Alejandro Caceres Hyperion Gray, LLC Owner/CTO --
RE: Best practice to index a large crawl through Solr?
Hi -Original message- From:Thilina Gunarathne cset...@gmail.com Sent: Tue 23-Oct-2012 00:38 To: user@nutch.apache.org Subject: Re: Best practice to index a large crawl through Solr? Hi Markus, Thanks a lot for the info. Hi - Hadoop can write more records per second than Solr can analyze and store, especially with multiple reducers (threads in Solr). SolrCloud is notoriously slow when it comes to indexing compared to a stand-alone setup. Can this be overcome by using the Nutch Solrindex job for indexing? In other words, does the Solr becomes a bottleneck for the SolrIndex job? Nutch trunk can only write to a single Solr URL and if you have more than a few reducers Solr is the bottleneck. But that should not be a problem when dealing with a few milliion records. It is a matter of minutes. Out of curiosity, does SolrCloud supports any data locality when loading data from Nutch? For an example, if I'm co-locating SolrCloud on the same nodes that are running Hadoop/HBase, can SolrCloud work with the local region servers to load the data? Eventually, we would have to process millions of records and I'm just wondering whether the communication between Nutch and Solr would be a huge bottleneck. Data locallity is more a thing for distributed processing, moving the program to the data in the assumption that it's cheaper in terms of bandwidth. That does not apply to SolrCloud, it works with hash ranges based on your ID and then points documents to a specific shard (see SolrCloud wiki page referred to in this thread). If you want a stable and performing Nutch and Solr cluster you must separate them. Both have specific resource requirements and should not run on the same node. If you mix them, it is hard to provide a reliable service. We operate one Nutch cluster and several Solr clusters with a lot of documents and don't worry about the bottleneck. Based on my experiences i think you should not worry too much at this point about Solr being an indexing bottle neck, you can scale out if it becomes a problem. A significant improvement in very large scale indexing from a Nutch cluster to a SolrCloud cluster is NUTCH-1377 but it's tedious to implement. Right now we don't yet need it because the bottleneck is insignificant for now, even with many millions of documents. Unless you are going to work with A LOT of records this should not be a big problem for the next few months. https://issues.apache.org/jira/browse/NUTCH-1377 thanks, Thilina However, this should not be a problem at all as your not dealing with millions of records. Trying to tie HBase as a backend to Solr is not a good idea at all. The best and fastest storage for Solr is a disk and MMappedDirectory enabled (default in recent version) and plenty of RAM. Keep in mind that Solr keeps several parts of the index in memory and others if it can and it is very efficient in doing that. With only a few million records it's easy and fast enough to run Hadoop locally (or pseudo if you can) and have a single Solr node running. -Original message- From:Thilina Gunarathne cset...@gmail.com Sent: Mon 22-Oct-2012 22:35 To: user@nutch.apache.org Subject: Re: Best practice to index a large crawl through Solr? Hi Alex, Thanks again for the information. My current requirement is to implement a simple searching application for a publication. Our current data sizes probably would not exceed the amount of records you mentioned and for now, we should be fine with a single Solr instance. I'm going to check out the SolrCloud for our future needs. Hm, so you are thinking Nutch - HBase - Solr - HBase, that does sound pretty crazy. I agree :).. Unfortunately (or may be luckily) I do not have much time to invest on this and I'll probably have to rely on the existing tools, rather than trying to reinvent the wheels :).. thanks, Thilina On Mon, Oct 22, 2012 at 4:00 PM, Alejandro Caceres acace...@hyperiongray.com wrote: No problem. Wrt to your first question, Solr would actually be storing this data locally. Solr sharding actually uses its own mechanism called SolrCloud. I'd recommend checking it out here: http://wiki.apache.org/solr/SolrCloud, it seems cool though I have not used it myself. Hm, so you are thinking Nutch - HBase - Solr - HBase, that does sound pretty crazy. You can most definitely find a more efficient way to do this, either by going to HBase directly from the start (I wouldn't do so personally) or just using Solr. It might be good to know what kind of application you are looking to build and asking more specifically. Alex On Mon, Oct 22, 2012 at 3:48 PM, Thilina Gunarathne cset...@gmail.com wrote: Hi Alex, Thanks for the very fast response :).. It sort of depends on your purpose and the amount of
RE: Fetcher Thread
Hi Ye, -Original message- From:Ye T Thet yethura.t...@gmail.com Sent: Thu 18-Oct-2012 15:46 To: user@nutch.apache.org Subject: Fetcher Thread Hi Folks, I have two questions about the Fetcher Thread in Nutch. The value fetcher.threads.fetch in configuration file determines the number of threads the Nutch would use to fetch. Of course threads.per.host is also used for politeness. I set 100 for fetcher.threads.fetch and 2 for threads.per.host value. So far on my development I have been using only one linux box to fetch thus it is clear that Nutch would fetch 100 urls at time provided that the threads.per.host criteria is met. The questions are: 1. What if I crawl on a hadoop cluster with with 5 linux box and set the fetcher.threads.fetch to 100? Would Nutch fetch 100 url at time or 500 (5 x 100) at time? All nodes are isolated and don't know what the other is doing. So if you set the threads to 100 for each machine, each machine will run with 100 threads. 2. Any advise on formulating optimum fetcher.threads.fetch and threads.per.host for a hadoop cluster with 5 linux box (Amazon EC2 medium instance, 3.7 GB memory). I would be crawling around 10,000 (10k) web sites. I think threads per host must not exceed 1 for most websites out of politeness. You can set the number of threads as high as you can, it only takes more memory. If you parse in the fetcher as well, you can run much fewer threads. Thanks, Ye