Re: Issues on Compiling Nutch 2.x with Eclipse
Hi Tejas, Thanks a lot for setting up this new setup guide. It really helped me and may be many other new Nutch users. Tony. On Tue, Jun 11, 2013 at 7:02 AM, Tejas Patil tejas.patil...@gmail.comwrote: Hi Tony, The simplified steps with snapshots are now added to Nutch wiki [0]. It would be helpful if you could try those out and lets us know if there are any improvements or corrections that you think. PS: Few images look shrinked. I will be fixing it soon. [0] : https://wiki.apache.org/nutch/RunNutchInEclipse On Mon, Jun 10, 2013 at 2:57 PM, Tejas Patil tejas.patil...@gmail.com wrote: I have created a google doc [0] with several snapshots describing how to setup nutch 2.x + eclipse. This is different from the one over the wiki page and tailored for Nutch 2.x. Please try it out, let us know if you still have issues with that. Based on your comments, I would add the same over nutch wiki. [0] : https://docs.google.com/document/d/1qvJwrZ9Sc0NAF9p3ie4uV7JsfCHxnrh9QF19HINw48c/edit?usp=sharing On Mon, Jun 10, 2013 at 11:32 AM, Tejas Patil tejas.patil...@gmail.com wrote: yes. - Close the project in eclipse. Right click on the project, click on Properties and get the location of the project. - Goto that location in terminal - Run 'ant eclipse'. (Note that you need to have Apache Ant http://ant.apache.org/manual/index.html installed and configured) After going command line, you might as well do this: Specify the GORA backend in nutch-site.xml, uncomment its dependency in ivy/ivy.xml and ensure that the store you selected is set as the default datastore in gora.properties On Mon, Jun 10, 2013 at 11:21 AM, Tony Mullins tonymullins...@gmail.comwrote: Hi, So the latest Nutch2.x includes the Teja's Patch ( https://issues.apache.org/jira/browse/NUTCH-1577) , means if I have latest source then it already has that patch. Now can some one please help me here what is meant by the 2nd last step 'Run 'ant eclipse' on http://wiki.apache.org/nutch/RunNutchInEclipse. Do I need to go to the location where source is and give ant command 'ant -f build.xml' , or its something else ??? And after refreshing the source, Eclipse would let compile and run my code ? Thanks, Tony On Mon, Jun 10, 2013 at 6:56 PM, Tony Mullins tonymullins...@gmail.com wrote: Hi Lewis, I understand this, that there may be something wrong on my end. And as I said I get different errors on running Nutch 2.x with Eclipse, after following different tutorials. My background is in .NET and I might will just move to JAVA , just because of this project (Nutch). But at the moment I am having difficult time understanding the 'setup/configuration' required to run Nutch in Eclipse. When you say '...*you may find it convenient to patch your dist with Tejas' Eclipse ant target and simply run 'ant eclipse' from within your terminal prior to doing a file, import, existing projects in to workspace from within Eclipse..*.' which patch do I need to get and how to apply it ? And by running 'ant eclipse' , do you mean dropping build.xml to Ant window in Eclipse , OR building the Nutch source by using the ant -f build.xml command in terminal ? ( by the way I have done both and both successfully builds the source , but eclipse doesn't run the source). So could you please guide me here in more details, I would be really grateful to you and Nutch community. Thanks, Tony. On Mon, Jun 10, 2013 at 6:38 PM, Lewis John Mcgibbney lewis.mcgibb...@gmail.com wrote: Hi Tony, These issues stem from your environment not being correct. I, as many other, have been able to DEBUG and develop Nutch 1.7 and 2.x series from within Eclipse. As you are working with 2.x source, you may find it convenient to patch your dist with Tejas' Eclipse ant target and simply run 'ant eclipse' from within your terminal prior to doing a file, import, existing projects in to workspace from within Eclipse. I can guarantee you, the reason the tutorial is on the Nutch wiki is because as some stage, someone (many many people), somewhere have found it useful for developing Nutch in Eclipse. I don't want to sound like a baloon here, but your java security exceptions are not a problem with Nutch... it's your environment. hth On Monday, June 10, 2013, Tony Mullins tonymullins...@gmail.com wrote: Hi , Ok now I have followed this tutorial word by word. http://wiki.apache.org/nutch/RunNutchInEclipse#Checkout_Nutch_in_Eclipse . After getting new source 2.2 , I have build it using Ant - which was successful then set the configurations and comment the 'hsqldb' dependency and uncomment the cassandra dependency ( as I want to run it against cassandra). After doing this all
Re: Nutch Compilation Error with Eclipse
Hi, Thank you so much for providing detail steps updating wiki site. My first problem is resolved (nutch compilation through eclipse) now I can run injector job from eclipse. Now I'm trying to debug the crawl process to understand the internal working of 'Nutch' with 'Cassandra' so I could write a new plugin for my requirement. It would be great if you could provide me program arguments ('Commands') and sequence in order to run/understand the crawling process. I have gone through the script file (nutch2.2/src/bin/crawl) which dictates whole crawling process in few steps: 1. Inject 2. Generate 3. Fetch 4. Parse . . Now I am having difficulties to how to add these commands in my Eclipse Program/Environment arguments to start specific job and debug it. I would really appreciate your help in this regard. Regards, Jamshaid On Tue, Jun 11, 2013 at 7:01 AM, Tejas Patil tejas.patil...@gmail.comwrote: Hi Jamshaid, The simplified steps with snapshots are now added to Nutch wiki [0]. It would be helpful if you could try those out and lets us know if there are any improvements or corrections that you think. PS: Few images look shrinked. I will be fixing it soon. [0] : https://wiki.apache.org/nutch/RunNutchInEclipse On Mon, Jun 10, 2013 at 2:58 PM, Tejas Patil tejas.patil...@gmail.com wrote: I have created a google doc [0] with several snapshots describing how to setup nutch 2.x + eclipse. This is different from the one over the wiki page and tailored for Nutch 2.x. Please try it out, let us know if you still have issues with that. Based on your comments, I would add the same over nutch wiki. [0] : https://docs.google.com/document/d/1qvJwrZ9Sc0NAF9p3ie4uV7JsfCHxnrh9QF19HINw48c/edit?usp=sharing On Mon, Jun 10, 2013 at 6:23 AM, Lewis John Mcgibbney lewis.mcgibb...@gmail.com wrote: Hi, It is (IMHO) kind of fruitless running the crawl class (which is deprecated now and we highly suggest you use and amend the /src/bin/crawl script for your usecase) within Eclipse. You will learn far more setting breakpoints within individual classes and watching them execute on that basis. I notice you've not provided an URL directory to the crawl argument anyway so you will need to sort this one out. Best Lewis On Monday, June 10, 2013, Jamshaid Ashraf jamshaid...@gmail.com wrote: I'm performing following tasks: Commands in Arguments tab: Program Arguments=urls -dir crawl -depth 3 -topN 50 VM Arguments:-Dhadoop.log.dir=logs -Dhadoop.log.file=hadoop.log And then just running the code. Regards, Jamshaid On Mon, Jun 10, 2013 at 4:54 PM, Sznajder ForMailingList bs4mailingl...@gmail.com wrote: Hi Which task do you try to launch? Benjamin On Mon, Jun 10, 2013 at 1:57 PM, Jamshaid Ashraf jamshaid...@gmail.com wrote: Hi, I am new to Nutch. I am trying to use Nutch with Cassandra and have successfully build the Nutch 2.x but shows following error when I run it from latest eclipse. java.lang.NullPointerException at org.apache.avro.util.Utf8.init(Utf8.java:37) at org.apache.nutch.crawl.GeneratorReducer.setup(GeneratorReducer.java:100) at org.apache.hadoop.mapreduce.Reducer.run(Reducer.java:174) at org.apache.hadoop.mapred.ReduceTask.runNewReducer(ReduceTask.java:650) at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:418) at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:260). I will be grateful for any help if someone can provide. Thanks. -- *Lewis*
Re: Nutch Compilation Error with Eclipse
If you want to find out the java class corresponding to any command, just peek inside src/bin/nutch script and at the bottom you would find a switch case with a case corresponding to each command. For 2.x, here are the important classes: inject - org.apache.nutch.crawl.InjectorJob generate - org.apache.nutch.crawl.GeneratorJob fetch - org.apache.nutch.fetcher.FetcherJob parse - org.apache.nutch.parse.ParserJob updatedb - org.apache.nutch.crawl.DbUpdaterJob Create a separate launcher for each of these. Running these without any i/p parameters would show you the usage of these commands.
RE: using Tika within Nutch to remove boiler plates?
we don't use Boilerpipe anymore so no point in sharing. Just set those two configuration options in nutch-site.xml as property nametika.use_boilerpipe/name valuetrue/value /property property nametika.boilerpipe.extractor/name valueArticleExtractor/value /property and it should work -Original message- From:Joe Zhang smartag...@gmail.com Sent: Tue 11-Jun-2013 01:42 To: user user@nutch.apache.org Subject: Re: using Tika within Nutch to remove boiler plates? Marcus, do you mind sharing a sample nutch-site.xml? On Mon, Jun 10, 2013 at 1:42 AM, Markus Jelsma markus.jel...@openindex.iowrote: Those settings belong to nutch-site. Enable BP and set the correct extractor and it should work just fine. -Original message- From:Lewis John Mcgibbney lewis.mcgibb...@gmail.com Sent: Sun 09-Jun-2013 20:47 To: user@nutch.apache.org Subject: Re: using Tika within Nutch to remove boiler plates? Hi Joe, I've not used this feature, it would be great if one of the others could chime in here. From what I can infer from the correspondence on the issue, and the available patches, you should be applying the most recent one uploaded by Markus [0] as your starting point. This is dated as 22/11/2011. On Sun, Jun 9, 2013 at 11:00 AM, Joe Zhang smartag...@gmail.com wrote: One of the comments mentioned the following: tika.use_boilerpipe=true tika.boilerpipe.extractor=ArticleExtractor|CanolaExtractor which part the code is it referring to? You will see this included in one of the earlier patches uploaded by Markus on 11/05/2011 [1] Also, within the current Nutch config, should I focus on parse-plugin.xml? Look at the other patches and also Gabriele's comments. You may most likely need to alter something but AFAICT the work hasbeen done.. it's just a case of pulling together several contributions. Maybe you should look at the patch for 2.x (uploaded most recently by Roland) and see what is going on there. hth [0] https://issues.apache.org/jira/secure/attachment/12504736/NUTCH-961-1.5-1.patch [1] https://issues.apache.org/jira/secure/attachment/12478927/NUTCH-961-1.3-tikaparser1.patch
Data Extraction from 100+ different sites...
Hi, I have 100+ different sites ( and may be more will be added in near future), I have to crawl them and extract my required information from each site. So each site would have its own extraction rule ( XPaths). So far I have seen there is no built-in mechanism in Nutch to fulfill my requirement and I may have to write custom HTMLParserFilter extension and IndexFilter plugin. And I may have to write 100+ switch cases in my plugin to handle the extraction rules of each site Is this the best way to handle my requirement or there is any better way to handle it ? Thanks for your support help. Tony.
RE: Data Extraction from 100+ different sites...
Hi, Yes, you should write a plugin that has a parse filter and indexing filter. To ease maintenance you would want to have a file per host/domain containing XPath expressions, far easier that switch statements that need to be recompiled. The indexing filter would then index the field values extracted by your parse filter. Cheers, Markus -Original message- From:Tony Mullins tonymullins...@gmail.com Sent: Tue 11-Jun-2013 16:07 To: user@nutch.apache.org Subject: Data Extraction from 100+ different sites... Hi, I have 100+ different sites ( and may be more will be added in near future), I have to crawl them and extract my required information from each site. So each site would have its own extraction rule ( XPaths). So far I have seen there is no built-in mechanism in Nutch to fulfill my requirement and I may have to write custom HTMLParserFilter extension and IndexFilter plugin. And I may have to write 100+ switch cases in my plugin to handle the extraction rules of each site Is this the best way to handle my requirement or there is any better way to handle it ? Thanks for your support help. Tony.
Re: using Tika within Nutch to remove boiler plates?
Any particular reason why you don't use boilerpipe any more? So what do you suggest as an alternative? On Tue, Jun 11, 2013 at 5:41 AM, Markus Jelsma markus.jel...@openindex.iowrote: we don't use Boilerpipe anymore so no point in sharing. Just set those two configuration options in nutch-site.xml as property nametika.use_boilerpipe/name valuetrue/value /property property nametika.boilerpipe.extractor/name valueArticleExtractor/value /property and it should work -Original message- From:Joe Zhang smartag...@gmail.com Sent: Tue 11-Jun-2013 01:42 To: user user@nutch.apache.org Subject: Re: using Tika within Nutch to remove boiler plates? Marcus, do you mind sharing a sample nutch-site.xml? On Mon, Jun 10, 2013 at 1:42 AM, Markus Jelsma markus.jel...@openindex.iowrote: Those settings belong to nutch-site. Enable BP and set the correct extractor and it should work just fine. -Original message- From:Lewis John Mcgibbney lewis.mcgibb...@gmail.com Sent: Sun 09-Jun-2013 20:47 To: user@nutch.apache.org Subject: Re: using Tika within Nutch to remove boiler plates? Hi Joe, I've not used this feature, it would be great if one of the others could chime in here. From what I can infer from the correspondence on the issue, and the available patches, you should be applying the most recent one uploaded by Markus [0] as your starting point. This is dated as 22/11/2011. On Sun, Jun 9, 2013 at 11:00 AM, Joe Zhang smartag...@gmail.com wrote: One of the comments mentioned the following: tika.use_boilerpipe=true tika.boilerpipe.extractor=ArticleExtractor|CanolaExtractor which part the code is it referring to? You will see this included in one of the earlier patches uploaded by Markus on 11/05/2011 [1] Also, within the current Nutch config, should I focus on parse-plugin.xml? Look at the other patches and also Gabriele's comments. You may most likely need to alter something but AFAICT the work hasbeen done.. it's just a case of pulling together several contributions. Maybe you should look at the patch for 2.x (uploaded most recently by Roland) and see what is going on there. hth [0] https://issues.apache.org/jira/secure/attachment/12504736/NUTCH-961-1.5-1.patch [1] https://issues.apache.org/jira/secure/attachment/12478927/NUTCH-961-1.3-tikaparser1.patch
Re: Data Extraction from 100+ different sites...
Hi Markus, I couldn't understand how can I avoid switch cases in your suggested idea I would have one plugin which will implement HtmlParseFilter and I would have to check the current URL by getting content.getUrl() and this all will be happening in same class so I would have to add swicth cases... I may could add xpath expression for each site in separate files but to get XPath expression I would have to decide which file I have to read and for that I would have to add my this code logic in swith case Please correct me if I am getting this all wrong !!! And I think this is common requirement for web crawling solutions to get custom data from page... then are not there any such Nutch plugins available on web ? Thanks, Tony. On Tue, Jun 11, 2013 at 7:35 PM, Markus Jelsma markus.jel...@openindex.iowrote: Hi, Yes, you should write a plugin that has a parse filter and indexing filter. To ease maintenance you would want to have a file per host/domain containing XPath expressions, far easier that switch statements that need to be recompiled. The indexing filter would then index the field values extracted by your parse filter. Cheers, Markus -Original message- From:Tony Mullins tonymullins...@gmail.com Sent: Tue 11-Jun-2013 16:07 To: user@nutch.apache.org Subject: Data Extraction from 100+ different sites... Hi, I have 100+ different sites ( and may be more will be added in near future), I have to crawl them and extract my required information from each site. So each site would have its own extraction rule ( XPaths). So far I have seen there is no built-in mechanism in Nutch to fulfill my requirement and I may have to write custom HTMLParserFilter extension and IndexFilter plugin. And I may have to write 100+ switch cases in my plugin to handle the extraction rules of each site Is this the best way to handle my requirement or there is any better way to handle it ? Thanks for your support help. Tony.
Re: Data Extraction from 100+ different sites...
Hi Tony, So if I understand correctly, you have 100+ web pages, each with a totally different format that you're trying to extract separate/unrelated pieces of information from. If there's no connection between any of the web pages or any of the pieces of information that you're trying to extract then it's pretty much unavoidable to have to provide separate identifiers and cases for finding each one. Markus' suggestion I believe is to just have a dictionary file with URL as the key and XPath expression for the info that you want as the value. No matter what crawling/parsing platform you're using a solution of that sort is pretty much unavoidable with the assumptions given. That being said, is there any common form that the data you're trying to extract from these pages follows? Is there a regex that could match it or anything else that might identify it in a common way? Alex On Tue, Jun 11, 2013 at 12:59 PM, Tony Mullins tonymullins...@gmail.comwrote: Hi Markus, I couldn't understand how can I avoid switch cases in your suggested idea I would have one plugin which will implement HtmlParseFilter and I would have to check the current URL by getting content.getUrl() and this all will be happening in same class so I would have to add swicth cases... I may could add xpath expression for each site in separate files but to get XPath expression I would have to decide which file I have to read and for that I would have to add my this code logic in swith case Please correct me if I am getting this all wrong !!! And I think this is common requirement for web crawling solutions to get custom data from page... then are not there any such Nutch plugins available on web ? Thanks, Tony. On Tue, Jun 11, 2013 at 7:35 PM, Markus Jelsma markus.jel...@openindex.iowrote: Hi, Yes, you should write a plugin that has a parse filter and indexing filter. To ease maintenance you would want to have a file per host/domain containing XPath expressions, far easier that switch statements that need to be recompiled. The indexing filter would then index the field values extracted by your parse filter. Cheers, Markus -Original message- From:Tony Mullins tonymullins...@gmail.com Sent: Tue 11-Jun-2013 16:07 To: user@nutch.apache.org Subject: Data Extraction from 100+ different sites... Hi, I have 100+ different sites ( and may be more will be added in near future), I have to crawl them and extract my required information from each site. So each site would have its own extraction rule ( XPaths). So far I have seen there is no built-in mechanism in Nutch to fulfill my requirement and I may have to write custom HTMLParserFilter extension and IndexFilter plugin. And I may have to write 100+ switch cases in my plugin to handle the extraction rules of each site Is this the best way to handle my requirement or there is any better way to handle it ? Thanks for your support help. Tony.
Re: Data Extraction from 100+ different sites...
Hi Tony! Kinda like that: // host - xpath map static MapString, String xpaths = ... //initialize map only once ... String xpath = xpaths.get(content.getUrl().getHost()) Best regards, Alexander --- Вт, 11.6.13, Tony Mullins tonymullins...@gmail.com пишет: От: Tony Mullins tonymullins...@gmail.com Тема: Re: Data Extraction from 100+ different sites... Кому: user@nutch.apache.org Дата: Вторник, 11 июнь 2013, 20:59 Hi Markus, I couldn't understand how can I avoid switch cases in your suggested idea I would have one plugin which will implement HtmlParseFilter and I would have to check the current URL by getting content.getUrl() and this all will be happening in same class so I would have to add swicth cases... I may could add xpath expression for each site in separate files but to get XPath expression I would have to decide which file I have to read and for that I would have to add my this code logic in swith case Please correct me if I am getting this all wrong !!! And I think this is common requirement for web crawling solutions to get custom data from page... then are not there any such Nutch plugins available on web ? Thanks, Tony.
Re: Data Extraction from 100+ different sites...
Yes all the web pages will have different HTML structure/layout and I would have to identify/define a XPath expression for each one of them. But I am trying to come up with generic output format for these XPath expressions so whatever the XPath expression is I want to have result in lets say Field A , Field B , Field C . In some cases some of these fields could be blank as well. So I could map them to my Solr schema properly. In this regard I was hopping to get some help or guideline from your past experiences ... Thanks, Tony On Tue, Jun 11, 2013 at 10:09 PM, AC Nutch acnu...@gmail.com wrote: Hi Tony, So if I understand correctly, you have 100+ web pages, each with a totally different format that you're trying to extract separate/unrelated pieces of information from. If there's no connection between any of the web pages or any of the pieces of information that you're trying to extract then it's pretty much unavoidable to have to provide separate identifiers and cases for finding each one. Markus' suggestion I believe is to just have a dictionary file with URL as the key and XPath expression for the info that you want as the value. No matter what crawling/parsing platform you're using a solution of that sort is pretty much unavoidable with the assumptions given. That being said, is there any common form that the data you're trying to extract from these pages follows? Is there a regex that could match it or anything else that might identify it in a common way? Alex On Tue, Jun 11, 2013 at 12:59 PM, Tony Mullins tonymullins...@gmail.com wrote: Hi Markus, I couldn't understand how can I avoid switch cases in your suggested idea I would have one plugin which will implement HtmlParseFilter and I would have to check the current URL by getting content.getUrl() and this all will be happening in same class so I would have to add swicth cases... I may could add xpath expression for each site in separate files but to get XPath expression I would have to decide which file I have to read and for that I would have to add my this code logic in swith case Please correct me if I am getting this all wrong !!! And I think this is common requirement for web crawling solutions to get custom data from page... then are not there any such Nutch plugins available on web ? Thanks, Tony. On Tue, Jun 11, 2013 at 7:35 PM, Markus Jelsma markus.jel...@openindex.iowrote: Hi, Yes, you should write a plugin that has a parse filter and indexing filter. To ease maintenance you would want to have a file per host/domain containing XPath expressions, far easier that switch statements that need to be recompiled. The indexing filter would then index the field values extracted by your parse filter. Cheers, Markus -Original message- From:Tony Mullins tonymullins...@gmail.com Sent: Tue 11-Jun-2013 16:07 To: user@nutch.apache.org Subject: Data Extraction from 100+ different sites... Hi, I have 100+ different sites ( and may be more will be added in near future), I have to crawl them and extract my required information from each site. So each site would have its own extraction rule ( XPaths). So far I have seen there is no built-in mechanism in Nutch to fulfill my requirement and I may have to write custom HTMLParserFilter extension and IndexFilter plugin. And I may have to write 100+ switch cases in my plugin to handle the extraction rules of each site Is this the best way to handle my requirement or there is any better way to handle it ? Thanks for your support help. Tony.
Re: Data Extraction from 100+ different sites...
I'm a bit confused on where the requirement to *crawl* these sites comes into it? From what you're saying it looks like you're just talking about parsing the contents of a list of sites that you're trying to extract data from. In which case there's not much of a use case for Nutch... or am I confused? On Tue, Jun 11, 2013 at 1:26 PM, Tony Mullins tonymullins...@gmail.comwrote: Yes all the web pages will have different HTML structure/layout and I would have to identify/define a XPath expression for each one of them. But I am trying to come up with generic output format for these XPath expressions so whatever the XPath expression is I want to have result in lets say Field A , Field B , Field C . In some cases some of these fields could be blank as well. So I could map them to my Solr schema properly. In this regard I was hopping to get some help or guideline from your past experiences ... Thanks, Tony On Tue, Jun 11, 2013 at 10:09 PM, AC Nutch acnu...@gmail.com wrote: Hi Tony, So if I understand correctly, you have 100+ web pages, each with a totally different format that you're trying to extract separate/unrelated pieces of information from. If there's no connection between any of the web pages or any of the pieces of information that you're trying to extract then it's pretty much unavoidable to have to provide separate identifiers and cases for finding each one. Markus' suggestion I believe is to just have a dictionary file with URL as the key and XPath expression for the info that you want as the value. No matter what crawling/parsing platform you're using a solution of that sort is pretty much unavoidable with the assumptions given. That being said, is there any common form that the data you're trying to extract from these pages follows? Is there a regex that could match it or anything else that might identify it in a common way? Alex On Tue, Jun 11, 2013 at 12:59 PM, Tony Mullins tonymullins...@gmail.com wrote: Hi Markus, I couldn't understand how can I avoid switch cases in your suggested idea I would have one plugin which will implement HtmlParseFilter and I would have to check the current URL by getting content.getUrl() and this all will be happening in same class so I would have to add swicth cases... I may could add xpath expression for each site in separate files but to get XPath expression I would have to decide which file I have to read and for that I would have to add my this code logic in swith case Please correct me if I am getting this all wrong !!! And I think this is common requirement for web crawling solutions to get custom data from page... then are not there any such Nutch plugins available on web ? Thanks, Tony. On Tue, Jun 11, 2013 at 7:35 PM, Markus Jelsma markus.jel...@openindex.iowrote: Hi, Yes, you should write a plugin that has a parse filter and indexing filter. To ease maintenance you would want to have a file per host/domain containing XPath expressions, far easier that switch statements that need to be recompiled. The indexing filter would then index the field values extracted by your parse filter. Cheers, Markus -Original message- From:Tony Mullins tonymullins...@gmail.com Sent: Tue 11-Jun-2013 16:07 To: user@nutch.apache.org Subject: Data Extraction from 100+ different sites... Hi, I have 100+ different sites ( and may be more will be added in near future), I have to crawl them and extract my required information from each site. So each site would have its own extraction rule ( XPaths). So far I have seen there is no built-in mechanism in Nutch to fulfill my requirement and I may have to write custom HTMLParserFilter extension and IndexFilter plugin. And I may have to write 100+ switch cases in my plugin to handle the extraction rules of each site Is this the best way to handle my requirement or there is any better way to handle it ? Thanks for your support help. Tony.
Re: Data Extraction from 100+ different sites...
I have to crawl the sub-links as well of these sites. And have to identify the pattern of these sub-links' html layout and extract my required data. One example could be a Movie Review site , now every page of this site would have (ideally) same HTML layout which describes a particular movie and I have to extract the info for that page. And for this requirement I am relying on Nutch + HtmlParse plugin!!! On Tue, Jun 11, 2013 at 10:34 PM, AC Nutch acnu...@gmail.com wrote: I'm a bit confused on where the requirement to *crawl* these sites comes into it? From what you're saying it looks like you're just talking about parsing the contents of a list of sites that you're trying to extract data from. In which case there's not much of a use case for Nutch... or am I confused? On Tue, Jun 11, 2013 at 1:26 PM, Tony Mullins tonymullins...@gmail.com wrote: Yes all the web pages will have different HTML structure/layout and I would have to identify/define a XPath expression for each one of them. But I am trying to come up with generic output format for these XPath expressions so whatever the XPath expression is I want to have result in lets say Field A , Field B , Field C . In some cases some of these fields could be blank as well. So I could map them to my Solr schema properly. In this regard I was hopping to get some help or guideline from your past experiences ... Thanks, Tony On Tue, Jun 11, 2013 at 10:09 PM, AC Nutch acnu...@gmail.com wrote: Hi Tony, So if I understand correctly, you have 100+ web pages, each with a totally different format that you're trying to extract separate/unrelated pieces of information from. If there's no connection between any of the web pages or any of the pieces of information that you're trying to extract then it's pretty much unavoidable to have to provide separate identifiers and cases for finding each one. Markus' suggestion I believe is to just have a dictionary file with URL as the key and XPath expression for the info that you want as the value. No matter what crawling/parsing platform you're using a solution of that sort is pretty much unavoidable with the assumptions given. That being said, is there any common form that the data you're trying to extract from these pages follows? Is there a regex that could match it or anything else that might identify it in a common way? Alex On Tue, Jun 11, 2013 at 12:59 PM, Tony Mullins tonymullins...@gmail.com wrote: Hi Markus, I couldn't understand how can I avoid switch cases in your suggested idea I would have one plugin which will implement HtmlParseFilter and I would have to check the current URL by getting content.getUrl() and this all will be happening in same class so I would have to add swicth cases... I may could add xpath expression for each site in separate files but to get XPath expression I would have to decide which file I have to read and for that I would have to add my this code logic in swith case Please correct me if I am getting this all wrong !!! And I think this is common requirement for web crawling solutions to get custom data from page... then are not there any such Nutch plugins available on web ? Thanks, Tony. On Tue, Jun 11, 2013 at 7:35 PM, Markus Jelsma markus.jel...@openindex.iowrote: Hi, Yes, you should write a plugin that has a parse filter and indexing filter. To ease maintenance you would want to have a file per host/domain containing XPath expressions, far easier that switch statements that need to be recompiled. The indexing filter would then index the field values extracted by your parse filter. Cheers, Markus -Original message- From:Tony Mullins tonymullins...@gmail.com Sent: Tue 11-Jun-2013 16:07 To: user@nutch.apache.org Subject: Data Extraction from 100+ different sites... Hi, I have 100+ different sites ( and may be more will be added in near future), I have to crawl them and extract my required information from each site. So each site would have its own extraction rule ( XPaths). So far I have seen there is no built-in mechanism in Nutch to fulfill my requirement and I may have to write custom HTMLParserFilter extension and IndexFilter plugin. And I may have to write 100+ switch cases in my plugin to handle the extraction rules of each site Is this the best way to handle my requirement or there is any better way to handle it ? Thanks for your support help. Tony.
RE: using Tika within Nutch to remove boiler plates?
Yes, Boilerpipe is complex and difficult to adapt. It also requires you to preset an extraction algorithm which is impossible for us. I've created an extractor instead that works for most pages and ignores stuff like news overviews and major parts of homepages. It's also tightly coupled with our date extractor (based on [1]) and language detector (based on LangDetect) and image extraction. In many cases boilerpipe's articleextractor will work very well but date extraction such as NUTCH-141 won't do the trick as it only works on extracted text as a whole and does not consider page semantics. [1]: https://issues.apache.org/jira/browse/NUTCH-1414 -Original message- From:Joe Zhang smartag...@gmail.com Sent: Tue 11-Jun-2013 18:06 To: user user@nutch.apache.org Subject: Re: using Tika within Nutch to remove boiler plates? Any particular reason why you don't use boilerpipe any more? So what do you suggest as an alternative? On Tue, Jun 11, 2013 at 5:41 AM, Markus Jelsma markus.jel...@openindex.iowrote: we don't use Boilerpipe anymore so no point in sharing. Just set those two configuration options in nutch-site.xml as property nametika.use_boilerpipe/name valuetrue/value /property property nametika.boilerpipe.extractor/name valueArticleExtractor/value /property and it should work -Original message- From:Joe Zhang smartag...@gmail.com Sent: Tue 11-Jun-2013 01:42 To: user user@nutch.apache.org Subject: Re: using Tika within Nutch to remove boiler plates? Marcus, do you mind sharing a sample nutch-site.xml? On Mon, Jun 10, 2013 at 1:42 AM, Markus Jelsma markus.jel...@openindex.iowrote: Those settings belong to nutch-site. Enable BP and set the correct extractor and it should work just fine. -Original message- From:Lewis John Mcgibbney lewis.mcgibb...@gmail.com Sent: Sun 09-Jun-2013 20:47 To: user@nutch.apache.org Subject: Re: using Tika within Nutch to remove boiler plates? Hi Joe, I've not used this feature, it would be great if one of the others could chime in here. From what I can infer from the correspondence on the issue, and the available patches, you should be applying the most recent one uploaded by Markus [0] as your starting point. This is dated as 22/11/2011. On Sun, Jun 9, 2013 at 11:00 AM, Joe Zhang smartag...@gmail.com wrote: One of the comments mentioned the following: tika.use_boilerpipe=true tika.boilerpipe.extractor=ArticleExtractor|CanolaExtractor which part the code is it referring to? You will see this included in one of the earlier patches uploaded by Markus on 11/05/2011 [1] Also, within the current Nutch config, should I focus on parse-plugin.xml? Look at the other patches and also Gabriele's comments. You may most likely need to alter something but AFAICT the work hasbeen done.. it's just a case of pulling together several contributions. Maybe you should look at the patch for 2.x (uploaded most recently by Roland) and see what is going on there. hth [0] https://issues.apache.org/jira/secure/attachment/12504736/NUTCH-961-1.5-1.patch [1] https://issues.apache.org/jira/secure/attachment/12478927/NUTCH-961-1.3-tikaparser1.patch
RE: Data Extraction from 100+ different sites...
You can use URLUtil in that parse filter to determine on which host/domain you are and lazy load the file with expressions for that host. Just keep a Maphostname, Listexpressions in your object and load lists of expressions on demand. -Original message- From:Tony Mullins tonymullins...@gmail.com Sent: Tue 11-Jun-2013 18:59 To: user@nutch.apache.org Subject: Re: Data Extraction from 100+ different sites... Hi Markus, I couldn't understand how can I avoid switch cases in your suggested idea I would have one plugin which will implement HtmlParseFilter and I would have to check the current URL by getting content.getUrl() and this all will be happening in same class so I would have to add swicth cases... I may could add xpath expression for each site in separate files but to get XPath expression I would have to decide which file I have to read and for that I would have to add my this code logic in swith case Please correct me if I am getting this all wrong !!! And I think this is common requirement for web crawling solutions to get custom data from page... then are not there any such Nutch plugins available on web ? Thanks, Tony. On Tue, Jun 11, 2013 at 7:35 PM, Markus Jelsma markus.jel...@openindex.iowrote: Hi, Yes, you should write a plugin that has a parse filter and indexing filter. To ease maintenance you would want to have a file per host/domain containing XPath expressions, far easier that switch statements that need to be recompiled. The indexing filter would then index the field values extracted by your parse filter. Cheers, Markus -Original message- From:Tony Mullins tonymullins...@gmail.com Sent: Tue 11-Jun-2013 16:07 To: user@nutch.apache.org Subject: Data Extraction from 100+ different sites... Hi, I have 100+ different sites ( and may be more will be added in near future), I have to crawl them and extract my required information from each site. So each site would have its own extraction rule ( XPaths). So far I have seen there is no built-in mechanism in Nutch to fulfill my requirement and I may have to write custom HTMLParserFilter extension and IndexFilter plugin. And I may have to write 100+ switch cases in my plugin to handle the extraction rules of each site Is this the best way to handle my requirement or there is any better way to handle it ? Thanks for your support help. Tony.
Re: using Tika within Nutch to remove boiler plates?
So what in your opinion is the most effective way of removing boilerplates in Nutch crawls? On Tue, Jun 11, 2013 at 12:12 PM, Markus Jelsma markus.jel...@openindex.iowrote: Yes, Boilerpipe is complex and difficult to adapt. It also requires you to preset an extraction algorithm which is impossible for us. I've created an extractor instead that works for most pages and ignores stuff like news overviews and major parts of homepages. It's also tightly coupled with our date extractor (based on [1]) and language detector (based on LangDetect) and image extraction. In many cases boilerpipe's articleextractor will work very well but date extraction such as NUTCH-141 won't do the trick as it only works on extracted text as a whole and does not consider page semantics. [1]: https://issues.apache.org/jira/browse/NUTCH-1414 -Original message- From:Joe Zhang smartag...@gmail.com Sent: Tue 11-Jun-2013 18:06 To: user user@nutch.apache.org Subject: Re: using Tika within Nutch to remove boiler plates? Any particular reason why you don't use boilerpipe any more? So what do you suggest as an alternative? On Tue, Jun 11, 2013 at 5:41 AM, Markus Jelsma markus.jel...@openindex.iowrote: we don't use Boilerpipe anymore so no point in sharing. Just set those two configuration options in nutch-site.xml as property nametika.use_boilerpipe/name valuetrue/value /property property nametika.boilerpipe.extractor/name valueArticleExtractor/value /property and it should work -Original message- From:Joe Zhang smartag...@gmail.com Sent: Tue 11-Jun-2013 01:42 To: user user@nutch.apache.org Subject: Re: using Tika within Nutch to remove boiler plates? Marcus, do you mind sharing a sample nutch-site.xml? On Mon, Jun 10, 2013 at 1:42 AM, Markus Jelsma markus.jel...@openindex.iowrote: Those settings belong to nutch-site. Enable BP and set the correct extractor and it should work just fine. -Original message- From:Lewis John Mcgibbney lewis.mcgibb...@gmail.com Sent: Sun 09-Jun-2013 20:47 To: user@nutch.apache.org Subject: Re: using Tika within Nutch to remove boiler plates? Hi Joe, I've not used this feature, it would be great if one of the others could chime in here. From what I can infer from the correspondence on the issue, and the available patches, you should be applying the most recent one uploaded by Markus [0] as your starting point. This is dated as 22/11/2011. On Sun, Jun 9, 2013 at 11:00 AM, Joe Zhang smartag...@gmail.com wrote: One of the comments mentioned the following: tika.use_boilerpipe=true tika.boilerpipe.extractor=ArticleExtractor|CanolaExtractor which part the code is it referring to? You will see this included in one of the earlier patches uploaded by Markus on 11/05/2011 [1] Also, within the current Nutch config, should I focus on parse-plugin.xml? Look at the other patches and also Gabriele's comments. You may most likely need to alter something but AFAICT the work hasbeen done.. it's just a case of pulling together several contributions. Maybe you should look at the patch for 2.x (uploaded most recently by Roland) and see what is going on there. hth [0] https://issues.apache.org/jira/secure/attachment/12504736/NUTCH-961-1.5-1.patch [1] https://issues.apache.org/jira/secure/attachment/12478927/NUTCH-961-1.3-tikaparser1.patch
RE: using Tika within Nutch to remove boiler plates?
In my opinion Boilerpipe is the most effective free and open source tool for the job :) It does require some patching (see linked issues) and manual upgrade to Boilerpipe 1.2.0. -Original message- From:Joe Zhang smartag...@gmail.com Sent: Tue 11-Jun-2013 21:19 To: user user@nutch.apache.org Subject: Re: using Tika within Nutch to remove boiler plates? So what in your opinion is the most effective way of removing boilerplates in Nutch crawls? On Tue, Jun 11, 2013 at 12:12 PM, Markus Jelsma markus.jel...@openindex.iowrote: Yes, Boilerpipe is complex and difficult to adapt. It also requires you to preset an extraction algorithm which is impossible for us. I've created an extractor instead that works for most pages and ignores stuff like news overviews and major parts of homepages. It's also tightly coupled with our date extractor (based on [1]) and language detector (based on LangDetect) and image extraction. In many cases boilerpipe's articleextractor will work very well but date extraction such as NUTCH-141 won't do the trick as it only works on extracted text as a whole and does not consider page semantics. [1]: https://issues.apache.org/jira/browse/NUTCH-1414 -Original message- From:Joe Zhang smartag...@gmail.com Sent: Tue 11-Jun-2013 18:06 To: user user@nutch.apache.org Subject: Re: using Tika within Nutch to remove boiler plates? Any particular reason why you don't use boilerpipe any more? So what do you suggest as an alternative? On Tue, Jun 11, 2013 at 5:41 AM, Markus Jelsma markus.jel...@openindex.iowrote: we don't use Boilerpipe anymore so no point in sharing. Just set those two configuration options in nutch-site.xml as property nametika.use_boilerpipe/name valuetrue/value /property property nametika.boilerpipe.extractor/name valueArticleExtractor/value /property and it should work -Original message- From:Joe Zhang smartag...@gmail.com Sent: Tue 11-Jun-2013 01:42 To: user user@nutch.apache.org Subject: Re: using Tika within Nutch to remove boiler plates? Marcus, do you mind sharing a sample nutch-site.xml? On Mon, Jun 10, 2013 at 1:42 AM, Markus Jelsma markus.jel...@openindex.iowrote: Those settings belong to nutch-site. Enable BP and set the correct extractor and it should work just fine. -Original message- From:Lewis John Mcgibbney lewis.mcgibb...@gmail.com Sent: Sun 09-Jun-2013 20:47 To: user@nutch.apache.org Subject: Re: using Tika within Nutch to remove boiler plates? Hi Joe, I've not used this feature, it would be great if one of the others could chime in here. From what I can infer from the correspondence on the issue, and the available patches, you should be applying the most recent one uploaded by Markus [0] as your starting point. This is dated as 22/11/2011. On Sun, Jun 9, 2013 at 11:00 AM, Joe Zhang smartag...@gmail.com wrote: One of the comments mentioned the following: tika.use_boilerpipe=true tika.boilerpipe.extractor=ArticleExtractor|CanolaExtractor which part the code is it referring to? You will see this included in one of the earlier patches uploaded by Markus on 11/05/2011 [1] Also, within the current Nutch config, should I focus on parse-plugin.xml? Look at the other patches and also Gabriele's comments. You may most likely need to alter something but AFAICT the work hasbeen done.. it's just a case of pulling together several contributions. Maybe you should look at the patch for 2.x (uploaded most recently by Roland) and see what is going on there. hth [0] https://issues.apache.org/jira/secure/attachment/12504736/NUTCH-961-1.5-1.patch [1] https://issues.apache.org/jira/secure/attachment/12478927/NUTCH-961-1.3-tikaparser1.patch
Re: using Tika within Nutch to remove boiler plates?
Is it interesting that you acknowledge BP as the most effective, but you also said you don't use it any more. On Tue, Jun 11, 2013 at 3:08 PM, Markus Jelsma markus.jel...@openindex.iowrote: In my opinion Boilerpipe is the most effective free and open source tool for the job :) It does require some patching (see linked issues) and manual upgrade to Boilerpipe 1.2.0. -Original message- From:Joe Zhang smartag...@gmail.com Sent: Tue 11-Jun-2013 21:19 To: user user@nutch.apache.org Subject: Re: using Tika within Nutch to remove boiler plates? So what in your opinion is the most effective way of removing boilerplates in Nutch crawls? On Tue, Jun 11, 2013 at 12:12 PM, Markus Jelsma markus.jel...@openindex.iowrote: Yes, Boilerpipe is complex and difficult to adapt. It also requires you to preset an extraction algorithm which is impossible for us. I've created an extractor instead that works for most pages and ignores stuff like news overviews and major parts of homepages. It's also tightly coupled with our date extractor (based on [1]) and language detector (based on LangDetect) and image extraction. In many cases boilerpipe's articleextractor will work very well but date extraction such as NUTCH-141 won't do the trick as it only works on extracted text as a whole and does not consider page semantics. [1]: https://issues.apache.org/jira/browse/NUTCH-1414 -Original message- From:Joe Zhang smartag...@gmail.com Sent: Tue 11-Jun-2013 18:06 To: user user@nutch.apache.org Subject: Re: using Tika within Nutch to remove boiler plates? Any particular reason why you don't use boilerpipe any more? So what do you suggest as an alternative? On Tue, Jun 11, 2013 at 5:41 AM, Markus Jelsma markus.jel...@openindex.iowrote: we don't use Boilerpipe anymore so no point in sharing. Just set those two configuration options in nutch-site.xml as property nametika.use_boilerpipe/name valuetrue/value /property property nametika.boilerpipe.extractor/name valueArticleExtractor/value /property and it should work -Original message- From:Joe Zhang smartag...@gmail.com Sent: Tue 11-Jun-2013 01:42 To: user user@nutch.apache.org Subject: Re: using Tika within Nutch to remove boiler plates? Marcus, do you mind sharing a sample nutch-site.xml? On Mon, Jun 10, 2013 at 1:42 AM, Markus Jelsma markus.jel...@openindex.iowrote: Those settings belong to nutch-site. Enable BP and set the correct extractor and it should work just fine. -Original message- From:Lewis John Mcgibbney lewis.mcgibb...@gmail.com Sent: Sun 09-Jun-2013 20:47 To: user@nutch.apache.org Subject: Re: using Tika within Nutch to remove boiler plates? Hi Joe, I've not used this feature, it would be great if one of the others could chime in here. From what I can infer from the correspondence on the issue, and the available patches, you should be applying the most recent one uploaded by Markus [0] as your starting point. This is dated as 22/11/2011. On Sun, Jun 9, 2013 at 11:00 AM, Joe Zhang smartag...@gmail.com wrote: One of the comments mentioned the following: tika.use_boilerpipe=true tika.boilerpipe.extractor=ArticleExtractor|CanolaExtractor which part the code is it referring to? You will see this included in one of the earlier patches uploaded by Markus on 11/05/2011 [1] Also, within the current Nutch config, should I focus on parse-plugin.xml? Look at the other patches and also Gabriele's comments. You may most likely need to alter something but AFAICT the work hasbeen done.. it's just a case of pulling together several contributions. Maybe you should look at the patch for 2.x (uploaded most recently by Roland) and see what is going on there. hth [0] https://issues.apache.org/jira/secure/attachment/12504736/NUTCH-961-1.5-1.patch [1] https://issues.apache.org/jira/secure/attachment/12478927/NUTCH-961-1.3-tikaparser1.patch