date:20130611

Re: Issues on Compiling Nutch 2.x with Eclipse

2013-06-11 Thread Tony Mullins

Hi Tejas,

Thanks a lot for setting up this new setup guide. It really helped me and
may be many other new Nutch users.

Tony.

On Tue, Jun 11, 2013 at 7:02 AM, Tejas Patil tejas.patil...@gmail.comwrote:

Hi Tony,

The simplified steps with snapshots are now added to Nutch wiki [0]. It
would be helpful if you could try those out and lets us know if there are
any improvements or corrections that you think.

PS: Few images look shrinked. I will be fixing it soon.

[0] : https://wiki.apache.org/nutch/RunNutchInEclipse

On Mon, Jun 10, 2013 at 2:57 PM, Tejas Patil tejas.patil...@gmail.com
wrote:

I have created a google doc [0] with several snapshots describing how to
setup nutch 2.x + eclipse. This is different from the one over the wiki
page and tailored for Nutch 2.x. Please try it out, let us know if you
still have issues with that. Based on your comments, I would add the same
over nutch wiki.

[0] :

https://docs.google.com/document/d/1qvJwrZ9Sc0NAF9p3ie4uV7JsfCHxnrh9QF19HINw48c/edit?usp=sharing

On Mon, Jun 10, 2013 at 11:32 AM, Tejas Patil tejas.patil...@gmail.com
wrote:

yes.

- Close the project in eclipse. Right click on the project, click on
Properties and get the location of the project.
- Goto that location in terminal
-

Run 'ant eclipse'. (Note that you need to have Apache Ant
http://ant.apache.org/manual/index.html installed
and configured)

After going command line, you might as well do this:
Specify the GORA backend in nutch-site.xml, uncomment its dependency in
ivy/ivy.xml and ensure that the store you selected is set as the default
datastore in gora.properties

On Mon, Jun 10, 2013 at 11:21 AM, Tony Mullins
tonymullins...@gmail.comwrote:

Hi,

So the latest Nutch2.x includes the Teja's Patch (
https://issues.apache.org/jira/browse/NUTCH-1577) , means if I have
latest
source then it already has that patch.

Now can some one please help me here what is meant by the 2nd last step
'Run 'ant eclipse' on http://wiki.apache.org/nutch/RunNutchInEclipse.

Do I need to go to the location where source is and give ant command
'ant
-f build.xml' , or its something else ???
And after refreshing the source, Eclipse would let compile and run my
code ?

Thanks,
Tony

On Mon, Jun 10, 2013 at 6:56 PM, Tony Mullins
tonymullins...@gmail.com
wrote:

Hi Lewis,

I understand this, that there may be something wrong on my end. And
as
I
said I get different errors on running Nutch 2.x with Eclipse, after
following different tutorials.

My background is in .NET and I might will just move to JAVA , just
because
of this project (Nutch). But at the moment I am having difficult time
understanding the 'setup/configuration' required to run Nutch in
Eclipse.

When you say '...*you may find it convenient to patch

your dist with Tejas' Eclipse ant target and simply run 'ant eclipse'
from
within your terminal prior to doing a file, import, existing projects
in to
workspace from within Eclipse..*.'

which patch do I need to get and how to apply it ?
And by running 'ant eclipse' , do you mean dropping build.xml to Ant
window in Eclipse , OR building the Nutch source by using the ant -f
build.xml command in terminal ? ( by the way I have done both and
both
successfully builds the source , but eclipse doesn't run the source).

So could you please guide me here in more details, I would be really
grateful to you and Nutch community.

Thanks,
Tony.

On Mon, Jun 10, 2013 at 6:38 PM, Lewis John Mcgibbney
lewis.mcgibb...@gmail.com wrote:

Hi Tony,
These issues stem from your environment not being correct.
I, as many other, have been able to DEBUG and develop Nutch 1.7 and
2.x
series from within Eclipse.
As you are working with 2.x source, you may find it convenient to
patch
your dist with Tejas' Eclipse ant target and simply run 'ant
eclipse'
from
within your terminal prior to doing a file, import, existing
projects
in
to
workspace from within Eclipse.
I can guarantee you, the reason the tutorial is on the Nutch wiki is
because as some stage, someone (many many people), somewhere have
found it
useful for developing Nutch in Eclipse. I don't want to sound like a
baloon
here, but your java security exceptions are not a problem with
Nutch...
it's your environment.
hth

On Monday, June 10, 2013, Tony Mullins tonymullins...@gmail.com
wrote:
Hi ,
Ok now I have followed this tutorial word by word.

http://wiki.apache.org/nutch/RunNutchInEclipse#Checkout_Nutch_in_Eclipse
.

After getting new source 2.2 , I have build it using Ant - which
was
successful then set the configurations and comment the 'hsqldb'
dependency
and uncomment the cassandra dependency ( as I want to run it against
cassandra). After doing this all

Re: Nutch Compilation Error with Eclipse

2013-06-11 Thread Jamshaid Ashraf

Hi,

Thank you so much for providing detail steps updating wiki site.

My first problem is resolved (nutch compilation through eclipse) now I
can run injector job from eclipse.

Now I'm trying to debug the crawl process to understand the internal
working of 'Nutch' with 'Cassandra' so I could write a new plugin for my
requirement.

It would be great if you could provide me program arguments ('Commands')
and sequence in order to run/understand the crawling process.

I have gone through the script file (nutch2.2/src/bin/crawl) which dictates
whole crawling process in few steps:

1. Inject
2. Generate
3. Fetch
4. Parse
.
.

Now I am having difficulties to how to add these commands in my Eclipse
Program/Environment arguments to start specific job and debug it. I would
really appreciate your help in this regard.

Regards,
Jamshaid

On Tue, Jun 11, 2013 at 7:01 AM, Tejas Patil tejas.patil...@gmail.comwrote:

Hi Jamshaid,
The simplified steps with snapshots are now added to Nutch wiki [0]. It
would be helpful if you could try those out and lets us know if there are
any improvements or corrections that you think.

PS: Few images look shrinked. I will be fixing it soon.

[0] : https://wiki.apache.org/nutch/RunNutchInEclipse

On Mon, Jun 10, 2013 at 2:58 PM, Tejas Patil tejas.patil...@gmail.com
wrote:

[0] :

https://docs.google.com/document/d/1qvJwrZ9Sc0NAF9p3ie4uV7JsfCHxnrh9QF19HINw48c/edit?usp=sharing

On Mon, Jun 10, 2013 at 6:23 AM, Lewis John Mcgibbney
lewis.mcgibb...@gmail.com wrote:

Hi,
It is (IMHO) kind of fruitless running the crawl class (which is
deprecated
now and we highly suggest you use and amend the /src/bin/crawl script
for
your usecase) within Eclipse. You will learn far more setting
breakpoints
within individual classes and watching them execute on that basis. I
notice
you've not provided an URL directory to the crawl argument anyway so you
will need to sort this one out.
Best
Lewis

On Monday, June 10, 2013, Jamshaid Ashraf jamshaid...@gmail.com
wrote:
I'm performing following tasks:

Commands in Arguments tab:

Program Arguments=urls -dir crawl -depth 3 -topN 50

VM Arguments:-Dhadoop.log.dir=logs -Dhadoop.log.file=hadoop.log

And then just running the code.

Regards,
Jamshaid

On Mon, Jun 10, 2013 at 4:54 PM, Sznajder ForMailingList
bs4mailingl...@gmail.com wrote:

Which task do you try to launch?

Benjamin

On Mon, Jun 10, 2013 at 1:57 PM, Jamshaid Ashraf
jamshaid...@gmail.com
wrote:

Hi,

I am new to Nutch. I am trying to use Nutch with Cassandra and have
successfully build the Nutch 2.x but shows following error when I
run
it
from latest eclipse.

java.lang.NullPointerException
at org.apache.avro.util.Utf8.init(Utf8.java:37)
at

org.apache.nutch.crawl.GeneratorReducer.setup(GeneratorReducer.java:100)
at org.apache.hadoop.mapreduce.Reducer.run(Reducer.java:174)
at
org.apache.hadoop.mapred.ReduceTask.runNewReducer(ReduceTask.java:650)
at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:418)
at

org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:260).

I will be grateful for any help if someone can provide.

Thanks.

--
*Lewis*

Re: Nutch Compilation Error with Eclipse

2013-06-11 Thread Tejas Patil

If you want to find out the java class corresponding to any command, just
peek inside src/bin/nutch script and at the bottom you would find a
switch case with a case corresponding to each command. For 2.x, here are
the important classes:

inject - org.apache.nutch.crawl.InjectorJob
generate - org.apache.nutch.crawl.GeneratorJob
fetch - org.apache.nutch.fetcher.FetcherJob
parse - org.apache.nutch.parse.ParserJob
updatedb - org.apache.nutch.crawl.DbUpdaterJob

Create a separate launcher for each of these. Running these without any i/p
parameters would show you the usage of these commands.

RE: using Tika within Nutch to remove boiler plates?

2013-06-11 Thread Markus Jelsma

we don't use Boilerpipe anymore so no point in sharing. Just set those two 
configuration options in nutch-site.xml as 

  property
  nametika.use_boilerpipe/name
  valuetrue/value
 /property
  property
  nametika.boilerpipe.extractor/name
  valueArticleExtractor/value
 /property

and it should work
 
-Original message-
 From:Joe Zhang smartag...@gmail.com
 Sent: Tue 11-Jun-2013 01:42
 To: user user@nutch.apache.org
 Subject: Re: using Tika within Nutch to remove boiler plates?
 
 Marcus, do you mind sharing a sample nutch-site.xml?
 
 
 On Mon, Jun 10, 2013 at 1:42 AM, Markus Jelsma
 markus.jel...@openindex.iowrote:
 
  Those settings belong to nutch-site. Enable BP and set the correct
  extractor and it should work just fine.
 
 
  -Original message-
   From:Lewis John Mcgibbney lewis.mcgibb...@gmail.com
   Sent: Sun 09-Jun-2013 20:47
   To: user@nutch.apache.org
   Subject: Re: using Tika within Nutch to remove boiler plates?
  
   Hi Joe,
   I've not used this feature, it would be great if one of the others could
   chime in here.
   From what I can infer from the correspondence on the issue, and the
   available patches, you should be applying the most recent one uploaded by
   Markus [0] as your starting point. This is dated as 22/11/2011.
  
   On Sun, Jun 9, 2013 at 11:00 AM, Joe Zhang smartag...@gmail.com wrote:
  
   
One of the comments mentioned the following:
   
tika.use_boilerpipe=true
tika.boilerpipe.extractor=ArticleExtractor|CanolaExtractor
   
which part the code is it referring to?
   
   
   You will see this included in one of the earlier patches uploaded by
  Markus
   on 11/05/2011 [1]
  
  
   
Also, within the current Nutch config, should I focus on
  parse-plugin.xml?
   
   
   Look at the other patches and also Gabriele's comments. You may most
  likely
   need to alter something but AFAICT the work hasbeen done.. it's just a
  case
   of pulling together several contributions.
  
   Maybe you should look at the patch for 2.x (uploaded most recently by
   Roland) and see what is going on there.
  
   hth
  
   [0]
  
  https://issues.apache.org/jira/secure/attachment/12504736/NUTCH-961-1.5-1.patch
   [1]
  
  https://issues.apache.org/jira/secure/attachment/12478927/NUTCH-961-1.3-tikaparser1.patch

Data Extraction from 100+ different sites...

2013-06-11 Thread Tony Mullins

Hi,

I have 100+ different sites ( and may be more will be added in near
future), I have to crawl them and extract my required information from each
site. So each site would have its own extraction rule ( XPaths).

So far I have seen there is no built-in mechanism in Nutch to fulfill my
requirement and I may  have to write custom HTMLParserFilter extension and
IndexFilter plugin.

And I may have to write 100+ switch cases in my plugin to handle the
extraction rules of each site

Is this the best way to handle my requirement or there is any better way to
handle it ?

Thanks for your support  help.

Tony.

RE: Data Extraction from 100+ different sites...

2013-06-11 Thread Markus Jelsma

Hi,

Yes, you should write a plugin that has a parse filter and indexing filter. To 
ease maintenance you would want to have a file per host/domain containing XPath 
expressions, far easier that switch statements that need to be recompiled. The 
indexing filter would then index the field values extracted by your parse 
filter.

Cheers,
Markus 
 
-Original message-
 From:Tony Mullins tonymullins...@gmail.com
 Sent: Tue 11-Jun-2013 16:07
 To: user@nutch.apache.org
 Subject: Data Extraction from 100+ different sites...
 
 Hi,
 
 I have 100+ different sites ( and may be more will be added in near
 future), I have to crawl them and extract my required information from each
 site. So each site would have its own extraction rule ( XPaths).
 
 So far I have seen there is no built-in mechanism in Nutch to fulfill my
 requirement and I may  have to write custom HTMLParserFilter extension and
 IndexFilter plugin.
 
 And I may have to write 100+ switch cases in my plugin to handle the
 extraction rules of each site
 
 Is this the best way to handle my requirement or there is any better way to
 handle it ?
 
 Thanks for your support  help.
 
 Tony.

Re: using Tika within Nutch to remove boiler plates?

2013-06-11 Thread Joe Zhang

Any particular reason why you don't use boilerpipe any more? So what do you
suggest as an alternative?


On Tue, Jun 11, 2013 at 5:41 AM, Markus Jelsma
markus.jel...@openindex.iowrote:

 we don't use Boilerpipe anymore so no point in sharing. Just set those two
 configuration options in nutch-site.xml as

   property
   nametika.use_boilerpipe/name
   valuetrue/value
  /property
   property
   nametika.boilerpipe.extractor/name
   valueArticleExtractor/value
  /property

 and it should work

 -Original message-
  From:Joe Zhang smartag...@gmail.com
  Sent: Tue 11-Jun-2013 01:42
  To: user user@nutch.apache.org
  Subject: Re: using Tika within Nutch to remove boiler plates?
 
  Marcus, do you mind sharing a sample nutch-site.xml?
 
 
  On Mon, Jun 10, 2013 at 1:42 AM, Markus Jelsma
  markus.jel...@openindex.iowrote:
 
   Those settings belong to nutch-site. Enable BP and set the correct
   extractor and it should work just fine.
  
  
   -Original message-
From:Lewis John Mcgibbney lewis.mcgibb...@gmail.com
Sent: Sun 09-Jun-2013 20:47
To: user@nutch.apache.org
Subject: Re: using Tika within Nutch to remove boiler plates?
   
Hi Joe,
I've not used this feature, it would be great if one of the others
 could
chime in here.
From what I can infer from the correspondence on the issue, and the
available patches, you should be applying the most recent one
 uploaded by
Markus [0] as your starting point. This is dated as 22/11/2011.
   
On Sun, Jun 9, 2013 at 11:00 AM, Joe Zhang smartag...@gmail.com
 wrote:
   

 One of the comments mentioned the following:

 tika.use_boilerpipe=true
 tika.boilerpipe.extractor=ArticleExtractor|CanolaExtractor

 which part the code is it referring to?


You will see this included in one of the earlier patches uploaded by
   Markus
on 11/05/2011 [1]
   
   

 Also, within the current Nutch config, should I focus on
   parse-plugin.xml?


Look at the other patches and also Gabriele's comments. You may most
   likely
need to alter something but AFAICT the work hasbeen done.. it's just
 a
   case
of pulling together several contributions.
   
Maybe you should look at the patch for 2.x (uploaded most recently by
Roland) and see what is going on there.
   
hth
   
[0]
   
  
 https://issues.apache.org/jira/secure/attachment/12504736/NUTCH-961-1.5-1.patch
[1]
   
  
 https://issues.apache.org/jira/secure/attachment/12478927/NUTCH-961-1.3-tikaparser1.patch

Re: Data Extraction from 100+ different sites...

2013-06-11 Thread Tony Mullins

Hi Markus,

I couldn't understand how can I avoid switch cases in your suggested
idea

I would have one plugin which will implement HtmlParseFilter and I would
have to check the current URL by getting content.getUrl() and this all will
be happening in same class so I would have to add swicth cases... I may
could add xpath expression for each site in separate files but to get XPath
expression I would have to decide which file I have to read and for that I
would have to add my this code logic in swith case

Please correct me if I am getting this all wrong !!!

And I think this is common requirement for web crawling solutions to get
custom data from page... then are not there any such Nutch plugins
available on web ?

Thanks,
Tony.


On Tue, Jun 11, 2013 at 7:35 PM, Markus Jelsma
markus.jel...@openindex.iowrote:

 Hi,

 Yes, you should write a plugin that has a parse filter and indexing
 filter. To ease maintenance you would want to have a file per host/domain
 containing XPath expressions, far easier that switch statements that need
 to be recompiled. The indexing filter would then index the field values
 extracted by your parse filter.

 Cheers,
 Markus

 -Original message-
  From:Tony Mullins tonymullins...@gmail.com
  Sent: Tue 11-Jun-2013 16:07
  To: user@nutch.apache.org
  Subject: Data Extraction from 100+ different sites...
 
  Hi,
 
  I have 100+ different sites ( and may be more will be added in near
  future), I have to crawl them and extract my required information from
 each
  site. So each site would have its own extraction rule ( XPaths).
 
  So far I have seen there is no built-in mechanism in Nutch to fulfill my
  requirement and I may  have to write custom HTMLParserFilter extension
 and
  IndexFilter plugin.
 
  And I may have to write 100+ switch cases in my plugin to handle the
  extraction rules of each site
 
  Is this the best way to handle my requirement or there is any better way
 to
  handle it ?
 
  Thanks for your support  help.
 
  Tony.

Re: Data Extraction from 100+ different sites...

2013-06-11 Thread AC Nutch

Hi Tony,

So if I understand correctly, you have 100+ web pages, each with a totally
different format that you're trying to extract separate/unrelated pieces of
information from. If there's no connection between any of the web pages or
any of the pieces of information that you're trying to extract then it's
pretty much unavoidable to have to provide separate identifiers and cases
for finding each one. Markus' suggestion I believe is to just have a
dictionary file with URL as the key and XPath expression for the info
that you want as the value. No matter what crawling/parsing platform you're
using a solution of that sort is pretty much unavoidable with the
assumptions given.

That being said, is there any common form that the data you're trying to
extract from these pages follows? Is there a regex that could match it or
anything else that might identify it in a common way?

Alex


On Tue, Jun 11, 2013 at 12:59 PM, Tony Mullins tonymullins...@gmail.comwrote:

 Hi Markus,

 I couldn't understand how can I avoid switch cases in your suggested
 idea

 I would have one plugin which will implement HtmlParseFilter and I would
 have to check the current URL by getting content.getUrl() and this all will
 be happening in same class so I would have to add swicth cases... I may
 could add xpath expression for each site in separate files but to get XPath
 expression I would have to decide which file I have to read and for that I
 would have to add my this code logic in swith case

 Please correct me if I am getting this all wrong !!!

 And I think this is common requirement for web crawling solutions to get
 custom data from page... then are not there any such Nutch plugins
 available on web ?

 Thanks,
 Tony.


 On Tue, Jun 11, 2013 at 7:35 PM, Markus Jelsma
 markus.jel...@openindex.iowrote:

  Hi,
 
  Yes, you should write a plugin that has a parse filter and indexing
  filter. To ease maintenance you would want to have a file per host/domain
  containing XPath expressions, far easier that switch statements that need
  to be recompiled. The indexing filter would then index the field values
  extracted by your parse filter.
 
  Cheers,
  Markus
 
  -Original message-
   From:Tony Mullins tonymullins...@gmail.com
   Sent: Tue 11-Jun-2013 16:07
   To: user@nutch.apache.org
   Subject: Data Extraction from 100+ different sites...
  
   Hi,
  
   I have 100+ different sites ( and may be more will be added in near
   future), I have to crawl them and extract my required information from
  each
   site. So each site would have its own extraction rule ( XPaths).
  
   So far I have seen there is no built-in mechanism in Nutch to fulfill
 my
   requirement and I may  have to write custom HTMLParserFilter extension
  and
   IndexFilter plugin.
  
   And I may have to write 100+ switch cases in my plugin to handle the
   extraction rules of each site
  
   Is this the best way to handle my requirement or there is any better
 way
  to
   handle it ?
  
   Thanks for your support  help.
  
   Tony.

Re: Data Extraction from 100+ different sites...

2013-06-11 Thread Alexander Chepurnoy

Hi Tony!

Kinda like that: 

// host - xpath map
static MapString, String xpaths = ... //initialize map only once
...
String xpath = xpaths.get(content.getUrl().getHost())


Best regards, Alexander

--- Вт, 11.6.13, Tony Mullins tonymullins...@gmail.com пишет:

 От: Tony Mullins tonymullins...@gmail.com
 Тема: Re: Data Extraction from 100+ different sites...
 Кому: user@nutch.apache.org
 Дата: Вторник, 11 июнь 2013, 20:59
 Hi Markus,
 
 I couldn't understand how can I avoid switch cases in your
 suggested
 idea
 
 I would have one plugin which will implement HtmlParseFilter
 and I would
 have to check the current URL by getting content.getUrl()
 and this all will
 be happening in same class so I would have to add swicth
 cases... I may
 could add xpath expression for each site in separate files
 but to get XPath
 expression I would have to decide which file I have to read
 and for that I
 would have to add my this code logic in swith case
 
 Please correct me if I am getting this all wrong !!!
 
 And I think this is common requirement for web crawling
 solutions to get
 custom data from page... then are not there any such Nutch
 plugins
 available on web ?
 
 Thanks,
 Tony.

Re: Data Extraction from 100+ different sites...

2013-06-11 Thread Tony Mullins

Yes all the web pages will have different HTML structure/layout and I would
have to identify/define a XPath expression for each one of them.

But I am trying to come up with generic output format for these XPath
expressions so whatever the XPath expression is I want to have result
in lets say Field A , Field B , Field C . In some cases some of these
fields could be blank as well. So I could map them to my Solr schema
properly.

In this regard I was hopping to get some help or guideline from your past
experiences ...

Thanks,
Tony


On Tue, Jun 11, 2013 at 10:09 PM, AC Nutch acnu...@gmail.com wrote:

 Hi Tony,

 So if I understand correctly, you have 100+ web pages, each with a totally
 different format that you're trying to extract separate/unrelated pieces of
 information from. If there's no connection between any of the web pages or
 any of the pieces of information that you're trying to extract then it's
 pretty much unavoidable to have to provide separate identifiers and cases
 for finding each one. Markus' suggestion I believe is to just have a
 dictionary file with URL as the key and XPath expression for the info
 that you want as the value. No matter what crawling/parsing platform you're
 using a solution of that sort is pretty much unavoidable with the
 assumptions given.

 That being said, is there any common form that the data you're trying to
 extract from these pages follows? Is there a regex that could match it or
 anything else that might identify it in a common way?

 Alex


 On Tue, Jun 11, 2013 at 12:59 PM, Tony Mullins tonymullins...@gmail.com
 wrote:

  Hi Markus,
 
  I couldn't understand how can I avoid switch cases in your suggested
  idea
 
  I would have one plugin which will implement HtmlParseFilter and I would
  have to check the current URL by getting content.getUrl() and this all
 will
  be happening in same class so I would have to add swicth cases... I may
  could add xpath expression for each site in separate files but to get
 XPath
  expression I would have to decide which file I have to read and for that
 I
  would have to add my this code logic in swith case
 
  Please correct me if I am getting this all wrong !!!
 
  And I think this is common requirement for web crawling solutions to get
  custom data from page... then are not there any such Nutch plugins
  available on web ?
 
  Thanks,
  Tony.
 
 
  On Tue, Jun 11, 2013 at 7:35 PM, Markus Jelsma
  markus.jel...@openindex.iowrote:
 
   Hi,
  
   Yes, you should write a plugin that has a parse filter and indexing
   filter. To ease maintenance you would want to have a file per
 host/domain
   containing XPath expressions, far easier that switch statements that
 need
   to be recompiled. The indexing filter would then index the field values
   extracted by your parse filter.
  
   Cheers,
   Markus
  
   -Original message-
From:Tony Mullins tonymullins...@gmail.com
Sent: Tue 11-Jun-2013 16:07
To: user@nutch.apache.org
Subject: Data Extraction from 100+ different sites...
   
Hi,
   
I have 100+ different sites ( and may be more will be added in near
future), I have to crawl them and extract my required information
 from
   each
site. So each site would have its own extraction rule ( XPaths).
   
So far I have seen there is no built-in mechanism in Nutch to fulfill
  my
requirement and I may  have to write custom HTMLParserFilter
 extension
   and
IndexFilter plugin.
   
And I may have to write 100+ switch cases in my plugin to handle the
extraction rules of each site
   
Is this the best way to handle my requirement or there is any better
  way
   to
handle it ?
   
Thanks for your support  help.
   
Tony.

Re: Data Extraction from 100+ different sites...

2013-06-11 Thread AC Nutch

I'm a bit confused on where the requirement to *crawl* these sites comes
into it? From what you're saying it looks like you're just talking about
parsing the contents of a list of sites that you're trying to extract data
from. In which case there's not much of a use case for Nutch... or am I
confused?


On Tue, Jun 11, 2013 at 1:26 PM, Tony Mullins tonymullins...@gmail.comwrote:

 Yes all the web pages will have different HTML structure/layout and I would
 have to identify/define a XPath expression for each one of them.

 But I am trying to come up with generic output format for these XPath
 expressions so whatever the XPath expression is I want to have result
 in lets say Field A , Field B , Field C . In some cases some of these
 fields could be blank as well. So I could map them to my Solr schema
 properly.

 In this regard I was hopping to get some help or guideline from your past
 experiences ...

 Thanks,
 Tony


 On Tue, Jun 11, 2013 at 10:09 PM, AC Nutch acnu...@gmail.com wrote:

  Hi Tony,
 
  So if I understand correctly, you have 100+ web pages, each with a
 totally
  different format that you're trying to extract separate/unrelated pieces
 of
  information from. If there's no connection between any of the web pages
 or
  any of the pieces of information that you're trying to extract then it's
  pretty much unavoidable to have to provide separate identifiers and cases
  for finding each one. Markus' suggestion I believe is to just have a
  dictionary file with URL as the key and XPath expression for the info
  that you want as the value. No matter what crawling/parsing platform
 you're
  using a solution of that sort is pretty much unavoidable with the
  assumptions given.
 
  That being said, is there any common form that the data you're trying to
  extract from these pages follows? Is there a regex that could match it or
  anything else that might identify it in a common way?
 
  Alex
 
 
  On Tue, Jun 11, 2013 at 12:59 PM, Tony Mullins tonymullins...@gmail.com
  wrote:
 
   Hi Markus,
  
   I couldn't understand how can I avoid switch cases in your suggested
   idea
  
   I would have one plugin which will implement HtmlParseFilter and I
 would
   have to check the current URL by getting content.getUrl() and this all
  will
   be happening in same class so I would have to add swicth cases... I may
   could add xpath expression for each site in separate files but to get
  XPath
   expression I would have to decide which file I have to read and for
 that
  I
   would have to add my this code logic in swith case
  
   Please correct me if I am getting this all wrong !!!
  
   And I think this is common requirement for web crawling solutions to
 get
   custom data from page... then are not there any such Nutch plugins
   available on web ?
  
   Thanks,
   Tony.
  
  
   On Tue, Jun 11, 2013 at 7:35 PM, Markus Jelsma
   markus.jel...@openindex.iowrote:
  
Hi,
   
Yes, you should write a plugin that has a parse filter and indexing
filter. To ease maintenance you would want to have a file per
  host/domain
containing XPath expressions, far easier that switch statements that
  need
to be recompiled. The indexing filter would then index the field
 values
extracted by your parse filter.
   
Cheers,
Markus
   
-Original message-
 From:Tony Mullins tonymullins...@gmail.com
 Sent: Tue 11-Jun-2013 16:07
 To: user@nutch.apache.org
 Subject: Data Extraction from 100+ different sites...

 Hi,

 I have 100+ different sites ( and may be more will be added in near
 future), I have to crawl them and extract my required information
  from
each
 site. So each site would have its own extraction rule ( XPaths).

 So far I have seen there is no built-in mechanism in Nutch to
 fulfill
   my
 requirement and I may  have to write custom HTMLParserFilter
  extension
and
 IndexFilter plugin.

 And I may have to write 100+ switch cases in my plugin to handle
 the
 extraction rules of each site

 Is this the best way to handle my requirement or there is any
 better
   way
to
 handle it ?

 Thanks for your support  help.

 Tony.

Re: Data Extraction from 100+ different sites...

2013-06-11 Thread Tony Mullins

I have to crawl the sub-links as well of these sites. And have to identify
the pattern of these sub-links' html layout and extract my required data.
One example could be a Movie Review site , now every page of this site
would have (ideally) same HTML layout which describes a particular movie
and I have to extract the info for that page.


And for this requirement I am relying on Nutch + HtmlParse plugin!!!



On Tue, Jun 11, 2013 at 10:34 PM, AC Nutch acnu...@gmail.com wrote:

 I'm a bit confused on where the requirement to *crawl* these sites comes
 into it? From what you're saying it looks like you're just talking about
 parsing the contents of a list of sites that you're trying to extract data
 from. In which case there's not much of a use case for Nutch... or am I
 confused?


 On Tue, Jun 11, 2013 at 1:26 PM, Tony Mullins tonymullins...@gmail.com
 wrote:

  Yes all the web pages will have different HTML structure/layout and I
 would
  have to identify/define a XPath expression for each one of them.
 
  But I am trying to come up with generic output format for these XPath
  expressions so whatever the XPath expression is I want to have result
  in lets say Field A , Field B , Field C . In some cases some of these
  fields could be blank as well. So I could map them to my Solr schema
  properly.
 
  In this regard I was hopping to get some help or guideline from your past
  experiences ...
 
  Thanks,
  Tony
 
 
  On Tue, Jun 11, 2013 at 10:09 PM, AC Nutch acnu...@gmail.com wrote:
 
   Hi Tony,
  
   So if I understand correctly, you have 100+ web pages, each with a
  totally
   different format that you're trying to extract separate/unrelated
 pieces
  of
   information from. If there's no connection between any of the web pages
  or
   any of the pieces of information that you're trying to extract then
 it's
   pretty much unavoidable to have to provide separate identifiers and
 cases
   for finding each one. Markus' suggestion I believe is to just have a
   dictionary file with URL as the key and XPath expression for the info
   that you want as the value. No matter what crawling/parsing platform
  you're
   using a solution of that sort is pretty much unavoidable with the
   assumptions given.
  
   That being said, is there any common form that the data you're trying
 to
   extract from these pages follows? Is there a regex that could match it
 or
   anything else that might identify it in a common way?
  
   Alex
  
  
   On Tue, Jun 11, 2013 at 12:59 PM, Tony Mullins 
 tonymullins...@gmail.com
   wrote:
  
Hi Markus,
   
I couldn't understand how can I avoid switch cases in your suggested
idea
   
I would have one plugin which will implement HtmlParseFilter and I
  would
have to check the current URL by getting content.getUrl() and this
 all
   will
be happening in same class so I would have to add swicth cases... I
 may
could add xpath expression for each site in separate files but to get
   XPath
expression I would have to decide which file I have to read and for
  that
   I
would have to add my this code logic in swith case
   
Please correct me if I am getting this all wrong !!!
   
And I think this is common requirement for web crawling solutions to
  get
custom data from page... then are not there any such Nutch plugins
available on web ?
   
Thanks,
Tony.
   
   
On Tue, Jun 11, 2013 at 7:35 PM, Markus Jelsma
markus.jel...@openindex.iowrote:
   
 Hi,

 Yes, you should write a plugin that has a parse filter and indexing
 filter. To ease maintenance you would want to have a file per
   host/domain
 containing XPath expressions, far easier that switch statements
 that
   need
 to be recompiled. The indexing filter would then index the field
  values
 extracted by your parse filter.

 Cheers,
 Markus

 -Original message-
  From:Tony Mullins tonymullins...@gmail.com
  Sent: Tue 11-Jun-2013 16:07
  To: user@nutch.apache.org
  Subject: Data Extraction from 100+ different sites...
 
  Hi,
 
  I have 100+ different sites ( and may be more will be added in
 near
  future), I have to crawl them and extract my required information
   from
 each
  site. So each site would have its own extraction rule ( XPaths).
 
  So far I have seen there is no built-in mechanism in Nutch to
  fulfill
my
  requirement and I may  have to write custom HTMLParserFilter
   extension
 and
  IndexFilter plugin.
 
  And I may have to write 100+ switch cases in my plugin to handle
  the
  extraction rules of each site
 
  Is this the best way to handle my requirement or there is any
  better
way
 to
  handle it ?
 
  Thanks for your support  help.
 
  Tony.

RE: using Tika within Nutch to remove boiler plates?

2013-06-11 Thread Markus Jelsma

Yes, Boilerpipe is complex and difficult to adapt. It also requires you to 
preset an extraction algorithm which is impossible for us. I've created an 
extractor instead that works for most pages and ignores stuff like news 
overviews and major parts of homepages. It's also tightly coupled with our date 
extractor (based on [1]) and language detector (based on LangDetect) and image 
extraction.

In many cases boilerpipe's articleextractor will work very well but date 
extraction such as NUTCH-141 won't do the trick as it only works on extracted 
text as a whole and does not consider page semantics.

[1]: https://issues.apache.org/jira/browse/NUTCH-1414

-Original message-
 From:Joe Zhang smartag...@gmail.com
 Sent: Tue 11-Jun-2013 18:06
 To: user user@nutch.apache.org
 Subject: Re: using Tika within Nutch to remove boiler plates?
 
 Any particular reason why you don't use boilerpipe any more? So what do you
 suggest as an alternative?
 
 
 On Tue, Jun 11, 2013 at 5:41 AM, Markus Jelsma
 markus.jel...@openindex.iowrote:
 
  we don't use Boilerpipe anymore so no point in sharing. Just set those two
  configuration options in nutch-site.xml as
 
property
nametika.use_boilerpipe/name
valuetrue/value
   /property
property
nametika.boilerpipe.extractor/name
valueArticleExtractor/value
   /property
 
  and it should work
 
  -Original message-
   From:Joe Zhang smartag...@gmail.com
   Sent: Tue 11-Jun-2013 01:42
   To: user user@nutch.apache.org
   Subject: Re: using Tika within Nutch to remove boiler plates?
  
   Marcus, do you mind sharing a sample nutch-site.xml?
  
  
   On Mon, Jun 10, 2013 at 1:42 AM, Markus Jelsma
   markus.jel...@openindex.iowrote:
  
Those settings belong to nutch-site. Enable BP and set the correct
extractor and it should work just fine.
   
   
-Original message-
 From:Lewis John Mcgibbney lewis.mcgibb...@gmail.com
 Sent: Sun 09-Jun-2013 20:47
 To: user@nutch.apache.org
 Subject: Re: using Tika within Nutch to remove boiler plates?

 Hi Joe,
 I've not used this feature, it would be great if one of the others
  could
 chime in here.
 From what I can infer from the correspondence on the issue, and the
 available patches, you should be applying the most recent one
  uploaded by
 Markus [0] as your starting point. This is dated as 22/11/2011.

 On Sun, Jun 9, 2013 at 11:00 AM, Joe Zhang smartag...@gmail.com
  wrote:

 
  One of the comments mentioned the following:
 
  tika.use_boilerpipe=true
  tika.boilerpipe.extractor=ArticleExtractor|CanolaExtractor
 
  which part the code is it referring to?
 
 
 You will see this included in one of the earlier patches uploaded by
Markus
 on 11/05/2011 [1]


 
  Also, within the current Nutch config, should I focus on
parse-plugin.xml?
 
 
 Look at the other patches and also Gabriele's comments. You may most
likely
 need to alter something but AFAICT the work hasbeen done.. it's just
  a
case
 of pulling together several contributions.

 Maybe you should look at the patch for 2.x (uploaded most recently by
 Roland) and see what is going on there.

 hth

 [0]

   
  https://issues.apache.org/jira/secure/attachment/12504736/NUTCH-961-1.5-1.patch
 [1]

   
  https://issues.apache.org/jira/secure/attachment/12478927/NUTCH-961-1.3-tikaparser1.patch

RE: Data Extraction from 100+ different sites...

2013-06-11 Thread Markus Jelsma

You can use URLUtil in that parse filter to determine on which host/domain you 
are and lazy load the file with expressions for that host. Just keep a 
Maphostname, Listexpressions in your object and load lists of expressions 
on demand.

-Original message-
 From:Tony Mullins tonymullins...@gmail.com
 Sent: Tue 11-Jun-2013 18:59
 To: user@nutch.apache.org
 Subject: Re: Data Extraction from 100+ different sites...

 Hi Markus,

 I couldn't understand how can I avoid switch cases in your suggested
 idea

 I would have one plugin which will implement HtmlParseFilter and I would
 have to check the current URL by getting content.getUrl() and this all will
 be happening in same class so I would have to add swicth cases... I may
 could add xpath expression for each site in separate files but to get XPath
 expression I would have to decide which file I have to read and for that I
 would have to add my this code logic in swith case

 Please correct me if I am getting this all wrong !!!

 And I think this is common requirement for web crawling solutions to get
 custom data from page... then are not there any such Nutch plugins
 available on web ?

 Thanks,
 Tony.

 On Tue, Jun 11, 2013 at 7:35 PM, Markus Jelsma
 markus.jel...@openindex.iowrote:

  Hi,

  Yes, you should write a plugin that has a parse filter and indexing
  filter. To ease maintenance you would want to have a file per host/domain
  containing XPath expressions, far easier that switch statements that need
  to be recompiled. The indexing filter would then index the field values
  extracted by your parse filter.

  Cheers,
  Markus

  -Original message-
   From:Tony Mullins tonymullins...@gmail.com
   Sent: Tue 11-Jun-2013 16:07
   To: user@nutch.apache.org
   Subject: Data Extraction from 100+ different sites...

   Hi,

   I have 100+ different sites ( and may be more will be added in near
   future), I have to crawl them and extract my required information from
  each
   site. So each site would have its own extraction rule ( XPaths).

   So far I have seen there is no built-in mechanism in Nutch to fulfill my
   requirement and I may  have to write custom HTMLParserFilter extension
  and
   IndexFilter plugin.

   And I may have to write 100+ switch cases in my plugin to handle the
   extraction rules of each site

   Is this the best way to handle my requirement or there is any better way
  to
   handle it ?

   Thanks for your support  help.

   Tony.

Re: using Tika within Nutch to remove boiler plates?

2013-06-11 Thread Joe Zhang

So what in your opinion is the most effective way of removing boilerplates
in Nutch crawls?


On Tue, Jun 11, 2013 at 12:12 PM, Markus Jelsma
markus.jel...@openindex.iowrote:

 Yes, Boilerpipe is complex and difficult to adapt. It also requires you to
 preset an extraction algorithm which is impossible for us. I've created an
 extractor instead that works for most pages and ignores stuff like news
 overviews and major parts of homepages. It's also tightly coupled with our
 date extractor (based on [1]) and language detector (based on LangDetect)
 and image extraction.

 In many cases boilerpipe's articleextractor will work very well but date
 extraction such as NUTCH-141 won't do the trick as it only works on
 extracted text as a whole and does not consider page semantics.

 [1]: https://issues.apache.org/jira/browse/NUTCH-1414

 -Original message-
  From:Joe Zhang smartag...@gmail.com
  Sent: Tue 11-Jun-2013 18:06
  To: user user@nutch.apache.org
  Subject: Re: using Tika within Nutch to remove boiler plates?
 
  Any particular reason why you don't use boilerpipe any more? So what do
 you
  suggest as an alternative?
 
 
  On Tue, Jun 11, 2013 at 5:41 AM, Markus Jelsma
  markus.jel...@openindex.iowrote:
 
   we don't use Boilerpipe anymore so no point in sharing. Just set those
 two
   configuration options in nutch-site.xml as
  
 property
 nametika.use_boilerpipe/name
 valuetrue/value
/property
 property
 nametika.boilerpipe.extractor/name
 valueArticleExtractor/value
/property
  
   and it should work
  
   -Original message-
From:Joe Zhang smartag...@gmail.com
Sent: Tue 11-Jun-2013 01:42
To: user user@nutch.apache.org
Subject: Re: using Tika within Nutch to remove boiler plates?
   
Marcus, do you mind sharing a sample nutch-site.xml?
   
   
On Mon, Jun 10, 2013 at 1:42 AM, Markus Jelsma
markus.jel...@openindex.iowrote:
   
 Those settings belong to nutch-site. Enable BP and set the correct
 extractor and it should work just fine.


 -Original message-
  From:Lewis John Mcgibbney lewis.mcgibb...@gmail.com
  Sent: Sun 09-Jun-2013 20:47
  To: user@nutch.apache.org
  Subject: Re: using Tika within Nutch to remove boiler plates?
 
  Hi Joe,
  I've not used this feature, it would be great if one of the
 others
   could
  chime in here.
  From what I can infer from the correspondence on the issue, and
 the
  available patches, you should be applying the most recent one
   uploaded by
  Markus [0] as your starting point. This is dated as 22/11/2011.
 
  On Sun, Jun 9, 2013 at 11:00 AM, Joe Zhang smartag...@gmail.com
 
   wrote:
 
  
   One of the comments mentioned the following:
  
   tika.use_boilerpipe=true
   tika.boilerpipe.extractor=ArticleExtractor|CanolaExtractor
  
   which part the code is it referring to?
  
  
  You will see this included in one of the earlier patches
 uploaded by
 Markus
  on 11/05/2011 [1]
 
 
  
   Also, within the current Nutch config, should I focus on
 parse-plugin.xml?
  
  
  Look at the other patches and also Gabriele's comments. You may
 most
 likely
  need to alter something but AFAICT the work hasbeen done.. it's
 just
   a
 case
  of pulling together several contributions.
 
  Maybe you should look at the patch for 2.x (uploaded most
 recently by
  Roland) and see what is going on there.
 
  hth
 
  [0]
 

  
 https://issues.apache.org/jira/secure/attachment/12504736/NUTCH-961-1.5-1.patch
  [1]
 

  
 https://issues.apache.org/jira/secure/attachment/12478927/NUTCH-961-1.3-tikaparser1.patch

RE: using Tika within Nutch to remove boiler plates?

2013-06-11 Thread Markus Jelsma

In my opinion Boilerpipe is the most effective free and open source tool for 
the job :)

It does require some patching (see linked issues) and manual upgrade to 
Boilerpipe 1.2.0.

-Original message-
 From:Joe Zhang smartag...@gmail.com
 Sent: Tue 11-Jun-2013 21:19
 To: user user@nutch.apache.org
 Subject: Re: using Tika within Nutch to remove boiler plates?

 So what in your opinion is the most effective way of removing boilerplates
 in Nutch crawls?

 On Tue, Jun 11, 2013 at 12:12 PM, Markus Jelsma
 markus.jel...@openindex.iowrote:

  Yes, Boilerpipe is complex and difficult to adapt. It also requires you to
  preset an extraction algorithm which is impossible for us. I've created an
  extractor instead that works for most pages and ignores stuff like news
  overviews and major parts of homepages. It's also tightly coupled with our
  date extractor (based on [1]) and language detector (based on LangDetect)
  and image extraction.

  In many cases boilerpipe's articleextractor will work very well but date
  extraction such as NUTCH-141 won't do the trick as it only works on
  extracted text as a whole and does not consider page semantics.

  [1]: https://issues.apache.org/jira/browse/NUTCH-1414

  -Original message-
   From:Joe Zhang smartag...@gmail.com
   Sent: Tue 11-Jun-2013 18:06
   To: user user@nutch.apache.org
   Subject: Re: using Tika within Nutch to remove boiler plates?

   Any particular reason why you don't use boilerpipe any more? So what do
  you
   suggest as an alternative?

   On Tue, Jun 11, 2013 at 5:41 AM, Markus Jelsma
   markus.jel...@openindex.iowrote:

we don't use Boilerpipe anymore so no point in sharing. Just set those
  two
configuration options in nutch-site.xml as

  property
  nametika.use_boilerpipe/name
  valuetrue/value
 /property
  property
  nametika.boilerpipe.extractor/name
  valueArticleExtractor/value
 /property

and it should work

-Original message-
 From:Joe Zhang smartag...@gmail.com
 Sent: Tue 11-Jun-2013 01:42
 To: user user@nutch.apache.org
 Subject: Re: using Tika within Nutch to remove boiler plates?

 Marcus, do you mind sharing a sample nutch-site.xml?

 On Mon, Jun 10, 2013 at 1:42 AM, Markus Jelsma
 markus.jel...@openindex.iowrote:

  Those settings belong to nutch-site. Enable BP and set the correct
  extractor and it should work just fine.

  -Original message-
   From:Lewis John Mcgibbney lewis.mcgibb...@gmail.com
   Sent: Sun 09-Jun-2013 20:47
   To: user@nutch.apache.org
   Subject: Re: using Tika within Nutch to remove boiler plates?

   Hi Joe,
   I've not used this feature, it would be great if one of the
  others
could
   chime in here.
   From what I can infer from the correspondence on the issue, and
  the
   available patches, you should be applying the most recent one
uploaded by
   Markus [0] as your starting point. This is dated as 22/11/2011.

   On Sun, Jun 9, 2013 at 11:00 AM, Joe Zhang smartag...@gmail.com

wrote:

One of the comments mentioned the following:

tika.use_boilerpipe=true
tika.boilerpipe.extractor=ArticleExtractor|CanolaExtractor

which part the code is it referring to?

   You will see this included in one of the earlier patches
  uploaded by
  Markus
   on 11/05/2011 [1]

Also, within the current Nutch config, should I focus on
  parse-plugin.xml?

   Look at the other patches and also Gabriele's comments. You may
  most
  likely
   need to alter something but AFAICT the work hasbeen done.. it's
  just
a
  case
   of pulling together several contributions.

   Maybe you should look at the patch for 2.x (uploaded most
  recently by
   Roland) and see what is going on there.

   hth

   [0]

  https://issues.apache.org/jira/secure/attachment/12504736/NUTCH-961-1.5-1.patch
   [1]

  https://issues.apache.org/jira/secure/attachment/12478927/NUTCH-961-1.3-tikaparser1.patch

Re: using Tika within Nutch to remove boiler plates?

2013-06-11 Thread Joe Zhang

Is it interesting that you acknowledge BP as the most effective, but you
also said you don't use it any more.


On Tue, Jun 11, 2013 at 3:08 PM, Markus Jelsma
markus.jel...@openindex.iowrote:

 In my opinion Boilerpipe is the most effective free and open source tool
 for the job :)

 It does require some patching (see linked issues) and manual upgrade to
 Boilerpipe 1.2.0.

 -Original message-
  From:Joe Zhang smartag...@gmail.com
  Sent: Tue 11-Jun-2013 21:19
  To: user user@nutch.apache.org
  Subject: Re: using Tika within Nutch to remove boiler plates?
 
  So what in your opinion is the most effective way of removing
 boilerplates
  in Nutch crawls?
 
 
  On Tue, Jun 11, 2013 at 12:12 PM, Markus Jelsma
  markus.jel...@openindex.iowrote:
 
   Yes, Boilerpipe is complex and difficult to adapt. It also requires
 you to
   preset an extraction algorithm which is impossible for us. I've
 created an
   extractor instead that works for most pages and ignores stuff like news
   overviews and major parts of homepages. It's also tightly coupled with
 our
   date extractor (based on [1]) and language detector (based on
 LangDetect)
   and image extraction.
  
   In many cases boilerpipe's articleextractor will work very well but
 date
   extraction such as NUTCH-141 won't do the trick as it only works on
   extracted text as a whole and does not consider page semantics.
  
   [1]: https://issues.apache.org/jira/browse/NUTCH-1414
  
   -Original message-
From:Joe Zhang smartag...@gmail.com
Sent: Tue 11-Jun-2013 18:06
To: user user@nutch.apache.org
Subject: Re: using Tika within Nutch to remove boiler plates?
   
Any particular reason why you don't use boilerpipe any more? So what
 do
   you
suggest as an alternative?
   
   
On Tue, Jun 11, 2013 at 5:41 AM, Markus Jelsma
markus.jel...@openindex.iowrote:
   
 we don't use Boilerpipe anymore so no point in sharing. Just set
 those
   two
 configuration options in nutch-site.xml as

   property
   nametika.use_boilerpipe/name
   valuetrue/value
  /property
   property
   nametika.boilerpipe.extractor/name
   valueArticleExtractor/value
  /property

 and it should work

 -Original message-
  From:Joe Zhang smartag...@gmail.com
  Sent: Tue 11-Jun-2013 01:42
  To: user user@nutch.apache.org
  Subject: Re: using Tika within Nutch to remove boiler plates?
 
  Marcus, do you mind sharing a sample nutch-site.xml?
 
 
  On Mon, Jun 10, 2013 at 1:42 AM, Markus Jelsma
  markus.jel...@openindex.iowrote:
 
   Those settings belong to nutch-site. Enable BP and set the
 correct
   extractor and it should work just fine.
  
  
   -Original message-
From:Lewis John Mcgibbney lewis.mcgibb...@gmail.com
Sent: Sun 09-Jun-2013 20:47
To: user@nutch.apache.org
Subject: Re: using Tika within Nutch to remove boiler plates?
   
Hi Joe,
I've not used this feature, it would be great if one of the
   others
 could
chime in here.
From what I can infer from the correspondence on the issue,
 and
   the
available patches, you should be applying the most recent one
 uploaded by
Markus [0] as your starting point. This is dated as
 22/11/2011.
   
On Sun, Jun 9, 2013 at 11:00 AM, Joe Zhang 
 smartag...@gmail.com
   
 wrote:
   

 One of the comments mentioned the following:

 tika.use_boilerpipe=true
 tika.boilerpipe.extractor=ArticleExtractor|CanolaExtractor

 which part the code is it referring to?


You will see this included in one of the earlier patches
   uploaded by
   Markus
on 11/05/2011 [1]
   
   

 Also, within the current Nutch config, should I focus on
   parse-plugin.xml?


Look at the other patches and also Gabriele's comments. You
 may
   most
   likely
need to alter something but AFAICT the work hasbeen done..
 it's
   just
 a
   case
of pulling together several contributions.
   
Maybe you should look at the patch for 2.x (uploaded most
   recently by
Roland) and see what is going on there.
   
hth
   
[0]
   
  

  
 https://issues.apache.org/jira/secure/attachment/12504736/NUTCH-961-1.5-1.patch
[1]
   
  

  
 https://issues.apache.org/jira/secure/attachment/12478927/NUTCH-961-1.3-tikaparser1.patch

Re: Issues on Compiling Nutch 2.x with Eclipse

Re: Nutch Compilation Error with Eclipse

Re: Nutch Compilation Error with Eclipse

RE: using Tika within Nutch to remove boiler plates?

Data Extraction from 100+ different sites...

RE: Data Extraction from 100+ different sites...

Re: using Tika within Nutch to remove boiler plates?

Re: Data Extraction from 100+ different sites...

Re: Data Extraction from 100+ different sites...

Re: Data Extraction from 100+ different sites...

Re: Data Extraction from 100+ different sites...

Re: Data Extraction from 100+ different sites...

Re: Data Extraction from 100+ different sites...

RE: using Tika within Nutch to remove boiler plates?

RE: Data Extraction from 100+ different sites...

Re: using Tika within Nutch to remove boiler plates?

RE: using Tika within Nutch to remove boiler plates?

Re: using Tika within Nutch to remove boiler plates?

18 matches

Site Navigation

Mail list logo

Footer information