There is another way to do this: crawl the mobile site! 

The Fennec browser from Mozilla talks Android. I often use it to get pagecrap 
off my screen.

----- Original Message -----
| From: "Lance Norskog" <goks...@gmail.com>
| To: solr-user@lucene.apache.org
| Sent: Wednesday, August 29, 2012 7:37:37 PM
| Subject: Re: Document Processing
| 
| I've seen the JSoup HTML parser library used for this. It worked
| really well. The Boilerpipe library may be what you want. Its
| schwerpunkt (*) is to separate boilerplate from wanted text in an
| HTML
| page. I don't know what fine-grained control it has.
| 
| * raison d'ĂȘtre. There is no English word for this concept.
| 
| On Tue, Dec 6, 2011 at 1:39 PM, Tommaso Teofili
| <tommaso.teof...@gmail.com> wrote:
| > Hello Michael,
| >
| > I can help you with using the UIMA UpdateRequestProcessor [1]; the
| > current
| > implementation uses in-memory execution of UIMA pipelines but since
| > I was
| > planning to add the support for higher scalability (with UIMA-AS
| > [2]) that
| > may help you as well.
| >
| > Tommaso
| >
| > [1] :
| > 
http://svn.apache.org/repos/asf/lucene/dev/trunk/solr/contrib/uima/src/java/org/apache/solr/uima/processor/UIMAUpdateRequestProcessor.java
| > [2] : http://uima.apache.org/doc-uimaas-what.html
| >
| > 2011/12/5 Michael Kelleher <mj.kelle...@gmail.com>
| >
| >> Hello Erik,
| >>
| >> I will take a look at both:
| >>
| >> org.apache.solr.update.**processor.**LangDetectLanguageIdentifierUp**
| >> dateProcessor
| >>
| >> and
| >>
| >> org.apache.solr.update.**processor.**TikaLanguageIdentifierUpdatePr**
| >> ocessor
| >>
| >>
| >> and figure out what I need to extend to handle processing in the
| >> way I am
| >> looking for.  I am assuming that "component" configuration is
| >> handled in a
| >> standard way such that I can configure my new UpdateProcessor in
| >> the same
| >> way I would configure any other UpdateProcessor "component"?
| >>
| >> Thanks for the suggestion.
| >>
| >>
| >> 1 more question:  given that I am probably going to convert the
| >> HTML to
| >> XML so I can use XPath expressions to "extract" my content, do you
| >> think
| >> that this kind of processing will overload Solr?  This Solr
| >> instance will
| >> be used solely for indexing, and will only ever have a single
| >> ManifoldCF
| >> crawling job feeding it documents at one time.
| >>
| >> --mike
| >>
| 
| 
| 
| --
| Lance Norskog
| goks...@gmail.com
| 

Reply via email to