There is another way to do this: crawl the mobile site! The Fennec browser from Mozilla talks Android. I often use it to get pagecrap off my screen.
----- Original Message ----- | From: "Lance Norskog" <goks...@gmail.com> | To: solr-user@lucene.apache.org | Sent: Wednesday, August 29, 2012 7:37:37 PM | Subject: Re: Document Processing | | I've seen the JSoup HTML parser library used for this. It worked | really well. The Boilerpipe library may be what you want. Its | schwerpunkt (*) is to separate boilerplate from wanted text in an | HTML | page. I don't know what fine-grained control it has. | | * raison d'ĂȘtre. There is no English word for this concept. | | On Tue, Dec 6, 2011 at 1:39 PM, Tommaso Teofili | <tommaso.teof...@gmail.com> wrote: | > Hello Michael, | > | > I can help you with using the UIMA UpdateRequestProcessor [1]; the | > current | > implementation uses in-memory execution of UIMA pipelines but since | > I was | > planning to add the support for higher scalability (with UIMA-AS | > [2]) that | > may help you as well. | > | > Tommaso | > | > [1] : | > http://svn.apache.org/repos/asf/lucene/dev/trunk/solr/contrib/uima/src/java/org/apache/solr/uima/processor/UIMAUpdateRequestProcessor.java | > [2] : http://uima.apache.org/doc-uimaas-what.html | > | > 2011/12/5 Michael Kelleher <mj.kelle...@gmail.com> | > | >> Hello Erik, | >> | >> I will take a look at both: | >> | >> org.apache.solr.update.**processor.**LangDetectLanguageIdentifierUp** | >> dateProcessor | >> | >> and | >> | >> org.apache.solr.update.**processor.**TikaLanguageIdentifierUpdatePr** | >> ocessor | >> | >> | >> and figure out what I need to extend to handle processing in the | >> way I am | >> looking for. I am assuming that "component" configuration is | >> handled in a | >> standard way such that I can configure my new UpdateProcessor in | >> the same | >> way I would configure any other UpdateProcessor "component"? | >> | >> Thanks for the suggestion. | >> | >> | >> 1 more question: given that I am probably going to convert the | >> HTML to | >> XML so I can use XPath expressions to "extract" my content, do you | >> think | >> that this kind of processing will overload Solr? This Solr | >> instance will | >> be used solely for indexing, and will only ever have a single | >> ManifoldCF | >> crawling job feeding it documents at one time. | >> | >> --mike | >> | | | | -- | Lance Norskog | goks...@gmail.com |