Re: Tagging documents as they are indexed -- Is FST a reasonable approach?

2012-01-04 Thread Julien Nioche
Hi Ryan, Why not preprocessing your documents with tools like Apache UIMA, GATE or OpenNLP before indexing them in Lucene? GATE for instance has FST-based gazetteers which would be perfect for your place names, AFAIK there is also a Dictionary component for UIMA which would be a good match.

Re: Using categories with Lucene

2010-08-11 Thread Julien Nioche
BTW I don't remember anyone on the Nutch list suggesting you to use Carrot for this (see : http://search-lucene.com/?q=luan+carrot) or classifying at querying time What I suggested in http://search-lucene.com/m/JWZTj1q4lB92 was about classifying during the parsing or indexing and generating a

Re: Extracting contact data

2010-01-14 Thread Julien Nioche
Hi, Tools like GATE (http://www.gate.ac.uk) or Apache UIMA would be good candidates for what you are trying to achieve. HTH -- DigitalPebble Ltd http://www.digitalpebble.com 2010/1/14 Ortelli, Gian Luca gianluca.orte...@truvo.com Well, the exact definition we're going to find out

Re: How to tune Analyzer for Text Extraction

2009-08-12 Thread Julien Nioche
Hi, you should also have a look at GATE (http://gate.ac.uk) which comes with a NER application called ANNIE. You could use it to analyse your docs before indexing them with Lucene or SOLR. As Grant mentioned, UIMA can also be used for that as there are a number of NER annotators available for it

Re: is there an histogram feature in lucene ak Magelan

2008-10-13 Thread Julien Nioche
Hi Thomas, Have a look at SOLR (*lucene.apache.org/solr*). It is based on Lucene and provides additional functionalities including faceted search. Best, Julien 2008/10/13 Thomas Birnbaum [EMAIL PROTECTED] hi... currently we are using an propetary search engine witch supports a historam.

Re: Lucene Zend Lucene Search : indexation speed, document parsing

2008-09-16 Thread Julien Nioche
Bonjour Romain, Im asking myself a few questions. Mainly about speed (indexation time) and document parsing (way to index most of commonly used office documents). For document parsing, I'm planning to use different open sources library. The company Im doing this for will be indexing a few

Re: Lucene and Google Web 1T 5 Gram

2008-04-23 Thread Julien Nioche
Hi Raphael, We initially tried to do the same but ended up developing our own API for querying the Web 1T. You can find more details on http://digitalpebble.com/resources.html There could be a way to reuse elements from Lucene e.g. the Term index only but I could not find an obvious way to