Re: simple tokenizer question

Vulcanoid Developer Sun, 08 Dec 2013 03:52:32 -0800

Thanks for your email.

Great, I will look at the WordDelimiterFactory. Just to make clear, I DON'T
want any other tokenizing on digits, specialchars, punctuations etc done
other than word delimiting on whitespace.


All I want for my first version is NO removal of punctuations/special
characters at indexing time and during search time i.e., input as-is and
search as-is (like a simple sql db?) . I was assuming this would be a
trivial case with SOLR and not sure what I am missing here.

thanks
Vulcanoid



On Sun, Dec 8, 2013 at 4:33 AM, Upayavira <u...@odoko.co.uk> wrote:

> Have you tried a WhitespaceTokenizerFactory followed by the
> WordDelimiterFilterFactory? The latter is perhaps more configurable at
> what it does. Alternatively, you could use a RegexFilterFactory to
> remove extraneous punctuation that wasn't removed by the Whitespace
> Tokenizer.
>
> Upayavira
>
> On Sat, Dec 7, 2013, at 06:15 PM, Vulcanoid Developer wrote:
> > Hi,
> >
> > I am new to solr and I guess this is a basic tokenizer question so please
> > bear with me.
> >
> > I am trying to use SOLR to index a few (Indian) legal judgments in text
> > form and search against them. One of the key points with these documents
> > is
> > that the sections/provisions of law usually have punctuation/special
> > characters in them. For example search queries will TYPICALLY be section
> > 12AA, section 80-IA, section 9(1)(vii) and the text of the judgments
> > themselves will contain these sort of text with section references all
> > over
> > the place.
> >
> > Now, using a default schema setup with standardtokenizer, which seems to
> > delimit on whitespace AND punctuations, I get really bad results because
> > it
> > looks like 12AA is split and results such having 12 and AA in them turn
> > up.
> >  It becomes worse with 9(1)(vii) with results containing 9 and 1 etc
> >  being
> > turned up.
> >
> > What is the best solution here? I really just want to index the document
> > as-is and also to do whitespace tokenizing on the search and nothing
> > more.
> >
> > So in other words:
> > a) I would like the text document to be indexed as-is with say 12AA and
> > 9(1)(vii) in the document stored as it is mentioned.
> > b) I would like to be able to search for 12AA and for 9(1)(vii) and get
> > proper full matches on them without any splitting up/munging etc.
> >
> > Any suggestions are appreciated.  Thank you for your time.
> >
> > Thanks
> > Vulcanoid
>

Re: simple tokenizer question

Reply via email to