Thanks for your email. Great, I will look at the WordDelimiterFactory. Just to make clear, I DON'T want any other tokenizing on digits, specialchars, punctuations etc done other than word delimiting on whitespace.
All I want for my first version is NO removal of punctuations/special characters at indexing time and during search time i.e., input as-is and search as-is (like a simple sql db?) . I was assuming this would be a trivial case with SOLR and not sure what I am missing here. thanks Vulcanoid On Sun, Dec 8, 2013 at 4:33 AM, Upayavira <u...@odoko.co.uk> wrote: > Have you tried a WhitespaceTokenizerFactory followed by the > WordDelimiterFilterFactory? The latter is perhaps more configurable at > what it does. Alternatively, you could use a RegexFilterFactory to > remove extraneous punctuation that wasn't removed by the Whitespace > Tokenizer. > > Upayavira > > On Sat, Dec 7, 2013, at 06:15 PM, Vulcanoid Developer wrote: > > Hi, > > > > I am new to solr and I guess this is a basic tokenizer question so please > > bear with me. > > > > I am trying to use SOLR to index a few (Indian) legal judgments in text > > form and search against them. One of the key points with these documents > > is > > that the sections/provisions of law usually have punctuation/special > > characters in them. For example search queries will TYPICALLY be section > > 12AA, section 80-IA, section 9(1)(vii) and the text of the judgments > > themselves will contain these sort of text with section references all > > over > > the place. > > > > Now, using a default schema setup with standardtokenizer, which seems to > > delimit on whitespace AND punctuations, I get really bad results because > > it > > looks like 12AA is split and results such having 12 and AA in them turn > > up. > > It becomes worse with 9(1)(vii) with results containing 9 and 1 etc > > being > > turned up. > > > > What is the best solution here? I really just want to index the document > > as-is and also to do whitespace tokenizing on the search and nothing > > more. > > > > So in other words: > > a) I would like the text document to be indexed as-is with say 12AA and > > 9(1)(vii) in the document stored as it is mentioned. > > b) I would like to be able to search for 12AA and for 9(1)(vii) and get > > proper full matches on them without any splitting up/munging etc. > > > > Any suggestions are appreciated. Thank you for your time. > > > > Thanks > > Vulcanoid >