Hi all,
My use case is very simple, given a string I would like to suggest all the possible urls that contain that string (given the limitations of the tokenizer and suggester). So far I have created a custom analyzer and tokenizer to parse urls, and that analyzer is used to create an AnalyzingSuggester object. When I look for a suggestion using a prefix of a url it works fine. However when I use an in between word I don’t get any suggestion. Let’s see my test case. I have a unique suggestion entry “www.google.com” in my TermFreq array. If I search a suggestion for “www” it returns the url. If I search a suggestion for “google” the result is empty. My tokenizer splits the suggestion entry into the following tuples (token,offset): (www,0:3),(google,4:10),(com,11:14). Please note that I’m getting rid of the dots The automaton created for this entry is: state 0 [reject]: w -> 1 state 1 [reject]: w -> 2 state 2 [reject]: w -> 3 state 3 [reject]: \\U00000100 -> 4 state 4 [reject]: g -> 5 state 5 [reject]: o -> 6 state 6 [reject]: o -> 7 state 7 [reject]: g -> 8 state 8 [reject]: l -> 9 state 9 [reject]: e -> 10 state 10 [reject]: \\U00000100 -> 11 state 11 [reject]: c -> 12 state 12 [reject]: o -> 13 state 13 [reject]: m -> 14 state 14 [accept]: When I print the fst I get this: “wwwgooglecom” The automaton created for “google” Initial state: 0 state 0 [reject]: g -> 1 state 1 [reject]: o -> 2 state 2 [reject]: o -> 3 state 3 [reject]: g -> 4 state 4 [reject]: l -> 5 state 5 [reject]: e -> 6 state 6 [accept]: I think I have a problem with my tokenizer (I’m not an expert) and this is affecting the creation of the first automaton. I really don’t know how to get this fixed, any advice? best regards!