Re: issue with singleton analyzer in single JVM multi-index setup

Dmitry Kan Wed, 18 Mar 2015 12:13:00 -0700

Yes, I use an analyzer provider. Here is the code:

[code]


public class SemanticAnalyzerTwitterLemmatizerProvider extends 
AbstractIndexAnalyzerProvider<RussianLemmatizingTwitterAnalyzer> {
    private final RussianLemmatizingTwitterAnalyzer 
russianLemmatizingGenericAnalyzer;

    private final Logger Log = 
Logger.getLogger(SemanticAnalyzerTwitterLemmatizerProvider.class.getName());

    @Inject
    public SemanticAnalyzerTwitterLemmatizerProvider(Index index, 
@IndexSettings Settings indexSettings,
                                                     @Assisted String name, 
Settings settings) {
        super(index, indexSettings, name, settings);
        Log.info("called super with name=" + name);
        try {
            String lemmatizerConfFile = settings.get("lemmatizerConf");
            boolean analyzeBest = 
Boolean.parseBoolean(settings.get("analyzeBest"));
            russianLemmatizingGenericAnalyzer = new 
RussianLemmatizingTwitterAnalyzer(lemmatizerConfFile, analyzeBest);
        } catch (IOException ioe) {
            throw new ElasticsearchIllegalArgumentException("Unable to load 
Russian morphology analyzer", ioe);
        } catch (Exception e) {
            throw new ElasticsearchIllegalArgumentException("Unable to load 
Russian morphology analyzer", e);
        }
    }

    @Override
    public RussianLemmatizingTwitterAnalyzer get() {
        return russianLemmatizingGenericAnalyzer;
    }
}


[/code]

Would you recommend to use your approach instead of this one? Do you spot 
issues in my implementation of the provider?

On Wednesday, 18 March 2015 20:52:08 UTC+2, Jörg Prante wrote:
>
> Do you use an analyzer provider?
>
> Example
>
> public class RussianLemmatizingTwitterAnalyzerProvider extends 
> AbstractIndexAnalyzerProvider<RussianLemmatizingTwitterAnalyzer> {
>
>     private final MorphAnalyzer morphAnalyzer;
>
>     ...
>      
>     @Inject
>     public RussianLemmatizingTwitterAnalyzerProvider(Index index,
>                                            @IndexSettings Settings 
> indexSettings,
>                                            Environment environment,
>                                            @Assisted String name, 
> @Assisted Settings settings) {
>         super(index, indexSettings, name, settings);
>         this.morphAnalyzer = createMorphAnalyzer(environment, settings, 
> ...);
>     }
>     
>     @Override
>     public RussianLemmatizingTwitterAnalyzer get() {
>         return new RussianLemmatizingTwitterAnalyzer(morphAnalyzer, ...);
>     }
>     
>     private MorphAnalyzer createMorphAnalyzer(...) {
>     }
>
> }
>
>
> Only such a provider is bound to a singleton. So the analyzer provider can 
> set up the analyzer configuration exactly once (with a MorphAnalyzer 
> instance etc.), and with get() method, it creates analyzers as required.
>
> Jörg
>
> On Wed, Mar 18, 2015 at 6:40 PM, Dmitry Kan <dmitr...@gmail.com 
> <javascript:>> wrote:
>
>>
>> Jörg,
>>
>> Thanks for replying!
>>
>> Here is the code of the RussianLemmatizingTwitterAnalyzer, the deepest 
>> class in the simplified class sequence I have posted in the original 
>> message.
>>
>> [code]
>>
>> public class RussianLemmatizingTwitterAnalyzer extends Analyzer {
>>
>>     private static MorphAnalyzer morphAnalyzerGlobal;
>>
>>     boolean useSyncMethod = true;
>>
>>     private static final boolean verbose = false;
>>     private MorphAnalyzer morphAnalyzer;
>>     private boolean analyzeBest = false;
>>
>>     private static final Logger Log = 
>> Logger.getLogger(RussianLemmatizingTwitterAnalyzer.class.getName());
>>
>>     public RussianLemmatizingTwitterAnalyzer(String lemmatizerConfFile, 
>> boolean analyzeBest) throws IOException {
>>         this.analyzeBest = analyzeBest;
>>
>>         if (useSyncMethod) {
>>             this.morphAnalyzer = loadCustomAnalyzer(lemmatizerConfFile);
>>         } else {
>>             Properties properties = new Properties();
>>
>>             Log.info("Loading lemmatizer properties from " + 
>> lemmatizerConfFile);
>>
>>             properties.load(new StringReader(IOUtils.readFile(new 
>> File(lemmatizerConfFile), Charsets.UTF_8)));
>>             this.morphAnalyzer = MorphAnalyzerLoader.load(new 
>> MorphAnalyzerConfig(properties));
>>         }
>>     }
>>
>>     private static MorphAnalyzer loadAnalyzer(String lemmatizerConfFile) 
>> throws IOException {
>>         Properties properties = new Properties();
>>
>>         Log.info("Loading lemmatizer properties from " + lemmatizerConfFile);
>>
>>         properties.load(new StringReader(IOUtils.readFile(new 
>> File(lemmatizerConfFile), Charsets.UTF_8)));
>>         MorphAnalyzer morphAnalyzer1 = MorphAnalyzerLoader.load(new 
>> MorphAnalyzerConfig(properties));
>>
>>         if (verbose) {
>>             if (morphAnalyzer1 != null) {
>>                 Log.info("Successfully created the analyzer!");
>>                 Log.info(morphAnalyzer1.analyzeBest("билета").toString());
>>             } else {
>>                 Log.severe("Failed to create the morphAnalyzer object");
>>             }
>>         }
>>
>>         return morphAnalyzer1;
>>     }
>>
>>     public static synchronized MorphAnalyzer loadCustomAnalyzer(String 
>> lemmatizerConfFile)
>>             throws IOException {
>>         if (morphAnalyzerGlobal == null) {
>>             morphAnalyzerGlobal = loadAnalyzer(lemmatizerConfFile);
>>         }
>>
>>         return morphAnalyzerGlobal;
>>     }
>>
>>     @Override
>>     protected TokenStreamComponents createComponents(String fieldName, final 
>> Reader reader) {
>>         Tokenizer tokenizer = new TwitterFlexLuceneTokenizer(reader);
>>
>>         Log.config("Using Tokenizer: " + 
>> tokenizer.getClass().getSimpleName());
>>
>>         TokenStream tokenStream = tokenizer;
>>         tokenStream = new LowerCaseFilter(Version.LUCENE_4_9,tokenStream);
>>         tokenStream = new MorphAnalTokenFilter(tokenStream, morphAnalyzer, 
>> analyzeBest);
>>         return new TokenStreamComponents(tokenizer, tokenStream);
>>     }
>>
>> }
>>
>>
>> [/code] 
>>
>> Note, that in the code above the TwitterFlexLuceneTokenizer is not thread 
>> safe and extends o.a.lucene.analysis.Tokenizer. In jvisualvm there are 97 
>> instances of this class.
>>
>> Let me know, if I should copy other code snippets up the class stream.
>>
>> Dmitry
>>
>> On Wednesday, 18 March 2015 18:22:47 UTC+2, Jörg Prante wrote:
>>>
>>> Is it possible to examine the code of your plugin?
>>>
>>> Generally speaking, analyzers are instantiated per index creation for 
>>> each thread.
>>>
>>> In org.elasticsearch.index.analysis.AnalysisModule, you can see how 
>>> analyzer providers and factories are prepared for injection by the help of 
>>> the ES injection modul which is based on Guice. Basically, the factories 
>>> are kept as singletons, and each thread can pick analyzer instances from 
>>> the factory when needed. All in all, Lucene analyzer classes are not 
>>> threadsafe, in particular the tokenizers. It means, it is up to the 
>>> implementor of an analyzer/tokenizer to store immutable objects as 
>>> singletons in a correct way so that all threads can safely access them.
>>>
>>> Jörg
>>>
>>> On Wed, Mar 18, 2015 at 4:02 PM, Dmitry Kan <dmitr...@gmail.com> wrote:
>>>
>>>> Hi,
>>>>
>>>> Could somebody answer, please?
>>>>
>>>>
>>>> On Tuesday, 17 March 2015 19:05:38 UTC+2, Dmitry Kan wrote:
>>>>>
>>>>> Hello!
>>>>>
>>>>> I'm a newbie in elasticsearch, so forgive if the question is lame.
>>>>>
>>>>> I have implemented a custom plugin using a custom lemmatizer and a 
>>>>> tokenizer. The simplified class sequence: 
>>>>>
>>>>>
>>>>> AnalysisMorphologyPlugin->MorphologyAnalysisBinderProcessor->SemanticAnalyzerTwitterLemmatizerProvider->RussianLemmatizingTwitterAnalyzer
>>>>>
>>>>> In the RussianLemmatizingTwitterAnalyzer's ctor I load the custom object 
>>>>> for lemmatization (object unrelated to lucene/es) in a singleton fashion 
>>>>> (in a syncrhonized code block).
>>>>> Then, when creating 14 indices in the same JVM I see 
>>>>>  14 instances of RussianLemmatizingTwitterAnalyzer, 
>>>>>  4 instances of SemanticAnalyzerTwitterLemmatizerProvider, 
>>>>>  4 instances of MorphologyAnalysisBinderProcessor,
>>>>>  30 instances of the custom lemmatizer (in each 
>>>>> RussianLemmatizingTwitterAnalyzer only one instance is expected, so 
>>>>> should be 14), 
>>>>>  1 instance of AnalysisMorphologyPlugin.
>>>>>
>>>>> The question is, can RussianLemmatizingTwitterAnalyzer object be made 
>>>>> shared between indices? Or is it by design, that they must load 
>>>>> separately per index?
>>>>> What could be wrong in the code that makes 30 instances of the custom 
>>>>> singleton lemmatizer instead of 14?
>>>>>
>>>>> The current standing is that *with* the plugin 100M of RAM is reserved by 
>>>>> the JVM with no data. *Without* the plugin the JVM reserves 2M with no 
>>>>> data. Elasticsearch 1.3.2, Lucene 4.9.0.
>>>>>
>>>>> Regards,
>>>>>
>>>>> Dmitry Kan
>>>>>
>>>>>  -- 
>>>> You received this message because you are subscribed to the Google 
>>>> Groups "elasticsearch" group.
>>>> To unsubscribe from this group and stop receiving emails from it, send 
>>>> an email to elasticsearc...@googlegroups.com.
>>>> To view this discussion on the web visit https://groups.google.com/d/
>>>> msgid/elasticsearch/c2c57184-ee3b-4600-9091-a515b496b867%
>>>> 40googlegroups.com 
>>>> <https://groups.google.com/d/msgid/elasticsearch/c2c57184-ee3b-4600-9091-a515b496b867%40googlegroups.com?utm_medium=email&utm_source=footer>
>>>> .
>>>>
>>>> For more options, visit https://groups.google.com/d/optout.
>>>>
>>>
>>>  -- 
>> You received this message because you are subscribed to the Google Groups 
>> "elasticsearch" group.
>> To unsubscribe from this group and stop receiving emails from it, send an 
>> email to elasticsearc...@googlegroups.com <javascript:>.
>> To view this discussion on the web visit 
>> https://groups.google.com/d/msgid/elasticsearch/decd4cb8-9a41-4f5b-a8a6-ce629757ed88%40googlegroups.com
>>  
>> <https://groups.google.com/d/msgid/elasticsearch/decd4cb8-9a41-4f5b-a8a6-ce629757ed88%40googlegroups.com?utm_medium=email&utm_source=footer>
>> .
>>
>> For more options, visit https://groups.google.com/d/optout.
>>
>
>

-- 
You received this message because you are subscribed to the Google Groups 
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to elasticsearch+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/b80d3251-c0b4-4081-8a6b-f585b3e3c60d%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Re: issue with singleton analyzer in single JVM multi-index setup

Reply via email to