Re: issue with singleton analyzer in single JVM multi-index setup

Dmitry Kan Wed, 18 Mar 2015 13:38:19 -0700

Jörg,

Following your suggestion I refactored the code like so:


[code]

public class SemanticAnalyzerTwitterLemmatizerProvider extends 
AbstractIndexAnalyzerProvider<RussianLemmatizingTwitterAnalyzer> {
    //private final RussianLemmatizingTwitterAnalyzer 
russianLemmatizingGenericAnalyzer;
    private final MorphAnalyzer morphAnalyzer;
    private String lemmatizerConfFile;
    boolean analyzeBest;

    private final Logger Log = 
Logger.getLogger(SemanticAnalyzerTwitterLemmatizerProvider.class.getName());

    @Inject
    public SemanticAnalyzerTwitterLemmatizerProvider(Index index, 
@IndexSettings Settings indexSettings,
                                                     @Assisted String name, 
Settings settings) {
        super(index, indexSettings, name, settings);
        Log.info("called super with name=" + name);
        try {
            /*
            String lemmatizerConfFile = settings.get("lemmatizerConf");
            boolean analyzeBest = 
Boolean.parseBoolean(settings.get("analyzeBest"));
            russianLemmatizingGenericAnalyzer = new 
RussianLemmatizingTwitterAnalyzer(lemmatizerConfFile, analyzeBest);
            */
            lemmatizerConfFile = settings.get("lemmatizerConf");
            morphAnalyzer = createMorphAnalyzer();

        } catch (IOException ioe) {
            throw new ElasticsearchIllegalArgumentException("Unable to load 
Russian morphology analyzer", ioe);
        } catch (Exception e) {
            throw new ElasticsearchIllegalArgumentException("Unable to load 
Russian morphology analyzer", e);
        }
    }

    private MorphAnalyzer createMorphAnalyzer() throws IOException {
        Log.info("start of createMorphAnalyzer()");
        MorphAnalyzer morphAnalyzer1;

        Properties properties = new Properties();

        Log.info("Loading lemmatizer properties from " + lemmatizerConfFile);

        properties.load(new StringReader(IOUtils.readFile(new 
File(lemmatizerConfFile), Charsets.UTF_8)));
        morphAnalyzer1 = MorphAnalyzerLoader.load(new 
MorphAnalyzerConfig(properties));

        Log.info("end of createMorphAnalyzer()");

        return morphAnalyzer1;
    }

    @Override
    public RussianLemmatizingTwitterAnalyzer get() {
        return new RussianLemmatizingTwitterAnalyzer(morphAnalyzer);
    }
}


[/code]

Still in the logs I see the creation of MorphAnalyzer object more than 
once. Probably something is still missing in the logic?

log excerpt:

[2015-03-18 22:34:06,900][INFO ][cluster.metadata         ] [Soldier X] 
[rustest] deleting index
Mar 18, 2015 10:34:06 PM 
org.elasticsearch.index.analysis.morphology.SemanticAnalyzerTwitterLemmatizerProvider
 
<init>
INFO: called super with name=russian_morphology_twitter
Mar 18, 2015 10:34:06 PM 
org.elasticsearch.index.analysis.morphology.SemanticAnalyzerTwitterLemmatizerProvider
 
createMorphAnalyzer
INFO: start of createMorphAnalyzer()
Mar 18, 2015 10:34:06 PM 
org.elasticsearch.index.analysis.morphology.SemanticAnalyzerTwitterLemmatizerProvider
 
createMorphAnalyzer
INFO: Loading lemmatizer properties from 
/Users/dmitry/projects/information_retrieval/elasticsearch-analysis-morphology-youscan/lemmatizer/lemmatizer-ru.properties
Mar 18, 2015 10:34:07 PM 
org.elasticsearch.index.analysis.morphology.SemanticAnalyzerTwitterLemmatizerProvider
 
createMorphAnalyzer
INFO: end of createMorphAnalyzer()
[2015-03-18 22:34:07,711][INFO ][cluster.metadata         ] [Soldier X] 
[rustest] creating index, cause [api], shards [5]/[1], mappings []
Mar 18, 2015 10:34:07 PM 
org.elasticsearch.index.analysis.morphology.SemanticAnalyzerTwitterLemmatizerProvider
 
<init>
INFO: called super with name=russian_morphology_twitter
Mar 18, 2015 10:34:07 PM 
org.elasticsearch.index.analysis.morphology.SemanticAnalyzerTwitterLemmatizerProvider
 
createMorphAnalyzer
INFO: start of createMorphAnalyzer()
Mar 18, 2015 10:34:07 PM 
org.elasticsearch.index.analysis.morphology.SemanticAnalyzerTwitterLemmatizerProvider
 
createMorphAnalyzer
INFO: Loading lemmatizer properties from 
/Users/dmitry/projects/information_retrieval/elasticsearch-analysis-morphology-youscan/lemmatizer/lemmatizer-ru.properties
Mar 18, 2015 10:34:08 PM 
org.elasticsearch.index.analysis.morphology.SemanticAnalyzerTwitterLemmatizerProvider
 
createMorphAnalyzer
INFO: end of createMorphAnalyzer()




On Wednesday, 18 March 2015 21:27:12 UTC+2, Jörg Prante wrote:
>
> In the get() method of the provider, I would better try to always return a 
> new analyzer instance. 
>
> The configuration and setup of the analyzer could be refactored to the 
> provider.
>
> Jörg
>
> On Wed, Mar 18, 2015 at 8:12 PM, Dmitry Kan <dmitr...@gmail.com 
> <javascript:>> wrote:
>
>> Yes, I use an analyzer provider. Here is the code:
>>
>> [code]
>>
>> public class SemanticAnalyzerTwitterLemmatizerProvider extends 
>> AbstractIndexAnalyzerProvider<RussianLemmatizingTwitterAnalyzer> {
>>     private final RussianLemmatizingTwitterAnalyzer 
>> russianLemmatizingGenericAnalyzer;
>>
>>     private final Logger Log = 
>> Logger.getLogger(SemanticAnalyzerTwitterLemmatizerProvider.class.getName());
>>
>>     @Inject
>>     public SemanticAnalyzerTwitterLemmatizerProvider(Index index, 
>> @IndexSettings Settings indexSettings,
>>                                                      @Assisted String name, 
>> Settings settings) {
>>         super(index, indexSettings, name, settings);
>>         Log.info("called super with name=" + name);
>>         try {
>>             String lemmatizerConfFile = settings.get("lemmatizerConf");
>>             boolean analyzeBest = 
>> Boolean.parseBoolean(settings.get("analyzeBest"));
>>             russianLemmatizingGenericAnalyzer = new 
>> RussianLemmatizingTwitterAnalyzer(lemmatizerConfFile, analyzeBest);
>>         } catch (IOException ioe) {
>>             throw new ElasticsearchIllegalArgumentException("Unable to load 
>> Russian morphology analyzer", ioe);
>>         } catch (Exception e) {
>>             throw new ElasticsearchIllegalArgumentException("Unable to load 
>> Russian morphology analyzer", e);
>>         }
>>     }
>>
>>     @Override
>>     public RussianLemmatizingTwitterAnalyzer get() {
>>         return russianLemmatizingGenericAnalyzer;
>>     }
>> }
>>
>>
>> [/code]
>>
>> Would you recommend to use your approach instead of this one? Do you spot 
>> issues in my implementation of the provider?
>>
>> On Wednesday, 18 March 2015 20:52:08 UTC+2, Jörg Prante wrote:
>>>
>>> Do you use an analyzer provider?
>>>
>>> Example
>>>
>>> public class RussianLemmatizingTwitterAnalyzerProvider extends 
>>> AbstractIndexAnalyzerProvider<RussianLemmatizingTwitterAnalyzer> {
>>>
>>>     private final MorphAnalyzer morphAnalyzer;
>>>
>>>     ...
>>>      
>>>     @Inject
>>>     public RussianLemmatizingTwitterAnalyzerProvider(Index index,
>>>                                            @IndexSettings Settings 
>>> indexSettings,
>>>                                            Environment environment,
>>>                                            @Assisted String name, 
>>> @Assisted Settings settings) {
>>>         super(index, indexSettings, name, settings);
>>>         this.morphAnalyzer = createMorphAnalyzer(environment, settings, 
>>> ...);
>>>     }
>>>     
>>>     @Override
>>>     public RussianLemmatizingTwitterAnalyzer get() {
>>>         return new RussianLemmatizingTwitterAnalyzer(morphAnalyzer, 
>>> ...);
>>>     }
>>>     
>>>     private MorphAnalyzer createMorphAnalyzer(...) {
>>>     }
>>>
>>> }
>>>
>>>
>>> Only such a provider is bound to a singleton. So the analyzer provider 
>>> can set up the analyzer configuration exactly once (with a MorphAnalyzer 
>>> instance etc.), and with get() method, it creates analyzers as required.
>>>
>>> Jörg
>>>
>>> On Wed, Mar 18, 2015 at 6:40 PM, Dmitry Kan <dmitr...@gmail.com> wrote:
>>>
>>>>
>>>> Jörg,
>>>>
>>>> Thanks for replying!
>>>>
>>>> Here is the code of the RussianLemmatizingTwitterAnalyzer, the deepest 
>>>> class in the simplified class sequence I have posted in the original 
>>>> message.
>>>>
>>>> [code]
>>>>
>>>> public class RussianLemmatizingTwitterAnalyzer extends Analyzer {
>>>>
>>>>     private static MorphAnalyzer morphAnalyzerGlobal;
>>>>
>>>>     boolean useSyncMethod = true;
>>>>
>>>>     private static final boolean verbose = false;
>>>>     private MorphAnalyzer morphAnalyzer;
>>>>     private boolean analyzeBest = false;
>>>>
>>>>     private static final Logger Log = 
>>>> Logger.getLogger(RussianLemmatizingTwitterAnalyzer.class.getName());
>>>>
>>>>     public RussianLemmatizingTwitterAnalyzer(String lemmatizerConfFile, 
>>>> boolean analyzeBest) throws IOException {
>>>>         this.analyzeBest = analyzeBest;
>>>>
>>>>         if (useSyncMethod) {
>>>>             this.morphAnalyzer = loadCustomAnalyzer(lemmatizerConfFile);
>>>>         } else {
>>>>             Properties properties = new Properties();
>>>>
>>>>             Log.info("Loading lemmatizer properties from " + 
>>>> lemmatizerConfFile);
>>>>
>>>>             properties.load(new StringReader(IOUtils.readFile(new 
>>>> File(lemmatizerConfFile), Charsets.UTF_8)));
>>>>             this.morphAnalyzer = MorphAnalyzerLoader.load(new 
>>>> MorphAnalyzerConfig(properties));
>>>>         }
>>>>     }
>>>>
>>>>     private static MorphAnalyzer loadAnalyzer(String lemmatizerConfFile) 
>>>> throws IOException {
>>>>         Properties properties = new Properties();
>>>>
>>>>         Log.info("Loading lemmatizer properties from " + 
>>>> lemmatizerConfFile);
>>>>
>>>>         properties.load(new StringReader(IOUtils.readFile(new 
>>>> File(lemmatizerConfFile), Charsets.UTF_8)));
>>>>         MorphAnalyzer morphAnalyzer1 = MorphAnalyzerLoader.load(new 
>>>> MorphAnalyzerConfig(properties));
>>>>
>>>>         if (verbose) {
>>>>             if (morphAnalyzer1 != null) {
>>>>                 Log.info("Successfully created the analyzer!");
>>>>                 Log.info(morphAnalyzer1.analyzeBest("билета").toString());
>>>>             } else {
>>>>                 Log.severe("Failed to create the morphAnalyzer object");
>>>>             }
>>>>         }
>>>>
>>>>         return morphAnalyzer1;
>>>>     }
>>>>
>>>>     public static synchronized MorphAnalyzer loadCustomAnalyzer(String 
>>>> lemmatizerConfFile)
>>>>             throws IOException {
>>>>         if (morphAnalyzerGlobal == null) {
>>>>             morphAnalyzerGlobal = loadAnalyzer(lemmatizerConfFile);
>>>>         }
>>>>
>>>>         return morphAnalyzerGlobal;
>>>>     }
>>>>
>>>>     @Override
>>>>     protected TokenStreamComponents createComponents(String fieldName, 
>>>> final Reader reader) {
>>>>         Tokenizer tokenizer = new TwitterFlexLuceneTokenizer(reader);
>>>>
>>>>         Log.config("Using Tokenizer: " + 
>>>> tokenizer.getClass().getSimpleName());
>>>>
>>>>         TokenStream tokenStream = tokenizer;
>>>>         tokenStream = new LowerCaseFilter(Version.LUCENE_4_9,tokenStream);
>>>>         tokenStream = new MorphAnalTokenFilter(tokenStream, morphAnalyzer, 
>>>> analyzeBest);
>>>>         return new TokenStreamComponents(tokenizer, tokenStream);
>>>>     }
>>>>
>>>> }
>>>>
>>>>
>>>> [/code] 
>>>>
>>>> Note, that in the code above the TwitterFlexLuceneTokenizer is not 
>>>> thread safe and extends o.a.lucene.analysis.Tokenizer. In jvisualvm there 
>>>> are 97 instances of this class.
>>>>
>>>> Let me know, if I should copy other code snippets up the class stream.
>>>>
>>>> Dmitry
>>>>
>>>> On Wednesday, 18 March 2015 18:22:47 UTC+2, Jörg Prante wrote:
>>>>>
>>>>> Is it possible to examine the code of your plugin?
>>>>>
>>>>> Generally speaking, analyzers are instantiated per index creation for 
>>>>> each thread.
>>>>>
>>>>> In org.elasticsearch.index.analysis.AnalysisModule, you can see how 
>>>>> analyzer providers and factories are prepared for injection by the help 
>>>>> of 
>>>>> the ES injection modul which is based on Guice. Basically, the factories 
>>>>> are kept as singletons, and each thread can pick analyzer instances from 
>>>>> the factory when needed. All in all, Lucene analyzer classes are not 
>>>>> threadsafe, in particular the tokenizers. It means, it is up to the 
>>>>> implementor of an analyzer/tokenizer to store immutable objects as 
>>>>> singletons in a correct way so that all threads can safely access them.
>>>>>
>>>>> Jörg
>>>>>
>>>>> On Wed, Mar 18, 2015 at 4:02 PM, Dmitry Kan <dmitr...@gmail.com> 
>>>>> wrote:
>>>>>
>>>>>> Hi,
>>>>>>
>>>>>> Could somebody answer, please?
>>>>>>
>>>>>>
>>>>>> On Tuesday, 17 March 2015 19:05:38 UTC+2, Dmitry Kan wrote:
>>>>>>>
>>>>>>> Hello!
>>>>>>>
>>>>>>> I'm a newbie in elasticsearch, so forgive if the question is lame.
>>>>>>>
>>>>>>> I have implemented a custom plugin using a custom lemmatizer and a 
>>>>>>> tokenizer. The simplified class sequence: 
>>>>>>>
>>>>>>>
>>>>>>> AnalysisMorphologyPlugin->MorphologyAnalysisBinderProcessor->SemanticAnalyzerTwitterLemmatizerProvider->RussianLemmatizingTwitterAnalyzer
>>>>>>>
>>>>>>> In the RussianLemmatizingTwitterAnalyzer's ctor I load the custom 
>>>>>>> object for lemmatization (object unrelated to lucene/es) in a singleton 
>>>>>>> fashion (in a syncrhonized code block).
>>>>>>> Then, when creating 14 indices in the same JVM I see 
>>>>>>>  14 instances of RussianLemmatizingTwitterAnalyzer, 
>>>>>>>  4 instances of SemanticAnalyzerTwitterLemmatizerProvider, 
>>>>>>>  4 instances of MorphologyAnalysisBinderProcessor,
>>>>>>>  30 instances of the custom lemmatizer (in each 
>>>>>>> RussianLemmatizingTwitterAnalyzer only one instance is expected, so 
>>>>>>> should be 14), 
>>>>>>>  1 instance of AnalysisMorphologyPlugin.
>>>>>>>
>>>>>>> The question is, can RussianLemmatizingTwitterAnalyzer object be made 
>>>>>>> shared between indices? Or is it by design, that they must load 
>>>>>>> separately per index?
>>>>>>> What could be wrong in the code that makes 30 instances of the custom 
>>>>>>> singleton lemmatizer instead of 14?
>>>>>>>
>>>>>>> The current standing is that *with* the plugin 100M of RAM is reserved 
>>>>>>> by the JVM with no data. *Without* the plugin the JVM reserves 2M with 
>>>>>>> no data. Elasticsearch 1.3.2, Lucene 4.9.0.
>>>>>>>
>>>>>>> Regards,
>>>>>>>
>>>>>>> Dmitry Kan
>>>>>>>
>>>>>>>  -- 
>>>>>> You received this message because you are subscribed to the Google 
>>>>>> Groups "elasticsearch" group.
>>>>>> To unsubscribe from this group and stop receiving emails from it, 
>>>>>> send an email to elasticsearc...@googlegroups.com.
>>>>>> To view this discussion on the web visit https://groups.google.com/d/
>>>>>> msgid/elasticsearch/c2c57184-ee3b-4600-9091-a515b496b867%40goo
>>>>>> glegroups.com 
>>>>>> <https://groups.google.com/d/msgid/elasticsearch/c2c57184-ee3b-4600-9091-a515b496b867%40googlegroups.com?utm_medium=email&utm_source=footer>
>>>>>> .
>>>>>>
>>>>>> For more options, visit https://groups.google.com/d/optout.
>>>>>>
>>>>>
>>>>>  -- 
>>>> You received this message because you are subscribed to the Google 
>>>> Groups "elasticsearch" group.
>>>> To unsubscribe from this group and stop receiving emails from it, send 
>>>> an email to elasticsearc...@googlegroups.com.
>>>> To view this discussion on the web visit https://groups.google.com/d/
>>>> msgid/elasticsearch/decd4cb8-9a41-4f5b-a8a6-ce629757ed88%
>>>> 40googlegroups.com 
>>>> <https://groups.google.com/d/msgid/elasticsearch/decd4cb8-9a41-4f5b-a8a6-ce629757ed88%40googlegroups.com?utm_medium=email&utm_source=footer>
>>>> .
>>>>
>>>> For more options, visit https://groups.google.com/d/optout.
>>>>
>>>
>>>  -- 
>> You received this message because you are subscribed to the Google Groups 
>> "elasticsearch" group.
>> To unsubscribe from this group and stop receiving emails from it, send an 
>> email to elasticsearc...@googlegroups.com <javascript:>.
>> To view this discussion on the web visit 
>> https://groups.google.com/d/msgid/elasticsearch/b80d3251-c0b4-4081-8a6b-f585b3e3c60d%40googlegroups.com
>>  
>> <https://groups.google.com/d/msgid/elasticsearch/b80d3251-c0b4-4081-8a6b-f585b3e3c60d%40googlegroups.com?utm_medium=email&utm_source=footer>
>> .
>>
>> For more options, visit https://groups.google.com/d/optout.
>>
>
>

-- 
You received this message because you are subscribed to the Google Groups 
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to elasticsearch+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/c79ac418-4129-4a3e-9227-64dd840a30cf%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Re: issue with singleton analyzer in single JVM multi-index setup

Reply via email to