Jörg, Following your suggestion I refactored the code like so:
[code] public class SemanticAnalyzerTwitterLemmatizerProvider extends AbstractIndexAnalyzerProvider<RussianLemmatizingTwitterAnalyzer> { //private final RussianLemmatizingTwitterAnalyzer russianLemmatizingGenericAnalyzer; private final MorphAnalyzer morphAnalyzer; private String lemmatizerConfFile; boolean analyzeBest; private final Logger Log = Logger.getLogger(SemanticAnalyzerTwitterLemmatizerProvider.class.getName()); @Inject public SemanticAnalyzerTwitterLemmatizerProvider(Index index, @IndexSettings Settings indexSettings, @Assisted String name, Settings settings) { super(index, indexSettings, name, settings); Log.info("called super with name=" + name); try { /* String lemmatizerConfFile = settings.get("lemmatizerConf"); boolean analyzeBest = Boolean.parseBoolean(settings.get("analyzeBest")); russianLemmatizingGenericAnalyzer = new RussianLemmatizingTwitterAnalyzer(lemmatizerConfFile, analyzeBest); */ lemmatizerConfFile = settings.get("lemmatizerConf"); morphAnalyzer = createMorphAnalyzer(); } catch (IOException ioe) { throw new ElasticsearchIllegalArgumentException("Unable to load Russian morphology analyzer", ioe); } catch (Exception e) { throw new ElasticsearchIllegalArgumentException("Unable to load Russian morphology analyzer", e); } } private MorphAnalyzer createMorphAnalyzer() throws IOException { Log.info("start of createMorphAnalyzer()"); MorphAnalyzer morphAnalyzer1; Properties properties = new Properties(); Log.info("Loading lemmatizer properties from " + lemmatizerConfFile); properties.load(new StringReader(IOUtils.readFile(new File(lemmatizerConfFile), Charsets.UTF_8))); morphAnalyzer1 = MorphAnalyzerLoader.load(new MorphAnalyzerConfig(properties)); Log.info("end of createMorphAnalyzer()"); return morphAnalyzer1; } @Override public RussianLemmatizingTwitterAnalyzer get() { return new RussianLemmatizingTwitterAnalyzer(morphAnalyzer); } } [/code] Still in the logs I see the creation of MorphAnalyzer object more than once. Probably something is still missing in the logic? log excerpt: [2015-03-18 22:34:06,900][INFO ][cluster.metadata ] [Soldier X] [rustest] deleting index Mar 18, 2015 10:34:06 PM org.elasticsearch.index.analysis.morphology.SemanticAnalyzerTwitterLemmatizerProvider <init> INFO: called super with name=russian_morphology_twitter Mar 18, 2015 10:34:06 PM org.elasticsearch.index.analysis.morphology.SemanticAnalyzerTwitterLemmatizerProvider createMorphAnalyzer INFO: start of createMorphAnalyzer() Mar 18, 2015 10:34:06 PM org.elasticsearch.index.analysis.morphology.SemanticAnalyzerTwitterLemmatizerProvider createMorphAnalyzer INFO: Loading lemmatizer properties from /Users/dmitry/projects/information_retrieval/elasticsearch-analysis-morphology-youscan/lemmatizer/lemmatizer-ru.properties Mar 18, 2015 10:34:07 PM org.elasticsearch.index.analysis.morphology.SemanticAnalyzerTwitterLemmatizerProvider createMorphAnalyzer INFO: end of createMorphAnalyzer() [2015-03-18 22:34:07,711][INFO ][cluster.metadata ] [Soldier X] [rustest] creating index, cause [api], shards [5]/[1], mappings [] Mar 18, 2015 10:34:07 PM org.elasticsearch.index.analysis.morphology.SemanticAnalyzerTwitterLemmatizerProvider <init> INFO: called super with name=russian_morphology_twitter Mar 18, 2015 10:34:07 PM org.elasticsearch.index.analysis.morphology.SemanticAnalyzerTwitterLemmatizerProvider createMorphAnalyzer INFO: start of createMorphAnalyzer() Mar 18, 2015 10:34:07 PM org.elasticsearch.index.analysis.morphology.SemanticAnalyzerTwitterLemmatizerProvider createMorphAnalyzer INFO: Loading lemmatizer properties from /Users/dmitry/projects/information_retrieval/elasticsearch-analysis-morphology-youscan/lemmatizer/lemmatizer-ru.properties Mar 18, 2015 10:34:08 PM org.elasticsearch.index.analysis.morphology.SemanticAnalyzerTwitterLemmatizerProvider createMorphAnalyzer INFO: end of createMorphAnalyzer() On Wednesday, 18 March 2015 21:27:12 UTC+2, Jörg Prante wrote: > > In the get() method of the provider, I would better try to always return a > new analyzer instance. > > The configuration and setup of the analyzer could be refactored to the > provider. > > Jörg > > On Wed, Mar 18, 2015 at 8:12 PM, Dmitry Kan <dmitr...@gmail.com > <javascript:>> wrote: > >> Yes, I use an analyzer provider. Here is the code: >> >> [code] >> >> public class SemanticAnalyzerTwitterLemmatizerProvider extends >> AbstractIndexAnalyzerProvider<RussianLemmatizingTwitterAnalyzer> { >> private final RussianLemmatizingTwitterAnalyzer >> russianLemmatizingGenericAnalyzer; >> >> private final Logger Log = >> Logger.getLogger(SemanticAnalyzerTwitterLemmatizerProvider.class.getName()); >> >> @Inject >> public SemanticAnalyzerTwitterLemmatizerProvider(Index index, >> @IndexSettings Settings indexSettings, >> @Assisted String name, >> Settings settings) { >> super(index, indexSettings, name, settings); >> Log.info("called super with name=" + name); >> try { >> String lemmatizerConfFile = settings.get("lemmatizerConf"); >> boolean analyzeBest = >> Boolean.parseBoolean(settings.get("analyzeBest")); >> russianLemmatizingGenericAnalyzer = new >> RussianLemmatizingTwitterAnalyzer(lemmatizerConfFile, analyzeBest); >> } catch (IOException ioe) { >> throw new ElasticsearchIllegalArgumentException("Unable to load >> Russian morphology analyzer", ioe); >> } catch (Exception e) { >> throw new ElasticsearchIllegalArgumentException("Unable to load >> Russian morphology analyzer", e); >> } >> } >> >> @Override >> public RussianLemmatizingTwitterAnalyzer get() { >> return russianLemmatizingGenericAnalyzer; >> } >> } >> >> >> [/code] >> >> Would you recommend to use your approach instead of this one? Do you spot >> issues in my implementation of the provider? >> >> On Wednesday, 18 March 2015 20:52:08 UTC+2, Jörg Prante wrote: >>> >>> Do you use an analyzer provider? >>> >>> Example >>> >>> public class RussianLemmatizingTwitterAnalyzerProvider extends >>> AbstractIndexAnalyzerProvider<RussianLemmatizingTwitterAnalyzer> { >>> >>> private final MorphAnalyzer morphAnalyzer; >>> >>> ... >>> >>> @Inject >>> public RussianLemmatizingTwitterAnalyzerProvider(Index index, >>> @IndexSettings Settings >>> indexSettings, >>> Environment environment, >>> @Assisted String name, >>> @Assisted Settings settings) { >>> super(index, indexSettings, name, settings); >>> this.morphAnalyzer = createMorphAnalyzer(environment, settings, >>> ...); >>> } >>> >>> @Override >>> public RussianLemmatizingTwitterAnalyzer get() { >>> return new RussianLemmatizingTwitterAnalyzer(morphAnalyzer, >>> ...); >>> } >>> >>> private MorphAnalyzer createMorphAnalyzer(...) { >>> } >>> >>> } >>> >>> >>> Only such a provider is bound to a singleton. So the analyzer provider >>> can set up the analyzer configuration exactly once (with a MorphAnalyzer >>> instance etc.), and with get() method, it creates analyzers as required. >>> >>> Jörg >>> >>> On Wed, Mar 18, 2015 at 6:40 PM, Dmitry Kan <dmitr...@gmail.com> wrote: >>> >>>> >>>> Jörg, >>>> >>>> Thanks for replying! >>>> >>>> Here is the code of the RussianLemmatizingTwitterAnalyzer, the deepest >>>> class in the simplified class sequence I have posted in the original >>>> message. >>>> >>>> [code] >>>> >>>> public class RussianLemmatizingTwitterAnalyzer extends Analyzer { >>>> >>>> private static MorphAnalyzer morphAnalyzerGlobal; >>>> >>>> boolean useSyncMethod = true; >>>> >>>> private static final boolean verbose = false; >>>> private MorphAnalyzer morphAnalyzer; >>>> private boolean analyzeBest = false; >>>> >>>> private static final Logger Log = >>>> Logger.getLogger(RussianLemmatizingTwitterAnalyzer.class.getName()); >>>> >>>> public RussianLemmatizingTwitterAnalyzer(String lemmatizerConfFile, >>>> boolean analyzeBest) throws IOException { >>>> this.analyzeBest = analyzeBest; >>>> >>>> if (useSyncMethod) { >>>> this.morphAnalyzer = loadCustomAnalyzer(lemmatizerConfFile); >>>> } else { >>>> Properties properties = new Properties(); >>>> >>>> Log.info("Loading lemmatizer properties from " + >>>> lemmatizerConfFile); >>>> >>>> properties.load(new StringReader(IOUtils.readFile(new >>>> File(lemmatizerConfFile), Charsets.UTF_8))); >>>> this.morphAnalyzer = MorphAnalyzerLoader.load(new >>>> MorphAnalyzerConfig(properties)); >>>> } >>>> } >>>> >>>> private static MorphAnalyzer loadAnalyzer(String lemmatizerConfFile) >>>> throws IOException { >>>> Properties properties = new Properties(); >>>> >>>> Log.info("Loading lemmatizer properties from " + >>>> lemmatizerConfFile); >>>> >>>> properties.load(new StringReader(IOUtils.readFile(new >>>> File(lemmatizerConfFile), Charsets.UTF_8))); >>>> MorphAnalyzer morphAnalyzer1 = MorphAnalyzerLoader.load(new >>>> MorphAnalyzerConfig(properties)); >>>> >>>> if (verbose) { >>>> if (morphAnalyzer1 != null) { >>>> Log.info("Successfully created the analyzer!"); >>>> Log.info(morphAnalyzer1.analyzeBest("билета").toString()); >>>> } else { >>>> Log.severe("Failed to create the morphAnalyzer object"); >>>> } >>>> } >>>> >>>> return morphAnalyzer1; >>>> } >>>> >>>> public static synchronized MorphAnalyzer loadCustomAnalyzer(String >>>> lemmatizerConfFile) >>>> throws IOException { >>>> if (morphAnalyzerGlobal == null) { >>>> morphAnalyzerGlobal = loadAnalyzer(lemmatizerConfFile); >>>> } >>>> >>>> return morphAnalyzerGlobal; >>>> } >>>> >>>> @Override >>>> protected TokenStreamComponents createComponents(String fieldName, >>>> final Reader reader) { >>>> Tokenizer tokenizer = new TwitterFlexLuceneTokenizer(reader); >>>> >>>> Log.config("Using Tokenizer: " + >>>> tokenizer.getClass().getSimpleName()); >>>> >>>> TokenStream tokenStream = tokenizer; >>>> tokenStream = new LowerCaseFilter(Version.LUCENE_4_9,tokenStream); >>>> tokenStream = new MorphAnalTokenFilter(tokenStream, morphAnalyzer, >>>> analyzeBest); >>>> return new TokenStreamComponents(tokenizer, tokenStream); >>>> } >>>> >>>> } >>>> >>>> >>>> [/code] >>>> >>>> Note, that in the code above the TwitterFlexLuceneTokenizer is not >>>> thread safe and extends o.a.lucene.analysis.Tokenizer. In jvisualvm there >>>> are 97 instances of this class. >>>> >>>> Let me know, if I should copy other code snippets up the class stream. >>>> >>>> Dmitry >>>> >>>> On Wednesday, 18 March 2015 18:22:47 UTC+2, Jörg Prante wrote: >>>>> >>>>> Is it possible to examine the code of your plugin? >>>>> >>>>> Generally speaking, analyzers are instantiated per index creation for >>>>> each thread. >>>>> >>>>> In org.elasticsearch.index.analysis.AnalysisModule, you can see how >>>>> analyzer providers and factories are prepared for injection by the help >>>>> of >>>>> the ES injection modul which is based on Guice. Basically, the factories >>>>> are kept as singletons, and each thread can pick analyzer instances from >>>>> the factory when needed. All in all, Lucene analyzer classes are not >>>>> threadsafe, in particular the tokenizers. It means, it is up to the >>>>> implementor of an analyzer/tokenizer to store immutable objects as >>>>> singletons in a correct way so that all threads can safely access them. >>>>> >>>>> Jörg >>>>> >>>>> On Wed, Mar 18, 2015 at 4:02 PM, Dmitry Kan <dmitr...@gmail.com> >>>>> wrote: >>>>> >>>>>> Hi, >>>>>> >>>>>> Could somebody answer, please? >>>>>> >>>>>> >>>>>> On Tuesday, 17 March 2015 19:05:38 UTC+2, Dmitry Kan wrote: >>>>>>> >>>>>>> Hello! >>>>>>> >>>>>>> I'm a newbie in elasticsearch, so forgive if the question is lame. >>>>>>> >>>>>>> I have implemented a custom plugin using a custom lemmatizer and a >>>>>>> tokenizer. The simplified class sequence: >>>>>>> >>>>>>> >>>>>>> AnalysisMorphologyPlugin->MorphologyAnalysisBinderProcessor->SemanticAnalyzerTwitterLemmatizerProvider->RussianLemmatizingTwitterAnalyzer >>>>>>> >>>>>>> In the RussianLemmatizingTwitterAnalyzer's ctor I load the custom >>>>>>> object for lemmatization (object unrelated to lucene/es) in a singleton >>>>>>> fashion (in a syncrhonized code block). >>>>>>> Then, when creating 14 indices in the same JVM I see >>>>>>> 14 instances of RussianLemmatizingTwitterAnalyzer, >>>>>>> 4 instances of SemanticAnalyzerTwitterLemmatizerProvider, >>>>>>> 4 instances of MorphologyAnalysisBinderProcessor, >>>>>>> 30 instances of the custom lemmatizer (in each >>>>>>> RussianLemmatizingTwitterAnalyzer only one instance is expected, so >>>>>>> should be 14), >>>>>>> 1 instance of AnalysisMorphologyPlugin. >>>>>>> >>>>>>> The question is, can RussianLemmatizingTwitterAnalyzer object be made >>>>>>> shared between indices? Or is it by design, that they must load >>>>>>> separately per index? >>>>>>> What could be wrong in the code that makes 30 instances of the custom >>>>>>> singleton lemmatizer instead of 14? >>>>>>> >>>>>>> The current standing is that *with* the plugin 100M of RAM is reserved >>>>>>> by the JVM with no data. *Without* the plugin the JVM reserves 2M with >>>>>>> no data. Elasticsearch 1.3.2, Lucene 4.9.0. >>>>>>> >>>>>>> Regards, >>>>>>> >>>>>>> Dmitry Kan >>>>>>> >>>>>>> -- >>>>>> You received this message because you are subscribed to the Google >>>>>> Groups "elasticsearch" group. >>>>>> To unsubscribe from this group and stop receiving emails from it, >>>>>> send an email to elasticsearc...@googlegroups.com. >>>>>> To view this discussion on the web visit https://groups.google.com/d/ >>>>>> msgid/elasticsearch/c2c57184-ee3b-4600-9091-a515b496b867%40goo >>>>>> glegroups.com >>>>>> <https://groups.google.com/d/msgid/elasticsearch/c2c57184-ee3b-4600-9091-a515b496b867%40googlegroups.com?utm_medium=email&utm_source=footer> >>>>>> . >>>>>> >>>>>> For more options, visit https://groups.google.com/d/optout. >>>>>> >>>>> >>>>> -- >>>> You received this message because you are subscribed to the Google >>>> Groups "elasticsearch" group. >>>> To unsubscribe from this group and stop receiving emails from it, send >>>> an email to elasticsearc...@googlegroups.com. >>>> To view this discussion on the web visit https://groups.google.com/d/ >>>> msgid/elasticsearch/decd4cb8-9a41-4f5b-a8a6-ce629757ed88% >>>> 40googlegroups.com >>>> <https://groups.google.com/d/msgid/elasticsearch/decd4cb8-9a41-4f5b-a8a6-ce629757ed88%40googlegroups.com?utm_medium=email&utm_source=footer> >>>> . >>>> >>>> For more options, visit https://groups.google.com/d/optout. >>>> >>> >>> -- >> You received this message because you are subscribed to the Google Groups >> "elasticsearch" group. >> To unsubscribe from this group and stop receiving emails from it, send an >> email to elasticsearc...@googlegroups.com <javascript:>. >> To view this discussion on the web visit >> https://groups.google.com/d/msgid/elasticsearch/b80d3251-c0b4-4081-8a6b-f585b3e3c60d%40googlegroups.com >> >> <https://groups.google.com/d/msgid/elasticsearch/b80d3251-c0b4-4081-8a6b-f585b3e3c60d%40googlegroups.com?utm_medium=email&utm_source=footer> >> . >> >> For more options, visit https://groups.google.com/d/optout. >> > > -- You received this message because you are subscribed to the Google Groups "elasticsearch" group. To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscr...@googlegroups.com. To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/c79ac418-4129-4a3e-9227-64dd840a30cf%40googlegroups.com. For more options, visit https://groups.google.com/d/optout.