Yes, I use an analyzer provider. Here is the code: [code]
public class SemanticAnalyzerTwitterLemmatizerProvider extends AbstractIndexAnalyzerProvider<RussianLemmatizingTwitterAnalyzer> { private final RussianLemmatizingTwitterAnalyzer russianLemmatizingGenericAnalyzer; private final Logger Log = Logger.getLogger(SemanticAnalyzerTwitterLemmatizerProvider.class.getName()); @Inject public SemanticAnalyzerTwitterLemmatizerProvider(Index index, @IndexSettings Settings indexSettings, @Assisted String name, Settings settings) { super(index, indexSettings, name, settings); Log.info("called super with name=" + name); try { String lemmatizerConfFile = settings.get("lemmatizerConf"); boolean analyzeBest = Boolean.parseBoolean(settings.get("analyzeBest")); russianLemmatizingGenericAnalyzer = new RussianLemmatizingTwitterAnalyzer(lemmatizerConfFile, analyzeBest); } catch (IOException ioe) { throw new ElasticsearchIllegalArgumentException("Unable to load Russian morphology analyzer", ioe); } catch (Exception e) { throw new ElasticsearchIllegalArgumentException("Unable to load Russian morphology analyzer", e); } } @Override public RussianLemmatizingTwitterAnalyzer get() { return russianLemmatizingGenericAnalyzer; } } [/code] Would you recommend to use your approach instead of this one? Do you spot issues in my implementation of the provider? On Wednesday, 18 March 2015 20:52:08 UTC+2, Jörg Prante wrote: > > Do you use an analyzer provider? > > Example > > public class RussianLemmatizingTwitterAnalyzerProvider extends > AbstractIndexAnalyzerProvider<RussianLemmatizingTwitterAnalyzer> { > > private final MorphAnalyzer morphAnalyzer; > > ... > > @Inject > public RussianLemmatizingTwitterAnalyzerProvider(Index index, > @IndexSettings Settings > indexSettings, > Environment environment, > @Assisted String name, > @Assisted Settings settings) { > super(index, indexSettings, name, settings); > this.morphAnalyzer = createMorphAnalyzer(environment, settings, > ...); > } > > @Override > public RussianLemmatizingTwitterAnalyzer get() { > return new RussianLemmatizingTwitterAnalyzer(morphAnalyzer, ...); > } > > private MorphAnalyzer createMorphAnalyzer(...) { > } > > } > > > Only such a provider is bound to a singleton. So the analyzer provider can > set up the analyzer configuration exactly once (with a MorphAnalyzer > instance etc.), and with get() method, it creates analyzers as required. > > Jörg > > On Wed, Mar 18, 2015 at 6:40 PM, Dmitry Kan <dmitr...@gmail.com > <javascript:>> wrote: > >> >> Jörg, >> >> Thanks for replying! >> >> Here is the code of the RussianLemmatizingTwitterAnalyzer, the deepest >> class in the simplified class sequence I have posted in the original >> message. >> >> [code] >> >> public class RussianLemmatizingTwitterAnalyzer extends Analyzer { >> >> private static MorphAnalyzer morphAnalyzerGlobal; >> >> boolean useSyncMethod = true; >> >> private static final boolean verbose = false; >> private MorphAnalyzer morphAnalyzer; >> private boolean analyzeBest = false; >> >> private static final Logger Log = >> Logger.getLogger(RussianLemmatizingTwitterAnalyzer.class.getName()); >> >> public RussianLemmatizingTwitterAnalyzer(String lemmatizerConfFile, >> boolean analyzeBest) throws IOException { >> this.analyzeBest = analyzeBest; >> >> if (useSyncMethod) { >> this.morphAnalyzer = loadCustomAnalyzer(lemmatizerConfFile); >> } else { >> Properties properties = new Properties(); >> >> Log.info("Loading lemmatizer properties from " + >> lemmatizerConfFile); >> >> properties.load(new StringReader(IOUtils.readFile(new >> File(lemmatizerConfFile), Charsets.UTF_8))); >> this.morphAnalyzer = MorphAnalyzerLoader.load(new >> MorphAnalyzerConfig(properties)); >> } >> } >> >> private static MorphAnalyzer loadAnalyzer(String lemmatizerConfFile) >> throws IOException { >> Properties properties = new Properties(); >> >> Log.info("Loading lemmatizer properties from " + lemmatizerConfFile); >> >> properties.load(new StringReader(IOUtils.readFile(new >> File(lemmatizerConfFile), Charsets.UTF_8))); >> MorphAnalyzer morphAnalyzer1 = MorphAnalyzerLoader.load(new >> MorphAnalyzerConfig(properties)); >> >> if (verbose) { >> if (morphAnalyzer1 != null) { >> Log.info("Successfully created the analyzer!"); >> Log.info(morphAnalyzer1.analyzeBest("билета").toString()); >> } else { >> Log.severe("Failed to create the morphAnalyzer object"); >> } >> } >> >> return morphAnalyzer1; >> } >> >> public static synchronized MorphAnalyzer loadCustomAnalyzer(String >> lemmatizerConfFile) >> throws IOException { >> if (morphAnalyzerGlobal == null) { >> morphAnalyzerGlobal = loadAnalyzer(lemmatizerConfFile); >> } >> >> return morphAnalyzerGlobal; >> } >> >> @Override >> protected TokenStreamComponents createComponents(String fieldName, final >> Reader reader) { >> Tokenizer tokenizer = new TwitterFlexLuceneTokenizer(reader); >> >> Log.config("Using Tokenizer: " + >> tokenizer.getClass().getSimpleName()); >> >> TokenStream tokenStream = tokenizer; >> tokenStream = new LowerCaseFilter(Version.LUCENE_4_9,tokenStream); >> tokenStream = new MorphAnalTokenFilter(tokenStream, morphAnalyzer, >> analyzeBest); >> return new TokenStreamComponents(tokenizer, tokenStream); >> } >> >> } >> >> >> [/code] >> >> Note, that in the code above the TwitterFlexLuceneTokenizer is not thread >> safe and extends o.a.lucene.analysis.Tokenizer. In jvisualvm there are 97 >> instances of this class. >> >> Let me know, if I should copy other code snippets up the class stream. >> >> Dmitry >> >> On Wednesday, 18 March 2015 18:22:47 UTC+2, Jörg Prante wrote: >>> >>> Is it possible to examine the code of your plugin? >>> >>> Generally speaking, analyzers are instantiated per index creation for >>> each thread. >>> >>> In org.elasticsearch.index.analysis.AnalysisModule, you can see how >>> analyzer providers and factories are prepared for injection by the help of >>> the ES injection modul which is based on Guice. Basically, the factories >>> are kept as singletons, and each thread can pick analyzer instances from >>> the factory when needed. All in all, Lucene analyzer classes are not >>> threadsafe, in particular the tokenizers. It means, it is up to the >>> implementor of an analyzer/tokenizer to store immutable objects as >>> singletons in a correct way so that all threads can safely access them. >>> >>> Jörg >>> >>> On Wed, Mar 18, 2015 at 4:02 PM, Dmitry Kan <dmitr...@gmail.com> wrote: >>> >>>> Hi, >>>> >>>> Could somebody answer, please? >>>> >>>> >>>> On Tuesday, 17 March 2015 19:05:38 UTC+2, Dmitry Kan wrote: >>>>> >>>>> Hello! >>>>> >>>>> I'm a newbie in elasticsearch, so forgive if the question is lame. >>>>> >>>>> I have implemented a custom plugin using a custom lemmatizer and a >>>>> tokenizer. The simplified class sequence: >>>>> >>>>> >>>>> AnalysisMorphologyPlugin->MorphologyAnalysisBinderProcessor->SemanticAnalyzerTwitterLemmatizerProvider->RussianLemmatizingTwitterAnalyzer >>>>> >>>>> In the RussianLemmatizingTwitterAnalyzer's ctor I load the custom object >>>>> for lemmatization (object unrelated to lucene/es) in a singleton fashion >>>>> (in a syncrhonized code block). >>>>> Then, when creating 14 indices in the same JVM I see >>>>> 14 instances of RussianLemmatizingTwitterAnalyzer, >>>>> 4 instances of SemanticAnalyzerTwitterLemmatizerProvider, >>>>> 4 instances of MorphologyAnalysisBinderProcessor, >>>>> 30 instances of the custom lemmatizer (in each >>>>> RussianLemmatizingTwitterAnalyzer only one instance is expected, so >>>>> should be 14), >>>>> 1 instance of AnalysisMorphologyPlugin. >>>>> >>>>> The question is, can RussianLemmatizingTwitterAnalyzer object be made >>>>> shared between indices? Or is it by design, that they must load >>>>> separately per index? >>>>> What could be wrong in the code that makes 30 instances of the custom >>>>> singleton lemmatizer instead of 14? >>>>> >>>>> The current standing is that *with* the plugin 100M of RAM is reserved by >>>>> the JVM with no data. *Without* the plugin the JVM reserves 2M with no >>>>> data. Elasticsearch 1.3.2, Lucene 4.9.0. >>>>> >>>>> Regards, >>>>> >>>>> Dmitry Kan >>>>> >>>>> -- >>>> You received this message because you are subscribed to the Google >>>> Groups "elasticsearch" group. >>>> To unsubscribe from this group and stop receiving emails from it, send >>>> an email to elasticsearc...@googlegroups.com. >>>> To view this discussion on the web visit https://groups.google.com/d/ >>>> msgid/elasticsearch/c2c57184-ee3b-4600-9091-a515b496b867% >>>> 40googlegroups.com >>>> <https://groups.google.com/d/msgid/elasticsearch/c2c57184-ee3b-4600-9091-a515b496b867%40googlegroups.com?utm_medium=email&utm_source=footer> >>>> . >>>> >>>> For more options, visit https://groups.google.com/d/optout. >>>> >>> >>> -- >> You received this message because you are subscribed to the Google Groups >> "elasticsearch" group. >> To unsubscribe from this group and stop receiving emails from it, send an >> email to elasticsearc...@googlegroups.com <javascript:>. >> To view this discussion on the web visit >> https://groups.google.com/d/msgid/elasticsearch/decd4cb8-9a41-4f5b-a8a6-ce629757ed88%40googlegroups.com >> >> <https://groups.google.com/d/msgid/elasticsearch/decd4cb8-9a41-4f5b-a8a6-ce629757ed88%40googlegroups.com?utm_medium=email&utm_source=footer> >> . >> >> For more options, visit https://groups.google.com/d/optout. >> > > -- You received this message because you are subscribed to the Google Groups "elasticsearch" group. To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscr...@googlegroups.com. To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/b80d3251-c0b4-4081-8a6b-f585b3e3c60d%40googlegroups.com. For more options, visit https://groups.google.com/d/optout.