Hi all,
I'm investigating possible strategies for the following situation: I have data in a graph where the nodes and edges represent knowledge about certain topics. These topics may occur in unstructured text. The knowledge about these topics is used in an analysis process to make sense of unstructured text. The analysis results are indexed in ElasticSearch. The graph is stored simply in MySQL for now. It's not really large (about 4000 nodes and 4000 edges/relationships), but the expectation is that this will grow substantially. The most important part of the analysis process involves identifying how well topics are represented in the unstructured text. This is done based on a number of rules which are represented in the knowledge graph. The analysis results of a single piece of unstructured text consists of a list of identified topics as well as a number of characteristics per topic. A topic is considered to be well-represented when it is found by more rules coming from the knowledge graph. I.e. a piece of text can have a topic to be represented if it meets a single rule, but if a second piece of text has the same topic represented by meeting 10 rules, the seconds document should score better in search results. Searching the analysis results through ElasticSearch is performed using a combination of filters and queries. Score is calculated using a function score query. The script score part of this uses document fields (the characteristics for each topic) as well as a number of parameters in the formula. When I search for the data, the query contains a number of topics I wish to search for (let's say 40 topics) and finds documents that match best. I am getting the right results when I search the data, which is great. The only issue I have is the following: The knowledge in the graph is updated regularly. Updates to the graph are required to be reflected in the scoring of documents in the ElasticSearch index, leading to better search results. There are different strategies to have the changes to the graph reflected in the scoring by ElasticSearch: - *Periodically re-analyse all pieces of unstructured text and index the results in ElasticSearch again* - A lot of precalculations are performed and stored in the ElasticSearch index. An index alias could be used to switch between a "live" and "rebuilding" index. The benefit here is that it is easy to implement and the queries are really fast as like <50ms as much is precalculated. The drawback here is that changes in the graph are only reflected in the ElasticSearch search scoring after a period of time (in my case about 8 hours) as the analysis process takes long to perform. - *Move parts of the analysis process to query-execution time* by dynamically building a filter+query using the knowledge graph to identify the topics and calculate the characteristics where possible on the fly using function score queries with script scores. The benefit is that the changes in the graph do not always require periodic updates to the entire index. The drawback here is that if a graph section used to build the query has lots of related nodes, the resulting query DSL becomes huge and has lots of bool clauses. This requires overhead to programmatically construct the query, provide it to ElasticSearch and ElasticSearch also takes longer to perform the query (800 milliseconds). Going this route I have queries which are about 2 megabytes and contain 4000+ boolean clauses. My wish is that I have changes updated asap in ElasticSearch. Within a couple of seconds is fine. I am wondering if there are other strategies possible. I hope the above clarifies my challenges enough for you to answer, but ask away if you have questions. I just can't detail too much because of non-disclosure :) I'm open to using other technologies aside ElasticSearch, and ElasticSearch plugins. Kind regards, Eric -- You received this message because you are subscribed to the Google Groups "elasticsearch" group. To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscr...@googlegroups.com. To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/3d1092a7-a708-4706-bc59-df4523cab47c%40googlegroups.com. For more options, visit https://groups.google.com/d/optout.