Hi all,

I'm investigating possible strategies for the following situation:

I have data in a graph where the nodes and edges represent knowledge about 
certain topics. These topics may occur in unstructured text. The knowledge 
about these topics is used in an analysis process to make sense of 
unstructured text. The analysis results are indexed in ElasticSearch. The 
graph is stored simply in MySQL for now. It's not really large (about 4000 
nodes and 4000 edges/relationships), but the expectation is that this will 
grow substantially.

The most important part of the analysis process involves identifying how 
well topics are represented in the unstructured text. This is done based on 
a number of rules which are represented in the knowledge graph. The 
analysis results of a single piece of unstructured text consists of a list 
of identified topics as well as a number of characteristics per topic. A 
topic is considered to be well-represented when it is found by more rules 
coming from the knowledge graph. I.e. a piece of text can have a topic to 
be represented if it meets a single rule, but if a second piece of text has 
the same topic represented by meeting 10 rules, the seconds document should 
score better in search results.

Searching the analysis results through ElasticSearch is performed using a 
combination of filters and queries. Score is calculated using a function 
score query. The script score part of this uses document fields (the 
characteristics for each topic) as well as a number of parameters in the 
formula.

When I search for the data, the query contains a number of topics I wish to 
search for (let's say 40 topics) and finds documents that match best. I am 
getting the right results when I search the data, which is great.

The only issue I have is the following: The knowledge in the graph is 
updated regularly. Updates to the graph are required to be reflected in the 
scoring of documents in the ElasticSearch index, leading to better search 
results.

There are different strategies to have the changes to the graph reflected 
in the scoring by ElasticSearch:

- *Periodically re-analyse all pieces of unstructured text and index the 
results in ElasticSearch again* - A lot of precalculations are performed 
and stored in the ElasticSearch index. An index alias could be used to 
switch between a "live" and "rebuilding" index. The benefit here is that it 
is easy to implement and the queries are really fast as like <50ms as much 
is precalculated. The drawback here is that changes in the graph are only 
reflected in the ElasticSearch search scoring after a period of time (in my 
case about 8 hours) as the analysis process takes long to perform.

-  *Move parts of the analysis process to query-execution time* by 
dynamically building a filter+query using the knowledge graph to identify 
the topics and calculate the characteristics where possible on the fly 
using function score queries with script scores. The benefit is that the 
changes in the graph do not always require periodic updates to the entire 
index. The drawback here is that if a graph section used to build the query 
has lots of related nodes, the resulting query DSL becomes huge and has 
lots of bool clauses. This requires overhead to programmatically construct 
the query, provide it to ElasticSearch and ElasticSearch also takes longer 
to perform the query (800 milliseconds). Going this route I have queries 
which are about 2 megabytes and contain 4000+ boolean clauses.

My wish is that I have changes updated asap in ElasticSearch. Within a 
couple of seconds is fine.

I am wondering if there are other strategies possible. I hope the above 
clarifies my challenges enough for you to answer, but ask away if you have 
questions. I just can't detail too much because of non-disclosure :)

I'm open to using other technologies aside ElasticSearch, and ElasticSearch 
plugins.

Kind regards,

Eric

-- 
You received this message because you are subscribed to the Google Groups 
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to elasticsearch+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/3d1092a7-a708-4706-bc59-df4523cab47c%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Reply via email to