GitHub user chenlica closed a discussion: Query Rewriter (from old wiki)

>From the page https://github.com/apache/texera/wiki/Query-Rewriter (may be 
>dangling)

=====

Authors:
[Shiladitya Sen](http://github.com/shiladityasen), [Kishore 
Narendran](http://github.com/kishore-narendran)  

Reviewer: Chen Li (**DONE**)

## Synopsis

The purpose of the "QueryRewriter" operator is to correct errors of missing 
spaces in a query that can lead to incorrect tokenization. For instance, a 
query "newyork" can be rewritten by this operator to "new york".  The operator 
is be used to return:

* The most likely rewritten query found using a word-frequency dictionary; or
* A set of valid rewritten queries.

## Status
As of 6/3/2016: **COMPLETED**

## Modules
`edu.uci.ics.texera.dataflow.queryrewriter`

## Related Issues
Design: Query Rewriter Issue - https://github.com/Texera/texera/issues/29

## Description

The operator inserts spaces to a query string to find likely words in order to 
rewrite the query. It has two implementations:

* A dynamic programming algorithm that utilizes a word-frequency dictionary to 
find the most likely tokenization. This algorithm was adopted from the Chinese 
characters tokenization performed in the [Srch2 Chinese Tokenization] module 
(https://github.com/SRCH2/srch2-ngn/blob/master/src/core/analyzer/ChineseTokenizer.cpp#L197).
  The word-frequency dictionary was derived from [Google 
unigrams](http://storage.googleapis.com/books/ngrams/books/datasetsv2.html) and 
the [NLTK English dictionary](http://www.nltk.org/book/ch02.html). The score 
for each word used for the algorithm is a reciprocal of frequency.

* A recursive algorithm that uses an English dictionary (possibly without word 
frequencies) to find all combinations of valid tokenizations in a search 
string. This algorithm that can be found 
[here](https://github.com/Texera/texera/blob/master/texera/texera-dataflow/src/main/java/edu/uci/ics/texera/dataflow/queryrewriter/QuerySegmenter.java#L143)

## Presentation
[Query Rewriter Dynamic Programming 
Algorithm](https://docs.google.com/presentation/d/1-Ufi_1G2JYYdHCOWeSRxxchhKYXAP2AMHhBDOvTQyVw/pub?start=true&loop=true&delayms=10000)

GitHub link: https://github.com/apache/texera/discussions/3981

----
This is an automatically sent email for [email protected].
To unsubscribe, please send an email to: 
[email protected]

Reply via email to