GitHub user chenlica created a discussion: Regex Matcher (from old wiki)

>From the page https://github.com/apache/texera/wiki/Regex-Matcher (may be 
>dangling)

===


Author(s): [Zuozhi Wang](https://github.com/zuozhi), [Shuying 
Lai](https://github.com/laisycs)

Reviewer(s): [Chen Li](https://github.com/chenlica) **REVIEWED**

## Synopsys

Implement an operator to use an index-based method based on gram inverted index 
to support efficient regular expressions on large datasets.  Our implementation 
is based on Russ Cox's [algorithm](https://swtch.com/~rsc/regexp/regexp4.html) 
and the corresponding open source 
[implementation](https://github.com/google/codesearch).

## Status
As of 6/14/2016: **FINISHED**

## Modules

```
edu.uci.ics.texera.dataflow.regexmatch
com.google.re2j
```

## Related Issues

https://github.com/Texera/texera/issues/30 and 
https://github.com/Texera/texera/issues/99

## Description

Our approach has the following steps:
* Build a gram index for documents using Lucene.
* Translate a regular expression to a boolean expression using Russ Cox's 
algorithm.
* Run the Boolean query on Lucene to compute candidate documents.
* Postprocess these candidate documents to remove false positives.

## Presentation
[Presentation] 
(https://docs.google.com/presentation/d/1F3Xboeb_azHSjWbJ2Cl36kGHpIeo_6-lI24XwXjq_hA/edit?usp=sharing)
 (Team 3)

[Initial Design 
Slides](https://docs.google.com/presentation/d/15_fq_7y_FAHDa8P7CzPP0y_5x3d4taZhNjmT5qwwbtc/edit?usp=sharing)

## Performance Test

Machine setting: Macbook Air (mid-2013), Intel Core i7 (4650U), SSD hard drive, 
8GB memory.

* Data set: 1 million Medline records, about 1.53 GB
* Regex: `\bmedic(ine|al|ation|are|aid)?\b` ("medicine" or "medical" or 
"medication" or "medicare" or "medicaid")
* Performance results (average time reported in seconds):

|               | Index Based | Scan Based |
| ------------- | ----------- | ---------- |
| Regex Matcher |    7.43     |   67.65    |
|     RE2J      |    11.88    |   99.05    |

* Data set: 5 million Medline records, about 7.8 GB.
* Regex: `\bmedic(ine|al|ation|are|aid)?\b` ("medicine" or "medical" or 
"medication" or "medicare" or "medicaid")
* Performance results (average time reported in seconds):

|               | Index Based | Scan Based |
| ------------- | ----------- | ---------- |
| Regex Matcher |    28.11    |  330.754   |
|     RE2J      |    44.72    |  506.34    |

## TODOs

* Design and implement algorithm to compare two boolean expressions. The basic 
idea is to format boolean expressions into 
[DNF](https://en.wikipedia.org/wiki/Disjunctive_normal_form), and check the 
equivalence of two DNFs. There is a PR for this task: 
https://github.com/Texera/texera/pull/127 **FINISHED**
* Use token position information to generate a more selective query to reduce 
the number of candidate documents.



GitHub link: https://github.com/apache/texera/discussions/3983

----
This is an automatically sent email for [email protected].
To unsubscribe, please send an email to: 
[email protected]

Reply via email to