GitHub user chenlica closed a discussion: Regex Matcher (from old wiki)
>From the page https://github.com/apache/texera/wiki/Regex-Matcher (may be >dangling) === Author(s): [Zuozhi Wang](https://github.com/zuozhi), [Shuying Lai](https://github.com/laisycs) Reviewer(s): [Chen Li](https://github.com/chenlica) **REVIEWED** ## Synopsys Implement an operator to use an index-based method based on gram inverted index to support efficient regular expressions on large datasets. Our implementation is based on Russ Cox's [algorithm](https://swtch.com/~rsc/regexp/regexp4.html) and the corresponding open source [implementation](https://github.com/google/codesearch). ## Status As of 6/14/2016: **FINISHED** ## Modules ``` edu.uci.ics.texera.dataflow.regexmatch com.google.re2j ``` ## Related Issues https://github.com/Texera/texera/issues/30 and https://github.com/Texera/texera/issues/99 ## Description Our approach has the following steps: * Build a gram index for documents using Lucene. * Translate a regular expression to a boolean expression using Russ Cox's algorithm. * Run the Boolean query on Lucene to compute candidate documents. * Postprocess these candidate documents to remove false positives. ## Presentation [Presentation] (https://docs.google.com/presentation/d/1F3Xboeb_azHSjWbJ2Cl36kGHpIeo_6-lI24XwXjq_hA/edit?usp=sharing) (Team 3) [Initial Design Slides](https://docs.google.com/presentation/d/15_fq_7y_FAHDa8P7CzPP0y_5x3d4taZhNjmT5qwwbtc/edit?usp=sharing) ## Performance Test Machine setting: Macbook Air (mid-2013), Intel Core i7 (4650U), SSD hard drive, 8GB memory. * Data set: 1 million Medline records, about 1.53 GB * Regex: `\bmedic(ine|al|ation|are|aid)?\b` ("medicine" or "medical" or "medication" or "medicare" or "medicaid") * Performance results (average time reported in seconds): | | Index Based | Scan Based | | ------------- | ----------- | ---------- | | Regex Matcher | 7.43 | 67.65 | | RE2J | 11.88 | 99.05 | * Data set: 5 million Medline records, about 7.8 GB. * Regex: `\bmedic(ine|al|ation|are|aid)?\b` ("medicine" or "medical" or "medication" or "medicare" or "medicaid") * Performance results (average time reported in seconds): | | Index Based | Scan Based | | ------------- | ----------- | ---------- | | Regex Matcher | 28.11 | 330.754 | | RE2J | 44.72 | 506.34 | ## TODOs * Design and implement algorithm to compare two boolean expressions. The basic idea is to format boolean expressions into [DNF](https://en.wikipedia.org/wiki/Disjunctive_normal_form), and check the equivalence of two DNFs. There is a PR for this task: https://github.com/Texera/texera/pull/127 **FINISHED** * Use token position information to generate a more selective query to reduce the number of candidate documents. GitHub link: https://github.com/apache/texera/discussions/3983 ---- This is an automatically sent email for [email protected]. To unsubscribe, please send an email to: [email protected]
