GitHub user chenlica created a discussion: Dictionary Based Matcher (from Old Wiki)
>From the wiki page >https://github.com/apache/texera/wiki/Dictionary-Based-Matcher (may be >dangling) ====== Authors: [Sandeep Reddy Madugala](https://github.com/sandeepreddy602) , [Sudeep Meduri](https://github.com/inkudo) and [Rajesh Yarlagadda](https://github.com/rajesh9625) Reviewers: [Chen Li](https://github.com/chenlica) ## Synopsys **Lucene** already provides basic functionality for performing a **Keyword search** and a **Phrase search**. We created a **Dictionary Matcher** feature at the top of these existing features. The purpose of the Dictionary Matcher is to enable users to perform multiple phrase searches at a time. ## Status As of 5/25/2016: COMPLETED ## Modules `edu.uci.ics.texera.dataflow.dictionarymatcher` `edu.uci.ics.texera.dataflow.common` `edu.uci.ics.texera.dataflow.keywordmatch` ## Related Issues [Issue #90] (Team -1) - Add Keyword based and Phrase Based Dictionary Matcher [Issue #53] (Team -1) - Design a Dictionary class for the DictionaryMatcher [Issue #52] (Team -1) - Implement a "Span" class [Issue #37] (Team -1) - Design: Dictionary Matcher Operator ## Description DictionaryMatcher performs a scan, keyword or a phrase based search depending on the sourceoperator type, gets the dictionary value and scans the documents for matches. Presently 2 types of KeywordOperatorTypes are supported. There are three kinds of source operators being considered. * SCANOPERATOR * KEYWORDOPERATOR * PHRASEOPERATOR #####SourceOperatorType.SCANOPERATOR: Loops through the dictionary entries. For each dictionary entry, loop through the tuples in the operator. For each tuple, loop through the fields in the attributelist. For each field, loop through all the matches. Returns only one tuple per document. If there are multiple matches, all spans are included in a list. Java Regex is used to match word boundaries. Ex: If dictionary word is "Lin", and text is "Lin is Angelina's friend", matches should include Lin but not Angelina. #####SourceOperatorType.KEYWORDOPERATOR: Loops through the dictionary entries. For each dictionary entry, keywordmatcher's getNextTuple is called using KeyWordOperator.BASIC. Updates span information at the end of the tuple. #####SourceOperatorType.PHRASEOPERATOR: Loops through the dictionary entries. For each dictionary entry, keywordmatcher's getNextTuple is called using KeyWordOperator.PHRASE. The span returned is the span information provided by the keywordmatcher's phrase operator. ## Presentation [Lucene Presentation](https://docs.google.com/presentation/d/1P9HUFFW72ogqdEZf07r5Y7_gM9JK6Wu8UVgH0bGNkF0/edit#slide=id.g1288dd6e56_0_0) (Team 1) ## Performance Test Machine configuration : MacBook Pro, 2.7 GHz Intel Core i5, 8 GB 1867 MHz DDR3 ### Dataset: 100k medline record * index time: 29.4110 seconds * Performance results for DictionaryMatcher with **SCANOPERATOR**: * **Dictionary** : {"medical"} * Lucene Query time: 0.1480 seconds * Match time: 5.2740 seconds * Total: 2459 results * Performance results for DictionaryMatcher with **PHRASEOPERATOR**: * **Dictionary** : {"medical"} * Lucene Query time: 0.3840 seconds * Match time: 0.5980 seconds * Total: 2459 results * Performance results for DictionaryMatcher with **SCANOPERATOR**: * **Dictionary** : {"medical","medication"} * Lucene Query time: 0.4430 seconds * Match time: 10.9500 seconds * Total: 2904 results * Performance results for DictionaryMatcher with **PHRASEOPERATOR**: * **Dictionary** : {"medical","medication"} * Lucene Query time: 0.4560 seconds * Match time: 0.8950 seconds * Total: 2904 results * Performance results for DictionaryMatcher with **PHRASEOPERATOR**: * **Dictionary** : {"medical","medication","medicare","medicaid"} * Lucene Query time: 0.5210 seconds * Match time: 0.9100 seconds * Total: 3022 results ### Dataset: 1M medline record * index time: 335.6620 seconds * Performance results for DictionaryMatcher with **SCANOPERATOR**: * **Dictionary** : {"medical"} * Lucene Query time: 0.9840 seconds * Match time: 53.0320 seconds * Total: 29355 results * Performance results for DictionaryMatcher with **PHRASEOPERATOR**: * **Dictionary** : {"medical"} * Lucene Query time: 0.5870 seconds * Match time: 5.2180 seconds * Total: 29355 results * Performance results for DictionaryMatcher with **PHRASEOPERATOR**: * **Dictionary** : {"medical","medication","medicare","medicaid"} * Lucene Query time: 0.5950 seconds * Match time: 5.6970 seconds * Total: 36528 results GitHub link: https://github.com/apache/texera/discussions/3963 ---- This is an automatically sent email for [email protected]. To unsubscribe, please send an email to: [email protected]
