GitHub user chenlica created a discussion: Dictionary Based Matcher (from Old 
Wiki)

>From the wiki page 
>https://github.com/apache/texera/wiki/Dictionary-Based-Matcher (may be 
>dangling)

======

Authors: [Sandeep Reddy Madugala](https://github.com/sandeepreddy602) , [Sudeep 
Meduri](https://github.com/inkudo) and [Rajesh 
Yarlagadda](https://github.com/rajesh9625)

Reviewers: [Chen Li](https://github.com/chenlica)

## Synopsys 
**Lucene** already provides basic functionality for performing a **Keyword 
search** and a **Phrase search**. We created a **Dictionary Matcher** feature 
at the top of these existing features. 

The purpose of the Dictionary Matcher is to enable users to perform multiple 
phrase searches at a time.

## Status
As of 5/25/2016: COMPLETED

## Modules
`edu.uci.ics.texera.dataflow.dictionarymatcher`

`edu.uci.ics.texera.dataflow.common`

`edu.uci.ics.texera.dataflow.keywordmatch`

## Related Issues
[Issue #90] (Team -1) - Add Keyword based and Phrase Based Dictionary Matcher

[Issue #53] (Team -1) - Design a Dictionary class for the DictionaryMatcher

[Issue #52] (Team -1) - Implement a "Span" class

[Issue #37] (Team -1) - Design: Dictionary Matcher Operator

## Description

DictionaryMatcher performs a scan, keyword or a phrase based search depending 
on the sourceoperator type, gets the dictionary value and scans the documents 
for matches. Presently 2 types of KeywordOperatorTypes are supported. 

There are three kinds of source operators being considered.
* SCANOPERATOR 
* KEYWORDOPERATOR
* PHRASEOPERATOR

#####SourceOperatorType.SCANOPERATOR:
      
Loops through the dictionary entries. For each dictionary entry, loop through 
the tuples in the operator. For each tuple, loop through the fields in the 
attributelist. For each field, loop through all the matches. Returns only one 
tuple per document. If there are multiple matches, all spans are included in a 
list. 

Java Regex is used to match word boundaries.
      
Ex: If dictionary word is "Lin", and text is "Lin is Angelina's friend", 
matches should include Lin but not Angelina.
      
#####SourceOperatorType.KEYWORDOPERATOR:
      
Loops through the dictionary entries. For each dictionary entry, 
keywordmatcher's getNextTuple is called using
KeyWordOperator.BASIC. Updates span information at the end of the tuple.
      
#####SourceOperatorType.PHRASEOPERATOR:
      
Loops through the dictionary entries. For each dictionary entry, 
keywordmatcher's getNextTuple is called using
KeyWordOperator.PHRASE. The span returned is the span information provided by 
the keywordmatcher's phrase operator.
      
## Presentation

[Lucene 
Presentation](https://docs.google.com/presentation/d/1P9HUFFW72ogqdEZf07r5Y7_gM9JK6Wu8UVgH0bGNkF0/edit#slide=id.g1288dd6e56_0_0)
 (Team 1)

## Performance Test

Machine configuration : MacBook Pro, 2.7 GHz Intel Core i5, 8 GB 1867 MHz DDR3

### Dataset: 100k medline record
* index time: 29.4110 seconds

* Performance results for DictionaryMatcher with **SCANOPERATOR**: 
* **Dictionary** : {"medical"}
* Lucene Query time: 0.1480 seconds
* Match time: 5.2740 seconds
* Total: 2459 results

* Performance results for DictionaryMatcher with **PHRASEOPERATOR**:
* **Dictionary** : {"medical"}
* Lucene Query time: 0.3840 seconds
* Match time: 0.5980 seconds
* Total: 2459 results

* Performance results for DictionaryMatcher with **SCANOPERATOR**: 
* **Dictionary** : {"medical","medication"}
* Lucene Query time: 0.4430 seconds
* Match time: 10.9500 seconds
* Total: 2904 results

* Performance results for DictionaryMatcher with **PHRASEOPERATOR**:
* **Dictionary** : {"medical","medication"}
* Lucene Query time: 0.4560 seconds
* Match time: 0.8950 seconds
* Total: 2904 results

* Performance results for DictionaryMatcher with **PHRASEOPERATOR**:
* **Dictionary** : {"medical","medication","medicare","medicaid"}
* Lucene Query time: 0.5210 seconds
* Match time: 0.9100 seconds
* Total: 3022 results

### Dataset: 1M medline record
* index time: 335.6620 seconds

* Performance results for DictionaryMatcher with **SCANOPERATOR**: 
* **Dictionary** : {"medical"}
* Lucene Query time: 0.9840 seconds
* Match time: 53.0320 seconds
* Total: 29355 results

* Performance results for DictionaryMatcher with **PHRASEOPERATOR**:
* **Dictionary** : {"medical"}
* Lucene Query time: 0.5870 seconds
* Match time: 5.2180 seconds
* Total: 29355 results

* Performance results for DictionaryMatcher with **PHRASEOPERATOR**:
* **Dictionary** : {"medical","medication","medicare","medicaid"}
* Lucene Query time: 0.5950 seconds
* Match time: 5.6970 seconds
* Total: 36528 results


GitHub link: https://github.com/apache/texera/discussions/3963

----
This is an automatically sent email for [email protected].
To unsubscribe, please send an email to: 
[email protected]

Reply via email to