GitHub user chenlica created a discussion: SystemT Report (from old wiki)

>From the page https://github.com/apache/texera/wiki/SystemT-Report (may be 
>dangling)

===

Authors(s): Zuozhi Wang (zuozhiw AT uci DOT edu)

Reviewer(s): Chen Li (chenli AT gmail DOT com) and Team 6

# SystemT Project Report
Conducted by Zuozhi Wang  
Advised by Professor Chen Li and PhD Jamshid Esmaelnezhad  
January 2016 - March 2016  
University of California, Irvine

## Introduction

[SystemT](http://researcher.watson.ibm.com/researcher/view_group_subpage.php?id=6335)
 is a text analytics product from IBM for information extraction. Different 
from traditional grammar based or machine-learning based methods, SystemT takes 
a new algebraic approach. It provides Annotation Query Language (AQL), a 
SQL-like declarative query language to extract structured data from raw text. 
It uses relational operators, as well as span extraction and aggregation 
operators to build complex models. 

During the winter quarter 2016 at UC Irvine, we learned to use SystemT and 
evaluated its performance. Furthermore, we did experiments on using Lucene and 
Russ Cox’s Regex algorithm to pre-process the query.

## Getting Started with SystemT

#### Install SystemT
SystemT provides three ways to access its features:
* Web interface: install IBM BigInsights on Virtual Box. It requires 16 GB RAM;
* Java API: use SystemT API in a Java program .
 
#### Sample AQL Queries
Here are some basic AQL elements: Regular expressions, dictionaries, and 
patterns. For AQL tutorials, please see **References, Tutorials, and Papers** 
sections.

**Extract regular expressions**.  
The Document view is a special view that represents the current document.  

```sql
create view DateFormat as
    extract regex /(\d|0\d|1[0-2])\/(\d|[0-2]\d|3[0-1])\/(19\d{2}|2\d{3}|d{2})/
    on D.text as Date from Document D;
```

**Extract dictionaries**.  
Here a dictionary is created from a local file. It can also be an inline 
dictionary.  
```sql
create dictionary symptom_dict
        from file '../../../resources/dictionaries/WebMD_symptoms.txt';

create view Symptoms as
        extract dictionary 'symptom_dict'
        on D.text as SymptomName from Document D;
```

**Extract Patterns**.  
This query extracts patterns where disease names and symptom names are close to 
each other (i.e., 0 to 20 tokens apart).  
```sql
create view DiseaseRelateSymptom as
        extract pattern <D.DiseaseNames> <Token>{0, 20} <S.SymptomNames>
                as match from Diseases D, Symptoms S;
```

#### References, Tutorials, and Papers

Access the [SystemT web 
page](http://researcher.watson.ibm.com/researcher/view_group_subpage.php?id=6335)
 for the latest instructions on how to get a copy of SystemT.

## Using SystemT in a Java program

All the code we wrote related to SystemT are available in [this github 
repo](https://github.com/zuozhi/MedExtract).

To use SystemT, check the code in **MedExtraction/src/extractor** folder. 
**Extractor.java** is a wrapper program that does all the API calls and 
simplifies the process. **MyExtraction.java** is a sample program using the 
Extractor. Please see comments in the code for details.

The dataset we used is mostly [iPubMed](http://ipubmed.ics.uci.edu/) data. The 
dictionary files are downloaded by the Python crawlers we wrote. The code is in 
the **med_crawlers** folder in the github repo.  Please come to us to get the 
data and dictionary files.  

In the **MedExtraction/textAnalytics/src** folder, there are some AQL scripts 
written by me and another student (Fan Mo).

The code in **MedExtraction/src/preprocessor** and **MedExtraction/src/test** 
folder are not directly related to SystemT. If you are interested, see the 
**Experiments in preprocessing dictionary and regex queries** section.

#### SystemT Performance Evaluation

Tests are run on our laptop (2013 Macbook Air) with a single thread.  

SystemT's running time on extracting a small dictionary of 400 entries:  

| # of docs | SystemT execution time (secs) |
|:---:|:---:|
| 10K | 9.54 |
| 50K | 35.84 |
| 100K | 97.16 |
| 250K | 255.29 |
| 400K | 397.13 |

SystemT's running time on extracting datetime regexes:  

| # of docs | SystemT execution time (secs) |
|:---:|:---:|
| 10K | 0.97 |
| 20K | 1.17 |
| 50K | 2.70|
| 100K | 5.30 |
| 200K | 10.41 |
| 300K | 15.90 |
| 400K | 21.32 |
| 500K | 26.24 |


## Experiments in preprocessing dictionary and regex queries

Please note that this section describes our own experiments on preprocessing 
dictionary and regex queries. It's not relevant to SystemT itself.

In this big data era, many information extraction tasks can use a huge amount 
of data. For example, our iPubMed dataset has over 26 million records and it's 
a really valuable resource for information extraction. SystemT is capable of 
processing each document efficiently. But if we feed all 26 million documents 
to SystemT, it would take more than 6 hours to perform a simple extraction 
tasks. This leads us to think about the following idea: if we can pre-process a 
query, then we only need to feed SystemT those documents relevant to the query.

A SystemT query can be based on dictionaries and regexes. Complex extraction 
models are built upon these two fundamental elements. So we did some 
experiments to filter the whole dataset based on dictionaries and regexes. Its 
code is available in the [preprocessor 
folder](https://github.com/zuozhi/MedExtract/tree/master/MedExtraction/src/preprocessor).

#### Scanning
The first step is to find dictionaries and regular expressions in the AQL file. 
The scanner we wrote is relatively simple. It only extracts dictionaries and 
regexes. It doesn't parse the whole AQL file, so if the query is **_not 
regex_** or **_not in dictionary_**, the preprocessing would be completely 
wrong. Writing a complete SystemT AQL parser needs a lot of efforts. For this 
experiment, the simple scanner is enough.

#### Preprocessing Dictionaries Using Lucene
I used Lucene to build an index on the whole dataset, find documents that 
contain entries in the dictionary, and feed the filtered documents to SystemT.  
In the following figures, the yellow line is the original time that all the 
documents are fed into SystemT. The blue line is the running time of building 
the index, searching, and feeding the filtered documents to SystemT. The red 
line is the time excluding the time of building index, because the time of 
building index is a one-time effort.

Here's the performance result of a small dictionary of 460 entries. Using 
Lucene for preprocessing is constantly a lot faster than the original SystemT.
![image of small dictionary 
performance](https://docs.google.com/drawings/d/1MFDp2xbqLqROz9ByLFsigtWdnSAtVXvC0bJFcCOn_UU/pub?w=480&h=360)

Here's the performance result of a large dictionary of over 40,000 entries. 
Lucene starts to slow down because the dictionary doesn't have an index, and it 
has to scan the index for each entry. But it's still faster than SystemT.
![image of large dictionary 
performance](https://docs.google.com/drawings/d/1Xu3UTJ5KMzVVEXFuLSod2rUYjQlfNWxk3UbCtRRlIGM/pub?w=480&h=360)

Using Lucene, we built an index on the input data, but not on the dictionary. 
It would be potentially faster if we build indexes on both the data and the 
dictionary and use an efficient algorithm to match both indexes.

#### Preprocessing Regexes Using Google Code Search
Google Code Search is one of online search tools that support regular 
expressions. Its technique to perform efficient regular expression matching has 
been long unknown, until Google Code Search service was closed in 2011, and 
Russ Cox wrote an [article explaning the 
algorithm](https://swtch.com/~rsc/regexp/regexp4.html) in 2012. [Google Code 
Search](https://github.com/google/codesearch) is now also open source.  
We used the Google Code Search to build an index and perform regular expression 
matching. Since it's written in the GO language, we had to make system calls in 
the Java program to run it. 

Here's the performance result of putting all data to Google Code Search. It 
turns out that Google Code Search uses much more time than the original SystemT 
itself. The execution time SystemT takes to search the filtered documents is 
usually 4 times less, but SystemT is fast enough on matching regular 
expressions.  
![image of google code search 
performance](https://docs.google.com/drawings/d/1zv-fy9UaFhmuSsU_N4OI7UZAGiWuuZMu3tCDqQWQg3A/pub?w=480&h=360)

However, the Google Code Search program builds an index based on the file 
system. So we had to split the one file containing 500K records into 500K 
different small files. We suspect feeding too many files to the program at once 
may affect the performance. So we split the dataset into 50K chunks and feed 
Google Code Search 50K files at a time. Here are the performance results. It 
turns out to be much faster, although still slower than SystemT.  
![image of google code search performancein 50K 
chunks](https://docs.google.com/drawings/d/1nWHBubo75RieMi_oND9RwbIXBEkbuOSDMTSsUK2YQUU/pub?w=480&h=360)


#### Future work
Our initial experiment on SystemT has helped us gain insights in this field. 
Our Texera project just started and I'm proudly responsible for the regex 
matching part. We are doing more research on the Google Code Search program and 
more generally, regular expression matching with indexes. We believe we can 
find ways to make it much faster. For the latest update, please visit the 
[**CS290 2016S Task: Regex 
Matcher**](https://github.com/Texera/texera/wiki/CS290-2016S-Task:-Regex-Matcher)
 wiki page.

#### Acknowledgements

We want to thank IBM for providing their software package for free for 
education purpose and the SystemT team for their great help during the process 
of this project.

GitHub link: https://github.com/apache/texera/discussions/3985

----
This is an automatically sent email for [email protected].
To unsubscribe, please send an email to: 
[email protected]

Reply via email to