[D] CS290 Spring 2016 [texera]

via GitHub Sun, 19 Oct 2025 22:59:13 -0700


GitHub user chenlica created a discussion: CS290 Spring 2016


The content is from https://github.com/apache/texera/wiki/CS290-Spring-2016 
(may be dangling)

======

# CS290: Text Analytics in the Big Data Era  
Spring 2016, Department of Computer Science, UC Irvine  

* Instructor: [Prof. Chen Li](http://chenli.ics.uci.edu/)
* Lecture time: Mondays 4-5:30 pm, DBH 4011
* Office Hours: Mondays 3-4 pm, DBH 2092 (Email confirmation needed)

**Goal**: 
* Gain hands-on experiences to build a system to manage large amounts of text 
information
* Study research challenges related to text and data management
* Form teams to do a group project; learn tools and skills to manage a software 
project.

**[Poster 
Presentation](https://docs.google.com/presentation/d/1W9rm8EL3YgjDcXnuB8Gd5JK5Jf7Mf6zMHi8YExVu_Lo/edit?usp=drive_web)**

**[System Overview](https://github.com/Texera/texera/wiki)**

**[Email list (Google 
Groups)](https://groups.google.com/d/forum/texeraproject)**

**[Management 
Sheet](https://docs.google.com/spreadsheets/d/1RFv_GUtOuvncSz7Zvcc_OyI2TH6LQwREHBPuS6Lnj6E/edit#gid=1924715222)**

**[Google 
Drive](https://drive.google.com/folderview?id=0B_b7l2bhyZTuNzN1UlM2WjRiZlE&usp=sharing)**


Schedule

|    No.    | Date           | Topics  |  Todos |
| ------------- |:-------------:| :-----| :--------| 
| 01      | 03/28/2016 | Introduction, [SystemT 
Overview](https://docs.google.com/presentation/d/15XUtm-NBCDyx_Oy9eVDYYcnyYNKGJlWyYOAJ6IyshPo/edit?usp=sharing)
 (by Instructor and Zuozhi) | Bid on tasks, form teams, github warmup
| 02      | 04/04/2016 | Task assignments, [Lucene Overview] 
(https://docs.google.com/presentation/d/1P9HUFFW72ogqdEZf07r5Y7_gM9JK6Wu8UVgH0bGNkF0/edit?usp=sharing)
 (by team 1) | Lucene sample program, design phase
| 03      | 04/11/2016 | ScanOperator (team 1), Data Store (team 1), 
Development environment (team 2), progress report (all teams) | Design phase, 
operator interface, test cases
| 04      | 04/18/2016 | [Token-based fuzzy 
operator](https://docs.google.com/presentation/d/123dmjHnazpfU82TVmlT3OXfcCGlK6o0NgLUWAUGj_ns/edit#slide=id.g12b64ffe2d_0_259)
 (Team 5), progress report (all teams) | Operator interface, test cases
| 05      | 04/25/2016 | [Stanford NLP] 
(https://docs.google.com/presentation/d/1ek18Zr0OqQ0RONj8D7W2aSGs9sz1etnf9bEnWTEA2ag/edit?usp=sharing)
 (Team 7), progress report (all teams) | Test cases, Implementation
| 06      | 05/02/2016 | [Regex Matching] 
(https://docs.google.com/presentation/d/1F3Xboeb_azHSjWbJ2Cl36kGHpIeo_6-lI24XwXjq_hA/edit#slide=id.g12e478a39d_0_10)
 (Team 3), progress report (all teams) | Implementation
| 07      | 05/09/2016 | [Fuzzy Tokenizer] (Foobar) (Team 2), progress report 
(all teams) | Implementation, Documentation
| 08      | 05/16/2016 | Progress report (all teams) | Finishing 
Implementation, Starting Documentation


**Course schedule:**
* Meet weekly with talks and project discussions;
* Form teams to work together;
* Evaluate existing software packages;
* Design and implement a text-centric data-management system.

**Prerequisites:**
* Hands-on system-building experiences;
* Familiar with Java and C/C++;
* Desire to learn, read existing software, and build systems;
* Eager to solve open problems;
* (Optional but a big plus) Have taken [CS222](http://www.ics.uci.edu/~cs222) 
or [CS221](http://www.ics.uci.edu/~lopes/teaching/cs221W12/).

**Commitment**: 10 hours per week, 2 units

**Software Tools**:

* Java
* Maven
* Git
* Wiki
* Issue tracking
* Jenkins 




**Tasks (Welcome to propose your own)**:
* Support dictionary-based search on documents (using 
[Lucene](https://github.com/apache/lucene-solr/tree/master/lucene))
* Build gram-based inverted index (using 
[Lucene](https://github.com/apache/lucene-solr/tree/master/lucene))
* Support fuzzy search with gram index (using 
[Lucene](https://github.com/apache/lucene-solr/tree/master/lucene))
* Support regex search with gram index (using 
[Lucene](https://github.com/apache/lucene-solr/tree/master/lucene))
* Develop a query processor
* Write a parser and translator from a SystemT query to a Texera query
* (Optional) Design a declarative query language TextSQL and write a parser
* (Optional) Include an embedded DB ([Derby](https://db.apache.org/derby/)) and 
store query results

**Related Projects**:
* [Lucene](https://github.com/apache/lucene-solr/tree/master/lucene) on keyword 
search (Java)
* [Flamingo (UCI)](http://flamingo.ics.uci.edu) on fuzzy search (C++)
* [RE2](https://github.com/google/re2) on index-based regex (C++)
* [SystemT 
(IBM)](http://researcher.watson.ibm.com/researcher/view_group.php?id=1264) on 
information extraction (Java)
* [Stanford NLP](http://nlp.stanford.edu/) on natural language processing 
(Java) 

**Project Management**:
* Form teams to do tasks. Each team has 1 or 2 members;
* Write test cases first;
* If possible, use a simplest solution (even if it's scan-based), then develop 
a more advanced solution;
* Be prepared to make adjustments during the course of the project.


**Project Protocol**:

* Do not add large files to git.  Check [github 
guidance](https://help.github.com/articles/what-is-my-disk-quota/) for details.
* Write high-quality code.
* Do high-quality peer reviews.
* Write good documentations using github wiki. Each wiki page has authors and 
reviewers with email address.
* Drawing diagrams: Use Google Drawings. Add diagram source files to [Google 
Drive](https://drive.google.com/folderview?id=0B_b7l2bhyZTuNzN1UlM2WjRiZlE&usp=sharing)
 and change the ownership to "texeraproject AT gmail.com".  Add authors to each 
diagram, and include the source file link on the wiki.  Here is an 
[example](https://github.com/chenlica/texera/wiki/Design-and-Architecture).
* Use the "sandbox/" folder on git for your only experiments.  Use the format 
of "[firstname]-[lastname]" (all lower case) for the name of your folder under 
"sandbox/".
* Use [Github Issues](https://github.com/chenlica/texera/issues) to manage 
tasks and bugs.

# Project Lead:
![Chen 
Li](https://docs.google.com/drawings/d/1PIQwRDWhX66nWYO1hAGn7DA3T5KnARz5S-FKeiJzHvs/pub?w=200&h=200)
  
[Chen Li](https://github.com/chenlica)  
# Tasks:

## Dictionary Matcher Operator  
![Sandeep Reddy 
Madugula](https://docs.google.com/drawings/d/1xwIqAuN9FohHpndKX3xFRsLypaxDgdzl0wtSyUtjYaw/pub?w=200&h=200)|![Rajesh
 
Yarlagadda](https://docs.google.com/drawings/d/123dfzbe36jzEIk_2u7ZUt9gv2bQRx6eCsdB8xyuYdY0/pub?w=200&h=200)|![Sudeep
 
Meduri](https://docs.google.com/drawings/d/1dQt5xtGWIvLNeD1kgQBxmOgLq0Zfwo09Y883id2c8nU/pub?w=200&h=200)
:---:|:---:|:---:
[Sandeep Reddy Madugala](https://github.com/sandeepreddy602) | [Rajesh 
Yarlagadda](https://github.com/rajesh9625) | [Sudeep 
Meduri](https://github.com/inkudo)

## Query-Rewriter Operator  
![Kishore 
Narendran](https://docs.google.com/drawings/d/1Jhc8gTQZn3RPu14EP9TnTKpRSjTSMxlPK86AjuPrApI/pub?w=200&h=200)|![Shiladitya
 
Sen](https://docs.google.com/drawings/d/1-Hql_tZWOrYDSE74vDI5VC90yiPdUQ11jQDaEAj0m5E/pub?w=200&h=200)
:---:|:---:
[Kishore Narendran](https://github.com/kishore-narendran) | [Shiladitya 
Sen](https://github.com/shiladityasen)

## Regex Matcher Operator  
![Zuozhi 
Wang](https://docs.google.com/drawings/d/1n2j62Mxc4vkP0Qw70aiDwhi6rtmSX34GA4E_HY1I_BU/pub?w=200&h=200)|![Shuying](https://docs.google.com/drawings/d/1JuesfsuSXCVqjH00s_HDEVQA0XtqS8krSJBzb4qD6sY/pub?w=200&h=200)
:---:|:---:
[Zuozhi Wang](https://github.com/zuozhi) | [Shuying](https://github.com/laisycs)

## Keyword Matcher Operator  
![Akshay 
Jain](https://docs.google.com/drawings/d/17b06T0YmrHdlT3mfaq0G1FzYQZnh3tiFgqtqaWCczN0/pub?w=200&h=200)|![Prakul
 
Agarwal](https://docs.google.com/drawings/d/1x6RXCEp4xyZYpl1rXIRPPy6rrjJCWHTne8_5b5RsPNg/pub?w=200&h=200)
:---:|:---:
[Akshay Jain](https://github.com/Akshaybetala) | [Prakul 
Agarwal](https://github.com/Prakul)

## Token-based Fuzzy Matcher  
![Varun 
Bharill](https://docs.google.com/drawings/d/1M8vVVAx-6IorMlwYJCaX4z7fe_EfsPYZwhnh3sswaHY/pub?w=200&h=200)|![Parag
 
Sarogi](https://docs.google.com/drawings/d/1MxbP0ShejJHkoxrUUjbPSkp-IG9Lm3xgN-924m7cJfA/pub?w=200&h=200)
:---:|:---:
[Varun Bharill](https://github.com/varunbharill) | [Parag 
Sarogi](https://github.com/renacimiento)

## System T comparison  
![Jinggang 
Diao](https://docs.google.com/drawings/d/1lolKnzGY-MMt4G8dH5jvQ-hSsa9qBFgi_GUjLFEIoEU/pub?w=200&h=200)|![Flavio
 
Bayer](https://docs.google.com/drawings/d/1nRUFLoj16SkTtqvfScEmfhqKmcEgLZUV3mg3xORQRnI/pub?w=200&h=200)|![Qing
 
Tang](https://docs.google.com/drawings/d/1vbyK0FsIT-H5QoeRAFGRnRxTzRP_LZ6AQpHHzosmSkU/pub?w=200&h=200)
:---:|:---:|:---:
[Jinggang Diao](https://github.com/diaojinggang) | [Flavio 
Bayer](https://github.com/FlavioBayer) | [Qing Tang](https://github.com/fukoyui)

## Integrating Stanford NLP
![Feng 
Hong](https://docs.google.com/drawings/d/1GHHqj8IR_PmOL8H8qo2uHv1XAzOPJB1_OuZgkVERDdU/pub?w=200&h=200)|![Yang
 
Jiao](https://docs.google.com/drawings/d/1w3AiX3CY80a3LJKCapXgxPq0OX5NX30ZfnKCJvnTtJg/pub?w=200&h=200)
:---:|:---:
[Feng Hong](https://github.com/sam0227) | [Yang 
Jiao](https://github.com/yangjiao2)

GitHub link: https://github.com/apache/texera/discussions/3954

----
This is an automatically sent email for [email protected].
To unsubscribe, please send an email to: 
[email protected]

[D] CS290 Spring 2016 [texera]

Reply via email to