GitHub user chenlica created a discussion: CS290 Spring 2016
The content is from https://github.com/apache/texera/wiki/CS290-Spring-2016 (may be dangling) ====== # CS290: Text Analytics in the Big Data Era Spring 2016, Department of Computer Science, UC Irvine * Instructor: [Prof. Chen Li](http://chenli.ics.uci.edu/) * Lecture time: Mondays 4-5:30 pm, DBH 4011 * Office Hours: Mondays 3-4 pm, DBH 2092 (Email confirmation needed) **Goal**: * Gain hands-on experiences to build a system to manage large amounts of text information * Study research challenges related to text and data management * Form teams to do a group project; learn tools and skills to manage a software project. **[Poster Presentation](https://docs.google.com/presentation/d/1W9rm8EL3YgjDcXnuB8Gd5JK5Jf7Mf6zMHi8YExVu_Lo/edit?usp=drive_web)** **[System Overview](https://github.com/Texera/texera/wiki)** **[Email list (Google Groups)](https://groups.google.com/d/forum/texeraproject)** **[Management Sheet](https://docs.google.com/spreadsheets/d/1RFv_GUtOuvncSz7Zvcc_OyI2TH6LQwREHBPuS6Lnj6E/edit#gid=1924715222)** **[Google Drive](https://drive.google.com/folderview?id=0B_b7l2bhyZTuNzN1UlM2WjRiZlE&usp=sharing)** Schedule | No. | Date | Topics | Todos | | ------------- |:-------------:| :-----| :--------| | 01 | 03/28/2016 | Introduction, [SystemT Overview](https://docs.google.com/presentation/d/15XUtm-NBCDyx_Oy9eVDYYcnyYNKGJlWyYOAJ6IyshPo/edit?usp=sharing) (by Instructor and Zuozhi) | Bid on tasks, form teams, github warmup | 02 | 04/04/2016 | Task assignments, [Lucene Overview] (https://docs.google.com/presentation/d/1P9HUFFW72ogqdEZf07r5Y7_gM9JK6Wu8UVgH0bGNkF0/edit?usp=sharing) (by team 1) | Lucene sample program, design phase | 03 | 04/11/2016 | ScanOperator (team 1), Data Store (team 1), Development environment (team 2), progress report (all teams) | Design phase, operator interface, test cases | 04 | 04/18/2016 | [Token-based fuzzy operator](https://docs.google.com/presentation/d/123dmjHnazpfU82TVmlT3OXfcCGlK6o0NgLUWAUGj_ns/edit#slide=id.g12b64ffe2d_0_259) (Team 5), progress report (all teams) | Operator interface, test cases | 05 | 04/25/2016 | [Stanford NLP] (https://docs.google.com/presentation/d/1ek18Zr0OqQ0RONj8D7W2aSGs9sz1etnf9bEnWTEA2ag/edit?usp=sharing) (Team 7), progress report (all teams) | Test cases, Implementation | 06 | 05/02/2016 | [Regex Matching] (https://docs.google.com/presentation/d/1F3Xboeb_azHSjWbJ2Cl36kGHpIeo_6-lI24XwXjq_hA/edit#slide=id.g12e478a39d_0_10) (Team 3), progress report (all teams) | Implementation | 07 | 05/09/2016 | [Fuzzy Tokenizer] (Foobar) (Team 2), progress report (all teams) | Implementation, Documentation | 08 | 05/16/2016 | Progress report (all teams) | Finishing Implementation, Starting Documentation **Course schedule:** * Meet weekly with talks and project discussions; * Form teams to work together; * Evaluate existing software packages; * Design and implement a text-centric data-management system. **Prerequisites:** * Hands-on system-building experiences; * Familiar with Java and C/C++; * Desire to learn, read existing software, and build systems; * Eager to solve open problems; * (Optional but a big plus) Have taken [CS222](http://www.ics.uci.edu/~cs222) or [CS221](http://www.ics.uci.edu/~lopes/teaching/cs221W12/). **Commitment**: 10 hours per week, 2 units **Software Tools**: * Java * Maven * Git * Wiki * Issue tracking * Jenkins **Tasks (Welcome to propose your own)**: * Support dictionary-based search on documents (using [Lucene](https://github.com/apache/lucene-solr/tree/master/lucene)) * Build gram-based inverted index (using [Lucene](https://github.com/apache/lucene-solr/tree/master/lucene)) * Support fuzzy search with gram index (using [Lucene](https://github.com/apache/lucene-solr/tree/master/lucene)) * Support regex search with gram index (using [Lucene](https://github.com/apache/lucene-solr/tree/master/lucene)) * Develop a query processor * Write a parser and translator from a SystemT query to a Texera query * (Optional) Design a declarative query language TextSQL and write a parser * (Optional) Include an embedded DB ([Derby](https://db.apache.org/derby/)) and store query results **Related Projects**: * [Lucene](https://github.com/apache/lucene-solr/tree/master/lucene) on keyword search (Java) * [Flamingo (UCI)](http://flamingo.ics.uci.edu) on fuzzy search (C++) * [RE2](https://github.com/google/re2) on index-based regex (C++) * [SystemT (IBM)](http://researcher.watson.ibm.com/researcher/view_group.php?id=1264) on information extraction (Java) * [Stanford NLP](http://nlp.stanford.edu/) on natural language processing (Java) **Project Management**: * Form teams to do tasks. Each team has 1 or 2 members; * Write test cases first; * If possible, use a simplest solution (even if it's scan-based), then develop a more advanced solution; * Be prepared to make adjustments during the course of the project. **Project Protocol**: * Do not add large files to git. Check [github guidance](https://help.github.com/articles/what-is-my-disk-quota/) for details. * Write high-quality code. * Do high-quality peer reviews. * Write good documentations using github wiki. Each wiki page has authors and reviewers with email address. * Drawing diagrams: Use Google Drawings. Add diagram source files to [Google Drive](https://drive.google.com/folderview?id=0B_b7l2bhyZTuNzN1UlM2WjRiZlE&usp=sharing) and change the ownership to "texeraproject AT gmail.com". Add authors to each diagram, and include the source file link on the wiki. Here is an [example](https://github.com/chenlica/texera/wiki/Design-and-Architecture). * Use the "sandbox/" folder on git for your only experiments. Use the format of "[firstname]-[lastname]" (all lower case) for the name of your folder under "sandbox/". * Use [Github Issues](https://github.com/chenlica/texera/issues) to manage tasks and bugs. # Project Lead:  [Chen Li](https://github.com/chenlica) # Tasks: ## Dictionary Matcher Operator || :---:|:---:|:---: [Sandeep Reddy Madugala](https://github.com/sandeepreddy602) | [Rajesh Yarlagadda](https://github.com/rajesh9625) | [Sudeep Meduri](https://github.com/inkudo) ## Query-Rewriter Operator | :---:|:---: [Kishore Narendran](https://github.com/kishore-narendran) | [Shiladitya Sen](https://github.com/shiladityasen) ## Regex Matcher Operator | :---:|:---: [Zuozhi Wang](https://github.com/zuozhi) | [Shuying](https://github.com/laisycs) ## Keyword Matcher Operator | :---:|:---: [Akshay Jain](https://github.com/Akshaybetala) | [Prakul Agarwal](https://github.com/Prakul) ## Token-based Fuzzy Matcher | :---:|:---: [Varun Bharill](https://github.com/varunbharill) | [Parag Sarogi](https://github.com/renacimiento) ## System T comparison || :---:|:---:|:---: [Jinggang Diao](https://github.com/diaojinggang) | [Flavio Bayer](https://github.com/FlavioBayer) | [Qing Tang](https://github.com/fukoyui) ## Integrating Stanford NLP | :---:|:---: [Feng Hong](https://github.com/sam0227) | [Yang Jiao](https://github.com/yangjiao2) GitHub link: https://github.com/apache/texera/discussions/3954 ---- This is an automatically sent email for [email protected]. To unsubscribe, please send an email to: [email protected]
