Hey Guys, I submitted the below talk on Apache Tika, Nutch and Solr to ApacheCon NA 2014:
Real Data Science: Exploring the FBI's Vault dataset with Apache Tika, Nutch and Solr Event ApacheCon North America Submission Type Lightning Talk Category Developer Biography Chris Mattmann has a wealth of experience in software design, and in the construction of large-scale data-intensive systems. His work has infected a broad set of communities, ranging from helping NASA unlock data from its next generation of earth science system satellites, to assisting graduate students at the University of Southern California (his Alma mater) in the study of software architecture, all the way to helping industry and open source as a member of the Apache Software Foundation. When he's not busy being busy, he's spending time with his lovely wife and son braving the mean streets of Southern California. Abstract Apache Tika is a content detection and analysis toolkit allowing automated MIME type identification and rapid parsing of text and metadata from over 1200 types of files including all major file types from the Internet Assigned Number Authority's MIME database. In this talk I'll show you how to practically use Apache Tika to explore the FBI's vault of declassified PDF documents, and to use Apache Nutch to pull down the dataset, and how to use Solr to ingest, and geoclassify the documents so that can build a map of FBI PDF documents corresponding to your favorite conspiracies throughout the USA. I've taught this material in my CSCI 572 Search Engines class at USC and it's a big hit. These are normally three assignments, so I will do my best to boil down their essence into a 45min-60 min talk replete with danger and excitement. Audience Developers interested in using Tika, Nutch and Solr. Folks interested in the FBI vault dataset. GIS wonks. The like. Experience Level Intermediate Benefits to the Ecosystem The core of the talk will be Tika, but there will be some Nutch magic, and some Solr magic at very basic levels. The benefits of the ecosystem will be the real display of data science involved and on a real dataset. Technical Requirements I need an internet connection, and a projector. Status New Cheers, Chris