Spark is a great framework to do things in parallel with multiple machines, will be really helpful for your case.
Once you can wrap your entire pipeline into a single Python function: def process_document(path, text): # you can call other tools or services here return xxx then you can process all the documents in parallel as easy as: sc.wholeTextFiles("path/to/documents").map(lambda (k, v): process_document(k, v)).saveAsXXX("path/in/s3") On Wed, May 20, 2015 at 12:38 AM, jakeheller <j...@casetext.com> wrote: > Hi all, I'm new to Spark -- so new that we're deciding whether to use it in > the first place, and I was hoping someone here could help me figure that > out. > > We're doing a lot of processing of legal documents -- in particular, the > entire corpus of American law. It's about 10m documents, many of which are > quite large as far as text goes (100s of pages). > > We'd like to > (a) transform these documents from the various (often borked) formats they > come to us in into a standard XML format, > (b) when it is in a standard format, extract information from them (e.g., > which judicial cases cite each other?) and annotate the documents with the > information extracted, and then > (c) deliver the end result to a repository (like s3) where it can be > accessed by the user-facing application. > > Of course, we'd also like to do all of this quickly -- optimally, running > the entire database through the whole pipeline in a few hours. > > We currently use a mix of Python and Java scripts (including XSLT, and > NLP/unstructured data tools like UIMA and Stanford's CoreNLP) in various > places along the pipeline we built for ourselves to handle these tasks. The > current pipeline infrastructure was built a while back -- it's basically a > number of HTTP servers that each have a single task and pass the document > along from server to server as it goes through the processing pipeline. It's > great although it's having trouble scaling, and there are some reliability > issues. It's also a headache to handle all the infrastructure. For what it's > worth, metadata about the documents resides in SQL, and the actual text of > the documents lives in s3. > > It seems like Spark would be ideal for this, but after some searching I > wasn't able to find too many examples of people using it for > document-processing tasks (like transforming documents from one XML format > into another) and I'm not clear if I can chain those sorts of tasks and NLP > tasks, especially if some happen in Python and others in Java. Finally, I > don't know if the size of the data (i.e., we'll likely want to run > operations on whole documents, rather than just lines) imposes > issues/constraints. > > Thanks all! > Jake > > > > -- > View this message in context: > http://apache-spark-user-list.1001560.n3.nabble.com/Is-this-a-good-use-case-for-Spark-tp22954.html > Sent from the Apache Spark User List mailing list archive at Nabble.com. > > --------------------------------------------------------------------- > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org > For additional commands, e-mail: user-h...@spark.apache.org > --------------------------------------------------------------------- To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org