Spark is a great framework to do things in parallel with multiple machines,
will be really helpful for your case.
Once you can wrap your entire pipeline into a single Python function:
def process_document(path, text):
# you can call other tools or services here
return xxx
then you can process all the documents in parallel as easy as:
sc.wholeTextFiles("path/to/documents").map(lambda (k, v):
process_document(k, v)).saveAsXXX("path/in/s3")
On Wed, May 20, 2015 at 12:38 AM, jakeheller wrote:
> Hi all, I'm new to Spark -- so new that we're deciding whether to use it in
> the first place, and I was hoping someone here could help me figure that
> out.
>
> We're doing a lot of processing of legal documents -- in particular, the
> entire corpus of American law. It's about 10m documents, many of which are
> quite large as far as text goes (100s of pages).
>
> We'd like to
> (a) transform these documents from the various (often borked) formats they
> come to us in into a standard XML format,
> (b) when it is in a standard format, extract information from them (e.g.,
> which judicial cases cite each other?) and annotate the documents with the
> information extracted, and then
> (c) deliver the end result to a repository (like s3) where it can be
> accessed by the user-facing application.
>
> Of course, we'd also like to do all of this quickly -- optimally, running
> the entire database through the whole pipeline in a few hours.
>
> We currently use a mix of Python and Java scripts (including XSLT, and
> NLP/unstructured data tools like UIMA and Stanford's CoreNLP) in various
> places along the pipeline we built for ourselves to handle these tasks. The
> current pipeline infrastructure was built a while back -- it's basically a
> number of HTTP servers that each have a single task and pass the document
> along from server to server as it goes through the processing pipeline. It's
> great although it's having trouble scaling, and there are some reliability
> issues. It's also a headache to handle all the infrastructure. For what it's
> worth, metadata about the documents resides in SQL, and the actual text of
> the documents lives in s3.
>
> It seems like Spark would be ideal for this, but after some searching I
> wasn't able to find too many examples of people using it for
> document-processing tasks (like transforming documents from one XML format
> into another) and I'm not clear if I can chain those sorts of tasks and NLP
> tasks, especially if some happen in Python and others in Java. Finally, I
> don't know if the size of the data (i.e., we'll likely want to run
> operations on whole documents, rather than just lines) imposes
> issues/constraints.
>
> Thanks all!
> Jake
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Is-this-a-good-use-case-for-Spark-tp22954.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org