Re: Is this a good use case for Spark?

2015-05-20 Thread Davies Liu
Spark is a great framework to do things in parallel with multiple machines,
will be really helpful for your case.

Once you can wrap your entire pipeline into a single Python function:

def process_document(path, text):
 # you can call other tools or services here
 return xxx

then you can process all the documents in parallel as easy as:

sc.wholeTextFiles("path/to/documents").map(lambda (k, v):
process_document(k, v)).saveAsXXX("path/in/s3")

On Wed, May 20, 2015 at 12:38 AM, jakeheller  wrote:
> Hi all, I'm new to Spark -- so new that we're deciding whether to use it in
> the first place, and I was hoping someone here could help me figure that
> out.
>
> We're doing a lot of processing of legal documents -- in particular, the
> entire corpus of American law. It's about 10m documents, many of which are
> quite large as far as text goes (100s of pages).
>
> We'd like to
> (a) transform these documents from the various (often borked) formats they
> come to us in into a standard XML format,
> (b) when it is in a standard format, extract information from them (e.g.,
> which judicial cases cite each other?) and annotate the documents with the
> information extracted, and then
> (c) deliver the end result to a repository (like s3) where it can be
> accessed by the user-facing application.
>
> Of course, we'd also like to do all of this quickly -- optimally, running
> the entire database through the whole pipeline in a few hours.
>
> We currently use a mix of Python and Java scripts (including XSLT, and
> NLP/unstructured data tools like UIMA and Stanford's CoreNLP) in various
> places along the pipeline we built for ourselves to handle these tasks. The
> current pipeline infrastructure was built a while back -- it's basically a
> number of HTTP servers that each have a single task and pass the document
> along from server to server as it goes through the processing pipeline. It's
> great although it's having trouble scaling, and there are some reliability
> issues. It's also a headache to handle all the infrastructure. For what it's
> worth, metadata about the documents resides in SQL, and the actual text of
> the documents lives in s3.
>
> It seems like Spark would be ideal for this, but after some searching I
> wasn't able to find too many examples of people using it for
> document-processing tasks (like transforming documents from one XML format
> into another) and I'm not clear if I can chain those sorts of tasks and NLP
> tasks, especially if some happen in Python and others in Java. Finally, I
> don't know if the size of the data (i.e., we'll likely want to run
> operations on whole documents, rather than just lines) imposes
> issues/constraints.
>
> Thanks all!
> Jake
>
>
>
> --
> View this message in context: 
> http://apache-spark-user-list.1001560.n3.nabble.com/Is-this-a-good-use-case-for-Spark-tp22954.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Is this a good use case for Spark?

2015-05-20 Thread jakeheller
Hi all, I'm new to Spark -- so new that we're deciding whether to use it in
the first place, and I was hoping someone here could help me figure that
out. 

We're doing a lot of processing of legal documents -- in particular, the
entire corpus of American law. It's about 10m documents, many of which are
quite large as far as text goes (100s of pages). 

We'd like to 
(a) transform these documents from the various (often borked) formats they
come to us in into a standard XML format, 
(b) when it is in a standard format, extract information from them (e.g.,
which judicial cases cite each other?) and annotate the documents with the
information extracted, and then 
(c) deliver the end result to a repository (like s3) where it can be
accessed by the user-facing application.

Of course, we'd also like to do all of this quickly -- optimally, running
the entire database through the whole pipeline in a few hours.

We currently use a mix of Python and Java scripts (including XSLT, and
NLP/unstructured data tools like UIMA and Stanford's CoreNLP) in various
places along the pipeline we built for ourselves to handle these tasks. The
current pipeline infrastructure was built a while back -- it's basically a
number of HTTP servers that each have a single task and pass the document
along from server to server as it goes through the processing pipeline. It's
great although it's having trouble scaling, and there are some reliability
issues. It's also a headache to handle all the infrastructure. For what it's
worth, metadata about the documents resides in SQL, and the actual text of
the documents lives in s3. 

It seems like Spark would be ideal for this, but after some searching I
wasn't able to find too many examples of people using it for
document-processing tasks (like transforming documents from one XML format
into another) and I'm not clear if I can chain those sorts of tasks and NLP
tasks, especially if some happen in Python and others in Java. Finally, I
don't know if the size of the data (i.e., we'll likely want to run
operations on whole documents, rather than just lines) imposes
issues/constraints. 

Thanks all!
Jake



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Is-this-a-good-use-case-for-Spark-tp22954.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org