Chris,

Thanks for stopping by! Here's a simple example. Imagine I've got a corpus
of data, which is an RDD[String], and I want to do some POS tagging on it.
In naive spark, that might look like this:

val props = new Properties.setAnnotators("pos")
val proc = new StanfordCoreNLP(props)
val data = sc.textFile("hdfs://some/distributed/corpus")

def processData(s: String): Annotation = {
  val a = new Annotation(s)
  proc.annotate(a)
}

val processedData = data.map(processData) //Note that this is actually
executed lazily.

Under the covers, spark takes the closure (processData), serializes it and
all objects/methods that it references (including the "proc"), and ships
the serialized closure off to workers so that they can run it on their
local partitions of the corpus. The issue at hand is that since the
StanfordCoreNLP object isn't serializable, *this will fail at runtime.* Hence
the solutions to this problem suggested in this thread, which all come down
to initializing the processor on the worker side (preferably once).

Your intuition about not wanting to serialize huge objects is fine. This
issue is not unique to CoreNLP - any Java library which has
non-serializable objects will face this issue.

HTH,
Evan


On Tue, Nov 25, 2014 at 8:05 AM, Christopher Manning <mann...@stanford.edu>
wrote:

> I’m not (yet!) an active Spark user, but saw this thread on twitter … and
> am involved with Stanford CoreNLP.
>
> Could someone explain how things need to be to work better with Spark —
> since that would be a useful goal.
>
> That is, while Stanford CoreNLP is not quite uniform (being developed by
> various people for over a decade), the general approach has always been
> that models should be serializable but that processors should not be. This
> make sense to me intuitively. It doesn’t really make sense to serialize a
> processor, which often has large mutable data structures used for
> processing.
>
> But does that not work well with Spark? Do processors need to be
> serializable, and then one needs to go through and make all the elements of
> the processor transient?
>
> Or what?
>
> Thanks!
>
> Chris
>
>
> > On Nov 25, 2014, at 7:54 AM, Evan Sparks <evan.spa...@gmail.com> wrote:
> >
> > If you only mark it as transient, then the object won't be serialized,
> and on the worker the field will be null. When the worker goes to use it,
> you get an NPE.
> >
> > Marking it lazy defers initialization to first use. If that use happens
> to be after serialization time (e.g. on the worker), then the worker will
> first check to see if it's initialized, and then initialize it if not.
> >
> > I think if you *do* reference the lazy val before serializing you will
> likely get an NPE.
> >
> >
> >> On Nov 25, 2014, at 1:05 AM, Theodore Vasiloudis <
> theodoros.vasilou...@gmail.com> wrote:
> >>
> >> Great, Ian's approach seems to work fine.
> >>
> >> Can anyone provide an explanation as to why this works, but passing the
> >> CoreNLP object itself
> >> as transient does not?
> >>
> >>
> >>
> >> --
> >> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Spark-and-Stanford-CoreNLP-tp19654p19739.html
> >> Sent from the Apache Spark User List mailing list archive at Nabble.com.
> >>
> >> ---------------------------------------------------------------------
> >> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> >> For additional commands, e-mail: user-h...@spark.apache.org
> >>
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> > For additional commands, e-mail: user-h...@spark.apache.org
> >
>
>
>

Reply via email to