Hey all, trying to set up a pretty simple streaming app and getting some
weird behavior.
First, a non-streaming job that works fine: I'm trying to pull out lines
of a log file that match a regex, for which I've set up a function:
def getRequestDoc(s: String):
String = { "KBDOC-[0-9]*".r.findFirstIn(s).orNull }
logs=sc.textFile(logfiles)
logs.map(getRequestDoc).take(10)
That works, but I want to run that on the same data, but streaming, so I
tried this:
val logs = ssc.socketTextStream("localhost",4444)
logs.map(getRequestDoc).print()
ssc.start()
>From this code, I get:
14/05/08 03:32:08 ERROR JobScheduler: Error running job streaming job
1399545128000 ms.0
org.apache.spark.SparkException: Job aborted: Task not serializable:
java.io.NotSerializableException:
org.apache.spark.streaming.StreamingContext
But if I do the map function inline instead of calling a separate function,
it works:
logs.map("KBDOC-[0-9]*".r.findFirstIn(_).orNull).print()
So why is it able to serialize my little function in regular spark, but not
in streaming?
Thanks,
Diana