I am learning more about Spark (and in this case Spark Streaming) and am getting that a functions like dstream.map() takes a function call and does something to each element of the rdd and that in turn returns a new rdd based on the original.
That's cool for the simple map functions in the examples where a lambda is used to to take x and do x * x but what happens in Python (specifically) with more complex functions? Especially those that use modules (that ARE build in on all nodes). For example, instead of a simple map, I want to take line of data and regex parse it into fields. It's still not a map (not a flat map) in that it's a one to one return. (One record of the RDD, a line of text, would return on parsed record in a Python dict) in my Spark Streaming Job, I have import re in the "main" part of the file, and this all seems to work, but I want to ensure I am not "by default" forcing computations in the driver rather than distributed. This is "working" as in it's returning the expected data, however I want to ensure I am not doing something weird by having a transform function using a module that's imported only at the driver. (Should I be calling import re IN the functioon?) If there are any good docs on this, I'd love to understand it more. Thanks! John Example def parseLine(line): restr = "^(\w\w\w ?\d\d? \d\d:\d\d:\d\d) ([^ ]+) " logre = re.compile(restr) m = logre.search(line[1]) # Why does every record of he RDD have a NONE value in the first position of the tuple? rec = {} if m: rec['field1'] = m.group(1) rec['field2] = m.group(2) return rec fwlog_dstream = KafkaUtils.createStream(ssc, zkQuorum, "sparkstreaming-fwlog_parsed", {kafka_src_topic: 1}) recs = fwlog_dstream.map(parseLine)