Thanks Marcelo, It actually made my few concepts clear. (y).
On Fri, May 30, 2014 at 10:14 PM, Marcelo Vanzin <van...@cloudera.com> wrote: > Hello there, > > On Fri, May 30, 2014 at 9:36 AM, Marcelo Vanzin <van...@cloudera.com> > wrote: > > workbook = xlsxwriter.Workbook('output_excel.xlsx') > > worksheet = workbook.add_worksheet() > > > > data = sc.textFile("xyz.txt") > > # xyz.txt is a file whose each line contains string delimited by <SPACE> > > > > row=0 > > > > def mapperFunc(x): > > for i in range(0,4): > > worksheet.write(row, i , x.split(" ")[i]) > > row++ > > return len(x.split()) > > > > data2 = data.map(mapperFunc) > > > Is using row in 'mapperFunc' like this is a correct way? Will it > > increment row each time? > > No. "mapperFunc" will be executed somewhere else, not in the same > process running this script. I'm not familiar with how serializing > closures works in Spark/Python, but you'll most certainly be updating > the local copy of "row" in the executor, and your driver's copy will > remain at "0". > > In general, in a distributed execution environment like Spark you want > to avoid as much as possible using state. "row" in your code is state, > so to do what you want you'd have to use other means (like Spark's > accumulators). But those are generally expensive in a distributed > system, and to be avoided if possible. > > > Is writing in the excel file using worksheet.write() in side the > > mapper function a correct way? > > No, for the same reasons. Your executor will have a copy of your > "workbook" variable. So the write() will happen locally to the > executor, and after the mapperFunc() returns, that will be discarded - > so your driver won't see anything. > > As a rule of thumb, your closures should try to use only their > arguments as input, or at most use local variables as read-only, and > only produce output in the form of return values. There are cases > where you might want to break these rules, of course, but in general > that's the mindset you should be in. > > Also note that you're not actually executing anything here. > "data.map()" is a transformation, so you're just building the > execution graph for the computation. You need to execute an action > (like collect() or take()) if you want the computation to actually > occur. > > -- > Marcelo > -- Rahul K Bhojwani 3rd Year B.Tech Computer Science and Engineering National Institute of Technology, Karnataka