Thanks jey I was hellpful.
On Sat, May 31, 2014 at 12:45 AM, Rahul Bhojwani < rahulbhojwani2...@gmail.com> wrote: > Thanks Marcelo, > > It actually made my few concepts clear. (y). > > > On Fri, May 30, 2014 at 10:14 PM, Marcelo Vanzin <van...@cloudera.com> > wrote: > >> Hello there, >> >> On Fri, May 30, 2014 at 9:36 AM, Marcelo Vanzin <van...@cloudera.com> >> wrote: >> > workbook = xlsxwriter.Workbook('output_excel.xlsx') >> > worksheet = workbook.add_worksheet() >> > >> > data = sc.textFile("xyz.txt") >> > # xyz.txt is a file whose each line contains string delimited by <SPACE> >> > >> > row=0 >> > >> > def mapperFunc(x): >> > for i in range(0,4): >> > worksheet.write(row, i , x.split(" ")[i]) >> > row++ >> > return len(x.split()) >> > >> > data2 = data.map(mapperFunc) >> >> > Is using row in 'mapperFunc' like this is a correct way? Will it >> > increment row each time? >> >> No. "mapperFunc" will be executed somewhere else, not in the same >> process running this script. I'm not familiar with how serializing >> closures works in Spark/Python, but you'll most certainly be updating >> the local copy of "row" in the executor, and your driver's copy will >> remain at "0". >> >> In general, in a distributed execution environment like Spark you want >> to avoid as much as possible using state. "row" in your code is state, >> so to do what you want you'd have to use other means (like Spark's >> accumulators). But those are generally expensive in a distributed >> system, and to be avoided if possible. >> >> > Is writing in the excel file using worksheet.write() in side the >> > mapper function a correct way? >> >> No, for the same reasons. Your executor will have a copy of your >> "workbook" variable. So the write() will happen locally to the >> executor, and after the mapperFunc() returns, that will be discarded - >> so your driver won't see anything. >> >> As a rule of thumb, your closures should try to use only their >> arguments as input, or at most use local variables as read-only, and >> only produce output in the form of return values. There are cases >> where you might want to break these rules, of course, but in general >> that's the mindset you should be in. >> >> Also note that you're not actually executing anything here. >> "data.map()" is a transformation, so you're just building the >> execution graph for the computation. You need to execute an action >> (like collect() or take()) if you want the computation to actually >> occur. >> >> -- >> Marcelo >> > > > > -- > Rahul K Bhojwani > 3rd Year B.Tech > Computer Science and Engineering > National Institute of Technology, Karnataka > -- Rahul K Bhojwani 3rd Year B.Tech Computer Science and Engineering National Institute of Technology, Karnataka