Thanks Marcelo,

It actually made my few concepts clear. (y).


On Fri, May 30, 2014 at 10:14 PM, Marcelo Vanzin <van...@cloudera.com>
wrote:

> Hello there,
>
> On Fri, May 30, 2014 at 9:36 AM, Marcelo Vanzin <van...@cloudera.com>
> wrote:
> > workbook = xlsxwriter.Workbook('output_excel.xlsx')
> > worksheet = workbook.add_worksheet()
> >
> > data = sc.textFile("xyz.txt")
> > # xyz.txt is a file whose each line contains string delimited by <SPACE>
> >
> > row=0
> >
> > def mapperFunc(x):
> >     for i in range(0,4):
> >         worksheet.write(row, i , x.split(" ")[i])
> >     row++
> >     return len(x.split())
> >
> > data2 = data.map(mapperFunc)
>
> > Is using row in 'mapperFunc' like this is a correct way? Will it
> > increment row each time?
>
> No. "mapperFunc" will be executed somewhere else, not in the same
> process running this script. I'm not familiar with how serializing
> closures works in Spark/Python, but you'll most certainly be updating
> the local copy of "row" in the executor, and your driver's copy will
> remain at "0".
>
> In general, in a distributed execution environment like Spark you want
> to avoid as much as possible using state. "row" in your code is state,
> so to do what you want you'd have to use other means (like Spark's
> accumulators). But those are generally expensive in a distributed
> system, and to be avoided if possible.
>
> > Is writing in the excel file using worksheet.write() in side the
> > mapper function a correct way?
>
> No, for the same reasons. Your executor will have a copy of your
> "workbook" variable. So the write() will happen locally to the
> executor, and after the mapperFunc() returns, that will be discarded -
> so your driver won't see anything.
>
> As a rule of thumb, your closures should try to use only their
> arguments as input, or at most use local variables as read-only, and
> only produce output in the form of return values. There are cases
> where you might want to break these rules, of course, but in general
> that's the mindset you should be in.
>
> Also note that you're not actually executing anything here.
> "data.map()" is a transformation, so you're just building the
> execution graph for the computation. You need to execute an action
> (like collect() or take()) if you want the computation to actually
> occur.
>
> --
> Marcelo
>



-- 
Rahul K Bhojwani
3rd Year B.Tech
Computer Science and Engineering
National Institute of Technology, Karnataka

Reply via email to