Thanks jey

I was hellpful.


On Sat, May 31, 2014 at 12:45 AM, Rahul Bhojwani <
rahulbhojwani2...@gmail.com> wrote:

> Thanks Marcelo,
>
> It actually made my few concepts clear. (y).
>
>
> On Fri, May 30, 2014 at 10:14 PM, Marcelo Vanzin <van...@cloudera.com>
> wrote:
>
>> Hello there,
>>
>> On Fri, May 30, 2014 at 9:36 AM, Marcelo Vanzin <van...@cloudera.com>
>> wrote:
>> > workbook = xlsxwriter.Workbook('output_excel.xlsx')
>> > worksheet = workbook.add_worksheet()
>> >
>> > data = sc.textFile("xyz.txt")
>> > # xyz.txt is a file whose each line contains string delimited by <SPACE>
>> >
>> > row=0
>> >
>> > def mapperFunc(x):
>> >     for i in range(0,4):
>> >         worksheet.write(row, i , x.split(" ")[i])
>> >     row++
>> >     return len(x.split())
>> >
>> > data2 = data.map(mapperFunc)
>>
>> > Is using row in 'mapperFunc' like this is a correct way? Will it
>> > increment row each time?
>>
>> No. "mapperFunc" will be executed somewhere else, not in the same
>> process running this script. I'm not familiar with how serializing
>> closures works in Spark/Python, but you'll most certainly be updating
>> the local copy of "row" in the executor, and your driver's copy will
>> remain at "0".
>>
>> In general, in a distributed execution environment like Spark you want
>> to avoid as much as possible using state. "row" in your code is state,
>> so to do what you want you'd have to use other means (like Spark's
>> accumulators). But those are generally expensive in a distributed
>> system, and to be avoided if possible.
>>
>> > Is writing in the excel file using worksheet.write() in side the
>> > mapper function a correct way?
>>
>> No, for the same reasons. Your executor will have a copy of your
>> "workbook" variable. So the write() will happen locally to the
>> executor, and after the mapperFunc() returns, that will be discarded -
>> so your driver won't see anything.
>>
>> As a rule of thumb, your closures should try to use only their
>> arguments as input, or at most use local variables as read-only, and
>> only produce output in the form of return values. There are cases
>> where you might want to break these rules, of course, but in general
>> that's the mindset you should be in.
>>
>> Also note that you're not actually executing anything here.
>> "data.map()" is a transformation, so you're just building the
>> execution graph for the computation. You need to execute an action
>> (like collect() or take()) if you want the computation to actually
>> occur.
>>
>> --
>> Marcelo
>>
>
>
>
> --
> Rahul K Bhojwani
> 3rd Year B.Tech
> Computer Science and Engineering
> National Institute of Technology, Karnataka
>



-- 
Rahul K Bhojwani
3rd Year B.Tech
Computer Science and Engineering
National Institute of Technology, Karnataka

Reply via email to