Hi Rahul,

Marcelo's explanation is correct. Here's a possible approach to your
program, in pseudo-Python:


# connect to Spark cluster
sc = SparkContext(...)

# load input data
input_data = load_xls(file("input.xls"))
input_rows = input_data['Sheet1'].rows

# create RDD on cluster
input_rdd = sc.parallelize(input_rows)

# munge RDD
result_rdd = input_rdd.map(munge_row)

# collect result RDD to local process
result_rows = result_rdd.collect()

# write output file
write_xls(file("output.xls", "w"), result_rows)



Hope that helps,
-Jey

On Fri, May 30, 2014 at 9:44 AM, Marcelo Vanzin <van...@cloudera.com> wrote:
> Hello there,
>
> On Fri, May 30, 2014 at 9:36 AM, Marcelo Vanzin <van...@cloudera.com> wrote:
>> workbook = xlsxwriter.Workbook('output_excel.xlsx')
>> worksheet = workbook.add_worksheet()
>>
>> data = sc.textFile("xyz.txt")
>> # xyz.txt is a file whose each line contains string delimited by <SPACE>
>>
>> row=0
>>
>> def mapperFunc(x):
>>     for i in range(0,4):
>>         worksheet.write(row, i , x.split(" ")[i])
>>     row++
>>     return len(x.split())
>>
>> data2 = data.map(mapperFunc)
>
>> Is using row in 'mapperFunc' like this is a correct way? Will it
>> increment row each time?
>
> No. "mapperFunc" will be executed somewhere else, not in the same
> process running this script. I'm not familiar with how serializing
> closures works in Spark/Python, but you'll most certainly be updating
> the local copy of "row" in the executor, and your driver's copy will
> remain at "0".
>
> In general, in a distributed execution environment like Spark you want
> to avoid as much as possible using state. "row" in your code is state,
> so to do what you want you'd have to use other means (like Spark's
> accumulators). But those are generally expensive in a distributed
> system, and to be avoided if possible.
>
>> Is writing in the excel file using worksheet.write() in side the
>> mapper function a correct way?
>
> No, for the same reasons. Your executor will have a copy of your
> "workbook" variable. So the write() will happen locally to the
> executor, and after the mapperFunc() returns, that will be discarded -
> so your driver won't see anything.
>
> As a rule of thumb, your closures should try to use only their
> arguments as input, or at most use local variables as read-only, and
> only produce output in the form of return values. There are cases
> where you might want to break these rules, of course, but in general
> that's the mindset you should be in.
>
> Also note that you're not actually executing anything here.
> "data.map()" is a transformation, so you're just building the
> execution graph for the computation. You need to execute an action
> (like collect() or take()) if you want the computation to actually
> occur.
>
> --
> Marcelo

Reply via email to