Thank you Holden, it works!

infile = sc.wholeTextFiles(sys.argv[1])
rdd = sc.parallelize(infile.collect())
rdd.saveAsSequenceFile(sys.argv[2])

Csaba


2014-10-28 17:56 GMT+01:00 Holden Karau <hol...@pigscanfly.ca>:

> Hi Csaba,
>
> It sounds like the API you are looking for is sc.wholeTextFiles :)
>
> Cheers,
>
> Holden :)
>
>
> On Tuesday, October 28, 2014, Csaba Ragany <rag...@gmail.com> wrote:
>
>> Dear Spark Community,
>>
>> Is it possible to convert text files (.log or .txt files) into
>> sequencefiles in Python?
>>
>> Using PySpark I can create a parallelized file with
>> rdd=sc.parallelize([('key1', 1.0)]) and I can save it as a sequencefile
>> with rdd.saveAsSequenceFile(). But how can I put the whole content of my
>> text files into the 'value' of 'key1' ?
>>
>> I want a sequencefile where the keys are the filenames of the text files
>> and the values are their content.
>>
>> Thank you for any help!
>> Csaba
>>
>
>
> --
> Cell : 425-233-8271
>

Reply via email to