Without the second line, it's will much faster:

 infile = sc.wholeTextFiles(sys.argv[1])
 infile.saveAsSequenceFile(sys.argv[2])


On Wed, Oct 29, 2014 at 3:31 AM, Csaba Ragany <rag...@gmail.com> wrote:
> Thank you Holden, it works!
>
> infile = sc.wholeTextFiles(sys.argv[1])
> rdd = sc.parallelize(infile.collect())
> rdd.saveAsSequenceFile(sys.argv[2])
>
> Csaba
>
>
> 2014-10-28 17:56 GMT+01:00 Holden Karau <hol...@pigscanfly.ca>:
>>
>> Hi Csaba,
>>
>> It sounds like the API you are looking for is sc.wholeTextFiles :)
>>
>> Cheers,
>>
>> Holden :)
>>
>>
>> On Tuesday, October 28, 2014, Csaba Ragany <rag...@gmail.com> wrote:
>>>
>>> Dear Spark Community,
>>>
>>> Is it possible to convert text files (.log or .txt files) into
>>> sequencefiles in Python?
>>>
>>> Using PySpark I can create a parallelized file with
>>> rdd=sc.parallelize([('key1', 1.0)]) and I can save it as a sequencefile with
>>> rdd.saveAsSequenceFile(). But how can I put the whole content of my text
>>> files into the 'value' of 'key1' ?
>>>
>>> I want a sequencefile where the keys are the filenames of the text files
>>> and the values are their content.
>>>
>>> Thank you for any help!
>>> Csaba
>>
>>
>>
>> --
>> Cell : 425-233-8271
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Reply via email to