Thank you Holden, it works! infile = sc.wholeTextFiles(sys.argv[1]) rdd = sc.parallelize(infile.collect()) rdd.saveAsSequenceFile(sys.argv[2])
Csaba 2014-10-28 17:56 GMT+01:00 Holden Karau <hol...@pigscanfly.ca>: > Hi Csaba, > > It sounds like the API you are looking for is sc.wholeTextFiles :) > > Cheers, > > Holden :) > > > On Tuesday, October 28, 2014, Csaba Ragany <rag...@gmail.com> wrote: > >> Dear Spark Community, >> >> Is it possible to convert text files (.log or .txt files) into >> sequencefiles in Python? >> >> Using PySpark I can create a parallelized file with >> rdd=sc.parallelize([('key1', 1.0)]) and I can save it as a sequencefile >> with rdd.saveAsSequenceFile(). But how can I put the whole content of my >> text files into the 'value' of 'key1' ? >> >> I want a sequencefile where the keys are the filenames of the text files >> and the values are their content. >> >> Thank you for any help! >> Csaba >> > > > -- > Cell : 425-233-8271 >