Re: pySpark - convert log/txt files into sequenceFile

2014-10-29 Thread Davies Liu
Without the second line, it's will much faster: infile = sc.wholeTextFiles(sys.argv[1]) infile.saveAsSequenceFile(sys.argv[2]) On Wed, Oct 29, 2014 at 3:31 AM, Csaba Ragany wrote: > Thank you Holden, it works! > > infile = sc.wholeTextFiles(sys.argv[1]) > rdd = sc.parallelize(infile.collect()

Re: pySpark - convert log/txt files into sequenceFile

2014-10-29 Thread Csaba Ragany
Thank you Holden, it works! infile = sc.wholeTextFiles(sys.argv[1]) rdd = sc.parallelize(infile.collect()) rdd.saveAsSequenceFile(sys.argv[2]) Csaba 2014-10-28 17:56 GMT+01:00 Holden Karau : > Hi Csaba, > > It sounds like the API you are looking for is sc.wholeTextFiles :) > > Cheers, > > Hold

Re: pySpark - convert log/txt files into sequenceFile

2014-10-28 Thread Holden Karau
Hi Csaba, It sounds like the API you are looking for is sc.wholeTextFiles :) Cheers, Holden :) On Tuesday, October 28, 2014, Csaba Ragany wrote: > Dear Spark Community, > > Is it possible to convert text files (.log or .txt files) into > sequencefiles in Python? > > Using PySpark I can create

pySpark - convert log/txt files into sequenceFile

2014-10-28 Thread Csaba Ragany
Dear Spark Community, Is it possible to convert text files (.log or .txt files) into sequencefiles in Python? Using PySpark I can create a parallelized file with rdd=sc.parallelize([('key1', 1.0)]) and I can save it as a sequencefile with rdd.saveAsSequenceFile(). But how can I put the whole cont