Re: pySpark - convert log/txt files into sequenceFile
Without the second line, it's will much faster: infile = sc.wholeTextFiles(sys.argv[1]) infile.saveAsSequenceFile(sys.argv[2]) On Wed, Oct 29, 2014 at 3:31 AM, Csaba Ragany wrote: > Thank you Holden, it works! > > infile = sc.wholeTextFiles(sys.argv[1]) > rdd = sc.parallelize(infile.collect()) > rdd.saveAsSequenceFile(sys.argv[2]) > > Csaba > > > 2014-10-28 17:56 GMT+01:00 Holden Karau : >> >> Hi Csaba, >> >> It sounds like the API you are looking for is sc.wholeTextFiles :) >> >> Cheers, >> >> Holden :) >> >> >> On Tuesday, October 28, 2014, Csaba Ragany wrote: >>> >>> Dear Spark Community, >>> >>> Is it possible to convert text files (.log or .txt files) into >>> sequencefiles in Python? >>> >>> Using PySpark I can create a parallelized file with >>> rdd=sc.parallelize([('key1', 1.0)]) and I can save it as a sequencefile with >>> rdd.saveAsSequenceFile(). But how can I put the whole content of my text >>> files into the 'value' of 'key1' ? >>> >>> I want a sequencefile where the keys are the filenames of the text files >>> and the values are their content. >>> >>> Thank you for any help! >>> Csaba >> >> >> >> -- >> Cell : 425-233-8271 > > - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: pySpark - convert log/txt files into sequenceFile
Thank you Holden, it works! infile = sc.wholeTextFiles(sys.argv[1]) rdd = sc.parallelize(infile.collect()) rdd.saveAsSequenceFile(sys.argv[2]) Csaba 2014-10-28 17:56 GMT+01:00 Holden Karau : > Hi Csaba, > > It sounds like the API you are looking for is sc.wholeTextFiles :) > > Cheers, > > Holden :) > > > On Tuesday, October 28, 2014, Csaba Ragany wrote: > >> Dear Spark Community, >> >> Is it possible to convert text files (.log or .txt files) into >> sequencefiles in Python? >> >> Using PySpark I can create a parallelized file with >> rdd=sc.parallelize([('key1', 1.0)]) and I can save it as a sequencefile >> with rdd.saveAsSequenceFile(). But how can I put the whole content of my >> text files into the 'value' of 'key1' ? >> >> I want a sequencefile where the keys are the filenames of the text files >> and the values are their content. >> >> Thank you for any help! >> Csaba >> > > > -- > Cell : 425-233-8271 >
Re: pySpark - convert log/txt files into sequenceFile
Hi Csaba, It sounds like the API you are looking for is sc.wholeTextFiles :) Cheers, Holden :) On Tuesday, October 28, 2014, Csaba Ragany wrote: > Dear Spark Community, > > Is it possible to convert text files (.log or .txt files) into > sequencefiles in Python? > > Using PySpark I can create a parallelized file with > rdd=sc.parallelize([('key1', 1.0)]) and I can save it as a sequencefile > with rdd.saveAsSequenceFile(). But how can I put the whole content of my > text files into the 'value' of 'key1' ? > > I want a sequencefile where the keys are the filenames of the text files > and the values are their content. > > Thank you for any help! > Csaba > -- Cell : 425-233-8271
pySpark - convert log/txt files into sequenceFile
Dear Spark Community, Is it possible to convert text files (.log or .txt files) into sequencefiles in Python? Using PySpark I can create a parallelized file with rdd=sc.parallelize([('key1', 1.0)]) and I can save it as a sequencefile with rdd.saveAsSequenceFile(). But how can I put the whole content of my text files into the 'value' of 'key1' ? I want a sequencefile where the keys are the filenames of the text files and the values are their content. Thank you for any help! Csaba