pySpark saveAsSequenceFile append overwrite
Dear Spark community, Has the pySpark saveAsSequenceFile(folder) method the ability to append the new sequencefile into an other one or to overwrite an existing sequencefile? If the folder already exists then I get an error message... Thank You! Csaba
Re: pySpark - convert log/txt files into sequenceFile
Thank you Holden, it works! infile = sc.wholeTextFiles(sys.argv[1]) rdd = sc.parallelize(infile.collect()) rdd.saveAsSequenceFile(sys.argv[2]) Csaba 2014-10-28 17:56 GMT+01:00 Holden Karau hol...@pigscanfly.ca: Hi Csaba, It sounds like the API you are looking for is sc.wholeTextFiles :) Cheers, Holden :) On Tuesday, October 28, 2014, Csaba Ragany rag...@gmail.com wrote: Dear Spark Community, Is it possible to convert text files (.log or .txt files) into sequencefiles in Python? Using PySpark I can create a parallelized file with rdd=sc.parallelize([('key1', 1.0)]) and I can save it as a sequencefile with rdd.saveAsSequenceFile(). But how can I put the whole content of my text files into the 'value' of 'key1' ? I want a sequencefile where the keys are the filenames of the text files and the values are their content. Thank you for any help! Csaba -- Cell : 425-233-8271
pySpark - convert log/txt files into sequenceFile
Dear Spark Community, Is it possible to convert text files (.log or .txt files) into sequencefiles in Python? Using PySpark I can create a parallelized file with rdd=sc.parallelize([('key1', 1.0)]) and I can save it as a sequencefile with rdd.saveAsSequenceFile(). But how can I put the whole content of my text files into the 'value' of 'key1' ? I want a sequencefile where the keys are the filenames of the text files and the values are their content. Thank you for any help! Csaba