pySpark saveAsSequenceFile append overwrite

2014-12-02 Thread Csaba Ragany
Dear Spark community,

Has the pySpark saveAsSequenceFile(folder) method the ability to append
the new sequencefile into an other one or to overwrite an existing
sequencefile? If the folder already exists then I get an error message...

Thank You!
Csaba


Re: pySpark - convert log/txt files into sequenceFile

2014-10-29 Thread Csaba Ragany
Thank you Holden, it works!

infile = sc.wholeTextFiles(sys.argv[1])
rdd = sc.parallelize(infile.collect())
rdd.saveAsSequenceFile(sys.argv[2])

Csaba


2014-10-28 17:56 GMT+01:00 Holden Karau hol...@pigscanfly.ca:

 Hi Csaba,

 It sounds like the API you are looking for is sc.wholeTextFiles :)

 Cheers,

 Holden :)


 On Tuesday, October 28, 2014, Csaba Ragany rag...@gmail.com wrote:

 Dear Spark Community,

 Is it possible to convert text files (.log or .txt files) into
 sequencefiles in Python?

 Using PySpark I can create a parallelized file with
 rdd=sc.parallelize([('key1', 1.0)]) and I can save it as a sequencefile
 with rdd.saveAsSequenceFile(). But how can I put the whole content of my
 text files into the 'value' of 'key1' ?

 I want a sequencefile where the keys are the filenames of the text files
 and the values are their content.

 Thank you for any help!
 Csaba



 --
 Cell : 425-233-8271



pySpark - convert log/txt files into sequenceFile

2014-10-28 Thread Csaba Ragany
Dear Spark Community,

Is it possible to convert text files (.log or .txt files) into
sequencefiles in Python?

Using PySpark I can create a parallelized file with
rdd=sc.parallelize([('key1', 1.0)]) and I can save it as a sequencefile
with rdd.saveAsSequenceFile(). But how can I put the whole content of my
text files into the 'value' of 'key1' ?

I want a sequencefile where the keys are the filenames of the text files
and the values are their content.

Thank you for any help!
Csaba