Thank you Holden, it works!
infile = sc.wholeTextFiles(sys.argv[1])
rdd = sc.parallelize(infile.collect())
rdd.saveAsSequenceFile(sys.argv[2])
Csaba
2014-10-28 17:56 GMT+01:00 Holden Karau hol...@pigscanfly.ca:
Hi Csaba,
It sounds like the API you are looking for is sc.wholeTextFiles :)
Without the second line, it's will much faster:
infile = sc.wholeTextFiles(sys.argv[1])
infile.saveAsSequenceFile(sys.argv[2])
On Wed, Oct 29, 2014 at 3:31 AM, Csaba Ragany rag...@gmail.com wrote:
Thank you Holden, it works!
infile = sc.wholeTextFiles(sys.argv[1])
rdd =
Dear Spark Community,
Is it possible to convert text files (.log or .txt files) into
sequencefiles in Python?
Using PySpark I can create a parallelized file with
rdd=sc.parallelize([('key1', 1.0)]) and I can save it as a sequencefile
with rdd.saveAsSequenceFile(). But how can I put the whole
Hi Csaba,
It sounds like the API you are looking for is sc.wholeTextFiles :)
Cheers,
Holden :)
On Tuesday, October 28, 2014, Csaba Ragany rag...@gmail.com wrote:
Dear Spark Community,
Is it possible to convert text files (.log or .txt files) into
sequencefiles in Python?
Using PySpark I