Without the second line, it's will much faster:
infile = sc.wholeTextFiles(sys.argv[1])
infile.saveAsSequenceFile(sys.argv[2])
On Wed, Oct 29, 2014 at 3:31 AM, Csaba Ragany wrote:
> Thank you Holden, it works!
>
> infile = sc.wholeTextFiles(sys.argv[1])
> rdd = sc.parallelize(infile.collect()
Thank you Holden, it works!
infile = sc.wholeTextFiles(sys.argv[1])
rdd = sc.parallelize(infile.collect())
rdd.saveAsSequenceFile(sys.argv[2])
Csaba
2014-10-28 17:56 GMT+01:00 Holden Karau :
> Hi Csaba,
>
> It sounds like the API you are looking for is sc.wholeTextFiles :)
>
> Cheers,
>
> Hold
Hi Csaba,
It sounds like the API you are looking for is sc.wholeTextFiles :)
Cheers,
Holden :)
On Tuesday, October 28, 2014, Csaba Ragany wrote:
> Dear Spark Community,
>
> Is it possible to convert text files (.log or .txt files) into
> sequencefiles in Python?
>
> Using PySpark I can create
Dear Spark Community,
Is it possible to convert text files (.log or .txt files) into
sequencefiles in Python?
Using PySpark I can create a parallelized file with
rdd=sc.parallelize([('key1', 1.0)]) and I can save it as a sequencefile
with rdd.saveAsSequenceFile(). But how can I put the whole cont