Re: pySpark - convert log/txt files into sequenceFile

2014-10-29 Thread Davies Liu
Without the second line, it's will much faster:

 infile = sc.wholeTextFiles(sys.argv[1])
 infile.saveAsSequenceFile(sys.argv[2])


On Wed, Oct 29, 2014 at 3:31 AM, Csaba Ragany  wrote:
> Thank you Holden, it works!
>
> infile = sc.wholeTextFiles(sys.argv[1])
> rdd = sc.parallelize(infile.collect())
> rdd.saveAsSequenceFile(sys.argv[2])
>
> Csaba
>
>
> 2014-10-28 17:56 GMT+01:00 Holden Karau :
>>
>> Hi Csaba,
>>
>> It sounds like the API you are looking for is sc.wholeTextFiles :)
>>
>> Cheers,
>>
>> Holden :)
>>
>>
>> On Tuesday, October 28, 2014, Csaba Ragany  wrote:
>>>
>>> Dear Spark Community,
>>>
>>> Is it possible to convert text files (.log or .txt files) into
>>> sequencefiles in Python?
>>>
>>> Using PySpark I can create a parallelized file with
>>> rdd=sc.parallelize([('key1', 1.0)]) and I can save it as a sequencefile with
>>> rdd.saveAsSequenceFile(). But how can I put the whole content of my text
>>> files into the 'value' of 'key1' ?
>>>
>>> I want a sequencefile where the keys are the filenames of the text files
>>> and the values are their content.
>>>
>>> Thank you for any help!
>>> Csaba
>>
>>
>>
>> --
>> Cell : 425-233-8271
>
>

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: pySpark - convert log/txt files into sequenceFile

2014-10-29 Thread Csaba Ragany
Thank you Holden, it works!

infile = sc.wholeTextFiles(sys.argv[1])
rdd = sc.parallelize(infile.collect())
rdd.saveAsSequenceFile(sys.argv[2])

Csaba


2014-10-28 17:56 GMT+01:00 Holden Karau :

> Hi Csaba,
>
> It sounds like the API you are looking for is sc.wholeTextFiles :)
>
> Cheers,
>
> Holden :)
>
>
> On Tuesday, October 28, 2014, Csaba Ragany  wrote:
>
>> Dear Spark Community,
>>
>> Is it possible to convert text files (.log or .txt files) into
>> sequencefiles in Python?
>>
>> Using PySpark I can create a parallelized file with
>> rdd=sc.parallelize([('key1', 1.0)]) and I can save it as a sequencefile
>> with rdd.saveAsSequenceFile(). But how can I put the whole content of my
>> text files into the 'value' of 'key1' ?
>>
>> I want a sequencefile where the keys are the filenames of the text files
>> and the values are their content.
>>
>> Thank you for any help!
>> Csaba
>>
>
>
> --
> Cell : 425-233-8271
>


Re: pySpark - convert log/txt files into sequenceFile

2014-10-28 Thread Holden Karau
Hi Csaba,

It sounds like the API you are looking for is sc.wholeTextFiles :)

Cheers,

Holden :)

On Tuesday, October 28, 2014, Csaba Ragany  wrote:

> Dear Spark Community,
>
> Is it possible to convert text files (.log or .txt files) into
> sequencefiles in Python?
>
> Using PySpark I can create a parallelized file with
> rdd=sc.parallelize([('key1', 1.0)]) and I can save it as a sequencefile
> with rdd.saveAsSequenceFile(). But how can I put the whole content of my
> text files into the 'value' of 'key1' ?
>
> I want a sequencefile where the keys are the filenames of the text files
> and the values are their content.
>
> Thank you for any help!
> Csaba
>


-- 
Cell : 425-233-8271


pySpark - convert log/txt files into sequenceFile

2014-10-28 Thread Csaba Ragany
Dear Spark Community,

Is it possible to convert text files (.log or .txt files) into
sequencefiles in Python?

Using PySpark I can create a parallelized file with
rdd=sc.parallelize([('key1', 1.0)]) and I can save it as a sequencefile
with rdd.saveAsSequenceFile(). But how can I put the whole content of my
text files into the 'value' of 'key1' ?

I want a sequencefile where the keys are the filenames of the text files
and the values are their content.

Thank you for any help!
Csaba