Re: PairRDD(K, L) to multiple files by key serializing each value in L before

abhisheksgumadi Wed, 16 Dec 2015 14:05:31 -0800

Hi Daniel

   Yes it will work without the collect method. You just do a map operation on 
every item of the RDD.


Thanks
Abhishek S

> On 16 Dec 2015, at 18:10, Daniel Valdivia <h...@danielvaldivia.com> wrote:
> 
> Hi Abhishek,
> 
> Thanks for your suggestion, I did considered it, but I'm not sure if to 
> achieve that I'd ned to collect() the data first, I don't think it would fit 
> into the Driver memory.
> 
> Since I'm trying all of this inside the pyspark shell I'm using a small 
> dataset, however the main dataset is about 1.5gb of data, and my cluster has 
> only 2gb of ram nodes (2 of them).
> 
> Do you think that your suggestion could work without having to collect() the 
> results?
> 
> Thanks in advance!
> 
>> On Wed, Dec 16, 2015 at 4:26 AM, Abhishek Shivkumar 
>> <abhisheksgum...@gmail.com> wrote:
>> Hello Daniel,
>> 
>>   I was thinking if you can write 
>> 
>> catGroupArr.map(lambda line: create_and_write_file(line))
>> 
>> def create_and_write_file(line):
>> 
>>     1. look at the key of line: line[0]
>>     2. Open a file with required file name based on key
>>     3. iterate through the values of this key,value pair
>> 
>>        for ele in line[1]:
>> 
>>     4. Write every ele into the file created.
>>     5. Close the file.
>> 
>> Do you think this works?
>> 
>> Thanks
>> Abhishek S
>> 
>> 
>> Thank you!
>> 
>> With Regards,
>> Abhishek S
>> 
>>> On Wed, Dec 16, 2015 at 1:05 AM, Daniel Valdivia <h...@danielvaldivia.com> 
>>> wrote:
>>> Hello everyone,
>>> 
>>> I have a PairRDD with a set of key and list of values, each value in the 
>>> list is a json which I already loaded beginning of my spark app, how can I 
>>> iterate over each value of the list in my pair RDD to transform it to a 
>>> string then save the whole content of the key to a file? one file per key
>>> 
>>> my input files look like cat-0-500.txt:
>>> 
>>> {cat:'red',value:'asd'}
>>> {cat:'green',value:'zxc'}
>>> {cat:'red',value:'jkl'}
>>> 
>>> The PairRDD looks like
>>> 
>>> ('red', [{cat:'red',value:'asd'},{cat:'red',value:'jkl'}])
>>> ('green', [{cat:'green',value:'zxc'}])
>>> 
>>> so as you can see I I'd like to serialize each json in the value list back 
>>> to string so I can easily saveAsTextFile(), ofcourse I'm trying to save a 
>>> separate file for each key
>>> 
>>> The way I got here:
>>> 
>>> rawcatRdd = sc.textFile("hdfs://x.x.x.../unstructured/cat-0-500.txt")
>>> import json
>>> categoriesJson = rawcatRdd.map(lambda x: json.loads(x))
>>> categories = categoriesJson
>>> 
>>> catByDate = categories.map(lambda x: (x['cat'], x)
>>> catGroup = catByDate.groupByKey()
>>> catGroupArr = catGroup.mapValues(lambda x : list(x))
>>> 
>>> Ideally I want to create a cat-red.txt that looks like:
>>> 
>>> {cat:'red',value:'asd'}
>>> {cat:'red',value:'jkl'}
>>> 
>>> and the same for the rest of the keys.
>>> 
>>> I already looked at this answer but I'm slightly lost as host to process 
>>> each value in the list (turn into string) before I save the contents to a 
>>> file, also I cannot figure out how to import MultipleTextOutputFormat in 
>>> python either.
>>> 
>>> I'm trying all this wacky stuff in the pyspark shell
>>> 
>>> Any advice would be greatly appreciated
>>> 
>>> Thanks in advance!
>

Re: PairRDD(K, L) to multiple files by key serializing each value in L before

Reply via email to