Hi Daniel Yes it will work without the collect method. You just do a map operation on every item of the RDD.
Thanks Abhishek S > On 16 Dec 2015, at 18:10, Daniel Valdivia <h...@danielvaldivia.com> wrote: > > Hi Abhishek, > > Thanks for your suggestion, I did considered it, but I'm not sure if to > achieve that I'd ned to collect() the data first, I don't think it would fit > into the Driver memory. > > Since I'm trying all of this inside the pyspark shell I'm using a small > dataset, however the main dataset is about 1.5gb of data, and my cluster has > only 2gb of ram nodes (2 of them). > > Do you think that your suggestion could work without having to collect() the > results? > > Thanks in advance! > >> On Wed, Dec 16, 2015 at 4:26 AM, Abhishek Shivkumar >> <abhisheksgum...@gmail.com> wrote: >> Hello Daniel, >> >> I was thinking if you can write >> >> catGroupArr.map(lambda line: create_and_write_file(line)) >> >> def create_and_write_file(line): >> >> 1. look at the key of line: line[0] >> 2. Open a file with required file name based on key >> 3. iterate through the values of this key,value pair >> >> for ele in line[1]: >> >> 4. Write every ele into the file created. >> 5. Close the file. >> >> Do you think this works? >> >> Thanks >> Abhishek S >> >> >> Thank you! >> >> With Regards, >> Abhishek S >> >>> On Wed, Dec 16, 2015 at 1:05 AM, Daniel Valdivia <h...@danielvaldivia.com> >>> wrote: >>> Hello everyone, >>> >>> I have a PairRDD with a set of key and list of values, each value in the >>> list is a json which I already loaded beginning of my spark app, how can I >>> iterate over each value of the list in my pair RDD to transform it to a >>> string then save the whole content of the key to a file? one file per key >>> >>> my input files look like cat-0-500.txt: >>> >>> {cat:'red',value:'asd'} >>> {cat:'green',value:'zxc'} >>> {cat:'red',value:'jkl'} >>> >>> The PairRDD looks like >>> >>> ('red', [{cat:'red',value:'asd'},{cat:'red',value:'jkl'}]) >>> ('green', [{cat:'green',value:'zxc'}]) >>> >>> so as you can see I I'd like to serialize each json in the value list back >>> to string so I can easily saveAsTextFile(), ofcourse I'm trying to save a >>> separate file for each key >>> >>> The way I got here: >>> >>> rawcatRdd = sc.textFile("hdfs://x.x.x.../unstructured/cat-0-500.txt") >>> import json >>> categoriesJson = rawcatRdd.map(lambda x: json.loads(x)) >>> categories = categoriesJson >>> >>> catByDate = categories.map(lambda x: (x['cat'], x) >>> catGroup = catByDate.groupByKey() >>> catGroupArr = catGroup.mapValues(lambda x : list(x)) >>> >>> Ideally I want to create a cat-red.txt that looks like: >>> >>> {cat:'red',value:'asd'} >>> {cat:'red',value:'jkl'} >>> >>> and the same for the rest of the keys. >>> >>> I already looked at this answer but I'm slightly lost as host to process >>> each value in the list (turn into string) before I save the contents to a >>> file, also I cannot figure out how to import MultipleTextOutputFormat in >>> python either. >>> >>> I'm trying all this wacky stuff in the pyspark shell >>> >>> Any advice would be greatly appreciated >>> >>> Thanks in advance! >