Hello Daniel, I was thinking if you can write
catGroupArr.map(lambda line: create_and_write_file(line)) def create_and_write_file(line): 1. look at the key of line: line[0] 2. Open a file with required file name based on key 3. iterate through the values of this key,value pair for ele in line[1]: 4. Write every ele into the file created. 5. Close the file. Do you think this works? Thanks Abhishek S Thank you! With Regards, Abhishek S On Wed, Dec 16, 2015 at 1:05 AM, Daniel Valdivia <h...@danielvaldivia.com> wrote: > Hello everyone, > > I have a PairRDD with a set of key and list of values, each value in the > list is a json which I already loaded beginning of my spark app, how can I > iterate over each value of the list in my pair RDD to transform it to a > string then save the whole content of the key to a file? one file per key > > my input files look like cat-0-500.txt: > > *{cat:'red',value:'asd'}* > *{cat:'green',value:'zxc'}* > *{cat:'red',value:'jkl'}* > > The PairRDD looks like > > *('red', [{cat:'red',value:'asd'},{cat:'red',value:'jkl'}])* > *('green', [{cat:'green',value:'zxc'}])* > > so as you can see I I'd like to serialize each json in the value list back > to string so I can easily saveAsTextFile(), ofcourse I'm trying to save a > separate file for each key > > The way I got here: > > *rawcatRdd = sc.textFile("hdfs://x.x.x.../unstructured/cat-0-500.txt")* > *import json* > *categoriesJson = rawcatRdd.map(lambda x: json.loads(x))* > *categories = categoriesJson* > > *catByDate = categories.map(lambda x: (x['cat'], x)* > *catGroup = catByDate.groupByKey()* > *catGroupArr = catGroup.mapValues(lambda x : list(x))* > > Ideally I want to create a cat-red.txt that looks like: > > {cat:'red',value:'asd'} > {cat:'red',value:'jkl'} > > and the same for the rest of the keys. > > I already looked at this answer > <http://stackoverflow.com/questions/23995040/write-to-multiple-outputs-by-key-spark-one-spark-job> > but > I'm slightly lost as host to process each value in the list (turn into > string) before I save the contents to a file, also I cannot figure out how > to import *MultipleTextOutputFormat* in python either. > > I'm trying all this wacky stuff in the pyspark shell > > Any advice would be greatly appreciated > > Thanks in advance! >