Hello everyone, I have a PairRDD with a set of key and list of values, each value in the list is a json which I already loaded beginning of my spark app, how can I iterate over each value of the list in my pair RDD to transform it to a string then save the whole content of the key to a file? one file per key
my input files look like cat-0-500.txt: {cat:'red',value:'asd'} {cat:'green',value:'zxc'} {cat:'red',value:'jkl'} The PairRDD looks like ('red', [{cat:'red',value:'asd'},{cat:'red',value:'jkl'}]) ('green', [{cat:'green',value:'zxc'}]) so as you can see I I'd like to serialize each json in the value list back to string so I can easily saveAsTextFile(), ofcourse I'm trying to save a separate file for each key The way I got here: rawcatRdd = sc.textFile("hdfs://x.x.x.../unstructured/cat-0-500.txt") import json categoriesJson = rawcatRdd.map(lambda x: json.loads(x)) categories = categoriesJson catByDate = categories.map(lambda x: (x['cat'], x) catGroup = catByDate.groupByKey() catGroupArr = catGroup.mapValues(lambda x : list(x)) Ideally I want to create a cat-red.txt that looks like: {cat:'red',value:'asd'} {cat:'red',value:'jkl'} and the same for the rest of the keys. I already looked at this answer <http://stackoverflow.com/questions/23995040/write-to-multiple-outputs-by-key-spark-one-spark-job> but I'm slightly lost as host to process each value in the list (turn into string) before I save the contents to a file, also I cannot figure out how to import MultipleTextOutputFormat in python either. I'm trying all this wacky stuff in the pyspark shell Any advice would be greatly appreciated Thanks in advance!