Hello Daniel,

  I was thinking if you can write

catGroupArr.map(lambda line: create_and_write_file(line))

def create_and_write_file(line):

    1. look at the key of line: line[0]
    2. Open a file with required file name based on key
    3. iterate through the values of this key,value pair

       for ele in line[1]:

    4. Write every ele into the file created.
    5. Close the file.

Do you think this works?

Thanks
Abhishek S


Thank you!

With Regards,
Abhishek S

On Wed, Dec 16, 2015 at 1:05 AM, Daniel Valdivia <h...@danielvaldivia.com>
wrote:

> Hello everyone,
>
> I have a PairRDD with a set of key and list of values, each value in the
> list is a json which I already loaded beginning of my spark app, how can I
> iterate over each value of the list in my pair RDD to transform it to a
> string then save the whole content of the key to a file? one file per key
>
> my input files look like cat-0-500.txt:
>
> *{cat:'red',value:'asd'}*
> *{cat:'green',value:'zxc'}*
> *{cat:'red',value:'jkl'}*
>
> The PairRDD looks like
>
> *('red', [{cat:'red',value:'asd'},{cat:'red',value:'jkl'}])*
> *('green', [{cat:'green',value:'zxc'}])*
>
> so as you can see I I'd like to serialize each json in the value list back
> to string so I can easily saveAsTextFile(), ofcourse I'm trying to save a
> separate file for each key
>
> The way I got here:
>
> *rawcatRdd = sc.textFile("hdfs://x.x.x.../unstructured/cat-0-500.txt")*
> *import json*
> *categoriesJson = rawcatRdd.map(lambda x: json.loads(x))*
> *categories = categoriesJson*
>
> *catByDate = categories.map(lambda x: (x['cat'], x)*
> *catGroup = catByDate.groupByKey()*
> *catGroupArr = catGroup.mapValues(lambda x : list(x))*
>
> Ideally I want to create a cat-red.txt that looks like:
>
> {cat:'red',value:'asd'}
> {cat:'red',value:'jkl'}
>
> and the same for the rest of the keys.
>
> I already looked at this answer
> <http://stackoverflow.com/questions/23995040/write-to-multiple-outputs-by-key-spark-one-spark-job>
>  but
> I'm slightly lost as host to process each value in the list (turn into
> string) before I save the contents to a file, also I cannot figure out how
> to import *MultipleTextOutputFormat* in python either.
>
> I'm trying all this wacky stuff in the pyspark shell
>
> Any advice would be greatly appreciated
>
> Thanks in advance!
>

Reply via email to