Re: PairRDD(K, L) to multiple files by key serializing each value in L before

2015-12-16 Thread Daniel Valdivia
Hi Abhishek, Thanks for your suggestion, I did considered it, but I'm not sure if to achieve that I'd ned to collect() the data first, I don't think it would fit into the Driver memory. Since I'm trying all of this inside the pyspark shell I'm using a small dataset, however the main dataset is

Re: PairRDD(K, L) to multiple files by key serializing each value in L before

2015-12-16 Thread abhisheksgumadi
Hi Daniel Yes it will work without the collect method. You just do a map operation on every item of the RDD. Thanks Abhishek S > On 16 Dec 2015, at 18:10, Daniel Valdivia wrote: > > Hi Abhishek, > > Thanks for your suggestion, I did considered it, but I'm not

Re: PairRDD(K, L) to multiple files by key serializing each value in L before

2015-12-16 Thread Abhishek Shivkumar
Hello Daniel, I was thinking if you can write catGroupArr.map(lambda line: create_and_write_file(line)) def create_and_write_file(line): 1. look at the key of line: line[0] 2. Open a file with required file name based on key 3. iterate through the values of this key,value pair

PairRDD(K, L) to multiple files by key serializing each value in L before

2015-12-15 Thread Daniel Valdivia
Hello everyone, I have a PairRDD with a set of key and list of values, each value in the list is a json which I already loaded beginning of my spark app, how can I iterate over each value of the list in my pair RDD to transform it to a string then save the whole content of the key to a file?