Make a function (or lambda) that reads the text file. Make a RDD with a list of X/Y, then map that RDD throught the file reading function. Same with you X/Y/Z directory. You then have RDDs with the content of each file as a record. Work with those as needed.
On Wed, May 11, 2016 at 2:36 PM Pradeep Nayak <pradeep1...@gmail.com> wrote: > Hi - > > I have a very unique problem which I am trying to solve and I am not sure > if spark would help here. > > I have a directory: /X/Y/a.txt and in the same structure /X/Y/Z/b.txt. > > a.txt contains a unique serial number, say: > 12345 > > and b.txt contains key value pairs. > a,1 > b,1, > c,0 etc. > > Everyday you receive data for a system Y. so there are multiple a.txt and > b.txt for a serial number. The serial number doesn't change and that the > key. So there are multiple systems and the data of a whole year is > available and its huge. > > I am trying to generate a report of unique serial numbers where the value > of the option a has changed to 1 over the last few months. Lets say the > default is 0. Also figure how many times it was toggled. > > > I am not sure how to read two text files in spark at the same time and > associated them with the serial number. Is there a way of doing this in > place given that we know the directory structure ? OR we should be > transforming the data anyway to solve this ? > -- Mathieu Longtin 1-514-803-8977