Hi - I have a very unique problem which I am trying to solve and I am not sure if spark would help here.
I have a directory: /X/Y/a.txt and in the same structure /X/Y/Z/b.txt. a.txt contains a unique serial number, say: 12345 and b.txt contains key value pairs. a,1 b,1, c,0 etc. Everyday you receive data for a system Y. so there are multiple a.txt and b.txt for a serial number. The serial number doesn't change and that the key. So there are multiple systems and the data of a whole year is available and its huge. I am trying to generate a report of unique serial numbers where the value of the option a has changed to 1 over the last few months. Lets say the default is 0. Also figure how many times it was toggled. I am not sure how to read two text files in spark at the same time and associated them with the serial number. Is there a way of doing this in place given that we know the directory structure ? OR we should be transforming the data anyway to solve this ?