Hi, I am new in spark. I wanted to do spark streaming setup to retrieve key value pairs of below format files:
file: info1 Note: Each info file will have around of 1000 of these records. And our system continuously generating info files. So Through spark streaming i wanted to aggregate result. Can we give input to spark cluster this kind of files. I am interested in the "SF" and "DA" delimiters only, "SF" corresponds to source file . And "DA" corresponds the ( line number, count). As this input data is not the line format, so is this the good idea to use these files for the spark input or should i need to do some intermediary stage where i need to clean these files to generate new files which will have each record information in line instead of blocks? Or can we achieve this in Spark itself? What should be the right approach? *What i wanted to achieve? :* I wanted to get line level information. Means, to get line (As a key) and info files (as values) My system continuously generating info files. So Through spark streaming i wanted to aggregate result. Final output i wanted is like below: line178 -> (info1, info2, info7.................) line 2908 -> (info3, info90........................) Do let me know if my explanation is not clear. Thanks & Regards, Vinti