You can do it with custom RDD implementation. You will mainly implement "getPartitions" - the logic to split your input into partitions and "compute" to compute and return the values from the executors.
On Tue, 17 Sep 2019 at 08:47, Marcelo Valle <marcelo.va...@ktech.com> wrote: > Just to be more clear about my requirements, what I have is actually a > custom format, with header, summary and multi line blocks. I want to create > tasks per block and no per line.I already have a library that reads an > InputStream and outputs an Iterator of Block, but now I need to integrate > this with spark > > On Tue, 17 Sep 2019 at 16:28, Marcelo Valle <marcelo.va...@ktech.com> > wrote: > >> Hi, >> >> I want to create a custom RDD which will read n lines in sequence from a >> file, which I call a block, and each block should be converted to a spark >> dataframe to be processed in parallel. >> >> Question - do I have to implement a custom hadoop input format to achieve >> this? Or is it possible to do it only with RDD APIs? >> >> Thanks, >> Marcelo. >> > > This email is confidential [and may be protected by legal privilege]. If > you are not the intended recipient, please do not copy or disclose its > content but contact the sender immediately upon receipt. > > KTech Services Ltd is registered in England as company number 10704940. > > Registered Office: The River Building, 1 Cousin Lane, London EC4R 3TE, > United Kingdom >