On Mon, Nov 19, 2018 at 07:23:10AM +0100, Jörn Franke wrote: > Why does it have to be a stream? >
Right now I manage the pipelines as spark batch processing. Mooving to stream would add some improvements such: - simplification of the pipeline - more frequent data ingestion - better resource management (?) On Mon, Nov 19, 2018 at 07:23:10AM +0100, Jörn Franke wrote: > Why does it have to be a stream? > > > Am 18.11.2018 um 23:29 schrieb Nicolas Paris <nicolas.pa...@riseup.net>: > > > > Hi > > > > I have pdf to load into spark with at least <filename, byte_array> > > format. I have considered some options: > > > > - spark streaming does not provide a native file stream for binary with > > variable size (binaryRecordStream specifies a constant size) and I > > would have to write my own receiver. > > > > - Structured streaming allows to process avro/parquet/orc files > > containing pdfs, but this makes things more complicated than > > monitoring a simple folder containing pdfs > > > > - Kafka is not designed to handle messages > 100KB, and for this reason > > it is not a good option to use in the stream pipeline. > > > > Somebody has a suggestion ? > > > > Thanks, > > > > -- > > nicolas > > > > --------------------------------------------------------------------- > > To unsubscribe e-mail: user-unsubscr...@spark.apache.org > > > -- nicolas --------------------------------------------------------------------- To unsubscribe e-mail: user-unsubscr...@spark.apache.org