streaming pdf

Nicolas Paris Sun, 18 Nov 2018 14:29:28 -0800

Hi

I have pdf to load into spark with at least <filename, byte_array>
format. I have considered some options:


- spark streaming does not provide a native file stream for binary with
  variable size (binaryRecordStream specifies a constant size) and I
  would have to write my own receiver.

- Structured streaming allows to process avro/parquet/orc files
  containing pdfs, but this makes things more complicated than
  monitoring a simple folder  containing pdfs

- Kafka is not designed to handle messages > 100KB, and for this reason
  it is not a good option to use in the stream pipeline.

Somebody has a suggestion ?

Thanks,

-- 
nicolas

---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

streaming pdf

Reply via email to