Re: streaming pdf

Jörn Franke Sun, 18 Nov 2018 22:23:30 -0800

Why does it have to be a stream?

> Am 18.11.2018 um 23:29 schrieb Nicolas Paris <nicolas.pa...@riseup.net>:
> 
> Hi
> 
> I have pdf to load into spark with at least <filename, byte_array>
> format. I have considered some options:
> 
> - spark streaming does not provide a native file stream for binary with
>  variable size (binaryRecordStream specifies a constant size) and I
>  would have to write my own receiver.
> 
> - Structured streaming allows to process avro/parquet/orc files
>  containing pdfs, but this makes things more complicated than
>  monitoring a simple folder  containing pdfs
> 
> - Kafka is not designed to handle messages > 100KB, and for this reason
>  it is not a good option to use in the stream pipeline.
> 
> Somebody has a suggestion ?
> 
> Thanks,
> 
> -- 
> nicolas
> 
> ---------------------------------------------------------------------
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>


---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Re: streaming pdf

Reply via email to