Re: streaming pdf

Nicolas Paris Mon, 19 Nov 2018 00:22:10 -0800

On Mon, Nov 19, 2018 at 07:23:10AM +0100, Jörn Franke wrote:
> Why does it have to be a stream?
>


Right now I manage the pipelines as spark batch processing. Mooving to
stream would add some improvements such:
- simplification of the pipeline
- more frequent data ingestion
- better resource management (?)


On Mon, Nov 19, 2018 at 07:23:10AM +0100, Jörn Franke wrote:
> Why does it have to be a stream?
> 
> > Am 18.11.2018 um 23:29 schrieb Nicolas Paris <nicolas.pa...@riseup.net>:
> > 
> > Hi
> > 
> > I have pdf to load into spark with at least <filename, byte_array>
> > format. I have considered some options:
> > 
> > - spark streaming does not provide a native file stream for binary with
> >  variable size (binaryRecordStream specifies a constant size) and I
> >  would have to write my own receiver.
> > 
> > - Structured streaming allows to process avro/parquet/orc files
> >  containing pdfs, but this makes things more complicated than
> >  monitoring a simple folder  containing pdfs
> > 
> > - Kafka is not designed to handle messages > 100KB, and for this reason
> >  it is not a good option to use in the stream pipeline.
> > 
> > Somebody has a suggestion ?
> > 
> > Thanks,
> > 
> > -- 
> > nicolas
> > 
> > ---------------------------------------------------------------------
> > To unsubscribe e-mail: user-unsubscr...@spark.apache.org
> > 
> 

-- 
nicolas

---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Re: streaming pdf

Reply via email to