Well, I am not so sure about the use cases, but what about using 
StreamingContext.fileStream?
https://spark.apache.org/docs/2.3.0/api/java/org/apache/spark/streaming/StreamingContext.html#fileStream-java.lang.String-scala.Function1-boolean-org.apache.hadoop.conf.Configuration-scala.reflect.ClassTag-scala.reflect.ClassTag-scala.reflect.ClassTag-


> Am 19.11.2018 um 09:22 schrieb Nicolas Paris <nicolas.pa...@riseup.net>:
> 
>> On Mon, Nov 19, 2018 at 07:23:10AM +0100, Jörn Franke wrote:
>> Why does it have to be a stream?
>> 
> 
> Right now I manage the pipelines as spark batch processing. Mooving to
> stream would add some improvements such:
> - simplification of the pipeline
> - more frequent data ingestion
> - better resource management (?)
> 
> 
>> On Mon, Nov 19, 2018 at 07:23:10AM +0100, Jörn Franke wrote:
>> Why does it have to be a stream?
>> 
>>> Am 18.11.2018 um 23:29 schrieb Nicolas Paris <nicolas.pa...@riseup.net>:
>>> 
>>> Hi
>>> 
>>> I have pdf to load into spark with at least <filename, byte_array>
>>> format. I have considered some options:
>>> 
>>> - spark streaming does not provide a native file stream for binary with
>>> variable size (binaryRecordStream specifies a constant size) and I
>>> would have to write my own receiver.
>>> 
>>> - Structured streaming allows to process avro/parquet/orc files
>>> containing pdfs, but this makes things more complicated than
>>> monitoring a simple folder  containing pdfs
>>> 
>>> - Kafka is not designed to handle messages > 100KB, and for this reason
>>> it is not a good option to use in the stream pipeline.
>>> 
>>> Somebody has a suggestion ?
>>> 
>>> Thanks,
>>> 
>>> -- 
>>> nicolas
>>> 
>>> ---------------------------------------------------------------------
>>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>>> 
>> 
> 
> -- 
> nicolas
> 
> ---------------------------------------------------------------------
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
> 

Reply via email to