Yes, trigger (once=True) set to all streaming sources and it will treat as
a batch mode. Then you can use any scheduler (e.g airflow) to run it
whatever time window. With checkpointing, in the next run it will start
processing files from the last checkpoint.
On Fri, Apr 23, 2021 at 8:13 AM Mich
Interesting.
If we go back to classic Lambda architecture on premise, you could Flume
API to Kafka to add files to HDFS in time series bases.
Most higher CDC vendors do exactly that. Oracle GoldenGate (OGG) classic
gets data from Oracle redo logs and sends them to subscribers. One can
deploy OGC
Hi
In one of the spark summit demo, it is been alluded that we should think
batch jobs in streaming pattern, using "run once" in a schedule.
I find this idea very interesting and I understand how this can be achieved
for sources like kafka, kinesis or similar. in fact we have implemented
this
Hello Asmath,
We had a similar challenge recently.
When you write back to hive, you are creating files on HDFS, and it depends on
your batch window.
If you increase your batch window lets say from 1 min to 5 mins you will end up
creating 5x times less.
The other factor is your partitioning.
Hi,
I am using spark streaming to write data back into hive with the below code
snippet
eventHubsWindowedStream.map(x => EventContent(new String(x)))
.foreachRDD(rdd => {
val sparkSession = SparkSession
.builder.enableHiveSupport.getOrCreate
import
-streaming-python-files-not-packaged-in-assembly-jar-tp21177.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.
-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user