Re: Spark Streaming with Files

2021-04-30 Thread muru
Yes, trigger (once=True) set to all streaming sources and it will treat as a batch mode. Then you can use any scheduler (e.g airflow) to run it whatever time window. With checkpointing, in the next run it will start processing files from the last checkpoint. On Fri, Apr 23, 2021 at 8:13 AM Mich

Re: Spark Structured Streaming with PySpark throwing error in execution

2021-02-22 Thread muru
You should include commons-pool2-2.9.0.jar and remove spark-streaming-kafka-0-10_2.12-3.0.1.jar (unnecessary jar). On Mon, Feb 22, 2021 at 12:42 PM Mich Talebzadeh wrote: > Hi, > > Trying to make PySpark with PyCharm work with Structured Streaming > > spark-3.0.1-bin-hadoop3.2 >

Re: Use case advice

2021-01-14 Thread muru
You need to make sure the delta-core_2.11-0.6.1. jar file in your $SPARK_HOME/jars folder. On Thu, Jan 14, 2021 at 4:59 AM András Kolbert wrote: > sorry missed out a bit. Added, highlighted with yellow. > > On Thu, 14 Jan 2021 at 13:54, András Kolbert > wrote: > >> Thank

Re: Use case advice

2021-01-09 Thread muru
You could try Delta Lake or Apache Hudi for this use case. On Sat, Jan 9, 2021 at 12:32 PM András Kolbert wrote: > Sorry if my terminology is misleading. > > What I meant under driver only is to use a local pandas dataframe (collect > the data to the master), and keep updating that instead of

Re: ForeachBatch Structured Streaming

2020-10-15 Thread muru
To achieve exactly-once with foreachBatch in SS, you must have a checkpoint enabled. In case of any exceptions or failures the spark SS job will get restarted and the same batchID reprocessed again (for any data sources). To avoid duplicates, you should have an external system to store and dedupe

Re: How to Scale Streaming Application to Multiple Workers

2020-10-15 Thread muru
File streaming in SS, you can try setting "maxFilesPerTrigger" per batch. The forEachBatch is an action, the output is written to various sinks. Are you doing any post transformation in forEachBatch? On Thu, Oct 15, 2020 at 1:24 PM Mich Talebzadeh wrote: > Hi, > > This in general depends on how

Re: Pyspark: Issue using sql in foreachBatch sink

2020-08-03 Thread muru
Thanks Jungtaek for your help. On Fri, Jul 31, 2020 at 6:31 PM Jungtaek Lim wrote: > Python doesn't allow abbreviating () with no param, whereas Scala does. > Use `write()`, not `write`. > > On Wed, Jul 29, 2020 at 9:09 AM muru wrote: > >> In a pyspark SS job, trying to u

Pyspark: Issue using sql in foreachBatch sink

2020-07-28 Thread muru
terminated with error py4j.Py4JException: An exception was raised by the Python Proxy. Return Message: Traceback (most recent call last): File "/Users/muru/spark/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py", line 2381, in _call_proxy return_value = getattr(self.pool[obj_id],