from:"muru"

Re: Spark Streaming with Files

2021-04-30 Thread muru

Yes, trigger (once=True) set to all streaming sources and it will treat as a batch mode. Then you can use any scheduler (e.g airflow) to run it whatever time window. With checkpointing, in the next run it will start processing files from the last checkpoint. On Fri, Apr 23, 2021 at 8:13 AM Mich

Re: Spark Structured Streaming with PySpark throwing error in execution

2021-02-22 Thread muru

You should include commons-pool2-2.9.0.jar and remove spark-streaming-kafka-0-10_2.12-3.0.1.jar (unnecessary jar). On Mon, Feb 22, 2021 at 12:42 PM Mich Talebzadeh wrote: > Hi, > > Trying to make PySpark with PyCharm work with Structured Streaming > > spark-3.0.1-bin-hadoop3.2 >

Re: Use case advice

2021-01-14 Thread muru

You need to make sure the delta-core_2.11-0.6.1. jar file in your $SPARK_HOME/jars folder. On Thu, Jan 14, 2021 at 4:59 AM András Kolbert wrote: > sorry missed out a bit. Added, highlighted with yellow. > > On Thu, 14 Jan 2021 at 13:54, András Kolbert > wrote: > >> Thank

Re: Use case advice

2021-01-09 Thread muru

You could try Delta Lake or Apache Hudi for this use case. On Sat, Jan 9, 2021 at 12:32 PM András Kolbert wrote: > Sorry if my terminology is misleading. > > What I meant under driver only is to use a local pandas dataframe (collect > the data to the master), and keep updating that instead of

Re: ForeachBatch Structured Streaming

2020-10-15 Thread muru

To achieve exactly-once with foreachBatch in SS, you must have a checkpoint enabled. In case of any exceptions or failures the spark SS job will get restarted and the same batchID reprocessed again (for any data sources). To avoid duplicates, you should have an external system to store and dedupe

Re: How to Scale Streaming Application to Multiple Workers

2020-10-15 Thread muru

File streaming in SS, you can try setting "maxFilesPerTrigger" per batch. The forEachBatch is an action, the output is written to various sinks. Are you doing any post transformation in forEachBatch? On Thu, Oct 15, 2020 at 1:24 PM Mich Talebzadeh wrote: > Hi, > > This in general depends on how

Re: Pyspark: Issue using sql in foreachBatch sink

2020-08-03 Thread muru

Thanks Jungtaek for your help. On Fri, Jul 31, 2020 at 6:31 PM Jungtaek Lim wrote: > Python doesn't allow abbreviating () with no param, whereas Scala does. > Use `write()`, not `write`. > > On Wed, Jul 29, 2020 at 9:09 AM muru wrote: > >> In a pyspark SS job, trying to u

Pyspark: Issue using sql in foreachBatch sink

2020-07-28 Thread muru

terminated with error py4j.Py4JException: An exception was raised by the Python Proxy. Return Message: Traceback (most recent call last): File "/Users/muru/spark/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py", line 2381, in _call_proxy return_value = getattr(self.pool[obj_id],

Re: Spark Streaming with Files

Re: Spark Structured Streaming with PySpark throwing error in execution

Re: Use case advice

Re: Use case advice

Re: ForeachBatch Structured Streaming

Re: How to Scale Streaming Application to Multiple Workers

Re: Pyspark: Issue using sql in foreachBatch sink

Pyspark: Issue using sql in foreachBatch sink

8 matches

Site Navigation

Mail list logo

Footer information