Re: Spark Structured Streaming -- Cannot consume next messages

2022-07-21 Thread KhajaAsmath Mohammed
I was able to figure it out . Hdfs directory where the data is being pushed was run previously with different user. Not having proper permissions resulted in this issue Thanks, Asmath > On Jul 21, 2022, at 4:22 PM, Artemis User wrote: > > Not sure what you mean by offerts/offsets. I assume

Re: Pyspark and multiprocessing

2022-07-21 Thread Khalid Mammadov
Pool.map requires 2 arguments. 1st a function and 2nd an iterable i.e. list, set etc. Check out examples from official docs how to use it: https://docs.python.org/3/library/multiprocessing.html On Thu, 21 Jul 2022, 21:25 Bjørn Jørgensen, wrote: > Thank you. > The reason for using spark local is

Re: Spark Structured Streaming -- Cannot consume next messages

2022-07-21 Thread Artemis User
Not sure what you mean by offerts/offsets.  I assume you were using file-based instead of Kafka-based of data sources.  Are the incoming data generated in mini-batch files or in a single large file?  Have you had this type of problem before? On 7/21/22 1:02 PM, KhajaAsmath Mohammed wrote: Hi,

Re: Pyspark and multiprocessing

2022-07-21 Thread Bjørn Jørgensen
Thank you. The reason for using spark local is to test the code, and as in this case I find the bottlenecks and fix them before I spinn up a K8S cluster. I did test it now with 16 cores and 10 files import time tic = time.perf_counter() json_to_norm_with_null("/home/jovyan/notebooks/falk/test",

Spark Structured Streaming -- Cannot consume next messages

2022-07-21 Thread KhajaAsmath Mohammed
Hi, I am seeing weird behavior in our spark structured streaming application where the offerts are not getting picked by the streaming job. If I delete the checkpoint directory and run the job again, I can see the data for the first batch but it is not picking up new offsets again from the next

Re: Pyspark and multiprocessing

2022-07-21 Thread Khalid Mammadov
One quick observation is that you allocate all your local CPUs to Spark then execute that app with 10 Threads i.e 10 spark apps and so you will need 160cores in total as each will need 16CPUs IMHO. Wouldn't that create CPU bottleneck? Also on the side note, why you need Spark if you use that on lo