I was able to figure it out . Hdfs directory where the data is being pushed was
run previously with different user. Not having proper permissions resulted in
this issue
Thanks,
Asmath
> On Jul 21, 2022, at 4:22 PM, Artemis User wrote:
>
> Not sure what you mean by offerts/offsets. I assume
Pool.map requires 2 arguments. 1st a function and 2nd an iterable i.e.
list, set etc.
Check out examples from official docs how to use it:
https://docs.python.org/3/library/multiprocessing.html
On Thu, 21 Jul 2022, 21:25 Bjørn Jørgensen,
wrote:
> Thank you.
> The reason for using spark local is
Not sure what you mean by offerts/offsets. I assume you were using
file-based instead of Kafka-based of data sources. Are the incoming
data generated in mini-batch files or in a single large file? Have you
had this type of problem before?
On 7/21/22 1:02 PM, KhajaAsmath Mohammed wrote:
Hi,
Thank you.
The reason for using spark local is to test the code, and as in this case I
find the bottlenecks and fix them before I spinn up a K8S cluster.
I did test it now with
16 cores and 10 files
import time
tic = time.perf_counter()
json_to_norm_with_null("/home/jovyan/notebooks/falk/test",
Hi,
I am seeing weird behavior in our spark structured streaming application
where the offerts are not getting picked by the streaming job.
If I delete the checkpoint directory and run the job again, I can see the
data for the first batch but it is not picking up new offsets again from
the next
One quick observation is that you allocate all your local CPUs to Spark
then execute that app with 10 Threads i.e 10 spark apps and so you will
need 160cores in total as each will need 16CPUs IMHO. Wouldn't that create
CPU bottleneck?
Also on the side note, why you need Spark if you use that on lo