Re: loop of spark jobs leads to increase in memory on worker nodes and eventually faillure

2022-04-07 Thread Mich Talebzadeh
Since your Hbase is supported by the external vendor, I would ask them to justify their choice of storage for Hbase and any suggestion they have vis-a-vis S3 etc. Spark has an efficient API to Hbase including remote Hbase. I have used in the past reading from Hbase. HTH view my Linkedin

Re: loop of spark jobs leads to increase in memory on worker nodes and eventually faillure

2022-04-07 Thread Joris Billen
Thanks for pointing this out. So currently data is stored in hbase on adls. Question (sorry I might be ignorant): is it clear that parquet on s3 would be faster as storage to read from than hbase on adls? In general, I ve found it hard after my processing is done, if I have an application that

Re: loop of spark jobs leads to increase in memory on worker nodes and eventually faillure

2022-04-07 Thread Bjørn Jørgensen
"4. S3: I am not using it, but people in the thread started suggesting potential solutions involving s3. It is an azure system, so hbase is stored on adls. In fact the nature of my application (geospatial stuff) requires me to use geomesa libs, which only allows directly writing from spark to

Re: loop of spark jobs leads to increase in memory on worker nodes and eventually faillure

2022-04-07 Thread Mich Talebzadeh
Ok. Your architect has decided to emulate anything on prem to the cloud.You are not really taking any advantages of cloud offerings or scalability. For example, how does your Hadoop clustercater for the increased capacity. Likewise your spark nodes are pigeonholed with your Hadoop nodes. Old wine

Re: loop of spark jobs leads to increase in memory on worker nodes and eventually faillure

2022-04-07 Thread Bjørn Jørgensen
"But it will be faster to use S3 (or GCS) through some network and it will be faster than writing to the local SSD. I don't understand the point here." Minio is a S3 mock, so you run minio local. tor. 7. apr. 2022 kl. 09:27 skrev Mich Talebzadeh : > Ok so that is your assumption. The whole thing

Re: loop of spark jobs leads to increase in memory on worker nodes and eventually faillure

2022-04-07 Thread Joris Billen
Thanks for active discussion and sharing your knowledge :-) 1.Cluster is a managed hadoop cluster on Azure in the cloud. It has hbase, and spark, and hdfs shared . 2.Hbase is on the cluster, so not standalone. It comes from an enterprise-level template from a commercial vendor, so assuming

Re: loop of spark jobs leads to increase in memory on worker nodes and eventually faillure

2022-04-07 Thread Mich Talebzadeh
Ok so that is your assumption. The whole thing is based on-premise on JBOD (including hadoop cluster which has Spark binaries on each node as I understand) as I understand. But it will be faster to use S3 (or GCS) through some network and it will be faster than writing to the local SSD. I don't

Re: loop of spark jobs leads to increase in memory on worker nodes and eventually faillure

2022-04-07 Thread Bjørn Jørgensen
1. Where does S3 come into this He is processing data for each day at a time. So to dump each day to a fast storage he can use parquet files and write it to S3. ons. 6. apr. 2022 kl. 22:27 skrev Mich Talebzadeh : > > Your statement below: > > > I believe I have found the issue: the job

Re: loop of spark jobs leads to increase in memory on worker nodes and eventually faillure

2022-04-06 Thread Mich Talebzadeh
Your statement below: I believe I have found the issue: the job writes data to hbase which is on the same cluster. When I keep on processing data and writing with spark to hbase , eventually the garbage collection can not keep up anymore for hbase, and the hbase memory consumption increases. As

Re: loop of spark jobs leads to increase in memory on worker nodes and eventually faillure

2022-04-06 Thread Bjørn Jørgensen
Great, upgrade from 2.4 to 3.X.X It seams like you can use unpersist after df=read(fromhdfs) df2=spark.sql(using df 1) ..df10=spark.sql(using df9) ? I did use kubernetes and spark with S3 API

Re: loop of spark jobs leads to increase in memory on worker nodes and eventually faillure

2022-04-06 Thread Gourav Sengupta
Hi, super duper. Please try to see if you can write out the data to S3, and then write a load script to load that data from S3 to HBase. Regards, Gourav Sengupta On Wed, Apr 6, 2022 at 4:39 PM Joris Billen wrote: > HI, > thanks for your reply. > > > I believe I have found the issue: the job

Re: loop of spark jobs leads to increase in memory on worker nodes and eventually faillure

2022-04-06 Thread Joris Billen
HI, thanks for your reply. I believe I have found the issue: the job writes data to hbase which is on the same cluster. When I keep on processing data and writing with spark to hbase , eventually the garbage collection can not keep up anymore for hbase, and the hbase memory consumption

Re: loop of spark jobs leads to increase in memory on worker nodes and eventually faillure

2022-04-05 Thread Gourav Sengupta
Hi, can you please give details around: spark version, what is the operation that you are running, why in loops, and whether you are caching in any data or not, and whether you are referencing the variables to create them like in the following expression we are referencing x to create x, x = x +

Re: loop of spark jobs leads to increase in memory on worker nodes and eventually faillure

2022-04-04 Thread Joris Billen
Clear-probably not a good idea. But a previous comment said “you are doing everything in the end in one go”. So this made me wonder: in case your only action is a write in the end after lots of complex transformations, then what is the alternative for writing in the end which means doing

Re: loop of spark jobs leads to increase in memory on worker nodes and eventually faillure

2022-04-01 Thread Sean Owen
This feels like premature optimization, and not clear it's optimizing, but maybe. Caching things that are used once is worse than not caching. It looks like a straight-line through to the write, so I doubt caching helps anything here. On Fri, Apr 1, 2022 at 2:49 AM Joris Billen wrote: > Hi, >

Re: loop of spark jobs leads to increase in memory on worker nodes and eventually faillure

2022-04-01 Thread Joris Billen
Hi, as said thanks for little discussion over mail. I understand that the action is triggered in the end at the write and then all of a sudden everything is executed at once. But I dont really need to trigger an action before. I am caching somewherew a df that will be reused several times

Re: loop of spark jobs leads to increase in memory on worker nodes and eventually faillure

2022-03-31 Thread Enrico Minack
How well Spark can scale up with your data (in terms of years of data) depends on two things: the operations performed on the data, and characteristics of the data, like value distributions. Failing tasks smell like you are using operations that do not scale (e.g. Cartesian product of your

Re: loop of spark jobs leads to increase in memory on worker nodes and eventually faillure

2022-03-31 Thread Sean Owen
If that is your loop unrolled, then you are not doing parts of work at a time. That will execute all operations in one go when the write finally happens. That's OK, but may be part of the problem. For example if you are filtering for a subset, processing, and unioning, then that is just a harder

Re: loop of spark jobs leads to increase in memory on worker nodes and eventually faillure

2022-03-31 Thread Joris Billen
Thanks for reply :-) I am using pyspark. Basicially my code (simplified is): df=spark.read.csv(hdfs://somehdfslocation) df1=spark.sql (complex statement using df) ... dfx=spark.sql(complex statement using df x-1) ... dfx15.write() What exactly is meant by "closing resources"? Is it just

Re: loop of spark jobs leads to increase in memory on worker nodes and eventually faillure

2022-03-30 Thread Enrico Minack
> Wrt looping: if I want to process 3 years of data, my modest cluster will never do it one go , I would expect? > I have to break it down in smaller pieces and run that in a loop (1 day is already lots of data). Well, that is exactly what Spark is made for. It splits the work up and

Re: loop of spark jobs leads to increase in memory on worker nodes and eventually faillure

2022-03-30 Thread Bjørn Jørgensen
It`s quite impossible for anyone to answer your question about what is eating your memory, without even knowing what language you are using. If you are using C then it`s always pointers, that's the mem issue. If you are using python, there can be some like not using context manager like With

Re: loop of spark jobs leads to increase in memory on worker nodes and eventually faillure

2022-03-30 Thread Joris Billen
Thanks for answer-much appreciated! This forum is very useful :-) I didnt know the sparkcontext stays alive. I guess this is eating up memory. The eviction means that he knows that he should clear some of the old cached memory to be able to store new one. In case anyone has good articles about

Re: loop of spark jobs leads to increase in memory on worker nodes and eventually faillure

2022-03-30 Thread Sean Owen
The Spark context does not stop when a job does. It stops when you stop it. There could be many ways mem can leak. Caching maybe - but it will evict. You should be clearing caches when no longer needed. I would guess it is something else your program holds on to in its logic. Also consider not

loop of spark jobs leads to increase in memory on worker nodes and eventually faillure

2022-03-30 Thread Joris Billen
Hi, I have a pyspark job submitted through spark-submit that does some heavy processing for 1 day of data. It runs with no errors. I have to loop over many days, so I run this spark job in a loop. I notice after couple executions the memory is increasing on all worker nodes and eventually this