Re: loop of spark jobs leads to increase in memory on worker nodes and eventually faillure

2022-03-30 Thread Enrico Minack
> Wrt looping: if I want to process 3 years of data, my modest cluster 
will never do it one go , I would expect?
> I have to break it down in smaller pieces and run that in a loop (1 
day is already lots of data).


Well, that is exactly what Spark is made for. It splits the work up and 
processes it in small pieces, called partitions. No matter how much data 
you have, it probably works with your laptop (as long as it fits on 
disk), though it will take some time. But it will succeed. A large 
cluster is doing nothing else, except for having more partitions being 
processed in parallel.


You should expect it to work, no matter how many years of data. 
Otherwise, you have to rethink your Spark code, not your cluster size.


Share some code that does not work with 3 years and people might help. 
Without that, speculations is all you will get.


Enrico



Am 30.03.22 um 17:40 schrieb Joris Billen:

Thanks for answer-much appreciated! This forum is very useful :-)

I didnt know the sparkcontext stays alive. I guess this is eating up 
memory.  The eviction means that he knows that he should clear some of 
the old cached memory to be able to store new one. In case anyone has 
good articles about memory leaks I would be interested to read.
I will try to add following lines at the end of my job (as I cached 
the table in spark sql):



/sqlContext.sql("UNCACHE TABLE mytableofinterest ")/
/spark.stop()/


Wrt looping: if I want to process 3 years of data, my modest cluster 
will never do it one go , I would expect? I have to break it down in 
smaller pieces and run that in a loop (1 day is already lots of data).




Thanks!





On 30 Mar 2022, at 17:25, Sean Owen  wrote:

The Spark context does not stop when a job does. It stops when you 
stop it. There could be many ways mem can leak. Caching maybe - but 
it will evict. You should be clearing caches when no longer needed.


I would guess it is something else your program holds on to in its 
logic.


Also consider not looping; there is probably a faster way to do it in 
one go.


On Wed, Mar 30, 2022, 10:16 AM Joris Billen 
 wrote:


Hi,
I have a pyspark job submitted through spark-submit that does
some heavy processing for 1 day of data. It runs with no errors.
I have to loop over many days, so I run this spark job in a loop.
I notice after couple executions the memory is increasing on all
worker nodes and eventually this leads to faillures. My job does
some caching, but I understand that when the job ends
successfully, then the sparkcontext is destroyed and the cache
should be cleared. However it seems that something keeps on
filling the memory a bit more and more after each run. THis is
the memory behaviour over time, which in the end will start
leading to failures :

(what we see is: green=physical memory used, green-blue=physical
memory cached, grey=memory capacity =straight line around 31GB )
This runs on a healthy spark 2.4 and was optimized already to
come to a stable job in terms of spark-submit resources
parameters like

driver-memory/num-executors/executor-memory/executor-cores/spark.locality.wait).
Any clue how to “really” clear the memory in between jobs? So
basically currently I can loop 10x and then need to restart my
cluster so all memory is cleared completely.


Thanks for any info!






Re: loop of spark jobs leads to increase in memory on worker nodes and eventually faillure

2022-03-30 Thread Bjørn Jørgensen
It`s quite impossible for anyone to answer your question about what is
eating your memory, without even knowing what language you are using.

If you are using C then it`s always pointers, that's the mem issue.
If you are using python, there can be some like not using context manager
like With Context Managers and Python's with Statement

And another can be not to close resources after use.

In my experience you can process 3 years or more of data, IF you are
closing opened resources.
I use the web GUI http://spark:4040 to follow what spark is doing.




ons. 30. mar. 2022 kl. 17:41 skrev Joris Billen <
joris.bil...@bigindustries.be>:

> Thanks for answer-much appreciated! This forum is very useful :-)
>
> I didnt know the sparkcontext stays alive. I guess this is eating up
> memory.  The eviction means that he knows that he should clear some of the
> old cached memory to be able to store new one. In case anyone has good
> articles about memory leaks I would be interested to read.
> I will try to add following lines at the end of my job (as I cached the
> table in spark sql):
>
>
> *sqlContext.sql("UNCACHE TABLE mytableofinterest ")*
> *spark.stop()*
>
>
> Wrt looping: if I want to process 3 years of data, my modest cluster will
> never do it one go , I would expect? I have to break it down in smaller
> pieces and run that in a loop (1 day is already lots of data).
>
>
>
> Thanks!
>
>
>
>
> On 30 Mar 2022, at 17:25, Sean Owen  wrote:
>
> The Spark context does not stop when a job does. It stops when you stop
> it. There could be many ways mem can leak. Caching maybe - but it will
> evict. You should be clearing caches when no longer needed.
>
> I would guess it is something else your program holds on to in its logic.
>
> Also consider not looping; there is probably a faster way to do it in one
> go.
>
> On Wed, Mar 30, 2022, 10:16 AM Joris Billen 
> wrote:
>
>> Hi,
>> I have a pyspark job submitted through spark-submit that does some heavy
>> processing for 1 day of data. It runs with no errors. I have to loop over
>> many days, so I run this spark job in a loop. I notice after couple
>> executions the memory is increasing on all worker nodes and eventually this
>> leads to faillures. My job does some caching, but I understand that when
>> the job ends successfully, then the sparkcontext is destroyed and the cache
>> should be cleared. However it seems that something keeps on filling the
>> memory a bit more and more after each run. THis is the memory behaviour
>> over time, which in the end will start leading to failures :
>>
>> (what we see is: green=physical memory used, green-blue=physical memory
>> cached, grey=memory capacity =straight line around 31GB )
>> This runs on a healthy spark 2.4 and was optimized already to come to a
>> stable job in terms of spark-submit resources parameters like
>> driver-memory/num-executors/executor-memory/executor-cores/spark.locality.wait).
>> Any clue how to “really” clear the memory in between jobs? So basically
>> currently I can loop 10x and then need to restart my cluster so all memory
>> is cleared completely.
>>
>>
>> Thanks for any info!
>>
>> 
>
>
>

-- 
Bjørn Jørgensen
Vestre Aspehaug 4, 6010 Ålesund
Norge

+47 480 94 297


Re: loop of spark jobs leads to increase in memory on worker nodes and eventually faillure

2022-03-30 Thread Joris Billen
Thanks for answer-much appreciated! This forum is very useful :-)

I didnt know the sparkcontext stays alive. I guess this is eating up memory.  
The eviction means that he knows that he should clear some of the old cached 
memory to be able to store new one. In case anyone has good articles about 
memory leaks I would be interested to read.
I will try to add following lines at the end of my job (as I cached the table 
in spark sql):


sqlContext.sql("UNCACHE TABLE mytableofinterest ")
spark.stop()


Wrt looping: if I want to process 3 years of data, my modest cluster will never 
do it one go , I would expect? I have to break it down in smaller pieces and 
run that in a loop (1 day is already lots of data).



Thanks!




On 30 Mar 2022, at 17:25, Sean Owen mailto:sro...@gmail.com>> 
wrote:

The Spark context does not stop when a job does. It stops when you stop it. 
There could be many ways mem can leak. Caching maybe - but it will evict. You 
should be clearing caches when no longer needed.

I would guess it is something else your program holds on to in its logic.

Also consider not looping; there is probably a faster way to do it in one go.

On Wed, Mar 30, 2022, 10:16 AM Joris Billen 
mailto:joris.bil...@bigindustries.be>> wrote:
Hi,
I have a pyspark job submitted through spark-submit that does some heavy 
processing for 1 day of data. It runs with no errors. I have to loop over many 
days, so I run this spark job in a loop. I notice after couple executions the 
memory is increasing on all worker nodes and eventually this leads to 
faillures. My job does some caching, but I understand that when the job ends 
successfully, then the sparkcontext is destroyed and the cache should be 
cleared. However it seems that something keeps on filling the memory a bit more 
and more after each run. THis is the memory behaviour over time, which in the 
end will start leading to failures :
[cid:C5C58A91-D7ED-4522-9984-C75192E4A9AA@home]

(what we see is: green=physical memory used, green-blue=physical memory cached, 
grey=memory capacity =straight line around 31GB )
This runs on a healthy spark 2.4 and was optimized already to come to a stable 
job in terms of spark-submit resources parameters like 
driver-memory/num-executors/executor-memory/executor-cores/spark.locality.wait).
Any clue how to “really” clear the memory in between jobs? So basically 
currently I can loop 10x and then need to restart my cluster so all memory is 
cleared completely.


Thanks for any info!





Re: loop of spark jobs leads to increase in memory on worker nodes and eventually faillure

2022-03-30 Thread Sean Owen
The Spark context does not stop when a job does. It stops when you stop it.
There could be many ways mem can leak. Caching maybe - but it will evict.
You should be clearing caches when no longer needed.

I would guess it is something else your program holds on to in its logic.

Also consider not looping; there is probably a faster way to do it in one
go.

On Wed, Mar 30, 2022, 10:16 AM Joris Billen 
wrote:

> Hi,
> I have a pyspark job submitted through spark-submit that does some heavy
> processing for 1 day of data. It runs with no errors. I have to loop over
> many days, so I run this spark job in a loop. I notice after couple
> executions the memory is increasing on all worker nodes and eventually this
> leads to faillures. My job does some caching, but I understand that when
> the job ends successfully, then the sparkcontext is destroyed and the cache
> should be cleared. However it seems that something keeps on filling the
> memory a bit more and more after each run. THis is the memory behaviour
> over time, which in the end will start leading to failures :
>
> (what we see is: green=physical memory used, green-blue=physical memory
> cached, grey=memory capacity =straight line around 31GB )
> This runs on a healthy spark 2.4 and was optimized already to come to a
> stable job in terms of spark-submit resources parameters like
> driver-memory/num-executors/executor-memory/executor-cores/spark.locality.wait).
> Any clue how to “really” clear the memory in between jobs? So basically
> currently I can loop 10x and then need to restart my cluster so all memory
> is cleared completely.
>
>
> Thanks for any info!
>
>


loop of spark jobs leads to increase in memory on worker nodes and eventually faillure

2022-03-30 Thread Joris Billen
Hi,
I have a pyspark job submitted through spark-submit that does some heavy 
processing for 1 day of data. It runs with no errors. I have to loop over many 
days, so I run this spark job in a loop. I notice after couple executions the 
memory is increasing on all worker nodes and eventually this leads to 
faillures. My job does some caching, but I understand that when the job ends 
successfully, then the sparkcontext is destroyed and the cache should be 
cleared. However it seems that something keeps on filling the memory a bit more 
and more after each run. THis is the memory behaviour over time, which in the 
end will start leading to failures :
[cid:C5C58A91-D7ED-4522-9984-C75192E4A9AA@home]

(what we see is: green=physical memory used, green-blue=physical memory cached, 
grey=memory capacity =straight line around 31GB )
This runs on a healthy spark 2.4 and was optimized already to come to a stable 
job in terms of spark-submit resources parameters like 
driver-memory/num-executors/executor-memory/executor-cores/spark.locality.wait).
Any clue how to “really” clear the memory in between jobs? So basically 
currently I can loop 10x and then need to restart my cluster so all memory is 
cleared completely.


Thanks for any info!



RE: [EXTERNAL] Re: spark ETL and spark thrift server running together

2022-03-30 Thread Alex Kosberg
Hi Christophe,
Thank you for the explanation!

Regards,
Alex


From: Christophe Préaud 
Sent: Wednesday, March 30, 2022 3:43 PM
To: Alex Kosberg ; user@spark.apache.org
Subject: [EXTERNAL] Re: spark ETL and spark thrift server running together

Hi Alex,

As stated in the Hive documentation 
(https://cwiki.apache.org/confluence/display/Hive/AdminManual+Metastore+Administration):

An embedded metastore database is mainly used for unit tests. Only one process 
can connect to the metastore database at a time, so it is not really a 
practical solution but works well for unit tests.

You need to set up a remote metastore database (e.g. MariaDB / MySQL) for 
production use.

Regards,
Christophe.

On 3/30/22 13:31, Alex Kosberg wrote:
Hi,
Some details:
1.   Spark SQL (version 3.2.1)
2.   Driver: Hive JDBC (version 2.3.9)
3.   ThriftCLIService: Starting ThriftBinaryCLIService on port 1 with 
5...500 worker threads
4.   BI tool is connect via odbc driver
After activating Spark Thrift Server I'm unable to run pyspark script using 
spark-submit as they both use the same metastore_db
error:
Caused by: ERROR XJ040: Failed to start database 'metastore_db' with class 
loader 
org.apache.spark.sql.hive.client.IsolatedClientLoader$$anon$1@3acaa384,
 see the next exception for details.
at org.apache.derby.iapi.error.StandardException.newException(Unknown 
Source)
at 
org.apache.derby.impl.jdbc.SQLExceptionFactory.wrapArgsForTransportAcrossDRDA(Unknown
 Source)
... 140 more
Caused by: ERROR XSDB6: Another instance of Derby may have already booted the 
database /tmp/metastore_db.

I need to be able to run PySpark (Spark ETL) while having spark thrift server 
up for BI tool queries. Any workaround for it?
Thanks!


Notice: This e-mail together with any attachments may contain information of 
Ribbon Communications Inc. and its Affiliates that is confidential and/or 
proprietary for the sole use of the intended recipient. Any review, disclosure, 
reliance or distribution by others or forwarding without express permission is 
strictly prohibited. If you are not the intended recipient, please notify the 
sender immediately and then delete all copies, including any attachments.



Notice: This e-mail together with any attachments may contain information of 
Ribbon Communications Inc. and its Affiliates that is confidential and/or 
proprietary for the sole use of the intended recipient. Any review, disclosure, 
reliance or distribution by others or forwarding without express permission is 
strictly prohibited. If you are not the intended recipient, please notify the 
sender immediately and then delete all copies, including any attachments.


Re: spark ETL and spark thrift server running together

2022-03-30 Thread Christophe Préaud
Hi Alex,

As stated in the Hive documentation 
(https://cwiki.apache.org/confluence/display/Hive/AdminManual+Metastore+Administration):

*An embedded metastore database is mainly used for unit tests. Only one process 
can connect to the metastore database at a time, so it is not really a 
practical solution but works well for unit tests.*


You need to set up a remote metastore database (e.g. MariaDB / MySQL) for 
production use.

Regards,
Christophe.

On 3/30/22 13:31, Alex Kosberg wrote:
>
> Hi,
>
> Some details:
>
> · Spark SQL (version 3.2.1)
>
> · Driver: Hive JDBC (version 2.3.9)
>
> · ThriftCLIService: Starting ThriftBinaryCLIService on port 1 
> with 5...500 worker threads
>
> · BI tool is connect via odbc driver
>
> After activating Spark Thrift Server I'm unable to run pyspark script using 
> spark-submit as they both use the same metastore_db
>
> error:
>
> Caused by: ERROR XJ040: Failed to start database 'metastore_db' with class 
> loader 
> org.apache.spark.sql.hive.client.IsolatedClientLoader$$anon$1@3acaa384, see 
> the next exception for details.
>
>     at org.apache.derby.iapi.error.StandardException.newException(Unknown 
> Source)
>
>     at 
> org.apache.derby.impl.jdbc.SQLExceptionFactory.wrapArgsForTransportAcrossDRDA(Unknown
>  Source)
>
>     ... 140 more
>
> Caused by: ERROR XSDB6: Another instance of Derby may have already booted the 
> database /tmp/metastore_db.
>
>  
>
> I need to be able to run PySpark (Spark ETL) while having spark thrift server 
> up for BI tool queries. Any workaround for it?
>
> Thanks!
>
>  
>
>
> Notice: This e-mail together with any attachments may contain information of 
> Ribbon Communications Inc. and its Affiliates that is confidential and/or 
> proprietary for the sole use of the intended recipient. Any review, 
> disclosure, reliance or distribution by others or forwarding without express 
> permission is strictly prohibited. If you are not the intended recipient, 
> please notify the sender immediately and then delete all copies, including 
> any attachments.



spark ETL and spark thrift server running together

2022-03-30 Thread Alex Kosberg
Hi,
Some details:
* Spark SQL (version 3.2.1)
* Driver: Hive JDBC (version 2.3.9)
* ThriftCLIService: Starting ThriftBinaryCLIService on port 1 with 
5...500 worker threads
* BI tool is connect via odbc driver
After activating Spark Thrift Server I'm unable to run pyspark script using 
spark-submit as they both use the same metastore_db
error:
Caused by: ERROR XJ040: Failed to start database 'metastore_db' with class 
loader 
org.apache.spark.sql.hive.client.IsolatedClientLoader$$anon$1@3acaa384,
 see the next exception for details.
at org.apache.derby.iapi.error.StandardException.newException(Unknown 
Source)
at 
org.apache.derby.impl.jdbc.SQLExceptionFactory.wrapArgsForTransportAcrossDRDA(Unknown
 Source)
... 140 more
Caused by: ERROR XSDB6: Another instance of Derby may have already booted the 
database /tmp/metastore_db.

I need to be able to run PySpark (Spark ETL) while having spark thrift server 
up for BI tool queries. Any workaround for it?
Thanks!


Notice: This e-mail together with any attachments may contain information of 
Ribbon Communications Inc. and its Affiliates that is confidential and/or 
proprietary for the sole use of the intended recipient. Any review, disclosure, 
reliance or distribution by others or forwarding without express permission is 
strictly prohibited. If you are not the intended recipient, please notify the 
sender immediately and then delete all copies, including any attachments.

Call for Presentations now open, ApacheCon North America 2022

2022-03-30 Thread Rich Bowen
[You are receiving this because you are subscribed to one or more user
or dev mailing list of an Apache Software Foundation project.]

ApacheCon draws participants at all levels to explore “Tomorrow’s
Technology Today” across 300+ Apache projects and their diverse
communities. ApacheCon showcases the latest developments in ubiquitous
Apache projects and emerging innovations through hands-on sessions,
keynotes, real-world case studies, trainings, hackathons, community
events, and more.

The Apache Software Foundation will be holding ApacheCon North America
2022 at the New Orleans Sheration, October 3rd through 6th, 2022. The
Call for Presentations is now open, and will close at 00:01 UTC on May
23rd, 2022.

We are accepting presentation proposals for any topic that is related
to the Apache mission of producing free software for the public good.
This includes, but is not limited to:

Community
Big Data
Search
IoT
Cloud
Fintech
Pulsar
Tomcat

You can submit your session proposals starting today at
https://cfp.apachecon.com/

Rich Bowen, on behalf of the ApacheCon Planners
apachecon.com
@apachecon

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Unusual bug,please help me,i can do nothing!!!

2022-03-30 Thread spark User
Hello, I am a spark user. I use the "spark-shell.cmd" startup command in 
windows cmd, the first startup is normal, when I use the "ctrl+c" command to 
force the end of the spark window, it can't start normally again. .The error 
message is as follows "Failed to initialize Spark 
session.org.apache.spark.SparkException: Invalid Spark URL: 
spark://HeartbeatReceiver@x.168.137.41:49963".
When I try to add "x.168.137.41" in 'etc/hosts' it works fine, then use 
"ctrl+c" again.
The result is that it cannot start normally. Please help me