Re: Can JVisual VM monitoring tool be used to Monitor Spark Executor Memory and CPU

2021-03-21 Thread Attila Zsolt Piros
Hi Ranju!

I am quite sure for your requirement "monitor every component and isolate
the resources consuming individually by every component" Spark metrics is
the right direction to go.

> Why only UsedstorageMemory should be checked?

Right, for you only storage memory won't be enough you need the system and
the execution memory too.
I expect ".JVMHeapMemory" and ".JVMOffHeapMemory" is what you looking for.

> Also I noticed cpuTime provides cpu time spent by an executor. But there
is no metric by which I can calculate the number of cores.


Number of cores is specified by the Spark submit. IIRC if you pass 3 it
means that each executor can run a maximum of 3 tasks at the same time.
So all these cores will be used if there is enough tasks. I know this is
not perfect solution but I hope it helps.

> Also I see Grafana, a very good visualization tool where I see all the
metrics can be viewed , but I have less idea for steps to install on
virtual server and integrate.

I cannot help in this with specifics but a monitoring system is a good idea
either Grafana or Prometheus.

Best regards,
Attila

On Sun, Mar 21, 2021 at 3:01 PM Ranju Jain  wrote:

> Hi Mich/Attila,
>
>
>
> @Mich Talebzadeh : I considered spark GUI ,
> but I have a confusion first at memory level.
>
>
>
> App Configuration: spark.executor.memory= 4g for running spark job.
>
>
>
> In spark GUI I see running spark job has Peak Execution Memory is 1 KB as
> highlighted below:
>
> I do not have Storage Memory screenshot. So  I calculated Total Memory
> consumption at that point of time was:
>
>
>
> Spark UI shows :  spark.executor.memory= Peak Execution Memory + Storage
> Mem + Reserved Mem + User Memory
>
>
>   = 1 Kb + Storage Mem + 300 Mb + (4g *0.25)
>
>
>= 1 Kb + Storage Mem + 300 Mb + 1g
>
>
>   = Approx 1.5 g
>
>
>
>
>
>
>
> And if I see Executor 0,1,2 actual memory consumption on virtual server
> using *top * commnd , it shows below reading:
>
>
>
> Executor – 2:   *top*
>
>
>
>
>
> Executor-0 :*top*
>
>
>
> Please suggest On Spark GUI, Can I go with below formula to isolate that
> how much spark component is consuming  memory out of several other
> components of a Web application.
>
>   spark.executor.memory= Peak Execution Memory + Storage Mem + Reserved
> Mem + User Memory
>
>   = 1 Kb + Storage Mem +
> 300 Mb + (4g *0.25)
>
>
>
>
>
> @Attila Zsolt Piros : I checked the
> *memoryMetrics.** of executor-metrics
> ,
> but here I have a confusion about
>
> usedOnHeapStorageMemory
>
> usedOffHeapStorageMemory
>
> totalOnHeapStorageMemory
>
> totalOffHeapStorageMemory
>
>
>
> *Why only UsedstorageMemory should be checked?*
>
>
>
> To isolate spark.executor.memory, Should I check *memoryMetrics**.**
> where *only storageMemory* is given  or Should I check *peakMemoryMetrics*.*
> where all Peaks are specified
>
>1. Execution
>2. Storage
>3. JVM Heap
>
>
>
> Also I noticed cpuTime provides cpu time spent by an executor. But there
> is no metric by which I can calculate the number of cores.
>
>
>
> As suggested, I checked Luca Canali’s presentation, there I see JMXSink
> which Registers metrics for viewing in JMX Console. I think exposing this
> metric via JMXSink take it to visualize
>
> spark.executor.memory and number of cores by an executor on Java
> Monitoring tool.
>
> Also I see Grafana, a very good visualization tool where I see all the
> metrics can be viewed , but I have less idea for steps to install on
> virtual server and integrate. I need to go through in detail the Grafana.
>
>
>
> Kindly suggest your views.
>
>
>
> Regards
>
> Ranju
>
>
>
> *From:* Attila Zsolt Piros 
> *Sent:* Sunday, March 21, 2021 3:42 AM
> *To:* Mich Talebzadeh 
> *Cc:* Ranju Jain ; user@spark.apache.org
> *Subject:* Re: Can JVisual VM monitoring tool be used to Monitor Spark
> Executor Memory and CPU
>
>
>
> Hi Ranju!
>
> You can configure Spark's metric system.
>
> Check the *memoryMetrics.** of executor-metrics
> 
>  and
> in the component-instance-executor
> 
>  the
> CPU times.
>
> Regarding the details I suggest to check Luca Canali's presentations about
> Spark's metric system and maybe his github repo
> 
> .
>
> Best Regards,
> Attila
>
>
>
> On Sat, Mar 20, 2021 at 5:41 PM Mich Talebzadeh 
> wrote:
>
> Hi,
>
>
>
> Have you considered spark GUI first?
>
>
>
>
>
>  [image: Image removed by sender.]  view my Linkedin profile
> 
>
>
>
> *Disclaimer:* Use it at your own risk. Any and 

[Spark SQL]: Can complex oracle views be created using Spark SQL

2021-03-21 Thread Gaurav Singh
Hi Team,

We have lots of complex oracle views ( containing multiple tables, joins,
analytical and  aggregate functions, sub queries etc) and we are wondering
if Spark can help us execute those views faster.

Also we want to know if those complex views can be implemented using Spark
SQL?

Thanks and regards,
Gaurav Singh
+91 8600852256


Re: Spark version verification

2021-03-21 Thread Kent Yao






Hi Mich,> What are the correlations among these links and the ability to establish a spark build version   Check the documentation list here, http://spark.apache.org/documentation.html . And the `latest` always points to the list head, for example http://spark.apache.org/docs/latest/ means http://spark.apache.org/docs/3.1.1/ for nowThe Spark build version in Spark releases is create by `spark-build-info ` see https://github.com/apache/spark/blob/89bf2afb3337a44f34009a36cae16dd0ff86b353/build/spark-build-info#L32 Some other options to check the spark build info1. the `RELEASE` filecat RELEASESpark 3.0.1 (git revision 2b147c4cd5) built for Hadoop 2.7.4Build flags: -B -Pmesos -Pyarn -Pkubernetes -Psparkr -Pscala-2.12 -Phadoop-2.7 -Phive -Phive-thriftserver -DzincPort=30362. bin/spark-submit —versionThe git revision itself does not tell you whether the release is rc or final.If you have the Spark source code locally, you can use `git show 1d550c4e90275ab418b9161925049239227f3dc9` and get the tag info, like `commit 1d550c4e90275ab418b9161925049239227f3dc9 (tag: v3.1.1-rc3, tag: v3.1.1)`.Or you can compare the revision you have got with all tags here https://github.com/apache/spark/tags Bests,






  



















Kent Yao @ Data Science Center, Hangzhou Research Institute, NetEase Corp.a spark enthusiastkyuubiis a unified multi-tenant JDBC interface for large-scale data processing and analytics, built on top of Apache Spark.spark-authorizerA Spark SQL extension which provides SQL Standard Authorization for Apache Spark.spark-postgres A library for reading data from and transferring data to Postgres / Greenplum with Spark SQL and DataFrames, 10~100x faster.spark-func-extrasA library that brings excellent and useful functions from various modern database management systems to Apache Spark.















 


On 03/22/2021 00:02,Mich Talebzadeh wrote: 


Hi Kent,Thanks for the links.You have to excuse my ignorance, what are the correlations among these links and the ability to establish a spark build version?

   view my Linkedin profile

 Disclaimer: Use it at your own risk. Any and all responsibility for any loss, damage or destruction
of data or any other property which may arise from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from such
loss, damage or destruction.  

On Sun, 21 Mar 2021 at 15:55, Kent Yao  wrote:







Please refer to http://spark.apache.org/docs/latest/api/sql/index.html#version 






  



















Kent Yao @ Data Science Center, Hangzhou Research Institute, NetEase Corp.a spark enthusiastkyuubiis a unified multi-tenant JDBC interface for large-scale data processing and analytics, built on top of Apache Spark.spark-authorizerA Spark SQL extension which provides SQL Standard Authorization for Apache Spark.spark-postgres A library for reading data from and transferring data to Postgres / Greenplum with Spark SQL and DataFrames, 10~100x faster.spark-func-extrasA library that brings excellent and useful functions from various modern database management systems to Apache Spark.















 


On 03/21/2021 23:28,Mich Talebzadeh wrote: 


Many thanksspark-sql> SELECT version();3.1.1 1d550c4e90275ab418b9161925049239227f3dc9What does 1d550c4e90275ab418b9161925049239227f3dc9 signify please?



   view my Linkedin profile

 Disclaimer: Use it at your own risk. Any and all responsibility for any loss, damage or destruction
of data or any other property which may arise from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from such
loss, damage or destruction.  

On Sun, 21 Mar 2021 at 15:14, Sean Owen  wrote:I believe you can "SELECT version()" in Spark SQL to see the build version.On Sun, Mar 21, 2021 at 4:41 AM Mich Talebzadeh  wrote:Thanks for the detailed info.I was hoping that one can find a simpler answer to the Spark version than doing forensic examination on base code so to speak.The primer for this verification is that on GCP dataprocs originally built on 3.11-rc2, there was an issue with running Spark Structured Streaming (SSS) which I reported to this forum before.After a while and me reporting to Google, they have now upgraded the base to Spark 3.1.1 itself. I am not privy to how they did the upgrade itself.In the meantime we installed 3.1.1 on-premise and ran it with the same Python code for SSS. It worked fine.However, when I run the same code on GCP dataproc upgraded to 3.1.1, occasionally I see this error21/03/18 16:53:38 ERROR 

Re: Spark version verification

2021-03-21 Thread Attila Zsolt Piros
Hi!

Thanks Sean and Kent! By reading your answers I have also learnt something
new.

@Mich Talebzadeh : see the commit  content by
prefixing it with *https://github.com/apache/spark/commit/
*.
So in your case
https://github.com/apache/spark/commit/1d550c4e90275ab418b9161925049239227f3dc9

Best Regards,
Attila

On Sun, Mar 21, 2021 at 5:02 PM Mich Talebzadeh 
wrote:

>
> Hi Kent,
>
> Thanks for the links.
>
> You have to excuse my ignorance, what are the correlations among these
> links and the ability to establish a spark build version?
>
>
>view my Linkedin profile
> 
>
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
>
> On Sun, 21 Mar 2021 at 15:55, Kent Yao  wrote:
>
>> Please refer to
>> http://spark.apache.org/docs/latest/api/sql/index.html#version
>>
>> *Kent Yao *
>> @ Data Science Center, Hangzhou Research Institute, NetEase Corp.
>> *a spark enthusiast*
>> *kyuubi is a
>> unified multi-tenant JDBC interface for large-scale data processing and
>> analytics, built on top of Apache Spark .*
>> *spark-authorizer A Spark
>> SQL extension which provides SQL Standard Authorization for **Apache
>> Spark .*
>> *spark-postgres  A library
>> for reading data from and transferring data to Postgres / Greenplum with
>> Spark SQL and DataFrames, 10~100x faster.*
>> *spark-func-extras A
>> library that brings excellent and useful functions from various modern
>> database management systems to Apache Spark .*
>>
>>
>>
>> On 03/21/2021 23:28,Mich Talebzadeh
>>  wrote:
>>
>> Many thanks
>>
>> spark-sql> SELECT version();
>> 3.1.1 1d550c4e90275ab418b9161925049239227f3dc9
>>
>> What does 1d550c4e90275ab418b9161925049239227f3dc9 signify please?
>>
>>
>>
>>
>>view my Linkedin profile
>> 
>>
>>
>>
>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>> any loss, damage or destruction of data or any other property which may
>> arise from relying on this email's technical content is explicitly
>> disclaimed. The author will in no case be liable for any monetary damages
>> arising from such loss, damage or destruction.
>>
>>
>>
>>
>> On Sun, 21 Mar 2021 at 15:14, Sean Owen  wrote:
>>
>>> I believe you can "SELECT version()" in Spark SQL to see the build
>>> version.
>>>
>>> On Sun, Mar 21, 2021 at 4:41 AM Mich Talebzadeh <
>>> mich.talebza...@gmail.com> wrote:
>>>
 Thanks for the detailed info.

 I was hoping that one can find a simpler answer to the Spark version
 than doing forensic examination on base code so to speak.

 The primer for this verification is that on GCP dataprocs originally
 built on 3.11-rc2, there was an issue with running Spark Structured
 Streaming (SSS) which I reported to this forum before.

 After a while and me reporting to Google, they have now upgraded the
 base to Spark 3.1.1 itself. I am not privy to how they did the upgrade
 itself.

 In the meantime we installed 3.1.1 on-premise and ran it with the same
 Python code for SSS. It worked fine.

 However, when I run the same code on GCP dataproc upgraded to 3.1.1,
 occasionally I see this error

 21/03/18 16:53:38 ERROR org.apache.spark.scheduler.AsyncEventQueue:
 Listener EventLoggingListener threw an exception

 java.util.ConcurrentModificationException

 at java.util.Hashtable$Enumerator.next(Hashtable.java:1387)

 This may be for other reasons or the consequence of upgrading from
 3.1.1-rc2 to 3.11?



view my Linkedin profile
 



 *Disclaimer:* Use it at your own risk. Any and all responsibility for
 any loss, damage or destruction of data or any other property which may
 arise from relying on this email's technical content is explicitly
 disclaimed. The author will in no case be liable for any monetary damages
 arising from such loss, damage or destruction.




 On Sat, 20 Mar 2021 at 22:41, Attila Zsolt Piros <
 piros.attila.zs...@gmail.com> wrote:

> Hi!
>
> I would check out the Spark source then diff those two RCs (first just
> take look to the list of the changed files):
>
> $ git diff v3.1.1-rc1..v3.1.1-rc2 --stat
> ...

Re: Spark version verification

2021-03-21 Thread Mich Talebzadeh
Hi Kent,

Thanks for the links.

You have to excuse my ignorance, what are the correlations among these
links and the ability to establish a spark build version?


   view my Linkedin profile




*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.




On Sun, 21 Mar 2021 at 15:55, Kent Yao  wrote:

> Please refer to
> http://spark.apache.org/docs/latest/api/sql/index.html#version
>
> *Kent Yao *
> @ Data Science Center, Hangzhou Research Institute, NetEase Corp.
> *a spark enthusiast*
> *kyuubi is a unified multi-tenant JDBC
> interface for large-scale data processing and analytics, built on top
> of Apache Spark .*
> *spark-authorizer A Spark
> SQL extension which provides SQL Standard Authorization for **Apache
> Spark .*
> *spark-postgres  A library for
> reading data from and transferring data to Postgres / Greenplum with Spark
> SQL and DataFrames, 10~100x faster.*
> *spark-func-extras A
> library that brings excellent and useful functions from various modern
> database management systems to Apache Spark .*
>
>
>
> On 03/21/2021 23:28,Mich Talebzadeh
>  wrote:
>
> Many thanks
>
> spark-sql> SELECT version();
> 3.1.1 1d550c4e90275ab418b9161925049239227f3dc9
>
> What does 1d550c4e90275ab418b9161925049239227f3dc9 signify please?
>
>
>
>
>view my Linkedin profile
> 
>
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
>
> On Sun, 21 Mar 2021 at 15:14, Sean Owen  wrote:
>
>> I believe you can "SELECT version()" in Spark SQL to see the build
>> version.
>>
>> On Sun, Mar 21, 2021 at 4:41 AM Mich Talebzadeh <
>> mich.talebza...@gmail.com> wrote:
>>
>>> Thanks for the detailed info.
>>>
>>> I was hoping that one can find a simpler answer to the Spark version
>>> than doing forensic examination on base code so to speak.
>>>
>>> The primer for this verification is that on GCP dataprocs originally
>>> built on 3.11-rc2, there was an issue with running Spark Structured
>>> Streaming (SSS) which I reported to this forum before.
>>>
>>> After a while and me reporting to Google, they have now upgraded the
>>> base to Spark 3.1.1 itself. I am not privy to how they did the upgrade
>>> itself.
>>>
>>> In the meantime we installed 3.1.1 on-premise and ran it with the same
>>> Python code for SSS. It worked fine.
>>>
>>> However, when I run the same code on GCP dataproc upgraded to 3.1.1,
>>> occasionally I see this error
>>>
>>> 21/03/18 16:53:38 ERROR org.apache.spark.scheduler.AsyncEventQueue:
>>> Listener EventLoggingListener threw an exception
>>>
>>> java.util.ConcurrentModificationException
>>>
>>> at java.util.Hashtable$Enumerator.next(Hashtable.java:1387)
>>>
>>> This may be for other reasons or the consequence of upgrading from
>>> 3.1.1-rc2 to 3.11?
>>>
>>>
>>>
>>>view my Linkedin profile
>>> 
>>>
>>>
>>>
>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>>> any loss, damage or destruction of data or any other property which may
>>> arise from relying on this email's technical content is explicitly
>>> disclaimed. The author will in no case be liable for any monetary damages
>>> arising from such loss, damage or destruction.
>>>
>>>
>>>
>>>
>>> On Sat, 20 Mar 2021 at 22:41, Attila Zsolt Piros <
>>> piros.attila.zs...@gmail.com> wrote:
>>>
 Hi!

 I would check out the Spark source then diff those two RCs (first just
 take look to the list of the changed files):

 $ git diff v3.1.1-rc1..v3.1.1-rc2 --stat
 ...

 The shell scripts in the release can be checked very easily:

 $ git diff v3.1.1-rc1..v3.1.1-rc2 --stat | grep ".sh "
  bin/docker-image-tool.sh   |   6 +-
  dev/create-release/release-build.sh|   2 +-

 We are lucky as *docker-image-tool.sh* is part of the released
 version.
 Is it from v3.1.1-rc2 or v3.1.1-rc1?

 Of course this only works if docker-image-tool.sh is not changed from
 the v3.1.1-rc2 back to v3.1.1-rc1.
 So let's continue with the python (and 

Re: Spark version verification

2021-03-21 Thread Kent Yao






Please refer to http://spark.apache.org/docs/latest/api/sql/index.html#version 






  



















Kent Yao @ Data Science Center, Hangzhou Research Institute, NetEase Corp.a spark enthusiastkyuubiis a unified multi-tenant JDBC interface for large-scale data processing and analytics, built on top of Apache Spark.spark-authorizerA Spark SQL extension which provides SQL Standard Authorization for Apache Spark.spark-postgres A library for reading data from and transferring data to Postgres / Greenplum with Spark SQL and DataFrames, 10~100x faster.spark-func-extrasA library that brings excellent and useful functions from various modern database management systems to Apache Spark.















 


On 03/21/2021 23:28,Mich Talebzadeh wrote: 


Many thanksspark-sql> SELECT version();3.1.1 1d550c4e90275ab418b9161925049239227f3dc9What does 1d550c4e90275ab418b9161925049239227f3dc9 signify please?



   view my Linkedin profile

 Disclaimer: Use it at your own risk. Any and all responsibility for any loss, damage or destruction
of data or any other property which may arise from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from such
loss, damage or destruction.  

On Sun, 21 Mar 2021 at 15:14, Sean Owen  wrote:I believe you can "SELECT version()" in Spark SQL to see the build version.On Sun, Mar 21, 2021 at 4:41 AM Mich Talebzadeh  wrote:Thanks for the detailed info.I was hoping that one can find a simpler answer to the Spark version than doing forensic examination on base code so to speak.The primer for this verification is that on GCP dataprocs originally built on 3.11-rc2, there was an issue with running Spark Structured Streaming (SSS) which I reported to this forum before.After a while and me reporting to Google, they have now upgraded the base to Spark 3.1.1 itself. I am not privy to how they did the upgrade itself.In the meantime we installed 3.1.1 on-premise and ran it with the same Python code for SSS. It worked fine.However, when I run the same code on GCP dataproc upgraded to 3.1.1, occasionally I see this error21/03/18 16:53:38 ERROR org.apache.spark.scheduler.AsyncEventQueue: Listener EventLoggingListener threw an exceptionjava.util.ConcurrentModificationException        at java.util.Hashtable$Enumerator.next(Hashtable.java:1387)This may be for other reasons or the consequence of upgrading from 3.1.1-rc2 to 3.11?

   view my Linkedin profile

 Disclaimer: Use it at your own risk. Any and all responsibility for any loss, damage or destruction
of data or any other property which may arise from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from such
loss, damage or destruction.  

On Sat, 20 Mar 2021 at 22:41, Attila Zsolt Piros  wrote:Hi!I would check out the Spark source then diff those two RCs (first just take look to the list of the changed files):$ git diff v3.1.1-rc1..v3.1.1-rc2 --stat...The shell scripts in the release can be checked very easily: $ git diff v3.1.1-rc1..v3.1.1-rc2 --stat | grep ".sh " bin/docker-image-tool.sh                           |   6 +- dev/create-release/release-build.sh                |   2 +-We are lucky as docker-image-tool.sh is part of the released version. Is it from v3.1.1-rc2 or v3.1.1-rc1?Of course this only works if docker-image-tool.sh is not changed from the v3.1.1-rc2 back to v3.1.1-rc1. So let's continue with the python (and latter with R) files:$ git diff v3.1.1-rc1..v3.1.1-rc2 --stat | grep ".py " python/pyspark/sql/avro/functions.py               |   4 +- python/pyspark/sql/dataframe.py                    |   1 + python/pyspark/sql/functions.py                    | 285 +-- .../pyspark/sql/tests/test_pandas_cogrouped_map.py |  12 + python/pyspark/sql/tests/test_pandas_map.py        |   8 +...After you have enough proof you can stop (to decide what is enough here should be decided by you). Finally you can use javap / scalap on the classes from the jars and check some code changes which is more harder to be analyzed than a simple text file.Best Regards,AttilaOn Thu, Mar 18, 2021 at 4:09 PM Mich Talebzadeh  wrote:Hi What would be a signature in Spark version or binaries that confirms the release is built on Spark built on 3.1.1 as opposed to 3.1.1-RC-1 or RC-2?Thanks

Mich

   view my Linkedin profile

 Disclaimer: Use it at your own risk. Any and all responsibility for any loss, damage or destruction
of data or any other property which may arise from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from such
loss, damage or 

Re: In built Optimizer on Spark

2021-03-21 Thread Mich Talebzadeh
Hi Felix,

As you may be aware Spark sql does have a catalyst optimiser.

What is the Catalyst Optimizer? - Databricks


You Mentioned Spark Structured Streaming. What specifics are you looking
for?

Have you considered the Spark GUI streaming tab?

HTH


   view my Linkedin profile




*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.




On Sun, 21 Mar 2021 at 13:54, Felix Kizhakkel Jose <
felixkizhakkelj...@gmail.com> wrote:

> Hello,
>
> Is there any in-built optimizer in Spark as in Flink, to avoid manual
> configuration tuning to achieve better performance of your
> structured streaming pipeline?
> Or is there any work happening to achieve this?
>
> Regards,
> Felix K Jose
>


Re: Spark version verification

2021-03-21 Thread Mich Talebzadeh
Many thanks

spark-sql> SELECT version();
3.1.1 1d550c4e90275ab418b9161925049239227f3dc9

What does 1d550c4e90275ab418b9161925049239227f3dc9 signify please?




   view my Linkedin profile




*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.




On Sun, 21 Mar 2021 at 15:14, Sean Owen  wrote:

> I believe you can "SELECT version()" in Spark SQL to see the build version.
>
> On Sun, Mar 21, 2021 at 4:41 AM Mich Talebzadeh 
> wrote:
>
>> Thanks for the detailed info.
>>
>> I was hoping that one can find a simpler answer to the Spark version than
>> doing forensic examination on base code so to speak.
>>
>> The primer for this verification is that on GCP dataprocs originally
>> built on 3.11-rc2, there was an issue with running Spark Structured
>> Streaming (SSS) which I reported to this forum before.
>>
>> After a while and me reporting to Google, they have now upgraded the base
>> to Spark 3.1.1 itself. I am not privy to how they did the upgrade itself.
>>
>> In the meantime we installed 3.1.1 on-premise and ran it with the same
>> Python code for SSS. It worked fine.
>>
>> However, when I run the same code on GCP dataproc upgraded to 3.1.1,
>> occasionally I see this error
>>
>> 21/03/18 16:53:38 ERROR org.apache.spark.scheduler.AsyncEventQueue:
>> Listener EventLoggingListener threw an exception
>>
>> java.util.ConcurrentModificationException
>>
>> at java.util.Hashtable$Enumerator.next(Hashtable.java:1387)
>>
>> This may be for other reasons or the consequence of upgrading from
>> 3.1.1-rc2 to 3.11?
>>
>>
>>
>>view my Linkedin profile
>> 
>>
>>
>>
>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>> any loss, damage or destruction of data or any other property which may
>> arise from relying on this email's technical content is explicitly
>> disclaimed. The author will in no case be liable for any monetary damages
>> arising from such loss, damage or destruction.
>>
>>
>>
>>
>> On Sat, 20 Mar 2021 at 22:41, Attila Zsolt Piros <
>> piros.attila.zs...@gmail.com> wrote:
>>
>>> Hi!
>>>
>>> I would check out the Spark source then diff those two RCs (first just
>>> take look to the list of the changed files):
>>>
>>> $ git diff v3.1.1-rc1..v3.1.1-rc2 --stat
>>> ...
>>>
>>> The shell scripts in the release can be checked very easily:
>>>
>>> $ git diff v3.1.1-rc1..v3.1.1-rc2 --stat | grep ".sh "
>>>  bin/docker-image-tool.sh   |   6 +-
>>>  dev/create-release/release-build.sh|   2 +-
>>>
>>> We are lucky as *docker-image-tool.sh* is part of the released version.
>>> Is it from v3.1.1-rc2 or v3.1.1-rc1?
>>>
>>> Of course this only works if docker-image-tool.sh is not changed from
>>> the v3.1.1-rc2 back to v3.1.1-rc1.
>>> So let's continue with the python (and latter with R) files:
>>>
>>> $ git diff v3.1.1-rc1..v3.1.1-rc2 --stat | grep ".py "
>>>  python/pyspark/sql/avro/functions.py   |   4 +-
>>>  python/pyspark/sql/dataframe.py|   1 +
>>>  python/pyspark/sql/functions.py| 285 +--
>>>  .../pyspark/sql/tests/test_pandas_cogrouped_map.py |  12 +
>>>  python/pyspark/sql/tests/test_pandas_map.py|   8 +
>>> ...
>>>
>>> After you have enough proof you can stop (to decide what is enough here
>>> should be decided by you).
>>> Finally you can use javap / scalap on the classes from the jars and
>>> check some code changes which is more harder to be analyzed than a simple
>>> text file.
>>>
>>> Best Regards,
>>> Attila
>>>
>>>
>>> On Thu, Mar 18, 2021 at 4:09 PM Mich Talebzadeh <
>>> mich.talebza...@gmail.com> wrote:
>>>
 Hi

 What would be a signature in Spark version or binaries that confirms
 the release is built on Spark built on 3.1.1 as opposed to 3.1.1-RC-1 or
 RC-2?

 Thanks

 Mich


view my Linkedin profile
 



 *Disclaimer:* Use it at your own risk. Any and all responsibility for
 any loss, damage or destruction of data or any other property which may
 arise from relying on this email's technical content is explicitly
 disclaimed. The author will in no case be liable for any monetary damages
 arising from such loss, damage or destruction.



>>>


Re: Spark version verification

2021-03-21 Thread Sean Owen
I believe you can "SELECT version()" in Spark SQL to see the build version.

On Sun, Mar 21, 2021 at 4:41 AM Mich Talebzadeh 
wrote:

> Thanks for the detailed info.
>
> I was hoping that one can find a simpler answer to the Spark version than
> doing forensic examination on base code so to speak.
>
> The primer for this verification is that on GCP dataprocs originally built
> on 3.11-rc2, there was an issue with running Spark Structured Streaming
> (SSS) which I reported to this forum before.
>
> After a while and me reporting to Google, they have now upgraded the base
> to Spark 3.1.1 itself. I am not privy to how they did the upgrade itself.
>
> In the meantime we installed 3.1.1 on-premise and ran it with the same
> Python code for SSS. It worked fine.
>
> However, when I run the same code on GCP dataproc upgraded to 3.1.1,
> occasionally I see this error
>
> 21/03/18 16:53:38 ERROR org.apache.spark.scheduler.AsyncEventQueue:
> Listener EventLoggingListener threw an exception
>
> java.util.ConcurrentModificationException
>
> at java.util.Hashtable$Enumerator.next(Hashtable.java:1387)
>
> This may be for other reasons or the consequence of upgrading from
> 3.1.1-rc2 to 3.11?
>
>
>
>view my Linkedin profile
> 
>
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
>
> On Sat, 20 Mar 2021 at 22:41, Attila Zsolt Piros <
> piros.attila.zs...@gmail.com> wrote:
>
>> Hi!
>>
>> I would check out the Spark source then diff those two RCs (first just
>> take look to the list of the changed files):
>>
>> $ git diff v3.1.1-rc1..v3.1.1-rc2 --stat
>> ...
>>
>> The shell scripts in the release can be checked very easily:
>>
>> $ git diff v3.1.1-rc1..v3.1.1-rc2 --stat | grep ".sh "
>>  bin/docker-image-tool.sh   |   6 +-
>>  dev/create-release/release-build.sh|   2 +-
>>
>> We are lucky as *docker-image-tool.sh* is part of the released version.
>> Is it from v3.1.1-rc2 or v3.1.1-rc1?
>>
>> Of course this only works if docker-image-tool.sh is not changed from
>> the v3.1.1-rc2 back to v3.1.1-rc1.
>> So let's continue with the python (and latter with R) files:
>>
>> $ git diff v3.1.1-rc1..v3.1.1-rc2 --stat | grep ".py "
>>  python/pyspark/sql/avro/functions.py   |   4 +-
>>  python/pyspark/sql/dataframe.py|   1 +
>>  python/pyspark/sql/functions.py| 285 +--
>>  .../pyspark/sql/tests/test_pandas_cogrouped_map.py |  12 +
>>  python/pyspark/sql/tests/test_pandas_map.py|   8 +
>> ...
>>
>> After you have enough proof you can stop (to decide what is enough here
>> should be decided by you).
>> Finally you can use javap / scalap on the classes from the jars and check
>> some code changes which is more harder to be analyzed than a simple text
>> file.
>>
>> Best Regards,
>> Attila
>>
>>
>> On Thu, Mar 18, 2021 at 4:09 PM Mich Talebzadeh <
>> mich.talebza...@gmail.com> wrote:
>>
>>> Hi
>>>
>>> What would be a signature in Spark version or binaries that confirms the
>>> release is built on Spark built on 3.1.1 as opposed to 3.1.1-RC-1 or RC-2?
>>>
>>> Thanks
>>>
>>> Mich
>>>
>>>
>>>view my Linkedin profile
>>> 
>>>
>>>
>>>
>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>>> any loss, damage or destruction of data or any other property which may
>>> arise from relying on this email's technical content is explicitly
>>> disclaimed. The author will in no case be liable for any monetary damages
>>> arising from such loss, damage or destruction.
>>>
>>>
>>>
>>


Spark saveAsTextFile Disk Recommendation

2021-03-21 Thread ranju goel
Hi Attila,



I will check why INVALID is getting appended in mailing address.



What is your use case here?

Client Driver Application not using collect but  internally calling python
script which is reading part files records [comma separated string] of each
cluster separately and copying records in other final csv file, so merging
all part files data in single csv file. This script runs on every node and
later they all combine to single file.



*On the other hand is your data really just a collection of strings without
any repetitions*

[Ranju]:

Yes It is comma separated string.

And I just checked the 2nd argument of saveAsTextFile and I believe read
and write will be faster on disk after use of compression. I will try this.



So I think there is no special requirement on type of disk for execution of
saveAsTextFile as they are local I/O operations.



Regards

Ranju





Hi!

I would like to reflect only to the first part of your mail:

I have a large RDD dataset of around 60-70 GB which I cannot send to driver
using *collect* so first writing that to disk using  *saveAsTextFile* and
then this data gets saved in the form of multiple part files on each node
of the cluster and after that driver reads the data from that storage.


What is your use case here?

As you mention *collect()* I can assume you have to process the data
outside of Spark maybe with a 3rd party tool, isn't it?

If you have 60-70 GB of data and you write it to text file then read it
back within the same application then you still cannot call *collect()* on
it as it is still 60-70GB data, right?

On the other hand is your data really just a collection of strings without
any repetitions? I ask this because of the fileformat you are using: text
file. Even for text file at least you can pass a compression codec as the
2nd argument of *saveAsTextFile()*

(when
you use this link you might need to scroll up a little bit.. at least my
chrome displays the the *saveAsTextFile* method without the 2nd arg codec).
As IO is slow a compressed data could be read back quicker: as there will
be less data in the disk. Check the Snappy
 codec for example.

But if there is a structure of your data and you have plan to process this
data further within Spark then please consider something way better: a columnar
storage format namely ORC or Parquet.

Best Regards,

Attila





*From:* Ranju Jain 
*Sent:* Sunday, March 21, 2021 8:10 AM
*To:* user@spark.apache.org
*Subject:* Spark saveAsTextFile Disk Recommendation



Hi All,



I have a large RDD dataset of around 60-70 GB which I cannot send to driver
using *collect* so first writing that to disk using  *saveAsTextFile* and
then this data gets saved in the form of multiple part files on each node
of the cluster and after that driver reads the data from that storage.



I have a question like *spark.local.dir* is the directory which is used as
a scratch space where mapoutputs files and RDDs might need to write by
spark for shuffle operations etc.

And there it is strongly recommended to use *local and fast disk *to avoid
any failure or performance impact.



*Do we have any such recommendation for storing multiple part files of
large dataset [ or Big RDD ] in fast disk?*

This will help me to configure the write type of disk for resulting part
files.



Regards

Ranju


RE: Can JVisual VM monitoring tool be used to Monitor Spark Executor Memory and CPU

2021-03-21 Thread Ranju Jain
Hi Mich/Attila,

@Mich Talebzadeh: I considered spark GUI , 
but I have a confusion first at memory level.

App Configuration: spark.executor.memory= 4g for running spark job.

In spark GUI I see running spark job has Peak Execution Memory is 1 KB as 
highlighted below:
I do not have Storage Memory screenshot. So  I calculated Total Memory 
consumption at that point of time was:

Spark UI shows :  spark.executor.memory= Peak Execution Memory + Storage Mem + 
Reserved Mem + User Memory
 = 
1 Kb + Storage Mem + 300 Mb + (4g *0.25)

   = 1 Kb + Storage Mem + 300 Mb + 1g

  = Approx 1.5 g


[cid:image001.png@01D71E88.C02E8A20][cid:image003.jpg@01D71E88.C02E8A20]

And if I see Executor 0,1,2 actual memory consumption on virtual server using 
top  commnd , it shows below reading:

Executor – 2:   top
  [cid:image006.jpg@01D71E88.C02E8A20]
[cid:image007.png@01D71E88.C02E8A20][cid:image014.png@01D71E84.D5144230]

Executor-0 :top
[cid:image008.png@01D71E88.C02E8A20][cid:image009.jpg@01D71E88.C02E8A20]

Please suggest On Spark GUI, Can I go with below formula to isolate that how 
much spark component is consuming  memory out of several other components of a 
Web application.
  spark.executor.memory= Peak Execution Memory + Storage Mem + Reserved Mem + 
User Memory
  = 1 Kb + Storage Mem + 300 Mb 
+ (4g *0.25)


@Attila Zsolt Piros: I checked the 
memoryMetrics.* of 
executor-metrics,
 but here I have a confusion about
usedOnHeapStorageMemory
usedOffHeapStorageMemory
totalOnHeapStorageMemory
totalOffHeapStorageMemory

Why only UsedstorageMemory should be checked?

To isolate spark.executor.memory, Should I check memoryMetrics.* where only 
storageMemory is given  or Should I check peakMemoryMetrics.* where all Peaks 
are specified

  1.  Execution
  2.  Storage
  3.  JVM Heap

Also I noticed cpuTime provides cpu time spent by an executor. But there is no 
metric by which I can calculate the number of cores.

As suggested, I checked Luca Canali’s presentation, there I see JMXSink which 
Registers metrics for viewing in JMX Console. I think exposing this metric via 
JMXSink take it to visualize
spark.executor.memory and number of cores by an executor on Java Monitoring 
tool.
Also I see Grafana, a very good visualization tool where I see all the metrics 
can be viewed , but I have less idea for steps to install on virtual server and 
integrate. I need to go through in detail the Grafana.

Kindly suggest your views.

Regards
Ranju

From: Attila Zsolt Piros 
Sent: Sunday, March 21, 2021 3:42 AM
To: Mich Talebzadeh 
Cc: Ranju Jain ; user@spark.apache.org
Subject: Re: Can JVisual VM monitoring tool be used to Monitor Spark Executor 
Memory and CPU

Hi Ranju!

You can configure Spark's metric system.

Check the memoryMetrics.* of 
executor-metrics
 and in the 
component-instance-executor
 the CPU times.

Regarding the details I suggest to check Luca Canali's presentations about 
Spark's metric system and maybe his github 
repo.

Best Regards,
Attila

On Sat, Mar 20, 2021 at 5:41 PM Mich Talebzadeh 
mailto:mich.talebza...@gmail.com>> wrote:
Hi,

Have you considered spark GUI first?



 [Image removed by sender.]   view my Linkedin 
profile



Disclaimer: Use it at your own risk. Any and all responsibility for any loss, 
damage or destruction of data or any other property which may arise from 
relying on this email's technical content is explicitly disclaimed. The author 
will in no case be liable for any monetary damages arising from such loss, 
damage or destruction.




On Sat, 20 Mar 2021 at 16:06, Ranju Jain  
wrote:
Hi All,

Virtual Machine running an application, this application is having various 
other 3PPs components running such as spark, database etc .

My requirement is to monitor every component and isolate the resources 
consuming individually by every component.

I am thinking of using a common tool such as Java Visual VM , where I specify 
the JMX URL of every component and monitor every component.

For other components I am able to view their resources.

Is there a possibility of Viewing the Spark Executor CPU/Memory via Java Visual 
VM Tool?

Please guide.

Regards

In built Optimizer on Spark

2021-03-21 Thread Felix Kizhakkel Jose
Hello,

Is there any in-built optimizer in Spark as in Flink, to avoid manual
configuration tuning to achieve better performance of your
structured streaming pipeline?
Or is there any work happening to achieve this?

Regards,
Felix K Jose


Re: Spark version verification

2021-03-21 Thread Mich Talebzadeh
Thanks for the detailed info.

I was hoping that one can find a simpler answer to the Spark version than
doing forensic examination on base code so to speak.

The primer for this verification is that on GCP dataprocs originally built
on 3.11-rc2, there was an issue with running Spark Structured Streaming
(SSS) which I reported to this forum before.

After a while and me reporting to Google, they have now upgraded the base
to Spark 3.1.1 itself. I am not privy to how they did the upgrade itself.

In the meantime we installed 3.1.1 on-premise and ran it with the same
Python code for SSS. It worked fine.

However, when I run the same code on GCP dataproc upgraded to 3.1.1,
occasionally I see this error

21/03/18 16:53:38 ERROR org.apache.spark.scheduler.AsyncEventQueue:
Listener EventLoggingListener threw an exception

java.util.ConcurrentModificationException

at java.util.Hashtable$Enumerator.next(Hashtable.java:1387)

This may be for other reasons or the consequence of upgrading from
3.1.1-rc2 to 3.11?



   view my Linkedin profile




*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.




On Sat, 20 Mar 2021 at 22:41, Attila Zsolt Piros <
piros.attila.zs...@gmail.com> wrote:

> Hi!
>
> I would check out the Spark source then diff those two RCs (first just
> take look to the list of the changed files):
>
> $ git diff v3.1.1-rc1..v3.1.1-rc2 --stat
> ...
>
> The shell scripts in the release can be checked very easily:
>
> $ git diff v3.1.1-rc1..v3.1.1-rc2 --stat | grep ".sh "
>  bin/docker-image-tool.sh   |   6 +-
>  dev/create-release/release-build.sh|   2 +-
>
> We are lucky as *docker-image-tool.sh* is part of the released version.
> Is it from v3.1.1-rc2 or v3.1.1-rc1?
>
> Of course this only works if docker-image-tool.sh is not changed from
> the v3.1.1-rc2 back to v3.1.1-rc1.
> So let's continue with the python (and latter with R) files:
>
> $ git diff v3.1.1-rc1..v3.1.1-rc2 --stat | grep ".py "
>  python/pyspark/sql/avro/functions.py   |   4 +-
>  python/pyspark/sql/dataframe.py|   1 +
>  python/pyspark/sql/functions.py| 285 +--
>  .../pyspark/sql/tests/test_pandas_cogrouped_map.py |  12 +
>  python/pyspark/sql/tests/test_pandas_map.py|   8 +
> ...
>
> After you have enough proof you can stop (to decide what is enough here
> should be decided by you).
> Finally you can use javap / scalap on the classes from the jars and check
> some code changes which is more harder to be analyzed than a simple text
> file.
>
> Best Regards,
> Attila
>
>
> On Thu, Mar 18, 2021 at 4:09 PM Mich Talebzadeh 
> wrote:
>
>> Hi
>>
>> What would be a signature in Spark version or binaries that confirms the
>> release is built on Spark built on 3.1.1 as opposed to 3.1.1-RC-1 or RC-2?
>>
>> Thanks
>>
>> Mich
>>
>>
>>view my Linkedin profile
>> 
>>
>>
>>
>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>> any loss, damage or destruction of data or any other property which may
>> arise from relying on this email's technical content is explicitly
>> disclaimed. The author will in no case be liable for any monetary damages
>> arising from such loss, damage or destruction.
>>
>>
>>
>


RE: Spark saveAsTextFile Disk Recommendation

2021-03-21 Thread Ranju Jain
Hi Attila,

What is your use case here?
Client Driver Application not using collect but  internally calling python 
script which is reading part files records [comma separated string] of each 
cluster separately and copying records in other final csv file, so merging all 
part files data in single csv file. This script runs on every node and later 
they all combine to single file.

On the other hand is your data really just a collection of strings without any 
repetitions
[Ranju]:
Yes It is comma separated string.
And I just checked the 2nd argument of saveAsTextFile and I believe read and 
write will be faster on disk after use of compression. I will try this.

So I think there is no special requirement on type of disk for execution of 
saveAsTextFile as they are local I/O operations.

Regards
Ranju


Hi!

I would like to reflect only to the first part of your mail:


I have a large RDD dataset of around 60-70 GB which I cannot send to driver 
using collect so first writing that to disk using  saveAsTextFile and then this 
data gets saved in the form of multiple part files on each node of the cluster 
and after that driver reads the data from that storage.

What is your use case here?

As you mention collect() I can assume you have to process the data outside of 
Spark maybe with a 3rd party tool, isn't it?

If you have 60-70 GB of data and you write it to text file then read it back 
within the same application then you still cannot call collect() on it as it is 
still 60-70GB data, right?

On the other hand is your data really just a collection of strings without any 
repetitions? I ask this because of the fileformat you are using: text file. 
Even for text file at least you can pass a compression codec as the 2nd 
argument of 
saveAsTextFile()
 (when you use this link you might need to scroll up a little bit.. at least my 
chrome displays the the saveAsTextFile method without the 2nd arg codec). As IO 
is slow a compressed data could be read back quicker: as there will be less 
data in the disk. Check the 
Snappy codec for example.

But if there is a structure of your data and you have plan to process this data 
further within Spark then please consider something way better: a columnar 
storage format namely ORC or Parquet.

Best Regards,
Attila


From: Ranju Jain 
Sent: Sunday, March 21, 2021 8:10 AM
To: user@spark.apache.org
Subject: Spark saveAsTextFile Disk Recommendation

Hi All,

I have a large RDD dataset of around 60-70 GB which I cannot send to driver 
using collect so first writing that to disk using  saveAsTextFile and then this 
data gets saved in the form of multiple part files on each node of the cluster 
and after that driver reads the data from that storage.

I have a question like spark.local.dir is the directory which is used as a 
scratch space where mapoutputs files and RDDs might need to write by spark for 
shuffle operations etc.
And there it is strongly recommended to use local and fast disk to avoid any 
failure or performance impact.

Do we have any such recommendation for storing multiple part files of large 
dataset [ or Big RDD ] in fast disk?
This will help me to configure the write type of disk for resulting part files.

Regards
Ranju


Re: Spark saveAsTextFile Disk Recommendation

2021-03-21 Thread Attila Zsolt Piros
Hi!

I would like to reflect only to the first part of your mail:

I have a large RDD dataset of around 60-70 GB which I cannot send to driver
> using *collect* so first writing that to disk using  *saveAsTextFile* and
> then this data gets saved in the form of multiple part files on each node
> of the cluster and after that driver reads the data from that storage.


What is your use case here?

As you mention *collect()* I can assume you have to process the data
outside of Spark maybe with a 3rd party tool, isn't it?

If you have 60-70 GB of data and you write it to text file then read it
back within the same application then you still cannot call *collect()* on
it as it is still 60-70GB data, right?

On the other hand is your data really just a collection of strings without
any repetitions? I ask this because of the fileformat you are using: text
file. Even for text file at least you can pass a compression codec as the
2nd argument of *saveAsTextFile()*

(when
you use this link you might need to scroll up a little bit.. at least my
chrome displays the the *saveAsTextFile* method without the 2nd arg codec).
As IO is slow a compressed data could be read back quicker: as there will
be less data in the disk. Check the Snappy
 codec for example.

But if there is a structure of your data and you have plan to process this
data further within Spark then please consider something way better: a columnar
storage format namely ORC or Parquet.

Best Regards,
Attila


On Sun, Mar 21, 2021 at 3:40 AM Ranju Jain 
wrote:

> Hi All,
>
>
>
> I have a large RDD dataset of around 60-70 GB which I cannot send to
> driver using *collect* so first writing that to disk using
> *saveAsTextFile* and then this data gets saved in the form of multiple
> part files on each node of the cluster and after that driver reads the data
> from that storage.
>
>
>
> I have a question like *spark.local.dir* is the directory which is used
> as a scratch space where mapoutputs files and RDDs might need to write by
> spark for shuffle operations etc.
>
> And there it is strongly recommended to use *local and fast disk *to
> avoid any failure or performance impact.
>
>
>
> *Do we have any such recommendation for storing multiple part files of
> large dataset [ or Big RDD ] in fast disk?*
>
> This will help me to configure the write type of disk for resulting part
> files.
>
>
>
> Regards
>
> Ranju
>