RE: Regarding spark-3.2.0 decommission features.

2022-01-26 Thread Rao, Abhishek (Nokia - IN/Bangalore)
Hi Dongjoon Hyun,

Any inputs on the below issue would be helpful. Please let us know if we're 
missing anything?

Thanks and Regards,
Abhishek

From: Patidar, Mohanlal (Nokia - IN/Bangalore) 
Sent: Thursday, January 20, 2022 11:58 AM
To: user@spark.apache.org
Subject: Suspected SPAM - RE: Regarding spark-3.2.0 decommission features.

Gentle reminder!!!

Br,
-Mohan Patidar



From: Patidar, Mohanlal (Nokia - IN/Bangalore)
Sent: Tuesday, January 18, 2022 2:02 PM
To: user@spark.apache.org<mailto:user@spark.apache.org>
Cc: Rao, Abhishek (Nokia - IN/Bangalore) 
mailto:abhishek@nokia.com>>; Gowda Tp, Thimme 
(Nokia - IN/Bangalore) 
mailto:thimme.gowda...@nokia.com>>; Sharma, Prakash 
(Nokia - IN/Bangalore) 
mailto:prakash.sha...@nokia.com>>; Tarun, N (Nokia - 
IN/Bangalore) mailto:n.ta...@nokia.com>>; Badagandi, 
Srinivas B. (Nokia - IN/Bangalore) 
mailto:srinivas.b.badaga...@nokia.com>>
Subject: Regarding spark-3.2.0 decommission features.

Hi,
 We're using Spark 3.2.0 and we have enabled the spark decommission 
feature. As part of validating this feature, we wanted to check if the rdd 
blocks and shuffle blocks from the decommissioned executors are migrated to 
other executors.
However, we could not see this happening. Below is the configuration we used.

  1.  Spark Configuration used:
 spark.local.dir /mnt/spark-ldir
 spark.decommission.enabled true
 spark.storage.decommission.enabled true
 spark.storage.decommission.rddBlocks.enabled true
 spark.storage.decommission.shuffleBlocks.enabled true
 spark.dynamicAllocation.enabled true
  2.  Brought up spark-driver and executors on the different nodes.
NAME
  READY  STATUS   NODE
decommission-driver 
1/1 Running   Node1
gzip-compression-test-ae0b0b7e4d7fbe40-exec-1  1/1 
Running   Node1
gzip-compression-test-ae0b0b7e4d7fbe40-exec-2  1/1 
Running   Node2
gzip-compression-test-ae0b0b7e4d7fbe40-exec-3  1/1 
Running   Node1
gzip-compression-test-ae0b0b7e4d7fbe40-exec-4  1/1 
Running   Node2
gzip-compression-test-ae0b0b7e4d7fbe40-exec-5  1/1 
Running   Node1
  3.  Bringdown Node2 so status of pods as are following.

NAME
  READY  STATUS   NODE
decommission-driver 
1/1 Running   Node1
gzip-compression-test-ae0b0b7e4d7fbe40-exec-1  1/1 
Running   Node1
gzip-compression-test-ae0b0b7e4d7fbe40-exec-2  1/1 
TerminatingNode2
gzip-compression-test-ae0b0b7e4d7fbe40-exec-3  1/1 
Running   Node1
gzip-compression-test-ae0b0b7e4d7fbe40-exec-4  1/1 
TerminatingNode2
gzip-compression-test-ae0b0b7e4d7fbe40-exec-5  1/1 
Running   Node1
  4.  Driver logs:
{"type":"log", "level":"INFO", "time":"2022-01-12T08:55:28.296Z", 
"timezone":"UTC", "log":"Adding decommission script to lifecycle"}
{"type":"log", "level":"INFO", "time":"2022-01-12T08:55:28.459Z", 
"timezone":"UTC", "log":"Adding decommission script to lifecycle"}
{"type":"log", "level":"INFO", "time":"2022-01-12T08:55:28.564Z", 
"timezone":"UTC", "log":"Adding decommission script to lifecycle"}
{"type":"log", "level":"INFO", "time":"2022-01-12T08:55:28.601Z", 
"timezone":"UTC", "log":"Adding decommission script to lifecycle"}
{"type":"log", "level":"INFO", "time":"2022-01-12T08:55:28.667Z", 
"timezone":"UTC", "log":"Adding decommission script to lifecycle"}
{"type":"log", "level":"INFO", "time":"2022-01-12T08:58:21.885Z", 
"timezone":"UTC", "log":"Notify executor 5 to decommissioning."}
{"type":"log", "level":"INFO", "time":"2022-01-12T08:58:21.887Z", 
"timezone":"UTC", "log":"Notify executor 1 to decommissioning."}
{"type":"log", "level":"INFO", "t

RE: Inclusive terminology usage in Spark

2021-06-30 Thread Rao, Abhishek (Nokia - IN/Bangalore)
HI Sean,

Thanks for the quick response. We’ll look into this.

Thanks and Regards,
Abhishek

From: Sean Owen 
Sent: Wednesday, June 30, 2021 6:30 PM
To: Rao, Abhishek (Nokia - IN/Bangalore) 
Cc: User 
Subject: Re: Inclusive terminology usage in Spark

This was covered and mostly done last year: 
https://issues.apache.org/jira/browse/SPARK-32004
In some instances, it's hard to change the terminology as it would break user 
APIs, and the marginal benefit may not be worth it, but, have a look at the 
remaining task under that umbrella.

On Wed, Jun 30, 2021 at 5:25 AM Rao, Abhishek (Nokia - IN/Bangalore) 
mailto:abhishek@nokia.com>> wrote:
Hi,


Terms such as Blacklist/Whitelist and master/slave is used at different places 
in Spark Code. Wanted to know if there are any plans to modify this to more 
inclusive terminology, for eg: Denylist/Allowlist and Leader/Follower? If so, 
what is the timeline?

I’ve also created an improvement ticket to track this.

https://issues.apache.org/jira/browse/SPARK-35952

Thanks and Regards,
Abhishek



Inclusive terminology usage in Spark

2021-06-30 Thread Rao, Abhishek (Nokia - IN/Bangalore)
Hi,


Terms such as Blacklist/Whitelist and master/slave is used at different places 
in Spark Code. Wanted to know if there are any plans to modify this to more 
inclusive terminology, for eg: Denylist/Allowlist and Leader/Follower? If so, 
what is the timeline?

I've also created an improvement ticket to track this.

https://issues.apache.org/jira/browse/SPARK-35952

Thanks and Regards,
Abhishek



RE: Why is Spark 3.0.x faster than Spark 3.1.x

2021-05-17 Thread Rao, Abhishek (Nokia - IN/Bangalore)
Hi Maziyar, Mich

Do we have any ticket to track this? Any idea if this is going to be fixed in 
3.1.2?

Thanks and Regards,
Abhishek

From: Mich Talebzadeh 
Sent: Friday, April 9, 2021 2:11 PM
To: Maziyar Panahi 
Cc: User 
Subject: Re: Why is Spark 3.0.x faster than Spark 3.1.x


Hi,

Regarding your point:

 I won't be able to defend this request by telling Spark users the previous 
major release was and still is more stable than the latest major release ...

With the benefit of hindsight version 3.1.1 was released recently and the 
definition of stable (from a practical point of view) does not come into it 
yet. That is perhaps the reason why some vendors like Cloudera are few releases 
away from the latest version. In production what matters most is the 
predictability and stability. You are not doing anything wrong by rolling it 
back and awaiting further clarification and resolution on the error.

HTH


[https://docs.google.com/uc?export=download=1qt8nKd2bxgs6clwYFqGy-k84L3N79hW6=0B1BiUVX33unjallLZWQwN1BDbGRMNTI5WUw3TlloMmJZRThjPQ]


 
[https://docs.google.com/uc?export=download=1-q7RFGRfLMObPuQPWSd9sl_H1UPNFaIZ=0B1BiUVX33unjMWtVUWpINWFCd0ZQTlhTRHpGckh4Wlg4RG80PQ]
   view my Linkedin 
profile



Disclaimer: Use it at your own risk. Any and all responsibility for any loss, 
damage or destruction of data or any other property which may arise from 
relying on this email's technical content is explicitly disclaimed. The author 
will in no case be liable for any monetary damages arising from such loss, 
damage or destruction.




On Fri, 9 Apr 2021 at 08:58, Maziyar Panahi 
mailto:maziyar.pan...@iscpif.fr>> wrote:
Thanks Mich, I will ask all of our users to use pyspark 3.0.x and will change 
all the notebooks/scripts to switch back from 3.1.1 to 3.0.2.

That's being said, I won't be able to defend this request by telling Spark 
users the previous major release was and still is more stable than the latest 
major release, something that made everything default to 3.1.1 (pyspark, 
downloads, etc.).

I'll see if I can open a ticket for this as well.


On 8 Apr 2021, at 17:27, Mich Talebzadeh 
mailto:mich.talebza...@gmail.com>> wrote:

Well the normal course of action (considering laws of diminishing returns)  is 
that your mileage varies:

Spark 3.0.1 is pretty stable and good enough. Unless there is an overriding 
reason why you have to use 3.1.1, you can set it aside and try it when you have 
other use cases. For now I guess you can carry on with 3.0.1 as BAU.

HTH


 
[https://docs.google.com/uc?export=download=1-q7RFGRfLMObPuQPWSd9sl_H1UPNFaIZ=0B1BiUVX33unjMWtVUWpINWFCd0ZQTlhTRHpGckh4Wlg4RG80PQ]
   view my Linkedin 
profile


Disclaimer: Use it at your own risk. Any and all responsibility for any loss, 
damage or destruction of data or any other property which may arise from 
relying on this email's technical content is explicitly disclaimed. The author 
will in no case be liable for any monetary damages arising from such loss, 
damage or destruction.




On Thu, 8 Apr 2021 at 16:19, Maziyar Panahi 
mailto:maziyar.pan...@iscpif.fr>> wrote:
I personally added the followings to my SparkSession in 3.1.1 and the result 
was exactly the same as before (local master). The 3.1.1 is still 4-5 times 
slower than 3.0.2 at least for that piece of code. I will do more investigation 
to see how it does with other stuff, especially anything without .transform or 
Spark ML related functions, but the small code I provided on any dataset that 
is big enough to take a minute to finish will show you the difference going 
from 3.0.2 to 3.1.1 by magnitude of 4-5:


.config("spark.sql.adaptive.coalescePartitions.enabled", "false")
.config("spark.sql.adaptive.enabled", "false")



On 8 Apr 2021, at 16:47, Mich Talebzadeh 
mailto:mich.talebza...@gmail.com>> wrote:

spark 3.1.1

I enabled the parameter

spark_session.conf.set("spark.sql.adaptive.enabled", "true")

to see it effects

in yarn cluster mode, i.e spark-submit --master yarn --deploy-mode client

with 4 executors it crashed the cluster.

I then reduced the number of executors to 2 and this time it ran OK but the 
performance is worse

I assume it adds some overhead?



 
[https://docs.google.com/uc?export=download=1-q7RFGRfLMObPuQPWSd9sl_H1UPNFaIZ=0B1BiUVX33unjMWtVUWpINWFCd0ZQTlhTRHpGckh4Wlg4RG80PQ]
   view my Linkedin 
profile


Disclaimer: Use it at your own risk. Any and all responsibility for any loss, 
damage or destruction of data or any other property which may arise from 
relying on this email's technical content is explicitly disclaimed. The author 
will in no case be liable for any monetary damages arising from such loss, 
damage or destruction.




On Thu, 8 Apr 2021 at 15:05, Maziyar Panahi 
mailto:maziyar.pan...@iscpif.fr>> wrote:
Thanks Sean,

I have already tried adding that and 

s3a staging committer (directory committer) not writing data to s3 bucket (final output directory) in spark3

2021-02-22 Thread Rao, Abhishek (Nokia - IN/Bangalore)
Hi,

I'm running spark3 on Kubernetes and using S3A staging committer (directory 
committer) to write data to s3 bucket. The same set up works fine with spark 
2.4.5 but with spark3 the final data (writing in parquet format) is not visible 
in s3 bucket and when read operation is performed on that parquet data it fails 
as it is an empty path without any data.
As s3a committer requires shared file system (like NFS or HDFS) for staging 
data i have set up a shared PVC for all executors and drivers(i.e., 
spark.hadoop.fs.s3a.committer.staging.tmp.path set to shared PVC with 
readWriteMany)

In S3 bucket i can see only _SUCCESS file without any data.

bash-4.2# s3cmd ls  --no-ssl --host=${AWS_ENDPOINT} --host-bucket= 
s3://rookbucket/shiva/ --recursive | grep people.parquet
2021-02-22 11:55  4074   s3://rookbucket/shiva/people.parquet/_SUCCESS
bash-4.2#

The _SUCCESS file is in json format with below content:

==
{
  "name" : "org.apache.hadoop.fs.s3a.commit.files.SuccessData/1",
  "timestamp" : 1613994948681,
  "date" : "Mon Feb 22 11:55:48 UTC 2021",
  "hostname" : "spark-thrift-hdfs",
  "committer" : "directory",
  "description" : "Task committer attempt_20210222115547__m_00_0",
  "metrics" : {
"stream_write_block_uploads" : 0,
"files_created" : 5,
"S3guard_metadatastore_put_path_latencyNumOps" : 0,
"stream_write_block_uploads_aborted" : 0,
"committer_commits_reverted" : 0,
"op_open" : 2,
"stream_closed" : 12,
"committer_magic_files_created" : 0,
"object_copy_requests" : 0,
"s3guard_metadatastore_initialization" : 0,
"S3guard_metadatastore_put_path_latency90thPercentileLatency" : 0,
"stream_write_block_uploads_committed" : 0,
"S3guard_metadatastore_throttle_rate75thPercentileFrequency (Hz)" : 0,
"S3guard_metadatastore_throttle_rate90thPercentileFrequency (Hz)" : 0,
"committer_bytes_committed" : 0,
"op_create" : 5,
"stream_read_fully_operations" : 0,
"committer_commits_completed" : 0,
"object_put_requests_active" : 0,
"s3guard_metadatastore_retry" : 0,
"stream_write_block_uploads_active" : 0,
"stream_opened" : 12,
"S3guard_metadatastore_throttle_rate95thPercentileFrequency (Hz)" : 0,
"op_create_non_recursive" : 0,
"object_continue_list_requests" : 0,
"committer_jobs_completed" : 5,
"S3guard_metadatastore_put_path_latency50thPercentileLatency" : 0,
"stream_close_operations" : 12,
"stream_read_operations" : 378,
"object_delete_requests" : 4,
"fake_directories_deleted" : 8,
"stream_aborted" : 0,
"op_rename" : 0,
"object_multipart_aborted" : 0,
"committer_commits_created" : 0,
"op_get_file_status" : 26,
"s3guard_metadatastore_put_path_request" : 9,
"committer_commits_failed" : 0,
"stream_bytes_read_in_close" : 0,
"op_glob_status" : 1,
"stream_read_exceptions" : 0,
"op_exists" : 5,
"stream_read_version_mismatches" : 0,
"S3guard_metadatastore_throttle_rate50thPercentileFrequency (Hz)" : 0,
"S3guard_metadatastore_put_path_latency95thPercentileLatency" : 0,
"stream_write_block_uploads_pending" : 4,
"directories_created" : 0,
"S3guard_metadatastore_throttle_rateNumEvents" : 0,
"S3guard_metadatastore_put_path_latency99thPercentileLatency" : 0,
"stream_bytes_backwards_on_seek" : 0,
"stream_bytes_read" : 2997558,
"stream_write_total_data" : 16282,
"committer_jobs_failed" : 0,
"stream_read_operations_incomplete" : 29,
"files_copied_bytes" : 0,
"op_delete" : 8,
"object_put_bytes_pending" : 0,
"stream_write_block_uploads_data_pending" : 0,
"op_list_located_status" : 0,
"object_list_requests" : 19,
"stream_forward_seek_operations" : 0,
"committer_tasks_completed" : 0,
"committer_commits_aborted" : 0,
"object_metadata_requests" : 45,
"object_put_requests_completed" : 4,
"stream_seek_operations" : 0,
"op_list_status" : 0,
"store_io_throttled" : 0,
"stream_write_failures" : 0,
"op_get_file_checksum" : 0,
"files_copied" : 0,
"ignored_errors" : 8,
"committer_bytes_uploaded" : 0,
"committer_tasks_failed" : 0,
"stream_bytes_skipped_on_seek" : 0,
   "op_list_files" : 0,
"files_deleted" : 0,
"stream_bytes_discarded_in_abort" : 0,
"op_mkdirs" : 1,
"op_copy_from_local_file" : 0,
"op_is_directory" : 1,
"s3guard_metadatastore_throttled" : 0,
"S3guard_metadatastore_put_path_latency75thPercentileLatency" : 0,
"stream_write_total_time" : 0,
"stream_backward_seek_operations" : 0,
"object_put_requests" : 4,
"object_put_bytes" : 16282,
"directories_deleted" : 0,
"op_is_file" : 2,
"S3guard_metadatastore_throttle_rate99thPercentileFrequency (Hz)" : 0
  },
  "diagnostics" : {
"fs.s3a.metadatastore.impl" : 
"org.apache.hadoop.fs.s3a.s3guard.NullMetadataStore",
"fs.s3a.committer.magic.enabled" : "false",
"fs.s3a.metadatastore.authoritative" : 

RE: Spark 3.0 using S3 taking long time for some set of TPC DS Queries

2020-09-10 Thread Rao, Abhishek (Nokia - IN/Bangalore)
Hi All,

We tried to regenerate the TPC DS data on S3 and after regeneration, we see 
that the queries are running faster and the execution time is now comparable 
with execution time on HDFS with Spark 3.0.0.
So may be there was some issue in generating the TPC DS data first time due to 
which we were seeing discrepancy in query execution time on S3 with Spark 3.0.0.

Thanks and Regards,
Abhishek

From: Gourav Sengupta 
Sent: Wednesday, August 26, 2020 5:49 PM
To: Rao, Abhishek (Nokia - IN/Bangalore) 
Cc: user 
Subject: Re: Spark 3.0 using S3 taking long time for some set of TPC DS Queries

Hi
Can you try using emrfs?
Your study looks good best of luck.

Regards
Gourav

On Wed, 26 Aug 2020, 12:37 Rao, Abhishek (Nokia - IN/Bangalore), 
mailto:abhishek@nokia.com>> wrote:
Yeah… Not sure if I’m missing any configurations which is causing this issue. 
Any suggestions?

Thanks and Regards,
Abhishek

From: Gourav Sengupta 
mailto:gourav.sengu...@gmail.com>>
Sent: Wednesday, August 26, 2020 2:35 PM
To: Rao, Abhishek (Nokia - IN/Bangalore) 
mailto:abhishek@nokia.com>>
Cc: user@spark.apache.org<mailto:user@spark.apache.org>
Subject: Re: Spark 3.0 using S3 taking long time for some set of TPC DS Queries

Hi,

So the results does not make sense.


Regards,
Gourav

On Wed, Aug 26, 2020 at 9:04 AM Rao, Abhishek (Nokia - IN/Bangalore) 
mailto:abhishek@nokia.com>> wrote:
Hi Gourav,

Yes. We’re using s3a.

Thanks and Regards,
Abhishek

From: Gourav Sengupta 
mailto:gourav.sengu...@gmail.com>>
Sent: Wednesday, August 26, 2020 1:18 PM
To: Rao, Abhishek (Nokia - IN/Bangalore) 
mailto:abhishek@nokia.com>>
Cc: user@spark.apache.org<mailto:user@spark.apache.org>
Subject: Re: Spark 3.0 using S3 taking long time for some set of TPC DS Queries

Hi,

are you using s3a, which is not using EMRFS? In that case, these results does 
not make sense to me.

Regards,
Gourav Sengupta

On Mon, Aug 24, 2020 at 12:52 PM Rao, Abhishek (Nokia - IN/Bangalore) 
mailto:abhishek@nokia.com>> wrote:
Hi All,

We’re doing some performance comparisons between Spark querying data on HDFS vs 
Spark querying data on S3 (Ceph Object Store used for S3 storage) using 
standard TPC DS Queries. We are observing that Spark 3.0 with S3 is consuming 
significantly larger duration for some set of queries when compared with HDFS.
We also ran similar queries with Spark 2.4.5 querying data from S3 and we see 
that for these set of queries, time taken by Spark 2.4.5 is lesser compared to 
Spark 3.0 looks to be very strange.
Below are the details of 9 queries where Spark 3.0 is taking >5 times the 
duration for running queries on S3 when compared to Hadoop.

Environment Details:

  *   Spark running on Kubernetes
  *   TPC DS Scale Factor: 500 GB
  *   Hadoop 3.x
  *   Same CPU and memory used for all executions

Query
Spark 3.0 with S3 (Time in seconds)
Spark 3.0 with Hadoop (Time in seconds)


Spark 2.4.5 with S3
(Time in seconds)
Spark 3.0 HDFS vs S3 (Factor)
Spark 2.4.5 S3 vs Spark 3.0 S3 (Factor)
Table involved
9
880.129
106.109
147.65
8.294574
5.960914
store_sales
44
129.618
23.747
103.916
5.458289
1.247334
store_sales
58
142.113
20.996
33.936
6.768575
4.187677
store_sales
62
32.519
5.425
14.809
5.994286
2.195894
web_sales
76
138.765
20.73
49.892
6.693922
2.781308
store_sales
88
475.824
48.2
94.382
9.871867
5.04147
store_sales
90
53.896
6.804
18.11
7.921223
2.976035
web_sales
94
241.172
43.49
81.181
5.545459
2.970794
web_sales
96
67.059
10.396
15.993
6.450462
4.193022
store_sales

When we analysed it further, we see that all these queries are performing 
operations either on store_sales or web_sales tables and Spark 3 with S3 seems 
to be downloading much more data from storage when compared to Spark 3 with 
Hadoop or Spark 2.4.5 with S3 and this is resulting in more time for query 
completion. I’m attaching the screen shots of Driver UI for one such instance 
(Query 9) for reference.
Also attached the spark configurations (Spark 3.0) used for these tests.

We’re not sure why Spark 3.0 on S3 is having this behaviour. Any inputs on what 
we’re missing?

Thanks and Regards,
Abhishek


-
To unsubscribe e-mail: 
user-unsubscr...@spark.apache.org<mailto:user-unsubscr...@spark.apache.org>


RE: Spark 3.0 using S3 taking long time for some set of TPC DS Queries

2020-08-26 Thread Rao, Abhishek (Nokia - IN/Bangalore)
Yeah… Not sure if I’m missing any configurations which is causing this issue. 
Any suggestions?

Thanks and Regards,
Abhishek

From: Gourav Sengupta 
Sent: Wednesday, August 26, 2020 2:35 PM
To: Rao, Abhishek (Nokia - IN/Bangalore) 
Cc: user@spark.apache.org
Subject: Re: Spark 3.0 using S3 taking long time for some set of TPC DS Queries

Hi,

So the results does not make sense.


Regards,
Gourav

On Wed, Aug 26, 2020 at 9:04 AM Rao, Abhishek (Nokia - IN/Bangalore) 
mailto:abhishek@nokia.com>> wrote:
Hi Gourav,

Yes. We’re using s3a.

Thanks and Regards,
Abhishek

From: Gourav Sengupta 
mailto:gourav.sengu...@gmail.com>>
Sent: Wednesday, August 26, 2020 1:18 PM
To: Rao, Abhishek (Nokia - IN/Bangalore) 
mailto:abhishek@nokia.com>>
Cc: user@spark.apache.org<mailto:user@spark.apache.org>
Subject: Re: Spark 3.0 using S3 taking long time for some set of TPC DS Queries

Hi,

are you using s3a, which is not using EMRFS? In that case, these results does 
not make sense to me.

Regards,
Gourav Sengupta

On Mon, Aug 24, 2020 at 12:52 PM Rao, Abhishek (Nokia - IN/Bangalore) 
mailto:abhishek@nokia.com>> wrote:
Hi All,

We’re doing some performance comparisons between Spark querying data on HDFS vs 
Spark querying data on S3 (Ceph Object Store used for S3 storage) using 
standard TPC DS Queries. We are observing that Spark 3.0 with S3 is consuming 
significantly larger duration for some set of queries when compared with HDFS.
We also ran similar queries with Spark 2.4.5 querying data from S3 and we see 
that for these set of queries, time taken by Spark 2.4.5 is lesser compared to 
Spark 3.0 looks to be very strange.
Below are the details of 9 queries where Spark 3.0 is taking >5 times the 
duration for running queries on S3 when compared to Hadoop.

Environment Details:

  *   Spark running on Kubernetes
  *   TPC DS Scale Factor: 500 GB
  *   Hadoop 3.x
  *   Same CPU and memory used for all executions

Query
Spark 3.0 with S3 (Time in seconds)
Spark 3.0 with Hadoop (Time in seconds)


Spark 2.4.5 with S3
(Time in seconds)
Spark 3.0 HDFS vs S3 (Factor)
Spark 2.4.5 S3 vs Spark 3.0 S3 (Factor)
Table involved
9
880.129
106.109
147.65
8.294574
5.960914
store_sales
44
129.618
23.747
103.916
5.458289
1.247334
store_sales
58
142.113
20.996
33.936
6.768575
4.187677
store_sales
62
32.519
5.425
14.809
5.994286
2.195894
web_sales
76
138.765
20.73
49.892
6.693922
2.781308
store_sales
88
475.824
48.2
94.382
9.871867
5.04147
store_sales
90
53.896
6.804
18.11
7.921223
2.976035
web_sales
94
241.172
43.49
81.181
5.545459
2.970794
web_sales
96
67.059
10.396
15.993
6.450462
4.193022
store_sales

When we analysed it further, we see that all these queries are performing 
operations either on store_sales or web_sales tables and Spark 3 with S3 seems 
to be downloading much more data from storage when compared to Spark 3 with 
Hadoop or Spark 2.4.5 with S3 and this is resulting in more time for query 
completion. I’m attaching the screen shots of Driver UI for one such instance 
(Query 9) for reference.
Also attached the spark configurations (Spark 3.0) used for these tests.

We’re not sure why Spark 3.0 on S3 is having this behaviour. Any inputs on what 
we’re missing?

Thanks and Regards,
Abhishek


-
To unsubscribe e-mail: 
user-unsubscr...@spark.apache.org<mailto:user-unsubscr...@spark.apache.org>


RE: Spark 3.0 using S3 taking long time for some set of TPC DS Queries

2020-08-26 Thread Rao, Abhishek (Nokia - IN/Bangalore)
Hi Gourav,

Yes. We’re using s3a.

Thanks and Regards,
Abhishek

From: Gourav Sengupta 
Sent: Wednesday, August 26, 2020 1:18 PM
To: Rao, Abhishek (Nokia - IN/Bangalore) 
Cc: user@spark.apache.org
Subject: Re: Spark 3.0 using S3 taking long time for some set of TPC DS Queries

Hi,

are you using s3a, which is not using EMRFS? In that case, these results does 
not make sense to me.

Regards,
Gourav Sengupta

On Mon, Aug 24, 2020 at 12:52 PM Rao, Abhishek (Nokia - IN/Bangalore) 
mailto:abhishek@nokia.com>> wrote:
Hi All,

We’re doing some performance comparisons between Spark querying data on HDFS vs 
Spark querying data on S3 (Ceph Object Store used for S3 storage) using 
standard TPC DS Queries. We are observing that Spark 3.0 with S3 is consuming 
significantly larger duration for some set of queries when compared with HDFS.
We also ran similar queries with Spark 2.4.5 querying data from S3 and we see 
that for these set of queries, time taken by Spark 2.4.5 is lesser compared to 
Spark 3.0 looks to be very strange.
Below are the details of 9 queries where Spark 3.0 is taking >5 times the 
duration for running queries on S3 when compared to Hadoop.

Environment Details:

  *   Spark running on Kubernetes
  *   TPC DS Scale Factor: 500 GB
  *   Hadoop 3.x
  *   Same CPU and memory used for all executions

Query
Spark 3.0 with S3 (Time in seconds)
Spark 3.0 with Hadoop (Time in seconds)


Spark 2.4.5 with S3
(Time in seconds)
Spark 3.0 HDFS vs S3 (Factor)
Spark 2.4.5 S3 vs Spark 3.0 S3 (Factor)
Table involved
9
880.129
106.109
147.65
8.294574
5.960914
store_sales
44
129.618
23.747
103.916
5.458289
1.247334
store_sales
58
142.113
20.996
33.936
6.768575
4.187677
store_sales
62
32.519
5.425
14.809
5.994286
2.195894
web_sales
76
138.765
20.73
49.892
6.693922
2.781308
store_sales
88
475.824
48.2
94.382
9.871867
5.04147
store_sales
90
53.896
6.804
18.11
7.921223
2.976035
web_sales
94
241.172
43.49
81.181
5.545459
2.970794
web_sales
96
67.059
10.396
15.993
6.450462
4.193022
store_sales

When we analysed it further, we see that all these queries are performing 
operations either on store_sales or web_sales tables and Spark 3 with S3 seems 
to be downloading much more data from storage when compared to Spark 3 with 
Hadoop or Spark 2.4.5 with S3 and this is resulting in more time for query 
completion. I’m attaching the screen shots of Driver UI for one such instance 
(Query 9) for reference.
Also attached the spark configurations (Spark 3.0) used for these tests.

We’re not sure why Spark 3.0 on S3 is having this behaviour. Any inputs on what 
we’re missing?

Thanks and Regards,
Abhishek


-
To unsubscribe e-mail: 
user-unsubscr...@spark.apache.org<mailto:user-unsubscr...@spark.apache.org>


RE: Spark 3.0 using S3 taking long time for some set of TPC DS Queries

2020-08-25 Thread Rao, Abhishek (Nokia - IN/Bangalore)
Hi Luca,

Thanks for sharing the feedback. We'll include these recommendations in our 
tests. However, we feel the issue that we're seeing right now is due to the 
difference in size of data downloaded from storage by the executors. In case of 
S3, executors are downloading almost 50 GB of data whereas in case of HDFS, it 
is only 4.5 GB.
Any idea why this difference is there?


Thanks and Regards,
Abhishek

From: Luca Canali 
Sent: Monday, August 24, 2020 7:18 PM
To: Rao, Abhishek (Nokia - IN/Bangalore) 
Cc: user@spark.apache.org
Subject: RE: Spark 3.0 using S3 taking long time for some set of TPC DS Queries

Hi Abhishek,

Just a few ideas/comments on the topic:

When benchmarking/testing I find it useful to  collect a more complete view of 
resources usage and Spark metrics, beyond just measuring query elapsed time. 
Something like this:
https://github.com/cerndb/spark-dashboard

I'd rather not use dynamic allocation when benchmarking if possible, as it adds 
a layer of complexity when examining results.

If you suspect that reading from S3 vs. HDFS may play an important role on the 
performance you observe, you may want to drill down on that with a simple 
micro-benchmark, for example something like this (for Spark 3.0):

val df=spark.read.parquet("/TPCDS/tpcds_1500/store_sales")
df.write.format("noop").mode("overwrite").save

Best,
Luca

From: Rao, Abhishek (Nokia - IN/Bangalore) 
mailto:abhishek@nokia.com>>
Sent: Monday, August 24, 2020 13:50
To: user@spark.apache.org<mailto:user@spark.apache.org>
Subject: Spark 3.0 using S3 taking long time for some set of TPC DS Queries

Hi All,

We're doing some performance comparisons between Spark querying data on HDFS vs 
Spark querying data on S3 (Ceph Object Store used for S3 storage) using 
standard TPC DS Queries. We are observing that Spark 3.0 with S3 is consuming 
significantly larger duration for some set of queries when compared with HDFS.
We also ran similar queries with Spark 2.4.5 querying data from S3 and we see 
that for these set of queries, time taken by Spark 2.4.5 is lesser compared to 
Spark 3.0 looks to be very strange.
Below are the details of 9 queries where Spark 3.0 is taking >5 times the 
duration for running queries on S3 when compared to Hadoop.

Environment Details:

  *   Spark running on Kubernetes
  *   TPC DS Scale Factor: 500 GB
  *   Hadoop 3.x
  *   Same CPU and memory used for all executions

Query
Spark 3.0 with S3 (Time in seconds)
Spark 3.0 with Hadoop (Time in seconds)


Spark 2.4.5 with S3
(Time in seconds)
Spark 3.0 HDFS vs S3 (Factor)
Spark 2.4.5 S3 vs Spark 3.0 S3 (Factor)
Table involved
9
880.129
106.109
147.65
8.294574
5.960914
store_sales
44
129.618
23.747
103.916
5.458289
1.247334
store_sales
58
142.113
20.996
33.936
6.768575
4.187677
store_sales
62
32.519
5.425
14.809
5.994286
2.195894
web_sales
76
138.765
20.73
49.892
6.693922
2.781308
store_sales
88
475.824
48.2
94.382
9.871867
5.04147
store_sales
90
53.896
6.804
18.11
7.921223
2.976035
web_sales
94
241.172
43.49
81.181
5.545459
2.970794
web_sales
96
67.059
10.396
15.993
6.450462
4.193022
store_sales

When we analysed it further, we see that all these queries are performing 
operations either on store_sales or web_sales tables and Spark 3 with S3 seems 
to be downloading much more data from storage when compared to Spark 3 with 
Hadoop or Spark 2.4.5 with S3 and this is resulting in more time for query 
completion. I'm attaching the screen shots of Driver UI for one such instance 
(Query 9) for reference.
Also attached the spark configurations (Spark 3.0) used for these tests.

We're not sure why Spark 3.0 on S3 is having this behaviour. Any inputs on what 
we're missing?

Thanks and Regards,
Abhishek



RE: Spark Thrift Server in Kubernetes deployment

2020-06-22 Thread Rao, Abhishek (Nokia - IN/Bangalore)
Hi,

STS deployment on k8s is not supported out of the box.
We had done some minor changes in spark code to get Spark Thrift Server working 
on k8s.
Here is the PR that we had created.
https://github.com/apache/spark/pull/22433

Unfortunately, this could not be merged.

Thanks and Regards,
Abhishek

From: Subash K 
Sent: Monday, June 22, 2020 9:00 AM
To: user@spark.apache.org
Subject: Spark Thrift Server in Kubernetes deployment

Hi,

We are currently using Spark 2.4.4 with Spark Thrift Server (STS) to expose a 
JDBC interface to the reporting tools to generate report from Spark tables.

Now as we are analyzing on containerized deployment of Spark and STS, I would 
like to understand is STS deployment on Kubernetes is supported out of the box? 
Because we were not able to find any document link on how to configure and spin 
up the container for STS. Please help us on this.

Regards,
Subash Kunjupillai



RE: [External Sender] Spark Executor pod not getting created on kubernetes cluster

2019-10-07 Thread Rao, Abhishek (Nokia - IN/Bangalore)
Hi Manish,

Is this issue resolved? If not, please check the overlay network of your 
cluster once. We had faced similar issues when we had problems with overlay 
networking.
In our case, executor had spawned, but the communication with driver and 
executor had failed (due to issues with overlay network) and we were seeing 
similar logs.

One way to quickly check this is to quarantine all worker nodes except one. 
This way both driver and executor will be launched on same worker node. If 
driver/executor communication happens in this case, then it is confirmed that 
we have issue with overlay network.

Thanks and Regards,
Abhishek

From: manish gupta 
Sent: 01 October 2019 PM 09:20
To: Prudhvi Chennuru (CONT) 
Cc: user 
Subject: Re: [External Sender] Spark Executor pod not getting created on 
kubernetes cluster

Kube-api server logs are not enabled. I will enable and check and get back on 
this.

Regards
Manish Gupta

On Tue, Oct 1, 2019 at 9:05 PM Prudhvi Chennuru (CONT) 
mailto:prudhvi.chenn...@capitalone.com>> wrote:
If you are passing the service account for executors as spark property then 
executor will use the one you are passing not the default service account. Did 
you check the api server logs?

On Tue, Oct 1, 2019 at 11:07 AM manish gupta 
mailto:tomanishgupt...@gmail.com>> wrote:
While launching the driver pod I am passing the service account which has 
cluster role and has all the required permissions to create a new pod. So will 
driver pass the same details to API server while creating executor pod OR 
executors will be created with default service account?

Regards
Manish Gupta

On Tue, Oct 1, 2019 at 8:01 PM Prudhvi Chennuru (CONT) 
mailto:prudhvi.chenn...@capitalone.com>> wrote:
By default, executors use default service account in the namespace you are 
creating the driver and executors so i am guessing that executors don't have 
access to run on the cluster, if you check the kube-apisever logs you will know 
the issue
and try giving privileged access to default service account in the namespace 
you are creating the executors it should work.

On Tue, Oct 1, 2019 at 10:25 AM manish gupta 
mailto:tomanishgupt...@gmail.com>> wrote:
Hi Prudhvi

I can see this issue consistently. I am doing a POC wherein I am trying to 
create a dynamic spark cluster to run my job using spark submit on Kubernetes. 
On Minikube it works fine but on rbac enabled kubernetes it fails to launch 
executor pod. It is able to launch driver pod but not sure why it cannot launch 
executor pod even though it has ample resources.I dont see any error message in 
the logs apart from the warning message that I have provided above.
Not even a single executor pod is getting launched.

Regards
Manish Gupta

On Tue, Oct 1, 2019 at 6:31 PM Prudhvi Chennuru (CONT) 
mailto:prudhvi.chenn...@capitalone.com>> wrote:
Hi Manish,

Are you seeing this issue consistently or sporadically? and when 
you say executors are not launched not even a single executor created for that 
driver pod?

On Tue, Oct 1, 2019 at 1:43 AM manish gupta 
mailto:tomanishgupt...@gmail.com>> wrote:
Hi Team

I am trying to create a spark cluster on kubernetes with rbac enabled using 
spark submit job. I am using spark-2.4.1 version.
Spark submit is able to launch the driver pod by contacting Kubernetes API 
server but executor Pod is not getting launched. I can see the below warning 
message in the driver pod logs.

19/09/27 10:16:01 INFO TaskSchedulerImpl: Adding task set 0.0 with 3 tasks
19/09/27 10:16:16 WARN TaskSchedulerImpl: Initial job has not accepted any 
resources; check your cluster UI to ensure that workers are registered and have 
sufficient resources

I have faced this issue in standalone spark clusters and resolved it but not 
sure how to resolve this issue in kubernetes. I have not given any 
ResourceQuota configuration in kubernetes rbac yaml file and there is ample 
memory and cpu available for any new pod/container to be launched.

Any leads/pointers to resolve this issue would be of great help.

Thanks and Regards
Manish Gupta


--
Thanks,
Prudhvi Chennuru.


The information contained in this e-mail is confidential and/or proprietary to 
Capital One and/or its affiliates and may only be used solely in performance of 
work or services for Capital One. The information transmitted herewith is 
intended only for use by the individual or entity to which it is addressed. If 
the reader of this message is not the intended recipient, you are hereby 
notified that any review, retransmission, dissemination, distribution, copying 
or other use of, or taking of any action in reliance upon this information is 
strictly prohibited. If you have received this communication in error, please 
contact the sender and delete the material from your computer.



--
Thanks,
Prudhvi Chennuru.


The information contained in this e-mail is confidential and/or proprietary to 
Capital One and/or 

RE: web access to sparkUI on docker or k8s pods

2019-08-27 Thread Rao, Abhishek (Nokia - IN/Bangalore)
Hi,

We have seen this issue when we tried to bringup the UI on custom ingress path 
(default ingress path “/” works). Do you also have similar configuration?
We tired setting spark.ui.proxyBase and spark.ui.reverseProxy but did not help.

As a workaround, we’re using ingress port (port on edge node) for now. There is 
option of using nodeport as well. That also works.

Thanks and Regards,
Abhishek

From: Yaniv Harpaz 
Sent: Tuesday, August 27, 2019 7:34 PM
To: user@spark.apache.org
Subject: web access to sparkUI on docker or k8s pods

hello guys,
when I launch driver pods or even when I use docker run with the spark image,
the spark master UI (8080) works great,
but the sparkUI (4040) is loading w/o the CSS

when I dig a bit deeper I see
"Refused to apply style from '' because its MIME type ('text/html') is not 
supported stylesheet MIME type, and strict MIME checking is enabled."

what am I missing here?
[https://ssl.gstatic.com/ui/v1/icons/mail/images/cleardot.gif]
Yaniv

Yaniv Harpaz
[ yaniv.harpaz at gmail.com ]


RE: Spark on Kubernetes - log4j.properties not read

2019-06-10 Thread Rao, Abhishek (Nokia - IN/Bangalore)
Hi Dave,

As part of driver pod bringup, a configmap is created using all the spark 
configuration parameters (with name spark.properties) and mounted to 
/opt/spark/conf. So all the other files present in /opt/spark/conf will be 
overwritten.
Same is happening with the log4j.properties in this case. You could try to 
build the container by placing the log4j.properties at some other location and 
set the same in spark.driver.extraJavaOptions

Thanks and Regards,
Abhishek

From: Dave Jaffe 
Sent: Tuesday, June 11, 2019 6:45 AM
To: user@spark.apache.org
Subject: Spark on Kubernetes - log4j.properties not read

I am using Spark on Kubernetes from Spark 2.4.3. I have created a 
log4j.properties file in my local spark/conf directory and modified it so that 
the console (or, in the case of Kubernetes, the log) only shows warnings and 
higher (log4j.rootCategory=WARN, console). I then added the command
COPY conf /opt/spark/conf
to /root/spark/kubernetes/dockerfiles/spark/Dockerfile and built a new 
container.

However, when I run that under Kubernetes, the program runs successfully but 
/opt/spark/conf/log4j.properties is not used (I still see the INFO lines when I 
run kubectl logs ).

I have tried other things such as explicitly adding a –properties-file to my 
spark-submit command and even
--conf 
spark.driver.extraJavaOptions=-Dlog4j.configuration=file:///opt/spark/conf/log4j.properties

My log4j.properties file is never seen.

How do I customize log4j.properties with Kubernetes?

Thanks, Dave Jaffe



RE: Spark UI History server on Kubernetes

2019-01-23 Thread Rao, Abhishek (Nokia - IN/Bangalore)
Hi Lakshman,

We’ve set these 2 properties to bringup spark history server

spark.history.fs.logDirectory 
spark.history.ui.port 

We’re writing the logs to HDFS. In order to write logs, we’re setting following 
properties while submitting the spark job
spark.eventLog.enabled true
spark.eventLog.dir 

Thanks and Regards,
Abhishek

From: Battini Lakshman 
Sent: Wednesday, January 23, 2019 1:55 PM
To: Rao, Abhishek (Nokia - IN/Bangalore) 
Subject: Re: Spark UI History server on Kubernetes

HI Abhishek,

Thank you for your response. Could you please let me know the properties you 
configured for bringing up History Server and its UI.

Also, are you writing the logs to any directory on persistent storage, if yes, 
could you let me know the changes you did in Spark to write logs to that 
directory. Thanks!

Best Regards,
Lakshman Battini.

On Tue, Jan 22, 2019 at 10:53 PM Rao, Abhishek (Nokia - IN/Bangalore) 
mailto:abhishek@nokia.com>> wrote:
Hi,

We’ve setup spark-history service (based on spark 2.4) on K8S. UI works 
perfectly fine when running on NodePort. We’re facing some issues when on 
ingress.
Please let us know what kind of inputs do you need?

Thanks and Regards,
Abhishek

From: Battini Lakshman 
mailto:battini.laksh...@gmail.com>>
Sent: Tuesday, January 22, 2019 6:02 PM
To: user@spark.apache.org<mailto:user@spark.apache.org>
Subject: Spark UI History server on Kubernetes

Hello,

We are running Spark 2.4 on Kubernetes cluster, able to access the Spark UI 
using "kubectl port-forward".

However, this spark UI contains currently running Spark application logs, we 
would like to maintain the 'completed' spark application logs as well. Could 
someone help us to setup 'Spark History server' on Kubernetes. Thanks!

Best Regards,
Lakshman Battini.


RE: Spark UI History server on Kubernetes

2019-01-22 Thread Rao, Abhishek (Nokia - IN/Bangalore)
Hi,

We’ve setup spark-history service (based on spark 2.4) on K8S. UI works 
perfectly fine when running on NodePort. We’re facing some issues when on 
ingress.
Please let us know what kind of inputs do you need?

Thanks and Regards,
Abhishek

From: Battini Lakshman 
Sent: Tuesday, January 22, 2019 6:02 PM
To: user@spark.apache.org
Subject: Spark UI History server on Kubernetes

Hello,

We are running Spark 2.4 on Kubernetes cluster, able to access the Spark UI 
using "kubectl port-forward".

However, this spark UI contains currently running Spark application logs, we 
would like to maintain the 'completed' spark application logs as well. Could 
someone help us to setup 'Spark History server' on Kubernetes. Thanks!

Best Regards,
Lakshman Battini.