Spark saveAsTextFile Disk Recommendation

2021-03-20 Thread Ranju Jain
Hi All, I have a large RDD dataset of around 60-70 GB which I cannot send to driver using collect so first writing that to disk using saveAsTextFile and then this data gets saved in the form of multiple part files on each node of the cluster and after that driver reads the data from that

Re: Spark version verification

2021-03-20 Thread Attila Zsolt Piros
Hi! I would check out the Spark source then diff those two RCs (first just take look to the list of the changed files): $ git diff v3.1.1-rc1..v3.1.1-rc2 --stat ... The shell scripts in the release can be checked very easily: $ git diff v3.1.1-rc1..v3.1.1-rc2 --stat | grep ".sh "

Re: Can JVisual VM monitoring tool be used to Monitor Spark Executor Memory and CPU

2021-03-20 Thread Attila Zsolt Piros
Hi Ranju! You can configure Spark's metric system. Check the *memoryMetrics.** of executor-metrics and in the component-instance-executor

Re: Coalesce vs reduce operation parameter

2021-03-20 Thread Attila Zsolt Piros
Hi! Actually *coalesce()* is usually a cheap operation as it moves some existing partitions from one node to another. So it is not a (full) shuffle. See the documentation

Re: Coalesce vs reduce operation parameter

2021-03-20 Thread Attila Zsolt Piros
Hi! Actually *coalesce()* is usually a cheap operation as it moves some existing partitions from one node to another. So it is not a (full) shuffle. See the documentation coalesce is a cheap operation as it moves some existing partitions from one node to another. So it is not a full shuffle. See

Re: Can JVisual VM monitoring tool be used to Monitor Spark Executor Memory and CPU

2021-03-20 Thread Mich Talebzadeh
Hi, Have you considered spark GUI first? view my Linkedin profile *Disclaimer:* Use it at your own risk. Any and all responsibility for any loss, damage or destruction of data or any other property which may arise from relying on

Parallelism parameter in cross validation

2021-03-20 Thread Arshee Siddiqui
How do we set the value of parallelism in cross validation (spark)? Also,does gbm takes alot of time? -- Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/ - To unsubscribe e-mail:

Can JVisual VM monitoring tool be used to Monitor Spark Executor Memory and CPU

2021-03-20 Thread Ranju Jain
Hi All, Virtual Machine running an application, this application is having various other 3PPs components running such as spark, database etc . My requirement is to monitor every component and isolate the resources consuming individually by every component. I am thinking of using a common tool

Re: Coalesce vs reduce operation parameter

2021-03-20 Thread vaquar khan
HI Pedro, What is your usecase ,why you used coqlesce ,coalesce() is very expensive operations as they shuffle the data across many partitions hence try to minimize repartition as much as possible. Regards, Vaquar khan On Thu, Mar 18, 2021, 5:47 PM Pedro Tuero wrote: > I was reviewing a