Spark History Server log files questions

2021-03-22 Thread Hung Vu
Hi, I have couple questions to ask regarding the Spark history server: 1. Is there a way for a cluster to selectively clean old files? For example, if we want to keep some logs from 3 days ago but also cleaned some logs from 2 days ago, is there a filter or config to do that? 2. We have over

Question about how hadoop configurations populated in driver/executor pod

2021-03-22 Thread Yue Peng
Hi, I am trying run sparkPi example via Spark on Kubernetes in my cluster. However, it is consistently failing because of executor does not have the correct hadoop configurations. I could fix it by pre-creating a configmap and mounting it into executor by specifying in pod template. But I do

Re: Repartition or Coalesce not working

2021-03-22 Thread KhajaAsmath Mohammed
Thanks Sean.I just realized it. Let me try that. On Mon, Mar 22, 2021 at 12:31 PM Sean Owen wrote: > You need to do something with the result of repartition. You haven't > changed textDF > > On Mon, Mar 22, 2021, 12:15 PM KhajaAsmath Mohammed < > mdkhajaasm...@gmail.com> wrote: > >> Hi, >> >> I

Re: unit testing for spark code

2021-03-22 Thread Attila Zsolt Piros
Hi! Let me draw your attention to Holden's* spark-testing-base* project. The documentation is at https://github.com/holdenk/spark-testing-base/wiki. As I usually write test for spark internal features I haven't needed to test so high level. But I am interested about your experiences. Best

Re: Repartition or Coalesce not working

2021-03-22 Thread Sean Owen
You need to do something with the result of repartition. You haven't changed textDF On Mon, Mar 22, 2021, 12:15 PM KhajaAsmath Mohammed wrote: > Hi, > > I have a use case where there are large files in hdfs. > > Size of the file is 3 GB. > > It is an existing code in production and I am trying

Repartition or Coalesce not working

2021-03-22 Thread KhajaAsmath Mohammed
Hi, I have a use case where there are large files in hdfs. Size of the file is 3 GB. It is an existing code in production and I am trying to improve the performance of the job. Sample Code: textDF=dataframe ( This is dataframe that got created from hdfs path) logging.info("Number of

Re: unit testing for spark code

2021-03-22 Thread Nicholas Gustafson
I've found pytest works well if you're using PySpark. Though if you have a lot of tests, running them all can be pretty slow. On Mon, Mar 22, 2021 at 6:32 AM Amit Sharma wrote: > Hi, can we write unit tests for spark code. Is there any specific > framework? > > > Thanks > Amit >

Re: unit testing for spark code

2021-03-22 Thread Mich Talebzadeh
coding in Scala or Python? Are you using any IDE (IntelliJ, PyCharm) view my Linkedin profile *Disclaimer:* Use it at your own risk. Any and all responsibility for any loss, damage or destruction of data or any other property

unit testing for spark code

2021-03-22 Thread Amit Sharma
Hi, can we write unit tests for spark code. Is there any specific framework? Thanks Amit

Re: Why code is failing to connect to Oracle DB in 3.1.1 through JDBC with Scala

2021-03-22 Thread Mich Talebzadeh
I sorted this one out. It was caused by mismatch between Spark 3.1.1 using scala 2.1.2 and the old scala 2.1.1 packages view my Linkedin profile *Disclaimer:* Use it at your own risk. Any and all responsibility for any loss,

Re: [Spark SQL]: Can complex oracle views be created using Spark SQL

2021-03-22 Thread Mich Talebzadeh
Hi Gaurav, What version of Spark will you be using? Have you tried a simple example of reading one of the views through JDBC connection to Oracle yourself HTH view my Linkedin profile *Disclaimer:* Use it at your own risk. Any

Why code is failing to connect to Oracle DB in 3.1.1 through JDBC with Scala

2021-03-22 Thread Mich Talebzadeh
Hi, The error is in Spark 3.1.1 with Scala The JDBC connection used to work fine in spark-2.4.3 val s = HiveContext.read.format("jdbc").options( Map("url" -> url, "dbtable" -> _dbtable, "user" -> _username, "password" -> _password)).load However, in 3.1.1 it is

Re: Bucketing 3.1.1

2021-03-22 Thread German Schiavon
Ohh! That is why! I missed that rename  Thanks a lot Bartosz! On Mon, 22 Mar 2021 at 09:55, Bartosz Konieczny wrote: > Hi German Schiavon, > > The property is supported in shuffle hash join strategy too and it was > renamed here https://github.com/apache/spark/pull/29079/files. Try with >

Re: Bucketing 3.1.1

2021-03-22 Thread Bartosz Konieczny
Hi German Schiavon, The property is supported in shuffle hash join strategy too and it was renamed here https://github.com/apache/spark/pull/29079/files. Try with *spark.sql.bucketing.coalesceBucketsInJoin.enabled* instead of spark.sql.bucketing.coalesceBucketsInSortMergeJoin.enabled :) (You can

Bucketing 3.1.1

2021-03-22 Thread German Schiavon
Hi all! In the 3.1.1 release a new bucket property was added in this PR . I'm trying to check this new behaviour but I'm not getting the same physical plan as the one given in the example. I'm executing the same code snippet from the PR in a 3.1.1

RE: Can JVisual VM monitoring tool be used to Monitor Spark Executor Memory and CPU

2021-03-22 Thread Ranju Jain
Hi Attila, I was configuring metrics.properties by following below steps: 1. *.sink.jmx.class=org.apache.spark.metrics.sink.JmxSink master.source.jvm.class=org.apache.spark.metrics.source.JvmSource worker.source.jvm.class=org.apache.spark.metrics.source.JvmSource

Invite Spark community as Pulsar Summit NA 2021 Community Partner

2021-03-22 Thread Dianjin Wang
Hi there, I'm from StreamNative, now working at organizing the [Pulsar Virtual Summit NA 2021][1]. I'm trying to contact our Spark community and invite the Spark community as one community partner of Pulsar Summit NA 2021. As a community partner, The Spark logo will be featured on the Pulsar