date:20210810

ExecutorMoniitor timeout

2021-08-10 Thread Zhenyu Hu

In private class Tracker of org.apache.spark.scheduler.dynalloc.ExecutorMonitor, the method ` updateTimeout ` will take the min of

Re: Performance of PySpark jobs on the Kubernetes cluster

2021-08-10 Thread Khalid Mammadov

Hi Mich I think you need to check your code. If code does not use PySpark API effectively you may get this. I.e. if you use pure Python/pandas api rather than Pyspark i.e. transform->transform->action. e.g df.select(..).withColumn(...)...count() Hope this helps to put you on right direction.

How can I write data to hive with jdbc

2021-08-10 Thread igyu

var cfg:Map[String,String] = Map() cfg += ("url"->"jdbc:hive2://tidb4ser:11000/joinwarehouse;user=jztwk;password=123456;hive.server2.proxy.user=jztwk") cfg += ("dbtable"->"ods_job_log") cfg += ("user"->"jztwk") cfg += ("passwrod"-> "123456") cfg += ("driver"->

Facing weird problem while reading Parquet

2021-08-10 Thread Prateek Rajput

Hi everyone, I am using spark-core-2.4 and spark-sql-2.4 (java spark). While reading 40K parquet part files from a single HDFS directory, somehow spark is spanning only 20037 parallel tasks, which is weird. My initial experience with spark is that while reading number of total tasks are equal to