When does SparkContext.defaultParallelism have the correct value?

2020-07-06 Thread Stephen Coy
Hi there, I have found that if I invoke sparkContext.defaultParallelism() too early it will not return the correct value; For example, if I write this: final JavaSparkContext sparkContext = new JavaSparkContext(sparkSession.sparkContext()); final int workerCount =

Re: java.lang.ClassNotFoundException for s3a comitter

2020-07-06 Thread Stephen Coy
Hi Steve, While I understand your point regarding the mixing of Hadoop jars, this does not address the java.lang.ClassNotFoundException. Prebuilt Apache Spark 3.0 builds are only available for Hadoop 2.7 or Hadoop 3.2. Not Hadoop 3.1. The only place that I have found that missing class is in

Re: Is it possible to use Hadoop 3.x and Hive 3.x using spark 2.4?

2020-07-06 Thread Daniel de Oliveira Mantovani
Hi Teja, To access Hive 3 using Apache Spark 2.x.x you need to use this connector from Cloudera https://docs.cloudera.com/HDPDocuments/HDP3/HDP-3.1.5/integrating-hive/content/hive_hivewarehouseconnector_for_handling_apache_spark_data.html . It has many limitations You just can write to Hive

Re: How To Access Hive 2 Through JDBC Using Kerberos

2020-07-06 Thread Daniel de Oliveira Mantovani
Hello Gabor, I meant, third-party connector* not "connection". Thank you so much! On Mon, Jul 6, 2020 at 1:09 PM Gabor Somogyi wrote: > Hi Daniel, > > I'm just working on the developer API where any custom JDBC connection > provider(including Hive) can be added. > Not sure what you mean by

Load distribution in Structured Streaming

2020-07-06 Thread Eric Beabes
In my structured streaming job I've noticed that a LOT of data keeps going to one executor whereas other executors don't process that much data. As a result, tasks on that executor take a lot of time to complete. In other words, the distribution is skewed. I believe in Structured streaming the

Re: Spark structured streaming -Kafka - deployment / monitor and restart

2020-07-06 Thread Jungtaek Lim
In SS, checkpointing is now a part of running micro-batch and it's supported natively. (making clear, my library doesn't deal with the native behavior of checkpointing) In other words, it can't be customized like you have been doing with your database. You probably don't need to do it with SS,

upsert dataframe to kudu

2020-07-06 Thread Umesh Bansal
Hi All, We are running into issues when spark is trying to insert a dataframe into the kudu table having 300 columns. Few of the tables getting inserted with NULL values. In code, we are using upsert built in method and passing dataframe on it Thanks

Re: Spark structured streaming -Kafka - deployment / monitor and restart

2020-07-06 Thread KhajaAsmath Mohammed
Thanks Lim, this is really helpful. I have few questions. Our earlier approach used low level customer to read offsets from database and use those information to read using spark streaming in Dstreams. Save the offsets back once the process is finished. This way we never lost data. with your

Re: Is it possible to use Hadoop 3.x and Hive 3.x using spark 2.4?

2020-07-06 Thread Sean Owen
2.4 works with Hadoop 3 (optionally) and Hive 1. I doubt it will work connecting to Hadoop 3 / Hive 3; it's possible in a few cases. It's also possible some vendor distributions support this combination. On Mon, Jul 6, 2020 at 7:51 AM Teja wrote: > > We use spark 2.4.0 to connect to Hadoop 2.7

Is it possible to use Hadoop 3.x and Hive 3.x using spark 2.4?

2020-07-06 Thread Teja
We use spark 2.4.0 to connect to Hadoop 2.7 cluster and query from Hive Metastore version 2.3. But the Cluster managing team has decided to upgrade to Hadoop 3.x and Hive 3.x. We could not migrate to spark 3 yet, which is compatible with Hadoop 3 and Hive 3, as we could not test if anything

Re: How To Access Hive 2 Through JDBC Using Kerberos

2020-07-06 Thread Gabor Somogyi
Hi Daniel, I'm just working on the developer API where any custom JDBC connection provider(including Hive) can be added. Not sure what you mean by third-party connection but AFAIK there is no workaround at the moment. BR, G On Mon, Jul 6, 2020 at 12:09 PM Daniel de Oliveira Mantovani <

How To Access Hive 2 Through JDBC Using Kerberos

2020-07-06 Thread Daniel de Oliveira Mantovani
Hello List, Is it possible to access Hive 2 through JDBC with Kerberos authentication from Apache Spark JDBC interface ? If it's possible do you have an example ? I found this tickets on JIRA: https://issues.apache.org/jira/browse/SPARK-12312 https://issues.apache.org/jira/browse/SPARK-31815 Do