Hi there,
I have found that if I invoke
sparkContext.defaultParallelism()
too early it will not return the correct value;
For example, if I write this:
final JavaSparkContext sparkContext = new
JavaSparkContext(sparkSession.sparkContext());
final int workerCount = sparkContext.defaultParallel
Hi Steve,
While I understand your point regarding the mixing of Hadoop jars, this does
not address the java.lang.ClassNotFoundException.
Prebuilt Apache Spark 3.0 builds are only available for Hadoop 2.7 or Hadoop
3.2. Not Hadoop 3.1.
The only place that I have found that missing class is in t
Hi Teja,
To access Hive 3 using Apache Spark 2.x.x you need to use this connector
from Cloudera
https://docs.cloudera.com/HDPDocuments/HDP3/HDP-3.1.5/integrating-hive/content/hive_hivewarehouseconnector_for_handling_apache_spark_data.html
.
It has many limitations You just can write to Hive ma
Hello Gabor,
I meant, third-party connector* not "connection".
Thank you so much!
On Mon, Jul 6, 2020 at 1:09 PM Gabor Somogyi
wrote:
> Hi Daniel,
>
> I'm just working on the developer API where any custom JDBC connection
> provider(including Hive) can be added.
> Not sure what you mean by thi
In my structured streaming job I've noticed that a LOT of data keeps going
to one executor whereas other executors don't process that much data. As a
result, tasks on that executor take a lot of time to complete. In other
words, the distribution is skewed.
I believe in Structured streaming the Par
In SS, checkpointing is now a part of running micro-batch and it's
supported natively. (making clear, my library doesn't deal with the native
behavior of checkpointing)
In other words, it can't be customized like you have been doing with your
database. You probably don't need to do it with SS, but
Hi All,
We are running into issues when spark is trying to insert a dataframe into
the kudu table having 300 columns. Few of the tables getting inserted with
NULL values.
In code, we are using upsert built in method and passing dataframe on it
Thanks
Thanks Lim, this is really helpful. I have few questions.
Our earlier approach used low level customer to read offsets from database
and use those information to read using spark streaming in Dstreams. Save
the offsets back once the process is finished. This way we never lost data.
with your libr
2.4 works with Hadoop 3 (optionally) and Hive 1. I doubt it will work
connecting to Hadoop 3 / Hive 3; it's possible in a few cases.
It's also possible some vendor distributions support this combination.
On Mon, Jul 6, 2020 at 7:51 AM Teja wrote:
>
> We use spark 2.4.0 to connect to Hadoop 2.7 cl
We use spark 2.4.0 to connect to Hadoop 2.7 cluster and query from Hive
Metastore version 2.3. But the Cluster managing team has decided to upgrade
to Hadoop 3.x and Hive 3.x. We could not migrate to spark 3 yet, which is
compatible with Hadoop 3 and Hive 3, as we could not test if anything
breaks.
Hi Daniel,
I'm just working on the developer API where any custom JDBC connection
provider(including Hive) can be added.
Not sure what you mean by third-party connection but AFAIK there is no
workaround at the moment.
BR,
G
On Mon, Jul 6, 2020 at 12:09 PM Daniel de Oliveira Mantovani <
daniel.o
Hello List,
Is it possible to access Hive 2 through JDBC with Kerberos authentication
from Apache Spark JDBC interface ? If it's possible do you have an example ?
I found this tickets on JIRA:
https://issues.apache.org/jira/browse/SPARK-12312
https://issues.apache.org/jira/browse/SPARK-31815
Do
12 matches
Mail list logo