When does SparkContext.defaultParallelism have the correct value?

2020-07-06 Thread Stephen Coy
Hi there,

I have found that if I invoke

sparkContext.defaultParallelism()

too early it will not return the correct value;

For example, if I write this:

final JavaSparkContext sparkContext = new 
JavaSparkContext(sparkSession.sparkContext());
final int workerCount = sparkContext.defaultParallelism();

I will get some small number (which I can’t recall right now).

However, if I insert:

sparkContext.parallelize(List.of(1, 2, 3, 4)).collect()

between these two lines I get the expected value being something like 
node_count * node_core_count;

This seems like a hacky work around solution to me. Is there a better way to 
get this value initialised properly?

FWIW, I need this value to size a connection pool (fs.s3a.connection.maximum) 
correctly in a cluster independent way.

Thanks,

Steve C


[http://downloads.ifmsystems.com/data/marketing/images/signatures/driving-force-newsletter.jpg]
This email contains confidential information of and is the copyright of 
Infomedia. It must not be forwarded, amended or disclosed without consent of 
the sender. If you received this message by mistake, please advise the sender 
and delete all copies. Security of transmission on the internet cannot be 
guaranteed, could be infected, intercepted, or corrupted and you should ensure 
you have suitable antivirus protection in place. By sending us your or any 
third party personal details, you consent to (or confirm you have obtained 
consent from such third parties) to Infomedia’s privacy policy. 
http://www.infomedia.com.au/privacy-policy/


Re: java.lang.ClassNotFoundException for s3a comitter

2020-07-06 Thread Stephen Coy
Hi Steve,

While I understand your point regarding the mixing of Hadoop jars, this does 
not address the java.lang.ClassNotFoundException.

Prebuilt Apache Spark 3.0 builds are only available for Hadoop 2.7 or Hadoop 
3.2. Not Hadoop 3.1.

The only place that I have found that missing class is in the Spark 
“hadoop-cloud” source module, and currently the only way to get the jar 
containing it is to build it yourself. If any of the devs are listening it  
would be nice if this was included in the standard distribution. It has a 
sizeable chunk of a repackaged Jetty embedded in it which I find a bit odd.

But I am relatively new to this stuff so I could be wrong.

I am currently running Spark 3.0 clusters with no HDFS. Spark is set up like:

hadoopConfiguration.set("spark.hadoop.fs.s3a.committer.name", "directory");
hadoopConfiguration.set("spark.sql.sources.commitProtocolClass", 
"org.apache.spark.internal.io.cloud.PathOutputCommitProtocol");
hadoopConfiguration.set("spark.sql.parquet.output.committer.class", 
"org.apache.spark.internal.io.cloud.BindingParquetOutputCommitter");
hadoopConfiguration.set("fs.s3a.connection.maximum", Integer.toString(coreCount 
* 2));

Querying and updating s3a data sources seems to be working ok.

Thanks,

Steve C

On 29 Jun 2020, at 10:34 pm, Steve Loughran 
mailto:ste...@cloudera.com.INVALID>> wrote:

you are going to need hadoop-3.1 on your classpath, with hadoop-aws and the 
same aws-sdk it was built with (1.11.something). Mixing hadoop JARs is doomed. 
using a different aws sdk jar is a bit risky, though more recent upgrades have 
all be fairly low stress

On Fri, 19 Jun 2020 at 05:39, murat migdisoglu 
mailto:murat.migdiso...@gmail.com>> wrote:
Hi all
I've upgraded my test cluster to spark 3 and change my comitter to directory 
and I still get this error.. The documentations are somehow obscure on that.
Do I need to add a third party jar to support new comitters?

java.lang.ClassNotFoundException: 
org.apache.spark.internal.io.cloud.PathOutputCommitProtocol


On Thu, Jun 18, 2020 at 1:35 AM murat migdisoglu 
mailto:murat.migdiso...@gmail.com>> wrote:
Hello all,
we have a hadoop cluster (using yarn) using  s3 as filesystem with s3guard is 
enabled.
We are using hadoop 3.2.1 with spark 2.4.5.

When I try to save a dataframe in parquet format, I get the following exception:
java.lang.ClassNotFoundException: 
com.hortonworks.spark.cloud.commit.PathOutputCommitProtocol

My relevant spark configurations are as following:
"hadoop.mapreduce.outputcommitter.factory.scheme.s3a":"org.apache.hadoop.fs.s3a.commit.S3ACommitterFactory",
"fs.s3a.committer.name":
 "magic",
"fs.s3a.committer.magic.enabled": true,
"fs.s3a.impl": "org.apache.hadoop.fs.s3a.S3AFileSystem",

While spark streaming fails with the exception above, apache beam succeeds 
writing parquet files.
What might be the problem?

Thanks in advance


--
"Talkers aren’t good doers. Rest assured that we’re going there to use our 
hands, not our tongues."
W. Shakespeare


--
"Talkers aren’t good doers. Rest assured that we’re going there to use our 
hands, not our tongues."
W. Shakespeare


[http://downloads.ifmsystems.com/data/marketing/images/signatures/driving-force-newsletter.jpg]

This email contains confidential information of and is the copyright of 
Infomedia. It must not be forwarded, amended or disclosed without consent of 
the sender. If you received this message by mistake, please advise the sender 
and delete all copies. Security of transmission on the internet cannot be 
guaranteed, could be infected, intercepted, or corrupted and you should ensure 
you have suitable antivirus protection in place. By sending us your or any 
third party personal details, you consent to (or confirm you have obtained 
consent from such third parties) to Infomedia’s privacy policy. 
http://www.infomedia.com.au/privacy-policy/


Re: Is it possible to use Hadoop 3.x and Hive 3.x using spark 2.4?

2020-07-06 Thread Daniel de Oliveira Mantovani
Hi Teja,

To access Hive 3 using Apache Spark 2.x.x you need to use this connector
from Cloudera
https://docs.cloudera.com/HDPDocuments/HDP3/HDP-3.1.5/integrating-hive/content/hive_hivewarehouseconnector_for_handling_apache_spark_data.html
.
It has many limitations You just can write to Hive managed tables in
ORC format. But you can mitigate this problem writing to Hive unmanaged
tables, so parquet will work.
The performance is also not the same.

Good luck


On Mon, Jul 6, 2020 at 3:16 PM Sean Owen  wrote:

> 2.4 works with Hadoop 3 (optionally) and Hive 1. I doubt it will work
> connecting to Hadoop 3 / Hive 3; it's possible in a few cases.
> It's also possible some vendor distributions support this combination.
>
> On Mon, Jul 6, 2020 at 7:51 AM Teja  wrote:
> >
> > We use spark 2.4.0 to connect to Hadoop 2.7 cluster and query from Hive
> > Metastore version 2.3. But the Cluster managing team has decided to
> upgrade
> > to Hadoop 3.x and Hive 3.x. We could not migrate to spark 3 yet, which is
> > compatible with Hadoop 3 and Hive 3, as we could not test if anything
> > breaks.
> >
> > *Is there any possible way to stick to spark 2.4.x version and still be
> able
> > to use Hadoop 3 and Hive 3?
> > *
> >
> > I got to know backporting is one option but I am not sure how. It would
> be
> > great if you could point me in that direction.
> >
> >
> >
> > --
> > Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/
> >
> > -
> > To unsubscribe e-mail: user-unsubscr...@spark.apache.org
> >
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>

-- 

--
Daniel Mantovani


Re: How To Access Hive 2 Through JDBC Using Kerberos

2020-07-06 Thread Daniel de Oliveira Mantovani
Hello Gabor,

I meant, third-party connector* not "connection".

Thank you so much!

On Mon, Jul 6, 2020 at 1:09 PM Gabor Somogyi 
wrote:

> Hi Daniel,
>
> I'm just working on the developer API where any custom JDBC connection
> provider(including Hive) can be added.
> Not sure what you mean by third-party connection but AFAIK there is no
> workaround at the moment.
>
> BR,
> G
>
>
> On Mon, Jul 6, 2020 at 12:09 PM Daniel de Oliveira Mantovani <
> daniel.oliveira.mantov...@gmail.com> wrote:
>
>> Hello List,
>>
>> Is it possible to access Hive 2 through JDBC with Kerberos authentication
>> from Apache Spark JDBC interface ? If it's possible do you have an example ?
>>
>> I found this tickets on JIRA:
>> https://issues.apache.org/jira/browse/SPARK-12312
>> https://issues.apache.org/jira/browse/SPARK-31815
>>
>> Do you know if there's a workaround for this ? Maybe using a third-party
>> connection ?
>>
>> Thank you so much
>> --
>>
>> --
>> Daniel Mantovani
>>
>>

-- 

--
Daniel Mantovani


Load distribution in Structured Streaming

2020-07-06 Thread Eric Beabes
In my structured streaming job I've noticed that a LOT of data keeps going
to one executor whereas other executors don't process that much data. As a
result, tasks on that executor take a lot of time to complete. In other
words, the distribution is skewed.

I believe in Structured streaming the Partitions in the input Kafka topic
get evenly distributed amongst exectors, right? In our input Kafka topic
the data is fairly evenly distributed amongst partitions - I would think.
Any reason for this skew? Is there a way to fix it by using a Partitioner
or something like that? Please let me know.

Thanks in advance for the help.


Re: Spark structured streaming -Kafka - deployment / monitor and restart

2020-07-06 Thread Jungtaek Lim
In SS, checkpointing is now a part of running micro-batch and it's
supported natively. (making clear, my library doesn't deal with the native
behavior of checkpointing)

In other words, it can't be customized like you have been doing with your
database. You probably don't need to do it with SS, but it still depends on
what you did with the offsets in the database.

On Tue, Jul 7, 2020 at 1:40 AM KhajaAsmath Mohammed 
wrote:

> Thanks Lim, this is really helpful. I have few questions.
>
> Our earlier approach used low level customer to read offsets from database
> and use those information to read using spark streaming in Dstreams. Save
> the offsets back once the process is finished. This way we never lost data.
>
> with your library, will it automatically process from the last offset it
> processed when the application was stopped or killed for some time.
>
> Thanks,
> Asmath
>
> On Sun, Jul 5, 2020 at 6:22 PM Jungtaek Lim 
> wrote:
>
>> There're sections in SS programming guide which exactly answer these
>> questions:
>>
>>
>> http://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#managing-streaming-queries
>>
>> http://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#monitoring-streaming-queries
>>
>> Also, for Kafka data source, there's a 3rd party project (DISCLAIMER: I'm
>> the author) to help you commit the offset to Kafka with the specific group
>> ID.
>>
>> https://github.com/HeartSaVioR/spark-sql-kafka-offset-committer
>>
>> After then, you can also leverage the Kafka ecosystem to monitor the
>> progress in point of Kafka's view, especially the gap between highest
>> offset and committed offset.
>>
>> Hope this helps.
>>
>> Thanks,
>> Jungtaek Lim (HeartSaVioR)
>>
>>
>> On Mon, Jul 6, 2020 at 2:53 AM Gabor Somogyi 
>> wrote:
>>
>>> In 3.0 the community just added it.
>>>
>>> On Sun, 5 Jul 2020, 14:28 KhajaAsmath Mohammed, 
>>> wrote:
>>>
 Hi,

 We are trying to move our existing code from spark dstreams to
 structured streaming for one of the old application which we built few
 years ago.

 Structured streaming job doesn’t have streaming tab in sparkui. Is
 there a way to monitor the job submitted by us in structured streaming ?
 Since the job runs for every trigger, how can we kill the job and restart
 if needed.

 Any suggestions on this please

 Thanks,
 Asmath



 -
 To unsubscribe e-mail: user-unsubscr...@spark.apache.org




upsert dataframe to kudu

2020-07-06 Thread Umesh Bansal
Hi All,

We are running into issues when spark is trying to insert a dataframe into
the kudu table having 300 columns. Few of the tables getting inserted with
NULL values.

In code, we are using upsert built in method and passing dataframe on it

Thanks


Re: Spark structured streaming -Kafka - deployment / monitor and restart

2020-07-06 Thread KhajaAsmath Mohammed
Thanks Lim, this is really helpful. I have few questions.

Our earlier approach used low level customer to read offsets from database
and use those information to read using spark streaming in Dstreams. Save
the offsets back once the process is finished. This way we never lost data.

with your library, will it automatically process from the last offset it
processed when the application was stopped or killed for some time.

Thanks,
Asmath

On Sun, Jul 5, 2020 at 6:22 PM Jungtaek Lim 
wrote:

> There're sections in SS programming guide which exactly answer these
> questions:
>
>
> http://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#managing-streaming-queries
>
> http://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#monitoring-streaming-queries
>
> Also, for Kafka data source, there's a 3rd party project (DISCLAIMER: I'm
> the author) to help you commit the offset to Kafka with the specific group
> ID.
>
> https://github.com/HeartSaVioR/spark-sql-kafka-offset-committer
>
> After then, you can also leverage the Kafka ecosystem to monitor the
> progress in point of Kafka's view, especially the gap between highest
> offset and committed offset.
>
> Hope this helps.
>
> Thanks,
> Jungtaek Lim (HeartSaVioR)
>
>
> On Mon, Jul 6, 2020 at 2:53 AM Gabor Somogyi 
> wrote:
>
>> In 3.0 the community just added it.
>>
>> On Sun, 5 Jul 2020, 14:28 KhajaAsmath Mohammed, 
>> wrote:
>>
>>> Hi,
>>>
>>> We are trying to move our existing code from spark dstreams to
>>> structured streaming for one of the old application which we built few
>>> years ago.
>>>
>>> Structured streaming job doesn’t have streaming tab in sparkui. Is there
>>> a way to monitor the job submitted by us in structured streaming ? Since
>>> the job runs for every trigger, how can we kill the job and restart if
>>> needed.
>>>
>>> Any suggestions on this please
>>>
>>> Thanks,
>>> Asmath
>>>
>>>
>>>
>>> -
>>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>>>
>>>


Re: Is it possible to use Hadoop 3.x and Hive 3.x using spark 2.4?

2020-07-06 Thread Sean Owen
2.4 works with Hadoop 3 (optionally) and Hive 1. I doubt it will work
connecting to Hadoop 3 / Hive 3; it's possible in a few cases.
It's also possible some vendor distributions support this combination.

On Mon, Jul 6, 2020 at 7:51 AM Teja  wrote:
>
> We use spark 2.4.0 to connect to Hadoop 2.7 cluster and query from Hive
> Metastore version 2.3. But the Cluster managing team has decided to upgrade
> to Hadoop 3.x and Hive 3.x. We could not migrate to spark 3 yet, which is
> compatible with Hadoop 3 and Hive 3, as we could not test if anything
> breaks.
>
> *Is there any possible way to stick to spark 2.4.x version and still be able
> to use Hadoop 3 and Hive 3?
> *
>
> I got to know backporting is one option but I am not sure how. It would be
> great if you could point me in that direction.
>
>
>
> --
> Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Is it possible to use Hadoop 3.x and Hive 3.x using spark 2.4?

2020-07-06 Thread Teja
We use spark 2.4.0 to connect to Hadoop 2.7 cluster and query from Hive
Metastore version 2.3. But the Cluster managing team has decided to upgrade
to Hadoop 3.x and Hive 3.x. We could not migrate to spark 3 yet, which is
compatible with Hadoop 3 and Hive 3, as we could not test if anything
breaks.

*Is there any possible way to stick to spark 2.4.x version and still be able
to use Hadoop 3 and Hive 3?
*

I got to know backporting is one option but I am not sure how. It would be
great if you could point me in that direction.



--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: How To Access Hive 2 Through JDBC Using Kerberos

2020-07-06 Thread Gabor Somogyi
Hi Daniel,

I'm just working on the developer API where any custom JDBC connection
provider(including Hive) can be added.
Not sure what you mean by third-party connection but AFAIK there is no
workaround at the moment.

BR,
G


On Mon, Jul 6, 2020 at 12:09 PM Daniel de Oliveira Mantovani <
daniel.oliveira.mantov...@gmail.com> wrote:

> Hello List,
>
> Is it possible to access Hive 2 through JDBC with Kerberos authentication
> from Apache Spark JDBC interface ? If it's possible do you have an example ?
>
> I found this tickets on JIRA:
> https://issues.apache.org/jira/browse/SPARK-12312
> https://issues.apache.org/jira/browse/SPARK-31815
>
> Do you know if there's a workaround for this ? Maybe using a third-party
> connection ?
>
> Thank you so much
> --
>
> --
> Daniel Mantovani
>
>


How To Access Hive 2 Through JDBC Using Kerberos

2020-07-06 Thread Daniel de Oliveira Mantovani
Hello List,

Is it possible to access Hive 2 through JDBC with Kerberos authentication
from Apache Spark JDBC interface ? If it's possible do you have an example ?

I found this tickets on JIRA:
https://issues.apache.org/jira/browse/SPARK-12312
https://issues.apache.org/jira/browse/SPARK-31815

Do you know if there's a workaround for this ? Maybe using a third-party
connection ?

Thank you so much
-- 

--
Daniel Mantovani