Hadoop-Token-Across-Kerberized -Cluster

2018-10-16 Thread Davinder Kumar
Hello All,


Need one help for Kerberized cluster. Having two Ambari-clusters. Cluster A and 
Cluster B both are Kerberized with same KDC


Use case is : Need to access the Hive data from Cluster B to Cluster A.


Action done


- Remote Cluster B Principal and keytab are provided to Cluster A. [admadmin is 
the user]

- Remote Cluster Hive metastore Principal/Keytab are provided to cluster A.

- Running the spark job on cluster A to access the data from Cluster B [Spark 
over yarn]

- Able to connect with Hive metastore of Remote cluster B by cluster A

- Now getting the error related with Hadoop-tokens [Any help or suggestion is 
appreciated]. Error logs are like this



18/10/16 20:33:55 INFO RMProxy: Connecting to ResourceManager at 
davinderrc15.c.ampool-141120.internal/10.128.15.198:8030
18/10/16 20:33:55 INFO YarnRMClient: Registering the ApplicationMaster
18/10/16 20:33:55 INFO YarnAllocator: Will request 2 executor container(s), 
each with 1 core(s) and 1408 MB memory (including 384 MB of overhead)
18/10/16 20:33:55 INFO YarnSchedulerBackend$YarnSchedulerEndpoint: 
ApplicationMaster registered as 
NettyRpcEndpointRef(spark://YarnAM@10.128.15.198:38524)
18/10/16 20:33:55 INFO YarnAllocator: Submitted 2 unlocalized container 
requests.
18/10/16 20:33:55 INFO ApplicationMaster: Started progress reporter thread with 
(heartbeat : 3000, initial allocation : 200) intervals
18/10/16 20:33:56 INFO AMRMClientImpl: Received new token for : 
davinderrc15.c.ampool-141120.internal:45454
18/10/16 20:33:56 INFO YarnAllocator: Launching container 
container_e07_1539521606680_0045_02_02 on host 
davinderrc15.c.ampool-141120.internal for executor with ID 1
18/10/16 20:33:56 INFO YarnAllocator: Received 1 containers from YARN, 
launching executors on 1 of them.
18/10/16 20:33:56 INFO ContainerManagementProtocolProxy: 
yarn.client.max-cached-nodemanagers-proxies : 0
18/10/16 20:33:56 INFO ContainerManagementProtocolProxy: Opening proxy : 
davinderrc15.c.ampool-141120.internal:45454
18/10/16 20:33:57 INFO YarnAllocator: Launching container 
container_e07_1539521606680_0045_02_03 on host 
davinderrc15.c.ampool-141120.internal for executor with ID 2
18/10/16 20:33:57 INFO YarnAllocator: Received 1 containers from YARN, 
launching executors on 1 of them.
18/10/16 20:33:57 INFO ContainerManagementProtocolProxy: 
yarn.client.max-cached-nodemanagers-proxies : 0
18/10/16 20:33:57 INFO ContainerManagementProtocolProxy: Opening proxy : 
davinderrc15.c.ampool-141120.internal:45454
18/10/16 20:33:59 INFO YarnSchedulerBackend$YarnDriverEndpoint: Registered 
executor NettyRpcEndpointRef(spark-client://Executor) (10.128.15.198:39686) 
with ID 1
18/10/16 20:33:59 INFO BlockManagerMasterEndpoint: Registering block manager 
davinderrc15.c.ampool-141120.internal:36291 with 366.3 MB RAM, 
BlockManagerId(1, davinderrc15.c.ampool-141120.internal, 36291, None)
18/10/16 20:33:59 INFO YarnSchedulerBackend$YarnDriverEndpoint: Registered 
executor NettyRpcEndpointRef(spark-client://Executor) (10.128.15.198:39704) 
with ID 2
18/10/16 20:33:59 INFO YarnClusterSchedulerBackend: SchedulerBackend is ready 
for scheduling beginning after reached minRegisteredResourcesRatio: 0.8
18/10/16 20:33:59 INFO YarnClusterScheduler: YarnClusterScheduler.postStartHook 
done
18/10/16 20:33:59 INFO SharedState: Setting hive.metastore.warehouse.dir 
('null') to the value of spark.sql.warehouse.dir 
('file:/hadoop/yarn/local/usercache/admadmin/appcache/application_1539521606680_0045/container_e07_1539521606680_0045_02_01/spark-warehouse').
18/10/16 20:33:59 INFO SharedState: Warehouse path is 
'file:/hadoop/yarn/local/usercache/admadmin/appcache/application_1539521606680_0045/container_e07_1539521606680_0045_02_01/spark-warehouse'.
18/10/16 20:33:59 INFO BlockManagerMasterEndpoint: Registering block manager 
davinderrc15.c.ampool-141120.internal:45507 with 366.3 MB RAM, 
BlockManagerId(2, davinderrc15.c.ampool-141120.internal, 45507, None)
18/10/16 20:34:00 INFO HiveUtils: Initializing HiveMetastoreConnection version 
1.2.1 using Spark classes.
18/10/16 20:34:00 INFO HiveClientImpl: Attempting to login to Kerberos using 
principal: admadmin/ad...@ampool.io and keytab: 
admadmin.keytab-a796621d-bacd-47e2-bd97-077090fe8aa8
18/10/16 20:34:00 INFO UserGroupInformation: Login successful for user 
admadmin/ad...@ampool.io using keytab file 
admadmin.keytab-a796621d-bacd-47e2-bd97-077090fe8aa8
18/10/16 20:34:01 INFO metastore: Trying to connect to metastore with URI 
thrift://10.128.0.39:9083
18/10/16 20:34:01 INFO metastore: Connected to metastore.
18/10/16 20:34:01 INFO SessionState: Created local directory: 
/hadoop/yarn/local/usercache/admadmin/appcache/application_1539521606680_0045/container_e07_1539521606680_0045_02_01/tmp/admadmin
18/10/16 20:34:01 INFO SessionState: Created local directory: 

Re: Starting to make changes for Spark 3 -- what can we delete?

2018-10-16 Thread Marcelo Vanzin
Might be good to take a look at things marked "@DeveloperApi" and
whether they should stay that way.

e.g. I was looking at SparkHadoopUtil and I've always wanted to just
make it private to Spark. I don't see why apps would need any of those
methods.
On Tue, Oct 16, 2018 at 10:18 AM Sean Owen  wrote:
>
> There was already agreement to delete deprecated things like Flume and
> Kafka 0.8 support in master. I've got several more on my radar, and
> wanted to highlight them and solicit general opinions on where we
> should accept breaking changes.
>
> For example how about removing accumulator v1?
> https://github.com/apache/spark/pull/22730
>
> Or using the standard Java Optional?
> https://github.com/apache/spark/pull/22383
>
> Or cleaning up some old workarounds and APIs while at it?
> https://github.com/apache/spark/pull/22729 (still in progress)
>
> I think I talked myself out of replacing Java function interfaces with
> java.util.function because...
> https://issues.apache.org/jira/browse/SPARK-25369
>
> There are also, say, old json and csv and avro reading method
> deprecated since 1.4. Remove?
> Anything deprecated since 2.0.0?
>
> Interested in general thoughts on these.
>
> Here are some more items targeted to 3.0:
> https://issues.apache.org/jira/browse/SPARK-17875?jql=project%3D%22SPARK%22%20AND%20%22Target%20Version%2Fs%22%3D%223.0.0%22%20ORDER%20BY%20priority%20ASC
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>


-- 
Marcelo

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



?????? SparkSQL read Hive transactional table

2018-10-16 Thread daily
Hi,


Spark version: 2.3.0
Hive   version: 2.1.0



Best regards.





--  --
??: "Gourav Sengupta";
: 2018??10??16??(??) 6:35
??: "daily";
: "user"; "dev"; 
: Re: SparkSQL read Hive transactional table



Hi,

can I please ask which version of Hive and Spark are you using?


Regards,
Gourav Sengupta


On Tue, Oct 16, 2018 at 2:42 AM daily  wrote:

 
Hi,
   
I use HCatalog Streaming   Mutation API to write data to hive transactional 
table, and then, I use   SparkSQL to read data from the hive transactional 
table. I get the right   result.
   However, SparkSQL uses more time to read hive orc bucket transactional 
table,   beacause SparkSQL read all columns(not The columns involved in SQL) so 
it   uses more time.
   My question is why that SparkSQL read all columns of hive orc bucket   
transactional table, but not the columns involved in SQL? Is it possible to   
control the SparkSQL read the columns involved in SQL?
   
 
   
For example:
   Hive Table:
   create table dbtest.t_a1 (t0 VARCHAR(36),t1 string,t2 double,t5 int ,t6 int) 
  partitioned by(sd string,st string) clustered by(t0) into 10 buckets stored   
as orc TBLPROPERTIES ('transactional'='true');
   
create table dbtest.t_a2 (t0   VARCHAR(36),t1 string,t2 double,t5 int ,t6 int) 
partitioned by(sd string,st   string) clustered by(t0) into 10 buckets stored 
as orc TBLPROPERTIES   ('transactional'='false');
   
SparkSQL: 
   select sum(t1),sum(t2) from dbtest.t_a1 group by t0;
   select sum(t1),sum(t2) from dbtest.t_a2 group by t0;
   
SparkSQL's stage Input size:
   
dbtest.t_a1=113.9 GB,
   
dbtest.t_a2=96.5 MB
   
 
   
Best regards.

Re: Hadoop 3 support

2018-10-16 Thread t4
has anyone got spark jars working with hadoop3.1 that they can share? i am
looking to be able to use the latest  hadoop-aws fixes from v3.1



--
Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: [DISCUSS][K8S][TESTS] Include Kerberos integration tests for Spark 2.4

2018-10-16 Thread Dongjoon Hyun
I also agree with Reynold and Xiao.

Although I love that new feature, Spark 2.4 branch-cut was made a long time
ago.

We cannot backport new features at this stage at RC4.

In addition, could you split Apache SPARK issue IDs, Ilan? It's confusing
during discussion.

 (1) [SPARK-23257][K8S] Kerberos Support for Spark on K8S (merged
yesterday for Apache Spark 3.0)
 (2) [SPARK-23257][K8S][TESTS] Kerberos Support Integration Tests (a
live PR with about *2000 lines. It's not a follow-up size.*)

For (1), it's merged yesterday. That means more people start to try (1)
from today. We need more time to stabilize it.
For (2), it's still under review.

Both (1) and (2) looks only valid for Spark 3.0.0.

Bests,
Dongjoon.



On Tue, Oct 16, 2018 at 1:32 PM Xiao Li  wrote:

> We need to strictly follow the backport and release policy. We can't merge
> such a new feature into a RC branch or a minor release (e.g., 2.4.1).
>
> Cheers,
>
> Xiao
>
> Bolke de Bruin  于2018年10月16日周二 下午12:48写道:
>
>> Chiming in here. We are in the same boat as Bloomberg.
>>
>> (But being a release manager often myself I understand the trade-off)
>>
>> B.
>>
>> Op di 16 okt. 2018 21:24 schreef Ilan Filonenko :
>>
>>> On Erik's note, would SPARK-23257 be included in, say, 2.4.1? When would
>>> the next RC be? I would like to propose the inclusion of the Kerberos
>>> feature sooner rather than later as it would increase Spark-on-K8S adoption
>>> in production workloads while bringing greater feature parity with Yarn and
>>> Mesos. I would like to note that the feature itself is isolated from Core
>>> and isolated via the step-based architecture of the Kubernetes
>>> Driver/Executor builders.
>>>
>>> Furthermore, Spark users traditionally use HDFS for storage and in
>>> production use-cases these HDFS clusters would be kerberized. At Bloomberg,
>>> for example, all of the HDFS clusters are kerberized and for this reason,
>>> the only thing stopping our internal Data Science Platform from adopting
>>> Spark-on-K8S is this feature.
>>>
>>> On Tue, Oct 16, 2018 at 10:21 AM Erik Erlandson 
>>> wrote:
>>>

 SPARK-23257 merged more recently than I realized. If that isn't on
 branch-2.4 then the first question is how soon on the release sequence that
 can be adopted

 On Tue, Oct 16, 2018 at 9:33 AM Reynold Xin 
 wrote:

> We shouldn’t merge new features into release branches anymore.
>
> On Tue, Oct 16, 2018 at 6:32 PM Rob Vesse 
> wrote:
>
>> Right now the Kerberos support for Spark on K8S is only on master
>> AFAICT i.e. the feature is not present on branch-2.4
>>
>>
>>
>> Therefore I don’t see any point in adding the tests into branch-2.4
>> unless the plan is to also merge the Kerberos support to branch-2.4
>>
>>
>>
>> Rob
>>
>>
>>
>> *From: *Erik Erlandson 
>> *Date: *Tuesday, 16 October 2018 at 16:47
>> *To: *dev 
>> *Subject: *[DISCUSS][K8S][TESTS] Include Kerberos integration tests
>> for Spark 2.4
>>
>>
>>
>> I'd like to propose including integration testing for Kerberos on the
>> Spark 2.4 release:
>>
>> https://github.com/apache/spark/pull/22608
>>
>>
>>
>> Arguments in favor:
>>
>> 1) it improves testing coverage on a feature important for
>> integrating with HDFS deployments
>>
>> 2) its intersection with existing code is small - it consists
>> primarily of new testing code, with a bit of refactoring into 'main' and
>> 'test' sub-trees. These new tests appear stable.
>>
>> 3) Spark 2.4 is still in RC, with outstanding correctness issues.
>>
>>
>>
>> The argument 'against' that I'm aware of would be the relatively
>> large size of the PR. I believe this is considered above, but am 
>> soliciting
>> community feedback before committing.
>>
>> Cheers,
>>
>> Erik
>>
>>
>>
>


Re: [DISCUSS][K8S][TESTS] Include Kerberos integration tests for Spark 2.4

2018-10-16 Thread Xiao Li
We need to strictly follow the backport and release policy. We can't merge
such a new feature into a RC branch or a minor release (e.g., 2.4.1).

Cheers,

Xiao

Bolke de Bruin  于2018年10月16日周二 下午12:48写道:

> Chiming in here. We are in the same boat as Bloomberg.
>
> (But being a release manager often myself I understand the trade-off)
>
> B.
>
> Op di 16 okt. 2018 21:24 schreef Ilan Filonenko :
>
>> On Erik's note, would SPARK-23257 be included in, say, 2.4.1? When would
>> the next RC be? I would like to propose the inclusion of the Kerberos
>> feature sooner rather than later as it would increase Spark-on-K8S adoption
>> in production workloads while bringing greater feature parity with Yarn and
>> Mesos. I would like to note that the feature itself is isolated from Core
>> and isolated via the step-based architecture of the Kubernetes
>> Driver/Executor builders.
>>
>> Furthermore, Spark users traditionally use HDFS for storage and in
>> production use-cases these HDFS clusters would be kerberized. At Bloomberg,
>> for example, all of the HDFS clusters are kerberized and for this reason,
>> the only thing stopping our internal Data Science Platform from adopting
>> Spark-on-K8S is this feature.
>>
>> On Tue, Oct 16, 2018 at 10:21 AM Erik Erlandson 
>> wrote:
>>
>>>
>>> SPARK-23257 merged more recently than I realized. If that isn't on
>>> branch-2.4 then the first question is how soon on the release sequence that
>>> can be adopted
>>>
>>> On Tue, Oct 16, 2018 at 9:33 AM Reynold Xin  wrote:
>>>
 We shouldn’t merge new features into release branches anymore.

 On Tue, Oct 16, 2018 at 6:32 PM Rob Vesse  wrote:

> Right now the Kerberos support for Spark on K8S is only on master
> AFAICT i.e. the feature is not present on branch-2.4
>
>
>
> Therefore I don’t see any point in adding the tests into branch-2.4
> unless the plan is to also merge the Kerberos support to branch-2.4
>
>
>
> Rob
>
>
>
> *From: *Erik Erlandson 
> *Date: *Tuesday, 16 October 2018 at 16:47
> *To: *dev 
> *Subject: *[DISCUSS][K8S][TESTS] Include Kerberos integration tests
> for Spark 2.4
>
>
>
> I'd like to propose including integration testing for Kerberos on the
> Spark 2.4 release:
>
> https://github.com/apache/spark/pull/22608
>
>
>
> Arguments in favor:
>
> 1) it improves testing coverage on a feature important for integrating
> with HDFS deployments
>
> 2) its intersection with existing code is small - it consists
> primarily of new testing code, with a bit of refactoring into 'main' and
> 'test' sub-trees. These new tests appear stable.
>
> 3) Spark 2.4 is still in RC, with outstanding correctness issues.
>
>
>
> The argument 'against' that I'm aware of would be the relatively large
> size of the PR. I believe this is considered above, but am soliciting
> community feedback before committing.
>
> Cheers,
>
> Erik
>
>
>



Re: [DISCUSS][K8S][TESTS] Include Kerberos integration tests for Spark 2.4

2018-10-16 Thread Bolke de Bruin
Chiming in here. We are in the same boat as Bloomberg.

(But being a release manager often myself I understand the trade-off)

B.

Op di 16 okt. 2018 21:24 schreef Ilan Filonenko :

> On Erik's note, would SPARK-23257 be included in, say, 2.4.1? When would
> the next RC be? I would like to propose the inclusion of the Kerberos
> feature sooner rather than later as it would increase Spark-on-K8S adoption
> in production workloads while bringing greater feature parity with Yarn and
> Mesos. I would like to note that the feature itself is isolated from Core
> and isolated via the step-based architecture of the Kubernetes
> Driver/Executor builders.
>
> Furthermore, Spark users traditionally use HDFS for storage and in
> production use-cases these HDFS clusters would be kerberized. At Bloomberg,
> for example, all of the HDFS clusters are kerberized and for this reason,
> the only thing stopping our internal Data Science Platform from adopting
> Spark-on-K8S is this feature.
>
> On Tue, Oct 16, 2018 at 10:21 AM Erik Erlandson 
> wrote:
>
>>
>> SPARK-23257 merged more recently than I realized. If that isn't on
>> branch-2.4 then the first question is how soon on the release sequence that
>> can be adopted
>>
>> On Tue, Oct 16, 2018 at 9:33 AM Reynold Xin  wrote:
>>
>>> We shouldn’t merge new features into release branches anymore.
>>>
>>> On Tue, Oct 16, 2018 at 6:32 PM Rob Vesse  wrote:
>>>
 Right now the Kerberos support for Spark on K8S is only on master
 AFAICT i.e. the feature is not present on branch-2.4



 Therefore I don’t see any point in adding the tests into branch-2.4
 unless the plan is to also merge the Kerberos support to branch-2.4



 Rob



 *From: *Erik Erlandson 
 *Date: *Tuesday, 16 October 2018 at 16:47
 *To: *dev 
 *Subject: *[DISCUSS][K8S][TESTS] Include Kerberos integration tests
 for Spark 2.4



 I'd like to propose including integration testing for Kerberos on the
 Spark 2.4 release:

 https://github.com/apache/spark/pull/22608



 Arguments in favor:

 1) it improves testing coverage on a feature important for integrating
 with HDFS deployments

 2) its intersection with existing code is small - it consists primarily
 of new testing code, with a bit of refactoring into 'main' and 'test'
 sub-trees. These new tests appear stable.

 3) Spark 2.4 is still in RC, with outstanding correctness issues.



 The argument 'against' that I'm aware of would be the relatively large
 size of the PR. I believe this is considered above, but am soliciting
 community feedback before committing.

 Cheers,

 Erik



>>>


Re: [DISCUSS][K8S][TESTS] Include Kerberos integration tests for Spark 2.4

2018-10-16 Thread Ilan Filonenko
On Erik's note, would SPARK-23257 be included in, say, 2.4.1? When would
the next RC be? I would like to propose the inclusion of the Kerberos
feature sooner rather than later as it would increase Spark-on-K8S adoption
in production workloads while bringing greater feature parity with Yarn and
Mesos. I would like to note that the feature itself is isolated from Core
and isolated via the step-based architecture of the Kubernetes
Driver/Executor builders.

Furthermore, Spark users traditionally use HDFS for storage and in
production use-cases these HDFS clusters would be kerberized. At Bloomberg,
for example, all of the HDFS clusters are kerberized and for this reason,
the only thing stopping our internal Data Science Platform from adopting
Spark-on-K8S is this feature.

On Tue, Oct 16, 2018 at 10:21 AM Erik Erlandson  wrote:

>
> SPARK-23257 merged more recently than I realized. If that isn't on
> branch-2.4 then the first question is how soon on the release sequence that
> can be adopted
>
> On Tue, Oct 16, 2018 at 9:33 AM Reynold Xin  wrote:
>
>> We shouldn’t merge new features into release branches anymore.
>>
>> On Tue, Oct 16, 2018 at 6:32 PM Rob Vesse  wrote:
>>
>>> Right now the Kerberos support for Spark on K8S is only on master AFAICT
>>> i.e. the feature is not present on branch-2.4
>>>
>>>
>>>
>>> Therefore I don’t see any point in adding the tests into branch-2.4
>>> unless the plan is to also merge the Kerberos support to branch-2.4
>>>
>>>
>>>
>>> Rob
>>>
>>>
>>>
>>> *From: *Erik Erlandson 
>>> *Date: *Tuesday, 16 October 2018 at 16:47
>>> *To: *dev 
>>> *Subject: *[DISCUSS][K8S][TESTS] Include Kerberos integration tests for
>>> Spark 2.4
>>>
>>>
>>>
>>> I'd like to propose including integration testing for Kerberos on the
>>> Spark 2.4 release:
>>>
>>> https://github.com/apache/spark/pull/22608
>>>
>>>
>>>
>>> Arguments in favor:
>>>
>>> 1) it improves testing coverage on a feature important for integrating
>>> with HDFS deployments
>>>
>>> 2) its intersection with existing code is small - it consists primarily
>>> of new testing code, with a bit of refactoring into 'main' and 'test'
>>> sub-trees. These new tests appear stable.
>>>
>>> 3) Spark 2.4 is still in RC, with outstanding correctness issues.
>>>
>>>
>>>
>>> The argument 'against' that I'm aware of would be the relatively large
>>> size of the PR. I believe this is considered above, but am soliciting
>>> community feedback before committing.
>>>
>>> Cheers,
>>>
>>> Erik
>>>
>>>
>>>
>>


Starting to make changes for Spark 3 -- what can we delete?

2018-10-16 Thread Sean Owen
There was already agreement to delete deprecated things like Flume and
Kafka 0.8 support in master. I've got several more on my radar, and
wanted to highlight them and solicit general opinions on where we
should accept breaking changes.

For example how about removing accumulator v1?
https://github.com/apache/spark/pull/22730

Or using the standard Java Optional?
https://github.com/apache/spark/pull/22383

Or cleaning up some old workarounds and APIs while at it?
https://github.com/apache/spark/pull/22729 (still in progress)

I think I talked myself out of replacing Java function interfaces with
java.util.function because...
https://issues.apache.org/jira/browse/SPARK-25369

There are also, say, old json and csv and avro reading method
deprecated since 1.4. Remove?
Anything deprecated since 2.0.0?

Interested in general thoughts on these.

Here are some more items targeted to 3.0:
https://issues.apache.org/jira/browse/SPARK-17875?jql=project%3D%22SPARK%22%20AND%20%22Target%20Version%2Fs%22%3D%223.0.0%22%20ORDER%20BY%20priority%20ASC

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: [DISCUSS][K8S][TESTS] Include Kerberos integration tests for Spark 2.4

2018-10-16 Thread Erik Erlandson
SPARK-23257 merged more recently than I realized. If that isn't on
branch-2.4 then the first question is how soon on the release sequence that
can be adopted

On Tue, Oct 16, 2018 at 9:33 AM Reynold Xin  wrote:

> We shouldn’t merge new features into release branches anymore.
>
> On Tue, Oct 16, 2018 at 6:32 PM Rob Vesse  wrote:
>
>> Right now the Kerberos support for Spark on K8S is only on master AFAICT
>> i.e. the feature is not present on branch-2.4
>>
>>
>>
>> Therefore I don’t see any point in adding the tests into branch-2.4
>> unless the plan is to also merge the Kerberos support to branch-2.4
>>
>>
>>
>> Rob
>>
>>
>>
>> *From: *Erik Erlandson 
>> *Date: *Tuesday, 16 October 2018 at 16:47
>> *To: *dev 
>> *Subject: *[DISCUSS][K8S][TESTS] Include Kerberos integration tests for
>> Spark 2.4
>>
>>
>>
>> I'd like to propose including integration testing for Kerberos on the
>> Spark 2.4 release:
>>
>> https://github.com/apache/spark/pull/22608
>>
>>
>>
>> Arguments in favor:
>>
>> 1) it improves testing coverage on a feature important for integrating
>> with HDFS deployments
>>
>> 2) its intersection with existing code is small - it consists primarily
>> of new testing code, with a bit of refactoring into 'main' and 'test'
>> sub-trees. These new tests appear stable.
>>
>> 3) Spark 2.4 is still in RC, with outstanding correctness issues.
>>
>>
>>
>> The argument 'against' that I'm aware of would be the relatively large
>> size of the PR. I believe this is considered above, but am soliciting
>> community feedback before committing.
>>
>> Cheers,
>>
>> Erik
>>
>>
>>
>


Re: [DISCUSS][K8S][TESTS] Include Kerberos integration tests for Spark 2.4

2018-10-16 Thread Yinan Li
Yep, the Kerberos support for k8s is in the master but not in branch-2.4. I
see no reason to get the integration tests into 2.4, which depend on the
feature in the master.

On Tue, Oct 16, 2018 at 9:32 AM Rob Vesse  wrote:

> Right now the Kerberos support for Spark on K8S is only on master AFAICT
> i.e. the feature is not present on branch-2.4
>
>
>
> Therefore I don’t see any point in adding the tests into branch-2.4 unless
> the plan is to also merge the Kerberos support to branch-2.4
>
>
>
> Rob
>
>
>
> *From: *Erik Erlandson 
> *Date: *Tuesday, 16 October 2018 at 16:47
> *To: *dev 
> *Subject: *[DISCUSS][K8S][TESTS] Include Kerberos integration tests for
> Spark 2.4
>
>
>
> I'd like to propose including integration testing for Kerberos on the
> Spark 2.4 release:
>
> https://github.com/apache/spark/pull/22608
>
>
>
> Arguments in favor:
>
> 1) it improves testing coverage on a feature important for integrating
> with HDFS deployments
>
> 2) its intersection with existing code is small - it consists primarily of
> new testing code, with a bit of refactoring into 'main' and 'test'
> sub-trees. These new tests appear stable.
>
> 3) Spark 2.4 is still in RC, with outstanding correctness issues.
>
>
>
> The argument 'against' that I'm aware of would be the relatively large
> size of the PR. I believe this is considered above, but am soliciting
> community feedback before committing.
>
> Cheers,
>
> Erik
>
>
>


Re: configure yarn to use more vcores as the node provides?

2018-10-16 Thread Peter Liu
Hi Khaled,

the 50-2-3g I mentioned below is meant for the --conf spark.executor.*
config, in particular,  spark.executor.instances=50,  spark.executor.cores=2
and  spark.executor.memory=3g.
for each run, I configured the streaming producer and kafka broker to have
the partitions aligned with the consumer side (in this case, partition is
100), viewable on the spark UI.

yarn scheduler does not seem to check the actual hardware threads (logical
cores) on the actual node. on the other hand, there seems to be some
mechanism in yarn to allow overcommit for certain jobs (with high or low
watermark etc).

not sure  how this should work in a good practice. Would appreciate any
hint from the experts here.

Thanks for your reply!

Peter / Gang

On Tue, Oct 16, 2018 at 3:43 AM Khaled Zaouk  wrote:

> Hi Peter,
>
> I actually meant the spark configuration that you put in your spark-submit
> program (such as --conf spark.executor.instances= ..., --conf
> spark.executor.memory= ..., etc...).
>
> I advice you to check the number of partitions that you get in each stage
> of your workload the Spark GUI while the workload is running. I feel like
> this number is beyond 80, and this is why overcommiting cpu cores can
> achieve better latency if the workload is not cpu intensive.
>
> Another question, did you try different values for spark.executor.cores?
> (for example 3, 4 or 5 cores per executor in addition to 2?) Try to play a
> little bit with this parameter and check how it affects your latency...
>
> Best,
>
> Khaled
>
>
>
> On Tue, Oct 16, 2018 at 3:06 AM Peter Liu  wrote:
>
>> Hi Khaled,
>>
>> I have attached the spark streaming config below in (a).
>> In case of the 100vcore run (see the initial email), I used 50 executors
>> where each executor has 2 vcores and 3g memory. For 70 vcore case, 35
>> executors, for 80 vcore case, 40 executors.
>> In the yarn config (yarn-site.xml, (b) below), the  available vcores set
>> over 80 (I call it "overcommit").
>>
>> Not sure if there is a more proper way to do this (overcommit) and what
>> would be the best practice in this type of situation (say, light cpu
>> workload in a dedicated yarn cluster) to increase the cpu utilization for a
>> better performance.
>>
>> Any help would be very much appreciated.
>>
>> Thanks ...
>>
>> Peter
>>
>> (a)
>>
>>val df = spark
>>   .readStream
>>   .format("kafka")
>>   .option("kafka.bootstrap.servers", kafkaCluster.kafkaNodesString)
>>   .option("startingOffsets", "latest")
>>   .option("subscribe", Variables.EVENTS_TOPIC)
>>   .option("kafkaConsumer.pollTimeoutMs", "5000")
>>   .load()
>>   .selectExpr("CAST(value AS STRING)", "CAST(timestamp AS
>> TIMESTAMP)").as[(String, Timestamp)]
>>   .select(from_json($"value", mySchema).as("data"), $"timestamp")
>>   .select("data.*", "timestamp")
>>   .where($"event_type" === "view")
>>   .select($"ad_id", $"event_time")
>>   .join(campaigns.toSeq.toDS().cache(), Seq("ad_id"))
>>   .groupBy(millisTime(window($"event_time", "10
>> seconds").getField("start")) as 'time_window, $"campaign_id")
>>   .agg(count("*") as 'count, max('event_time) as 'lastUpdate)
>>   .select(to_json(struct("*")) as 'value)
>>   .writeStream
>>   .format("kafka")
>>   .option("kafka.bootstrap.servers", kafkaCluster.kafkaNodesString)
>> //original
>>   .option("topic", Variables.OUTPUT_TOPIC)
>>   .option("checkpointLocation",
>> s"/tmp/${java.util.UUID.randomUUID()}") //TBD: ram disk?
>>   .outputMode("update")
>>   .start()
>>
>> (b)
>> 
>> yarn.nodemanager.resource.cpu-vcores
>> 110
>> 
>> 
>> yarn.scheduler.maximum-allocation-vcores
>> 110
>> 
>>
>> On Mon, Oct 15, 2018 at 4:26 PM Khaled Zaouk 
>> wrote:
>>
>>> Hi Peter,
>>>
>>> What parameters are you putting in your spark streaming configuration?
>>> What are you putting as number of executor instances and how many cores per
>>> executor are you setting in your Spark job?
>>>
>>> Best,
>>>
>>> Khaled
>>>
>>> On Mon, Oct 15, 2018 at 9:18 PM Peter Liu  wrote:
>>>
 Hi there,

 I have a system with 80 vcores and a relatively light spark streaming
 workload. Overcomming the vcore resource (i.e. > 80) in the config (see (a)
 below) seems to help to improve the average spark batch time (see (b)
 below).

 Is there any best practice guideline on resource overcommit with cpu /
 vcores, such as yarn config options, candidate cases ideal for
 overcommiting vcores etc.?

 the slide below (from 2016 though) seems to address the memory
 overcommit topic and hint a "future" topic on cpu overcommit:

 https://www.slideshare.net/HadoopSummit/investing-the-effects-of-overcommitting-yarn-resources
 

Re: [DISCUSS][K8S][TESTS] Include Kerberos integration tests for Spark 2.4

2018-10-16 Thread Reynold Xin
We shouldn’t merge new features into release branches anymore.

On Tue, Oct 16, 2018 at 6:32 PM Rob Vesse  wrote:

> Right now the Kerberos support for Spark on K8S is only on master AFAICT
> i.e. the feature is not present on branch-2.4
>
>
>
> Therefore I don’t see any point in adding the tests into branch-2.4 unless
> the plan is to also merge the Kerberos support to branch-2.4
>
>
>
> Rob
>
>
>
> *From: *Erik Erlandson 
> *Date: *Tuesday, 16 October 2018 at 16:47
> *To: *dev 
> *Subject: *[DISCUSS][K8S][TESTS] Include Kerberos integration tests for
> Spark 2.4
>
>
>
> I'd like to propose including integration testing for Kerberos on the
> Spark 2.4 release:
>
> https://github.com/apache/spark/pull/22608
>
>
>
> Arguments in favor:
>
> 1) it improves testing coverage on a feature important for integrating
> with HDFS deployments
>
> 2) its intersection with existing code is small - it consists primarily of
> new testing code, with a bit of refactoring into 'main' and 'test'
> sub-trees. These new tests appear stable.
>
> 3) Spark 2.4 is still in RC, with outstanding correctness issues.
>
>
>
> The argument 'against' that I'm aware of would be the relatively large
> size of the PR. I believe this is considered above, but am soliciting
> community feedback before committing.
>
> Cheers,
>
> Erik
>
>
>


Re: [DISCUSS][K8S][TESTS] Include Kerberos integration tests for Spark 2.4

2018-10-16 Thread Rob Vesse
Right now the Kerberos support for Spark on K8S is only on master AFAICT i.e. 
the feature is not present on branch-2.4 

 

Therefore I don’t see any point in adding the tests into branch-2.4 unless the 
plan is to also merge the Kerberos support to branch-2.4

 

Rob

 

From: Erik Erlandson 
Date: Tuesday, 16 October 2018 at 16:47
To: dev 
Subject: [DISCUSS][K8S][TESTS] Include Kerberos integration tests for Spark 2.4

 

I'd like to propose including integration testing for Kerberos on the Spark 2.4 
release:

https://github.com/apache/spark/pull/22608

 

Arguments in favor:

1) it improves testing coverage on a feature important for integrating with 
HDFS deployments

2) its intersection with existing code is small - it consists primarily of new 
testing code, with a bit of refactoring into 'main' and 'test' sub-trees. These 
new tests appear stable.

3) Spark 2.4 is still in RC, with outstanding correctness issues.

 

The argument 'against' that I'm aware of would be the relatively large size of 
the PR. I believe this is considered above, but am soliciting community 
feedback before committing.

Cheers,

Erik

 



Re: [DISCUSS][K8S][TESTS] Include Kerberos integration tests for Spark 2.4

2018-10-16 Thread Felix Cheung
I’m in favor of it. If you check the PR it’s a few isolated script changes and 
all test-only changes. Should have low impact on release but much better 
integration test coverage.



From: Erik Erlandson 
Sent: Tuesday, October 16, 2018 8:20 AM
To: dev
Subject: [DISCUSS][K8S][TESTS] Include Kerberos integration tests for Spark 2.4

I'd like to propose including integration testing for Kerberos on the Spark 2.4 
release:
https://github.com/apache/spark/pull/22608

Arguments in favor:
1) it improves testing coverage on a feature important for integrating with 
HDFS deployments
2) its intersection with existing code is small - it consists primarily of new 
testing code, with a bit of refactoring into 'main' and 'test' sub-trees. These 
new tests appear stable.
3) Spark 2.4 is still in RC, with outstanding correctness issues.

The argument 'against' that I'm aware of would be the relatively large size of 
the PR. I believe this is considered above, but am soliciting community 
feedback before committing.
Cheers,
Erik



[DISCUSS][K8S][TESTS] Include Kerberos integration tests for Spark 2.4

2018-10-16 Thread Erik Erlandson
I'd like to propose including integration testing for Kerberos on the Spark
2.4 release:
https://github.com/apache/spark/pull/22608

Arguments in favor:
1) it improves testing coverage on a feature important for integrating with
HDFS deployments
2) its intersection with existing code is small - it consists primarily of
new testing code, with a bit of refactoring into 'main' and 'test'
sub-trees. These new tests appear stable.
3) Spark 2.4 is still in RC, with outstanding correctness issues.

The argument 'against' that I'm aware of would be the relatively large size
of the PR. I believe this is considered above, but am soliciting community
feedback before committing.
Cheers,
Erik


HADOOP-13421 anyone using this with Spark

2018-10-16 Thread t4
https://issues.apache.org/jira/browse/HADOOP-13421 mentions s3a can use s3 v2
api for performance. has anyone been able to use this new hadoop-aws jar
with Spark?



--
Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: Timestamp Difference/operations

2018-10-16 Thread Paras Agarwal
Thanks Srabasti,


I am trying to convert teradata to spark sql.


TERADATA:
select * from Table1 where Date '1974-01-02' > CAST(birth_date AS TIMESTAMP(0)) 
+ (TIME '12:34:34' - TIME '00:00:00' HOUR TO SECOND);

HIVE ( With some tweaks i can write):
SELECT * FROM foodmart.trimmed_employee WHERE Date '1974-01-02' > 
CAST(CAST(CURRENT_DATE AS TIMESTAMP) + (CAST('2000-01-01 12:34:34' AS 
TIMESTAMP) - (CAST('2000-01-01 00:00:00' AS TIMESTAMP))) AS DATE)

SPARK (SO need spark equivalent):

SELECT * FROM foodmart.trimmed_employee WHERE Date '1974-01-02' > 
CAST(CAST(CURRENT_DATE AS TIMESTAMP) + (??) AS DATE)


Need to fill above ?? so that i can process.


Thanks & Regards,

Paras

9130006036


From: Srabasti Banerjee 
Sent: Tuesday, October 16, 2018 6:45:26 AM
To: Paras Agarwal; John Zhuge
Cc: user; dev
Subject: Re: Timestamp Difference/operations

Hi Paras,
Check out the link Spark Scala: DateDiff of two columns by hour or 
minute






Spark Scala: DateDiff of two columns by hour or minute

I have two timestamp columns in a dataframe that I'd like to get the minute 
difference of, or alternatively, the hour difference of. Currently I'm able to 
get the day difference, with rounding, by ...




Looks like you can get the difference in seconds as well.
Hopefully this helps!
Are you looking for a specific usecase? Can you please elaborate with an 
example?

Thanks
Srabasti Banerjee


Sent from Yahoo Mail on 
Android

On Sun, Oct 14, 2018 at 23:41, Paras Agarwal
 wrote:

Thanks John,


Actually need full date and  time difference not just date difference,

which I guess not supported.


Let me know if its possible, or any UDF available for the same.


Thanks And Regards,

Paras


From: John Zhuge 
Sent: Friday, October 12, 2018 9:48:47 PM
To: Paras Agarwal
Cc: user; dev
Subject: Re: Timestamp Difference/operations

Yeah, operator "-" does not seem to be supported, however, you can use 
"datediff" function:

In [9]: select datediff(CAST('2000-02-01 12:34:34' AS TIMESTAMP), 
CAST('2000-01-01 00:00:00' AS TIMESTAMP))
Out[9]:
+--+
| datediff(CAST(CAST(2000-02-01 12:34:34 AS TIMESTAMP) AS DATE), 
CAST(CAST(2000-01-01 00:00:00 AS TIMESTAMP) AS DATE)) |
+--+
| 31
   |
+--+

In [10]: select datediff('2000-02-01 12:34:34', '2000-01-01 00:00:00')
Out[10]:
++
| datediff(CAST(2000-02-01 12:34:34 AS DATE), CAST(2000-01-01 00:00:00 AS 
DATE)) |
++
| 31
 |
++

In [11]: select datediff(timestamp '2000-02-01 12:34:34', timestamp '2000-01-01 
00:00:00')
Out[11]:
+--+
| datediff(CAST(TIMESTAMP('2000-02-01 12:34:34.0') AS DATE), 
CAST(TIMESTAMP('2000-01-01 00:00:00.0') AS DATE)) |
+--+
| 31
   |
+--+

On Fri, Oct 12, 2018 at 7:01 AM Paras Agarwal 
mailto:paras.agar...@datametica.com>> wrote:

Hello Spark Community,

Currently in hive we can do operations on Timestamp Like :
CAST('2000-01-01 12:34:34' AS TIMESTAMP) - CAST('2000-01-01 00:00:00' AS 
TIMESTAMP)

Seems its not supporting in spark.
Is there any way available.

Kindly provide some insight on this.


Paras
9130006036


--
John


Re: overcommit: cpus / vcores

2018-10-16 Thread Khaled Zaouk
Hi Peter,

I actually meant the spark configuration that you put in your spark-submit
program (such as --conf spark.executor.instances= ..., --conf
spark.executor.memory= ..., etc...).

I advice you to check the number of partitions that you get in each stage
of your workload the Spark GUI while the workload is running. I feel like
this number is beyond 80, and this is why overcommiting cpu cores can
achieve better latency if the workload is not cpu intensive.

Another question, did you try different values for spark.executor.cores?
(for example 3, 4 or 5 cores per executor in addition to 2?) Try to play a
little bit with this parameter and check how it affects your latency...

Best,

Khaled



On Tue, Oct 16, 2018 at 3:06 AM Peter Liu  wrote:

> Hi Khaled,
>
> I have attached the spark streaming config below in (a).
> In case of the 100vcore run (see the initial email), I used 50 executors
> where each executor has 2 vcores and 3g memory. For 70 vcore case, 35
> executors, for 80 vcore case, 40 executors.
> In the yarn config (yarn-site.xml, (b) below), the  available vcores set
> over 80 (I call it "overcommit").
>
> Not sure if there is a more proper way to do this (overcommit) and what
> would be the best practice in this type of situation (say, light cpu
> workload in a dedicated yarn cluster) to increase the cpu utilization for a
> better performance.
>
> Any help would be very much appreciated.
>
> Thanks ...
>
> Peter
>
> (a)
>
>val df = spark
>   .readStream
>   .format("kafka")
>   .option("kafka.bootstrap.servers", kafkaCluster.kafkaNodesString)
>   .option("startingOffsets", "latest")
>   .option("subscribe", Variables.EVENTS_TOPIC)
>   .option("kafkaConsumer.pollTimeoutMs", "5000")
>   .load()
>   .selectExpr("CAST(value AS STRING)", "CAST(timestamp AS
> TIMESTAMP)").as[(String, Timestamp)]
>   .select(from_json($"value", mySchema).as("data"), $"timestamp")
>   .select("data.*", "timestamp")
>   .where($"event_type" === "view")
>   .select($"ad_id", $"event_time")
>   .join(campaigns.toSeq.toDS().cache(), Seq("ad_id"))
>   .groupBy(millisTime(window($"event_time", "10
> seconds").getField("start")) as 'time_window, $"campaign_id")
>   .agg(count("*") as 'count, max('event_time) as 'lastUpdate)
>   .select(to_json(struct("*")) as 'value)
>   .writeStream
>   .format("kafka")
>   .option("kafka.bootstrap.servers", kafkaCluster.kafkaNodesString)
> //original
>   .option("topic", Variables.OUTPUT_TOPIC)
>   .option("checkpointLocation",
> s"/tmp/${java.util.UUID.randomUUID()}") //TBD: ram disk?
>   .outputMode("update")
>   .start()
>
> (b)
> 
> yarn.nodemanager.resource.cpu-vcores
> 110
> 
> 
> yarn.scheduler.maximum-allocation-vcores
> 110
> 
>
> On Mon, Oct 15, 2018 at 4:26 PM Khaled Zaouk  wrote:
>
>> Hi Peter,
>>
>> What parameters are you putting in your spark streaming configuration?
>> What are you putting as number of executor instances and how many cores per
>> executor are you setting in your Spark job?
>>
>> Best,
>>
>> Khaled
>>
>> On Mon, Oct 15, 2018 at 9:18 PM Peter Liu  wrote:
>>
>>> Hi there,
>>>
>>> I have a system with 80 vcores and a relatively light spark streaming
>>> workload. Overcomming the vcore resource (i.e. > 80) in the config (see (a)
>>> below) seems to help to improve the average spark batch time (see (b)
>>> below).
>>>
>>> Is there any best practice guideline on resource overcommit with cpu /
>>> vcores, such as yarn config options, candidate cases ideal for
>>> overcommiting vcores etc.?
>>>
>>> the slide below (from 2016 though) seems to address the memory
>>> overcommit topic and hint a "future" topic on cpu overcommit:
>>>
>>> https://www.slideshare.net/HadoopSummit/investing-the-effects-of-overcommitting-yarn-resources
>>> 
>>>
>>> Would like to know if this is a reasonable config practice and why this
>>> is not achievable without overcommit. Any help/hint would be very much
>>> appreciated!
>>>
>>> Thanks!
>>>
>>> Peter
>>>
>>> (a) yarn-site.xml
>>> 
>>> yarn.nodemanager.resource.cpu-vcores
>>> 110
>>> 
>>>
>>> 
>>> yarn.scheduler.maximum-allocation-vcores
>>> 110
>>> 
>>>
>>>
>>> (b)
>>> FYI:
>>> I have a system with 80 vcores and a relatively light spark streaming
>>> workload. overcomming the vocore resource (here 100) seems to help the
>>> average spark batch time. need more understanding on this practice.
>>> Skylake (1 x 900K msg/sec) total batch# (avg) avg batch time in ms (avg) avg
>>> user cpu (%) nw read (mb/sec)
>>> 70vocres 178.20 8154.69 n/a n/a
>>> 80vocres 177.40 7865.44 27.85 222.31
>>> 100vcores 177.00 7,209.37 30.02 220.86
>>>
>>>