Re: spark-submit parameters about two keytab files to yarn and kafka

2020-11-01 Thread kevin chen
Hi,

Hope it can solve the issue by following method:

*step 1 : *
create a kafka kerberos config named kafka_client_jaas.conf:

KafkaClient {
   com.sun.security.auth.module.Krb5LoginModule required
   useKeyTab=true
   keyTab="./kafka.service.keytab"
   storeKey=true
   useTicketCache=false
   serviceName="kafka"
   principal="kafka/x...@example.com";
};


*step 2:*
spark-submit command :

/usr/local/spark/bin/spark-submit \
--files ./kafka_client_jaas.conf,./kafka.service.keytab \
--driver-java-options
"-Djava.security.auth.login.config=./kafka_client_jaas.conf" \
--conf
"spark.executor.extraJavaOptions=-Djava.security.auth.login.config=./kafka_client_jaas.conf"
\
--conf spark.yarn.keytab=./hadoop.service.keytab \
--conf spark.yarn.principal=hadoop/EXAMPLE.COM \

.

*step 3:*

change security.protocol in kafka client config  to SASL_PLAINTEXT, if your
spark version is 1.6.


*note:*
my test env:  spark 2.0.2  kafka 0.10

references
1. using-spark-streaming



-- 

Best,

Kevin Pis

Gabor Somogyi  于2020年10月28日周三 下午5:25写道:

> Hi,
>
> Cross-realm trust must be configured. One can find several docs on how to
> do that.
>
> BR,
> G
>
>
> On Wed, Oct 28, 2020 at 8:21 AM big data  wrote:
>
>> Hi,
>>
>> We want to submit spark streaming job to YARN and consume Kafka topic.
>>
>> YARN and Kafka are in two different clusters, and they have the
>> different kerberos authentication.
>>
>> We have two keytab files for YARN and Kafka.
>>
>> And my questions is how to add parameter for spark-submit command for
>> this situation?
>>
>> Thanks.
>>
>>
>> -
>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>>
>>


kafka_client_jaas.conf
Description: Binary data

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Re: [Spark Core] Vectorizing very high-dimensional data sourced in long format

2020-11-01 Thread kevin chen
Perhaps it can avoid errors(exhausting executor and driver memory) to add
random numbers to the entity_id column when you solve the issue by
Patrick's way.

Daniel Chalef  于2020年10月31日周六
上午12:42写道:

> Yes, the resulting matrix would be sparse. Thanks for the suggestion. Will
> explore ways of doing this using an agg and UDF.
>
> On Fri, Oct 30, 2020 at 6:26 AM Patrick McCarthy
>  wrote:
>
>> That's a very large vector. Is it sparse? Perhaps you'd have better luck
>> performing an aggregate instead of a pivot, and assembling the vector using
>> a UDF.
>>
>> On Thu, Oct 29, 2020 at 10:19 PM Daniel Chalef
>>  wrote:
>>
>>> Hello,
>>>
>>> I have a very large long-format dataframe (several billion rows) that
>>> I'd like to pivot and vectorize (using the VectorAssembler), with the aim
>>> to reduce dimensionality using something akin to TF-IDF. Once pivoted, the
>>> dataframe will have ~130 million columns.
>>>
>>> The source, long-format schema looks as follows:
>>>
>>> root
>>>  |-- entity_id: long (nullable = false)
>>>  |-- attribute_id: long (nullable = false)
>>>  |-- event_count: integer (nullable = true)
>>>
>>> Pivoting as per the following fails, exhausting executor and driver
>>> memory. I am unsure whether increasing memory limits would be successful
>>> here as my sense is that pivoting and then using a VectorAssembler isn't
>>> the right approach to solving this problem.
>>>
>>> wide_frame = (
>>> long_frame.groupBy("entity_id")
>>> .pivot("attribute_id")
>>> .agg(F.first("event_count"))
>>> )
>>>
>>> Are there other Spark patterns that I should attempt in order to achieve
>>> my end goal of a vector of attributes for every entity?
>>>
>>> Thanks, Daniel
>>>
>>
>>
>> --
>>
>>
>> *Patrick McCarthy  *
>>
>> Senior Data Scientist, Machine Learning Engineering
>>
>> Dstillery
>>
>> 470 Park Ave South, 17th Floor, NYC 10016
>>
>