Re: Spark standalone - reading kerberos hdfs

2021-01-24 Thread jelmer
The only way I ever got it to work with spark standalone is via web hdfs.

See
https://issues.apache.org/jira/browse/SPARK-5158?focusedCommentId=16516856=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-16516856

On Fri, 8 Jan 2021 at 18:49, Sudhir Babu Pothineni 
wrote:

> I spin up a spark standalone cluster (spark.autheticate=false), submitted
> a job which reads remote kerberized HDFS,
>
> val spark = SparkSession.builder()
>   .master("spark://spark-standalone:7077")
>   .getOrCreate()
>
> UserGroupInformation.loginUserFromKeytab(principal, keytab)
> val df = spark.read.parquet("hdfs://namenode:8020/test/parquet/")
>
> Ran into following exception:
>
> Caused by:
> java.io.IOException: java.io.IOException: Failed on local exception:
> java.io.IOException: org.apache.hadoop.security.AccessControlException:
> Client cannot authenticate via:[TOKEN, KERBEROS]; Host Details : local host
> is: "..."; destination host is: "...":10346;
>
>
> Any suggestions?
>
> Thanks
> Sudhir
>


unsubscribe

2021-01-24 Thread Andrew Milkowski



Re: Using same rdd from two threads

2021-01-24 Thread jelmer
Well it is now...

The RDD had a repartition call on it.

When I removed repartition it it it would work,
When i did not remove the repartition but called
called rdd.partitions.length on it it would also work!

I looked into the partitions method and in it some instance variables get
initialized, so saying rdd's are immutable is only true on a "logical" level

It seems I ran into https://issues.apache.org/jira/browse/SPARK-28917

And it looks like this change fixed it

https://github.com/apache/spark/blame/485145326a9c97ede260b0e267ee116f182cfd56/core/src/main/scala/org/apache/spark/rdd/RDD.scala#L298

But since we're using an old version that does not really help


On Fri, 22 Jan 2021 at 15:34, Sean Owen  wrote:

> RDDs are immutable, and Spark itself is thread-safe. This should be fine.
> Something else is going on in your code.
>
> On Fri, Jan 22, 2021 at 7:59 AM jelmer  wrote:
>
>> HI,
>>
>> I have a piece of code in which an rdd is created from a main method.
>> It then does work on this rdd from 2 different threads running in
>> parallel.
>>
>> When running this code as part of a test with a local master it will
>> sometimes make spark hang ( 1 task will never get completed)
>>
>> If i make a copy of the rdd  the joh will complete fine.
>>
>> I suspect it's a bad idea to use the same rdd from two threads but I
>> could not find any documentation on the subject.
>>
>> Should it be possible to do this and if not can anyone point me to
>> documentation pointing our that this is not on the table
>>
>> --jelmer
>>
>