Re: Spark standalone - reading kerberos hdfs

2021-01-24 Thread jelmer
The only way I ever got it to work with spark standalone is via web hdfs. See https://issues.apache.org/jira/browse/SPARK-5158?focusedCommentId=16516856=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-16516856 On Fri, 8 Jan 2021 at 18:49, Sudhir Babu Pothineni wrote:

Re: Using same rdd from two threads

2021-01-24 Thread jelmer
/spark/rdd/RDD.scala#L298 But since we're using an old version that does not really help On Fri, 22 Jan 2021 at 15:34, Sean Owen wrote: > RDDs are immutable, and Spark itself is thread-safe. This should be fine. > Something else is going on in your code. > > On Fri, Jan 22, 2021 at

Using same rdd from two threads

2021-01-22 Thread jelmer
a copy of the rdd the joh will complete fine. I suspect it's a bad idea to use the same rdd from two threads but I could not find any documentation on the subject. Should it be possible to do this and if not can anyone point me to documentation pointing our that this is not on the table --jelmer

Cleanup hook for temporary files produced as part of a spark job

2020-05-24 Thread jelmer
I am writing something that partitions a data set and then trains a machine learning model on the data in each partition The resulting model is very big and right now i am storing it in an rdd as a pair of : partition_id and very_big_model_that_is_hundreds_of_megabytes_big but it is becoming

Re: Request more yarn vcores than executors

2019-12-08 Thread jelmer
tasks per executor. > > On Sun, 8 Dec 2019, 8:16 pm jelmer, wrote: > >> I have a job, running on yarn, that uses multithreading inside of a >> mapPartitions transformation >> >> Ideally I would like to have a small number of partitions but have a high >> number of yar

Request more yarn vcores than executors

2019-12-08 Thread jelmer
I have a job, running on yarn, that uses multithreading inside of a mapPartitions transformation Ideally I would like to have a small number of partitions but have a high number of yarn vcores allocated to the task (that i can take advantage of because of multi threading) Is this possible? I

Any way to make catalyst optimise away join

2019-11-29 Thread jelmer
I have 2 dataframes , lets call them A and B, A is made up out of [unique_id, field1] B is made up out of [unique_id, field2] The have the exact same number of rows, and every id in A is also present in B if I execute a join like this A.join(B, Seq("unique_id")).select($"unique_id", $"field1")

Custom encoders and udf's

2019-09-10 Thread jelmer
Hi, I am using a org.apache.spark.sql.Encoder to serialize a custom object. I now want to pass this column to a udf so it can do some operations on it but this gives me the error : Caused by: java.lang.ClassCastException: [B cannot be cast to The code included at the problem demonstrates the

Re: Map side join without broadcast

2019-06-30 Thread jelmer
evant items by doing a groupBy On Sun, 30 Jun 2019 at 01:45, Chris Teoh wrote: > The closest thing I can think of here is if you have both dataframes > written out using buckets. Hive uses this technique for join optimisation > such that both datasets of the same bucket are read

Map side join without broadcast

2019-06-29 Thread jelmer
I have 2 dataframes, Dataframe A which contains 1 element per partition that is gigabytes big (an index) Dataframe B which is made up out of millions of small rows. I want to join B on A but i want all the work to be done on the executors holding the partitions of dataframe A Is there a way to