Re: Collections passed from driver to executors

2019-09-23 Thread Reynold Xin
It's was done 2014 by yours truly https://github.com/apache/spark/pull/1498 so any modern version would have it. On Mon, Sep 23, 2019 at 9:04 PM, Dhrubajyoti Hati < dhruba.w...@gmail.com > wrote: > > Thanks. Could you please let me know which version of spark its changed. > We are still at

Re: Collections passed from driver to executors

2019-09-23 Thread Dhrubajyoti Hati
Thanks. Could you please let me know which version of spark its changed. We are still at 2.2. On Tue, 24 Sep, 2019, 9:17 AM Reynold Xin, wrote: > A while ago we changed it so the task gets broadcasted too, so I think the > two are fairly similar. > > > > On Mon, Sep 23, 2019 at 8:17 PM,

Re: Collections passed from driver to executors

2019-09-23 Thread Reynold Xin
A while ago we changed it so the task gets broadcasted too, so I think the two are fairly similar. On Mon, Sep 23, 2019 at 8:17 PM, Dhrubajyoti Hati < dhruba.w...@gmail.com > wrote: > > I was wondering if anyone could help with this question. > > On Fri, 20 Sep, 2019, 11:52 AM Dhrubajyoti

Re: Collections passed from driver to executors

2019-09-23 Thread Dhrubajyoti Hati
I was wondering if anyone could help with this question. On Fri, 20 Sep, 2019, 11:52 AM Dhrubajyoti Hati, wrote: > Hi, > > I have a question regarding passing a dictionary from driver to executors > in spark on yarn. This dictionary is needed in an udf. I am using pyspark. > > As I understand

Re: Efficient cosine similarity computation

2019-09-23 Thread Chee Yee Lim
I've been trying to achieve the same objective, coming up with approaches similar to your method 1 and 2. Method 2 is the slowest for me due to massive amount of data being shuffled around at each matrix operation stage. Method 3 is new to me, so I can't comment much. I ended up using an approach

PySpark with custom transformer project organization

2019-09-23 Thread Femi Anthony
I have a Pyspark project that requires a custom ML Pipeline Transformer written in Scala. What is the best practice regarding project organization ? Should I include the scala files in the general Python project or should they be in a separate repo ? Opinions and suggestions welcome. Sent

Efficient cosine similarity computation

2019-09-23 Thread Stevens, Clay
There are several ways I can compute the cosine similarities between a Spark ML vector to each ML vector in a Spark DataFrame column then sorting for the highest results. However, I can't come up with a method that is faster than replacing the `/data/` in a Spark ML Word2Vec model, then using

Kafka offset committer tool for structured streaming query

2019-09-23 Thread Jungtaek Lim
Hi Spark users, especially Structured Streaming users who are using Kafka as data source, I'm pleased to introduce Kafka offset committer, which enables commit offsets which batch has been processed. The tool is basically an implementation of streaming query listener, which listens for events and