Thanks. Created: https://issues.apache.org/jira/browse/SPARK-26616
On Mon, Jan 14, 2019 at 9:19 PM Sean Owen <sro...@gmail.com> wrote: > Yes that seems OK to me. > > On Mon, Jan 14, 2019 at 9:40 AM Jatin Puri <purija...@gmail.com> wrote: > > > > Thanks for the response. So do I go ahead and create a jira ticket? > > Can then send a pull request for the same with the changes. > > > > On Mon, Jan 14, 2019 at 8:18 PM Sean Owen <sro...@gmail.com> wrote: > >> > >> I think that's reasonable. The caller probably has the number of docs > >> already but sure, it's one long and is already computed. This would > >> have to be added to Pyspark too. > >> > >> On Mon, Jan 14, 2019 at 7:56 AM Jatin Puri <purija...@gmail.com> wrote: > >> > > >> > Hello. > >> > > >> > As part of `org.apache.spark.ml.feature.IDFModel`, I think it is a > good idea to also expose: > >> > > >> > 1. Document frequency vector > >> > 2. Number of documents > >> > > >> > We get the above for free currently and they just need to be exposed > as public val. > >> > > >> > This avoids re-implementation for someone who needs to compute > DocumentFrequency of terms. Currently if someone needs df, then one would > need to reverse compute it based on the idf values obtained. > >> > > >> > Afaik, we dont explicitly provide such a functionality in mllib. And > we don't need to have a separate class, if we can expose it in `IDFModel` > itself. > >> > > >> > Does it sound alright? > >> > > >> > Regards, > >> > Jatin > >> > > > > > > > > > -- > > Jatin Puri > > http://jatinpuri.com > > > -- Jatin Puri http://jatinpuri.com <http://www.jatinpuri.com>