Re: [mllib] Document frequency

Sean Owen Mon, 14 Jan 2019 07:50:16 -0800

Yes that seems OK to me.

On Mon, Jan 14, 2019 at 9:40 AM Jatin Puri <purija...@gmail.com> wrote:
>
> Thanks for the response. So do I go ahead and create a jira ticket?
> Can then send a pull request for the same with the changes.
>
> On Mon, Jan 14, 2019 at 8:18 PM Sean Owen <sro...@gmail.com> wrote:
>>
>> I think that's reasonable. The caller probably has the number of docs
>> already but sure, it's one long and is already computed. This would
>> have to be added to Pyspark too.
>>
>> On Mon, Jan 14, 2019 at 7:56 AM Jatin Puri <purija...@gmail.com> wrote:
>> >
>> > Hello.
>> >
>> > As part of `org.apache.spark.ml.feature.IDFModel`, I think it is a good 
>> > idea to also expose:
>> >
>> > 1. Document frequency vector
>> > 2. Number of documents
>> >
>> > We get the above for free currently and they just need to be exposed as 
>> > public val.
>> >
>> > This avoids re-implementation for someone who needs to compute 
>> > DocumentFrequency of terms. Currently if someone needs df, then one would 
>> > need to reverse compute it based on the idf values obtained.
>> >
>> > Afaik, we dont explicitly provide such a functionality in mllib. And we 
>> > don't need to have a separate class, if we can expose it in `IDFModel` 
>> > itself.
>> >
>> > Does it sound alright?
>> >
>> > Regards,
>> > Jatin
>> >
>
>
>
> --
> Jatin Puri
> http://jatinpuri.com
>


---------------------------------------------------------------------
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org

Re: [mllib] Document frequency

Reply via email to