I think that's reasonable. The caller probably has the number of docs
already but sure, it's one long and is already computed. This would
have to be added to Pyspark too.

On Mon, Jan 14, 2019 at 7:56 AM Jatin Puri <purija...@gmail.com> wrote:
>
> Hello.
>
> As part of `org.apache.spark.ml.feature.IDFModel`, I think it is a good idea 
> to also expose:
>
> 1. Document frequency vector
> 2. Number of documents
>
> We get the above for free currently and they just need to be exposed as 
> public val.
>
> This avoids re-implementation for someone who needs to compute 
> DocumentFrequency of terms. Currently if someone needs df, then one would 
> need to reverse compute it based on the idf values obtained.
>
> Afaik, we dont explicitly provide such a functionality in mllib. And we don't 
> need to have a separate class, if we can expose it in `IDFModel` itself.
>
> Does it sound alright?
>
> Regards,
> Jatin
>

---------------------------------------------------------------------
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org

Reply via email to