Github user thvasilo commented on the issue:

    https://github.com/apache/flink/pull/2735
  
    Thank you for your contribution Kalman!
    
    I just took a brief look, this is a big PR so will probably take some time 
to review.
    
    For now a few things that jump to mind: 
    
    * We'll need to add docs for the algorithm, which should be example heavy. 
[Here's](https://ci.apache.org/projects/flink/flink-docs-release-1.2/dev/libs/ml/standard_scaler.html)
 a simple example for another pre-processing algorithm. I see you already have 
extensive ScalaDoc's we could prolly replicate those in the docs.
    * Have you tested it in a relatively large scale dataset? Ideally in a 
distributed setting where the input files are on HDFS. This way we test the 
scalability of the implementation, and problems usually arise.
    * Have you compared the output with a reference implementation? My 
knowledge of word2vec is not very deep but as far as I understand the output is 
non-deterministic, so we would need some sort of proxy to evaluate the 
integrated correctness of the implementation.
    * Finally I see this introduces a new nlp package. I'm not sure how to 
treat this (and relevant algorithms, say TF-IDF), as they are not necessarily 
NLP specific, even though they stem from the field you could treat any sequence 
of objects as a "sentence" and embed them. I would favor including them as 
pre-processing steps and hence inheriting from the `Transformer` interface, 
perhaps by having a feature pre-processing package.
    
    Regards,
    Theodore


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

Reply via email to