SnakePipe - get it? Python and 'plumbing'?
Yours, Chris Koerner Community Liaison - Discovery Wikimedia Foundation On Wed, Apr 5, 2017 at 2:28 PM, Erik Bernhardson <[email protected] > wrote: > We seem to have some consensus that for the upcoming learning to rank work > we will build out a python library to handle the bulk of the backend data > plumbing work. The library will primarily be code integrating with pyspark > to do various pieces such as: > > # Sampling from the click logs to generate the set of queries + page's > that will be labeled with click models > # Distributing the work of running click models against those sampled data > sets > # Pushing queries we use for feature generation into kafka, and reading > back the resulting feature vectors (the other end of this will run those > generated queries against either the hot-spare elasticsearch cluster or the > relforge cluster to get feature scores) > # Merging feature vectors with labeled data, splitting into > test/train/validate sets, and writing out files formatted for whichever > training library we decide on (xgboost, lightgbm and ranklib are in the > running currently) > # Whatever plumbing is necessary to run the actual model training and do > hyper parameter optimization > # Converting the resulting models into a format suitable for use with the > elasticsearch learn to rank plugin > # Reporting on the quality of models vs some baseline > > The high level goal is that we would have relatively simple python scripts > in our analytics repository that are called from oozie, those scripts would > know the appropriate locations to load/store data and pass into this > library for the bulk of the processing. There will also be some script, > probably within the library, that combines many of these steps for feature > engineering purposes to take some set of features and run the whole thing. > > So, what do we call this thing? Horrible first attempts: > > * ltr-pipeline > * learn-to-rank-pipeline > * bob > * cirrussearch-ltr > * ??? > > > _______________________________________________ > discovery mailing list > [email protected] > https://lists.wikimedia.org/mailman/listinfo/discovery > >
_______________________________________________ discovery mailing list [email protected] https://lists.wikimedia.org/mailman/listinfo/discovery
