Tjones has submitted this change and it was merged. ( https://gerrit.wikimedia.org/r/347058 )
Change subject: Setup tox for running flake8 and pytest ...................................................................... Setup tox for running flake8 and pytest More plumbing for general python project setup. This allows the tox command to be used to run all the tests, syntax checking, etc that we want to run on every commit. Work to get this running as part of jenkins CI pipeline will be further along. One of the constraints there will be getting the cdh5.10.0 spark packages installed, but shouldn't be too difficult. * Moves the virtualenv for running tox in the vm to /vagrant/venv to keep the mess in one place. Tried to avoid needing an extra virtualenv, as tox builds venvs anyways, but tox+pip weren't playing nice and errored out with the .[test] dep otherwise. * Switched to debian jessie. Prod is moving that direction, and it's no harm to switch now before anything complex is setup. * replace requirements.txt with setup.py * Add a LICENSE file, it's MIT. * Adjust the Vagrantfile to use NFS share. With the default share tox/virtualenv were unable to create hardlinks. Change-Id: Id57bd5fd0476fc061d4b0a1cd93a1b2f639b7ed4 --- M .gitignore A LICENSE A MANIFEST.in D README A README.rst M Vagrantfile M bootstrap-vm.sh M mjolnir/test/conftest.py D requirements.txt A setup.py A tox.ini 11 files changed, 159 insertions(+), 54 deletions(-) Approvals: Tjones: Verified; Looks good to me, approved diff --git a/.gitignore b/.gitignore index c476043..4b7c536 100644 --- a/.gitignore +++ b/.gitignore @@ -4,23 +4,8 @@ *~ # Distribution / packaging -.Python -env/ -bin/ -build/ -develop-eggs/ -dist/ -eggs/ -lib/ -lib64/ -parts/ -sdist/ -var/ -local/ -include/ -share/ +venv/ *.egg-info/ -.installed.cfg *.egg *.log diff --git a/LICENSE b/LICENSE new file mode 100644 index 0000000..d13cc4b --- /dev/null +++ b/LICENSE @@ -0,0 +1,19 @@ +The MIT License (MIT) + +Permission is hereby granted, free of charge, to any person obtaining a copy +of this software and associated documentation files (the "Software"), to deal +in the Software without restriction, including without limitation the rights +to use, copy, modify, merge, publish, distribute, sublicense, and/or sell +copies of the Software, and to permit persons to whom the Software is +furnished to do so, subject to the following conditions: + +The above copyright notice and this permission notice shall be included in all +copies or substantial portions of the Software. + +THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR +IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, +FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE +AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER +LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, +OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE +SOFTWARE. diff --git a/MANIFEST.in b/MANIFEST.in new file mode 100644 index 0000000..7623449 --- /dev/null +++ b/MANIFEST.in @@ -0,0 +1 @@ +include LICENESE README.rst diff --git a/README b/README deleted file mode 100644 index e0b5a60..0000000 --- a/README +++ /dev/null @@ -1,15 +0,0 @@ -== MjoLniR - Machine Learned Ranking for Wikimedia - -MjoLniR is a library for handling the backend data processing -for s Machine Learned Ranking at Wikimedia. It is specialized -to how click logs are stored at wikimedia and provides functionality -to transform the source click logs into machine ML models for ranking. - -== Requirements - -Targets pyspark 1.6.0 running on python 2.7 - -== Other - -Documentation follows the numpy documentation guidelines: - https://github.com/numpy/numpy/blob/master/doc/HOWTO_DOCUMENT.rst.txt diff --git a/README.rst b/README.rst new file mode 100644 index 0000000..e399627 --- /dev/null +++ b/README.rst @@ -0,0 +1,41 @@ +MjoLniR - Machine Learned Ranking for Wikimedia +=============================================== + +MjoLniR is a library for handling the backend data processing for Machine +Learned Ranking at Wikimedia. It is specialized to how click logs are stored at +Wikimedia and provides functionality to transform the source click logs into ML +models for ranking in elasticsearch. + +Requirements +============ + +Targets pyspark from cdh5.10.0. This is mostly pyspark 1.6.0, but has various +backports integrated. Requires python 2.7, as some dependencies (clickmodels) +do not support python 3 yet. + +Running tests +============= + +Tests can be run from within the provided Vagrant configuration. Use the +following from the root of this repository to build a vagrant box, ssh into it, +and run the tests:: + + vagrant up + vagrant ssh + cd /vagrant + venv/bin/tox + +The test suite includes both flake8 (linter) and pytest (unit) tests. These +can be run independently with the -e option for tox:: + + venv/bin/tox -e flake8 + +Individual pytest tests can be run by specifying the path on the command line:: + + venv/bin/tox -e pytest mjolnir/test/test_sampling.py + +Other +===== + +Documentation follows the numpy documentation guidelines: + https://github.com/numpy/numpy/blob/master/doc/HOWTO_DOCUMENT.rst.txt diff --git a/Vagrantfile b/Vagrantfile index 2dbb89e..ea4ae99 100644 --- a/Vagrantfile +++ b/Vagrantfile @@ -1,14 +1,18 @@ Vagrant.configure("2") do |config| config.vm.provider :virtualbox do |vb, override| - override.vm.box = "trusty-cloud" - override.vm.box_url = 'https://cloud-images.ubuntu.com/vagrant/trusty/current/trusty-server-cloudimg-amd64-vagrant-disk1.box' - override.vm.box_download_insecure = true - override.vm.synced_folder ".", "/vagrant", :mount_options => ["dmode=777"] + override.vm.box = 'debian/contrib-jessie64' vb.customize ['modifyvm', :id, '--memory', '2048'] end - config.vm.hostname = "MjoLniR" + root_share_options = { id: 'vagrant-root' } + root_share_options[:type] = :nfs + root_share_options[:mount_options] = ['noatime', 'rsize=32767', 'wsize=3267', 'async'] + config.nfs.map_uid = Process.uid + config.nfs.map_gid = Process.gid + config.vm.synced_folder ".", "/vagrant", root_share_options + config.vm.hostname = "MjoLniR" + config.vm.network "private_network", type: "dhcp" config.vm.provision "shell", path: "bootstrap-vm.sh" end diff --git a/bootstrap-vm.sh b/bootstrap-vm.sh index 24e720a..dd8f202 100644 --- a/bootstrap-vm.sh +++ b/bootstrap-vm.sh @@ -3,9 +3,9 @@ set -e cat >/etc/apt/sources.list.d/cloudera.list <<EOD -# Packages for Cloudera's Distribution for Hadoop, Version 5.10.0, on Ubuntu 14.04 amd64 -deb [arch=amd64] http://archive.cloudera.com/cdh5/ubuntu/trusty/amd64/cdh trusty-cdh5.10.0 contrib -deb-src http://archive.cloudera.com/cdh5/ubuntu/trusty/amd64/cdh trusty-cdh5.10.0 contrib +# Packages for Cloudera's Distribution for Hadoop, Version 5.10.0, on Ubuntu 14.04 amd64 +deb [arch=amd64] http://archive.cloudera.com/cdh5/debian/jessie/amd64/cdh jessie-cdh5.10.0 contrib +deb-src http://archive.cloudera.com/cdh5/debian/jessie/amd64/cdh jessie-cdh5.10.0 contrib EOD cat >/etc/apt/preferences.d/cloudera.pref <<EOD @@ -23,11 +23,31 @@ openjdk-7-jre-headless \ python-virtualenv +# findspark needs a SPARK_HOME to setup pyspark cat >/etc/profile.d/spark.sh <<EOD SPARK_HOME=/usr/lib/spark export SPARK_HOME EOD -cd /vagrant -virtualenv . -bin/pip install -r requirements.txt +# pyspark wants to put a metastore_db directory in /vagrant, put it somewhere else +cat >/etc/spark/conf/hive-site.xml <<EOD +<configuration> + <property> + <name>hive.metastore.warehouse.dir</name> + <value>/tmp/</value> + <description>location of default database for the warehouse</description> + </property> +</configuration> +EOD + +# pyspark wants to put a derby.log in /vagrant as well, put it elsewhere +cat >> /etc/spark/conf/spark-defaults.conf <<EOD +spark.driver.extraJavaOptions=-Dderby.stream.error.file=/tmp/derby.log +EOD + +if [ ! -d /vagrant/venv ]; then + cd /vagrant + mkdir venv + virtualenv venv + venv/bin/pip install tox +fi diff --git a/mjolnir/test/conftest.py b/mjolnir/test/conftest.py index aa93ae6..4074c53 100644 --- a/mjolnir/test/conftest.py +++ b/mjolnir/test/conftest.py @@ -1,10 +1,10 @@ import findspark -findspark.init() +findspark.init() # must happen before importing pyspark -import pytest -import logging -from pyspark import SparkContext, SparkConf -from pyspark.sql import HiveContext +import pytest # noqa: E402 +import logging # noqa: E402 +from pyspark import SparkContext, SparkConf # noqa: E402 +from pyspark.sql import HiveContext # noqa: E402 def quiet_log4j(): diff --git a/requirements.txt b/requirements.txt deleted file mode 100644 index a809f40..0000000 --- a/requirements.txt +++ /dev/null @@ -1,7 +0,0 @@ -argparse==1.2.1 -clickmodels==1.0.2 -findspark==1.1.0 -py==1.4.33 -py4j==0.10.4 -pytest==3.0.7 -wsgiref==0.1.2 diff --git a/setup.py b/setup.py new file mode 100644 index 0000000..9b30466 --- /dev/null +++ b/setup.py @@ -0,0 +1,41 @@ +import os +from setuptools import find_packages, setup + + +requirements = [ + 'clickmodels', + 'py4j', +] + +test_requirements = [ + 'pytest', + 'findspark', + 'flake8', + 'tox', +] + +setup( + name='MjoLniR', + version='0.0.1', + author='Wikimedia Search Team', + author_email='discov...@lists.wikimedia.org', + description='A plumbing library for Machine Learned Ranking at Wikimedia', + license='MIT', + packages=find_packages(), + include_package_data=True, + data_files=['README.rst'], + install_requires=requirements, + test_requires=test_requirements, + extras_require={ + "test": test_requirements + }, + classifiers=[ + "Development Status :: 3 - Alpha", + "Programming Language :: Python", + "Programming Language :: Python :: 2", + "Environment :: Other Environment", + "Intended Audience :: Developers", + "License :: OSI Approved :: MIT License", + "Operating System :: OS Independent" + ], +) diff --git a/tox.ini b/tox.ini new file mode 100644 index 0000000..61d24ee --- /dev/null +++ b/tox.ini @@ -0,0 +1,16 @@ +[tox] +minversion = 1.6 +envlist = flake8,pytest + +[flake8] +max-line-length = 120 + +[testenv:flake8] +basepython = python2.7 +commands = flake8 mjolnir/ +deps = flake8 + +[testenv:pytest] +commands = pytest --pyargs mjolnir +deps = .[test] +passenv = SPARK_HOME -- To view, visit https://gerrit.wikimedia.org/r/347058 To unsubscribe, visit https://gerrit.wikimedia.org/r/settings Gerrit-MessageType: merged Gerrit-Change-Id: Id57bd5fd0476fc061d4b0a1cd93a1b2f639b7ed4 Gerrit-PatchSet: 8 Gerrit-Project: search/MjoLniR Gerrit-Branch: master Gerrit-Owner: EBernhardson <ebernhard...@wikimedia.org> Gerrit-Reviewer: DCausse <dcau...@wikimedia.org> Gerrit-Reviewer: EBernhardson <ebernhard...@wikimedia.org> Gerrit-Reviewer: Smalyshev <smalys...@wikimedia.org> Gerrit-Reviewer: Tjones <tjo...@wikimedia.org> Gerrit-Reviewer: Volans <rcocci...@wikimedia.org> _______________________________________________ MediaWiki-commits mailing list MediaWiki-commits@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/mediawiki-commits