Space: Apache Mahout (https://cwiki.apache.org/confluence/display/MAHOUT)
Page: Wikipedia Bayes Example 
(https://cwiki.apache.org/confluence/display/MAHOUT/Wikipedia+Bayes+Example)

Comment edited by Chris Smith :
---------------------------------------------------------------------
A much more gentle approach to acquiring the source data, for 04Apr2012 data:

cat > wikipediafiles.txt <<EOF
enwiki-latest-pages-articles1.xml-p000000010p000010000.bz2 
enwiki-latest-pages-articles2.xml-p000010002p000024999.bz2 
enwiki-latest-pages-articles3.xml-p000025001p000055000.bz2 
enwiki-latest-pages-articles4.xml-p000055002p000104998.bz2 
enwiki-latest-pages-articles5.xml-p000105002p000184999.bz2 
enwiki-latest-pages-articles6.xml-p000185003p000305000.bz2 
enwiki-latest-pages-articles7.xml-p000305002p000464996.bz2 
enwiki-latest-pages-articles8.xml-p000465001p000665000.bz2 
enwiki-latest-pages-articles9.xml-p000665001p000925000.bz2 
enwiki-latest-pages-articles10.xml-p000925001p001325000.bz2
enwiki-latest-pages-articles11.xml-p001325001p001825000.bz2
enwiki-latest-pages-articles12.xml-p001825001p002425000.bz2
enwiki-latest-pages-articles13.xml-p002425002p003124997.bz2
enwiki-latest-pages-articles14.xml-p003125001p003924999.bz2
enwiki-latest-pages-articles15.xml-p003925001p004824998.bz2
enwiki-latest-pages-articles16.xml-p004825001p006024996.bz2
enwiki-latest-pages-articles17.xml-p006025001p007524997.bz2
enwiki-latest-pages-articles17.xml-p006025001p007524999.bz2
enwiki-latest-pages-articles18.xml-p007525004p009225000.bz2
enwiki-latest-pages-articles19.xml-p009225002p011124997.bz2
enwiki-latest-pages-articles20.xml-p011125004p013324998.bz2
enwiki-latest-pages-articles21.xml-p013325003p015724999.bz2
enwiki-latest-pages-articles22.xml-p015725013p018225000.bz2
enwiki-latest-pages-articles23.xml-p018225004p020925000.bz2
enwiki-latest-pages-articles24.xml-p020925002p023724999.bz2
enwiki-latest-pages-articles25.xml-p023725001p026625000.bz2
enwiki-latest-pages-articles26.xml-p026625004p029624976.bz2
enwiki-latest-pages-articles27.xml-p029625017p035314669.bz2
EOF

wget -o ./wikipedia.log -i ./wikipediafiles.txt -B 
http://dumps.wikimedia.org/enwiki/latest/ -t 10 -c --waitretry=10 &


Comment was previously :
---------------------------------------------------------------------
A much more gentle approach to acquiring the source data, for 04Apr2012 data:

cat > wikipediaurls.txt <<EOF
enwiki-latest-pages-articles1.xml-p000000010p000010000.bz2 
enwiki-latest-pages-articles2.xml-p000010002p000024999.bz2 
enwiki-latest-pages-articles3.xml-p000025001p000055000.bz2 
enwiki-latest-pages-articles4.xml-p000055002p000104998.bz2 
enwiki-latest-pages-articles5.xml-p000105002p000184999.bz2 
enwiki-latest-pages-articles6.xml-p000185003p000305000.bz2 
enwiki-latest-pages-articles7.xml-p000305002p000464996.bz2 
enwiki-latest-pages-articles8.xml-p000465001p000665000.bz2 
enwiki-latest-pages-articles9.xml-p000665001p000925000.bz2 
enwiki-latest-pages-articles10.xml-p000925001p001325000.bz2
enwiki-latest-pages-articles11.xml-p001325001p001825000.bz2
enwiki-latest-pages-articles12.xml-p001825001p002425000.bz2
enwiki-latest-pages-articles13.xml-p002425002p003124997.bz2
enwiki-latest-pages-articles14.xml-p003125001p003924999.bz2
enwiki-latest-pages-articles15.xml-p003925001p004824998.bz2
enwiki-latest-pages-articles16.xml-p004825001p006024996.bz2
enwiki-latest-pages-articles17.xml-p006025001p007524997.bz2
enwiki-latest-pages-articles17.xml-p006025001p007524999.bz2
enwiki-latest-pages-articles18.xml-p007525004p009225000.bz2
enwiki-latest-pages-articles19.xml-p009225002p011124997.bz2
enwiki-latest-pages-articles20.xml-p011125004p013324998.bz2
enwiki-latest-pages-articles21.xml-p013325003p015724999.bz2
enwiki-latest-pages-articles22.xml-p015725013p018225000.bz2
enwiki-latest-pages-articles23.xml-p018225004p020925000.bz2
enwiki-latest-pages-articles24.xml-p020925002p023724999.bz2
enwiki-latest-pages-articles25.xml-p023725001p026625000.bz2
enwiki-latest-pages-articles26.xml-p026625004p029624976.bz2
enwiki-latest-pages-articles27.xml-p029625017p035314669.bz2
EOF

wget -o ./wikipedia.log -i ./wikipediaurls.txt -B 
http://dumps.wikimedia.org/enwiki/latest/ -t 10 -c --waitretry=10 &


Change your notification preferences: 
https://cwiki.apache.org/confluence/users/viewnotifications.action

Reply via email to