On May 12, 2014, at 6:24am, David Noel <david.i.n...@gmail.com> wrote:

> What's everyone's opinion on using large stop word lists vs a very
> small value for maxDFPercent (like 30)? I'm playing around with both
> and am having trouble deciding whether one is better than the other,
> or if I should use a combination of both.

Not sure what you mean by "a combination of both".

I usually extract and output the top 1K terms (by DF) and graph them to see 
where the elbow occurs, and pick that as the cut-off; terms with DF > the elbow 
go into the stopwords bucket.

Which has issues caused by any use of a step function, but in practice this 
typically hasn't been a significant problem.

One advantage is that often the set of stopwords changes with domains - e.g. 
when processing online ads, words like "free", "trial" and "download" are high 
DF terms.

And it's amazing how often (across multiple languages and domains) the DF limit 
winds up being 20%.

-- Ken

> My data set is one day's
> worth of news articles gathered from 1000 online news outlets. It's
> probably similar to the reuters data set, but with a little more
> noise. I used Boilerpipe for article extraction.
> 
> I spent a good while Googling around to build the largest (English)
> stop word-list I could. I'll paste it below for anyone who's
> interested and would like to save themselves an hour of Googling and
> collating.
> 
> -------------------------------------------------------------
> 
> 'll,'tis,'twas,a,a's,aan,able,about,above,abroad,abst,accordance,according,accordingly,across,act,actually,ad,added,adj,af,affected,affecting,affects,after,afterwards,again,against,ago,ah,ahead,ain't,al,all,alle,alles,allow,allows,almost,alone,along,alongside,already,als,also,alt,although,altijd,always,am,amid,amidst,among,amongst,amoungst,amount,an,and,anden,andere,announce,another,any,anybody,anyhow,anymore,anyone,anything,anyway,anyways,anywhere,apart,apparently,appear,appreciate,appropriate,approximately,are,area,areas,aren,aren't,arent,arise,around,as,aside,ask,asked,asking,asks,associated,at,auth,available,away,awfully,b,back,backed,backing,backs,backward,backwards,be,became,because,become,becomes,becoming,been,before,beforehand,began,begin,beginning,beginnings,begins,behind,being,beings,believe,below,ben,beside,besides,best,better,between,beyond,big,bij,bill,biol,blev,blive,bliver,both,bottom,brief,briefly,but,by,c,c'mon,c's,ca,call,came,can,can't,cannot,cant,caption,case,cases,cause,causes,certain,certainly,changes,clear,clearly,co,co.,com,come,comes,computer,con,concerning,consequently,consider,considering,contain,containing,contains,corresponding,could,could've,couldn't,couldnt,course,cry,currently,d,da,daar,dan,dare,daren't,dat,date,de,dear,definitely,dem,den,denne,der,deres,describe,described,despite,det,detail,dette,deze,did,didn't,die,differ,different,differently,dig,din,directly,disse,dit,do,doch,doen,does,doesn't,dog,doing,don,don't,done,door,down,downed,downing,downs,downwards,du,due,during,dus,e,each,early,ed,edu,een,eens,effect,efter,eg,eight,eighty,either,eleven,eller,else,elsewhere,empty,en,end,ended,ending,ends,enough,entirely,er,especially,et,et-al,etc,even,evenly,ever,evermore,every,everybody,everyone,everything,everywhere,ex,exactly,example,except,f,face,faces,fact,facts,fairly,far,farther,felt,few,fewer,ff,fifteen,fifth,fify,fill,find,finds,fire,first,five,fix,followed,following,follows,for,forever,former,formerly,forth,forty,forward,found,four,fra,from,front,full,fully,further,furthered,furthering,furthermore,furthers,g,gave,ge,geen,general,generally,get,gets,getting,geweest,give,given,gives,giving,go,goes,going,gone,good,goods,got,gotten,great,greater,greatest,greeting,greetings,group,grouped,grouping,groups,h,haar,had,hadn't,half,ham,han,hans,happen,happens,har,hardly,has,hasn't,hasnt,havde,have,haven't,having,he,he'd,he'll,he's,heb,hebben,hed,heeft,hello,help,hem,hence,hende,hendes,her,here,here's,hereafter,hereby,herein,heres,hereupon,hers,herself,hes,het,hi,hid,hier,high,higher,highest,hij,him,himself,his,hither,hoe,home,hopeful,hopefully,hos,how,how'd,how'll,how's,howbeit,however,hun,hundred,hvad,hvis,hvor,i,i'd,i'll,i'm,i've,id,ie,iemand,iets,if,ignored,ik,ikke,im,immediate,immediately,importance,important,in,inasmuch,inc,inc.,ind,indeed,index,indicate,indicated,indicates,information,inner,inside,insofar,instead,interest,interested,interesting,interests,into,invention,inward,is,isn't,it,it'd,it'll,it's,itd,its,itself,j,ja,je,jeg,jer,jo,just,k,kan,keep,keeps,kept,kg,kind,km,knew,know,knowing,known,knows,kon,kunne,kunnen,l,large,largely,last,lastly,late,lately,later,latest,latter,latterly,least,less,lest,let,let's,lets,like,liked,likely,likewise,line,little,long,longer,longest,look,looking,looks,low,lower,ltd,m,maar,made,mainly,make,makes,making,man,mange,many,may,maybe,mayn't,me,mean,means,meantime,meanwhile,med,meer,meget,member,members,men,merely,met,mg,mig,might,might've,mightn't,mij,mijn,mill,million,millions,min,mine,minus,miss,mit,ml,mod,moet,more,moreover,most,mostly,move,mr,mrs,much,must,must've,mustn't,my,myself,n,na,naar,name,namely,nay,nd,near,nearly,necessarily,necessary,ned,need,needed,needing,needn't,needs,neither,never,neverf,neverless,nevertheless,new,newer,newest,next,niet,niets,nine,ninety,no,no-one,nobody,nog,noget,nogle,non,none,nonetheless,noone,nor,normally,nos,not,noted,nothing,notwithstanding,novel,now,nowhere,nu,number,numbers,o,obtain,obtained,obviously,of,off,often,og,ogs,oh,ok,okay,old,older,oldest,om,omdat,omitted,on,once,onder,one,one's,ones,only,ons,onto,ook,op,open,opened,opening,opens,opposite,or,ord,order,ordered,ordering,orders,os,other,others,otherwise,ought,oughtn't,our,ours,ourselves,out,outside,over,overall,owing,own,p,page,pages,part,parted,particular,particularly,parting,parts,past,per,perhaps,place,placed,places,please,plus,point,pointed,pointing,points,poorly,possible,possibly,potentially,pp,predominantly,present,presented,presenting,presents,presumably,presume,presumed,previously,primarily,primary,probable,probably,problem,problems,prompt,promptly,proud,provide,provided,provides,put,puts,q,que,quick,quickly,quite,qv,r,ran,rather,rd,re,readily,really,reasonable,reasonably,recent,recently,reeds,ref,refs,regard,regarding,regardless,regards,relate,related,relative,relatively,respective,respectively,result,resulted,resulting,results,right,room,rooms,round,run,s,said,same,saw,say,saying,says,sec,second,secondly,seconds,see,seeing,seem,seemed,seeming,seems,seen,sees,self,selv,selves,sensible,sent,serious,seriously,seven,several,shall,shan't,she,she'd,she'll,she's,shed,shes,should,should've,shouldn't,show,showed,showing,shown,showns,shows,side,sides,sig,significant,significantly,similar,similarly,sin,since,sincere,sine,sit,six,sixty,skal,skulle,slightly,small,smaller,smallest,so,som,some,somebody,someday,somehow,someone,somethan,something,sometime,sometimes,somewhat,somewhere,soon,sorry,specifically,specified,specify,specifying,state,states,still,stop,strongl,strongly,sub,substantial,substantially,successfully,such,sufficient,sufficiently,suggest,suggested,suggests,sup,sure,system,t,t's,take,taken,taking,te,tegen,tell,tells,ten,tends,th,than,thank,thanks,thanx,that,that'll,that's,that've,thats,the,their,theirs,them,themselves,then,thence,there,there'd,there'll,there're,there's,there've,thereafter,thereby,therefore,therein,theres,thereupon,these,they,they'd,they'll,they're,they've,thi,thick,thin,thing,things,think,thinks,third,thirty,this,thorough,thoroughly,those,though,thought,thoughts,three,through,throughout,thru,thus,til,till,tis,to,toch,today,toen,together,too,took,top,tot,toward,towards,tried,tries,truly,try,trying,turn,turned,turning,turns,twas,twelve,twenty,twice,two,u,ud,uit,un,under,underneath,undoing,unfortunately,unless,unlike,unlikely,until,unto,up,upon,upwards,us,use,used,useful,uses,using,usually,uucp,uw,v,value,van,var,various,veel,versus,very,vi,via,vil,ville,viz,voor,vor,vs,være,været,w,want,wanted,wanting,wants,waren,was,wasn't,wat,way,ways,we,we'd,we'll,we're,we've,welcome,well,wells,went,werd,were,weren't,wezen,what,what'd,what'll,what's,what've,whatever,when,when'd,when'll,when's,whence,whenever,where,where'd,where'll,where's,whereafter,whereas,whereby,wherein,whereupon,wherever,whether,which,whichever,while,whilst,whither,who,who'd,who'll,who's,whoever,whole,whom,whomever,whose,why,why'd,why'll,why's,wie,wil,will,willing,wish,with,within,without,won't,wonder,worden,wordt,work,worked,working,works,would,would've,wouldn't,www,x,y,year,years,yes,yet,you,you'd,you'll,you're,you've,young,younger,youngest,your,yours,yourself,yourselves,z,zal,ze,zelf,zero,zich,zij,zijn,zo,zonder,zou

--------------------------
Ken Krugler
+1 530-210-6378
http://www.scaleunlimited.com
custom big data solutions & training
Hadoop, Cascading, Cassandra & Solr





Reply via email to