Hi, Andy Your problem seems to be a general Java problem, rather than hadoop. In a java forum you may get better help. String.split uses regular expressions, which you definitely don't need. I would write my own split function, without regular expressions.
This link may help to better understand underlying operations: http://www.particle.kth.se/~lindsey/JavaCourse/Book/Part1/Java/Chapter10/stringBufferToken.html#split Also there is a constructor of StringTokenizer to return also delimeters: StringTokenizer(String string, String delimeters, boolean returnDelimeters); (I would write my own, though.) Rasit 2009/2/10 Andy Sautins <andy.saut...@returnpath.net>: > > > I have question. I've dabbled with different ways of tokenizing an > input file line for processing. I've noticed in my somewhat limited > tests that there seem to be some pretty reasonable performance > differences between different tokenizing methods. For example, roughly > it seems to split a line on tokens ( tab delimited in my case ) that > Scanner is the slowest, followed by String.spit and StringTokenizer > being the fastest. StringTokenizer, for my application, has the > unfortunate characteristic of not returning blank tokens ( i.e., parsing > "a,b,c,,d" would return "a","b","c","d" instead of "a","b","c","","d"). > The WordCount example uses StringTokenizer which makes sense to me, > except I'm currently getting hung up on not returning blank tokens. I > did run across the com.Ostermiller.util StringTokenizer replacement that > handles null/blank tokens > (http://ostermiller.org/utils/StringTokenizer.html ) which seems > possible to use, but it sure seems like someone else has solved this > problem already better than I have. > > > > So, my question is, is there a "best practice" for splitting an input > line especially when NULL tokens are expected ( i.e., two consecutive > delimiter characters )? > > > > Any thoughts would be appreciated > > > > Thanks > > > > Andy > > -- M. Raşit ÖZDAŞ