Hi, Andy

Your problem seems to be a general Java problem, rather than hadoop.
In a java forum you may get better help.
String.split uses regular expressions, which you definitely don't need.
I would write my own split function, without regular expressions.

This link may help to better understand underlying operations:
http://www.particle.kth.se/~lindsey/JavaCourse/Book/Part1/Java/Chapter10/stringBufferToken.html#split

Also there is a constructor of StringTokenizer to return also delimeters:
StringTokenizer(String string, String delimeters, boolean returnDelimeters);
(I would write my own, though.)

Rasit

2009/2/10 Andy Sautins <andy.saut...@returnpath.net>:
>
>
>   I have question.  I've dabbled with different ways of tokenizing an
> input file line for processing.  I've noticed in my somewhat limited
> tests that there seem to be some pretty reasonable performance
> differences between different tokenizing methods.  For example, roughly
> it seems to split a line on tokens ( tab delimited in my case ) that
> Scanner is the slowest, followed by String.spit and StringTokenizer
> being the fastest.  StringTokenizer, for my application, has the
> unfortunate characteristic of not returning blank tokens ( i.e., parsing
> "a,b,c,,d" would return "a","b","c","d" instead of "a","b","c","","d").
> The WordCount example uses StringTokenizer which makes sense to me,
> except I'm currently getting hung up on not returning blank tokens.  I
> did run across the com.Ostermiller.util StringTokenizer replacement that
> handles null/blank tokens
> (http://ostermiller.org/utils/StringTokenizer.html ) which seems
> possible to use, but it sure seems like someone else has solved this
> problem already better than I have.
>
>
>
>   So, my question is, is there a "best practice" for splitting an input
> line especially when NULL tokens are expected ( i.e., two consecutive
> delimiter characters )?
>
>
>
>   Any thoughts would be appreciated
>
>
>
>   Thanks
>
>
>
>   Andy
>
>



-- 
M. Raşit ÖZDAŞ

Reply via email to