Re: Best practices on spliltting an input line?
Stefan Podkowinski wrote: I'm currently using OpenCSV which can be found at http://opencsv.sourceforge.net/ but haven't done any performance tests on it yet. In my case simply splitting strings would not work anyways, since I need to handle quotes and separators within quoted values, e.g. "a","a,b","c". I've used it in the past; found it pretty reliable. Again, no perf tests, just reading in CSV files exported from spreadsheets
Re: Best practices on spliltting an input line?
I'm currently using OpenCSV which can be found at http://opencsv.sourceforge.net/ but haven't done any performance tests on it yet. In my case simply splitting strings would not work anyways, since I need to handle quotes and separators within quoted values, e.g. "a","a,b","c". On Tue, Feb 10, 2009 at 9:18 PM, Andy Sautins wrote: > > > I have question. I've dabbled with different ways of tokenizing an > input file line for processing. I've noticed in my somewhat limited > tests that there seem to be some pretty reasonable performance > differences between different tokenizing methods. For example, roughly > it seems to split a line on tokens ( tab delimited in my case ) that > Scanner is the slowest, followed by String.spit and StringTokenizer > being the fastest. StringTokenizer, for my application, has the > unfortunate characteristic of not returning blank tokens ( i.e., parsing > "a,b,c,,d" would return "a","b","c","d" instead of "a","b","c","","d"). > The WordCount example uses StringTokenizer which makes sense to me, > except I'm currently getting hung up on not returning blank tokens. I > did run across the com.Ostermiller.util StringTokenizer replacement that > handles null/blank tokens > (http://ostermiller.org/utils/StringTokenizer.html ) which seems > possible to use, but it sure seems like someone else has solved this > problem already better than I have. > > > > So, my question is, is there a "best practice" for splitting an input > line especially when NULL tokens are expected ( i.e., two consecutive > delimiter characters )? > > > > Any thoughts would be appreciated > > > > Thanks > > > > Andy > >
Re: Best practices on spliltting an input line?
Hi, Andy Your problem seems to be a general Java problem, rather than hadoop. In a java forum you may get better help. String.split uses regular expressions, which you definitely don't need. I would write my own split function, without regular expressions. This link may help to better understand underlying operations: http://www.particle.kth.se/~lindsey/JavaCourse/Book/Part1/Java/Chapter10/stringBufferToken.html#split Also there is a constructor of StringTokenizer to return also delimeters: StringTokenizer(String string, String delimeters, boolean returnDelimeters); (I would write my own, though.) Rasit 2009/2/10 Andy Sautins : > > > I have question. I've dabbled with different ways of tokenizing an > input file line for processing. I've noticed in my somewhat limited > tests that there seem to be some pretty reasonable performance > differences between different tokenizing methods. For example, roughly > it seems to split a line on tokens ( tab delimited in my case ) that > Scanner is the slowest, followed by String.spit and StringTokenizer > being the fastest. StringTokenizer, for my application, has the > unfortunate characteristic of not returning blank tokens ( i.e., parsing > "a,b,c,,d" would return "a","b","c","d" instead of "a","b","c","","d"). > The WordCount example uses StringTokenizer which makes sense to me, > except I'm currently getting hung up on not returning blank tokens. I > did run across the com.Ostermiller.util StringTokenizer replacement that > handles null/blank tokens > (http://ostermiller.org/utils/StringTokenizer.html ) which seems > possible to use, but it sure seems like someone else has solved this > problem already better than I have. > > > > So, my question is, is there a "best practice" for splitting an input > line especially when NULL tokens are expected ( i.e., two consecutive > delimiter characters )? > > > > Any thoughts would be appreciated > > > > Thanks > > > > Andy > > -- M. Raşit ÖZDAŞ
Best practices on spliltting an input line?
I have question. I've dabbled with different ways of tokenizing an input file line for processing. I've noticed in my somewhat limited tests that there seem to be some pretty reasonable performance differences between different tokenizing methods. For example, roughly it seems to split a line on tokens ( tab delimited in my case ) that Scanner is the slowest, followed by String.spit and StringTokenizer being the fastest. StringTokenizer, for my application, has the unfortunate characteristic of not returning blank tokens ( i.e., parsing "a,b,c,,d" would return "a","b","c","d" instead of "a","b","c","","d"). The WordCount example uses StringTokenizer which makes sense to me, except I'm currently getting hung up on not returning blank tokens. I did run across the com.Ostermiller.util StringTokenizer replacement that handles null/blank tokens (http://ostermiller.org/utils/StringTokenizer.html ) which seems possible to use, but it sure seems like someone else has solved this problem already better than I have. So, my question is, is there a "best practice" for splitting an input line especially when NULL tokens are expected ( i.e., two consecutive delimiter characters )? Any thoughts would be appreciated Thanks Andy