Pierre GM wrote: > I was thinking about something this week-end: we could create a second > list when looping on the rows, where we would store the length of each > splitted row. After the loop, we can find if these values don't match > the expected number of columns `nbcols` and where. Then, we can decide > to strip the `rows` list of its invalid values (that corresponds to > skipping) or raise an exception, but in both cases we know where the > problem is. > My only concern is that we'd be creating yet another list of integers, > which would increase memory usage. Would it be a problem ?
I doubt it would be that big deal, however... Skipper Seabold wrote: > One of the datasets I > was working with was about a million lines with about 500 columns in > each. In this use case, it's clearly not a big deal, but it's probably pretty common for folks to have data sets with a smaller number of columns, maybe even two or so (I know I do sometimes). In that case, I suppose we're increasing memory usage by 50% or s, which may be an issue. Another idea: only store the indexes of the rows that have the "wrong" number of columns -- if that's a large number, then then user has bigger problems than memory usage! > I can't think of a case where I would want to just skip bad rows. I can't either, but someone suggested it. It certainly shouldn't happen by default or without a big ol' message of some sort to the user's code. -Chris -- Christopher Barker, Ph.D. Oceanographer Emergency Response Division NOAA/NOS/OR&R (206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception chris.bar...@noaa.gov _______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion