On 19/05/2007 10:04 AM, James Stroud wrote: > py_genetic wrote: >> Hello, >> >> I'm importing large text files of data using csv. I would like to add >> some more auto sensing abilities. I'm considing sampling the data >> file and doing some fuzzy logic scoring on the attributes (colls in a >> data base/ csv file, eg. height weight income etc.) to determine the >> most efficient 'type' to convert the attribute coll into for further >> processing and efficient storage... >> >> Example row from sampled file data: [ ['8','2.33', 'A', 'BB', 'hello >> there' '100,000,000,000'], [next row...] ....] >> >> Aside from a missing attribute designator, we can assume that the same >> type of data continues through a coll. For example, a string, int8, >> int16, float etc. >> >> 1. What is the most efficient way in python to test weather a string >> can be converted into a given numeric type, or left alone if its >> really a string like 'A' or 'hello'? Speed is key? Any thoughts? >> >> 2. Is there anything out there already which deals with this issue? >> >> Thanks, >> Conor >> > > This is untested, but here is an outline to do what you want. > > First convert rows to columns: > > > columns = zip(*rows) > > > Okay, that was a lot of typing. Now, you should run down the columns, > testing with the most restrictive type and working to less restrictive > types. You will also need to keep in mind the potential for commas in > your numbers--so you will need to write your own converters, determining > for yourself what literals map to what values. Only you can decide what > you really want here. Here is a minimal idea of how I would do it: > > > def make_int(astr): > if not astr: > return 0 > else: > return int(astr.replace(',', '')) > > def make_float(astr): > if not astr: > return 0.0 > else: > return float(astr.replace(',', '')) > > make_str = lambda s: s > > > Now you can put the converters in a list, remembering to order them. > > > converters = [make_int, make_float, make_str] > > > Now, go down the columns checking, moving to the next, less restrictive, > converter when a particular converter fails. We assume that the make_str > identity operator will never fail. We could leave it out and have a > flag, etc., for efficiency, but that is left as an exercise. > > > new_columns = [] > for column in columns: > for converter in converters: > try: > new_column = [converter(v) for v in column] > break > except: > continue > new_columns.append(new_column) > > > For no reason at all, convert back to rows: > > > new_rows = zip(*new_columns) > > > You must decide for yourself how to deal with ambiguities. For example, > will '1.0' be a float or an int? The above assumes you want all values > in a column to have the same type. Reordering the loops can give mixed > types in columns, but would not fulfill your stated requirements. Some > things are not as efficient as they might be (for example, eliminating > the clumsy make_str). But adding tests to improve efficiency would cloud > the logic. >
[apologies in advance if this appears more than once] This approach is quite reasonable, IF: (1) the types involved follow a simple "ladder" hierarchy [ints pass the float test, floats pass the str test] (2) the supplier of the data has ensured that all values in a column are actually instances of the intended type. Constraint (1) falls apart if you need dates. Consider 31/12/99, 31/12/1999, 311299 [int?], 31121999 [int?], 31DEC99, ... and that's before you allow for dates in three different orders (dmy, mdy, ymd). Constraint (2) just falls apart -- with user-supplied data, there seem to be no rules but Rafferty's and no laws but Murphy's. The approach that I've adopted is to test the values in a column for all types, and choose the non-text type that has the highest success rate (provided the rate is greater than some threshold e.g. 90%, otherwise it's text). For large files, taking a 1/N sample can save a lot of time with little chance of misdiagnosis. Example: file of 1,079,000 records, with 15 columns, ultimately diagnosed as being 8 x text, 3 x int, 1 x float, 2 x date (dmy order), and [no kidding] 1 x date (ymd order). Using N==101 took about 15 seconds [Python 2.5.1, Win XP Pro SP2, 3.2GHz dual-core]; N==1 takes about 900 seconds. The "converter" function for dates is written in C. Cheers, John -- http://mail.python.org/mailman/listinfo/python-list