On Mar 7, 6:12 pm, John Machin <[EMAIL PROTECTED]> wrote: > On Mar 8, 1:53 am, "[EMAIL PROTECTED]" > > > > > > <[EMAIL PROTECTED]> wrote: > > On Mar 6, 9:17 pm, Luis M. González <[EMAIL PROTECTED]> wrote: > > > > On 6 mar, 11:27, Pierre Quentel <[EMAIL PROTECTED]> wrote: > > > > > Hi, > > > > > I would like to know if there is a module that converts a string to a > > > > value of the "most probable type" ; for instance : > > > > - if the string is "abcd" the value is the same string "abcd" > > > > - string "123" : value = the integer 123 > > > > - string "-1.23" (or "-1,23" if the locale for decimals is ,) : value > > > > = the float -1.23 > > > > - string "2008/03/06" (the format is also locale-dependant) : value = > > > > datetime.date(2008,03,06) > > > > > Like in spreadsheets, special prefixes could be used to force the > > > > type : for instance '123 would be converted to the *string* "123" > > > > instead of the *integer* 123 > > > > > I could code it myself, but this wheel is probably already invented > > > > > Regards, > > > > Pierre > > > >>> def convert(x): > > > > if '.' in x: > > > try: return float(x) > > > except ValueError: return x > > > else: > > > try: return int(x) > > > except: return x > > > > >>> convert('123') > > > 123 > > > >>> convert('123.99') > > > 123.98999999999999 > > > >>> convert('hello') > > > > 'hello' > > > Neat solution. The real challenge though is whether to support > > localised dates, these are all valid: > > 20/10/01 > > 102001 > > 20-10-2001 > > 20011020 > > Neat solution doesn't handle the case of using dots as date separators > e.g. 20.10.01 [they are used in dates in some locales and the > location of . on the numeric keypad is easier on the pinkies than / or > -] > > I'm a bit dubious about the utility of "most likely format" for ONE > input. > > I've used a brute-force approach when inspecting largish CSV files > (with a low but non-zero rate of typos etc) with the goal of > determining what is the most likely type of data in each column. > E.g 102001 could be a valid MMDDYY date, but not a valid DDMMYY or > YYMMDD date. 121212 could be all of those. Both qualify as int, float > and text. A column with 100% of entries qualifying as text, 99.999% as > float, 99.99% as integer, 99.9% as DDMMYY, and much lower percentages > as MMDDYY and YYMMDD would be tagged as DDMMYY. The general rule is: > pick the type whose priority is highest and whose score exceeds a > threshold. Priorities: date > int > float > text. Finding the date > order works well with things like date of birth where there is a wide > distribution of days and years. However a field (e.g. date interest > credited to bank account) where the day is always 01 and the year is > in 01 to 08 would give the same scores for each of 3 date orders ... > eye-balling the actual data never goes astray.- Hide quoted text - > > - Show quoted text -
In the case where dots are used as a date separator, count the number of dots (you should also count commas). If a single comma appears and is preceeded by only numbers or numbers with decimals, assume "foreign float". If a single decimal appears and is preceeded by only numbers or numbers with commas, assume "float". If 2 decimals appear and each field is 2 or less characters in length and numeric, assume date. If 2 decimals appear and the first 2 fields are 2 or less characters in length and numeric and the last field is 4 characters in length and numeric, assume date. There are things you can do, but you must be wary of the fact that it may not always be 100% perfect. -- http://mail.python.org/mailman/listinfo/python-list