Benjamin, any chance you can pre-process the data with an external
script prior to loading into your database?

Doing this sort of manipulation in SQL may be possible, but it surely
will be tricky.  RegEx support in MySQL is present, but in my
(limited) experience with it, it's really only good for determining
whether a string matches a pattern, not manipulating strings, which is
what you need to do.

An alternative would be to filter in your reporting language, if
you're using something like perl, ruby, PHP etc to display the data.
Problem would be that performing calculations would also have to be
external - no use of SUM etc in SQL since these aren't currently
numbers.

In either case, you're on to the same approach I would take, I think.
I'd use a  series of regex's, like so

# remove all characters that are not digits commas periods
s/[^\d,\.]//g

# replace all commas with periods
s/,/\./g

# remove periods not used to indicate decimal place, looking for .ddd pattern
# will break if you have 3 digit (or more) decimal precision!
1 while s/^(.*)\.(\d{3})(.*)/$1$2$3/

Now you should have a number that looks like one of the following:
1000000.00
1000000
1
1.00
1000.00
1000
and is therefore suitable for storage in a numeric column in MySQL.

Hopefully all your currency units are the same!

Dan

On 10/10/06, Benjamin Bittner <[EMAIL PROTECTED]> wrote:
Hi list subscribers,

i am inserting millions of product rows from csv files via LOAD DATA INFILE.
Every product has a price, but this "price-strings" vary heavily.
The main difference between them, is the decimal format. Sometimes it is 
european like this:
1.000.000,00
sometimes its american like this:
1,000,000.00

so some examples for these strings are:
EUR 1,00                 (meaning decimal 1.00)
1.00 &euro;             (meaning decimal 1.00)
1.000,00 EUR          (meaning decimal 1000.00)
EUR 1.000,00          (meaning decimal 1000.00)
1.000 EUR               (meaning decimal 1000.00)
1,000 EUR               (meaning decimal 1000.00)

and now, i want to filter/validate that directly within the query.
I think some RegEx could do the trick, but this is to much for me. Ive searched 
for some RegEx for validating decimals, but they allways use just one notation 
of a decimal. I think a good strategy for that, to filter all chars but 
[0-9\.,], and than to do some logic like to check how man chars after the last 
dot or comma (if its two you know its something like 0.00).

No my problem is, i don't know where to start. Maybe with a stored procedure or 
something like that? I don't wanna use another language for this, because i 
would have to do some comprehensive update work then (selecting every row, 
checking the price, updating the price), and these rows get written every day, 
so i would have to do these updates once a day on a couple of million rows.

Anyone get me in the right direction?
Thanks in advance
Regards Benjamin Bittner


---------------------------------
Was ist Glück? Schlafen Fische überhaupt? Die Antworten gibt's auf Yahoo! 
Clever.


--
MySQL General Mailing List
For list archives: http://lists.mysql.com/mysql
To unsubscribe:    http://lists.mysql.com/[EMAIL PROTECTED]

Reply via email to