Re: Question about data storage

2005-09-15 Thread Jay Savage
On 9/15/05, Bob Showalter <[EMAIL PROTECTED]> wrote:
> Peter Rabbitson wrote:
> > * I store dirty orders as XML in a SQL table with 2 columns - one for
> > the order id and another for the arbitrary XML structure. Problem once
> > again is that my XML can be thousands of characters long.
> 
> I vote for this option. You can use a TEXT or MEDIUMTEXT column to store
> this data (Unicode, UTF-8).
> 

That would be my chocie, too, just because the a straight 'select *'
on the table then produces human-readable output when it's so messed
up the web interface has trouble parsing it.  From a performance
standpoint, though, there's no reason to prefer XML stroed as text to
Data::Dumper output stored as as a blob.

Another option if length is an issue is to store the XML or
Data::Dumper output to an external file and just store the filename in
the DB. It will take up less space on disk--no DB overhead--and the
additional I/O is negligible (in some cases it may actually be more
efficient, depends on your system), especially for a small number of
requests. The only drawback is that you'll have to be careful about
file locking with the human cleanup interface to make sure that two
people don't try to clean the same file at the same time and munge it.
If you Text or Blob the data you can just use a transaction-safe table
and rollbacks. That's not a big loss, though: writing perl code with
e.g. flock is simpler than writing transaction-safge SQL, IMHO.

HTH,

-- jay
--
This email and attachment(s): [  ] blogable; [ x ] ask first; [  ]
private and confidential

daggerquill [at] gmail [dot] com
http://www.tuaw.com  http://www.dpguru.com  http://www.engatiki.org

values of β will give rise to dom!


Re: Question about data storage

2005-09-15 Thread Bob Showalter

Peter Rabbitson wrote:

* I store dirty orders as XML in a SQL table with 2 columns - one for
the order id and another for the arbitrary XML structure. Problem once
again is that my XML can be thousands of characters long.


I vote for this option. You can use a TEXT or MEDIUMTEXT column to store 
this data (Unicode, UTF-8). 



--
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
 




Re: Question about data storage

2005-09-15 Thread Peter Rabbitson
On Thu, Sep 15, 2005 at 04:57:28PM -0500, Walter Hunnel wrote:

> Not perl, but as a DBA:
> 

Thanks for answering! 

> Unless you are getting the data as XML or have to change the data to XML
> for some other reason, I would not move to XML just for the cleanup you
> are asking about.

I was considering XML because of the structural complexity of the data. 
At some places internally you get a hash-of-hashes-of-hashes-of-hashes 
situation, and I wanted to maintain some ability to search those as well 
(although painfully slow) - so it had to be clear text. As XML is most 
portable - I decided to choose it using XML::Simple to pack and unpack 
the structure, in leu of Data::Dumper or YAML.
You seem to be questioning my use of XML - is there a reason not to do 
it? I will stick with Data::Dumper if it poses a problem I am not aware 
about.
 
> Understanding that you specify that you are accepting data from other
> sources and may not have control of validation of the data before
> receiving it, you have no choice but to accept ANYTHING, store it,
> validate it, and deal with garbage as necessary.

This is precisely correct - no control over any of the sources, and 
unfortunately a couple of broken sources. And this is the whole reason I 
am doing it - I need to hold the data efficiently, yet intact until a 
human decides how to change it or whether to keep it at all. 
 
> I have to assume that the data is delimited in some way, so perhaps you
> can parse the delimited data and decide  based on that what to do with
> the bad data.
> 
> If that is the case, you can validate each field and any invalid field
> data can be stored in a separate table with an associated key and the
> key used in place of the data - so the valid data is stored where it
> should be and only bad data is stored in the other table, but you can
> rejoin them when giving the bad data to a human for interpretation. In
> that case, you need only two columns, a key and a large field to hold
> the bad data, every thing in the main table would meet the data
> restriction requirements. Of course, you would need to check for bad
> data table keys in ANY column before printing out any row, or use the
> flag idea from above...

Yes, this is exactly how it works. Each source returns data in its own 
format which I parse with a designated helper, and mark what needs to be 
looked at right there. The problem once again was how to store it intact 
until someone looks at it. 
The above is not a bad idea, but it will get messy when my dirty table 
index grows bigger than one of the limits for the fields in the accepted 
table. Then I will not be able to insert a pointer to the dirty table to 
the field, because the pointert value itself becomes illegal for that 
field. On the other hand I can add an additional "dirty" column which 
will signify that there is SOMETHING in the dirty table waiting to be 
joined with the rest of the data. In the dirty table I keep a keyed 
structure in the second column, once again something like XML or 
alike... 

Then of course comes the question wether it's worth the hassle, and why 
not just store the entire dirty order as it is in a dirty table, which 
is what I am leaning to. Currently doing some tests on how big the table 
gets worst case scenario (somebody doesn't do his job and dirty orders 
keep piling up). 

Thanks for the hints though :)

Peter


-- 
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
 




RE: Question about data storage

2005-09-15 Thread Walter Hunnel
Not perl, but as a DBA:

Some possibilities (I am sure there are others)

Make the fields large enough to hold any expected bogus data. Accept
ANYTHING submitted and store in the database (possibly as character
data). Have a column that is a 'dirty' flag, set it to (for example) 1
for ALL incoming data. Run programming or SQL against the data, and if
it is valid in your definition of valid, then set the flag to 2 (again
just for example), if not valid in ALL fields, set the flag to 3.
Automate the flag setting and checking, then all you have to do is print
the data with a flag of 2 (valid data in my example) and route anything
with a flag of 3 (bogus data in my example) to a human for cleanup or
decision making.

This would wast a lot of space unneccessarily.

Unless you are getting the data as XML or have to change the data to XML
for some other reason, I would not move to XML just for the cleanup you
are asking about.

If there exists the possibility of entering character data in a numeric
field, the frontend is not doing an acceptable job of validation of the
data. For example, the zip code should never be other than 5 digits, or
9 digits, or 5 digits followed by a dash followed by 4 digits for US
locations. Any other combination is invalid as a zip code and should be
bounced back to the user for clarification.

Understanding that you specify that you are accepting data from other
sources and may not have control of validation of the data before
receiving it, you have no choice but to accept ANYTHING, store it,
validate it, and deal with garbage as necessary.

I have to assume that the data is delimited in some way, so perhaps you
can parse the delimited data and decide  based on that what to do with
the bad data.

If that is the case, you can validate each field and any invalid field
data can be stored in a separate table with an associated key and the
key used in place of the data - so the valid data is stored where it
should be and only bad data is stored in the other table, but you can
rejoin them when giving the bad data to a human for interpretation. In
that case, you need only two columns, a key and a large field to hold
the bad data, every thing in the main table would meet the data
restriction requirements. Of course, you would need to check for bad
data table keys in ANY column before printing out any row, or use the
flag idea from above...

This also wastes some space, but nowhere near as much as making EVERY
COLUMN bigenough to hold ANY possibly data before validation.

I think I would do the second one. All valid data is stored where
expected, all invalid data is stored in one place, and the wasted space
is kept to a minimum.

Good luck.

Walt



Confidentiality Notice: This e-mail message, including any attachments, is for 
the sole use of the intended recipient(s) and may contain confidential and 
privileged information. Any unauthorized review, use, disclosure or 
distribution is prohibited. If you are not the intended recipient, please 
contact the sender by reply e-mail and destroy all copies of the original 
message. 


From: Peter Rabbitson [mailto:[EMAIL PROTECTED] 
Sent: Thursday, September 15, 2005 16:03
To: beginners@perl.org
Subject: Question about data storage

Hello perlers,
This is somewhat of a database related question, but it is as much
related to programming practices, so excuse me if this is not the
apporpriate list.
I am collecting and processing online orders from several different
sources. Each source is queried periodically, information about new
orders is retrieved, parsed, and presented to a central table in a
unified form (orders end up looking alike, no matter where they came
from). Then they are grouped, printed and so on and so forth. 
The problem is that some of the sources allow orders to be submitted
with bogus information (as 6 digit zip codes), so these orders must
aggregate in an intermediate container, and wait for an employee to make
necessary changes through a web interface. What I can not decide is how
to implement this intermediate container. So far I have thought of the
following options, but each has a fairly serious shortocmming:

* I store "dirty" orders alongside approved orders in my SQL tables. The
problem is that unlike approved orders, where each field has passed a
content/length check, the dirty order might have a string in a place
where only nemerics are expected - thus preventing me to store this
piece of information. If make separate tables for accepted and dirty
orders, I still face the problem that almost all fields must be of
arbotrary length (TEXT/BLOB) since I can not be sure of the individual
field sizes.

* I store dirty orders as XML in a SQL table with 2 columns - one for
the order id and another for the arbitrary XML structure. Problem once
again is that my XML can be thousands of characters long.

* I store dirty orders as XML in a native format like BDB. Problem is
that I lose the fine grained user access control that a RDBMS