subject:"Re\: \[GENERAL\] Date format for bulk copy"

Re: [GENERAL] Date format for bulk copy

2004-10-14 Thread Michael Fuhr

On Wed, Oct 13, 2004 at 03:37:14PM -0400, David Rysdam wrote:
> Michael Fuhr wrote:
> >
> >I'd probably choose to extend PostgreSQL rather than hack what
> >already exists, though.
>
> By "extend PostgreSQL" do you mean create a custom input_function for 
> timestamp?  Are there docs that give hints for replacing the input 
> function of an existing type?  Someone else replied similarly, but I'm 
> afraid I'm not familiar enough with PG to decipher it all.

I'd been thinking about your idea of creating a custom type or
providing a custom input function for a standard type.  I've
done simple tests where I updated a type's typinput field in
pg_catalog.pg_type and it worked, but I don't know what problems
that might cause.  One thing that comes to mind is that I don't
know if pg_dump/pg_dumpall would recognize such a change for a
standard type.  The more I think about it, the less I'm convinced
that it's a good idea.

I like Pierre-Frédéric Caillaud's suggestion to have a
filter script query the database for field positions and types
and reformat the fields as appropriate.  That would allow the
filter to have knowledge of which fields should be reformatted
without having to program that logic explicitly.

-- 
Michael Fuhr
http://www.fuhr.org/~mfuhr/

---(end of broadcast)---
TIP 2: you can get off all lists at once with the unregister command
(send "unregister YourEmailAddressHere" to [EMAIL PROTECTED])

Re: [GENERAL] Date format for bulk copy

2004-10-14 Thread Michael Fuhr

On Wed, Oct 13, 2004 at 08:36:50PM +0200, Pierre-Fr?d?ric Caillaud wrote:
> 
>   You can have your script make a query in the database to fetch the 
>   data  types of the fields and then know which ones are to be transformed 
> and  how. The script would take as arguments a dump file and a  
> database,schema.table, would read the file and pipe the transformed data  
> into a psql with a COPY FROM stdin command... could save you a lot of work  
> no ?

That's a good idea, and it would mean not having to hack PostgreSQL
or create custom types or custom input functions for standard types.

-- 
Michael Fuhr
http://www.fuhr.org/~mfuhr/

---(end of broadcast)---
TIP 5: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faqs/FAQ.html

Re: [GENERAL] Date format for bulk copy

2004-10-13 Thread David Rysdam

Michael Fuhr wrote:
On Wed, Oct 13, 2004 at 01:32:01PM -0400, David Rysdam wrote:
 

Michael Fuhr wrote:
   

You could filter the data through a script that reformats certain
fields, then feed the reformatted data to PostgreSQL.  This is
usually a trivial task for Perl, awk, sed, or the like.
 

Right, I *can* do this.  But then I have to build knowledge into that 
script so it can find each of these date fields (there's like 20 of them 
across 10 different files) and then update that knowledge each time it 
changes.
   

In your case that's a reasonable argument against filtering the
data with a script.  Using a regular expression in the script might
reduce or eliminate the need for some of the logic, but then you'd
run the risk of reformatting data that shouldn't have been touched.
 

I'm still leaning towards just making postgres accept at ':' 
delimiter for milliseconds.
   

Based on your requirements, that might indeed be a better solution.
I'd probably choose to extend PostgreSQL rather than hack what
already exists, though.  Doing the latter might break something
else and you have to remember to add the hack every time you upgrade
the server software.  That can cause headaches for whoever inherits
the system from you unless it's well-documented.
 

By "extend PostgreSQL" do you mean create a custom input_function for 
timestamp?  Are there docs that give hints for replacing the input 
function of an existing type?  Someone else replied similarly, but I'm 
afraid I'm not familiar enough with PG to decipher it all.

Why not the user-defined type with associated user-defined input function?
   

If filtering the data is awkward, then that might be a better way
to go.
 

I think I will, when I get to that point.
---(end of broadcast)---
TIP 5: Have you checked our extensive FAQ?
  http://www.postgresql.org/docs/faqs/FAQ.html

Re: [GENERAL] Date format for bulk copy

2004-10-13 Thread Pierre-Frédéric Caillaud


Right, I *can* do this.  But then I have to build knowledge into that
script so it can find each of these date fields (there's like 20 of them
across 10 different files) and then update that knowledge each time it
changes.
In your case that's a reasonable argument against filtering the
data with a script.  Using a regular expression in the script might
reduce or eliminate the need for some of the logic, but then you'd
run the risk of reformatting data that shouldn't have been touched.
Yes, but :
	You can have your script make a query in the database to fetch the data  
types of the fields and then know which ones are to be transformed and  
how. The script would take as arguments a dump file and a  
database,schema.table, would read the file and pipe the transformed data  
into a psql with a COPY FROM stdin command... could save you a lot of work  
no ?

	A bonus is that your script can complain if it detects incompatibilities,  
and be more fool-proof. Plu


---(end of broadcast)---
TIP 8: explain analyze is your friend

Re: [GENERAL] Date format for bulk copy

2004-10-13 Thread Michael Fuhr

On Wed, Oct 13, 2004 at 01:32:01PM -0400, David Rysdam wrote:
> Michael Fuhr wrote:
> >You could filter the data through a script that reformats certain
> >fields, then feed the reformatted data to PostgreSQL.  This is
> >usually a trivial task for Perl, awk, sed, or the like.
> >
> Right, I *can* do this.  But then I have to build knowledge into that 
> script so it can find each of these date fields (there's like 20 of them 
> across 10 different files) and then update that knowledge each time it 
> changes.

In your case that's a reasonable argument against filtering the
data with a script.  Using a regular expression in the script might
reduce or eliminate the need for some of the logic, but then you'd
run the risk of reformatting data that shouldn't have been touched.

> I'm still leaning towards just making postgres accept at ':' 
> delimiter for milliseconds.

Based on your requirements, that might indeed be a better solution.
I'd probably choose to extend PostgreSQL rather than hack what
already exists, though.  Doing the latter might break something
else and you have to remember to add the hack every time you upgrade
the server software.  That can cause headaches for whoever inherits
the system from you unless it's well-documented.

> Also, how much would a secondary script slow down the bulk copy,
> if any?

Probably some, but perhaps not enough to be significant.  I'd expect
the database to be the bottleneck, but I'd have to run tests to say
for certain.

> >Sounds like Sybase is dumping in hex, whereas PostgreSQL expects
> >octal.  If you can't change the dump format, then again, filtering
> >the data through a script might work.
> >
> Oh, so I can load binary data into PG if it's ASCII-encoded octal?

Yes -- see the "Binary Data Types" documentation:

http://www.postgresql.org/docs/7.4/static/datatype-binary.html

> Why not the user-defined type with associated user-defined input function?

If filtering the data is awkward, then that might be a better way
to go.

-- 
Michael Fuhr
http://www.fuhr.org/~mfuhr/

---(end of broadcast)---
TIP 4: Don't 'kill -9' the postmaster

Re: [GENERAL] Date format for bulk copy

2004-10-13 Thread David Rysdam

Greg Stark wrote:
David Rysdam <[EMAIL PROTECTED]> writes:
 

In my brute force port, I just bulk copied the date
fields into temporary tables and then did a to_timestamp(field, 'Mon DD 
HH:MI:SS:MSAM'). 
   

 

Again, I created a temporary table and did a decode(field, 'hex') to the
real table.
   

This is the standard approach. You're rather lucky these are the only
data representation changes you've had to do so far. I fear you'll run into
more and more complex changes over time and trying to avoid the temporary
table will get harder and harder.
 

No, I think I'm OK there.  These are programmatically-generated values 
and I've already been through them all once.  Just the millisecond issue 
and the hex binary issue AFAIK.

If it were me I would consider processing the files in perl. It should be
pretty easy to do both of these modifications very quickly.
 

Very quick and easy to do one time.  A little trickier to handle in an 
elegant, maintainable way for the dozens of data reloads I do every 
month for GBs of data onto two different server types.

If you really want to go with a custom C code then you might be able to just
grab the byteain/byteaout functions from src/backend/util/adt/varlena into a
separate module and create new functions with modified names. Load it with
CREATE FUNCTION byteain ... AS 'my_bytea_funcs.so' 'my_byteain';
Or maybe create the function as my_byteain in postgres and then update the
catalog entries somehow. I'm not sure how to do that but it shouldn't be too
hard. And it might make it easier to do the substitution for the data load and
then undo the change afterwards.
 

Why not create a type and then define the load function to be the 
equivalent of "decode('hex')"?

Doing the same for timmestamp is a bit trickier but you could copy
ParseDateTime from datetime.c as a static function for your module.
Be careful though, test this out thoroughly on a test database. I'm not sure
of all the impacts of altering the in/out functions for data types. I expect
it would break pg_dump, for example. And I would worry about the statistics
tables too.
 

This is kind of a hybrid of my suggestions and the problems are a hybrid 
as well.  :)

1) Just change the timestamp type so that it allows a ':' delimiter for 
milliseconds.  Potential problems: Other parts of the code won't expect 
it. People don't want that.

2) Create a new type.  Potential problem: Things like date ranges 
probably wouldn't work anymore, since the server wouldn't know it's a 
date now.


---(end of broadcast)---
TIP 7: don't forget to increase your free space map settings

Re: [GENERAL] Date format for bulk copy

2004-10-13 Thread David Rysdam

Michael Fuhr wrote:
On Wed, Oct 13, 2004 at 10:06:58AM -0400, David Rysdam wrote:
 

Sybase bulk copies the date fields out in this format:
Mar  4 1973 10:28:00:000AM
Postgresql's COPY (or psql \copy) doesn't like that format.
   

You could filter the data through a script that reformats certain
fields, then feed the reformatted data to PostgreSQL.  This is
usually a trivial task for Perl, awk, sed, or the like.
 

Right, I *can* do this.  But then I have to build knowledge into that 
script so it can find each of these date fields (there's like 20 of them 
across 10 different files) and then update that knowledge each time it 
changes.  I'm still leaning towards just making postgres accept at ':' 
delimiter for milliseconds.  Also, how much would a secondary script 
slow down the bulk copy, if any?

I have a similarish problem with another field type.  In Sybase it's a 
binary format.  In postgres it is a binary format (bytea).  But Sybase 
bcps the data out in ASCII.  Sybase recognizes that when it is a binary 
field and auto-converts the ASCII back to binary.  Postgres doesn't.  
Again, I created a temporary table and did a decode(field, 'hex') to the 
real table.
   

Sounds like Sybase is dumping in hex, whereas PostgreSQL expects
octal.  If you can't change the dump format, then again, filtering
the data through a script might work.
 

Oh, so I can load binary data into PG if it's ASCII-encoded octal?  Why 
not the user-defined type with associated user-defined input function?

---(end of broadcast)---
TIP 7: don't forget to increase your free space map settings

Re: [GENERAL] Date format for bulk copy

2004-10-13 Thread Michael Fuhr

On Wed, Oct 13, 2004 at 10:06:58AM -0400, David Rysdam wrote:

> Sybase bulk copies the date fields out in this format:
> 
> Mar  4 1973 10:28:00:000AM
> 
> Postgresql's COPY (or psql \copy) doesn't like that format.

You could filter the data through a script that reformats certain
fields, then feed the reformatted data to PostgreSQL.  This is
usually a trivial task for Perl, awk, sed, or the like.

> I have a similarish problem with another field type.  In Sybase it's a 
> binary format.  In postgres it is a binary format (bytea).  But Sybase 
> bcps the data out in ASCII.  Sybase recognizes that when it is a binary 
> field and auto-converts the ASCII back to binary.  Postgres doesn't.  
> Again, I created a temporary table and did a decode(field, 'hex') to the 
> real table.

Sounds like Sybase is dumping in hex, whereas PostgreSQL expects
octal.  If you can't change the dump format, then again, filtering
the data through a script might work.

-- 
Michael Fuhr
http://www.fuhr.org/~mfuhr/

---(end of broadcast)---
TIP 3: if posting/reading through Usenet, please send an appropriate
  subscribe-nomail command to [EMAIL PROTECTED] so that your
  message can get through to the mailing list cleanly

Re: [GENERAL] Date format for bulk copy

2004-10-13 Thread Greg Stark

David Rysdam <[EMAIL PROTECTED]> writes:

> In my brute force port, I just bulk copied the date
> fields into temporary tables and then did a to_timestamp(field, 'Mon DD 
> HH:MI:SS:MSAM'). 

> Again, I created a temporary table and did a decode(field, 'hex') to the
> real table.

This is the standard approach. You're rather lucky these are the only
data representation changes you've had to do so far. I fear you'll run into
more and more complex changes over time and trying to avoid the temporary
table will get harder and harder.

If it were me I would consider processing the files in perl. It should be
pretty easy to do both of these modifications very quickly.

If you really want to go with a custom C code then you might be able to just
grab the byteain/byteaout functions from src/backend/util/adt/varlena into a
separate module and create new functions with modified names. Load it with
CREATE FUNCTION byteain ... AS 'my_bytea_funcs.so' 'my_byteain';

Or maybe create the function as my_byteain in postgres and then update the
catalog entries somehow. I'm not sure how to do that but it shouldn't be too
hard. And it might make it easier to do the substitution for the data load and
then undo the change afterwards.

Doing the same for timmestamp is a bit trickier but you could copy
ParseDateTime from datetime.c as a static function for your module.

Be careful though, test this out thoroughly on a test database. I'm not sure
of all the impacts of altering the in/out functions for data types. I expect
it would break pg_dump, for example. And I would worry about the statistics
tables too.

-- 
greg

---(end of broadcast)---
TIP 6: Have you searched our list archives?

   http://archives.postgresql.org

Re: [GENERAL] Date format for bulk copy

Re: [GENERAL] Date format for bulk copy

Re: [GENERAL] Date format for bulk copy

Re: [GENERAL] Date format for bulk copy

Re: [GENERAL] Date format for bulk copy

Re: [GENERAL] Date format for bulk copy

Re: [GENERAL] Date format for bulk copy

Re: [GENERAL] Date format for bulk copy

Re: [GENERAL] Date format for bulk copy

9 matches

Site Navigation

Mail list logo

Footer information