Re: [GENERAL] ERROR: relation . . . does not exist

2008-08-30 Thread Christophe



You have made clear to me why my attempt for a RFE for COPY FROM CVS
has found some technical resistance/disagreement, but I still think my
idea even if not so popular for concrete and cultural reasons makes at
least sense to some people


It's a perfectly reasonable problem to want to solve; the question is  
whether COPY is the right place to solve it. I would think that a tool  
that reads the CSV data and produces a proposed schema definition for  
the table would be a more generally-useful approach.


--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] ERROR: relation . . . does not exist

2008-08-30 Thread Albretch Mueller
> ... are times local or UTC
~
 this is a rather semantic, not a syntactic issue that some code could
NOT decide based on the data it reads
~
> Should we assume integer or float?
~
 is a dot anywhere in the data you read in for that particular column? ...
~
> Varchar or text?
~
 Is the length of the data read in always less than 255 bytes ( or
characters?)?  ...
~
 You have made clear to me why my attempt for a RFE for COPY FROM CVS
has found some technical resistance/disagreement, but I still think my
idea even if not so popular for concrete and cultural reasons makes at
least sense to some people
~
 DBAs ussualy have a mental map of the data they have on each table,
etc; whereas as a data analyst you find yourself constantly reading
in, cleasing and marshaling data from which you have no prior
knowledge
~
 lbrtchx

-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] indexes on functions and create or replace function

2008-08-30 Thread Matthew Dennis
On Thu, Aug 28, 2008 at 7:45 PM, Matthew Dennis <[EMAIL PROTECTED]> wrote:

> Another question though.  Since I could potentially start transaction, drop
> indexes/checks, replace function, create indexes/checks, commit tranasaction
> could I deal with the case of the constant folding into the cached plan by
> flushing the entire cache in the same transaction?  Is cache flushing
> transactional?  The cases I have for this are infrequent in time and the
> overhead of reindexing things, rechecking checks/unique indexes already
> dwarf the performance lost to flushing the cache.
>
> On a related note, if I had a maintenence window where I can shutdown all
> DB access, make the referenced changes to the
> functions/indexes/caches/checks and restart PG - in your opinion, are there
> other likely problems to changing an immutable function under those
> circumstances, or should that be pretty safe?  In other words, I have a
> function that has indexes on it that does the wrong thing - what do I do to
> replace it?
>


In the thread below, we kind of got side tracked on some other stuff and I
never got an answer to the questions above.  Does anyone have any
insight/suggestions about the best way to replace a function that is used by
an index?

http://groups.google.com/group/pgsql.general/browse_thread/thread/92289ef0c2f5a109/8f96fb24bdd668e8


Re: [GENERAL] ERROR: relation . . . does not exist

2008-08-30 Thread Christophe


On Aug 30, 2008, at 10:33 AM, Albretch Mueller wrote:

 well, yeah! I would totally agree with you, but since I doubt very
much "COPY FROM CSV" is part of the SQL standard to beging with, why
not spice it up a little more?


I'd guess that coming up with a general algorithm to guess the type  
from a column of CSV text would satisfy no one, since we'd always  
miss a particular case that is important to someone (are times local  
or UTC? Should we assume integer or float? Varchar or text?), and the  
option is a forest of switches that would be extremely complex and  
error prone.


This sounds very much like an application-domain problem, best solved  
in the application domain.


--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] DUPS in tables columns ERROR: column ". . . " does not exist

2008-08-30 Thread Martijn van Oosterhout
On Sat, Aug 30, 2008 at 01:36:25PM -0400, Albretch Mueller wrote:
> > The system is smart enough to only do the count() once.
> ~
>  But not smart enough to make a variable you declare point to that
> internal variable so that things are clearer/ easier ;-)

The SQL standard has pretty clear rules about what variables can be
referenced from where, and this is one of those places (the rationale
is probably in there too).

Have a nice day
-- 
Martijn van Oosterhout   <[EMAIL PROTECTED]>   http://svana.org/kleptog/
> Please line up in a tree and maintain the heap invariant while 
> boarding. Thank you for flying nlogn airlines.


signature.asc
Description: Digital signature


Re: [GENERAL] ERROR: relation . . . does not exist

2008-08-30 Thread Adrian Klaver
On Saturday 30 August 2008 9:42:19 am Adrian Klaver wrote:
> On Saturday 30 August 2008 5:23:25 am Albretch Mueller wrote:
> > On Thu, Aug 28, 2008 at 7:50 PM, Adrian Klaver <[EMAIL PROTECTED]> 
wrote:
> > > Define easily.
> >
> > ~
> >  OK, let me try to outline the approach I would go for:
> > ~
> >  I think "COPY FROM CSV" should have three options, namely:
> > ~
> >  1) the way we have used it in which you create the table first
> > ~
> >  2) another way in which defaults are declared, generally as:
> > ~
> >  2.1) aggressive: data type, value and formatting analysis is done; if
> > only 1 or 0 are found declare then a BOOLEAN, if repeated data is
> > found (say state codes) and the stratification nodes cover the rest of
> > the data, stratify the data out to other extra table (they have a name
> > I can't recall now), index it ..., if data is kind of numeric with
> > front slashes and/or hyphen could they possibly be dates? if they are
> > definitelly dates convert them to bigint (and do the formatting in the
> > presentation code (also this a win-win situation with i18n code)) ...
> > ~
> >  2.2) conservative: data type and value, but no formatting analysis is
> > done and the greater encompassing data type is selected, say for 1 or
> > 0 data use bytes [0, 255], for bytes use int, if something could be
> > encoded as char(2), use varchar instead, . . .
> > ~
> >  2.3) dumn: just use the coarsest data type possible; bigint for
> > anything that looks like a number and varchar for the rest
> > ~
> >  the "dumn" option should suggest to the DBA the option they are
> > using, quantified consequences for their desicions (larger DBs for no
> > reason, approx. reduction in speed, . .) and how not to be "dumn"
> > ~
> >  3) or you could define "import templates" declaring which specific
> > data types to use for data in a certain way, which could be declared
> > per column using regexps
> > ~
> >
> > > I could go on, but the point is that table data types require some
> > > thought on the part of the DBA.
> >
> > ~
> >  Well, it still requires their minds and input, but they will have
> > jobs even if they get some help, don't you think so ;-)
> > ~
> >  lbrtchx
>
> This is a combination of more work then necessary and putting the cart
> after the horse.  All I can see happening is delaying the point of decision  

Lets try this again. The cart before the horse.
Memo to self:
1) Drink sufficient coffee.
2) Answer email.

> to a later time and or dumping the decision process on someone else. There
> is already a "dumb" solution that has been brought many times on this list.
> It involve creating a holding table that has text only fields and copying
> the data into and then moving the data from there to a final table. As far
> as import templates I suggest looking at:
> http://pgloader.projects.postgresql.org/
> It also addresses some of your other suggestions. It does not automatically
> create a table though.
>
>
>
> --
> Adrian Klaver
> [EMAIL PROTECTED]



-- 
Adrian Klaver
[EMAIL PROTECTED]

-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] DUPS in tables columns ERROR: column ". . . " does not exist

2008-08-30 Thread Albretch Mueller
> The system is smart enough to only do the count() once.
~
 But not smart enough to make a variable you declare point to that
internal variable so that things are clearer/ easier ;-)
~
 Thanks
 lbrtchx

-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] ERROR: relation . . . does not exist

2008-08-30 Thread Albretch Mueller
> spreadsheet programs (generally; I'm sure there are exceptions) don't have 
> the notion of a schema; each cell can hold its own particular type.
~
 Oh, now I see what Martin meant!
~
> that's not a traditional part of a database engine.
~
 well, yeah! I would totally agree with you, but since I doubt very
much "COPY FROM CSV" is part of the SQL standard to beging with, why
not spice it up a little more?
~
> This is probably one of those classic "twenty lines of Perl" problems.
~
 java since 1.5 comes with a full blown, PERL-like regexp engine
~
> I suggest looking at: http://pgloader.projects.postgresql.org/
> [1] A validator (regex) for each data type . . .
~
 that sort of things was what i was talking about, but I would go
quite a bit farther
~
 Thanks
 lbrtchx

-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] DUPS in tables columns ERROR: column ". . . " does not exist

2008-08-30 Thread Tom Lane
"Albretch Mueller" <[EMAIL PROTECTED]> writes:
>  thank you Stefan your SQL worked, but still; I am just asking and my
> programming bias will certainly show, but aren't you effectivly
> "calling" count on the table three times if you go:

The system is smart enough to only do the count() once.

regards, tom lane

-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] DUPS in tables columns ERROR: column ". . . " does not exist

2008-08-30 Thread Albretch Mueller
 thank you Stefan your SQL worked, but still; I am just asking and my
programming bias will certainly show, but aren't you effectivly
"calling" count on the table three times if you go:
~
SELECT md5, COUNT(md5)
FROM jdk1_6_0_07_txtfls_md5
GROUP BY md5
HAVING COUNT(md5) > 1
ORDER BY COUNT(md5) DESC;
~
 Shouldn't
~
SELECT md5, COUNT(md5) AS CNT
FROM jdk1_6_0_07_txtfls_md5
GROUP BY md5
HAVING CNT > 1
ORDER BY CNT DESC;
~
 work?
~
jpk=# SELECT md5, COUNT(md5) AS CNT
FROM jdk1_6_0_07_txtfls_md5
GROUP BY md5
HAVING CNT > 1
ORDER BY CNT DESC;
jpk-# jpk-# jpk-# jpk-# ERROR:  column "cnt" does not exist
LINE 4: HAVING CNT > 1
~
 Thanks
 lbrtchx

-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] ERROR: relation . . . does not exist

2008-08-30 Thread Steve Atkins


On Aug 30, 2008, at 9:19 AM, Christophe wrote:



On Aug 30, 2008, at 6:26 AM, Albretch Mueller wrote:

Well, then obviously there is the need for it and you were not
successful enough at convincing these developers that they were
"confusing postgresql with a spreadsheet program"


The behavior you are looking for is typical of a spreadsheet,  
because spreadsheet programs (generally; I'm sure there are  
exceptions) don't have the notion of a schema; each cell can hold  
its own particular type.  That being said, the automatic type- 
guessing that Excel, say, provides is far from foolproof; I've  
probably spent more time cleaning up Excel's bad guesses than would  
have been saved by my just specifying a type for each column.


As has been noted, text representation of values are extremely  
ambiguous as of which Postgres type they mean... and, of course, you  
could have user-defined domains and types as well.  It's true that  
it could take a wild guess, but that's not a traditional part of a  
database engine.


That being said, it would not be too hard to write a client that  
accepted a CSV or tab-delimited file, parsed the header into column  
names, and then scanned the values of the columns to take a  
reasonable guess as to the column type from a highly limited set of  
possibilities.  This is probably one of those classic "twenty lines  
of Perl" problems.


About 150 line of perl[1]. It can actually work quite well, but is  
entirely a client-side problem. None of that sort of heuristics should  
go anywhere near COPY in.


It doesn't seem as though COPY INTO is the right place for that,  
since the particular guesses and set of types that one would make  
strike me as very closely tied to your particular application domain.


Cheers,
  Steve

[1] A validator (regex) for each data type, then for each column track  
which data types it may be, as you scan through the file. Use the  
relative priorities of different data types to assign something  
appropriate for each column, then do a second pass translating the  
format into something Postgresql is comfortable with and feed it into  
pg_putcopydata.




--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] ERROR: relation . . . does not exist

2008-08-30 Thread Adrian Klaver
On Saturday 30 August 2008 5:23:25 am Albretch Mueller wrote:
> On Thu, Aug 28, 2008 at 7:50 PM, Adrian Klaver <[EMAIL PROTECTED]> wrote:
> > Define easily.
>
> ~
>  OK, let me try to outline the approach I would go for:
> ~
>  I think "COPY FROM CSV" should have three options, namely:
> ~
>  1) the way we have used it in which you create the table first
> ~
>  2) another way in which defaults are declared, generally as:
> ~
>  2.1) aggressive: data type, value and formatting analysis is done; if
> only 1 or 0 are found declare then a BOOLEAN, if repeated data is
> found (say state codes) and the stratification nodes cover the rest of
> the data, stratify the data out to other extra table (they have a name
> I can't recall now), index it ..., if data is kind of numeric with
> front slashes and/or hyphen could they possibly be dates? if they are
> definitelly dates convert them to bigint (and do the formatting in the
> presentation code (also this a win-win situation with i18n code)) ...
> ~
>  2.2) conservative: data type and value, but no formatting analysis is
> done and the greater encompassing data type is selected, say for 1 or
> 0 data use bytes [0, 255], for bytes use int, if something could be
> encoded as char(2), use varchar instead, . . .
> ~
>  2.3) dumn: just use the coarsest data type possible; bigint for
> anything that looks like a number and varchar for the rest
> ~
>  the "dumn" option should suggest to the DBA the option they are
> using, quantified consequences for their desicions (larger DBs for no
> reason, approx. reduction in speed, . .) and how not to be "dumn"
> ~
>  3) or you could define "import templates" declaring which specific
> data types to use for data in a certain way, which could be declared
> per column using regexps
> ~
>
> > I could go on, but the point is that table data types require some
> > thought on the part of the DBA.
>
> ~
>  Well, it still requires their minds and input, but they will have
> jobs even if they get some help, don't you think so ;-)
> ~
>  lbrtchx

This is a combination of more work then necessary and putting the cart after 
the horse. All I can see happening is delaying the point of decision to a 
later time and or dumping the decision process on someone else. There is 
already a "dumb" solution that has been brought many times on this list. It 
involve creating a holding table that has text only fields and copying the 
data into and then moving the data from there to a final table. As far as 
import templates I suggest looking at:
http://pgloader.projects.postgresql.org/
It also addresses some of your other suggestions. It does not automatically 
create a table though.



-- 
Adrian Klaver
[EMAIL PROTECTED]

-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] ERROR: relation . . . does not exist

2008-08-30 Thread Christophe


On Aug 30, 2008, at 6:26 AM, Albretch Mueller wrote:

 Well, then obviously there is the need for it and you were not
successful enough at convincing these developers that they were
"confusing postgresql with a spreadsheet program"


The behavior you are looking for is typical of a spreadsheet, because  
spreadsheet programs (generally; I'm sure there are exceptions) don't  
have the notion of a schema; each cell can hold its own particular  
type.  That being said, the automatic type-guessing that Excel, say,  
provides is far from foolproof; I've probably spent more time  
cleaning up Excel's bad guesses than would have been saved by my just  
specifying a type for each column.


As has been noted, text representation of values are extremely  
ambiguous as of which Postgres type they mean... and, of course, you  
could have user-defined domains and types as well.  It's true that it  
could take a wild guess, but that's not a traditional part of a  
database engine.


That being said, it would not be too hard to write a client that  
accepted a CSV or tab-delimited file, parsed the header into column  
names, and then scanned the values of the columns to take a  
reasonable guess as to the column type from a highly limited set of  
possibilities.  This is probably one of those classic "twenty lines  
of Perl" problems.


It doesn't seem as though COPY INTO is the right place for that,  
since the particular guesses and set of types that one would make  
strike me as very closely tied to your particular application domain.


--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] DUPS in tables columns ERROR: column ". . . " does not exist

2008-08-30 Thread Stefan Kaltenbrunner

Albretch Mueller wrote:

 Hi,
~
 I am trying to get dups from some data from files which md5sums I
previously calculated
~
 Here is my mere mortal SQL
~
SELECT md5, COUNT(md5) AS md5cnt
FROM jdk1_6_0_07_txtfls_md5
WHERE (md5cnt > 1)
GROUP BY md5
ORDER BY md5cnt DESC;


I think you are looking for HAVING as in:

SELECT md5, COUNT(md5)
FROM jdk1_6_0_07_txtfls_md5
GROUP BY md5
HAVING count(md5) > 1


Stefan

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] ERROR: relation . . . does not exist

2008-08-30 Thread Albretch Mueller
> I think you're confusing postgresql with a spreadsheet program.
~
 I wonder what makes you think so
~
> There are client programs which will do this for you, perhaps you wan one of 
> those?
~
 Well, then obviously there is the need for it and you were not
successful enough at convincing these developers that they were
"confusing postgresql with a spreadsheet program"
~
> PostgreSQL has 60+ types and many look like eachother. How do you propose to 
> differentiate?
~
 Data Types are basically about value ranges (how many bits do you
need to hold the value) and formatting. IMHO, finding an optimal
[im|ex]port reasoning among 60+ types should not be that much of a big
deal. In fact as a data analyst I have exported and imported CSV data
a whole lot and in many occasions it required some extra custom
coding. I may as well consolidate my code as a whole jcsvport library
in java and start an OS project when I find the time to so
~
 lbrtchx

-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] DUPS in tables columns ERROR: column ". . . " does not exist

2008-08-30 Thread Albretch Mueller
 Also I know there is a DISTINCT keyword, but I also need to know how
many times the particular data in the column is repeated if it is,
that is why I need to go:
~
 SELECT md5, COUNT(md5) AS md5cnt
 FROM jdk1_6_0_07_txtfls_md5
 WHERE (md5cnt > 1)
 GROUP BY md5
 ORDER BY md5cnt DESC;
~
 Thanks
 lbrtchx

-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] ERROR: relation . . . does not exist

2008-08-30 Thread Martijn van Oosterhout
On Sat, Aug 30, 2008 at 08:23:25AM -0400, Albretch Mueller wrote:
>  OK, let me try to outline the approach I would go for:
> ~
>  I think "COPY FROM CSV" should have three options, namely:

I think you're confusing postgresql with a spreadsheet program. A
database is designed to take care of your data and ensure its
integrity. As such it requires a little more thought.

There are client programs which will do this for you, perhaps you wan
one of those?

What's so hard about:

create table foo (a text, b text);

After which your COPY will complete fine.

>  2.1) aggressive: data type, value and formatting analysis is done; if
> only 1 or 0 are found declare then a BOOLEAN, if repeated data is
> found (say state codes) and the stratification nodes cover the rest of
> the data, stratify the data out to other extra table (they have a name
> I can't recall now), index it ..., if data is kind of numeric with
> front slashes and/or hyphen could they possibly be dates? if they are
> definitelly dates convert them to bigint (and do the formatting in the
> presentation code (also this a win-win situation with i18n code)) ...

PostgreSQL has 60+ types and many look like eachother. How do you
propose to differentiate?

Have a nice day,
-- 
Martijn van Oosterhout   <[EMAIL PROTECTED]>   http://svana.org/kleptog/
> Please line up in a tree and maintain the heap invariant while 
> boarding. Thank you for flying nlogn airlines.


signature.asc
Description: Digital signature


Re: [GENERAL] ERROR: relation . . . does not exist

2008-08-30 Thread Albretch Mueller
On Thu, Aug 28, 2008 at 7:50 PM, Adrian Klaver <[EMAIL PROTECTED]> wrote:
> Define easily.
~
 OK, let me try to outline the approach I would go for:
~
 I think "COPY FROM CSV" should have three options, namely:
~
 1) the way we have used it in which you create the table first
~
 2) another way in which defaults are declared, generally as:
~
 2.1) aggressive: data type, value and formatting analysis is done; if
only 1 or 0 are found declare then a BOOLEAN, if repeated data is
found (say state codes) and the stratification nodes cover the rest of
the data, stratify the data out to other extra table (they have a name
I can't recall now), index it ..., if data is kind of numeric with
front slashes and/or hyphen could they possibly be dates? if they are
definitelly dates convert them to bigint (and do the formatting in the
presentation code (also this a win-win situation with i18n code)) ...
~
 2.2) conservative: data type and value, but no formatting analysis is
done and the greater encompassing data type is selected, say for 1 or
0 data use bytes [0, 255], for bytes use int, if something could be
encoded as char(2), use varchar instead, . . .
~
 2.3) dumn: just use the coarsest data type possible; bigint for
anything that looks like a number and varchar for the rest
~
 the "dumn" option should suggest to the DBA the option they are
using, quantified consequences for their desicions (larger DBs for no
reason, approx. reduction in speed, . .) and how not to be "dumn"
~
 3) or you could define "import templates" declaring which specific
data types to use for data in a certain way, which could be declared
per column using regexps
~
> I could go on, but the point is that table data types require some thought on 
> the part of the DBA.
~
 Well, it still requires their minds and input, but they will have
jobs even if they get some help, don't you think so ;-)
~
 lbrtchx

-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


[GENERAL] DUPS in tables columns ERROR: column ". . . " does not exist

2008-08-30 Thread Albretch Mueller
 Hi,
~
 I am trying to get dups from some data from files which md5sums I
previously calculated
~
 Here is my mere mortal SQL
~
SELECT md5, COUNT(md5) AS md5cnt
FROM jdk1_6_0_07_txtfls_md5
WHERE (md5cnt > 1)
GROUP BY md5
ORDER BY md5cnt DESC;
~
 and this is what I get:
~
jpk=# SELECT md5, COUNT(md5) AS md5cnt
FROM jdk1_6_0_07_txtfls_md5
WHERE (md5cnt > 1)
GROUP BY md5
ORDER BY md5cnt DESC;
jpk-# jpk-# jpk-# jpk-# ERROR:  column "md5cnt" does not exist
LINE 3: WHERE (md5cnt > 1)
~
 I think I know what that one means based on the clear error message,
namely md5cntis not a table column itself, but I still think there
should be a way to formulate a simple query like this because PG does
take "ORDER BY md5cnt DESC" even if md5cnt is not a table column, why
on earth then it does not swallow and digest the "WHERE (md5cnt > 1)"
part?
~
 You could go the monkey way running a query like:
~
 SELECT md5, COUNT(md5) AS md5cnt FROM jdk1_6_0_07_txtfls_md5 GROUP BY
md5 ORDER BY md5cnt DESC;
~
 and then use code to jump of the loop when md5cnt becomes 1 or you
could use nested SQL statements
~
 How can you find duplicate records in a table?
~
 Thanks
 lbrtchx

-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general