[HACKERS] Drop separate CRC32 implementations in ltree, tsearch, tsearch2?

2005-06-02 Thread Tom Lane
contrib/ltree, contrib/tsearch, and contrib/tsearch2 each contain
implementations of CRC32 calculations.  I think these are now pretty
redundant with the CRC32 code existing in the main backend.  They
use a different CRC polynomial than the main backend code does,
but it's hard to credit that that is an important difference.
Anyone see a reason not to rip that code out and make these modules
use pg_crc.h/pg_crc.c instead?

regards, tom lane

---(end of broadcast)---
TIP 3: if posting/reading through Usenet, please send an appropriate
  subscribe-nomail command to [EMAIL PROTECTED] so that your
  message can get through to the mailing list cleanly


Re: [HACKERS] NOLOGGING option, or ?

2005-06-02 Thread Oliver Jowett
Alon Goldshuv wrote:

 I think that the basic issue is that there are some database users that would 
 like to take their data and put it into the database without pre-processing 
 it [...]
 The only responsibility of these users is to explicitly escape any delimiter 
 or 0x0A (LF) characters that they intend to have as DATA. that's all. 

Haven't you just replaced one preprocessing step with another, then?

-O

---(end of broadcast)---
TIP 7: don't forget to increase your free space map settings


Re: [HACKERS] Drop separate CRC32 implementations in ltree, tsearch,

2005-06-02 Thread Teodor Sigaev

contrib/ltree, contrib/tsearch, and contrib/tsearch2 each contain
implementations of CRC32 calculations.  I think these are now pretty
redundant with the CRC32 code existing in the main backend.  They
use a different CRC polynomial than the main backend code does,
but it's hard to credit that that is an important difference.
I think no matter. Although we had experiment to choice the best hash function 
and choose crc32. Other functions make much more collisions on non-engish words.




Anyone see a reason not to rip that code out and make these modules
use pg_crc.h/pg_crc.c instead?


contrib/tsearch is already marked as obsolete, and I think that we can remove it 
away from contrib.  BTW, ltree/crc32.c and tsearch2/crc32.c has small 
difference: ltree variant may lower string (depend of LOWER_NODE define) before 
calculation checksum.
But pg_crc counts 64-bit checksum while contrib modules needs 32. Will be 
correct to use only half-value? I am afraid number of collisions will be more...




--
Teodor Sigaev   E-mail: [EMAIL PROTECTED]
   WWW: http://www.sigaev.ru/

---(end of broadcast)---
TIP 3: if posting/reading through Usenet, please send an appropriate
 subscribe-nomail command to [EMAIL PROTECTED] so that your
 message can get through to the mailing list cleanly


Re: [HACKERS] NOLOGGING option, or ?

2005-06-02 Thread Hannu Krosing
On K, 2005-06-01 at 18:05 -0400, Bruce Momjian wrote:

 What we could do is to do no-WAL automatically for empty tables (like
 when a database is first loaded),

You forget that some databases use WAL for PITR / replication and doing
it automatically there would surely mess up their replica.

How is index creation handeled if it is not logged in WAL ? 
 - is it not automatically WAL'ed ? 
 - Must one recreate indexes after PITR or failover ?

  and use the flag for cases where the
 tables is not zero pages.  The fact is that database loads are a prefect
 case for this optimization and old dumps are not going to have that flag
 anyway, and automatic is better if we can do it.

-- 
Hannu Krosing [EMAIL PROTECTED]

---(end of broadcast)---
TIP 4: Don't 'kill -9' the postmaster


Re: [HACKERS] NOLOGGING option, or ?

2005-06-02 Thread Hannu Krosing
On K, 2005-06-01 at 11:31 -0400, Tom Lane wrote:
 Hannu Krosing [EMAIL PROTECTED] writes:
  I think this should be a decision done when creating a table, just like
  TEMP tables. So you always know if a certain table is or is not
  safe/replicated/recoverable.
  This has also the advantage of requiring no changes to actual COPY and
  INSERT commands.
 
 That doesn't seem right to me; the scenario I envision is that you are
 willing to do the initial data loading over again (since you presumably
 still have the source data available).  But once you've got it loaded
 you want full protection.

What I mean, was that as it can't be safely replicated using log-
shipping, It should be visible as such.

 Perhaps it could work to use an ALTER TABLE command to flip the state.

No. It would be the same as flipping a TEMP table to an ordinary table,
which we don't support, and IMHO for a good reason

 But I'm not really seeing the point compared to treating it as a COPY
 option.  

The point is having a separate (sub)type of storage - non-WAL/non-
replicated table and its indexes.

 I do not believe that anyone needs this to work on individual
 INSERT commands --- if you are after max speed, why aren't you using
 COPY?  And treating it as an ALTER property opens the possibility of
 forgetting to ALTER the table back to normal behavior, which would be
 a foot-gun of large caliber indeed :-(

That's what I'm trying to avoid - If it is obvious, that the whole table
is quasi-stable (in PITR/log-shipping sense) it is more clearly a user
choice what kinds of data can be stored there. Same as TEMP tables
again.

-- 
Hannu Krosing [EMAIL PROTECTED]

---(end of broadcast)---
TIP 6: Have you searched our list archives?

   http://archives.postgresql.org


Re: [HACKERS] NOLOGGING option, or ?

2005-06-02 Thread Zeugswetter Andreas DAZ SD

 Escape processing would proceed as before, but the semantics would change to
 allow the use of different characters as the escape character, in addition
 to the special characters for delimiter and newline.

If you mean syntax to specify escape and delimiter (and newline ?), 
that is a great addition, that imho all would like.

  Also, escape
 processing would be false as the default, so that the only special
 characters by default would be the newline and delimiter characters.

I don't see how that would be any advantage ? What is so wrong about having 
syntax to 
choose no escape processing, like escape '' ?

Andreas

---(end of broadcast)---
TIP 8: explain analyze is your friend


Re: [HACKERS] Google's Summer of Code ...

2005-06-02 Thread Jonah H. Harris

So, has anyone gone ahead and contacted Google yet?



---(end of broadcast)---
TIP 2: you can get off all lists at once with the unregister command
   (send unregister YourEmailAddressHere to [EMAIL PROTECTED])


Re: [HACKERS] Google's Summer of Code ...

2005-06-02 Thread Marc G. Fournier


Yes, been working on this since last night ...

On Thu, 2 Jun 2005, Jonah H. Harris wrote:


So, has anyone gone ahead and contacted Google yet?



---(end of broadcast)---
TIP 2: you can get off all lists at once with the unregister command
  (send unregister YourEmailAddressHere to [EMAIL PROTECTED])






Marc G. Fournier   Hub.Org Networking Services (http://www.hub.org)
Email: [EMAIL PROTECTED]   Yahoo!: yscrappy  ICQ: 7615664

---(end of broadcast)---
TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]


Re: [HACKERS] NOLOGGING option, or ?

2005-06-02 Thread Luke Lonergan
Andreas,

 Escape processing would proceed as before, but the semantics would change to
 allow the use of different characters as the escape character, in addition
 to the special characters for delimiter and newline.
 
 If you mean syntax to specify escape and delimiter (and newline ?),
 that is a great addition, that imho all would like.

Cool.  We've found it invaluable with customers doing web data analysis.
 
 I don't see how that would be any advantage ? What is so wrong about having
 syntax to 
 choose no escape processing, like escape '' ?

I agree.

- Luke



---(end of broadcast)---
TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]


Re: [HACKERS] NOLOGGING option, or ?

2005-06-02 Thread Luke Lonergan
Oliver,

 Haven't you just replaced one preprocessing step with another, then?

Generally not.  The most common problem with the current choice of escape
character is that there are *lots* of data load scenarios with backslash in
the text strings.  The extra preprocessing to escape them is unnecessary on
other databases and, in effect, causes the load to be even slower because
you have to prepare the data ahead of time.

Also, note that this patch can also do escape processing and the net result
will still be 5+ times faster than what is there.

In the data warehousing industry, data conversion and manipulation is
normally kept distinct from data loading.  Conversion is done by tools
called ETL (Extract Transform Load) and the database will have a very fast
path for direct loading of the resulting data.  PostgreSQL is definitely a
strange database right now in that there is a default filter applied to the
data on load.

It's even more strange because the load path is so slow, and now that we've
found that the slowness is there mostly because of non-optimized parsing and
attribute conversion routines.  The question of how to do escape processing
is a separate one, but is wrapped up in the question of whether to introduce
a new loading routine or whether to optimize the old one.

- Luke



---(end of broadcast)---
TIP 5: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq


Re: [HACKERS] Google's Summer of Code ...

2005-06-02 Thread Jonah H. Harris

Cool.  Thanks Marc.

Marc G. Fournier wrote:



Yes, been working on this since last night ...

On Thu, 2 Jun 2005, Jonah H. Harris wrote:


So, has anyone gone ahead and contacted Google yet?



---(end of broadcast)---
TIP 2: you can get off all lists at once with the unregister command
  (send unregister YourEmailAddressHere to [EMAIL PROTECTED])






Marc G. Fournier   Hub.Org Networking Services 
(http://www.hub.org)
Email: [EMAIL PROTECTED]   Yahoo!: yscrappy  ICQ: 
7615664




---(end of broadcast)---
TIP 3: if posting/reading through Usenet, please send an appropriate
 subscribe-nomail command to [EMAIL PROTECTED] so that your
 message can get through to the mailing list cleanly


Re: [HACKERS] NOLOGGING option, or ?

2005-06-02 Thread Steve Atkins
On Thu, Jun 02, 2005 at 07:33:13AM -0700, Luke Lonergan wrote:
 Oliver,
 
  Haven't you just replaced one preprocessing step with another, then?
 
 Generally not.  The most common problem with the current choice of escape
 character is that there are *lots* of data load scenarios with backslash in
 the text strings. 

I can only think of one where it's common. Windows filenames. But if
you're going to support arbitrary data in a load then whatever escape
character you choose will appear sometimes.

 The extra preprocessing to escape them is unnecessary on
 other databases and, in effect, causes the load to be even slower because
 you have to prepare the data ahead of time.

 Also, note that this patch can also do escape processing and the net result
 will still be 5+ times faster than what is there.

I strongly suspect that a patch to improve performance without changing
behaviour would be accepted with no questions asked.

One that allowed specifying the field and record delimiters and the
escape character and null symbol might require more discussion about
an appropriate syntax at the very least. So you may want to separate
the two.

 In the data warehousing industry, data conversion and manipulation is
 normally kept distinct from data loading.  Conversion is done by tools
 called ETL (Extract Transform Load) and the database will have a very fast
 path for direct loading of the resulting data.  PostgreSQL is definitely a
 strange database right now in that there is a default filter applied to the
 data on load.

 It's even more strange because the load path is so slow, and now that we've
 found that the slowness is there mostly because of non-optimized parsing and
 attribute conversion routines.  The question of how to do escape processing
 is a separate one, but is wrapped up in the question of whether to introduce
 a new loading routine or whether to optimize the old one.

There are already two loader routines. One of them is text-based and is
designed for easy generation of data load format using simple text
manipulation tools by using delimiters. It also allows (unlike your
suggestion) for loading of arbitrary data from a text file.

Because it allows for arbitrary data and uses delimiters to separate
fields it has to use an escaping mechanism.

If you want to be able to load arbitrary data and not have to handle
escape characters there's are two obvious ways to do it.

The first is that used by MIME and suggested by you. That is to use a
separator that you believe will not appear in the data. That can be
done by using a long multicharacter separator containing random
characters and assuming that sequence won't appear, it can be done by
parsing the input data twice, looking for strings that don't appear
for use as delimiters or it can take advantage of knowledge about
what characters can and can not appear in the input data. (I can't
imagine any case involving data-mining of web logs where the last
is likely to be relevant).

The other is to use length+data format for each tuple, avoiding all
issues of escapes in the data and allowing arbitrary data to be
represented. That's how the binary load format PG supports works, I
believe. If you're really concerned about speed of load that may be a
better format for your front-end to generate, perhaps?

Cheers,
  Steve

---(end of broadcast)---
TIP 6: Have you searched our list archives?

   http://archives.postgresql.org


Re: [HACKERS] Google's Summer of Code ...

2005-06-02 Thread Vishal Kashyap @ [SaiHertz]
Dear all ,

 
 Incidentally I havent seen any objections, if there are none should we go
 ahead and whip up an email to google?  Do we want to run this through the


We must go straight away for it.

I wonder why bigies like Bruce , Tom , Josh have not responded to this yet.

To my opinion the best way to do it is make a single person
responsible for this all and let every thing be done via him from
filling the application to the end.

-- 
With Best Regards,
Vishal Kashyap.
Lead Software Developer,
http://saihertz.com,
http://vishalkashyap.tk

---(end of broadcast)---
TIP 8: explain analyze is your friend


Re: [HACKERS] NOLOGGING option, or ?

2005-06-02 Thread Luke Lonergan
Steve,

 I can only think of one where it's common. Windows filenames.

Nearly all weblog data then.

 But if
 you're going to support arbitrary data in a load then whatever escape
 character you choose will appear sometimes.

If we allow an 8-bit character set in the text file, then yes, any
delimiter you choose has the potential to appear in your input data.  In
practice, with *mostly* 7-bit ASCII characters and even with international
8-bit text encodings, you can choose a delimiter and newline that work well.
Exceptions are handled by the forthcoming single row error handling patch.

 I strongly suspect that a patch to improve performance without changing
 behaviour would be accepted with no questions asked.

Understood - not sure it's the best thing for support of the users yet.
We've found a large number of issues from customers with the unmodified
behavior.
  
 There are already two loader routines. One of them is text-based and is
 designed for easy generation of data load format using simple text
 manipulation tools by using delimiters. It also allows (unlike your
 suggestion) for loading of arbitrary data from a text file.

Not to distract, but try loading a binary null into a text field.  The
assumption of null terminated strings penetrates deep into the codebase.
The existing system does not allow for loading arbitrary data from a text
file.

Our suggestion allows for escapes, but requires the ability to specify
alternate characters or none.
 
 Because it allows for arbitrary data and uses delimiters to separate
 fields it has to use an escaping mechanism.
 
 If you want to be able to load arbitrary data and not have to handle
 escape characters there's are two obvious ways to do it.

Let's dispense with the notion that we're suggesting no escapes (see above).

Binary with a bookends format is a fine idea and would be my personal
preference if it were fast, which it isn't.  Customers in the web log
analysis and other data warehousing fields prefer mostly 7-bit ascii text
input, which we're trying to support with this change.

- Luke



---(end of broadcast)---
TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]


Re: [HACKERS] Google's Summer of Code ...

2005-06-02 Thread Dave Page
 

 -Original Message-
 From: [EMAIL PROTECTED] 
 [mailto:[EMAIL PROTECTED] On Behalf Of 
 Vishal Kashyap @ [SaiHertz]
 Sent: 02 June 2005 16:19
 To: Robert Treat
 Cc: Jonah H. Harris; Marc G. Fournier; 
 pgsql-hackers@postgresql.org; [EMAIL PROTECTED]
 Subject: Re: [HACKERS] Google's Summer of Code ...
 
 Dear all ,
 
  
  Incidentally I havent seen any objections, if there are 
 none should we go
  ahead and whip up an email to google?  Do we want to run 
 this through the
 
 
 We must go straight away for it.
 
 I wonder why bigies like Bruce , Tom , Josh have not 
 responded to this yet.

I thought Marc was dealing it? And they don't get much bigger than him
grin :-)

Regards, Dave.

---(end of broadcast)---
TIP 9: the planner will ignore your desire to choose an index scan if your
  joining column's datatypes do not match


Re: [HACKERS] NOLOGGING option, or ?

2005-06-02 Thread Alon Goldshuv
 2) A modified command syntax for introducing a direct single row error
 handling. By direct I mean - a row that if rejected from within the COPY
 command context does not throw an error and rollsback the whole transaction.
 Instead the error is caught and recorded elsewhere, maybe in some error
 table, with some more information that can later on be retrieved. The
 following rows continue to be processed. This way there is barely any error
 handling overhead.
 
 Is there any idea on exactly how would this be done?  Do you plan on
 using savepoints to implement it?  I fail to see how is this barely any
 overhead.  Savepoints are not that expensive but they are not free either.
 (No, I haven't measured it.)

Good question, I am not entirely sure if this is possible yet, as I didn't
think it through entirely yet.

I guess data errors could be divided into 2 main categories: mal-formed data
where error is detected even before forming a tuple, and the other is errors
that are caused by some constraint violation, that is, after the tuple is
formed and inserted.

From what I hear and experience the big majority of errors are of the first
type. In that case the error could be caught, the data line + line number +
error description could be inserted into an ERROR table (all TEXT fields),
and then COPY can skip forming a tuple, and move to parsing the next. In
this process there is barely any overhead.

The more difficult part obviously is handling the second error type, which I
haven't looked at yet deeply. Hopefully it is not impossible to do while
keeping transaction integrity (Any ideas anyone?). The overhead for this one
will probably be larger, but again, we expect those to happen less (in most
cases at least). Nevertheless, it is surely much faster than recursively
narrowing down batch sizes.


Alon.



---(end of broadcast)---
TIP 9: the planner will ignore your desire to choose an index scan if your
  joining column's datatypes do not match


Re: [HACKERS] NOLOGGING option, or ?

2005-06-02 Thread Greg Stark

Luke Lonergan [EMAIL PROTECTED] writes:

 In the data warehousing industry, data conversion and manipulation is
 normally kept distinct from data loading.  

It's a bit strange to call this conversion or manipulation. One way or another
you have to escape whatever your delimiters are. How would you propose loading
strings that contain newlines?

The ETL transformations you're talking about are a different beast entirely.
You're talking about things like canonicalizing case or looking up foreign key
ids to replace strings and such.

Simply parsing the file format properly isn't part of that game. Otherwise
where do you stop? You could take this to a silly extreme and just say
postgres should just load each line as a record with single text field and let
tools deal with actually parsing. Or better yet, load the whole thing as a
single big blob.

Personally I would prefer to make prepared inserts as efficient as COPY and
deprecate COPY. Then we could have an entirely client-side tool that handled
as many formats as people want to implement without complicating the server.
Things like various vintages of Excel, fixed column files, etc should all be
handled as plugins for such a tool.

That would have the side benefit of allowing people to do other batch jobs
efficiently. Pipelining parameters to hundreds of executions of a prepared
query in the network. 

Actually it seems like there's no particular reason the NOLOGGING option Tom
described (where it only inserts on new pages, doesn't have any special WAL
entries, just fsyncs at the end instead of WAL logging) can't work with
arbitrary inserts. Somehow some state has to be preserved remembering which
pages the nologging inserts have created and hold locks on.

-- 
greg


---(end of broadcast)---
TIP 8: explain analyze is your friend


Re: [HACKERS] Google's Summer of Code ...

2005-06-02 Thread Marc G. Fournier


It is already being worked on ... more once we know more ...

On Thu, 2 Jun 2005, Vishal Kashyap @ [SaiHertz] wrote:


Dear all ,



Incidentally I havent seen any objections, if there are none should we go
ahead and whip up an email to google?  Do we want to run this through the



We must go straight away for it.

I wonder why bigies like Bruce , Tom , Josh have not responded to this yet.

To my opinion the best way to do it is make a single person
responsible for this all and let every thing be done via him from
filling the application to the end.

--
With Best Regards,
Vishal Kashyap.
Lead Software Developer,
http://saihertz.com,
http://vishalkashyap.tk




Marc G. Fournier   Hub.Org Networking Services (http://www.hub.org)
Email: [EMAIL PROTECTED]   Yahoo!: yscrappy  ICQ: 7615664

---(end of broadcast)---
TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]


Re: [HACKERS] Google's Summer of Code ...

2005-06-02 Thread Vikram Kalsi
I am a MSEE student at Penn State (University Park), for the past few
months I have been working on modifying parts of PostgreSQL for my
research work. I doubt if my current work would serve any purpose for
pgsql since it is experimental and research oriented, but all the
same, I have gained familiarity with parts of pgsql. I'm interested in
Google's Summer of Code, but I would definitely need help, starting
with selection of an idea to work on. So, if anybody from PostgreSQL
would like to support this, then please get in touch with me as early
as possible.

By the way, I didn't see PostgreSQL in the list of Participating
Organizations on http://code.google.com/summerofcode.html?

Thanks and Regards,
-Vikram Kalsi
MSEE PennState
vzk101 at psu dot edu
www.personal.psu.edu/vzk101


On 5/25/05, Tom Lane [EMAIL PROTECTED] wrote:
 Vikram Kalsi [EMAIL PROTECTED] writes:
  So, I suppose that during the query planning and optimization stage,
  the value of the original variables in the plan are somehow copied to
  the plan which is finally returned inside pg_plan_query().
 
 Look in createplan.c --- there are a couple places in there you need to
 fix.
 
regards, tom lane


---(end of broadcast)---
TIP 9: the planner will ignore your desire to choose an index scan if your
  joining column's datatypes do not match


[HACKERS] PostgreSQL Developer Network

2005-06-02 Thread Gevik babakhani








Dear People,



After a long time of various timeouts, I managed to get back on track
developing the long promised PQDN website. As some of you might know. I was
working on a project called the PostgreSQL Developer Network. 



The websites (which is a la MSDN) is meant to provide a knowledge base
for the precious programming knowledge regarding writing code for PostgreSQL.
Hopefully with your contribution it would be a great place for the ones (like
me) who would like to contribute.



The websites is being developed on www.truesoftware.net:8081/pgdn/
using PHP5 and of course PostgreSQL 8.0.3 



Please do not hesitate to share your ideas.



Kind regards,

Gevik










[HACKERS] Google's Summer of Code: Too Late

2005-06-02 Thread Marc G. Fournier


Just got word back that they have reached their 'limit' on # of projects 
they are working with this summer ... considering they only 'opened up' 
the floor on the 31st, and its closed already ... damn, that was a small 
window of opportunity.


But, I've been told to contact the co-ordinator in a week or so, since 
apparently this is just one program they are running, and we may be able 
to find another that we can fit into ...


I'll keep in touch the guy and see if we can get in on the ground floor of 
other programs they might start up ...



Marc G. Fournier   Hub.Org Networking Services (http://www.hub.org)
Email: [EMAIL PROTECTED]   Yahoo!: yscrappy  ICQ: 7615664

---(end of broadcast)---
TIP 6: Have you searched our list archives?

  http://archives.postgresql.org


[HACKERS] Tablespaces

2005-06-02 Thread Christopher Kings-Lynne
I'm interested if anyone is using tablespaces?  Do we have any actual 
reports of people actually using them, to advantage, in the field??


Maybe the next postgresql.org survey could be on tablespace usage?

Chris




---(end of broadcast)---
TIP 9: the planner will ignore your desire to choose an index scan if your
 joining column's datatypes do not match