[HACKERS] Drop separate CRC32 implementations in ltree, tsearch, tsearch2?
contrib/ltree, contrib/tsearch, and contrib/tsearch2 each contain implementations of CRC32 calculations. I think these are now pretty redundant with the CRC32 code existing in the main backend. They use a different CRC polynomial than the main backend code does, but it's hard to credit that that is an important difference. Anyone see a reason not to rip that code out and make these modules use pg_crc.h/pg_crc.c instead? regards, tom lane ---(end of broadcast)--- TIP 3: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [HACKERS] NOLOGGING option, or ?
Alon Goldshuv wrote: I think that the basic issue is that there are some database users that would like to take their data and put it into the database without pre-processing it [...] The only responsibility of these users is to explicitly escape any delimiter or 0x0A (LF) characters that they intend to have as DATA. that's all. Haven't you just replaced one preprocessing step with another, then? -O ---(end of broadcast)--- TIP 7: don't forget to increase your free space map settings
Re: [HACKERS] Drop separate CRC32 implementations in ltree, tsearch,
contrib/ltree, contrib/tsearch, and contrib/tsearch2 each contain implementations of CRC32 calculations. I think these are now pretty redundant with the CRC32 code existing in the main backend. They use a different CRC polynomial than the main backend code does, but it's hard to credit that that is an important difference. I think no matter. Although we had experiment to choice the best hash function and choose crc32. Other functions make much more collisions on non-engish words. Anyone see a reason not to rip that code out and make these modules use pg_crc.h/pg_crc.c instead? contrib/tsearch is already marked as obsolete, and I think that we can remove it away from contrib. BTW, ltree/crc32.c and tsearch2/crc32.c has small difference: ltree variant may lower string (depend of LOWER_NODE define) before calculation checksum. But pg_crc counts 64-bit checksum while contrib modules needs 32. Will be correct to use only half-value? I am afraid number of collisions will be more... -- Teodor Sigaev E-mail: [EMAIL PROTECTED] WWW: http://www.sigaev.ru/ ---(end of broadcast)--- TIP 3: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [HACKERS] NOLOGGING option, or ?
On K, 2005-06-01 at 18:05 -0400, Bruce Momjian wrote: What we could do is to do no-WAL automatically for empty tables (like when a database is first loaded), You forget that some databases use WAL for PITR / replication and doing it automatically there would surely mess up their replica. How is index creation handeled if it is not logged in WAL ? - is it not automatically WAL'ed ? - Must one recreate indexes after PITR or failover ? and use the flag for cases where the tables is not zero pages. The fact is that database loads are a prefect case for this optimization and old dumps are not going to have that flag anyway, and automatic is better if we can do it. -- Hannu Krosing [EMAIL PROTECTED] ---(end of broadcast)--- TIP 4: Don't 'kill -9' the postmaster
Re: [HACKERS] NOLOGGING option, or ?
On K, 2005-06-01 at 11:31 -0400, Tom Lane wrote: Hannu Krosing [EMAIL PROTECTED] writes: I think this should be a decision done when creating a table, just like TEMP tables. So you always know if a certain table is or is not safe/replicated/recoverable. This has also the advantage of requiring no changes to actual COPY and INSERT commands. That doesn't seem right to me; the scenario I envision is that you are willing to do the initial data loading over again (since you presumably still have the source data available). But once you've got it loaded you want full protection. What I mean, was that as it can't be safely replicated using log- shipping, It should be visible as such. Perhaps it could work to use an ALTER TABLE command to flip the state. No. It would be the same as flipping a TEMP table to an ordinary table, which we don't support, and IMHO for a good reason But I'm not really seeing the point compared to treating it as a COPY option. The point is having a separate (sub)type of storage - non-WAL/non- replicated table and its indexes. I do not believe that anyone needs this to work on individual INSERT commands --- if you are after max speed, why aren't you using COPY? And treating it as an ALTER property opens the possibility of forgetting to ALTER the table back to normal behavior, which would be a foot-gun of large caliber indeed :-( That's what I'm trying to avoid - If it is obvious, that the whole table is quasi-stable (in PITR/log-shipping sense) it is more clearly a user choice what kinds of data can be stored there. Same as TEMP tables again. -- Hannu Krosing [EMAIL PROTECTED] ---(end of broadcast)--- TIP 6: Have you searched our list archives? http://archives.postgresql.org
Re: [HACKERS] NOLOGGING option, or ?
Escape processing would proceed as before, but the semantics would change to allow the use of different characters as the escape character, in addition to the special characters for delimiter and newline. If you mean syntax to specify escape and delimiter (and newline ?), that is a great addition, that imho all would like. Also, escape processing would be false as the default, so that the only special characters by default would be the newline and delimiter characters. I don't see how that would be any advantage ? What is so wrong about having syntax to choose no escape processing, like escape '' ? Andreas ---(end of broadcast)--- TIP 8: explain analyze is your friend
Re: [HACKERS] Google's Summer of Code ...
So, has anyone gone ahead and contacted Google yet? ---(end of broadcast)--- TIP 2: you can get off all lists at once with the unregister command (send unregister YourEmailAddressHere to [EMAIL PROTECTED])
Re: [HACKERS] Google's Summer of Code ...
Yes, been working on this since last night ... On Thu, 2 Jun 2005, Jonah H. Harris wrote: So, has anyone gone ahead and contacted Google yet? ---(end of broadcast)--- TIP 2: you can get off all lists at once with the unregister command (send unregister YourEmailAddressHere to [EMAIL PROTECTED]) Marc G. Fournier Hub.Org Networking Services (http://www.hub.org) Email: [EMAIL PROTECTED] Yahoo!: yscrappy ICQ: 7615664 ---(end of broadcast)--- TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]
Re: [HACKERS] NOLOGGING option, or ?
Andreas, Escape processing would proceed as before, but the semantics would change to allow the use of different characters as the escape character, in addition to the special characters for delimiter and newline. If you mean syntax to specify escape and delimiter (and newline ?), that is a great addition, that imho all would like. Cool. We've found it invaluable with customers doing web data analysis. I don't see how that would be any advantage ? What is so wrong about having syntax to choose no escape processing, like escape '' ? I agree. - Luke ---(end of broadcast)--- TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]
Re: [HACKERS] NOLOGGING option, or ?
Oliver, Haven't you just replaced one preprocessing step with another, then? Generally not. The most common problem with the current choice of escape character is that there are *lots* of data load scenarios with backslash in the text strings. The extra preprocessing to escape them is unnecessary on other databases and, in effect, causes the load to be even slower because you have to prepare the data ahead of time. Also, note that this patch can also do escape processing and the net result will still be 5+ times faster than what is there. In the data warehousing industry, data conversion and manipulation is normally kept distinct from data loading. Conversion is done by tools called ETL (Extract Transform Load) and the database will have a very fast path for direct loading of the resulting data. PostgreSQL is definitely a strange database right now in that there is a default filter applied to the data on load. It's even more strange because the load path is so slow, and now that we've found that the slowness is there mostly because of non-optimized parsing and attribute conversion routines. The question of how to do escape processing is a separate one, but is wrapped up in the question of whether to introduce a new loading routine or whether to optimize the old one. - Luke ---(end of broadcast)--- TIP 5: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq
Re: [HACKERS] Google's Summer of Code ...
Cool. Thanks Marc. Marc G. Fournier wrote: Yes, been working on this since last night ... On Thu, 2 Jun 2005, Jonah H. Harris wrote: So, has anyone gone ahead and contacted Google yet? ---(end of broadcast)--- TIP 2: you can get off all lists at once with the unregister command (send unregister YourEmailAddressHere to [EMAIL PROTECTED]) Marc G. Fournier Hub.Org Networking Services (http://www.hub.org) Email: [EMAIL PROTECTED] Yahoo!: yscrappy ICQ: 7615664 ---(end of broadcast)--- TIP 3: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [HACKERS] NOLOGGING option, or ?
On Thu, Jun 02, 2005 at 07:33:13AM -0700, Luke Lonergan wrote: Oliver, Haven't you just replaced one preprocessing step with another, then? Generally not. The most common problem with the current choice of escape character is that there are *lots* of data load scenarios with backslash in the text strings. I can only think of one where it's common. Windows filenames. But if you're going to support arbitrary data in a load then whatever escape character you choose will appear sometimes. The extra preprocessing to escape them is unnecessary on other databases and, in effect, causes the load to be even slower because you have to prepare the data ahead of time. Also, note that this patch can also do escape processing and the net result will still be 5+ times faster than what is there. I strongly suspect that a patch to improve performance without changing behaviour would be accepted with no questions asked. One that allowed specifying the field and record delimiters and the escape character and null symbol might require more discussion about an appropriate syntax at the very least. So you may want to separate the two. In the data warehousing industry, data conversion and manipulation is normally kept distinct from data loading. Conversion is done by tools called ETL (Extract Transform Load) and the database will have a very fast path for direct loading of the resulting data. PostgreSQL is definitely a strange database right now in that there is a default filter applied to the data on load. It's even more strange because the load path is so slow, and now that we've found that the slowness is there mostly because of non-optimized parsing and attribute conversion routines. The question of how to do escape processing is a separate one, but is wrapped up in the question of whether to introduce a new loading routine or whether to optimize the old one. There are already two loader routines. One of them is text-based and is designed for easy generation of data load format using simple text manipulation tools by using delimiters. It also allows (unlike your suggestion) for loading of arbitrary data from a text file. Because it allows for arbitrary data and uses delimiters to separate fields it has to use an escaping mechanism. If you want to be able to load arbitrary data and not have to handle escape characters there's are two obvious ways to do it. The first is that used by MIME and suggested by you. That is to use a separator that you believe will not appear in the data. That can be done by using a long multicharacter separator containing random characters and assuming that sequence won't appear, it can be done by parsing the input data twice, looking for strings that don't appear for use as delimiters or it can take advantage of knowledge about what characters can and can not appear in the input data. (I can't imagine any case involving data-mining of web logs where the last is likely to be relevant). The other is to use length+data format for each tuple, avoiding all issues of escapes in the data and allowing arbitrary data to be represented. That's how the binary load format PG supports works, I believe. If you're really concerned about speed of load that may be a better format for your front-end to generate, perhaps? Cheers, Steve ---(end of broadcast)--- TIP 6: Have you searched our list archives? http://archives.postgresql.org
Re: [HACKERS] Google's Summer of Code ...
Dear all , Incidentally I havent seen any objections, if there are none should we go ahead and whip up an email to google? Do we want to run this through the We must go straight away for it. I wonder why bigies like Bruce , Tom , Josh have not responded to this yet. To my opinion the best way to do it is make a single person responsible for this all and let every thing be done via him from filling the application to the end. -- With Best Regards, Vishal Kashyap. Lead Software Developer, http://saihertz.com, http://vishalkashyap.tk ---(end of broadcast)--- TIP 8: explain analyze is your friend
Re: [HACKERS] NOLOGGING option, or ?
Steve, I can only think of one where it's common. Windows filenames. Nearly all weblog data then. But if you're going to support arbitrary data in a load then whatever escape character you choose will appear sometimes. If we allow an 8-bit character set in the text file, then yes, any delimiter you choose has the potential to appear in your input data. In practice, with *mostly* 7-bit ASCII characters and even with international 8-bit text encodings, you can choose a delimiter and newline that work well. Exceptions are handled by the forthcoming single row error handling patch. I strongly suspect that a patch to improve performance without changing behaviour would be accepted with no questions asked. Understood - not sure it's the best thing for support of the users yet. We've found a large number of issues from customers with the unmodified behavior. There are already two loader routines. One of them is text-based and is designed for easy generation of data load format using simple text manipulation tools by using delimiters. It also allows (unlike your suggestion) for loading of arbitrary data from a text file. Not to distract, but try loading a binary null into a text field. The assumption of null terminated strings penetrates deep into the codebase. The existing system does not allow for loading arbitrary data from a text file. Our suggestion allows for escapes, but requires the ability to specify alternate characters or none. Because it allows for arbitrary data and uses delimiters to separate fields it has to use an escaping mechanism. If you want to be able to load arbitrary data and not have to handle escape characters there's are two obvious ways to do it. Let's dispense with the notion that we're suggesting no escapes (see above). Binary with a bookends format is a fine idea and would be my personal preference if it were fast, which it isn't. Customers in the web log analysis and other data warehousing fields prefer mostly 7-bit ascii text input, which we're trying to support with this change. - Luke ---(end of broadcast)--- TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]
Re: [HACKERS] Google's Summer of Code ...
-Original Message- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of Vishal Kashyap @ [SaiHertz] Sent: 02 June 2005 16:19 To: Robert Treat Cc: Jonah H. Harris; Marc G. Fournier; pgsql-hackers@postgresql.org; [EMAIL PROTECTED] Subject: Re: [HACKERS] Google's Summer of Code ... Dear all , Incidentally I havent seen any objections, if there are none should we go ahead and whip up an email to google? Do we want to run this through the We must go straight away for it. I wonder why bigies like Bruce , Tom , Josh have not responded to this yet. I thought Marc was dealing it? And they don't get much bigger than him grin :-) Regards, Dave. ---(end of broadcast)--- TIP 9: the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [HACKERS] NOLOGGING option, or ?
2) A modified command syntax for introducing a direct single row error handling. By direct I mean - a row that if rejected from within the COPY command context does not throw an error and rollsback the whole transaction. Instead the error is caught and recorded elsewhere, maybe in some error table, with some more information that can later on be retrieved. The following rows continue to be processed. This way there is barely any error handling overhead. Is there any idea on exactly how would this be done? Do you plan on using savepoints to implement it? I fail to see how is this barely any overhead. Savepoints are not that expensive but they are not free either. (No, I haven't measured it.) Good question, I am not entirely sure if this is possible yet, as I didn't think it through entirely yet. I guess data errors could be divided into 2 main categories: mal-formed data where error is detected even before forming a tuple, and the other is errors that are caused by some constraint violation, that is, after the tuple is formed and inserted. From what I hear and experience the big majority of errors are of the first type. In that case the error could be caught, the data line + line number + error description could be inserted into an ERROR table (all TEXT fields), and then COPY can skip forming a tuple, and move to parsing the next. In this process there is barely any overhead. The more difficult part obviously is handling the second error type, which I haven't looked at yet deeply. Hopefully it is not impossible to do while keeping transaction integrity (Any ideas anyone?). The overhead for this one will probably be larger, but again, we expect those to happen less (in most cases at least). Nevertheless, it is surely much faster than recursively narrowing down batch sizes. Alon. ---(end of broadcast)--- TIP 9: the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [HACKERS] NOLOGGING option, or ?
Luke Lonergan [EMAIL PROTECTED] writes: In the data warehousing industry, data conversion and manipulation is normally kept distinct from data loading. It's a bit strange to call this conversion or manipulation. One way or another you have to escape whatever your delimiters are. How would you propose loading strings that contain newlines? The ETL transformations you're talking about are a different beast entirely. You're talking about things like canonicalizing case or looking up foreign key ids to replace strings and such. Simply parsing the file format properly isn't part of that game. Otherwise where do you stop? You could take this to a silly extreme and just say postgres should just load each line as a record with single text field and let tools deal with actually parsing. Or better yet, load the whole thing as a single big blob. Personally I would prefer to make prepared inserts as efficient as COPY and deprecate COPY. Then we could have an entirely client-side tool that handled as many formats as people want to implement without complicating the server. Things like various vintages of Excel, fixed column files, etc should all be handled as plugins for such a tool. That would have the side benefit of allowing people to do other batch jobs efficiently. Pipelining parameters to hundreds of executions of a prepared query in the network. Actually it seems like there's no particular reason the NOLOGGING option Tom described (where it only inserts on new pages, doesn't have any special WAL entries, just fsyncs at the end instead of WAL logging) can't work with arbitrary inserts. Somehow some state has to be preserved remembering which pages the nologging inserts have created and hold locks on. -- greg ---(end of broadcast)--- TIP 8: explain analyze is your friend
Re: [HACKERS] Google's Summer of Code ...
It is already being worked on ... more once we know more ... On Thu, 2 Jun 2005, Vishal Kashyap @ [SaiHertz] wrote: Dear all , Incidentally I havent seen any objections, if there are none should we go ahead and whip up an email to google? Do we want to run this through the We must go straight away for it. I wonder why bigies like Bruce , Tom , Josh have not responded to this yet. To my opinion the best way to do it is make a single person responsible for this all and let every thing be done via him from filling the application to the end. -- With Best Regards, Vishal Kashyap. Lead Software Developer, http://saihertz.com, http://vishalkashyap.tk Marc G. Fournier Hub.Org Networking Services (http://www.hub.org) Email: [EMAIL PROTECTED] Yahoo!: yscrappy ICQ: 7615664 ---(end of broadcast)--- TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]
Re: [HACKERS] Google's Summer of Code ...
I am a MSEE student at Penn State (University Park), for the past few months I have been working on modifying parts of PostgreSQL for my research work. I doubt if my current work would serve any purpose for pgsql since it is experimental and research oriented, but all the same, I have gained familiarity with parts of pgsql. I'm interested in Google's Summer of Code, but I would definitely need help, starting with selection of an idea to work on. So, if anybody from PostgreSQL would like to support this, then please get in touch with me as early as possible. By the way, I didn't see PostgreSQL in the list of Participating Organizations on http://code.google.com/summerofcode.html? Thanks and Regards, -Vikram Kalsi MSEE PennState vzk101 at psu dot edu www.personal.psu.edu/vzk101 On 5/25/05, Tom Lane [EMAIL PROTECTED] wrote: Vikram Kalsi [EMAIL PROTECTED] writes: So, I suppose that during the query planning and optimization stage, the value of the original variables in the plan are somehow copied to the plan which is finally returned inside pg_plan_query(). Look in createplan.c --- there are a couple places in there you need to fix. regards, tom lane ---(end of broadcast)--- TIP 9: the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
[HACKERS] PostgreSQL Developer Network
Dear People, After a long time of various timeouts, I managed to get back on track developing the long promised PQDN website. As some of you might know. I was working on a project called the PostgreSQL Developer Network. The websites (which is a la MSDN) is meant to provide a knowledge base for the precious programming knowledge regarding writing code for PostgreSQL. Hopefully with your contribution it would be a great place for the ones (like me) who would like to contribute. The websites is being developed on www.truesoftware.net:8081/pgdn/ using PHP5 and of course PostgreSQL 8.0.3 Please do not hesitate to share your ideas. Kind regards, Gevik
[HACKERS] Google's Summer of Code: Too Late
Just got word back that they have reached their 'limit' on # of projects they are working with this summer ... considering they only 'opened up' the floor on the 31st, and its closed already ... damn, that was a small window of opportunity. But, I've been told to contact the co-ordinator in a week or so, since apparently this is just one program they are running, and we may be able to find another that we can fit into ... I'll keep in touch the guy and see if we can get in on the ground floor of other programs they might start up ... Marc G. Fournier Hub.Org Networking Services (http://www.hub.org) Email: [EMAIL PROTECTED] Yahoo!: yscrappy ICQ: 7615664 ---(end of broadcast)--- TIP 6: Have you searched our list archives? http://archives.postgresql.org
[HACKERS] Tablespaces
I'm interested if anyone is using tablespaces? Do we have any actual reports of people actually using them, to advantage, in the field?? Maybe the next postgresql.org survey could be on tablespace usage? Chris ---(end of broadcast)--- TIP 9: the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match