[SQL] large xml database
Hi, i have very big XML documment which is larger than 50GB and want to import it into databse, and transform it to relational schema. When splitting this documment to smaller independent xml documments i get ~11.1mil XML documents. I have spent lots of time trying to get fastest way to transform all this data but every time i give up because it takes too much time. Sometimes more than month it would take if not stopped. I have tried to insert each line as varchar into database and parse it using plperl regex.. also i have tried to store every documment as XML and parse it, but it is also to slow. i have tried to store every documment as varchar but it is also slow when using regex to get data. many tries have failed because 8GB of ram and 10gb of swap were not enough. also sometimes i get that more than 2^32 operations were performed, and functions stopped to work. i wanted just to ask if someone knows how to speed this up. thanx in advance -- --- Viktor Bojović --- Wherever I go, Murphy goes with me
Re: [SQL] large xml database
-BEGIN PGP SIGNED MESSAGE- Hash: SHA256 On 10/30/2010 11:49 PM, Viktor Bojović wrote: > Hi, > i have very big XML documment which is larger than 50GB and want to import > it into databse, and transform it to relational schema. > When splitting this documment to smaller independent xml documments i get > ~11.1mil XML documents. > I have spent lots of time trying to get fastest way to transform all this > data but every time i give up because it takes too much time. Sometimes more > than month it would take if not stopped. > I have tried to insert each line as varchar into database and parse it using > plperl regex.. > also i have tried to store every documment as XML and parse it, but it is > also to slow. > i have tried to store every documment as varchar but it is also slow when > using regex to get data. > > many tries have failed because 8GB of ram and 10gb of swap were not enough. > also sometimes i get that more than 2^32 operations were performed, and > functions stopped to work. > > i wanted just to ask if someone knows how to speed this up. > > thanx in advance Use a SAX-parser and handle the endElement(String name) events to insert the element's content into your db. - -- Andreas Joseph Krogh Senior Software Developer / CTO Public key: http://home.officenet.no/~andreak/public_key.asc - +-+ OfficeNet AS| The most difficult thing in the world is to | Rosenholmveien 25 | know how to do a thing and to watch | 1414 Trollåsen | somebody else doing it wrong, without | NORWAY | comment.| | | Tlf:+47 24 15 38 90 | | Fax:+47 24 15 38 91 | | Mobile: +47 909 56 963 | | - +-+ -BEGIN PGP SIGNATURE- Version: GnuPG v1.4.10 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/ iF4EAREIAAYFAkzMltwACgkQ+QNFm4X8jCLZzwD/ZIAktYXFqwUgtLLiHgYpoYNo Nf+r1r9cGNVIwMC6kH8A/i0RUwAkL45xeQ8CsiyALXYAawZF/n6Fnql15qAkZDip =t+Xo -END PGP SIGNATURE- -- Sent via pgsql-sql mailing list (pgsql-sql@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-sql
Re: [SQL] large xml database
Andreas Joseph Krogh wrote: -BEGIN PGP SIGNED MESSAGE- Hash: SHA256 On 10/30/2010 11:49 PM, Viktor Bojović wrote: Hi, i have very big XML documment which is larger than 50GB and want to import it into databse, and transform it to relational schema. When splitting this documment to smaller independent xml documments i get ~11.1mil XML documents. I have spent lots of time trying to get fastest way to transform all this data but every time i give up because it takes too much time. Sometimes more than month it would take if not stopped. I have tried to insert each line as varchar into database and parse it using plperl regex.. also i have tried to store every documment as XML and parse it, but it is also to slow. i have tried to store every documment as varchar but it is also slow when using regex to get data. many tries have failed because 8GB of ram and 10gb of swap were not enough. also sometimes i get that more than 2^32 operations were performed, and functions stopped to work. i wanted just to ask if someone knows how to speed this up. thanx in advance Use a SAX-parser and handle the endElement(String name) events to insert the element's content into your db. If you still have the 11 million subfiles, I would start there, sax parse as above and maybe make make csv files, then load those with bulk as begin/end transaction on each data item discovered will hurt. Can the subfiles be segregated into specific data types, or at least holder of specific data types such that they releate to a specific subset of your new db/schema? This will play into what get's loaded first and who depends on whom w.r.t. foreign keys etc. You can parallelize marginally with multiple threads (hoping to split file read from sax paring from element construction from save/network) but more boxes would be the way to go. Partitioning remains a problem. -- Sent via pgsql-sql mailing list (pgsql-sql@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-sql
Re: [SQL] large xml database
> "VB" == Viktor Bojović writes: VB> i have very big XML documment which is larger than 50GB and want to VB> import it into databse, and transform it to relational schema. Were I doing such a conversion, I'd use perl to convert the xml into something which COPY can grok. Any other language, script or compiled, would work just as well. The goal is to avoid having to slurp the whole xml structure into memory. -JimC -- James Cloos OpenPGP: 1024D/ED7DAEA6 -- Sent via pgsql-sql mailing list (pgsql-sql@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-sql
Re: [SQL] large xml database
On Sat, 30 Oct 2010 23:49:29 +0200 Viktor Bojović wrote: > > many tries have failed because 8GB of ram and 10gb of swap were not enough. > also sometimes i get that more than 2^32 operations were performed, and > functions stopped to work. > we have a similar problem and we use the Amara xml Toolkit for python. To avoid the big memory consumption use pushbind. A 30G bme catalog file takes a maximum up to 20min to import. It might be faster because we are preparing complex objects with an orm. So the time consumption depends how complex the catalog is. If you use amara only to perform a conversion from xml to csv the final import can be done much faster. regards -- Lutz http://www.4c-gmbh.de -- Sent via pgsql-sql mailing list (pgsql-sql@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-sql