On Apr 28, 2016 14:33, "Johann Spies" <johann.sp...@gmail.com> wrote: > > I have several large (7GB+) xml files to get into an SQL database. > > The xml-files can contain up to 500 000 subrecords which I want to be able to query in the database. > > They are too large to do something like this: > > > insert into rawxml (xml) select XMLPARSE (DOCUMENT CONVERT_FROM(PG_READ_BINARY_FILE('FOO.xml' ), 'UTF8')); > > If it were possible, each file would be one huge record in the table which can then be unpacked using XPATH. > > > The options I am considering is : > > 1. Unpack the individual records (will be more than 50 million) using something like python with lxml and psycopg2 and insert them after dropping all indexes and triggers on the table > > 2. Unpack the individual records and write a (very) large tsv-file and then insert it using 'copy' >
The fastest way I found is to combine these two. Using iterparse from lxml combined with load_rows and COPY from py-postgresql: http://python.projects.pgfoundry.org/docs/1.1/driver.html#copy-statements That way you can stream the data. > It would be convenient If I could use the present xml files as 'foreign tables' and parse them using the xpath-capabilities of PostgreSQL. > > Is this possible? > There is a multicorn fdw for that: https://github.com/Kozea/Multicorn/blob/master/ <https://github.com/Kozea/Multicorn/blob/master/python/multicorn/xmlfdw.py> python <https://github.com/Kozea/Multicorn/blob/master/python/multicorn/xmlfdw.py>/ <https://github.com/Kozea/Multicorn/blob/master/python/multicorn/xmlfdw.py> multicorn <https://github.com/Kozea/Multicorn/blob/master/python/multicorn/xmlfdw.py>/ <https://github.com/Kozea/Multicorn/blob/master/python/multicorn/xmlfdw.py> xmlfdw.py <https://github.com/Kozea/Multicorn/blob/master/python/multicorn/xmlfdw.py> But I never tried it. It looks like it loads all rows in a python list. Groeten, Arjen