[SQL] A more efficient way?

2010-10-31 Thread James Cloos
I've a third-party app which is slower than it ought to be.

It does one operation in particular by doing a query to get a list of
rows, and then iterates though them to confirm whether it actually wants
that row.  As you can imagine that is very slow.

This query gets the correct data set in one go, but is also slow:

 select p from m where s = 7 and p not in (select p from m where s != 7);

The p column is not unique; the goal is the set of p for which *every*
row with a matching p also has s=7.  (s and p are both integer columns,
if that matters.)

That takes about 38 seconds with this (explain analyze) plan:

QUERY PLAN

 Index Scan using m_5 on m  (cost=8141.99..11519.73 rows=1 width=4) (actual 
time=564.689..37964.163 rows=243 loops=1)
   Index Cond: (s = 7)
   Filter: (NOT (SubPlan 1))
   SubPlan 1
 ->  Materialize  (cost=8141.99..13586.24 rows=523955 width=4) (actual 
time=0.003..80.734 rows=523948 loops=243)
   ->  Seq Scan on m  (cost=0.00..7413.34 rows=523955 width=4) (actual 
time=0.023..259.901 rows=523948 loops=1)
 Filter: (s <> 7)
 Total runtime: 38121.781 ms

The index m_5 is btree (s).  

Can that be done is a way which requires only 1 loop?

My understanding is that:

 select p from m where s = 7  except select p from m where s != 7;

will return the same results as a simple 'select p from m where s = 7', yes?

Is there a way to do it with a self join which requires just a single loop?

-JimC
-- 
James Cloos  OpenPGP: 1024D/ED7DAEA6

-- 
Sent via pgsql-sql mailing list (pgsql-sql@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-sql


Re: [SQL] A more efficient way?

2010-10-31 Thread Tom Lane
James Cloos  writes:
> I've a third-party app which is slower than it ought to be.
> It does one operation in particular by doing a query to get a list of
> rows, and then iterates though them to confirm whether it actually wants
> that row.  As you can imagine that is very slow.

> This query gets the correct data set in one go, but is also slow:

>  select p from m where s = 7 and p not in (select p from m where s != 7);

See if you can recast it as a NOT EXISTS.  NOT IN is hard to optimize
because of its weird behavior with nulls.

If you're working with a less-than-current version of PG, you may
instead have to resort to a left-join-with-is-null locution, ie
select m.p from m left join
(select p from m where s != 7) m2
on (m.p=m2.p)
where m2.p is null and m.s = 7;
but it's hard to wrap your head around that sort of thing, so I'd
advise against using it if you can get decent performance with EXISTS.

regards, tom lane

-- 
Sent via pgsql-sql mailing list (pgsql-sql@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-sql


Re: [SQL] large xml database

2010-10-31 Thread Viktor Bojović
On Sun, Oct 31, 2010 at 7:08 AM, Lutz Steinborn wrote:

> On Sat, 30 Oct 2010 23:49:29 +0200
> Viktor Bojović  wrote:
>
> >
> > many tries have failed because 8GB of ram and 10gb of swap were not
> enough.
> > also sometimes i get that more than 2^32 operations were performed, and
> > functions stopped to work.
> >
> we have a similar problem and we use the Amara xml Toolkit for python. To
> avoid
> the big memory consumption use pushbind. A 30G bme catalog file takes a
> maximum
> up to 20min to import. It might be faster because we are preparing complex
> objects with an orm. So the time consumption depends how complex the
> catalog is.
> If you use amara only to perform a conversion from xml to csv the final
> import
> can be done much faster.
>
> regards
>
> --
> Lutz
>
> http://www.4c-gmbh.de
>
>
Thanx Lutz, I will try to use that Amara and also I will try to parse it
with SAX.
I have tried twig and some other parsers but they consumed too much RAM.



-- 
---
Viktor Bojović
---
Wherever I go, Murphy goes with me


Re: [SQL] large xml database

2010-10-31 Thread Viktor Bojović
On Sun, Oct 31, 2010 at 2:26 AM, James Cloos  wrote:

> > "VB" == Viktor Bojović  writes:
>
> VB> i have very big XML documment which is larger than 50GB and want to
> VB> import it into databse, and transform it to relational schema.
>
> Were I doing such a conversion, I'd use perl to convert the xml into
> something which COPY can grok.  Any other language, script or compiled,
> would work just as well.  The goal is to avoid having to slurp the whole
> xml structure into memory.
>
> -JimC
> --
> James Cloos  OpenPGP: 1024D/ED7DAEA6
>

The insertion into dabase is not very big problem.
I insert it as XML docs, or as varchar lines or as XML docs in varchar
format. Usually i use transaction and commit after block of 1000 inserts and
it goes very fast. so insertion is over after few hours.
But the problem occurs when i want to transform it inside database from
XML(varchar or XML format) into tables by parsing.
That processing takes too much time in database no matter if it is stored as
varchar lines, varchar nodes or XML data type.

-- 
---
Viktor Bojović
---
Wherever I go, Murphy goes with me


Re: [SQL] large xml database

2010-10-31 Thread Rob Sargent




Viktor Bojovic' wrote:



On Sun, Oct 31, 2010 at 2:26 AM, James Cloos > wrote:


> "VB" == Viktor Bojovic' mailto:viktor.bojo...@gmail.com>> writes:

VB> i have very big XML documment which is larger than 50GB and
want to
VB> import it into databse, and transform it to relational schema.

Were I doing such a conversion, I'd use perl to convert the xml into
something which COPY can grok. Any other language, script or compiled,
would work just as well. The goal is to avoid having to slurp the
whole
xml structure into memory.

-JimC
--
James Cloos mailto:cl...@jhcloos.com>>
OpenPGP: 1024D/ED7DAEA6


The insertion into dabase is not very big problem.
I insert it as XML docs, or as varchar lines or as XML docs in varchar 
format. Usually i use transaction and commit after block of 1000 
inserts and it goes very fast. so insertion is over after few hours.
But the problem occurs when i want to transform it inside database 
from XML(varchar or XML format) into tables by parsing.
That processing takes too much time in database no matter if it is 
stored as varchar lines, varchar nodes or XML data type.


--
---
Viktor Bojovic'
---
Wherever I go, Murphy goes with me


Are you saying you first load the xml into the database, then parse that 
xml into instance of objects (rows in tables)?



--
Sent via pgsql-sql mailing list (pgsql-sql@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-sql


Re: [SQL] large xml database

2010-10-31 Thread Viktor Bojović
On Sun, Oct 31, 2010 at 9:42 PM, Rob Sargent  wrote:

>
>
>
> Viktor Bojovic' wrote:
>
>>
>>
>> On Sun, Oct 31, 2010 at 2:26 AM, James Cloos > cl...@jhcloos.com>> wrote:
>>
>>> "VB" == Viktor Bojovic' >
>>> writes:
>>
>>VB> i have very big XML documment which is larger than 50GB and
>>want to
>>VB> import it into databse, and transform it to relational schema.
>>
>>Were I doing such a conversion, I'd use perl to convert the xml into
>>something which COPY can grok. Any other language, script or compiled,
>>would work just as well. The goal is to avoid having to slurp the
>>whole
>>xml structure into memory.
>>
>>-JimC
>>--
>>James Cloos mailto:cl...@jhcloos.com>>
>>
>>OpenPGP: 1024D/ED7DAEA6
>>
>>
>> The insertion into dabase is not very big problem.
>> I insert it as XML docs, or as varchar lines or as XML docs in varchar
>> format. Usually i use transaction and commit after block of 1000 inserts and
>> it goes very fast. so insertion is over after few hours.
>> But the problem occurs when i want to transform it inside database from
>> XML(varchar or XML format) into tables by parsing.
>> That processing takes too much time in database no matter if it is stored
>> as varchar lines, varchar nodes or XML data type.
>>
>> --
>> ---
>> Viktor Bojovic'
>>
>> ---
>> Wherever I go, Murphy goes with me
>>
>
> Are you saying you first load the xml into the database, then parse that
> xml into instance of objects (rows in tables)?
>
>
Yes. That way takes less ram then using twig or simple xml, so I tried using
postgre xml functions or regexes.



-- 
---
Viktor Bojović
---
Wherever I go, Murphy goes with me


11.xml.gz
Description: GNU Zip compressed data

-- 
Sent via pgsql-sql mailing list (pgsql-sql@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-sql


Re: [SQL] large xml database

2010-10-31 Thread Rob Sargent



Viktor Bojović wrote:



On Sun, Oct 31, 2010 at 9:42 PM, Rob Sargent > wrote:





Viktor Bojovic' wrote:



On Sun, Oct 31, 2010 at 2:26 AM, James Cloos
mailto:cl...@jhcloos.com>
>> wrote:

   > "VB" == Viktor Bojovic' mailto:viktor.bojo...@gmail.com>

   >> writes:

   VB> i have very big XML documment which is larger than 50GB and
   want to
   VB> import it into databse, and transform it to relational
schema.

   Were I doing such a conversion, I'd use perl to convert the
xml into
   something which COPY can grok. Any other language, script
or compiled,
   would work just as well. The goal is to avoid having to
slurp the
   whole
   xml structure into memory.

   -JimC
   --
   James Cloos mailto:cl...@jhcloos.com>
>>

   OpenPGP: 1024D/ED7DAEA6


The insertion into dabase is not very big problem.
I insert it as XML docs, or as varchar lines or as XML docs in
varchar format. Usually i use transaction and commit after
block of 1000 inserts and it goes very fast. so insertion is
over after few hours.
But the problem occurs when i want to transform it inside
database from XML(varchar or XML format) into tables by parsing.
That processing takes too much time in database no matter if
it is stored as varchar lines, varchar nodes or XML data type.

-- 
---

Viktor Bojovic'

---
Wherever I go, Murphy goes with me


Are you saying you first load the xml into the database, then
parse that xml into instance of objects (rows in tables)?


Yes. That way takes less ram then using twig or simple xml, so I tried 
using postgre xml functions or regexes.




--
---
Viktor Bojović
---
Wherever I go, Murphy goes with me
Is the entire load a set of "entry" elements as your example contains?  
This I believe would parse nicely into a tidy but non-trivial schema 
directly without the "middle-man" of having xml in db (unless of course 
you prefer xpath to sql ;) )


The single most significant caveat I would have for you is Beware: 
Biologists involved. Inconsistency (at least overloaded concepts)  
almost assured :).  EMBL too is suspect imho, but I've been out of that 
arena for a while.






--
Sent via pgsql-sql mailing list (pgsql-sql@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-sql


Re: [SQL] large xml database

2010-10-31 Thread Viktor Bojović
On Sun, Oct 31, 2010 at 10:26 PM, Rob Sargent  wrote:



> Viktor Bojović wrote:
>
>>
>>
>> On Sun, Oct 31, 2010 at 9:42 PM, Rob Sargent > robjsarg...@gmail.com>> wrote:
>>
>>
>>
>>
>>Viktor Bojovic' wrote:
>>
>>
>>
>>On Sun, Oct 31, 2010 at 2:26 AM, James Cloos
>>mailto:cl...@jhcloos.com>
>>>> wrote:
>>
>>   > "VB" == Viktor Bojovic' >
>>
>>   >>> writes:
>>
>>   VB> i have very big XML documment which is larger than 50GB and
>>   want to
>>   VB> import it into databse, and transform it to relational
>>schema.
>>
>>   Were I doing such a conversion, I'd use perl to convert the
>>xml into
>>   something which COPY can grok. Any other language, script
>>or compiled,
>>   would work just as well. The goal is to avoid having to
>>slurp the
>>   whole
>>   xml structure into memory.
>>
>>   -JimC
>>   --
>>   James Cloos mailto:cl...@jhcloos.com>
>>>>
>>
>>
>>   OpenPGP: 1024D/ED7DAEA6
>>
>>
>>The insertion into dabase is not very big problem.
>>I insert it as XML docs, or as varchar lines or as XML docs in
>>varchar format. Usually i use transaction and commit after
>>block of 1000 inserts and it goes very fast. so insertion is
>>over after few hours.
>>But the problem occurs when i want to transform it inside
>>database from XML(varchar or XML format) into tables by parsing.
>>That processing takes too much time in database no matter if
>>it is stored as varchar lines, varchar nodes or XML data type.
>>
>>-- ---
>>Viktor Bojovic'
>>
>>---
>>Wherever I go, Murphy goes with me
>>
>>
>>Are you saying you first load the xml into the database, then
>>parse that xml into instance of objects (rows in tables)?
>>
>>
>> Yes. That way takes less ram then using twig or simple xml, so I tried
>> using postgre xml functions or regexes.
>>
>>
>>
>> --
>> ---
>> Viktor Bojović
>> ---
>> Wherever I go, Murphy goes with me
>>
> Is the entire load a set of "entry" elements as your example contains?
>  This I believe would parse nicely into a tidy but non-trivial schema
> directly without the "middle-man" of having xml in db (unless of course you
> prefer xpath to sql ;) )
>
> The single most significant caveat I would have for you is Beware:
> Biologists involved. Inconsistency (at least overloaded concepts)  almost
> assured :).  EMBL too is suspect imho, but I've been out of that arena for a
> while.
>
>
Unfortunately some elements are always missing, so I had to create script
which scanned whole document of swissprot and trembl , and stored it into
file to use it as a template to build a code generator if I find a best
parser for this purpose. To parse all elements it in one day I should use
parser which is capable to parse at least 128 entry blocks for an second @
2.4GHz. You are right about inconsistency, im constantly have problems with
PDB files.

btw.
you have mentioned "This I believe would parse nicely into a tidy but
non-trivial schema directly", does it mean that postgre has a support for
restoring the database schema from xml files?

-- 
---
Viktor Bojović
---
Wherever I go, Murphy goes with me

entry[]->sequence[]
entry[]->feature[]
entry[]->reference[]
entry[]->feature[]->location[]->position[]->status
entry[]->dbReference[]->property[]
entry[]->reference[]->citation[]->last
entry[]->comment[]->text[]->status
entry[]->geneLocation[]->type
entry[]->comment[]->experiments[]
entry[]->comment[]->conflict[]->sequence[]
entry[]->comment[]->subcellularLocation[]->orientation[]->status
entry[]->protein[]->domain[]->alternativeName[]->fullName[]
entry[]->evidence[]->category
entry[]->feature[]->location[]->begin[]->status
entry[]->reference[]->citation[]->volume
entry[]->feature[]->evidence
entry[]->dbReference[]->type
entry[]->reference[]->citation[]->authorList[]->consortium[]
entry[]->version
entry[]->comment[]->location[]->sequence
entry[]->sequence[]->version
entry[]->proteinExistence[]
entry[]->reference[]->scope[]
entry[]->reference[]->source[]->plasmid[]
entry[]->reference[]->citation[]->dbReference[]
entry[]->comment[]->locationType
entry[]->protein[]->domain[]
entry[]->reference[]->citation[]->publisher
entry[]->gene[]->name[]
entry[]->protein[]->domain[]->alternativeName[]->ref
entry[]->comment[]->conflict[]
entry[]->evidence[]
entry[]->sequence[]->modified
entry[]->comment[]->confl

Re: [SQL] large xml database

2010-10-31 Thread Rob Sargent

Skipping much of the included thread, urgently.

btw.
you have mentioned "This I believe would parse nicely into a tidy but 
non-trivial schema directly", does it mean that postgre has a support 
for restoring the database schema from xml files?


--
---
Viktor Bojović
---
Wherever I go, Murphy goes with me
No. Sorry.  Did not mean to imply that.  If you had a xsd file from them 
you might have a better chance.  Have never looked but I but someone has 
tried that sort of manipulation.  Throw out the cardinality constraints 
of course :).


--
Sent via pgsql-sql mailing list (pgsql-sql@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-sql