Re: [sqlalchemy] Handling unique key violations with bulk inserts of large number of rows

2015-01-22 Thread SF Markus Elfring
> If I'm doing a bulk insert of a very large number of rows
> is it possible to add only the ones that don't violate
> unique constraints and log the rest?

Yes. - But such functionality depends on some design details.

Would you like to invest any software development efforts
to get the involved application programming interfaces into
the shape that will provide you with enough error information
so that you can make the corresponding exception handling really
safe, consistent and efficient?


> But if there's even 1 row that violates unique constraint
> none of the hundreds of non-duplicated rows will get updated either.

How much do you need such core database functionality from
transaction management?

Does each record belong to a change which affects the consistency/integrity
for your data structures?

Can data updates be really reordered for failure handling?

Regards,
Markus

-- 
You received this message because you are subscribed to the Google Groups 
"sqlalchemy" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to sqlalchemy+unsubscr...@googlegroups.com.
To post to this group, send email to sqlalchemy@googlegroups.com.
Visit this group at http://groups.google.com/group/sqlalchemy.
For more options, visit https://groups.google.com/d/optout.


Re: [sqlalchemy] Difficulties with parallel data insertion into the same table

2015-01-22 Thread SF Markus Elfring
>> I would appreciate if I can fill these data structures in parallel
>> without a serial database preparation step (table creation with
>> repeated meta-data specification).
> 
> You’d need to implement checks for this concurrency.

I would expect that database implementations will provide functionality
for parallel updates including concurrent creation of each table.


> A create table would need to be preceded by a check to see that it exists,
> for example, and that would likely need to be mutexed on that name so that
> no race condition occurs in between the time that the name is checked
> vs. the create table is emitted.

I am curious to clarify more software development challenges.
How many open issues are already known around such implementation details?

Regards,
Markus

-- 
You received this message because you are subscribed to the Google Groups 
"sqlalchemy" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to sqlalchemy+unsubscr...@googlegroups.com.
To post to this group, send email to sqlalchemy@googlegroups.com.
Visit this group at http://groups.google.com/group/sqlalchemy.
For more options, visit https://groups.google.com/d/optout.


Re: [sqlalchemy] Handling unique key violations with bulk inserts of large number of rows

2015-01-22 Thread Michael Bayer


alchemy1  wrote:

> If I'm doing a bulk insert of a very large number of rows is it possible to 
> add only the ones that don't violate unique constraints and log the rest? 
> Since I'm inserting so many rows I don't do a flush/commit after each one, 
> instead I let my application batch several hundred at a time then do a 
> .commit(). But if there's even 1 row that violates unique constraint none of 
> the hundreds of non-duplicated rows will get updated either.

you’d need to load the rows ahead of time and remove them from your bulk list.

if the number of rows is too large to do that at once, then you have to handle 
it in chunks.  sort the items in your bulk list, extract chunks of say 1000 at 
a time; for each chunk of 1000, load the rows from the DB within the same range 
covered by those thousand, then remove from the chunk those rows already 
present.   bulk insert the 1000, then onto the next chunk.

ORM / core doesn’t matter.


> 
> I thought about adding the rows one at a time from my application (doing a 
> .commit() after each row) so that I can add only the ones that don't violate 
> the constraint. But then the loading process goes extremely slowly because of 
> all the database round trips and disk writes I presume. Is there a way to 
> bulk load while keeping performance high?
> 
> I am currently using ORM but would use Core if it's doable that way instead.
> 
> -- 
> You received this message because you are subscribed to the Google Groups 
> "sqlalchemy" group.
> To unsubscribe from this group and stop receiving emails from it, send an 
> email to sqlalchemy+unsubscr...@googlegroups.com.
> To post to this group, send email to sqlalchemy@googlegroups.com.
> Visit this group at http://groups.google.com/group/sqlalchemy.
> For more options, visit https://groups.google.com/d/optout.

-- 
You received this message because you are subscribed to the Google Groups 
"sqlalchemy" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to sqlalchemy+unsubscr...@googlegroups.com.
To post to this group, send email to sqlalchemy@googlegroups.com.
Visit this group at http://groups.google.com/group/sqlalchemy.
For more options, visit https://groups.google.com/d/optout.


[sqlalchemy] Handling unique key violations with bulk inserts of large number of rows

2015-01-22 Thread alchemy1
If I'm doing a bulk insert of a very large number of rows is it possible to 
add only the ones that don't violate unique constraints and log the rest? 
Since I'm inserting so many rows I don't do a flush/commit after each one, 
instead I let my application batch several hundred at a time then do a 
.commit(). But if there's even 1 row that violates unique constraint none 
of the hundreds of non-duplicated rows will get updated either.

I thought about adding the rows one at a time from my application (doing a 
.commit() after each row) so that I can add only the ones that don't 
violate the constraint. But then the loading process goes extremely slowly 
because of all the database round trips and disk writes I presume. Is there 
a way to bulk load while keeping performance high?

I am currently using ORM but would use Core if it's doable that way instead.

-- 
You received this message because you are subscribed to the Google Groups 
"sqlalchemy" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to sqlalchemy+unsubscr...@googlegroups.com.
To post to this group, send email to sqlalchemy@googlegroups.com.
Visit this group at http://groups.google.com/group/sqlalchemy.
For more options, visit https://groups.google.com/d/optout.


Re: [sqlalchemy] Difficulties with parallel data insertion into the same table

2015-01-22 Thread Michael Bayer


SF Markus Elfring  wrote:

> 
>> So there is no valid use case for parallel generation of tables
>> except in particular kinds of multi-tenancy situations.
> 
> I find my use case simple enough (and therefore very valid).
> I am going to manage two database tables by corresponding
> Python classes with SQLAlchemy services.
> 
> I would appreciate if I can fill these data structures in parallel
> without a serial database preparation step (table creation with
> repeated meta-data specification).


You’d need to implement checks for this concurrency.   A create table would 
need to be preceded by a check to see that it exists, for example, and that 
would likely need to be mutexed on that name so that no race condition occurs 
in between the time that the name is checked vs. the create table is emitted.   
if the mutex is on the database end, then you’d need to use Postgresql’s 
facilities; pg_advisory_lock() appears to be session level 
(http://www.postgresql.org/docs/9.1/static/functions-admin.html) , which is 
probably not sufficient if you are using multiple processes.  



-- 
You received this message because you are subscribed to the Google Groups 
"sqlalchemy" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to sqlalchemy+unsubscr...@googlegroups.com.
To post to this group, send email to sqlalchemy@googlegroups.com.
Visit this group at http://groups.google.com/group/sqlalchemy.
For more options, visit https://groups.google.com/d/optout.


Re: [sqlalchemy] Difficulties with parallel data insertion into the same table

2015-01-22 Thread SF Markus Elfring

> So there is no valid use case for parallel generation of tables
> except in particular kinds of multi-tenancy situations.

I find my use case simple enough (and therefore very valid).
I am going to manage two database tables by corresponding
Python classes with SQLAlchemy services.

I would appreciate if I can fill these data structures in parallel
without a serial database preparation step (table creation with
repeated meta-data specification).

Regards,
Markus

-- 
You received this message because you are subscribed to the Google Groups 
"sqlalchemy" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to sqlalchemy+unsubscr...@googlegroups.com.
To post to this group, send email to sqlalchemy@googlegroups.com.
Visit this group at http://groups.google.com/group/sqlalchemy.
For more options, visit https://groups.google.com/d/optout.


Re: [sqlalchemy] Difficulties with parallel data insertion into the same table

2015-01-22 Thread Jonathan Vanasco

>
> On Thursday, January 22, 2015 at 2:40:19 PM UTC-5, SF Markus Elfring wrote:
> I am surprised that this database software show such (unexpected) 
> behaviour.
>
 
Aside from what Michael wrote...

The reason why you're seeing an IntegrityError like that, is because 
Postgres is raising an integrity error on it's internal tables.  The `pg_` 
prefix is well known to relate to internal postgres system data. 
 PostgreSQL explicitly reserves the `pg_` prefix for it's own schemas and 
advises users against creating tables that begin with `pg_` as well 
(http://www.postgresql.org/docs/9.2/static/ddl-schemas.html).  So to those 
with experience in Postgres, this wouldn't be unexpected error and 
accurately explains what the underlying problem is.

-- 
You received this message because you are subscribed to the Google Groups 
"sqlalchemy" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to sqlalchemy+unsubscr...@googlegroups.com.
To post to this group, send email to sqlalchemy@googlegroups.com.
Visit this group at http://groups.google.com/group/sqlalchemy.
For more options, visit https://groups.google.com/d/optout.


Re: [sqlalchemy] Difficulties with parallel data insertion into the same table

2015-01-22 Thread Michael Bayer


SF Markus Elfring  wrote:

> 
>> The concurrent creation of tables is an extremely unusual use case
>> that I would not recommend, but especially when two different
>> threads/processes are contending to create the same table,
> 
> Should this use case become more known after the number of processor
> cores grew in various computers through the years?

not at all.

The schema of a relational database is fixed in a similar way as to the 
structure of the software itself.  Just like it’s not a great idea to 
dynamically produce and compile new code within an application in order to 
solve an issue, it’s not a great idea to produce new schema structures on the 
fly. Decent relational database design will typically have as many as a few 
dozen tables in order to handle a particular component of a reasonably-designed 
application; databases that have hundreds of tables are typically dealing with 
multiple sub-components of an application.  A schema that has “N” tables 
with no upper bound doesn’t actually have any schema design, the app is just 
generating tables in response to data input.   So there is no valid use case 
for parallel generation of tables except in particular kinds of multi-tenancy 
situations.




> 
> 
>> that suggests the system attempts to encode data within the database
>> schema itself; that is, the existence of a table “XYZ” in fact
>> represents the data “XYZ” being present.
> 
> I do not try to achieve such a data encoding for my software application
> at the moment.
> 
> 
>> Otherwise, the PG developers in that thread suggest this use case
>> can be facilitated using postgres advisory locks.
> 
> I am going to try another approach out.
> 
> * Will it be more useful here to extend a serial database preparation step
>  before records will be stored by background processes simultaneously?
> 
> * How often will corresponding implementation details need to be redesigned?
> 
> Regards,
> Markus

-- 
You received this message because you are subscribed to the Google Groups 
"sqlalchemy" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to sqlalchemy+unsubscr...@googlegroups.com.
To post to this group, send email to sqlalchemy@googlegroups.com.
Visit this group at http://groups.google.com/group/sqlalchemy.
For more options, visit https://groups.google.com/d/optout.


[sqlalchemy] Re: Conditional Insert Rewrite (part 2)

2015-01-22 Thread William Grisaitis
Lycovian:

I have one rather trivial question that might solve your problem - is your 
driver 32 bit or 64 bit? What about your python installation? 

I also use teradata and pyodbc and previously had similar issues with 
numeric data types, specifically the decimal type in teradata. What fixed 
things for me was using 32 bit versions of both python and the ODBC driver. 
Initially I was using 64 bit python with a 32 bit ODBC driver for Teradata, 
and my problems went away after changing to 32 bit python. Before, with 64 
bit python, I wasn't even able to do "select 0.0" via pyodbc. Now, with 32 
bit everything, I'm able to select and insert values with decimal type 
fields on linux or windows 7.

One other idea is to use the Teradata command line utilities. (Referring to 
bteq, fastexport, fastload.) Personally I support your approach with 
sqlalchemy and ODBC (indeed I found your posts by searching for a 
sqlalchemy dialect for teradata), but if this approach doesn't work out, 
then a plan B could be using a flat file intermediary with some shell 
scripting or python subprocess, involving bteq or fastload.

Best,
William


On Friday, January 16, 2015 at 8:50:50 PM UTC-5, Lycovian wrote:
>
> For a Teradata SA dialect I had posted an earlier question regarding 
> re-writing a query triggered only on Integer columns with primary_key=True. 
>   Thanks to a comment from Mike B I figured out a workaround.  Today though 
> I discovered thought that this issue is due to a rather nasty bug in 
> Teradata's ODBC implementation.  The short version is that via ODBC (in my 
> case PyODBC and Ubuntu) you can not insert any numeric column as None.  Le 
> sigh.
>
> For example here is the issue showing a table can be inserted:
> # connection stuff here 
> t = Table('test', metadata, Column('id', sqlalchemy.Integer(), 
> primary_key=True), Column('x', sqlalchemy.Integer()))
> ins = t.insert().values(x=2).compile()
> engine.connect().execute(ins)
> # success!
>
> ins = t.insert().values(x=None).compile()
> engine.connect().execute(ins)
>
> ...
>
> /home/etl/build/sqlalchemy/lib/sqlalchemy/engine/default.pyc in 
> do_execute(self, cursor, statement, parameters, context)
> 434
> 435 def do_execute(self, cursor, statement, parameters, 
> context=None):
> --> 436 cursor.execute(statement, parameters)
> 437
> 438 def do_execute_no_params(self, cursor, statement, 
> context=None):
>
> DBAPIError: (pyodbc.Error) ('HY000', '[HY000] [Teradata][ODBC Teradata 
> Driver][Teradata Database] The source parcel length does not match data 
> that was defined.  (-2673) (SQLExecDirectW)') [SQL: u'INSERT INTO 
> testsqlatd.test (id, x) VALUES (?, ?)'] [parameters: (Decimal('2'), None)]
>
> This is a rather serious bug in the driver and possibly it has been fixed 
> in later versions of their products ODBC driver but I unfortunately don't 
> have access to any updates. 
>
> I guess the question is basically the same:
> *How do I remove a bindparameter in my dialect if the underlying datatype 
> is numeric (Float/Integer/Boolean/etc) if the statement is an Insert (and 
> update) and the bind value is None?  *
>
> I've been looking at subclassing visit_insert or possibly visit_bindparam 
> to remove parameters that are attempting to set None any number based 
> parameter without success thus far.  Does that sound reasonable?  Anyone 
> have an example of removing a bind parameter from an INSERT statement based 
> on the bound value and the type?
>
>
>
>

-- 
You received this message because you are subscribed to the Google Groups 
"sqlalchemy" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to sqlalchemy+unsubscr...@googlegroups.com.
To post to this group, send email to sqlalchemy@googlegroups.com.
Visit this group at http://groups.google.com/group/sqlalchemy.
For more options, visit https://groups.google.com/d/optout.


Re: [sqlalchemy] Difficulties with parallel data insertion into the same table

2015-01-22 Thread SF Markus Elfring
> However in this case, this appears to be a side effect of attempting
> to issue CREATE TABLE statements in parallel, and in particular attempting
> to create a table of the same name twice in two different processes.

It seems that I dared this approach.


> PG developers describe this expected behavior here:
> http://www.postgresql.org/message-id/ca+tgmozadyvtwbfp1fl2smzbihcwt4uprzrlnnx1nb30ku3...@mail.gmail.com

Does the dialogue on a topic like "Errors on CREATE TABLE IF NOT EXISTS"
show any remaining open issues?


> I’m not surprised that PG does not prioritize attempting to make
> this error more user friendly as this is a very odd and probably
> unnecessary use case.

I am surprised that this database software show such (unexpected) behaviour.


> this has nothing to do with the ORM nor does it really have much
> to do with SQLAlchemy as a whole.

It makes it more convenient to stress other software components,
doesn't it?


> You would have this issue in exactly the same way if you were emitting
> conflicting CREATE TABLE statements on the DBAPI cursor directly.

Thanks for your acknowledgement.


> The concurrent creation of tables is an extremely unusual use case
> that I would not recommend, but especially when two different
> threads/processes are contending to create the same table,

Should this use case become more known after the number of processor
cores grew in various computers through the years?


> that suggests the system attempts to encode data within the database
> schema itself; that is, the existence of a table “XYZ” in fact
> represents the data “XYZ” being present.

I do not try to achieve such a data encoding for my software application
at the moment.


> Otherwise, the PG developers in that thread suggest this use case
> can be facilitated using postgres advisory locks.

I am going to try another approach out.

* Will it be more useful here to extend a serial database preparation step
  before records will be stored by background processes simultaneously?

* How often will corresponding implementation details need to be redesigned?

Regards,
Markus

-- 
You received this message because you are subscribed to the Google Groups 
"sqlalchemy" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to sqlalchemy+unsubscr...@googlegroups.com.
To post to this group, send email to sqlalchemy@googlegroups.com.
Visit this group at http://groups.google.com/group/sqlalchemy.
For more options, visit https://groups.google.com/d/optout.


Re: [sqlalchemy] Difficulties with parallel data insertion into the same table

2015-01-22 Thread Michael Bayer


SF Markus Elfring  wrote:

> Hello,
> 
> I try to write some data from a source code analysis
> which is performed by a few processor cores simultaneously
> into a single table.
> Now I stumble on a message like the following again.
> 
> "…
> sqlalchemy.exc.IntegrityError: (IntegrityError) duplicate key value violates 
> unique constraint "pg_type_typname_nsp_index"
> DETAIL:  Key (typname, typnamespace)=(positions_parallel1_line_seq, 2200) 
> already exists.
> '\nCREATE TABLE positions_parallel1 (\n\tfunction […], pattern)\n)\n\n' {}
> …"
> 
> 
> The following software components were involved for this
> application test on my openSUSE Tumbleweed system.
> 
> 1. SQLAlchemy 0.9.8-78.1
> 2. Psycopg 2.5.2-3.1
> 3. PostgreSQL 9.3.5-3.1
> 4. Python 2.7.9-2.1
> 5. Coccinelle spatch 1.0.0-rc23
> 6. make 4.1-2.2
> 
> 
> I have searched a bit on the internet.

Normally, this is an integrity constraint error which is raised by your 
database, which occurs when a process attempts to INSERT or UPDATE a row to 
include a key that already exists.   When performing an INSERT or UPDATE you 
need to ensure that the row will not duplicate a uniquely-constraint value that 
is already present in the table.

However in this case, this appears to be a side effect of attempting to issue 
CREATE TABLE statements in parallel, and in particular attempting to create a 
table of the same name twice in two different processes.   PG developers 
describe this expected behavior here: 
http://www.postgresql.org/message-id/ca+tgmozadyvtwbfp1fl2smzbihcwt4uprzrlnnx1nb30ku3...@mail.gmail.com.
  I’m not surprised that PG does not prioritize attempting to make this error 
more user friendly as this is a very odd and probably unnecessary use case.


> Are there any further
> software development challenges to consider for the parallel
> creation of database tables
> with the object-relational interface?

this has nothing to do with the ORM nor does it really have much to do with 
SQLAlchemy as a whole.   SQLAlchemy can be seen as a comprehensive system to 
automate the sending and receiving of messages along a database connection.   
You would have this issue in exactly the same way if you were emitting 
conflicting CREATE TABLE statements on the DBAPI cursor directly.

The concurrent creation of tables is an extremely unusual use case that I would 
not recommend, but especially when two different threads/processes are 
contending to create the same table, that suggests the system attempts to 
encode data within the database schema itself; that is, the existence of a 
table “XYZ” in fact represents the data “XYZ” being present.   There are some 
high volume websites that do things like this, like Reddit, but I think they’re 
nuts.   But certainly, for an application that isn’t handling millions of 
simultaneous users or hundreds of billions of rows, such an architecture is 
absolutely not called for.  Data should be stored in rows, not in the names of 
tables.

Otherwise, the PG developers in that thread suggest this use case can be 
facilitated using postgres advisory locks.  I have no experience with those, 
but see that thread for more details.










> 
> Regards,
> Markus
> 
> -- 
> You received this message because you are subscribed to the Google Groups 
> "sqlalchemy" group.
> To unsubscribe from this group and stop receiving emails from it, send an 
> email to sqlalchemy+unsubscr...@googlegroups.com.
> To post to this group, send email to sqlalchemy@googlegroups.com.
> Visit this group at http://groups.google.com/group/sqlalchemy.
> For more options, visit https://groups.google.com/d/optout.

-- 
You received this message because you are subscribed to the Google Groups 
"sqlalchemy" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to sqlalchemy+unsubscr...@googlegroups.com.
To post to this group, send email to sqlalchemy@googlegroups.com.
Visit this group at http://groups.google.com/group/sqlalchemy.
For more options, visit https://groups.google.com/d/optout.


[sqlalchemy] Difficulties with parallel data insertion into the same table

2015-01-22 Thread SF Markus Elfring
Hello,

I try to write some data from a source code analysis
which is performed by a few processor cores simultaneously
into a single table.
Now I stumble on a message like the following again.

"…
sqlalchemy.exc.IntegrityError: (IntegrityError) duplicate key value violates 
unique constraint "pg_type_typname_nsp_index"
DETAIL:  Key (typname, typnamespace)=(positions_parallel1_line_seq, 2200) 
already exists.
 '\nCREATE TABLE positions_parallel1 (\n\tfunction […], pattern)\n)\n\n' {}
…"


The following software components were involved for this
application test on my openSUSE Tumbleweed system.

1. SQLAlchemy 0.9.8-78.1
2. Psycopg 2.5.2-3.1
3. PostgreSQL 9.3.5-3.1
4. Python 2.7.9-2.1
5. Coccinelle spatch 1.0.0-rc23
6. make 4.1-2.2


I have searched a bit on the internet. Are there any further
software development challenges to consider for the parallel
creation of database tables with the object-relational interface?

Regards,
Markus

-- 
You received this message because you are subscribed to the Google Groups 
"sqlalchemy" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to sqlalchemy+unsubscr...@googlegroups.com.
To post to this group, send email to sqlalchemy@googlegroups.com.
Visit this group at http://groups.google.com/group/sqlalchemy.
For more options, visit https://groups.google.com/d/optout.