On Sunday, April 19, 2020 at 4:44:56 PM UTC-4, Ben wrote:
>
>
> These seem to be embedded in the related SQL implementations but are 
> clearly not ANSI standard. I'm not sure if that makes it disqualifying for 
> a SQLAlchemy feature request, or if anyone else could even use it, but 
> functionality like this is something that, at least for me, would make my 
> implementation more DB independent.
>

The major database servers and clients do have similar extensions to ANSI 
and the Python DBAPI here, but they're fairly different from one another in 
how they work and their input/output formats (and capabilities).  They also 
are not necessarily supported by the underlying database drivers. Pyscopg2 
has "copy_" methods on the cursor (
https://www.psycopg.org/docs/usage.html#copy); there are several mysql 
drivers, i looked at two and neither have explicit support for this. I'm 
not sure how this would really work though - it's basically designed for 
being implemented in the C clients with local files. 

The "bulk loading" functionality in SqlAlchemy is largely in the realm of 
being a low-level DBAPI operations for insert statements, and SqlAlchemy's 
performance is about as close to using the driver itself - while still 
having the syntactic sugar and not needing to write to a database 
specifically.  I don't believe there is any code to invoke a database 
client to read a source file itself; I am often mistaken though as Mike has 
put a ton of work into SqlAlchemy over the years.

>From experience with large imports, I can tell you the following:

* The best performance for large updates involves:
    1) take the database offline to clients
    2) drop (and copy) all the foreign key and check constraints
    3) import your data from a file with the native client that ships with 
your database *
    4) add the foreign key and check constraints back in
    if you feel the need to script this, you could use `subprocess` to 
invoke the database's native client on the machine

* The second best performance is:
    use SqlAlchemy's bulk strategies, 
    segment out the ranges of your source data file
    run multiple processes in parallel, each processing a subset of the 
source data file

* The third best performance is:
   write a script that uses the SqlAlchemy ORM with the 'windowed query' 
option to iterate over a range of lines in your source file
   have the script use a 3rd party tool like Redis to mark that it has 
"checked out" sections of the source data, and has completed/not-completed 
that range
   run multiple processes in parallel, each processing a subset of the 
source data file

 
In the second and third approach, you can periodically run a query or 
script to calculate the right number of processes to run.  It could be 7, 
or 20, or another number.




 





 

-- 
SQLAlchemy - 
The Python SQL Toolkit and Object Relational Mapper

http://www.sqlalchemy.org/

To post example code, please provide an MCVE: Minimal, Complete, and Verifiable 
Example.  See  http://stackoverflow.com/help/mcve for a full description.
--- 
You received this message because you are subscribed to the Google Groups 
"sqlalchemy" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to sqlalchemy+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/sqlalchemy/88842622-82e1-4679-8bd4-bd61a7020604%40googlegroups.com.

Reply via email to