Re: [sqlalchemy] SQLAlchemy Access to Native Bulk Loaders

2020-04-22 Thread Benjamin Taub
Thank you for taking the time to lay that out, Jonathan. I am not sure if
this will be a major issue for me yet or not and I would like to stay
within SQLAlchemy to the greatest extent possible. Since I'm using the
core, I think I'll try your second option. If I run into issues, I can
rethink at that time.

Thanks again!
Ben
__
[image: light logo] 
Benjamin Taub, CEO
Direct: (734) 585-3503
Check out Talentspace  to see
a list of prescreened analytics and data engineering candidates
Blog: http://www.dataspace.com/insights/blog/
Twitter: twitter.com/dataspaceinc

*NOTE: I usually check email only 2 - 3 times / day.  For issues needing
immediate attention, please call.*


On Mon, Apr 20, 2020 at 12:40 PM Jonathan Vanasco 
wrote:

>
> On Sunday, April 19, 2020 at 4:44:56 PM UTC-4, Ben wrote:
>>
>>
>> These seem to be embedded in the related SQL implementations but are
>> clearly not ANSI standard. I'm not sure if that makes it disqualifying for
>> a SQLAlchemy feature request, or if anyone else could even use it, but
>> functionality like this is something that, at least for me, would make my
>> implementation more DB independent.
>>
>
> The major database servers and clients do have similar extensions to ANSI
> and the Python DBAPI here, but they're fairly different from one another in
> how they work and their input/output formats (and capabilities).  They also
> are not necessarily supported by the underlying database drivers. Pyscopg2
> has "copy_" methods on the cursor (
> https://www.psycopg.org/docs/usage.html#copy); there are several mysql
> drivers, i looked at two and neither have explicit support for this. I'm
> not sure how this would really work though - it's basically designed for
> being implemented in the C clients with local files.
>
> The "bulk loading" functionality in SqlAlchemy is largely in the realm of
> being a low-level DBAPI operations for insert statements, and SqlAlchemy's
> performance is about as close to using the driver itself - while still
> having the syntactic sugar and not needing to write to a database
> specifically.  I don't believe there is any code to invoke a database
> client to read a source file itself; I am often mistaken though as Mike has
> put a ton of work into SqlAlchemy over the years.
>
> From experience with large imports, I can tell you the following:
>
> * The best performance for large updates involves:
> 1) take the database offline to clients
> 2) drop (and copy) all the foreign key and check constraints
> 3) import your data from a file with the native client that ships with
> your database *
> 4) add the foreign key and check constraints back in
> if you feel the need to script this, you could use `subprocess` to
> invoke the database's native client on the machine
>
> * The second best performance is:
> use SqlAlchemy's bulk strategies,
> segment out the ranges of your source data file
> run multiple processes in parallel, each processing a subset of the
> source data file
>
> * The third best performance is:
>write a script that uses the SqlAlchemy ORM with the 'windowed query'
> option to iterate over a range of lines in your source file
>have the script use a 3rd party tool like Redis to mark that it has
> "checked out" sections of the source data, and has completed/not-completed
> that range
>run multiple processes in parallel, each processing a subset of the
> source data file
>
>
> In the second and third approach, you can periodically run a query or
> script to calculate the right number of processes to run.  It could be 7,
> or 20, or another number.
>
>
>
>
>
>
>
>
>
>
>
>
> --
> SQLAlchemy -
> The Python SQL Toolkit and Object Relational Mapper
>
> http://www.sqlalchemy.org/
>
> To post example code, please provide an MCVE: Minimal, Complete, and
> Verifiable Example. See http://stackoverflow.com/help/mcve for a full
> description.
> ---
> You received this message because you are subscribed to the Google Groups
> "sqlalchemy" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to sqlalchemy+unsubscr...@googlegroups.com.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/sqlalchemy/88842622-82e1-4679-8bd4-bd61a7020604%40googlegroups.com
> 
> .
>

-- 
SQLAlchemy - 
The Python SQL Toolkit and Object Relational Mapper

http://www.sqlalchemy.org/

To post example code, please provide an MCVE: Minimal, Complete, and Verifiable 
Example.  See  http://stackoverflow.com/help/mcve for a full description.
--- 
You received this message because you are subscribed to the Google Groups 
"sqlalchemy" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to 

Re: [sqlalchemy] SQLAlchemy Access to Native Bulk Loaders

2020-04-20 Thread Jonathan Vanasco

On Sunday, April 19, 2020 at 4:44:56 PM UTC-4, Ben wrote:
>
>
> These seem to be embedded in the related SQL implementations but are 
> clearly not ANSI standard. I'm not sure if that makes it disqualifying for 
> a SQLAlchemy feature request, or if anyone else could even use it, but 
> functionality like this is something that, at least for me, would make my 
> implementation more DB independent.
>

The major database servers and clients do have similar extensions to ANSI 
and the Python DBAPI here, but they're fairly different from one another in 
how they work and their input/output formats (and capabilities).  They also 
are not necessarily supported by the underlying database drivers. Pyscopg2 
has "copy_" methods on the cursor (
https://www.psycopg.org/docs/usage.html#copy); there are several mysql 
drivers, i looked at two and neither have explicit support for this. I'm 
not sure how this would really work though - it's basically designed for 
being implemented in the C clients with local files. 

The "bulk loading" functionality in SqlAlchemy is largely in the realm of 
being a low-level DBAPI operations for insert statements, and SqlAlchemy's 
performance is about as close to using the driver itself - while still 
having the syntactic sugar and not needing to write to a database 
specifically.  I don't believe there is any code to invoke a database 
client to read a source file itself; I am often mistaken though as Mike has 
put a ton of work into SqlAlchemy over the years.

>From experience with large imports, I can tell you the following:

* The best performance for large updates involves:
1) take the database offline to clients
2) drop (and copy) all the foreign key and check constraints
3) import your data from a file with the native client that ships with 
your database *
4) add the foreign key and check constraints back in
if you feel the need to script this, you could use `subprocess` to 
invoke the database's native client on the machine

* The second best performance is:
use SqlAlchemy's bulk strategies, 
segment out the ranges of your source data file
run multiple processes in parallel, each processing a subset of the 
source data file

* The third best performance is:
   write a script that uses the SqlAlchemy ORM with the 'windowed query' 
option to iterate over a range of lines in your source file
   have the script use a 3rd party tool like Redis to mark that it has 
"checked out" sections of the source data, and has completed/not-completed 
that range
   run multiple processes in parallel, each processing a subset of the 
source data file

 
In the second and third approach, you can periodically run a query or 
script to calculate the right number of processes to run.  It could be 7, 
or 20, or another number.




 





 

-- 
SQLAlchemy - 
The Python SQL Toolkit and Object Relational Mapper

http://www.sqlalchemy.org/

To post example code, please provide an MCVE: Minimal, Complete, and Verifiable 
Example.  See  http://stackoverflow.com/help/mcve for a full description.
--- 
You received this message because you are subscribed to the Google Groups 
"sqlalchemy" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to sqlalchemy+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/sqlalchemy/88842622-82e1-4679-8bd4-bd61a7020604%40googlegroups.com.


Re: [sqlalchemy] SQLAlchemy Access to Native Bulk Loaders

2020-04-19 Thread Richard Damon
On 4/19/20 4:44 PM, Benjamin Taub wrote:
> Thanks for taking the time to respond, Richard. I may be thinking
> about the command line option that you mentioned. However, I do see
> that MySQL has a LOAD DATA statement
> (https://dev.mysql.com/doc/refman/8.0/en/load-data.html) that, I
> think, does what I'm thinking about. Similarly, Postgres has COPY
> (https://www.postgresql.org/docs/9.2/sql-copy.html) and SQL Server has
> BULK INSERT
> (https://docs.microsoft.com/en-us/sql/t-sql/statements/bulk-insert-transact-sql?view=sql-server-ver15).
>
> These seem to be embedded in the related SQL implementations but are
> clearly not ANSI standard. I'm not sure if that makes it disqualifying
> for a SQLAlchemy feature request, or if anyone else could even use it,
> but functionality like this is something that, at least for me, would
> make my implementation more DB independent.
>
> Anyhow, thanks again for your note and your work on SQLAlchemy. I
> appreciate it.
>
> Ben

I will admit that wasn't a command I was familiar with, but being DB
Specific it would be something I tend to try to minimize the use of.

-- 
Richard Damon

-- 
SQLAlchemy - 
The Python SQL Toolkit and Object Relational Mapper

http://www.sqlalchemy.org/

To post example code, please provide an MCVE: Minimal, Complete, and Verifiable 
Example.  See  http://stackoverflow.com/help/mcve for a full description.
--- 
You received this message because you are subscribed to the Google Groups 
"sqlalchemy" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to sqlalchemy+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/sqlalchemy/18784b2d-85c7-80ab-c268-a9fdd0b21d4b%40Damon-Family.org.


Re: [sqlalchemy] SQLAlchemy Access to Native Bulk Loaders

2020-04-19 Thread Benjamin Taub
Thanks for taking the time to respond, Richard. I may be thinking about the
command line option that you mentioned. However, I do see that MySQL has a
LOAD DATA statement (https://dev.mysql.com/doc/refman/8.0/en/load-data.html)
that, I think, does what I'm thinking about. Similarly, Postgres has COPY (
https://www.postgresql.org/docs/9.2/sql-copy.html) and SQL Server has BULK
INSERT (
https://docs.microsoft.com/en-us/sql/t-sql/statements/bulk-insert-transact-sql?view=sql-server-ver15
).

These seem to be embedded in the related SQL implementations but are
clearly not ANSI standard. I'm not sure if that makes it disqualifying for
a SQLAlchemy feature request, or if anyone else could even use it, but
functionality like this is something that, at least for me, would make my
implementation more DB independent.

Anyhow, thanks again for your note and your work on SQLAlchemy. I
appreciate it.

Ben
__
[image: light logo] 
Benjamin Taub, CEO
Direct: (734) 585-3503
Check out Talentspace  to see
a list of prescreened analytics and data engineering candidates
Blog: http://www.dataspace.com/insights/blog/
Twitter: twitter.com/dataspaceinc

*NOTE: I usually check email only 2 - 3 times / day.  For issues needing
immediate attention, please call.*


On Sun, Apr 19, 2020 at 3:59 PM Richard Damon 
wrote:

>
> I can't think of any SQL engines, where the engine itself can read a CSV
> file to load a database (In many cases, the actual SQL engine is off on
> another machine with the database, and has no direct link to the local
> file system). Like SQLite, many have a command line interface that can
> read the file and insert it into the database.
>
> Also, some higher end wrappers might provide such a feature, but that is
> NOT part of the base SQL language.
>
> --
> Richard Damon
>
> --
> SQLAlchemy -
> The Python SQL Toolkit and Object Relational Mapper
>
> http://www.sqlalchemy.org/
>
> To post example code, please provide an MCVE: Minimal, Complete, and
> Verifiable Example.  See  http://stackoverflow.com/help/mcve for a full
> description.
> ---
> You received this message because you are subscribed to the Google Groups
> "sqlalchemy" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to sqlalchemy+unsubscr...@googlegroups.com.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/sqlalchemy/babc0ee4-d95d-af02-54cc-0691ab9ce6da%40Damon-Family.org
> .
>

-- 
SQLAlchemy - 
The Python SQL Toolkit and Object Relational Mapper

http://www.sqlalchemy.org/

To post example code, please provide an MCVE: Minimal, Complete, and Verifiable 
Example.  See  http://stackoverflow.com/help/mcve for a full description.
--- 
You received this message because you are subscribed to the Google Groups 
"sqlalchemy" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to sqlalchemy+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/sqlalchemy/CAOeAApPT%2B%3DPawO6QJX0KaW%2Bkw-BuUGa0Cy-MrB%2BYbatDYhgZQA%40mail.gmail.com.


Re: [sqlalchemy] SQLAlchemy Access to Native Bulk Loaders

2020-04-19 Thread Richard Damon
On 4/19/20 3:01 PM, Ben wrote:
> Hi, James, thank you for the info. I'll give this a try. Still, it
> does seem to require that the data be read into Python and then
> written out to the DB, albeit quickly. What I'd prefer is a way to
> issue a call to the DB to have it 'suck in' some csv file directly,
> rather than having to read it into Python first. Most databases have a
> bulk loader to handle such jobs so I would think it would be possible
> to provide a vendor-neutral way to call this functionality but I could
> be missing something. Perhaps I should look at adding this to the
> SQLAlchemy github project but I'm not sure that the world would want
> to rely on my code :) 
>
> In any case, thank you so much for taking the time to reply.
>
> Ben

I can't think of any SQL engines, where the engine itself can read a CSV
file to load a database (In many cases, the actual SQL engine is off on
another machine with the database, and has no direct link to the local
file system). Like SQLite, many have a command line interface that can
read the file and insert it into the database.

Also, some higher end wrappers might provide such a feature, but that is
NOT part of the base SQL language.

-- 
Richard Damon

-- 
SQLAlchemy - 
The Python SQL Toolkit and Object Relational Mapper

http://www.sqlalchemy.org/

To post example code, please provide an MCVE: Minimal, Complete, and Verifiable 
Example.  See  http://stackoverflow.com/help/mcve for a full description.
--- 
You received this message because you are subscribed to the Google Groups 
"sqlalchemy" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to sqlalchemy+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/sqlalchemy/babc0ee4-d95d-af02-54cc-0691ab9ce6da%40Damon-Family.org.


Re: [sqlalchemy] SQLAlchemy Access to Native Bulk Loaders

2020-04-19 Thread Ben
Hi, James, thank you for the info. I'll give this a try. Still, it does 
seem to require that the data be read into Python and then written out to 
the DB, albeit quickly. What I'd prefer is a way to issue a call to the DB 
to have it 'suck in' some csv file directly, rather than having to read it 
into Python first. Most databases have a bulk loader to handle such jobs so 
I would think it would be possible to provide a vendor-neutral way to call 
this functionality but I could be missing something. Perhaps I should look 
at adding this to the SQLAlchemy github project but I'm not sure that the 
world would want to rely on my code :) 

In any case, thank you so much for taking the time to reply.

Ben

On Sunday, April 19, 2020 at 1:23:06 PM UTC-4, James Fennell wrote:
>
> Hi Ben,
>
> Have you checked out bulk operations? 
> https://docs.sqlalchemy.org/en/13/orm/persistence_techniques.html#bulk-operations
>
> Bulk operations provide a DB-agnostic API for doing large numbers of 
> inserts and/or updates. The speed up you see, compared to session.add, 
> depends on your database and the database connection arguments. With 
> Postgres, I've needed to enable batch mode to see the fullest benefit: 
> https://docs.sqlalchemy.org/en/13/dialects/postgresql.html#psycopg2-batch-mode
>
> Personally I've seen huge performance impovements in my applications after 
> migrating big operations from session.add over to the bulk API.
>
> Of course, as you say, you can do more low level SQL calls to get it even 
> faster, but then you run into a bunch of other issues.
>
> James
>
>
> On Sun, Apr 19, 2020, 12:46 PM Ben > 
> wrote:
>
>> I hope this is the right place for this... I need to load large files 
>> into my database. As I understand it, I can do this in one of two ways with 
>> SQLAlchemy Core: 1) Bring the data into Python and then write it out with 
>> the *add* method or, alternatively, 2) Use SQLAlchemy to issue a command 
>> to the DB to use it's native bulk loader to read data from my file. I would 
>> expect this second approach to be faster, to require less code, and to 
>> avoid issues such as trying to put too much in memory at one time. However, 
>> it is DB-vendor-specific (i.e. I believe the command I send to a MySQL DB 
>> will differ from that I send to a Postgres DB). 
>>
>> So, 
>>
>>- Do I properly understand SQLAlchemy's capabilities here or am I 
>>missing something?
>>- If I do have this right, is generic access to bulk loaders 
>>something that is on the upgrade / new development list?
>>
>> Sorry if this isn't the right place for this.
>> Thanks!
>> Ben
>>
>> -- 
>> SQLAlchemy - 
>> The Python SQL Toolkit and Object Relational Mapper
>>  
>> http://www.sqlalchemy.org/
>>  
>> To post example code, please provide an MCVE: Minimal, Complete, and 
>> Verifiable Example. See http://stackoverflow.com/help/mcve for a full 
>> description.
>> --- 
>> You received this message because you are subscribed to the Google Groups 
>> "sqlalchemy" group.
>> To unsubscribe from this group and stop receiving emails from it, send an 
>> email to sqlal...@googlegroups.com .
>> To view this discussion on the web visit 
>> https://groups.google.com/d/msgid/sqlalchemy/fd55ee9a-6918-4b60-88bb-961787e3c53e%40googlegroups.com
>>  
>> 
>> .
>>
>

-- 
SQLAlchemy - 
The Python SQL Toolkit and Object Relational Mapper

http://www.sqlalchemy.org/

To post example code, please provide an MCVE: Minimal, Complete, and Verifiable 
Example.  See  http://stackoverflow.com/help/mcve for a full description.
--- 
You received this message because you are subscribed to the Google Groups 
"sqlalchemy" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to sqlalchemy+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/sqlalchemy/abc915b7-f856-4c72-9898-d1ec34322860%40googlegroups.com.


Re: [sqlalchemy] SQLAlchemy Access to Native Bulk Loaders

2020-04-19 Thread James Fennell
Hi Ben,

Have you checked out bulk operations?
https://docs.sqlalchemy.org/en/13/orm/persistence_techniques.html#bulk-operations

Bulk operations provide a DB-agnostic API for doing large numbers of
inserts and/or updates. The speed up you see, compared to session.add,
depends on your database and the database connection arguments. With
Postgres, I've needed to enable batch mode to see the fullest benefit:
https://docs.sqlalchemy.org/en/13/dialects/postgresql.html#psycopg2-batch-mode

Personally I've seen huge performance impovements in my applications after
migrating big operations from session.add over to the bulk API.

Of course, as you say, you can do more low level SQL calls to get it even
faster, but then you run into a bunch of other issues.

James


On Sun, Apr 19, 2020, 12:46 PM Ben  wrote:

> I hope this is the right place for this... I need to load large files into
> my database. As I understand it, I can do this in one of two ways with
> SQLAlchemy Core: 1) Bring the data into Python and then write it out with
> the *add* method or, alternatively, 2) Use SQLAlchemy to issue a command
> to the DB to use it's native bulk loader to read data from my file. I would
> expect this second approach to be faster, to require less code, and to
> avoid issues such as trying to put too much in memory at one time. However,
> it is DB-vendor-specific (i.e. I believe the command I send to a MySQL DB
> will differ from that I send to a Postgres DB).
>
> So,
>
>- Do I properly understand SQLAlchemy's capabilities here or am I
>missing something?
>- If I do have this right, is generic access to bulk loaders something
>that is on the upgrade / new development list?
>
> Sorry if this isn't the right place for this.
> Thanks!
> Ben
>
> --
> SQLAlchemy -
> The Python SQL Toolkit and Object Relational Mapper
>
> http://www.sqlalchemy.org/
>
> To post example code, please provide an MCVE: Minimal, Complete, and
> Verifiable Example. See http://stackoverflow.com/help/mcve for a full
> description.
> ---
> You received this message because you are subscribed to the Google Groups
> "sqlalchemy" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to sqlalchemy+unsubscr...@googlegroups.com.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/sqlalchemy/fd55ee9a-6918-4b60-88bb-961787e3c53e%40googlegroups.com
> 
> .
>

-- 
SQLAlchemy - 
The Python SQL Toolkit and Object Relational Mapper

http://www.sqlalchemy.org/

To post example code, please provide an MCVE: Minimal, Complete, and Verifiable 
Example.  See  http://stackoverflow.com/help/mcve for a full description.
--- 
You received this message because you are subscribed to the Google Groups 
"sqlalchemy" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to sqlalchemy+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/sqlalchemy/CALDF6i1nvcNQQi46WqrqDyYk2RJa0EwRQVK%2B010VXwJ1Fk8i1w%40mail.gmail.com.


[sqlalchemy] SQLAlchemy Access to Native Bulk Loaders

2020-04-19 Thread Ben
I hope this is the right place for this... I need to load large files into 
my database. As I understand it, I can do this in one of two ways with 
SQLAlchemy Core: 1) Bring the data into Python and then write it out with 
the *add* method or, alternatively, 2) Use SQLAlchemy to issue a command to 
the DB to use it's native bulk loader to read data from my file. I would 
expect this second approach to be faster, to require less code, and to 
avoid issues such as trying to put too much in memory at one time. However, 
it is DB-vendor-specific (i.e. I believe the command I send to a MySQL DB 
will differ from that I send to a Postgres DB). 

So, 

   - Do I properly understand SQLAlchemy's capabilities here or am I 
   missing something?
   - If I do have this right, is generic access to bulk loaders something 
   that is on the upgrade / new development list?

Sorry if this isn't the right place for this.
Thanks!
Ben

-- 
SQLAlchemy - 
The Python SQL Toolkit and Object Relational Mapper

http://www.sqlalchemy.org/

To post example code, please provide an MCVE: Minimal, Complete, and Verifiable 
Example.  See  http://stackoverflow.com/help/mcve for a full description.
--- 
You received this message because you are subscribed to the Google Groups 
"sqlalchemy" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to sqlalchemy+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/sqlalchemy/fd55ee9a-6918-4b60-88bb-961787e3c53e%40googlegroups.com.