Re: [PERFORM] Benchmark Data requested --- pgloader CE design ideas

2008-02-07 Thread Dimitri Fontaine
Le jeudi 07 février 2008, Greg Smith a écrit :
Le mercredi 06 février 2008, Dimitri Fontaine a écrit :
 In other cases, a logical line is a physical line, so we start after first
 newline met from given lseek start position, and continue reading after the
 last lseek position until a newline.

 Now you're talking.  Find a couple of split points that way, fine-tune the
 boundaries a bit so they rest on line termination points, and off you go.

I was thinking of not even reading the file content from the controller 
thread, just decide splitting points in bytes (0..ST_SIZE/4 - 
ST_SIZE/4+1..2*ST_SIZE/4 etc) and let the reading thread fine-tune by 
beginning to process input after having read first newline, etc.

And while we're still at the design board, I'm also thinking to add a 
per-section parameter (with a global default value possible) 
split_file_reading which defaults to False, and which you'll have to set True 
for pgloader to behave the way we're talking about.

When split_file_reading = False and section_threads != 1 pgloader will have to 
manage several processing threads per section but only one file reading 
thread, giving the read input to processing theads in a round-robin fashion. 
In the future the processing thread choosing will possibly (another knob) be 
smarter than that, as soon as we get CE support into pgloader.

When split_file_reading = True and section_threads != 1 pgloader will have to 
manage several processing threads per section, each one responsible of 
reading its own part of the file, processing boundaries to be discovered at 
reading time. Adding in here CE support in this case means managing two 
separate thread pools per section, one responsible of splitted file reading 
and another responsible of data buffering and routing (COPY to partition 
instead of to parent table).

In both cases, maybe it would also be needed for pgloader to be able to have a 
separate thread for COPYing the buffer to the server, allowing it to continue 
preparing next buffer in the meantime?

This will need some re-architecturing of pgloader, but it seems it worth it 
(I'm not entirely sold about the two thread-pools idea, though, and this last 
continue-reading-while-copying-idea still has to be examined).
Some of the work needing to be done is by now quite clear for me, but a part 
of it still needs its design-time share. As usual though, the real hard part 
is knowing what we exactly want to get done, and we're showing good progress 
here :)

Greg's behavior:
max_threads   = N 
max_parallel_sections = 1
section_threads   = -1
split_file_reading= True

Simon's behaviour:
max_threads   = N
max_parallel_sections = 1   # I don't think Simon wants parallel sections
section_threads   = -1
split_file_reading= False

Comments?
-- 
dim


signature.asc
Description: This is a digitally signed message part.


Re: [PERFORM] Benchmark Data requested --- pgloader CE design ideas

2008-02-07 Thread Kenneth Marshall
On Thu, Feb 07, 2008 at 12:06:42PM -0500, Greg Smith wrote:
 On Thu, 7 Feb 2008, Dimitri Fontaine wrote:

 I was thinking of not even reading the file content from the controller
 thread, just decide splitting points in bytes (0..ST_SIZE/4 -
 ST_SIZE/4+1..2*ST_SIZE/4 etc) and let the reading thread fine-tune by
 beginning to process input after having read first newline, etc.

 The problem I was pointing out is that if chunk#2 moved foward a few bytes 
 before it started reading in search of a newline, how will chunk#1 know 
 that it's supposed to read up to that further point?  You have to stop #1 
 from reading further when it catches up with where #2 started.  Since the 
 start of #2 is fuzzy until some reading is done, what you're describing 
 will need #2 to send some feedback to #1 after they've both started, and 
 that sounds bad to me.  I like designs where the boundaries between threads 
 are clearly defined before any of them start and none of them ever talk to 
 the others.


As long as both processes understand the start condition, there
is not a problem. p1 starts at beginning and processes through chunk2
 offset until it reaches the start condition. p2 starts loading from
chunk2 offset plus the amount needed to reach the start condition, ...

DBfile|---|--x--|x|-x--|
  x chunk1---
 x chunk2
 x chunk3---...

As long as both pieces use the same test, they will each process
non-overlapping segments of the file and still process 100% of the
file.

Ken

 In both cases, maybe it would also be needed for pgloader to be able to 
 have a
 separate thread for COPYing the buffer to the server, allowing it to 
 continue
 preparing next buffer in the meantime?

 That sounds like a V2.0 design to me.  I'd only chase after that level of 
 complexity if profiling suggests that's where the bottleneck really is.

 --
 * Greg Smith [EMAIL PROTECTED] http://www.gregsmith.com Baltimore, MD

 ---(end of broadcast)---
 TIP 6: explain analyze is your friend


---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq


Re: [PERFORM] Benchmark Data requested --- pgloader CE design ideas

2008-02-07 Thread Greg Smith

On Thu, 7 Feb 2008, Dimitri Fontaine wrote:


I was thinking of not even reading the file content from the controller
thread, just decide splitting points in bytes (0..ST_SIZE/4 -
ST_SIZE/4+1..2*ST_SIZE/4 etc) and let the reading thread fine-tune by
beginning to process input after having read first newline, etc.


The problem I was pointing out is that if chunk#2 moved foward a few bytes 
before it started reading in search of a newline, how will chunk#1 know 
that it's supposed to read up to that further point?  You have to stop #1 
from reading further when it catches up with where #2 started.  Since the 
start of #2 is fuzzy until some reading is done, what you're describing 
will need #2 to send some feedback to #1 after they've both started, and 
that sounds bad to me.  I like designs where the boundaries between 
threads are clearly defined before any of them start and none of them ever 
talk to the others.



In both cases, maybe it would also be needed for pgloader to be able to have a
separate thread for COPYing the buffer to the server, allowing it to continue
preparing next buffer in the meantime?


That sounds like a V2.0 design to me.  I'd only chase after that level of 
complexity if profiling suggests that's where the bottleneck really is.


--
* Greg Smith [EMAIL PROTECTED] http://www.gregsmith.com Baltimore, MD

---(end of broadcast)---
TIP 6: explain analyze is your friend


Re: [PERFORM] Benchmark Data requested --- pgloader CE design ideas

2008-02-07 Thread Mark Lewis
  I was thinking of not even reading the file content from the controller
  thread, just decide splitting points in bytes (0..ST_SIZE/4 -
  ST_SIZE/4+1..2*ST_SIZE/4 etc) and let the reading thread fine-tune by
  beginning to process input after having read first newline, etc.
 
 The problem I was pointing out is that if chunk#2 moved foward a few bytes 
 before it started reading in search of a newline, how will chunk#1 know 
 that it's supposed to read up to that further point?  You have to stop #1 
 from reading further when it catches up with where #2 started.  Since the 
 start of #2 is fuzzy until some reading is done, what you're describing 
 will need #2 to send some feedback to #1 after they've both started, and 
 that sounds bad to me.  I like designs where the boundaries between 
 threads are clearly defined before any of them start and none of them ever 
 talk to the others.

I don't think that any communication is needed beyond the beginning of
the threads.  Each thread knows that it should start at byte offset X
and end at byte offset Y, but if Y happens to be in the middle of a
record then just keep going until the end of the record.  As long as the
algorithm for reading past the end marker is the same as the algorithm
for skipping past the beginning marker then all is well.

-- Mark Lewis

---(end of broadcast)---
TIP 7: You can help support the PostgreSQL project by donating at

http://www.postgresql.org/about/donate


Re: [PERFORM] Benchmark Data requested --- pgloader CE design ideas

2008-02-07 Thread Matthew

On Thu, 7 Feb 2008, Greg Smith wrote:
The problem I was pointing out is that if chunk#2 moved foward a few bytes 
before it started reading in search of a newline, how will chunk#1 know that 
it's supposed to read up to that further point?  You have to stop #1 from 
reading further when it catches up with where #2 started.  Since the start of 
#2 is fuzzy until some reading is done, what you're describing will need #2 
to send some feedback to #1 after they've both started, and that sounds bad 
to me.


It doesn't have to be fuzzy at all. Both threads will presumably be able 
to use the same algorithm to work out where the boundary is, therefore 
they'll get the same result. No need to pass back information.


Matthew

--
There is something in the lecture course which may not have been visible so
far, which is reality   -- Computer Science Lecturer

---(end of broadcast)---
TIP 6: explain analyze is your friend


Re: [PERFORM] Benchmark Data requested --- pgloader CE design ideas

2008-02-06 Thread Simon Riggs
On Wed, 2008-02-06 at 12:27 +0100, Dimitri Fontaine wrote:
 Multi-Threading behavior and CE support
 -=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
 
 Now, pgloader will be able to run N threads, each one loading some
 data to a 
 partitionned child-table target. N will certainly be configured
 depending on 
 the number of server cores and not depending on the partition
 numbers...
 
 So what do we do when reading a tuple we want to store in a partition
 which 
 has no dedicated Thread started yet, and we already have N Threads
 running?
 I'm thinking about some LRU(Thread) to choose a Thread to terminate
 (launch 
 COPY with current buffer and quit) and start a new one for the
 current 
 partition target.
 Hopefully there won't be such high values of N that the LRU is a bad
 choice 
 per see, and the input data won't be so messy to have to stop/start
 Threads 
 at each new line.

For me, it would be good to see a --parallel=n parameter that would
allow pg_loader to distribute rows in round-robin manner to n
different concurrent COPY statements. i.e. a non-routing version. Making
that work well, whilst continuing to do error-handling seems like a
challenge, but a very useful goal.

Adding intelligence to the row distribution may be technically hard but
may also simply move the bottleneck onto pg_loader. We may need multiple
threads in pg_loader, or we may just need multiple sessions from
pg_loader. Experience from doing the non-routing parallel version may
help in deciding whether to go for the routing version.

-- 
  Simon Riggs
  2ndQuadrant  http://www.2ndQuadrant.com 


---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq


Re: [PERFORM] Benchmark Data requested --- pgloader CE design ideas

2008-02-06 Thread Dimitri Fontaine
Hi,

I've been thinking about this topic some more, and as I don't know when I'll 
be able to go and implement it I'd want to publish the ideas here. This way 
I'll be able to find them again :)

Le mardi 05 février 2008, Dimitri Fontaine a écrit :
 Le mardi 05 février 2008, Simon Riggs a écrit :
  Much better than triggers and rules, but it will be hard to get it to
  work.

 Well, I'm thinking about providing a somewhat modular approach where
 pgloader code is able to recognize CHECK constraints, load a module
 registered to the operator and data types, then use it.

Here's how I think I'm gonna implement it:

User level configuration
-=-=-=-=-=-=-=-=-=-

At user level, you will have to add a constraint_exclusion = on parameter to 
pgloader section configuration for it to bother checking if the destination 
table has some children etc.
You'll need to provide also a global ce_path parameter (where to find user 
python constraint exclusion modules) and a ce_modules parameter for each 
section where constraint_exclusion = on:
  ce_modules = columnA:module:class, columnB:module:class

As the ce_path could point to any number of modules where a single type is 
supported by several modules, I'll let the user choose which module to use.

Constraint exclusion modules
-=-=-=-=-=-=-=-=-=-=-=-=-

The modules will provide one or several class(es) (kind of a packaging issue), 
each one will have to register which datatypes and operators they know about. 
Here's some pseudo-code of a module, which certainly is the best way to 
express a code design idea:

class MyCE:
   def __init__(self, operator, constant, cside='r'):
   CHECK ( col operator constant ) = cside = 'r', could be 'l' 
  ...

   @classmethod
   def support_type(cls, type):
  return type in ['integer', 'bigint', 'smallint', 'real', 'double']

   @classmethod
   def support_operator(cls, op):
   return op in ['=', '', '', '=', '=', '%']

   def check(self, op, data):
  if op == '' : return self.gt(data)
  ...

   def gt(self, data):
  if cside == 'l':
 return self.constant  data
  elif cside == 'r':
 return data  self.constant

This way pgloader will be able to support any datatype (user datatype like 
IP4R included) and operator (@@, ~= or whatever). For pgloader to handle a 
CHECK() constraint, though, it'll have to be configured to use a CE class 
supporting the used operators and datatypes.

PGLoader constraint exclusion support
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-

The CHECK() constraint being a tree of check expressions[*] linked by logical 
operators, pgloader will have to build some logic tree of MyCE (user CE 
modules) and evaluate all the checks in order to be able to choose the input 
line partition.

[*]: check((a % 10) = 1) makes an expression tree containing 2 check nodes

After having parsed pg_constraint.consrc (not conbin which seems too much an 
internal dump for using it from user code) and built a CHECK tree for each 
partition, pgloader will try to decide if it's about range partitioning (most 
common case). 

If each partition CHECK tree is AND((a=b, ac) or a variation of it, we have 
range partitioning. Then surely we can optimize the code to run to choose the 
partition where to COPY data to and still use the module operator 
implementation, e.g. making a binary search on a partitions limits tree.

If you want some other widely used (or not) partitioning scheme to be 
recognized and optimized by pgloader, just tell me and we'll see about it :)
Having this step as a user module seems overkill at the moment, though.

Multi-Threading behavior and CE support
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-

Now, pgloader will be able to run N threads, each one loading some data to a 
partitionned child-table target. N will certainly be configured depending on 
the number of server cores and not depending on the partition numbers...

So what do we do when reading a tuple we want to store in a partition which 
has no dedicated Thread started yet, and we already have N Threads running?
I'm thinking about some LRU(Thread) to choose a Thread to terminate (launch 
COPY with current buffer and quit) and start a new one for the current 
partition target.
Hopefully there won't be such high values of N that the LRU is a bad choice 
per see, and the input data won't be so messy to have to stop/start Threads 
at each new line.

Comments welcome, regards,
-- 
dim


signature.asc
Description: This is a digitally signed message part.


Re: [PERFORM] Benchmark Data requested --- pgloader CE design ideas

2008-02-06 Thread Dimitri Fontaine
Le mercredi 06 février 2008, Simon Riggs a écrit :
 For me, it would be good to see a --parallel=n parameter that would
 allow pg_loader to distribute rows in round-robin manner to n
 different concurrent COPY statements. i.e. a non-routing version.

What happen when you want at most N parallel Threads and have several sections 
configured: do you want pgloader to serialize sections loading (often there's 
one section per table, sometimes different sections target the same table) 
but parallelise each section loading?

I'm thinking we should have a global max_threads knob *and* and per-section 
max_thread one if we want to go this way, but then multi-threaded sections 
will somewhat fight against other sections (multi-threaded or not) for 
threads to use.

So I'll also add a parameter to configure how many (max) sections to load in 
parallel at any time.

We'll then have (default values presented):
max_threads = 1
max_parallel_sections = 1
section_threads = -1

The section_threads parameter would be overloadable at section level but would 
need to stay = max_threads (if not, discarded, warning issued). When 
section_threads is -1, pgloader tries to have the higher number of them 
possible, still in the max_threads global limit.
If max_parallel_section is -1, pgloader start a new thread per each new 
section, maxing out at max_threads, then it waits for a thread to finish 
before launching a new section loading.

If you have N max_threads and max_parallel_sections = section_threads = -1, 
then we'll see some kind of a fight between new section threads and in 
section thread (the parallel non-routing COPY behaviour). But then it's a 
user choice.

Adding in it the Constraint_Exclusion support would not mess it up, but it'll 
have some interest only when section_threads != 1 and max_threads  1.

 Making 
 that work well, whilst continuing to do error-handling seems like a
 challenge, but a very useful goal.

Quick tests showed me python threading model allows for easily sharing of 
objects between several threads, I don't think I'll need to adjust my reject 
code when going per-section multi-threaded. Just have to use a semaphore 
object to continue rejected one line at a time. Not that complex if reliable.

 Adding intelligence to the row distribution may be technically hard but
 may also simply move the bottleneck onto pg_loader. We may need multiple
 threads in pg_loader, or we may just need multiple sessions from
 pg_loader. Experience from doing the non-routing parallel version may
 help in deciding whether to go for the routing version.

If non-routing per-section multi-threading is a user request and not that hard 
to implement (thanks to python), that sounds a good enough reason for me to 
provide it :)

I'll keep you (and the list) informed as soon as I'll have the code to play 
with.
-- 
dim


signature.asc
Description: This is a digitally signed message part.


Re: [PERFORM] Benchmark Data requested --- pgloader CE design ideas

2008-02-06 Thread Greg Smith

On Wed, 6 Feb 2008, Simon Riggs wrote:


For me, it would be good to see a --parallel=n parameter that would
allow pg_loader to distribute rows in round-robin manner to n
different concurrent COPY statements. i.e. a non-routing version.


Let me expand on this.  In many of these giant COPY situations the 
bottleneck is plain old sequential I/O to a single process.  You can 
almost predict how fast the rows will load using dd.  Having a process 
that pulls rows in and distributes them round-robin is good, but it won't 
crack that bottleneck.  The useful approaches I've seen for other 
databases all presume that the data files involved are large enough that 
on big hardware, you can start multiple processes running at different 
points in the file and beat anything possible with a single reader.


If I'm loading a TB file, odds are good I can split that into 4 or more 
vertical pieces (say rows 1-25%, 25-50%, 50-75%, 75-100%), start 4 loaders 
at once, and get way more than 1 disk worth of throughput reading.  You 
have to play with the exact number because if you push the split too far 
you introduce seek slowdown instead of improvements, but that's the basic 
design I'd like to see one day.  It's not parallel loading that's useful 
for the cases I'm thinking about until something like this comes around.


--
* Greg Smith [EMAIL PROTECTED] http://www.gregsmith.com Baltimore, MD

---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

  http://www.postgresql.org/docs/faq


Re: [PERFORM] Benchmark Data requested --- pgloader CE design ideas

2008-02-06 Thread Luke Lonergan
Hi Greg,

On 2/6/08 7:56 AM, Greg Smith [EMAIL PROTECTED] wrote:

 If I'm loading a TB file, odds are good I can split that into 4 or more
 vertical pieces (say rows 1-25%, 25-50%, 50-75%, 75-100%), start 4 loaders
 at once, and get way more than 1 disk worth of throughput reading.  You
 have to play with the exact number because if you push the split too far
 you introduce seek slowdown instead of improvements, but that's the basic
 design I'd like to see one day.  It's not parallel loading that's useful
 for the cases I'm thinking about until something like this comes around.

Just load 4 relfiles.  You have to be able to handle partial relfiles, which
changes the storage mgmt a bit, but the benefits are easier to achieve.

- Luke


---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
   subscribe-nomail command to [EMAIL PROTECTED] so that your
   message can get through to the mailing list cleanly


Re: [PERFORM] Benchmark Data requested --- pgloader CE design ideas

2008-02-06 Thread Jignesh K. Shah



Greg Smith wrote:

On Wed, 6 Feb 2008, Simon Riggs wrote:


For me, it would be good to see a --parallel=n parameter that would
allow pg_loader to distribute rows in round-robin manner to n
different concurrent COPY statements. i.e. a non-routing version.


Let me expand on this.  In many of these giant COPY situations the 
bottleneck is plain old sequential I/O to a single process.  You can 
almost predict how fast the rows will load using dd.  Having a process 
that pulls rows in and distributes them round-robin is good, but it 
won't crack that bottleneck.  The useful approaches I've seen for 
other databases all presume that the data files involved are large 
enough that on big hardware, you can start multiple processes running 
at different points in the file and beat anything possible with a 
single reader.


If I'm loading a TB file, odds are good I can split that into 4 or 
more vertical pieces (say rows 1-25%, 25-50%, 50-75%, 75-100%), start 
4 loaders at once, and get way more than 1 disk worth of throughput 
reading.  You have to play with the exact number because if you push 
the split too far you introduce seek slowdown instead of improvements, 
but that's the basic design I'd like to see one day.  It's not 
parallel loading that's useful for the cases I'm thinking about until 
something like this comes around.




Some food for thought here: Most BI Type applications which does data 
conversions/cleansing also might end up sorting the data before its 
loaded into a database so starting parallel loaders at Total different 
points ruins that effort. A More pragmatic approach will be to read the 
next rows from the input file So if there are N parallel streams then 
each one is offset by 1 from each other and jumps by N rows so the seeks 
are pretty much narrrowed down to few rows (ideally 1) instead of 
jumping 1/Nth rows every time  a read happens.


For example to replicate this with dd to see the impact use a big file 
and use the seek option and blocksizes .. Somebody out here once had 
done that test and showed that seek time on the file being read is 
reduced significantly and depending on the file system it does 
intelligent prefetching (which unfortunately UFS in Solaris does not do 
best by default)  all the reads for the next stream will already be in 
memory.




Regards,
Jignesh


---(end of broadcast)---
TIP 6: explain analyze is your friend


Re: [PERFORM] Benchmark Data requested --- pgloader CE design ideas

2008-02-06 Thread Dimitri Fontaine
Le mercredi 06 février 2008, Greg Smith a écrit :
 If I'm loading a TB file, odds are good I can split that into 4 or more
 vertical pieces (say rows 1-25%, 25-50%, 50-75%, 75-100%), start 4 loaders
 at once, and get way more than 1 disk worth of throughput reading.

pgloader already supports starting at any input file line number, and limit 
itself to any number of reads:

  -C COUNT, --count=COUNT
number of input lines to process
  -F FROMCOUNT, --from=FROMCOUNT
number of input lines to skip

So you could already launch 4 pgloader processes with the same configuration 
fine but different command lines arguments. It there's interest/demand, it's 
easy enough for me to add those parameters as file configuration knobs too.

Still you have to pay for client to server communication instead of having the 
backend read the file locally, but now maybe we begin to compete?

Regards,
-- 
dim


signature.asc
Description: This is a digitally signed message part.


Re: [PERFORM] Benchmark Data requested --- pgloader CE design ideas

2008-02-06 Thread Luke Lonergan
Improvements are welcome, but to compete in the industry, loading will need to 
speed up by a factor of 100.

Note that Bizgres loader already does many of these ideas and it sounds like 
pgloader does too.

- Luke

Msg is shrt cuz m on ma treo

 -Original Message-
From:   Dimitri Fontaine [mailto:[EMAIL PROTECTED]
Sent:   Wednesday, February 06, 2008 12:41 PM Eastern Standard Time
To: pgsql-performance@postgresql.org
Cc: Greg Smith
Subject:Re: [PERFORM] Benchmark Data requested --- pgloader CE design 
ideas

Le mercredi 06 février 2008, Greg Smith a écrit :
 If I'm loading a TB file, odds are good I can split that into 4 or more
 vertical pieces (say rows 1-25%, 25-50%, 50-75%, 75-100%), start 4 loaders
 at once, and get way more than 1 disk worth of throughput reading.

pgloader already supports starting at any input file line number, and limit 
itself to any number of reads:

  -C COUNT, --count=COUNT
number of input lines to process
  -F FROMCOUNT, --from=FROMCOUNT
number of input lines to skip

So you could already launch 4 pgloader processes with the same configuration 
fine but different command lines arguments. It there's interest/demand, it's 
easy enough for me to add those parameters as file configuration knobs too.

Still you have to pay for client to server communication instead of having the 
backend read the file locally, but now maybe we begin to compete?

Regards,
-- 
dim


Re: [PERFORM] Benchmark Data requested --- pgloader CE design ideas

2008-02-06 Thread Dimitri Fontaine
Le Wednesday 06 February 2008 18:37:41 Dimitri Fontaine, vous avez écrit :
 Le mercredi 06 février 2008, Greg Smith a écrit :
  If I'm loading a TB file, odds are good I can split that into 4 or more
  vertical pieces (say rows 1-25%, 25-50%, 50-75%, 75-100%), start 4
  loaders at once, and get way more than 1 disk worth of throughput
  reading.

 pgloader already supports starting at any input file line number, and limit
 itself to any number of reads:

In fact, the -F option works by having pgloader read the given number of lines 
but skip processing them, which is not at all what Greg is talking about here 
I think.

Plus, I think it would be easier for me to code some stat() then lseek() then 
read() into the pgloader readers machinery than to change the code 
architecture to support a separate thread for the file reader.

Greg, what would you think of a pgloader which will separate file reading 
based on file size as given by stat (os.stat(file)[ST_SIZE]) and number of 
threads: we split into as many pieces as section_threads section config 
value.

This behaviour won't be available for sections where type = text and 
field_count(*) is given, cause in this case I don't see how pgloader could 
reliably recognize a new logical line beginning and start processing here.
In other cases, a logical line is a physical line, so we start after first 
newline met from given lseek start position, and continue reading after the 
last lseek position until a newline.

*:http://pgloader.projects.postgresql.org/#_text_format_configuration_parameters

Comments?
-- 
dim


signature.asc
Description: This is a digitally signed message part.


Re: [PERFORM] Benchmark Data requested --- pgloader CE design ideas

2008-02-06 Thread Dimitri Fontaine
Le Wednesday 06 February 2008 18:49:56 Luke Lonergan, vous avez écrit :
 Improvements are welcome, but to compete in the industry, loading will need
 to speed up by a factor of 100.

Oh, I meant to compete with internal COPY command instead of \copy one, not 
with the competition. AIUI competing with competition will need some 
PostgreSQL internal improvements, which I'll let the -hackers do :)

 Note that Bizgres loader already does many of these ideas and it sounds
 like pgloader does too.

We're talking about how to improve pgloader :)

-- 
dim


signature.asc
Description: This is a digitally signed message part.


Re: [PERFORM] Benchmark Data requested --- pgloader CE design ideas

2008-02-06 Thread Greg Smith

On Wed, 6 Feb 2008, Dimitri Fontaine wrote:


In fact, the -F option works by having pgloader read the given number of lines
but skip processing them, which is not at all what Greg is talking about here
I think.


Yeah, that's not useful.


Greg, what would you think of a pgloader which will separate file reading
based on file size as given by stat (os.stat(file)[ST_SIZE]) and number of
threads: we split into as many pieces as section_threads section config
value.


Now you're talking.  Find a couple of split points that way, fine-tune the 
boundaries a bit so they rest on line termination points, and off you go. 
Don't forget that the basic principle here implies you'll never know until 
you're done just how many lines were really in the file.  When thread#1 is 
running against chunk#1, it will never have any idea what line chunk#2 
really started at until it reaches there, at which point it's done and 
that information isn't helpful anymore.


You have to stop thinking in terms of lines for this splitting; all you 
can do is split the file into useful byte sections and then count the 
lines within them as you go.  Anything else requires a counting scan of 
the file and such a sequential read is exactly what can't happen 
(especially not more than once), it just takes too long.


--
* Greg Smith [EMAIL PROTECTED] http://www.gregsmith.com Baltimore, MD

---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

  http://www.postgresql.org/docs/faq