On Wed, 8 Apr 2020 at 22:30, Robert Haas <robertmh...@gmail.com> wrote: > - If we're unable to supply data to the COPY process as fast as the > workers could load it, then speed will be limited at that point. We > know reading the file from disk is pretty fast compared to what a > single process can do. I'm not sure we've tested what happens with a > network socket. It will depend on the network speed some, but it might > be useful to know how many MB/s we can pump through over a UNIX > socket.
This raises a good point. If at some point we want to minimize the amount of memory copies then we might want to allow for RDMA to directly write incoming network traffic into a distributing ring buffer, which would include the protocol level headers. But at this point we are so far off from network reception becoming a bottleneck I don't think it's worth holding anything up for not allowing for zero copy transfers. > - The portion of the time that is used to split the lines is not > easily parallelizable. That seems to be a fairly small percentage for > a reasonably wide table, but it looks significant (13-18%) for a > narrow table. Such cases will gain less performance and be limited to > a smaller number of workers. I think we also need to be careful about > files whose lines are longer than the size of the buffer. If we're not > careful, we could get a significant performance drop-off in such > cases. We should make sure to pick an algorithm that seems like it > will handle such cases without serious regressions and check that a > file composed entirely of such long lines is handled reasonably > efficiently. I don't have a proof, but my gut feel tells me that it's fundamentally impossible to ingest csv without a serial line-ending/comment tokenization pass. The current line splitting algorithm is terrible. I'm currently working with some scientific data where on ingestion CopyReadLineText() is about 25% on profiles. I prototyped a replacement that can do ~8GB/s on narrow rows, more on wider ones. For rows that are consistently wider than the input buffer I think parallelism will still give a win - the serial phase is just memcpy through a ringbuffer, after which a worker goes away to perform the actual insert, letting the next worker read the data. The memcpy is already happening today, CopyReadLineText() copies the input buffer into a StringInfo, so the only extra work is synchronization between leader and worker. > - There could be index contention. Let's suppose that we can read data > super fast and break it up into lines super fast. Maybe the file we're > reading is fully RAM-cached and the lines are long. Now all of the > backends are inserting into the indexes at the same time, and they > might be trying to insert into the same pages. If so, lock contention > could become a factor that hinders performance. Different data distribution strategies can have an effect on that. Dealing out input data in larger or smaller chunks will have a considerable effect on contention, btree page splits and all kinds of things. I think the common theme would be a push to increase chunk size to reduce contention.. > - There could also be similar contention on the heap. Say the tuples > are narrow, and many backends are trying to insert tuples into the > same heap page at the same time. This would lead to many lock/unlock > cycles. This could be avoided if the backends avoid targeting the same > heap pages, but I'm not sure there's any reason to expect that they > would do so unless we make some special provision for it. I thought there already was a provision for that. Am I mis-remembering? > - What else? I bet the above list is not comprehensive. I think parallel copy patch needs to concentrate on splitting input data to workers. After that any performance issues would be basically the same as a normal parallel insert workload. There may well be bottlenecks there, but those could be tackled independently. Regards, Ants Aasma Cybertec