Re: COPY table_name (single_column) FROM 'unknown.txt' DELIMITER E'\n'

2021-05-10 Thread Joel Jacobson
What about:

   COPY ... FROM ... WITH PATTERN 'regexp_pattern'

Where the columns would be matched with the capture groups.

This could handle the quite common case of varying white-space as column 
separators:

   COPY log (col1, col2, col3) FROM 'log.txt' WITH PATTERN 
'^(\S+)\s+(\S+)\s+(\S+)$'

This could also handle $SUBJECT:

   COPY table_name (single_column) FROM 'unknown.txt' WITH PATTERN '^(.*)$';

And lots of other more complex use-cases.

/Joel

Re: COPY table_name (single_column) FROM 'unknown.txt' DELIMITER E'\n'

2021-05-09 Thread Daniel Verite
Darafei "Komяpa" Praliaskouski wrote:

> What I would prefer is some new COPY mode like RAW that will just push
> whatever it gets on the stdin/input into the cell on the server side. This
> way it can be proxied by psql, utilize existing infra for passing streams
> and be used in shell scripting.

COPY RAW has been proposed and discussed quite a bit previously:
https://commitfest.postgresql.org/12/676/


Best regards,
-- 
Daniel Vérité
PostgreSQL-powered mailer: https://www.manitou-mail.org
Twitter: @DanielVerite




Re: COPY table_name (single_column) FROM 'unknown.txt' DELIMITER E'\n'

2021-05-06 Thread Andrew Dunstan


On 5/6/21 7:41 AM, Isaac Morland wrote:
> On Thu, 6 May 2021 at 02:21, Darafei "Komяpa" Praliaskouski
> mailto:m...@komzpa.net>> wrote:
>  
>
> What I would prefer is some new COPY mode like RAW that will just
> push whatever it gets on the stdin/input into the cell on the
> server side. This way it can be proxied by psql, utilize existing
> infra for passing streams and be used in shell scripting.
>
>
> Yes! A significant missing feature is “take this arbitrary bucket of
> bits and move it to/from the database from/to this file without
> modification of any kind”. There are all sorts of tutorials on the Web
> about how to almost do it or fake it or convert to hex or whatever but
> I’ve never seen a nice simple explanation of “here is the command:”.
>
> Of course there is a lot to think about. Coming out of the database,
> the query result must be exactly one row containing exactly one
> column; how do we handle other database results? Coming into the
> database, only one data value can come from a single file; so how do
> we populate the other columns of whatever table we copy to?
>


What I'd like is something a bit more general. We could extend the FDW
API to allow for a CopyStdin handler or some such. The input could be
sent unfiltered to the handler, which would do whatever it liked with
it. That way FDWs like file_fdw and file_text_array_fdw could read from
stdin, for example. I'm pretty sure it would handle the OP's use case.
The downside is you'd need to write an FDW handler, but that's not too
hard, and there are lots of examples.


cheers


andrew

--
Andrew Dunstan
EDB: https://www.enterprisedb.com





Re: COPY table_name (single_column) FROM 'unknown.txt' DELIMITER E'\n'

2021-05-06 Thread Isaac Morland
On Thu, 6 May 2021 at 12:02, Joel Jacobson  wrote:

> On Thu, May 6, 2021, at 13:41, Isaac Morland wrote:
>
> Yes! A significant missing feature is “take this arbitrary bucket of bits
> and move it to/from the database from/to this file without modification of
> any kind”. There are all sorts of tutorials on the Web about how to almost
> do it or fake it or convert to hex or whatever but I’ve never seen a nice
> simple explanation of “here is the command:”.
>
> Of course there is a lot to think about. Coming out of the database, the
> query result must be exactly one row containing exactly one column; how do
> we handle other database results? Coming into the database, only one data
> value can come from a single file; so how do we populate the other columns
> of whatever table we copy to?
>
> If the file is on the server, you can use pg_read_binary_file() for that
> purpose.
>

Yes, sorry, I should have explicitly said “on the client” somewhere up
there. Getting files from the DB server into the DB is no problem.


Re: COPY table_name (single_column) FROM 'unknown.txt' DELIMITER E'\n'

2021-05-06 Thread Joel Jacobson
On Thu, May 6, 2021, at 13:41, Isaac Morland wrote:
> On Thu, 6 May 2021 at 02:21, Darafei "Komяpa" Praliaskouski  
> wrote:
>  
>> What I would prefer is some new COPY mode like RAW that will just push 
>> whatever it gets on the stdin/input into the cell on the server side. This 
>> way it can be proxied by psql, utilize existing infra for passing streams 
>> and be used in shell scripting.
> 
> Yes! A significant missing feature is “take this arbitrary bucket of bits and 
> move it to/from the database from/to this file without modification of any 
> kind”. There are all sorts of tutorials on the Web about how to almost do it 
> or fake it or convert to hex or whatever but I’ve never seen a nice simple 
> explanation of “here is the command:”.
> 
> Of course there is a lot to think about. Coming out of the database, the 
> query result must be exactly one row containing exactly one column; how do we 
> handle other database results? Coming into the database, only one data value 
> can come from a single file; so how do we populate the other columns of 
> whatever table we copy to?
> 

If the file is on the server, you can use pg_read_binary_file() for that 
purpose.

/Joel

Re: COPY table_name (single_column) FROM 'unknown.txt' DELIMITER E'\n'

2021-05-06 Thread Isaac Morland
On Thu, 6 May 2021 at 02:21, Darafei "Komяpa" Praliaskouski 
wrote:


> What I would prefer is some new COPY mode like RAW that will just push
> whatever it gets on the stdin/input into the cell on the server side. This
> way it can be proxied by psql, utilize existing infra for passing streams
> and be used in shell scripting.
>

Yes! A significant missing feature is “take this arbitrary bucket of bits
and move it to/from the database from/to this file without modification of
any kind”. There are all sorts of tutorials on the Web about how to almost
do it or fake it or convert to hex or whatever but I’ve never seen a nice
simple explanation of “here is the command:”.

Of course there is a lot to think about. Coming out of the database, the
query result must be exactly one row containing exactly one column; how do
we handle other database results? Coming into the database, only one data
value can come from a single file; so how do we populate the other columns
of whatever table we copy to?


Re: COPY table_name (single_column) FROM 'unknown.txt' DELIMITER E'\n'

2021-05-06 Thread Komяpa
I have similar problems and what is really needed is a way to get a file
from client side into a server side object that can be dealt with later.
The most popular way is COPY and it is built into the psql tool. In general
it supports \copy wrapper, and there is COPY FROM STDIN. However, it is not
available to the files that are not following the csv-like structure. I had
to use it for XML and huge JSON files before, and it's always `sed` before
the import and a replace() after.

pg_read_file does not help on cloud and managed installs of postgres here.

What I would prefer is some new COPY mode like RAW that will just push
whatever it gets on the stdin/input into the cell on the server side. This
way it can be proxied by psql, utilize existing infra for passing streams
and be used in shell scripting.



On Thu, May 6, 2021 at 9:14 AM Joel Jacobson  wrote:

> On Wed, May 5, 2021, at 20:45, Tom Lane wrote:
>
> "Joel Jacobson"  writes:
> > I think you misunderstood the problem.
> > I don't want the entire file to be considered a single value.
> > I want each line to become its own row, just a row with a single column.
>
> > So I actually think COPY seems like a perfect match for the job,
> > since it does precisely that, except there is no delimiter in this case.
>
> Well, there's more to it than just the column delimiter.
>
> * What about \N being converted to NULL?
> * What about \. being treated as EOF?
> * Do you want to turn off the special behavior of backslash (ESCAPE)
>   altogether?
> * What about newline conversions (\r\n being seen as just \n, etc)?
>
> I'm inclined to think that "use pg_read_file and then split at newlines"
> might be a saner answer than delving into all these fine points.
> Not least because people yell when you add cycles to the COPY
> inner loops.
>
>
> Thanks for providing strong arguments why the COPY approach is a dead-end,
> I agree.
>
> However, as demonstrated in my previous email, using
>
>string_to_table(pg_read_file( filename ), E'\n')
>
> has its performance as well as max size issues.
>
> Maybe these two problems could be solved by combining the two functions
> into one?
>
>file_to_table ( filename text, delimiter text [, null_string text ] ) →
> setof text
>
> I'm thinking thanks to returning "setof text", such a function could read
> a stream,
> and return a line as soon as a delimiter is encountered, not having to keep
> the entire file in memory at any time.
>
> /Joel
>


-- 
Darafei "Komяpa" Praliaskouski
OSM BY Team - http://openstreetmap.by/


Re: COPY table_name (single_column) FROM 'unknown.txt' DELIMITER E'\n'

2021-05-06 Thread Joel Jacobson
On Wed, May 5, 2021, at 20:45, Tom Lane wrote:
> "Joel Jacobson" mailto:joel%40compiler.org>> writes:
> > I think you misunderstood the problem.
> > I don't want the entire file to be considered a single value.
> > I want each line to become its own row, just a row with a single column.
> 
> > So I actually think COPY seems like a perfect match for the job,
> > since it does precisely that, except there is no delimiter in this case.
> 
> Well, there's more to it than just the column delimiter.
> 
> * What about \N being converted to NULL?
> * What about \. being treated as EOF?
> * Do you want to turn off the special behavior of backslash (ESCAPE)
>   altogether?
> * What about newline conversions (\r\n being seen as just \n, etc)?
> 
> I'm inclined to think that "use pg_read_file and then split at newlines"
> might be a saner answer than delving into all these fine points.
> Not least because people yell when you add cycles to the COPY
> inner loops.

Thanks for providing strong arguments why the COPY approach is a dead-end, I 
agree.

However, as demonstrated in my previous email, using

   string_to_table(pg_read_file( filename ), E'\n')

has its performance as well as max size issues.

Maybe these two problems could be solved by combining the two functions into 
one?

   file_to_table ( filename text, delimiter text [, null_string text ] ) → 
setof text

I'm thinking thanks to returning "setof text", such a function could read a 
stream,
and return a line as soon as a delimiter is encountered, not having to keep
the entire file in memory at any time.

/Joel

Re: COPY table_name (single_column) FROM 'unknown.txt' DELIMITER E'\n'

2021-05-06 Thread Thomas Kellerer
Joel Jacobson schrieb am 05.05.2021 um 17:30:
> Could it be an idea to exploit the fact that DELIMITER E'\n' is currently an 
> error?
>
>     ERROR:  COPY delimiter cannot be newline or carriage return
>
> That is, to change E'\n' to be a valid delimiter, which would simply read 
> each line
> delimited by newlines, as a single column.
>
> The hack I'm currently abusing is to find some one-byte character that is not 
> present anywhere in the text file,
> and then to use that character as a delimiter. This doesn't work when needing 
> to deal with a text file
> which content is unknown at the time when writing the code though, so it's 
> mostly useful for throwaway one-off queries.

What about

delimiter E'\1'

The probability that a file contains the ASCII "character" 1 seems rather low.

Thomas






Re: COPY table_name (single_column) FROM 'unknown.txt' DELIMITER E'\n'

2021-05-05 Thread Joel Jacobson
On Wed, May 5, 2021, at 21:51, Tom Lane wrote:
> Andrew Dunstan mailto:andrew%40dunslane.net>> writes:
> > On 5/5/21 2:45 PM, Tom Lane wrote:
> >> Yeah, that's because of the conversion to "chr".  But a regexp
> >> is overkill for that anyway.  Don't we have something that will
> >> split on simple substring matches?
> 
> > Not that I know of. There is split_part but I don't think that's fit for
> > purpose here. Do we need one, or have I missed something?
> 
> [ checks manual ... ]
> 
> string_to_array or string_to_table would do, I think.
> 
> regards, tom lane
> 

Thanks for these new functions, they seem really useful for a lot of cases.

However, I see two problems with using string_to_table() for this particular 
use-case.
 
- Doesn't work with files larger than 1GB, due to pg_read_file()'s limit.
- 68% slower than using the COPY-hack.

% ls -lah foo.txt
-rw-r--r--  1 joel  staff   623M May  6 07:31 foo.txt

% wc -l foo.txt
6771864 foo.txt

# \d txt
   Table "public.txt"
Column | Type | Collation | Nullable | Default
+--+---+--+-
line   | text |   |  |

# COPY txt (line) FROM 'foo.txt' DELIMITER '"';
COPY 6771864
Time: 9829.707 ms (00:09.830)
Time: 9552.286 ms (00:09.552)
Time: 9483.115 ms (00:09.483)

# TRUNCATE txt;
TRUNCATE TABLE

# INSERT INTO txt (line) SELECT string_to_table(pg_read_file('foo.txt'),E'\n');
INSERT 0 6771865
Time: 16556.078 ms (00:16.556)
Time: 14720.343 ms (00:14.720)
Time: 17266.088 ms (00:17.266)

/Joel


Re: COPY table_name (single_column) FROM 'unknown.txt' DELIMITER E'\n'

2021-05-05 Thread Tom Lane
Andrew Dunstan  writes:
> On 5/5/21 2:45 PM, Tom Lane wrote:
>> Yeah, that's because of the conversion to "chr".  But a regexp
>> is overkill for that anyway.  Don't we have something that will
>> split on simple substring matches?

> Not that I know of. There is split_part but I don't think that's fit for
> purpose here. Do we need one, or have I missed something?

[ checks manual ... ]

string_to_array or string_to_table would do, I think.

regards, tom lane




Re: COPY table_name (single_column) FROM 'unknown.txt' DELIMITER E'\n'

2021-05-05 Thread Andrew Dunstan


On 5/5/21 3:36 PM, Justin Pryzby wrote:
> On Wed, May 05, 2021 at 02:45:41PM -0400, Tom Lane wrote:
>>> I'm currently using the pg_read_file()-hack in a project,
>>> and even though it can read files up to 1GB,
>>> using e.g. regexp_split_to_table() to split on E'\n'
>>> seems to need 4x as much memory, so it only
>>> works with files less than ~256MB.
>> Yeah, that's because of the conversion to "chr".  But a regexp
>> is overkill for that anyway.  Don't we have something that will
>> split on simple substring matches?
> For v14
>
> commit 66f163068030b5c5fe792a0daee27822dac43791
> Author: Tom Lane 
> Date:   Wed Sep 2 18:23:56 2020 -0400
>
> Add string_to_table() function.
>

Ha! just in time :-)


cheers


andrew


--
Andrew Dunstan
EDB: https://www.enterprisedb.com





Re: Re: COPY table_name (single_column) FROM 'unknown.txt' DELIMITER E'\n'

2021-05-05 Thread Justin Pryzby
On Wed, May 05, 2021 at 02:45:41PM -0400, Tom Lane wrote:
> > I'm currently using the pg_read_file()-hack in a project,
> > and even though it can read files up to 1GB,
> > using e.g. regexp_split_to_table() to split on E'\n'
> > seems to need 4x as much memory, so it only
> > works with files less than ~256MB.
> 
> Yeah, that's because of the conversion to "chr".  But a regexp
> is overkill for that anyway.  Don't we have something that will
> split on simple substring matches?

For v14

commit 66f163068030b5c5fe792a0daee27822dac43791
Author: Tom Lane 
Date:   Wed Sep 2 18:23:56 2020 -0400

Add string_to_table() function.

-- 
Justin




Re: COPY table_name (single_column) FROM 'unknown.txt' DELIMITER E'\n'

2021-05-05 Thread Andrew Dunstan


On 5/5/21 2:45 PM, Tom Lane wrote:
> "Joel Jacobson"  writes:
>> I think you misunderstood the problem.
>> I don't want the entire file to be considered a single value.
>> I want each line to become its own row, just a row with a single column.
>> So I actually think COPY seems like a perfect match for the job,
>> since it does precisely that, except there is no delimiter in this case.
> Well, there's more to it than just the column delimiter.
>
> * What about \N being converted to NULL?
> * What about \. being treated as EOF?
> * Do you want to turn off the special behavior of backslash (ESCAPE)
>   altogether?
> * What about newline conversions (\r\n being seen as just \n, etc)?
>
> I'm inclined to think that "use pg_read_file and then split at newlines"
> might be a saner answer than delving into all these fine points.
> Not least because people yell when you add cycles to the COPY
> inner loops.



+1


Also we have generally been resistant to supporting odd formats. FDWs
can help here (e.g. file_text_array), but they can't use STDIN IIRC.


>
>> I'm currently using the pg_read_file()-hack in a project,
>> and even though it can read files up to 1GB,
>> using e.g. regexp_split_to_table() to split on E'\n'
>> seems to need 4x as much memory, so it only
>> works with files less than ~256MB.
> Yeah, that's because of the conversion to "chr".  But a regexp
> is overkill for that anyway.  Don't we have something that will
> split on simple substring matches?
>
>   



Not that I know of. There is split_part but I don't think that's fit for
purpose here. Do we need one, or have I missed something?


cheers


andrew


--
Andrew Dunstan
EDB: https://www.enterprisedb.com





Re: Re: COPY table_name (single_column) FROM 'unknown.txt' DELIMITER E'\n'

2021-05-05 Thread Tom Lane
"Joel Jacobson"  writes:
> I think you misunderstood the problem.
> I don't want the entire file to be considered a single value.
> I want each line to become its own row, just a row with a single column.

> So I actually think COPY seems like a perfect match for the job,
> since it does precisely that, except there is no delimiter in this case.

Well, there's more to it than just the column delimiter.

* What about \N being converted to NULL?
* What about \. being treated as EOF?
* Do you want to turn off the special behavior of backslash (ESCAPE)
  altogether?
* What about newline conversions (\r\n being seen as just \n, etc)?

I'm inclined to think that "use pg_read_file and then split at newlines"
might be a saner answer than delving into all these fine points.
Not least because people yell when you add cycles to the COPY
inner loops.

> I'm currently using the pg_read_file()-hack in a project,
> and even though it can read files up to 1GB,
> using e.g. regexp_split_to_table() to split on E'\n'
> seems to need 4x as much memory, so it only
> works with files less than ~256MB.

Yeah, that's because of the conversion to "chr".  But a regexp
is overkill for that anyway.  Don't we have something that will
split on simple substring matches?

regards, tom lane




Re: COPY table_name (single_column) FROM 'unknown.txt' DELIMITER E'\n'

2021-05-05 Thread Joel Jacobson
On Wed, May 5, 2021, at 19:34, Isaac Morland wrote:
> Would DELIMITER NULL make sense? The existing values are literal strings so 
> NULL fits with that. Do we already have NONE as a keyword somewhere? It's 
> listed in the keyword appendix to the documentation but I can't think of 
> where it is used off the top of my head.

+1 to using some keyword. NULL or NONE seems fine to me. Or maybe WITHOUT 
DELIMITER?

/Joel

Re: COPY table_name (single_column) FROM 'unknown.txt' DELIMITER E'\n'

2021-05-05 Thread Joel Jacobson
On Wed, May 5, 2021, at 19:58, David G. Johnston wrote:
> On Wed, May 5, 2021 at 10:34 AM Isaac Morland  wrote:
>> On Wed, 5 May 2021 at 13:23, Chapman Flack  wrote:
>>> On 05/05/21 13:02, David G. Johnston wrote:
>>> > Why not just allow: "DELIMITER NONE" to be valid syntax meaning exactly
>>> > what it says and does exactly what you desire?
>>> 
>>> What would it mean? That you get one column, multiple rows of text
>>> corresponding to "lines" delimited by something, or that you get one
>>> column, one row of text for the entire content of the file?
>> 
>> It means no column delimiter. In other words, there is no character which 
>> marks the end of a data value, so the entire line is a single data value.
>> 
> 
> This.  When dealing with COPY it's expected that each line becomes its own 
> row.  On the server you can do pg_read_file() if you need the entire file to 
> be considered a single value.  psql (\I and variables) is a bit more hackey, 
> but I'd rather see that improved directly anyway if the goal is to try and 
> make getting the "whole document" easier - copy isn't the right API for that 
> IMO.

I think you misunderstood the problem.
I don't want the entire file to be considered a single value.
I want each line to become its own row, just a row with a single column.

So I actually think COPY seems like a perfect match for the job,
since it does precisely that, except there is no delimiter in this case.

I'm currently using the pg_read_file()-hack in a project,
and even though it can read files up to 1GB,
using e.g. regexp_split_to_table() to split on E'\n'
seems to need 4x as much memory, so it only
works with files less than ~256MB.

SELECT COUNT(*) FROM regexp_split_to_table(repeat(E'\n',10),E'\n');
ERROR:  invalid memory alloc request size 44
Time: 4151.374 ms (00:04.151)

/Joel

Re: COPY table_name (single_column) FROM 'unknown.txt' DELIMITER E'\n'

2021-05-05 Thread David G. Johnston
On Wed, May 5, 2021 at 10:34 AM Isaac Morland 
wrote:

> On Wed, 5 May 2021 at 13:23, Chapman Flack  wrote:
>
>> On 05/05/21 13:02, David G. Johnston wrote:
>> > Why not just allow: "DELIMITER NONE" to be valid syntax meaning exactly
>> > what it says and does exactly what you desire?
>>
>> What would it mean? That you get one column, multiple rows of text
>> corresponding to "lines" delimited by something, or that you get one
>> column, one row of text for the entire content of the file?
>>
>
> It means no column delimiter. In other words, there is no character
> which marks the end of a data value, so the entire line is a single data
> value.
>
>
This.  When dealing with COPY it's expected that each line becomes its own
row.  On the server you can do pg_read_file() if you need the entire file
to be considered a single value.  psql (\I and variables) is a bit more
hackey, but I'd rather see that improved directly anyway if the goal is to
try and make getting the "whole document" easier - copy isn't the right API
for that IMO.

David J.


Re: COPY table_name (single_column) FROM 'unknown.txt' DELIMITER E'\n'

2021-05-05 Thread Isaac Morland
On Wed, 5 May 2021 at 13:23, Chapman Flack  wrote:

> On 05/05/21 13:02, David G. Johnston wrote:
> > Why not just allow: "DELIMITER NONE" to be valid syntax meaning exactly
> > what it says and does exactly what you desire?
>
> What would it mean? That you get one column, multiple rows of text
> corresponding to "lines" delimited by something, or that you get one
> column, one row of text for the entire content of the file?
>

It means no column delimiter. In other words, there is no character
which marks the end of a data value, so the entire line is a single data
value.

Would DELIMITER NULL make sense? The existing values are literal strings so
NULL fits with that. Do we already have NONE as a keyword somewhere? It's
listed in the keyword appendix to the documentation but I can't think of
where it is used off the top of my head.


Re: COPY table_name (single_column) FROM 'unknown.txt' DELIMITER E'\n'

2021-05-05 Thread Chapman Flack
On 05/05/21 13:02, David G. Johnston wrote:
> Why not just allow: "DELIMITER NONE" to be valid syntax meaning exactly
> what it says and does exactly what you desire?

What would it mean? That you get one column, multiple rows of text
corresponding to "lines" delimited by something, or that you get one
column, one row of text for the entire content of the file?

Regards,
-Chap




Re: COPY table_name (single_column) FROM 'unknown.txt' DELIMITER E'\n'

2021-05-05 Thread David G. Johnston
On Wed, May 5, 2021 at 8:31 AM Joel Jacobson  wrote:

> Could it be an idea to exploit the fact that DELIMITER E'\n' is currently
> an error?
>
>
Why not just allow: "DELIMITER NONE" to be valid syntax meaning exactly
what it says and does exactly what you desire?

David J.