Re: [GENERAL] v9.1.3 WITH with_query UPDATE

2012-06-16 Thread Bill House
On 06/16/2012 01:27 PM, Vibhor Kumar wrote:
> On Jun 16, 2012, at 1:11 PM, Bill House wrote:
>
>> md5sum may be duplicated and I am trying to mark the column "del" of the
>> redundant records leaving one unmarked.
>>
>> Here is one variation of the syntax I have tried on one group:
>>
>>  WITH batch AS (select * from files_test where
>>  md5sum =  '0010a3e4cc6cb8623c014f5bb95b5be1'
>> ORDER BY path DESC OFFSET 1)
>>  UPDATE batch SET del = False;
> In Update clause you have to use tablename. 
> Syntax would be something like given below:
> WITH batch as (SELECT columname,columname FROM tablename) UPDATE file_test 
> set del=false FROM batch where file_test.columname=batch.columnname ...
>
> Also AFAIU, you want to set del flag to false for duplicate md5 then you can 
> achieve this with Normal UPDATE.
>
> Something like given below:
> UPDATE file_test set del=false WHERE CTID not in (SELECT MIN(ctid) FROM 
> file_test WHERE md5sum='0010a3e4cc6cb8623c014f5bb95b5be1';
>
>
> Thanks & Regards,
> Vibhor Kumar
> EnterpriseDB Corporation
> The Enterprise PostgreSQL Company
> Blog: http://vibhork.blogspot.com
>
>
>
Thanks for the help.

Based on your pointers and some careful re-reading of the manuals, I
composed a command that does what I want it to do (at least in my
preliminary test), mark all but one record to delete. 

Here it is for the record:

wch=# WITH batch AS (SELECT * FROM files_test
  WHERE md5sum ='0010a3e4cc6cb8623c014f5bb95b5be1'
  ORDER BY path DESC
  OFFSET 1)
  UPDATE files_test
  SET del = True
  FROM batch
  WHERE batch.md5sum || batch.path =
files_test.md5sum || files_test.path;
 
   UPDATE 2
  wch=#

If anyone can suggest a more efficient composition, I would be glad to
see it.  I am very new to this.

Thanks again,

Bill



-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] v9.1.3 WITH with_query UPDATE

2012-06-16 Thread Vibhor Kumar

On Jun 16, 2012, at 1:11 PM, Bill House wrote:

> md5sum may be duplicated and I am trying to mark the column "del" of the
> redundant records leaving one unmarked.
> 
> Here is one variation of the syntax I have tried on one group:
> 
>  WITH batch AS (select * from files_test where
>  md5sum =  '0010a3e4cc6cb8623c014f5bb95b5be1'
> ORDER BY path DESC OFFSET 1)
>  UPDATE batch SET del = False;

In Update clause you have to use tablename. 
Syntax would be something like given below:
WITH batch as (SELECT columname,columname FROM tablename) UPDATE file_test set 
del=false FROM batch where file_test.columname=batch.columnname ...

Also AFAIU, you want to set del flag to false for duplicate md5 then you can 
achieve this with Normal UPDATE.

Something like given below:
UPDATE file_test set del=false WHERE CTID not in (SELECT MIN(ctid) FROM 
file_test WHERE md5sum='0010a3e4cc6cb8623c014f5bb95b5be1';


Thanks & Regards,
Vibhor Kumar
EnterpriseDB Corporation
The Enterprise PostgreSQL Company
Blog: http://vibhork.blogspot.com


-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] any solution for doing a data file import spawning it on multiple processes

2012-06-16 Thread h...@101-factory.eu
thanks all, i will be looking into it. 

Met vriendelijke groet,

Henk 

On 16 jun. 2012, at 18:23, Edson Richter  wrote:

> Em 16/06/2012 12:59, h...@101-factory.eu escreveu:
>> thanks i thought about splitting the file, but that did no work out well.
>> 
>> so we receive 2 files evry 30 seconds and need to import this as fast as 
>> possible.
>> 
>> we do not run java curently but maybe it's an option.
>> are you willing to share your code?
>> 
>> also i was thinking using perl for it
>> 
>> 
>> henk
>> 
>> On 16 jun. 2012, at 17:37, Edson Richter  wrote:
>> 
>>> Em 16/06/2012 12:04, h...@101-factory.eu escreveu:
 hi there,
 
 I am trying to import large data files into pg.
 for now i used the. xarg linux command to spawn the file line for line and 
 set  and use the  maximum available connections.
 
 we use pg pool as connection pool to the database, and so try to maximize 
 the concurrent data import of the file.
 
 problem for now that it seems to work well but we miss a line once in a 
 while, and that is not acceptable. also it creates zombies ;(.
 
 does anybody have any other tricks that will do the job?
 
 thanks,
 
 Henk
>>> I've used custom Java application using connection pooling (limited to 1000 
>>> connections, mean 1000 concurrent file imports).
>>> 
>>> I'm able to import more than 64000 XML files (about 13Kb each) in 5 
>>> minutes, without memory leaks neither zombies, and (of course) no missing 
>>> records.
>>> 
>>> Besides I each thread import separate file, I have another situation where 
>>> I have separated threads importing different lines of same file. No 
>>> problems at all. Do not forget to check your OS "file open" limits (it was 
>>> a big issue in the past for me due Lucene indexes generated during import).
>>> 
>>> Server: 8 core Xeon, 16Gig, SAS 15000 rpm disks, PgSQL 9.1.3, Linux Centos 
>>> 5, Sun Java 1.6.27.
>>> 
>>> Regards,
>>> 
>>> Edson Richter
>>> 
>>> 
>>> -- 
>>> Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
>>> To make changes to your subscription:
>>> http://www.postgresql.org/mailpref/pgsql-general
> I'm not allowed to publish my company's code, but the logic if very easy to 
> understand (you will have to "invent" your own solution, below code is bare 
> bone):
> 
> class MainThread implements Runnable {
>private boolean keepRunning = true;
> 
>public void run() {
>while(keepRunning) {
>try {
>executeFiles();
>Thread.sleep(3); // sleep 30 seconds
>} catch(Exception ex) {
>ex.printStackTrace();
>}
>}
>}
> 
>private void executeFiles() {
>File monitorDir = new File("/var/mydatafolder/");
>File processingDir = new File("/var/myprocessingfolder/");
> 
>// I'll import only files with names like "data20120621.csv":
>FileFilter fileFilter = new FileFilter() {
>public boolean accept(File file) {
>boolean isfile = file.isFile() && !file.isHidden() && 
> !file.isDirectory();
>if(!isfile) return false;
>String fname = file.getName();
>return fname.startsWith("data") && 
> (file.getName().endsWith("csv"));
> }
> };
> 
>List forProcessing = monitorDir.listFiles(fileFilter);
> 
>for(File fileFound : forProcessing) {
>// FileUtil is a utility class, you will have to create your 
> own... your move method will vary according your Operating System
>FileUtil.move(fileFound, processingDir);
>// ProcessFile is a class that implements Runnable, and do your 
> stuff there...
>Thread t = new Thread(new ProcessFile(processingDir, 
> fileFound.getName()));
>t.start();
>}
>}
> 
>/** Use this method to stop the thread from another place in your complex 
> system! */
>public void synchronized stopWorker() {
>keepRunning = false;
>}
> 
>public static void main(String [] args) {
>Thread t = new Thread(new MainThread());
>t.start();
>}
> }
> 
> 
> 
> 
> -- 
> Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-general

-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] v9.1.3 WITH with_query UPDATE

2012-06-16 Thread Yeb Havinga

On 2012-06-16 19:11, Bill House wrote:

Would someone please point me to (or supply) some working examples of
UPDATE commands using the WITH clause syntax as described in the manual
(pdf version page 1560) and referring to Section 7.8 (pdf version page 104)?


http://www.depesz.com/2011/03/16/waiting-for-9-1-writable-cte/

http://www.slideshare.net/davidfetter/writeable-ct-espgconmay2011

regards,
Yeb


--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


[GENERAL] v9.1.3 WITH with_query UPDATE

2012-06-16 Thread Bill House
Hello all,

Would someone please point me to (or supply) some working examples of
UPDATE commands using the WITH clause syntax as described in the manual
(pdf version page 1560) and referring to Section 7.8 (pdf version page 104)?

I have looked around a lot and haven't seen much on this.

I have a table laid out as follows:

  wch=# \d files_test
Table "public.files_test"
   Column |  Type  |
Modifiers   
   


++---   
  

 md5sum | character(33)  
|   
 

 path| character(475)
|   
 

 file   | character(200)
|   
 

 del   | boolean   
|   
 

 recno  | integer  | not null

Indexes:
  

  "files_test_ord" btree (recno)

md5sum may be duplicated and I am trying to mark the column "del" of the
redundant records leaving one unmarked.

Here is one variation of the syntax I have tried on one group:

  WITH batch AS (select * from files_test where
  md5sum =  '0010a3e4cc6cb8623c014f5bb95b5be1'
 ORDER BY path DESC OFFSET 1)
  UPDATE batch SET del = False;

The error message I get in this instance is:

  ERROR:  relation "batch" does not exist
  LINE 1: ...4f5bb95b5be1' ORDER BY path DESC OFFSET 1) UPDATE batch
SET ..

Thanks for your help,

Bill



-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] any solution for doing a data file import spawning it on multiple processes

2012-06-16 Thread Edson Richter

Em 16/06/2012 12:59, h...@101-factory.eu escreveu:

thanks i thought about splitting the file, but that did no work out well.

so we receive 2 files evry 30 seconds and need to import this as fast as 
possible.

we do not run java curently but maybe it's an option.
are you willing to share your code?

also i was thinking using perl for it


henk

On 16 jun. 2012, at 17:37, Edson Richter  wrote:


Em 16/06/2012 12:04, h...@101-factory.eu escreveu:

hi there,

I am trying to import large data files into pg.
for now i used the. xarg linux command to spawn the file line for line and set  
and use the  maximum available connections.

we use pg pool as connection pool to the database, and so try to maximize the 
concurrent data import of the file.

problem for now that it seems to work well but we miss a line once in a while, 
and that is not acceptable. also it creates zombies ;(.

does anybody have any other tricks that will do the job?

thanks,

Henk

I've used custom Java application using connection pooling (limited to 1000 
connections, mean 1000 concurrent file imports).

I'm able to import more than 64000 XML files (about 13Kb each) in 5 minutes, 
without memory leaks neither zombies, and (of course) no missing records.

Besides I each thread import separate file, I have another situation where I have 
separated threads importing different lines of same file. No problems at all. Do not 
forget to check your OS "file open" limits (it was a big issue in the past for 
me due Lucene indexes generated during import).

Server: 8 core Xeon, 16Gig, SAS 15000 rpm disks, PgSQL 9.1.3, Linux Centos 5, 
Sun Java 1.6.27.

Regards,

Edson Richter


--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general
I'm not allowed to publish my company's code, but the logic if very easy 
to understand (you will have to "invent" your own solution, below code 
is bare bone):


class MainThread implements Runnable {
private boolean keepRunning = true;

public void run() {
while(keepRunning) {
try {
executeFiles();
Thread.sleep(3); // sleep 30 seconds
} catch(Exception ex) {
ex.printStackTrace();
}
}
}

private void executeFiles() {
File monitorDir = new File("/var/mydatafolder/");
File processingDir = new File("/var/myprocessingfolder/");

// I'll import only files with names like "data20120621.csv":
FileFilter fileFilter = new FileFilter() {
public boolean accept(File file) {
boolean isfile = file.isFile() && !file.isHidden() && 
!file.isDirectory();

if(!isfile) return false;
String fname = file.getName();
return fname.startsWith("data") && 
(file.getName().endsWith("csv"));

 }
 };

List forProcessing = monitorDir.listFiles(fileFilter);

for(File fileFound : forProcessing) {
// FileUtil is a utility class, you will have to create 
your own... your move method will vary according your Operating System

FileUtil.move(fileFound, processingDir);
// ProcessFile is a class that implements Runnable, and do 
your stuff there...
Thread t = new Thread(new ProcessFile(processingDir, 
fileFound.getName()));

t.start();
}
}

/** Use this method to stop the thread from another place in your 
complex system! */

public void synchronized stopWorker() {
keepRunning = false;
}

public static void main(String [] args) {
Thread t = new Thread(new MainThread());
t.start();
}
}




--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] any solution for doing a data file import spawning it on multiple processes

2012-06-16 Thread Bosco Rama
h...@101-factory.eu wrote:
> thanks i thought about splitting the file, but that did no work out well. 
> 
> so we receive 2 files evry 30 seconds and need to import this as fast as 
> possible. 
> 
> we do not run java curently but maybe it's an option. 
> are you willing to share your code?
> 
> also i was thinking using perl for it

Not sure if this will help, but have you looked at pgloader?



Bosco.

-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] any solution for doing a data file import spawning it on multiple processes

2012-06-16 Thread h...@101-factory.eu
thanks i thought about splitting the file, but that did no work out well. 

so we receive 2 files evry 30 seconds and need to import this as fast as 
possible. 

we do not run java curently but maybe it's an option. 
are you willing to share your code?

also i was thinking using perl for it


henk

On 16 jun. 2012, at 17:37, Edson Richter  wrote:

> Em 16/06/2012 12:04, h...@101-factory.eu escreveu:
>> hi there,
>> 
>> I am trying to import large data files into pg.
>> for now i used the. xarg linux command to spawn the file line for line and 
>> set  and use the  maximum available connections.
>> 
>> we use pg pool as connection pool to the database, and so try to maximize 
>> the concurrent data import of the file.
>> 
>> problem for now that it seems to work well but we miss a line once in a 
>> while, and that is not acceptable. also it creates zombies ;(.
>> 
>> does anybody have any other tricks that will do the job?
>> 
>> thanks,
>> 
>> Henk
> 
> I've used custom Java application using connection pooling (limited to 1000 
> connections, mean 1000 concurrent file imports).
> 
> I'm able to import more than 64000 XML files (about 13Kb each) in 5 minutes, 
> without memory leaks neither zombies, and (of course) no missing records.
> 
> Besides I each thread import separate file, I have another situation where I 
> have separated threads importing different lines of same file. No problems at 
> all. Do not forget to check your OS "file open" limits (it was a big issue in 
> the past for me due Lucene indexes generated during import).
> 
> Server: 8 core Xeon, 16Gig, SAS 15000 rpm disks, PgSQL 9.1.3, Linux Centos 5, 
> Sun Java 1.6.27.
> 
> Regards,
> 
> Edson Richter
> 
> 
> -- 
> Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-general

-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] any solution for doing a data file import spawning it on multiple processes

2012-06-16 Thread Edson Richter

Em 16/06/2012 12:04, h...@101-factory.eu escreveu:

hi there,

I am trying to import large data files into pg.
for now i used the. xarg linux command to spawn the file line for line and set  
and use the  maximum available connections.

we use pg pool as connection pool to the database, and so try to maximize the 
concurrent data import of the file.

problem for now that it seems to work well but we miss a line once in a while, 
and that is not acceptable. also it creates zombies ;(.

does anybody have any other tricks that will do the job?

thanks,

Henk


I've used custom Java application using connection pooling (limited to 
1000 connections, mean 1000 concurrent file imports).


I'm able to import more than 64000 XML files (about 13Kb each) in 5 
minutes, without memory leaks neither zombies, and (of course) no 
missing records.


Besides I each thread import separate file, I have another situation 
where I have separated threads importing different lines of same file. 
No problems at all. Do not forget to check your OS "file open" limits 
(it was a big issue in the past for me due Lucene indexes generated 
during import).


Server: 8 core Xeon, 16Gig, SAS 15000 rpm disks, PgSQL 9.1.3, Linux 
Centos 5, Sun Java 1.6.27.


Regards,

Edson Richter


--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


[GENERAL] any solution for doing a data file import spawning it on multiple processes

2012-06-16 Thread h...@101-factory.eu
hi there,

I am trying to import large data files into pg. 
for now i used the. xarg linux command to spawn the file line for line and set  
and use the  maximum available connections. 

we use pg pool as connection pool to the database, and so try to maximize the 
concurrent data import of the file. 

problem for now that it seems to work well but we miss a line once in a while, 
and that is not acceptable. also it creates zombies ;(. 

does anybody have any other tricks that will do the job?

thanks,

Henk
-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general