Re: [GENERAL] any solution for doing a data file import spawning it on multiple processes
thanks all, i will be looking into it. Met vriendelijke groet, Henk On 16 jun. 2012, at 18:23, Edson Richter wrote: > Em 16/06/2012 12:59, h...@101-factory.eu escreveu: >> thanks i thought about splitting the file, but that did no work out well. >> >> so we receive 2 files evry 30 seconds and need to import this as fast as >> possible. >> >> we do not run java curently but maybe it's an option. >> are you willing to share your code? >> >> also i was thinking using perl for it >> >> >> henk >> >> On 16 jun. 2012, at 17:37, Edson Richter wrote: >> >>> Em 16/06/2012 12:04, h...@101-factory.eu escreveu: >>>> hi there, >>>> >>>> I am trying to import large data files into pg. >>>> for now i used the. xarg linux command to spawn the file line for line and >>>> set and use the maximum available connections. >>>> >>>> we use pg pool as connection pool to the database, and so try to maximize >>>> the concurrent data import of the file. >>>> >>>> problem for now that it seems to work well but we miss a line once in a >>>> while, and that is not acceptable. also it creates zombies ;(. >>>> >>>> does anybody have any other tricks that will do the job? >>>> >>>> thanks, >>>> >>>> Henk >>> I've used custom Java application using connection pooling (limited to 1000 >>> connections, mean 1000 concurrent file imports). >>> >>> I'm able to import more than 64000 XML files (about 13Kb each) in 5 >>> minutes, without memory leaks neither zombies, and (of course) no missing >>> records. >>> >>> Besides I each thread import separate file, I have another situation where >>> I have separated threads importing different lines of same file. No >>> problems at all. Do not forget to check your OS "file open" limits (it was >>> a big issue in the past for me due Lucene indexes generated during import). >>> >>> Server: 8 core Xeon, 16Gig, SAS 15000 rpm disks, PgSQL 9.1.3, Linux Centos >>> 5, Sun Java 1.6.27. >>> >>> Regards, >>> >>> Edson Richter >>> >>> >>> -- >>> Sent via pgsql-general mailing list (pgsql-general@postgresql.org) >>> To make changes to your subscription: >>> http://www.postgresql.org/mailpref/pgsql-general > I'm not allowed to publish my company's code, but the logic if very easy to > understand (you will have to "invent" your own solution, below code is bare > bone): > > class MainThread implements Runnable { >private boolean keepRunning = true; > >public void run() { >while(keepRunning) { >try { >executeFiles(); >Thread.sleep(3); // sleep 30 seconds >} catch(Exception ex) { >ex.printStackTrace(); >} >} >} > >private void executeFiles() { >File monitorDir = new File("/var/mydatafolder/"); >File processingDir = new File("/var/myprocessingfolder/"); > >// I'll import only files with names like "data20120621.csv": >FileFilter fileFilter = new FileFilter() { >public boolean accept(File file) { >boolean isfile = file.isFile() && !file.isHidden() && > !file.isDirectory(); >if(!isfile) return false; >String fname = file.getName(); >return fname.startsWith("data") && > (file.getName().endsWith("csv")); > } > }; > >List forProcessing = monitorDir.listFiles(fileFilter); > >for(File fileFound : forProcessing) { >// FileUtil is a utility class, you will have to create your > own... your move method will vary according your Operating System >FileUtil.move(fileFound, processingDir); >// ProcessFile is a class that implements Runnable, and do your > stuff there... >Thread t = new Thread(new ProcessFile(processingDir, > fileFound.getName())); >t.start(); >} >} > >/** Use this method to stop the thread from another place in your complex > system! */ >public void synchronized stopWorker() { >keepRunning = false; >} > >public static void main(String [] args) { >Thread t = new Thread(new MainThread()); >t.start(); >} > } > > > > > -- > Sent via pgsql-general mailing list (pgsql-general@postgresql.org) > To make changes to your subscription: > http://www.postgresql.org/mailpref/pgsql-general -- Sent via pgsql-general mailing list (pgsql-general@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general
Re: [GENERAL] any solution for doing a data file import spawning it on multiple processes
thanks i thought about splitting the file, but that did no work out well. so we receive 2 files evry 30 seconds and need to import this as fast as possible. we do not run java curently but maybe it's an option. are you willing to share your code? also i was thinking using perl for it henk On 16 jun. 2012, at 17:37, Edson Richter wrote: > Em 16/06/2012 12:04, h...@101-factory.eu escreveu: >> hi there, >> >> I am trying to import large data files into pg. >> for now i used the. xarg linux command to spawn the file line for line and >> set and use the maximum available connections. >> >> we use pg pool as connection pool to the database, and so try to maximize >> the concurrent data import of the file. >> >> problem for now that it seems to work well but we miss a line once in a >> while, and that is not acceptable. also it creates zombies ;(. >> >> does anybody have any other tricks that will do the job? >> >> thanks, >> >> Henk > > I've used custom Java application using connection pooling (limited to 1000 > connections, mean 1000 concurrent file imports). > > I'm able to import more than 64000 XML files (about 13Kb each) in 5 minutes, > without memory leaks neither zombies, and (of course) no missing records. > > Besides I each thread import separate file, I have another situation where I > have separated threads importing different lines of same file. No problems at > all. Do not forget to check your OS "file open" limits (it was a big issue in > the past for me due Lucene indexes generated during import). > > Server: 8 core Xeon, 16Gig, SAS 15000 rpm disks, PgSQL 9.1.3, Linux Centos 5, > Sun Java 1.6.27. > > Regards, > > Edson Richter > > > -- > Sent via pgsql-general mailing list (pgsql-general@postgresql.org) > To make changes to your subscription: > http://www.postgresql.org/mailpref/pgsql-general -- Sent via pgsql-general mailing list (pgsql-general@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general
[GENERAL] any solution for doing a data file import spawning it on multiple processes
hi there, I am trying to import large data files into pg. for now i used the. xarg linux command to spawn the file line for line and set and use the maximum available connections. we use pg pool as connection pool to the database, and so try to maximize the concurrent data import of the file. problem for now that it seems to work well but we miss a line once in a while, and that is not acceptable. also it creates zombies ;(. does anybody have any other tricks that will do the job? thanks, Henk -- Sent via pgsql-general mailing list (pgsql-general@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general
Re: [GENERAL] Re: how to set up automatically startup database when the server boot or reboot.
are there any usefull startup script when eiunning in master slave setup with pg pool? Henk On 27 apr. 2012, at 19:22, leaf_yxj wrote: > My os is redhat linux 5.5. And My database is greenplum 4.2.1( postgresql > 8.2.15). I will take a look about the init.d directory. > > Thanks. Guys. Any opinion is welcome. Please help. > > > -- > View this message in context: > http://postgresql.1045698.n5.nabble.com/how-to-set-up-automatically-startup-database-when-the-server-boot-or-reboot-tp5670442p5670905.html > Sent from the PostgreSQL - general mailing list archive at Nabble.com. > > -- > Sent via pgsql-general mailing list (pgsql-general@postgresql.org) > To make changes to your subscription: > http://www.postgresql.org/mailpref/pgsql-general -- Sent via pgsql-general mailing list (pgsql-general@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general
Re: [GENERAL] Valid query times out when run from bash script
run the script with bash -v or -vvv for extra detailed vebose logging. see whats wrong, most of the times a matter using the right closure of the statements with ' or " Henk Bronk On 27 mrt. 2012, at 20:37, "W. David Jarvis" wrote: > Hello all - > > I've been trying to get a bash script set-up that creates a DB, does a > join, writes the join to file, and then cleans itself up afterwards. > Everything within the script runs fine when entered at an actual > prompt, but when I run the script the SELECT query runs indefinitely. > I'm not sure what the cause is—the script is being run on a box > connecting to its own local postgres installation, so it can't be a > connection issue. Any thoughts? > > The bash script is included below; postgres version is 9.0.4. Any help > very much appreciated :) > > #!/bin/bash > > : ${1?"ERROR: Incorrect number of arguments (files have not been > properly specified). Proper format is 'mta-join old_mta_file > new_mta_file desired_output_filename.'"} > : ${2?"ERROR: Incorrect number of arguments (files have not been > properly specified). Proper format is 'mta-join old_mta_file > new_mta_file desired_output_filename.'"} > : ${3?"ERROR: Incorrect number of arguments (files have not been > properly specified). Proper format is 'mta-join old_mta_file > new_mta_file desired_output_filename.'"} > > OLD_MTA_FILE=$1 > NEW_MTA_FILE=$2 > OUTPUT_FILE=$3 > > # Figure out how many days we need columns for > export NEW_MTA_COLWIDTH=`head -1 $NEW_MTA_FILE | grep ',' -o | wc -l` > NEW_MTA_COLWIDTH=($NEW_MTA_COLWIDTH-7)/4 > > # Assemble the extra SQL for the above > ADDITIONAL_CREATION_FIELDS="" > ADDITIONAL_SELECTION_FIELDS="" > for (( c=0; c<=$NEW_MTA_COLWIDTH; c++ )) > do > ADDITIONAL_CREATION_FIELDS=$ADDITIONAL_CREATION_FIELDS"DAY_"$c" > varchar(255), event_count_"$c" numeric(20,10), conversions_"$c" > numeric(20,10), revenue_"$c" numeric(20,10), " > > ADDITIONAL_SELECTION_FIELDS=$ADDITIONAL_SELECTION_FIELDS"new_mta.DAY_"$c", > new_mta.event_count_"$c", new_mta.conversions_"$c", > new_mta.revenue_"$c", " > done > > # Let's get rid of that extra comma at the end. > > ADDITIONAL_CREATION_FIELDS=${ADDITIONAL_CREATION_FIELDS:0:${#ADDITIONAL_CREATION_FIELDS}-2} > ADDITIONAL_SELECTION_FIELDS=${ADDITIONAL_SELECTION_FIELDS:0:${#ADDITIONAL_SELECTION_FIELDS}-2} > > echo -n "Creating database mta-join..." > createdb mta-join > echo -e "Done!\n" > > echo "Creating table new_mta..." > # PSQL commands start here > psql mta-join< CREATE TABLE new_mta ( > report_date date, > campaign_id integer, > tracking_campaign_idinteger, > placement_idinteger, > creative_id integer, > package_id integer, > audience_id integer, > $ADDITIONAL_CREATION_FIELDS); > EOF > echo -e "Done!\n" > > echo -n "Inserting new mta file into new_mta table..." > psql mta-join< \copy new_mta from '$NEW_MTA_FILE' with delimiter ',' > \q > EOF > echo -e "Done!\n" > > echo "Creating table old_mta..." > # PSQL commands start here > psql mta-join< CREATE TABLE old_mta ( > report_date_day date, > report_date_weekdate, > report_date_month date, > campaign_name varchar(255), > package_namevarchar(255), > audience_name varchar(255), > inventory_provider_name varchar(255), > placement_name varchar(255), > creative_size varchar(255), > creative_name varchar(255), > impressions bigint, > data_cost numeric(20,10), > media_cost numeric(20,10), > gross_cost numeric(20,10), > clicks integer, > lta_click_actions integer, > lta_view_actionsinteger, > lta_click_revenue integer, > lta_view_revenueinteger, > mta_actions numeric(20,10), > mta_revenue integer, > mta_action_countinteger, > mta_seconds integer, > campaign_id integer, > placement_idbigint, > creative_id bigint, > package_id bigint, > audience_id integer); > \q > EOF > echo -e "Done!\n" > > # Upload old MTA file into table old_mta > echo -n "Inserting old mta file into old_mta table..." > psql mta-join< \COPY old_mta from '$OLD_MTA_FILE' with delimiter ',' CSV header; > EOF > echo -e "Done!\n" > > # Create a bunch of indexes > echo -n "Creating table indexes for faster querying..." > psql mta-join< CREATE INDEX campaign_id_index_old ON old_mta (campaign_id); > CREATE INDEX campaign_id_index_new ON new_mta (campaign_id); > CREATE INDEX placement_id_index_old ON old_mta (placement_id); > CREATE INDEX placement_id_index_new ON new_mta (placement_id); > CREATE INDEX creative_id_index_old ON old_mta (creative_id); > CREATE INDEX creative_id_index_new ON new_mta (creative_id); > CREATE INDEX package_id_index_old ON old_mta (package_id); > CREATE INDEX package_id_index_new ON new_mta (package_id); > CREATE I