Re: processing csv files

Ole Tange Tue, 08 Mar 2022 10:10:04 -0800

On Mon, Mar 7, 2022 at 4:22 AM Saint Michael <[email protected]> wrote:
>
> So how would I submit the contents of many files to parallel, without 
> concatenating them?


Why do you see this as a problem? If you are going to start a process
for each line of input cat will not slow things down.

You _can_ avoid the cat, but it seems a bit silly:

< file1.csv parallel --colsep ',' function  "{1} {2} {3} {4} {5} {6} {7}"
< file2.csv parallel --colsep ',' function  "{1} {2} {3} {4} {5} {6} {7}"
< file3.csv parallel --colsep ',' function  "{1} {2} {3} {4} {5} {6} {7}"
< file4.csv parallel --colsep ',' function  "{1} {2} {3} {4} {5} {6} {7}"
< file5.csv parallel --colsep ',' function  "{1} {2} {3} {4} {5} {6} {7}"
< file6.csv parallel --colsep ',' function  "{1} {2} {3} {4} {5} {6} {7}"
< file7.csv parallel --colsep ',' function  "{1} {2} {3} {4} {5} {6} {7}"

And I think you will find the total run time is longer.

> The function neds to process each file line by line.
> I am sure there must be a better way.

> Why concatenate them at all?

Because you want to feed them into GNU Parallel as a single input source.

cat is way faster than GNU Parallel will ever be, so please explain
why you see cat as a problem.

seq 10000 > file
time cat file >/dev/null
< file time parallel echo >/dev/null

> There is no relationship between a line and the next line.

If you can change function to read from stdin (standard input), then
we can do something way more efficient:

myfunc() { wc; }
export -f myfunc
parallel --pipepart --block -1 myfunc :::: *.csv

--pipepart has some limitations, but it is insanely fast (almost as
fast as a parallelized cat). I

> Maybe a new feature?

If the previous does not answer your question then it is unclear to me
what you really want to do.

If you read https://stackoverflow.com/help/minimal-reproducible-example
you will see how to make it easier to help you.


/Ole

Re: processing csv files

Reply via email to