Francis, Thank you for the gist. I looked through it briefly just now, but won't be able to look at it in detail until later tonight or tomorrow morning.
On Friday, December 26, 2014 2:40:53 PM UTC-5, Francis Avila wrote: > > If you need parallelism, you need to do an indexing pass first to > determine the group boundaries. Then you can process them in parallel > because you know the units-of-work. > > Iota is an ok fit for this, so I suggest trying it first. (You may have to > dial down the parallelism of r/fold to avoid stressing your OS's mmap.) > > Unfortunately this file format makes it impossible to determine chunk > boundaries in parallel: there's no (easy) way to distinguish atom count and > comment lines without knowing the state of previous lines, so you cannot > index in parallel. However you can cache the index or even write it to disk > for iota to read later. > > I wrote a small gist to demonstrate the basic procedure: > https://gist.github.com/favila/035718ab762c6adfc8dc > > > On Friday, December 26, 2014 10:00:57 AM UTC-6, cej38 wrote: >> >> Line-by-line is the problem. I need groups of lines at a time. >> >> >> >> >> On Friday, December 26, 2014 10:33:27 AM UTC-5, Jony Hudson wrote: >>> >>> I think clojure.csv reads CSV files lazily, line-by-line, so might be >>> useful to take a look at: >>> >>> https://github.com/clojure/data.csv >>> >>> >>> Jony >>> >>> On Friday, 26 December 2014 14:49:59 UTC, cej38 wrote: >>>> >>>> In molecular dynamics a popular format for writing out the positions of >>>> the atoms in a system is the xyz file format (see: >>>> http://en.wikipedia.org/wiki/XYZ_file_format and/or >>>> http://www.ks.uiuc.edu/Research/vmd/plugins/molfile/xyzplugin.html). >>>> The format allows for storing the positions of the atoms at different >>>> snapshots in time (aka "time step"). You may have a few to millions of >>>> atoms in your system and you may have thousands of time steps represented >>>> in the file. It is easy to end up with a single file that is many GB in >>>> size. Here is a shell command that will create a very simple, and very >>>> small, test file (note that the positions of the atoms are completely >>>> unrealistic-they are all sitting on top of each other) >>>> >>>> perl -e 'open(F, ">>test1.xyz"); for( $t= 1; $t < 11; $t = $t +1){print >>>> F "10\n\n"; for( $a = 1; $a < 11; $a = $a + 1 ){print F "C 0.000 0.000 >>>> 0.0000\n";}}; close(F);' >>>> >>>> >>>> Here is a shell command that will produce a more complicated file >>>> structure (note that depending on who wrote the code that output the file >>>> there may be other columns of data at the end of each row, also the number >>>> of decimal places kept and the type of spacing between elements may >>>> change), this file has a different number of atoms with each time step : >>>> >>>> perl -e 'open(F, ">>test2.xyz"); for( $t= 1; $t < 5; $t = $t +1){my $s= >>>> $t + 10; print F "$s \n"; my $color = substr ("abcd efghij klmno pqrs tuv >>>> wxyz", int(rand(10)), int(rand(10))); print F $color; print F "\n" ;for( >>>> $a >>>> = 1; $a < (11 +$t); $a = $a + 1 ){print F "C 10.000000 10.00000 >>>> 10.00000 $a\n";}}; close(F);' >>>> perl -e 'open(F, ">>test2.xyz"); for( $t= 1; $t < 5; $t = $t +1){my $s= >>>> $t + 10; print F "$s \n"; myperl -e 'open(F, ">>test2.xyz"); for( $t= 1; >>>> $t >>>> < 5; $t = $t +1){my $s= $t + 10; print F "$s \n"; my >>>> >>>> Ok, that is the background to get to my question. I need a way to >>>> parse these files and group the lines into time steps. I currently have >>>> something that works but only in cases where the file size is relatively >>>> small-it reads the whole file into memory. I would like to use something >>>> like iota that will allow me lazily parse the file and run reducers on the >>>> data. Any help would be really appreciated. >>>> >>>> >>>> >>>> >>>> -- You received this message because you are subscribed to the Google Groups "Clojure" group. To post to this group, send email to clojure@googlegroups.com Note that posts from new members are moderated - please be patient with your first post. To unsubscribe from this group, send email to clojure+unsubscr...@googlegroups.com For more options, visit this group at http://groups.google.com/group/clojure?hl=en --- You received this message because you are subscribed to the Google Groups "Clojure" group. To unsubscribe from this group and stop receiving emails from it, send an email to clojure+unsubscr...@googlegroups.com. For more options, visit https://groups.google.com/d/optout.