Re: parsing and chunking large xyz files

cej38 Fri, 26 Dec 2014 13:28:23 -0800

Francis,
  Thank you for the gist.  I looked through it briefly just now, but won't 
be able to look at it in detail until later tonight or tomorrow morning.




On Friday, December 26, 2014 2:40:53 PM UTC-5, Francis Avila wrote:
>
> If you need parallelism, you need to do an indexing pass first to 
> determine the group boundaries. Then you can process them in parallel 
> because you know the units-of-work.
>
> Iota is an ok fit for this, so I suggest trying it first. (You may have to 
> dial down the parallelism of r/fold to avoid stressing your OS's mmap.)
>
> Unfortunately this file format makes it impossible to determine chunk 
> boundaries in parallel: there's no (easy) way to distinguish atom count and 
> comment lines without knowing the state of previous lines, so you cannot 
> index in parallel. However you can cache the index or even write it to disk 
> for iota to read later.
>
> I wrote a small gist to demonstrate the basic procedure: 
> https://gist.github.com/favila/035718ab762c6adfc8dc
>
>
> On Friday, December 26, 2014 10:00:57 AM UTC-6, cej38 wrote:
>>
>> Line-by-line is the problem.  I need groups of lines at a time.
>>
>>
>>
>>
>> On Friday, December 26, 2014 10:33:27 AM UTC-5, Jony Hudson wrote:
>>>
>>> I think clojure.csv reads CSV files lazily, line-by-line, so might be 
>>> useful to take a look at:
>>>
>>> https://github.com/clojure/data.csv
>>>
>>>
>>> Jony
>>>
>>> On Friday, 26 December 2014 14:49:59 UTC, cej38 wrote:
>>>>
>>>> In molecular dynamics a popular format for writing out the positions of 
>>>> the atoms in a system is the xyz file format (see: 
>>>> http://en.wikipedia.org/wiki/XYZ_file_format and/or 
>>>> http://www.ks.uiuc.edu/Research/vmd/plugins/molfile/xyzplugin.html). 
>>>>  The format allows for storing the positions of the atoms at different 
>>>> snapshots in time (aka "time step").  You may have a few to millions of 
>>>> atoms in your system and you may have thousands of time steps represented 
>>>> in the file.  It is easy to end up with a single file that is many GB in 
>>>> size.  Here is a shell command that will create a very simple, and very 
>>>> small, test file (note that the positions of the atoms are completely 
>>>> unrealistic-they are all sitting on top of each other)
>>>>
>>>> perl -e 'open(F, ">>test1.xyz"); for( $t= 1; $t < 11; $t = $t +1){print 
>>>> F "10\n\n"; for( $a = 1; $a < 11; $a = $a + 1 ){print F "C  0.000 0.000 
>>>> 0.0000\n";}}; close(F);'
>>>>
>>>>
>>>> Here is a shell command that will produce a more complicated file 
>>>> structure (note that depending on who wrote the code that output the file 
>>>> there may be other columns of data at the end of each row, also the number 
>>>> of decimal places kept and the type of spacing between elements may 
>>>> change), this file has a different number of atoms with each time step :
>>>>
>>>> perl -e 'open(F, ">>test2.xyz"); for( $t= 1; $t < 5; $t = $t +1){my $s= 
>>>> $t + 10; print F "$s \n"; my $color  = substr ("abcd efghij klmno pqrs tuv 
>>>> wxyz", int(rand(10)), int(rand(10))); print F $color; print F "\n" ;for( 
>>>> $a 
>>>> = 1; $a < (11 +$t); $a = $a + 1 ){print F "C    10.000000   10.00000   
>>>> 10.00000   $a\n";}}; close(F);'
>>>> perl -e 'open(F, ">>test2.xyz"); for( $t= 1; $t < 5; $t = $t +1){my $s= 
>>>> $t + 10; print F "$s \n"; myperl -e 'open(F, ">>test2.xyz"); for( $t= 1; 
>>>> $t 
>>>> < 5; $t = $t +1){my $s= $t + 10; print F "$s \n"; my
>>>>
>>>> Ok, that is the background to get to my question.  I need a way to 
>>>> parse these files and group the lines into time steps.  I currently have 
>>>> something that works but only in cases where the file size is relatively 
>>>> small-it reads the whole file into memory.  I would like to use something 
>>>> like iota that will allow me lazily parse the file and run reducers on the 
>>>> data.  Any help would be really appreciated.
>>>>
>>>>
>>>>
>>>>
>>>>

-- 
You received this message because you are subscribed to the Google
Groups "Clojure" group.
To post to this group, send email to clojure@googlegroups.com
Note that posts from new members are moderated - please be patient with your 
first post.
To unsubscribe from this group, send email to
clojure+unsubscr...@googlegroups.com
For more options, visit this group at
http://groups.google.com/group/clojure?hl=en
--- 
You received this message because you are subscribed to the Google Groups 
"Clojure" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to clojure+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Re: parsing and chunking large xyz files

Reply via email to