Re: Grammars and biological data formats

Fields, Christopher J Sat, 16 Aug 2014 11:48:06 -0700

Yes, that looks like an even better option.  I see that this is implemented in 
p5 as File::Map, which is a nice portable option.


Chris

> On Aug 16, 2014, at 7:51 AM, "Martin D Kealey" <mar...@kurahaupo.gen.nz> 
> wrote:
> 
> 
> Hmmm, what about just implementing mmap-as-string?
> 
> Then, assuming the parsing process is somewhat stream-like, the OS will take
> care of swapping in chunks as you need them. You don't even need anything
> special to support backtracking -- it's just a memory address, after all.
> 
> -Martin
> 
>> On Thu, 14 Aug 2014, Fields, Christopher J wrote:
>> Yeah, I'm thinking of a Cat-like class that would chunkify the data and 
>> check for matches.
>> 
>> The main reason I would like to stick with a consistent grammar-based 
>> approach is I have seen many instances in BioPerl where a parser is 
>> essentially rewritten based on its purpose (full parsing, lazy parsing, 
>> indexing of flat files, adding to a persistent data store, etc).  Having a 
>> way to both parse a full grammar but also subparse for a specific token/rule 
>> is very handy, and when Cat comes around even more so.
>> 
>> Chris
>> 
>> Sent from my iPad
>> 
>>> On Aug 14, 2014, at 6:40 AM, "Carl Mäsak" <cma...@gmail.com> wrote:
>>> 
>>> I was going to pipe in and say that I wouldn't wait around for Cat,
>>> I'd write something that reads chunks and then parses that. It'll be a
>>> bit more code, but it'll work today. But I see you reached that
>>> conclusion already. :)
>>> 
>>> Lately I've found myself writing more and more grammars that parse
>>> just one line of some input. Provided that the same action object gets
>>> attached to the parse each time, that's an excellent place to store
>>> information that you want to persist between lines. Actually, action
>>> objects started to make a whole lot more sense to me after I found
>>> that use case, because it takes on the role of a session/lifetime
>>> object for the parse process itself.
>>> 
>>> // Carl
>>> 
>>> On Wed, Aug 13, 2014 at 3:19 PM, Fields, Christopher J
>>> <cjfie...@illinois.edu> wrote:
>>>> On Aug 13, 2014, at 8:11 AM, Christopher Fields <cjfie...@illinois.edu> 
>>>> wrote:
>>>> 
>>>>>> On Aug 13, 2014, at 4:50 AM, Solomon Foster <colo...@gmail.com> wrote:
>>>>>> 
>>>>>> On Sat, Aug 9, 2014 at 7:26 PM, Fields, Christopher J
>>>>>> <cjfie...@illinois.edu> wrote:
>>>>>>> I have a fairly simple question regarding the feasibility of using 
>>>>>>> grammars with commonly used biological data formats.
>>>>>>> 
>>>>>>> My main question: if I wanted to parse() or subparse() vary large files 
>>>>>>> (not unheard of to have FASTA/FASTQ or other similar data files exceed 
>>>>>>> 100’s of GB) would a grammar be the best solution?  For instance, based 
>>>>>>> on what I am reading the semantics appear to be greedy; for instance:
>>>>>>> 
>>>>>>> Grammar.parsefile($file)
>>>>>>> 
>>>>>>> appears to be a convenient shorthand for:
>>>>>>> 
>>>>>>> Grammar.parse($file.slurp)
>>>>>>> 
>>>>>>> since Grammar.parse() works on a Str, not a IO::Handle or Buf.  Or am I 
>>>>>>> misunderstanding how this could be accomplished?
>>>>>> 
>>>>>> My understanding is it is intended that parsing can work on Cats
>>>>>> (hypothetical lazy strings) but this hasn't been implemented yet
>>>>>> anywhere.
>>>>>> 
>>>>>> --
>>>>>> Solomon Foster: colo...@gmail.com
>>>>>> HarmonyWare, Inc: http://www.harmonyware.com
>>>>> 
>>>>> Yeah, that’s what I recall as well.  I see very little in the specs re: 
>>>>> Cat unfortunately.
>>>>> 
>>>>> chris
>>>> 
>>>> Ah, nevermind.  I did a search of the IRC channel and found it’s 
>>>> considered to be a ‘6.1’ feature:
>>>> 
>>>>   http://irclog.perlgeek.de/perl6/2014-07-06#i_8978974
>>>> 
>>>> It is mentioned a few times in the specs, I’m guessing based on where it’s 
>>>> thought to fit in best.  For the moment the proposal is to run grammar 
>>>> parsing on sized chunks of the input data, which might be how Cat would be 
>>>> implemented anyway.
>>>> 
>>>> chris
>>>> 
>>

Re: Grammars and biological data formats

Reply via email to