Hi Khalil, Did you try the genbank xml format? Mark
On Fri, Jun 17, 2011 at 9:21 AM, Khalil El Mazouari < [email protected]> wrote: > Hi, > > exec time for parsing Genbank, EMBL and EMBL-XML is ± the same. > > However, writing sequence in EMBL format was 87% slower vs Genbank format. > > Regards, > > khalil > > > On 17 Jun 2011, at 12:36, Martin Jones wrote: > > > Yes, this approach won't be much use if you are interested in the > > contents of every genbank record. > > > > Have you thought about parsing the gb files in parallel? In my > > experience, parsing genbank files scales quite nicely when done in > > multiple threads. I have used the GPars library for this type of job > > and it is very nice to use: > > > > http://gpars.codehaus.org/Parallelizer > > > > > > M > > > > > > > > On 17 June 2011 11:33, Khalil El Mazouari <[email protected]> > wrote: > >> Thanks Martin, > >> > >> I already tried the regex. The performance increase was < 10%. > >> > >> My situation is different in 2 points: > >> 1. info to extract from genbank file is always present. > >> 2. there is multiple feature to extract from each record. > >> > >> I agree with you. Extracting a single field from a genbank file, is done > munch faster with simple regex than with FeatureFilter. > >> > >> Regards, > >> > >> khalil > >> > >> On 17 Jun 2011, at 12:12, Martin Jones wrote: > >> > >>> Hi, > >>> > >>> I have had the same issue when parsing large sets of genbank files. In > >>> my case, the workaround was to first treat the whole genbank record as > >>> a string, and do a quick regex match to check if it contained > >>> something of interest (in my case I was searching for specific > >>> taxids): > >>> > >>> // first do a quick pattern-match to extract the taxid so we can > >>> exit early without the overhead of parsing the whole file > >>> private final Pattern taxidPattern = > >>> Pattern.compile("db_xref=\\\"taxon:(\\d+)"); > >>> Matcher taxidMatcher = taxidPattern.matcher(currentRecord); > >>> if (taxidMatcher.find()) { > >>> def taxid = taxidMatcher[0][1].toInteger() > >>> if (!taxidList.contains(taxid)) { > >>> return > >>> } > >>> // here do the slow part of actually parsing all the features > >>> > >>> > >>> This is in Groovy so there are a few syntactical differences. If you > >>> are only interested in a subset of the GenBank records, then this > >>> approach might be of use. > >>> > >>> M > >>> > >>> > >>> > >>> > >>> On 17 June 2011 10:16, Khalil El Mazouari <[email protected]> > wrote: > >>>> Hi, > >>>> > >>>> I am developing an app where features are extracted from a large > genbank file, and processed: multiple alignment, annotation.... > >>>> > >>>> The feature extraction is a real bottleneck in my app. It consumes 87% > of total execution time. > >>>> > >>>> Feature extraction is done via: > >>>> > >>>> FeatureFilter ff = new FeatureFilter.ByAnnotation(key, value); > >>>> FeatureHolder fh = richSequence.filter(ff); > >>>> Feature feat = fh.features().next(); > >>>> ... > >>>> > >>>> Any suggestion on how to improve the performance of features > extraction is welcome. > >>>> > >>>> Thanks, > >>>> > >>>> khalil > >>>> _______________________________________________ > >>>> Biojava-l mailing list - [email protected] > >>>> http://lists.open-bio.org/mailman/listinfo/biojava-l > >>>> > >>>> > >> > >> > >> > > _______________________________________________ Biojava-l mailing list - [email protected] http://lists.open-bio.org/mailman/listinfo/biojava-l
