Re: [Pharo-users] Conditional CSV parsing

stepharo Sun, 22 Feb 2015 23:54:06 -0800

I will add that to the topic list!

Stef



Le 22/2/15 22:15, Sven Van Caekenberghe a écrit :

There are some cool ideas here:

https://github.com/BurntSushi/xsv

On 31 Jan 2015, at 14:26, stepharo <steph...@free.fr> wrote:

hernan

if you need some help you can also find a smart student and ask ESUG to sponsor 
him during a summertalk.

Stef

Le 26/1/15 21:03, Hernán Morales Durand a écrit :


2015-01-26 9:01 GMT-03:00 Sven Van Caekenberghe <s...@stfx.eu>:
Hernán,

On 26 Jan 2015, at 08:00, Hernán Morales Durand <hernan.mora...@gmail.com> 
wrote:

It is possible :)
I work with DNA sequences, there could be millions of common SNPs in a genome.

Still weird for CSV. How many record are there then ?

We genotyped few individuals (24 records) but now we have a genotyping platform 
(GeneTitan) with array plates allowing up to 96 samples, which is up to 2.6 
million of markers. The first run I completed generated CSVs of 1 million of 
records (see attach). Sadly the high-level analysis of this data (annotation, 
clustering, discrimination) now is performed with R with packages like 
SNPolisher.

And this is microarray analysis, NGS platforms produce larger volumes of data 
in a shorter period of time (several genomes in a day). See 
http://www.slideshare.net/allenday/renaissance-in-medicine-strata-nosql-and-genomics
 for the 2014-2020 predictions.

Feel free to contact me if you want to experiment with metrics.

I assume they all have the same number of fields ?

Yes, never seen CSV file with variable number of fields (in this domain)

Anyway, could you point me to the specification of the format you want to read ?

Actually I have no rush for this, I want to avoid awk, sed and shell scripts in 
the next run. I would like to avoid Python but spreads like a virus.

I will be working mostly with CSV's from Axiom annotation files [1] and genotyping 
results. Other file formats I use are genotype file formats for programs like PLINK 
[2] (PED files, column 7 onwards) and HaploView. Is worst than you might think, 
because you have to transpose the output generated by genotyping platforms 
(millions of records), and then filter & cut them by chromosome because those 
Java programs cannot deal with all chromosomes at the same time.

And to the older the that you used to use ?



http://www.smalltalkhub.com/#!/~hernan/CSV

Cheers,
Hernán


[1] http://www.affymetrix.com/support/technical/annotationfilesmain.affx
[2] http://pngu.mgh.harvard.edu/~purcell/plink/data.shtml#ped

Thx,


Sven

Cheers,

Hernán


2015-01-26 3:33 GMT-03:00 Sven Van Caekenberghe <s...@stfx.eu>:

On 26 Jan 2015, at 06:32, Hernán Morales Durand <hernan.mora...@gmail.com> 
wrote:



2015-01-23 18:00 GMT-03:00 Sven Van Caekenberghe <s...@stfx.eu>:

On 23 Jan 2015, at 20:53, Hernán Morales Durand <hernan.mora...@gmail.com> 
wrote:

Hi Sven,

2015-01-23 16:06 GMT-03:00 Sven Van Caekenberghe <s...@stfx.eu>:
Hi Hernán,

On 23 Jan 2015, at 19:50, Hernán Morales Durand <hernan.mora...@gmail.com> 
wrote:

I used to use a CSV parser from Squeak where I could attach conditional 
iterations:

csvParser rowsSkipFirst: 2 do: [: row | " some action ignoring first 2 fields on 
each row " ].
csvParser rowsSkipLast: 2 do: [: row | " some action ignoring last 2 fields on each 
row " ].

With NeoCSVParser you can describe how each field is read and converted, using 
the same mechanism you can ignore fields. Have a look at the senders of 
#addIgnoredField from the unit tests.


I am trying to understand the implementation, I see you included 
#addIgnoredFields: for consecutive fields in Neo-CSV-Core-SvenVanCaekenberghe.21
A question about usage then, adding ignored field(s) requires adding field 
types on all other remaining fields?

Yes, like this:

testReadWithIgnoredField
         | input |
         input := (String crlf join: #( '1,2,a,3' '1,2,b,3' '1,2,c,3' '')).
         self
                 assert: ((NeoCSVReader on: input readStream)
                                         addIntegerField;
                                         addIntegerField;
                                         addIgnoredField;
                                         addIntegerField;
                                         upToEnd)
                 equals: {
                         #(1 2 3).
                         #(1 2 3).
                         #(1 2 3).}



May be you like to know if you make a pass to NeoCSV, for some data sets I have 
1 million of columns, it would be nice an addFieldsInterval: or such.

1 million columns ? How is that possible, useful ?

The reader is like a builder. You could try to do this yourself by writing a 
little loop or two.

But still, 1 million ?

Thank you.

Hernán

Re: [Pharo-users] Conditional CSV parsing

Reply via email to