Hi,

for the input part you can build an InputOperation which uses a
ZipInputStream (from SharpZipLib) to read directly from an archive:

  public class ReadCompressedCSV : AbstractOperation
    {
        public override IEnumerable<Row> Execute(IEnumerable<Row> rows)
        {
            using (var file = File.OpenRead(@"c:\temp\test.zip"))
            using (var zipInputStream = new ZipInputStream(file))
            {
                //assume we have only one entry
                zipInputStream.GetNextEntry();
                using (var sr = new StreamReader(zipInputStream))
                {
                    var read = sr.ReadLine();
                    while (read != null)
                    {
                        var splitted = read.Split(';');
                        var inputFormat = new InputFormat();
                        inputFormat.Column1 = splitted[0];
                        inputFormat.Column2 = splitted[1];
                        yield return Row.FromObject(inputFormat);
                        read = sr.ReadLine();
                    }
                }
            }
        }
    }


Try to sort your input files first! Then the check for dupes is easy to
implement if you remeber the last row.



2013/11/28 Twist <[email protected]>

> Hi,
>
> I tried to look for similar requirement but I did not manage to find a
> suitable answer so I am sorry in advance if I am re-asking same questions
> as others :)
>
> I would like to use Rhino ETL for my files processing, I have like* 200K
> csv files zipped* which correspond to a daily prices for stocks and have
> to
>
> 1 - unzip
> 2 - process
> 3 - validate (?)
> 4 - save in db
>
> In total there must be more then 400 million rows worth of data. Rhino ETL
> seems to be very suitable for this operation as it allows me to apply the
> pipeline pattern "easily" but right now I don't understand how can I
> separate my steps because from examples that i see left right, the only
> type returned by an operation is an IEnumerable<Row>.
>
> I also would like to Massively insert the data, right now I am using
> SqlBulkCopy but some files contain duplicate which bother me. I would like
> to remove it before inserting but when ever I try to input this duplicate
> existence check, it slows down too much the process.
>
> Right now my pipeline is composed by the steps I mentioned.
>
> Is it possible for me to return some stream or something else then row ?
>
> Again sorry if this is a repeated subject, please do not hesitate to
> redirect me,.
>
> Thanks!
>
> --
> You received this message because you are subscribed to the Google Groups
> "Rhino Tools Dev" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to [email protected].
> To post to this group, send email to [email protected].
> Visit this group at http://groups.google.com/group/rhino-tools-dev.
> For more options, visit https://groups.google.com/groups/opt_out.
>

-- 
You received this message because you are subscribed to the Google Groups 
"Rhino Tools Dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at http://groups.google.com/group/rhino-tools-dev.
For more options, visit https://groups.google.com/groups/opt_out.

Reply via email to