seems like a nice effort, generic ETLs are tough!

On Jan 20, 2011, at 10:25 PM, Stefan Urbanek wrote:

> Hi,
> 
> I am working on a framework called Brewery. Goal is to provide
> abstract interface for data streams from heterogenous sources into
> heterogenous targets. More information with images:
> 
> http://databrewery.org/doc/streams.html
> 
> Point is to have objects similar to file streams, but streaming
> structured data in form of records/rows instead of bytes.
> 
> STREAMS
> 
> Currently implemented sources/targets are:
> 
> * Relational database table through SQLAlchemy (source+target)
> * CSV file (source+target)
> * XLS file (source only)
> * MongoDB (source+target)
> * google spreadsheet (source only)
> * directory with YAML files - one file per record (source+target)
> 
> For each source there are three basic methods:
> 
> - fields - list of fields provided by the source (has to be explicitly
> set for sources with unknown fields)
> - rows() - iterator for data represented by list
> - records() - iterator for data represented by dict object
> 
> Optionally you can use: read_fields(limit) to learn what fields are
> present in data source (for example in mongo DB)
> 
> For each target:
> 
> - append() - append an object, either a dictionary or a list to the
> target
> 
> With this simple interface you can easily create pipes between MongoDB
> and Postgres, import directory of YAML files into MySQL, ...
> 
> DATA QUALITY
> 
> In addition to that, there is simple data auditing tool for basic data
> quality audit. You can use StreamAuditor (stream target) to collect
> information about data and then generate data quality report.
> Currently audited data properties are:
> 
> * record and value count (might be different in document based
> DBs,same in relational)
> * null count
> * empty string count
> * distinct value count
> * distinct values
> * storage types (only one for relational databases)
> * ratios of measured properties, such as null/value count or null/
> record count
> 
> More probes to come (in a modular way).
> 
> API is documented here:
> 
> http://databrewery.org/doc/api/index.html
> 
> Sources:
> 
> bitbucket: https://bitbucket.org/Stiivi/brewery (main - mercurial
> repository)
> github: https://github.com/Stiivi/brewery/ (synchronized with main)
> 
> Example usage: Some source streams (XLS/CSV) are already being used
> for data proxy in project CKAN for converting data from various
> resources into common structured form:
> 
>    http://blog.ckan.org/2011/01/11/raw-data-in-ckan-resources-and-data-proxy/
> 
> FUTURE
> 
> Plans for the future are:
> 
> * command-line tools for simple data streaming tasks: copy, quality
> audit
> * data processing stream network with nodes for simple
> transformations, analysis and data mining
> * modular data quality probes - injectable into the network
> 
> The Brewery project is in early stage. I would like have some
> feedback: what do you think about it? Do you have any suggestions,
> comments? If anyone would like to try it and will have any troubles,
> just drop me a line and I will help.
> 
> Regards,
> 
> Stefan Urbanek
> --
> Twitter: @Stiivi
> 
> -- 
> You received this message because you are subscribed to the Google Groups 
> "sqlalchemy" group.
> To post to this group, send email to sqlalchemy@googlegroups.com.
> To unsubscribe from this group, send email to 
> sqlalchemy+unsubscr...@googlegroups.com.
> For more options, visit this group at 
> http://groups.google.com/group/sqlalchemy?hl=en.
> 

-- 
You received this message because you are subscribed to the Google Groups 
"sqlalchemy" group.
To post to this group, send email to sqlalchemy@googlegroups.com.
To unsubscribe from this group, send email to 
sqlalchemy+unsubscr...@googlegroups.com.
For more options, visit this group at 
http://groups.google.com/group/sqlalchemy?hl=en.

Reply via email to