seems like a nice effort, generic ETLs are tough!
On Jan 20, 2011, at 10:25 PM, Stefan Urbanek wrote: > Hi, > > I am working on a framework called Brewery. Goal is to provide > abstract interface for data streams from heterogenous sources into > heterogenous targets. More information with images: > > http://databrewery.org/doc/streams.html > > Point is to have objects similar to file streams, but streaming > structured data in form of records/rows instead of bytes. > > STREAMS > > Currently implemented sources/targets are: > > * Relational database table through SQLAlchemy (source+target) > * CSV file (source+target) > * XLS file (source only) > * MongoDB (source+target) > * google spreadsheet (source only) > * directory with YAML files - one file per record (source+target) > > For each source there are three basic methods: > > - fields - list of fields provided by the source (has to be explicitly > set for sources with unknown fields) > - rows() - iterator for data represented by list > - records() - iterator for data represented by dict object > > Optionally you can use: read_fields(limit) to learn what fields are > present in data source (for example in mongo DB) > > For each target: > > - append() - append an object, either a dictionary or a list to the > target > > With this simple interface you can easily create pipes between MongoDB > and Postgres, import directory of YAML files into MySQL, ... > > DATA QUALITY > > In addition to that, there is simple data auditing tool for basic data > quality audit. You can use StreamAuditor (stream target) to collect > information about data and then generate data quality report. > Currently audited data properties are: > > * record and value count (might be different in document based > DBs,same in relational) > * null count > * empty string count > * distinct value count > * distinct values > * storage types (only one for relational databases) > * ratios of measured properties, such as null/value count or null/ > record count > > More probes to come (in a modular way). > > API is documented here: > > http://databrewery.org/doc/api/index.html > > Sources: > > bitbucket: https://bitbucket.org/Stiivi/brewery (main - mercurial > repository) > github: https://github.com/Stiivi/brewery/ (synchronized with main) > > Example usage: Some source streams (XLS/CSV) are already being used > for data proxy in project CKAN for converting data from various > resources into common structured form: > > http://blog.ckan.org/2011/01/11/raw-data-in-ckan-resources-and-data-proxy/ > > FUTURE > > Plans for the future are: > > * command-line tools for simple data streaming tasks: copy, quality > audit > * data processing stream network with nodes for simple > transformations, analysis and data mining > * modular data quality probes - injectable into the network > > The Brewery project is in early stage. I would like have some > feedback: what do you think about it? Do you have any suggestions, > comments? If anyone would like to try it and will have any troubles, > just drop me a line and I will help. > > Regards, > > Stefan Urbanek > -- > Twitter: @Stiivi > > -- > You received this message because you are subscribed to the Google Groups > "sqlalchemy" group. > To post to this group, send email to sqlalchemy@googlegroups.com. > To unsubscribe from this group, send email to > sqlalchemy+unsubscr...@googlegroups.com. > For more options, visit this group at > http://groups.google.com/group/sqlalchemy?hl=en. > -- You received this message because you are subscribed to the Google Groups "sqlalchemy" group. To post to this group, send email to sqlalchemy@googlegroups.com. To unsubscribe from this group, send email to sqlalchemy+unsubscr...@googlegroups.com. For more options, visit this group at http://groups.google.com/group/sqlalchemy?hl=en.