Ah, forgot to add my footnote to the dirspec - we all know the link, but in any case:
[1]: https://gitweb.torproject.org/torspec.git/blob/HEAD:/dir-spec.txt This was in the context of discussing which fields from 2.1 to include. On Tue, Jun 11, 2013 at 12:34 AM, Kostas Jakeliunas <kos...@jakeliunas.com>wrote: > > Here, I think it is realistic to try and use and import all the fields >> available from metrics-db-*. >> > My PoC is overly simplistic in this regard: only relay descriptors, and >> only a limited subset of data fields is used in the schema, for the import. >> >> I'm not entirely sure what fields that would include. Two options come >> to mind... >> >> * Include just the fields that we need. This would require us to >> update the schema and perform another backfill whenever we need >> something new. I don't consider this 'frequent backfill' requirement >> to be a bad thing though - this would force us to make it extremely >> easy to spin up a new instance which is a very nice attribute to have. >> >> * Make the backend a more-or-less complete data store of descriptor >> data. This would mean schema updates whenever there's a dir-spec >> addition [1]. An advantage of this is that the ORM could provide us >> with stem Descriptor instances [2]. For high traffic applications >> though we'd probably still want to query the backend directly since we >> usually won't care about most descriptor attributes. >> > > In truth, I'm not sure here, either. I agree that it basically boils down > to either of the two aforementioned options. I'm okay with any of them. I'd > like to, however, see how well the db import scales if we were to import > all relay descriptor fields. There aren't a lot of them (dirspec [1]), if > we don't count extra-info of course and only want to deal with the Router > descriptor format (2.1). So I think I should try working with those > fields, and see if the import goes well and quickly enough. I plan to do > simple python timeit / timing report macroses that may be attached / > deattached from functions easily, would be simple and clean that way to > measure things and so on. > > > [...] An advantage of [more-or-less complete data store of descriptor > > data] is that the ORM could provide us > > > with stem Descriptor instances [2]. For high traffic applications > > though we'd probably still want to query the backend directly since we > > usually won't care about most descriptor attributes. > > I can try experimenting with this later on (when we have the full / needed > importer working, e.g.), but it might be difficult to scale indeed (not > sure, of course). Do you have any specific use cases in mind? (actually > curious, could be interesting to hear.) [2] fn is noted, I'll think about > it. > > > > The idea would be import all data as DB fields (so, indexable), but it >> makes sense to also import raw text lines to be able to e.g. supply the >> frontend application with raw data if needed, as the current tools do. But >> I think this could be made to be a separate table, with descriptor id as >> primary key, which means this can be done later on if need be, would not >> cause a problem. I guess there's no need to this right now. >> >> I like this idea. A couple advantages that this could provide us are... >> >> * The importer can provide warnings when our present schema is out of >> sync with stem's Descriptor attributes (ie. there has been a new >> dir-spec addition). >> >> * After making the schema update the importer could then run over this >> raw data table, constructing Descriptor instances from it and >> performing updates for any missing attributes. >> > > The 'schema/format mismatch report' idea sounds like a really good idea! > Surely if we are to try for Onionoo compatibility / eventual replacement, > but in any case, this seems like a very useful thing for the future. I will > keep this in mind for the nearest future / database importer rewrite. > > > * After making the schema update the importer could then run over this > > raw data table, constructing Descriptor instances from it and > > performing updates for any missing attributes. > > I can't say I can easily see the specifics of how all this would work, but > if we had an always-up-to-date data model (mediated by Stem Relay > Descriptor class, but not necessarily), this might work.. (The ORM <-> Stem > Descriptor object mapping itself is trivial, so all is well in that regard.) > > On Wed, May 29, 2013 at 5:49 PM, Damian Johnson <ata...@torproject.org>wrote: > >> > Here, I think it is realistic to try and use and import all the fields >> available from metrics-db-*. >> > My PoC is overly simplistic in this regard: only relay descriptors, and >> only a limited subset of data fields is used in the schema, for the import. >> >> I'm not entirely sure what fields that would include. Two options come >> to mind... >> >> * Include just the fields that we need. This would require us to >> update the schema and perform another backfill whenever we need >> something new. I don't consider this 'frequent backfill' requirement >> to be a bad thing though - this would force us to make it extremely >> easy to spin up a new instance which is a very nice attribute to have. >> >> * Make the backend a more-or-less complete data store of descriptor >> data. This would mean schema updates whenever there's a dir-spec >> addition [1]. An advantage of this is that the ORM could provide us >> with stem Descriptor instances [2]. For high traffic applications >> though we'd probably still want to query the backend directly since we >> usually won't care about most descriptor attributes. >> >> > The idea would be import all data as DB fields (so, indexable), but it >> makes sense to also import raw text lines to be able to e.g. supply the >> frontend application with raw data if needed, as the current tools do. But >> I think this could be made to be a separate table, with descriptor id as >> primary key, which means this can be done later on if need be, would not >> cause a problem. I guess there's no need to this right now. >> >> I like this idea. A couple advantages that this could provide us are... >> >> * The importer can provide warnings when our present schema is out of >> sync with stem's Descriptor attributes (ie. there has been a new >> dir-spec addition). >> >> * After making the schema update the importer could then run over this >> raw data table, constructing Descriptor instances from it and >> performing updates for any missing attributes. >> >> Cheers! -Damian >> >> [1] https://gitweb.torproject.org/torspec.git/blob/HEAD:/dir-spec.txt >> [2] This might be a no-go. Stem Descriptor instances are constructed >> from the raw descriptor content, and needs it for str(), get_bytes(), >> and signature validation. If we don't care about those we can subclass >> Descriptor and overwrite those methods. >> > >
_______________________________________________ tor-dev mailing list tor-dev@lists.torproject.org https://lists.torproject.org/cgi-bin/mailman/listinfo/tor-dev