Arek Kasprzyk wrote:
On 23 Aug 2006, at 15:48, Tom Oinn wrote:
Damian Smedley wrote:
How about a clear policy as to what forms of access are legal - a
sensible service interface suggests that bulk querying is legitimate
surely?
I don't want to ban people doing bulk querying though putting all the
IDs into one query is obviously much more efficient.
I'm not sure it's always equivalent though? I agree it's a problem,
obviously if the server's going down you need to do something to
resolve that but hopefully we can work out some kind of best practice
/ code change in Taverna to help out as well.
David Withers is our developer for the biomart side of things, I now
know relatively little of how it works internally but I believe he's
on the list as well :)
Cheers,
Tom
ok, some more details about this problem. I hope we can work this out
together as we do not want
to ban anybody from doing anything but simply to optimize the access so
it is works in
an optimal way for taverna as well for us.
(apologies for the massive cross-posting but not sure what list all the
relevant people are subscribed to :))
please feel free to redirect, narrow down this discussion or even
reject if do not recognize taverna
request pattern :)
This should be going to taverna-hackers, everyone appropriate is on
there I think, add mart-dev if there are people your end who need to see it.
ok, here it goes:
BioMart central server went down twice after a series of over 100 000
requests coming from a single
source over a relatively short period of time. After analyzing the
access logs and contacting the guys who
were firing those requests it seems that they have originated from
taverna workflows.
the requests came in the following pattern:
<snip>
after further analyzing the logs it seems like those users wanted
sequences for a ~300 ensembl transcripts. This in itself is a perfectly
valid and sensible use case.
However, what is unclear to me is why it is necessary to request each
sequence individually and more importantly why for each query the
software
(taverna?) needs to undergo a full configuration (as above). surely
this could be done once and then be followed either by individual
queries if necessary or
better still by less queries doing requests in batches. This is
normally is a light weight and sensible request when done properly. For
a comparison
I enclose below an example of exactly the same usage but sent as a
single query and a small perl script which quickly and harmlessly
retrieves it from our web-service so you can run and compare.
In this case that sounds plausible, the time when you want to run one
query per identifier is where you're getting more than one result
returned and want to maintain the mapping from input to output. This
could be done by altering the workflow as well but the most obvious way
to use the biomart process within taverna will tend to make lots of
distinct queries.
The issue with retrieval of dataset configs has I think been fixed in
CVS but David can confirm or deny that. That should massively reduce the
number of queries once we deploy the new code.
Cheers,
Tom