On Tue, Jul 23, 2019 at 1:23 PM Stas Malyshev <smalys...@wikimedia.org>
wrote:

> Hi!
>
> > Will this be approachable:   My 2 hour query will actually finally
> > return results into my 1gig csv.zip file?
>
> Not sure about 2 hours, as again it'd be a service that would be open to
> a wide community, and time is the most limited resource of all - once
> 2-hour query is running, that means the resource to serve it is consumed
> for 2 hours and not available to anybody else. Even with batching, we
> only have 24 hours per day which we'd be able to run only 12 such
> queries (well, parallelism exists, but let's not complicate it too much
> for the sake of example) and then the 13th person would have to wait the
> whole day for their query to be even run. Without some limit you'd have
> to book it months in advance like a posh restaurant :) Of course, it's a
> consideration of resources available and demand for such queries, so
> we'd have to see what the precise limit is when we get there. Maybe
> there are no 13 people to run such queries and we'd be ok.
>
>
Was thinking the same thing.  I wouldn't form a 2 hour query, just saying.
In actuality, I'd spend the day or two to download the data dump.

> Also, with live updates, long queries create other technical challenges
> (if query is running for 2 hours, the database has basically to keep the
> snapshot it runs on for 2 hours, which may make it much less efficient).
> We could of course have non-live-updates database, but updating it then
> would be a bit tricky as loading full dump takes a week now and catching
> up for that week takes even more time (hello, Achilles, hello,
> Tortoise). We're working on improving those, but for now 2 hour queries
> may be poorly compatible with both resources we have and the model we
> have. Shorter queries though may definitely be possible - we'd need to
> find the boundary that is safe given the current resources.
>
>
Yeap, agreed.  Its a balancing act, even we do that in the enterprise,
where even extremely large companies still have budgets.  But the CEO and
his reports come first, yah? :)

Thanks for the explanations Stas to confirm my assumptions there.
Let's continue to focus on the 80% of common user queries and save the 20%
like in my special cases to point users to the data dumps and say "roll
your own kid, and have fun while doing it!"

Thad
https://www.linkedin.com/in/thadguidry/
_______________________________________________
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata

Reply via email to