On Tue, Jul 23, 2019 at 1:23 PM Stas Malyshev <smalys...@wikimedia.org> wrote:
> Hi! > > > Will this be approachable: My 2 hour query will actually finally > > return results into my 1gig csv.zip file? > > Not sure about 2 hours, as again it'd be a service that would be open to > a wide community, and time is the most limited resource of all - once > 2-hour query is running, that means the resource to serve it is consumed > for 2 hours and not available to anybody else. Even with batching, we > only have 24 hours per day which we'd be able to run only 12 such > queries (well, parallelism exists, but let's not complicate it too much > for the sake of example) and then the 13th person would have to wait the > whole day for their query to be even run. Without some limit you'd have > to book it months in advance like a posh restaurant :) Of course, it's a > consideration of resources available and demand for such queries, so > we'd have to see what the precise limit is when we get there. Maybe > there are no 13 people to run such queries and we'd be ok. > > Was thinking the same thing. I wouldn't form a 2 hour query, just saying. In actuality, I'd spend the day or two to download the data dump. > Also, with live updates, long queries create other technical challenges > (if query is running for 2 hours, the database has basically to keep the > snapshot it runs on for 2 hours, which may make it much less efficient). > We could of course have non-live-updates database, but updating it then > would be a bit tricky as loading full dump takes a week now and catching > up for that week takes even more time (hello, Achilles, hello, > Tortoise). We're working on improving those, but for now 2 hour queries > may be poorly compatible with both resources we have and the model we > have. Shorter queries though may definitely be possible - we'd need to > find the boundary that is safe given the current resources. > > Yeap, agreed. Its a balancing act, even we do that in the enterprise, where even extremely large companies still have budgets. But the CEO and his reports come first, yah? :) Thanks for the explanations Stas to confirm my assumptions there. Let's continue to focus on the 80% of common user queries and save the 20% like in my special cases to point users to the data dumps and say "roll your own kid, and have fun while doing it!" Thad https://www.linkedin.com/in/thadguidry/
_______________________________________________ Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata