Re: [PR] refactor: build time comparison in frontend [superset]

via GitHub Mon, 04 Mar 2024 12:38:15 -0800


Antonio-RiveroMartnez commented on PR #27386:
URL: https://github.com/apache/superset/pull/27386#issuecomment-1977412955


   > > > > time_offsets joins DataFrames which are not where you would need to 
apply the join, would you? You need the join to happen at query pre execution 
level, not when you get the results of individual queries. `time_offset` works 
when you have the total data at your disposal like for example when you're 
working with other vizs but not for example when working with paginated data 
like in tables.
   > > > 
   > > > 
   > > > Okay, Let's narrow the scope of the discussion, I'm not sure what join 
you are talking about. The goal of this PR to do one thing that build 
comparison time range in frontend rather than backend. So, let' discussion 
around this PR.
   > > 
   > > 
   > > Yes, totally fine with that, just giving you context of why/when the 
query object needs to identify something "new" for time comparisons because 
again, you not always are able to compare using multiple queries sent from the 
frontend. Meaning the backend would be generating those ranges too, so that'w 
why I'm not worried but just pointing out the fact that some of the code 
deleted here would be actually required to handle those cases and that instead 
of having this in two places "front" and "back" I would rather having it in 
just one place.
   > > If it's easier to resonate, we can move ahead with these changes the 
moment CI passes and when I'm adjusting the other PR I will point out and 
reintroduce the necessary code and we can discuss there.
   > > Thanks as always.
   > 
   > Thanks for your reply, I hope you could provide some examples for 
why/when/what is the exact `JOIN` in the your 
[PR](https://github.com/apache/superset/pull/27355), but may be current PR is 
not a good place to talk about `join`.
   > 
   > Theoretically, the frontend and backend build time range are exactly the 
same because all the time range will parse by the pyparser.
   > 
   > I think we might need to think some new way to address like this 
https://github.com/apache/superset/pull/27355/files#diff-af5e15a55ff9c81f441eb6b9b10dae13b2ee17fec614d3e3ad0057ed0dc814aeR1434,
 I don't think such codes is maintable, but we can discussion in your PR.
   
   I tagged you in that other PR already if you want to discuss about the 
`JOIN`. But again, IMHO the backend still would need to build these ranges 
because as I said, it's not always a good idea/or even valid to have separate 
`queryObjects` to build the comparison. Meaning, the "frontend" doing it like 
in here only address one type of comparison scenario. While having it on the 
backend might serve all scenarios, or at least ease reusability for when needed.
   
   TLDR; The code you're removing here will be needed after and reintroduced 
again very shortly so I see no point in doing it the first time.
   
   I'm going to try to explain why building in the frontend is only "good" for 
one time of comparisons now:
   
   Imagine your queries are like this is like this:
   
   ```
   Query A: a metric from 1900-01-01 to 1950-01-01 
   Results: 1000
   
   QueryB: a metric from 1950-01-01 to 2000-01-01
   Results: 100
   ```
   
   Here having the frontend sending two queries is OK and totally fine like you 
have in this PR. Now, what's the limitation of this approach? Let's imagine 
another type of queries for another type of vizs. For example, Table Chart. 
Same exercise as before:
   
   ```
   Query A: a list of users that have bought the more items along with the num 
of items from 1900-01-01 to 1950-01-01 ordered by # items bought desc
   Results: User1: 100, User2: 90, User3:80
   
   Query B: a list of users that have bought the more items along with the num 
of items from 1950-01-01 to 2000-01-01 ordered by # items bought desc
   Results: User4:95, User1:70, User2:65
   ```
   
   In that scenario it's also fine to build two queries in the frontend with 
individual time_ranges and process each queryObject individually to run the 
metrics etc. Again, just fine with your PR here.
   
   Limitation? You can paginate a result set in Table chart (or any other chart 
that does the same or that would do the same later on). If you paginate using 
again just the frontend with isolated queries like you have here in this PR you 
would have something like:
   
   ```
   Query A: a list of users that have bought the more items along with the num 
of items from 1900-01-01 to 1950-01-01 ordered by # items bought desc LIMIT 2
   Results: User1: 100, User2: 90, User3:80
   
   Query B: a list of users that have bought the more items along with the num 
of items from 1950-01-01 to 2000-01-01 ordered by # items bought desc LIMIT2
   Results: User4:95, User1:70
   ```
   
   When creating the comparison metrics you would assume "User2 in query A has 
90 items and it's not present in the result set of query B so it's 0" which is 
as you can imagine, it's invalid. So, how to solve that? We could produce the 
queryObject needed in the backend itself (with the `JOIN` we can discuss in the 
other PR) and let the backend build the time ranges needed when constructing 
the queries.
   
   But yet again why not just two queries from the frontend with different 
ranges?
   1. It's redundant (as per the definition of redundant) to send you two 
queryObjects where we can produce both with just one: My mistake because I used 
Mixed time series as base when building the BigNumber with time comparison, so 
I introduced that pattern there.
   2. Data might have changed between execution time of each query, which may 
affect your results with inconsistent data.
   
   Hopefully this little exercise helped to clarify my point of view on why we 
should NOT bring that time_rage building to the frontend only (if not at all), 
and also can help to realize why time_offsets are not usable here neither.
   
   Just want to be clear of the limitation and the fact that solving the 
comparison for good like in the other PR would require the backend to process 
these ranges at what's more adding an optional in the extras or similar in the 
queryObject.
   
   But if others are happy with this we can move forward acknowledging what I 
stated here.
   
   Thanks as always @zhaoyongjie for the help and helping to bring more robust 
solutions whenever possible.
   
   
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: notifications-unsubscr...@superset.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: notifications-unsubscr...@superset.apache.org
For additional commands, e-mail: notifications-h...@superset.apache.org

Re: [PR] refactor: build time comparison in frontend [superset]

Reply via email to