timsaucer commented on issue #16821:
URL: https://github.com/apache/datafusion/issues/16821#issuecomment-3161406074

   > I have a problem I'd love to solve but I'm not exactly sure how to go 
about it. My issue is I need to do a join across a time axis, where an event in 
the past has a corresponding event between two dates in the future, and where 
field A is identical between the two events and field B is within some set of 
values in the past and within another set of values in the future. I believe if 
I partitioned on field A and ordered by date then I could do the self-join 
manually with far more efficiency than a more generic self-join.
   
   This sounds very interesting. Can we make it a concrete example? I think I'm 
missing part of what the output would look like.
   
   Suppose I had this data frame:
   
   ```
   +------------+------+-------+---------+
   | event      | time | price | acct_nr |
   +------------+------+-------+---------+
   | purchase-1 | 1    | 90.0  | 429     |
   | sale-2     | 2    | 135.0 | 184     |
   | sale-3     | 3    | 150.0 | 129     |
   | purchase-1 | 4    | 100.0 | 584     |
   | sale-2     | 5    | 125.0 | 231     |
   +------------+------+-------+---------+
   ```
   
   And I did the self join you're talking about where I'm searching for cases 
where `event` is the common Field A you describe but I want cases where the 
price goes up from early to late times. This would yield
   
   ```
   
+------------+------------+-------------+---------------+-----------+------------+--------------+
   | event      | early_time | early_price | early_acct_nr | late_time | 
late_price | late_acct_nr |
   
+------------+------------+-------------+---------------+-----------+------------+--------------+
   | purchase-1 | 1          | 90.0        | 429           | 4         | 100.0  
    | 584          |
   
+------------+------------+-------------+---------------+-----------+------------+--------------+
   ```
   
   I added in an extra piece of data because I didn't know what all the self 
join would entail - do you want something that ends up sending out only a 
subset of the data. If you have a real world use case that is more compelling, 
that would be helpful.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Reply via email to