RE: Use stream result like a query (alternative to innerJoin)
Fetch would work for my specific case (since I’m working with id’s there’s no one to many), if I was able to restrict fetch’s target domain with a query. I would first get all possible deleted ids, then use fetch to the items collection. But then the current fetch implementation would find all deleted items, not something like “deleted items with these names” or “deleted items between this time” etc. I came upon your video while researching this stuff: https://www.youtube.com/watch?v=kTNe3TaqFvo I’m trying to use the “let” expression to feed one stream’s result to another as a query, using string concat function and eval stream. So far I couldn’t write a working example, but it’s an idea that I’m playing with. Sent from Mail for Windows 10 From: Joel Bernstein Sent: 23 November 2020 23:23 To: solr-user@lucene.apache.org Subject: Re: Use stream result like a query (alternative to innerJoin) H
Re: Use stream result like a query (alternative to innerJoin)
Here is the documentation for fetch: https://lucene.apache.org/solr/guide/8_4/stream-decorator-reference.html#fetch Joel Bernstein http://joelsolr.blogspot.com/ On Mon, Nov 23, 2020 at 3:22 PM Joel Bernstein wrote: > There are two streams that behave like that. > > One is the "nodes" expression, which is not going to work for this use > case because it does everything in memory. > > The second one is the "fetch" expression which behaves like a nested loop > join with some limitations. Unfortunately the main limitation is likely to > be a blocker for you which is that it doesn't support one-to-many joins yet. > > Joel Bernstein > http://joelsolr.blogspot.com/ > > > On Sun, Nov 22, 2020 at 10:37 AM ufuk yılmaz > wrote: > >> Hi all, >> >> I’m looking for a way to query two collections and find documents that >> exist in both, I know this can be done with innerJoin streaming expression >> but I want to avoid it, since one of the collection streams can possibly >> have billions of results: >> >> Let’s say two collections are: >> >> deletedItems = [{deletedItemId: 1}, {deletedItemId: 2}...] >> items = [ >> { >> id: 1, >> name: "a" >> }, >> { id: 2, >> name: "b" >> }, >> { >> id: 3, >> name: "c" >> }. >> ] >> >> “deletedItems” contain a few documents compared to “items” collection >> (1mil vs 2-3 bil). If I query them both with a typical query in our system, >> deletedItems gives a few thousand results but items give tens/hundreds of >> millions. To use innerJoin, I have to stream the whole items result to >> worker node over network. >> >> Is there a way to avoid this, something like using “deletedItems” result >> as a query to “items” stream? >> >> Thanks in advance for the help >> >> Sent from Mail for Windows 10 >> >>
Re: Use stream result like a query (alternative to innerJoin)
There are two streams that behave like that. One is the "nodes" expression, which is not going to work for this use case because it does everything in memory. The second one is the "fetch" expression which behaves like a nested loop join with some limitations. Unfortunately the main limitation is likely to be a blocker for you which is that it doesn't support one-to-many joins yet. Joel Bernstein http://joelsolr.blogspot.com/ On Sun, Nov 22, 2020 at 10:37 AM ufuk yılmaz wrote: > Hi all, > > I’m looking for a way to query two collections and find documents that > exist in both, I know this can be done with innerJoin streaming expression > but I want to avoid it, since one of the collection streams can possibly > have billions of results: > > Let’s say two collections are: > > deletedItems = [{deletedItemId: 1}, {deletedItemId: 2}...] > items = [ > { > id: 1, > name: "a" > }, > { id: 2, > name: "b" > }, > { > id: 3, > name: "c" > }. > ] > > “deletedItems” contain a few documents compared to “items” collection > (1mil vs 2-3 bil). If I query them both with a typical query in our system, > deletedItems gives a few thousand results but items give tens/hundreds of > millions. To use innerJoin, I have to stream the whole items result to > worker node over network. > > Is there a way to avoid this, something like using “deletedItems” result > as a query to “items” stream? > > Thanks in advance for the help > > Sent from Mail for Windows 10 > >