RE: Use stream result like a query (alternative to innerJoin)

2020-11-24 Thread ufuk yılmaz
Fetch would work for my specific case (since I’m working with id’s there’s no 
one to many), if I was able to restrict fetch’s target domain with a query. I 
would first get all possible deleted ids, then use fetch to the items 
collection. But then the current fetch implementation would find all deleted 
items, not something like “deleted items with these names” or “deleted items 
between this time” etc.

I came upon your video while researching this stuff: 
https://www.youtube.com/watch?v=kTNe3TaqFvo

I’m trying to use the “let” expression to feed one stream’s result to another 
as a query, using string concat function and eval stream. So far I couldn’t 
write a working example, but it’s an idea that I’m playing with.


Sent from Mail for Windows 10

From: Joel Bernstein
Sent: 23 November 2020 23:23
To: solr-user@lucene.apache.org
Subject: Re: Use stream result like a query (alternative to innerJoin)

H



Re: Use stream result like a query (alternative to innerJoin)

2020-11-23 Thread Joel Bernstein
Here is the documentation for fetch:

https://lucene.apache.org/solr/guide/8_4/stream-decorator-reference.html#fetch


Joel Bernstein
http://joelsolr.blogspot.com/


On Mon, Nov 23, 2020 at 3:22 PM Joel Bernstein  wrote:

> There are two streams that behave like that.
>
> One is the "nodes" expression, which is not going to work for this use
> case because it does everything in memory.
>
> The second one is the "fetch" expression which behaves like a nested loop
> join with some limitations. Unfortunately the main limitation is likely to
> be a blocker for you which is that it doesn't support one-to-many joins yet.
>
> Joel Bernstein
> http://joelsolr.blogspot.com/
>
>
> On Sun, Nov 22, 2020 at 10:37 AM ufuk yılmaz 
> wrote:
>
>> Hi all,
>>
>> I’m looking for a way to query two collections and find documents that
>> exist in both, I know this can be done with innerJoin streaming expression
>> but I want to avoid it, since one of the collection streams can possibly
>> have billions of results:
>>
>> Let’s say two collections are:
>>
>> deletedItems = [{deletedItemId: 1}, {deletedItemId: 2}...]
>> items = [
>> {
>> id: 1,
>> name: "a"
>> },
>> {   id: 2,
>> name: "b"
>> },
>> {
>> id: 3,
>> name: "c"
>> }.
>> ]
>>
>> “deletedItems” contain a few documents compared to “items” collection
>> (1mil vs 2-3 bil). If I query them both with a typical query in our system,
>> deletedItems gives a few thousand results but items give tens/hundreds of
>> millions. To use innerJoin, I have to stream the whole items result to
>> worker node over network.
>>
>> Is there a way to avoid this, something like using “deletedItems” result
>> as a query to “items” stream?
>>
>> Thanks in advance for the help
>>
>> Sent from Mail for Windows 10
>>
>>


Re: Use stream result like a query (alternative to innerJoin)

2020-11-23 Thread Joel Bernstein
There are two streams that behave like that.

One is the "nodes" expression, which is not going to work for this use case
because it does everything in memory.

The second one is the "fetch" expression which behaves like a nested loop
join with some limitations. Unfortunately the main limitation is likely to
be a blocker for you which is that it doesn't support one-to-many joins yet.

Joel Bernstein
http://joelsolr.blogspot.com/


On Sun, Nov 22, 2020 at 10:37 AM ufuk yılmaz 
wrote:

> Hi all,
>
> I’m looking for a way to query two collections and find documents that
> exist in both, I know this can be done with innerJoin streaming expression
> but I want to avoid it, since one of the collection streams can possibly
> have billions of results:
>
> Let’s say two collections are:
>
> deletedItems = [{deletedItemId: 1}, {deletedItemId: 2}...]
> items = [
> {
> id: 1,
> name: "a"
> },
> {   id: 2,
> name: "b"
> },
> {
> id: 3,
> name: "c"
> }.
> ]
>
> “deletedItems” contain a few documents compared to “items” collection
> (1mil vs 2-3 bil). If I query them both with a typical query in our system,
> deletedItems gives a few thousand results but items give tens/hundreds of
> millions. To use innerJoin, I have to stream the whole items result to
> worker node over network.
>
> Is there a way to avoid this, something like using “deletedItems” result
> as a query to “items” stream?
>
> Thanks in advance for the help
>
> Sent from Mail for Windows 10
>
>