Re: Streaming Expression joins not returning all results

Ryan Cutter Mon, 16 May 2016 08:19:31 -0700

We likely have the same laptop :-)

There must be something weird with my schema or usage but even if I had 10x
the throughput I have now, throwing around that many docs for a single join
isn't conducive to desired latency, concurrent requests, network bandwidth,
etc.  I feel like I'm not using the tool properly so I'll do some more
thinking.


Really enjoying the work you're doing!

On Mon, May 16, 2016 at 8:09 AM, Joel Bernstein <joels...@gmail.com> wrote:

> So, with that setup you're getting around 150,000 docs per second
> throughput. On my laptop with a similar query I was able to stream around
> 650,000 docs per second. I have an SSD and 16 Gigs of RAM. Also I did lots
> of experimenting with different numbers of workers and tested after warming
> the partition filters. I was also able to maintain that speed exporting
> larger result sets in the 25,000,000 doc range.
>
> Based on our discussion, it's clear that there needs to be documentation
> about how to build and scale streaming architectures with Solr. I'm working
> on that now. The work progress is here:
>
> https://cwiki.apache.org/confluence/display/solr/Scaling+with+Worker+Collections
>
> As I work on the documentation I'll revalidate the performance numbers I
> was seeing when I did the performance testing several months ago.
>
> Joel Bernstein
> http://joelsolr.blogspot.com/
>
> On Mon, May 16, 2016 at 10:51 AM, Ryan Cutter <ryancut...@gmail.com>
> wrote:
>
> > Thanks for all this info, Joel.  I found if I artificially limit the
> > triples stream to 3M and use the /export handler with only 2 workers, I
> can
> > get results in @ 20 seconds and Solr doesn't tip over.  That seems to be
> > the best config for this local/single instance.
> >
> > It's also clear I'm not using streaming expressions optimally so I need
> to
> > do some more thinking!  I don't want to stream all 26M triples (much less
> > billions of docs) just for a simple join in which I expect a couple
> hundred
> > results.  I wanted to see if I could directly port a SQL join into this
> > framework using normalized Solr docs and single streaming expression.
> I'll
> > do some more tinkering.
> >
> > Thanks again, Ryan
> >
> > On Sun, May 15, 2016 at 4:14 PM, Joel Bernstein <joels...@gmail.com>
> > wrote:
> >
> > > One other thing to keep in is how the partitioning is done when you add
> > the
> > > partitionKeys.
> > >
> > > Partitioning is done using the HashQParserPlugin, which builds a filter
> > for
> > > each worker. Under the covers this is using the normal filter query
> > > mechanism. So after the filters are built and cached they are
> effectively
> > > free from a performance standpoint. But on the first run they need to
> be
> > > built and they need to be rebuilt after each commit. These means
> several
> > > things:
> > >
> > > 1) If you have 8 workers then 8 filters need to be computed. The
> workers
> > > call down to the shards in parallel so the filters will build in
> > parallel.
> > > But this can take time and the larger the index, the more time it
> takes.
> > >
> > > 2) Like all filters, the partitioning filters can be pre-computed using
> > > warming queries. You can check the logs and look for the {!hash ...}
> > filter
> > > queries to see the syntax. But basically you would need a warming query
> > for
> > > each worker ID.
> > >
> > > 3) If you don't pre-warm the partitioning filters then there will be a
> > > performance penalty the first time they are computed. The next query
> will
> > > be much faster.
> > >
> > > 4) This is another area where having more shards helps with
> performance,
> > > because having fewer documents per shard, means faster times building
> the
> > > partition filters.
> > >
> > > In the future we'll switch to segment level partitioning filters, so
> that
> > > following each commit only the new segments need to be built. But this
> is
> > > still on the TODO list.
> > >
> > >
> > > Joel Bernstein
> > > http://joelsolr.blogspot.com/
> > >
> > > On Sun, May 15, 2016 at 5:38 PM, Joel Bernstein <joels...@gmail.com>
> > > wrote:
> > >
> > > > Ah, you also used 4 shards. That means with 8 workers there were 32
> > > > concurrent queries against the /select handler each requesting
> 100,000
> > > > rows. That's a really heavy load!
> > > >
> > > > You can still try out the approach from my last email on the 4 shards
> > > > setup, as you add workers gradually you'll gradually ramp up the
> > > > parallelism on the machine. With a single worker you'll have 4 shards
> > > > working in parallel. With 8 works you'll have 32 threads working
> > > parallel.
> > > >
> > > > Joel Bernstein
> > > > http://joelsolr.blogspot.com/
> > > >
> > > > On Sun, May 15, 2016 at 5:23 PM, Joel Bernstein <joels...@gmail.com>
> > > > wrote:
> > > >
> > > >> Hi Ryan,
> > > >>
> > > >> The rows=100000 on the /select handler is likely going to cause
> > problems
> > > >> with 8 workers. This is calling the /select handler with 8
> concurrent
> > > >> workers each retrieving 100,000 rows. The /select handler bogs down
> as
> > > the
> > > >> number of rows increases. So using the rows parameter with the
> /select
> > > >> handler is really not a strategy for limiting the size of the join.
> To
> > > >> limit the size of the join you would need to place some kind of
> filter
> > > on
> > > >> the query and still use the /export handler.
> > > >>
> > > >> The /export handler was developed to handle large exports and not
> get
> > > >> bogged down.
> > > >>
> > > >> You may want to start just getting an understanding of how much
> data a
> > > >> single node can export, and how long it takes.
> > > >>
> > > >> 1) Try running a single *:* search() using the /export handler on
> the
> > > >> triple collection. Time how long it takes. If you run into problems
> > > getting
> > > >> this to complete then attach a memory profiler. It may be that 8
> gigs
> > is
> > > >> not enough to hold the docValues in memory and process the query.
> The
> > > >> /export handler does not use more memory as the result set rises, so
> > the
> > > >> /export handler should be able process the entire query (30,000,000
> > > docs).
> > > >> But it does take a lot of memory to hold the docValues fields in
> > memory.
> > > >> This query will likely take some time to complete though as you are
> > > sorting
> > > >> and exporting 30,000,000 million docs from a single node.
> > > >>
> > > >> 2) Then try running the same *:* search() against the /export
> handler
> > in
> > > >> parallel() gradually increasing the number of workers. Time how long
> > it
> > > >> takes as you add workers and watch the load it places on the server.
> > > >> Eventually you'll max out your performance.
> > > >>
> > > >>
> > > >> Then you'll start to get an idea of how fast a single node can sort
> > and
> > > >> export data.
> > > >>
> > > >>
> > > >>
> > > >>
> > > >> Joel Bernstein
> > > >> http://joelsolr.blogspot.com/
> > > >>
> > > >> On Sat, May 14, 2016 at 4:14 PM, Ryan Cutter <ryancut...@gmail.com>
> > > >> wrote:
> > > >>
> > > >>> Hello, I'm running Solr on my laptop with -Xmx8g and gave each
> > > >>> collection 4
> > > >>> shards and 2 replicas.
> > > >>>
> > > >>> Even grabbing 100k triple documents (like the following) is taking
> 20
> > > >>> seconds to complete and prone to fall over.  I could try this in a
> > > proper
> > > >>> cluster with multiple hosts and more sharding, etc.  I just
> thought I
> > > was
> > > >>> tinkering with a small enough data set to use locally.
> > > >>>
> > > >>> parallel(
> > > >>>     triple,
> > > >>>     innerJoin(
> > > >>>       search(triple, q=*:*, fl="subject_id,type_id", sort="type_id
> > > asc",
> > > >>> partitionKeys="type_id", rows="100000"),
> > > >>>       search(triple_type, q=*:*, fl="triple_type_id",
> > > >>> sort="triple_type_id
> > > >>> asc", partitionKeys="triple_type_id", qt="/export"),
> > > >>>       on="type_id=triple_type_id"
> > > >>>     ),
> > > >>>     sort="subject_id asc",
> > > >>>     workers="8")
> > > >>>
> > > >>>
> > > >>> When Solr does crash, it's leaving messages like this.
> > > >>>
> > > >>> ERROR - 2016-05-14 20:00:53.892; [c:triple s:shard3 r:core_node2
> > > >>> x:triple_shard3_replica2] org.apache.solr.common.SolrException;
> > > >>> null:java.io.IOException: java.util.concurrent.TimeoutException:
> Idle
> > > >>> timeout expired: 50001/50000 ms
> > > >>>
> > > >>> at
> > > >>>
> > > >>>
> > >
> >
> org.eclipse.jetty.util.SharedBlockingCallback$Blocker.block(SharedBlockingCallback.java:226)
> > > >>>
> > > >>> at org.eclipse.jetty.server.HttpOutput.write(HttpOutput.java:164)
> > > >>>
> > > >>> at org.eclipse.jetty.server.HttpOutput.write(HttpOutput.java:530)
> > > >>>
> > > >>> at
> > > >>>
> > > >>>
> > >
> >
> org.apache.solr.response.QueryResponseWriterUtil$1.write(QueryResponseWriterUtil.java:54)
> > > >>>
> > > >>> at java.io.OutputStream.write(OutputStream.java:116)
> > > >>>
> > > >>> at sun.nio.cs.StreamEncoder.writeBytes(StreamEncoder.java:221)
> > > >>>
> > > >>> at sun.nio.cs.StreamEncoder.implWrite(StreamEncoder.java:282)
> > > >>>
> > > >>> at sun.nio.cs.StreamEncoder.write(StreamEncoder.java:125)
> > > >>>
> > > >>> at java.io.OutputStreamWriter.write(OutputStreamWriter.java:207)
> > > >>>
> > > >>> at org.apache.solr.util.FastWriter.flush(FastWriter.java:140)
> > > >>>
> > > >>> at org.apache.solr.util.FastWriter.write(FastWriter.java:54)
> > > >>>
> > > >>> at
> > > >>>
> > > >>>
> > >
> >
> org.apache.solr.response.JSONWriter.writeMapCloser(JSONResponseWriter.java:420)
> > > >>>
> > > >>> at
> > > >>>
> > > >>>
> > >
> >
> org.apache.solr.response.JSONWriter.writeSolrDocument(JSONResponseWriter.java:364)
> > > >>>
> > > >>> at
> > > >>>
> > > >>>
> > >
> >
> org.apache.solr.response.TextResponseWriter.writeDocuments(TextResponseWriter.java:246)
> > > >>>
> > > >>> at
> > > >>>
> > > >>>
> > >
> >
> org.apache.solr.response.TextResponseWriter.writeVal(TextResponseWriter.java:150)
> > > >>>
> > > >>> at
> > > >>>
> > > >>>
> > >
> >
> org.apache.solr.response.JSONWriter.writeNamedListAsMapWithDups(JSONResponseWriter.java:183)
> > > >>>
> > > >>> On Fri, May 13, 2016 at 5:50 PM, Joel Bernstein <
> joels...@gmail.com>
> > > >>> wrote:
> > > >>>
> > > >>> > Also the hashJoin is going to read the entire entity table into
> > > >>> memory. If
> > > >>> > that's a large index that could be using lots of memory.
> > > >>> >
> > > >>> > 25 million docs should be ok to /export from one node, as long as
> > you
> > > >>> have
> > > >>> > enough memory to load the docValues for the fields for sorting
> and
> > > >>> > exporting.
> > > >>> >
> > > >>> > Breaking down the query into it's parts will show where the issue
> > is.
> > > >>> Also
> > > >>> > adding more heap might give you enough memory.
> > > >>> >
> > > >>> > In my testing the max docs per second I've seen the /export
> handler
> > > >>> push
> > > >>> > from a single node is 650,000. In order to get 650,000 docs per
> > > second
> > > >>> on
> > > >>> > one node you have to partition the stream with workers. In my
> > testing
> > > >>> it
> > > >>> > took 8 workers hitting one node to achieve the 650,000 docs per
> > > second.
> > > >>> >
> > > >>> > But the numbers get big as the cluster grows. With 20 shards and
> 4
> > > >>> replicas
> > > >>> > and 32 workers, you could export 52,000,000 docs per-second. With
> > 40
> > > >>> > shards, 5 replicas and 40 workers you could export 130,000,000
> docs
> > > per
> > > >>> > second.
> > > >>> >
> > > >>> > So with large clusters you could do very large distributed joins
> > with
> > > >>> > sub-second performance.
> > > >>> >
> > > >>> >
> > > >>> >
> > > >>> >
> > > >>> > Joel Bernstein
> > > >>> > http://joelsolr.blogspot.com/
> > > >>> >
> > > >>> > On Fri, May 13, 2016 at 8:11 PM, Ryan Cutter <
> ryancut...@gmail.com
> > >
> > > >>> wrote:
> > > >>> >
> > > >>> > > Thanks very much for the advice.  Yes, I'm running in a very
> > basic
> > > >>> single
> > > >>> > > shard environment.  I thought that 25M docs was small enough to
> > not
> > > >>> > require
> > > >>> > > anything special but I will try scaling like you suggest and
> let
> > > you
> > > >>> know
> > > >>> > > what happens.
> > > >>> > >
> > > >>> > > Cheers, Ryan
> > > >>> > >
> > > >>> > > On Fri, May 13, 2016 at 4:53 PM, Joel Bernstein <
> > > joels...@gmail.com>
> > > >>> > > wrote:
> > > >>> > >
> > > >>> > > > I would try breaking down the second query to see when the
> > > problems
> > > >>> > > occur.
> > > >>> > > >
> > > >>> > > > 1) Start with just a single *:* search from one of the
> > > collections.
> > > >>> > > > 2) Then test the innerJoin. The innerJoin won't take much
> > memory
> > > as
> > > >>> > it's
> > > >>> > > a
> > > >>> > > > streaming merge join.
> > > >>> > > > 3) Then try the full thing.
> > > >>> > > >
> > > >>> > > > If you're running a large join like this all on one host then
> > you
> > > >>> might
> > > >>> > > not
> > > >>> > > > have enough memory for the docValues and the two joins. In
> > > general
> > > >>> > > > streaming is designed to scale by adding servers. It scales 3
> > > ways:
> > > >>> > > >
> > > >>> > > > 1) Adding shards, splits up the index for more pushing power.
> > > >>> > > > 2) Adding workers, partitions the streams and splits up the
> > join
> > > /
> > > >>> > merge
> > > >>> > > > work.
> > > >>> > > > 3) Adding replicas, when you have workers you will add
> pushing
> > > >>> power by
> > > >>> > > > adding replicas. This is because workers will fetch
> partitions
> > of
> > > >>> the
> > > >>> > > > streams from across the entire cluster. So ALL replicas will
> be
> > > >>> pushing
> > > >>> > > at
> > > >>> > > > once.
> > > >>> > > >
> > > >>> > > > So, imagine a setup with 20 shards, 4 replicas, and 20
> workers.
> > > >>> You can
> > > >>> > > > perform massive joins quickly.
> > > >>> > > >
> > > >>> > > > But for you're scenario and available hardware you can
> > experiment
> > > >>> with
> > > >>> > > > different cluster sizes.
> > > >>> > > >
> > > >>> > > >
> > > >>> > > >
> > > >>> > > > Joel Bernstein
> > > >>> > > > http://joelsolr.blogspot.com/
> > > >>> > > >
> > > >>> > > > On Fri, May 13, 2016 at 7:27 PM, Ryan Cutter <
> > > ryancut...@gmail.com
> > > >>> >
> > > >>> > > wrote:
> > > >>> > > >
> > > >>> > > > > qt="/export" immediately fixed the query in Question #1.
> > Sorry
> > > >>> for
> > > >>> > > > missing
> > > >>> > > > > that in the docs!
> > > >>> > > > >
> > > >>> > > > > The second query (with /export) crashes the server so I was
> > > >>> going to
> > > >>> > > look
> > > >>> > > > > at parallelization if you think that's a good idea.  It
> also
> > > >>> seems
> > > >>> > > unwise
> > > >>> > > > > to joining into 26M docs so maybe I can reconfigure the
> query
> > > to
> > > >>> run
> > > >>> > > > along
> > > >>> > > > > a more happy path :-)  The schema is very RDBMS-centric so
> > > maybe
> > > >>> that
> > > >>> > > > just
> > > >>> > > > > won't ever work in this framework.
> > > >>> > > > >
> > > >>> > > > > Here's the log but it's not very helpful.
> > > >>> > > > >
> > > >>> > > > >
> > > >>> > > > > INFO  - 2016-05-13 23:18:13.214; [c:triple s:shard1
> > > r:core_node1
> > > >>> > > > > x:triple_shard1_replica1] org.apache.solr.core.SolrCore;
> > > >>> > > > > [triple_shard1_replica1]  webapp=/solr path=/export
> > > >>> > > > >
> > > >>> > > > >
> > > >>> > > >
> > > >>> > >
> > > >>> >
> > > >>>
> > >
> >
> params={q=*:*&distrib=false&fl=triple_id,subject_id,type_id&sort=type_id+asc&wt=json&version=2.2}
> > > >>> > > > > hits=26305619 status=0 QTime=61
> > > >>> > > > >
> > > >>> > > > > INFO  - 2016-05-13 23:18:13.747; [c:triple_type s:shard1
> > > >>> r:core_node1
> > > >>> > > > > x:triple_type_shard1_replica1]
> org.apache.solr.core.SolrCore;
> > > >>> > > > > [triple_type_shard1_replica1]  webapp=/solr path=/export
> > > >>> > > > >
> > > >>> > > > >
> > > >>> > > >
> > > >>> > >
> > > >>> >
> > > >>>
> > >
> >
> params={q=*:*&distrib=false&fl=triple_type_id,triple_type_label&sort=triple_type_id+asc&wt=json&version=2.2}
> > > >>> > > > > hits=702 status=0 QTime=2
> > > >>> > > > >
> > > >>> > > > > INFO  - 2016-05-13 23:18:48.504; [   ]
> > > >>> > > > > org.apache.solr.common.cloud.ConnectionManager; Watcher
> > > >>> > > > > org.apache.solr.common.cloud.ConnectionManager@6ad0f304
> > > >>> > > > > name:ZooKeeperConnection Watcher:localhost:9983 got event
> > > >>> > WatchedEvent
> > > >>> > > > > state:Disconnected type:None path:null path:null type:None
> > > >>> > > > >
> > > >>> > > > > INFO  - 2016-05-13 23:18:48.504; [   ]
> > > >>> > > > > org.apache.solr.common.cloud.ConnectionManager; zkClient
> has
> > > >>> > > disconnected
> > > >>> > > > >
> > > >>> > > > > ERROR - 2016-05-13 23:18:51.316; [c:triple s:shard1
> > > r:core_node1
> > > >>> > > > > x:triple_shard1_replica1]
> > org.apache.solr.common.SolrException;
> > > >>> > > > null:Early
> > > >>> > > > > Client Disconnect
> > > >>> > > > >
> > > >>> > > > > WARN  - 2016-05-13 23:18:51.431; [   ]
> > > >>> > > > > org.apache.zookeeper.ClientCnxn$SendThread; Session
> > > >>> 0x154ac66c81e0002
> > > >>> > > for
> > > >>> > > > > server localhost/0:0:0:0:0:0:0:1:9983, unexpected error,
> > > closing
> > > >>> > socket
> > > >>> > > > > connection and attempting reconnect
> > > >>> > > > >
> > > >>> > > > > java.io.IOException: Connection reset by peer
> > > >>> > > > >
> > > >>> > > > >         at sun.nio.ch.FileDispatcherImpl.read0(Native
> Method)
> > > >>> > > > >
> > > >>> > > > >         at
> > > >>> sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39)
> > > >>> > > > >
> > > >>> > > > >         at
> > > >>> sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:223)
> > > >>> > > > >
> > > >>> > > > >         at sun.nio.ch.IOUtil.read(IOUtil.java:192)
> > > >>> > > > >
> > > >>> > > > >         at
> > > >>> > > sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:380)
> > > >>> > > > >
> > > >>> > > > >         at
> > > >>> > > > >
> > > >>> > > >
> > > >>> > >
> > > >>> >
> > > >>>
> > >
> >
> org.apache.zookeeper.ClientCnxnSocketNIO.doIO(ClientCnxnSocketNIO.java:68)
> > > >>> > > > >
> > > >>> > > > >         at
> > > >>> > > > >
> > > >>> > > > >
> > > >>> > > >
> > > >>> > >
> > > >>> >
> > > >>>
> > >
> >
> org.apache.zookeeper.ClientCnxnSocketNIO.doTransport(ClientCnxnSocketNIO.java:366)
> > > >>> > > > >
> > > >>> > > > >         at
> > > >>> > > > >
> > > >>>
> org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1081)
> > > >>> > > > >
> > > >>> > > > > On Fri, May 13, 2016 at 3:09 PM, Joel Bernstein <
> > > >>> joels...@gmail.com>
> > > >>> > > > > wrote:
> > > >>> > > > >
> > > >>> > > > > > A couple of other things:
> > > >>> > > > > >
> > > >>> > > > > > 1) Your innerJoin can parallelized across workers to
> > improve
> > > >>> > > > performance.
> > > >>> > > > > > Take a look at the docs on the parallel function for the
> > > >>> details.
> > > >>> > > > > >
> > > >>> > > > > > 2) It looks like you might be doing graph operations with
> > > >>> joins.
> > > >>> > You
> > > >>> > > > > might
> > > >>> > > > > > to take a look at the gatherNodes function coming in 6.1:
> > > >>> > > > > >
> > > >>> > > > > >
> > > >>> > > > >
> > > >>> > > >
> > > >>> > >
> > > >>> >
> > > >>>
> > >
> >
> https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=62693238
> > > >>> > > > > >
> > > >>> > > > > > Joel Bernstein
> > > >>> > > > > > http://joelsolr.blogspot.com/
> > > >>> > > > > >
> > > >>> > > > > > On Fri, May 13, 2016 at 5:57 PM, Joel Bernstein <
> > > >>> > joels...@gmail.com>
> > > >>> > > > > > wrote:
> > > >>> > > > > >
> > > >>> > > > > > > When doing things that require all the results (like
> > joins)
> > > >>> you
> > > >>> > > need
> > > >>> > > > to
> > > >>> > > > > > > specify the /export handler in the search function.
> > > >>> > > > > > >
> > > >>> > > > > > > qt="/export"
> > > >>> > > > > > >
> > > >>> > > > > > > The search function defaults to the /select handler
> which
> > > is
> > > >>> > > designed
> > > >>> > > > > to
> > > >>> > > > > > > return the top N results. The /export handler always
> > > returns
> > > >>> all
> > > >>> > > > > results
> > > >>> > > > > > > that match the query. Also keep in mind that the
> /export
> > > >>> handler
> > > >>> > > > > requires
> > > >>> > > > > > > that sort fields and fl fields have docValues set.
> > > >>> > > > > > >
> > > >>> > > > > > > Joel Bernstein
> > > >>> > > > > > > http://joelsolr.blogspot.com/
> > > >>> > > > > > >
> > > >>> > > > > > > On Fri, May 13, 2016 at 5:36 PM, Ryan Cutter <
> > > >>> > ryancut...@gmail.com
> > > >>> > > >
> > > >>> > > > > > wrote:
> > > >>> > > > > > >
> > > >>> > > > > > >> Question #1:
> > > >>> > > > > > >>
> > > >>> > > > > > >> triple_type collection has a few hundred docs and
> triple
> > > >>> has 25M
> > > >>> > > > docs.
> > > >>> > > > > > >>
> > > >>> > > > > > >> When I search for a particular subject_id in triple
> > which
> > > I
> > > >>> know
> > > >>> > > has
> > > >>> > > > > 14
> > > >>> > > > > > >> results and do not pass in 'rows' params, it returns 0
> > > >>> results:
> > > >>> > > > > > >>
> > > >>> > > > > > >> innerJoin(
> > > >>> > > > > > >>     search(triple, q=subject_id:1656521,
> > > >>> > > > > > >> fl="triple_id,subject_id,type_id",
> > > >>> > > > > > >> sort="type_id asc"),
> > > >>> > > > > > >>     search(triple_type, q=*:*,
> > > >>> > > > fl="triple_type_id,triple_type_label",
> > > >>> > > > > > >> sort="triple_type_id asc"),
> > > >>> > > > > > >>     on="type_id=triple_type_id"
> > > >>> > > > > > >> )
> > > >>> > > > > > >>
> > > >>> > > > > > >> When I do the same search with rows=10000, it returns
> 14
> > > >>> > results:
> > > >>> > > > > > >>
> > > >>> > > > > > >> innerJoin(
> > > >>> > > > > > >>     search(triple, q=subject_id:1656521,
> > > >>> > > > > > >> fl="triple_id,subject_id,type_id",
> > > >>> > > > > > >> sort="type_id asc", rows=10000),
> > > >>> > > > > > >>     search(triple_type, q=*:*,
> > > >>> > > > fl="triple_type_id,triple_type_label",
> > > >>> > > > > > >> sort="triple_type_id asc", rows=10000),
> > > >>> > > > > > >>     on="type_id=triple_type_id"
> > > >>> > > > > > >> )
> > > >>> > > > > > >>
> > > >>> > > > > > >> Am I doing this right?  Is there a magic number to
> pass
> > > into
> > > >>> > rows
> > > >>> > > > > which
> > > >>> > > > > > >> says "give me all the results which match this query"?
> > > >>> > > > > > >>
> > > >>> > > > > > >>
> > > >>> > > > > > >> Question #2:
> > > >>> > > > > > >>
> > > >>> > > > > > >> Perhaps related to the first question but I want to
> run
> > > the
> > > >>> > > > > innerJoin()
> > > >>> > > > > > >> without the subject_id - rather have it use the
> results
> > of
> > > >>> > another
> > > >>> > > > > > query.
> > > >>> > > > > > >> But this does not return any results.  I'm saying
> > "search
> > > >>> for
> > > >>> > this
> > > >>> > > > > > entity
> > > >>> > > > > > >> based on id then use that result's entity_id as the
> > > >>> subject_id
> > > >>> > to
> > > >>> > > > look
> > > >>> > > > > > >> through the triple/triple_type collections:
> > > >>> > > > > > >>
> > > >>> > > > > > >> hashJoin(
> > > >>> > > > > > >>     innerJoin(
> > > >>> > > > > > >>         search(triple, q=*:*,
> > > >>> fl="triple_id,subject_id,type_id",
> > > >>> > > > > > >> sort="type_id asc"),
> > > >>> > > > > > >>         search(triple_type, q=*:*,
> > > >>> > > > > > fl="triple_type_id,triple_type_label",
> > > >>> > > > > > >> sort="triple_type_id asc"),
> > > >>> > > > > > >>         on="type_id=triple_type_id"
> > > >>> > > > > > >>     ),
> > > >>> > > > > > >>     hashed=search(entity,
> > > >>> > > > > > >>
> > > >>> q=id:"urn:sid:entity:455dfa1aa27eedad21ac2115797c1580bb3b3b4e",
> > > >>> > > > > > >> fl="entity_id,entity_label", sort="entity_id asc"),
> > > >>> > > > > > >>     on="subject_id=entity_id"
> > > >>> > > > > > >> )
> > > >>> > > > > > >>
> > > >>> > > > > > >> Am I using doing this hashJoin right?
> > > >>> > > > > > >>
> > > >>> > > > > > >> Thanks very much, Ryan
> > > >>> > > > > > >>
> > > >>> > > > > > >
> > > >>> > > > > > >
> > > >>> > > > > >
> > > >>> > > > >
> > > >>> > > >
> > > >>> > >
> > > >>> >
> > > >>>
> > > >>
> > > >>
> > > >
> > >
> >
>

Re: Streaming Expression joins not returning all results

Reply via email to