Re: Streaming Expression joins not returning all results

2016-05-16 Thread Ryan Cutter
We likely have the same laptop :-) There must be something weird with my schema or usage but even if I had 10x the throughput I have now, throwing around that many docs for a single join isn't conducive to desired latency, concurrent requests, network bandwidth, etc. I feel like I'm not using

Re: Streaming Expression joins not returning all results

2016-05-16 Thread Joel Bernstein
So, with that setup you're getting around 150,000 docs per second throughput. On my laptop with a similar query I was able to stream around 650,000 docs per second. I have an SSD and 16 Gigs of RAM. Also I did lots of experimenting with different numbers of workers and tested after warming the

Re: Streaming Expression joins not returning all results

2016-05-16 Thread Ryan Cutter
Thanks for all this info, Joel. I found if I artificially limit the triples stream to 3M and use the /export handler with only 2 workers, I can get results in @ 20 seconds and Solr doesn't tip over. That seems to be the best config for this local/single instance. It's also clear I'm not using

Re: Streaming Expression joins not returning all results

2016-05-15 Thread Joel Bernstein
One other thing to keep in is how the partitioning is done when you add the partitionKeys. Partitioning is done using the HashQParserPlugin, which builds a filter for each worker. Under the covers this is using the normal filter query mechanism. So after the filters are built and cached they are

Re: Streaming Expression joins not returning all results

2016-05-15 Thread Joel Bernstein
Ah, you also used 4 shards. That means with 8 workers there were 32 concurrent queries against the /select handler each requesting 100,000 rows. That's a really heavy load! You can still try out the approach from my last email on the 4 shards setup, as you add workers gradually you'll gradually

Re: Streaming Expression joins not returning all results

2016-05-15 Thread Joel Bernstein
Hi Ryan, The rows=10 on the /select handler is likely going to cause problems with 8 workers. This is calling the /select handler with 8 concurrent workers each retrieving 100,000 rows. The /select handler bogs down as the number of rows increases. So using the rows parameter with the /select

Re: Streaming Expression joins not returning all results

2016-05-14 Thread Ryan Cutter
Hello, I'm running Solr on my laptop with -Xmx8g and gave each collection 4 shards and 2 replicas. Even grabbing 100k triple documents (like the following) is taking 20 seconds to complete and prone to fall over. I could try this in a proper cluster with multiple hosts and more sharding, etc. I

Re: Streaming Expression joins not returning all results

2016-05-13 Thread Joel Bernstein
Also the hashJoin is going to read the entire entity table into memory. If that's a large index that could be using lots of memory. 25 million docs should be ok to /export from one node, as long as you have enough memory to load the docValues for the fields for sorting and exporting. Breaking

Re: Streaming Expression joins not returning all results

2016-05-13 Thread Ryan Cutter
Thanks very much for the advice. Yes, I'm running in a very basic single shard environment. I thought that 25M docs was small enough to not require anything special but I will try scaling like you suggest and let you know what happens. Cheers, Ryan On Fri, May 13, 2016 at 4:53 PM, Joel

Re: Streaming Expression joins not returning all results

2016-05-13 Thread Joel Bernstein
I would try breaking down the second query to see when the problems occur. 1) Start with just a single *:* search from one of the collections. 2) Then test the innerJoin. The innerJoin won't take much memory as it's a streaming merge join. 3) Then try the full thing. If you're running a large

Re: Streaming Expression joins not returning all results

2016-05-13 Thread Ryan Cutter
qt="/export" immediately fixed the query in Question #1. Sorry for missing that in the docs! The second query (with /export) crashes the server so I was going to look at parallelization if you think that's a good idea. It also seems unwise to joining into 26M docs so maybe I can reconfigure the

Re: Streaming Expression joins not returning all results

2016-05-13 Thread Joel Bernstein
A couple of other things: 1) Your innerJoin can parallelized across workers to improve performance. Take a look at the docs on the parallel function for the details. 2) It looks like you might be doing graph operations with joins. You might to take a look at the gatherNodes function coming in

Re: Streaming Expression joins not returning all results

2016-05-13 Thread Joel Bernstein
When doing things that require all the results (like joins) you need to specify the /export handler in the search function. qt="/export" The search function defaults to the /select handler which is designed to return the top N results. The /export handler always returns all results that match

Streaming Expression joins not returning all results

2016-05-13 Thread Ryan Cutter
Question #1: triple_type collection has a few hundred docs and triple has 25M docs. When I search for a particular subject_id in triple which I know has 14 results and do not pass in 'rows' params, it returns 0 results: innerJoin( search(triple, q=subject_id:1656521,