Re: querying using filter query and lots of possible values
Hi, thanks for this hint. Will check this out. Sounds promising. Daniel On Sat, Jul 28, 2012 at 3:18 AM, Chris Hostetter hossman_luc...@fucit.orgwrote: : the list of IDs is constant for a longer time. I will take a look at : these join thematic. : Maybe another solution would be to really create a whole new : collection or set of documents containing the aggregated documents (from the : ids) from scratch and to execute queries on this collection. Then this : would take : some time, but maybe it's worth it because the querying will thank you. Another avenue to consider... http://lucene.apache.org/solr/api-4_0_0-ALPHA/org/apache/solr/schema/ExternalFileField.html ...would allow you to map values in your source_id to some numeric values (many to many) and these numeric values would then be accessible in functions -- so you could use something like fq={!frange ...} to select all docs with value 67 where your extenral file field says that value 67 is mapped ot the following thousand source_id values. the external field fields can then be modified at any time just by doing a commit on your index. -Hoss
Re: querying using filter query and lots of possible values
: the list of IDs is constant for a longer time. I will take a look at : these join thematic. : Maybe another solution would be to really create a whole new : collection or set of documents containing the aggregated documents (from the : ids) from scratch and to execute queries on this collection. Then this : would take : some time, but maybe it's worth it because the querying will thank you. Another avenue to consider... http://lucene.apache.org/solr/api-4_0_0-ALPHA/org/apache/solr/schema/ExternalFileField.html ...would allow you to map values in your source_id to some numeric values (many to many) and these numeric values would then be accessible in functions -- so you could use something like fq={!frange ...} to select all docs with value 67 where your extenral file field says that value 67 is mapped ot the following thousand source_id values. the external field fields can then be modified at any time just by doing a commit on your index. -Hoss
querying using filter query and lots of possible values
Hi, i am facing the following issue: I have couple of million documents, which have a field called source_id. My problem is, that I want to retrieve all the documents which have a source_id in a specific range of values. This range can be pretty big, so for example a list of 200 to 2000 source ids. I was thinking that a filter query can be used like fq=source_id:(1 2 3 4 5 6 .) but this reminds me of SQLs WHERE IN (...) which was always bit slow for a huge number of values. Another solution that came into my mind was to assigned all the documents I want to retrieve a new kind of filter id. So all the documents which i want to analyse get a new id. But i need to update all the millions of documents for this and assign them a new id. This could take some time. Do you can think of a nicer way to solve this issue? Regards greetings Daniel
Re: querying using filter query and lots of possible values
Hi Daniel, index the id into a field of type tint or tlong and use a range query (http://wiki.apache.org/solr/SolrQuerySyntax?highlight=%28rangequery%29): fq=id:[200 TO 2000] If you want to exclude certain ids it might be wiser to simply add an exclusion query in addition to the range query instead of listing all the single values. You will run into problems with too long request urls. If you cannot avoid long urls you might want to increase maxBooleanClauses (see http://wiki.apache.org/solr/SolrConfigXml/#The_Query_Section). Cheers, Chantal Am 26.07.2012 um 18:01 schrieb Daniel Brügge: Hi, i am facing the following issue: I have couple of million documents, which have a field called source_id. My problem is, that I want to retrieve all the documents which have a source_id in a specific range of values. This range can be pretty big, so for example a list of 200 to 2000 source ids. I was thinking that a filter query can be used like fq=source_id:(1 2 3 4 5 6 .) but this reminds me of SQLs WHERE IN (...) which was always bit slow for a huge number of values. Another solution that came into my mind was to assigned all the documents I want to retrieve a new kind of filter id. So all the documents which i want to analyse get a new id. But i need to update all the millions of documents for this and assign them a new id. This could take some time. Do you can think of a nicer way to solve this issue? Regards greetings Daniel
Re: querying using filter query and lots of possible values
Hey Chantal, thanks for your answer. The range queries would not work, because they are not values in a row. They can be randomly ordered with gaps. Above was just an example. Excluding is also not a solution, because the list of excluded id would be even longer. To specify it even more. The IDs are not even integers, but UUIDs. And they are tens of thousands. And the document pool contains hundreds of million documents. Thanks. Daniel On Thu, Jul 26, 2012 at 6:22 PM, Chantal Ackermann c.ackerm...@it-agenten.com wrote: Hi Daniel, index the id into a field of type tint or tlong and use a range query ( http://wiki.apache.org/solr/SolrQuerySyntax?highlight=%28rangequery%29): fq=id:[200 TO 2000] If you want to exclude certain ids it might be wiser to simply add an exclusion query in addition to the range query instead of listing all the single values. You will run into problems with too long request urls. If you cannot avoid long urls you might want to increase maxBooleanClauses (see http://wiki.apache.org/solr/SolrConfigXml/#The_Query_Section). Cheers, Chantal Am 26.07.2012 um 18:01 schrieb Daniel Brügge: Hi, i am facing the following issue: I have couple of million documents, which have a field called source_id. My problem is, that I want to retrieve all the documents which have a source_id in a specific range of values. This range can be pretty big, so for example a list of 200 to 2000 source ids. I was thinking that a filter query can be used like fq=source_id:(1 2 3 4 5 6 .) but this reminds me of SQLs WHERE IN (...) which was always bit slow for a huge number of values. Another solution that came into my mind was to assigned all the documents I want to retrieve a new kind of filter id. So all the documents which i want to analyse get a new id. But i need to update all the millions of documents for this and assign them a new id. This could take some time. Do you can think of a nicer way to solve this issue? Regards greetings Daniel
Re: querying using filter query and lots of possible values
You can't update the original documents except by reindexing them, so no easy group assigment option. If you create this 'collection' once but query it multiple times, you may be able to use SOLR4 join with IDs being stored separately and joined on. Still not great because the performance is an issue when mapping on IDs: http://www.lucidimagination.com/blog/2012/06/20/solr-and-joins/ . If the list is some sort of combination of smaller lists - you could probably precompute (at index time) those fragments and do compound query over them. But if you have to query every time and the list is different every time, that could be complicated. Regards, Alex. Personal blog: http://blog.outerthoughts.com/ LinkedIn: http://www.linkedin.com/in/alexandrerafalovitch - Time is the quality of nature that keeps events from happening all at once. Lately, it doesn't seem to be working. (Anonymous - via GTD book) On Thu, Jul 26, 2012 at 12:01 PM, Daniel Brügge daniel.brue...@googlemail.com wrote: Hi, i am facing the following issue: I have couple of million documents, which have a field called source_id. My problem is, that I want to retrieve all the documents which have a source_id in a specific range of values. This range can be pretty big, so for example a list of 200 to 2000 source ids. I was thinking that a filter query can be used like fq=source_id:(1 2 3 4 5 6 .) but this reminds me of SQLs WHERE IN (...) which was always bit slow for a huge number of values. Another solution that came into my mind was to assigned all the documents I want to retrieve a new kind of filter id. So all the documents which i want to analyse get a new id. But i need to update all the millions of documents for this and assign them a new id. This could take some time. Do you can think of a nicer way to solve this issue? Regards greetings Daniel
Re: querying using filter query and lots of possible values
Thanks Alexandre, the list of IDs is constant for a longer time. I will take a look at these join thematic. Maybe another solution would be to really create a whole new collection or set of documents containing the aggregated documents (from the ids) from scratch and to execute queries on this collection. Then this would take some time, but maybe it's worth it because the querying will thank you. Daniel On Thu, Jul 26, 2012 at 7:43 PM, Alexandre Rafalovitch arafa...@gmail.comwrote: You can't update the original documents except by reindexing them, so no easy group assigment option. If you create this 'collection' once but query it multiple times, you may be able to use SOLR4 join with IDs being stored separately and joined on. Still not great because the performance is an issue when mapping on IDs: http://www.lucidimagination.com/blog/2012/06/20/solr-and-joins/ . If the list is some sort of combination of smaller lists - you could probably precompute (at index time) those fragments and do compound query over them. But if you have to query every time and the list is different every time, that could be complicated. Regards, Alex. Personal blog: http://blog.outerthoughts.com/ LinkedIn: http://www.linkedin.com/in/alexandrerafalovitch - Time is the quality of nature that keeps events from happening all at once. Lately, it doesn't seem to be working. (Anonymous - via GTD book) On Thu, Jul 26, 2012 at 12:01 PM, Daniel Brügge daniel.brue...@googlemail.com wrote: Hi, i am facing the following issue: I have couple of million documents, which have a field called source_id. My problem is, that I want to retrieve all the documents which have a source_id in a specific range of values. This range can be pretty big, so for example a list of 200 to 2000 source ids. I was thinking that a filter query can be used like fq=source_id:(1 2 3 4 5 6 .) but this reminds me of SQLs WHERE IN (...) which was always bit slow for a huge number of values. Another solution that came into my mind was to assigned all the documents I want to retrieve a new kind of filter id. So all the documents which i want to analyse get a new id. But i need to update all the millions of documents for this and assign them a new id. This could take some time. Do you can think of a nicer way to solve this issue? Regards greetings Daniel
Re: querying using filter query and lots of possible values
Hi Daniel, depending on how you decide on the list of ids, in the first place, you could also create a new index (core) and populate it with DIH which would select only documents from your main index (core) in this range of ids. When updating you could try a delta import. Of course, this is only worth the effort if that core would exist for some time - but you've written that the subset of ids is constant for a longer time. Just another idea on top ;-) Chantal
Re: querying using filter query and lots of possible values
Exactly. Creating a new index from the aggregated documents is the plan I described above. I don't really now, how long this will take for each new index. Hopefully under 1 hour or so. That would be tolerable. Thanks. Daniel On Thu, Jul 26, 2012 at 8:47 PM, Chantal Ackermann c.ackerm...@it-agenten.com wrote: Hi Daniel, depending on how you decide on the list of ids, in the first place, you could also create a new index (core) and populate it with DIH which would select only documents from your main index (core) in this range of ids. When updating you could try a delta import. Of course, this is only worth the effort if that core would exist for some time - but you've written that the subset of ids is constant for a longer time. Just another idea on top ;-) Chantal