Re: querying using filter query and lots of possible values

2012-07-28 Thread Daniel Brügge
Hi,

thanks for this hint. Will check this out. Sounds promising.

Daniel

On Sat, Jul 28, 2012 at 3:18 AM, Chris Hostetter
hossman_luc...@fucit.orgwrote:


 : the list of IDs is constant for a longer time. I will take a look at
 : these join thematic.
 : Maybe another solution would be to really create a whole new
 : collection or set of documents containing the aggregated documents (from
 the
 : ids) from scratch and to execute queries on this collection. Then this
 : would take
 : some time, but maybe it's worth it because the querying will thank you.

 Another avenue to consider...


 http://lucene.apache.org/solr/api-4_0_0-ALPHA/org/apache/solr/schema/ExternalFileField.html

 ...would allow you to map values in your source_id to some numeric
 values (many to many) and these numeric values would then be accessible in
 functions -- so you could use something like fq={!frange ...} to select
 all docs with value 67 where your extenral file field says that value 67
 is mapped ot the following thousand source_id values.

 the external field fields can then be modified at any time just by doing a
 commit on your index.



 -Hoss



Re: querying using filter query and lots of possible values

2012-07-27 Thread Chris Hostetter

: the list of IDs is constant for a longer time. I will take a look at
: these join thematic.
: Maybe another solution would be to really create a whole new
: collection or set of documents containing the aggregated documents (from the
: ids) from scratch and to execute queries on this collection. Then this
: would take
: some time, but maybe it's worth it because the querying will thank you.

Another avenue to consider...

http://lucene.apache.org/solr/api-4_0_0-ALPHA/org/apache/solr/schema/ExternalFileField.html

...would allow you to map values in your source_id to some numeric 
values (many to many) and these numeric values would then be accessible in 
functions -- so you could use something like fq={!frange ...} to select 
all docs with value 67 where your extenral file field says that value 67 
is mapped ot the following thousand source_id values.

the external field fields can then be modified at any time just by doing a 
commit on your index.



-Hoss


querying using filter query and lots of possible values

2012-07-26 Thread Daniel Brügge
Hi,

i am facing the following issue:

I have couple of million documents, which have a field called source_id.
My problem is, that I want to retrieve all the documents which have a
source_id
in a specific range of values. This range can be pretty big, so for example
a
list of 200 to 2000 source ids.

I was thinking that a filter query can be used like fq=source_id:(1 2 3 4 5
6 .)
but this reminds me of SQLs WHERE IN (...) which was always bit slow for a
huge
number of values.

Another solution that came into my mind was to assigned all the documents I
want to
retrieve a new kind of filter id. So all the documents which i want to
analyse
get a new id. But i need to update all the millions of documents for this
and assign
them a new id. This could take some time.

Do you can think of a nicer way to solve this issue?

Regards  greetings

Daniel


Re: querying using filter query and lots of possible values

2012-07-26 Thread Chantal Ackermann
Hi Daniel,

index the id into a field of type tint or tlong and use a range query 
(http://wiki.apache.org/solr/SolrQuerySyntax?highlight=%28rangequery%29):

fq=id:[200 TO 2000]

If you want to exclude certain ids it might be wiser to simply add an exclusion 
query in addition to the range query instead of listing all the single values. 
You will run into problems with too long request urls. If you cannot avoid long 
urls you might want to increase maxBooleanClauses (see 
http://wiki.apache.org/solr/SolrConfigXml/#The_Query_Section).

Cheers,
Chantal

Am 26.07.2012 um 18:01 schrieb Daniel Brügge:

 Hi,
 
 i am facing the following issue:
 
 I have couple of million documents, which have a field called source_id.
 My problem is, that I want to retrieve all the documents which have a
 source_id
 in a specific range of values. This range can be pretty big, so for example
 a
 list of 200 to 2000 source ids.
 
 I was thinking that a filter query can be used like fq=source_id:(1 2 3 4 5
 6 .)
 but this reminds me of SQLs WHERE IN (...) which was always bit slow for a
 huge
 number of values.
 
 Another solution that came into my mind was to assigned all the documents I
 want to
 retrieve a new kind of filter id. So all the documents which i want to
 analyse
 get a new id. But i need to update all the millions of documents for this
 and assign
 them a new id. This could take some time.
 
 Do you can think of a nicer way to solve this issue?
 
 Regards  greetings
 
 Daniel



Re: querying using filter query and lots of possible values

2012-07-26 Thread Daniel Brügge
Hey Chantal,

thanks for your answer.

The range queries would not work, because they are not values in a row.
They can be randomly ordered with gaps. Above was just an example.

Excluding is also not a solution, because the list of excluded id would be
even longer.

To specify it even more. The IDs are not even integers, but UUIDs. And they
are tens of thousands. And the document pool contains hundreds of million
documents.

Thanks. Daniel



On Thu, Jul 26, 2012 at 6:22 PM, Chantal Ackermann 
c.ackerm...@it-agenten.com wrote:

 Hi Daniel,

 index the id into a field of type tint or tlong and use a range query (
 http://wiki.apache.org/solr/SolrQuerySyntax?highlight=%28rangequery%29):

 fq=id:[200 TO 2000]

 If you want to exclude certain ids it might be wiser to simply add an
 exclusion query in addition to the range query instead of listing all the
 single values. You will run into problems with too long request urls. If
 you cannot avoid long urls you might want to increase maxBooleanClauses
 (see http://wiki.apache.org/solr/SolrConfigXml/#The_Query_Section).

 Cheers,
 Chantal

 Am 26.07.2012 um 18:01 schrieb Daniel Brügge:

  Hi,
 
  i am facing the following issue:
 
  I have couple of million documents, which have a field called
 source_id.
  My problem is, that I want to retrieve all the documents which have a
  source_id
  in a specific range of values. This range can be pretty big, so for
 example
  a
  list of 200 to 2000 source ids.
 
  I was thinking that a filter query can be used like fq=source_id:(1 2 3
 4 5
  6 .)
  but this reminds me of SQLs WHERE IN (...) which was always bit slow for
 a
  huge
  number of values.
 
  Another solution that came into my mind was to assigned all the
 documents I
  want to
  retrieve a new kind of filter id. So all the documents which i want to
  analyse
  get a new id. But i need to update all the millions of documents for this
  and assign
  them a new id. This could take some time.
 
  Do you can think of a nicer way to solve this issue?
 
  Regards  greetings
 
  Daniel




Re: querying using filter query and lots of possible values

2012-07-26 Thread Alexandre Rafalovitch
You can't update the original documents except by reindexing them, so
no easy group assigment option.

If you create this 'collection' once but query it multiple times, you
may be able to use SOLR4 join with IDs being stored separately and
joined on. Still not great because the performance is an issue when
mapping on IDs:
http://www.lucidimagination.com/blog/2012/06/20/solr-and-joins/ .

If the list is some sort of combination of smaller lists - you could
probably precompute (at index time) those fragments and do compound
query over them.

But if you have to query every time and the list is different every
time, that could be complicated.

Regards,
   Alex.

Personal blog: http://blog.outerthoughts.com/
LinkedIn: http://www.linkedin.com/in/alexandrerafalovitch
- Time is the quality of nature that keeps events from happening all
at once. Lately, it doesn't seem to be working.  (Anonymous  - via GTD
book)


On Thu, Jul 26, 2012 at 12:01 PM, Daniel Brügge
daniel.brue...@googlemail.com wrote:
 Hi,

 i am facing the following issue:

 I have couple of million documents, which have a field called source_id.
 My problem is, that I want to retrieve all the documents which have a
 source_id
 in a specific range of values. This range can be pretty big, so for example
 a
 list of 200 to 2000 source ids.

 I was thinking that a filter query can be used like fq=source_id:(1 2 3 4 5
 6 .)
 but this reminds me of SQLs WHERE IN (...) which was always bit slow for a
 huge
 number of values.

 Another solution that came into my mind was to assigned all the documents I
 want to
 retrieve a new kind of filter id. So all the documents which i want to
 analyse
 get a new id. But i need to update all the millions of documents for this
 and assign
 them a new id. This could take some time.

 Do you can think of a nicer way to solve this issue?

 Regards  greetings

 Daniel


Re: querying using filter query and lots of possible values

2012-07-26 Thread Daniel Brügge
Thanks Alexandre,

the list of IDs is constant for a longer time. I will take a look at
these join thematic.
Maybe another solution would be to really create a whole new
collection or set of documents containing the aggregated documents (from the
ids) from scratch and to execute queries on this collection. Then this
would take
some time, but maybe it's worth it because the querying will thank you.

Daniel

On Thu, Jul 26, 2012 at 7:43 PM, Alexandre Rafalovitch
arafa...@gmail.comwrote:

 You can't update the original documents except by reindexing them, so
 no easy group assigment option.

 If you create this 'collection' once but query it multiple times, you
 may be able to use SOLR4 join with IDs being stored separately and
 joined on. Still not great because the performance is an issue when
 mapping on IDs:
 http://www.lucidimagination.com/blog/2012/06/20/solr-and-joins/ .

 If the list is some sort of combination of smaller lists - you could
 probably precompute (at index time) those fragments and do compound
 query over them.

 But if you have to query every time and the list is different every
 time, that could be complicated.

 Regards,
Alex.

 Personal blog: http://blog.outerthoughts.com/
 LinkedIn: http://www.linkedin.com/in/alexandrerafalovitch
 - Time is the quality of nature that keeps events from happening all
 at once. Lately, it doesn't seem to be working.  (Anonymous  - via GTD
 book)


 On Thu, Jul 26, 2012 at 12:01 PM, Daniel Brügge
 daniel.brue...@googlemail.com wrote:
  Hi,
 
  i am facing the following issue:
 
  I have couple of million documents, which have a field called
 source_id.
  My problem is, that I want to retrieve all the documents which have a
  source_id
  in a specific range of values. This range can be pretty big, so for
 example
  a
  list of 200 to 2000 source ids.
 
  I was thinking that a filter query can be used like fq=source_id:(1 2 3
 4 5
  6 .)
  but this reminds me of SQLs WHERE IN (...) which was always bit slow for
 a
  huge
  number of values.
 
  Another solution that came into my mind was to assigned all the
 documents I
  want to
  retrieve a new kind of filter id. So all the documents which i want to
  analyse
  get a new id. But i need to update all the millions of documents for this
  and assign
  them a new id. This could take some time.
 
  Do you can think of a nicer way to solve this issue?
 
  Regards  greetings
 
  Daniel



Re: querying using filter query and lots of possible values

2012-07-26 Thread Chantal Ackermann
Hi Daniel,

depending on how you decide on the list of ids, in the first place, you could 
also create a new index (core) and populate it with DIH which would select only 
documents from your main index (core) in this range of ids. When updating you 
could try a delta import.

Of course, this is only worth the effort if that core would exist for some time 
- but you've written that the subset of ids is constant for a longer time.

Just another idea on top ;-)
Chantal

Re: querying using filter query and lots of possible values

2012-07-26 Thread Daniel Brügge
Exactly. Creating a new index from the aggregated documents is the plan
I described above. I don't really now, how long this will take for each
new index. Hopefully under 1 hour or so. That would be tolerable.

Thanks. Daniel

On Thu, Jul 26, 2012 at 8:47 PM, Chantal Ackermann 
c.ackerm...@it-agenten.com wrote:

 Hi Daniel,

 depending on how you decide on the list of ids, in the first place, you
 could also create a new index (core) and populate it with DIH which would
 select only documents from your main index (core) in this range of ids.
 When updating you could try a delta import.

 Of course, this is only worth the effort if that core would exist for some
 time - but you've written that the subset of ids is constant for a longer
 time.

 Just another idea on top ;-)
 Chantal