EC2 instance type recommended for SOLR?

2015-11-07 Thread Costi Muraru
Hi folks,

I'm trying to decide on the EC2 instance type
 to use for a Solr cluster. Some
details about the cluster:
1) The total index size is 89.9GB (somewhere around 20 mil records).
2) The number of requests that reach Solr is pretty low (thousands per
day), but they are heavy (long queries with frange and stuff like that).
3) Running Solr 4.10
4) The focus is on quick response time

What I'm thinking is that:
- The entire index should fit into memory
- Limit the number of nodes to reduce inter-node network communication in
order to have a faster response time
- Have a replication factor of at least 2

So far, I'm leaning towards using:
- 6 x c3.4xlarge (each with 16 CPU and 30GB RAM)
or
- 3 x c3.8xlarge (each with 32 CPU and 60GB RAM)

Which one do you think that it would yield better results (faster response
time)?
Feedback is gladly appreciated.

Thanks,
Costi


SOLR plugin: Retrieve all values of multivalued field

2015-05-11 Thread Costi Muraru
Hi folks,

I'm playing with a custom SOLR plugin and I'm trying to retrieve the value
for a multivalued field, using the code below.

==
schema.xml:
field name=my_field_name type=string indexed=true stored=false
multiValued=true/
==
input data:

add

doc
  field name=id83127/field
  field name=my_field_namesomevalue/field
  field name=my_field_namesome other value/field
  field name=my_field_namesome other value 3/field
  field name=my_field_namesome other value 4/field
/doc

/add
==

plugin:

SortedDocValues termsIndex = FieldCache.DEFAULT.getTermsIndex(atomicReader,
my_field_name);
...
int document = 12;
BytesRef spare = termsIndex.get(document);
String value = new String(spare.bytes, spare.offset, spare.length);

--

*This only returns the value some other value 3*. Is there any way to
obtain the other values as well (eg. somevalue, some other value)?
Any help is gladly appreciated.

Thanks,
Costi


Re: Using SolrCloud on Amazon EC2

2014-09-26 Thread Costi Muraru
Hi Timo,
Why not use Cloudera's CDH5 which comes with Solr?

Costi

On Thu, Sep 25, 2014 at 10:43 AM, Timo Schmidt timo-schm...@gmx.net wrote:

 Hi together,

 we currently plan to setup a project based on solr cloud and amazon
 webservices. Our main search application is deployed using aws opsworks
 which works out quite good.
 Since we also want to provision solr to ec2 i want to ask for experiences
 with the different deployment/provisioning tools.
 By now i see the following 3 approaches.

 1. Using ludic solr scale tk to setup and maintain the cluster
 Who is using this in production and what are your experiences?

 2. Implementing own chef cookbooks for aws opsworks to install solrcloud
 as a custom opsworks layer
 Did somebody do this allready?
 What are you experiences?

 Are there any cookbooks out, where we can contribute and reuse?

 3. Implementing own chef cookbooks for aws opsworks to install solrcloud
 as a docker container
 Any experiences with this?

 Do you see other options? Afaik elasticbeanstalk could also be an option?
 It would be very nice to get some experiences and recommendations?

 Cheers

 Timo



Re: Evaluate function only on subset of documents

2014-06-24 Thread Costi Muraru
Thanks guys for your answers.
Sorry for the query syntax errors I've added in the previous queries.

Chris, you've been really helpful. Indeed, point 3 is the one I'm trying to
solve, rather than 2.
You're saying that BooleanScorer will consult the clauses in order based
on which clause
says it can skip the most documents.
I think this might be the culprit for me.

Let's take this query sample:
XXX OR AAA AND {!frange ...}

For my use case:
AAA returns a subset of 100k documents.
frange returns 5k documents, all part of these 100k documents.

Therefore, frange skips the most documents. From what you are saying,
frange is going to be applied on all documents (since it skips the most
documents) and AAA is going to be applied on the subset. This is kind of
what I've originally noticed. My goal is to have this in reverse order,
since frange is much more expensive than AAA.
I was hoping to do so by specifying the cost, saying that Hey, frange has
cost 100 while AAA has cost 1, so run AAA first and then run frange on the
subset. However this does not seem to be taken into consideration.
Does this make sense / Am I getting something wrong? Is there something I
can do to achieve this?

Thanks,
Costi


On Tue, Jun 24, 2014 at 4:23 AM, Chris Hostetter hossman_luc...@fucit.org
wrote:

 : Now, if I want to make a query that also contains some OR, it is
 impossible
 : to do so with this approach. This is because fq with OR operator is not
 : supported (SOLR-1223). As an alternative I've tried these queries:
 :
 : county='New York' AND (location:Maylands OR location:Holliscort or
 : parking:yes) AND_val_:{!frange u=0 cost=150
 cache=false}mycustomfunction()

 1) most of the examples you've posted have syntax errors in them that are
 probably throwing a wrench into your testing.  in this example county='New
 York' is not valid syntax, presumably you want conty='New Your'

 2) based on the example you give, what you're trying to do here doesn't
 really depend on using SHOULD (ie: OR) type logic against the frange:
 the only disjunction you have is in a sub-query of a top level
 conjunction (e: all required) ... the frange itself is still mandatory.

 so you could still use it as a non-cached postfilter just like in your
 previous example:

 q=+XXX +(YYY ZZZ)fq={!frange cost=150 cache=false ...}


 3) if that query wasn't exactly what you ment, and your top level query is
 more complex, containing a mix of MUST, MUST_NOT, and SHOULD clauses, ie:

 q=+XXX YYY ZZZ -AAA +{!frange ...}

 ...then the internal behavior of BooleanQuery will automatically do what
 you want (no need for cache or cost params on the fq) to the best
 of it's ability because of how the evaluation of boolean clauses are
 re-ordered internally based on the next match.

 it's kind of complicated to explain, but the short version is:

 a) BooleanScorer will avoid asking any clause if it matches a document
 which has already been disqualified by another clause
 b) BooleanScorer will consult the clauses in order based on which clause
 says it can skip the most documents

 So you migght see your custom function evaluated for some docs that
 ultimately don't match, but if there are more rare mandatory clauses
 of your BQ that tell Lucene it can skip over a large number of docs
 then, your custom function will be skipped.

 This is how BooleanQuery has always worked, but i just committed a test to
 verify it even when wrapping a FunctionRangeQuery...

 https://svn.apache.org/r1604990


 4) the extreme of #3 is that if you need to use the {!frange} as part of
 a full disjunction, ie:

q=XXX OR YYY OR {!frange ...}

 ...then it would be impossible for Solr to only execute the expensive
 function against the subset of documents that match the query -- because
 BooleanScorer won't be able to tell which documents match the query unless
 it evaluates the function (it's a catch-22).   even if every doc does not
 match either XXX or YYY, solr has to evaluate the function against every
 doc to see if that function *makes* the document match the entire query.






 -Hoss
 http://www.lucidworks.com/



Re: Evaluate function only on subset of documents

2014-06-24 Thread Costi Muraru
Hi Chris,

Thanks for your patience, I've now got a better image on how things work.
I don't believe however that the two queries (the one with the post filter
and the one without one) are equivalent.

Suppose out of the whole document set:
XXX returns documents 1,2,3.
AAA returns documents  6,7,8.
{!frange}customfunction returns documents 7,8.

Running this query:
XXX OR AAA AND {!frange ...}
Matched documents are:
(1,2,3) OR (6,7,8) AND (7,8) = (1,2,3) OR (7,8) = 1,2,3,7,8

With the post filter:
q=XXX OR AAA  fq={!frange cost=150 cache=false ...}
Matched documents are:
(1,2,3) OR (6,7,8) = (1,2,3,6,7,8) with post filter (7,8) = (7,8)


I was hoping that the evaluation process would be short circuit.
Document set: 1,2,3,4,5,6,7,8

Document id 1:
Does it match XXX? Yes. Document matches query. Skip the second clause (AAA
AND {!frange ...}) and evaluate next doc.
Document id 2:
Does it match XXX? Yes. Document matches query. Skip second clause and
evaluate next doc.
Document id 3:
Does it match XXX? Yes. Document matches query. Skip second clause and
evaluate next doc.

Document id 4:
Does it match XXX? No.
Does it match AAA? No. Document does not match query. Skip frange and
evaluate next doc.

Document id 5:
Does it match XXX? No.
Does it match AAA? No. Document does not match query. Skip frange and
evaluate next doc.

Document id 6:
Does it match XXX? No.
Does it match AAA? Yes.
Does it match frange? No.  Document does not match query. [Only here the
custom function would be evaluated first.]

Document id 7:
Does it match XXX? No.
Does it match AAA? Yes.
Does it match frange? Yes.  Document matches query.

Document id 8:
Does it match XXX? No.
Does it match AAA? Yes.
Does it match frange? Yes.  Document matches query.

Returned documents: 1,2,3,7,8.

So with this logic the custom function would be evaluated on documents
6,7,8 rather than on the whole set to see the smallest doc index, like
you've described in your last email.

I hope I'm not rambling. :-)
Does it make sense?

Costi


On Tue, Jun 24, 2014 at 7:26 PM, Chris Hostetter hossman_luc...@fucit.org
 wrote:


 : Let's take this query sample:
 : XXX OR AAA AND {!frange ...}
 :
 : For my use case:
 : AAA returns a subset of 100k documents.
 : frange returns 5k documents, all part of these 100k documents.
 :
 : Therefore, frange skips the most documents. From what you are saying,
 : frange is going to be applied on all documents (since it skips the most
 : documents) and AAA is going to be applied on the subset. This is kind of
 : what I've originally noticed. My goal is to have this in reverse order,

 That's not exactly it ... there's no way for the query to know in advance
 how many documents it matches -- what BooleanQuery asks each clause is
 looking at the index, tell me the (internal) lucene docid of the first do
 you match.  it then looks at the lowest matching docid of each clause, and
 the Occur property of the clause (MUST, MUST_NOT, SHOULD) to be able to
 tell if/when it can say things like clause AAA is mandatory but the
 lowest id it matches is doc# 8675 -- so it doesn't mater that clause XXX's
 lowest match is doc# 10 or that clause {!frange}'s lowest matche is doc#
 100

 it can then ask XXX and {!frange} to both skip ahead, and find lowest
 docid they each match that is no less then 8675, etc...

 from the perspective of {!frange} in particular, this means that on the
 first call it will evaluate itself against docid #0, #1, #2, etc... untill
 it finds a match.  and on the secod call it will evaluate itself against
 docid #8675, 8676, etc... until it finds a match...

 : since frange is much more expensive than AAA.
 : I was hoping to do so by specifying the cost, saying that Hey, frange
 has

 There is no support for specifying cost on individual clauses instead of a
 BooleanQuery.

 But i really want to re-iterate, that even with the example you posted
 above you *still* don't need to nest your {!frange} instead of a boolean
 query -- what you have is this:

 XXX OR AAA AND {!frange ...}

 in which the {!frange ...} clause is completely mandatory -- so my
 previous point #2 still applies...

 :  2) based on the example you give, what you're trying to do here doesn't
 :  really depend on using SHOULD (ie: OR) type logic against the frange:
 :  the only disjunction you have is in a sub-query of a top level
 :  conjunction (e: all required) ... the frange itself is still mandatory.
 : 
 :  so you could still use it as a non-cached postfilter just like in your
 :  previous example:

   q=XXX OR AAA  fq={!frange cost=150 cache=false ...}


 -Hoss
 http://www.lucidworks.com/



Evaluate function only on subset of documents

2014-06-23 Thread Costi Muraru
Hi guys,

I'm running some tests and I can't see to figure this one out.
Suppose we have a real estate index, containing homes for rent and purchase.
The first kind of query I want to make is like so:
- type:purchase AND {!frange u=10}mycustomfunction()

The function is expensive and, in order to improve performance, I want it
to be applied only on the subset of documents that match type:purchase.

Using:
q=*:*fq={!cost=1}type:purchase{!frange u=0 cost=3}mycustomfunction()
The function is applied on all documents, instead of only those that match
the *purchase* type. I verified this assumption, by checking the query time
and also by debugging the custom function.
So after reading more, I found the post-filters feature that exists in
SOLR. This is activated when the cost=100 and cache=false. This lead me to
the following query:
q=*:*fq={!cost=1}type:purchase{!frange u=0 cost=150
cache=false}mycustomfunction()
This works beautifully, with the function being applied on only the subset
of documents that match the desired type.

Now, if I want to make a query that also contains some OR, it is impossible
to do so with this approach. This is because fq with OR operator is not
supported (SOLR-1223). As an alternative I've tried these queries:

county='New York' AND (location:Maylands OR location:Holliscort or
parking:yes) AND_val_:{!frange u=0 cost=150 cache=false}mycustomfunction()

county='New York' AND (location:Maylands OR location:Holliscort or
parking:yes) AND {!parent which=' {!frange u=0 cost=150
cache=false}mycustomfunction()'}

On both these queries, the function is applied on all the documents that
exist in the index, instead of at least limiting to those homes that are in
New York. I've also tried with different cost/cache params.

Is there any way I can achieve queries containing AND/OR operators and a
custom function being applied on only the subset of documents that match
the previously query parts?

Thanks,
Costi


Store Java object in field and retrieve it in custom function?

2014-06-19 Thread Costi Muraru
Hi,

I'm trying to save a Java object in a binary field and afterwards use this
value in a custom solr function.
I'm able to put and retrieve the Java object in Base64 via the UI, but I
can't seem to be able to retrieve the value in the custom function.

In the function I'm using:
termsIndex = FieldCache.DEFAULT.getTermsIndex(reader, fieldName);
termsIndex.get(doc, spare);
Log.debug(Length:  + spare.length);

The length is always 0. It works well if the field type is not binary, but
string.
Do you have any tips?

Thanks,
Costi


How to retrieve entire field value (text_general) in custom function?

2014-06-11 Thread Costi Muraru
I have a text_general field and want to use its value in a custom function.
I'm unable to do so. It seems that the tokenizer messes this up and only a
fraction of the entire value is being retrieved. See below for more details.

 doc str name=id1/str str name=field_tterm1 term2 term3/str 
long name=_version_1470628088879513600/long/doc doc str name=id
2/str str name=field_tx1 x2 x3/str long name=_version_
1470628088907825152/long/doc


public class MyFunction extends ValueSource {

@Override
public FunctionValues getValues(Map context, AtomicReaderContext
readerContext) throws IOException {
final FunctionValues values = valueSource.getValues(context,
readerContext);
return new StrDocValues(this) {

@Override
public String strVal(int doc) {
return values.strVal(doc);
}
};
}
}

Tried with SOLR 4.8.1.

Function returns:
- term3 (for first document)
- null (for the second document)

I want the function to return:
- term1 term2 term3 (for first document)
- x1 x2 x3 (for the second document)

How can I achieve this? I tried to google it but no luck. I also looked
through the SOLR code but could not find something similar.

Thanks!
Costi


Extract values from custom function for ValueSource with multiple indexable fields

2014-06-08 Thread Costi Muraru
Hi guys,

I have a custom FieldType that adds several IndexableFields for each
document.
I also have a custom function, in which I want to retrieve these indexable
fields. I can't seem to be able to do so. I have added some code snippets
below.
Any help is gladly appreciated.

Thanks,
Costi

public class MyField extends FieldType {
@Override
public final java.util.ListIndexableField createFields(SchemaField
field, Object val, float boost) {
ListIndexableField result = new ArrayListIndexableField();
result.add(new Field(field.getName(), field1, FIELD_TYPE));
result.add(new Field(field.getName(), 123, FIELD_TYPE));
result.add(new Field(field.getName(), ABC, FIELD_TYPE));
return result;
}
}


public class MyFunctionParser extends ValueSourceParser {
@Override
public ValueSource parse(FunctionQParser fqp) throws SyntaxError {
ValueSource fieldName = fqp.parseValueSource();
return new MyFunction(fieldName);
}
}

public class MyFunction extends ValueSource {
...
@Override
public FunctionValues getValues(Map context, AtomicReaderContext
readerContext) throws IOException {
final FunctionValues values = valueSource.getValues(context,
readerContext);
LOG.debug(Value is:  + values.strVal(doc); *// prints 123 - how
can I retrieve the field1 and ABC indexable fields as well?*
}
}


MergeReduceIndexerTool takes a lot of time for a limited number of documents

2014-05-26 Thread Costi Muraru
Hey guys,

I'm using the MergeReduceIndexerTool to import data into a SolrCloud
cluster made out of 3 decent machines.
Looking in the JobTracker, I can see that the mapper jobs finish quite
fast. The reduce jobs get to ~80% quite fast as well. It is here where
they get stucked for a long period of time (picture + log attached).
I'm only trying to insert ~80k documents with 10-50 different fields
each. Why is this happening? Am I not setting something correctly? Is
the fact that most of the documents have different field names, or too
many for that matter?
Any tips are gladly appreciated.

Thanks,
Costi

From the reduce logs:
60208 [main] INFO  org.apache.solr.update.UpdateHandler  - start
commit{,optimize=false,openSearcher=true,waitSearcher=false,expungeDeletes=false,softCommit=false,prepareCommit=false}
60208 [main] INFO  org.apache.solr.update.LoggingInfoStream  -
[IW][main]: commit: start
60208 [main] INFO  org.apache.solr.update.LoggingInfoStream  -
[IW][main]: commit: enter lock
60208 [main] INFO  org.apache.solr.update.LoggingInfoStream  -
[IW][main]: commit: now prepare
60208 [main] INFO  org.apache.solr.update.LoggingInfoStream  -
[IW][main]: prepareCommit: flush
60208 [main] INFO  org.apache.solr.update.LoggingInfoStream  -
[IW][main]:   index before flush
60208 [main] INFO  org.apache.solr.update.LoggingInfoStream  -
[DW][main]: main startFullFlush
60208 [main] INFO  org.apache.solr.update.LoggingInfoStream  -
[DW][main]: anyChanges? numDocsInRam=25603 deletes=true
hasTickets:false pendingChangesInFullFlush: false
60209 [main] INFO  org.apache.solr.update.LoggingInfoStream  -
[DWFC][main]: addFlushableState DocumentsWriterPerThread
[pendingDeletes=gen=0 25602 deleted terms (unique count=25602)
bytesUsed=5171604, segment=_0, aborting=false, numDocsInRAM=25603,
deleteQueue=DWDQ: [ generation: 0 ]]
61542 [main] INFO  org.apache.solr.update.LoggingInfoStream  -
[DWPT][main]: flush postings as segment _0 numDocs=25603
61664 [Thread-32] INFO  org.apache.solr.hadoop.HeartBeater  - Issuing
heart beat for 1 threads
125115 [Thread-32] INFO  org.apache.solr.hadoop.HeartBeater  - Issuing
heart beat for 1 threads
199408 [Thread-32] INFO  org.apache.solr.hadoop.HeartBeater  - Issuing
heart beat for 1 threads
271088 [Thread-32] INFO  org.apache.solr.hadoop.HeartBeater  - Issuing
heart beat for 1 threads
336754 [Thread-32] INFO  org.apache.solr.hadoop.HeartBeater  - Issuing
heart beat for 1 threads
417810 [Thread-32] INFO  org.apache.solr.hadoop.HeartBeater  - Issuing
heart beat for 1 threads
479495 [Thread-32] INFO  org.apache.solr.hadoop.HeartBeater  - Issuing
heart beat for 1 threads
552357 [Thread-32] INFO  org.apache.solr.hadoop.HeartBeater  - Issuing
heart beat for 1 threads
621450 [Thread-32] INFO  org.apache.solr.hadoop.HeartBeater  - Issuing
heart beat for 1 threads
683173 [Thread-32] INFO  org.apache.solr.hadoop.HeartBeater  - Issuing
heart beat for 1 threads

This is the run command I'm using:
hadoop jar /opt/cloudera/parcels/CDH/lib/solr/contrib/mr/search-mr-*-job.jar
org.apache.solr.hadoop.MapReduceIndexerTool \
 --log4j /home/cmuraru/solr/log4j.properties \
 --morphline-file morphline.conf \
 --output-dir hdfs://nameservice1:8020/tmp/outdir \
 --verbose --go-live --zk-host localhost:2181/solr \
 --collection collection1 \
hdfs://nameservice1:8020/tmp/indir


Re: MergeReduceIndexerTool takes a lot of time for a limited number of documents

2014-05-26 Thread Costi Muraru
Hey Erick,

The job reducers began to die with Error: Java heap space, after 1h and
22 minutes being stucked at ~80%.

I did a few more tests:

Test 1.
80,000 documents
Each document had *20* fields. The field names were* the same *for all the
documents. Values were different.
Job status: successful
Execution time: 33 seconds.

Test 2.
80,000 documents
Each document had *20* fields. The field names were *different* for all the
documents. Values were also different.
Job status: successful
Execution time: 643 seconds.

Test 3.
80,000 documents
Each document had *50* fields. The field names were *the same* for all the
documents. Values were different.
Job status: successful
Execution time: 45.96 seconds.

Test 4.
80,000 documents
Each document had *50* fields. The field names were *different* for all the
documents. Values were also different.
Job status: failed
Execution time: after 1h reducers failed.
Unfortunately, this is my use case.

My guess is that the reduce time (to perform the merges) depends if the
field names are the same across the documents. If they are different the
merge time increases very much. I don't have any knowledge behind the solr
merge operation, but is it possible that it tries to group the fields with
the same name across all the documents?
In the first case, when the field names are the same across documents, the
number of buckets is equal to the number of unique field names which is 20.
In the second case, where all the field names are different (my use case),
it creates a lot more buckets (80k documents * 50 different field names = 4
million buckets) and the process gets slowed down significantly.
Is this assumption correct / Is there any way to get around it?

Thanks again for reaching out. Hope this is more clear now.

This is how one of the 80k documents looks like (json format):
{
id : 442247098240414508034066540706561683636,
items : {
   IT49597_1180_i : 76,
   IT25363_1218_i : 4,
   IT12418_1291_i : 95,
   IT55979_1051_i : 31,
   IT9841_1224_i : 36,
   IT40463_1010_i : 87,
   IT37932_1346_i : 11,
   IT17653_1054_i : 37,
   IT59414_1025_i : 96,
   IT51080_1133_i : 5,
   IT7369_1395_i : 90,
   IT59974_1245_i : 25,
   IT25374_1345_i : 75,
   IT16825_1458_i : 28,
   IT56643_1050_i : 76,
   IT46274_1398_i : 50,
   IT47411_1275_i : 11,
   IT2791_1000_i : 97,
   IT7708_1053_i : 96,
   IT46622_1112_i : 90,
   IT47161_1382_i : 64
   }
}

Costi


On Mon, May 26, 2014 at 7:45 PM, Erick Erickson erickerick...@gmail.comwrote:

 The MapReduceIndexerTool is really intended for very large data sets,
 and by today's standards 80K doesn't qualify :).

 Basically, MRIT creates N sub-indexes, then merges them, which it
 may to in a tiered fashion. That is, it may merge gen1 to gen2, then
 merge gen2 to gen3 etc. Which is great when indexing a bazillion
 documents into 20 shards, but all that copying around may take
 more time than you really gain for 80K docs.

 Also be aware that MRIT does NOT update docs with the same ID, this
 is due to the inherent limitation of the Lucene mergeIndex process.

 How long is a long time? attachments tend to get filtered out, so if you
 want us to see the graph you might paste it somewhere and provide a link.

 Best,
 Erick

 On Mon, May 26, 2014 at 8:51 AM, Costi Muraru costimur...@gmail.com
 wrote:
  Hey guys,
 
  I'm using the MergeReduceIndexerTool to import data into a SolrCloud
  cluster made out of 3 decent machines.
  Looking in the JobTracker, I can see that the mapper jobs finish quite
  fast. The reduce jobs get to ~80% quite fast as well. It is here where
  they get stucked for a long period of time (picture + log attached).
  I'm only trying to insert ~80k documents with 10-50 different fields
  each. Why is this happening? Am I not setting something correctly? Is
  the fact that most of the documents have different field names, or too
  many for that matter?
  Any tips are gladly appreciated.
 
  Thanks,
  Costi
 
  From the reduce logs:
  60208 [main] INFO  org.apache.solr.update.UpdateHandler  - start
 
 commit{,optimize=false,openSearcher=true,waitSearcher=false,expungeDeletes=false,softCommit=false,prepareCommit=false}
  60208 [main] INFO  org.apache.solr.update.LoggingInfoStream  -
  [IW][main]: commit: start
  60208 [main] INFO  org.apache.solr.update.LoggingInfoStream  -
  [IW][main]: commit: enter lock
  60208 [main] INFO  org.apache.solr.update.LoggingInfoStream  -
  [IW][main]: commit: now prepare
  60208 [main] INFO  org.apache.solr.update.LoggingInfoStream  -
  [IW][main]: prepareCommit: flush
  60208 [main] INFO  org.apache.solr.update.LoggingInfoStream  -
  [IW][main]:   index before flush
  60208 [main] INFO  org.apache.solr.update.LoggingInfoStream  -
  [DW][main]: main startFullFlush
  60208 [main] INFO  org.apache.solr.update.LoggingInfoStream  -
  [DW][main]: anyChanges? numDocsInRam=25603 deletes=true
  hasTickets:false pendingChangesInFullFlush: false
  60209 [main] INFO  org.apache.solr.update.LoggingInfoStream  -
  [DWFC

Update existing documents using MapReduceIndexerTool?

2014-05-06 Thread Costi Muraru
Hi guys,

I've used the MapReduceIndexerTool [1] in order to import data into SOLR
and seem to stumbled upon something. I've followed the tutorial [2] and
managed to import data into a SolrCloud cluster using the map reduce job.
I ran the job a second time in order to update some of the existing
documents. The job itself was successful, but the documents maintained the
same field values as before.
In order to update some fields for the existing IDs, I've decompiled the
AVRO sample file
(examples/test-documents/sample-statuses-20120906-141433-medium.avro),
updated some of the fields with new values, while maintaining the same IDs
and packaged the AVRO back. After this I ran the MapReduceIndexerTool and,
although successful, the records were not updated.
I've tried this several times. Even with a few documents the result is the
same - the documents are not being updated with the new values. Instead,
the old field values are kept.
If I manually delete the old document from SOLR and after this I run the
job, the document is inserted with the new values.

Do you guys have any experience with this tool? Is this something by design
/ Am I missing something? Can this behavior be overwritten to force an
update? Any feedback is gladly appreciated.

Thanks,
Constantin

[1]
http://www.cloudera.com/content/cloudera-content/cloudera-docs/Search/latest/Cloudera-Search-User-Guide/csug_mapreduceindexertool.html#csug_topic_6_1

[2]
http://www.cloudera.com/content/cloudera-content/cloudera-docs/Search/latest/Cloudera-Search-User-Guide/csug_batch_index_to_solr_servers_using_golive.html


Re: Update existing documents using MapReduceIndexerTool?

2014-05-06 Thread Costi Muraru
Thanks, Wolfgang! Appreciate your support.
Is there any plan to make it possible to update/delete existing SOLR docs
using the MapReduceIndexerTool? Is such a thing even possible given the way
it works behind the curtains?

Costi



On Tue, May 6, 2014 at 3:58 PM, Wolfgang Hoschek whosc...@cloudera.comwrote:

 Yes, this is a known issue. Repeatedly running the MapReduceIndexerTool on
 the same set of input files can result in duplicate entries in the Solr
 collection. This occurs because currently the tool can only insert
 documents and cannot update or delete existing Solr documents.

 Wolfgang.

 On May 6, 2014, at 3:08 PM, Costi Muraru costimur...@gmail.com wrote:

  Hi guys,
 
  I've used the MapReduceIndexerTool [1] in order to import data into SOLR
  and seem to stumbled upon something. I've followed the tutorial [2] and
  managed to import data into a SolrCloud cluster using the map reduce job.
  I ran the job a second time in order to update some of the existing
  documents. The job itself was successful, but the documents maintained
 the
  same field values as before.
  In order to update some fields for the existing IDs, I've decompiled the
  AVRO sample file
  (examples/test-documents/sample-statuses-20120906-141433-medium.avro),
  updated some of the fields with new values, while maintaining the same
 IDs
  and packaged the AVRO back. After this I ran the MapReduceIndexerTool
 and,
  although successful, the records were not updated.
  I've tried this several times. Even with a few documents the result is
 the
  same - the documents are not being updated with the new values. Instead,
  the old field values are kept.
  If I manually delete the old document from SOLR and after this I run the
  job, the document is inserted with the new values.
 
  Do you guys have any experience with this tool? Is this something by
 design
  / Am I missing something? Can this behavior be overwritten to force an
  update? Any feedback is gladly appreciated.
 
  Thanks,
  Constantin
 
  [1]
 
 http://www.cloudera.com/content/cloudera-content/cloudera-docs/Search/latest/Cloudera-Search-User-Guide/csug_mapreduceindexertool.html#csug_topic_6_1
 
  [2]
 
 http://www.cloudera.com/content/cloudera-content/cloudera-docs/Search/latest/Cloudera-Search-User-Guide/csug_batch_index_to_solr_servers_using_golive.html




Fastest way to import big amount of documents in SolrCloud

2014-05-01 Thread Costi Muraru
Hi guys,

What would you say it's the fastest way to import data in SolrCloud?
Our use case: each day do a single import of a big number of documents.

Should we use SolrJ/DataImportHandler/other? Or perhaps is there a bulk
import feature in SOLR? I came upon this promising link:
http://wiki.apache.org/solr/UpdateCSV
Any idea on how UpdateCSV is performance-wise compared with
SolrJ/DataImportHandler?

If SolrJ, should we split the data in chunks and start multiple clients at
once? In this way we could perhaps take advantage of the multitude number
of servers in the SolrCloud configuration?

Either way, after the import is finished, should we do an optimize or a
commit or none (
http://wiki.solarium-project.org/index.php/V1:Optimize_command)?

Any tips and tricks to perform this process the right way are gladly
appreciated.

Thanks,
Costi


Re: Fastest way to import big amount of documents in SolrCloud

2014-05-01 Thread Costi Muraru
Thanks for the reply, Anshum. Please see my answers to your questions below.

* Why do you want to do a full index everyday?
Not sure I understand what you mean by full index. Every day we want to
import additional documents to the existing ones. Of course, we want to
remove older ones as well, so the total amount remains roughly the same.
* How much of data are we talking about?
The number of new documents is around 500k each day.
* What's your SolrCloud setup like?
We're currently using Solr 3.6 with 16 shards and planning to switch to
SolrCloud, hence the inquiry.
* Do you already have some benchmarks which you're not happy with?
Not yet. Planning to do some tests quite soon. I was looking for some
guidance before jumping in.

Also, it helps to set the commit intervals reasonable.
What do you mean by *reasonable*? Also, do you recommend using autoCommit?
We are currently doing an optimize after each import (in Solr 3), in order
to speed up future queries. This is proving to take very long though
(several hours). Doing a commit instead of optimize is usually bringing the
master and slave nodes down. We reverted to calling optimize on every
ingest.



On Thu, May 1, 2014 at 11:57 PM, Anshum Gupta ans...@anshumgupta.netwrote:

 Hi Costi,

 I'd recommend SolrJ, parallelize the inserts. Also, it helps to set the
 commit intervals reasonable.

 Just to get a better perspective
 * Why do you want to do a full index everyday?
 * How much of data are we talking about?
 * What's your SolrCloud setup like?
 * Do you already have some benchmarks which you're not happy with?



 On Thu, May 1, 2014 at 1:47 PM, Costi Muraru costimur...@gmail.com
 wrote:

  Hi guys,
 
  What would you say it's the fastest way to import data in SolrCloud?
  Our use case: each day do a single import of a big number of documents.
 
  Should we use SolrJ/DataImportHandler/other? Or perhaps is there a bulk
  import feature in SOLR? I came upon this promising link:
  http://wiki.apache.org/solr/UpdateCSV
  Any idea on how UpdateCSV is performance-wise compared with
  SolrJ/DataImportHandler?
 
  If SolrJ, should we split the data in chunks and start multiple clients
 at
  once? In this way we could perhaps take advantage of the multitude number
  of servers in the SolrCloud configuration?
 
  Either way, after the import is finished, should we do an optimize or a
  commit or none (
  http://wiki.solarium-project.org/index.php/V1:Optimize_command)?
 
  Any tips and tricks to perform this process the right way are gladly
  appreciated.
 
  Thanks,
  Costi
 



 --

 Anshum Gupta
 http://www.anshumgupta.net



Re: Delete fields from document using a wildcard

2014-04-29 Thread Costi Muraru
Thanks, Alex for the input.

Let me provide a better example on what I'm trying to achieve. I have
documents like this:

doc
field name=id100/field
field name=2_1600_i1/field
field name=5_1601_i5/field
field name=112_1602_i7/field
/doc

The schema looks the usual way:
dynamicField name=*_i  type=intindexed=true  stored=true/
The dynamic field pattern I'm using is this: id_day_i.

Each day I want to add new fields for the current day and remove the fields
for the oldest one.

adddoc
  field name=id100/field

  !-- add fields for current day --
  field name=251_1603_i update=set25/field

  !-- remove fields for oldest day --
  field name=2_1600_i update=set null=true1/field
/doc/add

The problem is, I don't know the exact names of the fields I want to
remove. All I know is that they end in *_1600_i.

When removing fields from a document, I want to avoid querying SOLR to see
what fields are actually present for the specific document. In this way,
hopefully I can speed up the process. Querying to see the schema.xml is not
going to help me much, since the field is defined a dynamic field *_i. This
makes me think that expanding the documents client-side is not the best way
to do it.

Regarding the second approach, to expand the documents server-side. I took
a look over the SOLR code and came upon the UpdateRequestProcessor.java class
which had this interesting javadoc:

* * This is a good place for subclassed update handlers to process the
document before it is *
* * indexed.  You may wish to add/remove fields or check if the requested
user is allowed to *
* * update the given document...*

As you can imagine, I have no expertise in SOLR code. How would you say it
would be possible to retrieve the document and its fields for the given id
and update the update/delete command to include the fields that match the
pattern I'm giving (eg. *_1600_i)?

Thanks,
Costi


On Tue, Apr 29, 2014 at 6:41 AM, Alexandre Rafalovitch
arafa...@gmail.comwrote:

 Not out of the box, as far as I know.

 Custom UpdateRequestProcessor could possibly do some sort of expansion
 of the field name by verifying the actual schema. Not sure if API
 supports that level of flexibility. Or, for latest Solr, you can
 request the list of known field names via REST and do client-side
 expansion instead.

 Regards,
Alex.
 Personal website: http://www.outerthoughts.com/
 Current project: http://www.solr-start.com/ - Accelerating your Solr
 proficiency


 On Tue, Apr 29, 2014 at 12:20 AM, Costi Muraru costimur...@gmail.com
 wrote:
  Hi guys,
 
  Would be possible, using Atomic Updates in SOLR4, to remove all fields
  matching a pattern? For instance something like:
 
  adddoc
field name=id100/field
*field name=*_name_i update=set null=true/field*
  /doc/add
 
  Or something similar to remove certain fields in all documents.
 
  Thanks,
  Costi



Re: Delete fields from document using a wildcard

2014-04-29 Thread Costi Muraru
I've opened an issue: https://issues.apache.org/jira/browse/SOLR-6034
Feedback in Jira is appreciated.


On Tue, Apr 29, 2014 at 8:34 PM, Shalin Shekhar Mangar 
shalinman...@gmail.com wrote:

 I think this is useful as well. Can you open an issue?


 On Tue, Apr 29, 2014 at 7:53 PM, Shawn Heisey s...@elyograg.org wrote:

  On 4/29/2014 5:25 AM, Costi Muraru wrote:
   The problem is, I don't know the exact names of the fields I want to
   remove. All I know is that they end in *_1600_i.
  
   When removing fields from a document, I want to avoid querying SOLR to
  see
   what fields are actually present for the specific document. In this
 way,
   hopefully I can speed up the process. Querying to see the schema.xml is
  not
   going to help me much, since the field is defined a dynamic field *_i.
  This
   makes me think that expanding the documents client-side is not the best
  way
   to do it.
 
  Unfortunately at this time, you'll have to query the document and
  go through the list of fields to determine which need to be deleted,
  then build a request that deleted them.
 
  I don't know how hard it is to accomplish this in Solr.  Getting it
  implemented might require a bunch of people standing up and saying we
  want this!
 
  Thanks,
  Shawn
 



 --
 Regards,
 Shalin Shekhar Mangar.



Delete fields from document using a wildcard

2014-04-28 Thread Costi Muraru
Hi guys,

Would be possible, using Atomic Updates in SOLR4, to remove all fields
matching a pattern? For instance something like:

adddoc
  field name=id100/field
  *field name=*_name_i update=set null=true/field*
/doc/add

Or something similar to remove certain fields in all documents.

Thanks,
Costi