Re: Merge Two Fields in SOLR

2015-04-07 Thread Damien Dykman
Ravi, what about using field aliasing at search time? Would that do the
trick for your use case?

http://localhost:8983/solr/mycollection/select?defType=edismaxq=name:john
doef.name.qf=firstname surname

For more details:
https://cwiki.apache.org/confluence/display/solr/The+Extended+DisMax+Query+Parser

Damien

On 04/07/2015 10:21 AM, Erick Erickson wrote:
 I don't understand why copyField doesn't work. Admittedly the
 firstName and SurName would be separate tokens, but isn't that what
 you want? The fact that it's multiValued isn't really a problem,
 multiValued fields are really functionally identical to single valued
 fields if you set positionIncrementGap to... hmmm.. 1 or 0 I'm not
 quite sure which.

 Of course if your'e sorting by the field, that's a different story.

 Here's a discussion with several options, but I really wonder what
 your specific objection to copyField is, it's the simplest and on the
 surface it seems like it would work.

 http://lucene.472066.n3.nabble.com/Concat-2-fields-in-another-field-td4086786.html

 Best,
 Erick

 On Tue, Apr 7, 2015 at 10:08 AM, EXTERNAL Taminidi Ravi (ETI,
 AA-AS/PAS-PTS) external.ravi.tamin...@us.bosch.com wrote:
 Hi Group,

 I am not sure if we have any easy way to merge two  fields data in One 
 Field, the Copy field doesn’t works as it stores as Multivalued.

 Can someone suggest any workaround to achieve this Use Case?

 FirstName:ABC
 SurName:XYZ

 I need an Another Field with Name:ABCXYZ where I have to do at SOLR END.. as 
 the Source Data is read only and no control to comibine.


 Thanks

 Ravi



Retrieving list of words for highlighting

2015-03-25 Thread Damien Dykman
In Solr 5 (or 4), is there an easy way to retrieve the list of words to
highlight?

Use case: allow an external application to highlight the matching words
of a matching document, rather than using the highlighted snippets
returned by Solr.

Thanks,
Damien


Re: Solr 5.0.0 - Multiple instances sharing Solr server *read-only* dir

2015-03-10 Thread Damien Dykman
Thanks Timothy for the pointer to the Jira ticket. That's exactly it :-)

Erick, the main reason why I would run multiple instances on the same
machine is to simulate a multi node environment. But beyond that, I like
the idea of being able to clearly separate the server dir and the data
dirs. That way the server dir could be deployed by root. Yet Solr
instances could run in userland.

Damien

On 03/10/2015 09:31 AM, Timothy Potter wrote:
 I think the next step here is to ship Solr with the war already extracted
 so that Jetty doesn't need to extract it on first startup -
 https://issues.apache.org/jira/browse/SOLR-7227

 On Tue, Mar 10, 2015 at 10:15 AM, Erick Erickson erickerick...@gmail.com
 wrote:

 If I'm understanding your problem correctly, I think you want the -d
 option,
 then all the -s guys would be under that.

 Just to check, though, why are you running multiple Solrs? There are
 sometimes
 very good reasons, just checking that you're not making things more
 difficult
 than necessary

 Best,
 Erick

 On Mon, Mar 9, 2015 at 4:59 PM, Damien Dykman damien.dyk...@gmail.com
 wrote:
 Hi all,

 Quoted from

 https://cwiki.apache.org/confluence/display/solr/Solr+Start+Script+Reference
 When running multiple instances of Solr on the same host, it is more
 common to use the same server directory for each instance and use a
 unique Solr home directory using the -s option.

 Is there a way to achieve this without making *any* changes to the
 extracted content of solr-5.0.0.tgz and only use runtime parameters? I
 other words, make the extracted folder solr-5.0.0 strictly read-only?

 By default, the Solr web app is deployed under server/solr-webapp, as
 per solr-jetty-context.xml. So unless I change solr-jetty-context.xml, I
 cannot make folder sorl-5.0.0 read-only to my Solr instances.

 I've figured out how to make the log files and pid file to be located
 under the Solr data dir by doing:

 export SOLR_PID_DIR=mySolrDataDir/logs; \
 export SOLR_LOGS_DIR=mySolrDataDir/logs; \
 bin/solr start -c -z localhost:32101/solr \
  -s mySolrDataDir \
  -a -Dsolr.log=mySolrDataDir/logs \
  -p 31100 -h localhost

 But if there was a way to not have to change solr-jetty-context.xml that
 would be awesome! Thoughts?

 Thanks,
 Damien



Solr 5.0.0 - Multiple instances sharing Solr server *read-only* dir

2015-03-09 Thread Damien Dykman
Hi all,

Quoted from
https://cwiki.apache.org/confluence/display/solr/Solr+Start+Script+Reference

When running multiple instances of Solr on the same host, it is more
common to use the same server directory for each instance and use a
unique Solr home directory using the -s option.

Is there a way to achieve this without making *any* changes to the
extracted content of solr-5.0.0.tgz and only use runtime parameters? I
other words, make the extracted folder solr-5.0.0 strictly read-only?

By default, the Solr web app is deployed under server/solr-webapp, as
per solr-jetty-context.xml. So unless I change solr-jetty-context.xml, I
cannot make folder sorl-5.0.0 read-only to my Solr instances.

I've figured out how to make the log files and pid file to be located
under the Solr data dir by doing:

export SOLR_PID_DIR=mySolrDataDir/logs; \
export SOLR_LOGS_DIR=mySolrDataDir/logs; \
bin/solr start -c -z localhost:32101/solr \
 -s mySolrDataDir \
 -a -Dsolr.log=mySolrDataDir/logs \
 -p 31100 -h localhost

But if there was a way to not have to change solr-jetty-context.xml that
would be awesome! Thoughts?

Thanks,
Damien


/export - Why need sort criteria (4.10.2)?

2014-12-17 Thread Damien Dykman
The /export request handler mandates a sort order. Is there a particular
reason?

It'd be nice to have the option to tell Solr: just export in the order
you want, to limit any kind of overhead added by sorting. Or am I
missing something? If exports were distributed, I can see the need for
some kind of sort order, but they are not.

BTW, kudos for adding this feature, it rocks and seems to scale really
well :-) Though, I did see some weird behaviors (NullPointerException @
SortingResponseWriter.java:784) in some cases. I'll further investigate
and if I manage to make that issue a little more deterministic and
reproducible, I'll share my findings.

Thanks,
Damien


Duplicate unique ID in implicit collection - Illegal?

2014-12-10 Thread Damien Dykman
Hi all,

With an implicit collection, is it legal to index the same document
(same unique ID) in 2 different shards? I know, it kind of defeats the
purpose of having a unique ID...

The reason I'm doing this, is because I want to move a single document
from 1 shard to an other. During the transition period, I'd use a search
criteria to specify which shard I want to target to find that document.

At search, I do notice some weird behaviors. The facets do take into
account the duplicate nature but the number of results varies, for
instance depending on parameter row=xx. But that doesn't surprise me too
much given the non-uniqueness-of-the-unique-ID.

So my actual question is the following: if my search query guaranties
there will be no duplicate matches, is my search result going to be
consistent? That's assuming it's legal to have duplicates across
shards from an indexation point of view.
 
Thanks,
Damien


Re: Transparently rebalancing a Solr cluster without splitting or moving shards

2014-07-08 Thread Damien Dykman

Thanks for your suggestions and recommendations.

If I understand correctly, the MIGRATE command does shard splitting 
(around the range of the split.key) and merging behind the scene. 
Though, it's a bit difficult to properly monitor the actual migration, 
set the proper timeouts, know when to direct indexing and search traffic 
to the destination collection, etc.


Note sure how to MIGRATE an entire collection. By providing the full 
list of split.keys? I'd be surprised if that was doable, but I guess it 
will skip the splitting part, which makes it easier ;-) Or much tougher 
by splitting around all the ranges. More seriously, doing a MERGEINDEX 
at the core level might not be a bad alternative, providing the hash 
ranges are compatible.


Damien

On 07/07/2014 05:14 PM, Shawn Heisey wrote:

I don't think you'd want to disable mmap. It could be done, by choosing
another DirectoryFactory object. Adding memory is likely to be the only
sane way forward.

Another possibility would be to bump up the maxShardsPerNode value and
build the new collection (with the proper number of shards) only on the
new machines... Then when they are built, move them to their proper homes
and manually adjust the cluster state in zookeeper. This will still
generate a lot of I/O, but hopefully it will last for less time on the
wall clock, and it will be something you can do when load is low.

After that done and you've switched to it, you can add replicas with
either the addreplica collections api or with the core admin api. You
should be on the newest Solr version... Lots of bugs have been found and
fixed.

One thing I wonder is whether the MIGRATE api can be used on an entire
collection. It says it works by shard key, but I suspect that most users
will not be using that functionality.

Thanks,
Shawn



Transparently rebalancing a Solr cluster without splitting or moving shards

2014-07-07 Thread Damien Dykman
I have a cluster of N boxes/nodes and I'd like to add M boxes/nodes and 
rebalance data accordingly.


Lets add the following constraints:
  - 1. boxes have different characteristics (RAM, CPU, disks)
  - 2. different number of shards per box/node (lets pretend we have 
found the sweet spot for each box)
  - 3. once rebalancing is over, the layout of the cluster should be 
the same as if it had been bootstrapped from N+M boxes


Because of the above constraints, shard splitting or moving shards 
around is not an option. And too keep the discussion simple, lets ignore 
shard replicas.


So far, the best scenario I could think of is the following:
  - a. 1 collection on the N nodes using implicit routing
  - b. add shards on the M new nodes as part of that collection
  - c. reindex a portion of the data on the shards of the M new nodes, 
while restricting them from search
  - d. in 1 transaction, delete the old data and immediately issue a 
soft commit and remove search restrictions


Any better idea?

I could also use 1 collection per box and have Solr do the routing 
within each collection. I would still have to handle the routing across 
collections but collection aliases would come in handy. But overall, it 
would be similar to the above scenario. Actually in my case, it wouldn't 
work as well because I also use some kind of flag document on the M 
new nodes which I need to update atomically with the delete of the old 
stuff. And, if I'm not mistaken, I'd loose atomicity with the 
multi-collection scenario.


Thank you for your feedback,
Damien







Re: Transparently rebalancing a Solr cluster without splitting or moving shards

2014-07-07 Thread Damien Dykman
Thanks Shawn, clean way to do it, indeed. And going your route, one 
could even copy the existing shards into the new collection and then 
delete the data which is getting reindexed on the new nodes. That would 
spare reindexing everything.


But in my case, I add boxes after a noticeable performance degradation 
due to data volume increase. So the old boxes cannot afford reindexing 
data (or deleting if using the propose variation) in the new collection 
while serving searches with the old collection. Unless there is a way to 
bound aggressively the RAM consumption of new collection (disabling 
MMAP?), given that it's not being used for search during the transition? 
That said, even if that was possible, both collections would compete for 
disk IOs.


Thanks,
Damien

On 07/07/2014 12:26 PM, Shawn Heisey wrote:

On 7/7/2014 12:41 PM, Damien Dykman wrote:

I have a cluster of N boxes/nodes and I'd like to add M boxes/nodes
and rebalance data accordingly.

Lets add the following constraints:
   - 1. boxes have different characteristics (RAM, CPU, disks)
   - 2. different number of shards per box/node (lets pretend we have
found the sweet spot for each box)
   - 3. once rebalancing is over, the layout of the cluster should be
the same as if it had been bootstrapped from N+M boxes

Because of the above constraints, shard splitting or moving shards
around is not an option. And too keep the discussion simple, lets
ignore shard replicas.

So far, the best scenario I could think of is the following:
   - a. 1 collection on the N nodes using implicit routing
   - b. add shards on the M new nodes as part of that collection
   - c. reindex a portion of the data on the shards of the M new nodes,
while restricting them from search
   - d. in 1 transaction, delete the old data and immediately issue a
soft commit and remove search restrictions

You may not like this answer, but here's a fairly clean way to do this,
assuming you have enough disk space on the existing machines:

1. Add the new boxes to the cluster.
2. Create a new collection across all the boxes.
2a. If your current collection is named test then name the new one
 test0 or something else that's related, but different.
3. Index all data into the new collection.
4. As quickly as possible, do the following actions:
4a. Stop indexing.
4b. Do a synchronization pass on the new collection so it's current.
4c. Delete the original collection.
4d. Create a collection alias so that you can access the new collection
 with the original collection name.
4e. Restart indexing.


Thanks,
Shawn





Re: Adding router.field property to an existing collection.

2014-06-25 Thread Damien Dykman

-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

Hi Modassar,

I ran into the same issue (Solr 4.8.1) with an existing collection set
to implicit routing but with no router.field defined. I managed to
set the router.field by modifying /clusterstate.json and pushing it
back to Zookeeper. For instance, I use field shard_name for routing.
Now, in my /clusterstate.json, I have:

router:{
  name:implicit,
  field:shard_name
}

Warning: you'll probably need to reload your collection (see Collection
API) for the change to be taken into account. Or a more brutal way,
restart your Solr nodes. Then you should see the update in
http://localhost:8983/solr/admin/collections?action=clusterstatus.

I'd be curious to know if there's a cleaner method though, rather than
modifying /clusterstate.json.

Otherwise, if you want to create a collection from scratch with implict
routing and a router.field (see Collection API), use:

http://localhost:8983/solr/admin/collections?action=CREATEname=my_collectionrouter.name=implicitrouter.field=shard_name

Good luck,
Damien

On 05/06/2014 05:59 AM, Modassar Ather wrote:
 Hi,

 I have a setup of two shard with embedded zookeeper and one collection on
 two tomcat instances. I cannot use uniqueKey i.e the compositeId routing
 for document routing as per my understanding it will change the uniqueKey.
 There is another way mentioned on Solr wiki is by using router.field. I
 could not find a way of setting it in solr.xml/other configuration file to
 get it added.

 Kindly share your suggestion on:
  How I can use router.field in an existing collection?
  Create a collection with router.field and implicit routing enabled?

 Thanks,
 Modassar


-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.12 (GNU/Linux)
Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/

iQEcBAEBAgAGBQJTqyixAAoJENfoFMxpEaCCPGgH/iAyTPeWbEtdgWdLN46kP3RT
vnSzf2qFEE4bXgdyVVuuZ/dagEPYUDxn9EhSwOrzuZmJcBNpgaTP8lZtejRo6LCO
jYItfO14uq/wEczelyvb3iEAqFYdCG1hQxpmabEi1uuLvLCgwLgbgsvZ8AR7l3ci
IGdQvMnD004VRXIAqErpv8E24ChH+qD+gC7ed4FiAhKfb6fBvNmsoIqmPSRcmeZX
zXjSZJ3K/c3P+pddKaEGr6BFccb/zIK/yJ/q/ihZIr1kyBnjEBfhhlBhgSvVXBEu
l97gvyz84WO5++TGFNbNIAj9quTu6+23Rn2ohjcMpz9TA9RtVbNImoZ5wQ0qjYY=
=F0U4
-END PGP SIGNATURE-



Atomic commit across shards?

2013-09-16 Thread Damien Dykman

Is a commit (hard or soft) atomic across shards?

In other words, can I guaranty that any given search on a multi-shard 
collection will hit the same index generation of each shard?


Thanks,
Damien