date:20110803

Re: how to get row no. of current record

2011-08-03 Thread Gora Mohanty

On Wed, Aug 3, 2011 at 9:35 AM, Ranveer ranveer.s...@gmail.com wrote:
 Hi Anshum,

 Thanks for reply.

 My requirement is to get result start from current id. For this I need to
 set start rows.
 I am looking something like Jonty's post :
 http://lucene.472066.n3.nabble.com/previous-and-next-rows-of-current-record-td3187935.html
[...]

Jonathan's replies (the last two) in that thread pretty much
tell one what to do: Your web. app has to maintain track of
where you are in paging through the list of results. This should
not be overly difficult to implement.

Regards,
Gora

Re: Why Slop doens't match anything?

2011-08-03 Thread Gora Mohanty

On Wed, Aug 3, 2011 at 1:33 AM, Alexander Ramos Jardim
alexander.ramos.jar...@gmail.com wrote:
[...]
 I am not using dismax. I didn't find the solution for the problem. I just
 made a full-import and the problem ended. Still odd.
[...]

Maybe you changed the type of the field in question, or changed
positionIncrementGap for the field, in between?

Regards,
Gora

Re: Jetty error message regarding EnvEntry in WebAppContext

2011-08-03 Thread Marian Steinbach

On Tue, Aug 2, 2011 at 18:42, Jonathan Rochkind rochk...@jhu.edu wrote:

 You know that Solr distro comes with a jetty with a Solr in it, right, as an
 example application? Even if you don't want to use it for some reason, that
 would probably be the best model to look at for a working jetty with solr.

Sure, I know about the pre-configured Jetty and that one runes fine on
the command line.

 Or is the problem that you want a different version of jetty?

What I actually wanted is a robust background service with init script.

 As it happens, I just recently set up a jetty 6.1.26 for another project,
 not for solr. It was kind of a pain not being too familiar with java
 deployment or jetty.  But I did get JDNI working, by following the jetty
 instructions here: http://docs.codehaus.org/display/JETTY/JNDI  (It was a
 bit confusing to figure out what they were talking about not being familiar
 with jetty, but eventually I got it, and the instructions were correct.)

I can imagine. I'll probably try to hand that task over to someone who
does have a clue. :)

Thanks for your response!

Marian

Re: performance crossover between single index and sharding

2011-08-03 Thread Bernd Fehling



On 02.08.2011 21:00, Shawn Heisey wrote:

...
I did try some early tests with a single large index. Performance was pretty 
decent once it got warmed up, but I was worried about how it would
perform under a heavy load, and how it would cope with frequent updates. I 
never really got very far with testing those fears, because the full
rebuild time was unacceptable - at least 8 hours. The source database can keep 
up with six DIH instances reindexing at once, which completes
much quicker than a single machine grabbing the entire database. I may increase 
the number of shards after I remove virtualization, but I'll
need to fix a few limitations in my build system.
...


At first, thanks a lot to all answers and here is my setup.

I know that it is very difficult to give specific recommendations about this.
Because of changing from FAST Search to Solr I can state that Solr performs
very well, if not excellent.

To show that I compare apples and oranges here are my previous FAST Search 
setup:
- one master server (controlling, logging, search dispatcher)
- six index server (4.25 mio docs per server, 5 slices per index)
  (searching and indexing at the same time, indexing once per week during the 
weekend)
- each server has 4GB RAM, all servers are physical on seperate machines
- RAM usage controlled by the processes
- total of 25.5 mio. docs (mainly metadata) from 1500 databases worldwide
- index size is about 67GB per indexer -- about 402GB total
- about 3 qps at peek times
- with average search time of 0.05 seconds at peek times

And here is now my current Solr setup:
- one master server (indexing only)
- two slave server (search only) but only one is online, the second is fallback
- each server has 32GB RAM, all server are virtuell
  (master on a seperate physical machine, both slaves together on a physical 
machine)
- RAM usage is currently 20GB to java heap
- total of 31 mio. docs (all metadata) from 2000 databases worldwide
- index size is 156GB total
- search handler statistic report 0.6 average requests per second
- average time per request 39.5 (is that seconds?)
- building the index from scratch takes about 20 hours

The good thing is I have the ability to compare a commercial product and
enterprise system to open source.

I started with my simple Solr setup because of kiss (keep it simple and 
stupid).
Actually it is doing excellent as single index on a single virtuell server.
But the average time per request should be reduced now, thats why I started
this discussion.
While searches with smaller Solr index size (3 mio. docs) showed that it can
stand with FAST Search it now shows that its time to go with sharding.
I think we are already far behind the point of search performance crossover.

What I hope to get with sharding:
- reduce time for building the index
- reduce average time per request

What I fear with sharding:
- i currently have master/slave, do I then have e.g. 3 master and 3 slaves?
- the query changes because of sharding (is there a search distributor?)
- how to distribute the content the indexer with DIH on 3 server?
- anything else to think about while changing to sharding?

Conclusion:
- Solr can handle much more than 30 mio. docs of metadata in a single index
  if java heap size is large enough. Have an eye on Lucenes fieldCache and
  sorted fields, especially title (string) fields.
- The crossover in my case is somewhere between 3 mio. and 10 mio. docs
  per index for Solr (compared to FAST Search). FAST recommends about 3 to 6 
mio.
  docs per 4GB RAM server for their system.

Anyone able to reduce my fears about sharding?
Thanks again for all your answers.

Regards
Bernd

--
*
BASE - Bielefeld Academic Search Engine - www.base-search.net
*

Re: performance crossover between single index and sharding

2011-08-03 Thread Dmitry Kan

OK, here is a brief on our sharded setup.

We have 10 shards, 3 per high-end Amazon machine. Majority of the searches
are done on 2 shards at most, that have the latest data in their indices. We
use logical sharding, not hash based. These two lead to a situation, where
given a user query that *will for sure* hit the 2 last (or adjacent in time)
shards, other solr shards would have to search in vain. Therefore, we have
implemented the query router, which is essentially solr itself with
modifications in the QueryComponent. Before implementing the router it was
nearly impossible to run the system.

Why did we do the sharding?

Simply because we started to see a lot OOM exceptions, and various other
instability issues. Also we had to rebuild the index very often due to
changes in the preceeding pipeline. Therefore distributing over shards was
another asset for us in the sense, that reindexing could be carried out in
parallel. On top of that, which is certainly not least, our search became
faster, the slimmer we kept the shards.

We don't yet have master / slave architecture, as this is done when the user
base grows. We started with growing amounts of data, therefore came
horizontal scaling.

Regards,
Dmitry Kan
twitter.com/DmitryKan

On Wed, Aug 3, 2011 at 12:24 PM, Bernd Fehling 
bernd.fehl...@uni-bielefeld.de wrote:


 On 02.08.2011 21:00, Shawn Heisey wrote:

 ...

 I did try some early tests with a single large index. Performance was
 pretty decent once it got warmed up, but I was worried about how it would
 perform under a heavy load, and how it would cope with frequent updates. I
 never really got very far with testing those fears, because the full
 rebuild time was unacceptable - at least 8 hours. The source database can
 keep up with six DIH instances reindexing at once, which completes
 much quicker than a single machine grabbing the entire database. I may
 increase the number of shards after I remove virtualization, but I'll
 need to fix a few limitations in my build system.
 ...


 At first, thanks a lot to all answers and here is my setup.

 I know that it is very difficult to give specific recommendations about
 this.
 Because of changing from FAST Search to Solr I can state that Solr performs
 very well, if not excellent.

 To show that I compare apples and oranges here are my previous FAST Search
 setup:
 - one master server (controlling, logging, search dispatcher)
 - six index server (4.25 mio docs per server, 5 slices per index)
  (searching and indexing at the same time, indexing once per week during
 the weekend)
 - each server has 4GB RAM, all servers are physical on seperate machines
 - RAM usage controlled by the processes
 - total of 25.5 mio. docs (mainly metadata) from 1500 databases worldwide
 - index size is about 67GB per indexer -- about 402GB total
 - about 3 qps at peek times
 - with average search time of 0.05 seconds at peek times

 And here is now my current Solr setup:
 - one master server (indexing only)
 - two slave server (search only) but only one is online, the second is
 fallback
 - each server has 32GB RAM, all server are virtuell
  (master on a seperate physical machine, both slaves together on a physical
 machine)
 - RAM usage is currently 20GB to java heap
 - total of 31 mio. docs (all metadata) from 2000 databases worldwide
 - index size is 156GB total
 - search handler statistic report 0.6 average requests per second
 - average time per request 39.5 (is that seconds?)
 - building the index from scratch takes about 20 hours

 The good thing is I have the ability to compare a commercial product and
 enterprise system to open source.

 I started with my simple Solr setup because of kiss (keep it simple and
 stupid).
 Actually it is doing excellent as single index on a single virtuell server.
 But the average time per request should be reduced now, thats why I started
 this discussion.
 While searches with smaller Solr index size (3 mio. docs) showed that it
 can
 stand with FAST Search it now shows that its time to go with sharding.
 I think we are already far behind the point of search performance
 crossover.

 What I hope to get with sharding:
 - reduce time for building the index
 - reduce average time per request

 What I fear with sharding:
 - i currently have master/slave, do I then have e.g. 3 master and 3 slaves?
 - the query changes because of sharding (is there a search distributor?)
 - how to distribute the content the indexer with DIH on 3 server?
 - anything else to think about while changing to sharding?

 Conclusion:
 - Solr can handle much more than 30 mio. docs of metadata in a single index
  if java heap size is large enough. Have an eye on Lucenes fieldCache and
  sorted fields, especially title (string) fields.
 - The crossover in my case is somewhere between 3 mio. and 10 mio. docs
  per index for Solr (compared to FAST Search). FAST recommends about 3 to 6
 mio.
  docs per 4GB RAM server for their system.

 Anyone able to reduce my fears about

Dispatching a query to multiple different cores

2011-08-03 Thread Ahmed Boubaker

Hello there!

I have a multicore solr with 6 different simple cores and somewhat
different schemas and I defined another meta core which I would it to be a
dispatcher:  the requests are sent to simple cores and results are
aggregated before sending back the results to the user.

Any idea or hints how can I achieve this?
I am wondering whether writing custom SearchComponent or a custom
SearchHandler are good entry points?
Is it possible to acces other SolrCore which are in the same container as
the meta core?

Many thanks for your help.

Boubaker

Re: Why Slop doens't match anything?

2011-08-03 Thread Alexander Ramos Jardim

Hm...

No.

2011/8/3 Gora Mohanty g...@mimirtech.com

 On Wed, Aug 3, 2011 at 1:33 AM, Alexander Ramos Jardim
 alexander.ramos.jar...@gmail.com wrote:
 [...]
  I am not using dismax. I didn't find the solution for the problem. I just
  made a full-import and the problem ended. Still odd.
 [...]

 Maybe you changed the type of the field in question, or changed
 positionIncrementGap for the field, in between?

 Regards,
 Gora




-- 
Alexander Ramos Jardim

Re: Why Slop doens't match anything?

2011-08-03 Thread Ahmet Arslan

 Hm...
 
 No.

Can you paste output of debugQuery=on for two queries?

RE: Joining on multi valued fields

2011-08-03 Thread matthew . fowler

Hi Yonik

Sorry for my late reply. I have been trying to get to the bottom of this
but I'm getting inconsistent behaviour. Here's an example:

Query = pi:rcs100 -   Here going to use pid_rcs as join
value

result name=response numFound=1 start=0
 doc
  str name=pircs100/str 
  str name=ctrcs/str 
  str name=pid_rcsG1/str 
  str name=name_rcsEmerging Market Countries/str 
  str name=definition_rcsAll business events relating to companies
and other issuers of securities./str 
  /doc
  /result
  /response

Query = code:G1   -   See how many docs have G1 in their
code field. Notice that code is multi valued 

- result name=response numFound=2 start=0
- doc
  str name=ctcat/str 
  date name=maindocdate2011-04-22T05:48:57Z/date 
  str name=pinCIF3wGpXk+1029782/str 
- arr name=code
  strG1/str 
  strG7U/str 
  strGK/str 
  strME7/str 
  strME8/str 
  strMN/str 
  strMR/str 
  /arr
  /doc
- doc
  str name=ctcat/str 
  date name=maindocdate2011-04-22T05:48:57Z/date 
  str name=pinCIF7YcLP+1029782/str 
- arr name=code
  strG1/str 
  strG7U/str 
  strGK/str 
  strME7/str 
  strME8/str 
  strMN/str 
  strMR/str 
  /arr
  /doc
  /result
  /response

Now for the join: http://10.15.39.137:8983/solr/file/select?q={!join
from=pid_rcs to=code}pi:rcs100

- result name=response numFound=3 start=0
- doc
  str name=ctcat/str 
  date name=maindocdate2011-04-22T05:48:57Z/date 
  str name=pinCIF3wGpXk+1029782/str 
- arr name=code
  strG1/str 
  strG7U/str 
  strGK/str 
  strME7/str 
  strME8/str 
  strMN/str 
  strMR/str 
  /arr
  /doc
- doc
  str name=ctcat/str 
  date name=maindocdate2011-04-22T05:48:57Z/date 
  str name=pinCIF7YcLP+1029782/str 
- arr name=code
  strG1/str 
  strG7U/str 
  strGK/str 
  strME7/str 
  strME8/str 
  strMN/str 
  strMR/str 
  /arr
  /doc
- doc
  str name=ctcat/str 
  date name=maindocdate2011-04-22T05:48:58Z/date 
  str name=pinCN1763203+1029782/str 
- arr name=code
  strA2/str 
  strA5/str 
  strA9/str 
  strAN/str 
  strB125/str 
  strB126/str 
  strB130/str 
  strBL63/str 
  strG41/str 
  strGK/str 
  strMZ/str 
  /arr
  /doc
  /result
  /response

So as you can see I get back 3 results when only 2 match the criteria.
i.e. docs where G1 is present in multi valued code field. Why should
the last document be included in the result of the join?

Thank you,

Matt


-Original Message-
From: ysee...@gmail.com [mailto:ysee...@gmail.com] On Behalf Of Yonik
Seeley
Sent: 01 August 2011 18:28
To: solr-user@lucene.apache.org
Subject: Re: Joining on multi valued fields

On Mon, Aug 1, 2011 at 12:58 PM,  matthew.fow...@thomsonreuters.com
wrote:
 I have been using the JOIN patch
 https://issues.apache.org/jira/browse/SOLR-2272 with great success.

 However I have hit a case where it doesn't seem to be working. It
 doesn't seem to work when joining to a multi-valued field.

That should work (and the unit tests do test with multi-valued fields).
Can you come up with a simple example where you are not getting the
expected results?

-Yonik
http://www.lucidimagination.com

This email was sent to you by Thomson Reuters, the global news and information 
company. Any views expressed in this message are those of the individual 
sender, except where the sender specifically states them to be the views of 
Thomson Reuters.

Re: changing the root directory where solrCloud stores info inside zookeeper File system

2011-08-03 Thread Mark Miller

Thanks - let me try and do this here manually later today and get back to you.

- Mark

On Aug 2, 2011, at 7:41 AM, Yatir Ben Shlomo wrote:

 Thanks A lot mark,
 Since My SolrCloud code was old I tried downloading and building the
 newest code from here
 https://svn.apache.org/repos/asf/lucene/dev/trunk/
 I am using tomcat6
 I manually created the sc sub-directory in my zooKeeper ensemble
 file-system
 I used this connection String to my ZK ensemble
 zook1:2181/sc,zook2:2181/sc,zook3:2181/sc
 but I still get the same problem
 here is the entire catalina.out log with the exception
 
 Using CATALINA_BASE:   /opt/tomcat6
 Using CATALINA_HOME:   /opt/tomcat6
 Using CATALINA_TMPDIR: /opt/tomcat6/temp
 Using JRE_HOME:/usr/java/default/
 Using CLASSPATH:   /opt/tomcat6/bin/bootstrap.jar
 Java HotSpot(TM) 64-Bit Server VM warning: Failed to reserve shared memory
 (errno = 12).
 Aug 2, 2011 4:28:46 AM org.apache.catalina.core.AprLifecycleListener init
 INFO: The APR based Apache Tomcat Native library which allows optimal
 performance in production environments was not found on the
 java.library.path:
 /usr/java/jdk1.6.0_21/jre/lib/amd64/server:/usr/java/jdk1.6.0_21/jre/lib/a
 md64:/usr/java/jdk1.6.0_21/jre/../lib/amd64:/usr/java/packages/lib/amd64:/
 usr/lib64:/lib64:/lib:/usr/lib
 Aug 2, 2011 4:28:46 AM org.apache.coyote.http11.Http11Protocol init
 INFO: Initializing Coyote HTTP/1.1 on http-8983
 Aug 2, 2011 4:28:46 AM org.apache.coyote.http11.Http11Protocol init
 INFO: Initializing Coyote HTTP/1.1 on http-8080
 Aug 2, 2011 4:28:46 AM org.apache.catalina.startup.Catalina load
 INFO: Initialization processed in 448 ms
 Aug 2, 2011 4:28:46 AM org.apache.catalina.core.StandardService start
 INFO: Starting service Catalina
 Aug 2, 2011 4:28:46 AM org.apache.catalina.core.StandardEngine start
 INFO: Starting Servlet Engine: Apache Tomcat/6.0.29
 Aug 2, 2011 4:28:46 AM org.apache.catalina.startup.HostConfig
 deployDescriptor
 INFO: Deploying configuration descriptor solr1.xml
 Aug 2, 2011 4:28:46 AM org.apache.solr.core.SolrResourceLoader
 locateSolrHome
 INFO: Using JNDI solr.home: /home/tomcat/solrCloud1
 Aug 2, 2011 4:28:46 AM org.apache.solr.core.SolrResourceLoader init
 INFO: Solr home set to '/home/tomcat/solrCloud1/'
 Aug 2, 2011 4:28:46 AM org.apache.solr.servlet.SolrDispatchFilter init
 INFO: SolrDispatchFilter.init()
 Aug 2, 2011 4:28:46 AM org.apache.solr.core.SolrResourceLoader
 locateSolrHome
 INFO: Using JNDI solr.home: /home/tomcat/solrCloud1
 Aug 2, 2011 4:28:46 AM org.apache.solr.core.CoreContainer$Initializer
 initialize
 INFO: looking for solr.xml: /home/tomcat/solrCloud1/solr.xml
 Aug 2, 2011 4:28:46 AM org.apache.solr.core.CoreContainer init
 INFO: New CoreContainer 853527367
 Aug 2, 2011 4:28:46 AM org.apache.solr.core.SolrResourceLoader
 locateSolrHome
 INFO: Using JNDI solr.home: /home/tomcat/solrCloud1
 Aug 2, 2011 4:28:46 AM org.apache.solr.core.SolrResourceLoader init
 INFO: Solr home set to '/home/tomcat/solrCloud1/'
 Aug 2, 2011 4:28:46 AM org.apache.solr.cloud.SolrZkServerProps
 getProperties
 INFO: Reading configuration from: /home/tomcat/solrCloud1/zoo.cfg
 Aug 2, 2011 4:28:46 AM org.apache.solr.core.CoreContainer initZooKeeper
 INFO: Zookeeper client=zook1:2181/sc,zook2:2181/sc,zook3:2181/sc
 Aug 2, 2011 4:28:46 AM org.apache.zookeeper.Environment logEnv
 INFO: Client environment:zookeeper.version=3.3.1-942149, built on
 05/07/2010 17:14 GMT
 Aug 2, 2011 4:28:46 AM org.apache.zookeeper.Environment logEnv
 INFO: Client environment:host.name=ob1079.nydc1.outbrain.com
 Aug 2, 2011 4:28:46 AM org.apache.zookeeper.Environment logEnv
 INFO: Client environment:java.version=1.6.0_21
 Aug 2, 2011 4:28:46 AM org.apache.zookeeper.Environment logEnv
 INFO: Client environment:java.vendor=Sun Microsystems Inc.
 Aug 2, 2011 4:28:46 AM org.apache.zookeeper.Environment logEnv
 INFO: Client environment:java.home=/usr/java/jdk1.6.0_21/jre
 Aug 2, 2011 4:28:46 AM org.apache.zookeeper.Environment logEnv
 INFO: Client environment:java.class.path=/opt/tomcat6/bin/bootstrap.jar
 Aug 2, 2011 4:28:46 AM org.apache.zookeeper.Environment logEnv
 INFO: Client
 environment:java.library.path=/usr/java/jdk1.6.0_21/jre/lib/amd64/server:/
 usr/java/jdk1.6.0_21/jre/lib/amd64:/usr/java/jdk1.6.0_21/jre/../lib/amd64:
 /usr/java/packages/lib/amd64:/usr/lib64:/lib64:/lib:/usr/lib
 Aug 2, 2011 4:28:46 AM org.apache.zookeeper.Environment logEnv
 INFO: Client environment:java.io.tmpdir=/opt/tomcat6/temp
 Aug 2, 2011 4:28:46 AM org.apache.zookeeper.Environment logEnv
 INFO: Client environment:java.compiler=NA
 Aug 2, 2011 4:28:46 AM org.apache.zookeeper.Environment logEnv
 INFO: Client environment:os.name=Linux
 Aug 2, 2011 4:28:46 AM org.apache.zookeeper.Environment logEnv
 INFO: Client environment:os.arch=amd64
 Aug 2, 2011 4:28:46 AM org.apache.zookeeper.Environment logEnv
 INFO: Client environment:os.version=2.6.18-194.8.1.el5
 Aug 2, 2011 4:28:46 AM org.apache.zookeeper.Environment logEnv
 INFO: Client

Re: about the Solr request filter

2011-08-03 Thread Erick Erickson

Can we see the queries you're running and the data you expect back?
And an idea of the documents you're expecting to be matched, including
the field definitions from your schema.xml for the fields in question.

Are you using SolrJ? Just a URL in a browser?

How do you mean manually?

It might help to review:
http://wiki.apache.org/solr/UsingMailingLists

Best
Erick

2011/7/28 于浩 yuhao.1...@qq.com:
 Hello,Dear friends,
  I have got an problem in developing with solr.
  In My Application ,It must sends multiple query to solr server after the 
 page is loaded. Then I found a problem: some request will return statusCode:0 
 and QTime:0, The solr has accepted the request, but It does not return a 
 result document.  If I send each request  one by one manually ,It will return 
 the result. But If I send the request frequently in a very  short times, It 
 will return nothing only statusCode:0 and QTime:0.
  I think this may be a stratege for solr. but i can't find any documents or 
 discussions on the internet.
  so i want you can help me.

  --
                         Surely, 你永远是最棒的!

Re: Possible to use quotes in dismax qf?

2011-08-03 Thread Erick Erickson

Did you look at phrase fields (pf) in dismax?

Best
Erick

On Thu, Jul 28, 2011 at 11:26 AM, O. Klein kl...@octoweb.nl wrote:
 I removed the post as it might confuse people.

 But because of analysers combining 2 words in a phrase query using shingles
 and positionfilter and the usage of dismax, I need q to be the original
 query plus the original query as phrasequery. That way the combined words
 are also highlighted and do I get the results I need.

 qf is not the place to do this it seems though. Any way to do this in Solr?

 --
 View this message in context: 
 http://lucene.472066.n3.nabble.com/Re-Possible-to-use-quotes-in-dismax-qf-tp3206891p3206986.html
 Sent from the Solr - User mailing list archive at Nabble.com.

Re: dealing with so many different sorting options

2011-08-03 Thread Erick Erickson

Well, you're kind of stuck unfortunately. It's pretty much required that you'll
have to reindex when you add new fields if you want existing documents to
have that field.

I don't think there's any good way to use the DB to sort Solr results that would
be performant.

About using Solr as your data source. I'm not sure what you mean here. Solr
is many things, but it's not intended to be a data store. You can essentially
store the entire DB in the Solr index though, there's nothing wrong with that.
Admittedly, re-indexing is a pain but I suspect that as your app matures you'll
find yourself adding fields less and less often.

Sorry I can't offer better suggestions, but that's the nature of the developing
apps G..

Best
Erick

On Fri, Jul 29, 2011 at 3:36 PM, Jason Toy jason...@gmail.com wrote:
 As I'm using solr more and more, I'm finding that I need to do searches and
 then order by new criteria.  So I am constantly add new fields into solr
  and then reindexing everything.

 I want to know if adding in all this data into solr is the normal way to
 deal with sorting.  I'm finding that I have almost a whole copy of my
 database in solr.

 Should I be pulling out all the data from solr and then sort in my database?
  This solution seems like it would take too long.
 Could/Should I just move to solr as my primary store so I can query directly
 against it without having to reindex all the time?


 Right now we store about 50 million docs, but the size is growing pretty
 fast and it is a pain to reindex everything everytime I add a new column to
 sort by.

Re: Looking for a senior search engineer

2011-08-03 Thread Erick Erickson

Here's a page where you can hire guns that you might be interested in...

http://wiki.apache.org/solr/Support

Best
ERick

On Fri, Jul 29, 2011 at 8:59 PM, Michael Economy mich...@goodreads.com wrote:
 Hi,

 Sorry if this isn't the right place for this message, but it's a very
 specific role we're looking for and I'm not sure where else to find
 solr experts!


 I was wondering if anyone would be interested, or knew anyone who
 would be interested in working on goodreads.com's search:


 We're using Solr, and we'd like someone with experience doing:
 solr-replication
 faceted search
 more cool stuff

 We run ruby on rails for the website.  Potential applicants don't need
 to know ruby or rails, but they'd be expected to pick it up after
 starting.

 More info on our website:
 http://.goodreads.com/about/us



 Michael Economy
 Director Engineering, Goodreads Inc.

Re: Different Access Permissions?

2011-08-03 Thread Erick Erickson

Sure, it's possible. It's just that you have to do the work yourself G...

You could define a series of request handlers for various classes of
user and route the request to the correct handler based on that
user's attributes. You could construct the query manually based on the
user's attributes. You could create a page with drop-downs for various
fields based on the user's attributes...

But the common thread  here is that you have to do all the logic in the app,
outside Solr, that routes the query to the correct place in Solr based on
user attributes that are also outside solr.

If this is irrelevant, perhaps you could  explain the use case in a bit more
depth?

Best
Erick

On Mon, Aug 1, 2011 at 2:36 AM, deniz denizdurmu...@gmail.com wrote:
 Hi All,

 here comes the problem... Let's say that I have a document having different
 fields. Is it possible to let some users to query the documents partially?

 Like this:

 Document  has name, age, occupation, country fields.

 UserA can make search within name and country fields while UserB can make a
 search in the whole document...

 is this possible?

 -
 Zeki ama calismiyor... Calissa yapar...
 --
 View this message in context: 
 http://lucene.472066.n3.nabble.com/Different-Access-Permissions-tp3215190p3215190.html
 Sent from the Solr - User mailing list archive at Nabble.com.

Re: Joining on multi valued fields

2011-08-03 Thread Yonik Seeley

Hmmm, if these are real responses from a solr server at rest (i.e.
documents not being changed between queries) then what you show
definitely looks like a bug.
That's interesting, since TestJoin implements a random test that
should cover cases like this pretty well.

I assume you are using a version of trunk (4.0-dev) and not just the
actual attached to the JIRA issue (which IIRC had at least one bug...
SOLR-2521).
Have you tried a more recent version of trunk?

-Yonik
http://www.lucidimagination.com



On Wed, Aug 3, 2011 at 7:00 AM,  matthew.fow...@thomsonreuters.com wrote:
 Hi Yonik

 Sorry for my late reply. I have been trying to get to the bottom of this
 but I'm getting inconsistent behaviour. Here's an example:

 Query = pi:rcs100     -       Here going to use pid_rcs as join
 value

 result name=response numFound=1 start=0
  doc
  str name=pircs100/str
  str name=ctrcs/str
  str name=pid_rcsG1/str
  str name=name_rcsEmerging Market Countries/str
  str name=definition_rcsAll business events relating to companies
 and other issuers of securities./str
  /doc
  /result
  /response

 Query = code:G1       -       See how many docs have G1 in their
 code field. Notice that code is multi valued

 - result name=response numFound=2 start=0
 - doc
  str name=ctcat/str
  date name=maindocdate2011-04-22T05:48:57Z/date
  str name=pinCIF3wGpXk+1029782/str
 - arr name=code
  strG1/str
  strG7U/str
  strGK/str
  strME7/str
  strME8/str
  strMN/str
  strMR/str
  /arr
  /doc
 - doc
  str name=ctcat/str
  date name=maindocdate2011-04-22T05:48:57Z/date
  str name=pinCIF7YcLP+1029782/str
 - arr name=code
  strG1/str
  strG7U/str
  strGK/str
  strME7/str
  strME8/str
  strMN/str
  strMR/str
  /arr
  /doc
  /result
  /response

 Now for the join: http://10.15.39.137:8983/solr/file/select?q={!join
 from=pid_rcs to=code}pi:rcs100

 - result name=response numFound=3 start=0
 - doc
  str name=ctcat/str
  date name=maindocdate2011-04-22T05:48:57Z/date
  str name=pinCIF3wGpXk+1029782/str
 - arr name=code
  strG1/str
  strG7U/str
  strGK/str
  strME7/str
  strME8/str
  strMN/str
  strMR/str
  /arr
  /doc
 - doc
  str name=ctcat/str
  date name=maindocdate2011-04-22T05:48:57Z/date
  str name=pinCIF7YcLP+1029782/str
 - arr name=code
  strG1/str
  strG7U/str
  strGK/str
  strME7/str
  strME8/str
  strMN/str
  strMR/str
  /arr
  /doc
 - doc
  str name=ctcat/str
  date name=maindocdate2011-04-22T05:48:58Z/date
  str name=pinCN1763203+1029782/str
 - arr name=code
  strA2/str
  strA5/str
  strA9/str
  strAN/str
  strB125/str
  strB126/str
  strB130/str
  strBL63/str
  strG41/str
  strGK/str
  strMZ/str
  /arr
  /doc
  /result
  /response

 So as you can see I get back 3 results when only 2 match the criteria.
 i.e. docs where G1 is present in multi valued code field. Why should
 the last document be included in the result of the join?

 Thank you,

 Matt


 -Original Message-
 From: ysee...@gmail.com [mailto:ysee...@gmail.com] On Behalf Of Yonik
 Seeley
 Sent: 01 August 2011 18:28
 To: solr-user@lucene.apache.org
 Subject: Re: Joining on multi valued fields

 On Mon, Aug 1, 2011 at 12:58 PM,  matthew.fow...@thomsonreuters.com
 wrote:
 I have been using the JOIN patch
 https://issues.apache.org/jira/browse/SOLR-2272 with great success.

 However I have hit a case where it doesn't seem to be working. It
 doesn't seem to work when joining to a multi-valued field.

 That should work (and the unit tests do test with multi-valued fields).
 Can you come up with a simple example where you are not getting the
 expected results?

 -Yonik
 http://www.lucidimagination.com

 This email was sent to you by Thomson Reuters, the global news and 
 information company. Any views expressed in this message are those of the 
 individual sender, except where the sender specifically states them to be the 
 views of Thomson Reuters.

Re: Update some fields for all documents: LUCENE-1879 vs. ParallelReader .FilterIndex

2011-08-03 Thread Erick Erickson

How are these fields used? Because if they're not used for searching, you could
put them in their own core and rebuild that index at your whim, then
querying that
core when you need the relationship information.

If you have a DB backing your system, you could perhaps store the info there
and query that (but I like the second core better G)..

But if you could use a separate index just for the relationships, you wouldn't
have to deal with the slow re-indexing of all the docs...

Best
Erick

On Mon, Aug 1, 2011 at 4:12 AM,  karsten-s...@gmx.de wrote:
 Hi lucene/solr-folk,

 Issue:
 Our documents are stable except for two fields which are used for linking 
 between the docs. So we like to update this two fields in a batch once a 
 month (possible once a week).
 We can not reindex all docs once a month, because we are using XeLDA in some 
 fields for stemming (morphological analysis), and XeLDA is slow. We have 14 
 Mio docs (less than 100GByte Main-Index and 3 GByte for this two changable 
 fields).
 In the next half year we will migrating our search engine from verity K2 to 
 solr; so we could wait for solr 4.0
 (
 btw any news about
 http://lucene.472066.n3.nabble.com/Release-schedule-Lucene-4-td2256958.html
 ?
 ).

 Solution?

 Our issue is exactly the purpose of ParallelReader.
 But Solr do not support ParallelReader (for a good reason:
 http://lucene.472066.n3.nabble.com/Vertical-Partitioning-advice-td494623.html#a494624
 ).
 So I see two possible ways to solve our issue:
 1. waiting for the new Parallel incremental indexing
 (
 https://issues.apache.org/jira/browse/LUCENE-1879
 ) and hoping that solr will integrate this.
 Pro:
  - nothing to do for us except waiting.
 Contra:
  - I did not found anything of the (old) patch in current trunk.

 2. Change lucene index below/without solr in a batch:
   a) Each month generate a new index only with our two changed fields
      (e.g. with DIH)
   b) Use FilterIndex and ParallelReader to mock a correct index
   c) “Merge” this mock index to a new Index
      (via IndexWriter.addIndexes(IndexReader...) )
 Pro:
  - The patch for https://issues.apache.org/jira/browse/LUCENE-1812
   should be a good example, how to do this.
 Contra:
  - relation between DocId and document index order is not an guaranteed 
 feature of DIH, (e.g. we will have to split the main index to ensure that no 
 merge will occur in/after DIH).
  - To run this batch, solr has to be stopped and restarted.
  - Even if we know, that our two field should change only for a subset of the 
 docs, we nevertheless have to reindex this two fields for all the docs.

 Any comments, hints or tips?
 Is there a third (better) way to solve our issue?
 Is there already an working example of the 2. solution?
 Will LUCENE-1879 (Parallel incremental indexing) be part of solr 4.0?

 Best regards
  Karsten

RE: Joining on multi valued fields

2011-08-03 Thread matthew . fowler

No I haven't. I will get the latest out of the trunk and report back.

Cheers again,

Matt

-Original Message-
From: ysee...@gmail.com [mailto:ysee...@gmail.com] On Behalf Of Yonik Seeley
Sent: 03 August 2011 14:51
To: Fowler, Matthew (Markets Eikon)
Cc: solr-user@lucene.apache.org
Subject: Re: Joining on multi valued fields

Hmmm, if these are real responses from a solr server at rest (i.e.
documents not being changed between queries) then what you show
definitely looks like a bug.
That's interesting, since TestJoin implements a random test that
should cover cases like this pretty well.

I assume you are using a version of trunk (4.0-dev) and not just the
actual attached to the JIRA issue (which IIRC had at least one bug...
SOLR-2521).
Have you tried a more recent version of trunk?

-Yonik
http://www.lucidimagination.com

On Wed, Aug 3, 2011 at 7:00 AM,  matthew.fow...@thomsonreuters.com wrote:
 Hi Yonik

 Sorry for my late reply. I have been trying to get to the bottom of this
 but I'm getting inconsistent behaviour. Here's an example:

 Query = pi:rcs100     -       Here going to use pid_rcs as join
 value

 result name=response numFound=1 start=0
  doc
  str name=pircs100/str
  str name=ctrcs/str
  str name=pid_rcsG1/str
  str name=name_rcsEmerging Market Countries/str
  str name=definition_rcsAll business events relating to companies
 and other issuers of securities./str
  /doc
  /result
  /response

 Query = code:G1       -       See how many docs have G1 in their
 code field. Notice that code is multi valued

 - result name=response numFound=2 start=0
 - doc
  str name=ctcat/str
  date name=maindocdate2011-04-22T05:48:57Z/date
  str name=pinCIF3wGpXk+1029782/str
 - arr name=code
  strG1/str
  strG7U/str
  strGK/str
  strME7/str
  strME8/str
  strMN/str
  strMR/str
  /arr
  /doc
 - doc
  str name=ctcat/str
  date name=maindocdate2011-04-22T05:48:57Z/date
  str name=pinCIF7YcLP+1029782/str
 - arr name=code
  strG1/str
  strG7U/str
  strGK/str
  strME7/str
  strME8/str
  strMN/str
  strMR/str
  /arr
  /doc
  /result
  /response

 Now for the join: http://10.15.39.137:8983/solr/file/select?q={!join
 from=pid_rcs to=code}pi:rcs100

 - result name=response numFound=3 start=0
 - doc
  str name=ctcat/str
  date name=maindocdate2011-04-22T05:48:57Z/date
  str name=pinCIF3wGpXk+1029782/str
 - arr name=code
  strG1/str
  strG7U/str
  strGK/str
  strME7/str
  strME8/str
  strMN/str
  strMR/str
  /arr
  /doc
 - doc
  str name=ctcat/str
  date name=maindocdate2011-04-22T05:48:57Z/date
  str name=pinCIF7YcLP+1029782/str
 - arr name=code
  strG1/str
  strG7U/str
  strGK/str
  strME7/str
  strME8/str
  strMN/str
  strMR/str
  /arr
  /doc
 - doc
  str name=ctcat/str
  date name=maindocdate2011-04-22T05:48:58Z/date
  str name=pinCN1763203+1029782/str
 - arr name=code
  strA2/str
  strA5/str
  strA9/str
  strAN/str
  strB125/str
  strB126/str
  strB130/str
  strBL63/str
  strG41/str
  strGK/str
  strMZ/str
  /arr
  /doc
  /result
  /response

 So as you can see I get back 3 results when only 2 match the criteria.
 i.e. docs where G1 is present in multi valued code field. Why should
 the last document be included in the result of the join?

 Thank you,

 Matt

 -Original Message-
 From: ysee...@gmail.com [mailto:ysee...@gmail.com] On Behalf Of Yonik
 Seeley
 Sent: 01 August 2011 18:28
 To: solr-user@lucene.apache.org
 Subject: Re: Joining on multi valued fields

 On Mon, Aug 1, 2011 at 12:58 PM,  matthew.fow...@thomsonreuters.com
 wrote:
 I have been using the JOIN patch
 https://issues.apache.org/jira/browse/SOLR-2272 with great success.

 However I have hit a case where it doesn't seem to be working. It
 doesn't seem to work when joining to a multi-valued field.

 That should work (and the unit tests do test with multi-valued fields).
 Can you come up with a simple example where you are not getting the
 expected results?

 -Yonik
 http://www.lucidimagination.com

 This email was sent to you by Thomson Reuters, the global news and 
 information company. Any views expressed in this message are those of the 
 individual sender, except where the sender specifically states them to be the 
 views of Thomson Reuters.

This email was sent to you by Thomson Reuters, the global news and information 
company. Any views expressed in this message are those of the individual 
sender, except where the sender specifically states them to be the views of 
Thomson Reuters.

Re: performance crossover between single index and sharding

2011-08-03 Thread Shawn Heisey


Replies inline.

On 8/3/2011 2:24 AM, Bernd Fehling wrote:
To show that I compare apples and oranges here are my previous FAST 
Search setup:

- one master server (controlling, logging, search dispatcher)
- six index server (4.25 mio docs per server, 5 slices per index)
  (searching and indexing at the same time, indexing once per week 
during the weekend)

- each server has 4GB RAM, all servers are physical on seperate machines
- RAM usage controlled by the processes
- total of 25.5 mio. docs (mainly metadata) from 1500 databases worldwide
- index size is about 67GB per indexer -- about 402GB total
- about 3 qps at peek times
- with average search time of 0.05 seconds at peek times


An average query time of 50 milliseconds isn't too bad.  If the number 
from your Solr setup below (39.5) is the QTime, then Solr thinks it is 
performing better, but Solr's QTime does not include absolutely 
everything that hs to happen.  Do you by chance have 95th and 99th 
percentile query times for either system?



And here is now my current Solr setup:
- one master server (indexing only)
- two slave server (search only) but only one is online, the second is 
fallback

- each server has 32GB RAM, all server are virtuell
  (master on a seperate physical machine, both slaves together on a 
physical machine)

- RAM usage is currently 20GB to java heap
- total of 31 mio. docs (all metadata) from 2000 databases worldwide
- index size is 156GB total
- search handler statistic report 0.6 average requests per second
- average time per request 39.5 (is that seconds?)
- building the index from scratch takes about 20 hours


I can't tell whether you mean that each physical host has 32GB or each 
VM has 32GB.  You want to be sure that you are not oversubscribing your 
memory.  If you can get more memory in your machines, you really 
should.  Do you know whether that 0.6 seconds is most of the delay that 
a user sees when making a search request, or are there other things 
going on that contribute more delay?  In our webapp, the Solr request 
time is usually small compared with everything else the server and the 
user's browser are doing to render the results page.  As much as I hate 
being the tall pole in the tent, I look forward to the day when the 
developers can change that balance.



The good thing is I have the ability to compare a commercial product and
enterprise system to open source.

I started with my simple Solr setup because of kiss (keep it simple 
and stupid).
Actually it is doing excellent as single index on a single virtuell 
server.
But the average time per request should be reduced now, thats why I 
started

this discussion.
While searches with smaller Solr index size (3 mio. docs) showed that 
it can

stand with FAST Search it now shows that its time to go with sharding.
I think we are already far behind the point of search performance 
crossover.


What I hope to get with sharding:
- reduce time for building the index
- reduce average time per request


You will probably achieve both of these things by sharding, especially 
if you have a lot of CPU cores available.  Like mine, your query volume 
is very low, so the CPU cores are better utilized distributing the search.



What I fear with sharding:
- i currently have master/slave, do I then have e.g. 3 master and 3 
slaves?

- the query changes because of sharding (is there a search distributor?)
- how to distribute the content the indexer with DIH on 3 server?
- anything else to think about while changing to sharding?


I think sharding is probably a good idea for you, as long as you don't 
lose redundancy.  You can duplicate the FAST concept of a master server, 
in a Solr core with no index.  The solrconfig.xml for the core needs to 
include the shards parameter.  That core combined with those shards will 
make up one complete index chain, and you need to have at least two 
complete chains, running on separate physical hardware.  A load balancer 
will be critical.  I use two small VMs on separate hosts with heartbeat 
and haproxy for mine.


Thanks,
Shawn

Strategies for sorting by array, when you can't sort by array?

2011-08-03 Thread Olson, Ron

Hi all-

Well, this is a problem. I have a list of names as a multi-valued field and I 
am searching on this field and need to return the results sorted. I know from 
searching and reading the documentation (and getting the error) that sorting on 
a multi-valued field isn't possible. Okay, so, what I haven't found is any real 
good solution/workaround to the problem. I was wondering what strategies others 
have done to overcome this particular situation; collapsing the individual 
names into a single field with copyField doesn't work because the name searched 
may not be the first name in the field.

Thanks for any hints/tips/tricks.

Ron

DISCLAIMER: This electronic message, including any attachments, files or 
documents, is intended only for the addressee and may contain CONFIDENTIAL, 
PROPRIETARY or LEGALLY PRIVILEGED information.  If you are not the intended 
recipient, you are hereby notified that any use, disclosure, copying or 
distribution of this message or any of the information included in or with it 
is  unauthorized and strictly prohibited.  If you have received this message in 
error, please notify the sender immediately by reply e-mail and permanently 
delete and destroy this message and its attachments, along with any copies 
thereof. This message does not create any contractual obligation on behalf of 
the sender or Law Bulletin Publishing Company.
Thank you.

Re: Strategies for sorting by array, when you can't sort by array?

2011-08-03 Thread Mike Sokolov

Although you weren't very clear about it, it sounds as if you want the 
results to be sorted by a name that actually matched the query?  In 
general that is not going to be easy, since it is not something that can 
be computed in advance and thus indexed.



-Mike

On 08/03/2011 10:39 AM, Olson, Ron wrote:

Hi all-

Well, this is a problem. I have a list of names as a multi-valued field and I 
am searching on this field and need to return the results sorted. I know from 
searching and reading the documentation (and getting the error) that sorting on 
a multi-valued field isn't possible. Okay, so, what I haven't found is any real 
good solution/workaround to the problem. I was wondering what strategies others 
have done to overcome this particular situation; collapsing the individual 
names into a single field with copyField doesn't work because the name searched 
may not be the first name in the field.

Thanks for any hints/tips/tricks.

Ron

DISCLAIMER: This electronic message, including any attachments, files or 
documents, is intended only for the addressee and may contain CONFIDENTIAL, 
PROPRIETARY or LEGALLY PRIVILEGED information.  If you are not the intended 
recipient, you are hereby notified that any use, disclosure, copying or 
distribution of this message or any of the information included in or with it 
is  unauthorized and strictly prohibited.  If you have received this message in 
error, please notify the sender immediately by reply e-mail and permanently 
delete and destroy this message and its attachments, along with any copies 
thereof. This message does not create any contractual obligation on behalf of 
the sender or Law Bulletin Publishing Company.
Thank you.

RE: Strategies for sorting by array, when you can't sort by array?

2011-08-03 Thread Olson, Ron

Right, the search term is the sort field. I can manually sort an individual 
page, but when the user clicks on the next page, the sort is reset, visually.

-Original Message-
From: Mike Sokolov [mailto:soko...@ifactory.com]
Sent: Wednesday, August 03, 2011 9:52 AM
To: solr-user@lucene.apache.org
Cc: Olson, Ron
Subject: Re: Strategies for sorting by array, when you can't sort by array?

Although you weren't very clear about it, it sounds as if you want the
results to be sorted by a name that actually matched the query?  In
general that is not going to be easy, since it is not something that can
be computed in advance and thus indexed.

-Mike

On 08/03/2011 10:39 AM, Olson, Ron wrote:
 Hi all-

 Well, this is a problem. I have a list of names as a multi-valued field and I 
 am searching on this field and need to return the results sorted. I know from 
 searching and reading the documentation (and getting the error) that sorting 
 on a multi-valued field isn't possible. Okay, so, what I haven't found is any 
 real good solution/workaround to the problem. I was wondering what strategies 
 others have done to overcome this particular situation; collapsing the 
 individual names into a single field with copyField doesn't work because the 
 name searched may not be the first name in the field.

 Thanks for any hints/tips/tricks.

 Ron

 DISCLAIMER: This electronic message, including any attachments, files or 
 documents, is intended only for the addressee and may contain CONFIDENTIAL, 
 PROPRIETARY or LEGALLY PRIVILEGED information.  If you are not the intended 
 recipient, you are hereby notified that any use, disclosure, copying or 
 distribution of this message or any of the information included in or with it 
 is  unauthorized and strictly prohibited.  If you have received this message 
 in error, please notify the sender immediately by reply e-mail and 
 permanently delete and destroy this message and its attachments, along with 
 any copies thereof. This message does not create any contractual obligation 
 on behalf of the sender or Law Bulletin Publishing Company.
 Thank you.

DISCLAIMER: This electronic message, including any attachments, files or 
documents, is intended only for the addressee and may contain CONFIDENTIAL, 
PROPRIETARY or LEGALLY PRIVILEGED information.  If you are not the intended 
recipient, you are hereby notified that any use, disclosure, copying or 
distribution of this message or any of the information included in or with it 
is  unauthorized and strictly prohibited.  If you have received this message in 
error, please notify the sender immediately by reply e-mail and permanently 
delete and destroy this message and its attachments, along with any copies 
thereof. This message does not create any contractual obligation on behalf of 
the sender or Law Bulletin Publishing Company.
Thank you.

Dismax mm per field

2011-08-03 Thread Dmitriy Shvadskiy

Hello,
Is there a way to apply (e)dismax mm parameter per field? If I have a query 
field1:(blah blah) AND field2:(foo bar)

is there a way to apply mm only to field2?

Thanks,
Dmitriy

--
View this message in context: 
http://lucene.472066.n3.nabble.com/Dismax-mm-per-field-tp3222594p3222594.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Strategies for sorting by array, when you can't sort by array?

2011-08-03 Thread Smiley, David W.

Hi Ron.
This is an interesting problem you have. One idea would be to create an index 
with the entity relationship going in the other direction.  So instead of one 
to many, go many to one.  You would end up with multiple documents with varying 
names but repeated parent entity information -- perhaps simply using just an ID 
which is used as a lookup. Do a search on this name field, sorting by a 
non-tokenized variant of the name field. Use Result-Grouping to consolidate 
multiple matches of a name to the same parent document. This whole idea might 
very well be academic since duplicating all the parent entity information for 
searching on that too might be a bit much than you care to bother with. And I 
don't think Solr 4's join feature addresses this use case. In the end, I think 
Solr could be modified to support this, with some work. It would make a good 
feature request in JIRA.

~ David Smiley

On Aug 3, 2011, at 10:39 AM, Olson, Ron wrote:

 Hi all-
 
 Well, this is a problem. I have a list of names as a multi-valued field and I 
 am searching on this field and need to return the results sorted. I know from 
 searching and reading the documentation (and getting the error) that sorting 
 on a multi-valued field isn't possible. Okay, so, what I haven't found is any 
 real good solution/workaround to the problem. I was wondering what strategies 
 others have done to overcome this particular situation; collapsing the 
 individual names into a single field with copyField doesn't work because the 
 name searched may not be the first name in the field.
 
 Thanks for any hints/tips/tricks.
 
 Ron
 
 DISCLAIMER: This electronic message, including any attachments, files or 
 documents, is intended only for the addressee and may contain CONFIDENTIAL, 
 PROPRIETARY or LEGALLY PRIVILEGED information.  If you are not the intended 
 recipient, you are hereby notified that any use, disclosure, copying or 
 distribution of this message or any of the information included in or with it 
 is  unauthorized and strictly prohibited.  If you have received this message 
 in error, please notify the sender immediately by reply e-mail and 
 permanently delete and destroy this message and its attachments, along with 
 any copies thereof. This message does not create any contractual obligation 
 on behalf of the sender or Law Bulletin Publishing Company.
 Thank you.

Re: Update some fields for all documents: LUCENE-1879 vs. ParallelReader .FilterIndex

2011-08-03 Thread karsten-solr

Hi Erick,

our two changable fields are used for linking between documents on
application level.
From lucene point of view they are just two searchable fields with stored term
vector for one of them.
Our queries will use one of this fields and a couple of fields from the
stable fields.

So the question is really about updating two fields in an existing lucene index
with more then fifty other fields.

Best regards
Karsten

P.S. about our linking between documents:
Our two fields called outgoingLinks and possibleIncomingLinks.

Our source-documents have an abstract and a couple of metadata.
We are using regular expression to find outgoing links in this abstract. This
means a couple of words, which indicates
1. that the author made a reference (like in my previos work published as
'Very important Article' in Nature 2010, 12 page 7)
2. that this reference contains metadata to an other document

Each of this links is transformed to a special key (2010NaturNr12Page7).
On the other side, we transform the metadata to all possible keys.
This key generation grows with our knowledge of possible link pattern.
For the lucene indexer this is a black-box: There is a service which produce
the keys for outgoing and possibleIncoming from our source (xml-)documents,
this keys must be searchable in lucene/solr.

P.P.S. in Context:
http://lucene.472066.n3.nabble.com/Update-some-fields-for-all-documents-LUCENE-1879-vs-ParallelReader-amp-FilterIndex-td3215398.html

Original-Nachricht
Datum: Wed, 3 Aug 2011 09:57:03 -0400
Von: Erick Erickson erickerick...@gmail.com
An: solr-user@lucene.apache.org
Betreff: Re: Update some fields for all documents: LUCENE-1879 vs.
ParallelReader .FilterIndex

How are these fields used? Because if they're not used for searching, you
could
put them in their own core and rebuild that index at your whim, then
querying that
core when you need the relationship information.

If you have a DB backing your system, you could perhaps store the info
there
and query that (but I like the second core better G)..

But if you could use a separate index just for the relationships, you
wouldn't
have to deal with the slow re-indexing of all the docs...

Best
Erick

On Mon, Aug 1, 2011 at 4:12 AM, karsten-s...@gmx.de wrote:
Hi lucene/solr-folk,

Issue:
Our documents are stable except for two fields which are used for
linking between the docs. So we like to update this two fields in a batch
once a
month (possible once a week).
We can not reindex all docs once a month, because we are using XeLDA in
some fields for stemming (morphological analysis), and XeLDA is slow. We
have 14 Mio docs (less than 100GByte Main-Index and 3 GByte for this two
changable fields).
In the next half year we will migrating our search engine from verity K2
to solr; so we could wait for solr 4.0
(
btw any news about

http://lucene.472066.n3.nabble.com/Release-schedule-Lucene-4-td2256958.html
?
).

Solution?

Our issue is exactly the purpose of ParallelReader.
But Solr do not support ParallelReader (for a good reason:

http://lucene.472066.n3.nabble.com/Vertical-Partitioning-advice-td494623.html#a494624
).
So I see two possible ways to solve our issue:
1. waiting for the new Parallel incremental indexing
(
https://issues.apache.org/jira/browse/LUCENE-1879
) and hoping that solr will integrate this.
Pro:
- nothing to do for us except waiting.
Contra:
- I did not found anything of the (old) patch in current trunk.

2. Change lucene index below/without solr in a batch:
a) Each month generate a new index only with our two changed fields
(e.g. with DIH)
b) Use FilterIndex and ParallelReader to mock a correct index
c) “Merge” this mock index to a new Index
(via IndexWriter.addIndexes(IndexReader...) )
Pro:
- The patch for https://issues.apache.org/jira/browse/LUCENE-1812
should be a good example, how to do this.
Contra:
- relation between DocId and document index order is not an guaranteed
feature of DIH, (e.g. we will have to split the main index to ensure that
no merge will occur in/after DIH).
- To run this batch, solr has to be stopped and restarted.
- Even if we know, that our two field should change only for a subset
of the docs, we nevertheless have to reindex this two fields for all the
docs.

Any comments, hints or tips?
Is there a third (better) way to solve our issue?
Is there already an working example of the 2. solution?
Will LUCENE-1879 (Parallel incremental indexing) be part of solr 4.0?

Best regards
Karsten

RE: Strategies for sorting by array, when you can't sort by array?

2011-08-03 Thread Olson, Ron

*Sigh*...I had thought maybe reversing it would work, but that would require 
creating a whole new index, on a separate core, as the existing index is used 
for other purposes. Plus, given the volume of data, that would be a big deal, 
update-wise. What would be better would be to remove that particular sort 
option-button on the webpage. ;)

I'll create a Jira issue, but in the meanwhile I'll have to come up with 
something else. I guess I didn't realize how much of a corner case this 
problem is. :)

Thanks for the suggestions!

Ron

-Original Message-
From: Smiley, David W. [mailto:dsmi...@mitre.org]
Sent: Wednesday, August 03, 2011 10:26 AM
To: solr-user@lucene.apache.org
Subject: Re: Strategies for sorting by array, when you can't sort by array?

Hi Ron.
This is an interesting problem you have. One idea would be to create an index 
with the entity relationship going in the other direction.  So instead of one 
to many, go many to one.  You would end up with multiple documents with varying 
names but repeated parent entity information -- perhaps simply using just an ID 
which is used as a lookup. Do a search on this name field, sorting by a 
non-tokenized variant of the name field. Use Result-Grouping to consolidate 
multiple matches of a name to the same parent document. This whole idea might 
very well be academic since duplicating all the parent entity information for 
searching on that too might be a bit much than you care to bother with. And I 
don't think Solr 4's join feature addresses this use case. In the end, I think 
Solr could be modified to support this, with some work. It would make a good 
feature request in JIRA.

~ David Smiley

On Aug 3, 2011, at 10:39 AM, Olson, Ron wrote:

 Hi all-

 Well, this is a problem. I have a list of names as a multi-valued field and I 
 am searching on this field and need to return the results sorted. I know from 
 searching and reading the documentation (and getting the error) that sorting 
 on a multi-valued field isn't possible. Okay, so, what I haven't found is any 
 real good solution/workaround to the problem. I was wondering what strategies 
 others have done to overcome this particular situation; collapsing the 
 individual names into a single field with copyField doesn't work because the 
 name searched may not be the first name in the field.

 Thanks for any hints/tips/tricks.

 Ron

 DISCLAIMER: This electronic message, including any attachments, files or 
 documents, is intended only for the addressee and may contain CONFIDENTIAL, 
 PROPRIETARY or LEGALLY PRIVILEGED information.  If you are not the intended 
 recipient, you are hereby notified that any use, disclosure, copying or 
 distribution of this message or any of the information included in or with it 
 is  unauthorized and strictly prohibited.  If you have received this message 
 in error, please notify the sender immediately by reply e-mail and 
 permanently delete and destroy this message and its attachments, along with 
 any copies thereof. This message does not create any contractual obligation 
 on behalf of the sender or Law Bulletin Publishing Company.
 Thank you.



DISCLAIMER: This electronic message, including any attachments, files or 
documents, is intended only for the addressee and may contain CONFIDENTIAL, 
PROPRIETARY or LEGALLY PRIVILEGED information.  If you are not the intended 
recipient, you are hereby notified that any use, disclosure, copying or 
distribution of this message or any of the information included in or with it 
is  unauthorized and strictly prohibited.  If you have received this message in 
error, please notify the sender immediately by reply e-mail and permanently 
delete and destroy this message and its attachments, along with any copies 
thereof. This message does not create any contractual obligation on behalf of 
the sender or Law Bulletin Publishing Company.
Thank you.

Re: Dismax mm per field

2011-08-03 Thread Jonathan Rochkind

There is not, and the way dismax works makes it not really that feasible 
in theory, sadly.


One thing you could do instead is combine multiple separate dismax 
queries using the nested query syntax. This will effect your relevancy 
ranking possibly in odd ways, but anything that accomplishes 'mm per 
field' will neccesarily not really be using dismax's disjunction-max 
relevancy ranking in the way it's intended.


Here's how you could combine two seperate dismax queries:

defType=lucene
q=_query_:{!dismax qf=field1 mm=100%}blah blah AND _query_:{!dismax 
qf=field2 mm=80%}foo bar


That whole q value would need to be properly URI escaped, which I 
haven't done here for human-readability.


Dismax has always got an mm, there's no way to not have an mm with 
dismax, but mm 100% might be what you mean. Of course, one of those 
queries could also not be dismax at all, but ordinary lucene query 
parser or anything else. And of course you could have the same query 
text for nested queries repeating eg blah blah in both.




On 8/3/2011 11:24 AM, Dmitriy Shvadskiy wrote:

Hello,
Is there a way to apply (e)dismax mm parameter per field? If I have a query
field1:(blah blah) AND field2:(foo bar)

is there a way to apply mm only to field2?

Thanks,
Dmitriy

--
View this message in context: 
http://lucene.472066.n3.nabble.com/Dismax-mm-per-field-tp3222594p3222594.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Strategies for sorting by array, when you can't sort by array?

2011-08-03 Thread Jonathan Rochkind

There's no great way to do this. I understand your problem as: It's a 
multi-valued field, but you want to sort on whichever of those values 
matched the query, not on the values that didn't. (Not entirely clear 
what to do if the documents are in the result set becuse of a match in 
an entirely different field!)


I would sometimes like to do that too, and haven't really been able to 
come up with any great way to do it.


Something involving facetting kind of gets you closer, but ends up being 
a huge pain and doesn't get  you (or at least me) all the way to 
supporting the interface I'd really want.


On 8/3/2011 10:39 AM, Olson, Ron wrote:

Hi all-

Well, this is a problem. I have a list of names as a multi-valued field and I 
am searching on this field and need to return the results sorted. I know from 
searching and reading the documentation (and getting the error) that sorting on 
a multi-valued field isn't possible. Okay, so, what I haven't found is any real 
good solution/workaround to the problem. I was wondering what strategies others 
have done to overcome this particular situation; collapsing the individual 
names into a single field with copyField doesn't work because the name searched 
may not be the first name in the field.

Thanks for any hints/tips/tricks.

Ron

DISCLAIMER: This electronic message, including any attachments, files or 
documents, is intended only for the addressee and may contain CONFIDENTIAL, 
PROPRIETARY or LEGALLY PRIVILEGED information.  If you are not the intended 
recipient, you are hereby notified that any use, disclosure, copying or 
distribution of this message or any of the information included in or with it 
is  unauthorized and strictly prohibited.  If you have received this message in 
error, please notify the sender immediately by reply e-mail and permanently 
delete and destroy this message and its attachments, along with any copies 
thereof. This message does not create any contractual obligation on behalf of 
the sender or Law Bulletin Publishing Company.
Thank you.

Re: Strategies for sorting by array, when you can't sort by array?

2011-08-03 Thread Jonathan Rochkind

Not so much that it's a corner case in the sense of being unusual 
neccesarily (I'm not sure), it's just something that fundamentally 
doesn't fit well into lucene's architecture.


I'm not sure that filing a JIRA will be much use, it's really unclear 
how one would get lucene to do this, it would be signficant work to do, 
and it's unlikely any Solr developer is going to decide to spend 
signficant time on it unless they need it for their own clients.


On 8/3/2011 11:40 AM, Olson, Ron wrote:

*Sigh*...I had thought maybe reversing it would work, but that would require 
creating a whole new index, on a separate core, as the existing index is used 
for other purposes. Plus, given the volume of data, that would be a big deal, 
update-wise. What would be better would be to remove that particular sort 
option-button on the webpage. ;)

I'll create a Jira issue, but in the meanwhile I'll have to come up with something else. 
I guess I didn't realize how much of a corner case this problem is. :)

Thanks for the suggestions!

Ron

-Original Message-
From: Smiley, David W. [mailto:dsmi...@mitre.org]
Sent: Wednesday, August 03, 2011 10:26 AM
To: solr-user@lucene.apache.org
Subject: Re: Strategies for sorting by array, when you can't sort by array?

Hi Ron.
This is an interesting problem you have. One idea would be to create an index 
with the entity relationship going in the other direction.  So instead of one 
to many, go many to one.  You would end up with multiple documents with varying 
names but repeated parent entity information -- perhaps simply using just an ID 
which is used as a lookup. Do a search on this name field, sorting by a 
non-tokenized variant of the name field. Use Result-Grouping to consolidate 
multiple matches of a name to the same parent document. This whole idea might 
very well be academic since duplicating all the parent entity information for 
searching on that too might be a bit much than you care to bother with. And I 
don't think Solr 4's join feature addresses this use case. In the end, I think 
Solr could be modified to support this, with some work. It would make a good 
feature request in JIRA.

~ David Smiley

On Aug 3, 2011, at 10:39 AM, Olson, Ron wrote:


Hi all-

Well, this is a problem. I have a list of names as a multi-valued field and I 
am searching on this field and need to return the results sorted. I know from 
searching and reading the documentation (and getting the error) that sorting on 
a multi-valued field isn't possible. Okay, so, what I haven't found is any real 
good solution/workaround to the problem. I was wondering what strategies others 
have done to overcome this particular situation; collapsing the individual 
names into a single field with copyField doesn't work because the name searched 
may not be the first name in the field.

Thanks for any hints/tips/tricks.

Ron

DISCLAIMER: This electronic message, including any attachments, files or 
documents, is intended only for the addressee and may contain CONFIDENTIAL, 
PROPRIETARY or LEGALLY PRIVILEGED information.  If you are not the intended 
recipient, you are hereby notified that any use, disclosure, copying or 
distribution of this message or any of the information included in or with it 
is  unauthorized and strictly prohibited.  If you have received this message in 
error, please notify the sender immediately by reply e-mail and permanently 
delete and destroy this message and its attachments, along with any copies 
thereof. This message does not create any contractual obligation on behalf of 
the sender or Law Bulletin Publishing Company.
Thank you.



DISCLAIMER: This electronic message, including any attachments, files or 
documents, is intended only for the addressee and may contain CONFIDENTIAL, 
PROPRIETARY or LEGALLY PRIVILEGED information.  If you are not the intended 
recipient, you are hereby notified that any use, disclosure, copying or 
distribution of this message or any of the information included in or with it 
is  unauthorized and strictly prohibited.  If you have received this message in 
error, please notify the sender immediately by reply e-mail and permanently 
delete and destroy this message and its attachments, along with any copies 
thereof. This message does not create any contractual obligation on behalf of 
the sender or Law Bulletin Publishing Company.
Thank you.

Re: Solr request filter and indexing process

2011-08-03 Thread 于浩

A ha,I have found the root cause , the Solr has return the result  properly
.The root cause is the SolrPHPClient, The SolrPHPClient uses
file_get_contents function for connecting to Solr by default ,this function
 is not stable, usually returns http status error.

thanks for everybody who gives me help.Good luck for you!

2011/8/2 Chris Hostetter hossman_luc...@fucit.org


 : thanks for the reply. This is tomcat log files on my Solr Server:
 : I found that : if the server returns status=0 and QTime=0, the
 SolrPhpClient
 : will throughs an Exception. But the same query String will not always
 return
 : status=0 and QTime=0.  The Query String is valid, I have tested them in
 Solr

 I know nothing about PHP but if your client code is throwing an exception
 anytime status=0 and QTime=0 then it sounds like a bug in your client code
 -- there is no reason why those two numbers being 0 should be considered
 an error.  It just means the request was processed in under a millisecond.


 -Hoss

A rant about field collapsing

2011-08-03 Thread baronDodd

I am working on an implementation of search within our application using
solr.

About 2 months ago we had the need to group results by a certain field.
After some searching I came across the JIRA in progress for this - field
collapsing: https://issues.apache.org/jira/browse/SOLR-236

It was scheduled for the next solr release and had a full set of proper JIRA
subtasks and patch files of almost complete implementations attached. So as
you can imagine I was happy to apply this patch and build it into our
application and await for the next release when it would be part of the main
trunk.

Now imagine my surprise when we have come around to upgrade to see that
suddenly field collapsing has been thrown away in favour of a totally
different grouping implementation
https://issues.apache.org/jira/browse/SOLR-2524

How was it decided that this would be used instead? It was not made very
clear that LUCENE-1421 was in progress which would effectively make the
field collapsing work irrelevant by fixing the problem in lucene rather than
primarily in solr. This has cost me days of work to now merge our custom
changes somehow to the new implementation. I guess it is my own fault for
basing our custom changes around an unresolved enhancement but as SOLR-236
had been 3-4 years in progress and SOLR-2524 did not exist at the time it
seemed pretty safe to assume that the same problem was not being fixed in 2
totally different ways!

--
View this message in context: 
http://lucene.472066.n3.nabble.com/A-rant-about-field-collapsing-tp3222798p3222798.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Dismax mm per field

2011-08-03 Thread Dmitriy Shvadskiy

Thanks Jonathan. I thought it would be possible via nested queries but
somehow could not get it to work.
I'll give it another shot.

On Wed, Aug 3, 2011 at 12:32 PM, Jonathan Rochkind [via Lucene]
ml-node+3222792-952640420-221...@n3.nabble.com wrote:

There is not, and the way dismax works makes it not really that feasible
in theory, sadly.

One thing you could do instead is combine multiple separate dismax
queries using the nested query syntax. This will effect your relevancy
ranking possibly in odd ways, but anything that accomplishes 'mm per
field' will neccesarily not really be using dismax's disjunction-max
relevancy ranking in the way it's intended.

Here's how you could combine two seperate dismax queries:

defType=lucene
q=_query_:{!dismax qf=field1 mm=100%}blah blah AND _query_:{!dismax
qf=field2 mm=80%}foo bar

That whole q value would need to be properly URI escaped, which I
haven't done here for human-readability.

Dismax has always got an mm, there's no way to not have an mm with
dismax, but mm 100% might be what you mean. Of course, one of those
queries could also not be dismax at all, but ordinary lucene query
parser or anything else. And of course you could have the same query
text for nested queries repeating eg blah blah in both.

On 8/3/2011 11:24 AM, Dmitriy Shvadskiy wrote:

Hello,
Is there a way to apply (e)dismax mm parameter per field? If I have a
query
field1:(blah blah) AND field2:(foo bar)

is there a way to apply mm only to field2?

Thanks,
Dmitriy

--
View this message in context:
http://lucene.472066.n3.nabble.com/Dismax-mm-per-field-tp3222594p3222594.html
Sent from the Solr - User mailing list archive at Nabble.com.

--
If you reply to this email, your message will be added to the discussion
below:

http://lucene.472066.n3.nabble.com/Dismax-mm-per-field-tp3222594p3222792.html
To unsubscribe from Dismax mm per field, click
herehttp://lucene.472066.n3.nabble.com/template/NamlServlet.jtp?macro=unsubscribe_by_codenode=3222594code=ZHNodmFkc2tpeUBnbWFpbC5jb218MzIyMjU5NHwtMjczNzY1OTgx.

--
View this message in context:
http://lucene.472066.n3.nabble.com/Dismax-mm-per-field-tp3222594p3222851.html
Sent from the Solr - User mailing list archive at Nabble.com.

Records skipped when using DataImportHandler

2011-08-03 Thread anand sridhar

Hi,
I am a newbie to Solr and have been trying to learn using
DataImportHandler.
I have a query in data-config.xml that fetches about 5 records when i fire
it in SQL Query manager.
However, when Solr does a full import, it is skipping 4 records and only
importing 1 record.
What could be the reason for that. ?

My data-config.xml looks like this -

dataConfig
dataSource type=JdbcDataSource
name=GeoService
driver=net.sourceforge.jtds.jdbc.Driver
url=jdbc:jtds:sqlserver://10.168.50.104/ZipCodeLookup
user=sa
password=psiuser/
  document
entity name=city
query=select  ll.cityId as id, ll.zip as zipCode, c.cityName as
cityName, st.stateName as state, ct.countryName as country from latlonginfo
ll,city c, state st, country ct where ll.cityId = c.cityID and
c.stateID=st.stateID and st.countryID = ct.countryID
order by ll.areacode
dataSource=GeoService
   field column=zipCode name=zipCode/
   field column=cityName name=cityName/
   field column=state name=state/
   field column=country name=country/
/entity
  /document
/dataConfig

My fields definition in schema.xml looks as below -

field name=CityName type=text_general indexed=true stored=true /
field name=zipCode type=text_general indexed=true stored=true/
field name=state type=text_general indexed=true stored=true /
field name=country type=text_general indexed=true stored=true /

One observation I made was the 1 record that is being indexes is the last
record in the result set.  I have verified that there are no duplicate
records being retreived.

For eg, if the result set from Database is -

zipcode  CityName   state country
---  -   - ---
91324 Northridge CA USA
91325 Northridge CA USA
91327 Northridge CA USA
91328 Northridge CA USA
91329 Northridge CA USA
91330 Northridge CA USA

The record being indexed is the last record all the time.

Any suggestions are welcome.

Thanks,
Anand

Setting up Namespaces to Avoid Running Multiple Solr Instances

2011-08-03 Thread Mike Papper

Hi, we run several independent websites on the same machines. Each site uses
a similar codebase for search. Currently each site contacts its own solr
server on a slightly different port. This means of course that we are
running several solr servers (each on their own port) on the same machine. I
would like to make this simpler by running just one server, listening on one
port. Can we do this and at the same time have the indexes and search data
separated for each web site?

So, I'm asking if I can namespace or federate the solr server. But by doing
so I would like to have the indexes etc. not comingled within the server.

Im new to solr so there might be a hiccup from the fact that currently each
solr server points to its own directory on a site-specific path (something
like /apps/site/solr/*) which contains the solr plugin (were using ruby on
rails). Can this be setup as a namespace (one for each web site) within the
single server instance?

Mike

Re: Setting up Namespaces to Avoid Running Multiple Solr Instances

2011-08-03 Thread Jonathan Rochkind

I think that Solr multi-core (nothing to do with CPU cores, just what 
it's called in Solr) is what you're looking for. 
http://wiki.apache.org/solr/CoreAdmin


On 8/3/2011 2:25 PM, Mike Papper wrote:

Hi, we run several independent websites on the same machines. Each site uses
a similar codebase for search. Currently each site contacts its own solr
server on a slightly different port. This means of course that we are
running several solr servers (each on their own port) on the same machine. I
would like to make this simpler by running just one server, listening on one
port. Can we do this and at the same time have the indexes and search data
separated for each web site?

So, I'm asking if I can namespace or federate the solr server. But by doing
so I would like to have the indexes etc. not comingled within the server.

Im new to solr so there might be a hiccup from the fact that currently each
solr server points to its own directory on a site-specific path (something
like /apps/site/solr/*) which contains the solr plugin (were using ruby on
rails). Can this be setup as a namespace (one for each web site) within the
single server instance?

Mike

RE: question on solr.ASCIIFoldingFilterFactory

2011-08-03 Thread cquezel


lboutros wrote:
 
 I used Spanish stemming, put the ASCIIFoldingFilterFactory before the
 stemming filter and added it in the query part too.
 
 Ludovic.
 

My experiments with french stemmer does not yield good results with this
order. Applying the ASCIIFoldingFilterFactory before stemming confuses the
language specific stemmer. For example:

 étranglée = ASCIIFoldingFilterFactory = etranglee = FrencheStemmer
= etranglee
 étranglé = ASCIIFoldingFilterFactory = etrangle = FrencheStemmer =
etrangl


 étranglée = FrencheStemmer = étrangl = ASCIIFoldingFilterFactory =
etrangl
 étranglé = FrencheStemmer = étrangl = ASCIIFoldingFilterFactory =
etrangl



--
View this message in context: 
http://lucene.472066.n3.nabble.com/question-on-solr-ASCIIFoldingFilterFactory-tp2780463p3223314.html
Sent from the Solr - User mailing list archive at Nabble.com.

Does solr support multiple index set

2011-08-03 Thread Sharath Jagannath

Hey,

This might be completely naive question.

Could, I create more than one instance of index sets on a single instance of
solr server?
If so, how could I specify which schema to use and which index set to use.


I am planning to create 2 separate index set using a single solr server.
Data that needs to be indexed are coming from 2 disparate source and have
different scheme.

I want to create 2 separate schema like
schema name=example1 version=1.4
/schema

schema name=example2 version=1.4
schema

and do all the regular operations (index, update, delete and query).


Thanks,
Sharath

Re: Does solr support multiple index set

2011-08-03 Thread Helton Alponti

Hello Sharath,

Yes you can create many indexes.

See this article: http://wiki.apache.org/solr/CoreAdmin

See you,
Helton

On Wed, Aug 3, 2011 at 4:55 PM, Sharath Jagannath
shotsonclo...@gmail.comwrote:

 Hey,

 This might be completely naive question.

 Could, I create more than one instance of index sets on a single instance
 of
 solr server?
 If so, how could I specify which schema to use and which index set to use.


 I am planning to create 2 separate index set using a single solr server.
 Data that needs to be indexed are coming from 2 disparate source and have
 different scheme.

 I want to create 2 separate schema like
 schema name=example1 version=1.4
 /schema

 schema name=example2 version=1.4
 schema

 and do all the regular operations (index, update, delete and query).


 Thanks,
 Sharath

Help with ShardParams

2011-08-03 Thread John Brewer

Hello,

  Can someone point me a good example or two of usage of the ShardParams 
shards.start and shards.rows?

  I have a Solr instance of 250M documents spread across 4 shards. And I need 
to be able to reliably and quickly access the records by page at the request 
of the user.

  I understand the searching limitation of the Distributed search when the 
start parameter gets high and have recently found the ShardParams and was 
hoping that this might be of some use.

Thanks,
John

Is there anyway to sort differently for facet values?

2011-08-03 Thread Way Cool

Hi, guys,

Is there anyway to sort differently for facet values? For example, sometimes
I want to sort facet values by their values instead of # of docs, and I want
to be able to have a predefined order for certain facets as well. Is that
possible in Solr we can do that?

Thanks,

YH

Re: indexing taking very long time

2011-08-03 Thread Erick Erickson

What version of Solr are you using? If it's a recent version, then
optimizing is not that  essential, you can do it during off hours, perhaps
nightly or weekly.

As far as indexing speed, have you profiled your application to see whether
it's Solr or your indexing process that's the bottleneck? A quick check
would be to monitor the CPU utilization on the server and see if it's high.

As far as multithreading, one option is to simply have multiple clients
indexing simultaneously. But you haven't indicated how the indexing is being
done. Are you using DIH? SolrJ? Streaming documents to Solr? You have to
provide those kinds of details to get meaningful help.

Best
Erick
On Aug 2, 2011 8:06 AM, Naveen Gupta nkgiit...@gmail.com wrote:
 Hi

 We have a requirement where we are indexing all the messages of a a
thread,
 a thread may have attachment too . We are adding to the solr for indexing
 and searching for applying few business rule.

 For a user, we have almost many threads (100k) in number and each thread
may
 be having 10-20 messages.

 Now what we are finding is that it is taking 30 mins to index the entire
 threads.

 When we run optimize then it is taking faster time.

 The question here is that how frequently this optimize should be called
and
 when ?

 Please note that we are following commit strategy (that is every after 10k
 threads, commit is called). we are not calling commit after every doc.

 Secondly how can we use multi threading from solr perspective in order to
 improve jvm and other utilization ?


 Thanks
 Naveen

Re: lucene/solr, raw indexing/searching

2011-08-03 Thread Erick Erickson

I predict you'll spend a lot of time on the admin/analysis page
understanding what the various combinations of tokenizers and filters do.
Because, you see, you already have differences, to whit: your Solr schema
has LowercaseFilter and removeDuplicates.

Have you determined *why* Solr indexing is slower? You might consider using
SolrJ and firing multiple threads/processes at the issue to bring indexing
performance up to acceptable levels and avoid this problem entirely

Best
Erick
On Aug 2, 2011 12:37 PM, Jonathan Rochkind rochk...@jhu.edu wrote:
In your solr schema.xml, are the fields you are using defined as text
fields with analyzers? It sounds like you want no analysis at all, which
probably means you don't want text fields either, you just want string
fields. That will make it impossible to search for individual tokens
though, searches will match only on complete matches of the value.

I'm not quite sure how to do what you want, it depends on exactly what
you want. What kind of searching do you expect to support? If you still
do want tokenization, you'll still want some analysis... but I'm not
quite sure how that corresponds to what you'd want to do on the lucene
end. What you're trying to do is going to be inevitably confusing, I
think. Which doesn't mean it's not possible. You might find it less
confusing if you were willing to use Solr to index though, rather than
straight lucene -- you could use Solr via the SolrJ java classes, rather
than the HTTP interface.

On 8/2/2011 11:14 AM, dhastings wrote:
Hello,
I am trying to get lucene and solr to agree on a completely Raw indexing
method. I use lucene in my indexers that write to an index on disk, and
solr to search those indexes that i create, as creating the indexes
without
solr is much much faster than using the solr server.

are there settings for BOTH solr and lucene to use EXACTLY whats in the
content as opposed to interpreting what it thinks im trying to do? My
content is extremely specific and needs no interpretation or adjustment,
indexing or searching, a text field.

for example:

203.1 seems to be indexed as 2031. searching for 203.1 i can get to work
correctly, but then it wont find whats indexed using 3.1's standard
analyzer.

if i have content that is :
this is rev. 23.302

i need it indexed EXACTLY as it appears,
this is rev. 23.302

I do not want any of solr or lucenes attempts to fix my content or my
queries. rev. needs to stay rev. and not turn into rev, 23.302
needs to stay as such, and NOT turn into 23302. this is for BOTH
indexing
and searching.

any hints?

right now for indexing i have:

Set nostopwords = new HashSet(); nostopwords.add(buahahahahahaha);

Analyzer an = new StandardAnalyzer(Version.LUCENE_31, nostopwords);
writer = new IndexWriter(fsDir,an,MaxFieldLength.UNLIMITED);
writer.setUseCompoundFile(false) ;

and for searching i have in my schema :

fieldType name=text class=solr.TextField positionIncrementGap=100
analyzer
tokenizer class=solr.StandardTokenizerFactory/

filter class=solr.LowerCaseFilterFactory/
filter class=solr.RemoveDuplicatesTokenFilterFactory/
/analyzer
/fieldType

Thanks. Very much appreciated.

--
View this message in context:
http://lucene.472066.n3.nabble.com/lucene-solr-raw-indexing-searching-tp3219277p3219277.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: MultiSearcher/ParallelSearcher - searching over multiple cores?

2011-08-03 Thread Erick Erickson

As far as I know, you're right. There's no built-in way to do what you want,
especially since the fact that you're talking about different search fields
implies that the scores from the documents aren't comparable anyway. How do
you intend to combine the results for presentation to the user?

Best
Erick
On Aug 2, 2011 5:11 PM, Ralf Musick ra...@gmx.de wrote:

Re: Update some fields for all documents: LUCENE-1879 vs. ParallelReader .FilterIndex

2011-08-03 Thread Erick Erickson

Hmmm, the only thing that comes to mind is the join feature being added to
Solr 4.x, but I confess I'm not entirely familiar with that functionality so
can't tell if it really solver your problem.

Other than that I'm out of ideas, but the again it's late and I'm tired so
maybe I'm not being very creative G...

Best
Erick
On Aug 3, 2011 11:40 AM, karsten-s...@gmx.de wrote:

Re: Records skipped when using DataImportHandler

2011-08-03 Thread Erick Erickson

Sorry, I'm on a restricted machine so can't get the precise URL. But,
there's a debug page for DIH that might allow you to see what the query
actually returns. I'd guess one of two things:
1 you aren't getting the number of rows you think.
2 you aren't committing the documents you add.

But that's just a guess.

Best
Erick
On Aug 3, 2011 2:15 PM, anand sridhar anand.for...@gmail.com wrote:
 Hi,
 I am a newbie to Solr and have been trying to learn using
 DataImportHandler.
 I have a query in data-config.xml that fetches about 5 records when i fire
 it in SQL Query manager.
 However, when Solr does a full import, it is skipping 4 records and only
 importing 1 record.
 What could be the reason for that. ?

 My data-config.xml looks like this -

 dataConfig
 dataSource type=JdbcDataSource
 name=GeoService
 driver=net.sourceforge.jtds.jdbc.Driver
 url=jdbc:jtds:sqlserver://10.168.50.104/ZipCodeLookup
 user=sa
 password=psiuser/
 document
 entity name=city
 query=select ll.cityId as id, ll.zip as zipCode, c.cityName as
 cityName, st.stateName as state, ct.countryName as country from
latlonginfo
 ll,city c, state st, country ct where ll.cityId = c.cityID and
 c.stateID=st.stateID and st.countryID = ct.countryID
 order by ll.areacode
 dataSource=GeoService
 field column=zipCode name=zipCode/
 field column=cityName name=cityName/
 field column=state name=state/
 field column=country name=country/
 /entity
 /document
 /dataConfig

 My fields definition in schema.xml looks as below -

 field name=CityName type=text_general indexed=true stored=true /
 field name=zipCode type=text_general indexed=true stored=true/
 field name=state type=text_general indexed=true stored=true /
 field name=country type=text_general indexed=true stored=true /

 One observation I made was the 1 record that is being indexes is the last
 record in the result set. I have verified that there are no duplicate
 records being retreived.

 For eg, if the result set from Database is -

 zipcode CityName state country
 --- - - ---
 91324 Northridge CA USA
 91325 Northridge CA USA
 91327 Northridge CA USA
 91328 Northridge CA USA
 91329 Northridge CA USA
 91330 Northridge CA USA

 The record being indexed is the last record all the time.

 Any suggestions are welcome.

 Thanks,
 Anand

Re: Is there anyway to sort differently for facet values?

2011-08-03 Thread Erick Erickson

have you looked at the facet.sort parameter? The index value is what I
think you want.

Best
Erick
On Aug 3, 2011 7:03 PM, Way Cool way1.wayc...@gmail.com wrote:
 Hi, guys,

 Is there anyway to sort differently for facet values? For example,
sometimes
 I want to sort facet values by their values instead of # of docs, and I
want
 to be able to have a predefined order for certain facets as well. Is that
 possible in Solr we can do that?

 Thanks,

 YH

Re: SEVERE: org.apache.solr.common.SolrException: Error loading class 'solr.ICUTokenizerFactory'

2011-08-03 Thread Satish Talim

Guys, I am still stuck. Any help?

Thanks,

Satish

On Tue, Aug 2, 2011 at 5:23 PM, Robert Muir rcm...@gmail.com wrote:

 did you add the analysis-extras jar itself? thats what has this factory.

 On Tue, Aug 2, 2011 at 5:03 AM, Satish Talim satish.ta...@gmail.com
 wrote:
  I am using Solr 3.3 on a Windows box.
 
  I want to use the solr.ICUTokenizerFactory in my schema.xml and added the
  fieldType name=text_icu as per the URL -
 
 http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.ICUTokenizerFactory
 
  I also added the following files to my apache-solr-3.3.0\example\lib
 folder:
  lucene-icu-3.3.0.jar
  lucene-smartcn-3.3.0.jar
  icu4j-4_8.jar
  lucene-stempel-3.3.0.jar
 
  When I start my Solr server from apache-solr-3.3.0\example folder:
  java -jar start.jar
 
  I get the following errors:
 
  SEVERE: org.apache.solr.common.SolrException: Error loading class
  'solr.ICUTokenizerFactory'
 
  SEVERE: org.apache.solr.common.SolrException: analyzer without class or
  tokenizer  filter list
 
  SEVERE: org.apache.solr.common.SolrException: Unknown fieldtype
 'text_icu'
  specified on field subject
 
  I tried adding various other jar files to the lib folder but it does not
  help.
 
  What am I doing wrong?
 
  Satish

Highlighting does not works with uniqueField set

2011-08-03 Thread Anand.Nigam

Hi,

I am new to solr. Am facing an issue wherein the highlighting of the 
searchresults for matches is not working when I have set a unique field as:

uniqueKeyid/uniqueKey

If this is commented then highlighting starts working. I need to have a unique 
field. Could someone please explain this erratic behaviour. I am setting this 
field while posting the documents to be indexed.

Thanks  Regards,
Anand


***
 
The Royal Bank of Scotland plc. Registered in Scotland No 90312. 
Registered Office: 36 St Andrew Square, Edinburgh EH2 2YB. 
Authorised and regulated by the Financial Services Authority. The 
Royal Bank of Scotland N.V. is authorised and regulated by the 
De Nederlandsche Bank and has its seat at Amsterdam, the 
Netherlands, and is registered in the Commercial Register under 
number 33002587. Registered Office: Gustav Mahlerlaan 350, 
Amsterdam, The Netherlands. The Royal Bank of Scotland N.V. and 
The Royal Bank of Scotland plc are authorised to act as agent for each 
other in certain jurisdictions. 
  
This e-mail message is confidential and for use by the addressee only. 
If the message is received by anyone other than the addressee, please 
return the message to the sender by replying to it and then delete the 
message from your computer. Internet e-mails are not necessarily 
secure. The Royal Bank of Scotland plc and The Royal Bank of Scotland 
N.V. including its affiliates (RBS group) does not accept responsibility 
for changes made to this message after it was sent. For the protection
of RBS group and its clients and customers, and in compliance with
regulatory requirements, the contents of both incoming and outgoing
e-mail communications, which could include proprietary information and
Non-Public Personal Information, may be read by authorised persons
within RBS group other than the intended recipient(s). 

Whilst all reasonable care has been taken to avoid the transmission of 
viruses, it is the responsibility of the recipient to ensure that the onward 
transmission, opening or use of this message and any attachments will 
not adversely affect its systems or data. No responsibility is accepted 
by the RBS group in this regard and the recipient should carry out such 
virus and other checks as it considers appropriate. 

Visit our website at www.rbs.com 

***

java.lang.IllegalStateException: Committed error in the logs

2011-08-03 Thread Anand.Nigam

I am getting following error log on trying to search. Any idea why this error 
is coming. Search results are coming after a long delay.



SEVERE: org.mortbay.jetty.EofException
at org.mortbay.jetty.HttpGenerator.flush(HttpGenerator.java:791)
at 
org.mortbay.jetty.AbstractGenerator$Output.flush(AbstractGenerator.java:569)
at 
org.mortbay.jetty.HttpConnection$Output.flush(HttpConnection.java:1012)
at sun.nio.cs.StreamEncoder.implFlush(StreamEncoder.java:278)
at sun.nio.cs.StreamEncoder.flush(StreamEncoder.java:122)
at java.io.OutputStreamWriter.flush(OutputStreamWriter.java:212)
at org.apache.solr.common.util.FastWriter.flush(FastWriter.java:115)
at 
org.apache.solr.servlet.SolrDispatchFilter.writeResponse(SolrDispatchFilter.java:344)
at 
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:265)
at 
org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212)
at 
org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:399)
at 
org.mortbay.jetty.security.SecurityHandler.handle(SecurityHandler.java:216)
at 
org.mortbay.jetty.servlet.SessionHandler.handle(SessionHandler.java:182)
at 
org.mortbay.jetty.handler.ContextHandler.handle(ContextHandler.java:766)
at org.mortbay.jetty.webapp.WebAppContext.handle(WebAppContext.java:450)
at 
org.mortbay.jetty.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:230)
at 
org.mortbay.jetty.handler.HandlerCollection.handle(HandlerCollection.java:114)
at 
org.mortbay.jetty.handler.HandlerWrapper.handle(HandlerWrapper.java:152)
at org.mortbay.jetty.Server.handle(Server.java:326)
at 
org.mortbay.jetty.HttpConnection.handleRequest(HttpConnection.java:542)
at 
org.mortbay.jetty.HttpConnection$RequestHandler.headerComplete(HttpConnection.java:928)
at org.mortbay.jetty.HttpParser.parseNext(HttpParser.java:549)
at org.mortbay.jetty.HttpParser.parseAvailable(HttpParser.java:212)
at org.mortbay.jetty.HttpConnection.handle(HttpConnection.java:404)
at 
org.mortbay.jetty.bio.SocketConnector$Connection.run(SocketConnector.java:228)
at 
org.mortbay.thread.QueuedThreadPool$PoolThread.run(QueuedThreadPool.java:582)
Caused by: java.net.SocketException: Connection reset
at java.net.SocketOutputStream.socketWrite(SocketOutputStream.java:96)
at java.net.SocketOutputStream.write(SocketOutputStream.java:136)
at org.mortbay.io.ByteArrayBuffer.writeTo(ByteArrayBuffer.java:368)
at org.mortbay.io.bio.StreamEndPoint.flush(StreamEndPoint.java:129)
at org.mortbay.io.bio.StreamEndPoint.flush(StreamEndPoint.java:149)
at org.mortbay.jetty.HttpGenerator.flush(HttpGenerator.java:714)
... 25 more

2011-08-04 06:05:10.550:WARN::Committed before 500 
null||org.mortbay.jetty.EofException|?at 
org.mortbay.jetty.HttpGenerator.flush(HttpGenerator.java:791)|?at 
org.mortbay.jetty.AbstractGenerator$Output.flush(AbstractGenerator.java:569)|?at
 org.mortbay.jetty.HttpConnection$Output.flush(HttpConnection.java:1012)|?at 
sun.nio.cs.StreamEncoder.implFlush(StreamEncoder.java:278)|?at 
sun.nio.cs.StreamEncoder.flush(StreamEncoder.java:122)|?at 
java.io.OutputStreamWriter.flush(OutputStreamWriter.java:212)|?at 
org.apache.solr.common.util.FastWriter.flush(FastWriter.java:115)|?at 
org.apache.solr.servlet.SolrDispatchFilter.writeResponse(SolrDispatchFilter.java:344)|?at
 
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:265)|?at
 
org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212)|?at
 org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:399)|?at 
org.mortbay.jetty.security.SecurityHandler.handle(SecurityHandler.java:216)|?at 
org.mortbay.jetty.servlet.SessionHandler.handle(SessionHandler.java:182)|?at 
org.mortbay.jetty.handler.ContextHandler.handle(ContextHandler.java:766)|?at 
org.mortbay.jetty.webapp.WebAppContext.handle(WebAppContext.java:450)|?at 
org.mortbay.jetty.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:230)|?at
 
org.mortbay.jetty.handler.HandlerCollection.handle(HandlerCollection.java:114)|?at
 org.mortbay.jetty.handler.HandlerWrapper.handle(HandlerWrapper.java:152)|?at 
org.mortbay.jetty.Server.handle(Server.java:326)|?at 
org.mortbay.jetty.HttpConnection.handleRequest(HttpConnection.java:542)|?at 
org.mortbay.jetty.HttpConnection$RequestHandler.headerComplete(HttpConnection.java:928)|?at
 org.mortbay.jetty.HttpParser.parseNext(HttpParser.java:549)|?at 
org.mortbay.jetty.HttpParser.parseAvailable(HttpParser.java:212)|?at 
org.mortbay.jetty.HttpConnection.handle(HttpConnection.java:404)|?at 
org.mortbay.jetty.bio.SocketConnector$Connection.run(SocketConnector.java:228)|?at
 
org.mortbay.thread.QueuedThreadPool$PoolThread.run(QueuedThreadPool.java:582)|Caused
 by:

csv responsewriter and numfound

2011-08-03 Thread Pooja Verlani

Hi,

Is there anyway to get numFound from csv response format? Some parameter?
Or shall I change the code for csvResponseWriter for this?

Thanks,
Pooja

50 matches

Mail list logo