Early Access Release #3 for Solr 4.x Deep Dive book is now available for download on Lulu.com

2013-07-19 Thread Jack Krupansky
Okay, it’s hot off the e-presses: Solr 4.x Deep Dive, Early Access Release #3 
is now available for purchase and download as an e-book for $9.99 on Lulu.com 
at:

http://www.lulu.com/shop/jack-krupansky/solr-4x-deep-dive-early-access-release-1/ebook/product-21079719.html


(That link says “1”, but it apparently correctly redirects to EAR #3.)

My recent blog posts over the past two weeks detailed the changes from EAR#2. 
Besides more cleanup, the focus was on features of Solr 4.4, including update 
processors and token filters. I still haven’t finished 4.4 coverage, but this 
is progress.

See:
http://basetechnology.blogspot.com/

The next EAR will be in approximately two weeks, contents TBD.

If you have purchased EAR#1 or #2, there is no need to rush out and pick up 
EAR#3. I mean, the technical content changes were relatively modest (68 new 
pages), and EAR#4 will be out in another two weeks anyway. That said, EAR#3 is 
a significant improvement over EAR#1 and EAR#2.

-- Jack Krupansky

Re: Sort by document similarity counts

2013-07-19 Thread zygis


Not sure if it will work. Say we have SearchComponent which does this in 
process method:

1. DocList docs = rb.getResults().docList;

2. Go over docs and for each doc do:

3. 
BooleanQuery q = new BooleanQuery(); //construct a query which gets all docs 
which are not equal to current one and are from a different host (we deal there 
with web pages)
q.add(new TermQuery(new Term(host, host)), BooleanClause.Occur.MUST_NOT);
q.add(new TermQuery(new Term(id, name)), BooleanClause.Occur.MUST_NOT);
DocListAndSet sim = searcher.getDocListAndSet( q, (TermQuery) null, null, 0, 
1000); //TODO how to set proper limit not hard-coded 1000

4. for all docs in sim calculate similarity to current doc (from #2)

5. Count all similar documents and add a new field
            FieldType ft = new FieldType();
            ft.setStored(true);
            ft.setIndexed(true);
            Field f = new IntField(similarCount, ds.size(), ft);
            d.add(f);


Now the problem is with #1 this comes in already sorted. That is if I call solr 
with q=*sort=similarityCount, sort is applied before calling last component, 
which does all the above defined steps. If I add this to first-components then 
#1 call will return null.


Completely different approach would be to calculate aggregate values on update 
via UpdateRequestProcessor. But then I need to be able to do searches in update 
processor (step #3). But in that case docs for searcher are available only 
after commit. I'd expect that this would work but search always returns 0

public void processCommit(CommitUpdateCommand cmd) throws IOException {
               TopDocs docs = searcher.search(new MatchAllDocsQuery(), 100);
               DocListAndSet sim = searcher.getDocListAndSet( 
                    new MatchAllDocsQuery(), (TermQuery) null, null, 0, 10);
                DocList docs = sim.docList;  Is always empty

(Tried placing it after solr.RunUpdateProcessorFactory in update chain, no 
change)

Even if searcher would work, it looks bad. Because in this case I would need to 
update not only incoming document but also all those documents which are 
similar to a current one (That is if A is similar to B and C, then B and C are 
similar to A, and similarCount field has to be increased in B and C as well).




 From: Koji Sekiguchi k...@r.email.ne.jp
To: solr-user@lucene.apache.org 
Sent: Thursday, July 18, 2013 4:29 PM
Subject: Re: Sort by document similarity counts
 

 I have tried doing this via custom SearchComponent, where I can find all 
 similar documents for each document in current search result, then add a new 
 field into document hoping to use sort parameter (q=*sort=similarityCount).

I don't understand this part very well, but:

 But this will not work because sort is done before handling my custom search 
 component, if added via last-components. Can't add it via first-components, 
 because then I will have no access to query results. And I do not want to 
 override QueryComponent because I need to have all the functionality it 
 covers: grouping, facets, etc.

You may want to put your custom SearchComponent to last-component and inject 
SortSpec
in your prepare() so that QueryComponent can sort the result complying with 
your SortSpec?

koji
-- 
http://soleami.com/blog/automatically-acquiring-synonym-knowledge-from-wikipedia.html

Solr 4.3.1 - SolrCloud nodes down and lost documents

2013-07-19 Thread Neil Prosser
While indexing some documents to a SolrCloud cluster (10 machines, 5 shards
and 2 replicas, so one replica on each machine) one of the replicas stopped
receiving documents, while the other replica of the shard continued to grow.

That was overnight so I was unable to track exactly what happened (I'm
going off our Graphite graphs here). This morning when I was able to look
at the cluster both replicas of that shard were marked as down (with one
marked as leader). I attempted to restart the non-leader node but it took a
long time to restart so I killed it and restarted the old leader, which
also took a long time. I killed that one (I'm impatient) and left the
non-leader node to restart, not realising it was missing approximately 700k
documents that the old leader had. Eventually it restarted and became
leader. I restarted the old leader and it dropped the number of documents
it had to match the previous non-leader.

Is this expected behaviour when a replica with fewer documents is started
before the other and elected leader? Should I have been paying more
attention to the number of documents on the server before restarting nodes?

I am still in the process of tuning the caches and warming for these
servers but we are putting some load through the cluster so it is possible
that the nodes are having to work quite hard when a new version of the core
comes is made available. Is this likely to explain why I occasionally see
nodes dropping out? Unfortunately in restarting the nodes I lost the GC
logs to see whether that was likely to be the culprit. Is this the sort of
situation where you raise the ZooKeeper timeout a bit? Currently the
timeout for all nodes is 15 seconds.

Are there any known issues which might explain what's happening? I'm just
getting started with SolrCloud after using standard master/slave
replication for an index which has got too big for one machine over the
last few months.

Also, is there any particular information that would be helpful to help
with these issues if it should happen again?


IDNA Support For Solr

2013-07-19 Thread Furkan KAMACI
Hi;

Is there any support for IDNA at Solr? (IDNA:
http://en.wikipedia.org/wiki/Internationalized_domain_name)


RE: IDNA Support For Solr

2013-07-19 Thread Markus Jelsma
Hi - What kind of support would you expect Solr to provide? IDN is only about 
conversion between Unicode in your address bas and ASCII in the DNS.
 
-Original message-
 From:Furkan KAMACI furkankam...@gmail.com
 Sent: Friday 19th July 2013 11:09
 To: solr-user@lucene.apache.org
 Subject: IDNA Support For Solr
 
 Hi;
 
 Is there any support for IDNA at Solr? (IDNA:
 http://en.wikipedia.org/wiki/Internationalized_domain_name)
 


Help !

2013-07-19 Thread narasimha.g
HI,

Need help on configuring SOLR search in Alfresco.

 --
 Regards
 Narasimha


Please do not print this email unless it is absolutely necessary.   

ATTENTION: The information in this electronic mail message is private and 
confidential, and only intended for the addressee. Should you receive this 
message by mistake, you are hereby notified that any disclosure, reproduction, 
distribution or use of this message is strictly prohibited. Please inform the 
sender by reply transmission and delete the message without copying or opening 
it. Messages and attachments are scanned for all viruses known. If this message 
contains password-protected attachments, the files have NOT been scanned for 
viruses by the ING mail domain. Always scan attachments before opening them.

ING Vysya Bank Limited, Regd  Corp off: ING Vysya House, # 22, M. G. Road, 
Bangalore – 560 001. www.ingvysyabank.com.


custom field type plugin

2013-07-19 Thread Kevin Stone
I have a particular use case that I think might require a custom field type, 
however I am having trouble getting the plugin to work.
My use case has to do with genetics data, and we are running into several 
situations were we need to be able to query multiple regions of a chromosome 
(or gene, or other object types). All that really boils down to is being able 
to give a number, e.g. 10234, and return documents that have regions containing 
the number. So you'd have a document with a list like 
[1:16090,400:8000,40123:43564], and it should come back because 10234 
falls between 1:16090. If there is a better or easier way to do this 
please speak up. I'd rather not have to use a join on another index, because 
1) it's more complex to set up, and 2) we might need to join against something 
else and you can only do one join at a time.

Anyway… I tried creating a field type similar to a PointType just to see if I 
could get one working. I added the following jars to get it to compile: 
apache-solr-core-4.0.0,lucene-core-4.0.0,lucene-queries-4.0.0,apache-solr-solrj-4.0.0.
 I am running solr 4.0.0 on jetty, and put my jar file in a sharedLib folder, 
and specified it in my solr.xml (I have multiple cores).

After starting up solr, I got the line that it picked up the jar:
INFO: Adding 'file:/blah/blah/lib/CustomPlugins.jar' to classloader

But I get this error about it not being able to find the 
AbstractSubTypeFieldType class.
Here is the first bit of the trace:

SEVERE: null:java.lang.NoClassDefFoundError: 
org/apache/solr/schema/AbstractSubTypeFieldType
at java.lang.ClassLoader.defineClass1(Native Method)
at java.lang.ClassLoader.defineClass(ClassLoader.java:791)
at java.security.SecureClassLoader.defineClass(SecureClassLoader.java:142)
at java.net.URLClassLoader.defineClass(URLClassLoader.java:449)
at java.net.URLClassLoader.access$100(URLClassLoader.java:71)
at java.net.URLClassLoader$1.run(URLClassLoader.java:361)
at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
...etc…


Any hints as to what I did wrong? I can provide source code, or a fuller stack 
trace, config settings, etc.

Also, I did try to unpack the solr.war, stick my jar in WEB-INF/lib, then 
repack. However, when I did that, I get a NoClassDefFoundError for my plugin 
itself.


Thanks,
Kevin

The information in this email, including attachments, may be confidential and 
is intended solely for the addressee(s). If you believe you received this email 
by mistake, please notify the sender by return email as soon as possible.


Re: Indexing into SolrCloud

2013-07-19 Thread Erick Erickson
Usually EOF errors indicate that the packet you're sending are too big.

Wait, though. 50K is not buffered docs, I think it's buffered _requests_.
So you're creating a queue that's ginormous and asking 2 threads to empty it.

But that's not really the issue I suspect. How many documents are you adding
at a time when you call server.add? I.e. are you using sever.add(doc) or
server.add(doclist)? If the latter and you're adding a bunch of docs, try
lowering that number. If you're sending one doc at a time I'm on the
wrong track.

Best
Erick

On Thu, Jul 18, 2013 at 2:51 PM, Beale, Jim (US-KOP) jim.be...@hibu.com wrote:
 Hey folks,

 I've been migrating an application which indexes about 15M documents from 
 straight-up Lucene into SolrCloud.  We've set up 5 Solr instances with a 3 
 zookeeper ensemble using HAProxy for load balancing. The documents are 
 processed on a quad core machine with 6 threads and indexed into SolrCloud 
 through HAProxy using ConcurrentUpdateSolrServer in order to batch the 
 updates.  The indexing box is heavily-loaded during indexing but I don't 
 think it is so bad that it would cause issues.

 I'm using Solr 4.3.1 on client and server side, zookeeper 3.4.5 and HAProxy 
 1.4.22.

 I've been accepting the default HttpClient with 50K buffered docs and 2 
 threads, i.e.,

 int solrMaxBufferedDocs = 5;
 int solrThreadCount = 2;
 solrServer = new ConcurrentUpdateSolrServer(solrHttpIPAddress, 
 solrMaxBufferedDocs, solrThreadCount);

 autoCommit is configured in the solrconfig as follows:

  autoCommit
maxTime60/maxTime
maxDocs50/maxDocs
openSearcherfalse/openSearcher
  /autoCommit

 I'm getting the following errors on the client and server sides respectively:

 Client side:

 2013-07-16 19:02:47,002 [concurrentUpdateScheduler-1-thread-4] INFO  
 SystemDefaultHttpClient - I/O exception (java.net.SocketException) caught 
 when processing request: Software caused connection abort: socket write error
 2013-07-16 19:02:47,002 [concurrentUpdateScheduler-1-thread-4] INFO  
 SystemDefaultHttpClient - Retrying request
 2013-07-16 19:02:47,002 [concurrentUpdateScheduler-1-thread-5] INFO  
 SystemDefaultHttpClient - I/O exception (java.net.SocketException) caught 
 when processing request: Software caused connection abort: socket write error
 2013-07-16 19:02:47,002 [concurrentUpdateScheduler-1-thread-5] INFO  
 SystemDefaultHttpClient - Retrying request

 Server side:

 7988753 [qtp1956653918-23] ERROR org.apache.solr.core.SolrCore  â 
 java.lang.RuntimeException: [was class org.eclipse.jetty.io.EofException] 
 early EOF
 at 
 com.ctc.wstx.util.ExceptionUtil.throwRuntimeException(ExceptionUtil.java:18)
 at 
 com.ctc.wstx.sr.StreamScanner.throwLazyError(StreamScanner.java:731)
 at 
 com.ctc.wstx.sr.BasicStreamReader.safeFinishToken(BasicStreamReader.java:3657)
 at 
 com.ctc.wstx.sr.BasicStreamReader.getText(BasicStreamReader.java:809)
 at 
 org.apache.solr.handler.loader.XMLLoader.readDoc(XMLLoader.java:393)

 When I disabled autoCommit on the server side, I didn't see any errors there 
 but I still get the issue client-side after about 2 million documents - which 
 is about 45 minutes.

 Has anyone seen this issue before?  I couldn't find anything useful on the 
 usual places.

 I suppose I could setup wireshark to see what is happening but I'm hoping 
 that someone has a better suggestion.

 Thanks in advance for any help!


 Best regards,
 Jim Beale

 hibu.com
 2201 Renaissance Boulevard, King of Prussia, PA, 19406
 Office: 610-879-3864
 Mobile: 610-220-3067

 The information contained in this email message, including any attachments, 
 is intended solely for use by the individual or entity named above and may be 
 confidential. If the reader of this message is not the intended recipient, 
 you are hereby notified that you must not read, use, disclose, distribute or 
 copy any part of this communication. If you have received this communication 
 in error, please immediately notify me by email and destroy the original 
 message, including any attachments. Thank you.

 -
 To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-user-h...@lucene.apache.org

 The information contained in this email message, including any attachments, 
 is intended solely for use by the individual or entity named above and may be 
 confidential. If the reader of this message is not the intended recipient, 
 you are hereby notified that you must not read, use, disclose, distribute or 
 copy any part of this communication. If you have received this communication 
 in error, please immediately notify me by email and destroy the original 
 message, including any attachments. Thank you.


Re: Auto-sharding and numShard parameter

2013-07-19 Thread Erick Erickson
First the numShards parameter is only relevant the very first time you
create your collection. It's a little confusing because in the SolrCloud
examples you're getting collection1 by default. Look further down the
SolrCloud Wiki page, the section titled
Managing Collections via the Collections API for creating collections
with a different name.

Either way, either when you run the bootstrap command or when you
create a new collection, that's the only time numShards counts. It's
ignored the rest of the time.

As far as data growing, you need to either
1 create enough shards to handle the eventual size things will be,
sometimes called oversharding
or
2 use the splitShard capabilities in very recent Solrs to expand
capacity.

Best
Erick

On Thu, Jul 18, 2013 at 4:52 PM, Flavio Pompermaier
pomperma...@okkam.it wrote:
 Hi to all,
 Probably this question has a simple answer but I just want to be sure of
 the potential drawbacks..when I run SolrCloud I run the main solr instance
 with the -numShard option (e.g. 2).
 Then as data grows, shards could potentially become a huge number. If I
 hadstio to restart all nodes and I re-run the master with the numShard=2,
 what will happen? It will be just ignored or Solr will try to reduce
 shards...?

 Another question...in SolrCloud, how do I restart all the cloud at once? Is
 it possible?

 Best,
 Flavio


Re: IDNA Support For Solr

2013-07-19 Thread Furkan KAMACI
I mean that:

there is a web adress: *çorba.com http://xn--orba-zoa.com*

However its IDNA coded version is: *xn--orba-zoa.com*

You can check it from here: *
http://www.whois.com.tr/?q=%C3%A7orbasldtld=com*

Let's assume that I've indexed a web page with that URL:
*xn--orba-zoa.com*and one searches that word:
*çorba *Than I have to say that there is a URL match for that search.
However I've indexed that URL as IDNA coded I will not able to see that URL
includes that word: *çorba.*



2013/7/19 Markus Jelsma markus.jel...@openindex.io

 Hi - What kind of support would you expect Solr to provide? IDN is only
 about conversion between Unicode in your address bas and ASCII in the DNS.

 -Original message-
  From:Furkan KAMACI furkankam...@gmail.com
  Sent: Friday 19th July 2013 11:09
  To: solr-user@lucene.apache.org
  Subject: IDNA Support For Solr
 
  Hi;
 
  Is there any support for IDNA at Solr? (IDNA:
  http://en.wikipedia.org/wiki/Internationalized_domain_name)
 



RE: IDNA Support For Solr

2013-07-19 Thread Markus Jelsma
No, you'll have to index the Unicode version of the domain name. Nutch 1.x 
already deals with this conversion for you. Or you could create a custom update 
processor for Solr and code it there. It's quite simple, IDN is in java.net 
package.
 
-Original message-
 From:Furkan KAMACI furkankam...@gmail.com
 Sent: Friday 19th July 2013 14:39
 To: solr-user@lucene.apache.org
 Subject: Re: IDNA Support For Solr
 
 I mean that:
 
 there is a web adress: *çorba.com http://xn--orba-zoa.com*
 
 However its IDNA coded version is: *xn--orba-zoa.com*
 
 You can check it from here: *
 http://www.whois.com.tr/?q=%C3%A7orbasldtld=com*
 
 Let's assume that I've indexed a web page with that URL:
 *xn--orba-zoa.com*and one searches that word:
 *çorba *Than I have to say that there is a URL match for that search.
 However I've indexed that URL as IDNA coded I will not able to see that URL
 includes that word: *çorba.*
 
 
 
 2013/7/19 Markus Jelsma markus.jel...@openindex.io
 
  Hi - What kind of support would you expect Solr to provide? IDN is only
  about conversion between Unicode in your address bas and ASCII in the DNS.
 
  -Original message-
   From:Furkan KAMACI furkankam...@gmail.com
   Sent: Friday 19th July 2013 11:09
   To: solr-user@lucene.apache.org
   Subject: IDNA Support For Solr
  
   Hi;
  
   Is there any support for IDNA at Solr? (IDNA:
   http://en.wikipedia.org/wiki/Internationalized_domain_name)
  
 
 


Re: Auto-sharding and numShard parameter

2013-07-19 Thread Flavio Pompermaier
Thank you for the reply Erick,
I was facing exactly with that problem..from the documentation it seems
that those parameter are required to run SolrCloud,
instead they are just used to initialize a sample collection..
I think that in the examples on the user doc it should be better to
separate those 2 concepts: one is starting the server,
another one is creating/managing collections.

Best,
Flavio


On Fri, Jul 19, 2013 at 2:13 PM, Erick Erickson erickerick...@gmail.comwrote:

 First the numShards parameter is only relevant the very first time you
 create your collection. It's a little confusing because in the SolrCloud
 examples you're getting collection1 by default. Look further down the
 SolrCloud Wiki page, the section titled
 Managing Collections via the Collections API for creating collections
 with a different name.

 Either way, either when you run the bootstrap command or when you
 create a new collection, that's the only time numShards counts. It's
 ignored the rest of the time.

 As far as data growing, you need to either
 1 create enough shards to handle the eventual size things will be,
 sometimes called oversharding
 or
 2 use the splitShard capabilities in very recent Solrs to expand
 capacity.

 Best
 Erick

 On Thu, Jul 18, 2013 at 4:52 PM, Flavio Pompermaier
 pomperma...@okkam.it wrote:
  Hi to all,
  Probably this question has a simple answer but I just want to be sure of
  the potential drawbacks..when I run SolrCloud I run the main solr
 instance
  with the -numShard option (e.g. 2).
  Then as data grows, shards could potentially become a huge number. If I
  hadstio to restart all nodes and I re-run the master with the numShard=2,
  what will happen? It will be just ignored or Solr will try to reduce
  shards...?
 
  Another question...in SolrCloud, how do I restart all the cloud at once?
 Is
  it possible?
 
  Best,
  Flavio



AW: Avoid Solr Pivot Faceting Out of Memory / Shorter result for pivot faceting requests with facet.pivot.ngroup=true and facet.pivot.showLastList=false

2013-07-19 Thread Sandro Zbinden
Dear Members.

Do you guys think I am better off in the solr developer group with this 
question. 

To summarize I would like to add a facet.pivot.ngroup =true param for show the 
count of the facet list
Further on I would like to avoid an out of memory exceptions in reducing the 
result of a facet.pivot query.

Best Regards 

Sandro Zbinden


-Ursprüngliche Nachricht-
Von: Sandro Zbinden [mailto:zbin...@imagic.ch] 
Gesendet: Mittwoch, 17. Juli 2013 13:45
An: solr-user@lucene.apache.org
Betreff: Avoid Solr Pivot Faceting Out of Memory / Shorter result for pivot 
faceting requests with facet.pivot.ngroup=true and 
facet.pivot.showLastList=false

Dear Usergroup


I am getting an out of memory exception in the following scenario.
I have 4 sql tables: patient, visit, study and image that will be denormalized 
for the solr index The solr index looks like the following



|p_id |p_lastname|v_id  |v_name  |...

| 1  | Miller| 10 | Study 1   |...
| 2  | Miller| 11 | Study 2   |...
| 2  | Miller| 12 | Study 3   |...  -- Duplication because 
of denormalization
| 3  | Smith| 13 | Study 4  |...
--

Now I am executing a facet query

q=*:*facet=true facet.pivot=p_lastname,p_id facet.limit=-1

And I get the following result

lst
str name=fieldp_lastname/str
str name=valueMiller/str
int name=count3/int
arr name=pivot
  lst
   str name=fieldp_id/str
   int name=value1/int
   int name=count1/int
  /lst
  lst
   str name=fieldp_id/str
   int name=value2/int
   int name=count2/int
  /lst
/arr
/lst
lst
str name=fieldp_lastname/str
str name=valueSmith/str
int name=count1/int
arr name=pivot
   str name=fieldp_id/str
   int name=value3/int
   int name=count1/int
  /lst
/arr
/lst


The goal is to show our clients a list of the group value and in parentheses 
how many patients the group contains.
 - Miller (2)
- Smith (1)

This is why we need to use the facet.pivot method with facet.limit-1. It is as 
far as I know the only way to get a grouping for 2 criterias.
And we need the pivot list to count how many patients are in a group.


Currently this works good on smaller indexes but if we have arround 1'000'000 
patients and we execute a query like the one above we run in an out of memory.
I figured out that the problem is not the calculation of the pivot but is the 
presentation of the result.
Because we load all fields (we can not us facet.offset because we need to order 
the results ascending and descending) the result can get really big.

To avoid this overload I created a change in the solr-core 
PivotFacetHandler.java class.
In the method doPivots i added the following code

   NamedListInteger nl = this.getTermCounts(subField);
   pivot.add( ngroups, nl.size());

This will give me the group size of the list.
Then I removed the recursion call pivot.add( pivot, doPivots( nl, subField, 
nextField, fnames, subset) ); Like this my result looks like the following

lst
str name=fieldp_lastname/str
str name=valueMiller/str
int name=count3/int
int name=ngroup2/int
/lst
lst
str name=fieldp_lastname/str
str name=valueSmith/str
int name=count1/int
int name=ngroup1/int
/lst


My questions is now if there is already something planned like 
facet.pivot.ngroup=true and facet.pivot.showLastList=false to improve the 
performance of pivot faceting.

Is there a chance we could get this into the solr code. I think it's a really 
small change of the code but could improve the product enormous.

Best Regards

Sandro Zbinden



Re: Help !

2013-07-19 Thread Gora Mohanty
On 19 July 2013 10:39,  narasimh...@ingvysyabank.com wrote:
 HI,

 Need help on configuring SOLR search in Alfresco.

Please do not ask questions that are so overly broad
that they are impossible to respond to. Firstly, do your
basic homework: Alfresco is now integrated with Solr.
Secondly, your question is more pertinent to an
Alfresco list. You might want to take a look at how
best to use mailing lists:
http://wiki.apache.org/solr/UsingMailingLists

Regards,
Gora


RE: Indexing into SolrCloud

2013-07-19 Thread Beale, Jim (US-KOP)
Hi Erick!

Thanks for the reply.  When I call server.add() it is just to add a single 
document.

But, still, I think you might be correct about the size of the ultimate 
request.  I decided to grab the bull by the horns by instantiating my own 
HttpClient and, in so doing, my first run changed the following parameters,

SOLR_HTTP_THREAD_COUNT=4
SOLR_MAX_BUFFERED_DOCS=1
SOLR_MAX_CONNECTIONS=256
SOLR_MAX_CONNECTIONS_PER_HOST=128
SOLR_CONNECTION_TIMEOUT=0
SOLR_SO_TIMEOUT=0

I doubled the number of emptying threads, reduced the size of the request 
buffer 5x, increased the connection limits and set the timeouts to infinite.  
(I'm not actually sure what the defaults for the timeouts were since I didn't 
see them in the Solr code and didn't track it down.)

Anyway, the good news is that this combination of parameters worked.  The bad 
news is that I don't know whether it was resolved by changing one or more of 
the parameters.

But, regardless, I think the whole experiment verifies your thinking that the 
request was too big!

Thanks again!! :)


Jim Beale
Lead Developer
hibu.com
2201 Renaissance Boulevard, King of Prussia, PA, 19406
Office: 610-879-3864
Mobile: 610-220-3067




-Original Message-
From: Erick Erickson [mailto:erickerick...@gmail.com]
Sent: Friday, July 19, 2013 8:08 AM
To: solr-user@lucene.apache.org
Subject: Re: Indexing into SolrCloud

Usually EOF errors indicate that the packet you're sending are too big.

Wait, though. 50K is not buffered docs, I think it's buffered _requests_.
So you're creating a queue that's ginormous and asking 2 threads to empty it.

But that's not really the issue I suspect. How many documents are you adding
at a time when you call server.add? I.e. are you using sever.add(doc) or
server.add(doclist)? If the latter and you're adding a bunch of docs, try
lowering that number. If you're sending one doc at a time I'm on the
wrong track.

Best
Erick

On Thu, Jul 18, 2013 at 2:51 PM, Beale, Jim (US-KOP) jim.be...@hibu.com wrote:
 Hey folks,

 I've been migrating an application which indexes about 15M documents from 
 straight-up Lucene into SolrCloud.  We've set up 5 Solr instances with a 3 
 zookeeper ensemble using HAProxy for load balancing. The documents are 
 processed on a quad core machine with 6 threads and indexed into SolrCloud 
 through HAProxy using ConcurrentUpdateSolrServer in order to batch the 
 updates.  The indexing box is heavily-loaded during indexing but I don't 
 think it is so bad that it would cause issues.

 I'm using Solr 4.3.1 on client and server side, zookeeper 3.4.5 and HAProxy 
 1.4.22.

 I've been accepting the default HttpClient with 50K buffered docs and 2 
 threads, i.e.,

 int solrMaxBufferedDocs = 5;
 int solrThreadCount = 2;
 solrServer = new ConcurrentUpdateSolrServer(solrHttpIPAddress, 
 solrMaxBufferedDocs, solrThreadCount);

 autoCommit is configured in the solrconfig as follows:

  autoCommit
maxTime60/maxTime
maxDocs50/maxDocs
openSearcherfalse/openSearcher
  /autoCommit

 I'm getting the following errors on the client and server sides respectively:

 Client side:

 2013-07-16 19:02:47,002 [concurrentUpdateScheduler-1-thread-4] INFO  
 SystemDefaultHttpClient - I/O exception (java.net.SocketException) caught 
 when processing request: Software caused connection abort: socket write error
 2013-07-16 19:02:47,002 [concurrentUpdateScheduler-1-thread-4] INFO  
 SystemDefaultHttpClient - Retrying request
 2013-07-16 19:02:47,002 [concurrentUpdateScheduler-1-thread-5] INFO  
 SystemDefaultHttpClient - I/O exception (java.net.SocketException) caught 
 when processing request: Software caused connection abort: socket write error
 2013-07-16 19:02:47,002 [concurrentUpdateScheduler-1-thread-5] INFO  
 SystemDefaultHttpClient - Retrying request

 Server side:

 7988753 [qtp1956653918-23] ERROR org.apache.solr.core.SolrCore  â 
 java.lang.RuntimeException: [was class org.eclipse.jetty.io.EofException] 
 early EOF
 at 
 com.ctc.wstx.util.ExceptionUtil.throwRuntimeException(ExceptionUtil.java:18)
 at 
 com.ctc.wstx.sr.StreamScanner.throwLazyError(StreamScanner.java:731)
 at 
 com.ctc.wstx.sr.BasicStreamReader.safeFinishToken(BasicStreamReader.java:3657)
 at 
 com.ctc.wstx.sr.BasicStreamReader.getText(BasicStreamReader.java:809)
 at 
 org.apache.solr.handler.loader.XMLLoader.readDoc(XMLLoader.java:393)

 When I disabled autoCommit on the server side, I didn't see any errors there 
 but I still get the issue client-side after about 2 million documents - which 
 is about 45 minutes.

 Has anyone seen this issue before?  I couldn't find anything useful on the 
 usual places.

 I suppose I could setup wireshark to see what is happening but I'm hoping 
 that someone has a better suggestion.

 Thanks in advance for any help!


 Best regards,
 Jim Beale

 hibu.com
 2201 Renaissance Boulevard, King of Prussia, PA, 19406
 Office: 610-879-3864
 Mobile: 

Date for 4.4 solr release

2013-07-19 Thread Jabouille Jean Charles

Hi,

we are currently using solr 4.2.1. There are a lot of fix in the 4.4
that we need. Can we have an approximative date of the first stable
release of solr 4.4 please ?

Regards,

jean charles

Kelkoo SAS
Société par Actions Simplifiée
Au capital de € 4.168.964,30
Siège social : 8, rue du Sentier 75002 Paris
425 093 069 RCS Paris

Ce message et les pièces jointes sont confidentiels et établis à l'attention 
exclusive de leurs destinataires. Si vous n'êtes pas le destinataire de ce 
message, merci de le détruire et d'en avertir l'expéditeur.


Re: dataimporter, custom fields and parsing error

2013-07-19 Thread Alexandre Rafalovitch
Dumb question: they are in your schema? Spelled right, in the right
section, using types also defined? Can you populate them by hand with a CSV
file and post.jar?

Regards,
   Alex.

Personal website: http://www.outerthoughts.com/
LinkedIn: http://www.linkedin.com/in/alexandrerafalovitch
- Time is the quality of nature that keeps events from happening all at
once. Lately, it doesn't seem to be working.  (Anonymous  - via GTD book)


On Fri, Jul 19, 2013 at 12:09 PM, Andreas Owen a...@conx.ch wrote:

 i'm using solr 4.3 which i just downloaded today and am using only jars
 that came with it. i have enabled the dataimporter and it runs without
 error. but the field path (included in schema.xml) and text (file
 content) aren't indexed. what am i doing wrong?

 solr-path: C:\ColdFusion10\cfusion\jetty-new
 collection-path: C:\ColdFusion10\cfusion\jetty-new\solr\collection1
 pdf-doc-path: C:\web\development\tkb\internet\public


 data-config.xml:

 dataConfig
 dataSource type=BinFileDataSource name=data/
 dataSource type=BinURLDataSource name=dataUrl/
 dataSource type=URLDataSource baseUrl=
 http://127.0.0.1/tkb/internet/; name=main/
 document
 entity name=rec processor=XPathEntityProcessor
 url=docImportUrl.xml forEach=/albums/album dataSource=main !--

 transformer=script:GenerateId--
 field column=title xpath=//title /
 field column=id xpath=//file /
 field column=path xpath=//path /
 field column=Author xpath=//author /

 !-- field
 column=tstamp2013-07-05T14:59:46.889Z/field --

 entity name=tika processor=TikaEntityProcessor
 url=../../../../../web/development/tkb/internet/public/${rec.path}/${
 rec.id}

 dataSource=data 
 field column=text /

 /entity
 /entity
 /document
 /dataConfig


 docImportUrl.xml:

 ?xml version=1.0 encoding=utf-8?
 albums
 album
 authorPeter Z./author
 titleBeratungsseminar kundenbrief/title
 descriptionwie kommuniziert man/description

 file0226520141_e-banking_Checkliste_CLX.Sentinel.pdf/file
 pathdownload/online/path
 /album
 album
 authorMarcel X./author
 titlekuchen backen/title
 descriptiontorten, kuchen, geb‰ck .../description
 fileKundenbrief.pdf/file
 pathdownload/online/path
 /album
 /albums


Collapsing similar queries

2013-07-19 Thread Otis Gospodnetic
Hi,

Are there any known good tools or approaches to collapsing queries.
For example, imagine 4 original queries:
* big house
* big houses
* the big house
* bigger house

...and all 4 being reduced/collapsed to just big house.

What might be some good approached for doing this?
1) stem them all and collapse if the are identical
2) compute levenstein distance and collapse if they are close enough

Maybe also remove stop words from them first? (not so good for queries
consisting of all or lots of stop words, like to be or not to be)

Any better approaches?

Thanks,
Otis
--
Solr  ElasticSearch Support -- http://sematext.com/
Performance Monitoring -- http://sematext.com/spm


Re: Date for 4.4 solr release

2013-07-19 Thread Gmail
Hahahaha ... Good 1

On 20/07/2013, at 1:43 AM, Jack Krupansky j...@basetechnology.com wrote:

 real_soon:[NOW+3DAYS TO NOW+10DAYS]
 
 -- Jack Krupansky
 
 -Original Message- From: Jabouille Jean Charles
 Sent: Friday, July 19, 2013 11:10 AM
 To: solr-user@lucene.apache.org
 Subject: Date for 4.4 solr release
 
 Hi,
 
 we are currently using solr 4.2.1. There are a lot of fix in the 4.4
 that we need. Can we have an approximative date of the first stable
 release of solr 4.4 please ?
 
 Regards,
 
 jean charles
 
 Kelkoo SAS
 Société par Actions Simplifiée
 Au capital de € 4.168.964,30
 Siège social : 8, rue du Sentier 75002 Paris
 425 093 069 RCS Paris
 
 Ce message et les pièces jointes sont confidentiels et établis à l'attention 
 exclusive de leurs destinataires. Si vous n'êtes pas le destinataire de ce 
 message, merci de le détruire et d'en avertir l'expéditeur. 


Re: Date for 4.4 solr release

2013-07-19 Thread Jack Krupansky

real_soon:[NOW+3DAYS TO NOW+10DAYS]

-- Jack Krupansky

-Original Message- 
From: Jabouille Jean Charles

Sent: Friday, July 19, 2013 11:10 AM
To: solr-user@lucene.apache.org
Subject: Date for 4.4 solr release

Hi,

we are currently using solr 4.2.1. There are a lot of fix in the 4.4
that we need. Can we have an approximative date of the first stable
release of solr 4.4 please ?

Regards,

jean charles

Kelkoo SAS
Société par Actions Simplifiée
Au capital de € 4.168.964,30
Siège social : 8, rue du Sentier 75002 Paris
425 093 069 RCS Paris

Ce message et les pièces jointes sont confidentiels et établis à l'attention 
exclusive de leurs destinataires. Si vous n'êtes pas le destinataire de ce 
message, merci de le détruire et d'en avertir l'expéditeur. 



Indexing CSV files in a Folder

2013-07-19 Thread Rajesh Jain
Hi

I have flume dumping CSV files in  folders and I would like Solr to build a
index using these CSV files.


What should I  do?

Thanks,
Rajesh


Re: Indexing CSV files in a Folder

2013-07-19 Thread Jack Krupansky

Read:
http://wiki.apache.org/solr/UpdateCSV

-- Jack Krupansky

-Original Message- 
From: Rajesh Jain 
Sent: Friday, July 19, 2013 1:55 PM 
To: solr-user@lucene.apache.org 
Subject: Indexing CSV files in a Folder 


Hi

I have flume dumping CSV files in  folders and I would like Solr to build a
index using these CSV files.


What should I  do?

Thanks,
Rajesh


Re: Custom RequestHandlerBase XML Response Issue

2013-07-19 Thread Chris Hostetter

: So as you mentioned in your last mail, how can I prepare a combined
: response for this xml doc and even if I do I don't think it would work
: because the same I am doing in the RequstHandler.

Part of the disconnect you seem to be having with the advice others have 
been giving you is that Solr does a very good job of abstracting away the 
*data* being returned to users from the *format* of that data, similar to 
an MVC setup (Ryan McKinley once told me he used Solr as his MVC Framework 
for all sorts of applications even if they didn't use the underlying 
index).

RequestHandlers are responsible for processing hte logic of a request (the 
Controller) and creating/manipulating the SolrQueryresponse (Model) which 
is then formated and written back to clients using a ResponseWriter (View) 
... clients can request to use a completley arbitrary ResponseWriter 
depending on what format they want to get data in, independent of the 
RequestHandler they use to generate the data.

In your case, the data that your custom RequestHandler wants to return is 
itself an XML structure -- but that doens't mean the existing solr XML 
ResponseWriter is prepared otwrite it out to you as is -- the XML 
ResponseWriter is designed to serialized structures of the supported data 
types in a specific solr XML format -- just as the JSON ResponseWriter is 
designed to serialize structures of the supported data types in a specific
solr json format, etc...

you could serialize your XML DOM as a string, and ask the response writer 
to handle that -- but it's probably not going to be what you want, because 
he response writer itself is going to take your arbitrary string data 
(that just so happens to be XML) and wrap it in it's own markup (XML, 
JSON, etc...)

In gneral, i agree with the questions/comments made by several other 
people...

1) what *exactly* is your ultimate goal (XY Problem?)
2) why are you doing this XML combining logic in solr, and not in your own 
application

But if you insist on the approach you are taking, you may find that the 
RawResponseWriter is useful to you -- it is an extremely specialized  
ResponseWriter for the purposes of use in the solr Admin request handlers 
and for remotely streaming files in DIH, but it may also work for your 
purposes.


-Hoss


Re: Collapsing similar queries

2013-07-19 Thread Jack Krupansky
For starters, I think you need to elaborate your criteria for queries that 
can be collapsed. You can say they're similar, but then that begs the 
questions of: 1) How to measure similarity, and 2) What threshold level of 
similarity to use for ok to collapse.


Two measures of similarity to consider:

1. How many top results do they have in common?
2. How many top terms and phrases from their top results do they have in 
common.


Maybe, ultimately, some arbitrary heuristic is good enough, say using 
editing distance for the raw query text. Or some adjusted editing distance. 
Or editing distance of the top terms of the top documents. Or, simply ANY 
heuristic that simple seems to both discriminate on differences and combine 
on similarities.


Here's a test case: query set

1. Office
2. The Office
3. Official
4. Office release
5. Official release
6. Office DVD

There are three distinct groups there.

If you have a specific, narrow domain in mind, a thesaurus of concepts and 
synonyms for that domain would help you a lot.


-- Jack Krupansky
-Original Message- 
From: Otis Gospodnetic

Sent: Friday, July 19, 2013 12:33 PM
To: solr-user@lucene.apache.org
Subject: Collapsing similar queries

Hi,

Are there any known good tools or approaches to collapsing queries.
For example, imagine 4 original queries:
* big house
* big houses
* the big house
* bigger house

...and all 4 being reduced/collapsed to just big house.

What might be some good approached for doing this?
1) stem them all and collapse if the are identical
2) compute levenstein distance and collapse if they are close enough

Maybe also remove stop words from them first? (not so good for queries
consisting of all or lots of stop words, like to be or not to be)

Any better approaches?

Thanks,
Otis
--
Solr  ElasticSearch Support -- http://sematext.com/
Performance Monitoring -- http://sematext.com/spm 



Re: Indexing CSV files in a Folder

2013-07-19 Thread Rajesh Jain
Thanks Jack,

I am looking at more real time, I am streaming CSV file in a folder using
Flume and I would like Solr to build the index automatically, rather that
posting using curl.

I think there is some discussion about the MorphlineSolrSink on Flume site,
but the documentation is very little.

I can think out writing the curl as a periodic job, but ... the file name
might change every time.

Thanks,
Rajesh



On Fri, Jul 19, 2013 at 2:18 PM, Jack Krupansky j...@basetechnology.comwrote:

 Read:
 http://wiki.apache.org/solr/**UpdateCSVhttp://wiki.apache.org/solr/UpdateCSV

 -- Jack Krupansky

 -Original Message- From: Rajesh Jain Sent: Friday, July 19, 2013
 1:55 PM To: solr-user@lucene.apache.org Subject: Indexing CSV files in a
 Folder
 Hi

 I have flume dumping CSV files in  folders and I would like Solr to build a
 index using these CSV files.


 What should I  do?

 Thanks,
 Rajesh



Re: custom field type plugin

2013-07-19 Thread Chris Hostetter

: a chromosome (or gene, or other object types). All that really boils 
: down to is being able to give a number, e.g. 10234, and return documents 
: that have regions containing the number. So you'd have a document with a 
: list like [1:16090,400:8000,40123:43564], and it should come 

You should take a look at some of the build in features using the spatial 
types...

http://wiki.apache.org/solr/SpatialForTimeDurations

I believe David also covered this usecase in his talk in san diego...

http://www.lucenerevolution.org/2013/Lucene-Solr4-Spatial-Deep-Dive

: But I get this error about it not being able to find the 
AbstractSubTypeFieldType class.
: Here is the first bit of the trace:
...
: Any hints as to what I did wrong? I can provide source code, or a fuller 
stack trace, config settings, etc.
: 
: Also, I did try to unpack the solr.war, stick my jar in WEB-INF/lib, 
: then repack. However, when I did that, I get a NoClassDefFoundError for 
: my plugin itself.

a fuller stack trace might help -- but the key question is what order did 
you try these two approaches in? and what exactly did you fieldType 
declaration look like?

my guess is that you tried repacking the war first, and maybe your 
exploded war classpath is still polluted with your old jar from when you 
repacked it and now you have multiple copies in the plugin classloaders 
classpath.  (the initial NoClassDefFoundError could have been from a 
mistake in your fieldType/ declaration)

try starting competley clean, using the stock war and sample configs and 
make sure you get no errors.  then try declaring your custom fieldType, 
using hte fully qualified classname w/o even telling solr about your jar, 
and ensure that you get a NoClassDefFoundError for your custom class -- if 
you get an error about AbstractSubTypeFieldType again then you still have 
a copy of your custom class somwhere in the classpath.  *THEN* try adding 
a lib/ directive to load your jar to load it.

if that still doesn't work provide us with the details of your servlet 
container, solr version, the full stack trace, the details of how you are 
configuring your fieldType/, how you declared the lib/ what your 
filesystem looks like for your solrhome, war, etc...




-Hoss


AUTO: Siobhan Roche is out of the office (returning 22/07/2013)

2013-07-19 Thread Siobhan Roche

I am out of the office until 22/07/2013.

I will respond to your query on my return,
Thanks
Siobhan


Note: This is an automated response to your message  custom field type
plugin sent on 19/07/2013 13:06:27.

This is the only notification you will receive while this person is away.



Request to be added to the ContributorsGroup

2013-07-19 Thread ricky gill
Hello,

 

Would someone please be kind enough and add me to the ContributorsGroup?
My Wiki Username is: RickyGill

 

Thanks again.

 

Regards

 

Ricky Gill | Managing Director | Jobuzu.co.uk
Mob: 07455071710 (Any Time) | Tel: 0845 805 2162 (11:00am - 5:30pm)
Skype: JobuzuLTD | Email:  mailto:ricky.g...@jobuzu.co.uk
ricky.g...@jobuzu.co.uk
Web:  http://jobuzu.co.uk/ http://jobuzu.co.uk

 http://jobuzu.co.uk/ 

We are a NO-SPAM company and respect your privacy if you would like not to
receive further emails from us please reply back with the following subject:
Remove Me

  _  

Jobuzu Ltd or any of its subsidiary companies may not be held responsible
for the content of this email as it may reflect the personal view of the
sender and not that of the company. Should you receive this email in error,
please notify the sender immediately and do not disclose copy or distribute
it. While Jobuzu Ltd runs anti-virus software on all servers and all
workstations, it cannot be held responsible for any infected files that you
may receive Jobuzu Ltd advises all recipients to virus scan any files.

 



dataimporter, custom fields and parsing error

2013-07-19 Thread Andreas Owen
i'm using solr 4.3 which i just downloaded today and am using only jars that 
came with it. i have enabled the dataimporter and it runs without error. but 
the field path (included in schema.xml) and text (file content) aren't 
indexed. what am i doing wrong?

solr-path: C:\ColdFusion10\cfusion\jetty-new
collection-path: C:\ColdFusion10\cfusion\jetty-new\solr\collection1
pdf-doc-path: C:\web\development\tkb\internet\public


data-config.xml:

dataConfig
dataSource type=BinFileDataSource name=data/
dataSource type=BinURLDataSource name=dataUrl/
dataSource type=URLDataSource 
baseUrl=http://127.0.0.1/tkb/internet/; name=main/
document
entity name=rec processor=XPathEntityProcessor 
url=docImportUrl.xml forEach=/albums/album dataSource=main !--

transformer=script:GenerateId--
field column=title xpath=//title /
field column=id xpath=//file /
field column=path xpath=//path /
field column=Author xpath=//author /

!-- field column=tstamp2013-07-05T14:59:46.889Z/field --

entity name=tika processor=TikaEntityProcessor 
url=../../../../../web/development/tkb/internet/public/${rec.path}/${rec.id} 

dataSource=data 
field column=text /

/entity
/entity
/document
/dataConfig


docImportUrl.xml:

?xml version=1.0 encoding=utf-8?
albums
album
authorPeter Z./author
titleBeratungsseminar kundenbrief/title
descriptionwie kommuniziert man/description
file0226520141_e-banking_Checkliste_CLX.Sentinel.pdf/file
pathdownload/online/path
/album
album
authorMarcel X./author
titlekuchen backen/title
descriptiontorten, kuchen, geb‰ck .../description
fileKundenbrief.pdf/file
pathdownload/online/path
/album
/albums

Re: Solr 4.3 open a lot more files than solr 3.6

2013-07-19 Thread SolrLover

Did you try setting useCompoundFile to true in solrconfig.xml?

Also, try using a lower mergeFactor which will result in fewer segments and
hence fewer open files.

Also, I assume you can set the limit using a ulimit command..

ex: 
ulimit -n20



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Solr-4-3-open-a-lot-more-files-than-solr-3-6-tp4079013p4079221.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: The way edismax parses colon seems weird

2013-07-19 Thread Jack Krupansky
What field type analyzer and tokenizer are you using, and what does a sample 
of the input data look like?


Generally, a single backslash I all that is needed for escaping.

And, escaping is not needed within a quoted phrase, except for quotes and 
literal backslashes.


-- Jack Krupansky

-Original Message- 
From: jefferyyuan

Sent: Friday, July 19, 2013 6:01 PM
To: solr-user@lucene.apache.org
Subject: The way edismax parses colon seems weird

In our application, user may search error code like 12:34.

We define default search field, like: str name=qftitle^10 body_stored^8
content^5/str
So when user search: 12:34, we want to search the error code in the
specified fields.

In the code, if we search q=12:34 directly, this can't find anything. It's
expected as it'ss to search 34 on 12 field.

Then we try to escape the colon, search: 12\:34, the parsedquery would be
+12\:34, still can't find the expected page.
str name=parsedquery(+12\:34)/no_coord/str
str name=parsedquery_toString+12\:34/str
str name=QParserExtendedDismaxQParser/str

If I type 2 \\, seems it can find the error page:
q=12\\:34
str name=parsedquery
(+DisjunctionMaxQuery((content:12 34^0.5 | body_stored:(12\:34 12)
34^0.8 | title:12 34^1.1)))/no_coord
/str
str name=parsedquery_toString
+(content:12 34^0.5 | body_stored:(12\:34 12) 34^0.8 | title:12
34^1.1)
/str
str name=QParserExtendedDismaxQParser/str

Is this a bug in Solr edismax or not?



--
View this message in context: 
http://lucene.472066.n3.nabble.com/The-way-edismax-parses-colon-seems-weird-tp4079226.html
Sent from the Solr - User mailing list archive at Nabble.com. 



Re: Request to be added to the ContributorsGroup

2013-07-19 Thread Stefan Matheis
Sure :) Done!

- Stefan  


On Friday, July 19, 2013 at 9:28 PM, ricky gill wrote:

  
 Hello,
  
  
   
  
  
 Would someone please be kind enough and add me to the “ContributorsGroup”? My 
 Wiki Username is: RickyGill
  
  
   
  
  
 Thanks again.
  
  
   
  
  
 Regards
  
  
   
  
  
 Ricky Gill | Managing Director | Jobuzu.co.uk (http://Jobuzu.co.uk)
 Mob: 07455071710 (Any Time) | Tel: 0845 805 2162 (11:00am - 5:30pm)
 Skype: JobuzuLTD | Email: ricky.g...@jobuzu.co.uk 
 (mailto:ricky.g...@jobuzu.co.uk)
 Web: http://jobuzu.co.uk (http://jobuzu.co.uk/)
  
  
  
  
  
 We are a NO-SPAM company and respect your privacy if you would like not to 
 receive further emails from us please reply back with the following subject: 
 Remove Me
  
  
  
 Jobuzu Ltd or any of its subsidiary companies may not be held responsible for 
 the content of this email as it may reflect the personal view of the sender 
 and not that of the company. Should you receive this email in error, please 
 notify the sender immediately and do not disclose copy or distribute it. 
 While Jobuzu Ltd runs anti-virus software on all servers and all 
 workstations, it cannot be held responsible for any infected files that you 
 may receive Jobuzu Ltd advises all recipients to virus scan any files.
  
  
   
  
  
  
  




The way edismax parses colon seems weird

2013-07-19 Thread jefferyyuan
In our application, user may search error code like 12:34.

We define default search field, like: str name=qftitle^10 body_stored^8
content^5/str
So when user search: 12:34, we want to search the error code in the
specified fields.

In the code, if we search q=12:34 directly, this can't find anything. It's
expected as it'ss to search 34 on 12 field.

Then we try to escape the colon, search: 12\:34, the parsedquery would be
+12\:34, still can't find the expected page.
str name=parsedquery(+12\:34)/no_coord/str
str name=parsedquery_toString+12\:34/str
str name=QParserExtendedDismaxQParser/str

If I type 2 \\, seems it can find the error page:
q=12\\:34
str name=parsedquery
(+DisjunctionMaxQuery((content:12 34^0.5 | body_stored:(12\:34 12)
34^0.8 | title:12 34^1.1)))/no_coord
/str
str name=parsedquery_toString
+(content:12 34^0.5 | body_stored:(12\:34 12) 34^0.8 | title:12
34^1.1)
/str
str name=QParserExtendedDismaxQParser/str

Is this a bug in Solr edismax or not?



--
View this message in context: 
http://lucene.472066.n3.nabble.com/The-way-edismax-parses-colon-seems-weird-tp4079226.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: The way edismax parses colon seems weird

2013-07-19 Thread Shawn Heisey

On 7/19/2013 4:01 PM, jefferyyuan wrote:

If I type 2 \\, seems it can find the error page:
q=12\\:34
str name=parsedquery
(+DisjunctionMaxQuery((content:12 34^0.5 | body_stored:(12\:34 12)
34^0.8 | title:12 34^1.1)))/no_coord
/str
str name=parsedquery_toString
+(content:12 34^0.5 | body_stored:(12\:34 12) 34^0.8 | title:12
34^1.1)
/str
str name=QParserExtendedDismaxQParser/str

Is this a bug in Solr edismax or not?


It sounds like it's a requirement for whatever you are using to 
construct your queries.  When building Strings in Java, for instance, a 
double backslash is required for a literal backslash, because a single 
backslash is used for special characters, like \n for newline.  It's 
similar for Perl, and probably PHP as well as other programming languages.


Thanks,
Shawn



Re: The way edismax parses colon seems weird

2013-07-19 Thread Alexandre Rafalovitch
Could this be related:
https://issues.apache.org/jira/browse/SOLR-4333(Fixed in 4.4, so you
could even run your test against RC1)

Regards,
   Alex.

Personal website: http://www.outerthoughts.com/
LinkedIn: http://www.linkedin.com/in/alexandrerafalovitch
- Time is the quality of nature that keeps events from happening all at
once. Lately, it doesn't seem to be working.  (Anonymous  - via GTD book)


On Fri, Jul 19, 2013 at 6:01 PM, jefferyyuan yuanyun...@gmail.com wrote:

 In our application, user may search error code like 12:34.

 We define default search field, like: str name=qftitle^10 body_stored^8
 content^5/str
 So when user search: 12:34, we want to search the error code in the
 specified fields.

 In the code, if we search q=12:34 directly, this can't find anything. It's
 expected as it'ss to search 34 on 12 field.

 Then we try to escape the colon, search: 12\:34, the parsedquery would be
 +12\:34, still can't find the expected page.
 str name=parsedquery(+12\:34)/no_coord/str
 str name=parsedquery_toString+12\:34/str
 str name=QParserExtendedDismaxQParser/str

 If I type 2 \\, seems it can find the error page:
 q=12\\:34
 str name=parsedquery
 (+DisjunctionMaxQuery((content:12 34^0.5 | body_stored:(12\:34 12)
 34^0.8 | title:12 34^1.1)))/no_coord
 /str
 str name=parsedquery_toString
 +(content:12 34^0.5 | body_stored:(12\:34 12) 34^0.8 | title:12
 34^1.1)
 /str
 str name=QParserExtendedDismaxQParser/str

 Is this a bug in Solr edismax or not?



 --
 View this message in context:
 http://lucene.472066.n3.nabble.com/The-way-edismax-parses-colon-seems-weird-tp4079226.html
 Sent from the Solr - User mailing list archive at Nabble.com.



Re: The way edismax parses colon seems weird

2013-07-19 Thread Chris Hostetter

You havne't told us anything about how you have the anslysis configured 
for the fileds you are using -- and those details probably contain the 
specifics of your problem.

once you've escaped hte colon so that eismax no longer recognizes it as 
search this specific user field syntax, any other questions about what 
the final query for that clauses windsup being, and what it does or 
doesn't match are entirely depending on your analyis.

When I use the Solr 4.3.1 example configs, and index a document like 
this...

$ java -Ddata=args -jar post.jar 'adddocfield name=idHOSS/fieldfield 
name=title12:34/fieldfield name=cat12:34/field/doc/add'

Then the following query will find it... 
http://localhost:8983/solr/select?debugQuery=truedefType=edismaxqf=titleq=12\:34
 
...and the parsed query, because of the fieldType  analyzier for the 
title field, looks like...

(+DisjunctionMaxQuery(((title:12 title:34/no_coord

This query will also find it...
http://localhost:8983/solr/select?debugQuery=truedefType=edismaxqf=catq=56\:78
(+DisjunctionMaxQuery((cat:56:78)))/no_coord

As will this one...
http://localhost:8983/solr/select?debugQuery=truedefType=edismaxqf=skuq=99\:00
(+DisjunctionMaxQuery((sku:9900)))/no_coord


As does combining them all together...
http://localhost:8983/solr/select?debugQuery=trueq.op=ANDdefType=edismaxqf=sku+title+catq=12\:34+56\:78+99\:00

Note that if you use to uf option to tighten down which field names 
edismax allows with the : syntax, you don't even have to escape it...

http://localhost:8983/solr/select?debugQuery=trueq.op=ANDdefType=edismaxuf=-*qf=sku+title+catq=12:34+56:78+99:00



-Hoss


Re: Indexing CSV files in a Folder

2013-07-19 Thread SolrLover
Did you look in to this link?

http://www.marshut.com/ruzyy/download-and-configure-morphlinesolrsink.html



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Indexing-CSV-files-in-a-Folder-tp4079192p4079222.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: The way edismax parses colon seems weird

2013-07-19 Thread jefferyyuan
Thanks very much for the reply. 
We are querying solr directly from browser:
http://localhost:8080/solr/select?q=12\:34defType=edismaxdebug=queryqf=content

str name=rawquerystring12\:34/str
str name=querystring12\:34/str
str name=parsedquery(+12\:34)/no_coord/str
str name=parsedquery_toString+12\:34/str
str name=QParserExtendedDismaxQParser/str

And seems this is not related with which (default) field I use to query.



--
View this message in context: 
http://lucene.472066.n3.nabble.com/The-way-edismax-parses-colon-seems-weird-tp4079226p4079234.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Date for 4.4 solr release

2013-07-19 Thread Alexandre Rafalovitch
Shouldn't that be
real_soon:[NOW/DAY+3DAYS TO NOW/DAY+10DAYS]

You know, just to avoid the performance problems of the people asking every
five minutes. :-)

Regards,
   Alex.
P.s. Or is this a premature optimization?

Personal website: http://www.outerthoughts.com/
LinkedIn: http://www.linkedin.com/in/alexandrerafalovitch
- Time is the quality of nature that keeps events from happening all at
once. Lately, it doesn't seem to be working.  (Anonymous  - via GTD book)


On Fri, Jul 19, 2013 at 11:43 AM, Jack Krupansky j...@basetechnology.comwrote:

 real_soon:[NOW+3DAYS TO NOW+10DAYS]

 -- Jack Krupansky

 -Original Message- From: Jabouille Jean Charles
 Sent: Friday, July 19, 2013 11:10 AM
 To: solr-user@lucene.apache.org
 Subject: Date for 4.4 solr release


 Hi,

 we are currently using solr 4.2.1. There are a lot of fix in the 4.4
 that we need. Can we have an approximative date of the first stable
 release of solr 4.4 please ?

 Regards,

 jean charles

 Kelkoo SAS
 Société par Actions Simplifiée
 Au capital de € 4.168.964,30
 Siège social : 8, rue du Sentier 75002 Paris
 425 093 069 RCS Paris

 Ce message et les pièces jointes sont confidentiels et établis à
 l'attention exclusive de leurs destinataires. Si vous n'êtes pas le
 destinataire de ce message, merci de le détruire et d'en avertir
 l'expéditeur.



Re: The way edismax parses colon seems weird

2013-07-19 Thread Jack Krupansky

Very good chance that is it.

-- Jack Krupansky

-Original Message- 
From: Alexandre Rafalovitch

Sent: Friday, July 19, 2013 7:16 PM
To: solr-user@lucene.apache.org
Subject: Re: The way edismax parses colon seems weird

Could this be related:
https://issues.apache.org/jira/browse/SOLR-4333(Fixed in 4.4, so you
could even run your test against RC1)

Regards,
  Alex.

Personal website: http://www.outerthoughts.com/
LinkedIn: http://www.linkedin.com/in/alexandrerafalovitch
- Time is the quality of nature that keeps events from happening all at
once. Lately, it doesn't seem to be working.  (Anonymous  - via GTD book)


On Fri, Jul 19, 2013 at 6:01 PM, jefferyyuan yuanyun...@gmail.com wrote:


In our application, user may search error code like 12:34.

We define default search field, like: str name=qftitle^10 
body_stored^8

content^5/str
So when user search: 12:34, we want to search the error code in the
specified fields.

In the code, if we search q=12:34 directly, this can't find anything. It's
expected as it'ss to search 34 on 12 field.

Then we try to escape the colon, search: 12\:34, the parsedquery would be
+12\:34, still can't find the expected page.
str name=parsedquery(+12\:34)/no_coord/str
str name=parsedquery_toString+12\:34/str
str name=QParserExtendedDismaxQParser/str

If I type 2 \\, seems it can find the error page:
q=12\\:34
str name=parsedquery
(+DisjunctionMaxQuery((content:12 34^0.5 | body_stored:(12\:34 12)
34^0.8 | title:12 34^1.1)))/no_coord
/str
str name=parsedquery_toString
+(content:12 34^0.5 | body_stored:(12\:34 12) 34^0.8 | title:12
34^1.1)
/str
str name=QParserExtendedDismaxQParser/str

Is this a bug in Solr edismax or not?



--
View this message in context:
http://lucene.472066.n3.nabble.com/The-way-edismax-parses-colon-seems-weird-tp4079226.html
Sent from the Solr - User mailing list archive at Nabble.com.





Re: The way edismax parses colon seems weird

2013-07-19 Thread Jack Krupansky
I noticed that a single backslash in the URL query got turned into a 
backslash in the parsed query, which implies that the backslash was escaped 
(improperly) by Solr:


http://localhost:8080/solr/select?q=12\:34defType=edismaxdebug=queryqf=content

str name=parsedquery_toString+12\:34/str

As a workaround, enclose the term in quotes, without the escaping:

http://localhost:8080/solr/select?q=12:34defType=edismaxdebug=queryqf=content

-- Jack Krupansky

-Original Message- 
From: jefferyyuan

Sent: Friday, July 19, 2013 7:09 PM
To: solr-user@lucene.apache.org
Subject: Re: The way edismax parses colon seems weird

Thanks very much for the reply.
We are querying solr directly from browser:
http://localhost:8080/solr/select?q=12\:34defType=edismaxdebug=queryqf=content

str name=rawquerystring12\:34/str
str name=querystring12\:34/str
str name=parsedquery(+12\:34)/no_coord/str
str name=parsedquery_toString+12\:34/str
str name=QParserExtendedDismaxQParser/str

And seems this is not related with which (default) field I use to query.



--
View this message in context: 
http://lucene.472066.n3.nabble.com/The-way-edismax-parses-colon-seems-weird-tp4079226p4079234.html
Sent from the Solr - User mailing list archive at Nabble.com. 



RE: custom field type plugin

2013-07-19 Thread Kevin Stone
I can try again this weekend to get a clean environment. However, the order I 
did things in was the reverse of what you suggest. I got the 
AbstractSubTypeFieldType error first. Then I removed my jar from the sharedLib 
folder, and tried the war repacking solution. That is when I got 
NoClassDefFoundError on my custom class.

The spatial feature looks intriguing, although I have no idea if it could fit 
my use case. It looks fairly complex a concept, but maybe it is all the 
different shapes and geometry that is confusing me. If I thought of my problem 
in terms of geometry, I would say a chromosome region is like a segment of a 
line. I would need to define multiple line segments and be able to query by a 
single point and only return documents that have a line segment that the single 
point falls on. Does that make sense? Is that at all doable with a spatial 
query?

-Kevin

From: Chris Hostetter [hossman_luc...@fucit.org]
Sent: Friday, July 19, 2013 3:15 PM
To: solr-user@lucene.apache.org
Subject: Re: custom field type plugin

: a chromosome (or gene, or other object types). All that really boils
: down to is being able to give a number, e.g. 10234, and return documents
: that have regions containing the number. So you'd have a document with a
: list like [1:16090,400:8000,40123:43564], and it should come

You should take a look at some of the build in features using the spatial
types...

http://wiki.apache.org/solr/SpatialForTimeDurations

I believe David also covered this usecase in his talk in san diego...

http://www.lucenerevolution.org/2013/Lucene-Solr4-Spatial-Deep-Dive

: But I get this error about it not being able to find the 
AbstractSubTypeFieldType class.
: Here is the first bit of the trace:
...
: Any hints as to what I did wrong? I can provide source code, or a fuller 
stack trace, config settings, etc.
:
: Also, I did try to unpack the solr.war, stick my jar in WEB-INF/lib,
: then repack. However, when I did that, I get a NoClassDefFoundError for
: my plugin itself.

a fuller stack trace might help -- but the key question is what order did
you try these two approaches in? and what exactly did you fieldType
declaration look like?

my guess is that you tried repacking the war first, and maybe your
exploded war classpath is still polluted with your old jar from when you
repacked it and now you have multiple copies in the plugin classloaders
classpath.  (the initial NoClassDefFoundError could have been from a
mistake in your fieldType/ declaration)

try starting competley clean, using the stock war and sample configs and
make sure you get no errors.  then try declaring your custom fieldType,
using hte fully qualified classname w/o even telling solr about your jar,
and ensure that you get a NoClassDefFoundError for your custom class -- if
you get an error about AbstractSubTypeFieldType again then you still have
a copy of your custom class somwhere in the classpath.  *THEN* try adding
a lib/ directive to load your jar to load it.

if that still doesn't work provide us with the details of your servlet
container, solr version, the full stack trace, the details of how you are
configuring your fieldType/, how you declared the lib/ what your
filesystem looks like for your solrhome, war, etc...




-Hoss

The information in this email, including attachments, may be confidential and 
is intended solely for the addressee(s). If you believe you received this email 
by mistake, please notify the sender by return email as soon as possible.


RE: custom field type plugin

2013-07-19 Thread Chris Hostetter

: I can try again this weekend to get a clean environment. However, the 
: order I did things in was the reverse of what you suggest. I got the 

Hmmm... then i'm kind of at a loss to explain what you're describing.  
need to see more details of the configs, dir structure, jar 
structure, etc...

: The spatial feature looks intriguing, although I have no idea if it 
: could fit my use case. It looks fairly complex a concept, but maybe it 
: is all the different shapes and geometry that is confusing me. If I 
: thought of my problem in terms of geometry, I would say a chromosome 
: region is like a segment of a line. I would need to define multiple line 
: segments and be able to query by a single point and only return 
: documents that have a line segment that the single point falls on. Does 
: that make sense? Is that at all doable with a spatial query?

The tricky thing about leveraging the spatial stuff for this type of 
problem is that it's frequently better to *not* let yourself think in 
terms of the a straightforward mapping between your problem space and 
geometry.

Instead of modeling your data as documents containing multiple line 
segments and trying to search for a document containing a line segment 
that contains your 1D point, imagine modeling your data as documents 
containing multiple 2D points, one point per range, where the X 
coordinate is the lower bound of your range, and the Y axis is the upper 
bound of the range...

https://people.apache.org/~hossman/spatial-for-non-spatial-meetup-20130117/#slide8

...and to find all documents containing a range that contains a 
specified input value V, you then query for all documents containing 
points inside of a specially crafted bounding box based on V...

https://people.apache.org/~hossman/spatial-for-non-spatial-meetup-20130117/#slide11

..the big caveat to this approach that i failed to mention before is that 
it presumes there is an absolute min/max definable for the overall range 
of values you are dealing with so that you can define the bounding boxes 
appropriates -- otherwise the geometery won't work.

In anycase .. it's an interesting idea i wanted to through out there for 
you to consider i case it worked for you before you jumped through a tone 
of hoops trying to get a new custom FieldType to work.


-Hoss


Re: Solr index lot of pdf, doc, txt

2013-07-19 Thread sodoo
I'm using Solr 4.2 but I don't understand well this post recursive way. 
Maybe I think write a bash script. But bash script is not good solution.

Another way  solution ?

Please advice me.



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Solr-index-lot-of-pdf-doc-txt-tp4078651p4079253.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Date for 4.4 solr release

2013-07-19 Thread adityab
+1
:D



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Date-for-4-4-solr-release-tp4079152p4079254.html
Sent from the Solr - User mailing list archive at Nabble.com.