Re: use of filter queries in Lucene/Solr Alpha40 and Beta4.0

2012-09-06 Thread guenter.hip...@unibas.ch
Hoss, I'm so happy you realized the problem because I was quite worried 
about it!!


Let me know if I can provide support with testing it.
The last two days I was busy with migrating a bunch of hosts which 
should -hopefully- be finished today.

Then I have again the infrastructure for running tests

Günter

On 09/05/2012 11:19 PM, Chris Hostetter wrote:

: Subject: Re: use of filter queries in Lucene/Solr Alpha40 and Beta4.0

Günter, This is definitely strange

The good news is, i can reproduce your problem.
The bad news is, i can reproduce your problem - and i have no idea what's
causing it.

I've opened SOLR-3793 to try to get to the bottom of this, and included
some basic steps to demonstrate the bug using the Solr 4.0-BETA example
data, but i'm really not sure what the problem might be...

https://issues.apache.org/jira/browse/SOLR-3793


-Hoss



--
Universität Basel
Universitätsbibliothek
Günter Hipler
Projekt SwissBib
Schoenbeinstrasse 18-20
4056 Basel, Schweiz
Tel.: + 41 (0)61 267 31 12 Fax: ++41 61 267 3103
e-mailguenter.hip...@unibas.ch
URL:www.swissbib.org   /http://www.ub.unibas.ch/



RE: Delete all documents in the index

2012-09-06 Thread Alexey Kozhemiakin
One more thanks for posting this! 
I struggled with the same issue yesterday and solved it with _version_ hint 
from mailing list .

Alex.

-Original Message-
From: Mark Mandel [mailto:mark.man...@gmail.com] 
Sent: Thursday, September 06, 2012 1:53 AM
To: solr-user@lucene.apache.org
Subject: Re: Delete all documents in the index

Thanks for posting this!

I ran into exactly this issue yesterday, and ended up felting the files to get 
around it.

Mark

Sent from my mobile doohickey.
On Sep 6, 2012 4:13 AM, Rohit Harchandani rhar...@gmail.com wrote:

 Thanks everyone. Adding the _version_ field in the schema worked.
 Deleting the data directory works for me, but was not sure why 
 deleting using curl was not working.

 On Wed, Sep 5, 2012 at 1:49 PM, Michael Della Bitta  
 michael.della.bi...@appinions.com wrote:

  Rohit:
 
  If it's easy, the easiest thing to do is to turn off your servlet 
  container, rm -r * inside of the data directory, and then restart 
  the container.
 
  Michael Della Bitta
 
  
  Appinions | 18 East 41st St., Suite 1806 | New York, NY 10017 
  www.appinions.com Where Influence Isn't a Game
 
 
  On Wed, Sep 5, 2012 at 12:56 PM, Jack Krupansky 
  j...@basetechnology.com
 
  wrote:
   Check to make sure that you are not stumbling into SOLR-3432:
  deleteByQuery
   silently ignored if updateLog is enabled, but {{_version_}} field 
   does
  not
   exist in schema.
  
   See:
   https://issues.apache.org/jira/browse/SOLR-3432
  
   This could happen if you kept the new 4.0 solrconfig.xml, but 
   copied in
  your
   pre-4.0 schema.xml.
  
   -- Jack Krupansky
  
   -Original Message- From: Rohit Harchandani
   Sent: Wednesday, September 05, 2012 12:48 PM
   To: solr-user@lucene.apache.org
   Subject: Delete all documents in the index
  
  
   Hi,
   I am having difficulty deleting documents from the index using curl.
 The
   urls i tried were:
   curl http://localhost:9020/solr/core1/update/?stream.body=
   deletequery*:*/query/deletecommit=true
   curl http://localhost:9020/solr/core1/update/?commit=true; -H
   Content-Type: text/xml --data-binary 'deletequeryid:[* TO 
   *]/query/delete'
   curl http://localhost:9020/solr/core1/update/?commit=true; -H
   Content-Type: text/xml --data-binary
  'deletequery*:*/query/delete'
   I also tried:
   curl 
  
 
 http://localhost:9020/solr/core1/update/?stream.body=%3Cdelete%3E%3Cqu
 ery%3E*:*%3C/query%3E%3C/delete%3Ecommit=true
   
   as suggested on some forums. I get a response with status=0 in all
 cases,
   but none of the above seem to work.
   When I run
   curl http://localhost:9020/solr/core1/select?q=*:*rows=0wt=xml;
   I still get a value for numFound.
  
   I am currently using solr 4.0 beta version.
  
   Thanks for your help in advance.
   Regards,
   Rohit
 



Re: AW: AW: auto completion search with solr using NGrams in SOLR

2012-09-06 Thread aniljayanti
Hi,

Thanks,

Iam getting the results with below url.

*suggest/?q=michael bdf=titledefType=lucenefl=title*

But, i want the results in spellcheck section.

i want to search with title or empname or both. 

Aniljayanti



--
View this message in context: 
http://lucene.472066.n3.nabble.com/auto-completion-search-with-solr-using-NGrams-in-SOLR-tp3998559p4005812.html
Sent from the Solr - User mailing list archive at Nabble.com.


terms component search

2012-09-06 Thread Peter Kirk
Hi

I am trying to implement some auto suggest functionality, and am currently 
looking at the terms component (Solr 3.6).

For example, I can form a query like this:

http://solrhost/solr/mycore/terms?terms.fl=title_sterms.sort=indexterms.limit=5terms.prefix=Hotel+C

which searches in the title_s field for strings starting Hotel C. Results 
could be
Hotel Chicago, 2
Hotel California, 8
Hotel Cool, 4

Is it possible to get more info in the results from this component - like 
return data from other fields?

For example, along with the results from the title_s field, the corresponding 
data from the telephone field.

Or, maybe I simply should execute a normal wildcard search.

Thanks,
Peter



Re: Document Processing

2012-09-06 Thread Tanguy Moal
If your interest is focusing on the real textual content of a web page, you
could try this : JReadability (https://github.com/ifesdjeen/jReadability ,
Apache 2.0 license), which wraps JSoup (as Lance suggested) and applies a
set of predefined rules to scrap crap (nav, headers, footers, ...) off of
the content.

If you'd rather have the possibility to map portions of a webpage to
dedicated solr fields, using JSoup on its own could be a win. Read this :
https://norrisshelton.wordpress.com/2011/01/27/jsoup-java-html-parser/

Hope this helps,

--
Tanguy

2012/9/6 Lance Norskog goks...@gmail.com

 There is another way to do this: crawl the mobile site!

 The Fennec browser from Mozilla talks Android. I often use it to get
 pagecrap off my screen.

 - Original Message -
 | From: Lance Norskog goks...@gmail.com
 | To: solr-user@lucene.apache.org
 | Sent: Wednesday, August 29, 2012 7:37:37 PM
 | Subject: Re: Document Processing
 |
 | I've seen the JSoup HTML parser library used for this. It worked
 | really well. The Boilerpipe library may be what you want. Its
 | schwerpunkt (*) is to separate boilerplate from wanted text in an
 | HTML
 | page. I don't know what fine-grained control it has.
 |
 | * raison d'être. There is no English word for this concept.
 |
 | On Tue, Dec 6, 2011 at 1:39 PM, Tommaso Teofili
 | tommaso.teof...@gmail.com wrote:
 |  Hello Michael,
 | 
 |  I can help you with using the UIMA UpdateRequestProcessor [1]; the
 |  current
 |  implementation uses in-memory execution of UIMA pipelines but since
 |  I was
 |  planning to add the support for higher scalability (with UIMA-AS
 |  [2]) that
 |  may help you as well.
 | 
 |  Tommaso
 | 
 |  [1] :
 | 
 http://svn.apache.org/repos/asf/lucene/dev/trunk/solr/contrib/uima/src/java/org/apache/solr/uima/processor/UIMAUpdateRequestProcessor.java
 |  [2] : http://uima.apache.org/doc-uimaas-what.html
 | 
 |  2011/12/5 Michael Kelleher mj.kelle...@gmail.com
 | 
 |  Hello Erik,
 | 
 |  I will take a look at both:
 | 
 |  org.apache.solr.update.**processor.**LangDetectLanguageIdentifierUp**
 |  dateProcessor
 | 
 |  and
 | 
 |  org.apache.solr.update.**processor.**TikaLanguageIdentifierUpdatePr**
 |  ocessor
 | 
 | 
 |  and figure out what I need to extend to handle processing in the
 |  way I am
 |  looking for.  I am assuming that component configuration is
 |  handled in a
 |  standard way such that I can configure my new UpdateProcessor in
 |  the same
 |  way I would configure any other UpdateProcessor component?
 | 
 |  Thanks for the suggestion.
 | 
 | 
 |  1 more question:  given that I am probably going to convert the
 |  HTML to
 |  XML so I can use XPath expressions to extract my content, do you
 |  think
 |  that this kind of processing will overload Solr?  This Solr
 |  instance will
 |  be used solely for indexing, and will only ever have a single
 |  ManifoldCF
 |  crawling job feeding it documents at one time.
 | 
 |  --mike
 | 
 |
 |
 |
 | --
 | Lance Norskog
 | goks...@gmail.com
 |



Re: terms component search

2012-09-06 Thread Tanguy Moal
Hi Peter,

Yes if you want to do complex things in suggest mode, you'd better rely on
the SearchComponent...

For example, this blog post is a good read
http://www.cominvent.com/2012/01/25/super-flexible-autocomplete-with-solr/ ,
if you have complex requirements on the searched fields.

(Although your requirements seem to be more related to the results
extraction than query building)

Kind regards,

--
Tanguy

2012/9/6 Peter Kirk p...@alpha-solutions.dk

 Hi

 I am trying to implement some auto suggest functionality, and am
 currently looking at the terms component (Solr 3.6).

 For example, I can form a query like this:


 http://solrhost/solr/mycore/terms?terms.fl=title_sterms.sort=indexterms.limit=5terms.prefix=Hotel+C

 which searches in the title_s field for strings starting Hotel C.
 Results could be
 Hotel Chicago, 2
 Hotel California, 8
 Hotel Cool, 4

 Is it possible to get more info in the results from this component - like
 return data from other fields?

 For example, along with the results from the title_s field, the
 corresponding data from the telephone field.

 Or, maybe I simply should execute a normal wildcard search.

 Thanks,
 Peter




Re: solr indexing slows down after few minutes

2012-09-06 Thread amit
Commit is not too often, it's a batch of 100 records, takes 40 to 60 secs
before another commit.
No I am not indexing with multi threads. It uses a single thread executor.

I have seen steady performance for now after increasing the merge factor
from 10 to 25.
Will have to wait and watch if that reduces the search speed, but so far so
good.

Thanks
Amit

On Thu, Aug 30, 2012 at 10:53 PM, pravesh [via Lucene] 
ml-node+s472066n4004421...@n3.nabble.com wrote:

 Did you checked wiki:
 http://wiki.apache.org/lucene-java/ImproveIndexingSpeed

 Do you commit often? Do you index with multiple threads? Also try
 experimenting with various available MergePolicies introduced from SOLR 3.4
 onwards

 Thanx
 Pravesh

 --
  If you reply to this email, your message will be added to the discussion
 below:

 http://lucene.472066.n3.nabble.com/solr-indexing-slows-down-after-few-minutes-tp4004337p4004421.html
  To unsubscribe from solr indexing slows down after few minutes, click
 herehttp://lucene.472066.n3.nabble.com/template/NamlServlet.jtp?macro=unsubscribe_by_codenode=4004337code=YW1pdC5tYWxsaWtAZ21haWwuY29tfDQwMDQzMzd8LTk5Njc5OTA3NA==
 .
 NAMLhttp://lucene.472066.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewerid=instant_html%21nabble%3Aemail.namlbase=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespacebreadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml





--
View this message in context: 
http://lucene.472066.n3.nabble.com/solr-indexing-slows-down-after-few-minutes-tp4004337p4005864.html
Sent from the Solr - User mailing list archive at Nabble.com.

solr 3.6.1 tomcat 7.0 missing core name in path

2012-09-06 Thread amit
Hi 
I have installed solr 3.6.1 on tomcat 7.0 following the steps here. 
http://ralf.schaeftlein.de/2012/02/10/installing-solr-3-5-under-tomcat-7/

The slor home page loads fine but the admin page
(http://localhost:8080/solr/admin/) throws error missing core name in path.
I am installing single core. This is the solr.xml 
solr persistent=false
  cores adminPath=/admin/cores defaultCoreName=collection1
core name=collection1 instanceDir=. /
  /cores
/solr

I have double checked lot of steps by searching on net, but no luck.
If anyone has faced this please suggest.  

Thanks
Amit



--
View this message in context: 
http://lucene.472066.n3.nabble.com/solr-3-6-1-tomcat-7-0-missing-core-name-in-path-tp4005868.html
Sent from the Solr - User mailing list archive at Nabble.com.


Facetting inside a custom component

2012-09-06 Thread Ralf Heyde
Hello,

i'm currently devoloping a custom component in Solr.
This component works fine. The problem I have is, I only have an access to the 
searcher which gives me the option to fire e.g. BooleanQueries. 

This searcher gives me a result, which I have to iterate to calculate 
informations which could be delivered by solr facets out of the box.

The problem i'm facing with is, that there is no option for facetting. There is 
an example on the Lucene here: 
http://lucene.apache.org/core/4_0_0-BETA/facet/org/apache/lucene/facet/doc-files/userguide.html

The problem is, that i have no taxonomyDir / i dont know, where to get it.
TaxonomyReader taxo = new DirectoryTaxonomyReader(taxoDir);

Does anybody have an idea, how to gather facet information?

Thanks in advance,

Ralf


Re: Facetting inside a custom component

2012-09-06 Thread Ralf Heyde
Hi,

just found a solution, but you have to know, what you want to count:

try {
 final SolrIndexSearcher s = rb.req.getSearcher();
 final SolrQueryParser qp = new SolrQueryParser(rb.req.getSchema(), null);
 final String queryString = entity_type:RELEASE;
 final Query q = qp.parse(queryString);
 final DocListAndSet results = s.getDocListAndSet(q, (ListQuery) null, (Sort) 
null, rb.req.getStart(), rb.req.getLimit());
 final NamedList counts = new NamedList();
 for (String fc : ImmutableSet.of(entity_type:RELEASE, 
entity_type:PRODUCT_NAME)) {
   counts.add(fc, s.numDocs(qp.parse(fc), results.docSet));
 }
 // in counts you have your facet-counts
} 
catch (ParseException pe) {
 pe.printStackTrace();
}


 Original-Nachricht 
 Datum: Thu, 06 Sep 2012 13:58:29 +0200
 Von: Ralf Heyde ralf.he...@gmx.de
 An: solr-user@lucene.apache.org
 Betreff: Facetting inside a custom component

 Hello,
 
 i'm currently devoloping a custom component in Solr.
 This component works fine. The problem I have is, I only have an access to
 the searcher which gives me the option to fire e.g. BooleanQueries. 
 
 This searcher gives me a result, which I have to iterate to calculate
 informations which could be delivered by solr facets out of the box.
 
 The problem i'm facing with is, that there is no option for facetting.
 There is an example on the Lucene here: 
 http://lucene.apache.org/core/4_0_0-BETA/facet/org/apache/lucene/facet/doc-files/userguide.html
 
 The problem is, that i have no taxonomyDir / i dont know, where to get it.
 TaxonomyReader taxo = new DirectoryTaxonomyReader(taxoDir);
 
 Does anybody have an idea, how to gather facet information?
 
 Thanks in advance,
 
 Ralf


Solr 4.0alpha: edismax complaints on certain characters

2012-09-06 Thread Alexandre Rafalovitch
Hello,

I was under the impression that edismax was supposed to be crash proof
and just ignore bad syntax. But I am either misconfiguring it or hit a
weird bug. I basically searched for text containing '/' and got this:

{
  'responseHeader'={
'status'=400,
'QTime'=9,
'params'={
  'qf'='TitleEN DescEN',
  'indent'='true',
  'wt'='ruby',
  'q'='foo/bar',
  'defType'='edismax'}},
  'error'={
'msg'='org.apache.lucene.queryparser.classic.ParseException:
Cannot parse \'foo/bar \': Lexical error at line 1, column 9.
Encountered: EOF after : /bar ',
'code'=400}}

Is that normal? If it is, is there a known list of characters I need
to escape or do I just have to catch the exception and tell user to
not do this again?

Regards,
   Alex.

Personal blog: http://blog.outerthoughts.com/
LinkedIn: http://www.linkedin.com/in/alexandrerafalovitch
- Time is the quality of nature that keeps events from happening all
at once. Lately, it doesn't seem to be working.  (Anonymous  - via GTD
book)


RE: Solr 4.0alpha: edismax complaints on certain characters

2012-09-06 Thread Yoni Amir
As far as I understand, / is a special character and needs to be escaped.
Maybe foo\/bar should work?

I found this when I looked at the code of ClientUtils.escapeQueryChars:

// These characters are part of the query syntax and must be escaped
  if (c == '\\' || c == '+' || c == '-' || c == '!'  || c == '(' || c == 
')' || c == ':'
|| c == '^' || c == '[' || c == ']' || c == '\' || c == '{' || c == 
'}' || c == '~'
|| c == '*' || c == '?' || c == '|' || c == ''  || c == ';' || c == '/'
|| Character.isWhitespace(c)) {
sb.append('\\');


-Original Message-
From: Alexandre Rafalovitch [mailto:arafa...@gmail.com] 
Sent: Thursday, September 06, 2012 4:35 PM
To: solr-user@lucene.apache.org
Subject: Solr 4.0alpha: edismax complaints on certain characters

Hello,

I was under the impression that edismax was supposed to be crash proof and just 
ignore bad syntax. But I am either misconfiguring it or hit a weird bug. I 
basically searched for text containing '/' and got this:

{
  'responseHeader'={
'status'=400,
'QTime'=9,
'params'={
  'qf'='TitleEN DescEN',
  'indent'='true',
  'wt'='ruby',
  'q'='foo/bar',
  'defType'='edismax'}},
  'error'={
'msg'='org.apache.lucene.queryparser.classic.ParseException:
Cannot parse \'foo/bar \': Lexical error at line 1, column 9.
Encountered: EOF after : /bar ',
'code'=400}}

Is that normal? If it is, is there a known list of characters I need to escape 
or do I just have to catch the exception and tell user to not do this again?

Regards,
   Alex.

Personal blog: http://blog.outerthoughts.com/
LinkedIn: http://www.linkedin.com/in/alexandrerafalovitch
- Time is the quality of nature that keeps events from happening all at once. 
Lately, it doesn't seem to be working.  (Anonymous  - via GTD
book)


Re: Solr 4.0alpha: edismax complaints on certain characters

2012-09-06 Thread Yonik Seeley
I believe this is caused by the regex support in
https://issues.apache.org/jira/browse/LUCENE-2039

It certainly seems wrong to interpret a slash in the middle of the
word as the start of a regex, so I've reopened the issue.

-Yonik
http://lucidworks.com


On Thu, Sep 6, 2012 at 9:34 AM, Alexandre Rafalovitch
arafa...@gmail.com wrote:
 Hello,

 I was under the impression that edismax was supposed to be crash proof
 and just ignore bad syntax. But I am either misconfiguring it or hit a
 weird bug. I basically searched for text containing '/' and got this:

 {
   'responseHeader'={
 'status'=400,
 'QTime'=9,
 'params'={
   'qf'='TitleEN DescEN',
   'indent'='true',
   'wt'='ruby',
   'q'='foo/bar',
   'defType'='edismax'}},
   'error'={
 'msg'='org.apache.lucene.queryparser.classic.ParseException:
 Cannot parse \'foo/bar \': Lexical error at line 1, column 9.
 Encountered: EOF after : /bar ',
 'code'=400}}

 Is that normal? If it is, is there a known list of characters I need
 to escape or do I just have to catch the exception and tell user to
 not do this again?

 Regards,
Alex.

 Personal blog: http://blog.outerthoughts.com/
 LinkedIn: http://www.linkedin.com/in/alexandrerafalovitch
 - Time is the quality of nature that keeps events from happening all
 at once. Lately, it doesn't seem to be working.  (Anonymous  - via GTD
 book)


AW: Website (crawler for) indexing

2012-09-06 Thread Lochschmied, Alexander
Thanks Rafał and Markus for your comments.

I think Droids it has serious problem with URL parameters in current version 
(0.2.0) from Maven central:
https://issues.apache.org/jira/browse/DROIDS-144

I knew about Nutch, but I haven't been able to implement a crawler with it. 
Have you done that or seen an example application?
It's probably easy to call a Nutch jar and make it index a website and maybe I 
will have to do that.
But as we already have a Java implementation to index other sources, it would 
be nice if we could integrate the crawling part too.

Regards,
Alexander 



Hello!

You can implement your own crawler using Droids
(http://incubator.apache.org/droids/) or use Apache Nutch 
(http://nutch.apache.org/), which is very easy to integrate with Solr and is 
very powerful crawler.

--
Regards,
 Rafał Kuć
 Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch - ElasticSearch

 This may be a bit off topic: How do you index an existing website and 
 control the data going into index?

 We already have Java code to process the HTML (or XHTML) and turn it 
 into a SolrJ Document (removing tags and other things we do not want 
 in the index). We use SolrJ for indexing.
 So I guess the question is essentially which Java crawler could be useful.

 We used to use wget on command line in our publishing process, but we do no 
 longer want to do that.

 Thanks,
 Alexander



RE: deletedPkQuery not work in solr 3.3

2012-09-06 Thread Dyer, James
You have deletedPKQuery, but the correct spelling is deletedPkQuery 
(lowercase k).  Try that and see if it fixes your problem.  

Also, you can probably simplify this if you do this as 
command=full-importclean=false, then use something like this for your query:

select product_id as '$deleteDocById' from modified_product 
where gmt_create gt; to_date('${dataimporter.last_index_time}','-mm-dd 
hh24:mi:ss') 
and modification = 'deleted'

See http://wiki.apache.org/solr/DataImportHandler#Special_Commands for more 
info on this technique.

Finally, you will want to be aware of 
https://issues.apache.org/jira/browse/SOLR-2492 , a bug which was fixed in Solr 
3.4.  DIH doesn't automatically do a commit in some cases if your import only 
does deletes.  You need to issue a commit manually after it completes.

James Dyer
E-Commerce Systems
Ingram Content Group
(615) 213-4311


-Original Message-
From: jun Wang [mailto:wangjun...@gmail.com] 
Sent: Wednesday, September 05, 2012 8:33 PM
To: solr-user@lucene.apache.org
Subject: deletedPkQuery not work in solr 3.3

I have a data-config.xml with 2 entity, like

entity name=full PK=ID ...
...
/entity

and

entity name=delta_build PK=ID ...
...
/entity

entity delta_build is for delta import, query is

?command=full-importentity=delta_buildclean=false

and I want to using deletedPkQuery to delete index. So I have add those to
entity delta_build

deltaQuery=select -1 as ID from dual

deltaImportQuery=select * from product where a.id='${dataimporter.delta.ID}' 

deletedPKQuery=select product_id as ID from modified_product where
gmt_create gt; to_date('${dataimporter.last_index_time}','-mm-dd
hh24:mi:ss') and modification = 'deleted'

deltaQuery and deltaImportQuery is simply to avoid delta import any
records, course delta import has been implement by full import. and I am
just want using delta for delete index.

But when I hit query

?command=delta-import

deltaQuery and deltaImportQuery can be found in log, and without
deletedPKQuery. Is there any thing wrong in config file?

-- 
from Jun Wang



Re: Solr 4.0alpha: edismax complaints on certain characters

2012-09-06 Thread Jack Krupansky
That's what I was thinking, but when I tried foo/bar in Solr 3.6 and 
4.0-BETA it was working fine - it split the term and generated the proper 
query without any error.


I think the problem is if you use the default Lucene query parser, not 
edismax. I removed defType==edismax from my query request and the problem 
reproduces.


My two test queries:
http://localhost:8983/solr/select/?debugQuery=truedefType=edismaxqf=featuresq=foo/bar
http://localhost:8983/solr/select/?debugQuery=truedf=featuresq=foo/bar

The first works; the second fails as reported (in 4.0-BETA, but works in 
3.6).


-- Jack Krupansky

-Original Message- 
From: Yonik Seeley

Sent: Thursday, September 06, 2012 9:53 AM
To: solr-user@lucene.apache.org
Subject: Re: Solr 4.0alpha: edismax complaints on certain characters

I believe this is caused by the regex support in
https://issues.apache.org/jira/browse/LUCENE-2039

It certainly seems wrong to interpret a slash in the middle of the
word as the start of a regex, so I've reopened the issue.

-Yonik
http://lucidworks.com


On Thu, Sep 6, 2012 at 9:34 AM, Alexandre Rafalovitch
arafa...@gmail.com wrote:

Hello,

I was under the impression that edismax was supposed to be crash proof
and just ignore bad syntax. But I am either misconfiguring it or hit a
weird bug. I basically searched for text containing '/' and got this:

{
  'responseHeader'={
'status'=400,
'QTime'=9,
'params'={
  'qf'='TitleEN DescEN',
  'indent'='true',
  'wt'='ruby',
  'q'='foo/bar',
  'defType'='edismax'}},
  'error'={
'msg'='org.apache.lucene.queryparser.classic.ParseException:
Cannot parse \'foo/bar \': Lexical error at line 1, column 9.
Encountered: EOF after : /bar ',
'code'=400}}

Is that normal? If it is, is there a known list of characters I need
to escape or do I just have to catch the exception and tell user to
not do this again?

Regards,
   Alex.

Personal blog: http://blog.outerthoughts.com/
LinkedIn: http://www.linkedin.com/in/alexandrerafalovitch
- Time is the quality of nature that keeps events from happening all
at once. Lately, it doesn't seem to be working.  (Anonymous  - via GTD
book) 




Re: Solr 4.0alpha: edismax complaints on certain characters

2012-09-06 Thread Alexandre Rafalovitch
I am on 4.0 alpha. Maybe it was fixed in beta. But I am most
definitely seeing this in edismax. If I get rid of / and use
debugQuery, I get:
'responseHeader'={
'status'=0,
'QTime'=14,
'params'={
  'debugQuery'='true',
  'indent'='true',
  'q'='foobar',
  'qf'='TitleEN DescEN',
  'wt'='ruby',
  'defType'='edismax'}},
  'response'={'numFound'=0,'start'=0,'docs'=[]
  },
  'debug'={
'rawquerystring'='foobar',
'querystring'='foobar',
'parsedquery'='(+DisjunctionMaxQuery((DescEN:foobar |
TitleEN:foobar)))/no_coord',
'parsedquery_toString'='+(DescEN:foobar | TitleEN:foobar)',
'explain'={},
'QParser'='ExtendedDismaxQParser',


I'll check beta on my machine by tomorrow.

Regards,
   Alex.

Personal blog: http://blog.outerthoughts.com/
LinkedIn: http://www.linkedin.com/in/alexandrerafalovitch
- Time is the quality of nature that keeps events from happening all
at once. Lately, it doesn't seem to be working.  (Anonymous  - via GTD
book)


On Thu, Sep 6, 2012 at 10:06 AM, Jack Krupansky j...@basetechnology.com wrote:
 That's what I was thinking, but when I tried foo/bar in Solr 3.6 and
 4.0-BETA it was working fine - it split the term and generated the proper
 query without any error.

 I think the problem is if you use the default Lucene query parser, not
 edismax. I removed defType==edismax from my query request and the problem
 reproduces.

 My two test queries:
 http://localhost:8983/solr/select/?debugQuery=truedefType=edismaxqf=featuresq=foo/bar
 http://localhost:8983/solr/select/?debugQuery=truedf=featuresq=foo/bar

 The first works; the second fails as reported (in 4.0-BETA, but works in
 3.6).

 -- Jack Krupansky

 -Original Message- From: Yonik Seeley
 Sent: Thursday, September 06, 2012 9:53 AM
 To: solr-user@lucene.apache.org
 Subject: Re: Solr 4.0alpha: edismax complaints on certain characters


 I believe this is caused by the regex support in
 https://issues.apache.org/jira/browse/LUCENE-2039

 It certainly seems wrong to interpret a slash in the middle of the
 word as the start of a regex, so I've reopened the issue.

 -Yonik
 http://lucidworks.com


 On Thu, Sep 6, 2012 at 9:34 AM, Alexandre Rafalovitch
 arafa...@gmail.com wrote:

 Hello,

 I was under the impression that edismax was supposed to be crash proof
 and just ignore bad syntax. But I am either misconfiguring it or hit a
 weird bug. I basically searched for text containing '/' and got this:

 {
   'responseHeader'={
 'status'=400,
 'QTime'=9,
 'params'={
   'qf'='TitleEN DescEN',
   'indent'='true',
   'wt'='ruby',
   'q'='foo/bar',
   'defType'='edismax'}},
   'error'={
 'msg'='org.apache.lucene.queryparser.classic.ParseException:
 Cannot parse \'foo/bar \': Lexical error at line 1, column 9.
 Encountered: EOF after : /bar ',
 'code'=400}}

 Is that normal? If it is, is there a known list of characters I need
 to escape or do I just have to catch the exception and tell user to
 not do this again?

 Regards,
Alex.

 Personal blog: http://blog.outerthoughts.com/
 LinkedIn: http://www.linkedin.com/in/alexandrerafalovitch
 - Time is the quality of nature that keeps events from happening all
 at once. Lately, it doesn't seem to be working.  (Anonymous  - via GTD
 book)




Re: AW: Website (crawler for) indexing

2012-09-06 Thread Rafał Kuć
Hello!

I think that really depends on what you want to achieve and what parts
of your current system you would like to reuse. If it is only HTML
processing I would let Nutch and Solr do that. Of course you can
extend Nutch (it has a plugin API) and implement the custom logic you
need as a Nutch plugin. There is even an example of Nutch plugin
available (http://wiki.apache.org/nutch/WritingPluginExample), but its
for Nutch 1.3. 

-- 
Regards,
 Rafał Kuć
 Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch - ElasticSearch

 Thanks Rafał and Markus for your comments.

 I think Droids it has serious problem with URL parameters in
 current version (0.2.0) from Maven central:
 https://issues.apache.org/jira/browse/DROIDS-144

 I knew about Nutch, but I haven't been able to implement a crawler
 with it. Have you done that or seen an example application?
 It's probably easy to call a Nutch jar and make it index a website and maybe 
 I will have to do that.
 But as we already have a Java implementation to index other
 sources, it would be nice if we could integrate the crawling part too.

 Regards,
 Alexander 

 

 Hello!

 You can implement your own crawler using Droids
 (http://incubator.apache.org/droids/) or use Apache Nutch
 (http://nutch.apache.org/), which is very easy to integrate with
 Solr and is very powerful crawler.

 --
 Regards,
  Rafał Kuć
  Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch - ElasticSearch

 This may be a bit off topic: How do you index an existing website and 
 control the data going into index?

 We already have Java code to process the HTML (or XHTML) and turn it 
 into a SolrJ Document (removing tags and other things we do not want 
 in the index). We use SolrJ for indexing.
 So I guess the question is essentially which Java crawler could be useful.

 We used to use wget on command line in our publishing process, but we do no 
 longer want to do that.

 Thanks,
 Alexander



Re: Solr 4.0alpha: edismax complaints on certain characters

2012-09-06 Thread Jack Krupansky
I do in fact see your problem with an earlier 4.0 build, but not with 
4.0-BETA.


-- Jack Krupansky

-Original Message- 
From: Alexandre Rafalovitch

Sent: Thursday, September 06, 2012 10:13 AM
To: solr-user@lucene.apache.org
Subject: Re: Solr 4.0alpha: edismax complaints on certain characters

I am on 4.0 alpha. Maybe it was fixed in beta. But I am most
definitely seeing this in edismax. If I get rid of / and use
debugQuery, I get:
'responseHeader'={
   'status'=0,
   'QTime'=14,
   'params'={
 'debugQuery'='true',
 'indent'='true',
 'q'='foobar',
 'qf'='TitleEN DescEN',
 'wt'='ruby',
 'defType'='edismax'}},
 'response'={'numFound'=0,'start'=0,'docs'=[]
 },
 'debug'={
   'rawquerystring'='foobar',
   'querystring'='foobar',
   'parsedquery'='(+DisjunctionMaxQuery((DescEN:foobar |
TitleEN:foobar)))/no_coord',
   'parsedquery_toString'='+(DescEN:foobar | TitleEN:foobar)',
   'explain'={},
   'QParser'='ExtendedDismaxQParser',


I'll check beta on my machine by tomorrow.

Regards,
  Alex.

Personal blog: http://blog.outerthoughts.com/
LinkedIn: http://www.linkedin.com/in/alexandrerafalovitch
- Time is the quality of nature that keeps events from happening all
at once. Lately, it doesn't seem to be working.  (Anonymous  - via GTD
book)


On Thu, Sep 6, 2012 at 10:06 AM, Jack Krupansky j...@basetechnology.com 
wrote:

That's what I was thinking, but when I tried foo/bar in Solr 3.6 and
4.0-BETA it was working fine - it split the term and generated the proper
query without any error.

I think the problem is if you use the default Lucene query parser, not
edismax. I removed defType==edismax from my query request and the problem
reproduces.

My two test queries:
http://localhost:8983/solr/select/?debugQuery=truedefType=edismaxqf=featuresq=foo/bar
http://localhost:8983/solr/select/?debugQuery=truedf=featuresq=foo/bar

The first works; the second fails as reported (in 4.0-BETA, but works in
3.6).

-- Jack Krupansky

-Original Message- From: Yonik Seeley
Sent: Thursday, September 06, 2012 9:53 AM
To: solr-user@lucene.apache.org
Subject: Re: Solr 4.0alpha: edismax complaints on certain characters


I believe this is caused by the regex support in
https://issues.apache.org/jira/browse/LUCENE-2039

It certainly seems wrong to interpret a slash in the middle of the
word as the start of a regex, so I've reopened the issue.

-Yonik
http://lucidworks.com


On Thu, Sep 6, 2012 at 9:34 AM, Alexandre Rafalovitch
arafa...@gmail.com wrote:


Hello,

I was under the impression that edismax was supposed to be crash proof
and just ignore bad syntax. But I am either misconfiguring it or hit a
weird bug. I basically searched for text containing '/' and got this:

{
  'responseHeader'={
'status'=400,
'QTime'=9,
'params'={
  'qf'='TitleEN DescEN',
  'indent'='true',
  'wt'='ruby',
  'q'='foo/bar',
  'defType'='edismax'}},
  'error'={
'msg'='org.apache.lucene.queryparser.classic.ParseException:
Cannot parse \'foo/bar \': Lexical error at line 1, column 9.
Encountered: EOF after : /bar ',
'code'=400}}

Is that normal? If it is, is there a known list of characters I need
to escape or do I just have to catch the exception and tell user to
not do this again?

Regards,
   Alex.

Personal blog: http://blog.outerthoughts.com/
LinkedIn: http://www.linkedin.com/in/alexandrerafalovitch
- Time is the quality of nature that keeps events from happening all
at once. Lately, it doesn't seem to be working.  (Anonymous  - via GTD
book)







RE: Website (crawler for) indexing

2012-09-06 Thread Markus Jelsma

-Original message-
 From:Lochschmied, Alexander alexander.lochschm...@vishay.com
 Sent: Thu 06-Sep-2012 16:04
 To: solr-user@lucene.apache.org
 Subject: AW: Website (crawler for) indexing
 
 Thanks Rafał and Markus for your comments.
 
 I think Droids it has serious problem with URL parameters in current version 
 (0.2.0) from Maven central:
 https://issues.apache.org/jira/browse/DROIDS-144
 
 I knew about Nutch, but I haven't been able to implement a crawler with it. 
 Have you done that or seen an example application?

We've been using it for some years now for our site search customers and are 
happy but it can be quite a beast to begin with. The Nutch tutorial will walk 
you through the first steps, crawling and indexing to Solr.

 It's probably easy to call a Nutch jar and make it index a website and maybe 
 I will have to do that.
 But as we already have a Java implementation to index other sources, it would 
 be nice if we could integrate the crawling part too.

You can control Nutch from within another application but i'd not recommend it, 
it's batch based and can take quite some time and resources to run. We usually 
prefer running a custom shell script controlling the process and call it via 
the cron.

 
 Regards,
 Alexander 
 
 
 
 Hello!
 
 You can implement your own crawler using Droids
 (http://incubator.apache.org/droids/) or use Apache Nutch 
 (http://nutch.apache.org/), which is very easy to integrate with Solr and is 
 very powerful crawler.
 
 --
 Regards,
  Rafał Kuć
  Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch - ElasticSearch
 
  This may be a bit off topic: How do you index an existing website and 
  control the data going into index?
 
  We already have Java code to process the HTML (or XHTML) and turn it 
  into a SolrJ Document (removing tags and other things we do not want 
  in the index). We use SolrJ for indexing.
  So I guess the question is essentially which Java crawler could be useful.
 
  We used to use wget on command line in our publishing process, but we do no 
  longer want to do that.
 
  Thanks,
  Alexander
 
 


Re: Solr 4.0alpha: edismax complaints on certain characters

2012-09-06 Thread Jack Krupansky
The fix in edismax was made just a few days (6/28) before the formal 
announcement of 4.0-ALPHA (7/3), but unfortunately the fix came a few days 
after the cutoff for 4.0-ALPHA (6/25).


See:
https://issues.apache.org/jira/browse/SOLR-3467

(That issue should probably be annotated to indicate that it affects 
4.0-ALPHA.)


-- Jack Krupansky

-Original Message- 
From: Alexandre Rafalovitch

Sent: Thursday, September 06, 2012 10:13 AM
To: solr-user@lucene.apache.org
Subject: Re: Solr 4.0alpha: edismax complaints on certain characters

I am on 4.0 alpha. Maybe it was fixed in beta. But I am most
definitely seeing this in edismax. If I get rid of / and use
debugQuery, I get:
'responseHeader'={
   'status'=0,
   'QTime'=14,
   'params'={
 'debugQuery'='true',
 'indent'='true',
 'q'='foobar',
 'qf'='TitleEN DescEN',
 'wt'='ruby',
 'defType'='edismax'}},
 'response'={'numFound'=0,'start'=0,'docs'=[]
 },
 'debug'={
   'rawquerystring'='foobar',
   'querystring'='foobar',
   'parsedquery'='(+DisjunctionMaxQuery((DescEN:foobar |
TitleEN:foobar)))/no_coord',
   'parsedquery_toString'='+(DescEN:foobar | TitleEN:foobar)',
   'explain'={},
   'QParser'='ExtendedDismaxQParser',


I'll check beta on my machine by tomorrow.

Regards,
  Alex.

Personal blog: http://blog.outerthoughts.com/
LinkedIn: http://www.linkedin.com/in/alexandrerafalovitch
- Time is the quality of nature that keeps events from happening all
at once. Lately, it doesn't seem to be working.  (Anonymous  - via GTD
book)


On Thu, Sep 6, 2012 at 10:06 AM, Jack Krupansky j...@basetechnology.com 
wrote:

That's what I was thinking, but when I tried foo/bar in Solr 3.6 and
4.0-BETA it was working fine - it split the term and generated the proper
query without any error.

I think the problem is if you use the default Lucene query parser, not
edismax. I removed defType==edismax from my query request and the problem
reproduces.

My two test queries:
http://localhost:8983/solr/select/?debugQuery=truedefType=edismaxqf=featuresq=foo/bar
http://localhost:8983/solr/select/?debugQuery=truedf=featuresq=foo/bar

The first works; the second fails as reported (in 4.0-BETA, but works in
3.6).

-- Jack Krupansky

-Original Message- From: Yonik Seeley
Sent: Thursday, September 06, 2012 9:53 AM
To: solr-user@lucene.apache.org
Subject: Re: Solr 4.0alpha: edismax complaints on certain characters


I believe this is caused by the regex support in
https://issues.apache.org/jira/browse/LUCENE-2039

It certainly seems wrong to interpret a slash in the middle of the
word as the start of a regex, so I've reopened the issue.

-Yonik
http://lucidworks.com


On Thu, Sep 6, 2012 at 9:34 AM, Alexandre Rafalovitch
arafa...@gmail.com wrote:


Hello,

I was under the impression that edismax was supposed to be crash proof
and just ignore bad syntax. But I am either misconfiguring it or hit a
weird bug. I basically searched for text containing '/' and got this:

{
  'responseHeader'={
'status'=400,
'QTime'=9,
'params'={
  'qf'='TitleEN DescEN',
  'indent'='true',
  'wt'='ruby',
  'q'='foo/bar',
  'defType'='edismax'}},
  'error'={
'msg'='org.apache.lucene.queryparser.classic.ParseException:
Cannot parse \'foo/bar \': Lexical error at line 1, column 9.
Encountered: EOF after : /bar ',
'code'=400}}

Is that normal? If it is, is there a known list of characters I need
to escape or do I just have to catch the exception and tell user to
not do this again?

Regards,
   Alex.

Personal blog: http://blog.outerthoughts.com/
LinkedIn: http://www.linkedin.com/in/alexandrerafalovitch
- Time is the quality of nature that keeps events from happening all
at once. Lately, it doesn't seem to be working.  (Anonymous  - via GTD
book)







Re: Problem with verifying signature ?

2012-09-06 Thread Chris Hostetter

: gpg: Signature made 08/06/12 19:52:21 Pacific Daylight Time using RSA key
: ID 322
: D7ECA
: gpg: Good signature from Robert Muir (Code Signing Key) rm...@apache.org
: *gpg: WARNING: This key is not certified with a trusted signature!*
: gpg:  There is no indication that the signature belongs to the
: owner.
: Primary key fingerprint: 6661 9BA3 C030 DD55 3625  1303 817A E1DD 322D 7ECA
: 
: Is this acceptable ?

I guess it depends on what you mean by acceptible?

I'm not an expert on this, but as i understand it...

gpg is telling you that it confirmed the signature matches a known key 
named Robert Muir (Code Signing Key) which is in your keyring, but that 
there is no certified level of trust association with that key.  

Key Trust is a personal thing, specific to you, your keyring, and how you 
got the keys you put in that ring.  if you trust that the KEYS file you 
downloaded from apache.org is legitimate, and that all the keys in it 
should be trusted, you can tell gpg that.  (using the trust 
interactive command when using --edit-key)

Alternatively, you could tell gpg that you have a high level of trust in 
the key of some other person you have met personally -- ie: if you met Uwe 
at a confernce and he physically handed you his key on a USB drive -- and 
then if Uwe has signed Robert's key with his own (i think it has, not sure 
off the top of my head), then gpg would extend an implicit transitive 
trust to Robert's key...

http://www.gnupg.org/gph/en/manual.html#AEN335


-Hoss


Re: Solr not allowing persistent HTTP connections

2012-09-06 Thread Chris Hostetter

: Some extra information. If I use curl and force it to use HTTP 1.0, it is more
: visible that Solr doesn't allow persistent connections:

a) solr has nothing to do with it, it's entirely something under the 
control of jetty  the client.

b) i think you are introducing confusion by trying to force an HTTP/1.0
connection -- Jetty supports Keep-Alive for HTTP/1.1, but maybe not for 
HTTP/1.0 ?

If you use curl to request multiple URLs and just let curl  jetty do 
their normal behavior (w/o trying to bypass anything or manually add 
headers) you can see that keep-alive is in fact working...

$ curl -v --keepalive 'http://localhost:8983/solr/select?q=*:*' 
'http://localhost:8983/solr/select?q=foo'
* About to connect() to localhost port 8983 (#0)
*   Trying 127.0.0.1... connected
 GET /solr/select?q=*:* HTTP/1.1
 User-Agent: curl/7.22.0 (x86_64-pc-linux-gnu) libcurl/7.22.0 OpenSSL/1.0.1 
 zlib/1.2.3.4 libidn/1.23 librtmp/2.3
 Host: localhost:8983
 Accept: */*
 
 HTTP/1.1 200 OK
 Content-Type: application/xml; charset=UTF-8
 Transfer-Encoding: chunked
 
?xml version=1.0 encoding=UTF-8?
response
lst name=responseHeaderint name=status0/intint 
name=QTime1/intlst name=paramsstr 
name=q*:*/str/lst/lstresult name=response numFound=0 
start=0/
/response
* Connection #0 to host localhost left intact
* Re-using existing connection! (#0) with host localhost
* Connected to localhost (127.0.0.1) port 8983 (#0)
 GET /solr/select?q=foo HTTP/1.1
 User-Agent: curl/7.22.0 (x86_64-pc-linux-gnu) libcurl/7.22.0 OpenSSL/1.0.1 
 zlib/1.2.3.4 libidn/1.23 librtmp/2.3
 Host: localhost:8983
 Accept: */*
 
 HTTP/1.1 200 OK
 Content-Type: application/xml; charset=UTF-8
 Transfer-Encoding: chunked
 
?xml version=1.0 encoding=UTF-8?
response
lst name=responseHeaderint name=status0/intint 
name=QTime0/intlst name=paramsstr 
name=qfoo/str/lst/lstresult name=response numFound=0 
start=0/
/response
* Connection #0 to host localhost left intact
* Closing connection #0





-Hoss


Re: Solr not allowing persistent HTTP connections

2012-09-06 Thread Aleksey Vorona

Thank you. I did the test with curl the same way you did it and it works.

I still can not get ab (apache benchmark) to reuse connections to 
solr. I'll investigate this further.


$ ab -c 1 -n 100 -k 'http://localhost:8983/solr/select?q=*:*' | grep Alive
Keep-Alive requests:0

-- Aleksey

On 12-09-06 11:07 AM, Chris Hostetter wrote:

: Some extra information. If I use curl and force it to use HTTP 1.0, it is more
: visible that Solr doesn't allow persistent connections:

a) solr has nothing to do with it, it's entirely something under the
control of jetty  the client.

b) i think you are introducing confusion by trying to force an HTTP/1.0
connection -- Jetty supports Keep-Alive for HTTP/1.1, but maybe not for
HTTP/1.0 ?

If you use curl to request multiple URLs and just let curl  jetty do
their normal behavior (w/o trying to bypass anything or manually add
headers) you can see that keep-alive is in fact working...

$ curl -v --keepalive 'http://localhost:8983/solr/select?q=*:*' 
'http://localhost:8983/solr/select?q=foo'
* About to connect() to localhost port 8983 (#0)
*   Trying 127.0.0.1... connected

GET /solr/select?q=*:* HTTP/1.1
User-Agent: curl/7.22.0 (x86_64-pc-linux-gnu) libcurl/7.22.0 OpenSSL/1.0.1 
zlib/1.2.3.4 libidn/1.23 librtmp/2.3
Host: localhost:8983
Accept: */*


 HTTP/1.1 200 OK
 Content-Type: application/xml; charset=UTF-8
 Transfer-Encoding: chunked

?xml version=1.0 encoding=UTF-8?
response
lst name=responseHeaderint name=status0/intint
name=QTime1/intlst name=paramsstr
name=q*:*/str/lst/lstresult name=response numFound=0
start=0/
/response
* Connection #0 to host localhost left intact
* Re-using existing connection! (#0) with host localhost
* Connected to localhost (127.0.0.1) port 8983 (#0)

GET /solr/select?q=foo HTTP/1.1
User-Agent: curl/7.22.0 (x86_64-pc-linux-gnu) libcurl/7.22.0 OpenSSL/1.0.1 
zlib/1.2.3.4 libidn/1.23 librtmp/2.3
Host: localhost:8983
Accept: */*


 HTTP/1.1 200 OK
 Content-Type: application/xml; charset=UTF-8
 Transfer-Encoding: chunked

?xml version=1.0 encoding=UTF-8?
response
lst name=responseHeaderint name=status0/intint
name=QTime0/intlst name=paramsstr
name=qfoo/str/lst/lstresult name=response numFound=0
start=0/
/response
* Connection #0 to host localhost left intact
* Closing connection #0





-Hoss





Solr-Export

2012-09-06 Thread Helton Alponti
Hey Guys,

I created a program to export Solr index data to XML.

The url is https://github.com/eltu/Solr-Export

Tell me about any problem, please.

***  I only tested  with the Solr 3.6.1

Thanks,
Helton


Solr search not working after copying a new field to an existing Indexed Field

2012-09-06 Thread Mani
I have a made a schema change to copy an existing field name (Source Field)
to an existing search field text (Destination Field). 

Since I made the schema change, I updated all the documents thinking the new
source field will be clubbed together with the text field.  The search for
a specific name does not return results. 

If I delete the document and then adding the document back works just fine. 

I thought Add command with default override option will work as Delete and
Add. 

Is this the only way to reindex the text field? Is there anyother method?

I really appreciate your help on this!



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Solr-search-not-working-after-copying-a-new-field-to-an-existing-Indexed-Field-tp4005993.html
Sent from the Solr - User mailing list archive at Nabble.com.


NoHttpResponseException: The server failed to respond

2012-09-06 Thread srinir
We have a distributed solr setup with 8 servers and 8 cores on each server in
production. We see this error multiple times in our solr servers. we are
using solr 3.6.1. Has anyone seen this error before and have  you resolved
it ?

2012-09-04 02:16:40,995 [http-nio-8080-exec-7] ERROR 
org.apache.solr.core.SolrCore - org.apache.solr.common.SolrException
at
org.apache.solr.handler.component.SearchHandler.handleRequestBody(SearchHandler.java:275)
at
com.nextag.search.solr.handler.ProductSearchHandler.handleRequestBody(ProductSearchHandler.java:269)
at
org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:129)
at org.apache.solr.core.SolrCore.execute(SolrCore.java:1376)
at
org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:365)
at
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:260)
at
org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:243)
at
org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:210)
at
org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:225)
at
org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:123)
at
org.apache.catalina.authenticator.AuthenticatorBase.invoke(AuthenticatorBase.java:472)
at
org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:168)
at
org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:98)
at
org.apache.catalina.valves.AccessLogValve.invoke(AccessLogValve.java:927)
at
org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:118)
at
org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:407)
at
org.apache.coyote.http11.AbstractHttp11Processor.process(AbstractHttp11Processor.java:1001)
at
org.apache.coyote.AbstractProtocol$AbstractConnectionHandler.process(AbstractProtocol.java:585)
at
org.apache.tomcat.util.net.NioEndpoint$SocketProcessor.run(NioEndpoint.java:1653)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603)
at java.lang.Thread.run(Thread.java:722)
Caused by: org.apache.solr.client.solrj.SolrServerException:
org.apache.commons.httpclient.NoHttpResponseException: The server
solr2-vip.servername.com failed to respond
at
org.apache.solr.client.solrj.impl.CommonsHttpSolrServer.request(CommonsHttpSolrServer.java:475)
at
org.apache.solr.client.solrj.impl.CommonsHttpSolrServer.request(CommonsHttpSolrServer.java:249)
at
org.apache.solr.handler.component.HttpShardHandler$1.call(HttpShardHandler.java:129)
at
org.apache.solr.handler.component.HttpShardHandler$1.call(HttpShardHandler.java:103)
at
java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:334)
at java.util.concurrent.FutureTask.run(FutureTask.java:166)
at
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
at
java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:334)
at java.util.concurrent.FutureTask.run(FutureTask.java:166)
... 3 more
Caused by: org.apache.commons.httpclient.NoHttpResponseException: The server
solr2-vip.servername.com failed to respond
at
org.apache.commons.httpclient.HttpMethodBase.readStatusLine(HttpMethodBase.java:1976)
at
org.apache.commons.httpclient.HttpMethodBase.readResponse(HttpMethodBase.java:1735)
at
org.apache.commons.httpclient.HttpMethodBase.execute(HttpMethodBase.java:1098)
at
org.apache.commons.httpclient.HttpMethodDirector.executeWithRetry(HttpMethodDirector.java:398)
at
org.apache.commons.httpclient.HttpMethodDirector.executeMethod(HttpMethodDirector.java:171)
at
org.apache.commons.httpclient.HttpClient.executeMethod(HttpClient.java:397)
at
org.apache.commons.httpclient.HttpClient.executeMethod(HttpClient.java:323)
at
org.apache.solr.client.solrj.impl.CommonsHttpSolrServer.request(CommonsHttpSolrServer.java:419)

Here is the shardHandlerFactory setting in our config

shardHandlerFactory class=HttpShardHandlerFactory
int name=connTimeout5000/int
int name=socketTimeout3/int
/shardHandlerFactory

We checked that Full GC is not running so frequently and the server is not
too busy.



--
View this message in context: 
http://lucene.472066.n3.nabble.com/NoHttpResponseException-The-server-failed-to-respond-tp4006017.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: UnInvertedField limitations

2012-09-06 Thread Fuad Efendi
Hi Jack,


24bit = 16M possibilities, it's clear; just to confirm... the rest is
unclear, why 4-byte can have 4 million cardinality? I thought it is 4
billions...


And, just to confirm: UnInvertedField allows 16M cardinality, correct?




On 12-08-20 6:51 PM, Jack Krupansky j...@basetechnology.com wrote:

It appears that there is a hard limit of 24-bits or 16M for the number of
bytes to reference the terms in a single field of a single document. It
takes 1, 2, 3, 4, or 5 bytes to reference a term. If it took 4 bytes,
that 
would allow 16/4 or 4 million unique terms - per document. Do you have
such 
large documents? This appears to be a hard limit based of 24-bytes in a
Java 
int.

You can try facet.method=enum, but that may be too slow.

What release of Solr are you running?

-- Jack Krupansky

-Original Message-
From: Fuad Efendi
Sent: Monday, August 20, 2012 4:34 PM
To: Solr-User@lucene.apache.org
Subject: UnInvertedField limitations

Hi All,


I have a problemŠ  (Yonik, please!) help me, what is Term count limits? I
possibly have 256,000,000 different terms in a fieldŠ or 16,000,000?

Thanks!


2012-08-20 16:20:19,262 ERROR [solr.core.SolrCore] - [pool-1-thread-1] - :
org.apache.solr.common.SolrException: Too many values for UnInvertedField
faceting on field enrich_keywords_string_mv
at
org.apache.solr.request.UnInvertedField.init(UnInvertedField.java:179)
at
org.apache.solr.request.UnInvertedField.getUnInvertedField(UnInvertedField
.j
ava:668)
at
org.apache.solr.request.SimpleFacets.getTermCounts(SimpleFacets.java:326)
at
org.apache.solr.request.SimpleFacets.getFacetFieldCounts(SimpleFacets.java
:4
23)
at
org.apache.solr.request.SimpleFacets.getFacetCounts(SimpleFacets.java:206)
at
org.apache.solr.handler.component.FacetComponent.process(FacetComponent.ja
va
:85)
at
org.apache.solr.handler.component.SearchHandler.handleRequestBody(SearchHa
nd
ler.java:204)
at
org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBas
e.
java:129)
at org.apache.solr.core.SolrCore.execute(SolrCore.java:1561)




-- 
Fuad Efendi
http://www.tokenizer.ca







Re: UnInvertedField limitations

2012-09-06 Thread Fuad Efendi
Hi Lance,


Use case is keyword extraction, and it could be 2- and 3-grams (2- and
3- words); so that theoretically we can have 10,000^3 = 1,000,000,000,000
3-grams for English only... of course my suggestion is to use statistics and
to build a dictionary of such 3-word combinations (remove top, remove
tail, using frequencies)... And to hard-limit this dictionary to 1,000,000...
That was business requirement which technically impossible to implement
(as a realtime query results); we don't even use word stemming etc...




-Fuad




On 12-08-20 7:22 PM, Lance Norskog goks...@gmail.com wrote:

Is this required by your application? Is there any way to reduce the
number of terms?

A work around is to use shards. If your terms follow Zipf's Law each
shard will have fewer than the complete number of terms. For N shards,
each shard will have ~1/N of the singleton terms. For 2-count terms,
1/N or 2/N will have that term.

Now I'm interested but not mathematically capable: what is the general
probabilistic formula for splitting Zipf's Law across shards?

On Mon, Aug 20, 2012 at 3:51 PM, Jack Krupansky j...@basetechnology.com
wrote:
 It appears that there is a hard limit of 24-bits or 16M for the number
of
 bytes to reference the terms in a single field of a single document. It
 takes 1, 2, 3, 4, or 5 bytes to reference a term. If it took 4 bytes,
that
 would allow 16/4 or 4 million unique terms - per document. Do you have
such
 large documents? This appears to be a hard limit based of 24-bytes in a
Java
 int.

 You can try facet.method=enum, but that may be too slow.

 What release of Solr are you running?

 -- Jack Krupansky

 -Original Message- From: Fuad Efendi
 Sent: Monday, August 20, 2012 4:34 PM
 To: Solr-User@lucene.apache.org
 Subject: UnInvertedField limitations


 Hi All,


 I have a problemŠ  (Yonik, please!) help me, what is Term count limits?
I
 possibly have 256,000,000 different terms in a fieldŠ or 16,000,000?

 Thanks!


 2012-08-20 16:20:19,262 ERROR [solr.core.SolrCore] - [pool-1-thread-1]
- :
 org.apache.solr.common.SolrException: Too many values for
UnInvertedField
 faceting on field enrich_keywords_string_mv
at
 org.apache.solr.request.UnInvertedField.init(UnInvertedField.java:179)
at
 
org.apache.solr.request.UnInvertedField.getUnInvertedField(UnInvertedFiel
d.j
 ava:668)
at
 
org.apache.solr.request.SimpleFacets.getTermCounts(SimpleFacets.java:326)
at
 
org.apache.solr.request.SimpleFacets.getFacetFieldCounts(SimpleFacets.jav
a:4
 23)
at
 
org.apache.solr.request.SimpleFacets.getFacetCounts(SimpleFacets.java:206
)
at
 
org.apache.solr.handler.component.FacetComponent.process(FacetComponent.j
ava
 :85)
at
 
org.apache.solr.handler.component.SearchHandler.handleRequestBody(SearchH
and
 ler.java:204)
at
 
org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBa
se.
 java:129)
at org.apache.solr.core.SolrCore.execute(SolrCore.java:1561)




 --
 Fuad Efendi
 http://www.tokenizer.ca






-- 
Lance Norskog
goks...@gmail.com




Re: UnInvertedField limitations

2012-09-06 Thread Yonik Seeley
It's actually limited to 24 bits to point to the term list in a
byte[], but there are 256 different arrays, so the maximum capacity is
4B bytes of un-inverted terms, but each bucket is limited to 4B/256 so
the real limit can come in at a little less due to luck.

From the comments:

 *   There is a single int[maxDoc()] which either contains a pointer
into a byte[] for
 *   the termNumber lists, or directly contains the termNumber list if
it fits in the 4
 *   bytes of an integer.  If the first byte in the integer is 1, the
next 3 bytes
 *   are a pointer into a byte[] where the termNumber list starts.
 *
 *   There are actually 256 byte arrays, to compensate for the fact
that the pointers
 *   into the byte arrays are only 3 bytes long.  The correct byte
array for a document
 *   is a function of it's id.


-Yonik
http://lucidworks.com


On Thu, Sep 6, 2012 at 6:33 PM, Fuad Efendi f...@efendi.ca wrote:
 Hi Jack,


 24bit = 16M possibilities, it's clear; just to confirm... the rest is
 unclear, why 4-byte can have 4 million cardinality? I thought it is 4
 billions...


 And, just to confirm: UnInvertedField allows 16M cardinality, correct?




 On 12-08-20 6:51 PM, Jack Krupansky j...@basetechnology.com wrote:

It appears that there is a hard limit of 24-bits or 16M for the number of
bytes to reference the terms in a single field of a single document. It
takes 1, 2, 3, 4, or 5 bytes to reference a term. If it took 4 bytes,
that
would allow 16/4 or 4 million unique terms - per document. Do you have
such
large documents? This appears to be a hard limit based of 24-bytes in a
Java
int.

You can try facet.method=enum, but that may be too slow.

What release of Solr are you running?

-- Jack Krupansky

-Original Message-
From: Fuad Efendi
Sent: Monday, August 20, 2012 4:34 PM
To: Solr-User@lucene.apache.org
Subject: UnInvertedField limitations

Hi All,


I have a problemŠ  (Yonik, please!) help me, what is Term count limits? I
possibly have 256,000,000 different terms in a fieldŠ or 16,000,000?

Thanks!


2012-08-20 16:20:19,262 ERROR [solr.core.SolrCore] - [pool-1-thread-1] - :
org.apache.solr.common.SolrException: Too many values for UnInvertedField
faceting on field enrich_keywords_string_mv
at
org.apache.solr.request.UnInvertedField.init(UnInvertedField.java:179)
at
org.apache.solr.request.UnInvertedField.getUnInvertedField(UnInvertedField
.j
ava:668)
at
org.apache.solr.request.SimpleFacets.getTermCounts(SimpleFacets.java:326)
at
org.apache.solr.request.SimpleFacets.getFacetFieldCounts(SimpleFacets.java
:4
23)
at
org.apache.solr.request.SimpleFacets.getFacetCounts(SimpleFacets.java:206)
at
org.apache.solr.handler.component.FacetComponent.process(FacetComponent.ja
va
:85)
at
org.apache.solr.handler.component.SearchHandler.handleRequestBody(SearchHa
nd
ler.java:204)
at
org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBas
e.
java:129)
at org.apache.solr.core.SolrCore.execute(SolrCore.java:1561)




--
Fuad Efendi
http://www.tokenizer.ca







Importing of unix date format from mysql database and dates of format 'Thu, 06 Sep 2012 22:32:33 +0000' in Solr 4.0

2012-09-06 Thread kiran chitturi
Hi,

I am using Solr with DIH and started getting errors when the database
time/date fields are getting imported in to Solr. I have used the date as
the field type but when i looked up at the docs it looks like the date
field does not accept (Thu, 06 Sep 2012 22:32:33 +) or (1346976590)
formats.

Also, When i used field_type as 'text_ar' and indexed a line with arabic
text, Solr is displaying some non-ISO characters. It looks like the text is
not being unicoded.

Did anyone face a similar issue ? The Solr date field type does not support
a variety of formats.

Is there any work around to solve this kind of issues ?

Many Thanks,
-- 
Kiran Chitturi


Re: Importing of unix date format from mysql database and dates of format 'Thu, 06 Sep 2012 22:32:33 +0000' in Solr 4.0

2012-09-06 Thread Chris Hostetter

: I am using Solr with DIH and started getting errors when the database
: time/date fields are getting imported in to Solr. I have used the date as

what actual error are you getting?

If you are pulling dates from a SQL Date field, that the jdbc driver 
returns as java.util.Date objects, then you shouldn't need to do anything 
special, they should import just fine with solr's TrieDateField.

if you are importing from something where you really need to convert 
yourself (ie: XML files, or string columns in a DB), there is the DIH 
DateFormatTransformer...

https://wiki.apache.org/solr/DataImportHandler#DateFormatTransformer

: Also, When i used field_type as 'text_ar' and indexed a line with arabic
: text, Solr is displaying some non-ISO characters. It looks like the text is
: not being unicoded.

unicode is not a verb, so i'm not sure what you mean by displaying some 
non-ISO characters and text is not being unicoded .. please provide a 
specific example of hte problem you are facing, including details on what 
the source data is.


-Hoss


Re: Importing of unix date format from mysql database and dates of format 'Thu, 06 Sep 2012 22:32:33 +0000' in Solr 4.0

2012-09-06 Thread Hasan Diwan
http://www.electrictoolbox.com/article/mysql/format-date-time-mysql/ hth --
H
On 6 Sep 2012 17:23, kiran chitturi chitturikira...@gmail.com wrote:

 Hi,

 I am using Solr with DIH and started getting errors when the database
 time/date fields are getting imported in to Solr. I have used the date as
 the field type but when i looked up at the docs it looks like the date
 field does not accept (Thu, 06 Sep 2012 22:32:33 +) or (1346976590)
 formats.

 Also, When i used field_type as 'text_ar' and indexed a line with arabic
 text, Solr is displaying some non-ISO characters. It looks like the text is
 not being unicoded.

 Did anyone face a similar issue ? The Solr date field type does not support
 a variety of formats.

 Is there any work around to solve this kind of issues ?

 Many Thanks,
 --
 Kiran Chitturi



Re: solr issue with seaching words

2012-09-06 Thread Chris Hostetter

: I am facing a strange problem. I am searching for word jacke but solr also
: returns result where my description contains 'RCA-Jack/'. Íf i search
: jacka or jackc or jackd, it works fine and does not return me any
: result which is what i am expecting in this case.

you need to tell us what the analyzers in your fieldType - if i had to 
guess, i would suspect that you are using a rules basd stemmer that 
converts jacke to jack in combination with something that splits on 
- ... which could be WordDelimiterFilter, or it could be something else.

devil is in the details


-Hoss

Re: EdgeNgramTokenFilter and positions

2012-09-06 Thread Otis Gospodnetic
I don't know for sure, but I remember something around this being a problem, 
yes ... maybe https://issues.apache.org/jira/browse/LUCENE-3907 ?

Otis 

Performance Monitoring for Solr / ElasticSearch / HBase - 
http://sematext.com/spm 



- Original Message -
 From: Walter Underwood wun...@chegg.com
 To: solr-user@lucene.apache.org solr-user@lucene.apache.org
 Cc: 
 Sent: Wednesday, September 5, 2012 1:51 PM
 Subject: EdgeNgramTokenFilter and positions
 
 In the analysis page, the n-grams produced by EdgeNgramTokenFilter are at 
 sequential positions. This seems wrong, because an n-gram is associated with 
 a 
 source token at a specific position. It also really messes up phrase matches.
 
 With the source text fleen, these positions and tokens are 
 generated:
 
 1,fl
 2,fle
 3,flee
 4,fleen
 
 Is this a known bug? Fixed? I'm running 3.3.
 
 wunder
 --
 Walter Underwood
 Search Guy
 wun...@chegg.commailto:wun...@chegg.com



Re: Importing of unix date format from mysql database and dates of format 'Thu, 06 Sep 2012 22:32:33 +0000' in Solr 4.0

2012-09-06 Thread kiran chitturi
Hi,

Thank you for your response.

The error i am getting is 'org.apache.solr.common.SolrException: Invalid
Date String: '1345743552'.

 I think it was being saved as a string in DB, so i will use the
DateFormatTransformer.

When i index a text field which has arabic and English like this tweet
“@anaga3an: هو سعد الحريري بيعمل ايه غير تحديد الدوجلاس ويختار الكرافته ؟؟”
#gcc #ksa #lebanon #syria #kuwait #egypt #سوريا
with field_type as 'text_ar' and when i try to see the same field again in
solr, it is shown as below.
RT @AhmedWagih: لو معملناش حاجة �ي الزيادة
السكانية �ي مصر، هنتحول لدولة �قيرة
كثي�ة السكان زي بنجلادش #Egypt #EgyEconomy

both of the lines do not mean the same, but i have just placed them here as
an example. This was the problem i am facing.

Many Thanks,
Kiran

On Thu, Sep 6, 2012 at 8:29 PM, Chris Hostetter hossman_luc...@fucit.orgwrote:


 : I am using Solr with DIH and started getting errors when the database
 : time/date fields are getting imported in to Solr. I have used the date as

 what actual error are you getting?

 If you are pulling dates from a SQL Date field, that the jdbc driver
 returns as java.util.Date objects, then you shouldn't need to do anything
 special, they should import just fine with solr's TrieDateField.

 if you are importing from something where you really need to convert
 yourself (ie: XML files, or string columns in a DB), there is the DIH
 DateFormatTransformer...

 https://wiki.apache.org/solr/DataImportHandler#DateFormatTransformer

 : Also, When i used field_type as 'text_ar' and indexed a line with arabic
 : text, Solr is displaying some non-ISO characters. It looks like the text
 is
 : not being unicoded.

 unicode is not a verb, so i'm not sure what you mean by displaying some
 non-ISO characters and text is not being unicoded .. please provide a
 specific example of hte problem you are facing, including details on what
 the source data is.


 -Hoss




-- 
Kiran Chitturi


solrcloud setup using tomcat, single machine

2012-09-06 Thread JesseBuesking
Hey guys!

I've been attempting to get solrcloud set up on a ubuntu vm, but I believe
I'm stuck.

I've got tomcat setup, the solr war file in place, and when I browser to
localhost:port/solr, I can see solr.  CHECK

I've set the zoo.cfg to use port 5200.  I can start it up and see it's
running (ls / shows me [zookeeper]). CHECK

*Issues I'm running into*
1. I'm trying to get it so that the example in solr
(example/solr/collection1/conf) will load up, however it doesn't look like
it's working (from posts online, it looks like I should see a *Cloud* tab
under localhost:port/solr, but it's not appearing.
2. Sometimes it looks like things are still trying to run on port 2181
(default zookeeper port).
3. Some commands I run look like they're trying to use jetty still, even
though I think I have tomcat set up correctly.

I must admit that my background is in C#, so calling java jars passing -D
everywhere is a bit new to me.

What I'd like to do is start up a solr node using zookeeper through tomcat,
but it seems like most guide use jetty and I'm having issues trying to
convert to tomcat.

I don't know what you might need to know to help me out, so I'm going to
give you as much info on my setup as I can.

For reference, the folder structure I've adopted (feel free to make
recommendations) is as follows:
/usr/solr
  /usr/solr/data/conf # conf files
  /usr/solr/solr4.0.0-BETA # extraction from the tar.gz
/usr/tomcat
  /usr/tomcat/tomcat7.0.30 #where tomcat lives
  /usr/tomcat/tomcat7.0.30/data/solr.war # war file from the extracted
tar.gz
  /usr/tomcat/tomcat7.0.30/conf/Catalina/localhost/solr.xml # contains the
following

Context docBase=/usr/tomcat/tomcat7.0.30/data/solr.war debug=0
crossContext=true
Environment name=solr/home type=java.lang.String
value=/usr/solr/data/conf override=true /
/Context

/usr/zookeeper
  /usr/zookeeper/zookeeper3.3.6 # zookeeper extraction
  /usr/zookeeper/zookeeper3.3.6/data # where the data will be stored
  /usr/zookeeper/zookeeper3.3.6/conf/zoo.cfg # contains the following

# The number of milliseconds of each tick
tickTime=2000
# The number of ticks that the initial
# synchronization phase can take
initLimit=10
# The number of ticks that can pass between
# sending a request and getting an acknowledgement
syncLimit=5
# the directory where the snapshot is stored.
dataDir=/usr/zookeeper/data
# the port at which the clients will connect
clientPort=5200

I've created the file /etc/init.d/tomcat (it contains the following):

# Tomcat auto-start
#
# description: Auto-starts tomcat
# processname: tomcat
# pidfile: /var/run/tomcat.pid

export JAVA_HOME=/opt/java/64/jre1.7.0_07

case $1 in
start)
   /export JAVA_OPTS=$JAVA_OPTS -DnumShards=1
-Dbootstrap_confdir=/usr/solr/example/solr/collection1/conf
-DzkHost=localhost:520
0 -DhostPort=8080 #might not be useful?/
   sh /usr/tomcat/tomcat7.0.30/bin/startup.sh
;;
stop)
sh /usr/tomcat/tomcat7.0.30/bin/shutdown.sh
;;
restart)
sh /usr/tomcat/tomcat7.0.30/bin/shutdown.sh
sh /usr/tomcat/tomcat7.0.30/bin/startup.sh
;;
esac
exit 0

I've been using some of these posts as references throughout the day (I've
been at this for several hours):
http://outerthought.org/blog/491-ot.html
http://blog.jesjobom.com/2012/08/configurando-solr-cloud-beta-tomcat-zookeeper-externo/
http://www.slideshare.net/lucenerevolution/how-solrcloud-changes-the-user-experience-in-a-sharded-environment
http://techspry.com/how-to/how-to-install-tomcat-7-and-solr-on-centos-5-5/
http://stackoverflow.com/questions/10026014/apache-solr-configuration-with-tomcat-6-0
... more, but I don't wanna make this any longer than it needs to be

*End goal for testing*
On a single box (for testing), get this to happen:
1. a single zookeeper instance running on port 5200
2. a single tomcat instance running on port 8080
3. a single solr node running, using configs stored in zookeeper

*Eventual production goal*
1. a 3-piece zookeeper ensemble, running on ports 5200,5201,5202
2. one of the following
a. 4 solr nodes, running replicated (to allow 1 failure)
b. 4 solr nodes, running replicated (to allow up to 2 failures)
*. both choices should allow for querying across 2-3 nodes for higher
volume, with potentially several shards per node in case data grows to big
for a single box (entire index doesn't fit on 1 node)

I know this is a lot to digest in a single post, but I'm trying to post what
I've done, what issues I've ran into, and where I'm going with this so that
you have enough information to base suggestions/answers on.

Thanks!
- Jesse



--
View this message in context: 
http://lucene.472066.n3.nabble.com/solrcloud-setup-using-tomcat-single-machine-tp4006041.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: EdgeNgramTokenFilter and positions

2012-09-06 Thread Walter Underwood
Yes, that is exactly the bug. EdgeNgram should work like the synonym filter.

wunder

On Sep 6, 2012, at 5:51 PM, Otis Gospodnetic wrote:

 I don't know for sure, but I remember something around this being a problem, 
 yes ... maybe https://issues.apache.org/jira/browse/LUCENE-3907 ?
 
 Otis 
 
 Performance Monitoring for Solr / ElasticSearch / HBase - 
 http://sematext.com/spm 
 
 
 
 - Original Message -
 From: Walter Underwood wun...@chegg.com
 To: solr-user@lucene.apache.org solr-user@lucene.apache.org
 Cc: 
 Sent: Wednesday, September 5, 2012 1:51 PM
 Subject: EdgeNgramTokenFilter and positions
 
 In the analysis page, the n-grams produced by EdgeNgramTokenFilter are at 
 sequential positions. This seems wrong, because an n-gram is associated with 
 a 
 source token at a specific position. It also really messes up phrase matches.
 
 With the source text fleen, these positions and tokens are 
 generated:
 
 1,fl
 2,fle
 3,flee
 4,fleen
 
 Is this a known bug? Fixed? I'm running 3.3.
 
 wunder
 --
 Walter Underwood
 Search Guy
 wun...@chegg.commailto:wun...@chegg.com
 

--
Walter Underwood
wun...@wunderwood.org





Solr request/response lifecycle and logging full response time

2012-09-06 Thread Aaron Daubman
Greetings,

I'm looking to add some additional logging to a solr 3.6.0 setup to
allow us to determine actual time spent by Solr responding to a
request.

We have a custom QueryComponent that sometimes returns 1+ MB of data
and while QTime is always on the order of ~100ms, the response time at
the client can be longer than a second (as measured with JMeter
running on the same server using localhost).

The end goal is to be able to:
1) determine if this large variance in response time is due to Solr,
and if so where (to help determine if/how it can be optimized)
2) determine if the large variance is due to how jetty handles
connections, buffering, etc... (and if so, if/how we can optimize
there)
...or some combination of the two.

As it stands now, where the second or so between when the actual query
finishes as indicated by QTime, when solr gathers all the data to be
returned as requested by fl, and when the client actually receives the
data (even when the client is on the localhost) is completely opaque.

My main question:
- Is there any documentation (a diagram / flowchart would be oh so
wonderful) on the lifecycle of a Solr request? So far I've attempted
to modify and rebuild solr, adding logging to SolrCore's execute()
method (this pretty much mirrors QTime), as well as add timing
calculations and logging to various different overriden methods in the
QueryComponent custom extension, all to no avail so far.

What I'm getting at is how to:
- start a stopwatch when solr receives the request from the client
- stop the stopwatch and log the elapsed time right before solr hands
the response body off to Jetty to be delivered back to the client.

Thanks, as always!
 Aaron


Re: Importing of unix date format from mysql database and dates of format 'Thu, 06 Sep 2012 22:32:33 +0000' in Solr 4.0

2012-09-06 Thread Gora Mohanty
On 7 September 2012 06:24, kiran chitturi chitturikira...@gmail.com wrote:
[...]

 When i index a text field which has arabic and English like this tweet
 “@anaga3an: هو سعد الحريري بيعمل ايه غير تحديد الدوجلاس ويختار الكرافته ؟؟”
 #gcc #ksa #lebanon #syria #kuwait #egypt #سوريا
 with field_type as 'text_ar' and when i try to see the same field again in
 solr, it is shown as below.
 RT @AhmedWagih: لو معملناش حاجة �ي الزيادة
 السكانية �ي مصر، هنتحول لدولة �قيرة
 كثي�ة السكان زي بنجلادش #Egypt #EgyEconomy

 both of the lines do not mean the same, but i have just placed them here as
 an example. This was the problem i am facing.

[...]

The encoding of your input text is being mangled at some point.
Presuming that your original encoding is UTF-8, I would look at
how you are indexing into Solr, and the encoding settings on the
Java container. Solr itself handles UTF-8 perfectly fine, as do
most Java containers if configured properly, so my first suspicion
would be the indexing code.

As it looks like you are pulling from mysql using DIH, check that
the database character set is UTF-8, and that the connection uses
UTF-8.

Regards,
Gora


Re: Solr request/response lifecycle and logging full response time

2012-09-06 Thread Aaron Daubman
I'd still love to see a query lifecycle flowchart, but, in case it
helps any future users or in case this is still incorrect, here's how
I'm tackling this:

1) Override default json responseWriter with my own in solrconfig.xml:
queryResponseWriter name=json
class=com.mydomain.solr.component.JSONResponseWriterWithTiming/
2) Define JSONResponseWriterWithTiming as just extending
JSONResponseWriter and adding in a log statement:

public class JSONResponseWriterWithTiming extends JSONResponseWriter {
private static final Logger logger =
LoggerFactory.getLogger(JSONResponseWriterWithTiming.class);
@Override
public void write(Writer writer, SolrQueryRequest req,
SolrQueryResponse rsp) throws IOException {
super.write(writer, req, rsp);
if (logger.isInfoEnabled()) {
final long st = req.getStartTime();
logger.info(String.format(Total solr time for query with
QTime: %d is: %d, (int) (rsp.getEndTime() - st), (int)
(System.currentTimeMillis() - st)));
}
}
}

Please advise if:
- Flowcharts for any solr/lucene-related lifecycles exist
- There is a better way of doing this

Thanks,
  Aaron

On Thu, Sep 6, 2012 at 9:16 PM, Aaron Daubman daub...@gmail.com wrote:
 Greetings,

 I'm looking to add some additional logging to a solr 3.6.0 setup to
 allow us to determine actual time spent by Solr responding to a
 request.

 We have a custom QueryComponent that sometimes returns 1+ MB of data
 and while QTime is always on the order of ~100ms, the response time at
 the client can be longer than a second (as measured with JMeter
 running on the same server using localhost).

 The end goal is to be able to:
 1) determine if this large variance in response time is due to Solr,
 and if so where (to help determine if/how it can be optimized)
 2) determine if the large variance is due to how jetty handles
 connections, buffering, etc... (and if so, if/how we can optimize
 there)
 ...or some combination of the two.

 As it stands now, where the second or so between when the actual query
 finishes as indicated by QTime, when solr gathers all the data to be
 returned as requested by fl, and when the client actually receives the
 data (even when the client is on the localhost) is completely opaque.

 My main question:
 - Is there any documentation (a diagram / flowchart would be oh so
 wonderful) on the lifecycle of a Solr request? So far I've attempted
 to modify and rebuild solr, adding logging to SolrCore's execute()
 method (this pretty much mirrors QTime), as well as add timing
 calculations and logging to various different overriden methods in the
 QueryComponent custom extension, all to no avail so far.

 What I'm getting at is how to:
 - start a stopwatch when solr receives the request from the client
 - stop the stopwatch and log the elapsed time right before solr hands
 the response body off to Jetty to be delivered back to the client.

 Thanks, as always!
  Aaron


Re: Importing of unix date format from mysql database and dates of format 'Thu, 06 Sep 2012 22:32:33 +0000' in Solr 4.0

2012-09-06 Thread Lance Norskog
Also, your browser may use a platform default for the encoding instead of 
UTF-8. Some MacOS and Windows browsers have this problem.

Tomcat sometimes needs adjustment to use UTF-8. If you are on tomcat, check 
this:
http://find.searchhub.org/link?url=http://wiki.apache.org/solr/SolrTomcat
http://find.searchhub.org/?q=utf-8#%2Fp%3Asolr%2Fs%3Alucid%2Cwiki

- Original Message -
| From: Gora Mohanty g...@mimirtech.com
| To: solr-user@lucene.apache.org
| Sent: Thursday, September 6, 2012 7:13:40 PM
| Subject: Re: Importing of unix date format from mysql database and dates of 
format 'Thu, 06 Sep 2012 22:32:33 +'
| in Solr 4.0
| 
| On 7 September 2012 06:24, kiran chitturi chitturikira...@gmail.com
| wrote:
| [...]
| 
|  When i index a text field which has arabic and English like this
|  tweet
|  “@anaga3an: هو سعد الحريري بيعمل ايه غير تحديد الدوجلاس ويختار
|  الكرافته ؟؟”
|  #gcc #ksa #lebanon #syria #kuwait #egypt #سوريا
|  with field_type as 'text_ar' and when i try to see the same field
|  again in
|  solr, it is shown as below.
|  RT @AhmedWagih: لو معملناش حاجة �ي الزيادة
|  السكانية �ي مصر، هنتحول لدولة �قيرة
|  كثي�ة السكان زي بنجلادش #Egypt #EgyEconomy
| 
|  both of the lines do not mean the same, but i have just placed them
|  here as
|  an example. This was the problem i am facing.
| 
| [...]
| 
| The encoding of your input text is being mangled at some point.
| Presuming that your original encoding is UTF-8, I would look at
| how you are indexing into Solr, and the encoding settings on the
| Java container. Solr itself handles UTF-8 perfectly fine, as do
| most Java containers if configured properly, so my first suspicion
| would be the indexing code.
| 
| As it looks like you are pulling from mysql using DIH, check that
| the database character set is UTF-8, and that the connection uses
| UTF-8.
| 
| Regards,
| Gora
| 


Re: Doubts in Result Grouping in solr 3.6.1

2012-09-06 Thread Erick Erickson
Grouping isn't defined for tokenized fields I don't think. See:
http://wiki.apache.org/solr/FieldCollapsing where it says for
group.field:
..The field must currently be single-valued...

Are you sure you don't want faceting?

Best
Erick

On Tue, Sep 4, 2012 at 5:27 AM, mechravi25 mechrav...@yahoo.co.in wrote:
 Hi,

 I am currently using solr 3.6.1 version and for indexing data, i am using
 the data import handler for 3.5 because of the reason posted in the
 following forum link
 http://lucene.472066.n3.nabble.com/Dataimport-Handler-in-solr-3-6-1-td4001149.html

 I am trying to achieve result grouping based on a field grpValue which has
 value like this Name XYZ|Company. There are totally 359 docs that were
 indexed and the field grpValue in all the 359 docs contains the word
 Company in its value.

 I gave the following in my schema.xml for splitting the word while indexing
 and querying

 fieldType name=groupField class=solr.TextField
 positionIncrementGap=100
  analyzer type=index
 tokenizer class=solr.PatternTokenizerFactory pattern=\s+|\|/

   filter class=solr.LowerCaseFilterFactory/
 filter class=solr.StopFilterFactory ignoreCase=true
 words=stopwords_new.txt enablePositionIncrements=true /
  /analyzer
  analyzer type=query

  tokenizer class=solr.PatternTokenizerFactory pattern=\s+|\|/
   filter class=solr.LowerCaseFilterFactory/
 filter class=solr.StopFilterFactory ignoreCase=true
 words=stopwords_new.txt enablePositionIncrements=true /
  /analyzer
 /fieldType


 I am trying to split the words if I have a single space or an “|” symbol in
 my data when i use the pattern=\s+|\| in PatternTokenizerFactory.

 When I gave the analyze option in solr, the sample value was split inot 3
 words Name,XYZ,Company in both my index and query analyzer.

 When i gave the following url

 http://localhost:8080/solr/core1/select/?q=*%3A*version=2.2start=0rows=359indent=ongroup=truegroup.field=grpValuegroup.limit=0

 I noticed that I have a grouping name called Company which has numFound as
 73 but the particular field grpValue has the word Company in its value
 in all the 359 docs. Ideally, i should have got 359 docs as numFound under
 my group

 - lst name=grouped
 - lst name=grpValue
   int name=matches359/int
 - arr name=groups
 - lst
   str name=groupValueCompany/str
   result name=doclist numFound=73 start=0 /
   /lst

 Please someone guide me as to why only 73 docs is present in that group
 instead of 359.

 I also noticed that when I counted the numFound in all the groups, it
 totalled upto 359.


 Please guide me on this and I am not sure what I am missing. Please let me
 know in case more details is needed.

 Thanks in advance.






 --
 View this message in context: 
 http://lucene.472066.n3.nabble.com/Doubts-in-Result-Grouping-in-solr-3-6-1-tp4005239.html
 Sent from the Solr - User mailing list archive at Nabble.com.


Re: How to preserve source column names in multivalue catch all field

2012-09-06 Thread Erick Erickson
Try using edismax to distribute the search across the fields rather
than using the catch-all field. There's no way that I know of to
reconstruct what field the source was.

But storing the source fields without indexing them is OK too, it won't affect
searching speed noticeably...

Best
Erick

On Tue, Sep 4, 2012 at 11:52 AM, Kiran Jayakumar kiranjuni...@gmail.com wrote:
 Hi everyone,

 I have got a multivalue catch all field which captures all the text fields.
 Whats the best way to preserve the column information also ? In the UI, I
 need to show field : value type output. Right now, I am storing the
 source fields without indexing. Is there a better way to do it ?

 Thanks


Re: Best practices on managing facets with Code and Name

2012-09-06 Thread Erick Erickson
I don't know of any better way to do this. Conflating the fields is
not _that_ error prone, although it is annoying I agree. I think that
idea is better than storing them separately.

Best
Erick

On Tue, Sep 4, 2012 at 4:58 PM, Alexandre Rafalovitch
arafa...@gmail.com wrote:
 Hello,

 I have some fields that have codes during search and internal
 management but also have user presentable names. Those fields are used
 for facets and I have a problem figuring out the best way to index,
 store, and present them.

 The best example would be a country name. I store the selected
 countries in URL, so want to say ...countries=ag|bo|cd but want those
 names to show up to user and be searchable for by SOLR as Antigua and
 Barbuda, Bolivia (Plurinational State of), and Democratic Republic
 of the Congo. The additional challenge is that ideally I want those
 full names in several languages (localized).

 Currently I am storing country codes in a facetable field and store
 country names in a catch-all names multiValue field. I get the codes
 as facets with counts and then do lookups to match to original names.
 But that does mean I have to look it up during indexing and then,
 second time, during display. It is ok, but I feel there must be a
 better way.

 I also tried conflating the fields into one, e.g. ag Antigua and
 Barbuda which would give both names when I retrieve facets, but
 having to remember that the field is combined one is annoying and
 error prone.

 Similarly, I thought of requesting both codes and names as facets, but
 that probably has a performance impact of double counting.

 Any ideas or known best practices? This feels like a fairly common scenario.

 Thank you,
Alex.
 Personal blog: http://blog.outerthoughts.com/
 LinkedIn: http://www.linkedin.com/in/alexandrerafalovitch
 - Time is the quality of nature that keeps events from happening all
 at once. Lately, it doesn't seem to be working.  (Anonymous  - via GTD
 book)


Re: Sorting on mutivalued fields still impossible?

2012-09-06 Thread Erick Erickson
And you've illustrated my viewpoint I think by saying
two obvious choices.

I may prefer the first, and you may prefer the second. Neither is
necessarily more correct IMO, it depends on the problem
space. Choosing either one will be unpopular with anyone
who likes the other

And I suspect that 99 times out of 100, someone wanting to sort on
fields with multiple tokens hasn't thought the problem through
carefully. So I favor forcing the person with the use-case where this
is actually _desired_ behavior to work to implement rather than
have to deal with surprising orderings.

And duplicate entries in the result set gets ugly. Say a user sorts
on a field containing 10,000 tokens. Now one doc is repeated
10,000 times in the result set. How many docs are set for
numFound? Faceting? Grouping?

I think your first option is at least easy to explain, but I don't see
it as compelling enough to put the work into it, although I confess
I don't know the guts of how much work it would take to find the
first (and last, don't forget specifying desc) token for each doc

Anyway, that's my story and I'm sticking to it G...

Best
Erick

On Wed, Sep 5, 2012 at 12:54 AM, Toke Eskildsen t...@statsbiblioteket.dk 
wrote:
 On Fri, 2012-08-31 at 13:35 +0200, Erick Erickson wrote:
 Imagine you have two entries, aardvark and emu in your
 multiValued field. How should that document sort relative to
 another doc with camel and zebra? Any heuristic
 you apply will be wrong for someone else

 I see two obvious choices here:

 1) Sort by the value that is ordered first by the comparator function.
 Doc1: aardvark, (emu)
 Doc2: camel, (zebra)
 This is what Uwe wants to do and it is normally done by preprocessing
 and collapsing to a single value.
 It could be implemented with an ordered multi-valued field cache by
 comparing on the first (or last, in the case of reverse sort) entry for
 each matching document.

 2) Make duplicate entries in the result set, one for each value.
 Doc1: aardvark, (emu)
 Doc2: camel, (zebra)
 Doc1: (aardvark), emu
 Doc2: (camel), zebra
 I have a hard time coming up with a real world use case for this.
 It could be implemented by using a multi-valued field cache as above and
 putting the same document ID into the sliding window sorter once for
 each field value.

 Collapsing this into a single algorithm:
 Step through all IDs. For each ID, give access to the list of field
 values and provide a callback for adding one or more (value, ID)-pairs
 to the sliding windows sorter.


 Are there some other realistic heuristics that I have missed?



Re: SOLR 4.0 / Jetty Security Set Up

2012-09-06 Thread Erick Erickson
Securing Solr pretty much universally requires that you only allow trusted
clients to access the machines directly, usually secured with a firewall
and allowed IP addresses, the admin handler is the least of your worries.

Consider if you let me ping solr directly, I can do something really
annoying like:
http://localhost:8983/solr/update?stream.body=deletequeryoffice:Bridgewater/query/delete

Best
Erick

On Wed, Sep 5, 2012 at 2:51 AM, Paul Codman snoozes...@gmail.com wrote:
 First time Solr user and I am loving it! I have a standard Solr 4 set up
 running under Jetty. The instructions in the Wiki do not seem to apply to
 Solr 4 (eg mortbay references / section to uncomment not present in xml
 file / etc) - could someone please advise on steps required to secure Solr
 4 and can someone confirm that security operates in relation to new Admin
 interface. Thanks in advance.


Re: use of filter queries in Lucene/Solr Alpha40 and Beta4.0

2012-09-06 Thread Erick Erickson
Guenter:

Are you using SolrCloud or straight Solr? And were you updating in
batches (i.e. updating multiple docs at once from SolrJ by using the
server.add(doclist) form)?

There was a bug in this process that caused various docs to show up
in various shards differently. This has been fixed in 4x, any nightly
build should have the fix.

I'm absolutely grasping at straws here, but this was a weird case that
I happen to know about...

Hossman:
of course this all goes up in smoke if you can reproduce this with any
recent compilation of the code.

FWIW
Erick

On Wed, Sep 5, 2012 at 11:29 PM, guenter.hip...@unibas.ch
guenter.hip...@unibas.ch wrote:
 Hoss, I'm so happy you realized the problem because I was quite worried
 about it!!

 Let me know if I can provide support with testing it.
 The last two days I was busy with migrating a bunch of hosts which should
 -hopefully- be finished today.
 Then I have again the infrastructure for running tests

 Günter


 On 09/05/2012 11:19 PM, Chris Hostetter wrote:

 : Subject: Re: use of filter queries in Lucene/Solr Alpha40 and Beta4.0

 Günter, This is definitely strange

 The good news is, i can reproduce your problem.
 The bad news is, i can reproduce your problem - and i have no idea what's
 causing it.

 I've opened SOLR-3793 to try to get to the bottom of this, and included
 some basic steps to demonstrate the bug using the Solr 4.0-BETA example
 data, but i'm really not sure what the problem might be...

 https://issues.apache.org/jira/browse/SOLR-3793


 -Hoss



 --
 Universität Basel
 Universitätsbibliothek
 Günter Hipler
 Projekt SwissBib
 Schoenbeinstrasse 18-20
 4056 Basel, Schweiz
 Tel.: + 41 (0)61 267 31 12 Fax: ++41 61 267 3103
 e-mailguenter.hip...@unibas.ch
 URL:www.swissbib.org   /http://www.ub.unibas.ch/