Re: Distributed query: strange behavior.

2013-05-27 Thread Luis Cappa Banda
Hi, Erick!

That's it! I'm using a custom implementation of a SolrServer with
distributed behavior that routes queries and updates using an in-house
Round Robin method. But the thing is that I'm doing this myself because
I've noticed that duplicated documents appears using LBHttpSolrServer
implementation. Last week I modified my implementation to avoid that with
this changes:


   - I have normalized the key field to all documents. Now every document
   indexed must include *_id_* field that stores the selected key value.
   The value is setted with a *copyField*.
   - When I index a new document a *HttpSolrServer* from the shard list is
   selected using a Round Robin strategy. Then, a field called *_shard_* is
   setted to *SolrInputDocument*. That field value includes a relationship
   with the main shard selected.
   - If a document wants to be indexed/updated and it includes
*_shard_*field to update it automatically the belonged shard (
   *HttpSolrServer*) is selected.
   - If a document wants to be indexed/updated and *_shard_* field is not
   included then the key value from *_id_* is getted from *SolrInputDocument
   *. With that key a distributed search query is executed by it's key to
   retrieve *_shard_* field. With *_shard_* field we can now choose the
   correct shard (*HttpSolrServer*). It's not a good practice and
   performance isn't the best, but it's secure.

Best Regards,

- Luis Cappa


2013/5/26 Erick Erickson erickerick...@gmail.com

 Valery:

 I share your puzzlement. _If_ you are letting Solr do the document
 routing, and not doing any of the custom routing, then the same unique
 key should be going to the same shard and replacing the previous doc
 with that key.

 But, if you're using custom routing, if you've been experimenting with
 different configurations and didn't start over, in general if you're
 configuration is in an interesting state this could happen.

 So in the normal case if you have a document with the same key indexed
 in multiple shards, that would indicate a bug. But there are many
 ways, especially when experimenting, that you could have this happen
 which are _not_ a bug. I'm guessing that Luis may be trying the custom
 routing option maybe?

 Best
 Erick

 On Fri, May 24, 2013 at 9:09 AM, Valery Giner valgi...@research.att.com
 wrote:
  Shawn,
 
  How is it possible for more than one document with the same unique key to
  appear in the index, even in different shards?
  Isn't it a bug by definition?
  What am I missing here?
 
  Thanks,
  Val
 
 
  On 05/23/2013 09:55 AM, Shawn Heisey wrote:
 
  On 5/23/2013 1:51 AM, Luis Cappa Banda wrote:
 
  I've query each Solr shard server one by one and the total number of
  documents is correct. However, when I change rows parameter from 10 to
  100
  the total numFound of documents change:
 
  I've seen this problem on the list before and the cause has been
  determined each time to be caused by documents with the same uniqueKey
  value appearing in more than one shard.
 
  What I think happens here:
 
  With rows=10, you get the top ten docs from each of the three shards,
  and each shard sends its numFound for that query to the core that's
  coordinating the search.  The coordinator adds up numFound, looks
  through those thirty docs, and arranges them according to the requested
  sort order, returning only the top 10.  In this case, there happen to be
  no duplicates.
 
  With rows=100, you get a total of 300 docs.  This time, duplicates are
  found and removed by the coordinator.  I think that the coordinator
  adjusts the total numFound by the number of duplicate documents it
  removed, in an attempt to be more accurate.
 
  I don't know if adjusting numFound when duplicates are found in a
  sharded query is the right thing to do, I'll leave that for smarter
  people.  Perhaps Solr should return a message with the results saying
  that duplicates were found, and if a config option is not enabled, the
  server should throw an exception and return a 4xx HTTP error code.  One
  idea for a config parameter name would be allowShardDuplicates, but
  something better can probably be found.
 
  Thanks,
  Shawn
 
 




-- 
- Luis Cappa


Re: Java heap space exception in 4.2.1

2013-05-27 Thread Jam Luo
I have the same problem. at 4.1  ,a solr instance could take 8,000,000,000
doc. but at 4.2.1, a instance only take 400,000,000 doc, it will oom at
facet query.  the facet field was token by space.

May 27, 2013 11:12:55 AM org.apache.solr.common.SolrException log
SEVERE: null:java.lang.RuntimeException: java.lang.OutOfMemoryError: Java
heap space
at
org.apache.solr.servlet.SolrDispatchFilter.sendError(SolrDispatchFilter.java:653)
at
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:366)
at
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:141)
at
org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1338)
at
org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:484)
at
org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:119)
at
org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:524)
at
org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:231)
at
org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1065)
at
org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:413)
at
org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:192)
at
org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:999)
at
org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:117)
at
org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:250)
at
org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:149)
at
org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:111)
at org.eclipse.jetty.server.Server.handle(Server.java:350)
at
org.eclipse.jetty.server.AbstractHttpConnection.handleRequest(AbstractHttpConnection.java:454)
at
org.eclipse.jetty.server.BlockingHttpConnection.handleRequest(BlockingHttpConnection.java:47)
at
org.eclipse.jetty.server.AbstractHttpConnection.content(AbstractHttpConnection.java:900)
at
org.eclipse.jetty.server.AbstractHttpConnection$RequestHandler.content(AbstractHttpConnection.java:954)
at org.eclipse.jetty.http.HttpParser.parseNext(HttpParser.java:851)
at
org.eclipse.jetty.http.HttpParser.parseAvailable(HttpParser.java:235)
at
org.eclipse.jetty.server.BlockingHttpConnection.handle(BlockingHttpConnection.java:66)
at
org.eclipse.jetty.server.bio.SocketConnector$ConnectorEndPoint.run(SocketConnector.java:254)
at
org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:603)
at
org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:538)
at java.lang.Thread.run(Thread.java:662)
Caused by: java.lang.OutOfMemoryError: Java heap space
at
org.apache.lucene.index.DocTermOrds.uninvert(DocTermOrds.java:448)
at
org.apache.solr.request.UnInvertedField.init(UnInvertedField.java:179)
at
org.apache.solr.request.UnInvertedField.getUnInvertedField(UnInvertedField.java:664)
at
org.apache.solr.request.SimpleFacets.getTermCounts(SimpleFacets.java:426)
at
org.apache.solr.request.SimpleFacets.getFacetFieldCounts(SimpleFacets.java:517)
at
org.apache.solr.request.SimpleFacets.getFacetCounts(SimpleFacets.java:252)
at
org.apache.solr.handler.component.FacetComponent.process(FacetComponent.java:78)
at
org.apache.solr.handler.component.SearchHandler.handleRequestBody(SearchHandler.java:208)
at
org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:135)
at org.apache.solr.core.SolrCore.execute(SolrCore.java:1825)
at
org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:639)
at
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:345)
at
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:141)
at
org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1338)
at
org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:484)
at
org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:119)
at
org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:524)
at
org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:231)
at
org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1065)
at
org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:413)
at
org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:192)
at
org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:999)
at

Re: Java heap space exception in 4.2.1

2013-05-27 Thread Jam Luo
I am sorry about a type mistake  8,000,000,000 - 800,000,000


2013/5/27 Jam Luo cooljam2...@gmail.com

 I have the same problem. at 4.1  ,a solr instance could take 8,000,000,000
 doc. but at 4.2.1, a instance only take 400,000,000 doc, it will oom at
 facet query.  the facet field was token by space.

 May 27, 2013 11:12:55 AM org.apache.solr.common.SolrException log
 SEVERE: null:java.lang.RuntimeException: java.lang.OutOfMemoryError: Java
 heap space
 at
 org.apache.solr.servlet.SolrDispatchFilter.sendError(SolrDispatchFilter.java:653)
 at
 org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:366)
 at
 org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:141)
 at
 org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1338)
 at
 org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:484)
 at
 org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:119)
 at
 org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:524)
 at
 org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:231)
 at
 org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1065)
 at
 org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:413)
 at
 org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:192)
 at
 org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:999)
 at
 org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:117)
 at
 org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:250)
 at
 org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:149)
 at
 org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:111)
 at org.eclipse.jetty.server.Server.handle(Server.java:350)
 at
 org.eclipse.jetty.server.AbstractHttpConnection.handleRequest(AbstractHttpConnection.java:454)
 at
 org.eclipse.jetty.server.BlockingHttpConnection.handleRequest(BlockingHttpConnection.java:47)
 at
 org.eclipse.jetty.server.AbstractHttpConnection.content(AbstractHttpConnection.java:900)
 at
 org.eclipse.jetty.server.AbstractHttpConnection$RequestHandler.content(AbstractHttpConnection.java:954)
 at org.eclipse.jetty.http.HttpParser.parseNext(HttpParser.java:851)
 at
 org.eclipse.jetty.http.HttpParser.parseAvailable(HttpParser.java:235)
 at
 org.eclipse.jetty.server.BlockingHttpConnection.handle(BlockingHttpConnection.java:66)
 at
 org.eclipse.jetty.server.bio.SocketConnector$ConnectorEndPoint.run(SocketConnector.java:254)
 at
 org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:603)
 at
 org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:538)
 at java.lang.Thread.run(Thread.java:662)
 Caused by: java.lang.OutOfMemoryError: Java heap space
 at
 org.apache.lucene.index.DocTermOrds.uninvert(DocTermOrds.java:448)
 at
 org.apache.solr.request.UnInvertedField.init(UnInvertedField.java:179)
 at
 org.apache.solr.request.UnInvertedField.getUnInvertedField(UnInvertedField.java:664)
 at
 org.apache.solr.request.SimpleFacets.getTermCounts(SimpleFacets.java:426)
 at
 org.apache.solr.request.SimpleFacets.getFacetFieldCounts(SimpleFacets.java:517)
 at
 org.apache.solr.request.SimpleFacets.getFacetCounts(SimpleFacets.java:252)
 at
 org.apache.solr.handler.component.FacetComponent.process(FacetComponent.java:78)
 at
 org.apache.solr.handler.component.SearchHandler.handleRequestBody(SearchHandler.java:208)
 at
 org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:135)
 at org.apache.solr.core.SolrCore.execute(SolrCore.java:1825)
 at
 org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:639)
 at
 org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:345)
 at
 org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:141)
 at
 org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1338)
 at
 org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:484)
 at
 org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:119)
 at
 org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:524)
 at
 org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:231)
 at
 org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1065)
 at
 org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:413)
 at
 

Re: Why would one not use RemoveDuplicatesTokenFilterFactory?

2013-05-27 Thread Dotan Cohen
On Sun, May 26, 2013 at 8:16 PM, Jack Krupansky j...@basetechnology.com wrote:
 The only comment I was trying to make here is the relationship between the
 RemoveDuplicatesTokenFilterFactory and the KeywordRepeatFilterFactory.

 No, stemmed terms are not considered the same text as the original word. By
 definition, they are a new value for the term text.



I see, for some reason I did not concentrate on this key quote of yours:
...to remove the tokens that did not produce a stem ...

Now it makes perfect sense.

Thank you, Jack!


--
Dotan Cohen

http://gibberish.co.il
http://what-is-what.com


Re: Indexing message module

2013-05-27 Thread Gora Mohanty
On 27 May 2013 12:58, Arkadi Colson ark...@smartbit.be wrote:
 Hi

 We would like to index our messages system. We should be able to search for
 messages for specific recipients due to performance issues on our databases.
 But the message is of course the same for all receipients and the message
 text should be saved only once! Is it possible to have some kind of array
 field to include in the search query where all the recipients are stored? Or
 should we for example use a simple text field which is filled with the
 receipients like this: field_434_3432_432_6546_75_8678_/field
[...]

Why couldn't you use a multi-valued string/int field for the
recipient IDs?

Regards,
Gora


Indexing message module

2013-05-27 Thread Arkadi Colson

Hi

We would like to index our messages system. We should be able to search 
for messages for specific recipients due to performance issues on our 
databases. But the message is of course the same for all receipients and 
the message text should be saved only once! Is it possible to have some 
kind of array field to include in the search query where all the 
recipients are stored? Or should we for example use a simple text field 
which is filled with the receipients like this: 
field_434_3432_432_6546_75_8678_/field


Anybody a good idea?

BR,
Arkadi


Re: Overlapping onDeckSearchers=2

2013-05-27 Thread heaven
Hi, thanks for the response. Seems like this is the case because there are no
any other applications that could fire commit/optimize calls. All commits
are triggered by Solr and the optimize is triggered by a cron task.

Because of all that it looks like a bug in Solr. It probably should not run
commits when optimize is in the process or should do commit before the
optimize. Or should not run optimize when a commit is in the process. Not
sure which exactly scenario happens.

Is there anything I can do to fix this? I don't have too much memory on this
server and since this could lead to double the RAM used it is a serious
issue for me.

Best,
Alex



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Overlapping-onDeckSearchers-2-tp772556p4066205.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Indexing message module

2013-05-27 Thread Arkadi Colson

Yes indeed... Thx!

On 05/27/2013 09:33 AM, Gora Mohanty wrote:

On 27 May 2013 12:58, Arkadi Colson ark...@smartbit.be wrote:

Hi

We would like to index our messages system. We should be able to search for
messages for specific recipients due to performance issues on our databases.
But the message is of course the same for all receipients and the message
text should be saved only once! Is it possible to have some kind of array
field to include in the search query where all the recipients are stored? Or
should we for example use a simple text field which is filled with the
receipients like this: field_434_3432_432_6546_75_8678_/field

[...]

Why couldn't you use a multi-valued string/int field for the
recipient IDs?

Regards,
Gora






RE: Tika: How can I import automatically all metadata without specifiying them explicitly

2013-05-27 Thread Gian Maria Ricci
Thanks for the help.

@Alexandre: Thanks for the suggestion, I'll try to use an
ExtractingRequestHandler, I thought that I was missing some DIH option :).

@Erik: I'm interested in knowing them all to do various form of analysis. I
have documents coming from heterogeneous sources and I'm interested in
searching inside the content, but also being able to extract all possible
metadata. I'm working in .Net so it is useful letting tika doing everything
for me directly in solr and then retrieve all metadata for matched
documents.

Thanks again to everyone. 

--
Gian Maria Ricci
Mobile: +39 320 0136949



-Original Message-
From: Erick Erickson [mailto:erickerick...@gmail.com] 
Sent: Sunday, May 26, 2013 5:30 PM
To: solr-user@lucene.apache.org; Gian Maria Ricci
Subject: Re: Tika: How can I import automatically all metadata without
specifiying them explicitly

In addition to Alexandre's comment:

bq:  ...I'd like to import in my index all metadata

Be a little careful here, this isn't actually very useful in my experience.
Sure
it's nice to have all that data in the index, but... how do you search it
meaningfully?

Consider that some doc may have an author metadata field. Another may have
a last editor field. Yet another may have a main author field. If you
add all these as their field name, what do you do to search for author?
Somehow you have to create a mapping between the various metadata names and
something that's searchable, why not do this at index time?

Not to mention I've seen this done and the result may be literally hundreds
of different metadata fields which are not very useful.

All that said, it may be perfectly valid to inde them all, but before going
there it's worth considering whether the result is actually _useful_.

Best
Erick


On Sat, May 25, 2013 at 4:44 AM, Gian Maria Ricci
alkamp...@nablasoft.comwrote:

 Hi to everyone,

 ** **

 I've configured import of a document folder with 
 FileListEntityProcessor, everything went smooth on the first try, but 
 I have a simple question. I'm able to map metadata without any 
 problem, but I'd like to import in my index all metadata, not only 
 those I've configured with field nodes. In this example I've imported 
 Author and title, but I does not know in advance which metadata a 
 document could have and I wish to have all of them inside my 
 index.

 ** **

 Here is my import config. It is the first try with importing with tika 
 and probably I'm missing a simple stuff.

 ** **

 dataConfig  

 dataSource type=BinFileDataSource /

 document

 entity name=files
 dataSource=null rootEntity=false


 processor=FileListEntityProcessor 

 baseDir=c:/temp/docs
 fileName=.*\.(doc)|(pdf)|(docx)

 onError=skip

 recursive=true

 field 
 column=file name=id /

 field 
 column=fileAbsolutePath name=path /

 field 
 column=fileSize name=size /

 field 
 column=fileLastModified name=lastModified /

 

 
 entity **
 **


 name=documentImport 


 processor=TikaEntityProcessor


 url=${files.fileAbsolutePath} 


 format=text


 field column=file name=fileName/


 field column=Author name=author meta=true/


 field column=title name=title meta=true/


 field column=text name=text/

 
 /entity*
 ***

 /entity

 /document 

 /dataConfig  

 ** **

 ** **

 --

 Gian Maria Ricci

 Mobile: +39 320 0136949

 http://mvp.microsoft.com/en-us/mvp/Gian%20Maria%20Ricci-4025635 [image:
 https://encrypted-tbn1.gstatic.com/images?q=tbn:ANd9GcQyg0wiW_QuTxl-rn
 uVR2P0jGuj4qO3I9attctCNarL--FC3vdPYg]http://www.linkedin.com/in/gianm
 ariaricci
  [image:
 https://encrypted-tbn2.gstatic.com/images?q=tbn:ANd9GcT8z0HpwpDSjDWw1I
 59Yx7HmF79u-NnP0NYeYYyEyWM1WtIbOl7]https://twitter.com/alkampfer
  [image:
 https://encrypted-tbn1.gstatic.com/images?q=tbn:ANd9GcQQWMj687BGGypKMU
 Tub_lkUrull1uU2LTx0K2tDBeu3mNUr7Oxlg]http://feeds.feedburner.com/Alka
 mpferEng
  [image:
 https://encrypted-tbn3.gstatic.com/images?q=tbn:ANd9GcSkTG_lPTPFe470xf
 DtiInUtseqKcuV_lvI5h_-8t_3PsY5ikg3]
 

 ** **

 ** **



Application connecting to SOLR cloud

2013-05-27 Thread sathish_ix
Hi,

We have setup the SOLR cloud with zookeeper.

 Zookeeper (localhost:8000)
 1 shard   (localhost:9000)
 2 Replica (localhost:9001,localhost:9002)

 Question :
  We load the solr index from Relational DB using DIH, Based on solr cloud
documentation the request to load the data will be forwarded to Leader.



CollectionShard1-Leader1(localhost:9000)
  |   |_Replication1(localhost:9001)
  |Replication2 (localhost :9002)

1.   To identify Leader (Here localhost:9000) from the zookeeper,  do we
need to read the znode clusterstate.json having leader=true   by connecting
to zookeeper. ?
 

2.  Do we need to keep external load balancer for (localhost:9000,9001,9002)
to route request ?

Is there any other way ?

Thanks,
Sathish

   






  

  




--
View this message in context: 
http://lucene.472066.n3.nabble.com/Application-connecting-to-SOLR-cloud-tp4066220.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Overlapping onDeckSearchers=2

2013-05-27 Thread Jack Krupansky
The intent is that optimize is obsolete and should no longer be used, 
especially with tiered merge policy running. In other words, merging should 
be occurring on the fly in Lucene now. What release of Solr are you running?


-- Jack Krupansky

-Original Message- 
From: heaven

Sent: Monday, May 27, 2013 3:51 AM
To: solr-user@lucene.apache.org
Subject: Re: Overlapping onDeckSearchers=2

Hi, thanks for the response. Seems like this is the case because there are 
no

any other applications that could fire commit/optimize calls. All commits
are triggered by Solr and the optimize is triggered by a cron task.

Because of all that it looks like a bug in Solr. It probably should not run
commits when optimize is in the process or should do commit before the
optimize. Or should not run optimize when a commit is in the process. Not
sure which exactly scenario happens.

Is there anything I can do to fix this? I don't have too much memory on this
server and since this could lead to double the RAM used it is a serious
issue for me.

Best,
Alex



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Overlapping-onDeckSearchers-2-tp772556p4066205.html
Sent from the Solr - User mailing list archive at Nabble.com. 



Re: How can I import automatically all metadata without specifiying them explicitly

2013-05-27 Thread Jack Krupansky
Setting the uprefix parameter of SolrCell (ERH) to something like attr_ 
will result in all metatdata attributes that are not named in the Solr 
schema being given the attr_ prefix to their metadata attribute names. For 
example,


curl http://localhost:8983/solr/update/extract?literal.id=doc-1\
commit=trueuprefix=attr_ -F my.pdf=@my.pdf

Once you fixed out which of the metadata you want to keep, either add those 
metadata attribute names to your schema, or
add explicit SolrCell field mappings for each piece of metadata: 
fmap.my-field=metadata-name.


-- Jack Krupansky

-Original Message- 
From: Gian Maria Ricci

Sent: Monday, May 27, 2013 4:21 AM
To: solr-user@lucene.apache.org
Subject: RE: Tika: How can I import automatically all metadata without 
specifiying them explicitly


Thanks for the help.

@Alexandre: Thanks for the suggestion, I'll try to use an
ExtractingRequestHandler, I thought that I was missing some DIH option :).

@Erik: I'm interested in knowing them all to do various form of analysis. I
have documents coming from heterogeneous sources and I'm interested in
searching inside the content, but also being able to extract all possible
metadata. I'm working in .Net so it is useful letting tika doing everything
for me directly in solr and then retrieve all metadata for matched
documents.

Thanks again to everyone.

--
Gian Maria Ricci
Mobile: +39 320 0136949



-Original Message-
From: Erick Erickson [mailto:erickerick...@gmail.com]
Sent: Sunday, May 26, 2013 5:30 PM
To: solr-user@lucene.apache.org; Gian Maria Ricci
Subject: Re: Tika: How can I import automatically all metadata without
specifiying them explicitly

In addition to Alexandre's comment:

bq:  ...I'd like to import in my index all metadata

Be a little careful here, this isn't actually very useful in my experience.
Sure
it's nice to have all that data in the index, but... how do you search it
meaningfully?

Consider that some doc may have an author metadata field. Another may have
a last editor field. Yet another may have a main author field. If you
add all these as their field name, what do you do to search for author?
Somehow you have to create a mapping between the various metadata names and
something that's searchable, why not do this at index time?

Not to mention I've seen this done and the result may be literally hundreds
of different metadata fields which are not very useful.

All that said, it may be perfectly valid to inde them all, but before going
there it's worth considering whether the result is actually _useful_.

Best
Erick


On Sat, May 25, 2013 at 4:44 AM, Gian Maria Ricci
alkamp...@nablasoft.comwrote:


Hi to everyone,

** **

I've configured import of a document folder with
FileListEntityProcessor, everything went smooth on the first try, but
I have a simple question. I'm able to map metadata without any
problem, but I'd like to import in my index all metadata, not only
those I've configured with field nodes. In this example I've imported
Author and title, but I does not know in advance which metadata a
document could have and I wish to have all of them inside my
index.

** **

Here is my import config. It is the first try with importing with tika
and probably I'm missing a simple stuff.

** **

dataConfig  

dataSource type=BinFileDataSource /

document

entity name=files
dataSource=null rootEntity=false


processor=FileListEntityProcessor 

baseDir=c:/temp/docs
fileName=.*\.(doc)|(pdf)|(docx)

onError=skip

recursive=true

field
column=file name=id /

field
column=fileAbsolutePath name=path /

field
column=fileSize name=size /

field
column=fileLastModified name=lastModified /




entity **
**


name=documentImport 


processor=TikaEntityProcessor


url=${files.fileAbsolutePath} 


format=text


field column=file name=fileName/


field column=Author name=author meta=true/


field column=title name=title meta=true/


field column=text name=text/


/entity*
***

/entity

/document 

/dataConfig  

** **

** **

--

Gian Maria Ricci

Mobile: +39 320 0136949

http://mvp.microsoft.com/en-us/mvp/Gian%20Maria%20Ricci-4025635 [image:
https://encrypted-tbn1.gstatic.com/images?q=tbn:ANd9GcQyg0wiW_QuTxl-rn
uVR2P0jGuj4qO3I9attctCNarL--FC3vdPYg]http://www.linkedin.com/in/gianm

RE: Tika: How can I import automatically all metadata without specifiying them explicitly

2013-05-27 Thread Alexandre Rafalovitch
Standalone Tika can also run in a network server mode.  That increases data
roundtrips but gives you more options. Even in .net .

Regards,
  Alex
On 27 May 2013 04:22, Gian Maria Ricci alkamp...@nablasoft.com wrote:

 Thanks for the help.

 @Alexandre: Thanks for the suggestion, I'll try to use an
 ExtractingRequestHandler, I thought that I was missing some DIH option :).

 @Erik: I'm interested in knowing them all to do various form of analysis. I
 have documents coming from heterogeneous sources and I'm interested in
 searching inside the content, but also being able to extract all possible
 metadata. I'm working in .Net so it is useful letting tika doing everything
 for me directly in solr and then retrieve all metadata for matched
 documents.

 Thanks again to everyone.

 --
 Gian Maria Ricci
 Mobile: +39 320 0136949



 -Original Message-
 From: Erick Erickson [mailto:erickerick...@gmail.com]
 Sent: Sunday, May 26, 2013 5:30 PM
 To: solr-user@lucene.apache.org; Gian Maria Ricci
 Subject: Re: Tika: How can I import automatically all metadata without
 specifiying them explicitly

 In addition to Alexandre's comment:

 bq:  ...I'd like to import in my index all metadata

 Be a little careful here, this isn't actually very useful in my experience.
 Sure
 it's nice to have all that data in the index, but... how do you search it
 meaningfully?

 Consider that some doc may have an author metadata field. Another may
 have
 a last editor field. Yet another may have a main author field. If you
 add all these as their field name, what do you do to search for author?
 Somehow you have to create a mapping between the various metadata names and
 something that's searchable, why not do this at index time?

 Not to mention I've seen this done and the result may be literally hundreds
 of different metadata fields which are not very useful.

 All that said, it may be perfectly valid to inde them all, but before going
 there it's worth considering whether the result is actually _useful_.

 Best
 Erick


 On Sat, May 25, 2013 at 4:44 AM, Gian Maria Ricci
 alkamp...@nablasoft.comwrote:

  Hi to everyone,
 
  ** **
 
  I've configured import of a document folder with
  FileListEntityProcessor, everything went smooth on the first try, but
  I have a simple question. I'm able to map metadata without any
  problem, but I'd like to import in my index all metadata, not only
  those I've configured with field nodes. In this example I've imported
  Author and title, but I does not know in advance which metadata a
  document could have and I wish to have all of them inside my
  index.
 
  ** **
 
  Here is my import config. It is the first try with importing with tika
  and probably I'm missing a simple stuff.
 
  ** **
 
  dataConfig  
 
  dataSource type=BinFileDataSource /
 
  document
 
  entity name=files
  dataSource=null rootEntity=false
 
 
  processor=FileListEntityProcessor 
 
  baseDir=c:/temp/docs
  fileName=.*\.(doc)|(pdf)|(docx)
 
  onError=skip
 
  recursive=true
 
  field
  column=file name=id /
 
  field
  column=fileAbsolutePath name=path /
 
  field
  column=fileSize name=size /
 
  field
  column=fileLastModified name=lastModified /
 
  
 
 
  entity **
  **
 
 
  name=documentImport 
 
 
  processor=TikaEntityProcessor
 
 
  url=${files.fileAbsolutePath} 
 
 
  format=text
 
 
  field column=file name=fileName/
 
 
  field column=Author name=author meta=true/
 
 
  field column=title name=title meta=true/
 
 
  field column=text name=text/
 
 
  /entity*
  ***
 
  /entity
 
  /document 
 
  /dataConfig  
 
  ** **
 
  ** **
 
  --
 
  Gian Maria Ricci
 
  Mobile: +39 320 0136949
 
  http://mvp.microsoft.com/en-us/mvp/Gian%20Maria%20Ricci-4025635
 [image:
  https://encrypted-tbn1.gstatic.com/images?q=tbn:ANd9GcQyg0wiW_QuTxl-rn
  uVR2P0jGuj4qO3I9attctCNarL--FC3vdPYg]http://www.linkedin.com/in/gianm
  ariaricci
   [image:
  https://encrypted-tbn2.gstatic.com/images?q=tbn:ANd9GcT8z0HpwpDSjDWw1I
  59Yx7HmF79u-NnP0NYeYYyEyWM1WtIbOl7]https://twitter.com/alkampfer
   [image:
  https://encrypted-tbn1.gstatic.com/images?q=tbn:ANd9GcQQWMj687BGGypKMU
  Tub_lkUrull1uU2LTx0K2tDBeu3mNUr7Oxlg]http://feeds.feedburner.com/Alka
  mpferEng
   [image:
  

Re: Java heap space exception in 4.2.1

2013-05-27 Thread Erick Erickson
400M docs is quite a large number of documents for a single piece of
hardware, and
if you're faceting over a large number of unique values, this will
chew up memory.

So it's not surprising that you're seeing OOMs, I suspect you just have too many
documents on a single machine..

Best
Erick


On Mon, May 27, 2013 at 3:11 AM, Jam Luo cooljam2...@gmail.com wrote:
 I am sorry about a type mistake  8,000,000,000 - 800,000,000


 2013/5/27 Jam Luo cooljam2...@gmail.com

 I have the same problem. at 4.1  ,a solr instance could take 8,000,000,000
 doc. but at 4.2.1, a instance only take 400,000,000 doc, it will oom at
 facet query.  the facet field was token by space.

 May 27, 2013 11:12:55 AM org.apache.solr.common.SolrException log
 SEVERE: null:java.lang.RuntimeException: java.lang.OutOfMemoryError: Java
 heap space
 at
 org.apache.solr.servlet.SolrDispatchFilter.sendError(SolrDispatchFilter.java:653)
 at
 org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:366)
 at
 org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:141)
 at
 org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1338)
 at
 org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:484)
 at
 org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:119)
 at
 org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:524)
 at
 org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:231)
 at
 org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1065)
 at
 org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:413)
 at
 org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:192)
 at
 org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:999)
 at
 org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:117)
 at
 org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:250)
 at
 org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:149)
 at
 org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:111)
 at org.eclipse.jetty.server.Server.handle(Server.java:350)
 at
 org.eclipse.jetty.server.AbstractHttpConnection.handleRequest(AbstractHttpConnection.java:454)
 at
 org.eclipse.jetty.server.BlockingHttpConnection.handleRequest(BlockingHttpConnection.java:47)
 at
 org.eclipse.jetty.server.AbstractHttpConnection.content(AbstractHttpConnection.java:900)
 at
 org.eclipse.jetty.server.AbstractHttpConnection$RequestHandler.content(AbstractHttpConnection.java:954)
 at org.eclipse.jetty.http.HttpParser.parseNext(HttpParser.java:851)
 at
 org.eclipse.jetty.http.HttpParser.parseAvailable(HttpParser.java:235)
 at
 org.eclipse.jetty.server.BlockingHttpConnection.handle(BlockingHttpConnection.java:66)
 at
 org.eclipse.jetty.server.bio.SocketConnector$ConnectorEndPoint.run(SocketConnector.java:254)
 at
 org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:603)
 at
 org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:538)
 at java.lang.Thread.run(Thread.java:662)
 Caused by: java.lang.OutOfMemoryError: Java heap space
 at
 org.apache.lucene.index.DocTermOrds.uninvert(DocTermOrds.java:448)
 at
 org.apache.solr.request.UnInvertedField.init(UnInvertedField.java:179)
 at
 org.apache.solr.request.UnInvertedField.getUnInvertedField(UnInvertedField.java:664)
 at
 org.apache.solr.request.SimpleFacets.getTermCounts(SimpleFacets.java:426)
 at
 org.apache.solr.request.SimpleFacets.getFacetFieldCounts(SimpleFacets.java:517)
 at
 org.apache.solr.request.SimpleFacets.getFacetCounts(SimpleFacets.java:252)
 at
 org.apache.solr.handler.component.FacetComponent.process(FacetComponent.java:78)
 at
 org.apache.solr.handler.component.SearchHandler.handleRequestBody(SearchHandler.java:208)
 at
 org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:135)
 at org.apache.solr.core.SolrCore.execute(SolrCore.java:1825)
 at
 org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:639)
 at
 org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:345)
 at
 org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:141)
 at
 org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1338)
 at
 org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:484)
 at
 org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:119)
 at
 

Re: Application connecting to SOLR cloud

2013-05-27 Thread Erick Erickson
There's no requirement to send the document to any leader, send updates to
any node in the system. The documents will be automatically forwarded to
the appropriate leaders.

You may be getting confused by the leader aware solr client stuff. It's
slightly more efficient to send updates to the leader directly and save the
extra hop, but it's not a requirement at all.


You don't need the external load balancer at all, internally Solr does its own
load balancing. That said, if your external app connects to a single node and
that node goes down, regardless of any internal load balancing, it's a single
point of failure so having the external load balancer can still make sense.

Best
Erick


On Mon, May 27, 2013 at 6:46 AM, sathish_ix skandhasw...@inautix.co.in wrote:
 Hi,

 We have setup the SOLR cloud with zookeeper.

  Zookeeper (localhost:8000)
  1 shard   (localhost:9000)
  2 Replica (localhost:9001,localhost:9002)

  Question :
   We load the solr index from Relational DB using DIH, Based on solr cloud
 documentation the request to load the data will be forwarded to Leader.



 CollectionShard1-Leader1(localhost:9000)
   |   |_Replication1(localhost:9001)
   |Replication2 (localhost :9002)

 1.   To identify Leader (Here localhost:9000) from the zookeeper,  do we
 need to read the znode clusterstate.json having leader=true   by connecting
 to zookeeper. ?


 2.  Do we need to keep external load balancer for (localhost:9000,9001,9002)
 to route request ?

 Is there any other way ?

 Thanks,
 Sathish















 --
 View this message in context: 
 http://lucene.472066.n3.nabble.com/Application-connecting-to-SOLR-cloud-tp4066220.html
 Sent from the Solr - User mailing list archive at Nabble.com.


Re: Note on The Book

2013-05-27 Thread Koji Sekiguchi

Hi Jack,

I'd like to ask as a person who contributed a case study article about
Automatically acquiring synonym knowledge from Wikipedia to the book.

(13/05/24 8:14), Jack Krupansky wrote:

To those of you who may have heard about the Lucene/Solr book that I and two 
others are writing on Lucene and Solr, some bad and good news. The bad news: 
The book contract with O’Reilly has been canceled. The good news: I’m going to 
proceed with self-publishing (possibly on Lulu or even Amazon) a somewhat 
reduced scope Solr-only Reference Guide (with hints of Lucene). The scope of 
the previous effort was too great, even for O’Reilly – a book larger than 800 
pages (or even 600) that was heavy on reference and lighter on “guide” just 
wasn’t fitting in with their traditional “guide” model. In truth, Solr is just 
too complex for a simple guide that covers it all, let alone Lucene as well.


Will the reduced Solr-only reference guide include my article?
If not (for now I think it is not because my article is for Lucene case study,
not Solr), I'd like to put it out on my blog or somewhere.

BTW, those who want to know how to acquire synonym knowledge from Wikipedia,
the summary is available at slideshare:

http://www.slideshare.net/KojiSekiguchi/wikipediasolr

koji
--
http://soleami.com/blog/lucene-4-is-super-convenient-for-developing-nlp-tools.html


A strange RemoteSolrException

2013-05-27 Thread Hans-Peter Stricker
Hello,

I'm writing my first little Solrj program, but don't get it running because of 
an RemoteSolrException: Server at http://localhost:8983/solr returned non ok 
status:404

The server is definitely running and the url works in the browser.

I am working with Solr 4.3.0.

This is my source code:

public static void main(String[] args) {

String url = http://localhost:8983/solr;;
SolrServer server;

try {
server = new HttpSolrServer(url);
server.ping();
   } catch (Exception ex) {
ex.printStackTrace();
   }
}

with the stack trace:

org.apache.solr.client.solrj.impl.HttpSolrServer$RemoteSolrException: Server at 
http://localhost:8983/solr returned non ok status:404, message:Not Found
 at 
org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:372)
 at 
org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:180)
 at org.apache.solr.client.solrj.request.SolrPing.process(SolrPing.java:62)
 at org.apache.solr.client.solrj.SolrServer.ping(SolrServer.java:293)
 at de.epublius.blogindexer.App.main(App.java:47)

If I call server.shutdown(), there is no such exception, but for almost all 
other SolrServer-methods.

What am I doing wrong?

Thanks in advance

Hans-Peter

Re: Note on The Book

2013-05-27 Thread Jack Krupansky
If you would like to Solr-ize your contribution, that would be great. The 
focus of the book will be hard-core Solr.


-- Jack Krupansky

-Original Message- 
From: Koji Sekiguchi

Sent: Monday, May 27, 2013 8:07 AM
To: solr-user@lucene.apache.org
Subject: Re: Note on The Book

Hi Jack,

I'd like to ask as a person who contributed a case study article about
Automatically acquiring synonym knowledge from Wikipedia to the book.

(13/05/24 8:14), Jack Krupansky wrote:
To those of you who may have heard about the Lucene/Solr book that I and 
two others are writing on Lucene and Solr, some bad and good news. The bad 
news: The book contract with O’Reilly has been canceled. The good news: I’m 
going to proceed with self-publishing (possibly on Lulu or even Amazon) a 
somewhat reduced scope Solr-only Reference Guide (with hints of Lucene). 
The scope of the previous effort was too great, even for O’Reilly – a book 
larger than 800 pages (or even 600) that was heavy on reference and 
lighter on “guide” just wasn’t fitting in with their traditional “guide” 
model. In truth, Solr is just too complex for a simple guide that covers 
it all, let alone Lucene as well.


Will the reduced Solr-only reference guide include my article?
If not (for now I think it is not because my article is for Lucene case 
study,

not Solr), I'd like to put it out on my blog or somewhere.

BTW, those who want to know how to acquire synonym knowledge from Wikipedia,
the summary is available at slideshare:

http://www.slideshare.net/KojiSekiguchi/wikipediasolr

koji
--
http://soleami.com/blog/lucene-4-is-super-convenient-for-developing-nlp-tools.html 



Re: index multiple files into one index entity

2013-05-27 Thread Alexandre Rafalovitch
You did not open source it by any chance? :-)

Regards,
   Alex.
Personal blog: http://blog.outerthoughts.com/
LinkedIn: http://www.linkedin.com/in/alexandrerafalovitch
- Time is the quality of nature that keeps events from happening all
at once. Lately, it doesn't seem to be working.  (Anonymous  - via GTD
book)


On Sun, May 26, 2013 at 8:23 PM, Yury Kats yuryk...@yahoo.com wrote:
 That's exactly what happens. Each streams goes into a separate document.
 If all streams share the same unique id parameter, the last stream
 will overwrite everything.

 I've asked this same question last year. Got no responses and ended up
 writing my own UpdateRequestProcessor.

 See http://tinyurl.com/phhqsb4

 On 5/26/2013 11:15 AM, Alexandre Rafalovitch wrote:
 If I understand correctly, the issue is:
 1) The client provides multiple content stream and expects Tika to
 parse all of them and stick all the extracted content into one big
 SolrDoc
 2) Tika (looking at load() method of: ExtractingDocumentLoader.java
 (Github link: http://bit.ly/12GsDl9 ) does not actually suspect that
 it's load method may be called multiple types and therefore happily
 submit the document at the end of that call. Probably submits a new
 document for each content source, which probably means it just
 overrides the same doc over and over again.

 If I am right, then we have a bug in Tika handler's expectations (of
 single load() call). The next step would be to put together a very
 simple use case and open a Jira case with it.

 Regards,
Alex.
 P.s. I am not a Solr code wrangler, so this MAY be completely wrong.

 Personal blog: http://blog.outerthoughts.com/
 LinkedIn: http://www.linkedin.com/in/alexandrerafalovitch
 - Time is the quality of nature that keeps events from happening all
 at once. Lately, it doesn't seem to be working.  (Anonymous  - via GTD
 book)


 On Sun, May 26, 2013 at 10:46 AM, Erick Erickson
 erickerick...@gmail.com wrote:
 I'm still not quite getting the issue. Separate requests (i.e. any
 addition of a SolrInputDocument) are treated as a separate document.
 There's no notion of append the contents of one doc to another based
 on ID, unless you're doing atomic updates.

 And Tika takes some care to index separate files as separate documents.

 Now, if you don't need these as with the same uniqueKey, you might
 index them as separate documents and include a field that lets you
 associate these documents somehow (see the group/field collapsing Wiki
 page).

 But otherwise, I think I need a higher-level view of what you're
 trying to accomplish to make an intelligent comment.

 Best
 Erick

 On Thu, May 23, 2013 at 9:05 AM,  mark.ka...@t-systems.com wrote:
 Hello Erick,
 Thank you for your fast answer.

 Maybe I don't exclaim my question clearly.
 I want index many files to one index entity. I will use the same behavior 
 as any other multivalued field which can indexed to one unique id.
 So I think every ContentStreamUpdateRequest represent one index entity, 
 isn't it? And with each addContentStream I will add one File to this 
 entity.

 Thank you and with best Regards
 Mark




 -Ursprüngliche Nachricht-
 Von: Erick Erickson [mailto:erickerick...@gmail.com]
 Gesendet: Donnerstag, 23. Mai 2013 14:11
 An: solr-user@lucene.apache.org
 Betreff: Re: index multiple files into one index entity

 I just skimmed your post, but I'm responding to the last bit.

 If you have uniqueKey defined as id in schema.xml then no, you cannot 
 have multiple documents with the same ID.
 Whenever a new doc comes in it replaces the old doc with that ID.

 You can remove the uniqueKey definition and do what you want, but there 
 are very few Solr installations with no uniqueKey and it's probably a 
 better idea to make your id's truly unique.

 Best
 Erick

 On Thu, May 23, 2013 at 6:14 AM,  mark.ka...@t-systems.com wrote:
 Hello solr team,

 I want to index multiple fields into one solr index entity, with the
 same id. We are using solr 4.1


 I try it with following source fragment:

 public void addContentSet(ContentSet contentSet) throws
 SearchProviderException {

 ...

 ContentStreamUpdateRequest csur = 
 generateCSURequest(contentSet.getIndexId(), contentSet);
 String indexId = contentSet.getIndexId();

 ConcurrentUpdateSolrServer server = 
 serverPool.getUpdateServer(indexId);
 server.request(csur);

 ...
 }

 private ContentStreamUpdateRequest generateCSURequest(String indexId, 
 ContentSet contentSet)
 throws IOException {
 ContentStreamUpdateRequest csur = new
 ContentStreamUpdateRequest(confStore.getExtractUrl());

 ModifiableSolrParams parameters = csur.getParams();
 if (parameters == null) {
 parameters = new ModifiableSolrParams();
 }

 parameters.set(literalsOverride, false);

 // maps the tika default content attribute to 

Re: index multiple files into one index entity

2013-05-27 Thread Yury Kats
No, the implementation was very specific to my needs.

On 5/27/2013 8:28 AM, Alexandre Rafalovitch wrote:
 You did not open source it by any chance? :-)
 
 Regards,
Alex.



using solr for web page classification

2013-05-27 Thread Rajesh Nikam
Hello,

I am working on implementation of system to categorize URLs/Web Pages.

I would have categories like ...

Adult  Health Business
Arts   Home   Science

I am looking at how Lucence/Solr could help me out to achive this.
I came across links that mention MoreLikeThis could be of my help.

I found LucidWorks Search of help for me as it has done installation for
Jetty, Solr in few clicks.

Importing data and Query was also straight forward.

 My question is:

 - I have pre-defined list of categories for which I would have webpages +
documents that could be stored in solr index assigned with category

 - have input processors like on each page

 Text extractor (from HTML, PDF, Office format)
 Text language detection
 Standard text processors - stemming, remove stopwords, lowwercase
etc
 Title extractor
Summary extractor
Field mapping
Header and footer remover

 - All these document could be processed and stored in Solr Index with
known category

 - When new request comes I need to for MLT or solr Query based on content
of webpage and get similar documents.
 Based on results I could reply back with top 3 categories.


 Please let me know if using solr for this problem in correct way ?
 If yes how to go with the forming query based on web page contents ?

Thanks
Rajesh


sourceId of JMX

2013-05-27 Thread 菅沼 嘉一
Hello

Our team faced the problem regarding the sourceId of JMX when getting the 
information of JMX from tomcat manager.

Command:
curl http://localhost:${PORT}/manager/jmxproxy?qry=solr:type=documentCache,* 

Here is the error log (tomcat/manager log).
---
2013/05/27 0:04:01 org.apache.catalina.core.ApplicationContext log
JMXProxy: Error getting attribute 
solr:type=documentCache,id=org.apache.solr.search.LRUCache sourceId 
javax.management.AttributeNotFoundException: sourceId
---

Solr ver. : 4.1.0

I think this error comes from when JMX cannot get the sourceId.


BTW Let's look at this issue.
https://issues.apache.org/jira/browse/SOLR-3329

It is decided to drop getSourceId() in this issue.

But in org.apache.solr.core.JmxMonitoredMap.SolrDynamicMBean, 
staticStats.add(sourceId) is still defined in SolrDynamicMBean at L211.

http://javasourcecode.org/html/open-source/solr/solr-3.3.0/org/apache/solr/core/JmxMonitoredMap.SolrDynamicMBean.java.html#line.202
--
l.211  staticStats.add(sourceId);
--

Maybe this error comes from this inconsistency.
This problem is not critical, but I think this is inconsistent.

1. Anyone knows why staticStats.add(sourceId) still remained in 
SolrDynamicMBean?
Do you have any idea?

2. Has anyone faced such error ? How did you solved it?


Thank you.

Regards
suganuma



Solr/Lucene Analayzer That Writes To File

2013-05-27 Thread Furkan KAMACI
Hi;

I want to use Solr for an academical research. One step of my purpose is I
want to store tokens in a file (I will store it at a database later) and I
don't want to index them. For such kind of purposes should I use core
Lucene or Solr? Is there an example for writing a custom analyzer and just
storing tokens in a file?


Re: Solr/Lucene Analayzer That Writes To File

2013-05-27 Thread Rafał Kuć
Hello!

Take a look at custom posting formats. For example
here  is  a  nice  post showing what you can do with Lucene SimpleText
codec:
http://blog.mikemccandless.com/2010/10/lucenes-simpletext-codec.html

However  please  remember  that it is not advised to use that codec in
production environment.

-- 
Regards,
 Rafał Kuć
 Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch - ElasticSearch

 Hi;

 I want to use Solr for an academical research. One step of my purpose is I
 want to store tokens in a file (I will store it at a database later) and I
 don't want to index them. For such kind of purposes should I use core
 Lucene or Solr? Is there an example for writing a custom analyzer and just
 storing tokens in a file?



Re: Overlapping onDeckSearchers=2

2013-05-27 Thread Yonik Seeley
On Mon, May 27, 2013 at 7:11 AM, Jack Krupansky j...@basetechnology.com wrote:
 The intent is that optimize is obsolete and should no longer be used

That's incorrect.

People need to understand the cost of optimize, and that it's use is optional.
It's up to the developer to figure out of the benefits of calling
optimize outweigh the costs in their particular situations.

The wiki currently says:

An optimize is like a hard commit except that it forces all of the
index segments to be merged into a single segment first. Depending on
the use cases, this operation should be performed infrequently (like
nightly), if at all, since it is very expensive and involves reading
and re-writing the entire index. Segments are normally merged over
time anyway (as determined by the merge policy), and optimize just
forces these merges to occur immediately.


-Yonik
http://lucidworks.com


Re: A strange RemoteSolrException

2013-05-27 Thread Shalin Shekhar Mangar
I downloaded solr 4.3.0, started it up with java -jar start.jar (from
inside the example directory) and executed your program. No exceptions are
thrown. Is there something you did differently?


On Mon, May 27, 2013 at 5:45 PM, Hans-Peter Stricker
stric...@epublius.dewrote:

 Hello,

 I'm writing my first little Solrj program, but don't get it running
 because of an RemoteSolrException: Server at 
 http://localhost:8983/solrreturned non ok status:404

 The server is definitely running and the url works in the browser.

 I am working with Solr 4.3.0.

 This is my source code:

 public static void main(String[] args) {

 String url = http://localhost:8983/solr;;
 SolrServer server;

 try {
 server = new HttpSolrServer(url);
 server.ping();
} catch (Exception ex) {
 ex.printStackTrace();
}
 }

 with the stack trace:

 org.apache.solr.client.solrj.impl.HttpSolrServer$RemoteSolrException:
 Server at http://localhost:8983/solr returned non ok status:404,
 message:Not Found
  at
 org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:372)
  at
 org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:180)
  at org.apache.solr.client.solrj.request.SolrPing.process(SolrPing.java:62)
  at org.apache.solr.client.solrj.SolrServer.ping(SolrServer.java:293)
  at de.epublius.blogindexer.App.main(App.java:47)

 If I call server.shutdown(), there is no such exception, but for almost
 all other SolrServer-methods.

 What am I doing wrong?

 Thanks in advance

 Hans-Peter




-- 
Regards,
Shalin Shekhar Mangar.


RE: Overlapping onDeckSearchers=2

2013-05-27 Thread Markus Jelsma
forceMerge is very useful if you delete a significant portion of an index. It 
can take a very long time before any merge policy decides to finally merge them 
all away, especially for a static or infrequently changing index. Also, having 
a lot of deleted docs in the index can be an issue if your similarity uses 
maxDoc for IDF.

The cost is also much lower when using SSD's, forceMerging a 1GB core is a 
matter of seconds.

-Original message-
 From:Yonik Seeley yo...@lucidworks.com
 Sent: Mon 27-May-2013 15:47
 To: solr-user@lucene.apache.org
 Subject: Re: Overlapping onDeckSearchers=2
 
 On Mon, May 27, 2013 at 7:11 AM, Jack Krupansky j...@basetechnology.com 
 wrote:
  The intent is that optimize is obsolete and should no longer be used
 
 That's incorrect.
 
 People need to understand the cost of optimize, and that it's use is optional.
 It's up to the developer to figure out of the benefits of calling
 optimize outweigh the costs in their particular situations.
 
 The wiki currently says:
 
 An optimize is like a hard commit except that it forces all of the
 index segments to be merged into a single segment first. Depending on
 the use cases, this operation should be performed infrequently (like
 nightly), if at all, since it is very expensive and involves reading
 and re-writing the entire index. Segments are normally merged over
 time anyway (as determined by the merge policy), and optimize just
 forces these merges to occur immediately.
 
 
 -Yonik
 http://lucidworks.com
 


Re: Overlapping onDeckSearchers=2

2013-05-27 Thread Jack Krupansky
As the wiki does say: if at all ... Segments are normally merged over time 
anyway (as determined by the merge policy), and optimize just forces these 
merges to occur immediately.


So, the only real question here is if the optimize really does lie outside 
the if at all category and whether Segments are normally merged over time 
anyway is in fact not good enough.


This is why I referred to the intent - whether the actual reality of a 
specific Solr application does align with the expectation that Segments are 
normally merged over time anyway. But, start with the presumption that 
merge policy does eliminate the need for optimize.


As a general proposition:

1. Try to avoid using optimize if at all possible. Let merge policy do its 
thing.
2. Try to take a server out offline while optimizing if an optimize is 
really absolutely needed.
3. Try to understand why #1 is not sufficient and resolve the cause(s), so 
that optimize is no longer needed.


-- Jack Krupansky

-Original Message- 
From: Yonik Seeley

Sent: Monday, May 27, 2013 9:46 AM
To: solr-user@lucene.apache.org
Subject: Re: Overlapping onDeckSearchers=2

On Mon, May 27, 2013 at 7:11 AM, Jack Krupansky j...@basetechnology.com 
wrote:

The intent is that optimize is obsolete and should no longer be used


That's incorrect.

People need to understand the cost of optimize, and that it's use is 
optional.

It's up to the developer to figure out of the benefits of calling
optimize outweigh the costs in their particular situations.

The wiki currently says:

An optimize is like a hard commit except that it forces all of the
index segments to be merged into a single segment first. Depending on
the use cases, this operation should be performed infrequently (like
nightly), if at all, since it is very expensive and involves reading
and re-writing the entire index. Segments are normally merged over
time anyway (as determined by the merge policy), and optimize just
forces these merges to occur immediately.


-Yonik
http://lucidworks.com 



Re: sourceId of JMX

2013-05-27 Thread Shalin Shekhar Mangar
This is a bug. The sourceId should have been removed from the
SolrDynamicMBean. I'll create an issue.


On Mon, May 27, 2013 at 6:39 PM, 菅沼 嘉一 yo_sugan...@waku-2.com wrote:

 Hello

 Our team faced the problem regarding the sourceId of JMX when getting the
 information of JMX from tomcat manager.

 Command:
 curl http://localhost:
 ${PORT}/manager/jmxproxy?qry=solr:type=documentCache,*

 Here is the error log (tomcat/manager log).
 ---
 2013/05/27 0:04:01 org.apache.catalina.core.ApplicationContext log
 JMXProxy: Error getting attribute
 solr:type=documentCache,id=org.apache.solr.search.LRUCache sourceId
 javax.management.AttributeNotFoundException: sourceId
 ---

 Solr ver. : 4.1.0

 I think this error comes from when JMX cannot get the sourceId.


 BTW Let's look at this issue.
 https://issues.apache.org/jira/browse/SOLR-3329

 It is decided to drop getSourceId() in this issue.

 But in org.apache.solr.core.JmxMonitoredMap.SolrDynamicMBean,
 staticStats.add(sourceId) is still defined in SolrDynamicMBean at L211.


 http://javasourcecode.org/html/open-source/solr/solr-3.3.0/org/apache/solr/core/JmxMonitoredMap.SolrDynamicMBean.java.html#line.202
 --
 l.211  staticStats.add(sourceId);
 --

 Maybe this error comes from this inconsistency.
 This problem is not critical, but I think this is inconsistent.

 1. Anyone knows why staticStats.add(sourceId) still remained in
 SolrDynamicMBean?
 Do you have any idea?

 2. Has anyone faced such error ? How did you solved it?


 Thank you.

 Regards
 suganuma




-- 
Regards,
Shalin Shekhar Mangar.


Re: Overlapping onDeckSearchers=2

2013-05-27 Thread heaven
I am on 4.2.1

@Yonik Seeley I do understand the cost and run it once per 24 hours and
perhaps later this interval will be increased up to a few days.

In general I am optimizing not to merge the fragments but to remove deleted
docs. My index refreshes quickly and number of deleted docs could reach a
few millions per week.

The questions is: if optimize does the same what the hard commit does + some
other optimizations why does Solr schedule a new commit when optimize is in
the process? That's the problem.

Since I am optimizing once per day and all commits are scheduled by Solr
itself:
autoCommit
  maxDocs25000/maxDocs
  maxTime30/maxTime
  openSearcherfalse/openSearcher 
/autoCommit

autoSoftCommit
  maxTime5000/maxTime
/autoSoftCommit

If Solr see that an optimization procedure is running it could delay all
scheduled hard commits until optimization is complete. Additionally it could
perform a soft commit (for cases when application need to see the updated
docs in index and fires commits) and run the delayed hard commit when
optimization is complete. It will warm a new searcher but at least will
prevent from multiple searchers warming simultaneously.

Best,
Alex



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Overlapping-onDeckSearchers-2-tp772556p4066267.html
Sent from the Solr - User mailing list archive at Nabble.com.


AW: Core admin action CREATE fails to persist some settings in solr.xml with Solr 4.3

2013-05-27 Thread André Widhani
I created SOLR-4862 ... I found no way to assign the ticket to somebody though 
(I guess it is is under Workflow, but the button is greyed out).

Thanks,
André



Re: sourceId of JMX

2013-05-27 Thread Shalin Shekhar Mangar
I opened https://issues.apache.org/jira/browse/SOLR-4863


On Mon, May 27, 2013 at 7:35 PM, Shalin Shekhar Mangar 
shalinman...@gmail.com wrote:

 This is a bug. The sourceId should have been removed from the
 SolrDynamicMBean. I'll create an issue.


 On Mon, May 27, 2013 at 6:39 PM, 菅沼 嘉一 yo_sugan...@waku-2.com wrote:

 Hello

 Our team faced the problem regarding the sourceId of JMX when getting the
 information of JMX from tomcat manager.

 Command:
 curl http://localhost:
 ${PORT}/manager/jmxproxy?qry=solr:type=documentCache,*

 Here is the error log (tomcat/manager log).

 ---
 2013/05/27 0:04:01 org.apache.catalina.core.ApplicationContext log
 JMXProxy: Error getting attribute
 solr:type=documentCache,id=org.apache.solr.search.LRUCache sourceId
 javax.management.AttributeNotFoundException: sourceId

 ---

 Solr ver. : 4.1.0

 I think this error comes from when JMX cannot get the sourceId.


 BTW Let's look at this issue.
 https://issues.apache.org/jira/browse/SOLR-3329

 It is decided to drop getSourceId() in this issue.

 But in org.apache.solr.core.JmxMonitoredMap.SolrDynamicMBean,
 staticStats.add(sourceId) is still defined in SolrDynamicMBean at L211.


 http://javasourcecode.org/html/open-source/solr/solr-3.3.0/org/apache/solr/core/JmxMonitoredMap.SolrDynamicMBean.java.html#line.202
 --
 l.211  staticStats.add(sourceId);
 --

 Maybe this error comes from this inconsistency.
 This problem is not critical, but I think this is inconsistent.

 1. Anyone knows why staticStats.add(sourceId) still remained in
 SolrDynamicMBean?
 Do you have any idea?

 2. Has anyone faced such error ? How did you solved it?


 Thank you.

 Regards
 suganuma




 --
 Regards,
 Shalin Shekhar Mangar.




-- 
Regards,
Shalin Shekhar Mangar.


Re: Distributed query: strange behavior.

2013-05-27 Thread Luis Cappa Banda
Hello, guys!

Well, I've done some tests and I think that there exists some kind of bug
related with distributed search. Currently I'm setting a key field that
it's impossible to be duplicated, and I have experienced the same wrong
behavior with numFound field while changing rows parameter. Has anyone
experienced the same?

Best regards,

- Luis Cappa


2013/5/27 Luis Cappa Banda luisca...@gmail.com

 Hi, Erick!

 That's it! I'm using a custom implementation of a SolrServer with
 distributed behavior that routes queries and updates using an in-house
 Round Robin method. But the thing is that I'm doing this myself because
 I've noticed that duplicated documents appears using LBHttpSolrServer
 implementation. Last week I modified my implementation to avoid that with
 this changes:


- I have normalized the key field to all documents. Now every document
indexed must include *_id_* field that stores the selected key value.
The value is setted with a *copyField*.
- When I index a new document a *HttpSolrServer* from the shard list
is selected using a Round Robin strategy. Then, a field called *_shard_
* is setted to *SolrInputDocument*. That field value includes a
relationship with the main shard selected.
- If a document wants to be indexed/updated and it includes *_shard_*field 
 to update it automatically the belonged shard (
*HttpSolrServer*) is selected.
- If a document wants to be indexed/updated and *_shard_* field is not
included then the key value from *_id_* is getted from *
SolrInputDocument*. With that key a distributed search query is
executed by it's key to retrieve *_shard_* field. With *_shard_* field
we can now choose the correct shard (*HttpSolrServer*). It's not a
good practice and performance isn't the best, but it's secure.

 Best Regards,

 - Luis Cappa


 2013/5/26 Erick Erickson erickerick...@gmail.com

 Valery:

 I share your puzzlement. _If_ you are letting Solr do the document
 routing, and not doing any of the custom routing, then the same unique
 key should be going to the same shard and replacing the previous doc
 with that key.

 But, if you're using custom routing, if you've been experimenting with
 different configurations and didn't start over, in general if you're
 configuration is in an interesting state this could happen.

 So in the normal case if you have a document with the same key indexed
 in multiple shards, that would indicate a bug. But there are many
 ways, especially when experimenting, that you could have this happen
 which are _not_ a bug. I'm guessing that Luis may be trying the custom
 routing option maybe?

 Best
 Erick

 On Fri, May 24, 2013 at 9:09 AM, Valery Giner valgi...@research.att.com
 wrote:
  Shawn,
 
  How is it possible for more than one document with the same unique key
 to
  appear in the index, even in different shards?
  Isn't it a bug by definition?
  What am I missing here?
 
  Thanks,
  Val
 
 
  On 05/23/2013 09:55 AM, Shawn Heisey wrote:
 
  On 5/23/2013 1:51 AM, Luis Cappa Banda wrote:
 
  I've query each Solr shard server one by one and the total number of
  documents is correct. However, when I change rows parameter from 10 to
  100
  the total numFound of documents change:
 
  I've seen this problem on the list before and the cause has been
  determined each time to be caused by documents with the same uniqueKey
  value appearing in more than one shard.
 
  What I think happens here:
 
  With rows=10, you get the top ten docs from each of the three shards,
  and each shard sends its numFound for that query to the core that's
  coordinating the search.  The coordinator adds up numFound, looks
  through those thirty docs, and arranges them according to the requested
  sort order, returning only the top 10.  In this case, there happen to
 be
  no duplicates.
 
  With rows=100, you get a total of 300 docs.  This time, duplicates are
  found and removed by the coordinator.  I think that the coordinator
  adjusts the total numFound by the number of duplicate documents it
  removed, in an attempt to be more accurate.
 
  I don't know if adjusting numFound when duplicates are found in a
  sharded query is the right thing to do, I'll leave that for smarter
  people.  Perhaps Solr should return a message with the results saying
  that duplicates were found, and if a config option is not enabled, the
  server should throw an exception and return a 4xx HTTP error code.  One
  idea for a config parameter name would be allowShardDuplicates, but
  something better can probably be found.
 
  Thanks,
  Shawn
 
 




 --
 - Luis Cappa




-- 
- Luis Cappa


Re: A strange RemoteSolrException

2013-05-27 Thread Hans-Peter Stricker
Yes, I started it up with java -Dsolr.solr.home=example-DIH/solr -jar 
start.jar.


Without the java options I don't get the expections neither! (I should have 
checked.)


What now?

--
From: Shalin Shekhar Mangar shalinman...@gmail.com
Sent: Monday, May 27, 2013 3:58 PM
To: solr-user@lucene.apache.org
Subject: Re: A strange RemoteSolrException


I downloaded solr 4.3.0, started it up with java -jar start.jar (from
inside the example directory) and executed your program. No exceptions are
thrown. Is there something you did differently?


On Mon, May 27, 2013 at 5:45 PM, Hans-Peter Stricker
stric...@epublius.dewrote:


Hello,

I'm writing my first little Solrj program, but don't get it running
because of an RemoteSolrException: Server at 
http://localhost:8983/solrreturned non ok status:404


The server is definitely running and the url works in the browser.

I am working with Solr 4.3.0.

This is my source code:

public static void main(String[] args) {

String url = http://localhost:8983/solr;;
SolrServer server;

try {
server = new HttpSolrServer(url);
server.ping();
   } catch (Exception ex) {
ex.printStackTrace();
   }
}

with the stack trace:

org.apache.solr.client.solrj.impl.HttpSolrServer$RemoteSolrException:
Server at http://localhost:8983/solr returned non ok status:404,
message:Not Found
 at
org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:372)
 at
org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:180)
 at 
org.apache.solr.client.solrj.request.SolrPing.process(SolrPing.java:62)

 at org.apache.solr.client.solrj.SolrServer.ping(SolrServer.java:293)
 at de.epublius.blogindexer.App.main(App.java:47)

If I call server.shutdown(), there is no such exception, but for almost
all other SolrServer-methods.

What am I doing wrong?

Thanks in advance

Hans-Peter





--
Regards,
Shalin Shekhar Mangar.





Re: A strange RemoteSolrException

2013-05-27 Thread Shawn Heisey
On 5/27/2013 6:15 AM, Hans-Peter Stricker wrote:
 I'm writing my first little Solrj program, but don't get it running because 
 of an RemoteSolrException: Server at http://localhost:8983/solr returned non 
 ok status:404
 
 The server is definitely running and the url works in the browser.
 
 I am working with Solr 4.3.0.

Hans,

To use SolrJ against the URL that you provided, you must have a
defaultCoreName attribute in solr.xml that points at a core that exists.
 The defaultCoreName used in the old-style solr.xml (4.3.1 and earlier)
is collection1.  The new-style solr.xml (4.4 and later, when released)
does not define a default core name.

A far safer option for any Solr client API is to use a base URL that
includes the name of the core.  If you are using SolrCloud, you can
optionally use the collection name instead.  This option will be
required for the new style solr.xml.  Here's the format:

http://server:port/solr/corename

In the UI, the cores that exist will be in a left-side dropdown that
says Core Selector.  If you are using SolrCloud, you can click on the
Cloud option in the UI and then on Graph to see the collection names.
They will be on the left side of the graph.

NB: If you are using SolrCloud, it is better to use CloudSolrServer
instead of HttpSolrServer.

Thanks,
Shawn



Re: A strange RemoteSolrException

2013-05-27 Thread Shawn Heisey
On 5/27/2013 8:24 AM, Hans-Peter Stricker wrote:
 Yes, I started it up with java -Dsolr.solr.home=example-DIH/solr -jar
 start.jar.

That explains it.  See my other reply.  The solr.xml file for
example-DIH does not have a defaultCoreName attribute.

Thanks,
Shawn



Re: A strange RemoteSolrException

2013-05-27 Thread Hans-Peter Stricker

Dear Shawn, dear Shalin,

thanks for your valuable replies!

Could/should I have known better (by reading more carefully the manual)?

I'll try to fix it - and I am confident that it will work!

Best regards

Hans

--
From: Shawn Heisey s...@elyograg.org
Sent: Monday, May 27, 2013 4:29 PM
To: solr-user@lucene.apache.org
Subject: Re: A strange RemoteSolrException


On 5/27/2013 8:24 AM, Hans-Peter Stricker wrote:

Yes, I started it up with java -Dsolr.solr.home=example-DIH/solr -jar
start.jar.


That explains it.  See my other reply.  The solr.xml file for
example-DIH does not have a defaultCoreName attribute.

Thanks,
Shawn






RE: Tika: How can I import automatically all metadata without specifiying them explicitly

2013-05-27 Thread Gian Maria Ricci
Thanks a lot, other useful hints, and probably standalone Tika could be a 
solution.

I've another little question: how can I express filters in DIH configuration to 
run import of the server incrementally?

Actually I've two distinct scenario. 

In first scenario I've documents stored inside database, so I need to write a 
DIH to import data from database and since I have timestamp column this is not 
a problem.

Second scenario: need to monitor one folder, and do incremental population each 
15 minutes. Usually with Sql DIH I use some column as a filter to do 
incremental population, but I wonder if it is possible to pass filter to 
BinFileDataSource, telling to process only new files and those modified after a 
timestamp (last run).

Thanks again for all your precious suggestions.

--
Gian Maria Ricci
Mobile: +39 320 0136949



-Original Message-
From: Alexandre Rafalovitch [mailto:arafa...@gmail.com] 
Sent: Monday, May 27, 2013 1:44 PM
To: solr-user@lucene.apache.org
Subject: RE: Tika: How can I import automatically all metadata without 
specifiying them explicitly

Standalone Tika can also run in a network server mode.  That increases data 
roundtrips but gives you more options. Even in .net .

Regards,
  Alex
On 27 May 2013 04:22, Gian Maria Ricci alkamp...@nablasoft.com wrote:

 Thanks for the help.

 @Alexandre: Thanks for the suggestion, I'll try to use an 
 ExtractingRequestHandler, I thought that I was missing some DIH option :).

 @Erik: I'm interested in knowing them all to do various form of 
 analysis. I have documents coming from heterogeneous sources and I'm 
 interested in searching inside the content, but also being able to 
 extract all possible metadata. I'm working in .Net so it is useful 
 letting tika doing everything for me directly in solr and then 
 retrieve all metadata for matched documents.

 Thanks again to everyone.

 --
 Gian Maria Ricci
 Mobile: +39 320 0136949



 -Original Message-
 From: Erick Erickson [mailto:erickerick...@gmail.com]
 Sent: Sunday, May 26, 2013 5:30 PM
 To: solr-user@lucene.apache.org; Gian Maria Ricci
 Subject: Re: Tika: How can I import automatically all metadata without 
 specifiying them explicitly

 In addition to Alexandre's comment:

 bq:  ...I'd like to import in my index all metadata

 Be a little careful here, this isn't actually very useful in my experience.
 Sure
 it's nice to have all that data in the index, but... how do you search 
 it meaningfully?

 Consider that some doc may have an author metadata field. Another 
 may have a last editor field. Yet another may have a main author 
 field. If you add all these as their field name, what do you do to 
 search for author?
 Somehow you have to create a mapping between the various metadata 
 names and something that's searchable, why not do this at index time?

 Not to mention I've seen this done and the result may be literally 
 hundreds of different metadata fields which are not very useful.

 All that said, it may be perfectly valid to inde them all, but before 
 going there it's worth considering whether the result is actually _useful_.

 Best
 Erick


 On Sat, May 25, 2013 at 4:44 AM, Gian Maria Ricci
 alkamp...@nablasoft.comwrote:

  Hi to everyone,
 
  ** **
 
  I've configured import of a document folder with 
  FileListEntityProcessor, everything went smooth on the first try, 
  but I have a simple question. I'm able to map metadata without any 
  problem, but I'd like to import in my index all metadata, not only 
  those I've configured with field nodes. In this example I've 
  imported Author and title, but I does not know in advance which 
  metadata a document could have and I wish to have all of them inside 
  my
  index.
 
  ** **
 
  Here is my import config. It is the first try with importing with 
  tika and probably I'm missing a simple stuff.
 
  ** **
 
  dataConfig  
 
  dataSource type=BinFileDataSource /
 
  document
 
  entity name=files
  dataSource=null rootEntity=false
 
 
  processor=FileListEntityProcessor 
 
  baseDir=c:/temp/docs
  fileName=.*\.(doc)|(pdf)|(docx)
 
  onError=skip
 
  
  recursive=true
 
  
  field column=file name=id /
 
  
  field column=fileAbsolutePath name=path /
 
  
  field column=fileSize name=size /
 
  
  field column=fileLastModified name=lastModified /
 
  
 
 
  entity **
  **
 
 
  

[blog post] Automatically Acquiring Synonym Knowledge from Wikipedia

2013-05-27 Thread Koji Sekiguchi
Hello,

Sorry for cross post. I just wanted to announce that I've written a blog post on
how to create synonyms.txt file automatically from Wikipedia:

http://soleami.com/blog/automatically-acquiring-synonym-knowledge-from-wikipedia.html

Hope that the article gives someone a good experience!

koji
-- 
http://soleami.com/blog/lucene-4-is-super-convenient-for-developing-nlp-tools.html


Re: A strange RemoteSolrException

2013-05-27 Thread Shawn Heisey
On 5/27/2013 8:34 AM, Hans-Peter Stricker wrote:
 Dear Shawn, dear Shalin,
 
 thanks for your valuable replies!
 
 Could/should I have known better (by reading more carefully the manual)?

I just looked at the wiki.  The SolrJ wiki page doesn't mention using
the core name, which I find surprising, because Solr has had multicore
capability for a REALLY long time, and it has been the default in the
example since the 3.x days.

The only wiki example code that has a URL with a core name is the code
in the database example:

http://wiki.apache.org/solr/Solrj#Reading_data_from_a_database

Oddly enough, I wrote and contributed that example, but it was a few
years ago and I haven't looked at it since.

When I find the time, I will go through the SolrJ wiki page and bring it
into this decade.  Multicore operation is very likely going to be
required on Solr 5.0 when that version comes out.

If anyone else wants to update the wiki, feel free.  If you don't
already have edit permission, just ask.  We can add your wiki username
to the contributors group.

Thanks,
Shawn



Re: Note on The Book

2013-05-27 Thread Koji Sekiguchi

Now my contribution can be read on soleami blog in English:

Automatically Acquiring Synonym Knowledge from Wikipedia
http://soleami.com/blog/automatically-acquiring-synonym-knowledge-from-wikipedia.html

koji

(13/05/27 21:16), Jack Krupansky wrote:

If you would like to Solr-ize your contribution, that would be great. The focus 
of the book will be
hard-core Solr.

-- Jack Krupansky

-Original Message- From: Koji Sekiguchi
Sent: Monday, May 27, 2013 8:07 AM
To: solr-user@lucene.apache.org
Subject: Re: Note on The Book

Hi Jack,

I'd like to ask as a person who contributed a case study article about
Automatically acquiring synonym knowledge from Wikipedia to the book.

(13/05/24 8:14), Jack Krupansky wrote:

To those of you who may have heard about the Lucene/Solr book that I and two 
others are writing on
Lucene and Solr, some bad and good news. The bad news: The book contract with 
O’Reilly has been
canceled. The good news: I’m going to proceed with self-publishing (possibly on 
Lulu or even
Amazon) a somewhat reduced scope Solr-only Reference Guide (with hints of 
Lucene). The scope of
the previous effort was too great, even for O’Reilly – a book larger than 800 
pages (or even 600)
that was heavy on reference and lighter on “guide” just wasn’t fitting in with 
their traditional
“guide” model. In truth, Solr is just too complex for a simple guide that 
covers it all, let alone
Lucene as well.


Will the reduced Solr-only reference guide include my article?
If not (for now I think it is not because my article is for Lucene case study,
not Solr), I'd like to put it out on my blog or somewhere.

BTW, those who want to know how to acquire synonym knowledge from Wikipedia,
the summary is available at slideshare:

http://www.slideshare.net/KojiSekiguchi/wikipediasolr

koji



--
http://soleami.com/blog/lucene-4-is-super-convenient-for-developing-nlp-tools.html


Specifiy colums to return for mlt results

2013-05-27 Thread Achim Domma
Hi,

I'm executing a search and retrieve more like this results. For the search 
results, I can specify the columns to be returned via the fl parameter. The 
mlt.fl parameter defines the columns to be used for similarity calculation. 
The mlt-results see to return the columns specified in fl too. Is there a way 
to specify different columns for the search and mlt results?

kind regards,
Achim

Problems with DIH in Solrj

2013-05-27 Thread Hans-Peter Stricker
I start the SOLR example with 

java -Dsolr.solr.home=example-DIH/solr -jar start.jar

and run

public static void main(String[] args) {

String url = http://localhost:8983/solr/rss;;
SolrServer server;
SolrQuery query;
try {
server = new HttpSolrServer(url);
query = new SolrQuery();
query.setParam(CommonParams.QT,/dataimport);
QueryRequest request = new QueryRequest(query);
QueryResponse response = request.process(server);
server.commit();
System.out.println(response.toString());

} catch (Exception ex) {
ex.printStackTrace();
}
}

without exception and the response string as

{responseHeader={status=0,QTime=0},initArgs={defaults={config=rss-data-config.xml}},status=idle,importResponse=,statusMessages={},WARNING=This
 response format is experimental.  It is likely to change in the future.}

The Lucene index is touched but not really updated: there are only 
segments.gen and segments_a files of size 1Kb. If I execute /dataimport 
(full-import with option commit checked) from 
http://localhost:8983/solr/#/rss/dataimport//dataimport I get

{ responseHeader: { status: 0, QTime: 1 }, initArgs: [ defaults, [ 
config, rss-data-config.xml ] ], command: status, status: idle, 
importResponse: , statusMessages: { Total Requests made to DataSource: 
1, Total Rows Fetched: 10, Total Documents Skipped: 0, Full Dump 
Started: 2013-05-27 17:57:07, : Indexing completed. Added/Updated: 10 
documents. Deleted 0 documents., Committed: 2013-05-27 17:57:07, Total 
Documents Processed: 10, Time taken: 0:0:0.603 }, WARNING: This 
response format is experimental. It is likely to change in the future. }

What am I doing wrong?

Re: Problems with DIH in Solrj

2013-05-27 Thread Shalin Shekhar Mangar
Your program is not specifying a command. You need to add:

query.setParam(command, full-import);


On Mon, May 27, 2013 at 9:31 PM, Hans-Peter Stricker
stric...@epublius.dewrote:

 I start the SOLR example with

 java -Dsolr.solr.home=example-DIH/solr -jar start.jar

 and run

 public static void main(String[] args) {

 String url = http://localhost:8983/solr/rss;;
 SolrServer server;
 SolrQuery query;
 try {
 server = new HttpSolrServer(url);
 query = new SolrQuery();
 query.setParam(CommonParams.QT,/dataimport);
 QueryRequest request = new QueryRequest(query);
 QueryResponse response = request.process(server);
 server.commit();
 System.out.println(response.toString());

 } catch (Exception ex) {
 ex.printStackTrace();
 }
 }

 without exception and the response string as

 {responseHeader={status=0,QTime=0},initArgs={defaults={config=rss-data-config.xml}},status=idle,importResponse=,statusMessages={},WARNING=This
 response format is experimental.  It is likely to change in the future.}

 The Lucene index is touched but not really updated: there are only
 segments.gen and segments_a files of size 1Kb. If I execute /dataimport
 (full-import with option commit checked) from
 http://localhost:8983/solr/#/rss/dataimport//dataimport I get

 { responseHeader: { status: 0, QTime: 1 }, initArgs: [ defaults,
 [ config, rss-data-config.xml ] ], command: status, status:
 idle, importResponse: , statusMessages: { Total Requests made to
 DataSource: 1, Total Rows Fetched: 10, Total Documents Skipped:
 0, Full Dump Started: 2013-05-27 17:57:07, : Indexing completed.
 Added/Updated: 10 documents. Deleted 0 documents., Committed:
 2013-05-27 17:57:07, Total Documents Processed: 10, Time taken:
 0:0:0.603 }, WARNING: This response format is experimental. It is
 likely to change in the future. }

 What am I doing wrong?




-- 
Regards,
Shalin Shekhar Mangar.


Re: Problems with DIH in Solrj

2013-05-27 Thread Hans-Peter Stricker

Marvelous!!

Once again: where could/should I have read this? What kinds of 
concepts/keywords are command and full-import? (Couldn't find them in 
any config file. Where are they explained?)


Anyway: Now it works like a charm!

Thanks

Hans



--
From: Shalin Shekhar Mangar shalinman...@gmail.com
Sent: Monday, May 27, 2013 6:09 PM
To: solr-user@lucene.apache.org
Subject: Re: Problems with DIH in Solrj


Your program is not specifying a command. You need to add:

query.setParam(command, full-import);


On Mon, May 27, 2013 at 9:31 PM, Hans-Peter Stricker
stric...@epublius.dewrote:


I start the SOLR example with

java -Dsolr.solr.home=example-DIH/solr -jar start.jar

and run

public static void main(String[] args) {

String url = http://localhost:8983/solr/rss;;
SolrServer server;
SolrQuery query;
try {
server = new HttpSolrServer(url);
query = new SolrQuery();
query.setParam(CommonParams.QT,/dataimport);
QueryRequest request = new QueryRequest(query);
QueryResponse response = request.process(server);
server.commit();
System.out.println(response.toString());

} catch (Exception ex) {
ex.printStackTrace();
}
}

without exception and the response string as

{responseHeader={status=0,QTime=0},initArgs={defaults={config=rss-data-config.xml}},status=idle,importResponse=,statusMessages={},WARNING=This
response format is experimental.  It is likely to change in the future.}

The Lucene index is touched but not really updated: there are only
segments.gen and segments_a files of size 1Kb. If I execute /dataimport
(full-import with option commit checked) from
http://localhost:8983/solr/#/rss/dataimport//dataimport I get

{ responseHeader: { status: 0, QTime: 1 }, initArgs: [ 
defaults,

[ config, rss-data-config.xml ] ], command: status, status:
idle, importResponse: , statusMessages: { Total Requests made to
DataSource: 1, Total Rows Fetched: 10, Total Documents Skipped:
0, Full Dump Started: 2013-05-27 17:57:07, : Indexing completed.
Added/Updated: 10 documents. Deleted 0 documents., Committed:
2013-05-27 17:57:07, Total Documents Processed: 10, Time taken:
0:0:0.603 }, WARNING: This response format is experimental. It is
likely to change in the future. }

What am I doing wrong?





--
Regards,
Shalin Shekhar Mangar.





Re: Problems with DIH in Solrj

2013-05-27 Thread Shalin Shekhar Mangar
Details about the DataImportHandler are on the wiki:

http://wiki.apache.org/solr/DataImportHandler

In general, the SolrJ client just makes HTTP requests to the corresponding
Solr APIs so you need to learn about the http parameters for the
corresponding solr component. The solr wiki is your best bet.

http://wiki.apache.org/solr/FrontPage


On Mon, May 27, 2013 at 9:50 PM, Hans-Peter Stricker
stric...@epublius.dewrote:

 Marvelous!!

 Once again: where could/should I have read this? What kinds of
 concepts/keywords are command and full-import? (Couldn't find them in
 any config file. Where are they explained?)

 Anyway: Now it works like a charm!

 Thanks

 Hans



 --**
 From: Shalin Shekhar Mangar shalinman...@gmail.com
 Sent: Monday, May 27, 2013 6:09 PM
 To: solr-user@lucene.apache.org
 Subject: Re: Problems with DIH in Solrj


  Your program is not specifying a command. You need to add:

 query.setParam(command, full-import);


 On Mon, May 27, 2013 at 9:31 PM, Hans-Peter Stricker
 stric...@epublius.dewrote:

  I start the SOLR example with

 java -Dsolr.solr.home=example-DIH/**solr -jar start.jar

 and run

 public static void main(String[] args) {

 String url = 
 http://localhost:8983/solr/**rsshttp://localhost:8983/solr/rss
 ;
 SolrServer server;
 SolrQuery query;
 try {
 server = new HttpSolrServer(url);
 query = new SolrQuery();
 query.setParam(CommonParams.**QT,/dataimport);
 QueryRequest request = new QueryRequest(query);
 QueryResponse response = request.process(server);
 server.commit();
 System.out.println(response.**toString());

 } catch (Exception ex) {
 ex.printStackTrace();
 }
 }

 without exception and the response string as

 {responseHeader={status=0,**QTime=0},initArgs={defaults={**
 config=rss-data-config.xml}},**status=idle,importResponse=,**
 statusMessages={},WARNING=This
 response format is experimental.  It is likely to change in the future.}

 The Lucene index is touched but not really updated: there are only
 segments.gen and segments_a files of size 1Kb. If I execute /dataimport
 (full-import with option commit checked) from
 http://localhost:8983/solr/#/**rss/dataimport//dataimporthttp://localhost:8983/solr/#/rss/dataimport//dataimportI
  get

 { responseHeader: { status: 0, QTime: 1 }, initArgs: [
 defaults,
 [ config, rss-data-config.xml ] ], command: status, status:
 idle, importResponse: , statusMessages: { Total Requests made to
 DataSource: 1, Total Rows Fetched: 10, Total Documents Skipped:
 0, Full Dump Started: 2013-05-27 17:57:07, : Indexing completed.
 Added/Updated: 10 documents. Deleted 0 documents., Committed:
 2013-05-27 17:57:07, Total Documents Processed: 10, Time taken:
 0:0:0.603 }, WARNING: This response format is experimental. It is
 likely to change in the future. }

 What am I doing wrong?





 --
 Regards,
 Shalin Shekhar Mangar.





-- 
Regards,
Shalin Shekhar Mangar.


Re: Problems with DIH in Solrj

2013-05-27 Thread Shawn Heisey
On 5/27/2013 10:20 AM, Hans-Peter Stricker wrote:
 Marvelous!!
 
 Once again: where could/should I have read this? What kinds of
 concepts/keywords are command and full-import? (Couldn't find them
 in any config file. Where are they explained?)
 
 Anyway: Now it works like a charm!

http://wiki.apache.org/solr/DataImportHandler#Commands

The CommonParams.QT syntax that you used only works with SolrJ 4.0 and
newer, and those versions have a shortcut that's slightly easier to read:

query.setRequestHandler(/dataimport);

The reason that there are no real examples of using DIH with SolrJ is
because if you are using SolrJ, it is expected that your application
will be doing the indexing itself, with the add method on the server
object.  I'll point you once again to the database example:

http://wiki.apache.org/solr/Solrj#Reading_data_from_a_database

I do use DIH on occasion - whenever I do a full rebuild of my index, DIH
does the job a lot faster than my own code.  I handle it from SolrJ.

Sending a full-import or delta-import command to Solr returns to SolrJ
immediately.  You will only see a failure on that request if something
major fails with the request itself, it won't tell you anything about
whether the import succeeded or failed.  You must periodically check the
status.

Interpreting the status in a program is a complicated endeavor, because
the status is human readable, not machine readable, and important
information is added or removed from the response at various success and
error stages.  There have been a number of issues on this.  I filed most
of them:

https://issues.apache.org/jira/browse/SOLR-2728
https://issues.apache.org/jira/browse/SOLR-2729
https://issues.apache.org/jira/browse/SOLR-3319
https://issues.apache.org/jira/browse/SOLR-3689
https://issues.apache.org/jira/browse/SOLR-4241

I do have SolrJ code that interprets DIH status, but it's tied up in a
larger work and will require some cleanup before I can share it.

Thanks,
Shawn



Unable to start solr 4.3

2013-05-27 Thread Gian Maria Ricci
I've a test VM where I usually test solr installation. In that VM I already
configured solr4.0 and everything went good. Today I download the 4.3
version, unpack everything, configuring TOMCAT as I did for the 4.0 version
but the application does not start, and in catilina log I find only

 

May 27, 2013 7:31:54 PM org.apache.catalina.core.StandardContext
startInternal

SEVERE: Error filterStart

May 27, 2013 7:31:54 PM org.apache.catalina.core.StandardContext
startInternal

SEVERE: Context [/TestInstance43] startup failed due to previous errors

 

Where can I find more info on what is wrong? It seems that in log file there
is no detailed information on why the Application should not start.

 

--

Gian Maria Ricci

Mobile: +39 320 0136949

 http://mvp.microsoft.com/en-us/mvp/Gian%20Maria%20Ricci-4025635
http://www.linkedin.com/in/gianmariaricci
https://twitter.com/alkampfer   http://feeds.feedburner.com/AlkampferEng
skype://alkampferaok/ 

 

 



Re: Unable to start solr 4.3

2013-05-27 Thread Alexandre Rafalovitch
The usual answer (which may or may not be relevant) is that Solr 4.3 has
moved the logging libraries around and you need to copy specific library
implementations to your Tomcat lib files. If that sounds as a possible,
search the mailing list for a number of detailed discussions on this topic.

Regards,
   Alex

Personal blog: http://blog.outerthoughts.com/
LinkedIn: http://www.linkedin.com/in/alexandrerafalovitch
- Time is the quality of nature that keeps events from happening all at
once. Lately, it doesn't seem to be working.  (Anonymous  - via GTD book)


On Mon, May 27, 2013 at 1:34 PM, Gian Maria Ricci
alkamp...@nablasoft.comwrote:

 I’ve a test VM where I usually test solr installation. In that VM I
 already configured solr4.0 and everything went good. Today I download the
 4.3 version, unpack everything, configuring TOMCAT as I did for the 4.0
 version but the application does not start, and in catilina log I find only
 

 ** **

 May 27, 2013 7:31:54 PM org.apache.catalina.core.StandardContext
 startInternal

 SEVERE: Error filterStart

 May 27, 2013 7:31:54 PM org.apache.catalina.core.StandardContext
 startInternal

 SEVERE: Context [/TestInstance43] startup failed due to previous errors***
 *

 ** **

 Where can I find more info on what is wrong? It seems that in log file
 there is no detailed information on why the Application should not start.*
 ***

 ** **

 --

 Gian Maria Ricci

 Mobile: +39 320 0136949

 http://mvp.microsoft.com/en-us/mvp/Gian%20Maria%20Ricci-4025635 [image:
 https://encrypted-tbn1.gstatic.com/images?q=tbn:ANd9GcQyg0wiW_QuTxl-rnuVR2P0jGuj4qO3I9attctCNarL--FC3vdPYg]http://www.linkedin.com/in/gianmariaricci
  [image:
 https://encrypted-tbn2.gstatic.com/images?q=tbn:ANd9GcT8z0HpwpDSjDWw1I59Yx7HmF79u-NnP0NYeYYyEyWM1WtIbOl7]https://twitter.com/alkampfer
  [image:
 https://encrypted-tbn1.gstatic.com/images?q=tbn:ANd9GcQQWMj687BGGypKMUTub_lkUrull1uU2LTx0K2tDBeu3mNUr7Oxlg]http://feeds.feedburner.com/AlkampferEng
  [image:
 https://encrypted-tbn3.gstatic.com/images?q=tbn:ANd9GcSkTG_lPTPFe470xfDtiInUtseqKcuV_lvI5h_-8t_3PsY5ikg3]
 

 ** **

 ** **



Prevention of heavy wildcard queries

2013-05-27 Thread Isaac Hebsh
Hi.

Searching terms with wildcard in their start, is solved with
ReversedWildcardFilterFactory. But, what about terms with wildcard in both
start AND end?

This query is heavy, and I want to disallow such queries from my users.

I'm looking for a way to cause these queries to fail.
I guess there is no built-in support for my need, so it is OK to write a
new solution.

My current plan is to create a search component (which will run before
QueryComponent). It should analyze the query string, and to drop the query
if too heavy wildcard are found.

Another option is to create a query parser, which wraps the current
(specified or default) qparser, and does the same work as above.

These two options require an analysis of the query text, which might be an
ugly work (just think about nested queries [using _query_], OR even a lot
of more basic scenarios like quoted terms, etc.)

Am I missing a simple and clean way to do this?
What would you do?

P.S. if no simple solution exists, timeAllowed limit is the best
work-around I could think about. Any other suggestions?


Re: Unable to start solr 4.3

2013-05-27 Thread Shawn Heisey
On 5/27/2013 12:00 PM, Alexandre Rafalovitch wrote:
 The usual answer (which may or may not be relevant) is that Solr 4.3 has
 moved the logging libraries around and you need to copy specific library
 implementations to your Tomcat lib files. If that sounds as a possible,
 search the mailing list for a number of detailed discussions on this topic.

snip

 alkamp...@nablasoft.comwrote:
 I’ve a test VM where I usually test solr installation. In that VM I
 already configured solr4.0 and everything went good. Today I download the
 4.3 version, unpack everything, configuring TOMCAT as I did for the 4.0
 version but the application does not start, and in catilina log I find only

Alexandre has probably nailed the problem here.

A little more detail on how to fix it: The section of the SolrLogging
wiki page on how to use the example logging with another container will
be exactly what you need.

http://wiki.apache.org/solr/SolrLogging#Using_the_example_logging_setup_in_containers_other_than_Jetty

This section is also followed by instructions for switching back to
java.util.logging.  Because the example setup takes logging control away
from tomcat, this is something that many tomcat users will want to do.

Thanks,
Shawn



RE: Unable to start solr 4.3

2013-05-27 Thread Gian Maria Ricci
Thanks, I'll check :)

--
Gian Maria Ricci
Mobile: +39 320 0136949



-Original Message-
From: Alexandre Rafalovitch [mailto:arafa...@gmail.com] 
Sent: Monday, May 27, 2013 8:00 PM
To: solr-user@lucene.apache.org; alkamp...@nablasoft.com
Subject: Re: Unable to start solr 4.3

The usual answer (which may or may not be relevant) is that Solr 4.3 has moved 
the logging libraries around and you need to copy specific library 
implementations to your Tomcat lib files. If that sounds as a possible, search 
the mailing list for a number of detailed discussions on this topic.

Regards,
   Alex

Personal blog: http://blog.outerthoughts.com/
LinkedIn: http://www.linkedin.com/in/alexandrerafalovitch
- Time is the quality of nature that keeps events from happening all at once. 
Lately, it doesn't seem to be working.  (Anonymous  - via GTD book)


On Mon, May 27, 2013 at 1:34 PM, Gian Maria Ricci
alkamp...@nablasoft.comwrote:

 I’ve a test VM where I usually test solr installation. In that VM I 
 already configured solr4.0 and everything went good. Today I download 
 the
 4.3 version, unpack everything, configuring TOMCAT as I did for the 
 4.0 version but the application does not start, and in catilina log I 
 find only
 

 ** **

 May 27, 2013 7:31:54 PM org.apache.catalina.core.StandardContext
 startInternal

 SEVERE: Error filterStart

 May 27, 2013 7:31:54 PM org.apache.catalina.core.StandardContext
 startInternal

 SEVERE: Context [/TestInstance43] startup failed due to previous 
 errors***
 *

 ** **

 Where can I find more info on what is wrong? It seems that in log file 
 there is no detailed information on why the Application should not 
 start.*
 ***

 ** **

 --

 Gian Maria Ricci

 Mobile: +39 320 0136949

 http://mvp.microsoft.com/en-us/mvp/Gian%20Maria%20Ricci-4025635 [image:
 https://encrypted-tbn1.gstatic.com/images?q=tbn:ANd9GcQyg0wiW_QuTxl-rn
 uVR2P0jGuj4qO3I9attctCNarL--FC3vdPYg]http://www.linkedin.com/in/gianm
 ariaricci
  [image:
 https://encrypted-tbn2.gstatic.com/images?q=tbn:ANd9GcT8z0HpwpDSjDWw1I
 59Yx7HmF79u-NnP0NYeYYyEyWM1WtIbOl7]https://twitter.com/alkampfer
  [image:
 https://encrypted-tbn1.gstatic.com/images?q=tbn:ANd9GcQQWMj687BGGypKMU
 Tub_lkUrull1uU2LTx0K2tDBeu3mNUr7Oxlg]http://feeds.feedburner.com/Alka
 mpferEng
  [image:
 https://encrypted-tbn3.gstatic.com/images?q=tbn:ANd9GcSkTG_lPTPFe470xf
 DtiInUtseqKcuV_lvI5h_-8t_3PsY5ikg3]
 

 ** **

 ** **



Re: Prevention of heavy wildcard queries

2013-05-27 Thread Roman Chyla
You are right that starting to parse the query before the query component
can get soon very ugly and complicated. You should take advantage of the
flex parser, it is already in lucene contrib - but if you are interested in
the better version, look at
https://issues.apache.org/jira/browse/LUCENE-5014

The way you can solve this is:

1. use the standard syntax grammar (which allows *foo*)
2. add (or modify) WildcardQueryNodeProcessor to dis/allow that case, or
raise error etc

this way, you are changing semantics - but don't need to touch the syntax
definition; of course, you may also change the grammar and allow only one
instance of wildcard (or some combination) but for that you should probably
use LUCENE-5014

roman

On Mon, May 27, 2013 at 2:18 PM, Isaac Hebsh isaac.he...@gmail.com wrote:

 Hi.

 Searching terms with wildcard in their start, is solved with
 ReversedWildcardFilterFactory. But, what about terms with wildcard in both
 start AND end?

 This query is heavy, and I want to disallow such queries from my users.

 I'm looking for a way to cause these queries to fail.
 I guess there is no built-in support for my need, so it is OK to write a
 new solution.

 My current plan is to create a search component (which will run before
 QueryComponent). It should analyze the query string, and to drop the query
 if too heavy wildcard are found.

 Another option is to create a query parser, which wraps the current
 (specified or default) qparser, and does the same work as above.

 These two options require an analysis of the query text, which might be an
 ugly work (just think about nested queries [using _query_], OR even a lot
 of more basic scenarios like quoted terms, etc.)

 Am I missing a simple and clean way to do this?
 What would you do?

 P.S. if no simple solution exists, timeAllowed limit is the best
 work-around I could think about. Any other suggestions?



RE: Unable to start solr 4.3

2013-05-27 Thread Gian Maria Ricci
Thanks a lot, it seems that probably solr won't start because of all the log 
libraries missing. Once I copied all needed log libraries inside c:\tomcat\libs 
solr started with no problem.

If other person are interested, here is the link on the wiki that states 
changes in logging library in solr 4.3

http://wiki.apache.org/solr/SolrLogging#What_changed

Thanks Alexandre for pointing me in the right direction :)


--
Gian Maria Ricci
Mobile: +39 320 0136949



-Original Message-
From: Alexandre Rafalovitch [mailto:arafa...@gmail.com] 
Sent: Monday, May 27, 2013 8:00 PM
To: solr-user@lucene.apache.org; alkamp...@nablasoft.com
Subject: Re: Unable to start solr 4.3

The usual answer (which may or may not be relevant) is that Solr 4.3 has moved 
the logging libraries around and you need to copy specific library 
implementations to your Tomcat lib files. If that sounds as a possible, search 
the mailing list for a number of detailed discussions on this topic.

Regards,
   Alex

Personal blog: http://blog.outerthoughts.com/
LinkedIn: http://www.linkedin.com/in/alexandrerafalovitch
- Time is the quality of nature that keeps events from happening all at once. 
Lately, it doesn't seem to be working.  (Anonymous  - via GTD book)


On Mon, May 27, 2013 at 1:34 PM, Gian Maria Ricci
alkamp...@nablasoft.comwrote:

 I’ve a test VM where I usually test solr installation. In that VM I 
 already configured solr4.0 and everything went good. Today I download 
 the
 4.3 version, unpack everything, configuring TOMCAT as I did for the 
 4.0 version but the application does not start, and in catilina log I 
 find only
 

 ** **

 May 27, 2013 7:31:54 PM org.apache.catalina.core.StandardContext
 startInternal

 SEVERE: Error filterStart

 May 27, 2013 7:31:54 PM org.apache.catalina.core.StandardContext
 startInternal

 SEVERE: Context [/TestInstance43] startup failed due to previous 
 errors***
 *

 ** **

 Where can I find more info on what is wrong? It seems that in log file 
 there is no detailed information on why the Application should not 
 start.*
 ***

 ** **

 --

 Gian Maria Ricci

 Mobile: +39 320 0136949

 http://mvp.microsoft.com/en-us/mvp/Gian%20Maria%20Ricci-4025635 [image:
 https://encrypted-tbn1.gstatic.com/images?q=tbn:ANd9GcQyg0wiW_QuTxl-rn
 uVR2P0jGuj4qO3I9attctCNarL--FC3vdPYg]http://www.linkedin.com/in/gianm
 ariaricci
  [image:
 https://encrypted-tbn2.gstatic.com/images?q=tbn:ANd9GcT8z0HpwpDSjDWw1I
 59Yx7HmF79u-NnP0NYeYYyEyWM1WtIbOl7]https://twitter.com/alkampfer
  [image:
 https://encrypted-tbn1.gstatic.com/images?q=tbn:ANd9GcQQWMj687BGGypKMU
 Tub_lkUrull1uU2LTx0K2tDBeu3mNUr7Oxlg]http://feeds.feedburner.com/Alka
 mpferEng
  [image:
 https://encrypted-tbn3.gstatic.com/images?q=tbn:ANd9GcSkTG_lPTPFe470xf
 DtiInUtseqKcuV_lvI5h_-8t_3PsY5ikg3]
 

 ** **

 ** **



Re: Keeping a rolling window of indexes around solr

2013-05-27 Thread Otis Gospodnetic
Hi,

SolrCloud now has the same index aliasing as Elasticsearch.  I can't lookup
the link now but Zoie from LinkedIn has Hourglass, which is uses for
circular buffer sort of index setup if I recall correctly.

Otis
Solr  ElasticSearch Support
http://sematext.com/
On May 24, 2013 10:26 AM, Saikat Kanjilal sxk1...@hotmail.com wrote:

 Hello Solr community folks,
 I am doing some investigative work around how to roll and manage indexes
 inside our solr configuration, to date I've come up with an architecture
 that separates a set of masters that are focused on writes and get
 replicated periodically and a set of slave shards strictly docused on
 reads, additionally for each master index the design contains partial
 purges which get performed on each of the slave shards as well as the
 master to keep the data current.   However the architecture seems a bit
 more complex than I'd like with a lot of moving pieces.  I was wondering if
 anyone has ever handled/designed an architecture around a conveyor belt
 or rolling window of indexes around n days of data and if there are best
 practices around this.  One thing I was thinking about was whether to keep
 a conveyor belt list of the slave shards and rotate them as needed and drop
 the master periodically and make its backup temporarily the master.


 Anyways would love to hear thoughts and usecases that are similar from the
 community.

 Regards


Re: Keeping a rolling window of indexes around solr

2013-05-27 Thread Alexandre Rafalovitch
But how is Hourglass going to help Solr? Or is it a portable implementation?

Regards,
   Alex.
Personal blog: http://blog.outerthoughts.com/
LinkedIn: http://www.linkedin.com/in/alexandrerafalovitch
- Time is the quality of nature that keeps events from happening all
at once. Lately, it doesn't seem to be working.  (Anonymous  - via GTD
book)


On Mon, May 27, 2013 at 3:48 PM, Otis Gospodnetic
otis.gospodne...@gmail.com wrote:
 Hi,

 SolrCloud now has the same index aliasing as Elasticsearch.  I can't lookup
 the link now but Zoie from LinkedIn has Hourglass, which is uses for
 circular buffer sort of index setup if I recall correctly.

 Otis
 Solr  ElasticSearch Support
 http://sematext.com/
 On May 24, 2013 10:26 AM, Saikat Kanjilal sxk1...@hotmail.com wrote:

 Hello Solr community folks,
 I am doing some investigative work around how to roll and manage indexes
 inside our solr configuration, to date I've come up with an architecture
 that separates a set of masters that are focused on writes and get
 replicated periodically and a set of slave shards strictly docused on
 reads, additionally for each master index the design contains partial
 purges which get performed on each of the slave shards as well as the
 master to keep the data current.   However the architecture seems a bit
 more complex than I'd like with a lot of moving pieces.  I was wondering if
 anyone has ever handled/designed an architecture around a conveyor belt
 or rolling window of indexes around n days of data and if there are best
 practices around this.  One thing I was thinking about was whether to keep
 a conveyor belt list of the slave shards and rotate them as needed and drop
 the master periodically and make its backup temporarily the master.


 Anyways would love to hear thoughts and usecases that are similar from the
 community.

 Regards


Re: Prevention of heavy wildcard queries

2013-05-27 Thread Isaac Hebsh
Thanks Roman.
Based on some of your suggestions, will the steps below do the work?

* Create (and register) a new SearchComponent
* In its prepare method: Do for Q and all of the FQs (so this
SearchComponent should run AFTER QueryComponent, in order to see all of the
FQs)
* Create org.apache.lucene.queryparser.flexible.core.StandardQueryParser,
with a special implementation of QueryNodeProcessorPipeline, which contains
my NodeProcessor in the top of its list.
* Set my analyzer into that StandardQueryParser
* My NodeProcessor will be called for each term in the query, so it can
throw an exception if a (basic) querynode contains wildcard in both start
and end of the term.

Do I have a way to avoid from reimplementing the whole StandardQueryParser
class?
Will this work for both LuceneQParser and EdismaxQParser queries?

Any other solution/work-around? How do other production environments of
Solr overcome this issue?


On Mon, May 27, 2013 at 10:15 PM, Roman Chyla roman.ch...@gmail.com wrote:

 You are right that starting to parse the query before the query component
 can get soon very ugly and complicated. You should take advantage of the
 flex parser, it is already in lucene contrib - but if you are interested in
 the better version, look at
 https://issues.apache.org/jira/browse/LUCENE-5014

 The way you can solve this is:

 1. use the standard syntax grammar (which allows *foo*)
 2. add (or modify) WildcardQueryNodeProcessor to dis/allow that case, or
 raise error etc

 this way, you are changing semantics - but don't need to touch the syntax
 definition; of course, you may also change the grammar and allow only one
 instance of wildcard (or some combination) but for that you should probably
 use LUCENE-5014

 roman

 On Mon, May 27, 2013 at 2:18 PM, Isaac Hebsh isaac.he...@gmail.com
 wrote:

  Hi.
 
  Searching terms with wildcard in their start, is solved with
  ReversedWildcardFilterFactory. But, what about terms with wildcard in
 both
  start AND end?
 
  This query is heavy, and I want to disallow such queries from my users.
 
  I'm looking for a way to cause these queries to fail.
  I guess there is no built-in support for my need, so it is OK to write a
  new solution.
 
  My current plan is to create a search component (which will run before
  QueryComponent). It should analyze the query string, and to drop the
 query
  if too heavy wildcard are found.
 
  Another option is to create a query parser, which wraps the current
  (specified or default) qparser, and does the same work as above.
 
  These two options require an analysis of the query text, which might be
 an
  ugly work (just think about nested queries [using _query_], OR even a lot
  of more basic scenarios like quoted terms, etc.)
 
  Am I missing a simple and clean way to do this?
  What would you do?
 
  P.S. if no simple solution exists, timeAllowed limit is the best
  work-around I could think about. Any other suggestions?
 



RE: sourceId of JMX

2013-05-27 Thread 菅沼 嘉一
Thank you, Shalin.
I'll see it.

-Original Message-
From: Shalin Shekhar Mangar [mailto:shalinman...@gmail.com]
Sent: Monday, May 27, 2013 11:11 PM
To: solr-user@lucene.apache.org
Subject: Re: sourceId of JMX

I opened https://issues.apache.org/jira/browse/SOLR-4863


On Mon, May 27, 2013 at 7:35 PM, Shalin Shekhar Mangar 
shalinman...@gmail.com wrote:

 This is a bug. The sourceId should have been removed from the
 SolrDynamicMBean. I'll create an issue.


 On Mon, May 27, 2013 at 6:39 PM, 菅沼 嘉一 yo_sugan...@waku-2.com
wrote:

 Hello

 Our team faced the problem regarding the sourceId of JMX when getting the
 information of JMX from tomcat manager.

 Command:
 curl http://localhost:
 ${PORT}/manager/jmxproxy?qry=solr:type=documentCache,*

 Here is the error log (tomcat/manager log).

 ---
 2013/05/27 0:04:01 org.apache.catalina.core.ApplicationContext log
 JMXProxy: Error getting attribute
 solr:type=documentCache,id=org.apache.solr.search.LRUCache sourceId
 javax.management.AttributeNotFoundException: sourceId

 ---

 Solr ver. : 4.1.0

 I think this error comes from when JMX cannot get the sourceId.


 BTW Let's look at this issue.
 https://issues.apache.org/jira/browse/SOLR-3329

 It is decided to drop getSourceId() in this issue.

 But in org.apache.solr.core.JmxMonitoredMap.SolrDynamicMBean,
 staticStats.add(sourceId) is still defined in SolrDynamicMBean at L211.



http://javasourcecode.org/html/open-source/solr/solr-3.3.0/org/apache/solr/core/Jmx
MonitoredMap.SolrDynamicMBean.java.html#line.202
 --
 l.211  staticStats.add(sourceId);
 --

 Maybe this error comes from this inconsistency.
 This problem is not critical, but I think this is inconsistent.

 1. Anyone knows why staticStats.add(sourceId) still remained in
 SolrDynamicMBean?
 Do you have any idea?

 2. Has anyone faced such error ? How did you solved it?


 Thank you.

 Regards
 suganuma




 --
 Regards,
 Shalin Shekhar Mangar.




--
Regards,
Shalin Shekhar Mangar.


Solr 4.3.0 geo search with multiple coordinates

2013-05-27 Thread Eric Grobler
Hi Solr experts,

I have a solr 4.3 schema
fieldType name=location_rpt class=
solr.SpatialRecursivePrefixTreeFieldType geo=true distErrPct=0.025
maxDistErr=0.09 units=degrees /

field name=location_geo type=location indexed=true stored=true *
multiValued*=true /

and xml data
field name=location_geo51.1164,6.9612/field
field name=location_geo52.3473,9.77564/field

If I run this query:
fq={!geofilt pt=51.11,6.9 sfield=location_geo d=20}
I get no result.


But if I remove the second geo line and only have this geo coordinate it
works:
field name=location_geo51.1164,6.9612/field

*Thus it seems that the multi valued index does not work *even though solr
returns the doc values as:
arr name=location_geo str51.1164,6.9612/str str52.3473,9.77564/
str /arr


Is my schema wrongly configured?

Thanks
Ericz


Re: Prevention of heavy wildcard queries

2013-05-27 Thread Roman Chyla
Hi Issac,
it is as you say, with the exception that you create a QParserPlugin, not a
search component

* create QParserPlugin, give it some name, eg. 'nw'
* make a copy of the pipeline - your component should be at the same place,
or just above, the wildcard processor

also make sure you are setting your qparser for FQ queries, ie.
fq={!nw}foo


On Mon, May 27, 2013 at 5:01 PM, Isaac Hebsh isaac.he...@gmail.com wrote:

 Thanks Roman.
 Based on some of your suggestions, will the steps below do the work?

 * Create (and register) a new SearchComponent
 * In its prepare method: Do for Q and all of the FQs (so this
 SearchComponent should run AFTER QueryComponent, in order to see all of the
 FQs)
 * Create org.apache.lucene.queryparser.flexible.core.StandardQueryParser,
 with a special implementation of QueryNodeProcessorPipeline, which contains
 my NodeProcessor in the top of its list.
 * Set my analyzer into that StandardQueryParser
 * My NodeProcessor will be called for each term in the query, so it can
 throw an exception if a (basic) querynode contains wildcard in both start
 and end of the term.

 Do I have a way to avoid from reimplementing the whole StandardQueryParser
 class?


you can try subclassing it, if it allows it


 Will this work for both LuceneQParser and EdismaxQParser queries?


this will not work for edismax, nothing but changing the edismax qparser
will do the trick



 Any other solution/work-around? How do other production environments of
 Solr overcome this issue?


you can also try modifying the standard solr parser, or even the JavaCC
generated classes
I believe many people do just that (or some sort of preprocessing)

roman




 On Mon, May 27, 2013 at 10:15 PM, Roman Chyla roman.ch...@gmail.com
 wrote:

  You are right that starting to parse the query before the query component
  can get soon very ugly and complicated. You should take advantage of the
  flex parser, it is already in lucene contrib - but if you are interested
 in
  the better version, look at
  https://issues.apache.org/jira/browse/LUCENE-5014
 
  The way you can solve this is:
 
  1. use the standard syntax grammar (which allows *foo*)
  2. add (or modify) WildcardQueryNodeProcessor to dis/allow that case, or
  raise error etc
 
  this way, you are changing semantics - but don't need to touch the syntax
  definition; of course, you may also change the grammar and allow only one
  instance of wildcard (or some combination) but for that you should
 probably
  use LUCENE-5014
 
  roman
 
  On Mon, May 27, 2013 at 2:18 PM, Isaac Hebsh isaac.he...@gmail.com
  wrote:
 
   Hi.
  
   Searching terms with wildcard in their start, is solved with
   ReversedWildcardFilterFactory. But, what about terms with wildcard in
  both
   start AND end?
  
   This query is heavy, and I want to disallow such queries from my users.
  
   I'm looking for a way to cause these queries to fail.
   I guess there is no built-in support for my need, so it is OK to write
 a
   new solution.
  
   My current plan is to create a search component (which will run before
   QueryComponent). It should analyze the query string, and to drop the
  query
   if too heavy wildcard are found.
  
   Another option is to create a query parser, which wraps the current
   (specified or default) qparser, and does the same work as above.
  
   These two options require an analysis of the query text, which might be
  an
   ugly work (just think about nested queries [using _query_], OR even a
 lot
   of more basic scenarios like quoted terms, etc.)
  
   Am I missing a simple and clean way to do this?
   What would you do?
  
   P.S. if no simple solution exists, timeAllowed limit is the best
   work-around I could think about. Any other suggestions?
  
 



Re: Solr 4.3.0 geo search with multiple coordinates

2013-05-27 Thread Eric Grobler
I think I found the reason/bug
the type was wrong, it should be
field name=location_geo type=*location_rpt* indexed=true stored=
true *multiValued*=true /


On Tue, May 28, 2013 at 1:37 AM, Eric Grobler impalah...@googlemail.comwrote:

 Hi Solr experts,

 I have a solr 4.3 schema
 fieldType name=location_rpt class=
 solr.SpatialRecursivePrefixTreeFieldType geo=true distErrPct=0.025
 maxDistErr=0.09 units=degrees /

  field name=location_geo type=location indexed=true stored=true *
 multiValued*=true /

 and xml data
 field name=location_geo51.1164,6.9612/field
 field name=location_geo52.3473,9.77564/field

 If I run this query:
 fq={!geofilt pt=51.11,6.9 sfield=location_geo d=20}
 I get no result.


 But if I remove the second geo line and only have this geo coordinate it
 works:
 field name=location_geo51.1164,6.9612/field

 *Thus it seems that the multi valued index does not work *even though
 solr returns the doc values as:
 arr name=location_geo str51.1164,6.9612/str str52.3473,9.77564/
 str /arr


 Is my schema wrongly configured?

 Thanks
 Ericz






RE: sourceId of JMX

2013-05-27 Thread 菅沼 嘉一
Shalin,

We tried use it after removing staticStats.add(sourceId), it seems going with 
no problem.
Do you know any other side effects by removing it ?

Regards
suganuma

-Original Message-
From: 菅沼 嘉一 [mailto:yo_sugan...@waku-2.com]
Sent: Tuesday, May 28, 2013 9:30 AM
To: solr-user@lucene.apache.org
Subject: RE: sourceId of JMX

Thank you, Shalin.
I'll see it.

-Original Message-
From: Shalin Shekhar Mangar [mailto:shalinman...@gmail.com]
Sent: Monday, May 27, 2013 11:11 PM
To: solr-user@lucene.apache.org
Subject: Re: sourceId of JMX

I opened https://issues.apache.org/jira/browse/SOLR-4863


On Mon, May 27, 2013 at 7:35 PM, Shalin Shekhar Mangar 
shalinman...@gmail.com wrote:

 This is a bug. The sourceId should have been removed from the
 SolrDynamicMBean. I'll create an issue.


 On Mon, May 27, 2013 at 6:39 PM, 菅沼 嘉一 yo_sugan...@waku-2.com
wrote:

 Hello

 Our team faced the problem regarding the sourceId of JMX when getting the
 information of JMX from tomcat manager.

 Command:
 curl http://localhost:
 ${PORT}/manager/jmxproxy?qry=solr:type=documentCache,*

 Here is the error log (tomcat/manager log).

 ---
 2013/05/27 0:04:01 org.apache.catalina.core.ApplicationContext log
 JMXProxy: Error getting attribute
 solr:type=documentCache,id=org.apache.solr.search.LRUCache sourceId
 javax.management.AttributeNotFoundException: sourceId

 ---

 Solr ver. : 4.1.0

 I think this error comes from when JMX cannot get the sourceId.


 BTW Let's look at this issue.
 https://issues.apache.org/jira/browse/SOLR-3329

 It is decided to drop getSourceId() in this issue.

 But in org.apache.solr.core.JmxMonitoredMap.SolrDynamicMBean,
 staticStats.add(sourceId) is still defined in SolrDynamicMBean at L211.



http://javasourcecode.org/html/open-source/solr/solr-3.3.0/org/apache/solr/core/J
mx
MonitoredMap.SolrDynamicMBean.java.html#line.202
 --
 l.211  staticStats.add(sourceId);
 --

 Maybe this error comes from this inconsistency.
 This problem is not critical, but I think this is inconsistent.

 1. Anyone knows why staticStats.add(sourceId) still remained in
 SolrDynamicMBean?
 Do you have any idea?

 2. Has anyone faced such error ? How did you solved it?


 Thank you.

 Regards
 suganuma




 --
 Regards,
 Shalin Shekhar Mangar.




--
Regards,
Shalin Shekhar Mangar.


Re: Prevention of heavy wildcard queries

2013-05-27 Thread Isaac Hebsh
I don't want to affect on the (correctness of the) real query parsing, so
creating a QParserPlugin is risky.
Instead, If I'll parse the query in my search component, it will be
detached from the real query parsing, (obviously this causes double
parsing, but assume it's OK)...


On Tue, May 28, 2013 at 3:52 AM, Roman Chyla roman.ch...@gmail.com wrote:

 Hi Issac,
 it is as you say, with the exception that you create a QParserPlugin, not a
 search component

 * create QParserPlugin, give it some name, eg. 'nw'
 * make a copy of the pipeline - your component should be at the same place,
 or just above, the wildcard processor

 also make sure you are setting your qparser for FQ queries, ie.
 fq={!nw}foo


 On Mon, May 27, 2013 at 5:01 PM, Isaac Hebsh isaac.he...@gmail.com
 wrote:

  Thanks Roman.
  Based on some of your suggestions, will the steps below do the work?
 
  * Create (and register) a new SearchComponent
  * In its prepare method: Do for Q and all of the FQs (so this
  SearchComponent should run AFTER QueryComponent, in order to see all of
 the
  FQs)
  * Create org.apache.lucene.queryparser.flexible.core.StandardQueryParser,
  with a special implementation of QueryNodeProcessorPipeline, which
 contains
  my NodeProcessor in the top of its list.
  * Set my analyzer into that StandardQueryParser
  * My NodeProcessor will be called for each term in the query, so it can
  throw an exception if a (basic) querynode contains wildcard in both start
  and end of the term.
 
  Do I have a way to avoid from reimplementing the whole
 StandardQueryParser
  class?
 

 you can try subclassing it, if it allows it


  Will this work for both LuceneQParser and EdismaxQParser queries?
 

 this will not work for edismax, nothing but changing the edismax qparser
 will do the trick


 
  Any other solution/work-around? How do other production environments of
  Solr overcome this issue?
 

 you can also try modifying the standard solr parser, or even the JavaCC
 generated classes
 I believe many people do just that (or some sort of preprocessing)

 roman


 
 
  On Mon, May 27, 2013 at 10:15 PM, Roman Chyla roman.ch...@gmail.com
  wrote:
 
   You are right that starting to parse the query before the query
 component
   can get soon very ugly and complicated. You should take advantage of
 the
   flex parser, it is already in lucene contrib - but if you are
 interested
  in
   the better version, look at
   https://issues.apache.org/jira/browse/LUCENE-5014
  
   The way you can solve this is:
  
   1. use the standard syntax grammar (which allows *foo*)
   2. add (or modify) WildcardQueryNodeProcessor to dis/allow that case,
 or
   raise error etc
  
   this way, you are changing semantics - but don't need to touch the
 syntax
   definition; of course, you may also change the grammar and allow only
 one
   instance of wildcard (or some combination) but for that you should
  probably
   use LUCENE-5014
  
   roman
  
   On Mon, May 27, 2013 at 2:18 PM, Isaac Hebsh isaac.he...@gmail.com
   wrote:
  
Hi.
   
Searching terms with wildcard in their start, is solved with
ReversedWildcardFilterFactory. But, what about terms with wildcard in
   both
start AND end?
   
This query is heavy, and I want to disallow such queries from my
 users.
   
I'm looking for a way to cause these queries to fail.
I guess there is no built-in support for my need, so it is OK to
 write
  a
new solution.
   
My current plan is to create a search component (which will run
 before
QueryComponent). It should analyze the query string, and to drop the
   query
if too heavy wildcard are found.
   
Another option is to create a query parser, which wraps the current
(specified or default) qparser, and does the same work as above.
   
These two options require an analysis of the query text, which might
 be
   an
ugly work (just think about nested queries [using _query_], OR even a
  lot
of more basic scenarios like quoted terms, etc.)
   
Am I missing a simple and clean way to do this?
What would you do?
   
P.S. if no simple solution exists, timeAllowed limit is the best
work-around I could think about. Any other suggestions?
   
  
 



Re: [blog post] Automatically Acquiring Synonym Knowledge from Wikipedia

2013-05-27 Thread Rajesh Nikam
Hello Koji,

This is seems pretty useful post on how to create synonyms file.
Thanks a lot for sharing this !

Have you shared source code / jar for the same so at it could be used ?

Thanks,
Rajesh



On Mon, May 27, 2013 at 8:44 PM, Koji Sekiguchi k...@r.email.ne.jp wrote:

 Hello,

 Sorry for cross post. I just wanted to announce that I've written a blog
 post on
 how to create synonyms.txt file automatically from Wikipedia:


 http://soleami.com/blog/automatically-acquiring-synonym-knowledge-from-wikipedia.html

 Hope that the article gives someone a good experience!

 koji
 --

 http://soleami.com/blog/lucene-4-is-super-convenient-for-developing-nlp-tools.html



Re: sourceId of JMX

2013-05-27 Thread Shalin Shekhar Mangar
Suganuma,

No, there shouldn't be any side effects.


On Tue, May 28, 2013 at 7:13 AM, 菅沼 嘉一 yo_sugan...@waku-2.com wrote:

 Shalin,

 We tried use it after removing staticStats.add(sourceId), it seems going
 with no problem.
 Do you know any other side effects by removing it ?

 Regards
 suganuma

 -Original Message-
 From: 菅沼 嘉一 [mailto:yo_sugan...@waku-2.com]
 Sent: Tuesday, May 28, 2013 9:30 AM
 To: solr-user@lucene.apache.org
 Subject: RE: sourceId of JMX
 
 Thank you, Shalin.
 I'll see it.
 
 -Original Message-
 From: Shalin Shekhar Mangar [mailto:shalinman...@gmail.com]
 Sent: Monday, May 27, 2013 11:11 PM
 To: solr-user@lucene.apache.org
 Subject: Re: sourceId of JMX
 
 I opened https://issues.apache.org/jira/browse/SOLR-4863
 
 
 On Mon, May 27, 2013 at 7:35 PM, Shalin Shekhar Mangar 
 shalinman...@gmail.com wrote:
 
  This is a bug. The sourceId should have been removed from the
  SolrDynamicMBean. I'll create an issue.
 
 
  On Mon, May 27, 2013 at 6:39 PM, 菅沼 嘉一 yo_sugan...@waku-2.com
 wrote:
 
  Hello
 
  Our team faced the problem regarding the sourceId of JMX when getting
 the
  information of JMX from tomcat manager.
 
  Command:
  curl http://localhost:
  ${PORT}/manager/jmxproxy?qry=solr:type=documentCache,*
 
  Here is the error log (tomcat/manager log).
 
 
 ---
  2013/05/27 0:04:01 org.apache.catalina.core.ApplicationContext log
  JMXProxy: Error getting attribute
  solr:type=documentCache,id=org.apache.solr.search.LRUCache sourceId
  javax.management.AttributeNotFoundException: sourceId
 
 
 ---
 
  Solr ver. : 4.1.0
 
  I think this error comes from when JMX cannot get the sourceId.
 
 
  BTW Let's look at this issue.
  https://issues.apache.org/jira/browse/SOLR-3329
 
  It is decided to drop getSourceId() in this issue.
 
  But in org.apache.solr.core.JmxMonitoredMap.SolrDynamicMBean,
  staticStats.add(sourceId) is still defined in SolrDynamicMBean at
 L211.
 
 
 
 
 http://javasourcecode.org/html/open-source/solr/solr-3.3.0/org/apache/solr/core/J
 mx
 MonitoredMap.SolrDynamicMBean.java.html#line.202
  --
  l.211  staticStats.add(sourceId);
  --
 
  Maybe this error comes from this inconsistency.
  This problem is not critical, but I think this is inconsistent.
 
  1. Anyone knows why staticStats.add(sourceId) still remained in
  SolrDynamicMBean?
  Do you have any idea?
 
  2. Has anyone faced such error ? How did you solved it?
 
 
  Thank you.
 
  Regards
  suganuma
 
 
 
 
  --
  Regards,
  Shalin Shekhar Mangar.
 
 
 
 
 --
 Regards,
 Shalin Shekhar Mangar.




-- 
Regards,
Shalin Shekhar Mangar.