date:20131018

By all means please do file a support request with DataStax, either as an 
official support ticket or as a question on StackOverflow.


But, I do think the previous answer of avoiding the use of a Map object in 
your document is likely to be the solution.


-- Jack Krupansky

-Original Message- 
From: Brent Ryan

Sent: Friday, October 18, 2013 10:21 PM
To: solr-user@lucene.apache.org
Subject: Re: SOLRJ replace document

So I think the issue might be related to the tech stack we're using which
is SOLR within DataStax enterprise which doesn't support atomic updates.
But I think it must have some sort of bug around this because it doesn't
appear to work correctly for this use case when using solrj ...  Anyways,
I've contacted support so lets see what they say.


On Fri, Oct 18, 2013 at 5:51 PM, Shawn Heisey  wrote:


On 10/18/2013 3:36 PM, Brent Ryan wrote:


My schema is pretty simple and has a string field called solr_id as my
unique key.  Once I get back to my computer I'll send some more details.



If you are trying to use a Map object as the value of a field, that is
probably why it is interpreting your add request as an atomic update.  If
this is the case, and you're doing it because you have a multivalued 
field,

you can use a List object rather than a Map.

If this doesn't sound like what's going on, can you share your code, or a
simplification of the SolrJ parts of it?

Thanks,
Shawn

Re: SOLRJ replace document

2013-10-18 Thread Jason Hellman

Keep in mind that DataStax has a custom update handler, and as such isn't 
exactly a vanilla Solr implementation (even though in many ways it still is).  
Since updates are co-written to Cassandra and Solr you should always tread a 
bit carefully when slightly outside what they perceive to be norms.


On Oct 18, 2013, at 7:21 PM, Brent Ryan  wrote:

> So I think the issue might be related to the tech stack we're using which
> is SOLR within DataStax enterprise which doesn't support atomic updates.
> But I think it must have some sort of bug around this because it doesn't
> appear to work correctly for this use case when using solrj ...  Anyways,
> I've contacted support so lets see what they say.
> 
> 
> On Fri, Oct 18, 2013 at 5:51 PM, Shawn Heisey  wrote:
> 
>> On 10/18/2013 3:36 PM, Brent Ryan wrote:
>> 
>>> My schema is pretty simple and has a string field called solr_id as my
>>> unique key.  Once I get back to my computer I'll send some more details.
>>> 
>> 
>> If you are trying to use a Map object as the value of a field, that is
>> probably why it is interpreting your add request as an atomic update.  If
>> this is the case, and you're doing it because you have a multivalued field,
>> you can use a List object rather than a Map.
>> 
>> If this doesn't sound like what's going on, can you share your code, or a
>> simplification of the SolrJ parts of it?
>> 
>> Thanks,
>> Shawn
>> 
>>

Re: SOLRJ replace document

So I think the issue might be related to the tech stack we're using which
is SOLR within DataStax enterprise which doesn't support atomic updates.
 But I think it must have some sort of bug around this because it doesn't
appear to work correctly for this use case when using solrj ...  Anyways,
I've contacted support so lets see what they say.

On Fri, Oct 18, 2013 at 5:51 PM, Shawn Heisey  wrote:

> On 10/18/2013 3:36 PM, Brent Ryan wrote:
>
>> My schema is pretty simple and has a string field called solr_id as my
>> unique key.  Once I get back to my computer I'll send some more details.
>>
>
> If you are trying to use a Map object as the value of a field, that is
> probably why it is interpreting your add request as an atomic update.  If
> this is the case, and you're doing it because you have a multivalued field,
> you can use a List object rather than a Map.
>
> If this doesn't sound like what's going on, can you share your code, or a
> simplification of the SolrJ parts of it?
>
> Thanks,
> Shawn
>
>

Re: SolrCloud Performance Issue

Hi,

What happens if you have just 1 shard - no distributed search, like
before? SPM for Solr or any other monitoring tool that captures OS and
Solr metrics should help you find the source of the problem faster.
Is disk IO the same? utilization of caches? JVM version, heap, etc.?
CPU usage? network?  I'd look at each of these things side by side and
look for big differences.

Otis
--
Solr & ElasticSearch Support -- http://sematext.com/
SOLR Performance Monitoring -- http://sematext.com/spm

On Fri, Oct 18, 2013 at 1:38 AM, shamik  wrote:
> I tried commenting out NOW in bq, but didn't make any difference in the
> performance. I do see minor entry in the queryfiltercache rate which is a
> meager 0.02.
>
> I'm really struggling to figure out the bottleneck, any known pain points I
> should be checking ?
>
>
>
> --
> View this message in context: 
> http://lucene.472066.n3.nabble.com/SolrCloud-Performance-Issue-tp4095971p4096277.html
> Sent from the Solr - User mailing list archive at Nabble.com.

Re: XLSB files not indexed

Hi Roland,

It looks like:
Tika - yes
Solr - no?

Based on http://search-lucene.com/?q=xlsb

ODF != XLSB though, I think...

Otis
--
Solr & ElasticSearch Support -- http://sematext.com/
Performance Monitoring -- http://sematext.com/spm



On Fri, Oct 18, 2013 at 7:36 AM, Roland Everaert  wrote:
> Hi,
>
> Can someone tells me if tika is supposed to extract data from xlsb files
> (the new MS Office format in binary form)?
>
> If so then it seems that solr is not able to index them like it is not able
> to index ODF files (a JIRA is already opened for ODF
> https://issues.apache.org/jira/browse/SOLR-4809)
>
> Can someone confirm the problem, or tell me what to do to make solr works
> with XLSB files.
>
>
> Regards,
>
>
> Roland.

Re: how to retireve content page in solr

Hi,

Ignore Nutch for a bit and just follow the Solr tutorial to learn about the
Solr side. Should be quick.

Otis
Solr & ElasticSearch Support
http://sematext.com/
On Oct 18, 2013 11:30 AM, "javozzo"  wrote:

> hi Harshvardhan Ojha;
> i'm using nutch 1.1 and solr 3.6.0.
> I mean whole document. I try to create a search engine with nutch and solr
> and i would obtain a interface like this:
>
> name1
> http://www.prova.com/name1.html
> first rows of content document
>
> name2
> http://www.prova.com/name2.html
> first rows of content document
>
> name3
> http://www.prova.com/name3.html
> first rows of content document
>
> any ideas?
> Thanks
> Danilo
>
>
>
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/how-to-retireve-content-page-in-solr-tp4096302p4096333.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>

Re: Solr timeout after reboot

Michael,

The servlet container controls timeouts, max threads and such. That's not a
high query rate,  but yes, it could be solr or OS caches are cold. You will
ne able too see all this in SPM for Solr while you hammer your poor Solr
servers :)

Otis
Solr & ElasticSearch Support
http://sematext.com/
On Oct 18, 2013 11:38 AM, "michael.boom"  wrote:

> I have a SolrCloud environment with 4 shards, each having a replica and a
> leader. The index size is about 70M docs and 60Gb, running with Jetty +
> Zookeeper, on 2 EC2 instances, each with 4CPUs and 15G RAM.
>
> I'm using SolrMeter for stress testing.
> If I restart Jetty and then try to use SolrMeter to bomb an instance with
> queries, using a query per minute rate of 3000 then that solr instance
> somehow timesout and I need to restart it again.
> If instead of using 3000 qpm i startup slowly with 200 for a minute or two,
> then 1800 and then 3000 everything is good.
>
> I assume this happens because Solr is not warmed up.
> What settings could I tweak so that Solr doesn't time out anymore when
> getting many requests? Is there a way to limit how many req it can serve?
>
>
>
> -
> Thanks,
> Michael
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/Solr-timeout-after-reboot-tp4096408.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>

Leader election fails in some point.

2013-10-18 Thread yriveiro

Hi,

In this screenshot I have a shard with two replicas without leader,

http://picpaste.com/qf2jdkj8.png

On machine with shard green I found this exception:

INFO  - dat5 - 2013-10-18 22:48:04.775;
org.apache.solr.handler.admin.CoreAdminHandler; Going to wait for
coreNodeName: 192.168.20.106:8983_solr_statistics-13_shard18_replica4,
state: recovering, checkLive: true, onlyIfLeader: true
ERROR - dat5 - 2013-10-18 22:48:04.775;
org.apache.solr.common.SolrException; org.apache.solr.common.SolrException:
We are not the leader
at
org.apache.solr.handler.admin.CoreAdminHandler.handleWaitForStateAction(CoreAdminHandler.java:824)
at
org.apache.solr.handler.admin.CoreAdminHandler.handleRequestBody(CoreAdminHandler.java:192)
at
org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:135)
at
org.apache.solr.servlet.SolrDispatchFilter.handleAdminRequest(SolrDispatchFilter.java:655)
at
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:246)
at
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:195)
at
org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1419)
at
org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:455)
at
org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:137)
at
org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:557)
--
at
org.eclipse.jetty.server.AbstractHttpConnection.headerComplete(AbstractHttpConnection.java:942)
at
org.eclipse.jetty.server.AbstractHttpConnection$RequestHandler.headerComplete(AbstractHttpConnection.java:1004)
at org.eclipse.jetty.http.HttpParser.parseNext(HttpParser.java:640)
at org.eclipse.jetty.http.HttpParser.parseAvailable(HttpParser.java:235)
at
org.eclipse.jetty.server.BlockingHttpConnection.handle(BlockingHttpConnection.java:72)
at
org.eclipse.jetty.server.bio.SocketConnector$ConnectorEndPoint.run(SocketConnector.java:264)
at
org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:608)
at
org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:543)
at java.lang.Thread.run(Unknown Source)

On the machine with the shard in recovery state I found this exception:

INFO  - dat6 - 2013-10-18 22:48:44.131;
org.apache.solr.cloud.ShardLeaderElectionContext; Running the leader process
for shard shard18
INFO  - dat6 - 2013-10-18 22:48:44.137;
org.apache.solr.cloud.ShardLeaderElectionContext; Checking if I should try
and be the leader.
INFO  - dat6 - 2013-10-18 22:48:44.138;
org.apache.solr.cloud.ShardLeaderElectionContext; My last published State
was recovering, I won't be the leader.
INFO  - dat6 - 2013-10-18 22:48:44.139;
org.apache.solr.cloud.ShardLeaderElectionContext; There may be a better
leader candidate than us - going back into recovery
INFO  - dat6 - 2013-10-18 22:48:44.142;
org.apache.solr.update.DefaultSolrCoreState; Running recovery - first
canceling any ongoing recovery
WARN  - dat6 - 2013-10-18 22:48:44.142;
org.apache.solr.cloud.RecoveryStrategy; Stopping recovery for
zkNodeName=192.168.20.106:8983_solr_statistics-13_shard18_replica4core=statistics-13_shard18_replica4
INFO  - dat6 - 2013-10-18 22:48:45.131;
org.apache.solr.cloud.RecoveryStrategy; Finished recovery process.
core=statistics-13_shard18_replica4
INFO  - dat6 - 2013-10-18 22:48:45.131;
org.apache.solr.cloud.RecoveryStrategy; Starting recovery process. 
core=statistics-13_shard18_replica4 recoveringAfterStartup=false
INFO  - dat6 - 2013-10-18 22:48:45.131; org.apache.solr.cloud.ZkController;
publishing core=statistics-13_shard18_replica4 state=recovering
INFO  - dat6 - 2013-10-18 22:48:45.132; org.apache.solr.cloud.ZkController;
numShards not found on descriptor - reading it from system property
INFO  - dat6 - 2013-10-18 22:48:45.141;
org.apache.solr.client.solrj.impl.HttpClientUtil; Creating new http client,
config:maxConnections=128&maxConnectionsPerHost=32&followRedirects=false
ERROR - dat6 - 2013-10-18 22:48:45.143;
org.apache.solr.common.SolrException; Error while trying to recover.
core=statistics-13_shard18_replica4:org.apache.solr.client.solrj.impl.HttpSolrServer$RemoteSolrException:
We are not the leader
at
org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:424)
at
org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:180)
at
org.apache.solr.cloud.RecoveryStrategy.sendPrepRecoveryCmd(RecoveryStrategy.java:198)
at
org.apache.solr.cloud.RecoveryStrategy.doRecovery(RecoveryStrategy.java:342)
at org.apache.solr.cloud.RecoveryStrategy.run(RecoveryStrategy.java:219)

No leader means we can't index data because a 503 http status code is
returned.

Is this the normal behaviour or a bug?



-
Best regards
--
View this message in context: 
http://lucene.472066.n3.nabb

Re: Seeking New Moderators for solr-user@lucene

2013-10-18 Thread Alexandre Rafalovitch

I'll be happy to moderate. I do it for some other lists already.

Regards,
Alex

Re: SOLRJ replace document

2013-10-18 Thread Shawn Heisey


On 10/18/2013 3:36 PM, Brent Ryan wrote:

My schema is pretty simple and has a string field called solr_id as my
unique key.  Once I get back to my computer I'll send some more details.


If you are trying to use a Map object as the value of a field, that is 
probably why it is interpreting your add request as an atomic update.  
If this is the case, and you're doing it because you have a multivalued 
field, you can use a List object rather than a Map.


If this doesn't sound like what's going on, can you share your code, or 
a simplification of the SolrJ parts of it?


Thanks,
Shawn

Re: Issues with Language detection in Solr


Sorry, but Latin is not on the list of supported languages:

https://code.google.com/p/language-detection/wiki/LanguageList

-- Jack Krupansky

-Original Message- 
From: vibhoreng04

Sent: Friday, October 18, 2013 3:07 PM
To: solr-user@lucene.apache.org
Subject: Re: Issues with Language detection in Solr

I agree with you Jack . But I request you to see here that still this filter
works perfectly fine .Only in one case  case where even all the words are
latin , the language is getting detected as German.My question is why and
how ?
If it works perfectly for the other docs what in this case is making it to
do abnormal behaiour ?



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Issues-with-Language-detection-in-Solr-tp4096433p4096443.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: SOLRJ replace document

My schema is pretty simple and has a string field called solr_id as my
unique key.  Once I get back to my computer I'll send some more details.

Brent

On Friday, October 18, 2013, Shawn Heisey wrote:

> On 10/18/2013 2:59 PM, Brent Ryan wrote:
>
>> How do I replace a document in solr using solrj library?  I keep getting
>> this error back:
>>
>> org.apache.solr.client.solrj.**impl.HttpSolrServer$**RemoteSolrException:
>> Atomic document updates are not supported unless  is
>> configured
>>
>> I don't want to do partial updates, I just want to replace it...
>>
>
> Replacing a document is done by simply adding the document, in the same
> way as if you were adding a new one.  If you have properly configured Solr,
> the old one will be deleted before the new one is inserted.  Properly
> configuring Solr means that you have a uniqueKey field in yourschema, and
> that it is a simple type like string, int, long, etc, and is not
> multivalued. A TextField type that is tokenized cannot be used as the
> uniqueKey field.
>
> Thanks,
> Shawn
>
>

Re: loading djvu xml into solr

2013-10-18 Thread sara amato

Ah, thanks for the clarification - I was having a serious misunderstanding!  
(As you can tell I'm newly off the tutorial and blundering ahead...)

On Oct 18, 2013, at 2:22 PM, Upayavira wrote:

> 
> 
> On Fri, Oct 18, 2013, at 10:11 PM, Sara Amato wrote:
>> Does anyone have a schema they'd be willing to share for loading djvu xml
>> into solr?  
> 
> I assume that djvu XML is a particular XML format? In which case, there
> is no schema that can do it. That's not how Solr works.
> 
> You need to use the XML format expected by Solr. Or, you can add
> tr=.xsl to the URL, and use an XSL stylesheet to transform your XML
> into Solr's XML format.
> 
> The schema defines the fields that are present in the index, not the
> format of the XML used.
> 
> Upayavira

Re: SOLRJ replace document

2013-10-18 Thread Shawn Heisey


On 10/18/2013 2:59 PM, Brent Ryan wrote:

How do I replace a document in solr using solrj library?  I keep getting
this error back:

org.apache.solr.client.solrj.impl.HttpSolrServer$RemoteSolrException:
Atomic document updates are not supported unless  is configured

I don't want to do partial updates, I just want to replace it...


Replacing a document is done by simply adding the document, in the same 
way as if you were adding a new one.  If you have properly configured 
Solr, the old one will be deleted before the new one is inserted.  
Properly configuring Solr means that you have a uniqueKey field in 
yourschema, and that it is a simple type like string, int, long, etc, 
and is not multivalued. A TextField type that is tokenized cannot be 
used as the uniqueKey field.


Thanks,
Shawn

Re: SOLRJ replace document

I wish that was the case but calling addDoc() is what's triggering that
exception.

On Friday, October 18, 2013, Jack Krupansky wrote:

> To "replace" a Solr document, simply "add" it again using the same
> technique used to insert the original document. The "set" option for atomic
> update is only used when you wish to selectively update only some of the
> fields for a document, and that does require that the update log be enabled
> using .
>
> -- Jack Krupansky
>
> -Original Message- From: Brent Ryan
> Sent: Friday, October 18, 2013 4:59 PM
> To: solr-user@lucene.apache.org
> Subject: SOLRJ replace document
>
> How do I replace a document in solr using solrj library?  I keep getting
> this error back:
>
> org.apache.solr.client.solrj.**impl.HttpSolrServer$**RemoteSolrException:
> Atomic document updates are not supported unless  is configured
>
> I don't want to do partial updates, I just want to replace it...
>
>
> Thanks,
> Brent
>

Re: SOLRJ replace document

To "replace" a Solr document, simply "add" it again using the same technique 
used to insert the original document. The "set" option for atomic update is 
only used when you wish to selectively update only some of the fields for a 
document, and that does require that the update log be enabled using 
.


-- Jack Krupansky

-Original Message- 
From: Brent Ryan

Sent: Friday, October 18, 2013 4:59 PM
To: solr-user@lucene.apache.org
Subject: SOLRJ replace document

How do I replace a document in solr using solrj library?  I keep getting
this error back:

org.apache.solr.client.solrj.impl.HttpSolrServer$RemoteSolrException:
Atomic document updates are not supported unless  is configured

I don't want to do partial updates, I just want to replace it...


Thanks,
Brent

Re: Solr 4.3 Startup with Multiple Cores Hangs on "Registering Core"

2013-10-18 Thread Jonatan Fournier

Hello,

I still have this issue using Solr 4.4, removing firstSearcher queries did
make the problem go away.

Note that I'm using Tomcat 7 and that if I'm using my own Java application
launching an Embedded Solr Server pointing to the same Solr configuration
the server fully starts with no hang.

What is the xml tag syntax to have spellcheck=false for firstSearcher
discussed above?

Cheers,

/jonatan

--- HANG with Tomcat 7 (firstSearcher queries on) ---
<...>
2409 [coreLoadExecutor-3-thread-3] INFO
 org.apache.solr.handler.component.SpellCheckComponent  – No queryConverter
defined, using default converter
2409 [coreLoadExecutor-3-thread-3] INFO
 org.apache.solr.handler.component.QueryElevationComponent  – Loading
QueryElevation from: /var/lib/myapp/conf/elevate.xml
2415 [coreLoadExecutor-3-thread-3] INFO
 org.apache.solr.handler.ReplicationHandler  – Commits will be reserved for
 1
2415 [searcherExecutor-16-thread-1] INFO  org.apache.solr.core.SolrCore  –
QuerySenderListener sending requests to
Searcher@5c43ecf0main{StandardDirectoryReader(segments_3:23
_9(4.4):C57862)}
2417 [searcherExecutor-16-thread-1] INFO  org.apache.solr.core.SolrCore  –
[foo-20130912] webapp=null path=null
params={event=firstSearcher&q=static+firstSearcher+warming+in+solrconfig.xml&distrib=false}
hits=0 status=0 QTime=1
2417 [searcherExecutor-16-thread-1] INFO  org.apache.solr.core.SolrCore  –
QuerySenderListener done.
2417 [searcherExecutor-16-thread-1] INFO
 org.apache.solr.handler.component.SpellCheckComponent  – Loading spell
index for spellchecker: default
2417 [searcherExecutor-16-thread-1] INFO
 org.apache.solr.handler.component.SpellCheckComponent  – Loading spell
index for spellchecker: wordbreak
2418 [searcherExecutor-16-thread-1] INFO  org.apache.solr.core.SolrCore  –
[foo-20130912] Registered new searcher
Searcher@5c43ecf0main{StandardDirectoryReader(segments_3:23
_9(4.4):C57862)}
2420 [coreLoadExecutor-3-thread-3] INFO  org.apache.solr.core.CoreContainer
 – registering core: foo-20130912

--- NO HANG EmbeddedSolrServer (firstSearcher queries on) ---
<...>
1797 [coreLoadExecutor-3-thread-1] INFO
 org.apache.solr.handler.component.SpellCheckComponent  – No queryConverter
defined, using default converter
1797 [coreLoadExecutor-3-thread-1] INFO
 org.apache.solr.handler.component.QueryElevationComponent  – Loading
QueryElevation from: /var/lib/myapp/conf/elevate.xml
1800 [coreLoadExecutor-3-thread-1] INFO
 org.apache.solr.handler.ReplicationHandler  – Commits will be reserved for
 1
1801 [searcherExecutor-15-thread-1] INFO  org.apache.solr.core.SolrCore  –
QuerySenderListener sending requests to
Searcher@27b104d7main{StandardDirectoryReader(segments_3:23
_9(4.4):C57862)}
1801 [searcherExecutor-15-thread-1] INFO  org.apache.solr.core.SolrCore  –
QuerySenderListener done.
1801 [searcherExecutor-15-thread-1] INFO
 org.apache.solr.handler.component.SpellCheckComponent  – Loading spell
index for spellchecker: default
1801 [coreLoadExecutor-3-thread-1] INFO  org.apache.solr.core.CoreContainer
 – registering core: foo-20130912
1801 [searcherExecutor-15-thread-1] INFO
 org.apache.solr.handler.component.SpellCheckComponent  – Loading spell
index for spellchecker: wordbreak
1801 [searcherExecutor-15-thread-1] INFO  org.apache.solr.core.SolrCore  –
[foo-20130912] Registered new searcher
Searcher@27b104d7main{StandardDirectoryReader(segments_3:23
_9(4.4):C57862)}


On Fri, Sep 6, 2013 at 4:29 PM, Austin Rasmussen wrote:

> : Do all of your cores have "newSearcher" event listners configured or just
> : 2 (i'm trying to figure out if it's a timing fluke that these two are
> stalled, or if it's something special about the configs)
>
> All of my cores have both the "newSearcher" and "firstSearcher" event
> listeners configured. (The firstSearcher actually doesn't have any queries
> configured against it, so it probably should just be removed altogether)
>
> : Can you try removing the newSearcher listners to confirm that that does
> in fact make the problem go away?
>
> Removing the "newSearcher" listeners does not make the problem go away;
> however, removing the "firstSearcher" listener (even if the "newSearcher"
> listener is still configured) does make the problem go away.
>
> : With the newSearcher listeners in place, Can you try setting
> "spellcheck=false" as a query param on the newSearcher listeners you have
> configured and
> : see if that works arround the problem?
>
> Adding the "spellcheck=false" param to the "firstSearcher" listener does
> appear to work around the problem.
>
> : Assuming it's just 2 cores using these listeners: can you reproduce this
> problem with a simpler seup where only one of the affected cores is in use?
>
> Since it's not just these two cores, I'm not sure how to produce much of a
> simpler setup.  I did attempt to limit how many cores are loaded in the
> solr.xml, and found that if I cut it down to 56, it was able to load
> successfully (without any of the above config changed).
>
> If I cut i

Re: loading djvu xml into solr

2013-10-18 Thread Upayavira

On Fri, Oct 18, 2013, at 10:11 PM, Sara Amato wrote:
> Does anyone have a schema they'd be willing to share for loading djvu xml
> into solr?  

I assume that djvu XML is a particular XML format? In which case, there
is no schema that can do it. That's not how Solr works.

You need to use the XML format expected by Solr. Or, you can add
tr=.xsl to the URL, and use an XSL stylesheet to transform your XML
into Solr's XML format.

The schema defines the fields that are present in the index, not the
format of the XML used.

Upayavira

loading djvu xml into solr

2013-10-18 Thread Sara Amato

Does anyone have a schema they'd be willing to share for loading djvu xml into 
solr?

Re: Check if dynamic columns exists and query else ignore

2013-10-18 Thread Utkarsh Sengar

Thanks Chris! That worked!
I overengineered my query!

Thanks,
-Utkarsh


On Fri, Oct 18, 2013 at 12:02 PM, Chris Hostetter
wrote:

>
> : I trying to do this:
> :
> : if (US_offers_i exists):
> :fq=US_offers_i:[1 TO *]
> : else:
> :fq=offers_count:[1 TO *]
>
> "if()" and "exist()" are functions, so you would have to explicitly use
> them
> in a function context (ie: {!func} parser, or {!frange} parser) and to use
> those nested queries inside of functions you'd need to use the "query()"
> function.
>
> but nothing about your problem description suggests that you really need
> to worry about this.
>
> If a document doesn't contain the "US_offers_i" then US_offers_i:[1 TO *]
> won't match that document, and neither will US_offers_i:[* TO *] -- so you
> can implement the logic you describe with a simple query...
>
> fq=(US_offers_i:[1 TO *] (offers_count:[1 TO *] -US_offers_i:[* TO *]))
>
> Which you can read as "Match does with 1 or more US offers, or: docs that
> have 1 or more offers but no US offer field at all"
>
> : Also, there is a heavy performance penalty for this condition? I am
> : planning to use this for all my queries.
>
> Any logic that you do at query time, which can be precomputed into a
> specific field in your index will *always* make the queries faster (at the
> expense of a little more time spent indexing and a little more disk used).
> If you know in advance that you are frequently going to want to ristrict
> on this type of logic, then unless you index docs more offten then you
> search docs, you should almost certainly index as "has_offers" boolean
> field that captures this logic.
>
>
> -Hoss
>



-- 
Thanks,
-Utkarsh

SOLRJ replace document

How do I replace a document in solr using solrj library?  I keep getting
this error back:

org.apache.solr.client.solrj.impl.HttpSolrServer$RemoteSolrException:
Atomic document updates are not supported unless  is configured

I don't want to do partial updates, I just want to replace it...


Thanks,
Brent

RE: Facet performance


: >> 1. 
q=word&facet.field=CONTENT&facet=true&facet.prefix=&facet.limit=10&facet.mincount=1&facet.method=enum&rows=0
: >> 2. 
q=word&facet.field=CONTENT&facet=true&facet.prefix=a&facet.limit=10&facet.mincount=1&facet.method=enum&rows=0
: >
: >> The only difference is am empty facet.prefix in the first query.

: >If you index was just opened when you issued your queries, the first 
: request will be notably slower than the second as the facet values might 
: not be in the disk cache.
: 
: I know but it shouldn't be orders of magnitudes as in this example, should it?

in and of itself: it can be if your index is large enough and none of the 
disk pages are in the file system buffer.

more significantly however, is that depending on how big your filterCache 
is, the first request could eaisly be caching all of filters needed for 
the second query -- at a minimum it's definitely caching your main query 
which will be re-used and save a lot of time independent of hte faceting.


-Hoss

Re: Facet performance

DocValues is the new black
http://wiki.apache.org/solr/DocValues

Otis
--
Solr & ElasticSearch Support -- http://sematext.com/
SOLR Performance Monitoring -- http://sematext.com/spm



On Fri, Oct 18, 2013 at 12:30 PM, Lemke, Michael  SZ/HZA-ZSW
 wrote:
> Toke Eskildsen [mailto:t...@statsbiblioteket.dk] wrote:
>>Lemke, Michael  SZ/HZA-ZSW [lemke...@schaeffler.com] wrote:
>>> 1. 
>>> q=word&facet.field=CONTENT&facet=true&facet.prefix=&facet.limit=10&facet.mincount=1&facet.method=enum&rows=0
>>> 2. 
>>> q=word&facet.field=CONTENT&facet=true&facet.prefix=a&facet.limit=10&facet.mincount=1&facet.method=enum&rows=0
>>
>>> The only difference is am empty facet.prefix in the first query.
>>
>>> The first query returns after some 20 seconds (QTime 2 in the result) 
>>> while
>>> the second one takes only 80 msec (QTime 80). Why is this?
>>
>>If you index was just opened when you issued your queries, the first request 
>>will be notably slower than the second as the facet values might not be in
> the disk cache.
>
> I know but it shouldn't be orders of magnitudes as in this example, should it?
>
>>
>>Furthermore, for enum the difference between no prefix and some prefix is 
>>huge. As enum iterates values first (as opposed to fc that iterates hits 
>>first), limiting to only the values that starts with 'a' ought to speed up 
>>retrieval by a factor 10 or more.
>
> Thanks.  That is what we sort of figured but it's good to know for sure.  Of 
> course it begs the question if there is a way to speed this up?
>
>>
>>> And as side note: facet.method=fc makes the queries run 'forever' and 
>>> eventually
>>> fail with org.apache.solr.common.SolrException: Too many values for 
>>> UnInvertedField faceting on field CONTENT.
>>
>>An internal memory structure optimization in Solr limits the amount of 
>>possible unique values when using fc. It is not a bug as such, but more a 
>>consequence of a choice. Unfortunately the enum-solution is normally quite 
>>slow when there are enough unique values to trigger the "too many 
>>values"-exception. I know too little about the structures for DocValues to 
>>say if they will help here, but you might want to take a look at those.
>
> What is DocValues?  Haven't heard of it yet.  And yes, the fc method was 
> terribly slow in a case where it did work.  Something like 20 minutes whereas 
> enum returned within a few seconds.
>
> Michael
>

Re: Seeking New Moderators for solr-user@lucene

2013-10-18 Thread Rafał Kuć

Hello!

I can help with moderation. 

-- 
Regards,
 Rafał Kuć
 Sematext :: http://sematext.com/ :: Solr - Lucene - ElasticSearch


> It looks like it's time to inject some fresh blood into the 
> solr-user@lucene moderation team.

> If you'd like to volunteer to be a moderator, please reply back to this
> thread and specify which email address you'd like to use as a moderator
> (if different from the one you use when sending the email)

> Being a moderator is really easy: you'll get a some extra emails in your
> inbox with MODERATE in the subject, which you skim to see if they are spam
> -- if they are you delete them, if not you "reply all" to let them get
> sent to the list, and authorize that person to send future messages w/o
> moderation.

> Occasionally, you'll see an explicit email to solr-user-owner@lucene from
> a user asking for help realted to their subscription (usually 
> unsubscribing problems) and you and the other moderators chime in with
> assistance when possible.

> More details can be found here...

> https://wiki.apache.org/solr/MailingListModeratorInfo

> (I'll wait ~72+ hours to see who responds, and then file the appropriate
> jira with INFRA)


> -Hoss

Re: Seeking New Moderators for solr-user@lucene

2013-10-18 Thread vibhoreng04

Hi Chris,

I would like to moderate and you can use the mail id vibhoren...@gmail.com
for this purpose .


Regards,
Vibhor Jaiswal



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Seeking-New-Moderators-for-solr-user-lucene-tp4096447p4096448.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Questions developing custom functionquery


: Field-Type: org.apache.solr.schema.TextField
...
: 
DocTermsIndexDocValues.
: Calling "getVal()" on a DocTermsIndexDocValues does some really weird stuff
: that I really don't understand.

Your TextField is being analyzed in some way you haven't clarified, and 
the DocTermsIndexDocValues you get contains the details of each term in 
that TextField

: Its possible I'm going about this wrong and need to re-do my approach. I'm
: just currently at a loss for what that approach is.

Based on your initial goal, you are most certainly going about this in a 
much more complicated way then you need to...

: > > > My goal is to be able to implement a custom sorting technique.

: > > > Example: /some
: > > > example/data/here/2013/09/12/testing.text
: > > >
: > > > I would like to do a custom sort based on this resname field.
: > > > Basically, I would like to parse out that date there (2013/09/12) and
: > > sort
: > > > on that date.

You are going to be *MUCH* happier (both in terms of effort, and in terms 
of performance) if instead of writing a custom function to parse strings 
at query time when sorting, you implement the parsing logic when indexing 
the doc and index it up front as a date field that you can sort on.

I would suggest something like CloneFieldUpdateProcessorFactory + 
RegexReplaceProcessorFactory could save you the work of needing to 
implement any custom logic -- but as Jack pointed out in SOLR-4864 it 
doesn't currently allow you to do capture group replacements (but maybe 
you could contribute a patch to fix that instead of needing to write 
completely custom code for yourself)

Of maybe, as is, you could use RegexReplaceProcessorFactory to throw away 
non digits - and then use ParseDateFieldUpdateProcessorFactory to get what 
you want?  (I'm not certain - i haven't played with 
ParseDateFieldUpdateProcessorFactory much)

https://issues.apache.org/jira/browse/SOLR-4864
https://lucene.apache.org/solr/4_5_0/solr-core/org/apache/solr/update/processor/RegexReplaceProcessorFactory.html
https://lucene.apache.org/solr/4_5_0/solr-core/org/apache/solr/update/processor/CloneFieldUpdateProcessorFactory.html
https://lucene.apache.org/solr/4_5_0/solr-core/org/apache/solr/update/processor/ParseDateFieldUpdateProcessorFactory.html



-Hoss

Re: Seeking New Moderators for solr-user@lucene

2013-10-18 Thread Anshum Gupta

Hey Hoss,

I'd be happy to moderate.

Sent from my iPhone

> On 19-Oct-2013, at 0:22, Chris Hostetter  wrote:
> 
> 
> It looks like it's time to inject some fresh blood into the solr-user@lucene 
> moderation team.
> 
> If you'd like to volunteer to be a moderator, please reply back to this 
> thread and specify which email address you'd like to use as a moderator (if 
> different from the one you use when sending the email)
> 
> Being a moderator is really easy: you'll get a some extra emails in your 
> inbox with MODERATE in the subject, which you skim to see if they are spam -- 
> if they are you delete them, if not you "reply all" to let them get sent to 
> the list, and authorize that person to send future messages w/o moderation.
> 
> Occasionally, you'll see an explicit email to solr-user-owner@lucene from a 
> user asking for help realted to their subscription (usually unsubscribing 
> problems) and you and the other moderators chime in with assistance when 
> possible.
> 
> More details can be found here...
> 
> https://wiki.apache.org/solr/MailingListModeratorInfo
> 
> (I'll wait ~72+ hours to see who responds, and then file the appropriate jira 
> with INFRA)
> 
> 
> -Hoss

Re: Issues with Language detection in Solr

2013-10-18 Thread vibhoreng04

I agree with you Jack . But I request you to see here that still this filter
works perfectly fine .Only in one case  case where even all the words are
latin , the language is getting detected as German.My question is why and
how ?
If it works perfectly for the other docs what in this case is making it to
do abnormal behaiour ?



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Issues-with-Language-detection-in-Solr-tp4096433p4096443.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Check if dynamic columns exists and query else ignore


: I trying to do this:
: 
: if (US_offers_i exists):
:fq=US_offers_i:[1 TO *]
: else:
:fq=offers_count:[1 TO *]

"if()" and "exist()" are functions, so you would have to explicitly use 
them 
in a function context (ie: {!func} parser, or {!frange} parser) and to use 
those nested queries inside of functions you'd need to use the "query()" 
function.

but nothing about your problem description suggests that you really need 
to worry about this.

If a document doesn't contain the "US_offers_i" then US_offers_i:[1 TO *] 
won't match that document, and neither will US_offers_i:[* TO *] -- so you 
can implement the logic you describe with a simple query...

fq=(US_offers_i:[1 TO *] (offers_count:[1 TO *] -US_offers_i:[* TO *]))

Which you can read as "Match does with 1 or more US offers, or: docs that 
have 1 or more offers but no US offer field at all"

: Also, there is a heavy performance penalty for this condition? I am
: planning to use this for all my queries.

Any logic that you do at query time, which can be precomputed into a 
specific field in your index will *always* make the queries faster (at the 
expense of a little more time spent indexing and a little more disk used).  
If you know in advance that you are frequently going to want to ristrict 
on this type of logic, then unless you index docs more offten then you 
search docs, you should almost certainly index as "has_offers" boolean 
field that captures this logic.


-Hoss

Re: Switching indexes

2013-10-18 Thread Christopher Gross

I was able to get the new collections working dynamically (via Collections
RESTful calls).  I was having some other issues with my development
environment that I had to fix up to get it going.

I had to upgrade to 4.5 in order for the aliases to work at all though.
Not sure what the deal was with that.

Thanks Shawn -- I have a much better understanding of all this now.

-- Chris


On Thu, Oct 17, 2013 at 7:31 PM, Shawn Heisey  wrote:

> On 10/17/2013 12:51 PM, Christopher Gross wrote:
>
>> OK, super confused now.
>>
>> http://index1:8080/solr/admin/**cores?action=CREATE&name=**
>> test2&collection=test2&**numshards=1&replicationFactor=**3
>>
>> Nets me this:
>> 
>> 
>> 400
>> 15007
>> 
>> 
>> Error CREATEing SolrCore 'test2': Could not find
>> configName
>> for collection test2 found:[xxx, xxx, , x, xx]
>> 400
>> 
>> 
>>
>> For that node (test2), in my solr data directory, I have a folder with the
>> conf files and an existing data dir (copied the index from another
>> location).
>>
>> Right now it seems like the only way that I can add in a collection is to
>> load the configs into zookeeper, stop tomcat, add it to the solr.xml file,
>> and restart tomcat.
>>
>
> The config does need to be loaded into zookeeper.  That's how SolrCloud
> works.
>
> Because you have existing collections, you're going to have at least one
> config set already uploaded, you may be able to use that directly.  You
> don't need to stop anything, though.  Michael Della Bitta's response
> indicates the part you're missing on your create URL - the
> collection.configName parameter.
>
> The basic way to get things done with collections is this:
>
> 1) Upload one or more named config sets to zookeeper.  This can be done
> with zkcli and its "upconfig" command, or with the bootstrap startup
> options that are intended to be used once.
>
> 2) Create the collection, referencing the proper collection.configName.
>
> You can have many collections that all share one config name.  You can
> also change which config an existing collection uses with the zkcli
> "linkconfig" command, followed by a collection reload.  If you upload a new
> configuration with an existing name, a collection reload (or Solr restart)
> is required to use the new config.
>
> For uploading configs, I find zkcli to be a lot cleaner than the bootstrap
> options - it doesn't require stopping Solr or giving it different startup
> options.  Actually, it doesn't even require Solr to be started - it talks
> only to zookeeper, and we strongly recommend standalone zookeeper, not the
> zk server that can be run embedded in Solr.
>
> Thanks,
> Shawn
>
>

Seeking New Moderators for solr-user@lucene



It looks like it's time to inject some fresh blood into the 
solr-user@lucene moderation team.


If you'd like to volunteer to be a moderator, please reply back to this 
thread and specify which email address you'd like to use as a moderator 
(if different from the one you use when sending the email)


Being a moderator is really easy: you'll get a some extra emails in your 
inbox with MODERATE in the subject, which you skim to see if they are spam 
-- if they are you delete them, if not you "reply all" to let them get 
sent to the list, and authorize that person to send future messages w/o 
moderation.


Occasionally, you'll see an explicit email to solr-user-owner@lucene from 
a user asking for help realted to their subscription (usually 
unsubscribing problems) and you and the other moderators chime in with 
assistance when possible.


More details can be found here...

https://wiki.apache.org/solr/MailingListModeratorInfo

(I'll wait ~72+ hours to see who responds, and then file the appropriate 
jira with INFRA)



-Hoss

Re: Issues with Language detection in Solr

I would say that in general you need at least 15 or 20 words in a text field 
for language to be detected reasonably well. Sure, sometimes it can work for 
8 to 12 words, but flip a coin how reliable it will be.


You haven't shown us any true text fields. I would say that language 
detection against simple name fields is a misuse of the language detection 
feature. I mean, it is designed for larger blocks of text, not very short 
phrases.


See some examples in my e-book.

-- Jack Krupansky

-Original Message- 
From: vibhoreng04

Sent: Friday, October 18, 2013 2:01 PM
To: solr-user@lucene.apache.org
Subject: Issues with Language detection in Solr

Hi All,I am trying to detect the language of the business name filed and the
address field. I am using Solr's lang Detect(Google Library) , not Tika. It
works ok in most of the cases but in some it detects the language
wrongly.For an example the document -"OrgName": "EXPLOITS VALLEY
HIGHGREENWOOD","StreetLine1": "19 GREENWOOD AVE",
"StreetLine2": "","SOrgName": "EXPLOITS VALLEY HIGHGREENWOOD",
"StandardizedStreetLine1": "19 GREENWOOD AVE","language_s": [
"de"]Language is detected as German(de) here , which is wrong.Below
is my
configuration-+
OrgName,StreetLine1,StreetLine2,SOrgName,StandardizedStreetLine1
language_s 0.9   en
+Why
there is an issue?Why the language detection is wrong ?Please help !Vibhor



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Issues-with-Language-detection-in-Solr-tp4096433.html
Sent from the Solr - User mailing list archive at Nabble.com.

Issues with Language detection in Solr

2013-10-18 Thread vibhoreng04

Hi All,I am trying to detect the language of the business name filed and the
address field. I am using Solr's lang Detect(Google Library) , not Tika. It
works ok in most of the cases but in some it detects the language
wrongly.For an example the document -"OrgName": "EXPLOITS VALLEY
HIGHGREENWOOD","StreetLine1": "19 GREENWOOD AVE",   
"StreetLine2": "","SOrgName": "EXPLOITS VALLEY HIGHGREENWOOD",   
"StandardizedStreetLine1": "19 GREENWOOD AVE","language_s": [ 
"de"]Language is detected as German(de) here , which is wrong.Below
is my
configuration-+
  
OrgName,StreetLine1,StreetLine2,SOrgName,StandardizedStreetLine1
language_s   0.9  en
 
+Why
there is an issue?Why the language detection is wrong ?Please help !Vibhor 



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Issues-with-Language-detection-in-Solr-tp4096433.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Check if dynamic columns exists and query else ignore

2013-10-18 Thread Utkarsh Sengar

Bumping this one, any suggestions?
Looks like if() and exists() are meant to solve this problem, but I am
using it in a wrong way.

-Utkarsh


On Thu, Oct 17, 2013 at 1:16 PM, Utkarsh Sengar wrote:

> I trying to do this:
>
> if (US_offers_i exists):
>fq=US_offers_i:[1 TO *]
> else:
>fq=offers_count:[1 TO *]
>
> Where:
> US_offers_i is a dynamic field containing an int
> offers_count is a status field containing an int.
>
> I have tried this so far but it doesn't work:
>
> http://solr_server/solr/col1/select?
> q=iphone+5s &
> fq=if(exist(US_offers_i),US_offers_i:[1 TO *], offers_count:[1 TO *])
>
> Also, there is a heavy performance penalty for this condition? I am
> planning to use this for all my queries.
>
> --
> Thanks,
> -Utkarsh
>



-- 
Thanks,
-Utkarsh

RE: Facet performance

2013-10-18 Thread Lemke, Michael SZ/HZA-ZSW

Toke Eskildsen [mailto:t...@statsbiblioteket.dk] wrote:
>Lemke, Michael  SZ/HZA-ZSW [lemke...@schaeffler.com] wrote:
>> 1. 
>> q=word&facet.field=CONTENT&facet=true&facet.prefix=&facet.limit=10&facet.mincount=1&facet.method=enum&rows=0
>> 2. 
>> q=word&facet.field=CONTENT&facet=true&facet.prefix=a&facet.limit=10&facet.mincount=1&facet.method=enum&rows=0
>
>> The only difference is am empty facet.prefix in the first query.
>
>> The first query returns after some 20 seconds (QTime 2 in the result) 
>> while
>> the second one takes only 80 msec (QTime 80). Why is this?
>
>If you index was just opened when you issued your queries, the first request 
>will be notably slower than the second as the facet values might not be in 
the disk cache.

I know but it shouldn't be orders of magnitudes as in this example, should it?

>
>Furthermore, for enum the difference between no prefix and some prefix is 
>huge. As enum iterates values first (as opposed to fc that iterates hits 
>first), limiting to only the values that starts with 'a' ought to speed up 
>retrieval by a factor 10 or more.

Thanks.  That is what we sort of figured but it's good to know for sure.  Of 
course it begs the question if there is a way to speed this up?

>
>> And as side note: facet.method=fc makes the queries run 'forever' and 
>> eventually
>> fail with org.apache.solr.common.SolrException: Too many values for 
>> UnInvertedField faceting on field CONTENT.
>
>An internal memory structure optimization in Solr limits the amount of 
>possible unique values when using fc. It is not a bug as such, but more a 
>consequence of a choice. Unfortunately the enum-solution is normally quite 
>slow when there are enough unique values to trigger the "too many 
>values"-exception. I know too little about the structures for DocValues to say 
>if they will help here, but you might want to take a look at those.

What is DocValues?  Haven't heard of it yet.  And yes, the fc method was 
terribly slow in a case where it did work.  Something like 20 minutes whereas 
enum returned within a few seconds.

Michael

Fwd: Searching within list of regions with 1:1 document-region mapping

2013-10-18 Thread Sandeep Gupta

Hi,

I have a Solr index of around 100 million documents with each document
being given a region id growing at a rate of about 10 million documents per
month - the average document size being aronud 10KB of pure text. The total
number of region ids are themselves in the range of 2.5 million.

I want to search for a query with a given list of region ids. The number of
region ids in this list is usually around 250-300 (most of the time), but
can be upto 500, with a maximum cap of around 2000 ids in one request.


What is the best way to model such queries besides using an IN param in the
query, or using a Filter FQ in the query or some other means?


 If it may help, the index is on a VM with 4 virtual-cores and has
currently 4GB of Java memory allocated out of 16GB in the machine. The
number of queries do not exceed more than 1 per minute for now. If needed,
we can throw more hardware to the index - but the index will still be only
on a single machine for atleast 6 months.

Best Regards,
Sandeep Gupta

Solr timeout after reboot

2013-10-18 Thread michael.boom

I have a SolrCloud environment with 4 shards, each having a replica and a
leader. The index size is about 70M docs and 60Gb, running with Jetty +
Zookeeper, on 2 EC2 instances, each with 4CPUs and 15G RAM.

I'm using SolrMeter for stress testing.
If I restart Jetty and then try to use SolrMeter to bomb an instance with
queries, using a query per minute rate of 3000 then that solr instance
somehow timesout and I need to restart it again.
If instead of using 3000 qpm i startup slowly with 200 for a minute or two,
then 1800 and then 3000 everything is good.

I assume this happens because Solr is not warmed up.
What settings could I tweak so that Solr doesn't time out anymore when
getting many requests? Is there a way to limit how many req it can serve?



-
Thanks,
Michael
--
View this message in context: 
http://lucene.472066.n3.nabble.com/Solr-timeout-after-reboot-tp4096408.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: how to retireve content page in solr

2013-10-18 Thread javozzo

hi Harshvardhan Ojha;
i'm using nutch 1.1 and solr 3.6.0.
I mean whole document. I try to create a search engine with nutch and solr
and i would obtain a interface like this:

name1
http://www.prova.com/name1.html
first rows of content document

name2
http://www.prova.com/name2.html
first rows of content document

name3
http://www.prova.com/name3.html
first rows of content document

any ideas?
Thanks
Danilo



--
View this message in context: 
http://lucene.472066.n3.nabble.com/how-to-retireve-content-page-in-solr-tp4096302p4096333.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: solrconfig.xml carrot2 params

2013-10-18 Thread youknowwho

Thanks, I'm new to the clustering libraries.  I finally made this connection 
when I started browsing through the carrot2 source.  I had pulled down a 
smaller MM document collection from our test environment.  It was not ideal as 
it was mostly structured, but small.  I foolishly thought I could cluster on 
the text copy field before realizing that it was index only.  Doh!

Our documents are indexed in SolrCloud, but stored in HBase.  I want to allow 
users to page through Solr hits, but would like to cluster on all (or at least 
several thousand) of the top search hits.  Now I'm puzzling over how to 
efficiently cluster over possibly several thousand Solr hits when the documents 
are in HBase.  I thought an HBase coprocessor, but carrot2 isn't designed for 
distributed computation.  Mahout, in the Hadoop M/R context, seems slow and 
heavy handed for this scale; maybe, I just need to dig deeper into their 
library.  Or I could just be missing something fundamental?  :)

-Original Message-
From: "Stanislaw Osinski" 
Sent: Friday, October 18, 2013 5:04am
To: solr-user@lucene.apache.org
Subject: Re: solrconfig.xml carrot2 params

Hi,

Out of curiosity -- what would you like to achieve by changing
Tokenizer.documentFields?
If you want to have clustering applied to more than one document field, you
can provide a comma-separated list of fields in the carrot.title and/or
carrot.snippet parameters.

Thanks,

Staszek

--
Stanislaw Osinski, stanislaw.osin...@carrotsearch.com
http://carrotsearch.com

On Thu, Oct 17, 2013 at 11:49 PM, youknow...@heroicefforts.net <
youknow...@heroicefforts.net> wrote:

> Would someone help me out with the syntax for setting
> Tokenizer.documentFields in the ClusteringComponent engine definition in
> solrconfig.xml?  Carrot2 is expecting a Collection of Strings.  There's no
> schema definition for this XML file and a big TODO on the Wiki wrt init
> params.  Every permutation I have tried results in an error stating:
>  Cannot set java.until.Collection field ... to java.lang.String.
> --
> Sent from my Android phone with K-9 Mail. Please excuse my brevity.

RE: Facet performance

2013-10-18 Thread Toke Eskildsen

Lemke, Michael  SZ/HZA-ZSW [lemke...@schaeffler.com] wrote:
> 1. 
> q=word&facet.field=CONTENT&facet=true&facet.prefix=&facet.limit=10&facet.mincount=1&facet.method=enum&rows=0
> 2. 
> q=word&facet.field=CONTENT&facet=true&facet.prefix=a&facet.limit=10&facet.mincount=1&facet.method=enum&rows=0

> The only difference is am empty facet.prefix in the first query.

> The first query returns after some 20 seconds (QTime 2 in the result) 
> while
> the second one takes only 80 msec (QTime 80). Why is this?

If you index was just opened when you issued your queries, the first request 
will be notably slower than the second as the facet values might not be in the 
disk cache.

Furthermore, for enum the difference between no prefix and some prefix is huge. 
As enum iterates values first (as opposed to fc that iterates hits first), 
limiting to only the values that starts with 'a' ought to speed up retrieval by 
a factor 10 or more.

> And as side note: facet.method=fc makes the queries run 'forever' and 
> eventually
> fail with org.apache.solr.common.SolrException: Too many values for 
> UnInvertedField faceting on field CONTENT.

An internal memory structure optimization in Solr limits the amount of possible 
unique values when using fc. It is not a bug as such, but more a consequence of 
a choice. Unfortunately the enum-solution is normally quite slow when there are 
enough unique values to trigger the "too many values"-exception. I know too 
little about the structures for DocValues to say if they will help here, but 
you might want to take a look at those.

- Toke Eskildsen

querying nested entity fields

2013-10-18 Thread sathish_ix

Hi ,

can some help if below query is possible,

Schema:


A
product1
product2

B
product12
product23



Is it possible to like this q=tag.category:A AND
tag.category.product=product1 ???






--
View this message in context: 
http://lucene.472066.n3.nabble.com/querying-nested-entity-fields-tp4096382.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Concurent indexing

2013-10-18 Thread Chris Geeringh

Erick, yes. Using SolrJ and CloudSolrServer - both 4.6 snapshots from 13 Oct


On 18 October 2013 12:17, Erick Erickson  wrote:

> Chris:
>
> OK, one of those stack traces does have the problem I referenced in the
> other thread. Are you sending updates to the server with SolrJ? And are you
> using CloudSolrServer? If you are, I'm surprised...
>
>  There are the important lines:
>
>1. - java.util.concurrent.Semaphore.acquire() @bci=5, line=317 (Compiled
>frame)
>2.  - org.apache.solr.util.AdjustableSemaphore.acquire() @bci=4, line=61
>(Compiled frame)
>3.  - org.apache.solr.update.SolrCmdDistributor.submit(org.apache.solr.
>update.SolrCmdDistributor$Request) @bci=22, line=418 (Compiled frame)
>4.  - org.apache.solr.update.SolrCmdDistributor.submit(org.apache.solr.
>client.solrj.request.UpdateRequest,
>
>
>
>
>
> On Wed, Oct 16, 2013 at 2:04 PM, Chris Geeringh 
> wrote:
>
> > Here's another jstack http://pastebin.com/8JiQc3rb
> >
> >
> > On 16 October 2013 11:53, Chris Geeringh  wrote:
> >
> > > Hi Erick, here is a paste from other thread (debugging update request)
> > > with my input as I am seeing errors too:
> > >
> > > I ran an import last night, and this morning my cloud wouldn't accept
> > > updates. I'm running the latest 4.6 snapshot. I was importing with
> latest
> > > solrj snapshot, and using java bin transport with CloudSolrServer.
> > >
> > > The cluster had indexed ~1.3 million docs before no further updates
> were
> > > accepted, querying still working.
> > >
> > > I'll run jstack shortly and provide the results.
> > >
> > > Here is my jstack output... Lots of blocked threads.
> > >
> > > http://pastebin.com/1ktjBYbf
> > >
> > >
> > >
> > > On 16 October 2013 11:46, Erick Erickson 
> > wrote:
> > >
> > >> Run jstack on the solr process (standard with Java) and
> > >> look for the word "semaphore". You should see your
> > >> servers blocked on this in the Solr code. That'll pretty
> > >> much nail it.
> > >>
> > >> There's an open JIRA to fix the underlying cause, see:
> > >> SOLR-5232, but that's currently slated for 4.6 which
> > >> won't be cut for a while.
> > >>
> > >> Also, there's a patch that will fix this as a side effect,
> > >> assuming you're using SolrJ, see. This is available in 4.5
> > >> SOLR-4816
> > >>
> > >> Best,
> > >> Erick
> > >>
> > >>
> > >>
> > >>
> > >> On Tue, Oct 15, 2013 at 1:33 PM, michael.boom 
> > >> wrote:
> > >>
> > >> > Here's some of the Solr's last words (log content before it stoped
> > >> > accepting
> > >> > updates), maybe someone can help me interpret that.
> > >> > http://pastebin.com/mv7fH62H
> > >> >
> > >> >
> > >> >
> > >> > --
> > >> > View this message in context:
> > >> >
> > >>
> >
> http://lucene.472066.n3.nabble.com/Concurent-indexing-tp4095409p4095642.html
> > >> > Sent from the Solr - User mailing list archive at Nabble.com.
> > >> >
> > >>
> > >
> > >
> >
>

Re: Filter cache pollution during sharded edismax queries

2013-10-18 Thread Anca Kopetz

Hi Ken,

Have you managed to find out why these entries were stored into filterCache and
if they have an impact on the hit ratio ?
We noticed the same problem, there are entries of this type :
item_+(+(title:western^10.0 | ... in our filterCache.

Thanks,
Anca

On 07/02/2013 09:01 PM, Ken Krugler wrote:

Hi all,

After upgrading from Solr 3.5 to 4.2.1, I noticed our filterCache hit ratio had
dropped significantly.

Previously it was at 95+%, but now it's < 50%.

I enabled recording 100 entries for debugging, and in looking at them it seems
that edismax (and faceting) is creating entries for me.

This is in a sharded setup, so it's a distributed search.

If I do a search for the string "bogus text" using edismax on two fields, I get
an entry in each of the shard's filter caches that looks like:

item_+(((field1:bogus | field2:bogu) (field1:text | field2:text))~2):

Is this expected?

I have a similar situation happening during faceted search, even though my
fields are single-value/untokenized strings, and I'm not using the enum facet
method.

But I'll get many, many entries in the filterCache for facet values, and they all look like
"item_::"

The net result of the above is that even with a very big filterCache size of
2K, the hit ratio is still only 60%.

Thanks for any insights,

-- Ken

--
Ken Krugler
+1 530-210-6378
http://www.scaleunlimited.com
custom big data solutions & training
Hadoop, Cascading, Cassandra & Solr

Kelkoo SAS
Société par Actions Simplifiée
Au capital de € 4.168.964,30
Siège social : 8, rue du Sentier 75002 Paris
425 093 069 RCS Paris

Ce message et les pièces jointes sont confidentiels et établis à l'attention
exclusive de leurs destinataires. Si vous n'êtes pas le destinataire de ce
message, merci de le détruire et d'en avertir l'expéditeur.

Re: Proximity search with wildcard

2013-10-18 Thread sayeed

Generally in solr if we give "Company engage"~5  it will give the results
containing "engage" 5 words near to the "company". 
So here I want to get the results if i gave the query  with wildcard as
"Compa* engage"~5



-
Sayeed
--
View this message in context: 
http://lucene.472066.n3.nabble.com/Proximity-search-with-wildcard-tp4096285p4096354.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: feedback on Solr 4.x LotsOfCores feature

2013-10-18 Thread Soyez Olivier

15K cores is around 4 minutes : no network drive, just a spinning disk
But, one important thing, to simulate a cold start or an useless linux buffer 
cache,
I used the following command to empty the linux buffer cache :
sync && echo 3 > /proc/sys/vm/drop_caches
Then, I started Solr and I found the result above

Le 11/10/2013 13:06, Erick Erickson a écrit :

bq: sharing the underlying solrconfig object the configset introduced
in JIRA SOLR-4478 seems to be the solution for non-SolrCloud mode

SOLR-4478 will NOT share the underlying config objects, it simply
shares the underlying directory. Each core will, at least as presently
envisioned, simply read the files that exist there and create their
own solrconfig object. Schema objects may be shared, but not config
objects. It may turn out to be relatively easy to do in the configset
situation, but last time I looked at sharing the underlying config
object it was too fraught with problems.

bq: 15K cores is around 4 minutes

I find this very odd. On my laptop, spinning disk, I think I was
seeing 1k cores discovered/sec. You're seeing roughly 16x slower, so I
have no idea what's going on here. If this is just reading the files,
you should be seeing horrible disk contention. Are you on some kind of
networked drive?

bq: To do that in background and to block on that request until core
discovery is complete, should not work for us (due to the worst case).
What other choices are there? Either you have to do it up front or
with some kind of blocking. Hmmm, I suppose you could keep some kind
of custom store (DB? File? ZooKeeper?) that would keep the last known
layout. You'd still have some kind of worst-case situation where the
core you were trying to load wouldn't be in your persistent store and
you'd _still_ have to wait for the discovery process to complete.

bq: and we will use the cores Auto option to create load or only load
the core on
Interesting. I can see how this could all work without any core
discovery but it does require a very specific setup.

On Thu, Oct 10, 2013 at 11:42 AM, Soyez Olivier
 wrote:
> The corresponding patch for Solr 4.2.1 LotsOfCores can be found in SOLR-5316, 
> including the new Cores options :
> - "numBuckets" to create a subdirectory based on a hash on the corename % 
> numBuckets in the core Datadir
> - "Auto" with 3 differents values :
>   1) false : default behaviour
>   2) createLoad : create, if not exist, and load the core on the fly on the 
> first incoming request (update, select)
>   3) onlyLoad : load the core on the fly on the first incoming request 
> (update, select), if exist on disk
>
> Concerning :
> - sharing the underlying solrconfig object, the configset introduced in JIRA 
> SOLR-4478 seems to be the solution for non-SolrCloud mode.
> We need to test it for our use case. If another solution exists, please tell 
> me. We are very interested in such functionality and to contribute, if we can.
>
> - the possibility of lotsOfCores in SolrCloud, we don't know in details how 
> SolrCloud is working.
> But one possible limit is the maximum number of entries that can be added to 
> a zookeeper node.
> Maybe, a solution will be just a kind of hashing in the zookeeper tree.
>
> - the time to discover cores in Solr 4.4 : with spinning disk under linux, 
> all cores with transient="true" and loadOnStartup="false", the linux buffer 
> cache empty before starting Solr :
> 15K cores is around 4 minutes. It's linear in the cores number, so for 50K 
> it's more than 13 minutes. In fact, it corresponding to the time to read all 
> core.properties files.
> To do that in background and to block on that request until core discovery is 
> complete, should not work for us (due to the worst case).
> So, we will just disable the core Discovery, because we don't need to know 
> all cores from the start. Start Solr without any core entries in solr.xml, 
> and we will use the cores Auto option to create load or only load the core on 
> the fly, based on the existence of the core on the disk (absolute path 
> calculated from the core name).
>
> Thanks for your interest,
>
> Olivier
> 
> De : Erick Erickson [erickerick...@gmail.com]
> Date d'envoi : lundi 7 octobre 2013 14:33
> À : solr-user@lucene.apache.org
> Objet : Re: feedback on Solr 4.x LotsOfCores feature
>
> Thanks for the great writeup! It's always interesting to see how
> a feature plays out "in the real world". A couple of questions
> though:
>
> bq: We added 2 Cores options :
> Do you mean you patched Solr? If so are you willing to shard the code
> back? If both are "yes", please open a JIRA, attach the patch and assign
> it to me.
>
> bq:  the number of file descriptors, it used a lot (need to increase global
> max and per process fd)
>
> Right, this makes sense since you have a bunch of cores all with their
> own descriptors open. I'm

Facet performance

2013-10-18 Thread Lemke, Michael SZ/HZA-ZSW

I am working with Solr facet fields and come across a 
performance problem I don't understand. Consider these 
two queries:

1. 
q=word&facet.field=CONTENT&facet=true&facet.prefix=&facet.limit=10&facet.mincount=1&facet.method=enum&rows=0

2. 
q=word&facet.field=CONTENT&facet=true&facet.prefix=a&facet.limit=10&facet.mincount=1&facet.method=enum&rows=0

The only difference is am empty facet.prefix in the first query.

The first query returns after some 20 seconds (QTime 2 in the result) while 
the second one takes only 80 msec (QTime 80). Why is this?

And as side note: facet.method=fc makes the queries run 'forever' and 
eventually 
fail with org.apache.solr.common.SolrException: Too many values for 
UnInvertedField faceting on field CONTENT.

This is with Solr 1.4.

Re: ExtractRequestHandler, skipping errors

2013-10-18 Thread Guido Medina

Dont, commons compress 1.5 is broken, either use 1.4.1 or later. Our app 
stopped compressing properly for a maven update.


Guido.

On 18/10/13 12:40, Roland Everaert wrote:

I will open a JIRA issue, I suppose that I just have to create an account
first?


Regards,


Roland.


On Fri, Oct 18, 2013 at 12:05 PM, Koji Sekiguchi  wrote:


Hi,

I think the flag cannot ignore NoSuchMethodError. There may be something
wrong here?

... I've just checked my Solr 4.5 directories and I found Tika version is
1.4.

Tika 1.4 seems to use commons compress 1.5:

http://svn.apache.org/viewvc/**tika/tags/1.4/tika-parsers/**
pom.xml?view=markup

But I see commons-compress-1.4.1.jar in solr/contrib/extraction/lib/
directory.

Can you open a JIRA issue?

For now, you can get commons compress 1.5 and put it to the directory
(don't forget to remove 1.4.1 jar file).

koji


(13/10/18 16:37), Roland Everaert wrote:


Hi,

We already configure the extractrequesthandler to ignore tika exceptions,
but it is solr that complains. The customer manage to reproduce the
problem. Following is the error from the solr.log. The file type cause
this
exception was WMZ. It seems that something is missing in a solr class. We
use SOLR 4.4.

ERROR - 2013-10-17 18:13:48.902; org.apache.solr.common.**SolrException;
null:java.lang.**RuntimeException: java.lang.NoSuchMethodError:
org.apache.commons.compress.**compressors.**CompressorStreamFactory.**
setDecompressConcatenated(Z)V
  at
org.apache.solr.servlet.**SolrDispatchFilter.sendError(**
SolrDispatchFilter.java:673)
  at
org.apache.solr.servlet.**SolrDispatchFilter.doFilter(**
SolrDispatchFilter.java:383)
  at
org.apache.solr.servlet.**SolrDispatchFilter.doFilter(**
SolrDispatchFilter.java:158)
  at
org.apache.catalina.core.**ApplicationFilterChain.**internalDoFilter(**
ApplicationFilterChain.java:**243)
  at
org.apache.catalina.core.**ApplicationFilterChain.**doFilter(**
ApplicationFilterChain.java:**210)
  at
org.apache.catalina.core.**StandardWrapperValve.invoke(**
StandardWrapperValve.java:222)
  at
org.apache.catalina.core.**StandardContextValve.invoke(**
StandardContextValve.java:123)
  at
org.apache.catalina.core.**StandardHostValve.invoke(**
StandardHostValve.java:171)
  at
org.apache.catalina.valves.**ErrorReportValve.invoke(**
ErrorReportValve.java:99)
  at
org.apache.catalina.valves.**AccessLogValve.invoke(**
AccessLogValve.java:953)
  at
org.apache.catalina.core.**StandardEngineValve.invoke(**
StandardEngineValve.java:118)
  at
org.apache.catalina.connector.**CoyoteAdapter.service(**
CoyoteAdapter.java:408)
  at
org.apache.coyote.http11.**AbstractHttp11Processor.**process(**
AbstractHttp11Processor.java:**1023)
  at
org.apache.coyote.**AbstractProtocol$**AbstractConnectionHandler.**
process(AbstractProtocol.java:**589)
  at
org.apache.tomcat.util.net.**AprEndpoint$SocketProcessor.**
run(AprEndpoint.java:1852)
  at java.util.concurrent.**ThreadPoolExecutor.runWorker(**Unknown
Source)
  at java.util.concurrent.**ThreadPoolExecutor$Worker.run(**Unknown
Source)
  at java.lang.Thread.run(Unknown Source)
Caused by: java.lang.NoSuchMethodError:
org.apache.commons.compress.**compressors.**CompressorStreamFactory.**
setDecompressConcatenated(Z)V
  at
org.apache.tika.parser.pkg.**CompressorParser.parse(**
CompressorParser.java:102)
  at
org.apache.tika.parser.**CompositeParser.parse(**
CompositeParser.java:242)
  at
org.apache.tika.parser.**CompositeParser.parse(**
CompositeParser.java:242)
  at
org.apache.tika.parser.**AutoDetectParser.parse(**
AutoDetectParser.java:120)
  at
org.apache.solr.handler.**extraction.**ExtractingDocumentLoader.load(**
ExtractingDocumentLoader.java:**219)
  at
org.apache.solr.handler.**ContentStreamHandlerBase.**handleRequestBody(**
ContentStreamHandlerBase.java:**74)
  at
org.apache.solr.handler.**RequestHandlerBase.**handleRequest(**
RequestHandlerBase.java:135)
  at
org.apache.solr.core.**RequestHandlers$**LazyRequestHandlerWrapper.**
handleRequest(RequestHandlers.**java:241)
  at org.apache.solr.core.SolrCore.**execute(SolrCore.java:1904)
  at
org.apache.solr.servlet.**SolrDispatchFilter.execute(**
SolrDispatchFilter.java:659)
  at
org.apache.solr.servlet.**SolrDispatchFilter.doFilter(**
SolrDispatchFilter.java:362)
  ... 16 more





On Thu, Oct 17, 2013 at 5:19 PM, Koji Sekiguchi 
wrote:

  Hi Roland,


(13/10/17 20:44), Roland Everaert wrote:

  Hi,

I helped a customer to deployed solr+manifoldCF and everything is going
quite smoothly, but every time solr is raising an exception, the
manifoldcfjob feeding

solr aborts. I would like to know if it is possible to configure the
ExtractRequestHandler to ignore errors like it seems to be possible with
dataimporthandler and entity processors.

I know that it is possible to configure the ExtractRequestHandler to
ignore
tika

Re: ExtractRequestHandler, skipping errors

Here is the link to the issue:

https://issues.apache.org/jira/browse/SOLR-5365

Thanks for your help.


Roland Everaert.


On Fri, Oct 18, 2013 at 1:40 PM, Roland Everaert wrote:

> I will open a JIRA issue, I suppose that I just have to create an account
> first?
>
>
> Regards,
>
>
> Roland.
>
>
> On Fri, Oct 18, 2013 at 12:05 PM, Koji Sekiguchi wrote:
>
>> Hi,
>>
>> I think the flag cannot ignore NoSuchMethodError. There may be something
>> wrong here?
>>
>> ... I've just checked my Solr 4.5 directories and I found Tika version is
>> 1.4.
>>
>> Tika 1.4 seems to use commons compress 1.5:
>>
>> http://svn.apache.org/viewvc/**tika/tags/1.4/tika-parsers/**
>> pom.xml?view=markup
>>
>> But I see commons-compress-1.4.1.jar in solr/contrib/extraction/lib/
>> directory.
>>
>> Can you open a JIRA issue?
>>
>> For now, you can get commons compress 1.5 and put it to the directory
>> (don't forget to remove 1.4.1 jar file).
>>
>> koji
>>
>>
>> (13/10/18 16:37), Roland Everaert wrote:
>>
>>> Hi,
>>>
>>> We already configure the extractrequesthandler to ignore tika exceptions,
>>> but it is solr that complains. The customer manage to reproduce the
>>> problem. Following is the error from the solr.log. The file type cause
>>> this
>>> exception was WMZ. It seems that something is missing in a solr class. We
>>> use SOLR 4.4.
>>>
>>> ERROR - 2013-10-17 18:13:48.902; org.apache.solr.common.**SolrException;
>>> null:java.lang.**RuntimeException: java.lang.NoSuchMethodError:
>>> org.apache.commons.compress.**compressors.**CompressorStreamFactory.**
>>> setDecompressConcatenated(Z)V
>>>  at
>>> org.apache.solr.servlet.**SolrDispatchFilter.sendError(**
>>> SolrDispatchFilter.java:673)
>>>  at
>>> org.apache.solr.servlet.**SolrDispatchFilter.doFilter(**
>>> SolrDispatchFilter.java:383)
>>>  at
>>> org.apache.solr.servlet.**SolrDispatchFilter.doFilter(**
>>> SolrDispatchFilter.java:158)
>>>  at
>>> org.apache.catalina.core.**ApplicationFilterChain.**internalDoFilter(**
>>> ApplicationFilterChain.java:**243)
>>>  at
>>> org.apache.catalina.core.**ApplicationFilterChain.**doFilter(**
>>> ApplicationFilterChain.java:**210)
>>>  at
>>> org.apache.catalina.core.**StandardWrapperValve.invoke(**
>>> StandardWrapperValve.java:222)
>>>  at
>>> org.apache.catalina.core.**StandardContextValve.invoke(**
>>> StandardContextValve.java:123)
>>>  at
>>> org.apache.catalina.core.**StandardHostValve.invoke(**
>>> StandardHostValve.java:171)
>>>  at
>>> org.apache.catalina.valves.**ErrorReportValve.invoke(**
>>> ErrorReportValve.java:99)
>>>  at
>>> org.apache.catalina.valves.**AccessLogValve.invoke(**
>>> AccessLogValve.java:953)
>>>  at
>>> org.apache.catalina.core.**StandardEngineValve.invoke(**
>>> StandardEngineValve.java:118)
>>>  at
>>> org.apache.catalina.connector.**CoyoteAdapter.service(**
>>> CoyoteAdapter.java:408)
>>>  at
>>> org.apache.coyote.http11.**AbstractHttp11Processor.**process(**
>>> AbstractHttp11Processor.java:**1023)
>>>  at
>>> org.apache.coyote.**AbstractProtocol$**AbstractConnectionHandler.**
>>> process(AbstractProtocol.java:**589)
>>>  at
>>> org.apache.tomcat.util.net.**AprEndpoint$SocketProcessor.**
>>> run(AprEndpoint.java:1852)
>>>  at java.util.concurrent.**ThreadPoolExecutor.runWorker(**Unknown
>>> Source)
>>>  at java.util.concurrent.**ThreadPoolExecutor$Worker.run(**Unknown
>>> Source)
>>>  at java.lang.Thread.run(Unknown Source)
>>> Caused by: java.lang.NoSuchMethodError:
>>> org.apache.commons.compress.**compressors.**CompressorStreamFactory.**
>>> setDecompressConcatenated(Z)V
>>>  at
>>> org.apache.tika.parser.pkg.**CompressorParser.parse(**
>>> CompressorParser.java:102)
>>>  at
>>> org.apache.tika.parser.**CompositeParser.parse(**
>>> CompositeParser.java:242)
>>>  at
>>> org.apache.tika.parser.**CompositeParser.parse(**
>>> CompositeParser.java:242)
>>>  at
>>> org.apache.tika.parser.**AutoDetectParser.parse(**
>>> AutoDetectParser.java:120)
>>>  at
>>> org.apache.solr.handler.**extraction.**ExtractingDocumentLoader.load(**
>>> ExtractingDocumentLoader.java:**219)
>>>  at
>>> org.apache.solr.handler.**ContentStreamHandlerBase.**handleRequestBody(*
>>> *ContentStreamHandlerBase.java:**74)
>>>  at
>>> org.apache.solr.handler.**RequestHandlerBase.**handleRequest(**
>>> RequestHandlerBase.java:135)
>>>  at
>>> org.apache.solr.core.**RequestHandlers$**LazyRequestHandlerWrapper.**
>>> handleRequest(RequestHandlers.**java:241)
>>>  at org.apache.solr.core.SolrCore.**execute(SolrCore.java:1904)
>>>  at
>>> org.apache.solr.servlet.**SolrDispatchFilter.execute(**
>>> SolrDispatchFilter.java:659)
>>>  at
>>> org.apache.solr.servlet.**SolrDispatchFilter.doFilter(**
>>> SolrDispatchFilter.java:362)
>>>  ... 16 more
>>>
>>>
>>>
>>>
>>>
>>> On Thu, Oct 17, 2013 at 5:19 PM, Koji Sekiguchi 
>>> wrote:
>>>
>>>  Hi Roland,
>

Re: ExtractRequestHandler, skipping errors

I will open a JIRA issue, I suppose that I just have to create an account
first?


Regards,


Roland.


On Fri, Oct 18, 2013 at 12:05 PM, Koji Sekiguchi  wrote:

> Hi,
>
> I think the flag cannot ignore NoSuchMethodError. There may be something
> wrong here?
>
> ... I've just checked my Solr 4.5 directories and I found Tika version is
> 1.4.
>
> Tika 1.4 seems to use commons compress 1.5:
>
> http://svn.apache.org/viewvc/**tika/tags/1.4/tika-parsers/**
> pom.xml?view=markup
>
> But I see commons-compress-1.4.1.jar in solr/contrib/extraction/lib/
> directory.
>
> Can you open a JIRA issue?
>
> For now, you can get commons compress 1.5 and put it to the directory
> (don't forget to remove 1.4.1 jar file).
>
> koji
>
>
> (13/10/18 16:37), Roland Everaert wrote:
>
>> Hi,
>>
>> We already configure the extractrequesthandler to ignore tika exceptions,
>> but it is solr that complains. The customer manage to reproduce the
>> problem. Following is the error from the solr.log. The file type cause
>> this
>> exception was WMZ. It seems that something is missing in a solr class. We
>> use SOLR 4.4.
>>
>> ERROR - 2013-10-17 18:13:48.902; org.apache.solr.common.**SolrException;
>> null:java.lang.**RuntimeException: java.lang.NoSuchMethodError:
>> org.apache.commons.compress.**compressors.**CompressorStreamFactory.**
>> setDecompressConcatenated(Z)V
>>  at
>> org.apache.solr.servlet.**SolrDispatchFilter.sendError(**
>> SolrDispatchFilter.java:673)
>>  at
>> org.apache.solr.servlet.**SolrDispatchFilter.doFilter(**
>> SolrDispatchFilter.java:383)
>>  at
>> org.apache.solr.servlet.**SolrDispatchFilter.doFilter(**
>> SolrDispatchFilter.java:158)
>>  at
>> org.apache.catalina.core.**ApplicationFilterChain.**internalDoFilter(**
>> ApplicationFilterChain.java:**243)
>>  at
>> org.apache.catalina.core.**ApplicationFilterChain.**doFilter(**
>> ApplicationFilterChain.java:**210)
>>  at
>> org.apache.catalina.core.**StandardWrapperValve.invoke(**
>> StandardWrapperValve.java:222)
>>  at
>> org.apache.catalina.core.**StandardContextValve.invoke(**
>> StandardContextValve.java:123)
>>  at
>> org.apache.catalina.core.**StandardHostValve.invoke(**
>> StandardHostValve.java:171)
>>  at
>> org.apache.catalina.valves.**ErrorReportValve.invoke(**
>> ErrorReportValve.java:99)
>>  at
>> org.apache.catalina.valves.**AccessLogValve.invoke(**
>> AccessLogValve.java:953)
>>  at
>> org.apache.catalina.core.**StandardEngineValve.invoke(**
>> StandardEngineValve.java:118)
>>  at
>> org.apache.catalina.connector.**CoyoteAdapter.service(**
>> CoyoteAdapter.java:408)
>>  at
>> org.apache.coyote.http11.**AbstractHttp11Processor.**process(**
>> AbstractHttp11Processor.java:**1023)
>>  at
>> org.apache.coyote.**AbstractProtocol$**AbstractConnectionHandler.**
>> process(AbstractProtocol.java:**589)
>>  at
>> org.apache.tomcat.util.net.**AprEndpoint$SocketProcessor.**
>> run(AprEndpoint.java:1852)
>>  at java.util.concurrent.**ThreadPoolExecutor.runWorker(**Unknown
>> Source)
>>  at java.util.concurrent.**ThreadPoolExecutor$Worker.run(**Unknown
>> Source)
>>  at java.lang.Thread.run(Unknown Source)
>> Caused by: java.lang.NoSuchMethodError:
>> org.apache.commons.compress.**compressors.**CompressorStreamFactory.**
>> setDecompressConcatenated(Z)V
>>  at
>> org.apache.tika.parser.pkg.**CompressorParser.parse(**
>> CompressorParser.java:102)
>>  at
>> org.apache.tika.parser.**CompositeParser.parse(**
>> CompositeParser.java:242)
>>  at
>> org.apache.tika.parser.**CompositeParser.parse(**
>> CompositeParser.java:242)
>>  at
>> org.apache.tika.parser.**AutoDetectParser.parse(**
>> AutoDetectParser.java:120)
>>  at
>> org.apache.solr.handler.**extraction.**ExtractingDocumentLoader.load(**
>> ExtractingDocumentLoader.java:**219)
>>  at
>> org.apache.solr.handler.**ContentStreamHandlerBase.**handleRequestBody(**
>> ContentStreamHandlerBase.java:**74)
>>  at
>> org.apache.solr.handler.**RequestHandlerBase.**handleRequest(**
>> RequestHandlerBase.java:135)
>>  at
>> org.apache.solr.core.**RequestHandlers$**LazyRequestHandlerWrapper.**
>> handleRequest(RequestHandlers.**java:241)
>>  at org.apache.solr.core.SolrCore.**execute(SolrCore.java:1904)
>>  at
>> org.apache.solr.servlet.**SolrDispatchFilter.execute(**
>> SolrDispatchFilter.java:659)
>>  at
>> org.apache.solr.servlet.**SolrDispatchFilter.doFilter(**
>> SolrDispatchFilter.java:362)
>>  ... 16 more
>>
>>
>>
>>
>>
>> On Thu, Oct 17, 2013 at 5:19 PM, Koji Sekiguchi 
>> wrote:
>>
>>  Hi Roland,
>>>
>>>
>>> (13/10/17 20:44), Roland Everaert wrote:
>>>
>>>  Hi,

 I helped a customer to deployed solr+manifoldCF and everything is going
 quite smoothly, but every time solr is raising an exception, the
 manifoldcfjob feeding

 solr aborts. I would like to know if it is possible to configure the
 Ex

XLSB files not indexed

Hi,

Can someone tells me if tika is supposed to extract data from xlsb files
(the new MS Office format in binary form)?

If so then it seems that solr is not able to index them like it is not able
to index ODF files (a JIRA is already opened for ODF
https://issues.apache.org/jira/browse/SOLR-4809)

Can someone confirm the problem, or tell me what to do to make solr works
with XLSB files.


Regards,


Roland.

Re: measure result set quality

2013-10-18 Thread Erick Erickson

bq: How do you compare the quality of your
search result in order to decide which schema is better?

Well, that's actually a hard problem. There's the
various TREC data, but that's a generic solution and most
every individual application of this generic thing called
"search" has its own version of "good" results.

Note that scores are NOT comparable across different
queries even in the same data set, so don't go down that
path.

I'd fire the question back at you, "Can you define what
good (or better) results are in such a way that you can
program an evaluation?" Often the answer is "no"...

One common technique is to have knowledgable users
do what's called A/B testing. You fire the query at two
separate Solr instances and display the results side-by-side,
and the user says "A is more relevant", or "B is more
relevant". Kind of like an eye doctor. In sophisticated A/B
testing, the program randomly changes which side the
results go, so you remove "sidedness" bias.

FWIW,
Erick

On Thu, Oct 17, 2013 at 11:28 AM, Alvaro Cabrerizo wrote:

> Hi,
>
> Imagine the next situation. You have a corpus of documents and a list of
> queries extracted from production environment. The corpus haven't been
> manually annotated with relvant/non relevant tags for every query. Then you
> configure various solr instances changing the schema (adding synonyms,
> stopwords...). After indexing, you prepare and execute the test over
> different schema configurations.  How do you compare the quality of your
> search result in order to decide which schema is better?
>
> Regards.
>

Re: Concurent indexing

2013-10-18 Thread Erick Erickson

Chris:

OK, one of those stack traces does have the problem I referenced in the
other thread. Are you sending updates to the server with SolrJ? And are you
using CloudSolrServer? If you are, I'm surprised...

 There are the important lines:

   1. - java.util.concurrent.Semaphore.acquire() @bci=5, line=317 (Compiled
   frame)
   2.  - org.apache.solr.util.AdjustableSemaphore.acquire() @bci=4, line=61
   (Compiled frame)
   3.  - org.apache.solr.update.SolrCmdDistributor.submit(org.apache.solr.
   update.SolrCmdDistributor$Request) @bci=22, line=418 (Compiled frame)
   4.  - org.apache.solr.update.SolrCmdDistributor.submit(org.apache.solr.
   client.solrj.request.UpdateRequest,





On Wed, Oct 16, 2013 at 2:04 PM, Chris Geeringh  wrote:

> Here's another jstack http://pastebin.com/8JiQc3rb
>
>
> On 16 October 2013 11:53, Chris Geeringh  wrote:
>
> > Hi Erick, here is a paste from other thread (debugging update request)
> > with my input as I am seeing errors too:
> >
> > I ran an import last night, and this morning my cloud wouldn't accept
> > updates. I'm running the latest 4.6 snapshot. I was importing with latest
> > solrj snapshot, and using java bin transport with CloudSolrServer.
> >
> > The cluster had indexed ~1.3 million docs before no further updates were
> > accepted, querying still working.
> >
> > I'll run jstack shortly and provide the results.
> >
> > Here is my jstack output... Lots of blocked threads.
> >
> > http://pastebin.com/1ktjBYbf
> >
> >
> >
> > On 16 October 2013 11:46, Erick Erickson 
> wrote:
> >
> >> Run jstack on the solr process (standard with Java) and
> >> look for the word "semaphore". You should see your
> >> servers blocked on this in the Solr code. That'll pretty
> >> much nail it.
> >>
> >> There's an open JIRA to fix the underlying cause, see:
> >> SOLR-5232, but that's currently slated for 4.6 which
> >> won't be cut for a while.
> >>
> >> Also, there's a patch that will fix this as a side effect,
> >> assuming you're using SolrJ, see. This is available in 4.5
> >> SOLR-4816
> >>
> >> Best,
> >> Erick
> >>
> >>
> >>
> >>
> >> On Tue, Oct 15, 2013 at 1:33 PM, michael.boom 
> >> wrote:
> >>
> >> > Here's some of the Solr's last words (log content before it stoped
> >> > accepting
> >> > updates), maybe someone can help me interpret that.
> >> > http://pastebin.com/mv7fH62H
> >> >
> >> >
> >> >
> >> > --
> >> > View this message in context:
> >> >
> >>
> http://lucene.472066.n3.nabble.com/Concurent-indexing-tp4095409p4095642.html
> >> > Sent from the Solr - User mailing list archive at Nabble.com.
> >> >
> >>
> >
> >
>

Re: Debugging update request

2013-10-18 Thread Erick Erickson

@Michael:

Yep, that's the bit that's addressed by the two patches I referenced. If
you can try this with 4.5 (or the soon to be done 4.5.1), the problem
should go away.

@Chris:

I think you have a different issue. A very quick glance at your stack trace
doesn't really show anything outstanding. There are always a bunch of
threads waiting around for something to do that show up as "blocked". So
I'm pretty puzzled. Are your Solr logs showing anything when you try to
update after this occurs?

On Wed, Oct 16, 2013 at 11:32 AM, Chris Geeringh  wrote:

> Here is my jstack output... Lots of blocked threads.
>
> http://pastebin.com/1ktjBYbf
>
>
> On 16 October 2013 10:28, michael.boom  wrote:
>
> > I got the trace from jstack.
> > I found references to "semaphore" but not sure if this is what you meant.
> > Here's the trace:
> > http://pastebin.com/15QKAz7U
> >
> >
> >
> > --
> > View this message in context:
> >
> http://lucene.472066.n3.nabble.com/Debugging-update-request-tp4095619p4095847.html
> > Sent from the Solr - User mailing list archive at Nabble.com.
> >
>

Re: how to retireve content page in solr

2013-10-18 Thread Harshvardhan Ojha

Hi Danila,

What do you mean by content information?
A whole document?
Metadata?
do you keep it separate in some fields?
Or is it about solr search queries?


Regards
Harshvardhan Ojha


On Fri, Oct 18, 2013 at 1:09 PM, javozzo  wrote:

> Hi, i'm new in solr.
> I use Nutch 1.1 to crawl web pages.
> I use solr to indexer these pages.
> My problem is: how to retrieve the content information about a document
> "stored" il solr?
>
> Example
> If I have a page http://www.prova.com/prova.html
> that contains the text "This is a web page"
>
> Is there a way to retrieve the text This is a web page?
> Any ideas?
> My application is written in java.
> Thanks
> Danilo
>
>
>
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/how-to-retireve-content-page-in-solr-tp4096302.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>

Re: ExtractRequestHandler, skipping errors

2013-10-18 Thread Koji Sekiguchi


Hi,

I think the flag cannot ignore NoSuchMethodError. There may be something wrong 
here?

... I've just checked my Solr 4.5 directories and I found Tika version is 1.4.

Tika 1.4 seems to use commons compress 1.5:

http://svn.apache.org/viewvc/tika/tags/1.4/tika-parsers/pom.xml?view=markup

But I see commons-compress-1.4.1.jar in solr/contrib/extraction/lib/ directory.

Can you open a JIRA issue?

For now, you can get commons compress 1.5 and put it to the directory
(don't forget to remove 1.4.1 jar file).

koji

(13/10/18 16:37), Roland Everaert wrote:

Hi,

We already configure the extractrequesthandler to ignore tika exceptions,
but it is solr that complains. The customer manage to reproduce the
problem. Following is the error from the solr.log. The file type cause this
exception was WMZ. It seems that something is missing in a solr class. We
use SOLR 4.4.

ERROR - 2013-10-17 18:13:48.902; org.apache.solr.common.SolrException;
null:java.lang.RuntimeException: java.lang.NoSuchMethodError:
org.apache.commons.compress.compressors.CompressorStreamFactory.setDecompressConcatenated(Z)V
 at
org.apache.solr.servlet.SolrDispatchFilter.sendError(SolrDispatchFilter.java:673)
 at
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:383)
 at
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:158)
 at
org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:243)
 at
org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:210)
 at
org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:222)
 at
org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:123)
 at
org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:171)
 at
org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:99)
 at
org.apache.catalina.valves.AccessLogValve.invoke(AccessLogValve.java:953)
 at
org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:118)
 at
org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:408)
 at
org.apache.coyote.http11.AbstractHttp11Processor.process(AbstractHttp11Processor.java:1023)
 at
org.apache.coyote.AbstractProtocol$AbstractConnectionHandler.process(AbstractProtocol.java:589)
 at
org.apache.tomcat.util.net.AprEndpoint$SocketProcessor.run(AprEndpoint.java:1852)
 at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
 at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
 at java.lang.Thread.run(Unknown Source)
Caused by: java.lang.NoSuchMethodError:
org.apache.commons.compress.compressors.CompressorStreamFactory.setDecompressConcatenated(Z)V
 at
org.apache.tika.parser.pkg.CompressorParser.parse(CompressorParser.java:102)
 at
org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
 at
org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
 at
org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120)
 at
org.apache.solr.handler.extraction.ExtractingDocumentLoader.load(ExtractingDocumentLoader.java:219)
 at
org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:74)
 at
org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:135)
 at
org.apache.solr.core.RequestHandlers$LazyRequestHandlerWrapper.handleRequest(RequestHandlers.java:241)
 at org.apache.solr.core.SolrCore.execute(SolrCore.java:1904)
 at
org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:659)
 at
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:362)
 ... 16 more





On Thu, Oct 17, 2013 at 5:19 PM, Koji Sekiguchi  wrote:


Hi Roland,


(13/10/17 20:44), Roland Everaert wrote:


Hi,

I helped a customer to deployed solr+manifoldCF and everything is going
quite smoothly, but every time solr is raising an exception, the
manifoldcfjob feeding

solr aborts. I would like to know if it is possible to configure the
ExtractRequestHandler to ignore errors like it seems to be possible with
dataimporthandler and entity processors.

I know that it is possible to configure the ExtractRequestHandler to
ignore
tika exception (We already do that) but the errors that now stops the
mcfjobs are generated by

solr itself.

While it is interesting to have such option in solr, I plan to post to the
manifoldcf mailing list, anyway, to know if it is possible to configure
manifolcf to be less picky about solr errors.



ignoreTikaException flag might help you?

https://issues.apache.org/**jira/browse/SOLR-2480

koji
--
http://soleami.com/blog/**automatically-acquiring-**
synonym-knowledge-from-**wikipedia.html

how to retireve content page in solr

2013-10-18 Thread javozzo

Hi, i'm new in solr.
I use Nutch 1.1 to crawl web pages. 
I use solr to indexer these pages. 
My problem is: how to retrieve the content information about a document
"stored" il solr?

Example
If I have a page http://www.prova.com/prova.html
that contains the text "This is a web page"

Is there a way to retrieve the text This is a web page?
Any ideas?
My application is written in java.
Thanks
Danilo



--
View this message in context: 
http://lucene.472066.n3.nabble.com/how-to-retireve-content-page-in-solr-tp4096302.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Proximity search with wildcard

2013-10-18 Thread Harshvardhan Ojha

Hi Sayeed,

you can use fuzzy search. comp engage~0.2.

Regards
harshvardhan ojha


On Fri, Oct 18, 2013 at 10:28 AM, sayeed  wrote:

> Hi,
> I am new to solr. Is it possible to do proximity search with solr.
>
> For example
> "comp* engage"~5.
>
>
>
>
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/Proximity-search-with-wildcard-tp4096285.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>

Re: solrconfig.xml carrot2 params

2013-10-18 Thread Stanislaw Osinski

Hi,

Out of curiosity -- what would you like to achieve by changing
Tokenizer.documentFields?
If you want to have clustering applied to more than one document field, you
can provide a comma-separated list of fields in the carrot.title and/or
carrot.snippet parameters.

Thanks,

Staszek

--
Stanislaw Osinski, stanislaw.osin...@carrotsearch.com
http://carrotsearch.com


On Thu, Oct 17, 2013 at 11:49 PM, youknow...@heroicefforts.net <
youknow...@heroicefforts.net> wrote:

> Would someone help me out with the syntax for setting
> Tokenizer.documentFields in the ClusteringComponent engine definition in
> solrconfig.xml?  Carrot2 is expecting a Collection of Strings.  There's no
> schema definition for this XML file and a big TODO on the Wiki wrt init
> params.  Every permutation I have tried results in an error stating:
>  Cannot set java.until.Collection field ... to java.lang.String.
> --
> Sent from my Android phone with K-9 Mail. Please excuse my brevity.

Complex Queries in solr

2013-10-18 Thread sayeed

Hi,
Is it possible to search complex queries like 
(consult* or advis*) NEAR(40) (fee or retainer or salary or bonus)
in solr




-
Sayeed
--
View this message in context: 
http://lucene.472066.n3.nabble.com/Complex-Queries-in-solr-tp4096288.html
Sent from the Solr - User mailing list archive at Nabble.com.

Proximity search with wildcard

2013-10-18 Thread sayeed

Hi,
I am new to solr. Is it possible to do proximity search with solr.

For example 
"comp* engage"~5.




--
View this message in context: 
http://lucene.472066.n3.nabble.com/Proximity-search-with-wildcard-tp4096285.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: ExtractRequestHandler, skipping errors