date:20151019

Re: SOLR-7191 SolrCloud 5 with thousands of collections

2015-10-19 Thread Damien Kamerman

OK, turned out ZkStateReader.constructState() was only calling
ClusterState.getCollections()
for log.debug(). I removed that and the next bottleneck is talking
to ZkStateReader.fetchCollectionState.

"coreZkRegister-4-thread-14-processing-n:ftet1:8003_solr
x:t_1558_shard1_replica1 s:shard1 c:t_1558 r:core_node1" #151 prio=5
os_prio=64 tid=0x05568800 nid=0xc8 in Object.wait()
[0x7fefb117c000]
   java.lang.Thread.State: WAITING (on object monitor)
at java.lang.Object.wait(Native Method)
at java.lang.Object.wait(Object.java:502)
at
org.apache.zookeeper.ClientCnxn.submitRequest(ClientCnxn.java:1342)
- locked <0x7fff50fadf70> (a
org.apache.zookeeper.ClientCnxn$Packet)
at org.apache.zookeeper.ZooKeeper.getData(ZooKeeper.java:1153)
at
org.apache.solr.common.cloud.SolrZkClient$7.execute(SolrZkClient.java:353)
at
org.apache.solr.common.cloud.SolrZkClient$7.execute(SolrZkClient.java:350)
at
org.apache.solr.common.cloud.ZkCmdExecutor.retryOperation(ZkCmdExecutor.java:61)
at
org.apache.solr.common.cloud.SolrZkClient.getData(SolrZkClient.java:350)
at
org.apache.solr.common.cloud.ZkStateReader.fetchCollectionState(ZkStateReader.java:1029)
at
org.apache.solr.common.cloud.ZkStateReader.updateClusterState(ZkStateReader.java:260)
- locked <0x7ff040b92270> (a
org.apache.solr.common.cloud.ZkStateReader)
at
org.apache.solr.cloud.ZkController.register(ZkController.java:979)
at
org.apache.solr.cloud.ZkController.register(ZkController.java:881)
at org.apache.solr.core.ZkContainer$2.run(ZkContainer.java:184)
at
org.apache.solr.common.util.ExecutorUtil$MDCAwareThreadPoolExecutor$1.run(ExecutorUtil.java:231)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)


On 19 October 2015 at 15:59, Damien Kamerman  wrote:

> Hi All,
>
> I've had a first look at porting the patch I did for SOLR-7191 (SolrCloud
> with thousands of collections) in Solr 4.10 to the Solr trunk (1708905).
> Now I created 6,000 collections (3 nodes; 2 x replicas) and re-started the
> 3 nodes. What I noticed is that the cloud is starting but slowly. All the 
> org.apache.solr.core.CoreContainer.create()
> threads are blocked in the ZkStateReader. I was hoping the changes to
> clusterstate.json from global to per collection would reduce the
> contention. Comments appreciated.
>
> example jstacks:
> "coreLoadExecutor-6-thread-24-processing-n:ftet1:8003_solr" #70 prio=5
> os_prio=64 tid=0x00bcd800 nid=0x88 waiting for monitor entry
> [0x7fefb29bc000]
>java.lang.Thread.State: BLOCKED (on object monitor)
> at
> org.apache.solr.common.cloud.ZkStateReader.addCollectionWatch(ZkStateReader.java:1048)
> - waiting to lock <0x7ff0403ff020> (a
> org.apache.solr.common.cloud.ZkStateReader)
> at
> org.apache.solr.cloud.ZkController.preRegister(ZkController.java:1561)
> at
> org.apache.solr.core.CoreContainer.create(CoreContainer.java:726)
> at
> org.apache.solr.core.CoreContainer$1.call(CoreContainer.java:451)
> at
> org.apache.solr.core.CoreContainer$1.call(CoreContainer.java:442)
> at java.util.concurrent.FutureTask.run(FutureTask.java:266)
> at
> org.apache.solr.common.util.ExecutorUtil$MDCAwareThreadPoolExecutor$1.run(ExecutorUtil.java:231)
> at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> at java.lang.Thread.run(Thread.java:745)
>
> "zkCallback-4-thread-80-processing-n:ftet1:8003_solr" #268 prio=5
> os_prio=64 tid=0x02ee nid=0x134 in Object.wait()
> [0x7fefaed2d000]
>java.lang.Thread.State: WAITING (on object monitor)
> at java.lang.Object.wait(Native Method)
> at java.lang.Object.wait(Object.java:502)
> at
> org.apache.zookeeper.ClientCnxn.submitRequest(ClientCnxn.java:1342)
> - locked <0x7ff0be17e600> (a
> org.apache.zookeeper.ClientCnxn$Packet)
> at org.apache.zookeeper.ZooKeeper.getData(ZooKeeper.java:1153)
> at
> org.apache.solr.common.cloud.SolrZkClient$7.execute(SolrZkClient.java:353)
> at
> org.apache.solr.common.cloud.SolrZkClient$7.execute(SolrZkClient.java:350)
> at
> org.apache.solr.common.cloud.ZkCmdExecutor.retryOperation(ZkCmdExecutor.java:61)
> at
> org.apache.solr.common.cloud.SolrZkClient.getData(SolrZkClient.java:350)
> at
> org.apache.solr.common.cloud.ZkStateReader.fetchCollectionState(ZkStateReader.java:1030)
> at
> org.apache.solr.common.cloud.ZkStateReader.getCollectionLive(ZkStateReader.java:1015)
> at
>

solr4.7: truncated log output in grouping.CommandHandler?

2015-10-19 Thread Jeff Wu

We had solr server 4.7 recently reported such WARN message, and come with a
long GC pause after that. Sometime it will force Solr server disconnect
with ZK server.

Solr 4.7.0, got this warning message:
WARN  - 2015-10-19 02:23:24.503;
org.apache.solr.search.grouping.CommandHandler; Query: +(+owner:testUser)
+(+directoryUUID:x +softFlag:`^H^@^@^@^@); Elapsed time:
20Exceeded allowed search time: 1 ms.

Questions is
The log string has sometring like this: `^H^@^@^@^@), this looks like a
truncated string. Is this expected output? Or indicating something wrong in
this query?

Re: Problem with indexing chinese characters when using SolrJ

2015-10-19 Thread Zheng Lin Edwin Yeo

I found that it works if I put the code in this way without the URLEncoder

req.setParam("literal.title", filename);

Is the URLEncoder doing the encoding from the chinese characters to the  string
of code like this "%E7%AB%8B%E9".?

Regards,
Edwin


On 19 October 2015 at 11:29, Zheng Lin Edwin Yeo 
wrote:

> Hi,
>
> I'm using the URLEncoder withg "UTF-8" encoding in SolrJ (code is written
> in Eclipse) to index the file name to Solr. The code looks like this.
>
> req.setParam("literal.title", URLEncoder.encode(filename, "UTF-8"));
>
> However, what is index in Solr is a string of code like this
> "%E7%AB%8B%E9". When I do the system.out.println of the filename in
> Eclipse, the full chinese characters is shown.
>
> What could be the reason that this is happening?
>
> I'm using Solr 5.3.0, Java 1.8 and Eclipse Juno Service Release 2.
>
> Thank you for your help.
>
> Regards,
> Edwin
>

Re: PayloadTermQuery deprecated

2015-10-19 Thread Alan Woodward

Hi Bill,

This looks like an oversight on my part when migrating the payload scoring 
queries - can you open a JIRA ticket to add 'includeSpanScore' as an option to 
PayloadScoreQuery?

As a workaround, you should be able to use a custom similarity that returns 1 
for all scores (see IndexSearcher.NON_SCORING_SIMILARITY for an implementation 
that returns 0, you could just clone that and change SimScorer.score())

Alan Woodward
www.flax.co.uk


On 19 Oct 2015, at 00:39, William Bell wrote:

> Here is my first stab at it. Thoughts?
> 
> Question:
> 
> new PayloadTermQuery(new Term(nv[0].substring(1), nv[1]), new
> AveragePayloadFunction(), false)
> 
> How do I handle the "false"  ? It means boolean includeSpanScore
> 
> 
> @Override
> public Query parse() throws SyntaxError {
> 
>if (qstr == null || qstr.length() == 0) return null;
>//BooleanQuery q = new BooleanQuery();
>BooleanQuery.Builder q = new BooleanQuery.Builder();
>q.setDisableCoord(true);
>if (qstr.length() > 1 && qstr.startsWith("\"") && qstr.endsWith("\"")) {
>qstr = qstr.substring(1,qstr.length()-1);
>}
>String[] nvps = StringUtils.split(qstr, " ");
>for (int i = 0; i < nvps.length; i++) {
>String[] nv = StringUtils.split(nvps[i], ":");
>if (nv.length > 1) {
>  if (nv[0].startsWith("+")) {
>  SpanTermQuery sq = new SpanTermQuery(new
> Term(nv[0].substring(1), nv[1]));
>  PayloadScoreQuery psq = new PayloadScoreQuery(sq, new
> AveragePayloadFunction());
>  q.add(psq, Occur.MUST);
>//q.add(new PayloadTermQuery(new Term(nv[0].substring(1),
> nv[1]), new AveragePayloadFunction(), false), Occur.MUST);
>  } else {
>//q.add(new PayloadTermQuery(new Term(nv[0], nv[1]), new
> AveragePayloadFunction(), false), Occur.SHOULD);
>  SpanTermQuery sq = new SpanTermQuery(new Term(nv[0], nv[1]));
>  PayloadScoreQuery psq = new PayloadScoreQuery(sq, new
> AveragePayloadFunction());
>  q.add(psq, Occur.SHOULD);
>  }
>}
>}
>// return q;
>return q.build();
> }
> 
> 
> On Sun, Oct 18, 2015 at 4:46 PM, William Bell  wrote:
> 
>> Wondering how to change my payload based on example:
>> 
>> https://lucidworks.com/blog/2014/06/13/end-to-end-payload-example-in-solr/
>> 
>> PayloadTermQuery and BooleanQuery are deprecated in 5.3.x
>> 
>> @Override
>> public Query parse() throws SyntaxError {
>> 
>>if (qstr == null || qstr.length() == 0) return null;
>>BooleanQuery q = new BooleanQuery();
>>if (qstr.length() > 1 && qstr.startsWith("\"") && qstr.endsWith("\"")) {
>>qstr = qstr.substring(1,qstr.length()-1);
>>}
>>String[] nvps = StringUtils.split(qstr, " ");
>>for (int i = 0; i < nvps.length; i++) {
>>String[] nv = StringUtils.split(nvps[i], ":");
>>if (nv.length > 1) {
>>  if (nv[0].startsWith("+")) {
>>q.add(new PayloadTermQuery(new Term(nv[0].substring(1), nv[1]),
>>  new AveragePayloadFunction(), false), Occur.MUST);
>>  } else {
>>q.add(new PayloadTermQuery(new Term(nv[0], nv[1]),
>>  new AveragePayloadFunction(), false), Occur.SHOULD);
>>  }
>>}
>>}
>>return q;
>> }
>> 
>> 
>> --
>> Bill Bell
>> billnb...@gmail.com
>> cell 720-256-8076
>> 
> 
> 
> 
> -- 
> Bill Bell
> billnb...@gmail.com
> cell 720-256-8076

Re: Problem with indexing chinese characters when using SolrJ

2015-10-19 Thread Zheng Lin Edwin Yeo

Hi Shawn,

Thank you for the explanation.

Regards,
Edwin


On 19 October 2015 at 15:58, Shawn Heisey  wrote:

> On 10/19/2015 12:18 AM, Zheng Lin Edwin Yeo wrote:
> > I found that it works if I put the code in this way without the
> URLEncoder
> >
> > req.setParam("literal.title", filename);
> >
> > Is the URLEncoder doing the encoding from the chinese characters to the
> string
> > of code like this "%E7%AB%8B%E9".?
>
> Yes, that's what the URL encoder would do.
>
> SolrJ will automatically handle URL encoding of parameters.  Because it
> is an API that takes Java code and generates HTTP requests, that's part
> of its job.
>
> If you URL encode it yourself, then SolrJ will URL encode it again when
> it sends it (turning the percent signs into %25), and Solr will receive
> things like the "%E7%AB%8B%E9" string you mentioned, not the original
> UTF-8 characters.
>
> Thanks,
> Shawn
>
>

Re: Problem with indexing chinese characters when using SolrJ

2015-10-19 Thread Shawn Heisey

On 10/19/2015 12:18 AM, Zheng Lin Edwin Yeo wrote:
> I found that it works if I put the code in this way without the URLEncoder
> 
> req.setParam("literal.title", filename);
> 
> Is the URLEncoder doing the encoding from the chinese characters to the  
> string
> of code like this "%E7%AB%8B%E9".?

Yes, that's what the URL encoder would do.

SolrJ will automatically handle URL encoding of parameters.  Because it
is an API that takes Java code and generates HTTP requests, that's part
of its job.

If you URL encode it yourself, then SolrJ will URL encode it again when
it sends it (turning the percent signs into %25), and Solr will receive
things like the "%E7%AB%8B%E9" string you mentioned, not the original
UTF-8 characters.

Thanks,
Shawn

Anyone users IBM J9 JVM with 32G max heap ? Tuning recommendations?

2015-10-19 Thread Jeff Wu

Hi all,

we are using solr4.7 on top of IBM JVM J9 Java7, max heap to 32G, system
RAM 64G.

JVM parameters: -Xgcpolicy:balanced -verbose:gc -Xms12228m -Xmx32768m
-XX:PermSize=128m -XX:MaxPermSize=512m

We faced one issue here: we set zkClient timeout value to 30 seconds. By
using the balanced GC policy, we sometimes occurred a global GC pause
>30seconds, therefore the solr server disconnected with ZK, and /update
requests on this solr was disabled after zk disconnect. We have to restart
this solr server to recover.

By staying with IBM JVM, anyone has recommendations on this ? The general
average heap usage in our solr server is around 26G so we'd like to stay
with 32G max heap, but want to better tune the JVM to have less global gc
pause.

Re: Nested entities not imported / do not show up in search?

2015-10-19 Thread Mikhail Khludnev

On Mon, Oct 19, 2015 at 2:48 AM, Matthias Fischer <
matthias.fisc...@doubleslash.de> wrote:

> Ok, thanks for your advice so far. I can import companies with their
> nested entities (business branches) now. But I wonder whether there is a
> way to query for company name patterns and get the business branches nested
> inside the respective companies.


pls check [child] at
https://cwiki.apache.org/confluence/display/solr/Transforming+Result+Documents


> Using the following query I only get the companies without their nested
> entities:
>
> http://localhost:8983/solr/jcg/select?q=firma_namenszeile_1%3AMu*=xml=true
>
> I can use the firma_ebi_nr (the company id) and get the associated
> branches by issueing the following query:
>
> http://localhost:8983/solr/jcg/select?q={!child%20of=%22firma_ebi_nr:123123%22}firma_ebi_nr:123123
> This results in a flat list of associated business branches. However I
> would like to search a company by name and in the result I would like to
> see all associated business branches nested inside the respective company.
> Is this possible or do I need to issue the second query above for each
> company search result in order to get the nested entities?
>
> Example of what I would like to achieve:
>
>
> http://localhost:8983/solr/jcg/select?q=firma_namenszeile_1%3AMust*=xml=true
>
> 
> 
> 0
> 1
> 
> firma_namenszeile_1:Must*
> true
> xml
> 
> 
> 
> 
> 123123
> Musterfirma
> ac8d5627-b17a--8926-8d5a80680ee4
> 1515205299087081472
>
> 
> 
> 6
> 
> 
> 43000
> 
> 
> 43900
> 
>
>  
>  
> 
>
>
> Is this possible? Or maybe there is a better way than nested enties? An
> alternative I could think of is to join companies and branches in the JDBC
> import. But this would result in duplicate companies in the search result
> (one for each associated branch). My goal is to have a suggest field where
> the user can type a company name pattern and gets a list of matching
> companies including the associated branches. Any suggestions?
>
> Kind regards,
> Matthias
>
> -Ursprüngliche Nachricht-
> Von: Andrea Gazzarini [mailto:a.gazzar...@gmail.com]
> Gesendet: Freitag, 16. Oktober 2015 17:24
> An: solr-user@lucene.apache.org
> Betreff: Re: Nested entities not imported / do not show up in search?
>
> Hi Matthias,
> I guess the company.id field is not unique so you need a "compound"
> uniqueKey on Solr, which is not strctly possible. As consequence of that
> (company) UUID is probably created before the index phase by an
> UpdateRequestProcessor [1] so you should check your solrconfig.xml and, if
> I'm right, check if the same strategy could be used for the nested entities.
>
> Andrea
>
> [1]
>
> http://lucene.apache.org/solr/5_2_1/solr-core/org/apache/solr/update/processor/UUIDUpdateProcessorFactory.html
>
> 2015-10-16 17:11 GMT+02:00 Matthias Fischer <
> matthias.fisc...@doubleslash.de
> >:
>
> > Thank you, Andrea, for answering so quickly.
> >
> > However I got further errors. I also had to change
> > "firma_ebi_nr" to "id".
> > But it still does not work properly. It seems that an id is auto
> > generated for the company documents but not for the nested ones (the
> business branches).
> > Any ideas how to fix this?
> >
> > 2015-10-16 12:49:29.650 WARN  (Thread-17) [   x:jcg] o.a.s.h.d.SolrWriter
> > Error creating document :
> > SolrInputDocument(
> > fields: [firma_ebi_nr=317709682, firma_namenszeile_1=Example
> > Company, id=3c7f7421-9d51-4056-a2a0-eebab87a546a,
> > _version_=1515192078460518400,
> _root_=3c7f7421-9d51-4056-a2a0-eebab87a546a],
> > children: [
> >SolrInputDocument(fields: [branche_ebc_code=7,
> > _root_=3c7f7421-9d51-4056-a2a0-eebab87a546a]),
> >SolrInputDocument(fields: [branche_ebc_code=47000,
> > _root_=3c7f7421-9d51-4056-a2a0-eebab87a546a]),
> >SolrInputDocument(fields: [branche_ebc_code=47700,
> > _root_=3c7f7421-9d51-4056-a2a0-eebab87a546a]),
> >SolrInputDocument(fields: [branche_ebc_code=47790,
> > _root_=3c7f7421-9d51-4056-a2a0-eebab87a546a]),
> >SolrInputDocument(fields: [branche_ebc_code=47791,
> > _root_=3c7f7421-9d51-4056-a2a0-eebab87a546a])])
> > org.apache.solr.common.SolrException: [doc=null] missing required field:
> id
> > at
> >
> org.apache.solr.update.DocumentBuilder.toDocument(DocumentBuilder.java:198)
> > at
> > org.apache.solr.update.AddUpdateCommand$1.next(AddUpdateCommand.java:191)
> > at
> > org.apache.solr.update.AddUpdateCommand$1.next(AddUpdateCommand.java:166)
> > at
> >
> org.apache.lucene.index.DocumentsWriterPerThread.updateDocuments(DocumentsWriterPerThread.java:259)
> > at
> >
> org.apache.lucene.index.DocumentsWriter.updateDocuments(DocumentsWriter.java:413)
> > at
> >
>

AW: Nested entities not imported / do not show up in search?

2015-10-19 Thread Matthias Fischer

Ok, thanks for your advice so far. I can import companies with their nested 
entities (business branches) now. But I wonder whether there is a way to query 
for company name patterns and get the business branches nested inside the 
respective companies. Using the following query I only get the companies 
without their nested entities:
http://localhost:8983/solr/jcg/select?q=firma_namenszeile_1%3AMu*=xml=true

I can use the firma_ebi_nr (the company id) and get the associated branches by 
issueing the following query:
http://localhost:8983/solr/jcg/select?q={!child%20of=%22firma_ebi_nr:123123%22}firma_ebi_nr:123123
This results in a flat list of associated business branches. However I would 
like to search a company by name and in the result I would like to see all 
associated business branches nested inside the respective company.
Is this possible or do I need to issue the second query above for each company 
search result in order to get the nested entities?

Example of what I would like to achieve:

http://localhost:8983/solr/jcg/select?q=firma_namenszeile_1%3AMust*=xml=true



0
1

firma_namenszeile_1:Must*
true
xml




123123
Musterfirma
ac8d5627-b17a--8926-8d5a80680ee4
1515205299087081472



6


43000


43900


 
 



Is this possible? Or maybe there is a better way than nested enties? An 
alternative I could think of is to join companies and branches in the JDBC 
import. But this would result in duplicate companies in the search result (one 
for each associated branch). My goal is to have a suggest field where the user 
can type a company name pattern and gets a list of matching companies including 
the associated branches. Any suggestions?

Kind regards,
Matthias

-Ursprüngliche Nachricht-
Von: Andrea Gazzarini [mailto:a.gazzar...@gmail.com] 
Gesendet: Freitag, 16. Oktober 2015 17:24
An: solr-user@lucene.apache.org
Betreff: Re: Nested entities not imported / do not show up in search?

Hi Matthias,
I guess the company.id field is not unique so you need a "compound"
uniqueKey on Solr, which is not strctly possible. As consequence of that
(company) UUID is probably created before the index phase by an 
UpdateRequestProcessor [1] so you should check your solrconfig.xml and, if I'm 
right, check if the same strategy could be used for the nested entities.

Andrea

[1]
http://lucene.apache.org/solr/5_2_1/solr-core/org/apache/solr/update/processor/UUIDUpdateProcessorFactory.html

2015-10-16 17:11 GMT+02:00 Matthias Fischer :

> Thank you, Andrea, for answering so quickly.
>
> However I got further errors. I also had to change 
> "firma_ebi_nr" to "id". 
> But it still does not work properly. It seems that an id is auto 
> generated for the company documents but not for the nested ones (the business 
> branches).
> Any ideas how to fix this?
>
> 2015-10-16 12:49:29.650 WARN  (Thread-17) [   x:jcg] o.a.s.h.d.SolrWriter
> Error creating document :
> SolrInputDocument(
> fields: [firma_ebi_nr=317709682, firma_namenszeile_1=Example 
> Company, id=3c7f7421-9d51-4056-a2a0-eebab87a546a, 
> _version_=1515192078460518400, _root_=3c7f7421-9d51-4056-a2a0-eebab87a546a],
> children: [
>SolrInputDocument(fields: [branche_ebc_code=7, 
> _root_=3c7f7421-9d51-4056-a2a0-eebab87a546a]),
>SolrInputDocument(fields: [branche_ebc_code=47000, 
> _root_=3c7f7421-9d51-4056-a2a0-eebab87a546a]),
>SolrInputDocument(fields: [branche_ebc_code=47700, 
> _root_=3c7f7421-9d51-4056-a2a0-eebab87a546a]),
>SolrInputDocument(fields: [branche_ebc_code=47790, 
> _root_=3c7f7421-9d51-4056-a2a0-eebab87a546a]),
>SolrInputDocument(fields: [branche_ebc_code=47791,
> _root_=3c7f7421-9d51-4056-a2a0-eebab87a546a])])
> org.apache.solr.common.SolrException: [doc=null] missing required field: id
> at
> org.apache.solr.update.DocumentBuilder.toDocument(DocumentBuilder.java:198)
> at
> org.apache.solr.update.AddUpdateCommand$1.next(AddUpdateCommand.java:191)
> at
> org.apache.solr.update.AddUpdateCommand$1.next(AddUpdateCommand.java:166)
> at
> org.apache.lucene.index.DocumentsWriterPerThread.updateDocuments(DocumentsWriterPerThread.java:259)
> at
> org.apache.lucene.index.DocumentsWriter.updateDocuments(DocumentsWriter.java:413)
> at
> org.apache.lucene.index.IndexWriter.updateDocuments(IndexWriter.java:1316)
> at
> org.apache.solr.update.DirectUpdateHandler2.addDoc0(DirectUpdateHandler2.java:235)
> at
> org.apache.solr.update.DirectUpdateHandler2.addDoc(DirectUpdateHandler2.java:163)
> at
> org.apache.solr.update.processor.RunUpdateProcessor.processAdd(RunUpdateProcessorFactory.java:69)
> at
>

Re: AW: Nested entities not imported / do not show up in search?

2015-10-19 Thread Andrea Gazzarini

Most probably my answer makes no sense because I don't know the overall
context, but why don't you import flat branches and companies with a "type"
attribute ("company" or "branch") and a "owner" field that will be
valorized only for braches with the company id ? Then you could
autocomplete on the company name (fq=type:"company"). Once selected a
company it would be just a matter of another query with two fq:
type:"branch", owner: 

Andrea
On 19 Oct 2015 11:48, "Matthias Fischer" 
wrote:

> Ok, thanks for your advice so far. I can import companies with their
> nested entities (business branches) now. But I wonder whether there is a
> way to query for company name patterns and get the business branches nested
> inside the respective companies. Using the following query I only get the
> companies without their nested entities:
>
> http://localhost:8983/solr/jcg/select?q=firma_namenszeile_1%3AMu*=xml=true
>
> I can use the firma_ebi_nr (the company id) and get the associated
> branches by issueing the following query:
>
> http://localhost:8983/solr/jcg/select?q={!child%20of=%22firma_ebi_nr:123123%22}firma_ebi_nr:123123
> This results in a flat list of associated business branches. However I
> would like to search a company by name and in the result I would like to
> see all associated business branches nested inside the respective company.
> Is this possible or do I need to issue the second query above for each
> company search result in order to get the nested entities?
>
> Example of what I would like to achieve:
>
>
> http://localhost:8983/solr/jcg/select?q=firma_namenszeile_1%3AMust*=xml=true
>
> 
> 
> 0
> 1
> 
> firma_namenszeile_1:Must*
> true
> xml
> 
> 
> 
> 
> 123123
> Musterfirma
> ac8d5627-b17a--8926-8d5a80680ee4
> 1515205299087081472
>
> 
> 
> 6
> 
> 
> 43000
> 
> 
> 43900
> 
>
>  
>  
> 
>
>
> Is this possible? Or maybe there is a better way than nested enties? An
> alternative I could think of is to join companies and branches in the JDBC
> import. But this would result in duplicate companies in the search result
> (one for each associated branch). My goal is to have a suggest field where
> the user can type a company name pattern and gets a list of matching
> companies including the associated branches. Any suggestions?
>
> Kind regards,
> Matthias
>
> -Ursprüngliche Nachricht-
> Von: Andrea Gazzarini [mailto:a.gazzar...@gmail.com]
> Gesendet: Freitag, 16. Oktober 2015 17:24
> An: solr-user@lucene.apache.org
> Betreff: Re: Nested entities not imported / do not show up in search?
>
> Hi Matthias,
> I guess the company.id field is not unique so you need a "compound"
> uniqueKey on Solr, which is not strctly possible. As consequence of that
> (company) UUID is probably created before the index phase by an
> UpdateRequestProcessor [1] so you should check your solrconfig.xml and, if
> I'm right, check if the same strategy could be used for the nested entities.
>
> Andrea
>
> [1]
>
> http://lucene.apache.org/solr/5_2_1/solr-core/org/apache/solr/update/processor/UUIDUpdateProcessorFactory.html
>
> 2015-10-16 17:11 GMT+02:00 Matthias Fischer <
> matthias.fisc...@doubleslash.de
> >:
>
> > Thank you, Andrea, for answering so quickly.
> >
> > However I got further errors. I also had to change
> > "firma_ebi_nr" to "id".
> > But it still does not work properly. It seems that an id is auto
> > generated for the company documents but not for the nested ones (the
> business branches).
> > Any ideas how to fix this?
> >
> > 2015-10-16 12:49:29.650 WARN  (Thread-17) [   x:jcg] o.a.s.h.d.SolrWriter
> > Error creating document :
> > SolrInputDocument(
> > fields: [firma_ebi_nr=317709682, firma_namenszeile_1=Example
> > Company, id=3c7f7421-9d51-4056-a2a0-eebab87a546a,
> > _version_=1515192078460518400,
> _root_=3c7f7421-9d51-4056-a2a0-eebab87a546a],
> > children: [
> >SolrInputDocument(fields: [branche_ebc_code=7,
> > _root_=3c7f7421-9d51-4056-a2a0-eebab87a546a]),
> >SolrInputDocument(fields: [branche_ebc_code=47000,
> > _root_=3c7f7421-9d51-4056-a2a0-eebab87a546a]),
> >SolrInputDocument(fields: [branche_ebc_code=47700,
> > _root_=3c7f7421-9d51-4056-a2a0-eebab87a546a]),
> >SolrInputDocument(fields: [branche_ebc_code=47790,
> > _root_=3c7f7421-9d51-4056-a2a0-eebab87a546a]),
> >SolrInputDocument(fields: [branche_ebc_code=47791,
> > _root_=3c7f7421-9d51-4056-a2a0-eebab87a546a])])
> > org.apache.solr.common.SolrException: [doc=null] missing required field:
> id
> > at
> >
> org.apache.solr.update.DocumentBuilder.toDocument(DocumentBuilder.java:198)
> > at
> > org.apache.solr.update.AddUpdateCommand$1.next(AddUpdateCommand.java:191)
> > at
> >

RE: Recursively scan documents for indexing in a folder in SolrJ

2015-10-19 Thread Duck Geraint (ext) GBJH

"The problem for this is that it is indexing all the files regardless of the 
formats, instead of just those formats in post.jar. So I guess still have to 
"steal" some codes from there to detect the file format?"

If you've not worked it out yourself yet, try something like:
http://docs.oracle.com/javase/7/docs/api/java/io/File.html#listFiles(java.io.FilenameFilter)
http://stackoverflow.com/questions/5751335/using-file-listfiles-with-filenameextensionfilter

Geraint

Geraint Duck
Data Scientist
Toxicology and Health Sciences
Syngenta UK
Email: geraint.d...@syngenta.com

-Original Message-
From: Zheng Lin Edwin Yeo [mailto:edwinye...@gmail.com]
Sent: 17 October 2015 00:55
To: solr-user@lucene.apache.org
Subject: Re: Recursively scan documents for indexing in a folder in SolrJ

Thanks for your advice. I also found this method which so far has been able to 
traverse all the documents in the folder and index them in Solr.

public static void showFiles(File[] files) {
for (File file : files) {
if (file.isDirectory()) {
System.out.println("Directory: " + file.getName());
showFiles(file.listFiles()); // Calls same method again.
} else {
System.out.println("File: " + file.getName());
}
}}

The problem for this is that it is indexing all the files regardless of the 
formats, instead of just those formats in post.jar. So I guess still have to 
"steal" some codes from there to detect the file format?

As for files that contains non-English characters (Eg; Chinese characters), it 
is currently not able to read the Chinese characters, and it is all read as a 
series of "???". Any idea how to solve this problem?

Thank you.

Regards,
Edwin


On 16 October 2015 at 21:16, Duck Geraint (ext) GBJH < 
geraint.d...@syngenta.com> wrote:

> Also, check this link for SolrJ example code (including the recursion):
> https://lucidworks.com/blog/2012/02/14/indexing-with-solrj/
>
> Geraint
>
>
> Geraint Duck
> Data Scientist
> Toxicology and Health Sciences
> Syngenta UK
> Email: geraint.d...@syngenta.com
>
> -Original Message-
> From: Jan Høydahl [mailto:jan@cominvent.com]
> Sent: 16 October 2015 12:14
> To: solr-user@lucene.apache.org
> Subject: Re: Recursively scan documents for indexing in a folder in
> SolrJ
>
> SolrJ does not have any file crawler built in.
> But you are free to steal code from SimplePostTool.java related to
> directory traversal, and then index each document found using SolrJ.
>
> Note that SimplePostTool.java tries to be smart with what endpoint to
> post files to, xml, csv and json content will be posted to /update
> while office docs go to /update/extract
>
> --
> Jan Høydahl, search solution architect Cominvent AS -
> www.cominvent.com
>
> > 16. okt. 2015 kl. 05.22 skrev Zheng Lin Edwin Yeo
> > >:
> >
> > Hi,
> >
> > I understand that in SimplePostTool (post.jar), there is this
> > command to automatically detect content types in a folder, and
> > recursively scan it for documents for indexing into a collection:
> > bin/post -c gettingstarted afolder/
> >
> > This has been useful for me to do mass indexing of all the files
> > that are in the folder. Now that I'm moving to production and plans
> > to use SolrJ to do the indexing as it can do more things like
> > robustness checks and retires for indexes that fails.
> >
> > However, I can't seems to find a way to do the same in SolrJ. Is it
> > possible for this to be done in SolrJ? I'm using Solr 5.3.0
> >
> > Thank you.
> >
> > Regards,
> > Edwin
>
>
> 
>
>
> Syngenta Limited, Registered in England No 2710846;Registered Office :
> Syngenta Limited, European Regional Centre, Priestley Road, Surrey
> Research Park, Guildford, Surrey, GU2 7YH, United Kingdom
>   This message may contain
> confidential information. If you are not the designated recipient,
> please notify the sender immediately, and delete the original and any
> copies. Any use of the message by you is prohibited.
>



Syngenta Limited, Registered in England No 2710846;Registered Office : Syngenta 
Limited, European Regional Centre, Priestley Road, Surrey Research Park, 
Guildford, Surrey, GU2 7YH, United Kingdom

 This message may contain confidential information. If you are not the 
designated recipient, please notify the sender immediately, and delete the 
original and any copies. Any use of the message by you is prohibited.

AW: AW: Nested entities not imported / do not show up in search?

2015-10-19 Thread Matthias Fischer

Thanks, Andrea, your answer does make sense! Obviously as a SOLR newbie I am 
still thinking too much in terms of traditional databases ;-)

Kind regards
Matthias

-Ursprüngliche Nachricht-
Von: Andrea Gazzarini [mailto:a.gazzar...@gmail.com] 
Gesendet: Montag, 19. Oktober 2015 12:05
An: solr-user@lucene.apache.org
Betreff: Re: AW: Nested entities not imported / do not show up in search?

Most probably my answer makes no sense because I don't know the overall 
context, but why don't you import flat branches and companies with a "type"
attribute ("company" or "branch") and a "owner" field that will be valorized 
only for braches with the company id ? Then you could autocomplete on the 
company name (fq=type:"company"). Once selected a company it would be just a 
matter of another query with two fq:
type:"branch", owner: 

Andrea
On 19 Oct 2015 11:48, "Matthias Fischer" 
wrote:

> Ok, thanks for your advice so far. I can import companies with their 
> nested entities (business branches) now. But I wonder whether there is 
> a way to query for company name patterns and get the business branches 
> nested inside the respective companies. Using the following query I 
> only get the companies without their nested entities:
>
> http://localhost:8983/solr/jcg/select?q=firma_namenszeile_1%3AMu*=x
> ml=true
>
> I can use the firma_ebi_nr (the company id) and get the associated 
> branches by issueing the following query:
>
> http://localhost:8983/solr/jcg/select?q={!child%20of=%22firma_ebi_nr:1
> 23123%22}firma_ebi_nr:123123 This results in a flat list of associated 
> business branches. However I would like to search a company by name 
> and in the result I would like to see all associated business branches 
> nested inside the respective company.
> Is this possible or do I need to issue the second query above for each 
> company search result in order to get the nested entities?
>
> Example of what I would like to achieve:
>
>
> http://localhost:8983/solr/jcg/select?q=firma_namenszeile_1%3AMust*
> =xml=true
>
> 
> 
> 0
> 1
> 
> firma_namenszeile_1:Must*
> true
> xml
> 
> 
> 
> 
> 123123
> Musterfirma
> ac8d5627-b17a--8926-8d5a80680ee4
> 1515205299087081472
>
> 
> 
> 6
> 
> 
> 43000
> 
> 
> 43900
> 
>
>  
>  
> 
>
>
> Is this possible? Or maybe there is a better way than nested enties? 
> An alternative I could think of is to join companies and branches in 
> the JDBC import. But this would result in duplicate companies in the 
> search result (one for each associated branch). My goal is to have a 
> suggest field where the user can type a company name pattern and gets 
> a list of matching companies including the associated branches. Any 
> suggestions?
>
> Kind regards,
> Matthias
>
> -Ursprüngliche Nachricht-
> Von: Andrea Gazzarini [mailto:a.gazzar...@gmail.com]
> Gesendet: Freitag, 16. Oktober 2015 17:24
> An: solr-user@lucene.apache.org
> Betreff: Re: Nested entities not imported / do not show up in search?
>
> Hi Matthias,
> I guess the company.id field is not unique so you need a "compound"
> uniqueKey on Solr, which is not strctly possible. As consequence of 
> that
> (company) UUID is probably created before the index phase by an 
> UpdateRequestProcessor [1] so you should check your solrconfig.xml 
> and, if I'm right, check if the same strategy could be used for the nested 
> entities.
>
> Andrea
>
> [1]
>
> http://lucene.apache.org/solr/5_2_1/solr-core/org/apache/solr/update/p
> rocessor/UUIDUpdateProcessorFactory.html
>
> 2015-10-16 17:11 GMT+02:00 Matthias Fischer < 
> matthias.fisc...@doubleslash.de
> >:
>
> > Thank you, Andrea, for answering so quickly.
> >
> > However I got further errors. I also had to change 
> > "firma_ebi_nr" to "id".
> > But it still does not work properly. It seems that an id is auto 
> > generated for the company documents but not for the nested ones (the
> business branches).
> > Any ideas how to fix this?
> >
> > 2015-10-16 12:49:29.650 WARN  (Thread-17) [   x:jcg] o.a.s.h.d.SolrWriter
> > Error creating document :
> > SolrInputDocument(
> > fields: [firma_ebi_nr=317709682, firma_namenszeile_1=Example 
> > Company, id=3c7f7421-9d51-4056-a2a0-eebab87a546a,
> > _version_=1515192078460518400,
> _root_=3c7f7421-9d51-4056-a2a0-eebab87a546a],
> > children: [
> >SolrInputDocument(fields: [branche_ebc_code=7, 
> > _root_=3c7f7421-9d51-4056-a2a0-eebab87a546a]),
> >SolrInputDocument(fields: [branche_ebc_code=47000, 
> > _root_=3c7f7421-9d51-4056-a2a0-eebab87a546a]),
> >SolrInputDocument(fields: [branche_ebc_code=47700, 
> > _root_=3c7f7421-9d51-4056-a2a0-eebab87a546a]),
> >SolrInputDocument(fields: [branche_ebc_code=47790, 
> >

Configuration

2015-10-19 Thread fabigol

Hi,
i catch an old Solr Project that i would configur.
I have the xml file for each entity but i don't have the database.
Is there a way to  find the table schema?
do it exist Tools to generate the table from xml file?
Here a file:
































































































































































































--
View this message in context: 
http://lucene.472066.n3.nabble.com/Configuration-tp4235163.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: PayloadTermQuery deprecated

2015-10-19 Thread Alan Woodward

I opened https://issues.apache.org/jira/browse/LUCENE-6844

Alan Woodward
www.flax.co.uk


On 19 Oct 2015, at 08:49, Alan Woodward wrote:

> Hi Bill,
> 
> This looks like an oversight on my part when migrating the payload scoring 
> queries - can you open a JIRA ticket to add 'includeSpanScore' as an option 
> to PayloadScoreQuery?
> 
> As a workaround, you should be able to use a custom similarity that returns 1 
> for all scores (see IndexSearcher.NON_SCORING_SIMILARITY for an 
> implementation that returns 0, you could just clone that and change 
> SimScorer.score())
> 
> Alan Woodward
> www.flax.co.uk
> 
> 
> On 19 Oct 2015, at 00:39, William Bell wrote:
> 
>> Here is my first stab at it. Thoughts?
>> 
>> Question:
>> 
>> new PayloadTermQuery(new Term(nv[0].substring(1), nv[1]), new
>> AveragePayloadFunction(), false)
>> 
>> How do I handle the "false"  ? It means boolean includeSpanScore
>> 
>> 
>> @Override
>> public Query parse() throws SyntaxError {
>> 
>>if (qstr == null || qstr.length() == 0) return null;
>>//BooleanQuery q = new BooleanQuery();
>>BooleanQuery.Builder q = new BooleanQuery.Builder();
>>q.setDisableCoord(true);
>>if (qstr.length() > 1 && qstr.startsWith("\"") && qstr.endsWith("\"")) {
>>qstr = qstr.substring(1,qstr.length()-1);
>>}
>>String[] nvps = StringUtils.split(qstr, " ");
>>for (int i = 0; i < nvps.length; i++) {
>>String[] nv = StringUtils.split(nvps[i], ":");
>>if (nv.length > 1) {
>>  if (nv[0].startsWith("+")) {
>>  SpanTermQuery sq = new SpanTermQuery(new
>> Term(nv[0].substring(1), nv[1]));
>>  PayloadScoreQuery psq = new PayloadScoreQuery(sq, new
>> AveragePayloadFunction());
>>  q.add(psq, Occur.MUST);
>>//q.add(new PayloadTermQuery(new Term(nv[0].substring(1),
>> nv[1]), new AveragePayloadFunction(), false), Occur.MUST);
>>  } else {
>>//q.add(new PayloadTermQuery(new Term(nv[0], nv[1]), new
>> AveragePayloadFunction(), false), Occur.SHOULD);
>>  SpanTermQuery sq = new SpanTermQuery(new Term(nv[0], nv[1]));
>>  PayloadScoreQuery psq = new PayloadScoreQuery(sq, new
>> AveragePayloadFunction());
>>  q.add(psq, Occur.SHOULD);
>>  }
>>}
>>}
>>// return q;
>>return q.build();
>> }
>> 
>> 
>> On Sun, Oct 18, 2015 at 4:46 PM, William Bell  wrote:
>> 
>>> Wondering how to change my payload based on example:
>>> 
>>> https://lucidworks.com/blog/2014/06/13/end-to-end-payload-example-in-solr/
>>> 
>>> PayloadTermQuery and BooleanQuery are deprecated in 5.3.x
>>> 
>>> @Override
>>> public Query parse() throws SyntaxError {
>>> 
>>>if (qstr == null || qstr.length() == 0) return null;
>>>BooleanQuery q = new BooleanQuery();
>>>if (qstr.length() > 1 && qstr.startsWith("\"") && qstr.endsWith("\"")) {
>>>qstr = qstr.substring(1,qstr.length()-1);
>>>}
>>>String[] nvps = StringUtils.split(qstr, " ");
>>>for (int i = 0; i < nvps.length; i++) {
>>>String[] nv = StringUtils.split(nvps[i], ":");
>>>if (nv.length > 1) {
>>>  if (nv[0].startsWith("+")) {
>>>q.add(new PayloadTermQuery(new Term(nv[0].substring(1), nv[1]),
>>>  new AveragePayloadFunction(), false), Occur.MUST);
>>>  } else {
>>>q.add(new PayloadTermQuery(new Term(nv[0], nv[1]),
>>>  new AveragePayloadFunction(), false), Occur.SHOULD);
>>>  }
>>>}
>>>}
>>>return q;
>>> }
>>> 
>>> 
>>> --
>>> Bill Bell
>>> billnb...@gmail.com
>>> cell 720-256-8076
>>> 
>> 
>> 
>> 
>> -- 
>> Bill Bell
>> billnb...@gmail.com
>> cell 720-256-8076
>

Re: File-based Spelling

2015-10-19 Thread Mark Fenbers

OK.  I removed it, started Solr, adn refreshed the query, but my results 
are the same, indicating that queryAnalyzerFieldType has nothing to do 
with my problem.


New ideas??
Mark

On 10/19/2015 4:37 AM, Duck Geraint (ext) GBJH wrote:

"Yet, it claimed it found my misspelled word to be "fenber" without the "s""
I wonder if this is because you seem to applying a stemmer to your dictionary 
words.

Try removing the "text_en" line from 
your spellcheck search component definition.

Geraint


Geraint Duck
Data Scientist
Toxicology and Health Sciences
Syngenta UK
Email: geraint.d...@syngenta.com


-Original Message-
From: Mark Fenbers [mailto:mark.fenb...@noaa.gov]
Sent: 16 October 2015 19:43
To: solr-user@lucene.apache.org
Subject: Re: File-based Spelling

On 10/13/2015 9:30 AM, Dyer, James wrote:

Mark,

The older spellcheck implementations create an n-gram sidecar index, which is 
why you're seeing your name split into 2-grams like this.  See the IR Book by 
Manning et al, section 3.3.4 for more information.  Based on the results you're 
getting, I think it is loading your file correctly.  You should now try a query 
against this spelling index, using words *not* in the file you loaded that are 
within 1 or 2 edits from something that is in the dictionary.  If it doesn't 
yield suggestions, then post the relevant sections of the solrconfig.xml, 
schema.xml and also the query string you are trying.

James Dyer
Ingram Content Group


James, I've already done this.   My query string was "fenbers". This is
my last name which does *not* occur in the linux.words file.  It is only
1 edit distance from "fenders" which *is* in the linux.words file.  Yet, it claimed it found my 
misspelled word to be "fenber" without the "s"
and it gave me these 8 suggestions:
f en be r
f e nb er
f en b er
f e n be r
f en b e r
f e nb e r
f e n b er
f e n b e r

So I'm attaching the the entire solrconfig.xml and schema.xml that is in 
effect.  These are in a single file with all the block comments removed.

I'm also puzzled that you say "older implementations create a sidecar index"... 
because I am using v5.3.0, which was the latest version as of my download a month or two 
ago.  So, with my implementation being recent, why is an n-gram sidecar index still 
(seemingly) being produced?

thanks for the help!
Mark






Syngenta Limited, Registered in England No 2710846;Registered Office : Syngenta 
Limited, European Regional Centre, Priestley Road, Surrey Research Park, 
Guildford, Surrey, GU2 7YH, United Kingdom

  This message may contain confidential information. If you are not the 
designated recipient, please notify the sender immediately, and delete the 
original and any copies. Any use of the message by you is prohibited.

RE: File-based Spelling

2015-10-19 Thread Duck Geraint (ext) GBJH

"Yet, it claimed it found my misspelled word to be "fenber" without the "s""
I wonder if this is because you seem to applying a stemmer to your dictionary 
words.

Try removing the "text_en" line from 
your spellcheck search component definition.

Geraint

Geraint Duck
Data Scientist
Toxicology and Health Sciences
Syngenta UK
Email: geraint.d...@syngenta.com

-Original Message-
From: Mark Fenbers [mailto:mark.fenb...@noaa.gov]
Sent: 16 October 2015 19:43
To: solr-user@lucene.apache.org
Subject: Re: File-based Spelling

On 10/13/2015 9:30 AM, Dyer, James wrote:
> Mark,
>
> The older spellcheck implementations create an n-gram sidecar index, which is 
> why you're seeing your name split into 2-grams like this.  See the IR Book by 
> Manning et al, section 3.3.4 for more information.  Based on the results 
> you're getting, I think it is loading your file correctly.  You should now 
> try a query against this spelling index, using words *not* in the file you 
> loaded that are within 1 or 2 edits from something that is in the dictionary. 
>  If it doesn't yield suggestions, then post the relevant sections of the 
> solrconfig.xml, schema.xml and also the query string you are trying.
>
> James Dyer
> Ingram Content Group
>
James, I've already done this.   My query string was "fenbers". This is
my last name which does *not* occur in the linux.words file.  It is only
1 edit distance from "fenders" which *is* in the linux.words file.  Yet, it 
claimed it found my misspelled word to be "fenber" without the "s"
and it gave me these 8 suggestions:
f en be r
f e nb er
f en b er
f e n be r
f en b e r
f e nb e r
f e n b er
f e n b e r

So I'm attaching the the entire solrconfig.xml and schema.xml that is in 
effect.  These are in a single file with all the block comments removed.

I'm also puzzled that you say "older implementations create a sidecar index"... 
because I am using v5.3.0, which was the latest version as of my download a 
month or two ago.  So, with my implementation being recent, why is an n-gram 
sidecar index still (seemingly) being produced?

thanks for the help!
Mark

Syngenta Limited, Registered in England No 2710846;Registered Office : Syngenta 
Limited, European Regional Centre, Priestley Road, Surrey Research Park, 
Guildford, Surrey, GU2 7YH, United Kingdom

 This message may contain confidential information. If you are not the 
designated recipient, please notify the sender immediately, and delete the 
original and any copies. Any use of the message by you is prohibited.

Re: Recursively scan documents for indexing in a folder in SolrJ

2015-10-19 Thread Zheng Lin Edwin Yeo

Yes, I've managed to "steal" some codes from post.jar to only send
rich-text documents format to /update/extract.

I've also change the setting of the Eclipse at Windows -> Preference ->
General -> Workspace. Under Text file encoding, select Other, and choose
UTF-8. The Eclipse is now able to read the Chinese characters successfully.

Thank you for your help.

Regards,
Edwin



On 19 October 2015 at 16:33, Duck Geraint (ext) GBJH <
geraint.d...@syngenta.com> wrote:

> "The problem for this is that it is indexing all the files regardless of
> the formats, instead of just those formats in post.jar. So I guess still
> have to "steal" some codes from there to detect the file format?"
>
> If you've not worked it out yourself yet, try something like:
>
> http://docs.oracle.com/javase/7/docs/api/java/io/File.html#listFiles(java.io.FilenameFilter)
>
> http://stackoverflow.com/questions/5751335/using-file-listfiles-with-filenameextensionfilter
>
> Geraint
>
> Geraint Duck
> Data Scientist
> Toxicology and Health Sciences
> Syngenta UK
> Email: geraint.d...@syngenta.com
>
> -Original Message-
> From: Zheng Lin Edwin Yeo [mailto:edwinye...@gmail.com]
> Sent: 17 October 2015 00:55
> To: solr-user@lucene.apache.org
> Subject: Re: Recursively scan documents for indexing in a folder in SolrJ
>
> Thanks for your advice. I also found this method which so far has been
> able to traverse all the documents in the folder and index them in Solr.
>
> public static void showFiles(File[] files) {
> for (File file : files) {
> if (file.isDirectory()) {
> System.out.println("Directory: " + file.getName());
> showFiles(file.listFiles()); // Calls same method again.
> } else {
> System.out.println("File: " + file.getName());
> }
> }}
>
> The problem for this is that it is indexing all the files regardless of
> the formats, instead of just those formats in post.jar. So I guess still
> have to "steal" some codes from there to detect the file format?
>
> As for files that contains non-English characters (Eg; Chinese
> characters), it is currently not able to read the Chinese characters, and
> it is all read as a series of "???". Any idea how to solve this problem?
>
> Thank you.
>
> Regards,
> Edwin
>
>
> On 16 October 2015 at 21:16, Duck Geraint (ext) GBJH <
> geraint.d...@syngenta.com> wrote:
>
> > Also, check this link for SolrJ example code (including the recursion):
> > https://lucidworks.com/blog/2012/02/14/indexing-with-solrj/
> >
> > Geraint
> >
> >
> > Geraint Duck
> > Data Scientist
> > Toxicology and Health Sciences
> > Syngenta UK
> > Email: geraint.d...@syngenta.com
> >
> > -Original Message-
> > From: Jan Høydahl [mailto:jan@cominvent.com]
> > Sent: 16 October 2015 12:14
> > To: solr-user@lucene.apache.org
> > Subject: Re: Recursively scan documents for indexing in a folder in
> > SolrJ
> >
> > SolrJ does not have any file crawler built in.
> > But you are free to steal code from SimplePostTool.java related to
> > directory traversal, and then index each document found using SolrJ.
> >
> > Note that SimplePostTool.java tries to be smart with what endpoint to
> > post files to, xml, csv and json content will be posted to /update
> > while office docs go to /update/extract
> >
> > --
> > Jan Høydahl, search solution architect Cominvent AS -
> > www.cominvent.com
> >
> > > 16. okt. 2015 kl. 05.22 skrev Zheng Lin Edwin Yeo
> > > > >:
> > >
> > > Hi,
> > >
> > > I understand that in SimplePostTool (post.jar), there is this
> > > command to automatically detect content types in a folder, and
> > > recursively scan it for documents for indexing into a collection:
> > > bin/post -c gettingstarted afolder/
> > >
> > > This has been useful for me to do mass indexing of all the files
> > > that are in the folder. Now that I'm moving to production and plans
> > > to use SolrJ to do the indexing as it can do more things like
> > > robustness checks and retires for indexes that fails.
> > >
> > > However, I can't seems to find a way to do the same in SolrJ. Is it
> > > possible for this to be done in SolrJ? I'm using Solr 5.3.0
> > >
> > > Thank you.
> > >
> > > Regards,
> > > Edwin
> >
> >
> > 
> >
> >
> > Syngenta Limited, Registered in England No 2710846;Registered Office :
> > Syngenta Limited, European Regional Centre, Priestley Road, Surrey
> > Research Park, Guildford, Surrey, GU2 7YH, United Kingdom
> >   This message may contain
> > confidential information. If you are not the designated recipient,
> > please notify the sender immediately, and delete the original and any
> > copies. Any use of the message by you is prohibited.
> >
> 
>
>
> Syngenta Limited, Registered in England No 2710846;Registered Office :
> Syngenta Limited, European Regional Centre, Priestley Road, Surrey Research
> Park, Guildford, Surrey, GU2 7YH,

Re: Anyone users IBM J9 JVM with 32G max heap ? Tuning recommendations?

2015-10-19 Thread Pushkar Raste

Do you have GC logging turned on? If yes can you provide excerpt from the
GC log for a pause that took > 30sec

On 19 October 2015 at 04:16, Jeff Wu  wrote:

> Hi all,
>
> we are using solr4.7 on top of IBM JVM J9 Java7, max heap to 32G, system
> RAM 64G.
>
> JVM parameters: -Xgcpolicy:balanced -verbose:gc -Xms12228m -Xmx32768m
> -XX:PermSize=128m -XX:MaxPermSize=512m
>
> We faced one issue here: we set zkClient timeout value to 30 seconds. By
> using the balanced GC policy, we sometimes occurred a global GC pause
> >30seconds, therefore the solr server disconnected with ZK, and /update
> requests on this solr was disabled after zk disconnect. We have to restart
> this solr server to recover.
>
> By staying with IBM JVM, anyone has recommendations on this ? The general
> average heap usage in our solr server is around 26G so we'd like to stay
> with 32G max heap, but want to better tune the JVM to have less global gc
> pause.
>

RequestProcessor with IndexSearcher for Different Core

2015-10-19 Thread Kilian Woods

Hi All,

I am looking to solve a design problem of mine.

I want to create a SolrIndexSearcher for a *different* core inside a
RequestProcessor.

I know how to create a SolrIndexSearcher from the SolrQueryRequest but I
want to search a different core instead. I want to update the current
document with data from another core.

At the moment I am creating a separate HttpSolrServer for the different
core and querying that way but would like to simplify as this approach
seems inefficient.

Any ideas?

Kind Regards,
Kilian.

Re: RequestProcessor with IndexSearcher for Different Core

2015-10-19 Thread Mikhail Khludnev

Assuming you need to access sibling core from UpdateRequestHandler you can
how it's done on cross core join
https://github.com/apache/lucene-solr/blob/trunk/solr/core/src/java/org/apache/solr/search/join/ScoreJoinQParserPlugin.java#L256
don't forget to close all resources.

On Mon, Oct 19, 2015 at 7:07 AM, Kilian Woods 
wrote:

> Hi All,
>
> I am looking to solve a design problem of mine.
>
> I want to create a SolrIndexSearcher for a *different* core inside a
> RequestProcessor.
>
> I know how to create a SolrIndexSearcher from the SolrQueryRequest but I
> want to search a different core instead. I want to update the current
> document with data from another core.
>
> At the moment I am creating a separate HttpSolrServer for the different
> core and querying that way but would like to simplify as this approach
> seems inefficient.
>
> Any ideas?
>
> Kind Regards,
> Kilian.
>

-- 
Sincerely yours
Mikhail Khludnev
Principal Engineer,
Grid Dynamics

Re: Configuration

2015-10-19 Thread Alexandre Rafalovitch

Sounds like a mission impossible given the number of inner joins.

However, what are you _actually_ trying to do? Are you trying to
reindex the data? Do you actually have the data to reindex?

Regards,
  Alex.

Solr Analyzers, Tokenizers, Filters, URPs and even a newsletter:
http://www.solr-start.com/


On 19 October 2015 at 06:41, fabigol  wrote:
> Hi,
> i catch an old Solr Project that i would configur.
> I have the xml file for each entity but i don't have the database.
> Is there a way to  find the table schema?
> do it exist Tools to generate the table from xml file?
> Here a file:
>
> 
>
>
>  url="jdbc:postgresql://localhost:5432/myvialtis" 
> user="postgres"
> password="postgres" readOnly="true" autoCommit="false"
> batchSize="10" 
> transactionIsolation="TRANSACTION_READ_COMMITTED"
> holdability="CLOSE_CURSORS_AT_COMMIT" />
>
> 
>
>
>
>
>
>
>
>
>
>
>
>  transformer="RegexTransformer"
> query="select *
> from bi_solr_transaction
> inner join bi_solr_tiers on tiers_id_client = 
> conso_id_client
> inner join bi_solr_contrat on contrat_id_contrat = 
> conso_id_contrat
> left join bi_solr_carte on carte_id_carte = 
> conso_id_carte
> left join bi_solr_invoice on 
> invoice_numero_facture=conso_numero_facture
> where conso_type_transaction is null or 
> conso_type_transaction != '1'">
>
> 
> 
>
>
>
> 
> 
>  column="conso_immatriculation" />
>  column="conso_classe_vehicule" />
> 
>  column="carte_carte_active" />
>
>
> 
>  column="conso_gare_entree" />
>  column="conso_gare_sortie" />
>  column="conso_autoroute_entree" />
>  column="conso_autoroute_sortie" />
>  column="conso_sca_entree" />
>  />
>  column="conso_date_entree" />
>  column="conso_date_sortie" />
> 
> 
> 
>
> 
> 
> 
> 
> 
> 
>  column="conso_date_ddmm" />
>  column="conso_date_mm" />
>  column="conso_date_hh24mi" />
>  column="conso_date_dow" />
>  column="conso_gare_pkm_entree" />
>  column="conso_gare_pkm_sortie" />
>  column="conso_classe_tarif" />
> 
>  column="conso_type_transaction" />
>  column="conso_sous_produit" />
>
>
>
> 
> 
>  splitBy="," />
>  column="tiers_id_pays_client" />
> 
> 
> 
>
>  column="contrat_id_type_service" />
>  />
>  column="contrat_id_contrat_cadre" />
>  column="contrat_id_statut_contrat" />
>  />
>  column="contrat_nom_produit" />
>  column="contrat_id_beneficiaire_conso" />
>  column="contrat_id_beneficiaire_hono"
> />
>  column="contrat_id_payeur_conso" />
>  column="contrat_id_payeur_hono" />
>  column="contrat_id_partenaire" />
>  column="contrat_nom_partenaire" />
>  column="contrat_type_partenaire" />
>  />
>  column="contrat_nom_filiale" />
>  column="contrat_compte_web_possible" />
>  column="contrat_ref_contrat_tva" />
>  column="contrat_pays_depot_tva" />
>
>  column="invoice_numero_facture" />
>  column="invoice_id_beneficiaire" />
> 
>  column="invoice_date_emission" />
>  column="invoice_date_emission_dd_mm_" />
>  column="invoice_date_emission_mm_" />
>  column="invoice_date_emission_" />
> 
>  column="invoice_total_ht"

Re: Anyone users IBM J9 JVM with 32G max heap ? Tuning recommendations?

2015-10-19 Thread Toke Eskildsen

Jeff Wu  wrote:
> By staying with IBM JVM, anyone has recommendations on this ? The general
> average heap usage in our solr server is around 26G so we'd like to stay
> with 32G max heap, but want to better tune the JVM to have less global gc
> pause.

I am not sure if the IBM JVM works the same as the Oracle one in this aspect, 
but for the Oracle one, it is important to set Xmx _below_ 32GB instead of at 
exactly 32GB:
https://blog.codecentric.de/en/2014/02/35gb-heap-less-32gb-java-jvm-memory-oddities/
You might want to try the program at that page to check where the IBM limit is. 

- Toke Eskildsen

Re: Highlighting content field problem when using JiebaTokenizerFactory

2015-10-19 Thread Scott Stults

Edwin,

Try setting hl.bs.language and hl.bs.country in your request or
requestHandler:

https://cwiki.apache.org/confluence/display/solr/FastVector+Highlighter#FastVectorHighlighter-UsingBoundaryScannerswiththeFastVectorHighlighter


-Scott

On Tue, Oct 13, 2015 at 5:04 AM, Zheng Lin Edwin Yeo 
wrote:

> Hi,
>
> I'm trying to use the JiebaTokenizerFactory to index Chinese characters in
> Solr. It works fine with the segmentation when I'm using
> the Analysis function on the Solr Admin UI.
>
> However, when I tried to do the highlighting in Solr, it is not
> highlighting in the correct place. For example, when I search of 自然环境与企业本身,
> it highlight 认为自然环境与企业本身的
>
> Even when I search for English character like  responsibility, it highlight
>   *responsibilit*y.
>
> Basically, the highlighting goes off by 1 character/space consistently.
>
> This problem only happens in content field, and not in any other fields.
> Does anyone knows what could be causing the issue?
>
> I'm using jieba-analysis-1.0.0, Solr 5.3.0 and Lucene 5.3.0.
>
>
> Regards,
> Edwin
>



-- 
Scott Stults | Founder & Solutions Architect | OpenSource Connections, LLC
| 434.409.2780
http://www.opensourceconnections.com

Re: Tokenize ShingleFilterFactory results and apply filters to tokens

2015-10-19 Thread Steve Rowe

Hi Vitaliy,

I don’t know of any combination of built-in Lucene/Solr analysis components 
that would do what you want, but there used to be filter called 
ShingleMatrixFilter that (if I understand both that filter and what you want 
correctly), would do what you want, following an EdgeNGramFilter: 

It was deprecated in v3.1 and removed in v4.0 (see 
) because it wasn’t being 
maintained by the original creator and nobody else understood it :).  Uwe 
Schindler put up a patch that rewrote it and fixed some problems on 
, but that was never 
finished/committed.

What you want could create a huge number of terms, depending on the # of 
documents, terms in the field, and term length.  What do you want to use these 
terms for?

Steve

> On Oct 17, 2015, at 10:33 AM, vitaly bulgakov  wrote:
> 
> /why don't you put EdgeNGramFilter just after ShingleFilter?/
> 
> Because it will do Edge Ngrams over a shingle as a string:
> for "Home Improvement" shingle it will do:  Hom, Home, Home , Home I,
> Home Im, Home Imp .. 
> 
> But I need:
> ... Hom Imp, Hom Impr ..
> 
> 
> 
> --
> View this message in context: 
> http://lucene.472066.n3.nabble.com/Tokenize-ShingleFilterFactory-results-and-apply-filters-to-tokens-tp4234574p4234872.html
> Sent from the Solr - User mailing list archive at Nabble.com.

Re: PayloadTermQuery deprecated

2015-10-19 Thread William Bell

Alan,

Does this code look equivalent? And how do I change PayLoadScoreQuery to do
a Custom Similarity?

PayloadScoreQuery psq = new PayloadScoreQuery(sq, new
AveragePayloadFunction());

@Override
public Query parse() throws SyntaxError {

if (qstr == null || qstr.length() == 0) return null;
//BooleanQuery q = new BooleanQuery();
BooleanQuery.Builder q = new BooleanQuery.Builder();
q.setDisableCoord(true);
if (qstr.length() > 1 && qstr.startsWith("\"") && qstr.endsWith("\"")) {
qstr = qstr.substring(1,qstr.length()-1);
}
String[] nvps = StringUtils.split(qstr, " ");
for (int i = 0; i < nvps.length; i++) {
String[] nv = StringUtils.split(nvps[i], ":");
if (nv.length > 1) {
  if (nv[0].startsWith("+")) {
  SpanTermQuery sq = new SpanTermQuery(new
Term(nv[0].substring(1), nv[1]));
  PayloadScoreQuery psq = new PayloadScoreQuery(sq, new
AveragePayloadFunction());
  q.add(psq, Occur.MUST);
//q.add(new PayloadTermQuery(new Term(nv[0].substring(1),
nv[1]), new AveragePayloadFunction(), false), Occur.MUST);
  } else {
//q.add(new PayloadTermQuery(new Term(nv[0], nv[1]), new
AveragePayloadFunction(), false), Occur.SHOULD);
  SpanTermQuery sq = new SpanTermQuery(new Term(nv[0], nv[1]));
  PayloadScoreQuery psq = new PayloadScoreQuery(sq, new
AveragePayloadFunction());
  q.add(psq, Occur.SHOULD);
  }
}
}
// return q;
return q.build();
}


On Mon, Oct 19, 2015 at 1:49 AM, Alan Woodward  wrote:

> Hi Bill,
>
> This looks like an oversight on my part when migrating the payload scoring
> queries - can you open a JIRA ticket to add 'includeSpanScore' as an option
> to PayloadScoreQuery?
>
> As a workaround, you should be able to use a custom similarity that
> returns 1 for all scores (see IndexSearcher.NON_SCORING_SIMILARITY for an
> implementation that returns 0, you could just clone that and change
> SimScorer.score())
>
> Alan Woodward
> www.flax.co.uk
>
>
> On 19 Oct 2015, at 00:39, William Bell wrote:
>
> > Here is my first stab at it. Thoughts?
> >
> > Question:
> >
> > new PayloadTermQuery(new Term(nv[0].substring(1), nv[1]), new
> > AveragePayloadFunction(), false)
> >
> > How do I handle the "false"  ? It means boolean includeSpanScore
> >
> >
> > @Override
> > public Query parse() throws SyntaxError {
> >
> >if (qstr == null || qstr.length() == 0) return null;
> >//BooleanQuery q = new BooleanQuery();
> >BooleanQuery.Builder q = new BooleanQuery.Builder();
> >q.setDisableCoord(true);
> >if (qstr.length() > 1 && qstr.startsWith("\"") &&
> qstr.endsWith("\"")) {
> >qstr = qstr.substring(1,qstr.length()-1);
> >}
> >String[] nvps = StringUtils.split(qstr, " ");
> >for (int i = 0; i < nvps.length; i++) {
> >String[] nv = StringUtils.split(nvps[i], ":");
> >if (nv.length > 1) {
> >  if (nv[0].startsWith("+")) {
> >  SpanTermQuery sq = new SpanTermQuery(new
> > Term(nv[0].substring(1), nv[1]));
> >  PayloadScoreQuery psq = new PayloadScoreQuery(sq, new
> > AveragePayloadFunction());
> >  q.add(psq, Occur.MUST);
> >//q.add(new PayloadTermQuery(new Term(nv[0].substring(1),
> > nv[1]), new AveragePayloadFunction(), false), Occur.MUST);
> >  } else {
> >//q.add(new PayloadTermQuery(new Term(nv[0], nv[1]), new
> > AveragePayloadFunction(), false), Occur.SHOULD);
> >  SpanTermQuery sq = new SpanTermQuery(new Term(nv[0],
> nv[1]));
> >  PayloadScoreQuery psq = new PayloadScoreQuery(sq, new
> > AveragePayloadFunction());
> >  q.add(psq, Occur.SHOULD);
> >  }
> >}
> >}
> >// return q;
> >return q.build();
> > }
> >
> >
> > On Sun, Oct 18, 2015 at 4:46 PM, William Bell 
> wrote:
> >
> >> Wondering how to change my payload based on example:
> >>
> >>
> https://lucidworks.com/blog/2014/06/13/end-to-end-payload-example-in-solr/
> >>
> >> PayloadTermQuery and BooleanQuery are deprecated in 5.3.x
> >>
> >> @Override
> >> public Query parse() throws SyntaxError {
> >>
> >>if (qstr == null || qstr.length() == 0) return null;
> >>BooleanQuery q = new BooleanQuery();
> >>if (qstr.length() > 1 && qstr.startsWith("\"") &&
> qstr.endsWith("\"")) {
> >>qstr = qstr.substring(1,qstr.length()-1);
> >>}
> >>String[] nvps = StringUtils.split(qstr, " ");
> >>for (int i = 0; i < nvps.length; i++) {
> >>String[] nv = StringUtils.split(nvps[i], ":");
> >>if (nv.length > 1) {
> >>  if (nv[0].startsWith("+")) {
> >>q.add(new PayloadTermQuery(new Term(nv[0].substring(1),
> nv[1]),
> >>  new AveragePayloadFunction(), false), Occur.MUST);
> >>  } else {
> >>q.add(new PayloadTermQuery(new Term(nv[0], nv[1]),
> >>  new

Re: Tokenize ShingleFilterFactory results and apply filters to tokens

2015-10-19 Thread Alexandre Rafalovitch

This sounds like an attempt to create an auto-complete using n-grams
in text. In which case, Ted Sullivan's writing might be of relevance:
http://lucidworks.com/blog/author/tedsullivan/

Regards,
   Alex.


Solr Analyzers, Tokenizers, Filters, URPs and even a newsletter:
http://www.solr-start.com/


On 15 October 2015 at 09:47, vitaly bulgakov  wrote:
> I want to rephrase my question I asked in another post.
> As far as I understand filter ShingleFilterFactory creates shingle as
> strings.
> But I want to apply more filters (like EdgeNgrams) to each token of a
> shingle.
>
> For example from "Home Improvement Service" I have two shingles:
> "Home Improvement" and "Improvement Service".
>
> I want to apply EdgeNgram to be able to do exact match to:
> "Hom Improvem" and "Improvemen Servi" as new phrases.
>
> Any, help, ideas are welcomed and appreciated.
>
>
>
> --
> View this message in context: 
> http://lucene.472066.n3.nabble.com/Tokenize-ShingleFilterFactory-results-and-apply-filters-to-tokens-tp4234574.html
> Sent from the Solr - User mailing list archive at Nabble.com.

auto deploument/setup of Solr & Zookeeper on medium-large clusters

2015-10-19 Thread Susheel Kumar

Hi,

I am trying to find the best practises for setting up Solr on new 20+
machines  & ZK (5+) and repeating same on other environments.  What's the
best way to download, extract, setup Solr & ZK in an automated way along
with other dependencies like java etc.  Among shell scripts or puppet or
docker or imaged vm's what is being used & suggested from Dev-Ops
perspective.

Thanks,
Susheel

Re: SolrCloud - Replica is showen as "Recovery-Failed"

2015-10-19 Thread Jae Joo

Found the root cause. I disabled the transaction log.

Thanks,

On Mon, Oct 19, 2015 at 1:07 PM, Jae Joo  wrote:

> Solr Version " 5.3
>
> I just built the SoleCloud with 5 shards and 3 replicationfactor in 15
> nodes. It means that I have shards and replicas running in it's own servers.
>
> When I see the Cloud page, I see that the status of replica is
> "recovery-failed".
> For testing, I downed the leader, but a replica couldn't be a leader
> because it's status was not active.
>
> NFO  - 2015-10-19 16:46:16.297;
> org.apache.solr.cloud.ShardLeaderElectionContext; My last published State
> was recovery_failed, I won't be the leader.
>
> There is no document indexed..
>
> Any help?
>
> Jae
>

Re: SolrCloud - Replica is showen as "Recovery-Failed"

2015-10-19 Thread Shawn Heisey

On 10/19/2015 11:56 AM, Jae Joo wrote:
> Found the root cause. I disabled the transaction log.

SolrCloud requires the transaction log for proper operation.  Disabling
it might cause all sorts of future problems, including problems with
data replication and recovery.  Because it's so critical, there has been
some discussion of detecting a disabled transaction log as a startup
error in cloud mode.

Instead of disabling it, you can take steps to ensure that the
transaction logs do not become enormous.  A hard commit will close the
current transaction log and start a new one.  You can have this happen
automatically by configuring autoCommit with something like the following:



  
6
false
  
  


Setting openSearcher to false ensures that these automatic commits do
not change document visibility -- whatever you are doing for change
visibility, unless it is using autoCommit, will continue to work as it
did before.  A maxTime of one minute means that it will probably not
happen frequently enough to cause a major performance issue.  You might
want to raise that to as much as five minutes (30).

A larger discussion of this topic:

http://lucidworks.com/blog/2013/08/23/understanding-transaction-logs-softcommit-and-commit-in-sorlcloud/

I personally think doing hard commits after fifteen seconds as discussed
in the article above is too frequent, but many people use that config
and don't have any problems.

Thanks,
Shawn

Re: Highlighting content field problem when using JiebaTokenizerFactory

2015-10-19 Thread Zheng Lin Edwin Yeo

Hi Scott,

Here's my schema.xml for content and title, which uses text_chinese. The
problem only occurs in content, and not in title.


   


  
 






 
 





  
   


Here's my solrconfig.xml on the highlighting portion:

  
  
   explicit
   10
   json
   true
  text
  id, title, content_type, last_modified, url, score 

  on
   id, title, content, author, tag
  true
   true
   html
  200
true
signature
true
100
  
  


 
WORD
en
SG
 



Meanwhile, I'll take a look at the articles too.

Thank you.

Regards,
Edwin


On 20 October 2015 at 11:32, Scott Chu  wrote:

> Hi Edwin,
>
> I didn't use Jieba on Chinese (I use only CJK, very foundamental, I
> know) so I didn't experience this problem.
>
> I'd suggest you post your schema.xml so we can see how you define your
> content field and the field type it uses?
>
> In the mean time, refer to these articles, maybe the answer or workaround
> can be deducted from them.
>
> https://issues.apache.org/jira/browse/SOLR-3390
>
> http://qnalist.com/questions/661133/solr-is-highlighting-wrong-words
>
> http://qnalist.com/questions/667066/highlighting-marks-wrong-words
>
> Good luck!
>
>
>
>
> Scott Chu，scott@udngroup.com
> 2015/10/20
>
> - Original Message -
> *From: *Zheng Lin Edwin Yeo 
> *To: *solr-user 
> *Date: *2015-10-13, 17:04:29
> *Subject: *Highlighting content field problem when using
> JiebaTokenizerFactory
>
> Hi,
>
> I'm trying to use the JiebaTokenizerFactory to index Chinese characters in
>
> Solr. It works fine with the segmentation when I'm using
> the Analysis function on the Solr Admin UI.
>
> However, when I tried to do the highlighting in Solr, it is not
> highlighting in the correct place. For example, when I search of 自然環境与企業本身,
> it highlight 認為自然環境与企業本身的
>
> Even when I search for English character like responsibility, it highlight
>   *responsibilit*y.
>
> Basically, the highlighting goes off by 1 character/space consistently.
>
> This problem only happens in content field, and not in any other fields.
> Does anyone knows what could be causing the issue?
>
> I'm using jieba-analysis-1.0.0, Solr 5.3.0 and Lucene 5.3.0.
>
>
> Regards,
> Edwin
>
>
>
> -
> 未在此訊息中找到病毒。
> 已透過 AVG 檢查 - www.avg.com
> 版本: 2015.0.6140 / 病毒庫: 4447/10808 - 發佈日期: 10/12/15
>
>

Re: Highlighting content field problem when using JiebaTokenizerFactory

2015-10-19 Thread Zheng Lin Edwin Yeo

Hi Scott,

Thank you for your reply.

I've tried to set that and also tried changing to Fast Vector Highlighter,
but it isn't working as well. I got the same highlighting results as
previously.

Regards,
Edwin


On 19 October 2015 at 23:56, Scott Stults  wrote:

> Edwin,
>
> Try setting hl.bs.language and hl.bs.country in your request or
> requestHandler:
>
>
> https://cwiki.apache.org/confluence/display/solr/FastVector+Highlighter#FastVectorHighlighter-UsingBoundaryScannerswiththeFastVectorHighlighter
>
>
> -Scott
>
> On Tue, Oct 13, 2015 at 5:04 AM, Zheng Lin Edwin Yeo  >
> wrote:
>
> > Hi,
> >
> > I'm trying to use the JiebaTokenizerFactory to index Chinese characters
> in
> > Solr. It works fine with the segmentation when I'm using
> > the Analysis function on the Solr Admin UI.
> >
> > However, when I tried to do the highlighting in Solr, it is not
> > highlighting in the correct place. For example, when I search of
> 自然环境与企业本身,
> > it highlight 认为自然环境与企业本身的
> >
> > Even when I search for English character like  responsibility, it
> highlight
> >   *responsibilit*y.
> >
> > Basically, the highlighting goes off by 1 character/space consistently.
> >
> > This problem only happens in content field, and not in any other fields.
> > Does anyone knows what could be causing the issue?
> >
> > I'm using jieba-analysis-1.0.0, Solr 5.3.0 and Lucene 5.3.0.
> >
> >
> > Regards,
> > Edwin
> >
>
>
>
> --
> Scott Stults | Founder & Solutions Architect | OpenSource Connections, LLC
> | 434.409.2780
> http://www.opensourceconnections.com
>

Re: Highlighting content field problem when using JiebaTokenizerFactory

2015-10-19 Thread Scott Chu

Hi Edwin,

I didn't use Jieba on Chinese (I use only CJK, very foundamental, I know) so I 
didn't experience this problem. 

I'd suggest you post your schema.xml so we can see how you define your content 
field and the field type it uses?

In the mean time, refer to these articles, maybe the answer or workaround can 
be deducted from them.

https://issues.apache.org/jira/browse/SOLR-3390

http://qnalist.com/questions/661133/solr-is-highlighting-wrong-words

http://qnalist.com/questions/667066/highlighting-marks-wrong-words

Good luck!




Scott Chu，scott@udngroup.com
2015/10/20 
- Original Message - 
From: Zheng Lin Edwin Yeo 
To: solr-user 
Date: 2015-10-13, 17:04:29
Subject: Highlighting content field problem when using JiebaTokenizerFactory


Hi,

I'm trying to use the JiebaTokenizerFactory to index Chinese characters in

Solr. It works fine with the segmentation when I'm using
the Analysis function on the Solr Admin UI.

However, when I tried to do the highlighting in Solr, it is not
highlighting in the correct place. For example, when I search of 自然環境与企業本身,
it highlight 認為自然環境与企業本身的

Even when I search for English character like responsibility, it highlight
  *responsibilit*y.

Basically, the highlighting goes off by 1 character/space consistently.

This problem only happens in content field, and not in any other fields.
Does anyone knows what could be causing the issue?

I'm using jieba-analysis-1.0.0, Solr 5.3.0 and Lucene 5.3.0.


Regards,
Edwin



-
未在此訊息中找到病毒。
已透過 AVG 檢查 - www.avg.com
版本: 2015.0.6140 / 病毒庫: 4447/10808 - 發佈日期: 10/12/15

Re: RequestProcessor with IndexSearcher for Different Core

2015-10-19 Thread Kilian Woods

Hi Mikhail,

Thank you very much, that looks very helpful indeed.

Kilian.

On 19 October 2015 at 15:48, Mikhail Khludnev 
wrote:

> Assuming you need to access sibling core from UpdateRequestHandler you can
> how it's done on cross core join
>
> https://github.com/apache/lucene-solr/blob/trunk/solr/core/src/java/org/apache/solr/search/join/ScoreJoinQParserPlugin.java#L256
> don't forget to close all resources.
>
>
>
> On Mon, Oct 19, 2015 at 7:07 AM, Kilian Woods 
> wrote:
>
> > Hi All,
> >
> > I am looking to solve a design problem of mine.
> >
> > I want to create a SolrIndexSearcher for a *different* core inside a
> > RequestProcessor.
> >
> > I know how to create a SolrIndexSearcher from the SolrQueryRequest but I
> > want to search a different core instead. I want to update the current
> > document with data from another core.
> >
> > At the moment I am creating a separate HttpSolrServer for the different
> > core and querying that way but would like to simplify as this approach
> > seems inefficient.
> >
> > Any ideas?
> >
> > Kind Regards,
> > Kilian.
> >
>
>
>
> --
> Sincerely yours
> Mikhail Khludnev
> Principal Engineer,
> Grid Dynamics
>
> 
> 
>

SolrCloud - Replica is showen as "Recovery-Failed"

2015-10-19 Thread Jae Joo

Solr Version " 5.3

I just built the SoleCloud with 5 shards and 3 replicationfactor in 15
nodes. It means that I have shards and replicas running in it's own servers.

When I see the Cloud page, I see that the status of replica is
"recovery-failed".
For testing, I downed the leader, but a replica couldn't be a leader
because it's status was not active.

NFO  - 2015-10-19 16:46:16.297;
org.apache.solr.cloud.ShardLeaderElectionContext; My last published State
was recovery_failed, I won't be the leader.

There is no document indexed..

Any help?

Jae

Re: Autostart Zookeeper and Solr using scripting

2015-10-19 Thread Scott Stults

Hi Adrian,

I'd probably start with the expect command and "echo ruok | nc  "
for a simple script. You might also want to try the Netflix Exhibitor REST
interface:

https://github.com/Netflix/exhibitor/wiki/REST-Cluster


k/r,
Scott

On Thu, Oct 15, 2015 at 2:01 AM, Adrian Liew 
wrote:

> Hi,
>
> I am trying to implement some scripting to detect if all Zookeepers have
> started in a cluster, then restart the solr servers. Has anyone achieved
> this yet through scripting?
>
> I also saw there is the ZookeeperClient that is available in .NET via a
> nuget package. Not sure if this could be also implemented to check if a
> zookeeper is running.
>
> Any thoughts on anyone using a script to perform this?
>
> Regards,
> Adrian
>
>


-- 
Scott Stults | Founder & Solutions Architect | OpenSource Connections, LLC
| 434.409.2780
http://www.opensourceconnections.com

Re: SOLR-7191 SolrCloud 5 with thousands of collections

solr4.7: truncated log output in grouping.CommandHandler?

Re: Problem with indexing chinese characters when using SolrJ

Re: PayloadTermQuery deprecated

Re: Problem with indexing chinese characters when using SolrJ

Re: Problem with indexing chinese characters when using SolrJ

Anyone users IBM J9 JVM with 32G max heap ? Tuning recommendations?

Re: Nested entities not imported / do not show up in search?

AW: Nested entities not imported / do not show up in search?

Re: AW: Nested entities not imported / do not show up in search?

RE: Recursively scan documents for indexing in a folder in SolrJ

AW: AW: Nested entities not imported / do not show up in search?

Configuration

Re: PayloadTermQuery deprecated

Re: File-based Spelling

RE: File-based Spelling

Re: Recursively scan documents for indexing in a folder in SolrJ

Re: Anyone users IBM J9 JVM with 32G max heap ? Tuning recommendations?

RequestProcessor with IndexSearcher for Different Core

Re: RequestProcessor with IndexSearcher for Different Core

Re: Configuration

Re: Anyone users IBM J9 JVM with 32G max heap ? Tuning recommendations?

Re: Highlighting content field problem when using JiebaTokenizerFactory

Re: Tokenize ShingleFilterFactory results and apply filters to tokens

Re: PayloadTermQuery deprecated

Re: Tokenize ShingleFilterFactory results and apply filters to tokens

auto deploument/setup of Solr & Zookeeper on medium-large clusters

Re: SolrCloud - Replica is showen as "Recovery-Failed"

Re: SolrCloud - Replica is showen as "Recovery-Failed"

Re: Highlighting content field problem when using JiebaTokenizerFactory

Re: Highlighting content field problem when using JiebaTokenizerFactory

Re: Highlighting content field problem when using JiebaTokenizerFactory

Re: RequestProcessor with IndexSearcher for Different Core

SolrCloud - Replica is showen as "Recovery-Failed"

Re: Autostart Zookeeper and Solr using scripting

35 matches

Site Navigation

Mail list logo

Footer information