Re: Solr DIH using JDBC with TIKA

2017-07-04 Thread ANNAMANENI RAVEENDRA
Yes it can be local directory
File:///full path


On Tue, 4 Jul 2017 at 10:25 PM, d0ct0r4r6a  wrote:

> For the URL param in the "extract" entity, can it be a local directory? If
> yes, how do you specify the path?
>
>
>
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/Solr-DIH-using-JDBC-with-TIKA-tp4180737p4344273.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>


Re: Solr DIH using JDBC with TIKA

2017-07-04 Thread d0ct0r4r6a
For the URL param in the "extract" entity, can it be a local directory? If
yes, how do you specify the path?



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Solr-DIH-using-JDBC-with-TIKA-tp4180737p4344273.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: cdcr replication only 1 node gets data

2017-07-04 Thread Webster Homer
We too often end up with a shard looking like:
{
"name": "shard2",
"range": "0-7fff",
"state": "active",
"replicas": [

{
"name": "core_node1",
"core": "sial-catalog-gene_shard2_replica2",
"baseUrl": "http://uc1f-ecom-msc01:8983/solr";,
"nodeName": "uc1f-ecom-msc01:8983_solr",
"state": "active",
"leader": true,
"index":
{
"numDocs": 57376,
"maxDocs": 86447,
"deletedDocs": 29071,
"size": "265.24 MB",
"lastModified": "2017-07-04T18:27:13.853Z",
"current": true,
"version": 1333,
"segmentCount": 18
}
}
,

{
"name": "core_node4",
"core": "sial-catalog-gene_shard2_replica1",
"baseUrl": "http://uc1f-ecom-msc02:8983/solr";,
"nodeName": "uc1f-ecom-msc02:8983_solr",
"state": "active",
"leader": false,
"index":
{
"numDocs": 0,
"maxDocs": 0,
"deletedDocs": 0,
"size": "101 bytes",
"lastModified": "2017-06-30T19:40:02.936Z",
"current": true,
"version": 1148,
"segmentCount": 0
}
}

On Tue, Jul 4, 2017 at 4:51 PM, Webster Homer 
wrote:

> I've seen this a number of times. We do cdcr replication to a cloud, and
> only the shard leader gets data.
>
> CDCR source has 2 nodes and we replicate to 2 clouds each of which have 4
> nodes
> Both source and targets have 2 shards
>
> We frequently end up with collections where the target shard leader has
> data but the replica doesn't
>

-- 


This message and any attachment are confidential and may be privileged or 
otherwise protected from disclosure. If you are not the intended recipient, 
you must not copy this message or attachment or disclose the contents to 
any other person. If you have received this transmission in error, please 
notify the sender immediately and delete the message and any attachment 
from your system. Merck KGaA, Darmstadt, Germany and any of its 
subsidiaries do not accept liability for any omissions or errors in this 
message which may arise as a result of E-Mail-transmission or for damages 
resulting from any unauthorized changes of the content of this message and 
any attachment thereto. Merck KGaA, Darmstadt, Germany and any of its 
subsidiaries do not guarantee that this message is free of viruses and does 
not accept liability for any damages caused by any virus transmitted 
therewith.

Click http://www.emdgroup.com/disclaimer to access the German, French, 
Spanish and Portuguese versions of this disclaimer.


cdcr replication only 1 node gets data

2017-07-04 Thread Webster Homer
I've seen this a number of times. We do cdcr replication to a cloud, and
only the shard leader gets data.

CDCR source has 2 nodes and we replicate to 2 clouds each of which have 4
nodes
Both source and targets have 2 shards

We frequently end up with collections where the target shard leader has
data but the replica doesn't

-- 


This message and any attachment are confidential and may be privileged or 
otherwise protected from disclosure. If you are not the intended recipient, 
you must not copy this message or attachment or disclose the contents to 
any other person. If you have received this transmission in error, please 
notify the sender immediately and delete the message and any attachment 
from your system. Merck KGaA, Darmstadt, Germany and any of its 
subsidiaries do not accept liability for any omissions or errors in this 
message which may arise as a result of E-Mail-transmission or for damages 
resulting from any unauthorized changes of the content of this message and 
any attachment thereto. Merck KGaA, Darmstadt, Germany and any of its 
subsidiaries do not guarantee that this message is free of viruses and does 
not accept liability for any damages caused by any virus transmitted 
therewith.

Click http://www.emdgroup.com/disclaimer to access the German, French, 
Spanish and Portuguese versions of this disclaimer.


Re: Did /export use to emit tuples and now does not?

2017-07-04 Thread Joel Bernstein
In the very early releases (5x) the /export handler had a different format
then the /search handler. Later the /export handler was changed to have the
same basic response format as the /search handler. This was done in
anticipation of unifying /search and /export at a later date.

The /export handler still powers the parallel relational algebra
expressions. In Solr 7.0 there is a shuffle expression that always uses the
/export handler to sort and partition result sets. In 6x the search
expression can be used with the qt=/export param to use the /export handler.

Joel Bernstein
http://joelsolr.blogspot.com/

On Tue, Jul 4, 2017 at 11:38 AM, Ronald Wood  wrote:

> 9 months ago I did a proof of concept for solr streaming using the /export
> handler. At that time, I got tuples back.
>
> Now when I try 6.x, I get results in a format similar to /search
> (including a count), instead of tuples (with an EOF).
>
> Did something change between 5.x and 6.x in this regard?
>
> I am trying to stream results in a non-cloud scenario, and I was under the
> impression that /export was the primitive handler for the more advanced
> streaming operations only possible under Solr Cloud.
>
> I am using official docker images for testing. I tried to to retest under
> 5.5.4 but I need to do some more work as docValues aren’t the default when
> using the gettingstarted index.
>
> -Ronald Wood
>
>


Re: Unique() metrics not supported in Solr Streaming facet stream source

2017-07-04 Thread Susheel Kumar
Hello Joel,

I tried to create a patch to add UniqueMetric and it works, but soon
realized, we have UniqueStream as well and can't load both of them (like
below) when required,  since both uses "unique" keyword.

Any advice how we can handle this.  Come up with different keyword for
UniqueMetric or rename UniqueStream etc..?

   StreamFactory factory = new StreamFactory()
 .withCollectionZkHost (...)
   .withFunctionName("facet", FacetStream.class)
 .withFunctionName("sum", SumMetric.class)
 .withFunctionName("unique", UniqueStream.class)
 .withFunctionName("unique", UniqueMetric.class)

On Thu, Jun 29, 2017 at 9:32 AM, Joel Bernstein  wrote:

> This is mainly due to focus on other things. It would great to support all
> the aggregate functions in facet, rollup and timeseries expressions.
>
> Joel Bernstein
> http://joelsolr.blogspot.com/
>
> On Thu, Jun 29, 2017 at 8:23 AM, Zheng Lin Edwin Yeo  >
> wrote:
>
> > Hi,
> >
> > We are working on the Solr Streaming expression, using the facet stream
> > source.
> >
> > As the underlying structure is using JSON Facet, would like to find out
> why
> > the unique() metrics is not supported? Currently, it only supports
> sum(col)
> > , avg(col), min(col), max(col), count(*)
> >
> > I'm using Solr 6.5.1
> >
> > Regards,
> > Edwin
> >
>


Re: cdcr bootstrap errors

2017-07-04 Thread Webster Homer
restarting the zookeeper on the source cloud seems to have helped

On Tue, Jul 4, 2017 at 3:42 PM, Webster Homer 
wrote:

> Another strange error message I'm seeing
> 2017-07-04 18:59:40.585 WARN  (cdcr-replicator-110-thread-
> 4-processing-n:dfw-pauth-msc02:8983_solr) [   ] o.a.s.h.CdcrReplicator
> Failed to forward update request to target: sial-catalog-product
> org.apache.solr.common.SolrException: Could not load collection from ZK:
> sial-catalog-product
> at org.apache.solr.common.cloud.ZkStateReader.getCollectionLive(
> ZkStateReader.java:1093)
> at org.apache.solr.common.cloud.ZkStateReader$LazyCollectionRef.get(
> ZkStateReader.java:638)
> at org.apache.solr.common.cloud.ClusterState.getCollectionOrNull(
> ClusterState.java:212)
> at org.apache.solr.common.cloud.ClusterState.hasCollection(
> ClusterState.java:114)
> at org.apache.solr.client.solrj.impl.CloudSolrClient.getCollectionNames(
> CloudSolrClient.java:1302)
> at org.apache.solr.client.solrj.impl.CloudSolrClient.
> requestWithRetryOnStaleState(CloudSolrClient.java:1024)
> at org.apache.solr.client.solrj.impl.CloudSolrClient.request(
> CloudSolrClient.java:997)
> at org.apache.solr.client.solrj.SolrRequest.process(SolrRequest.java:149)
> at org.apache.solr.client.solrj.SolrRequest.process(SolrRequest.java:166)
> at org.apache.solr.handler.CdcrReplicator.sendRequest(
> CdcrReplicator.java:135)
> at org.apache.solr.handler.CdcrReplicator.run(CdcrReplicator.java:115)
> at org.apache.solr.handler.CdcrReplicatorScheduler.lambda$null$0(
> CdcrReplicatorScheduler.java:81)
> at org.apache.solr.common.util.ExecutorUtil$MDCAwareThreadPoolExecutor.
> lambda$execute$0(ExecutorUtil.java:229)
> at java.util.concurrent.ThreadPoolExecutor.runWorker(
> ThreadPoolExecutor.java:1142)
> at java.util.concurrent.ThreadPoolExecutor$Worker.run(
> ThreadPoolExecutor.java:617)
> at java.lang.Thread.run(Thread.java:745)
> Caused by: org.apache.zookeeper.KeeperException$SessionExpiredException:
> KeeperErrorCode = Session expired for /collections/sial-catalog-
> product/state.json
>
> So is Zookeeper hosed? How do I tell?
>
> On Tue, Jul 4, 2017 at 3:27 PM, Webster Homer 
> wrote:
>
>> We've been using cdcr for a while now. It seems to be pretty fragile.
>>
>> Currently we're seeing tons of errors like this:
>> 2017-07-04 14:41:27.015 ERROR (cdcr-bootstrap-status-51-thre
>> ad-1-processing-n:dfw-pauth-msc02:8983_solr) [ ]
>> o.a.s.h.CdcrReplicatorManager Exception during bootstrap status request
>>
>> In this case we have one server throwing the above errors a lot!
>>
>> The error isn't very informative what can cause this?
>>
>> I also see these messages:
>> 2017-07-04 18:59:39.730 WARN  (cdcr-replicator-122-thread-3
>> -processing-n:dfw-pauth-msc02:8983_solr x:sial-catalog-gene_shard1_replica1
>> s:shard1 c:sial-catalog-gene r:core_node1) [c:sial-catalog-gene s:shard1
>> r:core_node1 x:sial-catalog-gene_shard1_replica1] o.a.s.h.CdcrReplicator
>> Log reader for target sial-catalog-gene is not initialised, it will be
>> ignored.
>>
>> 2017-07-04 18:59:39.730 INFO (cdcr-replicator-122-thread-1-
>> processing-n:dfw-pauth-msc02:8983_solr x:sial-catalog-gene_shard1_replica1
>> s:shard1 c:sial-catalog-gene r:core_node1) [c:sial-catalog-gene s:shard1
>> r:core_node1 x:sial-catalog-gene_shard1_replica1] o.a.s.h.CdcrReplicator
>> Forwarded 0 updates to target sial-catalog-gene 2017-07-04 18:59:39.975
>> WARN (cdcr-replicator-100-thread-3-processing-n:dfw-pauth-msc02:8983_solr)
>> [ ] o.a.s.h.CdcrReplicator Failed to forward update request to target:
>> bb-catalog-material java.lang.RuntimeException: Unknown type 17
>>
>> We are using Solr 6.2
>> We have a 2 node cloud with multiple collections all with 2 shards
>> replicating to two solr clouds running in Google cloud.
>> We noticed that some of the prod collections only had data in one of the
>> shards.
>>
>> So how do we diagnose this issue?
>>
>>
>

-- 


This message and any attachment are confidential and may be privileged or 
otherwise protected from disclosure. If you are not the intended recipient, 
you must not copy this message or attachment or disclose the contents to 
any other person. If you have received this transmission in error, please 
notify the sender immediately and delete the message and any attachment 
from your system. Merck KGaA, Darmstadt, Germany and any of its 
subsidiaries do not accept liability for any omissions or errors in this 
message which may arise as a result of E-Mail-transmission or for damages 
resulting from any unauthorized changes of the content of this message and 
any attachment thereto. Merck KGaA, Darmstadt, Germany and any of its 
subsidiaries do not guarantee that this message is free of viruses and does 
not accept liability for any damages caused by any virus transmitted 
therewith.

Click http://www.emdgroup.com/disclaimer to access the German, French, 
Spanish and Portuguese versions of this disclaimer.


Re: cdcr bootstrap errors

2017-07-04 Thread Webster Homer
Another strange error message I'm seeing
2017-07-04 18:59:40.585 WARN
 (cdcr-replicator-110-thread-4-processing-n:dfw-pauth-msc02:8983_solr) [
] o.a.s.h.CdcrReplicator Failed to forward update request to target:
sial-catalog-product
org.apache.solr.common.SolrException: Could not load collection from ZK:
sial-catalog-product
at
org.apache.solr.common.cloud.ZkStateReader.getCollectionLive(ZkStateReader.java:1093)
at
org.apache.solr.common.cloud.ZkStateReader$LazyCollectionRef.get(ZkStateReader.java:638)
at
org.apache.solr.common.cloud.ClusterState.getCollectionOrNull(ClusterState.java:212)
at
org.apache.solr.common.cloud.ClusterState.hasCollection(ClusterState.java:114)
at
org.apache.solr.client.solrj.impl.CloudSolrClient.getCollectionNames(CloudSolrClient.java:1302)
at
org.apache.solr.client.solrj.impl.CloudSolrClient.requestWithRetryOnStaleState(CloudSolrClient.java:1024)
at
org.apache.solr.client.solrj.impl.CloudSolrClient.request(CloudSolrClient.java:997)
at org.apache.solr.client.solrj.SolrRequest.process(SolrRequest.java:149)
at org.apache.solr.client.solrj.SolrRequest.process(SolrRequest.java:166)
at
org.apache.solr.handler.CdcrReplicator.sendRequest(CdcrReplicator.java:135)
at org.apache.solr.handler.CdcrReplicator.run(CdcrReplicator.java:115)
at
org.apache.solr.handler.CdcrReplicatorScheduler.lambda$null$0(CdcrReplicatorScheduler.java:81)
at
org.apache.solr.common.util.ExecutorUtil$MDCAwareThreadPoolExecutor.lambda$execute$0(ExecutorUtil.java:229)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Caused by: org.apache.zookeeper.KeeperException$SessionExpiredException:
KeeperErrorCode = Session expired for
/collections/sial-catalog-product/state.json

So is Zookeeper hosed? How do I tell?

On Tue, Jul 4, 2017 at 3:27 PM, Webster Homer 
wrote:

> We've been using cdcr for a while now. It seems to be pretty fragile.
>
> Currently we're seeing tons of errors like this:
> 2017-07-04 14:41:27.015 ERROR (cdcr-bootstrap-status-51-
> thread-1-processing-n:dfw-pauth-msc02:8983_solr) [ ]
> o.a.s.h.CdcrReplicatorManager Exception during bootstrap status request
>
> In this case we have one server throwing the above errors a lot!
>
> The error isn't very informative what can cause this?
>
> I also see these messages:
> 2017-07-04 18:59:39.730 WARN  (cdcr-replicator-122-thread-
> 3-processing-n:dfw-pauth-msc02:8983_solr x:sial-catalog-gene_shard1_replica1
> s:shard1 c:sial-catalog-gene r:core_node1) [c:sial-catalog-gene s:shard1
> r:core_node1 x:sial-catalog-gene_shard1_replica1] o.a.s.h.CdcrReplicator
> Log reader for target sial-catalog-gene is not initialised, it will be
> ignored.
>
> 2017-07-04 18:59:39.730 INFO (cdcr-replicator-122-thread-1-
> processing-n:dfw-pauth-msc02:8983_solr x:sial-catalog-gene_shard1_replica1
> s:shard1 c:sial-catalog-gene r:core_node1) [c:sial-catalog-gene s:shard1
> r:core_node1 x:sial-catalog-gene_shard1_replica1] o.a.s.h.CdcrReplicator
> Forwarded 0 updates to target sial-catalog-gene 2017-07-04 18:59:39.975
> WARN (cdcr-replicator-100-thread-3-processing-n:dfw-pauth-msc02:8983_solr)
> [ ] o.a.s.h.CdcrReplicator Failed to forward update request to target:
> bb-catalog-material java.lang.RuntimeException: Unknown type 17
>
> We are using Solr 6.2
> We have a 2 node cloud with multiple collections all with 2 shards
> replicating to two solr clouds running in Google cloud.
> We noticed that some of the prod collections only had data in one of the
> shards.
>
> So how do we diagnose this issue?
>
>

-- 


This message and any attachment are confidential and may be privileged or 
otherwise protected from disclosure. If you are not the intended recipient, 
you must not copy this message or attachment or disclose the contents to 
any other person. If you have received this transmission in error, please 
notify the sender immediately and delete the message and any attachment 
from your system. Merck KGaA, Darmstadt, Germany and any of its 
subsidiaries do not accept liability for any omissions or errors in this 
message which may arise as a result of E-Mail-transmission or for damages 
resulting from any unauthorized changes of the content of this message and 
any attachment thereto. Merck KGaA, Darmstadt, Germany and any of its 
subsidiaries do not guarantee that this message is free of viruses and does 
not accept liability for any damages caused by any virus transmitted 
therewith.

Click http://www.emdgroup.com/disclaimer to access the German, French, 
Spanish and Portuguese versions of this disclaimer.


cdcr bootstrap errors

2017-07-04 Thread Webster Homer
We've been using cdcr for a while now. It seems to be pretty fragile.

Currently we're seeing tons of errors like this:
2017-07-04 14:41:27.015 ERROR
(cdcr-bootstrap-status-51-thread-1-processing-n:dfw-pauth-msc02:8983_solr)
[ ] o.a.s.h.CdcrReplicatorManager Exception during bootstrap status request

In this case we have one server throwing the above errors a lot!

The error isn't very informative what can cause this?

I also see these messages:
2017-07-04 18:59:39.730 WARN
 (cdcr-replicator-122-thread-3-processing-n:dfw-pauth-msc02:8983_solr
x:sial-catalog-gene_shard1_replica1 s:shard1 c:sial-catalog-gene
r:core_node1) [c:sial-catalog-gene s:shard1 r:core_node1
x:sial-catalog-gene_shard1_replica1] o.a.s.h.CdcrReplicator Log reader for
target sial-catalog-gene is not initialised, it will be ignored.

2017-07-04 18:59:39.730 INFO
(cdcr-replicator-122-thread-1-processing-n:dfw-pauth-msc02:8983_solr
x:sial-catalog-gene_shard1_replica1 s:shard1 c:sial-catalog-gene
r:core_node1) [c:sial-catalog-gene s:shard1 r:core_node1
x:sial-catalog-gene_shard1_replica1] o.a.s.h.CdcrReplicator Forwarded 0
updates to target sial-catalog-gene 2017-07-04 18:59:39.975 WARN
(cdcr-replicator-100-thread-3-processing-n:dfw-pauth-msc02:8983_solr) [ ]
o.a.s.h.CdcrReplicator Failed to forward update request to target:
bb-catalog-material java.lang.RuntimeException: Unknown type 17

We are using Solr 6.2
We have a 2 node cloud with multiple collections all with 2 shards
replicating to two solr clouds running in Google cloud.
We noticed that some of the prod collections only had data in one of the
shards.

So how do we diagnose this issue?

-- 


This message and any attachment are confidential and may be privileged or 
otherwise protected from disclosure. If you are not the intended recipient, 
you must not copy this message or attachment or disclose the contents to 
any other person. If you have received this transmission in error, please 
notify the sender immediately and delete the message and any attachment 
from your system. Merck KGaA, Darmstadt, Germany and any of its 
subsidiaries do not accept liability for any omissions or errors in this 
message which may arise as a result of E-Mail-transmission or for damages 
resulting from any unauthorized changes of the content of this message and 
any attachment thereto. Merck KGaA, Darmstadt, Germany and any of its 
subsidiaries do not guarantee that this message is free of viruses and does 
not accept liability for any damages caused by any virus transmitted 
therewith.

Click http://www.emdgroup.com/disclaimer to access the German, French, 
Spanish and Portuguese versions of this disclaimer.


Re: Admin-UI: multiple facet

2017-07-04 Thread d0ct0r4r6a
The only way I know to directly add multiple facet fields in the Admin UI is
to add those facet fields in the "Raw Query Parameters". Like this:

facet.field=subject&facet.field=type&facet.field=organization&facet.field=region



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Admin-UI-multiple-facet-tp4010030p4344199.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Strange boolean query behaviour on 5.5.4

2017-07-04 Thread Erick Erickson
First of all, Solr doesn't implement strict boolean logic, see Hoss'
explanation here:
https://lucidworks.com/2011/12/28/why-not-and-or-and-not/

And bare "not" operators are a pain. There's actually _some_ trickery
_some_ places
to help you out (fq clauses come to mind) and prepend *:* for you.

I think you'll get what you expect by something like:
(*:* -someField:Foo) AND (otherField: (Bar OR Baz))

Best,
Erick

On Tue, Jul 4, 2017 at 6:04 AM, Bram Van Dam  wrote:
> Hey folks,
>
> I'm experiencing some strange query behaviour, and it isn't immediately
> clear to me why this wouldn happen. The definition of the query syntax
> on the wiki is a bit fuzzy, so my interpretation of the syntax might be off.
>
> This query does work (no results, when results are expected).
>
> (-someField:Foo) AND (otherField: (Bar OR Baz))
>
> With debug enabled, Solr interprets the query as
>
> +(-someField:Foo) +(otherField:Bar otherField:Baz)
>
> This query DOES work, results are returned.
>
> -someField:Foo +(otherField:Bar otherField:Baz)
>
> With debug enabled:
>
> -someField:Foo +(otherField:Bar otherField:Baz)
>
>
> The only difference between these queries is the presence of parantheses
> around the field with a single NOT condition. From a boolean point of
> view, they are equivalent.
>
> To make matters stranger, if I add a *:* clause to the NOT field,
> everything works again.
>
> (-someField:Foo AND *:*) AND (otherField: (Bar OR Baz))
> and
> -someField:Foo AND *:* AND (otherField: (Bar OR Baz))
> both work.
>
> Is this is query parser bug? Or are parenthesized groups with a single
> negated expression not supported? :-/
>
> I've only tested this on 5.5.4 using the default query parser, I don't
> have access to any other versions at the moment.
>
> Thanks for any insights,
>
>  - Bram


Did /export use to emit tuples and now does not?

2017-07-04 Thread Ronald Wood
9 months ago I did a proof of concept for solr streaming using the /export 
handler. At that time, I got tuples back.

Now when I try 6.x, I get results in a format similar to /search (including a 
count), instead of tuples (with an EOF).

Did something change between 5.x and 6.x in this regard?

I am trying to stream results in a non-cloud scenario, and I was under the 
impression that /export was the primitive handler for the more advanced 
streaming operations only possible under Solr Cloud.

I am using official docker images for testing. I tried to to retest under 5.5.4 
but I need to do some more work as docValues aren’t the default when using the 
gettingstarted index.

-Ronald Wood



Re: xml indexing

2017-07-04 Thread Alexandre Rafalovitch
You can set default values in the UpdateRequestProcessor chain:
http://www.solr-start.com/javadoc/solr-lucene/org/apache/solr/update/processor/DefaultValueUpdateProcessorFactory.html

You can combine URPs with DIH. There is an example for that in the latest Solr:
https://github.com/apache/lucene-solr/blob/releases/lucene-solr/6.6.0/solr/example/example-DIH/solr/atom/conf/solrconfig.xml

Regards,
   Alex.

http://www.solr-start.com/ - Resources for Solr users, new and experienced


On 4 July 2017 at 11:15, txlap786  wrote:
> Hello everyone o/. Im trying to index a xml file using DIH.
>
> Its mostly like this.
>
>  EXAMPLE DIH CONFIG STRUCTURE
>   forEach="/entryHeader"
> 
> 
> 
> 
> 
>
>  EXAMPLE XML STRUCTURE 
> 
>  xx 
>  xx 
> 
>  xx 
>  xx 
> 
> 
>  xx 
>  xx 
> 
> 
>  xx 
>  xx 
> 
> 
> 
>  xx 
>  xx 
> 
>  xx 
>  xx 
> 
> 
> 
>  xx 
>  xx 
> 
>  xx 
> 
> 
>  xx 
> 
> 
>
> (at  detailComment doesnt exist) !!
>
>  JSON return 
>
> "detailComment",
> [
> "100.01",
> "102.01",
> "102.02",
> "120.01",
> "120.02",
> "153.01",
> "320.01",
> null,
> null
> ]
>
>  INDEXED 
>
> "detailComment" : [
> "100.01",
> "102.01",
> "102.02",
> "120.01",
> "120.02",
> "153.01",
> "320.01"
> ]
>
>
> so,
> 
> default doesnt work due to multivalued
>
> How can i index those null as something visible. like "0","null","NULL" or
> "empty"
>
> I want the indexed ones to be same as json return..
>
> Can i use xPathprocessor inside of xPathprocessor to get those "entryDetail"
> ?
> So i wont have to use multivalues anymore. just gonna set default values for
> each
>
>
>
> --
> View this message in context: 
> http://lucene.472066.n3.nabble.com/xml-indexing-tp4344191.html
> Sent from the Solr - User mailing list archive at Nabble.com.


xml indexing

2017-07-04 Thread txlap786
Hello everyone o/. Im trying to index a xml file using DIH. 

Its mostly like this.

 EXAMPLE DIH CONFIG STRUCTURE






 EXAMPLE XML STRUCTURE 

 xx 
 xx 

 xx 
 xx 


 xx 
 xx 


 xx 
 xx 



 xx 
 xx 

 xx 
 xx 



 xx 
 xx 

 xx 


 xx 



(at  detailComment doesnt exist) !! 

 JSON return 

"detailComment",
[
"100.01",
"102.01",
"102.02",
"120.01",
"120.02",
"153.01",
"320.01",
null,
null
]

 INDEXED 

"detailComment" : [
"100.01",
"102.01",
"102.02",
"120.01",
"120.02",
"153.01",
"320.01"  
]


so,

default doesnt work due to multivalued

How can i index those null as something visible. like "0","null","NULL" or
"empty"

I want the indexed ones to be same as json return.. 

Can i use xPathprocessor inside of xPathprocessor to get those "entryDetail"
?
So i wont have to use multivalues anymore. just gonna set default values for
each



--
View this message in context: 
http://lucene.472066.n3.nabble.com/xml-indexing-tp4344191.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Solr 6.4. Can't index MS Visio vsdx files

2017-07-04 Thread Charlie Hull

On 11/04/2017 20:48, Allison, Timothy B. wrote:

It depends.  We've been trying to make parsers more, erm, flexible, but there 
are some problems from which we cannot recover.

Tl;dr there isn't a short answer.  :(

My sense is that DIH/ExtractingDocumentHandler is intended to get people up and 
running with Solr easily but it is not really a great idea for production.  See 
Erick's gem: https://lucidworks.com/2012/02/14/indexing-with-solrj/


+1. Tika extraction should happen *outside* Solr in production. A 
colleague even wrote a simple wrapper for Tika to help build this sort 
of thing: https://github.com/mattflax/dropwizard-tika-server


Charlie




As for the Tika portion... at the very least, Tika _shouldn't_ cause the 
ingesting process to crash.  At most, it should fail at the file level and not 
cause greater havoc.  In practice, if you're processing millions of files from 
the wild, you'll run into bad behavior and need to defend against permanent 
hangs, oom, memory leaks.

Also, at the least, if there's an exception with an embedded file, Tika should 
catch it and keep going with the rest of the file.  If this doesn't happen let 
us know!  We are aware that some types of embedded file stream problems were 
causing parse failures on the entire file, and we now catch those in Tika 
1.15-SNAPSHOT and don't let them percolate up through the parent file (they're 
reported in the metadata though).

Specifically for your stack traces:

For your initial problem with the missing class exceptions -- I thought we used 
to catch those in docx and log them.  I haven't been able to track this down, 
though.  I can look more if you have a need.

For "Caused by: org.apache.poi.POIXMLException: Invalid 'Row_Type' name 'PolylineTo' 
", this problem might go away if we implemented a pure SAX parser for vsdx.  We just 
did this for docx and pptx (coming in 1.15) and these are more robust to variation 
because they aren't requiring a match with the ooxml schema.  I haven't looked much at 
vsdx, but that _might_ help.

For "TODO Support v5 Pointers", this isn't supported and would require 
contributions.  However, I agree that POI shouldn't throw a Runtime exception.  Perhaps 
open an issue in POI, or maybe we should catch this special example at the Tika level?

For "Caused by: java.lang.ArrayIndexOutOfBoundsException:", the POI team 
_might_ be able to modify the parser to ignore a stream if there's an exception, but 
that's often a sign that something needs to be fixed with the parser.  In short, the 
solution will come from POI.

Best,

 Tim

-Original Message-
From: Gytis Mikuciunas [mailto:gyt...@gmail.com]
Sent: Tuesday, April 11, 2017 1:56 PM
To: solr-user@lucene.apache.org
Subject: RE: Solr 6.4. Can't index MS Visio vsdx files

Thanks for your responses.
Are there any posibilities to ignore parsing errors and continue indexing?
because now solr/tika stops parsing whole document if it finds any exception

On Apr 11, 2017 19:51, "Allison, Timothy B."  wrote:


You might want to drop a note to the dev or user's list on Apache POI.

I'm not extremely familiar with the vsd(x) portion of our code base.

The first item ("PolylineTo") may be caused by a mismatch btwn your
doc and the ooxml spec.

The second item appears to be an unsupported feature.

The third item may be an area for improvement within our codebase...I
can't tell just from the stacktrace.

You'll probably get more helpful answers over on POI.  Sorry, I can't
help with this...

Best,

   Tim

P.S.

 3.1. ooxml-schemas-1.3.jar instead of poi-ooxml-schemas-3.15.jar


You shouldn't need both. Ooxml-schemas-1.3.jar should be a super set
of poi-ooxml-schemas-3.15.jar






---
This email has been checked for viruses by AVG.
http://www.avg.com




--
Charlie Hull
Flax - Open Source Enterprise Search

tel/fax: +44 (0)8700 118334
mobile:  +44 (0)7767 825828
web: www.flax.co.uk


Strange boolean query behaviour on 5.5.4

2017-07-04 Thread Bram Van Dam
Hey folks,

I'm experiencing some strange query behaviour, and it isn't immediately
clear to me why this wouldn happen. The definition of the query syntax
on the wiki is a bit fuzzy, so my interpretation of the syntax might be off.

This query does work (no results, when results are expected).

(-someField:Foo) AND (otherField: (Bar OR Baz))

With debug enabled, Solr interprets the query as

+(-someField:Foo) +(otherField:Bar otherField:Baz)

This query DOES work, results are returned.

-someField:Foo +(otherField:Bar otherField:Baz)

With debug enabled:

-someField:Foo +(otherField:Bar otherField:Baz)


The only difference between these queries is the presence of parantheses
around the field with a single NOT condition. From a boolean point of
view, they are equivalent.

To make matters stranger, if I add a *:* clause to the NOT field,
everything works again.

(-someField:Foo AND *:*) AND (otherField: (Bar OR Baz))
and
-someField:Foo AND *:* AND (otherField: (Bar OR Baz))
both work.

Is this is query parser bug? Or are parenthesized groups with a single
negated expression not supported? :-/

I've only tested this on 5.5.4 using the default query parser, I don't
have access to any other versions at the moment.

Thanks for any insights,

 - Bram


Re: Is there any particular reason why ExternalFileField is read from data directory

2017-07-04 Thread apoorvqwerty
Thanks a lot,
for now I've written a listener to read from redis instead.
But might not scale well since the map is kept in memory.



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Is-there-any-particular-reason-why-ExternalFileField-is-read-from-data-directory-tp4343374p4344151.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: behind nginx

2017-07-04 Thread Rick Leir
Hi Walid
Is there any error occurring? If not, then do not change anything. Yes, that 
version of Solr is old, if you get the chance you would want to upgrade. Cheers 
-- Rick

On July 4, 2017 6:11:57 AM EDT, walid  wrote:
>Hi,
>i have a multiple solr slaves reversed by nginx, clients browser send
>to
>nginx who reverse with round robin to solr slaves, clients use post,
>and
>directly deal with solrs i use "ajax-solr". with this configuration and
>if i
>have a lot of ecommerce users, Is there a risk to have many tomcat
>sessions
>in solr servers? if i found solution with nginx to use same session in
>the
>solr requests, this will not cause solr response errors?
>my solr version is 1.4.1, i know this is too old 
>
>
>
>--
>View this message in context:
>http://lucene.472066.n3.nabble.com/behind-nginx-tp4344124.html
>Sent from the Solr - User mailing list archive at Nabble.com.

-- 
Sorry for being brief. Alternate email is rickleir at yahoo dot com 

Re: Work-around for "indexed without position data"

2017-07-04 Thread Susheel Kumar
Did you try to reproduce this on latest Solr (6.6) just to rule out any bug
with that version (though less likely).  Pls download and do a quick test.

On Mon, Jul 3, 2017 at 5:01 PM, Solr User  wrote:

> Not sure if it helps beyond the steps to reproduce that I supplied above,
> but I also see that "Omit Term Frequencies & Positions" is still set on the
> field according to the LukeRequestHandler:
>
> ITS--OF--
>
>
>
> On Mon, Jun 5, 2017 at 1:18 PM, Solr User  wrote:
>
> > Sorry for the delay.  I was able to reproduce this easily with my setup,
> > but reproducing this on a Solr example proved challenging.  Hopefully the
> > work that I did to find the situation in which this is produced will help
> > in resolving the problem.  The driving factor for this appears to be how
> > updates are sent to Solr.  When sending batches of updates with commits,
> > the problem is reproduced.  If the commit is held until after all updates
> > are sent, then no problem is produced.  This leads me to believe that
> this
> > issue has something to do with overlapping commits or index merges.  This
> > was reproducible regardless of running classic or managed schema and
> > regardless of running Solr core or SolrCloud.
> >
> > There are not many steps to reproduce this, but you will need a way to
> > send these updates.  I have included inline create.sh and create.pl
> > scripts to generate the data and send the updates.  You can index a
> > lastModified field or something to convince yourself that everything has
> > been re-indexed.  I left that out to keep the steps lean.  Also, this
> test
> > is using commit statements from the client sending the updates for
> > simplicity even though it is not a good practice.  My normal setup is
> using
> > Solrj with commitWithin to allow Solr to manage when the commits take
> > place, but the same error is produced either way.
> >
> >
> > *STEPS TO REPRODUCE*
> >
> >1. Install Solr 5.5.3 and change to that working directory
> >2. bin/solr -e techproducts
> >3. bin/solr stop [Why these next 3 steps?  These are to start the
> >index completely new without the 32 example documents as opposed to a
> >delete query.  The documents are not posted after the core is
> detected the
> >second time.]
> >4. rm -rf ./example/techproducts/solr/techproducts/data/
> >5. bin/solr -e techproducts
> >6. ./create.sh
> >7. curl -X POST -H 'Content-type:application/json' --data-binary '{
> >"replace-field":{ "name":"cat", "type":"text_en_splitting",
> "indexed":true,
> >"multiValued":true, "stored":true } }' http://localhost:8983/solr/
> >techproducts/schema
> >8. http://localhost:8983/solr/techproducts/select?q=cat:%
> >22hard%20drive%22  [error]
> >9. ./create.sh
> >10. http://localhost:8983/solr/techproducts/select?q=cat:%
> >22hard%20drive%22  [error even though all documents have been
> >re-indexed]
> >
> > *create.sh*
> > #!/bin/bash
> > for i in {1..100}; do
> > echo "$i"
> > ./create.pl $i > ./create.xml$i
> > curl http://localhost:8983/solr/techproducts/update?commit=true -H
> > "Content-Type: text/xml" --data-binary @./create.xml$i
> > done
> >
> > *create.pl *
> > #!/usr/bin/perl
> > my $S = $ARGV[0];
> > my $I = 100;
> > my $N = $S*$I + $I;
> > my $i;
> > print "\n";
> > for($i=$S*$I; $i<$N; $i++) {
> >print "SP${i}cat
> > hard drive ${i}\n";
> > }
> > print "\n";
> >
> >
> > On Fri, May 26, 2017 at 2:14 AM, Rick Leir  wrote:
> >
> >> Can you reproduce this error? What are the steps you take to reproduce
> >> it? ( simple is better).
> >>
> >> cheers -- Rick
> >>
> >>
> >>
> >> On 2017-05-25 05:46 PM, Solr User wrote:
> >>
> >>> This is in regards to changing a field type from string to
> >>> text_en_splitting, re-indexing all documents, even optimizing to give
> the
> >>> index a chance to merge segments and rewrite itself entirely, and then
> >>> getting this error when running a phrase query:
> >>> java.lang.IllegalStateException: field "blah" was indexed without
> >>> position
> >>> data; cannot run PhraseQuery
> >>>
> >>> I have encountered this issue before and have always done one of the
> >>> following as a work-around:
> >>> 1.  Instead of changing the field type on an existing field just
> create a
> >>> new field and retire the old one.
> >>> 2.  Delete the index directory and start from scratch.
> >>>
> >>> These work-arounds are not always ideal.  Does anyone know what is
> >>> holding
> >>> onto that old field type definition?  What thinks it is still a string?
> >>> Every document has been re-indexed and I am sure of this because I
> have a
> >>> time stamp indexed.  Is there any other way to get this to work?
> >>>
> >>> For what it is worth, I am running this in SolrCloud mode but I
> remember
> >>> seeing this issue before SolrCloud was released as well.
> >>>
> >>>
> >>
> >
>


behind nginx

2017-07-04 Thread walid
Hi,
i have a multiple solr slaves reversed by nginx, clients browser send to
nginx who reverse with round robin to solr slaves, clients use post, and
directly deal with solrs i use "ajax-solr". with this configuration and if i
have a lot of ecommerce users, Is there a risk to have many tomcat sessions
in solr servers? if i found solution with nginx to use same session in the
solr requests, this will not cause solr response errors?
my solr version is 1.4.1, i know this is too old 



--
View this message in context: 
http://lucene.472066.n3.nabble.com/behind-nginx-tp4344124.html
Sent from the Solr - User mailing list archive at Nabble.com.