How to update SOLR schema from continuous integration environment

2014-10-31 Thread Faisal Mansoor
Hi,

How do people usually update Solr configuration files from continuous
integration environment like TeamCity or Jenkins.

We have multiple development and testing environments and use WebDeploy and
AwsDeploy type of tools to remotely deploy code multiple times a day, to
update solr I wrote a simple node server which accepts conf folder over
http, updates the specified conf core folder and restarts the solr service.

Does there exists a standard tool for this uses case. I know about schema
rest api, but, I want to update all the files in the conf folder rather
than just updating a single file or adding or removing synonyms piecemeal.

Here is the link for the node server I mentioned if anyone is interested.
https://github.com/faisalmansoor/UpdateSolrConfig


Thanks,
Faisal


Re: Consul instead of ZooKeeper anyone?

2014-10-31 Thread Walter Underwood
It looks like Consul solves a different problem than Zookeeper. Consul manages 
what servers are up and starts new ones as needed. Zookeeper doesn’t start 
servers, but does leader election when they fail.

I don’t see any way that Consul could replace Zookeeper, but it could solve 
another part of the problem.

wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/

On Oct 31, 2014, at 5:15 PM, Erick Erickson  wrote:

> Not that I know of, but look before you leap. I took a quick look at
> Consul and it really doesn't look like any kind of drop-in replacement.
> Also, the Zookeeper usage in SolrCloud isn't really pluggable
> AFAIK, so there'll be lots of places in the Solr code that need to be
> reworked etc., especially in the realm of collections and sharding.
> 
> The Collections API will be challenging to port over I think.
> 
> Not to mention SolrJ and CloudSolrServer for clients who want to interact
> with SolrCloud through Java.
> 
> Not saying it won't work, I just suspect that getting it done would be
> a big job, and thereafter keeping those changes in sync with the
> changing SolrCloud code base would chew up a lots of time. So if
> I were putting my Product Manager hat on I'd ask "is the benefit
> worth the effort?".
> 
> All that said, go for it if you've a mind to!
> 
> Best,
> Erick
> 
> On Fri, Oct 31, 2014 at 4:08 PM, Greg Solovyev  wrote:
>> I am investigating a project to make SolrCloud run on Consul instead of 
>> ZooKeeper. So far, my research revealed no such efforts, but I wanted to 
>> check with this list to make sure I am not going to be reinventing the 
>> wheel. Have anyone attempted using Consul instead of ZK to coordinate 
>> SolrCloud nodes?
>> 
>> Thanks,
>> Greg



Re: Consul instead of ZooKeeper anyone?

2014-10-31 Thread Erick Erickson
Not that I know of, but look before you leap. I took a quick look at
Consul and it really doesn't look like any kind of drop-in replacement.
Also, the Zookeeper usage in SolrCloud isn't really pluggable
AFAIK, so there'll be lots of places in the Solr code that need to be
reworked etc., especially in the realm of collections and sharding.

The Collections API will be challenging to port over I think.

Not to mention SolrJ and CloudSolrServer for clients who want to interact
with SolrCloud through Java.

Not saying it won't work, I just suspect that getting it done would be
a big job, and thereafter keeping those changes in sync with the
changing SolrCloud code base would chew up a lots of time. So if
I were putting my Product Manager hat on I'd ask "is the benefit
worth the effort?".

All that said, go for it if you've a mind to!

Best,
Erick

On Fri, Oct 31, 2014 at 4:08 PM, Greg Solovyev  wrote:
> I am investigating a project to make SolrCloud run on Consul instead of 
> ZooKeeper. So far, my research revealed no such efforts, but I wanted to 
> check with this list to make sure I am not going to be reinventing the wheel. 
> Have anyone attempted using Consul instead of ZK to coordinate SolrCloud 
> nodes?
>
> Thanks,
> Greg


Consul instead of ZooKeeper anyone?

2014-10-31 Thread Greg Solovyev
I am investigating a project to make SolrCloud run on Consul instead of 
ZooKeeper. So far, my research revealed no such efforts, but I wanted to check 
with this list to make sure I am not going to be reinventing the wheel. Have 
anyone attempted using Consul instead of ZK to coordinate SolrCloud nodes? 

Thanks, 
Greg 


Re: exporting to CSV with solrj

2014-10-31 Thread Erick Erickson
@Will:

I can't tell you how many times questions like
"Why do you want to use CSV in SolrJ?" have
lead to solutions different from what the original
question might imply. It's a question I frequently
ask in almost the exact same way; it's a
perfectly legitimate question IMO.

Best,
Erick



On Fri, Oct 31, 2014 at 1:25 PM, Chris Hostetter
 wrote:
>
> : "Why do you want to use CSV in SolrJ?"  Alexandre are you looking for a
>
> It's a legitmate question - part of providing good community support is
> making sure we understand *why* users are asking how to do something, so
> we can give good advice on other solutions people might not even have
> thought of -- teach a man to fish, vs give a man a fish, etc...
>
> https://people.apache.org/~hossman/#xyproblem
>
> ...if we understand *why* people ask questions, or aproach problems in
> certain ways, we can not only offer the best possible suggestions, but
> also consider how the underlying usecase (and other similar use cases like
> it) might be better served in the future.
>
> -Hoss
> http://www.lucidworks.com/


Re: prefix length in fuzzy search solr 4.10.1

2014-10-31 Thread Jack Krupansky
No, but it is a reasonable request, as a global default, a 
collection-specific default, a request-specific default, and on an 
individual fuzzy term.


-- Jack Krupansky

-Original Message- 
From: elisabeth benoit

Sent: Thursday, October 30, 2014 6:07 AM
To: solr-user@lucene.apache.org
Subject: prefix length in fuzzy search solr 4.10.1

Hello all,

Is there a parameter in solr 4.10.1 api allowing user to fix prefix length
in fuzzy search.

Best regards,
Elisabeth 



[ANNOUNCE] Apache Solr 4.10.2 released

2014-10-31 Thread Michael McCandless
October 2014, Apache Solr™ 4.10.2 available

The Lucene PMC is pleased to announce the release of Apache Solr 4.10.2

Solr is the popular, blazing fast, open source NoSQL search platform
from the Apache Lucene project. Its major features include powerful
full-text search, hit highlighting, faceted search, dynamic
clustering, database integration, rich document (e.g., Word, PDF)
handling, and geospatial search. Solr is highly scalable, providing
fault tolerant distributed search and indexing, and powers the search
and navigation features of many of the world's largest internet sites.

Solr 4.10.2 is available for immediate download at:

http://lucene.apache.org/solr/mirrors-solr-latest-redir.html

Solr 4.10.2 includes 10 bug fixes, as well as Lucene 4.10.2 and its 2 bug fixes.

See the CHANGES.txt file included with the release for a full list of
changes and further details.

Please report any feedback to the mailing lists
(http://lucene.apache.org/solr/discussion.html)

Note: The Apache Software Foundation uses an extensive mirroring
network for distributing releases. It is possible that the mirror you
are using may not have replicated the release yet. If that is the
case, please try another mirror. This also goes for Maven access.

Happy Halloween,

Mike McCandless

http://blog.mikemccandless.com


Re: exporting to CSV with solrj

2014-10-31 Thread Chris Hostetter

: "Why do you want to use CSV in SolrJ?"  Alexandre are you looking for a

It's a legitmate question - part of providing good community support is 
making sure we understand *why* users are asking how to do something, so 
we can give good advice on other solutions people might not even have 
thought of -- teach a man to fish, vs give a man a fish, etc...

https://people.apache.org/~hossman/#xyproblem

...if we understand *why* people ask questions, or aproach problems in 
certain ways, we can not only offer the best possible suggestions, but 
also consider how the underlying usecase (and other similar use cases like 
it) might be better served in the future.

-Hoss
http://www.lucidworks.com/


Re: exporting to CSV with solrj

2014-10-31 Thread Alexandre Rafalovitch
On 31 October 2014 14:58, will martin  wrote:
> "Why do you want to use CSV in SolrJ?"  Alexandre are you looking for a
> design gig. This kind of question really begs nothing but disdain.

Nope. Not looking for a design gig. I give that advice away for free:
http://www.airpair.com/solr/workshops/discovering-your-inner-search-engine,
http://www.bigdatamontreal.org/?p=310 , http://www.solr-start.com/,
etc Though, in all fairness, I do charge for my Solr book:
https://www.packtpub.com/big-data-and-business-intelligence/instant-apache-solr-indexing-data-how-instant

In this particular case, there might have been two or three ways to
answer the question, depending on why Ted wanted to use CSV from SolrJ
as opposed to the more common command line approach, which is the
example given in the tutorial and online documentation. Depending on
his business-level goals, there might have been different types of
help offered. We, in the Solr community sometimes call it an X-Y
problem.

However, if you, Will Martin of USA, take a second-hand offence on
behalf of another person, I do apologize to you. There certainly was
no intent in upsetting innocent bystanders caught in the cross-fire of
determining the right answer to a slightly-unusual question.

Regards,
   Alex.


Re: Solr index corrupt question

2014-10-31 Thread ku3ia
Erick Erickson wrote
> What version of Solr/Lucene?

First of all, was Lucene\Solr v.4.6, but later it was changed to Lucene\Solr
4.8. More later to the schema was added _root_ field and child doc support.
Full data re-index on each change was not done. But not so long ago I had
made an optimize to one segment. No problems were with optimize.



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Solr-index-corrupt-question-tp4166810p4166908.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Ideas for debugging poor SolrCloud scalability

2014-10-31 Thread Peter Keegan
Yes, I was inadvertently sending them to a replica. When I sent them to the
leader, the leader reported (1000 adds) and the replica reported only 1 add
per document. So, it looks like the leader forwards the batched jobs
individually to the replicas.

On Fri, Oct 31, 2014 at 3:26 PM, Erick Erickson 
wrote:

> Internally, the docs are batched up into smaller buckets (10 as I
> remember) and forwarded to the correct shard leader. I suspect that's
> what you're seeing.
>
> Erick
>
> On Fri, Oct 31, 2014 at 12:20 PM, Peter Keegan 
> wrote:
> > Regarding batch indexing:
> > When I send batches of 1000 docs to a standalone Solr server, the log
> file
> > reports "(1000 adds)" in LogUpdateProcessor. But when I send them to the
> > leader of a replicated index, the leader log file reports much smaller
> > numbers, usually "(12 adds)". Why do the batches appear to be broken up?
> >
> > Peter
> >
> > On Fri, Oct 31, 2014 at 10:40 AM, Erick Erickson <
> erickerick...@gmail.com>
> > wrote:
> >
> >> NP, just making sure.
> >>
> >> I suspect you'll get lots more bang for the buck, and
> >> results much more closely matching your expectations if
> >>
> >> 1> you batch up a bunch of docs at once rather than
> >> sending them one at a time. That's probably the easiest
> >> thing to try. Sending docs one at a time is something of
> >> an anti-pattern. I usually start with batches of 1,000.
> >>
> >> And just to check.. You're not issuing any commits from the
> >> client, right? Performance will be terrible if you issue commits
> >> after every doc, that's totally an anti-pattern. Doubly so for
> >> optimizes Since you showed us your solrconfig  autocommit
> >> settings I'm assuming not but want to be sure.
> >>
> >> 2> use a leader-aware client. I'm totally unfamiliar with Go,
> >> so I have no suggestions whatsoever to offer there But you'll
> >> want to batch in this case too.
> >>
> >> On Fri, Oct 31, 2014 at 5:51 AM, Ian Rose 
> wrote:
> >> > Hi Erick -
> >> >
> >> > Thanks for the detailed response and apologies for my confusing
> >> > terminology.  I should have said "WPS" (writes per second) instead of
> QPS
> >> > but I didn't want to introduce a weird new acronym since QPS is well
> >> > known.  Clearly a bad decision on my part.  To clarify: I am doing
> >> > *only* writes
> >> > (document adds).  Whenever I wrote "QPS" I was referring to writes.
> >> >
> >> > It seems clear at this point that I should wrap up the code to do
> "smart"
> >> > routing rather than choose Solr nodes randomly.  And then see if that
> >> > changes things.  I must admit that although I understand that random
> node
> >> > selection will impose a performance hit, theoretically it seems to me
> >> that
> >> > the system should still scale up as you add more nodes (albeit at
> lower
> >> > absolute level of performance than if you used a smart router).
> >> > Nonetheless, I'm just theorycrafting here so the better thing to do is
> >> just
> >> > try it experimentally.  I hope to have that working today - will
> report
> >> > back on my findings.
> >> >
> >> > Cheers,
> >> > - Ian
> >> >
> >> > p.s. To clarify why we are rolling our own smart router code, we use
> Go
> >> > over here rather than Java.  Although if we still get bad performance
> >> with
> >> > our custom Go router I may try a pure Java load client using
> >> > CloudSolrServer to eliminate the possibility of bugs in our
> >> implementation.
> >> >
> >> >
> >> > On Fri, Oct 31, 2014 at 1:37 AM, Erick Erickson <
> erickerick...@gmail.com
> >> >
> >> > wrote:
> >> >
> >> >> I'm really confused:
> >> >>
> >> >> bq: I am not issuing any queries, only writes (document inserts)
> >> >>
> >> >> bq: It's clear that once the load test client has ~40 simulated users
> >> >>
> >> >> bq: A cluster of 3 shards over 3 Solr nodes *should* support
> >> >> a higher QPS than 2 shards over 2 Solr nodes, right
> >> >>
> >> >> QPS is usually used to mean "Queries Per Second", which is different
> >> from
> >> >> the statement that "I am not issuing any queries". And what do
> the
> >> >> number of users have to do with inserting documents?
> >> >>
> >> >> You also state: " In many cases, CPU on the solr servers is quite
> low as
> >> >> well"
> >> >>
> >> >> So let's talk about indexing first. Indexing should scale nearly
> >> >> linearly as long as
> >> >> 1> you are routing your docs to the correct leader, which happens
> with
> >> >> SolrJ
> >> >> and the CloudSolrSever automatically. Rather than rolling your own, I
> >> >> strongly
> >> >> suggest you try this out.
> >> >> 2> you have enough clients feeding the cluster to push CPU
> utilization
> >> >> on them all.
> >> >> Very often "slow indexing", or in your case "lack of scaling" is a
> >> >> result of document
> >> >> acquisition or, in your case, your doc generator is spending all it's
> >> >> time waiting for
> >> >> the individual documents to get to Solr and come back.
> >> >>
> >> >> bq: "chooses a random solr server for e

Re: Ideas for debugging poor SolrCloud scalability

2014-10-31 Thread Erick Erickson
Internally, the docs are batched up into smaller buckets (10 as I
remember) and forwarded to the correct shard leader. I suspect that's
what you're seeing.

Erick

On Fri, Oct 31, 2014 at 12:20 PM, Peter Keegan  wrote:
> Regarding batch indexing:
> When I send batches of 1000 docs to a standalone Solr server, the log file
> reports "(1000 adds)" in LogUpdateProcessor. But when I send them to the
> leader of a replicated index, the leader log file reports much smaller
> numbers, usually "(12 adds)". Why do the batches appear to be broken up?
>
> Peter
>
> On Fri, Oct 31, 2014 at 10:40 AM, Erick Erickson 
> wrote:
>
>> NP, just making sure.
>>
>> I suspect you'll get lots more bang for the buck, and
>> results much more closely matching your expectations if
>>
>> 1> you batch up a bunch of docs at once rather than
>> sending them one at a time. That's probably the easiest
>> thing to try. Sending docs one at a time is something of
>> an anti-pattern. I usually start with batches of 1,000.
>>
>> And just to check.. You're not issuing any commits from the
>> client, right? Performance will be terrible if you issue commits
>> after every doc, that's totally an anti-pattern. Doubly so for
>> optimizes Since you showed us your solrconfig  autocommit
>> settings I'm assuming not but want to be sure.
>>
>> 2> use a leader-aware client. I'm totally unfamiliar with Go,
>> so I have no suggestions whatsoever to offer there But you'll
>> want to batch in this case too.
>>
>> On Fri, Oct 31, 2014 at 5:51 AM, Ian Rose  wrote:
>> > Hi Erick -
>> >
>> > Thanks for the detailed response and apologies for my confusing
>> > terminology.  I should have said "WPS" (writes per second) instead of QPS
>> > but I didn't want to introduce a weird new acronym since QPS is well
>> > known.  Clearly a bad decision on my part.  To clarify: I am doing
>> > *only* writes
>> > (document adds).  Whenever I wrote "QPS" I was referring to writes.
>> >
>> > It seems clear at this point that I should wrap up the code to do "smart"
>> > routing rather than choose Solr nodes randomly.  And then see if that
>> > changes things.  I must admit that although I understand that random node
>> > selection will impose a performance hit, theoretically it seems to me
>> that
>> > the system should still scale up as you add more nodes (albeit at lower
>> > absolute level of performance than if you used a smart router).
>> > Nonetheless, I'm just theorycrafting here so the better thing to do is
>> just
>> > try it experimentally.  I hope to have that working today - will report
>> > back on my findings.
>> >
>> > Cheers,
>> > - Ian
>> >
>> > p.s. To clarify why we are rolling our own smart router code, we use Go
>> > over here rather than Java.  Although if we still get bad performance
>> with
>> > our custom Go router I may try a pure Java load client using
>> > CloudSolrServer to eliminate the possibility of bugs in our
>> implementation.
>> >
>> >
>> > On Fri, Oct 31, 2014 at 1:37 AM, Erick Erickson > >
>> > wrote:
>> >
>> >> I'm really confused:
>> >>
>> >> bq: I am not issuing any queries, only writes (document inserts)
>> >>
>> >> bq: It's clear that once the load test client has ~40 simulated users
>> >>
>> >> bq: A cluster of 3 shards over 3 Solr nodes *should* support
>> >> a higher QPS than 2 shards over 2 Solr nodes, right
>> >>
>> >> QPS is usually used to mean "Queries Per Second", which is different
>> from
>> >> the statement that "I am not issuing any queries". And what do the
>> >> number of users have to do with inserting documents?
>> >>
>> >> You also state: " In many cases, CPU on the solr servers is quite low as
>> >> well"
>> >>
>> >> So let's talk about indexing first. Indexing should scale nearly
>> >> linearly as long as
>> >> 1> you are routing your docs to the correct leader, which happens with
>> >> SolrJ
>> >> and the CloudSolrSever automatically. Rather than rolling your own, I
>> >> strongly
>> >> suggest you try this out.
>> >> 2> you have enough clients feeding the cluster to push CPU utilization
>> >> on them all.
>> >> Very often "slow indexing", or in your case "lack of scaling" is a
>> >> result of document
>> >> acquisition or, in your case, your doc generator is spending all it's
>> >> time waiting for
>> >> the individual documents to get to Solr and come back.
>> >>
>> >> bq: "chooses a random solr server for each ADD request (with 1 doc per
>> add
>> >> request)"
>> >>
>> >> Probably your culprit right there. Each and every document requires that
>> >> you
>> >> have to cross the network (and forward that doc to the correct leader).
>> So
>> >> given
>> >> that you're not seeing high CPU utilization, I suspect that you're not
>> >> sending
>> >> enough docs to SolrCloud fast enough to see scaling. You need to batch
>> up
>> >> multiple docs, I generally send 1,000 docs at a time.
>> >>
>> >> But even if you do solve this, the inter-node routing will prevent
>> >> linear scaling.
>> >> When a doc (or a ba

Re: Ideas for debugging poor SolrCloud scalability

2014-10-31 Thread Peter Keegan
Regarding batch indexing:
When I send batches of 1000 docs to a standalone Solr server, the log file
reports "(1000 adds)" in LogUpdateProcessor. But when I send them to the
leader of a replicated index, the leader log file reports much smaller
numbers, usually "(12 adds)". Why do the batches appear to be broken up?

Peter

On Fri, Oct 31, 2014 at 10:40 AM, Erick Erickson 
wrote:

> NP, just making sure.
>
> I suspect you'll get lots more bang for the buck, and
> results much more closely matching your expectations if
>
> 1> you batch up a bunch of docs at once rather than
> sending them one at a time. That's probably the easiest
> thing to try. Sending docs one at a time is something of
> an anti-pattern. I usually start with batches of 1,000.
>
> And just to check.. You're not issuing any commits from the
> client, right? Performance will be terrible if you issue commits
> after every doc, that's totally an anti-pattern. Doubly so for
> optimizes Since you showed us your solrconfig  autocommit
> settings I'm assuming not but want to be sure.
>
> 2> use a leader-aware client. I'm totally unfamiliar with Go,
> so I have no suggestions whatsoever to offer there But you'll
> want to batch in this case too.
>
> On Fri, Oct 31, 2014 at 5:51 AM, Ian Rose  wrote:
> > Hi Erick -
> >
> > Thanks for the detailed response and apologies for my confusing
> > terminology.  I should have said "WPS" (writes per second) instead of QPS
> > but I didn't want to introduce a weird new acronym since QPS is well
> > known.  Clearly a bad decision on my part.  To clarify: I am doing
> > *only* writes
> > (document adds).  Whenever I wrote "QPS" I was referring to writes.
> >
> > It seems clear at this point that I should wrap up the code to do "smart"
> > routing rather than choose Solr nodes randomly.  And then see if that
> > changes things.  I must admit that although I understand that random node
> > selection will impose a performance hit, theoretically it seems to me
> that
> > the system should still scale up as you add more nodes (albeit at lower
> > absolute level of performance than if you used a smart router).
> > Nonetheless, I'm just theorycrafting here so the better thing to do is
> just
> > try it experimentally.  I hope to have that working today - will report
> > back on my findings.
> >
> > Cheers,
> > - Ian
> >
> > p.s. To clarify why we are rolling our own smart router code, we use Go
> > over here rather than Java.  Although if we still get bad performance
> with
> > our custom Go router I may try a pure Java load client using
> > CloudSolrServer to eliminate the possibility of bugs in our
> implementation.
> >
> >
> > On Fri, Oct 31, 2014 at 1:37 AM, Erick Erickson  >
> > wrote:
> >
> >> I'm really confused:
> >>
> >> bq: I am not issuing any queries, only writes (document inserts)
> >>
> >> bq: It's clear that once the load test client has ~40 simulated users
> >>
> >> bq: A cluster of 3 shards over 3 Solr nodes *should* support
> >> a higher QPS than 2 shards over 2 Solr nodes, right
> >>
> >> QPS is usually used to mean "Queries Per Second", which is different
> from
> >> the statement that "I am not issuing any queries". And what do the
> >> number of users have to do with inserting documents?
> >>
> >> You also state: " In many cases, CPU on the solr servers is quite low as
> >> well"
> >>
> >> So let's talk about indexing first. Indexing should scale nearly
> >> linearly as long as
> >> 1> you are routing your docs to the correct leader, which happens with
> >> SolrJ
> >> and the CloudSolrSever automatically. Rather than rolling your own, I
> >> strongly
> >> suggest you try this out.
> >> 2> you have enough clients feeding the cluster to push CPU utilization
> >> on them all.
> >> Very often "slow indexing", or in your case "lack of scaling" is a
> >> result of document
> >> acquisition or, in your case, your doc generator is spending all it's
> >> time waiting for
> >> the individual documents to get to Solr and come back.
> >>
> >> bq: "chooses a random solr server for each ADD request (with 1 doc per
> add
> >> request)"
> >>
> >> Probably your culprit right there. Each and every document requires that
> >> you
> >> have to cross the network (and forward that doc to the correct leader).
> So
> >> given
> >> that you're not seeing high CPU utilization, I suspect that you're not
> >> sending
> >> enough docs to SolrCloud fast enough to see scaling. You need to batch
> up
> >> multiple docs, I generally send 1,000 docs at a time.
> >>
> >> But even if you do solve this, the inter-node routing will prevent
> >> linear scaling.
> >> When a doc (or a batch of docs) goes to a random Solr node, here's what
> >> happens:
> >> 1> the docs are re-packaged into groups based on which shard they're
> >> destined for
> >> 2> the sub-packets are forwarded to the leader for each shard
> >> 3> the responses are gathered back and returned to the client.
> >>
> >> This set of operations will eventually de

Re: exporting to CSV with solrj

2014-10-31 Thread will martin
"Why do you want to use CSV in SolrJ?"  Alexandre are you looking for a
design gig. This kind of question really begs nothing but disdain.
Commodity search exists, not matter what Paul Nelson writes and part of
that problem is due to advanced users always rewriting the reqs and specs
of less experienced users. 

On Fri, Oct 31, 2014 at 1:05 PM, Alexandre Rafalovitch 
wrote:

> Why do you want to use CSV in SolrJ? You would just have to parse it again.
>
> You could just trigger that as a URL call from outside with cURL or as
> just an HTTP (not SolrJ) call from Java client.
>
> Regards,
>Alex.
> Personal: http://www.outerthoughts.com/ and @arafalov
> Solr resources and newsletter: http://www.solr-start.com/ and @solrstart
> Solr popularizers community: https://www.linkedin.com/groups?gid=6713853
>
>
> On 31 October 2014 12:34, tedsolr  wrote:
> > Sure thing, but how do I get the results output in CSV format?
> > response.getResults() is a list of SolrDocuments.
> >
> >
> >
> > --
> > View this message in context:
> http://lucene.472066.n3.nabble.com/exporting-to-CSV-with-solrj-tp4166845p4166861.html
> > Sent from the Solr - User mailing list archive at Nabble.com.
>


Re: exporting to CSV with solrj

2014-10-31 Thread tedsolr
I think I'm getting the idea now. You either use the response writer via an
HTTP call, or you write your own exporter. Thanks to everyone for their
input.



--
View this message in context: 
http://lucene.472066.n3.nabble.com/exporting-to-CSV-with-solrj-tp4166845p4166889.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Missing Records

2014-10-31 Thread Erick Erickson
Sorry to say this, but I don't think the numDocs/maxDoc numbers
are telling you anything. because it looks like you've optimized
which purges any data associated with deleted docs, including
the internal IDs which are the numDocs/maxDocs figures. So if there
were deletions, we can't see any evidence of same.


Siih.


On Fri, Oct 31, 2014 at 9:56 AM, AJ Lemke  wrote:
> I have run some more tests so the numbers have changed a bit.
>
> Index Results done on Node 1:
> Indexing completed. Added/Updated: 903,993 documents. Deleted 0 documents. 
> (Duration: 31m 47s)
> Requests: 1 (0/s), Fetched: 903,993 (474/s), Skipped: 0, Processed: 903,993
>
> Node 1:
> Last Modified: 44 minutes ago
> Num Docs: 824216
> Max Doc: 824216
> Heap Memory Usage: -1
> Deleted Docs: 0
> Version: 1051
> Segment Count: 1
> Optimized: checked
> Current: checked
>
> Node 2:
> Last Modified: 44 minutes ago
> Num Docs: 824216
> Max Doc: 824216
> Heap Memory Usage: -1
> Deleted Docs: 0
> Version: 1051
> Segment Count: 1
> Optimized: checked
> Current: checked
>
> Search results are the same as the doc numbers above.
>
> Logs only have one instance of an error:
>
> ERROR - 2014-10-31 10:47:12.867; 
> org.apache.solr.update.StreamingSolrServers$1; error
> org.apache.solr.common.SolrException: Bad Request
>
>
>
> request: 
> http://192.168.20.57:7574/solr/inventory_shard1_replica1/update?update.distrib=TOLEADER&distrib.from=http%3A%2F%2F192.168.20.57%3A8983%2Fsolr%2Finventory_shard1_replica2%2F&wt=javabin&version=2
> at 
> org.apache.solr.client.solrj.impl.ConcurrentUpdateSolrServer$Runner.run(ConcurrentUpdateSolrServer.java:241)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> at java.lang.Thread.run(Thread.java:745)
>
> Some info that may be of help
> This is on my local vm using jetty with the embedded zookeeper.
> Commands to start cloud:
>
> java -DzkRun -jar start.jar
> java -Djetty.port=7574 -DzkRun -DzkHost=localhost:9983 -jar start.jar
>
> sh zkcli.sh -zkhost localhost:9983 -cmd upconfig -confdir 
> ~/development/configs/inventory/ -confname config_ inventory
> sh zkcli.sh -zkhost localhost:9983 -cmd linkconfig -collection inventory 
> -confname config_ inventory
>
> curl 
> "http://localhost:8983/solr/admin/collections?action=CREATE&name=inventory&numShards=1&replicationFactor=2&maxShardsPerNode=4";
> curl "http://localhost:8983/solr/admin/collections?action=RELOAD&name= 
> inventory "
>
> AJ
>
>
> -Original Message-
> From: Erick Erickson [mailto:erickerick...@gmail.com]
> Sent: Friday, October 31, 2014 9:49 AM
> To: solr-user@lucene.apache.org
> Subject: Re: Missing Records
>
> OK, that is puzzling.
>
> bq: If there were duplicates only one of the duplicates should be removed and 
> I still should be able to search for the ID and find one correct?
>
> Correct.
>
> Your bad request error is puzzling, you may be on to something there.
> What it looks like is that somehow some of the documents you're sending to 
> Solr aren't getting indexed, either being dropped through the network or 
> perhaps have invalid fields, field formats (i.e. a date in the wrong format,
> whatever) or some such. When you complete the run, what are the maxDoc and 
> numDocs numbers on one of the nodes?
>
> What else do you see in the logs? They're pretty big after that many adds, 
> but maybe you can grep for ERROR and see something interesting like stack 
> traces. Or even "org.apache.solr". This latter will give you some false hits, 
> but at least it's better than paging through a huge log file
>
> Personally, in this kind of situation I sometimes use SolrJ to do my indexing 
> rather than DIH, I find it easier to debug so that's another possibility. In 
> the worst case with SolrJ, you can send the docs one at a time
>
> Best,
> Erick
>
> On Fri, Oct 31, 2014 at 7:37 AM, AJ Lemke  wrote:
>> Hi Erick:
>>
>> All of the records are coming out of an auto numbered field so the ID's will 
>> all be unique.
>>
>> Here is the the test I ran this morning:
>>
>> Indexing completed. Added/Updated: 903,993 documents. Deleted 0
>> documents. (Duration: 28m)
>> Requests: 1 (0/s), Fetched: 903,993 (538/s), Skipped: 0, Processed:
>> 903,993 (538/s)
>> Started: 33 minutes ago
>>
>> Last Modified:4 minutes ago
>> Num Docs:903829
>> Max Doc:903829
>> Heap Memory Usage:-1
>> Deleted Docs:0
>> Version:1517
>> Segment Count:16
>> Optimized: checked
>> Current: checked
>>
>> If there were duplicates only one of the duplicates should be removed and I 
>> still should be able to search for the ID and find one correct?
>> As it is right now I am missing records that should be in the collection.
>>
>> I also noticed this:
>>
>> org.apache.solr.common.SolrException: Bad Request
>>
>>
>>
>> request: 
>> http://192.168.20.57:7574/solr/inventory_shard1_replica1/update?update.distrib=TOLEADER&distrib.fr

Re: Only copy string up to certain character symbol?

2014-10-31 Thread Erick Erickson
In addition to Alexandre's comment, your index chain looks suspect:

  


So the pattern replace stuff happens on the grams, not the full input. You might
be better off with a

solr.PatternReplaceCharFilterFactory

which works on the entire input string before even tokenization is done.

That said, Alexandre's comment is spot on. If your evidence for not respecting
the regex is that the document returns the whole input, it's because the
stored="true" stores the raw input and has nothing to do with the analysis
chain, the split to store the input happens before any kind of
analysis processing.

On Fri, Oct 31, 2014 at 9:33 AM, Alexandre Rafalovitch
 wrote:
> copyField can copy only part of the string but it is defined by
> character count. If you want to use regular expressions, you may be
> better off to do the copy in the UpdateRequestProcessor chain instead:
> http://www.solr-start.com/info/update-request-processors/#RegexReplaceProcessorFactory
>
> What you are doing (RegEx in the chain) only affects "indexed"
> representation of the text. Not the stored content. I suspect that's
> not what you want.
>
> Regards,
>Alex.
> Personal: http://www.outerthoughts.com/ and @arafalov
> Solr resources and newsletter: http://www.solr-start.com/ and @solrstart
> Solr popularizers community: https://www.linkedin.com/groups?gid=6713853
>
>
> On 31 October 2014 11:49, hschillig  wrote:
>> So I have a title field that is common to look like this:
>>
>> Personal legal forms simplified : the ultimate guide to personal legal forms
>> / Daniel Sitarz.
>>
>> I made a copyField that is of type "title_only". I want to ONLY copy the
>> text "Personal legal forms simplified : the ultimate guide to personal legal
>> forms".. so everything before the "/" symbol. I have it like this in my
>> schema.xml:
>>
>> 
>> 
>> 
>> 
>> > maxGramSize="15" side="front" />
>> > pattern="(\/.+?$)" replacement=""/>
>> 
>> 
>> 
>> 
>> > pattern="(\/.+?$)" replacement=""/>
>> 
>> 
>>
>> My regex seems to be off though as the field still holds the entire value
>> when I reindex and restart SolR. Thanks for any help!
>>
>>
>>
>> --
>> View this message in context: 
>> http://lucene.472066.n3.nabble.com/Only-copy-string-up-to-certain-character-symbol-tp4166857.html
>> Sent from the Solr - User mailing list archive at Nabble.com.


Re: Solr index corrupt question

2014-10-31 Thread Erick Erickson
What version of Solr/Lucene? There have been some instances of index
corruption, see the lucene/CHANGES.txt file that might account for it.
This is something of a stab in the dark
though.

Because this is troubling...

Best,
Erick

On Fri, Oct 31, 2014 at 7:57 AM, ku3ia  wrote:
> Hi, Erick. Thanks for you response.
>
> I'd checked my index via check index utility, and what I'm got:
>
> 3 of 41: name=_1ouwn docCount=518333
>   codec=Lucene46
>   compound=false
>   numFiles=11
>   size (MB)=431.564
>   diagnostics = {timestamp=1412166850391, os=Linux,
> os.version=3.2.0-68-generic, mergeFactor=10, source=merge,
> lucene.version=4.8-SNAPSHOT - root - 2014-09-04 12:30:45, os.arch=amd64,
> mergeMaxNumSegments=-1, java.version=1.7.0_67, java.vendor=Oracle
> Corporation}
>   has deletions [delGen=2260]
>   test: open reader.OK
>   test: check integrity.FAILED
>   WARNING: fixIndex() would remove reference to this segment; full
> exception:
> org.apache.lucene.index.CorruptIndexException: checksum failed (hardware
> problem?) : expected=e240ae5a actual=12262037
> (resource=BufferedChecksumIndexInput(MMapIndexInput(path="/mnt/data/solrcloud/node1/index.bak/_1ouwn_Lucene41_0.pos")))
>   at org.apache.lucene.codecs.CodecUtil.checkFooter(CodecUtil.java:211)
>   at
> org.apache.lucene.codecs.CodecUtil.checksumEntireFile(CodecUtil.java:268)
>   at
> org.apache.lucene.codecs.lucene41.Lucene41PostingsReader.checkIntegrity(Lucene41PostingsReader.java:1556)
>   at
> org.apache.lucene.codecs.BlockTreeTermsReader.checkIntegrity(BlockTreeTermsReader.java:3018)
>   at
> org.apache.lucene.codecs.perfield.PerFieldPostingsFormat$FieldsReader.checkIntegrity(PerFieldPostingsFormat.java:243)
>   at
> org.apache.lucene.index.SegmentReader.checkIntegrity(SegmentReader.java:587)
>   at org.apache.lucene.index.CheckIndex.checkIndex(CheckIndex.java:561)
>   at org.apache.lucene.index.CheckIndex.main(CheckIndex.java:1967)
>
> I have 3 dead segments, but there is one interesting thing: I have a backup
> of this segment, which I make after an optimize to one segment a month ago,
> naturally w/o del-file. So, when I'd replaced it - nothing was changed.
>
> It is possible, that my HDD is corrupted, but I'd checked it on bads and was
> not found anything.
>
> May be a del-file is corrupted? How I can check it or restore?
>
>
>
> --
> View this message in context: 
> http://lucene.472066.n3.nabble.com/Solr-index-corrupt-question-tp4166810p4166848.html
> Sent from the Solr - User mailing list archive at Nabble.com.


Re: exporting to CSV with solrj

2014-10-31 Thread Chris Hostetter

: Sure thing, but how do I get the results output in CSV format?
: response.getResults() is a list of SolrDocuments.

Either use something like the NoOpResponseParser which will give you the 
entire response back as a single string, or implement your own 
ResponseParser along hte lines of...

public class YourRawParser extends ResponseParser {

  public NamedList processResponse(InputStream body, String encoding) {
// do whatever you want with the data in the InputStream 
// as the data comes over the wire
doStuff(body);

// just ignore the result SolrServer gives you
return new NamedList();
  }
}





-Hoss
http://www.lucidworks.com/


Re: exporting to CSV with solrj

2014-10-31 Thread Alexandre Rafalovitch
Why do you want to use CSV in SolrJ? You would just have to parse it again.

You could just trigger that as a URL call from outside with cURL or as
just an HTTP (not SolrJ) call from Java client.

Regards,
   Alex.
Personal: http://www.outerthoughts.com/ and @arafalov
Solr resources and newsletter: http://www.solr-start.com/ and @solrstart
Solr popularizers community: https://www.linkedin.com/groups?gid=6713853


On 31 October 2014 12:34, tedsolr  wrote:
> Sure thing, but how do I get the results output in CSV format?
> response.getResults() is a list of SolrDocuments.
>
>
>
> --
> View this message in context: 
> http://lucene.472066.n3.nabble.com/exporting-to-CSV-with-solrj-tp4166845p4166861.html
> Sent from the Solr - User mailing list archive at Nabble.com.


RE: Missing Records

2014-10-31 Thread AJ Lemke
I have run some more tests so the numbers have changed a bit.

Index Results done on Node 1:
Indexing completed. Added/Updated: 903,993 documents. Deleted 0 documents. 
(Duration: 31m 47s)
Requests: 1 (0/s), Fetched: 903,993 (474/s), Skipped: 0, Processed: 903,993

Node 1:
Last Modified: 44 minutes ago
Num Docs: 824216
Max Doc: 824216
Heap Memory Usage: -1
Deleted Docs: 0
Version: 1051
Segment Count: 1
Optimized: checked
Current: checked

Node 2:
Last Modified: 44 minutes ago
Num Docs: 824216
Max Doc: 824216
Heap Memory Usage: -1
Deleted Docs: 0
Version: 1051
Segment Count: 1
Optimized: checked
Current: checked

Search results are the same as the doc numbers above.

Logs only have one instance of an error:

ERROR - 2014-10-31 10:47:12.867; org.apache.solr.update.StreamingSolrServers$1; 
error
org.apache.solr.common.SolrException: Bad Request



request: 
http://192.168.20.57:7574/solr/inventory_shard1_replica1/update?update.distrib=TOLEADER&distrib.from=http%3A%2F%2F192.168.20.57%3A8983%2Fsolr%2Finventory_shard1_replica2%2F&wt=javabin&version=2
at 
org.apache.solr.client.solrj.impl.ConcurrentUpdateSolrServer$Runner.run(ConcurrentUpdateSolrServer.java:241)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)

Some info that may be of help
This is on my local vm using jetty with the embedded zookeeper.
Commands to start cloud:

java -DzkRun -jar start.jar
java -Djetty.port=7574 -DzkRun -DzkHost=localhost:9983 -jar start.jar

sh zkcli.sh -zkhost localhost:9983 -cmd upconfig -confdir 
~/development/configs/inventory/ -confname config_ inventory
sh zkcli.sh -zkhost localhost:9983 -cmd linkconfig -collection inventory 
-confname config_ inventory

curl 
"http://localhost:8983/solr/admin/collections?action=CREATE&name=inventory&numShards=1&replicationFactor=2&maxShardsPerNode=4";
curl "http://localhost:8983/solr/admin/collections?action=RELOAD&name= 
inventory "

AJ


-Original Message-
From: Erick Erickson [mailto:erickerick...@gmail.com] 
Sent: Friday, October 31, 2014 9:49 AM
To: solr-user@lucene.apache.org
Subject: Re: Missing Records

OK, that is puzzling.

bq: If there were duplicates only one of the duplicates should be removed and I 
still should be able to search for the ID and find one correct?

Correct.

Your bad request error is puzzling, you may be on to something there.
What it looks like is that somehow some of the documents you're sending to Solr 
aren't getting indexed, either being dropped through the network or perhaps 
have invalid fields, field formats (i.e. a date in the wrong format,
whatever) or some such. When you complete the run, what are the maxDoc and 
numDocs numbers on one of the nodes?

What else do you see in the logs? They're pretty big after that many adds, but 
maybe you can grep for ERROR and see something interesting like stack traces. 
Or even "org.apache.solr". This latter will give you some false hits, but at 
least it's better than paging through a huge log file

Personally, in this kind of situation I sometimes use SolrJ to do my indexing 
rather than DIH, I find it easier to debug so that's another possibility. In 
the worst case with SolrJ, you can send the docs one at a time

Best,
Erick

On Fri, Oct 31, 2014 at 7:37 AM, AJ Lemke  wrote:
> Hi Erick:
>
> All of the records are coming out of an auto numbered field so the ID's will 
> all be unique.
>
> Here is the the test I ran this morning:
>
> Indexing completed. Added/Updated: 903,993 documents. Deleted 0 
> documents. (Duration: 28m)
> Requests: 1 (0/s), Fetched: 903,993 (538/s), Skipped: 0, Processed: 
> 903,993 (538/s)
> Started: 33 minutes ago
>
> Last Modified:4 minutes ago
> Num Docs:903829
> Max Doc:903829
> Heap Memory Usage:-1
> Deleted Docs:0
> Version:1517
> Segment Count:16
> Optimized: checked
> Current: checked
>
> If there were duplicates only one of the duplicates should be removed and I 
> still should be able to search for the ID and find one correct?
> As it is right now I am missing records that should be in the collection.
>
> I also noticed this:
>
> org.apache.solr.common.SolrException: Bad Request
>
>
>
> request: 
> http://192.168.20.57:7574/solr/inventory_shard1_replica1/update?update.distrib=TOLEADER&distrib.from=http%3A%2F%2F192.168.20.57%3A8983%2Fsolr%2Finventory_shard1_replica2%2F&wt=javabin&version=2
> at 
> org.apache.solr.client.solrj.impl.ConcurrentUpdateSolrServer$Runner.run(ConcurrentUpdateSolrServer.java:241)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> at java.lang.Thread.run(Thread.java:745)
>
> AJ
>
>
> -Original Message-
> From: Erick Erickson [mailto:erickerick...@gmail.com]
> Sent: Thursday, October 30, 

Re: exporting to CSV with solrj

2014-10-31 Thread tedsolr
Sure thing, but how do I get the results output in CSV format?
response.getResults() is a list of SolrDocuments.



--
View this message in context: 
http://lucene.472066.n3.nabble.com/exporting-to-CSV-with-solrj-tp4166845p4166861.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Only copy string up to certain character symbol?

2014-10-31 Thread Alexandre Rafalovitch
copyField can copy only part of the string but it is defined by
character count. If you want to use regular expressions, you may be
better off to do the copy in the UpdateRequestProcessor chain instead:
http://www.solr-start.com/info/update-request-processors/#RegexReplaceProcessorFactory

What you are doing (RegEx in the chain) only affects "indexed"
representation of the text. Not the stored content. I suspect that's
not what you want.

Regards,
   Alex.
Personal: http://www.outerthoughts.com/ and @arafalov
Solr resources and newsletter: http://www.solr-start.com/ and @solrstart
Solr popularizers community: https://www.linkedin.com/groups?gid=6713853


On 31 October 2014 11:49, hschillig  wrote:
> So I have a title field that is common to look like this:
>
> Personal legal forms simplified : the ultimate guide to personal legal forms
> / Daniel Sitarz.
>
> I made a copyField that is of type "title_only". I want to ONLY copy the
> text "Personal legal forms simplified : the ultimate guide to personal legal
> forms".. so everything before the "/" symbol. I have it like this in my
> schema.xml:
>
> 
> 
> 
> 
>  maxGramSize="15" side="front" />
>  pattern="(\/.+?$)" replacement=""/>
> 
> 
> 
> 
>  pattern="(\/.+?$)" replacement=""/>
> 
> 
>
> My regex seems to be off though as the field still holds the entire value
> when I reindex and restart SolR. Thanks for any help!
>
>
>
> --
> View this message in context: 
> http://lucene.472066.n3.nabble.com/Only-copy-string-up-to-certain-character-symbol-tp4166857.html
> Sent from the Solr - User mailing list archive at Nabble.com.


Only copy string up to certain character symbol?

2014-10-31 Thread hschillig
So I have a title field that is common to look like this:

Personal legal forms simplified : the ultimate guide to personal legal forms
/ Daniel Sitarz.

I made a copyField that is of type "title_only". I want to ONLY copy the
text "Personal legal forms simplified : the ultimate guide to personal legal
forms".. so everything before the "/" symbol. I have it like this in my
schema.xml:















My regex seems to be off though as the field still holds the entire value
when I reindex and restart SolR. Thanks for any help!



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Only-copy-string-up-to-certain-character-symbol-tp4166857.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: exporting to CSV with solrj

2014-10-31 Thread Jorge Luis Betancourt Gonzalez
When you fire a query against Solr with the wt=csv the response coming from 
Solr is *already* in CSV, the CSVResponseWriter is responsible for translating 
SolrDocument instances into a CSV on the server side, son I don’t see any 
reason on using it by your self, Solr already do the heavy lifting for you.

Regards,

On Oct 31, 2014, at 10:44 AM, tedsolr  wrote:

> I am trying to invoke the CSVResponseWriter to create a CSV file of all
> stored fields. There are millions of documents so I need to write to the
> file iteratively. I saw a snippet of code online that claimed it could
> effectively remove the SorDocumentList wrapper and allow the docs to be
> retrieved in the actual format requested in the query. However, I get a null
> pointer from the CSVResponseWriter.write() method.
> 
> SolrQuery qry = new SolrQuery("*:*");
> qry.setParam("wt", "csv");
> // set other params
> SolrServer server = getSolrServer();
> try {
>   QueryResponse res = server.query(qry);
> 
>   CSVResponseWriter writer = new CSVResponseWriter();
>   Writer w = new StringWriter();
> SolrQueryResponse solrResponse = new SolrQueryResponse();
>   solrResponse.setAllValues(res.getResponse());
>try {
> SolrParams list = new MapSolrParams(new HashMap String>());
> writer.write(w, new LocalSolrQueryRequest(null, list), 
> solrResponse);
>} catch (IOException e) {
>throw new RuntimeException(e);
>}
>System.out.print(w.toString());
> 
> } catch (SolrServerException e) {
>   e.printStackTrace();
> }
> 
> NPE snippet:
> org.apache.solr.response.CSVWriter.writeResponse(CSVResponseWriter.java:281)
> org.apache.solr.response.CSVResponseWriter.write(CSVResponseWriter.java:56)
> 
> Am I on the right track with the approach? I really don't want to roll my
> own document to CSV line convertor. Thanks!
> Solr 4.9
> 
> 
> 
> --
> View this message in context: 
> http://lucene.472066.n3.nabble.com/exporting-to-CSV-with-solrj-tp4166845.html
> Sent from the Solr - User mailing list archive at Nabble.com.



Re: Solr index corrupt question

2014-10-31 Thread ku3ia
Hi, Erick. Thanks for you response.

I'd checked my index via check index utility, and what I'm got:

3 of 41: name=_1ouwn docCount=518333
  codec=Lucene46
  compound=false
  numFiles=11
  size (MB)=431.564
  diagnostics = {timestamp=1412166850391, os=Linux,
os.version=3.2.0-68-generic, mergeFactor=10, source=merge,
lucene.version=4.8-SNAPSHOT - root - 2014-09-04 12:30:45, os.arch=amd64,
mergeMaxNumSegments=-1, java.version=1.7.0_67, java.vendor=Oracle
Corporation}
  has deletions [delGen=2260]
  test: open reader.OK
  test: check integrity.FAILED
  WARNING: fixIndex() would remove reference to this segment; full
exception:
org.apache.lucene.index.CorruptIndexException: checksum failed (hardware
problem?) : expected=e240ae5a actual=12262037
(resource=BufferedChecksumIndexInput(MMapIndexInput(path="/mnt/data/solrcloud/node1/index.bak/_1ouwn_Lucene41_0.pos")))
  at org.apache.lucene.codecs.CodecUtil.checkFooter(CodecUtil.java:211)
  at
org.apache.lucene.codecs.CodecUtil.checksumEntireFile(CodecUtil.java:268)
  at
org.apache.lucene.codecs.lucene41.Lucene41PostingsReader.checkIntegrity(Lucene41PostingsReader.java:1556)
  at
org.apache.lucene.codecs.BlockTreeTermsReader.checkIntegrity(BlockTreeTermsReader.java:3018)
  at
org.apache.lucene.codecs.perfield.PerFieldPostingsFormat$FieldsReader.checkIntegrity(PerFieldPostingsFormat.java:243)
  at
org.apache.lucene.index.SegmentReader.checkIntegrity(SegmentReader.java:587)
  at org.apache.lucene.index.CheckIndex.checkIndex(CheckIndex.java:561)
  at org.apache.lucene.index.CheckIndex.main(CheckIndex.java:1967)

I have 3 dead segments, but there is one interesting thing: I have a backup
of this segment, which I make after an optimize to one segment a month ago,
naturally w/o del-file. So, when I'd replaced it - nothing was changed.

It is possible, that my HDD is corrupted, but I'd checked it on bads and was
not found anything.

May be a del-file is corrupted? How I can check it or restore?



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Solr-index-corrupt-question-tp4166810p4166848.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Missing Records

2014-10-31 Thread Erick Erickson
OK, that is puzzling.

bq: If there were duplicates only one of the duplicates should be
removed and I still should be able to search for the ID and find one
correct?

Correct.

Your bad request error is puzzling, you may be on to something there.
What it looks like is that somehow some of the documents you're
sending to Solr aren't getting
indexed, either being dropped through the network or perhaps have
invalid fields, field formats (i.e. a date in the wrong format,
whatever) or some such. When you complete the run, what are the maxDoc
and numDocs numbers on one of the nodes?

What else do you see in the logs? They're pretty big after that many
adds, but maybe you can grep for ERROR and see something interesting
like stack traces. Or even "org.apache.solr". This latter will give
you some false hits, but at least it's better than paging through a
huge log file

Personally, in this kind of situation I sometimes use SolrJ to do my
indexing rather than DIH, I find it easier to debug so that's another
possibility. In the worst case with SolrJ, you can send the docs one
at a time

Best,
Erick

On Fri, Oct 31, 2014 at 7:37 AM, AJ Lemke  wrote:
> Hi Erick:
>
> All of the records are coming out of an auto numbered field so the ID's will 
> all be unique.
>
> Here is the the test I ran this morning:
>
> Indexing completed. Added/Updated: 903,993 documents. Deleted 0 documents. 
> (Duration: 28m)
> Requests: 1 (0/s), Fetched: 903,993 (538/s), Skipped: 0, Processed: 903,993 
> (538/s)
> Started: 33 minutes ago
>
> Last Modified:4 minutes ago
> Num Docs:903829
> Max Doc:903829
> Heap Memory Usage:-1
> Deleted Docs:0
> Version:1517
> Segment Count:16
> Optimized: checked
> Current: checked
>
> If there were duplicates only one of the duplicates should be removed and I 
> still should be able to search for the ID and find one correct?
> As it is right now I am missing records that should be in the collection.
>
> I also noticed this:
>
> org.apache.solr.common.SolrException: Bad Request
>
>
>
> request: 
> http://192.168.20.57:7574/solr/inventory_shard1_replica1/update?update.distrib=TOLEADER&distrib.from=http%3A%2F%2F192.168.20.57%3A8983%2Fsolr%2Finventory_shard1_replica2%2F&wt=javabin&version=2
> at 
> org.apache.solr.client.solrj.impl.ConcurrentUpdateSolrServer$Runner.run(ConcurrentUpdateSolrServer.java:241)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> at java.lang.Thread.run(Thread.java:745)
>
> AJ
>
>
> -Original Message-
> From: Erick Erickson [mailto:erickerick...@gmail.com]
> Sent: Thursday, October 30, 2014 7:08 PM
> To: solr-user@lucene.apache.org
> Subject: Re: Missing Records
>
> First question: Is there any possibility that some of the docs have duplicate 
> IDs (s)? If so, then some of the docs will be replaced, which will 
> lower your returns.
> One way to figuring this out is to go to the admin screen and if numDocs < 
> maxDoc, then documents have been replaced.
>
> Also, if numDocs is smaller than 903,993 then you probably have some docs 
> being replaced. One warning, however. Even if docs are deleted, then this 
> could still be the case because when segments are merged the deleted docs are 
> purged.
>
> Best,
> Erick
>
> On Thu, Oct 30, 2014 at 3:12 PM, S.L  wrote:
>> I am curious , how many shards do you have and whats the replication
>> factor you are using ?
>>
>> On Thu, Oct 30, 2014 at 5:27 PM, AJ Lemke  wrote:
>>
>>> Hi All,
>>>
>>> We have a SOLR cloud instance that has been humming along nicely for
>>> months.
>>> Last week we started experiencing missing records.
>>>
>>> Admin DIH Example:
>>> Fetched: 903,993 (736/s), Skipped: 0, Processed: 903,993 (736/s) A
>>> *:* search claims that there are only 903,902 this is the first full
>>> index.
>>> Subsequent full indexes give the following counts for the *:* search
>>> 903,805
>>> 903,665
>>> 826,357
>>>
>>> All the while the admin returns: Fetched: 903,993 (x/s), Skipped: 0,
>>> Processed: 903,993 (x/s) every time. ---records per second is
>>> variable
>>>
>>>
>>> I found an item that should be in the index but is not found in a search.
>>>
>>> Here are the referenced lines of the log file.
>>>
>>> DEBUG - 2014-10-30 15:10:51.160;
>>> org.apache.solr.update.processor.LogUpdateProcessor; PRE_UPDATE
>>> add{,id=750041421}
>>> {{params(debug=false&optimize=true&indent=true&commit=true&clean=true
>>> &wt=json&command=full-import&entity=ads&verbose=false),defaults(confi
>>> g=data-config.xml)}}
>>> DEBUG - 2014-10-30 15:10:51.160;
>>> org.apache.solr.update.SolrCmdDistributor; sending update to
>>> http://192.168.20.57:7574/solr/inventory_shard1_replica2/ retry:0
>>> add{,id=750041421}
>>> params:update.distrib=TOLEADER&distrib.from=http%3A%2F%2F192.168.20.5
>>> 7%3A8983%2Fsolr%2Finventory_shard1_replica1%2F
>>>
>>> --- there are 746 lines of log between entries ---

exporting to CSV with solrj

2014-10-31 Thread tedsolr
I am trying to invoke the CSVResponseWriter to create a CSV file of all
stored fields. There are millions of documents so I need to write to the
file iteratively. I saw a snippet of code online that claimed it could
effectively remove the SorDocumentList wrapper and allow the docs to be
retrieved in the actual format requested in the query. However, I get a null
pointer from the CSVResponseWriter.write() method.

SolrQuery qry = new SolrQuery("*:*");
qry.setParam("wt", "csv");
// set other params
SolrServer server = getSolrServer();
try {
QueryResponse res = server.query(qry);

CSVResponseWriter writer = new CSVResponseWriter();
Writer w = new StringWriter();
 SolrQueryResponse solrResponse = new SolrQueryResponse();
solrResponse.setAllValues(res.getResponse());
try {
  SolrParams list = new MapSolrParams(new HashMap());
  writer.write(w, new LocalSolrQueryRequest(null, list), 
solrResponse);
} catch (IOException e) {
throw new RuntimeException(e);
}
System.out.print(w.toString());

} catch (SolrServerException e) {
e.printStackTrace();
}

NPE snippet:
org.apache.solr.response.CSVWriter.writeResponse(CSVResponseWriter.java:281)
org.apache.solr.response.CSVResponseWriter.write(CSVResponseWriter.java:56)

Am I on the right track with the approach? I really don't want to roll my
own document to CSV line convertor. Thanks!
Solr 4.9



--
View this message in context: 
http://lucene.472066.n3.nabble.com/exporting-to-CSV-with-solrj-tp4166845.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Ideas for debugging poor SolrCloud scalability

2014-10-31 Thread Erick Erickson
NP, just making sure.

I suspect you'll get lots more bang for the buck, and
results much more closely matching your expectations if

1> you batch up a bunch of docs at once rather than
sending them one at a time. That's probably the easiest
thing to try. Sending docs one at a time is something of
an anti-pattern. I usually start with batches of 1,000.

And just to check.. You're not issuing any commits from the
client, right? Performance will be terrible if you issue commits
after every doc, that's totally an anti-pattern. Doubly so for
optimizes Since you showed us your solrconfig  autocommit
settings I'm assuming not but want to be sure.

2> use a leader-aware client. I'm totally unfamiliar with Go,
so I have no suggestions whatsoever to offer there But you'll
want to batch in this case too.

On Fri, Oct 31, 2014 at 5:51 AM, Ian Rose  wrote:
> Hi Erick -
>
> Thanks for the detailed response and apologies for my confusing
> terminology.  I should have said "WPS" (writes per second) instead of QPS
> but I didn't want to introduce a weird new acronym since QPS is well
> known.  Clearly a bad decision on my part.  To clarify: I am doing
> *only* writes
> (document adds).  Whenever I wrote "QPS" I was referring to writes.
>
> It seems clear at this point that I should wrap up the code to do "smart"
> routing rather than choose Solr nodes randomly.  And then see if that
> changes things.  I must admit that although I understand that random node
> selection will impose a performance hit, theoretically it seems to me that
> the system should still scale up as you add more nodes (albeit at lower
> absolute level of performance than if you used a smart router).
> Nonetheless, I'm just theorycrafting here so the better thing to do is just
> try it experimentally.  I hope to have that working today - will report
> back on my findings.
>
> Cheers,
> - Ian
>
> p.s. To clarify why we are rolling our own smart router code, we use Go
> over here rather than Java.  Although if we still get bad performance with
> our custom Go router I may try a pure Java load client using
> CloudSolrServer to eliminate the possibility of bugs in our implementation.
>
>
> On Fri, Oct 31, 2014 at 1:37 AM, Erick Erickson 
> wrote:
>
>> I'm really confused:
>>
>> bq: I am not issuing any queries, only writes (document inserts)
>>
>> bq: It's clear that once the load test client has ~40 simulated users
>>
>> bq: A cluster of 3 shards over 3 Solr nodes *should* support
>> a higher QPS than 2 shards over 2 Solr nodes, right
>>
>> QPS is usually used to mean "Queries Per Second", which is different from
>> the statement that "I am not issuing any queries". And what do the
>> number of users have to do with inserting documents?
>>
>> You also state: " In many cases, CPU on the solr servers is quite low as
>> well"
>>
>> So let's talk about indexing first. Indexing should scale nearly
>> linearly as long as
>> 1> you are routing your docs to the correct leader, which happens with
>> SolrJ
>> and the CloudSolrSever automatically. Rather than rolling your own, I
>> strongly
>> suggest you try this out.
>> 2> you have enough clients feeding the cluster to push CPU utilization
>> on them all.
>> Very often "slow indexing", or in your case "lack of scaling" is a
>> result of document
>> acquisition or, in your case, your doc generator is spending all it's
>> time waiting for
>> the individual documents to get to Solr and come back.
>>
>> bq: "chooses a random solr server for each ADD request (with 1 doc per add
>> request)"
>>
>> Probably your culprit right there. Each and every document requires that
>> you
>> have to cross the network (and forward that doc to the correct leader). So
>> given
>> that you're not seeing high CPU utilization, I suspect that you're not
>> sending
>> enough docs to SolrCloud fast enough to see scaling. You need to batch up
>> multiple docs, I generally send 1,000 docs at a time.
>>
>> But even if you do solve this, the inter-node routing will prevent
>> linear scaling.
>> When a doc (or a batch of docs) goes to a random Solr node, here's what
>> happens:
>> 1> the docs are re-packaged into groups based on which shard they're
>> destined for
>> 2> the sub-packets are forwarded to the leader for each shard
>> 3> the responses are gathered back and returned to the client.
>>
>> This set of operations will eventually degrade the scaling.
>>
>> bq:  A cluster of 3 shards over 3 Solr nodes *should* support
>> a higher QPS than 2 shards over 2 Solr nodes, right?  That's the whole idea
>> behind sharding.
>>
>> If we're talking search requests, the answer is no. Sharding is
>> what you do when your collection no longer fits on a single node.
>> If it _does_ fit on a single node, then you'll usually get better query
>> performance by adding a bunch of replicas to a single shard. When
>> the number of  docs on each shard grows large enough that you
>> no longer get good query performance, _then_ you shard. And
>> take th

RE: Missing Records

2014-10-31 Thread AJ Lemke
Hi Erick:

All of the records are coming out of an auto numbered field so the ID's will 
all be unique.

Here is the the test I ran this morning:

Indexing completed. Added/Updated: 903,993 documents. Deleted 0 documents. 
(Duration: 28m)
Requests: 1 (0/s), Fetched: 903,993 (538/s), Skipped: 0, Processed: 903,993 
(538/s)
Started: 33 minutes ago

Last Modified:4 minutes ago
Num Docs:903829
Max Doc:903829
Heap Memory Usage:-1
Deleted Docs:0
Version:1517
Segment Count:16
Optimized: checked
Current: checked

If there were duplicates only one of the duplicates should be removed and I 
still should be able to search for the ID and find one correct?
As it is right now I am missing records that should be in the collection.

I also noticed this:

org.apache.solr.common.SolrException: Bad Request



request: 
http://192.168.20.57:7574/solr/inventory_shard1_replica1/update?update.distrib=TOLEADER&distrib.from=http%3A%2F%2F192.168.20.57%3A8983%2Fsolr%2Finventory_shard1_replica2%2F&wt=javabin&version=2
at 
org.apache.solr.client.solrj.impl.ConcurrentUpdateSolrServer$Runner.run(ConcurrentUpdateSolrServer.java:241)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)

AJ


-Original Message-
From: Erick Erickson [mailto:erickerick...@gmail.com] 
Sent: Thursday, October 30, 2014 7:08 PM
To: solr-user@lucene.apache.org
Subject: Re: Missing Records

First question: Is there any possibility that some of the docs have duplicate 
IDs (s)? If so, then some of the docs will be replaced, which will 
lower your returns.
One way to figuring this out is to go to the admin screen and if numDocs < 
maxDoc, then documents have been replaced.

Also, if numDocs is smaller than 903,993 then you probably have some docs being 
replaced. One warning, however. Even if docs are deleted, then this could still 
be the case because when segments are merged the deleted docs are purged.

Best,
Erick

On Thu, Oct 30, 2014 at 3:12 PM, S.L  wrote:
> I am curious , how many shards do you have and whats the replication 
> factor you are using ?
>
> On Thu, Oct 30, 2014 at 5:27 PM, AJ Lemke  wrote:
>
>> Hi All,
>>
>> We have a SOLR cloud instance that has been humming along nicely for 
>> months.
>> Last week we started experiencing missing records.
>>
>> Admin DIH Example:
>> Fetched: 903,993 (736/s), Skipped: 0, Processed: 903,993 (736/s) A 
>> *:* search claims that there are only 903,902 this is the first full 
>> index.
>> Subsequent full indexes give the following counts for the *:* search
>> 903,805
>> 903,665
>> 826,357
>>
>> All the while the admin returns: Fetched: 903,993 (x/s), Skipped: 0,
>> Processed: 903,993 (x/s) every time. ---records per second is 
>> variable
>>
>>
>> I found an item that should be in the index but is not found in a search.
>>
>> Here are the referenced lines of the log file.
>>
>> DEBUG - 2014-10-30 15:10:51.160;
>> org.apache.solr.update.processor.LogUpdateProcessor; PRE_UPDATE 
>> add{,id=750041421} 
>> {{params(debug=false&optimize=true&indent=true&commit=true&clean=true
>> &wt=json&command=full-import&entity=ads&verbose=false),defaults(confi
>> g=data-config.xml)}}
>> DEBUG - 2014-10-30 15:10:51.160;
>> org.apache.solr.update.SolrCmdDistributor; sending update to 
>> http://192.168.20.57:7574/solr/inventory_shard1_replica2/ retry:0 
>> add{,id=750041421} 
>> params:update.distrib=TOLEADER&distrib.from=http%3A%2F%2F192.168.20.5
>> 7%3A8983%2Fsolr%2Finventory_shard1_replica1%2F
>>
>> --- there are 746 lines of log between entries ---
>>
>> DEBUG - 2014-10-30 15:10:51.340; org.apache.http.impl.conn.Wire;  >> 
>> "[0x2][0xc3][0xe0]¶ms[0xa2][0xe0].update.distrib(TOLEADER[0xe0],d
>> istrib.from?[0x17] 
>> http://192.168.20.57:8983/solr/inventory_shard1_replica1/[0xe0]&delBy
>> Q[0x0][0xe0]'docsMap[0xe][0x13][0x10]8[0x8]?[0x80][0x0][0x0][0xe0]#Zi
>> p%51106[0xe0]-IsReelCentric[0x2][0xe0](HasPrice[0x1][0xe0]*Make_Lower
>> 'ski-doo[0xe0])StateName$Iowa[0xe0]-OriginalModel/Summit
>> Highmark[0xe0]/VerticalSiteIDs!2[0xe0]-ClassBinaryIDp@[0xe0]#lat(42.4
>> 8929[0xe0]-SubClassFacet01704|Snowmobiles[0xe0](FuelType%Other[0xe0]2
>> DivisionName_Lower,recreational[0xe0]&latlon042.4893,-96.3693[0xe0]*P
>> hotoCount!8[0xe0](HasVideo[0x2][0xe0]"ID)750041421[0xe0]&Engine
>> [0xe0]*ClassFacet.12|Snowmobiles[0xe0]$Make'Ski-Doo[0xe0]$City*Sioux
>> City[0xe0]#lng*-96.369302[0xe0]-Certification!N[0xe0]0EmotionalTagline0162"
>> Long Track
>> [0xe0]*IsEnhanced[0x1][0xe0]*SubClassID$1704[0xe0](NetPrice$4500[0xe0
>> ]1IsInternetSpecial[0x2][0xe0](HasPhoto[0x1][0xe0]/DealerSortOrder!2[
>> 0xe0]+Description?VThis Bad boy will pull you through the deepest 
>> snow!With the 162" track and 1000cc of power you can fly up any 
>> hill!![0xe0],DealerRadius+8046.72[0xe0],Transmission
>> [0xe0]*ModelFacet7Ski-Doo|Summit 
>> Highmark[0xe0]/DealerNameFacet9C

Re: The exact same query gets executed n times for the nth row when retrieving body (plaintext) from BLOB column with Tika Entity Processor

2014-10-31 Thread Erick Erickson
Your message looks like it's missing stuff (snapshots?), the
e-mail for this list generally strips attachments, so you'll
have to put them somewhere else and link to them if you
want us to see them.

Best,
Erick

On Fri, Oct 31, 2014 at 5:11 AM, 5ton3  wrote:
> Hi!
>
> Not sure if this is a problem or if I just don't understand the debug
> response, but it seems somewhat odd to me.
> The "main" entity can have multiple BLOB documents. I'm using Tika Entity
> Processor to retrieve the body (plaintext) from these documents and put the
> result in a multivalued field, "filedata".  The data-config looks like this:
>
>
> It seems to work properly, but when I debug the data import, it seems that
> the query on TABLE2 on the BLOB column ("FILEDATA_BIN") gets executed 1 time
> for document #1, which is correct, but 2 times for document #2, 3 times for
> document #3, and so on.
> I.e. for document #1:
>
> And for document #2:
>
> The result seems correct, ie. it doesn't duplicate the filedata. But why
> does it query the DB two times for document #2? Any ideas? Maybe something
> wrong in my config?
>
>
>
> --
> View this message in context: 
> http://lucene.472066.n3.nabble.com/The-exact-same-query-gets-executed-n-times-for-the-nth-row-when-retrieving-body-plaintext-from-BLOB-r-tp4166822.html
> Sent from the Solr - User mailing list archive at Nabble.com.


Re: Solr index corrupt question

2014-10-31 Thread Erick Erickson
Not quite sure what you mean by "destroy". I can
use a delete-by-query with *:* and mark all docs in
my index deleted. Search results will return nothing
but it's still a valid index, it just consists of all deleted
docs. All the segments may be removed even in the
absence of an optimize due to segment merging.

But it's still a perfectly valid index, it just has nothing
in it.

Are you seeing a real problem here or are you just
wondering why all your segment files disappeared?

Best,
Erick

On Fri, Oct 31, 2014 at 3:33 AM, ku3ia  wrote:
> Hi folks!
> I'm interesting in, can delete operation destroy Solr index, if optimize
> command never  perform?
>
>
>
> --
> View this message in context: 
> http://lucene.472066.n3.nabble.com/Solr-index-corrupt-question-tp4166810.html
> Sent from the Solr - User mailing list archive at Nabble.com.


RE: Missing Records

2014-10-31 Thread AJ Lemke
I started this collection using this command:

http://localhost:8983/solr/admin/collections?action=CREATE&name=inventory&numShards=1&replicationFactor=2&maxShardsPerNode=4

So 1 shard and replicationFactor of 2

AJ

-Original Message-
From: S.L [mailto:simpleliving...@gmail.com] 
Sent: Thursday, October 30, 2014 5:12 PM
To: solr-user@lucene.apache.org
Subject: Re: Missing Records

I am curious , how many shards do you have and whats the replication factor you 
are using ?

On Thu, Oct 30, 2014 at 5:27 PM, AJ Lemke  wrote:

> Hi All,
>
> We have a SOLR cloud instance that has been humming along nicely for 
> months.
> Last week we started experiencing missing records.
>
> Admin DIH Example:
> Fetched: 903,993 (736/s), Skipped: 0, Processed: 903,993 (736/s) A *:* 
> search claims that there are only 903,902 this is the first full 
> index.
> Subsequent full indexes give the following counts for the *:* search
> 903,805
> 903,665
> 826,357
>
> All the while the admin returns: Fetched: 903,993 (x/s), Skipped: 0,
> Processed: 903,993 (x/s) every time. ---records per second is variable
>
>
> I found an item that should be in the index but is not found in a search.
>
> Here are the referenced lines of the log file.
>
> DEBUG - 2014-10-30 15:10:51.160;
> org.apache.solr.update.processor.LogUpdateProcessor; PRE_UPDATE 
> add{,id=750041421} 
> {{params(debug=false&optimize=true&indent=true&commit=true&clean=true&
> wt=json&command=full-import&entity=ads&verbose=false),defaults(config=
> data-config.xml)}}
> DEBUG - 2014-10-30 15:10:51.160;
> org.apache.solr.update.SolrCmdDistributor; sending update to 
> http://192.168.20.57:7574/solr/inventory_shard1_replica2/ retry:0 
> add{,id=750041421} 
> params:update.distrib=TOLEADER&distrib.from=http%3A%2F%2F192.168.20.57
> %3A8983%2Fsolr%2Finventory_shard1_replica1%2F
>
> --- there are 746 lines of log between entries ---
>
> DEBUG - 2014-10-30 15:10:51.340; org.apache.http.impl.conn.Wire;  >> 
> "[0x2][0xc3][0xe0]¶ms[0xa2][0xe0].update.distrib(TOLEADER[0xe0],di
> strib.from?[0x17] 
> http://192.168.20.57:8983/solr/inventory_shard1_replica1/[0xe0]&delByQ
> [0x0][0xe0]'docsMap[0xe][0x13][0x10]8[0x8]?[0x80][0x0][0x0][0xe0]#Zip%
> 51106[0xe0]-IsReelCentric[0x2][0xe0](HasPrice[0x1][0xe0]*Make_Lower'sk
> i-doo[0xe0])StateName$Iowa[0xe0]-OriginalModel/Summit
> Highmark[0xe0]/VerticalSiteIDs!2[0xe0]-ClassBinaryIDp@[0xe0]#lat(42.48
> 929[0xe0]-SubClassFacet01704|Snowmobiles[0xe0](FuelType%Other[0xe0]2Di
> visionName_Lower,recreational[0xe0]&latlon042.4893,-96.3693[0xe0]*Phot
> oCount!8[0xe0](HasVideo[0x2][0xe0]"ID)750041421[0xe0]&Engine
> [0xe0]*ClassFacet.12|Snowmobiles[0xe0]$Make'Ski-Doo[0xe0]$City*Sioux
> City[0xe0]#lng*-96.369302[0xe0]-Certification!N[0xe0]0EmotionalTagline0162"
> Long Track
> [0xe0]*IsEnhanced[0x1][0xe0]*SubClassID$1704[0xe0](NetPrice$4500[0xe0]
> 1IsInternetSpecial[0x2][0xe0](HasPhoto[0x1][0xe0]/DealerSortOrder!2[0x
> e0]+Description?VThis Bad boy will pull you through the deepest 
> snow!With the 162" track and 1000cc of power you can fly up any 
> hill!![0xe0],DealerRadius+8046.72[0xe0],Transmission
> [0xe0]*ModelFacet7Ski-Doo|Summit 
> Highmark[0xe0]/DealerNameFacet9Certified
> Auto,
> Inc.|4150[0xe0])StateAbbr"IA[0xe0])ClassName+Snowmobiles[0xe0](DealerI
> D$4150[0xe0]&AdCode$DX1Q[0xe0]*DealerName4Certified
> Auto,
> Inc.[0xe0])Condition$Used[0xe0]/Condition_Lower$used[0xe0]-ExteriorCol
> or+Blue/Yellow[0xe0],DivisionName,Recreational[0xe0]$Trim(1000
> SDI[0xe0](SourceID!1[0xe0]0HasAdEnhancement!0[0xe0]'ClassID"12[0xe0].F
> uelType_Lower%other[0xe0]$Year$2005[0xe0]+DealerFacet?[0x8]4150|Certif
> ied Auto, Inc.|Sioux 
> City|IA[0xe0],SubClassName+Snowmobiles[0xe0]%Model/Summit
> Highmark[0xe0])EntryDate42011-11-17T10:46:00Z[0xe0]+StockNumber&000105
> [0xe0]+PriceRebate!0[0xe0]+Model_Lower/summit
> highmark[\n]"
> What could be the issue and how does one fix this issue?
>
> Thanks so much and if more information is needed I have preserved the 
> log files.
>
> AJ
>


Re: Ideas for debugging poor SolrCloud scalability

2014-10-31 Thread Ian Rose
Hi Erick -

Thanks for the detailed response and apologies for my confusing
terminology.  I should have said "WPS" (writes per second) instead of QPS
but I didn't want to introduce a weird new acronym since QPS is well
known.  Clearly a bad decision on my part.  To clarify: I am doing
*only* writes
(document adds).  Whenever I wrote "QPS" I was referring to writes.

It seems clear at this point that I should wrap up the code to do "smart"
routing rather than choose Solr nodes randomly.  And then see if that
changes things.  I must admit that although I understand that random node
selection will impose a performance hit, theoretically it seems to me that
the system should still scale up as you add more nodes (albeit at lower
absolute level of performance than if you used a smart router).
Nonetheless, I'm just theorycrafting here so the better thing to do is just
try it experimentally.  I hope to have that working today - will report
back on my findings.

Cheers,
- Ian

p.s. To clarify why we are rolling our own smart router code, we use Go
over here rather than Java.  Although if we still get bad performance with
our custom Go router I may try a pure Java load client using
CloudSolrServer to eliminate the possibility of bugs in our implementation.


On Fri, Oct 31, 2014 at 1:37 AM, Erick Erickson 
wrote:

> I'm really confused:
>
> bq: I am not issuing any queries, only writes (document inserts)
>
> bq: It's clear that once the load test client has ~40 simulated users
>
> bq: A cluster of 3 shards over 3 Solr nodes *should* support
> a higher QPS than 2 shards over 2 Solr nodes, right
>
> QPS is usually used to mean "Queries Per Second", which is different from
> the statement that "I am not issuing any queries". And what do the
> number of users have to do with inserting documents?
>
> You also state: " In many cases, CPU on the solr servers is quite low as
> well"
>
> So let's talk about indexing first. Indexing should scale nearly
> linearly as long as
> 1> you are routing your docs to the correct leader, which happens with
> SolrJ
> and the CloudSolrSever automatically. Rather than rolling your own, I
> strongly
> suggest you try this out.
> 2> you have enough clients feeding the cluster to push CPU utilization
> on them all.
> Very often "slow indexing", or in your case "lack of scaling" is a
> result of document
> acquisition or, in your case, your doc generator is spending all it's
> time waiting for
> the individual documents to get to Solr and come back.
>
> bq: "chooses a random solr server for each ADD request (with 1 doc per add
> request)"
>
> Probably your culprit right there. Each and every document requires that
> you
> have to cross the network (and forward that doc to the correct leader). So
> given
> that you're not seeing high CPU utilization, I suspect that you're not
> sending
> enough docs to SolrCloud fast enough to see scaling. You need to batch up
> multiple docs, I generally send 1,000 docs at a time.
>
> But even if you do solve this, the inter-node routing will prevent
> linear scaling.
> When a doc (or a batch of docs) goes to a random Solr node, here's what
> happens:
> 1> the docs are re-packaged into groups based on which shard they're
> destined for
> 2> the sub-packets are forwarded to the leader for each shard
> 3> the responses are gathered back and returned to the client.
>
> This set of operations will eventually degrade the scaling.
>
> bq:  A cluster of 3 shards over 3 Solr nodes *should* support
> a higher QPS than 2 shards over 2 Solr nodes, right?  That's the whole idea
> behind sharding.
>
> If we're talking search requests, the answer is no. Sharding is
> what you do when your collection no longer fits on a single node.
> If it _does_ fit on a single node, then you'll usually get better query
> performance by adding a bunch of replicas to a single shard. When
> the number of  docs on each shard grows large enough that you
> no longer get good query performance, _then_ you shard. And
> take the query hit.
>
> If we're talking about inserts, then see above. I suspect your problem is
> that you're _not_ "saturating the SolrCloud cluster", you're sending
> docs to Solr very inefficiently and waiting on I/O. Batching docs and
> sending them to the right leader should scale pretty linearly until you
> start saturating your network.
>
> Best,
> Erick
>
> On Thu, Oct 30, 2014 at 6:56 PM, Ian Rose  wrote:
> > Thanks for the suggestions so for, all.
> >
> > 1) We are not using SolrJ on the client (not using Java at all) but I am
> > working on writing a "smart" router so that we can always send to the
> > correct node.  I am certainly curious to see how that changes things.
> > Nonetheless even with the overhead of extra routing hops, the observed
> > behavior (no increase in performance with more nodes) doesn't make any
> > sense to me.
> >
> > 2) Commits: we are using autoCommit with openSearcher=false
> (maxTime=6)
> > and autoSoftCommit (maxTime=15000).
> >
> 

The exact same query gets executed n times for the nth row when retrieving body (plaintext) from BLOB column with Tika Entity Processor

2014-10-31 Thread 5ton3
Hi!

Not sure if this is a problem or if I just don't understand the debug
response, but it seems somewhat odd to me.
The "main" entity can have multiple BLOB documents. I'm using Tika Entity
Processor to retrieve the body (plaintext) from these documents and put the
result in a multivalued field, "filedata".  The data-config looks like this:


It seems to work properly, but when I debug the data import, it seems that
the query on TABLE2 on the BLOB column ("FILEDATA_BIN") gets executed 1 time
for document #1, which is correct, but 2 times for document #2, 3 times for
document #3, and so on.
I.e. for document #1:

And for document #2:

The result seems correct, ie. it doesn't duplicate the filedata. But why
does it query the DB two times for document #2? Any ideas? Maybe something
wrong in my config?



--
View this message in context: 
http://lucene.472066.n3.nabble.com/The-exact-same-query-gets-executed-n-times-for-the-nth-row-when-retrieving-body-plaintext-from-BLOB-r-tp4166822.html
Sent from the Solr - User mailing list archive at Nabble.com.


Solr index corrupt question

2014-10-31 Thread ku3ia
Hi folks!
I'm interesting in, can delete operation destroy Solr index, if optimize
command never  perform?



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Solr-index-corrupt-question-tp4166810.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: issue related to blank value in datefield

2014-10-31 Thread Aman Tandon
Thanks Chris

With Regards
Aman Tandon

On Fri, Oct 31, 2014 at 5:45 AM, Chris Hostetter 
wrote:

>
> : I was just trying to index the fields returned by my msql and i found
> this
>
> If you are importing dates from MySql where you have -00-00T00:00:00Z
> as the default value, you should actaully be getting an error lsat time i
> checked, but this explains the right way to tell the MySQL JDBC driver not
> to give you those values ...
>
>
> https://wiki.apache.org/solr/DataImportHandlerFaq#Invalid_dates_.28e.g._.22-00-00.22.29_in_my_MySQL_database_cause_my_import_to_abort
>
> (even if you aren't using DIH to talk to MySQL, the same principle holds
> if you are using JDBC, if you are talking to MySQL from some other client
> langauge there should be a similar option)
>
> : Actually i just want to know why it is getting stored as '
> :  0002-11-30T00:00:00Z' on indexing the value -00-00T00:00:00Z.
>
> like i said: bugs. behavior with "Year " is ndefined in alot of the
> underlying date code.  as for what that speciic date? ... no idea.
>
>
> -Hoss
> http://www.lucidworks.com/
>


Re: Design optimal Solr Schema

2014-10-31 Thread tomas.kalas
Oh yes, i want to display stored data in html file. I have 2 pages, at one
page is form and i show here results.
Result here is link (by ID) at file where is all  conversation in second
page. And how did you mean sepparate each conversation interaction ? Thanks.



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Design-optimal-Solr-Schema-tp4166632p4166805.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Design optimal Solr Schema

2014-10-31 Thread tomas.kalas
Thanks for your help.
Ok i try it explain one more, sorry for my english.
I need to some functions in my searching.

1.) I will have a lot of documents naturally and i want find out if is for
example is phrase for example to 5  words apart. I used w:"Good morning"~5.
(in example solr it works, but i don't know how do it at my project).

2.) Find some word(phrase) to a certain time, for example Good morning to
time 5.25

3.) And if it is possible order of the words. I'm using solarium client for
highlight and I want to highlight words in this order Hello How Are you for
example, then in this field are words *hello* you are * how are you* and if
the searching word is not in order, then skip it, but it not necessary,
primary i have problem with first 2 points. 

How i make ideal schema and parse data for source file.

I've done some demo with basic searching in one page i have form and results
are links at files by id (i have id as filename) and when i clicked at link
i set a parameter query and in result page i get a necessary data for
display result.

And result file is table with all rewrite interview whit highlighted results
.

Thanks for help.



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Design-optimal-Solr-Schema-tp4166632p4166793.html
Sent from the Solr - User mailing list archive at Nabble.com.