Re: Issue serving concurrent requests to SOLR on PROD

2015-05-19 Thread Michael Della Bitta
Are you sure the requests are getting queued because the LB is detecting 
that Solr won't handle them?


The reason why I'm asking is I know that ELB doesn't handle bursts well. 
The load balancer needs to "warm up," which essentially means it might 
be underpowered at the beginning of a burst. It will spool up more 
resources if the average load over the last minute is high. But for that 
minute it will definitely not be able to handle a burst.


If you're testing infrastructure using a benchmarking tool that doesn't 
slowly ramp up traffic, you're definitely encountering this problem.


Michael


Jani, Vrushank 
2015-05-19 at 03:51

Hello,

We have production SOLR deployed on AWS Cloud. We have currently 4 
live SOLR servers running on m3xlarge EC2 server instances behind ELB 
(Elastic Load Balancer) on AWS cloud. We run Apache SOLR in Tomcat 
container which is sitting behind Apache httpd. Apache httpd is using 
prefork mpm and the request flows from ELB to Apache Httpd Server to 
Tomcat (via AJP).


Last few days, we are seeing increase in the requests around 2 
requests minute hitting the LB. In effect we see ELB Surge Queue 
Length continuously being around 100.
Surge Queue Length: represents the total number of request pending 
submission to the instances, queued by the load balancer;


This is causing latencies and time outs from Client applications. Our 
first reaction was that we don't have enough max connections set 
either in HTTPD or Tomcat. What we saw, the servers are very lightly 
loaded with very low CPU and memory utilisation. Apache preform 
settings are as below on each servers with keep-alive turned off.



StartServers 8
MinSpareServers 5
MaxSpareServers 20
ServerLimit 256
MaxClients 256
MaxRequestsPerChild 4000



Tomcat server.xml has following settings.

maxThreads="500" connectionTimeout="6"/>
For HTTPD – we see that there are lots of TIME_WAIT connections Apache 
port around 7000+ but ESTABLISHED connections are around 20.

For Tomact – we see about 60 ESTABLISHED connections on tomcat AJP port.

So the servers and connections doesn't look like fully utilised to the 
capacity. There is no visible stress anywhere. However we still get 
requests being queued up on LB because they can not be served from 
underlying servers.


Can you please help me resolving this issue? Can you see any apparent 
problem here? Am I missing any configuration or settings for SOLR?


Your help will be truly appreciated.

Regards
VJ






Vrushank Jani 
[http://media.for.truelocal.com.au/signature/img/divider.png] Senior 
Java Developer
T 02 8312 
1625[http://media.for.truelocal.com.au/signature/img/divider.png] E 
vrushank.j...@truelocal.com.au


[http://media.for.truelocal.com.au/signature/img/TL_logo.png] 
[http://media.for.truelocal.com.au/signature/img/TL_facebook.png] 
 
[http://media.for.truelocal.com.au/signature/img/TL_twitter.png] 
 
[http://media.for.truelocal.com.au/signature/img/TL_google.png] 
 
[http://media.for.truelocal.com.au/signature/img/TL_pintrest.png] 






Re: Applying Tokenizers and Filters to CopyFields

2015-03-26 Thread Michael Della Bitta
Glad you are sorted out!

Michael Della Bitta

Senior Software Engineer

o: +1 646 532 3062

appinions inc.

“The Science of Influence Marketing”

18 East 41st Street

New York, NY 10017

t: @appinions <https://twitter.com/Appinions> | g+:
plus.google.com/appinions
<https://plus.google.com/u/0/b/112002776285509593336/112002776285509593336/posts>
w: appinions.com <http://www.appinions.com/>

On Thu, Mar 26, 2015 at 10:09 AM, Martin Wunderlich 
wrote:

> Thanks so much, Erick and Michael, for all the additional explanation.
> The crucial information in the end turned out to be the one about the
> Default Search Field („df“). In solrconfig.xml this parameter was to point
> to the original text, which is why the expanded queries didn’t work. When I
> set the df parameter to one of the fields with the expanded text, the
> search works fine. I have also removed the copyField declarations.
>
> It’s all working as expected now. Thanks again for the help.
>
> Cheers,
>
> Martin
>
>
>
>
> > Am 25.03.2015 um 23:43 schrieb Erick Erickson :
> >
> > Martin:
> > Perhaps this would help
> >
> > indexed=true, stored=true
> > field can be searched. The raw input (not analyzed in any way) can be
> > shown to the user in the results list.
> >
> > indexed=true, stored=false
> > field can be searched. However, the field can't be returned in the
> > results list with the document.
> >
> > indexed=false, stored=true
> > The field cannot be searched, but the contents can be returned in the
> > results list with the document. There are some use-cases where this is
> > desirable behavior.
> >
> > indexed=false, stored=false
> > The entire field is thrown out, it's just as if you didn't send the
> > field to be indexed at all.
> >
> > And one other thing, the copyField gets the _raw_ data not the
> > analyzed data. Let's say you have two fields, "src" and "dst".
> > copying from src to dest in schema.xml is identical to
> > 
> >  
> >original text
> >   original text
> > 
> > 
> >
> > that is, copyfield directives are not chained.
> >
> > Also, watch out for your query syntax. Michael's comments are spot-on,
> > I'd just add this:
> >
> >
> http://localhost:8983/solr/windex/select?q=Sprache&fq=original&wt=json&indent=true
> >
> > is kind of odd. Let's assume you mean "qf" rather than "fq". That
> > _only_ matters if your query parser is "edismax", it'll be ignored in
> > this case I believe.
> >
> > You'd want something like
> > q=src:Sprache
> > or
> > q=dst:Sprache
> > or even
> > http://localhost:8983/solr/windex/select?q=Sprache&df=src
> > http://localhost:8983/solr/windex/select?q=Sprache&df=dst
> >
> > where "df" is "default field" and the search is applied against that
> > field in the absence of a field qualification like my first two
> > examples.
> >
> > Best,
> > Erick
> >
> > On Wed, Mar 25, 2015 at 2:52 PM, Michael Della Bitta
> >  wrote:
> >> I agree the terminology is possibly a little confusing.
> >>
> >> Stored refers to values that are stored verbatim. You can retrieve them
> >> verbatim. Analysis does not affect stored values.
> >> Indexed values are tokenized/transformed and stored inverted. You can't
> >> recover the literal analyzed version (at least, not easily).
> >>
> >> If what you really want is to store and retrieve case folded versions of
> >> your data as well as the original, you need to use something like a
> >> UpdateRequestProcessor, which I personally am less familiar with.
> >>
> >>
> >> On Wed, Mar 25, 2015 at 5:28 PM, Martin Wunderlich 
> >> wrote:
> >>
> >>> So, the pre-processing steps are applied under .
> >>> And this point is not quite clear to me: Assuming that I have a simple
> >>> case-folding step applied to the target of the copyField: How or where
> are
> >>> the lower-case tokens stored, if the text isn’t added to the index?
> How is
> >>> the query supposed to retrieve the lower-case version?
> >>> (sorry, if this sounds like a naive question, but I have a feeling
> that I
> >>> am missing something really basic here).
> >>>
> >>
> >>
> >> Michael Della Bitta
> >>
> >> Senior Software Engineer
> >>
> >> o: +1 646 532 3062
> >>
> >> appinions inc.
> >>
> >> “The Science of Influence Marketing”
> >>
> >> 18 East 41st Street
> >>
> >> New York, NY 10017
> >>
> >> t: @appinions <https://twitter.com/Appinions> | g+:
> >> plus.google.com/appinions
> >> <
> https://plus.google.com/u/0/b/112002776285509593336/112002776285509593336/posts
> >
> >> w: appinions.com <http://www.appinions.com/>
>
>


Re: Applying Tokenizers and Filters to CopyFields

2015-03-25 Thread Michael Della Bitta
I agree the terminology is possibly a little confusing.

Stored refers to values that are stored verbatim. You can retrieve them
verbatim. Analysis does not affect stored values.
Indexed values are tokenized/transformed and stored inverted. You can't
recover the literal analyzed version (at least, not easily).

If what you really want is to store and retrieve case folded versions of
your data as well as the original, you need to use something like a
UpdateRequestProcessor, which I personally am less familiar with.


On Wed, Mar 25, 2015 at 5:28 PM, Martin Wunderlich 
wrote:

> So, the pre-processing steps are applied under .
> And this point is not quite clear to me: Assuming that I have a simple
> case-folding step applied to the target of the copyField: How or where are
> the lower-case tokens stored, if the text isn’t added to the index? How is
> the query supposed to retrieve the lower-case version?
> (sorry, if this sounds like a naive question, but I have a feeling that I
> am missing something really basic here).
>


Michael Della Bitta

Senior Software Engineer

o: +1 646 532 3062

appinions inc.

“The Science of Influence Marketing”

18 East 41st Street

New York, NY 10017

t: @appinions <https://twitter.com/Appinions> | g+:
plus.google.com/appinions
<https://plus.google.com/u/0/b/112002776285509593336/112002776285509593336/posts>
w: appinions.com <http://www.appinions.com/>


Re: Applying Tokenizers and Filters to CopyFields

2015-03-25 Thread Michael Della Bitta
Two other things I noticed:

1. You probably don't want to store your copyFields. That's literally going
to be the same information each time.

2. Your expectation "the pre-processed version of the text is added to the
index" may be incorrect. Anything done in  sections
actually happens at query time. Not sure if that's significant for you.


Michael Della Bitta

Senior Software Engineer

o: +1 646 532 3062

appinions inc.

“The Science of Influence Marketing”

18 East 41st Street

New York, NY 10017

t: @appinions <https://twitter.com/Appinions> | g+:
plus.google.com/appinions
<https://plus.google.com/u/0/b/112002776285509593336/112002776285509593336/posts>
w: appinions.com <http://www.appinions.com/>

On Wed, Mar 25, 2015 at 4:27 PM, Ahmet Arslan 
wrote:

> Hi Martin,
>
> fq means filter query. May be you want to use qf (query fields) parameter
> of edismax?
>
>
>
> On Wednesday, March 25, 2015 9:23 PM, Martin Wunderlich 
> wrote:
> Hi all,
>
> I am wondering what the process is for applying Tokenizers and Filter (as
> defined in the FieldType definition) to field contents that result from
> CopyFields. To be more specific, in my Solr instance, Iwould like to
> support query expansion by two means: removing stop words and adding
> inflected word forms as synonyms.
>
> To use a specific example, let’s say I have the following sentence to be
> indexed (from a Wittgenstein manuscript):
>
> "Was zum Wesen der Welt gehört, kann die Sprache nicht ausdrücken.“
>
>
> This sentence will be indexed in a field called „original“ that is defined
> as follows:
>
>  required="true“/>
>
>  positionIncrementGap="100">
>   
> 
>   
>   
> 
>   
> 
>
>
> Then, in order to create fields for the two types of query expansion, I
> have set up specific fields for this:
>
> - one field where stopwords are removed both on the indexed content and
> the query. So, if the users is searching for a phrase like „der Sprache“,
> Solr should still find the segment above, because the determiners („der“
> and „die“) are removed prior to indexing and prior to querying,
> respectively. This field is defined as follows:
>
>  indexed="true" stored="true" required="true“/>
>
>  positionIncrementGap="100">
>   
> 
>  words=„stopwords_de.txt" format="snowball"/>
> 
>   
>   
> 
>  words="stopwords_de.txt" format="snowball"/>
> 
>   
> 
>
>
> - a second field where synonyms are added to the query so that more
> segments will be found. For instance, if the user is searching for the
> plural form „Sprachen“, Solr should return the segment above, due to this
> entry in the synonyms file: "Sprache,Sprach,Sprachen“. This field is
> defined as follows:
>
>  required="true“/>expanded
>
>  positionIncrementGap="100">
>   
> 
>  words="stopwords_de.txt" format="snowball"/>
> 
>   
>   
> 
>  words="stopwords_de.txt" format="snowball"/>
>  synonyms="synonyms_de.txt" ignoreCase="true" expand="true"/>
> 
>   
> 
>
> Finally, to avoid having to specify three fields with identical content in
> the import documents, I am defining the two fields for query expansion as
> copyFields:
>
>   
>   http://localhost:8983/solr/windex/select?q=Sprache&fq=original&wt=json&indent=true
> <
> http://localhost:8983/solr/windex/select?q=Sprache&fq=original&wt=json&indent=true
> >
>
> will return no matches. I would expected that using the fq parameter the
> user can specify what type of search (s)he would like to carry out: A
> standard search (field original) or an expanded search (one of the other
> two fields).
>
> For debugging, I have checked the analysis and results seem ok (posted
> below).
> Apologies for the long post, but I am really a bit stuck here (even after
> doing a lot of reading and googling). It is probably something simple that
> I missing.
> Thanks a lot in advance for any help.
>
> Cheers,
>
> Martin
>
>
> ST
> Was
> zum
> Wesen
>
> der
> Welt
> gehört
> kann
> die
> Sprache
> nicht
> ausdrücken
> SF
> Was
> zum
> Wesen
>
> Welt
> gehört
> kann
> die
> Sprache
> nicht
> ausdrücken
> LCF
> was
> zum
> wesen
>
> welt
> gehört
> kann
> die
> sprache
> nicht
> ausdrücken
>


Re: Solr and HDFS configuration

2015-03-24 Thread Michael Della Bitta
The ultimate answer is that you need to test your configuration with your
expected workflow.

However, the thing that mitigates the remote IO factor (hopefully) is that
the Solr HDFS stuff features a blockcache that should (when tuned
correctly) cache in RAM the blocks your Solr process needs the most.

Solr on HDFS currently doesn't have any sort of rack locality like there is
with say HBase colocated on the HDFS nodes. So you can expect that even
with Solr installed on the same nodes as your datanodes for HDFS, that
there will be remote IO.



Michael Della Bitta

Senior Software Engineer

o: +1 646 532 3062

appinions inc.

“The Science of Influence Marketing”

18 East 41st Street

New York, NY 10017

t: @appinions <https://twitter.com/Appinions> | g+:
plus.google.com/appinions
<https://plus.google.com/u/0/b/112002776285509593336/112002776285509593336/posts>
w: appinions.com <http://www.appinions.com/>

On Tue, Mar 24, 2015 at 2:47 PM, Joseph Obernberger  wrote:

> Hi All - does it make sense to run a solr shard on a node within an Hadoop
> cluster that is not a data node?  In that case all the data that node
> processes would need to come over the network, but you get the benefit of
> more CPU for things like faceting.
> Thank you!
>
> -Joe
>


Re: 8 Shards of Cloud with 4.10.3.

2015-02-24 Thread Michael Della Bitta
I guess the place to start is the Reference Guide:

https://cwiki.apache.org/confluence/display/solr/SolrCloud

Generally speaking, when you start Solr with any sort of Zookeeper, you've
entered "cloud mode," which essentially means that Solr is now capable of
organizing cores into groups that represent shards, and groups of shards
are coordinated into collections. Additionally, Zookeeper allows multiple
Solr installations to be coordinated together to serve these collections
with high availability.

If you're just trying to gain parallelism on a single by using multiple
cores, you don't specifically need cloud mode or collections. You can
create multiple cores, distribute your documents manually to each core, and
then do a distributed search ala
https://wiki.apache.org/solr/DistributedSearch. The downside here is that
you're on your own in terms of distributing the documents at write time,
but on the other hand, you don't have to maintain a Zookeeper ensemble or
devote brain cells to understanding collections/shards/etc.


Michael Della Bitta

Senior Software Engineer

o: +1 646 532 3062

appinions inc.

“The Science of Influence Marketing”

18 East 41st Street

New York, NY 10017

t: @appinions <https://twitter.com/Appinions> | g+:
plus.google.com/appinions
<https://plus.google.com/u/0/b/112002776285509593336/112002776285509593336/posts>
w: appinions.com <http://www.appinions.com/>

On Tue, Feb 24, 2015 at 3:21 PM, Benson Margulies 
wrote:

> On Tue, Feb 24, 2015 at 1:30 PM, Michael Della Bitta
>  wrote:
> > Benson:
> >
> > Are you trying to run independent invocations of Solr for every node?
> > Otherwise, you'd just want to create a 8 shard collection with
> > maxShardsPerNode set to 8 (or more I guess).
>
> Michael Della Bitta,
>
> I don't want to run multiple invocations. I just want to exploit
> hardware cores with shards. Can you point me at doc for the process
> you are referencing here? I confess to some ongoing confusion between
> cores and collections.
>
> --benson
>
>
> >
> > Michael Della Bitta
> >
> > Senior Software Engineer
> >
> > o: +1 646 532 3062
> >
> > appinions inc.
> >
> > “The Science of Influence Marketing”
> >
> > 18 East 41st Street
> >
> > New York, NY 10017
> >
> > t: @appinions <https://twitter.com/Appinions> | g+:
> > plus.google.com/appinions
> > <
> https://plus.google.com/u/0/b/112002776285509593336/112002776285509593336/posts
> >
> > w: appinions.com <http://www.appinions.com/>
> >
> > On Tue, Feb 24, 2015 at 1:27 PM, Benson Margulies  >
> > wrote:
> >
> >> With so much of the site shifted to 5.0, I'm having a bit of trouble
> >> finding what I need, and so I'm hoping that someone can give me a push
> >> in the right direction.
> >>
> >> On a big multi-core machine, I want to set up a configuration with 8
> >> (or perhaps more) nodes treated as shards. I have some very particular
> >> solrconfig.xml and schema.xml that I need to use.
> >>
> >> Could some kind person point me at a relatively step-by-step layout?
> >> This is all on Linux, I'm happy to explicitly run Zookeeper.
> >>
>


Re: 8 Shards of Cloud with 4.10.3.

2015-02-24 Thread Michael Della Bitta
Benson:

Are you trying to run independent invocations of Solr for every node?
Otherwise, you'd just want to create a 8 shard collection with
maxShardsPerNode set to 8 (or more I guess).

Michael Della Bitta

Senior Software Engineer

o: +1 646 532 3062

appinions inc.

“The Science of Influence Marketing”

18 East 41st Street

New York, NY 10017

t: @appinions <https://twitter.com/Appinions> | g+:
plus.google.com/appinions
<https://plus.google.com/u/0/b/112002776285509593336/112002776285509593336/posts>
w: appinions.com <http://www.appinions.com/>

On Tue, Feb 24, 2015 at 1:27 PM, Benson Margulies 
wrote:

> With so much of the site shifted to 5.0, I'm having a bit of trouble
> finding what I need, and so I'm hoping that someone can give me a push
> in the right direction.
>
> On a big multi-core machine, I want to set up a configuration with 8
> (or perhaps more) nodes treated as shards. I have some very particular
> solrconfig.xml and schema.xml that I need to use.
>
> Could some kind person point me at a relatively step-by-step layout?
> This is all on Linux, I'm happy to explicitly run Zookeeper.
>


Re: incorrect Java version reported in solr dashboard

2015-02-23 Thread Michael Della Bitta
You're probably launching Solr using the older version of Java somehow. You
should make sure your PATH and JAVA_HOME variables point at your Java 8
install from the point of view of the script or configuration that launches
Solr.

Hope that helps.

Michael Della Bitta

Senior Software Engineer

o: +1 646 532 3062

appinions inc.

“The Science of Influence Marketing”

18 East 41st Street

New York, NY 10017

t: @appinions <https://twitter.com/Appinions> | g+:
plus.google.com/appinions
<https://plus.google.com/u/0/b/112002776285509593336/112002776285509593336/posts>
w: appinions.com <http://www.appinions.com/>

On Mon, Feb 23, 2015 at 9:19 AM, SolrUser1543  wrote:

> I have upgraded Java version from 1.7 to 1.8 on Linux server.
> After the upgrade,  if I run " Java -version " I can see that it really
> changed to the new one.
>
> But when I run Solr, it is still reporting the old version in dashboard JVM
> section.
>
> What could be the reason?
>
>
>
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/incorrect-Java-version-reported-in-solr-dashboard-tp4188236.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>


Re: ignoring bad documents during index

2015-02-20 Thread Michael Della Bitta
At the layer right before you send that XML out, have it have a fallback
option on error where it sends each document one at a time if there's a
failure with the batch.

Michael Della Bitta

Senior Software Engineer

o: +1 646 532 3062

appinions inc.

“The Science of Influence Marketing”

18 East 41st Street

New York, NY 10017

t: @appinions <https://twitter.com/Appinions> | g+:
plus.google.com/appinions
<https://plus.google.com/u/0/b/112002776285509593336/112002776285509593336/posts>
w: appinions.com <http://www.appinions.com/>

On Fri, Feb 20, 2015 at 10:26 AM, SolrUser1543  wrote:

> I am sending a bulk of XML via http request.
>
> The same way like indexing via " documents " in solr interface.
>
>
>
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/ignoring-bad-documents-during-index-tp4176947p4187632.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>


Re: Solr Logging files get high

2015-02-03 Thread Michael Della Bitta
If you're trying to do a bulk ingest of data, I recommend committing less
frequently. Don't soft commit at all until the end of the batch, and hard
commit every 60 seconds.

Michael Della Bitta

Senior Software Engineer

o: +1 646 532 3062

appinions inc.

“The Science of Influence Marketing”

18 East 41st Street

New York, NY 10017

t: @appinions <https://twitter.com/Appinions> | g+:
plus.google.com/appinions
<https://plus.google.com/u/0/b/112002776285509593336/112002776285509593336/posts>
w: appinions.com <http://www.appinions.com/>

On Tue, Feb 3, 2015 at 12:51 AM, Nitin Solanki  wrote:

> Hi Michael Della and Michael Sokolov,
>
> *size of tlog :-*
> 56K/mnt/nitin/solr/node1/solr/wikingram_shard3_replica1/data/tlog/
> 56K/mnt/nitin/solr/node1/solr/wikingram_shard7_replica1/data/tlog/
> 56K/mnt/nitin/solr/node2/solr/wikingram_shard4_replica1/data/tlog/
> 52K/mnt/nitin/solr/node2/solr/wikingram_shard8_replica1/data/tlog/
> 52K/mnt/nitin/solr/node3/solr/wikingram_shard1_replica1/data/tlog/
> 52K/mnt/nitin/solr/node3/solr/wikingram_shard5_replica1/data/tlog/
> 56K/mnt/nitin/solr/node4/solr/wikingram_shard2_replica1/data/tlog/
> 48K/mnt/nitin/solr/node4/solr/wikingram_shard6_replica1/data/tlog/
>
> *Size of logs :-*
> 755M/mnt/nitin/solr/node1/logs/
> 729M/mnt/nitin/solr/node2/logs/
> 729M/mnt/nitin/solr/node3/logs/
> 729M/mnt/nitin/solr/node4/logs/
>
> Which log is reducing performance?  I am committing more frequent hard
> commits. After 1 second , I am performing soft commit and after 15 seconds,
> I am performing hard commit. I indexed 2 GB of data and you can see the
> size of tlog that I pasted above. Is this tlog is good for 2GB indexed
> data? Or is it high? The main question is that size of log will harm
> performance of Solr?
>
>
>
>
> On Mon, Feb 2, 2015 at 10:27 PM, Michael Della Bitta <
> michael.della.bi...@appinions.com> wrote:
>
> > Good call, it could easily be the tlog Nitin is talking about.
> >
> > As for which definition of high, I was making assumptions as well. :)
> >
> > Michael Della Bitta
> >
> > Senior Software Engineer
> >
> > o: +1 646 532 3062
> >
> > appinions inc.
> >
> > “The Science of Influence Marketing”
> >
> > 18 East 41st Street
> >
> > New York, NY 10017
> >
> > t: @appinions <https://twitter.com/Appinions> | g+:
> > plus.google.com/appinions
> > <
> >
> https://plus.google.com/u/0/b/112002776285509593336/112002776285509593336/posts
> > >
> > w: appinions.com <http://www.appinions.com/>
> >
> > On Mon, Feb 2, 2015 at 11:51 AM, Michael Sokolov <
> > msoko...@safaribooksonline.com> wrote:
> >
> > > I was tempted to suggest rehab -- but seriously it wasn't clear if
> Nitin
> > > meant the log files Michael is referring to, or the transaction log
> > > (tlog).  If it's the transaction log, the solution is more frequent
> hard
> > > commits.
> > >
> > > -Mike
> > >
> > > On 2/2/2015 11:48 AM, Michael Della Bitta wrote:
> > >
> > >> If you'd like to reduce the amount of lines Solr logs, you need to
> edit
> > >> the
> > >> file example/resources/log4j.properties in Solr's home directory.
> Change
> > >> lines that say INFO to WARN.
> > >>
> > >> Michael Della Bitta
> > >>
> > >> Senior Software Engineer
> > >>
> > >> o: +1 646 532 3062
> > >>
> > >> appinions inc.
> > >>
> > >> “The Science of Influence Marketing”
> > >>
> > >> 18 East 41st Street
> > >>
> > >> New York, NY 10017
> > >>
> > >> t: @appinions <https://twitter.com/Appinions> | g+:
> > >> plus.google.com/appinions
> > >> <https://plus.google.com/u/0/b/112002776285509593336/
> > >> 112002776285509593336/posts>
> > >> w: appinions.com <http://www.appinions.com/>
> > >>
> > >> On Mon, Feb 2, 2015 at 7:42 AM, Nitin Solanki 
> > >> wrote:
> > >>
> > >>  Hi,
> > >>>   My solr logs directory has been get high. It is seriously
> > >>> problem
> > >>> or It harms my solr performance in both cases indexing as well as
> > >>> searching.
> > >>>
> > >>>
> > >
> >
>


Re: Solr Logging files get high

2015-02-02 Thread Michael Della Bitta
Good call, it could easily be the tlog Nitin is talking about.

As for which definition of high, I was making assumptions as well. :)

Michael Della Bitta

Senior Software Engineer

o: +1 646 532 3062

appinions inc.

“The Science of Influence Marketing”

18 East 41st Street

New York, NY 10017

t: @appinions <https://twitter.com/Appinions> | g+:
plus.google.com/appinions
<https://plus.google.com/u/0/b/112002776285509593336/112002776285509593336/posts>
w: appinions.com <http://www.appinions.com/>

On Mon, Feb 2, 2015 at 11:51 AM, Michael Sokolov <
msoko...@safaribooksonline.com> wrote:

> I was tempted to suggest rehab -- but seriously it wasn't clear if Nitin
> meant the log files Michael is referring to, or the transaction log
> (tlog).  If it's the transaction log, the solution is more frequent hard
> commits.
>
> -Mike
>
> On 2/2/2015 11:48 AM, Michael Della Bitta wrote:
>
>> If you'd like to reduce the amount of lines Solr logs, you need to edit
>> the
>> file example/resources/log4j.properties in Solr's home directory. Change
>> lines that say INFO to WARN.
>>
>> Michael Della Bitta
>>
>> Senior Software Engineer
>>
>> o: +1 646 532 3062
>>
>> appinions inc.
>>
>> “The Science of Influence Marketing”
>>
>> 18 East 41st Street
>>
>> New York, NY 10017
>>
>> t: @appinions <https://twitter.com/Appinions> | g+:
>> plus.google.com/appinions
>> <https://plus.google.com/u/0/b/112002776285509593336/
>> 112002776285509593336/posts>
>> w: appinions.com <http://www.appinions.com/>
>>
>> On Mon, Feb 2, 2015 at 7:42 AM, Nitin Solanki 
>> wrote:
>>
>>  Hi,
>>>   My solr logs directory has been get high. It is seriously
>>> problem
>>> or It harms my solr performance in both cases indexing as well as
>>> searching.
>>>
>>>
>


Re: Solr Logging files get high

2015-02-02 Thread Michael Della Bitta
If you'd like to reduce the amount of lines Solr logs, you need to edit the
file example/resources/log4j.properties in Solr's home directory. Change
lines that say INFO to WARN.

Michael Della Bitta

Senior Software Engineer

o: +1 646 532 3062

appinions inc.

“The Science of Influence Marketing”

18 East 41st Street

New York, NY 10017

t: @appinions <https://twitter.com/Appinions> | g+:
plus.google.com/appinions
<https://plus.google.com/u/0/b/112002776285509593336/112002776285509593336/posts>
w: appinions.com <http://www.appinions.com/>

On Mon, Feb 2, 2015 at 7:42 AM, Nitin Solanki  wrote:

> Hi,
>  My solr logs directory has been get high. It is seriously problem
> or It harms my solr performance in both cases indexing as well as
> searching.
>


Re: OutOfMemoryError for PDF document upload into Solr

2015-01-14 Thread Michael Della Bitta
Yep, you'll have to increase the heap size for your Tomcat container.

http://stackoverflow.com/questions/6897476/tomcat-7-how-to-set-initial-heap-size-correctly

Michael Della Bitta

Senior Software Engineer

o: +1 646 532 3062

appinions inc.

“The Science of Influence Marketing”

18 East 41st Street

New York, NY 10017

t: @appinions <https://twitter.com/Appinions> | g+:
plus.google.com/appinions
<https://plus.google.com/u/0/b/112002776285509593336/112002776285509593336/posts>
w: appinions.com <http://www.appinions.com/>

On Wed, Jan 14, 2015 at 12:00 PM,  wrote:

> Hello,
>
> Can someone pass on the hints to get around following error? Is there any
> Heap Size parameter I can set in Tomcat or in Solr webApp that gets
> deployed in Solr?
>
> I am running Solr webapp inside Tomcat on my local machine which has RAM
> of 12 GB. I have PDF document which is 4 GB max in size that needs to be
> loaded into Solr
>
>
>
>
> Exception in thread "http-apr-8983-exec-6" java.lang.: Java heap space
> at java.util.AbstractCollection.toArray(Unknown Source)
> at java.util.ArrayList.(Unknown Source)
> at
> org.apache.pdfbox.cos.COSDocument.getObjects(COSDocument.java:518)
> at org.apache.pdfbox.cos.COSDocument.close(COSDocument.java:575)
> at org.apache.pdfbox.pdfparser.PDFParser.parse(PDFParser.java:254)
> at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:1238)
> at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:1203)
> at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:111)
> at
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
> at
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
> at
> org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120)
> at
> org.apache.solr.handler.extraction.ExtractingDocumentLoader.load(ExtractingDocumentLoader.java:219)
> at
> org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:74)
> at
> org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:135)
> at
> org.apache.solr.core.RequestHandlers$LazyRequestHandlerWrapper.handleRequest(RequestHandlers.java:246)
> at org.apache.solr.core.SolrCore.execute(SolrCore.java:1967)
> at
> org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:777)
> at
> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:418)
> at
> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:207)
> at
> org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:241)
> at
> org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:208)
> at
> org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:220)
> at
> org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:122)
> at
> org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:170)
> at
> org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:103)
> at
> org.apache.catalina.valves.AccessLogValve.invoke(AccessLogValve.java:950)
> at
> org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:116)
> at
> org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:421)
> at
> org.apache.coyote.http11.AbstractHttp11Processor.process(AbstractHttp11Processor.java:1070)
> at
> org.apache.coyote.AbstractProtocol$AbstractConnectionHandler.process(AbstractProtocol.java:611)
> at
> org.apache.tomcat.util.net.AprEndpoint$SocketProcessor.doRun(AprEndpoint.java:2462)
> at
> org.apache.tomcat.util.net.AprEndpoint$SocketProcessor.run(AprEndpoint.java:2451)
>
>
> Thanks
> Ganesh
>
>


Re: Solr limiting number of rows to indexed to 21500 every time.

2015-01-13 Thread Michael Della Bitta
Looks like you have an underlying JDBC problem. The socket representing
your database connection seems to be going away. Have you tried running
this query outside of Solr and iterating through all the results? How about
in a standalone Java program? Do you have a DBA you can consult to see if
there are any errors on the Oracle side?

Michael Della Bitta

Senior Software Engineer

o: +1 646 532 3062

appinions inc.

“The Science of Influence Marketing”

18 East 41st Street

New York, NY 10017

t: @appinions <https://twitter.com/Appinions> | g+:
plus.google.com/appinions
<https://plus.google.com/u/0/b/112002776285509593336/112002776285509593336/posts>
w: appinions.com <http://www.appinions.com/>

On Tue, Jan 13, 2015 at 2:31 AM, Pankaj Sonawane 
wrote:

> Hi,
>
> I am using Solr DataImportHandler to index data from database
> table(Oracle). One of the column contains String representation of XML
> (Sample below).
>
> **
> *1*
>
> *2*
>
> *3*
> *.*
> *.*
> *.*
>
> * // can be 100-200*
>
> I want solr to index each 'name' in 'option' tag against its value
>
> ex. JSON for 1 row
> "docs": [ {
> "COL1": "F",
> "COL2": "ASDF", "COL3": "ATCC", "COL4": 29039757, "A_s": "1", "B_s": "2", "
> C_s": "3",
> .
> .
> .
> *  }*
> // appending '_s' to 'name' attribute for making dynamic fields.
>
>
> But while indexing data, *every time only 21500 rows get indexed*. After
> these much records get indexed I got following exception:
>
> *1320927 [Thread-15] ERROR
> org.apache.solr.handler.dataimport.EntityProcessorBase  û getNext() failed
> for query 'SELECT col1,col2,col3,col4,XMLSERIALIZE(col5 AS  CLOB) AS col5
> FROM
> tableName':org.apache.solr.handler.dataimport.DataImportHandlerException:
> java.sql.SQLRecoverableException: No more data to read from socket*
> *at
>
> org.apache.solr.handler.dataimport.DataImportHandlerException.wrapAndThrow(DataImportHandlerException.java:63)*
> *at
>
> org.apache.solr.handler.dataimport.JdbcDataSource$ResultSetIterator.hasnext(JdbcDataSource.java:378)*
> *at
>
> org.apache.solr.handler.dataimport.JdbcDataSource$ResultSetIterator.access$600(JdbcDataSource.java:258)*
> *at
>
> org.apache.solr.handler.dataimport.JdbcDataSource$ResultSetIterator$1.hasNext(JdbcDataSource.java:293)*
> *at
>
> org.apache.solr.handler.dataimport.EntityProcessorBase.getNext(EntityProcessorBase.java:116)*
> *at
>
> org.apache.solr.handler.dataimport.SqlEntityProcessor.nextRow(SqlEntityProcessor.java:75)*
> *at
>
> org.apache.solr.handler.dataimport.EntityProcessorWrapper.nextRow(EntityProcessorWrapper.java:243)*
> *at
>
> org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:476)*
> *at
>
> org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:415)*
> *at
>
> org.apache.solr.handler.dataimport.DocBuilder.doFullDump(DocBuilder.java:330)*
> *at
> org.apache.solr.handler.dataimport.DocBuilder.execute(DocBuilder.java:232)*
> *at
>
> org.apache.solr.handler.dataimport.DataImporter.doFullImport(DataImporter.java:416)*
> *at
>
> org.apache.solr.handler.dataimport.DataImporter.runCmd(DataImporter.java:480)*
> *at
>
> org.apache.solr.handler.dataimport.DataImporter$1.run(DataImporter.java:461)*
> *Caused by: java.sql.SQLRecoverableException: No more data to read from
> socket*
> *at
> oracle.jdbc.driver.T4CMAREngine.unmarshalUB1(T4CMAREngine.java:1200)*
> *at
> oracle.jdbc.driver.T4CMAREngine.unmarshalCLR(T4CMAREngine.java:1865)*
> *at
> oracle.jdbc.driver.T4CMAREngine.unmarshalCLR(T4CMAREngine.java:1757)*
> *at
> oracle.jdbc.driver.T4CMAREngine.unmarshalCLR(T4CMAREngine.java:1750)*
> *at
>
> oracle.jdbc.driver.T4CClobAccessor.handlePrefetch(T4CClobAccessor.java:543)*
> *at
>
> oracle.jdbc.driver.T4CClobAccessor.unmarshalOneRow(T4CClobAccessor.java:197)*
> *at oracle.jdbc.driver.T4CTTIrxd.unmarshal(T4CTTIrxd.java:916)*
> *at oracle.jdbc.driver.T4CTTIrxd.unmarshal(T4CTTIrxd.java:835)*
> *at oracle.jdbc.driver.T4C8Oall.readRXD(T4C8Oall.java:664)*
> *at oracle.jdbc.driver.T4CTTIfun.receive(T4CTTIfun.java:328)*
> *at oracle.jdbc.driver.T4CTTIfun.doRPC(T4CTTIfun.java:186)*
> *at oracle.jdbc.driver.T4C8Oall.doOALL(T4C8Oall.java:521)*
> *at oracle.jdbc.driver.T4CStatement.doOall8(T4CStatement.java:194)*
> *at oracle.jdbc.driver.T4CStat

Re: solrcloud nodes registering as 127.0.1.1

2015-01-12 Thread Michael Della Bitta
Another way of doing it is by setting the -Dhost=$hostname parameter when
you start Solr.

Michael Della Bitta

Senior Software Engineer

o: +1 646 532 3062

appinions inc.

“The Science of Influence Marketing”

18 East 41st Street

New York, NY 10017

t: @appinions <https://twitter.com/Appinions> | g+:
plus.google.com/appinions
<https://plus.google.com/u/0/b/112002776285509593336/112002776285509593336/posts>
w: appinions.com <http://www.appinions.com/>

On Mon, Jan 12, 2015 at 7:15 AM, Matteo Grolla 
wrote:

> Solved!
> ubuntu has an entry like this in /etc/hosts
>
> 127.0.1.1   
>
> to properly run solrcloud one must substitute 127.0.1.1 with a real
> (possibly permanent) ip address
>
>
>
> Il giorno 12/gen/2015, alle ore 12:47, Matteo Grolla ha scritto:
>
> > Hi,
> >   hope someone can help me troubleshoot this issue.
> > I'm trying to setup a solrcloud cluster with
> >
> > -zookeeper on 192.168.1.8 (osx mac)
> > -solr1 on 192.168.1.10(virtualized ubuntu running on mac)
> > -solr2 on 192.168.1.3 (ubuntu on another pc)
> >
> > the problem is that both nodes register on zookeeper as 127.0.1.1 so
> they appear as the same node
> > here's a message from solr log
> >
> > 5962 [zkCallback-2-thread-1] INFO
> org.apache.solr.cloud.DistributedQueue  – LatchChildWatcher fired on path:
> /overseer/queue state: SyncConnected type NodeChildrenChanged
> > 5965
> [OverseerStateUpdate-93130698829725696-127.0.1.1:8983_solr-n_00]
> INFO  org.apache.solr.cloud.Overseer  – Update state numShards=2 message={
> >  "operation":"state",
> >  "core_node_name":"core_node1",
> >  "numShards":"2",
> >  "shard":"shard1",
> >  "roles":null,
> >  "state":"active",
> >  "core":"collection1",
> >  "collection":"collection1",
> >  "node_name":"127.0.1.1:8983_solr",
> >  "base_url":"http://127.0.1.1:8983/solr"}
> >
> >
> > I'm able to run the cluster if I change jetty.port to one of the nodes,
> but I'd really like some help troubleshooting this issue.
> >
> > Thanks
>
>


Re: solrcloud without faceting, i.e. for failover only

2015-01-06 Thread Michael Della Bitta

The downsides that come to mind:

1. Every write gets amplified by the number of nodes in the cloud. 1000 
write requests end up creating 1000*N HTTP calls as the leader forwards 
those writes individually to all of the followers in the cloud. Contrast 
that with classical replication where only changed index segments get 
replicated asynchronously.


2. Slightly more complicated infrastructure in terms of having to run a 
zookeeper cluster.


#1 is a trade off against being possibly more available to writes in the 
case of a single down node. In the cloud case, you're still open for 
business. In the classical replication case, you're no longer available 
for writes if the downed node is the master.


My two cents.

On 1/6/15 16:30, Will Milspec wrote:

Hi all,

We have a smallish index that performs well for searches and are
considering using solrcloud --but just for high availability/redundancy,
i.e. without any sharding.

The indexes would be replicated, but not distributed.

I know that "there are no stupid questions..Only stupid people"...but here
goes:

-is solrcloud w/o sharding done?( I.e. "it's just not done!!" )
-any downside (i.e. aside from the lack of horizontal scalability )

will





Re: .htaccess / password

2015-01-06 Thread Michael Della Bitta
The Jetty servlet container that Solr uses doesn't understand those 
files. It would not use them to determine access, and would likely make 
them accessible to web requests in plain text.


On 1/6/15 16:01, Craig Hoffman wrote:

Thanks Otis. Do think a .htaccess / .passwd file in the Solr admin dir would 
interfere with its operation?
--
Craig Hoffman
w: http://www.craighoffmanphotography.com
FB: www.facebook.com/CraigHoffmanPhotography
TW: https://twitter.com/craiglhoffman














On Jan 6, 2015, at 1:09 PM, Otis Gospodnetic  wrote:

Hi Craig,

If you want to protect Solr, put it behind something like Apache / Nginx /
HAProxy and put .htaccess at that level, in front of Solr.
Or try something like
http://blog.jelastic.com/2013/06/17/secure-access-to-your-jetty-web-application/

Otis
--
Monitoring * Alerting * Anomaly Detection * Centralized Log Management
Solr & Elasticsearch Support * http://sematext.com/


On Tue, Jan 6, 2015 at 1:28 PM, Craig Hoffman  wrote:


Quick question: If put a .htaccess file in www.mydomin.com/8983/solr/#/
will Solr continue to function properly? One thing to note, I will have a
CRON job that runs nightly that re-indexes the engine. In a nutshell I’m
looking for a way to secure this area.

Thanks,
Craig
--
Craig Hoffman
w: http://www.craighoffmanphotography.com
FB: www.facebook.com/CraigHoffmanPhotography
TW: https://twitter.com/craiglhoffman


















Re: Running Multiple Solr Instances

2015-01-06 Thread Michael Della Bitta

I would do one of either:

1. Set a different Solr home for each instance. I'd use the 
-Dsolr.solr.home=/d/2 command line switch when launching Solr to do so.


2. RAID 10 the drives. If you expect the Solr instances to get uneven 
traffic, pooling the drives will allow a given Solr instance to share 
the capacity of all of them.


On 1/5/15 23:31, Nishanth S wrote:

Hi folks,

I  am running  multiple solr instances  (Solr 4.10.3 on tomcat 8).There are
3 physical machines and  I have 4 solr instances running  on each machine
on ports  8080,8081,8082 and 8083.The set up is well up to this point.Now I
want to point each of these instance to a different  index directories.The
drives in the machines are mounted as d/1,d/2,d/3 ,d/4 etc.Now if I define
/d/1 as  the solr home all solr index directories  are created in /d/1
where as the other drives remain un used.So how do I configure solr to
  make use of all the drives so that I can  get maximum storage for solr.I
would really appreciate any help in this regard.

Thanks,
Nishanth





Re: Endless 100% CPU usage on searcherExecutor thread

2014-12-18 Thread Michael Della Bitta
I've been experiencing this problem. Running VisualVM on my instances 
shows that they spend a lot of time creating WeakReferences
(org.apache.lucene.util.WeakIdentityMap$IdentityWeakReference that is). 
I think what's happening here is the heap's not big enough for Lucene's 
caches and it ends up thrashing.


You might try bumping up your heap some to see if that helps. It's made 
a difference for me, but mostly in delaying the onset and limiting the 
occurrence of this. Likely I just need an even larger heap.


Michael


On 12/18/14 17:36, heaven wrote:

Hi,

We have 2 shards, each one has 2 replicas and each Solr instance has a
single thread that constantly uses 100% of CPU:


After restart it is running normally for some time (approximately until Solr
comes close to Xmx limit), then the mentioned thread start consuming one
CPU. 4 solr instances = minus 4 CPU cores.

We do not commit manually and the search is not used too intensively.

{code}

   25000
   30
   false



   15000

{code}

So I was wondering if that's correct, if this is supposed to be or if
something is wrong with our configuration or with Solr.

Thanks,
Alex



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Endless-100-CPU-usage-on-searcherExecutor-thread-tp4175088.html
Sent from the Solr - User mailing list archive at Nabble.com.




Re: Details on why ConccurentUpdateSolrServer is reccommended for maximum index performance

2014-12-12 Thread Michael Della Bitta
Shawn:

I seem to remember being able to do something about errors with the
handleError method, but I must have had to do it in a custom subclass to
actually have visibility into what exactly went wrong.
On Dec 11, 2014 9:28 PM, "Shawn Heisey"  wrote:

> On 12/11/2014 9:19 AM, Michael Della Bitta wrote:
> > Only thing you have to worry about (in both the CUSS and the home grown
> > case) is a single bad document in a batch fails the whole batch. It's up
> > to you to fall back to writing them individually so the rest of the
> > batch makes it in.
>
> With CUSS, your program will never know that the batch failed, so your
> code won't know that it must retry documents individually.  All requests
> return with an apparent success even before the data is sent to Solr,
> and there's no way for exceptions thrown during the background indexing
> to be caught by user code.
>
> If your program must know whether your updates were indexed successfully
> by catching an exception when there's a problem, you'll need to write
> your own multi-threaded indexing application using an instance of
> HttpSolrServer.
>
> I filed an issue on this, and built an imperfect patch.  The patch can
> only tell you that there was a problem during indexing, it doesn't know
> which document or even which batch had the problem.
>
> https://issues.apache.org/jira/browse/SOLR-3284
>
> Thanks,
> Shawn
>
>


Re: Details on why ConccurentUpdateSolrServer is reccommended for maximum index performance

2014-12-11 Thread Michael Della Bitta

Tom:

ConcurrentUpdateSolrServer isn't magic or anything. You could pretty 
trivially write something that takes batches of your XML documents and 
combines them into a single document (multiple  tags in the  
section) and sends them up to Solr and achieve some of the same speed 
benefits.


If you use it, the JavaBin-based serialization in CUSS is lighter as a 
wire format, though: 
http://lucene.apache.org/solr/4_10_2/solr-solrj/org/apache/solr/client/solrj/impl/BinaryRequestWriter.html


Only thing you have to worry about (in both the CUSS and the home grown 
case) is a single bad document in a batch fails the whole batch. It's up 
to you to fall back to writing them individually so the rest of the 
batch makes it in.


Michael

On 12/11/14 11:04, Erick Erickson wrote:

I don't think so, it uses SolrInputDocuments and
lists thereof. So if you parse the xml and then
put things in SolrInputDocuments..

Or something like that.

Erick

On Thu, Dec 11, 2014 at 9:43 AM, Tom Burton-West  wrote:

Thanks Eric,

That is helpful.  We already have a process that works similarly.  Each
thread/process that sends a document to Solr waits until it gets a response
in order to make sure that the document was indexed successfully (we log
errors and retry docs that don't get indexed successfully), however we run
20-100 of these processes,depending on  throughput (i.e. we send documents
to Solr for indexing as fast as we can until they start queuing up on the
Solr end.)

Is there a way to use CUSS with XML documents?

ie my second question:

A related question, is how to use ConcurrentUpdateSolrServer with XML
documents

I have very large XML documents, and the examples I see all build

documents

by adding fields in Java code.  Is there an example that actually reads

XML

files from the file system?

Tom




Re: Question on Solr Caching

2014-12-04 Thread Michael Della Bitta

Hi, Manohar,


1. Does posting-list and term-list of the index reside in the memory? If

not, how to load this to memory. I don't want to load entire data, like
using DocumentCache. Either I want to use RAMDirectoryFactory as the data
will be lost if you restart


If you use MMapDirectory, Lucene will map the files into memory off heap 
and the OS's disk cache will cache the files in memory for you. Don't 
use RAMDirectory, it's not better than MMapDirectory for any use I'm 
aware of.


> 2. For FilterCache, there is a way to specify whether the filter 
should be cached or not in the query.


If you add {!cache=false}  to your filter query, it will bypass the 
cache. I'm fairly certain it will not subsequently be cached.


> Similarly, Is there a way where I can specify the list of stored 
fields to be loaded to Document Cache?


If you have lazy loading enabled, the DocumentCache will only have the 
fields you asked for in it.


> 3. Similarly, Is there a way I can specify list of fields to be 
cached for FieldCache? Thanks, Manohar


You basically don't have much control over the FieldCache in Solr other 
than warming it with queries.


You should check out this wiki page, it will probably answer some questions:

https://wiki.apache.org/solr/SolrCaching

I hope that helps!

Michael



Re: Dealing with bad apples in a SolrCloud cluster

2014-11-21 Thread Michael Della Bitta

Good discussion topic.

I'm wondering if Solr doesn't need some sort of "shoot the other node in 
the head" functionality.


We ran into one of failure modes that only AWS can dream up recently, 
where for an extended amount of time, two nodes in the same placement 
group couldn't talk to one another, but they could both see Zookeeper, 
so nothing was marked as down.


I've written a basic monitoring script that periodically tries to access 
every node in the cluster from every other node, but I haven't gotten to 
the point that I've automated anything based on that. It does trigger 
now and again for brief moments of time.


It'd be nice if there was some way the cluster could achieve some 
consensus that a particular node is a bad apple, and evict it from 
collections that have other active replicas. Not sure what the logic 
would be that would allow it to rejoin those collections after the 
situation passed, however.


Michael

On 11/21/14 13:54, Timothy Potter wrote:

Just soliciting some advice from the community ...

Let's say I have a 10-node SolrCloud cluster and have a single collection
with 2 shards with replication factor 10, so basically each shard has one
replica on each of my nodes.

Now imagine one of those nodes starts getting into a bad state and starts
to be slow about serving queries (not bad enough to crash outright though)
... I'm sure we could ponder any number of ways a box might slow down
without crashing.

 From my calculations, about 2/10ths of the queries will now be affected
since

1/10 queries from client apps will hit the bad apple
   +
1/10 queries from other replicas will hit the bad apple (distrib=false)


If QPS is high enough and the bad apple is slow enough, things can start to
get out of control pretty fast, esp. since we've set max threads so high to
avoid distributed dead-lock.

What have others done to mitigate this risk? Anything we can do in Solr to
help deal with this? It seems reasonable that nodes can identify a bad
apple by keeping track of query times and looking for nodes that are
significantly outside (>=2 stddev) what the other nodes are doing. Then
maybe mark the node as being down in ZooKeeper so clients and other nodes
stop trying to send requests to it; or maybe a simple policy of just don't
send requests to that node for a few minutes.





Re: Handling growth

2014-11-20 Thread Michael Della Bitta
The collections we index under this multi-collection alias does not use 
real time get, no. We have other collections behind single-collection 
aliases where get calls seem to work, but I'm not clear whether the 
calls are real time. Seems like it would be easy for you to test, but 
just be aware that there's multiple things you'd have to prove:


1. Whether get calls are real-time
2. Whether they work against multi-collection aliases as opposed to 
single collection aliases.


Also be aware that there were some issues with alias visibility and 
solrj clients prior to ~4.5 or so, and I believe there were early issues 
with writing to aliases prior to then as well. I'd suggest using a 
relatively modern release.


Michael

On 11/19/14 19:56, Patrick Henry wrote:

Michael,

Interesting, I'm still unfamiliar with limitations (if any) of aliasing.
Does architecture utilize realtime get?
On Nov 18, 2014 11:49 AM, "Michael Della Bitta" <
michael.della.bi...@appinions.com> wrote:


We're achieving some success by treating aliases as collections and
collections as shards.

More specifically, there's a read alias that spans all the collections,
and a write alias that points at the 'latest' collection. Every week, I
create a new collection, add it to the read alias, and point the write
alias at it.

Michael

On 11/14/14 07:06, Toke Eskildsen wrote:


Patrick Henry [patricktheawesomeg...@gmail.com] wrote:

  I am working with a Solr collection that is several terabytes in size

over
several hundred millions of documents.  Each document is very rich, and
over the past few years we have consistently quadrupled the size our
collection annually.  Unfortunately, this sits on a single node with
only a
few hundred megabytes of memory - so our performance is less than ideal.


I assume you mean gigabytes of memory. If you have not already done so,
switching to SSDs for storage should buy you some more time.

  [Going for SolrCloud]  We are in a continuous adding documents and never

change
existing ones.  Based on that, one individual recommended for me to
implement custom hashing and route the latest documents to the shard with
the least documents, and when that shard fills up add a new shard and
index
on the new shard, rinse and repeat.


We have quite a similar setup, where we produce a never-changing shard
once every 8 days and add it to our cloud. One could also combine this
setup with a single live shard, for keeping the full index constantly up to
date. The memory overhead of running an immutable shard is smaller than a
mutable one and easier to fine-tune. It also allows you to optimize the
index down to a single segment, which requires a bit less processing power
and saves memory when faceting. There's a description of our setup at
http://sbdevel.wordpress.com/net-archive-search/

  From an administrative point of view, we like having complete control
over each shard. We keep track of what goes in it and in case of schema or
analyze chain changes, we can re-build each shard one at a time and deploy
them continuously, instead of having to re-build everything in one go on a
parallel setup. Of course, fundamental changes to the schema would require
a complete re-build before deploy, so we hope to avoid that.

- Toke Eskildsen







Re: Handling growth

2014-11-18 Thread Michael Della Bitta
We're achieving some success by treating aliases as collections and 
collections as shards.


More specifically, there's a read alias that spans all the collections, 
and a write alias that points at the 'latest' collection. Every week, I 
create a new collection, add it to the read alias, and point the write 
alias at it.


Michael

On 11/14/14 07:06, Toke Eskildsen wrote:

Patrick Henry [patricktheawesomeg...@gmail.com] wrote:


I am working with a Solr collection that is several terabytes in size over
several hundred millions of documents.  Each document is very rich, and
over the past few years we have consistently quadrupled the size our
collection annually.  Unfortunately, this sits on a single node with only a
few hundred megabytes of memory - so our performance is less than ideal.

I assume you mean gigabytes of memory. If you have not already done so, 
switching to SSDs for storage should buy you some more time.


[Going for SolrCloud]  We are in a continuous adding documents and never change
existing ones.  Based on that, one individual recommended for me to
implement custom hashing and route the latest documents to the shard with
the least documents, and when that shard fills up add a new shard and index
on the new shard, rinse and repeat.

We have quite a similar setup, where we produce a never-changing shard once 
every 8 days and add it to our cloud. One could also combine this setup with a 
single live shard, for keeping the full index constantly up to date. The memory 
overhead of running an immutable shard is smaller than a mutable one and easier 
to fine-tune. It also allows you to optimize the index down to a single 
segment, which requires a bit less processing power and saves memory when 
faceting. There's a description of our setup at 
http://sbdevel.wordpress.com/net-archive-search/

 From an administrative point of view, we like having complete control over 
each shard. We keep track of what goes in it and in case of schema or analyze 
chain changes, we can re-build each shard one at a time and deploy them 
continuously, instead of having to re-build everything in one go on a parallel 
setup. Of course, fundamental changes to the schema would require a complete 
re-build before deploy, so we hope to avoid that.

- Toke Eskildsen




Re: Can we query on _version_field ?

2014-11-13 Thread Michael Della Bitta
You could also find a natural key that doesn't look like an ID and 
create a name-based (Type 3) UUID out of it, with something like Java's 
nameUUIDFromBytes:


https://docs.oracle.com/javase/7/docs/api/java/util/UUID.html#nameUUIDFromBytes%28byte%5B%5D%29

Implementations of this exist in other languages as well.

On 11/13/14 11:35, Shawn Heisey wrote:

On 11/12/2014 10:45 PM, S.L wrote:

We know that _version_field is a mandatory field in solrcloud schema.xml,
it is expected to be of type long , it also seems to have unique value in a
collection.

However the query of the form
http://server1.mydomain.com:7344/solr/collection1/select/?q=*:*&fq=%28_version_:148463254894438%29&wt=json
does not seems to return any record , can we query on the _version_field in
the schema.xml ?

I've been watching your journey unfold on the mailing list.  The whole
thing seems like an XY problem.

If I'm reading everything correctly, you want to have a unique ID value
that can serve as the uniqueKey, as well as a way to quickly look up a
single document in Solr.

Is there one part of the URL that serves as a unique identifier that
doesn't contain special characters?  It seems insane that you would not
have a unique ID value for every entity in your system that is composed
of only "regular" characters.

Assuming that such an ID exists (and is likely used as one piece of that
doctorURL that you mentioned) ... if you can extract that ID value into
its own field (either in your indexing code or a custom update
processor), you could use that for both uniqueKey and single-document
lookups.  Having that kind of information in your index seems like a
generally good idea.

Thanks,
Shawn





Re: Lucene to Solrcloud migration

2014-11-11 Thread Michael Della Bitta
Yeah, Erick confused me a bit too, but I think what he's talking about 
takes for granted that you'd have your various indexes directly set up 
as individual collections.


If instead you're considering one big collection, or a few collections 
based on aggregations of your individual indexes, having big, 
multisharded collections using compositeId should work, unless there's a 
use case we're not discussing.


Michael

On 11/11/14 10:27, Michal Krajňanský wrote:

Hi Eric, Michael,

thank you both for your comments.

2014-11-11 5:05 GMT+01:00 Erick Erickson :


bq: - the documents are organized in "shards" according to date (integer)
and
language (a possibly extensible discrete set)

bq: - the indexes are disjunct

OK, I'm having a hard time getting my head around these two statements.

If the indexes are disjunct in the sense that you only search one at a
time,
then they are different "collections" in SolrCloud jargon.



I just meant that every document is contained in a single one of the
indexes. I have a lot of Lucene indexes for various [language X timespan],
but logically we are speaking about a single huge index. That is why I
thought it would be natural to represent is as a single SolrCloud
collection.

If, on the other hand, these are a big collection and you want to search

them all with a single query, I suggest that in SolrCloud land you don't
want them to be discrete shards. My reasoning here is that let's say you
have a bunch of documents for October, 2014 in Spanish. By putting these
all on a single shard, your queries all have to be serviced by that one
shard. You don't get any parallelism.



That is right. Actually the parallelization is not the main issue right
now. The queries are very sparse, currently our system does not support
load balancing at all. I imagined that in the future it could be achievable
via SolrCloud replication.

The main consideration is to be able to plug the indexes in and out on
demand. The total size of the data is in terabytes. We usually want to
search only the latest indexes but occassionally it is needed to plug in
one of the older ones.

Maybe (probably) I still have some misconceptions about the uses of
SolrCloud...

If it really does make sense in your case to route all the doc to a

single shard,
then Michael's comment is spot-on use compositeId router.



You confuse me here. I was not thinking about a single shard, on the
contrary, any [language X timespan] index would be itself a shard. I agree
that compositeId router seems to be natural for what I need. I am currently
searching for the way to convert my indexes in such way that my document
ID's have the composite format. Currently these are just unique integers,
so I would like to prefix all the document ID's of an index with it's
language and timespan. I do not know how, but I believe this should be
possible, as it is a constant operation that would not change the structure
of the index.

Best,

Michal




Best,
Erick

On Mon, Nov 10, 2014 at 11:50 AM, Michael Della Bitta
 wrote:

Hi Michal,

Is there a particular reason to shard your collections like that? If it

was

mainly for ease of operations, I'd consider just using CompositeId to
prevent specific types of queries hotspotting particular nodes.

If your ingest rate is fast, you might also consider making each
"collection" an alias that points to many actual collections, and
periodically closing off a collection and starting a new one. This

prevents

cache churn and the impact of large merges.

Michael



On 11/10/14 08:03, Michal Krajňanský wrote:

Hi All,

I have been working on a project that has long employed Lucene indexer.

Currently, the system implements a proprietary document routing and

index

plugging/unplugging on top of the Lucene and of course contains a great
body of indexes. Recently an idea came up to migrate from Lucene to
Solrcloud, which appears to be more powerfull that our proprietary

system.

Could you suggest the best way to seamlessly migrate the system to use
Solrcloud, when the reindexing is not an option?

- all the existing indexes represent a single collection in terms of
Solrcloud
- the documents are organized in "shards" according to date (integer)

and

language (a possibly extensible discrete set)
- the indexes are disjunct

I have been able to convert the existing indexes to the newest Lucene
version and plug them individually into the Solrcloud. However, there is
the question of routing, sharding etc.

Any insight appreciated.

Best,


Michal Krajnansky





Re: how do I stop queries from being logged in two different log files in Tomcat

2014-11-10 Thread Michael Della Bitta
I generally turn off the console logging when I install Tomcat. It 
flushes after every line, unlike the other handlers, and that's sort of 
a performance problem (although if you need that, you need that).


Basically, find logging.properties in Tomcat's conf directory, and 
change these two lines:


handlers = 1catalina.org.apache.juli.FileHandler, 
2localhost.org.apache.juli.FileHandler, 
3manager.org.apache.juli.FileHandler, 
4host-manager.org.apache.juli.FileHandler, java.util.logging.ConsoleHandler
.handlers = 1catalina.org.apache.juli.FileHandler, 
java.util.logging.ConsoleHandler


to:

handlers = 1catalina.org.apache.juli.FileHandler, 
2localhost.org.apache.juli.FileHandler, 
3manager.org.apache.juli.FileHandler, 
4host-manager.org.apache.juli.FileHandler

.handlers = 1catalina.org.apache.juli.FileHandler

This might be different depending on the version of Tomcat you're using, 
but you see the idea.


Michael


On 11/10/14 12:07, solr-user wrote:

hi all.

We have a number of solr 1.4x and solr 4.x installations running on tomcat

We are trying to standardize the content of our log files so that we can
automate log analysis; we dont want to use log4j at this time.

In our solr 1.4x installations, the following conf\logging.properties file
is correctly logging queries only to our localhost_access_log.xxx.txt files,
and tomcat type messages to our catalina.xxx.log files

However

in our solr 4.x installations, we are seeing solr queries being logged in
both our localhost_access_log.xxx.txt files and our catalina.xxx.log files.
We dont want the solr queries logged in catalina.xxx.log files since it more
than doubles the amount of logging being done and doubles the disk space
requirement (which can be huge).

Is there a way to configure logging, without using log4j (for now), to only
log solr queries to the localhost_access_log.xxx.txt files??

I have looked at various tomcat logging info and dont see how to do it.

Any help appreciated.



# Licensed to the Apache Software Foundation (ASF) under one or more
# contributor license agreements.  See the NOTICE file distributed with
# this work for additional information regarding copyright ownership.
# The ASF licenses this file to You under the Apache License, Version 2.0
# (the "License"); you may not use this file except in compliance with
# the License.  You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

handlers = 1catalina.org.apache.juli.FileHandler,
2localhost.org.apache.juli.FileHandler,
3manager.org.apache.juli.FileHandler, java.util.logging.ConsoleHandler

.handlers = 1catalina.org.apache.juli.FileHandler,
java.util.logging.ConsoleHandler


# Handler specific properties.
# Describes specific configuration info for Handlers.


1catalina.org.apache.juli.FileHandler.level = FINE
1catalina.org.apache.juli.FileHandler.directory = ${catalina.base}/logs
1catalina.org.apache.juli.FileHandler.prefix = catalina.

2localhost.org.apache.juli.FileHandler.level = FINE
2localhost.org.apache.juli.FileHandler.directory = ${catalina.base}/logs
2localhost.org.apache.juli.FileHandler.prefix = localhost.

3manager.org.apache.juli.FileHandler.level = FINE
3manager.org.apache.juli.FileHandler.directory = ${catalina.base}/logs
3manager.org.apache.juli.FileHandler.prefix = manager.

java.util.logging.ConsoleHandler.level = WARNING
java.util.logging.ConsoleHandler.formatter =
java.util.logging.SimpleFormatter



# Facility specific properties.
# Provides extra control for each logger.


org.apache.catalina.core.ContainerBase.[Catalina].[localhost].level = INFO
org.apache.catalina.core.ContainerBase.[Catalina].[localhost].handlers =
2localhost.org.apache.juli.FileHandler

org.apache.catalina.core.ContainerBase.[Catalina].[localhost].[/manager].level
= INFO
org.apache.catalina.core.ContainerBase.[Catalina].[localhost].[/manager].handlers
= 3manager.org.apache.juli.FileHandler

# For example, set the org.apache.catalina.util.LifecycleBase logger to log
# each component that extends LifecycleBase changing state:
#org.apache.catalina.util.LifecycleBase.level = FINE




--
View this message in context: 
http://lucene.472066.n3.nabble.com/how-do-I-stop-queries-from-being-logged-in-two-different-log-files-in-Tomcat-tp4168587.html
Sent from the Solr - User mailing list archive at Nabble.com.




Re: Lucene to Solrcloud migration

2014-11-10 Thread Michael Della Bitta

Hi Michal,

Is there a particular reason to shard your collections like that? If it 
was mainly for ease of operations, I'd consider just using CompositeId 
to prevent specific types of queries hotspotting particular nodes.


If your ingest rate is fast, you might also consider making each 
"collection" an alias that points to many actual collections, and 
periodically closing off a collection and starting a new one. This 
prevents cache churn and the impact of large merges.


Michael


On 11/10/14 08:03, Michal Krajňanský wrote:

Hi All,

I have been working on a project that has long employed Lucene indexer.

Currently, the system implements a proprietary document routing and index
plugging/unplugging on top of the Lucene and of course contains a great
body of indexes. Recently an idea came up to migrate from Lucene to
Solrcloud, which appears to be more powerfull that our proprietary system.

Could you suggest the best way to seamlessly migrate the system to use
Solrcloud, when the reindexing is not an option?

- all the existing indexes represent a single collection in terms of
Solrcloud
- the documents are organized in "shards" according to date (integer) and
language (a possibly extensible discrete set)
- the indexes are disjunct

I have been able to convert the existing indexes to the newest Lucene
version and plug them individually into the Solrcloud. However, there is
the question of routing, sharding etc.

Any insight appreciated.

Best,


Michal Krajnansky





Re: Migrating shards

2014-11-07 Thread Michael Della Bitta
1. The new replica will not begin serving data until it's all there and 
caught up. You can watch the replica status on the Cloud screen to see 
it catch up; when it's green, you're done. If you're trying to automate 
this, you're going to look for the replica that says "recovering" in 
clusterstate.json and wait until it's "active."


2. I believe this to be the case, but I'll wait for someone else to 
chime in who knows better. Also, I wonder if there's a difference 
between DELETEREPLICA and unloading the core directly.


Michael


On 11/7/14 10:24, Ian Rose wrote:

Howdy -

What is the current best practice for migrating shards to another machine?
I have heard suggestions that it is "add replica on new machine, wait for
it to catch up, delete original replica on old machine".  But I wanted to
check to make sure...

And if that is the best method, two follow-up questions:

1. Is there a best practice for knowing when the new replica has "caught
up" or do you just do a "*:*" query on both, compare counts, and call it a
day when they are the same (or nearly so, since the slave replica might lag
a little bit)?

2. When deleting the original (old) replica, since that one could be the
leader, is the replica deletion done in a safe manner such that no
documents will be lost (e.g. ones that were recently received by the leader
and not yet synced over to the slave replica before the leader is deleted)?

Thanks as always,
Ian





Re: Is there a way to stop some hyphenated terms from being tokenized

2014-11-05 Thread Michael Della Bitta

Pretty sure what you need is called KeywordMarkerFilterFactory.

|protected="protwords.txt" />|


On 11/5/14 17:24, Tang, Rebecca wrote:

Hi there,

For some hyphenated terms, I want them to stay as is instead of being 
tokenized.  For example: e-cigarette, e-cig, I-pad.  I don't want them to be 
split into e and cig or I and pad  because the single letter e and I produces 
too many false positive matches.

Is there a way to tell the standard tokenizer to skip tokenizing some terms?

Rebecca Tang
Applications Developer, UCSF CKM
Legacy Tobacco Document Library
E: rebecca.t...@ucsf.edu





Re: Solr Cloud Management Tools

2014-11-04 Thread Michael Della Bitta
http://sematext.com/spm/

Michael Della Bitta

Senior Software Engineer

o: +1 646 532 3062

appinions inc.

“The Science of Influence Marketing”

18 East 41st Street

New York, NY 10017

t: @appinions <https://twitter.com/Appinions> | g+:
plus.google.com/appinions
<https://plus.google.com/u/0/b/112002776285509593336/112002776285509593336/posts>
w: appinions.com <http://www.appinions.com/>

On Tue, Nov 4, 2014 at 3:01 PM, elangovan palani  wrote:

>
>
>
>
> Hello.
>
>
>
>
> Can someone suggest SolrCloud Management tool
>
> I'm Looking to gather Collection/Docuements/Shares Metrics and also
>
> to collect Data about the cluster usage on Mem,ReadWrites etc..
>
>
>
>
> Thanks
>
>
>
>
> Elan
>


Re: Automating Solr

2014-10-30 Thread Michael Della Bitta

You probably just need to put double quotes around the url.


On 10/30/14 15:27, Craig Hoffman wrote:

Thanks! One more question. WGET seems to choking on a my URL in particular the # 
and the & character . What’s the best method escaping?

http:// 
:8983/solr/#/articles/dataimport//dataimport?command=full-import&clean=true&optimize=true
--
Craig Hoffman
w: http://www.craighoffmanphotography.com
FB: www.facebook.com/CraigHoffmanPhotography
TW: https://twitter.com/craiglhoffman














On Oct 30, 2014, at 12:30 PM, Ramzi Alqrainy  wrote:

Simple add this line to your crontab with crontab -e command:

0,30 * * * * /usr/bin/wget
http://:8983/solr//dataimport?command=full-import

This will full import every 30 minutes. Replace  and 
with your configuration

*Using delta-import command*

Delta Import operation can be started by hitting the URL
http://localhost:8983/solr/dataimport?command=delta-import. This operation
will be started in a new thread and the status attribute in the response
should be shown busy now. Depending on the size of your data set, this
operation may take some time. At any time, you can hit
http://localhost:8983/solr/dataimport to see the status flag.

When delta-import command is executed, it reads the start time stored in
conf/dataimport.properties. It uses that timestamp to run delta queries and
after completion, updates the timestamp in conf/dataimport.properties.

Note: there is an alternative approach for updating documents in Solr, which
is in many cases more efficient and also requires less configuration
explained on DataImportHandlerDeltaQueryViaFullImport.

*Delta-Import Example*

We will use the same example database used in the full import example. Note
that the database schema has been updated and each table contains an
additional column last_modified of timestamp type. You may want to download
the database again since it has been updated recently. We use this timestamp
field to determine what rows in each table have changed since the last
indexed time.

Take a look at the following data-config.xml















Pay attention to the deltaQuery attribute which has an SQL statement capable
of detecting changes in the item table. Note the variable
${dataimporter.last_index_time} The DataImportHandler exposes a variable
called last_index_time which is a timestamp value denoting the last time
full-import 'or' delta-import was run. You can use this variable anywhere in
the SQL you write in data-config.xml and it will be replaced by the value
during processing.





--
View this message in context: 
http://lucene.472066.n3.nabble.com/Automating-Solr-tp4166696p4166707.html
Sent from the Solr - User mailing list archive at Nabble.com.




Re: solr-map-reduce API

2014-10-29 Thread Michael Della Bitta

Check this out:

http://www.slideshare.net/cloudera/solrhadoopbigdatasearch

On 10/29/14 16:31, Pritesh Patel wrote:

What exactly does this API do?

--Pritesh





Re: Heavy Multi-threaded indexing and SolrCloud 4.10.1 replicas out of synch.

2014-10-28 Thread Michael Della Bitta
No you do not, although you may consider it, because you'd be getting a 
sort of integrated stack.


But really, the decision to switch to running Solr in HDFS should not be 
taken lightly. Unless you are on a team familiar with running a Hadoop 
stack, or you're willing to devote a lot of effort toward becoming 
proficient with one, I would recommend against it.


On 10/28/14 15:27, S.L wrote:

I m using Apache Hadoop and Solr , do I nee dto switch to Cloudera

On Tue, Oct 28, 2014 at 1:27 PM, Michael Della Bitta <
michael.della.bi...@appinions.com> wrote:


We index directly from mappers using SolrJ. It does work, but you pay the
price of having to instantiate all those sockets vs. the way
MapReduceIndexerTool works, where you're writing to an EmbeddedSolrServer
directly in the Reduce task.

You don't *need* to use MapReduceIndexerTool, but it's more efficient, and
if you don't, you then have to make sure to appropriately tune your Hadoop
implementation to match what your Solr installation is capable of.

On 10/28/14 12:39, S.L wrote:


Will,

I think in one of your other emails(which I am not able to find) you has
asked if I was indexing directly from MapReduce jobs, yes I am indexing
directly from the map task and that is done using SolrJ with a
SolrCloudServer initialized with the ZK ensemble URLs.Do I need to use
something like MapReducerIndexerTool , which I suupose writes to HDFS and
that is in a subsequent step moved to Solr index ? If so why ?

I dont use any softCommits and do autocommit every 15 seconds , the
snippet
in the configuration can be seen below.

   
 ${solr.
autoSoftCommit.maxTime:-1}
   

   
 ${solr.autoCommit.maxTime:15000}

 true
   

I looked at the localhost_access.log file ,  all the GET and POST requests
have a sub-second response time.




On Tue, Oct 28, 2014 at 2:06 AM, Will Martin 
wrote:

  The easiest, and coarsest measure of response time [not service time in a

distributed system] can be picked up in your localhost_access.log file.
You're using tomcat write?  Lookup AccessLogValve in the docs and
server.xml. You can add configuration to report the payload and time to
service the request without touching any code.

Queueing theory is what Otis was talking about when he said you've
saturated your environment. In AWS people just auto-scale up and don't
worry about where the load comes from; its dumb if it happens more than 2
times. Capacity planning is tough, let's hope it doesn't disappear
altogether.

G'luck


-Original Message-
From: S.L [mailto:simpleliving...@gmail.com]
Sent: Monday, October 27, 2014 9:25 PM
To: solr-user@lucene.apache.org
Subject: Re: Heavy Multi-threaded indexing and SolrCloud 4.10.1 replicas
out of synch.

Good point about ZK logs , I do see the following exceptions
intermittently in the ZK log.

2014-10-27 06:54:14,621 [myid:1] - INFO  [NIOServerCxn.Factory:
0.0.0.0/0.0.0.0:2181:NIOServerCnxn@1007] - Closed socket connection for
client /xxx.xxx.xxx.xxx:56877 which had sessionid 0x34949dbad580029
2014-10-27 07:00:06,697 [myid:1] - INFO  [NIOServerCxn.Factory:
0.0.0.0/0.0.0.0:2181:NIOServerCnxnFactory@197] - Accepted socket
connection from /xxx.xxx.xxx.xxx:37336
2014-10-27 07:00:06,725 [myid:1] - INFO  [NIOServerCxn.Factory:
0.0.0.0/0.0.0.0:2181:ZooKeeperServer@868] - Client attempting to
establish new session at /xxx.xxx.xxx.xxx:37336
2014-10-27 07:00:06,746 [myid:1] - INFO
[CommitProcessor:1:ZooKeeperServer@617] - Established session
0x14949db9da40037 with negotiated timeout 1 for client
/xxx.xxx.xxx.xxx:37336
2014-10-27 07:01:06,520 [myid:1] - WARN  [NIOServerCxn.Factory:
0.0.0.0/0.0.0.0:2181:NIOServerCnxn@357] - caught end of stream exception
EndOfStreamException: Unable to read additional data from client
sessionid
0x14949db9da40037, likely client has closed socket
  at
org.apache.zookeeper.server.NIOServerCnxn.doIO(NIOServerCnxn.java:228)
  at

org.apache.zookeeper.server.NIOServerCnxnFactory.run(
NIOServerCnxnFactory.java:208)
  at java.lang.Thread.run(Thread.java:744)

For queuing theory , I dont know of any way to see how fasts the requests
are being served by SolrCloud , and if a queue is being maintained if the
service rate is slower than the rate of requests from the incoming
multiple
threads.

On Mon, Oct 27, 2014 at 7:09 PM, Will Martin 
wrote:

  2 naïve comments, of course.



-  Queuing theory

-  Zookeeper logs.



From: S.L [mailto:simpleliving...@gmail.com]
Sent: Monday, October 27, 2014 1:42 PM
To: solr-user@lucene.apache.org
Subject: Re: Heavy Multi-threaded indexing and SolrCloud 4.10.1
replicas out of synch.



Please find the clusterstate.json attached.

Also in this case atleast the Shard1 replicas are out of sync , as can
be seen below.

Shard 1 replica 1 *does not* return a result with distrib=false.

Query
:http://server3.mydomain.com:8082/solr/dyCollection1/s

Re: Heavy Multi-threaded indexing and SolrCloud 4.10.1 replicas out of synch.

2014-10-28 Thread Michael Della Bitta
We index directly from mappers using SolrJ. It does work, but you pay 
the price of having to instantiate all those sockets vs. the way 
MapReduceIndexerTool works, where you're writing to an 
EmbeddedSolrServer directly in the Reduce task.


You don't *need* to use MapReduceIndexerTool, but it's more efficient, 
and if you don't, you then have to make sure to appropriately tune your 
Hadoop implementation to match what your Solr installation is capable of.


On 10/28/14 12:39, S.L wrote:

Will,

I think in one of your other emails(which I am not able to find) you has
asked if I was indexing directly from MapReduce jobs, yes I am indexing
directly from the map task and that is done using SolrJ with a
SolrCloudServer initialized with the ZK ensemble URLs.Do I need to use
something like MapReducerIndexerTool , which I suupose writes to HDFS and
that is in a subsequent step moved to Solr index ? If so why ?

I dont use any softCommits and do autocommit every 15 seconds , the snippet
in the configuration can be seen below.

  
${solr.
autoSoftCommit.maxTime:-1}
  

  
${solr.autoCommit.maxTime:15000}

true
  

I looked at the localhost_access.log file ,  all the GET and POST requests
have a sub-second response time.




On Tue, Oct 28, 2014 at 2:06 AM, Will Martin  wrote:


The easiest, and coarsest measure of response time [not service time in a
distributed system] can be picked up in your localhost_access.log file.
You're using tomcat write?  Lookup AccessLogValve in the docs and
server.xml. You can add configuration to report the payload and time to
service the request without touching any code.

Queueing theory is what Otis was talking about when he said you've
saturated your environment. In AWS people just auto-scale up and don't
worry about where the load comes from; its dumb if it happens more than 2
times. Capacity planning is tough, let's hope it doesn't disappear
altogether.

G'luck


-Original Message-
From: S.L [mailto:simpleliving...@gmail.com]
Sent: Monday, October 27, 2014 9:25 PM
To: solr-user@lucene.apache.org
Subject: Re: Heavy Multi-threaded indexing and SolrCloud 4.10.1 replicas
out of synch.

Good point about ZK logs , I do see the following exceptions
intermittently in the ZK log.

2014-10-27 06:54:14,621 [myid:1] - INFO  [NIOServerCxn.Factory:
0.0.0.0/0.0.0.0:2181:NIOServerCnxn@1007] - Closed socket connection for
client /xxx.xxx.xxx.xxx:56877 which had sessionid 0x34949dbad580029
2014-10-27 07:00:06,697 [myid:1] - INFO  [NIOServerCxn.Factory:
0.0.0.0/0.0.0.0:2181:NIOServerCnxnFactory@197] - Accepted socket
connection from /xxx.xxx.xxx.xxx:37336
2014-10-27 07:00:06,725 [myid:1] - INFO  [NIOServerCxn.Factory:
0.0.0.0/0.0.0.0:2181:ZooKeeperServer@868] - Client attempting to
establish new session at /xxx.xxx.xxx.xxx:37336
2014-10-27 07:00:06,746 [myid:1] - INFO
[CommitProcessor:1:ZooKeeperServer@617] - Established session
0x14949db9da40037 with negotiated timeout 1 for client
/xxx.xxx.xxx.xxx:37336
2014-10-27 07:01:06,520 [myid:1] - WARN  [NIOServerCxn.Factory:
0.0.0.0/0.0.0.0:2181:NIOServerCnxn@357] - caught end of stream exception
EndOfStreamException: Unable to read additional data from client sessionid
0x14949db9da40037, likely client has closed socket
 at
org.apache.zookeeper.server.NIOServerCnxn.doIO(NIOServerCnxn.java:228)
 at

org.apache.zookeeper.server.NIOServerCnxnFactory.run(NIOServerCnxnFactory.java:208)
 at java.lang.Thread.run(Thread.java:744)

For queuing theory , I dont know of any way to see how fasts the requests
are being served by SolrCloud , and if a queue is being maintained if the
service rate is slower than the rate of requests from the incoming multiple
threads.

On Mon, Oct 27, 2014 at 7:09 PM, Will Martin  wrote:


2 naïve comments, of course.



-  Queuing theory

-  Zookeeper logs.



From: S.L [mailto:simpleliving...@gmail.com]
Sent: Monday, October 27, 2014 1:42 PM
To: solr-user@lucene.apache.org
Subject: Re: Heavy Multi-threaded indexing and SolrCloud 4.10.1
replicas out of synch.



Please find the clusterstate.json attached.

Also in this case atleast the Shard1 replicas are out of sync , as can
be seen below.

Shard 1 replica 1 *does not* return a result with distrib=false.

Query
:http://server3.mydomain.com:8082/solr/dyCollection1/select/?q=*:* <
http://server3.mydomain.com:8082/solr/dyCollection1/select/?q=*:*&fq=%
28id:9f4748c0-fe16-4632-b74e-4fee6b80cbf5%29&wt=xml&distrib=false&debu
g=track&shards.info=true>
&fq=%28id:9f4748c0-fe16-4632-b74e-4fee6b80cbf5%29&wt=xml&distrib=false
&debug=track&
shards.info=true



Result :

01*:*truefalsetrackxml(id:9f4748c0-fe16-4632-b74e-4fee6b80cbf5)<
result name="response" numFound="0" start="0"/>



Shard1 replica 2 *does* return the result with distrib=false.

Query:
http://server2.mydomain.com:8082/solr/dyCollection1/select/?q=*:* <
http://server2.mydomain.com:8082/solr/dyCollection1/select/?q=*:*&fq=%
28id:9f4748c0-fe16-4

Re: Heavy Multi-threaded indexing and SolrCloud 4.10.1 replicas out of synch.

2014-10-27 Thread Michael Della Bitta

I'm curious, could you elaborate on the issue and the partial fix?

Thanks!

On 10/27/14 11:31, Markus Jelsma wrote:

It is an ancient issue. One of the major contributors to the issue was resolved 
some versions ago but we are still seeing it sometimes too, there is nothing to 
see in the logs. We ignore it and just reindex.

-Original message-

From:S.L 
Sent: Monday 27th October 2014 16:25
To: solr-user@lucene.apache.org
Subject: Re: Heavy Multi-threaded indexing and SolrCloud 4.10.1 replicas out of 
synch.

Thank Otis,

I have checked the logs , in my case the default catalina.out and I dont
see any OOMs or , any other exceptions.

What others metrics do you suggest ?

On Mon, Oct 27, 2014 at 9:26 AM, Otis Gospodnetic <
otis.gospodne...@gmail.com> wrote:


Hi,

You may simply be overwhelming your cluster-nodes. Have you checked
various metrics to see if that is the case?

Otis
--
Monitoring * Alerting * Anomaly Detection * Centralized Log Management
Solr & Elasticsearch Support * http://sematext.com/




On Oct 26, 2014, at 9:59 PM, S.L  wrote:

Folks,

I have posted previously about this , I am using SolrCloud 4.10.1 and

have

a sharded collection with  6 nodes , 3 shards and a replication factor

of 2.

I am indexing Solr using a Hadoop job , I have 15 Map fetch tasks , that
can each have upto 5 threds each , so the load on the indexing side can

get

to as high as 75 concurrent threads.

I am facing an issue where the replicas of a particular shard(s) are
consistently getting out of synch , initially I thought this was

beccause I

was using a custom component , but I did a fresh install and removed the
custom component and reindexed using the Hadoop job , I still see the

same

behavior.

I do not see any exceptions in my catalina.out , like OOM , or any other
excepitions, I suspecting thi scould be because of the multi-threaded
indexing nature of the Hadoop job . I use CloudSolrServer from my java

code

to index and initialize the CloudSolrServer using a 3 node ZK ensemble.

Does any one know of any known issues with a highly multi-threaded

indexing

and SolrCloud ?

Can someone help ? This issue has been slowing things down on my end for

a

while now.

Thanks and much appreciated!




Re: SolrCloud config question and zookeeper

2014-10-27 Thread Michael Della Bitta
You want external zookeepers. Partially because you don't want your Solr 
garbage collections holding up zookeeper availability, but also because 
you don't want your zookeepers going offline if you have to restart Solr 
for some reason.


Also, you want 3 or 5 zookeeepers, not 4 or 8.

On 10/27/14 10:35, Bernd Fehling wrote:

While starting now with SolrCloud I tried to understand the sense
of external zookeeper.

Let's assume I want to split 1 huge collection accross 4 server.
My straight forward idea is to setup a cloud with 4 shards (one
on each server) and also have a replication of the shard on another
server.
server_1: shard_1, shard_replication_4
server_2: shard_2, shard_replication_1
server_3: shard_3, shard_replication_2
server_4: shard_4, shard_replication_3

In this configuration I always have all 4 shards available if
one server fails.

But now to zookeeper. I would start the internal zookeeper for
all shards including replicas. Does this make sense?


Or I only start the internal zookeeper for shard 1 to 4 but not
the replicas. Should be good enough, one server can fail, or not?


Or I follow the recommendations and install on all 4 server
an external seperate zookeeper, but what is the advantage against
having the internal zookeeper on each server?


I really don't get it at this point. Can anyone help me here?

Regards
Bernd




Re: Solr + HDFS settings

2014-10-27 Thread Michael Della Bitta
This doesn't answer your question, but unless something is changed, 
you're going to want to set this to false. It causes index corruption at 
the moment.


On 10/25/14 03:42, Norgorn wrote:

  true




Re: Solr replicas - stop replication and start again

2014-10-20 Thread Michael Della Bitta
Yes, that's what I'm suggesting. It seems a perfect fit for a single shard
collection with an offsite remote that you don't always want to write to.

Michael Della Bitta

Senior Software Engineer

o: +1 646 532 3062

appinions inc.

“The Science of Influence Marketing”

18 East 41st Street

New York, NY 10017

t: @appinions <https://twitter.com/Appinions> | g+:
plus.google.com/appinions
<https://plus.google.com/u/0/b/112002776285509593336/112002776285509593336/posts>
w: appinions.com <http://www.appinions.com/>

On Mon, Oct 20, 2014 at 10:41 AM, andreic9203  wrote:

> Hello Michael,
>
> Do you want to say, the replication from solr, that with master-slave?
>
> Thank you,
> Andrei
>
>
>
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/Solr-replicas-stop-replication-and-start-again-tp4164931p4164965.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>


Re: Solr replicas - stop replication and start again

2014-10-20 Thread Michael Della Bitta
Andrei,

I'm wondering if you've considered using Classic replication for this use
case. It seems better suited for it.

Michael Della Bitta

Senior Software Engineer

o: +1 646 532 3062

appinions inc.

“The Science of Influence Marketing”

18 East 41st Street

New York, NY 10017

t: @appinions <https://twitter.com/Appinions> | g+:
plus.google.com/appinions
<https://plus.google.com/u/0/b/112002776285509593336/112002776285509593336/posts>
w: appinions.com <http://www.appinions.com/>

On Mon, Oct 20, 2014 at 9:53 AM, andreic9203  wrote:

> Another idea,
>
> I turned off the replica in which I want to insert data and then to process
> them, I started again, BUT, without -DzkHost, or -DzkRun, so the new
> started
> solr instance. I put my data into it, I stopped again, and I started with
> -DzkHost that points to my zoo keeper.
>
> But the problem is that the ZooKeeper doesn't know about the changes from
> the new replica, and voila, no replication, no nothing.
>
> Any idea?
>
> Thank you,
> Andrei
>
>
>
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/Solr-replicas-stop-replication-and-start-again-tp4164931p4164954.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>


Re: Inconsistent response time

2014-10-03 Thread Michael Della Bitta
Hi Scott,

Any chance this could be an IPv6 thing? What if you start both server and 
client with this flag:

-Djava.net.preferIPv4Stack=true



Michael Della Bitta
Senior Software Engineer
o: +1 646 532 3062

appinions inc.
“The Science of Influence Marketing”

18 East 41st Street
New York, NY 10017
t: @appinions | g+: plus.google.com/appinions
w: appinions.com

On Oct 3, 2014, at 15:08, Scott Johnson  wrote:

> We are attempting to improve our Solr response time as our application uses
> Solr for large and time consuming queries. We have found a very inconsistent
> result in the time elapsed when pinging Solr. If we ping Solr from a desktop
> Windows 7 machine, there is usually a 5 ms elapsed time. But if we ping the
> same Solr instance from a Windows Server 2008 machine, it takes about 15 ms.
> This could be the difference between a 1 hour process and a 3 hour process,
> so it is something we would like to debug and fix if possible.
> 
> 
> 
> Does anybody have any ideas about why this might be? We get these same
> results pretty consistently (testing on multiple desktops and servers). One
> thing that seemed to have an impact is removing various additional JDKs that
> had been installed, and JDK 1.7u67 specifically seemed to make a difference.
> 
> 
> 
> Finally, the code we are suing to test this is below. If there is a better
> test I would be curious to hear that as well.
> 
> 
> 
> Thanks,
> 
> 
> Scott
> 
> 
> 
> 
> 
> package solr;
> 
> 
> 
> import org.apache.commons.lang.StringUtils;
> 
> import org.apache.solr.client.solrj.SolrQuery;
> 
> import org.apache.solr.client.solrj.SolrRequest.METHOD;
> 
> import org.apache.solr.client.solrj.impl.BinaryRequestWriter;
> 
> import org.apache.solr.client.solrj.impl.BinaryResponseParser;
> 
> import org.apache.solr.client.solrj.impl.HttpSolrServer;
> 
> import org.apache.solr.client.solrj.response.QueryResponse;
> 
> import org.apache.solr.client.solrj.response.SolrPingResponse;
> 
> import org.apache.solr.common.SolrDocumentList;
> 
> 
> 
> public class SolrTest {
> 
> 
> 
>private HttpSolrServer server;
> 
> 
> 
>/**
> 
>* @param args
> 
>* @throws Exception 
> 
> */
> 
>public static void main(String[] args) throws Exception {
> 
>SolrTest solr = new SolrTest(args);
> 
>// Run it a few times, the second time runs
> a lot faster.
> 
>for (int i=0; i<3; i++) {
> 
>solr.execute();
> 
>}
> 
>}
> 
> 
> 
>public SolrTest(String[] args) throws Exception {
> 
>String targetUrl = args[0];
> 
> 
> 
>System.out.println("=System
> properties=");
> 
>System.out.println("Start solr test " +
> targetUrl);
> 
> 
> 
>server = new HttpSolrServer("http://"; +
> targetUrl + ":8111/solr/search/");   
> 
>server.setRequestWriter(new
> BinaryRequestWriter());
> 
>server.setParser(new
> BinaryResponseParser());
> 
>server.setAllowCompression(true);
> 
>server.setDefaultMaxConnectionsPerHost(128);
> 
>server.setMaxTotalConnections(128);
> 
> 
> 
>SolrPingResponse response = server.ping();
> 
>System.out.println("Ping time: " +
> response.getElapsedTime() + " ms");
> 
>System.out.println("Ping time: " +
> response.getElapsedTime() + " ms");
> 
>}
> 
> 
> 
>private void execute() throws Exception {
> 
>SolrQuery query = new SolrQuery();
> 
>query.setParam("start", "0");
> 
>query.setParam("rows", "1");
> 
> 
> 
>long startTime = System.currentTimeMillis();
> 
> 
> 
>QueryResponse queryResponse =
> server.query(query, METHOD.POST);
> 
> 
> 
>long

Re: Does Solr handle an sshfs mounted index

2014-10-02 Thread Michael Della Bitta
Grainne,

I would recommend that you do not do this. In fact, I would recommend you not 
use NFS as well, although that’s more likely to work, just not ideally. Solr’s 
going to do best when it’s working with fast, local storage that the OS can 
cache natively.

Michael Della Bitta
Senior Software Engineer
o: +1 646 532 3062

appinions inc.
“The Science of Influence Marketing”

18 East 41st Street
New York, NY 10017
t: @appinions | g+: plus.google.com/appinions
w: appinions.com

On Oct 2, 2014, at 14:44, Grainne  wrote:

> I am currently running Solr 4.4.0 on RHEL 6.  The index used to be mounted
> via nfs and it all worked perfectly fine.  For security reasons we switched
> the index to be sshfs mounted - and this seems to cause solr to fail after a
> while.  If we switch back to nfs it works again.
> 
> The behavior is strange - Solr starts up and issues an error:
> ...
> Oct 02, 2014 11:43:00 AM org.apache.solr.common.SolrException log
> SEVERE: org.apache.solr.common.SolrException: Error opening new searcher
> ...
> Caused by: java.io.FileNotFoundException:
> /path/to/collection/data/index/_10_Lucene41_0.tim (Operation not permitted)
> ...
> 
> While Solr is running, if, as the same user, I look at the mounted path I
> get the same behavior:
> -bash-4.1$ ls /mounted/filesystem/path
> ls: reading directory /mounted/filesystem/path: Operation not permitted
> 
> When I shut down Solr it behaves as expected and I get the file listing. 
> The file is there and 
> 
> Several of us, including unix systems people, are looking at why this might
> be happening and have yet to figure it out.
> 
> Does anyone know if it possible to run Solr where the index is mounted via
> sshfs?  
> 
> Thanks for any advice,
> Grainne
> 
> 
> 
> 
> 
> --
> View this message in context: 
> http://lucene.472066.n3.nabble.com/Does-Solr-handle-an-sshfs-mounted-index-tp4162375.html
> Sent from the Solr - User mailing list archive at Nabble.com.



Re: Upgrade from solr 4.4 to 4.10.1

2014-10-02 Thread Michael Della Bitta
Yes, you can just do something like curl
"http://mysolrserver:mysolrport/solr/mycollectionname/update?optimize=true";.
You should expect heavy disk activity while this completes. I wouldn't do
more than one collection at a time.

Michael Della Bitta

Senior Software Engineer

o: +1 646 532 3062

appinions inc.

“The Science of Influence Marketing”

18 East 41st Street

New York, NY 10017

t: @appinions <https://twitter.com/Appinions> | g+:
plus.google.com/appinions
<https://plus.google.com/u/0/b/112002776285509593336/112002776285509593336/posts>
w: appinions.com <http://www.appinions.com/>

On Thu, Oct 2, 2014 at 12:55 PM, Grainne  wrote:

> Hi Michael,
>
> Thanks for the quick response. Running optimize on the index sounds like a
> good idea.  Do you know if  that is possible from the command line?
>
> I agree it is an omission to not be easily able to reindex files and that
> is
> a story I need to prioritize.
>
> Thanks again,
> Grainne
>
>
>
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/Upgrade-from-solr-4-4-to-4-10-1-tp4162340p4162359.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>


Re: Upgrade from solr 4.4 to 4.10.1

2014-10-02 Thread Michael Della Bitta
You should of course perform a test first to be sure, but you shouldn't
need to reindex. Running an optimize on your cores or collections will
upgrade them to the new format, or you could use Lucene's IndexUpgrader
tool. In the meantime, bringing up your data in 4.10.1 will work, it just
won't take advantage of some of the file format improvements.

However, it is somewhat of a design smell that you can't reindex. In my
experience, it is extremely valuable to be able to reindex your data at
will.

Michael Della Bitta

Senior Software Engineer

o: +1 646 532 3062

appinions inc.

“The Science of Influence Marketing”

18 East 41st Street

New York, NY 10017

t: @appinions <https://twitter.com/Appinions> | g+:
plus.google.com/appinions
<https://plus.google.com/u/0/b/112002776285509593336/112002776285509593336/posts>
w: appinions.com <http://www.appinions.com/>

On Thu, Oct 2, 2014 at 12:06 PM, Grainne  wrote:

> I need to upgrade from Solr 4.4 to version 4.10.1 and am not sure if I need
> to reindex.
>
> The following from http://wiki.apache.org/solr/Solr4.0 leads me to
> believe I
> don't:
> "The guarantee for this alpha release is that the index format will be the
> 4.0 index format, supported through the 5.x series of Lucene/Solr, unless
> there is a critical bug (e.g. that would cause index corruption) that would
> prevent this."
>
> I've been looking through the change logs and news and the following from
> http://lucene.apache.org/solr/solrnews.html makes me think that maybe I do
> need to reindex:
> "Solr 4.6 Release Highlights:
> ...
> New default index format: Lucene46Codec
> ..."
>
> It will not be an easy task to reindex the files so I am hoping the answer
> is that it is not necessary.
>
> Thanks for any advice,
> Grainne
>
>
>
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/Upgrade-from-solr-4-4-to-4-10-1-tp4162340.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>


Re: SolrCloud Slow to boot up

2014-09-25 Thread Michael Della Bitta
1. What version of Solr are you running?
2. Have you made substantial changes to solrconfig.xml?

Michael Della Bitta

Applications Developer

o: +1 646 532 3062

appinions inc.

“The Science of Influence Marketing”

18 East 41st Street

New York, NY 10017

t: @appinions <https://twitter.com/Appinions> | g+:
plus.google.com/appinions
<https://plus.google.com/u/0/b/112002776285509593336/112002776285509593336/posts>
w: appinions.com <http://www.appinions.com/>

On Thu, Sep 25, 2014 at 7:19 AM, anand.mahajan  wrote:

> Hello all,
>
> Hosted a SolrCloud - 6 Nodes - 36 Shards x 3 Replica each -> 108 cores
> across 6 servers. Moved in about 250M documents in this cluster. When I
> restart this cluster - only the leaders per shard comes up live instantly
> (within a minute) and all the replicas are shown as Recovering on the Cloud
> screen and all 6 servers are doing some processing (consuming about 4 CPUs
> at the back and doing a lot of Network IO too) In essence its not doing any
> reads are writes to the index and I dont see any replication/catch up
> activity going on too at the back, yet the RAM grows consuming all 96GB
> available on each box. And all the Recovering replicas recover one by one
> in
> about an hour or so. Why is it taking so long to boot up, and what is it
> doing that is consuming so much CPU, RAM and Network IO? All disks are
> reading at 100% on all servers during this boot up. Is there are setting I
> might have missed that will help?
>
> FYI - The Zookeeper cluster is on the same 6 boxes.  Size of the Solr data
> dir is about 150GB per server and each box has 96GB RAM.
>
> Thanks,
> Anand
>
>
>
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/SolrCloud-Slow-to-boot-up-tp4161098.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>


Re: Solr and hadoop

2014-09-25 Thread Michael Della Bitta
Yes, there's SolrInputDocumentWritable and MapReduceIndexerTool, plus the
Morphline stuff (check out
https://github.com/markrmiller/solr-map-reduce-example).

Michael Della Bitta

Applications Developer

o: +1 646 532 3062

appinions inc.

“The Science of Influence Marketing”

18 East 41st Street

New York, NY 10017

t: @appinions <https://twitter.com/Appinions> | g+:
plus.google.com/appinions
<https://plus.google.com/u/0/b/112002776285509593336/112002776285509593336/posts>
w: appinions.com <http://www.appinions.com/>

On Thu, Sep 25, 2014 at 9:58 AM, Tom Chen  wrote:

> I wonder if Solr has InputFormat and OutputFormat like the EsInputFormat
> and EsOutputFormat that are provided by Elasticserach for Hadoop
> (es-hadoop).
>
> Is it possible for Solr to provide such integration with Hadoop?
>
> Best,
> Tom
>


Re: Performance of Unsorted Queries

2014-09-16 Thread Michael Della Bitta
Performance would be better getting them all at the same time, but the
behavior would kind of stink (long pause before a response, big results
stuck in memory, etc).

If you're using a relatively up-to-date version of Solr, you should check
out the "cursormark" feature:
https://wiki.apache.org/solr/CommonQueryParameters#Deep_paging_with_cursorMark

That's the magic knock that will get you what you want.

Michael Della Bitta

Applications Developer

o: +1 646 532 3062

appinions inc.

“The Science of Influence Marketing”

18 East 41st Street

New York, NY 10017

t: @appinions <https://twitter.com/Appinions> | g+:
plus.google.com/appinions
<https://plus.google.com/u/0/b/112002776285509593336/112002776285509593336/posts>
w: appinions.com <http://www.appinions.com/>

On Tue, Sep 16, 2014 at 11:03 AM, Ilya Bernshteyn  wrote:

> If I query for IDs and I do not care about order, should I still expect
> better performance paging the results? (e.g. rows=1000 or rows=1) The
> use case is that I need to get all of the IDs regardless (there will be
> thousands, maybe 10s of thousands, but not millions)
>
> Example query:
>
>
> http://domain/solr/select?q=ACCT_ID%3A1153&fq=SOME_FIELD%3SomeKeyword%2C+SOME_FIELD_2%3ASomeKeyword&rows=1&fl=ID&wt=json
>
> With this kind of query, I notice that rows=10 returns in 5ms, while
> rows=1 (producing about 7000 results) returns in about 500ms.
>
> Another way to word my question, if I have 100k not ordered IDs to
> retrieve, is performance better getting 1k at a time or all 100k at the
> same time?
>
> Thanks,
>
> Ilya
>


Re: Moving to HDFS, How to merge indices from 8 servers ?‏‏

2014-09-15 Thread Michael Della Bitta
If all you need is better availability, I would start by trying out an
additional replica of each shard on a different box, so each box would be
serving the data for 2 shards and each shard would be available on 2 boxes.

Michael Della Bitta

Applications Developer

o: +1 646 532 3062

appinions inc.

“The Science of Influence Marketing”

18 East 41st Street

New York, NY 10017

t: @appinions <https://twitter.com/Appinions> | g+:
plus.google.com/appinions
<https://plus.google.com/u/0/b/112002776285509593336/112002776285509593336/posts>
w: appinions.com <http://www.appinions.com/>

On Mon, Sep 15, 2014 at 1:29 PM, Amey - codeinventory <
ameyjad...@codeinventory.com> wrote:

> well, i have 8 m1.large ec2 having 2 core 7gb ram and 1tb ebs attached to
> each server for index.
>
> in my case i dont expect index to be store in ram neither a quick reply as
> its not a real time application, i just want fault tolerance in application
> and availability of full data.
>
>
> Is it good to use HDFS over normal solr cloud?
>
> Best,
> Amey
>
> --- Original Message ---
>
> From: "Michael Della Bitta" 
> Sent: September 15, 2014 9:26 PM
> To: solr-user@lucene.apache.org
> Subject: Re: Moving to HDFS, How to merge indices from 8 servers ?‏‏
>
> There's not much about Solr Cloud or HDFS indexes that suggests you should
> only have one logical shard. If your goal is better uptime with a sharded
> index, you should add more replicas.
>
> If your collection is small enough that one machine can serve one query
> with acceptable performance, but you want to scale to many queries, then
> just adding mirrors of a single-sharded collection is fine. But that's a
> big "if."
>
> Switching to HDFS is an option if you have enough RAM for your whole
> collection, and have a lot of existing storage devoted to HDFS, or if you
> want to batch create indexes. It's not really aimed at preserving uptime as
> far as I know.
>
> Michael Della Bitta
>
> Applications Developer
>
> o: +1 646 532 3062
>
> appinions inc.
>
> “The Science of Influence Marketing”
>
> 18 East 41st Street
>
> New York, NY 10017
>
> t: @appinions <https://twitter.com/Appinions> | g+:
> plus.google.com/appinions
> <
> https://plus.google.com/u/0/b/112002776285509593336/112002776285509593336/posts
> >
> w: appinions.com <http://www.appinions.com/>
>
> On Mon, Sep 15, 2014 at 11:23 AM, Amey Jadiye <
> ameyjad...@codeinventory.com>
> wrote:
>
> > Thanks for reply Erik,
> > I think i have some misconfusion about how SOLR works with HDFS, and
> > solution i am thinking could be reorganised  by user community :)
> > Here is the actual solution/situation which is implemented by me
> > *Usecase* : I need a google like search engine which should be work in
> > distributed and fault tolerant mode, we are collecting the health related
> > URLs from a third party system in large amount, approx 1Million/hour. we
> > want to build an inventory which contains all of there detail. now i am
> > fetching that URL data breaking it in H1, P, Div like tags with help of
> > Jsoup lib and putting in Solr as a documents with different boost to
> > different fields.
> > Now after the putting this data, i have a custom program with which we
> > categorise all the data Example. All the cancer related pages, i am
> > querying the SOLR and fetching all URL related to cancer with CursorMark
> > and putting in a file for further use of our system.
> > *Old Solution* : For this i have build the 8 SOLR servers with 3
> > zookeepers on the individual AWS Ec2 instances with one collection:8
> shards
> > problem with this solution is whenever any instance go down i am loosing
> > that data for a moment. link of current solution
> > http://postimg.org/image/luli3ybtj/
> > *New _OR_ could be faulty solution* : I am thinking that if i use HDFS
> > which is virtually only one file system is better so if my server go down
> > that data is available through another server, below is steps i am
> thinking
> > to do.
> > 1 > I will merge all the 8 server  indices somewhere in to one.2 > Make
> > setting for HDFS on same 8 servers.3 > Put the merged index folder in
> HDFS
> > so it will be distributed in 8 servers physically it self.4 > Restart 8
> > servers pointing to HDFS on each instance.5 > and now i am ready to go
> for
> > putting data on 8 servers and fetching through any one of SOLR , if that
> is
> > down choose another so it will be guaranteed to get all the data.
> > So is this solution sounds good, OR you guys suggest me another bett

Re: Moving to HDFS, How to merge indices from 8 servers ?‏‏

2014-09-15 Thread Michael Della Bitta
There's not much about Solr Cloud or HDFS indexes that suggests you should
only have one logical shard. If your goal is better uptime with a sharded
index, you should add more replicas.

If your collection is small enough that one machine can serve one query
with acceptable performance, but you want to scale to many queries, then
just adding mirrors of a single-sharded collection is fine. But that's a
big "if."

Switching to HDFS is an option if you have enough RAM for your whole
collection, and have a lot of existing storage devoted to HDFS, or if you
want to batch create indexes. It's not really aimed at preserving uptime as
far as I know.

Michael Della Bitta

Applications Developer

o: +1 646 532 3062

appinions inc.

“The Science of Influence Marketing”

18 East 41st Street

New York, NY 10017

t: @appinions <https://twitter.com/Appinions> | g+:
plus.google.com/appinions
<https://plus.google.com/u/0/b/112002776285509593336/112002776285509593336/posts>
w: appinions.com <http://www.appinions.com/>

On Mon, Sep 15, 2014 at 11:23 AM, Amey Jadiye 
wrote:

> Thanks for reply Erik,
> I think i have some misconfusion about how SOLR works with HDFS, and
> solution i am thinking could be reorganised  by user community :)
> Here is the actual solution/situation which is implemented by me
> *Usecase* : I need a google like search engine which should be work in
> distributed and fault tolerant mode, we are collecting the health related
> URLs from a third party system in large amount, approx 1Million/hour. we
> want to build an inventory which contains all of there detail. now i am
> fetching that URL data breaking it in H1, P, Div like tags with help of
> Jsoup lib and putting in Solr as a documents with different boost to
> different fields.
> Now after the putting this data, i have a custom program with which we
> categorise all the data Example. All the cancer related pages, i am
> querying the SOLR and fetching all URL related to cancer with CursorMark
> and putting in a file for further use of our system.
> *Old Solution* : For this i have build the 8 SOLR servers with 3
> zookeepers on the individual AWS Ec2 instances with one collection:8 shards
> problem with this solution is whenever any instance go down i am loosing
> that data for a moment. link of current solution
> http://postimg.org/image/luli3ybtj/
> *New _OR_ could be faulty solution* : I am thinking that if i use HDFS
> which is virtually only one file system is better so if my server go down
> that data is available through another server, below is steps i am thinking
> to do.
> 1 > I will merge all the 8 server  indices somewhere in to one.2 > Make
> setting for HDFS on same 8 servers.3 > Put the merged index folder in HDFS
> so it will be distributed in 8 servers physically it self.4 > Restart 8
> servers pointing to HDFS on each instance.5 > and now i am ready to go for
> putting data on 8 servers and fetching through any one of SOLR , if that is
> down choose another so it will be guaranteed to get all the data.
> So is this solution sounds good, OR you guys suggest me another better
> solution ?
> Regards,Amey
>
>
> > Date: Thu, 11 Sep 2014 14:41:48 -0700
> > Subject: Re: Moving to HDFS, How to merge indices from 8 servers ?‏‏
> > From: erickerick...@gmail.com
> > To: solr-user@lucene.apache.org
> >
> > Um, I really think this is pretty likely to not be a great solution.
> > When you say "merge indexes", I'm thinking you want to go from 8
> > shards to 1 shard. Now, this can be done with the "merge indexes" core
> > admin API, see:
> > https://wiki.apache.org/solr/MergingSolrIndexes
> >
> > BUT.
> > 1>  This will break all things SolrCloud-ish assuming you created your
> > 8 shards under SolrCloud.
> > 2> Solr is usually limited by memory, so trying to fit enough of your
> > single huge index into memory may be problematical.
> >
> > This feels like an XY problem, _why_ are you asking about this? What
> > is the use-case you want to handle by this?
> >
> > Best,
> > Erick
> >
> > On Thu, Sep 11, 2014 at 7:44 AM, Amey Jadiye
> >  wrote:
> > > FYI, I searched the google for this problem but didn't find any
> satisfactory answer.Here is the current situation : I have the 8 shards in
> my solr cloud backed up with 3 zookeeper all are setup on AWS EC2
> instances, all 8 are leader with no replicas.I have only 1 collection say
> collection1 divided in 8 shards, i have configured the index and tlog
> folder on each server pointing into 1TB EBS disk attached to each servers,
> all 8 servers are having around 100GB for index folder each. so total index
> files i have is ~80

Re: Solr Exceptions -- "immense terms"

2014-09-15 Thread Michael Della Bitta
I just came back to this because I figured out you're trying to just store
this text. Now I'm baffled. How big is it? :)

Not sure why an analyzer is running if you're just storing the content.
Maybe you should post your whole schema.xml... there could be a copyfield
that's dumping the text into a different field that has the keyword
tokenizer?

Michael Della Bitta

Applications Developer

o: +1 646 532 3062

appinions inc.

“The Science of Influence Marketing”

18 East 41st Street

New York, NY 10017

t: @appinions <https://twitter.com/Appinions> | g+:
plus.google.com/appinions
<https://plus.google.com/u/0/b/112002776285509593336/112002776285509593336/posts>
w: appinions.com <http://www.appinions.com/>

On Mon, Sep 15, 2014 at 10:37 AM, Michael Della Bitta <
michael.della.bi...@appinions.com> wrote:

> If you're using a String fieldtype, you're not indexing it so much as
> dumping the whole content blob in there as a single term for exact
> matching.
>
> You probably want to look at one of the text field types for textural
> content.
>
> That doesn't explain the difference in behavior between Solr versions, but
> my hunch is that you'll be happier in general with the behavior of a field
> type that does tokenizing and stemming for plain text search anyway.
>
> Michael Della Bitta
>
> Applications Developer
>
> o: +1 646 532 3062
>
> appinions inc.
>
> “The Science of Influence Marketing”
>
> 18 East 41st Street
>
> New York, NY 10017
>
> t: @appinions <https://twitter.com/Appinions> | g+:
> plus.google.com/appinions
> <https://plus.google.com/u/0/b/112002776285509593336/112002776285509593336/posts>
> w: appinions.com <http://www.appinions.com/>
>
> On Mon, Sep 15, 2014 at 10:06 AM, Christopher Gross 
> wrote:
>
>> Solr 4.9.0
>> Java 1.7.0_49
>>
>> I'm indexing an internal Wiki site.  I was running on an older version of
>> Solr (4.1) and wasn't having any trouble indexing the content, but now I'm
>> getting errors:
>>
>> SCHEMA:
>> > required="true"/>
>>
>> LOGS:
>> Caused by: java.lang.IllegalArgumentException: Document contains at least
>> one immense term in field="content" (whose UTF8 encoding is longer than
>> the
>> max length 32766), all of which were skipped.  Please correct the analyzer
>> to not produce such terms.  The prefix of the first immense term is: '[60,
>> 33, 45, 45, 32, 98, 111, 100, 121, 67, 111, 110, 116, 101, 110, 116, 32,
>> 45, 45, 62, 10, 9, 9, 9, 60, 100, 105, 118, 32, 115]...', original
>> message:
>> bytes can be at most 32766 in length; got 183250
>> 
>> Caused by:
>> org.apache.lucene.util.BytesRefHash$MaxBytesLengthExceededException: bytes
>> can be at most 32766 in length; got 183250
>>
>> I was indexing it, but I switched that off (as you can see above) but it
>> still is having problems.  Is there a different type I should use, or a
>> different analyzer?  I imagine that there is a way to index very large
>> documents in Solr.  Any recommendations would be helpful.  Thanks!
>>
>> -- Chris
>>
>
>


Re: Solr Exceptions -- "immense terms"

2014-09-15 Thread Michael Della Bitta
If you're using a String fieldtype, you're not indexing it so much as
dumping the whole content blob in there as a single term for exact
matching.

You probably want to look at one of the text field types for textural
content.

That doesn't explain the difference in behavior between Solr versions, but
my hunch is that you'll be happier in general with the behavior of a field
type that does tokenizing and stemming for plain text search anyway.

Michael Della Bitta

Applications Developer

o: +1 646 532 3062

appinions inc.

“The Science of Influence Marketing”

18 East 41st Street

New York, NY 10017

t: @appinions <https://twitter.com/Appinions> | g+:
plus.google.com/appinions
<https://plus.google.com/u/0/b/112002776285509593336/112002776285509593336/posts>
w: appinions.com <http://www.appinions.com/>

On Mon, Sep 15, 2014 at 10:06 AM, Christopher Gross 
wrote:

> Solr 4.9.0
> Java 1.7.0_49
>
> I'm indexing an internal Wiki site.  I was running on an older version of
> Solr (4.1) and wasn't having any trouble indexing the content, but now I'm
> getting errors:
>
> SCHEMA:
>  required="true"/>
>
> LOGS:
> Caused by: java.lang.IllegalArgumentException: Document contains at least
> one immense term in field="content" (whose UTF8 encoding is longer than the
> max length 32766), all of which were skipped.  Please correct the analyzer
> to not produce such terms.  The prefix of the first immense term is: '[60,
> 33, 45, 45, 32, 98, 111, 100, 121, 67, 111, 110, 116, 101, 110, 116, 32,
> 45, 45, 62, 10, 9, 9, 9, 60, 100, 105, 118, 32, 115]...', original message:
> bytes can be at most 32766 in length; got 183250
> 
> Caused by:
> org.apache.lucene.util.BytesRefHash$MaxBytesLengthExceededException: bytes
> can be at most 32766 in length; got 183250
>
> I was indexing it, but I switched that off (as you can see above) but it
> still is having problems.  Is there a different type I should use, or a
> different analyzer?  I imagine that there is a way to index very large
> documents in Solr.  Any recommendations would be helpful.  Thanks!
>
> -- Chris
>


Re: WordDelimiter filter, expanding to multiple words, unexpected results

2014-09-02 Thread Michael Della Bitta
If that's your problem, I bet all you have to do is twiddle on one of the
catenate options, either catenateWords or catenateAll.

Michael Della Bitta

Applications Developer

o: +1 646 532 3062

appinions inc.

“The Science of Influence Marketing”

18 East 41st Street

New York, NY 10017

t: @appinions <https://twitter.com/Appinions> | g+:
plus.google.com/appinions
<https://plus.google.com/u/0/b/112002776285509593336/112002776285509593336/posts>
w: appinions.com <http://www.appinions.com/>


On Tue, Sep 2, 2014 at 1:07 PM, Jonathan Rochkind  wrote:

> Thanks for the response.
>
> I understand the problem a little bit better after investigating more.
>
> Posting my full field definitions is, I think, going to be confusing, as
> they are long and complicated. I can narrow it down to an isolation case if
> I need to. My indexed field in question is relatively short strings.
>
> But what it's got to do with is the WordDelimiterFilter's default
> splitOnCaseChange=1 and generateWordParts=1, and the effects of such.
>
> Let's take a less confusing example, query "MacBook". With a
> WordDelimiterFilter followed by something that downcases everything.
>
> I think what the WDF (followed by case folding) is trying to do is make
> query "MacBook" match both indexed text "mac book" as well as "macbook" --
> either one should be a match. Is my understanding right of what
> WordDelimiterfilter with splitOnCaseChange=1 and generateWordParts=1 is
> intending to do?
>
> In my actual index, query "MacBook" is matching ONLY "mac book", and not
> "macbook".  Which is unexpected. I indeed want it to match both. (I realize
> I could make it match only 'macbook' by setting splitOnCaseChange=0 and/or
> generateWordParts=0).
>
> It's possible this is happening as a side effect of other parts of my
> complex field definition, and I really do need to post hte whole thing
> and/or isolate it. But I wonder if there are known general problem cases
> that cause this kind of failure, or any known bugs in WordDelimiterFilter
> (in Solr 4.3?) that cause this kind of failure.
>
> And I wonder if WordDelimiter filter spitting out the token "MacBook" with
> position "2" rather than "1" is expected, irrelevant, or possibly a
> relevant problem.
>
> Thanks again,
>
> Jonathan
>
>
> On 9/2/14 12:59 PM, Michael Della Bitta wrote:
>
>> Hi Jonathan,
>>
>> Little confused by this line:
>>
>>  And, what I think it's trying to do, is match text indexed as "d elalain"
>>>
>> as well as text indexed by "delalain".
>>
>> In this case, I don't know how WordDelimiterFilter will help, as you're
>> likely tokenizing on spaces somewhere, and that input text has a space. I
>> could be wrong. It's probably best if you post your field definition from
>> your schema.
>>
>> Also, is this a free-text field, or something that's more like a short
>> string?
>>
>> Thanks,
>>
>>
>> Michael Della Bitta
>>
>> Applications Developer
>>
>> o: +1 646 532 3062
>>
>> appinions inc.
>>
>> “The Science of Influence Marketing”
>>
>> 18 East 41st Street
>>
>> New York, NY 10017
>>
>> t: @appinions <https://twitter.com/Appinions> | g+:
>> plus.google.com/appinions
>> <https://plus.google.com/u/0/b/112002776285509593336/
>> 112002776285509593336/posts>
>> w: appinions.com <http://www.appinions.com/>
>>
>>
>>
>> On Tue, Sep 2, 2014 at 12:41 PM, Jonathan Rochkind 
>> wrote:
>>
>>  Hello, I'm running into a case where a query is not returning the results
>>> I expect, and I'm hoping someone can offer some explanation that might
>>> help
>>> me fine tune things or understand what's up.
>>>
>>> I am running Solr 4.3.
>>>
>>> My filter chain includes a WordDelimiterFilter and, later a filter that
>>> downcases everything for case-insensitive searching. It includes many
>>> other
>>> things too, but I think these are the pertinent facts.
>>>
>>> For query "dELALAIN", the WordDelimiterFilter splits into:
>>>
>>> text: d
>>> start: 0
>>> position: 1
>>>
>>> text: ELALAIN
>>> start: 1
>>> position: 2
>>>
>>> text: dELALAIN
>>> start: 0
>>> position: 2
>>>
>>> Note the duplication/overlap of the tokens -- one version with "

Re: WordDelimiter filter, expanding to multiple words, unexpected results

2014-09-02 Thread Michael Della Bitta
Hi Jonathan,

Little confused by this line:

> And, what I think it's trying to do, is match text indexed as "d elalain"
as well as text indexed by "delalain".

In this case, I don't know how WordDelimiterFilter will help, as you're
likely tokenizing on spaces somewhere, and that input text has a space. I
could be wrong. It's probably best if you post your field definition from
your schema.

Also, is this a free-text field, or something that's more like a short
string?

Thanks,


Michael Della Bitta

Applications Developer

o: +1 646 532 3062

appinions inc.

“The Science of Influence Marketing”

18 East 41st Street

New York, NY 10017

t: @appinions <https://twitter.com/Appinions> | g+:
plus.google.com/appinions
<https://plus.google.com/u/0/b/112002776285509593336/112002776285509593336/posts>
w: appinions.com <http://www.appinions.com/>


On Tue, Sep 2, 2014 at 12:41 PM, Jonathan Rochkind  wrote:

> Hello, I'm running into a case where a query is not returning the results
> I expect, and I'm hoping someone can offer some explanation that might help
> me fine tune things or understand what's up.
>
> I am running Solr 4.3.
>
> My filter chain includes a WordDelimiterFilter and, later a filter that
> downcases everything for case-insensitive searching. It includes many other
> things too, but I think these are the pertinent facts.
>
> For query "dELALAIN", the WordDelimiterFilter splits into:
>
> text: d
> start: 0
> position: 1
>
> text: ELALAIN
> start: 1
> position: 2
>
> text: dELALAIN
> start: 0
> position: 2
>
> Note the duplication/overlap of the tokens -- one version with "d" and
> "ELALAIN" split into two tokens, and another with just one token.
>
> Later, all the tokens are lowercased by another filter in the chain.
> (actually an ICU filter which is doing something more complicated than just
> lowercasing, but I think we can consider it lowercasing for the purposes of
> this discussion).
>
> If I understand right what the WordDelimiterFilter is trying to do here,
> it's probably doing something special because of the lowercase "d" followed
> by an uppercase letter, a special case for that. (I don't get this behavior
> with other mixed case queries not beginning with 'd').
>
> And, what I think it's trying to do, is match text indexed as "d elalain"
> as well as text indexed by "delalain".
>
> The problem is, it's not accomplishing that -- it is NOT matching text
> that was indexed as "delalain" (one token).
>
> I don't entirely understand what the "position" attribute is for -- but I
> wonder if in this case, the position on "dELALAIN" is really supposed to be
> 1, not 2?  Could that be responsible for the bug?  Or is position
> irrelevant in this case?
>
> If that's not it, then I'm at a loss as to what may be causing this bug --
> or even if it's a bug at all, or I'm just not understanding intended
> behavior. I expect a query for "dELALAIN" to match text indexed as
> "delalain" (because of the forced lowercasing in the filter chain). But
> it's not doing so. Are my expectations wrong? Bug? Something else?
>
> Thanks for any advice,
>
> Jonathan
>


Re: Solr and HDFS

2014-08-29 Thread Michael Della Bitta
Are you sure you have your namenode URL set correctly? Usually the namenode
URL points at port 8020, whereas the RPC port is 8022.

To answer the original question, CDH 5 ships with a patched Solr 4.6. If
that's the version you want to run, you might as well use theirs. I have a
testing cluster with 4.8.1 pointing at a CDH 5 HDFS, and a production
cluster with 4.9 as well.

Michael Della Bitta

Applications Developer

o: +1 646 532 3062

appinions inc.

“The Science of Influence Marketing”

18 East 41st Street

New York, NY 10017

t: @appinions <https://twitter.com/Appinions> | g+:
plus.google.com/appinions
<https://plus.google.com/u/0/b/112002776285509593336/112002776285509593336/posts>
w: appinions.com <http://www.appinions.com/>


On Fri, Aug 29, 2014 at 12:59 PM, nagyMarcelo 
wrote:

> I´m with the same problem.
> This is what is throw.
>
> RuntimeException: Problem creating directory:
> hdfs://
> 10.58.10.147:8022/usr/solr/FavorecidoHdfs/core_node1/BY9/lws-main/data/solr/cores/favorecidohdfs_6/data/FavorecidoHdfs_shard1_replica1
> at
> org.apache.solr.store.hdfs.HdfsDirectory.(HdfsDirectory.java:89)
> at
>
> org.apache.solr.core.HdfsDirectoryFactory.create(HdfsDirectoryFactory.java:148)
> at
>
> org.apache.solr.core.CachingDirectoryFactory.get(CachingDirectoryFactory.java:350)
> at org.apache.solr.core.SolrCore.getNewIndexDir(SolrCore.java:267)
> at org.apache.solr.core.SolrCore.initIndex(SolrCore.java:477)
> at org.apache.solr.core.SolrCore.(SolrCore.java:772)
> ... 10 more
> Caused by: java.io.IOException: Failed on local exception:
> com.google.protobuf.InvalidProtocolBufferException: Message missing
> required
> fields: callId, status; Host Details : local host is:
> "scixd0021cld.itau/10.58.10.147"; destination host is:
> "scixd0021cld.itau":8022;
>
>
>
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/Solr-and-HDFS-tp4155470p4155872.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>


Re: Copying a collection from one version of SOLR to another

2014-08-25 Thread Michael Della Bitta
Hi Philippe,

You can indeed copy an index like that. The problem probably arises because
4.9.0 is using core discovery by default. This wiki page will shed some
light:

https://wiki.apache.org/solr/Core%20Discovery%20%284.4%20and%20beyond%29

Michael Della Bitta

Applications Developer

o: +1 646 532 3062

appinions inc.

“The Science of Influence Marketing”

18 East 41st Street

New York, NY 10017

t: @appinions <https://twitter.com/Appinions> | g+:
plus.google.com/appinions
<https://plus.google.com/u/0/b/112002776285509593336/112002776285509593336/posts>
w: appinions.com <http://www.appinions.com/>


On Mon, Aug 25, 2014 at 4:31 AM,  wrote:

>
> Hello,
>
> is it possible to copy a collection created with SOLR 4.6.0 to a SOLR
> 4.9.0 server?
>
> I have just copied a collection called 'collection3', located in
> solr4.6.0/example/solr,  to solr4.9.0/example/solr, but to no avail,
> because my SOLR 4.9.0 Server's admin does not list it among the available
> cores.
>
> What am I doing wrong?
>
> Many thanks.
>
> Philippe
>
>


Re: Questions about caching and HDFSDirectory

2014-08-25 Thread Michael Della Bitta
Just in case someone else runs into this post, I think the following two
URLs have me sorted:

http://techkites.blogspot.com/2014/06/performance-tuning-and-optimization-for.html

http://www.cloudera.com/content/cloudera-content/cloudera-docs/Search/1.1.0-beta2/Cloudera-Search-User-Guide/csug_tuning_solr.html

If anyone has anything to add or correct about these two resources, please
let me know!


Michael Della Bitta

Applications Developer

o: +1 646 532 3062

appinions inc.

“The Science of Influence Marketing”

18 East 41st Street

New York, NY 10017

t: @appinions <https://twitter.com/Appinions> | g+:
plus.google.com/appinions
<https://plus.google.com/u/0/b/112002776285509593336/112002776285509593336/posts>
w: appinions.com <http://www.appinions.com/>


On Fri, Aug 22, 2014 at 3:54 PM, Michael Della Bitta <
michael.della.bi...@appinions.com> wrote:

> I'm looking at the Solr Reference Guide about Solr on HDFS, and it's
> bringing up a couple of quick questions for me. I guess I got spoiled by
> MMapDirectory and how magically it worked!
>
> 1. What is the minimum number of configuration parameters that enables
> HDFS block caching? It seems like I need to set XX:MaxDirectMemorySize when
> launching Solr, and then for every collection I want to be able to use
> caching with, I need to be sure that the Block Cache Settings are enabled
> based on defaults, save for
> solr.hdfs.blockcache.write.enabled should be false.
>
> 2. If I use solr.hdfs.blockcache.global, is the slab count still per core,
> or does it apply to everything, or is it no longer relevant?
>
> 3. Is there a sneaky way of ensuring a given collection or core loads
> first so no other cores accidentally override the global blockcache setting?
>
> 4. In terms of -XX:MaxDirectMemorySize and
> solr.hdfs.blockcache.slab.count, is there some percentage of system ram or
> some overall maximum beyond which it no longer achieves benefits, or can I
> actually just tune this to be nearly all of the ram minus the JVM's
> overhead and the RAM needed by the system? Or can it even be set higher
> than the overall RAM just to be sure?
>
> Thanks,
>
> Michael Della Bitta
>
> Applications Developer
>
> o: +1 646 532 3062
>
> appinions inc.
>
> “The Science of Influence Marketing”
>
> 18 East 41st Street
>
> New York, NY 10017
>
> t: @appinions <https://twitter.com/Appinions> | g+:
> plus.google.com/appinions
> <https://plus.google.com/u/0/b/112002776285509593336/112002776285509593336/posts>
> w: appinions.com <http://www.appinions.com/>
>


Questions about caching and HDFSDirectory

2014-08-22 Thread Michael Della Bitta
I'm looking at the Solr Reference Guide about Solr on HDFS, and it's
bringing up a couple of quick questions for me. I guess I got spoiled by
MMapDirectory and how magically it worked!

1. What is the minimum number of configuration parameters that enables HDFS
block caching? It seems like I need to set XX:MaxDirectMemorySize when
launching Solr, and then for every collection I want to be able to use
caching with, I need to be sure that the Block Cache Settings are enabled
based on defaults, save for
solr.hdfs.blockcache.write.enabled should be false.

2. If I use solr.hdfs.blockcache.global, is the slab count still per core,
or does it apply to everything, or is it no longer relevant?

3. Is there a sneaky way of ensuring a given collection or core loads first
so no other cores accidentally override the global blockcache setting?

4. In terms of -XX:MaxDirectMemorySize and solr.hdfs.blockcache.slab.count,
is there some percentage of system ram or some overall maximum beyond which
it no longer achieves benefits, or can I actually just tune this to be
nearly all of the ram minus the JVM's overhead and the RAM needed by the
system? Or can it even be set higher than the overall RAM just to be sure?

Thanks,

Michael Della Bitta

Applications Developer

o: +1 646 532 3062

appinions inc.

“The Science of Influence Marketing”

18 East 41st Street

New York, NY 10017

t: @appinions <https://twitter.com/Appinions> | g+:
plus.google.com/appinions
<https://plus.google.com/u/0/b/112002776285509593336/112002776285509593336/posts>
w: appinions.com <http://www.appinions.com/>


Re: Auto Complete

2014-08-06 Thread Michael Della Bitta
You'd still need to modify that schema to use the ASCII folding filter.

Alternatively, if you want something off the shelf, you might check out
Sematext's autocomplete product:
http://www.sematext.com/products/autocomplete/index.html

Michael Della Bitta

Applications Developer

o: +1 646 532 3062

appinions inc.

“The Science of Influence Marketing”

18 East 41st Street

New York, NY 10017

t: @appinions <https://twitter.com/Appinions> | g+:
plus.google.com/appinions
<https://plus.google.com/u/0/b/112002776285509593336/112002776285509593336/posts>
w: appinions.com <http://www.appinions.com/>


On Wed, Aug 6, 2014 at 10:56 AM, benjelloun  wrote:

> Hello thanks for the tutorial i test all schema but its not what i need.
> what i need is to auto complete with an autocorrection like i said before:
> q="gene" -->autocomplete "genève" with accent
>
>
> 2014-08-05 18:03 GMT+02:00 Michael Della Bitta-2 [via Lucene] <
> ml-node+s472066n4151261...@n3.nabble.com>:
>
> > In this case, I recommend using the approach that this tutorial uses:
> >
> >
> http://www.cominvent.com/2012/01/25/super-flexible-autocomplete-with-solr/
> >
> > Basically the idea is you index the data a few different ways and then
> use
> > edismax to query them all with different boosts. You'd use the stored
> > version of you field for display, so your accented characters would not
> > get
> > stripped.
> >
> > Michael Della Bitta
> >
> > Applications Developer
> >
> > o: +1 646 532 3062
> >
> > appinions inc.
> >
> > “The Science of Influence Marketing”
> >
> > 18 East 41st Street
> >
> > New York, NY 10017
> >
> > t: @appinions <https://twitter.com/Appinions> | g+:
> > plus.google.com/appinions
> > <
> >
> https://plus.google.com/u/0/b/112002776285509593336/112002776285509593336/posts
> >
> >
> > w: appinions.com <http://www.appinions.com/>
> >
> >
> > On Tue, Aug 5, 2014 at 9:32 AM, benjelloun <[hidden email]
> > <http://user/SendEmail.jtp?type=node&node=4151261&i=0>> wrote:
> >
> > > yeah thats true i creat this index just for auto complete
> > > here is my schema:
> > >
> > >  > > required="false" multiValued="true"/>
> > >  > > required="false" multiValued="true"/>
> > >  > > required="false" multiValued="true"/>
> > >
> > > 
> > > 
> > > 
> > >
> > > the i use "suggestField" for autocomplet like i mentioned above
> > > do you have any other configuration which can do what i need ?
> > >
> > >
> > >
> > > 2014-08-05 15:19 GMT+02:00 Michael Della Bitta-2 [via Lucene] <
> > > [hidden email] <http://user/SendEmail.jtp?type=node&node=4151261&i=1
> >>:
> > >
> > > > Unless I'm mistaken, it seems like you've created this index
> > specifically
> > > > for autocomplete? Or is this index used for general search also?
> > > >
> > > > The easy way to understand this question: Is there one entry in your
> > > index
> > > > for each term you want to autocomplete? Or are there multiple entries
> > > that
> > > > might contain the same term?
> > > >
> > > > Michael Della Bitta
> > > >
> > > > Applications Developer
> > > >
> > > > o: +1 646 532 3062
> > > >
> > > > appinions inc.
> > > >
> > > > “The Science of Influence Marketing”
> > > >
> > > > 18 East 41st Street
> > > >
> > > > New York, NY 10017
> > > >
> > > > t: @appinions <https://twitter.com/Appinions> | g+:
> > > > plus.google.com/appinions
> > > > <
> > > >
> > >
> >
> https://plus.google.com/u/0/b/112002776285509593336/112002776285509593336/posts
> > > >
> > > >
> > > > w: appinions.com <http://www.appinions.com/>
> > > >
> > > >
> > > > On Tue, Aug 5, 2014 at 9:10 AM, benjelloun <[hidden email]
> > > > <http://user/SendEmail.jtp?type=node&node=4151216&i=0>> wrote:
> > > >
> > > > > hello,
> > > > >
> > > > > did you find any solution to this problem ?
> > > > >
> > > > > regards
> > > > >
> &

Re: solr over hdfs for accessing/ changing indexes outside solr

2014-08-05 Thread Michael Della Bitta
Probably the "most correct" way to modify the index would be to use the
Solr REST API to push your changes out.

Another thing you might want to look at is Lilly. Basically it's a way to
set up a Solr collection as an HBase replication target, so changes to your
HBase table would automatically propagate over to Solr.

http://www.ngdata.com/on-lily-hbase-hadoop-and-solr/

Michael Della Bitta

Applications Developer

o: +1 646 532 3062

appinions inc.

“The Science of Influence Marketing”

18 East 41st Street

New York, NY 10017

t: @appinions <https://twitter.com/Appinions> | g+:
plus.google.com/appinions
<https://plus.google.com/u/0/b/112002776285509593336/112002776285509593336/posts>
w: appinions.com <http://www.appinions.com/>


On Tue, Aug 5, 2014 at 9:04 AM, Ali Nazemian  wrote:

> Dear all,
> Hi,
> I changed solr 4.9 to write index and data on hdfs. Now I am going to
> connect to those data from the outside of solr for changing some of the
> values. Could somebody please tell me how that is possible? Suppose I am
> using Hbase over hdfs for do these changes.
> Best regards.
>
> --
> A.Nazemian
>


Re: Auto Complete

2014-08-05 Thread Michael Della Bitta
In this case, I recommend using the approach that this tutorial uses:

http://www.cominvent.com/2012/01/25/super-flexible-autocomplete-with-solr/

Basically the idea is you index the data a few different ways and then use
edismax to query them all with different boosts. You'd use the stored
version of you field for display, so your accented characters would not get
stripped.

Michael Della Bitta

Applications Developer

o: +1 646 532 3062

appinions inc.

“The Science of Influence Marketing”

18 East 41st Street

New York, NY 10017

t: @appinions <https://twitter.com/Appinions> | g+:
plus.google.com/appinions
<https://plus.google.com/u/0/b/112002776285509593336/112002776285509593336/posts>
w: appinions.com <http://www.appinions.com/>


On Tue, Aug 5, 2014 at 9:32 AM, benjelloun  wrote:

> yeah thats true i creat this index just for auto complete
> here is my schema:
>
>  required="false" multiValued="true"/>
>  required="false" multiValued="true"/>
>  required="false" multiValued="true"/>
>
> 
> 
> 
>
> the i use "suggestField" for autocomplet like i mentioned above
> do you have any other configuration which can do what i need ?
>
>
>
> 2014-08-05 15:19 GMT+02:00 Michael Della Bitta-2 [via Lucene] <
> ml-node+s472066n4151216...@n3.nabble.com>:
>
> > Unless I'm mistaken, it seems like you've created this index specifically
> > for autocomplete? Or is this index used for general search also?
> >
> > The easy way to understand this question: Is there one entry in your
> index
> > for each term you want to autocomplete? Or are there multiple entries
> that
> > might contain the same term?
> >
> > Michael Della Bitta
> >
> > Applications Developer
> >
> > o: +1 646 532 3062
> >
> > appinions inc.
> >
> > “The Science of Influence Marketing”
> >
> > 18 East 41st Street
> >
> > New York, NY 10017
> >
> > t: @appinions <https://twitter.com/Appinions> | g+:
> > plus.google.com/appinions
> > <
> >
> https://plus.google.com/u/0/b/112002776285509593336/112002776285509593336/posts
> >
> >
> > w: appinions.com <http://www.appinions.com/>
> >
> >
> > On Tue, Aug 5, 2014 at 9:10 AM, benjelloun <[hidden email]
> > <http://user/SendEmail.jtp?type=node&node=4151216&i=0>> wrote:
> >
> > > hello,
> > >
> > > did you find any solution to this problem ?
> > >
> > > regards
> > >
> > >
> > > 2014-08-04 16:16 GMT+02:00 Michael Della Bitta-2 [via Lucene] <
> > > [hidden email] <http://user/SendEmail.jtp?type=node&node=4151216&i=1
> >>:
> > >
> > > > How are you implementing autosuggest? I'm assuming you're querying an
> > > > indexed field and getting a stored value back. But there are a wide
> > > > variety
> > > > of ways of doing it.
> > > >
> > > > Michael Della Bitta
> > > >
> > > > Applications Developer
> > > >
> > > > o: +1 646 532 3062
> > > >
> > > > appinions inc.
> > > >
> > > > “The Science of Influence Marketing”
> > > >
> > > > 18 East 41st Street
> > > >
> > > > New York, NY 10017
> > > >
> > > > t: @appinions <https://twitter.com/Appinions> | g+:
> > > > plus.google.com/appinions
> > > > <
> > > >
> > >
> >
> https://plus.google.com/u/0/b/112002776285509593336/112002776285509593336/posts
> > > >
> > > >
> > > > w: appinions.com <http://www.appinions.com/>
> > > >
> > > >
> > > > On Mon, Aug 4, 2014 at 10:10 AM, benjelloun <[hidden email]
> > > > <http://user/SendEmail.jtp?type=node&node=4150990&i=0>> wrote:
> > > >
> > > > > hello you didnt enderstand well my probleme,
> > > > >
> > > > > i give exemple: i have document contain "genève" with accent
> > > > > when i do q="gene" --> autoSuggest "geneve" because of
> > > > > ASCIIFoldingFilterFactory preserveOriginal="true"
> > > > > when i do q="genè" --> autoSuggest "genève"
> > > > > but what i need to is:
> > > > > q="gene" without accent and get this result: "genève" with accent
> > > > >
>

Re: Auto Complete

2014-08-05 Thread Michael Della Bitta
Unless I'm mistaken, it seems like you've created this index specifically
for autocomplete? Or is this index used for general search also?

The easy way to understand this question: Is there one entry in your index
for each term you want to autocomplete? Or are there multiple entries that
might contain the same term?

Michael Della Bitta

Applications Developer

o: +1 646 532 3062

appinions inc.

“The Science of Influence Marketing”

18 East 41st Street

New York, NY 10017

t: @appinions <https://twitter.com/Appinions> | g+:
plus.google.com/appinions
<https://plus.google.com/u/0/b/112002776285509593336/112002776285509593336/posts>
w: appinions.com <http://www.appinions.com/>


On Tue, Aug 5, 2014 at 9:10 AM, benjelloun  wrote:

> hello,
>
> did you find any solution to this problem ?
>
> regards
>
>
> 2014-08-04 16:16 GMT+02:00 Michael Della Bitta-2 [via Lucene] <
> ml-node+s472066n4150990...@n3.nabble.com>:
>
> > How are you implementing autosuggest? I'm assuming you're querying an
> > indexed field and getting a stored value back. But there are a wide
> > variety
> > of ways of doing it.
> >
> > Michael Della Bitta
> >
> > Applications Developer
> >
> > o: +1 646 532 3062
> >
> > appinions inc.
> >
> > “The Science of Influence Marketing”
> >
> > 18 East 41st Street
> >
> > New York, NY 10017
> >
> > t: @appinions <https://twitter.com/Appinions> | g+:
> > plus.google.com/appinions
> > <
> >
> https://plus.google.com/u/0/b/112002776285509593336/112002776285509593336/posts
> >
> >
> > w: appinions.com <http://www.appinions.com/>
> >
> >
> > On Mon, Aug 4, 2014 at 10:10 AM, benjelloun <[hidden email]
> > <http://user/SendEmail.jtp?type=node&node=4150990&i=0>> wrote:
> >
> > > hello you didnt enderstand well my probleme,
> > >
> > > i give exemple: i have document contain "genève" with accent
> > > when i do q="gene" --> autoSuggest "geneve" because of
> > > ASCIIFoldingFilterFactory preserveOriginal="true"
> > > when i do q="genè" --> autoSuggest "genève"
> > > but what i need to is:
> > > q="gene" without accent and get this result: "genève" with accent
> > >
> > >
> > >
> > > --
> > > View this message in context:
> > >
> http://lucene.472066.n3.nabble.com/Auto-Complete-tp4150987p4150989.html
> >
> > > Sent from the Solr - User mailing list archive at Nabble.com.
> > >
> >
> >
> > --
> >  If you reply to this email, your message will be added to the discussion
> > below:
> > http://lucene.472066.n3.nabble.com/Auto-Complete-tp4150987p4150990.html
> >  To unsubscribe from Auto Complete, click here
> > <
> http://lucene.472066.n3.nabble.com/template/NamlServlet.jtp?macro=unsubscribe_by_code&node=4150987&code=YW5hc3MuYm5qQGdtYWlsLmNvbXw0MTUwOTg3fC0xMDQyNjMzMDgx
> >
> > .
> > NAML
> > <
> http://lucene.472066.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewer&id=instant_html%21nabble%3Aemail.naml&base=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespace&breadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml
> >
> >
>
>
>
>
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/Auto-Complete-tp4150987p4151211.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>


Re: Stand alone Solr - no zookeeper?

2014-08-04 Thread Michael Della Bitta
Hi Joel,

You're sort of describing the classic replication scenario, which you can
get started on by reading this: http://wiki.apache.org/solr/SolrReplication

Although I believe this is handled in the reference guide, too.

Generally speaking, the sorts of issues you mention are general issues that
you have to deal with when using Solr at scale, no matter how you
replicate. Proper GC tuning is a must. You can seriously diminish the
impact of GC with some tuning.

Etsy has done some interesting things regarding implementing an API that's
resilient to garbage collecting nodes. Take a look at this:
http://www.lucenerevolution.org/sites/default/files/Living%20with%20Garbage.pdf


Michael Della Bitta

Applications Developer

o: +1 646 532 3062

appinions inc.

“The Science of Influence Marketing”

18 East 41st Street

New York, NY 10017

t: @appinions <https://twitter.com/Appinions> | g+:
plus.google.com/appinions
<https://plus.google.com/u/0/b/112002776285509593336/112002776285509593336/posts>
w: appinions.com <http://www.appinions.com/>


On Fri, Aug 1, 2014 at 10:48 AM, Joel Cohen  wrote:

> Hi,
>
> We're in the development phase of a new application and the current dev
> team mindset leans towards running Solr (4.9) in AWS without Zookeeper. The
> theory is that we can add nodes quickly to our load balancer
> programmatically and get a dump of the indexes from another node and copy
> them over to the new one. A RESTful API would handle other applications
> talking to Solr without the need for each of them to have to use SolrJ.
> Data ingestion happens nightly in bulk by way of ActiveMQ which each server
> subscribes to and pulls its own copy of the indexes. Incremental updates
> are very few during the day, but we would have some mechanism of getting a
> new server to 'catch up' to the live servers before making it active in the
> load balancer.
>
> The only thing so far that I see as a hurdle here is the data set size vs.
> heap size. If the index grows too large, then we have to increase the heap
> size, which could lead to longer GC times. Servers could pop in and out of
> the load balancer if they are unavailable for too long when a major GC
> happens.
>
> Current stats:
> 11 Gb of data (and growing)
> 4 Gb java heap
> 4 CPU, 16 Gb RAM nodes (maybe more needed?)
>
> All thoughts are welcomed.
>
> Thanks.
> --
> *Joel Cohen*
> Devops Engineer
>
> *GrubHub Inc.*
> *jco...@grubhub.com *
> 646-527-7771
> 1065 Avenue of the Americas
> 15th Floor
> New York, NY 10018
>
> grubhub.com | *fb <http://www.facebook.com/grubhub>* | *tw
> <http://www.twitter.com/grubhub>*
> seamless.com | *fb <http://www.facebook.com/seamless>* | *tw
> <http://www.twitter.com/seamless>*
>


Re: Auto Complete

2014-08-04 Thread Michael Della Bitta
How are you implementing autosuggest? I'm assuming you're querying an
indexed field and getting a stored value back. But there are a wide variety
of ways of doing it.

Michael Della Bitta

Applications Developer

o: +1 646 532 3062

appinions inc.

“The Science of Influence Marketing”

18 East 41st Street

New York, NY 10017

t: @appinions <https://twitter.com/Appinions> | g+:
plus.google.com/appinions
<https://plus.google.com/u/0/b/112002776285509593336/112002776285509593336/posts>
w: appinions.com <http://www.appinions.com/>


On Mon, Aug 4, 2014 at 10:10 AM, benjelloun  wrote:

> hello you didnt enderstand well my probleme,
>
> i give exemple: i have document contain "genève" with accent
> when i do q="gene" --> autoSuggest "geneve" because of
> ASCIIFoldingFilterFactory preserveOriginal="true"
> when i do q="genè" --> autoSuggest "genève"
> but what i need to is:
> q="gene" without accent and get this result: "genève" with accent
>
>
>
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/Auto-Complete-tp4150987p4150989.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>


Re: Auto Complete

2014-08-04 Thread Michael Della Bitta
You need to use this filter in your analysis chain:

http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.ASCIIFoldingFilterFactory

Michael Della Bitta

Applications Developer

o: +1 646 532 3062

appinions inc.

“The Science of Influence Marketing”

18 East 41st Street

New York, NY 10017

t: @appinions <https://twitter.com/Appinions> | g+:
plus.google.com/appinions
<https://plus.google.com/u/0/b/112002776285509593336/112002776285509593336/posts>
w: appinions.com <http://www.appinions.com/>


On Mon, Aug 4, 2014 at 9:59 AM, benjelloun  wrote:

> Hello,
>
> I have an index which contain "genève"
> I need to do this query q="gene" and get in auto complete this result :
> "genève"  (e -> è)
> I'm using StandardTokenizerFactory for field and SpellCheckComponent for
> searchCompenent.
> All solutions are welcome,
>
> Thanks,
> Best regards,
> Anass BENJELLOUN
>
>
>
>
>
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/Auto-Complete-tp4150987.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>


Re: solr boosting any perticular URL

2014-07-17 Thread Michael Della Bitta
Rahul,

Check out the relevancy FAQ. You probably want to boost that field value at
index time, or use the query elevation component.

http://wiki.apache.org/solr/SolrRelevancyFAQ

Michael Della Bitta

Applications Developer

o: +1 646 532 3062

appinions inc.

“The Science of Influence Marketing”

18 East 41st Street

New York, NY 10017

t: @appinions <https://twitter.com/Appinions> | g+:
plus.google.com/appinions
<https://plus.google.com/u/0/b/112002776285509593336/112002776285509593336/posts>
w: appinions.com <http://www.appinions.com/>


On Thu, Jul 17, 2014 at 10:28 AM, rahulmodi  wrote:

> Hi There,
>
> I am new to Solr. My client is asking me to boost a particular URL so that
> it should appear on the top of the results.
> I have already searched on various websites but i did not found boosting
> for
> particular URL.
>
> Please tell me whether this feature is available or not, if available then
> how to achieve it.
>
> Thanks
> Rahul Modi
>
>
>
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/solr-boosting-any-perticular-URL-tp4147657.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>


Re: java.net.SocketException: Connection reset

2014-07-07 Thread Michael Della Bitta
I don't see anything out of the ordinary thus far, except your heap looks a
little big. I usually run with 6-7gb. I'm wondering if maybe you're running
into a juliet pause and that's causing your sockets to time out.

Have you gathered any GC stats?

Also, what are you doing with respect to commits and optimizes?



Michael Della Bitta

Applications Developer

o: +1 646 532 3062

appinions inc.

“The Science of Influence Marketing”

18 East 41st Street

New York, NY 10017

t: @appinions <https://twitter.com/Appinions> | g+:
plus.google.com/appinions
<https://plus.google.com/u/0/b/112002776285509593336/112002776285509593336/posts>
w: appinions.com <http://www.appinions.com/>


On Fri, Jul 4, 2014 at 5:22 PM, heaven  wrote:

> Today this had happened again + this one:
> null:java.net.SocketException: Broken pipe
> at java.net.SocketOutputStream.socketWrite0(Native Method)
> at
> java.net.SocketOutputStream.socketWrite(SocketOutputStream.java:113)
> at java.net.SocketOutputStream.write(SocketOutputStream.java:159)
> at
>
> org.apache.http.impl.io.AbstractSessionOutputBuffer.write(AbstractSessionOutputBuffer.java:181)
> at
>
> org.apache.http.impl.io.ChunkedOutputStream.flushCache(ChunkedOutputStream.java:111)
> at
>
> org.apache.http.impl.io.ChunkedOutputStream.flush(ChunkedOutputStream.java:193)
> at
>
> org.apache.solr.client.solrj.impl.ConcurrentUpdateSolrServer$Runner$1.writeTo(ConcurrentUpdateSolrServer.java:206)
> at
> org.apache.http.entity.EntityTemplate.writeTo(EntityTemplate.java:69)
> at
> org.apache.http.entity.HttpEntityWrapper.writeTo(HttpEntityWrapper.java:89)
> at
>
> org.apache.http.impl.client.EntityEnclosingRequestWrapper$EntityWrapper.writeTo(EntityEnclosingRequestWrapper.java:108)
> at
>
> org.apache.http.impl.entity.EntitySerializer.serialize(EntitySerializer.java:117)
> at
>
> org.apache.http.impl.AbstractHttpClientConnection.sendRequestEntity(AbstractHttpClientConnection.java:265)
> at
>
> org.apache.http.impl.conn.ManagedClientConnectionImpl.sendRequestEntity(ManagedClientConnectionImpl.java:203)
> at
>
> org.apache.http.protocol.HttpRequestExecutor.doSendRequest(HttpRequestExecutor.java:236)
> at
>
> org.apache.http.protocol.HttpRequestExecutor.execute(HttpRequestExecutor.java:121)
> at
>
> org.apache.http.impl.client.DefaultRequestDirector.tryExecute(DefaultRequestDirector.java:682)
> at
>
> org.apache.http.impl.client.DefaultRequestDirector.execute(DefaultRequestDirector.java:486)
> at
>
> org.apache.http.impl.client.AbstractHttpClient.doExecute(AbstractHttpClient.java:863)
> at
>
> org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:82)
> at
>
> org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:106)
> at
>
> org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:57)
> at
>
> org.apache.solr.client.solrj.impl.ConcurrentUpdateSolrServer$Runner.run(ConcurrentUpdateSolrServer.java:233)
> at
>
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> at
>
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> at java.lang.Thread.run(Thread.java:744)
>
> Previously we had all 4 instances on a single node so I thought these
> errors
> might be result of high load. Like if some request taking too long to
> complete or something like that. And we always had missing docs in the
> index
> or vise verse some docs remains in the index when they shouldn't (even
> though it is supposed to recover from the log and our index queue never
> remove docs from it until it gets a successful response from Solr).
>
> But now we run shards and replicas on separate nodes with lots of resources
> and a very fast disk storage. And it still causes weird errors. It seems
> Solr is buggy as hell, that's my impression after a few years of usage. And
> it doesn't get better in this aspect, these errors follow us from the very
> beginning.
>
>
>
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/java-net-SocketException-Connection-reset-tp4145519p4145675.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>


Re: java.net.SocketException: Connection reset

2014-07-03 Thread Michael Della Bitta
What's the %system load on your nodes? What servlet container are you
using? Are you writing a single document per update, or in batches? How
many clients are attached to your cloud?



Michael Della Bitta

Applications Developer

o: +1 646 532 3062

appinions inc.

“The Science of Influence Marketing”

18 East 41st Street

New York, NY 10017

t: @appinions <https://twitter.com/Appinions> | g+:
plus.google.com/appinions
<https://plus.google.com/u/0/b/112002776285509593336/112002776285509593336/posts>
w: appinions.com <http://www.appinions.com/>


On Thu, Jul 3, 2014 at 2:06 PM, heaven  wrote:

> Hi, trying DigitalOcean for Solr, everything seems well, except sometimes I
> see these errors:
> java.net.SocketException: Connection reset
> at java.net.SocketInputStream.read(SocketInputStream.java:196)
> at java.net.SocketInputStream.read(SocketInputStream.java:122)
> at
>
> org.apache.http.impl.io.AbstractSessionInputBuffer.fillBuffer(AbstractSessionInputBuffer.java:160)
> at
>
> org.apache.http.impl.io.SocketInputBuffer.fillBuffer(SocketInputBuffer.java:84)
> at
>
> org.apache.http.impl.io.AbstractSessionInputBuffer.readLine(AbstractSessionInputBuffer.java:273)
> at
>
> org.apache.http.impl.conn.DefaultHttpResponseParser.parseHead(DefaultHttpResponseParser.java:140)
> at
>
> org.apache.http.impl.conn.DefaultHttpResponseParser.parseHead(DefaultHttpResponseParser.java:57)
> at
>
> org.apache.http.impl.io.AbstractMessageParser.parse(AbstractMessageParser.java:260)
> at
>
> org.apache.http.impl.AbstractHttpClientConnection.receiveResponseHeader(AbstractHttpClientConnection.java:283)
> at
>
> org.apache.http.impl.conn.DefaultClientConnection.receiveResponseHeader(DefaultClientConnection.java:251)
> at
>
> org.apache.http.impl.conn.ManagedClientConnectionImpl.receiveResponseHeader(ManagedClientConnectionImpl.java:197)
> at
>
> org.apache.http.protocol.HttpRequestExecutor.doReceiveResponse(HttpRequestExecutor.java:271)
> at
>
> org.apache.http.protocol.HttpRequestExecutor.execute(HttpRequestExecutor.java:123)
> at
>
> org.apache.http.impl.client.DefaultRequestDirector.tryExecute(DefaultRequestDirector.java:682)
> at
>
> org.apache.http.impl.client.DefaultRequestDirector.execute(DefaultRequestDirector.java:486)
> at
>
> org.apache.http.impl.client.AbstractHttpClient.doExecute(AbstractHttpClient.java:863)
> at
>
> org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:82)
> at
>
> org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:106)
> at
>
> org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:57)
> at
>
> org.apache.solr.client.solrj.impl.ConcurrentUpdateSolrServer$Runner.run(ConcurrentUpdateSolrServer.java:233)
> at
>
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> at
>
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> at java.lang.Thread.run(Thread.java:744)
>
> <http://lucene.472066.n3.nabble.com/file/n4145519/Screenshot_794.png>
>
> Solr version is 4.8.1, on Ubuntu Linux. We have 2 nodes, one run 2 shards
> and another 2 replicas.
>
> Errors happen during indexing process. Does it require some
> tweaks/optimizations? I have no idea where to look to fix this. Any
> suggestions are welcome.
>
> Thank you,
> Alex
>
>
>
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/java-net-SocketException-Connection-reset-tp4145519.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>


Re: OCR - Saving multi-term position

2014-07-02 Thread Michael Della Bitta
I don't have first hand knowledge of how you implement that, but I bet a
look at the WordDelimiterFilter would help you understand how to emit
multiple terms with the same positions pretty easily.

I've heard of this "bag of word variants" approach to indexing poor-quality
OCR output before for findability reasons and I heard it works out OK.

Michael Della Bitta

Applications Developer

o: +1 646 532 3062

appinions inc.

“The Science of Influence Marketing”

18 East 41st Street

New York, NY 10017

t: @appinions <https://twitter.com/Appinions> | g+:
plus.google.com/appinions
<https://plus.google.com/u/0/b/112002776285509593336/112002776285509593336/posts>
w: appinions.com <http://www.appinions.com/>


On Wed, Jul 2, 2014 at 10:19 AM, Manuel Le Normand <
manuel.lenorm...@gmail.com> wrote:

> Hello,
> Many of our indexed documents are scanned and OCR'ed documents.
> Unfortunately we were not able to improve much the OCR quality (less than
> 80% word accuracy) for various reasons, a fact which badly hurts the
> retrieval quality.
>
> As we use an open-source OCR, we think of changing every scanned term
> output to it's main possible variations to get a higher level of
> confidence.
>
> Is there any analyser that supports this kind of need or should I make up a
> syntax and analyser of my own, i.e the payload syntax?
>
> The quick brown fox --> The|1 Tlne|1 quick|2 quiok|2 browm|3 brown|3 fox|4
>
> Thanks,
> Manuel
>


Re: Restriction on type of uniqueKey field?

2014-07-01 Thread Michael Della Bitta
Alex, maybe you're thinking of constraints put on shard keys?

Michael Della Bitta

Applications Developer

o: +1 646 532 3062

appinions inc.

“The Science of Influence Marketing”

18 East 41st Street

New York, NY 10017

t: @appinions <https://twitter.com/Appinions> | g+:
plus.google.com/appinions
<https://plus.google.com/u/0/b/112002776285509593336/112002776285509593336/posts>
w: appinions.com <http://www.appinions.com/>


On Tue, Jul 1, 2014 at 7:05 AM, Shalin Shekhar Mangar <
shalinman...@gmail.com> wrote:

> No, you definitely can have an int or long uniqueKey. A lot of Solr's tests
> use such a uniqueKey. See
> solr/core/src/test-files/solr/collection1/conf/schema.xml
>
>
> On Tue, Jul 1, 2014 at 3:20 PM, Alexandre Rafalovitch 
> wrote:
>
> > Hello,
> >
> > I remember reading somewhere that id field (uniqueKey) must be String.
> > But I cannot find the definitive confirmation, just that it should be
> > non-analyzed.
> >
> > Can I use a single-valued TrieLongField type, with precision set to 0?
> > Or am I going to hit issues?
> >
> > Regards,
> >Alex.
> > Personal website: http://www.outerthoughts.com/
> > Current project: http://www.solr-start.com/ - Accelerating your Solr
> > proficiency
> >
>
>
>
> --
> Regards,
> Shalin Shekhar Mangar.
>


Am I being dense? Or are real-time gets not exposed in SolrJ?

2014-06-25 Thread Michael Della Bitta
The subject line kind of says it all... this is the latest thing we have
noticed that doesn't seem to have made it in. Am I missing something?

Other awkwardness was doing a deleteByQuery against a collection other than
the defaultCollection, and trying to share a CloudSolrServer among
different objects that were writing and reading against multiple
collections.

We managed to hack around the former by doing it with an UpdateRequest. I'm
wondering if a valid solution to the latter is actually to create one
CloudSolrServer, rip the zkStateReader out of it, and stuff it in
subsequent ones. Is that a bad idea? It seems like there might be some
overhead to having several going in the same process that could be avoided,
but maybe I'm overcomplicating things.

Thanks,

Michael Della Bitta

Applications Developer

o: +1 646 532 3062

appinions inc.

“The Science of Influence Marketing”

18 East 41st Street

New York, NY 10017

t: @appinions <https://twitter.com/Appinions> | g+:
plus.google.com/appinions
<https://plus.google.com/u/0/b/112002776285509593336/112002776285509593336/posts>
w: appinions.com <http://www.appinions.com/>


Re: SolrCloud copy the index to another cluster.

2014-06-24 Thread Michael Della Bitta
So what I'm playing with now is creating a new collection on the target
cluster, turning off the target cluster, wiping the indexes, and manually
just copying the indexes over to the correct directories and starting
again. In the middle, you can run an optimize or use the Lucene index
upgrader tool to bring yourself up to the new version.

Part of this for me is a migration to HDFSDirectory so there's an added
level of complication there.

I would assume that since you only need to preserve reads, you could cut
over once your collections were created on the new cloud?

Michael Della Bitta

Applications Developer

o: +1 646 532 3062

appinions inc.

“The Science of Influence Marketing”

18 East 41st Street

New York, NY 10017

t: @appinions <https://twitter.com/Appinions> | g+:
plus.google.com/appinions
<https://plus.google.com/u/0/b/112002776285509593336/112002776285509593336/posts>
w: appinions.com <http://www.appinions.com/>


On Tue, Jun 24, 2014 at 3:25 PM, heaven  wrote:

> Zero read would be enough, we can safely stop index updates for a while.
> But
> have some API endpoints, where read downtime is very undesirable.
>
> Best,
> Alex
>
>
>
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/SolrCloud-copy-the-index-to-another-cluster-tp4143759p4143795.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>


Re: SolrCloud copy the index to another cluster.

2014-06-24 Thread Michael Della Bitta
I'm currently playing around with Solr Cloud migration strategies, too. I'm
wondering... when you say "zero downtime," do you mean zero *read*
downtime, or zero downtime altogether?

Michael Della Bitta

Applications Developer

o: +1 646 532 3062

appinions inc.

“The Science of Influence Marketing”

18 East 41st Street

New York, NY 10017

t: @appinions <https://twitter.com/Appinions> | g+:
plus.google.com/appinions
<https://plus.google.com/u/0/b/112002776285509593336/112002776285509593336/posts>
w: appinions.com <http://www.appinions.com/>


On Tue, Jun 24, 2014 at 1:43 PM, heaven  wrote:

> I've just realized that old and new clusters do use different
> installations,
> configs and lib paths. So the nodes from the new cluster will probably
> simply refuse to start using configs from the old zookeper.
>
> Only if there is a way to run them with their own zookeper and then
> manually
> add as replicas to the old cluster, so old and new clusters keep using
> their
> own zookepers.
>
>
>
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/SolrCloud-copy-the-index-to-another-cluster-tp4143759p4143769.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>


Re: Does one need to perform an optimize soon after doing a batch indexing using SolrJ ?

2014-06-24 Thread Michael Della Bitta
Hi,

You don't need to optimize just based on segment counts. Solr doesn't
optimize automatically because often it doesn't improve things enough to
justify the computational cost of optimizing. You shouldn't optimize unless
you do a benchmark and discover that optimizing improves performance.

If you're just worried about the segment count, you can tune that in
solrconfig.xml and Solr will merge down your index on the fly as it indexes.

Michael Della Bitta

Applications Developer

o: +1 646 532 3062

appinions inc.

“The Science of Influence Marketing”

18 East 41st Street

New York, NY 10017

t: @appinions <https://twitter.com/Appinions> | g+:
plus.google.com/appinions
<https://plus.google.com/u/0/b/112002776285509593336/112002776285509593336/posts>
w: appinions.com <http://www.appinions.com/>


On Tue, Jun 24, 2014 at 8:32 AM, RadhaJayalakshmi <
rlakshminaraya...@inautix.co.in> wrote:

> I am using Solr 4.5.1. I have two collections:
> Collection 1 - 2 shards, 3 replicas (Size of Shard 1 - 115
> MB, Size of Shard 2 - 55 MB)
> Collection 2 - 2 shards, 3 replicas (Size of Shard 2 - 3.5
> GB, Size of Shard 2 - 1 GB)
>
> I have a batch process that performs indexing (full refresh) - once a week
> on the same index.
>
> Here is some information on how I index:
> a) I use SolrJ's bulk ADD API for indexing - CloudSolrServer.add(Collection
> docs).
> b) I have an autoCommit (hardcommit) setting of for both my Collections
> (solrConfig.xml):
> 
> 10
>
> false
> 
> c) I do a programatic hardcommit at the end of the indexing cycle - with an
> open searcher of "true" - so that the documents show up on the Search
> Results.
> d) I neither programatically soft commit (nor have any autoSoftCommit
> seetings) during the batch indexing process
> e) When I re-index all my data again (the following week) into the same
> index - I don't delete existing docs. Rather, I just re-index into the same
> Collection.
> f) I am using the default mergefactor of 10 in my solrconfig.xml
> 10
>
> Here is what I am observing:
> 1) After a batch indexing cycle - the segment counts for each shard / core
> is pretty high. The Solr Dashboard reports segment counts between 8 - 30
> segments on the variousr cores.
> 2) Sometimes the Solr Dashboard shows the status of my Core as - NOT
> OPTIMIZED. This I find unusual - since I have just finished a Batch
> indexing
> cycle - and would assume that the Index should already be optimized - Is
> this happening because I don't delete my docs before re-indexing all my
> data
> ?
> 3) After I run an optimize on my Collections - the segment count does
> reduce
> to significantly - to 1 segment.
>
> Am I doing indexing the right way ? Is there a better strategy ?
>
> Is it necessary to perform an optimize after every batch indexing cycle ??
>
> The outcome I am looking for is that I need an optimized index after every
> major Batch Indexing cycle.
>
> Thanks!!
>
>
>
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/Does-one-need-to-perform-an-optimize-soon-after-doing-a-batch-indexing-using-SolrJ-tp4143686.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>


Re: Restricting access to reading full text document field

2014-06-23 Thread Michael Della Bitta
Yes, that's the general model. Use a layer in between your clients and Solr
to restrict access to what you wish to let people to do.

Generally speaking, you should expose a SearchHandler that hardcodes the fl
param to prevent retrieval of your full text field, and uses a filter query
param to limit access to documents you don't want to allow access to. Then
put a lightweight proxy in front of Solr that only accesses that handler,
and stick Solr behind a firewall. That way, you're not providing access to
the update or admin functions or some of the more compute intensive query
functions.

Michael Della Bitta

Applications Developer

o: +1 646 532 3062

appinions inc.

“The Science of Influence Marketing”

18 East 41st Street

New York, NY 10017

t: @appinions <https://twitter.com/Appinions> | g+:
plus.google.com/appinions
<https://plus.google.com/u/0/b/112002776285509593336/112002776285509593336/posts>
w: appinions.com <http://www.appinions.com/>


On Mon, Jun 23, 2014 at 9:12 AM, Bjørn Axelsen <
bjorn.axel...@fagkommunikation.dk> wrote:

> Thanks, Michael ... so if I plan to do client-side ajax, you would suggest
> to call back an ajax proxy rather than query the Solr instance directly?
>
> 2014-06-23 14:57 GMT+02:00 Michael Della Bitta <
> michael.della.bi...@appinions.com>:
>
> > Unfortunately, it's not really advisable to allow open access to Solr to
> > the open web.
> >
> > There are many avenues of DOSing a Solr install otherwise, and depending
> on
> > how it's configured, some more intrusive vulnerabilities.
> >
> > Michael Della Bitta
> >
> > Applications Developer
> >
> > o: +1 646 532 3062
> >
> > appinions inc.
> >
> > “The Science of Influence Marketing”
> >
> > 18 East 41st Street
> >
> > New York, NY 10017
> >
> > t: @appinions <https://twitter.com/Appinions> | g+:
> > plus.google.com/appinions
> > <
> >
> https://plus.google.com/u/0/b/112002776285509593336/112002776285509593336/posts
> > >
> > w: appinions.com <http://www.appinions.com/>
> >
> >
> > On Mon, Jun 23, 2014 at 8:52 AM, Bjørn Axelsen <
> > bjorn.axel...@fagkommunikation.dk> wrote:
> >
> > > Dear Solr users,
> > >
> > > I am building a Solr 4.8 search engine that will hold documents
> > containing
> > > subscription-only content. We want potential customers to be able to
> > search
> > > the full content. And we also want to show them highlighted context
> > > snippets from the full contents.
> > >
> > > So, I have included the full text as a stored field in order to show
> the
> > > context snippets.
> > >
> > > For ease of implementation across multiple sites I prefer access to the
> > > Solr query URL to be open (no HTTP basic authentication etc.).
> > >
> > > However, we do not want to expose the full text to the public (paid
> > > content).
> > >
> > > What would be the most simple way to
> > >
> > > 1) provide highlighted context snippets from the full content field,
> > > 2) block access to read the full field contents?
> > >
> > > Regards,
> > >
> > > Bjørn Axelsen
> > > Web Consultant
> > >  Fagkommunikation   Webbureau som formidler viden
> > > Schillerhuset  ·  Nannasgade 28  ·  2200 København N  ·  +45 60660669
>  ·
> > > i...@fagkommunikation.dk  ·  fagkommunikation.dk
> > >
> >
>


Re: Restricting access to reading full text document field

2014-06-23 Thread Michael Della Bitta
Unfortunately, it's not really advisable to allow open access to Solr to
the open web.

There are many avenues of DOSing a Solr install otherwise, and depending on
how it's configured, some more intrusive vulnerabilities.

Michael Della Bitta

Applications Developer

o: +1 646 532 3062

appinions inc.

“The Science of Influence Marketing”

18 East 41st Street

New York, NY 10017

t: @appinions <https://twitter.com/Appinions> | g+:
plus.google.com/appinions
<https://plus.google.com/u/0/b/112002776285509593336/112002776285509593336/posts>
w: appinions.com <http://www.appinions.com/>


On Mon, Jun 23, 2014 at 8:52 AM, Bjørn Axelsen <
bjorn.axel...@fagkommunikation.dk> wrote:

> Dear Solr users,
>
> I am building a Solr 4.8 search engine that will hold documents containing
> subscription-only content. We want potential customers to be able to search
> the full content. And we also want to show them highlighted context
> snippets from the full contents.
>
> So, I have included the full text as a stored field in order to show the
> context snippets.
>
> For ease of implementation across multiple sites I prefer access to the
> Solr query URL to be open (no HTTP basic authentication etc.).
>
> However, we do not want to expose the full text to the public (paid
> content).
>
> What would be the most simple way to
>
> 1) provide highlighted context snippets from the full content field,
> 2) block access to read the full field contents?
>
> Regards,
>
> Bjørn Axelsen
> Web Consultant
>  Fagkommunikation   Webbureau som formidler viden
> Schillerhuset  ·  Nannasgade 28  ·  2200 København N  ·  +45 60660669  ·
> i...@fagkommunikation.dk  ·  fagkommunikation.dk
>


Looking for migration stories to an HDFS-backed Solr Cloud

2014-06-18 Thread Michael Della Bitta
Hi everyone,

We're considering a migration to an HDFS-backed Solr Cloud, both from our
4.2-based Solr Cloud, and a legacy 3.6 classic replication setup. In the
end, we hope to unify these two and upgrade to 4.8.1, or 4.9 if that's out
in time.

I'm wondering how many of you have experience with migrating to HDFS, and
if you managed to do something a little more crafty than a bulk reindex
against a new installation.

For example, is it possible to do something like join some 4.8, HDFS-backed
nodes to your 4.2 setup, add replicas to the new nodes, have things sync
over, and then terminate the 4.2 nodes?

For our older setup, could I bodge together collections by simply copying
the index data into HDFS and building a single shard collection from each
one? Would the HDFSDirectoryFactory do OK against an index written using an
older codec and on a random access disk?

Any information or experiences you might be able to share would be helpful.
In the meantime, I'm going to start experimenting with some of these
approaches.

Thanks!

Michael Della Bitta

Applications Developer

o: +1 646 532 3062

appinions inc.

“The Science of Influence Marketing”

18 East 41st Street

New York, NY 10017

t: @appinions <https://twitter.com/Appinions> | g+:
plus.google.com/appinions
<https://plus.google.com/u/0/b/112002776285509593336/112002776285509593336/posts>
w: appinions.com <http://www.appinions.com/>


Re: solrj error

2014-06-17 Thread Michael Della Bitta
Clearly you're going to need to deposit 25 cents to make that call. :)

More seriously, I'm wondering if most of the issue is environment-related,
since it seems like it's looking for that file on your system based on the
path. I checked my machine and it doesn't have a
$JAVA_HOME/lib/currency.data file either. Is it possible that you have
somehow used a mismatched JAVA_HOME and tools.jar somehow?

Michael Della Bitta

Applications Developer

o: +1 646 532 3062

appinions inc.

“The Science of Influence Marketing”

18 East 41st Street

New York, NY 10017

t: @appinions <https://twitter.com/Appinions> | g+:
plus.google.com/appinions
<https://plus.google.com/u/0/b/112002776285509593336/112002776285509593336/posts>
w: appinions.com <http://www.appinions.com/>


On Tue, Jun 17, 2014 at 12:03 PM, Vivek Pathak  wrote:

> Hi
>
> I am using solrj 4.6 for accessing solr 4.6.As a test case for my
> application,  I created a servlet which holds the SolrJ connection via
> zookeeper.
>
> When I run the test, I am getting a weird stack trace.  The test fails on
> not finding a currency file of java.  This file I believe used to be
> present in java 1.6.  Is somehow solrj 4.6 coupled with java 1.6?  Any
> other ideas?
>
>
>Caused by: java.lang.InternalError
>at java.util.Currency$1.run(Currency.java:224)
>at java.security.AccessController.doPrivileged(Native Method)
>at java.util.Currency.(Currency.java:192)
>at
> java.text.DecimalFormatSymbols.initialize(DecimalFormatSymbols
>.java:585)
>at java.text.DecimalFormatSymbols.(DecimalFormatSymbols.
>java:94)
>at java.text.DecimalFormatSymbols.getInstance(
>DecimalFormatSymbols.java:157)
>at java.text.NumberFormat.getInstance(NumberFormat.java:767)
>at java.text.NumberFormat.getIntegerInstance(NumberFormat.java:
>439)
>at java.text.SimpleDateFormat.initialize(SimpleDateFormat.java:
>664)
>at java.text.SimpleDateFormat.(SimpleDateFormat.java:585)
>at org.apache.solr.common.util.DateUtil$ThreadLocalDateFormat.<
>init>(DateUtil.java:187)
>at org.apache.solr.common.util.DateUtil.(DateUtil.java:
>179)
>at org.apache.solr.client.solrj.util.ClientUtils.(
>ClientUtils.java:193)
>at org.apache.solr.client.solrj.impl.CloudSolrServer.request(
>CloudSolrServer.java:565)
>at org.apache.solr.client.solrj.request.QueryRequest.process(
>QueryRequest.java:90)
>at
> org.apache.solr.client.solrj.SolrServer.query(SolrServer.java:
>310)
>at com.qbase.gsn.SearchServlet.doGet(SearchServlet.java:121)
>... 21 more
>Caused by: java.io.FileNotFoundException: /opt/jdk1.7.0_25/lib/currency.
>data (No such file or directory)
>at java.io.FileInputStream.open(Native Method)
>at java.io.FileInputStream.(FileInputStream.java:138)
>at java.io.FileInputStream.(FileInputStream.java:97)
>at java.util.Currency$1.run(Currency.java:198)
>... 37 more
>
>
>
> Thanks
> Vivek
>
>
> P.S. : I tried to force /opt/jdk1.7 to be java.home thinking the execution
> path will change but the bug remained.  Also there is no java 1.6 on the
> machine
>


Re: Tomcat restart removes the Core.

2014-06-05 Thread Michael Della Bitta
Did you put that attribute on the root element, or somewhere else? The
beginning of solr.xml should look like this:




Michael Della Bitta

Applications Developer

o: +1 646 532 3062

appinions inc.

“The Science of Influence Marketing”

18 East 41st Street

New York, NY 10017

t: @appinions <https://twitter.com/Appinions> | g+:
plus.google.com/appinions
<https://plus.google.com/u/0/b/112002776285509593336/112002776285509593336/posts>
w: appinions.com <http://www.appinions.com/>


On Thu, Jun 5, 2014 at 3:52 PM, EXTERNAL Taminidi Ravi (ETI,
Automotive-Service-Solutions)  wrote:

> I update persistent=true in the solr.xml but still no change , after a
> restart the Cores are removed..
>
> -Original Message-----
> From: Michael Della Bitta [mailto:michael.della.bi...@appinions.com]
> Sent: Wednesday, June 04, 2014 2:54 PM
> To: solr-user@lucene.apache.org
> Subject: Re: Tomcat restart removes the Core.
>
> Any chance you don't have a persistent="true" attribute in your solr.xml?
>
> Michael Della Bitta
>
> Applications Developer
>
> o: +1 646 532 3062
>
> appinions inc.
>
> “The Science of Influence Marketing”
>
> 18 East 41st Street
>
> New York, NY 10017
>
> t: @appinions <https://twitter.com/Appinions> | g+:
> plus.google.com/appinions
> <
> https://plus.google.com/u/0/b/112002776285509593336/112002776285509593336/posts
> >
> w: appinions.com <http://www.appinions.com/>
>
>
> On Wed, Jun 4, 2014 at 1:06 PM, EXTERNAL Taminidi Ravi (ETI,
> Automotive-Service-Solutions)  wrote:
>
> > All, Can anyone help me on what is going wrong in my tomcat. When I
> > restart the tomcat after schema update, the Cores are removed.
> >
> > I need to add the cores manually to get back them on work.
> >
> > Is there anything someone experience..
> >
> > Thanks
> >
> > Ravi
> >
>


Re: Tomcat restart removes the Core.

2014-06-04 Thread Michael Della Bitta
Any chance you don't have a persistent="true" attribute in your solr.xml?

Michael Della Bitta

Applications Developer

o: +1 646 532 3062

appinions inc.

“The Science of Influence Marketing”

18 East 41st Street

New York, NY 10017

t: @appinions <https://twitter.com/Appinions> | g+:
plus.google.com/appinions
<https://plus.google.com/u/0/b/112002776285509593336/112002776285509593336/posts>
w: appinions.com <http://www.appinions.com/>


On Wed, Jun 4, 2014 at 1:06 PM, EXTERNAL Taminidi Ravi (ETI,
Automotive-Service-Solutions)  wrote:

> All, Can anyone help me on what is going wrong in my tomcat. When I
> restart the tomcat after schema update, the Cores are removed.
>
> I need to add the cores manually to get back them on work.
>
> Is there anything someone experience..
>
> Thanks
>
> Ravi
>


Re: Percolator feature

2014-05-29 Thread Michael Della Bitta
We've definitely looked at Luwak before... nice to hear it might be being
brought closer into the Solr ecosystem!


Re: search using Ngram.

2014-05-29 Thread Michael Della Bitta
Sounds like you are tokenizing your string when you don't really want to.

Either you want all queries to only search against prefixes of the whole
value without tokenization, or you need to produce several copyFields with
different analysis applied and use dismax to let Solr know which should
rank higher.

Or, you could use the Suggester component or one of the other bolt-on
autocomplete components instead.

Maybe you should post your current field definition and let us know
specifically what you're trying to achieve?


Michael Della Bitta

Applications Developer

o: +1 646 532 3062

appinions inc.

“The Science of Influence Marketing”

18 East 41st Street

New York, NY 10017

t: @appinions <https://twitter.com/Appinions> | g+:
plus.google.com/appinions
<https://plus.google.com/u/0/b/112002776285509593336/112002776285509593336/posts>
w: appinions.com <http://www.appinions.com/>


On Thu, May 29, 2014 at 4:54 AM, Gurfan  wrote:

> Hi All,
>
> We are using EdgeNGramFilterFactory for searching with minGramSize="3", as
> per Business logic, auto fill suggestions should appear on entering 3
> characters in search filter. While searching for contact with name "Bill
> Moor",  the  value will does not get listed when we type 'Bill M' but when
> we type 'Bill Moo' or 'Bill' it suggests 'Bill Moor'.
>
> Clearly, The tokens are not generated when there is space in between, we
> cannot set set minGramSize="1" as that will generate many tokens and slow
> the performance. Do we have a solution without using Ngram to generate
> tokens on entering 3 characters?
>
>
> Please suggest.
>
> Thanks,
> --Gurfan
>
>
>
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/search-using-Ngram-tp4138596.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>


Re: index a repository of documents(.doc) without using post.jar

2014-05-23 Thread Michael Della Bitta
There's an example of using curl to make a REST call to update a core on
this page:

https://wiki.apache.org/solr/UpdateXmlMessages

If that doesn't help, please let us know what error you're receiving.


Michael Della Bitta

Applications Developer

o: +1 646 532 3062

appinions inc.

“The Science of Influence Marketing”

18 East 41st Street

New York, NY 10017

t: @appinions <https://twitter.com/Appinions> | g+:
plus.google.com/appinions<https://plus.google.com/u/0/b/112002776285509593336/112002776285509593336/posts>
w: appinions.com <http://www.appinions.com/>


On Fri, May 23, 2014 at 10:42 AM, benjelloun  wrote:

> Hello,
>
> I looked to source code of post.jar, that was very interesting.
> I looked for manifoldcf apache, that was interesting too.
> But i what i want to do is indexing some files using http rest, this is my
> request which dont work, maybe this way is the easiest for implementation:
>
> put: localhost:8080/solr/update?commit=true
> 
>   
> khalid
> bouchna9 
> 23/05/2014 
>   
> 
>
> I'm using dev http client for test.
> Thanks,
> Anass BENJELLOUN
>
>
>
>
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/index-a-repository-of-documents-doc-without-using-post-jar-tp4137797p4137881.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>


Re: How to Disable Commit Option and Just Manage it via SolrConfig?

2014-05-22 Thread Michael Della Bitta
Just a thought: If your users can send updates and you can't trust them,
how can you keep them from deleting all your data?

I would consider using a servlet filter to inspect the request. That would
probably be non-trivial if you plan to accept javabin requests as well.

Michael Della Bitta

Applications Developer

o: +1 646 532 3062

appinions inc.

“The Science of Influence Marketing”

18 East 41st Street

New York, NY 10017

t: @appinions <https://twitter.com/Appinions> | g+:
plus.google.com/appinions<https://plus.google.com/u/0/b/112002776285509593336/112002776285509593336/posts>
w: appinions.com <http://www.appinions.com/>


On Thu, May 22, 2014 at 6:36 AM, Furkan KAMACI wrote:

> Hi All;
>
> I've designed a system that allows people to use a search service from
> SolrCloud. However I think that I should disable "commit" option for people
> to avoid performance issues (many users can send commit requests and this
> may cause to performance issues). I'll configure solr config file with
> autocommit and I'll not let people to commit individually.
>
> I've done some implementation for it and people can not send commit request
> by GET as like:
>
> localhost:8983/solr/update*?commit=true*
>
> and they can not use:
>
> HttpSolrServer solrServer = new HttpSolrServer("http://localhost:8983/solr
> ");
> solrServer*.commit();*
>
> I think that there is another way to send a commit request to Solr. It is
> something like:
>
> {"add":{ "doc":{"id":"change.me","title":"change.me
> "},"boost":1.0,"overwrite":true,"*commitWithin*":1000}}
>
> So, I want to stop that usage and my current implementation does not
> provide it.
>
> My question is that: Is there anyway I can close the commit option for Solr
> from "clients"/"outside the world of Solr" and manage that option only via
> solr config?
>
> Thanks;
> Furkan KAMACI
>


Re: solr problem after indexing, shutdown and startup

2014-05-21 Thread Michael Della Bitta
Two possibly unrelated things:

1. Don't commit until the end.

2. Consider not optimizing at all.

You might want to look at your autocommit settings in your solrconfig.xml.
You probably want soft commits set at something north of 10 seconds, and
hard commits set to openSearcher=false with a maxTime somewhat larger than
your soft commit setting, somewhere in the low minutes range.


Michael Della Bitta

Applications Developer

o: +1 646 532 3062

appinions inc.

“The Science of Influence Marketing”

18 East 41st Street

New York, NY 10017

t: @appinions <https://twitter.com/Appinions> | g+:
plus.google.com/appinions<https://plus.google.com/u/0/b/112002776285509593336/112002776285509593336/posts>
w: appinions.com <http://www.appinions.com/>


On Wed, May 21, 2014 at 9:51 AM, Cam Bazz  wrote:

> Hello,
>
> I am indexing some 20 million documents in a solr instance.
>
> After i do the indexing, and shut the instance down and back on, it will
> not respond to queries,
> and it will not show collection stats.
>
> I am attaching the solr log to this email.
>
> The way that I do indexing is:
>
> // document list
> Collection docs = new ArrayList<>();
>
> // process files
> for(;;) {
>
>// create document and add it to docs list
> if(linenum%(256*128)==0) {
>// every once in a while add them to server
> try {
> server.add(docs);
> server.commit(true, false, false);
> docs.clear();
> } catch (SolrServerException ex) {
>
> Logger.getLogger(Indexer.class.getName()).log(Level.SEVERE, null, ex);
> }
> }
>   }
>
> // and at the end i optimize
> try {
> server.optimize(true, false);
> } catch (SolrServerException ex) {
>
> Logger.getLogger(Indexer.class.getName()).log(Level.SEVERE, null, ex);
> }
>
>
> I am suspecting this has to do something with background merges, etc.
>
> Any ideas/help/recomendations on this problem is greatly appreciated. I am
> using solr4.8
>
> Best Regards,
> C.B.
>
>
>


Cloudera Manager install

2014-05-16 Thread Michael Della Bitta
Hi everyone,

I'm investigating migrating over to an HDFS-based Solr Cloud install.

We use Cloudera Manager here to maintain a few other clusters, so
maintaining our Solr cluster with it as well is attractive. However, just
from reading the documentation, it's not totally clear to me what
version(s) of Solr I can install and manage with Cloudera Manager. I saw in
one place in the documentation an indication that Cloudera Search uses 4.4,
but then elsewhere I see the opportunity to use custom versions, and
finally, one indication that Cloudera Manager uses the "latest version."

I'm wondering if anybody has experience with installing a fairly new
version of Solr, say 4.7 or 4.8, through Cloudera Manager.


Michael Della Bitta

Applications Developer

o: +1 646 532 3062

appinions inc.

“The Science of Influence Marketing”

18 East 41st Street

New York, NY 10017

t: @appinions <https://twitter.com/Appinions> | g+:
plus.google.com/appinions<https://plus.google.com/u/0/b/112002776285509593336/112002776285509593336/posts>
w: appinions.com <http://www.appinions.com/>


Re: Storing tweets For WC2014

2014-05-16 Thread Michael Della Bitta
Some of the data providers for Twitter offer a search API. Depending on
what you're doing, you might not even need to host this yourself.

My company does do search and analytics over tweets, but by the time we end
up indexing them, we've winnowed down the initial set to 10% of what we've
initially ingested, which itself is a fraction of the total set of tweets
as our data provider has let us filter for the ones that have the keywords
we want.

Our news index approaches the size of what you're talking about within an
order of magnitude (where 'news' is really an index of sentences taken from
news reports, along with metadata about the document the news came from).
Overall, we're hosting about 310 million records (give or take depending
where in the sharding cycle we're on) in a cluster of 5 AWS i2.xlarge boxes.

This setup indexes from our feeds in real time, which means there's no mass
loading. Additionally, we generally do bulk data collection across only 3
days of data, so if you're looking to do a mess of reporting against your
full set, take that into consideration.

Michael Della Bitta

Applications Developer

o: +1 646 532 3062

appinions inc.

“The Science of Influence Marketing”

18 East 41st Street

New York, NY 10017

t: @appinions <https://twitter.com/Appinions> | g+:
plus.google.com/appinions<https://plus.google.com/u/0/b/112002776285509593336/112002776285509593336/posts>
w: appinions.com <http://www.appinions.com/>


On Fri, May 9, 2014 at 1:39 PM, Cool Techi  wrote:

> Hi,
> We have a requirement from one of our customers to provide search and
> analytics on the upcoming Soccer World cup, given the sheer volume of
> tweet's that would be generated at such an event I cannot imagine what
> would be required to store this in solr.
> It would be great if there can be some pointer's on the scale or hardware
> required, number of shards that should be created etc. Some requirement,
> All the tweets should be searchable (approximately 100million tweets/date
>  * 60 Days of event). All fields on tweets should be searchable/facet on
> numeric and date fields. Facets would be run on TwitterId's (unique users),
> tweet created on date, Location, Sentiment (some fields which we generate)
>
> If anyone has attempted anything like this it would be helpful.
> Regards,Rohit
>


Re: Solrj Default Data Format

2014-05-13 Thread Michael Della Bitta
Hi Furkan,

If I were to guess, the XML format is more cross-compatible with different
versions of SolrJ. But it might not be intentional.

In any case, feeding your SolrServer a BinaryResponseParser will switch it
over to javabin.

Michael Della Bitta

Applications Developer

o: +1 646 532 3062

appinions inc.

“The Science of Influence Marketing”

18 East 41st Street

New York, NY 10017

t: @appinions <https://twitter.com/Appinions> | g+:
plus.google.com/appinions<https://plus.google.com/u/0/b/112002776285509593336/112002776285509593336/posts>
w: appinions.com <http://www.appinions.com/>


On Thu, May 8, 2014 at 10:17 AM, Furkan KAMACI wrote:

> Hi;
>
> I found the reason of weird format at my previous mail. Now I capture the
> data with wireshark and I see that it is pure XML and content type is set
> to application/xml?
>
> Any ideas about why it is not javabin?
>
> Thanks;
> Furkan KAMACI
>
>
> 2014-05-07 22:16 GMT+03:00 Furkan KAMACI :
>
> > Hmmm, I see that it is like XML format but not. I have added three
> > documents but has something like that:
> >
> > 
> > 
> > id1
> > id2
> > id3
> > id4
> > d1
> > d2
> > d3
> > d4
> > 
> > 
> > 
> > 
> > 
> >
> > is this javabin format? I mean optimizing XML and having a first byte of
> > "2"?
> >
> > Thanks;
> > Furkan KAMACI
> >
> >
> > 2014-05-07 22:04 GMT+03:00 Furkan KAMACI :
> >
> > Hi;
> >>
> >> I am testing Solrj. I use Solr 4.5.1 and HttpSolrServer for my test. I
> >> just generate some SolrInputDocuments and call add method of server to
> add
> >> them. When  I track the request I see that data is at XML format
> instead of
> >> javabin. Do I miss anything?
> >>
> >> Thanks;
> >> Furkan KAMACI
> >>
> >
> >
>


Re: Solr interface

2014-04-07 Thread Michael Della Bitta
The speed of ingest via HTTP improves greatly once you do two things:

1. Batch multiple documents into a single request.
2. Index with multiple threads at once.

Michael Della Bitta

Applications Developer

o: +1 646 532 3062

appinions inc.

"The Science of Influence Marketing"

18 East 41st Street

New York, NY 10017

t: @appinions <https://twitter.com/Appinions> | g+:
plus.google.com/appinions<https://plus.google.com/u/0/b/112002776285509593336/112002776285509593336/posts>
w: appinions.com <http://www.appinions.com/>


On Mon, Apr 7, 2014 at 12:40 PM, Daniel Collins wrote:

> I have to agree with Shawn.  We have a SolrCloud setup with 256 shards,
> ~400M documents in total, with 4-way replication (so its quite a big
> setup!)  I had thought that HTTP would slow things down, so we recently
> trialed a JNI approach (clients are C++) so we could call SolrJ and get the
> benefits of JavaBin encoding for our indexing
>
> Once we had done benchmarks with both solutions, I think we saved about 1ms
> per document (on average) with JNI, so it wasn't as big a gain as we were
> expecting.  There are other benefits of SolrJ (zookeeper integration,
> better routing, etc) and we were doing local HTTP (so it was literally just
> a TCP port to localhost, no actual net traffic) but that just goes to prove
> what other posters have said here.  Check whether HTTP really *is* the
> bottleneck before you try to replace it!
>
>
> On 7 April 2014 17:05, Shawn Heisey  wrote:
>
> > On 4/7/2014 5:52 AM, Jonathan Varsanik wrote:
> >
> >> Do you mean to tell me that the people on this list that are indexing
> >> 100s of millions of documents are doing this over http?  I have been
> using
> >> custom Lucene code to index files, as I thought this would be faster for
> >> many documents and I wanted some non-standard OCR and index fields.  Is
> >> there a better way?
> >>
> >> To the OP: You can also use Lucene to locally index files for Solr.
> >>
> >
> > My sharded index has 94 million docs in it.  All normal indexing and
> > maintenance is done with SolrJ, over http.Currently full rebuilds are
> done
> > with the dataimport handler loading from MySQL, but that is legacy.  This
> > is NOT a SolrCloud installation.  It is also not a replicated setup -- my
> > indexing program keeps both copies up to date independently, similar to
> > what happens behind the scenes with SolrCloud.
> >
> > The single-thread DIH is very well optimized, and is faster than what I
> > have written myself -- also single-threaded.
> >
> > The real reason that we still use DIH for rebuilds is that I can run the
> > DIH simultaenously on all shards.  A full rebuild that way takes about 5
> > hours.  A SolrJ process feeding all shards with a single thread would
> take
> > a lot longer.  Once I have time to work on it, I can make the SolrJ
> rebuild
> > multi-threaded, and I expect it will be similar to DIH in rebuild speed.
> >  Hopefully I can make it faster.
> >
> > There is always overhead with HTTP.  On a gigabit LAN, I don't think it's
> > high enough to matter.
> >
> > Using Lucene to index files for Solr is an option -- but that requires
> > writing a custom Lucene application, and knowledge about how to turn the
> > Solr schema into Lucene code.  A lot of users on this list (me included)
> do
> > not have the skills required.  I know SolrJ reasonably well, but Lucene
> is
> > a nut that I haven't cracked.
> >
> > Thanks,
> > Shawn
> >
> >
>


Re: zookeeper reconnect failure

2014-03-28 Thread Michael Della Bitta
Hi, Jessica,

We've had a similar problem when DNS resolution of our Hadoop task nodes
has failed. They tend to take a dirt nap until you fix the problem
manually. Are you experiencing this in AWS as well?

I'd say the two things to do are to poll the node state via HTTP using a
monitoring tool so you get an immediate notification of the problem, and to
install some sort of caching server like nscd if you expect to have DNS
resolution failures regularly.



Michael Della Bitta

Applications Developer

o: +1 646 532 3062

appinions inc.

"The Science of Influence Marketing"

18 East 41st Street

New York, NY 10017

t: @appinions <https://twitter.com/Appinions> | g+:
plus.google.com/appinions<https://plus.google.com/u/0/b/112002776285509593336/112002776285509593336/posts>
w: appinions.com <http://www.appinions.com/>


On Fri, Mar 28, 2014 at 4:27 PM, Jessica Mallet wrote:

> Hi,
>
> First off, I'd like to give a disclaimer that this probably is a very edge
> case issue. However, since it happened to us, I would like to get some
> advice on how to best handle this failure scenario.
>
> Basically, we had some network issue where we temporarily lost connection
> and DNS. The zookeeper client properly triggered the watcher. However, when
> trying to reconnect, this following Exception is thrown:
>
> 2014-03-27 17:24:46,882 ERROR [main-EventThread] SolrException.java (line
> 121) :java.net.UnknownHostException: : Name or
> service not known
> at java.net.Inet6AddressImpl.lookupAllHostAddr(Native Method)
> at java.net.InetAddress$1.lookupAllHostAddr(InetAddress.java:866)
> at
> java.net.InetAddress.getAddressesFromNameService(InetAddress.java:1258)
> at java.net.InetAddress.getAllByName0(InetAddress.java:1211)
> at java.net.InetAddress.getAllByName(InetAddress.java:1127)
> at java.net.InetAddress.getAllByName(InetAddress.java:1063)
> at
>
> org.apache.zookeeper.client.StaticHostProvider.(StaticHostProvider.java:60)
> at org.apache.zookeeper.ZooKeeper.(ZooKeeper.java:445)
> at org.apache.zookeeper.ZooKeeper.(ZooKeeper.java:380)
> at
> org.apache.solr.common.cloud.SolrZooKeeper.(SolrZooKeeper.java:41)
> at
>
> org.apache.solr.common.cloud.DefaultConnectionStrategy.reconnect(DefaultConnectionStrategy.java:53)
> at
>
> org.apache.solr.common.cloud.ConnectionManager.process(ConnectionManager.java:147)
> at
>
> org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:519)
> at
> org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:495)
>
> I tried to look at the code and it seems that there'd be no further retries
> to connect to Zookeeper, and the node is basically left in a bad state and
> will not recover on its own. (Please correct me if I'm reading this wrong.)
> Thinking about it, this is probably fair, since normally you wouldn't
> expect retries to fix an "unknown host" issue--even though in our case it
> would have--but I'm wondering what we should do to handle this situation if
> it happens again in the future.
>
> Any advice is appreciated.
>
> Thanks,
> Jessica
>


Re: Solr Cloud collection keep going down?

2014-03-25 Thread Michael Della Bitta
What kind of load are the machines under when this happens? A lot of
writes? A lot of http connections?

Do your zookeeper logs mention anything about losing clients?

Have you tried turning on GC logging or profiling GC?

Have you tried running with a smaller max heap size, or
setting -XX:CMSInitiatingOccupancyFraction ?

Just a shot in the dark, since I'm not familiar with Jetty's logging
statements, but that looks like plain old dropped HTTP sockets to me.

Michael Della Bitta

Applications Developer

o: +1 646 532 3062

appinions inc.

"The Science of Influence Marketing"

18 East 41st Street

New York, NY 10017

t: @appinions <https://twitter.com/Appinions> | g+:
plus.google.com/appinions<https://plus.google.com/u/0/b/112002776285509593336/112002776285509593336/posts>
w: appinions.com <http://www.appinions.com/>


On Tue, Mar 25, 2014 at 1:13 PM, Software Dev wrote:

> Can anyone else chime in? Thanks
>
> On Mon, Mar 24, 2014 at 10:10 AM, Software Dev
>  wrote:
> > Shawn,
> >
> > Thanks for pointing me in the right direction. After consulting the
> > above document I *think* that the problem may be too large of a heap
> > and which may be affecting GC collection and hence causing ZK
> > timeouts.
> >
> > We have around 20G of memory on these machines with a min/max of heap
> > at 6, 8 respectively (-Xms6G -Xmx10G). The rest was allocated for
> > aside for disk cache. Why did we choose 6-10? No other reason than we
> > wanted to allot enough for disk cache and then everything else was
> > thrown and Solr. Does this sound about right?
> >
> > I took some screenshots for VisualVM and our NewRelic reporting as
> > well as some relevant portions of our SolrConfig.xml. Any
> > thoughts/comments would be greatly appreciated.
> >
> > http://postimg.org/gallery/4t73sdks/1fc10f9c/
> >
> > Thanks
> >
> >
> >
> >
> > On Sat, Mar 22, 2014 at 2:26 PM, Shawn Heisey  wrote:
> >> On 3/22/2014 1:23 PM, Software Dev wrote:
> >>> We have 2 collections with 1 shard each replicated over 5 servers in
> the
> >>> cluster. We see a lot of flapping (down or recovering) on one of the
> >>> collections. When this happens the other collection hosted on the same
> >>> machine is still marked as active. When this happens it takes a fairly
> long
> >>> time (~30 minutes) for the collection to come back online, if at all. I
> >>> find that its usually more reliable to completely shutdown solr on the
> >>> affected machine and bring it back up with its core disabled. We then
> >>> re-enable the core when its marked as active.
> >>>
> >>> A few questions:
> >>>
> >>> 1) What is the healthcheck in Solr-Cloud? Put another way, what is
> failing
> >>> that marks one collection as down but the other on the same machine as
> up?
> >>>
> >>> 2) Why does recovery take forever when a node goes down.. even if its
> only
> >>> down for 30 seconds. Our index is only 7-8G and we are running on
> SSD's.
> >>>
> >>> 3) What can be done to diagnose and fix this problem?
> >>
> >> Unless you are actually using the ping request handler, the healthcheck
> >> config will not matter.  Or were you referring to something else?
> >>
> >> Referencing the logs you included in your reply:  The EofException
> >> errors happen because your client code times out and disconnects before
> >> the request it made has completed.  That is most likely just a symptom
> >> that has nothing at all to do with the problem.
> >>
> >> Read the following wiki page.  What I'm going to say below will
> >> reference information you can find there:
> >>
> >> http://wiki.apache.org/solr/SolrPerformanceProblems
> >>
> >> Relevant side note: The default zookeeper client timeout is 15 seconds.
> >>  A typical zookeeper config defines tickTime as 2 seconds, and the
> >> timeout cannot be configured to be more than 20 times the tickTime,
> >> which means it cannot go beyond 40 seconds.  The default timeout value
> >> 15 seconds is usually more than enough, unless you are having
> >> performance problems.
> >>
> >> If you are not actually taking Solr instances down, then the fact that
> >> you are seeing the log replay messages indicates to me that something is
> >> taking so much time that the connection to Zookeeper times out.  When it
> >> finally responds, it will attempt to recover the index, which means
> >&g

Re: Replication (Solr Cloud)

2014-03-25 Thread Michael Della Bitta
No, don't disable replication!

The way shards ordinarily keep up with updates is by sending every document
to each member of the shard. However, if a shard goes offline for a period
of time and comes back, replication is used to "catch up" that shard. So
you really need it on.

If you created your collection with the collections API and the required
bits are in schema.xml and solrconfig.xml, you should be good to go. See
https://wiki.apache.org/solr/SolrCloud#Required_Config

Michael Della Bitta

Applications Developer

o: +1 646 532 3062

appinions inc.

"The Science of Influence Marketing"

18 East 41st Street

New York, NY 10017

t: @appinions <https://twitter.com/Appinions> | g+:
plus.google.com/appinions<https://plus.google.com/u/0/b/112002776285509593336/112002776285509593336/posts>
w: appinions.com <http://www.appinions.com/>


On Tue, Mar 25, 2014 at 12:42 PM, Software Dev wrote:

> I see that by default in SolrCloud that my collections are
> replicating. Should this be disabled in SolrCloud as this is already
> handled by it?
>
> From the documentation:
>
> "The Replication screen shows you the current replication state for
> the named core you have specified. In Solr, replication is for the
> index only. SolrCloud has supplanted much of this functionality, but
> if you are still using index replication, you can use this screen to
> see the replication state:"
>
> I just want to make sure before I disable it that if we send an update
> to one server that the document will be correctly replicated across
> all nodes. Thanks
>


Re: Solr4 performance

2014-02-27 Thread Michael Della Bitta
You would get more room for disk cache by reducing your large heap.
Otherwise, you'd have to add more RAM to your systems or shard your index
to more nodes to gain more RAM that way.

The Linux VM subsystem actually has a number of tuning parameters (like
vm.bdflush, vm.swappiness and vm.pagecache), but I don't know if there's
any definitive information about how to set them appropriately for Solr.

Michael Della Bitta

Applications Developer

o: +1 646 532 3062

appinions inc.

"The Science of Influence Marketing"

18 East 41st Street

New York, NY 10017

t: @appinions <https://twitter.com/Appinions> | g+:
plus.google.com/appinions<https://plus.google.com/u/0/b/112002776285509593336/112002776285509593336/posts>
w: appinions.com <http://www.appinions.com/>


On Thu, Feb 27, 2014 at 3:09 PM, Joshi, Shital  wrote:

> Hi Michael,
>
> If page cache is the issue, what is the solution?
>
> Thanks!
>
> -Original Message-
> From: Michael Della Bitta [mailto:michael.della.bi...@appinions.com]
> Sent: Monday, February 24, 2014 9:54 PM
> To: solr-user@lucene.apache.org
> Subject: Re: Solr4 performance
>
> I'm not sure how you're measuring free RAM. Maybe this will help:
>
> http://www.linuxatemyram.com/play.html
>
> Michael Della Bitta
>
> Applications Developer
>
> o: +1 646 532 3062
>
> appinions inc.
>
> "The Science of Influence Marketing"
>
> 18 East 41st Street
>
> New York, NY 10017
>
> t: @appinions <https://twitter.com/Appinions> | g+:
> plus.google.com/appinions<
> https://plus.google.com/u/0/b/112002776285509593336/112002776285509593336/posts
> >
> w: appinions.com <http://www.appinions.com/>
>
>
> On Mon, Feb 24, 2014 at 5:35 PM, Joshi, Shital 
> wrote:
>
> > Thanks.
> >
> > We found some evidence that this could be the issue. We're monitoring
> > closely to confirm this.
> >
> > One question though: none of our nodes show more that 50% of physical
> > memory used. So there is enough memory available for memory mapped files.
> > Can this kind of pause still happen?
> >
> >
> > -Original Message-
> > From: Michael Della Bitta [mailto:michael.della.bi...@appinions.com]
> > Sent: Friday, February 21, 2014 5:28 PM
> > To: solr-user@lucene.apache.org
> > Subject: Re: Solr4 performance
> >
> > It could be that your query is churning the page cache on that node
> > sometimes, so Solr pauses so the OS can drag those pages off of disk.
> Have
> > you tried profiling your iowait in top or iostat during these pauses?
> > (assuming you're using linux).
> >
> > Michael Della Bitta
> >
> > Applications Developer
> >
> > o: +1 646 532 3062
> >
> > appinions inc.
> >
> > "The Science of Influence Marketing"
> >
> > 18 East 41st Street
> >
> > New York, NY 10017
> >
> > t: @appinions <https://twitter.com/Appinions> | g+:
> > plus.google.com/appinions<
> >
> https://plus.google.com/u/0/b/112002776285509593336/112002776285509593336/posts
> > >
> > w: appinions.com <http://www.appinions.com/>
> >
> >
> > On Fri, Feb 21, 2014 at 5:20 PM, Joshi, Shital 
> > wrote:
> >
> > > Thanks for your answer.
> > >
> > > We confirmed that it is not GC issue.
> > >
> > > The auto warming query looks good too and queries before and after the
> > > long running query comes back really quick. The only thing stands out
> is
> > > shard on which query takes long time has couple million more documents
> > than
> > > other shards.
> > >
> > > -Original Message-
> > > From: Michael Della Bitta [mailto:michael.della.bi...@appinions.com]
> > > Sent: Thursday, February 20, 2014 5:26 PM
> > > To: solr-user@lucene.apache.org
> > > Subject: RE: Solr4 performance
> > >
> > > Hi,
> > >
> > > As for your first question, setting openSearcher to true means you will
> > see
> > > the new docs after every hard commit. Soft and hard commits only become
> > > isolated from one another with that set to false.
> > >
> > > Your second problem might be explained by your large heap and garbage
> > > collection. Walking a heap that large can take an appreciable amount of
> > > time. You might consider turning on the JVM options for logging GC and
> > > seeing if you can correlate your slow responses to times when your JVM
> is
> > > garbage collecting.
> > >
> > > Hope 

Re: Solr4 performance

2014-02-24 Thread Michael Della Bitta
I'm not sure how you're measuring free RAM. Maybe this will help:

http://www.linuxatemyram.com/play.html

Michael Della Bitta

Applications Developer

o: +1 646 532 3062

appinions inc.

"The Science of Influence Marketing"

18 East 41st Street

New York, NY 10017

t: @appinions <https://twitter.com/Appinions> | g+:
plus.google.com/appinions<https://plus.google.com/u/0/b/112002776285509593336/112002776285509593336/posts>
w: appinions.com <http://www.appinions.com/>


On Mon, Feb 24, 2014 at 5:35 PM, Joshi, Shital  wrote:

> Thanks.
>
> We found some evidence that this could be the issue. We're monitoring
> closely to confirm this.
>
> One question though: none of our nodes show more that 50% of physical
> memory used. So there is enough memory available for memory mapped files.
> Can this kind of pause still happen?
>
>
> -Original Message-
> From: Michael Della Bitta [mailto:michael.della.bi...@appinions.com]
> Sent: Friday, February 21, 2014 5:28 PM
> To: solr-user@lucene.apache.org
> Subject: Re: Solr4 performance
>
> It could be that your query is churning the page cache on that node
> sometimes, so Solr pauses so the OS can drag those pages off of disk. Have
> you tried profiling your iowait in top or iostat during these pauses?
> (assuming you're using linux).
>
> Michael Della Bitta
>
> Applications Developer
>
> o: +1 646 532 3062
>
> appinions inc.
>
> "The Science of Influence Marketing"
>
> 18 East 41st Street
>
> New York, NY 10017
>
> t: @appinions <https://twitter.com/Appinions> | g+:
> plus.google.com/appinions<
> https://plus.google.com/u/0/b/112002776285509593336/112002776285509593336/posts
> >
> w: appinions.com <http://www.appinions.com/>
>
>
> On Fri, Feb 21, 2014 at 5:20 PM, Joshi, Shital 
> wrote:
>
> > Thanks for your answer.
> >
> > We confirmed that it is not GC issue.
> >
> > The auto warming query looks good too and queries before and after the
> > long running query comes back really quick. The only thing stands out is
> > shard on which query takes long time has couple million more documents
> than
> > other shards.
> >
> > -Original Message-
> > From: Michael Della Bitta [mailto:michael.della.bi...@appinions.com]
> > Sent: Thursday, February 20, 2014 5:26 PM
> > To: solr-user@lucene.apache.org
> > Subject: RE: Solr4 performance
> >
> > Hi,
> >
> > As for your first question, setting openSearcher to true means you will
> see
> > the new docs after every hard commit. Soft and hard commits only become
> > isolated from one another with that set to false.
> >
> > Your second problem might be explained by your large heap and garbage
> > collection. Walking a heap that large can take an appreciable amount of
> > time. You might consider turning on the JVM options for logging GC and
> > seeing if you can correlate your slow responses to times when your JVM is
> > garbage collecting.
> >
> > Hope that helps,
> > On Feb 20, 2014 4:52 PM, "Joshi, Shital"  wrote:
> >
> > > Hi!
> > >
> > > I have few other questions regarding Solr4 performance issue we're
> > facing.
> > >
> > > We're committing data to Solr4 every ~30 seconds (up to 20K rows). We
> use
> > > commit=false in update URL. We have only hard commit setting in Solr4
> > > config.
> > >
> > > 
> > >${solr.autoCommit.maxTime:60}
> > >10
> > >true
> > >  
> > >
> > >
> > > Since we're not using Soft commit at all (commit=false), the caches
> will
> > > not get reloaded for every commit and recently added documents will not
> > be
> > > visible, correct?
> > >
> > > What we see is queries which usually take few milli seconds, takes ~40
> > > seconds once in a while. Can high IO during hard commit cause queries
> to
> > > slow down?
> > >
> > > For some shards we see 98% full physical memory. We have 60GB machine
> (30
> > > GB JVM, 28 GB free RAM, ~35 GB of index). We're ruling out that high
> > > physical memory would cause queries to slow down. We're in process of
> > > reducing JVM size anyways.
> > >
> > > We have never run optimization till now. QA optimization didn't yield
> in
> > > performance gain.
> > >
> > > Thanks much for all help.
> > >
> > > -Original Message-
> > > From: Shawn Heisey [mailto:s...@elyograg.org]
> > > Sent: Tuesday, February 18, 2014 4:55 PM
> > > To: solr-user@lucene.apache.org
> > > Subject: Re: Solr4 performance
> > >
> > > On 2/18/2014 2:14 PM, Joshi, Shital wrote:
> > > > Thanks much for all suggestions. We're looking into reducing
> allocated
> > > heap size of Solr4 JVM.
> > > >
> > > > We're using NRTCachingDirectoryFactory. Does it use MMapDirectory
> > > internally? Can someone please confirm?
> > >
> > > In Solr, NRTCachingDirectory does indeed use MMapDirectory as its
> > > default delegate.  That's probably also the case with Lucene -- these
> > > are Lucene classes, after all.
> > >
> > > MMapDirectory is almost always the most efficient way to handle on-disk
> > > indexes.
> > >
> > > Thanks,
> > > Shawn
> > >
> > >
> >
>


Re: Solr4 performance

2014-02-21 Thread Michael Della Bitta
It could be that your query is churning the page cache on that node
sometimes, so Solr pauses so the OS can drag those pages off of disk. Have
you tried profiling your iowait in top or iostat during these pauses?
(assuming you're using linux).

Michael Della Bitta

Applications Developer

o: +1 646 532 3062

appinions inc.

"The Science of Influence Marketing"

18 East 41st Street

New York, NY 10017

t: @appinions <https://twitter.com/Appinions> | g+:
plus.google.com/appinions<https://plus.google.com/u/0/b/112002776285509593336/112002776285509593336/posts>
w: appinions.com <http://www.appinions.com/>


On Fri, Feb 21, 2014 at 5:20 PM, Joshi, Shital  wrote:

> Thanks for your answer.
>
> We confirmed that it is not GC issue.
>
> The auto warming query looks good too and queries before and after the
> long running query comes back really quick. The only thing stands out is
> shard on which query takes long time has couple million more documents than
> other shards.
>
> -Original Message-
> From: Michael Della Bitta [mailto:michael.della.bi...@appinions.com]
> Sent: Thursday, February 20, 2014 5:26 PM
> To: solr-user@lucene.apache.org
> Subject: RE: Solr4 performance
>
> Hi,
>
> As for your first question, setting openSearcher to true means you will see
> the new docs after every hard commit. Soft and hard commits only become
> isolated from one another with that set to false.
>
> Your second problem might be explained by your large heap and garbage
> collection. Walking a heap that large can take an appreciable amount of
> time. You might consider turning on the JVM options for logging GC and
> seeing if you can correlate your slow responses to times when your JVM is
> garbage collecting.
>
> Hope that helps,
> On Feb 20, 2014 4:52 PM, "Joshi, Shital"  wrote:
>
> > Hi!
> >
> > I have few other questions regarding Solr4 performance issue we're
> facing.
> >
> > We're committing data to Solr4 every ~30 seconds (up to 20K rows). We use
> > commit=false in update URL. We have only hard commit setting in Solr4
> > config.
> >
> > 
> >${solr.autoCommit.maxTime:60}
> >10
> >true
> >  
> >
> >
> > Since we're not using Soft commit at all (commit=false), the caches will
> > not get reloaded for every commit and recently added documents will not
> be
> > visible, correct?
> >
> > What we see is queries which usually take few milli seconds, takes ~40
> > seconds once in a while. Can high IO during hard commit cause queries to
> > slow down?
> >
> > For some shards we see 98% full physical memory. We have 60GB machine (30
> > GB JVM, 28 GB free RAM, ~35 GB of index). We're ruling out that high
> > physical memory would cause queries to slow down. We're in process of
> > reducing JVM size anyways.
> >
> > We have never run optimization till now. QA optimization didn't yield in
> > performance gain.
> >
> > Thanks much for all help.
> >
> > -Original Message-
> > From: Shawn Heisey [mailto:s...@elyograg.org]
> > Sent: Tuesday, February 18, 2014 4:55 PM
> > To: solr-user@lucene.apache.org
> > Subject: Re: Solr4 performance
> >
> > On 2/18/2014 2:14 PM, Joshi, Shital wrote:
> > > Thanks much for all suggestions. We're looking into reducing allocated
> > heap size of Solr4 JVM.
> > >
> > > We're using NRTCachingDirectoryFactory. Does it use MMapDirectory
> > internally? Can someone please confirm?
> >
> > In Solr, NRTCachingDirectory does indeed use MMapDirectory as its
> > default delegate.  That's probably also the case with Lucene -- these
> > are Lucene classes, after all.
> >
> > MMapDirectory is almost always the most efficient way to handle on-disk
> > indexes.
> >
> > Thanks,
> > Shawn
> >
> >
>


Re: Best way to get results ordered

2014-02-21 Thread Michael Della Bitta
Hi Metin,

How many IDs are you supplying in a single query? You could probably
accomplish this easily with boosts if it were few.

Michael Della Bitta

Applications Developer

o: +1 646 532 3062

appinions inc.

"The Science of Influence Marketing"

18 East 41st Street

New York, NY 10017

t: @appinions <https://twitter.com/Appinions> | g+:
plus.google.com/appinions<https://plus.google.com/u/0/b/112002776285509593336/112002776285509593336/posts>
w: appinions.com <http://www.appinions.com/>


On Fri, Feb 21, 2014 at 1:25 PM, OSMAN Metin wrote:

> Hi all,
>
> we are using SolR 4.4.0 and planning to migrate to 4.6.1 very soon.
>
> We are looking for a way to get results ordered in a certain way.
>
> For example, we are doing query by ids this way : q=id=A OR id =C OR id=B
> and we want the results to be sorted as A,C,B.
>
> Is there a good way to do this with SolR or should we sort the items on
> the client application side ?
>
> Regards,
>
> Metin
>
>


RE: Solr4 performance

2014-02-20 Thread Michael Della Bitta
Hi,

As for your first question, setting openSearcher to true means you will see
the new docs after every hard commit. Soft and hard commits only become
isolated from one another with that set to false.

Your second problem might be explained by your large heap and garbage
collection. Walking a heap that large can take an appreciable amount of
time. You might consider turning on the JVM options for logging GC and
seeing if you can correlate your slow responses to times when your JVM is
garbage collecting.

Hope that helps,
On Feb 20, 2014 4:52 PM, "Joshi, Shital"  wrote:

> Hi!
>
> I have few other questions regarding Solr4 performance issue we're facing.
>
> We're committing data to Solr4 every ~30 seconds (up to 20K rows). We use
> commit=false in update URL. We have only hard commit setting in Solr4
> config.
>
> 
>${solr.autoCommit.maxTime:60}
>10
>true
>  
>
>
> Since we're not using Soft commit at all (commit=false), the caches will
> not get reloaded for every commit and recently added documents will not be
> visible, correct?
>
> What we see is queries which usually take few milli seconds, takes ~40
> seconds once in a while. Can high IO during hard commit cause queries to
> slow down?
>
> For some shards we see 98% full physical memory. We have 60GB machine (30
> GB JVM, 28 GB free RAM, ~35 GB of index). We're ruling out that high
> physical memory would cause queries to slow down. We're in process of
> reducing JVM size anyways.
>
> We have never run optimization till now. QA optimization didn't yield in
> performance gain.
>
> Thanks much for all help.
>
> -Original Message-
> From: Shawn Heisey [mailto:s...@elyograg.org]
> Sent: Tuesday, February 18, 2014 4:55 PM
> To: solr-user@lucene.apache.org
> Subject: Re: Solr4 performance
>
> On 2/18/2014 2:14 PM, Joshi, Shital wrote:
> > Thanks much for all suggestions. We're looking into reducing allocated
> heap size of Solr4 JVM.
> >
> > We're using NRTCachingDirectoryFactory. Does it use MMapDirectory
> internally? Can someone please confirm?
>
> In Solr, NRTCachingDirectory does indeed use MMapDirectory as its
> default delegate.  That's probably also the case with Lucene -- these
> are Lucene classes, after all.
>
> MMapDirectory is almost always the most efficient way to handle on-disk
> indexes.
>
> Thanks,
> Shawn
>
>


Re: Boost Query Example

2014-02-17 Thread Michael Della Bitta
Hi,

Filter queries don't affect score, so boosting won't have an effect there.
If you want those query terms to get boosted, move them into the q
parameter.

http://wiki.apache.org/solr/CommonQueryParameters#fq

Hope that helps!

Michael Della Bitta

Applications Developer

o: +1 646 532 3062

appinions inc.

"The Science of Influence Marketing"

18 East 41st Street

New York, NY 10017

t: @appinions <https://twitter.com/Appinions> | g+:
plus.google.com/appinions<https://plus.google.com/u/0/b/112002776285509593336/112002776285509593336/posts>
w: appinions.com <http://www.appinions.com/>


On Mon, Feb 17, 2014 at 3:49 PM, EXTERNAL Taminidi Ravi (ETI,
Automotive-Service-Solutions)  wrote:

>
> Hi can some one help me on the Boost & Sort query example.
>
> http://localhost:8983/solr/ProductCollection/select?q=*%3A*&wt=json&indent=true&fq=SKU:223-CL10V3^100
> OR SKU:223-CL1^90
>
> There is not different in the query Order, Let me know if I am missing
> something. Also I like to Order with the exact match for SKU:223-CL10V3^100
>
> Thanks
>
> Ravi
>


Re: Best way to copy data from SolrCloud to standalone Solr?

2014-02-17 Thread Michael Della Bitta
I do know for certain that the backup command on a cloud core still works.
We have a script like this running on a cron to snapshot indexes:

curl -s '
http://localhost:8080/solr/#{core}/replication?command=backup&numberToKeep=4&location=/tmp
'

(not really using /tmp for this, parameters changed to protect the guilty)

The admin handler for replication doesn't seem to be there, but the actual
API seems to work normally.

Michael Della Bitta

Applications Developer

o: +1 646 532 3062

appinions inc.

"The Science of Influence Marketing"

18 East 41st Street

New York, NY 10017

t: @appinions <https://twitter.com/Appinions> | g+:
plus.google.com/appinions<https://plus.google.com/u/0/b/112002776285509593336/112002776285509593336/posts>
w: appinions.com <http://www.appinions.com/>


On Mon, Feb 17, 2014 at 2:02 PM, Shawn Heisey  wrote:

> On 2/17/2014 8:32 AM, Daniel Bryant wrote:
> > I have a production SolrCloud server which has multiple sharded indexes,
> > and I need to copy all of the indexes to a (non-cloud) Solr server
> > within our QA environment.
> >
> > Can I ask for advice on the best way to do this please?
> >
> > I've searched the web and found solr2solr
> > (https://github.com/dbashford/solr2solr), but the author states that
> > this is best for small indexes, and ours are rather large at ~20Gb each.
> > I've also looked at replication, but can't find a definite reference on
> > how this should be done between SolrCloud and Solr?
> >
> > Any guidance is very much appreciated.
>
> If the master index isn't changing at the time of the copy, and you're
> on a non-Windows platform, you should be able to copy the index
> directory directly.  On a Windows platform, whether you can copy the
> index while Solr is using it would depend on how Solr/Lucene opens the
> files.  A typical Windows file open will prevent anything else from
> opening them, and I do not know whether Lucene is smarter than that.
>
> SolrCloud requires the replication handler to be enabled on all configs,
> but during normal operation, it does not actually use replication.  This
> is a confusing thing for some users.
>
> I *think* you can configure the replication handler on slave cores with
> a non-cloud config that point at the master cores, and it should
> replicate the main Lucene index, but not the config files.  I have no
> idea whether things will work right if you configure other master
> options like replicateAfter and config files, and I also don't know if
> those options might cause problems for SolrCloud itself.  Those options
> shouldn't be necessary for just getting the data into a dev environment,
> though.
>
> Thanks,
> Shawn
>
>


RE: JVM heap constraints and garbage collection

2014-02-03 Thread Michael Della Bitta
> i2.xlarge looks vastly better than m2.2xlarge at about the same price, so
I must be missing something: Is it the 120 IPs that explains why anyone
would choose m2.2xlarge?

i2.xlarge is a relatively new instance type (December 2013). In our case,
we're partway through a yearlong reservation of m2.2xlarges and won't be up
for reconsidering that for a few months. I don't think that Amazon has ever
dropped a legacy instance type, so there's bound to be some overlap as they
roll out new ones. And I imagine someone setting up a huge memcached pool
might rather have the extra RAM over the SSD, so it still makes sense for
the m2.2xlarge to be around.

It can be kind of hard to understand how the various parameters that make
up an instance type get decided on, though. I have to consult that
ec2instances.info link all the time to make sure I'm not missing something
regarding what types we should be using.


On Feb 1, 2014 1:51 PM, "Toke Eskildsen"  wrote:

> Michael Della Bitta [michael.della.bi...@appinions.com] wrote:
> > Here at Appinions, we use mostly m2.2xlarges, but the new i2.xlarges look
> > pretty tasty primarily because of the SSD, and I'll probably push for a
> > switch to those when our reservations run out.
>
> > http://www.ec2instances.info/
>
> i2.xlarge looks vastly better than m2.2xlarge at about the same price, so
> I must be missing something: Is it the 120 IPs that explains why anyone
> would choose m2.2xlarge?
>
> Anyhow, it is good to see that Amazon now has 11 different setups with
> SSD. The IOPS looks solid at around 40K/s (estimated) for the i2.xlarge and
> they even have TRIM (
> http://aws.amazon.com/about-aws/whats-new/2013/12/19/announcing-the-next-generation-of-amazon-ec2-high-i/o-instance/).
>
> - Toke Eskildsen


  1   2   3   4   5   >