date:20140217

Re: SolrJ Socket Leak

2014-02-17 Thread Kiran Chitturi

Jared,

I faced a similar issue when using CloudSolrServer with Solr. As Shawn
pointed out the 'TIME_WAIT' status happens when the connection is closed
by the http client. HTTP client closes connection whenever it thinks the
connection is stale
(https://hc.apache.org/httpcomponents-client-ga/tutorial/html/connmgmt.html
#d5e405). Even the docs point out the stale connection checking cannot be
all reliable. 

I see two ways to get around this:

1. Enable 'SO_REUSEADDR'
2. Disable stale connection checks.

Also by default, when we create CSS it does not explicitly configure any
http client parameters
(https://github.com/apache/lucene-solr/blob/trunk/solr/solrj/src/java/org/a
pache/solr/client/solrj/impl/CloudSolrServer.java#L124). In this case, the
default configuration parameters (max connections, max connections per
host) are used for a http connection. You can explicitly configure these
params when creating CSS using HttpClientUtil:

ModifiableSolrParams params = new ModifiableSolrParams();
params.set(HttpClientUtil.PROP_MAX_CONNECTIONS, 128);
params.set(HttpClientUtil.PROP_MAX_CONNECTIONS_PER_HOST, 32);
params.set(HttpClientUtil.PROP_FOLLOW_REDIRECTS, false);
params.set(HttpClientUtil.PROP_CONNECTION_TIMEOUT, 3);
httpClient = HttpClientUtil.createClient(params);

final HttpClient client = HttpClientUtil.createClient(params);
LBHttpSolrServer lb = new LBHttpSolrServer(client);
CloudSolrServer server = new CloudSolrServer(zkConnect, lb);


Currently, I am using http client 4.3.2 and building the client when
creating the CSS. I also use 'SO_REUSEADDR' option and I haven't seen the
'TIME_WAIT'  after this (may be because of better handling of stale
connections in 4.3.2 or because of 'SO_REUSEADDR' param enabled). My
current http client code looks like this: (works only with http client
4.3.2)

HttpClientBuilder httpBuilder = HttpClientBuilder.create();

Builder socketConfig =  SocketConfig.custom();
socketConfig.setSoReuseAddress(true);
socketConfig.setSoTimeout(1);
httpBuilder.setDefaultSocketConfig(socketConfig.build());
httpBuilder.setMaxConnTotal(300);
httpBuilder.setMaxConnPerRoute(100);

httpBuilder.disableRedirectHandling();
httpBuilder.useSystemProperties();
LBHttpSolrServer lb = new LBHttpSolrServer(httpClient, parser)
CloudSolrServer server = new CloudSolrServer(zkConnect, lb);


There should be a way to configure socket reuse with 4.2.3 too. You can
try different configurations. I am surprised you have 'TIME_WAIT'
connections even after 30 minutes because 'TIME_WAIT' connection should be
closed by default in 2 mins by O.S I think.


HTH,

-- 
Kiran Chitturi,


On 2/13/14 12:38 PM, Jared Rodriguez jrodrig...@kitedesk.com wrote:

I am using solr/solrj 4.6.1 along with the apache httpclient 4.3.2 as part
of a web application which connects to the solr server via solrj
using CloudSolrServer();  The web application is wired up with Guice, and
there is a single instance of the CloudSolrServer class used by all
inbound
requests.  All this is running on Amazon.

Basically, everything looks and runs fine for a while, but even with
moderate concurrency, solrj starts leaving sockets open.  We are handling
only about 250 connections to the web app per minute and each of these
issues from 3 - 7 requests to solr.  Over a 30 minute period of this type
of use, we end up with many 1000s of lingering sockets.  I can see these
when running netstats

tcp0  0 ip-10-80-14-26.ec2.in:41098
ip-10-99-145-47.ec2.i:glrpc
TIME_WAIT

All to the same target host, which is my solr server. There are no other
pieces of infrastructure on that box, just solr.  Eventually, the server
just dies as no further sockets can be opened and the opened ones are not
reused.

The solr server itself is unphased and running like a champ.  Average
timer
per request of 0.126, as seen in the solr web app admin UI query handler
stats.

Apache httpclient had a bunch of leakage from version 4.2.x that they
cleaned up and refactored in 4.3.x, which is why I upgraded.  Currently,
solrj makes use of the old leaky 4.2 classes for establishing connections
and using a connection pool.

http://www.apache.org/dist/httpcomponents/httpclient/RELEASE_NOTES-4.3.x.t
xt



-- 
Jared Rodriguez

Re: DIH

2014-02-17 Thread Mikhail Khludnev

On Sat, Feb 15, 2014 at 1:07 PM, Shawn Heisey s...@elyograg.org wrote:

On 2/14/2014 10:45 PM, William Bell wrote:
On virtual cores the DIH handler is really slow. On a 12 core box it only
uses 1 core while indexing.

Does anyone know how to do Java threading from a SQL query into Solr?
Examples?

I can use SolrJ to do it, or I might be able to modify DIH to enable
threading.

At some point in 3.x threading was enabled in DIH, but it was removed
since
people where having issues with it (we never did).

If you know how to fix DIH so it can do multiple indexing threads
safely, please open an issue and upload a patch.

Please! Don't do it. Never again!
https://issues.apache.org/jira/browse/SOLR-3011

As far as I understand the general idea is to find the DIH successor
https://issues.apache.org/jira/browse/SOLR-4799?focusedCommentId=13738424page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13738424

I'm still using DIH for full rebuilds, but I'd actually like to replace
it with a rebuild routine written in SolrJ. I currently achieve decent
speed by running DIH on all my shards at the same time.

I do use SolrJ for once-a-minute index maintenance, but the code that
I've written to pull data out of SQL and write it to Solr is not able to
index millions of documents in a single thread as fast as DIH does. I
have been building a multithreaded design in my head, but I haven't had
a chance to write real code and see whether it's actually a good design.

For me, the bottleneck is definitely Solr, not the database. I recently
wrote a test program that uses my current SolrJ indexing method. If I
skip the server.add(docs) line, it can read all 91 million docs from
the database and build SolrInputDocument objects for them in 2.5 hours
or less, all with a single thread. When I do a real rebuild with DIH,
it takes a little more than 4.5 hours -- and that is inherently
multithreaded, because it's doing all the shards simultaneously. I have
no idea how long it would take with a single-threaded SolrJ program.

Thanks,
Shawn

--
Sincerely yours
Mikhail Khludnev
Principal Engineer,
Grid Dynamics

http://www.griddynamics.com
mkhlud...@griddynamics.com

Re: Solr index filename doesn't match with solr vesion

2014-02-17 Thread Nguyen Manh Tien

Thanks Shawn, Tri for your infos, explanation.
Tien


On Mon, Feb 17, 2014 at 1:36 PM, Tri Cao tm...@me.com wrote:

 Lucene main file formats actually don't change a lot in 4.x (or even 5.x),
 and the newer codecs just delegate to previous versions for most file
 types. The newer file types don't typically include Lucene's version in
 file names.

 For example, Lucene 4.6 codes basically delegate stored fields and term
 vector file format to 4.1, doc format to 4.0, etc. and only implement the
 new segment info/fields info formats (the .si and .fnm files).


 https://github.com/apache/lucene-solr/blob/lucene_solr_4_6/lucene/core/src/java/org/apache/lucene/codecs/lucene46/Lucene46Codec.java#L50

 Hope this helps,
 Tri


 On Feb 16, 2014, at 08:52 PM, Shawn Heisey s...@elyograg.org wrote:

 On 2/16/2014 7:25 PM, Nguyen Manh Tien wrote:

 I upgraded recently from solr 4.0 to solr 4.6,

 I check solr index folder and found there file

 _aars_*Lucene41*_0.doc

 _aars_*Lucene41*_0.pos

 _aars_*Lucene41*_0.tim

 _aars_*Lucene41*_0.tip

 I don't know why it don't have *Lucene46* in file name.


 This is an indication that this part of the index is using a file format
 introduced in Lucene 4.1.

 Here's what I have for one of my index segments on a Solr 4.6.1 server:

 _5s7_2h.del
 _5s7.fdt
 _5s7.fdx
 _5s7.fnm
 _5s7_Lucene41_0.doc
 _5s7_Lucene41_0.pos
 _5s7_Lucene41_0.tim
 _5s7_Lucene41_0.tip
 _5s7_Lucene45_0.dvd
 _5s7_Lucene45_0.dvm
 _5s7.nvd
 _5s7.nvm
 _5s7.si
 _5s7.tvd
 _5s7.tvx

 It shows the same pieces as your list, but I am also using docValues in
 my index, and those files indicate that they are using the format from
 Lucene 4.5. I'm not sure why there are not version numbers in *all* of
 the file extensions -- that happens in the Lucene layer, which is a bit
 of a mystery to me.

 Thanks,
 Shawn

Re: DIH

2014-02-17 Thread Alexandre Rafalovitch

There has been a couple of discussions to find DIH successor
(including on HelioSearch list), but no real momentum as far as I can
tell.

I think somebody will have to really pitch in and do the same couple
of scenarios DIH does in several different frameworks (TodoMVC style).
That should get it going.

Regards,
Alex.
Personal website: http://www.outerthoughts.com/
LinkedIn: http://www.linkedin.com/in/alexandrerafalovitch
- Time is the quality of nature that keeps events from happening all
at once. Lately, it doesn't seem to be working. (Anonymous - via GTD
book)

On Mon, Feb 17, 2014 at 7:40 PM, Mikhail Khludnev
mkhlud...@griddynamics.com wrote:
On Sat, Feb 15, 2014 at 1:07 PM, Shawn Heisey s...@elyograg.org wrote:

On 2/14/2014 10:45 PM, William Bell wrote:
On virtual cores the DIH handler is really slow. On a 12 core box it only
uses 1 core while indexing.

Does anyone know how to do Java threading from a SQL query into Solr?
Examples?

I can use SolrJ to do it, or I might be able to modify DIH to enable
threading.

At some point in 3.x threading was enabled in DIH, but it was removed
since
people where having issues with it (we never did).

If you know how to fix DIH so it can do multiple indexing threads
safely, please open an issue and upload a patch.

Please! Don't do it. Never again!
https://issues.apache.org/jira/browse/SOLR-3011

I'm still using DIH for full rebuilds, but I'd actually like to replace
it with a rebuild routine written in SolrJ. I currently achieve decent
speed by running DIH on all my shards at the same time.

Thanks,
Shawn

--
Sincerely yours
Mikhail Khludnev
Principal Engineer,
Grid Dynamics

http://www.griddynamics.com
mkhlud...@griddynamics.com

Re: DIH

2014-02-17 Thread Ahmet Arslan

Hi Mikhail,

Can you please elaborate what do you mean?
My understanding is that there is no multi-threading support in DIH. For some
reasons, it won't have. Am I correct?

Regarding apache flume, how it can be dih replacement? Can I index rich
documents on my disk using flume? Can I fetch documents from
wikipedia,jira,twitter,dropbox,rdbms,rss,file system by using it?

Ahmet

On Monday, February 17, 2014 10:41 AM, Mikhail Khludnev
mkhlud...@griddynamics.com wrote:
On Sat, Feb 15, 2014 at 1:07 PM, Shawn Heisey s...@elyograg.org wrote:

On 2/14/2014 10:45 PM, William Bell wrote:
On virtual cores the DIH handler is really slow. On a 12 core box it only
uses 1 core while indexing.

Does anyone know how to do Java threading from a SQL query into Solr?
Examples?

I can use SolrJ to do it, or I might be able to modify DIH to enable
threading.

At some point in 3.x threading was enabled in DIH, but it was removed
since
people where having issues with it (we never did).

If you know how to fix DIH so it can do multiple indexing threads
safely, please open an issue and upload a patch.

Please! Don't do it. Never again!
https://issues.apache.org/jira/browse/SOLR-3011

I'm still using DIH for full rebuilds, but I'd actually like to replace
it with a rebuild routine written in SolrJ. I currently achieve decent
speed by running DIH on all my shards at the same time.

Thanks,
Shawn

--
Sincerely yours
Mikhail Khludnev
Principal Engineer,
Grid Dynamics

http://www.griddynamics.com
mkhlud...@griddynamics.com

Re: DIH

2014-02-17 Thread Alexandre Rafalovitch

I haven't tried Apache Flume but the manual seems to suggest 'yes' to
a large number of your checklist items:
http://flume.apache.org/FlumeUserGuide.html

When you say 'rich document' indexing, the keyword you are looking for
is (Apache) Tika, as that's what actually doing the job under the
covers.

Whether it can replicate your specific requirements, is a question
only you can answer for yourself of course. When you do, maybe let us
know, so we can learn too. :-)

On Mon, Feb 17, 2014 at 8:11 PM, Ahmet Arslan iori...@yahoo.com wrote:
Hi Mikhail,