OOM on solr cloud 5.2.1, does not trigger oom_solr.sh

2015-10-22 Thread Raja Pothuganti
Hi,

Some times I see OOM happening on replicas,but does not trigger script
oom_solr.sh which was passed in as
-XX:OnOutOfMemoryError=/actualLocation/solr/bin/oom_solr.sh 8091.

These OOM happened while DIH importing data from database. Is this known
issue? is there any quick fix?

Here are stack traces when OOM happened


1)
org.apache.solr.common.SolrException; null:java.lang.RuntimeException:
java.lang.OutOfMemoryError: Java heap space
at 
org.apache.solr.servlet.HttpSolrCall.sendError(HttpSolrCall.java:593)
at org.apache.solr.servlet.HttpSolrCall.call(HttpSolrCall.java:465)
at 
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java
:227)
at 
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java
:196)
at 
org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandle
r.java:1652)
at 
org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:585)
at 
org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:14
3)
at 
org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:577)
at 
org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.jav
a:223)
at 
org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.jav
a:1127)
at 
org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:515)
at 
org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java
:185)
at 
org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java
:1061)
at 
org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:14
1)
at 
org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHan
dlerCollection.java:215)
at 
org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection
.java:110)
at 
org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:
97)
at org.eclipse.jetty.server.Server.handle(Server.java:497)
at 
org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:310)
at 
org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:257)
at 
org.eclipse.jetty.io.AbstractConnection$2.run(AbstractConnection.java:540)
at 
org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java
:635)
at 
org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:
555)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.lang.OutOfMemoryError: Java heap space



2)
org.apache.solr.common.SolrException;
org.apache.solr.common.SolrException: Exception writing document id
R277453962 to the index; possible analysis error.
at 
org.apache.solr.update.DirectUpdateHandler2.addDoc(DirectUpdateHandler2.jav
a:167)
at 
org.apache.solr.update.processor.RunUpdateProcessor.processAdd(RunUpdatePro
cessorFactory.java:69)
at 
org.apache.solr.update.processor.UpdateRequestProcessor.processAdd(UpdateRe
questProcessor.java:51)
at 
org.apache.solr.update.processor.DistributedUpdateProcessor.doLocalAdd(Dist
ributedUpdateProcessor.java:955)
at 
org.apache.solr.update.processor.DistributedUpdateProcessor.versionAdd(Dist
ributedUpdateProcessor.java:1110)
at 
org.apache.solr.update.processor.DistributedUpdateProcessor.processAdd(Dist
ributedUpdateProcessor.java:706)
at 
org.apache.solr.update.processor.LogUpdateProcessor.processAdd(LogUpdatePro
cessorFactory.java:104)
at 
org.apache.solr.handler.loader.JavabinLoader$1.update(JavabinLoader.java:10
1)
at 
org.apache.solr.client.solrj.request.JavaBinUpdateRequestCodec$1.readOuterM
ostDocIterator(JavaBinUpdateRequestCodec.java:179)
at 
org.apache.solr.client.solrj.request.JavaBinUpdateRequestCodec$1.readIterat
or(JavaBinUpdateRequestCodec.java:135)
at 
org.apache.solr.common.util.JavaBinCodec.readVal(JavaBinCodec.java:241)
at 
org.apache.solr.client.solrj.request.JavaBinUpdateRequestCodec$1.readNamedL
ist(JavaBinUpdateRequestCodec.java:121)
at 
org.apache.solr.common.util.JavaBinCodec.readVal(JavaBinCodec.java:206)
at 
org.apache.solr.common.util.JavaBinCodec.unmarshal(JavaBinCodec.java:126)
at 
org.apache.solr.client.solrj.request.JavaBinUpdateRequestCodec.unmarshal(Ja
vaBinUpdateRequestCodec.java:186)
at 
org.apache.solr.handler.loader.JavabinLoader.parseAndLoadDocs(JavabinLoader
.java:111)
at 
org.apache.solr.handler.loader.JavabinLoader.load(JavabinLoader.java:58)
at 
org.apache.solr.handler.UpdateRequestHandler$1.load(UpdateRequestHandler.ja
va:98)
at 
org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentS
treamHandlerBase.java:74)
at 
org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase
.java:143)
at org.apache.solr.core.SolrCore.execute(SolrCore.java

Re: Help on Out of memory when using Cursor with sort on Unique Key

2015-09-08 Thread Raja Pothuganti
Hi Naresh

1) For 'sort by' fields, have you considered using DocValue=true for in
schema definition.
If you  are changing schema definition, you would need redo full reindex
after backing up & deleting current index from dataDir.
Also note that, adding docValue=true would increase size of index.

2)>Each node memory parameter : -Xms2g, -Xmx4g
What is the basis choosing above memory sizes? Have you observed through
jconsole or visual vm?

Raja
On 9/8/15, 8:57 AM, "Naresh Yadav"  wrote:

>Cluster details :
>
>Solr Version  : solr-4.10.4
>No of nodes : 2 each 16 GB RAM
>Node of shards : 2
>Replication : 1
>Each node memory parameter : -Xms2g, -Xmx4g
>
>Collection details :
>
>No of docs in my collection : 12.31 million
>Indexed field per document : 2
>Unique key field : tids
>Stored filed per document : varies 30- 40
>Total index size node1+node2 = 13gb+13gb=26gb
>
>Query throwing Heap Space : /select?q=*:*&sort=tids+desc&rows=100&fl=tids
>
>Query working* : */select?q=*:*&rows=100&fl=tids
>
>I am using sort on unique key field tids for Cursor based pagination of
>100
>size.
>
>Already tried :
>
>I also tried tweaking Xmx but problem not solved..
>I also tried q with criteria of indexed filed with only 4200 hits that
>also
>not working
>when sort parameter included.
>
>Please help me here as i am clueless why OOM error in getting 100
>documents.
>
>Thanks
>Naresh



Re: spread index not equally each sharding

2015-07-31 Thread Raja Pothuganti
As far as I know sharding is done on basis of unique key hash(by default).
So most of the time, each shard will have almost equal number of
documents. But each of document my have different size which can show up
as different index size per shard.
Thanks

On 7/31/15, 5:49 AM, "wilanjar ."  wrote:

>hi folks,
>
>I'm new joiner in this milis and have question about amount of index in
>sharding.
>i have 3 sharding on collection, but capacity index in each sharding not
>equally or near each other.
>this below is example :
> du -sh shard1_replica12/
>*1.1Gshard1_replica12/*
> du -sh shard2_replica12/
>*1.5Gshard2_replica12/*
>du -sh  shard3_replica12/
>
>
>*841M   shard3_replica12/*
>Sharding2 bigger from the other, may be anyone can give enlightenment
>about
>it?
>
>Thanks



Re: Data Import Handler Stays Idle

2015-07-20 Thread Raja Pothuganti
>Yes the number of unimported matches (with IOExceptions)

What is the IOException about?

On 7/20/15, 5:10 PM, "Paden"  wrote:

>Yes the number of unimported matches. No I did not specify "false" to
>commit
>on any of my dataimporthandler. Since it defaults to true I really didn't
>take it into account though.
>
>
>
>--
>View this message in context:
>http://lucene.472066.n3.nabble.com/Data-Import-Handler-Stays-Idle-tp421825
>0p4218262.html
>Sent from the Solr - User mailing list archive at Nabble.com.



Re: Data Import Handler Stays Idle

2015-07-20 Thread Raja Pothuganti
Number of Ioexceptions , are they equal to un-imported/un processed
documents?

By any chance commit set to false in import request
example:
http://localhost:8983/solr/db/dataimport?command=full-import&commit=false


Thanks
Raja

On 7/20/15, 4:51 PM, "Paden"  wrote:

>I was consistently checking the logs to see if there were any errors that
>would give me any idling. There were no errors except for a few skipped
>documents due to some Illegal IOexceptions from Tika but none of those
>occurred around the time that solr began idling. A lot of font warnings.
>But
>again. Nothing but font warnings around time of idling.
>
>
>
>--
>View this message in context:
>http://lucene.472066.n3.nabble.com/Data-Import-Handler-Stays-Idle-tp421825
>0p4218260.html
>Sent from the Solr - User mailing list archive at Nabble.com.



Re: How to create indexes for files in hdfs using tika embeded in solr?

2015-07-18 Thread Raja Pothuganti
 would MapReduceIndexerTool option ?


http://www.cloudera.com/content/cloudera/en/documentation/cloudera-search/v
1-latest/Cloudera-Search-User-Guide/csug_mapreduceindexertool.html



On 7/18/15, 9:38 AM, "步青云"  wrote:

>I need help. I have several hundreds of GB files in hdfs and I want to
>creat indexes for these files so that I can search quickly. How can I
>create indexes for these files in hdfs? I know tika embeded in solr could
>extact the content of files in local file system and then solr would
>create indexes for these files. What I need to do is to set the path of
>file. Then, ContentStreamUpdateRequest would extract the content of files
>using Tika and create indexes in solr. The java code is as follows:
>  public void indexFilesSolrCell(FileBean fileBean)
>   throws IOException, SolrServerException {
>   try{
>   SolrServer solr = FtrsSolrServer.getServer();
>   ContentStreamUpdateRequest up = new 
> ContentStreamUpdateRequest(
>   "/update/extract");
>   up.addFile(new File(fileBean.getLocalPath()),
>fileBean.getContentType()); //set the path of file
>   up.setParam("literal.id", UUID.randomUUID().toString());
>   up.setParam("literal.create_time", 
> fileBean.getCreateTime());
>   up.setParam("literal.title", fileBean.getTitle());
>   up.setParam("literal.creator", fileBean.getCreator());
>   up.setParam("literal.description", 
> fileBean.getDescription());
>   up.setParam("literal.file_name", 
> fileBean.getFileName());
>   up.setParam("literal.folder_path", 
> fileBean.getFolderPath());
>   up.setParam("literal.fid", fileBean.getFid());
>   up.setParam("fmap.content", "content");
>   up.setAction(AbstractUpdateRequest.ACTION.COMMIT, true, 
> true);
>   
>   solr.request(up);
>   } catch(Exception e){
>   e.printStackTrace();
>   }
>   }
>
>But the code above is not work for the files in hdfs. When I set the file
>path as "hdfs://hadoop1:8020/.", errors occured. The error message is
>just like the mean "FileSystem should not start with 'hdfs://". Could
>tika not extract the files in hdfs or there are some mistakes in my java
>code? If tika could not extract, how can I create indexes for the files
>in hdfs?
>Thanks for any reply. I urgently need help.
>Best wishes.



Re: copying data from one collection to another collection (solr cloud 521)

2015-07-15 Thread Raja Pothuganti
Hi Charles,
Thank you for the response. We will be using aliasing. Looking into ways
to avoid ingestion into each of the collections as you have mentioned "For
example, would it be faster to make a file system copy of the most recent
collection ..² 

MapReduceIndexerTool is not an option at this point.


One option is to Backup each shard from current_stuff collection at the
end of week to a particular location( say directory /opt/data/) and then
1) empty/delete existing documents in previous_stuff_1 collection
2) restore each corresponding shard from /opt/data/ to previous_stuff_1
collection by using backup & restore as suggested
https://cwiki.apache.org/confluence/display/solr/Making+and+Restoring+Backu
ps+of+SolrCores


Trying to find if there are any better ways than above option.

Thanks
Raja




On 7/15/15, 10:23 AM, "Reitzel, Charles" 
wrote:

>Since they want explicitly search within a given "version" of the data,
>this seems like a textbook application for collection aliases.
>
>You could have N public collection names: current_stuff,
>previous_stuff_1, previous_stuff_2, ...   At any given time, these will
>be aliased to reference the "actual" collection names:
>   current_stuff -> stuff_20150712,
>   previous_stuff_1 -> stuff_20150705,
>   previous_stuff_2 -> stuff_20150628,
>   ...
>
>Every weekend, you create a new collection and index everything current
>into it.  Once done, reset all the aliases to point to the newest N
>collections and dropping the oldest:
>   current_stuff -> stuff_20150719
>   previous_stuff_1 -> stuff_20150712,
>   previous_stuff_2 -> stuff_20150705,
>   ...
>
>Collections API: Create or modify an Alias for a Collection
>https://cwiki.apache.org/confluence/display/solr/Collections+API#Collectio
>nsAPI-api4
>
>Thus, you can keep the IDs the same and use them to compare to previous
>versions of any given document.   Useful, if only for debugging purposes.
>
>Curious if there are opportunities for optimization here.  For example,
>would it be faster to make a file system copy of the most recent
>collection and load only changed documents (assuming the delta is
>available from the source system)?
>
>-Original Message-
>From: Erick Erickson [mailto:erickerick...@gmail.com]
>Sent: Monday, July 13, 2015 11:55 PM
>To: solr-user@lucene.apache.org
>Subject: Re: copying data from one collection to another collection (solr
>cloud 521)
>
>bq: does offline
>
>No. I'm talking about "collection aliasing". You can create an entirely
>new collection, index to it however  you want then switch to using that
>new collection.
>
>bq: Any updates to EXISTING document in the LIVE collection should NOT be
>replicated to the previous week(s) snapshot(s)
>
>then give it a new ID maybe?
>
>Best,
>Erick
>
>On Mon, Jul 13, 2015 at 3:21 PM, Raja Pothuganti
> wrote:
>> Thank you Erick
>>>Actually, my question is why do it this way at all? Why not index
>>>directly to your "live" nodes? This is what SolrCloud is built for.
>>>You an use "implicit" routing to create shards say, for each week and
>>>age out the ones that are "too old" as well.
>>
>>
>> Any updates to EXISTING document in the LIVE collection should NOT be
>> replicated to the previous week(s) snapshot(s). Think of the
>> snapshot(s) as an archive of sort and searchable independent of LIVE.
>> We're aiming to support at most 2 archives of data in the past.
>>
>>
>>>Another option would be to use "collection aliasing" to keep an
>>>offline index up to date then switch over when necessary.
>>
>> Does offline indexing refers to this link
>> https://github.com/cloudera/search/tree/0d47ff79d6ccc0129ffadcb50f9fe0
>> b271f
>> 102aa/search-mr
>>
>>
>> Thanks
>> Raja
>>
>>
>>
>> On 7/13/15, 3:14 PM, "Erick Erickson"  wrote:
>>
>>>Actually, my question is why do it this way at all? Why not index
>>>directly to your "live" nodes? This is what SolrCloud is built for.
>>>
>>>There's the new backup/restore functionality that's still a work in
>>>progress, see: https://issues.apache.org/jira/browse/SOLR-5750
>>>
>>>You an use "implicit" routing to create shards say, for each week and
>>>age out the ones that are "too old" as well.
>>>
>>>Another option would be to use "collection aliasing" to keep an
>>>offline index up to date then switch over when necessary.
>>>
>>>I'd

Re: copying data from one collection to another collection (solr cloud 521)

2015-07-13 Thread Raja Pothuganti
Thank you Erick
>Actually, my question is why do it this way at all? Why not index
>directly to your "live" nodes? This is what SolrCloud is built for.
>You an use "implicit" routing to create shards say, for each week and
>age out the ones that are "too old" as well.


Any updates to EXISTING document in the LIVE collection should NOT be
replicated to the previous week(s) snapshot(s). Think of the snapshot(s)
as an archive of sort and searchable independent of LIVE. We're aiming to
support at most 2 archives of data in the past.


>Another option would be to use "collection aliasing" to keep an
>offline index up to date then switch over when necessary.

Does offline indexing refers to this link
https://github.com/cloudera/search/tree/0d47ff79d6ccc0129ffadcb50f9fe0b271f
102aa/search-mr


Thanks
Raja



On 7/13/15, 3:14 PM, "Erick Erickson"  wrote:

>Actually, my question is why do it this way at all? Why not index
>directly to your "live" nodes? This is what SolrCloud is built for.
>
>There's the new backup/restore functionality that's still a work in
>progress, see: https://issues.apache.org/jira/browse/SOLR-5750
>
>You an use "implicit" routing to create shards say, for each week and
>age out the ones that are "too old" as well.
>
>Another option would be to use "collection aliasing" to keep an
>offline index up to date then switch over when necessary.
>
>I'd really like to know this isn't an XY problem though, what's the
>high-level problem you're trying to solve?
>
>Best,
>Erick
>
>On Mon, Jul 13, 2015 at 12:49 PM, Raja Pothuganti
> wrote:
>>
>> Hi,
>> We are setting up a new SolrCloud environment with 5.2.1 on Ubuntu
>>boxes. We currently ingest data into a large collection, call it LIVE.
>>After the full ingest is done we then trigger a delta delta ingestion
>>every 15 minutes to get the documents & data that have changed into this
>>LIVE instance.
>>
>> In Solr 4.X using a Master / Slave setup we had slaves that would
>>periodically (weekly, or monthly) refresh their data from the Master
>>rather than every 15 minutes. We're now trying to figure out how to get
>>this same type of setup using SolrCloud.
>>
>> Question(s):
>> - Is there a way to copy data from one SolrCloud collection into
>>another quickly and easily?
>> - Is there a way to programmatically control when a replica receives
>>it's data or possibly move it to another collection (without losing
>>data) that updates on a  different interval? It ideally would be another
>>collection name, call it Week1 ... Week52 ... to avoid a replica in the
>>same collection serving old data.
>>
>> One option we thought of was to create a backup and then restore that
>>into a new clean cloud. This has a lot of moving parts and isn't nearly
>>as neat as the Master / Slave controlled replication setup. It also has
>>the side effect of potentially taking a very long time to backup and
>>restore instead of just copying the indexes like the old M/S setup.
>>
>> Any ideas of thoughts? Thanks in advance for you help.
>> Raja



copying data from one collection to another collection (solr cloud 521)

2015-07-13 Thread Raja Pothuganti

Hi,
We are setting up a new SolrCloud environment with 5.2.1 on Ubuntu boxes. We 
currently ingest data into a large collection, call it LIVE. After the full 
ingest is done we then trigger a delta delta ingestion every 15 minutes to get 
the documents & data that have changed into this LIVE instance.

In Solr 4.X using a Master / Slave setup we had slaves that would periodically 
(weekly, or monthly) refresh their data from the Master rather than every 15 
minutes. We're now trying to figure out how to get this same type of setup 
using SolrCloud.

Question(s):
- Is there a way to copy data from one SolrCloud collection into another 
quickly and easily?
- Is there a way to programmatically control when a replica receives it's data 
or possibly move it to another collection (without losing data) that updates on 
a  different interval? It ideally would be another collection name, call it 
Week1 ... Week52 ... to avoid a replica in the same collection serving old data.

One option we thought of was to create a backup and then restore that into a 
new clean cloud. This has a lot of moving parts and isn't nearly as neat as the 
Master / Slave controlled replication setup. It also has the side effect of 
potentially taking a very long time to backup and restore instead of just 
copying the indexes like the old M/S setup.

Any ideas of thoughts? Thanks in advance for you help.
Raja