Re: Fastest way to import big amount of documents in SolrCloud

2014-05-01 Thread Alexander Kanarsky
If you build your index in Hadoop, read this (it is about the Cloudera
Search but in my understanding also should work with Solr Hadoop contrib
since 4.7)
http://www.cloudera.com/content/cloudera-content/cloudera-docs/Search/latest/Cloudera-Search-User-Guide/csug_batch_index_to_solr_servers_using_golive.html


On Thu, May 1, 2014 at 1:47 PM, Costi Muraru  wrote:

> Hi guys,
>
> What would you say it's the fastest way to import data in SolrCloud?
> Our use case: each day do a single import of a big number of documents.
>
> Should we use SolrJ/DataImportHandler/other? Or perhaps is there a bulk
> import feature in SOLR? I came upon this promising link:
> http://wiki.apache.org/solr/UpdateCSV
> Any idea on how UpdateCSV is performance-wise compared with
> SolrJ/DataImportHandler?
>
> If SolrJ, should we split the data in chunks and start multiple clients at
> once? In this way we could perhaps take advantage of the multitude number
> of servers in the SolrCloud configuration?
>
> Either way, after the import is finished, should we do an optimize or a
> commit or none (
> http://wiki.solarium-project.org/index.php/V1:Optimize_command)?
>
> Any tips and tricks to perform this process the right way are gladly
> appreciated.
>
> Thanks,
> Costi
>


Re: Production Release process with Solr 3.5 implementation.

2012-11-01 Thread Alexander Kanarsky
Why not to change the order to this:

3. Upgrade Solr Schema (Master) Replication is disabled.
4. Start Index Rebuild. (if step 3)
1. Pull up Maintenance Pages
2. Upgrade DB
5. Upgrade UI code
6. Index build complete ? Start Replication
7. Verify UI and Drop Maintenance Pages.

So your slaves will continue to serve traffic until you're done with the
master index. Or the master index also imports from the same database?


On Thu, Nov 1, 2012 at 4:08 PM, Shawn Heisey  wrote:

> On 11/1/2012 2:46 PM, adityab wrote:
>
>> 1. Pull up Maintenance Pages
>> 2. Upgrade DB
>> 3. Upgrade Solr Schema (Master) Replication is disabled.
>> 4. Start Index Rebuild. (if step 3)
>> 5. Upgrade UI code
>> 6. Index build complete ? Start Replication
>> 7. Verify UI and Drop Maintenance Pages.
>>
>> As # 4 takes couple of hours compared to all other steps which run within
>> few minutes, we need to have down time for the duration of that.
>>
>
> What I do is a little bit different. I have two completely independent
> copies of my index, no replication.  The build system maintains each copy
> simultaneously, including managing independent rebuilds.  I used to run two
> copies of my build system, but I recently made it so that one copy manages
> multiple indexes.
>
> If I need to do an upgrade, I will first test everything out as much as
> possible on my test environment.  Then I will take one copy of my index
> offline, perform the required changes, and reindex.  The UI continues to
> send queries to the online index that hasn't been changed.  At that point,
> we initiate the upgrade sequence you've described, except that instead of
> step 4 taking a few hours, we just have to redirect traffic to the brand
> new index copy.  If everything works out, we then repeat with the other
> index copy.  If it doesn't work out, we revert everything and go back to
> the original index.
>
> Also, every index has a build core and a live core.  I currently maintain
> the same config in both cores, but it would be possible to change the
> config in the build core, reload or restart Solr, do your reindex, and
> simply do a core swap, which is almost instantaneous.  If you are doing
> replication, swapping cores on the master initiates full replication to the
> slave. Excerpt from my solr.xml:
>
>  dataDir="../../data/s0_1"/>
>  dataDir="../../data/s0_0"/>
>
> Thanks,
> Shawn
>
> P.S. Actually, I have three full copies of my index now -- I recently
> upgraded my test server so it has enough disk capacity to hold my entire
> index.  The test server runs a local copy of the build system which keeps
> it up to date with the two production copies.
>
>


Re: 400 MB Fields

2011-06-08 Thread Alexander Kanarsky
Otis,

Not sure about the Solr, but with Lucene It was certainly doable. I
saw fields way bigger than 400Mb indexed, sometimes having a large set
of unique terms as well (think something like log file with lots of
alphanumeric tokens, couple of gigs in size). While indexing and
querying of such things the I/O, naturally, could easily become a
bottleneck.

-Alexander


Re: copyField generates "multiple values encountered for non multiValued field"

2011-05-31 Thread Alexander Kanarsky
Alexander,

I saw the same behavior in 1.4.x with non-multivalued fields when
"updating" the document in the index (i.e obtaining the doc from the
index, modifying some fields and then adding the document with the same
id back). I do not know what causes this, but it looks like the
copyField logic completely bypasses the "multivalueness" check and just
adds the value in addition to whatever already there (instead of
replacing the value). So yes, Solr renders itself into incorrect state
then (note that the index is still correct from the Lucene's
standpoint). 

-Alexander

 


On Wed, 2011-05-25 at 16:50 +0200, Alexander Golubowitsch wrote:
> Dear list,
>  
> hope somebody can help me understand/avoid this.
>  
> I am sending an "add" request with allowDuplicates=false to a Solr 1.4.1
> instance.
> This is for debugging purposes, so I am sending the exact same data that are
> already stored in Solr's index.
> I am using the PHP PECL libraries, which fail completely in giving me any
> hint on what goes wrong.
> 
> Only sending the same "add" request again gives me a proper
> "SolrClientException" that hints:
>  
> ERROR: [288400] multiple values encountered for non multiValued field
> "field2" [fieldvalue, fieldvalue]
> 
> The scenario:
> - "field1" is implicitly single value, type "text", indexed and stored
> - "field2" is generated via a copyField directive in schema.xml, implicitly
> single value, type "string", indexed and stored
> 
> What appears to happen:
> - On the first "add" (SolrClient::addDocuments(array(SolrInputDocument
> theDocument))), regular fields like "field1" get overwritten as intended
> - "field2", defined with a copyField, but still single value, gets
> _appended_ instead
> - When I retrieve the updated document in a query and try to add it again,
> it won't let me because of the inconsistent multi-value state
> - The PECL library, in addition, appears to hit some internal exception
> (that it doesn't handle properly) when encountering multiple values for a
> single value field. That gives me zero results querying a set that includes
> the document via PHP, while the document can be retrieved properly, though
> in inconsistent state, any other way.
> 
> But: Solr appears to be generating the corrupted state itsself via
> copyField?
> What's going wrong? I'm pretty confused...
> 
> Thank you,
>  Alex
> 




Re: Replication Clarification Please

2011-05-15 Thread Alexander Kanarsky
Ravi,

what is the replication configuration on both master and slave? 
Also could you list of files in the index folder on master and slave
before and after the replication? 

-Alexander


On Fri, 2011-05-13 at 18:34 -0400, Ravi Solr wrote:
> Sorry guys spoke too soon I guess. The replication still remains very
> slow even after upgrading to 3.1 and setting the compression off. Now
> Iam totally clueless. I have tried everything that I know of to
> increase the speed of replication but failed. if anybody faced the
> same issue, can you please tell me how you solved it.
> 
> Ravi Kiran Bhaskar
> 
> On Thu, May 12, 2011 at 6:42 PM, Ravi Solr  wrote:
> > Thank you Mr. Bell and Mr. Kanarsky, as per your advise we have moved
> > from 1.4.1 to 3.1 and have made several changes to configuration. The
> > configuration changes have worked nicely till now and the replication
> > is finishing within the interval and not backing up. The changes we
> > made are as follows
> >
> > 1. Increased the mergeFactor from 10 to 15
> > 2. Increased ramBufferSizeMB to 1024
> > 3. Changed lockType to single (previously it was simple)
> > 4. Set maxCommitsToKeep to 1 in the deletionPolicy
> > 5. Set maxPendingDeletes to 0
> > 6. Changed caches from LRUCache to FastLRUCache as we had hit ratios
> > well over 75% to increase warming speed
> > 7. Increased the poll interval to 6 minutes and re-indexed all content.
> >
> > Thanks,
> >
> > Ravi Kiran Bhaskar
> >
> > On Wed, May 11, 2011 at 6:00 PM, Alexander Kanarsky
> >  wrote:
> >> Ravi,
> >>
> >> if you have what looks like a full replication each time even if the
> >> master generation is greater than slave, try to watch for the index on
> >> both master and slave the same time to see what files are getting
> >> replicated. You probably may need to adjust your merge factor, as Bill
> >> mentioned.
> >>
> >> -Alexander
> >>
> >>
> >>
> >> On Tue, 2011-05-10 at 12:45 -0400, Ravi Solr wrote:
> >>> Hello Mr. Kanarsky,
> >>> Thank you very much for the detailed explanation,
> >>> probably the best explanation I found regarding replication. Just to
> >>> be sure, I wanted to test solr 3.1 to see if it alleviates the
> >>> problems...I dont think it helped. The master index version and
> >>> generation are greater than the slave, still the slave replicates the
> >>> entire index form master (see replication admin screen output below).
> >>> Any idea why it would get the whole index everytime even in 3.1 or am
> >>> I misinterpreting the output ? However I must admit that 3.1 finished
> >>> the replication unlike 1.4.1 which would hang and be backed up for
> >>> ever.
> >>>
> >>> Masterhttp://masterurl:post/solr-admin/searchcore/replication
> >>>   Latest Index Version:null, Generation: null
> >>>   Replicatable Index Version:1296217097572, Generation: 12726
> >>>
> >>> Poll Interval 00:03:00
> >>>
> >>> Local Index   Index Version: 1296217097569, Generation: 12725
> >>>
> >>>   Location: /data/solr/core/search-data/index
> >>>   Size: 944.32 MB
> >>>   Times Replicated Since Startup: 148
> >>>   Previous Replication Done At: Tue May 10 12:32:42 EDT 2011
> >>>   Config Files Replicated At: null
> >>>   Config Files Replicated: null
> >>>   Times Config Files Replicated Since Startup: null
> >>>   Next Replication Cycle At: Tue May 10 12:35:41 EDT 2011
> >>>
> >>> Current Replication StatusStart Time: Tue May 10 12:32:41 EDT 2011
> >>>   Files Downloaded: 18 / 108
> >>>   Downloaded: 317.48 KB / 436.24 MB [0.0%]
> >>>   Downloading File: _ayu.nrm, Downloaded: 4 bytes / 4 bytes [100.0%]
> >>>   Time Elapsed: 17s, Estimated Time Remaining: 23902s, Speed: 18.67 
> >>> KB/s
> >>>
> >>>
> >>> Thanks,
> >>> Ravi Kiran Bhaskar
> >>>
> >>> On Tue, May 10, 2011 at 4:10 AM, Alexander Kanarsky
> >>>  wrote:
> >>> > Ravi,
> >>> >
> >>> > as far as I remember, this is how the replication logic works (see
> >>> > SnapPuller class, fetchLatestIndex method):
> >>> >
> >>> >> 1. Does the Slave get the whole index every time during replication or
> >>&

Re: Replication Clarification Please

2011-05-11 Thread Alexander Kanarsky
Ravi,

if you have what looks like a full replication each time even if the
master generation is greater than slave, try to watch for the index on
both master and slave the same time to see what files are getting
replicated. You probably may need to adjust your merge factor, as Bill
mentioned. 

-Alexander



On Tue, 2011-05-10 at 12:45 -0400, Ravi Solr wrote:
> Hello Mr. Kanarsky,
> Thank you very much for the detailed explanation,
> probably the best explanation I found regarding replication. Just to
> be sure, I wanted to test solr 3.1 to see if it alleviates the
> problems...I dont think it helped. The master index version and
> generation are greater than the slave, still the slave replicates the
> entire index form master (see replication admin screen output below).
> Any idea why it would get the whole index everytime even in 3.1 or am
> I misinterpreting the output ? However I must admit that 3.1 finished
> the replication unlike 1.4.1 which would hang and be backed up for
> ever.
> 
> Masterhttp://masterurl:post/solr-admin/searchcore/replication
>   Latest Index Version:null, Generation: null
>   Replicatable Index Version:1296217097572, Generation: 12726
> 
> Poll Interval 00:03:00
> 
> Local Index   Index Version: 1296217097569, Generation: 12725
> 
>   Location: /data/solr/core/search-data/index
>   Size: 944.32 MB
>   Times Replicated Since Startup: 148
>   Previous Replication Done At: Tue May 10 12:32:42 EDT 2011
>   Config Files Replicated At: null
>   Config Files Replicated: null
>   Times Config Files Replicated Since Startup: null
>   Next Replication Cycle At: Tue May 10 12:35:41 EDT 2011
> 
> Current Replication StatusStart Time: Tue May 10 12:32:41 EDT 2011
>   Files Downloaded: 18 / 108
>   Downloaded: 317.48 KB / 436.24 MB [0.0%]
>   Downloading File: _ayu.nrm, Downloaded: 4 bytes / 4 bytes [100.0%]
>   Time Elapsed: 17s, Estimated Time Remaining: 23902s, Speed: 18.67 KB/s
> 
> 
> Thanks,
> Ravi Kiran Bhaskar
> 
> On Tue, May 10, 2011 at 4:10 AM, Alexander Kanarsky
>  wrote:
> > Ravi,
> >
> > as far as I remember, this is how the replication logic works (see
> > SnapPuller class, fetchLatestIndex method):
> >
> >> 1. Does the Slave get the whole index every time during replication or
> >> just the delta since the last replication happened ?
> >
> >
> > It look at the index version AND the index generation. If both slave's
> > version and generation are the same as on master, nothing gets
> > replicated. if the master's generation is greater than on slave, the
> > slave fetches the delta files only (even if the partial merge was done
> > on the master) and put the new files from master to the same index
> > folder on slave (either index or index., see further
> > explanation). However, if the master's index generation is equals or
> > less than one on slave, the slave does the full replication by
> > fetching all files of the master's index and place them into a
> > separate folder on slave (index.). Then, if the fetch is
> > successfull, the slave updates (or creates) the index.properties file
> > and puts there the name of the "current" index folder. The "old"
> > index. folder(s) will be kept in 1.4.x - which was treated
> > as a bug - see SOLR-2156 (and this was fixed in 3.1). After this, the
> > slave does commit or reload core depending whether the config files
> > were replicated. There is another bug in 1.4.x that fails replication
> > if the slave need to do the full replication AND the config files were
> > changed - also fixed in 3.1 (see SOLR-1983).
> >
> >> 2. If there are huge number of queries being done on slave will it
> >> affect the replication ? How can I improve the performance ? (see the
> >> replications details at he bottom of the page)
> >
> >
> > >From my experience the half of the replication time is a time when the
> > transferred data flushes to the disk. So the IO impact is important.
> >
> >> 3. Will the segment names be same be same on master and slave after
> >> replication ? I see that they are different. Is this correct ? If it
> >> is correct how does the slave know what to fetch the next time i.e.
> >> the delta.
> >
> >
> > They should be the same. The slave fetches the changed files only (see
> > above), also look at SnapPuller code.
> >
> >> 4. When and why does the index. folder get created ? I see
> >> this type of folder getting created only on slave and the slav

Re: Replication Clarification Please

2011-05-10 Thread Alexander Kanarsky
Ravi,

as far as I remember, this is how the replication logic works (see
SnapPuller class, fetchLatestIndex method):

> 1. Does the Slave get the whole index every time during replication or
> just the delta since the last replication happened ?


It look at the index version AND the index generation. If both slave's
version and generation are the same as on master, nothing gets
replicated. if the master's generation is greater than on slave, the
slave fetches the delta files only (even if the partial merge was done
on the master) and put the new files from master to the same index
folder on slave (either index or index., see further
explanation). However, if the master's index generation is equals or
less than one on slave, the slave does the full replication by
fetching all files of the master's index and place them into a
separate folder on slave (index.). Then, if the fetch is
successfull, the slave updates (or creates) the index.properties file
and puts there the name of the "current" index folder. The "old"
index. folder(s) will be kept in 1.4.x - which was treated
as a bug - see SOLR-2156 (and this was fixed in 3.1). After this, the
slave does commit or reload core depending whether the config files
were replicated. There is another bug in 1.4.x that fails replication
if the slave need to do the full replication AND the config files were
changed - also fixed in 3.1 (see SOLR-1983).

> 2. If there are huge number of queries being done on slave will it
> affect the replication ? How can I improve the performance ? (see the
> replications details at he bottom of the page)


>From my experience the half of the replication time is a time when the
transferred data flushes to the disk. So the IO impact is important.

> 3. Will the segment names be same be same on master and slave after
> replication ? I see that they are different. Is this correct ? If it
> is correct how does the slave know what to fetch the next time i.e.
> the delta.


They should be the same. The slave fetches the changed files only (see
above), also look at SnapPuller code.

> 4. When and why does the index. folder get created ? I see
> this type of folder getting created only on slave and the slave
> instance is pointing to it.


See above.

> 5. Does replication process copy both the index and index.
folder ?


index. folder gets created only of the full replication
happened at least once. Otherwise, the slave will use the index
folder.

> 6. what happens if the replication kicks off even before the previous
> invocation has not completed ? will the 2nd invocation block or will
> it go through causing more confusion ?


There is a lock (snapPullLock in ReplicationHandler) that prevents two
replications run simultaneously. If there is no bug, it should just
return silently from the replication call. (I personally never had
problem with this so it looks there is no bug :)

> 7. If I have to prep a new master-slave combination is it OK to copy
> the respective contents into the new master-slave and start solr ? or
> do I have have to wipe the new slave and let it replicate from its new
> master ?


If the new master has a different index, the slave will create a new
 folder. There is no need to wipe it.

> 8. Doing an 'ls | wc -l' on index folder of master and slave gave 194
> and 17968 respectively...I slave has lot of segments_xxx files. Is
> this normal ?


No, it looks like in your case the slave continues to replicate to the
same folder for a long time period but the old files are not getting
deleted bu some reason. Try to restart the slave or do core reload on
it to see if the old segments gone.

-Alexander



Re: Multicore Relaod Theoretical Question

2011-01-24 Thread Alexander Kanarsky
Em,

that's correct. You can use 'lsof' to see file handles still in use.
See 
http://0xfe.blogspot.com/2006/03/troubleshooting-unix-systems-with-lsof.html,
"Recipe #11".

-Alexander

On Sun, Jan 23, 2011 at 1:52 AM, Em  wrote:
>
> Hi Alexander,
>
> thank you for your response.
>
> You said that the old index files were still in use. That means Linux does
> not *really* delete them until Solr frees its locks from it, which happens
> while reloading?
>
>
>
> Thank you for sharing your experiences!
>
> Kind regards,
> Em
>
>
> Alexander Kanarsky wrote:
>>
>> Em,
>>
>> yes, you can replace the index (get the new one into a separate folder
>> like index.new and then rename it to the index folder) outside the
>> Solr, then just do the http call to reload the core.
>>
>> Note that the old index files may still be in use (continue to serve
>> the queries while reloading), even if the old index folder is deleted
>> - that is on Linux filesystems, not sure about NTFS.
>> That means the space on disk will be freed only when the old files are
>> not referenced by Solr searcher any longer.
>>
>> -Alexander
>>
>> On Sat, Jan 22, 2011 at 1:51 PM, Em  wrote:
>>>
>>> Hi Erick,
>>>
>>> thanks for your response.
>>>
>>> Yes, it's really not that easy.
>>>
>>> However, the target is to avoid any kind of master-slave-setup.
>>>
>>> The most recent idea i got is to create a new core with a data-dir
>>> pointing
>>> to an already existing directory with a fully optimized index.
>>>
>>> Regards,
>>> Em
>>> --
>>> View this message in context:
>>> http://lucene.472066.n3.nabble.com/Multicore-Relaod-Theoretical-Question-tp2293999p2310709.html
>>> Sent from the Solr - User mailing list archive at Nabble.com.
>>>
>>
>>
>
> --
> View this message in context: 
> http://lucene.472066.n3.nabble.com/Multicore-Relaod-Theoretical-Question-tp2293999p2312778.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>


Re: old index files not deleted on slave

2011-01-22 Thread Alexander Kanarsky
I see the file

-rw-rw-r-- 1 feeddo feeddo0 Dec 15 01:19
lucene-cdaa80c0fefe1a7dfc7aab89298c614c-write.lock

was created on Dec. 15. At the end of the replication, as far as I
remember, the SnapPuller tries to open the writer to ensure the old
files are deleted, and in
your case it cannot obtain a lock on the index folder on Dec 16,
17,18. Can you reproduce the problem if you delete the lock file,
restart the slave
and try replication again? Do you have any other Writer(s) open for
this folder outside of this core?

-Alexander

On Sat, Jan 22, 2011 at 3:52 PM, feedly team  wrote:
> The file system checked out, I also tried creating a slave on a
> different machine and could reproduce the issue. I logged SOLR-2329.
>
> On Sat, Dec 18, 2010 at 8:01 PM, Lance Norskog  wrote:
>> This could be a quirk of the native locking feature. What's the file
>> system? Can you fsck it?
>>
>> If this error keeps happening, please file this. It should not happen.
>> Add the text above and also your solrconfigs if you can.
>>
>> One thing you could try is to change from the native locking policy to
>> the simple locking policy - but only on the child.
>>
>> On Sat, Dec 18, 2010 at 4:44 PM, feedly team  wrote:
>>> I have set up index replication (triggered on optimize). The problem I
>>> am having is the old index files are not being deleted on the slave.
>>> After each replication, I can see the old files still hanging around
>>> as well as the files that have just been pulled. This causes the data
>>> directory size to increase by the index size every replication until
>>> the disk fills up.
>>>
>>> Checking the logs, I see the following error:
>>>
>>> SEVERE: SnapPull failed
>>> org.apache.solr.common.SolrException: Index fetch failed :
>>>        at 
>>> org.apache.solr.handler.SnapPuller.fetchLatestIndex(SnapPuller.java:329)
>>>        at 
>>> org.apache.solr.handler.ReplicationHandler.doFetch(ReplicationHandler.java:265)
>>>        at org.apache.solr.handler.SnapPuller$1.run(SnapPuller.java:159)
>>>        at 
>>> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:441)
>>>        at 
>>> java.util.concurrent.FutureTask$Sync.innerRunAndReset(FutureTask.java:317)
>>>        at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:150)
>>>        at 
>>> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$101(ScheduledThreadPoolExecutor.java:98)
>>>        at 
>>> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.runPeriodic(ScheduledThreadPoolExecutor.java:181)
>>>        at 
>>> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:205)
>>>        at 
>>> java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
>>>        at 
>>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
>>>        at java.lang.Thread.run(Thread.java:619)
>>> Caused by: org.apache.lucene.store.LockObtainFailedException: Lock
>>> obtain timed out:
>>> NativeFSLock@/var/solrhome/data/index/lucene-cdaa80c0fefe1a7dfc7aab89298c614c-write.lock
>>>        at org.apache.lucene.store.Lock.obtain(Lock.java:84)
>>>        at org.apache.lucene.index.IndexWriter.(IndexWriter.java:1065)
>>>        at org.apache.lucene.index.IndexWriter.(IndexWriter.java:954)
>>>        at 
>>> org.apache.solr.update.SolrIndexWriter.(SolrIndexWriter.java:192)
>>>        at 
>>> org.apache.solr.update.UpdateHandler.createMainIndexWriter(UpdateHandler.java:99)
>>>        at 
>>> org.apache.solr.update.DirectUpdateHandler2.openWriter(DirectUpdateHandler2.java:173)
>>>        at 
>>> org.apache.solr.update.DirectUpdateHandler2.forceOpenWriter(DirectUpdateHandler2.java:376)
>>>        at org.apache.solr.handler.SnapPuller.doCommit(SnapPuller.java:471)
>>>        at 
>>> org.apache.solr.handler.SnapPuller.fetchLatestIndex(SnapPuller.java:319)
>>>        ... 11 more
>>>
>>> lsof reveals that the file is still opened from the java process.
>>>
>>> I am running 4.0 rev 993367 with patch SOLR-1316. Otherwise, the setup
>>> is pretty vanilla. The OS is linux, the indexes are on local
>>> directories, write permissions look ok, nothing unusual in the config
>>> (default deletion policy, etc.). Contents of the index data dir:
>>>
>>> master:
>>> -rw-rw-r-- 1 feeddo feeddo  191 Dec 14 01:06 _1lg.fnm
>>> -rw-rw-r-- 1 feeddo feeddo  26M Dec 14 01:07 _1lg.fdx
>>> -rw-rw-r-- 1 feeddo feeddo 1.9G Dec 14 01:07 _1lg.fdt
>>> -rw-rw-r-- 1 feeddo feeddo 474M Dec 14 01:12 _1lg.tis
>>> -rw-rw-r-- 1 feeddo feeddo  15M Dec 14 01:12 _1lg.tii
>>> -rw-rw-r-- 1 feeddo feeddo 144M Dec 14 01:12 _1lg.prx
>>> -rw-rw-r-- 1 feeddo feeddo 277M Dec 14 01:12 _1lg.frq
>>> -rw-rw-r-- 1 feeddo feeddo  311 Dec 14 01:12 segments_1ji
>>> -rw-rw-r-- 1 feeddo feeddo  23M Dec 14 01:12 _1lg.nrm
>>> -rw-rw-r-- 1 feeddo feeddo  191 Dec 18 01:11 _24e.fnm
>>> -rw-rw-r-- 1 feeddo feeddo  26M Dec 18 01:12 _24e.fdx
>>> -rw-rw-r-- 1 feeddo feeddo 1.9G Dec 18 01

Re: Multicore Relaod Theoretical Question

2011-01-22 Thread Alexander Kanarsky
Em,

yes, you can replace the index (get the new one into a separate folder
like index.new and then rename it to the index folder) outside the
Solr, then just do the http call to reload the core.

Note that the old index files may still be in use (continue to serve
the queries while reloading), even if the old index folder is deleted
- that is on Linux filesystems, not sure about NTFS.
That means the space on disk will be freed only when the old files are
not referenced by Solr searcher any longer.

-Alexander

On Sat, Jan 22, 2011 at 1:51 PM, Em  wrote:
>
> Hi Erick,
>
> thanks for your response.
>
> Yes, it's really not that easy.
>
> However, the target is to avoid any kind of master-slave-setup.
>
> The most recent idea i got is to create a new core with a data-dir pointing
> to an already existing directory with a fully optimized index.
>
> Regards,
> Em
> --
> View this message in context: 
> http://lucene.472066.n3.nabble.com/Multicore-Relaod-Theoretical-Question-tp2293999p2310709.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>


Re: Can I host TWO separate datasets in Solr?

2011-01-21 Thread Alexander Kanarsky
Igor,

you can set two different Solr cores in solr.xml and search them separately.
See multicore example in Solr distribution.

-Alexander

On Fri, Jan 21, 2011 at 3:51 PM, Igor Chudov  wrote:
> I would like to have two sets of data and search them separately (they are
> used for two different websites).
>
> How can I do it?
>
> Thanks!
>


Re: Solr + Hadoop

2011-01-13 Thread Alexander Kanarsky
Joan,

make sure that you are running the job on Hadoop 0.21 cluster. (It
looks like you have compiled the apache-solr-hadoop jar with Hadoop
0.21 but using it on 0.20 cluster).

-Alexander


Re: Creating Solr index from map/reduce

2011-01-03 Thread Alexander Kanarsky
Joan,

current version of the patch assumes the location and names for the
schema and solrconfig files ($SOLR_HOME/conf), it is hardcoded (see
the SolrRecordWriter's constructor). Multi-core configuration with
separate configuration locations via solr.xml is not supported as for
now.  As a workaround, you could link or copy the schema and
solrconfig files to follow the hardcoded assumption.

Thanks,
-Alexander

On Wed, Dec 29, 2010 at 2:50 AM, Joan  wrote:
> If I rename my custom schema file (schema-xx.xml), whitch is located in
> SOLR_HOME/schema/, and then I copy it to "conf" folder and finally I try to
> run CSVIndexer, it shows me an other error:
>
> Caused by: java.lang.RuntimeException: Can't find resource 'solrconfig.xml'
> in classpath or
> '/tmp/hadoop-root/mapred/local/taskTracker/archive/localhost/tmp/b7611d6d-9cc7-4237-a240-96ecaab9f21a.solr.zip/conf/'
>
> I dont't understand because I've a solr configuration file (solr.xml) where
> I define all core:
>
>          instanceDir="solr-data/index"
>        config="solr/conf/solrconfig_xx.xml"
>        schema="solr/schema/schema_xx.xml"
>        properties="solr/conf/solrcore.properties"/ >
>
> But I think that when I run CSVIndexer, it doesn't know that solr.xml exist,
> and it try to looking for schema.xml and solrconfig.xml by default in
> default folder (conf)
>
>
>
> 2010/12/29 Joan 
>
>> Hi,
>>
>> I'm trying generate Solr index from hadoop (map/reduce) so I'm using this
>> patch SOLR-301 , however
>> I don't get it.
>>
>> When I try to run CSVIndexer with some arguments: 
>> -solr  
>>
>> I'm runnig CSVIndexer:
>>
>> /bin/hadoop jar my.jar CSVIndexer  -solr
>> / 
>>
>> Before that I run CSVIndexer, I've put csv file into HDFS.
>>
>> My Solr home hasn't default files configurations, but which is divided
>> into multiple folders
>>
>> /conf
>> /schema
>>
>> I have custom solr file configurations so CSVIndexer can't find schema.xml,
>> obviously It won't be able to find it because this file doesn't exist, in my
>> case, this file is named "schema-xx.xml" and CSVIndexer is looking for it
>> inside "conf" folder and It don't know that schema folder exist. And I have
>> solr configuration file (solr.xml) where I configure multiple cores.
>>
>> I tried to modify solr's paths but It still not working .
>>
>> I understand that CSVIndexer copy Solr Home specified into HDFS
>> (/tmp/hadoop-user/mapred/local/taskTracker/archive/...) and when It try to
>> find "schema.xml" it doesn't exit:
>>
>> 10/12/29 10:18:11 INFO mapred.JobClient: Task Id :
>> attempt_201012291016_0002_r_00_1, Status : FAILED
>> java.lang.IllegalStateException: Failed to initialize record writer for
>> my.jar, attempt_201012291016_0002_r_00_1
>>         at
>> org.apache.solr.hadoop.SolrRecordWriter.(SolrRecordWriter.java:253)
>>         at
>> org.apache.solr.hadoop.SolrOutputFormat.getRecordWriter(SolrOutputFormat.java:152)
>>         at
>> org.apache.hadoop.mapred.ReduceTask.runNewReducer(ReduceTask.java:553)
>>         at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:408)
>>         at org.apache.hadoop.mapred.Child.main(Child.java:170)
>> Caused by: java.io.FileNotFoundException: Source
>> '/tmp/hadoop-guest/mapred/local/taskTracker/archive/localhost/tmp/e8be5bb1-e910-47a1-b5a7-1352dfec2b1f.solr.zip/conf/schema.xml'
>> does not exist
>>         at org.apache.commons.io.FileUtils.copyFile(FileUtils.java:636)
>>         at org.apache.commons.io.FileUtils.copyFile(FileUtils.java:606)
>>         at
>> org.apache.solr.hadoop.SolrRecordWriter.(SolrRecordWriter.java:222)
>>         ... 4 more
>


Re: Searching with wrong keyboard layout or using translit

2010-10-28 Thread Alexander Kanarsky
Pavel,

it depends on size of your documents corpus, complexity and types of
the queries you plan to use etc. I would recommend you to search for
the discussions on synonyms expansion in Lucene (index time vs. query
time tradeoffs etc.) since your problem is quite similar to that
(think Moskva vs. Moskwa). Unless you have a small corpus, I would go
with the second approach and expand the terms during the query time.
However, the first approach might be useful, too: say, you may want to
boost the score for the documents that naturally contain the word
'Moskva', so such a documents will be at the top of the result list.
Having both forms indexed will allow you to achieve this easily by
utilizing Solr's dismax query (to boost the results from the field
with the original terms):
http://localhost:8983/solr/select/?q=Moskva&defType=dismax&qf=text^10.0+text_translit^0.1
('text' field has the original Cyrillic tokens, 'text_translit' is for
transliterated ones)

-Alexander


2010/10/28 Pavel Minchenkov :
> Alexander,
>
> Thanks,
> What variat has better performance?
>
>
> 2010/10/28 Alexander Kanarsky 
>
>> Pavel,
>>
>> I think there is no single way to implement this. Some ideas that
>> might be helpful:
>>
>> 1. Consider adding additional terms while indexing. This assumes
>> conversion of Russian text to both "translit" and "wrong keyboard"
>> forms and index converted terms along with original terms (i.e. your
>> Analyzer/Filter should produce Moskva and Vjcrdf for term Москва). You
>> may re-use the same field (if you plan for a simple term queries) or
>> create a separate fields for the generated terms (better for phrase,
>> proximity queries etc. since it keeps the original text positional
>> info). Then the query could use any of these forms to fetch the
>> document. If you use separate fields, you'll need to expand/create
>> your query to search for them, of course.
>> 2. If you have to index just an original Russian text, you might
>> generate all term forms while analyzing the query, then you could
>> treat the converted terms as a synonyms and use the combination of
>> TermQuery for all term forms or the MultiPhraseQuery for the phrases.
>> For Solr in this case you probably will need to add a custom filter
>> similar to SynonymFilter.
>>
>> Hope this helps,
>> -Alexander
>>
>> On Wed, Oct 27, 2010 at 1:31 PM, Pavel Minchenkov 
>> wrote:
>> > Hi,
>> >
>> > When I'm trying to search Google with wrong keyboard layout -- it
>> corrects
>> > my query, example: http://www.google.ru/search?q=vjcrdf (I typed word
>> > "Moscow" in Russian but in English keyboard layout).
>> > <http://www.google.ru/search?q=vjcrdf>Also, when I'm searching using
>> > translit, It does the same: http://www.google.ru/search?q=moskva
>> >
>> > What is the right way to implement this feature in Solr?
>> >
>> > --
>> > Pavel Minchenkov
>> >
>>
>
>
>
> --
> Pavel Minchenkov
>


Re: Searching with wrong keyboard layout or using translit

2010-10-28 Thread Alexander Kanarsky
Pavel,

I think there is no single way to implement this. Some ideas that
might be helpful:

1. Consider adding additional terms while indexing. This assumes
conversion of Russian text to both "translit" and "wrong keyboard"
forms and index converted terms along with original terms (i.e. your
Analyzer/Filter should produce Moskva and Vjcrdf for term Москва). You
may re-use the same field (if you plan for a simple term queries) or
create a separate fields for the generated terms (better for phrase,
proximity queries etc. since it keeps the original text positional
info). Then the query could use any of these forms to fetch the
document. If you use separate fields, you'll need to expand/create
your query to search for them, of course.
2. If you have to index just an original Russian text, you might
generate all term forms while analyzing the query, then you could
treat the converted terms as a synonyms and use the combination of
TermQuery for all term forms or the MultiPhraseQuery for the phrases.
For Solr in this case you probably will need to add a custom filter
similar to SynonymFilter.

Hope this helps,
-Alexander

On Wed, Oct 27, 2010 at 1:31 PM, Pavel Minchenkov  wrote:
> Hi,
>
> When I'm trying to search Google with wrong keyboard layout -- it corrects
> my query, example: http://www.google.ru/search?q=vjcrdf (I typed word
> "Moscow" in Russian but in English keyboard layout).
> Also, when I'm searching using
> translit, It does the same: http://www.google.ru/search?q=moskva
>
> What is the right way to implement this feature in Solr?
>
> --
> Pavel Minchenkov
>


Re: I was at a search vendor round table today...

2010-09-22 Thread Alexander Kanarsky
>  He said some other things about a huge petabyte hosted search collection 
> they have used by banks..

In context of your discussion this reference sounds really, really funny... :)

-Alexander

On Wed, Sep 22, 2010 at 1:17 PM, Grant Ingersoll  wrote:
>
> On Sep 22, 2010, at 2:04 PM, Smiley, David W. wrote:
>
>> (I don't twitter or blog so I thought I'd send this message here)
>>
>> Today at work (at MITRE outside DC) there was (is) a day of technical 
>> presentations about topics related to information dissemination and 
>> discovery (broad squishy words there, but mostly covered "search") at which 
>> I spoke about the value of faceting, and gave a quick Solr pitch.  There was 
>> an hour vendor panel in which a representative from Autonomy, Microsoft 
>> (i.e. FAST), Google, Vivisimo, and Endeca had the opportunity to espouse the 
>> virtues of their product, and fit in an occasional jab at their competitors 
>> next to them.  In the absence of a suitable representative for Solr (e.g. 
>> Lucid) I pointed out how open-source Solr has "democratized" (i.e. made 
>> free) search and faceting when it used to require paying lots of money.  And 
>> I asked them how their products have reacted to this new reality.  Autonomy 
>> acknowledged they used to make millions on simple engagements in the distant 
>> past but that isn't the case these days.  He said some other things about a 
>> huge petabyte hosted search collection they have used by banks... I forget 
>> what else he said.  I forgot what Google said.  Vivisimo quoted Steve 
>> Ballmer, saying "open source is as free as a free puppy" (not a bad point 
>> IMO).
>
> Too funny.  Hadn't heard that one before.  Presumably meaning you have to 
> care and feed it, despite the fact that you really do love it and it is cute 
> as hell?  The care and feeding is true of the commercial ones, too, 
> especially in terms of  for supporting features you never use, but love 
> (as in we love using this tool) is usually not a word I hear associated in 
> those respects too often, but of course that is likely self selecting.
>
>> Endeca claimed to be happy Solr exists because it raises the awareness of 
>> faceted search, but then claimed it would not scale and they should then 
>> upgrade to Endeca.  (!)  I found that claim ridiculous, of course.
>
> Having replaced all the above on a number of occasions w/ Solr at both a 
> significant cost savings on licensing, dev time, and hardware, I would agree 
> that claim is quite ridiculous.  Besides, in my experience, the scale claim 
> is silly.  Everyone (customers) says they need scale, but few of them really 
> know what scale is, so it is all relative.   For some, scale is 1M docs, for 
> others it's 1B+ docs;  for others it's 100K queries per day, for others it's 
> 100M per day.  (BTW, I've seen Lucene/Solr do both, just fine.  Not that it 
> is a free lunch, but neither are the other ones despite what they say.)
>
>>
>> Speaking of performance, on a large scale search project where we're using 
>> Solr in place of a MarkLogic prototype (because ML is so friggin expensive, 
>> for one reason), the search results were so fast (~150ms) vs. the ML's 
>> results of 2-3 seconds, that the UI engineers building the interface on top 
>> of the XML output thought Solr was broken because it was so fast.  The quote 
>> was "It's so fast, it's broken".    In other words, they were used to 2-3 
>> second response times and so if the results came back as fast as what Solr 
>> has been doing, then surely there's a bug.  There's no bug.  :)  Admittedly, 
>> I think it was a bit of an apples and oranges comparison but I love that 
>> quote nonetheless.
>
>
> I love it.  I have had the same experience where people think it's broken b/c 
> it's so fast.  Large vendor named above took 24 hours to index 4M records 
> (they weren't even doing anything fancy on the indexing side) and search was 
> slow too.  Solr took about 40 minutes to index all the content and search was 
> blazing.  Same content, faster indexing, better search results, a lot less 
> time.
>
> At any rate, enough of tooting our own horn.  Thanks for sharing!
>
> -Grant
>
>
> --
> Grant Ingersoll
> http://www.lucidimagination.com/
>
>


Re: Solr for statistical data

2010-09-20 Thread Alexander Kanarsky
Set up your JVM to produce the heap dumps in case of OOM and try to
analyze them with a profiler like YourKit. This could give you some
ideas on what takes memory and what potentially could be reduced.
Sometimes the cache settings could be adjusted without significant
performance toll etc. See what on query side and on indexing side
could be downsized. In some case you might need to modify the Lucene
source code to adjust the internal cache I/O buffers size, for
example. But look for low hanging fruits first. Use 32-bit JVM if
possible, of course.

-Alexander


On Mon, Sep 20, 2010 at 5:58 AM, Kjetil Ødegaard
 wrote:
> On Thu, Sep 16, 2010 at 11:48 AM, Peter Karich  wrote:
>
>> Hi Kjetil,
>>
>> is this custom component (which performes groub by + calcs stats)
>> somewhere available?
>> I would like to do something similar. Would you mind to share if it
>> isn't already available?
>>
>> The grouping stuff sounds similar to
>> https://issues.apache.org/jira/browse/SOLR-236
>>
>> where you can have mem problems too ;-) or see:
>> https://issues.apache.org/jira/browse/SOLR-1682
>>
>>
> Thanks for the links! These patches seem to provide somewhat similar
> functionality, I'll investigate if they're implemented in a similar way too.
>
> We've developed this component for a client, so while I'd like to share it I
> can't make any promises. Sorry.
>
>
>> > Any tips or similar experiences?
>>
>> you want to decrease memory usage?
>
>
> Yes. Specifically, I would like to keep the heap at 4 GB. Unfortunately I'm
> still seeing some OutOfMemoryErrors so I might have to up the heap size
> again.
>
> I guess what I'm really wondering is if there's a way to keep memory use
> down, while at the same time not sacrificing the performance of our queries.
> The queries have to run through all values for a field in order to calculate
> the sum, so it's not enough to just cache a few values.
>
> The code which fetches values from the index uses
> FieldCache.DEFAULT.getStringIndex for a field, and then indexes like this:
>
> FieldType fieldType = searcher.getSchema().getFieldType(fieldName);
> fieldType.indexedToReadable(stringIndex.lookup[stringIndex.order[documentId]]);
>
> Is there a better way to do this? Thanks.
>
>
> ---Kjetil
>