RE: Solr vs. Compass

2010-01-25 Thread Funtick


Minutello, Nick wrote:
 
 Maybe spend some time playing with Compass rather than speculating ;)
 

I spent few weeks by studying Compass source code, it was three years ago,
and Compass docs (3 years ago) were saying the same as now:
Compass::Core provides support for two phase commits transactions
(read_committed and serializable), implemented on top of Lucene index
segmentations. The implementation provides fast commits (faster than
Lucene), though they do require the concept of Optimizers that will keep the
index at bay. Compass::Core comes with support for Local and JTA
transactions, and Compass::Spring comes with Spring transaction
synchronization. When only adding data to the index, Compass comes with the
batch_insert transaction, which is the same IndexWriter operation with the
same usual suspects for controlling performance and memory. 

It is just blatant advertisement, trick; even JavaDocs remain unchanged...


Clever guys from Compass can re-apply transaction log to Lucene in case of
server crash (for instance, server was 'killed'  _before_ Lucene flushed new
segment to disk).

Internally, it is implemented as a background thread. Nothing says in docs
lucene is part of transaction; I studied source - it is just
'speculating'.




Minutello, Nick wrote:
 
 If it helps, on the project where I last used compass, we had what I
 consider to be a small dataset - just a few million documents. Nothing
 related to indexing/searching took more than a second or 2 - mostly it
 was 10's or 100's of milliseconds. That app has been live almost 3
 years.
 

I did the same, and I was happy with Compass: I got Lucene-powered search
without any development. But I got performance problems after few weeks... I
needed about 300 TPS, and Compass-based approach didn't work. With SOLR, I
have 4000 index updates per second.


-Fuad
http://www.tokenizer.org

-- 
View this message in context: 
http://old.nabble.com/Solr-vs.-Compass-tp27259766p27317213.html
Sent from the Solr - User mailing list archive at Nabble.com.



SOLR uniqueKey - extremely strange behavior! Documents disappeared...

2009-08-17 Thread Funtick

After running an application which heavily uses MD5 HEX-representation as
uniqueKey for SOLR v.1.4-dev-trunk:

1. After 30 hours: 
101,000,000 documents added

2. Commit: 
numDocs = 783,714 
maxDoc = 3,975,393

3. Upload new docs to SOLR during 1 hour(!!!), then commit, then
optimize:
numDocs=1,281,851
maxDocs=1,281,851

It looks _extremely_ strange that within an hour I have such a huge increase
with same 'average' document set...

I am suspecting something goes wrong with Lucene buffer flush / index merge
OR SOLR - Unique ID handling...

According to my own estimates, I should have about 10,000,000 new documents
now... I had 0.5 millions within an hour, and 0.8 mlns within a day; same
'random' documents.

This morning index size was about 4Gb, then suddenly dropped below 0.5 Gb.
Why? I haven't issued any commit...

I am using ramBufferMB=8192






-- 
View this message in context: 
http://www.nabble.com/SOLR-%3CuniqueKey%3E---extremely-strange-behavior%21-Documents-disappeared...-tp25017728p25017728.html
Sent from the Solr - User mailing list archive at Nabble.com.



Re: JVM Heap utilization Memory leaks with Solr

2009-08-17 Thread Funtick

Can you tell me please how many non-tokenized single-valued fields your
schema uses, and how many documents?
Thanks,
Fuad


Rahul R wrote:
 
 My primary issue is not Out of Memory error at run time. It is memory
 leaks:
 heap space not being released after doing a force GC also. So after
 sometime
 as progressively more heap gets utilized, I start running out of
 memory
 The verdict however seems unanimous that there are no known memory leak
 issues within Solr. I am still looking at my application to analyse the
 problem. Thank you.
 
 On Thu, Aug 13, 2009 at 10:58 PM, Fuad Efendi f...@efendi.ca wrote:
 
 Most OutOfMemoryException (if not 100%) happening with SOLR are because
 of

 http://lucene.apache.org/java/2_4_0/api/org/apache/lucene/search/FieldCache.
 html
 - it is used internally in Lucene to cache Field value and document ID.

 My very long-term observations: SOLR can run without any problems few
 days/months and unpredictable OOM happens just because someone tried
 sorted
 search which will populate array with IDs of ALL documents in the index.

 The only solution: calculate exactly amount of RAM needed for
 FieldCache...
 For instance, for 100,000,000 documents single instance of FieldCache may
 require 8*100,000,000 bytes (8 bytes per document ID?) which is almost
 1Gb
 (at least!)


 I didn't notice any memory leaks after I started to use 16Gb RAM for SOLR
 instance (almost a year without any restart!)




 -Original Message-
 From: Rahul R [mailto:rahul.s...@gmail.com]
 Sent: August-13-09 1:25 AM
 To: solr-user@lucene.apache.org
  Subject: Re: JVM Heap utilization  Memory leaks with Solr

 *You should try to generate heap dumps and analyze the heap using a tool
 like the Eclipse Memory Analyzer. Maybe it helps spotting a group of
 objects holding a large amount of memory*

 The tool that I used also allows to capture heap snap shots. Eclipse had
 a
 lot of pre-requisites. You need to apply some three or five patches
 before
 you can start using it My observations with this tool were that
 some
 Hashmaps were taking up a lot of space. Although I could not pin it down
 to
 the exact HashMap. These would either be weblogic's or Solr's I will
 anyway give eclipse's a try and see how it goes. Thanks for your input.

 Rahul

 On Wed, Aug 12, 2009 at 2:15 PM, Gunnar Wagenknecht
 gun...@wagenknecht.orgwrote:

  Rahul R schrieb:
   I tried using a profiling tool - Yourkit. The trial version was free
 for
  15
   days. But I couldn't find anything of significance.
 
  You should try to generate heap dumps and analyze the heap using a tool
  like the Eclipse Memory Analyzer. Maybe it helps spotting a group of
  objects holding a large amount of memory.
 
  -Gunnar
 
  --
  Gunnar Wagenknecht
  gun...@wagenknecht.org
  http://wagenknecht.org/
 
 



 
 

-- 
View this message in context: 
http://www.nabble.com/JVM-Heap-utilization---Memory-leaks-with-Solr-tp24802380p25017767.html
Sent from the Solr - User mailing list archive at Nabble.com.



Re: SOLR uniqueKey - extremely strange behavior! Documents disappeared...

2009-08-17 Thread Funtick


But how to explain that within an hour (after commit) I have had about
500,000 new documents, and within 30 hours (after commit) only 1,300,000?

Same _random_enough_ documents... 

BTW, SOLR Console was showing only few hundreds deletesById although I
don't use any deleteById explicitly; only update with allowOverwrite and
uniqueId.




markrmiller wrote:
 
 I'd say you have a lot of documents that have the same id.
 When you add a doc with the same id, first the old one is deleted, then
 the
 new one is added (atomically though).
 
 The deleted docs are not removed from the index immediately though - the
 doc
 id is just marked as deleted.
 
 Over time though, as segments are merged due to hitting triggers while
 adding new documents, deletes are removed (which deletes depends on which
 segments have been merged).
 
 So if you add a tone of documents over time, many with the same ids, you
 would likely see this type of maxDoc, numDoc churn. maxDoc will include
 deleted docs while numDoc will not.
 
 
 -- 
 - Mark
 
 http://www.lucidimagination.com
 
 On Mon, Aug 17, 2009 at 11:09 PM, Funtick f...@efendi.ca wrote:
 

 After running an application which heavily uses MD5 HEX-representation as
 uniqueKey for SOLR v.1.4-dev-trunk:

 1. After 30 hours:
 101,000,000 documents added

 2. Commit:
 numDocs = 783,714
 maxDoc = 3,975,393

 3. Upload new docs to SOLR during 1 hour(!!!), then commit, then
 optimize:
 numDocs=1,281,851
 maxDocs=1,281,851

 It looks _extremely_ strange that within an hour I have such a huge
 increase
 with same 'average' document set...

 I am suspecting something goes wrong with Lucene buffer flush / index
 merge
 OR SOLR - Unique ID handling...

 According to my own estimates, I should have about 10,000,000 new
 documents
 now... I had 0.5 millions within an hour, and 0.8 mlns within a day; same
 'random' documents.

 This morning index size was about 4Gb, then suddenly dropped below 0.5
 Gb.
 Why? I haven't issued any commit...

 I am using ramBufferMB=8192






 --
 View this message in context:
 http://www.nabble.com/SOLR-%3CuniqueKey%3E---extremely-strange-behavior%21-Documents-disappeared...-tp25017728p25017728.html
 Sent from the Solr - User mailing list archive at Nabble.com.


 
 

-- 
View this message in context: 
http://www.nabble.com/SOLR-%3CuniqueKey%3E---extremely-strange-behavior%21-Documents-disappeared...-tp25017728p25017826.html
Sent from the Solr - User mailing list archive at Nabble.com.



Re: SOLR uniqueKey - extremely strange behavior! Documents disappeared...

2009-08-17 Thread Funtick

One more hour, and I have +0.5 mlns more (after commit/optimize)

Something strange happening with SOLR buffer flush (if we have single
segment???)... explicit commit prevents it...

30 hours, with index flush, commit: 783,714
+ 1 hour, commit, optimize: 1,281,851
+ 1 hour, commit, optimize: 1,786,552

Same random docs retrieved from web...



Funtick wrote:
 
 
 But how to explain that within an hour (after commit) I have had about
 500,000 new documents, and within 30 hours (after commit) only 783,714?
 
 Same _random_enough_ documents... 
 
 BTW, SOLR Console was showing only few hundreds deletesById although I
 don't use any deleteById explicitly; only update with allowOverwrite
 and uniqueId.
 
 
 
 
 markrmiller wrote:
 
 I'd say you have a lot of documents that have the same id.
 When you add a doc with the same id, first the old one is deleted, then
 the
 new one is added (atomically though).
 
 The deleted docs are not removed from the index immediately though - the
 doc
 id is just marked as deleted.
 
 Over time though, as segments are merged due to hitting triggers while
 adding new documents, deletes are removed (which deletes depends on which
 segments have been merged).
 
 So if you add a tone of documents over time, many with the same ids, you
 would likely see this type of maxDoc, numDoc churn. maxDoc will include
 deleted docs while numDoc will not.
 
 
 -- 
 - Mark
 
 http://www.lucidimagination.com
 
 On Mon, Aug 17, 2009 at 11:09 PM, Funtick f...@efendi.ca wrote:
 

 After running an application which heavily uses MD5 HEX-representation
 as
 uniqueKey for SOLR v.1.4-dev-trunk:

 1. After 30 hours:
 101,000,000 documents added

 2. Commit:
 numDocs = 783,714
 maxDoc = 3,975,393

 3. Upload new docs to SOLR during 1 hour(!!!), then commit, then
 optimize:
 numDocs=1,281,851
 maxDocs=1,281,851

 It looks _extremely_ strange that within an hour I have such a huge
 increase
 with same 'average' document set...

 I am suspecting something goes wrong with Lucene buffer flush / index
 merge
 OR SOLR - Unique ID handling...

 According to my own estimates, I should have about 10,000,000 new
 documents
 now... I had 0.5 millions within an hour, and 0.8 mlns within a day;
 same
 'random' documents.

 This morning index size was about 4Gb, then suddenly dropped below 0.5
 Gb.
 Why? I haven't issued any commit...

 I am using ramBufferMB=8192






 --
 View this message in context:
 http://www.nabble.com/SOLR-%3CuniqueKey%3E---extremely-strange-behavior%21-Documents-disappeared...-tp25017728p25017728.html
 Sent from the Solr - User mailing list archive at Nabble.com.


 
 
 
 

-- 
View this message in context: 
http://www.nabble.com/SOLR-%3CuniqueKey%3E---extremely-strange-behavior%21-Documents-disappeared...-tp25017728p25017967.html
Sent from the Solr - User mailing list archive at Nabble.com.



Re: JVM Heap utilization Memory leaks with Solr

2009-08-17 Thread Funtick

BTW, you should really prefer JRockit which really rocks!!!

Mission Control has necessary toolongs; and JRockit produces _nice_
exception stacktrace (explaining almost everything) in case of even OOM
which SUN JVN still fails to produce.


SolrServlet still catches Throwable:

} catch (Throwable e) {
  SolrException.log(log,e);
  sendErr(500, SolrException.toStr(e), request, response);
} finally {





Rahul R wrote:
 
 Otis,
 Thank you for your response. I know there are a few variables here but the
 difference in memory utilization with and without shards somehow leads me
 to
 believe that the leak could be within Solr.
 
 I tried using a profiling tool - Yourkit. The trial version was free for
 15
 days. But I couldn't find anything of significance.
 
 Regards
 Rahul
 
 
 On Tue, Aug 4, 2009 at 7:35 PM, Otis Gospodnetic
 otis_gospodne...@yahoo.com
 wrote:
 
 Hi Rahul,

 A) There are no known (to me) memory leaks.
 I think there are too many variables for a person to tell you what
 exactly
 is happening, plus you are dealing with the JVM here. :)

 Try jmap -histo:live PID-HERE | less and see what's using your memory.

 Otis
 --
 Sematext is hiring -- http://sematext.com/about/jobs.html?mls
 Lucene, Solr, Nutch, Katta, Hadoop, HBase, UIMA, NLP, NER, IR



 - Original Message 
  From: Rahul R rahul.s...@gmail.com
  To: solr-user@lucene.apache.org
  Sent: Tuesday, August 4, 2009 1:09:06 AM
  Subject: JVM Heap utilization  Memory leaks with Solr
 
  I am trying to track memory utilization with my Application that uses
 Solr.
  Details of the setup :
  -3rd party Software : Solaris 10, Weblogic 10, jdk_150_14, Solr 1.3.0
  - Hardware : 12 CPU, 24 GB RAM
 
  For testing during PSR I am using a smaller subset of the actual data
 that I
  want to work with. Details of this smaller sub-set :
  - 5 million records, 4.5 GB index size
 
  Observations during PSR:
  A) I have allocated 3.2 GB for the JVM(s) that I used. After all users
  logout and doing a force GC, only 60 % of the heap is reclaimed. As
 part
 of
  the logout process I am invalidating the HttpSession and doing a
 close()
 on
  CoreContainer. From my application's side, I don't believe I am holding
 on
  to any resource. I wanted to know if there are known issues surrounding
  memory leaks with Solr ?
  B) To further test this, I tried deploying with shards. 3.2 GB was
 allocated
  to each JVM. All JVMs had 96 % free heap space after start up. I got
 varying
  results with this.
  Case 1 : Used 6 weblogic domains. My application was deployed one 1
 domain.
  I split the 5 million index into 5 parts of 1 million each and used
 them
 as
  shards. After multiple users used the system and doing a force GC,
 around
 94
  - 96 % of heap was reclaimed in all the JVMs.
  Case 2: Used 2 weblogic domains. My application was deployed on 1
 domain.
 On
  the other, I deployed the entire 5 million part index as one shard.
 After
  multiple users used the system and doing a gorce GC, around 76 % of the
 heap
  was reclaimed in the shard JVM. And 96 % was reclaimed in the JVM where
 my
  application was running. This result further convinces me that my
  application can be absolved of holding on to memory resources.
 
  I am not sure how to interpret these results ? For searching, I am
 using
  Without Shards : EmbeddedSolrServer
  With Shards :CommonsHttpSolrServer
  In terms of Solr objects this is what differs in my code between normal
  search and shards search (distributed search)
 
  After looking at Case 1, I thought that the CommonsHttpSolrServer was
 more
  memory efficient but Case 2 proved me wrong. Or could there still be
 memory
  leaks in my application ? Any thoughts, suggestions would be welcome.
 
  Regards
  Rahul


 
 

-- 
View this message in context: 
http://www.nabble.com/JVM-Heap-utilization---Memory-leaks-with-Solr-tp24802380p25018165.html
Sent from the Solr - User mailing list archive at Nabble.com.



Re: SOLR uniqueKey - extremely strange behavior! Documents disappeared...

2009-08-17 Thread Funtick

UPDATE:

After few more minutes (after previous commit):
docsPending: about 7,000,000

After commit:
numDocs: 2,297,231

Increase = 2,297,231 - 1,281,851 = 1,000,000 (average)

So that I have 7 docs with same ID in average.

Having 100,000,000 and then dropping below 1,000,000 is strange; it is a bug
somewhere... need to investigate ramBufferSize and MergePolicy, including
SOLR uniqueId implementation...



Funtick wrote:
 
 After running an application which heavily uses MD5 HEX-representation as
 uniqueKey for SOLR v.1.4-dev-trunk:
 
 1. After 30 hours: 
 101,000,000 documents added
 
 2. Commit: 
 numDocs = 783,714 
 maxDoc = 3,975,393
 
 3. Upload new docs to SOLR during 1 hour(!!!), then commit, then
 optimize:
 numDocs=1,281,851
 maxDocs=1,281,851
 
 It looks _extremely_ strange that within an hour I have such a huge
 increase with same 'average' document set...
 
 I am suspecting something goes wrong with Lucene buffer flush / index
 merge OR SOLR - Unique ID handling...
 
 According to my own estimates, I should have about 10,000,000 new
 documents now... I had 0.5 millions within an hour, and 0.8 mlns within a
 day; same 'random' documents.
 
 This morning index size was about 4Gb, then suddenly dropped below 0.5 Gb.
 Why? I haven't issued any commit...
 
 I am using ramBufferMB=8192
 
 
 
 
 
 
 

-- 
View this message in context: 
http://www.nabble.com/SOLR-%3CuniqueKey%3E---extremely-strange-behavior%21-Documents-disappeared...-tp25017728p25018221.html
Sent from the Solr - User mailing list archive at Nabble.com.



Re: SOLR uniqueKey - extremely strange behavior! Documents disappeared...

2009-08-17 Thread Funtick

sorry for typo in prev msg,

Increase = 2,297,231 - 1,786,552  = 500,000 (average)

RATE (non-unique-id:unique-id) = 7,000,000 : 500,000 = 14:1

but 125:1 (initial 30 hours) was very strange...



Funtick wrote:
 
 UPDATE:
 
 After few more minutes (after previous commit):
 docsPending: about 7,000,000
 
 After commit:
 numDocs: 2,297,231
 
 Increase = 2,297,231 - 1,281,851 = 1,000,000 (average)
 
 So that I have 7 docs with same ID in average.
 
 Having 100,000,000 and then dropping below 1,000,000 is strange; it is a
 bug somewhere... need to investigate ramBufferSize and MergePolicy,
 including SOLR uniqueId implementation...
 
 
 
 Funtick wrote:
 
 After running an application which heavily uses MD5 HEX-representation as
 uniqueKey for SOLR v.1.4-dev-trunk:
 
 1. After 30 hours: 
 101,000,000 documents added
 
 2. Commit: 
 numDocs = 783,714 
 maxDoc = 3,975,393
 
 3. Upload new docs to SOLR during 1 hour(!!!), then commit, then
 optimize:
 numDocs=1,281,851
 maxDocs=1,281,851
 
 It looks _extremely_ strange that within an hour I have such a huge
 increase with same 'average' document set...
 
 I am suspecting something goes wrong with Lucene buffer flush / index
 merge OR SOLR - Unique ID handling...
 
 According to my own estimates, I should have about 10,000,000 new
 documents now... I had 0.5 millions within an hour, and 0.8 mlns within a
 day; same 'random' documents.
 
 This morning index size was about 4Gb, then suddenly dropped below 0.5
 Gb. Why? I haven't issued any commit...
 
 I am using ramBufferMB=8192
 
 
 
 
 
 
 
 
 

-- 
View this message in context: 
http://www.nabble.com/SOLR-%3CuniqueKey%3E---extremely-strange-behavior%21-Documents-disappeared...-tp25017728p25018263.html
Sent from the Solr - User mailing list archive at Nabble.com.



Contributions Needed: Faceting Performance, SOLR Caching

2008-10-19 Thread Funtick

Users  Developers  Possible Contributors, 


Hi,

Recently I did some code hacks and I am using frequency calcs for TermVector
instead of default out-of-the-box DocSet Intersections. It improves
performance hundreds of times at shopping engine http://www.tokenizer.org -
please check http://issues.apache.org/jira/browse/SOLR-711 - I feel the term
faceting (and related architectural decision made for CNET several years
ago) is completely wrong. Default SOLR response times: 30-180 seconds; with
TermVector: 0.2 seconds (25 millions documents, tokenized field). For
non-tokenized field: it also looks natural to use frequency calcs, but I
have not done it yet.

Sorry... too busy with Liferay Portal contract assignments,
http://www.linkedin.com/in/liferay

Another possible performance improvements: create safe  concurrent cache
for SOLR, you may check LingPipe, and also
http://issues.apache.org/jira/browse/SOLR-665 and
http://issues.apache.org/jira/browse/SOLR-667.

Lucene developers are doing greate job to remove synchronization in several
places too, such as isDeleted() method call... would be nice to have
unsynchronized API version for read-only indexes.


Thanks!




-- 
View this message in context: 
http://www.nabble.com/Contributions-Needed%3A-Faceting-Performance%2C-SOLR-Caching-tp20058987p20058987.html
Sent from the Solr - User mailing list archive at Nabble.com.



background merge hit exception

2008-08-24 Thread Funtick

Is it file-system error? I can commit and I can not optimize:

Exception in thread main org.apache.solr.common.SolrException: background
merge hit exception: _ztu:C14604370 _105b:C1690769 _105l:C340280
_105w:C336330 _1068:C336025 _106j:C330206 _106u:C338541 _1075:C337713
_1080:C463455 into _1081 [optimize]  java.io.IOException: background merge
hit exception: _ztu:C14604370 _105b:C1690769 _105l:C340280 _105w:C336330
_1068:C336025 _106j:C330206 _106u:C338541 _1075:C337713 _1080:C463455 into
_1081 [optimize]at
org.apache.lucene.index.IndexWriter.optimize(IndexWriter.java:2300) at
org.apache.lucene.index.IndexWriter.optimize(IndexWriter.java:2230) at
org.apache.solr.update.DirectUpdateHandler2.commit(DirectUpdateHandler2.java:355)
 
at
org.apache.solr.update.processor.RunUpdateProcessor.processCommit(RunUpdateProcessorFactory.java:85)
 
at
org.apache.solr.handler.RequestHandlerUtils.handleCommit(RequestHandlerUtils.java:104)
 
at
org.apache.solr.handler.XmlUpdateRequestHandler.handleRequestBody(XmlUpdateRequestHandler.java:115)
 
at
org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:128)
 
at org.apache.solr.core.SolrCore.execute(SolrCore.java:1081)at
org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:338) 
at
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:272)
 
at
org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:235)
 
at
org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:206)
 
at
org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:233)
 
at
org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:175)
 
at
org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:128) 
at
org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:102) 
at
org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:109)
 
at
org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:286) 
at
org.apache.coyote.http11.Http11AprProcessor.process(Http11AprProcessor.java:856)
 
at
org.apache.coyote.http11.Http11AprProtocol$Http11ConnectionHandler.process(Http11AprProtocol.java:565)
 
at
org.apache.tomcat.util.net.AprEndpoint$SocketWithOptionsProcessor.run(AprEndpoint.java:1949)
 
at
java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:885)
 
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:907) 
at java.lang.Thread.run(Thread.java:619) Caused by: java.io.IOException: No
space left on deviceat
java.io.RandomAccessFile.writeBytes(RandomAccessFile.java)  at
java.io.RandomAccessFile.write(RandomAccessFile.java:466)   at
org.apache.lucene.store.FSDirectory$FSIndexOutput.flushBuffer(FSDirectory.java:632)
 
at
org.apache.lucene.store.BufferedIndexOutput.flushBuffer(BufferedIndexOutput.java:96)
 
at
org.apache.lucene.store.BufferedIndexOutput.flush(BufferedIndexOutput.java:85) 
at
org.apache.lucene.store.BufferedIndexOutput.close(BufferedIndexOutput.java:109) 
at
org.apache.lucene.store.FSDirectory$FSIndexOutput.close(FSDirectory.java:639) 
at org.apache.lucene.index.FieldsWriter.close(FieldsWriter.java:133)at
org.apache.lucene.index.SegmentMerger.mergeFields(SegmentMerger.java:361) 
at org.apache.lucene.index.SegmentMerger.merge(SegmentMerger.java:134)  at
org.apache.lucene.index.IndexWriter.mergeMiddle(IndexWriter.java:3998)  at
org.apache.lucene.index.IndexWriter.merge(IndexWriter.java:3650)at
org.apache.lucene.index.ConcurrentMergeScheduler.doMerge(ConcurrentMergeScheduler.java:214)
 
at
org.apache.lucene.index.ConcurrentMergeScheduler$MergeThread.run(ConcurrentMergeScheduler.java:269)


-- 
View this message in context: 
http://www.nabble.com/background-merge-hit-exception-tp19130991p19130991.html
Sent from the Solr - User mailing list archive at Nabble.com.



Re: background merge hit exception

2008-08-24 Thread Funtick

I found an answer: not enough space in filesystem.


Funtick wrote:
 
 Is it file-system error? I can commit and I can not optimize:
 
 Exception in thread main org.apache.solr.common.SolrException:
 background merge hit exception: _ztu:C14604370 _105b:C1690769
 _105l:C340280 _105w:C336330 _1068:C336025 _106j:C330206 _106u:C338541
 _1075:C337713 _1080:C463455 into _1081 [optimize]  java.io.IOException:
 background merge hit exception: _ztu:C14604370 _105b:C1690769
 _105l:C340280 _105w:C336330 _1068:C336025 _106j:C330206 _106u:C338541
 _1075:C337713 _1080:C463455 into _1081 [optimize] at
 org.apache.lucene.index.IndexWriter.optimize(IndexWriter.java:2300)
 

-- 
View this message in context: 
http://www.nabble.com/background-merge-hit-exception-tp19130991p19131084.html
Sent from the Solr - User mailing list archive at Nabble.com.



Re: big discrepancy between elapsedtime and qtime although enableLazyFieldLoading= true

2008-07-31 Thread Funtick


Yes, it should be extremely simple! I simply can't understand how you
describe it:

Britske wrote:
 
 Rows in solr represent productcategories. I will have up to 100k of them. 
 
 - Each product category can have 10k products each. These are encoded as
 the 10k columns / fields (all 10k fields are int values) 
   
 - At any given at most 1 product per productcategory is returned,
 (analoguous to selecting 1 out of 10k columns). (This is the requirements
 that makes this scheme possible) 
 
 -products in the same column have certain characteristics in common, which
 are encoded in the column name (using dynamic fields). So the combination
 of these characteristics uniquely determines 1 out of 10k columns. When
 the user hasn't supplied all characteristics good defaults for these
 characteristics can be chosen, so a column can always be determined. 
 
 - on top of that each row has 20 productcategory-fields (which all
 possible 10k products of that category share). 
 

1. You can't really define 10.000 columns; you are probably using
multivalued field for that. (sorry if I am not familiar with newest-greatest
features of SOLR such as 'dynamic fields')

2. You are trying to pass to Lucene 'normalized data'
- But it is indeed the job of Lucene, to normalize data!

3. All 10k fields are int values!? Lucene is designed for full-text
search... are you trying to use Lucene instead of a database?

Sorry if I don't understand your design...




Britske wrote:
 
 
 
 Funtick wrote:
 
 
 Britske wrote:
 
 - Rows in solr represent productcategories. I will have up to 100k of
 them. 
 - Each product category can have 10k products each. These are encoded as
 the 10k columns / fields (all 10k fields are int values) 
 
 
 You are using multivalued fields, you are not using 10k fields. And 10k
 is huge.
 
 Design is wrong... you should define two fileds only: Category,
 Product. Lucene will do the rest.
 
 -Fuad
 
 
 ;-). Well I wish it was that simple. 
 

-- 
View this message in context: 
http://www.nabble.com/big-discrepancy-between-elapsedtime-and-qtime-although-enableLazyFieldLoading%3D-true-tp18698590p18756166.html
Sent from the Solr - User mailing list archive at Nabble.com.



Re: Facet Performance

2008-07-31 Thread Funtick

Hoss,

This is still extremely interesting area for possible improvements; I simply
don't want the topic to die 
http://www.nabble.com/Facet-Performance-td7746964.html

http://issues.apache.org/jira/browse/SOLR-665
http://issues.apache.org/jira/browse/SOLR-667
http://issues.apache.org/jira/browse/SOLR-669

I am currently using faceting on single-valued _tokenized_ field with huge
amount of documents; _unsynchronized_ version of FIFOCache; 1.5 seconds
average response time (for faceted queries only!)

I think we can use additional cache for facet results (to store calculated
values!); Lucene's FieldCache can be used only for non-tokenized
single-valued non-bollean fields

-Fuad



hossman_lucene wrote:
 
 
 : Unfortunately which strategy will be chosen is currently undocumented
 : and control is a bit oblique:  If the field is tokenized or multivalued
 : or Boolean, the FilterQuery method will be used; otherwise the
 : FieldCache method.  I expect I or others will improve that shortly.
 
 Bear in mind, what's provide out of the box is SimpleFacets ... it's
 designed to meet simple faceting needs ... when you start talking about
 100s or thousands of constraints per facet, you are getting outside the
 scope of what it was intended to serve efficiently.
 
 At a certain point the only practical thing to do is write a custom
 request handler that makes the best choices for your data.
 
 For the record: a really simple patch someone could submit would be to
 make add an optional field based param indicating which type of faceting
 (termenum/fieldcache) should be used to generate the list of terms and
 then make SimpleFacets.getFacetFieldCounts use that and call the
 apprpriate method insteado calling getTermCounts -- that way you could
 force one or the other if you know it's better for your data/query.
 
 
 
 -Hoss
 
 
 

-- 
View this message in context: 
http://www.nabble.com/Facet-Performance-tp7746964p18756500.html
Sent from the Solr - User mailing list archive at Nabble.com.



Re: big discrepancy between elapsedtime and qtime although enableLazyFieldLoading= true

2008-07-31 Thread Funtick



I do understand that, at first glance, it seems possible to use multivalued
fields, but with multivalued fields it's not possible to pinpoint the exact
value within the multivalued field that I need. 


I used a technics with single document consisting on single Category and
multiple Products (multi-valued): using 'Highlight'. Highlight feature gives
you exact subset of Products with queried terms, as a highlighted list...


-- 
View this message in context: 
http://www.nabble.com/big-discrepancy-between-elapsedtime-and-qtime-although-enableLazyFieldLoading%3D-true-tp18698590p18757269.html
Sent from the Solr - User mailing list archive at Nabble.com.



Re: big discrepancy between elapsedtime and qtime although enableLazyFieldLoading= true

2008-07-31 Thread Funtick


But answer to initial question... I think your documents are huge...



Funtick wrote:
 
 
 
 Britske wrote:
 
 I do understand that, at first glance, it seems possible to use
 multivalued fields, but with multivalued fields it's not possible to
 pinpoint the exact value within the multivalued field that I need. 
 
 
 I used a technics with single document consisting on single Category and
 multiple Products (multi-valued): using 'Highlight'. Highlight feature
 gives you exact subset of Products with queried terms, as a highlighted
 list...
 
 

-- 
View this message in context: 
http://www.nabble.com/big-discrepancy-between-elapsedtime-and-qtime-although-enableLazyFieldLoading%3D-true-tp18698590p18757342.html
Sent from the Solr - User mailing list archive at Nabble.com.



Re: big discrepancy between elapsedtime and qtime although enableLazyFieldLoading= true

2008-07-31 Thread Funtick

Simple design with _single_ valued fields:

IdCategoryProduct
001  TVSONY 12345
002  Radio Panasonic 54321
003  TVToshiba ABCD
004  Radio ABCD Z-54321


We have 4 documents with single-valued fields. It's not neccessary to store
'Category' field in index... Data is not 'normalized' from DBA's viewpoint,
but it is what Lucene needs...




Britske wrote:
 
 no, I'm using dynamic fields, they've been around for a pretty long time. 
 I use int-values in the 10k fields for filtering and sorting. On top of
 that I use a lot of full-text filtering on the other fields, as well as
 faceting, etc. 
 
 I do understand that, at first glance, it seems possible to use
 multivalued fields, but with multivalued fields it's not possible to
 pinpoint the exact value within the multivalued field that I need.
 Consider the case with 1 multi-valued field, category, as you called it,
 which would have at most 10k fields. The meaning of these values within
 the field are completely lost, although it is a requirement to fetch
 products (thus values in the multivalued field)  given a specific set of
 criteria. In other words, there is no way of getting a specific value from
 a multivalued field given a set of criteria.  Now, compare that with my
 current design in which these criteria pinpoint a specific field / column
 to use and the difference should be clear. 
 
 regards,
 Britske
 
 
 Funtick wrote:
 
 
 Yes, it should be extremely simple! I simply can't understand how you
 describe it:
 
 Britske wrote:
 
 Rows in solr represent productcategories. I will have up to 100k of
 them. 
 
 - Each product category can have 10k products each. These are encoded as
 the 10k columns / fields (all 10k fields are int values) 
   
 - At any given at most 1 product per productcategory is returned,
 (analoguous to selecting 1 out of 10k columns). (This is the
 requirements that makes this scheme possible) 
 
 -products in the same column have certain characteristics in common,
 which are encoded in the column name (using dynamic fields). So the
 combination of these characteristics uniquely determines 1 out of 10k
 columns. When the user hasn't supplied all characteristics good defaults
 for these characteristics can be chosen, so a column can always be
 determined. 
 
 - on top of that each row has 20 productcategory-fields (which all
 possible 10k products of that category share). 
 
 
 1. You can't really define 10.000 columns; you are probably using
 multivalued field for that. (sorry if I am not familiar with
 newest-greatest features of SOLR such as 'dynamic fields')
 
 2. You are trying to pass to Lucene 'normalized data'
 - But it is indeed the job of Lucene, to normalize data!
 
 3. All 10k fields are int values!? Lucene is designed for full-text
 search... are you trying to use Lucene instead of a database?
 
 Sorry if I don't understand your design...
 
 
 
 
 Britske wrote:
 
 
 
 Funtick wrote:
 
 
 Britske wrote:
 
 - Rows in solr represent productcategories. I will have up to 100k of
 them. 
 - Each product category can have 10k products each. These are encoded
 as the 10k columns / fields (all 10k fields are int values) 
 
 
 You are using multivalued fields, you are not using 10k fields. And 10k
 is huge.
 
 Design is wrong... you should define two fileds only: Category,
 Product. Lucene will do the rest.
 
 -Fuad
 
 
 ;-). Well I wish it was that simple. 
 
 
 
 
 

-- 
View this message in context: 
http://www.nabble.com/big-discrepancy-between-elapsedtime-and-qtime-although-enableLazyFieldLoading%3D-true-tp18698590p18757461.html
Sent from the Solr - User mailing list archive at Nabble.com.



Re: big discrepancy between elapsedtime and qtime although enableLazyFieldLoading= true

2008-07-31 Thread Funtick

With many-to-many relationship between Category and Product we can go with
multivalued Category field, or we can even have repeated values in
CategoryPoint-of-Interest fields (_single_ valued); it's not necessary to
store all fields in an index - you can store pointer to database Primary Key
for instance.

001 Attraction  CN Tower
002 Hotel CN Tower
003 Hotel Sheraton
004 Restaurant CN Tower


Funtick wrote:
 
 Simple design with _single_ valued fields:
 
 IdCategoryProduct
 001  TVSONY 12345
 002  Radio Panasonic 54321
 003  TVToshiba ABCD
 004  Radio ABCD Z-54321
 

-- 
View this message in context: 
http://www.nabble.com/big-discrepancy-between-elapsedtime-and-qtime-although-enableLazyFieldLoading%3D-true-tp18698590p18757613.html
Sent from the Solr - User mailing list archive at Nabble.com.



Re: big discrepancy between elapsedtime and qtime although enableLazyFieldLoading= true

2008-07-30 Thread Funtick


Britske wrote:
 
 When performing these queries I notice a big difference between qTime
 (which is mostly in the 15-30 ms range due to caching) and total time
 taken to return the response (measured through SolrJ's elapsedTime), which
 takes between 500-1600 ms. 
 Documents have a lot of stored fields (more than 10.000), but at any given
 query a maximum of say 20 are returned (through fl-field ) or used (as
 part of filtering, faceting, sorting)
 


Hi Britske, how do you manage 10.000 field  definitions? Sorry I didn't
understand...


Guys, I am constantly seeing the same problem, athough I have just a few
small fields defined, lazyLoading is disabled, and memory is more than
enough (25Gb for SOLR, 7Gb for OS, 3Gb index).

Britske, do you see the difference with faceted queries only?


Yonik, 

I am suspecting there is _bug_ with SOLR faceting so that faceted query time
(qtime) is 10-20ms and elapsed time is huge; SOLR has filterCache where Key
is 'filter'; SOLR does not have any queryFacetResultCache where Key is
'query' and Value is 'facets'...

Am I right?

-Fuad

-- 
View this message in context: 
http://www.nabble.com/big-discrepancy-between-elapsedtime-and-qtime-although-enableLazyFieldLoading%3D-true-tp18698590p18736155.html
Sent from the Solr - User mailing list archive at Nabble.com.



Re: big discrepancy between elapsedtime and qtime although enableLazyFieldLoading= true

2008-07-30 Thread Funtick


Britske wrote:
 
 - Rows in solr represent productcategories. I will have up to 100k of
 them. 
 - Each product category can have 10k products each. These are encoded as
 the 10k columns / fields (all 10k fields are int values) 
 

You are using multivalued fields, you are not using 10k fields. And 10k is
huge.

Design is wrong... you should define two fileds only: Category, Product.
Lucene will do the rest.

-Fuad
-- 
View this message in context: 
http://www.nabble.com/big-discrepancy-between-elapsedtime-and-qtime-although-enableLazyFieldLoading%3D-true-tp18698590p18737748.html
Sent from the Solr - User mailing list archive at Nabble.com.



Re: big discrepancy between elapsedtime and qtime although enableLazyFieldLoading= true

2008-07-30 Thread Funtick



Funtick wrote:
 
 
 Britske wrote:
 
 - Rows in solr represent productcategories. I will have up to 100k of
 them. 
 - Each product category can have 10k products each. These are encoded as
 the 10k columns / fields (all 10k fields are int values) 
 
 
 You are using multivalued fields, you are not using 10k fields. And 10k is
 huge.
 
 Design is wrong... you should define two fileds only: Category, Product.
 Lucene will do the rest.
 
 -Fuad
 

two _single_value_ fields per document Category, Product... (may be some
additional fields such as ISDN, Price, copy-field for faceting, etc.)

-- 
View this message in context: 
http://www.nabble.com/big-discrepancy-between-elapsedtime-and-qtime-although-enableLazyFieldLoading%3D-true-tp18698590p18737834.html
Sent from the Solr - User mailing list archive at Nabble.com.



Re: Uprade lucene to 2.3

2008-04-29 Thread Funtick

Special things: 
- 2.3.1 fixes bugs with 'autocommit' of version 2.3.0
- I am having OutOfMemoryError constantly, I can't understand where the
problem is yet... I didn't have it with default SOLR 1.2 installation. It's
not memory-cache related, most probably it is a bug somewhere...


Yongjun Rong-2 wrote:
 
   It seems the latest lucene 2.3 has some improvement on performance.
 I'm just wondering if it is ok for us to easily upgrade the solr's
 lucene from 2.1 to 2.3. Is any special thing we need to know except just
 replace the lucene jars in the lib directory.
 

-- 
View this message in context: 
http://www.nabble.com/Uprade-lucene-to-2.3-tp16963107p16968012.html
Sent from the Solr - User mailing list archive at Nabble.com.



Re: Problems querying Russian content

2007-06-28 Thread funtick

Hi Danier,

Ensure that UTF-8 is everywhere... SOLR, WebServer, AppServer, HTTP  
Headers, etc.


And do not use  
q=#1041;#1072;#1084;#1073;#1072;#1088;#1073;#1080;#1072;  
#1050;#1080;#1088;#1082;#1091;#1076;#1091;

use this instead (encoded URL):
q=%D0%91%D0%B0%D0%BC%D0%B1%D0%B0%D1%80%D0%B1%D0%B8%D0%B0+%D0%9A%D0%B8%D1%80%D0%BA%D1%83%D0%B4%D1%83

http://www.tokenizer.org is a search engine, SOLR powered... I need to  
add some large Internet shops to the crawler, from Russia...


Quoting Daniel Alheiros:


Hi

I'm in trouble now about how to issue queries against Solr using in my q
parameter content in Russian (it applies to Chinese and Arabic as well).

The problem is I can't send any Russian special character in URL's because
they don't fit in ASCII domain, so I'm doing a POST to accomplish that.

My application gets the request and logs it (and the Russian characters
appear correctly on my logs) and then calls the Solr server and Solr is not
receiving it correctly... I can just see in the Solr log the special
characters as question marks...

Did anyone faced problems like that? My whole system is set to work in UTF-8
(browser, application servers).

Regards,
Daniel


http://www.bbc.co.uk/
This e-mail (and any attachments) is confidential and may contain   
personal views which are not the views of the BBC unless   
specifically stated.

If you have received it in error, please delete it from your system.
Do not use, copy or disclose the information in any way nor act in   
reliance on it and notify the sender immediately.

Please note that the BBC monitors e-mails sent or received.
Further communication will signify your consent to this.







Re: To make sure XML is UTF-8

2007-06-08 Thread Funtick


Tiong Jeffrey wrote:
 
 Thought this is not directly related to Solr, but I have a XML output from
 mysql database, but during indexing the XML output is not working. And the
 problem is part of the XML output is not in UTF-8 encoding, how can I
 convert it to UTF-8 and how do I know what kind of coding it uses in the
 first place (the data I export from the mysql database). Thanks!
 

You won't have any problem with standard JAXP and java.util.* etc. classes,
even with
comlpex MySQL data (one column is LATIN1, another is LATIN2, another is
ASCII, ...)

In Java, use standard classes: String, Long, Date. And use JAXP.
-- 
View this message in context: 
http://www.nabble.com/To-make-sure-XML-is-UTF-8-tf3891427.html#a11032117
Sent from the Solr - User mailing list archive at Nabble.com.



Re: To make sure XML is UTF-8

2007-06-08 Thread funtick

Thought this is not directly related to Solr, but I have a XML output from
mysql database, but during indexing the XML output is not working. And the
problem is part of the XML output is not in UTF-8 encoding, how can I
convert it to UTF-8 and how do I know what kind of coding it uses in the
first place (the data I export from the mysql database). Thanks!


How do you generate XML output? Output itself is usually a raw byte  
array, it uses Transport and Encoding. If you save it in a file  
system and forget about transport-layer-encoding you will get some  
new problems...



during indexing the XML output is not working

- what exactly happens, which kind of error messages?