RE: this is a BUG?

Will Martin Fri, 12 Jun 2015 21:48:17 -0700

Myn, please let us know the issue number created?  So we can share with other 
projects that might be better suited for this infrastructure piece.


fwiw

-----Original Message-----
From: [email protected] [mailto:[email protected]] 
Sent: Thursday, June 11, 2015 11:27 AM
To: [email protected]
Subject: Re: this is a BUG?

myn,
Please file a JIRA issue http://issues.apache.org/jira/browse/SOLR

On Mon, Jun 8, 2015 at 8:38 AM myn <[email protected]> wrote:

>
> I also think that is not a high-performance implements on 
> HdfsDirectory,because direct read /write on hdfs is slower then local 
> filesystem.
>
> why we not supply a Cache on hdfs,so that`can imporve speed by local 
> filesystem.  the cache could Store in local disk,we split HDFS file 
> into bolcks(fix length), and store in local disk by LRU.
>
> we used hdfs for Data reliability，and we used local file system for 
> high-performance that`s how hermes used it ,that what is our suggest.
>
>
>
>
> At 2015-06-08 20:28:34, "myn" <[email protected]> wrote:
>
>
> SOLR package
> org.apache.solr.store.blockcache.CustomBufferedIndexInput.since method 
> BufferedIndexInput.wrap(sliceDescription, this, offset, length);
>
> I have  change my lucene version from lucene3.5 to lucene5.1
>
> on my test build index on hdfs ,that quit slow.
>
> I found when we use docvalues, the direcrory onten call since methos 
> for field input clone;
>   @Override
>   public IndexInput slice(String sliceDescription, long offset, long
> length) throws IOException {
>       return BufferedIndexInput.wrap(sliceDescription, this, offset, 
> length);
>   }
>
> but defaut buffer size is 1024,  is not the buffer my set; so I fix it 
> like below then build index go faster;
>
>
>   @Override
>   public IndexInput slice(String sliceDescription, long offset, long
> length) throws IOException {
>       SlicedIndexInput rtn= new SlicedIndexInput(sliceDescription, 
> this, offset, length);
>       rtn.setBufferSize(this.bufferSize);
>     return rtn;//BufferedIndexInput.wrap(sliceDescription, this, 
> offset, length);
>   }
>
>
>
>
>
> At 2015-01-29 20:14:25, "myn" <[email protected]> wrote:
>
>
>
> add attachment
> ------------------------------
>
>
>
>
> 在 2015-01-29 19:59:09，"yannianmu(母延年)" <[email protected]> 写道：
>
>   Dear Lucene dev
>
>
>     We are from the the Hermes team. Hermes is a project base on lucene 3.5 
> and solr 3.5.
>
>
> Hermes process 100 billions documents per day,2000 billions document 
> for total days (two month). N owadays our single cluster index size is 
> over then 200Tb,total size is 600T. We use lucene for the big data 
> warehouse  speed up .reduce the analysis response time, for example filter 
> like this age=32 and keywords like 'lucene'  or do some thing like count 
> ,sum,order by group by and so on.
>
>
>
>
>     Hermes could filter a data form 1000billions in 1 
> secondes.10billions data`s order by taken 10s,10billions data`s group 
> by thaken 15 s,10 billions days`s sum,avg,max,min stat taken 30 s
>
> For those purpose,We made lots of improve base on lucene and solr ,  
> nowadays  lucene has change so much since version 4.10, the coding has 
> change so much.so we don`t want to commit our code to lucene .only to 
> introduce our imporve base on luene 3.5,and introduce how hermes can 
> process 100billions documents per day on 32 Physical Machines .we think it 
> may be helpfull for some people who have the similary sense .
>
>     First level index(tii)，Loading by Demand
>
> Original:
>
> 1. .tii file is load to ram by TermInfosReaderIndex
>
> 2. that may quite slowly by first open Index
>
> 3. the index need open by Persistence,once open it ,nevel close it.
>
> 4.
> this cause will limit the number of the index.when we have thouthand 
> of index,that will Impossible.
>
> Our improve:
>
> 1. Loading by Demand,not all fields need to load into memory
>
> 2. we modify the method getIndexOffset(dichotomy
> ) on disk, not on memory,but we use lru cache to speed up it.
>
> 3. getIndexOffset
>  on disk can save lots of memory,and can reduce times when open a 
> index
>
> 4. hermes often open different index for dirrerent Business ; when the 
> index is not often to used ,we will to close it.(manage by lru)
>
> 5. such this my 1 Physical Machine
>  can store over then 100000 number of index.
>
> Solve the problem:
>
> 1.
> hermes need to store over then 1000billons documents,we have not 
> enough memory to store the tii file
>
> 2.
> we have over then 100000 number of index,if all is opend ,that will weast 
> lots of file descriptor,the file system will not allow.
>
>
> Build index on Hdfs
>
> 1.
> We modifyed lucene 3.5 code at 2013.so that we can build index direct 
> on hdfs.(lucene has support hdfs since 4.0)
>
> 2. All the offline data is build by mapreduce on hdfs.
>
> 3. we move all the realtime index from local disk to hdfs
>
> 4. we can ignore disk failure because of index on hdfs
>
> 5. we can move process from on machine to another machine on hdfs
>
> 6. we can quick recover index when a disk failure happend .
>
> 7.
> we does need recover data when a machine is broker(the Index is so big move 
> need lots of hours),the process can quick move to other machine by zookeeper 
> heartbeat.
>
> 8.
> all we know index on hdfs is slower then local file system,but why ? local 
> file system the OS make so many optimization, use lots cache to speed up 
> random access. so we also need a optimization on hdfs.that is why some body 
> often said that hdfs index is so slow the reason is that you didn`t optimize 
> it .
>
> 9.
> we split the hdfs file into fix length block,1kb per block.and then use a lru 
> cache to cache it ,the tii file and some frequent terms will speed up.
>
> 10. some hdfs file does`t need to close Immediately  we make a lru 
> cache to cache it ,to reduce the frequent of open file.
>
>
> Improve solr, so that one core can dynamic process multy index.
>
> Original:
>
> 1. a solr core(one process) only process 1~N index by solr config
>
> Our improve:
>
> 2.
> use a partion like oracle or hadoop hive.not build only one big 
> index,instand build lots of index by day(month,year,or other partion)
>
> 3. dynamic create table for dynamic businiss
>
> Solve the problem:
>
> 1.
> to solve the index is to big over then Interger.maxvalue, docid 
> overflow
>
> 2.
> some times the searcher not need to search all of the data ,may be only need 
> recent 3 days.
>
>
> Label mark technology for doc values
>
> Original:
>
> 1. group by,sort,sum,max,min ,avg those stats method need to read 
> Original  from tis file
>
> 2. FieldCacheImpl
>  load all the term values into memory for solr fieldValueCache,Even if i only 
> stat one record .
>
> 3.
> first time search is quite slowly because of to build the 
> fieldValueCache and load all the term values into memory
>
> Our improve:
>
> 1. General situation
> ,the data has a lot of repeat value,for exampe the sex file ,the age field .
>
> 2. if we store the original value ,that will weast a lot of storage.
> so we make a small modify at TermInfosWriter, Additional  add a new 
> filed called termNumber.
> make a unique term sort by term through TermInfosWriter , and then 
> gave each term a unique  Number from begin to end  (mutch like solr 
> UnInvertedField).
>
> 3.
> we use termNum(we called label) instead of Term.we store termNum(label) into 
> a file called doctotm. the doctotm file is order by docid,lable is store by 
> fixed length. the file could be read by random read(like fdx it store by 
> fixed length),the file doesn`t need load all into memory.
>
> 4.
> the label`s order is the same with terms order .so if we do some calculation 
> like order by or group by only read the label. we don`t need to read the 
> original value.
>
> 5.
> some field like sex field ,only have 2 different values.so we only use 2 
> bits(not 2 bytes) to store the label, it will save a lot of Disk io.
>
> 6.
> when we finish all of the calculation, we translate label to Term by a 
> dictionary.
>
> 7.
> if a lots of rows have the same original value ,the original value we only 
> store once,onley read once.
>
> Solve the problem:
>
> 1.
> Hermes`s data is quite big we don`t have enough memory to load all 
> Values to memory like lucene FieldCacheImpl or solr UnInvertedField.
>
> 2. on realtime mode ,data is change Frequent , The cache is 
> invalidated Frequent by append or update. build FieldCacheImpl  will 
> take a lot of times and io;
>
> 3. the Original
>  value is lucene Term. it is a string type.  whene sortring or grouping ,thed 
> string value need a lot of memory and need lot of cpu time to calculate 
> hashcode \compare \equals ,But label is number  is fast.
>
> 4.
> the label is number ,it`s type mabbe short ,or maybe byte ,or may be integer 
> whitch depending on the max number of the label.
>
> 5. read the original value will need lot of io, need iterate  tis 
> file.even though we just need to read only docunent.
>
> 6. Solve take a lot of time when first build FieldCacheImpl.
>
>
>
>  two-phase search
>
> Original:
>
> 1.
> group by order by use original value,the real value may be is a string 
> type,may be more larger ,the real value maybe  need a lot of io  
> because of to read tis,frq file
>
> 2. compare by string is slowly then compare by integer
>
> Our improve:
>
> 1. we split one search into multy-phase search
>
> 2.
> the first search we only search the field that use for order by ,group 
> by
>
> 3.
> the first search we doesn`t need to read the original value(the real 
> value),we only need to read the docid and label(see <  Label mark technology 
> for doc values>) for order by group by.
>
> 4.
> when we finish all the order by and group by ,may be we only need to return 
> Top n records .so we start next to search to get the Top n records original 
> value.
>
> Solve the problem:
>
> 1. reduce io ,read original take a lot of disk io
>
> 2. reduce network io (for merger)
>
> 3.
> most of the field has repeated value, the repeated only need to read 
> once
>
>
> the group by filed only need to read the origina once by label whene display 
> to user.
>
> 4.
> most of the search only need to display on Top n (n<=100) results, so use to 
> phrase search some original value could be skip.
>
>
>
>  multy-phase indexing
>
> 1. hermes doesn`t update index one by one,it use batch index
>
> 2.
> the index area is split into four area ,they are called 
> doclist=>buffer index=>ram index=>diskIndex/hdfsIndex
>
> 3. doclist only store the solrinputdocument for the batch update or 
> append
>
> 4. buffer index is a ramdirectory ,use for merge doclist to index.
>
> 5.
> ram index is also a ramdirector ,but it is biger then buffer index, it can be 
> search by the user.
>
> 6. disk/hdfs index is Persistence store use for big index
>
> 7. we also use wal called binlog(like mysql binlog) for recover
>
>
>  two-phase commit for update
>
> 1.
> we doesn`t update record once by once like solr(solr is search by term,found 
> the document,delete it,and then append a new one),one by one is slowly.
>
> 2. we need Atomic
>  inc field ,solr that can`t support ,solr only support replace field value.
> Atomic
>  inc field need to read the last value first ,and then increace it`s value.
>
> 3. hermes use pre mark delete,batch commit to update a document.
>
> 4.
> if a document is state is premark ,it also could be search by the user,unil 
> we commit it.
> we modify SegmentReader ,split deletedDocs  into to 3 part. one part 
> is called deletedDocstmp  whitch is for pre mark (pending 
> delete),another one is called deletedDocs_forsearch which is for index 
> search, another is also call deletedDocs
>
> 5. once we want to pending delete a document,we operate deletedDocstmp  
> (a openbitset)to mark one document is pending delete.
>
> and then we append our new value to doclist area(buffer area)
>
> the pending delete means user also could search the old value.
>
> the buffer area means user couldn`t search the new value.
>
> but when we commit it(batch)
>
>
> the old value is realy droped,and flush all the buffer area to Ram 
> area(ram area can be search)
>
> 6.
> the pending delete we called visual delete,after commit it we called 
> physics delete
>
> 7.
> hermes ofthen visula delete a lots of document ,and then commit once 
> ,to improve up the Performance one by one
>
> 8. also we use a lot of cache to speed up the atomic inc field.
>
>
>
>   Term data skew
>
> Original:
>
> 1. lucene use inverted index to store term and doclist.
>
> 2.
> some filed like sex  has only to value male or female, so male while have 50% 
> of doclist.
>
> 3.
> solr use filter cache to cache the FQ,FQ is a openbitset which store the 
> doclist.
>
> 4.
> when the firest time to use FQ(not cached),it will read a lot of doclist to 
> build openbitset ,take a lot of disk io.
>
> 5.
> most of the time we only need the TOP n doclist,we dosn`t care about the 
> score sort.
>
>  Our improve:
>
> 1.
> we often combination other fq,to use the skip doclist to skip the 
> docid that not used( we may to seed the query methord called advance)
>
> 2.
> we does`n cache the openbitset by FQ ,we cache the frq files block into 
> memeory, to speed up the place often read.
>
> 3.
> our index is quite big ,if we cache the FQ(openbitset),that will take 
> a lots of memory
>
> 4.
> we modify the indexSearch  to support real Top N search and ignore the 
> doc score sort
>
>  Solve the problem:
>
> 1. data skew take a lot of disk io to read not necessary doclist.
>
> 2.
> 2000billions index is to big,the FQ cache (filter cache) user 
> openbitset take a lot of memor
>
> 3.
> most of the search ,only need the top N result ,doesn`t need score 
> sort,we need to speed up the search time
>
>
>
>
>
>  Block-Buffer-Cache
>
> Openbitset,fieldvalueCache
>  need to malloc a big long[] or int[] array. it is ofen seen by lots 
> of cache ,such as UnInvertedField ,fieldCacheImpl,filterQueryCache and 
> so on. most of time  much of the elements is zero(empty),
>
> Original:
>
> 1.
> we create the big array directly,when we doesn`t neet we drop it to 
> JVM GC
>
> Our improve:
>
> 1.
> we split the big arry into fix length block,witch block is a small array,but 
> fix 1024 length .
>
> 2.
> if a block `s element is almost empty(element is zero),we use hashmap 
> to instead of array
>
> 3.
> if a block `s non zero value is empty(length=0),we couldn`t create 
> this block arrry only use a null to instead of array
>
> 4.
> when the block is not to use ,we collectoion the array to buffer ,next 
> time we reuse it
>
> Solve the problem:
>
> 1. save memory
>
> 2. reduce the jvm Garbage collection take a lot of cpu resource.
>
>
>
>
> weakhashmap,hashmap , synchronized problem
>
> 1. FieldCacheImpl use weakhashmap to manage field value cache,it has 
> memory leak BUG.
>
> 2.
> sorlInputDocunent use a lot of hashmap,linkhashmap for field,that 
> weast a lot of memory
>
> 3. AttributeSource use weakhashmap to cache class impl,and use a 
> global synchronized reduce performance
>
> 4. AttributeSource is a base class , NumbericField extends 
> AttributeSource ,but they create a lot of hashmap,but NumbericField never use 
> it .
>
> 5. all of this ,JVM GC take a lot of burder for the never used hashmap.
>
>  Our improve:
>
> 1.
> weakhashmap is not high performance ,we use softReferance instead of 
> it
>
> 2. reuse NumbericField avoid create AttributeSource frequent
>
> 3. not use global synchronized
>
>
>  when we finish this optimization our process,speed up from 20000/s to 
> 60000/s (1k per document).
>
>
>
>
>
>  Other GC optimization
>
> 1. reuse byte[] arry in the inputbuffer ,outpuer buffer .
>
> 2. reuse byte[] arry in the RAMfile
>
> 3. remove some finallze method, the not necessary.
>
> 4. use StringHelper.intern to reuse the field name in 
> solrinputdocument
>
>
>
>  Directory optimization
>
> 1. index commit doesn`t neet sync all the field
>
> 2.
> we use a block cache on top of FsDriectory and hdfsDirectory to speed 
> up read sppedn
>
> 3.
> we close index or index file that not often to used.also we limit the 
> index that allow max open;block cache is manager by LRU
>
>
>
>  network optimization
>
> 1. optimization ThreadPool
>  in searchHandle class ,some times does`t need keep alive connection,and 
> increate the timeout time for large Index.
>
> 2.
> remove jetty ,we write socket by myself ,jetty import data is not high 
> performance
>
> 3.
> we change the data import form push mode to pull mode with like apache storm.
>
>
> append mode,optimization
>
> 1.
> append mode we doesn`t store the field value to fdt file.that will take a lot 
> of io on index merger, but it is doesn`t need.
>
> 2.
> we store the field data to a single file ,the files format is hadoop 
> sequence file ,we use LZO compress to save io
>
> 3. we make a pointer to point docid to sequencefile
>
>
>
>  non tokenizer field optimization
>
> 1. non tokenizer field we doesn`t store the field value to fdt field.
>
> 2. we read the field value from label (see  << Label mark technology 
> for doc values>>)
>
> 3. most of the field has duplicate value，
> this can reduce the index file size
>
>
>
>  multi level of merger server
>
> 1. solr can only use on shard to act as a merger server .
>
> 2. we use multi level of merger server to merge all shards result
>
> 3.
> shard on the same mathine have the high priority to merger by the same 
> mathine merger server.
>
> solr`s merger is like this
>
>   hermes`s merger is like this
>
> <span style="FONT-FAMILY:
>
>

RE: this is a BUG?

Reply via email to