RE: how to sampling search result

2016-09-28 Thread Yongtao Liu
Alexandre,

Thanks for reply.
The use case is customer want to review document based on search result.
But they do not want to review all, since it is costly.
So, they want to pick partial (from 1% to 100%) document to review.
For statistics, user also ask this function.
It is kind of common requirement
Do you know any plan to implement this feature in future?

Post filter should work. Like collapsing query parser.

Thanks,
Yongtao
-Original Message-
From: Alexandre Rafalovitch [mailto:arafa...@gmail.com] 
Sent: Tuesday, September 27, 2016 9:25 PM
To: solr-user
Subject: Re: how to sampling search result

I am not sure I understand what the business case is. However, you might be 
able to do something with a custom post-filter.

Regards,
   Alex.

Newsletter and resources for Solr beginners and intermediates:
http://www.solr-start.com/


On 27 September 2016 at 22:29, Yongtao Liu <y...@commvault.com> wrote:
> Mikhail,
>
> Thanks for your reply.
>
> Random field is based on index time.
> We want to do sampling based on search result.
>
> Like if the random field has value 1 - 100.
> And the query touched documents may all in range 90 - 100.
> So random field will not help.
>
> Is it possible we can sampling based on search result?
>
> Thanks,
> Yongtao
> -Original Message-
> From: Mikhail Khludnev [mailto:m...@apache.org]
> Sent: Tuesday, September 27, 2016 11:16 AM
> To: solr-user
> Subject: Re: how to sampling search result
>
> Perhaps, you can apply a filter on random field.
>
> On Tue, Sep 27, 2016 at 5:57 PM, googoo <liu...@gmail.com> wrote:
>
>> Hi,
>>
>> Is it possible I can sampling based on  "search result"?
>> Like run query first, and search result return 1 million documents.
>> With random sampling, 50% (500K) documents return for facet, and stats.
>>
>> The sampling need based on "search result".
>>
>> Thanks,
>> Yongtao
>>
>>
>>
>> --
>> View this message in context: http://lucene.472066.n3.
>> nabble.com/how-to-sampling-search-result-tp4298269.html
>> Sent from the Solr - User mailing list archive at Nabble.com.
>>
>
>
>
> --
> Sincerely yours
> Mikhail Khludnev


RE: how to remove duplicate from search result

2016-09-27 Thread Yongtao Liu
Shamik,

Thanks a lot.
Collapsing query parser solve the issue.

Thanks,
Yongtao
-Original Message-
From: shamik [mailto:sham...@gmail.com] 
Sent: Tuesday, September 27, 2016 3:09 PM
To: solr-user@lucene.apache.org
Subject: RE: how to remove duplicate from search result

Did you take a look at Collapsin Query Parser ?

https://cwiki.apache.org/confluence/display/solr/Collapse+and+Expand+Results



--
View this message in context: 
http://lucene.472066.n3.nabble.com/how-to-remove-duplicate-from-search-result-tp4298272p4298305.html
Sent from the Solr - User mailing list archive at Nabble.com.


RE: how to remove duplicate from search result

2016-09-27 Thread Yongtao Liu
David,

Thanks for your reply.

Group cannot solve the issue.
We also need run facet and stats based on search result.
With group, facet and stats result still count duplicate.

Thanks,
Yongtao
-Original Message-
From: David Santamauro [mailto:david.santama...@gmail.com] 
Sent: Tuesday, September 27, 2016 11:35 AM
To: solr-user@lucene.apache.org
Cc: david.santama...@gmail.com
Subject: Re: how to remove duplicate from search result

Have a look at

https://cwiki.apache.org/confluence/display/solr/Result+Grouping


On 09/27/2016 11:03 AM, googoo wrote:
> hi,
>
> We want to provide remove duplicate from search result function.
>
> like we have below documents.
> id(uniqueKey) guid
> doc1  G1
> doc2  G2
> doc3  G3
> doc4  G1
>
> user run one query and hit doc1, doc2 and doc4.
> user want to remove duplicate from search result based on guid field.
> since doc1 and doc4 has same guid, one of them should be drop from 
> search result.
>
> how we can address this requirement?
>
> Thanks,
> Yongtao
>
>
>
>
>
> --
> View this message in context: 
> http://lucene.472066.n3.nabble.com/how-to-remove-duplicate-from-search
> -result-tp4298272.html Sent from the Solr - User mailing list archive 
> at Nabble.com.
>


RE: how to sampling search result

2016-09-27 Thread Yongtao Liu
Mikhail,

Thanks for your reply.

Random field is based on index time.
We want to do sampling based on search result.

Like if the random field has value 1 - 100.
And the query touched documents may all in range 90 - 100.
So random field will not help.

Is it possible we can sampling based on search result?

Thanks,
Yongtao
-Original Message-
From: Mikhail Khludnev [mailto:m...@apache.org] 
Sent: Tuesday, September 27, 2016 11:16 AM
To: solr-user
Subject: Re: how to sampling search result

Perhaps, you can apply a filter on random field.

On Tue, Sep 27, 2016 at 5:57 PM, googoo  wrote:

> Hi,
>
> Is it possible I can sampling based on  "search result"?
> Like run query first, and search result return 1 million documents.
> With random sampling, 50% (500K) documents return for facet, and stats.
>
> The sampling need based on "search result".
>
> Thanks,
> Yongtao
>
>
>
> --
> View this message in context: http://lucene.472066.n3.
> nabble.com/how-to-sampling-search-result-tp4298269.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>



-- 
Sincerely yours
Mikhail Khludnev


remove user defined duplicate from search result

2016-09-26 Thread Yongtao Liu
Hi,

I am try to remove user defined duplicate from search result.

like below documents match the query.
when query return, I try to remove doc3 from result since it has duplicate guid 
with doc1.

Id (uniqueKey)

guid

doc1

G1

doc2

G2

doc3

G1


To do this, I generate exclude list based guid field terms.
For each term, we add from the second document to exclude list.
And add these docs to QueryCommand filter.

If there any better approach to handler this requirement?


Below is code change in SolrIndexSearcer.java

  private TreeMap dupDocs = null;

  public QueryResult search(QueryResult qr, QueryCommand cmd) throws 
IOException {
if (cmd.getUniqueField() != null)
{
  DocSet filter = getDuplicateByField(cmd.getUniqueField());
  if (cmd.getFilter() != null) cmd.getFilter().addAllTo(filter);
  cmd.setFilter(filter);
}

getDocListC(qr,cmd);

return qr;
  }

  private synchronized BitDocSet getDuplicateByField(String field) throws 
IOException
  {
if (dupDocs != null && dupDocs.containsKey(field)) {
  return dupDocs.get(field);
}

if (dupDocs == null)
{
  dupDocs = new TreeMap();
}

LeafReader reader = getLeafReader();

BitDocSet res = new BitDocSet(new FixedBitSet(maxDoc()));

Terms terms = reader.terms(field);

if (terms == null)
{
  dupDocs.put(field, res);
  return res;
}

TermsEnum termEnum = terms.iterator();
PostingsEnum docs = null;
BytesRef term = null;
while ((term = termEnum.next()) != null) {
  docs = termEnum.postings(docs, PostingsEnum.NONE);

  // slip first document
  docs.nextDoc();

  int docID = 0;
  while ((docID = docs.nextDoc()) != DocIdSetIterator.NO_MORE_DOCS)
  {
res.add(docID);
  }
}

dupDocs.put(field, res);
return res;
  }

Thanks,
Yongtao


RE: remove user defined duplicate from search result

2016-09-26 Thread Yongtao Liu
Sorry, the table is missing.
Update below email with table.

-Original Message-
From: Yongtao Liu [mailto:y...@commvault.com] 
Sent: Monday, September 26, 2016 10:47 AM
To: 'solr-user@lucene.apache.org'
Subject: remove user defined duplicate from search result

Hi,

I am try to remove user defined duplicate from search result.

like below documents match the query.
when query return, I try to remove doc3 from result since it has duplicate guid 
with doc1.

id(uniqueKey)   guid
doc1G1
doc2G2
doc2G1

To do this, I generate exclude list based guid field terms.
For each term, we add from the second document to exclude list.
And add these docs to QueryCommand filter.

If there any better approach to handler this requirement?


Below is code change in SolrIndexSearcer.java

  private TreeMap<String, BitDocSet> dupDocs = null;

  public QueryResult search(QueryResult qr, QueryCommand cmd) throws 
IOException {
if (cmd.getUniqueField() != null)
{
  DocSet filter = getDuplicateByField(cmd.getUniqueField());
  if (cmd.getFilter() != null) cmd.getFilter().addAllTo(filter);
  cmd.setFilter(filter);
}

getDocListC(qr,cmd);

return qr;
  }

  private synchronized BitDocSet getDuplicateByField(String field) throws 
IOException
  {
if (dupDocs != null && dupDocs.containsKey(field)) {
  return dupDocs.get(field);
}

if (dupDocs == null)
{
  dupDocs = new TreeMap<String, BitDocSet>();
}

LeafReader reader = getLeafReader();

BitDocSet res = new BitDocSet(new FixedBitSet(maxDoc()));

Terms terms = reader.terms(field);

if (terms == null)
{
  dupDocs.put(field, res);
  return res;
}

TermsEnum termEnum = terms.iterator();
PostingsEnum docs = null;
BytesRef term = null;
while ((term = termEnum.next()) != null) {
  docs = termEnum.postings(docs, PostingsEnum.NONE);

  // slip first document
  docs.nextDoc();

  int docID = 0;
  while ((docID = docs.nextDoc()) != DocIdSetIterator.NO_MORE_DOCS)
  {
res.add(docID);
  }
}

dupDocs.put(field, res);
return res;
  }

Thanks,
Yongtao


RE: memory usage keep increase

2011-11-17 Thread Yongtao Liu
Erick,

Thanks for your reply.

Yes, virtual memory does not mean physical memory.
But if when virtual memory  physical memory, the system will change to 
slow, since lots for paging request happen.

Yongtao
-Original Message-
From: Erick Erickson [mailto:erickerick...@gmail.com] 
Sent: Tuesday, November 15, 2011 8:37 AM
To: solr-user@lucene.apache.org
Subject: Re: memory usage keep increase

I'm pretty sure not. The words virtual memory address space is important 
here, that's not physical memory...

Best
Erick

On Mon, Nov 14, 2011 at 11:55 AM, Yongtao Liu y...@commvault.com wrote:
 Hi all,

 I saw one issue is ram usage keep increase when we run query.
 After look in the code, looks like Lucene use MMapDirectory to map index file 
 to ram.

 According to 
 http://lucene.apache.org/java/3_1_0/api/core/org/apache/lucene/store/MMapDirectory.html
  comments, it will use lot of memory.
 NOTE: memory mapping uses up a portion of the virtual memory address space in 
 your process equal to the size of the file being mapped. Before using this 
 class, be sure your have plenty of virtual address space, e.g. by using a 64 
 bit JRE, or a 32 bit JRE with indexes that are guaranteed to fit within the 
 address space.

 So, my understanding is solr request physical RAM = index file size, is it 
 right?

 Yongtao


 **Legal Disclaimer***
 This communication may contain confidential and privileged material 
 for the sole use of the intended recipient. Any unauthorized review, 
 use or distribution by others is strictly prohibited. If you have 
 received the message in error, please advise the sender by reply email 
 and delete the message. Thank you.
 *


memory usage keep increase

2011-11-14 Thread Yongtao Liu
Hi all,

I saw one issue is ram usage keep increase when we run query.
After look in the code, looks like Lucene use MMapDirectory to map index file 
to ram.

According to 
http://lucene.apache.org/java/3_1_0/api/core/org/apache/lucene/store/MMapDirectory.html
 comments, it will use lot of memory.
NOTE: memory mapping uses up a portion of the virtual memory address space in 
your process equal to the size of the file being mapped. Before using this 
class, be sure your have plenty of virtual address space, e.g. by using a 64 
bit JRE, or a 32 bit JRE with indexes that are guaranteed to fit within the 
address space.

So, my understanding is solr request physical RAM = index file size, is it 
right?

Yongtao


**Legal Disclaimer***
This communication may contain confidential and privileged
material for the sole use of the intended recipient. Any
unauthorized review, use or distribution by others is strictly
prohibited. If you have received the message in error, please
advise the sender by reply email and delete the message. Thank
you.
*

Re: FW: MMapDirectory failed to map a 23G compound index segment

2011-09-21 Thread Yongtao Liu
I hit similar issue recently.
Not sure if MMapDirectory is right way to go.

When index file be map to ram, JVM will call OS file mapping function.
The memory usage is in share memory, it may not be calculate to JVM process
space.

I saw one problem is if the index file bigger then physical ram, and there
are lot of query which cause wide index file access.
Then, the machine has no available memory.
The system change to very slow.

What i did is change lucene code to disable MMapDirectory.

On Wed, Sep 21, 2011 at 1:26 PM, Yongtao Liu y...@commvault.com wrote:



 -Original Message-
 From: Michael McCandless [mailto:luc...@mikemccandless.com]
 Sent: Tuesday, September 20, 2011 3:33 PM
 To: solr-user@lucene.apache.org
 Subject: Re: MMapDirectory failed to map a 23G compound index segment

 Since you hit OOME during mmap, I think this is an OS issue not a JVM
 issue.  Ie, the JVM isn't running out of memory.

 How many segments were in the unoptimized index?  It's possible the OS
 rejected the mmap because of process limits.  Run cat
 /proc/sys/vm/max_map_count to see how many mmaps are allowed.

 Or: is it possible you reopened the reader several times against the index
 (ie, after committing from Solr)?  If so, I think 2.9.x never unmaps the
 mapped areas, and so this would accumulate against the system limit.

  My memory of this is a little rusty but isn't mmap also limited by mem +
 swap on the box? What does 'free -g' report?

 I don't think this should be the case; you are using a 64 bit OS/JVM so in
 theory (except for OS system wide / per-process limits imposed) you should
 be able to mmap up to the full 64 bit address space.

 Your virtual memory is unlimited (from ulimit output), so that's good.

 Mike McCandless

 http://blog.mikemccandless.com

 On Wed, Sep 7, 2011 at 12:25 PM, Rich Cariens richcari...@gmail.com
 wrote:
  Ahoy ahoy!
 
  I've run into the dreaded OOM error with MMapDirectory on a 23G cfs
  compound index segment file. The stack trace looks pretty much like
  every other trace I've found when searching for OOM  map failed[1].
  My configuration
  follows:
 
  Solr 1.4.1/Lucene 2.9.3 (plus
  SOLR-1969https://issues.apache.org/jira/browse/SOLR-1969
  )
  CentOS 4.9 (Final)
  Linux 2.6.9-100.ELsmp x86_64 yada yada yada Java SE (build
  1.6.0_21-b06) Hotspot 64-bit Server VM (build 17.0-b16, mixed mode)
  ulimits:
 core file size (blocks, -c) 0
 data seg size(kbytes, -d) unlimited
 file size (blocks, -f) unlimited
 pending signals(-i) 1024
 max locked memory (kbytes, -l) 32
 max memory size (kbytes, -m) unlimited
 open files(-n) 256000
 pipe size (512 bytes, -p) 8
 POSIX message queues (bytes, -q) 819200
 stack size(kbytes, -s) 10240
 cpu time(seconds, -t) unlimited
 max user processes (-u) 1064959
 virtual memory(kbytes, -v) unlimited
 file locks(-x) unlimited
 
  Any suggestions?
 
  Thanks in advance,
  Rich
 
  [1]
  ...
  java.io.IOException: Map failed
   at sun.nio.ch.FileChannelImpl.map(Unknown Source)
   at
  org.apache.lucene.store.MMapDirectory$MMapIndexInput.init(Unknown
  Source)
   at
  org.apache.lucene.store.MMapDirectory$MMapIndexInput.init(Unknown
  Source)
   at org.apache.lucene.store.MMapDirectory.openInput(Unknown Source)
   at org.apache.lucene.index.SegmentReader$CoreReaders.init(Unknown
  Source)
 
   at org.apache.lucene.index.SegmentReader.get(Unknown Source)
   at org.apache.lucene.index.SegmentReader.get(Unknown Source)
   at org.apache.lucene.index.DirectoryReader.init(Unknown Source)
   at org.apache.lucene.index.ReadOnlyDirectoryReader.init(Unknown
  Source)
   at org.apache.lucene.index.DirectoryReader$1.doBody(Unknown Source)
   at org.apache.lucene.index.SegmentInfos$FindSegmentsFile.run(Unknown
  Source)
   at org.apache.lucene.index.DirectoryReader.open(Unknown Source)
   at org.apache.lucene.index.IndexReader.open(Unknown Source) ...
  Caused by: java.lang.OutOfMemoryError: Map failed
   at sun.nio.ch.FileChannelImpl.map0(Native Method) ...
 
 **Legal Disclaimer***
 This communication may contain confidential and privileged
 material for the sole use of the intended recipient. Any
 unauthorized review, use or distribution by others is strictly
 prohibited. If you have received the message in error, please
 advise the sender by reply email and delete the message. Thank
 you.
 *