date:20110216

RE: duplicate records in index

2011-02-16 Thread Digy

You are adding the same doc twice.
(See how you add acttime )

DIGY

-Original Message-
From: Wen Gao [mailto:samuel.gao...@gmail.com] 
Sent: Wednesday, February 16, 2011 11:35 AM
To: lucene-net-dev@lucene.apache.org
Subject: duplicate records in index

Hi,

I am creating an index from my database, however, the record in .cfs files
contains duplicate records,  e.g.

book1, 1, susan, 1

book1, 1,susan,1, 03/01/2010

book2, 2,tom,

book2,2,tom, 2,03/02/2010

..

 

I got the data from several tables, and am sure that the sql only generate
one record. Also, when I debug the code, the record is only added once.

So I am confused whether data replicate in idex.

 

I define my index as following format:



doc.Add(new Lucene.Net.Documents.Field(

lmname,

readerreader1[lmname].ToString(),

//new
System.IO.StringReader(readerreader[cname].ToString()),

Lucene.Net.Documents.Field.Store.YES,

 Lucene.Net.Documents.Field.Index.TOKENIZED)

 

 

);

//lmid

doc.Add(new Lucene.Net.Documents.Field(

lmid,

readerreader1[lmid].ToString(),

 Lucene.Net.Documents.Field.Store.YES,

 Lucene.Net.Documents.Field.Index.UN_TOKENIZED));

 

// nick name of user

doc.Add(new Lucene.Net.Documents.Field(

nickName,

 readerreader1[nickName].ToString(),

 Lucene.Net.Documents.Field.Store.YES,

 Lucene.Net.Documents.Field.Index.UN_TOKENIZED));

 

// uid

doc.Add(new Lucene.Net.Documents.Field(

uid,

 readerreader1[uid].ToString(),

 Lucene.Net.Documents.Field.Store.YES,

 Lucene.Net.Documents.Field.Index.UN_TOKENIZED));

writer.AddDocument(doc);

 

// acttime

doc.Add(new Lucene.Net.Documents.Field(

acttime,

 readerreader1[acttime].ToString(),

 Lucene.Net.Documents.Field.Store.YES,

 Lucene.Net.Documents.Field.Index.UN_TOKENIZED));

writer.AddDocument(doc);

//

 

Any ideas?

 

Thanks,

Wen Gao

[jira] Issue Comment Edited: (LUCENENET-379) Clean up Lucene.Net website

2011-02-16 Thread michael herndon (JIRA)

[
https://issues.apache.org/jira/browse/LUCENENET-379?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12995283#comment-12995283
]

michael herndon edited comment on LUCENENET-379 at 2/16/11 1:32 PM:

I think anything would be better than the current one (which would look cool if
it was cleaned up and put on the side of a chevelle. But I don't know how would
help brand lucene.net).

I'd say keep doing a few more variations. Open it up for the public to make
some submissions as well. (giving credit to whoever's design is chosen maybe
even give them some social media love).

The final one needs to work well with both rgb and cymk color formats and in a
scalable graphics format so that it can be resized cleanly.

Also it should have a visual aspect of it that can be turned into a decent 16 x
16 favicon. (like the 3 yellow hexagons that is in the jpg).

Though keep in mind basic color theory. Yellow is irritating on the eyes. Its
definitely grabs attention, but its harder on the eyes for an extended period
of time. Green is the most relaxing.

But above all else: keep moving forward towards something new.

:: edited due to posting this while on an empty stomach, never wise ::

was (Author: michaelherndon):
I think anything would be better than the current one (which would look
cool if was cleaned up and put on the side of a chevelle, but I don't know how
would help brand lucene.net).

I'd say keep doing a few more variations. open it up for the public to make
some submissions as well. (giving credit to whoever's design is chosen maybe
even give them some social media love).

The final one needs to work well with both rgb and cymk color formats and in a
scalable graphics format so that it can be resized cleanly.

Also it should have a visual aspect of it that can be turned into a decent 16 x
16 favicon. (like the 3 yellow hexagons that is in the jpg).

Though keep in mind basic color theory. Yellow is irritating on the eyes. Its
definitely grabs attention, but its harder on the eyes for an extended period
of time. Green is the most relaxing.

But above all else keep moving forward towards something new.

Clean up Lucene.Net website
---

Key: LUCENENET-379
URL: https://issues.apache.org/jira/browse/LUCENENET-379
Project: Lucene.Net
Issue Type: Task
Reporter: George Aroush
Attachments: Lucene.zip, New Logo Idea.jpg, asfcms.zip, asfcms_1.patch

The existing Lucene.Net home page at http://lucene.apache.org/lucene.net/ is
still based on the incubation, out of date design. This JIRA task is to
bring it up to date with other ASF project's web page.
The existing website is here:
https://svn.apache.org/repos/asf/lucene/lucene.net/site/
See http://www.apache.org/dev/project-site.html to get started.
It would be best to start by cloning an existing ASF project's website and
adopting it for Lucene.Net. Some examples,
https://svn.apache.org/repos/asf/lucene/pylucene/site/ and
https://svn.apache.org/repos/asf/lucene/java/site/

--
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

Re: how can I get the similarity in fuzzy query

2011-02-16 Thread Christopher Currens

As far as i know, you'll need to calculate that manually.  FuzzyQuery
searches don't return any results like that.

On Wed, Feb 16, 2011 at 11:47 AM, Wen Gao samuel.gao...@gmail.com wrote:

 Hi,
 I think my situation is just to compare the similarity of strings: I want
 to
 calculate the similarity between the typed results and the returned results
 using *FuzzyQuery*. I have set the minimumSimilarity of FuzzyQuery as 0.5f,
 what i want to do is get the similariy instead of score for every
 result
 that returns.

 Thanks for your time.

 Wen

 2011/2/16 Christopher Currens currens.ch...@gmail.com

  I was going to post the link that Digy posted, which suggests not to
  determine a match that way.  If my understanding is correct, the scores
  returned for a query are relative to which documents were retrieved by
 the
  search, in that if a document is deleted from the index, the scores will
  change even though the query did not, because the number of returned
  documents are different.
 
  If the only thing you wanted to do was to calculate how a resulting
 string
  was to a search string, I suggest the Levenshtein Distance algorithm
  http://en.wikipedia.org/wiki/Levenshtein_distance...but it doesn't seem
  like
  that's quite what you want to accomplish based on your question.
 
  Christopher
 
  On Wed, Feb 16, 2011 at 10:55 AM, Wen Gao samuel.gao...@gmail.com
 wrote:
 
   Hi,
   I am using FuzzyQuery to get fuzzy mathed results. I want to get the
   similarity in percent for every matched record.
   for example, if i search for databasd, and it will return results
 such
  as
   database, database1, and database11. I want to get the similarity
  in
   percent for evey record, such as 87.5%, 75%, and 62.5%.
  
   How can I do this?
  
   Any ideas?
  
   Wen Gao

RE: how can I get the similarity in fuzzy query

2011-02-16 Thread Digy

Whether *fuzzy* or not, all queries are simple term queries at the end and
Lucene does not have an info like *similarity*, just scores.

DIGY

-Original Message-
From: Wen Gao [mailto:samuel.gao...@gmail.com] 
Sent: Wednesday, February 16, 2011 9:47 PM
To: lucene-net-dev@lucene.apache.org
Subject: Re: how can I get the similarity in fuzzy query

Hi,
I think my situation is just to compare the similarity of strings: I want to
calculate the similarity between the typed results and the returned results
using *FuzzyQuery*. I have set the minimumSimilarity of FuzzyQuery as 0.5f,
what i want to do is get the similariy instead of score for every result
that returns.

Thanks for your time.

Wen

2011/2/16 Christopher Currens currens.ch...@gmail.com

 I was going to post the link that Digy posted, which suggests not to
 determine a match that way.  If my understanding is correct, the scores
 returned for a query are relative to which documents were retrieved by the
 search, in that if a document is deleted from the index, the scores will
 change even though the query did not, because the number of returned
 documents are different.

 If the only thing you wanted to do was to calculate how a resulting string
 was to a search string, I suggest the Levenshtein Distance algorithm
 http://en.wikipedia.org/wiki/Levenshtein_distance...but it doesn't seem
 like
 that's quite what you want to accomplish based on your question.

 Christopher

 On Wed, Feb 16, 2011 at 10:55 AM, Wen Gao samuel.gao...@gmail.com wrote:

  Hi,
  I am using FuzzyQuery to get fuzzy mathed results. I want to get the
  similarity in percent for every matched record.
  for example, if i search for databasd, and it will return results such
 as
  database, database1, and database11. I want to get the similarity
 in
  percent for evey record, such as 87.5%, 75%, and 62.5%.

  How can I do this?

  Any ideas?

  Wen Gao

RE: how can I get the similarity in fuzzy query

2011-02-16 Thread Digy

Download the source from
https://svn.apache.org/repos/asf/incubator/lucene.net/tags/Lucene.Net_2_9_2
using a svn client(like TortoiseSVN), and open the project file with VS20XX.

DIGY

-Original Message-
From: Wen Gao [mailto:samuel.gao...@gmail.com] 
Sent: Wednesday, February 16, 2011 9:58 PM
To: lucene-net-dev@lucene.apache.org
Subject: Re: how can I get the similarity in fuzzy query

OK. i get it. how can I recompile a Lucene_src on Windows?

Thanks.
Wen
2011/2/16 Christopher Currens currens.ch...@gmail.com

 As far as i know, you'll need to calculate that manually.  FuzzyQuery
 searches don't return any results like that.

 On Wed, Feb 16, 2011 at 11:47 AM, Wen Gao samuel.gao...@gmail.com wrote:

  Hi,
  I think my situation is just to compare the similarity of strings: I
want
  to
  calculate the similarity between the typed results and the returned
 results
  using *FuzzyQuery*. I have set the minimumSimilarity of FuzzyQuery as
 0.5f,
  what i want to do is get the similariy instead of score for every
  result
  that returns.

  Thanks for your time.

  Wen

  2011/2/16 Christopher Currens currens.ch...@gmail.com

   I was going to post the link that Digy posted, which suggests not to
   determine a match that way.  If my understanding is correct, the
scores
   returned for a query are relative to which documents were retrieved by
  the
   search, in that if a document is deleted from the index, the scores
 will
   change even though the query did not, because the number of returned
   documents are different.

   If the only thing you wanted to do was to calculate how a resulting
  string
   was to a search string, I suggest the Levenshtein Distance algorithm
   http://en.wikipedia.org/wiki/Levenshtein_distance...but it doesn't
 seem
   like
   that's quite what you want to accomplish based on your question.

   Christopher

   On Wed, Feb 16, 2011 at 10:55 AM, Wen Gao samuel.gao...@gmail.com
  wrote:

Hi,
I am using FuzzyQuery to get fuzzy mathed results. I want to get the
similarity in percent for every matched record.
for example, if i search for databasd, and it will return results
  such
   as
database, database1, and database11. I want to get the
 similarity
   in
percent for evey record, such as 87.5%, 75%, and 62.5%.

How can I do this?

Any ideas?

Wen Gao

Re: how can I get the similarity in fuzzy query

2011-02-16 Thread Wen Gao

Thanks you.

Wen
2011/2/16 Digy digyd...@gmail.com

 Download the source from
 https://svn.apache.org/repos/asf/incubator/lucene.net/tags/Lucene.Net_2_9_2
 using a svn client(like TortoiseSVN), and open the project file with
 VS20XX.

 DIGY

 -Original Message-
 From: Wen Gao [mailto:samuel.gao...@gmail.com]
 Sent: Wednesday, February 16, 2011 9:58 PM
 To: lucene-net-dev@lucene.apache.org
 Subject: Re: how can I get the similarity in fuzzy query

  OK. i get it. how can I recompile a Lucene_src on Windows?

 Thanks.
 Wen
 2011/2/16 Christopher Currens currens.ch...@gmail.com

  As far as i know, you'll need to calculate that manually.  FuzzyQuery
  searches don't return any results like that.
 
  On Wed, Feb 16, 2011 at 11:47 AM, Wen Gao samuel.gao...@gmail.com
 wrote:
 
   Hi,
   I think my situation is just to compare the similarity of strings: I
 want
   to
   calculate the similarity between the typed results and the returned
  results
   using *FuzzyQuery*. I have set the minimumSimilarity of FuzzyQuery as
  0.5f,
   what i want to do is get the similariy instead of score for every
   result
   that returns.
  
   Thanks for your time.
  
   Wen
  
   2011/2/16 Christopher Currens currens.ch...@gmail.com
  
I was going to post the link that Digy posted, which suggests not to
determine a match that way.  If my understanding is correct, the
 scores
returned for a query are relative to which documents were retrieved
 by
   the
search, in that if a document is deleted from the index, the scores
  will
change even though the query did not, because the number of returned
documents are different.
   
If the only thing you wanted to do was to calculate how a resulting
   string
was to a search string, I suggest the Levenshtein Distance algorithm
http://en.wikipedia.org/wiki/Levenshtein_distance...but it doesn't
  seem
like
that's quite what you want to accomplish based on your question.
   
Christopher
   
On Wed, Feb 16, 2011 at 10:55 AM, Wen Gao samuel.gao...@gmail.com
   wrote:
   
 Hi,
 I am using FuzzyQuery to get fuzzy mathed results. I want to get
 the
 similarity in percent for every matched record.
 for example, if i search for databasd, and it will return results
   such
as
 database, database1, and database11. I want to get the
  similarity
in
 percent for evey record, such as 87.5%, 75%, and 62.5%.

 How can I do this?

 Any ideas?

 Wen Gao

Re: Site

2011-02-16 Thread Ayende Rahien

Off topic, can we get a [Lucene.NET] prefix for messages to the list?

On Wed, Feb 16, 2011 at 11:05 PM, Prescott Nasser geobmx...@hotmail.comwrote:

 Where does that site compile to? The incubator lucene.net site appears to
 be the older one

Re: Site

2011-02-16 Thread Troy Howard

So, currently we are only setup for working in the staging
environment. Once we are ready to publish, we'll need to enter a new
JIRA ticket to the infrastructure project and ask for the site to be
set up for publishing. Once that's done, we will be able to
self-publish whenever we'd like either through the web ui for the CMS
or by running the publish script on the server. Each time we publish,
the changes will build and go public immediately.

The current staging site is here:

http://lucene.net.staging.apache.org/lucene.net/

The CMS Web UI for our site is:

https://cms.apache.org/lucene.net/

You can use the web based editors to do most everything and that's the
preferred method for making site modifications. This provides a
controlled semi WYSIWYG environment for editing and will perform SVN
commits for you when you save. It's a pretty easy system to work with.

At first there were some issues with building the site and web ui, but
Joe S in infrastructure got those taken care of today. I've cleaned up
the other issues with the markdown and we've got a functioning version
available at the staging site. Next steps are to edit content as a
group and get it to where we are comfortable publishing it. Once we do
that, we'll get setup for public publishing.

I found the #asfinfra IRC channel very helpful, as it allowed me to
work with Joe in real time to get the issues resolve and get my
questions answered. I suggest looking there for help on the site, as
the documentation is a bit sparse and a number of aspects of the CMS
design are shrouded in mystery at first because of that. Hopefully
they'll get the documentation updated soon, til then IRC and mailing
lists... :)

Thanks,
Troy


On Wed, Feb 16, 2011 at 1:05 PM, Prescott Nasser geobmx...@hotmail.com wrote:
 Where does that site compile to? The incubator lucene.net site appears to be 
 the older one

[jira] Commented: (LUCENENET-379) Clean up Lucene.Net website

2011-02-16 Thread Troy Howard (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENENET-379?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12995595#comment-12995595
 ] 

Troy Howard commented on LUCENENET-379:
---

The staging site and CMS Web UI are working now and ready for us to get in 
there and edit content/layout/etc.. 

I set this up with a really basic template copied from the Lucy project, which 
is copied from the default Apache site. 

Browse here to see the staging site: 

http://lucene.net.staging.apache.org/lucene.net/

And here to edit content using CMS Web UI:

https://cms.apache.org/lucene.net/



 Clean up Lucene.Net website
 ---

 Key: LUCENENET-379
 URL: https://issues.apache.org/jira/browse/LUCENENET-379
 Project: Lucene.Net
  Issue Type: Task
Reporter: George Aroush
 Attachments: Lucene.zip, New Logo Idea.jpg, asfcms.zip, asfcms_1.patch


 The existing Lucene.Net home page at http://lucene.apache.org/lucene.net/ is 
 still based on the incubation, out of date design.  This JIRA task is to 
 bring it up to date with other ASF project's web page.
 The existing website is here: 
 https://svn.apache.org/repos/asf/lucene/lucene.net/site/
 See http://www.apache.org/dev/project-site.html to get started.
 It would be best to start by cloning an existing ASF project's website and 
 adopting it for Lucene.Net.  Some examples, 
 https://svn.apache.org/repos/asf/lucene/pylucene/site/ and 
 https://svn.apache.org/repos/asf/lucene/java/site/

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

Re: subclassing Python classes in Java

2011-02-16 Thread Andi Vajda



On Feb 16, 2011, at 9:39, Bill Janssen jans...@parc.com wrote:


How do I subclass a Python class in a JCC-wrapped Java module?


 - define a Java class with native methods
 - using the usual extension tricks have a Python class implement  
these native methods
 - define a subclass of that Java class so as to inherit these native  
implementations


Andi..



In UpLib, I've got a class, uplib.ripper.Ripper, and I'd like to be  
able

to create a Java subclass for that in my module.  I presume I need a
Java interface for that Python class, but how do I hook the two
together so that the Java subclass can inherit from the Python class?

Bill

[jira] Commented: (SOLR-1395) Integrate Katta

2011-02-16 Thread tom liu (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-1395?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12995212#comment-12995212
 ] 

tom liu commented on SOLR-1395:
---

On Katta slave node, my folder hierarchy is:
|/var/data|root|
|/var/data/hadoop|store hadoop data|
|/var/data/hdfszips|store zip tmp data, which get from hdfs,then move to 
katta's shardes|
|/var/data/solr|root store solr core configures|
|/var/data/solr/seoproxy|store seoproxy's solr config,which is used by 
sub-proxy|
|/var/data/katta/shards/nodename_2/seo0#seo0|store seo0 shard,which is 
deployed from master node|
|/var/data/zkdata|store zkserver data,which is zk logs and snapshotes|

On Katta master node, my folder hierarchy is:
|/var/data|root|
|/var/data/hadoop|store hadoop data|
|/var/data/hdfsfile|store solr tmp data, which get from solr dataimporter,then 
zip  put to hdfs|
|/var/data/solr|root store solr core configures|
|/var/data/solr/seo|store seo's solr config,which is used by tomcat's webapp|
|/var/data/zkdata|store zkserver data,which is zk logs and snapshotes|

so, my config is from five folderes:
|Master|/var/data/solr/seo|tomcat webapp's solrcore config|
|Slave|/var/data/solr/seoproxy|sub-proxy's solrcore config|
|Master|/var/data/hdfsfile|query-core's config,which is config template|
|HDFS|http://hdfsname:9000/seo/seo0.zip|query-core seo0's zip file,which is 
hold conf|
|Slave|/var/data/katta/shards/nodename_2/seo0#seo0/conf|query-core seo0's 
config,which is unzipped from seo0.zip of HDFS|

and, /var/data/hdfsfile structure is:
{noformat}
seo@seo-solr1:/var/data/hdfsfile$ ll
total 28
drwxr-xr-x 6 seo seo 4096 Oct 21 15:21 ./
drwxr-xr-x 4 seo seo 4096 Feb 16 15:49 ../
drwxr-xr-x 2 seo seo 4096 Oct  8 09:17 bin/
drwxr-xr-x 4 seo seo 4096 Jan 21 18:22 conf/
drwxr-xr-x 3 seo seo 4096 Oct 21 15:21 data/
drwxr-xr-x 2 seo seo 4096 Sep 29 14:01 lib/
-rw-r--r-- 1 seo seo 1320 Oct  8 09:20 solr.xml
{noformat}


 Integrate Katta
 ---

 Key: SOLR-1395
 URL: https://issues.apache.org/jira/browse/SOLR-1395
 Project: Solr
  Issue Type: New Feature
Affects Versions: 1.4
Reporter: Jason Rutherglen
Priority: Minor
 Fix For: Next

 Attachments: SOLR-1395.patch, SOLR-1395.patch, SOLR-1395.patch, 
 back-end.log, front-end.log, hadoop-core-0.19.0.jar, katta-core-0.6-dev.jar, 
 katta-solrcores.jpg, katta.node.properties, katta.zk.properties, 
 log4j-1.2.13.jar, solr-1395-1431-3.patch, solr-1395-1431-4.patch, 
 solr-1395-1431-katta0.6.patch, solr-1395-1431-katta0.6.patch, 
 solr-1395-1431.patch, solr-1395-katta-0.6.2-1.patch, 
 solr-1395-katta-0.6.2-2.patch, solr-1395-katta-0.6.2-3.patch, 
 solr-1395-katta-0.6.2.patch, test-katta-core-0.6-dev.jar, 
 zkclient-0.1-dev.jar, zookeeper-3.2.1.jar

   Original Estimate: 336h
  Remaining Estimate: 336h

 We'll integrate Katta into Solr so that:
 * Distributed search uses Hadoop RPC
 * Shard/SolrCore distribution and management
 * Zookeeper based failover
 * Indexes may be built using Hadoop

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Re: Please mark distributed date faceting for 3.1

2011-02-16 Thread Robert Muir

On Wed, Feb 16, 2011 at 12:06 AM, Smiley, David W. dsmi...@mitre.org wrote:
 I may have added a test just now, but I and others have been using this 
 [simple] code for some time now.  It has baked, it doesn't need more baking 
 IMO.

I am sure people will say I am just being silly, but hudson does a
better job testing these things than people playing with the code. For
example, hudson randomizes external variables (locale X timezone)...
on the latest 1.6u23 there are 152 locales, and 609 timezones (only
424 unique according to raw offset + rules). With hudson selecting 1
of these ~ 65K possibilities 96 times a day, you can start to
calculate how long is a good baking for date-related functionality.

Someone can argue that because Solr insists on treating dates
internally, that this does not matter, but I have found and fixed
timezone and localization related bugs in Lucene and Solr before, so
that argument fails... not knowing the surrounding code, nothing makes
me feel better than a couple weeks of hudson grinding on the code.

Even then, sometimes a few weeks isnt enough.. for example if I
remember right, SOLR-1821 was daylight-savings related (note: the
issue was reported the very day daylight savings started in the United
States, but in other timezones it had not yet, and would fail for some
developers but not others).

 If this patch wasn't the biggest reason to not use distributed search (a key 
 feature) then I wouldn't be here arguing my point.  But I've apparently lost 
 this argument already so I give up;... assign if for 3.2 if that's the best 
 you can do Rob. It's better than being unassigned which is what it is now.


I don't think that would be the best, as its not my area of expertise.
If I see good patches being ignored because other devs are
time-constrained sometimes I will take the time to try to bring myself
up to speed to get them committed though, but I haven't yet given up
on this patch :)

Just so you know, Its nothing about your patch at all, I am just
against any new features of any sort being added to 3.1 at this point.

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Re: strange problem of PForDelta decoder

2011-02-16 Thread Li Li

   our recent experiments show that PFOR is not a good solution for and query
we tested it with our dataset and users' queries. for most case, PFOR is slower
than vint. we analyzed the reason may be that it's very likely there
is a low-frequent
term in most queries. So the scoring time is the majority while decoding is not.
   e.g in our index, term beijing's df is 2557916 and park is
2313201, both them
are hight frequent terms. but the count of documents containing both
is only 1552
   for vint, it only need decode 1552 documents, while PFOR, it may decode many
blocks.
   for most search engines, and query is used. So PFOR is only good for or query
and and query whose terms are all high frequent.
   So we have to give up this in our application.
   partial decoder for PFOR? for all high frequent terms, using normal
PFOR decoder
;for quries with low frequent  terms, using partial decoder?
   partial decoder of PFOR many need many if/else and will be slower.
   Any one has any solution for this?


2010/12/27 Li Li fancye...@gmail.com:
 I integrated pfor codec into lucene 2.9.3 and the search time
 comparsion is as follows:
                                   single term   and query   or query
 VINT in lucene 2.9.3         11.2            36.5           38.6
 PFor in lucene 2.9.3         8.7              27.6           33.4
 VINT in lucene 4 branch   10.6             26.5           35.4
 PFor in lcuene 4 branch    8.1              22.5           30.7

 My test terms are high frequncy terms because we are interested in bad case
 It seems lucene 4 branch's implementation of and query(conjuction
 query) is well optimized that even for VINT codec, it's faster than
 PFor in lucene 2.9.3. Could any one tell me what optimization is done?
 is store docIDs and freqs separately making it faster? or anything
 else?

 Another querstion, Is there anyone interested in integrating pfor
 codec into lucene 2.9.3 as me( we have to use lucene 2.9 and solr
 1.4). And how do I contribute this patch?

 2010/12/24 Michael McCandless luc...@mikemccandless.com:
 Well, an early patch somewhere was able to run PFor on trunk, but the
 performance wasn't great because the trunk bulk-read API is a
 bottleneck (this is why the bulk postings branch was created).

 Mike

 On Wed, Dec 22, 2010 at 9:45 PM, Li Li fancye...@gmail.com wrote:
 I used the bulkpostings
 branch(https://svn.apache.org/repos/asf/lucene/dev/branches/bulkpostings/lucene)
 does trunk have PForDelta decoder/encoder ?

 2010/12/23 Michael McCandless luc...@mikemccandless.com:
 Those are nice speedups!

 Did you use the 4.0 branch (ie trunk) or the bulkpostings branch for this 
 test?

 Mike

 On Tue, Dec 21, 2010 at 9:59 PM, Li Li fancye...@gmail.com wrote:
 great improvement!
 I did a test in our data set. doc count is about 2M+ and index size
 after optimization is about 13.3GB(including fdt)
 it seems lucene4's index format is better than lucene2.9.3. and PFor
 give good results.
 Besides BlockEncoder for frq and pos. is there any other modification
 for lucene 4?

       decoder    \ avg time     single word(ms)          and
 query(ms)     or query(ms)
  VINT in lucene 2.9                   11.2
 36.5                 38.6
  VINT in lucene 4 branch           10.6
 26.5                 35.4
  PFor in lucene 4 branch             8.1
 22.5                 30.7
 2010/12/21 Li Li fancye...@gmail.com:
 OK we should have a look at that one still.  We need to converge on a
 good default codec for 4.0.  Fortunately it's trivial to take any int
 block encoder (fixed or variable block) and make a Lucene codec out of
 it!

 I suggests you not to use this one, I fixed dozens of bugs but it
 still failed when with random tests. it's codes is hand coded rather
 than generated by program. But we may learn something from it.


 -
 To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
 For additional commands, e-mail: dev-h...@lucene.apache.org



 -
 To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
 For additional commands, e-mail: dev-h...@lucene.apache.org



 -
 To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
 For additional commands, e-mail: dev-h...@lucene.apache.org



 -
 To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
 For additional commands, e-mail: dev-h...@lucene.apache.org




-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Re: inverted index pruning

2011-02-16 Thread Li Li

great things.
But I think the patch is different from the method in that paper.
my colleague had tested this patch but don't get good results
(I don't know the detail well, and he just tell me his experience)

2011/2/15 Andrzej Bialecki a...@getopt.org:
 On 2/15/11 11:57 AM, Li Li wrote:

 hi all,
     I recently read a paper Pruning Policies for Two-Tiered Inverted
 Index with Correctness Guarantee. It's idea is interesting and I
 have some questions and like to share with you.

 Please take a look at LUCENE-1812, LUCENE-2632 and my presentation from
 Apache EuroCon 2010 in Prague, Munching and Crunching.


 --
 Best regards,
 Andrzej Bialecki     
  ___. ___ ___ ___ _ _   __
 [__ || __|__/|__||\/|  Information Retrieval, Semantic Web
 ___|||__||  \|  ||  |  Embedded Unix, System Integration
 http://www.sigram.com  Contact: info at sigram dot com


 -
 To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
 For additional commands, e-mail: dev-h...@lucene.apache.org



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] Assigned: (LUCENE-1812) Static index pruning by in-document term frequency (Carmel pruning)

2011-02-16 Thread Doron Cohen (JIRA)

[
https://issues.apache.org/jira/browse/LUCENE-1812?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Doron Cohen reassigned LUCENE-1812:
---

Assignee: Doron Cohen

Static index pruning by in-document term frequency (Carmel pruning)
---

Key: LUCENE-1812
URL: https://issues.apache.org/jira/browse/LUCENE-1812
Project: Lucene - Java
Issue Type: New Feature
Components: contrib/*
Affects Versions: 2.9, 3.1
Reporter: Andrzej Bialecki
Assignee: Doron Cohen
Attachments: pruning.patch, pruning.patch, pruning.patch,
pruning.patch

This module provides tools to produce a subset of input indexes by removing
postings data for those terms where their in-document frequency is below a
specified threshold. The net effect of this processing is a much smaller
index that for common types of queries returns nearly identical top-N results
as compared with the original index, but with increased performance.
Optionally, stored values and term vectors can also be removed. This
functionality is largely independent, so it can be used without term pruning
(when term freq. threshold is set to 1).
As the threshold value increases, the total size of the index decreases,
search performance increases, and recall decreases (i.e. search quality
deteriorates). NOTE: especially phrase recall deteriorates significantly at
higher threshold values.
Primary purpose of this class is to produce small first-tier indexes that fit
completely in RAM, and store these indexes using
IndexWriter.addIndexes(IndexReader[]). Usually the performance of this class
will not be sufficient to use the resulting index view for on-the-fly pruning
and searching.
NOTE: If the input index is optimized (i.e. doesn't contain deletions) then
the index produced via IndexWriter.addIndexes(IndexReader[]) will preserve
internal document id-s so that they are in sync with the original index. This
means that all other auxiliary information not necessary for first-tier
processing, such as some stored fields, can also be removed, to be quickly
retrieved on-demand from the original index using the same internal document
id.
Threshold values can be specified globally (for terms in all fields) using
defaultThreshold parameter, and can be overriden using per-field or per-term
values supplied in a thresholds map. Keys in this map are either field names,
or terms in field:text format. The precedence of these values is the
following: first a per-term threshold is used if present, then per-field
threshold if present, and finally the default threshold.
A command-line tool (PruningTool) is provided for convenience. At this moment
it doesn't support all functionality available through API.

--
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] Updated: (LUCENE-1812) Static index pruning by in-document term frequency (Carmel pruning)

2011-02-16 Thread Doron Cohen (JIRA)

[
https://issues.apache.org/jira/browse/LUCENE-1812?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Doron Cohen updated LUCENE-1812:

Affects Version/s: (was: 3.1)
(was: 2.9)
Fix Version/s: 4.0
3.2

Static index pruning by in-document term frequency (Carmel pruning)
---

Key: LUCENE-1812
URL: https://issues.apache.org/jira/browse/LUCENE-1812
Project: Lucene - Java
Issue Type: New Feature
Components: contrib/*
Reporter: Andrzej Bialecki
Assignee: Doron Cohen
Fix For: 3.2, 4.0

Attachments: pruning.patch, pruning.patch, pruning.patch,
pruning.patch

--
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] Updated: (SOLR-2105) RequestHandler param update.processor is confusing

2011-02-16 Thread JIRA


 [ 
https://issues.apache.org/jira/browse/SOLR-2105?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jan Høydahl updated SOLR-2105:
--

Attachment: SOLR-2105.patch

Updated patch attached.

* Use of update.processor is not deprecated but still works, logging a warning
* Added test case which tests that both params work

Patch is for trunk.

 RequestHandler param update.processor is confusing
 --

 Key: SOLR-2105
 URL: https://issues.apache.org/jira/browse/SOLR-2105
 Project: Solr
  Issue Type: Improvement
  Components: update
Affects Versions: 1.4.1
Reporter: Jan Høydahl
Priority: Minor
 Attachments: SOLR-2105.patch, SOLR-2105.patch


 Today we reference a custom updateRequestProcessorChain using the update 
 request parameter update.processor.
 See 
 http://wiki.apache.org/solr/SolrConfigXml#UpdateRequestProcessorChain_section
 This is confusing, since what we are really referencing is not an 
 UpdateProcessor, but an updateRequestProcessorChain.
 I propose that update.processor is renamed as update.chain or similar

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] Commented: (LUCENENET-379) Clean up Lucene.Net website

2011-02-16 Thread michael herndon (JIRA)

[
https://issues.apache.org/jira/browse/LUCENENET-379?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12995283#comment-12995283
]

michael herndon commented on LUCENENET-379:
---

I think anything would be better than the current one (which would look cool if
was cleaned up and put on the side of a chevelle, but I don't know how would
help brand lucene.net).

I'd say keep doing a few more variations. open it up for the public to make
some submissions as well. (giving credit to whoever's design is chosen maybe
even give them some social media love).

The final one needs to work well with both rgb and cymk color formats and in a
scalable graphics format so that it can be resized cleanly.

Also it should have a visual aspect of it that can be turned into a decent 16 x
16 favicon. (like the 3 yellow hexagons that is in the jpg).

Though keep in mind basic color theory. Yellow is irritating on the eyes. Its
definitely grabs attention, but its arder on the eyes for an extended period of
time. Green is the most relaxing.

But above all else keep moving forward towards something new.

Clean up Lucene.Net website
---

--
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] Issue Comment Edited: (LUCENENET-379) Clean up Lucene.Net website

2011-02-16 Thread michael herndon (JIRA)

[
https://issues.apache.org/jira/browse/LUCENENET-379?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12995283#comment-12995283
]

michael herndon edited comment on LUCENENET-379 at 2/16/11 1:21 PM: