Big problem with solr in an official server.

2010-04-19 Thread Ariel
Hi everybody:

I have a big problem with solr in a server with the memory size it is using,
I would want to know how to configure it to use a limited memory size, I am
setting up Solr with java -jar start.jar command in an ubuntu server, the
process start.jar is using 7Gb of  memory in the server and it is affecting
considerably the performance of the server.
Could you help me please ???
Thanks in advance.
Regards
Ariel


How to delete documents from an index and how to reset de remote multisearcher so the deleted docs not being shown in the search results ???

2009-09-11 Thread Ariel
Hi every body:
I am using lucene version 2.3.2 to index and search my documents.
The problem is that I have a remote search server implemented this way:
[code]
Searcher parallelSearcher;  
try {
parallelSearcher = new 
ParallelMultiSearcher(searchables);
parallelImpl = new RemoteSearchable(parallelSearcher);
Naming.rebind(rmiUrlSearch, parallelImpl);
} catch (RemoteException e) {
log.error(ERROR , e);
} catch (MalformedURLException e) {
log.error(ERROR , e);
} catch (IOException e) {
log.error(ERROR , e);
}
[/code]
Then a client from another host connect to the search server to obtain
the search results. But when a document is deleted in the indexes in
the search server the deleted documents still appear in the search
results, the only way the deleted documents don't appear in the search
results is restarting the rmi in the search server.
Please could you help me to know what can I do to make that the
documents deleted don't appear in the search results when they are
deleted ???
I hope you can help me.
Regards
Ariel

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Problem with ranking in lucene

2009-04-09 Thread Ariel
Hi everybody:

I have a question about the ranking of lucene. Here I have the problem: when
I do a search in my index like this: bank OR transference I get 10 results,
the first two documents that are returned have the both terms in the content
field but then the 3th, 4th and 5th only has the word bank and then the 6th
is a document that have both terms.
Why is this happening ?
It is not supposed that when I do a search with the OR operator it returned
first the documents that have the terms together and then the document that
only have one of the two terms ???
I am indexing by two fields and I am searching with MultifieldQuery in both
fields two: title and content, I am using the same analyzer for indexing and
searching.

I hope you can help me.
Thanks in advance
Regards
Ariel


How can I change that lucene use by default the AND operator between terms ???

2009-04-08 Thread Ariel
When I do a search using lucene internally lucene use by default the OR
operator between terms, How can I change that lucene use by default the AND
operator between terms ???

Regards
Ariel


How Can I make an analyzer that ignore the numbers o the texts ???

2009-04-08 Thread Ariel
Hi everybody:

I would want to know how Can I make an analyzer that ignore the numbers o
the texts like the stop words are ignored ??? For example that the terms :
3.8, 100, 4.15, 4,33 don't be added to the index.
How can I do that ???

Regards
Ariel


Re: How to search a phrase using quotes in a query ???

2009-04-07 Thread Ariel
 catch block
e.printStackTrace();
}

}
[/code]


English Analyzer is a custom analyzer that have all these
filters:SynonymFilter, SnowballFilter, StopFilter, LowerCaseFilter,
StandardFilter and StandardTokenizer.

So, I don't know why when I do a search like the bank of america the
search results doesn't return the documents that have the exact phrase the
bank of america.
Could you help me please ???
Regards
Ariel


On Mon, Apr 6, 2009 at 5:26 PM, Erick Erickson erickerick...@gmail.comwrote:

 If you have luke, you should be able to submit your query and use
 the explain functionality to gain some insights into what the query
 actually looks like as well

 Best
 Erick

 On Mon, Apr 6, 2009 at 5:24 PM, Ariel isaacr...@gmail.com wrote:

  Well I have luke lucene, the index has been build fine.
  The field where I am searching is the content field.
 
  I am using the same analyzer in query and indexing time: SnowBall English
  Analyzer.
 
  I am going to submit later the snippet code.
 
  Regards
  Ariel
 
 
  On Mon, Apr 6, 2009 at 4:37 PM, Erick Erickson erickerick...@gmail.com
  wrote:
 
   We really need some more data. First, I *strongly* recommend you
   get a copy of Luke and examine your index to see what is
   *actually* there. Google lucene luke. That often answers
   many questions.
  
   Second, query.toString is your friend. For instance, if the query
   you provided below is all that you're submitting, it's going against
   the default field you might have specified when you instantiated
   your query parser.
  
   Third, what analyzers are you using at index and query time?
  
   Code snippets would also help.
  
   Best
   Erick
  
   On Mon, Apr 6, 2009 at 4:32 PM, Ariel isaacr...@gmail.com wrote:
  
Hi every body:
   
Why when I make a query with this search  query : the fool of the
  hill
doesn't appear documents in the search results that contains the
 entire
phrase the fool of the hill and it does exist documents that
 contain
   that
phrase, I am using snowball analyzer for English ???
Could you help with this please ???
Regards
Ariel
   
  
 



How to search a phrase using quotes in a query ???

2009-04-06 Thread Ariel
Hi every body:

Why when I make a query with this search  query : the fool of the hill
doesn't appear documents in the search results that contains the entire
phrase the fool of the hill and it does exist documents that contain that
phrase, I am using snowball analyzer for English ???
Could you help with this please ???
Regards
Ariel


Re: How to search a phrase using quotes in a query ???

2009-04-06 Thread Ariel
Well I have luke lucene, the index has been build fine.
The field where I am searching is the content field.

I am using the same analyzer in query and indexing time: SnowBall English
Analyzer.

I am going to submit later the snippet code.

Regards
Ariel


On Mon, Apr 6, 2009 at 4:37 PM, Erick Erickson erickerick...@gmail.comwrote:

 We really need some more data. First, I *strongly* recommend you
 get a copy of Luke and examine your index to see what is
 *actually* there. Google lucene luke. That often answers
 many questions.

 Second, query.toString is your friend. For instance, if the query
 you provided below is all that you're submitting, it's going against
 the default field you might have specified when you instantiated
 your query parser.

 Third, what analyzers are you using at index and query time?

 Code snippets would also help.

 Best
 Erick

 On Mon, Apr 6, 2009 at 4:32 PM, Ariel isaacr...@gmail.com wrote:

  Hi every body:
 
  Why when I make a query with this search  query : the fool of the hill
  doesn't appear documents in the search results that contains the entire
  phrase the fool of the hill and it does exist documents that contain
 that
  phrase, I am using snowball analyzer for English ???
  Could you help with this please ???
  Regards
  Ariel
 



How to index correctly taking in account the synonyms using Wordnet ???

2009-02-04 Thread Ariel
Hi every body:

I am using wordnet to index my document taking in account the synonyms
with wordnet.
After I indexed the whole documents collections I made a query with
the word snort but documents that contain the word bird are
retrieved, I don't understand this because snort and bird are not
synonyms then Why are the documents that contain bird retrieved ???

Could help me to solve that problem ???

How do you index your documents using wordnet ???

Thanks in advance.
Regards
Ariel

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: How to index correctly taking in account the synonyms using Wordnet ???

2009-02-04 Thread Ariel
Well, I have the luke 0.8, I opened my index with that tool but there is not
any clue of synonyms in the field I have indexed with the synonym analyzer.
I don't know how can I see the group of synonyms of each term, sould
somebody tell me hot to do that ???



On Wed, Feb 4, 2009 at 5:09 PM, Erick Erickson erickerick...@gmail.comwrote:

 The first thing I'd do is get a copy of luke (google lucene luke) and
 examine your index to see what's actually there in the document
 you claim in incorrectly returned. If that doesn't
 enlighten you, you really have to provide more details and code
 examples, because your question is unanswerable as it
 stands.

 Best
 Erick

 On Wed, Feb 4, 2009 at 3:27 PM, Ariel isaacr...@gmail.com wrote:

  Hi every body:
 
  I am using wordnet to index my document taking in account the synonyms
  with wordnet.
  After I indexed the whole documents collections I made a query with
  the word snort but documents that contain the word bird are
  retrieved, I don't understand this because snort and bird are not
  synonyms then Why are the documents that contain bird retrieved ???
 
  Could help me to solve that problem ???
 
  How do you index your documents using wordnet ???
 
  Thanks in advance.
  Regards
  Ariel
 
  -
  To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
  For additional commands, e-mail: java-user-h...@lucene.apache.org
 
 



Re: How to index correctly taking in account the synonyms using Wordnet ???

2009-02-04 Thread Ariel
How can I see the senses of a word with wordnet ??? And How could I select
the most populars ???
Is there a way to make queries ignoring the synonyms I have added to the
index ???

I hope you can help me.

Regards
Ariel


On Wed, Feb 4, 2009 at 7:46 PM, Manu Konchady mkonch...@yahoo.com wrote:




 --- On Wed, 4/2/09, Ariel isaacr...@gmail.com wrote:

  From: Ariel isaacr...@gmail.com
  I am using wordnet to index my document taking in account
  the synonyms
  with wordnet.
  After I indexed the whole documents collections I made a
  query with
  the word snort but documents that contain the
  word bird are
  retrieved, I don't understand this because snort and
  bird are not
  synonyms then Why are the documents that contain
  bird retrieved ???
 


 In WordNet bird is one of the noun senses for the meaning of snort

 Noun

 Sense 1: snicker, snort, snigger
 Description: a disrespectful laugh

 Sense 2: boo, hoot, Bronx cheer, hiss, raspberry, razzing, razz, snort,
 bird
 Description: a cry or noise made to express displeasure or contempt


 You may want to try and select just the synonyms of the most popular sense
 of the word.

  Regards,

  Manu


  Get perfect Email ID for your Resume. Grab now
 http://in.promos.yahoo.com/address




Re: Default and optimal use of RAMDirectory

2009-01-05 Thread Ariel
Did you mean that the people that think the use of RAMDirectory is going to
speed up the indexing proccess are wrong ???

On Sun, Dec 21, 2008 at 10:22 PM, Otis Gospodnetic 
otis_gospodne...@yahoo.com wrote:

 Let me add to that that I clearly recall having a hard time getting the
 tests for that particular section of LIA1 to clearly and consistently show
 that using the RAMDirectory buffering approach instead of vanilla
 IndexWriter yields faster indexing.  Even back then IndexWriter buffered
 indexed data in memory, though today's IndexWriter is much, much better at
 it.


 Otis --
 Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch



 - Original Message 
  From: Michael McCandless luc...@mikemccandless.com
  To: java-user@lucene.apache.org
  Sent: Saturday, December 20, 2008 4:25:13 AM
  Subject: Re: Default and optimal use of RAMDirectory
 
  Actually, things have improved since LIA1 was written a few years ago:
  IndexWriter now does a good job managing the RAM buffer you assign to
  it, so you should not see much benefit by doing your own buffering
  with RAMDirectory (and if you somehow do, I'd like to know about
  it!).
 
  Instead you should call IndexWriter.setRAMBufferSizeMB.
 
  Also, FSDirectory does no RAM buffering on its own.
 
  See here for further ways to tune for indexing throughput:
 
  http://wiki.apache.org/lucene-java/ImproveIndexingSpeed
 
  Mike
 
  wrote:
 
  
   Hi all,
  
   First of I'd like to say I'm quite pleased to be a part of this mailing
   list - its even more exciting to know that we have Otis G. and Erik H.,
   authors of (at least in my opinion) the Lucene Bible - Lucene in
 Action,
   actively answering all these inquiries =)
  
   We're currently in the initial stages of implementing lucene as part of
 our
   product and one problem that we need to resolve is optimizing lucene.
  I've
   been reading Lucene in Action book and one of the tips for optimizing
   lucene indexing is by using RAMDirectory as a buffer before writing to
   FSDirectory.  According to the book, this is done internally and
   automatically when I use FSDirectory.  My questions are 1.) What's the
   default implementation/ computation used in allocating RAMdirectory
 when we
   implement FSDirectory and 2.) What's the optimal way of customizing
   RAMDirectory usage - any tips on how to do it.
  
   BTW, we're using Lucene 2.3.2
  
   Thanks for all the help
  
   Joseph
  
  
  
   -
   To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
   For additional commands, e-mail: java-user-h...@lucene.apache.org
  
  


 -
 To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-user-h...@lucene.apache.org




Re: How to search documents taking in account the dates ???

2008-12-18 Thread Ariel
What I am doing is this:
code
Sort sort = new Sort();
sort.setSort(year, true);
hits = searcher.search(pquery,sort);
/code

How I must put my code to sort first by date an then by score ???
Greetings
Ariel


On Thu, Dec 18, 2008 at 4:48 AM, Ian Lea ian@gmail.com wrote:

 Lucene lets you sort by multiple fields, including score.  See the
 javadocs for Sort and SortField, specifically SortField.SCORE.


 --
 Ian.

 On Wed, Dec 17, 2008 at 8:15 PM, Ariel isaacr...@gmail.com wrote:
  Hi:
  This solution have a problem.
  the results are sorted bye the year criteria but I need that after sort
 by
  year criteria it sort by the scoring criteria two.
  How can I do this ???
 
  I hope you can help me.
  Greetings
  Ariel
 
  On Wed, Nov 19, 2008 at 5:28 PM, Erick Erickson erickerick...@gmail.com
 wrote:
 
  Well, MultiSearcher is just a Searcher, so you have available
  all of the search methods on Searcher. One of which is:
 
  search
 
  public TopFieldDocs
 
 
 file:///C:/lucene-2.1.0/docs/api/org/apache/lucene/search/TopFieldDocs.html
  *search*(Query
  file:///C:/lucene-2.1.0/docs/api/org/apache/lucene/search/Query.html
  query,
Filter
  file:///C:/lucene-2.1.0/docs/api/org/apache/lucene/search/Filter.html
  filter,
int n,
Sort
  file:///C:/lucene-2.1.0/docs/api/org/apache/lucene/search/Sort.html
  sort)
 throws IOException
  http://java.sun.com/j2se/1.4/docs/api/java/io/IOException.html
 
  Expert: Low-level search implementation with arbitrary sorting. Finds
 the
  top n hits for query, applying filter if non-null, and sorting the hits
 by
  the criteria in sort.
 
 
  Best
  Erick
 
 
  On Wed, Nov 19, 2008 at 4:22 PM, Ariel isaacr...@gmail.com wrote:
 
   Well, this is what I am doing:
  
   queryString=year:[2003 TO 2005]
   [CODE]
  Query pquery = null;
  Hits hits = null;
  Analyzer analyzer = null;
  analyzer = new SnowballAnalyzer(English);
  try {
  pquery = MultiFieldQueryParser.parse(new String[] {queryString,
   queryString}, new String[] {title, content}, analyzer); //analyzer
  } catch (ParseException e1) {
  e1.printStackTrace();
  }
  MultiSearcher searcher = (MultiSearcher) searcherCache.get(name);
  
  try {
  hits = searcher.search(pquery);
  } catch (IOException e1) {
  e1.printStackTrace();
  }
   [/CODE]
  
   I don't know the methods that include sorting. I have made the sorting
 by
   the score criteria so far, I don-t know how to change it to the year
  field
   criteria.
   As you can see, I am using a multisearcher because I have several
  indexes.
  
   I hope you can help me.
   Regards
   Thanks in advance
   Ariel
  
  
  
   On Wed, Nov 19, 2008 at 3:58 PM, Ian Lea ian@gmail.com wrote:
  
Are you using one of the search methods that includes sorting?  If
not, then do.  If you are, then you need to tell us exactly what you
are doing and exactly what you reckon is going wrong.
   
   
--
Ian.
   
   
On Wed, Nov 19, 2008 at 6:23 PM, Ariel isaacr...@gmail.com wrote:
 it is supposed lucene make a lexicocraphic sorting but this is not
hapening,
 Could you tell me what I'm doing wrong ?
 I hope you can help me.
 Regards

 On Wed, Nov 19, 2008 at 11:56 AM, Ariel isaacr...@gmail.com
 wrote:

 Thanks, that was very helpful, but I have a question when I make
 the
 searches it does not sort the results according to the range, for
example:
 year: [2003 TO 2008] in the first page 2003 documents are showed,
 in
   the
 second 2005 documents, in the third page 2004 documents, I don't
 see
   any
 sort criteria.
 How could I fix that problem ???
 Greetings
 Ariel


 On Wed, Nov 19, 2008 at 11:09 AM, Ian Lea ian@gmail.com
  wrote:

 Hi - sounds like you need a range query.



   
  
 
 http://lucene.apache.org/java/2_3_2/queryparsersyntax.html#Range%20Searches


 --
 Ian.


 On Wed, Nov 19, 2008 at 4:02 PM, Ariel isaacr...@gmail.com
  wrote:
  Hi everybody:
 
  I need to make search with lucene 2.3.2, taking in account the
   dates,
  previously when I build the index I create a date field where
 I
stored
 the
  year in which the document was created, at the search moment I
   would
 like to
  retrieve documents that have been created before a Year or
 after
  a
Year,
 for
  example documents before 2002 year o after 2003 year.
  It is possible to do that with lucene ???
  Regards
  Ariel
 


  -
 To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
 For additional commands, e-mail:
 java-user-h...@lucene.apache.org

Re: How to search documents taking in account the dates ???

2008-12-18 Thread Ariel
Thank you, it works very good.
Regards
Ariel

On Thu, Dec 18, 2008 at 8:22 AM, Erick Erickson erickerick...@gmail.comwrote:

 Use the setSort that takes an array of Sort objects...

 On Thu, Dec 18, 2008 at 8:11 AM, Ariel isaacr...@gmail.com wrote:

  What I am doing is this:
  code
 Sort sort = new Sort();
 sort.setSort(year, true);
 hits = searcher.search(pquery,sort);
  /code
 
  How I must put my code to sort first by date an then by score ???
  Greetings
  Ariel
 
 
  On Thu, Dec 18, 2008 at 4:48 AM, Ian Lea ian@gmail.com wrote:
 
   Lucene lets you sort by multiple fields, including score.  See the
   javadocs for Sort and SortField, specifically SortField.SCORE.
  
  
   --
   Ian.
  
   On Wed, Dec 17, 2008 at 8:15 PM, Ariel isaacr...@gmail.com wrote:
Hi:
This solution have a problem.
the results are sorted bye the year criteria but I need that after
 sort
   by
year criteria it sort by the scoring criteria two.
How can I do this ???
   
I hope you can help me.
Greetings
Ariel
   
On Wed, Nov 19, 2008 at 5:28 PM, Erick Erickson 
  erickerick...@gmail.com
   wrote:
   
Well, MultiSearcher is just a Searcher, so you have available
all of the search methods on Searcher. One of which is:
   
search
   
public TopFieldDocs
   
   
  
 
 file:///C:/lucene-2.1.0/docs/api/org/apache/lucene/search/TopFieldDocs.html
*search*(Query
   
 file:///C:/lucene-2.1.0/docs/api/org/apache/lucene/search/Query.html
query,
  Filter
   
  file:///C:/lucene-2.1.0/docs/api/org/apache/lucene/search/Filter.html
filter,
  int n,
  Sort
   
 file:///C:/lucene-2.1.0/docs/api/org/apache/lucene/search/Sort.html
sort)
   throws IOException
http://java.sun.com/j2se/1.4/docs/api/java/io/IOException.html
   
Expert: Low-level search implementation with arbitrary sorting.
 Finds
   the
top n hits for query, applying filter if non-null, and sorting the
  hits
   by
the criteria in sort.
   
   
Best
Erick
   
   
On Wed, Nov 19, 2008 at 4:22 PM, Ariel isaacr...@gmail.com wrote:
   
 Well, this is what I am doing:

 queryString=year:[2003 TO 2005]
 [CODE]
Query pquery = null;
Hits hits = null;
Analyzer analyzer = null;
analyzer = new SnowballAnalyzer(English);
try {
pquery = MultiFieldQueryParser.parse(new String[]
  {queryString,
 queryString}, new String[] {title, content}, analyzer);
  //analyzer
} catch (ParseException e1) {
e1.printStackTrace();
}
MultiSearcher searcher = (MultiSearcher)
 searcherCache.get(name);

try {
hits = searcher.search(pquery);
} catch (IOException e1) {
e1.printStackTrace();
}
 [/CODE]

 I don't know the methods that include sorting. I have made the
  sorting
   by
 the score criteria so far, I don-t know how to change it to the
 year
field
 criteria.
 As you can see, I am using a multisearcher because I have several
indexes.

 I hope you can help me.
 Regards
 Thanks in advance
 Ariel



 On Wed, Nov 19, 2008 at 3:58 PM, Ian Lea ian@gmail.com
 wrote:

  Are you using one of the search methods that includes sorting?
  If
  not, then do.  If you are, then you need to tell us exactly what
  you
  are doing and exactly what you reckon is going wrong.
 
 
  --
  Ian.
 
 
  On Wed, Nov 19, 2008 at 6:23 PM, Ariel isaacr...@gmail.com
  wrote:
   it is supposed lucene make a lexicocraphic sorting but this is
  not
  hapening,
   Could you tell me what I'm doing wrong ?
   I hope you can help me.
   Regards
  
   On Wed, Nov 19, 2008 at 11:56 AM, Ariel isaacr...@gmail.com
   wrote:
  
   Thanks, that was very helpful, but I have a question when I
  make
   the
   searches it does not sort the results according to the range,
  for
  example:
   year: [2003 TO 2008] in the first page 2003 documents are
  showed,
   in
 the
   second 2005 documents, in the third page 2004 documents, I
  don't
   see
 any
   sort criteria.
   How could I fix that problem ???
   Greetings
   Ariel
  
  
   On Wed, Nov 19, 2008 at 11:09 AM, Ian Lea ian@gmail.com
 
wrote:
  
   Hi - sounds like you need a range query.
  
  
  
 

   
  
 
 http://lucene.apache.org/java/2_3_2/queryparsersyntax.html#Range%20Searches
  
  
   --
   Ian.
  
  
   On Wed, Nov 19, 2008 at 4:02 PM, Ariel isaacr...@gmail.com
 
wrote:
Hi everybody:
   
I need to make search with lucene 2.3.2, taking in account
  the
 dates,
previously when I build the index I create a date

Re: How to search documents taking in account the dates ???

2008-12-17 Thread Ariel
Hi:
This solution have a problem.
the results are sorted bye the year criteria but I need that after sort by
year criteria it sort by the scoring criteria two.
How can I do this ???

I hope you can help me.
Greetings
Ariel

On Wed, Nov 19, 2008 at 5:28 PM, Erick Erickson erickerick...@gmail.comwrote:

 Well, MultiSearcher is just a Searcher, so you have available
 all of the search methods on Searcher. One of which is:

 search

 public TopFieldDocs

 file:///C:/lucene-2.1.0/docs/api/org/apache/lucene/search/TopFieldDocs.html
 *search*(Query
 file:///C:/lucene-2.1.0/docs/api/org/apache/lucene/search/Query.html
 query,
   Filter
 file:///C:/lucene-2.1.0/docs/api/org/apache/lucene/search/Filter.html
 filter,
   int n,
   Sort
 file:///C:/lucene-2.1.0/docs/api/org/apache/lucene/search/Sort.html
 sort)
throws IOException
 http://java.sun.com/j2se/1.4/docs/api/java/io/IOException.html

 Expert: Low-level search implementation with arbitrary sorting. Finds the
 top n hits for query, applying filter if non-null, and sorting the hits by
 the criteria in sort.


 Best
 Erick


 On Wed, Nov 19, 2008 at 4:22 PM, Ariel isaacr...@gmail.com wrote:

  Well, this is what I am doing:
 
  queryString=year:[2003 TO 2005]
  [CODE]
 Query pquery = null;
 Hits hits = null;
 Analyzer analyzer = null;
 analyzer = new SnowballAnalyzer(English);
 try {
 pquery = MultiFieldQueryParser.parse(new String[] {queryString,
  queryString}, new String[] {title, content}, analyzer); //analyzer
 } catch (ParseException e1) {
 e1.printStackTrace();
 }
 MultiSearcher searcher = (MultiSearcher) searcherCache.get(name);
 
 try {
 hits = searcher.search(pquery);
 } catch (IOException e1) {
 e1.printStackTrace();
 }
  [/CODE]
 
  I don't know the methods that include sorting. I have made the sorting by
  the score criteria so far, I don-t know how to change it to the year
 field
  criteria.
  As you can see, I am using a multisearcher because I have several
 indexes.
 
  I hope you can help me.
  Regards
  Thanks in advance
  Ariel
 
 
 
  On Wed, Nov 19, 2008 at 3:58 PM, Ian Lea ian@gmail.com wrote:
 
   Are you using one of the search methods that includes sorting?  If
   not, then do.  If you are, then you need to tell us exactly what you
   are doing and exactly what you reckon is going wrong.
  
  
   --
   Ian.
  
  
   On Wed, Nov 19, 2008 at 6:23 PM, Ariel isaacr...@gmail.com wrote:
it is supposed lucene make a lexicocraphic sorting but this is not
   hapening,
Could you tell me what I'm doing wrong ?
I hope you can help me.
Regards
   
On Wed, Nov 19, 2008 at 11:56 AM, Ariel isaacr...@gmail.com wrote:
   
Thanks, that was very helpful, but I have a question when I make the
searches it does not sort the results according to the range, for
   example:
year: [2003 TO 2008] in the first page 2003 documents are showed, in
  the
second 2005 documents, in the third page 2004 documents, I don't see
  any
sort criteria.
How could I fix that problem ???
Greetings
Ariel
   
   
On Wed, Nov 19, 2008 at 11:09 AM, Ian Lea ian@gmail.com
 wrote:
   
Hi - sounds like you need a range query.
   
   
   
  
 
 http://lucene.apache.org/java/2_3_2/queryparsersyntax.html#Range%20Searches
   
   
--
Ian.
   
   
On Wed, Nov 19, 2008 at 4:02 PM, Ariel isaacr...@gmail.com
 wrote:
 Hi everybody:

 I need to make search with lucene 2.3.2, taking in account the
  dates,
 previously when I build the index I create a date field where I
   stored
the
 year in which the document was created, at the search moment I
  would
like to
 retrieve documents that have been created before a Year or after
 a
   Year,
for
 example documents before 2002 year o after 2003 year.
 It is possible to do that with lucene ???
 Regards
 Ariel

   
   
 -
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org
   
   
   
   
  
   -
   To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
   For additional commands, e-mail: java-user-h...@lucene.apache.org
  
  
 



Re: I would want to know more about the lucene implementation in C++

2008-12-08 Thread Ariel
Thank you, very much.

On Thu, Dec 4, 2008 at 11:33 AM, Otis Gospodnetic 
[EMAIL PROTECTED] wrote:

 There is CLucene.  It's not a part of Apache, but lives on SourceForge,
 I think.


 Otis
 --
 Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch



 - Original Message 
  From: Ariel [EMAIL PROTECTED]
  To: lucene user java-user@lucene.apache.org
  Sent: Tuesday, December 2, 2008 2:13:08 PM
  Subject: I would want to know more about the lucene implementation in C++
 
  Hi everybody:
  I have seen the lucene project for C++ has been abandoned, could you tell
 me
  if there is another similar implementation of java lucene in C++ ???


 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]




I would want to know more about the lucene implementation in C++

2008-12-02 Thread Ariel
Hi everybody:
I have seen the lucene project for C++ has been abandoned, could you tell me
if there is another similar implementation of java lucene in C++ ???


How to search documents taking in account the dates ???

2008-11-19 Thread Ariel
Hi everybody:

I need to make search with lucene 2.3.2, taking in account the dates,
previously when I build the index I create a date field where I stored the
year in which the document was created, at the search moment I would like to
retrieve documents that have been created before a Year or after a Year, for
example documents before 2002 year o after 2003 year.
It is possible to do that with lucene ???
Regards
Ariel


Re: How to search documents taking in account the dates ???

2008-11-19 Thread Ariel
Thanks, that was very helpful, but I have a question when I make the
searches it does not sort the results according to the range, for example:
year: [2003 TO 2008] in the first page 2003 documents are showed, in the
second 2005 documents, in the third page 2004 documents, I don't see any
sort criteria.
How could I fix that problem ???
Greetings
Ariel

On Wed, Nov 19, 2008 at 11:09 AM, Ian Lea [EMAIL PROTECTED] wrote:

 Hi - sounds like you need a range query.

 http://lucene.apache.org/java/2_3_2/queryparsersyntax.html#Range%20Searches


 --
 Ian.


 On Wed, Nov 19, 2008 at 4:02 PM, Ariel [EMAIL PROTECTED] wrote:
  Hi everybody:
 
  I need to make search with lucene 2.3.2, taking in account the dates,
  previously when I build the index I create a date field where I stored
 the
  year in which the document was created, at the search moment I would like
 to
  retrieve documents that have been created before a Year or after a Year,
 for
  example documents before 2002 year o after 2003 year.
  It is possible to do that with lucene ???
  Regards
  Ariel
 

 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]




Re: How to search documents taking in account the dates ???

2008-11-19 Thread Ariel
it is supposed lucene make a lexicocraphic sorting but this is not hapening,
Could you tell me what I'm doing wrong ?
I hope you can help me.
Regards

On Wed, Nov 19, 2008 at 11:56 AM, Ariel [EMAIL PROTECTED] wrote:

 Thanks, that was very helpful, but I have a question when I make the
 searches it does not sort the results according to the range, for example:
 year: [2003 TO 2008] in the first page 2003 documents are showed, in the
 second 2005 documents, in the third page 2004 documents, I don't see any
 sort criteria.
 How could I fix that problem ???
 Greetings
 Ariel


 On Wed, Nov 19, 2008 at 11:09 AM, Ian Lea [EMAIL PROTECTED] wrote:

 Hi - sounds like you need a range query.


 http://lucene.apache.org/java/2_3_2/queryparsersyntax.html#Range%20Searches


 --
 Ian.


 On Wed, Nov 19, 2008 at 4:02 PM, Ariel [EMAIL PROTECTED] wrote:
  Hi everybody:
 
  I need to make search with lucene 2.3.2, taking in account the dates,
  previously when I build the index I create a date field where I stored
 the
  year in which the document was created, at the search moment I would
 like to
  retrieve documents that have been created before a Year or after a Year,
 for
  example documents before 2002 year o after 2003 year.
  It is possible to do that with lucene ???
  Regards
  Ariel
 

 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]





Re: How to search documents taking in account the dates ???

2008-11-19 Thread Ariel
Well, this is what I am doing:

queryString=year:[2003 TO 2005]
[CODE]
Query pquery = null;
Hits hits = null;
Analyzer analyzer = null;
analyzer = new SnowballAnalyzer(English);
try {
pquery = MultiFieldQueryParser.parse(new String[] {queryString,
queryString}, new String[] {title, content}, analyzer); //analyzer
} catch (ParseException e1) {
e1.printStackTrace();
}
MultiSearcher searcher = (MultiSearcher) searcherCache.get(name);

try {
hits = searcher.search(pquery);
} catch (IOException e1) {
e1.printStackTrace();
}
[/CODE]

I don't know the methods that include sorting. I have made the sorting by
the score criteria so far, I don-t know how to change it to the year field
criteria.
As you can see, I am using a multisearcher because I have several indexes.

I hope you can help me.
Regards
Thanks in advance
Ariel



On Wed, Nov 19, 2008 at 3:58 PM, Ian Lea [EMAIL PROTECTED] wrote:

 Are you using one of the search methods that includes sorting?  If
 not, then do.  If you are, then you need to tell us exactly what you
 are doing and exactly what you reckon is going wrong.


 --
 Ian.


 On Wed, Nov 19, 2008 at 6:23 PM, Ariel [EMAIL PROTECTED] wrote:
  it is supposed lucene make a lexicocraphic sorting but this is not
 hapening,
  Could you tell me what I'm doing wrong ?
  I hope you can help me.
  Regards
 
  On Wed, Nov 19, 2008 at 11:56 AM, Ariel [EMAIL PROTECTED] wrote:
 
  Thanks, that was very helpful, but I have a question when I make the
  searches it does not sort the results according to the range, for
 example:
  year: [2003 TO 2008] in the first page 2003 documents are showed, in the
  second 2005 documents, in the third page 2004 documents, I don't see any
  sort criteria.
  How could I fix that problem ???
  Greetings
  Ariel
 
 
  On Wed, Nov 19, 2008 at 11:09 AM, Ian Lea [EMAIL PROTECTED] wrote:
 
  Hi - sounds like you need a range query.
 
 
 
 http://lucene.apache.org/java/2_3_2/queryparsersyntax.html#Range%20Searches
 
 
  --
  Ian.
 
 
  On Wed, Nov 19, 2008 at 4:02 PM, Ariel [EMAIL PROTECTED] wrote:
   Hi everybody:
  
   I need to make search with lucene 2.3.2, taking in account the dates,
   previously when I build the index I create a date field where I
 stored
  the
   year in which the document was created, at the search moment I would
  like to
   retrieve documents that have been created before a Year or after a
 Year,
  for
   example documents before 2002 year o after 2003 year.
   It is possible to do that with lucene ???
   Regards
   Ariel
  
 
  -
  To unsubscribe, e-mail: [EMAIL PROTECTED]
  For additional commands, e-mail: [EMAIL PROTECTED]
 
 
 
 

 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]




What is the percent of size of lucene's index ?

2008-07-23 Thread Ariel
I need to know what is the percent of size of lucene's index respect the
information I'm going to index, I have read some articles that say if a I
index 120 Gb of information the index will grow until 40 Gb, that means the
percent is 30 %, Could somebody tell me how can be proved that ?
Is there any official document of apache lucene where says that ?
I hope somebody can help me.
Thanks.
Ariel


How to make documents clustering and topic classification with lucene

2008-07-07 Thread Ariel
Hi everybody:
Do you have Idea how to make how to make documents clustering and topic
classification using lucene ??? Is there anyway to do this.
Please I need help.
Thanks everybody.
Ariel


Re: How to make documents clustering and topic classification with lucene

2008-07-07 Thread Ariel
Hi everybody:
Do you have Idea how to make how to make documents clustering and topic
classification using lucene ??? Is there anyway to do this.
Please I need help.
Thanks everybody.
Ariel


Re: boosting relevance of certain documents

2008-04-25 Thread Jonathan Ariel
Ok. So I'm not an expert of the scoring algorithm, but based on tf*idf you
can tell that the returned document is more relevant because it has more
term frequency.

Using the explain you can see the following:

Doc 1
0.643841 = (MATCH) fieldWeight(searchable:fifa in 0), product of:
  1.0 = tf(termFreq(searchable:fifa)=1)
  1.287682 = idf(docFreq=2)
  0.5 = fieldNorm(field=searchable, doc=0)

Doc2
0.68289655 = (MATCH) fieldWeight(searchable:fifa in 1), product of:
  1.4142135 = tf(termFreq(searchable:fifa)=2)
  1.287682 = idf(docFreq=2)
  0.375 = fieldNorm(field=searchable, doc=1)

On Fri, Apr 25, 2008 at 2:30 PM, Daniel Freudenberger 
[EMAIL PROTECTED] wrote:

 I'm using the StandardAnalyzer - hope this answers your question (I'm
 quite
 new to the lucene thing)

 -Original Message-
 From: Jonathan Ariel [mailto:[EMAIL PROTECTED]
 Sent: Friday, April 25, 2008 6:59 PM
 To: java-user@lucene.apache.org
 Subject: Re: boosting relevance of certain documents

 How are you analyzing the searchable field?

 On Fri, Apr 25, 2008 at 12:49 PM, Daniel Freudenberger 
 [EMAIL PROTECTED] wrote:

  Hello,
 
 
 
  I'm using lucene within a new project and I'm not sure about how to
 solve
  the following problem: My index consists of the two attributes id and
  searchable. id is the id of a product and searchable is a
  combination
  of the product name and its category name.
 
 
 
   example:
 
   id searchable
 
   1 fifa 08 - playstation 3
 
   2 fifa 2003 fifa 03 - playstation 3
 
   3 playstation 60gb hdd - playstation 3
 
   4 playstation i like you - playstation 3
 
 
 
  When searching for fifa, lucene returns the product with id 2 at
 first,
  whereas id 1 (fifa 08) would be the much more relevant result (from
 the
  user side of view). the same problem arises when searching for
  playstation
  - the customer expects products having playstation in their names at
  first, ideally the console itself. in reality however, he gets all
  possible
  products which are in the playstation category as well.
 
 
 
  my idea was to introduce another attribute relevance, which may increase
  the
  relevance of an entry. the actual relevance shouldn't be suppressed
  completely though, but should only be taken into account with products
  that
  are similarly relevant for a specific search term.
 
 
 
  Does anybody have an idea on how to solve this problem?
 
 
 
  Thank you in advance,
 
  Daniel
 
 


 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]




Re: MoreLikeThis over a subset of documents

2008-04-23 Thread Jonathan Ariel
Yes, it will be too much to do in real time, but it is a good idea tough.

I don't know if a vector of term frequencies is stored with the document.
Because I could search on the index to get the subset of documents and then
take the term frequencies from there.
In that case I could change MoreLikeThis to receive a set of term
frequencies, instead of an IndexReader, and use that to do all the process.

Anyone knows if a document contains for his fields the term frequencies?

On Wed, Apr 23, 2008 at 7:46 AM, Karl Wettin [EMAIL PROTECTED] wrote:

 Jonathan Ariel skrev:

  Smart idea, but it won't help me. I have almost 50 categories and
  eventually
  I would like to filter not just on category but maybe also on
  language,
  etc.
  Karl: what do you mean by measure the distance between the term vectors
  and
  cluster them in real time?
 

 I mean exactly what I say, that if your subsets are small enough you could
 evalute the cosine coefficient and group documents accordingly.

 2 million documents is however way to much data to do that in real time.

 I would probably create one index for each filter you want to use.


karl


 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]




MoreLikeThis patch to support boost factor

2008-04-23 Thread Jonathan Ariel
This is a patch I made to be able to boost the terms with a specific factor
beside the relevancy returned by MoreLikeThis. This is helpful when having
more then 1 MoreLikeThis in the query, so words in the field A (i.e. Title)
can be boosted more than words in the field B (i.e. Description).

Any feedback?

Jonathan
Index: 
/home/developer/workspace/lucene/contrib/queries/src/java/org/apache/lucene/search/similar/MoreLikeThis.java
===
--- 
/home/developer/workspace/lucene/contrib/queries/src/java/org/apache/lucene/search/similar/MoreLikeThis.java
(revision 651048)
+++ 
/home/developer/workspace/lucene/contrib/queries/src/java/org/apache/lucene/search/similar/MoreLikeThis.java
(working copy)
@@ -284,6 +284,11 @@
 private final IndexReader ir;
 
 /**
+ * Boost factor to use when boosting the terms
+ */
+private int boostFactor = 1;
+
+/**
  * Constructor requiring an IndexReader.
  */
 public MoreLikeThis(IndexReader ir) {
@@ -574,7 +579,7 @@
 }
 float myScore = ((Float) ar[2]).floatValue();
 
-tq.setBoost(myScore / bestScore);
+tq.setBoost(boostFactor * myScore / bestScore);
 }
 
 try {
@@ -921,6 +926,22 @@
 x = 1;
 }
 }
+
+/**
+ * Returns the boost factor used when boosting terms
+ * @return the boost factor used when boosting terms
+ */
+   public int getBoostFactor() {
+   return boostFactor;
+   }
+
+   /**
+* Sets the boost factor to use when boosting terms
+* @param boostFactor
+*/
+   public void setBoostFactor(int boostFactor) {
+   this.boostFactor = boostFactor;
+   }
 
 
 }
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

MoreLikeThis over a subset of documents

2008-04-22 Thread Jonathan Ariel
Is there any way to execute a MoreLikeThis over a subset of documents? I
need to retrieve a set of interesting keywords from a subset of documents
and not the entire index (imagine that my index has documents categorized as
A, B and C and I just want to work with those categorized as A). Right now
it is using docFreq from the IndexReader. So I looked into the
FilterIndexReader to see if I can override the docFreq behavior, but I'm not
sure if it's possible.

What do you think?

Jonathan


Re: MoreLikeThis over a subset of documents

2008-04-22 Thread Jonathan Ariel
But that doesn't help me with my problem, because the interesting terms are
taken from the entire index and not a subset as I need.
On Tue, Apr 22, 2008 at 6:46 PM, Glen Newton [EMAIL PROTECTED] wrote:

 Instead of this:

 MoreLikeThis mlt = new MoreLikeThis(ir);
 Reader target = ... // orig source of doc you want to find similarities to
 Query query = mlt.like( target);
 Hits hits = is.search(query);

 do this:

 MoreLikeThis mlt = new MoreLikeThis(ir);
 Reader target = ... // orig source of doc you want to find similarities to
 Query moreQuery = mlt.like( target);
 BooleanQuery bq = new BooleanQuery();
 bq.add(moreQuery, BooleanClause.Occur.MUST);
 Query restrictQuery = new TermQuery(new Term(Category, A));
 bq.add(restrictQuery, BooleanClause.Occur.MUST);
 Hits hits = is.search(bq);

 -glen

 2008/4/22 Jonathan Ariel [EMAIL PROTECTED]:
  Is there any way to execute a MoreLikeThis over a subset of documents? I
   need to retrieve a set of interesting keywords from a subset of
 documents
   and not the entire index (imagine that my index has documents
 categorized as
   A, B and C and I just want to work with those categorized as A). Right
 now
   it is using docFreq from the IndexReader. So I looked into the
   FilterIndexReader to see if I can override the docFreq behavior, but
 I'm not
   sure if it's possible.
 
   What do you think?
 
   Jonathan
 



 --

 -

 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]




Re: MoreLikeThis over a subset of documents

2008-04-22 Thread Jonathan Ariel
I could have up to 2 million documents and growing.

On Tue, Apr 22, 2008 at 7:29 PM, Karl Wettin [EMAIL PROTECTED] wrote:

 Jonathan Ariel skrev:

  Is there any way to execute a MoreLikeThis over a subset of documents? I
  need to retrieve a set of interesting keywords from a subset of
  documents
  and not the entire index (imagine that my index has documents
  categorized as
  A, B and C and I just want to work with those categorized as A). Right
  now
  it is using docFreq from the IndexReader. So I looked into the
  FilterIndexReader to see if I can override the docFreq behavior, but I'm
  not
  sure if it's possible.
 
  What do you think?
 

 It might be tricky.

 How many documents do you have in the subset? Could you measure the
 distance between the term vectors and cluster them in real time?


 karl


 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]




Re: MoreLikeThis over a subset of documents

2008-04-22 Thread Jonathan Ariel
Smart idea, but it won't help me. I have almost 50 categories and eventually
I would like to filter not just on category but maybe also on language,
etc.
Karl: what do you mean by measure the distance between the term vectors and
cluster them in real time?

On Tue, Apr 22, 2008 at 7:39 PM, Glen Newton [EMAIL PROTECTED] wrote:

 Sorry, I misunderstood the problem. My mistake.

 While not optimal and rather expensive space-wise, you could have - in
 addition to existing keyword field - a field for each category.  If
 the document being indexed is in category A, only add the text to the
 catA field. Now do MoreLikeThis on catA. This assumes you know the
 categories at index time, of course.
 Redundant but workable.

 -Glen

 2008/4/22 Jonathan Ariel [EMAIL PROTECTED]:
  Is there any way to execute a MoreLikeThis over a subset of documents? I
   need to retrieve a set of interesting keywords from a subset of
 documents
   and not the entire index (imagine that my index has documents
 categorized as
   A, B and C and I just want to work with those categorized as A). Right
 now
   it is using docFreq from the IndexReader. So I looked into the
   FilterIndexReader to see if I can override the docFreq behavior, but
 I'm not
   sure if it's possible.
 
   What do you think?
 
   Jonathan
 



 --

 -

 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]




How to obtain the freq term vector of a field from a remote index ?

2008-02-28 Thread Ariel
Hi folks:

I need to know how to get the frequency term vector of a field from a remote
index in another host.
I know that *IndexSearcher *class has a method named
*getIndexReader().getTermFreqVector(idDoc,
fieldName) *to know the the term frequency vector of certain field* *but I
am using* RemoteSearchable *that is * Searcher, *because my search
functionalities are in an rmi server. I access the remoteSearcheble from a
another host to obtain the hits but so  far I haven't found the way to
obtain the term frequency vector of certain field too.
Do you know if it is possible to do that ??? How can I make it ?
Any help is appreciated .
Greetings
Ariel


MoreLikeThis jar doesn't contain classes

2008-02-22 Thread Jonathan Ariel
Hi,
I've downloaded Lucene 2.3.0 binaries and in the contrib folder I can see
the Similarity package, but inside the Jar there are no classes!
Downloading the sources I ran into the same issue.

Am I doing something wrong? Where should I get the MoreLikeThis classes
from?

Thanks!

Jonathan


MoreLikeThis queries

2008-02-22 Thread Jonathan Ariel
Hi, I'm trying to use MoreLikeThis but I can't find how to make a
MoreLikeThis query that will return related documents given a document and
some conditions, like country field in the related documents should be 1,
etc.

Is there any documentation on how to do this kind of queries?

Thanks,


Jonathan


Re: Why is lucene so slow indexing in nfs file system ?

2008-01-10 Thread Ariel
Thanks all you for yours answers, I going to change a few things in my
application and make tests.
One thing I haven't find another good pdfToText converter like pdfBox Do you
know any other faster ?
Greetings
Thanks for yours answers
Ariel

On Jan 9, 2008 11:08 PM, Otis Gospodnetic [EMAIL PROTECTED]
wrote:

 Ariel,

 I believe PDFBox is not the fastest thing and was built more to handle all
 possible PDFs than for speed (just my impression - Ben, PDFBox's author
 might still be on this list and might comment).  Pulling data from NFS to
 index seems like a bad idea.  I hope at least the indices are local and not
 on a remote NFS...

 We benchmarked local disk vs. NFS vs. a FC SAN (don't recall which one)
 and indexing overNFS was slooow.

 Otis

 --
 Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch

 - Original Message 
 From: Ariel [EMAIL PROTECTED]
 To: java-user@lucene.apache.org
 Sent: Wednesday, January 9, 2008 2:50:41 PM
 Subject: Why is lucene so slow indexing in nfs file system ?

 Hi:
 I have seen the post in
 http://www.mail-archive.com/[EMAIL PROTECTED]/msg12700.html
  and
 I am implementing a similar application in a distributed enviroment, a
 cluster of nodes only 5 nodes. The operating system I use is
  Linux(Centos)
 so I am using nfs file system too to access the home directory where
  the
 documents to be indexed reside and I would like to know how much time
  an
 application spends to index a big amount of documents like 10 Gb ?
 I use lucene version 2.2.0, CPU processor xeon dual 2.4 Ghz 512 Mb in
  every
 nodes, LAN: 1Gbits/s.

 The problem I have is that my application spends a lot of time to index
  all
 the documents, the delay to index 10 gb of pdf documents is about 2
  days (to
 convert pdf to text I am using pdfbox) that is of course a lot of time,
 others applications based in lucene, for instance ibm omnifind only
  takes 5
 hours to index the same amount of pdfs documents. I would like to find
  out
 why my application has this big delay to index, any help is welcome.
 Dou you know others distributed architecture application that uses
  lucene to
 index big amounts of documents ? How long time it takes to index ?
 I hope yo can help me
 Greetings




 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]




Re: Why is lucene so slow indexing in nfs file system ?

2008-01-10 Thread Ariel
In a distributed enviroment the application should make an exhaustive use of
the network and there is not another way to access to the documents in a
remote repository but accessing in nfs file system.
One thing I must clarify: I index the documents in memory, I use
RAMDirectory to do that, then when the RAMDirectory reach the limit(I have
put about 10 Mb) then I serialize to disk(nfs) the index to merge it with
the central index(the central index is in nfs file system), is that correct?
I hope you can help me.
I have take in consideration the suggestions you have make me before, I
going to do some things to test it.
Ariel


On Jan 10, 2008 8:45 AM, Ariel [EMAIL PROTECTED] wrote:

 Thanks all you for yours answers, I going to change a few things in my
 application and make tests.
 One thing I haven't find another good pdfToText converter like pdfBox Do
 you know any other faster ?
 Greetings
 Thanks for yours answers
 Ariel


 On Jan 9, 2008 11:08 PM, Otis Gospodnetic [EMAIL PROTECTED]
 wrote:

  Ariel,
 
  I believe PDFBox is not the fastest thing and was built more to handle
  all possible PDFs than for speed (just my impression - Ben, PDFBox's author
  might still be on this list and might comment).  Pulling data from NFS to
  index seems like a bad idea.  I hope at least the indices are local and not
  on a remote NFS...
 
  We benchmarked local disk vs. NFS vs. a FC SAN (don't recall which one)
  and indexing overNFS was slooow.
 
  Otis
 
  --
  Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
 
  - Original Message 
  From: Ariel [EMAIL PROTECTED]
  To: java-user@lucene.apache.org
  Sent: Wednesday, January 9, 2008 2:50:41 PM
  Subject: Why is lucene so slow indexing in nfs file system ?
 
  Hi:
  I have seen the post in
  http://www.mail-archive.com/[EMAIL PROTECTED]/msg12700.html
   and
  I am implementing a similar application in a distributed enviroment, a
  cluster of nodes only 5 nodes. The operating system I use is
   Linux(Centos)
  so I am using nfs file system too to access the home directory where
   the
  documents to be indexed reside and I would like to know how much time
   an
  application spends to index a big amount of documents like 10 Gb ?
  I use lucene version 2.2.0, CPU processor xeon dual 2.4 Ghz 512 Mb in
   every
  nodes, LAN: 1Gbits/s.
 
  The problem I have is that my application spends a lot of time to index
   all
  the documents, the delay to index 10 gb of pdf documents is about 2
   days (to
  convert pdf to text I am using pdfbox) that is of course a lot of time,
  others applications based in lucene, for instance ibm omnifind only
   takes 5
  hours to index the same amount of pdfs documents. I would like to find
   out
  why my application has this big delay to index, any help is welcome.
  Dou you know others distributed architecture application that uses
   lucene to
  index big amounts of documents ? How long time it takes to index ?
  I hope yo can help me
  Greetings
 
 
 
 
  -
  To unsubscribe, e-mail: [EMAIL PROTECTED]
  For additional commands, e-mail: [EMAIL PROTECTED]
 
 



Re: Why is lucene so slow indexing in nfs file system ?

2008-01-10 Thread Ariel
I am indexing into RAM then merging explicitly because my application demand
it due to I have design it as a distributed enviroment so many threads or
workers are in different machines indexing into RAM serialize to disk an
another thread in another machine access the segment index to merge it with
the principal one, that is faster than if I had just one thread indexing the
documents, doesn' it ?
Yours suggestions are very useful.
I hope you can help me.
Greetings
Ariel

On Jan 10, 2008 10:21 AM, Erick Erickson [EMAIL PROTECTED] wrote:

 This seems really clunky. Especially if your merge step also optimizes.

 There's not much point in indexing into RAM then merging explicitly.
 Just use an FSDirectory rather than a RAMDirectory. There is *already*
 buffering built in to FSDirectory, and your merge factor etc. control
 how much RAM is used before flushing to disk. There's considerable
 discussion of this on the Wiki I believe, but in the mail archive for
 sure.
 And I believe there's a RAM usage based flushing policy somewhere.

 You're adding complexity where it's probably not necessary. Did you
 adopt this scheme because you *thought* it would be faster or because
 you were addressing a *known* problem? Don't *ever* write complex code
 to support a theoretical case unless you have considerable certainty
 that it really is a problem. It would be faster is a weak argument when
 you don't know whether you're talking about saving 1% or 95%. The
 added maintenance is just not worth it.

 There's a famous quote about that from Donald Knuth
 (paraphrasing Hoare) We should forget about small efficiencies,
 say about 97% of the time: premature optimization is the root of
 all evil. It's true.

 So the very *first* measurement I'd take is to get rid of the in-RAM
 stuff and just write the index to local disk. I suspect you'll be *far*
 better off doing this then just copying your index to the nfs mount.

 Best
 Erick

 On Jan 10, 2008 10:05 AM, Ariel [EMAIL PROTECTED] wrote:

  In a distributed enviroment the application should make an exhaustive
 use
  of
  the network and there is not another way to access to the documents in a
  remote repository but accessing in nfs file system.
  One thing I must clarify: I index the documents in memory, I use
  RAMDirectory to do that, then when the RAMDirectory reach the limit(I
 have
  put about 10 Mb) then I serialize to disk(nfs) the index to merge it
 with
  the central index(the central index is in nfs file system), is that
  correct?
  I hope you can help me.
  I have take in consideration the suggestions you have make me before, I
  going to do some things to test it.
  Ariel
 
 
  On Jan 10, 2008 8:45 AM, Ariel [EMAIL PROTECTED] wrote:
 
   Thanks all you for yours answers, I going to change a few things in my
   application and make tests.
   One thing I haven't find another good pdfToText converter like pdfBox
 Do
   you know any other faster ?
   Greetings
   Thanks for yours answers
   Ariel
  
  
   On Jan 9, 2008 11:08 PM, Otis Gospodnetic [EMAIL PROTECTED]
   wrote:
  
Ariel,
   
I believe PDFBox is not the fastest thing and was built more to
 handle
all possible PDFs than for speed (just my impression - Ben, PDFBox's
  author
might still be on this list and might comment).  Pulling data from
 NFS
  to
index seems like a bad idea.  I hope at least the indices are local
  and not
on a remote NFS...
   
We benchmarked local disk vs. NFS vs. a FC SAN (don't recall which
  one)
and indexing overNFS was slooow.
   
Otis
   
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
   
- Original Message 
From: Ariel [EMAIL PROTECTED]
To: java-user@lucene.apache.org
Sent: Wednesday, January 9, 2008 2:50:41 PM
Subject: Why is lucene so slow indexing in nfs file system ?
   
Hi:
I have seen the post in
   
  http://www.mail-archive.com/[EMAIL PROTECTED]/msg12700.html
 and
I am implementing a similar application in a distributed enviroment,
 a
cluster of nodes only 5 nodes. The operating system I use is
 Linux(Centos)
so I am using nfs file system too to access the home directory where
 the
documents to be indexed reside and I would like to know how much
 time
 an
application spends to index a big amount of documents like 10 Gb ?
I use lucene version 2.2.0, CPU processor xeon dual 2.4 Ghz 512 Mb
 in
 every
nodes, LAN: 1Gbits/s.
   
The problem I have is that my application spends a lot of time to
  index
 all
the documents, the delay to index 10 gb of pdf documents is about 2
 days (to
convert pdf to text I am using pdfbox) that is of course a lot of
  time,
others applications based in lucene, for instance ibm omnifind only
 takes 5
hours to index the same amount of pdfs documents. I would like to
 find
 out
why my application has this big delay to index, any help is welcome.
Dou you know

Re: Why is lucene so slow indexing in nfs file system ?

2008-01-10 Thread Ariel
Thanks for yours suggestions.

I'm sorry I didn't know but I would want to know what Do you mean with SAN
and FC?

Another thing, I have visited the lucene home page and there is not released
the 2.3 version, could you tell me where is the download link ?

Thanks in advance.
Ariel

On Jan 10, 2008 2:59 PM, Otis Gospodnetic [EMAIL PROTECTED]
wrote:

 Ariel,

 Comments inline.


 - Original Message 
 From: Ariel [EMAIL PROTECTED]
 To: java-user@lucene.apache.org
 Sent: Thursday, January 10, 2008 10:05:28 AM
 Subject: Re: Why is lucene so slow indexing in nfs file system ?

 In a distributed enviroment the application should make an exhaustive
  use of
 the network and there is not another way to access to the documents in
  a
 remote repository but accessing in nfs file system.

 OG: What about SAN connected over FC for example?

 One thing I must clarify: I index the documents in memory, I use
 RAMDirectory to do that, then when the RAMDirectory reach the limit(I
  have
 put about 10 Mb) then I serialize to disk(nfs) the index to merge it
  with
 the central index(the central index is in nfs file system), is that
  correct?

 OG: Nah, don't bother with RAMDirectory, just use FSDirectory and it will
 do in-memory thing for you.  Make good use of your RAM and use 2.3 which
 gives you more control over RAM use during indexing.  Parallelizing indexing
 over multiple machines and merging at the end is faster, so that's a good
 approach.  Also, if your boxes have multiple CPUs write your code so that it
 has multiple worker threads that do indexing and feed docs to
 IndexWriter.addDocument(Document) to keep the CPUs fully utilized.

 OG: Oh, something faster than PDFBox?  There is (can't remember the name
 now... itextstream or something like that?), though it may not be free like
 PDFBox.

 Otis
 --
 Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch


 On Jan 10, 2008 8:45 AM, Ariel [EMAIL PROTECTED] wrote:

  Thanks all you for yours answers, I going to change a few things in
  my
  application and make tests.
  One thing I haven't find another good pdfToText converter like pdfBox
  Do
  you know any other faster ?
  Greetings
  Thanks for yours answers
  Ariel
 
 
  On Jan 9, 2008 11:08 PM, Otis Gospodnetic
  [EMAIL PROTECTED]
  wrote:
 
   Ariel,
  
   I believe PDFBox is not the fastest thing and was built more to
  handle
   all possible PDFs than for speed (just my impression - Ben,
  PDFBox's author
   might still be on this list and might comment).  Pulling data from
  NFS to
   index seems like a bad idea.  I hope at least the indices are local
  and not
   on a remote NFS...
  
   We benchmarked local disk vs. NFS vs. a FC SAN (don't recall which
  one)
   and indexing overNFS was slooow.
  
   Otis
  
   --
   Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
  
   - Original Message 
   From: Ariel [EMAIL PROTECTED]
   To: java-user@lucene.apache.org
   Sent: Wednesday, January 9, 2008 2:50:41 PM
   Subject: Why is lucene so slow indexing in nfs file system ?
  
   Hi:
   I have seen the post in
  
  http://www.mail-archive.com/[EMAIL PROTECTED]/msg12700.html
and
   I am implementing a similar application in a distributed
  enviroment, a
   cluster of nodes only 5 nodes. The operating system I use is
Linux(Centos)
   so I am using nfs file system too to access the home directory
  where
the
   documents to be indexed reside and I would like to know how much
  time
an
   application spends to index a big amount of documents like 10 Gb ?
   I use lucene version 2.2.0, CPU processor xeon dual 2.4 Ghz 512 Mb
  in
every
   nodes, LAN: 1Gbits/s.
  
   The problem I have is that my application spends a lot of time to
  index
all
   the documents, the delay to index 10 gb of pdf documents is about 2
days (to
   convert pdf to text I am using pdfbox) that is of course a lot of
  time,
   others applications based in lucene, for instance ibm omnifind only
takes 5
   hours to index the same amount of pdfs documents. I would like to
  find
out
   why my application has this big delay to index, any help is
  welcome.
   Dou you know others distributed architecture application that uses
lucene to
   index big amounts of documents ? How long time it takes to index ?
   I hope yo can help me
   Greetings
  
  
  
  
  
  -
   To unsubscribe, e-mail: [EMAIL PROTECTED]
   For additional commands, e-mail: [EMAIL PROTECTED]
  
  
 




 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]




Why is lucene so slow indexing in nfs file system ?

2008-01-09 Thread Ariel
Hi:
I have seen the post in
http://www.mail-archive.com/[EMAIL PROTECTED]/msg12700.html and
I am implementing a similar application in a distributed enviroment, a
cluster of nodes only 5 nodes. The operating system I use is Linux(Centos)
so I am using nfs file system too to access the home directory where the
documents to be indexed reside and I would like to know how much time an
application spends to index a big amount of documents like 10 Gb ?
I use lucene version 2.2.0, CPU processor xeon dual 2.4 Ghz 512 Mb in every
nodes, LAN: 1Gbits/s.

The problem I have is that my application spends a lot of time to index all
the documents, the delay to index 10 gb of pdf documents is about 2 days (to
convert pdf to text I am using pdfbox) that is of course a lot of time,
others applications based in lucene, for instance ibm omnifind only takes 5
hours to index the same amount of pdfs documents. I would like to find out
why my application has this big delay to index, any help is welcome.
Dou you know others distributed architecture application that uses lucene to
index big amounts of documents ? How long time it takes to index ?
I hope yo can help me
Greetings


Re: How to build your custom termfreq vector an add it to the field ?

2007-11-08 Thread Ariel
Very interesting the link you suggest me Mr Grant Ingersoll.
Let see if I understand how the ranking issue in lucene could be implemented:
1.  First I must create my own query class extending the abstract Query
class. The only method I must implement from this class is toString.
Is right this ???
2.  I must implement inside my own query class the Weight interface
But I really don't understand how this is going to let me change my
ranking scoring.
3 I must implement my custom Scorer ???
I don't know how integrate this. There is a lot of little pieces of
information but not concrete.
Greetings



On Nov 7, 2007 1:48 PM, Grant Ingersoll [EMAIL PROTECTED] wrote:
 Term Vectors (specifically TermFreqVector) in Lucene are a storage
 mechanism for convenience and applications to use.  They are not an
 integral part of the scoring in the way you may be thinking of them in
 terms of the traditional Vector Space Model, thus there may be some
 confusion from the different usages of that terminology.  If you want
 to see examples of how to implement scorers have a look at classes
 like TermScorer, BoostingTermQuery, and any of the other classes that
 extend Scorer.  You might also find the file formats page (off of the
 Lucene Java website under Documentation) helpful for understanding
 what Lucene stores so that it can do scoring.

 There really isn't any tutorial on scoring, as it is not something
 that many people have expressed an interest in or no one has made it a
 high enough priority to write one.  Having written a Scorer (or maybe
 two, I forget) I can give advice on specific things, but I am not sure
 I could write a tutorial that is general enough to be useful at this
 point.

 One thought for associating a weight to a given term based on its
 cooccurring terms is to use the new Payload mechanism whereby you can
 store a byte array at each term which can then be used in scoring via
 things like the BoostingTermQuery (or your own implementation.)  If
 that is of interest, you can search the archives for payloads (I also
 think Michael Busch is presenting on Payloads, amongst other things,
 at ApacheCon in Atlanta) and have a look at the BoostingTermQuery.
 There certainly are other PayloadQueries that need to be implemented.
 See the Lucene wiki for some background and details on Payloads as well.

 I don't know that it is a big mistake to try this in Lucene.  The
 community hasn't put a huge priority on making altering the innards of
 scoring easier to deal with (if possible), but that doesn't mean we
 are not open to suggestions and patches.You may find 
 https://issues.apache.org/jira/browse/LUCENE-965
   to be informative for both the implementation and the discussion of
 things that need to happen to be accepted into Lucene.  This JIRA
 issue specifically attempts to provide Lucene with a new scoring
 mechanism.

 You might also have a look at Lemur (http://www.lemurproject.org/)
 which is much more academically focused.

 Cheers,
 Grant


 On Nov 7, 2007, at 12:49 PM, Ariel wrote:

  Then if I want to use another scoring formula I must to implement my
  own Query/Weigh/Scorer  ? For example instead of cousine distance
  leiderbage distance or .. another. I'm studying Query/Weigh/Scorer
  classes to find out how to do that but there is not much documentation
  about that.
 
  I have seen I could change similarity factors extending the simlarity
  class, but I have not seen any example about changing scoring formula
  and changing the weight by term in the term vector. Do you know any
  tutorial about this ?
 
  What I want to do changing frecuency in the terms vector is this: for
  example instead of take the tf term frecuency of the term and stored
  in the vector I want to consider the correlation of the term with the
  other terms of the documents and store that measure by term in the
  vector so later with my custom similarity formula calculate the
  ranking of a document against a query considering the correlation
  between terms.
  Dou you think is a big mistake try to do this with lucene ??? Is
  there any way ?
 
  -
  To unsubscribe, e-mail: [EMAIL PROTECTED]
  For additional commands, e-mail: [EMAIL PROTECTED]
 

 --
 Grant Ingersoll
 http://lucene.grantingersoll.com

 Lucene Boot Camp Training:
 ApacheCon Atlanta, Nov. 12, 2007.  Sign up now!  http://www.apachecon.com

 Lucene Helpful Hints:
 http://wiki.apache.org/lucene-java/BasicsOfPerformance
 http://wiki.apache.org/lucene-java/LuceneFAQ



 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: How to build your custom termfreq vector an add it to the field ?

2007-11-07 Thread Ariel
Then if I want to use another scoring formula I must to implement my
own Query/Weigh/Scorer  ? For example instead of cousine distance
leiderbage distance or .. another. I'm studying Query/Weigh/Scorer
classes to find out how to do that but there is not much documentation
about that.

I have seen I could change similarity factors extending the simlarity
class, but I have not seen any example about changing scoring formula
and changing the weight by term in the term vector. Do you know any
tutorial about this ?

What I want to do changing frecuency in the terms vector is this: for
example instead of take the tf term frecuency of the term and stored
in the vector I want to consider the correlation of the term with the
other terms of the documents and store that measure by term in the
vector so later with my custom similarity formula calculate the
ranking of a document against a query considering the correlation
between terms.
Dou you think is a big mistake try to do this with lucene ??? Is there any way ?

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: How to change the similarity function of lucene

2007-09-28 Thread Ariel
Sorry the delay.
But, what I want to do is to change the terms weigth, I don´t want that
terms weight be the frecuency the term appear in the document intead of that
I want it to be another special measure and with that change the similarity
function.
I don´t know how to change the terms weight in the term vector in a
document, How can I do it ?

Greetings
Ariel

On 9/24/07, Grant Ingersoll [EMAIL PROTECTED] wrote:

 Perhaps you can explain in what way you want to make it more
 powerful?  There are possibilities to do:
 1. Change the Similarity class (a call back mechanism)
 2. Implement or extend Queries, Scorers, etc.
 3. Others???

 See http://lucene.apache.org/java/docs/scoring.html for some insights.

 In other words, it can be as complex as you want it to be...

 -Grant

 On Sep 24, 2007, at 5:24 PM, Ariel wrote:

  Hi every body:
 
 I would like to know how to change the similarity function of
  lucene to
  extends the posibilities of searching and make it more powefull. Have
  somebody made this before ?
  Could you help me please ? I don't know how complex might be this.
 
  I hope you can help me.
  Greetings
  Ariel

 --
 Grant Ingersoll
 http://lucene.grantingersoll.com

 Lucene Helpful Hints:
 http://wiki.apache.org/lucene-java/BasicsOfPerformance
 http://wiki.apache.org/lucene-java/LuceneFAQ



 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]




How to change the similarity function of lucene

2007-09-24 Thread Ariel
Hi every body:

   I would like to know how to change the similarity function of lucene to
extends the posibilities of searching and make it more powefull. Have
somebody made this before ?
Could you help me please ? I don't know how complex might be this.

I hope you can help me.
Greetings
Ariel


How to get documents similar to other document ?

2007-09-11 Thread Ariel
Hi every body:

My question is if there is an api function of lucene to obtain similar
documents to other document comparing the term frequence vector of a field
???
I supposed a lot of people have asked this before but I haven't found the
answer neither with google nor api lucene.
This could be a very useful functionality of the lucene api. I am using
lucene version 1.9
I hope you can help me.
Greetings.
Ariel


Re: How to get documents similar to other document ?

2007-09-11 Thread Ariel
Excuse me, Could you give more details ?
Are you telling me that functionality exists ?
Which class should I use for this ?
I hope not being bothering you.
Greetings

On 9/11/07, Grant Ingersoll [EMAIL PROTECTED] wrote:

 See the MoreLikeThis functionality in the contrib package, also
 search this archive for MoreLikeThis.


 On Sep 11, 2007, at 11:50 AM, Ariel wrote:

  Hi every body:
 
  My question is if there is an api function of lucene to obtain similar
  documents to other document comparing the term frequence vector of
  a field
  ???
  I supposed a lot of people have asked this before but I haven't
  found the
  answer neither with google nor api lucene.
  This could be a very useful functionality of the lucene api. I am
  using
  lucene version 1.9
  I hope you can help me.
  Greetings.
  Ariel

 --
 Grant Ingersoll
 http://lucene.grantingersoll.com

 Lucene Helpful Hints:
 http://wiki.apache.org/lucene-java/BasicsOfPerformance
 http://wiki.apache.org/lucene-java/LuceneFAQ



 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]




Indexing

2007-08-22 Thread Jonathan Ariel
Hi,
I'm new to this list. So first of all Hello to everyone!

So right now I have a little issue I would like to discuss with you.
Suppose that your are in a really big application where the data in your
database is updated really fast. I reindex lucene every 5 min but since my
application lists everything from lucene there are like 5 minutes (in the
worse case) where I don't see new staff.
What do you think would be the best aproach to this problem?

Thanks!

Jonathan


Re: Indexing

2007-08-22 Thread Jonathan Ariel
I'm not reindexing the entire index. I'm just commiting the updated. But I'm
not sure how it would affect performance to commit in real time. I think
right now I have like 10 updated per minute.

On 8/22/07, Erick Erickson [EMAIL PROTECTED] wrote:

 There are several approaches. First, is your index small
 enough to fit in RAM? You might consider just putting it all in
 RAM and searching that.

 A more complex solution would be to keep the increments
 in a separate RAMDir AND your FSDir, search both and
 keep things coordinated. Something like

 open FSDIr
 create RAMDir
 while (whatever) {
get request
if (modification) {
write to FSDir and RAMDir
   }
if (search) {
  search FSDir
  open RAMDir reader
  search RAMDir
  close RAMDir reader (but not writer!)
   }
 }

 close FSDIr
 close RAMDir
 start again from the top.



 Warning: I haven't done this, but it *should* work. The sticky
 part seems to me to be coordinating deletes since the
 open FSDir may contain documents also in the RAMDir,
 but that's an exercise for the readerG,

 You could also define the problem away and just live
 with a 5 minute latency.

 Best
 Erick

 On 8/22/07, Jonathan Ariel [EMAIL PROTECTED] wrote:
 
  Hi,
  I'm new to this list. So first of all Hello to everyone!
 
  So right now I have a little issue I would like to discuss with you.
  Suppose that your are in a really big application where the data in your
  database is updated really fast. I reindex lucene every 5 min but since
 my
  application lists everything from lucene there are like 5 minutes (in
 the
  worse case) where I don't see new staff.
  What do you think would be the best aproach to this problem?
 
  Thanks!
 
  Jonathan
 



One index per user or one index per day?

2007-02-26 Thread ariel goldberg

Greetings,



 



I'm creating an application that 
requires the indexing of millions of documents on behalf of a large group of 
users, and was hoping to get an opinion on whether I should use one index per 
user or one index per day.



 



My application will have to handle 
the following:



 



- the indexing of about 1 million 5K 
documents per day, with each document containing about 5 
fields



- expiration of documents, since 
after a while, my hard drive would run out of 
room



- queries that consist of boolean 
expressions (e.g., the body field contains a AND b, and the title field 
contains c), as well as ranges (e.g., the document needs to have been indexed 
between 2/25/07 10:00 am and 2/28/07 9:00 pm)



- permissions; in other words, user 
A might be able to search on documents X and Y, but user B might be able to 
search on documents Y and Z.



- up to 1,000 
users



 



So, I was considering the 
following:



 



1) Using one index per 
user



 



This would entail creating and using 
up to 1,000 indices.  Document Y in the example above would have to be 
duplicated.  Expiration is performed via IndexWriter.deleteDocuments.  The 
advantage here is that querying should be reasonably quick, because each index 
would only contain tens of thousands of documents, instead of millions.  The 
disadvantages: I'm concerned about the too many open files error, and I'm 
also 
concerned about the performance of 
deleteDocuments.



 



2) Using one index per 
day



 



Each day, I create a new index.  
Again, document Y in the example above would have to be duplicated (is there 
any 
way around this?)  The advantage here is that expiring documents means simply 
deleting the index corresponding to a particular day.  The disadvantage is the 
query performance, since the queries, which are already very complex, would 
have 
to be performed using MultiSearcher (if expiration is after 10 days, that's 10 
indices to search across).



 



Tough to know for sure which option 
is better without testing, but does anyone have a gut reaction?  Any advice 
would be greatly appreciated!



 



Thanks,



Ariel






 

Need Mail bonding?
Go to the Yahoo! Mail QA for great tips from Yahoo! Answers users.
http://answers.yahoo.com/dir/?link=listsid=396546091

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Full disk space during indexing process with 120 gb of free disk space

2006-12-05 Thread Ariel Isaac Romero Cartaya

 Here is my source code where I convert pdf files to text for indexing, I
got this source code from lucene in action examples and adapted it for my
convenience, I hop you could help me to fix this problem, anyway if you know
another more efficient way to do it please tell me how to:

import java.io.File;
import java.io.FileInputStream;
import java.io.IOException;
import java.io.InputStream;
import java.util.Iterator;
import java.util.List;

import org.apache.lucene.document.Document;
import org.apache.lucene.document.Field;
import org.pdfbox.cos.COSDocument;
import org.pdfbox.encryption.DecryptDocument;
import org.pdfbox.exceptions.CryptographyException;
import org.pdfbox.exceptions.InvalidPasswordException;
import org.pdfbox.pdfparser.PDFParser;
import org.pdfbox.pdmodel.PDDocument;
import org.pdfbox.pdmodel.PDDocumentInformation;
import org.pdfbox.util.PDFTextStripper;

import cu.co.cenatav.kernel.parser.DocumentHandler;
import cu.co.cenatav.kernel.parser.DocumentHandlerException;
import cu.co.cenatav.kernel.parser.schema.SchemaExtractor;

public class PDFBoxPDFHandler implements DocumentHandler {

 public static String password = -password;

 public Document getDocument(InputStream is)
   throws DocumentHandlerException {

   COSDocument cosDoc = null;
   try {
 cosDoc = parseDocument(is);
   }
   catch (IOException e) {
 closeCOSDocument(cosDoc);
 throw new DocumentHandlerException(
   Cannot parse PDF document, e);
   }

   // decrypt the PDF document, if it is encrypted
   try {
 if (cosDoc.isEncrypted()) {
   DecryptDocument decryptor = new DecryptDocument(cosDoc);
   decryptor.decryptDocument(password);
 }
   }
   catch (CryptographyException e) {
 closeCOSDocument(cosDoc);
 throw new DocumentHandlerException(
   Cannot decrypt PDF document, e);
   }
   catch (InvalidPasswordException e) {
 closeCOSDocument(cosDoc);
 throw new DocumentHandlerException(
   Cannot decrypt PDF document, e);
   }
   catch (IOException e) {
 closeCOSDocument(cosDoc);
 throw new DocumentHandlerException(
   Cannot decrypt PDF document, e);
   }

   // extract PDF document's textual content
   String bodyText = null;
   try {
 PDFTextStripper stripper = new PDFTextStripper();
 bodyText = stripper.getText(new PDDocument(cosDoc));
   }
   catch (IOException e) {
 closeCOSDocument(cosDoc);
 throw new DocumentHandlerException(
   Cannot parse PDF document, e);
//   String errS = e.toString();
//   if (errS.toLowerCase().indexOf(font) != -1) {
//   }
   }

   Document doc = new Document();
   if (bodyText != null) {

   PDDocument pdDoc = null;
   PDDocumentInformation docInfo = null;

   try {
 pdDoc = new PDDocument(cosDoc);
 docInfo = pdDoc.getDocumentInformation();
   }
   catch (Exception e) {
 closeCOSDocument(cosDoc);
 closePDDocument(pdDoc);
 System.err.println(Cannot extraxt metadata from PDF:  +
e.getMessage());
   }

 SchemaExtractor schemaExtractor = new SchemaExtractor(bodyText);

 String author = null;
 if (docInfo != null)
 author   =  docInfo.getAuthor();

 if (author == null || author.equals()){

 //TODO Hacer el componente schemaExtractor

 List Authors = schemaExtractor.getAuthor();

 Iterator I = Authors.iterator();

 while (I.hasNext()){
  String Author = (String)I.next();
  doc.add(new Field(author, Author, Field.Store.YES ,
Field.Index.TOKENIZED, Field.TermVector.YES));
 }
 }else{
 doc.add(new Field(author, author, Field.Store.YES ,
Field.Index.TOKENIZED, Field.TermVector.YES));
 }
 String title = null;
 if (docInfo != null)
title = docInfo.getTitle();

 if (title == null || title.equals()){
 title = schemaExtractor.getTitle();
 }

 String keywords = null;

 if (docInfo != null)
keywords = docInfo.getKeywords();
 if (keywords == null)
 keywords = ;

 String summary = null;

 if (docInfo != null)
 summary  = docInfo.getProducer() +   +
docInfo.getCreator() +   +  docInfo.getSubject();

 if (summary == null || summary.equals()){
 summary = schemaExtractor.getAbstract();
 }

 String content = schemaExtractor.getContent();

 Field fieldTitle = new Field(title, title, Field.Store.YES ,
Field.Index.TOKENIZED,Field.TermVector.YES);
 //fieldTitle.setBoost(new Float(1.5));
 doc.add(fieldTitle);

 Field fieldSumary = new Field(sumary, summary, Field.Store.YES ,
Field.Index.TOKENIZED,Field.TermVector.YES);
//fieldSumary.setBoost(new Float(1.3));
 doc.add(fieldSumary);


 doc.add(new Field(content, content, Field.Store.YES ,

Full disk space during indexing process with 120 gb of free disk space

2006-12-04 Thread Ariel Isaac Romero Cartaya

Hi every body:

  I am getting a problem during the indexing process, I am indexing big
amounts of texts most of them in pdf format I am using pdf box 0.6 version.
The space in hard disk before that the indexing process begin is around 120
Gb but incredibly even when my lucene index doesn't have yet 300 mb my hard
disk has not already free space, more incredible is that when I turn off the
process of indexing then the free disk space arise rapidly to 120 Gb. How
could happen this if I doesn't copy the documents to the disk ??? , I have a
linux machine for the indexing process, I have been thinking that could be
the temporaly files of something , may be pdf box ???
Could you help me please ???
Greetings


Re: Big problem with big indexes

2006-10-17 Thread Ariel Isaac Romero Cartaya

Here are pieces of my source code:

First of all, I search in all the indexes given a query String with a
parallel searcher. As you can see I make a multi field query. Then you can
see the index format I use, I store in the index all the fields. My index is
optimized.

 public  Hits search(String query) throws IOException  {

   AnalyzerHandler analizer = new AnalyzerHandler();
   Query pquery = null;

   try {
   pquery = MultiFieldQueryParser.parse(query, new String[]
{title, sumary, filename, content, author}, analizer.getAnalyzer
());
   } catch (ParseException e1) {
   e1.printStackTrace();
   }

   Searchable[] searchables = new Searchable[IndexCount];

   for (int i = 0; i  IndexCount; i++) {
   searchables[i] = new IndexSearcher(RAMIndexsManager.getInstance
().getDirectoryAt(i));
   }

   Searcher parallelSearcher = new ParallelMultiSearcher(searchables);

   return parallelSearcher.search(pquery);


 }

Then in another method I obtain the fragment where the term occur, As you
can see I use an EnglisAnalyzer that filter stopwords, stemming, synonims
detection ... :

   public Vector getResults(Hits h, String string) throws IOException{

   Vector ResultItems = new Vector();
   int cantHits = h.length();
   if (cantHits!=0){

   QueryParser qparser = new QueryParser(content, new
AnalyzerHandler().getAnalyzer());
   Query query1 = null;
   try {
   query1 = qparser.parse(string);
   } catch (ParseException e1) {
   e1.printStackTrace();
   }

   QueryScorer scorer = new QueryScorer(query1);

   Highlighter highlighter = new Highlighter(scorer);

   Fragmenter fragmenter = new SimpleFragmenter(150);

   highlighter.setTextFragmenter(fragmenter);


   for (int i = 0; i  cantHits; i++) {

   org.apache.lucene.document.Document doc = h.doc(i);

   String filename = doc.get(filename);

   filename = filename.substring(filename.indexOf(/) + 1);

   String filepath  = doc.get(filepath);

   Integer id = new Integer(h.id(i));

   String score = (h.score(i))+ ;

   int fileSize = Integer.parseInt(doc.get(filesize));

   String title = doc.get(title);
   String summary = doc.get(sumary);

   //fragment
   String body = h.doc(i).get(content);

   TokenStream stream =  new
EnglishAnalyzer().tokenStream(content,new StringReader(body));

   String[] fragment = highlighter.getBestFragments(stream,
body, 4);
   //fragment



   if (fragment.length == 0)  {
   fragment = new String[1];
   fragment[0] = ;
   }



   StringBuilder buffer = new StringBuilder();

   for (int I = 0; I  fragment.length; I++){
   buffer.append(validateCad(fragment[I]) + ...\n);
   }

   String stringFragment = buffer.toString();

   ResultItem result = new ResultItem();
   result.setFilename(fileName);
   result.setFilepath(filePath);
   result.setFilesize(filesize);
   result.setScore(Double.parseDouble(score));
   result.setFragment(fragment);
   result.setId(new Integer(id));
   result.setSummary(summary);
   result.setTitle(title);
   ResultItems.add(result);


   }
   }



   return ResultItems;
   }


So these are the principals methods that make search. Could you tell me if I
do something wrong or inefficient ?
As you can see I make a parallel search, I have a dual xeon machine with two
CPU hyperthreading 2,4 Ghz 512 RAM but when I make the parallel searcher I
can see in my command prompt on Linux that the 3 og my 4 cpu are always idle
while only one is working, why occur that if the parallel searcher must
saturate all the CPU of work.

I hope you can help me.


Big problem with big indexes

2006-10-11 Thread Ariel Isaac Romero Cartaya

Hi everybody:

I have a big problem making prallel searches in big indexes.
I have indexed with lucene over 60 000 articles, I have distributed the
indexes in 10 computers nodes so each index not exceed the 60 MB of size. I
makes parallel searches in those indexes but I get the search results after
40 MINUTES !!! Then I put the indexes in memory to do the parallel searches
But still I get the search results after 3 minutes !!! that`s to mucho time
waiting !!!
 How Can I reduce the time of search ???
 Could you help me please ???
 I need help !

Greetings


RE: graphically representing an index

2006-09-01 Thread SOMMERIA KLEIN Ariel Ext VIACCESS-BU_DRM
Hi Andzej,
Thanks for the tip, it does what I want. You are right, though, it's of limited 
use for helping the user access data. But I'm sure it will come in handy for my 
own analysis.
Best,
Ariel

-Message d'origine-
De : Andrzej Bialecki [mailto:[EMAIL PROTECTED] 
Envoyé : jeudi 31 août 2006 15:49
À : java-user@lucene.apache.org
Objet : Re: graphically representing an index

SOMMERIA KLEIN Ariel Ext VIACCESS-BU_DRM wrote:
 Hi all,
 I'm a newbie with Lucene and I'm looking to implement the following:
 I want to index posts from a forum, and, rather than proposing a search
 on the contents, graphically represent the contents of the index. More
 precisely, I would like to have a list of the most popular words, with a
 number next to each indicating how often they occur. 
 The icing on the cake would be to be able to click on such a word and
 get a subset of the posts including that word. 
 Can Lucene be used for this? Has anyone already implemented it? Any
 links?
 I've dug around a bit without any success, but my apologies if this has
 already been dealt with

   

See http://www.getopt.org/luke for an example of such functionality. 
However, I must disappoint you - the most frequent words in a corpus are 
quite probably also most useless words. For English these are: the, a, 
to, for, by, in, can, I, ...
 So, you will need to eliminate them from the top of the list to get any 
useful results.

-- 
Best regards,
Andrzej Bialecki 
 ___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



-

Privileged/Confidential information may be contained in this e-mail 
and attachments. This e-mail, including attachments, constitutes non-public 
information intended to be conveyed only to the designated recipient(s). If you 
are not an intended recipient, please delete this e-mail, including 
attachments, and notify us immediately. The unauthorized use, dissemination, 
distribution or reproduction of this e-mail, including attachments, is 
prohibited and may be unlawful. In general, the content of this e-mail and 
attachments does not constitute any form of commitment by VIACCESS SA.

-


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



graphically representing an index

2006-08-31 Thread SOMMERIA KLEIN Ariel Ext VIACCESS-BU_DRM
Hi all,
I'm a newbie with Lucene and I'm looking to implement the following:
I want to index posts from a forum, and, rather than proposing a search
on the contents, graphically represent the contents of the index. More
precisely, I would like to have a list of the most popular words, with a
number next to each indicating how often they occur. 
The icing on the cake would be to be able to click on such a word and
get a subset of the posts including that word. 
Can Lucene be used for this? Has anyone already implemented it? Any
links?
I've dug around a bit without any success, but my apologies if this has
already been dealt with

-

Privileged/Confidential information may be contained in this e-mail 
and attachments. This e-mail, including attachments, constitutes non-public 
information intended to be conveyed only to the designated recipient(s). If you 
are not an intended recipient, please delete this e-mail, including 
attachments, and notify us immediately. The unauthorized use, dissemination, 
distribution or reproduction of this e-mail, including attachments, is 
prohibited and may be unlawful. In general, the content of this e-mail and 
attachments does not constitute any form of commitment by VIACCESS SA.

-


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



How to merge lucene indexes ???

2006-05-15 Thread Ariel Isaac Romero

 Hi every body:

  I need to know how to merge an index into another.

  I have a master index whose another indexes are added to it from others
nodes . I want to merge indexes from the

others nodes to master index, I made this method:

 public void merge(String MasterIndexDir, String IndexToMerge) {

   FSDirectory fsDir;
   try {
   fsDir = FSDirectory.getDirectory(IndexDir, false);
   IndexReader indexToMerge = IndexReader.open(IndexToMerge);
   AnalyzerHandler analyzer = new AnalyzerHandler();
   IndexWriter fsWriter = new IndexWriter(fsDir,
analyzer.getAnalyzer(), false);

   fsWriter.addIndexes(new IndexReader[] {indexToMerge});
   fsWriter.close();

   } catch (IOException e) {
   System.err.println(e.getMessage());
   e.printStackTrace();
   }


   }

But with this method I get the following exception:

Lock obtain timed out: [EMAIL PROTECTED]:\DOCUME~1\a\LOCALS~1\Temp\lucene-
f9488d465badf2bf80c713184c580f65-write.lock
java.io.IOException: Lock obtain timed out: [EMAIL PROTECTED]
:\DOCUME~1\aromero\LOCALS~1\Temp\lucene-
f9488d465badf2bf80c713184c580f65-write.lock
   at org.apache.lucene.store.Lock.obtain(Lock.java:58)
   at org.apache.lucene.index.IndexWriter.init(IndexWriter.java:223)
   at org.apache.lucene.index.IndexWriter.init(IndexWriter.java:213)
   at cu.co.cenatav.kernel.indexing.MergeIndexes.merge(MergeIndexes.java
:18)
   at cu.co.cenatav.kernel.indexing.MergeIndexes.main(MergeIndexes.java:36)

Could you help me ???
I don't know why this is happening ???

Sorry for my english.


Re: How to merge lucene indexes ???

2006-05-15 Thread Ariel Isaac Romero

I think that do not solve my problem, because the line who's throwing the
exception is this :

IndexWriter fsWriter = new IndexWriter(fsDir, analyzer.getAnalyzer(),
false);

Besides if I create a new master index each time I'm going to merge them
I'd lose others indexes I have merged into master index before, that's why I
can't put the boolean parameter true.
I really need help, please.
I'm open to any suggestion.


On 5/15/06, Daniel Naber [EMAIL PROTECTED] wrote:


On Montag 15 Mai 2006 19:51, Ariel Isaac Romero wrote:

 IndexReader indexToMerge =
 IndexReader.open(IndexToMerge); AnalyzerHandler analyzer = new
 AnalyzerHandler(); IndexWriter fsWriter = new IndexWriter(fsDir,
 analyzer.getAnalyzer(), false);

Don't open a reader, supply an array of Directories instead and use an
IndexWriter that creates a new index (true as last parameter).

Regards
Daniel

--
http://www.danielnaber.de

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]