Hi,
I've been trying to adjust the weightings for my searches (thanks
Chris for his replies on that thread), and have been using
ConstantScoreQuery to even out scores from portions in my query that I
want to match but not to contribute to the ranking of that result.
I convert a
can't access the file:
Forbidden
Remote Host: [62.172.205.164]
You do not have permission to access
http://cdoronc.20m.com/tmp/indexingThreads.zip
Data files must be stored on the same site they are linked from.
Thank you for using 20m.com
--
View this message in context:
Yes, this can be easily done using TokenStream class and hence getting
the the BestTokens.
But ofcourse you have to have this content in the index.
DONE
Ramesh Reddy
On Wed, 2006-07-12 at 12:43 +0100, Mike Streeton wrote:
The simplest solution is always the best - when storing the page,
Are you using the StandardAnalyzer at the time of Indexing?
which one do u use at the time of Querying?
Ramesh Reddy
On Mon, 2006-07-10 at 18:37 -0700, Chris Hostetter wrote:
: I'm storing a field in an index with that option
: (Field.Index.UN_TOKENZIED).
the key to understanding your
I've done a bit of testing with accented characters (Croatian, to be
specific) and can't really explain what I see when I explore the index
with luke.
I've used accented characters in directory names, file names and file contents.
Now, in the list of terms (in Top ranking terms, Overview tab) I
On Wed, 2006-05-24 at 13:11 +0530, Vikas Khengare wrote:
So when I type “L” it will give me search options names which will
start from “L”. Then when I will type “Lu” then it should give me
options for names which are starting from “Lu”. so on ……
Vikas,
the Jira now contains code that does
Another option is to use Sun's free and soon to be open source Java Studio
Creator2. It's a great way to do JSF and provides an AJAX google suggest
type component. You can hook this component up to a lucene search and
*BOOM*...google suggest.
Here is a link to a did you mean tutorial as well (it
If you are using
http://www.pdfbox.org/javadoc/org/pdfbox/util/PDFTextStripper.html#getText(o
rg.pdfbox.pdmodel.PDDocument), you are going to get a large String and may
need a 1G heap.
If, however, you are using
http://www.pdfbox.org/javadoc/org/pdfbox/util/PDFTextStripper.html#writeText
Thanks.
I am using the getText(PDDocument) method of the PDFTextStripper. I will
try the other suggestion.
suba suresh.
Rob Staveley (Tom) wrote:
If you are using
http://www.pdfbox.org/javadoc/org/pdfbox/util/PDFTextStripper.html#getText(o
rg.pdfbox.pdmodel.PDDocument), you are going to get
By 300MG I assume you mean 300MB.
You can also try extracting the text outside of lucene by using a
PDFBox command line app.
java org.pdfbox.ExtractText pdffile
you may need to increase the JRE memory like this
java -Xmx512m .pdfbox.ExtractText pdffile
OR
java -Xmx1024m
Let us know how you get on. There are a lot of people fighting very similar
battles on this list.
-Original Message-
From: Suba Suresh [mailto:[EMAIL PROTECTED]
Sent: 13 July 2006 15:30
To: java-user@lucene.apache.org
Subject: Re: Out of memory error
Thanks.
I am using the
Definitely. Thanks for both the suggestions. Yes it is 300MB.(typo)
suba suresh.
Rob Staveley (Tom) wrote:
Let us know how you get on. There are a lot of people fighting very similar
battles on this list.
-Original Message-
From: Suba Suresh [mailto:[EMAIL PROTECTED]
Sent: 13 July
Bok Tomi,
What do you mean by terms are misrepresented? What should they be, and what
are you seeing?
What I'm not clear on is how can I see the problematic *terms* in the list of
terms, but not the documents they're stored in?
Are you saying that the content got indexed, but the file
can't access the file:
http://cdoronc.20m.com/tmp/indexingThreads.zip
Yes, this Web host sometimes behaves strange when clicking a link from a
mail program. Please try to copy
cdoronc.20m.com/tmp
to the Web Browser (e.g. Firefox), click Enter.
This should show the content of that tmp folder,
Hi,
I am sure this is a question been asked before. :-) I have done some research
too, but still don't quite understand. I indexed 20 terms under field name
mesh, and set the boost accordingly from 20 to 1.(just some arbitrary
numbers) But when I checked the index from Luke, the boosts all
On 7/13/06, Zhao, Xin [EMAIL PROTECTED] wrote:
Hi,
I am sure this is a question been asked before. :-) I have done some research too, but still don't quite
understand. I indexed 20 terms under field name mesh, and set the boost accordingly from 20
to 1.(just some arbitrary numbers) But when I
Since I cannot seem to access the HTMLParser mailing list and I saw the
library recommended here, I thought someone here that has used it
successfully can help me out.
I have HTML text stored in a database field which I want to add to a
Lucene document, but I want to remove the HTML tags, so
Here is a use case I am trying to address.
I have two separate indexes, which contain sets of the same document
pool/corpus.
The two indexes have a different set of indexed fields.
One of the indexed fields is an external DocumentID.
I would like to perform searches, like a relational join,
Though I'm a newbie (which means I may be completely wrong), I don't
think this is possible out of the box. The quickest would be to
write a filter which looks up document id's in the first index and
applies this to the second index to get the disired subset to search
over.
I may need this too,
I've never used HTMLParser, but if you have malformed., incomplete, or
optional HTML that would otherwise choke an HTML parser, you could use
Solr's HTMLStripping:
http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#head-031d5d370010955fdcc529d208395cd556f4a73e
It's pretty stand-alone,
As I understand from earlier answers to my question that
one can create an index on machine A,
and use it (search and merge with other indices) on Machine B.
I was reading the file format today.
http://lucene.apache.org/java/docs/fileformats.html
The index has Byte UInt32 UInt64 in most
I think that I may be misreading the documentation.
I didn't see the description of the Long and Int type under the Primitive
Types section, while reading about the description of Byte, UInt32, Uint64,
VInt. So, for some reason I thought that Long and Int are byte
order sensitive.
Upon
22 matches
Mail list logo