Hi,
The problem comes form PDFBox
(http://brutus.apache.org/jira/browse/PDFBOX-377) and is fixed now.
However Tika doesn't yet use this version of PDFBox.
So for PDF text extraction, I doesn't use Tika but pdftotext.
Dominique
Le 09/03/10 06:00, Robert Muir a écrit :
it is an optional
okay i install my solr so like how the wiki said. and a new try. here one of
my two XML-files:
/var/lib/conf/Catalina/localhost/suggest.xml
Context docBase=/var/lib/tomcat5.5/solr.war debug=0 crossContext=true
Environment name=solr/home type=java.lang.String
sorry for the link to the wrong JIRA issue, was looking at another issue.
its here: https://issues.apache.org/jira/browse/SOLR-1813
again you will need to apply it to trunk I think, as thats the only
place I have tested it.
--
Robert Muir
rcm...@gmail.com
Hi,
I have built an index of several million documents with all primitive type
fields, either String, text or int. I have another multivalued field to
index now for each document which is a list of tags as a hashmap, so:
tagskey,value , where key is String and value is Int.
key is a given tag
nor 3.8 version does change anythings !
On 3/9/10, Robert Muir rcm...@gmail.com wrote:
I think the problem is that Solr does not include the ICU4J jar, so it
won't work with Arabic PDF files.
Try putting ICU4J 3.8 (http://site.icu-project.org/download) in your
classpath.
On Mon, Mar 8,
I'm using 1.4 version of Solr
On 3/9/10, Robert Muir rcm...@gmail.com wrote:
On Tue, Mar 9, 2010 at 9:44 AM, Abdelhamid ABID aeh.a...@gmail.com
wrote:
I put ICU4J 4.2 in the lib of Solr, nothing changed, I'm trying now with
ICU4J 3.8
Hello, what version of Solr are you using? I think
Understood. My solution was to convert any search terms with an asterisk to
lowercase prior to submitting to solr and it seems to be working correctly
now. Thanks for your help.
--
View this message in context:
http://old.nabble.com/Wildcard-questioncase-issue-tp27823332p27836740.html
Sent
I kind of suspected stemming to be the reason behind this.
But I consider stemming to be a good feature.
This is the side effect of stemming. Stemming increases recall while
harming precision.
This is a side effect of stemming, the way it is currently implemented in
Lucene. Stemming
this depends on what version of solr you are using, the trunk version
has a version of tika that supports this. See SOLR-1813
On Tue, Mar 9, 2010 at 3:59 AM, Dominique Bejean
dominique.bej...@eolya.fr wrote:
Hi,
The problem comes form PDFBox
(http://brutus.apache.org/jira/browse/PDFBOX-377)
I think Don is talking about Zoie - it requires a long uniqueKey.
On Tue, Mar 9, 2010 at 10:18 AM, Lance Norskog goks...@gmail.com wrote:
Solr unique ids can be any type. The QueryElevateComponent complains
if the unique id is not a string, but you can comment out the QEC. I
have one
Sounds like solr.HTMLStripCharFilter may work... except, I'm getting a couple
of problems:
1) HTML still seems to be getting into my content field
All I did was add charFilter class=solr.HTMLStripCharFilterFactory / to the
index analyzer for the my text fieldType.
2) Some it seems to have
okay i got it .. iam studid XD i set my dataDir to /var/data/solr/... and
gives the correct rights now it runs.
Jens Kapitza-2 wrote:
Am 08.03.2010 15:08, schrieb stocki:
Hello.
is use 2 cores for solr.
when is restart my tomcat on debian, tomcat delete my index.
you should
Ok I think I know where the problem is
@Deprecated
169 public SolrIndexWriter(String name, String path,
DirectoryFactory dirFactory, boolean create, IndexSchema schema,
SolrIndexConfig config) throws IOException {
170super(getDirectory(path, dirFactory, null),
On 09.03.2010 16:01 Ahmet Arslan wrote:
I kind of suspected stemming to be the reason behind this.
But I consider stemming to be a good feature.
This is the side effect of stemming. Stemming increases recall while harming
precision.
But most people want the best possible combination of
Hi,
I have indexed some documents that have title, content and keyword
(multi-value).
I want to *search* on title and content, and then, within these results *boost*
by keyword.
I have set up my qf as such:
str name=qf
content^0.5 title^1.0
/str
And my bq as such:
Well, that's a matter of opinion, isn't it? If *your* application
requires this, you could always copy the field to a non-stemmed
field and apply boosts...
Erick
On Tue, Mar 9, 2010 at 9:21 AM, abhishes abhis...@gmail.com wrote:
I kind of suspected stemming to be the reason behind this. But I
I put ICU4J 4.2 in the lib of Solr, nothing changed, I'm trying now with
ICU4J 3.8
On 3/9/10, Robert Muir rcm...@gmail.com wrote:
I think the problem is that Solr does not include the ICU4J jar, so it
won't work with Arabic PDF files.
Try putting ICU4J 3.8
I kind of suspected stemming to be the reason behind this. But I consider
stemming to be a good feature.
The point is that if an exact match exists, then solr should report that
first and then stemmed results should be reported.
disabling stemming altogether would be a step in the wrong
On Tue, Mar 9, 2010 at 9:44 AM, Abdelhamid ABID aeh.a...@gmail.com wrote:
I put ICU4J 4.2 in the lib of Solr, nothing changed, I'm trying now with
ICU4J 3.8
Hello, what version of Solr are you using? I think you will need to
use the trunk version.
I created a patch for this issue that you
I kind of suspected stemming to be the reason behind this.
But I consider stemming to be a good feature.
This is the side effect of stemming. Stemming increases recall while harming
precision.
I doen't know about pdftotext, is it pluggable with Solr, or do we need
hard-code the step of extraction before Solr turn.
On 3/9/10, Dominique Bejean dominique.bej...@eolya.fr wrote:
Hi,
The problem comes form PDFBox (
http://brutus.apache.org/jira/browse/PDFBOX-377) and is fixed now.
On Tue, Mar 9, 2010 at 10:10 AM, Abdelhamid ABID aeh.a...@gmail.com wrote:
nor 3.8 version does change anythings !
the patch (https://issues.apache.org/jira/browse/SOLR-1813) can only
work on Solr trunk. It will not work with Solr 1.4.
Solr 1.4 uses pdfbox-0.7.3.jar, which does not support
nor 3.8 version does change anythings !
On 3/9/10, Robert Muir rcm...@gmail.com wrote:
I think the problem is that Solr does not include the ICU4J jar, so it
won't work with Arabic PDF files.
Try putting ICU4J 3.8 (http://site.icu-project.org/download) in your
classpath.
On Mon, Mar 8,
Please repost as a separate thread..
From:
http://people.apache.org/~hossman/#threadhijack
When starting a new discussion on a mailing list, please do not reply to
an existing message, instead start a fresh email. Even if you change the
subject line of your email, other mail headers still track
So right now I'm thinking that solr just doesn't like me.
I just noticed that the following document config doesn't work for me
document
entity name=product query=select price_1,price_2,price_3,
pv.product_id, pv.product_id as id,vendor_label_name,
On Tue, Mar 9, 2010 at 4:38 PM, abhishes abhis...@gmail.com wrote:
I am indexing a column in a database. I have chosen field type of text for
this column (this type was defined in the sample schema file which comes in
the Solr Example).
When I search for the word impress and top 3 results.
On Mon, Mar 8, 2010 at 9:39 PM, Lance Norskog goks...@gmail.com wrote:
... curl http://xen1.xcski.com:8080/solrChunk/nutch/select
that should be /update, not /select
Ah, that seems to have fixed it. Thanks.
--
http://www.linkedin.com/in/paultomblin
While attempting to work around my other issue, I'm trying to use an
embedded solr server to try to programatically load data into solr.
It seems though that I can't deploy my app, as a result of this exception:
: java.lang.IllegalAccessError: tried to access field
tmp
Hey there,
I'm trying to figure out the best way to cut out all zeros of an input string
like 01.10. or 022.300...
Is there such a filter in Solr or anything similar that I can adapt to do the
task?
Thanks for any help
I'd read that too, but in the debug data queryBoosting is showing
matches on our int typed identifiers (though it does show it as
str123456/str). Is the problem that it can match against an
integer, but it can't reorder them in the results? This seems unlikely
as using a standard query and
Dear all, I am trying to setup a master/slave index replication
with two slaves embedded in a tomcat cluster and a master kept in a separate
machine.
I would like to know if is it possible to configure slaves with a
ReplicationHandler able to access master
by starting an embedded server instead of
Yonik,
I have provided an image below gives details on what is causing the blocked
http thread. Is there any way to resolve this issue.
Thanks,
John
--
John Williams
System Administrator
37signals
inline: Screen shot 2010-03-09 at 11.22.20 AM.png
On Mar 9, 2010, at 10:41 AM, John Williams
Ah - loading the fieldcache - do you have a *lot* of unique terms in the
fields you are sorting/faceting on?
localhost:8983/solr/admin/luke is helpful for checking this.
--
- Mark
http://www.lucidimagination.com
On 03/09/2010 12:33 PM, John Williams wrote:
Yonik,
I have provided an
Ahhh, FieldCache loading... what version of Solr are you using?
It's interesting it would take that long to load too (and maxing out
one CPU - doesn't look particularly IO bound). How many documents are
in this index?
-Yonik
On Tue, Mar 9, 2010 at 12:33 PM, John Williams j...@37signals.com
The SolrEmbededServer doesn't have any http, and so you can't use the http
replication.
You can use the script-based replication if you're on LUNIX. See:
http://wiki.apache.org/solr/CollectionDistribution
It would be worth looking at using Solr in a Jetty container and using the
http
Mark,
I am trying to load that url but its taking quite a while. I will let
you know if/when it loads.
-John
--
John Williams
System Administrator
37signals
On Mar 9, 2010, at 11:38 AM, Mark Miller wrote:
Ah - loading the fieldcache - do you have a *lot* of unique terms in the
Yonik,
We are on Solr 1.3. The total number of documents is 54173459. Let me
know if need any additional info.
Thanks,
John
--
John Williams
System Administrator
37signals
On Mar 9, 2010, at 11:39 AM, Yonik Seeley wrote:
Ahhh, FieldCache loading... what version of Solr are you
Ok Peter for script-based replication; I forgot to mention I already
verified that mechanism.
When I configure the slave as follows
requestHandler name=/replication class=solr.ReplicationHandler
lst name=slave
str name=masterUrlhttp://localhost:8983/solr/admin/replication/str
I'm trying to figure out the best way to cut out all zeros
of an input string like 01.10. or 022.300...
Is there such a filter in Solr or anything similar that I
can adapt to do the task?
With solr.MappingCharFilterFactory[1] you can replace all zeros with before
tokenizer.
charFilter
Thank you Mark for your help.
Now a few days later I am thinking, I need access to the SolrConfig object
in multiple classes. Maybe I should not be reloading it over and over? I
see that there is a getSolrConfig() method in the SolrCore class that will
return the SolrConfig object.
Should I
Yes - I think you should if you can. If you can make them SolrAware that
is - only certain plugin classes have the ability to do so (due to a
runtime check against a list of approved classes)
- Mark
On 03/09/2010 01:28 PM, Kimberly Kantola wrote:
Thank you Mark for your help.
Now a few days
Otis,
I've been thinking on it, and trying to figure out the different solutions
- Try to solve it doing a bridge between solr and clustering.
- Try to solve it before/during indexing
The second option, of course is better for performance, but how to do it??
I think a good option may be to
I allways build solr index from scratch, so I don't have neither pk
attribute in entity tag (dataconfig.xml file) nor UniqueKey in index
schema. When I updated solr from 1.3 to 1.4 I got the following exception
during solr initialization:
Hello all,
We have been indexing a large collection of OCR'd text. About 5 million books
in over 200 languages. With 1.5 billion OCR'd pages, even a small OCR error
rate creates a relatively large number of meaningless unique terms. (See
Can anyone suggest any practical solutions to removing some fraction of the
tokens containing OCR errors from our input stream?
one approach would be to try http://issues.apache.org/jira/browse/LUCENE-1812
and filter terms that only appear once in the document.
--
Robert Muir
Hi Dino,
I suppose you could write your own ReplicationHandler to do the replication
yourself, but I should think the effort involved would be better spent
deploying the existing Solr http replication or using a Hadoop-based
solution, or UNIX scripting.
By far, the easiest path to replication is
Hey All
I have indexed a whole bunch of documents and now I want to search against them.
My search is going great all but highlighting.
I have these items set
hl=true
hl.snippets=2
hl.fl = attr_content
hl.fragsize=100
Everything works apart from the highlighted text found not being surrounded
did u enable the highlighting component in solrconfig.xml? try setting
debugQuery=true to see if the highlighting component is even being
called...
On Tue, Mar 9, 2010 at 12:23 PM, Lee Smith l...@weblee.co.uk wrote:
Hey All
I have indexed a whole bunch of documents and now I want to search
Yes it shows when I run the debug
-lst name=org.apache.solrhandler.component.HighlightComponent
double name=time0.0/double
/lst
Any other ideas ?
On 9 Mar 2010, at 21:06, Joe Calderon wrote:
did u enable the highlighting component in solrconfig.xml? try setting
debugQuery=true to see
On Tue, Mar 9, 2010 at 2:35 PM, Robert Muir rcm...@gmail.com wrote:
Can anyone suggest any practical solutions to removing some fraction of
the tokens containing OCR errors from our input stream?
one approach would be to try
http://issues.apache.org/jira/browse/LUCENE-1812
and filter
: Now a few days later I am thinking, I need access to the SolrConfig object
: in multiple classes. Maybe I should not be reloading it over and over? I
: see that there is a getSolrConfig() method in the SolrCore class that will
: return the SolrConfig object.
: Should I maybe just take all
: One technique to control commit times is to do automatic commits: you
: can configure a core to commit every N seconds (really milliseconds,
: but less than 5 minutes becomes difficult) and/or every N documents.
: This promotes a more fixed amount of work per commit.
...but increaseing commit
Yes it shows when I run the debug
-lst
name=org.apache.solrhandler.component.HighlightComponent
double name=time0.0/double
/lst
Any other ideas ?
is the field attr_content stored? Are you querying this field? What happens
when you append hl.maxAnalyzedChars=-1 to your search
Unfortunately, I don't see how the KeywordTokenizerFactory could work given
the field in question is delimited text (paragraphs) and the
KeywordTokenizerFactory essentially does nothing to the inbound content.
Feel like I must be missing something . . . but can't figure out what.
Do I
: I do not believe the SOLR or LUCENE syntax allows this
At the lowest level, Solr and Lucene-Java both support any arbitrary
character you want in the field name -- it's just that sevearl features
use syntax that doesn't play nicely with characters like whitespace in
field names.
when using
: Is there a way to remove duplicate values from the multivalued fields? I am
: using Solrj client with solr 1.4 version.
not trivially, but you could write an UpdateProcessor to do this fairly
trivially, or emplement it in the client.
-Hoss
: I need to implement a search where i should count the number of times
: the string appears on the search field,
:
: ie: only return articles that mention the word 'HP' at least 2x.
...
: Is there a way that SOLR does this type of operation for me?
you'd have to implement it in a
I *think* that you can use the same instanceDir for multiple cores, the
key issue being that you need to make sure they each have distinct
dataDirs (which as i recall can be done using property replacement with
the core name)
: The action CREATE creates a new core based on preexisting
:
: I connected to one of my solr instances with Jconsole today and
: noticed that most of the mbeans under the solr hierarchy are missing.
: The only thing there was a Searcher, which I had no trouble seeing
: attributes for, but the rest of the statistics beans were missing.
: They all show up
I think you need to back up and tell us what you're
trying to accomplish from a higher level.
See Hossman's apache page:
Your question appears to be an XY Problem ... that is: you are dealing
with X, you are assuming Y will help you, and you are asking about Y
without giving more details about
: Ok I think I know where the problem is
...
: It's the constructor used by SolrCore in r772051
Ughhh... so to be clear: you haven't been using Solr 1.4 at any point in
this thread?
that explains why no one else could recreate the problem you were
describing.
For future refrence:
Is there a digest mode to this list?
It's very active and helpful. I'm just not fully 'dove in' to using it yet.
Just need to look in the digests for answers to my questions.
Dennis Gearon
Signature Warning
EARTH has a Right To Life,
otherwise we all die.
Read 'Hot, Flat,
: A quick check did show me a couple of duplicates, but if I understand
: correctly, even if two different process send the same document, the last
: one should update the previous. If I send the same documents 10 times, in
: the end, it should only be in my index once, no?
it should yes ... i
: I just checked popular search services and it seems that neither
: lucidimagination search nor search-lucene support this:
it really depends on what you want to do ... most people i know who index
email want to included quoted portions in the message because it's part of
hte context of the
P.S. although phrase queries with fields that do NOT
have stopwords removed feels kinda like what you're
hinting at.
Erick
On Tue, Mar 9, 2010 at 6:49 PM, Erick Erickson erickerick...@gmail.comwrote:
I think you need to back up and tell us what you're
trying to accomplish from a higher level.
Hi,
I'm trying to figure out if SOLR is the component I need and if so that
I'm asking the right questions :)
I need to index a large set of multilingual documents against a project
specific taxonomy.
From what I've read SOLR should be perfect for this.
However I'm not sure that my
I was wondering if someone could be so kind to give me some architectural
guidance.
A little about our setup. We are RoR shop that is currently using Ferret (no
laughs please) as our search technology. Our indexing process at the moment
is quite poor as well as our search results. After some
Well, the LukeRequestHandler lets you peek at the
index, see:
http://wiki.apache.org/solr/LukeRequestHandler
warning: it'll take a bit for this to make lots of sense.
You can get a copy of Luke (google Lucene Luke) for
what the above is based on, point it at your index and
have at it.
One bit
Hello,
I wonder if anyone might have some insight/advice on index scaling for high
document count vs size deployments...
The nature of the incoming data is a steady stream of, on average, 4GB per
day. Importantly, the number of documents inserted during this time is
~7million (i.e. lots of small
: Mailing-List: contact solr-user-h...@lucene.apache.org; run by ezmlm
: Precedence: bulk
: List-Help: mailto:solr-user-h...@lucene.apache.org
...if you send mail to that address it should have info about subscribing
in digest mode.
And PS...
: Subject: digest
: In-Reply-To:
Stackoverflow.com is serving ads for open source projects:
http://meta.stackoverflow.com/questions/31913/open-source-advertising-sidebar-1h-2010
I think it would be good publicity for Solr to have a banner there... anyone
up for designing one? (if it's ok with the Solr dev team, of course)
: If I write a custom analyser that accept a specific attribut in the
: constructor
:
: public MyCustomAnalyzer(String myAttribute);
:
: Is there a way to dynamically send a value for this attribute from Solr at
: index time in the XML Message ?
:
: add
: doc
: field name=content
: I allways build solr index from scratch, so I don't have neither pk
: attribute in entity tag (dataconfig.xml file) nor UniqueKey in index
: schema. When I updated solr from 1.3 to 1.4 I got the following exception
: during solr initialization:
This is in fact a bug in Solr 1.4...
So, the way I made my analyzer is the good one. Thank you.
hossman wrote:
: If I write a custom analyser that accept a specific attribut in the
: constructor
:
: public MyCustomAnalyzer(String myAttribute);
:
: Is there a way to dynamically send a value for this attribute from Solr
2010/3/9 Shalin Shekhar Mangar shalinman...@gmail.com
I think Don is talking about Zoie - it requires a long uniqueKey.
Yep; we're using UUIDs.
76 matches
Mail list logo