Re: questions about autocommit committing documents

2010-09-26 Thread MitchK

Hi Andy,


Andy-152 wrote:
 
 autoCommit 
   maxDocs1/maxDocs
   maxTime1000/maxTime 
 /autoCommit
 
 has been commented out.
 
 - With autoCommit commented out, does it mean that every new document
 indexed to Solr is being auto-committed individually? Or that they are not
 being auto-committed at all?
 
I am not sure, whether there is a default value, but if not, commenting out
would mean that you have to send a commit explicitly. 



 - If I enable autoCommit and set maxDocs at 1, does it mean that
 my new documents won't be avalable for searching until 10,000 new
 documents have been added?
 
Yes, that's correct. However, you can do a commit explicitly, if you want to
do so. 



 - When I add a new document to Solr, do I need to call commit explicitly?
 If so, how do I do that? 
 I look at the Solr tutorial (
 http://lucene.apache.org/solr/tutorial.html), the command used to index
 documents (java -jar post.jar solr.xml monitor.xml) doesn't include any
 explicit call to commit the documents. So I'm not sure if it's necessary.
 
 Thanks
 
Committing is necessary, since every added document is not visible at
query-time, if there was no commit to it. 

Kind regards,
Mitch
-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/questions-about-autocommit-committing-documents-tp1582487p1582676.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: questions about autocommit committing documents

2010-09-26 Thread MitchK

First: Usually you do not use post.jar for updating your index. It's a simple
tool, but normally you use features like the csv- or
xml-update-RequestHandler.

Have a look at UpdateCSV and UpdateXMLMessages in the wiki.
There you can find examples on how to commit explicitly.

With the post.jar you need to set either dcommit=yes or to append
commit/, I think.

Hope this helps.

- Mitch
-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/questions-about-autocommit-committing-documents-tp1582487p1582846.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Swapping cores with SolrJ

2010-09-14 Thread MitchK

Hi Shaun,

I think it is more easy to fix this problem, if we got more information
about what is going on in your application.
Please, could you provide the CoreAdminResponse returned by car.process()
for us?

Kind regards,
- Mitch
-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/Swapping-cores-with-SolrJ-tp1472154p1473435.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Solr CoreAdmin create ignores dataDir Parameter

2010-09-10 Thread MitchK

Frank,

have a look at SOLR-646.

Do you think a workaround for the data-dir-tag in the solrconfig.xml can
help?
I think about something like dataDir${solr./data/corename}/dataDir for
illustration.

Unfortunately I am not very skilled in working with solr's variables and
therefore I do not know what variables are available. 

If we find a solution, we should provide it as a suggestion at the wiki's
CoreAdmin-page.

Kind regards,
- Mitch
-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/Solr-CoreAdmin-create-ignores-dataDir-Parameter-tp1451665p1454705.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: SolrCloud distributed indexing (Re: anyone use hadoop+solr?)

2010-09-07 Thread MitchK

What if we do not care about the version of a document at index-time?

When it comes to distributed search, we currently decide aggregating
documents based on their uniqueKey. But what would be, if we decide
additionally decide on uniqueKey plus indexingDate, so that we only
aggregate the last indexed version of a document?

The concept could look like this:
When Solr aggregated the documents for a response, it could store what shard
responsed an older version of document x. 

Now a crawler can crawl through our SolrCloud and asking each shard whether
it noticed something like shard y got an older version of doc x-case.
The crawler aggregates those information. After he finished crawling, he
sends delete-by-query-requests to those shards which have older versions of
documents than they should have. 

I will call these stores document versions that are older than the newest
version ODV (Old Document Versions) for better understanding. 

So, what can happen:
Before the crawler can visit shard A - who noticed that shard y stores an
ODV of doc x - shard A can go down. That's okay, because either another
shard noticed the same, or shard A will be available later on. If those
information will we stored at HD, it will also be available. If it was
stored in RAM the information is lost... however, you could replicate those
information over more than one shard, right? :-)

Another case:
Shard y can go down - so someone has to care for storing the noticed
ODV-information, so that one can delete the document when Shard Y comes
back.

Pros:
- You can do something like consistent hashing in connection with a concept
where each node has to care for its neighbour-nodes. This is because only
the neighbour nodes can store ODVs.

- using the described concept, you can do nightly batches, looking for ODVs
in the neigbour-nodes.

- ODVs will be found at requesting time, so we can avoid to response ODVs
over newer versions.

Cons:
- We are wasting disc space.

- This works only for smaller clusters, not for large ones where the number
of machines changes very frequently

... this is just another idea - and it is very very lazy.

I must emphasize, that I assume that neighbour-machines do not go down very
frequently. Of course, it is not a question whether a machine crashes, but
when it crashes - but I assume that the same server does not crash every
hour. :-)

Thoughts?

Kind regards


Andrzej Bialecki wrote:
 
 On 2010-09-06 16:41, Yonik Seeley wrote:
 On Mon, Sep 6, 2010 at 10:18 AM, MitchKmitc...@web.de  wrote:
 [...consistent hashing...]
 But it doesn't solve the problem at all, correct me if I am wrong, but:
 If
 you add a new server, let's call him IP3-1, and IP3-1 is nearer to the
 current ressource X, than doc x will be indexed at IP3-1 - even if IP2-1
 holds the older version.
 Am I right?

 Right.  You still need code to handle migration.

 Consistent hashing is a way for everyone to be able to agree on the
 mapping, and for the mapping to change incrementally.  i.e. you add a
 node and it only changes the docid-node mapping of a limited percent
 of the mappings, rather than changing the mappings of potentially
 everything, as a simple MOD would do.
 
 Another strategy to avoid excessive reindexing is to keep splitting the 
 largest shards, and then your mapping becomes a regular MOD plus a list 
 of these additional splits. Really, there's an infinite number of ways 
 you could implement this...
 

 For SolrCloud, I don't think we'll end up using consistent hashing -
 we don't need it (although some of the concepts may still be useful).
 
 I imagine there could be situations where a simple MOD won't do ;) so I 
 think it would be good to hide this strategy behind an 
 interface/abstract class. It costs nothing, and gives you flexibility in 
 how you implement this mapping.
 
 -- 
 Best regards,
 Andrzej Bialecki 
   ___. ___ ___ ___ _ _   __
 [__ || __|__/|__||\/|  Information Retrieval, Semantic Web
 ___|||__||  \|  ||  |  Embedded Unix, System Integration
 http://www.sigram.com  Contact: info at sigram dot com
 
 
 
-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/anyone-use-hadoop-solr-tp485333p1434329.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: SolrCloud distributed indexing (Re: anyone use hadoop+solr?)

2010-09-07 Thread MitchK

I must add something to my last post:

When saying it could be used together with techniques like consistent
hashing, I mean it could be used at indexing time for indexing documents,
since I assumed that the number of shards does not change frequently and
therefore an ODV-case becomes relatively infrequent. Furthermore the
overhead of searching for and removing those ODV-documents is relatively
low. 
-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/anyone-use-hadoop-solr-tp485333p1434364.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: anyone use hadoop+solr?

2010-09-06 Thread MitchK

Thanks for your detailed feedback Andzej!

From what I understood, SOLR-1301 becomes obsolete ones Solr becomes
cloud-ready, right?



 Looking into the future: eventually, when SolrCloud arrives we will be 
 able to index straight to a SolrCloud cluster, assigning documents to 
 shards through a hashing schema (e.g. 'md5(docId) % numShards')
 
Hm, let's say the md5(docId) would produce a value of 10 (it won't, but
let's assume it).
If I got a constant number of shards, the doc will be published to the same
shard again and again.

i.e.: 10 % numShards(5) = 2 - doc 10 will be indexed at shard 2.

A few days later the rest of the cluster is available, now it looks like

10 % numShards(10) -  1 - doc 10 will be indexed at shard 1... and what
about the older version at shard 2? I am no expert when it comes to
cloudComputing and the other stuff.
If you can point me to one or another reference where I can read about it,
it would help me a lot, since I only want to understand how it works at the
moment.

The problem with Solr is its lack of documentation in some classes and the
lack of capsulating some very complex things into different methods or
extra-classes. Of course, this is because it costs some extra time to do so,
but it makes understanding and modifying things very complicated if you do
not understand whats going on from a theoretical point of view.

Since the cloud-feature will be complex, a lack of documentation and no
understanding of the theory behind the code will make contributing back
very, very complicated.

Thank you :-)
- Mitch
-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/anyone-use-hadoop-solr-tp485333p1425986.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: anyone use hadoop+solr?

2010-09-06 Thread MitchK

Yonik,

are there any discussions about SolrCloud-indexing?

I would be glad to join them, if I find some interesting papers about that
topic.

- Mitch
-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/anyone-use-hadoop-solr-tp485333p1426469.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: SolrCloud distributed indexing (Re: anyone use hadoop+solr?)

2010-09-06 Thread MitchK

Andrzej,

thank you for sharing your experiences.



 b) use consistent hashing as the mapping schema to assign documents to a 
 changing number of shards. There are many explanations of this schema on 
 the net, here's one that is very simple: 
 
Boom. 
With the given explanation, I understand it as the following:
You can use hadoop and do some map-reduce-jobs per csv-file.
At the reducer-side, the reducer has to look for the id of the current doc
and needs to create a hash of it.
Now it looks inside a SortedSet, picks the next-best server and looks in a
map, whether this server has got free capacity or not. That's cool.

But it doesn't solve the problem at all, correct me if I am wrong, but: If
you add a new server, let's call him IP3-1, and IP3-1 is nearer to the
current ressource X, than doc x will be indexed at IP3-1 - even if IP2-1
holds the older version. 
Am I right?

Thank you for sharing the paper. I will have a look for more like this. 



 In this case the lack of good docs and user-level API can be blamed on 
 the fact that this functionality is still under heavy development. 
 
I do not only mean documentation at the user-level but also inside a class,
if there is going on some complicated stuff. 

- Mitch
-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/anyone-use-hadoop-solr-tp485333p1426728.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Show a facet filter All

2010-09-05 Thread MitchK

Peter,

take a close look at tagging and and excluding filters:
http://wiki.apache.org/solr/SimpleFacetParameters#LocalParams_for_faceting

Another way would be to index your services_raw as 
services_raw/Exclusive rental
services_raw/Fotoreport
services_raw/Live music

In this case, you can use the facet-prefix param to get all the
services_raw/*-values.
I am not sure, but maybe even * is a valid prefix - than you do not need
such extra-work.

If all your documents include a services_raw-field, than this facet wouldn't
make much sense, since it is applicable to all the documents, isn't it? 

Kind regards,
- Mitch
-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/Show-a-facet-filter-All-tp1421248p1421539.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: anyone use hadoop+solr?

2010-09-04 Thread MitchK

Hi,

this topic started a few months ago, however there are some questions from
my side, that I couldn't answer by looking at the SOLR-1301-issue nor the
wiki-pages.

Let me try to explain my thoughts:
Given: a Hadoop-cluster, a solr-search-cluster and nutch as a
crawling-engine which also performs LinkRank and webgraph-related tasks.

Once a list of documents is created by nutch, you put the list + the
LinkRank-values etc. into a Solr+Hadoop-job like it is described in
Solr-1301 to index or reindex the given documents.
When the shards are built, they will be sent over the network to the
solr-search-cluster.
Is this description correct?

What makes me thinking is:
Assumed I got a Document X on machine Y in shard Y... 
When I reindex that document X together with lots of other documents that
are present or not present in Shard Y... and I put the resulting shard on a
machine Z, how does machine Y notice that it has got an older version of
document X than machine Z?

Furthermore: Go on and assume that the shard Y was replicated to three other
machines, how do they all notice, that their version of document X is not
the newest available one?
In such an environment, we do not have a master (right?), so far: How to
keep the index as consistent as possible?

Thank you for clearifying. 

Kind regards
-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/anyone-use-hadoop-solr-tp485333p1418140.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: full control over norm values?

2010-08-27 Thread MitchK

Hi Micheal,

have a look at SweetSpotSimilarity (Lucene).

Kind regards,
- Mitch
-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/full-control-over-norm-values-tp1366910p1367462.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Why it's boosted up?

2010-08-24 Thread MitchK

Hi Scott,



 (so  shorter fields are automatically boosted up).  
 
The theory behind that is the following (in easy words):
Let's say you got two documents, each doc contains on 1 field (like it was
in my example).
Additionally we got a query that contains two words.
Let's say doc1 contains on 10 words and doc2 contains on 20 words.
The query matches both docs with both words.
The idea of boosting shorter fields stronger than longer fields is the
following:
In doc1, 2/10 = 0.2 = 20% of the words are matching your query.
In doc2 2/20 = 0.1 = 10% of the words are matching your query.

So doc1 should get a better score, because the rate of matching words vs the
total number of occuring words is greater than in doc2
This is the idea of using norms as an index-time-boosting-factor. NOTE: This
does not mean that doc1 get's boosted by 20% and doc1 by 10%! It only
illustrates what the idea behind such norms is.

From the similarity-class's documentation of lengthNorm():



 Matches in longer fields are less precise, so implementations of this
 method usually return smaller values when numTokens is large, and larger
 values when numTokens is small.
 

However, you, as a search-application-developer got the task, that you have
to decide whether this theory applies to your application or not. In some
cases using norms makes no sense, in others it does. 
If you think that norms are applying to your project, ommitting them is no
good approach to save disk-space.
Furthermore: If you think the theory does apply to the business-needs of
your application but its impact is currently to heavy, you can have a look
at the sweetSpotSimilarity in Lucene. 



 The request is from our business team, they wish user of our product can 
 type in partial string of a word that exists in title or body field.
 
You mean something like typing note and also getting results like
notebook?
The correct approach for something like that is not using shingleFilter but
NGrams or edged NGrams.
Shingles are doing something like that:
This is my shingle sentence - This is, is my, my shingle, shingle
sentence - it breaks up the sentence into smaller pieces. The benefit of
doins so is, that, if a query matches one of these shingles, you have found
a short phrase without using the performance-consuming phraseQuery-feature.

Kind regards,
- Mitch


scott chu wrote:
 
 In Lucene's web page, there's a paragraph:
 
 Indexing time boosts are preprocessed for storage efficiency and written
 to 
 the directory (when writing the document) in a single byte (!) as follows: 
 For each field of a document, all boosts of that field (i.e. all boosts 
 under the same field name in that doc) are multiplied. The result is 
 multiplied by the boost of the document, and also multiplied by a field 
 length norm value that represents the length of that field in that doc
 (so 
 shorter fields are automatically boosted up). 
 
 I though the greater the value, the boosting is upper. Then why short
 fields 
 are boost up? Isn't Norm value for short fields smaller?
 
 
 
-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/Doing-Shingle-but-also-keep-special-single-word-tp1241204p1306419.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Solr creates whitespace in dismax query

2010-08-24 Thread MitchK

Johann,

try to remove the wordDelimiterFilter from the query-analyzer of your
fieldType.
If your index-analyzer-wordDelimiterFilter is well configured, it will find
everything you want. 

Does this solve the problem?

Kind regards,
- Mitch
-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/Solr-creates-whitespace-in-dismax-query-tp1317196p1318759.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Doing Shingle but also keep special single word

2010-08-23 Thread MitchK

No, I mean that you use an additional field (indexed) for searching (i.e.
whitespace-tokenized, so every word - seperated by a whitespace - becomes to
a token .
So you have got two fields (shingle-token-field and single-token-field).
So you can search accross both fields.
This provides several benefits: i.e. you can boost the shingle-field at
query-time, since a match in a shingle-field would mean, that there matches
an exact phrase.

Additionally: You can search with single-word-queries as well as
multi-word-queries.
Furthermore you can apply synonyms to your single-token-field. 

If you want to keep your index as small as possible but as large as needed,
try to understand Lucene's similarity implementation to consider, whether
you can set the field option omitNorms=true or
omitTermFreqAndPositions=true. 
http://lucene.apache.org/java/3_0_1/api/all/org/apache/lucene/search/Similarity.html
Keep in mind what happens, if you omit one of those options.

A small example of the consequences of setting omitNorms = true;.
doc1: this is a short example doc
doc2: this is a longer example doc for presenting the effect of omitNorms

If you are searching for doc while omitNorms=false your response will look
like this:
doc1,
doc2
This is because the norm-value for doc1 is smaller as the norm-value for
doc2, because doc1 is shorter than doc2 (have a look at the provided link).

If omitNorms=true, the scores for both docs will be equal.

Kind regards,
- Mitch


scott chu wrote:
 
 I don't quite understand additional-field-way? Do you mean making another 
 field that stores special words particularly but no indexing for that
 field?
 
 Scott
 
 - Original Message - 
 From: MitchK mitc...@web.de
 To: solr-user@lucene.apache.org
 Sent: Sunday, August 22, 2010 11:48 PM
 Subject: Re: Doing Shingle but also keep special single word
 
 

 Hi,

 keepword-filter is no solution for this problem, since this would lead to
 the problematic that one has to manage a word-dictionary. As explained, 
 this
 would lead to too much effort.

 You can easily add outputUnigrams=true and check out the analysis.jsp for
 this field. So you can see how much bigger a single field will become
 with
 this option.
 However, I am quite sure that the difference between using
 outputUnigrams=true and indexing in a seperate field is not noteworthy.

 I would suggest you to do it the additionally-field-way, since this would
 lead to more flexibility in boosting the different fields.

 Unfortunately, I haven't understood your explanation about the use-case. 
 But
 it sounds a little bit like tagging?

 Kind regards,
 - Mitch


 iorixxx wrote:

 Isn't set outputUnigrams=true will
 make index size about twice than when it's set to false?

 Sure index will be bigger. I didn't know that this is problem for you. 
 But
 if you have a list of special single words that you want to keep,
 keepwordfilter can eliminate other tokens. So index size will be okey.


 Scott

 - Original Message - From: Ahmet Arslan iori...@yahoo.com
 To: solr-user@lucene.apache.org
 Sent: Saturday, August 21, 2010 1:15 AM
 Subject: Re: Doing Shingle but also keep special single
 word


  I am building index with Shingle
  filter. We know it's minimum 2-gram but I also
 want keep
  some special single word, e.g. IBM, Microsoft,
 etc. i.e. I
  want to do a minimum 2-gram but also want to have
 these
  single word in my index, Is it possible?
 
  outputUnigrams=true parameter does not work for
 you?
 
  After that you can cast filter
 class=solr.KeepWordFilterFactory words=keepwords.txt
 ignoreCase=true/ with keepwords.txt=IBM, Microsoft.
 
 
 
 







 -- 
 View this message in context: 
 http://lucene.472066.n3.nabble.com/Doing-Shingle-but-also-keep-special-single-word-tp1241204p1276506.html
 Sent from the Solr - User mailing list archive at Nabble.com.

 
 
 
 
 
 
 ¥¼¦b¶Ç¤J°T®§¤¤§ä¨ì¯f¬r¡C
 Checked by AVG - www.avg.com
 Version: 9.0.851 / Virus Database: 271.1.1/3083 - Release Date: 08/20/10 
 14:35:00
 
 
 
-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/Doing-Shingle-but-also-keep-special-single-word-tp1241204p1300497.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: How to limit rows to which highlighting applies

2010-08-22 Thread MitchK

Alex,

it sounds like it would make sense.
Use cases could be i.e. clustering or similar techniques.
However, in my opinion the point of view for such a modification is not the
right.

I.e. one wants to have got several resultsets. I could imagine that one does
a primary-query (the query for the displayed results) and a query to compute
clustering-results. 
Now, you want to do different things with the result-sets.
The primary-query needs faceting, highlighting, spellcheck and much more,
wheareas the additional query only needs clustering or something like that.
In your case, you do not want to apply highlighting for the whole set, since
you do not need such information for every row.

This is a general problem and I think a solution that makes it possible to
create more than one resultset for a single solr-request would be
applicable for more general use cases.

What do you think?

Kind regards,
- Mitch


Alex Baranau wrote:
 
 Hello Solr users and devs!
 
 Is there a way to limit number of rows to which highlighting applies? I
 don't see any hl.rows or similar parameter description, so it looks like
 I
 need to enhance HighlightComponent to enable that. If it is not possible
 currently, do you think it's worth adding such possibility?
 
 JFI my case, when I need this: I display on results page 20, 10 or 5 rows
 only, but I need much more rows (100-500) to display additional data on
 the
 same page. Queries could be very complex and their execution time
 (QueryComponent) is quite big. So I do want to fetch things via single
 request. However, I noticed that with increasing number of rows, time
 spent
 in HighlightComponent increases dramatically. For those additional
 hundreds
 of rows I don't need highlighting at all.
 
 Actually, *ideally* it would be great to have the ability to specify
 fields
 returned for those extra rows as well. So I tend to think that adding this
 features should not be based on changing HighlightComponent behaviour, but
 changing QueryComponent or even bigger part somehow so that Solr query
 accepts specifying extra group(s) of rows for fetching along with params
 for
 them (which not influence the searching process, like
 formatting/highlighting, fields to return, etc.). Thus, we could execute
 *one* search query and fetch different data for different purposes.
 
 Does this all make sense to you guys?
 
 Thank you,
 Alex Baranau
 
 Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch - Hadoop - HBase
 Lucene ecosystem search ::
 http://search-lucene.com/http://search-hadoop.com/
 
 
-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/How-to-limit-rows-to-which-highlighting-applies-tp1274042p1275962.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Doing Shingle but also keep special single word

2010-08-22 Thread MitchK

Hi,

keepword-filter is no solution for this problem, since this would lead to
the problematic that one has to manage a word-dictionary. As explained, this
would lead to too much effort.

You can easily add outputUnigrams=true and check out the analysis.jsp for
this field. So you can see how much bigger a single field will become with
this option.
However, I am quite sure that the difference between using
outputUnigrams=true and indexing in a seperate field is not noteworthy.

I would suggest you to do it the additionally-field-way, since this would
lead to more flexibility in boosting the different fields.

Unfortunately, I haven't understood your explanation about the use-case. But
it sounds a little bit like tagging?

Kind regards,
- Mitch


iorixxx wrote:
 
 Isn't set outputUnigrams=true will
 make index size about twice than when it's set to false?
 
 Sure index will be bigger. I didn't know that this is problem for you. But
 if you have a list of special single words that you want to keep,
 keepwordfilter can eliminate other tokens. So index size will be okey.
 
 
 Scott
 
 - Original Message - From: Ahmet Arslan iori...@yahoo.com
 To: solr-user@lucene.apache.org
 Sent: Saturday, August 21, 2010 1:15 AM
 Subject: Re: Doing Shingle but also keep special single
 word
 
 
  I am building index with Shingle
  filter. We know it's minimum 2-gram but I also
 want keep
  some special single word, e.g. IBM, Microsoft,
 etc. i.e. I
  want to do a minimum 2-gram but also want to have
 these
  single word in my index, Is it possible?
  
  outputUnigrams=true parameter does not work for
 you?
  
  After that you can cast filter
 class=solr.KeepWordFilterFactory words=keepwords.txt
 ignoreCase=true/ with keepwords.txt=IBM, Microsoft.
  
  
  
  
 
 
 
 
   
 
 
-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/Doing-Shingle-but-also-keep-special-single-word-tp1241204p1276506.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: SolrJ Response + JSON

2010-08-02 Thread MitchK

Hi,

as I promised, I want to give a feedback for transforming SolrJ's output 
into JSON with the package from json.org (the package was the json.org's 
one):


I need to make a small modification to the package, since they store the 
JSON-key-value-pairs in a HashMap, I changed this to a LinkedHashMap to 
make sure that the order of the retrived values is the same order as 
they were inserted in the map.


The result looks very, very pretty.

It was very easy to transform the SolrJ's output into the desired 
JSON-format and I can add now whatever I want to the response.


Kind regards,
- Mitch


RE: Boosting DisMax queries with !boost component

2010-08-02 Thread MitchK


Jonathan Rochkind wrote:
 
 qf needs to have spaces in it, unfortunately the local query parser can
 not
 deal with that, as Erik Hatcher mentioned some months ago.
 
 By local query parser, you mean what I call the LocalParams stuff (for
 lack of being sure of the proper term)?  
 
Yes, that was what I meant.

Kind regards,
- Mitch
-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/Boosting-DisMax-queries-with-boost-component-tp1011294p1015619.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Boosting DisMax queries with !boost component

2010-08-01 Thread MitchK

Hi,

qf needs to have spaces in it, unfortunately the local query parser can not
deal with that, as Erik Hatcher mentioned some months ago.

A solution would be to do something like that:

{!dismax%20qf=$yourqf}yourQueryyourgf=title^1.0 tags^2.0

Since you are using the dismax-query-parser, you can add the boosting query
via bq-param.

Hope this helps,
- Mitch
-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/Boosting-DisMax-queries-with-boost-component-tp1011294p1014242.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Nabble problems?

2010-07-29 Thread MitchK

I got some problems with Nabble, too.
Nabble sends some warnings that my posts are still pending to the
mailing-list, while people were already answering to my initial questions.

Did you send a message to the nabble-support?

Kind regards,
- Mitch 

kenf_nc wrote:
 
 The Nabble.com page for Solr - User seems to be broken. I haven't seen an
 update on it since early this morning. However I'm still getting email
 notifications so people are seeing and responding to posts. I'm just
 curious, are you just using email and responding to
 solr-u...@lucene.apache.org? Or is there a mirror site that *is* working
 for the Solr User forum?
 
-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/Nabble-problems-tp1004870p1004992.html
Sent from the Solr - User mailing list archive at Nabble.com.


SolrJ Response + JSON

2010-07-28 Thread MitchK

Hello community,

I need to transform SolrJ - responses into JSON, after some computing on
those results by another application has finished.

I can not do those computations on the Solr - side.

So, I really have to translate SolrJ's output into JSON.

Any experiences how to do so without writing your own JSON-writer?

Thank you.
- Mitch
-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/SolrJ-Response-JSON-tp1002024p1002024.html
Sent from the Solr - User mailing list archive at Nabble.com.


SolrJ Response + JSON

2010-07-28 Thread MitchK

Hello , 

Second try to send a mail to the mailing list... 

I need to translate SolrJ's response into JSON-response.
I can not query Solr directly, because I need to do some math with the
responsed data, before I show the results to the client.

Any experiences how to translate SolrJ's response into JSON without writing
your own JSON Writer?

Thank you. 
- Mitch
-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/SolrJ-Response-JSON-tp1002115p1002115.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: SolrJ Response + JSON

2010-07-28 Thread MitchK

Thank you Markus, Mark.

Seems to be a problem with Nabble, not with the mailing list. Sorry.

I can create a JSON-response, when I query Solr directly.
But I mean, that I query Solr through a SolrJ-client 
(CommonsHttpSolrServer).
That means my queries look a litte bit like that: 
http://wiki.apache.org/solr/Solrj#Reading_Data_from_Solr

So the response is shown as an QueryResponse-object, not as a JSON-string.

Or do I miss something here?

Am 28.07.2010 15:15, schrieb Markus Jelsma:

Hi,

I got a response to your e-mail in my box 30 minutes ago. Anyway, enable the
JSONResponseWriter, if you haven't already, and query with wt=json. Can't get
mucht easier.

Cheers,

On Wednesday 28 July 2010 15:08:26 MitchK wrote:
   

Hello ,

Second try to send a mail to the mailing list...

I need to translate SolrJ's response into JSON-response.
I can not query Solr directly, because I need to do some math with the
responsed data, before I show the results to the client.

Any experiences how to translate SolrJ's response into JSON without writing
your own JSON Writer?

Thank you.
- Mitch

 

Markus Jelsma - Technisch Architect - Buyways BV
http://www.linkedin.com/in/markus17
050-8536620 / 06-50258350


   




Re: SolrJ Response + JSON

2010-07-28 Thread MitchK

Thank you, Chantal.

I have looked at this one: http://www.json.org/java/index.html

This seems to be an easy-to-understand-implementation.

However, I am wondering how to determine whether a SolrDocument's field 
is multiValued or not.
The JSONResponseWriter of Solr looks at the schema-configuration. 
However, the client shouldn't do that.

How did you solved that problem?

Thanks for sharing ideas.

- Mitch


Am 28.07.2010 15:35, schrieb Chantal Ackermann:

You could use org.apache.solr.handler.JsonLoader.
That one uses org.apache.noggit.JSONParser internally.
I've used the JacksonParser with Spring.

http://json.org/ lists parsers for different programming languages.

Cheers,
Chantal

On Wed, 2010-07-28 at 15:08 +0200, MitchK wrote:
   

Hello ,

Second try to send a mail to the mailing list...

I need to translate SolrJ's response into JSON-response.
I can not query Solr directly, because I need to do some math with the
responsed data, before I show the results to the client.

Any experiences how to translate SolrJ's response into JSON without writing
your own JSON Writer?

Thank you.
- Mitch
 



   




Re: SolrJ Response + JSON

2010-07-28 Thread MitchK

Hi Chantal,

thank you for the feedback.
I did not see the wood for the trees!
The SolrDocument's javadoc says the following: 
http://lucene.apache.org/solr/api/org/apache/solr/common/SolrDocument.html


|*getFieldValue 
../../../../org/apache/solr/common/SolrDocument.html#getFieldValue%28java.lang.String%29*(String 
http://java.sun.com/javase/6/docs/api/java/lang/String.html?is-external=true name)| 


  Get the value or collection of values for a given field.

The magical word here is that little or :-).

I will try that tomorrow and give you a feedback!


Are you sure that you cannot change the SOLR results at query time
according to your needs?


Unfortunately, it is not possible in this case.

Kind regards,
Mitch


Am 28.07.2010 16:49, schrieb Chantal Ackermann:

Hi Mitch

On Wed, 2010-07-28 at 16:38 +0200, MitchK wrote:
   

Thank you, Chantal.

I have looked at this one: http://www.json.org/java/index.html

This seems to be an easy-to-understand-implementation.

However, I am wondering how to determine whether a SolrDocument's field
is multiValued or not.
The JSONResponseWriter of Solr looks at the schema-configuration.
However, the client shouldn't do that.
How did you solved that problem?
 

I didn't. I'm not recreating JSON from the SolrJ results.

I would try to use the same classes that SolrJ uses, actually. (Writing
that without having a further look at the code.) I would avoid
recreating existing code as much as possible.
About multivalued fields: you need instanceof checks, I guess. The field
only contains a list if there really are multiple values. (That's what
works for my ScriptTransformer.)

Are you sure that you cannot change the SOLR results at query time
according to your needs? Maybe you should ask for that, first (ask for X
instead of Y...).

Cheers,
Chantal


   

Thanks for sharing ideas.

- Mitch


Am 28.07.2010 15:35, schrieb Chantal Ackermann:
 

You could use org.apache.solr.handler.JsonLoader.
That one uses org.apache.noggit.JSONParser internally.
I've used the JacksonParser with Spring.

http://json.org/ lists parsers for different programming languages.

Cheers,
Chantal

On Wed, 2010-07-28 at 15:08 +0200, MitchK wrote:

   

Hello ,

Second try to send a mail to the mailing list...

I need to translate SolrJ's response into JSON-response.
I can not query Solr directly, because I need to do some math with the
responsed data, before I show the results to the client.

Any experiences how to translate SolrJ's response into JSON without writing
your own JSON Writer?

Thank you.
- Mitch

 



   




   




Re: DIH : SQL query (sub-entity) is executed although variable is not set (null or empty list)

2010-07-27 Thread MitchK

Hi Chantal,



 However, with this approach indexing time went up from 20min to more 
 than 5 hours. 
 
This is 15x slower than the initial solution... wow.
From MySQL I know that IN ()-clauses are the embodiment of endlessness -
they perform very, very badly.

New idea:
Create a method which returns the query-string:

returnString(theVIP)
{
   if ( theVIP != null || theVIP != )
   {
   return a query-string to find the vip
   }
   else
   {
   return SELECT 1 // you need to modify this, so that it
matches your field-definition
   }
}

The main-idea is to perform a blazing fast query, instead of a complex
IN-clause-query.
Does this sounds like a solution???



 The new approach is to query the solr index for that other database that 
 I've already setup. This is only a bit slower than the original query 
 (20min). (I'm using URLDataSource to be 1.4.1 conform.) 
 
Unfortunately I can not follow you. 
You are querying a solr-index for a database?

Kind regards,
- Mitch
-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/DIH-SQL-query-sub-entity-is-executed-although-variable-is-not-set-null-or-empty-list-tp995983p998859.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: DIH : SQL query (sub-entity) is executed although variable is not set (null or empty list)

2010-07-27 Thread MitchK

Hi Chantal,

instead of:

entity name=prog ... 
field name=vip ... /* multivalued, not required */ 
entity name=ssc_entry dataSource=ssc onError=continue 
query=select SSC_VALUE from SSC_VALUE 
where SSC_ATTRIBUTE_ID=1 
  and SSC_VALUE in (${prog.vip}) 
field column=SSC_VALUE name=vip_ssc / 
/entity 
/entity 

you do:

entity name=prog ... 
field name=vip ... /* multivalued, not required */ 
entity name=ssc_entry dataSource=ssc onError=continue 
query=${yourCustomFunctionToReturnAQueryString(prog.vip,
..., ...)} 
field column=SSC_VALUE name=vip_ssc / 
/entity 
/entity 

The yourCustomFunctionToReturnAQueryString(vip, querystring1, querystring2)
{
if(vip != null  !vip.equals())
{
 StringBuilder sb = new StringBuilder(50);
 sb.append(querystring1); // SELECT SSC_VALUE from SSC_VALUE where
SSC_ATTRIBUTE_ID=1 
   and SSC_VALUE in (
 sb.append(vip);//VIP-value
 sb.append(querystring2);//just the closing )
 return sb.toString();
 }
 else
 {
return SELECT \\ AS yourFieldName;
 }
}

I expect that this method is called for every vip-value, if there is one.

Solr DIH uses the returned querystring to query the database. So, if
vip-value is empty or null, you can use a different query that is blazing
fast (i.e. SELECT  AS yourFieldName - just an example to show the logic).
This query should return a row with an empty string. So Solr fills the
current field with an empty string.

I don't know how to prevent Solr from calling your ssc_entry-entity, when
vip is null or empty.
But this would be a solution to handle empty vip-strings as efficient as
possible. 



 If realized 
 that I have to throw an exception and add the onError attribute to the 
 entity to make that work. 
 
I am curious:
Can you show how to make a method throwing an exception that is accepted by
the onError-attribute?

I hope we do not talk past eachother here. :-)

Kind regards,
- Mitch
-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/DIH-SQL-query-sub-entity-is-executed-although-variable-is-not-set-null-or-empty-list-tp995983p998950.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: a bug of solr distributed search

2010-07-26 Thread MitchK

Good morning,

https://issues.apache.org/jira/browse/SOLR-1632

- Mitch


Li Li wrote:
 
 where is the link of this patch?
 
 2010/7/24 Yonik Seeley yo...@lucidimagination.com:
 On Fri, Jul 23, 2010 at 2:23 PM, MitchK mitc...@web.de wrote:
 why do we do not send the output of TermsComponent of every node in the
 cluster to a Hadoop instance?
 Since TermsComponent does the map-part of the map-reduce concept, Hadoop
 only needs to reduce the stuff. Maybe we even do not need Hadoop for
 this.
 After reducing, every node in the cluster gets the current values to
 compute
 the idf.
 We can store this information in a HashMap-based SolrCache (or something
 like that) to provide constant-time access. To keep the values up to
 date,
 we can repeat that after every x minutes.

 There's already a patch in JIRA that does distributed IDF.
 Hadoop wouldn't be the right tool for that anyway... it's for batch
 oriented systems, not low-latency queries.

 If we got that, it does not care whereas we use doc_X from shard_A or
 shard_B, since they will all have got the same scores.

 That only works if the docs are exactly the same - they may not be.

 -Yonik
 http://www.lucidimagination.com

 
 
-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/a-bug-of-solr-distributed-search-tp983533p995407.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Solr Doc Lucene Doc !?

2010-07-26 Thread MitchK

Stockii,

Solr's index is a Lucene Index. Therefore, Solr documents are Lucene
documents.

Kind regards,
- Mitch
-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/Solr-Doc-Lucene-Doc-tp995922p995968.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: DIH : SQL query (sub-entity) is executed although variable is not set (null or empty list)

2010-07-26 Thread MitchK

Hi Chantal,

did you tried to write a  http://wiki.apache.org/solr/DIHCustomFunctions
custom DIH Function ?
If not, I think this will be a solution.
Just check, whether ${prog.vip} is an empty string or null.
If so, you need to replace it with a value that never can response anything.

So the vip-field will always be empty for such queries. 
Maybe that helps?

Hopefully, the variable resolver is able to resolve something like
${dih.functions.getReplacementIfNeeded(prog.vip).

Kind regards,
- Mitch



Chantal Ackermann wrote:
 
 Hi,
 
 my use case is the following:
 
 In a sub-entity I request rows from a database for an input list of
 strings:
 entity name=prog ...
   field name=vip ... /* multivalued, not required */
   entity name=ssc_entry dataSource=ssc onError=continue
   query=select SSC_VALUE from SSC_VALUE
   where SSC_ATTRIBUTE_ID=1
 and SSC_VALUE in (${prog.vip})
   field column=SSC_VALUE name=vip_ssc /
   /entity
 /entity
 
 The root entity is prog and it has an optional multivalued field
 called vip. When the list of vip values is empty, the SQL for the
 sub-entity above throws an SQLException. (Working with Oracle which does
 not allow an empty expression in the in-clause.)
 
 Two things:
 (A) best would be not to run the query whenever ${prog.vip} is null or
 empty.
 (B) From the documentation, it is not clear that onError is only checked
 in the transformer runs but not checked when the SQL for the entity
 throws an exception. (Trunk version JdbcDataSource lines 250pp).
 
 IMHO, (A) is the better fix, and if so, (B) is the right decision. (If
 (A) is not easily fixable, making (B) work would be helpful.)
 
 Looking through the code, I've realized that the replacement of the
 variables is done in a very generic way. I've not yet seen an
 appropriate way to check on those variables in order to stop the
 processing of the entity if the variable is empty.
 Is there a way to do this? Or maybe there is a completely different way
 to get my use case working. Any help most appreciated!
 
 Thanks,
 Chantal
 
 
 
-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/DIH-SQL-query-sub-entity-is-executed-although-variable-is-not-set-null-or-empty-list-tp995983p996446.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: a bug of solr distributed search

2010-07-24 Thread MitchK

Okay, but than LiLi did something wrong, right?

I mean, if the document exists only at one shard, it should get the same
score whenever one requests it, no?
Of course, this only applies if nothing gets changed between the requests.
The only remaining problem here would be, that you need distributed IDF
(like at the mentioned JIRA-issue) to normalize your results's scoring. 

But the mentioned problem at this mailing-list-posting has nothing to do
with that...

Regards
- Mitch
-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/a-bug-of-solr-distributed-search-tp983533p991907.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: a bug of solr distributed search

2010-07-23 Thread MitchK

Yonik,

why do we do not send the output of TermsComponent of every node in the
cluster to a Hadoop instance?
Since TermsComponent does the map-part of the map-reduce concept, Hadoop
only needs to reduce the stuff. Maybe we even do not need Hadoop for this.
After reducing, every node in the cluster gets the current values to compute
the idf.
We can store this information in a HashMap-based SolrCache (or something
like that) to provide constant-time access. To keep the values up to date,
we can repeat that after every x minutes.

If we got that, it does not care whereas we use doc_X from shard_A or
shard_B, since they will all have got the same scores. 

Even if we got large indices with 10 million or more unique terms, this will
only need some megabyte network-traffic.

Kind regards,
- Mitch


Yonik Seeley-2-2 wrote:
 
 As the comments suggest, it's not a bug, but just the best we can do
 for now since our priority queues don't support removal of arbitrary
 elements.  I guess we could rebuild the current priority queue if we
 detect a duplicate, but that will have an obvious performance impact.
 Any other suggestions?
 
 -Yonik
 http://www.lucidimagination.com
 
-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/a-bug-of-solr-distributed-search-tp983533p990506.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: a bug of solr distributed search

2010-07-23 Thread MitchK

... Additionally to my previous posting:
To keep this sync we could do two things:
Waiting for every server to make sure that everyone uses the same values to
compute the score and than apply them.
Or: Let's say that we collect the new values every 15 minutes. To merge and
send them over the network, we declare that this will need 3 additionally
minutes (We want to keep the network traffic for such actions very low, so
we do not send everything instantly).
Okay, and now we say 2 additionally minutes, if 3 were not enough or
something needs a little bit more time than we tought.. After those 2
minutes, every node has to apply the new values.
Pro: If one node gets broken, we do not delay the Application of the new
values.
Con: We need two HashMaps and both will have roughly the same sice. That
means we will waste some RAM for this operation, if we do not write the
values to disk (Which I do not suggest).

Thoughts?

- Mitch

MitchK wrote:
 
 Yonik,
 
 why do we do not send the output of TermsComponent of every node in the
 cluster to a Hadoop instance?
 Since TermsComponent does the map-part of the map-reduce concept, Hadoop
 only needs to reduce the stuff. Maybe we even do not need Hadoop for this.
 After reducing, every node in the cluster gets the current values to
 compute the idf.
 We can store this information in a HashMap-based SolrCache (or something
 like that) to provide constant-time access. To keep the values up to date,
 we can repeat that after every x minutes.
 
 If we got that, it does not care whereas we use doc_X from shard_A or
 shard_B, since they will all have got the same scores. 
 
 Even if we got large indices with 10 million or more unique terms, this
 will only need some megabyte network-traffic.
 
 Kind regards,
 - Mitch
 
 
 Yonik Seeley-2-2 wrote:
 
 As the comments suggest, it's not a bug, but just the best we can do
 for now since our priority queues don't support removal of arbitrary
 elements.  I guess we could rebuild the current priority queue if we
 detect a duplicate, but that will have an obvious performance impact.
 Any other suggestions?
 
 -Yonik
 http://www.lucidimagination.com
 
 
-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/a-bug-of-solr-distributed-search-tp983533p990551.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: a bug of solr distributed search

2010-07-23 Thread MitchK


That only works if the docs are exactly the same - they may not be. 
Ahm, what? Why? If the uniqueID is the same, the docs *should* be the same,
don't they?
-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/a-bug-of-solr-distributed-search-tp983533p990563.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: a bug of solr distributed search

2010-07-21 Thread MitchK

Li Li,

this is the intended behaviour, not a bug.
Otherwise you could get back the same record in a response for several
times, which may not be intended by the user.

Kind regards,
- Mitch
-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/a-bug-of-solr-distributed-search-tp983533p983675.html
Sent from the Solr - User mailing list archive at Nabble.com.


nested query and number of matched records

2010-07-21 Thread MitchK

Hello community,

I got a situation, where I know that some types of documents contain very
extensive information and other types are giving more general information.
Since I don't know whether a user searches for general or extensive
information (and I don't want to ask him when he uses the default search), I
want to give him a response back like this:

10 documents are type: short
1 document, if there is one, is type: extensive

An example query would look like this:
q={!dismax fq=type:short}my cool query OR {!dismax fq=type:extensive}my cool
query
The problem with this one will be, that I can not specify to retrive up to
10 short-documents and at most one extensive.

I think this will not work and if I want to create such a search, I need to
do two different queries. But before I waste performance, I wanted to ask.

Thank you!
Kind regards,
- Mitch
-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/nested-query-and-number-of-matched-records-tp983756p983756.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: a bug of solr distributed search

2010-07-21 Thread MitchK

Ah, okay. I understand your problem. Why should doc x be at position 1 when
searching for the first time, and when I search for the 2nd time it occurs
at position 8 - right?

I am not sure, but I think you can't prevent this without custom coding or
making a document's occurence unique.

Kind regards,
- Mitch
-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/a-bug-of-solr-distributed-search-tp983533p983771.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: nested query and number of matched records

2010-07-21 Thread MitchK

Oh,... I just see, there is no direct question ;-).

How can I specify the number of returned documents in the desired way
*within* one request?

- Mitch
-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/nested-query-and-number-of-matched-records-tp983756p983773.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: a bug of solr distributed search

2010-07-21 Thread MitchK

I don't know much about the code. 
Maybe you can tell me to what file you are referring?

However, from the comments one can see, that the problem is known but one
decided to let it happen, because of System requirements in the Java
version. 

- Mitch
-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/a-bug-of-solr-distributed-search-tp983533p983880.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: a bug of solr distributed search

2010-07-21 Thread MitchK

It already was sorted by score.

The problem here is the following:
Shard_A and shard_B contain doc_X and doc_X.
If you are querying for something, doc_X could have a score of 1.0 at
shard_A and a score of 12.0 at shard_B.

You can never be sure which doc Solr sees first. In the bad case, Solr sees
the doc_X firstly at shard_A and ignores it at shard_B. That means, that the
doc maybe would occur at page 10 in pagination, although it *should* occur
at page 1 or 2.

Kind regards,
- Mitch
-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/a-bug-of-solr-distributed-search-tp983533p984743.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: nested query and number of matched records

2010-07-21 Thread MitchK

Thank you three for your feedback!

Chantal, unfortuntately kenf is right. Facetting won't work in this special
case. 


 parallel calls.
 
Yes, this will be the solution. However, this would lead to a second
HTTP-request and I hoped to be able to avoid it.


Chantal Ackermann wrote:
 
 Sure SOLR supports this: use facets on the field type:
 
 add to your regular query:
 
 facet.query=truefacet.field=type
 
 see http://wiki.apache.org/solr/SimpleFacetParameters
 
 
 On Wed, 2010-07-21 at 15:48 +0200, kenf_nc wrote:
 parallel calls. simultaneously query for type:short rows=10  and
 type:extensive rows=1  and merge your results.  This would also let you
 separate your short docs from your extensive docs into different solr
 instances if you wished...depending on your document architecture this
 could
 speed up one or the other.
 
 
 
 
 
-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/nested-query-and-number-of-matched-records-tp983756p984750.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Autocomplete with NGrams

2010-07-20 Thread MitchK

It sounds like the best solution here, right.

However, I do not want to exclude the possibility of doing things one
*should* do in different cores with different configurations and schema.xml
in one core.
I haven't completly read the lucidimagination article, but I would suggest
you to do your work in different cores, since it would make managing and
configuring the different tasks easier.
Furthermore the optimization in configurations for task A (a normal index
where you search) may work worse or wasteful with task B. 
To prevent such situation you must use multicore-setups.

- Mitch
-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/Autocomplete-with-NGrams-tp979312p980680.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Beginner question

2010-07-20 Thread MitchK

Here you can find params and their meanings for the dismax-handler.
You may not find anything in the wiki by searching for a parser ;).

Link:  http://wiki.apache.org/solr/DisMaxRequestHandler Wiki:
DisMaxRequestHandler 

Kind regards
- Mitch



Erik Hatcher-4 wrote:
 
 Consider using the dismax query parser instead.  It has more  
 sophisticated capability to spread user queries across multiple fields  
 with different weightings.
 
   Erik
 
 On Jul 20, 2010, at 4:34 AM, Bilgin Ibryam wrote:
 
 Hi all,

 I have two simple questions:

 I have an Item entity with id, name, category and description  
 fields. The
 main requirements is to be able to search in all the fields with the  
 same
 string and different priority per field, so matches in name appear  
 before
 category matches, and they appear before description field matches  
 in the
 result list.

 1. I think to create an index having the same fields, because each  
 field
 needs different priority during searching.

 2. And then do the search with a query like this:
 name:search_string^1.3 OR categpry:search_string^1.2 OR
 description:search_string^1.1

 Is this the right approach to model the index and search query?

 Thanks in advance.
 Bilgin
 
 
 
-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/Beginner-question-tp980695p980819.html
Sent from the Solr - User mailing list archive at Nabble.com.


Problem with Solr-Mailinglist

2010-07-19 Thread MitchK

Hello,

I try to post
http://lucene.472066.n3.nabble.com/Solr-in-an-extra-project-what-about-replication-scaling-etc-td977961.html#a977961
 
this  message for the fourth time to the Solr Mailinglist and everytime I
get the following response from the Mailing-list's server:



   solr-user@lucene.apache.org
 SMTP error from remote mail server after end of data:
 host mx1.eu.apache.org [192.87.106.230]: 552 spam score (7.8) exceeded
 threshold
 

Why is my posting declared as Spam?! Did anyone else has got such problems?

Thank you!
- Mitch
-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/Problem-with-Solr-Mailinglist-tp978247p978247.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Problem with Solr-Mailinglist

2010-07-19 Thread MitchK

Thank you both.

I will do what Hoss suggested, tomorrow.
The mail was sent over the nabble-board and another time over my
thunderbird-client. Both with the same result. So there was not more
HTML-code than it was in every of my other postings.

Kind regards
- Mitch
-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/Problem-with-Solr-Mailinglist-tp978247p979602.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Autocomplete with NGrams

2010-07-19 Thread MitchK

Frank,

have a look at Solr's example-directory's and look for 'multicore'. There
you can see an example-configuration for a multicore-environment.

Kind regards,
- Mitch

-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/Autocomplete-with-NGrams-tp979312p979610.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: solr with hadoop

2010-07-05 Thread MitchK

I need to revive this discussion...

If you do distributed indexing correctly, what about updating the documents
and what about replicating them correctly?

Does this work? Or wasn't this an issue?

Kind regards
- Mitch
-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/solr-with-hadoop-tp482688p944413.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Wither field compresed=true ?

2010-06-29 Thread MitchK

David,

well, I am no committer, but I noticed that Lucene will no longer care of
compressing (I think this was because of the trouble when doing this) and
maybe this is the reason why Solr keeps this option no longer available.

Unfortunately, I do not have got any link for it, but I think this was said
in some changes.txt (at Nutch, I think).

Kind regards
- Mitch
-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/Wither-field-compresed-true-tp926288p929985.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: How I can use score value for my function

2010-06-29 Thread MitchK

Britske good workaround!
I did not thought about the possibility of using subqueries.

Regards
- Mitch
-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/How-I-can-use-score-value-for-my-function-tp899662p931448.html
Sent from the Solr - User mailing list archive at Nabble.com.


Question about the mailinglist (junk on my behalf)

2010-06-28 Thread MitchK

Hello community,

since a few days I recieve daily some mails with suspicious content. It is
said that some of my mails were rejected, because of the file-types of the
mail's attachements and other things.
This wonders me a lot, because I didn't send any mails with attachements and
even the eMail-adresses which want to make me aware of my rejected mails are
unknown to me.

This is the first mailinglist I have joined and I know that there are a lot
of bots out there, crawling for eMail-adresses to send junk. However, I
can't recognize any suspicious behaviour except those mails.

The number of mails that make me aware of the mentioned thing is 10 in a few
days, maybe 15 but not more. And I do not get more junk than I normally get. 

Does anyone recieves suspicious eMails on my behalf?

Thank you.
- Mitch
-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/Question-about-the-mailinglist-junk-on-my-behalf-tp927461p927461.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: MoreLikeThis (mlt) : use the match's maxScore for result score normalization

2010-06-25 Thread MitchK

Hi Chantal,

Munich? Germany seems to be soo small :-).


Chantal Ackermann wrote:
 
 I only want a way to show to the 
 user a kind of relevancy or similarity indicator (for example using a 
 range of 10 stars) that would give a hint on how similar the mlt hit is 
 to the input (match) item. 
 
Okay, that's making more sense.
Unfortunately, you can not do that with Lucene with results that might fit
your needs (as far as I know).

Kind regards
- Mitch
-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/MoreLikeThis-mlt-use-the-match-s-maxScore-for-result-score-normalization-tp919598p921942.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Solr 1.4 - Image-Highlighting and Payloads

2010-06-24 Thread MitchK

Sebastian,

sounds like an exciting project.



 We've found the argument TokenGroup in method highlightTerm
 implemented in SimpleHtmlFormatter. TokenGroup provides the method
 getPayload(), but the returned value is always NULL. 
 
No, Token provides this method, not TokenGroup. But this might not be the
mistake.

Hm, since this approach is very special, I would suggest to do something
easier.
You already got tools to retrive the word and the word's position from the
image, right?

What would be, if you add a field to the schema.xml with a preprocessed
input-string.

I.e:
You got two fields:
page's text and page's text's word-positions.

Page's text's word-positions needs preprocessing outside of Solr where you
add the coordinates of the words .

This preprocessing will be a little bit tricky.
If the 10th word is Solr and the 30th word also, you do not want to have
solr two times with different coordinates.
In fact, you want to store both coordinates for the term solr.

However, on the Solr-side you can add this preprocessed string to a field
with TermVectors.
If your query hits the page, you will get all the coordinates you want to
get.
Unfortunately, highlighting must be done on the client side.

Hope this helps
- Mitch
-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/Solr-1-4-Image-Highlighting-and-Payloads-tp919266p919342.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: MoreLikeThis (mlt) : use the match's maxScore for result score normalization

2010-06-24 Thread MitchK

Chantal,

have a look at 
http://lucene.apache.org/java/3_0_1/api/all/org/apache/lucene/search/similar/MoreLikeThis.html
More like this  to have a guess what the MLT's score concerns.

The problem is that you can't compare scores.
The query for the normal result-response was maybe something like 
Bill Gates featuring Linus Torvald - The perfect OS song.
The user picks now one of the responsed documents and says he wants More
like this - maybe, because the concerned topic was okay, but the content
was not enough or whatever...
But the sent query is totaly different (as you can see in the link) - so
that would be like comparing apples and oranges, since they do not use the
same base.

What would be the use case? Why is score-normalization needed?

Kind regards from Germany,
- Mitch
-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/MoreLikeThis-mlt-use-the-match-s-maxScore-for-result-score-normalization-tp919598p919716.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: solr with hadoop

2010-06-22 Thread MitchK

I wanted to add a Jira-issue about exactly what Otis is asking here.
Unfortunately, I haven't time for it because of my exams.

However, I'd like to add a question to Otis' ones:
If you destribute the indexing-progress this way, are you able to replicate
the different documents correctly?

Thank you.
- Mitch

Otis Gospodnetic-2 wrote:
 
 Stu,
 
 Interesting!  Can you provide more details about your setup?  By load
 balance the indexing stage you mean distribute the indexing process,
 right?  Do you simply take your content to be indexed, split it into N
 chunks where N matches the number of TaskNodes in your Hadoop cluster and
 provide a map function that does the indexing?  What does the reduce
 function do?  Does that call IndexWriter.addAllIndexes or do you do that
 outside Hadoop?
 
 Thanks,
 Otis
 --
 Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
 
 - Original Message 
 From: Stu Hood stuh...@webmail.us
 To: solr-user@lucene.apache.org
 Sent: Monday, January 7, 2008 7:14:20 PM
 Subject: Re: solr with hadoop
 
 As Mike suggested, we use Hadoop to organize our data en route to Solr.
  Hadoop allows us to load balance the indexing stage, and then we use
  the raw Lucene IndexWriter.addAllIndexes method to merge the data to be
  hosted on Solr instances.
 
 Thanks,
 Stu
 
 
 
 -Original Message-
 From: Mike Klaas mike.kl...@gmail.com
 Sent: Friday, January 4, 2008 3:04pm
 To: solr-user@lucene.apache.org
 Subject: Re: solr with hadoop
 
 On 4-Jan-08, at 11:37 AM, Evgeniy Strokin wrote:
 
 I have huge index base (about 110 millions documents, 100 fields  
 each). But size of the index base is reasonable, it's about 70 Gb.  
 All I need is increase performance, since some queries, which match  
 big number of documents, are running slow.
 So I was thinking is any benefits to use hadoop for this? And if  
 so, what direction should I go? Is anybody did something for  
 integration Solr with Hadoop? Does it give any performance boost?

 Hadoop might be useful for organizing your data enroute to Solr, but  
 I don't see how it could be used to boost performance over a huge  
 Solr index.  To accomplish that, you need to split it up over two  
 machines (for which you might find hadoop useful).
 
 -Mike
 
 
 
 
 
 
 
-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/solr-with-hadoop-tp482688p914589.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Re: Re: Solr and Nutch/Droids - to use or not to use?

2010-06-17 Thread MitchK



 Solr doesn't know anything about OPIC, but I suppose you can feed the OPIC
 score computed by Nutch into a Solr field and use it during scoring, if
 you want, say with a function query. 
 
Oh! Yes, that makes more sense than using the OPIC as doc-boost-value. :-)
Anywhere at the Lucene Mailing lists I read that in future it will be
possible to change field's contents without reindexing the whole document.
If one stores the OPIC-Score (which is independent from the page's content)
in a field and uses functionQuery to influence the score of a document, one
saves the effort of reindexing the whole doc, if the content did not change.

Regards
- Mitch
-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/Solr-and-Nutch-Droids-to-use-or-not-to-use-tp900069p902158.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Re: Re: Solr and Nutch/Droids - to use or not to use?

2010-06-17 Thread MitchK

Otis,

you are right. I wasn't aware of this. At least not with such a large
dataList (let's think of an index with 4mio docs, this would mean we got an
ExternalFile with 4mio records). But from what I've read at 
search-lucene.com it seems to perform very well. Thanks for the idea!

Btw: Otis, did you open a JIRA Issue for the distributed indexing ability of
Solr?
I would like to follow the issue, if it is open. 

Regards
- Mitch


Otis Gospodnetic-2 wrote:
 
 Mitch,
 
 Yes, one day.  But it sounds like you are not aware of ExternalFieldFile,
 which you can use today:
 
 http://search-lucene.com/?q=ExternalFileFieldfc_project=Solr
 
 Otis
 
 Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
 Lucene ecosystem search :: http://search-lucene.com/
 
 
 
 - Original Message 
 From: MitchK mitc...@web.de
 To: solr-user@lucene.apache.org
 Sent: Thu, June 17, 2010 4:15:27 AM
 Subject: Re: Re: Re: Solr and Nutch/Droids - to use or not to use?
 
 
 
 
 Solr doesn't know anything about OPIC, but I suppose you can 
 feed the OPIC
 score computed by Nutch into a Solr field and use it 
 during scoring, if
 you want, say with a function query. 
 
 Oh! 
 Yes, that makes more sense than using the OPIC as doc-boost-value. 
 :-)
 Anywhere at the Lucene Mailing lists I read that in future it will 
 be
 possible to change field's contents without reindexing the whole 
 document.
 If one stores the OPIC-Score (which is independent from the page's 
 content)
 in a field and uses functionQuery to influence the score of a 
 document, one
 saves the effort of reindexing the whole doc, if the content 
 did not change.
 
 Regards
 - Mitch
 -- 
 View this message in 
 context: 
 href=http://lucene.472066.n3.nabble.com/Solr-and-Nutch-Droids-to-use-or-not-to-use-tp900069p902158.html;
  
 target=_blank 
 http://lucene.472066.n3.nabble.com/Solr-and-Nutch-Droids-to-use-or-not-to-use-tp900069p902158.html
 Sent 
 from the Solr - User mailing list archive at Nabble.com.
 
 
-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/Solr-and-Nutch-Droids-to-use-or-not-to-use-tp900069p903148.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Document boosting troubles

2010-06-17 Thread MitchK

Hi,

first of all, are you sure that row.put('$docBoost',docBoostVal) is correct?

I think it should be row.put($docBoost,docBoostVal); - unfortunately I am
not sure.

Hm, I think, until you can solve the problem with the docBoosts itself, you
should use a functionQuery.

Use div(1, rank) as boost function (bf).

The higher the rank value, the smaller the result.

Hope this helps!
- Mitch

 
dbashford wrote:
 
 Brand new to this sort of thing so bear with me.
 
 For sake of simplicity, I've got a two field document, title and rank. 
 Title gets searched on, rank has values from 1 to 10.  1 being highest. 
 What I'd like to do is boost results of searches on title based on the
 documents rank.
 
 Because it's fairly cut and dry, I was hoping to do it during indexing.  I
 have this in my DIH transformer..
 
 var docBoostVal = 0;
 switch (rank) {
   case '1': 
   docBoostVal = 3.0;
   break;
   case '2': 
   docBoostVal = 2.6;
   break;
   case '3': 
   docBoostVal = 2.2;
   break;
   case '4': 
   docBoostVal = 1.8;
   break;
   case '5': 
   docBoostVal = 1.5;
   break;
   case '6': 
   docBoostVal = 1.2;
   break;
   case '7':
   docBoostVal = 0.9;
   break;
   case '8': 
   docBoostVal = 0.7;
   break;
   case '9': 
   docBoostVal = 0.5;  
   break;
 } 
 row.put('$docBoost',docBoostVal); 
 
 It's my understanding that with this, I can simply do the same /select
 queries I've been doing and expect documents to be boosted, but that
 doesn't seem to be happening because I'm seeing things like this in the
 results...
 
 {title:Some title 1,
 rank:10,
  score:0.11726039},
 {title:Some title 2,
  rank:7,
  score:0.11726039},
 
 Pretty much everything with the same score.  Whatever I'm doing isn't
 making its way through. (To cover my bases I did try the case statement
 with integers rather than strings, same result)
 
 
 
 
 
 With that not working I started looking at other options.  Starting
 playing with dismax.  
 
 I'm able to add this to a query string a get results I'm somewhat
 expecting...
 
 bq=rank:1^3.0 rank:2^2.6 rank:3^2.2 rank:4^1.8 rank:5^1.5 rank:6^1.2
 rank:7^0.9 rank:8^0.7 rank:9^0.5
 
 ...but I guess I wasn't expecting it to ONLY rank based on those factors. 
 That essentially gives me a sort by rank.  
 
 Trying to be super inclusive with the search, so while I'm fiddling my
 mm=11.  As expected, a q= like q=red door is returning everything that
 contains Red and door.  But I was hoping that items that matched red
 door exactly would sort closer to the top.  And if that exact match was a
 rank 7 that it's score wouldn't be exactly the same as all the other rank
 7s?  Ditto if I searched for q=The Tales Of, anything possessing all 3
 terms would sort closer to the top...and possessing two terms behind
 them...and possessing 1 term behind them, and within those groups weight
 heavily on by rank.
 
 I think I understand that the score is based entirely on the boosts I
 provide...so how do I get something more like what I'm looking for?
 
 
 
 
 Along those lines, I initially had put something like this in my
 defaults...
 
  str name=bf
 rank:1^10.0 rank:2^9.0 rank:3^8.0 rank:4^7.0 rank:5^6.0 rank:6^5.0
 rank:7^4.0 rank:8^3.0 rank:9^2.0
  /str
 
 ...but that was not working, queries fail with a syntax exception. 
 Guessing this won't work?
 
 
 
 Thanks in advance for any help you can provide.
 
 
-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/Document-boosting-troubles-tp902982p903190.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Document boosting troubles

2010-06-17 Thread MitchK

Sorry, I've overlooked your other question.



  str name=bf 
 rank:1^10.0 rank:2^9.0 rank:3^8.0 rank:4^7.0 rank:5^6.0 rank:6^5.0
 rank:7^4.0 rank:8^3.0 rank:9^2.0 
  /str 
 

This is wrong.
You need to change bf to bq.
Bf - boosting function
Bq - boosting query.
-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/Document-boosting-troubles-tp902982p903208.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: solr multi-node

2010-06-17 Thread MitchK

Antonello,

here are a few links to the Solr Wiki:
http://wiki.apache.org/solr/SolrReplication Solr Replication 
http://wiki.apache.org/solr/DistributedSearchDesign Distributed Search
Design 
http://wiki.apache.org/solr/DistributedSearch Distributed Search 
http://wiki.apache.org/solr/SolrCloud Solr Cloud 

Hope this helps.
- Mitch

-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/solr-multi-node-tp903159p903228.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Master master?

2010-06-17 Thread MitchK

What is the usecase for such an architecture?
Do you send requests to two different masters for indexing and that's why
they need to be synchronized?

Kind regards
- Mitch
-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/Master-master-tp884253p903233.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Document boosting troubles

2010-06-17 Thread MitchK

Hi,



 One problem down, two left!  =)  bf == bq did the trick, thanks.  Now at
 least if I can't get the DIH solution working I don't have to tack that on
 every query string. 
 
I would really recommend to use a boost function. If your rank will change
in future implementations, you do not need to redefine the bq. Besides that,
I think this is not only more comfortable, but also scales better.
The bq-param is more for things like boost this category or boost docs of
an advertisement campaign or something like that.

I am not sure, since I never worked with the DIH this way, but - from my
logic - the problem could be, that you do not return the row, right?
If you don't, try it again when return row was added to your sourcecode.

Otherwise, I can't help you, since there are no more codeexamples available
at the mailing list (from what I have seen).

Maybe this mailing-list topic helps you: 
http://lucene.472066.n3.nabble.com/Using-DIH-s-special-commands-Help-needed-td475695.html#a475695
Using DIHs special commands Help needed .
There are some suggestions,... however, it seems like he wasn't able to
solve the problem.



 And still can't figure out what I need to do with my dismax querying to
 get scores for quality of match. 
 
I don't really understand what you mean. Can you explain it a little bit
more?
What, except the $docBoost, does not work as it should do?

Kind regards
- Mitch
-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/Document-boosting-troubles-tp902982p904129.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: DismaxRequestHandler

2010-06-17 Thread MitchK

Joe, 

please, can you provide an example of what you are thinking of?

Subqueries with Solr... I've never seen something like that before.

Thank you!

Kind regards
- Mitch
-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/DismaxRequestHandler-tp903641p904142.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Re: Re: Solr and Nutch/Droids - to use or not to use?

2010-06-17 Thread MitchK

Otis,

And again I wished I were registred.

I will check the JIRA and when I feel comfortable with it, I will open it.

Kind regards
- Mitch
-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/Solr-and-Nutch-Droids-to-use-or-not-to-use-tp900069p904145.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Question on dynamic fields

2010-06-17 Thread MitchK

Barani,

without more background on dynamic fields, I would say that the most easiest
way would be to define a suffix for each of the fields you want to index
into the mentioned dynamic field and to redefine your dynamic field -
condition.

If suffix does not work, because of other dynamic-field declarations, use a
prefix.

Instead of *_bla to match myField_bla, you can use bla_* to match
bla_myField.

Hope this helps,
- Mitch
-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/Question-on-dynamic-fields-tp904053p904159.html
Sent from the Solr - User mailing list archive at Nabble.com.


Solr and Nutch/Droids - to use or not to use?

2010-06-16 Thread MitchK

Hello community, 

from several discussions about Solr and Nutch, I got some questions for a
virtual web-search-engine. 
I know I've posted this message to the mailing list a few days ago, but the
thread got injected and at least I did not get any more postings about the
topic and so I try to reopen it, hopefully no one gets upset here :-).
Please, bear with me. Thank you.

The requirements: 
I. I need a scalable solution for a growing index that becomes larger than
one machine can handle. If I add more hardware, I want to linear improve the
performance. 

II. I want to use technologies like the OPIC-algorithm (default algorithm in
Nutch) or PageRank or... whatever is out there to improve the ranking of the
webpages. 

III. I want to be able to easily add more fields to my documents. Imagine
one retrives information from a webpage's content, than I want to make it
searchable. 

IV. While fetching my data, I want to make special-searches possible. For
example I want to retrive pictures from a webpage and want to index
picture-related content into another search-index plus I want to save a
small thumbnail of the picture itself. Btw: This is (as far as I know) not
possible with solr, because solr was not intended to do such special
indexing-logic. 

V. I want to use filter queries (i.e. main-query christopher lee returns
1.5mio results, subquery action - the main-query would be a filter-query
and action would be the actual query. So a search within search-results
would be easily made available). 

VI. I want to be able to use different logics for different pages. Maybe I
got a pool of 100 domains that I know better than others and I got special
scripts that retrive more special information from those 100 domains. Than I
want to apply my special logic to those 100 domains, but every other domain
should use the default logic. 

- 

The project is only virtual. So why I am asking? 
I want to learn more about websearch and I would like to make some new
experiences. 

What do I know about Solr + Nutch: 
As it is said on lucidimagination.com, Solr + Nutch does not scale if the
index is too large. 
The article was a little bit older and I don't know whether this problem
gets fixed with the new distributed abilities of Solr. 

Furthermore I don't want to index the pages with nutch and reindex them with
solr. 
The only exception would be: If the content of a webpage get's indexed by
nutch, I want to use the already tokenized content of the body with some
Solr copyfield operations to extend the search (i.e. making fuzzy search
possible). At the moment: I don't think this is possible. 

I don't know much about the droids project and how well it is documented. 
But from what I can read by some posts of Otis, it seems to be usable as a
crawler-framework. 


Pros for Nutch are: It is very scalable! Thanks to hadoop and MapReduce it
is a scaling-monster (from what I've read). 

Cons: The search is not as rich as it is possible with Solr. Extend Nutch's
search-abilities *seems* to be more complicated than with Solr. Furthermore,
if I want to use Solr to search nutch's index, looking at my requirements I
would need to reindex the whole thing - without the benefits of Hadoop. 

What I don't know at the moment is, how it is possible to use algorithms
like in II. mentioned with Solr. 

I hope you understand the problem here - Solr *seems* to me as it would not
be the best solution for a web-search-engine, because of scaling reasons in
indexing. 


Where should I dive deeper? 
Solr + Droids? 
Solr + Nutch? 
Nutch + howToExtendNutchToMakeSearchBetter? 


Thanks for the discussion! 
- Mitch
-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/Solr-and-Nutch-Droids-to-use-or-not-to-use-tp900069p900069.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Solr and Nutch/Droids - to use or not to use?

2010-06-16 Thread MitchK

Thank you for the feedback, Otis.
Yes, I thought that such an approach is usefull if the number of pages to
crawl is relatively low.

However, what about using solr + nutch?
Exists the problem that this would not scale, if the index becomes too
large, up to now?

What about extending nutch with features such as the DisMaxRequestHandler,
is the amount of work larger than it would be in Solr?

The big pro of Solr is that I can enhance the whole thing in a few minutes,
if I need more extra-information to improve the search.
That makes it very easy to experiment with boostings, filters etc.
As far as I know, Nutch does not offer such greatefull features.
Do you know a little bit more about that?

Probably I should ask such question at the Nutch-mailing list, but at the
moment I hope that I can achieve as much as I can with Solr, because I have
no experiences with Hadoop but Nutch seems to require it.

Thank you!
- Mitch
-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/Solr-and-Nutch-Droids-to-use-or-not-to-use-tp900069p900480.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Solr and Nutch/Droids - to use or not to use?

2010-06-16 Thread MitchK

Thanks, that really helps to find the right beginning for such a journey. :-)



 * Use Solr, not Nutch's search webapp 
 
As far as I have read, Solr can't scale, if the index gets too large for one
Server



 The setup explained here has one significant caveat you also need to keep
 in mind: scale. You cannot use this kind of setup with vertical scale
 (collection size) that goes beyond one Solr box. The horizontal scaling
 (query throughput) is still possible with the standard Solr replication
 tools.
 
...from Lucidimagination.com

Is this still the case?
Furthermore, as far as I have understood this blogpost: 
http://www.lucidimagination.com/blog/2009/03/09/nutch-solr/
Lucidimagination.com : Nutch and Solr , they index the whole stuff with
nutch and reindex it to Solr - sounds like a lot of redundant work.

Lucid, Sematext and the Nutch-wiki are the only information-sources where I
can find talks about Nutch and Solr, but no one seems to talk about these
facts - except this one blogpost.

If you say this is wrong or contingent on the shown setup, can you tell me
how to avoid these problems?

A lot of questions, but it's such an exciting topic...

Hopefully you can answer some of them.

Again, thank you for the feedback, Otis.

- Mitch
-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/Solr-and-Nutch-Droids-to-use-or-not-to-use-tp900069p900604.html
Sent from the Solr - User mailing list archive at Nabble.com.


RE: Re: Re: Solr and Nutch/Droids - to use or not to use?

2010-06-16 Thread MitchK

Good morning!

Great feedback from you all. This really helped a lot to get an impression
of what is possible and what is not.

What is interesting to me are some detail questions.

Let's assume Solr is possible to work on his own with distributed indexing,
so that the client does not need to know anything about shards etc.

What is interesting to me is:
I. 
The scoring - Nutch uses special Scoring-implementations like the
OPIC-algorithm. Can Solr use such improvements or do I need to reimplement
it for Solr?

II. 
The indexing.
At the moment it really sounds like nutch would index the whole stuff and
afterwards Solr does the job again.
Regarding to indexing it would make sense, if Nutch computes things like the
document boost (I am not sure, but I think the results of the OPIC-algorithm
were added to each document as a boost) and sends an indexing-request to
Solr afterwards.
However, if Nutch indexes the page's content and Solr does it, too - I would
waste some time, no?
Is this the case or do I missunderstood something here?

III.
I am no Java-Expert.
However, in a few month I will start to study computer-science at an
university. Maybe I will find some literature to learn more about
distributed software and how hashing needs to work, to do the job it should
do, to make distributed indexing work.
Maybe than I can help to implement this feature into  Solr.
On the other hand, not much is known about Solr's distributed search-concept
and which classes are responsible for that - but such things one could ask
on the mailing list, no? 

As far as I know Elastic Search already supports distributed indexing. 
Maybe one can reuse the responsible implementation for Solr.


Btw:
I think a great benefit of using Solr + Nutch would be to extend the search.
I could create several Solr cores for different kinds of search - one for
picture-search, one for video-search etc. *and* with the help of Nutch I can
index some of the needed content in special directories. So Solr does not
need to care about indexing a picture - Nutch already does the job. 

Kind regards,
- Mitch
-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/Solr-and-Nutch-Droids-to-use-or-not-to-use-tp900069p901943.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Solr and Nutch/Droids - to use or not to use?

2010-06-14 Thread MitchK

Just wanted to push the topic a little bit, because those question come up
quite often and it's very interesting for me.

Thank you!

- Mitch


MitchK wrote:
 
 Hello community and a nice satureday,
 
 from several discussions about Solr and Nutch, I got some questions for a
 virtual web-search-engine.
 
 The requirements:
 I. I need a scalable solution for a growing index that becomes larger than
 one machine can handle. If I add more hardware, I want to linear improve
 the performance.
 
 II. I want to use technologies like the OPIC-algorithm (default algorithm
 in Nutch) or PageRank or... whatever is out there to improve the ranking
 of the webpages. 
 
 III. I want to be able to easily add more fields to my documents. Imagine
 one retrives information from a webpage's content, than I want to make it
 searchable.
 
 IV. While fetching my data, I want to make special-searches possible. For
 example I want to retrive pictures from a webpage and want to index
 picture-related content into another search-index plus I want to save a
 small thumbnail of the picture itself. Btw: This is (as far as I know) not
 possible with solr, because solr was not intended to do such special
 indexing-logic.
 
 V. I want to use filter queries (i.e. main-query christopher lee returns
 1.5mio results, subquery action - the main-query would be a
 filter-query and action would be the actual query. So a search within
 search-results would be easily made available).
 
 VI. I want to be able to use different logics for different pages. Maybe I
 got a pool of 100 domains that I know better than others and I got special
 scripts that retrive more special information from those 100 domains. Than
 I want to apply my special logic to those 100 domains, but every other
 domain should use the default logic.
 
 -
 
 The project is only virtual. So why I am asking?
 I want to learn more about websearch and I would like to make some new
 experiences.
 
 What do I know about Solr + Nutch:
 As it is said on lucidimagination.com, Solr + Nutch does not scale if the
 index is too large.
 The article was a little bit older and I don't know whether this problem
 gets fixed with the new distributed abilities of Solr.
 
 Furthermore I don't want to index the pages with nutch and reindex them
 with solr. 
 The only exception would be: If the content of a webpage get's indexed by
 nutch, I want to use the already tokenized content of the body with some
 Solr copyfield operations to extend the search (i.e. making fuzzy search
 possible). At the moment: I don't think this is possible.
 
 I don't know much about the droids project and how well it is documented.
 But from what I can read by some posts of Otis, it seems to be usable as a
 crawler-framework.
 
 
 Pros for Nutch are: It is very scalable! Thanks to hadoop and MapReduce it
 is a scaling-monster (from what I've read).
 
 Cons: The search is not as rich as it is possible with Solr. Extend
 Nutch's search-abilities *seems* to be more complicated than with Solr.
 Furthermore, if I want to use Solr to search nutch's index, looking at my
 requirements I would need to reindex the whole thing - without the
 benefits of Hadoop. 
 
 What I don't know at the moment is, how it is possible to use algorithms
 like in II. mentioned with Solr.
 
 I hope you understand the problem here - Solr *seems* to me as it would
 not be the best solution for a web-search-engine, because of scaling
 reasons in indexing. 
 
 
 Where should I dive deeper? 
 Solr + Droids?
 Solr + Nutch?
 Nutch + howToExtendNutchToMakeSearchBetter?
 
 
 Thanks for the discussion!
 - Mitch
 
-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/Solr-and-Nutch-Droids-to-use-or-not-to-use-tp890640p894391.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Solr DataConfig / DIH Question

2010-06-13 Thread MitchK

Guys???

You are in the wrong thread. Please, send a message to the mailing list, do
not answer to existing posts.

Thank you. 
-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/Solr-and-Nutch-Droids-to-use-or-not-to-use-tp890640p892041.html
Sent from the Solr - User mailing list archive at Nabble.com.


Solr and Nutch/Droids - to use or not to use?

2010-06-12 Thread MitchK

Hello community and a nice satureday,

from several discussions about Solr and Nutch, I got some questions for a
virtual web-search-engine.

The requirements:
I. I need a scalable solution for a growing index that becomes larger than
one machine can handle. If I add more hardware, I want to linear improve the
performance.

II. I want to use technologies like the OPIC-algorithm (default algorithm in
Nutch) or PageRank or... whatever is out there to improve the ranking of the
webpages. 

III. I want to be able to easily add more fields to my documents. Imagine
one retrives information from a webpage's content, than I want to make it
searchable.

IV. While fetching my data, I want to make special-searches possible. For
example I want to retrive pictures from a webpage and want to index
picture-related content into another search-index plus I want to save a
small thumbnail of the picture itself. Btw: This is (as far as I know) not
possible with solr, because solr was not intended to do such special
indexing-logic.

V. I want to use filter queries (i.e. main-query christopher lee returns
1.5mio results, subquery action - the main-query would be a filter-query
and action would be the actual query. So a search within search-results
would be easily made available).

VI. I want to be able to use different logics for different pages. Maybe I
got a pool of 100 domains that I know better than others and I got special
scripts that retrive more special information from those 100 domains. Than I
want to apply my special logic to those 100 domains, but every other domain
should use the default logic.

-

The project is only virtual. So why I am asking?
I want to learn more about websearch and I would like to make some new
experiences.

What do I know about Solr + Nutch:
As it is said on lucidimagination.com, Solr + Nutch does not scale if the
index is too large.
The article was a little bit older and I don't know whether this problem
gets fixed with the new distributed abilities of Solr.

Furthermore I don't want to index the pages with nutch and reindex them with
solr. 
The only exception would be: If the content of a webpage get's indexed by
nutch, I want to use the already tokenized content of the body with some
Solr copyfield operations to extend the search (i.e. making fuzzy search
possible). At the moment: I don't think this is possible.

I don't know much about the droids project and how well it is documented.
But from what I can read by some posts of Otis, it seems to be usable as a
crawler-framework.


Pros for Nutch are: It is very scalable! Thanks to hadoop and MapReduce it
is a scaling-monster (from what I've read).

Cons: The search is not as rich as it is possible with Solr. Extend Nutch's
search-abilities *seems* to be more complicated than with Solr. Furthermore,
if I want to use Solr to search nutch's index, looking at my requirements I
would need to reindex the whole thing - without the benefits of Hadoop. 

What I don't know at the moment is, how it is possible to use algorithms
like in II. mentioned with Solr.

I hope you understand the problem here - Solr *seems* to me as it would not
be the best solution for a web-search-engine, because of scaling reasons in
indexing. 


Where should I dive deeper? 
Solr + Droids?
Solr + Nutch?
Nutch + howToExtendNutchToMakeSearchBetter?


Thanks for the discussion!
- Mitch
-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/Solr-and-Nutch-Droids-to-use-or-not-to-use-tp890640p890640.html
Sent from the Solr - User mailing list archive at Nabble.com.


conditional Document Boost

2010-06-04 Thread MitchK

Hello out there,

I am searching for a solution for conditional Document Boosting.
During analyzing the fields of a document, I want to create a document boost
based on some metrics.

There are two approaches:
First: I preprocess the data. The main problem with this is, that I need to
take care about the preprocessing-part and I can't do it out of the box
(implementing an analyzer,  compute the boosting value and afterwards store
those values or send them to solr.).

Second: Using the UpdateRequestProcessor (does it work with DIH?). However,
the problem would also be custom work and taking care that the used params
are up-to-date. 

Third: Setting the Document Boost while analyzing-process is running with
the help of a TokenFilter  (is this possible?).

What would you do?


I think what I want to do is quite the same as working with Mahout and Solr.
I never worked with Mahout - but how can I use it to improve the user's
search-experience? 
Where can I use Mahout in Solr, if I want to influence document's boosts?
And where in general (i.e. for classification).

References, ideas and whatever could be useful are welcome :-).

Thank you.

Kind regards
- Mitch
-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/conditional-Document-Boost-tp871108p871108.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: sort by function

2010-05-24 Thread MitchK

Where is your query?
You don't search for anything.
The q-param is empty.

You got two options (untested): remove the q-param or search for something
special.
I think removing is not a good idea. Instead search  for *:* would retrive
ALL results that match your filter-query. 

Kind regards
- Mitch
-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/sort-by-function-tp814380p839167.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: IndexSearcher and Caches

2010-05-24 Thread MitchK

Ahh, now I understand.

No, you need no second IndexSearcher as long as the Server is alive.
You can reuse your searcher for every user.

The only commands you are executing per user are those to create a
search-query.

Kind regards,
- Mitch
-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/IndexSearcher-and-Caches-tp833567p840228.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: IndexSearcher and Caches

2010-05-24 Thread MitchK

Good question.
Well, I never worked productively with SolrJ.

But two things:
The first: As the documentation says, you *should* get your IndexSearcher
from your SolrQueryRequest-object.
The second: As a developer of the SolrJ I would do as much as I can
automatically behind the curtain. That means that if you do a commit, the
index searcher should be automatically renewed. But that's a guess. 
I can't answer you this question, sorry.

Maybe this link helps?
http://lucene.472066.n3.nabble.com/Solr-commit-issue-td770315.html#a770453
(searched with the following keywords: solrj commit searcher)

I am new to Java and the concept of Java Enterprise Edition's Servlets is
not yet fully clear to me. Please, let me ask a question.

Let me give you an example:
If I use inside my application (it's a Servlet) a SolrServer, I should
create him when I start the Servlet.
Should I cache the instantiated SolrServer-object with the help of the
servlet's cache? And should my cache-implementation should provide a
getSolrServer()-method? 
Maybe this is a question more related to the JavaEE-concept.

Thank you.

Kind regards
- Mitch
-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/IndexSearcher-and-Caches-tp833567p840479.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: IndexSearcher and Caches

2010-05-23 Thread MitchK



 In my case, I have an index which will not be modified after creation.
 Does 
 this mean that in a multi-user scenario, I can have a static IndexSearcher 
 object that can be shared by multiple users ? 
 
I am not sure, what you mean with multi-user-scenario. Can you tell me
what you got in mind?
If your index never changes, your IndexSearcher won't change.




 If the IndexSearcher object is threadsafe, then only issues related to 
 concurrency are addressed. What about the case where the IndexSearcher is 
 static? User 1 logs in to the system, queries with the static
 IndexSearcher, 
 logs out; and then User 2 logs in to the system, queries with the same 
 static IndexSearcher, logs out. In this case, the users 1 and 2 are not 
 querying concurrently but one after another. Will the query information 
 (filters or any other data) of User 1 be retained when User 2 uses this ? 
 
I am not sure about the benefit of a static IndexSearcher. What do you
hope???

If user 1 uses  a filter like fq=name:Samuelq=somethingIWantToKnow and
user 2 queries for fq=name:Samuelq=whatIReallyWantToKnow than they use
the same cached filter-object, retrived from Solr's internal cache (of
course you need to have a cache-size that allows cacheing).



 The solr wiki states that the caches are per IndexSearcher object i.e if I 
 set my filterCache size to 1000 it means that 1000 entries can be assigned 
 for every IndexSearcher object. 
 
Yes. If a new searcher is created than the new Cache is built on the old
one.



 Is this true for queryResultsCache, 
 filterCache and documentCache ? 
 
For FilterCache it's true. For queryResultsCache (if I understand the wiki
right), too.
Please note, that the documentCache's behaviour is different from the
already mentioned ones. 
The wiki says:


 Note: This cache cannot be used as a source for autowarming because
 document IDs will change when anything in the index changes so they can't
 be used by a new searcher.
 

The wiki says that the number of the document cache should not be bigger
than the number of _results_ * number of _concurrent_ queries.
I never worked with the document cache, so maybe someone else can throw some
light into the dark.
But from what I have understood it means the following:

If you show 10 results per request and you think of up to 500 concurrent
queries:
10 * 500 = 5000

But I want to emphasize, that this is only a gues. I actually don't exactly
know more about this topic.

Kind regards
- Mitch
-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/IndexSearcher-and-Caches-tp833567p838367.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: sort by function

2010-05-22 Thread MitchK

The score isn't computed when you try to access it. Furthermore your
functionQuery needs to become part of the score.

So what can you do???

The keyword is boosting.

Do:  {!func}product(0.88,rank)^x
Where x is a boosting factor based on your experiences.

Keep in mind that the result of your product-function-query will be added to
the score.
That means if the result is i.e. 12, and the normal score would be 5,6,
than the final score for the document is 17,6. If your ranking-value or your
x-value is too large, this would lead to unexpected results.

Hope this helps.
- Mitch
-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/sort-by-function-tp814380p836471.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: IndexSearcher and Caches

2010-05-21 Thread MitchK

Rahul,

the IndexSearcher of Solr gets shared with every request within two commits.
That means one IndexSearcher + its caches got a lifetime of one commit.
After every commit, there will be a new one created.

The cache does not mean, that they are applied automatically. They mean,
that a filter from a query will be cached and whenever an user-query
requieres the same filtering-criteria, they will use the cached filter
instead of creating a new one on the fly. 

I.e: fq=inStock:true
The result of this filtering-criteria gets cached one time. If another user
asks again for a query with fq=inStock:true, Solr reuses the already
existing filter. 
Since such filters are cached as byteVectors, they are not large. 
In this case it does not care for what the user is querying in his q-param. 

BTW: The IndexSearcher is threadsafe. So there is no problem with concurrent
usage.

Hope this helps???

Kind regards
- Mitch
-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/IndexSearcher-and-Caches-tp833567p833841.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Personalized Search

2010-05-20 Thread MitchK

Hi dc,



 - at query time, specify boosts for 'my items' items 
 
Do you mean something like document-boost or do you want to include
something like
OR myItemId:100^100
?

Can you tell us how you would specify document-boostings at query-time? Or
are you querying something like a boolean field (i.e. isFavorite:true^10) or
a numeric field?

Kind regards
- Mitch
-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/Personalized-Search-tp831070p832062.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: sort by function

2010-05-16 Thread MitchK

Can you please do some math to show the principle?

Do you want to do something like this: 
finalScore = score * rank
finalScore = rank

???

If the first is the case, than it is done by default (have a look at the
wiki-example for making more recent documents more relevant).
If the second is the case, than I would say you need a new sort-function
(never realized something like that).

Hope this helps
- Mitch
-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/sort-by-function-tp814380p821239.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Short DismaxRequestHandler Question

2010-05-15 Thread MitchK

Okay, I will do so in future, if another problem like this occurs.

At the moment, everything is fine after I followed your suggestions.

Kind regards
- Mitch
-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/Short-DismaxRequestHandler-Question-tp775913p820355.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: sort by function

2010-05-15 Thread MitchK

Can you provide us some more information on what you really want to do?
Like the examples in the wiki said, the returned value of the function query
is multiplied with the score - you can boost your returned value from the
function query, if you like to do so. 

Kind regards
- Mitch
-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/sort-by-function-tp814380p820359.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Short DismaxRequestHandler Question

2010-05-07 Thread MitchK

Okay, let me be more specific:
I got a custom StopWordFilter and a WordMarkingFilter.

The WordMarkingFilter is an easy implementation to determine which type a
word is. 
The StopWordFilter (my implementation) removes specific types of words *and*
all markers from all words.

This leads to a deletion of some parts of sentences.

In my disMaxQuery I specified some fields with such filters and some
without.


a) what docs should *not* match the query you listed
In this case: docs where only Solr OR development occur should not match.
It is not important, if both words occur in different fields.


b) what queries should *not* match the doc you listed
Actually Solr Development Lucidworks should not match, for example
(assuming that lucidworks does not occur in a field like content).
In this case, the user searches for development-work with Solr in relation
to LucidWorks.
Solr does not know about the relation, however with the 100%mm-definition, I
can tell Solr
something like this in a more easier way.


c) what types of URLs you've already tried 
Those I have shown here. No more.

Let me be sure, that I have understood your part about how the
DisMaxRequestHandler works.
If I got 4 fields:
name, colour, category, manufacturer

And an example-doc like this:
title: iPhone
colour: black
category: smartphone
manufacturer: apple

And I got a dismax-query like this:
q=apple iPhone  qf=title^5 manufacturer  mm=100% 
Than the whole thing will match (assumed that iPhone and /or apple where no
stopwords)?

If yes, than the problem is my filter-definition.
There were some threads with discussions about such problems with the
standard-stopWordFilter.

Another example:
title: Solr in a production environment
cat: tutorial

At index-time, title is reduced to: Solr production environment.
A query like this using Solr in a production environment
will be reduced to Solr production environment.
This will work, as I have understood, because both: the indexed terms and
the query are the same.

However, if I got a content field, that indexes the content of the text
without my markerFilter, this won't work, because the parsed query-strings
are different??? I don't understand the problem 

example:
title: Solr in a production environment
cat: tutorial
content: here is some text about using Solr in production. This fieldType
consists of a lowerCaseFilter and a standard-StopWordFilter to delete all
words like 'the, and, in' etc.

Please, note that environment does not occur in the content-field.
So a parsed querystring would look like:
using Solr in a production environment - using Solr production
environment (stopwords are removed).
This won't match, because the word environment does not occur in the
content-field? And according to that, the whole doc does not match?

If you are confused about my examples and questions - I was trying to
understand the explanations that were described here:
http://lucene.472066.n3.nabble.com/DisMax-request-handler-doesn-t-work-with-stopwords-td478128.html#a478128

Thank you for help.

- Mitch
-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/Short-DismaxRequestHandler-Question-tp775913p783063.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Short DismaxRequestHandler Question

2010-05-07 Thread MitchK

Btw: This thread helps a lot to understand the difference between qf and pf
:-)
http://lucene.472066.n3.nabble.com/Dismax-query-phrases-td489994.html#a489995
-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/Short-DismaxRequestHandler-Question-tp775913p783379.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: increase(change) relevancy

2010-05-07 Thread MitchK

Hi Ramzesua,

take a look at the example of the function query that influences relvancy by
the popular-field of the example-directory.

http://wiki.apache.org/solr/FunctionQuery#Using_FunctionQuery

Kind regards
- Mitch
-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/increase-change-relevancy-tp783497p783750.html
Sent from the Solr - User mailing list archive at Nabble.com.


Short DismaxRequestHandler Question

2010-05-04 Thread MitchK

Hello community,

I need a minimum should match only on some fields, not on all.

Let me give you an example:
title: Breaking News: New information about Solr 1.5
category: development
tag: Solr News

If I am searching for Solr development, I want to return this doc,
although I defined a minimum should match of 100%, because 100% of the query
match the *whole* document. 
At the moment, 100% applies only if 100% of the query match a field. 

Is this possible at the moment?
If not, are there any suggestions or practices to make this working?

Thank you.

Kind regards
- Mitch
-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/Short-DismaxRequestHandler-Question-tp775913p775913.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Custom SearchComponent to reset facet value counts after collapse

2010-05-04 Thread MitchK

When is the returned facet-info the expected info for your multiValued
fields?
Before or after your collapse?
It could be possible, that you need to facet only on your multiValued fields
before you are collapsing to retrive the right values.
If this is the case, you need to integrate the before-collapsing feature of
the collapsing-patch in your own component, the rest is done by the patch
itself.

Hope this helps.

Kind regards
- Mitch
-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/Custom-SearchComponent-to-reset-facet-value-counts-after-collapse-tp770826p776067.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Short DismaxRequestHandler Question

2010-05-04 Thread MitchK

Thank you for responsing.

This would be possible. However, I wouldn't like to do so, because a match
in title should boost higher than a match in category. 


-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/Short-DismaxRequestHandler-Question-tp775913p776238.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Short DismaxRequestHandler Question

2010-05-04 Thread MitchK

I got an idea:
If I would catenate all relevant fields to one large multiValued field, I
could query like this: 
{!dismax qf='myLargeField^5'}solr development //mm is 1 (100%) if not set
Additionally to that, I add a phraseQuery

{!dismax qf='myLargeField^5'}solr development AND title:(solr
development)^10 OR category:(solr development)^2 

Any other ideas are welcome.
Thank you for the discussion. 
-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/Short-DismaxRequestHandler-Question-tp775913p776446.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Custom SearchComponent to reset facet value counts after collapse

2010-05-04 Thread MitchK

I would prefer extending the given CollapseComponent, because of
performance-reasons. What you want to do sounds a bit like making things too
complicate. 

There are two options I would prefer:
1. get the schema-information for every field you want to query against and
define, whether you want to facet before or after collapsing. As far as I
have understood: For multiValued fields you want to facet before collapsing,
because if you facet after collapsing, the returned counts are wrong.

2. As a developer, you know which of the queried fields is a multiValued
one. Knowing this, you create a new param that contains on those fields, you
always want to facet on BEFORE collapsing. 

I want to emphasize that I never had a look at the sourcecode of the patch.
However, I really think that you do not need to reimplement so much things.
You only need to implement the logic when to facet which field. That's
everything. 
And as far as the component seems to implement both things: facet before
*and* after collapsing, you can use the provided methods to make your logic
work. 

Just some thoughts. :)

Kind regards
- Mitch
-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/Custom-SearchComponent-to-reset-facet-value-counts-after-collapse-tp770826p776896.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: How do I return all the results in an index?

2010-05-04 Thread MitchK

Did you clean up the Browser-Cache? 
Maybe you need to restart (I am currently not sure, whether Solr caches
HTTP-requests, even when you did a commit???).

Kind regards
- Mitch
-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/How-do-I-return-all-the-results-in-an-index-tp777214p777353.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: synonym filter problem for string or phrase

2010-05-03 Thread MitchK

Just for clear terminology: You mean field, not fieldType. FieldType is the
definition of tokenizers, filters etc..
You apply a fieldType on a field. And you query against a field, not against
a whole fieldType. :-)

Kind regards
- Mitch


Marco Martinez-2 wrote:
 
 Hi Ranveer,
 
 If you don't specify a field type in the q parameter, the search will be
 done searching in your default search field defined in the solrconfig.xml,
 its your default field a text_sync field?
 
 Regards,
 
 Marco Martínez Bautista
 http://www.paradigmatecnologico.com
 Avenida de Europa, 26. Ática 5. 3ª Planta
 28224 Pozuelo de Alarcón
 Tel.: 91 352 59 42
 
-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/synonym-filter-problem-for-string-or-phrase-tp765242p773083.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Custom SearchComponent to reset facet value counts after collapse

2010-05-02 Thread MitchK

Kelly, did you have a look at the facetComponent - class and
simpleFacets-class?
Why do you want to reset the counts? What is your usecase?
What is the difference between the facetComponent's return-value and your
component?

Kind regards
- Mitch


-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/Custom-SearchComponent-to-reset-facet-value-counts-after-collapse-tp770826p771260.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Custom SearchComponent to reset facet value counts after collapse

2010-05-02 Thread MitchK

Unfortunately this patch does not support multiValued-fields (as this is said
by the author and some others that worked with that patch). I had a look on
others, but they seem to have the same problem.
What would I suggest, hmm...
Out-of-the-box and at this time (it's late here in Germany) I got only one
simple idea:
Send a second request with using the standard facetComponent and do the same
query and facet at those fields that seems to have unexpected results. If I
understand you right, this would be the fastest solution. 

However, I am not sure, whether you really got a problem, since the
simpleFacet-implementation sends also several queries to get the count-value
per facet-value.

Does it really kills your performance? Or do you have got performance
issues, even if you don't do so? What time does it take to compute a
response?

Maybe you can provide the full code of your own implementation, so that we
can have a look together at your source code.

Hope this helps.

Kind regards
- Mitch
-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/Custom-SearchComponent-to-reset-facet-value-counts-after-collapse-tp770826p772012.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Custom SearchComponent to reset facet value counts after collapse

2010-05-02 Thread MitchK

Good morning,

I do not have the time to read your full code very carefully at the moment.
I will do so later on, however: Have a look at simpleFacets. Consider the
method that creates the facetCounts. When I got it right in mind, the author
uses the IndexSearcher's numDoc(arg1, arg2) method. 
That's what you need here, I *think* (I never created such a feature).

There is one thing that may be tricky: Which field to quey against (in an
universal way - at the moment you need to do so when we are talking about
multiValued fields).



 If I use param (CollapseParams.COLLAPSE_FACET, after) I get accurate
 counts for some facet values, while other facet values (from multi-value
 fields) are completely missing. 
 
What you've said was shown in your example? Just want to know to verify,
that we see the same problem.

Kind regards
- Mitch
-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/Custom-SearchComponent-to-reset-facet-value-counts-after-collapse-tp770826p772544.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: thresholding results by percentage drop from maxScore in lucene/solr

2010-05-01 Thread MitchK

I am curious:
What is your usecase or what type of data is this? Web-Pages? Blog-posts?
Product-items?

Can you provide some real examples so that we can discuss other ideas than
doing it by the score?
Because I think this is not possible or really difficult to achieve, since
you don't know what the highest score will be, until every document that
match the query is found.

Kind regards
- Mitch
-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/thresholding-results-by-percentage-drop-from-maxScore-in-lucene-solr-tp768872p770063.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Any way to get top 'n' queries searched from Solr?

2010-04-30 Thread MitchK

The most simple way is to send the querystring to your Solr-client *and* to
your custom query-fetcher, which could be any database you like. Doing so,
you can count how often which query was send etc.
*And* you can make them searchable by exporting those datasets to another
Solr-core.
Why  an extra DB?
Because if there occurs a crash, you got no guaranties given by Solr. Keep
in mind that Solr is only an index-search-server, not a real database.

This is the pretty easiest way to implement such a feature, I think.

Good luck.
- Mitch
-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/Any-way-to-get-top-n-queries-searched-from-Solr-tp767165p767489.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Elevation of of part match

2010-04-30 Thread MitchK

Gert, could you provide the solrconfig- and schema-specifications you have
made?
If the wiki really means what it says, the behaviour you want should be
possible. 

But that's only what I guess.

Btw: The standard definition for the elevation-component is string in the
example-directory. That means that there is no tokinization and according to
this a partially match is not possible.

Hope that helps
- Mitch 
-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/Elevation-of-of-part-match-tp767139p767877.html
Sent from the Solr - User mailing list archive at Nabble.com.


RE: Elevation of of part match

2010-04-30 Thread MitchK

The elevate.xml-example says:

!-- If this file is found in the config directory, it will only be
 loaded once at startup.  If it is found in Solr's data
 directory, it will be re-loaded every commit.
--

Did you make a restart?
-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/Elevation-of-of-part-match-tp767139p768120.html
Sent from the Solr - User mailing list archive at Nabble.com.


  1   2   3   >