from:"MitchK"

Re: questions about autocommit committing documents

2010-09-26 Thread MitchK


Hi Andy,


Andy-152 wrote:
 
 autoCommit 
   maxDocs1/maxDocs
   maxTime1000/maxTime 
 /autoCommit
 
 has been commented out.
 
 - With autoCommit commented out, does it mean that every new document
 indexed to Solr is being auto-committed individually? Or that they are not
 being auto-committed at all?
 
I am not sure, whether there is a default value, but if not, commenting out
would mean that you have to send a commit explicitly. 



 - If I enable autoCommit and set maxDocs at 1, does it mean that
 my new documents won't be avalable for searching until 10,000 new
 documents have been added?
 
Yes, that's correct. However, you can do a commit explicitly, if you want to
do so. 



 - When I add a new document to Solr, do I need to call commit explicitly?
 If so, how do I do that? 
 I look at the Solr tutorial (
 http://lucene.apache.org/solr/tutorial.html), the command used to index
 documents (java -jar post.jar solr.xml monitor.xml) doesn't include any
 explicit call to commit the documents. So I'm not sure if it's necessary.
 
 Thanks
 
Committing is necessary, since every added document is not visible at
query-time, if there was no commit to it. 

Kind regards,
Mitch
-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/questions-about-autocommit-committing-documents-tp1582487p1582676.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: questions about autocommit committing documents

2010-09-26 Thread MitchK


First: Usually you do not use post.jar for updating your index. It's a simple
tool, but normally you use features like the csv- or
xml-update-RequestHandler.

Have a look at UpdateCSV and UpdateXMLMessages in the wiki.
There you can find examples on how to commit explicitly.

With the post.jar you need to set either dcommit=yes or to append
commit/, I think.

Hope this helps.

- Mitch
-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/questions-about-autocommit-committing-documents-tp1582487p1582846.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Swapping cores with SolrJ

2010-09-14 Thread MitchK


Hi Shaun,

I think it is more easy to fix this problem, if we got more information
about what is going on in your application.
Please, could you provide the CoreAdminResponse returned by car.process()
for us?

Kind regards,
- Mitch
-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/Swapping-cores-with-SolrJ-tp1472154p1473435.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Solr CoreAdmin create ignores dataDir Parameter

2010-09-10 Thread MitchK


Frank,

have a look at SOLR-646.

Do you think a workaround for the data-dir-tag in the solrconfig.xml can
help?
I think about something like dataDir${solr./data/corename}/dataDir for
illustration.

Unfortunately I am not very skilled in working with solr's variables and
therefore I do not know what variables are available. 

If we find a solution, we should provide it as a suggestion at the wiki's
CoreAdmin-page.

Kind regards,
- Mitch
-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/Solr-CoreAdmin-create-ignores-dataDir-Parameter-tp1451665p1454705.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: SolrCloud distributed indexing (Re: anyone use hadoop+solr?)

2010-09-07 Thread MitchK


What if we do not care about the version of a document at index-time?

When it comes to distributed search, we currently decide aggregating
documents based on their uniqueKey. But what would be, if we decide
additionally decide on uniqueKey plus indexingDate, so that we only
aggregate the last indexed version of a document?

The concept could look like this:
When Solr aggregated the documents for a response, it could store what shard
responsed an older version of document x. 

Now a crawler can crawl through our SolrCloud and asking each shard whether
it noticed something like shard y got an older version of doc x-case.
The crawler aggregates those information. After he finished crawling, he
sends delete-by-query-requests to those shards which have older versions of
documents than they should have. 

I will call these stores document versions that are older than the newest
version ODV (Old Document Versions) for better understanding. 

So, what can happen:
Before the crawler can visit shard A - who noticed that shard y stores an
ODV of doc x - shard A can go down. That's okay, because either another
shard noticed the same, or shard A will be available later on. If those
information will we stored at HD, it will also be available. If it was
stored in RAM the information is lost... however, you could replicate those
information over more than one shard, right? :-)

Another case:
Shard y can go down - so someone has to care for storing the noticed
ODV-information, so that one can delete the document when Shard Y comes
back.

Pros:
- You can do something like consistent hashing in connection with a concept
where each node has to care for its neighbour-nodes. This is because only
the neighbour nodes can store ODVs.

- using the described concept, you can do nightly batches, looking for ODVs
in the neigbour-nodes.

- ODVs will be found at requesting time, so we can avoid to response ODVs
over newer versions.

Cons:
- We are wasting disc space.

- This works only for smaller clusters, not for large ones where the number
of machines changes very frequently

... this is just another idea - and it is very very lazy.

I must emphasize, that I assume that neighbour-machines do not go down very
frequently. Of course, it is not a question whether a machine crashes, but
when it crashes - but I assume that the same server does not crash every
hour. :-)

Thoughts?

Kind regards


Andrzej Bialecki wrote:
 
 On 2010-09-06 16:41, Yonik Seeley wrote:
 On Mon, Sep 6, 2010 at 10:18 AM, MitchKmitc...@web.de  wrote:
 [...consistent hashing...]
 But it doesn't solve the problem at all, correct me if I am wrong, but:
 If
 you add a new server, let's call him IP3-1, and IP3-1 is nearer to the
 current ressource X, than doc x will be indexed at IP3-1 - even if IP2-1
 holds the older version.
 Am I right?

 Right.  You still need code to handle migration.

 Consistent hashing is a way for everyone to be able to agree on the
 mapping, and for the mapping to change incrementally.  i.e. you add a
 node and it only changes the docid-node mapping of a limited percent
 of the mappings, rather than changing the mappings of potentially
 everything, as a simple MOD would do.
 
 Another strategy to avoid excessive reindexing is to keep splitting the 
 largest shards, and then your mapping becomes a regular MOD plus a list 
 of these additional splits. Really, there's an infinite number of ways 
 you could implement this...
 

 For SolrCloud, I don't think we'll end up using consistent hashing -
 we don't need it (although some of the concepts may still be useful).
 
 I imagine there could be situations where a simple MOD won't do ;) so I 
 think it would be good to hide this strategy behind an 
 interface/abstract class. It costs nothing, and gives you flexibility in 
 how you implement this mapping.
 
 -- 
 Best regards,
 Andrzej Bialecki 
   ___. ___ ___ ___ _ _   __
 [__ || __|__/|__||\/|  Information Retrieval, Semantic Web
 ___|||__||  \|  ||  |  Embedded Unix, System Integration
 http://www.sigram.com  Contact: info at sigram dot com
 
 
 
-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/anyone-use-hadoop-solr-tp485333p1434329.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: SolrCloud distributed indexing (Re: anyone use hadoop+solr?)

2010-09-07 Thread MitchK


I must add something to my last post:

When saying it could be used together with techniques like consistent
hashing, I mean it could be used at indexing time for indexing documents,
since I assumed that the number of shards does not change frequently and
therefore an ODV-case becomes relatively infrequent. Furthermore the
overhead of searching for and removing those ODV-documents is relatively
low. 
-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/anyone-use-hadoop-solr-tp485333p1434364.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: anyone use hadoop+solr?

2010-09-06 Thread MitchK


Thanks for your detailed feedback Andzej!

From what I understood, SOLR-1301 becomes obsolete ones Solr becomes
cloud-ready, right?



 Looking into the future: eventually, when SolrCloud arrives we will be 
 able to index straight to a SolrCloud cluster, assigning documents to 
 shards through a hashing schema (e.g. 'md5(docId) % numShards')
 
Hm, let's say the md5(docId) would produce a value of 10 (it won't, but
let's assume it).
If I got a constant number of shards, the doc will be published to the same
shard again and again.

i.e.: 10 % numShards(5) = 2 - doc 10 will be indexed at shard 2.

A few days later the rest of the cluster is available, now it looks like

10 % numShards(10) -  1 - doc 10 will be indexed at shard 1... and what
about the older version at shard 2? I am no expert when it comes to
cloudComputing and the other stuff.
If you can point me to one or another reference where I can read about it,
it would help me a lot, since I only want to understand how it works at the
moment.

The problem with Solr is its lack of documentation in some classes and the
lack of capsulating some very complex things into different methods or
extra-classes. Of course, this is because it costs some extra time to do so,
but it makes understanding and modifying things very complicated if you do
not understand whats going on from a theoretical point of view.

Since the cloud-feature will be complex, a lack of documentation and no
understanding of the theory behind the code will make contributing back
very, very complicated.

Thank you :-)
- Mitch
-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/anyone-use-hadoop-solr-tp485333p1425986.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: anyone use hadoop+solr?

2010-09-06 Thread MitchK


Yonik,

are there any discussions about SolrCloud-indexing?

I would be glad to join them, if I find some interesting papers about that
topic.

- Mitch
-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/anyone-use-hadoop-solr-tp485333p1426469.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: SolrCloud distributed indexing (Re: anyone use hadoop+solr?)

2010-09-06 Thread MitchK


Andrzej,

thank you for sharing your experiences.



 b) use consistent hashing as the mapping schema to assign documents to a 
 changing number of shards. There are many explanations of this schema on 
 the net, here's one that is very simple: 
 
Boom. 
With the given explanation, I understand it as the following:
You can use hadoop and do some map-reduce-jobs per csv-file.
At the reducer-side, the reducer has to look for the id of the current doc
and needs to create a hash of it.
Now it looks inside a SortedSet, picks the next-best server and looks in a
map, whether this server has got free capacity or not. That's cool.

But it doesn't solve the problem at all, correct me if I am wrong, but: If
you add a new server, let's call him IP3-1, and IP3-1 is nearer to the
current ressource X, than doc x will be indexed at IP3-1 - even if IP2-1
holds the older version. 
Am I right?

Thank you for sharing the paper. I will have a look for more like this. 



 In this case the lack of good docs and user-level API can be blamed on 
 the fact that this functionality is still under heavy development. 
 
I do not only mean documentation at the user-level but also inside a class,
if there is going on some complicated stuff. 

- Mitch
-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/anyone-use-hadoop-solr-tp485333p1426728.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Show a facet filter All

2010-09-05 Thread MitchK


Peter,

take a close look at tagging and and excluding filters:
http://wiki.apache.org/solr/SimpleFacetParameters#LocalParams_for_faceting

Another way would be to index your services_raw as 
services_raw/Exclusive rental
services_raw/Fotoreport
services_raw/Live music

In this case, you can use the facet-prefix param to get all the
services_raw/*-values.
I am not sure, but maybe even * is a valid prefix - than you do not need
such extra-work.

If all your documents include a services_raw-field, than this facet wouldn't
make much sense, since it is applicable to all the documents, isn't it? 

Kind regards,
- Mitch
-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/Show-a-facet-filter-All-tp1421248p1421539.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: anyone use hadoop+solr?

2010-09-04 Thread MitchK


Hi,

this topic started a few months ago, however there are some questions from
my side, that I couldn't answer by looking at the SOLR-1301-issue nor the
wiki-pages.

Let me try to explain my thoughts:
Given: a Hadoop-cluster, a solr-search-cluster and nutch as a
crawling-engine which also performs LinkRank and webgraph-related tasks.

Once a list of documents is created by nutch, you put the list + the
LinkRank-values etc. into a Solr+Hadoop-job like it is described in
Solr-1301 to index or reindex the given documents.
When the shards are built, they will be sent over the network to the
solr-search-cluster.
Is this description correct?

What makes me thinking is:
Assumed I got a Document X on machine Y in shard Y... 
When I reindex that document X together with lots of other documents that
are present or not present in Shard Y... and I put the resulting shard on a
machine Z, how does machine Y notice that it has got an older version of
document X than machine Z?

Furthermore: Go on and assume that the shard Y was replicated to three other
machines, how do they all notice, that their version of document X is not
the newest available one?
In such an environment, we do not have a master (right?), so far: How to
keep the index as consistent as possible?

Thank you for clearifying. 

Kind regards
-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/anyone-use-hadoop-solr-tp485333p1418140.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: full control over norm values?

2010-08-27 Thread MitchK


Hi Micheal,

have a look at SweetSpotSimilarity (Lucene).

Kind regards,
- Mitch
-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/full-control-over-norm-values-tp1366910p1367462.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Why it's boosted up?

2010-08-24 Thread MitchK

Hi Scott,

(so shorter fields are automatically boosted up).

The theory behind that is the following (in easy words):
Let's say you got two documents, each doc contains on 1 field (like it was
in my example).
Additionally we got a query that contains two words.
Let's say doc1 contains on 10 words and doc2 contains on 20 words.
The query matches both docs with both words.
The idea of boosting shorter fields stronger than longer fields is the
following:
In doc1, 2/10 = 0.2 = 20% of the words are matching your query.
In doc2 2/20 = 0.1 = 10% of the words are matching your query.

So doc1 should get a better score, because the rate of matching words vs the
total number of occuring words is greater than in doc2
This is the idea of using norms as an index-time-boosting-factor. NOTE: This
does not mean that doc1 get's boosted by 20% and doc1 by 10%! It only
illustrates what the idea behind such norms is.

From the similarity-class's documentation of lengthNorm():

Matches in longer fields are less precise, so implementations of this
method usually return smaller values when numTokens is large, and larger
values when numTokens is small.

However, you, as a search-application-developer got the task, that you have
to decide whether this theory applies to your application or not. In some
cases using norms makes no sense, in others it does.
If you think that norms are applying to your project, ommitting them is no
good approach to save disk-space.
Furthermore: If you think the theory does apply to the business-needs of
your application but its impact is currently to heavy, you can have a look
at the sweetSpotSimilarity in Lucene.

The request is from our business team, they wish user of our product can
type in partial string of a word that exists in title or body field.

You mean something like typing note and also getting results like
notebook?
The correct approach for something like that is not using shingleFilter but
NGrams or edged NGrams.
Shingles are doing something like that:
This is my shingle sentence - This is, is my, my shingle, shingle
sentence - it breaks up the sentence into smaller pieces. The benefit of
doins so is, that, if a query matches one of these shingles, you have found
a short phrase without using the performance-consuming phraseQuery-feature.

Kind regards,
- Mitch

scott chu wrote:

In Lucene's web page, there's a paragraph:

Indexing time boosts are preprocessed for storage efficiency and written
to
the directory (when writing the document) in a single byte (!) as follows:
For each field of a document, all boosts of that field (i.e. all boosts
under the same field name in that doc) are multiplied. The result is
multiplied by the boost of the document, and also multiplied by a field
length norm value that represents the length of that field in that doc
(so
shorter fields are automatically boosted up).

I though the greater the value, the boosting is upper. Then why short
fields
are boost up? Isn't Norm value for short fields smaller?

--
View this message in context:
http://lucene.472066.n3.nabble.com/Doing-Shingle-but-also-keep-special-single-word-tp1241204p1306419.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Solr creates whitespace in dismax query

2010-08-24 Thread MitchK


Johann,

try to remove the wordDelimiterFilter from the query-analyzer of your
fieldType.
If your index-analyzer-wordDelimiterFilter is well configured, it will find
everything you want. 

Does this solve the problem?

Kind regards,
- Mitch
-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/Solr-creates-whitespace-in-dismax-query-tp1317196p1318759.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Doing Shingle but also keep special single word

2010-08-23 Thread MitchK

No, I mean that you use an additional field (indexed) for searching (i.e.
whitespace-tokenized, so every word - seperated by a whitespace - becomes to
a token .
So you have got two fields (shingle-token-field and single-token-field).
So you can search accross both fields.
This provides several benefits: i.e. you can boost the shingle-field at
query-time, since a match in a shingle-field would mean, that there matches
an exact phrase.

Additionally: You can search with single-word-queries as well as
multi-word-queries.
Furthermore you can apply synonyms to your single-token-field.

If you want to keep your index as small as possible but as large as needed,
try to understand Lucene's similarity implementation to consider, whether
you can set the field option omitNorms=true or
omitTermFreqAndPositions=true.
http://lucene.apache.org/java/3_0_1/api/all/org/apache/lucene/search/Similarity.html
Keep in mind what happens, if you omit one of those options.

A small example of the consequences of setting omitNorms = true;.
doc1: this is a short example doc
doc2: this is a longer example doc for presenting the effect of omitNorms

If you are searching for doc while omitNorms=false your response will look
like this:
doc1,
doc2
This is because the norm-value for doc1 is smaller as the norm-value for
doc2, because doc1 is shorter than doc2 (have a look at the provided link).

If omitNorms=true, the scores for both docs will be equal.

Kind regards,
- Mitch

scott chu wrote:

I don't quite understand additional-field-way? Do you mean making another
field that stores special words particularly but no indexing for that
field?

Scott

- Original Message -
From: MitchK mitc...@web.de
To: solr-user@lucene.apache.org
Sent: Sunday, August 22, 2010 11:48 PM
Subject: Re: Doing Shingle but also keep special single word

Hi,

keepword-filter is no solution for this problem, since this would lead to
the problematic that one has to manage a word-dictionary. As explained,
this
would lead to too much effort.

You can easily add outputUnigrams=true and check out the analysis.jsp for
this field. So you can see how much bigger a single field will become
with
this option.
However, I am quite sure that the difference between using
outputUnigrams=true and indexing in a seperate field is not noteworthy.

I would suggest you to do it the additionally-field-way, since this would
lead to more flexibility in boosting the different fields.

Unfortunately, I haven't understood your explanation about the use-case.
But
it sounds a little bit like tagging?

Kind regards,
- Mitch

iorixxx wrote:

Isn't set outputUnigrams=true will
make index size about twice than when it's set to false?

Sure index will be bigger. I didn't know that this is problem for you.
But
if you have a list of special single words that you want to keep,
keepwordfilter can eliminate other tokens. So index size will be okey.

Scott

- Original Message - From: Ahmet Arslan iori...@yahoo.com
To: solr-user@lucene.apache.org
Sent: Saturday, August 21, 2010 1:15 AM
Subject: Re: Doing Shingle but also keep special single
word

I am building index with Shingle
filter. We know it's minimum 2-gram but I also
want keep
some special single word, e.g. IBM, Microsoft,
etc. i.e. I
want to do a minimum 2-gram but also want to have
these
single word in my index, Is it possible?

outputUnigrams=true parameter does not work for
you?

After that you can cast filter
class=solr.KeepWordFilterFactory words=keepwords.txt
ignoreCase=true/ with keepwords.txt=IBM, Microsoft.

--
View this message in context:
http://lucene.472066.n3.nabble.com/Doing-Shingle-but-also-keep-special-single-word-tp1241204p1276506.html
Sent from the Solr - User mailing list archive at Nabble.com.

¥¼¦b¶Ç¤J°T®§¤¤§ä¨ì¯f¬r¡C
Checked by AVG - www.avg.com
Version: 9.0.851 / Virus Database: 271.1.1/3083 - Release Date: 08/20/10
14:35:00

--
View this message in context:
http://lucene.472066.n3.nabble.com/Doing-Shingle-but-also-keep-special-single-word-tp1241204p1300497.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: How to limit rows to which highlighting applies

2010-08-22 Thread MitchK

Alex,

it sounds like it would make sense.
Use cases could be i.e. clustering or similar techniques.
However, in my opinion the point of view for such a modification is not the
right.

I.e. one wants to have got several resultsets. I could imagine that one does
a primary-query (the query for the displayed results) and a query to compute
clustering-results.
Now, you want to do different things with the result-sets.
The primary-query needs faceting, highlighting, spellcheck and much more,
wheareas the additional query only needs clustering or something like that.
In your case, you do not want to apply highlighting for the whole set, since
you do not need such information for every row.

This is a general problem and I think a solution that makes it possible to
create more than one resultset for a single solr-request would be
applicable for more general use cases.

What do you think?

Kind regards,
- Mitch

Alex Baranau wrote:

Hello Solr users and devs!

Is there a way to limit number of rows to which highlighting applies? I
don't see any hl.rows or similar parameter description, so it looks like
I
need to enhance HighlightComponent to enable that. If it is not possible
currently, do you think it's worth adding such possibility?

JFI my case, when I need this: I display on results page 20, 10 or 5 rows
only, but I need much more rows (100-500) to display additional data on
the
same page. Queries could be very complex and their execution time
(QueryComponent) is quite big. So I do want to fetch things via single
request. However, I noticed that with increasing number of rows, time
spent
in HighlightComponent increases dramatically. For those additional
hundreds
of rows I don't need highlighting at all.

Actually, *ideally* it would be great to have the ability to specify
fields
returned for those extra rows as well. So I tend to think that adding this
features should not be based on changing HighlightComponent behaviour, but
changing QueryComponent or even bigger part somehow so that Solr query
accepts specifying extra group(s) of rows for fetching along with params
for
them (which not influence the searching process, like
formatting/highlighting, fields to return, etc.). Thus, we could execute
*one* search query and fetch different data for different purposes.

Does this all make sense to you guys?

Thank you,
Alex Baranau

Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch - Hadoop - HBase
Lucene ecosystem search ::
http://search-lucene.com/http://search-hadoop.com/

--
View this message in context:
http://lucene.472066.n3.nabble.com/How-to-limit-rows-to-which-highlighting-applies-tp1274042p1275962.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Doing Shingle but also keep special single word

2010-08-22 Thread MitchK

Hi,

keepword-filter is no solution for this problem, since this would lead to
the problematic that one has to manage a word-dictionary. As explained, this
would lead to too much effort.

You can easily add outputUnigrams=true and check out the analysis.jsp for
this field. So you can see how much bigger a single field will become with
this option.
However, I am quite sure that the difference between using
outputUnigrams=true and indexing in a seperate field is not noteworthy.

I would suggest you to do it the additionally-field-way, since this would
lead to more flexibility in boosting the different fields.

Unfortunately, I haven't understood your explanation about the use-case. But
it sounds a little bit like tagging?

Kind regards,
- Mitch

iorixxx wrote:

Isn't set outputUnigrams=true will
make index size about twice than when it's set to false?

Sure index will be bigger. I didn't know that this is problem for you. But
if you have a list of special single words that you want to keep,
keepwordfilter can eliminate other tokens. So index size will be okey.

Scott

- Original Message - From: Ahmet Arslan iori...@yahoo.com
To: solr-user@lucene.apache.org
Sent: Saturday, August 21, 2010 1:15 AM
Subject: Re: Doing Shingle but also keep special single
word

outputUnigrams=true parameter does not work for
you?

After that you can cast filter
class=solr.KeepWordFilterFactory words=keepwords.txt
ignoreCase=true/ with keepwords.txt=IBM, Microsoft.

Re: SolrJ Response + JSON

2010-08-02 Thread MitchK


Hi,

as I promised, I want to give a feedback for transforming SolrJ's output 
into JSON with the package from json.org (the package was the json.org's 
one):


I need to make a small modification to the package, since they store the 
JSON-key-value-pairs in a HashMap, I changed this to a LinkedHashMap to 
make sure that the order of the retrived values is the same order as 
they were inserted in the map.


The result looks very, very pretty.

It was very easy to transform the SolrJ's output into the desired 
JSON-format and I can add now whatever I want to the response.


Kind regards,
- Mitch

RE: Boosting DisMax queries with !boost component

2010-08-02 Thread MitchK



Jonathan Rochkind wrote:
 
 qf needs to have spaces in it, unfortunately the local query parser can
 not
 deal with that, as Erik Hatcher mentioned some months ago.
 
 By local query parser, you mean what I call the LocalParams stuff (for
 lack of being sure of the proper term)?  
 
Yes, that was what I meant.

Kind regards,
- Mitch
-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/Boosting-DisMax-queries-with-boost-component-tp1011294p1015619.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Boosting DisMax queries with !boost component

2010-08-01 Thread MitchK


Hi,

qf needs to have spaces in it, unfortunately the local query parser can not
deal with that, as Erik Hatcher mentioned some months ago.

A solution would be to do something like that:

{!dismax%20qf=$yourqf}yourQueryyourgf=title^1.0 tags^2.0

Since you are using the dismax-query-parser, you can add the boosting query
via bq-param.

Hope this helps,
- Mitch
-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/Boosting-DisMax-queries-with-boost-component-tp1011294p1014242.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Nabble problems?

2010-07-29 Thread MitchK


I got some problems with Nabble, too.
Nabble sends some warnings that my posts are still pending to the
mailing-list, while people were already answering to my initial questions.

Did you send a message to the nabble-support?

Kind regards,
- Mitch 

kenf_nc wrote:
 
 The Nabble.com page for Solr - User seems to be broken. I haven't seen an
 update on it since early this morning. However I'm still getting email
 notifications so people are seeing and responding to posts. I'm just
 curious, are you just using email and responding to
 solr-u...@lucene.apache.org? Or is there a mirror site that *is* working
 for the Solr User forum?
 
-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/Nabble-problems-tp1004870p1004992.html
Sent from the Solr - User mailing list archive at Nabble.com.

SolrJ Response + JSON

2010-07-28 Thread MitchK


Hello community,

I need to transform SolrJ - responses into JSON, after some computing on
those results by another application has finished.

I can not do those computations on the Solr - side.

So, I really have to translate SolrJ's output into JSON.

Any experiences how to do so without writing your own JSON-writer?

Thank you.
- Mitch
-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/SolrJ-Response-JSON-tp1002024p1002024.html
Sent from the Solr - User mailing list archive at Nabble.com.

SolrJ Response + JSON

2010-07-28 Thread MitchK


Hello , 

Second try to send a mail to the mailing list... 

I need to translate SolrJ's response into JSON-response.
I can not query Solr directly, because I need to do some math with the
responsed data, before I show the results to the client.

Any experiences how to translate SolrJ's response into JSON without writing
your own JSON Writer?

Thank you. 
- Mitch
-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/SolrJ-Response-JSON-tp1002115p1002115.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: SolrJ Response + JSON

2010-07-28 Thread MitchK


Thank you Markus, Mark.

Seems to be a problem with Nabble, not with the mailing list. Sorry.

I can create a JSON-response, when I query Solr directly.
But I mean, that I query Solr through a SolrJ-client 
(CommonsHttpSolrServer).
That means my queries look a litte bit like that: 
http://wiki.apache.org/solr/Solrj#Reading_Data_from_Solr

So the response is shown as an QueryResponse-object, not as a JSON-string.

Or do I miss something here?

Am 28.07.2010 15:15, schrieb Markus Jelsma:

Hi,

I got a response to your e-mail in my box 30 minutes ago. Anyway, enable the
JSONResponseWriter, if you haven't already, and query with wt=json. Can't get
mucht easier.

Cheers,

On Wednesday 28 July 2010 15:08:26 MitchK wrote:
   

Hello ,

Second try to send a mail to the mailing list...

I need to translate SolrJ's response into JSON-response.
I can not query Solr directly, because I need to do some math with the
responsed data, before I show the results to the client.

Any experiences how to translate SolrJ's response into JSON without writing
your own JSON Writer?

Thank you.
- Mitch

 

Markus Jelsma - Technisch Architect - Buyways BV
http://www.linkedin.com/in/markus17
050-8536620 / 06-50258350

Re: SolrJ Response + JSON

2010-07-28 Thread MitchK


Thank you, Chantal.

I have looked at this one: http://www.json.org/java/index.html

This seems to be an easy-to-understand-implementation.

However, I am wondering how to determine whether a SolrDocument's field 
is multiValued or not.
The JSONResponseWriter of Solr looks at the schema-configuration. 
However, the client shouldn't do that.

How did you solved that problem?

Thanks for sharing ideas.

- Mitch


Am 28.07.2010 15:35, schrieb Chantal Ackermann:

You could use org.apache.solr.handler.JsonLoader.
That one uses org.apache.noggit.JSONParser internally.
I've used the JacksonParser with Spring.

http://json.org/ lists parsers for different programming languages.

Cheers,
Chantal

On Wed, 2010-07-28 at 15:08 +0200, MitchK wrote:
   

Hello ,

Second try to send a mail to the mailing list...

I need to translate SolrJ's response into JSON-response.
I can not query Solr directly, because I need to do some math with the
responsed data, before I show the results to the client.

Any experiences how to translate SolrJ's response into JSON without writing
your own JSON Writer?

Thank you.
- Mitch

Re: SolrJ Response + JSON

2010-07-28 Thread MitchK

Hi Chantal,

thank you for the feedback.
I did not see the wood for the trees!
The SolrDocument's javadoc says the following:
http://lucene.apache.org/solr/api/org/apache/solr/common/SolrDocument.html

|*getFieldValue
../../../../org/apache/solr/common/SolrDocument.html#getFieldValue%28java.lang.String%29*(String
http://java.sun.com/javase/6/docs/api/java/lang/String.html?is-external=true name)|

Get the value or collection of values for a given field.

The magical word here is that little or :-).

I will try that tomorrow and give you a feedback!

Are you sure that you cannot change the SOLR results at query time
according to your needs?

Unfortunately, it is not possible in this case.

Kind regards,
Mitch

Am 28.07.2010 16:49, schrieb Chantal Ackermann:

Hi Mitch

On Wed, 2010-07-28 at 16:38 +0200, MitchK wrote:

Thank you, Chantal.

I have looked at this one: http://www.json.org/java/index.html

This seems to be an easy-to-understand-implementation.

However, I am wondering how to determine whether a SolrDocument's field
is multiValued or not.
The JSONResponseWriter of Solr looks at the schema-configuration.
However, the client shouldn't do that.
How did you solved that problem?

I didn't. I'm not recreating JSON from the SolrJ results.

I would try to use the same classes that SolrJ uses, actually. (Writing
that without having a further look at the code.) I would avoid
recreating existing code as much as possible.
About multivalued fields: you need instanceof checks, I guess. The field
only contains a list if there really are multiple values. (That's what
works for my ScriptTransformer.)

Are you sure that you cannot change the SOLR results at query time
according to your needs? Maybe you should ask for that, first (ask for X
instead of Y...).

Cheers,
Chantal

Thanks for sharing ideas.

- Mitch

Am 28.07.2010 15:35, schrieb Chantal Ackermann:

You could use org.apache.solr.handler.JsonLoader.
That one uses org.apache.noggit.JSONParser internally.
I've used the JacksonParser with Spring.

http://json.org/ lists parsers for different programming languages.

Cheers,
Chantal

On Wed, 2010-07-28 at 15:08 +0200, MitchK wrote:

Hello ,

Second try to send a mail to the mailing list...

I need to translate SolrJ's response into JSON-response.
I can not query Solr directly, because I need to do some math with the
responsed data, before I show the results to the client.

Any experiences how to translate SolrJ's response into JSON without writing
your own JSON Writer?

Thank you.
- Mitch

Re: DIH : SQL query (sub-entity) is executed although variable is not set (null or empty list)

2010-07-27 Thread MitchK


Hi Chantal,



 However, with this approach indexing time went up from 20min to more 
 than 5 hours. 
 
This is 15x slower than the initial solution... wow.
From MySQL I know that IN ()-clauses are the embodiment of endlessness -
they perform very, very badly.

New idea:
Create a method which returns the query-string:

returnString(theVIP)
{
   if ( theVIP != null || theVIP != )
   {
   return a query-string to find the vip
   }
   else
   {
   return SELECT 1 // you need to modify this, so that it
matches your field-definition
   }
}

The main-idea is to perform a blazing fast query, instead of a complex
IN-clause-query.
Does this sounds like a solution???



 The new approach is to query the solr index for that other database that 
 I've already setup. This is only a bit slower than the original query 
 (20min). (I'm using URLDataSource to be 1.4.1 conform.) 
 
Unfortunately I can not follow you. 
You are querying a solr-index for a database?

Kind regards,
- Mitch
-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/DIH-SQL-query-sub-entity-is-executed-although-variable-is-not-set-null-or-empty-list-tp995983p998859.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: DIH : SQL query (sub-entity) is executed although variable is not set (null or empty list)

2010-07-27 Thread MitchK


Hi Chantal,

instead of:

entity name=prog ... 
field name=vip ... /* multivalued, not required */ 
entity name=ssc_entry dataSource=ssc onError=continue 
query=select SSC_VALUE from SSC_VALUE 
where SSC_ATTRIBUTE_ID=1 
  and SSC_VALUE in (${prog.vip}) 
field column=SSC_VALUE name=vip_ssc / 
/entity 
/entity 

you do:

entity name=prog ... 
field name=vip ... /* multivalued, not required */ 
entity name=ssc_entry dataSource=ssc onError=continue 
query=${yourCustomFunctionToReturnAQueryString(prog.vip,
..., ...)} 
field column=SSC_VALUE name=vip_ssc / 
/entity 
/entity 

The yourCustomFunctionToReturnAQueryString(vip, querystring1, querystring2)
{
if(vip != null  !vip.equals())
{
 StringBuilder sb = new StringBuilder(50);
 sb.append(querystring1); // SELECT SSC_VALUE from SSC_VALUE where
SSC_ATTRIBUTE_ID=1 
   and SSC_VALUE in (
 sb.append(vip);//VIP-value
 sb.append(querystring2);//just the closing )
 return sb.toString();
 }
 else
 {
return SELECT \\ AS yourFieldName;
 }
}

I expect that this method is called for every vip-value, if there is one.

Solr DIH uses the returned querystring to query the database. So, if
vip-value is empty or null, you can use a different query that is blazing
fast (i.e. SELECT  AS yourFieldName - just an example to show the logic).
This query should return a row with an empty string. So Solr fills the
current field with an empty string.

I don't know how to prevent Solr from calling your ssc_entry-entity, when
vip is null or empty.
But this would be a solution to handle empty vip-strings as efficient as
possible. 



 If realized 
 that I have to throw an exception and add the onError attribute to the 
 entity to make that work. 
 
I am curious:
Can you show how to make a method throwing an exception that is accepted by
the onError-attribute?

I hope we do not talk past eachother here. :-)

Kind regards,
- Mitch
-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/DIH-SQL-query-sub-entity-is-executed-although-variable-is-not-set-null-or-empty-list-tp995983p998950.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: a bug of solr distributed search

2010-07-26 Thread MitchK


Good morning,

https://issues.apache.org/jira/browse/SOLR-1632

- Mitch


Li Li wrote:
 
 where is the link of this patch?
 
 2010/7/24 Yonik Seeley yo...@lucidimagination.com:
 On Fri, Jul 23, 2010 at 2:23 PM, MitchK mitc...@web.de wrote:
 why do we do not send the output of TermsComponent of every node in the
 cluster to a Hadoop instance?
 Since TermsComponent does the map-part of the map-reduce concept, Hadoop
 only needs to reduce the stuff. Maybe we even do not need Hadoop for
 this.
 After reducing, every node in the cluster gets the current values to
 compute
 the idf.
 We can store this information in a HashMap-based SolrCache (or something
 like that) to provide constant-time access. To keep the values up to
 date,
 we can repeat that after every x minutes.

 There's already a patch in JIRA that does distributed IDF.
 Hadoop wouldn't be the right tool for that anyway... it's for batch
 oriented systems, not low-latency queries.

 If we got that, it does not care whereas we use doc_X from shard_A or
 shard_B, since they will all have got the same scores.

 That only works if the docs are exactly the same - they may not be.

 -Yonik
 http://www.lucidimagination.com

 
 
-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/a-bug-of-solr-distributed-search-tp983533p995407.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Solr Doc Lucene Doc !?

2010-07-26 Thread MitchK


Stockii,

Solr's index is a Lucene Index. Therefore, Solr documents are Lucene
documents.

Kind regards,
- Mitch
-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/Solr-Doc-Lucene-Doc-tp995922p995968.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: DIH : SQL query (sub-entity) is executed although variable is not set (null or empty list)

2010-07-26 Thread MitchK


Hi Chantal,

did you tried to write a  http://wiki.apache.org/solr/DIHCustomFunctions
custom DIH Function ?
If not, I think this will be a solution.
Just check, whether ${prog.vip} is an empty string or null.
If so, you need to replace it with a value that never can response anything.

So the vip-field will always be empty for such queries. 
Maybe that helps?

Hopefully, the variable resolver is able to resolve something like
${dih.functions.getReplacementIfNeeded(prog.vip).

Kind regards,
- Mitch



Chantal Ackermann wrote:
 
 Hi,
 
 my use case is the following:
 
 In a sub-entity I request rows from a database for an input list of
 strings:
 entity name=prog ...
   field name=vip ... /* multivalued, not required */
   entity name=ssc_entry dataSource=ssc onError=continue
   query=select SSC_VALUE from SSC_VALUE
   where SSC_ATTRIBUTE_ID=1
 and SSC_VALUE in (${prog.vip})
   field column=SSC_VALUE name=vip_ssc /
   /entity
 /entity
 
 The root entity is prog and it has an optional multivalued field
 called vip. When the list of vip values is empty, the SQL for the
 sub-entity above throws an SQLException. (Working with Oracle which does
 not allow an empty expression in the in-clause.)
 
 Two things:
 (A) best would be not to run the query whenever ${prog.vip} is null or
 empty.
 (B) From the documentation, it is not clear that onError is only checked
 in the transformer runs but not checked when the SQL for the entity
 throws an exception. (Trunk version JdbcDataSource lines 250pp).
 
 IMHO, (A) is the better fix, and if so, (B) is the right decision. (If
 (A) is not easily fixable, making (B) work would be helpful.)
 
 Looking through the code, I've realized that the replacement of the
 variables is done in a very generic way. I've not yet seen an
 appropriate way to check on those variables in order to stop the
 processing of the entity if the variable is empty.
 Is there a way to do this? Or maybe there is a completely different way
 to get my use case working. Any help most appreciated!
 
 Thanks,
 Chantal
 
 
 
-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/DIH-SQL-query-sub-entity-is-executed-although-variable-is-not-set-null-or-empty-list-tp995983p996446.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: a bug of solr distributed search

2010-07-24 Thread MitchK


Okay, but than LiLi did something wrong, right?

I mean, if the document exists only at one shard, it should get the same
score whenever one requests it, no?
Of course, this only applies if nothing gets changed between the requests.
The only remaining problem here would be, that you need distributed IDF
(like at the mentioned JIRA-issue) to normalize your results's scoring. 

But the mentioned problem at this mailing-list-posting has nothing to do
with that...

Regards
- Mitch
-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/a-bug-of-solr-distributed-search-tp983533p991907.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: a bug of solr distributed search

2010-07-23 Thread MitchK


Yonik,

why do we do not send the output of TermsComponent of every node in the
cluster to a Hadoop instance?
Since TermsComponent does the map-part of the map-reduce concept, Hadoop
only needs to reduce the stuff. Maybe we even do not need Hadoop for this.
After reducing, every node in the cluster gets the current values to compute
the idf.
We can store this information in a HashMap-based SolrCache (or something
like that) to provide constant-time access. To keep the values up to date,
we can repeat that after every x minutes.

If we got that, it does not care whereas we use doc_X from shard_A or
shard_B, since they will all have got the same scores. 

Even if we got large indices with 10 million or more unique terms, this will
only need some megabyte network-traffic.

Kind regards,
- Mitch


Yonik Seeley-2-2 wrote:
 
 As the comments suggest, it's not a bug, but just the best we can do
 for now since our priority queues don't support removal of arbitrary
 elements.  I guess we could rebuild the current priority queue if we
 detect a duplicate, but that will have an obvious performance impact.
 Any other suggestions?
 
 -Yonik
 http://www.lucidimagination.com
 
-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/a-bug-of-solr-distributed-search-tp983533p990506.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: a bug of solr distributed search

2010-07-23 Thread MitchK


... Additionally to my previous posting:
To keep this sync we could do two things:
Waiting for every server to make sure that everyone uses the same values to
compute the score and than apply them.
Or: Let's say that we collect the new values every 15 minutes. To merge and
send them over the network, we declare that this will need 3 additionally
minutes (We want to keep the network traffic for such actions very low, so
we do not send everything instantly).
Okay, and now we say 2 additionally minutes, if 3 were not enough or
something needs a little bit more time than we tought.. After those 2
minutes, every node has to apply the new values.
Pro: If one node gets broken, we do not delay the Application of the new
values.
Con: We need two HashMaps and both will have roughly the same sice. That
means we will waste some RAM for this operation, if we do not write the
values to disk (Which I do not suggest).

Thoughts?

- Mitch

MitchK wrote:
 
 Yonik,
 
 why do we do not send the output of TermsComponent of every node in the
 cluster to a Hadoop instance?
 Since TermsComponent does the map-part of the map-reduce concept, Hadoop
 only needs to reduce the stuff. Maybe we even do not need Hadoop for this.
 After reducing, every node in the cluster gets the current values to
 compute the idf.
 We can store this information in a HashMap-based SolrCache (or something
 like that) to provide constant-time access. To keep the values up to date,
 we can repeat that after every x minutes.
 
 If we got that, it does not care whereas we use doc_X from shard_A or
 shard_B, since they will all have got the same scores. 
 
 Even if we got large indices with 10 million or more unique terms, this
 will only need some megabyte network-traffic.
 
 Kind regards,
 - Mitch
 
 
 Yonik Seeley-2-2 wrote:
 
 As the comments suggest, it's not a bug, but just the best we can do
 for now since our priority queues don't support removal of arbitrary
 elements.  I guess we could rebuild the current priority queue if we
 detect a duplicate, but that will have an obvious performance impact.
 Any other suggestions?
 
 -Yonik
 http://www.lucidimagination.com
 
 
-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/a-bug-of-solr-distributed-search-tp983533p990551.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: a bug of solr distributed search

2010-07-23 Thread MitchK



That only works if the docs are exactly the same - they may not be. 
Ahm, what? Why? If the uniqueID is the same, the docs *should* be the same,
don't they?
-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/a-bug-of-solr-distributed-search-tp983533p990563.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: a bug of solr distributed search

2010-07-21 Thread MitchK


Li Li,

this is the intended behaviour, not a bug.
Otherwise you could get back the same record in a response for several
times, which may not be intended by the user.

Kind regards,
- Mitch
-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/a-bug-of-solr-distributed-search-tp983533p983675.html
Sent from the Solr - User mailing list archive at Nabble.com.

nested query and number of matched records

2010-07-21 Thread MitchK


Hello community,

I got a situation, where I know that some types of documents contain very
extensive information and other types are giving more general information.
Since I don't know whether a user searches for general or extensive
information (and I don't want to ask him when he uses the default search), I
want to give him a response back like this:

10 documents are type: short
1 document, if there is one, is type: extensive

An example query would look like this:
q={!dismax fq=type:short}my cool query OR {!dismax fq=type:extensive}my cool
query
The problem with this one will be, that I can not specify to retrive up to
10 short-documents and at most one extensive.

I think this will not work and if I want to create such a search, I need to
do two different queries. But before I waste performance, I wanted to ask.

Thank you!
Kind regards,
- Mitch
-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/nested-query-and-number-of-matched-records-tp983756p983756.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: a bug of solr distributed search

2010-07-21 Thread MitchK


Ah, okay. I understand your problem. Why should doc x be at position 1 when
searching for the first time, and when I search for the 2nd time it occurs
at position 8 - right?

I am not sure, but I think you can't prevent this without custom coding or
making a document's occurence unique.

Kind regards,
- Mitch
-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/a-bug-of-solr-distributed-search-tp983533p983771.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: nested query and number of matched records

2010-07-21 Thread MitchK


Oh,... I just see, there is no direct question ;-).

How can I specify the number of returned documents in the desired way
*within* one request?

- Mitch
-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/nested-query-and-number-of-matched-records-tp983756p983773.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: a bug of solr distributed search

2010-07-21 Thread MitchK


I don't know much about the code. 
Maybe you can tell me to what file you are referring?

However, from the comments one can see, that the problem is known but one
decided to let it happen, because of System requirements in the Java
version. 

- Mitch
-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/a-bug-of-solr-distributed-search-tp983533p983880.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: a bug of solr distributed search

2010-07-21 Thread MitchK


It already was sorted by score.

The problem here is the following:
Shard_A and shard_B contain doc_X and doc_X.
If you are querying for something, doc_X could have a score of 1.0 at
shard_A and a score of 12.0 at shard_B.

You can never be sure which doc Solr sees first. In the bad case, Solr sees
the doc_X firstly at shard_A and ignores it at shard_B. That means, that the
doc maybe would occur at page 10 in pagination, although it *should* occur
at page 1 or 2.

Kind regards,
- Mitch
-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/a-bug-of-solr-distributed-search-tp983533p984743.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: nested query and number of matched records

2010-07-21 Thread MitchK


Thank you three for your feedback!

Chantal, unfortuntately kenf is right. Facetting won't work in this special
case. 


 parallel calls.
 
Yes, this will be the solution. However, this would lead to a second
HTTP-request and I hoped to be able to avoid it.


Chantal Ackermann wrote:
 
 Sure SOLR supports this: use facets on the field type:
 
 add to your regular query:
 
 facet.query=truefacet.field=type
 
 see http://wiki.apache.org/solr/SimpleFacetParameters
 
 
 On Wed, 2010-07-21 at 15:48 +0200, kenf_nc wrote:
 parallel calls. simultaneously query for type:short rows=10  and
 type:extensive rows=1  and merge your results.  This would also let you
 separate your short docs from your extensive docs into different solr
 instances if you wished...depending on your document architecture this
 could
 speed up one or the other.
 
 
 
 
 
-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/nested-query-and-number-of-matched-records-tp983756p984750.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Autocomplete with NGrams

2010-07-20 Thread MitchK


It sounds like the best solution here, right.

However, I do not want to exclude the possibility of doing things one
*should* do in different cores with different configurations and schema.xml
in one core.
I haven't completly read the lucidimagination article, but I would suggest
you to do your work in different cores, since it would make managing and
configuring the different tasks easier.
Furthermore the optimization in configurations for task A (a normal index
where you search) may work worse or wasteful with task B. 
To prevent such situation you must use multicore-setups.

- Mitch
-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/Autocomplete-with-NGrams-tp979312p980680.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Beginner question

2010-07-20 Thread MitchK


Here you can find params and their meanings for the dismax-handler.
You may not find anything in the wiki by searching for a parser ;).

Link:  http://wiki.apache.org/solr/DisMaxRequestHandler Wiki:
DisMaxRequestHandler 

Kind regards
- Mitch



Erik Hatcher-4 wrote:
 
 Consider using the dismax query parser instead.  It has more  
 sophisticated capability to spread user queries across multiple fields  
 with different weightings.
 
   Erik
 
 On Jul 20, 2010, at 4:34 AM, Bilgin Ibryam wrote:
 
 Hi all,

 I have two simple questions:

 I have an Item entity with id, name, category and description  
 fields. The
 main requirements is to be able to search in all the fields with the  
 same
 string and different priority per field, so matches in name appear  
 before
 category matches, and they appear before description field matches  
 in the
 result list.

 1. I think to create an index having the same fields, because each  
 field
 needs different priority during searching.

 2. And then do the search with a query like this:
 name:search_string^1.3 OR categpry:search_string^1.2 OR
 description:search_string^1.1

 Is this the right approach to model the index and search query?

 Thanks in advance.
 Bilgin
 
 
 
-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/Beginner-question-tp980695p980819.html
Sent from the Solr - User mailing list archive at Nabble.com.

Problem with Solr-Mailinglist

2010-07-19 Thread MitchK


Hello,

I try to post
http://lucene.472066.n3.nabble.com/Solr-in-an-extra-project-what-about-replication-scaling-etc-td977961.html#a977961
 
this  message for the fourth time to the Solr Mailinglist and everytime I
get the following response from the Mailing-list's server:



   solr-user@lucene.apache.org
 SMTP error from remote mail server after end of data:
 host mx1.eu.apache.org [192.87.106.230]: 552 spam score (7.8) exceeded
 threshold
 

Why is my posting declared as Spam?! Did anyone else has got such problems?

Thank you!
- Mitch
-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/Problem-with-Solr-Mailinglist-tp978247p978247.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Problem with Solr-Mailinglist

2010-07-19 Thread MitchK


Thank you both.

I will do what Hoss suggested, tomorrow.
The mail was sent over the nabble-board and another time over my
thunderbird-client. Both with the same result. So there was not more
HTML-code than it was in every of my other postings.

Kind regards
- Mitch
-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/Problem-with-Solr-Mailinglist-tp978247p979602.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Autocomplete with NGrams

2010-07-19 Thread MitchK


Frank,

have a look at Solr's example-directory's and look for 'multicore'. There
you can see an example-configuration for a multicore-environment.

Kind regards,
- Mitch

-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/Autocomplete-with-NGrams-tp979312p979610.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: solr with hadoop

2010-07-05 Thread MitchK


I need to revive this discussion...

If you do distributed indexing correctly, what about updating the documents
and what about replicating them correctly?

Does this work? Or wasn't this an issue?

Kind regards
- Mitch
-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/solr-with-hadoop-tp482688p944413.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Wither field compresed=true ?

2010-06-29 Thread MitchK


David,

well, I am no committer, but I noticed that Lucene will no longer care of
compressing (I think this was because of the trouble when doing this) and
maybe this is the reason why Solr keeps this option no longer available.

Unfortunately, I do not have got any link for it, but I think this was said
in some changes.txt (at Nutch, I think).

Kind regards
- Mitch
-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/Wither-field-compresed-true-tp926288p929985.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: How I can use score value for my function

2010-06-29 Thread MitchK


Britske good workaround!
I did not thought about the possibility of using subqueries.

Regards
- Mitch
-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/How-I-can-use-score-value-for-my-function-tp899662p931448.html
Sent from the Solr - User mailing list archive at Nabble.com.

Question about the mailinglist (junk on my behalf)

2010-06-28 Thread MitchK


Hello community,

since a few days I recieve daily some mails with suspicious content. It is
said that some of my mails were rejected, because of the file-types of the
mail's attachements and other things.
This wonders me a lot, because I didn't send any mails with attachements and
even the eMail-adresses which want to make me aware of my rejected mails are
unknown to me.

This is the first mailinglist I have joined and I know that there are a lot
of bots out there, crawling for eMail-adresses to send junk. However, I
can't recognize any suspicious behaviour except those mails.

The number of mails that make me aware of the mentioned thing is 10 in a few
days, maybe 15 but not more. And I do not get more junk than I normally get. 

Does anyone recieves suspicious eMails on my behalf?

Thank you.
- Mitch
-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/Question-about-the-mailinglist-junk-on-my-behalf-tp927461p927461.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: MoreLikeThis (mlt) : use the match's maxScore for result score normalization

2010-06-25 Thread MitchK


Hi Chantal,

Munich? Germany seems to be soo small :-).


Chantal Ackermann wrote:
 
 I only want a way to show to the 
 user a kind of relevancy or similarity indicator (for example using a 
 range of 10 stars) that would give a hint on how similar the mlt hit is 
 to the input (match) item. 
 
Okay, that's making more sense.
Unfortunately, you can not do that with Lucene with results that might fit
your needs (as far as I know).

Kind regards
- Mitch
-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/MoreLikeThis-mlt-use-the-match-s-maxScore-for-result-score-normalization-tp919598p921942.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Solr 1.4 - Image-Highlighting and Payloads

2010-06-24 Thread MitchK


Sebastian,

sounds like an exciting project.



 We've found the argument TokenGroup in method highlightTerm
 implemented in SimpleHtmlFormatter. TokenGroup provides the method
 getPayload(), but the returned value is always NULL. 
 
No, Token provides this method, not TokenGroup. But this might not be the
mistake.

Hm, since this approach is very special, I would suggest to do something
easier.
You already got tools to retrive the word and the word's position from the
image, right?

What would be, if you add a field to the schema.xml with a preprocessed
input-string.

I.e:
You got two fields:
page's text and page's text's word-positions.

Page's text's word-positions needs preprocessing outside of Solr where you
add the coordinates of the words .

This preprocessing will be a little bit tricky.
If the 10th word is Solr and the 30th word also, you do not want to have
solr two times with different coordinates.
In fact, you want to store both coordinates for the term solr.

However, on the Solr-side you can add this preprocessed string to a field
with TermVectors.
If your query hits the page, you will get all the coordinates you want to
get.
Unfortunately, highlighting must be done on the client side.

Hope this helps
- Mitch
-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/Solr-1-4-Image-Highlighting-and-Payloads-tp919266p919342.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: MoreLikeThis (mlt) : use the match's maxScore for result score normalization

2010-06-24 Thread MitchK


Chantal,

have a look at 
http://lucene.apache.org/java/3_0_1/api/all/org/apache/lucene/search/similar/MoreLikeThis.html
More like this  to have a guess what the MLT's score concerns.

The problem is that you can't compare scores.
The query for the normal result-response was maybe something like 
Bill Gates featuring Linus Torvald - The perfect OS song.
The user picks now one of the responsed documents and says he wants More
like this - maybe, because the concerned topic was okay, but the content
was not enough or whatever...
But the sent query is totaly different (as you can see in the link) - so
that would be like comparing apples and oranges, since they do not use the
same base.

What would be the use case? Why is score-normalization needed?

Kind regards from Germany,
- Mitch
-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/MoreLikeThis-mlt-use-the-match-s-maxScore-for-result-score-normalization-tp919598p919716.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: solr with hadoop

2010-06-22 Thread MitchK


I wanted to add a Jira-issue about exactly what Otis is asking here.
Unfortunately, I haven't time for it because of my exams.

However, I'd like to add a question to Otis' ones:
If you destribute the indexing-progress this way, are you able to replicate
the different documents correctly?

Thank you.
- Mitch

Otis Gospodnetic-2 wrote:
 
 Stu,
 
 Interesting!  Can you provide more details about your setup?  By load
 balance the indexing stage you mean distribute the indexing process,
 right?  Do you simply take your content to be indexed, split it into N
 chunks where N matches the number of TaskNodes in your Hadoop cluster and
 provide a map function that does the indexing?  What does the reduce
 function do?  Does that call IndexWriter.addAllIndexes or do you do that
 outside Hadoop?
 
 Thanks,
 Otis
 --
 Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
 
 - Original Message 
 From: Stu Hood stuh...@webmail.us
 To: solr-user@lucene.apache.org
 Sent: Monday, January 7, 2008 7:14:20 PM
 Subject: Re: solr with hadoop
 
 As Mike suggested, we use Hadoop to organize our data en route to Solr.
  Hadoop allows us to load balance the indexing stage, and then we use
  the raw Lucene IndexWriter.addAllIndexes method to merge the data to be
  hosted on Solr instances.
 
 Thanks,
 Stu
 
 
 
 -Original Message-
 From: Mike Klaas mike.kl...@gmail.com
 Sent: Friday, January 4, 2008 3:04pm
 To: solr-user@lucene.apache.org
 Subject: Re: solr with hadoop
 
 On 4-Jan-08, at 11:37 AM, Evgeniy Strokin wrote:
 
 I have huge index base (about 110 millions documents, 100 fields  
 each). But size of the index base is reasonable, it's about 70 Gb.  
 All I need is increase performance, since some queries, which match  
 big number of documents, are running slow.
 So I was thinking is any benefits to use hadoop for this? And if  
 so, what direction should I go? Is anybody did something for  
 integration Solr with Hadoop? Does it give any performance boost?

 Hadoop might be useful for organizing your data enroute to Solr, but  
 I don't see how it could be used to boost performance over a huge  
 Solr index.  To accomplish that, you need to split it up over two  
 machines (for which you might find hadoop useful).
 
 -Mike
 
 
 
 
 
 
 
-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/solr-with-hadoop-tp482688p914589.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Re: Re: Solr and Nutch/Droids - to use or not to use?

2010-06-17 Thread MitchK




 Solr doesn't know anything about OPIC, but I suppose you can feed the OPIC
 score computed by Nutch into a Solr field and use it during scoring, if
 you want, say with a function query. 
 
Oh! Yes, that makes more sense than using the OPIC as doc-boost-value. :-)
Anywhere at the Lucene Mailing lists I read that in future it will be
possible to change field's contents without reindexing the whole document.
If one stores the OPIC-Score (which is independent from the page's content)
in a field and uses functionQuery to influence the score of a document, one
saves the effort of reindexing the whole doc, if the content did not change.

Regards
- Mitch
-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/Solr-and-Nutch-Droids-to-use-or-not-to-use-tp900069p902158.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Re: Re: Solr and Nutch/Droids - to use or not to use?

2010-06-17 Thread MitchK

Otis,

you are right. I wasn't aware of this. At least not with such a large
dataList (let's think of an index with 4mio docs, this would mean we got an
ExternalFile with 4mio records). But from what I've read at
search-lucene.com it seems to perform very well. Thanks for the idea!

Btw: Otis, did you open a JIRA Issue for the distributed indexing ability of
Solr?
I would like to follow the issue, if it is open.

Regards
- Mitch

Otis Gospodnetic-2 wrote:

Mitch,

Yes, one day. But it sounds like you are not aware of ExternalFieldFile,
which you can use today:

http://search-lucene.com/?q=ExternalFileFieldfc_project=Solr

Otis

Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
Lucene ecosystem search :: http://search-lucene.com/

- Original Message
From: MitchK mitc...@web.de
To: solr-user@lucene.apache.org
Sent: Thu, June 17, 2010 4:15:27 AM
Subject: Re: Re: Re: Solr and Nutch/Droids - to use or not to use?

Solr doesn't know anything about OPIC, but I suppose you can
feed the OPIC
score computed by Nutch into a Solr field and use it
during scoring, if
you want, say with a function query.

Oh!
Yes, that makes more sense than using the OPIC as doc-boost-value.
:-)
Anywhere at the Lucene Mailing lists I read that in future it will
be
possible to change field's contents without reindexing the whole
document.
If one stores the OPIC-Score (which is independent from the page's
content)
in a field and uses functionQuery to influence the score of a
document, one
saves the effort of reindexing the whole doc, if the content
did not change.

Regards
- Mitch
--
View this message in
context:
href=http://lucene.472066.n3.nabble.com/Solr-and-Nutch-Droids-to-use-or-not-to-use-tp900069p902158.html;

target=_blank
http://lucene.472066.n3.nabble.com/Solr-and-Nutch-Droids-to-use-or-not-to-use-tp900069p902158.html
Sent
from the Solr - User mailing list archive at Nabble.com.

--
View this message in context:
http://lucene.472066.n3.nabble.com/Solr-and-Nutch-Droids-to-use-or-not-to-use-tp900069p903148.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Document boosting troubles

2010-06-17 Thread MitchK


Hi,

first of all, are you sure that row.put('$docBoost',docBoostVal) is correct?

I think it should be row.put($docBoost,docBoostVal); - unfortunately I am
not sure.

Hm, I think, until you can solve the problem with the docBoosts itself, you
should use a functionQuery.

Use div(1, rank) as boost function (bf).

The higher the rank value, the smaller the result.

Hope this helps!
- Mitch

 
dbashford wrote:
 
 Brand new to this sort of thing so bear with me.
 
 For sake of simplicity, I've got a two field document, title and rank. 
 Title gets searched on, rank has values from 1 to 10.  1 being highest. 
 What I'd like to do is boost results of searches on title based on the
 documents rank.
 
 Because it's fairly cut and dry, I was hoping to do it during indexing.  I
 have this in my DIH transformer..
 
 var docBoostVal = 0;
 switch (rank) {
   case '1': 
   docBoostVal = 3.0;
   break;
   case '2': 
   docBoostVal = 2.6;
   break;
   case '3': 
   docBoostVal = 2.2;
   break;
   case '4': 
   docBoostVal = 1.8;
   break;
   case '5': 
   docBoostVal = 1.5;
   break;
   case '6': 
   docBoostVal = 1.2;
   break;
   case '7':
   docBoostVal = 0.9;
   break;
   case '8': 
   docBoostVal = 0.7;
   break;
   case '9': 
   docBoostVal = 0.5;  
   break;
 } 
 row.put('$docBoost',docBoostVal); 
 
 It's my understanding that with this, I can simply do the same /select
 queries I've been doing and expect documents to be boosted, but that
 doesn't seem to be happening because I'm seeing things like this in the
 results...
 
 {title:Some title 1,
 rank:10,
  score:0.11726039},
 {title:Some title 2,
  rank:7,
  score:0.11726039},
 
 Pretty much everything with the same score.  Whatever I'm doing isn't
 making its way through. (To cover my bases I did try the case statement
 with integers rather than strings, same result)
 
 
 
 
 
 With that not working I started looking at other options.  Starting
 playing with dismax.  
 
 I'm able to add this to a query string a get results I'm somewhat
 expecting...
 
 bq=rank:1^3.0 rank:2^2.6 rank:3^2.2 rank:4^1.8 rank:5^1.5 rank:6^1.2
 rank:7^0.9 rank:8^0.7 rank:9^0.5
 
 ...but I guess I wasn't expecting it to ONLY rank based on those factors. 
 That essentially gives me a sort by rank.  
 
 Trying to be super inclusive with the search, so while I'm fiddling my
 mm=11.  As expected, a q= like q=red door is returning everything that
 contains Red and door.  But I was hoping that items that matched red
 door exactly would sort closer to the top.  And if that exact match was a
 rank 7 that it's score wouldn't be exactly the same as all the other rank
 7s?  Ditto if I searched for q=The Tales Of, anything possessing all 3
 terms would sort closer to the top...and possessing two terms behind
 them...and possessing 1 term behind them, and within those groups weight
 heavily on by rank.
 
 I think I understand that the score is based entirely on the boosts I
 provide...so how do I get something more like what I'm looking for?
 
 
 
 
 Along those lines, I initially had put something like this in my
 defaults...
 
  str name=bf
 rank:1^10.0 rank:2^9.0 rank:3^8.0 rank:4^7.0 rank:5^6.0 rank:6^5.0
 rank:7^4.0 rank:8^3.0 rank:9^2.0
  /str
 
 ...but that was not working, queries fail with a syntax exception. 
 Guessing this won't work?
 
 
 
 Thanks in advance for any help you can provide.
 
 
-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/Document-boosting-troubles-tp902982p903190.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Document boosting troubles

2010-06-17 Thread MitchK


Sorry, I've overlooked your other question.



  str name=bf 
 rank:1^10.0 rank:2^9.0 rank:3^8.0 rank:4^7.0 rank:5^6.0 rank:6^5.0
 rank:7^4.0 rank:8^3.0 rank:9^2.0 
  /str 
 

This is wrong.
You need to change bf to bq.
Bf - boosting function
Bq - boosting query.
-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/Document-boosting-troubles-tp902982p903208.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: solr multi-node

2010-06-17 Thread MitchK


Antonello,

here are a few links to the Solr Wiki:
http://wiki.apache.org/solr/SolrReplication Solr Replication 
http://wiki.apache.org/solr/DistributedSearchDesign Distributed Search
Design 
http://wiki.apache.org/solr/DistributedSearch Distributed Search 
http://wiki.apache.org/solr/SolrCloud Solr Cloud 

Hope this helps.
- Mitch

-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/solr-multi-node-tp903159p903228.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Master master?

2010-06-17 Thread MitchK


What is the usecase for such an architecture?
Do you send requests to two different masters for indexing and that's why
they need to be synchronized?

Kind regards
- Mitch
-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/Master-master-tp884253p903233.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Document boosting troubles

2010-06-17 Thread MitchK


Hi,



 One problem down, two left!  =)  bf == bq did the trick, thanks.  Now at
 least if I can't get the DIH solution working I don't have to tack that on
 every query string. 
 
I would really recommend to use a boost function. If your rank will change
in future implementations, you do not need to redefine the bq. Besides that,
I think this is not only more comfortable, but also scales better.
The bq-param is more for things like boost this category or boost docs of
an advertisement campaign or something like that.

I am not sure, since I never worked with the DIH this way, but - from my
logic - the problem could be, that you do not return the row, right?
If you don't, try it again when return row was added to your sourcecode.

Otherwise, I can't help you, since there are no more codeexamples available
at the mailing list (from what I have seen).

Maybe this mailing-list topic helps you: 
http://lucene.472066.n3.nabble.com/Using-DIH-s-special-commands-Help-needed-td475695.html#a475695
Using DIHs special commands Help needed .
There are some suggestions,... however, it seems like he wasn't able to
solve the problem.



 And still can't figure out what I need to do with my dismax querying to
 get scores for quality of match. 
 
I don't really understand what you mean. Can you explain it a little bit
more?
What, except the $docBoost, does not work as it should do?

Kind regards
- Mitch
-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/Document-boosting-troubles-tp902982p904129.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: DismaxRequestHandler

2010-06-17 Thread MitchK


Joe, 

please, can you provide an example of what you are thinking of?

Subqueries with Solr... I've never seen something like that before.

Thank you!

Kind regards
- Mitch
-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/DismaxRequestHandler-tp903641p904142.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Re: Re: Solr and Nutch/Droids - to use or not to use?

2010-06-17 Thread MitchK


Otis,

And again I wished I were registred.

I will check the JIRA and when I feel comfortable with it, I will open it.

Kind regards
- Mitch
-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/Solr-and-Nutch-Droids-to-use-or-not-to-use-tp900069p904145.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Question on dynamic fields

2010-06-17 Thread MitchK


Barani,

without more background on dynamic fields, I would say that the most easiest
way would be to define a suffix for each of the fields you want to index
into the mentioned dynamic field and to redefine your dynamic field -
condition.

If suffix does not work, because of other dynamic-field declarations, use a
prefix.

Instead of *_bla to match myField_bla, you can use bla_* to match
bla_myField.

Hope this helps,
- Mitch
-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/Question-on-dynamic-fields-tp904053p904159.html
Sent from the Solr - User mailing list archive at Nabble.com.

Solr and Nutch/Droids - to use or not to use?

2010-06-16 Thread MitchK

Hello community,

from several discussions about Solr and Nutch, I got some questions for a
virtual web-search-engine.
I know I've posted this message to the mailing list a few days ago, but the
thread got injected and at least I did not get any more postings about the
topic and so I try to reopen it, hopefully no one gets upset here :-).
Please, bear with me. Thank you.

The requirements:
I. I need a scalable solution for a growing index that becomes larger than
one machine can handle. If I add more hardware, I want to linear improve the
performance.

II. I want to use technologies like the OPIC-algorithm (default algorithm in
Nutch) or PageRank or... whatever is out there to improve the ranking of the
webpages.

III. I want to be able to easily add more fields to my documents. Imagine
one retrives information from a webpage's content, than I want to make it
searchable.

IV. While fetching my data, I want to make special-searches possible. For
example I want to retrive pictures from a webpage and want to index
picture-related content into another search-index plus I want to save a
small thumbnail of the picture itself. Btw: This is (as far as I know) not
possible with solr, because solr was not intended to do such special
indexing-logic.

V. I want to use filter queries (i.e. main-query christopher lee returns
1.5mio results, subquery action - the main-query would be a filter-query
and action would be the actual query. So a search within search-results
would be easily made available).

VI. I want to be able to use different logics for different pages. Maybe I
got a pool of 100 domains that I know better than others and I got special
scripts that retrive more special information from those 100 domains. Than I
want to apply my special logic to those 100 domains, but every other domain
should use the default logic.

The project is only virtual. So why I am asking?
I want to learn more about websearch and I would like to make some new
experiences.

What do I know about Solr + Nutch:
As it is said on lucidimagination.com, Solr + Nutch does not scale if the
index is too large.
The article was a little bit older and I don't know whether this problem
gets fixed with the new distributed abilities of Solr.

Furthermore I don't want to index the pages with nutch and reindex them with
solr.
The only exception would be: If the content of a webpage get's indexed by
nutch, I want to use the already tokenized content of the body with some
Solr copyfield operations to extend the search (i.e. making fuzzy search
possible). At the moment: I don't think this is possible.

I don't know much about the droids project and how well it is documented.
But from what I can read by some posts of Otis, it seems to be usable as a
crawler-framework.

Pros for Nutch are: It is very scalable! Thanks to hadoop and MapReduce it
is a scaling-monster (from what I've read).

Cons: The search is not as rich as it is possible with Solr. Extend Nutch's
search-abilities *seems* to be more complicated than with Solr. Furthermore,
if I want to use Solr to search nutch's index, looking at my requirements I
would need to reindex the whole thing - without the benefits of Hadoop.

What I don't know at the moment is, how it is possible to use algorithms
like in II. mentioned with Solr.

I hope you understand the problem here - Solr *seems* to me as it would not
be the best solution for a web-search-engine, because of scaling reasons in
indexing.

Where should I dive deeper?
Solr + Droids?
Solr + Nutch?
Nutch + howToExtendNutchToMakeSearchBetter?

Thanks for the discussion!
- Mitch
--
View this message in context:
http://lucene.472066.n3.nabble.com/Solr-and-Nutch-Droids-to-use-or-not-to-use-tp900069p900069.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Solr and Nutch/Droids - to use or not to use?

2010-06-16 Thread MitchK


Thank you for the feedback, Otis.
Yes, I thought that such an approach is usefull if the number of pages to
crawl is relatively low.

However, what about using solr + nutch?
Exists the problem that this would not scale, if the index becomes too
large, up to now?

What about extending nutch with features such as the DisMaxRequestHandler,
is the amount of work larger than it would be in Solr?

The big pro of Solr is that I can enhance the whole thing in a few minutes,
if I need more extra-information to improve the search.
That makes it very easy to experiment with boostings, filters etc.
As far as I know, Nutch does not offer such greatefull features.
Do you know a little bit more about that?

Probably I should ask such question at the Nutch-mailing list, but at the
moment I hope that I can achieve as much as I can with Solr, because I have
no experiences with Hadoop but Nutch seems to require it.

Thank you!
- Mitch
-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/Solr-and-Nutch-Droids-to-use-or-not-to-use-tp900069p900480.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Solr and Nutch/Droids - to use or not to use?

2010-06-16 Thread MitchK


Thanks, that really helps to find the right beginning for such a journey. :-)



 * Use Solr, not Nutch's search webapp 
 
As far as I have read, Solr can't scale, if the index gets too large for one
Server



 The setup explained here has one significant caveat you also need to keep
 in mind: scale. You cannot use this kind of setup with vertical scale
 (collection size) that goes beyond one Solr box. The horizontal scaling
 (query throughput) is still possible with the standard Solr replication
 tools.
 
...from Lucidimagination.com

Is this still the case?
Furthermore, as far as I have understood this blogpost: 
http://www.lucidimagination.com/blog/2009/03/09/nutch-solr/
Lucidimagination.com : Nutch and Solr , they index the whole stuff with
nutch and reindex it to Solr - sounds like a lot of redundant work.

Lucid, Sematext and the Nutch-wiki are the only information-sources where I
can find talks about Nutch and Solr, but no one seems to talk about these
facts - except this one blogpost.

If you say this is wrong or contingent on the shown setup, can you tell me
how to avoid these problems?

A lot of questions, but it's such an exciting topic...

Hopefully you can answer some of them.

Again, thank you for the feedback, Otis.

- Mitch
-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/Solr-and-Nutch-Droids-to-use-or-not-to-use-tp900069p900604.html
Sent from the Solr - User mailing list archive at Nabble.com.

RE: Re: Re: Solr and Nutch/Droids - to use or not to use?

2010-06-16 Thread MitchK

Good morning!

Great feedback from you all. This really helped a lot to get an impression
of what is possible and what is not.

What is interesting to me are some detail questions.

Let's assume Solr is possible to work on his own with distributed indexing,
so that the client does not need to know anything about shards etc.

What is interesting to me is:
I.
The scoring - Nutch uses special Scoring-implementations like the
OPIC-algorithm. Can Solr use such improvements or do I need to reimplement
it for Solr?

II.
The indexing.
At the moment it really sounds like nutch would index the whole stuff and
afterwards Solr does the job again.
Regarding to indexing it would make sense, if Nutch computes things like the
document boost (I am not sure, but I think the results of the OPIC-algorithm
were added to each document as a boost) and sends an indexing-request to
Solr afterwards.
However, if Nutch indexes the page's content and Solr does it, too - I would
waste some time, no?
Is this the case or do I missunderstood something here?

III.
I am no Java-Expert.
However, in a few month I will start to study computer-science at an
university. Maybe I will find some literature to learn more about
distributed software and how hashing needs to work, to do the job it should
do, to make distributed indexing work.
Maybe than I can help to implement this feature into Solr.
On the other hand, not much is known about Solr's distributed search-concept
and which classes are responsible for that - but such things one could ask
on the mailing list, no?

As far as I know Elastic Search already supports distributed indexing.
Maybe one can reuse the responsible implementation for Solr.

Btw:
I think a great benefit of using Solr + Nutch would be to extend the search.
I could create several Solr cores for different kinds of search - one for
picture-search, one for video-search etc. *and* with the help of Nutch I can
index some of the needed content in special directories. So Solr does not
need to care about indexing a picture - Nutch already does the job.

Kind regards,
- Mitch
--
View this message in context:
http://lucene.472066.n3.nabble.com/Solr-and-Nutch-Droids-to-use-or-not-to-use-tp900069p901943.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Solr and Nutch/Droids - to use or not to use?

2010-06-14 Thread MitchK

Just wanted to push the topic a little bit, because those question come up
quite often and it's very interesting for me.

Thank you!

- Mitch

MitchK wrote:

Hello community and a nice satureday,

from several discussions about Solr and Nutch, I got some questions for a
virtual web-search-engine.

The requirements:
I. I need a scalable solution for a growing index that becomes larger than
one machine can handle. If I add more hardware, I want to linear improve
the performance.

II. I want to use technologies like the OPIC-algorithm (default algorithm
in Nutch) or PageRank or... whatever is out there to improve the ranking
of the webpages.

III. I want to be able to easily add more fields to my documents. Imagine
one retrives information from a webpage's content, than I want to make it
searchable.

V. I want to use filter queries (i.e. main-query christopher lee returns
1.5mio results, subquery action - the main-query would be a
filter-query and action would be the actual query. So a search within
search-results would be easily made available).

VI. I want to be able to use different logics for different pages. Maybe I
got a pool of 100 domains that I know better than others and I got special
scripts that retrive more special information from those 100 domains. Than
I want to apply my special logic to those 100 domains, but every other
domain should use the default logic.

The project is only virtual. So why I am asking?
I want to learn more about websearch and I would like to make some new
experiences.

Furthermore I don't want to index the pages with nutch and reindex them
with solr.
The only exception would be: If the content of a webpage get's indexed by
nutch, I want to use the already tokenized content of the body with some
Solr copyfield operations to extend the search (i.e. making fuzzy search
possible). At the moment: I don't think this is possible.

I don't know much about the droids project and how well it is documented.
But from what I can read by some posts of Otis, it seems to be usable as a
crawler-framework.

Pros for Nutch are: It is very scalable! Thanks to hadoop and MapReduce it
is a scaling-monster (from what I've read).

Cons: The search is not as rich as it is possible with Solr. Extend
Nutch's search-abilities *seems* to be more complicated than with Solr.
Furthermore, if I want to use Solr to search nutch's index, looking at my
requirements I would need to reindex the whole thing - without the
benefits of Hadoop.

What I don't know at the moment is, how it is possible to use algorithms
like in II. mentioned with Solr.

I hope you understand the problem here - Solr *seems* to me as it would
not be the best solution for a web-search-engine, because of scaling
reasons in indexing.

Where should I dive deeper?
Solr + Droids?
Solr + Nutch?
Nutch + howToExtendNutchToMakeSearchBetter?

Thanks for the discussion!
- Mitch

--
View this message in context:
http://lucene.472066.n3.nabble.com/Solr-and-Nutch-Droids-to-use-or-not-to-use-tp890640p894391.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Solr DataConfig / DIH Question

2010-06-13 Thread MitchK


Guys???

You are in the wrong thread. Please, send a message to the mailing list, do
not answer to existing posts.

Thank you. 
-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/Solr-and-Nutch-Droids-to-use-or-not-to-use-tp890640p892041.html
Sent from the Solr - User mailing list archive at Nabble.com.

Solr and Nutch/Droids - to use or not to use?

2010-06-12 Thread MitchK