Re: Re: How to properly use Levenstein distance with ~ in Java

2014-10-23 Thread karsten-solr
Hi Aleksander,
 
The Fuzzy Searche '~' is not supported in dismax (defType=dismax)
https://cwiki.apache.org/confluence/display/solr/The+DisMax+Query+Parser
 
You are using SearchComponent spellchecker. This does not change the query 
results.
 

btw: It looks like you are using path /select with qt=dismax. This normaly 
would throw an exception.
Is there a tag
  requestHandler name=/dismax ...
inside your solrconfig.xml ? 
 
Best regards
 
  Karsten
 
P.S. in Context: 
http://lucene.472066.n3.nabble.com/How-to-properly-use-Levenstein-distance-with-in-Java-td4164793.html
 

 On 20 October 2014 11:13, Aleksander Sadecki wrote:

 Ok, thank you for your response. But why I cannot use '~'?


Re: Best way to index Solr XML from w/in the same servlet container

2012-09-18 Thread karsten-solr
Hi Jay,

I would like to see the Zookeeper Watcher as part of DIH in solr.
Possible you could extend org.apache.solr.handler.dataimport.DataSource.

If you want to call solr without http you can use solrJ: 
org.apache.solr.client.solrj.embedded.EmbeddedSolrServer

Beste regards
  Karsten

 Original-Nachricht 
 Datum: Mon, 17 Sep 2012 13:29:53 -0700
 Von: Jay Hill jayallenh...@gmail.com
 An: solr-user@lucene.apache.org
 Betreff: Best way to index Solr XML from w/in the same servlet container

 I've created a custom process in Solr that has a Zookeeper Watcher
 configured to pull Solr XML files from a znode. When I receive a file I
 can
 send the file to /update and get it indexed, but that seems inefficient. I
 could use SolrJ, but I believe that is still sending an HTTP request to
 /update. Is there a better way to do this, or is SolrJ running w/in the
 same servlet container the most efficient way to index SolrJ from w/in the
 same servlet container that is running Solr?
 
 Thanks,
 -Jay


Re: DataImport using last_indexed_id or getting max(id) quickly

2012-07-12 Thread karsten-solr
Hi Avenka,

you asked for a HowTo to add a field inverseID which allows to calculate 
max(id) from its first term:
If you do not use solr you have to calculate 1 - id and store it in 
an extra field inverseID.
If you fill solr with your own code, add a TrieLongField inverseID and fill 
with the value -id.
If you only want to change schema.xml (and add some classes):
  * You need a new FieldType inverseLongType and a Field inverseID of Type 
inverseLongType
  * You need a line copyField source=id dest=inverseID/
   (see http://wiki.apache.org/solr/SchemaXml#Copy_Fields)

For inverseLongType I see two possibilities
 a) use TextField and make your own filter to calculate 1 - id
 b) extends TrieLongField to a new FieldType InverseTrieLongField with:
  @Override
  public String readableToIndexed(String val) {
return super.readableToIndexed(Long.toString( -Long.parseLong(val)));
  }
  @Override
  public Fieldable createField(SchemaField field, String externalVal, float 
boost) {
return super.createField(field,Long.toString( -Long.parseLong(val)), boost 
);
  }
  @Override
  public Object toObject(Fieldable f) {
Object result = super.toObject(f);
if(result instanceof Long){
  return new Long( -((Long)result).longValue());
}
return result;
  }

Beste regards
   Karsten

View this message in context:
http://lucene.472066.n3.nabble.com/DataImport-using-last-indexed-id-or-getting-max-id-quickly-tp3993763p3994560.html


 Original-Nachricht 
 Datum: Wed, 11 Jul 2012 20:59:10 -0700 (PDT)
 Von: avenka ave...@gmail.com
 An: solr-user@lucene.apache.org
 Betreff: Re: DataImport using last_indexed_id or getting max(id) quickly

 Thanks. Can you explain more the first TermsComponent option to obtain
 max(id)? Do I have to modify schema.xml to add a new field? How exactly do
 I
 query for the lowest value of 1 - id?
 
 --
 View this message in context:
 http://lucene.472066.n3.nabble.com/DataImport-using-last-indexed-id-or-getting-max-id-quickly-tp3993763p3994560.html
 Sent from the Solr - User mailing list archive at Nabble.com.


Re: NRT and multi-value facet - what is Solr's limit?

2012-07-12 Thread karsten-solr
Hi Andy,

as long as the cache for facetting is not per segment there is no NRT together 
with facetting.

This is what Jason told you in
http://lucene.472066.n3.nabble.com/Nrt-and-caching-td3993612.html
and I am agree.

Possible you could use multicore.

Beste regards
   Karsten

 Original-Nachricht 
 Datum: Thu, 12 Jul 2012 03:18:47 -0700 (PDT)
 Von: Andy angelf...@yahoo.com
 An: solr-user@lucene.apache.org solr-user@lucene.apache.org
 Betreff: NRT and multi-value facet  - what is Solr\'s limit?

 Hi,
 
 I understand that the cache for multi-value facet is multi-segment. So
 every time a document is updated the entire cache needs to be rebuilt.
 
 Is there any rule of thumb on the highest update rate NRT can handle
 before this cache-rebuild-on-each-commit becomes too expensive? I know it
 depends, but I'm just looking for order-of-magnitude estimates. Are we talking
 about 10 updates/s? 100? 1,000?
 
 Thanks


Re: Nrt and caching

2012-07-12 Thread karsten-solr
Hi Andy,

Multi-value faceting is a special case of taxonomy. So it is covered by the 
org.apache.lucene.facet package (lucene/facet).
This is not per segment but works without per IndexSearcher cache.

So imho the taxonomy faceting will work with NRT.

Because of the new TermsEnum#ord() Method the class UnInvertedField already 
lost half of its code-lines. UnInvertedField would work per segment, if the 
ordinal position for a term would not change in a commit. Which is the basic 
idea of the taxonomy-solution.

So I am quite sure that Solr will adopt this approach any time.
I do not now about soon.

Best regards
   Karsten

in context:
http://lucene.472066.n3.nabble.com/Nrt-and-caching-tp3993612p3993700.html

 Original-Nachricht 
 Datum: Sat, 7 Jul 2012 17:32:52 -0700 (PDT)
 Von: Andy angelf...@yahoo.com
 An: solr-user@lucene.apache.org solr-user@lucene.apache.org
 Betreff: Re: Nrt and caching

 Jason,
 
 If I just use stock Solr 4.0 without modifying the source code, does that
 mean multi-value faceting will be very slow when I'm constantly
 inserting/updating documents? 
 
 Which open source library are you referring to? Will Solr adopt this
 per-segment approach any time soon?
 
 Thanks
 


Re: Unable to determine why query won't return results

2011-11-10 Thread karsten-solr
Hi Kurt,

I toke your fieldtype definition and could not reproduce your problem with solr 
3.4.

But I think you have a problem with the ampersand in A. J. Johnson  Co.

Two comments: 
In your analysis html-example there is a gap of two positions between Johnson 
and Co. This must not be (A. J. Johnson  Co. is indexed like A J Johnson 
Co).
Possible you have an encoding problem with the ampersand? Do you use solrj for 
url generation?

Best regards
  Karsten



 Original-Nachricht 
 Datum: Wed, 9 Nov 2011 21:47:20 +
 Von: Nordstrom, Kurt kurt.nordst...@unt.edu
 An: solr-user@lucene.apache.org solr-user@lucene.apache.org
 Betreff: Unable to determine why query won\'t return results

 Hello all.
 
 I'm having an issue in regards to matching a quoted phrase in Solr, and
 I'm not certain what the issue at hand is.
 
 I have tried this on both Solr 1.3 (Our production system) and 3.3 (Our
 development system).
 
 The field is a text field, and has the following fieldType definition:
 http://pastebin.com/SkmmucUE
 
 In the case where the search is failing, the field is indexed with the
 following value: A. J. Johnson  Co.
 
 We are searching the field with the following string (in quotes): A. J.
 Johnson  Co.
 
 Unfortunately, we get a response of no results when searching the field in
 question with the above specified string. If we search merely for A. J.
 Johnson (with quotes), we get the desired result.  Using the full string,
 however, seems to cause the results not to match.
 
 I have attempted to use Solr's analyzer (without success) to trace the
 problem. The results of this are here:
 http://pastehtml.com/view/bdgpdrt0w.html
 
 Any suggestions?


Re: [Profiling] How to profile/tune Solr server

2011-11-04 Thread karsten-solr
Hi Spark,

2009 there was a monitor from lucidimagination:
http://www.lucidimagination.com/about/news/releases/lucid-imagination-releases-performance-monitoring-utility-open-source-apache-lucene

A colleague of mine calls the sematext-monitor trojan because SPM phone 
home:
Easy in, easy out - if you try SPM and don't like it, simply stop and remove 
the small client-side piece that sends us your data
http://sematext.com/spm/solr-performance-monitoring/index.html

Looks like other people using a real profiler like YourKit Java Profiler
http://forums.yourkit.com/viewtopic.php?f=3t=3850

There is also an article about Zabbix
http://www.lucidimagination.com/blog/2011/10/02/monitoring-apache-solr-and-lucidworks-with-zabbix/

In your case any profiler would do, but if you find out a Profiler with 
solr-specific default-filter let me know.



Best regrads
  Karsten

P.S. eMail in context 
http://lucene.472066.n3.nabble.com/Profiling-How-to-profile-tune-Solr-server-td3467027.html

 Original-Nachricht 
 Datum: Mon, 31 Oct 2011 18:35:32 +0800
 Von: yu shen shenyu...@gmail.com
 An: solr-user@lucene.apache.org
 Betreff: Re: [Profiling] How to profile/tune Solr server

 No idea so far, try to figure out.
 
 Spark
 
 2011/10/31 Jan Høydahl jan@cominvent.com
 
  Hi,
 
  There are no official tools other than looking at the built-in stats
 pages
  and perhaps using JConsole or similar JVM monitoring tools. Note that
  Solr's JMX capabilities may let you hook your enterprise's existing
  monitoring dashboard up with Solr.
 
  Also check out the new monitoring service from Sematext which will give
  you graphs and all. So far it's free evaluation:
  http://sematext.com/spm/index.html
 
  Do you have a clue for why the indexing is slow?
 
  --
  Jan Høydahl, search solution architect
  Cominvent AS - www.cominvent.com
  Solr Training - www.solrtraining.com
 
  On 31. okt. 2011, at 04:59, yu shen wrote:
 
   Hi All,
  
   I am a solr newbie. I find solr documents easy to access and use,
 which
  is
   really good thing. While my problem is I did not find a solr home
 grown
   profiling/monitoring tool.
  
   I set up the server as a multi-core server, each core has
 approximately
  2GB
   index. And I need to update solr and re-generate index in a real time
   manner (In java code, using SolrJ). Sometimes the update operation is
  slow.
   And it is expected that in a year, the index size may increase to 4GB.
  And
   I need to do something to prevent performance downgrade.
  
   Is there any solr official monitoring  profiling tool for this?
  
   Spark
 
 


Re: Limit by score? sort by other field

2011-10-27 Thread karsten-solr
Hi Robert,

take a look to
http://lucene.472066.n3.nabble.com/How-to-cut-off-hits-with-score-below-threshold-td3219064.html#a3219117
and
http://lucene.472066.n3.nabble.com/Filter-by-relevance-td1837486.html

So will
sort=date+descq={!frange l=0.85}query($qq)
qq=the original relevancy query
help?


Best regards
  Karsten 

 Original-Nachricht 
 Datum: Thu, 27 Oct 2011 12:30:31 +0100
 Von: Robert Brown r...@intelcompute.com
 An: solr-user@lucene.apache.org
 Betreff: Limit by score? sort by other field

 When we display search results to our users we include a percentage 
 score.
 
 Top result being 100%, then all others normalised based on the 
 maxScore, calculated outside of Solr.
 
 We now want to limit returned docs with a percentage score higher than 
 say, 50%.
 
 e.g. We want to search but only return docs scoring above 80%, but 
 want to sort by date, hence not being able to just sort by score.
 


Re: data-import problem

2011-10-24 Thread karsten-solr
Hi Radha Krishna,

try command full-import instead of fullimport
see
http://wiki.apache.org/solr/DataImportHandler#Commands


Best regards
  Karsten

 Original-Nachricht 
 Datum: Mon, 24 Oct 2011 11:10:22 +0530
 Von: Radha Krishna Reddy radhakrishn...@gmail.com
 An: solr-user@lucene.apache.org
 Betreff: data-import problem

 Hi,
 
 I am trying to comfigure solr on aws ubuntu instance.I have mysql on a
 different server.so i created a ssh tunnel for mysql on port 3309.
 
 Download the mysql jdbc driver and copied it to lib folder.
 
 *I edited the example/solr/conf/solrconfig.xml*
...
 *when i tried to import data.*
 
 http://myservername/solr/dataimport?command=fullimport
 
 i* am getting the following response*
 
 ?xml version=1.0 encoding=UTF-8?
 response
 lst name=responseHeaderint name=status0/intint
 name=QTime5/int/lstlst name=initArgslst
 name=defaultsstr
 name=configdata-config.xml/str/lst/lststr
 name=commandfullimport/strstr name=statusidle/strstr
 name=importResponse/lst name=statusMessages/str
 name=WARNINGThis response format is experimental.  It is likely to
 change in the future./str
 /response
 
 
 Can someone help me on this?Also where can i find the logs.
 
 Thanks and Regards,
 Radha Krishna.


Re: indexing key value pair into lucene solr index

2011-10-24 Thread karsten-solr
Hi Jame,

you can
 - generate one token for each pair (key, value) -- key_value
 - insert a gap between each pair and us phrase queries
 - use key as field-name (if you have a restricted set of keys)
 - wait for joins in Solr 4.0 (http://wiki.apache.org/solr/Join)
 - use position or payloads to connect key and value
 - tell the forum your exact use-case with examples

Best regrads
  Karsten

 Original-Nachricht 
 Datum: Mon, 24 Oct 2011 17:11:49 +0530
 Von: jame vaalet jamevaa...@gmail.com
 An: solr-user@lucene.apache.org
 Betreff: indexing key value pair into lucene solr index

 hi,
 in my use case i have list of key value pairs in each document object, if
 i
 index them as separate index fields then in the result doc object i will
 get
 two arrays corresponding to my keys and values. The problem i face here is
 that there wont be any mapping between those keys and values.
 
 do we have any easy to index these data in solr ? thanks in advance ...
 
 -- 
 
 -JAME


Re: indexing key value pair into lucene solr index

2011-10-24 Thread karsten-solr
Hi Jame,

preserve order in index fields:

if you don't want to use phrase queries in key or value this order is 
position.
if you use phrase queries but no value has more then 50 Tokens you also could 
use position and start each pair with position 100, 200, 300 ...
Otherwise you could use payloads.

Imho there is no standard way to connect the positions of two fields.
You have to write your own Query.
My Tip: 
 Take org.apache.lucene.search.spans.TermSpans as starting point and use the 
queryparser-Module.

btw: 
normaly there is a standard solution in lucene for each problem.
So please tell more about your use-case and somebody will have an answer 
without program by your own.

Best regards
  Karsten



 Original-Nachricht 
 Datum: Mon, 24 Oct 2011 17:53:26 +0530
 Von: jame vaalet jamevaa...@gmail.com
 An: solr-user@lucene.apache.org
 Betreff: Re: indexing key value pair into lucene solr index

 thanks karsten.
 can we preserve order within index field ? if yes, i can index them
 separately and map them using their order.
 
 On 24 October 2011 17:32, karsten-s...@gmx.de wrote:
 
  Hi Jame,
 
  you can
   - generate one token for each pair (key, value) -- key_value
   - insert a gap between each pair and us phrase queries
   - use key as field-name (if you have a restricted set of keys)
   - wait for joins in Solr 4.0 (http://wiki.apache.org/solr/Join)
   - use position or payloads to connect key and value
   - tell the forum your exact use-case with examples
 
  Best regrads
   Karsten
 
   Original-Nachricht 
   Datum: Mon, 24 Oct 2011 17:11:49 +0530
   Von: jame vaalet jamevaa...@gmail.com
   An: solr-user@lucene.apache.org
   Betreff: indexing key value pair into lucene solr index
 
   hi,
   in my use case i have list of key value pairs in each document object,
 if
   i
   index them as separate index fields then in the result doc object i
 will
   get
   two arrays corresponding to my keys and values. The problem i face
 here
  is
   that there wont be any mapping between those keys and values.
  
   do we have any easy to index these data in solr ? thanks in advance
 ...
  
   --
  
   -JAME
 
 
 
 
 -- 
 
 -JAME


Re: Can Solr handle large text files?

2011-10-21 Thread karsten-solr
Hi Peter,

highlighting in large text files can not be fast without dividing the original 
text in small piece.
So take a look in
http://xtf.cdlib.org/documentation/under-the-hood/#Chunking
and in
http://www.lucidimagination.com/blog/2010/09/16/2446/

Which means that you should divide your files and use
Result Grouping / Field Collapsing
to list only one hit per original document.

(xtf also would solve your problem out of the box but xtf does not use solr).

Best regards
  Karsten

 Original-Nachricht 
 Datum: Thu, 20 Oct 2011 17:59:04 -0700
 Von: Peter Spam ps...@mac.com
 An: solr-user@lucene.apache.org
 Betreff: Can Solr handle large text files?

 I have about 20k text files, some very small, but some up to 300MB, and
 would like to do text searching with highlighting.
 
 Imagine the text is the contents of your syslog.
 
 I would like to type in some terms, such as error and mail, and have
 Solr return the syslog lines with those terms PLUS two lines of context. 
 Pretty much just like Google's highlighting.
 
 1) Can Solr handle this?  I had extremely long query times when I tried
 this with Solr 1.4.1 (yes I was using TermVectors, etc.).  I tried breaking
 the files into 1MB pieces, but searching would be wonky = return the wrong
 number of documents (ie. if one file had a term 5 times, and that was the
 only file that had the term, I want 1 result, not 5 results).  
 
 2) What sort of tokenizer would be best?  Here's what I'm using:
 
field name=body type=text_pl indexed=true stored=true
 multiValued=false termVectors=true termPositions=true 
 termOffsets=true /
 
 fieldType name=text_pl class=solr.TextField
   analyzer
 tokenizer class=solr.StandardTokenizerFactory/
 filter class=solr.LowerCaseFilterFactory/
 filter class=solr.WordDelimiterFilterFactory
 generateWordParts=0 generateNumberParts=0 catenateWords=0 
 catenateNumbers=0
 catenateAll=0 splitOnCaseChange=0/
   /analyzer
 /fieldType
 
 
 Thanks!
 Pete


Re: Migration from Autonomy IDOL to SOLR

2011-08-16 Thread karsten-solr
Hi Arcadius,

currently we have a migration project from verity k2 search server to solr.
I do not know IDOL, but autonomy bought verity before IDOL was released, so 
possible it is comparable?
verity k2 works directly on xml-Files, in result the query syntax is a little 
bit like xpath e.g. with text1 IN zone2 IN zone1 instead of 
contains(//zone1/zone2,'text1').

About verity query syntax:
http://gregconely.getmyip.com/dl/OTG%20Software/5.30.087%20Suite%20%28SP3%29/Disc%204%20-%20Verity/Verity%20K2%20Server%205.5/doc/docs/pdf/VerityQueryLanguage.pdf

Does IDOL work the same way?


Best regards
  Karsten

P.S. in Context:
http://lucene.472066.n3.nabble.com/Migration-from-Autonomy-IDOL-to-SOLR-td3255377.html

 Original-Nachricht 
 Datum: Mon, 15 Aug 2011 11:11:36 +0100
 Von: Arcadius Ahouansou arcad...@menelic.com
 An: solr-user@lucene.apache.org
 Betreff: Migration from Autonomy IDOL to SOLR

 Hello.
 
 We have a couple of application running on half a dozen Autonomy IDOL
 servers.
 Currently, all feature we need are supported by Solr.
 
 We have done some internal testing and realized that SOLR would do a
 better
 job.
 
 So, we are investigation all possibilities for a smooth migration from
 IDOL
 to SOLR.
 
 I am looking for advice from people who went through something similar.
 
 Ideally, we would like to keep most of our legacy code unchanged and have
 a
 kind of query-translation-layer plugged into our app if possible.
 
 -Is there lib available?
 
 -Any thought?
 
 Thanks.
 
 Arcadius.


Re: string cut-off filter?

2011-08-08 Thread karsten-solr
Hi Bernd,

I also searched for such a filter but did not found it.

Best regards
  Karsten

P.S. I am using now this filter:

public class CutMaxLengthFilter extends TokenFilter {

public CutMaxLengthFilter(TokenStream in) {
this(in, DEFAULT_MAXLENGTH);
}

public CutMaxLengthFilter(TokenStream in, int maxLength) {
super(in);
this.maxLength = maxLength;
}

public static final int DEFAULT_MAXLENGTH = 15;
private final int maxLength;
private final CharTermAttribute termAtt = 
addAttribute(CharTermAttribute.class);

@Override
public final boolean incrementToken() throws IOException {
if (!input.incrementToken()) {
return false;
}
int length = termAtt.length();
if (maxLength  0  length  maxLength) {
termAtt.setLength(maxLength);
}
return true;
}
}

with this factory

public class CutMaxLengthFilterFactory extends BaseTokenFilterFactory {

private int maxLength;

@Override
public void init(MapString, String args) {
super.init(args);
maxLength = getInt(maxLength, 
CutMaxLengthFilter.DEFAULT_MAXLENGTH);
}

public TokenStream create(TokenStream input) {
return new CutMaxLengthFilter(input, maxLength);
}
}



 Original-Nachricht 
 Datum: Mon, 08 Aug 2011 10:15:45 +0200
 Von: Bernd Fehling bernd.fehl...@uni-bielefeld.de
 An: solr-user@lucene.apache.org
 Betreff: string cut-off filter?

 Hi list,
 
 is there a string cut-off filter to limit the length
 of a KeywordTokenized string?
 
 So the string should not be dropped, only limitited to a
 certain length.
 
 Regards
 Bernd


Re: Update some fields for all documents: LUCENE-1879 vs. ParallelReader .FilterIndex

2011-08-04 Thread karsten-solr
Hi Erick,

thanks a lot!
This looks like a good idea:
Our queries with the changeable fields fits the join-idea from
https://issues.apache.org/jira/browse/SOLR-2272
because
 - we do not need relevance ranking
 - we can separate in a conjunction of a query with the changeable fields and 
our other stable fields
So we can use something like
q=stablefields:query1fq={!join from=changeable_fields_doc_id 
to:stable_fields_doc_id}changeablefields:query2

Only disprofit from the solution with ParallelReader is, that our stored fields 
and vector terms will be divided on two lucene-docs, which is ok in our 
use-case.

Best regards
  Karsten

in context:
http://lucene.472066.n3.nabble.com/Update-some-fields-for-all-documents-LUCENE-1879-vs-ParallelReader-amp-FilterIndex-td3215398.html

 Original-Nachricht 
 Datum: Wed, 3 Aug 2011 22:11:08 -0400
 Von: Erick Erickson erickerick...@gmail.com
 An: solr-user@lucene.apache.org
 Betreff: Re: Update some fields for all documents: LUCENE-1879 vs. 
 ParallelReader .FilterIndex

 Hmmm, the only thing that comes to mind is the join feature being added
 to
 Solr 4.x, but I confess I'm not entirely familiar with that functionality
 so
 can't tell if it really solver your problem.
 
 Other than that I'm out of ideas, but the again it's late and I'm tired so
 maybe I'm not being very creative G...
 
 Best
 Erick
 On Aug 3, 2011 11:40 AM, karsten-s...@gmx.de wrote:


Re: Update some fields for all documents: LUCENE-1879 vs. ParallelReader .FilterIndex

2011-08-03 Thread karsten-solr
Hi Erick,

our two changable fields are used for linking between documents on 
application level.
From lucene point of view they are just two searchable fields with stored term 
vector for one of them.
Our queries will use one of this fields and a couple of fields from the 
stable fields.

So the question is really about updating two fields in an existing lucene index 
with more then fifty other fields.

Best regards
  Karsten

P.S. about our linking between documents:
Our two fields called outgoingLinks and possibleIncomingLinks.

Our source-documents have an abstract and a couple of metadata.
We are using regular expression to find outgoing links in this abstract. This 
means a couple of words, which indicates 
 1. that the author made a reference (like in my previos work published as 
'Very important Article' in Nature 2010, 12 page 7)
 2. that this reference contains metadata to an other document

Each of this links is transformed to a special key (2010NaturNr12Page7).
On the other side, we transform the metadata to all possible keys.
This key generation grows with our knowledge of possible link pattern.
For the lucene indexer this is a black-box: There is a service which produce 
the keys for outgoing and possibleIncoming from our source (xml-)documents, 
this keys must be searchable in lucene/solr.

P.P.S. in Context:
http://lucene.472066.n3.nabble.com/Update-some-fields-for-all-documents-LUCENE-1879-vs-ParallelReader-amp-FilterIndex-td3215398.html

 Original-Nachricht 
 Datum: Wed, 3 Aug 2011 09:57:03 -0400
 Von: Erick Erickson erickerick...@gmail.com
 An: solr-user@lucene.apache.org
 Betreff: Re: Update some fields for all documents: LUCENE-1879 vs. 
 ParallelReader .FilterIndex

 How are these fields used? Because if they're not used for searching, you
 could
 put them in their own core and rebuild that index at your whim, then
 querying that
 core when you need the relationship information.
 
 If you have a DB backing your system, you could perhaps store the info
 there
 and query that (but I like the second core better G)..
 
 But if you could use a separate index just for the relationships, you
 wouldn't
 have to deal with the slow re-indexing of all the docs...
 
 Best
 Erick
 
 On Mon, Aug 1, 2011 at 4:12 AM,  karsten-s...@gmx.de wrote:
  Hi lucene/solr-folk,
 
  Issue:
  Our documents are stable except for two fields which are used for
 linking between the docs. So we like to update this two fields in a batch 
 once a
 month (possible once a week).
  We can not reindex all docs once a month, because we are using XeLDA in
 some fields for stemming (morphological analysis), and XeLDA is slow. We
 have 14 Mio docs (less than 100GByte Main-Index and 3 GByte for this two
 changable fields).
  In the next half year we will migrating our search engine from verity K2
 to solr; so we could wait for solr 4.0
  (
  btw any news about
 
 http://lucene.472066.n3.nabble.com/Release-schedule-Lucene-4-td2256958.html
  ?
  ).
 
  Solution?
 
  Our issue is exactly the purpose of ParallelReader.
  But Solr do not support ParallelReader (for a good reason:
 
 http://lucene.472066.n3.nabble.com/Vertical-Partitioning-advice-td494623.html#a494624
  ).
  So I see two possible ways to solve our issue:
  1. waiting for the new Parallel incremental indexing
  (
  https://issues.apache.org/jira/browse/LUCENE-1879
  ) and hoping that solr will integrate this.
  Pro:
   - nothing to do for us except waiting.
  Contra:
   - I did not found anything of the (old) patch in current trunk.
 
  2. Change lucene index below/without solr in a batch:
    a) Each month generate a new index only with our two changed fields
       (e.g. with DIH)
    b) Use FilterIndex and ParallelReader to mock a correct index
    c) “Merge” this mock index to a new Index
       (via IndexWriter.addIndexes(IndexReader...) )
  Pro:
   - The patch for https://issues.apache.org/jira/browse/LUCENE-1812
    should be a good example, how to do this.
  Contra:
   - relation between DocId and document index order is not an guaranteed
 feature of DIH, (e.g. we will have to split the main index to ensure that
 no merge will occur in/after DIH).
   - To run this batch, solr has to be stopped and restarted.
   - Even if we know, that our two field should change only for a subset
 of the docs, we nevertheless have to reindex this two fields for all the
 docs.
 
  Any comments, hints or tips?
  Is there a third (better) way to solve our issue?
  Is there already an working example of the 2. solution?
  Will LUCENE-1879 (Parallel incremental indexing) be part of solr 4.0?
 
  Best regards
   Karsten
 


Re: xpath expression not working

2011-08-02 Thread karsten-solr
Hi abhayd,

XPathEntityProcessor does only support a subset of xpath,
like div[@id=2] but not [id=2]
Take a look to
https://issues.apache.org/jira/browse/SOLR-1437#commentauthor_12756469_verbose

I solve this problem by using xslt a preprocessor (with full xpath).

The drawback is performance wasting: See
http://lucene.472066.n3.nabble.com/DIH-Enhance-XPathRecordReader-to-deal-with-body-FLATTEN-true-and-body-h1-td2799005.html

Best regards
  Karsten

 Original-Nachricht 
 Datum: Mon, 1 Aug 2011 23:21:45 -0700 (PDT)
 Von: abhayd ajdabhol...@hotmail.com
 An: solr-user@lucene.apache.org
 Betreff: xpath expression not working

 hi 
 I have a xml doc whichi would like to index using xpath entity processor.
 add
 doc
  id1/id
  detailsxyz/details
 /doc
 doc
  id2/id
  detailsxyz2/details
 /doc
 /add
 
 if i want to just load document with id=2 how would that work? 
 
 I tried xpath expression that works with xpath tools but not in solr. 
 
 dataConfig
 dataSource type=FileDataSource /
 document
 entity name=f processor=FileListEntityProcessor
 baseDir=c:\temp fileName=promotions.xml 
 recursive=false rootEntity=false dataSource=null
 entity name=x processor=XPathEntityProcessor
 forEach=/add/doc url=${f.fileAbsolutePath} pk=id
 field column=id xpath=/add/doc/[id=2]/id/
 /entity
 /entity
 /document
 /dataConfig
 
 Any help how i can do this?
 
 --
 View this message in context:
 http://lucene.472066.n3.nabble.com/xpath-expression-not-working-tp3218133p3218133.html
 Sent from the Solr - User mailing list archive at Nabble.com.


Re: Store complete XML record (DIH XPathEntityProcessor)

2011-08-02 Thread karsten-solr
Hi g, Hi Chantal

I had the same problem.
You can use XPathEntityProcessor but you have to insert an xsl. The drawback is 
performance wasting: See
http://lucene.472066.n3.nabble.com/DIH-Enhance-XPathRecordReader-to-deal-with-body-FLATTEN-true-and-body-h1-td2799005.html

Best regards
  Karsten

 Original-Nachricht 
 Datum: Mon, 1 Aug 2011 12:17:45 +0200
 Von: Chantal Ackermann chantal.ackerm...@btelligent.de
 An: solr-user@lucene.apache.org solr-user@lucene.apache.org
 Betreff: Re: Store complete XML record  (DIH  XPathEntityProcessor)

 Hi g,
 
 ok, I understand your problem, now. (Sorry for answering that late.)
 
 I don't think PlainTextEntityProcessor can help you. It does not take a
 regex. LineEntityProcessor does but your record elements probably do not
 come on their own line each and you wouldn't want to depend on that,
 anyway.
 
 I guess you would be best off writing your own entity processor - maybe
 by extending XPath EP if that gives you some advantage. You can of
 course also implement your own importer using SolrJ and your favourite
 XML parser framework - or any other programming language.
 
 If you are looking for a config-only solution - i'm not sure that there
 is one. Someone else might be able to comment on that?
 
 Cheers,
 Chantal
 
 
 On Thu, 2011-07-28 at 19:17 +0200, solruser@9913 wrote:
  Thanks Chantal
  I am ok with the second call and I already tried using that. 
 Unfortunatly
  It reads the whole file into a field.  My file is as below example
  xml  
record 
... 
/record

record 
... 
/record
   
 record 
... 
/record
  
  /xml
  
  Now the XPATH does the 'for each /record' part.  For each record I also
 need
  to store the raw log in there.  If I use the  PlainTextEntityProcessor
 then
  it gives me the whole file (from xml .. /xml ) and not each of the
  record /record
  
  Am I using the PlainTextEntityProcessor wrong?
  
  THanks
  g
  
  
  --
  View this message in context:
 http://lucene.472066.n3.nabble.com/Store-complete-XML-record-DIH-XPathEntityProcessor-tp3205524p3207203.html
  Sent from the Solr - User mailing list archive at Nabble.com.
 


Re: Matching queries on a per-element basis against a multivalued field

2011-08-02 Thread karsten-solr
Hi Suk-Hyun Cho,

if myFriend is the unit of retrieval you should use this as lucene document 
with the fields isCool gender bloodType ...

if you realy want to insert all myFriends in one field like your
myFriends = [
isCool=true SOME_JUNK_HERE gender=female bloodType=O,
isCool=false SOME_JUNK_HERE gender=male bloodType=AB
]
example, you can use SpanQueries

http://www.lucidimagination.com/blog/2009/07/18/the-spanquery/

with SpanNotQuery you can search for all isCool true and gender male where 
no other isCool is between both phrases.

Best regards
  Karsten


P.S. see in context
http://lucene.472066.n3.nabble.com/Matching-queries-on-a-per-element-basis-against-a-multivalued-field-td3217432.html


Re: How to cut off hits with score below threshold?

2011-08-02 Thread karsten-solr
Hi Otis,

is this the same question as
http://lucene.472066.n3.nabble.com/Filter-by-relevance-td1837486.html
?

If yes, perhaps something like (http://search-lucene.com/m/4AHNF17wIJW1/)
q={!frange l=0.85}query($qq)
qq=the original relevancy query
will help?

(BTW, a also would like to specify a custom Collector via API in Solr, possible 
an issue?)

Best regards
  Karsten


in context:
http://lucene.472066.n3.nabble.com/How-to-cut-off-hits-with-score-below-threshold-td3219064.html

 Original-Nachricht 
 If one wanted to cut off hits whose score is below some threshold (I know,
 I know, one doesn't typically want to do this), what are the most elegant
 options?


Update some fields for all documents: LUCENE-1879 vs. ParallelReader .FilterIndex

2011-08-01 Thread karsten-solr
Hi lucene/solr-folk,

Issue:
Our documents are stable except for two fields which are used for linking 
between the docs. So we like to update this two fields in a batch once a month 
(possible once a week). 
We can not reindex all docs once a month, because we are using XeLDA in some 
fields for stemming (morphological analysis), and XeLDA is slow. We have 14 Mio 
docs (less than 100GByte Main-Index and 3 GByte for this two changable fields).
In the next half year we will migrating our search engine from verity K2 to 
solr; so we could wait for solr 4.0 
(
btw any news about
http://lucene.472066.n3.nabble.com/Release-schedule-Lucene-4-td2256958.html
?
).

Solution?

Our issue is exactly the purpose of ParallelReader. 
But Solr do not support ParallelReader (for a good reason:
http://lucene.472066.n3.nabble.com/Vertical-Partitioning-advice-td494623.html#a494624
).
So I see two possible ways to solve our issue:
1. waiting for the new Parallel incremental indexing 
( 
https://issues.apache.org/jira/browse/LUCENE-1879
) and hoping that solr will integrate this.
Pro: 
 - nothing to do for us except waiting.
Contra: 
 - I did not found anything of the (old) patch in current trunk.

2. Change lucene index below/without solr in a batch:
   a) Each month generate a new index only with our two changed fields 
  (e.g. with DIH)
   b) Use FilterIndex and ParallelReader to mock a correct index
   c) “Merge” this mock index to a new Index
  (via IndexWriter.addIndexes(IndexReader...) )
Pro: 
 - The patch for https://issues.apache.org/jira/browse/LUCENE-1812
   should be a good example, how to do this.
Contra: 
 - relation between DocId and document index order is not an guaranteed feature 
of DIH, (e.g. we will have to split the main index to ensure that no merge will 
occur in/after DIH).
 - To run this batch, solr has to be stopped and restarted. 
 - Even if we know, that our two field should change only for a subset of the 
docs, we nevertheless have to reindex this two fields for all the docs.

Any comments, hints or tips?
Is there a third (better) way to solve our issue?
Is there already an working example of the 2. solution?
Will LUCENE-1879 (Parallel incremental indexing) be part of solr 4.0?

Best regards
  Karsten


Re: Solr Configuration with 404 error

2011-07-11 Thread karsten-solr
Hi rocco,

you did not stop jetty after your first attempt.
(You have to kill the task.)

Best regards
  Karsten

btw: How to change the port 8983:
http://lucene.472066.n3.nabble.com/How-to-change-a-port-td490375.html

 Original-Nachricht 
 Datum: Sun, 10 Jul 2011 20:11:54 -0700 (PDT)
 Von: rocco2004 steve.adams2...@gmail.com
 An: solr-user@lucene.apache.org
 Betreff: Solr Configuration with 404 error

 I installed Solr using:
 
 java -jar start.jar
 
 However I downloaded the source code and didn't compile it (Didn't pay
 attention). And the error using:
 http://localhost:8983/solr/admin/ was:
 
 HTTP ERROR: 404 Problem accessing /solr/admin/. Reason: NOT_FOUND
 
 I realized that it was nos configuring because the source code was not
 compiled. Then I downloaded the compiled version of solr but when trying
 to
 run the example configuration I'm getting exception: 
 
 java.net.BindException: Address already in use
 
 Is there a way to revert solr configuration and start from scratch? Looks
 like the configuration got messed up. I don't see anything related to it
 in
 the manual.
 
 Here is the error:
 
 2011-07-10 22:41:27.631:WARN::failed SocketConnector@0.0.0.0:8983:
 java.net.BindException: Address already in use 2011-07-10
 22:41:27.632:WARN::failed Server@c4e21db: java.net.BindException: Address
 already in use 2011-07-10 22:41:27.632:WARN::EXCEPTION
 java.net.BindException: Address already in use at
 java.net.PlainSocketImpl.socketBind(Native Method) at
 java.net.PlainSocketImpl.bind(PlainSocketImpl.java:383) at
 java.net.ServerSocket.bind(ServerSocket.java:328) at
 java.net.ServerSocket.(ServerSocket.java:194) at
 java.net.ServerSocket.(ServerSocket.java:150) at
 org.mortbay.jetty.bio.SocketConnector.newServerSocket(SocketConnector.java:80)
 at org.mortbay.jetty.bio.SocketConnector.open(SocketConnector.java:73) at
 org.mortbay.jetty.AbstractConnector.doStart(AbstractConnector.java:283) at
 org.mortbay.jetty.bio.SocketConnector.doStart(SocketConnector.java:147) at
 org.mortbay.component.AbstractLifeCycle.start(AbstractLifeCycle.java:50)
 at
 org.mortbay.jetty.Server.doStart(Server.java:235) at
 org.mortbay.component.AbstractLifeCycle.start(AbstractLifeCycle.java:50)
 at
 org.mortbay.xml.XmlConfiguration.main(XmlConfiguration.java:985) at
 sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
 at
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
 at java.lang.reflect.Method.invoke(Method.java:597) at
 org.mortbay.start.Main.invokeMain(Main.java:194) at
 org.mortbay.start.Main.start(Main.java:534) at
 org.mortbay.start.Main.start(Main.java:441) at
 org.mortbay.start.Main.main(Main.java:119) Jul 10, 2011 10:41:27 PM
 org.apache.solr.core.SolrCore registerSearcher INFO: [] Registered new
 searcher Searcher@5b6b9e62 main
 
 
 --
 View this message in context:
 http://lucene.472066.n3.nabble.com/Solr-Configuration-with-404-error-tp3157895p3157895.html
 Sent from the Solr - User mailing list archive at Nabble.com.


Re: Showing facet of first N docs

2011-06-16 Thread karsten-solr
Hi Tommaso,

the FacetComponent works with the DocListAndSet#docSet.
It should be easy to switch to DocListAndSet#docList (which contains all 
documents for result list (default: TOP-10, but possible 15-25 (if start=15, 
rows=11). Which means to change the source code.

Instead of changing the source-code the easier way should be to send a second 
request with relevance-Filter (if your sort-criteria is relevance):
 http://lucene.472066.n3.nabble.com/Filter-by-relevance-td1837486.html

Best regards
  Karsten

http://lucene.472066.n3.nabble.com/Showing-facet-of-first-N-docs-td3071395.html
 Original-Nachricht 
 Datum: Thu, 16 Jun 2011 12:39:32 +0200
 Von: Tommaso Teofili tommaso.teof...@gmail.com
 An: solr-user@lucene.apache.org
 Betreff: Showing facet of first N docs

 Hi all,
 Do you know if it is possible to show the facets for a particular field
 related only to the first N docs of the total number of results?
 It seems facet.limit doesn't help with it as it defines a window in the
 facet constraints returned.
 Thanks in advance,
 Tommaso


Re: AndQueryNode to NearSpanQuery

2011-06-14 Thread karsten-solr
Hi member of digitalsmiths,

I also implemented SpanNearQueryNode and some QueryNodeProcessors.
Most possible you can solve your problem by using
QueryNode#setTag:
In QueryNodeProcessor#preProcessNode you can set and remove and reset a Tag to 
mark the AndNodes that should became SpanNodes;
after this you can use the QueryNodeProcessor#postProcessNode method to 
substitute this AndNodes in your OrNodes.

(But be aware of https://issues.apache.org/jira/browse/LUCENE-3045 )

Best regards
  Karsten

 Original-Nachricht 
 Datum: Mon, 13 Jun 2011 19:45:49 -0700 (PDT)
 Von: mtraynham mtrayn...@digitalsmiths.com
 An: solr-user@lucene.apache.org
 Betreff: AndQueryNode to NearSpanQuery
 ...
 The SpanNearQueryNode is a class I made that implements FieldableNode
 and extends QueryNodeImpl (as I want all Fieldable children to be from 
 the same field, therefore just remembering the terms).  Plus it
  maintains a distance or slop factor and a inOrder boolean.
 
 The problem here is that I can't keep the children from getting
  manipulated further down the pipeline, because I want my 
 NearSpanQueryBuilder to use it's original children nodes and at the same 
 time be cloned/changed/etc.  QueryNodeImpl has many private and final 
 methods and you can't override setChildren, etc, etc., but I'd rather 
 stay away from monkey patching. 
 --
 View this message in context:
 http://lucene.472066.n3.nabble.com/AndQueryNode-to-NearSpanQuery-tp3061286p3061607.html
 Sent from the Solr - User mailing list archive at Nabble.com.


Re: Query on Synonyms feature in Solr

2011-06-13 Thread karsten-solr
Hi rajini,

multi-word synonyms like private schools normally make problems.

See e.g. Solr-1-4-Enterprise-Search-Server Page 56:
For multi-word synonyms to work, the analysis must be applied at
index-time and with expansion so that both the original words and the
combined word get indexed. ...

Your problem:
The input of Synonym Filter must be the exact !Token! Private schools.

So WhitespaceTokenizerFactory generates two tokens: private schools
and for KeywordTokenizerFactory the whole text is one token.

Beste regards
  Karsten



 Original-Nachricht 
 Datum: Mon, 13 Jun 2011 16:07:35 +0530
 Von: rajini maski rajinima...@gmail.com
 An: solr-user@lucene.apache.org
 Betreff: Query on Synonyms feature in Solr

 Synonyms feature to be enabled on documents in Solr.
 
 
 I have one field in solr that has the content of a document.( say field
 name
 : document_data).
 
 The data in that field is :
 
 Tamil Nadu state private school fee determination committee headed by
 Justice Raviraja has submitted the private schools fees structure to the
 district educational officers on Monday
 
 Synonyms for private school in synonym flat file are :
 
 Private schools,NGO Schools,Unaided schools
 
 
 Now when i search on this field as  document_data=unaided schools.  I need
 to get the results.
 
 What are the token, analyser filter that i can apply  to the
 document_dataFIELD in order to get the results above
 
 
 
 
 This is the indexed document :
 add
 doc
 field name=IDSOLR200/field
 field name=document_dataTamil Nadu state private school fee
 determination committee headed by Justice Raviraja has submitted the
 private
 schools fees structure to the district educational officers on
 Monday/field
 /doc
 /add
 
 
 Right now i tried for these 2 fields type.. And i couldn't get the above
 results
 
  fieldType name=Synonym_document class=solr.TextField
 positionIncrementGap=100 
 analyzer
   tokenizer class=solr.KeywordTokenizerFactory/
 filter class=solr.SynonymFilter synonyms=Taxonomy.txt
 ignoreCase=true expand=true/
  filter class=solr.LowerCaseFilterFactory/
  filter class=solr.SnowballPorterFilterFactory language=English
 protected=protwords.txt/
   /analyzer
 /fieldType
 
 
  fieldType name=Synonym_document class=solr.TextField
 positionIncrementGap=100 
 analyzer
  tokenizer class=solr.WhitespaceTokenizerFactory/
 filter class=solr.SynonymFilter synonyms=Taxonomy.txt
 ignoreCase=true expand=true/
  filter class=solr.LowerCaseFilterFactory/
  filter class=solr.SnowballPorterFilterFactory language=English
 protected=protwords.txt/
   /analyzer
 /fieldType
 
 
  field name=document_data type=Synonym_document indexed=true
 multiValued=true/
 
 Both didn't work for my query.
 Anyone please guide me with the token, analyser filter that i can apply 
 to
 the document_data FIELD in order to get the results above
 
 
 Regards,
 Rajani


Re: RE: Indexing Question for large dataset

2011-04-14 Thread karsten-solr
Hi Joshua,

what is the use-case?
Do you need only the facets for one field (for each query)?
Do you need all facet-values or only the first 10 in .sort=index 
(FACET_SORT_INDEX / numeric order) / in .sort=count (FACET_SORT_COUNT) ?
How many different facet-valuss do you have per field?
Do you only need this fields for faceted search?


Your problem will be, that solr normaly put a int[searcher.maxDoc()] array in 
main-memory for each field with facets.
You can avoid this by using .method=enum which should not fit in your case.

Because you do not have multiToken per document, your facets will compute by 
SimpleFacets#getFieldCacheCounts. In Version 3.1 you will find a TODO that fits 
your needs :-(
In this method you will also see the the method use indirectly a WeakHashMap, 
so if you only use 100 fields per hour you should not have a problem :-)
But there will be no warm up for your application (first facet search will take 
a while).

From my point of view you should program your own solr-PlugIn for your 
purpose. This is not so hard, I assure you.

Best regards
  Karsten



 Joshua

 Name equals the product name. 
 
 Each separate product can have 1 to n prices based upon pricelist.
 
 A single document represents that single product.
 
 doc
   field name=id1/field
   field name=nameThe product name./field
   field name=price1.00/field
   field name=priceList1Price0.99/field
   field name=priceList2Price0.98/field
   field name=priceList1500Price0.85/field
 /doc
 doc
   field name=id2/field
   field name=nameThe product name./field
   field name=price1.10/field
   field name=priceList1Price1.09/field
   field name=priceList2Price1.08/field
   field name=priceList1500Price1.05/field
 /doc
 
 Yes, the amount of pricelist could grow from 1000 to 5000 given the user
 base grows.
 
 There are currently about 150,000 products.
 
 We do need to index the products, since they change frequently.
 
 Thanks everyone for all your responses so far!
 
 -Original Message-
 From: kenf_nc [mailto:ken.fos...@realestate.com] 
 Sent: Wednesday, April 13, 2011 1:15 PM
 To: solr-user@lucene.apache.org
 Subject: RE: Indexing Question for large dataset
 
 Is NAME a product name? Why would it be multivalue? And why would it
 appear
 on more than one document?  Is each 'document' a package of products? And
 the pricing tiers are on the package, not individual pieces?
 
 So sounds like you could, potentially, have a PriceListX column for each
 user. As your User base grows, the number of columns you need may grow
 (you
 already bumped up from 2000 to 5000 in the space of a couple posts :) ).
 Is
 that right?
 
 How many products (or packages of products) do you have? Could you flip
 this
 on its ear and make a User the document. Then it could have just 3
 multivalue fields (beyond any you need to identify the user like user_id)
 product_id
 product_name
 product_price
 
 Downside is if a new product is introduced you have to re-index all users
 that have a price point on that product.  
 
 
 --
 View this message in context:
 http://lucene.472066.n3.nabble.com/Indexing-Question-for-large-dataset-tp2816344p2816994.html
 Sent from the Solr - User mailing list archive at Nabble.com.
 The recipient of this email should check this email and any attachments
 for the presence of viruses. 
 The Wasserstrom Companies accepts no liability for any damage caused by
 any virus transmitted by this email.
 
 This footnote also confirms that this email message has been scanned for
 the presence of computer viruses.
 
 The Wasserstrom Companies


Re: DIH: Enhance XPathRecordReader to deal with //body(FLATTEN=true) and //body/h1

2011-04-11 Thread karsten-solr
Hi Lance,

your are right:
XPathEntityProcessor has the attribut xsl, so I can use xslt to generate a 
xml-File in the form of the standard Solr update schema.
I will check the performance of this.


Best regards
  Karsten


btw. flatten is an attribute of the field-Tag, not of XPathEntityProcessor 
(like wrongly specified it the wiki)


 Lance
 There is an option somewhere to use the full XML DOM implementation
 for using xpaths. The purpose of the XPathEP is to be as simple and
 dumb as possible and handle most cases: RSS feeds and other open
 standards.
 
 Search for xsl(optional)
 
 http://wiki.apache.org/solr/DataImportHandler#Configuration_in_data-config.xml-1
 
 Karsten
 On Sat, Apr 9, 2011 at 5:32 AM
  Hi Folks,
 
  does anyone improve DIH XPathRecordReader to deal with nested xpaths?
  e.g.
  data-config.xml with
   entity .. processor=XPathEntityProcessor ..
   field column=title xpath=//body/h1/
   field column=alltext” xpath=//body flatten=true/
  and the XML stream contains
   /html/body/h1...
  will only fill field “alltext” but field “title” will be empty.
 
  This is a known issue from 2009
 
 https://issues.apache.org/jira/browse/SOLR-1437#commentauthor_12756469_verbose
 
  So three questions:
  1. How to fill a “search over all”-Field without nested xpaths?
    (schema.xml  copyField source=* dest=alltext/ will not help,
 because we lose the original token order)
  2. Does anyone try to improve XPathRecordReader to deal with nested
 xpaths?
  3. Does anyone else need this feature?
 
 
  Best regards
   Karsten
 
http://lucene.472066.n3.nabble.com/DIH-Enhance-XPathRecordReader-to-deal-with-body-FLATTEN-true-and-body-h1-td2799005.html


Re: DIH: Enhance XPathRecordReader to deal with //body(FLATTEN=true) and //body/h1

2011-04-11 Thread karsten-solr
Hi Lance,

I used XPathEntityProcessor with attribut xsl and generate a xml-File in the 
form of the standard Solr update schema.
I lost a lot of performance, it is a pity that XPathEntityProcessor does only 
use one thread.

My tests with a collection of 350T Document:
1. use of XPathRecordReader without xslt:  28min
2. use of XPathEntityProcessor with xslt (Standard solr-war / Xalan): 44min  
2. use of XPathEntityProcessor with saxon-xslt: 36min  


Best regards
  Karsten



 Lance 
 There is an option somewhere to use the full XML DOM implementation
 for using xpaths. The purpose of the XPathEP is to be as simple and
 dumb as possible and handle most cases: RSS feeds and other open
 standards.
 
 Search for xsl(optional)
 
 http://wiki.apache.org/solr/DataImportHandler#Configuration_in_data-config.xml-1
 
--karsten
  Hi Folks,
 
  does anyone improve DIH XPathRecordReader to deal with nested xpaths?
  e.g.
  data-config.xml with
   entity .. processor=XPathEntityProcessor ..
   field column=title xpath=//body/h1/
   field column=alltext” xpath=//body flatten=true/
  and the XML stream contains
   /html/body/h1...
  will only fill field “alltext” but field “title” will be empty.
 
  This is a known issue from 2009
 
 https://issues.apache.org/jira/browse/SOLR-1437#commentauthor_12756469_verbose
 
  So three questions:
  1. How to fill a “search over all”-Field without nested xpaths?
    (schema.xml  copyField source=* dest=alltext/ will not help,
 because we lose the original token order)
  2. Does anyone try to improve XPathRecordReader to deal with nested
 xpaths?
  3. Does anyone else need this feature?
 
 
  Best regards
   Karsten
 

http://lucene.472066.n3.nabble.com/DIH-Enhance-XPathRecordReader-to-deal-with-body-FLATTEN-true-and-body-h1-td2799005.html


DIH: Enhance XPathRecordReader to deal with //body(FLATTEN=true) and //body/h1

2011-04-09 Thread karsten-solr
Hi Folks,

does anyone improve DIH XPathRecordReader to deal with nested xpaths?
e.g.
data-config.xml with
 entity .. processor=XPathEntityProcessor ..
  field column=title xpath=//body/h1/
  field column=alltext” xpath=//body flatten=true/
and the XML stream contains
  /html/body/h1...
will only fill field “alltext” but field “title” will be empty.

This is a known issue from 2009
https://issues.apache.org/jira/browse/SOLR-1437#commentauthor_12756469_verbose

So three questions: 
1. How to fill a “search over all”-Field without nested xpaths? 
   (schema.xml  copyField source=* dest=alltext/ will not help, because 
we lose the original token order)
2. Does anyone try to improve XPathRecordReader to deal with nested xpaths?
3. Does anyone else need this feature?


Best regards
  Karsten


Solr without Server / Search solutions with Solr on DVD (examples?)

2011-04-07 Thread karsten-solr
Hi folks,

we want to migrate our search-portal to Solr.
But some of our customers search in our informations offline with a DVD-Version.
So we want to estimate the complexity of a Solr DVD-Version.
This means to trim Solr to work on small computers with the opposite of heavy 
loads. So no server-optimizations, no Cache, less facet terms in memory...

My question:
Does anyone know examples of solutions with Solr starting from DVD?

Is there a tutorial for “configure a slow Solr for Computer with little main 
memory”?

Any best practice tips from yourself?


Best regards
  Karsten


Re: Solr without Server / Search solutions with Solr on DVD (examples?)

2011-04-07 Thread karsten-solr
Hi Ezequiel,

In Solr the performance of sorting and faceted search is mainly a question of 
main memory.
e.g Mike McCandless wrote in s.apache.org/OWK that sorting of 5m wikipedia 
documents by title field need 674 MB of RAM.

But again: My main interest is an example of other companies/product who 
delivered information on DVD with stand alone Solr.

Best regards
  Karsten

---Ezequiel

 Try setting a virtual machine and see its performance.
 
 I'm really not a java guy, so i really don't know how to tune it for
 performance...
 
 But afaik solr handles pretty well in ram if the index is static...
 
 On Thu, Apr 7, 2011 at 2:48 PM, Karsten Fissmer karsten-s...@gmx.de
 wrote:
 
  Hi yonik, Hi Ezequiel,
 
  Java is no problem for an DVD Version. We already have a DVD version
 with
  Servlet-Container (but this does currently not use Solr).
 
  Some of our customers work in public sector institutions and have less
 then
  1gb main memory, but they use MS Word and IE and..
 
  But let us say that we can set Xmx384m (we have 14m documents).
  Xmx384m with 14m UnitsOfRetrieval means e.g. that we do not allow the
 same
  fields for sorting as on server.
 
  My main interest is an example of other companies/product who delivered
  information on DVD with stand alone Solr.
 
  Best regards
   Karsten
 
   ---yonik
   Including a JRE on the DVD and a launch script that uses that JRE by
   default should be doable as well.
   -Yonik
   Jeffrey
   Even if you can ship your DVD with a jetty server, you'll still need
   JAVA
   installed on the customer machine...
  
   ---Karsten
   My question:
   Does anyone know examples of solutions with Solr starting from DVD?
   Is there a tutorial for “configure a slow Solr for Computer with
 little
  main memory”?
   Any best practice tips from yourself?
 
 
 
 
 -- 
 __
 Ezequiel.
 
 Http://www.ironicnet.com


sending a parsed query to solr (xml-query-parser, syntaxtree)

2011-03-21 Thread karsten-solr
Hi, 

I am working on a migration from verity k2 to solr. 

At this point I have a parser for the Verity Query Language (our used subset) 
which generates a syntax tree. 
I transfer this in a couple of filters and one query. This fragmentation is the 
reason, why I can not use my parser inside solr 
(via QparserPlugin: http://wiki.apache.org/solr/SolrPlugins#QParserPlugin ). 

Because I have a syntax tree I like to use the QueryParser Lucene contrib 
module. 
Other reason is, that we need our own PhraseQuery, so the normal Solr 
Query-Syntax will not work 
(not even the nested queries 
  http://www.lucidimagination.com/blog/2009/03/31/nested-queries-in-solr/
  http://www.lucidimagination.com/blog/2009/02/22/exploring-query-parsers/
) 

So lets say that I already have a 
org.apache.lucene.queryParser.core.nodes.QueryNode 

What is the proper way to send this to solr? 
1. serialize to XML and use xml-query-parser for deserialization 
   (but support in Solr is not stable: 
https://issues.apache.org/jira/browse/SOLR-839 ) 
2. serialize and deserialize with XStream 
3. serialize and deserialize with NamedList (like SolrJ does this in the other 
direction) 
4. other suggestions? 

If 1.: 
Does anyone use xml-query-parser with heavy loads? 

If 2.: 
Does anyone know why QueryNodeImpl.java lost its serialVersionUID from 3.x to 
4.x? 
Will it not longer implement Serializable? 

If 3.: 
Does anyone used NamedList to sent informations to solr? 
Does anyone used NamedList to represent a QueryNode (or syntax tree)? 

Best regards 
  Karsten 

P.S. An small example why QueryParser is great: 
http://sujitpal.blogspot.com/2011/03/using-lucenes-new-queryparser-framework.html