Parallel SQL / calcite adapter

2015-11-19 Thread Kai Gülzau

We are currently evaluating calcite as a SQL facade for different Data Sources

-  JDBC

-  REST

>SOLR

-  ...

I didn't found a "native" calcite adapter for solr 
(http://calcite.apache.org/docs/adapter.html).

Is it a good idea to use the parallel sql feature (over jdbc) to connect 
calcite (or apache drill) to solr?
Any suggestions?


Thanks,

Kai Gülzau


Keyword aware Tokenizer?

2013-05-17 Thread Kai Gülzau
Does anybody know of a tokenizer which can be configured with (multiple) 
regular expressions to mark some of the input text as keyword
and behave like StandardTokenizer (or UAX29URLEmailTokenizer) otherwise?

Input:
Does my order 4711.0815!-somecode_and.other(stuff) arrive on friday?

Tokens:
does|my|order|4711.0815!-somecode_and.other(stuff)|arrive|on|Friday


Any pointer? How to code?

Regards,

Kai Gülzau






StandardTokenizer vs. hyphens

2013-05-17 Thread Kai Gülzau
Is there some StandardTokenizer Implementation which does not break words on 
hyphens?

I think it would be more flexible to retain hyphens and use a 
WordDelimiterFactory to split these tokens.


StandardTokenizer today:
doc1: email - email
doc2: e-mail - e|mail
doc3: e mail - e|mail

query1: email - doc1
query2: e-mail - doc2,doc3
query2: e mail - doc2,doc3


StandardTokenizer which keeps hyphens + WDF:
doc1: email - email
doc2: e-mail - e-mail|email|e|mail
doc3: e mail - e|mail

query1: email - doc1,doc2
query2: e-mail - doc1,doc2,doc3
query2: e mail - doc2,doc3


Any suggestions to configure or code the 2nd behavior?

Regards,

Kai Gülzau


RE: How to make this work with SOLR ( LUCENE-2899 : Add OpenNLP Analysis capabilities as a module)

2013-02-15 Thread Kai Gülzau
 I tried patching my SOLR 4.1 source , as well as a freshly downloaded
 SOLR trunk, to no avail. I guess I just need some tips on how and what
 to patch. I tried to patch the base directory as well as the lucene
 directory. If there's something I need to hack in the  patch, do let
 me know.

Try to apply the patch to trunk within eclipse.
There you can see each filediff and manually change it while patching.

I just ignored most of the javadoc and some other (nonfunctional) diffs and
was able to produce some jars which are running (for my tests) in solr 4.1.


regards,

Kai


RE: which analyzer is used for facet.query?

2013-02-15 Thread Kai Gülzau
OK, problem solved...

I my tests I only reloaded the core master and queried the core slave.
So config changes on slave where not in place :-\

Sorry guys!

Kai


RE: Term Frequencies for Query Result

2013-02-15 Thread Kai Gülzau
 i *think* you are saying that you want the sum of term frequencies for all 
 terms in all matching documents -- but i'm not sure, because i don't see 
 how TermVectorComponent is helping you unless you are iterating over every 
 doc in the result set (ie: deep paging) to get the TermVectors for every 
 doc ... it would help if you could explain what you mean by counting all 
 frequencies manually

You are good in guessing :-)
Saying counting all frequencies manually I think of collecting term
frequencies for each term while iterating over all documents.


 I am looking for a way to get the top terms for a query result.
 you have to elaborate on exactly what you mean ... how are you defining 
 top terms for a query result ?  Are you talking about the most common 
 terms in the entire result set of documents that match your query?

My goal is to show the most relevant keywords for some documents of the index.
So top terms for a query result should be top nouns for a filtered query.

While using faceting top means sorted by count of docs containing the term.

When I could get the sum of the term frequencies, my hope is to be able
to distinguish between too common terms and more relevant terms.
Something like a score for a term based on a filtered query.


regards,

Kai Gülzau


RE: which analyzer is used for facet.query?

2013-02-08 Thread Kai Gülzau
 So it seems that facet.query is using the analyzer of type index.
 Is it a bug or is there another analyzer type for the facet query?

Nobody?
Should I file a bug?

Kai

-Original Message-
From: Kai Gülzau [mailto:kguel...@novomind.com] 
Sent: Tuesday, February 05, 2013 2:31 PM
To: solr-user@lucene.apache.org
Subject: which analyzer is used for facet.query?

Hi all,

which analyzer is used for the facet.query?


This is my schema.xml:

fieldType name=uima_nouns_de class=solr.TextField 
positionIncrementGap=100
  analyzer type=index
tokenizer class=solr.UIMATypeAwareAnnotationsTokenizerFactory 
descriptorPath=/uima/AggregateSentenceDEAE.xml
  tokenType=org.apache.uima.TokenAnnotation featurePath=posTag/
filter class=solr.TypeTokenFilterFactory useWhitelist=true 
types=/uima/whitelist_de.txt /
  /analyzer
  analyzer type=query
tokenizer class=solr.WhitespaceTokenizerFactory/
  /analyzer
/fieldType
...
field name=albody_de type=uima_nouns_de indexed=true stored=true 
multiValued=false omitTermFreqAndPositions=false termVectors=true 
termPositions=false termOffsets=false /


When doing a faceting search like:

http://localhost:8983/solr/slave/select?q=*:*fq=type:7rows=0wt=jsonindent=truefacet=truefacet.query=albody_de:Klaus

The UIMA whitespace tokenizer logs some infos:
Feb 05, 2013 2:23:06 PM WhitespaceTokenizer process Information: Whitespace 
tokenizer starts processing
Feb 05, 2013 2:23:06 PM WhitespaceTokenizer process Information: Whitespace 
tokenizer finished processing


So it seems that facet.query is using the analyzer of type index.
Is it a bug or is there another analyzer type for the facet query?

Regards,

Kai Gülzau





copy Field / postprocess Fields after analyze / dynamic analyzer config

2013-02-08 Thread Kai Gülzau
I there a way to postprocess a field after analyze?

Saying postprocess I think of renaming, moving or appending fields.


Some more information:

My schema.xml contains several language suffixed fields (nouns_de, ...).
Each of these is analyzed in a language dependent way:

fieldType name= nouns_de class=solr.TextField positionIncrementGap=100
  analyzer type=index
tokenizer class=solr.UIMATypeAwareAnnotationsTokenizerFactory
  descriptorPath=/uima/AggregateSentenceDEAE.xml 
tokenType=org.apache.uima.TokenAnnotation featurePath=posTag/
filter class=solr.TypeTokenFilterFactory useWhitelist=true 
types=/uima/whitelist_de.txt /
  /analyzer
  analyzer type=query
tokenizer class=solr.WhitespaceTokenizerFactory/
  /analyzer
/fieldType

When I do a facted search I have to include every field_lang combination since 
I do not know the language at query time:

http://localhost:8983/solr/master/select?q=*:*rows=0facet=truefacet.field=nouns_defacet.field=nouns_enfacet.field=nouns_frfacet.field=nouns_nl
 ...

So I have to merge all terms in my own business logic :-(


Any idea / pointer to rename fields after analyze?

This post says it's not possible with the current API:
http://lucene.472066.n3.nabble.com/copyField-after-analyzer-td3900337.html


Another approach would be to allow analyzer configuration depending on another 
field value (language).


regards,

Kai Gülzau



which analyzer is used for facet.query?

2013-02-05 Thread Kai Gülzau
Hi all,

which analyzer is used for the facet.query?


This is my schema.xml:

fieldType name=uima_nouns_de class=solr.TextField 
positionIncrementGap=100
  analyzer type=index
tokenizer class=solr.UIMATypeAwareAnnotationsTokenizerFactory 
descriptorPath=/uima/AggregateSentenceDEAE.xml
  tokenType=org.apache.uima.TokenAnnotation featurePath=posTag/
filter class=solr.TypeTokenFilterFactory useWhitelist=true 
types=/uima/whitelist_de.txt /
  /analyzer
  analyzer type=query
tokenizer class=solr.WhitespaceTokenizerFactory/
  /analyzer
/fieldType
...
field name=albody_de type=uima_nouns_de indexed=true stored=true 
multiValued=false omitTermFreqAndPositions=false termVectors=true 
termPositions=false termOffsets=false /


When doing a faceting search like:

http://localhost:8983/solr/slave/select?q=*:*fq=type:7rows=0wt=jsonindent=truefacet=truefacet.query=albody_de:Klaus

The UIMA whitespace tokenizer logs some infos:
Feb 05, 2013 2:23:06 PM WhitespaceTokenizer process Information: Whitespace 
tokenizer starts processing
Feb 05, 2013 2:23:06 PM WhitespaceTokenizer process Information: Whitespace 
tokenizer finished processing


So it seems that facet.query is using the analyzer of type index.
Is it a bug or is there another analyzer type for the facet query?

Regards,

Kai Gülzau





RE: Indexing nouns only with UIMA works - performance issue?

2013-02-05 Thread Kai Gülzau
So with https://issues.apache.org/jira/browse/LUCENE-4749 it's possible to set 
the ModelFile?

tokenizer class=solr.UIMAAnnotationsTokenizerFactory
descriptorPath=/uima/AggregateSentenceAE.xml 
tokenType=org.apache.uima.SentenceAnnotation ngramsize=2
modelFile=file:german/TuebaModel.dat /

???

Thanks,

Kai 


-Original Message-
From: Tommaso Teofili [mailto:tommaso.teof...@gmail.com] 
Sent: Monday, February 04, 2013 2:47 PM
To: solr-user@lucene.apache.org
Subject: Re: Indexing nouns only with UIMA works - performance issue?

see an example at
http://svn.apache.org/viewvc/lucene/dev/branches/branch_4x/solr/contrib/uima/src/test-files/uima/uima-tokenizers-schema.xml?view=diffr1=1442116r2=1442117pathrev=1442117where
the 'ngramsize' parameter is set, that's defined in
AggregateSentenceAE.xml descriptor and is then set with the given actual
value.
HTH,

Tommaso


Indexing nouns only with UIMA works - performance issue?

2013-02-01 Thread Kai Gülzau
I now use the stupid way to use the german corpus for UIMA: copy + paste :-)

I modified the Tagger-2.3.1.jar/HmmTagger.xml to use the german corpus
...
fileResourceSpecifier
  fileUrlfile:german/TuebaModel.dat/fileUrl
/fileResourceSpecifier
...
and saved it as Tagger-2.3.1.jar/HmmTaggerDE.xml


Next step is to replace every occurrence of HmmTagger in
lucene-analyzers-uima-4.1.0.jar/uima/AggregateSentenceAE.xml
with HmmTaggerDE an save it as
lucene-analyzers-uima-4.1.0.jar/uima/AggregateSentenceDEAE.xml

This can be used in your schema.xml:
fieldType name=uima_nouns_de class=solr.TextField 
positionIncrementGap=100
  analyzer
tokenizer class=solr.UIMATypeAwareAnnotationsTokenizerFactory
  descriptorPath=/uima/AggregateSentenceDEAE.xml 
tokenType=org.apache.uima.TokenAnnotation featurePath=posTag/
filter class=solr.TypeTokenFilterFactory useWhitelist=true 
types=/uima/whitelist_de.txt /
  /analyzer
/fieldType

There should be a way to accomplish this via config though.



Last open issue: Performance!

First run via Admin GUI analyze index value Klaus geht in das Haus und sieht 
eine Maus. / query: : ~ 5 seconds
Feb 01, 2013 11:01:00 AM WhitespaceTokenizer initialize Information: 
Whitespace tokenizer successfully initialized
Feb 01, 2013 11:01:02 AM WhitespaceTokenizer typeSystemInit Information: 
Whitespace tokenizer typesystem initialized
Feb 01, 2013 11:01:02 AM WhitespaceTokenizer processInformation: 
Whitespace tokenizer starts processing
Feb 01, 2013 11:01:02 AM WhitespaceTokenizer processInformation: 
Whitespace tokenizer finished processing
Feb 01, 2013 11:01:02 AM WhitespaceTokenizer initialize Information: 
Whitespace tokenizer successfully initialized
Feb 01, 2013 11:01:03 AM WhitespaceTokenizer typeSystemInit Information: 
Whitespace tokenizer typesystem initialized
Feb 01, 2013 11:01:03 AM WhitespaceTokenizer processInformation: 
Whitespace tokenizer starts processing
Feb 01, 2013 11:01:03 AM WhitespaceTokenizer processInformation: 
Whitespace tokenizer finished processing
Feb 01, 2013 11:01:03 AM WhitespaceTokenizer initialize Information: 
Whitespace tokenizer successfully initialized
Feb 01, 2013 11:01:05 AM WhitespaceTokenizer typeSystemInit Information: 
Whitespace tokenizer typesystem initialized
Feb 01, 2013 11:01:05 AM WhitespaceTokenizer processInformation: 
Whitespace tokenizer starts processing
Feb 01, 2013 11:01:05 AM WhitespaceTokenizer processInformation: 
Whitespace tokenizer finished processing

Second run via Admin GUI analyze Klaus geht in das Haus und sieht eine Maus. 
/ query: : ~ 4 seconds
Feb 01, 2013 11:07:31 AM WhitespaceTokenizer initialize Information: 
Whitespace tokenizer successfully initialized
Feb 01, 2013 11:07:32 AM WhitespaceTokenizer typeSystemInit Information: 
Whitespace tokenizer typesystem initialized
Feb 01, 2013 11:07:32 AM WhitespaceTokenizer processInformation: 
Whitespace tokenizer starts processing
Feb 01, 2013 11:07:32 AM WhitespaceTokenizer processInformation: 
Whitespace tokenizer finished processing
Feb 01, 2013 11:07:32 AM WhitespaceTokenizer initialize Information: 
Whitespace tokenizer successfully initialized
Feb 01, 2013 11:07:33 AM WhitespaceTokenizer typeSystemInit Information: 
Whitespace tokenizer typesystem initialized
Feb 01, 2013 11:07:33 AM WhitespaceTokenizer processInformation: 
Whitespace tokenizer starts processing
Feb 01, 2013 11:07:33 AM WhitespaceTokenizer processInformation: 
Whitespace tokenizer finished processing
Feb 01, 2013 11:07:33 AM WhitespaceTokenizer initialize Information: 
Whitespace tokenizer successfully initialized
Feb 01, 2013 11:07:34 AM WhitespaceTokenizer typeSystemInit Information: 
Whitespace tokenizer typesystem initialized
Feb 01, 2013 11:07:34 AM WhitespaceTokenizer processInformation: 
Whitespace tokenizer starts processing
Feb 01, 2013 11:07:34 AM WhitespaceTokenizer processInformation: 
Whitespace tokenizer finished processing

Initialized 3 times?
I think some of the components are not reused while analyzing.

Is this a known issue?


Regards,

Kai Gülzau



-Original Message-
From: Kai Gülzau [mailto:kguel...@novomind.com] 
Sent: Thursday, January 31, 2013 6:48 PM
To: solr-user@lucene.apache.org
Subject: RE: Indexing nouns only - UIMA vs. OpenNLP

UIMA:

I just found this issue https://issues.apache.org/jira/browse/SOLR-3013
Now I am able to use this analyzer for english texts and filter (un)wanted 
token types :-)

fieldType name=uima_nouns_en class=solr.TextField 
positionIncrementGap=100
  analyzer
tokenizer class=solr.UIMATypeAwareAnnotationsTokenizerFactory
  descriptorPath=/uima/AggregateSentenceAE.xml 
tokenType=org.apache.uima.TokenAnnotation
  featurePath=posTag/
filter class=solr.TypeTokenFilterFactory types=/uima/stoptypes.txt /
  /analyzer
/fieldType

Open issue - How to set

RE: Indexing nouns only - UIMA vs. OpenNLP

2013-02-01 Thread Kai Gülzau
Hi Lance,

 About removing non-nouns: the OpenNLP patch includes two simple 
 TokenFilters for manipulating terms with payloads. The 
 FilterPayloadFilter lets you keep or remove terms with given payloads.

yes, I used this already in the schema.xml
 filter class=solr.FilterPayloadsFilterFactory 
 payloadList=NN,NNS,NNP,NNPS,FM keepPayloads=true/
 filter class=solr.StripPayloadsFilterFactory/

Works fine :-)
But as Robert Muir stated in LUCENE-4345 I also think using types (and storing 
these optionally as payloads)
would be a better approach.

 http://code.google.com/p/universal-pos-tags/
Thanks for the pointer, used it to improve my english (brown) whitelist for 
UIMA :-)

Regards,

Kai Gülzau


Indexing nouns only - UIMA vs. OpenNLP

2013-01-31 Thread Kai Gülzau
Hi,

I am stuck trying to index only the nouns of german and english texts.
(very similar to http://wiki.apache.org/solr/OpenNLP#Full_Example)


First try was to use UIMA with the HMMTagger:

processor 
class=org.apache.solr.uima.processor.UIMAUpdateRequestProcessorFactory
  lst name=uimaConfig
lst name=runtimeParameters/lst
str 
name=analysisEngine/org/apache/uima/desc/AggregateSentenceAE.xml/str
bool name=ignoreErrorsfalse/bool
lst name=analyzeFields
  bool name=mergefalse/bool
  arr name=fieldsstralbody/str/arr
/lst
lst name=fieldMappings
  lst name=type
str name=nameorg.apache.uima.SentenceAnnotation/str
lst name=mapping
  str name=featurecoveredText/str
  str name=fieldalbody2/str
/lst
  /lst
   /lst
  /lst
/processor

- But how do I set the ModelFile to use the german corpus?
- What about language identification?
-- How do I use the right corpus/tagger based on the language?
-- Should this be done in UIMA (how?) or via solr contrib/langid field mapping?
- How to remove non nouns in the annotated field?


Second try is to use OpenNLP and to apply the patch 
https://issues.apache.org/jira/browse/LUCENE-2899
But the patch seems to be a bit out of date.
Currently I try to get it to work with solr 4.1.


Any pointers appreciated :-)

Regards,

Kai Gülzau



RE: Indexing nouns only - UIMA vs. OpenNLP

2013-01-31 Thread Kai Gülzau
UIMA:

I just found this issue https://issues.apache.org/jira/browse/SOLR-3013
Now I am able to use this analyzer for english texts and filter (un)wanted 
token types :-)

fieldType name=uima_nouns_en class=solr.TextField 
positionIncrementGap=100
  analyzer
tokenizer class=solr.UIMATypeAwareAnnotationsTokenizerFactory
  descriptorPath=/uima/AggregateSentenceAE.xml 
tokenType=org.apache.uima.TokenAnnotation
  featurePath=posTag/
filter class=solr.TypeTokenFilterFactory types=/uima/stoptypes.txt /
  /analyzer
/fieldType

Open issue - How to set the ModelFile for the Tagger to 
german/TuebaModel.dat ???



OpenNLP:

And a modified patch for https://issues.apache.org/jira/browse/LUCENE-2899 is 
now working
with solr 4.1. :-)

fieldType name=nlp_nouns_de class=solr.TextField 
positionIncrementGap=100
  analyzer
tokenizer class=solr.OpenNLPTokenizerFactory 
tokenizerModel=opennlp/de-token.bin /
  filter class=solr.OpenNLPFilterFactory 
posTaggerModel=opennlp/de-pos-maxent.bin /
  filter class=solr.FilterPayloadsFilterFactory 
payloadList=NN,NNS,NNP,NNPS,FM keepPayloads=true/
  filter class=solr.StripPayloadsFilterFactory/
  /analyzer
/fieldType



Any hints on which lib is more accurate on noun tagging?
Any performance or memory issues (some OOM here while testing with 1GB via 
Analyzer Admin GUI)?


Regards,

Kai Gülzau




-Original Message-
From: Kai Gülzau [mailto:kguel...@novomind.com] 
Sent: Thursday, January 31, 2013 2:19 PM
To: solr-user@lucene.apache.org
Subject: Indexing nouns only - UIMA vs. OpenNLP

Hi,

I am stuck trying to index only the nouns of german and english texts.
(very similar to http://wiki.apache.org/solr/OpenNLP#Full_Example)


First try was to use UIMA with the HMMTagger:

processor 
class=org.apache.solr.uima.processor.UIMAUpdateRequestProcessorFactory
  lst name=uimaConfig
lst name=runtimeParameters/lst
str 
name=analysisEngine/org/apache/uima/desc/AggregateSentenceAE.xml/str
bool name=ignoreErrorsfalse/bool
lst name=analyzeFields
  bool name=mergefalse/bool
  arr name=fieldsstralbody/str/arr
/lst
lst name=fieldMappings
  lst name=type
str name=nameorg.apache.uima.SentenceAnnotation/str
lst name=mapping
  str name=featurecoveredText/str
  str name=fieldalbody2/str
/lst
  /lst
   /lst
  /lst
/processor

- But how do I set the ModelFile to use the german corpus?
- What about language identification?
-- How do I use the right corpus/tagger based on the language?
-- Should this be done in UIMA (how?) or via solr contrib/langid field mapping?
- How to remove non nouns in the annotated field?


Second try is to use OpenNLP and to apply the patch 
https://issues.apache.org/jira/browse/LUCENE-2899
But the patch seems to be a bit out of date.
Currently I try to get it to work with solr 4.1.


Any pointers appreciated :-)

Regards,

Kai Gülzau



Term Frequencies for Query Result

2013-01-30 Thread Kai Gülzau
Hi,

I am looking for a way to get the top terms for a query result.

Faceting does not work since counts are measured as documents containing a term 
and not as the overall count of a term in all found documents:

http://localhost:8983/solr/master/select?q=type%3A7rows=1wt=jsonindent=truefacet=truefacet.query=type%3A7facet.field=albodyfacet.method=fc

  facet_counts:{
facet_queries:{
  type:7:156},
facet_fields:{
  albody:[
der,73,
in,68,
betreff,63,
...


Using http://wiki.apache.org/solr/TermVectorComponent an counting all 
frequencies manually seems to be the only solution by now:

http://localhost:8983/solr/tvrh/?q=type:7tv.fl=albodyf.albody.tv.tf=truewt=jsonindent=true


termVectors:[

uniqueKeyFieldName,ukey,

798_7_0,[

  uniqueKey,798_7_0,

  albody,[

der,[

  tf,5],

die,[

  tf,7],

...



Does anyone know a better and more efficient solution?


Regards,

Kai Gülzau



RE: How to update one field without losing the others?

2012-06-18 Thread Kai Gülzau
I'm currently playing around with a branch 4x Version 
(https://builds.apache.org/job/Solr-4.x/5/) but I don't get field updates to 
work.

A simple GET testrequest
http://localhost:8983/solr/master/update/json?stream.body={add:{doc:{ukey:08154711,type:1,nbody:{set:mycontent

results in
{
  ukey:08154711,
  type:1,
  nbody:{set=mycontent}}]
}

All fields are stored.
ukey is the unique key :-)
type is a required field.
nbody is a solr.TextField.


Is there any (wiki/readme) pointer how to test and use these feature correctly?
What are the restrictions?

Regards,

Kai Gülzau

 
-Original Message-
From: ysee...@gmail.com [mailto:ysee...@gmail.com] On Behalf Of Yonik Seeley
Sent: Saturday, June 16, 2012 4:47 PM
To: solr-user@lucene.apache.org
Subject: Re: How to update one field without losing the others?

Atomic update is a very new feature coming in 4.0 (i.e. grab a recent
nightly build to try it out).

It's not documented yet, but here's the JIRA issue:
https://issues.apache.org/jira/browse/SOLR-139?focusedCommentId=13269007page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13269007

-Yonik
http://lucidimagination.com


mailto: scheme aware tokenizer

2012-03-16 Thread Kai Gülzau
Is there any analyzer out there which handles the mailto: scheme?

UAX29URLEmailTokenizer seems to split at the wrong place:

mailto:t...@example.org -
mailto:test
example.org

As a workaround I use

charFilter class=solr.PatternReplaceCharFilterFactory pattern=mailto:; 
replacement=mailto: /

Regards,

Kai Gülzau

novomind AG
__

Bramfelder Straße 121 • 22305 Hamburg

phone +49 (0)40 808071138 • fax +49 (0)40 808071-100
email kguel...@novomind.com • http://www.novomind.com

Vorstand : Peter Samuelsen (Vors.) • Stefan Grieben • Thomas Köhler
Aufsichtsratsvorsitzender: Werner Preuschhof
Gesellschaftssitz: Hamburg • HR B93508 Amtsgericht Hamburg


RE: DIH Strange Problem

2011-11-28 Thread Kai Gülzau
Do you use Java 6 update 29? There is a known issue with the latest mssql 
driver:

http://blogs.msdn.com/b/jdbcteam/archive/2011/11/07/supported-java-versions-november-2011.aspx

In addition, there are known connection failure issues with Java 6 update 29, 
and the developer preview (non production) versions of Java 6 update 30 and 
Java 6 update 30 build 12.  We are in contact with Java on these issues and we 
will update this blog once we have more information.

Should work with update 28.

Kai

-Original Message-
From: Husain, Yavar [mailto:yhus...@firstam.com] 
Sent: Monday, November 28, 2011 1:02 PM
To: solr-user@lucene.apache.org; Shawn Heisey
Subject: RE: DIH Strange Problem

I figured out the solution and Microsoft and not Solr is the problem here :):

I downloaded and build latest Solr (3.4) from sources and finally hit following 
line of code in Solr (where I put my debug statement) :

if(url != null){
   LOG.info(Yavar: getting handle to driver manager:);
   c = DriverManager.getConnection(url, initProps);
   LOG.info(Yavar: got handle to driver manager:); }

The call to Driver Manager was not returning. Here was the error!! The Driver 
we were using was Microsoft Type 4 JDBC driver for SQL Server. I downloaded 
another driver called jTDS jDBC driver and installed that. Problem got fixed!!!

So please follow the following steps:

1. Download jTDS jDBC driver from http://jtds.sourceforge.net/ 2. Put the 
driver jar file into your Solr/lib directory where you had put Microsoft JDBC 
driver.
3. In the data-config.xml use this statement: 
driver=net.sourceforge.jtds.jdbc.Driver
4. Also in data-config.xml mention url like this: 
url=jdbc:jTDS:sqlserver://localhost:1433;databaseName=XXX
5. Now run your indexing.

It should solve the problem.

-Original Message-
From: Husain, Yavar
Sent: Thursday, November 24, 2011 12:38 PM
To: solr-user@lucene.apache.org; Shawn Heisey
Subject: RE: DIH Strange Problem

Hi

Thanks for your replies.

I carried out these 2 steps (it did not solve my problem):

1. I tried setting responseBuffering to adaptive. Did not work.
2. For checking Database connection I wrote a simple java program to connect to 
database and fetch some results with the same driver that I use for solr. It 
worked. So it does not seem to be a problem with the connection.

Now I am stuck where Tomcat log says: Creating a connection for entity . 
and does nothing, I mean after this log we usually get the getConnection() 
took x millisecond however I dont get that ,I can just see the time moving 
with no records getting fetched.

Original Problem listed again:


I am using Solr 1.4.1 on Windows/MS SQL Server and am using DIH for importing 
data. Indexing and all was working perfectly fine. However today when I started 
full indexing again, Solr halts/stucks at the line Creating a connection for 
entity. There are no further messages after that. I can see that DIH 
is busy and on the DIH console I can see A command is still running, I can 
also see total rows fetched = 0 and total request made to datasource = 1 and 
time is increasing however it is not doing anything. This is the exact 
configuration that worked for me. I am not really able to understand the 
problem here. Also in the index directory where I am storing the index there 
are just 3 files: 2 segment files + 1  lucene*-write.lock file. 
...
data-config.xml:

dataSource type=JdbcDataSource 
driver=com.microsoft.sqlserver.jdbc.SQLServerDriver
url=jdbc:sqlserver://127.0.0.1:1433;databaseName=SampleOrders user=testUser 
password=password/ document .
.

Logs:

INFO: Server startup in 2016 ms
Nov 23, 2011 4:11:27 PM org.apache.solr.handler.dataimport.DataImporter 
doFullImport
INFO: Starting Full Import
Nov 23, 2011 4:11:27 PM org.apache.solr.core.SolrCore execute
INFO: [] webapp=/solr path=/dataimport params={command=full-import} status=0 
QTime=11 Nov 23, 2011 4:11:27 PM org.apache.solr.handler.dataimport.SolrWriter 
readIndexerProperties
INFO: Read dataimport.properties
Nov 23, 2011 4:11:27 PM org.apache.solr.update.DirectUpdateHandler2 deleteAll
INFO: [] REMOVING ALL DOCUMENTS FROM INDEX Nov 23, 2011 4:11:27 PM 
org.apache.solr.core.SolrDeletionPolicy onInit
INFO: SolrDeletionPolicy.onInit: commits:num=1
   
commit{dir=C:\solrindexes\index,segFN=segments_6,version=1322041133719,generation=6,filenames=[segments_6]
Nov 23, 2011 4:11:27 PM org.apache.solr.core.SolrDeletionPolicy updateCommits
INFO: newest commit = 1322041133719
Nov 23, 2011 4:11:27 PM org.apache.solr.handler.dataimport.JdbcDataSource$1 call
INFO: Creating a connection for entity SampleText with URL: 
jdbc:sqlserver://127.0.0.1:1433;databaseName=SampleOrders


-Original Message-
From: Shawn Heisey [mailto:s...@elyograg.org] 
Sent: Wednesday, November 23, 2011 7:36 PM
To: solr-user@lucene.apache.org
Subject: Re: DIH Strange Problem

On 11/23/2011 5:21 AM, Chantal Ackermann wrote:
 

DIH - how to collect added/error unique keys?

2011-11-09 Thread Kai Gülzau
Hi *,

I am using DataImportHandler to do imports on a INDEX_QUEUE table (UKEY | 
ACTION)
using a custom Transformer which adds fields from various sources depending on 
the UKEY.

Indexing works fine this way.

But now I want to delete the rows from INDEX_QUEUE which were successfully 
updated.

- Is there a good API way to do this?

Right now I'm using custom RequestProcessor which collects the UIDs and calls a 
method
on a singleton with access to the DB. It works but I hate these global 
singletons... :-(

public void processAdd(AddUpdateCommand cmd) throws IOException {
  SolrInputDocument doc = cmd.getSolrInputDocument();
  try {
super.processAdd(cmd);
addOK(doc);
  } catch (IOException e) {
addError(doc);
throw e;
  } catch (RuntimeException e) {
addError(doc);
throw e;
  }
}

Any other suggestions?

Regards,

Kai Gülzau



RE: Jetty logging

2011-11-03 Thread Kai Gülzau
Hi,

remove slf4j-jdk14-1.6.1.jar from the war and repack it with slf4j-log4j12.jar 
and log4j-1.2.14.jar instead.

-http://wiki.apache.org/solr/SolrLogging

Regards,

Kai Gülzau

-Original Message-
From: darul [mailto:daru...@gmail.com] 
Sent: Thursday, November 03, 2011 11:26 AM
To: solr-user@lucene.apache.org
Subject: Jetty logging

Hello everybody,

I do not find a solution on how to configure jetty with sl4j and a 
log4j.properties file.

In  I have put :

- log4j-1.2.14.jar
- slf4j-api-1.3.1.jar

in  directory:
- log4j.properties



At the end, nothing append when running jetty.

Do you have any ideas ?

Thanks,

Julien





--
View this message in context: 
http://lucene.472066.n3.nabble.com/Jetty-logging-tp3476715p3476715.html
Sent from the Solr - User mailing list archive at Nabble.com.


RE: document update / nested documents / document join

2011-10-17 Thread Kai Gülzau
Nobody?

SOLR-139 seems to be the most popular issue but I don’t think this will be 
resolved in near future (this year). Right?

So I will try SOLR-2272 as a workaround, split up my documents in static and 
 frequently updated
and join them at query time.

What is the exact join query to do a query like category:bugfixes AND 
body:answer
  matching category:bugfixes in doc1 and
  matching body:answer in doc3
  with just returning doc 1??

I adopted the fieldnames of
doc 3:
type: out
out_ticketid: 1001
out_body: this is my answer
out_category: other

q={!join+from=out_ticketid+to=ticketid}(category:bugfixes+OR+out_category:bugfixes)+AND+(body:answer+OR+out_body:answer)


Writing this, I doubt this syntax is even possible!?
Additionally I'm not sure if trunk with SOLR-2272 is production ready.

The only way to do what I want in a released 3.x version is to do several 
searches and joining the results manually.
e.g. 
q=category:bugfixes - doc1 - ticketid: 1001
q=body:answers - doc3 - ticket:1001
- result ticketid:1001

This I way I would lose benefits like faceted search etc. :-\

Any suggestions?


Regards,

Kai Gülzau

-Original Message-
From: Kai Gülzau [mailto:kguel...@novomind.com] 
Sent: Thursday, October 13, 2011 4:52 PM
To: solr-user@lucene.apache.org
Subject: document update / nested documents / document join

Hi *,

i am a bit confused about what is the best way to achieve my requirements.

We have a mail ticket system. A ticket is created when a mail is received by 
the system:

doc 1:
uid: 1001_in
ticketid: 1001
type: in
body: I have a problem
category: bugfixes
date: 201110131955

This incoming document is static. While the ticket is in progress there is 
another document representing the current/last state of the ticket. Some fields 
of this document are updated frequently:

doc 2:
uid: 1001_out
ticketid: 1001
type: out
body:
category: bugfixes
date: 201110132015

a bit later (doc 2 is deleted/updated):
doc 3:
uid: 1001_out
ticketid: 1001
type: out
body: this is my answer
category: other
date: 201110140915

I would like to do a boolean search spanning multiple documents like 
category:bugfixes AND body:answer.

I think it's the same what was proposed by:
http://www.slideshare.net/MarkHarwood/proposal-for-nested-document-support-in-lucene

So I dig into the deeps of Lucene and Solr tickets and now i am stuck choosing 
the right way:

https://issues.apache.org/jira/browse/LUCENE-2454 Nested Document query support
https://issues.apache.org/jira/browse/LUCENE-3171 BlockJoinQuery/Collector
https://issues.apache.org/jira/browse/LUCENE-1879 Parallel incremental indexing
https://issues.apache.org/jira/browse/SOLR-139 Support updateable/modifiable 
documents
https://issues.apache.org/jira/browse/SOLR-2272 Join


If it is easily possible to update one field in a document i would just merge 
the two logical documents into one representing the whole ticket. But i can't 
see this is already possible.

SOLR-2272 seems to be the best solution by now but feels like workaround.
 I can't update a document field so i split it up in static and dynamic 
content and join both at query time.

SOLR-2272 is committed to trunk/solr 4.
Are there any planned release dates for solr 4 or a possible backport for 
SOLR-2272 in 3.x?


I would appreciate any suggestions.

Regards,

Kai Gülzau







RE: document update / nested documents / document join

2011-10-17 Thread Kai Gülzau
I just found another feature/ticket to be able to update fields:
https://issues.apache.org/jira/browse/SOLR-2753
https://issues.apache.org/jira/browse/LUCENE-1231

- CSF Column Stride Fields

This should work well with simple fields like category/date/...!?

So I have 2 options:
1.)
Introduce a rather complex logic on client side to form the right join query 
(or do join manually),
which should, as you stated, work even with complex queries.

2.)
Or do it straightforward, combine all docs to one and WAIT for one of the 
various update field/doc
features to be realized.


I think I'll give 1.) a try and wait for 2.) if I get into trouble.


Regards,

Kai Gülzau
  

-Original Message-
From: Thijs [mailto:vonk.th...@gmail.com] 
Sent: Monday, October 17, 2011 1:22 PM
To: solr-user@lucene.apache.org
Subject: Re: document update / nested documents / document join

Hi,

First. I'm not sure you know. But the join isn't like a join in a database it's 
more like
   select * from (set of documents that match query)
   where exists (set of documents that match join query)

I have some complex (multiple join fq) in one call and that is fine, so I think 
this query may work also.
other wise you could try something like:
q=*:*fq={!join+from=out_ticketid+to=ticketid}(category:bugfixes+OR+out_category:bugfixes)fq={!join+from=out_ticketid+to=ticketid}(body:answer+OR+out_body:answer)

My wish would also be that this where backported to 3.x. But if not we'll 
probably go live on 4.x

Thijs


On 17-10-2011 11:46, Kai Gülzau wrote:
 Nobody?

 SOLR-139 seems to be the most popular issue but I don’t think this will be 
 resolved in near future (this year). Right?

 So I will try SOLR-2272 as a workaround, split up my documents in static 
 and  frequently updated
 and join them at query time.

 What is the exact join query to do a query like category:bugfixes AND 
 body:answer
matching category:bugfixes in doc1 and
matching body:answer in doc3
with just returning doc 1??

 I adopted the fieldnames of
 doc 3:
 type: out
 out_ticketid: 1001
 out_body: this is my answer
 out_category: other

 q={!join+from=out_ticketid+to=ticketid}(category:bugfixes+OR+out_categ
 ory:bugfixes)+AND+(body:answer+OR+out_body:answer)


 Writing this, I doubt this syntax is even possible!?
 Additionally I'm not sure if trunk with SOLR-2272 is production ready.

 The only way to do what I want in a released 3.x version is to do several 
 searches and joining the results manually.
 e.g.
 q=category:bugfixes -  doc1 -  ticketid: 1001 q=body:answers -  
 doc3 -  ticket:1001
 -  result ticketid:1001

 This I way I would lose benefits like faceted search etc. :-\

 Any suggestions?


 Regards,

 Kai Gülzau

 -Original Message-
 From: Kai Gülzau [mailto:kguel...@novomind.com]
 Sent: Thursday, October 13, 2011 4:52 PM
 To: solr-user@lucene.apache.org
 Subject: document update / nested documents / document join

 Hi *,

 i am a bit confused about what is the best way to achieve my requirements.

 We have a mail ticket system. A ticket is created when a mail is received by 
 the system:

 doc 1:
 uid: 1001_in
 ticketid: 1001
 type: in
 body: I have a problem
 category: bugfixes
 date: 201110131955

 This incoming document is static. While the ticket is in progress there is 
 another document representing the current/last state of the ticket. Some 
 fields of this document are updated frequently:

 doc 2:
 uid: 1001_out
 ticketid: 1001
 type: out
 body:
 category: bugfixes
 date: 201110132015

 a bit later (doc 2 is deleted/updated):
 doc 3:
 uid: 1001_out
 ticketid: 1001
 type: out
 body: this is my answer
 category: other
 date: 201110140915

 I would like to do a boolean search spanning multiple documents like 
 category:bugfixes AND body:answer.

 I think it's the same what was proposed by:
 http://www.slideshare.net/MarkHarwood/proposal-for-nested-document-sup
 port-in-lucene

 So I dig into the deeps of Lucene and Solr tickets and now i am stuck 
 choosing the right way:

 https://issues.apache.org/jira/browse/LUCENE-2454 Nested Document 
 query support
 https://issues.apache.org/jira/browse/LUCENE-3171 
 BlockJoinQuery/Collector
 https://issues.apache.org/jira/browse/LUCENE-1879 Parallel incremental 
 indexing
 https://issues.apache.org/jira/browse/SOLR-139 Support 
 updateable/modifiable documents
 https://issues.apache.org/jira/browse/SOLR-2272 Join


 If it is easily possible to update one field in a document i would just merge 
 the two logical documents into one representing the whole ticket. But i can't 
 see this is already possible.

 SOLR-2272 seems to be the best solution by now but feels like workaround.
  I can't update a document field so i split it up in static and dynamic 
 content and join both at query time.

 SOLR-2272 is committed to trunk/solr 4.
 Are there any planned release dates for solr 4 or a possible backport for 
 SOLR-2272 in 3.x?


 I would appreciate any suggestions.

 Regards,

 Kai Gülzau








RE: Multiple indexes

2011-06-17 Thread Kai Gülzau
  (for example if you need separate TFs for each document type).
 
 I wonder if in this precise case it wouldn't be pertinent to 
 have a single index with the various document types each 
 having each their own fields set. Isn't TF calculated field by field ?

Oh, you are right :)
So i will start testing with one mixed type index and
perhaps use IndexReaderFactory afterwards in comparison.

Thanks,

Kai Gülzau

RE: Multiple indexes

2011-06-16 Thread Kai Gülzau
Are there any plans to support a kind of federated search
in a future solr version?

I think there are reasons to use seperate indexes for each document type
but do combined searches on these indexes
(for example if you need separate TFs for each document type).

I am aware of http://wiki.apache.org/solr/DistributedSearch
and a workaround to do federated search with sharding
http://stackoverflow.com/questions/2139030/search-multiple-solr-cores-and-return-one-result-set
but this seems to be too much network- and maintenance overhead.

Perhaps it is worth a try to use an IndexReaderFactory which
returns a lucene MultiReader!?
Is the IndexReaderFactory still Experimental?
https://issues.apache.org/jira/browse/SOLR-1366


Regards,

Kai Gülzau

 

 -Original Message-
 From: Jonathan Rochkind [mailto:rochk...@jhu.edu] 
 Sent: Wednesday, June 15, 2011 8:43 PM
 To: solr-user@lucene.apache.org
 Subject: Re: Multiple indexes
 
 Next, however, I predict you're going to ask how you do a 'join' or 
 otherwise query accross both these cores at once though. You can't do 
 that in Solr.
 
 On 6/15/2011 1:00 PM, Frank Wesemann wrote:
  You'll configure multiple cores:
  http://wiki.apache.org/solr/CoreAdmin
  Hi.
 
  How to have multiple indexes in SOLR, with different fields and
  different types of data?
 
  Thank you very much!
  Bye.
 
 
 

RE: Is there anything like MultiSearcher?

2011-06-15 Thread Kai Gülzau
Hi Roman,

do you have solved your problem and how?

Regards,

Kai Gülzau

 

 -Original Message-
 From: Roman Chyla [mailto:roman.ch...@gmail.com] 
 Sent: Saturday, February 05, 2011 4:50 PM
 To: solr-user@lucene.apache.org
 Subject: Is there anything like MultiSearcher?
 
 Dear Solr experts,
 
 Could you recommend some strategies or perhaps tell me if I approach
 my problem from a wrong side? I was hoping to use MultiSearcher to
 search across multiple indexes in Solr, but there is no such a thing
 and MultiSearcher was removed according to this post:
 http://osdir.com/ml/solr-user.lucene.apache.org/2011-01/msg00250.html
 
 I though I had two use cases:
 
 1. maintenance - I wanted to build two separate indexes, one for
 fulltext and one for metadata (the docs have the unique ids) -
 indexing them separately would make things much simpler
 2. ability to switch indexes at search time (ie. for testing purposes
 - one fulltext index could be built by Solr standard mechanism, the
 other by a rather different process - independent instance of lucene)
 
 I think the recommended approach is to use the Distributed search - I
 found a nice solution here:
 http://stackoverflow.com/questions/2139030/search-multiple-sol
r-cores-and-return-one-result-set
 - however it seems to me, that data are sent over HTTP (5M from one
 core, and 5M from the other core being merged by the 3rd solr core?)
 and I would like to do it only for local indexes and without the
 network overhead.
 
 Could you please shed some light if there already exist an optimal
 solution to my use cases? And if not, whether I could just try to
 build a new SolrQuerySearcher that is extending lucene MultiSearcher
 instead of IndexSearch - or you think there are some deeply rooted
 problems there and the MultiSearch-er cannot work inside Solr?
 
 Thank you,
 
   Roman