Re: Indexing TIKA extracted text. Are there some issues?

2009-07-29 Thread ashokc

Sure.

The java command I use with TIKA to extract text from a URL is:

java -jar tika-0.3-standalone.jar -t $url

I have also attached the screenshots of the web page, post documents
produced in the two different ways (Perl  Tika) for that web page, and the
screenshots of the search result for a string contained in that web page.
The index in each case contains just this one URL. To keep everything else
identical, I used the same instance for creating the index in each case.
First I posted the Tika document, checked for the results, emptied the
index, posted the Perl document, and checked the results.

Debug query for Tika:

str name=parsedquery
+DisjunctionMaxQuery((urltext:高通公司展现了海量的优质多媒体内容能^2.0
| title:高通公司展现了海量的优质多媒体内容能^2.0 |
content_china:高通 通公 公司 司展 展现 现了 了海 海量
量的 的优 优质 质多 多媒 媒体 体内 内容 容能)~0.01) ()
/str

Debug query for Perl:

str name=parsedquery
+DisjunctionMaxQuery((urltext:高通公司展现了海量的优质多媒体内容能^2.0
| title:高通公司展现了海量的优质多媒体内容能^2.0 |
content_china:高通 通公 公司 司展 展现 现了 了海 海量
量的 的优 优质 质多 多媒 媒体 体内 内容 容能)~0.01) ()
/str

The screenshots
http://www.nabble.com/file/p24728917/Tika%2BIssue.docx Tika+Issue.docx 

Perl extracted doc
http://www.nabble.com/file/p24728917/china.perl.xml china.perl.xml 

Tika extracted doc
http://www.nabble.com/file/p24728917/china.tika.xml china.tika.xml 


Grant Ingersoll-6 wrote:
 
 Hmm, looks very much like an encoding problem.  Can you post a sample  
 showing it, along with the commands you invoked?
 
 Thanks,
 Grant
 
 On Jul 28, 2009, at 6:14 PM, ashokc wrote:
 

 I am finding that the search results based on indexing Tika  
 extracted text
 are very different from results based on indexing the text extracted  
 via
 other means. This shows up for example with a chinese web site that  
 I am
 trying to index.

 I created the documents (for posting to SOLR) in two ways. The  
 source text
 of the web pages are full of html entities like #12345; and some  
 english
 characters mixed in.

 (a) Simple text extraction from the page source by a Perl script. The
 resulting content field looks like

 field name=content_chinaWho We Are  
 #20844;#21496;#21382;#21490;
 #24744;#30340;#25104;#21151;#26696;#20363;
 #39046;#23548;#22242;#38431; #19994;#21153;#37096;#38376;  
 Innovation
 #21019; etc... /field

 I posted these documents to a SOLR instance

 (b) Used Tika (command line). The resulting content field looks like

 field name=content_chinaWho We Are å ¬å¸à 
 ¥ÂŽÂ†Ã¥ÂÂ²
 您的成功æ¡
 ˆä¾‹ 领导团队  
 业务部门  Innovation à 
 ¥Â
 etc... /field

 I posted these documents to a different instance

 When I search the first instance for a string (that I copied   
 pasted from
 the web site) I find a number of hits, including the page from which I
 copied the string from. But when I do the same on the instance with  
 Tika
 extracted text - I get nothing.

 Has anyone seen this? I believe it may have to do with encoding. In  
 both
 cases the posted documents were utf-8 compiant.

 Thanks for your insights.

 - ashok

 -- 
 View this message in context:
 http://www.nabble.com/Indexing-TIKA-extracted-text.-Are-there-some-issues--tp24708854p24708854.html
 Sent from the Solr - User mailing list archive at Nabble.com.

 
 --
 Grant Ingersoll
 http://www.lucidimagination.com/
 
 Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids)  
 using Solr/Lucene:
 http://www.lucidimagination.com/search
 
 
 

-- 
View this message in context: 
http://www.nabble.com/Indexing-TIKA-extracted-text.-Are-there-some-issues--tp24708854p24728917.html
Sent from the Solr - User mailing list archive at Nabble.com.



Re: Indexing TIKA extracted text. Are there some issues?

2009-07-29 Thread ashokc

Could very well be... I will rectify it and try again. Thanks

- ashok



Robert Muir wrote:
 
 it appears there is an encoding problem, in the screenshot I can see
 the title is mangled, and if i open up the URL in IE or firefox, both
 browsers think it is iso-8859-1.
 
 I think this is why (from w3c validator):
 
 Character Encoding mismatch!
 
 The character encoding specified in the HTTP header (iso-8859-1) is
 different from the value in the meta element (utf-8). I will use the
 value from the HTTP header (iso-8859-1) for this validation.
 
 On Wed, Jul 29, 2009 at 6:02 PM, ashokcash...@qualcomm.com wrote:

 Sure.

 The java command I use with TIKA to extract text from a URL is:

 java -jar tika-0.3-standalone.jar -t $url

 I have also attached the screenshots of the web page, post documents
 produced in the two different ways (Perl  Tika) for that web page, and
 the
 screenshots of the search result for a string contained in that web page.
 The index in each case contains just this one URL. To keep everything
 else
 identical, I used the same instance for creating the index in each case.
 First I posted the Tika document, checked for the results, emptied the
 index, posted the Perl document, and checked the results.

 Debug query for Tika:

 str name=parsedquery
 +DisjunctionMaxQuery((urltext:é«˜é€šå…¬å ¸å±•çŽ°äº†æµ·é‡
 的优质多媒体内容能^2.0
 | title:é«˜é€šå…¬å ¸å±•çŽ°äº†æµ·é‡ çš„ä¼˜è´¨å¤šåª’ä½“å†…å®¹èƒ½^2.0 |
 content_china:高通 通公 å…¬å ¸ å ¸å±• 展现 现了 了海 æµ·é‡
 é‡ çš„ 的优 优质 质多 多媒 媒体 体内 内容 容能)~0.01) ()
 /str

 Debug query for Perl:

 str name=parsedquery
 +DisjunctionMaxQuery((urltext:é«˜é€šå…¬å ¸å±•çŽ°äº†æµ·é‡
 的优质多媒体内容能^2.0
 | title:é«˜é€šå…¬å ¸å±•çŽ°äº†æµ·é‡ çš„ä¼˜è´¨å¤šåª’ä½“å†…å®¹èƒ½^2.0 |
 content_china:高通 通公 å…¬å ¸ å ¸å±• 展现 现了 了海 æµ·é‡
 é‡ çš„ 的优 优质 质多 多媒 媒体 体内 内容 容能)~0.01) ()
 /str

 The screenshots
 http://www.nabble.com/file/p24728917/Tika%2BIssue.docx Tika+Issue.docx

 Perl extracted doc
 http://www.nabble.com/file/p24728917/china.perl.xml china.perl.xml

 Tika extracted doc
 http://www.nabble.com/file/p24728917/china.tika.xml china.tika.xml


 Grant Ingersoll-6 wrote:

 Hmm, looks very much like an encoding problem.  Can you post a sample
 showing it, along with the commands you invoked?

 Thanks,
 Grant

 On Jul 28, 2009, at 6:14 PM, ashokc wrote:


 I am finding that the search results based on indexing Tika
 extracted text
 are very different from results based on indexing the text extracted
 via
 other means. This shows up for example with a chinese web site that
 I am
 trying to index.

 I created the documents (for posting to SOLR) in two ways. The
 source text
 of the web pages are full of html entities like #12345; and some
 english
 characters mixed in.

 (a) Simple text extraction from the page source by a Perl script. The
 resulting content field looks like

 field name=content_chinaWho We Are
 #20844;#21496;#21382;#21490;
 #24744;#30340;#25104;#21151;#26696;#20363;
 #39046;#23548;#22242;#38431; #19994;#21153;#37096;#38376;
 Innovation
 #21019; etc...     /field

 I posted these documents to a SOLR instance

 (b) Used Tika (command line). The resulting content field looks like

 field name=content_chinaWho We Are Ã¥ ¬å ¸Ã
 ¥ÂŽÂ†Ã¥Â ²
 您的戠功æ¡
 ˆä¾‹ 领导团队
 业务部门 Â Innovation Ã
 ¥Â
 etc... /field

 I posted these documents to a different instance

 When I search the first instance for a string (that I copied 
 pasted from
 the web site) I find a number of hits, including the page from which I
 copied the string from. But when I do the same on the instance with
 Tika
 extracted text - I get nothing.

 Has anyone seen this? I believe it may have to do with encoding. In
 both
 cases the posted documents were utf-8 compiant.

 Thanks for your insights.

 - ashok

 --
 View this message in context:
 http://www.nabble.com/Indexing-TIKA-extracted-text.-Are-there-some-issues--tp24708854p24708854.html
 Sent from the Solr - User mailing list archive at Nabble.com.


 --
 Grant Ingersoll
 http://www.lucidimagination.com/

 Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids)
 using Solr/Lucene:
 http://www.lucidimagination.com/search




 --
 View this message in context:
 http://www.nabble.com/Indexing-TIKA-extracted-text.-Are-there-some-issues--tp24708854p24728917.html
 Sent from the Solr - User mailing list archive at Nabble.com.


 
 
 
 -- 
 Robert Muir
 rcm...@gmail.com
 
 

-- 
View this message in context: 
http://www.nabble.com/Indexing-TIKA-extracted-text.-Are-there-some-issues--tp24708854p24729595.html
Sent from the Solr - User mailing list archive at Nabble.com.



Indexing TIKA extracted text. Are there some issues?

2009-07-28 Thread ashokc

I am finding that the search results based on indexing Tika extracted text
are very different from results based on indexing the text extracted via
other means. This shows up for example with a chinese web site that I am
trying to index.

I created the documents (for posting to SOLR) in two ways. The source text
of the web pages are full of html entities like #12345; and some english
characters mixed in.

(a) Simple text extraction from the page source by a Perl script. The
resulting content field looks like

field name=content_chinaWho We Are #20844;#21496;#21382;#21490;
#24744;#30340;#25104;#21151;#26696;#20363;
#39046;#23548;#22242;#38431; #19994;#21153;#37096;#38376; Innovation
#21019; etc... /field

I posted these documents to a SOLR instance

(b) Used Tika (command line). The resulting content field looks like

field name=content_chinaWho We Are å ¬å¸åŽ†å²
您的成功æ¡
ˆä¾‹ 领导团队 业务部门  Innovation Ã¥Â
etc... /field

I posted these documents to a different instance

When I search the first instance for a string (that I copied  pasted from
the web site) I find a number of hits, including the page from which I
copied the string from. But when I do the same on the instance with Tika
extracted text - I get nothing.

Has anyone seen this? I believe it may have to do with encoding. In both
cases the posted documents were utf-8 compiant.

Thanks for your insights.

- ashok

-- 
View this message in context: 
http://www.nabble.com/Indexing-TIKA-extracted-text.-Are-there-some-issues--tp24708854p24708854.html
Sent from the Solr - User mailing list archive at Nabble.com.



Re: CJKTokenizerFactory seems to work for Korea but not for China and Japan

2009-07-01 Thread ashokc

Yes, I reindexed the entire repository after each of my changes. Here is the
output with debug on.

== DEBUG OUTPUT BEGIN ==
lst name=responseHeader
 int name=status0/int
 int name=QTime83/int
 lst name=params
  str name=wtstandard/str
  str name=rows10/str

  str name=explainOther/
  str name=start0/str
  str name=hl.flcontent/str
  str name=indenton/str
  str name=fl*,score/str
  str name=hlon/str

  str name=q创意或商业创新、/str
  str name=debugQueryon/str
  str name=qtdismax/str
  str name=version2.2/str
 /lst
/lst
result name=response numFound=0 start=0 maxScore=0.0/

lst name=debug

 str name=rawquerystring创意或商业创新、/str
 str name=querystring创意或商业创新、/str
 str name=parsedquery+DisjunctionMaxQuery((content:创意或商业创新、 |
urltext:创意或商业创新、^2.0 | title:创意或商业创新、^2.0)~0.01) ()/str
 str name=parsedquery_toString+(content:创意或商业创新、 | urltext:创意或商业创新、^2.0
| title:创意或商业创新、^2.0)~0.01 ()/str
 lst name=explain/
 str name=QParserDismaxQParser/str

 null name=altquerystring/
 null name=boostfuncs/

== DEBUG OUTPUT END ==






 
 That is strange. Can you add a request parameter debugQuery=on and post
 the
 response? Also, whenever you change the field type (use a different
 tokenizer etc.), make sure you re-index the documents.
 
 -- 
 Regards,
 Shalin Shekhar Mangar.
 
 

-- 
View this message in context: 
http://www.nabble.com/CJKTokenizerFactory-seems-to-work-for-Korea-but-not-for-China-and-Japan-tp24279927p24292975.html
Sent from the Solr - User mailing list archive at Nabble.com.



CJKTokenizerFactory seems to work for Korea but not for China and Japan

2009-06-30 Thread ashokc

Hi

I have the following fieldType that processes korean/chinese/japanese text

fieldType name=cjk_text class=solr.TextField
  analyzer type=index
tokenizer class=solr.CJKTokenizerFactory/
  /analyzer
  analyzer type=query
tokenizer class=solr.CJKTokenizerFactory/
  /analyzer
/fieldType

When I supply korean words/phrases in the query, I do get several expected
Korean URLs as search results, and the my keywords are correctly highlighted
in the excerpt. But for chinese  japanese I almost always draw a blank -
i.e. no hits.

I ran sample chinese/japanese text through 'analysis'
(/search/admin/analysis.jsp) it does highlight the matches it found for the
query words I supplied. But when I actually search for it
(/search/admin/form.jsp) I get no hits.

For chinese text I have also tried

fieldType name=cn_text class=solr.TextField
  analyzer type=index
tokenizer class=solr.ChineseTokenizerFactory/
filter class=solr.ChineseFilterFactory/
  /analyzer
  analyzer type=query
tokenizer class=solr.ChineseTokenizerFactory/
filter class=solr.ChineseFilterFactory/
  /analyzer
 /fieldType

Same behavior.

I am using SOLR for several other languages like
russian/spanish/italian/french/german etc... (each with its own tokenizers 
stemmers too if available) and I do get results that correctly highlight the
words I am supplying in the query. While I can't judge the meaningful
quality of the results, I am satisfied that SOLR is returning documents that
contain the query string(s).

Not sure what the problem may be with chinese  japanese. I have updated my
SOLR distribution to the latest nightly solr-2009-06-29.zip just in case.
Has not helped of course. Thanks for your help. - ashok
-- 
View this message in context: 
http://www.nabble.com/CJKTokenizerFactory-seems-to-work-for-Korea-but-not-for-China-and-Japan-tp24279927p24279927.html
Sent from the Solr - User mailing list archive at Nabble.com.



copyfield and 'store' and highlighting

2009-06-10 Thread ashokc

Hi,
I copy 'field1' to 'field2' so that I can apply a different set of analyzers
 filters. Content wise, they are identical. 'field2' has to be stored
because it is used for high-lighting. Do I have to declare 'field1' also to
be stored? 'field1' is never returned in the response. Thanks. - ashok
-- 
View this message in context: 
http://www.nabble.com/copyfield-and-%27store%27-and-highlighting-tp23967232p23967232.html
Sent from the Solr - User mailing list archive at Nabble.com.



qf boost Versus field boost for Dismax queries

2009-06-09 Thread ashokc

When 'dismax' queries are use, where is the best place to apply boost
values/factors? While indexing by supplying the 'boost' attribute to the
field, or in solrconfig.xml by specifying the 'qf' parameter with the same
boosts? What are the advantages/disadvantages to each? What happens if both
boosts are present? Do they get multiplied?

Thanks

- ashok
-- 
View this message in context: 
http://www.nabble.com/qf-boost-Versus-field-boost-for-Dismax-queries-tp23952323p23952323.html
Sent from the Solr - User mailing list archive at Nabble.com.



How to disable posting updates from a remote server

2009-06-04 Thread ashokc

Hi,

I find that I am freely able to post to my production SOLR server, from any
other host that can run the post command. So somebody can wipe out the whole
index by posting a delete query. Is there a way SOLR can be configured so
that it will take updates ONLY from the server on which it is running?
Thanks - ashok
-- 
View this message in context: 
http://www.nabble.com/How-to-disable-posting-updates-from-a-remote-server-tp23876170p23876170.html
Sent from the Solr - User mailing list archive at Nabble.com.



Highlighting and Field options

2009-06-01 Thread ashokc

Hi,

The 'content' field that I am indexing is usually large (e.g. a pdf doc of a
few Mb in size). I need highlighting to be on. This 'seems' to require that
I have to set the 'content' field to be STORED. This returns the whole
content field in the search result XML. for each matching document. The
highlighted text also is returned in a separate block. But I do NOT need the
entire content field to display the search results. I only use the
highlighted segments to display a brief description of each hit. The fact
that SOLR returns entire content field, makes the returned XML unnecessarily
huge, and makes for larger response times. How can I have SOLR return ONLY
the highlighted text for each hit and NOT the entire 'content' filed? Thanks
- ashok
-- 
View this message in context: 
http://www.nabble.com/Highlighting-and-Field-options-tp23818019p23818019.html
Sent from the Solr - User mailing list archive at Nabble.com.



Re: Boosting by facets with standard query

2009-04-19 Thread ashokc

Thanks for the tip. Looks like a neat idea. I have never used the sort
feature, so I have to create a new numeric key with values 1 or 2 - value 1
for white_papers/pdfs  2 for others?

The problem also is that the facets I need to boost can vary by query. That
is, if the query term 'a', boost the facets 'facet1  facet2'. If the query
term is 'b', then boost the facets 'facet4  facet5'. Perhaps I can identify
the most freqently used boost order, and create as many fields as there are
orders. That would be the way right?
- ashok



Shalin Shekhar Mangar wrote:
 
 On Fri, Apr 17, 2009 at 11:32 AM, ashokc ash...@qualcomm.com wrote:
 

 What we need is for the white_papers  pdfs to be boosted, but if and
 only
 if such doucments are valid results to the search term in question. How
 would I write my above 'q' to accomplish that?

 
 Thanks for explaining in detail.
 
 Basically, all you want to do is sort the results in the following order:
 1. White papers
 2. PDFs
 3. Others
 
 or maybe #1 and #2 are equivalent and can be intermingled.
 
 Easiest way to do this is to index a new field whose values (when sorted)
 give you the desired order. Then you can simply sort on that field and
 score.
 
 -- 
 Regards,
 Shalin Shekhar Mangar.
 
 

-- 
View this message in context: 
http://www.nabble.com/Boosting-by-facets-with-standard-query-tp23084860p23123288.html
Sent from the Solr - User mailing list archive at Nabble.com.



Re: Boosting by facets with standard query

2009-04-17 Thread ashokc

What you indicated here is for a different purpose, is it not? I already do
something similar with my 'q'. For example a sample query logged in
'catalina.out' looks like

webapp=/search path=/select
params={rows=15start=0q=(+(content:umts)+OR+(title:umts)^2+OR+(urltext:umts)^2)}

when the search term is umts. I am looking for this term umts in the
fields  - (a) content, (b) title (boosted by a factor of 2) and (c) urltext
(boosted by a factor of 2). So the presense of the term umts in title or
url is weighed more than its presense in the regular content. So far so
good.

Now, I have other fields as well, like document type, file type etc... that
serve as facets to telescope down. Among the above set of search results, I
want to boost a specific document type 'white_papers'  a specific file type
pdf. By boosting I mean that these white_paper  pdf documents should
float to the top of the heap in the search results, if such documents are at
all present in the search results.

So would I simply add the following to the above q?

q=(+(content:umts)+OR+(title:umts)^2+OR+(urltext:umts)^2)+AND+(doctype:white_papers)^2+AND+(filetype:pdf)^2

But wouldn't the above give 0 results if there are no white_papers  pdfs
(because of the AND)? If I use OR, then the meaning of the query is lost
altogether.

What we need is for the white_papers  pdfs to be boosted, but if and only
if such doucments are valid results to the search term in question. How
would I write my above 'q' to accomplish that?

Thanks

- ashok



Shalin Shekhar Mangar wrote:
 
 On Fri, Apr 17, 2009 at 1:03 AM, ashokc ash...@qualcomm.com wrote:
 

 I have a query that yields results binned in several facets. How can I
 boost
 the results that fall in certain facets over the rest of them that do not
 belong to those facets? I use the standard query format. Thank you
 
 
 I'm not sure what you mean by boosting by facet. Do you mean that you want
 to boost documents which match a term query?
 
 If yes, you can use your_field_name:value^2.0 in the q parameter.
 -- 
 Regards,
 Shalin Shekhar Mangar.
 
 

-- 
View this message in context: 
http://www.nabble.com/Boosting-by-facets-with-standard-query-tp23084860p23091586.html
Sent from the Solr - User mailing list archive at Nabble.com.



Boosting by facets with standard query

2009-04-16 Thread ashokc

I have a query that yields results binned in several facets. How can I boost
the results that fall in certain facets over the rest of them that do not
belong to those facets? I use the standard query format. Thank you
- ashok
-- 
View this message in context: 
http://www.nabble.com/Boosting-by-facets-with-standard-query-tp23084860p23084860.html
Sent from the Solr - User mailing list archive at Nabble.com.



DIH uniqueKey

2009-04-14 Thread ashokc

Hi,

I have separate JDBC datasources (DS1  DS2) that I want to index with DIH
in a single SOLR instance. The unique record for the two sources are
different. Do I have to synthesize a uniqueKey that spans both the
datasources? Something like this? That is, the uniqueKey values will be like
(+ indicating concatenation):

DS1 + primary key for DS1

DS2 + primary key for DS2

Thanks
- ashok
-- 
View this message in context: 
http://www.nabble.com/DIH---uniqueKey-tp23042732p23042732.html
Sent from the Solr - User mailing list archive at Nabble.com.



Re: More than one language in the same document

2009-04-07 Thread ashokc

What I am doing right now is to capture all the content under content_korea
for example, use 'copyField' to duplicate that content to content_english.
content_korea gets processed with CJK analyzers, and content_english
gets processed with usual detailed index/query analyzers, filters, synonyms.
Some results do come up, but I have not been able to verify that this
approach is yielding better results.

A related question. What does 'copyField' actually do? Does it 'append'
content from the source field to the 'target' field? Or does it
replace/overwrite it? Thank you.

- ashok



hossman wrote:
 
 
 : I have documents where text from two languages, e.g. (english  korean)
 or
 : (english  german) are mixed u p in a fairly intensive way. 20-30% of
 the
 
 if you search the list archives you'll find a lot of results for 
 languages ... it's not something i deal with much but i believe using 
 separate fields (or dynamic fields) for each language is considered the 
 best strategy.
 
 
 
 
 -Hoss
 
 
 

-- 
View this message in context: 
http://www.nabble.com/More-than-one-language-in-the-same-document-tp22726478p22939331.html
Sent from the Solr - User mailing list archive at Nabble.com.



Re: Oracle Clob column with DIH does not turn to String

2009-04-04 Thread ashokc

Yes, you are correct. But the documentation for DIH says the column names are
case insensitive. That should be fixed. Here is what it says:

=
A shorter data-config

In the above example, there are mappings of fields to Solr fields. It is
possible to totally avoid the field entries in entities if the names of the
fields are same (case does not matter) as those in Solr schema.



Noble Paul നോബിള്‍  नोब्ळ् wrote:
 
 it is very expensive to do a case insensitive lookup. It must first
 convert all the keys to lower case and try looking up there. because
 it may not be always in uppercase it can be in mixed case as well
 
 On Sat, Apr 4, 2009 at 12:58 AM, ashokc ash...@qualcomm.com wrote:

 Happy to report that it is working. Looks like we have to use UPPER CASE
 for
 all the column names. When I examined the map 'aRow', it had the column
 names in upper case, where as my config had lower case. No match was
 found
 so nothing happened. Changed my config and it works now. Thanks for your
 help. Perhaps this transformer can be modified to be case-insensitive for
 the column names. If you had written it perhaps it is a quick change for
 you?

 Noble Paul നോബിള്‍  नोब्ळ् wrote:

 I guess u can write a custom transformer which gets a String out of
 the oracle.sql.CLOB. I am just out of clue, why this may happen. I
 even wrote a testcase and it seems to work fine
 --Noble

 On Fri, Apr 3, 2009 at 10:23 PM, ashokc ash...@qualcomm.com wrote:

 I downloaded the nightly build yesterday (2nd April), modified the
 ClobTransformer.java file with some prints, compiled it all (ant dist).
 It
 produced a war file, apache-solr-1.4-dev.war. That is what I am
 using.
 My
 modification  compilation has not affected the results. I was getting
 the
 same behavior with the 'war' that download came with. Thanks Noble.

 Noble Paul നോബിള്‍  नोब्ळ् wrote:

 and which version of Solr are u using?

 On Fri, Apr 3, 2009 at 10:09 PM, ashokc ash...@qualcomm.com wrote:

 Sure:

 data-config Xml
 ===

 dataConfig
    dataSource driver=oracle.jdbc.driver.OracleDriver
 url=jdbc:oracle:thin:@x user=remedy password=y/
    document name=remedy
            entity name=log transformer=ClobTransformer
 query=SELECT
 mylog_ato, name_char, dsc FROM log_tbl
                field column=mylog_ato name=log_no /
                field column=name_char name=short_desc /
                field column=dsc clob=true name=description /
            /entity
    /document
 /dataConfig

 ===

 A search result on the field short_desc:
 --

 doc
 float name=score1.8670129/float
 str name=descriptionoracle.sql.c...@155e3ab/str
 int name=log_no4486/int
 str name=short_descDevelop Rating functionality for QIN/str
 date name=timestamp2009-04-03T11:47:32.635Z/date
 /doc




 Noble Paul നോബിള്‍  नोब्ळ् wrote:

 There is something else wrong with your setup.

 can you just paste the whole data-config.xml

 --Noble

 On Fri, Apr 3, 2009 at 5:39 PM, ashokc ash...@qualcomm.com wrote:

 Noble,
 I put in a few 'System.out.println' statements in the
 ClobTransformer.java
 file  remade the war. But I see none of these prints coming up in
 my
 'catalina.out' file. Is that the right file to be looking at?

 As an aside, is 'catalina.out' the ONLY log file for SOLR? I turned
 on
 the
 logging to 'FINE' for everything. Also, these settings seem to go
 away
 when
 Tomcat is restarted.
 - ashok

 Noble Paul നോബിള്‍  नोब्ळ् wrote:

 yeah, ant dist will give you the .war file you may need . just
 drop
 it
 in and you are set to go. or if you can hook up a debugger to a
 running Solr that is the easiest
 --Noble

 On Fri, Apr 3, 2009 at 9:35 AM, ashokc ash...@qualcomm.com
 wrote:

 That would require me to recompile (with ant/maven scripts?) the
 source
 and
 replace the jar for DIH, right? I can try - for the first time.
 - ashok

 Noble Paul നോബിള്‍  नोब्ळ् wrote:

 This looks strange. Apparently the Transformer did not get
 applied.
 Is
 it possible for you to debug ClobTransformer
 adding(System.out.println
 into ClobTransformer may help)

 On Fri, Apr 3, 2009 at 6:04 AM, ashokc ash...@qualcomm.com
 wrote:

 Correcting my earlier post. It lost some lines some how.

 Hi,

 I have set up to import some oracle clob columns with DIH. I am
 using
 the
 latest nightly release. My config says,


 entity name=log transformer=ClobTransformer
 ...

    field column=description clob=true name=description
 /
    

 /entity

 But it does not seem to turn this clob into a String. The
 search
 results
 show:

 doc
   float name=score1.8670129/float
    str name=descriptionoracle.sql.c...@aed3a5/str
   int name=log_no4486/int
 /doc

 Any pointers on why I do not get the 'string' out of the clob
 for
 indexing?
 Is the nightly war NOT the right one to use?

 Thanks for your help.

 - ashok



 ashokc wrote:

 Hi,

 I have set up to import some oracle clob columns with DIH. I
 am
 using
 the
 latest nightly

Re: Multi-valued fields with DIH

2009-04-04 Thread ashokc

That worked. Thanks again.

Noble Paul നോബിള്‍  नोब्ळ् wrote:
 
 the column names are case sensitive try this
 
 field column=PROJECT_AREA name=projects /
field column=PROJECT_VERSION name=projects /
 On Sat, Apr 4, 2009 at 3:58 AM, ashokc ash...@qualcomm.com wrote:

 Hi,
 I need to assign multiple values to a field, with each value coming from
 a
 different column of the sql query.

 My data config snippet has lines like

field column=project_area name=projects /
field column=project_version name=projects /

 where 'project_area'  'project_version' are output by the sql query to
 the
 datasource. The 'verbose-output' from dataimport.jsp does show that these
 columns have values returned by the query

 ===

 lst name=verbose-output
 -
 lst name=entity:log
 -
 lst name=document#1
 +
 str name=query
 x
 /str
 str name=time-taken0:0:0.142/str
 str--- row #1-/str
 str name=PROJECT_AREAMySource/Area/Admin/str
 str name=PROJECT_VERSIONMySource/Version/06.02/str
 date name=LAST_MODIFIED_DATE2008-10-21T07:00:00Z/date
 .

 ==

 But the resulting index has no data in the field 'projects'. Is it NOT
 possible to create multi-valued fields with DIH?

 Thanks
 --
 View this message in context:
 http://www.nabble.com/Multi-valued-fields-with-DIH-tp22877509p22877509.html
 Sent from the Solr - User mailing list archive at Nabble.com.


 
 
 
 -- 
 --Noble Paul
 
 

-- 
View this message in context: 
http://www.nabble.com/Multi-valued-fields-with-DIH-tp22877509p22886586.html
Sent from the Solr - User mailing list archive at Nabble.com.



Re: Oracle Clob column with DIH does not turn to String

2009-04-03 Thread ashokc

Noble,
I put in a few 'System.out.println' statements in the ClobTransformer.java
file  remade the war. But I see none of these prints coming up in my
'catalina.out' file. Is that the right file to be looking at?

As an aside, is 'catalina.out' the ONLY log file for SOLR? I turned on the
logging to 'FINE' for everything. Also, these settings seem to go away when
Tomcat is restarted.
- ashok

Noble Paul നോബിള്‍  नोब्ळ् wrote:
 
 yeah, ant dist will give you the .war file you may need . just drop it
 in and you are set to go. or if you can hook up a debugger to a
 running Solr that is the easiest
 --Noble
 
 On Fri, Apr 3, 2009 at 9:35 AM, ashokc ash...@qualcomm.com wrote:

 That would require me to recompile (with ant/maven scripts?) the source
 and
 replace the jar for DIH, right? I can try - for the first time.
 - ashok

 Noble Paul നോബിള്‍  नोब्ळ् wrote:

 This looks strange. Apparently the Transformer did not get applied. Is
 it possible for you to debug ClobTransformer adding(System.out.println
 into ClobTransformer may help)

 On Fri, Apr 3, 2009 at 6:04 AM, ashokc ash...@qualcomm.com wrote:

 Correcting my earlier post. It lost some lines some how.

 Hi,

 I have set up to import some oracle clob columns with DIH. I am using
 the
 latest nightly release. My config says,


 entity name=log transformer=ClobTransformer
 ...

    field column=description clob=true name=description /
    

 /entity

 But it does not seem to turn this clob into a String. The search
 results
 show:

 doc
   float name=score1.8670129/float
    str name=descriptionoracle.sql.c...@aed3a5/str
   int name=log_no4486/int
 /doc

 Any pointers on why I do not get the 'string' out of the clob for
 indexing?
 Is the nightly war NOT the right one to use?

 Thanks for your help.

 - ashok



 ashokc wrote:

 Hi,

 I have set up to import some oracle clob columns with DIH. I am using
 the
 latest nightly release. My config says,

 entity name=description transformer=ClobTransformer ... field
 column=description clob=true /
     

 /entity

 But it does not seem to turn this clob into a String. The search
 results
 show:

 doc
    float name=score1.8670129/float
     str name=descriptionoracle.sql.c...@aed3a5/str
    int name=log_no4486/int
 /doc

 Any pointers on why I do not get the 'string' out of the clob for
 indexing? Is the nightly war NOT the right one to use?

 Thanks for your help.

 - ashok




 --
 View this message in context:
 http://www.nabble.com/Oracle-Clob-column-with-DIH-does-not-turn-to-String-tp22859837p22859865.html
 Sent from the Solr - User mailing list archive at Nabble.com.





 --
 --Noble Paul



 --
 View this message in context:
 http://www.nabble.com/Oracle-Clob-column-with-DIH-does-not-turn-to-String-tp22859837p22861630.html
 Sent from the Solr - User mailing list archive at Nabble.com.


 
 
 
 -- 
 --Noble Paul
 
 

-- 
View this message in context: 
http://www.nabble.com/Oracle-Clob-column-with-DIH-does-not-turn-to-String-tp22859837p22867161.html
Sent from the Solr - User mailing list archive at Nabble.com.



Re: Oracle Clob column with DIH does not turn to String

2009-04-03 Thread ashokc

Sure:

data-config Xml
===

dataConfig
dataSource driver=oracle.jdbc.driver.OracleDriver
url=jdbc:oracle:thin:@x user=remedy password=y/
document name=remedy
entity name=log transformer=ClobTransformer query=SELECT
mylog_ato, name_char, dsc FROM log_tbl
field column=mylog_ato name=log_no /
field column=name_char name=short_desc /
field column=dsc clob=true name=description /
/entity
/document
/dataConfig

===

A search result on the field short_desc:
--

doc
float name=score1.8670129/float
str name=descriptionoracle.sql.c...@155e3ab/str
int name=log_no4486/int
str name=short_descDevelop Rating functionality for QIN/str
date name=timestamp2009-04-03T11:47:32.635Z/date
/doc




Noble Paul നോബിള്‍  नोब्ळ् wrote:
 
 There is something else wrong with your setup.
 
 can you just paste the whole data-config.xml
 
 --Noble
 
 On Fri, Apr 3, 2009 at 5:39 PM, ashokc ash...@qualcomm.com wrote:

 Noble,
 I put in a few 'System.out.println' statements in the
 ClobTransformer.java
 file  remade the war. But I see none of these prints coming up in my
 'catalina.out' file. Is that the right file to be looking at?

 As an aside, is 'catalina.out' the ONLY log file for SOLR? I turned on
 the
 logging to 'FINE' for everything. Also, these settings seem to go away
 when
 Tomcat is restarted.
 - ashok

 Noble Paul നോബിള്‍  नोब्ळ् wrote:

 yeah, ant dist will give you the .war file you may need . just drop it
 in and you are set to go. or if you can hook up a debugger to a
 running Solr that is the easiest
 --Noble

 On Fri, Apr 3, 2009 at 9:35 AM, ashokc ash...@qualcomm.com wrote:

 That would require me to recompile (with ant/maven scripts?) the source
 and
 replace the jar for DIH, right? I can try - for the first time.
 - ashok

 Noble Paul നോബിള്‍  नोब्ळ् wrote:

 This looks strange. Apparently the Transformer did not get applied. Is
 it possible for you to debug ClobTransformer adding(System.out.println
 into ClobTransformer may help)

 On Fri, Apr 3, 2009 at 6:04 AM, ashokc ash...@qualcomm.com wrote:

 Correcting my earlier post. It lost some lines some how.

 Hi,

 I have set up to import some oracle clob columns with DIH. I am using
 the
 latest nightly release. My config says,


 entity name=log transformer=ClobTransformer
 ...

    field column=description clob=true name=description /
    

 /entity

 But it does not seem to turn this clob into a String. The search
 results
 show:

 doc
   float name=score1.8670129/float
    str name=descriptionoracle.sql.c...@aed3a5/str
   int name=log_no4486/int
 /doc

 Any pointers on why I do not get the 'string' out of the clob for
 indexing?
 Is the nightly war NOT the right one to use?

 Thanks for your help.

 - ashok



 ashokc wrote:

 Hi,

 I have set up to import some oracle clob columns with DIH. I am
 using
 the
 latest nightly release. My config says,

 entity name=description transformer=ClobTransformer ... field
 column=description clob=true /
     

 /entity

 But it does not seem to turn this clob into a String. The search
 results
 show:

 doc
    float name=score1.8670129/float
     str name=descriptionoracle.sql.c...@aed3a5/str
    int name=log_no4486/int
 /doc

 Any pointers on why I do not get the 'string' out of the clob for
 indexing? Is the nightly war NOT the right one to use?

 Thanks for your help.

 - ashok




 --
 View this message in context:
 http://www.nabble.com/Oracle-Clob-column-with-DIH-does-not-turn-to-String-tp22859837p22859865.html
 Sent from the Solr - User mailing list archive at Nabble.com.





 --
 --Noble Paul



 --
 View this message in context:
 http://www.nabble.com/Oracle-Clob-column-with-DIH-does-not-turn-to-String-tp22859837p22861630.html
 Sent from the Solr - User mailing list archive at Nabble.com.





 --
 --Noble Paul



 --
 View this message in context:
 http://www.nabble.com/Oracle-Clob-column-with-DIH-does-not-turn-to-String-tp22859837p22867161.html
 Sent from the Solr - User mailing list archive at Nabble.com.


 
 
 
 -- 
 --Noble Paul
 
 

-- 
View this message in context: 
http://www.nabble.com/Oracle-Clob-column-with-DIH-does-not-turn-to-String-tp22859837p22872184.html
Sent from the Solr - User mailing list archive at Nabble.com.



Re: Oracle Clob column with DIH does not turn to String

2009-04-03 Thread ashokc

I downloaded the nightly build yesterday (2nd April), modified the
ClobTransformer.java file with some prints, compiled it all (ant dist). It
produced a war file, apache-solr-1.4-dev.war. That is what I am using. My
modification  compilation has not affected the results. I was getting the
same behavior with the 'war' that download came with. Thanks Noble.

Noble Paul നോബിള്‍  नोब्ळ् wrote:
 
 and which version of Solr are u using?
 
 On Fri, Apr 3, 2009 at 10:09 PM, ashokc ash...@qualcomm.com wrote:

 Sure:

 data-config Xml
 ===

 dataConfig
    dataSource driver=oracle.jdbc.driver.OracleDriver
 url=jdbc:oracle:thin:@x user=remedy password=y/
    document name=remedy
            entity name=log transformer=ClobTransformer query=SELECT
 mylog_ato, name_char, dsc FROM log_tbl
                field column=mylog_ato name=log_no /
                field column=name_char name=short_desc /
                field column=dsc clob=true name=description /
            /entity
    /document
 /dataConfig

 ===

 A search result on the field short_desc:
 --

 doc
 float name=score1.8670129/float
 str name=descriptionoracle.sql.c...@155e3ab/str
 int name=log_no4486/int
 str name=short_descDevelop Rating functionality for QIN/str
 date name=timestamp2009-04-03T11:47:32.635Z/date
 /doc




 Noble Paul നോബിള്‍  नोब्ळ् wrote:

 There is something else wrong with your setup.

 can you just paste the whole data-config.xml

 --Noble

 On Fri, Apr 3, 2009 at 5:39 PM, ashokc ash...@qualcomm.com wrote:

 Noble,
 I put in a few 'System.out.println' statements in the
 ClobTransformer.java
 file  remade the war. But I see none of these prints coming up in my
 'catalina.out' file. Is that the right file to be looking at?

 As an aside, is 'catalina.out' the ONLY log file for SOLR? I turned on
 the
 logging to 'FINE' for everything. Also, these settings seem to go away
 when
 Tomcat is restarted.
 - ashok

 Noble Paul നോബിള്‍  नोब्ळ् wrote:

 yeah, ant dist will give you the .war file you may need . just drop it
 in and you are set to go. or if you can hook up a debugger to a
 running Solr that is the easiest
 --Noble

 On Fri, Apr 3, 2009 at 9:35 AM, ashokc ash...@qualcomm.com wrote:

 That would require me to recompile (with ant/maven scripts?) the
 source
 and
 replace the jar for DIH, right? I can try - for the first time.
 - ashok

 Noble Paul നോബിള്‍  नोब्ळ् wrote:

 This looks strange. Apparently the Transformer did not get applied.
 Is
 it possible for you to debug ClobTransformer
 adding(System.out.println
 into ClobTransformer may help)

 On Fri, Apr 3, 2009 at 6:04 AM, ashokc ash...@qualcomm.com wrote:

 Correcting my earlier post. It lost some lines some how.

 Hi,

 I have set up to import some oracle clob columns with DIH. I am
 using
 the
 latest nightly release. My config says,


 entity name=log transformer=ClobTransformer
 ...

    field column=description clob=true name=description /
    

 /entity

 But it does not seem to turn this clob into a String. The search
 results
 show:

 doc
   float name=score1.8670129/float
    str name=descriptionoracle.sql.c...@aed3a5/str
   int name=log_no4486/int
 /doc

 Any pointers on why I do not get the 'string' out of the clob for
 indexing?
 Is the nightly war NOT the right one to use?

 Thanks for your help.

 - ashok



 ashokc wrote:

 Hi,

 I have set up to import some oracle clob columns with DIH. I am
 using
 the
 latest nightly release. My config says,

 entity name=description transformer=ClobTransformer ...
 field
 column=description clob=true /
     

 /entity

 But it does not seem to turn this clob into a String. The search
 results
 show:

 doc
    float name=score1.8670129/float
     str name=descriptionoracle.sql.c...@aed3a5/str
    int name=log_no4486/int
 /doc

 Any pointers on why I do not get the 'string' out of the clob for
 indexing? Is the nightly war NOT the right one to use?

 Thanks for your help.

 - ashok




 --
 View this message in context:
 http://www.nabble.com/Oracle-Clob-column-with-DIH-does-not-turn-to-String-tp22859837p22859865.html
 Sent from the Solr - User mailing list archive at Nabble.com.





 --
 --Noble Paul



 --
 View this message in context:
 http://www.nabble.com/Oracle-Clob-column-with-DIH-does-not-turn-to-String-tp22859837p22861630.html
 Sent from the Solr - User mailing list archive at Nabble.com.





 --
 --Noble Paul



 --
 View this message in context:
 http://www.nabble.com/Oracle-Clob-column-with-DIH-does-not-turn-to-String-tp22859837p22867161.html
 Sent from the Solr - User mailing list archive at Nabble.com.





 --
 --Noble Paul



 --
 View this message in context:
 http://www.nabble.com/Oracle-Clob-column-with-DIH-does-not-turn-to-String-tp22859837p22872184.html
 Sent from the Solr - User mailing list archive at Nabble.com.


 
 
 
 -- 
 --Noble Paul
 
 

-- 
View this message in context: 
http://www.nabble.com/Oracle-Clob-column-with-DIH

Re: Oracle Clob column with DIH does not turn to String

2009-04-03 Thread ashokc

Happy to report that it is working. Looks like we have to use UPPER CASE for
all the column names. When I examined the map 'aRow', it had the column
names in upper case, where as my config had lower case. No match was found
so nothing happened. Changed my config and it works now. Thanks for your
help. Perhaps this transformer can be modified to be case-insensitive for
the column names. If you had written it perhaps it is a quick change for
you? 

Noble Paul നോബിള്‍  नोब्ळ् wrote:
 
 I guess u can write a custom transformer which gets a String out of
 the oracle.sql.CLOB. I am just out of clue, why this may happen. I
 even wrote a testcase and it seems to work fine
 --Noble
 
 On Fri, Apr 3, 2009 at 10:23 PM, ashokc ash...@qualcomm.com wrote:

 I downloaded the nightly build yesterday (2nd April), modified the
 ClobTransformer.java file with some prints, compiled it all (ant dist).
 It
 produced a war file, apache-solr-1.4-dev.war. That is what I am using.
 My
 modification  compilation has not affected the results. I was getting
 the
 same behavior with the 'war' that download came with. Thanks Noble.

 Noble Paul നോബിള്‍  नोब्ळ् wrote:

 and which version of Solr are u using?

 On Fri, Apr 3, 2009 at 10:09 PM, ashokc ash...@qualcomm.com wrote:

 Sure:

 data-config Xml
 ===

 dataConfig
    dataSource driver=oracle.jdbc.driver.OracleDriver
 url=jdbc:oracle:thin:@x user=remedy password=y/
    document name=remedy
            entity name=log transformer=ClobTransformer
 query=SELECT
 mylog_ato, name_char, dsc FROM log_tbl
                field column=mylog_ato name=log_no /
                field column=name_char name=short_desc /
                field column=dsc clob=true name=description /
            /entity
    /document
 /dataConfig

 ===

 A search result on the field short_desc:
 --

 doc
 float name=score1.8670129/float
 str name=descriptionoracle.sql.c...@155e3ab/str
 int name=log_no4486/int
 str name=short_descDevelop Rating functionality for QIN/str
 date name=timestamp2009-04-03T11:47:32.635Z/date
 /doc




 Noble Paul നോബിള്‍  नोब्ळ् wrote:

 There is something else wrong with your setup.

 can you just paste the whole data-config.xml

 --Noble

 On Fri, Apr 3, 2009 at 5:39 PM, ashokc ash...@qualcomm.com wrote:

 Noble,
 I put in a few 'System.out.println' statements in the
 ClobTransformer.java
 file  remade the war. But I see none of these prints coming up in my
 'catalina.out' file. Is that the right file to be looking at?

 As an aside, is 'catalina.out' the ONLY log file for SOLR? I turned
 on
 the
 logging to 'FINE' for everything. Also, these settings seem to go
 away
 when
 Tomcat is restarted.
 - ashok

 Noble Paul നോബിള്‍  नोब्ळ् wrote:

 yeah, ant dist will give you the .war file you may need . just drop
 it
 in and you are set to go. or if you can hook up a debugger to a
 running Solr that is the easiest
 --Noble

 On Fri, Apr 3, 2009 at 9:35 AM, ashokc ash...@qualcomm.com wrote:

 That would require me to recompile (with ant/maven scripts?) the
 source
 and
 replace the jar for DIH, right? I can try - for the first time.
 - ashok

 Noble Paul നോബിള്‍  नोब्ळ् wrote:

 This looks strange. Apparently the Transformer did not get
 applied.
 Is
 it possible for you to debug ClobTransformer
 adding(System.out.println
 into ClobTransformer may help)

 On Fri, Apr 3, 2009 at 6:04 AM, ashokc ash...@qualcomm.com
 wrote:

 Correcting my earlier post. It lost some lines some how.

 Hi,

 I have set up to import some oracle clob columns with DIH. I am
 using
 the
 latest nightly release. My config says,


 entity name=log transformer=ClobTransformer
 ...

    field column=description clob=true name=description /
    

 /entity

 But it does not seem to turn this clob into a String. The search
 results
 show:

 doc
   float name=score1.8670129/float
    str name=descriptionoracle.sql.c...@aed3a5/str
   int name=log_no4486/int
 /doc

 Any pointers on why I do not get the 'string' out of the clob for
 indexing?
 Is the nightly war NOT the right one to use?

 Thanks for your help.

 - ashok



 ashokc wrote:

 Hi,

 I have set up to import some oracle clob columns with DIH. I am
 using
 the
 latest nightly release. My config says,

 entity name=description transformer=ClobTransformer ...
 field
 column=description clob=true /
     

 /entity

 But it does not seem to turn this clob into a String. The search
 results
 show:

 doc
    float name=score1.8670129/float
     str name=descriptionoracle.sql.c...@aed3a5/str
    int name=log_no4486/int
 /doc

 Any pointers on why I do not get the 'string' out of the clob
 for
 indexing? Is the nightly war NOT the right one to use?

 Thanks for your help.

 - ashok




 --
 View this message in context:
 http://www.nabble.com/Oracle-Clob-column-with-DIH-does-not-turn-to-String-tp22859837p22859865.html
 Sent from the Solr - User mailing list archive at Nabble.com.





 --
 --Noble Paul

Multi-valued fields with DIH

2009-04-03 Thread ashokc

Hi,
I need to assign multiple values to a field, with each value coming from a
different column of the sql query.

My data config snippet has lines like

field column=project_area name=projects /
field column=project_version name=projects /

where 'project_area'  'project_version' are output by the sql query to the
datasource. The 'verbose-output' from dataimport.jsp does show that these
columns have values returned by the query

===

lst name=verbose-output
−
lst name=entity:log
−
lst name=document#1
+
str name=query
x
/str
str name=time-taken0:0:0.142/str
str--- row #1-/str
str name=PROJECT_AREAMySource/Area/Admin/str
str name=PROJECT_VERSIONMySource/Version/06.02/str
date name=LAST_MODIFIED_DATE2008-10-21T07:00:00Z/date
.

==

But the resulting index has no data in the field 'projects'. Is it NOT
possible to create multi-valued fields with DIH?

Thanks
-- 
View this message in context: 
http://www.nabble.com/Multi-valued-fields-with-DIH-tp22877509p22877509.html
Sent from the Solr - User mailing list archive at Nabble.com.



Oracle Clob column with DIH does not turn to String

2009-04-02 Thread ashokc

Hi,

I have set up to import some oracle clob columns with DIH. I am using the
latest nightly release. My config says,






But it does not seem to turn this clob into a String. The search results
show:


   1.8670129
oracle.sql.c...@aed3a5
   4486


Any pointers on why I do not get the 'string' out of the clob for indexing?
Is the nightly war NOT the right one to use?

Thanks for your help.

- ashok


-- 
View this message in context: 
http://www.nabble.com/Oracle-Clob-column-with-DIH-does-not-turn-to-String-tp22859837p22859837.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Oracle Clob column with DIH does not turn to String

2009-04-02 Thread ashokc

Correcting my earlier post. It lost some lines some how.

Hi,

I have set up to import some oracle clob columns with DIH. I am using the
latest nightly release. My config says,


entity name=log transformer=ClobTransformer
...

field column=description clob=true name=description /


/entity

But it does not seem to turn this clob into a String. The search results
show:

doc
   float name=score1.8670129/float
str name=descriptionoracle.sql.c...@aed3a5/str
   int name=log_no4486/int
/doc

Any pointers on why I do not get the 'string' out of the clob for indexing?
Is the nightly war NOT the right one to use?

Thanks for your help.

- ashok



ashokc wrote:
 
 Hi,
 
 I have set up to import some oracle clob columns with DIH. I am using the
 latest nightly release. My config says,
 
 entity name=description transformer=ClobTransformer ... field
 column=description clob=true /
 
 
 /entity
 
 But it does not seem to turn this clob into a String. The search results
 show:
 
 doc
float name=score1.8670129/float
 str name=descriptionoracle.sql.c...@aed3a5/str
int name=log_no4486/int
 /doc
 
 Any pointers on why I do not get the 'string' out of the clob for
 indexing? Is the nightly war NOT the right one to use?
 
 Thanks for your help.
 
 - ashok
 
 
 

-- 
View this message in context: 
http://www.nabble.com/Oracle-Clob-column-with-DIH-does-not-turn-to-String-tp22859837p22859865.html
Sent from the Solr - User mailing list archive at Nabble.com.



Re: Oracle Clob column with DIH does not turn to String

2009-04-02 Thread ashokc

That would require me to recompile (with ant/maven scripts?) the source and
replace the jar for DIH, right? I can try - for the first time.
- ashok

Noble Paul നോബിള്‍  नोब्ळ् wrote:
 
 This looks strange. Apparently the Transformer did not get applied. Is
 it possible for you to debug ClobTransformer adding(System.out.println
 into ClobTransformer may help)
 
 On Fri, Apr 3, 2009 at 6:04 AM, ashokc ash...@qualcomm.com wrote:

 Correcting my earlier post. It lost some lines some how.

 Hi,

 I have set up to import some oracle clob columns with DIH. I am using the
 latest nightly release. My config says,


 entity name=log transformer=ClobTransformer
 ...

    field column=description clob=true name=description /
    

 /entity

 But it does not seem to turn this clob into a String. The search results
 show:

 doc
   float name=score1.8670129/float
    str name=descriptionoracle.sql.c...@aed3a5/str
   int name=log_no4486/int
 /doc

 Any pointers on why I do not get the 'string' out of the clob for
 indexing?
 Is the nightly war NOT the right one to use?

 Thanks for your help.

 - ashok



 ashokc wrote:

 Hi,

 I have set up to import some oracle clob columns with DIH. I am using
 the
 latest nightly release. My config says,

 entity name=description transformer=ClobTransformer ... field
 column=description clob=true /
     

 /entity

 But it does not seem to turn this clob into a String. The search results
 show:

 doc
    float name=score1.8670129/float
     str name=descriptionoracle.sql.c...@aed3a5/str
    int name=log_no4486/int
 /doc

 Any pointers on why I do not get the 'string' out of the clob for
 indexing? Is the nightly war NOT the right one to use?

 Thanks for your help.

 - ashok




 --
 View this message in context:
 http://www.nabble.com/Oracle-Clob-column-with-DIH-does-not-turn-to-String-tp22859837p22859865.html
 Sent from the Solr - User mailing list archive at Nabble.com.


 
 
 
 -- 
 --Noble Paul
 
 

-- 
View this message in context: 
http://www.nabble.com/Oracle-Clob-column-with-DIH-does-not-turn-to-String-tp22859837p22861630.html
Sent from the Solr - User mailing list archive at Nabble.com.



More than one language in the same document

2009-03-26 Thread ashokc

Hi,

I have documents where text from two languages, e.g. (english  korean) or
(english  german) are mixed u p in a fairly intensive way. 20-30% of the
text is in English and the rest in the other. Can somebody indicate how I
should set up the 'analyzers' and 'fields' in schema.xml? Should I have 2
fields with the same content, and 'analyze' them as English  non-english to
build the index? Will the analyzer for non-english corrupt the index while
processing the english text? And should my query look at both the fields to
fetch the results? Has somebody looked at this already? Thanks for your
help.

- ashok  
-- 
View this message in context: 
http://www.nabble.com/More-than-one-language-in-the-same-document-tp22726478p22726478.html
Sent from the Solr - User mailing list archive at Nabble.com.



Re: Highlighting Oddities

2009-02-04 Thread ashokc

I have seen some of these oddities that Chris is referring to. In my case,
terms that are NOT in the query get highlighted. For example searching for
'Intel' highlights 'Microsot Corp' as well. I do not have them as synonyms
either. Do these filter factories add some extra intelligence to the index
in that if you search for 'Samsung' even 'LG' is considered a highlightable
term?

I believe this was not the case when I was working with an earlier
development version (from Nov or early Dec). Right now I am using
solr-2008-12-29.war.

- ashok



ryguasu wrote:
 
 I'm testing out the default (gap) fragmenter with some simple,
 single-word queries on a patched 1.3.0 release populated with some
 real-world data. (I think the primary quirk in my setup is that I'm
 using ShingleFilterFactory to put word bigrams (aka shingles) into my
 index. I was worried that this might mess up highlighting, but
 highlighting is *mostly* working.) There are some oddities here, and
 I'm wondering if people have any suggestions for debugging my setup
 and/or trying to make a good, reproducible test case.
 
 1. The main weird thing is that, the vast majority of the time, the
 highlighted term is the last term in the fragment. For example, if I
 search for cat, then almost all my fragments look like this:
 
 fragment 1: to the *cat*
 fragment 2: with the *cat*
 fragment 3: it's what the *cat*
 fragment 4: Once upon a time the *cat*
 
 (My actual fragments are longer. The key to note is that all of these
 examples end in cat.)
 
 Sometimes cat will appear at somewhere other than the last position,
 but this is rare. My expectation, in contrast, is that cat would
 tend to be more or less evenly distributed throughout fragment
 positions.
 
 Note: I tried to reproduce this on 1.3.0 with my patches applied but
 using the example dataset/schema from the Solr source tree rather than
 my own dataset/schema. With the example dataset this didn't seem to be
 an issue.
 
 I've experienced three other highlighting issues, which may or may not
 be related:
 
 2. Sometimes, if a term appears multiple times in a fragment, not just
 the term but all the words in between the two appearances will get
 highlighted too. For example, I searched for fear, and got this as
 one of the snippets:
 
 SETTLEMENT AGREEMENT This Agreement (the Agreement) is entered
 into this 18th day of August, 2008, by
 and between Cape emFear Bank Corporation, a North Carolina
 corporation (the Company), and Cape Fear/em
 
 In contrast, I would have expected
 
 SETTLEMENT AGREEMENT This Agreement (the Agreement) is entered
 into this 18th day of August, 2008, by
 and between Cape emFear/em Bank Corporation, a North Carolina
 corporation (the Company), and Cape emFear/em
 
 3. My install seems to have a curiously liberal interpretation of
 hl.fragsize. Now if I put hl.fragsize=0, then things are as expected,
 i.e. it highlights the whole field. And it also seems more or less
 true (as it should) that as I increase hl.fragsize, the fragments get
 longer. However, I was surprised to see that when I put hl.fragsize=1
 or hl.fragsize=5, I can get fragments as long as this one:
 
 addition, we believe the wireless feature for our controller will
 facilitate exceptional customer services and
 response time. About GpsLatitude GpsLatitude, a Montreal-based
 company, is a provider of security
 solutions and tracking for mobile assets. It is also a developer
 of advanced  Videlocalisation , a cost-effective,
 integrated mobile digital emvideo/em
 
 That seems shockingly long for something of size five.
 
 4. Very rarely I'll get a fragment that doesn't actually contain any
 of the search terms. For example, maybe I'll search for cat, and
 I'll get back three ounces of milk as a snippet. I need to explore
 this more, though the last time this happened when I opened the
 document and found that when I located three ounces of milk in the
 document text, the word cat did appear nearby; so maybe the document
 did contain three ounces of milk for the cat.
 
 Obviously I'm not describing my setup in much detail. Let me know what
 you think would be helpful to know more about.
 
 Thanks,
 Chris
 
 

-- 
View this message in context: 
http://www.nabble.com/Highlighting-Oddities-tp20351015p21841992.html
Sent from the Solr - User mailing list archive at Nabble.com.



Single index - multiple SOLR instances

2009-01-12 Thread ashokc

Hello,

Is it possible to have the index created by a single SOLR instance, but have
several SOLR instances field the search queries. Or do I HAVE to replicate
the index for each SOLR instance that I want to answer queries? I need to
set up a fail-over instance. Thanks

- ashok
-- 
View this message in context: 
http://www.nabble.com/Single-index---multiple-SOLR-instances-tp21422543p21422543.html
Sent from the Solr - User mailing list archive at Nabble.com.



Re: Single index - multiple SOLR instances

2009-01-12 Thread ashokc

Thanks, Otis. That is great, as I plan to place the index on NAS and make it
writable to a single solr instance (write load is not heavy) and readable by
many solr instances to handle fail-over and also share the query load (query
load can be high)

- ashok

Otis Gospodnetic wrote:
 
 Ashok,
 
 You can put your index on any kind of shared storage - SAN, NAS, NFS (this
 one is not recommended).  That will let you point all your Solr instances
 to a single copy of your index.  Of course, you will want to test
 performance to ensure the network is not slowing things down too much, if
 there is network in the picture.
 
 
 Otis
 --
 Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
 
 
 
 - Original Message 
 From: ashokc ash...@qualcomm.com
 To: solr-user@lucene.apache.org
 Sent: Monday, January 12, 2009 3:05:40 PM
 Subject: Single index - multiple SOLR instances
 
 
 Hello,
 
 Is it possible to have the index created by a single SOLR instance, but
 have
 several SOLR instances field the search queries. Or do I HAVE to
 replicate
 the index for each SOLR instance that I want to answer queries? I need to
 set up a fail-over instance. Thanks
 
 - ashok
 -- 
 View this message in context: 
 http://www.nabble.com/Single-index---multiple-SOLR-instances-tp21422543p21422543.html
 Sent from the Solr - User mailing list archive at Nabble.com.
 
 
 

-- 
View this message in context: 
http://www.nabble.com/Single-index---multiple-SOLR-instances-tp21422543p21423138.html
Sent from the Solr - User mailing list archive at Nabble.com.



Re: Boost a query by field at query time - Standard Request Handler

2008-12-09 Thread ashokc

Thanks for the reply. I figured there is no simple solution here. I am
parsing the query in my code separating out negations, assertions and such
and building the final SOLR query to issue. I simply ue the boost as given
by the user. If none given, I use a default boost for title  url matches.

- ashok



hossman wrote:
 
 
 : Query (can be quite complex, as it gets built from an advanced search
 form):
 : term1^2.0 OR term2 OR term3 term4
   ...
 : Any matches in the title or url fields should be weighed more. I can
 specify
 
 if i'm understanding you correctly: the client app can provide any 
 arbitrary lucene syntax query, and you want to (server side) specify 
 additional information about how specific fields (if specified) should be 
 boosted ... is that correct?
 
 there is no way to do this with standard request handler out of the box 
 ... but you could subclass the LuceneQParser, and use your own QueryParser 
 subclass that knows about your field boosts and adds them -- don't forget 
 you'll have to decide what to do when the client specifies a boost on a 
 field you've got a configured boost for (add them? ... multiply them? ...)
 
 
 
 
 -Hoss
 
 
 

-- 
View this message in context: 
http://www.nabble.com/Boost-a-query--by-field-at-query-time---Standard-Request-Handler-tp20842675p20920307.html
Sent from the Solr - User mailing list archive at Nabble.com.



Re: Merging Indices

2008-12-05 Thread ashokc

Thanks for the help Yonik  Shalin.It really makes it easy for me if I do not
have to stop/start the SOLR app during the merge operations.

The reason I have to do this many times a day, is that I am implementing a
simple-minded entity-extraction procedure for the content I am indexing. I
have a user defined taxonomy into which the current documents, and any new
documents should be classified under. The taxonomy defines the nested facet
fields for SOLR. When a new document is posted, the user expects to have it
available in the right facet right away. My classification procedure is as
follows when a new document is added.

1. Create a new temporary index with that document (no taxonomy fields at
this time)
2. Search this index with each of the taxonomy terms (synonyms are employed
as well through synonyms.txt) and find out which of these categories is a
hit for this document.
3. Add a new field ... line into the document for each category that is a
match for this document.
4. Repost this updated document.

Now I have a new index that facets this document, the same the the big index
does.

5. I merge these two indices now so that the new document also part of the
big index.

6. Delete the temporary index

The reason for a new temporary index is that, the step 2 is A LOT quicker
with a single (or a handful) document. If I simply posted this new doc, into
the big index, and then tried to classify it, this search will take a while.
I have over 200 nested taxonomy fields to search over.

Are there better approaches?

Thanks

- ashok



Yonik Seeley wrote:
 
 On Thu, Dec 4, 2008 at 6:39 PM, ashokc [EMAIL PROTECTED] wrote:

 The SOLR wiki says

3. Make sure both indexes you want to merge are closed.

 What exactly does 'closed' mean?
 
 If you do a commit, and then prevent updates, the index should be
 closed (no open IndexWriter).
 
 1. Do I need to stop SOLR search on both indexes before running the merge
 command? So a brief downtime is required?
 Or do I simply prevent any 'updates/deletes' to these indices during the
 merge time so they can still serve up results (read only?) while I am
 creating a new merged index?
 
 Preventing updates/deletes should be sufficient.
 
 2. Before the new index replaces the old index, do I need to stop SOLR
 for
 that instance? Or can I simply move the old index out and place the new
 index in the same place, without having to stop SOLR
 
 Yes, simply moving the index should work if you are careful to avoid
 any updates since the last commit.
 
 3. If SOLR has to be stopped during the merge operation, can we work with
 a
 redundant/failover instance and stagger the merge so the search service
 will
 not go down? Any guidelines here are welcome.

 Thanks

 - ashok
 --
 View this message in context:
 http://www.nabble.com/Merging-Indices-tp20845009p20845009.html
 Sent from the Solr - User mailing list archive at Nabble.com.


 
 

-- 
View this message in context: 
http://www.nabble.com/Merging-Indices-tp20845009p20859513.html
Sent from the Solr - User mailing list archive at Nabble.com.



Boost a query by field at query time - Standard Request Handler

2008-12-04 Thread ashokc

Here is the problem I am trying to solve. I have to use the Standard Request
Handler.

Query (can be quite complex, as it gets built from an advanced search form):
term1^2.0 OR term2 OR term3 term4

I have 3 fields - content (the default search field), title and url.

Any matches in the title or url fields should be weighed more. I can specify
index time boosting for these two fields, but I would rather not, as it is a
heavy handed solution. I need to make it user configurable for advanced
search.

What should my query to SOLR be? Something like this?

content:term1^2.0 OR content:term2 OR content:term3 term4 OR
title:term1^2.0 OR title:term2 OR title:term3 term4 OR url:term1^2.0 OR
url:term2 OR url:term3 term4

Looks like it can get pretty long and error prone. With the 'dismax' handler
I can simply specify

qf=content title^2 url^2

no matter how complex the 'q' parameter is.

Is there a similar easier way I can do query time boosting with Standard
Request Handler, that I am missing?

Thanks for your help

- ashok

-- 
View this message in context: 
http://www.nabble.com/Boost-a-query--by-field-at-query-time---Standard-Request-Handler-tp20842675p20842675.html
Sent from the Solr - User mailing list archive at Nabble.com.



Merging Indices

2008-12-04 Thread ashokc

The SOLR wiki says

3. Make sure both indexes you want to merge are closed.

What exactly does 'closed' mean?

1. Do I need to stop SOLR search on both indexes before running the merge
command? So a brief downtime is required?
Or do I simply prevent any 'updates/deletes' to these indices during the
merge time so they can still serve up results (read only?) while I am
creating a new merged index?

2. Before the new index replaces the old index, do I need to stop SOLR for
that instance? Or can I simply move the old index out and place the new
index in the same place, without having to stop SOLR

3. If SOLR has to be stopped during the merge operation, can we work with a
redundant/failover instance and stagger the merge so the search service will
not go down? Any guidelines here are welcome.

Thanks

- ashok
-- 
View this message in context: 
http://www.nabble.com/Merging-Indices-tp20845009p20845009.html
Sent from the Solr - User mailing list archive at Nabble.com.



solrQueryParser does not take effect - nightly build

2008-11-20 Thread ashokc

Hi,

I have set

solrQueryParser defaultOperator=AND/

but it is not taking effect. It continues to take it as OR. I am working
with the latest nightly build 11/20/2008

For a querry like

term1 term2

Debug shows

str name=parsedquerycontent:term1 content:term2/str

Bug?

Thanks

- ashok


-- 
View this message in context: 
http://www.nabble.com/solrQueryParser-does-not-take-effect---nightly-build-tp20609974p20609974.html
Sent from the Solr - User mailing list archive at Nabble.com.