Re: Indexing data from multiple datasources

2011-06-09 Thread Tom Gross

it's a feature request since ages ...

https://issues.apache.org/jira/browse/SOLR-139

On 06/09/2011 09:25 PM, Greg Georges wrote:

No from what I understand, the way Solr does an update is to delete the 
document, then recreate all the fields, there is no partial updating of the 
file.. maybe because of performance issues or locking?

-Original Message-
From: David Ross [mailto:davidtr...@hotmail.com]
Sent: 9 juin 2011 15:23
To: solr-user@lucene.apache.org
Subject: RE: Indexing data from multiple datasources


This thread got me thinking a bit...
Does SOLR support the concept of "partial updates" to documents?  By this I 
mean updating a subset of fields in a document that already exists in the index, and 
without having to resubmit the entire document.
An example would be storing/indexing user tags associated with documents. These 
tags will not be available when the document is initially presented to SOLR, 
and may or may not come along at a later time. When that time comes, can we 
just submit the tag data (and document identifier I'd imagine), or do we have 
to import the entire document?
new to SOLR...


Date: Thu, 9 Jun 2011 14:00:43 -0400
Subject: Re: Indexing data from multiple datasources
From: erickerick...@gmail.com
To: solr-user@lucene.apache.org

How are you using it? Streaming the files to Solr via HTTP? You can use Tika
on the client to extract the various bits from the structured documents, and
use SolrJ to assemble various bits of that data Tika exposes into a
Solr document
that you then send to Solr. At the point you're transferring data from the
Tika parse to the Solr document, you could add any data from your database that
you wanted.

The result is that you'd be indexing the complete Solr document only once.

You're right that updating a document in Solr overwrites the previous
version and any
data in the previous version is lost

Best
Erick

On Thu, Jun 9, 2011 at 1:20 PM, Greg Georges  wrote:

Hello Erick,

Thanks for the response. No, I am using the extract handler to extract the data 
from my text files. In your second approach, you say I could use a DIH to 
update the index which would have been created by the extract handler in the 
first phase. I thought that lets say I get info from the DB and update the 
index with the document ID, will I overwrite the data and lose the initial data 
from the extract handler phase? Thanks

Greg

-Original Message-
From: Erick Erickson [mailto:erickerick...@gmail.com]
Sent: 9 juin 2011 12:15
To: solr-user@lucene.apache.org
Subject: Re: Indexing data from multiple datasources

Hmmm, when you say you use Tika, are you using some custom Java code? Because
if you are, the best thing to do is query your database at that point
and add whatever information
you need to the document.

If you're using DIH to do the crawl, consider implementing a
Transformer to do the database
querying and modify the document as necessary This is pretty
simple to do, we can
chat a bit more depending on whether either approach makes sense.

Best
Erick



On Thu, Jun 9, 2011 at 10:43 AM, Greg Georges  wrote:

Hello all,

I have checked the forums to see if it is possible to create and index from 
multiple datasources. I have found references to SOLR-1358, but I don't think 
this fits my scenario. In all, we have an application where we upload files. On 
the file upload, I use the Tika extract handler to save metadata from the file 
(_attr, literal values, etc..). We also have a database which has information 
on the uploaded files, like the category, type, etc.. I would like to update 
the index to include this information from the db in the index for each 
document. If I run a dataimporthandler after the extract phase I am afraid that 
by updating the doc in the index by its id will just cause that I overwrite the 
old information with the info from the DB (what I understand is that Solr 
updates its index by ID by deleting first then recreating the info).

Anyone have any pointers, is there a clean way to do this, or must I find a way 
to pass the db metadata to the extract handler and save it as literal fields?

Thanks in advance

Greg









--
Auther of the book "Plone 3 Multimedia" - http://amzn.to/dtrp0C

Tom Gross
email.@toms-projekte.de
skype.tom_gross
web.http://toms-projekte.de
blog...http://blog.toms-projekte.de



Re: Unique Results from Edgy Text

2011-06-09 Thread Ahmet Arslan


--- On Thu, 6/9/11, Jamie Johnson  wrote:

> From: Jamie Johnson 
> Subject: Unique Results from Edgy Text
> To: solr-user@lucene.apache.org
> Date: Thursday, June 9, 2011, 10:42 PM
> I am using the guide found here (
> http://www.lucidimagination.com/blog/2009/09/08/auto-suggest-from-popular-queries-using-edgengrams/)
> to build an autocomplete search capability but in my data
> set I have some
> documents which have the same value for the field that is
> being returned, so
> for instance I have the following being returned:
> 
> A test document to see how this works
>  A test document to see how this works
>  A test document to see how this works
> A test document to see how this works
>  A test document to see how this works
> 
> I'm wondering if there is something I can specify that I
> want only unique
> results to come back.  I know I can do some post
> processing of the results
> to make sure that only unique items come back, but I was
> hoping there was
> something that could be done to the query.  Any
> thoughts?

May be http://wiki.apache.org/solr/Deduplication ?

Or may be somehow you can populate your database table with unique queries 
(outside of the solr) along with their couts.


Re: SolrCloud questions

2011-06-09 Thread Mohammad Shariq
I am also planning to move to SolrCloud;
since its still in under development, I am not sure about its behavior in
Production.
Please update us once you find it stable.


On 10 June 2011 03:56, Upayavira  wrote:

> I'm exploring SolrCloud for a new project, and have some questions based
> upon what I've found so far.
>
> The setup I'm planning is going to have a number of multicore hosts,
> with cores being moved between hosts, and potentially with cores merging
> as they get older (cores are time based, so once today has passed, they
> don't get updated).
>
> First question: The solr/conf dir gets uploaded to Zookeeper when you
> first start up, and using system properties you can specify a name to be
> associated with those conf files. How do you handle it when you have a
> multicore setup, and different configs for each core on your host?
>
> Second question: Can you query collections when using multicore? On
> single core, I can query:
>
>  http://localhost:8983/solr/collection1/select?q=blah
>
> On a multicore system I can query:
>
>  http://localhost:8983/solr/core1/select?q=blah
>
> but I cannot work out a URL to query collection1 when I have multiple
> cores.
>
> Third question: For replication, I'm assuming that replication in
> SolrCloud is still managed in the same way as non-cloud Solr, that is as
> ReplicationHandler config in solrconfig? In which case, I need a
> different config setup for each slave, as each slave has a different
> master (or can I delegate the decision as to which host/core is its
> master to zookeeper?)
>
> Thanks for any pointers.
>
> Upayavira
> ---
> Enterprise Search Consultant at Sourcesense UK,
> Making Sense of Open Source
>
>


-- 
Thanks and Regards
Mohammad Shariq


Re: ERROR on posting update request using CURL in php

2011-06-09 Thread Naveen

Hi,

Basically i need to post something like this using curl in php

The example of php explained in earlier thread,

curl http://localhost:8983/solr/update?commit=true -H "Content-Type:
text/xml" --data-binary 'testdoc'

Should we need to create a temp file and using put command 


can we do it using post 

Regards
Naveen 



--
View this message in context: 
http://lucene.472066.n3.nabble.com/ERROR-on-posting-update-request-using-CURL-in-php-tp3047312p3047372.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: ERROR on posting update request using CURL in php

2011-06-09 Thread Naveen Gupta
Hi,


curl http://localhost:8983/solr/update?commit=true -H "Content-Type:
text/xml" --data-binary 'testdoc'

Regards
Naveen

On Fri, Jun 10, 2011 at 10:18 AM, Naveen Gupta  wrote:

> Hi
>
> This is my document
>
> in php
>
> $xmldoc = 'F_146 name="userid">74gmail.com name="attachment_size">121 name="attachment_name">sample.pptx';
>
>   $ch = curl_init("http://localhost:8080/solr/update";);
>   curl_setopt($ch, CURLOPT_RETURNTRANSFER,1);
>   curl_setopt ($ch, CURLOPT_POST, 1);
>   curl_setopt($ch, CURLOPT_HTTPHEADER, array("Content-Type:
> text/xml") );
>   curl_setopt($ch, CURLOPT_POSTFIELDS,$xmldoc);
>
>$result= curl_exec($ch);
>if(!curl_errno($ch))
>{
>$info = curl_getinfo($ch);
>$header = substr($response, 0, $info['header_size']);
>echo 'Took ' . $info['total_time'] . ' seconds to send a
> request to ' . $info['url'];
>  }else{
>  print_r('no idea');
> }
> println('result of query'.'  '.' -> '.$result);
>
> It is throwing error
>
>  Apache Tomcat/6.0.18 - Error
> report
> HTTP Status 400 - Unexpected character ''' (code 39) in
> prolog; expected '<'
>  at [row,col {unknown-source}]: [1,1] noshade="noshade">type Status reportmessage
> Unexpected character ''' (code 39) in prolog; expected '<'
>  at [row,col {unknown-source}]: [1,1]description The
> request sent by the client was syntactically incorrect (Unexpected character
> ''' (code 39) in prolog; expected '<'
>  at [row,col {unknown-source}]: [1,1]). noshade="noshade">Apache Tomcat/6.0.18
>
>
> Thanks
> Naveen
>
>
>


Re: Multiple Values not getting Indexed

2011-06-09 Thread Gora Mohanty
On Fri, Jun 10, 2011 at 10:36 AM, Pawan Darira  wrote:
> it did not work :(
[...]

Please provide more details of what you tried, what was the error, and
any error messages that you got. Just saying that "it did not work" makes
it pretty much impossible for anyone to help you.

You might take a look at http://wiki.apache.org/solr/UsingMailingLists

Regards,
Gora


Re: Multiple Values not getting Indexed

2011-06-09 Thread Pawan Darira
it did not work :(

On Thu, Jun 9, 2011 at 12:53 PM, Bill Bell  wrote:

> You have to take the input and splitBy something like "," to get it into
> an array and reposted back to
> Solr...
>
> I believe others have suggested that?
>
> On 6/8/11 10:14 PM, "Pawan Darira"  wrote:
>
> >Hi
> >
> >I am trying to index 2 fields with multiple values. BUT, it is only
> >putting
> >1 value for each & ignoring rest of the values after comma(,). I am
> >fetching
> >query through DIH. It works fine if i have only 1 value each of the 2
> >fields
> >
> >E.g. Field1 - 150,178,461,151,310,306,305,179,137,162
> >& Field2 - Chandigarh,Gurgaon,New
> >Delhi,Ahmedabad,Rajkot,Surat,Mumbai,Nagpur,Pune,India - Others
> >
> >*Schema.xml*
> >
> >
> >
> >
> >
> >p.s. i tried multivalued=true but of no help.
> >
> >--
> >Thanks,
> >Pawan Darira
>
>
>


-- 
Thanks,
Pawan Darira


Re: how to Index and Search non-Eglish Text in solr

2011-06-09 Thread Mohammad Shariq
Thanks Erick for your help.
I have another silly question.
Suppose I created mutiple fieldTypes e.g. news_English, news_Chinese,
news_Japnese etc.
after creating these field, can I copy all these to CopyField "*defaultquery"
*like below :

*



*and my "defaultquery" looks like :*


*Is this right way to deal  with multiple language Indexing and searching* *
???*

*


On 9 June 2011 19:06, Erick Erickson  wrote:

> No, you'd have to create multiple fieldTypes, one for each language
>
> Best
> Erick
>
> On Thu, Jun 9, 2011 at 5:26 AM, Mohammad Shariq 
> wrote:
> > Can I specify multiple language in filter tag in schema.xml ???  like
> below
> >
> > 
> >   
> >  
> >   > words="stopwords.txt" enablePositionIncrements="true"/>
> >   generateWordParts="1"
> > generateNumberParts="1" catenateWords="1" catenateNumbers="1"
> > catenateAll="0" splitOnCaseChange="1"/>
> >
> > 
> > 
> > 
> > 
> > 
> >
> >
> >
> >   > class="solr.SnowballPorterFilterFactory" language="Hungarian" />
> >
> >
> > On 8 June 2011 18:47, Erick Erickson  wrote:
> >
> >> This page is a handy reference for individual languages...
> >> http://wiki.apache.org/solr/LanguageAnalysis
> >>
> >> But the usual approach, especially for Chinese/Japanese/Korean
> >> (CJK) is to index the content in different fields with language-specific
> >> analyzers then spread your search across the language-specific
> >> fields (e.g. title_en, title_fr, title_ar). Stemming and stopwords
> >> particularly give "surprising" results if you put words from different
> >> languages in the same field.
> >>
> >> Best
> >> Erick
> >>
> >> On Wed, Jun 8, 2011 at 8:34 AM, Mohammad Shariq 
> >> wrote:
> >> > Hi,
> >> > I had setup solr( solr-1.4 on Ubuntu 10.10) for indexing news articles
> in
> >> > English, but my requirement extend to index the news of other
> languages
> >> too.
> >> >
> >> > This is how my schema looks :
> >> >  >> > required="false"/>
> >> >
> >> >
> >> > And the "text" Field in schema.xml looks like :
> >> >
> >> >  positionIncrementGap="100">
> >> >
> >> >   
> >> >>> > words="stopwords.txt" enablePositionIncrements="true"/>
> >> >>> generateWordParts="1"
> >> > generateNumberParts="1" catenateWords="1" catenateNumbers="1"
> >> > catenateAll="0" splitOnCaseChange="1"/>
> >> >   
> >> >language="English"
> >> > protected="protwords.txt"/>
> >> >
> >> >
> >> >   
> >> >synonyms="synonyms.txt"
> >> > ignoreCase="true" expand="true"/>
> >> >>> > words="stopwords.txt" enablePositionIncrements="true"/>
> >> >>> generateWordParts="1"
> >> > generateNumberParts="1" catenateWords="0" catenateNumbers="0"
> >> > catenateAll="0" splitOnCaseChange="1"/>
> >> >   
> >> >language="English"
> >> > protected="protwords.txt"/>
> >> >
> >> > 
> >> >
> >> >
> >> > My Problem is :
> >> > Now I want to index the news articles in other languages to e.g.
> >> > Chinese,Japnese.
> >> > How I can I modify my text field so that I can Index the news in other
> >> lang
> >> > too and make it searchable ??
> >> >
> >> > Thanks
> >> > Shariq
> >> >
> >> >
> >> >
> >> >
> >> >
> >> > --
> >> > View this message in context:
> >>
> http://lucene.472066.n3.nabble.com/how-to-Index-and-Search-non-Eglish-Text-in-solr-tp3038851p3038851.html
> >> > Sent from the Solr - User mailing list archive at Nabble.com.
> >> >
> >>
> >
> >
> >
> > --
> > Thanks and Regards
> > Mohammad Shariq
> >
>



-- 
Thanks and Regards
Mohammad Shariq


ERROR on posting update request using CURL in php

2011-06-09 Thread Naveen Gupta
Hi

This is my document

in php

$xmldoc = 'F_14674gmail.com121sample.pptx';

  $ch = curl_init("http://localhost:8080/solr/update";);
  curl_setopt($ch, CURLOPT_RETURNTRANSFER,1);
  curl_setopt ($ch, CURLOPT_POST, 1);
  curl_setopt($ch, CURLOPT_HTTPHEADER, array("Content-Type:
text/xml") );
  curl_setopt($ch, CURLOPT_POSTFIELDS,$xmldoc);

   $result= curl_exec($ch);
   if(!curl_errno($ch))
   {
   $info = curl_getinfo($ch);
   $header = substr($response, 0, $info['header_size']);
   echo 'Took ' . $info['total_time'] . ' seconds to send a
request to ' . $info['url'];
 }else{
 print_r('no idea');
}
println('result of query'.'  '.' -> '.$result);

It is throwing error

 Apache Tomcat/6.0.18 - Error report
HTTP Status 400 - Unexpected character ''' (code 39) in
prolog; expected '<'
 at [row,col {unknown-source}]: [1,1]type Status reportmessage
Unexpected character ''' (code 39) in prolog; expected '<'
 at [row,col {unknown-source}]: [1,1]description The
request sent by the client was syntactically incorrect (Unexpected character
''' (code 39) in prolog; expected '<'
 at [row,col {unknown-source}]: [1,1]).Apache Tomcat/6.0.18


Thanks
Naveen


Re: tika integration exception and other related queries

2011-06-09 Thread Naveen Gupta
Hi Gary,

Similar thing we are doing, but we are not creating an XML doc, rather we
are leaving TIKA to extract the content and depends on dynamic fields. We
are not storing the text as well. But not sure if in future that would be
the case.

What about microsoft 7 and later related attachments. Is this working for
you, because we are always getting number format exception. I posted as well
in the community, but till now no response has some.

Thanks
Naveen

On Thu, Jun 9, 2011 at 6:43 PM, Gary Taylor  wrote:

> Naveen,
>
> Not sure our requirement matches yours, but one of the things we index is a
> "comment" item that can have one or more files attached to it.  To index the
> whole thing as a single Solr document we create a zipfile containing a file
> with the comment details in it and any additional attached files.  This is
> submitted to Solr as a TEXT field in an XML doc, along with other meta-data
> fields from the comment.  In our schema the TEXT field is indexed but not
> stored, so when we search and get a match back it doesn't contain all of the
> contents from the attached files etc., only the stored fields in our schema.
>   Admittedly, the user can therefore get back a "comment" match with no
> indication as to WHERE the match occurred (ie. was it in the meta-data or
> the contents of the attached files), but at the moment we're only interested
> in getting appropriate matches, not explaining where the match is.
>
> Hope that helps.
>
> Kind regards,
> Gary.
>
>
>
>
> On 09/06/2011 03:00, Naveen Gupta wrote:
>
>> Hi Gary
>>
>> It started working .. though i did not test for Zip files, but for rar
>> files, it is working fine ..
>>
>> only thing what i wanted to do is to index the metadata (text mapped to
>> content) not store the data  Also in search result, i want to filter
>> the
>> stuffs ... and it started working fine .. i don't want to show the content
>> stuffs to the end user, since the way it extracts the information is not
>> very helpful to the user .. although we can apply few of the analyzers and
>> filters to remove the unnecessary tags ..still the information would not
>> be
>> of much help .. looking for your opinion ... what you did in order to
>> filter
>> out the content or are you showing the content extracted to the end user?
>>
>> Even in case, we are showing the text part to the end user, how can i
>> limit
>> the number of characters while querying the search results ... is there
>> any
>> feature where we can achieve this ... the concept of snippet kind of thing
>> ...
>>
>> Thanks
>> Naveen
>>
>> On Wed, Jun 8, 2011 at 1:45 PM, Gary Taylor  wrote:
>>
>>  Naveen,
>>>
>>> For indexing Zip files with Tika, take a look at the following thread :
>>>
>>>
>>>
>>> http://lucene.472066.n3.nabble.com/Extracting-contents-of-zipped-files-with-Tika-and-Solr-1-4-1-td2327933.html
>>>
>>> I got it to work with the 3.1 source and a couple of patches.
>>>
>>> Hope this helps.
>>>
>>> Regards,
>>> Gary.
>>>
>>>
>>>
>>> On 08/06/2011 04:12, Naveen Gupta wrote:
>>>
>>>  Hi Can somebody answer this ...

 3. can somebody tell me an idea how to do indexing for a zip file ?

 1. while sending docx, we are getting following error.


>


Re: Where to find the Log file

2011-06-09 Thread Morris Mwanga
Here's help on how to setup logging 

http://skybert.wordpress.com/2009/07/22/how-to-get-solr-to-log-to-a-log-file/

-
Morris

- Original Message -
From: "Ruixiang Zhang" 
To: solr-user@lucene.apache.org
Sent: Thursday, June 9, 2011 8:45:30 PM GMT -05:00 US/Canada Eastern
Subject: Where to find the Log file

Where can I find the log file of solr? Is it turned on by default? (I use
Jetty)

Thanks
Ruixiang


Re: Where to find the Log file

2011-06-09 Thread Jack Repenning

On Jun 9, 2011, at 5:45 PM, Ruixiang Zhang wrote:

> Where can I find the log file of solr?  (I use
> Jetty)

By default, it's in /solr/logs/solr.log

> Is it turned on by default?

Yes. Oh, yes. Very much so. Uh-huh, you betcha.

-==-
Jack Repenning
Technologist
Codesion Business Unit
CollabNet, Inc.
8000 Marina Boulevard, Suite 600
Brisbane, California 94005
office: +1 650.228.2562
twitter: http://twitter.com/jrep











PGP.sig
Description: This is a digitally signed message part


Re: Boosting result on query.

2011-06-09 Thread Jeff Boul
HI,

Thank you for your answer.

But... I cannot use a boost calculated offline since the boost will changed
depending of the query made.
Each query will boost the query differently.

Any other ideaàs ?

Jeff


--
View this message in context: 
http://lucene.472066.n3.nabble.com/Boosting-result-on-query-tp3037649p3046859.html
Sent from the Solr - User mailing list archive at Nabble.com.


Where to find the Log file

2011-06-09 Thread Ruixiang Zhang
Where can I find the log file of solr? Is it turned on by default? (I use
Jetty)

Thanks
Ruixiang


Re: Tokenising based on known words?

2011-06-09 Thread Mark Mandel
Thanks for the feedback! This definitely gives me some options to work on!

Mark

On Thu, Jun 9, 2011 at 11:21 PM, Steven A Rowe  wrote:

> Hi Mark,
>
> Are you familiar with shingles aka token n-grams?
>
>
> http://lucene.apache.org/solr/api/org/apache/solr/analysis/ShingleFilterFactory.html
>
> Use the empty string for the tokenSeparator to get wordstogether style
> tokens in your index.
>
> I think you'll want to apply this filter only at index-time, since the
> users will supply the shingles all by themselves :).
>
> Steve
>
> > -Original Message-
> > From: Mark Mandel [mailto:mark.man...@gmail.com]
> > Sent: Thursday, June 09, 2011 8:37 AM
> > To: solr-user@lucene.apache.org
> > Subject: Re: Tokenising based on known words?
> >
> > Synonyms really wouldn't work for every possible combination of words in
> > our
> > index.
> >
> > Thanks for the idea though.
> >
> > Mark
> >
> > On Thu, Jun 9, 2011 at 3:42 PM, Gora Mohanty  wrote:
> >
> > > On Thu, Jun 9, 2011 at 4:37 AM, Mark Mandel 
> > wrote:
> > > > Not sure if this possible, but figured I would ask the question.
> > > >
> > > > Basically, we have some users who do some pretty rediculous things
> > ;o)
> > > >
> > > > Rather than writing "red jacket", they write "redjacket", which
> > obviously
> > > > returns no results.
> > > [...]
> > >
> > > Have you tried using synonyms,
> > >
> > >
> >
> http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.SynonymF
> > ilterFactory
> > > It seems like they should fit your use case.
> > >
> > > Regards,
> > > Gora
> > >
> >
> >
> >
> > --
> > E: mark.man...@gmail.com
> > T: http://www.twitter.com/neurotic
> > W: www.compoundtheory.com
> >
> > cf.Objective(ANZ) - Nov 17, 18 - Melbourne Australia
> > http://www.cfobjective.com.au
> >
> > Hands-on ColdFusion ORM Training
> > www.ColdFusionOrmTraining.com
>



-- 
E: mark.man...@gmail.com
T: http://www.twitter.com/neurotic
W: www.compoundtheory.com

cf.Objective(ANZ) - Nov 17, 18 - Melbourne Australia
http://www.cfobjective.com.au

Hands-on ColdFusion ORM Training
www.ColdFusionOrmTraining.com


SolrCloud questions

2011-06-09 Thread Upayavira
I'm exploring SolrCloud for a new project, and have some questions based
upon what I've found so far.

The setup I'm planning is going to have a number of multicore hosts,
with cores being moved between hosts, and potentially with cores merging
as they get older (cores are time based, so once today has passed, they
don't get updated).

First question: The solr/conf dir gets uploaded to Zookeeper when you
first start up, and using system properties you can specify a name to be
associated with those conf files. How do you handle it when you have a
multicore setup, and different configs for each core on your host?

Second question: Can you query collections when using multicore? On
single core, I can query:

 http://localhost:8983/solr/collection1/select?q=blah

On a multicore system I can query:

 http://localhost:8983/solr/core1/select?q=blah

but I cannot work out a URL to query collection1 when I have multiple
cores.

Third question: For replication, I'm assuming that replication in
SolrCloud is still managed in the same way as non-cloud Solr, that is as
ReplicationHandler config in solrconfig? In which case, I need a
different config setup for each slave, as each slave has a different
master (or can I delegate the decision as to which host/core is its
master to zookeeper?)

Thanks for any pointers.

Upayavira
--- 
Enterprise Search Consultant at Sourcesense UK, 
Making Sense of Open Source



Re: Processing/Indexing CSV

2011-06-09 Thread Ken Krugler

On Jun 9, 2011, at 2:21pm, Helmut Hoffer von Ankershoffen wrote:

> Hi,
> 
> btw: there seems to somewhat of a non-match regarding efforts to Enhance DIH
> regarding the CSV format (James Dyer) and the effort to maintain the
> CSVLoader (Ken Krugler). How about merging your efforts and migrating the
> CSVLoader to a CSVEntityProcessor (cp. my initial email)? :-)

While I'm a CSVLoader user (and I've found/fixed one bug in it), I'm not 
involved in any active development/maintenance of that piece of code.

If James or you can make progress on merging support for CSV into DIH, that's 
great.

-- Ken


> On Thu, Jun 9, 2011 at 11:17 PM, Helmut Hoffer von Ankershoffen <
> helmut...@googlemail.com> wrote:
> 
>> 
>> 
>> On Thu, Jun 9, 2011 at 11:05 PM, Ken Krugler 
>> wrote:
>> 
>>> 
>>> On Jun 9, 2011, at 1:27pm, Helmut Hoffer von Ankershoffen wrote:
>>> 
 Hi,
 
 ... that would be an option if there is a defined set of field names and
>>> a
 single column/CSV layout. The scenario however is different csv files
>>> (from
 different shops) with individual column layouts (separators, encodings
 etc.). The idea is to map known field names to defined field names in
>>> the
 solr schema. If I understand the capabilities of the CSVLoader correctly
 (sorry, I am completely new to Solr, started work on it today) this is
>>> not
 possible - is it?
>>> 
>>> As per the documentation on
>>> http://wiki.apache.org/solr/UpdateCSV#fieldnames, you can specify the
>>> names/positions of fields in the CSV file, and ignore fieldnames.
>>> 
>>> So this seems like it would solve your requirement, as each different
>>> layout could specify its own such mapping during import.
>>> 
>>> Sure, but the requirement (to keep the process of integrating new shops
>> efficient) is not to have one mapping per import (cp. the Email regarding
>> "more or less schema free") but to enhance one mapping that maps common
>> field names to defined fields disregarding order of known fields/columns. As
>> far as I understand that is not a problem at all with DIH, however DIH and
>> CSV are not a perfect match ,-)
>> 
>> 
>>> It could be handy to provide a fieldname map (versus the value map that
>>> UpdateCSV supports).
>> 
>> Definitely. Either a fieldname map in CSVLoader or a robust CSVLoader in
>> DIH ...
>> 
>> 
>>> Then you could use the header, and just provide a mapping from header
>>> fieldnames to schema fieldnames.
>>> 
>> That's the idea -)
>> 
>> => what's the best way to progress. Either someone enhances the CSVLoader
>> by a field mapper (with multipel input field names mapping to one field name
>> in the Solr schema) or someone enhances the DIH with a robust CSV loader
>> ,-). As I am completely new to this Community, please give me the direction
>> to go (or wait :-).
>> 
>> best regards
>> 
>> 
>>> -- Ken
>>> 
 On Thu, Jun 9, 2011 at 10:12 PM, Yonik Seeley <
>>> yo...@lucidimagination.com>wrote:
 
> On Thu, Jun 9, 2011 at 4:07 PM, Helmut Hoffer von Ankershoffen
>  wrote:
>> Hi,
>> yes, it's about CSV files loaded via HTTP from shops to be fed into a
>> shopping search engine.
>> The CSV Loader cannot map fields (only field values) etc.
> 
> You can provide your own list of fieldnames and optionally ignore the
> first line of the CSV file (assuming it contains the field names).
> http://wiki.apache.org/solr/UpdateCSV#fieldnames
> 
> -Yonik
> http://www.lucidimagination.com
> 
>>> 
>>> --
>>> Ken Krugler
>>> +1 530-210-6378
>>> http://bixolabs.com
>>> custom data mining solutions
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>> 

--
Ken Krugler
+1 530-210-6378
http://bixolabs.com
custom data mining solutions








RE: Displaying highlights in formatted HTML document

2011-06-09 Thread Ahmet Arslan
> Yes, I asked the wrong question. What I was subconsciously
> getting at is
> this: how are you avoiding the possibility of getting hits
> in the HTML
> elements? Is that accomplished by putting tag names in your
> stopwords, or
> by some other mechanism?

HtmlStripCharFilter removes html tags. After it only textual content remains. 
It is the same as extracting text from html/xml. 

admin/analysis.jsp is great tool visualizing analysis chain. You can try it.

http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.HTMLStripCharFilterFactory


Re: Processing/Indexing CSV

2011-06-09 Thread Helmut Hoffer von Ankershoffen
Hi,

btw: there seems to somewhat of a non-match regarding efforts to Enhance DIH
regarding the CSV format (James Dyer) and the effort to maintain the
CSVLoader (Ken Krugler). How about merging your efforts and migrating the
CSVLoader to a CSVEntityProcessor (cp. my initial email)? :-)

Best Regards

On Thu, Jun 9, 2011 at 11:17 PM, Helmut Hoffer von Ankershoffen <
helmut...@googlemail.com> wrote:

>
>
> On Thu, Jun 9, 2011 at 11:05 PM, Ken Krugler 
> wrote:
>
>>
>> On Jun 9, 2011, at 1:27pm, Helmut Hoffer von Ankershoffen wrote:
>>
>> > Hi,
>> >
>> > ... that would be an option if there is a defined set of field names and
>> a
>> > single column/CSV layout. The scenario however is different csv files
>> (from
>> > different shops) with individual column layouts (separators, encodings
>> > etc.). The idea is to map known field names to defined field names in
>> the
>> > solr schema. If I understand the capabilities of the CSVLoader correctly
>> > (sorry, I am completely new to Solr, started work on it today) this is
>> not
>> > possible - is it?
>>
>> As per the documentation on
>> http://wiki.apache.org/solr/UpdateCSV#fieldnames, you can specify the
>> names/positions of fields in the CSV file, and ignore fieldnames.
>>
>> So this seems like it would solve your requirement, as each different
>> layout could specify its own such mapping during import.
>>
>> Sure, but the requirement (to keep the process of integrating new shops
> efficient) is not to have one mapping per import (cp. the Email regarding
> "more or less schema free") but to enhance one mapping that maps common
> field names to defined fields disregarding order of known fields/columns. As
> far as I understand that is not a problem at all with DIH, however DIH and
> CSV are not a perfect match ,-)
>
>
>> It could be handy to provide a fieldname map (versus the value map that
>> UpdateCSV supports).
>
> Definitely. Either a fieldname map in CSVLoader or a robust CSVLoader in
> DIH ...
>
>
>> Then you could use the header, and just provide a mapping from header
>> fieldnames to schema fieldnames.
>>
> That's the idea -)
>
> => what's the best way to progress. Either someone enhances the CSVLoader
> by a field mapper (with multipel input field names mapping to one field name
> in the Solr schema) or someone enhances the DIH with a robust CSV loader
> ,-). As I am completely new to this Community, please give me the direction
> to go (or wait :-).
>
> best regards
>
>
>> -- Ken
>>
>> > On Thu, Jun 9, 2011 at 10:12 PM, Yonik Seeley <
>> yo...@lucidimagination.com>wrote:
>> >
>> >> On Thu, Jun 9, 2011 at 4:07 PM, Helmut Hoffer von Ankershoffen
>> >>  wrote:
>> >>> Hi,
>> >>> yes, it's about CSV files loaded via HTTP from shops to be fed into a
>> >>> shopping search engine.
>> >>> The CSV Loader cannot map fields (only field values) etc.
>> >>
>> >> You can provide your own list of fieldnames and optionally ignore the
>> >> first line of the CSV file (assuming it contains the field names).
>> >> http://wiki.apache.org/solr/UpdateCSV#fieldnames
>> >>
>> >> -Yonik
>> >> http://www.lucidimagination.com
>> >>
>>
>> --
>> Ken Krugler
>> +1 530-210-6378
>> http://bixolabs.com
>> custom data mining solutions
>>
>>
>>
>>
>>
>>
>>
>


Re: Processing/Indexing CSV

2011-06-09 Thread Helmut Hoffer von Ankershoffen
On Thu, Jun 9, 2011 at 11:05 PM, Ken Krugler wrote:

>
> On Jun 9, 2011, at 1:27pm, Helmut Hoffer von Ankershoffen wrote:
>
> > Hi,
> >
> > ... that would be an option if there is a defined set of field names and
> a
> > single column/CSV layout. The scenario however is different csv files
> (from
> > different shops) with individual column layouts (separators, encodings
> > etc.). The idea is to map known field names to defined field names in the
> > solr schema. If I understand the capabilities of the CSVLoader correctly
> > (sorry, I am completely new to Solr, started work on it today) this is
> not
> > possible - is it?
>
> As per the documentation on
> http://wiki.apache.org/solr/UpdateCSV#fieldnames, you can specify the
> names/positions of fields in the CSV file, and ignore fieldnames.
>
> So this seems like it would solve your requirement, as each different
> layout could specify its own such mapping during import.
>
> Sure, but the requirement (to keep the process of integrating new shops
efficient) is not to have one mapping per import (cp. the Email regarding
"more or less schema free") but to enhance one mapping that maps common
field names to defined fields disregarding order of known fields/columns. As
far as I understand that is not a problem at all with DIH, however DIH and
CSV are not a perfect match ,-)


> It could be handy to provide a fieldname map (versus the value map that
> UpdateCSV supports).

Definitely. Either a fieldname map in CSVLoader or a robust CSVLoader in DIH
...


> Then you could use the header, and just provide a mapping from header
> fieldnames to schema fieldnames.
>
That's the idea -)

=> what's the best way to progress. Either someone enhances the CSVLoader by
a field mapper (with multipel input field names mapping to one field name in
the Solr schema) or someone enhances the DIH with a robust CSV loader ,-).
As I am completely new to this Community, please give me the direction to go
(or wait :-).

best regards


> -- Ken
>
> > On Thu, Jun 9, 2011 at 10:12 PM, Yonik Seeley <
> yo...@lucidimagination.com>wrote:
> >
> >> On Thu, Jun 9, 2011 at 4:07 PM, Helmut Hoffer von Ankershoffen
> >>  wrote:
> >>> Hi,
> >>> yes, it's about CSV files loaded via HTTP from shops to be fed into a
> >>> shopping search engine.
> >>> The CSV Loader cannot map fields (only field values) etc.
> >>
> >> You can provide your own list of fieldnames and optionally ignore the
> >> first line of the CSV file (assuming it contains the field names).
> >> http://wiki.apache.org/solr/UpdateCSV#fieldnames
> >>
> >> -Yonik
> >> http://www.lucidimagination.com
> >>
>
> --
> Ken Krugler
> +1 530-210-6378
> http://bixolabs.com
> custom data mining solutions
>
>
>
>
>
>
>


Re: Solr Indexing Patterns

2011-06-09 Thread Judioo
Very informative links and statement Jonathan. thank you.



On 6 June 2011 20:55, Jonathan Rochkind  wrote:

> This is a start, for many common best practices:
>
> http://wiki.apache.org/solr/SolrRelevancyFAQ
>
> Many of the questions in there have an answer that involves de-normalizing.
> As an example. It may be that even if your specific problem isn't in there,
>  I myself anyway found reading through there gave me a general sense of
> common patterns in Solr.
>
> ( It's certainly true that some things are hard to do in Solr.  It turns
> out that an RDBMS is a remarkably flexible thing -- but when it doesn't do
> something you need well, and you turn to a specialized tool instead like
> Solr, you certainly give up some things
>
> One of the biggest areas of limitation involves hieararchical or
> relationship data, definitely. There are a variety of features, some more
> fully baked than others, some not yet in a Solr release, meant to provide
> tools to get at different aspects of this. Including "pivot facetting",
>  "join" (https://issues.apache.org/jira/browse/SOLR-2272), and
> field-collapsing.  Each, IMO, is trying to deal with different aspects of
> dealing with hieararchical or multi-class data, or data that is entities
> with relationships. ).
>
>
> On 6/6/2011 3:43 PM, Judioo wrote:
>
>> I do think that Solr would be better served if there was a *best practice
>> section *of the site.
>>
>> Looking at the majority of emails to this list they resolve around "how do
>> I
>> do X?".
>>
>> Seems like tutorials with real world examples would serve Solr no end of
>> good.
>>
>> I still do not have an example of the best method to approach my problem,
>> although Erick has  help me understand the limitations of Solr.
>>
>> Just thought I'd say.
>>
>>
>>
>>
>>
>>
>> On 6 June 2011 20:26, Judioo  wrote:
>>
>>  Thanks
>>>
>>>
>>> On 6 June 2011 19:32, Erick Erickson  wrote:
>>>
>>>  #Everybody# (including me) who has any RDBMS background
 doesn't want to flatten data, but that's usually the way to go in
 Solr.

 Part of whether it's a good idea or not depends on how big the index
 gets, and unfortunately the only way to figure that out is to test.

 But that's the first approach I'd try.

 Good luck!
 Erick

 On Mon, Jun 6, 2011 at 11:42 AM, Judioo  wrote:

> On 5 June 2011 14:42, Erick Erickson  wrote:
>
>  See: http://wiki.apache.org/solr/SchemaXml
>>
>> By adding ' "multiValued="true" ' to the field, you can add
>> the same field multiple times in a doc, something like
>>
>> 
>> 
>>  value1
>>  value2
>> 
>> 
>>
>> I can't see how that would work as one would need to associate the
>>
> right

> start / end dates and price.
> As I understand using multivalued and thus flattening the  discounts
>
 would

> result in:
>
> {
>"name":"The Book",
>"price":"$9.99",
>"price":"$3.00",
>"price":"$4.00","synopsis":"thanksgiving special",
>"starts":"11-24-2011",
>"starts":"10-10-2011",
>"ends":"11-25-2011",
>"ends":"10-11-2011",
>"synopsis":"Canadian thanksgiving special",
>  },
>
> How does one differentiate the different offers?
>
>
>
>  But there's no real ability  in Solr to store "sub documents",
>> so you'd have to get creative in how you encoded the discounts...
>>
>>  This is what I'm asking :)
> What is the best / recommended / known patterns for doing this?
>
>
>
>  But I suspect a better approach would be to store each discount as
>> a separate document. If you're in the trunk version, you could then
>> group results by, say, ISBN and get responses grouped together...
>>
>>  This is an option but seems sub optimal. So say I store the discounts
> in
> multiple documents with ISDN as an attribute and also store the title
>
 again

> with ISDN as an attribute.
>
> To get
> "all books currently discounted"
>
> requires 2 request
>
> * get all discounts currently active
> * get all books  using ISDN retrieved from above search
>
> Not that bad. However what happens when I want
> "all books that are currently on discount in the "horror" genre
>
 containing

> the word 'elm' in the title."
>
> The only way I can see in catering for the above search is to duplicate
>
 all

> searchable fields in my "book" document in my "discount" document.
>
 Coming

> from a RDBM background this seems wrong.
>
> Is this the correct approach to take?
>
>
>
>  Best
>> Erick
>>
>> On Sat, Jun 4, 2011 at 1:42 AM, Judioo  wrote:
>>
>>> Hi,
>>> Discounts can change daily. Also there can be a lot of them (over
>>>
>> time

> and
>>
>>> 

Re: Processing/Indexing CSV

2011-06-09 Thread Ken Krugler

On Jun 9, 2011, at 1:27pm, Helmut Hoffer von Ankershoffen wrote:

> Hi,
> 
> ... that would be an option if there is a defined set of field names and a
> single column/CSV layout. The scenario however is different csv files (from
> different shops) with individual column layouts (separators, encodings
> etc.). The idea is to map known field names to defined field names in the
> solr schema. If I understand the capabilities of the CSVLoader correctly
> (sorry, I am completely new to Solr, started work on it today) this is not
> possible - is it?

As per the documentation on http://wiki.apache.org/solr/UpdateCSV#fieldnames, 
you can specify the names/positions of fields in the CSV file, and ignore 
fieldnames.

So this seems like it would solve your requirement, as each different layout 
could specify its own such mapping during import.

It could be handy to provide a fieldname map (versus the value map that 
UpdateCSV supports). Then you could use the header, and just provide a mapping 
from header fieldnames to schema fieldnames.

-- Ken
 
> On Thu, Jun 9, 2011 at 10:12 PM, Yonik Seeley 
> wrote:
> 
>> On Thu, Jun 9, 2011 at 4:07 PM, Helmut Hoffer von Ankershoffen
>>  wrote:
>>> Hi,
>>> yes, it's about CSV files loaded via HTTP from shops to be fed into a
>>> shopping search engine.
>>> The CSV Loader cannot map fields (only field values) etc.
>> 
>> You can provide your own list of fieldnames and optionally ignore the
>> first line of the CSV file (assuming it contains the field names).
>> http://wiki.apache.org/solr/UpdateCSV#fieldnames
>> 
>> -Yonik
>> http://www.lucidimagination.com
>> 

--
Ken Krugler
+1 530-210-6378
http://bixolabs.com
custom data mining solutions








RE: solr Invalid Date in Date Math String/Invalid Date String

2011-06-09 Thread Chris Hostetter

: Here is the error message:
: 
: Fieldtype: tdate (I use the default one in solr schema.xml)
: Field value(Index): 2006-12-22T13:52:13Z
: Field value(query): [2006-12-22T00:00:00Z TO 2006-12-22T23:59:59Z]   <<<
: with '[' and ']'
: 
: And it generates the result below:

i think the piece of info people were overlooking here is that you are 
describing input to the analysis.jsp page.

you can't enter arbitrary query expressions on this page -- just *values* 
for hte analyzer of the specifeid field (or field type)

DateField doesn't know abything about the [... TO ...] syntax -- that is 
syntax of the query parser.

all the DateField knows is that what you have entered into the "Field 
Value" text box is not a date value, and it is not a date match value 
either.



-Hoss


RE: Displaying highlights in formatted HTML document

2011-06-09 Thread Bryan Loofbourrow
> > OK, I think see what you're up to. Might be pretty viable
> > for me as well.
> > Can you talk about anything in your mappings.txt files that
> > is an
> > important part of the solution?
>
> It is not important. I just copied it. Plus html strip char filter does
> not have mappings parameter. It was a copy paste mistake.

Yes, I asked the wrong question. What I was subconsciously getting at is
this: how are you avoiding the possibility of getting hits in the HTML
elements? Is that accomplished by putting tag names in your stopwords, or
by some other mechanism?

-- Bryan


RE: Displaying highlights in formatted HTML document

2011-06-09 Thread Ahmet Arslan
> OK, I think see what you're up to. Might be pretty viable
> for me as well.
> Can you talk about anything in your mappings.txt files that
> is an
> important part of the solution?

It is not important. I just copied it. Plus html strip char filter does not 
have mappings parameter. It was a copy paste mistake.
 
> Also, isn't there another piece? Don't you need to force it
> to return the
> whole document, rather than its usual context chunks? 

Yes you are right.  &hl.fragsize=0 is needed.

> We have another requirement I forgot to mention, about
> wanting to
> associate a sequence number with each hit, but I imagine I
> can deal with
> that by putting some sort of identifiable char sequence in
> a custom prefix
> for the highlighting, then replacing that with a sequence
> number in
> postprocessing.
> 
> I'm also wondering about the performance of this approach
> with large
> documents, vs. something like what Ludovic is talking
> about, where you
> would just get positions back from Solr, and fetch the
> document separately
> from a filestore.

Highlighting large documents takes time. Storing termVectors can be used to 
speedup. I don't know the answer to performance comparison. Perhaps someone 
familiar with highlighting can answer this. 



Re: Processing/Indexing CSV

2011-06-09 Thread Helmut Hoffer von Ankershoffen
Hi,

... that would be an option if there is a defined set of field names and a
single column/CSV layout. The scenario however is different csv files (from
different shops) with individual column layouts (separators, encodings
etc.). The idea is to map known field names to defined field names in the
solr schema. If I understand the capabilities of the CSVLoader correctly
(sorry, I am completely new to Solr, started work on it today) this is not
possible - is it?

Best Regards


On Thu, Jun 9, 2011 at 10:12 PM, Yonik Seeley wrote:

> On Thu, Jun 9, 2011 at 4:07 PM, Helmut Hoffer von Ankershoffen
>  wrote:
> > Hi,
> > yes, it's about CSV files loaded via HTTP from shops to be fed into a
> > shopping search engine.
> > The CSV Loader cannot map fields (only field values) etc.
>
> You can provide your own list of fieldnames and optionally ignore the
> first line of the CSV file (assuming it contains the field names).
> http://wiki.apache.org/solr/UpdateCSV#fieldnames
>
> -Yonik
> http://www.lucidimagination.com
>


Re: Processing/Indexing CSV

2011-06-09 Thread Yonik Seeley
On Thu, Jun 9, 2011 at 4:07 PM, Helmut Hoffer von Ankershoffen
 wrote:
> Hi,
> yes, it's about CSV files loaded via HTTP from shops to be fed into a
> shopping search engine.
> The CSV Loader cannot map fields (only field values) etc.

You can provide your own list of fieldnames and optionally ignore the
first line of the CSV file (assuming it contains the field names).
http://wiki.apache.org/solr/UpdateCSV#fieldnames

-Yonik
http://www.lucidimagination.com


Re: Processing/Indexing CSV

2011-06-09 Thread Helmut Hoffer von Ankershoffen
Hi,

yes, it's about CSV files loaded via HTTP from shops to be fed into a
shopping search engine.

The CSV Loader cannot map fields (only field values) etc. DIH is flexible
enough for building the importing part of such a thing but misses elegant
handling of CSV data ...

Regards

On Thu, Jun 9, 2011 at 9:50 PM, Yonik Seeley wrote:

> On Thu, Jun 9, 2011 at 3:31 PM, Helmut Hoffer von Ankershoffen
>  wrote:
> > Hi,
> >
> > there seems to be no way to index CSV using the DataImportHandler.
>
> Looking over the features you want, it looks like you're starting from
> a CSV file (as opposed to CSV stored in a database).
> Is there a reason that you need to use DIH and can't directly use the
> CSV loader?
> http://wiki.apache.org/solr/UpdateCSV
>
>
> -Yonik
> http://www.lucidimagination.com
>
>
>
> > Using a combination of
> > LineEntityProcessor<
> http://wiki.apache.org/solr/DataImportHandler#LineEntityProcessor>
> >  and RegexTransformer<
> http://wiki.apache.org/solr/DataImportHandler#RegexTransformer>
> > as
> > proposed in
> >
> http://robotlibrarian.billdueber.com/an-exercise-in-solr-and-dataimporthandler-hathitrust-data/is
> > not working for real world CSV files.
> >
> > E.g. many CSV files have double-quotes enclosing some but not all columns
> -
> > there is no elegant way to segment this using a simple regular
> expression.
> >
> > As CSV is still very common esp. in E-Commerce scenarios, I propose that
> > Solr provides a CSVEntityProcessor that:
> > 1) Handles the case of CSV files with/without and with some double-quote
> > enclosed columns
> > 2) Allows for a configurable column separator (';',',','\t' etc.)
> > 3) Allows for a leading row containing column headings
> > 4) If there is a leading row with column headings provides a possibility
> to
> > address columns by their column names and map them to Solr fields
> (similar
> > to the XPathEntityProcessor)
> > 5) Auto-detects encoding of the file (UTF-8 etc.)
> >
> > This would make it A LOT easier to use Solr for E-Commerce scenarios.
> >
> > If there is no such entity processor in the works i will develop one ...
> So
> > please let me know.
> >
> > Regards
> >
>


RE: Displaying highlights in formatted HTML document

2011-06-09 Thread Bryan Loofbourrow
> -Original Message-
> From: Ahmet Arslan [mailto:iori...@yahoo.com]
> Sent: Wednesday, June 08, 2011 11:56 PM
> To: solr-user@lucene.apache.org
> Subject: Re: Displaying highlights in formatted HTML document
>
>
>
> --- On Thu, 6/9/11, Bryan Loofbourrow 
> wrote:
>
> > From: Bryan Loofbourrow 
> > Subject: Displaying highlights in formatted HTML document
> > To: solr-user@lucene.apache.org
> > Date: Thursday, June 9, 2011, 2:14 AM
> > Here is my use case:
> >
> >
> >
> > I have a large number of HTML documents, sizes in the
> > 0.5K-50M range, most
> > around, say, 10M.
> >
> >
> >
> > I want to be able to present the user with the formatted
> > HTML document, with
> > the hits tagged, so that he may iterate through them, and
> > see them in the
> > context of the document, with the document looking as it
> > would be presented
> > by a browser; that is, fully formatted, with its tables and
> > italics and font
> > sizes and all.
> >
> >
> >
> > This is something that the user would explicitly request
> > from within a set
> > of search results, not something I'd expect to have
> > returned from an initial
> > search - the initial search merely returns the snippets
> > around the hits. But
> > if the user wants to dive into one of the returned results
> > and see them in
> > context, I need to be able to go get that.
> >
> >
> >
> > We are currently solving this problem by using an entirely
> > separate search
> > engine (dtSearch), which performs the tagging of the hits
> > in the HTML just
> > fine. But the solution is unsatisfactory because there are
> > Solr searches
> > that dtSearch's capabilities cannot reasonably match.
> >
> >
> >
> > Can anyone suggest a good way to use Solr/Lucene for this
> > instead? I'm
> > thinking a separate core for this purpose might make sense,
> > so as not to
> > burden the primary search core with the full contents of
> > the document. But
> > after that, I'm stuck. How can I get Solr to express the
> > highlighting in the
> > context of the formatted HTML document?
> >
> >
> >
> > If Solr does not do this currently, and anyone can suggest
> > ways to add the
> > feature, any tips on how this might best be incorporated
> > into the
> > implementation would be welcome.
>
> I am doing the same thing (solr trunk) using the following field type:
>
>  positionIncrementGap="100">
> 
> 
>  mapping="mappings.txt"/>
> 
>  words="stopwords.txt" enablePositionIncrements="true"/>
>  ignoreCase="true" expand="true"/>
> 
> 
> 
> 
>  words="stopwords.txt" enablePositionIncrements="true"/>
> 
>
> In your separate core - which will is queried when the user wants to
dive
> into one of the returned results - feed your html files in to this
field.
>
> You may want to increase max analyzed chars too.
> 147483647

OK, I think see what you're up to. Might be pretty viable for me as well.
Can you talk about anything in your mappings.txt files that is an
important part of the solution?

Also, isn't there another piece? Don't you need to force it to return the
whole document, rather than its usual context chunks? Or are you somehow
able to map the returned chunks into the separately-stored documents?

We have another requirement I forgot to mention, about wanting to
associate a sequence number with each hit, but I imagine I can deal with
that by putting some sort of identifiable char sequence in a custom prefix
for the highlighting, then replacing that with a sequence number in
postprocessing.

I'm also wondering about the performance of this approach with large
documents, vs. something like what Ludovic is talking about, where you
would just get positions back from Solr, and fetch the document separately
from a filestore.

-- Bryan


Re: Processing/Indexing CSV

2011-06-09 Thread Helmut Hoffer von Ankershoffen
s/provide and/provide any/ig ,-)

On Thu, Jun 9, 2011 at 10:01 PM, Helmut Hoffer von Ankershoffen <
helmut...@googlemail.com> wrote:

> Hi,
>
> just looked at your code. Definitely an improvement :-)
>
> The problem with the double-quotes is, that the delimiter (let's say ',')
> might be part of the column value. The goal is to process something like
> this without any tricky configuration
>
> name1,name2,name3
> val1,"val2,...",val3
> ...
>
> The user should not have to provide and before-hand knowledge regarding the
> column layout or the encoding of the CSV file. Ideally the only thing that
> has to be specified is firstLineHasFieldnames="true" separator=";".
> Autodetecting the separator and encoding would be even more elegant.
>
> If nobody else has this in the works I will start building such a patch
> next week.
>
> Best Regards
>
>
> On Thu, Jun 9, 2011 at 9:45 PM, Dyer, James wrote:
>
>> Helmut,
>>
>> I recently submitted SOLR-2549 (
>> https://issues.apache.org/jira/browse/SOLR-2549) to handle both
>> fixed-width and delimited flat files.  To be honest, I only needed
>> fixed-width support for my app so this might not support everything you
>> mention for delimited files, but it should be a good start.
>>
>> In particular, you might need to enhance this to handle the double quotes
>> (I had though a delimiter regex along these lines might handle it:
>>  (?:[\"]?[,]|[\"]$)  ... note this is a sample I just cooked up quick and no
>> doubt has errors, and maybe as you say a simple regex might not work at all
>> ) ... I also didn't do anything with encodings but I'm not sure this will be
>> an issue either...
>>
>> James Dyer
>> E-Commerce Systems
>> Ingram Content Group
>> (615) 213-4311
>>
>> -Original Message-
>> From: Helmut Hoffer von Ankershoffen [mailto:helmut...@googlemail.com]
>> Sent: Thursday, June 09, 2011 2:32 PM
>> To: solr-user@lucene.apache.org
>> Subject: Processing/Indexing CSV
>>
>> Hi,
>>
>> there seems to be no way to index CSV using the DataImportHandler.
>>
>> Using a combination of
>> LineEntityProcessor<
>> http://wiki.apache.org/solr/DataImportHandler#LineEntityProcessor>
>>  and RegexTransformer<
>> http://wiki.apache.org/solr/DataImportHandler#RegexTransformer>
>> as
>> proposed in
>>
>> http://robotlibrarian.billdueber.com/an-exercise-in-solr-and-dataimporthandler-hathitrust-data/is
>> not working for real world CSV files.
>>
>> E.g. many CSV files have double-quotes enclosing some but not all columns
>> -
>> there is no elegant way to segment this using a simple regular expression.
>>
>> As CSV is still very common esp. in E-Commerce scenarios, I propose that
>> Solr provides a CSVEntityProcessor that:
>> 1) Handles the case of CSV files with/without and with some double-quote
>> enclosed columns
>> 2) Allows for a configurable column separator (';',',','\t' etc.)
>> 3) Allows for a leading row containing column headings
>> 4) If there is a leading row with column headings provides a possibility
>> to
>> address columns by their column names and map them to Solr fields (similar
>> to the XPathEntityProcessor)
>> 5) Auto-detects encoding of the file (UTF-8 etc.)
>>
>> This would make it A LOT easier to use Solr for E-Commerce scenarios.
>>
>> If there is no such entity processor in the works i will develop one ...
>> So
>> please let me know.
>>
>> Regards
>>
>
>


Re: Processing/Indexing CSV

2011-06-09 Thread Helmut Hoffer von Ankershoffen
Hi,

just looked at your code. Definitely an improvement :-)

The problem with the double-quotes is, that the delimiter (let's say ',')
might be part of the column value. The goal is to process something like
this without any tricky configuration

name1,name2,name3
val1,"val2,...",val3
...

The user should not have to provide and before-hand knowledge regarding the
column layout or the encoding of the CSV file. Ideally the only thing that
has to be specified is firstLineHasFieldnames="true" separator=";".
Autodetecting the separator and encoding would be even more elegant.

If nobody else has this in the works I will start building such a patch next
week.

Best Regards


On Thu, Jun 9, 2011 at 9:45 PM, Dyer, James wrote:

> Helmut,
>
> I recently submitted SOLR-2549 (
> https://issues.apache.org/jira/browse/SOLR-2549) to handle both
> fixed-width and delimited flat files.  To be honest, I only needed
> fixed-width support for my app so this might not support everything you
> mention for delimited files, but it should be a good start.
>
> In particular, you might need to enhance this to handle the double quotes
> (I had though a delimiter regex along these lines might handle it:
>  (?:[\"]?[,]|[\"]$)  ... note this is a sample I just cooked up quick and no
> doubt has errors, and maybe as you say a simple regex might not work at all
> ) ... I also didn't do anything with encodings but I'm not sure this will be
> an issue either...
>
> James Dyer
> E-Commerce Systems
> Ingram Content Group
> (615) 213-4311
>
> -Original Message-
> From: Helmut Hoffer von Ankershoffen [mailto:helmut...@googlemail.com]
> Sent: Thursday, June 09, 2011 2:32 PM
> To: solr-user@lucene.apache.org
> Subject: Processing/Indexing CSV
>
> Hi,
>
> there seems to be no way to index CSV using the DataImportHandler.
>
> Using a combination of
> LineEntityProcessor<
> http://wiki.apache.org/solr/DataImportHandler#LineEntityProcessor>
>  and RegexTransformer<
> http://wiki.apache.org/solr/DataImportHandler#RegexTransformer>
> as
> proposed in
>
> http://robotlibrarian.billdueber.com/an-exercise-in-solr-and-dataimporthandler-hathitrust-data/is
> not working for real world CSV files.
>
> E.g. many CSV files have double-quotes enclosing some but not all columns -
> there is no elegant way to segment this using a simple regular expression.
>
> As CSV is still very common esp. in E-Commerce scenarios, I propose that
> Solr provides a CSVEntityProcessor that:
> 1) Handles the case of CSV files with/without and with some double-quote
> enclosed columns
> 2) Allows for a configurable column separator (';',',','\t' etc.)
> 3) Allows for a leading row containing column headings
> 4) If there is a leading row with column headings provides a possibility to
> address columns by their column names and map them to Solr fields (similar
> to the XPathEntityProcessor)
> 5) Auto-detects encoding of the file (UTF-8 etc.)
>
> This would make it A LOT easier to use Solr for E-Commerce scenarios.
>
> If there is no such entity processor in the works i will develop one ... So
> please let me know.
>
> Regards
>


RE: Displaying highlights in formatted HTML document

2011-06-09 Thread lboutros
I am not (yet) a tika user, perhaps that the  iorixxx's solution is good for
you.

We will share the highlighter module and 2 other developments soon. ('have
to see how to do that)

Ludovic. 

-
Jouve
France.
--
View this message in context: 
http://lucene.472066.n3.nabble.com/Displaying-highlights-in-formatted-HTML-document-tp3041909p3045654.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Processing/Indexing CSV

2011-06-09 Thread Yonik Seeley
On Thu, Jun 9, 2011 at 3:31 PM, Helmut Hoffer von Ankershoffen
 wrote:
> Hi,
>
> there seems to be no way to index CSV using the DataImportHandler.

Looking over the features you want, it looks like you're starting from
a CSV file (as opposed to CSV stored in a database).
Is there a reason that you need to use DIH and can't directly use the
CSV loader?
http://wiki.apache.org/solr/UpdateCSV


-Yonik
http://www.lucidimagination.com



> Using a combination of
> LineEntityProcessor
>  and 
> RegexTransformer
> as
> proposed in
> http://robotlibrarian.billdueber.com/an-exercise-in-solr-and-dataimporthandler-hathitrust-data/is
> not working for real world CSV files.
>
> E.g. many CSV files have double-quotes enclosing some but not all columns -
> there is no elegant way to segment this using a simple regular expression.
>
> As CSV is still very common esp. in E-Commerce scenarios, I propose that
> Solr provides a CSVEntityProcessor that:
> 1) Handles the case of CSV files with/without and with some double-quote
> enclosed columns
> 2) Allows for a configurable column separator (';',',','\t' etc.)
> 3) Allows for a leading row containing column headings
> 4) If there is a leading row with column headings provides a possibility to
> address columns by their column names and map them to Solr fields (similar
> to the XPathEntityProcessor)
> 5) Auto-detects encoding of the file (UTF-8 etc.)
>
> This would make it A LOT easier to use Solr for E-Commerce scenarios.
>
> If there is no such entity processor in the works i will develop one ... So
> please let me know.
>
> Regards
>


RE: Displaying highlights in formatted HTML document

2011-06-09 Thread Bryan Loofbourrow
Ludovic,

>> how do you index your html files ? I mean do you create fields for
different
parts of your document (for different stop words lists, stemming, etc) ?
with DIH or solrj or something else ?  <<

We are sending them over http, and using Tika to strip the HTML, at
present.

We do not split the document itself into separate fields, but what we
index includes a bunch of metadata that has been extracted by processes
earlier in the pipeline. These fields don't enter into the
HTML-hit-highlighting question.

>> I developed this week a new highlighter module which transfers the
fields
highlighting to the original document (xml in my case) (I use payloads to
store offsets and lenghts of fields in the index). This way, I use the
good
analyzers to do the highlighting correctly and then, I replace the
different
field parts in the document by the highlighted parts. It is not finished
yet, but I already have some good results. <<

Yes, I have been thinking along very similar lines. If you arrive at
something you're happy with, I encourage you to share it.

>> This is a client request too. Let me know if the iorixxx's solution is
not enought for your particular use case.<<

I'm enough of a Solr newb that I'll need to study his suggestion for a
bit, to figure out what it does and does not do. When I've done so, I'll
respond to his message.

Thanks,

-- Bryan


RE: Processing/Indexing CSV

2011-06-09 Thread Dyer, James
Helmut,

I recently submitted SOLR-2549 
(https://issues.apache.org/jira/browse/SOLR-2549) to handle both fixed-width 
and delimited flat files.  To be honest, I only needed fixed-width support for 
my app so this might not support everything you mention for delimited files, 
but it should be a good start.  

In particular, you might need to enhance this to handle the double quotes (I 
had though a delimiter regex along these lines might handle it:  
(?:[\"]?[,]|[\"]$)  ... note this is a sample I just cooked up quick and no 
doubt has errors, and maybe as you say a simple regex might not work at all ) 
... I also didn't do anything with encodings but I'm not sure this will be an 
issue either...

James Dyer
E-Commerce Systems
Ingram Content Group
(615) 213-4311

-Original Message-
From: Helmut Hoffer von Ankershoffen [mailto:helmut...@googlemail.com] 
Sent: Thursday, June 09, 2011 2:32 PM
To: solr-user@lucene.apache.org
Subject: Processing/Indexing CSV

Hi,

there seems to be no way to index CSV using the DataImportHandler.

Using a combination of
LineEntityProcessor
 and 
RegexTransformer
as
proposed in
http://robotlibrarian.billdueber.com/an-exercise-in-solr-and-dataimporthandler-hathitrust-data/is
not working for real world CSV files.

E.g. many CSV files have double-quotes enclosing some but not all columns -
there is no elegant way to segment this using a simple regular expression.

As CSV is still very common esp. in E-Commerce scenarios, I propose that
Solr provides a CSVEntityProcessor that:
1) Handles the case of CSV files with/without and with some double-quote
enclosed columns
2) Allows for a configurable column separator (';',',','\t' etc.)
3) Allows for a leading row containing column headings
4) If there is a leading row with column headings provides a possibility to
address columns by their column names and map them to Solr fields (similar
to the XPathEntityProcessor)
5) Auto-detects encoding of the file (UTF-8 etc.)

This would make it A LOT easier to use Solr for E-Commerce scenarios.

If there is no such entity processor in the works i will develop one ... So
please let me know.

Regards


Unique Results from Edgy Text

2011-06-09 Thread Jamie Johnson
I am using the guide found here (
http://www.lucidimagination.com/blog/2009/09/08/auto-suggest-from-popular-queries-using-edgengrams/)
to build an autocomplete search capability but in my data set I have some
documents which have the same value for the field that is being returned, so
for instance I have the following being returned:

A test document to see how this works
 A test document to see how this works
 A test document to see how this works
A test document to see how this works
 A test document to see how this works

I'm wondering if there is something I can specify that I want only unique
results to come back.  I know I can do some post processing of the results
to make sure that only unique items come back, but I was hoping there was
something that could be done to the query.  Any thoughts?


Re: Processing/Indexing CSV

2011-06-09 Thread Helmut Hoffer von Ankershoffen
Hi,

to make my point more clear: if the CSV has a fixed schema / column layout,
using the RegexTransformer is of course a possibility (however awkward). But
if you want to implement a (more or less) schema free shopping search engine
...

regards

On Thu, Jun 9, 2011 at 9:31 PM, Helmut Hoffer von Ankershoffen <
helmut...@googlemail.com> wrote:

> Hi,
>
> there seems to be no way to index CSV using the DataImportHandler.
>
> Using a combination of 
> LineEntityProcessor
>  and 
> RegexTransformer
>  as
> proposed in
> http://robotlibrarian.billdueber.com/an-exercise-in-solr-and-dataimporthandler-hathitrust-data/is
>  not working for real world CSV files.
>
> E.g. many CSV files have double-quotes enclosing some but not all columns -
> there is no elegant way to segment this using a simple regular expression.
>
> As CSV is still very common esp. in E-Commerce scenarios, I propose that
> Solr provides a CSVEntityProcessor that:
> 1) Handles the case of CSV files with/without and with some double-quote
> enclosed columns
> 2) Allows for a configurable column separator (';',',','\t' etc.)
> 3) Allows for a leading row containing column headings
> 4) If there is a leading row with column headings provides a possibility to
> address columns by their column names and map them to Solr fields (similar
> to the XPathEntityProcessor)
> 5) Auto-detects encoding of the file (UTF-8 etc.)
>
> This would make it A LOT easier to use Solr for E-Commerce scenarios.
>
> If there is no such entity processor in the works i will develop one ... So
> please let me know.
>
> Regards
>


Processing/Indexing CSV

2011-06-09 Thread Helmut Hoffer von Ankershoffen
Hi,

there seems to be no way to index CSV using the DataImportHandler.

Using a combination of
LineEntityProcessor
 and 
RegexTransformer
as
proposed in
http://robotlibrarian.billdueber.com/an-exercise-in-solr-and-dataimporthandler-hathitrust-data/is
not working for real world CSV files.

E.g. many CSV files have double-quotes enclosing some but not all columns -
there is no elegant way to segment this using a simple regular expression.

As CSV is still very common esp. in E-Commerce scenarios, I propose that
Solr provides a CSVEntityProcessor that:
1) Handles the case of CSV files with/without and with some double-quote
enclosed columns
2) Allows for a configurable column separator (';',',','\t' etc.)
3) Allows for a leading row containing column headings
4) If there is a leading row with column headings provides a possibility to
address columns by their column names and map them to Solr fields (similar
to the XPathEntityProcessor)
5) Auto-detects encoding of the file (UTF-8 etc.)

This would make it A LOT easier to use Solr for E-Commerce scenarios.

If there is no such entity processor in the works i will develop one ... So
please let me know.

Regards


RE: Indexing data from multiple datasources

2011-06-09 Thread Greg Georges
No from what I understand, the way Solr does an update is to delete the 
document, then recreate all the fields, there is no partial updating of the 
file.. maybe because of performance issues or locking?

-Original Message-
From: David Ross [mailto:davidtr...@hotmail.com] 
Sent: 9 juin 2011 15:23
To: solr-user@lucene.apache.org
Subject: RE: Indexing data from multiple datasources


This thread got me thinking a bit...
Does SOLR support the concept of "partial updates" to documents?  By this I 
mean updating a subset of fields in a document that already exists in the 
index, and without having to resubmit the entire document.
An example would be storing/indexing user tags associated with documents. These 
tags will not be available when the document is initially presented to SOLR, 
and may or may not come along at a later time. When that time comes, can we 
just submit the tag data (and document identifier I'd imagine), or do we have 
to import the entire document?
new to SOLR...

> Date: Thu, 9 Jun 2011 14:00:43 -0400
> Subject: Re: Indexing data from multiple datasources
> From: erickerick...@gmail.com
> To: solr-user@lucene.apache.org
> 
> How are you using it? Streaming the files to Solr via HTTP? You can use Tika
> on the client to extract the various bits from the structured documents, and
> use SolrJ to assemble various bits of that data Tika exposes into a
> Solr document
> that you then send to Solr. At the point you're transferring data from the
> Tika parse to the Solr document, you could add any data from your database 
> that
> you wanted.
> 
> The result is that you'd be indexing the complete Solr document only once.
> 
> You're right that updating a document in Solr overwrites the previous
> version and any
> data in the previous version is lost
> 
> Best
> Erick
> 
> On Thu, Jun 9, 2011 at 1:20 PM, Greg Georges  wrote:
> > Hello Erick,
> >
> > Thanks for the response. No, I am using the extract handler to extract the 
> > data from my text files. In your second approach, you say I could use a DIH 
> > to update the index which would have been created by the extract handler in 
> > the first phase. I thought that lets say I get info from the DB and update 
> > the index with the document ID, will I overwrite the data and lose the 
> > initial data from the extract handler phase? Thanks
> >
> > Greg
> >
> > -Original Message-
> > From: Erick Erickson [mailto:erickerick...@gmail.com]
> > Sent: 9 juin 2011 12:15
> > To: solr-user@lucene.apache.org
> > Subject: Re: Indexing data from multiple datasources
> >
> > Hmmm, when you say you use Tika, are you using some custom Java code? 
> > Because
> > if you are, the best thing to do is query your database at that point
> > and add whatever information
> > you need to the document.
> >
> > If you're using DIH to do the crawl, consider implementing a
> > Transformer to do the database
> > querying and modify the document as necessary This is pretty
> > simple to do, we can
> > chat a bit more depending on whether either approach makes sense.
> >
> > Best
> > Erick
> >
> >
> >
> > On Thu, Jun 9, 2011 at 10:43 AM, Greg Georges  
> > wrote:
> >> Hello all,
> >>
> >> I have checked the forums to see if it is possible to create and index 
> >> from multiple datasources. I have found references to SOLR-1358, but I 
> >> don't think this fits my scenario. In all, we have an application where we 
> >> upload files. On the file upload, I use the Tika extract handler to save 
> >> metadata from the file (_attr, literal values, etc..). We also have a 
> >> database which has information on the uploaded files, like the category, 
> >> type, etc.. I would like to update the index to include this information 
> >> from the db in the index for each document. If I run a dataimporthandler 
> >> after the extract phase I am afraid that by updating the doc in the index 
> >> by its id will just cause that I overwrite the old information with the 
> >> info from the DB (what I understand is that Solr updates its index by ID 
> >> by deleting first then recreating the info).
> >>
> >> Anyone have any pointers, is there a clean way to do this, or must I find 
> >> a way to pass the db metadata to the extract handler and save it as 
> >> literal fields?
> >>
> >> Thanks in advance
> >>
> >> Greg
> >>
> >
  


RE: Indexing data from multiple datasources

2011-06-09 Thread David Ross

This thread got me thinking a bit...
Does SOLR support the concept of "partial updates" to documents?  By this I 
mean updating a subset of fields in a document that already exists in the 
index, and without having to resubmit the entire document.
An example would be storing/indexing user tags associated with documents. These 
tags will not be available when the document is initially presented to SOLR, 
and may or may not come along at a later time. When that time comes, can we 
just submit the tag data (and document identifier I'd imagine), or do we have 
to import the entire document?
new to SOLR...

> Date: Thu, 9 Jun 2011 14:00:43 -0400
> Subject: Re: Indexing data from multiple datasources
> From: erickerick...@gmail.com
> To: solr-user@lucene.apache.org
> 
> How are you using it? Streaming the files to Solr via HTTP? You can use Tika
> on the client to extract the various bits from the structured documents, and
> use SolrJ to assemble various bits of that data Tika exposes into a
> Solr document
> that you then send to Solr. At the point you're transferring data from the
> Tika parse to the Solr document, you could add any data from your database 
> that
> you wanted.
> 
> The result is that you'd be indexing the complete Solr document only once.
> 
> You're right that updating a document in Solr overwrites the previous
> version and any
> data in the previous version is lost
> 
> Best
> Erick
> 
> On Thu, Jun 9, 2011 at 1:20 PM, Greg Georges  wrote:
> > Hello Erick,
> >
> > Thanks for the response. No, I am using the extract handler to extract the 
> > data from my text files. In your second approach, you say I could use a DIH 
> > to update the index which would have been created by the extract handler in 
> > the first phase. I thought that lets say I get info from the DB and update 
> > the index with the document ID, will I overwrite the data and lose the 
> > initial data from the extract handler phase? Thanks
> >
> > Greg
> >
> > -Original Message-
> > From: Erick Erickson [mailto:erickerick...@gmail.com]
> > Sent: 9 juin 2011 12:15
> > To: solr-user@lucene.apache.org
> > Subject: Re: Indexing data from multiple datasources
> >
> > Hmmm, when you say you use Tika, are you using some custom Java code? 
> > Because
> > if you are, the best thing to do is query your database at that point
> > and add whatever information
> > you need to the document.
> >
> > If you're using DIH to do the crawl, consider implementing a
> > Transformer to do the database
> > querying and modify the document as necessary This is pretty
> > simple to do, we can
> > chat a bit more depending on whether either approach makes sense.
> >
> > Best
> > Erick
> >
> >
> >
> > On Thu, Jun 9, 2011 at 10:43 AM, Greg Georges  
> > wrote:
> >> Hello all,
> >>
> >> I have checked the forums to see if it is possible to create and index 
> >> from multiple datasources. I have found references to SOLR-1358, but I 
> >> don't think this fits my scenario. In all, we have an application where we 
> >> upload files. On the file upload, I use the Tika extract handler to save 
> >> metadata from the file (_attr, literal values, etc..). We also have a 
> >> database which has information on the uploaded files, like the category, 
> >> type, etc.. I would like to update the index to include this information 
> >> from the db in the index for each document. If I run a dataimporthandler 
> >> after the extract phase I am afraid that by updating the doc in the index 
> >> by its id will just cause that I overwrite the old information with the 
> >> info from the DB (what I understand is that Solr updates its index by ID 
> >> by deleting first then recreating the info).
> >>
> >> Anyone have any pointers, is there a clean way to do this, or must I find 
> >> a way to pass the db metadata to the extract handler and save it as 
> >> literal fields?
> >>
> >> Thanks in advance
> >>
> >> Greg
> >>
> >
  

Re: Indexing data from multiple datasources

2011-06-09 Thread Erick Erickson
How are you using it? Streaming the files to Solr via HTTP? You can use Tika
on the client to extract the various bits from the structured documents, and
use SolrJ to assemble various bits of that data Tika exposes into a
Solr document
that you then send to Solr. At the point you're transferring data from the
Tika parse to the Solr document, you could add any data from your database that
you wanted.

The result is that you'd be indexing the complete Solr document only once.

You're right that updating a document in Solr overwrites the previous
version and any
data in the previous version is lost

Best
Erick

On Thu, Jun 9, 2011 at 1:20 PM, Greg Georges  wrote:
> Hello Erick,
>
> Thanks for the response. No, I am using the extract handler to extract the 
> data from my text files. In your second approach, you say I could use a DIH 
> to update the index which would have been created by the extract handler in 
> the first phase. I thought that lets say I get info from the DB and update 
> the index with the document ID, will I overwrite the data and lose the 
> initial data from the extract handler phase? Thanks
>
> Greg
>
> -Original Message-
> From: Erick Erickson [mailto:erickerick...@gmail.com]
> Sent: 9 juin 2011 12:15
> To: solr-user@lucene.apache.org
> Subject: Re: Indexing data from multiple datasources
>
> Hmmm, when you say you use Tika, are you using some custom Java code? Because
> if you are, the best thing to do is query your database at that point
> and add whatever information
> you need to the document.
>
> If you're using DIH to do the crawl, consider implementing a
> Transformer to do the database
> querying and modify the document as necessary This is pretty
> simple to do, we can
> chat a bit more depending on whether either approach makes sense.
>
> Best
> Erick
>
>
>
> On Thu, Jun 9, 2011 at 10:43 AM, Greg Georges  
> wrote:
>> Hello all,
>>
>> I have checked the forums to see if it is possible to create and index from 
>> multiple datasources. I have found references to SOLR-1358, but I don't 
>> think this fits my scenario. In all, we have an application where we upload 
>> files. On the file upload, I use the Tika extract handler to save metadata 
>> from the file (_attr, literal values, etc..). We also have a database which 
>> has information on the uploaded files, like the category, type, etc.. I 
>> would like to update the index to include this information from the db in 
>> the index for each document. If I run a dataimporthandler after the extract 
>> phase I am afraid that by updating the doc in the index by its id will just 
>> cause that I overwrite the old information with the info from the DB (what I 
>> understand is that Solr updates its index by ID by deleting first then 
>> recreating the info).
>>
>> Anyone have any pointers, is there a clean way to do this, or must I find a 
>> way to pass the db metadata to the extract handler and save it as literal 
>> fields?
>>
>> Thanks in advance
>>
>> Greg
>>
>


RE: Indexing data from multiple datasources

2011-06-09 Thread Greg Georges
Hello Erick,

Thanks for the response. No, I am using the extract handler to extract the data 
from my text files. In your second approach, you say I could use a DIH to 
update the index which would have been created by the extract handler in the 
first phase. I thought that lets say I get info from the DB and update the 
index with the document ID, will I overwrite the data and lose the initial data 
from the extract handler phase? Thanks

Greg

-Original Message-
From: Erick Erickson [mailto:erickerick...@gmail.com] 
Sent: 9 juin 2011 12:15
To: solr-user@lucene.apache.org
Subject: Re: Indexing data from multiple datasources

Hmmm, when you say you use Tika, are you using some custom Java code? Because
if you are, the best thing to do is query your database at that point
and add whatever information
you need to the document.

If you're using DIH to do the crawl, consider implementing a
Transformer to do the database
querying and modify the document as necessary This is pretty
simple to do, we can
chat a bit more depending on whether either approach makes sense.

Best
Erick



On Thu, Jun 9, 2011 at 10:43 AM, Greg Georges  wrote:
> Hello all,
>
> I have checked the forums to see if it is possible to create and index from 
> multiple datasources. I have found references to SOLR-1358, but I don't think 
> this fits my scenario. In all, we have an application where we upload files. 
> On the file upload, I use the Tika extract handler to save metadata from the 
> file (_attr, literal values, etc..). We also have a database which has 
> information on the uploaded files, like the category, type, etc.. I would 
> like to update the index to include this information from the db in the index 
> for each document. If I run a dataimporthandler after the extract phase I am 
> afraid that by updating the doc in the index by its id will just cause that I 
> overwrite the old information with the info from the DB (what I understand is 
> that Solr updates its index by ID by deleting first then recreating the info).
>
> Anyone have any pointers, is there a clean way to do this, or must I find a 
> way to pass the db metadata to the extract handler and save it as literal 
> fields?
>
> Thanks in advance
>
> Greg
>


Re: [Free Text] Field Tokenizing

2011-06-09 Thread Erick Erickson
The KeywordTokenizer doesn't do anything to break up the input stream,
it just treats the whole input to the field as a single token. So I don't think
you'll be able to "extract" anything starting with that tokenizer.

Look at the admin/analysis page to see a step-by-step breakdown of what
your analyzer chain does. Be sure to check the "verbose" checkbox

Best
Erick

On Thu, Jun 9, 2011 at 12:35 PM, Adam Estrada
 wrote:
> Erick,
>
> I totally understand that BUT the keyword tokenizer factory does a really
> good job extracting phrases (or what look like phrases from) from my data. I
> don't know why exactly but it does do it. I am going to continue working
> through it to see if I can't figure it out ;-)
>
> Adam
>
> On Thu, Jun 9, 2011 at 12:26 PM, Erick Erickson 
> wrote:
>
>> The problem here is that none of the built-in filters or tokenizers
>> have a prayer
>> of recognizing what #you# think are phrases, since it'll be unique to
>> your situation.
>>
>> If you have a list of phrases you care about, you could substitute a
>> single token
>> for the phrases you care about...
>>
>> But the overriding question is what determines a phrase you're
>> interested in? Is it
>> a list or is there some heuristic you want to apply?
>>
>> Or could you just recognize them at query time and make them into a
>> literal phrase
>> (i.e. with quotationmarks)?
>>
>> Best
>> Erick
>>
>> On Thu, Jun 9, 2011 at 10:56 AM, Adam Estrada
>>  wrote:
>> > All,
>> >
>> > I am at a bit of a loss here so any help would be greatly appreciated. I
>> am
>> > using the DIH to grab data from a DB. The field that I am most interested
>> in
>> > has anywhere from 1 word to several paragraphs worth of free text. What I
>> > would really like to do is pull out phrases like "Joe's coffee shop"
>> rather
>> > than the 3 individual words. I have tried the KeywordTokenizerFactory and
>> > that does seem to do what I want it to do but it is not actually
>> tokenizing
>> > anything so it does what I want it to for the most part but it's not
>> > creating the tokens that I need for further analysis in apps like Mahout.
>> >
>> > We can play with the combination of tokenizers and filters all day long
>> and
>> > see what the results are after a quick reindex. I typlically just view
>> them
>> > in Solitas as facets which may be the problem for me too. Does anyone
>> have
>> > an example fieldType they can share with me that shows how to
>> > extract phrases if they are there from the data I described earlier. Am I
>> > even going about this the right way? I am using today's trunk build of
>> Solr
>> > and here is what I have munged together this morning.
>> >
>> > > positionIncrementGap="100"
>> > autoGeneratePhraseQueries="true">
>> >  
>> >  
>> >  > > mapping="mapping-ISOLatin1Accent.txt"/>
>> >  
>> >  > > words="stopwords.txt" enablePositionIncrements="true"/>
>> >  > > outputUnigrams="true" outputUnigramIfNoNgram="false"/>
>> >  > protected="protwords.txt"/>
>> >  
>> >  
>> >  
>> >  
>> >  
>> >  
>> > 
>> >
>> > Thanks,
>> > Adam
>> >
>>
>


Re: [Free Text] Field Tokenizing

2011-06-09 Thread Adam Estrada
Erick,

I totally understand that BUT the keyword tokenizer factory does a really
good job extracting phrases (or what look like phrases from) from my data. I
don't know why exactly but it does do it. I am going to continue working
through it to see if I can't figure it out ;-)

Adam

On Thu, Jun 9, 2011 at 12:26 PM, Erick Erickson wrote:

> The problem here is that none of the built-in filters or tokenizers
> have a prayer
> of recognizing what #you# think are phrases, since it'll be unique to
> your situation.
>
> If you have a list of phrases you care about, you could substitute a
> single token
> for the phrases you care about...
>
> But the overriding question is what determines a phrase you're
> interested in? Is it
> a list or is there some heuristic you want to apply?
>
> Or could you just recognize them at query time and make them into a
> literal phrase
> (i.e. with quotationmarks)?
>
> Best
> Erick
>
> On Thu, Jun 9, 2011 at 10:56 AM, Adam Estrada
>  wrote:
> > All,
> >
> > I am at a bit of a loss here so any help would be greatly appreciated. I
> am
> > using the DIH to grab data from a DB. The field that I am most interested
> in
> > has anywhere from 1 word to several paragraphs worth of free text. What I
> > would really like to do is pull out phrases like "Joe's coffee shop"
> rather
> > than the 3 individual words. I have tried the KeywordTokenizerFactory and
> > that does seem to do what I want it to do but it is not actually
> tokenizing
> > anything so it does what I want it to for the most part but it's not
> > creating the tokens that I need for further analysis in apps like Mahout.
> >
> > We can play with the combination of tokenizers and filters all day long
> and
> > see what the results are after a quick reindex. I typlically just view
> them
> > in Solitas as facets which may be the problem for me too. Does anyone
> have
> > an example fieldType they can share with me that shows how to
> > extract phrases if they are there from the data I described earlier. Am I
> > even going about this the right way? I am using today's trunk build of
> Solr
> > and here is what I have munged together this morning.
> >
> >  positionIncrementGap="100"
> > autoGeneratePhraseQueries="true">
> >  
> >  
> >   > mapping="mapping-ISOLatin1Accent.txt"/>
> >  
> >   > words="stopwords.txt" enablePositionIncrements="true"/>
> >   > outputUnigrams="true" outputUnigramIfNoNgram="false"/>
> >   protected="protwords.txt"/>
> >  
> >  
> >  
> >  
> >  
> >  
> > 
> >
> > Thanks,
> > Adam
> >
>


Re: [Free Text] Field Tokenizing

2011-06-09 Thread Erick Erickson
The problem here is that none of the built-in filters or tokenizers
have a prayer
of recognizing what #you# think are phrases, since it'll be unique to
your situation.

If you have a list of phrases you care about, you could substitute a
single token
for the phrases you care about...

But the overriding question is what determines a phrase you're
interested in? Is it
a list or is there some heuristic you want to apply?

Or could you just recognize them at query time and make them into a
literal phrase
(i.e. with quotationmarks)?

Best
Erick

On Thu, Jun 9, 2011 at 10:56 AM, Adam Estrada
 wrote:
> All,
>
> I am at a bit of a loss here so any help would be greatly appreciated. I am
> using the DIH to grab data from a DB. The field that I am most interested in
> has anywhere from 1 word to several paragraphs worth of free text. What I
> would really like to do is pull out phrases like "Joe's coffee shop" rather
> than the 3 individual words. I have tried the KeywordTokenizerFactory and
> that does seem to do what I want it to do but it is not actually tokenizing
> anything so it does what I want it to for the most part but it's not
> creating the tokens that I need for further analysis in apps like Mahout.
>
> We can play with the combination of tokenizers and filters all day long and
> see what the results are after a quick reindex. I typlically just view them
> in Solitas as facets which may be the problem for me too. Does anyone have
> an example fieldType they can share with me that shows how to
> extract phrases if they are there from the data I described earlier. Am I
> even going about this the right way? I am using today's trunk build of Solr
> and here is what I have munged together this morning.
>
>  autoGeneratePhraseQueries="true">
>  
>  
>   mapping="mapping-ISOLatin1Accent.txt"/>
>  
>   words="stopwords.txt" enablePositionIncrements="true"/>
>   outputUnigrams="true" outputUnigramIfNoNgram="false"/>
>  
>  
>  
>  
>  
>  
>  
> 
>
> Thanks,
> Adam
>


Re: Indexing data from multiple datasources

2011-06-09 Thread Erick Erickson
Hmmm, when you say you use Tika, are you using some custom Java code? Because
if you are, the best thing to do is query your database at that point
and add whatever information
you need to the document.

If you're using DIH to do the crawl, consider implementing a
Transformer to do the database
querying and modify the document as necessary This is pretty
simple to do, we can
chat a bit more depending on whether either approach makes sense.

Best
Erick



On Thu, Jun 9, 2011 at 10:43 AM, Greg Georges  wrote:
> Hello all,
>
> I have checked the forums to see if it is possible to create and index from 
> multiple datasources. I have found references to SOLR-1358, but I don't think 
> this fits my scenario. In all, we have an application where we upload files. 
> On the file upload, I use the Tika extract handler to save metadata from the 
> file (_attr, literal values, etc..). We also have a database which has 
> information on the uploaded files, like the category, type, etc.. I would 
> like to update the index to include this information from the db in the index 
> for each document. If I run a dataimporthandler after the extract phase I am 
> afraid that by updating the doc in the index by its id will just cause that I 
> overwrite the old information with the info from the DB (what I understand is 
> that Solr updates its index by ID by deleting first then recreating the info).
>
> Anyone have any pointers, is there a clean way to do this, or must I find a 
> way to pass the db metadata to the extract handler and save it as literal 
> fields?
>
> Thanks in advance
>
> Greg
>


Re: Does MultiTerm highlighting work with the fastVectorHighlighter?

2011-06-09 Thread Koji Sekiguchi

(11/06/10 0:14), Burton-West, Tom wrote:

Hi Koji,


Thank you for your reply.


It is the feature of FVH. FVH supports TermQuery, PhraseQuery, BooleanQuery and 
DisjunctionMaxQuery
and Query constructed by those queries.


Sorry, I'm not sure I understand.  Are you saying that FVH supports MultiTerm 
highlighting?


Tom,

I'm sorry but FVH doesn't cover MultiTermQuery.

koji
--
http://www.rondhuit.com/en/


Re: ExtractingRequestHandler - renaming tika generated fields

2011-06-09 Thread Jan Høydahl
One solution to this problem is to change the order of field operation 
(http://wiki.apache.org/solr/ExtractingRequestHandler#Order_of_field_operations)
 to first do fmap.*= processing, then add the fields from literal.*=. Why would 
anyone want to rename a field they just have explicitly named anyway?

Another solution that would work for me is an option to let ALL tika generated 
fields be prefixed, e.g. tprefix=tika_. But I need Extracting handler to output 
to fields which do not exist in schema.xml. This is because later in the 
UpdateChain I do field choosing and renaming in another UpdateProcessor, so the 
field names coming from ExtractingHandler are only tempoprary and will not be 
sent to Solr. Thus, an option to skip the schema check would be useful, perhaps 
in the form of a whitelist for uprefix 
&uprefix.whitelist=fielda,other-non-existing-field, causing uprefix not rename 
those.

--
Jan Høydahl, search solution architect
Cominvent AS - www.cominvent.com
Solr Training - www.solrtraining.com

On 9. juni 2011, at 11.26, Jan Høydahl wrote:

> Hi,
> 
> I post a PDF from a CMS client, which has metadata about the document. One of 
> those metadata is the title. I trust the title of the CMS more than the title 
> extracted from the PDF, but I cannot find a way to both send 
> &literal.title=CMS-Title as well as changing the name of the title field 
> generated by Tika/SolrCell. If I do fmap.title=tika_title then my 
> literal.title also also changes name. Any ideas?
> 
> --
> Jan Høydahl, search solution architect
> Cominvent AS - www.cominvent.com
> Solr Training - www.solrtraining.com
> 



RE: Does MultiTerm highlighting work with the fastVectorHighlighter?

2011-06-09 Thread Burton-West, Tom
Hi Koji,


Thank you for your reply.

>> It is the feature of FVH. FVH supports TermQuery, PhraseQuery, BooleanQuery 
>> and DisjunctionMaxQuery
>> and Query constructed by those queries.

Sorry, I'm not sure I understand.  Are you saying that FVH supports MultiTerm 
highlighting?  

Tom



Re: [Mahout] Integration with Solr

2011-06-09 Thread Adam Estrada
Thanks for the reply, Tommaso! I would like to see tighter integration like
in the way Nutch integrates with Solr. There is a single param that you set
which points to the Solr instance. My interest in Mahout is with it's
abitlity to handle large data and find frequency, co-location of data,
clustering, etc...All the algorithms that are in the core build are great
and I am just now wrapping my head around how to use them all.

Adam

On Thu, Jun 9, 2011 at 10:33 AM, Tommaso Teofili
wrote:

> Hello Adam,
> I've managed to create a small POC of integrating Mahout with Solr for a
> clustering task, do you want to use it for clustering only or possibly for
> other purposes/algorithms?
> More generally speaking, I think it'd be nice if Solr could be extended
> with
> a proper API for integrating clustering engines in it so that one can plug
> and exchange engines flawlessly (just need an Adapter).
> Regards,
> Tommaso
>
> 2011/6/9 Adam Estrada 
>
> > Has anyone integrated Mahout with Solr? I know that Carrot2 is part of
> the
> > core build but the docs say that it's not very good for very large
> indexes.
> > Anyone have thoughts on this?
> >
> > Thanks,
> > Adam
> >
>


[Free Text] Field Tokenizing

2011-06-09 Thread Adam Estrada
All,

I am at a bit of a loss here so any help would be greatly appreciated. I am
using the DIH to grab data from a DB. The field that I am most interested in
has anywhere from 1 word to several paragraphs worth of free text. What I
would really like to do is pull out phrases like "Joe's coffee shop" rather
than the 3 individual words. I have tried the KeywordTokenizerFactory and
that does seem to do what I want it to do but it is not actually tokenizing
anything so it does what I want it to for the most part but it's not
creating the tokens that I need for further analysis in apps like Mahout.

We can play with the combination of tokenizers and filters all day long and
see what the results are after a quick reindex. I typlically just view them
in Solitas as facets which may be the problem for me too. Does anyone have
an example fieldType they can share with me that shows how to
extract phrases if they are there from the data I described earlier. Am I
even going about this the right way? I am using today's trunk build of Solr
and here is what I have munged together this morning.


 
 
 
 
 
 
 
 
 
 
 
 
 


Thanks,
Adam


Indexing data from multiple datasources

2011-06-09 Thread Greg Georges
Hello all,

I have checked the forums to see if it is possible to create and index from 
multiple datasources. I have found references to SOLR-1358, but I don't think 
this fits my scenario. In all, we have an application where we upload files. On 
the file upload, I use the Tika extract handler to save metadata from the file 
(_attr, literal values, etc..). We also have a database which has information 
on the uploaded files, like the category, type, etc.. I would like to update 
the index to include this information from the db in the index for each 
document. If I run a dataimporthandler after the extract phase I am afraid that 
by updating the doc in the index by its id will just cause that I overwrite the 
old information with the info from the DB (what I understand is that Solr 
updates its index by ID by deleting first then recreating the info).

Anyone have any pointers, is there a clean way to do this, or must I find a way 
to pass the db metadata to the extract handler and save it as literal fields?

Thanks in advance

Greg


Re: [Mahout] Integration with Solr

2011-06-09 Thread Tommaso Teofili
Hello Adam,
I've managed to create a small POC of integrating Mahout with Solr for a
clustering task, do you want to use it for clustering only or possibly for
other purposes/algorithms?
More generally speaking, I think it'd be nice if Solr could be extended with
a proper API for integrating clustering engines in it so that one can plug
and exchange engines flawlessly (just need an Adapter).
Regards,
Tommaso

2011/6/9 Adam Estrada 

> Has anyone integrated Mahout with Solr? I know that Carrot2 is part of the
> core build but the docs say that it's not very good for very large indexes.
> Anyone have thoughts on this?
>
> Thanks,
> Adam
>


Re: Does MultiTerm highlighting work with the fastVectorHighlighter?

2011-06-09 Thread Koji Sekiguchi

(11/06/09 4:24), Burton-West, Tom wrote:

We are trying to implement highlighting for wildcard (MultiTerm) queries.  This 
seems to work find with the regular highlighter but when we try to use the 
fastVectorHighlighter we don't see any results in the  highlighting section of 
the response.  Appended below are the parameters we are using.


It is the feature of FVH. FVH supports TermQuery, PhraseQuery, BooleanQuery and 
DisjunctionMaxQuery
and Query constructed by those queries.

koji
--
http://www.rondhuit.com/en/


Re: Edismax sorting help

2011-06-09 Thread Denis Kuzmenok
Your  solution  seems  to work fine, not perfect, but much better then
mine :)
Thanks!

>> If i do query like "Samsung" i want to see prior most relevant results
>> with  isflag:true and bigger popularity, but if i do query like "Nokia
>> 6500"  and  there is isflag:false, then it should be higher because of
>> exact  match.  Tried different combinations, but didn't found one that
>> suites   me.   Just   got   isflag/popularity   sorting   working   or
>> isflag/relevancy sorting.

> Multiplicative boosts tend to be more stable...

> Perhaps try replacing
>   bf=isflag sqrt(popularity)
> with
>   bq=isflag:true^10  // vary the boost to change how much
> isflag counts vs the relevancy score of the main query
>   boost=sqrt(popularity)  // this will multiply the result by
> sqrt(popularity)... assumes that every document has a non-zero
> popularity

> You could get more creative in trunk where booleans have better
> support in function queries.

> -Yonik
> http://www.lucidimagination.com






Re: Solr monitoring: Newrelic

2011-06-09 Thread Ken Krugler
It sounds like "roySolr" is running embedded Jetty, launching solr using the 
start.jar

If so, then there's no app container where Newrelic can be installed.

-- Ken

On Jun 9, 2011, at 2:28am, Sujatha Arun wrote:

> Try the RPM support  accessed from the accout support page ,Giving all
> details ,they are very helpful.
> 
> Regards
> Sujatha
> 
> On Thu, Jun 9, 2011 at 2:33 PM, roySolr  wrote:
> 
>> Yes, that's the problem. There is no jetty folder.
>> I have try the example/lib directory, it's not working. There is no jetty
>> war file, only
>> jetty-***.jar files
>> 
>> Same error, could not locate a jetty instance.
>> 
>> --
>> View this message in context:
>> http://lucene.472066.n3.nabble.com/Solr-monitoring-Newrelic-tp3042889p3043080.html
>> Sent from the Solr - User mailing list archive at Nabble.com.
>> 

--
Ken Krugler
+1 530-210-6378
http://bixolabs.com
custom data mining solutions








Re: Edismax sorting help

2011-06-09 Thread Yonik Seeley
2011/6/9 Denis Kuzmenok :
> Hi, everyone.
>
> I have fields:
> text fields: name, title, text
> boolean field: isflag (true / false)
> int field: popularity (0 to 9)
>
> Now i do query:
> defType=edismax
> start=0
> rows=20
> fl=id,name
> q=lg optimus
> fq=
> qf=name^3 title text^0.3
> sort=score desc
> pf=name
> bf=isflag sqrt(popularity)
> mm=100%
> debugQuery=on
>
>
> If i do query like "Samsung" i want to see prior most relevant results
> with  isflag:true and bigger popularity, but if i do query like "Nokia
> 6500"  and  there is isflag:false, then it should be higher because of
> exact  match.  Tried different combinations, but didn't found one that
> suites   me.   Just   got   isflag/popularity   sorting   working   or
> isflag/relevancy sorting.

Multiplicative boosts tend to be more stable...

Perhaps try replacing
  bf=isflag sqrt(popularity)
with
  bq=isflag:true^10  // vary the boost to change how much
isflag counts vs the relevancy score of the main query
  boost=sqrt(popularity)  // this will multiply the result by
sqrt(popularity)... assumes that every document has a non-zero
popularity

You could get more creative in trunk where booleans have better
support in function queries.

-Yonik
http://www.lucidimagination.com


Re: how to Index and Search non-Eglish Text in solr

2011-06-09 Thread Erick Erickson
No, you'd have to create multiple fieldTypes, one for each language

Best
Erick

On Thu, Jun 9, 2011 at 5:26 AM, Mohammad Shariq  wrote:
> Can I specify multiple language in filter tag in schema.xml ???  like below
>
> 
>   
>      
>       words="stopwords.txt" enablePositionIncrements="true"/>
>       generateNumberParts="1" catenateWords="1" catenateNumbers="1"
> catenateAll="0" splitOnCaseChange="1"/>
>
> 
> 
> 
> 
> 
>
>
>
>       class="solr.SnowballPorterFilterFactory" language="Hungarian" />
>
>
> On 8 June 2011 18:47, Erick Erickson  wrote:
>
>> This page is a handy reference for individual languages...
>> http://wiki.apache.org/solr/LanguageAnalysis
>>
>> But the usual approach, especially for Chinese/Japanese/Korean
>> (CJK) is to index the content in different fields with language-specific
>> analyzers then spread your search across the language-specific
>> fields (e.g. title_en, title_fr, title_ar). Stemming and stopwords
>> particularly give "surprising" results if you put words from different
>> languages in the same field.
>>
>> Best
>> Erick
>>
>> On Wed, Jun 8, 2011 at 8:34 AM, Mohammad Shariq 
>> wrote:
>> > Hi,
>> > I had setup solr( solr-1.4 on Ubuntu 10.10) for indexing news articles in
>> > English, but my requirement extend to index the news of other languages
>> too.
>> >
>> > This is how my schema looks :
>> > > > required="false"/>
>> >
>> >
>> > And the "text" Field in schema.xml looks like :
>> >
>> > 
>> >    
>> >       
>> >       > > words="stopwords.txt" enablePositionIncrements="true"/>
>> >       > generateWordParts="1"
>> > generateNumberParts="1" catenateWords="1" catenateNumbers="1"
>> > catenateAll="0" splitOnCaseChange="1"/>
>> >       
>> >       > > protected="protwords.txt"/>
>> >    
>> >    
>> >       
>> >       > > ignoreCase="true" expand="true"/>
>> >       > > words="stopwords.txt" enablePositionIncrements="true"/>
>> >       > generateWordParts="1"
>> > generateNumberParts="1" catenateWords="0" catenateNumbers="0"
>> > catenateAll="0" splitOnCaseChange="1"/>
>> >       
>> >       > > protected="protwords.txt"/>
>> >    
>> > 
>> >
>> >
>> > My Problem is :
>> > Now I want to index the news articles in other languages to e.g.
>> > Chinese,Japnese.
>> > How I can I modify my text field so that I can Index the news in other
>> lang
>> > too and make it searchable ??
>> >
>> > Thanks
>> > Shariq
>> >
>> >
>> >
>> >
>> >
>> > --
>> > View this message in context:
>> http://lucene.472066.n3.nabble.com/how-to-Index-and-Search-non-Eglish-Text-in-solr-tp3038851p3038851.html
>> > Sent from the Solr - User mailing list archive at Nabble.com.
>> >
>>
>
>
>
> --
> Thanks and Regards
> Mohammad Shariq
>


Re: how can I return function results in my query?

2011-06-09 Thread Ahmet Arslan
> I want to be able to run a
> query  like idf(text, 'term') and have that data
> returned with my search results.  I've searched the
> docs,but I'm unable to
> find how to do it.  Is this possible and how can I do
> that ?

http://wiki.apache.org/solr/FunctionQuery#idf


how can I return function results in my query?

2011-06-09 Thread Jason Toy
I want to be able to run a query  like idf(text, 'term') and have that data
returned with my search results.  I've searched the docs,but I'm unable to
find how to do it.  Is this possible and how can I do that ?


RE: Tokenising based on known words?

2011-06-09 Thread Steven A Rowe
Hi Mark,

Are you familiar with shingles aka token n-grams?

http://lucene.apache.org/solr/api/org/apache/solr/analysis/ShingleFilterFactory.html

Use the empty string for the tokenSeparator to get wordstogether style tokens 
in your index. 

I think you'll want to apply this filter only at index-time, since the users 
will supply the shingles all by themselves :).

Steve

> -Original Message-
> From: Mark Mandel [mailto:mark.man...@gmail.com]
> Sent: Thursday, June 09, 2011 8:37 AM
> To: solr-user@lucene.apache.org
> Subject: Re: Tokenising based on known words?
> 
> Synonyms really wouldn't work for every possible combination of words in
> our
> index.
> 
> Thanks for the idea though.
> 
> Mark
> 
> On Thu, Jun 9, 2011 at 3:42 PM, Gora Mohanty  wrote:
> 
> > On Thu, Jun 9, 2011 at 4:37 AM, Mark Mandel 
> wrote:
> > > Not sure if this possible, but figured I would ask the question.
> > >
> > > Basically, we have some users who do some pretty rediculous things
> ;o)
> > >
> > > Rather than writing "red jacket", they write "redjacket", which
> obviously
> > > returns no results.
> > [...]
> >
> > Have you tried using synonyms,
> >
> >
> http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.SynonymF
> ilterFactory
> > It seems like they should fit your use case.
> >
> > Regards,
> > Gora
> >
> 
> 
> 
> --
> E: mark.man...@gmail.com
> T: http://www.twitter.com/neurotic
> W: www.compoundtheory.com
> 
> cf.Objective(ANZ) - Nov 17, 18 - Melbourne Australia
> http://www.cfobjective.com.au
> 
> Hands-on ColdFusion ORM Training
> www.ColdFusionOrmTraining.com


Re: [Mahout] Integration with Solr

2011-06-09 Thread Tomás Fernández Löbbe
I don't know much of it, but I know Grant Ingersoll posted about that:
http://www.lucidimagination.com/blog/2010/03/16/integrating-apache-mahout-with-apache-lucene-and-solr-part-i-of-3/

On Thu, Jun 9, 2011 at 9:24 AM, Adam Estrada
wrote:

> Has anyone integrated Mahout with Solr? I know that Carrot2 is part of the
> core build but the docs say that it's not very good for very large indexes.
> Anyone have thoughts on this?
>
> Thanks,
> Adam
>


Re: tika integration exception and other related queries

2011-06-09 Thread Gary Taylor

Naveen,

Not sure our requirement matches yours, but one of the things we index 
is a "comment" item that can have one or more files attached to it.  To 
index the whole thing as a single Solr document we create a zipfile 
containing a file with the comment details in it and any additional 
attached files.  This is submitted to Solr as a TEXT field in an XML 
doc, along with other meta-data fields from the comment.  In our schema 
the TEXT field is indexed but not stored, so when we search and get a 
match back it doesn't contain all of the contents from the attached 
files etc., only the stored fields in our schema.   Admittedly, the user 
can therefore get back a "comment" match with no indication as to WHERE 
the match occurred (ie. was it in the meta-data or the contents of the 
attached files), but at the moment we're only interested in getting 
appropriate matches, not explaining where the match is.


Hope that helps.

Kind regards,
Gary.



On 09/06/2011 03:00, Naveen Gupta wrote:

Hi Gary

It started working .. though i did not test for Zip files, but for rar
files, it is working fine ..

only thing what i wanted to do is to index the metadata (text mapped to
content) not store the data  Also in search result, i want to filter the
stuffs ... and it started working fine .. i don't want to show the content
stuffs to the end user, since the way it extracts the information is not
very helpful to the user .. although we can apply few of the analyzers and
filters to remove the unnecessary tags ..still the information would not be
of much help .. looking for your opinion ... what you did in order to filter
out the content or are you showing the content extracted to the end user?

Even in case, we are showing the text part to the end user, how can i limit
the number of characters while querying the search results ... is there any
feature where we can achieve this ... the concept of snippet kind of thing
...

Thanks
Naveen

On Wed, Jun 8, 2011 at 1:45 PM, Gary Taylor  wrote:


Naveen,

For indexing Zip files with Tika, take a look at the following thread :


http://lucene.472066.n3.nabble.com/Extracting-contents-of-zipped-files-with-Tika-and-Solr-1-4-1-td2327933.html

I got it to work with the 3.1 source and a couple of patches.

Hope this helps.

Regards,
Gary.



On 08/06/2011 04:12, Naveen Gupta wrote:


Hi Can somebody answer this ...

3. can somebody tell me an idea how to do indexing for a zip file ?

1. while sending docx, we are getting following error.





Edismax sorting help

2011-06-09 Thread Denis Kuzmenok
Hi, everyone.

I have fields:
text fields: name, title, text
boolean field: isflag (true / false)
int field: popularity (0 to 9)

Now i do query:
defType=edismax
start=0
rows=20
fl=id,name
q=lg optimus
fq=
qf=name^3 title text^0.3
sort=score desc
pf=name
bf=isflag sqrt(popularity)
mm=100%
debugQuery=on


If i do query like "Samsung" i want to see prior most relevant results
with  isflag:true and bigger popularity, but if i do query like "Nokia
6500"  and  there is isflag:false, then it should be higher because of
exact  match.  Tried different combinations, but didn't found one that
suites   me.   Just   got   isflag/popularity   sorting   working   or
isflag/relevancy sorting.



Re: Tokenising based on known words?

2011-06-09 Thread Mark Mandel
Synonyms really wouldn't work for every possible combination of words in our
index.

Thanks for the idea though.

Mark

On Thu, Jun 9, 2011 at 3:42 PM, Gora Mohanty  wrote:

> On Thu, Jun 9, 2011 at 4:37 AM, Mark Mandel  wrote:
> > Not sure if this possible, but figured I would ask the question.
> >
> > Basically, we have some users who do some pretty rediculous things ;o)
> >
> > Rather than writing "red jacket", they write "redjacket", which obviously
> > returns no results.
> [...]
>
> Have you tried using synonyms,
>
> http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.SynonymFilterFactory
> It seems like they should fit your use case.
>
> Regards,
> Gora
>



-- 
E: mark.man...@gmail.com
T: http://www.twitter.com/neurotic
W: www.compoundtheory.com

cf.Objective(ANZ) - Nov 17, 18 - Melbourne Australia
http://www.cfobjective.com.au

Hands-on ColdFusion ORM Training
www.ColdFusionOrmTraining.com


[Mahout] Integration with Solr

2011-06-09 Thread Adam Estrada
Has anyone integrated Mahout with Solr? I know that Carrot2 is part of the
core build but the docs say that it's not very good for very large indexes.
Anyone have thoughts on this?

Thanks,
Adam


Re: London open source search social - 13th June

2011-06-09 Thread Richard Marr
Just a quick reminder that we're meeting on Monday. Come along if you're
around.


On 1 June 2011 13:27, Richard Marr  wrote:

> Hi guys,
>
> Just to let you know we're meeting up to talk all-things-search on Monday
> 13th June. There's usually a good mix of backgrounds and experience levels
> so if you're free and in the London area then it'd be good to see you there.
>
> Details:
> 7pm - The Elgin - 96 Ladbrooke Grove
> http://www.meetup.com/london-search-social/events/20387881/
>
> 
>
> Greetings search geeks!
>
> We've booked the next meetup for the 13th June. As usual, the plan is to
> meet up and geek out over a friendly beer.
>
> I know my co-organiser René has been working on some interesting search
> projects, and I've recently left Empora to work on my own project so by June
> I should hopefully have some war stories about using @elasticsearch in
> production. The format is completely open though so please bring your own
> topics if you've got them.
>
> Hope to see you there!
>
> --
> Richard Marr


Re: Boost or sort a query with range values

2011-06-09 Thread Jan Høydahl
Btw. your example is a simple boolean query, and this will also work:
&bq=(myfield1:0 AND myfield2:1)^100.0

--
Jan Høydahl, search solution architect
Cominvent AS - www.cominvent.com
Solr Training - www.solrtraining.com

On 9. juni 2011, at 13.31, Jan Høydahl wrote:

> Check the new if() function in Trunk, SOLR-2136. You could then use it in 
> &bf= or &boost=
> 
> --
> Jan Høydahl, search solution architect
> Cominvent AS - www.cominvent.com
> Solr Training - www.solrtraining.com
> 
> On 9. juni 2011, at 13.05, jlefebvre wrote:
> 
>> thanks it's ok
>> 
>> another question
>> how to do a condition in &bq ?
>> 
>> something like &bq=iif(myfield1 = 0 AND myfield2 = 1;1;0)
>> 
>> thanks
>> 
>> --
>> View this message in context: 
>> http://lucene.472066.n3.nabble.com/Boost-or-sort-a-query-with-range-values-tp3043328p3043406.html
>> Sent from the Solr - User mailing list archive at Nabble.com.
> 



Re: Boost or sort a query with range values

2011-06-09 Thread Jan Høydahl
Check the new if() function in Trunk, SOLR-2136. You could then use it in &bf= 
or &boost=

--
Jan Høydahl, search solution architect
Cominvent AS - www.cominvent.com
Solr Training - www.solrtraining.com

On 9. juni 2011, at 13.05, jlefebvre wrote:

> thanks it's ok
> 
> another question
> how to do a condition in &bq ?
> 
> something like &bq=iif(myfield1 = 0 AND myfield2 = 1;1;0)
> 
> thanks
> 
> --
> View this message in context: 
> http://lucene.472066.n3.nabble.com/Boost-or-sort-a-query-with-range-values-tp3043328p3043406.html
> Sent from the Solr - User mailing list archive at Nabble.com.



Re: Boost or sort a query with range values

2011-06-09 Thread jlefebvre
thanks it's ok

another question
how to do a condition in &bq ?

something like &bq=iif(myfield1 = 0 AND myfield2 = 1;1;0)

thanks

--
View this message in context: 
http://lucene.472066.n3.nabble.com/Boost-or-sort-a-query-with-range-values-tp3043328p3043406.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Boost or sort a query with range values

2011-06-09 Thread lee carroll
[* TO *]^5

On 9 June 2011 11:31, jlefebvre  wrote:
> Hello
>
> I try to boost a query with a range values but I can't find the correct
> syntax :
> this is ok .&bq=myfield:"-1"^5 but I want to do something lik this
> &bq=myfield:"-1 to 1"^5
>
> Boost value from -1 to 1
>
> thanks
>
> --
> View this message in context: 
> http://lucene.472066.n3.nabble.com/Boost-or-sort-a-query-with-range-values-tp3043328p3043328.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>


Boost or sort a query with range values

2011-06-09 Thread jlefebvre
Hello

I try to boost a query with a range values but I can't find the correct
syntax :
this is ok .&bq=myfield:"-1"^5 but I want to do something lik this
&bq=myfield:"-1 to 1"^5

Boost value from -1 to 1

thanks

--
View this message in context: 
http://lucene.472066.n3.nabble.com/Boost-or-sort-a-query-with-range-values-tp3043328p3043328.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Tokenising based on known words?

2011-06-09 Thread lee carroll
we've played with HyphenationCompoundWordTokenFilterFactory it works
better than maintaining a word dictionary to split (although we ended
up not using it for reasons i can't recall)

see

http://lucene.apache.org/solr/api/org/apache/solr/analysis/HyphenationCompoundWordTokenFilterFactory.html



On 9 June 2011 06:42, Gora Mohanty  wrote:
> On Thu, Jun 9, 2011 at 4:37 AM, Mark Mandel  wrote:
>> Not sure if this possible, but figured I would ask the question.
>>
>> Basically, we have some users who do some pretty rediculous things ;o)
>>
>> Rather than writing "red jacket", they write "redjacket", which obviously
>> returns no results.
> [...]
>
> Have you tried using synonyms,
> http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.SynonymFilterFactory
> It seems like they should fit your use case.
>
> Regards,
> Gora
>


Re: AW: How to deal with many files using solr external file field

2011-06-09 Thread Martin Grotzke
Hi,

as I'm also involved in this issue (on the side of Sven) I created a
patch, that replaces the float array by a map that stores score by doc,
so it contains as many entries as the external scoring file contains
lines, but no more.

I created an issue for this: https://issues.apache.org/jira/browse/SOLR-2583

It would be great if someone could have a look at it and comment.

Thanx for your feedback,
cheers,
Martin


On 06/08/2011 12:22 PM, Bohnsack, Sven wrote:
> Hi,
> 
> I could not provide a stack trace and IMHO it won't provide some useful 
> information. But we've made a good progress in the analysis.
> 
> We took a deeper look at what happened, when an "external-file-field"-Request 
> is sent to SOLR:
> 
> * SOLR looks if there is a file for the requested query, e.g. "trousers"
> * If so, then SOLR loads the "trousers"-file and generates a HashMap-Entry 
> consisting of a FileFloatSource-Object and a FloatArray with the size of the 
> number of documents in the SOLR-index. Every document matched by the query 
> gains the score-value, which is provided in the external-score-file. For 
> every(!) other document SOLR writes a zero in that FloatArray
> * if SOLR does not find a file for the query-Request, then SOLR still 
> generates a HashMapEntry with score zero for every document
> 
> In our case we have about 8.5 Mio. documents in our index and one of those 
> Arrays occupies about 34MB Heap Space. Having e.g. 100 different queries and 
> using external file field for sorting the result, SOLR occupies about 3.4GB 
> of Heap Space.
> 
> The problem might be the use of WeakHashMap [1], which prevents the Garbage 
> Collector from cleaning up unused Keys.
> 
> 
> What do you think could be a possible solution for this whole problem? 
> (except from "don't use external file fields" ;)
> 
> 
> Regards
> Sven
> 
> 
> [1]: "A hashtable-based Map implementation with weak keys. An entry in a 
> WeakHashMap will automatically be removed when its key is no longer in 
> ordinary use. More precisely, the presence of a mapping for a given key will 
> not prevent the key from being discarded by the garbage collector, that is, 
> made finalizable, finalized, and then reclaimed. When a key has been 
> discarded its entry is effectively removed from the map, so this class 
> behaves somewhat differently than other Map implementations."
> 
> -Ursprüngliche Nachricht-
> Von: mtnes...@gmail.com [mailto:mtnes...@gmail.com] Im Auftrag von Simon 
> Rosenthal
> Gesendet: Mittwoch, 8. Juni 2011 03:56
> An: solr-user@lucene.apache.org
> Betreff: Re: How to deal with many files using solr external file field
> 
> Can you provide a stack trace for the OOM eexception ?
> 
> On Tue, Jun 7, 2011 at 4:25 PM, Bohnsack, Sven
> wrote:
> 
>> Hi all,
>>
>> we're using solr 1.4 and external file field ([1]) for sorting our
>> searchresults. We have about 40.000 Terms, for which we use this sorting
>> option.
>> Currently we're running into massive OutOfMemory-Problems and were not
>> pretty sure, what's the matter. It seems that the garbage collector stops
>> working or some processes are going wild. However, solr starts to allocate
>> more and more RAM until we experience this OutOfMemory-Exception.
>>
>>
>> We noticed the following:
>>
>> For some terms one could see in the solr log that there appear some
>> java.io.FileNotFoundExceptions, when solr tries to load an external file for
>> a term for which there is not such a file, e.g. solr tries to load the
>> external score file for "trousers" but there ist none in the
>> /solr/data-Folder.
>>
>> Question: is it possible, that those exceptions are responsible for the
>> OutOfMemory-Problem or could it be due to the large(?) number of 40k terms
>> for which we want to sort the result via external file field?
>>
>> I'm looking forward for your answers, suggestions and ideas :)
>>
>>
>> Regards
>> Sven
>>
>>
>> [1]:
>> http://lucene.apache.org/solr/api/org/apache/solr/schema/ExternalFileField.html
>>

-- 
Martin Grotzke
http://twitter.com/martin_grotzke



signature.asc
Description: OpenPGP digital signature


Re: Solr monitoring: Newrelic

2011-06-09 Thread Sujatha Arun
Try the RPM support  accessed from the accout support page ,Giving all
details ,they are very helpful.

Regards
Sujatha

On Thu, Jun 9, 2011 at 2:33 PM, roySolr  wrote:

> Yes, that's the problem. There is no jetty folder.
> I have try the example/lib directory, it's not working. There is no jetty
> war file, only
> jetty-***.jar files
>
> Same error, could not locate a jetty instance.
>
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/Solr-monitoring-Newrelic-tp3042889p3043080.html
>  Sent from the Solr - User mailing list archive at Nabble.com.
>


Re: how to Index and Search non-Eglish Text in solr

2011-06-09 Thread Mohammad Shariq
Can I specify multiple language in filter tag in schema.xml ???  like below


   
  
  
  









  


On 8 June 2011 18:47, Erick Erickson  wrote:

> This page is a handy reference for individual languages...
> http://wiki.apache.org/solr/LanguageAnalysis
>
> But the usual approach, especially for Chinese/Japanese/Korean
> (CJK) is to index the content in different fields with language-specific
> analyzers then spread your search across the language-specific
> fields (e.g. title_en, title_fr, title_ar). Stemming and stopwords
> particularly give "surprising" results if you put words from different
> languages in the same field.
>
> Best
> Erick
>
> On Wed, Jun 8, 2011 at 8:34 AM, Mohammad Shariq 
> wrote:
> > Hi,
> > I had setup solr( solr-1.4 on Ubuntu 10.10) for indexing news articles in
> > English, but my requirement extend to index the news of other languages
> too.
> >
> > This is how my schema looks :
> >  > required="false"/>
> >
> >
> > And the "text" Field in schema.xml looks like :
> >
> > 
> >
> >   
> >> words="stopwords.txt" enablePositionIncrements="true"/>
> >generateWordParts="1"
> > generateNumberParts="1" catenateWords="1" catenateNumbers="1"
> > catenateAll="0" splitOnCaseChange="1"/>
> >   
> >> protected="protwords.txt"/>
> >
> >
> >   
> >> ignoreCase="true" expand="true"/>
> >> words="stopwords.txt" enablePositionIncrements="true"/>
> >generateWordParts="1"
> > generateNumberParts="1" catenateWords="0" catenateNumbers="0"
> > catenateAll="0" splitOnCaseChange="1"/>
> >   
> >> protected="protwords.txt"/>
> >
> > 
> >
> >
> > My Problem is :
> > Now I want to index the news articles in other languages to e.g.
> > Chinese,Japnese.
> > How I can I modify my text field so that I can Index the news in other
> lang
> > too and make it searchable ??
> >
> > Thanks
> > Shariq
> >
> >
> >
> >
> >
> > --
> > View this message in context:
> http://lucene.472066.n3.nabble.com/how-to-Index-and-Search-non-Eglish-Text-in-solr-tp3038851p3038851.html
> > Sent from the Solr - User mailing list archive at Nabble.com.
> >
>



-- 
Thanks and Regards
Mohammad Shariq


ExtractingRequestHandler - renaming tika generated fields

2011-06-09 Thread Jan Høydahl
Hi,

I post a PDF from a CMS client, which has metadata about the document. One of 
those metadata is the title. I trust the title of the CMS more than the title 
extracted from the PDF, but I cannot find a way to both send 
&literal.title=CMS-Title as well as changing the name of the title field 
generated by Tika/SolrCell. If I do fmap.title=tika_title then my literal.title 
also also changes name. Any ideas?

--
Jan Høydahl, search solution architect
Cominvent AS - www.cominvent.com
Solr Training - www.solrtraining.com



Re: Displaying highlights in formatted HTML document

2011-06-09 Thread Ahmet Arslan
> iorixxx, could you please explain a bit more your solution,
> because I don't
> see how your solution could give an "exact highlighting", I
> mean with the
> different fields analysis for each fields.

It does not work with your use case (e.g. different synonyms applied different 
parts of the html/xml etc)



Re: Solr monitoring: Newrelic

2011-06-09 Thread roySolr
Yes, that's the problem. There is no jetty folder. 
I have try the example/lib directory, it's not working. There is no jetty
war file, only
jetty-***.jar files

Same error, could not locate a jetty instance.

--
View this message in context: 
http://lucene.472066.n3.nabble.com/Solr-monitoring-Newrelic-tp3042889p3043080.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Solr monitoring: Newrelic

2011-06-09 Thread Sujatha Arun
There  is no jetty folder  in the standard package ,but the jetty war file
is under example/lib folder ,so this where u need to  put the newrelic
folder i guess

Regards
Sujatha

On Thu, Jun 9, 2011 at 2:03 PM, roySolr  wrote:

>
> I use Jetty, it's standard in the solr package. Where can i find
> the "jetty" folder?
>
> then i can start this command:
> java -jar newrelic.jar install
>
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/Solr-monitoring-Newrelic-tp3042889p3042981.html
>  Sent from the Solr - User mailing list archive at Nabble.com.
>


Re: Displaying highlights in formatted HTML document

2011-06-09 Thread lboutros
Hi Bryan,

how do you index your html files ? I mean do you create fields for different
parts of your document (for different stop words lists, stemming, etc) ?
with DIH or solrj or something else ? 

iorixxx, could you please explain a bit more your solution, because I don't
see how your solution could give an "exact highlighting", I mean with the
different fields analysis for each fields.

I developed this week a new highlighter module which transfers the fields
highlighting to the original document (xml in my case) (I use payloads to
store offsets and lenghts of fields in the index). This way, I use the good
analyzers to do the highlighting correctly and then, I replace the different
field parts in the document by the highlighted parts. It is not finished
yet, but I already have some good results.
This is a client request too. Let me know if the iorixxx's solution is not
enought for your particular use case.

Ludovic.



-
Jouve
France.
--
View this message in context: 
http://lucene.472066.n3.nabble.com/Displaying-highlights-in-formatted-HTML-document-tp3041909p3042983.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Solr monitoring: Newrelic

2011-06-09 Thread roySolr

I use Jetty, it's standard in the solr package. Where can i find 
the "jetty" folder? 

then i can start this command:
java -jar newrelic.jar install

--
View this message in context: 
http://lucene.472066.n3.nabble.com/Solr-monitoring-Newrelic-tp3042889p3042981.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Solr monitoring: Newrelic

2011-06-09 Thread Sujatha Arun
You need to  install the new relic folder  under "tomcat" folder,  in case
app server is "tomcat".

Then from the command line   ,you need to run the install commnad given  in
the new relic site from your newrelic folder.

Once this is done, restart the appserver and you shld be able to see a log
file created under newrelic folder,  if all went well.

Regards
Sujatha
On Thu, Jun 9, 2011 at 1:27 PM, roySolr  wrote:

> Hello,
>
> I found this tool to monitor solr querys, cache etc.
>
> http://newrelic.com/ http://newrelic.com/
>
> I have some problems with the installation of it. I get the following
> errors:
>
> Could not locate a Tomcat, Jetty or JBoss instance in /var/www/sites/royr
> Try re-running the install command from /newrelic.
> If that doesn't work, locate and edit the start script manually.
> Generated New Relic configuration file
> /var/www/sites/royr/newrelic/newrelic.yml
> * Install incomplete
>
> Does anybody have experience with Newrelic in combination with Solr?
>
>
>
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/Solr-monitoring-Newrelic-tp3042889p3042889.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>


Solr monitoring: Newrelic

2011-06-09 Thread roySolr
Hello,

I found this tool to monitor solr querys, cache etc. 

http://newrelic.com/ http://newrelic.com/ 

I have some problems with the installation of it. I get the following
errors:

Could not locate a Tomcat, Jetty or JBoss instance in /var/www/sites/royr
Try re-running the install command from /newrelic.
If that doesn't work, locate and edit the start script manually.
Generated New Relic configuration file
/var/www/sites/royr/newrelic/newrelic.yml
* Install incomplete

Does anybody have experience with Newrelic in combination with Solr?



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Solr-monitoring-Newrelic-tp3042889p3042889.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Multiple Values not getting Indexed

2011-06-09 Thread Bill Bell
You have to take the input and splitBy something like "," to get it into
an array and reposted back to
Solr...

I believe others have suggested that?

On 6/8/11 10:14 PM, "Pawan Darira"  wrote:

>Hi
>
>I am trying to index 2 fields with multiple values. BUT, it is only
>putting
>1 value for each & ignoring rest of the values after comma(,). I am
>fetching
>query through DIH. It works fine if i have only 1 value each of the 2
>fields
>
>E.g. Field1 - 150,178,461,151,310,306,305,179,137,162
>& Field2 - Chandigarh,Gurgaon,New
>Delhi,Ahmedabad,Rajkot,Surat,Mumbai,Nagpur,Pune,India - Others
>
>*Schema.xml*
>
>
>
>
>
>p.s. i tried multivalued=true but of no help.
>
>-- 
>Thanks,
>Pawan Darira




Re: Multiple Values not getting Indexed

2011-06-09 Thread Bill Bell
Is there a way to splitBy and trim the field after splitting?

I know I can do it with Javascript in DIH, but how about using the regex
parser?

On 6/9/11 1:18 AM, "Stefan Matheis"  wrote:

>Pawan,
>
>just separating multiple values by comma does not make them
>multi-value in solr-speak. But if you're already using DIH, you may
>try the http://wiki.apache.org/solr/DataImportHandler#RegexTransformer
>to 'splitBy' the field and get the expected field-values
>
>Regards
>Stefan
>
>On Thu, Jun 9, 2011 at 6:14 AM, Pawan Darira 
>wrote:
>> Hi
>>
>> I am trying to index 2 fields with multiple values. BUT, it is only
>>putting
>> 1 value for each & ignoring rest of the values after comma(,). I am
>>fetching
>> query through DIH. It works fine if i have only 1 value each of the 2
>>fields
>>
>> E.g. Field1 - 150,178,461,151,310,306,305,179,137,162
>> & Field2 - Chandigarh,Gurgaon,New
>> Delhi,Ahmedabad,Rajkot,Surat,Mumbai,Nagpur,Pune,India - Others
>>
>> *Schema.xml*
>>
>> 
>> 
>>
>>
>> p.s. i tried multivalued=true but of no help.
>>
>> --
>> Thanks,
>> Pawan Darira
>>




Re: Code for getting distinct facet counts across shards(Distributed Process).

2011-06-09 Thread Bill Bell
I have coded and tested this and it appears to work.

Are you having any problems?

On 6/9/11 12:35 AM, "rajini maski"  wrote:

> In solr 1.4.1, for getting "distinct facet terms count" across shards,
>
>
>
>The piece of code added for getting count of distinct facet terms across
>distributed process is as followed:
>
>
>
>
>
>Class: facetcomponent.java
>
>Function: -- finishStage(ResponseBuilder rb)
>
>
>
>  for (DistribFieldFacet dff : fi.facets.values()) {
>
>//just after this line of code
>
> else { // TODO: log error or throw exception?
>
> counts = dff.getLexSorted();
>
>
>
>int namedistint = 0;
>
>
>namedistint=rb.req.getParams().getFieldInt(dff.getKey().toString(),FacetPa
>rams.FACET_NAMEDISTINCT,0);
>
>if (namedistint  == 0)
>
>facet_fields.add(dff.getKey(), fieldCounts);
>
>
>
>if (namedistint  == 1)
>
>facet_fields.add("numfacetTerms", counts.length);
>
>
>
>
> if (namedistint  == 2) {
>
> NamedList resCount = new NamedList();
>
>
> resCount.add("numfacetTerms", counts.length);
>
>
> resCount.add("counts", fieldCounts);
>
>facet_fields.add(dff.getKey(), resCount);
>
> }
>
>
>
>
>Is this flow correct ?  I have worked with few test cases and it has
>worked
>fine.  but i want to know if there are any bugs that can creep in here?
>(My
>concern is this piece of code should not effect the rest of logic)
>
>
>
>
>*Code flow with comments for reference:*
>
>
> Function : --   finishStage(ResponseBuilder rb)
>
>
>
>  //in this for loop ,
>
> for (DistribFieldFacet dff : fi.facets.values()) {
>
>
>
>//just after this line of code
>
> else { // TODO: log error or throw exception?
>
> counts = dff.getLexSorted();
>
>
>
> int namedistint = 0;  //default
>
>
>
>//get the value of facet.numterms from the input query
>
>
>namedistint=rb.req.getParams().getFieldInt(dff.getKey().toString(),FacetPa
>rams.FACET_NAMEDISTINCT,0);
>
>
>
>// based on the value for  facet.numterms==0 or 1 or 2  , if conditions
>
>
>
>//Get only facet field counts
>
>if (namedistint  == 0)
>
>{
>
>facet_fields.add(dff.getKey(), fieldCounts);
>
>
>}
>
>
>
>//get only distinct facet term count
>
>if (namedistint  == 1)
>
>{
>
>facet_fields.add("numfacetTerms", counts.length);
>
>
>}
>
>
>
>//get facet field count and distinct term count.
>
> if (namedistint  == 2) {
>
> NamedList resCount = new NamedList();
>
>
> resCount.add("numfacetTerms", counts.length);
>
>
> resCount.add("counts", fieldCounts);
>
>facet_fields.add(dff.getKey(), resCount);
>
> }
>
>
>
>
>
>Regards,
>
>Rajani
>
>
>
>
>
>On Fri, May 27, 2011 at 1:14 PM, rajini maski 
>wrote:
>
>>  No such issues . Successfully integrated with 1.4.1 and it works across
>> single index.
>>
>> for f.2.facet.numFacetTerms=1  parameter it will give the distinct count
>> result
>>
>> for f.2.facet.numFacetTerms=2 parameter  it will give counts as well as
>> results for facets.
>>
>> But this is working only across single index not distributed process.
>>The
>> conditions you have added in simple facet.java- "if namedistinct count
>>==int
>> " ( 0, 1 and 2 condtions).. Should it be added in distributed process
>> function to enable it work across shards?
>>
>> Rajani
>>
>>
>>
>> On Fri, May 27, 2011 at 12:33 PM, Bill Bell  wrote:
>>
>>> I am pretty sure it does not yet support distributed shards..
>>>
>>> But the patch was written for 4.0... So there might be issues with
>>>running
>>> it on 1.4.1.
>>>
>>> On 5/26/11 11:08 PM, "rajini maski"  wrote:
>>>
>>> > The patch solr 2242 for getting count of distinct facet terms
>>> doesn't
>>> >work for distributedProcess
>>> >
>>> >(https://issues.apache.org/jira/browse/SOLR-2242)
>>> >
>>> >The error log says
>>> >
>>> > HTTP ERROR 500
>>> >Problem accessing /solr/select. Reason:
>>> >
>>> >For input string: "numFacetTerms"
>>> >
>>> >java.lang.NumberFormatException: For input string: "numFacetTerms"
>>> >at
>>>
>>> 
java.lang.NumberFormatException.forInputString(NumberFormatException.ja
va:
>>> >48)
>>> >at java.lang.Long.parseLong(Long.java:403)
>>> >at java.lang.Long.parseLong(Long.java:461)
>>> >at 
>>>org.apache.solr.schema.TrieField.readableToIndexed(TrieField.java:331)
>>> >at org.apache.solr.schema.TrieField.toInternal(TrieField.java:344)
>>> >at
>>>
>>> 
org.apache.solr.handler.component.FacetComponent$DistribFieldFacet.add(
Fac
>>> >etComponent.java:619)
>>> >at
>>>
>>> 
org.apache.solr.handler.component.FacetComponent.countFacets(FacetCompo
nen
>>> >t.java:265)
>>> >at
>>>
>>> 
org.apache.solr.han

Re: Multiple Values not getting Indexed

2011-06-09 Thread Stefan Matheis
Pawan,

just separating multiple values by comma does not make them
multi-value in solr-speak. But if you're already using DIH, you may
try the http://wiki.apache.org/solr/DataImportHandler#RegexTransformer
to 'splitBy' the field and get the expected field-values

Regards
Stefan

On Thu, Jun 9, 2011 at 6:14 AM, Pawan Darira  wrote:
> Hi
>
> I am trying to index 2 fields with multiple values. BUT, it is only putting
> 1 value for each & ignoring rest of the values after comma(,). I am fetching
> query through DIH. It works fine if i have only 1 value each of the 2 fields
>
> E.g. Field1 - 150,178,461,151,310,306,305,179,137,162
> & Field2 - Chandigarh,Gurgaon,New
> Delhi,Ahmedabad,Rajkot,Surat,Mumbai,Nagpur,Pune,India - Others
>
> *Schema.xml*
>
> 
> 
>
>
> p.s. i tried multivalued=true but of no help.
>
> --
> Thanks,
> Pawan Darira
>


wrong index version of solr3.2?

2011-06-09 Thread Bernd Fehling


After switching to solr 3.2 and building a new index from scratch I ran 
check_index which reports:
Segments file=segments_or numSegments=1 version=FORMAT_3_1 [Lucene 3.1]

Why do I get FORMAT_3_1 and Lucene 3.1, anything wrong with my index?

from my schema.xml:


from my solrconfig.xml:
LUCENE_32

Regards,
Bernd