Re: Tokenize Sentence and Set Attribute

2013-05-08 Thread Edward Garrett
i find UpdateRequestProcessors (
http://wiki.apache.org/solr/UpdateRequestProcessor) a handy way to add and
remove NLP-related fields to a document as it is processed by Solr. this is
also how UIMA integrates with Solr (http://wiki.apache.org/solr/SolrUIMA).
you might want to take a look at UIMA as well.


On Mon, May 6, 2013 at 6:22 PM, Jack Krupansky j...@basetechnology.comwrote:

 Sounds like a very ambitious project. I'm sure you COULD do it in Solr,
 but not in very short order.

 Check out some discussion of simply searching within sentences:
 http://markmail.org/message/**aoiq62a4mlo25zzk?q=apache#**
 query:apache+page:1+mid:**aoiq62a4mlo25zzk+state:resultshttp://markmail.org/message/aoiq62a4mlo25zzk?q=apache#query:apache+page:1+mid:aoiq62a4mlo25zzk+state:results

 First, how do you expect to use/query the corpus?  In other words, what
 are your user requirements? They will determine what structure the Solr
 index, analysis chains, and custom search components will need.

 Also, check out the Solr OpenNLP wiki:
 http://wiki.apache.org/solr/**OpenNLPhttp://wiki.apache.org/solr/OpenNLP

 And see LUCENE-2899: Add OpenNLP Analysis capabilities as a module:
 https://issues.apache.org/**jira/browse/LUCENE-2899https://issues.apache.org/jira/browse/LUCENE-2899

 -- Jack Krupansky

 -Original Message- From: Rendy Bambang Junior
 Sent: Monday, May 06, 2013 11:41 AM
 To: solr-user@lucene.apache.org
 Subject: Tokenize Sentence and Set Attribute


 Hello,

 I am trying to use part of speech tagger for bahasa Indonesia to filter
 tokens in Solr.
 The tagger receive input as word list of a sentence and return tag array.

 I think the process should by like this:
 - tokenize sentence
 - tokenize word
 - pass it into the tagger
 - set attribute using tagger output
 - pass it into a FilteringTokenFilter implementation

 Is it possible to do this in Solr/Lucene? If it is, how?

 I've read similar solution for Japanese language but since I am lack of
 Japanese understanding, it couldn't help a lot.

 --
 Regards,
 Rendy Bambang Junior
 Informatics Engineering '09
 Bandung Institute of Technology




-- 
edge


Re: indexing Text file in solr

2013-01-29 Thread Edward Garrett
i don't have experience with this but it looks like you could use, from DIH:

http://wiki.apache.org/solr/DataImportHandler#LineEntityProcessor


On Sun, Jan 27, 2013 at 10:23 AM, hadyelsahar hadyelsa...@gmail.com wrote:
 i have a large Arabic Text File that contains Tweets each line contains one
 tweet , that i want to index in solr such that each line of this document
 should be indexed in a separate solr document

 what i tried so far :

 i know how to SQL databse records in solr
 i know how to change solr schema to fit the data and working with Data
 import handler
 i know how the queries used to index data in solr
 what i want is :

 know how to index text file in solr in order that each line is considered a
 solr document



 --
 View this message in context: 
 http://lucene.472066.n3.nabble.com/indexing-Text-file-in-solr-tp4036496.html
 Sent from the Solr - User mailing list archive at Nabble.com.



-- 
edge


Re: Calculate a sum.

2013-01-14 Thread Edward Garrett
i've had perfectly fine performance with StatsComponent, but have only
tested with 50,000 documents. for example i have field syllables and
numeric field syllables_count. then i sum the syllable count for any
search query. how many documents are you working with?

On Mon, Jan 14, 2013 at 10:54 AM, Mikhail Khludnev
mkhlud...@griddynamics.com wrote:
 Stored fields are famous for its' slowness as well as they requires two io
 operation per doc. You can spend some heap for uninverting the index and
 utilize wiki.apache.org/solr/StatsComponent
 Let us know whether it works for you.
 14.01.2013 13:14 пользователь stockii stock.jo...@googlemail.com
 написал:

 hello.

 My problem is, that i need to calculate a sum of amounts. this amount is in
 my index (stored=true). my php script get all values with paging. but if
 a
 request takes too long, jetty is killing this process and i get a broken
 pipe.

 Which is the best/fastest way to get the values of many fields from index?
 exists an ResponseHandler for exports? Or which is the fastest?



 --
 View this message in context:
 http://lucene.472066.n3.nabble.com/Calculate-a-sum-tp4033091.html
 Sent from the Solr - User mailing list archive at Nabble.com.




-- 
edge


get a list of terms sorted by total term frequency

2012-11-07 Thread Edward Garrett
hi,

is there a simple way to get a list of all terms that occur in a field
sorted by their total term frequency within that field?

TermsComponent (http://wiki.apache.org/solr/TermsComponent) provides
fast field faceting over the whole index, but as counts it gives the
number of documents that each term occurs in (given a field or set of
fields). in place of document counts, i want total term frequency
counts. the ttf function
(http://wiki.apache.org/solr/FunctionQuery#totaltermfreq) provides
this, but only if you know what term to pass to the function.

edward


Re: get a list of terms sorted by total term frequency

2012-11-07 Thread Edward Garrett
i see... using the -t flag

it would be cool if TermsComponent had an option to sort by total term
frequency, something like

terms.sort={count|index|ttf}

surely that's a common enough use case


On Wed, Nov 7, 2012 at 6:17 PM, Michael McCandless
luc...@mikemccandless.com wrote:
 Lucene's misc module has HighFreqTerms tool.

 Mike McCandless

 http://blog.mikemccandless.com


 On Wed, Nov 7, 2012 at 1:15 PM, Edward Garrett heacu.mcint...@gmail.com 
 wrote:
 hi,

 is there a simple way to get a list of all terms that occur in a field
 sorted by their total term frequency within that field?

 TermsComponent (http://wiki.apache.org/solr/TermsComponent) provides
 fast field faceting over the whole index, but as counts it gives the
 number of documents that each term occurs in (given a field or set of
 fields). in place of document counts, i want total term frequency
 counts. the ttf function
 (http://wiki.apache.org/solr/FunctionQuery#totaltermfreq) provides
 this, but only if you know what term to pass to the function.

 edward



-- 
edge


Re: How to tell the highlighter not to escape?

2007-01-04 Thread Edward Garrett

just to add a note on this, the whole idea of inserting pseudo-markup into
XML text elements seems to be pretty much in disrepute, and certainly caused
many complaints about RSS 1.0, see e.g.

http://www.biglist.com/lists/xsl-list/archives/200505/msg00316.html

in xsl, you **can** use disable-output-escaping=yes to convert
pseudo-markup to markup, but xslt processors are not required to support
this, and so some do not.

it sure seems to me that if SOLR is returning XML, it might as well return
XML with real markup through and through instead of exploiting
pseudo-markup. if there is concern about introducing validation errors, then
perhaps you could use namespaces in the XML and put the highlighting markup
in a non-SOLR namespace???


Re: How to tell the highlighter not to escape?

2007-01-03 Thread Edward Garrett

for what it's worth, i wrote a recursive template in xsl that replaces the
escaped characters with actual elements. here, the variable $val would be
the tag, e.g. em. this has been working okay for me so far.

xsl:template name=unescapeEm
   xsl:param name=val select=''/
   xsl:variable name=preEm select=substring-before($val, 'lt;')/
   xsl:choose
   xsl:when test=$preEm or starts-with($val, 'lt;')
   xsl:variable name=insideEm select=substring-before($val,
'lt;/')/
   xsl:value-of select=$preEm/emxsl:value-of
select=substring($insideEm, string-length($preEm)+5)//em
   xsl:variable name=leftover select=substring($val,
string-length($insideEm) + 6)/
   xsl:if test=$leftover
   xsl:call-template name=unescapeEm
   xsl:with-param name=val select=$leftover/
   /xsl:call-template
   /xsl:if
   /xsl:when
   xsl:otherwise
   xsl:value-of select=$val/
   /xsl:otherwise
   /xsl:choose
/xsl:template

On 1/3/07, Thorsten Scherler [EMAIL PROTECTED] wrote:


On Wed, 2007-01-03 at 02:16 +, Edward Garrett wrote:
 thorsten,

 see the following for discussion. your case is indeed an annoyance--the
 thread below discusses motivations for it and ways of working around it.
(i
 too confess that i wish it were not so.)

 http://www.mail-archive.com/solr-user@lucene.apache.org/msg01483.html

Thanks Edward, the problem is with the suggestion in the above thread is
that:
just create an XSL that
generates XML and unescapes the fields you know will contain wellformed
XML data -- then apply your second transform client side

Is not possible with xsl. See e.g.
http://www.biglist.com/lists/xsl-list/archives/200109/msg00318.html
 How can I match the Cdata Section?!?

You can't, the XPath data model regards CDATA as merely an input shortcut,
not as an information-bearing part of the XML content. In other words,
![CDATA[x]] and x look exactly the same to the XSLT processor.

Mike Kay

Michael Kay is the xsl guru and I can say as well from my own experience
one would need to write a custom parser since ![CDATA[emTERM/em]]
is equal to lt;emgt;TERMlt;/emgt; and this in xsl is a string (XPath
would match text()).

IMO the highlighter should really return pure xml and not escape it.
I will have a look in the XmlResponseWriter maybe I find a way to change
this.

salu2



 -edward

 On 1/2/07, Mike Klaas [EMAIL PROTECTED] wrote:
 
  Hi Thorsten,
 
  The highlighter does not escape anything itself: you are seeing the
  results of solr's automatic escaping of xml data within its xml
  response.  This should be transparent (your xml decoder should
  un-escape the values on the way out).  I'm not really familiar with
  xslt so I'm unsure why that isn't so (perhaps it is automatically
  html-escaping the values after un-xml-escaping them?)
 
  Be careful of documents containing html fragments natively.
 
  cheers,
  -MIke
 
  On 1/2/07, Thorsten Scherler 
[EMAIL PROTECTED]
  wrote:
   Hi all,
  
   I am playing around with the highlighter and found that all
highlight
   terms get escaped.
  
   I mean solr will return
lt;emgt;TERMlt;/emgt; and not
   em TERM /em
  
   I am not sure where this escaping is happening but I would need the
   highlighting to NOT escape the hl.simple.pre and hl.simple.post tag
   since it is horror to work with cdata sections in xsl.
  
   I had a look in the lucene highlighter and it seem that it does not
   escape the tags.
  
   Can somebody point me to code which is responsible for escaping and
   maybe give me a tip how I can patch to make it configurable.
  
   TIA
  
   salu2
  
  
 



--
thorsten

Together we stand, divided we fall!
Hey you (Pink Floyd)






--
Edward Garrett

Visiting Fellow (2006-07)
Endangered Languages Academic Programme
School of Oriental and African Studies
London, UK
0207 898 4536

Assistant Professor, Linguistics Program
Eastern Michigan University
612 Pray-Harrold Building
Ypsilanti, MI, USA


Re: How to tell the highlighter not to escape?

2007-01-02 Thread Edward Garrett

thorsten,

see the following for discussion. your case is indeed an annoyance--the
thread below discusses motivations for it and ways of working around it. (i
too confess that i wish it were not so.)

http://www.mail-archive.com/solr-user@lucene.apache.org/msg01483.html

-edward

On 1/2/07, Mike Klaas [EMAIL PROTECTED] wrote:


Hi Thorsten,

The highlighter does not escape anything itself: you are seeing the
results of solr's automatic escaping of xml data within its xml
response.  This should be transparent (your xml decoder should
un-escape the values on the way out).  I'm not really familiar with
xslt so I'm unsure why that isn't so (perhaps it is automatically
html-escaping the values after un-xml-escaping them?)

Be careful of documents containing html fragments natively.

cheers,
-MIke

On 1/2/07, Thorsten Scherler [EMAIL PROTECTED]
wrote:
 Hi all,

 I am playing around with the highlighter and found that all highlight
 terms get escaped.

 I mean solr will return
  lt;emgt;TERMlt;/emgt; and not
 em TERM /em

 I am not sure where this escaping is happening but I would need the
 highlighting to NOT escape the hl.simple.pre and hl.simple.post tag
 since it is horror to work with cdata sections in xsl.

 I had a look in the lucene highlighter and it seem that it does not
 escape the tags.

 Can somebody point me to code which is responsible for escaping and
 maybe give me a tip how I can patch to make it configurable.

 TIA

 salu2







--
Edward Garrett

Visiting Fellow (2006-07)
Endangered Languages Academic Programme
School of Oriental and African Studies
London, UK
0207 898 4536

Assistant Professor, Linguistics Program
Eastern Michigan University
612 Pray-Harrold Building
Ypsilanti, MI, USA