Re: Tokenize Sentence and Set Attribute
i find UpdateRequestProcessors ( http://wiki.apache.org/solr/UpdateRequestProcessor) a handy way to add and remove NLP-related fields to a document as it is processed by Solr. this is also how UIMA integrates with Solr (http://wiki.apache.org/solr/SolrUIMA). you might want to take a look at UIMA as well. On Mon, May 6, 2013 at 6:22 PM, Jack Krupansky j...@basetechnology.comwrote: Sounds like a very ambitious project. I'm sure you COULD do it in Solr, but not in very short order. Check out some discussion of simply searching within sentences: http://markmail.org/message/**aoiq62a4mlo25zzk?q=apache#** query:apache+page:1+mid:**aoiq62a4mlo25zzk+state:resultshttp://markmail.org/message/aoiq62a4mlo25zzk?q=apache#query:apache+page:1+mid:aoiq62a4mlo25zzk+state:results First, how do you expect to use/query the corpus? In other words, what are your user requirements? They will determine what structure the Solr index, analysis chains, and custom search components will need. Also, check out the Solr OpenNLP wiki: http://wiki.apache.org/solr/**OpenNLPhttp://wiki.apache.org/solr/OpenNLP And see LUCENE-2899: Add OpenNLP Analysis capabilities as a module: https://issues.apache.org/**jira/browse/LUCENE-2899https://issues.apache.org/jira/browse/LUCENE-2899 -- Jack Krupansky -Original Message- From: Rendy Bambang Junior Sent: Monday, May 06, 2013 11:41 AM To: solr-user@lucene.apache.org Subject: Tokenize Sentence and Set Attribute Hello, I am trying to use part of speech tagger for bahasa Indonesia to filter tokens in Solr. The tagger receive input as word list of a sentence and return tag array. I think the process should by like this: - tokenize sentence - tokenize word - pass it into the tagger - set attribute using tagger output - pass it into a FilteringTokenFilter implementation Is it possible to do this in Solr/Lucene? If it is, how? I've read similar solution for Japanese language but since I am lack of Japanese understanding, it couldn't help a lot. -- Regards, Rendy Bambang Junior Informatics Engineering '09 Bandung Institute of Technology -- edge
Re: indexing Text file in solr
i don't have experience with this but it looks like you could use, from DIH: http://wiki.apache.org/solr/DataImportHandler#LineEntityProcessor On Sun, Jan 27, 2013 at 10:23 AM, hadyelsahar hadyelsa...@gmail.com wrote: i have a large Arabic Text File that contains Tweets each line contains one tweet , that i want to index in solr such that each line of this document should be indexed in a separate solr document what i tried so far : i know how to SQL databse records in solr i know how to change solr schema to fit the data and working with Data import handler i know how the queries used to index data in solr what i want is : know how to index text file in solr in order that each line is considered a solr document -- View this message in context: http://lucene.472066.n3.nabble.com/indexing-Text-file-in-solr-tp4036496.html Sent from the Solr - User mailing list archive at Nabble.com. -- edge
Re: Calculate a sum.
i've had perfectly fine performance with StatsComponent, but have only tested with 50,000 documents. for example i have field syllables and numeric field syllables_count. then i sum the syllable count for any search query. how many documents are you working with? On Mon, Jan 14, 2013 at 10:54 AM, Mikhail Khludnev mkhlud...@griddynamics.com wrote: Stored fields are famous for its' slowness as well as they requires two io operation per doc. You can spend some heap for uninverting the index and utilize wiki.apache.org/solr/StatsComponent Let us know whether it works for you. 14.01.2013 13:14 пользователь stockii stock.jo...@googlemail.com написал: hello. My problem is, that i need to calculate a sum of amounts. this amount is in my index (stored=true). my php script get all values with paging. but if a request takes too long, jetty is killing this process and i get a broken pipe. Which is the best/fastest way to get the values of many fields from index? exists an ResponseHandler for exports? Or which is the fastest? -- View this message in context: http://lucene.472066.n3.nabble.com/Calculate-a-sum-tp4033091.html Sent from the Solr - User mailing list archive at Nabble.com. -- edge
get a list of terms sorted by total term frequency
hi, is there a simple way to get a list of all terms that occur in a field sorted by their total term frequency within that field? TermsComponent (http://wiki.apache.org/solr/TermsComponent) provides fast field faceting over the whole index, but as counts it gives the number of documents that each term occurs in (given a field or set of fields). in place of document counts, i want total term frequency counts. the ttf function (http://wiki.apache.org/solr/FunctionQuery#totaltermfreq) provides this, but only if you know what term to pass to the function. edward
Re: get a list of terms sorted by total term frequency
i see... using the -t flag it would be cool if TermsComponent had an option to sort by total term frequency, something like terms.sort={count|index|ttf} surely that's a common enough use case On Wed, Nov 7, 2012 at 6:17 PM, Michael McCandless luc...@mikemccandless.com wrote: Lucene's misc module has HighFreqTerms tool. Mike McCandless http://blog.mikemccandless.com On Wed, Nov 7, 2012 at 1:15 PM, Edward Garrett heacu.mcint...@gmail.com wrote: hi, is there a simple way to get a list of all terms that occur in a field sorted by their total term frequency within that field? TermsComponent (http://wiki.apache.org/solr/TermsComponent) provides fast field faceting over the whole index, but as counts it gives the number of documents that each term occurs in (given a field or set of fields). in place of document counts, i want total term frequency counts. the ttf function (http://wiki.apache.org/solr/FunctionQuery#totaltermfreq) provides this, but only if you know what term to pass to the function. edward -- edge
Re: How to tell the highlighter not to escape?
just to add a note on this, the whole idea of inserting pseudo-markup into XML text elements seems to be pretty much in disrepute, and certainly caused many complaints about RSS 1.0, see e.g. http://www.biglist.com/lists/xsl-list/archives/200505/msg00316.html in xsl, you **can** use disable-output-escaping=yes to convert pseudo-markup to markup, but xslt processors are not required to support this, and so some do not. it sure seems to me that if SOLR is returning XML, it might as well return XML with real markup through and through instead of exploiting pseudo-markup. if there is concern about introducing validation errors, then perhaps you could use namespaces in the XML and put the highlighting markup in a non-SOLR namespace???
Re: How to tell the highlighter not to escape?
for what it's worth, i wrote a recursive template in xsl that replaces the escaped characters with actual elements. here, the variable $val would be the tag, e.g. em. this has been working okay for me so far. xsl:template name=unescapeEm xsl:param name=val select=''/ xsl:variable name=preEm select=substring-before($val, 'lt;')/ xsl:choose xsl:when test=$preEm or starts-with($val, 'lt;') xsl:variable name=insideEm select=substring-before($val, 'lt;/')/ xsl:value-of select=$preEm/emxsl:value-of select=substring($insideEm, string-length($preEm)+5)//em xsl:variable name=leftover select=substring($val, string-length($insideEm) + 6)/ xsl:if test=$leftover xsl:call-template name=unescapeEm xsl:with-param name=val select=$leftover/ /xsl:call-template /xsl:if /xsl:when xsl:otherwise xsl:value-of select=$val/ /xsl:otherwise /xsl:choose /xsl:template On 1/3/07, Thorsten Scherler [EMAIL PROTECTED] wrote: On Wed, 2007-01-03 at 02:16 +, Edward Garrett wrote: thorsten, see the following for discussion. your case is indeed an annoyance--the thread below discusses motivations for it and ways of working around it. (i too confess that i wish it were not so.) http://www.mail-archive.com/solr-user@lucene.apache.org/msg01483.html Thanks Edward, the problem is with the suggestion in the above thread is that: just create an XSL that generates XML and unescapes the fields you know will contain wellformed XML data -- then apply your second transform client side Is not possible with xsl. See e.g. http://www.biglist.com/lists/xsl-list/archives/200109/msg00318.html How can I match the Cdata Section?!? You can't, the XPath data model regards CDATA as merely an input shortcut, not as an information-bearing part of the XML content. In other words, ![CDATA[x]] and x look exactly the same to the XSLT processor. Mike Kay Michael Kay is the xsl guru and I can say as well from my own experience one would need to write a custom parser since ![CDATA[emTERM/em]] is equal to lt;emgt;TERMlt;/emgt; and this in xsl is a string (XPath would match text()). IMO the highlighter should really return pure xml and not escape it. I will have a look in the XmlResponseWriter maybe I find a way to change this. salu2 -edward On 1/2/07, Mike Klaas [EMAIL PROTECTED] wrote: Hi Thorsten, The highlighter does not escape anything itself: you are seeing the results of solr's automatic escaping of xml data within its xml response. This should be transparent (your xml decoder should un-escape the values on the way out). I'm not really familiar with xslt so I'm unsure why that isn't so (perhaps it is automatically html-escaping the values after un-xml-escaping them?) Be careful of documents containing html fragments natively. cheers, -MIke On 1/2/07, Thorsten Scherler [EMAIL PROTECTED] wrote: Hi all, I am playing around with the highlighter and found that all highlight terms get escaped. I mean solr will return lt;emgt;TERMlt;/emgt; and not em TERM /em I am not sure where this escaping is happening but I would need the highlighting to NOT escape the hl.simple.pre and hl.simple.post tag since it is horror to work with cdata sections in xsl. I had a look in the lucene highlighter and it seem that it does not escape the tags. Can somebody point me to code which is responsible for escaping and maybe give me a tip how I can patch to make it configurable. TIA salu2 -- thorsten Together we stand, divided we fall! Hey you (Pink Floyd) -- Edward Garrett Visiting Fellow (2006-07) Endangered Languages Academic Programme School of Oriental and African Studies London, UK 0207 898 4536 Assistant Professor, Linguistics Program Eastern Michigan University 612 Pray-Harrold Building Ypsilanti, MI, USA
Re: How to tell the highlighter not to escape?
thorsten, see the following for discussion. your case is indeed an annoyance--the thread below discusses motivations for it and ways of working around it. (i too confess that i wish it were not so.) http://www.mail-archive.com/solr-user@lucene.apache.org/msg01483.html -edward On 1/2/07, Mike Klaas [EMAIL PROTECTED] wrote: Hi Thorsten, The highlighter does not escape anything itself: you are seeing the results of solr's automatic escaping of xml data within its xml response. This should be transparent (your xml decoder should un-escape the values on the way out). I'm not really familiar with xslt so I'm unsure why that isn't so (perhaps it is automatically html-escaping the values after un-xml-escaping them?) Be careful of documents containing html fragments natively. cheers, -MIke On 1/2/07, Thorsten Scherler [EMAIL PROTECTED] wrote: Hi all, I am playing around with the highlighter and found that all highlight terms get escaped. I mean solr will return lt;emgt;TERMlt;/emgt; and not em TERM /em I am not sure where this escaping is happening but I would need the highlighting to NOT escape the hl.simple.pre and hl.simple.post tag since it is horror to work with cdata sections in xsl. I had a look in the lucene highlighter and it seem that it does not escape the tags. Can somebody point me to code which is responsible for escaping and maybe give me a tip how I can patch to make it configurable. TIA salu2 -- Edward Garrett Visiting Fellow (2006-07) Endangered Languages Academic Programme School of Oriental and African Studies London, UK 0207 898 4536 Assistant Professor, Linguistics Program Eastern Michigan University 612 Pray-Harrold Building Ypsilanti, MI, USA