Re: Tag generation

2010-07-16 Thread kenf_nc

Thanks for all the suggestions! I'm absorbing them as quickly as I can. 
-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/Tag-generation-tp969888p973277.html
Sent from the Solr - User mailing list archive at Nabble.com.


Tag generation

2010-07-15 Thread kenf_nc

A colleague mentioned that he knew of services where you pass some content
and it spits out some suggested Tags or Keywords that would be best suited
to associate with that content.

Does anyone know if there is a contrib to Solr or Lucene that does something
like this? Or a third party tool that can be given a solr index or solr
query and it comes up with some good Tag suggestions?
-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/Tag-generation-tp969888p969888.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Tag generation

2010-07-15 Thread Olivier Dobberkau

Am 15.07.2010 um 17:34 schrieb kenf_nc:

 A colleague mentioned that he knew of services where you pass some content
 and it spits out some suggested Tags or Keywords that would be best suited
 to associate with that content.
 
 Does anyone know if there is a contrib to Solr or Lucene that does something
 like this? Or a third party tool that can be given a solr index or solr
 query and it comes up with some good Tag suggestions?

Hi

there something from http://www.zemanta.com/
and something from basis tech http://www.basistech.com/

i am not sure if this would help. you could have a look at

http://uima.apache.org/

greetings,

olivier

--

Olivier Dobberkau



Re: Tag generation

2010-07-15 Thread Markus Jelsma
Check out OpenCalais [1]. Maybe it works for your case and language.

[1]: http://www.opencalais.com/

On Thursday 15 July 2010 17:34:31 kenf_nc wrote:
 A colleague mentioned that he knew of services where you pass some content
 and it spits out some suggested Tags or Keywords that would be best suited
 to associate with that content.
 
 Does anyone know if there is a contrib to Solr or Lucene that does
  something like this? Or a third party tool that can be given a solr index
  or solr query and it comes up with some good Tag suggestions?
 

Markus Jelsma - Technisch Architect - Buyways BV
http://www.linkedin.com/in/markus17
050-8536620 / 06-50258350



Re: Tag generation

2010-07-15 Thread Tommaso Teofili
Hi all,
in UIMA there are two components which wrap OpenCalais [1] and AlchemyAPI
[2][3] services that you could use, then you could also add something else
to the tagging pipeline (using existing stuff [4] or implementing your own
logic).
Hope this helps.
Tommaso

[1] : http://uima.apache.org/sandbox.html#opencalais.annotator
[2] : http://www.alchemyapi.com
[3] : http://svn.apache.org/repos/asf/uima/sandbox/trunk/AlchemyAPIAnnotator
[4] : http://uima.apache.org/sandbox.html

2010/7/15 Markus Jelsma markus.jel...@buyways.nl

 Check out OpenCalais [1]. Maybe it works for your case and language.

 [1]: http://www.opencalais.com/

 On Thursday 15 July 2010 17:34:31 kenf_nc wrote:
  A colleague mentioned that he knew of services where you pass some
 content
  and it spits out some suggested Tags or Keywords that would be best
 suited
  to associate with that content.
 
  Does anyone know if there is a contrib to Solr or Lucene that does
   something like this? Or a third party tool that can be given a solr
 index
   or solr query and it comes up with some good Tag suggestions?
 

 Markus Jelsma - Technisch Architect - Buyways BV
 http://www.linkedin.com/in/markus17
 050-8536620 / 06-50258350




Re: TermVectorComponent for tag generation?

2008-11-01 Thread Grant Ingersoll
How do you propose to distinguish those words from the other ones?   
The problem you are addressing is often called keyword extraction.  In  
general, it 's a difficult problem, but you may have domain knowledge  
that can help.



On Oct 31, 2008, at 6:35 PM, Jon Baer wrote:


Well for example in any given text (which is field on a document);

While suitable for any application which requires full text  
indexing and searching capability, Lucene has been widely recognized  
for its utility in the implementation of Internet search engines and  
local, single-site searching.


At the core of Lucene's logical architecture is the idea of a  
document containing fields of text. This flexibility allows Lucene's  
API to be independent of file format. Text from PDFs, HTML,  
Microsoft Word documents, as well as many others can all be indexed  
so long as their textual information can be extracted.


Id like to be able to say the tags for this article should be  
[Lucene, PDF, HTML, Microsoft Word] because they are in field values  
from other documents.  Basically how to generate tags from just a  
single document based on other document field values.


- Jon


On Oct 31, 2008, at 6:17 PM, Grant Ingersoll wrote:


Hey Jon,

Not following how the TVC (TermVectorComp) would help here.I  
suppose you could use the most important terms, as defined by TF- 
IDF, as suggested tags.  The MLT (MoreLikeThis) uses this to  
generate query terms.


However, I'm not following the different filter query piece.  Can  
you provide a bit more details?


One thing you did make me think, though, is it might be interesting  
to extend TermVectorMapper so that it can output a NamedList and  
then allow people to implement their own SolrTermVectorMapper and  
have it customize the TV output...


Thanks,
Grant

On Oct 31, 2008, at 5:20 PM, Jon Baer wrote:


Hi,

So Im looking to either use this or build a component which might  
do what Im looking for.  Id like to figure out if its possible use  
a single doc to get tag generation based on the matches within  
that document for example:


1 News Doc - contains 5 Players and 8 Teams (show them as  
possible tags for this article)


In this case Players and Teams are also docs.  It's almost like I  
want to use MoreLikeThis w/ a different filter query than what Im  
using.


Is there any easy hack to get this going?

Thanks.

- Jon


--
Grant Ingersoll
Lucene Boot Camp Training Nov. 3-4, 2008, ApacheCon US New Orleans.
http://www.lucenebootcamp.com


Lucene Helpful Hints:
http://wiki.apache.org/lucene-java/BasicsOfPerformance
http://wiki.apache.org/lucene-java/LuceneFAQ













--
Grant Ingersoll
Lucene Boot Camp Training Nov. 3-4, 2008, ApacheCon US New Orleans.
http://www.lucenebootcamp.com


Lucene Helpful Hints:
http://wiki.apache.org/lucene-java/BasicsOfPerformance
http://wiki.apache.org/lucene-java/LuceneFAQ











Re: TermVectorComponent for tag generation?

2008-11-01 Thread Jon Baer


On Nov 1, 2008, at 1:16 PM, Grant Ingersoll wrote:


How do you propose to distinguish those words from the other ones?


** They are field values from other documents

 The problem you are addressing is often called keyword extraction.   
In general, it 's a difficult problem, but you may have domain  
knowledge that can help.


** Im finding it hard to think Lucene can do amazing job @ search but  
yet nothing to tell me if a generated list of content is present in a  
resulting document.  The other options of TVC are what peaked my  
interest in the beginning ...


Other Options
* tv.fl - List of fields to get TV information from. Optional. If  
not specified, the fl parameter is used.
* tv.docIds - List of Lucene document ids (not the Solr Unique  
Key) to get term vectors for.


Im pretty sure that might work for what I need it for.

- Jon


Re: TermVectorComponent for tag generation?

2008-11-01 Thread Grant Ingersoll




On Nov 1, 2008, at 3:04 PM, Jon Baer wrote:



On Nov 1, 2008, at 1:16 PM, Grant Ingersoll wrote:


How do you propose to distinguish those words from the other ones?


** They are field values from other documents


But so are many other words from that document, what separates out  
[Lucene, PDF, HTML, Microsoft Word]  from the rest?  Your brain made  
the distinction, but what info exists in that document such that a  
computer can?  (this is a leading question, I have some ideas, but I  
think hearing it from you will help me better understand what you are  
trying to do)





The problem you are addressing is often called keyword extraction.   
In general, it 's a difficult problem, but you may have domain  
knowledge that can help.


** Im finding it hard to think Lucene can do amazing job @ search  
but yet nothing to tell me if a generated list of content is present  
in a resulting document.


I think it can, I think the thing I'm missing is where the generated  
list comes from.  Given the list, I think it's just another search,  
right?


So, I suppose you could get the TV for your current document, along  
with the DF (doc freq) and know which terms occur in other documents,  
then you could go get those documents by searching for each of those  
terms.


However, I still suspect I'm missing something, so I'd say give it a  
try!  Maybe trying it out in code would be the best way to articulate  
it.


-Grant


TermVectorComponent for tag generation?

2008-10-31 Thread Jon Baer

Hi,

So Im looking to either use this or build a component which might do  
what Im looking for.  Id like to figure out if its possible use a  
single doc to get tag generation based on the matches within that  
document for example:


1 News Doc - contains 5 Players and 8 Teams (show them as possible  
tags for this article)


In this case Players and Teams are also docs.  It's almost like I want  
to use MoreLikeThis w/ a different filter query than what Im using.


Is there any easy hack to get this going?

Thanks.

- Jon 


Re: TermVectorComponent for tag generation?

2008-10-31 Thread Grant Ingersoll

Hey Jon,

Not following how the TVC (TermVectorComp) would help here.I  
suppose you could use the most important terms, as defined by TF- 
IDF, as suggested tags.  The MLT (MoreLikeThis) uses this to generate  
query terms.


However, I'm not following the different filter query piece.  Can you  
provide a bit more details?


One thing you did make me think, though, is it might be interesting to  
extend TermVectorMapper so that it can output a NamedList and then  
allow people to implement their own SolrTermVectorMapper and have it  
customize the TV output...


Thanks,
Grant

On Oct 31, 2008, at 5:20 PM, Jon Baer wrote:


Hi,

So Im looking to either use this or build a component which might do  
what Im looking for.  Id like to figure out if its possible use a  
single doc to get tag generation based on the matches within that  
document for example:


1 News Doc - contains 5 Players and 8 Teams (show them as possible  
tags for this article)


In this case Players and Teams are also docs.  It's almost like I  
want to use MoreLikeThis w/ a different filter query than what Im  
using.


Is there any easy hack to get this going?

Thanks.

- Jon


--
Grant Ingersoll
Lucene Boot Camp Training Nov. 3-4, 2008, ApacheCon US New Orleans.
http://www.lucenebootcamp.com


Lucene Helpful Hints:
http://wiki.apache.org/lucene-java/BasicsOfPerformance
http://wiki.apache.org/lucene-java/LuceneFAQ











Re: TermVectorComponent for tag generation?

2008-10-31 Thread Jon Baer

Well for example in any given text (which is field on a document);

While suitable for any application which requires full text indexing  
and searching capability, Lucene has been widely recognized for its  
utility in the implementation of Internet search engines and local,  
single-site searching.


At the core of Lucene's logical architecture is the idea of a document  
containing fields of text. This flexibility allows Lucene's API to be  
independent of file format. Text from PDFs, HTML, Microsoft Word  
documents, as well as many others can all be indexed so long as their  
textual information can be extracted.


Id like to be able to say the tags for this article should be [Lucene,  
PDF, HTML, Microsoft Word] because they are in field values from other  
documents.  Basically how to generate tags from just a single document  
based on other document field values.


- Jon


On Oct 31, 2008, at 6:17 PM, Grant Ingersoll wrote:


Hey Jon,

Not following how the TVC (TermVectorComp) would help here.I  
suppose you could use the most important terms, as defined by TF- 
IDF, as suggested tags.  The MLT (MoreLikeThis) uses this to  
generate query terms.


However, I'm not following the different filter query piece.  Can  
you provide a bit more details?


One thing you did make me think, though, is it might be interesting  
to extend TermVectorMapper so that it can output a NamedList and  
then allow people to implement their own SolrTermVectorMapper and  
have it customize the TV output...


Thanks,
Grant

On Oct 31, 2008, at 5:20 PM, Jon Baer wrote:


Hi,

So Im looking to either use this or build a component which might  
do what Im looking for.  Id like to figure out if its possible use  
a single doc to get tag generation based on the matches within that  
document for example:


1 News Doc - contains 5 Players and 8 Teams (show them as possible  
tags for this article)


In this case Players and Teams are also docs.  It's almost like I  
want to use MoreLikeThis w/ a different filter query than what Im  
using.


Is there any easy hack to get this going?

Thanks.

- Jon


--
Grant Ingersoll
Lucene Boot Camp Training Nov. 3-4, 2008, ApacheCon US New Orleans.
http://www.lucenebootcamp.com


Lucene Helpful Hints:
http://wiki.apache.org/lucene-java/BasicsOfPerformance
http://wiki.apache.org/lucene-java/LuceneFAQ













Re: TermVectorComponent for tag generation?

2008-10-31 Thread Vaijanath N. Rao

Hi Jon,

Isn't it similar to what Grant just said the top most terms ( after 
removing the stop words ).


You would need to get how many terms are there and there related 
frequency and any term which is beyond a certain threshold you would 
mark it as an member of tag set.


One can also build a set of related entities or terms which are 
following the current term, and than can decide on which all can become 
part of the tagset.


It that the requirement or I am missing something here.

-- Thanks and Regards
Vaijanath N. Rao

Jon Baer wrote:

Well for example in any given text (which is field on a document);

While suitable for any application which requires full text indexing 
and searching capability, Lucene has been widely recognized for its 
utility in the implementation of Internet search engines and local, 
single-site searching.


At the core of Lucene's logical architecture is the idea of a document 
containing fields of text. This flexibility allows Lucene's API to be 
independent of file format. Text from PDFs, HTML, Microsoft Word 
documents, as well as many others can all be indexed so long as their 
textual information can be extracted.


Id like to be able to say the tags for this article should be [Lucene, 
PDF, HTML, Microsoft Word] because they are in field values from other 
documents.  Basically how to generate tags from just a single document 
based on other document field values.


- Jon


On Oct 31, 2008, at 6:17 PM, Grant Ingersoll wrote:


Hey Jon,

Not following how the TVC (TermVectorComp) would help here.I 
suppose you could use the most important terms, as defined by 
TF-IDF, as suggested tags.  The MLT (MoreLikeThis) uses this to 
generate query terms.


However, I'm not following the different filter query piece.  Can you 
provide a bit more details?


One thing you did make me think, though, is it might be interesting 
to extend TermVectorMapper so that it can output a NamedList and then 
allow people to implement their own SolrTermVectorMapper and have it 
customize the TV output...


Thanks,
Grant

On Oct 31, 2008, at 5:20 PM, Jon Baer wrote:


Hi,

So Im looking to either use this or build a component which might do 
what Im looking for.  Id like to figure out if its possible use a 
single doc to get tag generation based on the matches within that 
document for example:


1 News Doc - contains 5 Players and 8 Teams (show them as possible 
tags for this article)


In this case Players and Teams are also docs.  It's almost like I 
want to use MoreLikeThis w/ a different filter query than what Im 
using.


Is there any easy hack to get this going?

Thanks.

- Jon


--
Grant Ingersoll
Lucene Boot Camp Training Nov. 3-4, 2008, ApacheCon US New Orleans.
http://www.lucenebootcamp.com


Lucene Helpful Hints:
http://wiki.apache.org/lucene-java/BasicsOfPerformance
http://wiki.apache.org/lucene-java/LuceneFAQ