DIH for multilingual index multiValued field?

2010-11-13 Thread Andy
I have a MySQL table:

CREATE TABLE documents (
id INT NOT NULL AUTO_INCREMENT,
language_code CHAR(2),
tags CHAR(30),
text TEXT,
PRIMARY KEY (id)
);

I have 2 questions about Solr DIH:

1) The langauge_code field indicates what language the text field is in. 
And depending on the language, I want to index text to different Solr fields.

# pseudo code

if langauge_code == en:
index text to Solr field text_en
elif langauge_code == fr:
index text to Solr field text_fr
elif langauge_code == zh:
index text to Solr field text_zh
...

Can DIH handle a usecase like this? How do I configure it to do so?

2) The tags field needs to be indexed into a Solr multiValued field. Multiple 
values are stored in a string, separated by a comma. For example, if `tags` 
contains the string blue, green, yellow then I want to index the 3 values 
blue, green, yellow into a Solr multiValued field.

How do I do that with DIH?

Thanks.


  


Re: DIH for multilingual index multiValued field?

2010-11-13 Thread Imran
I think a custom transformer would be of help in these scenarios
http://wiki.apache.org/solr/DIHCustomTransformer

http://wiki.apache.org/solr/DIHCustomTransformerCheers
-- Imran

On Sat, Nov 13, 2010 at 8:55 PM, Andy angelf...@yahoo.com wrote:

 I have a MySQL table:

CREATE TABLE documents (
id INT NOT NULL AUTO_INCREMENT,
language_code CHAR(2),
tags CHAR(30),
text TEXT,
PRIMARY KEY (id)
);

 I have 2 questions about Solr DIH:

 1) The langauge_code field indicates what language the text field is
 in. And depending on the language, I want to index text to different Solr
 fields.

# pseudo code

if langauge_code == en:
index text to Solr field text_en
elif langauge_code == fr:
index text to Solr field text_fr
elif langauge_code == zh:
index text to Solr field text_zh
...

 Can DIH handle a usecase like this? How do I configure it to do so?

 2) The tags field needs to be indexed into a Solr multiValued field.
 Multiple values are stored in a string, separated by a comma. For example,
 if `tags` contains the string blue, green, yellow then I want to index the
 3 values blue, green, yellow into a Solr multiValued field.

 How do I do that with DIH?

 Thanks.






Re: DIH for multilingual index multiValued field?

2010-11-13 Thread Ahmet Arslan
For (1) you probably need to write a custom transformer. Something like:
public Object transformRow(MapString, Object row) {
String language_code = row.get(language_code);
String text = row.get(text);
if(en.equals(language_code))
   row.put(text_en, text);
else if if(fr.equals(language_code))
   row.put(text_fr, text);

return row;
}


For (2), it doable with regex transformer. 

field column=mailId splitBy=, sourceColName=emailids/
The 'emailids' field in the table can be a comma separated value. So it ends up 
giving out one or more than one email ids and we expect the 'mailId' to be a 
multivalued field in Solr. [1]

[1]http://wiki.apache.org/solr/DataImportHandler#RegexTransformer


--- On Sat, 11/13/10, Andy angelf...@yahoo.com wrote:

 From: Andy angelf...@yahoo.com
 Subject: DIH for multilingual index  multiValued field?
 To: solr-user@lucene.apache.org
 Date: Saturday, November 13, 2010, 10:55 PM
 I have a MySQL table:
 
     CREATE TABLE documents (
         id INT NOT NULL
 AUTO_INCREMENT,
         language_code CHAR(2),
         tags CHAR(30),
         text TEXT,
         PRIMARY KEY (id)
     );
 
 I have 2 questions about Solr DIH:
 
 1) The langauge_code field indicates what language the
 text field is in. And depending on the language, I want to
 index text to different Solr fields.
 
     # pseudo code
 
     if langauge_code == en:
         index text to Solr field
 text_en
     elif langauge_code == fr:
         index text to Solr field
 text_fr
     elif langauge_code == zh:
         index text to Solr field
 text_zh
     ...
 
 Can DIH handle a usecase like this? How do I configure it
 to do so?
 
 2) The tags field needs to be indexed into a Solr
 multiValued field. Multiple values are stored in a string,
 separated by a comma. For example, if `tags` contains the
 string blue, green, yellow then I want to index the 3
 values blue, green, yellow into a Solr multiValued
 field.
 
 How do I do that with DIH?
 
 Thanks.
 
 
       
 





Re: DIH for multilingual index multiValued field?

2010-11-13 Thread Ken Stanley
On Sat, Nov 13, 2010 at 4:56 PM, Ahmet Arslan iori...@yahoo.com wrote:
 For (1) you probably need to write a custom transformer. Something like:
 public Object transformRow(MapString, Object row)     {
 String language_code = row.get(language_code);
 String text = row.get(text);
 if(en.equals(language_code))
       row.put(text_en, text);
 else if if(fr.equals(language_code))
       row.put(text_fr, text);

 return row;
 }


 For (2), it doable with regex transformer.

 field column=mailId splitBy=, sourceColName=emailids/
 The 'emailids' field in the table can be a comma separated value. So it ends 
 up giving out one or more than one email ids and we expect the 'mailId' to be 
 a multivalued field in Solr. [1]

 [1]http://wiki.apache.org/solr/DataImportHandler#RegexTransformer


In my opinion, I think that this is a bit of overkill. Since the DIH
supports multiple entities, with no real limit on the SQL queries, I
think that the easiest (and less involved) approach would be to create
three entities for the languages the OP wishes to index:

entity name=english query=SELECT * FROM documents WHERE
language_code='en' transformer=RegexTransformer
field column=text_en column=text /
field column=tags column=tags splitBy=, /
/entity

entity name=french query=SELECT * FROM documents WHERE
language_code='fr' transformer=RegexTransformer
field column=text_fr column=text /
field column=tags column=tags splitBy=, /
/entity

entity name=chinese query=SELECT * FROM documents WHERE
language_code='zh' transformer=RegexTransformer
field column=text_zh column=text /
field column=tags column=tags splitBy=, /
/entity

But, I admit that depending on future growth of languages, as well as
other factors (i.e., needing more specific logic, etc), a programmatic
approach might be warranted.

I would recommend, however, that the database table be a little more
normalized. Your definition for tags is quite limiting, and could be
better served using a many-to-many relationship. Something like the
following might serve you well:

   CREATE TABLE documents (
   id INT NOT NULL AUTO_INCREMENT,
   language_code CHAR(2),
   tags CHAR(30),
   text TEXT,
   PRIMARY KEY (id)
   );

   CREATE TABLE document_tags (
   id INT NOT NULL AUTO_INCREMENT,
   tag CHAR(30),
   PRIMARY KEY (id)
   );

   CREATE TABLE document_tag_lookup (
   document_id INT NOT NULL,
   tag_id INT NOT NULL,
   PRIMARY KEY (document_id, tag_id)
   );

Then in the DIH, you simply nest a second entity to look up the zero
or more tags that might be associated with your documents; take the
english entity from above:

entity name=english query=SELECT * FROM documents WHERE
language_code='en' transformer=RegexTransformer
field name=text_en column=text /

entity name=english_tags query=SELECT * FROM document_tags dt
INNER JOIN document_tag_lookup dtl ON (dtl.tag_id = dt.id AND
dtl.document_id='${english.id}')
field name=tags column=tag /
/entity
/entity

This would allow for growth, and is easy to maintain. Additionally, if
you wanted to implement a custom transformer of your own, you could.
As an aside, a sort of compromise, you could also use the
ScriptTransformer [1] to create a Javascript function that can do your
language logic and create the necessary fields, and not have to worry
about maintaining any custom Java code.

[1] http://wiki.apache.org/solr/DataImportHandler#ScriptTransformer

- Ken


Re: DIH for multilingual index multiValued field?

2010-11-13 Thread Ken Stanley
On Sat, Nov 13, 2010 at 5:59 PM, Ken Stanley doh...@gmail.com wrote:
   CREATE TABLE documents (
       id INT NOT NULL AUTO_INCREMENT,
       language_code CHAR(2),
       tags CHAR(30),
       text TEXT,
       PRIMARY KEY (id)
   );

I apologize, but I couldn't leave the typo in my last post without a
follow up; it might cause confusion. I copied the OP's original table
definition and forgot to remove the tags field. My purposed definition
for the documents table should be:

  CREATE TABLE documents (
  id INT NOT NULL AUTO_INCREMENT,
  language_code CHAR(2),
  text TEXT,
  PRIMARY KEY (id)
  );

- Ken