DIH for multilingual index multiValued field?
I have a MySQL table: CREATE TABLE documents ( id INT NOT NULL AUTO_INCREMENT, language_code CHAR(2), tags CHAR(30), text TEXT, PRIMARY KEY (id) ); I have 2 questions about Solr DIH: 1) The langauge_code field indicates what language the text field is in. And depending on the language, I want to index text to different Solr fields. # pseudo code if langauge_code == en: index text to Solr field text_en elif langauge_code == fr: index text to Solr field text_fr elif langauge_code == zh: index text to Solr field text_zh ... Can DIH handle a usecase like this? How do I configure it to do so? 2) The tags field needs to be indexed into a Solr multiValued field. Multiple values are stored in a string, separated by a comma. For example, if `tags` contains the string blue, green, yellow then I want to index the 3 values blue, green, yellow into a Solr multiValued field. How do I do that with DIH? Thanks.
Re: DIH for multilingual index multiValued field?
I think a custom transformer would be of help in these scenarios http://wiki.apache.org/solr/DIHCustomTransformer http://wiki.apache.org/solr/DIHCustomTransformerCheers -- Imran On Sat, Nov 13, 2010 at 8:55 PM, Andy angelf...@yahoo.com wrote: I have a MySQL table: CREATE TABLE documents ( id INT NOT NULL AUTO_INCREMENT, language_code CHAR(2), tags CHAR(30), text TEXT, PRIMARY KEY (id) ); I have 2 questions about Solr DIH: 1) The langauge_code field indicates what language the text field is in. And depending on the language, I want to index text to different Solr fields. # pseudo code if langauge_code == en: index text to Solr field text_en elif langauge_code == fr: index text to Solr field text_fr elif langauge_code == zh: index text to Solr field text_zh ... Can DIH handle a usecase like this? How do I configure it to do so? 2) The tags field needs to be indexed into a Solr multiValued field. Multiple values are stored in a string, separated by a comma. For example, if `tags` contains the string blue, green, yellow then I want to index the 3 values blue, green, yellow into a Solr multiValued field. How do I do that with DIH? Thanks.
Re: DIH for multilingual index multiValued field?
For (1) you probably need to write a custom transformer. Something like: public Object transformRow(MapString, Object row) { String language_code = row.get(language_code); String text = row.get(text); if(en.equals(language_code)) row.put(text_en, text); else if if(fr.equals(language_code)) row.put(text_fr, text); return row; } For (2), it doable with regex transformer. field column=mailId splitBy=, sourceColName=emailids/ The 'emailids' field in the table can be a comma separated value. So it ends up giving out one or more than one email ids and we expect the 'mailId' to be a multivalued field in Solr. [1] [1]http://wiki.apache.org/solr/DataImportHandler#RegexTransformer --- On Sat, 11/13/10, Andy angelf...@yahoo.com wrote: From: Andy angelf...@yahoo.com Subject: DIH for multilingual index multiValued field? To: solr-user@lucene.apache.org Date: Saturday, November 13, 2010, 10:55 PM I have a MySQL table: CREATE TABLE documents ( id INT NOT NULL AUTO_INCREMENT, language_code CHAR(2), tags CHAR(30), text TEXT, PRIMARY KEY (id) ); I have 2 questions about Solr DIH: 1) The langauge_code field indicates what language the text field is in. And depending on the language, I want to index text to different Solr fields. # pseudo code if langauge_code == en: index text to Solr field text_en elif langauge_code == fr: index text to Solr field text_fr elif langauge_code == zh: index text to Solr field text_zh ... Can DIH handle a usecase like this? How do I configure it to do so? 2) The tags field needs to be indexed into a Solr multiValued field. Multiple values are stored in a string, separated by a comma. For example, if `tags` contains the string blue, green, yellow then I want to index the 3 values blue, green, yellow into a Solr multiValued field. How do I do that with DIH? Thanks.
Re: DIH for multilingual index multiValued field?
On Sat, Nov 13, 2010 at 4:56 PM, Ahmet Arslan iori...@yahoo.com wrote: For (1) you probably need to write a custom transformer. Something like: public Object transformRow(MapString, Object row) { String language_code = row.get(language_code); String text = row.get(text); if(en.equals(language_code)) row.put(text_en, text); else if if(fr.equals(language_code)) row.put(text_fr, text); return row; } For (2), it doable with regex transformer. field column=mailId splitBy=, sourceColName=emailids/ The 'emailids' field in the table can be a comma separated value. So it ends up giving out one or more than one email ids and we expect the 'mailId' to be a multivalued field in Solr. [1] [1]http://wiki.apache.org/solr/DataImportHandler#RegexTransformer In my opinion, I think that this is a bit of overkill. Since the DIH supports multiple entities, with no real limit on the SQL queries, I think that the easiest (and less involved) approach would be to create three entities for the languages the OP wishes to index: entity name=english query=SELECT * FROM documents WHERE language_code='en' transformer=RegexTransformer field column=text_en column=text / field column=tags column=tags splitBy=, / /entity entity name=french query=SELECT * FROM documents WHERE language_code='fr' transformer=RegexTransformer field column=text_fr column=text / field column=tags column=tags splitBy=, / /entity entity name=chinese query=SELECT * FROM documents WHERE language_code='zh' transformer=RegexTransformer field column=text_zh column=text / field column=tags column=tags splitBy=, / /entity But, I admit that depending on future growth of languages, as well as other factors (i.e., needing more specific logic, etc), a programmatic approach might be warranted. I would recommend, however, that the database table be a little more normalized. Your definition for tags is quite limiting, and could be better served using a many-to-many relationship. Something like the following might serve you well: CREATE TABLE documents ( id INT NOT NULL AUTO_INCREMENT, language_code CHAR(2), tags CHAR(30), text TEXT, PRIMARY KEY (id) ); CREATE TABLE document_tags ( id INT NOT NULL AUTO_INCREMENT, tag CHAR(30), PRIMARY KEY (id) ); CREATE TABLE document_tag_lookup ( document_id INT NOT NULL, tag_id INT NOT NULL, PRIMARY KEY (document_id, tag_id) ); Then in the DIH, you simply nest a second entity to look up the zero or more tags that might be associated with your documents; take the english entity from above: entity name=english query=SELECT * FROM documents WHERE language_code='en' transformer=RegexTransformer field name=text_en column=text / entity name=english_tags query=SELECT * FROM document_tags dt INNER JOIN document_tag_lookup dtl ON (dtl.tag_id = dt.id AND dtl.document_id='${english.id}') field name=tags column=tag / /entity /entity This would allow for growth, and is easy to maintain. Additionally, if you wanted to implement a custom transformer of your own, you could. As an aside, a sort of compromise, you could also use the ScriptTransformer [1] to create a Javascript function that can do your language logic and create the necessary fields, and not have to worry about maintaining any custom Java code. [1] http://wiki.apache.org/solr/DataImportHandler#ScriptTransformer - Ken
Re: DIH for multilingual index multiValued field?
On Sat, Nov 13, 2010 at 5:59 PM, Ken Stanley doh...@gmail.com wrote: CREATE TABLE documents ( id INT NOT NULL AUTO_INCREMENT, language_code CHAR(2), tags CHAR(30), text TEXT, PRIMARY KEY (id) ); I apologize, but I couldn't leave the typo in my last post without a follow up; it might cause confusion. I copied the OP's original table definition and forgot to remove the tags field. My purposed definition for the documents table should be: CREATE TABLE documents ( id INT NOT NULL AUTO_INCREMENT, language_code CHAR(2), text TEXT, PRIMARY KEY (id) ); - Ken