Why wouldn't you take advantage of your use case - the chars belong to different char classes.
You can index this field to a single solr field (no copyField) and apply an analysis chain that includes both languages analysis - stopword, stemmers etc. As every filter should apply to its' specific language (e.g an arabic stemmer should not stem a lating word) you can make cross languages search on this single field. On Mon, Apr 28, 2014 at 5:59 AM, Alexandre Rafalovitch <arafa...@gmail.com>wrote: > If you can throw money at the problem: > http://www.basistech.com/text-analytics/rosette/language-identifier/ . > Language Boundary Locator at the bottom of the page seems to be > part/all of your solution. > > Otherwise, specifically for English and Arabic, you could play with > Unicode ranges to try detecting text blocks: > 1) Create an UpdateRequestProcessor chain that > a) clones text into field_EN and field_AR. > b) applies regular expression transformations that strip English or > Arabic unicode text range correspondingly, so field_EN only has > English characters left, etc. Of course, you need to decide what you > want to do with occasional EN or neutral characters happening in the > middle of Arabic text (numbers: Arabic or Indic? brackets, dashes, > etc). But if you just index text, it might be ok even if it is not > perfect. > c) deletes empty fields, just in case not all of them have mix language > 2) Use eDismax to search over both fields, each with its own processor. > > Regards, > Alex. > Personal website: http://www.outerthoughts.com/ > Current project: http://www.solr-start.com/ - Accelerating your Solr > proficiency > > > On Fri, Apr 25, 2014 at 5:34 PM, Timothy Hill <timothy.d.h...@gmail.com> > wrote: > > This may not be a practically solvable problem, but the company I work > for > > has a large number of lengthy mixed-language documents - for example, > > scholarly articles about Islam written in English but containing lengthy > > passages of Arabic. Ideally, we would like users to be able to search > both > > the English and Arabic portions of the text, using the full complement of > > language-processing tools such as stemming and stopword removal. > > > > The problem, of course, is that these two languages co-occur in the same > > field. Is there any way to apply different processing to different words > or > > paragraphs within a single field through language detection? Is this to > all > > intents and purposes impossible within Solr? Or is another approach > (using > > language detection to split the single large field into > > language-differentiated smaller fields, for example) > possible/recommended? > > > > Thanks, > > > > Tim Hill >