Helmut, I recently submitted SOLR-2549 (https://issues.apache.org/jira/browse/SOLR-2549) to handle both fixed-width and delimited flat files. To be honest, I only needed fixed-width support for my app so this might not support everything you mention for delimited files, but it should be a good start.
In particular, you might need to enhance this to handle the double quotes (I had though a delimiter regex along these lines might handle it: (?:[\"]?[,]|[\"]$) ... note this is a sample I just cooked up quick and no doubt has errors, and maybe as you say a simple regex might not work at all ) ... I also didn't do anything with encodings but I'm not sure this will be an issue either... James Dyer E-Commerce Systems Ingram Content Group (615) 213-4311 -----Original Message----- From: Helmut Hoffer von Ankershoffen [mailto:helmut...@googlemail.com] Sent: Thursday, June 09, 2011 2:32 PM To: solr-user@lucene.apache.org Subject: Processing/Indexing CSV Hi, there seems to be no way to index CSV using the DataImportHandler. Using a combination of LineEntityProcessor<http://wiki.apache.org/solr/DataImportHandler#LineEntityProcessor> and RegexTransformer<http://wiki.apache.org/solr/DataImportHandler#RegexTransformer> as proposed in http://robotlibrarian.billdueber.com/an-exercise-in-solr-and-dataimporthandler-hathitrust-data/is not working for real world CSV files. E.g. many CSV files have double-quotes enclosing some but not all columns - there is no elegant way to segment this using a simple regular expression. As CSV is still very common esp. in E-Commerce scenarios, I propose that Solr provides a CSVEntityProcessor that: 1) Handles the case of CSV files with/without and with some double-quote enclosed columns 2) Allows for a configurable column separator (';',',','\t' etc.) 3) Allows for a leading row containing column headings 4) If there is a leading row with column headings provides a possibility to address columns by their column names and map them to Solr fields (similar to the XPathEntityProcessor) 5) Auto-detects encoding of the file (UTF-8 etc.) This would make it A LOT easier to use Solr for E-Commerce scenarios. If there is no such entity processor in the works i will develop one ... So please let me know. Regards