
I recently submitted SOLR-2549 
(https://issues.apache.org/jira/browse/SOLR-2549) to handle both fixed-width 
and delimited flat files.  To be honest, I only needed fixed-width support for 
my app so this might not support everything you mention for delimited files, 
but it should be a good start.  

In particular, you might need to enhance this to handle the double quotes (I 
had though a delimiter regex along these lines might handle it:  
(?:[\"]?[,]|[\"]$)  ... note this is a sample I just cooked up quick and no 
doubt has errors, and maybe as you say a simple regex might not work at all ) 
... I also didn't do anything with encodings but I'm not sure this will be an 
issue either...

James Dyer
E-Commerce Systems
Ingram Content Group
(615) 213-4311

-----Original Message-----
From: Helmut Hoffer von Ankershoffen [mailto:helmut...@googlemail.com] 
Sent: Thursday, June 09, 2011 2:32 PM
To: solr-user@lucene.apache.org
Subject: Processing/Indexing CSV


there seems to be no way to index CSV using the DataImportHandler.

Using a combination of
proposed in
not working for real world CSV files.

E.g. many CSV files have double-quotes enclosing some but not all columns -
there is no elegant way to segment this using a simple regular expression.

As CSV is still very common esp. in E-Commerce scenarios, I propose that
Solr provides a CSVEntityProcessor that:
1) Handles the case of CSV files with/without and with some double-quote
enclosed columns
2) Allows for a configurable column separator (';',',','\t' etc.)
3) Allows for a leading row containing column headings
4) If there is a leading row with column headings provides a possibility to
address columns by their column names and map them to Solr fields (similar
to the XPathEntityProcessor)
5) Auto-detects encoding of the file (UTF-8 etc.)

This would make it A LOT easier to use Solr for E-Commerce scenarios.

If there is no such entity processor in the works i will develop one ... So
please let me know.


Reply via email to