[ 
https://issues.apache.org/jira/browse/IO-200?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12688086#action_12688086
 ] 

haruhiko nishi commented on IO-200:
-----------------------------------

I didn't mean to provoke every one by the last comment I made. I just came up 
with an idea and I wanted to share. Thats's all and nothing more.
I needed to work on CSV file with different multibyte characters that are 
superset of ASCII and csv libraries in sandbox has some issue handling them.
Even the RFC 4180 ABNF fails to cover non ascii characters(i.e. mulibyte 
characters) as possible values of CSV and I modifed in the code above.(I neede 
to use MS932 and JIS208 to filter out some platform independent characters) I 
have not looked at the flatfile sandbox, but I'll look at it.
Thank you Matt.(I could have used antlr for the Parsing part I did in the code 
above,but I think antlr is overkill if it could be done custom LL parser.
yours definately need it as it discussed something about DSL in document.)
 My approach may be totally wrong but I'd be prealsed if any part of this code 
snippet shared some idea
with comitters of apache commented here.


> CSV component
> -------------
>
>                 Key: IO-200
>                 URL: https://issues.apache.org/jira/browse/IO-200
>             Project: Commons IO
>          Issue Type: New Feature
>          Components: Utilities
>            Reporter: haruhiko nishi
>            Priority: Trivial
>
> TableBuilder is 'Builder ' that maps the CSV to a matrix and provides 
> interface that allows user to manipulate after it is build by parsing a csv 
> file to it parse() method.(There is only one method implemented and it is for 
> copying a column values to another position, as I could not  think of other 
> operation that may be useful)
> Within the TableBuilder, each column of the CSV is represented as byte[] and 
> each becomes a target to be validated against Rule,represented by the 
> interface that you find in the example code below. As TableBuilder 
> "buildTable" ,when parse() method is invoked, a byte[]  representation of the 
> value of each CSV cell  is passed to isValid() method of implementations of 
> Rules, which you apply to the TableBuilder instance through the addRule() 
> method. (you you can add as many Rule as you need.)
> Rule gets executed until the validation fails or succeeds. If any of the Rule 
> fails, then its replace() is called and the column value being processed gets 
> replaced by the retun value of this method.
> Another goodie is that it is possible to refer to the values of preceding 
> cell values of the row within a Rule.
> It is useful if you need to see the entries of the preceding cell when 
> validating the value in a Rule. An example would be,
> Given a csv,
> A,B,C
> 1,2,3
> in order for the value  3 of the column C is to be validated true, the Value 
> of A needs to be less than the value of C.
> TableBuilder is RFC 4180 compliant and therefore distinguishes NL exists by 
> itself and NL found in double quotes.
> So you can add Rule that practically removes all NL chars found in value 
> enclosed within doublequotes. 
> (useful when you need to remove CRLF in double quotes from  CSV exported from 
> Excel)
> Currently, TableBuilder implements a method called copyColumn with method 
> signature of,
> copyColumn(Rule rule,int from,int to, boolean override) which allows user to 
> manipulate the parsed csv.
> What it does is literarly copies column from that is specified  at 'from' and 
> to 'to' position of the matrix.
> If override is true, the copying colum is overriden else the column is right 
> shifted and inserted at the specified position.
> You can specify some kind of Rule here to rewrite the value being copied from 
> the origin.
> An example would be copy column value that all ends with .jpg or .gif and to 
> the position specified prefixing the column value with 
> "http://some.server.com/imanges."; after checking the image exists, after 
> checking that the named file exists at some location also by an 
> implementation of  another Rule.
> TableBuilder is just a "rough skecth" idea of CSV parsing.(The code below 
> works fine though) it still needs alot of refactoring and so.
> I appreciate any comment on this idea. What do you think? My code style sucks 
> I know! 
> Here is simple exampe to use TableBuilder.
> {code:title=TableBuilder|borderStyle=solid}
>     public static void main(String[] args)throws Exception{
>         TableBuilder tableBuilder=new TableBuilder("UTF-8",
>                 new MessageHandler(){
>                     public void handleMessage(String message) {
>                         System.err.println(message);
>                     }
>                 },0,true);
>         tableBuilder.addRule(3,new RemoveNLChars()); //removing NL cahracters 
> found in value.
>         tableBuilder.parse(new FileInputStream("test.txt"),TableBuilder.CSV);
>         List<Record> list=tableBuilder.getRowAsListOf(Record.class);
>         for(Record record:list)
>             System.out.println(record.getA());//TODO not implemented yet!
>         tableBuilder.writeTo(new 
> FileOutputStream("test_mod.txt"),TableBuilder.CSV);
>     }
> public class RemoveNLChars extends StringValueRuleAdapter {
>     protected boolean isValid(String columnValue) {
>         return !columnValue.contains(System.getProperty("line.separator"));
>     }
>     protected String replace(String columnValue) {
>         return 
> columnValue.replaceAll(System.getProperty("line.separator"),"");
>     }
>     public String getMessage() {
>         return "";
>     }
> }
> public interface Rule {
>     public void setRowReference(List<byte[]> rowReference);
>     public void setCharsetName(String charsetName);
>     boolean isValid(final byte[] columnValue);
>     byte[] replace(final byte[] columnValue);
>     String getMessage();
> }
> //StringValueruleAdapter is an adapter converts the byte[] representation of 
> the cell value.
> public  abstract class StringValueRuleAdapter implements Rule{
>     private String charsetName;
>     private List<byte[]> rowReference;
>     
>     public void setRowReference(List<byte[]> rowReference) {
>         this.rowReference=rowReference;
>     }
>     public void setCharsetName(String charsetName) {
>         this.charsetName=charsetName;
>     }
>     public final boolean isValid(final byte[] columnValue) {
>         String strValue;
>         try {
>             if(columnValue.length>0)
>                 strValue=(charsetName!=null) ? new 
> String(columnValue,charsetName) : new String(columnValue);
>             else
>                 strValue="";
>         } catch (UnsupportedEncodingException e) {
>             if(columnValue.length>0)
>                 strValue=new String(columnValue);
>             else
>                  strValue="";
>         }
>         return isValid(strValue);
>     }
>     public final byte[] replace(final byte[] columnValue) {
>         String strValue;
>         try {
>             if(columnValue.length>0)
>                 strValue=(charsetName!=null) ? new 
> String(columnValue,charsetName):new String(columnValue);
>             else
>                 strValue="";
>             return (charsetName!=null) ? 
> replace(strValue).getBytes(charsetName):replace(strValue).getBytes();
>         } catch (UnsupportedEncodingException e) {
>             if(columnValue.length>0)
>                 strValue=new String(columnValue);
>             else
>                 strValue="";
>             return replace(strValue).getBytes();
>         }
>     }
>     protected String getRowValue(int column) {
>         try {
>             return (charsetName!=null) ? new 
> String(rowReference.get(column),charsetName) :
>                     new String(rowReference.get(column));
>         } catch (UnsupportedEncodingException e) {
>             return new String(rowReference.get(column));
>         } catch(IndexOutOfBoundsException noListFound){
>             throw new IllegalArgumentException("no value exists at the 
> requested column.");
>         }
>     }
>     protected String getPrecedingRowValue(){
>         return getRowValue(rowReference.size()-1);
>     }
>     protected abstract boolean isValid(String columnValue);
>     protected abstract String replace(String columnValue);
> }
> public class TableBuilder {
>     public static int CSV=0x2c;
>     public static int TSV=0x09;
>     private Map<Integer,Set<Rule>> columnRule=new  
> HashMap<Integer,Set<Rule>>();
>     private Table currentTable;
>     private byte[] newLineChars;
>     private boolean endsWithNL;
>     private String charsetName;
>     private int rowOffset;
>     private boolean useFirstColumnAsRowName;
>     private MessageHandler msgHandler=new MessageHandler(){
>         public void handleMessage(String message) {
>             System.err.println(message);
>         }
>     };
>     public TableBuilder(String charsetName,MessageHandler msgHandler,int 
> rowOffset,boolean useFirstColumnAsRowName){
>         this.charsetName=charsetName;
>         this.rowOffset=rowOffset;
>         this.msgHandler=msgHandler;
>         this.useFirstColumnAsRowName=useFirstColumnAsRowName;
>     }
>     public TableBuilder(String charsetName){
>         this.charsetName=charsetName;
>     }
>     public TableBuilder(){
>         
>     }
>     public void addRule(int column, Rule rule){
>         Set<Rule> ruleset;
>         if((ruleset=columnRule.get(column))==null){
>             ruleset=new LinkedHashSet<Rule>();
>             columnRule.put(column,ruleset);
>         }
>         rule.setCharsetName(charsetName);
>         ruleset.add(rule);
>     }
>     public void parse(InputStream in, int delimiter)throws Exception{
>         int bytesRead;
>         byte buf[]=new byte[1024];
>         ByteArrayOutputStream outbuf=new ByteArrayOutputStream(buf.length);
>         while((bytesRead=in.read(buf,0,buf.length))!=-1)
>             outbuf.write(buf,0,bytesRead);
>         in.close();
>         ByteBuffer 
> bytebuffer=ByteBuffer.allocateDirect(outbuf.size()).put(outbuf.toByteArray());
>         bytebuffer.flip();
>         currentTable=buildTable(bytebuffer,delimiter);
>     }
>     private class Table {
>         private List<byte[]>[] columnMatrix;
>         private List<List<byte[]>> rowMatrix;
>         
>         Table(List<byte[]>[] columnMatrix,List<List<byte[]>> rowMatrix){
>             this.columnMatrix=columnMatrix;
>             this.rowMatrix=rowMatrix;
>         }
>         public int getNumOfColumns() {
>             return columnMatrix.length;
>         }
>         public int getNumOfRows(){
>             return rowMatrix.size();
>         }
>         public byte[] getValueAt(int row, int column) {
>             return columnMatrix[column].get(row);
>         }
>         public byte[] getColumnName(int column){
>             return columnMatrix[column].get(0);
>         }
>         public List<byte[]> getColumn(int column){
>             return columnMatrix[column];
>         }
>         public List<byte[]> getRow(int row){
>             return rowMatrix.get(row);
>         }
>         
>     }
>   //TODO  extract csv row as JavaBean
>     public <E> List<E> getRowAsListOf(final Class<E> clazz){
>         List<E> list=null;
>         Iterator<byte[]> header=currentTable.getRow(0).iterator();
>         for(int i=1;i<currentTable.getNumOfRows();i++){
>             try {
>                 E instance=clazz.newInstance();
>                 for(byte[] value:currentTable.getRow(i)){
>                     String name=new String(header.next());
>                     //BeanUtils.setProperty(instance,name,value);
>                 }
>                 if(list==null)
>                     list=new ArrayList<E>();
>                 list.add(instance);
>                 header=currentTable.getRow(0).iterator();
>             } catch (IllegalAccessException e) {
>                 e.printStackTrace();
>             } catch (InvocationTargetException e) {
>                 e.printStackTrace();
>             } catch (InstantiationException e) {
>                 e.printStackTrace();
>             }
>         }
>         return list;
>     }
>     public void writeTo(OutputStream out,int delimiter) throws IOException {
>         for(int i=0,j=0;i<currentTable.getNumOfRows();i++,j=0){
>             for(byte[] value:currentTable.getRow(i)){
>                 out.write(value);
>                 if(++j<currentTable.getNumOfColumns())
>                     out.write(delimiter);
>             }
>             if(i<currentTable.getNumOfRows()-1)
>                 out.write(newLineChars);
>             else{                                             
>                 if(endsWithNL)
>                     out.write(newLineChars);
>             }
>         }
>         out.close();
>     }
>      
>      public void copyColumn(Rule rule,int from,int to, boolean override) {
>             int numOfColumns=override ? 
> currentTable.getNumOfColumns():currentTable.getNumOfColumns()+1;
>             List<byte[]>[] columnMatrix=(List<byte[]>[])new 
> List[numOfColumns];
>             columnMatrix[to]=new ArrayList<byte[]>();
>             for(int i=0,j=0;i<columnMatrix.length;i++){
>                 if(i==to){
>                     for(int row=0;row<currentTable.getNumOfRows();row++){
>                         byte[] value;
>                         if(row>=rowOffset)
>                             value=currentTable.getValueAt(row,from);
>                         else
>                             value=new byte[0];
>                         if(rule!=null && row>rowOffset){
>                             rule.setCharsetName(charsetName);
>                             rule.setRowReference(currentTable.getRow(row));
>                             if(!rule.isValid(value)){
>                                 String columnName;
>                                 byte[] 
> columnNameByte=currentTable.getColumnName(from);
>                                 if(columnNameByte.length>0){
>                                     try {
>                                         if(charsetName!=null)
>                                             columnName="'"+new 
> String(columnNameByte,charsetName).trim()+"'";
>                                         else
>                                             columnName="'"+new 
> String(columnNameByte).trim()+"'";
>                                     } catch (UnsupportedEncodingException e) {
>                                         columnName="'"+new 
> String(columnNameByte).trim()+"'";
>                                     }
>                                 }else
>                                     columnName="''";
>                                 value=rule.replace(value);
>                                 String msg=rule.getMessage();
>                                     if(msg.length()>0)
>                                         try {
>                                             handleMessage(msg
>                                             .replace("${column_from}",""+from)
>                                             
> .replace("${columnName}",columnName)
>                                             .replace("${column_to}",""+(to+1))
>                                             
> .replace("${row}",useFirstColumnAsRowName ? new 
> String(currentTable.getRow(row).get(0),charsetName) : ""+(row+1)));
>                                         } catch (UnsupportedEncodingException 
> ignored) {
>                                             
>                                         }
>                             }
>                         }
>                     columnMatrix[i].add(value);
>                     if(override)
>                         currentTable.rowMatrix.get(row).remove(i);
>                     currentTable.rowMatrix.get(row).add(i,value);
>                     }
>                     if(override)
>                         ++j;
>                 }else
>                     columnMatrix[i]=currentTable.getColumn(j++);
>             }
>             currentTable=new Table(columnMatrix,currentTable.rowMatrix);
>     }
>     
>     private Table buildTable(ByteBuffer buf,int delimiter) throws 
> ParseException {
>         List<byte[]>[] columnMatrix=null;
>         List<List<byte[]>> rowMatrix=new ArrayList<List<byte[]>>();
>         int i=0,j,currentRow=0,rowIndex=0,column_count=0,column=0;
>         endsWithNL=true;
>         newLineChars=null;
>         int limit=buf.limit();
>         int pos=0;
>         while(i<limit && 
> ((j=(buf.get(i)&0xff))==0x0d||(j=(buf.get(i)&0xff))==0x0a)){
>             if(j==0x0a)
>                 ++currentRow;
>             pos=++i;
>         }
>         
>         int headRow=currentRow;
>         while(i<limit){
>             int tmp=buf.get(i) & 0xff;
>                 if(tmp==0x0a){
>                     int k=i;
>                     while(k>=0 
> &&((buf.get(k)&0xff)==0x0d||(buf.get(k)&0xff)==0x0a))
>                         --k;
>                     byte[] prev=new byte[++k-pos];
>                     
>                     buf.position(pos);
>                     buf.get(prev,0,prev.length);
>                     List<byte[]> row;
>                     try{
>                         row=rowMatrix.get(rowIndex);
>                     }catch(IndexOutOfBoundsException noListFound){
>                         rowMatrix.add(new ArrayList<byte[]>());
>                         row=rowMatrix.get(rowIndex);
>                     }
>                     if(currentRow==headRow){
>                         column_count=column;
>                         row.add(prev);
>                         columnMatrix=(List<byte[]>[])new ArrayList[column+1];
>                         Iterator<byte[]> itr;
>                         for(j=0,itr=row.iterator();j<columnMatrix.length;j++){
>                             columnMatrix[j]=new ArrayList<byte[]>();
>                             columnMatrix[j].add(itr.next());
>                         }
>                     }else if(column_count!=column){
>                         throw new ParseException("column count mismatch on 
> row ",currentRow+1);
>                     }else{
>                         Set<Rule> ruleset=columnRule.get(column);
>                         if(ruleset!=null && currentRow>rowOffset+headRow){
>                             byte[] 
> columnNameByte=rowMatrix.get(rowOffset).get(column);
>                             Rule rule=validate(ruleset,prev,row);
>                             if(rule!=null){
>                                 String columnName;
>                                 if(columnNameByte.length>0){
>                                     try {
>                                         if(charsetName!=null)
>                                             columnName="'"+new 
> String(columnNameByte,charsetName).trim()+"'";
>                                         else
>                                             columnName="'"+new 
> String(columnNameByte).trim()+"'";
>                                     } catch (UnsupportedEncodingException e) {
>                                         columnName="'"+new 
> String(columnNameByte).trim()+"'";
>                                     }
>                                 }else
>                                     columnName="''";
>                                 prev=rule.replace(prev);
>                                 String msg=rule.getMessage();
>                                 if(msg.length()>0)
>                                     try {
>                                         handleMessage(msg
>                                             .replace("${column}",""+column)
>                                             
> .replace("${columnName}",columnName.trim())
>                                             
> .replace("${row}",useFirstColumnAsRowName ? new 
> String(rowMatrix.get(rowIndex).get(0),charsetName) : ""+(currentRow+1)));
>                                     } catch (UnsupportedEncodingException 
> ignored) {
>                                     }
>                             }
>                         }
>                         columnMatrix[column].add(prev);
>                         row.add(prev);
>                     }
>                     if(newLineChars==null){
>                         newLineChars=new byte[++i-k];
>                         buf.position(k);
>                         buf.get(newLineChars,0,newLineChars.length);
>                     }else
>                         ++i;
>                 while(i<limit && 
> ((j=(buf.get(i)&0xff))==0x0d||(j=(buf.get(i)&0xff))==0x0a)){
>                     if(j==0x0a)
>                         ++currentRow;
>                     ++i;
>                 }
>                 column=0;
>                 ++currentRow;
>                 ++rowIndex;
>                 pos=i;
>             }else if(tmp==delimiter){
>                 List<byte[]> row;
>                 try{
>                     row=rowMatrix.get(rowIndex);
>                 }catch(IndexOutOfBoundsException noListFound){
>                     rowMatrix.add(new ArrayList<byte[]>());
>                     row=rowMatrix.get(rowIndex);
>                 }
>                 byte[] prev=new byte[i-pos];
>                 buf.position(pos);
>                 buf.get(prev,0,prev.length);
>                 if(currentRow==headRow)
>                     row.add(prev);
>                 else{
>                     Set<Rule> ruleset=columnRule.get(column);
>                     if(ruleset!=null && currentRow>rowOffset+headRow){
>                         byte[] 
> columnNameByte=rowMatrix.get(rowOffset).get(column);
>                             Rule rule=validate(ruleset,prev,row);
>                             if(rule!=null){
>                                 String columnName;
>                                 if(columnNameByte.length>0){
>                                     try {
>                                         if(charsetName!=null)
>                                             columnName="'"+new 
> String(columnNameByte,charsetName).trim()+"'";
>                                         else
>                                             columnName="'"+new 
> String(columnNameByte).trim()+"'";
>                                     } catch (UnsupportedEncodingException e) {
>                                         columnName="'"+new 
> String(columnNameByte).trim()+"'";
>                                     }
>                                 }else
>                                     columnName="''";
>                                 prev=rule.replace(prev);
>                         String msg=rule.getMessage();
>                         if(msg.length()>0)
>                             try {
>                                 handleMessage(msg
>                                         .replace("${column}",""+column)
>                                         
> .replace("${columnName}",columnName.trim())
>                                         
> .replace("${row}",useFirstColumnAsRowName ? new 
> String(rowMatrix.get(rowIndex).get(0),charsetName) : ""+(currentRow+1)));
>                             } catch (UnsupportedEncodingException ignored) {
>                             }
>                             }
>                     }
>                     columnMatrix[column].add(prev);
>                     row.add(prev);
>                 }
>                 ++column;
>                 pos=++i;
>             }else
>                 if((i=_ESCAPED(buf,i))==i)
>                     ++i;
>         }
>         if(pos!=limit){
>             endsWithNL=false;
>             byte[] remaining=new byte[limit-pos];
>             buf.position(pos);
>             buf.get(remaining,0,remaining.length);
>             
>             if(columnMatrix!=null){
>                 if(column_count!=column)
>                     throw new ParseException("column count mismatch on row 
> ",+1+currentRow);
>                 List<byte[]> row=rowMatrix.get(rowIndex);
>                 row.add(remaining);
>                 Set<Rule> ruleset=columnRule.get(column);
>                 if(ruleset!=null && currentRow>rowOffset+headRow){
>                     byte[] 
> columnNameByte=rowMatrix.get(rowOffset).get(column);
>                     Rule rule=validate(ruleset,remaining,row);
>                     if(rule!=null){
>                         String columnName;
>                         if(columnNameByte.length>0){
>                             try {
>                                 if(charsetName!=null)
>                                     columnName="'"+new 
> String(columnNameByte,charsetName).trim()+"'";
>                                 else
>                                     columnName="'"+new 
> String(columnNameByte).trim()+"'";
>                             } catch (UnsupportedEncodingException e) {
>                                 columnName="'"+new 
> String(columnNameByte).trim()+"'";
>                             }
>                         }else
>                             columnName="''";
>                         remaining=rule.replace(remaining);
>                         String msg=rule.getMessage();
>                         if(msg.length()>0)
>                             try {
>                                 handleMessage(msg
>                                 .replace("${column}",""+column)
>                                 .replace("${columnName}",columnName.trim())
>                                 .replace("${row}",useFirstColumnAsRowName ? 
> new String(rowMatrix.get(rowIndex).get(0),charsetName) : ""+(currentRow+1)));
>                             } catch (UnsupportedEncodingException ignored) {
>                             }
>                     }
>                 }
>                 columnMatrix[column].add(remaining);
>             }else{
>                 columnMatrix=(List<byte[]>[])new List[column+1];
>                 List<byte[]> row;
>                 try{
>                     row=rowMatrix.get(rowIndex);
>                 }catch(IndexOutOfBoundsException noListFound){
>                     rowMatrix.add(new ArrayList<byte[]>());
>                     row=rowMatrix.get(rowIndex);
>                 }
>                 row.add(remaining);
>                 Iterator<byte[]> itr;
>                 for(j=0,itr=row.iterator();j<columnMatrix.length;j++){
>                     columnMatrix[j]=new ArrayList<byte[]>(1);
>                     columnMatrix[j].add(itr.next());
>                 }
>             }
>         }
>         return new Table(columnMatrix,rowMatrix);
>     }
>     private int _ESCAPED(ByteBuffer src,int i){
>         int org=i;
>         if(i==src.limit())
>             return i;
>         int j;
>         if((j=_DQUOTE(src,i))==i)
>             return i;
>         
> for(i=j;(j=_TEXTDATA(src,i))>i||(j=_COMMA(src,i))>i||(j=_CR(src,i))>i||(j=_LF(src,i))>i||(j=_2DQUOTE(src,i))>i;)
>             i=j;
>         if(i==_DQUOTE(src,i))
>             return org;
>         return i;
>     }
>     private int _TEXTDATA(ByteBuffer src,int i){
>         if(i==src.limit())
>             return i;
>         if(_COMMA(src,i)==i && _CR(src,i)==i && _LF(src,i)==i && 
> _DQUOTE(src,i)==i)
>             return ++i;
>         return i;
>     }
>     private int _2DQUOTE(ByteBuffer src,int i) {
>         if(i==src.limit())
>             return i;
>         if(i==_DQUOTE(src,i))
>             return i;
>         if(i+1==_DQUOTE(src,i+1))
>             return i;
>         return i+2;
>     }
>     private int _DQUOTE(ByteBuffer src,int i) {
>         return _CHAR(src,i,0x22);
>     }
>     public int _LF(ByteBuffer src,int i) {
>         return _CHAR(src,i,0x0a);
>     }
>     private int _CR(ByteBuffer src,int i) {
>         return _CHAR(src,i,0x0d);
>     }
>     private int _COMMA(ByteBuffer src,int i) {
>         return _CHAR(src,i,0x2c);
>     }
>     private int _CHAR(ByteBuffer src,int i,int token){
>         if(i==src.limit())
>             return i;
>         if((src.get(i) & 0xff)==token)
>             ++i;
>         return i;
>     }
>      
>     private void handleMessage(String message) {
>         msgHandler.handleMessage(message);
>     }
>     
>     public Rule validate(Set<Rule> ruleset,byte[] value, List<byte[]> 
> rowReference) {
>         for(Rule rule:ruleset){
>             if(rule!=null){
>                 rule.setRowReference(rowReference);
>                 if(!rule.isValid(value))
>                     return rule;
>             }
>         }
>         return null;
>     }
> }
> {code} 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to