Re: DIH transformers - sect 2 - SOLR-1033

2009-02-21 Thread Fergus McMenemie
I have created SOLR-1033 in JIRA to address this issue.

At 13:32 + 21/2/09, Fergus McMenemie wrote:
On Mon, Feb 16, 2009 at 3:22 PM, Fergus McMenemie fer...@twig.me.uk wrote:

  2) Having used TemplateTransformer to assign a value to an
 entity column that column cannot be used in other
 TemplateTransformer operations. In my project I am
 attempting to reuse x.fileWebPath. To fix this, the
 last line of transformRow() in TemplateTransformer.java
 needs replaced with the following which as well as
 'putting' the templated-ed string in 'row' also saves it
 into the 'resolver'.

 **originally**
  row.put(column, resolver.replaceTokens(expr));
  }

 **new**
  String columnName = map.get(DataImporter.COLUMN);
  expr=resolver.replaceTokens(expr);
  row.put(columnName, expr);
  resolverMapCopy.put(columnName, expr);
  }

isn't it better to write a custom transformer to achieve this. I did
not want a standard component to change the state of the
VariableResolver .

I am not sure what is the best way.


Noble, (Good to have email working :-)

Hmm not sure why this requires a custom transformer. Why is this not 
more in the nature of a bug fix? Also the current behavior temporarily
adds all the column names into the resolver for the duration of the 
TemplateTransformer's operation, removing them again at the end. I
do not think there is any permanent change to the state of the 
VariableResolver.

Surely if we have defined a value for a column, that value should be
temporarily available in subsequent template or regexp operations?

Fergus.



   dataConfig
   dataSource name=myfilereader type=FileDataSource/
document
entity name=jc
   processor=FileListEntityProcessor
   fileName=^.*\.xml$
   newerThan='NOW-1000DAYS'
   recursive=true
   rootEntity=false
   dataSource=null
   baseDir=/Volumes/spare/ts/solr/content
   
entity name=x
  dataSource=myfilereader
  processor=XPathEntityProcessor
  url=${jc.fileAbsolutePath}
  rootEntity=true
  stream=false
  forEach=/record | /record/mediaBlock
  
 transformer=DateFormatTransformer,TemplateTransformer,RegexTransformer

 field column=fileAbsolutePath   template=${jc.fileAbsolutePath} /
 field column=fileWebPathregex=${x.test}(.*) 
 replaceWith=/ford$1 sourceColName=fileAbsolutePath/
 field column=title  xpath=/record/title /
 field column=para1 name=para  xpath=/record/sect1/para /
 field column=para2 name=para  xpath=/record/list/listitem/para /
 field column=pubdate
 xpath=/record/metadata/da...@qualifier='pubDate'] 
 dateTimeFormat=MMdd   /

 field column=vurl   
 xpath=/record/mediaBlock/mediaObject/@vurl /
 field column=imgSrcArticle  
 template=${dataimporter.request.fordinstalldir} /
 field column=imgCpation xpath=/record/mediaBlock/caption  
 /

 field column=test   
 template=${dataimporter.request.contentinstalldir} /
 !-- **problem is that vurl is just a fragment of the info needed to access 
 the picture. --
 field column=imgWebPathICON regex=(.*)/.* 
 replaceWith=$1/imagery/${x.vurl}s.jpg sourceColName=fileWebPath/
 field column=imgWebPathFULL regex=(.*)/.* 
 replaceWith=$1/imagery/${x.vurl}.jpg  sourceColName=fileWebPath/
 field column=vdkvgwkey  
 template=${jc.fileAbsolutePath}#${x.vurl} /
   /entity
   /entity
   /document
/dataConfig

-- 

===
Fergus McMenemie   Email:fer...@twig.me.uk
Techmore Ltd   Phone:(UK) 07721 376021

Unix/Mac/Intranets Analyst Programmer
===


Re: DIH transformers - sect 2

2009-02-17 Thread Fergus McMenemie
On Mon, Feb 16, 2009 at 3:22 PM, Fergus McMenemie fer...@twig.me.uk wrote:

  2) Having used TemplateTransformer to assign a value to an
 entity column that column cannot be used in other
 TemplateTransformer operations. In my project I am
 attempting to reuse x.fileWebPath. To fix this, the
 last line of transformRow() in TemplateTransformer.java
 needs replaced with the following which as well as
 'putting' the templated-ed string in 'row' also saves it
 into the 'resolver'.

 **originally**
  row.put(column, resolver.replaceTokens(expr));
  }

 **new**
  String columnName = map.get(DataImporter.COLUMN);
  expr=resolver.replaceTokens(expr);
  row.put(columnName, expr);
  resolverMapCopy.put(columnName, expr);
  }

isn't it better to write a custom transformer to achieve this. I did
not want a standard component to change the state of the
VariableResolver .

I am not sure what is the best way.


Noble, (Good to have email working :-)

Hmm not sure why this requires a custom transformer. Why is this not 
more in the nature of a bug fix? Also the current behavior temporarily
adds all the column names into the resolver for the duration of the 
TemplateTransformer's operation, removing them again at the end. I
do not think there is any permanent change to the state of the 
VariableResolver.

Surely if we have defined a value for a column, that value should be
temporarily available in subsequent template or regexp operations?

Fergus.



   dataConfig
   dataSource name=myfilereader type=FileDataSource/
document
entity name=jc
   processor=FileListEntityProcessor
   fileName=^.*\.xml$
   newerThan='NOW-1000DAYS'
   recursive=true
   rootEntity=false
   dataSource=null
   baseDir=/Volumes/spare/ts/solr/content
   
entity name=x
  dataSource=myfilereader
  processor=XPathEntityProcessor
  url=${jc.fileAbsolutePath}
  rootEntity=true
  stream=false
  forEach=/record | /record/mediaBlock
  
 transformer=DateFormatTransformer,TemplateTransformer,RegexTransformer

 field column=fileAbsolutePath   template=${jc.fileAbsolutePath} /
 field column=fileWebPathregex=${x.test}(.*) 
 replaceWith=/ford$1 sourceColName=fileAbsolutePath/
 field column=title  xpath=/record/title /
 field column=para1 name=para  xpath=/record/sect1/para /
 field column=para2 name=para  xpath=/record/list/listitem/para /
 field column=pubdate
 xpath=/record/metadata/da...@qualifier='pubDate'] 
 dateTimeFormat=MMdd   /

 field column=vurl   
 xpath=/record/mediaBlock/mediaObject/@vurl /
 field column=imgSrcArticle  
 template=${dataimporter.request.fordinstalldir} /
 field column=imgCpation xpath=/record/mediaBlock/caption  /

 field column=test   
 template=${dataimporter.request.contentinstalldir} /
 !-- **problem is that vurl is just a fragment of the info needed to access 
 the picture. --
 field column=imgWebPathICON regex=(.*)/.* 
 replaceWith=$1/imagery/${x.vurl}s.jpg sourceColName=fileWebPath/
 field column=imgWebPathFULL regex=(.*)/.* 
 replaceWith=$1/imagery/${x.vurl}.jpg  sourceColName=fileWebPath/
 field column=vdkvgwkey  
 template=${jc.fileAbsolutePath}#${x.vurl} /
   /entity
   /entity
   /document
/dataConfig

-- 

===
Fergus McMenemie   Email:fer...@twig.me.uk
Techmore Ltd   Phone:(UK) 07721 376021

Unix/Mac/Intranets Analyst Programmer
===


DIH transformers

2009-02-16 Thread Fergus McMenemie
Hello.

I have been beating my head around the data-config.xml listed
at the end of this message. It breaks in a few different ways.

  1) I have bodged TemplateTransformer to allow it to return 
 when one of the variables is undefined. This ensures my
 uniqueKey is always defined. But thinking more on
 Nobel's comments there is use in having it work both ways.
 ie leaving the column undefined or replacing the variable
 with . I still like my idea about using the default
 value of a solr field from schema.xml, but I cant figure
 out how/where to best implement it. 

  2) Having used TemplateTransformer to assign a value to an 
 entity column that column cannot be used in other 
 TemplateTransformer operations. In my project I am 
 attempting to reuse x.fileWebPath. To fix this, the 
 last line of transformRow() in TemplateTransformer.java
 needs replaced with the following which as well as 
 'putting' the templated-ed string in 'row' also saves it
 into the 'resolver'.

 **originally**
  row.put(column, resolver.replaceTokens(expr));
  }

 **new**
  String columnName = map.get(DataImporter.COLUMN);
  expr=resolver.replaceTokens(expr);
  row.put(columnName, expr);
  resolverMapCopy.put(columnName, expr);
  }

 As an aside I think I ran into the issues covered by 
 SOLR-993. It took a while to figure out I could not a
 a single columnname/value to the resolver. I had instead
 to add to the map that was already stored within the
 resolver.

  3) No entity column names can be used within RegexTransformer.
 I guess all the stuff that was added to TemplateTransformer
 to allow column names to be used in templates needs re-added
 into RegexTransformer. I am doing that now... but am confused
 by the fragment of code which copies from resolverMap into
 resolverMapCopy. As best I can see resolverMap is always 
 empty; but I am barely able to follow the code! Can somebody
 explain when/why resolverMap would be populated.

 Also, I begin to understand comments made by Noble in
 SOL-1001 about resolving entity attributes in 
 ContextImpl.getEntityAttribute and I guess Shalin was
 right as well. However it also seems wrong that at the
 top of every transformer we are going to repeat the
 same code to load the resolver with information about the 
 entity.

  4) In that I am reusing template output within other templates
 the order of execution becomes important. Can I assume that
 the explicitly listed columns in an entity are processed by
 the various transformers in the order they appear within
 data-config.xml. I *think* that the list of columns within
 an entity as returned by getAllEntityFields() is actually
 an ArrayList which I think or order dependent. IS this
 correct?

  5) Should I raise this as a single JIRA issue?

  6) Having played with this stuff, I was going to add a bit
 more to the wiki highlighting some of the possibilities
 and issues with transformers. But want to check with the 
 list first!


   dataConfig
   dataSource name=myfilereader type=FileDataSource/
document
entity name=jc
   processor=FileListEntityProcessor
   fileName=^.*\.xml$
   newerThan='NOW-1000DAYS'
   recursive=true
   rootEntity=false
   dataSource=null
   baseDir=/Volumes/spare/ts/solr/content
   
entity name=x
  dataSource=myfilereader
  processor=XPathEntityProcessor
  url=${jc.fileAbsolutePath}
  rootEntity=true
  stream=false
  forEach=/record | /record/mediaBlock
  
transformer=DateFormatTransformer,TemplateTransformer,RegexTransformer

field column=fileAbsolutePath   template=${jc.fileAbsolutePath} /
field column=fileWebPathregex=${x.test}(.*) 
replaceWith=/ford$1 sourceColName=fileAbsolutePath/
field column=title  xpath=/record/title /
field column=para1 name=para  xpath=/record/sect1/para /
field column=para2 name=para  xpath=/record/list/listitem/para /
field column=pubdate
xpath=/record/metadata/da...@qualifier='pubDate'] dateTimeFormat=MMdd   
/

field column=vurl   
xpath=/record/mediaBlock/mediaObject/@vurl /
field column=imgSrcArticle  
template=${dataimporter.request.fordinstalldir} /
field column=imgCpation xpath=/record/mediaBlock/caption  /

field column=test   
template=${dataimporter.request.contentinstalldir} /
!-- **problem is that vurl is just a fragment of the info needed to access the 
picture. --
field column=imgWebPathICON regex=(.*)/.* 
replaceWith=$1/imagery/${x.vurl}s.jpg sourceColName=fileWebPath/
field column=imgWebPathFULL regex=(.*)/.* 

Re: DIH transformers

2009-02-16 Thread Noble Paul നോബിള്‍ नोब्ळ्
On Mon, Feb 16, 2009 at 3:22 PM, Fergus McMenemie fer...@twig.me.uk wrote:
 Hello.

 I have been beating my head around the data-config.xml listed
 at the end of this message. It breaks in a few different ways.

  1) I have bodged TemplateTransformer to allow it to return
 when one of the variables is undefined. This ensures my
 uniqueKey is always defined. But thinking more on
 Nobel's comments there is use in having it work both ways.
 ie leaving the column undefined or replacing the variable
 with . I still like my idea about using the default
 value of a solr field from schema.xml, but I cant figure
 out how/where to best implement it.
When a value is missing from the templatewe may end up giving
constructing a partial string which may not be desired. If we leave it
out as empty, then Solr would automatically put in the default value
and it should be solved. Just in case you wish to know the
defaultvalue in the schema.xml you can get it from the api.
fields = context.getAllEntityFields();
String defval = fields.get(0).get(defaultvalue);

  2) Having used TemplateTransformer to assign a value to an
 entity column that column cannot be used in other
 TemplateTransformer operations. In my project I am
 attempting to reuse x.fileWebPath. To fix this, the
 last line of transformRow() in TemplateTransformer.java
 needs replaced with the following which as well as
 'putting' the templated-ed string in 'row' also saves it
 into the 'resolver'.

 **originally**
  row.put(column, resolver.replaceTokens(expr));
  }

 **new**
  String columnName = map.get(DataImporter.COLUMN);
  expr=resolver.replaceTokens(expr);
  row.put(columnName, expr);
  resolverMapCopy.put(columnName, expr);
  }

isn't it better to write a custom transformer to achieve this. I did
not want a standard component to change the state of the
VariableResolver .

I am not sure what is the best way.


 As an aside I think I ran into the issues covered by
 SOLR-993. It took a while to figure out I could not a
 a single columnname/value to the resolver. I had instead
 to add to the map that was already stored within the
 resolver.

  3) No entity column names can be used within RegexTransformer.
 I guess all the stuff that was added to TemplateTransformer
 to allow column names to be used in templates needs re-added
 into RegexTransformer. I am doing that now... but am confused
 by the fragment of code which copies from resolverMap into
 resolverMapCopy. As best I can see resolverMap is always
 empty; but I am barely able to follow the code! Can somebody
 explain when/why resolverMap would be populated.

The behavior is like this, the expression ${currentEntity.colName}
does not work automatically. Because the row is not added to
VariableResolver .TemplateTransformer has hacked the stuff to make it
work.

We can think of modifying this behavior

 Also, I begin to understand comments made by Noble in
 SOL-1001 about resolving entity attributes in
 ContextImpl.getEntityAttribute and I guess Shalin was
 right as well. However it also seems wrong that at the
 top of every transformer we are going to repeat the
 same code to load the resolver with information about the
 entity.

  4) In that I am reusing template output within other templates
 the order of execution becomes important. Can I assume that
 the explicitly listed columns in an entity are processed by
 the various transformers in the order they appear within
 data-config.xml. I *think* that the list of columns within
 an entity as returned by getAllEntityFields() is actually
 an ArrayList which I think or order dependent. IS this
 correct?

IT IS CORRECT

  5) Should I raise this as a single JIRA issue?
Do not add ONE issue forall. If they are logically connected  put all
of them into one.If not, split them into as many issues as possible.

  6) Having played with this stuff, I was going to add a bit
 more to the wiki highlighting some of the possibilities
 and issues with transformers. But want to check with the
 list first!


   dataConfig
   dataSource name=myfilereader type=FileDataSource/
document
entity name=jc
   processor=FileListEntityProcessor
   fileName=^.*\.xml$
   newerThan='NOW-1000DAYS'
   recursive=true
   rootEntity=false
   dataSource=null
   baseDir=/Volumes/spare/ts/solr/content
   
entity name=x
  dataSource=myfilereader
  processor=XPathEntityProcessor
  url=${jc.fileAbsolutePath}
  rootEntity=true
  stream=false
  forEach=/record | /record/mediaBlock
  
 transformer=DateFormatTransformer,TemplateTransformer,RegexTransformer

 field