On Sep 21, 2010, at 2:54 PM, Itamar Syn-Hershko wrote:

> Addressing most of the recent discussion below...
> 
> On 16/9/2010 4:24 AM, Dan Cardin wrote:
>>    1. Should the Open Relevance viewer be capable of importing text and
>>    images?
>>   
> Corpora, IMO, should be text only and index-ready (e.g. no special parsing 
> required). This is what I assumed in Orev, as well (see below).

I'm not sure it needs to be text/index-ready.  That can mean diff. things to 
diff. engines.  Our first requirement is that we have a corpora that has a 
known revision/signature so that all people are using the same raw content.  
What the engine chooses to do with it is up to the engine.  Can we provide 
tools that help it be text/index ready?  Of course, but that is not a 
requirement, in my view.

> 
>> Is the objective of the Open Relevance Viewer to provide a crowd sourcing
>> tool that can have its data annotated and then to use the annotated data for
>> determining the performance of machine learning techniques/algorithms? Or,
>> is it to provide a generic crowd souring tool for academics, government, and
>> industry to annotate data with? Or am I missing the point?
>>   
> This tool should be, as Grant and Mark mentioned, engine agnostic. It should 
> provide those interested with tools to be able to judge effectiveness of 
> different engines, and also different methods with the same engine.
> 
> Hence, the most basic implementation should know to handle many corpora and 
> topics for more than one (natural) language, and the crowd-sourcing portion 
> of it is where a user can create judgments - e.g. view a document from a 
> corpus side by side with a topic, and mark "Relevant", "Non-relevant" (or 
> "Skip this").
> 
> This banal implementation after several hundreds of human-hours will result 
> in packages containing corpora, topics and judgments for several languages. 
> This can then be used as basis for more sophisticated parts of the project, 
> where relevance ranking of actual query results, TREC-like testing, MAP/MRR 
> and user behavior tracking are just examples. In other words, IMHO Grant's 
> view is a bit too far going for this stage, where there's still a lot of 
> fundamental work to do.
> 
> Robert, from the discussion we had a while ago I gathered you are thinking 
> the same?
> 
> Once such data exists in a central system, importing corpora and topics, and 
> exporting them back with judgments in various formats (TREC, CLEF, FIRE) can 
> be done fairly easily. We just need to make sure that system stores all data 
> correctly.
> 
> Sorry for bringing this up again, but I think I pretty much did most of that 
> work already, so no need for redundant efforts. In Orev I have already spec'd 
> and implemented all the above. What is missing is some better GUI and user 
> management. I suggest you have a look at it and at its DB scheme: 
> http://github.com/synhershko/Orev/blob/master/Orev.png

You should supply a patch and build instructions, etc.

> 
>> How are annotations used for judgments obtained? Separate file, specifed by 
>> the user?
>>   
> If a tool like Orev will be used, then this data can be pulled directly from 
> its DB by the actual test tools (if separate).
> 
>> Can you provide me with a direct link to the TREC format?
> 
> http://trec.nist.gov/pubs/trec1/papers/01.txt
> 
> But if we are not going to base data storage on the FS, there's no need to 
> stick to a particular format, only when exporting judgments...

Right

Reply via email to