Hey Zhiwei,

this goes into the right direction for this sub-task. Please ensure that
your proposal addresses the following problems:

- Roughly how would the immediate input format look like?
- Would the immediate file go straight into HDFS?
- How do you plan to limit negative influence of this abstraction to the
indexing process? A small performance hit for the indexing will likely not
be avoidable, but this is an important issue since with the current system
reading from XML dumps directly, performance is still an issue and the
English (the biggest) version still takes several hours on a reasonably
sized cluster.

Please make sure you understand the differences between the indexing
pipelines. The Lucene backend is distinct from the pignlproc based backend
(what we sometimes call DB-backed core).

Other than that, I think that this sub-task is very reasonably achievable
and as such is not alone appropriate for the for a project of this
size/funding. It also has limited benefits for the performance of the
system (it will improve the system architecture+flexibility), we would also
like to see some work on improving the general performance. Finishing the
integration of the graph-based methods (mentioned on the wiki page), would
be a fairly straight-forward and manageable addition to this.

Best,
Jo
​
------------------------------------------------------------------------------
Try New Relic Now & We'll Send You this Cool Shirt
New Relic is the only SaaS-based application performance monitoring service 
that delivers powerful full stack analytics. Optimize and monitor your
browser, app, & servers with just a few lines of code. Try New Relic
and get this awesome Nerd Life shirt! http://p.sf.net/sfu/newrelic_d2d_apr
_______________________________________________
Dbpedia-gsoc mailing list
Dbpedia-gsoc@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/dbpedia-gsoc

Reply via email to