Hi, everyone.
While working with the tests, and thinking about how I would use Tika in
"real life", it occurred to me that the simplest use case is when I'd want
to get the full text of a document, without accessing the document's
metadata. So when I reduced this to its simplest form, it became:
public static String getStrContent(URL documentUrl, URL configURL)
Or, if one already has a LiusConfig object, an alternate form would be:
public static String getStrContent(URL documentUrl, LiusConfig config)
Does this make sense to you? If so, would this go in the Utils class?
The code for these methods is below. It depends on some proposed patches
that have not yet been committed.
- Keith
------------------------------------------------------------------
/**
* Gets the full text (but not other properties of the document
* at the specified URL.
*
* @param documentUrl URL of the resource to parse
* @param configUrl url of Tika configuration object
* @return the document's full text
*/
public static String getStrContent(URL documentUrl, URL configUrl)
throws LiusException, IOException {
return getStrContent(documentUrl,
LiusConfig.getInstance(configUrl));
}
/**
* Gets the full text (but not other properties of the document
* at the specified URL.
*
* @param documentUrl URL of the resource to parse
* @param config Tika configuration object
* @return the document's full text
*/
public static String getStrContent(URL documentUrl, LiusConfig config)
throws LiusException, IOException {
String fulltext = null;
if (documentUrl != null) {
Parser parser = ParserFactory.getParser(documentUrl, config);
fulltext = parser.getStrContent();
}
return fulltext;
}
--
View this message in context:
http://www.nabble.com/Convenience-Method-for-Simplest-Parse-Use-Case-tf4445667.html#a12684897
Sent from the Apache Tika - Development mailing list archive at Nabble.com.