Here is one such url:
http://www.yummly.com/recipe/Gabis-Low-Carb-Yeast-Bread-1073667?columns=4&position=1%2F74
I'm currently able to get meta tags using the following code:
HTTPDocumentSource doc = new
HTTPDocumentSource(DefaultHTTPClient.createInitializedHTTPClient(),
info.getId());
InputStream documentInputInputStream = doc.openInputStream();
TagSoupParser tagSoupParser = new TagSoupParser(documentInputInputStream,
doc.getDocumentURI());
Document document = tagSoupParser.getDOM();
NodeList nl = document.getElementsByTagName("meta");
for (int i = 0; i < nl.getLength(); i++) {
//System.out.println(nl.item(i).getNodeType());
//System.out.println(nl.item(i).getNodeName());
Element e = (Element)(nl.item(i));
String name = e.getAttribute("property");
if (name == null || name.trim().length()==0){
name = e.getAttribute("name");
}
if (name==null || name.trim().length()==0){
name = e.getAttribute("itemprop");
}
if (name!=null && name.trim().length()>0){
String value = e.getAttribute("content");
logger.info(name+" "+value);
info.addInfo("meta_"+name, value);
}
}
On Mon, Dec 7, 2015 at 10:59 PM, Lewis John Mcgibbney <
[email protected]> wrote:
> Hi Frank,
>
> On Mon, Dec 7, 2015 at 3:50 PM, <[email protected]> wrote:
>
>>
>> I'm trying to extract meta tags from webpages. I'm using the code below
>> but am finding that only a small subset of meta tags are being returned.
>> There are meta tags like those for facebook open graph that i am interested
>> in that are not being returned?
>>
>
> By default Any23 Configuration [0] defines that HTML head meta tags should
> be extracted by default. There is therefore no need to change this
> behaviour as extraction of HTML meta tags 'should' be happening by default.
> You are also correctly defining this within your code as below!
> Can you please post an example of a URL we can test against?
> Thanks
> Lewis
>
> [0]
> https://github.com/apache/any23/blob/master/api/src/main/resources/default-configuration.properties#L70
>
>
>