[ https://issues.apache.org/jira/browse/CONNECTORS-256?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13127558#comment-13127558 ]
Karl Wright commented on CONNECTORS-256: ---------------------------------------- So are you saying that the content gets only the end? Can you email me the XML response you got from the content fetch, or attach it to this ticket? I can add it to the test suite and debug why the standard SAX XML parsing is missing data. Maybe it's CDATA or something... > Connector for crawling Wikis > ---------------------------- > > Key: CONNECTORS-256 > URL: https://issues.apache.org/jira/browse/CONNECTORS-256 > Project: ManifoldCF > Issue Type: New Feature > Components: Wiki connector > Affects Versions: ManifoldCF 0.4 > Reporter: Karl Wright > Assignee: Karl Wright > Fix For: ManifoldCF 0.4 > > > People have been trying to crawl wikis with ManifoldCF, but using the generic > crawler is not a good way to do this. Instead, it looks like we really could > use a wiki connector, which would understand the wiki API and thus crawl wiki > content quickly and effectively. > Some pertinent API references follow: > I don't know if it is possible to link to a wiki document with just the > pageid, but it is possible to to get the url for the referring pageid via api: > http://en.wikipedia.org/w/api.php?action=query&prop=info&pageids=27697087&inprop=url > It is possible to get the metadata of a document using the pages id (instead > of title) directly: > Titel -> > http://en.wikipedia.org/w/api.php?action=query&prop=revisions&titles=API&rvprop=timestamp|user|comment|content > PageID -> > http://en.wikipedia.org/w/api.php?action=query&prop=revisions&pageids=27697087&rvprop=timestamp|user|comment|content > - There needs to be some notion of an overall list of pages: > - http://www.mediawiki.org/wiki/API:Allpages > - Example: > http://en.wikipedia.org/w/api.php?action=query&list=allpages&apfrom=Kre&aplimit=5 > - Metadata information (author and pub date) also needs to be separated out > in some way: > - http://www.mediawiki.org/wiki/API:Properties#Revisions:_Example > - Example: > http://en.wikipedia.org/w/api.php?action=query&prop=revisions&titles=API|Main%20Page&rvprop=timestamp|user|comment|content -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira