Hi Marvin, Importing big datasets via the RESTful API is not the intended usage for the API. There are several things that does make this slower as importing via the Entityhub Indexing Tool. (1) the Post data is hold in memory, (2) The RDF data is first hold as Clerezza Graph and after that converted to SolrInputDocuments (3) Creating Representation requires a query over the parsed RDF data to search all subjects used in the parsed graph (4) ID lookup are made to ensure that POST requests to not update existing and PUT requests do not create new Entities.
For your use case I would suggest to use the Indexing Tool for the initial import. After it finishes you can take the Solr index located under 'indexing/destination/indexes/default' and configure a ReferencedSolrServer [1] for it. Make sure to check the solr.xml as the instanceDir uses an absolute path. After that you can normally configure a ManagedSite [2]. For the SolrYard configuration you need to use "{ref-solr-server-name}:{core-name}" instead of "{core-name}" for the solrUri property. best Rupert [1] http://stanbol.apache.org/docs/trunk/utils/commons-solr#referencedsolrserver [2] http://stanbol.apache.org/docs/trunk/components/entityhub/managedsite.html On Sun, Mar 2, 2014 at 6:55 PM, Marvin Luchs <mar...@luchs.org> wrote: > Hi, > > I want to use the Content Enhancement component of Stanbol with a custom > vocabulary which contains about 1.5 million triples. I followed the > instructions for creating a local index using the Entityhub Indexing Tool and > everything worked as expected. However, as those 1.5 million triples are only > the initial import and after that, the vocabulary should be managed via the > REST API, I would prefer to have a Managed Site or use the Entityhub itself > for storing my RDF data. I tried importing my triples (which are distributed > over 15 .nt files) via the /entityhub/entity?update=true endpoint, however I > ran into problems, most likely because of the size of the import. The Java > application which send the REST calls to the Stanbol API returns a > "java.net.SocketException: Unexpected end of file from server" for each file > and even after my program finished, Stanbol is processing the submitted data > for hours. The error log states repeatedly "PERFORMANCE WARNING: Overlapping > onDeckSearchers=2". > > What would you suggest is the best approach for importing such a large amount > of triples into the Entityhub? > > And furthermore, could you please explain, if there's any difference between > using a Managed Site and the Entityhub itself? From what I understood from > the documentation, the only advantage of a Managed Site is the fact that it > can be used to separate multiple vocabularies from each other. Is there any > other difference? > > Any help is much appreciated! > > Best regards, > Marvin Luchs -- | Rupert Westenthaler rupert.westentha...@gmail.com | Bodenlehenstraße 11 ++43-699-11108907 | A-5500 Bischofshofen