Hi Marvin,

Importing big datasets via the RESTful API is not the intended usage
for the API. There are several things that does make this slower as
importing via the Entityhub Indexing Tool. (1) the Post data is hold
in memory, (2) The RDF data is first hold as Clerezza Graph and after
that converted to SolrInputDocuments (3) Creating Representation
requires a query over the parsed RDF data to search all subjects used
in the parsed graph (4) ID lookup are made to ensure that POST
requests to not update existing and PUT requests do not create new
Entities.

For your use case I would suggest to use the Indexing Tool for the
initial import. After it finishes you can take the Solr index located
under 'indexing/destination/indexes/default' and configure a
ReferencedSolrServer [1] for it. Make sure to check the solr.xml as
the instanceDir uses an absolute path.

After that you can normally configure a ManagedSite [2]. For the
SolrYard configuration you need to use
"{ref-solr-server-name}:{core-name}" instead of "{core-name}" for the
solrUri property.


best
Rupert

[1] http://stanbol.apache.org/docs/trunk/utils/commons-solr#referencedsolrserver
[2] http://stanbol.apache.org/docs/trunk/components/entityhub/managedsite.html



On Sun, Mar 2, 2014 at 6:55 PM, Marvin Luchs <mar...@luchs.org> wrote:
> Hi,
>
> I want to use the Content Enhancement component of Stanbol with a custom 
> vocabulary which contains about 1.5 million triples. I followed the 
> instructions for creating a local index using the Entityhub Indexing Tool and 
> everything worked as expected. However, as those 1.5 million triples are only 
> the initial import and after that, the vocabulary should be managed via the 
> REST API, I would prefer to have a Managed Site or use the Entityhub itself 
> for storing my RDF data. I tried importing my triples (which are distributed 
> over 15 .nt files) via the /entityhub/entity?update=true endpoint, however I 
> ran into problems, most likely because of the size of the import. The Java 
> application which send the REST calls to the Stanbol API returns a 
> "java.net.SocketException: Unexpected end of file from server" for each file 
> and even after my program finished, Stanbol is processing the submitted data 
> for hours. The error log states repeatedly "PERFORMANCE WARNING: Overlapping 
> onDeckSearchers=2".
>
> What would you suggest is the best approach for importing such a large amount 
> of triples into the Entityhub?
>
> And furthermore, could you please explain, if there's any difference between 
> using a Managed Site and the Entityhub itself? From what I understood from 
> the documentation, the only advantage of a Managed Site is the fact that it 
> can be used to separate multiple vocabularies from each other. Is there any 
> other difference?
>
> Any help is much appreciated!
>
> Best regards,
> Marvin Luchs



-- 
| Rupert Westenthaler             rupert.westentha...@gmail.com
| Bodenlehenstraße 11                             ++43-699-11108907
| A-5500 Bischofshofen

Reply via email to