[ 
https://issues.apache.org/jira/browse/STANBOL-1125?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14018791#comment-14018791
 ] 

Rafa Haro commented on STANBOL-1125:
------------------------------------

Hi Soroka, 

I worked on this some months ago. In fact, I coded most of the necessary stuff 
as an extension of the current indexing tool. As you have pointed out, I 
assumed that the dump was ordered by subject, so I was temporally storing the 
entities in memory until the subject change, that was when I considered the 
entity had been completely crawled. I never committed it because the sorted by 
subject constraint sounded like a very tough one. 

Now actually I'm thinking in another possible approach to prevent that 
constraint to be taken into account, that is to use Solr Atomic Updates 
https://wiki.apache.org/solr/Atomic_Updates, at least with the Solr Yard. I 
would need to take a look to how the schema is managed for the SolrYard because 
the problem with the Atomic Updates is that all the fields must be stored for 
preventing losing information

> Create a lightweight EntityHub Indexing Tool for Freebase
> ---------------------------------------------------------
>
>                 Key: STANBOL-1125
>                 URL: https://issues.apache.org/jira/browse/STANBOL-1125
>             Project: Stanbol
>          Issue Type: Improvement
>          Components: Entityhub
>            Reporter: Rafa Haro
>
> Due to the enormous size of the dumps, current Freebase indexing tool in 
> Stanbol can't barely work in machines without several gigas of RAM and/or SSD 
> disks. JenaTDB importer has been identified as the bootle neck of the 
> indexing process. To use an RDF database is mandatory in order to, for 
> instance, use LDPath programs at indexing time.
> The idea is to develop a lightweight indexing tool that stream data from the 
> dumps and push it directly to Solr. Despite losing some functionality, it is 
> possible for any user to generate Freebase EntityHub indexes from any dump.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Reply via email to