Hi Amindri

> http://rdf.freebase.com/ns/.0432b)!
>
>
> It seems like the 'm' in the entity id is being dropped. I created a patch
> for that, so that preceding elements of the entity id are not dropped when
> the default name space is used.
>

This is definitely a bug. Where have you fixed it? in the
NamespacePrefixService or in the indexing tool? Can you please create
an issue and provide the patch?

This this fixed the loggings of the 2nd try do look fine. However it
looks as as if no data are in the Jena TDB store.

> When I checked the code, this happens because
> indexingDataset.getDefaultGraph()
> (RdfIndexingSource.getEntityData(String id) - 406) returns en empty graph so
> it cannot find the parsed entity in it.. The indexing/resources/tdb folder,
> which is used to create the  indexingDataset exists with 26 data files.
>

The files are creates as soon as you start Jena TDB. Depending on the
OS the initial size of the files differ. On Mac they are several GByte
in Size (as the OS allocates the full size of the memory mapped files)
on Linux and Windows they are just some kBytes.

With the Freebase Data imported the size of all files in this
directory should be much higher. I still have a directory with a dump
I imported in April 2013. This one has about 70GByte. Back than I was
using a machine with a SSD to import the RDF data. The process needed
about a week to complete.

To import the data you need to copy the file compressed file with the
corrected RDF data (output of step (4) in the README) to the
"indexing/resources/rdfdata" folder. The indexing tool will import all
RDF files in this folder to the Jena TDB store. Imported files will be
moved over to "indexing/resources/imported" (to avoid importing them
again on follow up executions).

In addition I recommend to cancel the Indexing tool after it has
finished importing all data. Experience showed that restarting the
indexing tool after importing nearly 2 billion triples to Jena TDB
increased the indexing time quite a but.

In my setup importing the RDF data to Jena TDB needed about  a week.
Indexing the imported data needed about 10 hours.

best
Rupert

On Thu, Feb 12, 2015 at 6:35 AM, Amindri Udugala
<amindriudug...@gmail.com> wrote:
> Hi Rupert,
>
> I was following the readme file, but the problem still exists.
>
> I enabled debug and saw the following lines
>
> 15:48:43,318 [Indexing: Entity Source Reader Deamon] DEBUG
> source.LineBasedEntityIterator - > line =     141 m.0432b
> 15:48:43,318 [Indexing: Entity Source Reader Deamon] DEBUG
> source.LineBasedEntityIterator -  - id = m.0432b
> 15:48:43,318 [Indexing: Entity Source Reader Deamon] DEBUG
> source.LineBasedEntityIterator -  - entity =
> http://rdf.freebase.com/ns/.0432b
> 15:48:43,318 [Indexing: Entity Source Reader Deamon] DEBUG
> source.LineBasedEntityIterator -  - score =
> 15:48:43,318 [Indexing: Entity Source Reader Deamon] DEBUG
> jenatdb.RdfIndexingSource - No Statements found for id
> http://rdf.freebase.com/key/.0432b (Node:
> http://rdf.freebase.com/ns/.0432b)!
>
>
> It seems like the 'm' in the entity id is being dropped. I created a patch
> for that, so that preceding elements of the entity id are not dropped when
> the default name space is used.
>
>
> However even after fixing this problem, I still get the same Debug log with
> the correct entity URL
>
> 15:48:43,319 [Indexing: Entity Source Reader Deamon] DEBUG
> impl.EntityIdBasedIndexingDaemon - unable to get Data for Entity
> http://rdf.freebase.com/key/m.041yjm (score=norm:0.314499|orig:141.0)
> 15:48:43,319 [Indexing: Entity Source Reader Deamon] DEBUG
> source.LineBasedEntityIterator - > line =     141 m.041s8n
> 15:48:43,319 [Indexing: Entity Source Reader Deamon] DEBUG
> source.LineBasedEntityIterator -  - id = m.041s8n
> 15:48:43,319 [Indexing: Entity Source Reader Deamon] DEBUG
> source.LineBasedEntityIterator -  - entity =
> http://rdf.freebase.com/ns/m.041s8n
> 15:48:43,319 [Indexing: Entity Source Reader Deamon] DEBUG
> source.LineBasedEntityIterator -  - score =
> 15:48:43,319 [Indexing: Entity Source Reader Deamon] DEBUG
> jenatdb.RdfIndexingSource - No Statements found for id
> http://rdf.freebase.com/ns/m.041s8n (Node:
> http://rdf.freebase.com/key/m.041s8n)!
> 15:48:43,319 [Indexing: Entity Source Reader Deamon] DEBUG
> impl.EntityIdBasedIndexingDaemon - unable to get Data for Entity
> http://rdf.freebase.com/ns/m.041s8n (score=norm:0.314499|orig:141.0)
>
> When I checked the code, this happens because
> indexingDataset.getDefaultGraph()
> (RdfIndexingSource.getEntityData(String id) - 406) returns en empty graph so
> it cannot find the parsed entity in it.. The indexing/resources/tdb folder,
> which is used to create the  indexingDataset exists with 26 data files.
>
>
> Do you have any idea why this happens?
>
> Thanks
> Amindri
>
>
>
> On 11 February 2015 at 23:03, Rupert Westenthaler
> <rupert.westentha...@gmail.com> wrote:
>>
>> Hi Amindri,
>>
>> The file to look is the README.md file of the freebase indexer [1]. If
>> something is missing in this file please create an issue [2] and if
>> possible provide a patch.
>>
>> thx
>> Rupert
>>
>>
>> [1]
>> http://svn.apache.org/repos/asf/stanbol/trunk/entityhub/indexing/freebase/README.md
>> [2] https://issues.apache.org/jira/browse/STANBOL
>>
>>
>> On Wed, Feb 11, 2015 at 1:00 AM, Amindri Udugala
>> <amindriudug...@gmail.com> wrote:
>> > Hi Rupert,
>> >
>> > Thanks for the informative reply.
>> > I was able to specify an empty String as the namespace prefix
>> > namespaceprefix.mapping
>> > file. Exactly as you mentioned, indexing started with no loggings for
>> > quite
>> > some time. Then the process finish without indexing a single entity.
>> >
>> > I used all the default configuration files created by the init process.
>> > I'm
>> > trying to build a freebase index for multilingual FST linking. I would
>> > much
>> > appreciate if you can point me to resource where I can get the
>> > information
>> > to correctly configure the properties files.
>> >
>> > Thanks,
>> > Amindri
>>
>>
>>
>> --
>> | Rupert Westenthaler             rupert.westentha...@gmail.com
>> | Bodenlehenstraße 11                              ++43-699-11108907
>> | A-5500 Bischofshofen
>> | REDLINK.CO
>> ..........................................................................
>> | http://redlink.co/
>
>
>
>
> --
> Regards
> Amindri Udugala
>
>



-- 
| Rupert Westenthaler             rupert.westentha...@gmail.com
| Bodenlehenstraße 11                              ++43-699-11108907
| A-5500 Bischofshofen
| REDLINK.CO 
..........................................................................
| http://redlink.co/

Reply via email to