Re: Entityhub indexing for Freebase data

Rupert Westenthaler Tue, 26 May 2015 05:48:08 -0700

HI

On Tue, May 26, 2015 at 2:13 PM,  <[email protected]> wrote:
> Hi Rupert,
>
> After last failure, I am only using language=en and it still fails.
>


Can you provide the some lines of logging before the OOM. I would like
to be sure that it really happens during the Solr optimization phase.

> Thanks for the timely answer. Just to double confirm, if I re-started the 
> index command this am again with higher -Xmx option is it too late to run 
> finalise correct?

If the OOM exception really happened during the Solr optimization calling

   java -jar -Xmx{higher-value}g
org.apache.stanbol.entityhub.indexing.freebase-1.0.0-SNAPSHOT.jar
finalise

will use the data of the previous indexing call and just repeat the
finalization steps

best
Rupert


> With best regards,
> Rajan
>
> Sent from my iPhone
>
>> On May 26, 2015, at 7:47 AM, Rupert Westenthaler 
>> <[email protected]> wrote:
>>
>> Hi Rajan
>>
>>> On Mon, May 25, 2015 at 6:15 AM, Rajan Shah <[email protected]> wrote:
>>> Hi Rupert,
>>>
>>> Thanks for the reply.
>>>
>>> As per your suggestion, I made necessary changes however it failed with
>>> "OutOfMemory" errors. At present, I am running with -Xmx48g however at this
>>> point it's a trial and error approach with several days effort being
>>> wasted.
>>
>> I guess you are getting the OutOfMemory while optimizing the Solr
>> Index (right?). The README [1] explicitly notes that a high amount of
>> memory is needed by exactly this step of the indexing process.
>>
>> If the indexing fails at this step you can call the indexing tool with
>> the `finalise` command (instead if `indexing`) (seeSTANBOL-1047 [2]
>> for details). This will prevent the indexing to be repeated and only
>> execute the finalization steps (optimizing the Solr Index and creating
>> the freebase.solrindex.zip file).
>>
>>
>>> I am just throwing out an idea, but wanted to see
>>>
>>> a. Is it possible to publish set of constraints and required parameters.
>>> i.e. with minimal set of entities within mappings.txt, one need to set
>>> these parameters?
>>
>> I do not understand this question. Do you want to filter entities
>> based on their information? If so you might want to have a look at the
>> `org.apache.stanbol.entityhub.indexing.core.processor.FieldValueFilter`.
>> The generic RDF indexing tool as an example on how to use this
>> processor to filter entities based on their rdf:type values.
>>
>> See also the "Entity Filters" section of [3]
>>
>>>
>>> b. Is it possible to split the file based on subject? generate smaller
>>> index for each subject and merge afterwards?
>>
>> Yes. You can split up the dump (by subject). Import those parts in
>> different Indexing Tool instances (meaning different Jena TDB
>> instances). Importing 4*500million triples to Jena TDB is supposed to
>> be much faster as 1*2Billion.
>>
>> If you still want to have all data in a single Entityhub Site you need
>> to script the indexing process.
>>
>> * call indexing for the first part
>> * after this finishes link the {part1}/indexing/destination/indexes
>> folder to {part2..n}/indexing/destination/indexes
>> * call indexing for the 2..n parts.
>>
>> As the indexing tool only adds additional information to the Solr
>> Index you will get the union over all parts at the end of the process.
>> All parts need to use the full incoming_links.txt file because
>> otherwise the rankings would not be correct.
>>
>> The "Indexing Datasets separately" section of [3] describes a similar
>> trick of creating an union index over multiple datasets.
>>
>>
>> best
>> Rupert
>>
>>> c. Work with BaseKB guys to also make it available at nominal charge?
>>>
>>> d. Maybe apply some Map/Reduce - extension of idea b
>>>
>>> With best regards,
>>> Rajan
>>
>>
>>
>> [1] 
>> http://svn.apache.org/repos/asf/stanbol/trunk/entityhub/indexing/freebase/README.md
>> [2] https://issues.apache.org/jira/browse/STANBOL-1047
>> [3] 
>> http://svn.apache.org/repos/asf/stanbol/branches/release-0.12/demos/ehealth/README.md
>>
>>>
>>>
>>>
>>> On Fri, May 22, 2015 at 9:29 AM, Rupert Westenthaler <
>>> [email protected]> wrote:
>>>
>>>> Hi Rajan,
>>>>
>>>>> *05:11:41,851 [Indexing: Finished Entity Logger Deamon] INFO
>>>>> impl.IndexerImpl - Indexed 0 items in 2059467sec (Infinityms/item):
>>>>
>>>> 'You have not indexed a single entity. So something in your indexing
>>>> configuration is wrong. Most likely you are not correctly building the
>>>> URIs of the entities from the incoming_links.txt file. Can you provide
>>>> me an example line of the 'incoming_links.txt' file and the contents
>>>> of the 'iditerator.properties' file. Those specify how Entity URIs are
>>>> built.
>>>>
>>>> Short answers to the other questions
>>>>
>>>>
>>>>> On Fri, May 22, 2015 at 2:10 PM, Rajan Shah <[email protected]> wrote:
>>>>> it ran for almost 3 days and generated index.
>>>>
>>>> Thats good. It means you do have now the Freebase dump in your Jena
>>>> TDB triple store. You will not need to repeat this (until you want to
>>>> use a newer dump. On the next call to the indexing tool it will
>>>> immediately start with the indexing step.
>>>>
>>>>
>>>>>
>>>>> Couple questions come to mind:
>>>>>
>>>>> a. Is there any particular log/error file the process generates besides
>>>>> printing out on stdout/stderr?
>>>>
>>>> The indexer writes a zip archive with the IDs of all the indexed
>>>> entities. Its in the indexing/destination folder.
>>>>
>>>>> b. Is it a must-have to have stanbol full launcher running all the time
>>>>> while indexing is going on?
>>>>
>>>> No Stanbol instance is needed by the indexing process.
>>>>
>>>>> c. Is it possible that, if the machine is not connected to internet for
>>>>> couple minutes could cause some issues?
>>>>
>>>> No Internet connectivity is needed during indexing. Only if you want
>>>> to use the namespace prefix mappings of prefix.cc you need to have
>>>> internet connectivity when starting the indexing tool.
>>>>
>>>> best
>>>> Rupert
>>>>
>>>>>
>>>>> I would really appreciate, if you can shed some light on "what could be
>>>>> wrong" or "potential approach to nail down this issue"? If you need, I am
>>>>> happy to share any additional logs/properties.
>>>>>
>>>>> With best regards,
>>>>> Rajan
>>>>>
>>>>> *1. Configuration changes*
>>>>>
>>>>> a. set ns-prefix-state=false*
>>>>> [within /indexing/config/iditerator.properties]*
>>>>> b. add empty space mapping to   http://rdf.freebase.com/ns/*
>>>>> [within namespaceprefix.mappings]*
>>>>> c. enable bunch of properties within mappings.txt such as following
>>>>>
>>>>> fb:music.artist.genre
>>>>> fb:music.artist.label
>>>>> fb:music.artist.album
>>>>>
>>>>> *2. Contents of indexing/dist directory*
>>>>>
>>>>> -rw-r--r--  108899 May 22 05:11 freebase.solrindex.zip
>>>>> -rw-r--r--  3457 May 22 05:11
>>>>> org.apache.stanbol.data.site.freebase-1.0.0.jar
>>>>>
>>>>> *3. Contents of /tmp/freebase/indexing/resources/imported directory*
>>>>>
>>>>> -rw-r--r--  1 31026810858 May 20 07:32 freebase.nt.gz
>>>>>
>>>>> *4. Contents of /tmp/freebase/indexing/resources directory*
>>>>>
>>>>> -rw-r--r--   1 1206745360 May 19 09:38 incoming_links.txt
>>>>>
>>>>> *5. The indexer log*
>>>>>
>>>>> *04:31:57,236 [Thread-3] INFO  jenatdb.RdfResourceImporter - Add:
>>>>> 570,850,000 triples (Batch: 2,604 / Avg: 3,621)*
>>>>> *04:32:00,727 [Thread-3] INFO  jenatdb.RdfResourceImporter - Filtered:
>>>>> 2429800000 triples (80.97554853864854%)*
>>>>> *04:32:01,157 [Thread-3] INFO  jenatdb.RdfResourceImporter - -- Finish
>>>>> triples data phase*
>>>>> *04:32:01,157 [Thread-3] INFO  jenatdb.RdfResourceImporter - ** Data:
>>>>> 570,859,352 triples loaded in 157,619.39 seconds [Rate: 3,621.76 per
>>>>> second]*
>>>>> *04:32:01,157 [Thread-3] INFO  jenatdb.RdfResourceImporter - -- Start
>>>>> triples index phase*
>>>>> *04:32:01,157 [Thread-3] INFO  jenatdb.RdfResourceImporter - -- Finish
>>>>> triples index phase*
>>>>> *04:32:01,157 [Thread-3] INFO  jenatdb.RdfResourceImporter - -- Finish
>>>>> triples load*
>>>>> *04:32:01,157 [Thread-3] INFO  jenatdb.RdfResourceImporter - **
>>>> Completed:
>>>>> 570,859,352 triples loaded in 157,619.39 seconds [Rate: 3,621.76 per
>>>>> second]*
>>>>> 04:32:56,880 [Thread-3] INFO  source.ResourceLoader -    ... moving
>>>>> imported file freebase.nt.gz to imported/freebase.nt.gz
>>>>> 04:32:56,883 [Thread-3] INFO  source.ResourceLoader -    - completed in
>>>>> 157675 seconds
>>>>> 04:32:56,883 [Thread-3] INFO  source.ResourceLoader -  > loading
>>>>> '/private/tmp/freebase/indexing/resources/rdfdata/fixit.sh' ...
>>>>> 04:32:56,944 [Thread-3] WARN  jenatdb.RdfResourceImporter - ignore File
>>>> {}
>>>>> because of unknown extension
>>>>> 04:32:56,958 [Thread-3] INFO  source.ResourceLoader -    - completed in 0
>>>>> seconds
>>>>> 04:32:56,958 [Thread-3] INFO  source.ResourceLoader -  ... 2 files
>>>> imported
>>>>> in 157675 seconds
>>>>> 04:32:56,958 [Thread-3] INFO  source.ResourceLoader - Loding 0 File ...
>>>>> 04:32:56,958 [Thread-3] INFO  source.ResourceLoader -  ... 0 files
>>>> imported
>>>>> in 0 seconds
>>>>> 04:32:56,971 [main] INFO  impl.IndexerImpl -  ... delete existing
>>>>> IndexedEntityId file
>>>>> /private/tmp/freebase/indexing/destination/indexed-entities-ids.zip
>>>>> 04:32:56,982 [main] INFO  impl.IndexerImpl - Initialisation completed
>>>>> 04:32:56,982 [main] INFO  impl.IndexerImpl -   ... initialisation
>>>> completed
>>>>> 04:32:56,982 [main] INFO  impl.IndexerImpl - start indexing ...
>>>>> 04:32:56,982 [main] INFO  impl.IndexerImpl - Indexing started ...
>>>>>
>>>>>
>>>>>
>>>>> 04:45:48,075 [pool-1-thread-1] WARN  impl.NamespacePrefixProviderImpl -
>>>>> Invalid Namespace Mapping: prefix 'nsogi' valid , namespace '
>>>>> http://prefix.cc/nsogi:' invalid -> mapping ignored!
>>>>> 04:45:48,076 [pool-1-thread-1] WARN  impl.NamespacePrefixProviderImpl -
>>>>> Invalid Namespace Mapping: prefix 'category' valid , namespace '
>>>>> http://dbpedia.org/resource/Category:' invalid -> mapping ignored!
>>>>> 04:45:48,077 [pool-1-thread-1] WARN  impl.NamespacePrefixProviderImpl -
>>>>> Invalid Namespace Mapping: prefix 'chebi' valid , namespace '
>>>>> http://bio2rdf.org/chebi:' invalid -> mapping ignored!
>>>>> 04:45:48,077 [pool-1-thread-1] WARN  impl.NamespacePrefixProviderImpl -
>>>>> Invalid Namespace Mapping: prefix 'hgnc' valid , namespace '
>>>>> http://bio2rdf.org/hgnc:' invalid -> mapping ignored!
>>>>> 04:45:48,077 [pool-1-thread-1] WARN  impl.NamespacePrefixProviderImpl -
>>>>> Invalid Namespace Mapping: prefix 'dbptmpl' valid , namespace '
>>>>> http://dbpedia.org/resource/Template:' invalid -> mapping ignored!
>>>>> 04:45:48,077 [pool-1-thread-1] WARN  impl.NamespacePrefixProviderImpl -
>>>>> Invalid Namespace Mapping: prefix 'dbc' valid , namespace '
>>>>> http://dbpedia.org/resource/Category:' invalid -> mapping ignored!
>>>>> 04:45:48,078 [pool-1-thread-1] WARN  impl.NamespacePrefixProviderImpl -
>>>>> Invalid Namespace Mapping: prefix 'pubmed' valid , namespace '
>>>>> http://bio2rdf.org/pubmed_vocabulary:' invalid -> mapping ignored!
>>>>> 04:45:48,078 [pool-1-thread-1] WARN  impl.NamespacePrefixProviderImpl -
>>>>> Invalid Namespace Mapping: prefix 'dbt' valid , namespace '
>>>>> http://dbpedia.org/resource/Template:' invalid -> mapping ignored!
>>>>> 04:45:48,078 [pool-1-thread-1] WARN  impl.NamespacePrefixProviderImpl -
>>>>> Invalid Namespace Mapping: prefix 'dbrc' valid , namespace '
>>>>> http://dbpedia.org/resource/Category:' invalid -> mapping ignored!
>>>>> 04:45:48,078 [pool-1-thread-1] WARN  impl.NamespacePrefixProviderImpl -
>>>>> Invalid Namespace Mapping: prefix 'call' valid , namespace '
>>>>> http://webofcode.org/wfn/call:' invalid -> mapping ignored!
>>>>> 04:45:48,078 [pool-1-thread-1] WARN  impl.NamespacePrefixProviderImpl -
>>>>> Invalid Namespace Mapping: prefix 'dbcat' valid , namespace '
>>>>> http://dbpedia.org/resource/Category:' invalid -> mapping ignored!
>>>>> 04:45:48,084 [pool-1-thread-1] WARN  impl.NamespacePrefixProviderImpl -
>>>>> Invalid Namespace Mapping: prefix 'affymetrix' valid , namespace '
>>>>> http://bio2rdf.org/affymetrix_vocabulary:' invalid -> mapping ignored!
>>>>> 04:45:48,084 [pool-1-thread-1] WARN  impl.NamespacePrefixProviderImpl -
>>>>> Invalid Namespace Mapping: prefix 'bgcat' valid , namespace '
>>>>> http://bg.dbpedia.org/resource/Категория:' invalid -> mapping ignored!
>>>>> 04:45:48,084 [pool-1-thread-1] WARN  impl.NamespacePrefixProviderImpl -
>>>>> Invalid Namespace Mapping: prefix 'condition' valid , namespace '
>>>>> http://www.kinjal.com/condition:' invalid -> mapping ignored!
>>>>> 05:11:41,836 [Indexing: Entity Source Reader Deamon] INFO
>>>> impl.IndexerImpl
>>>>> - Indexing: Entity Source Reader Deamon completed (sequence=0) ...
>>>>> 05:11:41,838 [Indexing: Entity Source Reader Deamon] INFO
>>>> impl.IndexerImpl
>>>>> -  > current sequence : 0
>>>>> 05:11:41,838 [Indexing: Entity Source Reader Deamon] INFO
>>>> impl.IndexerImpl
>>>>> -  > new sequence: 1
>>>>> 05:11:41,838 [Indexing: Entity Source Reader Deamon] INFO
>>>> impl.IndexerImpl
>>>>> - Send end-of-queue to Deamons with Sequence 1
>>>>> 05:11:41,839 [Indexing: Entity Processor Deamon] INFO  impl.IndexerImpl -
>>>>> Indexing: Entity Processor Deamon completed (sequence=1) ...
>>>>> 05:11:41,839 [Indexing: Entity Processor Deamon] INFO  impl.IndexerImpl -
>>>>>> current sequence : 1
>>>>> 05:11:41,839 [Indexing: Entity Processor Deamon] INFO  impl.IndexerImpl -
>>>>>> new sequence: 2
>>>>> 05:11:41,839 [Indexing: Entity Processor Deamon] INFO  impl.IndexerImpl -
>>>>> Send end-of-queue to Deamons with Sequence 2
>>>>> 05:11:41,839 [Indexing: Entity Perstisting Deamon] INFO
>>>> impl.IndexerImpl -
>>>>> Indexing: Entity Perstisting Deamon completed (sequence=2) ...
>>>>> 05:11:41,839 [Indexing: Entity Perstisting Deamon] INFO
>>>> impl.IndexerImpl -
>>>>>> current sequence : 2
>>>>> 05:11:41,839 [Indexing: Entity Perstisting Deamon] INFO
>>>> impl.IndexerImpl -
>>>>>> new sequence: 3
>>>>> 05:11:41,839 [Indexing: Entity Perstisting Deamon] INFO
>>>> impl.IndexerImpl -
>>>>> Send end-of-queue to Deamons with Sequence 3
>>>>> *05:11:41,851 [Indexing: Finished Entity Logger Deamon] INFO
>>>>> impl.IndexerImpl - Indexed 0 items in 2059467sec (Infinityms/item):
>>>>> processing:  -1.000ms/item | queue:  -1.000ms*
>>>>> 05:11:41,851 [Indexing: Finished Entity Logger Deamon] INFO
>>>>> impl.IndexerImpl -   - source   :  -1.000ms/item
>>>>> 05:11:41,851 [Indexing: Finished Entity Logger Deamon] INFO
>>>>> impl.IndexerImpl -   - processing:  -1.000ms/item
>>>>> 05:11:41,851 [Indexing: Finished Entity Logger Deamon] INFO
>>>>> impl.IndexerImpl -   - store     :  -1.000ms/item
>>>>> 05:11:41,906 [Indexing: Finished Entity Logger Deamon] INFO
>>>>> impl.IndexerImpl - Indexing: Finished Entity Logger Deamon completed
>>>>> (sequence=3) ...
>>>>> 05:11:41,906 [Indexing: Finished Entity Logger Deamon] INFO
>>>>> impl.IndexerImpl -  > current sequence : 3
>>>>> 05:11:41,906 [Indexing: Finished Entity Logger Deamon] INFO
>>>>> impl.IndexerImpl -  > new sequence: 4
>>>>> 05:11:41,906 [Indexing: Finished Entity Logger Deamon] INFO
>>>>> impl.IndexerImpl - Send end-of-queue to Deamons with Sequence 4
>>>>> 05:11:41,910 [Indexer: Entity Error Logging Daemon] INFO
>>>> impl.IndexerImpl
>>>>> - Indexer: Entity Error Logging Daemon completed (sequence=4) ...
>>>>> 05:11:41,910 [Indexer: Entity Error Logging Daemon] INFO
>>>> impl.IndexerImpl
>>>>> -  > current sequence : 4
>>>>> 05:11:41,910 [main] INFO  impl.IndexerImpl -   ... indexing completed
>>>>> 05:11:41,910 [main] INFO  impl.IndexerImpl - start post-processing ...
>>>>> 05:11:41,910 [main] INFO  impl.IndexerImpl - PostProcessing started ...
>>>>> 05:11:41,910 [main] INFO  impl.IndexerImpl -   ... post-processing
>>>> finished
>>>>> ...
>>>>> 05:11:41,911 [main] INFO  impl.IndexerImpl - start finalisation....
>>>>>
>>>>>
>>>>>
>>>>> On Wed, May 20, 2015 at 8:19 AM, Rupert Westenthaler <
>>>>> [email protected]> wrote:
>>>>>
>>>>>>> On Tue, May 19, 2015 at 7:04 PM, Rajan Shah <[email protected]> wrote:
>>>>>>> Hi Rupert and Antonio,
>>>>>>>
>>>>>>> Thanks a lot for the reply.
>>>>>>>
>>>>>>> I start to follow Rupert's suggestion, however it failed again at
>>>>>>>
>>>>>>> 10:56:34,152 [Thread-3] ERROR jena.riot - [line: 8722294, col: 88]
>>>>>> illegal
>>>>>>> escape sequence value: $ (0x24) -- Is there anyway it can be resolved
>>>> for
>>>>>>> the entire file?
>>>>>>
>>>>>> The indexing tool uses Apache Jena. An those are Jena parsing errors.
>>>>>> So the Jena Mailing lists would be the better place to look for
>>>>>> answers.
>>>>>> This specific issue looks like an invalid URI that is not fixed by the
>>>>>> fixit script.
>>>>>>
>>>>>>
>>>>>>> I requested an access to latest BaseKB bucket, as it doesn't seem to
>>>> be
>>>>>>> open.
>>>>>>>
>>>>>>> s3cmd ls s3://basekb-now/2015-04-15-18-54/
>>>>>>> --add-header="x-amz-request-payer: requester"
>>>>>>> ERROR: Access to bucket 'basekb-now' was denied
>>>>>>>
>>>>>>>
>>>>>>> *Couple additional questions:*
>>>>>>>
>>>>>>> *1. indexing enhancements:*
>>>>>>> What settings/properties one can tweak to gain most out of the
>>>> indexing.
>>>>>>
>>>>>> In general you do only want information as needed for your application
>>>>>> case in the index.
>>>>>> For EntityLinking only labels and type are required.
>>>>>> Additional properties will only be used for dereferencing Entities. So
>>>>>> this will depend on your application needs (your dereferencing
>>>>>> configuration).
>>>>>>
>>>>>> In general I try to exclude as much information as possible form the
>>>>>> index to keep the size of the Solr Index as small as possible.
>>>>>>
>>>>>>> a. for ex. domain specific such as Pharmaceutical, Law etc... within
>>>>>>> freebase
>>>>>>> b. potential optimizations to speed up the overall indexing
>>>>>>
>>>>>> Most of the time will be needed to load the Freebase dump into Jena
>>>>>> TDB. Even with an SSD equipped Server this will take several days.
>>>>>> Assigning more RAM will speed up this process as Jena TDB can cache
>>>>>> more things in RAM.
>>>>>>
>>>>>> Usually it is a good Idea to cancel the indexing process after the
>>>>>> importing of the RDF data has finished (and the indexing of the
>>>>>> Entities has started). This is because after indexing all the RAM will
>>>>>> be used by Jena TDB for caching stuff that is no longer needed in the
>>>>>> read-only operations during indexing. So a fresh start can speed up
>>>>>> the indexing part of the process.
>>>>>>
>>>>>> Also have a look at the Freebase Indexing Tool Readme
>>>>>>
>>>>>>>
>>>>>>> *2. demo:*
>>>>>>> I see that, in recent github commit(s) the eHealth and other demos
>>>> have
>>>>>>> been commented out. How can I get demo source code and other
>>>> components
>>>>>> for
>>>>>>> these demos. I prefer to build it myself to see the power of stanbol.
>>>>>>
>>>>>> The eHealth demo is still in the 0.12 branch [1]. This is fully
>>>>>> compatible to the trunk version.
>>>>>>
>>>>>>> *3. custom vocabulary:*
>>>>>>> Suppose, I have custom vocabulary in CSV format. Is there a preferred
>>>> way
>>>>>>> to upload it to Stanbol and have it recognize my entities?
>>>>>>
>>>>>> Google Refine[2] with the RDF extension [3]. You can also try to use
>>>>>> the (newer) Open Refine [4] with the RDF Refine 0.9.0 Alpha version
>>>>>> but AFAIK this combination is not so stable and might not work at all.
>>>>>>
>>>>>> * Google Refine allows you to import your CSV file.
>>>>>> * Clean it up (if necessary)
>>>>>> * The RDF extension allows you to map your CSV data to RDF
>>>>>> * based on this mapping you can save your data as RDF
>>>>>> * after that you can import the RDF data to Apache Stanbol
>>>>>>
>>>>>> hope this helps
>>>>>> best
>>>>>> Rupert
>>>>>>
>>>>>>>
>>>>>>> Thanks in advance,
>>>>>>> Rajan
>>>>>>
>>>>>>
>>>>>>
>>>>>> [1]
>>>> http://svn.apache.org/repos/asf/stanbol/branches/release-0.12/demos/ehealth/
>>>>>> [2] https://code.google.com/p/google-refine/
>>>>>> [3] http://refine.deri.ie/
>>>>>> [4] http://openrefine.org/
>>>>>>
>>>>>>> On Tue, May 19, 2015 at 3:01 AM, Rupert Westenthaler <
>>>>>>> [email protected]> wrote:
>>>>>>>
>>>>>>>> Hi Rajan,
>>>>>>>>
>>>>>>>> I think this is because you named you file
>>>>>>>> "freebase-rdf-latest-fixed.gz". Jena assumes RDF/XML if the RDF
>>>> format
>>>>>>>> is not provided by the file extension. Renaming the file to
>>>>>>>> "freebase-rdf-latest-fixed.nt.gz" should fix this issue.
>>>>>>>>
>>>>>>>> The suggestion of Antonio to use BaseKB is also a valid option.
>>>>>>>>
>>>>>>>> best
>>>>>>>> Rupert
>>>>>>>>
>>>>>>>> On Tue, May 19, 2015 at 8:32 AM, Antonio David Perez Morales
>>>>>>>> <[email protected]> wrote:
>>>>>>>>> Hi Rajan
>>>>>>>>>
>>>>>>>>> Freebase dump contains some things that does not fit very well with
>>>>>> the
>>>>>>>>> indexer.
>>>>>>>>> I advise you to use the dump provided by BaseKB (http://basekb.com
>>>> )
>>>>>>>> which
>>>>>>>>> is a curated Freebase dump.
>>>>>>>>> I did not have any problem indexing it using that dump.
>>>>>>>>>
>>>>>>>>> Regards
>>>>>>>>>
>>>>>>>>> On Mon, May 18, 2015 at 8:48 PM, Rajan Shah <[email protected]>
>>>>>> wrote:
>>>>>>>>>
>>>>>>>>>> Hi,
>>>>>>>>>>
>>>>>>>>>> I am working on indexing Freebase data within EntityHub and
>>>> observed
>>>>>>>>>> following issue:
>>>>>>>>>>
>>>>>>>>>> 01:06:01,547 [Thread-3] ERROR jena.riot - [line: 1, col: 7 ]
>>>> Element
>>>>>> or
>>>>>>>>>> attribute do not match QName production:
>>>> QName::=(NCName':')?NCName.
>>>>>>>>>>
>>>>>>>>>> I would appreciate any help pertaining to this issue.
>>>>>>>>>>
>>>>>>>>>> Thanks,
>>>>>>>>>> Rajan
>>>>>>>>>>
>>>>>>>>>> *Steps followed:*
>>>>>>>>>>
>>>>>>>>>> *1. Initialization: *
>>>>>>>>>> java -jar
>>>>>>>> org.apache.stanbol.entityhub.indexing.freebase-1.0.0-SNAPSHOT.jar
>>>>>>>>>> init
>>>>>>>>>>
>>>>>>>>>> *2. Download the data:*
>>>>>>>>>> Download data and copy it to
>>>>>>>> https://developers.google.com/freebase/data
>>>>>>>>>>
>>>>>>>>>> *3. Performed execution of fbrankings-uri.sh*
>>>>>>>>>> It generated incoming_links.txt under resources directory as
>>>> follows
>>>>>>>>>>
>>>>>>>>>> 10888430 m.0kpv11
>>>>>>>>>> 3741261 m.019h
>>>>>>>>>> 2667858 m.0775xx5
>>>>>>>>>> 2667804 m.0775xvm
>>>>>>>>>> 1875352 m.01xryvm
>>>>>>>>>> 1739262 m.05zppz
>>>>>>>>>> 1369590 m.01xrzlb
>>>>>>>>>>
>>>>>>>>>> *4. Performed execution of fixit script*
>>>>>>>>>>
>>>>>>>>>> gunzip -c ${FB_DUMP} | fixit | gzip > ${FB_DUMP_fixed}
>>>>>>>>>>
>>>>>>>>>> *5. Rename the fixed file to freebase.rdf.gz and copy it *
>>>>>>>>>> to indexing/resources/rdfdata
>>>>>>>>>>
>>>>>>>>>> *6. config/iditer.properties file has following setting*
>>>>>>>>>> #id-namespace=http://freebase.com/
>>>>>>>>>> ns-prefix-state=false
>>>>>>>>>>
>>>>>>>>>> *7. Performed run of following command:*
>>>>>>>>>> java -jar -Xmx32g
>>>>>>>>>> org.apache.stanbol.entityhub.indexing.freebase-1.0.0-SNAPSHOT.jar
>>>>>> index
>>>>>>>>>>
>>>>>>>>>> The error dump on stdout is as follows:
>>>>>>>>>>
>>>>>>>>>> 01:37:32,884 [Thread-0] INFO
>>>> solryard.SolrYardIndexingDestination -
>>>>>>>> ...
>>>>>>>>>> copy Solr Configuration form
>>>>>>>> /private/tmp/freebase/indexing/config/freebase
>>>>>>>>>> to
>>>>>> /private/tmp/freebase/indexing/destination/indexes/default/freebase
>>>>>>>>>> 01:37:32,895 [Thread-3] INFO  jenatdb.RdfResourceImporter -     -
>>>>>> bulk
>>>>>>>>>> loading File freebase.rdf.gz using Format Lang:RDF/XML
>>>>>>>>>> 01:37:32,896 [Thread-3] INFO  jenatdb.RdfResourceImporter - --
>>>> Start
>>>>>>>>>> triples data phase
>>>>>>>>>> 01:37:32,896 [Thread-3] INFO  jenatdb.RdfResourceImporter - **
>>>> Load
>>>>>>>> empty
>>>>>>>>>> triples table
>>>>>>>>>> *01:37:32,948 [Thread-3] ERROR jena.riot - [line: 1, col: 7 ]
>>>>>> Element or
>>>>>>>>>> attribute do not match QName production:
>>>> QName::=(NCName':')?NCName.*
>>>>>>>>>> 01:37:32,948 [Thread-3] INFO  jenatdb.RdfResourceImporter - --
>>>> Finish
>>>>>>>>>> triples data phase
>>>>>>>>>> 01:37:32,948 [Thread-3] INFO  jenatdb.RdfResourceImporter - --
>>>> Finish
>>>>>>>>>> triples load
>>>>>>>>>> 01:37:32,960 [Thread-3] INFO  source.ResourceLoader - Ignore Error
>>>>>> for
>>>>>>>> File
>>>>>>>>>> /private/tmp/freebase/indexing/resources/rdfdata/freebase.rdf.gz
>>>> and
>>>>>>>>>> continue
>>>>>>>>>>
>>>>>>>>>> Additional Reference Point:
>>>>>>>>>>
>>>>>>>>>> *Original Freebase dump size:*  31025015397 May 14 18:10
>>>>>>>>>> freebase-rdf-latest.gz
>>>>>>>>>> *Fixed Freebase dump size:* 31026818367 May 15 12:45
>>>>>>>>>> freebase-rdf-latest-fixed.gz
>>>>>>>>>> *Incoming Links size: *1206745360 May 17 00:42 incoming_links.txt
>>>>>>>>>
>>>>>>>>> --
>>>>>>>>>
>>>>>>>>> ------------------------------
>>>>>>>>> This message should be regarded as confidential. If you have
>>>> received
>>>>>>>> this
>>>>>>>>> email in error please notify the sender and destroy it immediately.
>>>>>>>>> Statements of intent shall only become binding when confirmed in
>>>> hard
>>>>>>>> copy
>>>>>>>>> by an authorised signatory.
>>>>>>>>>
>>>>>>>>> Zaizi Ltd is registered in England and Wales with the registration
>>>>>> number
>>>>>>>>> 6440931. The Registered Office is Brook House, 229 Shepherds Bush
>>>>>> Road,
>>>>>>>>> London W6 7AN.
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> --
>>>>>>>> | Rupert Westenthaler             [email protected]
>>>>>>>> | Bodenlehenstraße 11                              ++43-699-11108907
>>>>>>>> | A-5500 Bischofshofen
>>>>>>>> | REDLINK.CO
>>>> ..........................................................................
>>>>>>>> | http://redlink.co/
>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> | Rupert Westenthaler             [email protected]
>>>>>> | Bodenlehenstraße 11                              ++43-699-11108907
>>>>>> | A-5500 Bischofshofen
>>>>>> | REDLINK.CO
>>>> ..........................................................................
>>>>>> | http://redlink.co/
>>>>
>>>>
>>>>
>>>> --
>>>> | Rupert Westenthaler             [email protected]
>>>> | Bodenlehenstraße 11                              ++43-699-11108907
>>>> | A-5500 Bischofshofen
>>>> | REDLINK.CO
>>>> ..........................................................................
>>>> | http://redlink.co/
>>
>>
>>
>> --
>> | Rupert Westenthaler             [email protected]
>> | Bodenlehenstraße 11                              ++43-699-11108907
>> | A-5500 Bischofshofen
>> | REDLINK.CO 
>> ..........................................................................
>> | http://redlink.co/



-- 
| Rupert Westenthaler             [email protected]
| Bodenlehenstraße 11                              ++43-699-11108907
| A-5500 Bischofshofen
| REDLINK.CO 
..........................................................................
| http://redlink.co/

Re: Entityhub indexing for Freebase data

Reply via email to