Re: Entityhub indexing for Freebase data

Rajan Shah Thu, 28 May 2015 21:50:10 -0700

Hi,

Thanks again for your help.


Finally, I have freebase index and I can use it. I really appreciate your
continuous help.

With best regards,
Rajan

On Thu, May 28, 2015 at 5:42 AM, Rupert Westenthaler <
rupert.westentha...@gmail.com> wrote:

> Hi,
>
> Please have a look at the stanbol log file (./stanbol/log/error.log).
> The schema.xml of the freebase indexing tool uses Solr Analyzers that
> are not included by all Stanbol Launchers. If you are missing some
> things you will see according exceptions in the log.
>
> Installation will extract the index from the archive and copy it to
> the ./stanbol/indexes. So depending on the size the installation may
> take some time.
>
> best
> Rupert
>
>
> On Wed, May 27, 2015 at 3:01 PM, Rajan Shah <raja...@gmail.com> wrote:
> > Hi Rupert,
> >
> > Finally, I got the freebase index after 2 days run. For english language
> > only, the size is roughly 28G.
> >
> > Surprisingly, after I installed it via OSGI console it created Referenced
> > Site and Solr Yard. However, it's not visible within entityhub sites. I
> did
> > configure following parameters within SolrYard
> >
> > a. "Allow Initialization" - checked
> > b. Index configuration: freebase.solrindex.zip
> >
> > I also re-started couple times but no luck.
> >
> > Does it require any additional special configuration? i.e. do I need to
> > have higher -Xmx parameter setting or something else
> >
> > With best regards,
> > Rajan
> >
> > On Tue, May 26, 2015 at 9:06 AM, <raja...@gmail.com> wrote:
> >
> >> Hi,
> >>
> >> Accidentally, I wiped out logs for a clean start. At the same time, I am
> >> planning to run on a higher end AWS instance as well, so will keep you
> >> posted.
> >>
> >> Thanks again for your continuous help.
> >>
> >> With best regards,
> >> Rajan
> >>
> >> Sent from my iPhone
> >>
> >> > On May 26, 2015, at 8:47 AM, Rupert Westenthaler <
> >> rupert.westentha...@gmail.com> wrote:
> >> >
> >> > HI
> >> >
> >> >> On Tue, May 26, 2015 at 2:13 PM,  <raja...@gmail.com> wrote:
> >> >> Hi Rupert,
> >> >>
> >> >> After last failure, I am only using language=en and it still fails.
> >> >
> >> > Can you provide the some lines of logging before the OOM. I would like
> >> > to be sure that it really happens during the Solr optimization phase.
> >> >
> >> >> Thanks for the timely answer. Just to double confirm, if I re-started
> >> the index command this am again with higher -Xmx option is it too late
> to
> >> run finalise correct?
> >> >
> >> > If the OOM exception really happened during the Solr optimization
> calling
> >> >
> >> >   java -jar -Xmx{higher-value}g
> >> > org.apache.stanbol.entityhub.indexing.freebase-1.0.0-SNAPSHOT.jar
> >> > finalise
> >> >
> >> > will use the data of the previous indexing call and just repeat the
> >> > finalization steps
> >> >
> >> > best
> >> > Rupert
> >> >
> >> >
> >> >> With best regards,
> >> >> Rajan
> >> >>
> >> >> Sent from my iPhone
> >> >>
> >> >>> On May 26, 2015, at 7:47 AM, Rupert Westenthaler <
> >> rupert.westentha...@gmail.com> wrote:
> >> >>>
> >> >>> Hi Rajan
> >> >>>
> >> >>>> On Mon, May 25, 2015 at 6:15 AM, Rajan Shah <raja...@gmail.com>
> >> wrote:
> >> >>>> Hi Rupert,
> >> >>>>
> >> >>>> Thanks for the reply.
> >> >>>>
> >> >>>> As per your suggestion, I made necessary changes however it failed
> >> with
> >> >>>> "OutOfMemory" errors. At present, I am running with -Xmx48g however
> >> at this
> >> >>>> point it's a trial and error approach with several days effort
> being
> >> >>>> wasted.
> >> >>>
> >> >>> I guess you are getting the OutOfMemory while optimizing the Solr
> >> >>> Index (right?). The README [1] explicitly notes that a high amount
> of
> >> >>> memory is needed by exactly this step of the indexing process.
> >> >>>
> >> >>> If the indexing fails at this step you can call the indexing tool
> with
> >> >>> the `finalise` command (instead if `indexing`) (seeSTANBOL-1047 [2]
> >> >>> for details). This will prevent the indexing to be repeated and only
> >> >>> execute the finalization steps (optimizing the Solr Index and
> creating
> >> >>> the freebase.solrindex.zip file).
> >> >>>
> >> >>>
> >> >>>> I am just throwing out an idea, but wanted to see
> >> >>>>
> >> >>>> a. Is it possible to publish set of constraints and required
> >> parameters.
> >> >>>> i.e. with minimal set of entities within mappings.txt, one need to
> set
> >> >>>> these parameters?
> >> >>>
> >> >>> I do not understand this question. Do you want to filter entities
> >> >>> based on their information? If so you might want to have a look at
> the
> >> >>>
> >> `org.apache.stanbol.entityhub.indexing.core.processor.FieldValueFilter`.
> >> >>> The generic RDF indexing tool as an example on how to use this
> >> >>> processor to filter entities based on their rdf:type values.
> >> >>>
> >> >>> See also the "Entity Filters" section of [3]
> >> >>>
> >> >>>>
> >> >>>> b. Is it possible to split the file based on subject? generate
> smaller
> >> >>>> index for each subject and merge afterwards?
> >> >>>
> >> >>> Yes. You can split up the dump (by subject). Import those parts in
> >> >>> different Indexing Tool instances (meaning different Jena TDB
> >> >>> instances). Importing 4*500million triples to Jena TDB is supposed
> to
> >> >>> be much faster as 1*2Billion.
> >> >>>
> >> >>> If you still want to have all data in a single Entityhub Site you
> need
> >> >>> to script the indexing process.
> >> >>>
> >> >>> * call indexing for the first part
> >> >>> * after this finishes link the {part1}/indexing/destination/indexes
> >> >>> folder to {part2..n}/indexing/destination/indexes
> >> >>> * call indexing for the 2..n parts.
> >> >>>
> >> >>> As the indexing tool only adds additional information to the Solr
> >> >>> Index you will get the union over all parts at the end of the
> process.
> >> >>> All parts need to use the full incoming_links.txt file because
> >> >>> otherwise the rankings would not be correct.
> >> >>>
> >> >>> The "Indexing Datasets separately" section of [3] describes a
> similar
> >> >>> trick of creating an union index over multiple datasets.
> >> >>>
> >> >>>
> >> >>> best
> >> >>> Rupert
> >> >>>
> >> >>>> c. Work with BaseKB guys to also make it available at nominal
> charge?
> >> >>>>
> >> >>>> d. Maybe apply some Map/Reduce - extension of idea b
> >> >>>>
> >> >>>> With best regards,
> >> >>>> Rajan
> >> >>>
> >> >>>
> >> >>>
> >> >>> [1]
> >>
> http://svn.apache.org/repos/asf/stanbol/trunk/entityhub/indexing/freebase/README.md
> >> >>> [2] https://issues.apache.org/jira/browse/STANBOL-1047
> >> >>> [3]
> >>
> http://svn.apache.org/repos/asf/stanbol/branches/release-0.12/demos/ehealth/README.md
> >> >>>
> >> >>>>
> >> >>>>
> >> >>>>
> >> >>>> On Fri, May 22, 2015 at 9:29 AM, Rupert Westenthaler <
> >> >>>> rupert.westentha...@gmail.com> wrote:
> >> >>>>
> >> >>>>> Hi Rajan,
> >> >>>>>
> >> >>>>>> *05:11:41,851 [Indexing: Finished Entity Logger Deamon] INFO
> >> >>>>>> impl.IndexerImpl - Indexed 0 items in 2059467sec
> (Infinityms/item):
> >> >>>>>
> >> >>>>> 'You have not indexed a single entity. So something in your
> indexing
> >> >>>>> configuration is wrong. Most likely you are not correctly building
> >> the
> >> >>>>> URIs of the entities from the incoming_links.txt file. Can you
> >> provide
> >> >>>>> me an example line of the 'incoming_links.txt' file and the
> contents
> >> >>>>> of the 'iditerator.properties' file. Those specify how Entity URIs
> >> are
> >> >>>>> built.
> >> >>>>>
> >> >>>>> Short answers to the other questions
> >> >>>>>
> >> >>>>>
> >> >>>>>> On Fri, May 22, 2015 at 2:10 PM, Rajan Shah <raja...@gmail.com>
> >> wrote:
> >> >>>>>> it ran for almost 3 days and generated index.
> >> >>>>>
> >> >>>>> Thats good. It means you do have now the Freebase dump in your
> Jena
> >> >>>>> TDB triple store. You will not need to repeat this (until you
> want to
> >> >>>>> use a newer dump. On the next call to the indexing tool it will
> >> >>>>> immediately start with the indexing step.
> >> >>>>>
> >> >>>>>
> >> >>>>>>
> >> >>>>>> Couple questions come to mind:
> >> >>>>>>
> >> >>>>>> a. Is there any particular log/error file the process generates
> >> besides
> >> >>>>>> printing out on stdout/stderr?
> >> >>>>>
> >> >>>>> The indexer writes a zip archive with the IDs of all the indexed
> >> >>>>> entities. Its in the indexing/destination folder.
> >> >>>>>
> >> >>>>>> b. Is it a must-have to have stanbol full launcher running all
> the
> >> time
> >> >>>>>> while indexing is going on?
> >> >>>>>
> >> >>>>> No Stanbol instance is needed by the indexing process.
> >> >>>>>
> >> >>>>>> c. Is it possible that, if the machine is not connected to
> internet
> >> for
> >> >>>>>> couple minutes could cause some issues?
> >> >>>>>
> >> >>>>> No Internet connectivity is needed during indexing. Only if you
> want
> >> >>>>> to use the namespace prefix mappings of prefix.cc you need to have
> >> >>>>> internet connectivity when starting the indexing tool.
> >> >>>>>
> >> >>>>> best
> >> >>>>> Rupert
> >> >>>>>
> >> >>>>>>
> >> >>>>>> I would really appreciate, if you can shed some light on "what
> >> could be
> >> >>>>>> wrong" or "potential approach to nail down this issue"? If you
> >> need, I am
> >> >>>>>> happy to share any additional logs/properties.
> >> >>>>>>
> >> >>>>>> With best regards,
> >> >>>>>> Rajan
> >> >>>>>>
> >> >>>>>> *1. Configuration changes*
> >> >>>>>>
> >> >>>>>> a. set ns-prefix-state=false*
> >> >>>>>> [within /indexing/config/iditerator.properties]*
> >> >>>>>> b. add empty space mapping to   http://rdf.freebase.com/ns/*
> >> >>>>>> [within namespaceprefix.mappings]*
> >> >>>>>> c. enable bunch of properties within mappings.txt such as
> following
> >> >>>>>>
> >> >>>>>> fb:music.artist.genre
> >> >>>>>> fb:music.artist.label
> >> >>>>>> fb:music.artist.album
> >> >>>>>>
> >> >>>>>> *2. Contents of indexing/dist directory*
> >> >>>>>>
> >> >>>>>> -rw-r--r--  108899 May 22 05:11 freebase.solrindex.zip
> >> >>>>>> -rw-r--r--  3457 May 22 05:11
> >> >>>>>> org.apache.stanbol.data.site.freebase-1.0.0.jar
> >> >>>>>>
> >> >>>>>> *3. Contents of /tmp/freebase/indexing/resources/imported
> directory*
> >> >>>>>>
> >> >>>>>> -rw-r--r--  1 31026810858 May 20 07:32 freebase.nt.gz
> >> >>>>>>
> >> >>>>>> *4. Contents of /tmp/freebase/indexing/resources directory*
> >> >>>>>>
> >> >>>>>> -rw-r--r--   1 1206745360 May 19 09:38 incoming_links.txt
> >> >>>>>>
> >> >>>>>> *5. The indexer log*
> >> >>>>>>
> >> >>>>>> *04:31:57,236 [Thread-3] INFO  jenatdb.RdfResourceImporter - Add:
> >> >>>>>> 570,850,000 triples (Batch: 2,604 / Avg: 3,621)*
> >> >>>>>> *04:32:00,727 [Thread-3] INFO  jenatdb.RdfResourceImporter -
> >> Filtered:
> >> >>>>>> 2429800000 triples (80.97554853864854%)*
> >> >>>>>> *04:32:01,157 [Thread-3] INFO  jenatdb.RdfResourceImporter - --
> >> Finish
> >> >>>>>> triples data phase*
> >> >>>>>> *04:32:01,157 [Thread-3] INFO  jenatdb.RdfResourceImporter - **
> >> Data:
> >> >>>>>> 570,859,352 triples loaded in 157,619.39 seconds [Rate: 3,621.76
> per
> >> >>>>>> second]*
> >> >>>>>> *04:32:01,157 [Thread-3] INFO  jenatdb.RdfResourceImporter - --
> >> Start
> >> >>>>>> triples index phase*
> >> >>>>>> *04:32:01,157 [Thread-3] INFO  jenatdb.RdfResourceImporter - --
> >> Finish
> >> >>>>>> triples index phase*
> >> >>>>>> *04:32:01,157 [Thread-3] INFO  jenatdb.RdfResourceImporter - --
> >> Finish
> >> >>>>>> triples load*
> >> >>>>>> *04:32:01,157 [Thread-3] INFO  jenatdb.RdfResourceImporter - **
> >> >>>>> Completed:
> >> >>>>>> 570,859,352 triples loaded in 157,619.39 seconds [Rate: 3,621.76
> per
> >> >>>>>> second]*
> >> >>>>>> 04:32:56,880 [Thread-3] INFO  source.ResourceLoader -    ...
> moving
> >> >>>>>> imported file freebase.nt.gz to imported/freebase.nt.gz
> >> >>>>>> 04:32:56,883 [Thread-3] INFO  source.ResourceLoader -    -
> >> completed in
> >> >>>>>> 157675 seconds
> >> >>>>>> 04:32:56,883 [Thread-3] INFO  source.ResourceLoader -  > loading
> >> >>>>>> '/private/tmp/freebase/indexing/resources/rdfdata/fixit.sh' ...
> >> >>>>>> 04:32:56,944 [Thread-3] WARN  jenatdb.RdfResourceImporter -
> ignore
> >> File
> >> >>>>> {}
> >> >>>>>> because of unknown extension
> >> >>>>>> 04:32:56,958 [Thread-3] INFO  source.ResourceLoader -    -
> >> completed in 0
> >> >>>>>> seconds
> >> >>>>>> 04:32:56,958 [Thread-3] INFO  source.ResourceLoader -  ... 2
> files
> >> >>>>> imported
> >> >>>>>> in 157675 seconds
> >> >>>>>> 04:32:56,958 [Thread-3] INFO  source.ResourceLoader - Loding 0
> File
> >> ...
> >> >>>>>> 04:32:56,958 [Thread-3] INFO  source.ResourceLoader -  ... 0
> files
> >> >>>>> imported
> >> >>>>>> in 0 seconds
> >> >>>>>> 04:32:56,971 [main] INFO  impl.IndexerImpl -  ... delete existing
> >> >>>>>> IndexedEntityId file
> >> >>>>>>
> /private/tmp/freebase/indexing/destination/indexed-entities-ids.zip
> >> >>>>>> 04:32:56,982 [main] INFO  impl.IndexerImpl - Initialisation
> >> completed
> >> >>>>>> 04:32:56,982 [main] INFO  impl.IndexerImpl -   ... initialisation
> >> >>>>> completed
> >> >>>>>> 04:32:56,982 [main] INFO  impl.IndexerImpl - start indexing ...
> >> >>>>>> 04:32:56,982 [main] INFO  impl.IndexerImpl - Indexing started ...
> >> >>>>>>
> >> >>>>>>
> >> >>>>>>
> >> >>>>>> 04:45:48,075 [pool-1-thread-1] WARN
> >> impl.NamespacePrefixProviderImpl -
> >> >>>>>> Invalid Namespace Mapping: prefix 'nsogi' valid , namespace '
> >> >>>>>> http://prefix.cc/nsogi:' invalid -> mapping ignored!
> >> >>>>>> 04:45:48,076 [pool-1-thread-1] WARN
> >> impl.NamespacePrefixProviderImpl -
> >> >>>>>> Invalid Namespace Mapping: prefix 'category' valid , namespace '
> >> >>>>>> http://dbpedia.org/resource/Category:' invalid -> mapping
> ignored!
> >> >>>>>> 04:45:48,077 [pool-1-thread-1] WARN
> >> impl.NamespacePrefixProviderImpl -
> >> >>>>>> Invalid Namespace Mapping: prefix 'chebi' valid , namespace '
> >> >>>>>> http://bio2rdf.org/chebi:' invalid -> mapping ignored!
> >> >>>>>> 04:45:48,077 [pool-1-thread-1] WARN
> >> impl.NamespacePrefixProviderImpl -
> >> >>>>>> Invalid Namespace Mapping: prefix 'hgnc' valid , namespace '
> >> >>>>>> http://bio2rdf.org/hgnc:' invalid -> mapping ignored!
> >> >>>>>> 04:45:48,077 [pool-1-thread-1] WARN
> >> impl.NamespacePrefixProviderImpl -
> >> >>>>>> Invalid Namespace Mapping: prefix 'dbptmpl' valid , namespace '
> >> >>>>>> http://dbpedia.org/resource/Template:' invalid -> mapping
> ignored!
> >> >>>>>> 04:45:48,077 [pool-1-thread-1] WARN
> >> impl.NamespacePrefixProviderImpl -
> >> >>>>>> Invalid Namespace Mapping: prefix 'dbc' valid , namespace '
> >> >>>>>> http://dbpedia.org/resource/Category:' invalid -> mapping
> ignored!
> >> >>>>>> 04:45:48,078 [pool-1-thread-1] WARN
> >> impl.NamespacePrefixProviderImpl -
> >> >>>>>> Invalid Namespace Mapping: prefix 'pubmed' valid , namespace '
> >> >>>>>> http://bio2rdf.org/pubmed_vocabulary:' invalid -> mapping
> ignored!
> >> >>>>>> 04:45:48,078 [pool-1-thread-1] WARN
> >> impl.NamespacePrefixProviderImpl -
> >> >>>>>> Invalid Namespace Mapping: prefix 'dbt' valid , namespace '
> >> >>>>>> http://dbpedia.org/resource/Template:' invalid -> mapping
> ignored!
> >> >>>>>> 04:45:48,078 [pool-1-thread-1] WARN
> >> impl.NamespacePrefixProviderImpl -
> >> >>>>>> Invalid Namespace Mapping: prefix 'dbrc' valid , namespace '
> >> >>>>>> http://dbpedia.org/resource/Category:' invalid -> mapping
> ignored!
> >> >>>>>> 04:45:48,078 [pool-1-thread-1] WARN
> >> impl.NamespacePrefixProviderImpl -
> >> >>>>>> Invalid Namespace Mapping: prefix 'call' valid , namespace '
> >> >>>>>> http://webofcode.org/wfn/call:' invalid -> mapping ignored!
> >> >>>>>> 04:45:48,078 [pool-1-thread-1] WARN
> >> impl.NamespacePrefixProviderImpl -
> >> >>>>>> Invalid Namespace Mapping: prefix 'dbcat' valid , namespace '
> >> >>>>>> http://dbpedia.org/resource/Category:' invalid -> mapping
> ignored!
> >> >>>>>> 04:45:48,084 [pool-1-thread-1] WARN
> >> impl.NamespacePrefixProviderImpl -
> >> >>>>>> Invalid Namespace Mapping: prefix 'affymetrix' valid , namespace
> '
> >> >>>>>> http://bio2rdf.org/affymetrix_vocabulary:' invalid -> mapping
> >> ignored!
> >> >>>>>> 04:45:48,084 [pool-1-thread-1] WARN
> >> impl.NamespacePrefixProviderImpl -
> >> >>>>>> Invalid Namespace Mapping: prefix 'bgcat' valid , namespace '
> >> >>>>>> http://bg.dbpedia.org/resource/Категория:' invalid -> mapping
> >> ignored!
> >> >>>>>> 04:45:48,084 [pool-1-thread-1] WARN
> >> impl.NamespacePrefixProviderImpl -
> >> >>>>>> Invalid Namespace Mapping: prefix 'condition' valid , namespace '
> >> >>>>>> http://www.kinjal.com/condition:' invalid -> mapping ignored!
> >> >>>>>> 05:11:41,836 [Indexing: Entity Source Reader Deamon] INFO
> >> >>>>> impl.IndexerImpl
> >> >>>>>> - Indexing: Entity Source Reader Deamon completed (sequence=0)
> ...
> >> >>>>>> 05:11:41,838 [Indexing: Entity Source Reader Deamon] INFO
> >> >>>>> impl.IndexerImpl
> >> >>>>>> -  > current sequence : 0
> >> >>>>>> 05:11:41,838 [Indexing: Entity Source Reader Deamon] INFO
> >> >>>>> impl.IndexerImpl
> >> >>>>>> -  > new sequence: 1
> >> >>>>>> 05:11:41,838 [Indexing: Entity Source Reader Deamon] INFO
> >> >>>>> impl.IndexerImpl
> >> >>>>>> - Send end-of-queue to Deamons with Sequence 1
> >> >>>>>> 05:11:41,839 [Indexing: Entity Processor Deamon] INFO
> >> impl.IndexerImpl -
> >> >>>>>> Indexing: Entity Processor Deamon completed (sequence=1) ...
> >> >>>>>> 05:11:41,839 [Indexing: Entity Processor Deamon] INFO
> >> impl.IndexerImpl -
> >> >>>>>>> current sequence : 1
> >> >>>>>> 05:11:41,839 [Indexing: Entity Processor Deamon] INFO
> >> impl.IndexerImpl -
> >> >>>>>>> new sequence: 2
> >> >>>>>> 05:11:41,839 [Indexing: Entity Processor Deamon] INFO
> >> impl.IndexerImpl -
> >> >>>>>> Send end-of-queue to Deamons with Sequence 2
> >> >>>>>> 05:11:41,839 [Indexing: Entity Perstisting Deamon] INFO
> >> >>>>> impl.IndexerImpl -
> >> >>>>>> Indexing: Entity Perstisting Deamon completed (sequence=2) ...
> >> >>>>>> 05:11:41,839 [Indexing: Entity Perstisting Deamon] INFO
> >> >>>>> impl.IndexerImpl -
> >> >>>>>>> current sequence : 2
> >> >>>>>> 05:11:41,839 [Indexing: Entity Perstisting Deamon] INFO
> >> >>>>> impl.IndexerImpl -
> >> >>>>>>> new sequence: 3
> >> >>>>>> 05:11:41,839 [Indexing: Entity Perstisting Deamon] INFO
> >> >>>>> impl.IndexerImpl -
> >> >>>>>> Send end-of-queue to Deamons with Sequence 3
> >> >>>>>> *05:11:41,851 [Indexing: Finished Entity Logger Deamon] INFO
> >> >>>>>> impl.IndexerImpl - Indexed 0 items in 2059467sec
> (Infinityms/item):
> >> >>>>>> processing:  -1.000ms/item | queue:  -1.000ms*
> >> >>>>>> 05:11:41,851 [Indexing: Finished Entity Logger Deamon] INFO
> >> >>>>>> impl.IndexerImpl -   - source   :  -1.000ms/item
> >> >>>>>> 05:11:41,851 [Indexing: Finished Entity Logger Deamon] INFO
> >> >>>>>> impl.IndexerImpl -   - processing:  -1.000ms/item
> >> >>>>>> 05:11:41,851 [Indexing: Finished Entity Logger Deamon] INFO
> >> >>>>>> impl.IndexerImpl -   - store     :  -1.000ms/item
> >> >>>>>> 05:11:41,906 [Indexing: Finished Entity Logger Deamon] INFO
> >> >>>>>> impl.IndexerImpl - Indexing: Finished Entity Logger Deamon
> completed
> >> >>>>>> (sequence=3) ...
> >> >>>>>> 05:11:41,906 [Indexing: Finished Entity Logger Deamon] INFO
> >> >>>>>> impl.IndexerImpl -  > current sequence : 3
> >> >>>>>> 05:11:41,906 [Indexing: Finished Entity Logger Deamon] INFO
> >> >>>>>> impl.IndexerImpl -  > new sequence: 4
> >> >>>>>> 05:11:41,906 [Indexing: Finished Entity Logger Deamon] INFO
> >> >>>>>> impl.IndexerImpl - Send end-of-queue to Deamons with Sequence 4
> >> >>>>>> 05:11:41,910 [Indexer: Entity Error Logging Daemon] INFO
> >> >>>>> impl.IndexerImpl
> >> >>>>>> - Indexer: Entity Error Logging Daemon completed (sequence=4) ...
> >> >>>>>> 05:11:41,910 [Indexer: Entity Error Logging Daemon] INFO
> >> >>>>> impl.IndexerImpl
> >> >>>>>> -  > current sequence : 4
> >> >>>>>> 05:11:41,910 [main] INFO  impl.IndexerImpl -   ... indexing
> >> completed
> >> >>>>>> 05:11:41,910 [main] INFO  impl.IndexerImpl - start
> post-processing
> >> ...
> >> >>>>>> 05:11:41,910 [main] INFO  impl.IndexerImpl - PostProcessing
> started
> >> ...
> >> >>>>>> 05:11:41,910 [main] INFO  impl.IndexerImpl -   ...
> post-processing
> >> >>>>> finished
> >> >>>>>> ...
> >> >>>>>> 05:11:41,911 [main] INFO  impl.IndexerImpl - start
> finalisation....
> >> >>>>>>
> >> >>>>>>
> >> >>>>>>
> >> >>>>>> On Wed, May 20, 2015 at 8:19 AM, Rupert Westenthaler <
> >> >>>>>> rupert.westentha...@gmail.com> wrote:
> >> >>>>>>
> >> >>>>>>>> On Tue, May 19, 2015 at 7:04 PM, Rajan Shah <raja...@gmail.com
> >
> >> wrote:
> >> >>>>>>>> Hi Rupert and Antonio,
> >> >>>>>>>>
> >> >>>>>>>> Thanks a lot for the reply.
> >> >>>>>>>>
> >> >>>>>>>> I start to follow Rupert's suggestion, however it failed again
> at
> >> >>>>>>>>
> >> >>>>>>>> 10:56:34,152 [Thread-3] ERROR jena.riot - [line: 8722294, col:
> 88]
> >> >>>>>>> illegal
> >> >>>>>>>> escape sequence value: $ (0x24) -- Is there anyway it can be
> >> resolved
> >> >>>>> for
> >> >>>>>>>> the entire file?
> >> >>>>>>>
> >> >>>>>>> The indexing tool uses Apache Jena. An those are Jena parsing
> >> errors.
> >> >>>>>>> So the Jena Mailing lists would be the better place to look for
> >> >>>>>>> answers.
> >> >>>>>>> This specific issue looks like an invalid URI that is not fixed
> by
> >> the
> >> >>>>>>> fixit script.
> >> >>>>>>>
> >> >>>>>>>
> >> >>>>>>>> I requested an access to latest BaseKB bucket, as it doesn't
> seem
> >> to
> >> >>>>> be
> >> >>>>>>>> open.
> >> >>>>>>>>
> >> >>>>>>>> s3cmd ls s3://basekb-now/2015-04-15-18-54/
> >> >>>>>>>> --add-header="x-amz-request-payer: requester"
> >> >>>>>>>> ERROR: Access to bucket 'basekb-now' was denied
> >> >>>>>>>>
> >> >>>>>>>>
> >> >>>>>>>> *Couple additional questions:*
> >> >>>>>>>>
> >> >>>>>>>> *1. indexing enhancements:*
> >> >>>>>>>> What settings/properties one can tweak to gain most out of the
> >> >>>>> indexing.
> >> >>>>>>>
> >> >>>>>>> In general you do only want information as needed for your
> >> application
> >> >>>>>>> case in the index.
> >> >>>>>>> For EntityLinking only labels and type are required.
> >> >>>>>>> Additional properties will only be used for dereferencing
> >> Entities. So
> >> >>>>>>> this will depend on your application needs (your dereferencing
> >> >>>>>>> configuration).
> >> >>>>>>>
> >> >>>>>>> In general I try to exclude as much information as possible form
> >> the
> >> >>>>>>> index to keep the size of the Solr Index as small as possible.
> >> >>>>>>>
> >> >>>>>>>> a. for ex. domain specific such as Pharmaceutical, Law etc...
> >> within
> >> >>>>>>>> freebase
> >> >>>>>>>> b. potential optimizations to speed up the overall indexing
> >> >>>>>>>
> >> >>>>>>> Most of the time will be needed to load the Freebase dump into
> Jena
> >> >>>>>>> TDB. Even with an SSD equipped Server this will take several
> days.
> >> >>>>>>> Assigning more RAM will speed up this process as Jena TDB can
> cache
> >> >>>>>>> more things in RAM.
> >> >>>>>>>
> >> >>>>>>> Usually it is a good Idea to cancel the indexing process after
> the
> >> >>>>>>> importing of the RDF data has finished (and the indexing of the
> >> >>>>>>> Entities has started). This is because after indexing all the
> RAM
> >> will
> >> >>>>>>> be used by Jena TDB for caching stuff that is no longer needed
> in
> >> the
> >> >>>>>>> read-only operations during indexing. So a fresh start can
> speed up
> >> >>>>>>> the indexing part of the process.
> >> >>>>>>>
> >> >>>>>>> Also have a look at the Freebase Indexing Tool Readme
> >> >>>>>>>
> >> >>>>>>>>
> >> >>>>>>>> *2. demo:*
> >> >>>>>>>> I see that, in recent github commit(s) the eHealth and other
> demos
> >> >>>>> have
> >> >>>>>>>> been commented out. How can I get demo source code and other
> >> >>>>> components
> >> >>>>>>> for
> >> >>>>>>>> these demos. I prefer to build it myself to see the power of
> >> stanbol.
> >> >>>>>>>
> >> >>>>>>> The eHealth demo is still in the 0.12 branch [1]. This is fully
> >> >>>>>>> compatible to the trunk version.
> >> >>>>>>>
> >> >>>>>>>> *3. custom vocabulary:*
> >> >>>>>>>> Suppose, I have custom vocabulary in CSV format. Is there a
> >> preferred
> >> >>>>> way
> >> >>>>>>>> to upload it to Stanbol and have it recognize my entities?
> >> >>>>>>>
> >> >>>>>>> Google Refine[2] with the RDF extension [3]. You can also try to
> >> use
> >> >>>>>>> the (newer) Open Refine [4] with the RDF Refine 0.9.0 Alpha
> version
> >> >>>>>>> but AFAIK this combination is not so stable and might not work
> at
> >> all.
> >> >>>>>>>
> >> >>>>>>> * Google Refine allows you to import your CSV file.
> >> >>>>>>> * Clean it up (if necessary)
> >> >>>>>>> * The RDF extension allows you to map your CSV data to RDF
> >> >>>>>>> * based on this mapping you can save your data as RDF
> >> >>>>>>> * after that you can import the RDF data to Apache Stanbol
> >> >>>>>>>
> >> >>>>>>> hope this helps
> >> >>>>>>> best
> >> >>>>>>> Rupert
> >> >>>>>>>
> >> >>>>>>>>
> >> >>>>>>>> Thanks in advance,
> >> >>>>>>>> Rajan
> >> >>>>>>>
> >> >>>>>>>
> >> >>>>>>>
> >> >>>>>>> [1]
> >> >>>>>
> >>
> http://svn.apache.org/repos/asf/stanbol/branches/release-0.12/demos/ehealth/
> >> >>>>>>> [2] https://code.google.com/p/google-refine/
> >> >>>>>>> [3] http://refine.deri.ie/
> >> >>>>>>> [4] http://openrefine.org/
> >> >>>>>>>
> >> >>>>>>>> On Tue, May 19, 2015 at 3:01 AM, Rupert Westenthaler <
> >> >>>>>>>> rupert.westentha...@gmail.com> wrote:
> >> >>>>>>>>
> >> >>>>>>>>> Hi Rajan,
> >> >>>>>>>>>
> >> >>>>>>>>> I think this is because you named you file
> >> >>>>>>>>> "freebase-rdf-latest-fixed.gz". Jena assumes RDF/XML if the
> RDF
> >> >>>>> format
> >> >>>>>>>>> is not provided by the file extension. Renaming the file to
> >> >>>>>>>>> "freebase-rdf-latest-fixed.nt.gz" should fix this issue.
> >> >>>>>>>>>
> >> >>>>>>>>> The suggestion of Antonio to use BaseKB is also a valid
> option.
> >> >>>>>>>>>
> >> >>>>>>>>> best
> >> >>>>>>>>> Rupert
> >> >>>>>>>>>
> >> >>>>>>>>> On Tue, May 19, 2015 at 8:32 AM, Antonio David Perez Morales
> >> >>>>>>>>> <ape...@zaizi.com> wrote:
> >> >>>>>>>>>> Hi Rajan
> >> >>>>>>>>>>
> >> >>>>>>>>>> Freebase dump contains some things that does not fit very
> well
> >> with
> >> >>>>>>> the
> >> >>>>>>>>>> indexer.
> >> >>>>>>>>>> I advise you to use the dump provided by BaseKB (
> >> http://basekb.com
> >> >>>>> )
> >> >>>>>>>>> which
> >> >>>>>>>>>> is a curated Freebase dump.
> >> >>>>>>>>>> I did not have any problem indexing it using that dump.
> >> >>>>>>>>>>
> >> >>>>>>>>>> Regards
> >> >>>>>>>>>>
> >> >>>>>>>>>> On Mon, May 18, 2015 at 8:48 PM, Rajan Shah <
> raja...@gmail.com>
> >> >>>>>>> wrote:
> >> >>>>>>>>>>
> >> >>>>>>>>>>> Hi,
> >> >>>>>>>>>>>
> >> >>>>>>>>>>> I am working on indexing Freebase data within EntityHub and
> >> >>>>> observed
> >> >>>>>>>>>>> following issue:
> >> >>>>>>>>>>>
> >> >>>>>>>>>>> 01:06:01,547 [Thread-3] ERROR jena.riot - [line: 1, col: 7 ]
> >> >>>>> Element
> >> >>>>>>> or
> >> >>>>>>>>>>> attribute do not match QName production:
> >> >>>>> QName::=(NCName':')?NCName.
> >> >>>>>>>>>>>
> >> >>>>>>>>>>> I would appreciate any help pertaining to this issue.
> >> >>>>>>>>>>>
> >> >>>>>>>>>>> Thanks,
> >> >>>>>>>>>>> Rajan
> >> >>>>>>>>>>>
> >> >>>>>>>>>>> *Steps followed:*
> >> >>>>>>>>>>>
> >> >>>>>>>>>>> *1. Initialization: *
> >> >>>>>>>>>>> java -jar
> >> >>>>>>>>>
> org.apache.stanbol.entityhub.indexing.freebase-1.0.0-SNAPSHOT.jar
> >> >>>>>>>>>>> init
> >> >>>>>>>>>>>
> >> >>>>>>>>>>> *2. Download the data:*
> >> >>>>>>>>>>> Download data and copy it to
> >> >>>>>>>>> https://developers.google.com/freebase/data
> >> >>>>>>>>>>>
> >> >>>>>>>>>>> *3. Performed execution of fbrankings-uri.sh*
> >> >>>>>>>>>>> It generated incoming_links.txt under resources directory as
> >> >>>>> follows
> >> >>>>>>>>>>>
> >> >>>>>>>>>>> 10888430 m.0kpv11
> >> >>>>>>>>>>> 3741261 m.019h
> >> >>>>>>>>>>> 2667858 m.0775xx5
> >> >>>>>>>>>>> 2667804 m.0775xvm
> >> >>>>>>>>>>> 1875352 m.01xryvm
> >> >>>>>>>>>>> 1739262 m.05zppz
> >> >>>>>>>>>>> 1369590 m.01xrzlb
> >> >>>>>>>>>>>
> >> >>>>>>>>>>> *4. Performed execution of fixit script*
> >> >>>>>>>>>>>
> >> >>>>>>>>>>> gunzip -c ${FB_DUMP} | fixit | gzip > ${FB_DUMP_fixed}
> >> >>>>>>>>>>>
> >> >>>>>>>>>>> *5. Rename the fixed file to freebase.rdf.gz and copy it *
> >> >>>>>>>>>>> to indexing/resources/rdfdata
> >> >>>>>>>>>>>
> >> >>>>>>>>>>> *6. config/iditer.properties file has following setting*
> >> >>>>>>>>>>> #id-namespace=http://freebase.com/
> >> >>>>>>>>>>> ns-prefix-state=false
> >> >>>>>>>>>>>
> >> >>>>>>>>>>> *7. Performed run of following command:*
> >> >>>>>>>>>>> java -jar -Xmx32g
> >> >>>>>>>>>>>
> >> org.apache.stanbol.entityhub.indexing.freebase-1.0.0-SNAPSHOT.jar
> >> >>>>>>> index
> >> >>>>>>>>>>>
> >> >>>>>>>>>>> The error dump on stdout is as follows:
> >> >>>>>>>>>>>
> >> >>>>>>>>>>> 01:37:32,884 [Thread-0] INFO
> >> >>>>> solryard.SolrYardIndexingDestination -
> >> >>>>>>>>> ...
> >> >>>>>>>>>>> copy Solr Configuration form
> >> >>>>>>>>> /private/tmp/freebase/indexing/config/freebase
> >> >>>>>>>>>>> to
> >> >>>>>>>
> /private/tmp/freebase/indexing/destination/indexes/default/freebase
> >> >>>>>>>>>>> 01:37:32,895 [Thread-3] INFO  jenatdb.RdfResourceImporter -
> >>  -
> >> >>>>>>> bulk
> >> >>>>>>>>>>> loading File freebase.rdf.gz using Format Lang:RDF/XML
> >> >>>>>>>>>>> 01:37:32,896 [Thread-3] INFO  jenatdb.RdfResourceImporter -
> --
> >> >>>>> Start
> >> >>>>>>>>>>> triples data phase
> >> >>>>>>>>>>> 01:37:32,896 [Thread-3] INFO  jenatdb.RdfResourceImporter -
> **
> >> >>>>> Load
> >> >>>>>>>>> empty
> >> >>>>>>>>>>> triples table
> >> >>>>>>>>>>> *01:37:32,948 [Thread-3] ERROR jena.riot - [line: 1, col: 7
> ]
> >> >>>>>>> Element or
> >> >>>>>>>>>>> attribute do not match QName production:
> >> >>>>> QName::=(NCName':')?NCName.*
> >> >>>>>>>>>>> 01:37:32,948 [Thread-3] INFO  jenatdb.RdfResourceImporter -
> --
> >> >>>>> Finish
> >> >>>>>>>>>>> triples data phase
> >> >>>>>>>>>>> 01:37:32,948 [Thread-3] INFO  jenatdb.RdfResourceImporter -
> --
> >> >>>>> Finish
> >> >>>>>>>>>>> triples load
> >> >>>>>>>>>>> 01:37:32,960 [Thread-3] INFO  source.ResourceLoader - Ignore
> >> Error
> >> >>>>>>> for
> >> >>>>>>>>> File
> >> >>>>>>>>>>>
> >> /private/tmp/freebase/indexing/resources/rdfdata/freebase.rdf.gz
> >> >>>>> and
> >> >>>>>>>>>>> continue
> >> >>>>>>>>>>>
> >> >>>>>>>>>>> Additional Reference Point:
> >> >>>>>>>>>>>
> >> >>>>>>>>>>> *Original Freebase dump size:*  31025015397 May 14 18:10
> >> >>>>>>>>>>> freebase-rdf-latest.gz
> >> >>>>>>>>>>> *Fixed Freebase dump size:* 31026818367 May 15 12:45
> >> >>>>>>>>>>> freebase-rdf-latest-fixed.gz
> >> >>>>>>>>>>> *Incoming Links size: *1206745360 May 17 00:42
> >> incoming_links.txt
> >> >>>>>>>>>>
> >> >>>>>>>>>> --
> >> >>>>>>>>>>
> >> >>>>>>>>>> ------------------------------
> >> >>>>>>>>>> This message should be regarded as confidential. If you have
> >> >>>>> received
> >> >>>>>>>>> this
> >> >>>>>>>>>> email in error please notify the sender and destroy it
> >> immediately.
> >> >>>>>>>>>> Statements of intent shall only become binding when
> confirmed in
> >> >>>>> hard
> >> >>>>>>>>> copy
> >> >>>>>>>>>> by an authorised signatory.
> >> >>>>>>>>>>
> >> >>>>>>>>>> Zaizi Ltd is registered in England and Wales with the
> >> registration
> >> >>>>>>> number
> >> >>>>>>>>>> 6440931. The Registered Office is Brook House, 229 Shepherds
> >> Bush
> >> >>>>>>> Road,
> >> >>>>>>>>>> London W6 7AN.
> >> >>>>>>>>>
> >> >>>>>>>>>
> >> >>>>>>>>>
> >> >>>>>>>>> --
> >> >>>>>>>>> | Rupert Westenthaler
> rupert.westentha...@gmail.com
> >> >>>>>>>>> | Bodenlehenstraße 11
> >> ++43-699-11108907
> >> >>>>>>>>> | A-5500 Bischofshofen
> >> >>>>>>>>> | REDLINK.CO
> >> >>>>>
> >>
> ..........................................................................
> >> >>>>>>>>> | http://redlink.co/
> >> >>>>>>>
> >> >>>>>>>
> >> >>>>>>>
> >> >>>>>>> --
> >> >>>>>>> | Rupert Westenthaler             rupert.westentha...@gmail.com
> >> >>>>>>> | Bodenlehenstraße 11
> >> ++43-699-11108907
> >> >>>>>>> | A-5500 Bischofshofen
> >> >>>>>>> | REDLINK.CO
> >> >>>>>
> >>
> ..........................................................................
> >> >>>>>>> | http://redlink.co/
> >> >>>>>
> >> >>>>>
> >> >>>>>
> >> >>>>> --
> >> >>>>> | Rupert Westenthaler             rupert.westentha...@gmail.com
> >> >>>>> | Bodenlehenstraße 11
> ++43-699-11108907
> >> >>>>> | A-5500 Bischofshofen
> >> >>>>> | REDLINK.CO
> >> >>>>>
> >>
> ..........................................................................
> >> >>>>> | http://redlink.co/
> >> >>>
> >> >>>
> >> >>>
> >> >>> --
> >> >>> | Rupert Westenthaler             rupert.westentha...@gmail.com
> >> >>> | Bodenlehenstraße 11                              ++43-699-11108907
> >> >>> | A-5500 Bischofshofen
> >> >>> | REDLINK.CO
> >>
> ..........................................................................
> >> >>> | http://redlink.co/
> >> >
> >> >
> >> >
> >> > --
> >> > | Rupert Westenthaler             rupert.westentha...@gmail.com
> >> > | Bodenlehenstraße 11                              ++43-699-11108907
> >> > | A-5500 Bischofshofen
> >> > | REDLINK.CO
> >>
> ..........................................................................
> >> > | http://redlink.co/
> >>
>
>
>
> --
> | Rupert Westenthaler             rupert.westentha...@gmail.com
> | Bodenlehenstraße 11                              ++43-699-11108907
> | A-5500 Bischofshofen
> | REDLINK.CO
> ..........................................................................
> | http://redlink.co/
>

Re: Entityhub indexing for Freebase data

Reply via email to