[ https://issues.apache.org/jira/browse/RYA-500?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16532874#comment-16532874 ]
ASF GitHub Bot commented on RYA-500: ------------------------------------ Github user kchilton2 commented on a diff in the pull request: https://github.com/apache/incubator-rya/pull/299#discussion_r200156492 --- Diff: extras/rya.manual/src/site/markdown/loaddata.md --- @@ -92,29 +92,55 @@ The default "format" is RDF/XML, but these formats are supported : RDFXML, NTRIP ## Bulk Loading data -Bulk loading data is done through Map Reduce jobs +Bulk loading data is done through Map Reduce jobs. ### Bulk Load RDF data -This Map Reduce job will read files into memory and parse them into statements. The statements are saved into the store. Here is an example for storing in Accumulo: +This Map Reduce job will read files into memory and parse them into statements. The statements are saved into the triplestore. +Here are the steps to prepare and run the job: + + * Load the RDF data to HDFS. It can be single of multiple volumes and directories in them. + * Also load the `mapreduce/target/rya.mapreduce-<version>-shaded.jar` executable jar file to HDFS. + * Run the following sample command: ``` -hadoop jar target/rya.mapreduce-3.2.10-SNAPSHOT-shaded.jar org.apache.rya.accumulo.mr.RdfFileInputTool -Dac.zk=localhost:2181 -Dac.instance=accumulo -Dac.username=root -Dac.pwd=secret -Drdf.tablePrefix=triplestore_ -Drdf.format=N-Triples /tmp/temp.ntrips +hadoop hdfs://volume/rya.mapreduce-<version>-shaded.jar org.apache.rya.accumulo.mr.tools.RdfFileInputTool -Dac.zk=localhost:2181 -Dac.instance=accumulo -Dac.username=root -Dac.pwd=secret -Drdf.tablePrefix=triplestore_ -Drdf.format=N-Triples hdfs://volume/dir1,hdfs://volume/dir2,hdfs://volume/file1.nt ``` Options: -- rdf.tablePrefix : The tables (spo, po, osp) are prefixed with this qualifier. The tables become: (rdf.tablePrefix)spo,(rdf.tablePrefix)po,(rdf.tablePrefix)osp -- ac.* : Accumulo connection parameters -- rdf.format : See RDFFormat from RDF4J, samples include (Trig, N-Triples, RDF/XML) -- sc.use_freetext, sc.use_geo, sc.use_temporal, sc.use_entity : If any of these are set to true, statements will also be +- **rdf.tablePrefix** - The tables (spo, po, osp) are prefixed with this qualifier. + The tables become: (rdf.tablePrefix)spo,(rdf.tablePrefix)po,(rdf.tablePrefix)osp +- **ac.*** - Accumulo connection parameters +- **rdf.format** - See RDFFormat from RDF4J, samples include (Trig, N-Triples, RDF/XML) +- **sc.use_freetext, sc.use_geo, sc.use_temporal, sc.use_entity** - If any of these are set to true, statements will also be added to the enabled secondary indices. -- sc.freetext.predicates, sc.geo.predicates, sc.temporal.predicates: If the associated indexer is enabled, these options specify +- **sc.freetext.predicates, sc.geo.predicates, sc.temporal.predicates** - If the associated indexer is enabled, these options specify which statements should be sent to that indexer (based on the predicate). If not given, all indexers will attempt to index all statements. -The argument is the directory/file to load. This file needs to be loaded into HDFS before running. If loading a directory, all files should have the same RDF -format. +The positional argument is a comma separated list of directories/files to load. +They need to be loaded into HDFS before running. If loading a directory, +all files should have the same RDF format. + +Once the data is loaded, it is actually a good practice to compact your tables. +You can do this by opening the accumulo shell shell and running the compact --- End diff -- "the accumulo shell shell and" > Make RdfFileInputTool to accept multiple input paths > ---------------------------------------------------- > > Key: RYA-500 > URL: https://issues.apache.org/jira/browse/RYA-500 > Project: Rya > Issue Type: Improvement > Affects Versions: 3.2.12 > Reporter: Maxim Kolchin > Priority: Trivial > Labels: mapreduce > > We store RDF files in multiple folders where each folder contains data about > a specific type of entity (e.g. person, company, etc.). So it's not > convenient that the RdfFileInputTool allows only a single input path. -- This message was sent by Atlassian JIRA (v7.6.3#76005)