Paolo Castagna wrote:
> Hi,
> I have a MapReduce job with a map function which parses a line from an
> N-Quads file:
> 
>   private static final Logger log = 
> LoggerFactory.getLogger(FirstMapper.class);
>   private String inputFileName;
>   private MapReduceParserProfile profile;
>   private LabelToNode labelMapping;
> 
>   public void setup(Context context) throws IOException, InterruptedException 
> {
>       inputFileName = 
> context.getConfiguration().get("mapreduce.map.input.file");
>       Prologue prologue = new Prologue(null, IRIResolver.createNoResolve());
>       labelMapping = new MapReduceLabelToNode(inputFileName);
>       profile = new MapReduceParserProfile(prologue,
>         ErrorHandlerFactory.errorHandlerStd, labelMapping);
>   }
> 
>   @Override
>   public void map (LongWritable key, Text value, Context context)
>   throws IOException, InterruptedException {
>       if ( log.isDebugEnabled() ) log.debug("< ({}, {})", key, value);
>       SinkToContext sink = new SinkToContext(context);
>       Tokenizer tokenizer = 
> TokenizerFactory.makeTokenizerString(value.toString());
>       LangNQuads parser = new LangNQuads(tokenizer, profile, sink) ;
>       parser.parse();
>   }

FirstMapper.java is here:
https://github.com/castagna/tdbloader3/blob/master/src/main/java/com/talis/labs/tdb/tdbloader3/FirstMapper.java

> 
> (A RecordReader<LongWritable, QuadWritable> would be better, but for now the
> snippet above does its job. Almost.)
> 
> The problem I have is with blank node labels.
> 
> With MapReduce the same file will be split into multiple file splits which
> are parsed on different machines. Therefore, I would like to have my own
> LabelToNode implementation with an Allocator<String, Node> which takes into
> account the filename (or an hash of it) when it creates a new blank node.
> 
> Something along these lines:
> 
>   public Node create(String label) {
>       return Node.createAnon(new AnonId(filename + "-" + label)) ;
>   }
> 
> So, I have my MapReduceLabelToNode:
> 
> public class MapReduceLabelToNode extends LabelToNode {
> 
>     public MapReduceLabelToNode(String filename) {
>         super(new SingleScopePolicy(), new MapReduceAllocator(filename));
>     }
> 
>     ...
> 

MapReduceLabelToNode.java is here:
https://github.com/castagna/tdbloader3/blob/master/src/main/java/com/talis/labs/tdb/tdbloader3/MapReduceLabelToNode.java

> But LabelToNode constructor is private.
> 
> Could we make it protected?
> 
> Or, alternatively, how can I construct a LabelToNode object which will be 
> using
> my MapReduceAllocator?
> 
> Thanks,
> Paolo
> 

Another thing I struggled with is using RIOT parsers in a QuadRecordReader:
https://github.com/castagna/tdbloader3/blob/master/src/main/java/com/talis/labs/tdb/tdbloader3/io/QuadRecordReader.java

The problem I have is that with Hadoop a file is split on a byte level basis,
therefore a line is often truncated. Is there a way I can tell RIOT, given an
InputStream, to skip to the next quad/line and start parsing from there?

Thanks,
Paolo

Reply via email to