subject:"Re\: seqdirectory command in MapReduce"

Re: seqdirectory command in MapReduce

2013-02-16 Thread Dan Filimon

Hi Claudio,

Could you be more specific? What does 'MapReduce style' mean?
seqdirectory should create sequence files from the documents in a
folder, where the keys are the document names and the values are the
documents' content.

What do you need it to do?

On Sat, Feb 16, 2013 at 5:55 PM, Claudio Reggiani nop...@gmail.com wrote:
 Hello,

 I have a text dataset. Running seqdirectory command on it I see it's not
 written in MapReduce style (looking at the source code of
 SequenceFilesFromDirectory confirms that).

 What if I have a big dataset stored in HDFS and I would like to convert it
 in SequenceFile format? Do I need to create my own custom job or
 seqdirectory does that?

 Thanks
 Claudio Reggiani

Re: seqdirectory command in MapReduce

2013-02-16 Thread Claudio Reggiani

Let say the directory has only one big text. Logically it's one file but
actually on HDFS the data is distributed among the cluster. Suppose now the
big text can't stay in memory (in any memory of the cluster), does
seqdirectory work?

If so, the only way is to run seqdirectory as MapReduce job.

The output will be (logically) one key-value record, where (as you said)
the key is the file name and the value is the file content in vector format.

Sorry for my vagueness
Claudio


2013/2/16 Dan Filimon dangeorge.fili...@gmail.com

 Hi Claudio,

 Could you be more specific? What does 'MapReduce style' mean?
 seqdirectory should create sequence files from the documents in a
 folder, where the keys are the document names and the values are the
 documents' content.

 What do you need it to do?

 On Sat, Feb 16, 2013 at 5:55 PM, Claudio Reggiani nop...@gmail.com
 wrote:
  Hello,
 
  I have a text dataset. Running seqdirectory command on it I see it's
 not
  written in MapReduce style (looking at the source code of
  SequenceFilesFromDirectory confirms that).
 
  What if I have a big dataset stored in HDFS and I would like to convert
 it
  in SequenceFile format? Do I need to create my own custom job or
  seqdirectory does that?
 
  Thanks
  Claudio Reggiani

Re: seqdirectory command in MapReduce

2013-02-16 Thread Steve Chien

 I think he meant that code is reading and converting the files from the Input 
directory as a standalone program. Not a map-reduce program...

On Feb 16, 2013, at 11:22, Dan Filimon dangeorge.fili...@gmail.com wrote:

 Hi Claudio,
 
 Could you be more specific? What does 'MapReduce style' mean?
 seqdirectory should create sequence files from the documents in a
 folder, where the keys are the document names and the values are the
 documents' content.
 
 What do you need it to do?
 
 On Sat, Feb 16, 2013 at 5:55 PM, Claudio Reggiani nop...@gmail.com wrote:
 Hello,
 
 I have a text dataset. Running seqdirectory command on it I see it's not
 written in MapReduce style (looking at the source code of
 SequenceFilesFromDirectory confirms that).
 
 What if I have a big dataset stored in HDFS and I would like to convert it
 in SequenceFile format? Do I need to create my own custom job or
 seqdirectory does that?
 
 Thanks
 Claudio Reggiani

Re: seqdirectory command in MapReduce

2013-02-16 Thread Claudio Reggiani

Yes, thank you Steve. And sorry for my encoded messages

Claudio


2013/2/16 Steve Chien stvch...@gmail.com

  I think he meant that code is reading and converting the files from the
 Input directory as a standalone program. Not a map-reduce program...

 On Feb 16, 2013, at 11:22, Dan Filimon dangeorge.fili...@gmail.com
 wrote:

  Hi Claudio,
 
  Could you be more specific? What does 'MapReduce style' mean?
  seqdirectory should create sequence files from the documents in a
  folder, where the keys are the document names and the values are the
  documents' content.
 
  What do you need it to do?
 
  On Sat, Feb 16, 2013 at 5:55 PM, Claudio Reggiani nop...@gmail.com
 wrote:
  Hello,
 
  I have a text dataset. Running seqdirectory command on it I see it's
 not
  written in MapReduce style (looking at the source code of
  SequenceFilesFromDirectory confirms that).
 
  What if I have a big dataset stored in HDFS and I would like to convert
 it
  in SequenceFile format? Do I need to create my own custom job or
  seqdirectory does that?
 
  Thanks
  Claudio Reggiani

Re: seqdirectory command in MapReduce

2013-02-16 Thread Dan Filimon

But why would this be a problem? As long as it's using HDFS to access
the files, it should be able to fetch the chunks from wherever they
might be in the cluster.

I don't see why it wouldn't work. Let us know if it works!

On Sat, Feb 16, 2013 at 7:38 PM, Claudio Reggiani nop...@gmail.com wrote:
 Yes, thank you Steve. And sorry for my encoded messages

 Claudio


 2013/2/16 Steve Chien stvch...@gmail.com

  I think he meant that code is reading and converting the files from the
 Input directory as a standalone program. Not a map-reduce program...

 On Feb 16, 2013, at 11:22, Dan Filimon dangeorge.fili...@gmail.com
 wrote:

  Hi Claudio,
 
  Could you be more specific? What does 'MapReduce style' mean?
  seqdirectory should create sequence files from the documents in a
  folder, where the keys are the document names and the values are the
  documents' content.
 
  What do you need it to do?
 
  On Sat, Feb 16, 2013 at 5:55 PM, Claudio Reggiani nop...@gmail.com
 wrote:
  Hello,
 
  I have a text dataset. Running seqdirectory command on it I see it's
 not
  written in MapReduce style (looking at the source code of
  SequenceFilesFromDirectory confirms that).
 
  What if I have a big dataset stored in HDFS and I would like to convert
 it
  in SequenceFile format? Do I need to create my own custom job or
  seqdirectory does that?
 
  Thanks
  Claudio Reggiani

Re: seqdirectory command in MapReduce

2013-02-16 Thread Josh Patterson

look at MAHOUT-833 , this patch gives you this functionality.


On Sat, Feb 16, 2013 at 10:55 AM, Claudio Reggiani nop...@gmail.com wrote:

 Hello,

 I have a text dataset. Running seqdirectory command on it I see it's not
 written in MapReduce style (looking at the source code of
 SequenceFilesFromDirectory confirms that).

 What if I have a big dataset stored in HDFS and I would like to convert it
 in SequenceFile format? Do I need to create my own custom job or
 seqdirectory does that?

 Thanks
 Claudio Reggiani

Re: seqdirectory command in MapReduce

Re: seqdirectory command in MapReduce

Re: seqdirectory command in MapReduce

Re: seqdirectory command in MapReduce

Re: seqdirectory command in MapReduce

Re: seqdirectory command in MapReduce

6 matches

Site Navigation

Mail list logo

Footer information