Re: File format question about Random forest.

deneche abdelhakim Fri, 15 Jul 2011 14:43:41 -0700

I think Hadoop can read files from the local file system by using "file:///"
before the path


On Fri, Jul 15, 2011 at 3:55 PM, Xiaobo Gu <[email protected]> wrote:

> Do the -p and -f option of org.apache.mahout.df.tools.Describe have to
> be HDFS URLs, can they be local file system paths?
>
>
> On Fri, Jul 15, 2011 at 9:28 PM, Xiaobo Gu <[email protected]> wrote:
> > Can we make the file descriptor as following:
> >
> > 1. make a small csv file with the same format as the actual dataset,
> > say a CSV file with header and only one record,
> > 2. Use java weka.core.converters.CSVLoader filename.csv >
> > filename.arff  to convert the small CSV into a ARFF file, see
> > http://maya.cs.depaul.edu/classes/ect584/weka/preprocess.html
> > 3. Use org.apache.mahout.df.tools.Describe  to generate a descriptor
> >
> >
> > The only consern here is: does the small CSV file with one record
> > sufficient enough to generate the ARFF file header, or do we have to
> > use the whole file to avoid losing information?
> >
> >
> > Xiaobo Gu
> >
> >
> >
> >
> > On Fri, Jul 15, 2011 at 9:10 PM, Xiaobo Gu <[email protected]>
> wrote:
> >> But if we use CSV files, how can we generate descriptors for datasets?
> >>
> >> Cheers
> >>
> >> Xiaobo Gu
> >>
> >> On Thu, Jul 14, 2011 at 1:27 AM, deneche abdelhakim <[email protected]>
> wrote:
> >>> I guess yes. as long as you don't use quotes or double quotes to embed
> the
> >>> fields.
> >>>
> >>> On Wed, Jul 13, 2011 at 2:58 PM, Xiaobo Gu <[email protected]>
> wrote:
> >>>
> >>>> So for simple datasets, which only have numeric and character
> >>>> lable(without blank) category columns,  can we just use CSV tools to
> >>>> save it as a standard CSV file without header?
> >>>>
> >>>>
> >>>> On Wed, Jul 13, 2011 at 3:53 AM, deneche abdelhakim <
> [email protected]>
> >>>> wrote:
> >>>> > the current implementation doesn't support the ARFF format
> >>>> out-of-the-box,
> >>>> > as described in the Wiki you need to remove the header of the file
> and
> >>>> leave
> >>>> > only the data. Actually, this implementation is fully compatible
> with
> >>>> UCI's
> >>>> > datasets which are comma separated text files. You'll also need to
> call
> >>>> the
> >>>> > dataset description tool (see the wiki) in order to generate a
> proper
> >>>> > description file (contains the nature of each attribute: Numerical
> or
> >>>> > Categorical).
> >>>> >
> >>>> > Yes you can use BuildForest and TestForest to generate and use
> Random
> >>>> forest
> >>>> > models from the command line
> >>>> >
> >>>> > On Tue, Jul 12, 2011 at 2:19 PM, Xiaobo Gu <[email protected]>
> >>>> wrote:
> >>>> >
> >>>> >> Hi,
> >>>> >>
> >>>> >> The Random Forest partial implementation in
> >>>> >>
> >>>>
> https://cwiki.apache.org/confluence/display/MAHOUT/Partial+Implementation
> >>>> >> use the ARFF file format, is ARFF the only supportted file format
> when
> >>>> >> using the BuildForest and TestForest program, and are BuildForest
> and
> >>>> >> TestForest program are official tools to build Random Forest models
> >>>> >> from the command line?
> >>>> >>
> >>>> >> Regards,
> >>>> >>
> >>>> >> Xiaobo Gu
> >>>> >>
> >>>> >
> >>>>
> >>>
> >>
> >
>

Re: File format question about Random forest.

Reply via email to