I think Hadoop can read files from the local file system by using "file:///" before the path
On Fri, Jul 15, 2011 at 3:55 PM, Xiaobo Gu <[email protected]> wrote: > Do the -p and -f option of org.apache.mahout.df.tools.Describe have to > be HDFS URLs, can they be local file system paths? > > > On Fri, Jul 15, 2011 at 9:28 PM, Xiaobo Gu <[email protected]> wrote: > > Can we make the file descriptor as following: > > > > 1. make a small csv file with the same format as the actual dataset, > > say a CSV file with header and only one record, > > 2. Use java weka.core.converters.CSVLoader filename.csv > > > filename.arff to convert the small CSV into a ARFF file, see > > http://maya.cs.depaul.edu/classes/ect584/weka/preprocess.html > > 3. Use org.apache.mahout.df.tools.Describe to generate a descriptor > > > > > > The only consern here is: does the small CSV file with one record > > sufficient enough to generate the ARFF file header, or do we have to > > use the whole file to avoid losing information? > > > > > > Xiaobo Gu > > > > > > > > > > On Fri, Jul 15, 2011 at 9:10 PM, Xiaobo Gu <[email protected]> > wrote: > >> But if we use CSV files, how can we generate descriptors for datasets? > >> > >> Cheers > >> > >> Xiaobo Gu > >> > >> On Thu, Jul 14, 2011 at 1:27 AM, deneche abdelhakim <[email protected]> > wrote: > >>> I guess yes. as long as you don't use quotes or double quotes to embed > the > >>> fields. > >>> > >>> On Wed, Jul 13, 2011 at 2:58 PM, Xiaobo Gu <[email protected]> > wrote: > >>> > >>>> So for simple datasets, which only have numeric and character > >>>> lable(without blank) category columns, can we just use CSV tools to > >>>> save it as a standard CSV file without header? > >>>> > >>>> > >>>> On Wed, Jul 13, 2011 at 3:53 AM, deneche abdelhakim < > [email protected]> > >>>> wrote: > >>>> > the current implementation doesn't support the ARFF format > >>>> out-of-the-box, > >>>> > as described in the Wiki you need to remove the header of the file > and > >>>> leave > >>>> > only the data. Actually, this implementation is fully compatible > with > >>>> UCI's > >>>> > datasets which are comma separated text files. You'll also need to > call > >>>> the > >>>> > dataset description tool (see the wiki) in order to generate a > proper > >>>> > description file (contains the nature of each attribute: Numerical > or > >>>> > Categorical). > >>>> > > >>>> > Yes you can use BuildForest and TestForest to generate and use > Random > >>>> forest > >>>> > models from the command line > >>>> > > >>>> > On Tue, Jul 12, 2011 at 2:19 PM, Xiaobo Gu <[email protected]> > >>>> wrote: > >>>> > > >>>> >> Hi, > >>>> >> > >>>> >> The Random Forest partial implementation in > >>>> >> > >>>> > https://cwiki.apache.org/confluence/display/MAHOUT/Partial+Implementation > >>>> >> use the ARFF file format, is ARFF the only supportted file format > when > >>>> >> using the BuildForest and TestForest program, and are BuildForest > and > >>>> >> TestForest program are official tools to build Random Forest models > >>>> >> from the command line? > >>>> >> > >>>> >> Regards, > >>>> >> > >>>> >> Xiaobo Gu > >>>> >> > >>>> > > >>>> > >>> > >> > > >
