Re: File format question about Random forest.

Xiaobo Gu Fri, 15 Jul 2011 21:58:40 -0700

But if I just use CSV file, how can I generate the descriptor file,
does descriptor file must be supplied for BuildForest and TestForest?



On Sat, Jul 16, 2011 at 5:39 AM, deneche abdelhakim <[email protected]> wrote:
> you don't need to convert the CSV file to ARFF, you can use it right away.
>
> you can use a small dataset as long as all values of categorical attributes
> are available in the dataset
>
> On Fri, Jul 15, 2011 at 2:28 PM, Xiaobo Gu <[email protected]> wrote:
>
>> Can we make the file descriptor as following:
>>
>> 1. make a small csv file with the same format as the actual dataset,
>> say a CSV file with header and only one record,
>> 2. Use java weka.core.converters.CSVLoader filename.csv >
>> filename.arff  to convert the small CSV into a ARFF file, see
>> http://maya.cs.depaul.edu/classes/ect584/weka/preprocess.html
>> 3. Use org.apache.mahout.df.tools.Describe  to generate a descriptor
>>
>>
>> The only consern here is: does the small CSV file with one record
>> sufficient enough to generate the ARFF file header, or do we have to
>> use the whole file to avoid losing information?
>>
>>
>> Xiaobo Gu
>>
>>
>>
>>
>> On Fri, Jul 15, 2011 at 9:10 PM, Xiaobo Gu <[email protected]> wrote:
>> > But if we use CSV files, how can we generate descriptors for datasets?
>> >
>> > Cheers
>> >
>> > Xiaobo Gu
>> >
>> > On Thu, Jul 14, 2011 at 1:27 AM, deneche abdelhakim <[email protected]>
>> wrote:
>> >> I guess yes. as long as you don't use quotes or double quotes to embed
>> the
>> >> fields.
>> >>
>> >> On Wed, Jul 13, 2011 at 2:58 PM, Xiaobo Gu <[email protected]>
>> wrote:
>> >>
>> >>> So for simple datasets, which only have numeric and character
>> >>> lable(without blank) category columns,  can we just use CSV tools to
>> >>> save it as a standard CSV file without header?
>> >>>
>> >>>
>> >>> On Wed, Jul 13, 2011 at 3:53 AM, deneche abdelhakim <
>> [email protected]>
>> >>> wrote:
>> >>> > the current implementation doesn't support the ARFF format
>> >>> out-of-the-box,
>> >>> > as described in the Wiki you need to remove the header of the file
>> and
>> >>> leave
>> >>> > only the data. Actually, this implementation is fully compatible with
>> >>> UCI's
>> >>> > datasets which are comma separated text files. You'll also need to
>> call
>> >>> the
>> >>> > dataset description tool (see the wiki) in order to generate a proper
>> >>> > description file (contains the nature of each attribute: Numerical or
>> >>> > Categorical).
>> >>> >
>> >>> > Yes you can use BuildForest and TestForest to generate and use Random
>> >>> forest
>> >>> > models from the command line
>> >>> >
>> >>> > On Tue, Jul 12, 2011 at 2:19 PM, Xiaobo Gu <[email protected]>
>> >>> wrote:
>> >>> >
>> >>> >> Hi,
>> >>> >>
>> >>> >> The Random Forest partial implementation in
>> >>> >>
>> >>>
>> https://cwiki.apache.org/confluence/display/MAHOUT/Partial+Implementation
>> >>> >> use the ARFF file format, is ARFF the only supportted file format
>> when
>> >>> >> using the BuildForest and TestForest program, and are BuildForest
>> and
>> >>> >> TestForest program are official tools to build Random Forest models
>> >>> >> from the command line?
>> >>> >>
>> >>> >> Regards,
>> >>> >>
>> >>> >> Xiaobo Gu
>> >>> >>
>> >>> >
>> >>>
>> >>
>> >
>>
>

Re: File format question about Random forest.

Reply via email to