I've successfully run the vectorization process on "reuters dataset".

 

Now I'm trying to vectorize the wikidataset(10.6GB).

 

And I'm getting OutOfMemoryError.

 

Any help?

 

Thanks.

 


aroha@aroha-laptop:~/workspace/mahout$ bin/mahout seqdirectory -c UTF-8 -i
/media/F89A6F359A6EF014/wiki/wikidataset/ -o
/media/F89A6F359A6EF014/wiki/wiki_sequences/MAHOUT_LOCAL is set, so we don't
add HADOOP_CONF_DIR to classpath.
MAHOUT_LOCAL is set, running locally
11/12/01 11:12:50 INFO common.AbstractJob: Command line arguments:
{--charset=UTF-8, --chunkSize=64, --endPhase=2147483647,
--fileFilterClass=org.apache.mahout.text.PrefixAdditionFilter,
--input=/media/F89A6F359A6EF014/wiki/wikidataset/, --keyPrefix=,
--output=/media/F89A6F359A6EF014/wiki/wiki_sequences/, --startPhase=0,
--tempDir=temp}
Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
    at java.util.Arrays.copyOf(Arrays.java:2882)
    at
java.lang.AbstractStringBuilder.expandCapacity(AbstractStringBuilder.java:10
0)
    at
java.lang.AbstractStringBuilder.append(AbstractStringBuilder.java:390)
    at java.lang.StringBuilder.append(StringBuilder.java:119)
    at
org.apache.mahout.text.PrefixAdditionFilter.process(PrefixAdditionFilter.jav
a:62)
    at
org.apache.mahout.text.SequenceFilesFromDirectoryFilter.accept(SequenceFiles
FromDirectoryFilter.java:90)
    at org.apache.hadoop.fs.FileSystem.listStatus(FileSystem.java:769)
    at org.apache.hadoop.fs.FileSystem.listStatus(FileSystem.java:791)
    at
org.apache.mahout.text.SequenceFilesFromDirectory.run(SequenceFilesFromDirec
tory.java:98)
    at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
    at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:79)
    at
org.apache.mahout.text.SequenceFilesFromDirectory.main(SequenceFilesFromDire
ctory.java:53)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39
)
    at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl
.java:25)
    at java.lang.reflect.Method.invoke(Method.java:597)
    at
org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver
.java:68)
    at org.apache.hadoop.util.ProgramDriver.driver(ProgramDriver.java:139)
    at org.apache.mahout.driver.MahoutDriver.main(MahoutDriver.java:188)

 


----------------------------------------------------------------------------
------------------------------------------------------

 

And now I run it again after unsetting the MAHOUT_LOCAL env variable.

 

aroha@aroha-laptop:~/workspace/mahout$ unset MAHOUT_LOCAL



aroha@aroha-laptop:~/workspace/mahout$ bin/mahout seqdirectory -c UTF-8 -i
/media/F89A6F359A6EF014/wiki/wikidataset/ -o
/media/F89A6F359A6EF014/wiki/wiki_sequences/
MAHOUT_LOCAL is not set; adding HADOOP_CONF_DIR to classpath.
Running on hadoop, using
HADOOP_HOME=/home/aroha/Desktop/Aroha/hadoop-0.20.203.0
HADOOP_CONF_DIR=/home/aroha/Desktop/Aroha/hadoop-0.20.203.0/conf
MAHOUT-JOB:
/home/aroha/workspace/mahout/examples/target/mahout-examples-0.6-SNAPSHOT-jo
b.jar
11/12/01 11:13:21 INFO common.AbstractJob: Command line arguments:
{--charset=UTF-8, --chunkSize=64, --endPhase=2147483647,
--fileFilterClass=org.apache.mahout.text.PrefixAdditionFilter,
--input=/media/F89A6F359A6EF014/wiki/wikidataset/, --keyPrefix=,
--output=/media/F89A6F359A6EF014/wiki/wiki_sequences/, --startPhase=0,
--tempDir=temp}
Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
    at java.util.Arrays.copyOf(Arrays.java:2882)
    at
java.lang.AbstractStringBuilder.expandCapacity(AbstractStringBuilder.java:10
0)
    at
java.lang.AbstractStringBuilder.append(AbstractStringBuilder.java:390)
    at java.lang.StringBuilder.append(StringBuilder.java:119)
    at
org.apache.mahout.text.PrefixAdditionFilter.process(PrefixAdditionFilter.jav
a:62)
    at
org.apache.mahout.text.SequenceFilesFromDirectoryFilter.accept(SequenceFiles
FromDirectoryFilter.java:90)
    at org.apache.hadoop.fs.FileSystem.listStatus(FileSystem.java:769)
    at org.apache.hadoop.fs.FileSystem.listStatus(FileSystem.java:791)
    at
org.apache.mahout.text.SequenceFilesFromDirectory.run(SequenceFilesFromDirec
tory.java:98)
    at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
    at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:79)
    at
org.apache.mahout.text.SequenceFilesFromDirectory.main(SequenceFilesFromDire
ctory.java:53)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39
)
    at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl
.java:25)
    at java.lang.reflect.Method.invoke(Method.java:597)
    at
org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver
.java:68)
    at org.apache.hadoop.util.ProgramDriver.driver(ProgramDriver.java:139)
    at org.apache.mahout.driver.MahoutDriver.main(MahoutDriver.java:188)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39
)
    at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl
.java:25)
    at java.lang.reflect.Method.invoke(Method.java:597)
    at org.apache.hadoop.util.RunJar.main(RunJar.java:156)

 

 

 

Regards,

Faizan

Reply via email to