Hi everybody,

Just to inform you about some crawling performances using Nutch 2.2.1 with
Cassandra 1.2.8.
I deployed a single Nutch 2.2.1 on a ubuntu 12.04 LTS using a single
m2.2xlarge AWS machine [4 vCPU, 13 ECU, 34,2Gio RAM, 850Go data, moderate
network performances (what means moderate?)]
I installed cassandra 1.2.8 on the same machine.

TEST 1
cassandra is started using sudo cassandra -f [see A - CASSANDRA LAUNCH
INFORMATIONS at the end of the mail].
sudo ./bin/nutch crawl tf1/tf1Seed/ -threads 10 -depth 20 [see B - NUTCH
LAUNCH INFORMATIONS at the end of the mail].
Using a standard configuration except this :
      - nutch-site.xml : fetcher.server.delay = 0.4
      - regex-urlfilter.txt : +^http://www.tf1.fr(/[a-zA-Z0-9-_&\?=%]*\.)*
      - domain-urlfilter.txt com \n org \n net \n edu \n gov
      [I know it exists a simpler way to ignore external links with
overriding the db.ignore.external.links parameter]

RESULTS 1
It fails after around 24 hours crawling of www.tf1.fr and only this domain.
During the 3rd round there were more than 56K URLs to fetch. When parsing
step came, it ran throw a heap space exception [see C - HEAP SPACE
EXCEPTION at the end of the mail].

TEST 2
I tried the same configuration but with GORA gora.buffer.read.limit and
gora.buffer.write.limit set to 1000 instead of 10000.

RESULTS 2
It seems to work as it is still running. The parse step passed. It is
running the next fetching round (4rth one I think).
Top command tells me :
 6817 root      20   0 11.4g 6.8g 420m S    0 20.3  25:43.37 java
[cassandra]
 7029 root      20   0 3073m 501m  17m S    1  1.5  24:53.21 java
[nutch]
I could see that the RAM consumed by cassandra is relatively constant, but
close to the 8Go RAM.

TO CONCLUDE
- I will tell you the final results as soon as the crawl ends.
- I will do the same bench using a Nutch 2.2.1 deployed on a 2 machines
Hadoop cluster.
- I will also try to do the same after using HBase instead of Cassandra
with those configurations :
      - Hadoop 0.20.205 + HBase -0.90.5
      - Hadoop 0.20-append-r1056497  - HBase 0.90.4 (pom of 0.90.4 hbase
tells it uses Hadoop 0.20-append-r1056497)

If you have recommandations about Haddop/HBase with GORA 0.3, please tell
me. I have doubt on which hadoop/HBase versions to use with GORA 0.3.
[Gora 0.3 have a dependency to hadoop-core 1.0.1]

Bye!

A - CASSANDRA LAUNCH INFORMATIONS
 6817 pts/0    Sl    25:45  |   \_ java -ea
-javaagent:./bin/../lib/jamm-0.2.5.jar -XX:+UseThreadPriorities
-XX:ThreadPriorityPolicy=42 -Xms8192M -Xmx8192M -Xmn400M
-XX:+HeapDumpOnOutOfMemoryError -Xss280k -XX:+UseParNewGC
-XX:+UseConcMarkSweepGC -XX:+CMSParallelRemarkEnabled -XX:SurvivorRatio=8
-XX:MaxTenuringThreshold=1 -XX:CMSInitiatingOccupancyFraction=75
-XX:+UseCMSInitiatingOccupancyOnly -XX:+UseTLAB -XX:+UseCondCardMark
-Djava.net.preferIPv4Stack=true -Dcom.sun.management.jmxremote.port=7199
-Dcom.sun.management.jmxremote.ssl=false
-Dcom.sun.management.jmxremote.authenticate=false
-Dlog4j.configuration=log4j-server.properties
-Dlog4j.defaultInitOverride=true -Dcassandra-foreground=yes -cp
./bin/../conf:./bin/../build/classes/main:./bin/../build/classes/thrift:./bin/../lib/antlr-3.2.jar:./bin/../lib/apache-
cassandra-1.2.8.jar:./bin/../lib/apache-cassandra
-clientutil-1.2.8.jar:./bin/../lib/apache-cassandra
-thrift-1.2.8.jar:./bin/../lib/avro-1.4.0-fixes.jar:./bin/../lib/avro-1.4.0-sources-fixes.jar:./bin/../lib/commons-cli-1.1.jar:./bin/../lib/commons-codec-1.2.jar:./bin/../lib/commons-lang-2.6.jar:./bin/../lib/compress-lzf-0.8.4.jar:./bin/../lib/concurrentlinkedhashmap-lru-1.3.jar:./bin/../lib/guava-13.0.1.jar:./bin/../lib/high-scale-lib-1.1.2.jar:./bin/../lib/jackson-core-asl-1.9.2.jar:./bin/../lib/jackson-mapper-asl-1.9.2.jar:./bin/../lib/jamm-0.2.5.jar:./bin/../lib/jbcrypt-0.3
m.jar:./bin/../lib/jline-1.0.jar:./bin/../lib/json-simple-1.1.jar:./bin/../lib/libthrift-0.7.0.jar:./bin/../lib/log4j-1.2.16.jar:./bin/../lib/lz4-1.1.0.jar:./bin/../lib/metrics-core-2.0.3.jar:./bin/../lib/netty-3.5.9.Final.jar:./bin/../lib/servlet-api-2.5-20081211.jar:./bin/../lib/slf4j-api-1.7.2.jar:./bin/../lib/slf4j-log4j12-1.7.2.jar:./bin/../lib/snakeyaml-1.6.jar:./bin/../lib/snappy-java-1.0.5.jar:./bin/../lib/snaptree-0.1.jar
 org.apache.
cassandra.service.CassandraDaemon

B - NUTCH LAUNCH INFORMATIONS
7029 pts/1    Sl+   24:56      \_ /usr/lib/jvm/java-7-oracle//bin/java
-Xmx1000m -Djava
x.xml.parsers.DocumentBuilderFactory=com.sun.org.apache.xerces.internal.jaxp.DocumentBuilderFactoryImpl
 -Dhadoop.log.dir=/home/ubuntu/apache-nutch-2.2.1/runtime/local/logs
-Dhadoop.log.file=hadoop.log -Djava
.library.path=/home/ubuntu/apache-nutch-2.2.1
/runtime/local/lib/native/Linux-amd64-64 -classpath
/home/ubuntu/apache-nutch-2.2.1
/runtime/local:/home/ubuntu/apache-nutch-2.2.1
/runtime/local/conf:/usr/lib/jvm/java
-7-oracle//lib/tools.jar:/home/ubuntu/apache-nutch-2.2.1
/runtime/local/lib/activation-1.1.jar:/home/ubuntu/apache-nutch-2.2.1
/runtime/local/lib/aopalliance-1.0.jar:/home/ubuntu/apache-nutch-2.2.1
/runtime/local/lib/apache-nutch-2.2.1.jar:/home/ubuntu/apache-nutch-2.2.1
/runtime/local/lib/asm-3.2.jar:/home/ubuntu/apache-nutch-2.2.1
/runtime/local/lib/avro-1.3.3.jar:/home/ubuntu/apache-nutch-2.2.1
/runtime/local/lib/cassandra-thrift-1.1.2.jar:/home/ubuntu/apache-nutch-2.2.1
/runtime/local/lib/commons-beanutils-1.7.0.jar:/home/ubuntu/apache-nutch-2.2.1
/runtime/local/lib/commons-beanutils-core-1.8.0.jar:/home/ubuntu/apache-nutch-2.2.1
/runtime/local/lib/commons-cli-1.2.jar:/home/ubuntu/apache-nutch-2.2.1
/runtime/local/lib/commons-codec-1.4.jar:/home/ubuntu/apache-nutch-2.2.1
/runtime/local/lib/commons-collections-3.2.1.jar:/home/ubuntu/apache-nutch-2.2.1
/runtime/local/lib/commons-configuration-1.6.jar:/home/ubuntu/apache-nutch-2.2.1
/runtime/local/lib/commons-digester-1.8.jar:/home/ubuntu/apache-nutch-2.2.1
/runtime/local/lib/commons-el-1.0.jar:/home/ubuntu/apache-nutch-2.2.1
/runtime/local/lib/commons-httpclient-3.1.jar:/home/ubuntu/apache-nutch-2.2.1
/runtime/local/lib/commons-io-2.4.jar:/home/ubuntu/apache-nutch-2.2.1
/runtime/local/lib/commons-lang-2.6.jar:/home/ubuntu/apache-nutch-2.2.1
/runtime/local/lib/commons-logging-1.1.1.jar:/home/ubuntu/apache-nutch-2.2.1
/runtime/local/lib/commons-math-2.1.jar:/home/ubuntu/apache-nutch-2.2.1
/runtime/local/lib/commons-net-1.4.1.jar:/home/ubuntu/apache-nutch-2.2.1
/runtime/local/lib/commons-pool-1.5.3.jar:/home/ubuntu/apache-nutch-2.2.1
/runtime/local/lib/core-3.1.1.jar:/home/ubuntu/apache-nutch-2.2.1
/runtime/local/lib/crawler-commons-0.2.jar:/home/ubuntu/apache-nutch-2.2.1
/runtime/local/lib/cxf-api-2.5.2.jar:/home/ubuntu/apache-nutch-2.2.1
/runtime/local/lib/cxf-common-utilities-2.5.2.jar:/home/ubuntu/apache-nutch-2.2.1
/runtime/local/lib/cxf-rt-bindings-xml-2.5.2.jar:/home/ubuntu/apache-nutch-2.2.1
/runtime/local/lib/cxf-rt-core-2.5.2.jar:/home/ubuntu/apache-nutch-2.2.1
/runtime/local/lib/cxf-rt-frontend-jaxrs-2.5.2.jar:/home/ubuntu/apache-nutch-2.2.1
/runtime/local/lib/cxf-rt-transports-common-2.5.2.jar:/home/ubuntu/apache-nutch-2.2.1
/runtime/local/lib/cxf-rt-transports-http-2.5.2.jar:/home/ubuntu/apache-nutch-2.2.1
/runtime/local/lib/dom4j-1.6.1.jar:/home/ubuntu/apache-nutch-2.2.1
/runtime/local/lib/elasticsearch-0.19.4.jar:/home/ubuntu/apache-nutch-2.2.1
/runtime/local/lib/ftplet-api-1.0.0.jar:/home/ubuntu/apache-nutch-2.2.1
/runtime/local/lib/ftpserver-core-1.0.0.jar:/home/ubuntu/apache-nutch-2.2.1
/runtime/local/lib/ftpserver-deprecated-1.0.0-M2.jar:/home/ubuntu/apache-nutch-2.2.1
/runtime/local/lib/geronimo-
javamail_1.4_spec-1.7.1.jar:/home/ubuntu/apache-nutch-2.2.1
/runtime/local/lib/geronimo-stax-api_1.0_spec-1.0.1.jar:/home/ubuntu/apache-nutch-2.2.1
/runtime/local/lib/gora-cassandra-0.3.jar:/home/ubuntu/apache-nutch-2.2.1
/runtime/local/lib/gora-core-0.3.jar:/home/ubuntu/apache-nutch-2.2.1
/runtime/local/lib/guava-11.0.2.jar:/home/ubuntu/apache-nutch-2.2.1
/runtime/local/lib/hadoop-core-1.2.0.jar:/home/ubuntu/apache-nutch-2.2.1
/runtime/local/lib/hamcrest-core-1.3.jar:/home/ubuntu/apache-nutch-2.2.1
/runtime/local/lib/hector-core-1.1-0.jar:/home/ubuntu/apache-nutch-2.2.1
/runtime/local/lib/hsqldb-2.2.8.jar:/home/ubuntu/apache-nutch-2.2.1
/runtime/local/lib/httpclient-4.1.1.jar:/home/ubuntu/apache-nutch-2.2.1
/runtime/local/lib/httpcore-4.1.jar:/home/ubuntu/apache-nutch-2.2.1
/runtime/local/lib/icu4j-4.0.1.jar:/home/ubuntu/apache-nutch-2.2.1
/runtime/local/lib/jackson-core-asl-1.8.8.jar:/home/ubuntu/apache-nutch-2.2.1
/runtime/local/lib/jackson-jax

C - HEAP SPACE EXCEPTION
2013-10-09 14:19:03,902 INFO  parse.ParserJob - Parsing
http://www.tf1.fr/auto-moto/skoda/versions/yeti-1-2-tsi-105-active-2010-6177686.html
2013-10-09 14:19:07,347 ERROR connection.HConnectionManager - MARK HOST AS
DOWN TRIGGERED for host localhost(127.0.0.1):9160
2013-10-09 14:19:07,348 ERROR connection.HConnectionManager - Pool state on
shutdown:
<ConcurrentCassandraClientPoolByHost>:{localhost(127.0.0.1):9160};
IsActive?: true; Active: 1; Blocked: 0; Idle: 15; NumBeforeExhausted: 49
2013-10-09 14:19:07,348 INFO  connection.ConcurrentHClientPool - Shutdown
triggered on
<ConcurrentCassandraClientPoolByHost>:{localhost(127.0.0.1):9160}
2013-10-09 14:19:07,348 INFO  connection.ConcurrentHClientPool - Shutdown
complete on
<ConcurrentCassandraClientPoolByHost>:{localhost(127.0.0.1):9160}
2013-10-09 14:19:07,348 INFO  connection.CassandraHostRetryService - Host
detected as down was added to retry queue: localhost(127.0.0.1):9160
2013-10-09 14:19:07,350 WARN  connection.HConnectionManager - Could not
fullfill request on this host CassandraClient<localhost:9160-5>
2013-10-09 14:19:07,350 WARN  connection.HConnectionManager - Exception:
me.prettyprint.hector.api.exceptions.HectorTransportException:
org.apache.thrift.transport.TTransportException: Read a negative frame size
(-2147418110)!
      at
me.prettyprint.cassandra.service.ExceptionsTranslatorImpl.translate(ExceptionsTranslatorImpl.java:33)
      at
me.prettyprint.cassandra.connection.HConnectionManager.operateWithFailover(HConnectionManager.java:264)
      at
me.prettyprint.cassandra.model.ExecutingKeyspace.doExecuteOperation(ExecutingKeyspace.java:97)
      at
me.prettyprint.cassandra.model.MutatorImpl.execute(MutatorImpl.java:243)
      at
me.prettyprint.cassandra.model.MutatorImpl.insert(MutatorImpl.java:77)
      at
org.apache.gora.cassandra.store.HectorUtils.insertSubColumn(HectorUtils.java:70)
      at
org.apache.gora.cassandra.store.CassandraClient.addSubColumn(CassandraClient.java:220)
      at
org.apache.gora.cassandra.store.CassandraClient.addSubColumn(CassandraClient.java:225)
      at
org.apache.gora.cassandra.store.CassandraClient.addStatefulHashMap(CassandraClient.java:302)
      at
org.apache.gora.cassandra.store.CassandraStore.addOrUpdateField(CassandraStore.java:384)
      at
org.apache.gora.cassandra.store.CassandraStore.flush(CassandraStore.java:228)
      at
org.apache.gora.cassandra.store.CassandraStore.close(CassandraStore.java:95)
      at
org.apache.gora.mapreduce.GoraRecordWriter.close(GoraRecordWriter.java:56)
      at
org.apache.hadoop.mapred.MapTask$NewDirectOutputCollector.close(MapTask.java:650)
      at org.apache.hadoop.mapred.MapTask.closeQuietly(MapTask.java:1793)
      at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:779)
      at org.apache.hadoop.mapred.MapTask.run(MapTask.java:364)
      at
org.apache.hadoop.mapred.LocalJobRunner$Job$MapTaskRunnable.run(LocalJobRunner.java:223)
      at
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
      at java.util.concurrent.FutureTask.run(FutureTask.java:262)
      at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
      at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
      at java.lang.Thread.run(Thread.java:724)
Caused by: org.apache.thrift.transport.TTransportException: Read a negative
frame size (-2147418110)!
      at
org.apache.thrift.transport.TFramedTransport.readFrame(TFramedTransport.java:133)
      at
org.apache.thrift.transport.TFramedTransport.read(TFramedTransport.java:101)
      at org.apache.thrift.transport.TTransport.readAll(TTransport.java:84)
      at
org.apache.thrift.protocol.TBinaryProtocol.readAll(TBinaryProtocol.java:378)
      at
org.apache.thrift.protocol.TBinaryProtocol.readI32(TBinaryProtocol.java:297)
      at
org.apache.thrift.protocol.TBinaryProtocol.readMessageBegin(TBinaryProtocol.java:204)
      at
org.apache.thrift.TServiceClient.receiveBase(TServiceClient.java:69)
      at
org.apache.cassandra.thrift.Cassandra$Client.recv_batch_mutate(Cassandra.java:922)
      at
org.apache.cassandra.thrift.Cassandra$Client.batch_mutate(Cassandra.java:908)
      at
me.prettyprint.cassandra.model.MutatorImpl$3.execute(MutatorImpl.java:246)
      at
me.prettyprint.cassandra.model.MutatorImpl$3.execute(MutatorImpl.java:243)
      at
me.prettyprint.cassandra.service.Operation.executeAndSetResult(Operation.java:103)
      at
me.prettyprint.cassandra.connection.HConnectionManager.operateWithFailover(HConnectionManager.java:258)
      ... 21 more
2013-10-09 14:19:07,351 INFO  connection.HConnectionManager - Client
CassandraClient<localhost:9160-5> released to inactive or dead pool.
Closing.
2013-10-09 14:19:07,352 INFO  connection.HConnectionManager - Client
CassandraClient<localhost:9160-5> released to inactive or dead pool.
Closing.
2013-10-09 14:19:07,352 INFO  mapreduce.GoraRecordWriter - Exception at
GoraRecordWriter.class while closing datastore.All host pools marked down.
Retry burden pushed out to client.
2013-10-09 14:19:07,353 WARN  mapred.FileOutputCommitter - Output path is
null in cleanup
2013-10-09 14:19:07,353 WARN  mapred.LocalJobRunner -
job_local625438788_0016
java.lang.Exception: java.lang.OutOfMemoryError: Java heap space
      at
org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:354)
Caused by: java.lang.OutOfMemoryError: Java heap space
      at
org.apache.thrift.transport.TFramedTransport.readFrame(TFramedTransport.java:140)
      at
org.apache.thrift.transport.TFramedTransport.read(TFramedTransport.java:101)
      at org.apache.thrift.transport.TTransport.readAll(TTransport.java:84)
      at
org.apache.thrift.protocol.TBinaryProtocol.readAll(TBinaryProtocol.java:378)
      at
org.apache.thrift.protocol.TBinaryProtocol.readI32(TBinaryProtocol.java:297)
      at
org.apache.thrift.protocol.TBinaryProtocol.readMessageBegin(TBinaryProtocol.java:204)
      at
org.apache.thrift.TServiceClient.receiveBase(TServiceClient.java:69)
      at
org.apache.cassandra.thrift.Cassandra$Client.recv_get_range_slices(Cassandra.java:692)
      at
org.apache.cassandra.thrift.Cassandra$Client.get_range_slices(Cassandra.java:676)
      at
me.prettyprint.cassandra.service.KeyspaceServiceImpl$3.execute(KeyspaceServiceImpl.java:151)
      at
me.prettyprint.cassandra.service.KeyspaceServiceImpl$3.execute(KeyspaceServiceImpl.java:145)
      at
me.prettyprint.cassandra.service.Operation.executeAndSetResult(Operation.java:103)
      at
me.prettyprint.cassandra.connection.HConnectionManager.operateWithFailover(HConnectionManager.java:258)
      at
me.prettyprint.cassandra.service.KeyspaceServiceImpl.operateWithFailover(KeyspaceServiceImpl.java:131)
      at
me.prettyprint.cassandra.service.KeyspaceServiceImpl.getRangeSlices(KeyspaceServiceImpl.java:167)
      at
me.prettyprint.cassandra.model.thrift.ThriftRangeSlicesQuery$1.doInKeyspace(ThriftRangeSlicesQuery.java:66)
      at
me.prettyprint.cassandra.model.thrift.ThriftRangeSlicesQuery$1.doInKeyspace(ThriftRangeSlicesQuery.java:62)
      at
me.prettyprint.cassandra.model.KeyspaceOperationCallback.doInKeyspaceAndMeasure(KeyspaceOperationCallback.java:20)
      at
me.prettyprint.cassandra.model.ExecutingKeyspace.doExecute(ExecutingKeyspace.java:85)
      at
me.prettyprint.cassandra.model.thrift.ThriftRangeSlicesQuery.execute(ThriftRangeSlicesQuery.java:61)
      at
org.apache.gora.cassandra.store.CassandraClient.execute(CassandraClient.java:361)
      at
org.apache.gora.cassandra.store.CassandraStore.addSubColumns(CassandraStore.java:158)
      at
org.apache.gora.cassandra.store.CassandraStore.execute(CassandraStore.java:146)
      at org.apache.gora.query.impl.QueryBase.execute(QueryBase.java:71)
      at
org.apache.gora.mapreduce.GoraRecordReader.executeQuery(GoraRecordReader.java:68)
      at
org.apache.gora.mapreduce.GoraRecordReader.nextKeyValue(GoraRecordReader.java:110)
      at
org.apache.hadoop.mapred.MapTask$NewTrackingRecordReader.nextKeyValue(MapTask.java:531)
      at
org.apache.hadoop.mapreduce.MapContext.nextKeyValue(MapContext.java:67)
      at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144)
      at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:764)
      at org.apache.hadoop.mapred.MapTask.run(MapTask.java:364)
      at
org.apache.hadoop.mapred.LocalJobRunner$Job$MapTaskRunnable.run(LocalJobRunner.java:223)
2013-10-09 14:19:07,361 INFO  connection.HConnectionManager - Added host
localhost(127.0.0.1):9160 to pool
Nous vous rappelons que les résultats de Médiamétrie sont et demeurent sa 
propriété : ils sont protégés au double
titre du droit d'auteur et de la protection des bases de données.
Ce message est confidentiel et établi à
l'intention de ses destinataires.
Tout message électronique étant susceptible d'altération,
la société Médiamétrie
décline toute responsabilité s'il a été altéré, déformé ou falsifié.


We remind you that the results produced by Médiamétrie are and remain its sole 
property covered by both copyright
and databases protection.
This message is confidential and intended solely for the adressees.
E-mails are susceptible
to alteration.
Neither Médiamétrie company shall be liable for the message if altered, changed 
or falsified.

Reply via email to