[jira] [Commented] (NUTCH-2222) re-fetch deletes all metadata except _csh_ and _rs_

2016-03-14 Thread Adnane B. (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15193561#comment-15193561
 ] 

Adnane B. commented on NUTCH-:
--

Hi Lewis,

Any news about that please ?


> re-fetch deletes all  metadata except _csh_ and _rs_
> 
>
> Key: NUTCH-
> URL: https://issues.apache.org/jira/browse/NUTCH-
> Project: Nutch
>  Issue Type: Bug
>  Components: crawldb
>Affects Versions: 2.3.1
> Environment: Centos 6, mongodb 2.6 and mongodb 3.0 and 
> hbase-0.98.8-hadoop2
>Reporter: Adnane B.
>Assignee: Lewis John McGibbney
> Fix For: 2.3.2
>
> Attachments: TestReFetch.java, index.html
>
>
> This problem happens at the the second time I crawl a page
> {code}
> bin/nutch inject urls/
> bin/nutch generate -topN 1000
> bin/nutch fetch  -all
> bin/nutch parse -force   -all
> bin/nutch updatedb  -all
> {code}
> seconde time (re-fetch) : 
> {code}
> bin/nutch generate -topN 1000 --> batchid changes for all existing pages
> bin/nutch fetch  -all   -->  *** metadatas are delete for all pages already 
> crawled  **
> bin/nutch parse -force   -all
> bin/nutch updatedb  -all
> {code}
> I reproduce it with mongodb 2.6, mongodb 3.0, and hbase-0.98.8-hadoop2
> It happens only if the page has not changed
> To reproduce easily, please add to nutch-site.xml :
> {code}
> 
>   db.fetch.interval.default
>   60
>   The default number of seconds between re-fetches of a page (1 
> minute)
> 
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (NUTCH-2222) re-fetch deletes all metadata except _csh_ and _rs_

2016-03-03 Thread Adnane B. (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15178745#comment-15178745
 ] 

Adnane B. edited comment on NUTCH- at 3/3/16 10:43 PM:
---

Hi [~lewismc]

The problem seems to happen only with gora. With MemStore it works fine

Please, install mongodb, copy index.html in build/test/data/refetch-test-site 
directory, then run the test. Please find attached files.

Please let me know if it's OK for you to debug the issue with this test.
{code}
// comment this line to test with org.apache.gora.memory.store.MemStore
conf.set ("storage.data.store.class",  
"org.apache.gora.mongodb.store.MongoStore");
{code}



was (Author: abenjell):
Hi [~lewismc]

The problem seems to happen only with gora. With MemStore it works fine

Please, install mongodb, copy index.html in build/test/data/refetch-test-site 
directory, then run the test. Please find attached files.

Please let me know if it's OK for you to debug the issue with this test.




> re-fetch deletes all  metadata except _csh_ and _rs_
> 
>
> Key: NUTCH-
> URL: https://issues.apache.org/jira/browse/NUTCH-
> Project: Nutch
>  Issue Type: Bug
>  Components: crawldb
>Affects Versions: 2.3.1
> Environment: Centos 6, mongodb 2.6 and mongodb 3.0 and 
> hbase-0.98.8-hadoop2
>Reporter: Adnane B.
>Assignee: Lewis John McGibbney
> Fix For: 2.3.2
>
> Attachments: TestReFetch.java, index.html
>
>
> This problem happens at the the second time I crawl a page
> {code}
> bin/nutch inject urls/
> bin/nutch generate -topN 1000
> bin/nutch fetch  -all
> bin/nutch parse -force   -all
> bin/nutch updatedb  -all
> {code}
> seconde time (re-fetch) : 
> {code}
> bin/nutch generate -topN 1000 --> batchid changes for all existing pages
> bin/nutch fetch  -all   -->  *** metadatas are delete for all pages already 
> crawled  **
> bin/nutch parse -force   -all
> bin/nutch updatedb  -all
> {code}
> I reproduce it with mongodb 2.6, mongodb 3.0, and hbase-0.98.8-hadoop2
> It happens only if the page has not changed
> To reproduce easily, please add to nutch-site.xml :
> {code}
> 
>   db.fetch.interval.default
>   60
>   The default number of seconds between re-fetches of a page (1 
> minute)
> 
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (NUTCH-2222) re-fetch deletes all metadata except _csh_ and _rs_

2016-03-03 Thread Adnane B. (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15178745#comment-15178745
 ] 

Adnane B. edited comment on NUTCH- at 3/3/16 10:42 PM:
---

Hi [~lewismc]

The problem seems to happen only with gora. With MemStore it works fine

Please, install mongodb, copy index.html in build/test/data/refetch-test-site 
directory, then run the test. Please find attached files.

Please let me know if it's OK for you to debug the issue with this test.





was (Author: abenjell):
Hi [~lewismc]

The problem seems to happen only with gora. With MemStore it works fine

Please, install mongodb, copy index.html in build/test/data/refetch-test-site 
directory, then run the test.

Please let me know if it's OK for you to debug the issue with this test


> re-fetch deletes all  metadata except _csh_ and _rs_
> 
>
> Key: NUTCH-
> URL: https://issues.apache.org/jira/browse/NUTCH-
> Project: Nutch
>  Issue Type: Bug
>  Components: crawldb
>Affects Versions: 2.3.1
> Environment: Centos 6, mongodb 2.6 and mongodb 3.0 and 
> hbase-0.98.8-hadoop2
>Reporter: Adnane B.
>Assignee: Lewis John McGibbney
> Fix For: 2.3.2
>
> Attachments: TestReFetch.java, index.html
>
>
> This problem happens at the the second time I crawl a page
> {code}
> bin/nutch inject urls/
> bin/nutch generate -topN 1000
> bin/nutch fetch  -all
> bin/nutch parse -force   -all
> bin/nutch updatedb  -all
> {code}
> seconde time (re-fetch) : 
> {code}
> bin/nutch generate -topN 1000 --> batchid changes for all existing pages
> bin/nutch fetch  -all   -->  *** metadatas are delete for all pages already 
> crawled  **
> bin/nutch parse -force   -all
> bin/nutch updatedb  -all
> {code}
> I reproduce it with mongodb 2.6, mongodb 3.0, and hbase-0.98.8-hadoop2
> It happens only if the page has not changed
> To reproduce easily, please add to nutch-site.xml :
> {code}
> 
>   db.fetch.interval.default
>   60
>   The default number of seconds between re-fetches of a page (1 
> minute)
> 
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (NUTCH-2222) re-fetch deletes all metadata except _csh_ and _rs_

2016-03-03 Thread Adnane B. (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15178745#comment-15178745
 ] 

Adnane B. edited comment on NUTCH- at 3/3/16 10:41 PM:
---

Hi [~lewismc]

The problem seems to happen only with gora. With MemStore it works fine

Please, install mongodb, copy index.html in build/test/data/refetch-test-site 
directory, then run the test.

Please let me know if it's OK for you to debug the issue with this test



was (Author: abenjell):
Hi [~lewismc]

The problem seems to happen only with gora. With MemStore it works fine

Please, install mongodb, copy index.html in build/test/data/refetch-test-site 
directory, then run the test.

Please let me know if it's OK for you to debug the problem with this test


> re-fetch deletes all  metadata except _csh_ and _rs_
> 
>
> Key: NUTCH-
> URL: https://issues.apache.org/jira/browse/NUTCH-
> Project: Nutch
>  Issue Type: Bug
>  Components: crawldb
>Affects Versions: 2.3.1
> Environment: Centos 6, mongodb 2.6 and mongodb 3.0 and 
> hbase-0.98.8-hadoop2
>Reporter: Adnane B.
>Assignee: Lewis John McGibbney
> Fix For: 2.3.2
>
> Attachments: TestReFetch.java, index.html
>
>
> This problem happens at the the second time I crawl a page
> {code}
> bin/nutch inject urls/
> bin/nutch generate -topN 1000
> bin/nutch fetch  -all
> bin/nutch parse -force   -all
> bin/nutch updatedb  -all
> {code}
> seconde time (re-fetch) : 
> {code}
> bin/nutch generate -topN 1000 --> batchid changes for all existing pages
> bin/nutch fetch  -all   -->  *** metadatas are delete for all pages already 
> crawled  **
> bin/nutch parse -force   -all
> bin/nutch updatedb  -all
> {code}
> I reproduce it with mongodb 2.6, mongodb 3.0, and hbase-0.98.8-hadoop2
> It happens only if the page has not changed
> To reproduce easily, please add to nutch-site.xml :
> {code}
> 
>   db.fetch.interval.default
>   60
>   The default number of seconds between re-fetches of a page (1 
> minute)
> 
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (NUTCH-2222) re-fetch deletes all metadata except _csh_ and _rs_

2016-03-03 Thread Adnane B. (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Adnane B. updated NUTCH-:
-
Attachment: index.html
TestReFetch.java

Hi [~lewismc]

The problem seems to happen only with gora. With MemStore it works fine

Please, install mongodb, copy index.html in build/test/data/refetch-test-site 
directory, then run the test.

Please let me know if it's OK for you to debug the problem with this test


> re-fetch deletes all  metadata except _csh_ and _rs_
> 
>
> Key: NUTCH-
> URL: https://issues.apache.org/jira/browse/NUTCH-
> Project: Nutch
>  Issue Type: Bug
>  Components: crawldb
>Affects Versions: 2.3.1
> Environment: Centos 6, mongodb 2.6 and mongodb 3.0 and 
> hbase-0.98.8-hadoop2
>Reporter: Adnane B.
>Assignee: Lewis John McGibbney
> Fix For: 2.3.2
>
> Attachments: TestReFetch.java, index.html
>
>
> This problem happens at the the second time I crawl a page
> {code}
> bin/nutch inject urls/
> bin/nutch generate -topN 1000
> bin/nutch fetch  -all
> bin/nutch parse -force   -all
> bin/nutch updatedb  -all
> {code}
> seconde time (re-fetch) : 
> {code}
> bin/nutch generate -topN 1000 --> batchid changes for all existing pages
> bin/nutch fetch  -all   -->  *** metadatas are delete for all pages already 
> crawled  **
> bin/nutch parse -force   -all
> bin/nutch updatedb  -all
> {code}
> I reproduce it with mongodb 2.6, mongodb 3.0, and hbase-0.98.8-hadoop2
> It happens only if the page has not changed
> To reproduce easily, please add to nutch-site.xml :
> {code}
> 
>   db.fetch.interval.default
>   60
>   The default number of seconds between re-fetches of a page (1 
> minute)
> 
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (NUTCH-2222) re-fetch deletes all metadata except _csh_ and _rs_

2016-03-02 Thread Adnane B. (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15177088#comment-15177088
 ] 

Adnane B. commented on NUTCH-:
--

I wrote testReFetch test but it ignores mongodb configuration in 
nutch-site.xml. any idea please ?


storage.data.store.class
org.apache.gora.mongodb.store.MongoStore
Default class for storing data


> re-fetch deletes all  metadata except _csh_ and _rs_
> 
>
> Key: NUTCH-
> URL: https://issues.apache.org/jira/browse/NUTCH-
> Project: Nutch
>  Issue Type: Bug
>  Components: crawldb
>Affects Versions: 2.3.1
> Environment: Centos 6, mongodb 2.6 and mongodb 3.0 and 
> hbase-0.98.8-hadoop2
>Reporter: Adnane B.
>Assignee: Lewis John McGibbney
> Fix For: 2.3.2
>
>
> This problem happens at the the second time I crawl a page
> {code}
> bin/nutch inject urls/
> bin/nutch generate -topN 1000
> bin/nutch fetch  -all
> bin/nutch parse -force   -all
> bin/nutch updatedb  -all
> {code}
> seconde time (re-fetch) : 
> {code}
> bin/nutch generate -topN 1000 --> batchid changes for all existing pages
> bin/nutch fetch  -all   -->  *** metadatas are delete for all pages already 
> crawled  **
> bin/nutch parse -force   -all
> bin/nutch updatedb  -all
> {code}
> I reproduce it with mongodb 2.6, mongodb 3.0, and hbase-0.98.8-hadoop2
> It happens only if the page has not changed
> To reproduce easily, please add to nutch-site.xml :
> {code}
> 
>   db.fetch.interval.default
>   60
>   The default number of seconds between re-fetches of a page (1 
> minute)
> 
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (NUTCH-2222) re-fetch deletes all metadata except _csh_ and _rs_

2016-03-02 Thread Adnane B. (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15177088#comment-15177088
 ] 

Adnane B. edited comment on NUTCH- at 3/3/16 3:23 AM:
--

I wrote testReFetch test but it ignores mongodb configuration in 
nutch-site.xml. any idea please ?
{code}

storage.data.store.class
org.apache.gora.mongodb.store.MongoStore
Default class for storing data

{code}


was (Author: abenjell):
I wrote testReFetch test but it ignores mongodb configuration in 
nutch-site.xml. any idea please ?


storage.data.store.class
org.apache.gora.mongodb.store.MongoStore
Default class for storing data


> re-fetch deletes all  metadata except _csh_ and _rs_
> 
>
> Key: NUTCH-
> URL: https://issues.apache.org/jira/browse/NUTCH-
> Project: Nutch
>  Issue Type: Bug
>  Components: crawldb
>Affects Versions: 2.3.1
> Environment: Centos 6, mongodb 2.6 and mongodb 3.0 and 
> hbase-0.98.8-hadoop2
>Reporter: Adnane B.
>Assignee: Lewis John McGibbney
> Fix For: 2.3.2
>
>
> This problem happens at the the second time I crawl a page
> {code}
> bin/nutch inject urls/
> bin/nutch generate -topN 1000
> bin/nutch fetch  -all
> bin/nutch parse -force   -all
> bin/nutch updatedb  -all
> {code}
> seconde time (re-fetch) : 
> {code}
> bin/nutch generate -topN 1000 --> batchid changes for all existing pages
> bin/nutch fetch  -all   -->  *** metadatas are delete for all pages already 
> crawled  **
> bin/nutch parse -force   -all
> bin/nutch updatedb  -all
> {code}
> I reproduce it with mongodb 2.6, mongodb 3.0, and hbase-0.98.8-hadoop2
> It happens only if the page has not changed
> To reproduce easily, please add to nutch-site.xml :
> {code}
> 
>   db.fetch.interval.default
>   60
>   The default number of seconds between re-fetches of a page (1 
> minute)
> 
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (NUTCH-2222) re-fetch deletes all metadata except _csh_ and _rs_

2016-03-02 Thread Adnane B. (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15175528#comment-15175528
 ] 

Adnane B. commented on NUTCH-:
--

Hello,
Can you please tell me if there is any workaround. My deadline is too short and 
my project depends on nutch.
Best regards,

> re-fetch deletes all  metadata except _csh_ and _rs_
> 
>
> Key: NUTCH-
> URL: https://issues.apache.org/jira/browse/NUTCH-
> Project: Nutch
>  Issue Type: Bug
>  Components: crawldb
>Affects Versions: 2.3.1
> Environment: Centos 6, mongodb 2.6 and mongodb 3.0 and 
> hbase-0.98.8-hadoop2
>Reporter: Adnane B.
>Assignee: Lewis John McGibbney
> Fix For: 2.3.2
>
>
> This problem happens at the the second time I crawl a page
> {code}
> bin/nutch inject urls/
> bin/nutch generate -topN 1000
> bin/nutch fetch  -all
> bin/nutch parse -force   -all
> bin/nutch updatedb  -all
> {code}
> seconde time (re-fetch) : 
> {code}
> bin/nutch generate -topN 1000 --> batchid changes for all existing pages
> bin/nutch fetch  -all   -->  *** metadatas are delete for all pages already 
> crawled  **
> bin/nutch parse -force   -all
> bin/nutch updatedb  -all
> {code}
> I reproduce it with mongodb 2.6, mongodb 3.0, and hbase-0.98.8-hadoop2
> It happens only if the page has not changed
> To reproduce easily, please add to nutch-site.xml :
> {code}
> 
>   db.fetch.interval.default
>   60
>   The default number of seconds between re-fetches of a page (1 
> minute)
> 
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (NUTCH-2222) re-fetch deletes all metadata except _csh_ and _rs_

2016-02-29 Thread Adnane B. (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15171889#comment-15171889
 ] 

Adnane B. commented on NUTCH-:
--

Thank you very match!

> re-fetch deletes all  metadata except _csh_ and _rs_
> 
>
> Key: NUTCH-
> URL: https://issues.apache.org/jira/browse/NUTCH-
> Project: Nutch
>  Issue Type: Bug
>  Components: crawldb
>Affects Versions: 2.3.1
> Environment: Centos 6, mongodb 2.6 and mongodb 3.0 and 
> hbase-0.98.8-hadoop2
>Reporter: Adnane B.
>Assignee: Lewis John McGibbney
> Fix For: 2.3.2
>
>
> This problem happens at the the second time I crawl a page
> {code}
> bin/nutch inject urls/
> bin/nutch generate -topN 1000
> bin/nutch fetch  -all
> bin/nutch parse -force   -all
> bin/nutch updatedb  -all
> {code}
> seconde time (re-fetch) : 
> {code}
> bin/nutch generate -topN 1000 --> batchid changes for all existing pages
> bin/nutch fetch  -all   -->  *** metadatas are delete for all pages already 
> crawled  **
> bin/nutch parse -force   -all
> bin/nutch updatedb  -all
> {code}
> I reproduce it with mongodb 2.6, mongodb 3.0, and hbase-0.98.8-hadoop2
> It happens only if the page has not changed
> To reproduce easily, please add to nutch-site.xml :
> {code}
> 
>   db.fetch.interval.default
>   60
>   The default number of seconds between re-fetches of a page (1 
> minute)
> 
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (NUTCH-2222) re-fetch deletes all metadata except _csh_ and _rs_

2016-02-27 Thread Adnane B. (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15170580#comment-15170580
 ] 

Adnane B. commented on NUTCH-:
--

Please let me know if this issue does not exist with any other persistent 
storage configuration.

> re-fetch deletes all  metadata except _csh_ and _rs_
> 
>
> Key: NUTCH-
> URL: https://issues.apache.org/jira/browse/NUTCH-
> Project: Nutch
>  Issue Type: Bug
>  Components: crawldb
>Affects Versions: 2.3.1
> Environment: Centos 6, mongodb 2.6 and mongodb 3.0 and 
> hbase-0.98.8-hadoop2
>Reporter: Adnane B.
>Assignee: Lewis John McGibbney
>
> This problem happens at the the second time I crawl a page
> {code}
> bin/nutch inject urls/
> bin/nutch generate -topN 1000
> bin/nutch fetch  -all
> bin/nutch parse -force   -all
> bin/nutch updatedb  -all
> {code}
> seconde time (re-fetch) : 
> {code}
> bin/nutch generate -topN 1000 --> batchid changes for all existing pages
> bin/nutch fetch  -all   -->  *** metadatas are delete for all pages already 
> crawled  **
> bin/nutch parse -force   -all
> bin/nutch updatedb  -all
> {code}
> I reproduce it with mongodb 2.6, mongodb 3.0, and hbase-0.98.8-hadoop2
> It happens only if the page has not changed
> To reproduce easily, please add to nutch-site.xml :
> {code}
> 
>   db.fetch.interval.default
>   60
>   The default number of seconds between re-fetches of a page (1 
> minute)
> 
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (NUTCH-2222) re-fetch deletes all metadata except _csh_ and _rs_

2016-02-26 Thread Adnane B. (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15170226#comment-15170226
 ] 

Adnane B. commented on NUTCH-:
--

Hello,
Did you reproduced this issue ?
Please let me know if I can help to reproduce it.


> re-fetch deletes all  metadata except _csh_ and _rs_
> 
>
> Key: NUTCH-
> URL: https://issues.apache.org/jira/browse/NUTCH-
> Project: Nutch
>  Issue Type: Bug
>  Components: crawldb
>Affects Versions: 2.3.1
> Environment: Centos 6, mongodb 2.6 and mongodb 3.0 and 
> hbase-0.98.8-hadoop2
>Reporter: Adnane B.
>Assignee: Lewis John McGibbney
>
> This problem happens at the the second time I crawl a page
> {code}
> bin/nutch inject urls/
> bin/nutch generate -topN 1000
> bin/nutch fetch  -all
> bin/nutch parse -force   -all
> bin/nutch updatedb  -all
> {code}
> seconde time (re-fetch) : 
> {code}
> bin/nutch generate -topN 1000 --> batchid changes for all existing pages
> bin/nutch fetch  -all   -->  *** metadatas are delete for all pages already 
> crawled  **
> bin/nutch parse -force   -all
> bin/nutch updatedb  -all
> {code}
> I reproduce it with mongodb 2.6, mongodb 3.0, and hbase-0.98.8-hadoop2
> It happens only if the page has not changed
> To reproduce easily, please add to nutch-site.xml :
> {code}
> 
>   db.fetch.interval.default
>   60
>   The default number of seconds between re-fetches of a page (1 
> minute)
> 
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (NUTCH-2222) fetch deletes all metadata except _csh_ and _rs_

2016-02-23 Thread Adnane B. (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Adnane B. updated NUTCH-:
-
Description: 
This problem happens at the the second time I crawl a page

bin/nutch inject urls/
bin/nutch generate -topN 1000
bin/nutch fetch  -all
bin/nutch parse -force   -all
bin/nutch updatedb  -all

seconde time : 

bin/nutch generate -topN 1000 --> batchid changes for all existing pages
bin/nutch fetch  -all   -->  *** metadatas are delete for all pages already 
crawled  **
bin/nutch parse -force   -all
bin/nutch updatedb  -all

I reproduce it with mongodb 2.6, mongodb 3.0, and hbase-0.98.8-hadoop2

It happens only if the page has not changed

To reproduce easily, please add to nutch-site.xml :

db.fetch.interval.default
60
The default number of seconds between re-fetches of a page (1 
minute)



  was:
This problem happens at the the second time I crawl a page

bin/nutch inject urls/
bin/nutch generate -topN 1000
bin/nutch fetch  -all
bin/nutch parse -force   -all
bin/nutch updatedb  -all

seconde time : 

bin/nutch generate -topN 1000 --> batchid changes for all existing pages
bin/nutch fetch  -all   -->  *** metadatas are delete for all pages already 
crawled  **
bin/nutch parse -force   -all
bin/nutch updatedb  -all

I reproduce it with mongodb and hbase-0.98.8-hadoop2

It happens only if the page has not changed

To reproduce easily, please add to nutch-site.xml :

db.fetch.interval.default
60
The default number of seconds between re-fetches of a page (1 
minute)




> fetch deletes all  metadata except _csh_ and _rs_
> -
>
> Key: NUTCH-
> URL: https://issues.apache.org/jira/browse/NUTCH-
> Project: Nutch
>  Issue Type: Bug
>  Components: crawldb
>Affects Versions: 2.3.1
> Environment: Centos 6, mongodb 2.6 and mongodb 3.0 and 
> hbase-0.98.8-hadoop2
>Reporter: Adnane B.
>
> This problem happens at the the second time I crawl a page
> bin/nutch inject urls/
> bin/nutch generate -topN 1000
> bin/nutch fetch  -all
> bin/nutch parse -force   -all
> bin/nutch updatedb  -all
> seconde time : 
> bin/nutch generate -topN 1000 --> batchid changes for all existing pages
> bin/nutch fetch  -all   -->  *** metadatas are delete for all pages already 
> crawled  **
> bin/nutch parse -force   -all
> bin/nutch updatedb  -all
> I reproduce it with mongodb 2.6, mongodb 3.0, and hbase-0.98.8-hadoop2
> It happens only if the page has not changed
> To reproduce easily, please add to nutch-site.xml :
> 
> db.fetch.interval.default
> 60
> The default number of seconds between re-fetches of a page (1 
> minute)
> 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (NUTCH-2222) fetch deletes all metadata except _csh_ and _rs_

2016-02-23 Thread Adnane B. (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Adnane B. updated NUTCH-:
-
Environment: 
Centos 6, mongodb 2.6 and mongodb 3.0 and hbase-0.98.8-hadoop2


  was:
Centos 6, mongodb 2.6 and mongodb 3.0 



> fetch deletes all  metadata except _csh_ and _rs_
> -
>
> Key: NUTCH-
> URL: https://issues.apache.org/jira/browse/NUTCH-
> Project: Nutch
>  Issue Type: Bug
>  Components: crawldb
>Affects Versions: 2.3.1
> Environment: Centos 6, mongodb 2.6 and mongodb 3.0 and 
> hbase-0.98.8-hadoop2
>Reporter: Adnane B.
>
> This problem happens at the the second time I crawl a page
> bin/nutch inject urls/
> bin/nutch generate -topN 1000
> bin/nutch fetch  -all
> bin/nutch parse -force   -all
> bin/nutch updatedb  -all
> seconde time : 
> bin/nutch generate -topN 1000 --> batchid changes for all existing pages
> bin/nutch fetch  -all   -->  *** metadatas are delete for all pages already 
> crawled  **
> bin/nutch parse -force   -all
> bin/nutch updatedb  -all
> I reproduce it with mongodb and hbase-0.98.8-hadoop2
> It happens only if the page has not changed
> To reproduce easily, please add to nutch-site.xml :
> 
> db.fetch.interval.default
> 60
> The default number of seconds between re-fetches of a page (1 
> minute)
> 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Issue Comment Deleted] (NUTCH-2222) fetch deletes all metadata except _csh_ and _rs_

2016-02-23 Thread Adnane B. (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Adnane B. updated NUTCH-:
-
Comment: was deleted

(was: To reproduce easily, please add to nutch-site.xml : 


  db.fetch.interval.default
  60
  The default number of seconds between re-fetches of a page (1 
minute)
)

> fetch deletes all  metadata except _csh_ and _rs_
> -
>
> Key: NUTCH-
> URL: https://issues.apache.org/jira/browse/NUTCH-
> Project: Nutch
>  Issue Type: Bug
>  Components: crawldb
>Affects Versions: 2.3.1
> Environment: Centos 6, mongodb 2.6 and mongodb 3.0 
>Reporter: Adnane B.
>
> This problem happens at the the second time I crawl a page
> bin/nutch inject urls/
> bin/nutch generate -topN 1000
> bin/nutch fetch  -all
> bin/nutch parse -force   -all
> bin/nutch updatedb  -all
> seconde time : 
> bin/nutch generate -topN 1000 --> batchid changes for all existing pages
> bin/nutch fetch  -all   -->  *** metadatas are delete for all pages already 
> crawled  **
> bin/nutch parse -force   -all
> bin/nutch updatedb  -all
> I reproduce it with mongodb and hbase-0.98.8-hadoop2
> It happens only if the page has not changed
> To reproduce easily, please add to nutch-site.xml :
> 
> db.fetch.interval.default
> 60
> The default number of seconds between re-fetches of a page (1 
> minute)
> 
> Reply



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (NUTCH-2222) fetch deletes all metadata except _csh_ and _rs_

2016-02-23 Thread Adnane B. (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Adnane B. updated NUTCH-:
-
Description: 
This problem happens at the the second time I crawl a page

bin/nutch inject urls/
bin/nutch generate -topN 1000
bin/nutch fetch  -all
bin/nutch parse -force   -all
bin/nutch updatedb  -all

seconde time : 

bin/nutch generate -topN 1000 --> batchid changes for all existing pages
bin/nutch fetch  -all   -->  *** metadatas are delete for all pages already 
crawled  **
bin/nutch parse -force   -all
bin/nutch updatedb  -all

I reproduce it with mongodb and hbase-0.98.8-hadoop2

It happens only if the page has not changed

To reproduce easily, please add to nutch-site.xml :

db.fetch.interval.default
60
The default number of seconds between re-fetches of a page (1 
minute)



  was:
This problem happens at the the second time I crawl a page

bin/nutch inject urls/
bin/nutch generate -topN 1000
bin/nutch fetch  -all
bin/nutch parse -force   -all
bin/nutch updatedb  -all

seconde time : 

bin/nutch generate -topN 1000 --> batchid changes for all existing pages
bin/nutch fetch  -all   -->  *** metadatas are delete for all pages already 
crawled  **
bin/nutch parse -force   -all
bin/nutch updatedb  -all

I reproduce it with mongodb and hbase-0.98.8-hadoop2

It happens only if the page has not changed

To reproduce easily, please add to nutch-site.xml :

db.fetch.interval.default
60
The default number of seconds between re-fetches of a page (1 
minute)

Reply


> fetch deletes all  metadata except _csh_ and _rs_
> -
>
> Key: NUTCH-
> URL: https://issues.apache.org/jira/browse/NUTCH-
> Project: Nutch
>  Issue Type: Bug
>  Components: crawldb
>Affects Versions: 2.3.1
> Environment: Centos 6, mongodb 2.6 and mongodb 3.0 
>Reporter: Adnane B.
>
> This problem happens at the the second time I crawl a page
> bin/nutch inject urls/
> bin/nutch generate -topN 1000
> bin/nutch fetch  -all
> bin/nutch parse -force   -all
> bin/nutch updatedb  -all
> seconde time : 
> bin/nutch generate -topN 1000 --> batchid changes for all existing pages
> bin/nutch fetch  -all   -->  *** metadatas are delete for all pages already 
> crawled  **
> bin/nutch parse -force   -all
> bin/nutch updatedb  -all
> I reproduce it with mongodb and hbase-0.98.8-hadoop2
> It happens only if the page has not changed
> To reproduce easily, please add to nutch-site.xml :
> 
> db.fetch.interval.default
> 60
> The default number of seconds between re-fetches of a page (1 
> minute)
> 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (NUTCH-2222) fetch deletes all metadata except _csh_ and _rs_

2016-02-23 Thread Adnane B. (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Adnane B. updated NUTCH-:
-
Description: 
This problem happens at the the second time I crawl a page

bin/nutch inject urls/
bin/nutch generate -topN 1000
bin/nutch fetch  -all
bin/nutch parse -force   -all
bin/nutch updatedb  -all

seconde time : 

bin/nutch generate -topN 1000 --> batchid changes for all existing pages
bin/nutch fetch  -all   -->  *** metadatas are delete for all pages already 
crawled  **
bin/nutch parse -force   -all
bin/nutch updatedb  -all

I reproduce it with mongodb and hbase-0.98.8-hadoop2

It happens only if the page has not changed

To reproduce easily, please add to nutch-site.xml :

db.fetch.interval.default
60
The default number of seconds between re-fetches of a page (1 
minute)

Reply

  was:
This problem happens at the the second time I crawl a page

bin/nutch inject urls/
bin/nutch generate -topN 1000
bin/nutch fetch  -all
bin/nutch parse -force   -all
bin/nutch updatedb  -all

seconde time : 

bin/nutch generate -topN 1000 --> batchid changes for all existing pages
bin/nutch fetch  -all   -->  *** metadatas are delete for all pages already 
crawled  **
bin/nutch parse -force   -all
bin/nutch updatedb  -all

I reproduce it with mongodb and hbase-0.98.8-hadoop2

It happens only if the page has not changed



> fetch deletes all  metadata except _csh_ and _rs_
> -
>
> Key: NUTCH-
> URL: https://issues.apache.org/jira/browse/NUTCH-
> Project: Nutch
>  Issue Type: Bug
>  Components: crawldb
>Affects Versions: 2.3.1
> Environment: Centos 6, mongodb 2.6 and mongodb 3.0 
>Reporter: Adnane B.
>
> This problem happens at the the second time I crawl a page
> bin/nutch inject urls/
> bin/nutch generate -topN 1000
> bin/nutch fetch  -all
> bin/nutch parse -force   -all
> bin/nutch updatedb  -all
> seconde time : 
> bin/nutch generate -topN 1000 --> batchid changes for all existing pages
> bin/nutch fetch  -all   -->  *** metadatas are delete for all pages already 
> crawled  **
> bin/nutch parse -force   -all
> bin/nutch updatedb  -all
> I reproduce it with mongodb and hbase-0.98.8-hadoop2
> It happens only if the page has not changed
> To reproduce easily, please add to nutch-site.xml :
> 
> db.fetch.interval.default
> 60
> The default number of seconds between re-fetches of a page (1 
> minute)
> 
> Reply



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (NUTCH-2222) fetch deletes all metadata except _csh_ and _rs_

2016-02-23 Thread Adnane B. (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15160086#comment-15160086
 ] 

Adnane B. commented on NUTCH-:
--

To reproduce easily, please add to nutch-site.xml : 


  db.fetch.interval.default
  60
  The default number of seconds between re-fetches of a page (1 
minute)


> fetch deletes all  metadata except _csh_ and _rs_
> -
>
> Key: NUTCH-
> URL: https://issues.apache.org/jira/browse/NUTCH-
> Project: Nutch
>  Issue Type: Bug
>  Components: crawldb
>Affects Versions: 2.3.1
> Environment: Centos 6, mongodb 2.6 and mongodb 3.0 
>Reporter: Adnane B.
>
> This problem happens at the the second time I crawl a page
> bin/nutch inject urls/
> bin/nutch generate -topN 1000
> bin/nutch fetch  -all
> bin/nutch parse -force   -all
> bin/nutch updatedb  -all
> seconde time : 
> bin/nutch generate -topN 1000 --> batchid changes for all existing pages
> bin/nutch fetch  -all   -->  *** metadatas are delete for all pages already 
> crawled  **
> bin/nutch parse -force   -all
> bin/nutch updatedb  -all
> I reproduce it with mongodb and hbase-0.98.8-hadoop2
> It happens only if the page has not changed



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (NUTCH-2222) fetch deletes all metadata except _csh_ and _rs_

2016-02-23 Thread Adnane B. (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Adnane B. updated NUTCH-:
-
Description: 
This problem happens at the the second time I crawl a page

bin/nutch inject urls/
bin/nutch generate -topN 1000
bin/nutch fetch  -all
bin/nutch parse -force   -all
bin/nutch updatedb  -all

seconde time : 

bin/nutch generate -topN 1000 --> batchid changes for all existing pages
bin/nutch fetch  -all   -->  *** metadatas are delete for all pages already 
crawled  **
bin/nutch parse -force   -all
bin/nutch updatedb  -all

I reproduce it with mongodb and hbase-0.98.8-hadoop2

It happens only if the page has not changed


  was:
This problem happens at the the second time I crawl a page

bin/nutch inject urls/
bin/nutch generate -topN 1000
bin/nutch fetch  -all
bin/nutch parse -force   -all
bin/nutch updatedb  -all

seconde time : 

bin/nutch generate -topN 1000 --> batchid changes for all existing pages
bin/nutch fetch  -all   -->  *** metadatas are delete for all pages already 
crawled  **
bin/nutch parse -force   -all
bin/nutch updatedb  -all

I reproduce it with mongodb and hbase

It happens only if the page has not changed



> fetch deletes all  metadata except _csh_ and _rs_
> -
>
> Key: NUTCH-
> URL: https://issues.apache.org/jira/browse/NUTCH-
> Project: Nutch
>  Issue Type: Bug
>  Components: crawldb
>Affects Versions: 2.3.1
> Environment: Centos 6, mongodb 2.6 and mongodb 3.0 
>Reporter: Adnane B.
>
> This problem happens at the the second time I crawl a page
> bin/nutch inject urls/
> bin/nutch generate -topN 1000
> bin/nutch fetch  -all
> bin/nutch parse -force   -all
> bin/nutch updatedb  -all
> seconde time : 
> bin/nutch generate -topN 1000 --> batchid changes for all existing pages
> bin/nutch fetch  -all   -->  *** metadatas are delete for all pages already 
> crawled  **
> bin/nutch parse -force   -all
> bin/nutch updatedb  -all
> I reproduce it with mongodb and hbase-0.98.8-hadoop2
> It happens only if the page has not changed



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (NUTCH-2222) fetch deletes all metadata except _csh_ and _rs_

2016-02-23 Thread Adnane B. (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Adnane B. updated NUTCH-:
-
Description: 
This problem happens at the the second time I crawl a page

bin/nutch inject urls/
bin/nutch generate -topN 1000
bin/nutch fetch  -all
bin/nutch parse -force   -all
bin/nutch updatedb  -all

seconde time : 

bin/nutch generate -topN 1000 --> batchid changes for all existing pages
bin/nutch fetch  -all   -->  *** metadatas are delete for all pages already 
crawled  **
bin/nutch parse -force   -all
bin/nutch updatedb  -all

I reproduce it with mongodb and hbase

It happens only if the page has not changed


  was:
This problem happens at the the second time I crawl a page

bin/nutch inject urls/
bin/nutch generate -topN 1000
bin/nutch fetch  -all
bin/nutch parse -force   -all
bin/nutch updatedb  -all

seconde time : 

bin/nutch generate -topN 1000 --> batchid changes for all existing pages
bin/nutch fetch  -all   -->  *** metadatas are delete for all pages already 
crawled  **
bin/nutch parse -force   -all
bin/nutch updatedb  -all

I'm using mongodb




> fetch deletes all  metadata except _csh_ and _rs_
> -
>
> Key: NUTCH-
> URL: https://issues.apache.org/jira/browse/NUTCH-
> Project: Nutch
>  Issue Type: Bug
>  Components: crawldb
>Affects Versions: 2.3.1
> Environment: Centos 6, mongodb 2.6 and mongodb 3.0 
>Reporter: Adnane B.
>
> This problem happens at the the second time I crawl a page
> bin/nutch inject urls/
> bin/nutch generate -topN 1000
> bin/nutch fetch  -all
> bin/nutch parse -force   -all
> bin/nutch updatedb  -all
> seconde time : 
> bin/nutch generate -topN 1000 --> batchid changes for all existing pages
> bin/nutch fetch  -all   -->  *** metadatas are delete for all pages already 
> crawled  **
> bin/nutch parse -force   -all
> bin/nutch updatedb  -all
> I reproduce it with mongodb and hbase
> It happens only if the page has not changed



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (NUTCH-2222) fetch deletes all metadata except _csh_ and _rs_

2016-02-16 Thread Adnane B. (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Adnane B. updated NUTCH-:
-
Description: 
This problem happens at the the second time I crawl a page

bin/nutch inject urls/
bin/nutch generate -topN 1000
bin/nutch fetch  -all
bin/nutch parse -force   -all
bin/nutch updatedb  -all

seconde time : 

bin/nutch generate -topN 1000 --> batchid changes for all existing pages
bin/nutch fetch  -all   -->  *** metadatas are delete for all pages already 
crawled  **
bin/nutch parse -force   -all
bin/nutch updatedb  -all

I'm using mongodb



  was:
This problem happens at the the second time a crawl a page

bin/nutch inject urls/
bin/nutch generate -topN 1000
bin/nutch fetch  -all
bin/nutch parse -force   -all
bin/nutch updatedb  -all

seconde time : 

bin/nutch generate -topN 1000 --> batchid changes for all existing pages
bin/nutch fetch  -all   -->  *** metadatas are delete for all pages already 
crawled  **
bin/nutch parse -force   -all
bin/nutch updatedb  -all

I'm using mongodb




> fetch deletes all  metadata except _csh_ and _rs_
> -
>
> Key: NUTCH-
> URL: https://issues.apache.org/jira/browse/NUTCH-
> Project: Nutch
>  Issue Type: Bug
>  Components: crawldb
>Affects Versions: 2.3.1
> Environment: Centos 6, mongodb 2.6 and mongodb 3.0 
>Reporter: Adnane B.
>
> This problem happens at the the second time I crawl a page
> bin/nutch inject urls/
> bin/nutch generate -topN 1000
> bin/nutch fetch  -all
> bin/nutch parse -force   -all
> bin/nutch updatedb  -all
> seconde time : 
> bin/nutch generate -topN 1000 --> batchid changes for all existing pages
> bin/nutch fetch  -all   -->  *** metadatas are delete for all pages already 
> crawled  **
> bin/nutch parse -force   -all
> bin/nutch updatedb  -all
> I'm using mongodb



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (NUTCH-2222) fetch deletes all metadata except _csh_ and _rs_

2016-02-16 Thread Adnane B. (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Adnane B. updated NUTCH-:
-
Description: 
This problem happens at the the second time a crawl a page

bin/nutch inject urls/
bin/nutch generate -topN 1000
bin/nutch fetch  -all
bin/nutch parse -force   -all
bin/nutch updatedb  -all

seconde time : 

bin/nutch generate -topN 1000 --> batchid changes for all existing pages
bin/nutch fetch  -all   -->  *** metadatas are delete for all pages already 
crawled  **
bin/nutch parse -force   -all
bin/nutch updatedb  -all

I'm using mongodb



  was:
This problem happens at the the second time a crawl a page

bin/nutch inject urls/
bin/nutch generate -topN 1000
bin/nutch fetch  -all
bin/nutch parse -force   -all
bin/nutch updatedb  -all

seconde time : 

bin/nutch generate -topN 1000 --> bachid changes for all existing pages
bin/nutch fetch  -all   -->  *** metadatas are delete for all pages already 
crawled  **
bin/nutch parse -force   -all
bin/nutch updatedb  -all

I'm using mongodb




> fetch deletes all  metadata except _csh_ and _rs_
> -
>
> Key: NUTCH-
> URL: https://issues.apache.org/jira/browse/NUTCH-
> Project: Nutch
>  Issue Type: Bug
>  Components: crawldb
>Affects Versions: 2.3.1
> Environment: Centos 6, mongodb 2.6 and mongodb 3.0 
>Reporter: Adnane B.
>
> This problem happens at the the second time a crawl a page
> bin/nutch inject urls/
> bin/nutch generate -topN 1000
> bin/nutch fetch  -all
> bin/nutch parse -force   -all
> bin/nutch updatedb  -all
> seconde time : 
> bin/nutch generate -topN 1000 --> batchid changes for all existing pages
> bin/nutch fetch  -all   -->  *** metadatas are delete for all pages already 
> crawled  **
> bin/nutch parse -force   -all
> bin/nutch updatedb  -all
> I'm using mongodb



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (NUTCH-2222) fetch deletes all metadata except _csh_ and _rs_

2016-02-16 Thread Adnane B. (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Adnane B. updated NUTCH-:
-
Description: 
This problem happens at the the second time a crawl a page

bin/nutch inject urls/
bin/nutch generate -topN 1000
bin/nutch fetch  -all
bin/nutch parse -force   -all
bin/nutch updatedb  -all

seconde time : 

bin/nutch generate -topN 1000 --> bachid changes for all existing pages
bin/nutch fetch  -all   -->  *** metadatas are delete for all pages already 
crawled  **
bin/nutch parse -force   -all
bin/nutch updatedb  -all

I'm using mongodb



  was:
This problem happens at the the second time a crawl a page

bin/nutch inject urls/
bin/nutch generate -topN 1000
bin/nutch fetch  -all
bin/nutch parse -force   -all
bin/nutch updatedb  -all

seconde time : 

bin/nutch generate -topN 1000
bin/nutch fetch  -all   -->  *** metadatas are delete for all pages already 
crawled  **
bin/nutch parse -force   -all
bin/nutch updatedb  -all

I'm using mongodb




> fetch deletes all  metadata except _csh_ and _rs_
> -
>
> Key: NUTCH-
> URL: https://issues.apache.org/jira/browse/NUTCH-
> Project: Nutch
>  Issue Type: Bug
>  Components: crawldb
>Affects Versions: 2.3.1
> Environment: Centos 6, mongodb 2.6 and mongodb 3.0 
>Reporter: Adnane B.
>
> This problem happens at the the second time a crawl a page
> bin/nutch inject urls/
> bin/nutch generate -topN 1000
> bin/nutch fetch  -all
> bin/nutch parse -force   -all
> bin/nutch updatedb  -all
> seconde time : 
> bin/nutch generate -topN 1000 --> bachid changes for all existing pages
> bin/nutch fetch  -all   -->  *** metadatas are delete for all pages already 
> crawled  **
> bin/nutch parse -force   -all
> bin/nutch updatedb  -all
> I'm using mongodb



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (NUTCH-2222) fetch deletes all metadata except _csh_ and _rs_

2016-02-16 Thread Adnane B. (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Adnane B. updated NUTCH-:
-
Description: 
This problem happens at the the second time a crawl a page

bin/nutch inject urls/
bin/nutch generate -topN 1000
bin/nutch fetch  -all
bin/nutch parse -force   -all
bin/nutch updatedb  -all

seconde time : 

bin/nutch generate -topN 1000
bin/nutch fetch  -all   -->  *** metadatas are delete for all pages already 
crawled  **
bin/nutch parse -force   -all
bin/nutch updatedb  -all

I'm using mongodb



  was:
This problem happens at the the second update on a crawled a page ** that has 
not changed** with -all option
bin/nutch updatedb -all
not tested with  other options
I'm using mongodb




> fetch deletes all  metadata except _csh_ and _rs_
> -
>
> Key: NUTCH-
> URL: https://issues.apache.org/jira/browse/NUTCH-
> Project: Nutch
>  Issue Type: Bug
>  Components: crawldb
>Affects Versions: 2.3.1
> Environment: Centos 6, mongodb 2.6 and mongodb 3.0 
>Reporter: Adnane B.
>
> This problem happens at the the second time a crawl a page
> bin/nutch inject urls/
> bin/nutch generate -topN 1000
> bin/nutch fetch  -all
> bin/nutch parse -force   -all
> bin/nutch updatedb  -all
> seconde time : 
> bin/nutch generate -topN 1000
> bin/nutch fetch  -all   -->  *** metadatas are delete for all pages already 
> crawled  **
> bin/nutch parse -force   -all
> bin/nutch updatedb  -all
> I'm using mongodb



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (NUTCH-2222) fetch deletes all metadata except _csh_ and _rs_

2016-02-16 Thread Adnane B. (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Adnane B. updated NUTCH-:
-
Summary: fetch deletes all  metadata except _csh_ and _rs_  (was: updatedb 
deletes all  metadata except _csh_ and _rs_)

> fetch deletes all  metadata except _csh_ and _rs_
> -
>
> Key: NUTCH-
> URL: https://issues.apache.org/jira/browse/NUTCH-
> Project: Nutch
>  Issue Type: Bug
>  Components: crawldb
>Affects Versions: 2.3.1
> Environment: Centos 6, mongodb 2.6 and mongodb 3.0 
>Reporter: Adnane B.
>
> This problem happens at the the second update on a crawled a page ** that has 
> not changed** with -all option
> bin/nutch updatedb -all
> not tested with  other options
> I'm using mongodb



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (NUTCH-2222) updatedb deletes all metadata except _csh_ and _rs_

2016-02-16 Thread Adnane B. (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Adnane B. updated NUTCH-:
-
Description: 
This problem happens at the the second update on a crawled a page ** that has 
not changed** with -all option
bin/nutch updatedb -all
not tested with  other options
I'm using mongodb



  was:
This problem happens at the the second update on a crawled page with -all option
bin/nutch updatedb -all
not tested with  other options
I'm using mongodb


> updatedb deletes all  metadata except _csh_ and _rs_
> 
>
> Key: NUTCH-
> URL: https://issues.apache.org/jira/browse/NUTCH-
> Project: Nutch
>  Issue Type: Bug
>  Components: crawldb
>Affects Versions: 2.3.1
> Environment: Centos 6, mongodb 2.6 and mongodb 3.0 
>Reporter: Adnane B.
>
> This problem happens at the the second update on a crawled a page ** that has 
> not changed** with -all option
> bin/nutch updatedb -all
> not tested with  other options
> I'm using mongodb



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (NUTCH-2222) updatedb deletes all metadata except _csh_ and _rs_

2016-02-16 Thread Adnane B. (JIRA)
Adnane B. created NUTCH-:


 Summary: updatedb deletes all  metadata except _csh_ and _rs_
 Key: NUTCH-
 URL: https://issues.apache.org/jira/browse/NUTCH-
 Project: Nutch
  Issue Type: Bug
  Components: crawldb
Affects Versions: 2.3.1
 Environment: Centos 6, mongodb 2.6 and mongodb 3.0 

Reporter: Adnane B.


This problem happens at the the second update on a crawled page with -all option
bin/nutch updatedb -all
not tested with  other options
I'm using mongodb



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)