[jira] [Commented] (NUTCH-1773) Solr Indexer fails

2014-05-16 Thread Lewis John McGibbney (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1773?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13998333#comment-13998333 ] Lewis John McGibbney commented on NUTCH-1773: - bq. hduser@bl4ck1c3:~/nutch-2.3

[jira] [Updated] (NUTCH-926) Redirections from META tag don't get filtered

2014-05-16 Thread Sebastian Nagel (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-926?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel updated NUTCH-926: -- Attachment: NUTCH-926-trunk.patch Patch for current trunk: * meta refresh redirects are filtered

[jira] [Commented] (NUTCH-1714) Nutch 2.x upgrade to Gora 0.4

2014-05-16 Thread Lewis John McGibbney (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1714?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13999417#comment-13999417 ] Lewis John McGibbney commented on NUTCH-1714: - Excellent @jnioche and @alparsl

[jira] [Updated] (NUTCH-1780) ttl and gc_grace_seconds attributes are missing from gora-cassandra-mapping.xml file

2014-05-16 Thread kaveh minooie (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1780?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] kaveh minooie updated NUTCH-1780: - Attachment: NUTCH-1780.patch there is really no good default value for gc_grace_seconds. we can u

[jira] [Updated] (NUTCH-1770) Nutch is failing to parse all PDFs

2014-05-16 Thread Julien Nioche (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1770?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Julien Nioche updated NUTCH-1770: - Affects Version/s: (was: 2.3) > Nutch is failing to parse all PDFs >

[jira] [Commented] (NUTCH-1714) Nutch 2.x upgrade to Gora 0.4

2014-05-16 Thread Julien Nioche (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1714?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13998563#comment-13998563 ] Julien Nioche commented on NUTCH-1714: -- Hi [~kaveh], Please open a separate issue f

[jira] [Commented] (NUTCH-207) Bandwidth target for fetcher rather than a thread count

2014-05-16 Thread Sebastian Nagel (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-207?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13998593#comment-13998593 ] Sebastian Nagel commented on NUTCH-207: --- Hi Julien, I've just observed that the log l

[jira] [Created] (NUTCH-1780) ttl and gc_grace_seconds attributes are missing from gora-cassandra-mapping.xml file

2014-05-16 Thread kaveh minooie (JIRA)
kaveh minooie created NUTCH-1780: Summary: ttl and gc_grace_seconds attributes are missing from gora-cassandra-mapping.xml file Key: NUTCH-1780 URL: https://issues.apache.org/jira/browse/NUTCH-1780 Pr

[jira] [Commented] (NUTCH-1676) Add rudimentary SSL support to protocol-http

2014-05-16 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1676?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13999894#comment-13999894 ] Markus Jelsma commented on NUTCH-1676: -- thanks jul for taking over! > Add rudimentar

[jira] [Resolved] (NUTCH-1772) Injector does not need merging if no pre-existing crawldb

2014-05-16 Thread Julien Nioche (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1772?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Julien Nioche resolved NUTCH-1772. -- Resolution: Fixed Fix Version/s: 1.9 Committed revision 1595137. thanks for the reviews

[jira] [Commented] (NUTCH-1772) Injector does not need merging if no pre-existing crawldb

2014-05-16 Thread Julien Nioche (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1772?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13998586#comment-13998586 ] Julien Nioche commented on NUTCH-1772: -- Thanks Diaa. I will have a look at it a bit l

[jira] [Commented] (NUTCH-1768) port NUTCH-1745 to Nutch 2.x (Upgrade to ElasticSearch 1.1.0)

2014-05-16 Thread Julien Nioche (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1768?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13998581#comment-13998581 ] Julien Nioche commented on NUTCH-1768: -- Any more testers for this one? > port NUTCH-

[jira] [Resolved] (NUTCH-1676) Add rudimentary SSL support to protocol-http

2014-05-16 Thread Julien Nioche (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1676?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Julien Nioche resolved NUTCH-1676. -- Resolution: Fixed trunk => Committed revision 1595193 2.x => Committed revision 1595196. Thank

[jira] [Updated] (NUTCH-1718) update description of property http.robots.agent

2014-05-16 Thread Sebastian Nagel (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1718?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel updated NUTCH-1718: --- Attachment: NUTCH-1718-trunk.v2.patch Updated patch: * for backward compatibility: take care

[jira] [Commented] (NUTCH-1613) Timeouts in protocol-httpclient when crawling same host with >2 threads and added cookie strings for both http protocols

2014-05-16 Thread Sebastian Nagel (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1613?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13998632#comment-13998632 ] Sebastian Nagel commented on NUTCH-1613: For cookie support there exists already N

[jira] [Commented] (NUTCH-1774) Crawling from REST API giving NullPointerException

2014-05-16 Thread Lewis John McGibbney (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1774?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13999415#comment-13999415 ] Lewis John McGibbney commented on NUTCH-1774: - I have no issue with committing

[jira] [Commented] (NUTCH-1709) Generated classes o.a.n.storage.Host and o.a.n.storage.ProtocolStatus contain methods not defined in source .avsc

2014-05-16 Thread Julien Nioche (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1709?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13998575#comment-13998575 ] Julien Nioche commented on NUTCH-1709: -- [NUTCH-1714] has been committed without Lewis

Re: Clean up in case of error is not handled

2014-05-16 Thread Diaa Abdallah
Thanks! Created a JIRA issue with the patch https://issues.apache.org/jira/browse/NUTCH-1783 On Tue, May 13, 2014 at 12:19 AM, Markus Jelsma wrote: > Hi Diaa, > > Yes, you can open an issue for these fixes and attach patches if you can. > > Cheers, > Markus > > > > Diaa Abdallah schreef: > >

[jira] [Created] (NUTCH-1783) Cleanup temp folders in case of failures

2014-05-16 Thread Diaa (JIRA)
Diaa created NUTCH-1783: --- Summary: Cleanup temp folders in case of failures Key: NUTCH-1783 URL: https://issues.apache.org/jira/browse/NUTCH-1783 Project: Nutch Issue Type: Bug Affects Versions: 1.

[jira] [Updated] (NUTCH-1782) NodeWalker to return current node

2014-05-16 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1782?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-1782: - Attachment: NUTCH-1782-trunk.patch Patch! > NodeWalker to return current node >

[jira] [Updated] (NUTCH-1776) Log incorrect plugin.folder file path

2014-05-16 Thread Diaa (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1776?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Diaa updated NUTCH-1776: Attachment: (was: PluginManifestParser.java.patch) > Log incorrect plugin.folder file path > --

[jira] [Created] (NUTCH-1782) NodeWalker to return current node

2014-05-16 Thread Markus Jelsma (JIRA)
Markus Jelsma created NUTCH-1782: Summary: NodeWalker to return current node Key: NUTCH-1782 URL: https://issues.apache.org/jira/browse/NUTCH-1782 Project: Nutch Issue Type: Improvement

[jira] [Commented] (NUTCH-207) Bandwidth target for fetcher rather than a thread count

2014-05-16 Thread Julien Nioche (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-207?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13999102#comment-13999102 ] Julien Nioche commented on NUTCH-207: - Hi Sebastian, Not really. I will revert it back

[jira] [Updated] (NUTCH-1718) redefine http.robots.agent as "additional agent names"

2014-05-16 Thread Sebastian Nagel (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1718?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel updated NUTCH-1718: --- Summary: redefine http.robots.agent as "additional agent names" (was: update description of

[jira] [Resolved] (NUTCH-1714) Nutch 2.x upgrade to Gora 0.4

2014-05-16 Thread Julien Nioche (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1714?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Julien Nioche resolved NUTCH-1714. -- Resolution: Fixed Committed revision 1594812. Thanks to Alparslan and everyone involved! > Nu

[jira] [Updated] (NUTCH-1780) ttl and gc_grace_seconds attributes are missing from gora-cassandra-mapping.xml file

2014-05-16 Thread kaveh minooie (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1780?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] kaveh minooie updated NUTCH-1780: - Attachment: (was: NUTCH-1780.patch) > ttl and gc_grace_seconds attributes are missing from >

[jira] [Commented] (NUTCH-207) Bandwidth target for fetcher rather than a thread count

2014-05-16 Thread Julien Nioche (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-207?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13999673#comment-13999673 ] Julien Nioche commented on NUTCH-207: - Fix log level in revision 1595135. > Bandwidth

[jira] [Commented] (NUTCH-1605) mime type detector recognizes xlsx as zip file

2014-05-16 Thread Sebastian Nagel (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1605?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13998578#comment-13998578 ] Sebastian Nagel commented on NUTCH-1605: Tested with of a few dozens of documents

[jira] [Updated] (NUTCH-1770) Nutch is failing to parse all PDFs

2014-05-16 Thread Julien Nioche (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1770?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Julien Nioche updated NUTCH-1770: - Fix Version/s: (was: 2.3) > Nutch is failing to parse all PDFs >

[jira] [Commented] (NUTCH-1776) Log incorrect plugin.folder file path

2014-05-16 Thread Sebastian Nagel (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1776?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13998527#comment-13998527 ] Sebastian Nagel commented on NUTCH-1776: +1 (looks ok) Is there a reason why a new

[jira] [Commented] (NUTCH-1714) Nutch 2.x upgrade to Gora 0.4

2014-05-16 Thread Julien Nioche (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1714?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13993574#comment-13993574 ] Julien Nioche commented on NUTCH-1714: -- Ralf - your questions is not directly related

[jira] [Commented] (NUTCH-1709) Generated classes o.a.n.storage.Host and o.a.n.storage.ProtocolStatus contain methods not defined in source .avsc

2014-05-16 Thread Lewis John McGibbney (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1709?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13999434#comment-13999434 ] Lewis John McGibbney commented on NUTCH-1709: - Yep... I'll update this to refl

[jira] [Updated] (NUTCH-1783) Cleanup temp folders in case of failures

2014-05-16 Thread Diaa (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1783?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Diaa updated NUTCH-1783: Attachment: cleanup temp folders.patch > Cleanup temp folders in case of failures > ---

[jira] [Commented] (NUTCH-1622) Create Outlinks with metadata

2014-05-16 Thread Julien Nioche (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1622?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13994218#comment-13994218 ] Julien Nioche commented on NUTCH-1622: -- Lewis - this has already been committed in tr

[jira] [Updated] (NUTCH-1774) Crawling from REST API giving NullPointerException

2014-05-16 Thread Lewis John McGibbney (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1774?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lewis John McGibbney updated NUTCH-1774: Fix Version/s: 2.4 > Crawling from REST API giving NullPointerException > -

[jira] [Commented] (NUTCH-1772) Injector does not need merging if no pre-existing crawldb

2014-05-16 Thread Sebastian Nagel (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1772?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13998598#comment-13998598 ] Sebastian Nagel commented on NUTCH-1772: ?? What about committing this one as a te

[jira] [Created] (NUTCH-1778) Generator not logging number of URLs in batch correctly

2014-05-16 Thread Julien Nioche (JIRA)
Julien Nioche created NUTCH-1778: Summary: Generator not logging number of URLs in batch correctly Key: NUTCH-1778 URL: https://issues.apache.org/jira/browse/NUTCH-1778 Project: Nutch Issue T

[jira] [Updated] (NUTCH-1780) ttl and gc_grace_seconds attributes are missing from gora-cassandra-mapping.xml file

2014-05-16 Thread kaveh minooie (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1780?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] kaveh minooie updated NUTCH-1780: - Attachment: NUTCH-1780.patch > ttl and gc_grace_seconds attributes are missing from > gora-cassa

Inject auto generated urls

2014-05-16 Thread Diaa Abdallah
Hi, In some cases when you crawl a webpage you already know many page urls that have a similar structure. For example in imdb entertainment artists have the following link structure: http://www.imdb.com/name/nm1/ http://www.imdb.com/name/nm2/ http://www.imdb.com/name/nm6499112/ How about allowing

[jira] [Updated] (NUTCH-1776) Log incorrect plugin.folder file path

2014-05-16 Thread Diaa (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1776?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Diaa updated NUTCH-1776: Attachment: Logging file path error.patch @ [~wastl-nagel] changed level to warn and removed new logger. Do you thi

[jira] [Resolved] (NUTCH-1674) Use batchId filter to enable scan (GORA-119) for Fetch,Parse,Update,Index

2014-05-16 Thread Julien Nioche (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1674?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Julien Nioche resolved NUTCH-1674. -- Resolution: Fixed Committed revision 1594813. Thanks everyone! > Use batchId filter to enable

[jira] [Commented] (NUTCH-1779) Apply formatting to the code

2014-05-16 Thread Lewis John McGibbney (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1779?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14000571#comment-14000571 ] Lewis John McGibbney commented on NUTCH-1779: - final patch to be committed to

[jira] [Commented] (NUTCH-1774) Crawling from REST API giving NullPointerException

2014-05-16 Thread Lewis John McGibbney (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1774?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14000569#comment-14000569 ] Lewis John McGibbney commented on NUTCH-1774: - [~sreemanth] Crawler class no l

[jira] [Closed] (NUTCH-1780) ttl and gc_grace_seconds attributes are missing from gora-cassandra-mapping.xml file

2014-05-16 Thread Lewis John McGibbney (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1780?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lewis John McGibbney closed NUTCH-1780. --- > ttl and gc_grace_seconds attributes are missing from > gora-cassandra-mapping.xml file

[jira] [Comment Edited] (NUTCH-1714) Nutch 2.x upgrade to Gora 0.4

2014-05-16 Thread Lewis John McGibbney (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1714?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13994098#comment-13994098 ] Lewis John McGibbney edited comment on NUTCH-1714 at 5/10/14 1:19 AM: --

[jira] [Created] (NUTCH-1777) Fetcher not getting all the entries in input

2014-05-16 Thread Julien Nioche (JIRA)
Julien Nioche created NUTCH-1777: Summary: Fetcher not getting all the entries in input Key: NUTCH-1777 URL: https://issues.apache.org/jira/browse/NUTCH-1777 Project: Nutch Issue Type: Bug

[jira] [Updated] (NUTCH-1780) ttl and gc_grace_seconds attributes are missing from gora-cassandra-mapping.xml file

2014-05-16 Thread Lewis John McGibbney (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1780?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lewis John McGibbney updated NUTCH-1780: Description: after upgrading to Gora 0.4 ( NUTCH-1714) we need extra properties in

[jira] [Created] (NUTCH-1781) Update gora-*-mapping.xml and gora.proeprties to reflect Gora 0.4

2014-05-16 Thread Lewis John McGibbney (JIRA)
Lewis John McGibbney created NUTCH-1781: --- Summary: Update gora-*-mapping.xml and gora.proeprties to reflect Gora 0.4 Key: NUTCH-1781 URL: https://issues.apache.org/jira/browse/NUTCH-1781 Project

[jira] [Updated] (NUTCH-1779) Apply formatting to the code

2014-05-16 Thread Lewis John McGibbney (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1779?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lewis John McGibbney updated NUTCH-1779: Fix Version/s: 2.3 > Apply formatting to the code > >

[jira] [Created] (NUTCH-1779) Apply formatting to the code

2014-05-16 Thread Julien Nioche (JIRA)
Julien Nioche created NUTCH-1779: Summary: Apply formatting to the code Key: NUTCH-1779 URL: https://issues.apache.org/jira/browse/NUTCH-1779 Project: Nutch Issue Type: Task Affects Versi

[jira] [Updated] (NUTCH-1774) Crawling from REST API giving NullPointerException

2014-05-16 Thread Lewis John McGibbney (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1774?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lewis John McGibbney updated NUTCH-1774: Fix Version/s: (was: 2.4) 2.3 > Crawling from REST API givin

[jira] [Resolved] (NUTCH-1780) ttl and gc_grace_seconds attributes are missing from gora-cassandra-mapping.xml file

2014-05-16 Thread Lewis John McGibbney (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1780?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lewis John McGibbney resolved NUTCH-1780. - Resolution: Fixed Committed @revision 1595398 in 2.X HEAD Thank you [~kaveh] very

[jira] [Updated] (NUTCH-1776) Log incorrect plugin.folder file path

2014-05-16 Thread Lewis John McGibbney (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1776?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lewis John McGibbney updated NUTCH-1776: Fix Version/s: 2.3 > Log incorrect plugin.folder file path > --

[jira] [Updated] (NUTCH-1780) ttl and gc_grace_seconds attributes are missing from gora-cassandra-mapping.xml file

2014-05-16 Thread Lewis John McGibbney (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1780?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lewis John McGibbney updated NUTCH-1780: Fix Version/s: 2.3 > ttl and gc_grace_seconds attributes are missing from > gora-c