Re: [Nutch-dev] Plugins initialized all the time!
Hi, On 5/29/07, Nicolás Lichtmaier [EMAIL PROTECTED] wrote: Which job causes the problem? Perhaps, we can find out what keeps creating a conf object over and over. Also, I have tried what you have suggested (better caching for plugin repository) and it really seems to make a difference. Can you try with this patch(*) to see if it solves your problem? (*) http://www.ceng.metu.edu.tr/~e1345172/plugin_repository_cache.patch Some comments about you patch. The approach seems nice, you only check the parameters that affect plugin loading. But have in mind that the plugin themselves will configure themselves with many other parameters, so to keep things safe there should be a PluginRepository for each set of parameters (including all of them). Besides, remember that CACHE is a WeakHashMap, you are creating ad-hoc PluginProperty objects as keys, something doesn't loook right... the lifespan of those objects will be much shorter than you require, perhaps you should be using SoftReferences instead, or a simple LRU (LinkedHashMap provides that simply) cache. My patch is just a draft to see if we can create a better caching mechanism. There are definitely some rough edges there:) I don't really worry about WeakHashMap-LinkedHashMap stuff. But your approach is simple and should be faster so I guess it's OK. You are right about per-plugin parameters but I think it will be very difficult to keep PluginProperty class in sync with plugin parameters. I mean, if a plugin defines a new parameter, we have to remember to update PluginProperty. Perhaps, we can force plugins to define configuration options it will use in, say, its plugin.xml file, but that will be very error-prone too. I don't want to compare entire configuration objects, because changing irrevelant options, like fetcher.store.content shouldn't force loading plugins again, though it seems it may be inevitable Anyway, I'll try to build my own Nutch to test your patch. Thanks! -- Doğacan Güney - This SF.net email is sponsored by DB2 Express Download DB2 Express C - the FREE version of DB2 express and take control of your XML. No limits. Just data. Click to get it now. http://sourceforge.net/powerbar/db2/ ___ Nutch-developers mailing list Nutch-developers@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/nutch-developers
[Nutch-dev] RE:回复
尊敬的公司领导:(经理/财务)您好! 我司每月有一部分增值税电脑发票和普通商品销售税发票(国税、地税).优惠代开 或合作,点数较低,还可以根据所做数量额度的大小来商讨优惠的点数。 本公司郑重承诺所用绝对是真票!更希望能够有机会与贵司合作!验票后付款。诚 信与保密。贵司如有需要欢迎您来电咨询。 联系电话:13590116835 联系人:张豪兴 E- MAIL [EMAIL PROTECTED] 地址:深圳市深南中路国际文化大厦 注:(此信息长期有效敬请保留、如有打扰请原谅。) 致 礼! - This SF.net email is sponsored by DB2 Express Download DB2 Express C - the FREE version of DB2 express and take control of your XML. No limits. Just data. Click to get it now. http://sourceforge.net/powerbar/db2/___ Nutch-developers mailing list Nutch-developers@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/nutch-developers
[Nutch-dev] 你找我有事吗?
负责人:您好! 我公司是一家正常纳税的A级企业,在全国大、中、小城市均有。在与任何客户、单位的合作程 序都是按照国家法规进行,如有违规愿承担相关责任,本公司因需扩展市场的竞争性,为客户对 营业税收提供方便灵活、优惠应用;能够对贵公司提供优惠缴纳税款.可以帮客户代开代理发票: 一: 普通国税发票 1:商业销售(可以网上查) 2:货物统一销售 3:工业(企业)销售 二:普通地税发票 1:运输(电脑版运输、货运代理、装卸、联运、海运等) 2:其它服务(广告费、住宿费、会议费、咨询费等) 3:建筑安装 加工修理 4:有海关核销单出售,价格优惠.交接方便 5:其它(租赁,行政事业专用、机动车销售、房地产交易、税务代理) 等专用票据 。以上票据税点均在0.5%~1.5%目前在全国是最低之一 如需敬请致电: 手 机: 13826592593 联系人: 刘先生 E-mail:[EMAIL PROTECTED] - This SF.net email is sponsored by DB2 Express Download DB2 Express C - the FREE version of DB2 express and take control of your XML. No limits. Just data. Click to get it now. http://sourceforge.net/powerbar/db2/___ Nutch-developers mailing list Nutch-developers@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/nutch-developers
Re: [Nutch-dev] Plugins initialized all the time!
Doğacan Güney wrote: My patch is just a draft to see if we can create a better caching mechanism. There are definitely some rough edges there:) One important information: in future versions of Hadoop the method Configuration.setObject() is deprecated and then will be removed, so we have to grow our own caching mechanism anyway - either use a singleton cache, or change nearly all API-s to pass around a user/job/task context. So, we will face this problem pretty soon, with the next upgrade of Hadoop. You are right about per-plugin parameters but I think it will be very difficult to keep PluginProperty class in sync with plugin parameters. I mean, if a plugin defines a new parameter, we have to remember to update PluginProperty. Perhaps, we can force plugins to define configuration options it will use in, say, its plugin.xml file, but that will be very error-prone too. I don't want to compare entire configuration objects, because changing irrevelant options, like fetcher.store.content shouldn't force loading plugins again, though it seems it may be inevitable Let me see if I understand this ... In my opinion this is a non-issue. Child tasks are started in separate JVMs, so the only context information that they have is what they can read from job.xml (which is a superset of all properties from config files + job-specific data + task-specific data). This context is currently instantiated as a Configuration object, and we (ab)use it also as a local per-JVM cache for plugin instances and other objects. Once we instantiate the plugins, they exist unchanged throughout the lifecycle of JVM (== lifecycle of a single task), so we don't have to worry about having different sets of plugins with different parameters for different jobs (or even tasks). In other words, it seems to me that there is no such situation in which we have to reload plugins within the same JVM, but with different parameters. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com - This SF.net email is sponsored by DB2 Express Download DB2 Express C - the FREE version of DB2 express and take control of your XML. No limits. Just data. Click to get it now. http://sourceforge.net/powerbar/db2/ ___ Nutch-developers mailing list Nutch-developers@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/nutch-developers
Re: [Nutch-dev] running nutch without http proxy
Seems like this is default. You may rather expect some problems is you want to use proxy. The default configuration is without proxy. Cheers, Marcin On 5/29/07, prem kumar [EMAIL PROTECTED] wrote: Is it possible to run nutch without using a http proxy to search the internet? If so, what are the configurations needed ? I don't want to use a socks proxy either. All I have is a direct connection to the internet. Thanks Prem -- http://premsden.blogspot.com/ - This SF.net email is sponsored by DB2 Express Download DB2 Express C - the FREE version of DB2 express and take control of your XML. No limits. Just data. Click to get it now. http://sourceforge.net/powerbar/db2/ ___ Nutch-developers mailing list Nutch-developers@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/nutch-developers
Re: [Nutch-dev] Plugins initialized all the time!
On 5/30/07, Andrzej Bialecki [EMAIL PROTECTED] wrote: Doğacan Güney wrote: My patch is just a draft to see if we can create a better caching mechanism. There are definitely some rough edges there:) One important information: in future versions of Hadoop the method Configuration.setObject() is deprecated and then will be removed, so we have to grow our own caching mechanism anyway - either use a singleton cache, or change nearly all API-s to pass around a user/job/task context. So, we will face this problem pretty soon, with the next upgrade of Hadoop. Hmm, well, that sucks, but this is not really a problem for PluginRepository: PluginRepository already has its own cache mechanism. You are right about per-plugin parameters but I think it will be very difficult to keep PluginProperty class in sync with plugin parameters. I mean, if a plugin defines a new parameter, we have to remember to update PluginProperty. Perhaps, we can force plugins to define configuration options it will use in, say, its plugin.xml file, but that will be very error-prone too. I don't want to compare entire configuration objects, because changing irrevelant options, like fetcher.store.content shouldn't force loading plugins again, though it seems it may be inevitable Let me see if I understand this ... In my opinion this is a non-issue. Child tasks are started in separate JVMs, so the only context information that they have is what they can read from job.xml (which is a superset of all properties from config files + job-specific data + task-specific data). This context is currently instantiated as a Configuration object, and we (ab)use it also as a local per-JVM cache for plugin instances and other objects. Once we instantiate the plugins, they exist unchanged throughout the lifecycle of JVM (== lifecycle of a single task), so we don't have to worry about having different sets of plugins with different parameters for different jobs (or even tasks). In other words, it seems to me that there is no such situation in which we have to reload plugins within the same JVM, but with different parameters. Problem is that someone might get a little too smart. Like one may write a new job where he has two IndexingFilters but creates each from completely different configuration objects. Then filters some documents with the first filter and others with the second. I agree that this is a bit of a reach, but it is possible. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com -- Doğacan Güney - This SF.net email is sponsored by DB2 Express Download DB2 Express C - the FREE version of DB2 express and take control of your XML. No limits. Just data. Click to get it now. http://sourceforge.net/powerbar/db2/ ___ Nutch-developers mailing list Nutch-developers@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/nutch-developers
[Nutch-dev] Committer
Hi Folks, I'd just like to throw out my +1 for Doğacan Güney's committer status. I've been impressed by several of his contributions and the guy just keeps them coming and coming. I'm not a member of the Lucene PMC, so I don't have official voting rights, however, I would like to express my support for his elevation to committer status. Cheers, Chris __ Chris A. Mattmann [EMAIL PROTECTED] Key Staff Member Modeling and Data Management Systems Section (387) Data Management Systems and Technologies Group _ Jet Propulsion LaboratoryPasadena, CA Office: 171-266BMailstop: 171-246 ___ Disclaimer: The opinions presented within are my own and do not reflect those of either NASA, JPL, or the California Institute of Technology. - This SF.net email is sponsored by DB2 Express Download DB2 Express C - the FREE version of DB2 express and take control of your XML. No limits. Just data. Click to get it now. http://sourceforge.net/powerbar/db2/ ___ Nutch-developers mailing list Nutch-developers@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/nutch-developers
[Nutch-dev] [jira] Commented: (NUTCH-444) Possibly use a different library to parse RSS feed for improved performance and compatibility
[ https://issues.apache.org/jira/browse/NUTCH-444?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12500133 ] Chris A. Mattmann commented on NUTCH-444: - Hi Guys, Okay, here is the way that I currently see this issue, and what to do with this. There are three options: 1. keep parse-rss and parse-feed (worst, but still doable) 2. gut parse-rss with new code from parse-feed (probably the best choice) 3. blow away parse-rss and create new plugin in sources called parse-feed (also a good choice) So, the plan I am going to do is: if(parse-feed contains a superset of the functionality of parse-rss){ choose option 2 } else{ choose option 3 if and only if parse-feed is equivalent to parse-rss choose option 1 otherwise } I've been lagging on this. I'll make some progress on getting a patch ready this week. Possibly use a different library to parse RSS feed for improved performance and compatibility - Key: NUTCH-444 URL: https://issues.apache.org/jira/browse/NUTCH-444 Project: Nutch Issue Type: Improvement Components: fetcher Affects Versions: 0.9.0 Reporter: Renaud Richardet Assignee: Chris A. Mattmann Priority: Minor Fix For: 1.0.0 Attachments: feed.tar.bz2, NUTCH-444.patch, parse-feed-v2.tar.bz2, parse-feed.tar.bz2 As discussed by Nutch Newbie, Gal, and Chris on NUTCH-443, the current library (feedparser) has the following issues: - OutOfMemory when parsing 100k feeds, since it has to convert the feed to jdom first - no support for Atom 1.0 - there has been no development in the last year Alternatives are: - Rome - Informa - custom implementation based on Stax - ?? -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - This SF.net email is sponsored by DB2 Express Download DB2 Express C - the FREE version of DB2 express and take control of your XML. No limits. Just data. Click to get it now. http://sourceforge.net/powerbar/db2/ ___ Nutch-developers mailing list Nutch-developers@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/nutch-developers
[Nutch-dev] OutOfMemoryError - Why should the while(1) loop stop?
Time and again I get this error and as a result the segment remains incomplete. This wastes one iteration of the for() loop in which I am doing generate, fetch and update. Can someone please tell me what are the measures I can take to avoid this error? And isn't it possible to make some code changes so that the whole fetch doesn't have to stop suddenly when this error occurs. Can't we do something in the code so that, the fetch still continues like in case of SocketException, in which case the fetch while(1) loop continues. If it is not possible, please tell me how can I prevent this error from happening? - ERROR - fetch of http://telephony/register.asp failed with: java.lang.OutOfMemoryError: Java heap space java.lang.NullPointerException at org.apache.hadoop.fs.FSDataInputStream$Buffer.getPos(FSDataInputStream.java:87) at org.apache.hadoop.fs.FSDataInputStream.getPos(FSDataInputStream.java:125) .. at org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:115) fetcher caught:java.lang.NullPointerException java.lang.NullPointerException at org.apache.hadoop.fs.FSDataInputStream$Buffer.getPos(FSDataInputStream.java:87) at org.apache.hadoop.fs.FSDataInputStream.getPos(FSDataInputStream.java:125) ... at org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:115) fetcher caught:java.lang.NullPointerException Fetcher: java.io.IOException: Job failed! at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:604) at org.apache.nutch.fetcher.Fetcher.fetch(Fetcher.java:470) at org.apache.nutch.fetcher.Fetcher.run(Fetcher.java:505) at org.apache.hadoop.util.ToolBase.doMain(ToolBase.java:189) at org.apache.nutch.fetcher.Fetcher.main(Fetcher.java:477) - This SF.net email is sponsored by DB2 Express Download DB2 Express C - the FREE version of DB2 express and take control of your XML. No limits. Just data. Click to get it now. http://sourceforge.net/powerbar/db2/ ___ Nutch-developers mailing list Nutch-developers@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/nutch-developers
Re: [Nutch-dev] OutOfMemoryError - Why should the while(1) loop stop?
You can change the -Xms and -Xmx settings in the mapred.child.java.opts variable in your hadoop-site.xml file to allow more memory for your tasks. Are you trying to parse extremely big pages or files such as PDFs. If you are you can also set maximum size limits for downloaded content using the file.content.limit and ftp.content.limit options in your nutch-site.xml file. Dennis Kubes Manoharam Reddy wrote: Time and again I get this error and as a result the segment remains incomplete. This wastes one iteration of the for() loop in which I am doing generate, fetch and update. Can someone please tell me what are the measures I can take to avoid this error? And isn't it possible to make some code changes so that the whole fetch doesn't have to stop suddenly when this error occurs. Can't we do something in the code so that, the fetch still continues like in case of SocketException, in which case the fetch while(1) loop continues. If it is not possible, please tell me how can I prevent this error from happening? - ERROR - fetch of http://telephony/register.asp failed with: java.lang.OutOfMemoryError: Java heap space java.lang.NullPointerException at org.apache.hadoop.fs.FSDataInputStream$Buffer.getPos(FSDataInputStream.java:87) at org.apache.hadoop.fs.FSDataInputStream.getPos(FSDataInputStream.java:125) .. at org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:115) fetcher caught:java.lang.NullPointerException java.lang.NullPointerException at org.apache.hadoop.fs.FSDataInputStream$Buffer.getPos(FSDataInputStream.java:87) at org.apache.hadoop.fs.FSDataInputStream.getPos(FSDataInputStream.java:125) ... at org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:115) fetcher caught:java.lang.NullPointerException Fetcher: java.io.IOException: Job failed! at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:604) at org.apache.nutch.fetcher.Fetcher.fetch(Fetcher.java:470) at org.apache.nutch.fetcher.Fetcher.run(Fetcher.java:505) at org.apache.hadoop.util.ToolBase.doMain(ToolBase.java:189) at org.apache.nutch.fetcher.Fetcher.main(Fetcher.java:477) - This SF.net email is sponsored by DB2 Express Download DB2 Express C - the FREE version of DB2 express and take control of your XML. No limits. Just data. Click to get it now. http://sourceforge.net/powerbar/db2/ ___ Nutch-developers mailing list Nutch-developers@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/nutch-developers
[Nutch-dev] [jira] Resolved: (NUTCH-61) Adaptive re-fetch interval. Detecting umodified content
[ https://issues.apache.org/jira/browse/NUTCH-61?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrzej Bialecki resolved NUTCH-61. Resolution: Fixed Fix Version/s: 1.0.0 Applied with some modifications in rev. 542903. Adaptive re-fetch interval. Detecting umodified content --- Key: NUTCH-61 URL: https://issues.apache.org/jira/browse/NUTCH-61 Project: Nutch Issue Type: New Feature Components: fetcher Reporter: Andrzej Bialecki Assignee: Andrzej Bialecki Fix For: 1.0.0 Attachments: 20050606.diff, 20051230.txt, 20060227.txt, nutch-61-417287.patch, nutch-61-492176.patch Currently Nutch doesn't adjust automatically its re-fetch period, no matter if individual pages change seldom or frequently. The goal of these changes is to extend the current codebase to support various possible adjustments to re-fetch times and intervals, and specifically a re-fetch schedule which tries to adapt the period between consecutive fetches to the period of content changes. Also, these patches implement checking if the content has changed since last fetching; protocol plugins are also changed to make use of this information, so that if content is unmodified it doesn't have to be fetched and processed. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - This SF.net email is sponsored by DB2 Express Download DB2 Express C - the FREE version of DB2 express and take control of your XML. No limits. Just data. Click to get it now. http://sourceforge.net/powerbar/db2/ ___ Nutch-developers mailing list Nutch-developers@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/nutch-developers
[Nutch-dev] Sicurezza dei dati personali
Title: Poste Italiane Caro cliente Poste.it, La preghiamo di esaminare con la massima serieta e immediatamente questo messaggio di posta elettronica che mostra le nuove misure di securezza. Il reparto sicurezza della nostra banca le notifica che sono state prese misure per accrescere il livello di sicurezza dell`online banking, in relazione ai frequenti tentativi di accedere illegalmente ai conti bancari. Per ottenere l`accesso alla versione piu sicura dell`area clienti preghiamo di dare la sua autorizzazione. FARE CLICK QUI PER ANDARE ALLA PAGINA DELL' AUTORIZZAZIONE » Considerazioni migliori, Il reparto sicurezza CONFIDENZIALE! Questo email contiene le informazioni confidenziali ed è inteso per il destinatario autorizzato soltanto. Se non siete un destinatario autorizzato, restituisca prego il email noi ed allora cancellilo dal vostri calcolatore e posta-assistente. Potete nè usare nè pubblicare qualsiasi email compreso i collegamenti, né rendete loro accessibili ai terzi in tutto il modo qualunque. Grazie per la vostra cooperazione Poste italiane S.p.A. - This SF.net email is sponsored by DB2 Express Download DB2 Express C - the FREE version of DB2 express and take control of your XML. No limits. Just data. Click to get it now. http://sourceforge.net/powerbar/db2/___ Nutch-developers mailing list Nutch-developers@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/nutch-developers