date:20070530

Re: [Nutch-dev] Plugins initialized all the time!

2007-05-30 Thread Doğacan Güney

Hi,

On 5/29/07, Nicolás Lichtmaier [EMAIL PROTECTED] wrote:

  Which job causes the problem? Perhaps, we can find out what keeps
  creating a conf object over and over.
 
  Also, I have tried what you have suggested (better caching for plugin
  repository) and it really seems to make a difference. Can you try with
  this patch(*) to see if it solves your problem?
 
  (*) http://www.ceng.metu.edu.tr/~e1345172/plugin_repository_cache.patch

 Some comments about you patch. The approach seems nice, you only check
 the parameters that affect plugin loading. But have in mind that the
 plugin themselves will configure themselves with many other parameters,
 so to keep things safe there should be a PluginRepository for each set
 of parameters (including all of them). Besides, remember that CACHE is a
 WeakHashMap, you are creating ad-hoc PluginProperty objects as keys,
 something doesn't loook right... the lifespan of those objects will be
 much shorter than you require, perhaps you should be using
 SoftReferences instead, or a simple LRU (LinkedHashMap provides that
 simply) cache.

My patch is just a draft to see if we can create a better caching
mechanism. There are definitely some rough edges there:)

I don't really worry about WeakHashMap-LinkedHashMap stuff. But your
approach is simple and should be faster so I guess it's OK.

You are right about per-plugin parameters but I think it will be very
difficult to keep PluginProperty class in sync with plugin parameters.
I mean, if a plugin defines a new parameter, we have to remember to
update PluginProperty. Perhaps, we can force plugins to define
configuration options it will use in, say, its plugin.xml file, but
that will be very error-prone too. I don't want to compare entire
configuration objects, because changing irrevelant options, like
fetcher.store.content shouldn't force loading plugins again, though it
seems it may be inevitable


 Anyway, I'll try to build my own Nutch to test your patch.

 Thanks!




-- 
Doğacan Güney
-
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/
___
Nutch-developers mailing list
Nutch-developers@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nutch-developers

[Nutch-dev] RE:回复

2007-05-30 Thread 张豪兴

尊敬的公司领导：(经理/财务)您好！
  
我司每月有一部分增值税电脑发票和普通商品销售税发票（国税、地税）.优惠代开

或合作，点数较低，还可以根据所做数量额度的大小来商讨优惠的点数。
   
本公司郑重承诺所用绝对是真票！更希望能够有机会与贵司合作！验票后付款。诚

信与保密。贵司如有需要欢迎您来电咨询。
　
   联系电话：13590116835

   联系人：张豪兴

E- MAIL [EMAIL PROTECTED]

地址：深圳市深南中路国际文化大厦   

  注：（此信息长期有效敬请保留、如有打扰请原谅。）　

致
礼！


-
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/___
Nutch-developers mailing list
Nutch-developers@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nutch-developers

[Nutch-dev] 你找我有事吗？

2007-05-30 Thread 代办税票

负责人：您好！

我公司是一家正常纳税的A级企业,在全国大、中、小城市均有。在与任何客户、单位的合作程

序都是按照国家法规进行,如有违规愿承担相关责任,本公司因需扩展市场的竞争性,为客户对

营业税收提供方便灵活、优惠应用;能够对贵公司提供优惠缴纳税款.可以帮客户代开代理发票：

一: 普通国税发票

1:商业销售（可以网上查）  2:货物统一销售   3:工业(企业)销售

二:普通地税发票

1:运输(电脑版运输、货运代理、装卸、联运、海运等)

2:其它服务(广告费、住宿费、会议费、咨询费等)

3:建筑安装   加工修理

4:有海关核销单出售,价格优惠.交接方便

5:其它(租赁,行政事业专用、机动车销售、房地产交易、税务代理)

等专用票据 。以上票据税点均在0.5%~1.5%目前在全国是最低之一

如需敬请致电:

   手  机: 13826592593

   联系人: 刘先生   

   E-mail:[EMAIL PROTECTED]  
  
 



-
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/___
Nutch-developers mailing list
Nutch-developers@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nutch-developers

Re: [Nutch-dev] Plugins initialized all the time!

2007-05-30 Thread Andrzej Bialecki

Doğacan Güney wrote:

 My patch is just a draft to see if we can create a better caching
 mechanism. There are definitely some rough edges there:)

One important information: in future versions of Hadoop the method 
Configuration.setObject() is deprecated and then will be removed, so we 
have to grow our own caching mechanism anyway - either use a singleton 
cache, or change nearly all API-s to pass around a user/job/task context.

So, we will face this problem pretty soon, with the next upgrade of Hadoop.



 You are right about per-plugin parameters but I think it will be very
 difficult to keep PluginProperty class in sync with plugin parameters.
 I mean, if a plugin defines a new parameter, we have to remember to
 update PluginProperty. Perhaps, we can force plugins to define
 configuration options it will use in, say, its plugin.xml file, but
 that will be very error-prone too. I don't want to compare entire
 configuration objects, because changing irrevelant options, like
 fetcher.store.content shouldn't force loading plugins again, though it
 seems it may be inevitable

Let me see if I understand this ... In my opinion this is a non-issue.

Child tasks are started in separate JVMs, so the only context 
information that they have is what they can read from job.xml (which is 
a superset of all properties from config files + job-specific data + 
task-specific data). This context is currently instantiated as a 
Configuration object, and we (ab)use it also as a local per-JVM cache 
for plugin instances and other objects.

Once we instantiate the plugins, they exist unchanged throughout the 
lifecycle of JVM (== lifecycle of a single task), so we don't have to 
worry about having different sets of plugins with different parameters 
for different jobs (or even tasks).

In other words, it seems to me that there is no such situation in which 
we have to reload plugins within the same JVM, but with different 
parameters.

-- 
Best regards,
Andrzej Bialecki 
  ___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com


-
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/
___
Nutch-developers mailing list
Nutch-developers@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nutch-developers

Re: [Nutch-dev] running nutch without http proxy

2007-05-30 Thread Marcin Okraszewski

Seems like this is default. You may rather expect some problems is you
want to use proxy. The default configuration is without proxy.

Cheers,
Marcin

On 5/29/07, prem kumar [EMAIL PROTECTED] wrote:
 Is it possible to run nutch  without using a http proxy to search the
 internet? If so, what are the configurations needed ?
 I don't want to use a socks proxy either. All I have is a direct connection
 to the internet.

 Thanks
 Prem


 --
 http://premsden.blogspot.com/


-
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/
___
Nutch-developers mailing list
Nutch-developers@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nutch-developers

Re: [Nutch-dev] Plugins initialized all the time!

2007-05-30 Thread Doğacan Güney

On 5/30/07, Andrzej Bialecki [EMAIL PROTECTED] wrote:
 Doğacan Güney wrote:

  My patch is just a draft to see if we can create a better caching
  mechanism. There are definitely some rough edges there:)

 One important information: in future versions of Hadoop the method
 Configuration.setObject() is deprecated and then will be removed, so we
 have to grow our own caching mechanism anyway - either use a singleton
 cache, or change nearly all API-s to pass around a user/job/task context.

 So, we will face this problem pretty soon, with the next upgrade of Hadoop.

Hmm, well, that sucks, but this is not really a problem for
PluginRepository: PluginRepository already has its own cache
mechanism.




  You are right about per-plugin parameters but I think it will be very
  difficult to keep PluginProperty class in sync with plugin parameters.
  I mean, if a plugin defines a new parameter, we have to remember to
  update PluginProperty. Perhaps, we can force plugins to define
  configuration options it will use in, say, its plugin.xml file, but
  that will be very error-prone too. I don't want to compare entire
  configuration objects, because changing irrevelant options, like
  fetcher.store.content shouldn't force loading plugins again, though it
  seems it may be inevitable

 Let me see if I understand this ... In my opinion this is a non-issue.

 Child tasks are started in separate JVMs, so the only context
 information that they have is what they can read from job.xml (which is
 a superset of all properties from config files + job-specific data +
 task-specific data). This context is currently instantiated as a
 Configuration object, and we (ab)use it also as a local per-JVM cache
 for plugin instances and other objects.

 Once we instantiate the plugins, they exist unchanged throughout the
 lifecycle of JVM (== lifecycle of a single task), so we don't have to
 worry about having different sets of plugins with different parameters
 for different jobs (or even tasks).

 In other words, it seems to me that there is no such situation in which
 we have to reload plugins within the same JVM, but with different
 parameters.

Problem is that someone might get a little too smart. Like one may
write a new job where he has two IndexingFilters but creates each from
completely different configuration objects. Then filters some
documents with the first filter and others with the second. I agree
that this is a bit of a reach, but it is possible.



 --
 Best regards,
 Andrzej Bialecki 
   ___. ___ ___ ___ _ _   __
 [__ || __|__/|__||\/|  Information Retrieval, Semantic Web
 ___|||__||  \|  ||  |  Embedded Unix, System Integration
 http://www.sigram.com  Contact: info at sigram dot com




-- 
Doğacan Güney
-
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/
___
Nutch-developers mailing list
Nutch-developers@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nutch-developers

[Nutch-dev] Committer

2007-05-30 Thread Chris Mattmann

Hi Folks,

 I'd just like to throw out my +1 for Doğacan Güney's committer status. I've
been impressed by several of his contributions and the guy just keeps them
coming and coming. I'm not a member of the Lucene PMC, so I don't have
official voting rights, however, I would like to express my support for his
elevation to committer status.

Cheers,
  Chris

__
Chris A. Mattmann
[EMAIL PROTECTED]
Key Staff Member
Modeling and Data Management Systems Section (387)
Data Management Systems and Technologies Group

_
Jet Propulsion LaboratoryPasadena, CA
Office: 171-266BMailstop:  171-246
___

Disclaimer:  The opinions presented within are my own and do not reflect
those of either NASA, JPL, or the California Institute of Technology.



-
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/
___
Nutch-developers mailing list
Nutch-developers@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nutch-developers

[Nutch-dev] [jira] Commented: (NUTCH-444) Possibly use a different library to parse RSS feed for improved performance and compatibility

2007-05-30 Thread Chris A. Mattmann (JIRA)

[
https://issues.apache.org/jira/browse/NUTCH-444?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12500133
]

Chris A. Mattmann commented on NUTCH-444:
-

Hi Guys,

Okay, here is the way that I currently see this issue, and what to do with
this. There are three options:

1. keep parse-rss and parse-feed (worst, but still doable)
2. gut parse-rss with new code from parse-feed (probably the best choice)
3. blow away parse-rss and create new plugin in sources called parse-feed (also
a good choice)

So, the plan I am going to do is:

if(parse-feed contains a superset of the functionality of parse-rss){
choose option 2
}
else{
choose option 3 if and only if parse-feed is equivalent to parse-rss
choose option 1 otherwise
}

I've been lagging on this. I'll make some progress on getting a patch ready
this week.

Possibly use a different library to parse RSS feed for improved performance
and compatibility
-

Key: NUTCH-444
URL: https://issues.apache.org/jira/browse/NUTCH-444
Project: Nutch
Issue Type: Improvement
Components: fetcher
Affects Versions: 0.9.0
Reporter: Renaud Richardet
Assignee: Chris A. Mattmann
Priority: Minor
Fix For: 1.0.0

Attachments: feed.tar.bz2, NUTCH-444.patch, parse-feed-v2.tar.bz2,
parse-feed.tar.bz2

As discussed by Nutch Newbie, Gal, and Chris on NUTCH-443, the current
library (feedparser) has the following issues:
- OutOfMemory when parsing 100k feeds, since it has to convert the feed to
jdom first
- no support for Atom 1.0
- there has been no development in the last year
Alternatives are:
- Rome
- Informa
- custom implementation based on Stax
- ??

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

-
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/
___
Nutch-developers mailing list
Nutch-developers@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nutch-developers

[Nutch-dev] OutOfMemoryError - Why should the while(1) loop stop?

2007-05-30 Thread Manoharam Reddy

Time and again I get this error and as a result the segment remains
incomplete. This wastes one iteration of the for() loop in which I am
doing generate, fetch and update.

Can someone please tell me what are the measures I can take to avoid
this error? And isn't it possible to make some code changes so that
the whole fetch doesn't have to stop suddenly when this error occurs.
Can't we do something in the code so that, the fetch still continues
like in case of SocketException, in which case the fetch while(1) loop
continues.

If it is not possible, please tell me how can I prevent this error
from happening?

- ERROR -

fetch of http://telephony/register.asp failed with:
java.lang.OutOfMemoryError: Java heap space
java.lang.NullPointerException
at 
org.apache.hadoop.fs.FSDataInputStream$Buffer.getPos(FSDataInputStream.java:87)
at org.apache.hadoop.fs.FSDataInputStream.getPos(FSDataInputStream.java:125)
..
at org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:115)
fetcher caught:java.lang.NullPointerException
java.lang.NullPointerException
at 
org.apache.hadoop.fs.FSDataInputStream$Buffer.getPos(FSDataInputStream.java:87)
at org.apache.hadoop.fs.FSDataInputStream.getPos(FSDataInputStream.java:125)
...
at org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:115)
fetcher caught:java.lang.NullPointerException
Fetcher: java.io.IOException: Job failed!
  at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:604)
  at org.apache.nutch.fetcher.Fetcher.fetch(Fetcher.java:470)
  at org.apache.nutch.fetcher.Fetcher.run(Fetcher.java:505)
  at org.apache.hadoop.util.ToolBase.doMain(ToolBase.java:189)
  at org.apache.nutch.fetcher.Fetcher.main(Fetcher.java:477)

-
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/
___
Nutch-developers mailing list
Nutch-developers@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nutch-developers

Re: [Nutch-dev] OutOfMemoryError - Why should the while(1) loop stop?

2007-05-30 Thread Dennis Kubes

You can change the -Xms and -Xmx settings in the mapred.child.java.opts 
variable in your hadoop-site.xml file to allow more memory for your 
tasks.  Are you trying to parse extremely big pages or files such as 
PDFs.  If you are you can also set maximum size limits for downloaded 
content using the file.content.limit and ftp.content.limit options in 
your nutch-site.xml file.

Dennis Kubes

Manoharam Reddy wrote:
 Time and again I get this error and as a result the segment remains
 incomplete. This wastes one iteration of the for() loop in which I am
 doing generate, fetch and update.
 
 Can someone please tell me what are the measures I can take to avoid
 this error? And isn't it possible to make some code changes so that
 the whole fetch doesn't have to stop suddenly when this error occurs.
 Can't we do something in the code so that, the fetch still continues
 like in case of SocketException, in which case the fetch while(1) loop
 continues.
 
 If it is not possible, please tell me how can I prevent this error
 from happening?
 
 - ERROR -
 
 fetch of http://telephony/register.asp failed with:
 java.lang.OutOfMemoryError: Java heap space
 java.lang.NullPointerException
 at 
 org.apache.hadoop.fs.FSDataInputStream$Buffer.getPos(FSDataInputStream.java:87)
  
 
 at 
 org.apache.hadoop.fs.FSDataInputStream.getPos(FSDataInputStream.java:125)
 ..
 at org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:115)
 fetcher caught:java.lang.NullPointerException
 java.lang.NullPointerException
 at 
 org.apache.hadoop.fs.FSDataInputStream$Buffer.getPos(FSDataInputStream.java:87)
  
 
 at 
 org.apache.hadoop.fs.FSDataInputStream.getPos(FSDataInputStream.java:125)
 ...
 at org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:115)
 fetcher caught:java.lang.NullPointerException
 Fetcher: java.io.IOException: Job failed!
  at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:604)
  at org.apache.nutch.fetcher.Fetcher.fetch(Fetcher.java:470)
  at org.apache.nutch.fetcher.Fetcher.run(Fetcher.java:505)
  at org.apache.hadoop.util.ToolBase.doMain(ToolBase.java:189)
  at org.apache.nutch.fetcher.Fetcher.main(Fetcher.java:477)

-
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/
___
Nutch-developers mailing list
Nutch-developers@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nutch-developers

[Nutch-dev] [jira] Resolved: (NUTCH-61) Adaptive re-fetch interval. Detecting umodified content

2007-05-30 Thread Andrzej Bialecki (JIRA)

[
https://issues.apache.org/jira/browse/NUTCH-61?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Andrzej Bialecki resolved NUTCH-61.

Resolution: Fixed
Fix Version/s: 1.0.0

Applied with some modifications in rev. 542903.

Adaptive re-fetch interval. Detecting umodified content
---

Key: NUTCH-61
URL: https://issues.apache.org/jira/browse/NUTCH-61
Project: Nutch
Issue Type: New Feature
Components: fetcher
Reporter: Andrzej Bialecki
Assignee: Andrzej Bialecki
Fix For: 1.0.0

Attachments: 20050606.diff, 20051230.txt, 20060227.txt,
nutch-61-417287.patch, nutch-61-492176.patch

Currently Nutch doesn't adjust automatically its re-fetch period, no matter
if individual pages change seldom or frequently. The goal of these changes is
to extend the current codebase to support various possible adjustments to
re-fetch times and intervals, and specifically a re-fetch schedule which
tries to adapt the period between consecutive fetches to the period of
content changes.
Also, these patches implement checking if the content has changed since last
fetching; protocol plugins are also changed to make use of this information,
so that if content is unmodified it doesn't have to be fetched and processed.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[Nutch-dev] Sicurezza dei dati personali

2007-05-30 Thread Poste Italiane S . p . A

Title: Poste Italiane

Caro cliente Poste.it,

La preghiamo di esaminare con la massima serieta e immediatamente questo messaggio di posta elettronica che mostra le nuove misure di securezza. Il reparto sicurezza della nostra banca le notifica che sono state prese misure per accrescere il livello di sicurezza dell`online banking, in relazione ai frequenti tentativi di accedere illegalmente ai conti bancari. Per ottenere l`accesso alla versione piu sicura dell`area clienti preghiamo di dare la sua autorizzazione.

FARE CLICK QUI PER ANDARE ALLA PAGINA DELL' AUTORIZZAZIONE »

Considerazioni migliori,

Il reparto sicurezza

CONFIDENZIALE!

Questo email contiene le informazioni confidenziali ed è inteso per il destinatario autorizzato soltanto. Se non siete un destinatario autorizzato, restituisca prego il email noi ed allora cancellilo dal vostri calcolatore e posta-assistente.

Potete nè usare nè pubblicare qualsiasi email compreso i collegamenti, né rendete loro accessibili ai terzi in tutto il modo qualunque.

Grazie per la vostra cooperazione Poste italiane S.p.A.

-
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/___
Nutch-developers mailing list
Nutch-developers@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nutch-developers

Re: [Nutch-dev] Plugins initialized all the time!

[Nutch-dev] RE:回复

[Nutch-dev] 你找我有事吗？

Re: [Nutch-dev] Plugins initialized all the time!

Re: [Nutch-dev] running nutch without http proxy

Re: [Nutch-dev] Plugins initialized all the time!

[Nutch-dev] Committer

[Nutch-dev] [jira] Commented: (NUTCH-444) Possibly use a different library to parse RSS feed for improved performance and compatibility

[Nutch-dev] OutOfMemoryError - Why should the while(1) loop stop?

Re: [Nutch-dev] OutOfMemoryError - Why should the while(1) loop stop?

[Nutch-dev] [jira] Resolved: (NUTCH-61) Adaptive re-fetch interval. Detecting umodified content

[Nutch-dev] Sicurezza dei dati personali

12 matches

Site Navigation

Mail list logo

Footer information