[Nutch Wiki] Trivial Update of "FlorinePa" by FlorinePa

2013-03-25 Thread Apache Wiki
Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Nutch Wiki" for change 
notification.

The "FlorinePa" page has been changed by FlorinePa:
http://wiki.apache.org/nutch/FlorinePa

New page:
人々  する あなた  最悪のヒット 間違いなくいた  プラス 容量  中古 10 代。それは ない  へ簡単にアクセス 救済を見つける  
の私たちの過去、ので、ここで いる  いくつか便利な インスピレーションあふれるアイデア 。<>
<>
Feel free to visit my page :: 
[[http://www.abercrombiefitchoutlet-freeshipping.com/|アバクロンビー&フィッチ]]


[jira] [Updated] (NUTCH-1389) parsechecker and indexchecker to report truncated content

2013-03-25 Thread Lewis John McGibbney (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1389?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney updated NUTCH-1389:


Fix Version/s: 2.2

> parsechecker and indexchecker to report truncated content
> -
>
> Key: NUTCH-1389
> URL: https://issues.apache.org/jira/browse/NUTCH-1389
> Project: Nutch
>  Issue Type: Improvement
>  Components: indexer, parser
>Affects Versions: nutchgora, 1.5
>Reporter: Sebastian Nagel
>Priority: Minor
> Fix For: 1.7, 2.2
>
>
> ParserChecker and IndexingFiltersChecker should report when a document is 
> truncated due to {http,file,ftp}.content.limit.
> Truncated content may cause text and metadata extraction to fail for PDF and 
> other binary document formats.
> A hint that truncation (and not a broken plugin) is the possible reason would 
> be useful.
> See NUTCH-965 and {{ParseSegment.isTruncated(content)}}.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (NUTCH-1419) parsechecker and indexchecker to report protocol status

2013-03-25 Thread Lewis John McGibbney (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1419?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13613376#comment-13613376
 ] 

Lewis John McGibbney commented on NUTCH-1419:
-

Same as before. I am +1 for this issue to be implemented in both trunk and 2.x 
branches. Please say if you can commit or not Seb. Thank you

> parsechecker and indexchecker to report protocol status
> ---
>
> Key: NUTCH-1419
> URL: https://issues.apache.org/jira/browse/NUTCH-1419
> Project: Nutch
>  Issue Type: Improvement
>  Components: indexer, parser
>Affects Versions: nutchgora, 1.6
>Reporter: Sebastian Nagel
>Priority: Minor
> Fix For: 1.7, 2.2
>
> Attachments: NUTCH-1419-1.patch, NUTCH-1419-2.x.patch, 
> NUTCH-1419-trunk.patch
>
>
> Parsechecker and indexchecker should report the protocol status when the 
> fetch was not successful (status other than 200/ok).
> In case of a redirect, the protocol status contains the URL a redirect points 
> to. Usually, this URL should be checked instead of the original one which is 
> not indexed. The content of a redirect response is less useful (and often 
> empty):
> {code}
> % nutch indexchecker http://lucene.apache.org/nutch/
> fetching: http://lucene.apache.org/nutch/
> parsing: http://lucene.apache.org/nutch/
> contentType: text/html
> content :   301 Moved Permanently Moved Permanently The document has 
> moved here . Apache/2.4.1 (Unix) OpenSSL/1.
> title : 301 Moved Permanently
> host :  lucene.apache.org
> tstamp :Tue Jul 03 13:27:32 CEST 2012
> url :   http://lucene.apache.org/nutch/
> {code}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (NUTCH-1038) Port IndexingFiltersChecker to 2.0

2013-03-25 Thread Lewis John McGibbney (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1038?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13613373#comment-13613373
 ] 

Lewis John McGibbney commented on NUTCH-1038:
-

I am +1 for commit Seb. Please commit when you are ready. If you cannot do so 
then I will happily.

> Port IndexingFiltersChecker to 2.0
> --
>
> Key: NUTCH-1038
> URL: https://issues.apache.org/jira/browse/NUTCH-1038
> Project: Nutch
>  Issue Type: New Feature
>Affects Versions: nutchgora
>Reporter: Markus Jelsma
> Fix For: 2.2
>
> Attachments: NUTCH-1038.patch, NUTCH-1038v2.patch
>
>


--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (NUTCH-1419) parsechecker and indexchecker to report protocol status

2013-03-25 Thread Sebastian Nagel (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1419?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13613277#comment-13613277
 ] 

Sebastian Nagel commented on NUTCH-1419:


Hi Lewis,

+1 for NUTCH-1419-trunk.patch (parsechecker and indexchecker).

For NUTCH-1419-2.x.patch (parsechecker only): the error message
{code}
2013-03-25 23:29:40,000 ERROR parse.ParserChecker - Fetch failed with protocol 
status: org.apache.nutch.storage.ProtocolStatus@1b7d0 {
  "code":"12"
  "args":"[http://www.apachecon.eu/]";
  "lastModified":"0"
}
{code}
could be improved using ProtocolStatusUtils.getName and .getMessage, cf. the 
patch for indexchecker in NUTCH-1038. A "moved" or "moved(12)" is more 
informative.


> parsechecker and indexchecker to report protocol status
> ---
>
> Key: NUTCH-1419
> URL: https://issues.apache.org/jira/browse/NUTCH-1419
> Project: Nutch
>  Issue Type: Improvement
>  Components: indexer, parser
>Affects Versions: nutchgora, 1.6
>Reporter: Sebastian Nagel
>Priority: Minor
> Fix For: 1.7, 2.2
>
> Attachments: NUTCH-1419-1.patch, NUTCH-1419-2.x.patch, 
> NUTCH-1419-trunk.patch
>
>
> Parsechecker and indexchecker should report the protocol status when the 
> fetch was not successful (status other than 200/ok).
> In case of a redirect, the protocol status contains the URL a redirect points 
> to. Usually, this URL should be checked instead of the original one which is 
> not indexed. The content of a redirect response is less useful (and often 
> empty):
> {code}
> % nutch indexchecker http://lucene.apache.org/nutch/
> fetching: http://lucene.apache.org/nutch/
> parsing: http://lucene.apache.org/nutch/
> contentType: text/html
> content :   301 Moved Permanently Moved Permanently The document has 
> moved here . Apache/2.4.1 (Unix) OpenSSL/1.
> title : 301 Moved Permanently
> host :  lucene.apache.org
> tstamp :Tue Jul 03 13:27:32 CEST 2012
> url :   http://lucene.apache.org/nutch/
> {code}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (NUTCH-1501) Harmonize behavior of parsechecker and indexchecker

2013-03-25 Thread Sebastian Nagel (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1501?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13613264#comment-13613264
 ] 

Sebastian Nagel commented on NUTCH-1501:


2.x does not log to stdout. Add to 2.x log4j.properties (as in trunk):
{code}
log4j.logger.org.apache.nutch.parse.ParserChecker=INFO,cmdstdout
log4j.logger.org.apache.nutch.indexer.IndexingFiltersChecker=INFO,cmdstdout
{code}


> Harmonize behavior of parsechecker and indexchecker
> ---
>
> Key: NUTCH-1501
> URL: https://issues.apache.org/jira/browse/NUTCH-1501
> Project: Nutch
>  Issue Type: Improvement
>  Components: indexer, parser
>Reporter: Sebastian Nagel
>Priority: Minor
> Fix For: 2.2
>
>
> Behaviour of ParserChecker and IndexingFiltersChecker has diverged between 
> trunk and 2.x
> - missing in 2.x: NUTCH-1320, NUTCH-1207
> - open issue to be also applied to 2.x: NUTCH-1419, NUTCH-1389

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


Re: Google Summer of Code 2013 - Giraph implementation of Nutch LinkRank Algorithm

2013-03-25 Thread Lewis John Mcgibbney
Hi Ahmet,

On Mon, Mar 25, 2013 at 6:05 AM,  wrote:

> Google Summer of Code 2013 - Giraph implementation of Nutch LinkRank
> Algorithm
> 22613 by: Lewis John Mcgibbney
> 22616 by: Mattmann, Chris A (388J)
> 22628 by: Ahmet Emre Aladağ
>

So are you interested in picking up the Google Summer of Code project? Are
you able to do so... e.g. are you a student and do you have time time for
this?


> I've just started learning Nutch and would like to implement this feature
> on 2.x branch. Would it differ too much implementing for trunk and 2.x
> branches?
>

Currently the LinkRank implementation is contained within the core Nutch
code [0]. To answer your question, in short I don't know. That is why I
posted the GSoC project. So we could find out these types of things.

>
> I'd like to hear suggestions for a learning path from scratch to implement
> this feature.
>

Start learning the Nutch code based and how the LinkRank algorithm works.
How scoring filters can extend the ScoringFilter extension point [1], etc.
Other examples of how this works can be seen in the tld plugin, and of
course the other scoring plugins available in trunk and 2.x branches.
We currently have some wiki pages dedicated to learning how to write
plugins for Nutch [2]. This is as good a starting point as any to get your
familiar with the plugin extension point mechanism.


> I'm currently following the Nutch 1.x wiki for Becoming a developer.
>

Excellent.


[0]
http://svn.apache.org/repos/asf/nutch/trunk/src/java/org/apache/nutch/scoring/webgraph/
[1]
http://svn.apache.org/repos/asf/nutch/trunk/src/java/org/apache/nutch/scoring/ScoringFilter.java
[2] http://wiki.apache.org/nutch/PluginCentral


[Nutch Wiki] Trivial Update of "PhilCase" by PhilCase

2013-03-25 Thread Apache Wiki
Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Nutch Wiki" for change 
notification.

The "PhilCase" page has been changed by PhilCase:
http://wiki.apache.org/nutch/PhilCase

New page:
Konkurencja na wielu współczesnych rynkach staje się raz po raz silniejsza, w 
zasadzie pod spodem wpływem globalizacji sektorze.<>
W ten badania rynki, nawet lokalne stają się coraz to z większym natężeniem 
nasycone, co przekłada się na powikłania w pozyskiwaniu klientów. Konieczne 
jest komenderowanie szeroko zakrojonych działań marketingowych, ażeby 
podwyższyć swoje szanse na owocne akwizycja nowych klientów, jak oraz na 
zastrzeżenie zmniejszania się liczby konsekwentnych klientów, którzy korzystają 
z tworów oraz usług firmy.<>
W ramach działań marketingowych możliwe jest aplikowanie form reklamy, czy 
promocji, które są w tym momencie łatwiejsze aż do przeprowadzenia. Wynika owo 
spośród okazji tańszego wykonywania reklam, jednakowoż materiałów promocyjnych, 
[[http://www.visitdecaturgeorgia.com/redirect.aspx?url=http://fajnadrukarnia.com|visitdecaturgeorgia.com]][[http://www.slonecznik.nielotki.cieszyn.pl/artykul-5204/Drukarnia_Warszawa_fajnadrukarnia_com.html|Druk
 ulotek warszawa]] jak katalogi warszawa, z racji stosowaniu nowoczesnych 
form druku.<>
W ów tryb dopuszczalne jest wykorzystywanie takich prac nawet za pośrednictwem 
małe i średnie firmy. Przy przygotowywaniu materiałów promocyjnych, jednakowoż 
reklamowych u dołu kampanie marketingowe niezwykle ważne jest biont spójnej 
koncepcji reklamowej. Jeśli chcemy posługiwać się spośród takich materiałów, 
podczas gdy broszury warszawa mimo wszystko możemy rozważyć o statycznych 
umowach  
[[http://www.boemre.gov:8765/help/urlstatusgo.html?col=boemre&url=http://fajnadrukarnia.com|Druk
 ulotek Warszawa]] spośród drukarniami.<>
Przy takich umowach możemy opierać się obniżka kosztów korzystania z 
poszczególnych materiałów w dłuższym trakcie czasu. Innym rozwiązaniem, kiedy 
zamawiamy akcydensy syreni gród, jest decydowanie się na zamawianie większych 
ilości materiałów, co pozwala również na zmniejszenie wydatków na pojedyncze 
towary zamawiane w drukarniach, zwłaszcza podczas gdy zdecydujemy się na 
użytkowanie z drukarni internetowych.


[jira] [Commented] (NUTCH-1532) Replace 'segment' mapping field with batchId

2013-03-25 Thread Lewis John McGibbney (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1532?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13613100#comment-13613100
 ] 

Lewis John McGibbney commented on NUTCH-1532:
-

I am +1 for you to commit this Feng. It passes all of my tests and I now get 
back Id when indexing to Solr server.

> Replace 'segment' mapping field with batchId
> 
>
> Key: NUTCH-1532
> URL: https://issues.apache.org/jira/browse/NUTCH-1532
> Project: Nutch
>  Issue Type: Bug
>  Components: storage
>Affects Versions: 2.1
>Reporter: Lewis John McGibbney
>Priority: Minor
> Fix For: 2.2
>
> Attachments: NUTCH-1532.patch, NUTCH-1532-v2.patch
>
>
> As described here [0], the segment field in solr-mapping.xml should be 
> replaced with the batchId. This reflects the different architecture in 2.x.
> [0] http://www.mail-archive.com/user%40nutch.apache.org/msg08793.html

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[Nutch Wiki] Trivial Update of "ContributorsGroup" by LewisJohnMcgibbney

2013-03-25 Thread Apache Wiki
Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Nutch Wiki" for change 
notification.

The "ContributorsGroup" page has been changed by LewisJohnMcgibbney:
http://wiki.apache.org/nutch/ContributorsGroup?action=diff&rev1=3&rev2=4

   * AdminGroup
   * ElisabethAdler
   * EdwardDrapkin
-  * subhankarray
  


Re: [Nutch Wiki] Trivial Update of "PGOSimone" by PGOSimone

2013-03-25 Thread Lewis John Mcgibbney
Hi,

On Mon, Mar 25, 2013 at 6:05 AM,  wrote:

>
>  Hey Julien,
>
>  I heard on #asfinfra that any of our MoinMoin wikis have been attacked
> recently by SPAM.
>
>  I think we may want to contact infra and ask for specific
> ContributorsGroup only Nutch wiki access.
>
>  http://wiki.apache.org/general/OurWikiFarm
>
>
>
Right now we have
http://wiki.apache.org/nutch/AdminGroup with following names

Gavin McDonald (INFRA), Sebastian, Julien, Markus, Ferdy, Kiran and myself
on there.
and
http://wiki.apache.org/nutch/ContributersGroup with

AdminGroup, ElisabethAdler, EdwardDrapkin, subhankarray

I've never heard or communicated with subhankarray on the Nutch list. I
therefore propose to remove authentication for this person from the wiki.

This conversation is also going on over in java-user@lucene

Lewis


[jira] [Commented] (NUTCH-1533) Implement getPrevModifiedTime(), setPrevModifiedTime(), getBatchId() and setBatchId() accessors in o.a.n.storage.WebPage

2013-03-25 Thread Lewis John McGibbney (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1533?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13613072#comment-13613072
 ] 

Lewis John McGibbney commented on NUTCH-1533:
-

So are you happy with the patch which has been applied?
I see you resolved the issue.
Do you have a commit number please. It really helps to end Jira issues with a 
simple message saying that person x, committed to branch y at commit number Z.
Thanks Feng.

> Implement getPrevModifiedTime(), setPrevModifiedTime(), getBatchId() and 
> setBatchId() accessors in o.a.n.storage.WebPage
> 
>
> Key: NUTCH-1533
> URL: https://issues.apache.org/jira/browse/NUTCH-1533
> Project: Nutch
>  Issue Type: Bug
>  Components: storage
>Affects Versions: 2.1
>Reporter: Lewis John McGibbney
>Assignee: lufeng
>Priority: Minor
> Fix For: 2.2
>
> Attachments: NUTCH-1533.patch, NUTCH-1533v2.patch, NUTCH-1533-v3.patch
>
>
> NUTCH-1532 needs to obtain a batchId to add to NutchDocument prior to 
> indexing. This is currently not available as we do not store the information 
> in the WebPage. Additionally, we do not store the other ModifiedTime's but 
> incorrectly set them in o.a.n.crawl.FetchSchedule#setFetchSchedule.
> All the above accessors should be implemented.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[Nutch Wiki] Trivial Update of "MaybellSD" by MaybellSD

2013-03-25 Thread Apache Wiki
Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Nutch Wiki" for change 
notification.

The "MaybellSD" page has been changed by MaybellSD:
http://wiki.apache.org/nutch/MaybellSD

New page:
Rayford Lockett is his identify and his wife will not like it at all.<>
One of his favorite hobbies is to cook dinner but he will not have the time 
these days. For many years he's been residing in Virgin Islands. Taking care of 
animals is how he supports his family. If you want to locate uot much more 
check out out his web site: [[http://www.disableddatingsingles.com|disabled 
singles]]


[Nutch Wiki] Trivial Update of "HenryCyr" by HenryCyr

2013-03-25 Thread Apache Wiki
Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Nutch Wiki" for change 
notification.

The "HenryCyr" page has been changed by HenryCyr:
http://wiki.apache.org/nutch/HenryCyr

New page:
Not much to say about myself I think.<>
Yes! Im a part of this community.<>
I just wish I'm useful in one way here.<>
<>
my webpage :: 
[[http://www.hotels-booking.com/dubai-hotels-where-to-get-the-best-deals/|please
 click www.hotels-booking.com]]


[Nutch Wiki] Trivial Update of "AletheaSc" by AletheaSc

2013-03-25 Thread Apache Wiki
Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Nutch Wiki" for change 
notification.

The "AletheaSc" page has been changed by AletheaSc:
http://wiki.apache.org/nutch/AletheaSc

New page:
There is nothing to write about me at all.<>
Hurrey Im here and a member of apache.org.<>
I just hope I'm useful at all<>
<>
Check out my web blog; 
[[http://www.da0zero.com/index.php?do=/blog/3212/understanding-aspects-of-learning-english/|cach
 hoc tieng anh hieu qua nhat]]


[Nutch Wiki] Trivial Update of "GailPrath" by GailPrath

2013-03-25 Thread Apache Wiki
Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Nutch Wiki" for change 
notification.

The "GailPrath" page has been changed by GailPrath:
http://wiki.apache.org/nutch/GailPrath

New page:
My name is Gail Prather. I life in Claro (Switzerland).<>
<>
<>
Also visit my blog: 
[[http://wikipedia.fsw.leidenuniv.nl:8080/IclonWiki/index.php?title=Gebruiker:Tania9545|Nicer
 Dicer Zellers]]


[jira] [Updated] (NUTCH-1545) capture batchId and remove references to segments in 2.x crawl script.

2013-03-25 Thread lufeng (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1545?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

lufeng updated NUTCH-1545:
--

Attachment: NUTCH-1545.patch

remove references to segments in 2.x crawl script.

> capture batchId and remove references to segments in 2.x crawl script.
> --
>
> Key: NUTCH-1545
> URL: https://issues.apache.org/jira/browse/NUTCH-1545
> Project: Nutch
>  Issue Type: Task
>Affects Versions: 2.1
>Reporter: Lewis John McGibbney
>Priority: Minor
> Fix For: 2.2
>
> Attachments: NUTCH-1545.patch
>
>
> The concept of segment is replaced by batchId in 2.x
> I'm currently getting rid of segments references in 2.x
> This issue was flagged up and separate from NUTCH-1532 which I am working on.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (NUTCH-1532) Replace 'segment' mapping field with batchId

2013-03-25 Thread lufeng (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1532?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

lufeng updated NUTCH-1532:
--

Attachment: NUTCH-1532-v2.patch

add small replaces in TestbedProxy and Benchmark class

> Replace 'segment' mapping field with batchId
> 
>
> Key: NUTCH-1532
> URL: https://issues.apache.org/jira/browse/NUTCH-1532
> Project: Nutch
>  Issue Type: Bug
>  Components: storage
>Affects Versions: 2.1
>Reporter: Lewis John McGibbney
>Priority: Minor
> Fix For: 2.2
>
> Attachments: NUTCH-1532.patch, NUTCH-1532-v2.patch
>
>
> As described here [0], the segment field in solr-mapping.xml should be 
> replaced with the batchId. This reflects the different architecture in 2.x.
> [0] http://www.mail-archive.com/user%40nutch.apache.org/msg08793.html

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


Re: [Nutch Wiki] Trivial Update of "PGOSimone" by PGOSimone

2013-03-25 Thread Mattmann, Chris A (388J)
Hey Julien,

I heard on #asfinfra that any of our MoinMoin wikis have been attacked recently 
by SPAM.

I think we may want to contact infra and ask for specific ContributorsGroup 
only Nutch wiki access.

http://wiki.apache.org/general/OurWikiFarm

Cheers,
Chris


From: Julien Nioche 
mailto:lists.digitalpeb...@gmail.com>>
Reply-To: "dev@nutch.apache.org" 
mailto:dev@nutch.apache.org>>
Date: Monday, March 25, 2013 1:55 AM
To: "dev@nutch.apache.org" 
mailto:dev@nutch.apache.org>>
Subject: Re: [Nutch Wiki] Trivial Update of "PGOSimone" by PGOSimone

I thought we had to have a login / password to modify the Wiki. If so how come 
we got so much spam lately?

Julien

On 25 March 2013 04:26, Apache Wiki 
mailto:wikidi...@apache.org>> wrote:
Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Nutch Wiki" for change 
notification.

The "PGOSimone" page has been changed by PGOSimone:
http://wiki.apache.org/nutch/PGOSimone
[..snip..]


--
[http://digitalpebble.com/img/logo.gif]
Open Source Solutions for Text Engineering

http://digitalpebble.blogspot.com/
http://www.digitalpebble.com
http://twitter.com/digitalpebble


Re: [Nutch Wiki] Trivial Update of "PGOSimone" by PGOSimone

2013-03-25 Thread kiran chitturi
I have a feeling someone's account got compromised or spammers found a new
way. I am not sure how they are getting in.


On Mon, Mar 25, 2013 at 4:55 AM, Julien Nioche <
lists.digitalpeb...@gmail.com> wrote:

> I thought we had to have a login / password to modify the Wiki. If so how
> come we got so much spam lately?
>
> Julien
>
>
> On 25 March 2013 04:26, Apache Wiki  wrote:
>
>> Dear Wiki user,
>>
>> You have subscribed to a wiki page or wiki category on "Nutch Wiki" for
>> change notification.
>>
>> The "PGOSimone" page has been changed by PGOSimone:
>> http://wiki.apache.org/nutch/PGOSimone
>>
>> New page:
>> Pleased to meet up with you! My title is Audria Pumphrey.<>
>> One particular of the incredibly finest factors in the earth for me is
>> doing aerobics but I haven't manufactured a dime with it. Illinois is the
>> place I have often been residing but I will have to transfer in a yr or
>> two. My working day career is a postal assistance employee but shortly I am
>> going to be on my own.<>
>> <>
>> Here is my homepage ... [[
>> http://Velocar.dirkhennig.de/index.php?title=Pattaya_Hotels_-_Family_Friendly_Choices|visitthe
>>  following page]]
>>
>
>
>
> --
> *
> *Open Source Solutions for Text Engineering
>
> http://digitalpebble.blogspot.com/
> http://www.digitalpebble.com
> http://twitter.com/digitalpebble
>



-- 
Kiran Chitturi




Re: Deploy Nutch Project

2013-03-25 Thread feng lu
hi

Currently Hadoop does not support ":" characters in filenames. in case some
of your jars or supplementary dist-cache files carry such a filename. See
https://issues.apache.org/jira/browse/HADOOP-3257



On Mon, Mar 25, 2013 at 2:36 AM, raviksingh wrote:

> Hi,
> This may be very silly question, but I have already tried hard. I
> created a Nutch java project as per my requirements. Now I want to deploy
> it
> on linux server. I created runnable jar. Initially it gave me path error
> for
> "seed.txt". I checked jar and changed the path to remove the error.
> However,
> I get this error now :
>
>
> java.lang.IllegalArgumentException: java.net.URISyntaxException: Relative
> path in absolute URI: rsrc:apache-nutch-2.1.jar.
>
>
> This project runs well in ecllipse. Is there any other way of deployment
> that making a jar. I am new to nutch and java.
>
> Thanks In advance
>
>
>
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/Deploy-Nutch-Project-tp4050896.html
> Sent from the Nutch - Dev mailing list archive at Nabble.com.
>



-- 
Don't Grow Old, Grow Up... :-)


[jira] [Resolved] (NUTCH-1533) Implement getPrevModifiedTime(), setPrevModifiedTime(), getBatchId() and setBatchId() accessors in o.a.n.storage.WebPage

2013-03-25 Thread lufeng (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1533?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

lufeng resolved NUTCH-1533.
---

Resolution: Fixed

> Implement getPrevModifiedTime(), setPrevModifiedTime(), getBatchId() and 
> setBatchId() accessors in o.a.n.storage.WebPage
> 
>
> Key: NUTCH-1533
> URL: https://issues.apache.org/jira/browse/NUTCH-1533
> Project: Nutch
>  Issue Type: Bug
>  Components: storage
>Affects Versions: 2.1
>Reporter: Lewis John McGibbney
>Assignee: lufeng
>Priority: Minor
> Fix For: 2.2
>
> Attachments: NUTCH-1533.patch, NUTCH-1533v2.patch, NUTCH-1533-v3.patch
>
>
> NUTCH-1532 needs to obtain a batchId to add to NutchDocument prior to 
> indexing. This is currently not available as we do not store the information 
> in the WebPage. Additionally, we do not store the other ModifiedTime's but 
> incorrectly set them in o.a.n.crawl.FetchSchedule#setFetchSchedule.
> All the above accessors should be implemented.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (NUTCH-1533) Implement getPrevModifiedTime(), setPrevModifiedTime(), getBatchId() and setBatchId() accessors in o.a.n.storage.WebPage

2013-03-25 Thread lufeng (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1533?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13612583#comment-13612583
 ] 

lufeng commented on NUTCH-1533:
---

Hi Lewis, 
I also found a problem when i committed this patch. but i can not found what's 
reason to cause this. Thanks Lewis. 

> Implement getPrevModifiedTime(), setPrevModifiedTime(), getBatchId() and 
> setBatchId() accessors in o.a.n.storage.WebPage
> 
>
> Key: NUTCH-1533
> URL: https://issues.apache.org/jira/browse/NUTCH-1533
> Project: Nutch
>  Issue Type: Bug
>  Components: storage
>Affects Versions: 2.1
>Reporter: Lewis John McGibbney
>Assignee: lufeng
>Priority: Minor
> Fix For: 2.2
>
> Attachments: NUTCH-1533.patch, NUTCH-1533v2.patch, NUTCH-1533-v3.patch
>
>
> NUTCH-1532 needs to obtain a batchId to add to NutchDocument prior to 
> indexing. This is currently not available as we do not store the information 
> in the WebPage. Additionally, we do not store the other ModifiedTime's but 
> incorrectly set them in o.a.n.crawl.FetchSchedule#setFetchSchedule.
> All the above accessors should be implemented.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[Nutch Wiki] Trivial Update of "SamiraChr" by SamiraChr

2013-03-25 Thread Apache Wiki
Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Nutch Wiki" for change 
notification.

The "SamiraChr" page has been changed by SamiraChr:
http://wiki.apache.org/nutch/SamiraChr

New page:
Company call this guy Theo Dawe. He works as some sort of accounting 
representative and so he will not necessarily quite change it all anytime 
today.<>
One of the particular things your guy loves the majority is participating in 
dominoes but he might be struggling at find occasion for this. Vermont is 
really the fit he takes pleasure in most and as well as his family loves the 
idea. Check out the latest headlines on his very own website: http://wp00.<>
sakuraworks.org/sns/groups/no-no-hair-elimination-for-pubic-hair-wmv/


[Nutch Wiki] Trivial Update of "NellApoda" by NellApoda

2013-03-25 Thread Apache Wiki
Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Nutch Wiki" for change 
notification.

The "NellApoda" page has been changed by NellApoda:
http://wiki.apache.org/nutch/NellApoda

New page:
Got nothing to write about me really.<>
Enjoying to be a part of this community.<>
I really wish I am useful at all<>
<>
Feel free to surf to my blog post ... 
[[http://Www.vehicle-maintenance.com/category/vehicle-maintenance/|automobile 
insurance]]


[Nutch Wiki] Trivial Update of "TerriBolt" by TerriBolt

2013-03-25 Thread Apache Wiki
Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Nutch Wiki" for change 
notification.

The "TerriBolt" page has been changed by TerriBolt:
http://wiki.apache.org/nutch/TerriBolt

New page:
Not much to write about myself I think.<>
Lovely to be a part of website.<>
I just hope I am useful in some way here.<>
<>
Also visit my blog post ... 
[[http://xn--kama-ota.pl/praca-za-granica-bez-znajomosci-jezyka/|opiekunka osób 
starszych szwajcaria]]


Re: [Nutch Wiki] Trivial Update of "PGOSimone" by PGOSimone

2013-03-25 Thread Julien Nioche
I thought we had to have a login / password to modify the Wiki. If so how
come we got so much spam lately?

Julien

On 25 March 2013 04:26, Apache Wiki  wrote:

> Dear Wiki user,
>
> You have subscribed to a wiki page or wiki category on "Nutch Wiki" for
> change notification.
>
> The "PGOSimone" page has been changed by PGOSimone:
> http://wiki.apache.org/nutch/PGOSimone
>
> New page:
> Pleased to meet up with you! My title is Audria Pumphrey.<>
> One particular of the incredibly finest factors in the earth for me is
> doing aerobics but I haven't manufactured a dime with it. Illinois is the
> place I have often been residing but I will have to transfer in a yr or
> two. My working day career is a postal assistance employee but shortly I am
> going to be on my own.<>
> <>
> Here is my homepage ... [[
> http://Velocar.dirkhennig.de/index.php?title=Pattaya_Hotels_-_Family_Friendly_Choices|visitthe
>  following page]]
>



-- 
*
*Open Source Solutions for Text Engineering

http://digitalpebble.blogspot.com/
http://www.digitalpebble.com
http://twitter.com/digitalpebble


Re: Google Summer of Code 2013 - Giraph implementation of Nutch LinkRank Algorithm

2013-03-25 Thread Ahmet Emre Aladağ

Hi,

I've just started learning Nutch and would like to implement this 
feature on 2.x branch. Would it differ too much implementing for trunk 
and 2.x branches?


I'd like to hear suggestions for a learning path from scratch to 
implement this feature. I'm currently following the Nutch 1.x wiki for 
Becoming a developer.


Ahmet Emre Aladağ

On 03/24/2013 09:38 PM, Lewis John Mcgibbney wrote:

Hi All,

After some discussion and drumming up of interest within the Giraph 
community, I've logged a Google Summer of Code issue [0] for this topic.
We are looking for interested students to come forward and participate 
in the effort.
I logged this over in Giraph as there was no GSoC eefort already going 
on there, we already have an issue for the Wicket-based User Interface 
implementation in Nutch.
I would be very happy if people (users and developers) could chime in 
on the thread so we can get the project started with the right 
direction and intention in mind.

I propose this for Nutch TRUNK.

Thanks for now

Best

Lewis

[0] https://issues.apache.org/jira/browse/GIRAPH-584

--
/Lewis/