[jira] Commented: (NUTCH-722) Nutch contains jars that we cannot redistribute

2009-03-23 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-722?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12688562#action_12688562
 ] 

Hudson commented on NUTCH-722:
--

Integrated in Nutch-trunk #762 (See 
[http://hudson.zones.apache.org/hudson/job/Nutch-trunk/762/])
 remove JAI libs


> Nutch contains jars that we cannot redistribute
> ---
>
> Key: NUTCH-722
> URL: https://issues.apache.org/jira/browse/NUTCH-722
> Project: Nutch
>  Issue Type: Bug
>Reporter: Sami Siren
>Priority: Blocker
> Fix For: 1.0.0
>
>
> It seems that we have some jars (as part of pdf parser) that we cannot 
> redistribute.
> Jukkas comment from email:
> "
> The release contains the Java Advanced Imaging libraries (jai_core.jar and 
> jai_codec.jar) which are licensed under Sun's Binary Code License. We can't 
> redistribute those libraries.
> "

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



Re: Problems compiling Nutch in Eclipse

2009-03-23 Thread Ninad Raut
inverted index - A sequence of (key, pointer) pairs where each pointer
points to a record in a database which contains the key value in some
particular field. The index is sorted on the key values to allow rapid
searching for a particular key value, using e.g. binary search. The
index is "inverted" in the sense that the key value is used to find
the record rather than the other way round.

in nutch indexes are created on:

 from parse, for title, metadata, etc.

 from parse, for text
 from invert, for anchors
 from fetch, for fetch date


Checkout the indexes folder after crawling.


On Mon, Mar 23, 2009 at 7:56 PM, Rodrigo Reyes C. wrote:

> Ninad
>
> I've been reading your blog, specifically the article named "Nutch
> Architecture". I posted a comment there but I am not sure you have noticed
> it so I will post it here too.
>
> What do you mean by:
>
> *"The index is the inverted index of all of the pages the system has
> retrieved, and is created by merging all of the individual segment indexes.
> *"
>
> Can you give us an example of how the original segment index looks like and
> how it is inverted? Thanx
>
> Rodrigo
>
> 2009/3/21 Ninad Raut 
>
>> Check out my blog :
>>
>> http://j2eewebsearch.blogspot.com/
>>
>> Check out the third point...
>>
>> Let me know if you you get it all right. Your comments will be
>> appreciated.
>>
>> Regards,
>> Ninad
>>
>>
>> On Sat, Mar 21, 2009 at 6:32 AM, Rodrigo Reyes C. 
>> wrote:
>>
>>> Hi
>>>
>>> I have configured my eclipse project as stated here
>>>
>>> http://wiki.apache.org/nutch/RunNutchInEclipse0.9
>>>
>>> Still, I am getting the following errors:
>>>
>>>- The return type is incompatible with Parser.getParse(Content)
>>>RTFParseFactory.java
>>>nutch/src/plugin/parse-rtf/src/java/org/apache/nutch/parse/rtfline 52
>>>Java Problem
>>>- Type mismatch: cannot convert from ParseResult to Parse
>>>TestRTFParser.java
>>>nutch/src/plugin/parse-rtf/src/test/org/apache/nutch/parse/rtfline 78
>>>Java Problem
>>>
>>> Any ideas on what could be wrong? I already included both
>>> http://nutch.cvs.sourceforge.net/nutch/nutch/src/plugin/parse-mp3/lib/and
>>> http://nutch.cvs.sourceforge.net/nutch/nutch/src/plugin/parse-rtf/lib/jars.
>>>
>>> Thanks in advance
>>>
>>> --
>>> Rodrigo Reyes C.
>>>
>>>
>>
>
>


Re: [VOTE] Release Apache Nutch 1.0

2009-03-23 Thread Doğacan Güney
Another non-binding +1 from me.

Hope this one is a keeper :D

On Mon, Mar 23, 2009 at 22:28, Sami Siren  wrote:

> Hello,
>
> I have packaged the third release candidate for Apache Nutch 1.0 release at
> http://people.apache.org/~siren/nutch-1.0/rc2/
>
> See the CHANGES.txt[1] file for details on release contents and latest
> changes. The release was made from tag:
> http://svn.apache.org/viewvc/lucene/nutch/tags/release-1.0-rc2/
>
> The following issues that were discovered during the review of last rc have
> been fixed:
>
> https://issues.apache.org/jira/browse/NUTCH-722
> https://issues.apache.org/jira/browse/NUTCH-723
> https://issues.apache.org/jira/browse/NUTCH-725
> https://issues.apache.org/jira/browse/NUTCH-726
> https://issues.apache.org/jira/browse/NUTCH-727
>
> Please vote on releasing this package as Apache Nutch 1.0. The vote is open
> for the next 72 hours. Only votes from Lucene PMC members are binding, but
> everyone is welcome to check the release candidate and voice their approval
> or disapproval. The vote  passes if at least three binding +1 votes are
> cast.
>
> [ ] +1 Release the packages as Apache Nutch 1.0
> [ ] -1 Do not release the packages because...
>
> Here's my +1
>
>
> Thanks!
>
>
> [1]
> http://svn.apache.org/viewvc/lucene/nutch/tags/release-1.0-rc2/CHANGES.txt?revision=757511
> --
> Sami Siren
>



-- 
Doğacan Güney


[VOTE] Release Apache Nutch 1.0

2009-03-23 Thread Sami Siren

Hello,

I have packaged the third release candidate for Apache Nutch 1.0 release 
at http://people.apache.org/~siren/nutch-1.0/rc2/


See the CHANGES.txt[1] file for details on release contents and latest 
changes. The release was made from tag: 
http://svn.apache.org/viewvc/lucene/nutch/tags/release-1.0-rc2/


The following issues that were discovered during the review of last rc 
have been fixed:


https://issues.apache.org/jira/browse/NUTCH-722
https://issues.apache.org/jira/browse/NUTCH-723
https://issues.apache.org/jira/browse/NUTCH-725
https://issues.apache.org/jira/browse/NUTCH-726
https://issues.apache.org/jira/browse/NUTCH-727

Please vote on releasing this package as Apache Nutch 1.0. The vote is 
open for the next 72 hours. Only votes from Lucene PMC members are 
binding, but everyone is welcome to check the release candidate and 
voice their approval or disapproval. The vote  passes if at least three 
binding +1 votes are cast.


[ ] +1 Release the packages as Apache Nutch 1.0
[ ] -1 Do not release the packages because...

Here's my +1


Thanks!


[1] 
http://svn.apache.org/viewvc/lucene/nutch/tags/release-1.0-rc2/CHANGES.txt?revision=757511

--
Sami Siren


How do I prioritise URLs to be fetched?

2009-03-23 Thread Rodrigo Reyes C.
Hi all

I am relatively new to nutch and I am trying to understand how it crawls
websites, but more specifically, how it creates and prioritises its Fetch
List. So I have a couple of questions I would like to ask:

   1. Which are Nutch crawl URL sources? I think they are both WebDB and
   segments but I am not sure.
   2. How does nutch prioritise crawling? By content expiration date only?
   3. Is there some way affect the way nutch orders URLs to be fetched? I've
   been reading the Generator class but haven't found an extension point for
   this.

Thanks in advance...

Rodrigo


Re: Problems compiling Nutch in Eclipse

2009-03-23 Thread Rodrigo Reyes C.
Ninad

I've been reading your blog, specifically the article named "Nutch
Architecture". I posted a comment there but I am not sure you have noticed
it so I will post it here too.

What do you mean by:

*"The index is the inverted index of all of the pages the system has
retrieved, and is created by merging all of the individual segment indexes.*
"

Can you give us an example of how the original segment index looks like and
how it is inverted? Thanx

Rodrigo

2009/3/21 Ninad Raut 

> Check out my blog :
> http://j2eewebsearch.blogspot.com/
>
> Check out the third point...
>
> Let me know if you you get it all right. Your comments will be appreciated.
>
> Regards,
> Ninad
>
>
> On Sat, Mar 21, 2009 at 6:32 AM, Rodrigo Reyes C. 
> wrote:
>
>> Hi
>>
>> I have configured my eclipse project as stated here
>>
>> http://wiki.apache.org/nutch/RunNutchInEclipse0.9
>>
>> Still, I am getting the following errors:
>>
>>- The return type is incompatible with Parser.getParse(Content)
>>RTFParseFactory.java
>>nutch/src/plugin/parse-rtf/src/java/org/apache/nutch/parse/rtfline 52
>>Java Problem
>>- Type mismatch: cannot convert from ParseResult to Parse
>>TestRTFParser.java
>>nutch/src/plugin/parse-rtf/src/test/org/apache/nutch/parse/rtfline 78
>>Java Problem
>>
>> Any ideas on what could be wrong? I already included both
>> http://nutch.cvs.sourceforge.net/nutch/nutch/src/plugin/parse-mp3/lib/and
>> http://nutch.cvs.sourceforge.net/nutch/nutch/src/plugin/parse-rtf/lib/jars.
>>
>> Thanks in advance
>>
>> --
>> Rodrigo Reyes C.
>>
>>
>


Re: NUTCH-722 is resolved

2009-03-23 Thread Andrzej Bialecki

Sami Siren wrote:
I think we are good to go for rc2 and it also seems that the smartest 
thing to do with the package contents at this point is "do not touch them".


I agree.



I will roll out the new rc later today.


Great, thanks.


--
Best regards,
Andrzej Bialecki <><
 ___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com