[jira] [Commented] (NUTCH-2201) Remove loops program from webgraph package

2016-01-19 Thread Dennis Kubes (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2201?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15106667#comment-15106667
 ] 

Dennis Kubes commented on NUTCH-2201:
-

+1 on this.  

The loops program, iirc, is a factorial algorithm.  After a depth of around 3, 
depending on resources and input, the time it takes to run is excessive.  It 
does find cycles in the webgraph and that can be useful as that is one way 
people try to game the search, but there have to be better algorithms.

> Remove loops program from webgraph package
> --
>
> Key: NUTCH-2201
> URL: https://issues.apache.org/jira/browse/NUTCH-2201
> Project: Nutch
>  Issue Type: Task
>Reporter: Markus Jelsma
>Assignee: Markus Jelsma
>Priority: Minor
>
> Recently Dennis mentioned the loops program to be bad program. As developer 
> of the package, he recommends not to use it.
> {quote}
> 2. Crawl the pages for 1 shard.  Update the WebGraph and Linkrank as
>described here.  https://wiki.apache.org/nutch/NewScoring. Don't use
>Loops.  It was a bad program with a bad algorithm and I never should
>have put it in.  Live and learn.
> {quote}
> See: https://www.mail-archive.com/user@nutch.apache.org/msg14164.html



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


Re: [VOTE] Moving to Git

2016-01-10 Thread Dennis Kubes

+1

On 01/08/2016 02:46 AM, Chris Mattmann wrote:

Hi Everyone,

I proposed this earlier, and we said we’d wait until after the
1.11 release. So it’s time to VOTE to move Nutch to Git. So
far, the following people have expressed +1s and if I don’t hear
otherwise, I will implicitly count their VOTE from the DISCUSS
thread:

+1 PMC

Chris Mattmann*
Sebastien Nagel*
Michael Joyce*
Asitang Mishra*
Dennis Kubes*
BlackIce

Everyone else (or those above that would like to amend their VOTE),
please VOTE below. I will leave the VOTE open for at least 72 hours.

[x ] +1 Move the Nutch SCM to Writeable Git repositories at the ASF.
[ ] +0 No opinion.
[ ] -1 Don’t move the Nutch SCM to Writeable Git repositories at the
ASF because…

Please note, I created a page for Tika that is worth checking out and
perhaps copying over to the Nutch wiki:

http://wiki.apache.org/tika/UsingGit

Please have a look as I think it will help with our workflows too.

Cheers,
Chris




-Original Message-
From: jpluser <chris.a.mattm...@jpl.nasa.gov>
Reply-To: "dev@nutch.apache.org" <dev@nutch.apache.org>
Date: Wednesday, November 18, 2015 at 7:39 PM
To: "dev@nutch.apache.org" <dev@nutch.apache.org>
Subject: [DISCUSS] Moving to Git


Hi All,

I propose that we consider moving to ASF supported writeable git
repos fro Nutch. This would entail moving Nutch’s canonical repo
from:

https://svn.apache.org/repos/asf/nutch

TO

https://git-wip-us.apache.org/repos/asf/nutch.git


We are already accepting PRs and so forth from Github and I think
many of us are using Git in our regular day to day workflows.

Thoughts?

Cheers,
Chris

++
Chris Mattmann, Ph.D.
Chief Architect
Instrument Software and Science Data Systems Section (398)
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 168-519, Mailstop: 168-527
Email: chris.a.mattm...@nasa.gov
WWW:  http://sunset.usc.edu/~mattmann/
++
Adjunct Associate Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++









Re: [VOTE] Moving to Git

2016-01-10 Thread Dennis Kubes

So sad. :)

On 01/08/2016 03:28 AM, Julien Nioche wrote:

+1 to move to Git

Note : I don't think Dennis is on the PMC anymore

Ju

On 8 January 2016 at 08:46, Chris Mattmann <mattm...@apache.org 
<mailto:mattm...@apache.org>> wrote:


Hi Everyone,

I proposed this earlier, and we said we’d wait until after the
1.11 release. So it’s time to VOTE to move Nutch to Git. So
far, the following people have expressed +1s and if I don’t hear
otherwise, I will implicitly count their VOTE from the DISCUSS
thread:

+1 PMC

Chris Mattmann*
Sebastien Nagel*
Michael Joyce*
Asitang Mishra*
Dennis Kubes*
BlackIce

Everyone else (or those above that would like to amend their VOTE),
please VOTE below. I will leave the VOTE open for at least 72 hours.

[ ] +1 Move the Nutch SCM to Writeable Git repositories at the ASF.
[ ] +0 No opinion.
[ ] -1 Don’t move the Nutch SCM to Writeable Git repositories at the
ASF because…

Please note, I created a page for Tika that is worth checking out and
perhaps copying over to the Nutch wiki:

http://wiki.apache.org/tika/UsingGit

Please have a look as I think it will help with our workflows too.

Cheers,
Chris




-Original Message-
From: jpluser <chris.a.mattm...@jpl.nasa.gov
<mailto:chris.a.mattm...@jpl.nasa.gov>>
Reply-To: "dev@nutch.apache.org <mailto:dev@nutch.apache.org>"
<dev@nutch.apache.org <mailto:dev@nutch.apache.org>>
Date: Wednesday, November 18, 2015 at 7:39 PM
To: "dev@nutch.apache.org <mailto:dev@nutch.apache.org>"
<dev@nutch.apache.org <mailto:dev@nutch.apache.org>>
Subject: [DISCUSS] Moving to Git

>Hi All,
>
>I propose that we consider moving to ASF supported writeable git
>repos fro Nutch. This would entail moving Nutch’s canonical repo
>from:
>
>https://svn.apache.org/repos/asf/nutch
>
>TO
>
>https://git-wip-us.apache.org/repos/asf/nutch.git
>
>
>We are already accepting PRs and so forth from Github and I think
>many of us are using Git in our regular day to day workflows.
>
>Thoughts?
>
>Cheers,
>Chris
>
>++
>Chris Mattmann, Ph.D.
>Chief Architect
>Instrument Software and Science Data Systems Section (398)
>NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
>Office: 168-519, Mailstop: 168-527
>Email: chris.a.mattm...@nasa.gov <mailto:chris.a.mattm...@nasa.gov>
>WWW: http://sunset.usc.edu/~mattmann/
<http://sunset.usc.edu/%7Emattmann/>
>++
>Adjunct Associate Professor, Computer Science Department
>University of Southern California, Los Angeles, CA 90089 USA
>++
>
>
>





--
*
*/Open Source Solutions for Text Engineering/
/
/http://www.digitalpebble.com <http://www.digitalpebble.com/>
http://digitalpebble.blogspot.com/
#digitalpebble <http://twitter.com/digitalpebble>




Re: [DISCUSS] Moving to Git

2015-11-21 Thread Dennis Kubes

I know I don't much on the list these days but I do watch.

+1 from me on moving to git as well.

Dennis

On 11/20/2015 05:38 PM, Michael Joyce wrote:

+1 from me


-- Jimmy

On Thu, Nov 19, 2015 at 1:32 PM, Sebastian Nagel 
> wrote:


+1 from me

But, please, after 1.11 and 2.3.1 have been finally released.
There is few work to do, and we should keep the releases on focus
first.

Sebastian

On 11/19/2015 04:39 AM, Mattmann, Chris A (3980) wrote:
> Hi All,
>
> I propose that we consider moving to ASF supported writeable git
> repos fro Nutch. This would entail moving Nutch’s canonical repo
> from:
>
> https://svn.apache.org/repos/asf/nutch
>
> TO
>
> https://git-wip-us.apache.org/repos/asf/nutch.git
>
>
> We are already accepting PRs and so forth from Github and I think
> many of us are using Git in our regular day to day workflows.
>
> Thoughts?
>
> Cheers,
> Chris
>
> ++
> Chris Mattmann, Ph.D.
> Chief Architect
> Instrument Software and Science Data Systems Section (398)
> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
> Office: 168-519, Mailstop: 168-527
> Email: chris.a.mattm...@nasa.gov 
> WWW: http://sunset.usc.edu/~mattmann/

> ++
> Adjunct Associate Professor, Computer Science Department
> University of Southern California, Los Angeles, CA 90089 USA
> ++
>
>
>






Re: [VOTE] Move 2.0 out of trunk

2011-09-18 Thread Dennis Kubes

+1

On 09/18/2011 04:21 AM, Julien Nioche wrote:

Hi,

Following the discussions [1] on the dev-list about the future of 
Nutch 2.0, I would like to call for a vote on moving Nutch 2.0 from 
the trunk to a separate branch, promote 1.4 to trunk and consider 2.0 
as unmaintained. The arguments for / against can be found in the 
thread I mentioned.


The vote is open for the next 72 hours.

[ ] +1 : Shelve 2.0 and move 1.4 to trunk
[] 0 : No opinion
[] -1 : Bad idea.  Please give justification.

Thanks

Julien

[1] 
http://www.mail-archive.com/gora-dev@incubator.apache.org/msg00483.html http://mail-archives.apache.org/mod_mbox/nutch-dev/201109.mbox/%3cca+-fm0tj2kvuco0wwkxbj6hsamxx5819ujv7lco2vo2kd2z...@mail.gmail.com%3E


--
*
*Open Source Solutions for Text Engineering

http://digitalpebble.blogspot.com/
http://www.digitalpebble.com


LinkedIn Group

2011-01-06 Thread Dennis Kubes
Do any of the committers / nutch users want to become an admin to the 
Nutch LinkedIn group?  I haven't been very good about maintaining it so 
it would be good to have more people involved.


Dennis


Re: [VOTE] Apache Nutch 1.2 Release Candidate #4

2010-09-24 Thread Dennis Kubes

 +1  good to go

On 09/24/2010 01:40 PM, Mattmann, Chris A (388J) wrote:
Thanks Andrzej, appreciate it. I know you've been really vigilant with 
the other RCs I've thrown up about testing and I appreciate it. Other 
Nutch PMC'ers: just need one more VOTE. Help, please? :)


Cheers,
Chris


On 9/24/10 11:38 AM, Andrzej Bialecki a...@getopt.org wrote:

On 2010-09-24 04:38, Mattmann, Chris A (388J) wrote:
 Hi Nutch PMC:

 /nudge

 Anyone get a chance to review this yet? I have some free cycles
tomorrow
 and would really think it's cool if I could finally push out the
1.2 RC.

I had little time this week, but I'm testing it now... I should be
done
tomorrow.


--
Best regards,
Andrzej Bialecki 
  ___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com




++
Chris Mattmann, Ph.D.
Senior Computer Scientist
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 171-266B, Mailstop: 171-246
Email: _chris.mattm...@jpl.nasa.gov
_WWW: _http://sunset.usc.edu/~mattmann/ 
http://sunset.usc.edu/%7Emattmann/

_++
Adjunct Assistant Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++



[jira] Created: (NUTCH-908) Infinite Loop and Null Pointer Bugs in Searching

2010-09-16 Thread Dennis Kubes (JIRA)
Infinite Loop and Null Pointer Bugs in Searching


 Key: NUTCH-908
 URL: https://issues.apache.org/jira/browse/NUTCH-908
 Project: Nutch
  Issue Type: Bug
Affects Versions: 1.1, 1.0.0
 Environment: All
Reporter: Dennis Kubes
Assignee: Dennis Kubes
 Fix For: 1.2


It is possible for the NutchBean to drop into an infinite loop while trying to 
optimize a query to re-search for more results.  There are also two Null 
Pointer bugs in the search process.  One in NutchBean where there was an 
incorrect loop assignment and a second in DistributedSegementsBean when a 
segment is null (shouldn't happen but still should be handled.)  A patch is 
available for both.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (NUTCH-908) Infinite Loop and Null Pointer Bugs in Searching

2010-09-16 Thread Dennis Kubes (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-908?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dennis Kubes updated NUTCH-908:
---

Attachment: NUTCH-908-1-20100916.patch

Fixes infinite loop and null pointer bugs.

 Infinite Loop and Null Pointer Bugs in Searching
 

 Key: NUTCH-908
 URL: https://issues.apache.org/jira/browse/NUTCH-908
 Project: Nutch
  Issue Type: Bug
Affects Versions: 1.0.0, 1.1
 Environment: All
Reporter: Dennis Kubes
Assignee: Dennis Kubes
 Fix For: 1.2

 Attachments: NUTCH-908-1-20100916.patch

   Original Estimate: 4h
  Remaining Estimate: 4h

 It is possible for the NutchBean to drop into an infinite loop while trying 
 to optimize a query to re-search for more results.  There are also two Null 
 Pointer bugs in the search process.  One in NutchBean where there was an 
 incorrect loop assignment and a second in DistributedSegementsBean when a 
 segment is null (shouldn't happen but still should be handled.)  A patch is 
 available for both.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (NUTCH-877) Allow setting of slop values for non-quote phrase queries on query-basic plugin

2010-08-09 Thread Dennis Kubes (JIRA)
Allow setting of slop values for non-quote phrase queries on query-basic plugin
---

 Key: NUTCH-877
 URL: https://issues.apache.org/jira/browse/NUTCH-877
 Project: Nutch
  Issue Type: Improvement
  Components: searcher
Affects Versions: 1.2
 Environment: All
Reporter: Dennis Kubes
Assignee: Dennis Kubes
 Fix For: 1.2


Patch adds a configuration variable for setting slop values on phrase queries.  
The default slop value, which currently can't be changed through configuration, 
is Integer.MAX_VALUE.  It produces something like this, which doesn't seem 
right to me.  If you are searching for a phrase you usually want it within a 
certain distance:

2.9141337E-4 = weight(content:my phrase~2147483647 in 1029), product of:

* 0.07163286 = queryWeight(content:my phrase~2147483647), product of:
  o 9.657982 = idf(content: my=13470 phrase=534)
  o 0.0074169594 = queryNorm

This patch adds the query.phrase.slop configuration value to the 
nutch-default.xml file.  It has a default setting of 5.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



Re: Benchmark of Nutch trunk

2010-07-30 Thread Dennis Kubes

Very nice.

On 07/30/2010 05:07 PM, Andrzej Bialecki wrote:

Hi,

We have a simple crawling benchmark now in trunk. Here's how to use it:

* in one console execute 'ant proxy'. This will start on port 8181 a 
proxy server that produces fake pages.


* in another console execute 'ant benchmark'. This will run 5 rounds 
of fetching (~16,000 pages) using that proxy server.


There are already some interesting issues I noticed. First, on a 
reasonably good hardware in local mode I was able to fetch and process 
(NOTE: this includes ALL steps, i.e. generate, fetch, parse, crawldb 
update and invertlinks) 16k pages in 400 sec. This means a total 
crawling throughput of 40 pages/sec. This is in local mode, so in 
distributed mode I guess we would be getting this number times the 
number of tasks.


Secondly, it seems that Fetcher has some synchronization issues in its 
queue management - even if other queues are non-empty, but one of the 
queues blocks, the Fetcher will spin-wait all threads until an item 
becomes available on that queue, and then it starts to happily consume 
items from all non-blocking queues (including this one). The process 
then repeats - one queue blocks, and all threads stop getting items 
from other queues... At the moment I can't figure out where this 
lock-up is happening, but the symptoms are obvious when you look at 
the logs in real-time.


More stuff to come on this subject - at least we have a tool to 
experiment with :)




Re: [jira] Created: (NUTCH-857) DistributedBeans should not close their RPC counterparts

2010-07-22 Thread Dennis Kubes

If nobody objects I am going to commit this in the next 24 hours.

Dennis

On 07/19/2010 04:00 PM, Dennis Kubes (JIRA) wrote:

DistributedBeans should not close their RPC counterparts


  Key: NUTCH-857
  URL: https://issues.apache.org/jira/browse/NUTCH-857
  Project: Nutch
   Issue Type: Bug
 Affects Versions: 1.1
  Environment: All
 Reporter: Dennis Kubes
 Assignee: Dennis Kubes
  Fix For: 1.2
  Attachments: NUTCH-857-1-20100619.patch

DistributedSearch and Segment Beans currently call close on their RPC 
counterparts from their own close methods.  This results in killing (closing) 
all distributed servers when the main bean (website, application, etc) is 
shutdown.  DistributedSearchServer (SegmentServer) are run independent from the 
main NutchBean or website calling those servers in shard type environments.  
With the current code the distributed servers are closed and any further search 
requests throw IndexAlreadyClosed exceptions.  The distributed servers have to 
be restarted before searching can resume.  Obviously this doesn't work in a 
large distributed search where multiple beans could be called the distributed 
servers and where distributed servers could be coming up and down frequently.

The solution is simple though.  The Distributed beans shouldn't call close on 
their RPC counterparts.  Patch is attached.

   


Re: [VOTE] Apache Nutch 1.1 Release Candidate #4

2010-06-14 Thread Dennis Kubes

+1

On 06/14/2010 10:30 AM, Mattmann, Chris A (388J) wrote:

Hey Nutch PMC’ers:

*nudge*

We currently have 2 PMC binding +1's on this VOTE:

Chris Mattmann
Doğacan Güney

Would be great to wrap up the 1.1 release and get another PMC check on
this...thanks!

Cheers,
Chris


On 6/6/10 10:58 PM, Mattmann, Chris A (388J)
chris.a.mattm...@jpl.nasa.gov  wrote:

   

Hi Folks,

I have posted an updated candidate for the Apache Nutch 1.1 release. The
source code is at:

http://people.apache.org/~mattmann/apache-nutch-1.1/rc4/

The major differences between this release and rc #3 are the application of:

---
* NUTCH-818 Parse-tika uses minorCodes instead of majorCodes in ParseStatus,
* NUTCH-819 Included Solr schema.xml and solrindex-mapping.xml don't play
together.
---

based on feedback from prior release candidates.

For more detailed information, see the included CHANGES.txt file for details
on release contents and latest changes. The release was made using the Nutch
release process, documented on the Wiki here:

http://bit.ly/d5ugid

A Nutch 1.1 tag is at:

http://svn.apache.org/repos/asf/lucene/nutch/tags/1.1/

note
In response to several user requests during the last RC cycle, I've also
included *binary* releases (labeled as apache-nutch-1.1-bin.tar.gz and
apache-nutch-1.1-bin.zip). This addresses Sami Siren's request that the
tutorial be updated to reflect the fact that this release is a source-only
release.

Sami also requested to integrate RAT into the build, however, in the
interest of getting this 1.1 out and getting going on the Nutch TLP, my
proposal is:

* run RAT and integrate into the build on releases post 1.1

/note

Please vote on releasing these packages as Apache Nutch 1.1. The vote is
open for the next 72 hours.

Only votes from Nutch PMC are binding, but folks are welcome to check the
release candidate and voice their approval or disapproval. The vote passes
if at least three binding +1 votes are cast.

[ ] +1 Release the packages as Apache Nutch 1.1.

[ ] -1 Do not release the packages because...

Thanks!

Cheers,
Chris

P.S. Here is my +1.


++
Chris Mattmann, Ph.D.
Senior Computer Scientist
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 171-266B, Mailstop: 171-246
Email: chris.mattm...@jpl.nasa.gov
WWW:   http://sunset.usc.edu/~mattmann/
++
Adjunct Assistant Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++



 


++
Chris Mattmann, Ph.D.
Senior Computer Scientist
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 171-266B, Mailstop: 171-246
Email: chris.mattm...@jpl.nasa.gov
WWW:   http://sunset.usc.edu/~mattmann/
++
Adjunct Assistant Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++


   


[jira] Commented: (NUTCH-828) Fetch Filter

2010-06-09 Thread Dennis Kubes (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-828?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12876970#action_12876970
 ] 

Dennis Kubes commented on NUTCH-828:


Nice.  I didn't realize the signature update would do that.  I am assuming 
since ParseUtil doesn't interact with the CrawlDatum we are going to have to 
call the FetchFilters (I am ok with renaming this btw) twice, once in the 
fetcher and once in the ParseSegment, both dealing with their respective 
CrawlDatum needs?

 Fetch Filter
 

 Key: NUTCH-828
 URL: https://issues.apache.org/jira/browse/NUTCH-828
 Project: Nutch
  Issue Type: New Feature
  Components: fetcher
 Environment: All
Reporter: Dennis Kubes
Assignee: Dennis Kubes
 Fix For: 1.1

 Attachments: NUTCH-828-1-20100608.patch, NUTCH-828-2-20100608.patch


 Adds a Nutch extension point for a fetch filter.  The fetch filter allows 
 filtering content and parse data/text after it is fetched but before it is 
 written to segments.  The fliter can return true if content is to be written 
 or false if it is not.  
 Some use cases for this filter would be topical search engines that only want 
 to fetch/index certain types of content, for example a news or sports only 
 search engine.  In these types of situations the only way to determine if 
 content belongs to a particular set is to fetch the page and then analyze the 
 content.  If the content passes, meaning belongs to the set of say sports 
 pages, then we want to include it.  If it doesn't then we want to ignore it, 
 never fetch that same page in the future, and ignore any urls on that page.  
 If content is rejected due to a fetch filter then its status is written to 
 the CrawlDb as gone and its content is ignored and not written to segments.  
 This effectively stop crawling along the crawl path of that page and the urls 
 from that page.  An example filter, fetch-safe, is provided that allows 
 fetching content that does not contain a list of bad words.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (NUTCH-828) Fetch Filter

2010-06-07 Thread Dennis Kubes (JIRA)
Fetch Filter


 Key: NUTCH-828
 URL: https://issues.apache.org/jira/browse/NUTCH-828
 Project: Nutch
  Issue Type: New Feature
  Components: fetcher
 Environment: All
Reporter: Dennis Kubes
Assignee: Dennis Kubes
 Fix For: 1.1
 Attachments: NUTCH-828-1-20100608.patch

Adds a Nutch extension point for a fetch filter.  The fetch filter allows 
filtering content and parse data/text after it is fetched but before it is 
written to segments.  The fliter can return true if content is to be written or 
false if it is not.  

Some use cases for this filter would be topical search engines that only want 
to fetch/index certain types of content, for example a news or sports only 
search engine.  In these types of situations the only way to determine if 
content belongs to a particular set is to fetch the page and then analyze the 
content.  If the content passes, meaning belongs to the set of say sports 
pages, then we want to include it.  If it doesn't then we want to ignore it, 
never fetch that same page in the future, and ignore any urls on that page.  If 
content is rejected due to a fetch filter then its status is written to the 
CrawlDb as gone and its content is ignored and not written to segments.  This 
effectively stop crawling along the crawl path of that page and the urls from 
that page.  An example filter, fetch-safe, is provided that allows fetching 
content that does not contain a list of bad words.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.