[jira] Commented: (NUTCH-873) Ivy configuration settings don't include Gora

2010-11-05 Thread Alexis (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-873?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12928788#action_12928788
 ] 

Alexis commented on NUTCH-873:
--

It did not work as seamless for me. The gora build created a 
~/.ivy2/local/org.gora directory.

Just hack this setting to build nutch trunk right now, waiting for Gora to be 
properly transitioned to Apache?

[nutch]$ svn diff ivy/ivysettings.xml
Index: ivy/ivysettings.xml
===
--- ivy/ivysettings.xml (revision 1031723)
+++ ivy/ivysettings.xml (working copy)
@@ -83,7 +83,7 @@
 rather than look for them online.
 -->
 
-
+
 
 
   

> Ivy configuration settings don't include Gora
> -
>
> Key: NUTCH-873
> URL: https://issues.apache.org/jira/browse/NUTCH-873
> Project: Nutch
>  Issue Type: Bug
>  Components: build
> Environment: Nutch trunk (formerly Nutchbase)
>Reporter: Chris A. Mattmann
>Assignee: Chris A. Mattmann
>Priority: Blocker
> Fix For: 2.0
>
>
> The Nutch 2.0 trunk now requires Gora, and even though it's not available in 
> any repository, we should still configure Ivy to depend on it so that the 
> build will work provided you follow the Gora instructions here:
> http://github.com/enis/gora
> I've fixed it locally and will commit an update shortly that takes care of 
> it. In order to compile Nutch trunk now (before we get Gora into a repo), 
> here are the steps (copied from http://github.com/enis/gora):
> {noformat}
> $ git clone git://github.com/enis/gora.git
> $ cd gora 
> $ ant
> {noformat}
> This will install Gora into your local Ivy repo. Then from there on out, just 
> update your Ivy resolver (or alternatively just the Nutch build post this 
> issue being resolved) and you're good.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Issue Comment Edited: (NUTCH-873) Ivy configuration settings don't include Gora

2010-11-05 Thread Alexis (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-873?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12928788#action_12928788
 ] 

Alexis edited comment on NUTCH-873 at 11/5/10 3:48 PM:
---

It did not work as seamless for me. The gora build created a 
~/.ivy2/local/org.gora directory, not with org.apache.gora namespace.

  was (Author: alexis779):
It did not work as seamless for me. The gora build created a 
~/.ivy2/local/org.gora directory.

Just hack this setting to build nutch trunk right now, waiting for Gora to be 
properly transitioned to Apache?

[nutch]$ svn diff ivy/ivysettings.xml
Index: ivy/ivysettings.xml
===
--- ivy/ivysettings.xml (revision 1031723)
+++ ivy/ivysettings.xml (working copy)
@@ -83,7 +83,7 @@
 rather than look for them online.
 -->
 
-
+
 
 
   
  
> Ivy configuration settings don't include Gora
> -
>
> Key: NUTCH-873
> URL: https://issues.apache.org/jira/browse/NUTCH-873
> Project: Nutch
>  Issue Type: Bug
>  Components: build
> Environment: Nutch trunk (formerly Nutchbase)
>Reporter: Chris A. Mattmann
>Assignee: Chris A. Mattmann
>Priority: Blocker
> Fix For: 2.0
>
>
> The Nutch 2.0 trunk now requires Gora, and even though it's not available in 
> any repository, we should still configure Ivy to depend on it so that the 
> build will work provided you follow the Gora instructions here:
> http://github.com/enis/gora
> I've fixed it locally and will commit an update shortly that takes care of 
> it. In order to compile Nutch trunk now (before we get Gora into a repo), 
> here are the steps (copied from http://github.com/enis/gora):
> {noformat}
> $ git clone git://github.com/enis/gora.git
> $ cd gora 
> $ ant
> {noformat}
> This will install Gora into your local Ivy repo. Then from there on out, just 
> update your Ivy resolver (or alternatively just the Nutch build post this 
> issue being resolved) and you're good.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Issue Comment Edited: (NUTCH-873) Ivy configuration settings don't include Gora

2010-11-05 Thread Alexis (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-873?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12928788#action_12928788
 ] 

Alexis edited comment on NUTCH-873 at 11/5/10 3:51 PM:
---

It did not work as seamless for me. The gora build created a 
~/.ivy2/local/org.gora directory, not with org.apache.gora namespace.

I guess you need to move away from Github and go to Apache:
$ svn co http://svn.apache.org/repos/asf/incubator/gora/trunk gora
$ cd gora
$ ant

  was (Author: alexis779):
It did not work as seamless for me. The gora build created a 
~/.ivy2/local/org.gora directory, not with org.apache.gora namespace.
  
> Ivy configuration settings don't include Gora
> -
>
> Key: NUTCH-873
> URL: https://issues.apache.org/jira/browse/NUTCH-873
> Project: Nutch
>  Issue Type: Bug
>  Components: build
> Environment: Nutch trunk (formerly Nutchbase)
>Reporter: Chris A. Mattmann
>Assignee: Chris A. Mattmann
>Priority: Blocker
> Fix For: 2.0
>
>
> The Nutch 2.0 trunk now requires Gora, and even though it's not available in 
> any repository, we should still configure Ivy to depend on it so that the 
> build will work provided you follow the Gora instructions here:
> http://github.com/enis/gora
> I've fixed it locally and will commit an update shortly that takes care of 
> it. In order to compile Nutch trunk now (before we get Gora into a repo), 
> here are the steps (copied from http://github.com/enis/gora):
> {noformat}
> $ git clone git://github.com/enis/gora.git
> $ cd gora 
> $ ant
> {noformat}
> This will install Gora into your local Ivy repo. Then from there on out, just 
> update your Ivy resolver (or alternatively just the Nutch build post this 
> issue being resolved) and you're good.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Issue Comment Edited: (NUTCH-873) Ivy configuration settings don't include Gora

2010-11-05 Thread Alexis (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-873?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12928788#action_12928788
 ] 

Alexis edited comment on NUTCH-873 at 11/5/10 3:52 PM:
---

It did not work as seamless for me. The gora build created a 
~/.ivy2/local/org.gora directory, not with org.apache.gora namespace.

I guess you need to move away from Github and go to Apache:
{noformat} 
$ svn co http://svn.apache.org/repos/asf/incubator/gora/trunk gora
$ cd gora
$ ant
{noformat} 

  was (Author: alexis779):
It did not work as seamless for me. The gora build created a 
~/.ivy2/local/org.gora directory, not with org.apache.gora namespace.

I guess you need to move away from Github and go to Apache:
$ svn co http://svn.apache.org/repos/asf/incubator/gora/trunk gora
$ cd gora
$ ant
  
> Ivy configuration settings don't include Gora
> -
>
> Key: NUTCH-873
> URL: https://issues.apache.org/jira/browse/NUTCH-873
> Project: Nutch
>  Issue Type: Bug
>  Components: build
> Environment: Nutch trunk (formerly Nutchbase)
>Reporter: Chris A. Mattmann
>Assignee: Chris A. Mattmann
>Priority: Blocker
> Fix For: 2.0
>
>
> The Nutch 2.0 trunk now requires Gora, and even though it's not available in 
> any repository, we should still configure Ivy to depend on it so that the 
> build will work provided you follow the Gora instructions here:
> http://github.com/enis/gora
> I've fixed it locally and will commit an update shortly that takes care of 
> it. In order to compile Nutch trunk now (before we get Gora into a repo), 
> here are the steps (copied from http://github.com/enis/gora):
> {noformat}
> $ git clone git://github.com/enis/gora.git
> $ cd gora 
> $ ant
> {noformat}
> This will install Gora into your local Ivy repo. Then from there on out, just 
> update your Ivy resolver (or alternatively just the Nutch build post this 
> issue being resolved) and you're good.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (NUTCH-880) REST API for Nutch

2010-11-05 Thread Alexis (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-880?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12928896#action_12928896
 ] 

Alexis commented on NUTCH-880:
--

This revision introduced a bug in the nutch inject command. It now throws a 
NullPointerException.

Please take a look at:
http://svn.apache.org/viewvc/nutch/trunk/src/java/org/apache/nutch/crawl/InjectorJob.java?annotate=1028235&pathrev=1028235

Make sure the first element in the array is not null:

{noformat}
Index: src/java/org/apache/nutch/crawl/InjectorJob.java
===
--- src/java/org/apache/nutch/crawl/InjectorJob.java(revision 1031881)
+++ src/java/org/apache/nutch/crawl/InjectorJob.java(working copy)
@@ -242,6 +242,7 @@
 job.setReducerClass(Reducer.class);
 job.setNumReduceTasks(0);
 job.waitForCompletion(true);
+jobs[0] = job;

 job = new NutchJob(getConf(), "inject-p2 " + args[0]);
 StorageUtils.initMapperJob(job, FIELDS, String.class,
{noformat}


> REST API for Nutch
> --
>
> Key: NUTCH-880
> URL: https://issues.apache.org/jira/browse/NUTCH-880
> Project: Nutch
>  Issue Type: New Feature
>Affects Versions: 2.0
>Reporter: Andrzej Bialecki 
>Assignee: Andrzej Bialecki 
> Fix For: 2.0
>
> Attachments: API-2.patch, API.patch
>
>
> This issue is for discussing a REST-style API for accessing Nutch.
> Here's an initial idea:
> * I propose to use org.restlet for handling requests and returning 
> JSON/XML/whatever responses.
> * hook up all regular tools so that they can be driven via this API. This 
> would have to be an async API, since all Nutch operations take long time to 
> execute. It follows then that we need to be able also to list running 
> operations, retrieve their current status, and possibly 
> abort/cancel/stop/suspend/resume/...? This also means that we would have to 
> potentially create & manage many threads in a servlet - AFAIK this is frowned 
> upon by J2EE purists...
> * package this in a webapp (that includes all deps, essentially nutch.job 
> content), with the restlet servlet as an entry point.
> Open issues:
> * how to implement the reading of crawl results via this API
> * should we manage only crawls that use a single configuration per webapp, or 
> should we have a notion of crawl contexts (sets of crawl configs) with CRUD 
> ops on them? this would be nice, because it would allow managing of several 
> different crawls, with different configs, in a single webapp - but it 
> complicates the implementation a lot.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (NUTCH-899) java.sql.BatchUpdateException: Data truncation: Data too long for column 'content' at row 1

2010-12-10 Thread Alexis (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-899?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12970336#action_12970336
 ] 

Alexis commented on NUTCH-899:
--

I ran into the exact same issue, with MySQL. The blob column type can only 
store a string which length L is less than 2^16 = 65536 (not equal to) 
See http://dev.mysql.com/doc/refman/5.0/en/storage-requirements.html

I believe you just need to decrement http.content.limit from 65536 to 65535 in 
conf/nutch-default.xml...


> java.sql.BatchUpdateException: Data truncation: Data too long for column 
> 'content' at row 1
> ---
>
> Key: NUTCH-899
> URL: https://issues.apache.org/jira/browse/NUTCH-899
> Project: Nutch
>  Issue Type: Bug
>  Components: storage
>Affects Versions: 2.0
> Environment: ubuntu 10.04
> JVM : 1.6.0_20
> nutch 2.0 (trunk)
> Mysql/HBase (0.20.6) / Hadoop(0.20.2) pseudo-distributed 
>Reporter: Faruk Berksöz
>Priority: Minor
>
> wenn i try to fetch a web page (e.g. 
> http://www.w3.org/Protocols/rfc2616/rfc2616-sec14.html ) with mysql storage 
> definition,
> I am seeing the following error in my hadoop logs. ,  (no error with hbase ) ;
> java.io.IOException: java.sql.BatchUpdateException: Data truncation: Data too 
> long for column 'content' at row 1
> at org.gora.sql.store.SqlStore.flush(SqlStore.java:316)
> at org.gora.sql.store.SqlStore.close(SqlStore.java:163)
> at org.gora.mapreduce.GoraOutputFormat$1.close(GoraOutputFormat.java:72)
> at org.apache.hadoop.mapred.ReduceTask.runNewReducer(ReduceTask.java:567)
> at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:408)
> at 
> org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:216)
> The type of the column 'content' is BLOB.
> It may be important for the next developments of Gora.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (NUTCH-899) java.sql.BatchUpdateException: Data truncation: Data too long for column 'content' at row 1

2010-12-18 Thread Alexis (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-899?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexis updated NUTCH-899:
-

Attachment: httpContentLimit.patch

We stick with  the default gora schema for the MySQL backend, which says 
"bytes" in the Avro definition, that is translated into "blob" in MySQL. From 
src/gora/webpage.avsc;
{"name": "WebPage",
 "type": "record",
 "namespace": "org.apache.nutch.storage",
 "fields": [
{"name": "content", "type": "bytes"},
   ]
}


There is potential bug in protocol-http. The http.content.limit value might be 
exceeded a little bit, hence the error saying that the value is too big for the 
MySQL blob column type, even tough we explicitly force http.content.limit to 
the 65535 max size.

I tried to come up with a unit test for this, which is rather imperfect. Please 
see it in the attached patch. It changes http.content.limit from 65536 to 65535 
when fetching a url which body content is big enough. The first test should see 
the error, the second should not.

Ideally we want to generate the content with a local server for the unit test 
instead of using a random internet url. That remains to be implemented in the 
test.

> java.sql.BatchUpdateException: Data truncation: Data too long for column 
> 'content' at row 1
> ---
>
> Key: NUTCH-899
> URL: https://issues.apache.org/jira/browse/NUTCH-899
> Project: Nutch
>  Issue Type: Bug
>  Components: storage
>Affects Versions: 2.0
> Environment: ubuntu 10.04
> JVM : 1.6.0_20
> nutch 2.0 (trunk)
> Mysql/HBase (0.20.6) / Hadoop(0.20.2) pseudo-distributed 
>Reporter: Faruk Berksöz
>Priority: Minor
> Attachments: httpContentLimit.patch
>
>
> wenn i try to fetch a web page (e.g. 
> http://www.w3.org/Protocols/rfc2616/rfc2616-sec14.html ) with mysql storage 
> definition,
> I am seeing the following error in my hadoop logs. ,  (no error with hbase ) ;
> java.io.IOException: java.sql.BatchUpdateException: Data truncation: Data too 
> long for column 'content' at row 1
> at org.gora.sql.store.SqlStore.flush(SqlStore.java:316)
> at org.gora.sql.store.SqlStore.close(SqlStore.java:163)
> at org.gora.mapreduce.GoraOutputFormat$1.close(GoraOutputFormat.java:72)
> at org.apache.hadoop.mapred.ReduceTask.runNewReducer(ReduceTask.java:567)
> at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:408)
> at 
> org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:216)
> The type of the column 'content' is BLOB.
> It may be important for the next developments of Gora.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (NUTCH-950) Content-Length limit, URL filter and few minor issues

2011-01-01 Thread Alexis (JIRA)
Content-Length limit, URL filter and few minor issues
-

 Key: NUTCH-950
 URL: https://issues.apache.org/jira/browse/NUTCH-950
 Project: Nutch
  Issue Type: Bug
Affects Versions: 2.0
Reporter: Alexis


1. crawl command (nutch1.patch)

The class was renamed to Crawler but the references to it were not updated.


2. URL filter (nutch2.patch)

This avoids a NPE on bogus urls which host do not have a suffix.


3. Content-Length limit (nutch3.patch)

This is related to NUTCH-899.
The patch avoids the entire flush operation on the Gora datastore to crash 
because the MySQL blob limit was exceeded by a few bytes. Both protocol-http 
and protocol-httpclient plugins were problematic.


4. Ivy configuration (nutch4.patch)
- Change xercesImpl and restlet versions. These 2 version changes are required. 
The first one currently makes a JUnit test crash, the second one is missing in 
default Maven repository.

- Add gora-hbase, zookeeper which is an HBase dependency. Add MySQL connector. 
These jars are necesary to run Gora with HBase or MySQL datastores. (more a 
suggestion that a requirement here)

- Add com.jcraft/jsch, which is a protocol-sftp plugin dependency. 


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (NUTCH-950) Content-Length limit, URL filter and few minor issues

2011-01-01 Thread Alexis (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-950?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexis updated NUTCH-950:
-

Attachment: nutch4.patch

> Content-Length limit, URL filter and few minor issues
> -
>
> Key: NUTCH-950
> URL: https://issues.apache.org/jira/browse/NUTCH-950
> Project: Nutch
>  Issue Type: Bug
>Affects Versions: 2.0
>Reporter: Alexis
> Attachments: nutch1.patch, nutch2.patch, nutch3.patch, nutch4.patch
>
>
> 1. crawl command (nutch1.patch)
> The class was renamed to Crawler but the references to it were not updated.
> 2. URL filter (nutch2.patch)
> This avoids a NPE on bogus urls which host do not have a suffix.
> 3. Content-Length limit (nutch3.patch)
> This is related to NUTCH-899.
> The patch avoids the entire flush operation on the Gora datastore to crash 
> because the MySQL blob limit was exceeded by a few bytes. Both protocol-http 
> and protocol-httpclient plugins were problematic.
> 4. Ivy configuration (nutch4.patch)
> - Change xercesImpl and restlet versions. These 2 version changes are 
> required. The first one currently makes a JUnit test crash, the second one is 
> missing in default Maven repository.
> - Add gora-hbase, zookeeper which is an HBase dependency. Add MySQL 
> connector. These jars are necesary to run Gora with HBase or MySQL 
> datastores. (more a suggestion that a requirement here)
> - Add com.jcraft/jsch, which is a protocol-sftp plugin dependency. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (NUTCH-950) Content-Length limit, URL filter and few minor issues

2011-01-01 Thread Alexis (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-950?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexis updated NUTCH-950:
-

Attachment: nutch3.patch
nutch2.patch
nutch1.patch

> Content-Length limit, URL filter and few minor issues
> -
>
> Key: NUTCH-950
> URL: https://issues.apache.org/jira/browse/NUTCH-950
> Project: Nutch
>  Issue Type: Bug
>Affects Versions: 2.0
>Reporter: Alexis
> Attachments: nutch1.patch, nutch2.patch, nutch3.patch, nutch4.patch
>
>
> 1. crawl command (nutch1.patch)
> The class was renamed to Crawler but the references to it were not updated.
> 2. URL filter (nutch2.patch)
> This avoids a NPE on bogus urls which host do not have a suffix.
> 3. Content-Length limit (nutch3.patch)
> This is related to NUTCH-899.
> The patch avoids the entire flush operation on the Gora datastore to crash 
> because the MySQL blob limit was exceeded by a few bytes. Both protocol-http 
> and protocol-httpclient plugins were problematic.
> 4. Ivy configuration (nutch4.patch)
> - Change xercesImpl and restlet versions. These 2 version changes are 
> required. The first one currently makes a JUnit test crash, the second one is 
> missing in default Maven repository.
> - Add gora-hbase, zookeeper which is an HBase dependency. Add MySQL 
> connector. These jars are necesary to run Gora with HBase or MySQL 
> datastores. (more a suggestion that a requirement here)
> - Add com.jcraft/jsch, which is a protocol-sftp plugin dependency. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (NUTCH-955) Ivy configuration

2011-01-10 Thread Alexis (JIRA)
Ivy configuration
-

 Key: NUTCH-955
 URL: https://issues.apache.org/jira/browse/NUTCH-955
 Project: Nutch
  Issue Type: Improvement
  Components: build
Affects Versions: 2.0
Reporter: Alexis


As mentioned in NUTCH-950, we can slightly improve the Ivy configuration to 
help setup the Gora backend more easily.
If the user does not want to stick with default HSQL database, other 
alternatives exist, such as MySQL and HBase.

org.restlet and xercesImpl versions should be changed as well.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (NUTCH-955) Ivy configuration

2011-01-10 Thread Alexis (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-955?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexis updated NUTCH-955:
-

Attachment: ivy.patch

In the patch, the required dependencies for MySQL and HBase are included in the 
Ivy config, but commented out. It's up to the user to use his own backend to 
store the data.

Following 3 points are minor issues but the fixes allow to play more nicely 
under Eclipse:

- The call to "nutch.root" property set in build.xml for ant should be replaced 
in src/plugin/protocol-sftp/ivy.xml by the built-in "basedir" ivy property.
- The 2.0.1 version of restlet dependency does not exist in the maven 
repository, so you want to manually change it to 2.0.0.
- The xerces (XML parser) implementation needs to be upgraded from 2.6.2 to 
2.9.1, otherwise you'll see exceptions while running a JUnit test.

> Ivy configuration
> -
>
> Key: NUTCH-955
> URL: https://issues.apache.org/jira/browse/NUTCH-955
> Project: Nutch
>  Issue Type: Improvement
>  Components: build
>Affects Versions: 2.0
>Reporter: Alexis
> Attachments: ivy.patch
>
>
> As mentioned in NUTCH-950, we can slightly improve the Ivy configuration to 
> help setup the Gora backend more easily.
> If the user does not want to stick with default HSQL database, other 
> alternatives exist, such as MySQL and HBase.
> org.restlet and xercesImpl versions should be changed as well.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Issue Comment Edited: (NUTCH-955) Ivy configuration

2011-01-10 Thread Alexis (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-955?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12979525#action_12979525
 ] 

Alexis edited comment on NUTCH-955 at 1/10/11 5:27 AM:
---

In the patch, the required dependencies for MySQL and HBase are included in the 
Ivy config, but commented out as suggested in Julien's comment. It's up to the 
user to use his own backend to store the data.

Following 3 points are minor issues but the fixes allow to play more nicely 
under Eclipse:

- The call to "nutch.root" property set in build.xml for ant should be replaced 
in src/plugin/protocol-sftp/ivy.xml by the built-in "basedir" ivy property.
- The 2.0.1 version of restlet dependency does not exist in the maven 
repository, so you want to manually change it to 2.0.0.
- The xerces (XML parser) implementation needs to be upgraded from 2.6.2 to 
2.9.1, otherwise you'll see exceptions while running a JUnit test.

  was (Author: alexis779):
In the patch, the required dependencies for MySQL and HBase are included in 
the Ivy config, but commented out. It's up to the user to use his own backend 
to store the data.

Following 3 points are minor issues but the fixes allow to play more nicely 
under Eclipse:

- The call to "nutch.root" property set in build.xml for ant should be replaced 
in src/plugin/protocol-sftp/ivy.xml by the built-in "basedir" ivy property.
- The 2.0.1 version of restlet dependency does not exist in the maven 
repository, so you want to manually change it to 2.0.0.
- The xerces (XML parser) implementation needs to be upgraded from 2.6.2 to 
2.9.1, otherwise you'll see exceptions while running a JUnit test.
  
> Ivy configuration
> -
>
> Key: NUTCH-955
> URL: https://issues.apache.org/jira/browse/NUTCH-955
> Project: Nutch
>  Issue Type: Improvement
>  Components: build
>Affects Versions: 2.0
>Reporter: Alexis
> Attachments: ivy.patch
>
>
> As mentioned in NUTCH-950, we can slightly improve the Ivy configuration to 
> help setup the Gora backend more easily.
> If the user does not want to stick with default HSQL database, other 
> alternatives exist, such as MySQL and HBase.
> org.restlet and xercesImpl versions should be changed as well.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Resolved: (NUTCH-950) Content-Length limit, URL filter and few minor issues

2011-01-10 Thread Alexis (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-950?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexis resolved NUTCH-950.
--

   Resolution: Fixed
Fix Version/s: 2.0

Sorry I missed the Ivy configuration file in the plugin directory.

See NUTCH-955 for the new Ivy issue.

> Content-Length limit, URL filter and few minor issues
> -
>
> Key: NUTCH-950
> URL: https://issues.apache.org/jira/browse/NUTCH-950
> Project: Nutch
>  Issue Type: Bug
>Affects Versions: 2.0
>Reporter: Alexis
> Fix For: 2.0
>
> Attachments: nutch1.patch, nutch2.patch, nutch3.patch, nutch4.patch
>
>
> 1. crawl command (nutch1.patch)
> The class was renamed to Crawler but the references to it were not updated.
> 2. URL filter (nutch2.patch)
> This avoids a NPE on bogus urls which host do not have a suffix.
> 3. Content-Length limit (nutch3.patch)
> This is related to NUTCH-899.
> The patch avoids the entire flush operation on the Gora datastore to crash 
> because the MySQL blob limit was exceeded by a few bytes. Both protocol-http 
> and protocol-httpclient plugins were problematic.
> 4. Ivy configuration (nutch4.patch)
> - Change xercesImpl and restlet versions. These 2 version changes are 
> required. The first one currently makes a JUnit test crash, the second one is 
> missing in default Maven repository.
> - Add gora-hbase, zookeeper which is an HBase dependency. Add MySQL 
> connector. These jars are necesary to run Gora with HBase or MySQL 
> datastores. (more a suggestion that a requirement here)
> - Add com.jcraft/jsch, which is a protocol-sftp plugin dependency. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (NUTCH-956) soldindex issues

2011-01-13 Thread Alexis (JIRA)
soldindex issues


 Key: NUTCH-956
 URL: https://issues.apache.org/jira/browse/NUTCH-956
 Project: Nutch
  Issue Type: Bug
  Components: indexer
Affects Versions: 2.0
Reporter: Alexis


I ran into a few caveats with solrindex command trying to index documents.
Please refer to 
http://techvineyard.blogspot.com/2010/12/build-nutch-20.html#solrindex that 
describes my tests.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (NUTCH-956) soldindex issues

2011-01-13 Thread Alexis (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-956?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexis updated NUTCH-956:
-

Attachment: solr.patch

Here are the changes:

- Avoid multiple values for id field. (NUTCH-819)
- Allow multiple values for tag field. Add tld (Top Level Domain) field.
- Get the content-type from WebPage object's member. Otherwise, you will see 
NullPointerExceptions.
- Compare strings with equalsTo. That's pretty random, but it avoids having 
some suprises.

> soldindex issues
> 
>
> Key: NUTCH-956
> URL: https://issues.apache.org/jira/browse/NUTCH-956
> Project: Nutch
>  Issue Type: Bug
>  Components: indexer
>Affects Versions: 2.0
>Reporter: Alexis
> Attachments: solr.patch
>
>
> I ran into a few caveats with solrindex command trying to index documents.
> Please refer to 
> http://techvineyard.blogspot.com/2010/12/build-nutch-20.html#solrindex that 
> describes my tests.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (NUTCH-956) solrindex issues

2011-01-13 Thread Alexis (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-956?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexis updated NUTCH-956:
-

Summary: solrindex issues  (was: soldindex issues)

> solrindex issues
> 
>
> Key: NUTCH-956
> URL: https://issues.apache.org/jira/browse/NUTCH-956
> Project: Nutch
>  Issue Type: Bug
>  Components: indexer
>Affects Versions: 2.0
>Reporter: Alexis
> Attachments: solr.patch
>
>
> I ran into a few caveats with solrindex command trying to index documents.
> Please refer to 
> http://techvineyard.blogspot.com/2010/12/build-nutch-20.html#solrindex that 
> describes my tests.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (NUTCH-955) Ivy configuration

2011-01-18 Thread Alexis (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-955?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12983125#action_12983125
 ] 

Alexis commented on NUTCH-955:
--

Sorry please disregard the nutch.root first bullet in the previous comment and 
in the patch. This would break the build: basedir variable holds the plugin's 
base directory ("Nutch2.0/src/plugin/protocol-sftp"). I get an error in the 
build saying ivy/ivy-configurations.xml is not found with this patch.

I need to figure out how to load this nutch.root variable in the Ivy plugin in 
Eclipse.

> Ivy configuration
> -
>
> Key: NUTCH-955
> URL: https://issues.apache.org/jira/browse/NUTCH-955
> Project: Nutch
>  Issue Type: Improvement
>  Components: build
>Affects Versions: 2.0
>Reporter: Alexis
> Attachments: ivy.patch
>
>
> As mentioned in NUTCH-950, we can slightly improve the Ivy configuration to 
> help setup the Gora backend more easily.
> If the user does not want to stick with default HSQL database, other 
> alternatives exist, such as MySQL and HBase.
> org.restlet and xercesImpl versions should be changed as well.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (NUTCH-965) Parsing takes up 100% CPU

2011-02-08 Thread Alexis (JIRA)
Parsing takes up 100% CPU
-

 Key: NUTCH-965
 URL: https://issues.apache.org/jira/browse/NUTCH-965
 Project: Nutch
  Issue Type: Improvement
  Components: parser
Reporter: Alexis


The issue you're likely to run into when parsing truncated FLV files is 
described here:
http://www.mail-archive.com/user@nutch.apache.org/msg01880.html

The parser library gets stuck in infinite loop as it encounters corrupted data 
due to for example truncating big binary files at fetch time.

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] Updated: (NUTCH-965) Parsing takes up 100% CPU

2011-02-08 Thread Alexis (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-965?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexis updated NUTCH-965:
-

Attachment: parserJob.patch

In the parser mapper, compare Content-Length header to the size of the content 
buffer to see if they match.

If this HTTP header is available and in the case that the file was truncated, 
skip the parsing step to avoid that the parser gets stuck in infinite loop 
taking up all the CPU resources.


Before, in the logs, we would see:

{noformat}2011-02-07 14:03:34,693 WARN  parse.ParseUtil - TIMEOUT parsing 
http://downtownjoes.com/botb1.flv with 
org.apache.nutch.parse.tika.TikaParser@8c0162
2011-02-07 14:03:34,693 WARN  parse.ParseUtil - Unable to successfully parse 
content http://downtownjoes.com/botb1.flv of type video/x-flv
2011-02-07 14:04:04,725 WARN  parse.ParseUtil - TIMEOUT parsing 
http://downtownjoes.com/dtj.flv with 
org.apache.nutch.parse.tika.TikaParser@8c0162
2011-02-07 14:04:04,725 WARN  parse.ParseUtil - Unable to successfully parse 
content http://downtownjoes.com/dtj.flv of type video/x-flv
2011-02-07 14:04:34,772 WARN  parse.ParseUtil - TIMEOUT parsing 
http://downtownjoes.com/botb2.flv with 
org.apache.nutch.parse.tika.TikaParser@8c0162
2011-02-07 14:04:34,772 WARN  parse.ParseUtil - Unable to successfully parse 
content http://downtownjoes.com/botb2.flv of type video/x-flv
{noformat} 

After:

{noformat}2011-02-08 09:06:54,482 INFO  parse.ParserJob - 
http://downtownjoes.com/botb1.flv skipped. Content of size 4527822 was 
truncated to 63980
2011-02-08 09:06:54,482 INFO  parse.ParserJob - http://downtownjoes.com/dtj.flv 
skipped. Content of size 2692082 was truncated to 63980
2011-02-08 09:06:54,482 INFO  parse.ParserJob - 
http://downtownjoes.com/botb2.flv skipped. Content of size 35496213 was 
truncated to 61058
{noformat} 




> Parsing takes up 100% CPU
> -
>
> Key: NUTCH-965
> URL: https://issues.apache.org/jira/browse/NUTCH-965
> Project: Nutch
>  Issue Type: Improvement
>  Components: parser
>Reporter: Alexis
> Attachments: parserJob.patch
>
>
> The issue you're likely to run into when parsing truncated FLV files is 
> described here:
> http://www.mail-archive.com/user@nutch.apache.org/msg01880.html
> The parser library gets stuck in infinite loop as it encounters corrupted 
> data due to for example truncating big binary files at fetch time.

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] Updated: (NUTCH-965) Skip parsing for truncated documents

2011-02-10 Thread Alexis (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-965?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexis updated NUTCH-965:
-

Summary: Skip parsing for truncated documents  (was: Parsing takes up 100% 
CPU)

> Skip parsing for truncated documents
> 
>
> Key: NUTCH-965
> URL: https://issues.apache.org/jira/browse/NUTCH-965
> Project: Nutch
>  Issue Type: Improvement
>  Components: parser
>Reporter: Alexis
> Attachments: parserJob.patch
>
>
> The issue you're likely to run into when parsing truncated FLV files is 
> described here:
> http://www.mail-archive.com/user@nutch.apache.org/msg01880.html
> The parser library gets stuck in infinite loop as it encounters corrupted 
> data due to for example truncating big binary files at fetch time.

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (NUTCH-956) solrindex issues

2011-07-12 Thread Alexis (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-956?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13064148#comment-13064148
 ] 

Alexis commented on NUTCH-956:
--

I do get the NPE when indexing this url

http://www.truveo.com/ (Content-Type header is "Content-Type: text/html; 
charset=utf-8")

without the 
src/plugin/index-more/src/java/org/apache/nutch/indexer/more/MoreIndexingFilter.java
 patch.

{code}
java.lang.NullPointerException
at 
org.apache.nutch.indexer.more.MoreIndexingFilter.addType(MoreIndexingFilter.java:204)
at 
org.apache.nutch.indexer.more.MoreIndexingFilter.filter(MoreIndexingFilter.java:78)
at 
org.apache.nutch.indexer.IndexingFilters.filter(IndexingFilters.java:107)
at 
org.apache.nutch.indexer.IndexerReducer.reduce(IndexerReducer.java:73)
at org.apache.nutch.indexer.IndexerReducer.reduce(IndexerReducer.java:1)
at org.apache.hadoop.mapreduce.Reducer.run(Reducer.java:176)
at 
org.apache.hadoop.mapred.ReduceTask.runNewReducer(ReduceTask.java:566)
at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:408)
at 
org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:216)
{code}


See attached patch "solr.patch2".
If you have time can you please go ahead an run the entire tests suite as well:

1 InjectorJob
2 GeneratorJob
3 FetcherJob
4 ParserJob
5 DbUpdaterJob
6 SolrIndexerJob
(Finally chech the index with 
http://localhost:8983/solr/select/?q=video&indent=on in the browser)

at least on this seed url:
- http://www.truveo.com/


Regarding the String comparison in Java, I believe people usually call 
String.equals instead of using the boolean comparator (==).

> solrindex issues
> 
>
> Key: NUTCH-956
> URL: https://issues.apache.org/jira/browse/NUTCH-956
> Project: Nutch
>  Issue Type: Bug
>  Components: indexer
>Affects Versions: 2.0
>Reporter: Alexis
> Fix For: 1.4, 2.0
>
> Attachments: solr.patch
>
>
> I ran into a few caveats with solrindex command trying to index documents.
> Please refer to 
> http://techvineyard.blogspot.com/2010/12/build-nutch-20.html#solrindex that 
> describes my tests.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (NUTCH-956) solrindex issues

2011-07-12 Thread Alexis (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-956?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexis updated NUTCH-956:
-

Attachment: solr.patch2

- NPE related to content-type field
- tld field in Solr schema
- string comparison in Java

> solrindex issues
> 
>
> Key: NUTCH-956
> URL: https://issues.apache.org/jira/browse/NUTCH-956
> Project: Nutch
>  Issue Type: Bug
>  Components: indexer
>Affects Versions: 2.0
>Reporter: Alexis
> Fix For: 1.4, 2.0
>
> Attachments: solr.patch, solr.patch2
>
>
> I ran into a few caveats with solrindex command trying to index documents.
> Please refer to 
> http://techvineyard.blogspot.com/2010/12/build-nutch-20.html#solrindex that 
> describes my tests.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira