Re: [ANNOUNCE] Web Crawler

2013-05-23 Thread Dominique Bejean

Hi,

Release 3.0.3 was tested with :

* Oracle Java 6 but should work fine with version 7
* Tomcat 5.5 and 6 and 7
* PHP 5.2.x and 5.3.x
* Apache 2.2.x
* MongoDB 64 bits 2.2 (know issue with 2.4)

The new release 4.0.0-alpha-2 is available under Github - 
https://github.com/bejean/crawl-anywhere


The pre-requisites are :

Oracle Java 6 or >
Tomcat 5.5 or >
Apache 2.2 or >
PHP 5.2.x or 5.3.x or 5.4.x
MongoDB 64 bits 2.2 or >
Solr 3.x or > (configuration files provided for Solr 4.3.0)

And the up to date installation instructions are here 
http://www.crawl-anywhere.com/installation-v400/


Please read the Github project home page, all information are provided.

Regards.

Dominique




Le 23/05/13 07:38, Rajesh Nikam a écrit :

Hi,

crawl anywhere seems to using old versions of java, tomcat, etc.

http://www.crawl-anywhere.com/installation-v300/

Will it work with new versions of these required software ?

Is there updated installation guide available ?

Thanks
Rajesh





On Wed, May 22, 2013 at 6:48 PM, Dominique Bejean 
mailto:dominique.bej...@eolya.fr>> wrote:


Hi,

Crawl-Anywhere is now open-source -
https://github.com/bejean/crawl-anywhere

Best regards.


Le 02/03/11 10:02, findbestopensource a écrit :

Hello Dominique Bejean,

Good job.

We identified almost 8 open source web crawlers
http://www.findbestopensource.com/tagged/webcrawler   I don't
know how far yours would be different from the rest.

Your license states that it is not open source but it is free
for personnel use.

Regards
Aditya
www.findbestopensource.com 



On Wed, Mar 2, 2011 at 5:55 AM, Dominique Bejean
mailto:dominique.bej...@eolya.fr>
>> wrote:

Hi,

I would like to announce Crawl Anywhere. Crawl-Anywhere is
a Java
Web Crawler. It includes :

  * a crawler
  * a document processing pipeline
  * a solr indexer

The crawler has a web administration in order to manage
web sites
to be crawled. Each web site crawl is configured with a lot of
possible parameters (no all mandatory) :

  * number of simultaneous items crawled by site
  * recrawl period rules based on item type (html, PDF, …)
  * item type inclusion / exclusion rules
  * item path inclusion / exclusion / strategy rules
  * max depth
  * web site authentication
  * language
  * country
  * tags
  * collections
  * ...

The pileline includes various ready to use stages (text
extraction, language detection, Solr ready to index xml
writer, ...).

All is very configurable and extendible either by scripting or
java coding.

With scripting technology, you can help the crawler to handle
javascript links or help the pipeline to extract relevant
title
and cleanup the html pages (remove menus, header, footers, ..)

With java coding, you can develop your own pipeline stage
stage

The Crawl Anywhere web site provides good explanations and
screen
shots. All is documented in a wiki.

The current version is 1.1.4. You can download and try it
out from
here : www.crawl-anywhere.com
 


Regards

Dominique



-- 
Dominique Béjean

+33 6 08 46 12 43
skype: dbejean
www.eolya.fr 
www.crawl-anywhere.com 
www.mysolrserver.com 




--
Dominique Béjean
+33 6 08 46 12 43
skype: dbejean
www.eolya.fr
www.crawl-anywhere.com



Re: [ANNOUNCE] Web Crawler

2013-05-22 Thread Rajesh Nikam
Hi,

crawl anywhere seems to using old versions of java, tomcat, etc.

http://www.crawl-anywhere.com/installation-v300/

Will it work with new versions of these required software ?

Is there updated installation guide available ?

Thanks
Rajesh





On Wed, May 22, 2013 at 6:48 PM, Dominique Bejean  wrote:

> Hi,
>
> Crawl-Anywhere is now open-source - https://github.com/bejean/**
> crawl-anywhere 
>
> Best regards.
>
>
> Le 02/03/11 10:02, findbestopensource a écrit :
>
>> Hello Dominique Bejean,
>>
>> Good job.
>>
>> We identified almost 8 open source web crawlers
>> http://www.findbestopensource.**com/tagged/webcrawler
>>   I don't know how far yours would be different from the rest.
>>
>> Your license states that it is not open source but it is free for
>> personnel use.
>>
>> Regards
>> Aditya
>> www.findbestopensource.com 
>> 
>> >
>>
>>
>> On Wed, Mar 2, 2011 at 5:55 AM, Dominique Bejean <
>> dominique.bej...@eolya.fr 
>> >
>> wrote:
>>
>> Hi,
>>
>> I would like to announce Crawl Anywhere. Crawl-Anywhere is a Java
>> Web Crawler. It includes :
>>
>>   * a crawler
>>   * a document processing pipeline
>>   * a solr indexer
>>
>> The crawler has a web administration in order to manage web sites
>> to be crawled. Each web site crawl is configured with a lot of
>> possible parameters (no all mandatory) :
>>
>>   * number of simultaneous items crawled by site
>>   * recrawl period rules based on item type (html, PDF, …)
>>   * item type inclusion / exclusion rules
>>   * item path inclusion / exclusion / strategy rules
>>   * max depth
>>   * web site authentication
>>   * language
>>   * country
>>   * tags
>>   * collections
>>   * ...
>>
>> The pileline includes various ready to use stages (text
>> extraction, language detection, Solr ready to index xml writer, ...).
>>
>> All is very configurable and extendible either by scripting or
>> java coding.
>>
>> With scripting technology, you can help the crawler to handle
>> javascript links or help the pipeline to extract relevant title
>> and cleanup the html pages (remove menus, header, footers, ..)
>>
>> With java coding, you can develop your own pipeline stage stage
>>
>> The Crawl Anywhere web site provides good explanations and screen
>> shots. All is documented in a wiki.
>>
>> The current version is 1.1.4. You can download and try it out from
>> here : www.crawl-anywhere.com 
>>
>>
>> Regards
>>
>> Dominique
>>
>>
>>
> --
> Dominique Béjean
> +33 6 08 46 12 43
> skype: dbejean
> www.eolya.fr
> www.crawl-anywhere.com
> www.mysolrserver.com
>
>


Re: [ANNOUNCE] Web Crawler

2013-05-22 Thread Dominique Bejean

Hi,

I did see this message (again). Please, use the new dedicated 
Crawl-Anywhere forum for your next questions.

https://groups.google.com/forum/#!forum/crawl-anywhere

Did you solve your problem ?

Thank you

Dominique



Le 29/01/13 09:28, SivaKarthik a écrit :

Hi,
  i resolved the issue "Access denied for user 'crawler'@'localhost' (using
password: YES)"
  
  mysql user crawler/crawler was created and privileges added as mentioned in

the tutorial..
  Thank you.
   




--
View this message in context: 
http://lucene.472066.n3.nabble.com/ANNOUNCE-Web-Crawler-tp2607831p4036978.html
Sent from the Solr - User mailing list archive at Nabble.com.



--
Dominique Béjean
+33 6 08 46 12 43
skype: dbejean
www.eolya.fr
www.crawl-anywhere.com
www.mysolrserver.com



Re: [ANNOUNCE] Web Crawler

2013-05-22 Thread Dominique Bejean

Hi,

Crawl-Anywhere is now open-source - https://github.com/bejean/crawl-anywhere

Best regards.


Le 02/03/11 10:02, findbestopensource a écrit :

Hello Dominique Bejean,

Good job.

We identified almost 8 open source web crawlers 
http://www.findbestopensource.com/tagged/webcrawler   I don't know how 
far yours would be different from the rest.


Your license states that it is not open source but it is free for 
personnel use.


Regards
Aditya
www.findbestopensource.com 


On Wed, Mar 2, 2011 at 5:55 AM, Dominique Bejean 
mailto:dominique.bej...@eolya.fr>> wrote:


Hi,

I would like to announce Crawl Anywhere. Crawl-Anywhere is a Java
Web Crawler. It includes :

  * a crawler
  * a document processing pipeline
  * a solr indexer

The crawler has a web administration in order to manage web sites
to be crawled. Each web site crawl is configured with a lot of
possible parameters (no all mandatory) :

  * number of simultaneous items crawled by site
  * recrawl period rules based on item type (html, PDF, …)
  * item type inclusion / exclusion rules
  * item path inclusion / exclusion / strategy rules
  * max depth
  * web site authentication
  * language
  * country
  * tags
  * collections
  * ...

The pileline includes various ready to use stages (text
extraction, language detection, Solr ready to index xml writer, ...).

All is very configurable and extendible either by scripting or
java coding.

With scripting technology, you can help the crawler to handle
javascript links or help the pipeline to extract relevant title
and cleanup the html pages (remove menus, header, footers, ..)

With java coding, you can develop your own pipeline stage stage

The Crawl Anywhere web site provides good explanations and screen
shots. All is documented in a wiki.

The current version is 1.1.4. You can download and try it out from
here : www.crawl-anywhere.com 


Regards

Dominique




--
Dominique Béjean
+33 6 08 46 12 43
skype: dbejean
www.eolya.fr
www.crawl-anywhere.com
www.mysolrserver.com



Re: [ANNOUNCE] Web Crawler

2013-01-29 Thread SivaKarthik
Hi,
 i resolved the issue "Access denied for user 'crawler'@'localhost' (using
password: YES)" 
 
 mysql user crawler/crawler was created and privileges added as mentioned in
the tutorial..
 Thank you.
  



--
View this message in context: 
http://lucene.472066.n3.nabble.com/ANNOUNCE-Web-Crawler-tp2607831p4036978.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: [ANNOUNCE] Web Crawler

2013-01-29 Thread SivaKarthik
Klein,
 Thank you for ur reply..
 
 i hosted the application in apache2 server
 and able to access the link http://localhost/search/

 but while accessing http://localhost/crawler/login.php
  its showing the error msg as 
 "Access denied for user 'crawler'@'localhost' (using
password: YES)"

 i tried to access
   http://localhost/crawler/log.php
  http://localhost/crawler/display.php

  but all throws the same error msg
  "Access denied for user 'crawler'@'localhost' (using
password: YES)"

 for testing purpose
i created test1.html and test2.php under  /opt/crawler/web/crawler/pub
folder
and i succeeded  to access it
 http://localhost/crawler/test2.php
 http://localhost/crawler/test1.html

im not completely sure why the Access denied error for login.php page
Any idea?

Regards
 

 



--
View this message in context: 
http://lucene.472066.n3.nabble.com/ANNOUNCE-Web-Crawler-tp2607831p4036966.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: [ANNOUNCE] Web Crawler

2013-01-27 Thread O. Klein
This is actualy showing it works.

crawlerws is used by Crawl Anywhere UI and will pass it the correct
arguments when needed.




SivaKarthik wrote
> Hii,
>  I'm trying to configure crawl-anywhere 3.0.3 version in my local system..
>  i'm following the steps from the page
> http://www.crawl-anywhere.com/installation-v300/
>  but, crawlerws is failing and throwing the below error message in the
> brower
>   http://localhost:8080/crawlerws/
> 
>
> 
> 1
> 
>
> 
> Missing action
> 
> 
> Not sure where im doing wrong.. could please help me out to resolve the
> problem.. thank you.





--
View this message in context: 
http://lucene.472066.n3.nabble.com/ANNOUNCE-Web-Crawler-tp2607831p4036520.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: [ANNOUNCE] Web Crawler

2013-01-27 Thread SivaKarthik
Hii,
 I'm trying to configure crawl-anywhere 3.0.3 version in my local system..
 i'm following the steps from the page
http://www.crawl-anywhere.com/installation-v300/
 but, crawlerws is failing and throwing the below error message in the
brower
  http://localhost:8080/crawlerws/


   1
   Missing action


Not sure where im doing wrong.. could please help me out to resolve the
problem.. thank you.



--
View this message in context: 
http://lucene.472066.n3.nabble.com/ANNOUNCE-Web-Crawler-tp2607831p4036493.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: [ANNOUNCE] Web Crawler

2011-03-02 Thread Nestor Oviedo
Hi everyone!
I've been following this thread and I realized we've constructed something
similar to "Crawl Anywhere". The main difference is that our project is
oriented to the digital libraries and digital repositories context.
Specifically related to metadata collection from multiple sources,
information improvements and storing in multiple destinations.
So far, I can share an article about the project, because the code is in our
development machines and testing servers. If everything goes well, we plan
to make it open source in the near future.
I'd be glad to hear your comments and opinions about it. There is no need to
be polite.
Thanks in advance.

Best regards.
Nestor



On Wed, Mar 2, 2011 at 11:46 AM, Dominique Bejean  wrote:

> Hi,
>
> No, it doesn't. It looks like to be a apache httpclient 3.x limitation.
> https://issues.apache.org/jira/browse/HTTPCLIENT-579
>
> Dominique
>
> Le 02/03/11 15:04, Thumuluri, Sai a écrit :
>
>  Dominique, Does your crawler support NTLM2 authentication? We have content
>> under SiteMinder which uses NTLM2 and that is posing challenges with Nutch?
>>
>> -Original Message-
>> From: Dominique Bejean [mailto:dominique.bej...@eolya.fr]
>> Sent: Wednesday, March 02, 2011 6:22 AM
>> To: solr-user@lucene.apache.org
>> Subject: Re: [ANNOUNCE] Web Crawler
>>
>> Aditya,
>>
>> The crawler is not open source and won't be in the next future. Anyway,
>> I have to change the license because it can be use for any personal or
>> commercial projects.
>>
>> Sincerely,
>>
>> Dominique
>>
>> Le 02/03/11 10:02, findbestopensource a écrit :
>>
>>> Hello Dominique Bejean,
>>>
>>> Good job.
>>>
>>> We identified almost 8 open source web crawlers
>>> http://www.findbestopensource.com/tagged/webcrawler   I don't know how
>>> far yours would be different from the rest.
>>>
>>> Your license states that it is not open source but it is free for
>>> personnel use.
>>>
>>> Regards
>>> Aditya
>>> www.findbestopensource.com<http://www.findbestopensource.com>
>>>
>>>
>>> On Wed, Mar 2, 2011 at 5:55 AM, Dominique Bejean
>>> mailto:dominique.bej...@eolya.fr>>  wrote:
>>>
>>> Hi,
>>>
>>> I would like to announce Crawl Anywhere. Crawl-Anywhere is a Java
>>> Web Crawler. It includes :
>>>
>>>   * a crawler
>>>   * a document processing pipeline
>>>   * a solr indexer
>>>
>>> The crawler has a web administration in order to manage web sites
>>> to be crawled. Each web site crawl is configured with a lot of
>>> possible parameters (no all mandatory) :
>>>
>>>   * number of simultaneous items crawled by site
>>>   * recrawl period rules based on item type (html, PDF, ...)
>>>   * item type inclusion / exclusion rules
>>>   * item path inclusion / exclusion / strategy rules
>>>   * max depth
>>>   * web site authentication
>>>   * language
>>>   * country
>>>   * tags
>>>   * collections
>>>   * ...
>>>
>>> The pileline includes various ready to use stages (text
>>> extraction, language detection, Solr ready to index xml writer, ...).
>>>
>>> All is very configurable and extendible either by scripting or
>>> java coding.
>>>
>>> With scripting technology, you can help the crawler to handle
>>> javascript links or help the pipeline to extract relevant title
>>> and cleanup the html pages (remove menus, header, footers, ..)
>>>
>>> With java coding, you can develop your own pipeline stage stage
>>>
>>> The Crawl Anywhere web site provides good explanations and screen
>>> shots. All is documented in a wiki.
>>>
>>> The current version is 1.1.4. You can download and try it out from
>>> here : www.crawl-anywhere.com<http://www.crawl-anywhere.com>
>>>
>>>
>>> Regards
>>>
>>> Dominique
>>>
>>>
>>>


Re: [ANNOUNCE] Web Crawler

2011-03-02 Thread Dominique Bejean

Hi,

No, it doesn't. It looks like to be a apache httpclient 3.x limitation.
https://issues.apache.org/jira/browse/HTTPCLIENT-579

Dominique

Le 02/03/11 15:04, Thumuluri, Sai a écrit :

Dominique, Does your crawler support NTLM2 authentication? We have content 
under SiteMinder which uses NTLM2 and that is posing challenges with Nutch?

-Original Message-
From: Dominique Bejean [mailto:dominique.bej...@eolya.fr]
Sent: Wednesday, March 02, 2011 6:22 AM
To: solr-user@lucene.apache.org
Subject: Re: [ANNOUNCE] Web Crawler

Aditya,

The crawler is not open source and won't be in the next future. Anyway,
I have to change the license because it can be use for any personal or
commercial projects.

Sincerely,

Dominique

Le 02/03/11 10:02, findbestopensource a écrit :

Hello Dominique Bejean,

Good job.

We identified almost 8 open source web crawlers
http://www.findbestopensource.com/tagged/webcrawler   I don't know how
far yours would be different from the rest.

Your license states that it is not open source but it is free for
personnel use.

Regards
Aditya
www.findbestopensource.com<http://www.findbestopensource.com>


On Wed, Mar 2, 2011 at 5:55 AM, Dominique Bejean
mailto:dominique.bej...@eolya.fr>>  wrote:

 Hi,

 I would like to announce Crawl Anywhere. Crawl-Anywhere is a Java
 Web Crawler. It includes :

   * a crawler
   * a document processing pipeline
   * a solr indexer

 The crawler has a web administration in order to manage web sites
 to be crawled. Each web site crawl is configured with a lot of
 possible parameters (no all mandatory) :

   * number of simultaneous items crawled by site
   * recrawl period rules based on item type (html, PDF, ...)
   * item type inclusion / exclusion rules
   * item path inclusion / exclusion / strategy rules
   * max depth
   * web site authentication
   * language
   * country
   * tags
   * collections
   * ...

 The pileline includes various ready to use stages (text
 extraction, language detection, Solr ready to index xml writer, ...).

 All is very configurable and extendible either by scripting or
 java coding.

 With scripting technology, you can help the crawler to handle
 javascript links or help the pipeline to extract relevant title
 and cleanup the html pages (remove menus, header, footers, ..)

 With java coding, you can develop your own pipeline stage stage

 The Crawl Anywhere web site provides good explanations and screen
 shots. All is documented in a wiki.

 The current version is 1.1.4. You can download and try it out from
 here : www.crawl-anywhere.com<http://www.crawl-anywhere.com>


 Regards

 Dominique




RE: [ANNOUNCE] Web Crawler

2011-03-02 Thread Thumuluri, Sai
Dominique, Does your crawler support NTLM2 authentication? We have content 
under SiteMinder which uses NTLM2 and that is posing challenges with Nutch?

-Original Message-
From: Dominique Bejean [mailto:dominique.bej...@eolya.fr] 
Sent: Wednesday, March 02, 2011 6:22 AM
To: solr-user@lucene.apache.org
Subject: Re: [ANNOUNCE] Web Crawler

Aditya,

The crawler is not open source and won't be in the next future. Anyway, 
I have to change the license because it can be use for any personal or 
commercial projects.

Sincerely,

Dominique

Le 02/03/11 10:02, findbestopensource a écrit :
> Hello Dominique Bejean,
>
> Good job.
>
> We identified almost 8 open source web crawlers 
> http://www.findbestopensource.com/tagged/webcrawler   I don't know how 
> far yours would be different from the rest.
>
> Your license states that it is not open source but it is free for 
> personnel use.
>
> Regards
> Aditya
> www.findbestopensource.com <http://www.findbestopensource.com>
>
>
> On Wed, Mar 2, 2011 at 5:55 AM, Dominique Bejean 
> mailto:dominique.bej...@eolya.fr>> wrote:
>
> Hi,
>
> I would like to announce Crawl Anywhere. Crawl-Anywhere is a Java
> Web Crawler. It includes :
>
>   * a crawler
>   * a document processing pipeline
>   * a solr indexer
>
> The crawler has a web administration in order to manage web sites
> to be crawled. Each web site crawl is configured with a lot of
> possible parameters (no all mandatory) :
>
>   * number of simultaneous items crawled by site
>   * recrawl period rules based on item type (html, PDF, ...)
>   * item type inclusion / exclusion rules
>   * item path inclusion / exclusion / strategy rules
>   * max depth
>   * web site authentication
>   * language
>   * country
>   * tags
>   * collections
>   * ...
>
> The pileline includes various ready to use stages (text
> extraction, language detection, Solr ready to index xml writer, ...).
>
> All is very configurable and extendible either by scripting or
> java coding.
>
> With scripting technology, you can help the crawler to handle
> javascript links or help the pipeline to extract relevant title
> and cleanup the html pages (remove menus, header, footers, ..)
>
> With java coding, you can develop your own pipeline stage stage
>
> The Crawl Anywhere web site provides good explanations and screen
> shots. All is documented in a wiki.
>
> The current version is 1.1.4. You can download and try it out from
> here : www.crawl-anywhere.com <http://www.crawl-anywhere.com>
>
>
> Regards
>
> Dominique
>
>


Re: [ANNOUNCE] Web Crawler

2011-03-02 Thread Paul Libbrecht
VIewing the indexing result, which is a part of what you are describing I 
think, is a nice job for such an indexing framework.

Do you guys know whether such feature is already out there?

paul


Le 2 mars 2011 à 12:20, Geert-Jan Brits a écrit :

> Hi Dominique,
> 
> This looks nice.
> In the past, I've been interested in (semi)-automatically inducing a
> scheme/wrapper from a set of example webpages (often called 'wrapper
> induction' is the scientific field) .
> This would allow for fast scheme-creation which could be used as a basis for
> extraction.
> 
> Lately I've been looking for crawlers that incoporate this technology but
> without success.
> Any plans on incorporating this?
> 
> Cheers,
> Geert-Jan
> 
> 2011/3/2 Dominique Bejean 
> 
>> Rosa,
>> 
>> In the pipeline, there is a stage that extract the text from the original
>> document (PDF, HTML, ...).
>> It is possible to plug scripts (Java 6 compliant) in order to keep only
>> relevant parts of the document.
>> See
>> http://www.wiizio.com/confluence/display/CRAWLUSERS/DocTextExtractor+stage
>> 
>> Dominique
>> 
>> Le 02/03/11 09:36, Rosa (Anuncios) a écrit :
>> 
>> Nice job!
>>> 
>>> It would be good to be able to extract specific data from a given page via
>>> XPATH though.
>>> 
>>> Regards,
>>> 
>>> 
>>> Le 02/03/2011 01:25, Dominique Bejean a écrit :
>>> 
 Hi,
 
 I would like to announce Crawl Anywhere. Crawl-Anywhere is a Java Web
 Crawler. It includes :
 
  * a crawler
  * a document processing pipeline
  * a solr indexer
 
 The crawler has a web administration in order to manage web sites to be
 crawled. Each web site crawl is configured with a lot of possible 
 parameters
 (no all mandatory) :
 
  * number of simultaneous items crawled by site
  * recrawl period rules based on item type (html, PDF, …)
  * item type inclusion / exclusion rules
  * item path inclusion / exclusion / strategy rules
  * max depth
  * web site authentication
  * language
  * country
  * tags
  * collections
  * ...
 
 The pileline includes various ready to use stages (text extraction,
 language detection, Solr ready to index xml writer, ...).
 
 All is very configurable and extendible either by scripting or java
 coding.
 
 With scripting technology, you can help the crawler to handle javascript
 links or help the pipeline to extract relevant title and cleanup the html
 pages (remove menus, header, footers, ..)
 
 With java coding, you can develop your own pipeline stage stage
 
 The Crawl Anywhere web site provides good explanations and screen shots.
 All is documented in a wiki.
 
 The current version is 1.1.4. You can download and try it out from here :
 www.crawl-anywhere.com
 
 
 Regards
 
 Dominique
 
 
 
>>> 
>>> 



Re: [ANNOUNCE] Web Crawler

2011-03-02 Thread Dominique Bejean

Hi,

The crawler comes with a extendible document processing pipeline. If you 
know java libraries or web services for 'wrapper induction' processing, 
it is possible to implement a dedicated stage in the pipeline.


Dominique

Le 02/03/11 12:20, Geert-Jan Brits a écrit :

Hi Dominique,

This looks nice.
In the past, I've been interested in (semi)-automatically inducing a 
scheme/wrapper from a set of example webpages (often called 'wrapper 
induction' is the scientific field) .
This would allow for fast scheme-creation which could be used as a 
basis for extraction.


Lately I've been looking for crawlers that incoporate this technology 
but without success.

Any plans on incorporating this?

Cheers,
Geert-Jan

2011/3/2 Dominique Bejean >


Rosa,

In the pipeline, there is a stage that extract the text from the
original document (PDF, HTML, ...).
It is possible to plug scripts (Java 6 compliant) in order to keep
only relevant parts of the document.
See
http://www.wiizio.com/confluence/display/CRAWLUSERS/DocTextExtractor+stage

Dominique

Le 02/03/11 09:36, Rosa (Anuncios) a écrit :

Nice job!

It would be good to be able to extract specific data from a
given page via XPATH though.

Regards,


Le 02/03/2011 01:25, Dominique Bejean a écrit :

Hi,

I would like to announce Crawl Anywhere. Crawl-Anywhere is
a Java Web Crawler. It includes :

  * a crawler
  * a document processing pipeline
  * a solr indexer

The crawler has a web administration in order to manage
web sites to be crawled. Each web site crawl is configured
with a lot of possible parameters (no all mandatory) :

  * number of simultaneous items crawled by site
  * recrawl period rules based on item type (html, PDF, …)
  * item type inclusion / exclusion rules
  * item path inclusion / exclusion / strategy rules
  * max depth
  * web site authentication
  * language
  * country
  * tags
  * collections
  * ...

The pileline includes various ready to use stages (text
extraction, language detection, Solr ready to index xml
writer, ...).

All is very configurable and extendible either by
scripting or java coding.

With scripting technology, you can help the crawler to
handle javascript links or help the pipeline to extract
relevant title and cleanup the html pages (remove menus,
header, footers, ..)

With java coding, you can develop your own pipeline stage
stage

The Crawl Anywhere web site provides good explanations and
screen shots. All is documented in a wiki.

The current version is 1.1.4. You can download and try it
out from here : www.crawl-anywhere.com



Regards

Dominique







Re: [ANNOUNCE] Web Crawler

2011-03-02 Thread Dominique Bejean

Aditya,

The crawler is not open source and won't be in the next future. Anyway, 
I have to change the license because it can be use for any personal or 
commercial projects.


Sincerely,

Dominique

Le 02/03/11 10:02, findbestopensource a écrit :

Hello Dominique Bejean,

Good job.

We identified almost 8 open source web crawlers 
http://www.findbestopensource.com/tagged/webcrawler   I don't know how 
far yours would be different from the rest.


Your license states that it is not open source but it is free for 
personnel use.


Regards
Aditya
www.findbestopensource.com 


On Wed, Mar 2, 2011 at 5:55 AM, Dominique Bejean 
mailto:dominique.bej...@eolya.fr>> wrote:


Hi,

I would like to announce Crawl Anywhere. Crawl-Anywhere is a Java
Web Crawler. It includes :

  * a crawler
  * a document processing pipeline
  * a solr indexer

The crawler has a web administration in order to manage web sites
to be crawled. Each web site crawl is configured with a lot of
possible parameters (no all mandatory) :

  * number of simultaneous items crawled by site
  * recrawl period rules based on item type (html, PDF, …)
  * item type inclusion / exclusion rules
  * item path inclusion / exclusion / strategy rules
  * max depth
  * web site authentication
  * language
  * country
  * tags
  * collections
  * ...

The pileline includes various ready to use stages (text
extraction, language detection, Solr ready to index xml writer, ...).

All is very configurable and extendible either by scripting or
java coding.

With scripting technology, you can help the crawler to handle
javascript links or help the pipeline to extract relevant title
and cleanup the html pages (remove menus, header, footers, ..)

With java coding, you can develop your own pipeline stage stage

The Crawl Anywhere web site provides good explanations and screen
shots. All is documented in a wiki.

The current version is 1.1.4. You can download and try it out from
here : www.crawl-anywhere.com 


Regards

Dominique




Re: [ANNOUNCE] Web Crawler

2011-03-02 Thread Dominique Bejean

Lukas,

I am thinking about it but no decision yet.

Anyway, in next release, I will provide source code of pipeline stages 
and connectors as samples.


Dominique

Le 02/03/11 10:01, Lukáš Vlček a écrit :

Hi,

is there any plan to open source it?

Regards,
Lukas

[OT] I tried HuriSearch, input "Java" into search field, it returned a 
lot of references to coldfusion error pages. May be a recrawl would help?


On Wed, Mar 2, 2011 at 1:25 AM, Dominique Bejean 
mailto:dominique.bej...@eolya.fr>> wrote:


Hi,

I would like to announce Crawl Anywhere. Crawl-Anywhere is a Java
Web Crawler. It includes :

  * a crawler
  * a document processing pipeline
  * a solr indexer

The crawler has a web administration in order to manage web sites
to be crawled. Each web site crawl is configured with a lot of
possible parameters (no all mandatory) :

  * number of simultaneous items crawled by site
  * recrawl period rules based on item type (html, PDF, …)
  * item type inclusion / exclusion rules
  * item path inclusion / exclusion / strategy rules
  * max depth
  * web site authentication
  * language
  * country
  * tags
  * collections
  * ...

The pileline includes various ready to use stages (text
extraction, language detection, Solr ready to index xml writer, ...).

All is very configurable and extendible either by scripting or
java coding.

With scripting technology, you can help the crawler to handle
javascript links or help the pipeline to extract relevant title
and cleanup the html pages (remove menus, header, footers, ..)

With java coding, you can develop your own pipeline stage stage

The Crawl Anywhere web site provides good explanations and screen
shots. All is documented in a wiki.

The current version is 1.1.4. You can download and try it out from
here : www.crawl-anywhere.com 


Regards

Dominique




Re: [ANNOUNCE] Web Crawler

2011-03-02 Thread Geert-Jan Brits
Hi Dominique,

This looks nice.
In the past, I've been interested in (semi)-automatically inducing a
scheme/wrapper from a set of example webpages (often called 'wrapper
induction' is the scientific field) .
This would allow for fast scheme-creation which could be used as a basis for
extraction.

Lately I've been looking for crawlers that incoporate this technology but
without success.
Any plans on incorporating this?

Cheers,
Geert-Jan

2011/3/2 Dominique Bejean 

> Rosa,
>
> In the pipeline, there is a stage that extract the text from the original
> document (PDF, HTML, ...).
> It is possible to plug scripts (Java 6 compliant) in order to keep only
> relevant parts of the document.
> See
> http://www.wiizio.com/confluence/display/CRAWLUSERS/DocTextExtractor+stage
>
> Dominique
>
> Le 02/03/11 09:36, Rosa (Anuncios) a écrit :
>
>  Nice job!
>>
>> It would be good to be able to extract specific data from a given page via
>> XPATH though.
>>
>> Regards,
>>
>>
>> Le 02/03/2011 01:25, Dominique Bejean a écrit :
>>
>>> Hi,
>>>
>>> I would like to announce Crawl Anywhere. Crawl-Anywhere is a Java Web
>>> Crawler. It includes :
>>>
>>>   * a crawler
>>>   * a document processing pipeline
>>>   * a solr indexer
>>>
>>> The crawler has a web administration in order to manage web sites to be
>>> crawled. Each web site crawl is configured with a lot of possible parameters
>>> (no all mandatory) :
>>>
>>>   * number of simultaneous items crawled by site
>>>   * recrawl period rules based on item type (html, PDF, …)
>>>   * item type inclusion / exclusion rules
>>>   * item path inclusion / exclusion / strategy rules
>>>   * max depth
>>>   * web site authentication
>>>   * language
>>>   * country
>>>   * tags
>>>   * collections
>>>   * ...
>>>
>>> The pileline includes various ready to use stages (text extraction,
>>> language detection, Solr ready to index xml writer, ...).
>>>
>>> All is very configurable and extendible either by scripting or java
>>> coding.
>>>
>>> With scripting technology, you can help the crawler to handle javascript
>>> links or help the pipeline to extract relevant title and cleanup the html
>>> pages (remove menus, header, footers, ..)
>>>
>>> With java coding, you can develop your own pipeline stage stage
>>>
>>> The Crawl Anywhere web site provides good explanations and screen shots.
>>> All is documented in a wiki.
>>>
>>> The current version is 1.1.4. You can download and try it out from here :
>>> www.crawl-anywhere.com
>>>
>>>
>>> Regards
>>>
>>> Dominique
>>>
>>>
>>>
>>
>>


Re: [ANNOUNCE] Web Crawler

2011-03-02 Thread Dominique Bejean

Rosa,

In the pipeline, there is a stage that extract the text from the 
original document (PDF, HTML, ...).
It is possible to plug scripts (Java 6 compliant) in order to keep only 
relevant parts of the document.
See 
http://www.wiizio.com/confluence/display/CRAWLUSERS/DocTextExtractor+stage


Dominique

Le 02/03/11 09:36, Rosa (Anuncios) a écrit :

Nice job!

It would be good to be able to extract specific data from a given page 
via XPATH though.


Regards,


Le 02/03/2011 01:25, Dominique Bejean a écrit :

Hi,

I would like to announce Crawl Anywhere. Crawl-Anywhere is a Java Web 
Crawler. It includes :


   * a crawler
   * a document processing pipeline
   * a solr indexer

The crawler has a web administration in order to manage web sites to 
be crawled. Each web site crawl is configured with a lot of possible 
parameters (no all mandatory) :


   * number of simultaneous items crawled by site
   * recrawl period rules based on item type (html, PDF, …)
   * item type inclusion / exclusion rules
   * item path inclusion / exclusion / strategy rules
   * max depth
   * web site authentication
   * language
   * country
   * tags
   * collections
   * ...

The pileline includes various ready to use stages (text extraction, 
language detection, Solr ready to index xml writer, ...).


All is very configurable and extendible either by scripting or java 
coding.


With scripting technology, you can help the crawler to handle 
javascript links or help the pipeline to extract relevant title and 
cleanup the html pages (remove menus, header, footers, ..)


With java coding, you can develop your own pipeline stage stage

The Crawl Anywhere web site provides good explanations and screen 
shots. All is documented in a wiki.


The current version is 1.1.4. You can download and try it out from 
here : www.crawl-anywhere.com



Regards

Dominique







Re: [ANNOUNCE] Web Crawler

2011-03-02 Thread Dominique Bejean

David,

The UI was not the only reason that make me choose to write a totaly new 
crawler. After eliminating candidate crawlers due to various reasons 
(inactive project, ...), Nutch and Heritrix where the 2 crawlers in my 
short list of possible candidates to be use.


In my mind, the crawler and the pipleline have to be tottaly 
disconnected of the target repository (Solr, ...). This made nutch not a 
possible choice.
At the end, I found Heritrix to far of the solution's architecture I 
imagined.


Dominique


Le 02/03/11 05:41, David Smiley (@MITRE.org) a écrit :

Dominique,
The obvious number one question is of course why you re-invented this wheel
when there are several existing crawlers to choose from.  Your website says
the reason is that the UIs on existing crawlers (e.g. Nutch, Heritrix, ...)
weren't sufficiently user-friendly or had the site-specific configuration
you wanted.  Well if that is the case, why didn't you add/enhance such
capabilities for an existing crawler?

~ David Smiley

-
  Author: https://www.packtpub.com/solr-1-4-enterprise-search-server/book


Re: [ANNOUNCE] Web Crawler

2011-03-02 Thread findbestopensource
Hello Dominique Bejean,

Good job.

We identified almost 8 open source web crawlers
http://www.findbestopensource.com/tagged/webcrawler   I don't know how far
yours would be different from the rest.

Your license states that it is not open source but it is free for personnel
use.

Regards
Aditya
www.findbestopensource.com


On Wed, Mar 2, 2011 at 5:55 AM, Dominique Bejean
wrote:

> Hi,
>
> I would like to announce Crawl Anywhere. Crawl-Anywhere is a Java Web
> Crawler. It includes :
>
>   * a crawler
>   * a document processing pipeline
>   * a solr indexer
>
> The crawler has a web administration in order to manage web sites to be
> crawled. Each web site crawl is configured with a lot of possible parameters
> (no all mandatory) :
>
>   * number of simultaneous items crawled by site
>   * recrawl period rules based on item type (html, PDF, …)
>   * item type inclusion / exclusion rules
>   * item path inclusion / exclusion / strategy rules
>   * max depth
>   * web site authentication
>   * language
>   * country
>   * tags
>   * collections
>   * ...
>
> The pileline includes various ready to use stages (text extraction,
> language detection, Solr ready to index xml writer, ...).
>
> All is very configurable and extendible either by scripting or java coding.
>
> With scripting technology, you can help the crawler to handle javascript
> links or help the pipeline to extract relevant title and cleanup the html
> pages (remove menus, header, footers, ..)
>
> With java coding, you can develop your own pipeline stage stage
>
> The Crawl Anywhere web site provides good explanations and screen shots.
> All is documented in a wiki.
>
> The current version is 1.1.4. You can download and try it out from here :
> www.crawl-anywhere.com
>
>
> Regards
>
> Dominique
>
>


Re: [ANNOUNCE] Web Crawler

2011-03-02 Thread Lukáš Vlček
Hi,

is there any plan to open source it?

Regards,
Lukas

[OT] I tried HuriSearch, input "Java" into search field, it returned a lot
of references to coldfusion error pages. May be a recrawl would help?

On Wed, Mar 2, 2011 at 1:25 AM, Dominique Bejean
wrote:

> Hi,
>
> I would like to announce Crawl Anywhere. Crawl-Anywhere is a Java Web
> Crawler. It includes :
>
>   * a crawler
>   * a document processing pipeline
>   * a solr indexer
>
> The crawler has a web administration in order to manage web sites to be
> crawled. Each web site crawl is configured with a lot of possible parameters
> (no all mandatory) :
>
>   * number of simultaneous items crawled by site
>   * recrawl period rules based on item type (html, PDF, …)
>   * item type inclusion / exclusion rules
>   * item path inclusion / exclusion / strategy rules
>   * max depth
>   * web site authentication
>   * language
>   * country
>   * tags
>   * collections
>   * ...
>
> The pileline includes various ready to use stages (text extraction,
> language detection, Solr ready to index xml writer, ...).
>
> All is very configurable and extendible either by scripting or java coding.
>
> With scripting technology, you can help the crawler to handle javascript
> links or help the pipeline to extract relevant title and cleanup the html
> pages (remove menus, header, footers, ..)
>
> With java coding, you can develop your own pipeline stage stage
>
> The Crawl Anywhere web site provides good explanations and screen shots.
> All is documented in a wiki.
>
> The current version is 1.1.4. You can download and try it out from here :
> www.crawl-anywhere.com
>
>
> Regards
>
> Dominique
>
>


Re: [ANNOUNCE] Web Crawler

2011-03-02 Thread Rosa (Anuncios)

Nice job!

It would be good to be able to extract specific data from a given page 
via XPATH though.


Regards,


Le 02/03/2011 01:25, Dominique Bejean a écrit :

Hi,

I would like to announce Crawl Anywhere. Crawl-Anywhere is a Java Web 
Crawler. It includes :


   * a crawler
   * a document processing pipeline
   * a solr indexer

The crawler has a web administration in order to manage web sites to 
be crawled. Each web site crawl is configured with a lot of possible 
parameters (no all mandatory) :


   * number of simultaneous items crawled by site
   * recrawl period rules based on item type (html, PDF, …)
   * item type inclusion / exclusion rules
   * item path inclusion / exclusion / strategy rules
   * max depth
   * web site authentication
   * language
   * country
   * tags
   * collections
   * ...

The pileline includes various ready to use stages (text extraction, 
language detection, Solr ready to index xml writer, ...).


All is very configurable and extendible either by scripting or java 
coding.


With scripting technology, you can help the crawler to handle 
javascript links or help the pipeline to extract relevant title and 
cleanup the html pages (remove menus, header, footers, ..)


With java coding, you can develop your own pipeline stage stage

The Crawl Anywhere web site provides good explanations and screen 
shots. All is documented in a wiki.


The current version is 1.1.4. You can download and try it out from 
here : www.crawl-anywhere.com



Regards

Dominique






Re: [ANNOUNCE] Web Crawler

2011-03-01 Thread David Smiley (@MITRE.org)
Dominique,
The obvious number one question is of course why you re-invented this wheel
when there are several existing crawlers to choose from.  Your website says
the reason is that the UIs on existing crawlers (e.g. Nutch, Heritrix, ...)
weren't sufficiently user-friendly or had the site-specific configuration
you wanted.  Well if that is the case, why didn't you add/enhance such
capabilities for an existing crawler?

~ David Smiley

-
 Author: https://www.packtpub.com/solr-1-4-enterprise-search-server/book
-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/ANNOUNCE-Web-Crawler-tp2607831p2608956.html
Sent from the Solr - User mailing list archive at Nabble.com.


[ANNOUNCE] Web Crawler

2011-03-01 Thread Dominique Bejean

Hi,

I would like to announce Crawl Anywhere. Crawl-Anywhere is a Java Web 
Crawler. It includes :


   * a crawler
   * a document processing pipeline
   * a solr indexer

The crawler has a web administration in order to manage web sites to be 
crawled. Each web site crawl is configured with a lot of possible 
parameters (no all mandatory) :


   * number of simultaneous items crawled by site
   * recrawl period rules based on item type (html, PDF, …)
   * item type inclusion / exclusion rules
   * item path inclusion / exclusion / strategy rules
   * max depth
   * web site authentication
   * language
   * country
   * tags
   * collections
   * ...

The pileline includes various ready to use stages (text extraction, 
language detection, Solr ready to index xml writer, ...).


All is very configurable and extendible either by scripting or java coding.

With scripting technology, you can help the crawler to handle javascript 
links or help the pipeline to extract relevant title and cleanup the 
html pages (remove menus, header, footers, ..)


With java coding, you can develop your own pipeline stage stage

The Crawl Anywhere web site provides good explanations and screen shots. 
All is documented in a wiki.


The current version is 1.1.4. You can download and try it out from here 
: www.crawl-anywhere.com



Regards

Dominique