Hi list,
I am now facing one problem on scientific computering.
there exist 5G datum (maily matrix/vector) that we collected for
some surveys. And now we plan to do some datamining on these. And
honestly, I am not every well know Hadoop/Mapreduce. The question
seems quite simple to you
Hi all
RegExp is widely used in nutch, and I now wondering is it jdk/jakarta
classes is faster enough?
Here is the benchmarks i found on web.
http://tusker.org/regex/regex_benchmark.html
it seems dk.brics.automaton.RegExp is fastest among the libs.
/Jack
--
Keep Discovering ... ...
Hi Stefan
Can you explain a little more? I mean I cannot find some evidence in
the source code...
Thanks
/Jack
On 2/23/06, Stefan Groschupf [EMAIL PROTECTED] wrote:
Hi Jack,
the summary is only created from all hits displayed on one page.
Stefan
Am 23.02.2006 um 02:45 schrieb Jack Tang
the question?
Am 24.02.2006 um 02:51 schrieb Jack Tang:
Hi Stefan
Can you explain a little more? I mean I cannot find some evidence in
the source code...
Thanks
/Jack
On 2/23/06, Stefan Groschupf [EMAIL PROTECTED] wrote:
Hi Jack,
the summary is only created from all hits displayed
Yes, you're right:) i find the answer.
Thanks.
On 2/24/06, Stefan Groschupf [EMAIL PROTECTED] wrote:
Isn't HitDetails.length == hitsPerPage?
This happens in search.jsp.
Am 24.02.2006 um 03:09 schrieb Jack Tang:
I dont think so.
Let's take non-dfs as example.
NutchBean.getSummary
On 2/23/06, Doug Cutting [EMAIL PROTECTED] wrote:
Jack Tang wrote:
In FetchedSegments class, below code shows how to get the hit summaries.
public String[] getSummary(HitDetails[] details, Query query)
throws IOException {
SummaryThread[] threads = new SummaryThread
Hi All
I don't know will nutch only support JDK1.5 or both JDK1.4 and 1.5 in
the future. If the former, is it better to adopt JDK1.5 concurrency
framework for thread (say fetcher and summaries thread)? And here is
ibm tutorial on the new classes in tiger.
/Jack
--
Keep Discovering ... ...
Hi Guys
In FetchedSegments class, below code shows how to get the hit summaries.
public String[] getSummary(HitDetails[] details, Query query)
throws IOException {
SummaryThread[] threads = new SummaryThread[details.length];
for (int i = 0; i threads.length; i++) {
Hi All
Now nutch only supports content field highlight. Any suggestion to
enable multi-fields highlighting? say some hits in anchor text and url
(like google), and etc.. I know one simplest but stupid way is get the
hitdetails first then invoke summarier threads, any smarter ideas?
Thanks.
/Jack
On 2/9/06, Stefan Groschupf [EMAIL PROTECTED] wrote:
Hi Folks,
I hope and it looks like we are close to get meta data support for
crawlDatum (CrawlDB) into the sources soon.
At this point we can store and read but not 'process' (means creation
or inheritance etc. [some one knows a better
Hi Jérôme
On 1/21/06, Jérôme Charron [EMAIL PROTECTED] wrote:
I am wondering Analyzer of nutch in svn trunk is chosen by
languageidentifer plugin or not? (I knew in nutch 0.7.1-dev it did).
It's not really choosen by the languageidentifier, but coosen regarding the
value of the lang
Hi All
I am wondering Analyzer of nutch in svn trunk is chosen by
languageidentifer plugin or not? (I knew in nutch 0.7.1-dev it did).
In org.apache.nutch.indexer.Indexer.class line 104
writer.addDocument((Document)((ObjectWritable)value).get());
It should be
NutchAnalyzer analyzer =
On 1/21/06, Jack Tang [EMAIL PROTECTED] wrote:
Hi All
I am wondering Analyzer of nutch in svn trunk is chosen by
languageidentifer plugin or not? (I knew in nutch 0.7.1-dev it did).
In org.apache.nutch.indexer.Indexer.class line 104
writer.addDocument((Document)((ObjectWritable)value).get
Hi Guys
I update the source code from svn head version now. However I cannot
find org.apache.nutch.protocol.http.api.HttpBase class. Did you miss
it?
Thanks
/Jack
--
Keep Discovering ... ...
http://www.jroller.com/page/jmars
Hi
I think it is reasonable that PluginManifestParser should implement
NutchConfigurable interface. As the NutchConfigurable interface
described, PluginManifestParser need NutchConf.
/Jack
--
Keep Discovering ... ...
http://www.jroller.com/page/jmars
Hi
I am going to feed nutch-0.8-dev crawler with seeds in xml format. And
I have read nutch TextInputFormat/InputFormatBase. It seems now nutch
breaks the plain text files into chars and parses on them. My question
is how to support XmlInputFormat, in my eye, xml format is not
character-based but
Hi Andrzej
The idea brings vertical search into nutch and definitely it is great:)
I think nutch should add information retrieving layer into the who
architecture, and export some abstract interface, say
UrlBasedInformationRetrieve(you can implement your url grouping idea
here?),
).
Am 20.12.2005 um 10:29 schrieb Jack Tang:
Hi Guys
Is it possible to dump suggestion list from nutch index in order to
implement ajax auto-complete?
Google suggestion: http://www.google.com/webhp?complete=1hl=en
Regards
/Jack
--
Keep Discovering ... ...
http
will want to have
exclusive read-access to the live index without someone writing stuff
(locking it) sometimes. Each low-traffic period, copy the built-up
statistical index, optimize() it, and replace the current live index with
the new copy.
Good luck,
Fredrik
On 12/12/05, Jack Tang [EMAIL
for suggestion.
Fredrik
On 9/29/05, Jack Tang [EMAIL PROTECTED] wrote:
Hi
I am very like Google's Did you mean and I notice that nutch now
does not provider this function.
In this article http://today.java.net/lpt/a/211 , author Tim White
implemented suggestion using n-gram to generate
Stefan
It seemed your patch missing
org.apache.nutch.protocol.ContentProperties class, right?
/Jack
On 12/10/05, Stefan Groschupf (JIRA) [EMAIL PROTECTED] wrote:
[
http://issues.apache.org/jira/browse/NUTCH-135?page=comments#action_12360025 ]
Stefan Groschupf commented on NUTCH-135:
Hi
I am going to standardize some fields which I stored in my parser
plugin. But I found that sometimes
parse.getData().getMetadata().get(propertyName) is NULL. In fact
when i stepped in the source code, the value of propertyName is not
NULL.
So can someone explain this? Thanks
/Jack
--
Keep
Guys
My fault! I miss copying the segments dir. Sorry for that. Pls ignore
this messgae.
/Jack
On 12/8/05, Jack Tang [EMAIL PROTECTED] wrote:
Hi All
Currently I update my nutch from 0.7 to 0.8-dev (svn version) and come
across one question on searcher.
I wrote my own indexer and searcher
Hi
I checked out latest source code from svn, and played NDFS according
the tutorial (http://wiki.apache.org/nutch/NutchDistributedFileSystem).
And I tested my NDFS using TestClient. It was odd that when I input
every command, the NameNode would throw exception:
051206 003714 Server connection
Hi Doug
1. How to deal with dead urls? If I remove the url after nutch 1st
crawling. Should nutch keeps the dead urls and never fetches them
again?
2. should nutch export dedup as one extension point? In my project, we
add information extraction layer to nutch, I think it is good idea
export
Thanks for your explaination, Andrzej.
I am going to read some NFS source codes and ask smarter questions later.
Thanks again.
Regards
/Jack
On 11/9/05, Andrzej Bialecki [EMAIL PROTECTED] wrote:
Jack Tang wrote:
Hi Andrzej
In document, Michael said:
I'd strongly recommend using the system
Hi Doug
On 11/10/05, Doug Cutting [EMAIL PROTECTED] wrote:
Jack Tang wrote:
Below is google architecture in my brain:
DataNode A
Master DataNode B GoogleCrawler
DataNode C
..
GoogleCrawler is kept running all
[
http://issues.apache.org/jira/browse/NUTCH-36?page=comments#action_12331394 ]
Jack Tang commented on NUTCH-36:
Kerang Lv's solution did well in NutchAnalysis but still there are some bugs in
Summarizer. Say here is one chinese string (c1)(c2)(c3)(c4
Environment: all
Reporter: Jack Tang
Priority: Minor
I customize one query filter using test as my field. And when i try to
search test:(c1)(c2)(c3), the query object which is generated by
NutchAnalysis is wrong. Now the result is
test:(c1)(c2) [DEFAULT](c2)(c3).
However
Hi
I am very like Google's Did you mean and I notice that nutch now
does not provider this function.
In this article http://today.java.net/lpt/a/211, author Tim White
implemented suggestion using n-gram to generate suggestion index. Do
you think is it good for nutch? I mean index in nutch will
with |query,frequency|
tuples (updated nightly, weekly, or whatever), and simply search this index
with a FuzzyQuery with some defined similarity, and pick the most frequent
query for suggestion.
Fredrik
On 9/29/05, Jack Tang [EMAIL PROTECTED] wrote:
Hi
I am very like Google's Did you mean
Hi AJ
I guess the growing of thread.
You can show the thread id in the log. I think it makes sence
Regards
/Jack
On 9/29/05, AJ Chen [EMAIL PROTECTED] wrote:
I started the crawler with about 2000 sites. The fetcher could achieve
7 pages/sec initially, but the performance gradually dropped to
/NUTCH-36
Project: Nutch
Type: Improvement
Components: indexer, searcher
Environment: all
Reporter: Jack Tang
Priority: Minor
Attachments: #26700
Nutch now support Chinese in very simple way: NutchAnalysis segments CJK
term word-by-word.
So, if I search
Hi Kerang
I have test the query, no problem in summary highlight. It is really
amazing. It's the solution for Chinese bi-gram segmentation.
Regards
/Jack
On 9/22/05, Jack Tang [EMAIL PROTECTED] wrote:
Hi Kerang
Pretty nice hack!
I will test highlight in query summary now...
see you
Hi Nutchers
I hope this email is noise in this community. I am now working on
something like hyperbolic browser (
http://www.acm.org/sigchi/chi96/proceedings/videos/Lamping/hb-video.html
). And I remembered that there were some apis written by java. I got
it through click the blog address in
Hi All
Is nutch crawler breadth-first one? It seems a lot of URLs are lost
while I try do breadth-first crawling, I set the depth to 3.
Any comments?
Regards
/Jack
--
Keep Discovering ... ...
http://www.jroller.com/page/jmars
Hi Andrzej
First of all, thanks for your quick response.
On 9/7/05, Andrzej Bialecki [EMAIL PROTECTED] wrote:
Jack Tang wrote:
Hi All
Is nutch crawler breadth-first one? It seems a lot of URLs are lost
while I try do breadth-first crawling, I set the depth to 3.
Any comments?
Yes
, the
db.max.outlinks.per.page must be set to a number that is larger than the
number of outlinks on the page. If these is true, then the max number
has to be determined in real time since the number of outlinks varies
from page to page.
Is my understanding correct?
AJ
Jack Tang wrote:
Hi All
Hi Guys
Did someone install parse-rss and try to fetch rss feeds?
It failed on my side. I enabled the plugin and it fetched, not rss
parser didnot work.
My feed is http://www.craigslist.org/evs/index.rss
Here is the error:
org.apache.nutch.fetcher.Fetcher$FetcherThread [11] - fetch okay, but
Hi All
It takes long time for me to think about embedding improved
CJKAnalysis into NutchAnalysis. I got nothing but some failure
experiences, and share with you, maybe you can hack it( well, I am not
going to give up).
I have written several Chinese words segmentation, some are dictionary
40 matches
Mail list logo