Distributed Matrix Computering on Hadoop

2006-07-21 Thread Jack Tang
Hi list, I am now facing one problem on scientific computering. there exist 5G datum (maily matrix/vector) that we collected for some surveys. And now we plan to do some datamining on these. And honestly, I am not every well know Hadoop/Mapreduce. The question seems quite simple to you

Much faster RegExp lib needed in nutch?

2006-03-11 Thread Jack Tang
Hi all RegExp is widely used in nutch, and I now wondering is it jdk/jakarta classes is faster enough? Here is the benchmarks i found on web. http://tusker.org/regex/regex_benchmark.html it seems dk.brics.automaton.RegExp is fastest among the libs. /Jack -- Keep Discovering ... ...

Re: Summarier threads in nutch

2006-02-23 Thread Jack Tang
Hi Stefan Can you explain a little more? I mean I cannot find some evidence in the source code... Thanks /Jack On 2/23/06, Stefan Groschupf [EMAIL PROTECTED] wrote: Hi Jack, the summary is only created from all hits displayed on one page. Stefan Am 23.02.2006 um 02:45 schrieb Jack Tang

Re: Summarier threads in nutch

2006-02-23 Thread Jack Tang
the question? Am 24.02.2006 um 02:51 schrieb Jack Tang: Hi Stefan Can you explain a little more? I mean I cannot find some evidence in the source code... Thanks /Jack On 2/23/06, Stefan Groschupf [EMAIL PROTECTED] wrote: Hi Jack, the summary is only created from all hits displayed

Re: Summarier threads in nutch

2006-02-23 Thread Jack Tang
Yes, you're right:) i find the answer. Thanks. On 2/24/06, Stefan Groschupf [EMAIL PROTECTED] wrote: Isn't HitDetails.length == hitsPerPage? This happens in search.jsp. Am 24.02.2006 um 03:09 schrieb Jack Tang: I dont think so. Let's take non-dfs as example. NutchBean.getSummary

Re: Summarier threads in nutch

2006-02-22 Thread Jack Tang
On 2/23/06, Doug Cutting [EMAIL PROTECTED] wrote: Jack Tang wrote: In FetchedSegments class, below code shows how to get the hit summaries. public String[] getSummary(HitDetails[] details, Query query) throws IOException { SummaryThread[] threads = new SummaryThread

Thread in nutch

2006-02-20 Thread Jack Tang
Hi All I don't know will nutch only support JDK1.5 or both JDK1.4 and 1.5 in the future. If the former, is it better to adopt JDK1.5 concurrency framework for thread (say fetcher and summaries thread)? And here is ibm tutorial on the new classes in tiger. /Jack -- Keep Discovering ... ...

Summarier threads in nutch

2006-02-19 Thread Jack Tang
Hi Guys In FetchedSegments class, below code shows how to get the hit summaries. public String[] getSummary(HitDetails[] details, Query query) throws IOException { SummaryThread[] threads = new SummaryThread[details.length]; for (int i = 0; i threads.length; i++) {

How to supprt multi-fields highlight?

2006-02-16 Thread Jack Tang
Hi All Now nutch only supports content field highlight. Any suggestion to enable multi-fields highlighting? say some hits in anchor text and url (like google), and etc.. I know one simplest but stupid way is get the hitdetails first then invoke summarier threads, any smarter ideas? Thanks. /Jack

Re: process/create/hand over: crawl meta data

2006-02-08 Thread Jack Tang
On 2/9/06, Stefan Groschupf [EMAIL PROTECTED] wrote: Hi Folks, I hope and it looks like we are close to get meta data support for crawlDatum (CrawlDB) into the sources soon. At this point we can store and read but not 'process' (means creation or inheritance etc. [some one knows a better

Re: lang identifier and nutch analyzer in trunk

2006-01-21 Thread Jack Tang
Hi Jérôme On 1/21/06, Jérôme Charron [EMAIL PROTECTED] wrote: I am wondering Analyzer of nutch in svn trunk is chosen by languageidentifer plugin or not? (I knew in nutch 0.7.1-dev it did). It's not really choosen by the languageidentifier, but coosen regarding the value of the lang

lang identifier and nutch analyzer in trunk

2006-01-20 Thread Jack Tang
Hi All I am wondering Analyzer of nutch in svn trunk is chosen by languageidentifer plugin or not? (I knew in nutch 0.7.1-dev it did). In org.apache.nutch.indexer.Indexer.class line 104 writer.addDocument((Document)((ObjectWritable)value).get()); It should be NutchAnalyzer analyzer =

Re: lang identifier and nutch analyzer in trunk

2006-01-20 Thread Jack Tang
On 1/21/06, Jack Tang [EMAIL PROTECTED] wrote: Hi All I am wondering Analyzer of nutch in svn trunk is chosen by languageidentifer plugin or not? (I knew in nutch 0.7.1-dev it did). In org.apache.nutch.indexer.Indexer.class line 104 writer.addDocument((Document)((ObjectWritable)value).get

Where is org.apache.nutch.protocol.http.api.HttpBase?

2006-01-12 Thread Jack Tang
Hi Guys I update the source code from svn head version now. However I cannot find org.apache.nutch.protocol.http.api.HttpBase class. Did you miss it? Thanks /Jack -- Keep Discovering ... ... http://www.jroller.com/page/jmars

PluginManifestParser should be NutchConfigurable

2006-01-11 Thread Jack Tang
Hi I think it is reasonable that PluginManifestParser should implement NutchConfigurable interface. As the NutchConfigurable interface described, PluginManifestParser need NutchConf. /Jack -- Keep Discovering ... ... http://www.jroller.com/page/jmars

XmlInputFortmat ?

2006-01-10 Thread Jack Tang
Hi I am going to feed nutch-0.8-dev crawler with seeds in xml format. And I have read nutch TextInputFormat/InputFormatBase. It seems now nutch breaks the plain text files into chars and parses on them. My question is how to support XmlInputFormat, in my eye, xml format is not character-based but

Re: Per-page crawling policy

2006-01-06 Thread Jack Tang
Hi Andrzej The idea brings vertical search into nutch and definitely it is great:) I think nutch should add information retrieving layer into the who architecture, and export some abstract interface, say UrlBasedInformationRetrieve(you can implement your url grouping idea here?),

Re: nutch and google suggestion

2005-12-20 Thread Jack Tang
). Am 20.12.2005 um 10:29 schrieb Jack Tang: Hi Guys Is it possible to dump suggestion list from nutch index in order to implement ajax auto-complete? Google suggestion: http://www.google.com/webhp?complete=1hl=en Regards /Jack -- Keep Discovering ... ... http

Re: Hot Search! Re: Nutch Suggestion? (Google like did you mean)

2005-12-12 Thread Jack Tang
will want to have exclusive read-access to the live index without someone writing stuff (locking it) sometimes. Each low-traffic period, copy the built-up statistical index, optimize() it, and replace the current live index with the new copy. Good luck, Fredrik On 12/12/05, Jack Tang [EMAIL

Hot Search! Re: Nutch Suggestion? (Google like did you mean)

2005-12-11 Thread Jack Tang
for suggestion. Fredrik On 9/29/05, Jack Tang [EMAIL PROTECTED] wrote: Hi I am very like Google's Did you mean and I notice that nutch now does not provider this function. In this article http://today.java.net/lpt/a/211 , author Tim White implemented suggestion using n-gram to generate

Re: [jira] Commented: (NUTCH-135) http header meta data are case insensitive in the real world (e.g. Content-Type or content-type)

2005-12-10 Thread Jack Tang
Stefan It seemed your patch missing org.apache.nutch.protocol.ContentProperties class, right? /Jack On 12/10/05, Stefan Groschupf (JIRA) [EMAIL PROTECTED] wrote: [ http://issues.apache.org/jira/browse/NUTCH-135?page=comments#action_12360025 ] Stefan Groschupf commented on NUTCH-135:

parse.getData().getMetadata().get(propName) is NULL?

2005-12-09 Thread Jack Tang
Hi I am going to standardize some fields which I stored in my parser plugin. But I found that sometimes parse.getData().getMetadata().get(propertyName) is NULL. In fact when i stepped in the source code, the value of propertyName is not NULL. So can someone explain this? Thanks /Jack -- Keep

Re: Nutch 0.8 update issue

2005-12-07 Thread Jack Tang
Guys My fault! I miss copying the segments dir. Sorry for that. Pls ignore this messgae. /Jack On 12/8/05, Jack Tang [EMAIL PROTECTED] wrote: Hi All Currently I update my nutch from 0.7 to 0.8-dev (svn version) and come across one question on searcher. I wrote my own indexer and searcher

NDFS Connection reset

2005-12-05 Thread Jack Tang
Hi I checked out latest source code from svn, and played NDFS according the tutorial (http://wiki.apache.org/nutch/NutchDistributedFileSystem). And I tested my NDFS using TestClient. It was odd that when I input every command, the NameNode would throw exception: 051206 003714 Server connection

Re: incremental crawling

2005-12-01 Thread Jack Tang
Hi Doug 1. How to deal with dead urls? If I remove the url after nutch 1st crawling. Should nutch keeps the dead urls and never fetches them again? 2. should nutch export dedup as one extension point? In my project, we add information extraction layer to nutch, I think it is good idea export

Re: Index update and Google Dance

2005-11-09 Thread Jack Tang
Thanks for your explaination, Andrzej. I am going to read some NFS source codes and ask smarter questions later. Thanks again. Regards /Jack On 11/9/05, Andrzej Bialecki [EMAIL PROTECTED] wrote: Jack Tang wrote: Hi Andrzej In document, Michael said: I'd strongly recommend using the system

Re: Index update and Google Dance

2005-11-09 Thread Jack Tang
Hi Doug On 11/10/05, Doug Cutting [EMAIL PROTECTED] wrote: Jack Tang wrote: Below is google architecture in my brain: DataNode A Master DataNode B GoogleCrawler DataNode C .. GoogleCrawler is kept running all

[jira] Commented: (NUTCH-36) Chinese in Nutch

2005-10-05 Thread Jack Tang (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-36?page=comments#action_12331394 ] Jack Tang commented on NUTCH-36: Kerang Lv's solution did well in NutchAnalysis but still there are some bugs in Summarizer. Say here is one chinese string (c1)(c2)(c3)(c4

[jira] Created: (NUTCH-104) Nutch query parser does not support CJK bi-gram segmentation.

2005-10-05 Thread Jack Tang (JIRA)
Environment: all Reporter: Jack Tang Priority: Minor I customize one query filter using test as my field. And when i try to search test:(c1)(c2)(c3), the query object which is generated by NutchAnalysis is wrong. Now the result is test:(c1)(c2) [DEFAULT](c2)(c3). However

Nutch Suggestion? (Google like did you mean)

2005-09-29 Thread Jack Tang
Hi I am very like Google's Did you mean and I notice that nutch now does not provider this function. In this article http://today.java.net/lpt/a/211, author Tim White implemented suggestion using n-gram to generate suggestion index. Do you think is it good for nutch? I mean index in nutch will

Re: Nutch Suggestion? (Google like did you mean)

2005-09-29 Thread Jack Tang
with |query,frequency| tuples (updated nightly, weekly, or whatever), and simply search this index with a FuzzyQuery with some defined similarity, and pick the most frequent query for suggestion. Fredrik On 9/29/05, Jack Tang [EMAIL PROTECTED] wrote: Hi I am very like Google's Did you mean

Re: what contibute to fetch slowing down

2005-09-28 Thread Jack Tang
Hi AJ I guess the growing of thread. You can show the thread id in the log. I think it makes sence Regards /Jack On 9/29/05, AJ Chen [EMAIL PROTECTED] wrote: I started the crawler with about 2000 sites. The fetcher could achieve 7 pages/sec initially, but the performance gradually dropped to

Re: [jira] Commented: (NUTCH-36) Chinese in Nutch

2005-09-22 Thread Jack Tang
/NUTCH-36 Project: Nutch Type: Improvement Components: indexer, searcher Environment: all Reporter: Jack Tang Priority: Minor Attachments: #26700 Nutch now support Chinese in very simple way: NutchAnalysis segments CJK term word-by-word. So, if I search

Re: [jira] Commented: (NUTCH-36) Chinese in Nutch

2005-09-22 Thread Jack Tang
Hi Kerang I have test the query, no problem in summary highlight. It is really amazing. It's the solution for Chinese bi-gram segmentation. Regards /Jack On 9/22/05, Jack Tang [EMAIL PROTECTED] wrote: Hi Kerang Pretty nice hack! I will test highlight in query summary now... see you

hyperbolic browser api (I missed)

2005-09-21 Thread Jack Tang
Hi Nutchers I hope this email is noise in this community. I am now working on something like hyperbolic browser ( http://www.acm.org/sigchi/chi96/proceedings/videos/Lamping/hb-video.html ). And I remembered that there were some apis written by java. I got it through click the blog address in

Nutch crawler is breadth-first ?

2005-09-07 Thread Jack Tang
Hi All Is nutch crawler breadth-first one? It seems a lot of URLs are lost while I try do breadth-first crawling, I set the depth to 3. Any comments? Regards /Jack -- Keep Discovering ... ... http://www.jroller.com/page/jmars

Re: Nutch crawler is breadth-first ?

2005-09-07 Thread Jack Tang
Hi Andrzej First of all, thanks for your quick response. On 9/7/05, Andrzej Bialecki [EMAIL PROTECTED] wrote: Jack Tang wrote: Hi All Is nutch crawler breadth-first one? It seems a lot of URLs are lost while I try do breadth-first crawling, I set the depth to 3. Any comments? Yes

Re: db.max.outlinks.per.page is misunderstood?

2005-09-07 Thread Jack Tang
, the db.max.outlinks.per.page must be set to a number that is larger than the number of outlinks on the page. If these is true, then the max number has to be determined in real time since the number of outlinks varies from page to page. Is my understanding correct? AJ Jack Tang wrote: Hi All

RSS Parser Bug!?

2005-09-07 Thread Jack Tang
Hi Guys Did someone install parse-rss and try to fetch rss feeds? It failed on my side. I enabled the plugin and it fetched, not rss parser didnot work. My feed is http://www.craigslist.org/evs/index.rss Here is the error: org.apache.nutch.fetcher.Fetcher$FetcherThread [11] - fetch okay, but

NutchAnalysis and CJK

2005-07-14 Thread Jack Tang
Hi All It takes long time for me to think about embedding improved CJKAnalysis into NutchAnalysis. I got nothing but some failure experiences, and share with you, maybe you can hack it( well, I am not going to give up). I have written several Chinese words segmentation, some are dictionary