Hi Nana,

May be you can use URL regex filter to exclude these out. Following regex
expression will allow only http(s) links to be crawled.

+^http(s){0,1}://*

Thanks & Regards,
Karanjeet Singh
USC

On Mon, Jun 6, 2016 at 7:13 PM, Nana Pandiawan <
nana.pandia...@solusi247.com.invalid> wrote:

> Hi Furkan,
> thanks for your response
>
> if the error occurred when nutch find a data uri schema like the one below?
>
> <img  src="data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAAAUA
> AAAFCAYAAACNbyblAAAAHElEQVQI12P4//8/w38GIAXDIBKE0DHxgljNBAAO
> 9TXL0Y4OHwAAAABJRU5ErkJggg=="  alt="Red dot"  />
>
> I just crawl the random page and get the error,
> how to skip it that crawling proccess can be continued by nutch?
>
> On 06/06/16 17:25, Furkan KAMACI wrote:
>
>> Hi Nana,
>>
>> It seems that your problem maybe related to base64 data. Here is a link
>> about it:
>>
>> https://urldefense.proofpoint.com/v2/url?u=http-3A__stackoverflow.com_questions_12458390_embed-2Djava-2Dapplet-2Dthrough-2Durl-2Ddata&d=DQICaQ&c=clK7kQUTWtAVEOVIgvi0NU5BOUHhpN0H8p7CSfnc_gI&r=u7neGGUaVmQKNSLUqJ9zpA&m=hqok1xhQzZJQQMUShFCwJlH6xLwK-lHBPuRLfzb1UMU&s=8fdlI6GMFRLCE37HOq1zs3Xm-sNs7ol0BxzvGxCFm5A&e=
>> Could
>>
>> you share the pages that you get error for?
>>
>> Kind Regards,
>> Furkan KAMACI
>>
>> On Mon, Jun 6, 2016 at 4:26 AM, Nana Pandiawan <
>> nana.pandia...@solusi247.com.invalid> wrote:
>>
>> Hi All,
>>>
>>> I'm getting following errors when updatedb. can someone tell me whats
>>> going
>>> wrong and how to solve it.
>>> thanks.
>>>
>>> 16/06/04 00:58:42 INFO mapreduce.Job:  map 0% reduce 0%
>>> 16/06/04 00:59:27 INFO mapreduce.Job: Task Id :
>>> attempt_1464314319848_0309_m_000000_0, Status : FAILED
>>> Error: java.net.MalformedURLException: unknown protocol: t00
>>>          at java.net.URL.<init>(URL.java:603)
>>>          at java.net.URL.<init>(URL.java:493)
>>>          at java.net.URL.<init>(URL.java:442)
>>>          at org.apache.nutch.util.TableUtil.reverseUrl(TableUtil.java:43)
>>>          at
>>> org.apache.nutch.crawl.DbUpdateMapper.map(DbUpdateMapper.java:96)
>>>          at
>>> org.apache.nutch.crawl.DbUpdateMapper.map(DbUpdateMapper.java:38)
>>>          at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:146)
>>>          at
>>> org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:787)
>>>          at org.apache.hadoop.mapred.MapTask.run(MapTask.java:341)
>>>          at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:164)
>>>          at java.security.AccessController.doPrivileged(Native Method)
>>>          at javax.security.auth.Subject.doAs(Subject.java:415)
>>>          at
>>>
>>> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657)
>>>          at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:158)
>>> 16/06/04 01:00:14 INFO mapreduce.Job: Task Id :
>>> attempt_1464314319848_0309_m_000000_1, Status : FAILED
>>> Error: java.net.MalformedURLException: unknown protocol: t00
>>>          at java.net.URL.<init>(URL.java:603)
>>>          at java.net.URL.<init>(URL.java:493)
>>>          at java.net.URL.<init>(URL.java:442)
>>>          at org.apache.nutch.util.TableUtil.reverseUrl(TableUtil.java:43)
>>>          at
>>> org.apache.nutch.crawl.DbUpdateMapper.map(DbUpdateMapper.java:96)
>>>          at
>>> org.apache.nutch.crawl.DbUpdateMapper.map(DbUpdateMapper.java:38)
>>>          at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:146)
>>>          at
>>> org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:787)
>>>          at org.apache.hadoop.mapred.MapTask.run(MapTask.java:341)
>>>          at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:164)
>>>          at java.security.AccessController.doPrivileged(Native Method)
>>>          at javax.security.auth.Subject.doAs(Subject.java:415)
>>>          at
>>>
>>> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657)
>>>          at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:158)
>>> 16/06/04 01:00:42 INFO mapreduce.Job: Task Id :
>>> attempt_1464314319848_0309_m_000001_0, Status : FAILED
>>> Error: java.net.MalformedURLException: unknown protocol: data
>>>
>>> I use Apache Nutch 2.3.1 and hbase as backend.
>>> Regards,
>>>
>>>
>
ᐧ

Reply via email to