Re: No space left on device

2012-05-28 Thread yingnan.ma
ok,I find it. the jobtracker server is full.


2012-05-28 



yingnan.ma 



发件人: yingnan.ma 
发送时间: 2012-05-28  13:01:56 
收件人: common-user 
抄送: 
主题: No space left on device 
 
Hi,
I encounter a problem as following:
 Error - Job initialization failed:
org.apache.hadoop.fs.FSError: java.io.IOException: No space left on device
 at 
org.apache.hadoop.fs.RawLocalFileSystem$LocalFSFileOutputStream.write(RawLocalFileSystem.java:201)
at 
java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:65)
at java.io.BufferedOutputStream.flush(BufferedOutputStream.java:123)
at java.io.FilterOutputStream.close(FilterOutputStream.java:140)
at 
org.apache.hadoop.fs.FSDataOutputStream$PositionCache.close(FSDataOutputStream.java:61)
at 
org.apache.hadoop.fs.FSDataOutputStream.close(FSDataOutputStream.java:86)
at 
org.apache.hadoop.fs.ChecksumFileSystem$ChecksumFSOutputSummer.close(ChecksumFileSystem.java:348)
at 
org.apache.hadoop.fs.FSDataOutputStream$PositionCache.close(FSDataOutputStream.java:61)
at 
org.apache.hadoop.fs.FSDataOutputStream.close(FSDataOutputStream.java:86)
at 
org.apache.hadoop.mapred.JobHistory$JobInfo.logSubmitted(JobHistory.java:1344)
..
So, I think that the HDFS is full or something, but I cannot find a way to 
address the problem, if you had some suggestion, Please show me , thank you.
Best Regards


RE: Splunk + Hadoop

2012-05-28 Thread Shreya.Pal
Hi Abhishek,

I am looking for a scenario where the customer representative needs to respond 
back to the customers on call.
They need to search on huge data and then respond back in few seconds.

Thanks and Regards,
Shreya Pal
Architect Technology
Cognizant Technology Pvt Ltd
Vnet - 205594
Mobile - +91-9766310680


-Original Message-
From: Abhishek Pratap Singh [mailto:manu.i...@gmail.com]
Sent: Tuesday, May 22, 2012 2:44 AM
To: common-user@hadoop.apache.org
Subject: Re: Splunk + Hadoop

I have used Hadoop and Splunk both. Can you please let me know what is your 
requirement?
Real time processing with hadoop depends upon What defines Real time in 
particular scenario. Based on requirement, Real time (near real time) can be 
achieved.

~Abhishek

On Fri, May 18, 2012 at 3:58 PM, Russell Jurney russell.jur...@gmail.comwrote:

 Because that isn't Cube.

 Russell Jurney
 twitter.com/rjurney
 russell.jur...@gmail.com
 datasyndrome.com

 On May 18, 2012, at 2:01 PM, Ravi Shankar Nair
 ravishankar.n...@gmail.com wrote:

  Why not Hbase with Hadoop?
  It's a best bet.
  Rgds, Ravi
 
  Sent from my Beethoven
 
 
  On May 18, 2012, at 3:29 PM, Russell Jurney
  russell.jur...@gmail.com
 wrote:
 
  I'm playing with using Hadoop and Pig to load MongoDB with data for
 Cube to
  consume. Cube https://github.com/square/cube/wiki is a realtime
 tool...
  but we'll be replaying events from the past.  Does that count?  It
  is
 nice
  to batch backfill metrics into 'real-time' systems in bulk.
 
  On Fri, May 18, 2012 at 12:11 PM, shreya@cognizant.com wrote:
 
  Hi ,
 
  Has anyone used Hadoop and splunk, or any other real-time
  processing
 tool
  over Hadoop?
 
  Regards,
  Shreya
 
 
 
  This e-mail and any files transmitted with it are for the sole use
  of
 the
  intended recipient(s) and may contain confidential and privileged
  information. If you are not the intended recipient(s), please
  reply to
 the
  sender and destroy all copies of the original message. Any
  unauthorized review, use, disclosure, dissemination, forwarding,
  printing or
 copying of
  this email, and/or any action taken in reliance on the contents of
  this e-mail is strictly prohibited and may be unlawful.
 
 
  Russell Jurney twitter.com/rjurney russell.jur...@gmail.com
 datasyndrome.com

This e-mail and any files transmitted with it are for the sole use of the 
intended recipient(s) and may contain confidential and privileged information. 
If you are not the intended recipient(s), please reply to the sender and 
destroy all copies of the original message. Any unauthorized review, use, 
disclosure, dissemination, forwarding, printing or copying of this email, 
and/or any action taken in reliance on the contents of this e-mail is strictly 
prohibited and may be unlawful.


Re: Splunk + Hadoop

2012-05-28 Thread Nitin Pawar
Hi Shreya,

if you are looking at data locality, then you may or may not use hadoop out
of the box.
It will all depend on how you design the data layout on top of hdfs and how
do you implement search based on the customer queries.

a good idea might be have hop-in queryable database like mysql inbetween
where you can store the results of your data being processed on hadoop and
then use solr search for fast access and search.

Thanks,
Nitin

On Mon, May 28, 2012 at 12:41 PM, shreya@cognizant.com wrote:

 Hi Abhishek,

 I am looking for a scenario where the customer representative needs to
 respond back to the customers on call.
 They need to search on huge data and then respond back in few seconds.

 Thanks and Regards,
 Shreya Pal
 Architect Technology
 Cognizant Technology Pvt Ltd
 Vnet - 205594
 Mobile - +91-9766310680


 -Original Message-
 From: Abhishek Pratap Singh [mailto:manu.i...@gmail.com]
 Sent: Tuesday, May 22, 2012 2:44 AM
 To: common-user@hadoop.apache.org
 Subject: Re: Splunk + Hadoop

 I have used Hadoop and Splunk both. Can you please let me know what is
 your requirement?
 Real time processing with hadoop depends upon What defines Real time in
 particular scenario. Based on requirement, Real time (near real time) can
 be achieved.

 ~Abhishek

 On Fri, May 18, 2012 at 3:58 PM, Russell Jurney russell.jur...@gmail.com
 wrote:

  Because that isn't Cube.
 
  Russell Jurney
  twitter.com/rjurney
  russell.jur...@gmail.com
  datasyndrome.com
 
  On May 18, 2012, at 2:01 PM, Ravi Shankar Nair
  ravishankar.n...@gmail.com wrote:
 
   Why not Hbase with Hadoop?
   It's a best bet.
   Rgds, Ravi
  
   Sent from my Beethoven
  
  
   On May 18, 2012, at 3:29 PM, Russell Jurney
   russell.jur...@gmail.com
  wrote:
  
   I'm playing with using Hadoop and Pig to load MongoDB with data for
  Cube to
   consume. Cube https://github.com/square/cube/wiki is a realtime
  tool...
   but we'll be replaying events from the past.  Does that count?  It
   is
  nice
   to batch backfill metrics into 'real-time' systems in bulk.
  
   On Fri, May 18, 2012 at 12:11 PM, shreya@cognizant.com wrote:
  
   Hi ,
  
   Has anyone used Hadoop and splunk, or any other real-time
   processing
  tool
   over Hadoop?
  
   Regards,
   Shreya
  
  
  
   This e-mail and any files transmitted with it are for the sole use
   of
  the
   intended recipient(s) and may contain confidential and privileged
   information. If you are not the intended recipient(s), please
   reply to
  the
   sender and destroy all copies of the original message. Any
   unauthorized review, use, disclosure, dissemination, forwarding,
   printing or
  copying of
   this email, and/or any action taken in reliance on the contents of
   this e-mail is strictly prohibited and may be unlawful.
  
  
   Russell Jurney twitter.com/rjurney russell.jur...@gmail.com
  datasyndrome.com
 
 This e-mail and any files transmitted with it are for the sole use of the
 intended recipient(s) and may contain confidential and privileged
 information. If you are not the intended recipient(s), please reply to the
 sender and destroy all copies of the original message. Any unauthorized
 review, use, disclosure, dissemination, forwarding, printing or copying of
 this email, and/or any action taken in reliance on the contents of this
 e-mail is strictly prohibited and may be unlawful.




-- 
Nitin Pawar


Help with DFSClient Exception.

2012-05-28 Thread akshaymb

Hi,

We are frequently observing the exception 
java.io.IOException: DFSClient_attempt_201205232329_28133_r_02_0 could
not complete file
/output/tmp/test/_temporary/_attempt_201205232329_28133_r_02_0/part-r-2.
 
Giving up. 
on our cluster.  The exception occurs during writing a file.  We are using
Hadoop 0.20.2. It’s ~250 nodes cluster and on average 1 box goes down every
3 days.

Detailed stack trace :
12/05/27 23:26:54 INFO mapred.JobClient: Task Id :
attempt_201205232329_28133_r_02_0, Status : FAILED
java.io.IOException: DFSClient_attempt_201205232329_28133_r_02_0 could
not complete file
/output/tmp/test/_temporary/_attempt_201205232329_28133_r_02_0/part-r-2.
 
Giving up.
at
org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.closeInternal(DFSClient.java:3331)
at
org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.close(DFSClient.java:3240)
at
org.apache.hadoop.fs.FSDataOutputStream$PositionCache.close(FSDataOutputStream.java:61)
at
org.apache.hadoop.fs.FSDataOutputStream.close(FSDataOutputStream.java:86)
at
org.apache.hadoop.mapreduce.lib.output.TextOutputFormat$LineRecordWriter.close(TextOutputFormat.java:106)
at
org.apache.hadoop.mapred.ReduceTask.runNewReducer(ReduceTask.java:567)
at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:408)
at org.apache.hadoop.mapred.Child.main(Child.java:170)

Our investigation: 
We have min replication factor set to 2.  As mentioned 
http://kazman.shidler.hawaii.edu/ArchDocDecomposition.html here  , “A call
to complete() will not return true until all the file's blocks have been
replicated the minimum number of times.  Thus, DataNode failures may cause a
client to call complete() several times before succeeding”, we should retry
complete() several times.
The org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.closeInternal() calls
complete() function and retries it for 20 times.  But in spite of that file
blocks are not replicated minimum number of times. The retry count is not
configurable.  Changing min replication factor to 1 is also not a good idea
since there are continuously jobs running on our cluster. 

Do we have any solution / workaround for this problem?

What is min replication factor in general used in industry.

Let me know if any further inputs required.

Thanks,
-Akshay



-- 
View this message in context: 
http://old.nabble.com/Help-with-DFSClient-Exception.-tp33918949p33918949.html
Sent from the Hadoop core-user mailing list archive at Nabble.com.



Re: Help with DFSClient Exception.

2012-05-28 Thread Nitin Pawar
Whats the block size?
also are you experiencing any slowness in network?

i am guessing you are using EC2

these issues normally come with network problems

On Mon, May 28, 2012 at 3:57 PM, akshaymb akshaybhara...@gmail.com wrote:


 Hi,

 We are frequently observing the exception
 java.io.IOException: DFSClient_attempt_201205232329_28133_r_02_0 could
 not complete file

 /output/tmp/test/_temporary/_attempt_201205232329_28133_r_02_0/part-r-2.
 Giving up.
 on our cluster.  The exception occurs during writing a file.  We are using
 Hadoop 0.20.2. It’s ~250 nodes cluster and on average 1 box goes down every
 3 days.

 Detailed stack trace :
 12/05/27 23:26:54 INFO mapred.JobClient: Task Id :
 attempt_201205232329_28133_r_02_0, Status : FAILED
 java.io.IOException: DFSClient_attempt_201205232329_28133_r_02_0 could
 not complete file

 /output/tmp/test/_temporary/_attempt_201205232329_28133_r_02_0/part-r-2.
 Giving up.
at

 org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.closeInternal(DFSClient.java:3331)
at
 org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.close(DFSClient.java:3240)
at

 org.apache.hadoop.fs.FSDataOutputStream$PositionCache.close(FSDataOutputStream.java:61)
at
 org.apache.hadoop.fs.FSDataOutputStream.close(FSDataOutputStream.java:86)
at

 org.apache.hadoop.mapreduce.lib.output.TextOutputFormat$LineRecordWriter.close(TextOutputFormat.java:106)
at
 org.apache.hadoop.mapred.ReduceTask.runNewReducer(ReduceTask.java:567)
at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:408)
at org.apache.hadoop.mapred.Child.main(Child.java:170)

 Our investigation:
 We have min replication factor set to 2.  As mentioned
 http://kazman.shidler.hawaii.edu/ArchDocDecomposition.html here  , “A call
 to complete() will not return true until all the file's blocks have been
 replicated the minimum number of times.  Thus, DataNode failures may cause
 a
 client to call complete() several times before succeeding”, we should retry
 complete() several times.
 The org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.closeInternal() calls
 complete() function and retries it for 20 times.  But in spite of that file
 blocks are not replicated minimum number of times. The retry count is not
 configurable.  Changing min replication factor to 1 is also not a good idea
 since there are continuously jobs running on our cluster.

 Do we have any solution / workaround for this problem?

 What is min replication factor in general used in industry.

 Let me know if any further inputs required.

 Thanks,
 -Akshay



 --
 View this message in context:
 http://old.nabble.com/Help-with-DFSClient-Exception.-tp33918949p33918949.html
 Sent from the Hadoop core-user mailing list archive at Nabble.com.




-- 
Nitin Pawar


Re: No space left on device

2012-05-28 Thread Marcos Ortiz

Do you have the JT and NN on the same node?
Look here on the Lars Francke´s post:
http://gbif.blogspot.com/2011/01/setting-up-hadoop-cluster-part-1-manual.html
This is a very schema how to install Hadoop, and look the configuration 
that he used for the name and data directories.
If this directories are in the same disk, and you don´t have enough 
space for it, you can find that exception.


My recomendation is to divide these directories in separate discs with a 
very similar schema to the Lars´s configuration

Another recomendation is to check the Hadoop´s logs. Read about this here:
http://www.cloudera.com/blog/2010/11/hadoop-log-location-and-retention/

regards

On 05/28/2012 02:20 AM, yingnan.ma wrote:

ok,I find it. the jobtracker server is full.


2012-05-28



yingnan.ma



发件人: yingnan.ma
发送时间: 2012-05-28  13:01:56
收件人: common-user
抄送:
主题: No space left on device

Hi,
I encounter a problem as following:
  Error - Job initialization failed:
org.apache.hadoop.fs.FSError: java.io.IOException: No space left on device
  at 
org.apache.hadoop.fs.RawLocalFileSystem$LocalFSFileOutputStream.write(RawLocalFileSystem.java:201)
 at 
java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:65)
 at java.io.BufferedOutputStream.flush(BufferedOutputStream.java:123)
 at java.io.FilterOutputStream.close(FilterOutputStream.java:140)
 at 
org.apache.hadoop.fs.FSDataOutputStream$PositionCache.close(FSDataOutputStream.java:61)
 at 
org.apache.hadoop.fs.FSDataOutputStream.close(FSDataOutputStream.java:86)
 at 
org.apache.hadoop.fs.ChecksumFileSystem$ChecksumFSOutputSummer.close(ChecksumFileSystem.java:348)
 at 
org.apache.hadoop.fs.FSDataOutputStream$PositionCache.close(FSDataOutputStream.java:61)
 at 
org.apache.hadoop.fs.FSDataOutputStream.close(FSDataOutputStream.java:86)
 at 
org.apache.hadoop.mapred.JobHistory$JobInfo.logSubmitted(JobHistory.java:1344)
 ..
So, I think that the HDFS is full or something, but I cannot find a way to 
address the problem, if you had some suggestion, Please show me , thank you.
Best Regards


--
Marcos Luis Ortíz Valmaseda
 Data Engineer  Sr. System Administrator at UCI
 http://marcosluis2186.posterous.com
 http://www.linkedin.com/in/marcosluis2186
 Twitter: @marcosluis2186


10mo. ANIVERSARIO DE LA CREACION DE LA UNIVERSIDAD DE LAS CIENCIAS 
INFORMATICAS...
CONECTADOS AL FUTURO, CONECTADOS A LA REVOLUCION

http://www.uci.cu
http://www.facebook.com/universidad.uci
http://www.flickr.com/photos/universidad_uci


HBase (BigTable) many to many with students and courses

2012-05-28 Thread Em
Hello list,

I have some time now to try out HBase and want to use it for a private
project.

Questions like How to I transfer one-to-many or many-to-many relations
from my RDBMS's schema to HBase? seem to be common.

I hope we can throw all the best practices that are out there in this
thread.

As the wiki states:
One should create two tables.
One for students, another for courses.

Within the students' table, one should add one column per selected
course with the course_id besides some columns for the student itself
(name, birthday, sex etc.).

On the other hand one fills the courses table with one column per
student_id besides some columns which describe the course itself (name,
teacher, begin, end, year, location etc.).

So far, so good.

How do I access these tables efficiently?

A common case would be to show all courses per student.

To do so, one has to access the student-table and get all the student's
courses-columns.
Let's say their names are prefixed ids. One has to remove the prefix and
then one accesses the courses-table to get all the courses and their
metadata (name, teacher, location etc.).

How do I do this kind of operation efficiently?
The naive and brute force approach seems to be using a Get-object per
course and fetch the neccessary data.
Another approach seems to be using the HTable-class and unleash the
power of multigets by using the batch()-method.

All of the information above is theoretically, since I did not used it
in code (I currently learn more about the fundamentals of HBase).

That's why I give the question to you: How do you do this kind of
operation by using HBase?

Kind regards,
Em


Eclipse Plugin removed from contrib folder, where to find it?

2012-05-28 Thread Varad Meru
Hi All,

I just downloaded hadoop-1.0.3 from apache's download page, but to my surprise, 
could not find the eclipse plugin that normally comes with hadoop in the 
contrib folder. I could find the source foe building the hadoop eclipse 
plugin in the src/contrib/eclipse-plugin folder, but building it with ant too, 
did not fetch me any jar to work with. 

I have plugins from 0.20.203 stage but it wont work in my opinion due to the 
new API support. Also plugin found at the location 
http://code.google.com/edu/parallel/tools/hadoopvm/hadoop-eclipse-plugin.jar 
supports very old version of Hadoop. How would a newbie get started with the 
eclipse plugin in 1.0.x era?

Please let me know if I am doing the steps right or missing something.

Thanks in Advance,
Varad