Re: /tmp dir for import configurable?

Jarek Jarcec Cecho Fri, 05 Apr 2013 22:06:32 -0700

Hi Christian,
thank you very much for sharing the log and please accept my apologies for late 
response.


Closely looking into your exception, I can confirm that it's the S3 file system 
that is creating the files in /tmp and not Sqoop itself.

> [hostaddress] out: 13/04/02 01:56:03 INFO s3native.NativeS3FileSystem:
> OutputStream for key 'some_table/_SUCCESS' writing to tempfile '*
> /tmp/hadoop-jenkins/s3/output-1400873345908825433.tmp*'

Taking a brief look into the source code [1], it seems that it's the method 
newBackupFile() defined on line 195 that is responsible for creating the 
temporary file. And also it seems that it's behaviour can be altered using 
fs.s3.buffer.dir property. Would you mind to try use it in your Sqoop execution?

  sqoop import -Dfs.s3.buffer.dir=/custom/path ...

I've also noticed that you're using the LocalJobRunner which is suggesting 
Sqoop is executing all jobs locally on your machine and not on your Hadoop 
cluster. I would recommend checking Hadoop configuration in case that your 
intention is to run your data transfer in parallel.

Jarcec

Links:
1: 
http://hadoop.apache.org/docs/r2.0.3-alpha/api/src-html/org/apache/hadoop/fs/s3native/NativeS3FileSystem.html

On Tue, Apr 02, 2013 at 11:38:35AM +0100, Christian Prokopp wrote:
> Hi Jarcec,
> 
> I am running the command on the CLI of a cluster node. It appears to run a
> local MR job writing the results to /tmp before sending it to S3:
> 
> [..]
> [hostaddress] out: 13/04/02 01:52:49 INFO mapreduce.MySQLDumpMapper:
> Beginning mysqldump fast path import
> [hostaddress] out: 13/04/02 01:52:49 INFO mapreduce.MySQLDumpMapper:
> Performing import of table image from database some_db
> [hostaddress] out: 13/04/02 01:52:49 INFO mapreduce.MySQLDumpMapper:
> Converting data to use specified delimiters.
> [hostaddress] out: 13/04/02 01:52:49 INFO mapreduce.MySQLDumpMapper: (For
> the fastest possible import, use
> [hostaddress] out: 13/04/02 01:52:49 INFO mapreduce.MySQLDumpMapper:
> --mysql-delimiters to specify the same field
> [hostaddress] out: 13/04/02 01:52:49 INFO mapreduce.MySQLDumpMapper:
> delimiters as are used by mysqldump.)
> [hostaddress] out: 13/04/02 01:52:54 INFO mapred.LocalJobRunner:
> [hostaddress] out: 13/04/02 01:52:55 INFO mapred.JobClient:  map 100%
> reduce 0%
> [hostaddress] out: 13/04/02 01:52:57 INFO mapred.LocalJobRunner:
> [..]
> [hostaddress] out: 13/04/02 01:53:03 INFO mapred.LocalJobRunner:
> [hostaddress] out: 13/04/02 01:54:42 INFO mapreduce.MySQLDumpMapper:
> Transfer loop complete.
> [hostaddress] out: 13/04/02 01:54:42 INFO mapreduce.MySQLDumpMapper:
> Transferred 668.9657 MB in 113.0105 seconds (5.9195 MB/sec)
> [hostaddress] out: 13/04/02 01:54:42 INFO mapred.LocalJobRunner:
> [hostaddress] out: 13/04/02 01:54:42 INFO s3native.NativeS3FileSystem:
> OutputStream for key
> 'some_table/_temporary/_attempt_local555455791_0001_m_000000_0/part-m-00000'
> closed. Now beginning upload
> [hostaddress] out: 13/04/02 01:54:42 INFO mapred.LocalJobRunner:
> [hostaddress] out: 13/04/02 01:54:45 INFO mapred.LocalJobRunner:
> [hostaddress] out: 13/04/02 01:55:31 INFO s3native.NativeS3FileSystem:
> OutputStream for key
> 'some_table/_temporary/_attempt_local555455791_0001_m_000000_0/part-m-00000'
> upload complete
> [hostaddress] out: 13/04/02 01:55:31 INFO mapred.Task:
> Task:attempt_local555455791_0001_m_000000_0 is done. And is in the process
> of commiting
> [hostaddress] out: 13/04/02 01:55:31 INFO mapred.LocalJobRunner:
> [hostaddress] out: 13/04/02 01:55:31 INFO mapred.Task: Task
> attempt_local555455791_0001_m_000000_0 is allowed to commit now
> [hostaddress] out: 13/04/02 01:55:36 INFO mapred.LocalJobRunner:
> [hostaddress] out: 13/04/02 01:56:03 WARN output.FileOutputCommitter:
> Failed to delete the temporary output directory of task:
> attempt_local555455791_0001_m_000000_0 - s3n://secret@bucketsomewhere
> /some_table/_temporary/_attempt_local555455791_0001_m_000000_0
> [hostaddress] out: 13/04/02 01:56:03 INFO output.FileOutputCommitter: Saved
> output of task 'attempt_local555455791_0001_m_000000_0' to
> s3n://secret@bucketsomewhere/some_table
> [hostaddress] out: 13/04/02 01:56:03 INFO mapred.LocalJobRunner:
> [hostaddress] out: 13/04/02 01:56:03 INFO mapred.Task: Task
> 'attempt_local555455791_0001_m_000000_0' done.
> [hostaddress] out: 13/04/02 01:56:03 INFO mapred.LocalJobRunner: Finishing
> task: attempt_local555455791_0001_m_000000_0
> [hostaddress] out: 13/04/02 01:56:03 INFO mapred.LocalJobRunner: Map task
> executor complete.
> [hostaddress] out: 13/04/02 01:56:03 INFO s3native.NativeS3FileSystem:
> OutputStream for key 'some_table/_SUCCESS' writing to tempfile '*
> /tmp/hadoop-jenkins/s3/output-1400873345908825433.tmp*'
> [hostaddress] out: 13/04/02 01:56:03 INFO s3native.NativeS3FileSystem:
> OutputStream for key 'some_table/_SUCCESS' closed. Now beginning upload
> [hostaddress] out: 13/04/02 01:56:03 INFO s3native.NativeS3FileSystem:
> OutputStream for key 'some_table/_SUCCESS' upload complete
> [...deleting cached jars...]
> [hostaddress] out: 13/04/02 01:56:03 INFO mapred.JobClient: Job complete:
> job_local555455791_0001
> [hostaddress] out: 13/04/02 01:56:03 INFO mapred.JobClient: Counters: 23
> [hostaddress] out: 13/04/02 01:56:03 INFO mapred.JobClient:   File System
> Counters
> [hostaddress] out: 13/04/02 01:56:03 INFO mapred.JobClient:     FILE:
> Number of bytes read=6471451
> [hostaddress] out: 13/04/02 01:56:03 INFO mapred.JobClient:     FILE:
> Number of bytes written=6623109
> [hostaddress] out: 13/04/02 01:56:03 INFO mapred.JobClient:     FILE:
> Number of read operations=0
> [hostaddress] out: 13/04/02 01:56:03 INFO mapred.JobClient:     FILE:
> Number of large read operations=0
> [hostaddress] out: 13/04/02 01:56:03 INFO mapred.JobClient:     FILE:
> Number of write operations=0
> [hostaddress] out: 13/04/02 01:56:03 INFO mapred.JobClient:     HDFS:
> Number of bytes read=0
> [hostaddress] out: 13/04/02 01:56:03 INFO mapred.JobClient:     HDFS:
> Number of bytes written=0
> [hostaddress] out: 13/04/02 01:56:03 INFO mapred.JobClient:     HDFS:
> Number of read operations=0
> [hostaddress] out: 13/04/02 01:56:03 INFO mapred.JobClient:     HDFS:
> Number of large read operations=0
> [hostaddress] out: 13/04/02 01:56:03 INFO mapred.JobClient:     HDFS:
> Number of write operations=0
> [hostaddress] out: 13/04/02 01:56:03 INFO mapred.JobClient:     S3N: Number
> of bytes read=0
> [hostaddress] out: 13/04/02 01:56:03 INFO mapred.JobClient:     S3N: Number
> of bytes written=773081963
> [hostaddress] out: 13/04/02 01:56:03 INFO mapred.JobClient:     S3N: Number
> of read operations=0
> [hostaddress] out: 13/04/02 01:56:03 INFO mapred.JobClient:     S3N: Number
> of large read operations=0
> [hostaddress] out: 13/04/02 01:56:03 INFO mapred.JobClient:     S3N: Number
> of write operations=0
> [hostaddress] out: 13/04/02 01:56:03 INFO mapred.JobClient:   Map-Reduce
> Framework
> [hostaddress] out: 13/04/02 01:56:03 INFO mapred.JobClient:     Map input
> records=1
> [hostaddress] out: 13/04/02 01:56:03 INFO mapred.JobClient:     Map output
> records=14324124
> [hostaddress] out: 13/04/02 01:56:03 INFO mapred.JobClient:     Input split
> bytes=87
> [hostaddress] out: 13/04/02 01:56:03 INFO mapred.JobClient:     Spilled
> Records=0
> [hostaddress] out: 13/04/02 01:56:03 INFO mapred.JobClient:     CPU time
> spent (ms)=0
> [hostaddress] out: 13/04/02 01:56:03 INFO mapred.JobClient:     Physical
> memory (bytes) snapshot=0
> [hostaddress] out: 13/04/02 01:56:03 INFO mapred.JobClient:     Virtual
> memory (bytes) snapshot=0
> [hostaddress] out: 13/04/02 01:56:03 INFO mapred.JobClient:     Total
> committed heap usage (bytes)=142147584
> [hostaddress] out: 13/04/02 01:56:03 INFO mapreduce.ImportJobBase:
> Transferred 0 bytes in 201.4515 seconds (0 bytes/sec)
> [hostaddress] out: 13/04/02 01:56:03 INFO mapreduce.ImportJobBase:
> Retrieved 14324124 records.
> 
> On Thu, Mar 28, 2013 at 9:49 PM, Jarek Jarcec Cecho <[email protected]>wrote:
> 
> > Hi Christian,
> > would you mind describing a bit more the behaviour you're observing?
> >
> > Sqoop should be touching /tmp only on machine where you've executed it for
> > generating and compiling code (<1MB!). The data transfer itself is done on
> > your Hadoop cluster from within a mapreduce job and the output is directly
> > stored in your destination folder. I'm not familiar with s3 file system
> > implementation, but can it happen that it's the S3 library which is storing
> > the data in /tmp?
> >
> > Jarcec
> >
> > On Thu, Mar 28, 2013 at 03:54:11PM +0000, Christian Prokopp wrote:
> > > Thanks for the idea Alex. I considered this but that would mean I have to
> > > change my cluster setup for sqoop (last resort solution). I'd very much
> > > rather point sqoop to existing large disks.
> > >
> > > Cheers,
> > > Christian
> > >
> > >
> > > On Thu, Mar 28, 2013 at 3:50 PM, Alexander Alten-Lorenz <
> > [email protected]
> > > > wrote:
> > >
> > > > You could mount a bigger disk into /tmp - or symlink /tmp to another
> > > > directory which have enough space.
> > > >
> > > > Best
> > > > - Alex
> > > >
> > > > On Mar 28, 2013, at 4:35 PM, Christian Prokopp <
> > [email protected]>
> > > > wrote:
> > > >
> > > > > Hi,
> > > > >
> > > > > I am using sqoop to copy data from MySQL to S3:
> > > > >
> > > > > (Sqoop 1.4.2-cdh4.2.0)
> > > > > $ sqoop import --connect jdbc:mysql://server:port/db --username user
> > > > --password pass  --table tablename --target-dir s3n://xyz@somehwere
> > /a/b/c
> > > > --fields-terminated-by='\001' -m 1 --direct
> > > > >
> > > > > My problem is that sqoop temporarily stores the data on /tmp, which
> > is
> > > > not big enough for the data. I am unable to find a configuration
> > option to
> > > > point sqoop to a bigger partition/disk. Any suggestions?
> > > > >
> > > > > Cheers,
> > > > > Christian
> > > > >
> > > >
> > > > --
> > > > Alexander Alten-Lorenz
> > > > http://mapredit.blogspot.com
> > > > German Hadoop LinkedIn Group: http://goo.gl/N8pCF
> > > >
> > > >
> > >
> > >
> > > --
> > > Best regards,
> > >
> > > *Christian Prokopp*
> > > Data Scientist, PhD
> > > Rangespan Ltd. <http://www.rangespan.com/>
> >
> 
> 
> 
> -- 
> Best regards,
> 
> *Christian Prokopp*
> Data Scientist, PhD
> Rangespan Ltd. <http://www.rangespan.com/>

signature.asc
Description: Digital signature

Re: /tmp dir for import configurable?

Reply via email to