Hi Jarcec, Perfect solution. Thank you very much!
Cheers, Christian On Sat, Apr 6, 2013 at 6:05 AM, Jarek Jarcec Cecho <[email protected]>wrote: > Hi Christian, > thank you very much for sharing the log and please accept my apologies for > late response. > > Closely looking into your exception, I can confirm that it's the S3 file > system that is creating the files in /tmp and not Sqoop itself. > > > [hostaddress] out: 13/04/02 01:56:03 INFO s3native.NativeS3FileSystem: > > OutputStream for key 'some_table/_SUCCESS' writing to tempfile '* > > /tmp/hadoop-jenkins/s3/output-1400873345908825433.tmp*' > > Taking a brief look into the source code [1], it seems that it's the > method newBackupFile() defined on line 195 that is responsible for creating > the temporary file. And also it seems that it's behaviour can be altered > using fs.s3.buffer.dir property. Would you mind to try use it in your Sqoop > execution? > > sqoop import -Dfs.s3.buffer.dir=/custom/path ... > > I've also noticed that you're using the LocalJobRunner which is suggesting > Sqoop is executing all jobs locally on your machine and not on your Hadoop > cluster. I would recommend checking Hadoop configuration in case that your > intention is to run your data transfer in parallel. > > Jarcec > > Links: > 1: > http://hadoop.apache.org/docs/r2.0.3-alpha/api/src-html/org/apache/hadoop/fs/s3native/NativeS3FileSystem.html > > On Tue, Apr 02, 2013 at 11:38:35AM +0100, Christian Prokopp wrote: > > Hi Jarcec, > > > > I am running the command on the CLI of a cluster node. It appears to run > a > > local MR job writing the results to /tmp before sending it to S3: > > > > [..] > > [hostaddress] out: 13/04/02 01:52:49 INFO mapreduce.MySQLDumpMapper: > > Beginning mysqldump fast path import > > [hostaddress] out: 13/04/02 01:52:49 INFO mapreduce.MySQLDumpMapper: > > Performing import of table image from database some_db > > [hostaddress] out: 13/04/02 01:52:49 INFO mapreduce.MySQLDumpMapper: > > Converting data to use specified delimiters. > > [hostaddress] out: 13/04/02 01:52:49 INFO mapreduce.MySQLDumpMapper: (For > > the fastest possible import, use > > [hostaddress] out: 13/04/02 01:52:49 INFO mapreduce.MySQLDumpMapper: > > --mysql-delimiters to specify the same field > > [hostaddress] out: 13/04/02 01:52:49 INFO mapreduce.MySQLDumpMapper: > > delimiters as are used by mysqldump.) > > [hostaddress] out: 13/04/02 01:52:54 INFO mapred.LocalJobRunner: > > [hostaddress] out: 13/04/02 01:52:55 INFO mapred.JobClient: map 100% > > reduce 0% > > [hostaddress] out: 13/04/02 01:52:57 INFO mapred.LocalJobRunner: > > [..] > > [hostaddress] out: 13/04/02 01:53:03 INFO mapred.LocalJobRunner: > > [hostaddress] out: 13/04/02 01:54:42 INFO mapreduce.MySQLDumpMapper: > > Transfer loop complete. > > [hostaddress] out: 13/04/02 01:54:42 INFO mapreduce.MySQLDumpMapper: > > Transferred 668.9657 MB in 113.0105 seconds (5.9195 MB/sec) > > [hostaddress] out: 13/04/02 01:54:42 INFO mapred.LocalJobRunner: > > [hostaddress] out: 13/04/02 01:54:42 INFO s3native.NativeS3FileSystem: > > OutputStream for key > > > 'some_table/_temporary/_attempt_local555455791_0001_m_000000_0/part-m-00000' > > closed. Now beginning upload > > [hostaddress] out: 13/04/02 01:54:42 INFO mapred.LocalJobRunner: > > [hostaddress] out: 13/04/02 01:54:45 INFO mapred.LocalJobRunner: > > [hostaddress] out: 13/04/02 01:55:31 INFO s3native.NativeS3FileSystem: > > OutputStream for key > > > 'some_table/_temporary/_attempt_local555455791_0001_m_000000_0/part-m-00000' > > upload complete > > [hostaddress] out: 13/04/02 01:55:31 INFO mapred.Task: > > Task:attempt_local555455791_0001_m_000000_0 is done. And is in the > process > > of commiting > > [hostaddress] out: 13/04/02 01:55:31 INFO mapred.LocalJobRunner: > > [hostaddress] out: 13/04/02 01:55:31 INFO mapred.Task: Task > > attempt_local555455791_0001_m_000000_0 is allowed to commit now > > [hostaddress] out: 13/04/02 01:55:36 INFO mapred.LocalJobRunner: > > [hostaddress] out: 13/04/02 01:56:03 WARN output.FileOutputCommitter: > > Failed to delete the temporary output directory of task: > > attempt_local555455791_0001_m_000000_0 - s3n://secret@bucketsomewhere > > /some_table/_temporary/_attempt_local555455791_0001_m_000000_0 > > [hostaddress] out: 13/04/02 01:56:03 INFO output.FileOutputCommitter: > Saved > > output of task 'attempt_local555455791_0001_m_000000_0' to > > s3n://secret@bucketsomewhere/some_table > > [hostaddress] out: 13/04/02 01:56:03 INFO mapred.LocalJobRunner: > > [hostaddress] out: 13/04/02 01:56:03 INFO mapred.Task: Task > > 'attempt_local555455791_0001_m_000000_0' done. > > [hostaddress] out: 13/04/02 01:56:03 INFO mapred.LocalJobRunner: > Finishing > > task: attempt_local555455791_0001_m_000000_0 > > [hostaddress] out: 13/04/02 01:56:03 INFO mapred.LocalJobRunner: Map task > > executor complete. > > [hostaddress] out: 13/04/02 01:56:03 INFO s3native.NativeS3FileSystem: > > OutputStream for key 'some_table/_SUCCESS' writing to tempfile '* > > /tmp/hadoop-jenkins/s3/output-1400873345908825433.tmp*' > > [hostaddress] out: 13/04/02 01:56:03 INFO s3native.NativeS3FileSystem: > > OutputStream for key 'some_table/_SUCCESS' closed. Now beginning upload > > [hostaddress] out: 13/04/02 01:56:03 INFO s3native.NativeS3FileSystem: > > OutputStream for key 'some_table/_SUCCESS' upload complete > > [...deleting cached jars...] > > [hostaddress] out: 13/04/02 01:56:03 INFO mapred.JobClient: Job complete: > > job_local555455791_0001 > > [hostaddress] out: 13/04/02 01:56:03 INFO mapred.JobClient: Counters: 23 > > [hostaddress] out: 13/04/02 01:56:03 INFO mapred.JobClient: File System > > Counters > > [hostaddress] out: 13/04/02 01:56:03 INFO mapred.JobClient: FILE: > > Number of bytes read=6471451 > > [hostaddress] out: 13/04/02 01:56:03 INFO mapred.JobClient: FILE: > > Number of bytes written=6623109 > > [hostaddress] out: 13/04/02 01:56:03 INFO mapred.JobClient: FILE: > > Number of read operations=0 > > [hostaddress] out: 13/04/02 01:56:03 INFO mapred.JobClient: FILE: > > Number of large read operations=0 > > [hostaddress] out: 13/04/02 01:56:03 INFO mapred.JobClient: FILE: > > Number of write operations=0 > > [hostaddress] out: 13/04/02 01:56:03 INFO mapred.JobClient: HDFS: > > Number of bytes read=0 > > [hostaddress] out: 13/04/02 01:56:03 INFO mapred.JobClient: HDFS: > > Number of bytes written=0 > > [hostaddress] out: 13/04/02 01:56:03 INFO mapred.JobClient: HDFS: > > Number of read operations=0 > > [hostaddress] out: 13/04/02 01:56:03 INFO mapred.JobClient: HDFS: > > Number of large read operations=0 > > [hostaddress] out: 13/04/02 01:56:03 INFO mapred.JobClient: HDFS: > > Number of write operations=0 > > [hostaddress] out: 13/04/02 01:56:03 INFO mapred.JobClient: S3N: > Number > > of bytes read=0 > > [hostaddress] out: 13/04/02 01:56:03 INFO mapred.JobClient: S3N: > Number > > of bytes written=773081963 > > [hostaddress] out: 13/04/02 01:56:03 INFO mapred.JobClient: S3N: > Number > > of read operations=0 > > [hostaddress] out: 13/04/02 01:56:03 INFO mapred.JobClient: S3N: > Number > > of large read operations=0 > > [hostaddress] out: 13/04/02 01:56:03 INFO mapred.JobClient: S3N: > Number > > of write operations=0 > > [hostaddress] out: 13/04/02 01:56:03 INFO mapred.JobClient: Map-Reduce > > Framework > > [hostaddress] out: 13/04/02 01:56:03 INFO mapred.JobClient: Map input > > records=1 > > [hostaddress] out: 13/04/02 01:56:03 INFO mapred.JobClient: Map > output > > records=14324124 > > [hostaddress] out: 13/04/02 01:56:03 INFO mapred.JobClient: Input > split > > bytes=87 > > [hostaddress] out: 13/04/02 01:56:03 INFO mapred.JobClient: Spilled > > Records=0 > > [hostaddress] out: 13/04/02 01:56:03 INFO mapred.JobClient: CPU time > > spent (ms)=0 > > [hostaddress] out: 13/04/02 01:56:03 INFO mapred.JobClient: Physical > > memory (bytes) snapshot=0 > > [hostaddress] out: 13/04/02 01:56:03 INFO mapred.JobClient: Virtual > > memory (bytes) snapshot=0 > > [hostaddress] out: 13/04/02 01:56:03 INFO mapred.JobClient: Total > > committed heap usage (bytes)=142147584 > > [hostaddress] out: 13/04/02 01:56:03 INFO mapreduce.ImportJobBase: > > Transferred 0 bytes in 201.4515 seconds (0 bytes/sec) > > [hostaddress] out: 13/04/02 01:56:03 INFO mapreduce.ImportJobBase: > > Retrieved 14324124 records. > > > > On Thu, Mar 28, 2013 at 9:49 PM, Jarek Jarcec Cecho <[email protected] > >wrote: > > > > > Hi Christian, > > > would you mind describing a bit more the behaviour you're observing? > > > > > > Sqoop should be touching /tmp only on machine where you've executed it > for > > > generating and compiling code (<1MB!). The data transfer itself is > done on > > > your Hadoop cluster from within a mapreduce job and the output is > directly > > > stored in your destination folder. I'm not familiar with s3 file system > > > implementation, but can it happen that it's the S3 library which is > storing > > > the data in /tmp? > > > > > > Jarcec > > > > > > On Thu, Mar 28, 2013 at 03:54:11PM +0000, Christian Prokopp wrote: > > > > Thanks for the idea Alex. I considered this but that would mean I > have to > > > > change my cluster setup for sqoop (last resort solution). I'd very > much > > > > rather point sqoop to existing large disks. > > > > > > > > Cheers, > > > > Christian > > > > > > > > > > > > On Thu, Mar 28, 2013 at 3:50 PM, Alexander Alten-Lorenz < > > > [email protected] > > > > > wrote: > > > > > > > > > You could mount a bigger disk into /tmp - or symlink /tmp to > another > > > > > directory which have enough space. > > > > > > > > > > Best > > > > > - Alex > > > > > > > > > > On Mar 28, 2013, at 4:35 PM, Christian Prokopp < > > > [email protected]> > > > > > wrote: > > > > > > > > > > > Hi, > > > > > > > > > > > > I am using sqoop to copy data from MySQL to S3: > > > > > > > > > > > > (Sqoop 1.4.2-cdh4.2.0) > > > > > > $ sqoop import --connect jdbc:mysql://server:port/db --username > user > > > > > --password pass --table tablename --target-dir s3n://xyz@somehwere > > > /a/b/c > > > > > --fields-terminated-by='\001' -m 1 --direct > > > > > > > > > > > > My problem is that sqoop temporarily stores the data on /tmp, > which > > > is > > > > > not big enough for the data. I am unable to find a configuration > > > option to > > > > > point sqoop to a bigger partition/disk. Any suggestions? > > > > > > > > > > > > Cheers, > > > > > > Christian > > > > > > > > > > > > > > > > -- > > > > > Alexander Alten-Lorenz > > > > > http://mapredit.blogspot.com > > > > > German Hadoop LinkedIn Group: http://goo.gl/N8pCF > > > > > > > > > > > > > > > > > > > > > > -- > > > > Best regards, > > > > > > > > *Christian Prokopp* > > > > Data Scientist, PhD > > > > Rangespan Ltd. <http://www.rangespan.com/> > > > > > > > > > > > -- > > Best regards, > > > > *Christian Prokopp* > > Data Scientist, PhD > > Rangespan Ltd. <http://www.rangespan.com/> > -- Best regards, *Christian Prokopp* Data Scientist, PhD Rangespan Ltd. <http://www.rangespan.com/>
