Re: /tmp dir for import configurable?

Christian Prokopp Wed, 10 Apr 2013 04:21:01 -0700

Hi Jarcec,

Perfect solution. Thank you very much!


Cheers,
Christian


On Sat, Apr 6, 2013 at 6:05 AM, Jarek Jarcec Cecho <[email protected]>wrote:

> Hi Christian,
> thank you very much for sharing the log and please accept my apologies for
> late response.
>
> Closely looking into your exception, I can confirm that it's the S3 file
> system that is creating the files in /tmp and not Sqoop itself.
>
> > [hostaddress] out: 13/04/02 01:56:03 INFO s3native.NativeS3FileSystem:
> > OutputStream for key 'some_table/_SUCCESS' writing to tempfile '*
> > /tmp/hadoop-jenkins/s3/output-1400873345908825433.tmp*'
>
> Taking a brief look into the source code [1], it seems that it's the
> method newBackupFile() defined on line 195 that is responsible for creating
> the temporary file. And also it seems that it's behaviour can be altered
> using fs.s3.buffer.dir property. Would you mind to try use it in your Sqoop
> execution?
>
>   sqoop import -Dfs.s3.buffer.dir=/custom/path ...
>
> I've also noticed that you're using the LocalJobRunner which is suggesting
> Sqoop is executing all jobs locally on your machine and not on your Hadoop
> cluster. I would recommend checking Hadoop configuration in case that your
> intention is to run your data transfer in parallel.
>
> Jarcec
>
> Links:
> 1:
> http://hadoop.apache.org/docs/r2.0.3-alpha/api/src-html/org/apache/hadoop/fs/s3native/NativeS3FileSystem.html
>
> On Tue, Apr 02, 2013 at 11:38:35AM +0100, Christian Prokopp wrote:
> > Hi Jarcec,
> >
> > I am running the command on the CLI of a cluster node. It appears to run
> a
> > local MR job writing the results to /tmp before sending it to S3:
> >
> > [..]
> > [hostaddress] out: 13/04/02 01:52:49 INFO mapreduce.MySQLDumpMapper:
> > Beginning mysqldump fast path import
> > [hostaddress] out: 13/04/02 01:52:49 INFO mapreduce.MySQLDumpMapper:
> > Performing import of table image from database some_db
> > [hostaddress] out: 13/04/02 01:52:49 INFO mapreduce.MySQLDumpMapper:
> > Converting data to use specified delimiters.
> > [hostaddress] out: 13/04/02 01:52:49 INFO mapreduce.MySQLDumpMapper: (For
> > the fastest possible import, use
> > [hostaddress] out: 13/04/02 01:52:49 INFO mapreduce.MySQLDumpMapper:
> > --mysql-delimiters to specify the same field
> > [hostaddress] out: 13/04/02 01:52:49 INFO mapreduce.MySQLDumpMapper:
> > delimiters as are used by mysqldump.)
> > [hostaddress] out: 13/04/02 01:52:54 INFO mapred.LocalJobRunner:
> > [hostaddress] out: 13/04/02 01:52:55 INFO mapred.JobClient:  map 100%
> > reduce 0%
> > [hostaddress] out: 13/04/02 01:52:57 INFO mapred.LocalJobRunner:
> > [..]
> > [hostaddress] out: 13/04/02 01:53:03 INFO mapred.LocalJobRunner:
> > [hostaddress] out: 13/04/02 01:54:42 INFO mapreduce.MySQLDumpMapper:
> > Transfer loop complete.
> > [hostaddress] out: 13/04/02 01:54:42 INFO mapreduce.MySQLDumpMapper:
> > Transferred 668.9657 MB in 113.0105 seconds (5.9195 MB/sec)
> > [hostaddress] out: 13/04/02 01:54:42 INFO mapred.LocalJobRunner:
> > [hostaddress] out: 13/04/02 01:54:42 INFO s3native.NativeS3FileSystem:
> > OutputStream for key
> >
> 'some_table/_temporary/_attempt_local555455791_0001_m_000000_0/part-m-00000'
> > closed. Now beginning upload
> > [hostaddress] out: 13/04/02 01:54:42 INFO mapred.LocalJobRunner:
> > [hostaddress] out: 13/04/02 01:54:45 INFO mapred.LocalJobRunner:
> > [hostaddress] out: 13/04/02 01:55:31 INFO s3native.NativeS3FileSystem:
> > OutputStream for key
> >
> 'some_table/_temporary/_attempt_local555455791_0001_m_000000_0/part-m-00000'
> > upload complete
> > [hostaddress] out: 13/04/02 01:55:31 INFO mapred.Task:
> > Task:attempt_local555455791_0001_m_000000_0 is done. And is in the
> process
> > of commiting
> > [hostaddress] out: 13/04/02 01:55:31 INFO mapred.LocalJobRunner:
> > [hostaddress] out: 13/04/02 01:55:31 INFO mapred.Task: Task
> > attempt_local555455791_0001_m_000000_0 is allowed to commit now
> > [hostaddress] out: 13/04/02 01:55:36 INFO mapred.LocalJobRunner:
> > [hostaddress] out: 13/04/02 01:56:03 WARN output.FileOutputCommitter:
> > Failed to delete the temporary output directory of task:
> > attempt_local555455791_0001_m_000000_0 - s3n://secret@bucketsomewhere
> > /some_table/_temporary/_attempt_local555455791_0001_m_000000_0
> > [hostaddress] out: 13/04/02 01:56:03 INFO output.FileOutputCommitter:
> Saved
> > output of task 'attempt_local555455791_0001_m_000000_0' to
> > s3n://secret@bucketsomewhere/some_table
> > [hostaddress] out: 13/04/02 01:56:03 INFO mapred.LocalJobRunner:
> > [hostaddress] out: 13/04/02 01:56:03 INFO mapred.Task: Task
> > 'attempt_local555455791_0001_m_000000_0' done.
> > [hostaddress] out: 13/04/02 01:56:03 INFO mapred.LocalJobRunner:
> Finishing
> > task: attempt_local555455791_0001_m_000000_0
> > [hostaddress] out: 13/04/02 01:56:03 INFO mapred.LocalJobRunner: Map task
> > executor complete.
> > [hostaddress] out: 13/04/02 01:56:03 INFO s3native.NativeS3FileSystem:
> > OutputStream for key 'some_table/_SUCCESS' writing to tempfile '*
> > /tmp/hadoop-jenkins/s3/output-1400873345908825433.tmp*'
> > [hostaddress] out: 13/04/02 01:56:03 INFO s3native.NativeS3FileSystem:
> > OutputStream for key 'some_table/_SUCCESS' closed. Now beginning upload
> > [hostaddress] out: 13/04/02 01:56:03 INFO s3native.NativeS3FileSystem:
> > OutputStream for key 'some_table/_SUCCESS' upload complete
> > [...deleting cached jars...]
> > [hostaddress] out: 13/04/02 01:56:03 INFO mapred.JobClient: Job complete:
> > job_local555455791_0001
> > [hostaddress] out: 13/04/02 01:56:03 INFO mapred.JobClient: Counters: 23
> > [hostaddress] out: 13/04/02 01:56:03 INFO mapred.JobClient:   File System
> > Counters
> > [hostaddress] out: 13/04/02 01:56:03 INFO mapred.JobClient:     FILE:
> > Number of bytes read=6471451
> > [hostaddress] out: 13/04/02 01:56:03 INFO mapred.JobClient:     FILE:
> > Number of bytes written=6623109
> > [hostaddress] out: 13/04/02 01:56:03 INFO mapred.JobClient:     FILE:
> > Number of read operations=0
> > [hostaddress] out: 13/04/02 01:56:03 INFO mapred.JobClient:     FILE:
> > Number of large read operations=0
> > [hostaddress] out: 13/04/02 01:56:03 INFO mapred.JobClient:     FILE:
> > Number of write operations=0
> > [hostaddress] out: 13/04/02 01:56:03 INFO mapred.JobClient:     HDFS:
> > Number of bytes read=0
> > [hostaddress] out: 13/04/02 01:56:03 INFO mapred.JobClient:     HDFS:
> > Number of bytes written=0
> > [hostaddress] out: 13/04/02 01:56:03 INFO mapred.JobClient:     HDFS:
> > Number of read operations=0
> > [hostaddress] out: 13/04/02 01:56:03 INFO mapred.JobClient:     HDFS:
> > Number of large read operations=0
> > [hostaddress] out: 13/04/02 01:56:03 INFO mapred.JobClient:     HDFS:
> > Number of write operations=0
> > [hostaddress] out: 13/04/02 01:56:03 INFO mapred.JobClient:     S3N:
> Number
> > of bytes read=0
> > [hostaddress] out: 13/04/02 01:56:03 INFO mapred.JobClient:     S3N:
> Number
> > of bytes written=773081963
> > [hostaddress] out: 13/04/02 01:56:03 INFO mapred.JobClient:     S3N:
> Number
> > of read operations=0
> > [hostaddress] out: 13/04/02 01:56:03 INFO mapred.JobClient:     S3N:
> Number
> > of large read operations=0
> > [hostaddress] out: 13/04/02 01:56:03 INFO mapred.JobClient:     S3N:
> Number
> > of write operations=0
> > [hostaddress] out: 13/04/02 01:56:03 INFO mapred.JobClient:   Map-Reduce
> > Framework
> > [hostaddress] out: 13/04/02 01:56:03 INFO mapred.JobClient:     Map input
> > records=1
> > [hostaddress] out: 13/04/02 01:56:03 INFO mapred.JobClient:     Map
> output
> > records=14324124
> > [hostaddress] out: 13/04/02 01:56:03 INFO mapred.JobClient:     Input
> split
> > bytes=87
> > [hostaddress] out: 13/04/02 01:56:03 INFO mapred.JobClient:     Spilled
> > Records=0
> > [hostaddress] out: 13/04/02 01:56:03 INFO mapred.JobClient:     CPU time
> > spent (ms)=0
> > [hostaddress] out: 13/04/02 01:56:03 INFO mapred.JobClient:     Physical
> > memory (bytes) snapshot=0
> > [hostaddress] out: 13/04/02 01:56:03 INFO mapred.JobClient:     Virtual
> > memory (bytes) snapshot=0
> > [hostaddress] out: 13/04/02 01:56:03 INFO mapred.JobClient:     Total
> > committed heap usage (bytes)=142147584
> > [hostaddress] out: 13/04/02 01:56:03 INFO mapreduce.ImportJobBase:
> > Transferred 0 bytes in 201.4515 seconds (0 bytes/sec)
> > [hostaddress] out: 13/04/02 01:56:03 INFO mapreduce.ImportJobBase:
> > Retrieved 14324124 records.
> >
> > On Thu, Mar 28, 2013 at 9:49 PM, Jarek Jarcec Cecho <[email protected]
> >wrote:
> >
> > > Hi Christian,
> > > would you mind describing a bit more the behaviour you're observing?
> > >
> > > Sqoop should be touching /tmp only on machine where you've executed it
> for
> > > generating and compiling code (<1MB!). The data transfer itself is
> done on
> > > your Hadoop cluster from within a mapreduce job and the output is
> directly
> > > stored in your destination folder. I'm not familiar with s3 file system
> > > implementation, but can it happen that it's the S3 library which is
> storing
> > > the data in /tmp?
> > >
> > > Jarcec
> > >
> > > On Thu, Mar 28, 2013 at 03:54:11PM +0000, Christian Prokopp wrote:
> > > > Thanks for the idea Alex. I considered this but that would mean I
> have to
> > > > change my cluster setup for sqoop (last resort solution). I'd very
> much
> > > > rather point sqoop to existing large disks.
> > > >
> > > > Cheers,
> > > > Christian
> > > >
> > > >
> > > > On Thu, Mar 28, 2013 at 3:50 PM, Alexander Alten-Lorenz <
> > > [email protected]
> > > > > wrote:
> > > >
> > > > > You could mount a bigger disk into /tmp - or symlink /tmp to
> another
> > > > > directory which have enough space.
> > > > >
> > > > > Best
> > > > > - Alex
> > > > >
> > > > > On Mar 28, 2013, at 4:35 PM, Christian Prokopp <
> > > [email protected]>
> > > > > wrote:
> > > > >
> > > > > > Hi,
> > > > > >
> > > > > > I am using sqoop to copy data from MySQL to S3:
> > > > > >
> > > > > > (Sqoop 1.4.2-cdh4.2.0)
> > > > > > $ sqoop import --connect jdbc:mysql://server:port/db --username
> user
> > > > > --password pass  --table tablename --target-dir s3n://xyz@somehwere
> > > /a/b/c
> > > > > --fields-terminated-by='\001' -m 1 --direct
> > > > > >
> > > > > > My problem is that sqoop temporarily stores the data on /tmp,
> which
> > > is
> > > > > not big enough for the data. I am unable to find a configuration
> > > option to
> > > > > point sqoop to a bigger partition/disk. Any suggestions?
> > > > > >
> > > > > > Cheers,
> > > > > > Christian
> > > > > >
> > > > >
> > > > > --
> > > > > Alexander Alten-Lorenz
> > > > > http://mapredit.blogspot.com
> > > > > German Hadoop LinkedIn Group: http://goo.gl/N8pCF
> > > > >
> > > > >
> > > >
> > > >
> > > > --
> > > > Best regards,
> > > >
> > > > *Christian Prokopp*
> > > > Data Scientist, PhD
> > > > Rangespan Ltd. <http://www.rangespan.com/>
> > >
> >
> >
> >
> > --
> > Best regards,
> >
> > *Christian Prokopp*
> > Data Scientist, PhD
> > Rangespan Ltd. <http://www.rangespan.com/>
>



-- 
Best regards,

*Christian Prokopp*
Data Scientist, PhD
Rangespan Ltd. <http://www.rangespan.com/>

Re: /tmp dir for import configurable?

Reply via email to