Hey Chintu,

Thanks much. One thing you could try to use to speed up as well would be
to:

1. Subclass, or roll your own LocalDataTransfer class -- call it
MoveBasedLocalDataTransfer
2. Replace the class in that class from using FileUtils.copyFile or
FileUtils.moveFile to calls to ExecHelper.execute("cp ...") and ("mv ...")
3. In your calls to the crawler, pass --dataTransferFactory for your new
MoveBased...one

See if that improves it at all. If you want, file a JIRA issue too and I
could try and wire up such a transferer for you.

Thanks!

Cheers,
Chris

On 12/14/12 5:23 AM, "Mistry, Chintu (GSFC-586.0)[COLUMBUS TECHNOLOGIES
AND SERVICES INC]" <chintu.mis...@nasa.gov> wrote:

>Thank you for example.
>
>In our case, the file size can vary from 10K to 200MB. About 12000 files
>make up of 262GB data.
>
>We are using IBM GPFS for our storage, which is suppose to be faster for
>this kind of activity. The parallel performance that we are seeing in our
>test case is far from isolated filesystem test (with simple copy and move
>without OODT).
>
>So far the best combination I could find is to use "move" and keep almost
>1:1 ratio of FM and CR. Will still dig more into it.
>
>Thanks
>--
>Chintu Mistry
>NASA Goddard Space Flight Center
>Bldg L40B, Room S776
>Office: 240 684 0477
>Mobile: 770 310 1047
>
>From: Cameron Goodale <good...@apache.org<mailto:good...@apache.org>>
>Date: Friday, December 14, 2012 12:03 AM
>To: "dev@oodt.apache.org<mailto:dev@oodt.apache.org>"
><dev@oodt.apache.org<mailto:dev@oodt.apache.org>>
>Cc: "Mistry, Chintu (GSFC-586.0)[COLUMBUS TECHNOLOGIES AND SERVICES INC]"
><chintu.mis...@nasa.gov<mailto:chintu.mis...@nasa.gov>>
>Subject: Re: OODT 0.3 branch
>
>Chintu,
>
>I see that your test data volume is 262GB, but I am curious about the
>make up of the data.  On average what is your file size and how many
>files?
>
>The reason I ask is because the process of extraction and ingestion can
>vary wildly.  On the LMMP project I was ingesting 12GB DEMs over NFS and
>it was a slow process.  It was basically serial with 1CR+1FM, but we
>didn't have a requirement to push large volumes of data.
>
>On our recent Snow Data System I am processing 160 workflow jobs in
>parallel and OODT could handle the load, it turned out the filesystem was
>our major bottleneck.  We used a SAN initially when doing development,
>but when we increased the number of jobs in parallel the I/O became so
>bad we moved to GlusterFS. GlusterFS had speed improvements over the SAN,
>but we had to be careful about heavy writing, moving, deleting since the
>clustering would try to replicate the data.  Turns out Gluster is great
>for heavy writting OR heavy reading, but not both at the same time.
>Finally we are using NAS and it works great.
>
>My point is the file system plays a major role in performance when
>ingesting data.  The ultimate speed test would be if you could actually
>write the data into the final archive directory and basically do an
>ingestion in place (skip data transfer entirely), but I know that is
>rarely possible.
>
>This is an interesting challenge to see what configuration will yield the
>best through put/performance.  I look forward to hearing more about your
>progress on this.
>
>
>Best Regards,
>
>
>
>Cameron
>
>
>On Wed, Dec 12, 2012 at 7:28 PM, Mattmann, Chris A (388J)
><chris.a.mattm...@jpl.nasa.gov<mailto:chris.a.mattm...@jpl.nasa.gov>>
>wrote:
>Hi Chintu,
>
>From: <Mistry>, "Chintu [COLUMBUS TECHNOLOGIES AND SERVICES INC]
>(GSFC-586.0)" 
><chintu.mis...@nasa.gov<mailto:chintu.mis...@nasa.gov><mailto:chintu.mistr
>y...@nasa.gov<mailto:chintu.mis...@nasa.gov>>>
>Date: Wednesday, December 12, 2012 12:02 PM
>To: jpluser 
><chris.a.mattm...@jpl.nasa.gov<mailto:chris.a.mattm...@jpl.nasa.gov><mailt
>o:chris.a.mattm...@jpl.nasa.gov<mailto:chris.a.mattm...@jpl.nasa.gov>>>,
>"dev@oodt.apache.org<mailto:dev@oodt.apache.org><mailto:d...@oodt.apache.or
>g<mailto:dev@oodt.apache.org>>"
><dev@oodt.apache.org<mailto:dev@oodt.apache.org><mailto:d...@oodt.apache.or
>g<mailto:dev@oodt.apache.org>>>
>Subject: Re: OODT 0.3 branch
>
>If you are saying that FM can handle multiple connections at one time,
>
>Yep I'm saying that it can.
>
>then multiple crawlers pointing to same FM should increase performance
>significantly.
>
>Well that really depends to be honest. It sounds like you guys are
>hitting an IO bottleneck potentially in data transfer? What file sizes
>are you transferring? If you are IO bound on the data transfer part, the
>product isn't fully ingested until:
>
>
>  1.  it's entry is added to the catalog
>  2.  The data transfer finishes
>
>Are you checking the FM for status along the way? Also realize that the
>FM will never be faster than the file system, so if it takes the file
>system X minutes to transfer a file F1, Y to transfer F2, and Z to
>transfer F3, then you still have to wait at least the max(X,Y,Z) time,
>regardless for the 3 ingestions to complete.
>
>But that¹s not what we saw in our tests.
>
>For example,
>I saw barely 2 minutes performance difference between 2FM-6CR and 3FM-6CR.
>
>1) 2 hour  6 minutes to process 262G   (1FM 3CR - 3CR to 1FM)
>2) 1 hour 58 minutes to process 262G   (1FM 6CR - 6CR to 1FM)
>3) 1 hour 39 minutes to process 262G   (2FM 6CR - 3CR to 1FM)
>4) 1 hour 39 minutes to process 262G   (2FM 9CR - 4+CR to 1FM)
>5) 1 hour 37 minutes to process 262G   (3FM 9CR - 3CR to 1FM)
>6) 2 hour            to process 262G   (3FM 20CR - 6+CR to 1FM)
>7) 28 minutes    to process 262G   (6FM 9CR - 1+CR to 1FM)   => This is
>my latest test and this is good number.
>
>What would be interesting is simply looking at the speed for how long it
>takes to cp the files (which I bet is what's happening) versus mv'ing the
>files by hand. If mv is faster, I'd:
>
>
>  1.  Implement a Data Transfer implementation that simply replaces the
>calls to FileUtils.copyFile or .moveFile with systemCalls (see ExecHelper
>from oodt-commons) to UNIX equivalents.
>  2.  Plug that data transfer in to your crawler invocations via the cmd
>line.
>
>HTH!
>
>Cheers,
>Chris
>
>
>From: <Mattmann>, Chris A
><chris.a.mattm...@jpl.nasa.gov<mailto:chris.a.mattm...@jpl.nasa.gov><mailt
>o:chris.a.mattm...@jpl.nasa.gov<mailto:chris.a.mattm...@jpl.nasa.gov>>>
>Date: Wednesday, December 12, 2012 2:51 PM
>To: "Mistry, Chintu (GSFC-586.0)[COLUMBUS TECHNOLOGIES AND SERVICES INC]"
><chintu.mis...@nasa.gov<mailto:chintu.mis...@nasa.gov><mailto:chintu.mistr
>y...@nasa.gov<mailto:chintu.mis...@nasa.gov>>>,
>"dev@oodt.apache.org<mailto:dev@oodt.apache.org><mailto:d...@oodt.apache.or
>g<mailto:dev@oodt.apache.org>>"
><dev@oodt.apache.org<mailto:dev@oodt.apache.org><mailto:d...@oodt.apache.or
>g<mailto:dev@oodt.apache.org>>>
>Subject: Re: OODT 0.3 branch
>
>Hey Chintu,
>
>From: <Mistry>, "Chintu [COLUMBUS TECHNOLOGIES AND SERVICES INC]
>(GSFC-586.0)" 
><chintu.mis...@nasa.gov<mailto:chintu.mis...@nasa.gov><mailto:chintu.mistr
>y...@nasa.gov<mailto:chintu.mis...@nasa.gov>>>
>Date: Tuesday, December 11, 2012 2:41 PM
>To: jpluser 
><chris.a.mattm...@jpl.nasa.gov<mailto:chris.a.mattm...@jpl.nasa.gov><mailt
>o:chris.a.mattm...@jpl.nasa.gov<mailto:chris.a.mattm...@jpl.nasa.gov>>>,
>"dev@oodt.apache.org<mailto:dev@oodt.apache.org><mailto:d...@oodt.apache.or
>g<mailto:dev@oodt.apache.org>>"
><dev@oodt.apache.org<mailto:dev@oodt.apache.org><mailto:d...@oodt.apache.or
>g<mailto:dev@oodt.apache.org>>>
>Subject: Re: OODT 0.3 branch
>
>Answers inline below.
>
>---snip
>
>Gotcha, so you are using different product types. So, each crawler is
>crawling various product types in each one of the staging area dirs, that
>looks like e.g.,
>
>/STAGING_AREA_BASE
>  /dir1 ­ 1st crawler
>   - file1 of product type 1
>   - file2 of product type 3
>
> /dir2 ­ 2nd crawler
>   - file3 of product type 3
>
> /dir3 ­ 3rd crawler
>   - file4 of product type 2
>
>Is that what the staging area looks like? - YES
>
>And then your FM is ingesting all 3 product types (I just picked 3
>arbitrarily could have been N) into:
>
>ARCHIVE_BASE/{ProductTypeName}/{YYYYMMDD}
>
>Correct?  - YES
>
>If so, I would imagine if FM1 and FM2 and FM3 would actually speed up the
>ingestion process compared to just using 1 FM with 1, or 2 or 3 crawlers
>all talking to it.
>
>Let me ask a few more questions:
>
>Do you see e.g., in the above example that file4 is ingested before
>file2? What about file3 before file2? If not, there is something wiggy
>going on.
>       - I have not checked that. I guess I can check that. Can FM handle
>multiple connections at the same time ?
>
>
>Yep FM can handle multiple connections at one time up to a limit (I think
>hard defaulted to ~100-200 by the underlying XMLRPC 2.1 library). We're
>using an old library currently but have a goal to upgrade to the latest
>version where I think this # is configurable.
>
>Cheers,
>Chris
>
>

  • Re: OO... Mattmann, Chris A (388J)
    • R... Mistry, Chintu (GSFC-586.0)[COLUMBUS TECHNOLOGIES AND SERVICES INC]
      • ... Mattmann, Chris A (388J)
        • ... Mistry, Chintu (GSFC-586.0)[COLUMBUS TECHNOLOGIES AND SERVICES INC]
          • ... Mattmann, Chris A (388J)
            • ... Mistry, Chintu (GSFC-586.0)[COLUMBUS TECHNOLOGIES AND SERVICES INC]
              • ... Mattmann, Chris A (388J)
              • ... Cameron Goodale
              • ... Mattmann, Chris A (388J)
              • ... Mistry, Chintu (GSFC-586.0)[COLUMBUS TECHNOLOGIES AND SERVICES INC]
              • ... Mattmann, Chris A (388J)

Reply via email to