Hey Chintu,

Thanks much. One thing you could try to use to speed up as well would be

1. Subclass, or roll your own LocalDataTransfer class -- call it
2. Replace the class in that class from using FileUtils.copyFile or
FileUtils.moveFile to calls to ExecHelper.execute("cp ...") and ("mv ...")
3. In your calls to the crawler, pass --dataTransferFactory for your new

See if that improves it at all. If you want, file a JIRA issue too and I
could try and wire up such a transferer for you.



On 12/14/12 5:23 AM, "Mistry, Chintu (GSFC-586.0)[COLUMBUS TECHNOLOGIES
AND SERVICES INC]" <chintu.mis...@nasa.gov> wrote:

>Thank you for example.
>In our case, the file size can vary from 10K to 200MB. About 12000 files
>make up of 262GB data.
>We are using IBM GPFS for our storage, which is suppose to be faster for
>this kind of activity. The parallel performance that we are seeing in our
>test case is far from isolated filesystem test (with simple copy and move
>without OODT).
>So far the best combination I could find is to use "move" and keep almost
>1:1 ratio of FM and CR. Will still dig more into it.
>Chintu Mistry
>NASA Goddard Space Flight Center
>Bldg L40B, Room S776
>Office: 240 684 0477
>Mobile: 770 310 1047
>From: Cameron Goodale <good...@apache.org<mailto:good...@apache.org>>
>Date: Friday, December 14, 2012 12:03 AM
>To: "dev@oodt.apache.org<mailto:dev@oodt.apache.org>"
>Subject: Re: OODT 0.3 branch
>I see that your test data volume is 262GB, but I am curious about the
>make up of the data.  On average what is your file size and how many
>The reason I ask is because the process of extraction and ingestion can
>vary wildly.  On the LMMP project I was ingesting 12GB DEMs over NFS and
>it was a slow process.  It was basically serial with 1CR+1FM, but we
>didn't have a requirement to push large volumes of data.
>On our recent Snow Data System I am processing 160 workflow jobs in
>parallel and OODT could handle the load, it turned out the filesystem was
>our major bottleneck.  We used a SAN initially when doing development,
>but when we increased the number of jobs in parallel the I/O became so
>bad we moved to GlusterFS. GlusterFS had speed improvements over the SAN,
>but we had to be careful about heavy writing, moving, deleting since the
>clustering would try to replicate the data.  Turns out Gluster is great
>for heavy writting OR heavy reading, but not both at the same time.
>Finally we are using NAS and it works great.
>My point is the file system plays a major role in performance when
>ingesting data.  The ultimate speed test would be if you could actually
>write the data into the final archive directory and basically do an
>ingestion in place (skip data transfer entirely), but I know that is
>rarely possible.
>This is an interesting challenge to see what configuration will yield the
>best through put/performance.  I look forward to hearing more about your
>progress on this.
>Best Regards,
>On Wed, Dec 12, 2012 at 7:28 PM, Mattmann, Chris A (388J)
>Hi Chintu,
>Date: Wednesday, December 12, 2012 12:02 PM
>To: jpluser 
>Subject: Re: OODT 0.3 branch
>If you are saying that FM can handle multiple connections at one time,
>Yep I'm saying that it can.
>then multiple crawlers pointing to same FM should increase performance
>Well that really depends to be honest. It sounds like you guys are
>hitting an IO bottleneck potentially in data transfer? What file sizes
>are you transferring? If you are IO bound on the data transfer part, the
>product isn't fully ingested until:
>  1.  it's entry is added to the catalog
>  2.  The data transfer finishes
>Are you checking the FM for status along the way? Also realize that the
>FM will never be faster than the file system, so if it takes the file
>system X minutes to transfer a file F1, Y to transfer F2, and Z to
>transfer F3, then you still have to wait at least the max(X,Y,Z) time,
>regardless for the 3 ingestions to complete.
>But that¹s not what we saw in our tests.
>For example,
>I saw barely 2 minutes performance difference between 2FM-6CR and 3FM-6CR.
>1) 2 hour  6 minutes to process 262G   (1FM 3CR - 3CR to 1FM)
>2) 1 hour 58 minutes to process 262G   (1FM 6CR - 6CR to 1FM)
>3) 1 hour 39 minutes to process 262G   (2FM 6CR - 3CR to 1FM)
>4) 1 hour 39 minutes to process 262G   (2FM 9CR - 4+CR to 1FM)
>5) 1 hour 37 minutes to process 262G   (3FM 9CR - 3CR to 1FM)
>6) 2 hour            to process 262G   (3FM 20CR - 6+CR to 1FM)
>7) 28 minutes    to process 262G   (6FM 9CR - 1+CR to 1FM)   => This is
>my latest test and this is good number.
>What would be interesting is simply looking at the speed for how long it
>takes to cp the files (which I bet is what's happening) versus mv'ing the
>files by hand. If mv is faster, I'd:
>  1.  Implement a Data Transfer implementation that simply replaces the
>calls to FileUtils.copyFile or .moveFile with systemCalls (see ExecHelper
>from oodt-commons) to UNIX equivalents.
>  2.  Plug that data transfer in to your crawler invocations via the cmd
>From: <Mattmann>, Chris A
>Date: Wednesday, December 12, 2012 2:51 PM
>Subject: Re: OODT 0.3 branch
>Hey Chintu,
>Date: Tuesday, December 11, 2012 2:41 PM
>To: jpluser 
>Subject: Re: OODT 0.3 branch
>Answers inline below.
>Gotcha, so you are using different product types. So, each crawler is
>crawling various product types in each one of the staging area dirs, that
>looks like e.g.,
>  /dir1 ­ 1st crawler
>   - file1 of product type 1
>   - file2 of product type 3
> /dir2 ­ 2nd crawler
>   - file3 of product type 3
> /dir3 ­ 3rd crawler
>   - file4 of product type 2
>Is that what the staging area looks like? - YES
>And then your FM is ingesting all 3 product types (I just picked 3
>arbitrarily could have been N) into:
>Correct?  - YES
>If so, I would imagine if FM1 and FM2 and FM3 would actually speed up the
>ingestion process compared to just using 1 FM with 1, or 2 or 3 crawlers
>all talking to it.
>Let me ask a few more questions:
>Do you see e.g., in the above example that file4 is ingested before
>file2? What about file3 before file2? If not, there is something wiggy
>going on.
>       - I have not checked that. I guess I can check that. Can FM handle
>multiple connections at the same time ?
>Yep FM can handle multiple connections at one time up to a limit (I think
>hard defaulted to ~100-200 by the underlying XMLRPC 2.1 library). We're
>using an old library currently but have a goal to upgrade to the latest
>version where I think this # is configurable.

  • Re: OO... Mattmann, Chris A (388J)
      • ... Mattmann, Chris A (388J)
        • ... Mistry, Chintu (GSFC-586.0)[COLUMBUS TECHNOLOGIES AND SERVICES INC]
          • ... Mattmann, Chris A (388J)
            • ... Mistry, Chintu (GSFC-586.0)[COLUMBUS TECHNOLOGIES AND SERVICES INC]
              • ... Mattmann, Chris A (388J)
              • ... Cameron Goodale
              • ... Mattmann, Chris A (388J)
              • ... Mistry, Chintu (GSFC-586.0)[COLUMBUS TECHNOLOGIES AND SERVICES INC]
              • ... Mattmann, Chris A (388J)

Reply via email to