Re: OODT 0.3 branch

2012-12-14 Thread Mattmann, Chris A (388J)
Hey Chintu,

Thanks much. One thing you could try to use to speed up as well would be
to:

1. Subclass, or roll your own LocalDataTransfer class -- call it
MoveBasedLocalDataTransfer
2. Replace the class in that class from using FileUtils.copyFile or
FileUtils.moveFile to calls to ExecHelper.execute("cp ...") and ("mv ...")
3. In your calls to the crawler, pass --dataTransferFactory for your new
MoveBased...one

See if that improves it at all. If you want, file a JIRA issue too and I
could try and wire up such a transferer for you.

Thanks!

Cheers,
Chris

On 12/14/12 5:23 AM, "Mistry, Chintu (GSFC-586.0)[COLUMBUS TECHNOLOGIES
AND SERVICES INC]"  wrote:

>Thank you for example.
>
>In our case, the file size can vary from 10K to 200MB. About 12000 files
>make up of 262GB data.
>
>We are using IBM GPFS for our storage, which is suppose to be faster for
>this kind of activity. The parallel performance that we are seeing in our
>test case is far from isolated filesystem test (with simple copy and move
>without OODT).
>
>So far the best combination I could find is to use "move" and keep almost
>1:1 ratio of FM and CR. Will still dig more into it.
>
>Thanks
>--
>Chintu Mistry
>NASA Goddard Space Flight Center
>Bldg L40B, Room S776
>Office: 240 684 0477
>Mobile: 770 310 1047
>
>From: Cameron Goodale mailto:good...@apache.org>>
>Date: Friday, December 14, 2012 12:03 AM
>To: "dev@oodt.apache.org<mailto:dev@oodt.apache.org>"
>mailto:dev@oodt.apache.org>>
>Cc: "Mistry, Chintu (GSFC-586.0)[COLUMBUS TECHNOLOGIES AND SERVICES INC]"
>mailto:chintu.mis...@nasa.gov>>
>Subject: Re: OODT 0.3 branch
>
>Chintu,
>
>I see that your test data volume is 262GB, but I am curious about the
>make up of the data.  On average what is your file size and how many
>files?
>
>The reason I ask is because the process of extraction and ingestion can
>vary wildly.  On the LMMP project I was ingesting 12GB DEMs over NFS and
>it was a slow process.  It was basically serial with 1CR+1FM, but we
>didn't have a requirement to push large volumes of data.
>
>On our recent Snow Data System I am processing 160 workflow jobs in
>parallel and OODT could handle the load, it turned out the filesystem was
>our major bottleneck.  We used a SAN initially when doing development,
>but when we increased the number of jobs in parallel the I/O became so
>bad we moved to GlusterFS. GlusterFS had speed improvements over the SAN,
>but we had to be careful about heavy writing, moving, deleting since the
>clustering would try to replicate the data.  Turns out Gluster is great
>for heavy writting OR heavy reading, but not both at the same time.
>Finally we are using NAS and it works great.
>
>My point is the file system plays a major role in performance when
>ingesting data.  The ultimate speed test would be if you could actually
>write the data into the final archive directory and basically do an
>ingestion in place (skip data transfer entirely), but I know that is
>rarely possible.
>
>This is an interesting challenge to see what configuration will yield the
>best through put/performance.  I look forward to hearing more about your
>progress on this.
>
>
>Best Regards,
>
>
>
>Cameron
>
>
>On Wed, Dec 12, 2012 at 7:28 PM, Mattmann, Chris A (388J)
>mailto:chris.a.mattm...@jpl.nasa.gov>>
>wrote:
>Hi Chintu,
>
>From: , "Chintu [COLUMBUS TECHNOLOGIES AND SERVICES INC]
>(GSFC-586.0)" 
>mailto:chintu.mis...@nasa.gov><mailto:chintu.mistr
>y...@nasa.gov<mailto:chintu.mis...@nasa.gov>>>
>Date: Wednesday, December 12, 2012 12:02 PM
>To: jpluser 
>mailto:chris.a.mattm...@jpl.nasa.gov>o:chris.a.mattm...@jpl.nasa.gov<mailto:chris.a.mattm...@jpl.nasa.gov>>>,
>"dev@oodt.apache.org<mailto:dev@oodt.apache.org><mailto:d...@oodt.apache.or
>g<mailto:dev@oodt.apache.org>>"
>mailto:dev@oodt.apache.org><mailto:d...@oodt.apache.or
>g<mailto:dev@oodt.apache.org>>>
>Subject: Re: OODT 0.3 branch
>
>If you are saying that FM can handle multiple connections at one time,
>
>Yep I'm saying that it can.
>
>then multiple crawlers pointing to same FM should increase performance
>significantly.
>
>Well that really depends to be honest. It sounds like you guys are
>hitting an IO bottleneck potentially in data transfer? What file sizes
>are you transferring? If you are IO bound on the data transfer part, the
>product isn't fully ingested until:
>
>
>  1.  it's entry is added to the catalog
>  2.  The data transfer finishes
>
>Are you checking the FM for status along the way? Also realize that the
>FM will

Re: OODT 0.3 branch

2012-12-14 Thread Mistry, Chintu (GSFC-586.0)[COLUMBUS TECHNOLOGIES AND SERVICES INC]
Thank you for example.

In our case, the file size can vary from 10K to 200MB. About 12000 files make 
up of 262GB data.

We are using IBM GPFS for our storage, which is suppose to be faster for this 
kind of activity. The parallel performance that we are seeing in our test case 
is far from isolated filesystem test (with simple copy and move without OODT).

So far the best combination I could find is to use "move" and keep almost 1:1 
ratio of FM and CR. Will still dig more into it.

Thanks
--
Chintu Mistry
NASA Goddard Space Flight Center
Bldg L40B, Room S776
Office: 240 684 0477
Mobile: 770 310 1047

From: Cameron Goodale mailto:good...@apache.org>>
Date: Friday, December 14, 2012 12:03 AM
To: "dev@oodt.apache.org<mailto:dev@oodt.apache.org>" 
mailto:dev@oodt.apache.org>>
Cc: "Mistry, Chintu (GSFC-586.0)[COLUMBUS TECHNOLOGIES AND SERVICES INC]" 
mailto:chintu.mis...@nasa.gov>>
Subject: Re: OODT 0.3 branch

Chintu,

I see that your test data volume is 262GB, but I am curious about the make up 
of the data.  On average what is your file size and how many files?

The reason I ask is because the process of extraction and ingestion can vary 
wildly.  On the LMMP project I was ingesting 12GB DEMs over NFS and it was a 
slow process.  It was basically serial with 1CR+1FM, but we didn't have a 
requirement to push large volumes of data.

On our recent Snow Data System I am processing 160 workflow jobs in parallel 
and OODT could handle the load, it turned out the filesystem was our major 
bottleneck.  We used a SAN initially when doing development, but when we 
increased the number of jobs in parallel the I/O became so bad we moved to 
GlusterFS. GlusterFS had speed improvements over the SAN, but we had to be 
careful about heavy writing, moving, deleting since the clustering would try to 
replicate the data.  Turns out Gluster is great for heavy writting OR heavy 
reading, but not both at the same time.  Finally we are using NAS and it works 
great.

My point is the file system plays a major role in performance when ingesting 
data.  The ultimate speed test would be if you could actually write the data 
into the final archive directory and basically do an ingestion in place (skip 
data transfer entirely), but I know that is rarely possible.

This is an interesting challenge to see what configuration will yield the best 
through put/performance.  I look forward to hearing more about your progress on 
this.


Best Regards,



Cameron


On Wed, Dec 12, 2012 at 7:28 PM, Mattmann, Chris A (388J) 
mailto:chris.a.mattm...@jpl.nasa.gov>> wrote:
Hi Chintu,

From: , "Chintu [COLUMBUS TECHNOLOGIES AND SERVICES INC] (GSFC-586.0)" 
mailto:chintu.mis...@nasa.gov><mailto:chintu.mis...@nasa.gov<mailto:chintu.mis...@nasa.gov>>>
Date: Wednesday, December 12, 2012 12:02 PM
To: jpluser 
mailto:chris.a.mattm...@jpl.nasa.gov><mailto:chris.a.mattm...@jpl.nasa.gov<mailto:chris.a.mattm...@jpl.nasa.gov>>>,
 
"dev@oodt.apache.org<mailto:dev@oodt.apache.org><mailto:dev@oodt.apache.org<mailto:dev@oodt.apache.org>>"
 
mailto:dev@oodt.apache.org><mailto:dev@oodt.apache.org<mailto:dev@oodt.apache.org>>>
Subject: Re: OODT 0.3 branch

If you are saying that FM can handle multiple connections at one time,

Yep I'm saying that it can.

then multiple crawlers pointing to same FM should increase performance 
significantly.

Well that really depends to be honest. It sounds like you guys are hitting an 
IO bottleneck potentially in data transfer? What file sizes are you 
transferring? If you are IO bound on the data transfer part, the product isn't 
fully ingested until:


  1.  it's entry is added to the catalog
  2.  The data transfer finishes

Are you checking the FM for status along the way? Also realize that the FM will 
never be faster than the file system, so if it takes the file system X minutes 
to transfer a file F1, Y to transfer F2, and Z to transfer F3, then you still 
have to wait at least the max(X,Y,Z) time, regardless for the 3 ingestions to 
complete.

But that’s not what we saw in our tests.

For example,
I saw barely 2 minutes performance difference between 2FM-6CR and 3FM-6CR.

1) 2 hour  6 minutes to process 262G   (1FM 3CR - 3CR to 1FM)
2) 1 hour 58 minutes to process 262G   (1FM 6CR - 6CR to 1FM)
3) 1 hour 39 minutes to process 262G   (2FM 6CR - 3CR to 1FM)
4) 1 hour 39 minutes to process 262G   (2FM 9CR - 4+CR to 1FM)
5) 1 hour 37 minutes to process 262G   (3FM 9CR - 3CR to 1FM)
6) 2 hourto process 262G   (3FM 20CR - 6+CR to 1FM)
7) 28 minutesto process 262G   (6FM 9CR - 1+CR to 1FM)   => This is my 
latest test and this is good number.

What would be interesting is simply looking at the speed for how long it takes 
to cp the files (which I bet is what's happening) versus mv'ing the files by 
hand. If mv is faster, I'd

Re: OODT 0.3 branch

2012-12-13 Thread Mattmann, Chris A (388J)
Thanks Cam, for the use cases, and insight.

Cheers,
Chris

On 12/13/12 9:03 PM, "Cameron Goodale"  wrote:

>Chintu,
>
>I see that your test data volume is 262GB, but I am curious about the make
>up of the data.  On average what is your file size and how many files?
>
>The reason I ask is because the process of extraction and ingestion can
>vary wildly.  On the LMMP project I was ingesting 12GB DEMs over NFS and
>it
>was a slow process.  It was basically serial with 1CR+1FM, but we didn't
>have a requirement to push large volumes of data.
>
>On our recent Snow Data System I am processing 160 workflow jobs in
>parallel and OODT could handle the load, it turned out the filesystem was
>our major bottleneck.  We used a SAN initially when doing development, but
>when we increased the number of jobs in parallel the I/O became so bad we
>moved to GlusterFS. GlusterFS had speed improvements over the SAN, but we
>had to be careful about heavy writing, moving, deleting since the
>clustering would try to replicate the data.  Turns out Gluster is great
>for
>heavy writting OR heavy reading, but not both at the same time.  Finally
>we
>are using NAS and it works great.
>
>My point is the file system plays a major role in performance when
>ingesting data.  The ultimate speed test would be if you could actually
>write the data into the final archive directory and basically do an
>ingestion in place (skip data transfer entirely), but I know that is
>rarely
>possible.
>
>This is an interesting challenge to see what configuration will yield the
>best through put/performance.  I look forward to hearing more about your
>progress on this.
>
>
>Best Regards,
>
>
>
>Cameron
>
>
>On Wed, Dec 12, 2012 at 7:28 PM, Mattmann, Chris A (388J) <
>chris.a.mattm...@jpl.nasa.gov> wrote:
>
>> Hi Chintu,
>>
>> From: , "Chintu [COLUMBUS TECHNOLOGIES AND SERVICES INC]
>> (GSFC-586.0)" mailto:chintu.mis...@nasa.gov>>
>> Date: Wednesday, December 12, 2012 12:02 PM
>> To: jpluser > chris.a.mattm...@jpl.nasa.gov>>, "dev@oodt.apache.org> dev@oodt.apache.org>" mailto:dev@oodt.apache.org>>
>> Subject: Re: OODT 0.3 branch
>>
>> If you are saying that FM can handle multiple connections at one time,
>>
>> Yep I'm saying that it can.
>>
>> then multiple crawlers pointing to same FM should increase performance
>> significantly.
>>
>> Well that really depends to be honest. It sounds like you guys are
>>hitting
>> an IO bottleneck potentially in data transfer? What file sizes are you
>> transferring? If you are IO bound on the data transfer part, the product
>> isn't fully ingested until:
>>
>>
>>   1.  it's entry is added to the catalog
>>   2.  The data transfer finishes
>>
>> Are you checking the FM for status along the way? Also realize that the
>>FM
>> will never be faster than the file system, so if it takes the file
>>system X
>> minutes to transfer a file F1, Y to transfer F2, and Z to transfer F3,
>>then
>> you still have to wait at least the max(X,Y,Z) time, regardless for the
>>3
>> ingestions to complete.
>>
>> But that¹s not what we saw in our tests.
>>
>> For example,
>> I saw barely 2 minutes performance difference between 2FM-6CR and
>>3FM-6CR.
>>
>> 1) 2 hour  6 minutes to process 262G   (1FM 3CR - 3CR to 1FM)
>> 2) 1 hour 58 minutes to process 262G   (1FM 6CR - 6CR to 1FM)
>> 3) 1 hour 39 minutes to process 262G   (2FM 6CR - 3CR to 1FM)
>> 4) 1 hour 39 minutes to process 262G   (2FM 9CR - 4+CR to 1FM)
>> 5) 1 hour 37 minutes to process 262G   (3FM 9CR - 3CR to 1FM)
>> 6) 2 hourto process 262G   (3FM 20CR - 6+CR to 1FM)
>> 7) 28 minutesto process 262G   (6FM 9CR - 1+CR to 1FM)   => This is
>>my
>> latest test and this is good number.
>>
>> What would be interesting is simply looking at the speed for how long it
>> takes to cp the files (which I bet is what's happening) versus mv'ing
>>the
>> files by hand. If mv is faster, I'd:
>>
>>
>>   1.  Implement a Data Transfer implementation that simply replaces the
>> calls to FileUtils.copyFile or .moveFile with systemCalls (see
>>ExecHelper
>> from oodt-commons) to UNIX equivalents.
>>   2.  Plug that data transfer in to your crawler invocations via the cmd
>> line.
>>
>> HTH!
>>
>> Cheers,
>> Chris
>>
>>
>> From: , Chris A > chris.a.mattm...@jpl.nasa.gov>>
>> Date: Wednesday, December 12, 2012 2:51 PM
>> T

Re: OODT 0.3 branch

2012-12-13 Thread Cameron Goodale
Chintu,

I see that your test data volume is 262GB, but I am curious about the make
up of the data.  On average what is your file size and how many files?

The reason I ask is because the process of extraction and ingestion can
vary wildly.  On the LMMP project I was ingesting 12GB DEMs over NFS and it
was a slow process.  It was basically serial with 1CR+1FM, but we didn't
have a requirement to push large volumes of data.

On our recent Snow Data System I am processing 160 workflow jobs in
parallel and OODT could handle the load, it turned out the filesystem was
our major bottleneck.  We used a SAN initially when doing development, but
when we increased the number of jobs in parallel the I/O became so bad we
moved to GlusterFS. GlusterFS had speed improvements over the SAN, but we
had to be careful about heavy writing, moving, deleting since the
clustering would try to replicate the data.  Turns out Gluster is great for
heavy writting OR heavy reading, but not both at the same time.  Finally we
are using NAS and it works great.

My point is the file system plays a major role in performance when
ingesting data.  The ultimate speed test would be if you could actually
write the data into the final archive directory and basically do an
ingestion in place (skip data transfer entirely), but I know that is rarely
possible.

This is an interesting challenge to see what configuration will yield the
best through put/performance.  I look forward to hearing more about your
progress on this.


Best Regards,



Cameron


On Wed, Dec 12, 2012 at 7:28 PM, Mattmann, Chris A (388J) <
chris.a.mattm...@jpl.nasa.gov> wrote:

> Hi Chintu,
>
> From: , "Chintu [COLUMBUS TECHNOLOGIES AND SERVICES INC]
> (GSFC-586.0)" mailto:chintu.mis...@nasa.gov>>
> Date: Wednesday, December 12, 2012 12:02 PM
> To: jpluser  chris.a.mattm...@jpl.nasa.gov>>, "dev@oodt.apache.org dev@oodt.apache.org>" mailto:dev@oodt.apache.org>>
> Subject: Re: OODT 0.3 branch
>
> If you are saying that FM can handle multiple connections at one time,
>
> Yep I'm saying that it can.
>
> then multiple crawlers pointing to same FM should increase performance
> significantly.
>
> Well that really depends to be honest. It sounds like you guys are hitting
> an IO bottleneck potentially in data transfer? What file sizes are you
> transferring? If you are IO bound on the data transfer part, the product
> isn't fully ingested until:
>
>
>   1.  it's entry is added to the catalog
>   2.  The data transfer finishes
>
> Are you checking the FM for status along the way? Also realize that the FM
> will never be faster than the file system, so if it takes the file system X
> minutes to transfer a file F1, Y to transfer F2, and Z to transfer F3, then
> you still have to wait at least the max(X,Y,Z) time, regardless for the 3
> ingestions to complete.
>
> But that’s not what we saw in our tests.
>
> For example,
> I saw barely 2 minutes performance difference between 2FM-6CR and 3FM-6CR.
>
> 1) 2 hour  6 minutes to process 262G   (1FM 3CR - 3CR to 1FM)
> 2) 1 hour 58 minutes to process 262G   (1FM 6CR - 6CR to 1FM)
> 3) 1 hour 39 minutes to process 262G   (2FM 6CR - 3CR to 1FM)
> 4) 1 hour 39 minutes to process 262G   (2FM 9CR - 4+CR to 1FM)
> 5) 1 hour 37 minutes to process 262G   (3FM 9CR - 3CR to 1FM)
> 6) 2 hourto process 262G   (3FM 20CR - 6+CR to 1FM)
> 7) 28 minutesto process 262G   (6FM 9CR - 1+CR to 1FM)   => This is my
> latest test and this is good number.
>
> What would be interesting is simply looking at the speed for how long it
> takes to cp the files (which I bet is what's happening) versus mv'ing the
> files by hand. If mv is faster, I'd:
>
>
>   1.  Implement a Data Transfer implementation that simply replaces the
> calls to FileUtils.copyFile or .moveFile with systemCalls (see ExecHelper
> from oodt-commons) to UNIX equivalents.
>   2.  Plug that data transfer in to your crawler invocations via the cmd
> line.
>
> HTH!
>
> Cheers,
> Chris
>
>
> From: , Chris A  chris.a.mattm...@jpl.nasa.gov>>
> Date: Wednesday, December 12, 2012 2:51 PM
> To: "Mistry, Chintu (GSFC-586.0)[COLUMBUS TECHNOLOGIES AND SERVICES INC]" <
> chintu.mis...@nasa.gov<mailto:chintu.mis...@nasa.gov>>, "
> dev@oodt.apache.org<mailto:dev@oodt.apache.org>"  <mailto:dev@oodt.apache.org>>
> Subject: Re: OODT 0.3 branch
>
> Hey Chintu,
>
> From: , "Chintu [COLUMBUS TECHNOLOGIES AND SERVICES INC]
> (GSFC-586.0)" mailto:chintu.mis...@nasa.gov>>
> Date: Tuesday, December 11, 2012 2:41 PM
> To: jpluser  chris.a.mattm...@jpl.nasa.gov>>, "dev@oodt.apache.org dev@oodt.apache.org>" mailto:dev@oodt.apach

Re: OODT 0.3 branch

2012-12-12 Thread Mistry, Chintu (GSFC-586.0)[COLUMBUS TECHNOLOGIES AND SERVICES INC]
If you are saying that FM can handle multiple connections at one time, then 
multiple crawlers pointing to same FM should increase performance 
significantly. But that’s not what we saw in our tests.

For example,
I saw barely 2 minutes performance difference between 2FM-6CR and 3FM-6CR.

1) 2 hour  6 minutes to process 262G   (1FM 3CR - 3CR to 1FM)
2) 1 hour 58 minutes to process 262G   (1FM 6CR - 6CR to 1FM)
3) 1 hour 39 minutes to process 262G   (2FM 6CR - 3CR to 1FM)
4) 1 hour 39 minutes to process 262G   (2FM 9CR - 4+CR to 1FM)
5) 1 hour 37 minutes to process 262G   (3FM 9CR - 3CR to 1FM)
6) 2 hourto process 262G   (3FM 20CR - 6+CR to 1FM)
7) 28 minutesto process 262G   (6FM 9CR - 1+CR to 1FM)   => This is my 
latest test and this is good number.

Regards
--
Chintu Mistry
NASA Goddard Space Flight Center
Bldg L40B, Room S776
Office: 240 684 0477
Mobile: 770 310 1047

From: , Chris A 
mailto:chris.a.mattm...@jpl.nasa.gov>>
Date: Wednesday, December 12, 2012 2:51 PM
To: "Mistry, Chintu (GSFC-586.0)[COLUMBUS TECHNOLOGIES AND SERVICES INC]" 
mailto:chintu.mis...@nasa.gov>>, 
"dev@oodt.apache.org<mailto:dev@oodt.apache.org>" 
mailto:dev@oodt.apache.org>>
Subject: Re: OODT 0.3 branch

Hey Chintu,

From: , "Chintu [COLUMBUS TECHNOLOGIES AND SERVICES INC] (GSFC-586.0)" 
mailto:chintu.mis...@nasa.gov>>
Date: Tuesday, December 11, 2012 2:41 PM
To: jpluser 
mailto:chris.a.mattm...@jpl.nasa.gov>>, 
"dev@oodt.apache.org<mailto:dev@oodt.apache.org>" 
mailto:dev@oodt.apache.org>>
Subject: Re: OODT 0.3 branch

Answers inline below.

---snip

Gotcha, so you are using different product types. So, each crawler is crawling 
various product types in each one of the staging area dirs, that looks like 
e.g.,

/STAGING_AREA_BASE
  /dir1 – 1st crawler
   - file1 of product type 1
   - file2 of product type 3

 /dir2 – 2nd crawler
   - file3 of product type 3

 /dir3 – 3rd crawler
   - file4 of product type 2

Is that what the staging area looks like? - YES

And then your FM is ingesting all 3 product types (I just picked 3 arbitrarily 
could have been N) into:

ARCHIVE_BASE/{ProductTypeName}/{MMDD}

Correct?  - YES

If so, I would imagine if FM1 and FM2 and FM3 would actually speed up the 
ingestion process compared to just using 1 FM with 1, or 2 or 3 crawlers all 
talking to it.

Let me ask a few more questions:

Do you see e.g., in the above example that file4 is ingested before file2? What 
about file3 before file2? If not, there is something wiggy going on.
   - I have not checked that. I guess I can check that. Can FM handle 
multiple connections at the same time ?


Yep FM can handle multiple connections at one time up to a limit (I think hard 
defaulted to ~100-200 by the underlying XMLRPC 2.1 library). We're using an old 
library currently but have a goal to upgrade to the latest version where I 
think this # is configurable.

Cheers,
Chris



Re: OODT 0.3 branch

2012-12-12 Thread Mattmann, Chris A (388J)
Hi Chintu,

From: , "Chintu [COLUMBUS TECHNOLOGIES AND SERVICES INC] (GSFC-586.0)" 
mailto:chintu.mis...@nasa.gov>>
Date: Wednesday, December 12, 2012 12:02 PM
To: jpluser 
mailto:chris.a.mattm...@jpl.nasa.gov>>, 
"dev@oodt.apache.org<mailto:dev@oodt.apache.org>" 
mailto:dev@oodt.apache.org>>
Subject: Re: OODT 0.3 branch

If you are saying that FM can handle multiple connections at one time,

Yep I'm saying that it can.

then multiple crawlers pointing to same FM should increase performance 
significantly.

Well that really depends to be honest. It sounds like you guys are hitting an 
IO bottleneck potentially in data transfer? What file sizes are you 
transferring? If you are IO bound on the data transfer part, the product isn't 
fully ingested until:


  1.  it's entry is added to the catalog
  2.  The data transfer finishes

Are you checking the FM for status along the way? Also realize that the FM will 
never be faster than the file system, so if it takes the file system X minutes 
to transfer a file F1, Y to transfer F2, and Z to transfer F3, then you still 
have to wait at least the max(X,Y,Z) time, regardless for the 3 ingestions to 
complete.

But that’s not what we saw in our tests.

For example,
I saw barely 2 minutes performance difference between 2FM-6CR and 3FM-6CR.

1) 2 hour  6 minutes to process 262G   (1FM 3CR - 3CR to 1FM)
2) 1 hour 58 minutes to process 262G   (1FM 6CR - 6CR to 1FM)
3) 1 hour 39 minutes to process 262G   (2FM 6CR - 3CR to 1FM)
4) 1 hour 39 minutes to process 262G   (2FM 9CR - 4+CR to 1FM)
5) 1 hour 37 minutes to process 262G   (3FM 9CR - 3CR to 1FM)
6) 2 hourto process 262G   (3FM 20CR - 6+CR to 1FM)
7) 28 minutesto process 262G   (6FM 9CR - 1+CR to 1FM)   => This is my 
latest test and this is good number.

What would be interesting is simply looking at the speed for how long it takes 
to cp the files (which I bet is what's happening) versus mv'ing the files by 
hand. If mv is faster, I'd:


  1.  Implement a Data Transfer implementation that simply replaces the calls 
to FileUtils.copyFile or .moveFile with systemCalls (see ExecHelper from 
oodt-commons) to UNIX equivalents.
  2.  Plug that data transfer in to your crawler invocations via the cmd line.

HTH!

Cheers,
Chris


From: , Chris A 
mailto:chris.a.mattm...@jpl.nasa.gov>>
Date: Wednesday, December 12, 2012 2:51 PM
To: "Mistry, Chintu (GSFC-586.0)[COLUMBUS TECHNOLOGIES AND SERVICES INC]" 
mailto:chintu.mis...@nasa.gov>>, 
"dev@oodt.apache.org<mailto:dev@oodt.apache.org>" 
mailto:dev@oodt.apache.org>>
Subject: Re: OODT 0.3 branch

Hey Chintu,

From: , "Chintu [COLUMBUS TECHNOLOGIES AND SERVICES INC] (GSFC-586.0)" 
mailto:chintu.mis...@nasa.gov>>
Date: Tuesday, December 11, 2012 2:41 PM
To: jpluser 
mailto:chris.a.mattm...@jpl.nasa.gov>>, 
"dev@oodt.apache.org<mailto:dev@oodt.apache.org>" 
mailto:dev@oodt.apache.org>>
Subject: Re: OODT 0.3 branch

Answers inline below.

---snip

Gotcha, so you are using different product types. So, each crawler is crawling 
various product types in each one of the staging area dirs, that looks like 
e.g.,

/STAGING_AREA_BASE
  /dir1 – 1st crawler
   - file1 of product type 1
   - file2 of product type 3

 /dir2 – 2nd crawler
   - file3 of product type 3

 /dir3 – 3rd crawler
   - file4 of product type 2

Is that what the staging area looks like? - YES

And then your FM is ingesting all 3 product types (I just picked 3 arbitrarily 
could have been N) into:

ARCHIVE_BASE/{ProductTypeName}/{MMDD}

Correct?  - YES

If so, I would imagine if FM1 and FM2 and FM3 would actually speed up the 
ingestion process compared to just using 1 FM with 1, or 2 or 3 crawlers all 
talking to it.

Let me ask a few more questions:

Do you see e.g., in the above example that file4 is ingested before file2? What 
about file3 before file2? If not, there is something wiggy going on.
   - I have not checked that. I guess I can check that. Can FM handle 
multiple connections at the same time ?


Yep FM can handle multiple connections at one time up to a limit (I think hard 
defaulted to ~100-200 by the underlying XMLRPC 2.1 library). We're using an old 
library currently but have a goal to upgrade to the latest version where I 
think this # is configurable.

Cheers,
Chris



Re: OODT 0.3 branch

2012-12-12 Thread Mattmann, Chris A (388J)
Hey Chintu,

From: , "Chintu [COLUMBUS TECHNOLOGIES AND SERVICES INC] (GSFC-586.0)" 
mailto:chintu.mis...@nasa.gov>>
Date: Tuesday, December 11, 2012 2:41 PM
To: jpluser 
mailto:chris.a.mattm...@jpl.nasa.gov>>, 
"dev@oodt.apache.org<mailto:dev@oodt.apache.org>" 
mailto:dev@oodt.apache.org>>
Subject: Re: OODT 0.3 branch

Answers inline below.

---snip

Gotcha, so you are using different product types. So, each crawler is crawling 
various product types in each one of the staging area dirs, that looks like 
e.g.,

/STAGING_AREA_BASE
  /dir1 – 1st crawler
   - file1 of product type 1
   - file2 of product type 3

 /dir2 – 2nd crawler
   - file3 of product type 3

 /dir3 – 3rd crawler
   - file4 of product type 2

Is that what the staging area looks like? - YES

And then your FM is ingesting all 3 product types (I just picked 3 arbitrarily 
could have been N) into:

ARCHIVE_BASE/{ProductTypeName}/{MMDD}

Correct?  - YES

If so, I would imagine if FM1 and FM2 and FM3 would actually speed up the 
ingestion process compared to just using 1 FM with 1, or 2 or 3 crawlers all 
talking to it.

Let me ask a few more questions:

Do you see e.g., in the above example that file4 is ingested before file2? What 
about file3 before file2? If not, there is something wiggy going on.
   - I have not checked that. I guess I can check that. Can FM handle 
multiple connections at the same time ?


Yep FM can handle multiple connections at one time up to a limit (I think hard 
defaulted to ~100-200 by the underlying XMLRPC 2.1 library). We're using an old 
library currently but have a goal to upgrade to the latest version where I 
think this # is configurable.

Cheers,
Chris



Re: OODT 0.3 branch

2012-12-12 Thread Mistry, Chintu (GSFC-586.0)[COLUMBUS TECHNOLOGIES AND SERVICES INC]

--
Chintu Mistry
NASA Goddard Space Flight Center
Bldg L40B, Room S776
Office: 240 684 0477
Mobile: 770 310 1047

From: , Chris A 
mailto:chris.a.mattm...@jpl.nasa.gov>>
Date: Tuesday, December 11, 2012 6:25 PM
To: "Mistry, Chintu (GSFC-586.0)[COLUMBUS TECHNOLOGIES AND SERVICES INC]" 
mailto:chintu.mis...@nasa.gov>>, 
"dev@oodt.apache.org<mailto:dev@oodt.apache.org>" 
mailto:dev@oodt.apache.org>>
Subject: Re: OODT 0.3 branch

Hey Chintu,


From: , "Chintu [COLUMBUS TECHNOLOGIES AND SERVICES INC] (GSFC-586.0)" 
mailto:chintu.mis...@nasa.gov>>
Date: Tuesday, December 11, 2012 2:41 PM
To: jpluser 
mailto:chris.a.mattm...@jpl.nasa.gov>>, 
"dev@oodt.apache.org<mailto:dev@oodt.apache.org>" 
mailto:dev@oodt.apache.org>>
Subject: Re: OODT 0.3 branch

Answers inline below.

We will share information on apache.org at one point, but we are not there yet.

Thanks, OK,  please see inline below:

--
Chintu Mistry
NASA Goddard Space Flight Center
Bldg L40B, Room S776
Office: 240 684 0477
Mobile: 770 310 1047

From: , Chris A 
mailto:chris.a.mattm...@jpl.nasa.gov>>
Date: Tuesday, December 11, 2012 5:23 PM
To: "Mistry, Chintu (GSFC-586.0)[COLUMBUS TECHNOLOGIES AND SERVICES INC]" 
mailto:chintu.mis...@nasa.gov>>, 
"dev@oodt.apache.org<mailto:dev@oodt.apache.org>" 
mailto:dev@oodt.apache.org>>
Subject: Re: OODT 0.3 branch

Hey Chintu,

Thanks for reaching out! Replies inline below:

From: , "Chintu [COLUMBUS TECHNOLOGIES AND SERVICES INC] (GSFC-586.0)" 
mailto:chintu.mis...@nasa.gov>>
Date: Tuesday, December 11, 2012 1:50 PM
To: "dev@oodt.apache.org<mailto:dev@oodt.apache.org>" 
mailto:dev@oodt.apache.org>>
Cc: jpluser 
mailto:chris.a.mattm...@jpl.nasa.gov>>
Subject: OODT 0.3 branch

Hi Chris,

We are trying measure a performance of how fast filemanager+crawler is 
performing.

Here is what we are trying to do:

 *   Total data to process : 262GB
 *   3 file managers and 9 crawlers
 *where 3 crawlers are sending file location to  file manager to process 
the file
 *   We have our own schema running on postgresql database
 *   Custom H5 Extactor using h5dump utility

Cool this sounds like an awesome test. Would you be willing to share some of 
the info on the OODT wiki?

https://cwiki.apache.org/confluence/display/OODT/Home

Questions:
1) I have tried using FileUtils.copyFile vs FileUtils.moveFile, but I don't see 
any difference in processing time. Both my LandingZone and Archive Area are 
located on same Filesystem(GPFS). It is roughly taking 100 minutes to process 
262G data. Can you shed any light on why don't we see any performance change ?

This may have to do with the way that the JDK (what version are you using?) 
implements the actual arraycopy methods, and how the apache commons-io library 
wraps those methods. Let me know what JDK version you're using and we can 
investigate it.

- java version "1.6.0_24"
OpenJDK Runtime Environment (IcedTea6 1.11.5) (rhel-1.50.1.11.5.el6_3-x86_64)
OpenJDK 64-Bit Server VM (build 20.0-b12, mixed mode)

OK thanks. I found this article:

http://stackoverflow.com/questions/300559/move-copy-file-operations-in-java

It doesn't really go into too much detail but the nice thing is that if you 
need a different, or faster DataTransfer, you can always sub-class or implement 
your own that makes a call to e.g., "mv" or "cp" at the UNIX level if you think 
it'll speed it up.

Looking at: 
http://commons.apache.org/io/api-release/org/apache/commons/io/FileUtils.html

http://commons.apache.org/io/api-release/org/apache/commons/io/FileUtils.html#copyFile(java.io.File,
 
java.io.File)<http://commons.apache.org/io/api-release/org/apache/commons/io/FileUtils.html#copyFile(java.io.File,%20java.io.File)>
http://commons.apache.org/io/api-release/org/apache/commons/io/FileUtils.html#moveFile(java.io.File,
 
java.io.File)<http://commons.apache.org/io/api-release/org/apache/commons/io/FileUtils.html#moveFile(java.io.File,%20java.io.File)>

Note for moveFile:

"When the destination file is on another file system, do a "copy and delete".

I wonder how it detects that? I wonder if it always thinks it's on another 
filesystem using JDK and GFS? If so that might explain what you are seeing in 
that there is no difference between copyFile and moveFile?

2) The other thing also is that I don't see any performance gain between 
running 2 FM or 3FM. I thought that I would see some performance gain due to 
concurrency. Same goes for multiple crawlers. I was hoping to see pretty 
obvious performance change if I increase number of crawlers. What are thoughts 
on running things in parallel to increase performance.

How are you situating the additional file managers? Are you having 1 crawler 
ingest to 3? Or is there a 1:1 correspondence between

Re: OODT 0.3 branch

2012-12-11 Thread Mistry, Chintu (GSFC-586.0)[COLUMBUS TECHNOLOGIES AND SERVICES INC]
Answers inline below.

We will share information on apache.org at one point, but we are not there yet.

--
Chintu Mistry
NASA Goddard Space Flight Center
Bldg L40B, Room S776
Office: 240 684 0477
Mobile: 770 310 1047

From: , Chris A 
mailto:chris.a.mattm...@jpl.nasa.gov>>
Date: Tuesday, December 11, 2012 5:23 PM
To: "Mistry, Chintu (GSFC-586.0)[COLUMBUS TECHNOLOGIES AND SERVICES INC]" 
mailto:chintu.mis...@nasa.gov>>, 
"dev@oodt.apache.org<mailto:dev@oodt.apache.org>" 
mailto:dev@oodt.apache.org>>
Subject: Re: OODT 0.3 branch

Hey Chintu,

Thanks for reaching out! Replies inline below:

From: , "Chintu [COLUMBUS TECHNOLOGIES AND SERVICES INC] (GSFC-586.0)" 
mailto:chintu.mis...@nasa.gov>>
Date: Tuesday, December 11, 2012 1:50 PM
To: "dev@oodt.apache.org<mailto:dev@oodt.apache.org>" 
mailto:dev@oodt.apache.org>>
Cc: jpluser 
mailto:chris.a.mattm...@jpl.nasa.gov>>
Subject: OODT 0.3 branch

Hi Chris,

We are trying measure a performance of how fast filemanager+crawler is 
performing.

Here is what we are trying to do:

 *   Total data to process : 262GB
 *   3 file managers and 9 crawlers
 *where 3 crawlers are sending file location to  file manager to process 
the file
 *   We have our own schema running on postgresql database
 *   Custom H5 Extactor using h5dump utility

Cool this sounds like an awesome test. Would you be willing to share some of 
the info on the OODT wiki?

https://cwiki.apache.org/confluence/display/OODT/Home

Questions:
1) I have tried using FileUtils.copyFile vs FileUtils.moveFile, but I don't see 
any difference in processing time. Both my LandingZone and Archive Area are 
located on same Filesystem(GPFS). It is roughly taking 100 minutes to process 
262G data. Can you shed any light on why don't we see any performance change ?

This may have to do with the way that the JDK (what version are you using?) 
implements the actual arraycopy methods, and how the apache commons-io library 
wraps those methods. Let me know what JDK version you're using and we can 
investigate it.

- java version "1.6.0_24"
OpenJDK Runtime Environment (IcedTea6 1.11.5) (rhel-1.50.1.11.5.el6_3-x86_64)
OpenJDK 64-Bit Server VM (build 20.0-b12, mixed mode)

2) The other thing also is that I don't see any performance gain between 
running 2 FM or 3FM. I thought that I would see some performance gain due to 
concurrency. Same goes for multiple crawlers. I was hoping to see pretty 
obvious performance change if I increase number of crawlers. What are thoughts 
on running things in parallel to increase performance.

How are you situating the additional file managers? Are you having 1 crawler 
ingest to 3? Or is there a 1:1 correspondence between each crawler and FM? And, 
what do you mean by no performance gain? Do you mean that you don't see 3x 
speed in terms of e.g. Product ingestion of met into the catalog? Of file 
transfer speed?

- All 3 FM are running on one machine. Each crawler instance is crawling 
different directory. And 3 Crawlers are connected to 1st FM. Other 3 are 
connected to second FM and last 3 crawlers are connected to third FM. When I 
say performance difference between 2 and 3FM, I meant they take identically 
same amount of time to process same amount of data concurrently. I would love 
to see 3x speed if I run 3FM. I was talking about the whole ingest process from 
start to end for one file, which involves extracting metadata, inserting 
records into database and transferring file to archive location.

Are the 3 crawlers crawling the same staging area concurrently? Or are they 
separated out by buckets? And, which crawler are you using? The 
MetExtractorProductCrawler or the AutoDetectCrawler? Also, what is the 
versioning policy for the FM on a per product basis? Are all products being 
ingested of the same ProductType and ultimately of the same versioner and 
ultimate disk location?

- We are using StdProductCrawler. We don't have versioning requirement. 
Products are of different ProductTypes. We are trying to process 1 orbit full 
of data. They all get archived at "ARCHIVE_BASE/{ProductType}/MMDD" 
location.

3) Like I said earlier, we are running crawler to push data to file manager. If 
I run it that way, then "data transfer(copy or move)" is happing on the crawler 
side. I can not find any way to let file manager handle "data transfer" using 
on of your runtime options. Please let me know if you guys know how to do that ?

If you want the FM to handle the transfer you have to use the low level File 
Manager Client and omit the clientTransfer option:

[chipotle:local/filemgr/bin] mattmann% ./filemgr-client
filemgr-client --url  --operation [ [params]]
operations:
--addProductType --typeName  --typeDesc  --repository  
--versionClass 
--ingestProduct --productName  --productStructure  
--productTypeName  --metadata

Re: OODT 0.3 branch

2012-12-11 Thread Mattmann, Chris A (388J)
Hey Chintu,


From: , "Chintu [COLUMBUS TECHNOLOGIES AND SERVICES INC] (GSFC-586.0)" 
mailto:chintu.mis...@nasa.gov>>
Date: Tuesday, December 11, 2012 2:41 PM
To: jpluser 
mailto:chris.a.mattm...@jpl.nasa.gov>>, 
"dev@oodt.apache.org<mailto:dev@oodt.apache.org>" 
mailto:dev@oodt.apache.org>>
Subject: Re: OODT 0.3 branch

Answers inline below.

We will share information on apache.org at one point, but we are not there yet.

Thanks, OK,  please see inline below:

--
Chintu Mistry
NASA Goddard Space Flight Center
Bldg L40B, Room S776
Office: 240 684 0477
Mobile: 770 310 1047

From: , Chris A 
mailto:chris.a.mattm...@jpl.nasa.gov>>
Date: Tuesday, December 11, 2012 5:23 PM
To: "Mistry, Chintu (GSFC-586.0)[COLUMBUS TECHNOLOGIES AND SERVICES INC]" 
mailto:chintu.mis...@nasa.gov>>, 
"dev@oodt.apache.org<mailto:dev@oodt.apache.org>" 
mailto:dev@oodt.apache.org>>
Subject: Re: OODT 0.3 branch

Hey Chintu,

Thanks for reaching out! Replies inline below:

From: , "Chintu [COLUMBUS TECHNOLOGIES AND SERVICES INC] (GSFC-586.0)" 
mailto:chintu.mis...@nasa.gov>>
Date: Tuesday, December 11, 2012 1:50 PM
To: "dev@oodt.apache.org<mailto:dev@oodt.apache.org>" 
mailto:dev@oodt.apache.org>>
Cc: jpluser 
mailto:chris.a.mattm...@jpl.nasa.gov>>
Subject: OODT 0.3 branch

Hi Chris,

We are trying measure a performance of how fast filemanager+crawler is 
performing.

Here is what we are trying to do:

  *   Total data to process : 262GB
  *   3 file managers and 9 crawlers
  *where 3 crawlers are sending file location to  file manager to process 
the file
  *   We have our own schema running on postgresql database
  *   Custom H5 Extactor using h5dump utility

Cool this sounds like an awesome test. Would you be willing to share some of 
the info on the OODT wiki?

https://cwiki.apache.org/confluence/display/OODT/Home

Questions:
1) I have tried using FileUtils.copyFile vs FileUtils.moveFile, but I don't see 
any difference in processing time. Both my LandingZone and Archive Area are 
located on same Filesystem(GPFS). It is roughly taking 100 minutes to process 
262G data. Can you shed any light on why don't we see any performance change ?

This may have to do with the way that the JDK (what version are you using?) 
implements the actual arraycopy methods, and how the apache commons-io library 
wraps those methods. Let me know what JDK version you're using and we can 
investigate it.

- java version "1.6.0_24"
OpenJDK Runtime Environment (IcedTea6 1.11.5) (rhel-1.50.1.11.5.el6_3-x86_64)
OpenJDK 64-Bit Server VM (build 20.0-b12, mixed mode)

OK thanks. I found this article:

http://stackoverflow.com/questions/300559/move-copy-file-operations-in-java

It doesn't really go into too much detail but the nice thing is that if you 
need a different, or faster DataTransfer, you can always sub-class or implement 
your own that makes a call to e.g., "mv" or "cp" at the UNIX level if you think 
it'll speed it up.

Looking at: 
http://commons.apache.org/io/api-release/org/apache/commons/io/FileUtils.html

http://commons.apache.org/io/api-release/org/apache/commons/io/FileUtils.html#copyFile(java.io.File,
 
java.io.File)<http://commons.apache.org/io/api-release/org/apache/commons/io/FileUtils.html#copyFile(java.io.File,%20java.io.File)>
http://commons.apache.org/io/api-release/org/apache/commons/io/FileUtils.html#moveFile(java.io.File,
 
java.io.File)<http://commons.apache.org/io/api-release/org/apache/commons/io/FileUtils.html#moveFile(java.io.File,%20java.io.File)>

Note for moveFile:

"When the destination file is on another file system, do a "copy and delete".

I wonder how it detects that? I wonder if it always thinks it's on another 
filesystem using JDK and GFS? If so that might explain what you are seeing in 
that there is no difference between copyFile and moveFile?

2) The other thing also is that I don't see any performance gain between 
running 2 FM or 3FM. I thought that I would see some performance gain due to 
concurrency. Same goes for multiple crawlers. I was hoping to see pretty 
obvious performance change if I increase number of crawlers. What are thoughts 
on running things in parallel to increase performance.

How are you situating the additional file managers? Are you having 1 crawler 
ingest to 3? Or is there a 1:1 correspondence between each crawler and FM? And, 
what do you mean by no performance gain? Do you mean that you don't see 3x 
speed in terms of e.g. Product ingestion of met into the catalog? Of file 
transfer speed?

- All 3 FM are running on one machine. Each crawler instance is crawling 
different directory. And 3 Crawlers are connected to 1st FM. Other 3 are 
connected to second FM and last 3 crawlers are connected to third FM. When I 
say performance difference between 2 and 3FM,

OODT 0.3 branch

2012-12-11 Thread Mistry, Chintu (GSFC-586.0)[COLUMBUS TECHNOLOGIES AND SERVICES INC]
Hi Chris,

We are trying measure a performance of how fast filemanager+crawler is 
performing.

Here is what we are trying to do:

 *   Total data to process : 262GB
 *   3 file managers and 9 crawlers
 *where 3 crawlers are sending file location to  file manager to process 
the file
 *   We have our own schema running on postgresql database
 *   Custom H5 Extactor using h5dump utility

Questions:
1) I have tried using FileUtils.copyFile vs FileUtils.moveFile, but I don't see 
any difference in processing time. Both my LandingZone and Archive Area are 
located on same Filesystem(GPFS). It is roughly taking 100 minutes to process 
262G data. Can you shed any light on why don't we see any performance change ?

2) The other thing also is that I don't see any performance gain between 
running 2 FM or 3FM. I thought that I would see some performance gain due to 
concurrency. Same goes for multiple crawlers. I was hoping to see pretty 
obvious performance change if I increase number of crawlers. What are thoughts 
on running things in parallel to increase performance.

3) Like I said earlier, we are running crawler to push data to file manager. If 
I run it that way, then "data transfer(copy or move)" is happing on the crawler 
side. I can not find any way to let file manager handle "data transfer" using 
on of your runtime options. Please let me know if you guys know how to do that ?

We have enough processing power to run multiple FM and Crawlers for 
scalability. But for some reason crawler is not scaling enough.


Regards
--
Chintu Mistry
NASA Goddard Space Flight Center
Bldg L40B, Room S776
Office: 240 684 0477
Mobile: 770 310 1047


Re: OODT 0.3 branch

2012-12-11 Thread Mattmann, Chris A (388J)
Hey Chintu,

Thanks for reaching out! Replies inline below:

From: , "Chintu [COLUMBUS TECHNOLOGIES AND SERVICES INC] (GSFC-586.0)" 
mailto:chintu.mis...@nasa.gov>>
Date: Tuesday, December 11, 2012 1:50 PM
To: "dev@oodt.apache.org<mailto:dev@oodt.apache.org>" 
mailto:dev@oodt.apache.org>>
Cc: jpluser 
mailto:chris.a.mattm...@jpl.nasa.gov>>
Subject: OODT 0.3 branch

Hi Chris,

We are trying measure a performance of how fast filemanager+crawler is 
performing.

Here is what we are trying to do:

  *   Total data to process : 262GB
  *   3 file managers and 9 crawlers
  *where 3 crawlers are sending file location to  file manager to process 
the file
  *   We have our own schema running on postgresql database
  *   Custom H5 Extactor using h5dump utility

Cool this sounds like an awesome test. Would you be willing to share some of 
the info on the OODT wiki?

https://cwiki.apache.org/confluence/display/OODT/Home

Questions:
1) I have tried using FileUtils.copyFile vs FileUtils.moveFile, but I don't see 
any difference in processing time. Both my LandingZone and Archive Area are 
located on same Filesystem(GPFS). It is roughly taking 100 minutes to process 
262G data. Can you shed any light on why don't we see any performance change ?

This may have to do with the way that the JDK (what version are you using?) 
implements the actual arraycopy methods, and how the apache commons-io library 
wraps those methods. Let me know what JDK version you're using and we can 
investigate it.

2) The other thing also is that I don't see any performance gain between 
running 2 FM or 3FM. I thought that I would see some performance gain due to 
concurrency. Same goes for multiple crawlers. I was hoping to see pretty 
obvious performance change if I increase number of crawlers. What are thoughts 
on running things in parallel to increase performance.

How are you situating the additional file managers? Are you having 1 crawler 
ingest to 3? Or is there a 1:1 correspondence between each crawler and FM? And, 
what do you mean by no performance gain? Do you mean that you don't see 3x 
speed in terms of e.g. Product ingestion of met into the catalog? Of file 
transfer speed?

Are the 3 crawlers crawling the same staging area concurrently? Or are they 
separated out by buckets? And, which crawler are you using? The 
MetExtractorProductCrawler or the AutoDetectCrawler? Also, what is the 
versioning policy for the FM on a per product basis? Are all products being 
ingested of the same ProductType and ultimately of the same versioner and 
ultimate disk location?

3) Like I said earlier, we are running crawler to push data to file manager. If 
I run it that way, then "data transfer(copy or move)" is happing on the crawler 
side. I can not find any way to let file manager handle "data transfer" using 
on of your runtime options. Please let me know if you guys know how to do that ?

If you want the FM to handle the transfer you have to use the low level File 
Manager Client and omit the clientTransfer option:

[chipotle:local/filemgr/bin] mattmann% ./filemgr-client
filemgr-client --url  --operation [ [params]]
operations:
--addProductType --typeName  --typeDesc  --repository  
--versionClass 
--ingestProduct --productName  --productStructure  
--productTypeName  --metadataFile  
[--clientTransfer --dataTransfer ] 
--refs ...
--hasProduct --productName 
--getProductTypeByName --productTypeName 
--getNumProducts --productTypeName 
--getFirstPage --productTypeName 
--getNextPage --productTypeName  --currentPageNum 
--getPrevPage --productTypeName  --currentPageNum 
--getLastPage --productTypeName 
--getCurrentTransfer
--getCurrentTransfers
--getProductPctTransferred --productId  --productTypeName 
--getFilePctTransferred --origRef 

[chipotle:local/filemgr/bin] mattmann%

That is just a CMD line exposure of the underlying FM client Java API which 
lets you do server side transfers on ingest by passing clientTransfer == false 
to this method:

http://oodt.apache.org/components/maven/xref/org/apache/oodt/cas/filemgr/system/XmlRpcFileManagerClient.html#1168

We have enough processing power to run multiple FM and Crawlers for 
scalability. But for some reason crawler is not scaling enough.


We'll get it scaling out for ya. Can you please provide answers to the above 
questions and we'll go from there? Thanks!

Thanks!

Cheers,
Chris




Regards
--
Chintu Mistry
NASA Goddard Space Flight Center
Bldg L40B, Room S776
Office: 240 684 0477
Mobile: 770 310 1047