Thanks Aaron. I will test it out today and let you know the result.

Regards,
Dibyendu

On Wed, Oct 29, 2014 at 6:59 AM, Aaron McCurry <[email protected]> wrote:

> Dibyendu you were right.  After looking into the problem (and hitting it
> myself) the maybeMerge call was missing.  Here's the commit that fixes the
> issue.
>
>
> https://github.com/apache/incubator-blur/commit/ce2179ed5cb5a534275678ad0982fef4e99fbb49
>
> Also we have made a few fixes to resource management over the past few days
> so when you get a chance I would recommend that you upgrade to the latest
> version on master.  Thanks!
>
> Aaron
>
> On Sun, Oct 26, 2014 at 9:29 PM, Aaron McCurry <[email protected]> wrote:
>
> >
> >
> > On Sun, Oct 26, 2014 at 1:48 PM, Dibyendu Bhattacharya <
> > [email protected]> wrote:
> >
> >> I am using following settings when indexing via Spark...
> >>
> >> BlurOutputFormat.setIndexLocally(conf, false);
> >>
> >> Because of this merge is not happening ? I can see
> GenericBlurRecordWriter
> >> flush method is calling maybeMerge only when it is using local temp
> index
> >> .'
> >>
> >
> > The index locally setting is used to index directly in HDFS from MR.  The
> > normal behavior is to index locally the update through the
> > BlurOutputFormat.  Once complete the index is then copied and optimized
> in
> > flight to HDFS where the Blur table is being served.  So when Index
> Local =
> > false one or more segments are added to the index.  When Index Local =
> true
> > only one segment is added to the index regardless of size.
> >
> > Aaron
> >
> >
> >>
> >> Dibyendu
> >>
> >>
> >>
> >>
> >>
> >> On Sun, Oct 26, 2014 at 9:47 PM, Aaron McCurry <[email protected]>
> >> wrote:
> >>
> >> > On Sun, Oct 26, 2014 at 1:21 AM, Dibyendu Bhattacharya <
> >> > [email protected]> wrote:
> >> >
> >> > > Some more observation.
> >> > >
> >> > > As I said, when I index using Spark RDD saveAsHaddop* API , there
> are
> >> > > bunch of .lnk files and inuse folders got created which never got
> >> > > merged/deleted. But when I stopped the Spark Job which uses Hadoop
> API
> >> > and
> >> > > started another Spark Job for same table which uses Blur Thrift
> >> enqueue
> >> > > mutate call, I can see all the previous .lnk files and inuse folders
> >> are
> >> > > eventually merged and deleted. The index counts is fine and new
> >> documents
> >> > > also keep added to index.
> >> > >
> >> >
> >> > Ok, we are in the process of testing this issue.  We will let you know
> >> what
> >> > we find.
> >> >
> >> >
> >> > >
> >> > > So I do not think there is any issue with Permissions . Probably the
> >> > merge
> >> > > logic not getting started when indexing is happening using
> >> > BlurOutputFormat.
> >> > >
> >> >
> >> > Not sure when the mergeMaybe call was integrated into the external
> index
> >> > loading, but it should be merging the segments.
> >> >
> >> > Aaron
> >> >
> >> >
> >> > >
> >> > > Regards,
> >> > > Dibyendu
> >> > >
> >> > > On Sat, Oct 25, 2014 at 4:28 AM, Tim Williams <[email protected]
> >
> >> > > wrote:
> >> > >
> >> > >> On Fri, Oct 24, 2014 at 3:48 PM, Aaron McCurry <[email protected]
> >
> >> > >> wrote:
> >> > >>
> >> > >> > Didn't reply to all.
> >> > >> >
> >> > >> > ---------- Forwarded message ----------
> >> > >> > From: Aaron McCurry <[email protected]>
> >> > >> > Date: Fri, Oct 24, 2014 at 3:47 PM
> >> > >> > Subject: Re: Some Performance number of Spark Blur Connector
> >> > >> > To: Dibyendu Bhattacharya <[email protected]>
> >> > >> >
> >> > >> >
> >> > >> >
> >> > >> >
> >> > >> > On Fri, Oct 24, 2014 at 2:19 PM, Dibyendu Bhattacharya <
> >> > >> > [email protected]> wrote:
> >> > >> >
> >> > >> >> Hi Aaron,
> >> > >> >>
> >> > >> >> here are some performance number between enqueue mutate and RDD
> >> > >> >> saveAsHadoopFile both using Spark Streaming.
> >> > >> >>
> >> > >> >> Set up I used not very optimized one , but can give a idea about
> >> both
> >> > >> >> method of indexing via Spark Streaming.
> >> > >> >>
> >> > >> >> I used 4 Node EMR M1.Xlarge cluster, and installed Blur as 1
> >> > Controller
> >> > >> >> and 3 Shard Server. My blur table has 9 partitions.
> >> > >> >>
> >> > >> >> On the same cluster, I was running Spark with 1 Master and 3
> >> Worker.
> >> > >> This
> >> > >> >> is not a good setup but anyway, here are the numbers.
> >> > >> >>
> >> > >> >> The enqueMutate index rate is around 800 messages / Second.
> >> > >> >>
> >> > >> >> The RDD saveAsHadoopFile index rate is around 12,000 message
> >> /second.
> >> > >> >>
> >> > >> >> This is few order of magnitude faster.
> >> > >> >>
> >> > >> >
> >> > >> > That's awesome, thank you for sharing!
> >> > >> >
> >> > >> >
> >> > >> >>
> >> > >> >>
> >> > >> >> Not sure if this is a issue with saveAsHadoopFile approach, but
> I
> >> can
> >> > >> see
> >> > >> >> in Shard folder in HDFS has lots of small Lucene *.lnk files are
> >> > >> getting
> >> > >> >> created ( probably for each saveAsHadoopFile call) and there are
> >> that
> >> > >> many
> >> > >> >> "insue" folders as you see in screen shot.
> >> > >> >>
> >> > >> >> And these entries keep increasing to huge number  if this Spark
> >> > >> streaming
> >> > >> >> keep running for some time . Not sure if this has any impact on
> >> > >> indexing
> >> > >> >> and search performance ?
> >> > >> >>
> >> > >> >
> >> > >> > They should be merged and removed over time however if there is a
> >> > >> > permission problem blur might not be able to remove the inuse
> >> folders.
> >> > >> >
> >> > >>
> >> > >> In a situation where permissions are the problem the .lnk files are
> >> > >> properly cleaned while the .inuse dirs hang around.  If both are
> >> hanging
> >> > >> around I suspect it's not permissions.  A lesson freshly learned
> over
> >> > >> here:)
> >> > >>
> >> > >> --tim
> >> > >>
> >> > >
> >> > >
> >> >
> >>
> >
> >
>

Reply via email to