Re: Some Performance number of Spark Blur Connector

Aaron McCurry Sun, 26 Oct 2014 18:30:07 -0700

On Sun, Oct 26, 2014 at 1:48 PM, Dibyendu Bhattacharya <
[email protected]> wrote:


> I am using following settings when indexing via Spark...
>
> BlurOutputFormat.setIndexLocally(conf, false);
>
> Because of this merge is not happening ? I can see GenericBlurRecordWriter
> flush method is calling maybeMerge only when it is using local temp index
> .'
>

The index locally setting is used to index directly in HDFS from MR.  The
normal behavior is to index locally the update through the
BlurOutputFormat.  Once complete the index is then copied and optimized in
flight to HDFS where the Blur table is being served.  So when Index Local =
false one or more segments are added to the index.  When Index Local = true
only one segment is added to the index regardless of size.

Aaron


>
> Dibyendu
>
>
>
>
>
> On Sun, Oct 26, 2014 at 9:47 PM, Aaron McCurry <[email protected]> wrote:
>
> > On Sun, Oct 26, 2014 at 1:21 AM, Dibyendu Bhattacharya <
> > [email protected]> wrote:
> >
> > > Some more observation.
> > >
> > > As I said, when I index using Spark RDD saveAsHaddop* API , there are
> > > bunch of .lnk files and inuse folders got created which never got
> > > merged/deleted. But when I stopped the Spark Job which uses Hadoop API
> > and
> > > started another Spark Job for same table which uses Blur Thrift enqueue
> > > mutate call, I can see all the previous .lnk files and inuse folders
> are
> > > eventually merged and deleted. The index counts is fine and new
> documents
> > > also keep added to index.
> > >
> >
> > Ok, we are in the process of testing this issue.  We will let you know
> what
> > we find.
> >
> >
> > >
> > > So I do not think there is any issue with Permissions . Probably the
> > merge
> > > logic not getting started when indexing is happening using
> > BlurOutputFormat.
> > >
> >
> > Not sure when the mergeMaybe call was integrated into the external index
> > loading, but it should be merging the segments.
> >
> > Aaron
> >
> >
> > >
> > > Regards,
> > > Dibyendu
> > >
> > > On Sat, Oct 25, 2014 at 4:28 AM, Tim Williams <[email protected]>
> > > wrote:
> > >
> > >> On Fri, Oct 24, 2014 at 3:48 PM, Aaron McCurry <[email protected]>
> > >> wrote:
> > >>
> > >> > Didn't reply to all.
> > >> >
> > >> > ---------- Forwarded message ----------
> > >> > From: Aaron McCurry <[email protected]>
> > >> > Date: Fri, Oct 24, 2014 at 3:47 PM
> > >> > Subject: Re: Some Performance number of Spark Blur Connector
> > >> > To: Dibyendu Bhattacharya <[email protected]>
> > >> >
> > >> >
> > >> >
> > >> >
> > >> > On Fri, Oct 24, 2014 at 2:19 PM, Dibyendu Bhattacharya <
> > >> > [email protected]> wrote:
> > >> >
> > >> >> Hi Aaron,
> > >> >>
> > >> >> here are some performance number between enqueue mutate and RDD
> > >> >> saveAsHadoopFile both using Spark Streaming.
> > >> >>
> > >> >> Set up I used not very optimized one , but can give a idea about
> both
> > >> >> method of indexing via Spark Streaming.
> > >> >>
> > >> >> I used 4 Node EMR M1.Xlarge cluster, and installed Blur as 1
> > Controller
> > >> >> and 3 Shard Server. My blur table has 9 partitions.
> > >> >>
> > >> >> On the same cluster, I was running Spark with 1 Master and 3
> Worker.
> > >> This
> > >> >> is not a good setup but anyway, here are the numbers.
> > >> >>
> > >> >> The enqueMutate index rate is around 800 messages / Second.
> > >> >>
> > >> >> The RDD saveAsHadoopFile index rate is around 12,000 message
> /second.
> > >> >>
> > >> >> This is few order of magnitude faster.
> > >> >>
> > >> >
> > >> > That's awesome, thank you for sharing!
> > >> >
> > >> >
> > >> >>
> > >> >>
> > >> >> Not sure if this is a issue with saveAsHadoopFile approach, but I
> can
> > >> see
> > >> >> in Shard folder in HDFS has lots of small Lucene *.lnk files are
> > >> getting
> > >> >> created ( probably for each saveAsHadoopFile call) and there are
> that
> > >> many
> > >> >> "insue" folders as you see in screen shot.
> > >> >>
> > >> >> And these entries keep increasing to huge number  if this Spark
> > >> streaming
> > >> >> keep running for some time . Not sure if this has any impact on
> > >> indexing
> > >> >> and search performance ?
> > >> >>
> > >> >
> > >> > They should be merged and removed over time however if there is a
> > >> > permission problem blur might not be able to remove the inuse
> folders.
> > >> >
> > >>
> > >> In a situation where permissions are the problem the .lnk files are
> > >> properly cleaned while the .inuse dirs hang around.  If both are
> hanging
> > >> around I suspect it's not permissions.  A lesson freshly learned over
> > >> here:)
> > >>
> > >> --tim
> > >>
> > >
> > >
> >
>

Re: Some Performance number of Spark Blur Connector

Reply via email to