Dibyendu you were right.  After looking into the problem (and hitting it
myself) the maybeMerge call was missing.  Here's the commit that fixes the
issue.

https://github.com/apache/incubator-blur/commit/ce2179ed5cb5a534275678ad0982fef4e99fbb49

Also we have made a few fixes to resource management over the past few days
so when you get a chance I would recommend that you upgrade to the latest
version on master.  Thanks!

Aaron

On Sun, Oct 26, 2014 at 9:29 PM, Aaron McCurry <[email protected]> wrote:

>
>
> On Sun, Oct 26, 2014 at 1:48 PM, Dibyendu Bhattacharya <
> [email protected]> wrote:
>
>> I am using following settings when indexing via Spark...
>>
>> BlurOutputFormat.setIndexLocally(conf, false);
>>
>> Because of this merge is not happening ? I can see GenericBlurRecordWriter
>> flush method is calling maybeMerge only when it is using local temp index
>> .'
>>
>
> The index locally setting is used to index directly in HDFS from MR.  The
> normal behavior is to index locally the update through the
> BlurOutputFormat.  Once complete the index is then copied and optimized in
> flight to HDFS where the Blur table is being served.  So when Index Local =
> false one or more segments are added to the index.  When Index Local = true
> only one segment is added to the index regardless of size.
>
> Aaron
>
>
>>
>> Dibyendu
>>
>>
>>
>>
>>
>> On Sun, Oct 26, 2014 at 9:47 PM, Aaron McCurry <[email protected]>
>> wrote:
>>
>> > On Sun, Oct 26, 2014 at 1:21 AM, Dibyendu Bhattacharya <
>> > [email protected]> wrote:
>> >
>> > > Some more observation.
>> > >
>> > > As I said, when I index using Spark RDD saveAsHaddop* API , there are
>> > > bunch of .lnk files and inuse folders got created which never got
>> > > merged/deleted. But when I stopped the Spark Job which uses Hadoop API
>> > and
>> > > started another Spark Job for same table which uses Blur Thrift
>> enqueue
>> > > mutate call, I can see all the previous .lnk files and inuse folders
>> are
>> > > eventually merged and deleted. The index counts is fine and new
>> documents
>> > > also keep added to index.
>> > >
>> >
>> > Ok, we are in the process of testing this issue.  We will let you know
>> what
>> > we find.
>> >
>> >
>> > >
>> > > So I do not think there is any issue with Permissions . Probably the
>> > merge
>> > > logic not getting started when indexing is happening using
>> > BlurOutputFormat.
>> > >
>> >
>> > Not sure when the mergeMaybe call was integrated into the external index
>> > loading, but it should be merging the segments.
>> >
>> > Aaron
>> >
>> >
>> > >
>> > > Regards,
>> > > Dibyendu
>> > >
>> > > On Sat, Oct 25, 2014 at 4:28 AM, Tim Williams <[email protected]>
>> > > wrote:
>> > >
>> > >> On Fri, Oct 24, 2014 at 3:48 PM, Aaron McCurry <[email protected]>
>> > >> wrote:
>> > >>
>> > >> > Didn't reply to all.
>> > >> >
>> > >> > ---------- Forwarded message ----------
>> > >> > From: Aaron McCurry <[email protected]>
>> > >> > Date: Fri, Oct 24, 2014 at 3:47 PM
>> > >> > Subject: Re: Some Performance number of Spark Blur Connector
>> > >> > To: Dibyendu Bhattacharya <[email protected]>
>> > >> >
>> > >> >
>> > >> >
>> > >> >
>> > >> > On Fri, Oct 24, 2014 at 2:19 PM, Dibyendu Bhattacharya <
>> > >> > [email protected]> wrote:
>> > >> >
>> > >> >> Hi Aaron,
>> > >> >>
>> > >> >> here are some performance number between enqueue mutate and RDD
>> > >> >> saveAsHadoopFile both using Spark Streaming.
>> > >> >>
>> > >> >> Set up I used not very optimized one , but can give a idea about
>> both
>> > >> >> method of indexing via Spark Streaming.
>> > >> >>
>> > >> >> I used 4 Node EMR M1.Xlarge cluster, and installed Blur as 1
>> > Controller
>> > >> >> and 3 Shard Server. My blur table has 9 partitions.
>> > >> >>
>> > >> >> On the same cluster, I was running Spark with 1 Master and 3
>> Worker.
>> > >> This
>> > >> >> is not a good setup but anyway, here are the numbers.
>> > >> >>
>> > >> >> The enqueMutate index rate is around 800 messages / Second.
>> > >> >>
>> > >> >> The RDD saveAsHadoopFile index rate is around 12,000 message
>> /second.
>> > >> >>
>> > >> >> This is few order of magnitude faster.
>> > >> >>
>> > >> >
>> > >> > That's awesome, thank you for sharing!
>> > >> >
>> > >> >
>> > >> >>
>> > >> >>
>> > >> >> Not sure if this is a issue with saveAsHadoopFile approach, but I
>> can
>> > >> see
>> > >> >> in Shard folder in HDFS has lots of small Lucene *.lnk files are
>> > >> getting
>> > >> >> created ( probably for each saveAsHadoopFile call) and there are
>> that
>> > >> many
>> > >> >> "insue" folders as you see in screen shot.
>> > >> >>
>> > >> >> And these entries keep increasing to huge number  if this Spark
>> > >> streaming
>> > >> >> keep running for some time . Not sure if this has any impact on
>> > >> indexing
>> > >> >> and search performance ?
>> > >> >>
>> > >> >
>> > >> > They should be merged and removed over time however if there is a
>> > >> > permission problem blur might not be able to remove the inuse
>> folders.
>> > >> >
>> > >>
>> > >> In a situation where permissions are the problem the .lnk files are
>> > >> properly cleaned while the .inuse dirs hang around.  If both are
>> hanging
>> > >> around I suspect it's not permissions.  A lesson freshly learned over
>> > >> here:)
>> > >>
>> > >> --tim
>> > >>
>> > >
>> > >
>> >
>>
>
>

Reply via email to