Re: Some Performance number of Spark Blur Connector

Aaron McCurry Sun, 26 Oct 2014 09:18:06 -0700

On Sun, Oct 26, 2014 at 1:21 AM, Dibyendu Bhattacharya <
[email protected]> wrote:


> Some more observation.
>
> As I said, when I index using Spark RDD saveAsHaddop* API , there are
> bunch of .lnk files and inuse folders got created which never got
> merged/deleted. But when I stopped the Spark Job which uses Hadoop API and
> started another Spark Job for same table which uses Blur Thrift enqueue
> mutate call, I can see all the previous .lnk files and inuse folders  are
> eventually merged and deleted. The index counts is fine and new documents
> also keep added to index.
>

Ok, we are in the process of testing this issue.  We will let you know what
we find.


>
> So I do not think there is any issue with Permissions . Probably the merge
> logic not getting started when indexing is happening using BlurOutputFormat.
>

Not sure when the mergeMaybe call was integrated into the external index
loading, but it should be merging the segments.

Aaron


>
> Regards,
> Dibyendu
>
> On Sat, Oct 25, 2014 at 4:28 AM, Tim Williams <[email protected]>
> wrote:
>
>> On Fri, Oct 24, 2014 at 3:48 PM, Aaron McCurry <[email protected]>
>> wrote:
>>
>> > Didn't reply to all.
>> >
>> > ---------- Forwarded message ----------
>> > From: Aaron McCurry <[email protected]>
>> > Date: Fri, Oct 24, 2014 at 3:47 PM
>> > Subject: Re: Some Performance number of Spark Blur Connector
>> > To: Dibyendu Bhattacharya <[email protected]>
>> >
>> >
>> >
>> >
>> > On Fri, Oct 24, 2014 at 2:19 PM, Dibyendu Bhattacharya <
>> > [email protected]> wrote:
>> >
>> >> Hi Aaron,
>> >>
>> >> here are some performance number between enqueue mutate and RDD
>> >> saveAsHadoopFile both using Spark Streaming.
>> >>
>> >> Set up I used not very optimized one , but can give a idea about both
>> >> method of indexing via Spark Streaming.
>> >>
>> >> I used 4 Node EMR M1.Xlarge cluster, and installed Blur as 1 Controller
>> >> and 3 Shard Server. My blur table has 9 partitions.
>> >>
>> >> On the same cluster, I was running Spark with 1 Master and 3 Worker.
>> This
>> >> is not a good setup but anyway, here are the numbers.
>> >>
>> >> The enqueMutate index rate is around 800 messages / Second.
>> >>
>> >> The RDD saveAsHadoopFile index rate is around 12,000 message /second.
>> >>
>> >> This is few order of magnitude faster.
>> >>
>> >
>> > That's awesome, thank you for sharing!
>> >
>> >
>> >>
>> >>
>> >> Not sure if this is a issue with saveAsHadoopFile approach, but I can
>> see
>> >> in Shard folder in HDFS has lots of small Lucene *.lnk files are
>> getting
>> >> created ( probably for each saveAsHadoopFile call) and there are that
>> many
>> >> "insue" folders as you see in screen shot.
>> >>
>> >> And these entries keep increasing to huge number  if this Spark
>> streaming
>> >> keep running for some time . Not sure if this has any impact on
>> indexing
>> >> and search performance ?
>> >>
>> >
>> > They should be merged and removed over time however if there is a
>> > permission problem blur might not be able to remove the inuse folders.
>> >
>>
>> In a situation where permissions are the problem the .lnk files are
>> properly cleaned while the .inuse dirs hang around.  If both are hanging
>> around I suspect it's not permissions.  A lesson freshly learned over
>> here:)
>>
>> --tim
>>
>
>

Re: Some Performance number of Spark Blur Connector

Reply via email to