On Fri, Oct 24, 2014 at 3:48 PM, Aaron McCurry <[email protected]> wrote:

> Didn't reply to all.
>
> ---------- Forwarded message ----------
> From: Aaron McCurry <[email protected]>
> Date: Fri, Oct 24, 2014 at 3:47 PM
> Subject: Re: Some Performance number of Spark Blur Connector
> To: Dibyendu Bhattacharya <[email protected]>
>
>
>
>
> On Fri, Oct 24, 2014 at 2:19 PM, Dibyendu Bhattacharya <
> [email protected]> wrote:
>
>> Hi Aaron,
>>
>> here are some performance number between enqueue mutate and RDD
>> saveAsHadoopFile both using Spark Streaming.
>>
>> Set up I used not very optimized one , but can give a idea about both
>> method of indexing via Spark Streaming.
>>
>> I used 4 Node EMR M1.Xlarge cluster, and installed Blur as 1 Controller
>> and 3 Shard Server. My blur table has 9 partitions.
>>
>> On the same cluster, I was running Spark with 1 Master and 3 Worker. This
>> is not a good setup but anyway, here are the numbers.
>>
>> The enqueMutate index rate is around 800 messages / Second.
>>
>> The RDD saveAsHadoopFile index rate is around 12,000 message /second.
>>
>> This is few order of magnitude faster.
>>
>
> That's awesome, thank you for sharing!
>
>
>>
>>
>> Not sure if this is a issue with saveAsHadoopFile approach, but I can see
>> in Shard folder in HDFS has lots of small Lucene *.lnk files are getting
>> created ( probably for each saveAsHadoopFile call) and there are that many
>> "insue" folders as you see in screen shot.
>>
>> And these entries keep increasing to huge number  if this Spark streaming
>> keep running for some time . Not sure if this has any impact on indexing
>> and search performance ?
>>
>
> They should be merged and removed over time however if there is a
> permission problem blur might not be able to remove the inuse folders.
>

In a situation where permissions are the problem the .lnk files are
properly cleaned while the .inuse dirs hang around.  If both are hanging
around I suspect it's not permissions.  A lesson freshly learned over here:)

--tim

Reply via email to