On Fri, Oct 24, 2014 at 3:48 PM, Aaron McCurry <[email protected]> wrote:
> Didn't reply to all. > > ---------- Forwarded message ---------- > From: Aaron McCurry <[email protected]> > Date: Fri, Oct 24, 2014 at 3:47 PM > Subject: Re: Some Performance number of Spark Blur Connector > To: Dibyendu Bhattacharya <[email protected]> > > > > > On Fri, Oct 24, 2014 at 2:19 PM, Dibyendu Bhattacharya < > [email protected]> wrote: > >> Hi Aaron, >> >> here are some performance number between enqueue mutate and RDD >> saveAsHadoopFile both using Spark Streaming. >> >> Set up I used not very optimized one , but can give a idea about both >> method of indexing via Spark Streaming. >> >> I used 4 Node EMR M1.Xlarge cluster, and installed Blur as 1 Controller >> and 3 Shard Server. My blur table has 9 partitions. >> >> On the same cluster, I was running Spark with 1 Master and 3 Worker. This >> is not a good setup but anyway, here are the numbers. >> >> The enqueMutate index rate is around 800 messages / Second. >> >> The RDD saveAsHadoopFile index rate is around 12,000 message /second. >> >> This is few order of magnitude faster. >> > > That's awesome, thank you for sharing! > > >> >> >> Not sure if this is a issue with saveAsHadoopFile approach, but I can see >> in Shard folder in HDFS has lots of small Lucene *.lnk files are getting >> created ( probably for each saveAsHadoopFile call) and there are that many >> "insue" folders as you see in screen shot. >> >> And these entries keep increasing to huge number if this Spark streaming >> keep running for some time . Not sure if this has any impact on indexing >> and search performance ? >> > > They should be merged and removed over time however if there is a > permission problem blur might not be able to remove the inuse folders. > In a situation where permissions are the problem the .lnk files are properly cleaned while the .inuse dirs hang around. If both are hanging around I suspect it's not permissions. A lesson freshly learned over here:) --tim
