On Sun, Oct 26, 2014 at 1:21 AM, Dibyendu Bhattacharya < [email protected]> wrote:
> Some more observation. > > As I said, when I index using Spark RDD saveAsHaddop* API , there are > bunch of .lnk files and inuse folders got created which never got > merged/deleted. But when I stopped the Spark Job which uses Hadoop API and > started another Spark Job for same table which uses Blur Thrift enqueue > mutate call, I can see all the previous .lnk files and inuse folders are > eventually merged and deleted. The index counts is fine and new documents > also keep added to index. > Ok, we are in the process of testing this issue. We will let you know what we find. > > So I do not think there is any issue with Permissions . Probably the merge > logic not getting started when indexing is happening using BlurOutputFormat. > Not sure when the mergeMaybe call was integrated into the external index loading, but it should be merging the segments. Aaron > > Regards, > Dibyendu > > On Sat, Oct 25, 2014 at 4:28 AM, Tim Williams <[email protected]> > wrote: > >> On Fri, Oct 24, 2014 at 3:48 PM, Aaron McCurry <[email protected]> >> wrote: >> >> > Didn't reply to all. >> > >> > ---------- Forwarded message ---------- >> > From: Aaron McCurry <[email protected]> >> > Date: Fri, Oct 24, 2014 at 3:47 PM >> > Subject: Re: Some Performance number of Spark Blur Connector >> > To: Dibyendu Bhattacharya <[email protected]> >> > >> > >> > >> > >> > On Fri, Oct 24, 2014 at 2:19 PM, Dibyendu Bhattacharya < >> > [email protected]> wrote: >> > >> >> Hi Aaron, >> >> >> >> here are some performance number between enqueue mutate and RDD >> >> saveAsHadoopFile both using Spark Streaming. >> >> >> >> Set up I used not very optimized one , but can give a idea about both >> >> method of indexing via Spark Streaming. >> >> >> >> I used 4 Node EMR M1.Xlarge cluster, and installed Blur as 1 Controller >> >> and 3 Shard Server. My blur table has 9 partitions. >> >> >> >> On the same cluster, I was running Spark with 1 Master and 3 Worker. >> This >> >> is not a good setup but anyway, here are the numbers. >> >> >> >> The enqueMutate index rate is around 800 messages / Second. >> >> >> >> The RDD saveAsHadoopFile index rate is around 12,000 message /second. >> >> >> >> This is few order of magnitude faster. >> >> >> > >> > That's awesome, thank you for sharing! >> > >> > >> >> >> >> >> >> Not sure if this is a issue with saveAsHadoopFile approach, but I can >> see >> >> in Shard folder in HDFS has lots of small Lucene *.lnk files are >> getting >> >> created ( probably for each saveAsHadoopFile call) and there are that >> many >> >> "insue" folders as you see in screen shot. >> >> >> >> And these entries keep increasing to huge number if this Spark >> streaming >> >> keep running for some time . Not sure if this has any impact on >> indexing >> >> and search performance ? >> >> >> > >> > They should be merged and removed over time however if there is a >> > permission problem blur might not be able to remove the inuse folders. >> > >> >> In a situation where permissions are the problem the .lnk files are >> properly cleaned while the .inuse dirs hang around. If both are hanging >> around I suspect it's not permissions. A lesson freshly learned over >> here:) >> >> --tim >> > >
