Dibyendu you were right. After looking into the problem (and hitting it myself) the maybeMerge call was missing. Here's the commit that fixes the issue.
https://github.com/apache/incubator-blur/commit/ce2179ed5cb5a534275678ad0982fef4e99fbb49 Also we have made a few fixes to resource management over the past few days so when you get a chance I would recommend that you upgrade to the latest version on master. Thanks! Aaron On Sun, Oct 26, 2014 at 9:29 PM, Aaron McCurry <[email protected]> wrote: > > > On Sun, Oct 26, 2014 at 1:48 PM, Dibyendu Bhattacharya < > [email protected]> wrote: > >> I am using following settings when indexing via Spark... >> >> BlurOutputFormat.setIndexLocally(conf, false); >> >> Because of this merge is not happening ? I can see GenericBlurRecordWriter >> flush method is calling maybeMerge only when it is using local temp index >> .' >> > > The index locally setting is used to index directly in HDFS from MR. The > normal behavior is to index locally the update through the > BlurOutputFormat. Once complete the index is then copied and optimized in > flight to HDFS where the Blur table is being served. So when Index Local = > false one or more segments are added to the index. When Index Local = true > only one segment is added to the index regardless of size. > > Aaron > > >> >> Dibyendu >> >> >> >> >> >> On Sun, Oct 26, 2014 at 9:47 PM, Aaron McCurry <[email protected]> >> wrote: >> >> > On Sun, Oct 26, 2014 at 1:21 AM, Dibyendu Bhattacharya < >> > [email protected]> wrote: >> > >> > > Some more observation. >> > > >> > > As I said, when I index using Spark RDD saveAsHaddop* API , there are >> > > bunch of .lnk files and inuse folders got created which never got >> > > merged/deleted. But when I stopped the Spark Job which uses Hadoop API >> > and >> > > started another Spark Job for same table which uses Blur Thrift >> enqueue >> > > mutate call, I can see all the previous .lnk files and inuse folders >> are >> > > eventually merged and deleted. The index counts is fine and new >> documents >> > > also keep added to index. >> > > >> > >> > Ok, we are in the process of testing this issue. We will let you know >> what >> > we find. >> > >> > >> > > >> > > So I do not think there is any issue with Permissions . Probably the >> > merge >> > > logic not getting started when indexing is happening using >> > BlurOutputFormat. >> > > >> > >> > Not sure when the mergeMaybe call was integrated into the external index >> > loading, but it should be merging the segments. >> > >> > Aaron >> > >> > >> > > >> > > Regards, >> > > Dibyendu >> > > >> > > On Sat, Oct 25, 2014 at 4:28 AM, Tim Williams <[email protected]> >> > > wrote: >> > > >> > >> On Fri, Oct 24, 2014 at 3:48 PM, Aaron McCurry <[email protected]> >> > >> wrote: >> > >> >> > >> > Didn't reply to all. >> > >> > >> > >> > ---------- Forwarded message ---------- >> > >> > From: Aaron McCurry <[email protected]> >> > >> > Date: Fri, Oct 24, 2014 at 3:47 PM >> > >> > Subject: Re: Some Performance number of Spark Blur Connector >> > >> > To: Dibyendu Bhattacharya <[email protected]> >> > >> > >> > >> > >> > >> > >> > >> > >> > >> > On Fri, Oct 24, 2014 at 2:19 PM, Dibyendu Bhattacharya < >> > >> > [email protected]> wrote: >> > >> > >> > >> >> Hi Aaron, >> > >> >> >> > >> >> here are some performance number between enqueue mutate and RDD >> > >> >> saveAsHadoopFile both using Spark Streaming. >> > >> >> >> > >> >> Set up I used not very optimized one , but can give a idea about >> both >> > >> >> method of indexing via Spark Streaming. >> > >> >> >> > >> >> I used 4 Node EMR M1.Xlarge cluster, and installed Blur as 1 >> > Controller >> > >> >> and 3 Shard Server. My blur table has 9 partitions. >> > >> >> >> > >> >> On the same cluster, I was running Spark with 1 Master and 3 >> Worker. >> > >> This >> > >> >> is not a good setup but anyway, here are the numbers. >> > >> >> >> > >> >> The enqueMutate index rate is around 800 messages / Second. >> > >> >> >> > >> >> The RDD saveAsHadoopFile index rate is around 12,000 message >> /second. >> > >> >> >> > >> >> This is few order of magnitude faster. >> > >> >> >> > >> > >> > >> > That's awesome, thank you for sharing! >> > >> > >> > >> > >> > >> >> >> > >> >> >> > >> >> Not sure if this is a issue with saveAsHadoopFile approach, but I >> can >> > >> see >> > >> >> in Shard folder in HDFS has lots of small Lucene *.lnk files are >> > >> getting >> > >> >> created ( probably for each saveAsHadoopFile call) and there are >> that >> > >> many >> > >> >> "insue" folders as you see in screen shot. >> > >> >> >> > >> >> And these entries keep increasing to huge number if this Spark >> > >> streaming >> > >> >> keep running for some time . Not sure if this has any impact on >> > >> indexing >> > >> >> and search performance ? >> > >> >> >> > >> > >> > >> > They should be merged and removed over time however if there is a >> > >> > permission problem blur might not be able to remove the inuse >> folders. >> > >> > >> > >> >> > >> In a situation where permissions are the problem the .lnk files are >> > >> properly cleaned while the .inuse dirs hang around. If both are >> hanging >> > >> around I suspect it's not permissions. A lesson freshly learned over >> > >> here:) >> > >> >> > >> --tim >> > >> >> > > >> > > >> > >> > >
