I am using following settings when indexing via Spark... BlurOutputFormat.setIndexLocally(conf, false);
Because of this merge is not happening ? I can see GenericBlurRecordWriter flush method is calling maybeMerge only when it is using local temp index . Dibyendu On Sun, Oct 26, 2014 at 9:47 PM, Aaron McCurry <[email protected]> wrote: > On Sun, Oct 26, 2014 at 1:21 AM, Dibyendu Bhattacharya < > [email protected]> wrote: > > > Some more observation. > > > > As I said, when I index using Spark RDD saveAsHaddop* API , there are > > bunch of .lnk files and inuse folders got created which never got > > merged/deleted. But when I stopped the Spark Job which uses Hadoop API > and > > started another Spark Job for same table which uses Blur Thrift enqueue > > mutate call, I can see all the previous .lnk files and inuse folders are > > eventually merged and deleted. The index counts is fine and new documents > > also keep added to index. > > > > Ok, we are in the process of testing this issue. We will let you know what > we find. > > > > > > So I do not think there is any issue with Permissions . Probably the > merge > > logic not getting started when indexing is happening using > BlurOutputFormat. > > > > Not sure when the mergeMaybe call was integrated into the external index > loading, but it should be merging the segments. > > Aaron > > > > > > Regards, > > Dibyendu > > > > On Sat, Oct 25, 2014 at 4:28 AM, Tim Williams <[email protected]> > > wrote: > > > >> On Fri, Oct 24, 2014 at 3:48 PM, Aaron McCurry <[email protected]> > >> wrote: > >> > >> > Didn't reply to all. > >> > > >> > ---------- Forwarded message ---------- > >> > From: Aaron McCurry <[email protected]> > >> > Date: Fri, Oct 24, 2014 at 3:47 PM > >> > Subject: Re: Some Performance number of Spark Blur Connector > >> > To: Dibyendu Bhattacharya <[email protected]> > >> > > >> > > >> > > >> > > >> > On Fri, Oct 24, 2014 at 2:19 PM, Dibyendu Bhattacharya < > >> > [email protected]> wrote: > >> > > >> >> Hi Aaron, > >> >> > >> >> here are some performance number between enqueue mutate and RDD > >> >> saveAsHadoopFile both using Spark Streaming. > >> >> > >> >> Set up I used not very optimized one , but can give a idea about both > >> >> method of indexing via Spark Streaming. > >> >> > >> >> I used 4 Node EMR M1.Xlarge cluster, and installed Blur as 1 > Controller > >> >> and 3 Shard Server. My blur table has 9 partitions. > >> >> > >> >> On the same cluster, I was running Spark with 1 Master and 3 Worker. > >> This > >> >> is not a good setup but anyway, here are the numbers. > >> >> > >> >> The enqueMutate index rate is around 800 messages / Second. > >> >> > >> >> The RDD saveAsHadoopFile index rate is around 12,000 message /second. > >> >> > >> >> This is few order of magnitude faster. > >> >> > >> > > >> > That's awesome, thank you for sharing! > >> > > >> > > >> >> > >> >> > >> >> Not sure if this is a issue with saveAsHadoopFile approach, but I can > >> see > >> >> in Shard folder in HDFS has lots of small Lucene *.lnk files are > >> getting > >> >> created ( probably for each saveAsHadoopFile call) and there are that > >> many > >> >> "insue" folders as you see in screen shot. > >> >> > >> >> And these entries keep increasing to huge number if this Spark > >> streaming > >> >> keep running for some time . Not sure if this has any impact on > >> indexing > >> >> and search performance ? > >> >> > >> > > >> > They should be merged and removed over time however if there is a > >> > permission problem blur might not be able to remove the inuse folders. > >> > > >> > >> In a situation where permissions are the problem the .lnk files are > >> properly cleaned while the .inuse dirs hang around. If both are hanging > >> around I suspect it's not permissions. A lesson freshly learned over > >> here:) > >> > >> --tim > >> > > > > >
