Some more observation. As I said, when I index using Spark RDD saveAsHaddop* API , there are bunch of .lnk files and inuse folders got created which never got merged/deleted. But when I stopped the Spark Job which uses Hadoop API and started another Spark Job for same table which uses Blur Thrift enqueue mutate call, I can see all the previous .lnk files and inuse folders are eventually merged and deleted. The index counts is fine and new documents also keep added to index.
So I do not think there is any issue with Permissions . Probably the merge logic not getting started when indexing is happening using BlurOutputFormat. Regards, Dibyendu On Sat, Oct 25, 2014 at 4:28 AM, Tim Williams <[email protected]> wrote: > On Fri, Oct 24, 2014 at 3:48 PM, Aaron McCurry <[email protected]> wrote: > > > Didn't reply to all. > > > > ---------- Forwarded message ---------- > > From: Aaron McCurry <[email protected]> > > Date: Fri, Oct 24, 2014 at 3:47 PM > > Subject: Re: Some Performance number of Spark Blur Connector > > To: Dibyendu Bhattacharya <[email protected]> > > > > > > > > > > On Fri, Oct 24, 2014 at 2:19 PM, Dibyendu Bhattacharya < > > [email protected]> wrote: > > > >> Hi Aaron, > >> > >> here are some performance number between enqueue mutate and RDD > >> saveAsHadoopFile both using Spark Streaming. > >> > >> Set up I used not very optimized one , but can give a idea about both > >> method of indexing via Spark Streaming. > >> > >> I used 4 Node EMR M1.Xlarge cluster, and installed Blur as 1 Controller > >> and 3 Shard Server. My blur table has 9 partitions. > >> > >> On the same cluster, I was running Spark with 1 Master and 3 Worker. > This > >> is not a good setup but anyway, here are the numbers. > >> > >> The enqueMutate index rate is around 800 messages / Second. > >> > >> The RDD saveAsHadoopFile index rate is around 12,000 message /second. > >> > >> This is few order of magnitude faster. > >> > > > > That's awesome, thank you for sharing! > > > > > >> > >> > >> Not sure if this is a issue with saveAsHadoopFile approach, but I can > see > >> in Shard folder in HDFS has lots of small Lucene *.lnk files are getting > >> created ( probably for each saveAsHadoopFile call) and there are that > many > >> "insue" folders as you see in screen shot. > >> > >> And these entries keep increasing to huge number if this Spark > streaming > >> keep running for some time . Not sure if this has any impact on indexing > >> and search performance ? > >> > > > > They should be merged and removed over time however if there is a > > permission problem blur might not be able to remove the inuse folders. > > > > In a situation where permissions are the problem the .lnk files are > properly cleaned while the .inuse dirs hang around. If both are hanging > around I suspect it's not permissions. A lesson freshly learned over > here:) > > --tim >
