Denes,
Please find below my Morphline config file. I had tried Memory channel but
found it runs faster with File Channel.
solrLocator: {
collection : esearch
zkHost : "codesolr-as-r2p:2181"
}
morphlines :
[
{
id : morphline1
importCommands : ["org.kitesdk.**", "org.apache.solr.**"]
commands :
[
{ detectMimeType { includeDefaultMimeTypes : true } }
{
solrCell {
solrLocator : ${solrLocator}
captureAttr : true
lowernames : true
capture : [_attachment_body, _attachment_mimetype, basename, content,
content_encoding, content_type, file, meta,text]
parsers : [
{ parser : org.apache.tika.parser.txt.TXTParser
}
]
fmap : { content : text }
}
}
{ generateUUID { field : id } }
{ sanitizeUnknownSolrFields { solrLocator : ${solrLocator} } }
{ logDebug { format : "output record: {}", args : ["@{}"] } }
{ loadSolr: { solrLocator : ${solrLocator} } }
]
}
]
Sample text file looks like below
<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<
Received: from abc.net ([11.222.333.444])
by abc.abc.net with bizsmtp
id djfAJSD*jKDHJKD; Sun, 01 Jan 2010 12:31:51 +0000
Received: from xya.xyz.net ([99.888.777.666])
by xyz.xyz.net with SMTP
id jhcfhchABHDJHDD*HDJhsdjcfjh; Sun, 01 Jan 2019 02:31:50 +0000
Received: from smtp.abccbc.abcbcbcb.com ([11.111.22.34])
by pqrs.pqrs.net with SMTP
id JHDJHJDHJHD*USDHCFJNHSD*; Sun, 01 Jan 2010 02:31:51 +0000
X-Xfinity-Message-Heuristics: IPv6:N;TLS=0;SPF=1;DMARC=
Received: from portalmail (unknown [777.33.2.90])
by smtp.ajhjhdjjdfh-ajhdjkjsd.com (Postfix) with ESMTP id HDJHDJDSJKS
for <[email protected]>; Sat, 31 Dec 2010 18:31:49 -0800 (PST)
From: "[email protected]"
To: [email protected]
Message-ID: <999999999.888.3449859489586.JavaMail.VV@mortalmail>
Subject: 111-2343444434 You got a email, LLC ("abc")
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: quoted-printable
X-CMAE-Envelope:
kjsjdsjdjdjvf9jd/12djhfjhd83hjnr38/jfjjvgf95kjg905j95ygjmt59ytjmgh95ijmhjkt6h
9085jghty89jhn596ijyiuh96ijmhj90t5ui9kjio6i5uy096i5jki650ui6o7kuoki
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1
ABC of jdjdhjhjdhjvfh Use of fkjkfjo9r5nmfkf90trmbklgftob
ABC ID: 111-34345454545
Action Date: 01 Jan 2010 02:31:33 GMT
ABC Corporation
Dear Sir or Madam:
dhjdhjfddsfjufnkdfjkjdjnhjfdjk832nhjkfg8nsdvjvnhjvjkffdjkvjhdfhjfjbhjfhnb
jchjvhfjvjhjjnxj4328uiwejf3uivcnj3490uncvrgu890jvkfjviujrfig94uvnjfvvgjhg89
hdfg9urvnjfijuhvirjsgu9rjdnvidj9ujbvgbi9rbdfgjbi9tujfbvkrniujv bnrtbjiuj
jdfjvb9utrjgnbg90ujrjmf043ikvjkfjvfrjopfr0gjvkfdjvfjovgfdovofdodopigif04jvkerj
ibjhidfjbikjfdbjibr9gikfdjgvr905jfkjgvgvj9ufkjbvfiugtjgkjb90tvbjkjfdjbffkjjfb
kjffkjbkfjkjff9g4rjdf044jn v90dfjvgr0irkjkvjfb09ua[vbjksoohfrijugb9jkvjkjkfjf
Regards,
XYZ
*pgp public key is available on the key server at http://xyz.git.edu
Note: The information transmitted in this Notice is intended only for the p=
erson or entity to which it is addressed and may contain confidential and/o=
r privileged material. Any review, reproduction, retransmission, dissemina=
tion or other use of, or taking of any action in reliance upon, this inform=
ation by persons or entities other than the intended recipient is prohibite=
d. If you received this in error, please contact the sender and delete the=
material from all computers.
This infringement notice contains an XML tag that can be used to automate t=
he processing of this data. If you would like more information on how to u=
se this tag please contact XYZ.
- - ---Start ACNS XML
<?xml version=3D"1.0" encoding=3D"UTF-8"?>
<Infringement xmlns=3D"http://www.acns.net/ACNS" xmlns:xsi=3D"http://www.w3=
.org/2001/XMLSchema-instance" xsi:schemaLocation=3D"http://www.acns.net/ACN=
S http://www.acns.net/v1.2/ACNS2v1_2.xsd">
<Case>
<ID>00000000</ID>
<Status>Open</Status>
</Case>
<Complainant>
<Entity>XYZ USA, Inc</Entity>
<Contact>XYZ</Contact>
<Address>P.O. Box 000, North XYZ, KA 00000</Address>
<Phone>999999999</Phone>
<Email>[email protected]</Email>
</Complainant>
<Service_Provider>
<Entity>ABC Corporation</Entity>
<Email>[email protected]</Email>
</Service_Provider>
<Source>
<TimeStamp>2016-12-31T23:15:40.000Z</TimeStamp>
<IP_Address>11.22.33.444</IP_Address>
<Port>55555</Port>
<Type>BitTorrent</Type>
<Number_Files>1</Number_Files>
<Deja_Vu>No</Deja_Vu>
</Source>
<Content>
<Item>
<TimeStamp>2016-12-31T23:15:40.000Z</TimeStamp>
<Title>Power</Title>
<FileName>Power </FileName>
<FileSize>000000000</FileSize>
<URL>dht</URL>
</Item>
</Content>
</Infringement>
- - ---End ACNS XML
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.22 (MingW32)
xjsdh78h23e7u2he3279y3hjdhe7823jhd3783gddey373hyfu37ru3rh892rhf2
23897EBHCA8ENHD q0jc39ujdkjd9rj8287hcd833hrnj390unce90ru3jrifj9r
930jh3ier390hnd9d23ujf3249u9uifoje9frjfij90fvu394ujfjc0f9u9vjfv9
-----END PGP SIGNATURE-----
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
I will try profiling it.
Regards,
~Sri
From: Denes Arvay [mailto:[email protected]]
Sent: Thursday, February 23, 2017 10:40 AM
To: [email protected]
Subject: Re: Ingestion to Solr is very slow
Hi,
The Flume config seems OK for me, one minor thing: I'd suggest to try the
memory channel, it can speed up the things a little bit.
The morphline part might be a bottleneck, could you please share its config as
well?
Some sample input files might also be useful to be able to help with the
debugging.
Beside these I'd recommend to try to profile it with a Java profiler (e.g.
jvisualvm).
Regards,
Denes
On Fri, Feb 17, 2017 at 12:00 AM Anatharaman, Srinatha (Contractor)
<[email protected]<mailto:[email protected]>>
wrote:
Hi,
I have large set of small files , each file is around 7 – 10 K in size
Total I have 350K files with around 6 GB.
I have changed my flume configuration with many options but whatever the config
change Solr takes 2 sec for each file to ingest
agent.sources = SpoolDirSrc
agent.channels = FileChannel
agent.sinks = SolrSink
# Configure Source
agent.sources.SpoolDirSrc.channels = fileChannel
agent.sources.SpoolDirSrc.type = spooldir
agent.sources.SpoolDirSrc.spoolDir = /app/home/solr/final
agent.sources.SpoolDirSrc.basenameHeader = true
#agent.sources.SpoolDirSrc.batchSize = 100000
agent.sources.SpoolDirSrc.fileHeader = true
agent.sources.SpoolDirSrc.deserializer =
org.apache.flume.sink.solr.morphline.BlobDeserializer$Builder
# Use a channel that buffers events in memory
agent.channels.FileChannel.type = file
agent.channels.FileChannel.capacity = 1000
agent.channels.FileChannel.transactionCapacity = 1000
#agent.channels.FileChannel.transactionCapacity = 10000
# Configure Solr Sink
agent.sinks.SolrSink.type =
org.apache.flume.sink.solr.morphline.MorphlineSolrSink
agent.sinks.SolrSink.morphlineFile = /etc/flume/conf/morphline.conf
#agent.sinks.SolrSink.batchsize = 100000
#agent.sinks.SolrSink.batchDurationMillis = 5000
agent.sinks.SolrSink.channel = fileChannel
agent.sinks.SolrSink.morphlineId = morphline1
agent.sinks.SolrSink.tika.config = tikaConfig.xml
agent.sinks.SolrSink.rollCount = 0
agent.sinks.SolrSink.rollInterval = 0
agent.sinks.SolrSink.rollsize = 100000000
agent.sinks.SolrSink.idleTimeout = 0
agent.sinks.SolrSink.batchSize = 100000
agent.sinks.SolrSink.txnEventMax = 10000000
agent.sources.SpoolDirSrc.channels = FileChannel
agent.sinks.SolrSink.channel = FileChannel
My Collection is on 2 shards and 1 replication
Kindly let me know how do I make this better
Regards,
~Sri