I got some time to look in to it. It appears as that Spark (latest git)
is doing this operation much more often compare to Aug 1 version. Here
is the log from operation I am referring to

14/08/19 12:37:26 INFO spark.CacheManager: Partition rdd_8_414 not
found, computing it
14/08/19 12:37:26 INFO rdd.HadoopRDD: Input split:
hdfs://test/test_flows/test-2014-05-06.csv:9529458688+134217728
14/08/19 12:37:41 INFO python.PythonRDD: Times: total = 16312, boot = 8,
init = 134, finish = 16170
14/08/19 12:37:41 INFO columnar.StringColumnBuilder: Compressor for
[dstip]:
org.apache.spark.sql.columnar.compression.PassThrough$Encoder@6374d682,
ratio: 1.0
14/08/19 12:37:41 INFO columnar.StringColumnBuilder: Compressor for
[dstport]:
org.apache.spark.sql.columnar.compression.PassThrough$Encoder@baf23d1,
ratio: 1.0
14/08/19 12:37:41 INFO columnar.StringColumnBuilder: Compressor for
[srcport]:
org.apache.spark.sql.columnar.compression.PassThrough$Encoder@17587455,
ratio: 1.0
14/08/19 12:37:41 INFO columnar.StringColumnBuilder: Compressor for
[stime]:
org.apache.spark.sql.columnar.compression.PassThrough$Encoder@303d846c,
ratio: 1.0
14/08/19 12:37:41 INFO columnar.StringColumnBuilder: Compressor for
[endtime]:
org.apache.spark.sql.columnar.compression.PassThrough$Encoder@16c0e732,
ratio: 1.0
14/08/19 12:37:41 INFO columnar.StringColumnBuilder: Compressor for
[srcip]:
org.apache.spark.sql.columnar.compression.PassThrough$Encoder@528a8f49,
ratio: 1.0
14/08/19 12:37:41 INFO storage.MemoryStore: ensureFreeSpace(64834432)
called with curMem=1556288334, maxMem=9446715555

In Aug 1 version the log file from processing the same size data is of
approx 136KB where from latest git it is of 23 MB. The only message
which makes logfile grow is the mentioned above. It appears the latest
git version has an issue when reading data and converting it columnar
format. As this conversion happens when Spark is trying to create a RDD,
once for each RDD. In latest git version it just simply might be doing
for each record in RDD. That's what causing it to slow read from disk as
it spends time in this operation.

Any suggestion/help in this regard will be helpful.

- Gurvinder
On 08/14/2014 10:27 AM, Gurvinder Singh wrote:
> Hi,
> 
> I am running spark from the git directly. I recently compiled the newer
> version Aug 13 version and it has performance drop of 2-3x in read from
> HDFS compare to git version of Aug 1. So I am wondering which commit
> would have cause such an issue in read performance. The performance is
> almost same once data is cached in memory, but read from HDFS is well
> slow compare to Aug 1 version.
> 
> - Gurvinder
> 


---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Reply via email to