[ 
https://issues.apache.org/jira/browse/PHOENIX-5410?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Manohar Chamaraju updated PHOENIX-5410:
---------------------------------------
    Description: 
While using the phoenix spark connector 1.0.0-SNAPSHOT 
([https://github.com/apache/phoenix-connectors/tree/master/phoenix-spark])  for 
hbase found that write was taking really long time.

On profiling the connector found that 90% of cpu time is consumed in method 
SparkJdbcUtil.toRow() method. 

!https://files.slack.com/files-pri/T037D1PV9-FKYGD504A/image.png!

If i look into code SparkJdbcUtil.toRow() method gets called for every field of 
a row and RowEncoder(schema).resolveAndBind() object gets created for every 
iteration because of this lots of encoder objects get created and collected by 
GC eventually causing CPU cycles and causing performance degradation.

Moreover SparkJdbcUtil.toRow() is called by PhoenixDataWriter.write() where 
schema for writer object is same for all rows hence we can optimize the code 
there by avoiding creating unnecessary objects and gaining good % of 
performance improvement.

 

By using changes in patch time required for write reduced from 30 minutes to 
less than 40 seconds in our test environment.

  was:
While using the phoenix spark connector 1.0.0-SNAPSHOT 
([https://github.com/apache/phoenix-connectors/tree/master/phoenix-spark])  for 
hbase found that write was taking really long time.

On profiling the connector found that 90% of cpu time is consumed in method 
SparkJdbcUtil.toRow() method. 

!https://files.slack.com/files-pri/T037D1PV9-FKYGD504A/image.png!

If i look into code SparkJdbcUtil.toRow() method gets called for every field of 
a row and RowEncoder(schema).resolveAndBind() object gets created for every 
iteration because of this lots of encoder objects get created and collected by 
GC eventually causing CPU cycles and causing performance degradation.

Moreover SparkJdbcUtil.toRow() is called by PhoenixDataWriter.write() where 
schema for writer object is same for all rows hence we can optimize the code 
there by avoiding creating unnecessary objects and gaining good % of 
performance improvement.

 

By using changes in patch time required for write reduced from 30 minutes to 
less than second in our test environment.


> Phoenix spark to hbase connector takes long time persist data
> -------------------------------------------------------------
>
>                 Key: PHOENIX-5410
>                 URL: https://issues.apache.org/jira/browse/PHOENIX-5410
>             Project: Phoenix
>          Issue Type: Bug
>    Affects Versions: connectors-1.0.0
>            Reporter: Manohar Chamaraju
>            Priority: Major
>         Attachments: PHOENIX-5410.patch
>
>
> While using the phoenix spark connector 1.0.0-SNAPSHOT 
> ([https://github.com/apache/phoenix-connectors/tree/master/phoenix-spark])  
> for hbase found that write was taking really long time.
> On profiling the connector found that 90% of cpu time is consumed in method 
> SparkJdbcUtil.toRow() method. 
> !https://files.slack.com/files-pri/T037D1PV9-FKYGD504A/image.png!
> If i look into code SparkJdbcUtil.toRow() method gets called for every field 
> of a row and RowEncoder(schema).resolveAndBind() object gets created for 
> every iteration because of this lots of encoder objects get created and 
> collected by GC eventually causing CPU cycles and causing performance 
> degradation.
> Moreover SparkJdbcUtil.toRow() is called by PhoenixDataWriter.write() where 
> schema for writer object is same for all rows hence we can optimize the code 
> there by avoiding creating unnecessary objects and gaining good % of 
> performance improvement.
>  
> By using changes in patch time required for write reduced from 30 minutes to 
> less than 40 seconds in our test environment.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

Reply via email to