Hi Liang,Thank you very much. Now, I am doing my tests with the updated one. I will share the results with you. I have one question: Can I adjust table block size in DataFrame write?
Regards Faisal On 13.04.2017 07:07, Liang Chen wrote:
Hi Rana That would be very nice, you could participate in us to test TPC-H. One contributor will contact you and share with you about the script and DDL of TPC-H. Actually, you are using old version(Jan version), the current master has done many optimization for TPC-H , maybe you need to clone the master version. Regards Liang 2017-04-13 4:38 GMT+05:30 Rana Faisal Munir <fmu...@essi.upc.edu>:Hi Liang, Thank you very much for your reply. I am giving answers side by side to your questions 1.Did you use the latest master version , or 1.0 ? suggest you use master to test I have downloaded the latest version from GIT and compile it. It is carbondata_2.11-1.0.0-incubating-shade-hadoop2.2.0 2.Have you tested other TPC-H query which including where/filter? I just started recently and my future plan is to move towards whole TPCH queries to see CarbonData performance improvements over Parquet. But right now, I am just running my own queries with different projected columns to see how well CarbonData can push down the projection. 3.In your case, the query is slow ? or the below "write.format" is slow ? write.format("csv").save("hdfs://hdfsmaster/output/carbon/proj1/") I have the same line for Parquet and Parquet is working perfectly fine. I don't think , this writing is causing any problem. 4.Use master version to do query test , and set "ENABLE_VECTOR_READER" to true. import org.apache.carbondata.core.util.CarbonProperties import org.apache.carbondata.core.constants.CarbonCommonConstants CarbonProperties.getInstance().addProperty(CarbonCommonConstants.ENABLE_ VECTOR_READER, "true") Thanks for this suggestion. I will enable it and will share with you the updated results. Community is doing TPC-H test also currently, do you want to participate in test together? It would be nice to be part of this. Could you please guide me how I can contribute. Thank you Regards Faisal -----Original Message----- From: Liang Chen [mailto:chenliang6...@gmail.com] Sent: Thursday, April 13, 2017 1:00 AM To: dev@carbondata.incubator.apache.org Subject: Re: CarbonData performance benchmkaring Hi 1.Did you use the latest master version , or 1.0 ? suggest you use master to test 2.Have you tested other TPC-H query which including where/filter? 3.In your case, the query is slow ? or the below "write.format" is slow ? write.format("csv").save("hdfs://hdfsmaster/output/carbon/proj1/") 4.Use master version to do query test , and set "ENABLE_VECTOR_READER" to true. import org.apache.carbondata.core.util.CarbonProperties import org.apache.carbondata.core.constants.CarbonCommonConstants CarbonProperties.getInstance().addProperty(CarbonCommonConstants.ENABLE_ VECTOR_READER, "true") Community is doing TPC-H test also currently, do you want to participate in test together? Regards Liang 2017-04-13 1:18 GMT+05:30 Rana Faisal Munir <fmu...@essi.upc.edu>:Dear all, I am running some experiments to benchmark the performance of both Parquet and CarbonData. I am using TPC-H lineitem table of size 8GB. It has 16 columns and I am running different projection queries where I am reading different number of columns (3,6,9,12,15). I am facing some problem with CarbonData and it seems to be very slow when I selectmore than 8 columns.It takes almost hours to process my request whereas Parquet is veryquick.Could please anybody helps me to know this behavior. This is my configuration of cluster: 3 Machines 1 Driver Machine (128 GB, 24 cores) 2 Worker Machines (128GB, 24 cores) My configuration settings for Spark are: spark.executor.instances 12 spark.executor.memory 18g spark.driver.memory 57g spark.executor.cores 3 spark.driver.cores 5 spark.default.parallelism 72 carbon.sort.file.buffer.size=20 carbon.graph.rowset.size=100000 carbon.number.of.cores.while.loading=6 carbon.sort.size=500000 carbon.enableXXHash=true carbon.number.of.cores.while.compacting=2 carbon.compaction.level.threshold=4,3 carbon.major.compaction.size=1024 carbon.number.of.cores=4 carbon.inmemory.record.size=120000 carbon.enable.quick.filter=false My Queries: carbon.sql("CREATE TABLE IF NOT EXISTS lineitem_4 (orderkey BIGINT, partkey BIGINT, suppkey BIGINT, linenumber BIGINT, quantity DOUBLE, extendedprice DOUBLE, discount DOUBLE, tax DOUBLE, returnflag STRING, linestatus STRING, shipdate DATE, commitdate DATE, receiptdate DATE, shipinstruct STRING, shipmode STRING, comment STRING) STORED BY 'carbondata' TBLPROPERTIES('TABLE_BLOCKSIZE'='128 MB')") carbon.sql("LOAD DATA INPATH 'hdfs://hdfsmaster/input/lineitem/' INTO TABLE lineitem_4 OPTIONS ('FILEHEADER' = 'orderkey,partkey,suppkey,linenumber,quantity, extendedprice,discount,tax,ret urnflag,linestatus,shipdate,commitdate,receiptdate, shipinstruct,shipmode,com ment', 'USE_KETTLE' = 'false', 'DELIMITER'='|')") val proj1 = carbon.sql("SELECT orderkey,partkey,linenumber FROM lineitem_4")) proj1.write.format("csv").save("hdfs://hdfsmaster/output/carbon/proj1/ ") val proj2 = carbon.sql("SELECT orderkey,partkey,linenumber,quantity,discount,returnflag FROM lineitem_4") proj2.write.format("csv").save("hdfs://hdfsmaster/output/carbon/proj2/ ") val proj3 = carbon.sql("SELECT orderkey,partkey,linenumber,quantity,discount,returnflag, linestatus,commitda te,receiptdate FROM lineitem_4") proj3.write.format("csv").save("hdfs://hdfsmaster/output/carbon/proj3/ ") val proj4 = carbon.sql("SELECT orderkey,partkey,linenumber,quantity,discount,returnflag, linestatus,commitda te,receiptdate,shipinstruct,shipmode,comment FROM lineitem_4") proj4.write.format("csv").save("hdfs://hdfsmaster/output/carbon/proj4/ ") Thank you Regards Faisal-- Regards Liang