[DISCUSSION] Improve insert process

2017-10-26 Thread cenyuhai11
When I insert data into carbondata from one table, I should do as the
following:
1、select count(1) from table1
and then
2、insert into table table1 select * from table1

Why I should execute "select count(1) from table1" first?
because the number of tasks are compute by carbondata, it is releated to how
many executor hosts we have now!

I don't think it is the right way. We should let spark to control the number
of tasks.
set the parameter "mapred.max.splits.size" is a common way to adjust the
number of tasks.

Even when I do the step 2, some tasks still failed, it will increase the
insert time.

So I sugguest that don't adjust the number of tasks, just use the default
behavior of spark. 
And then if there are small files, add a fast merge job(merge data at
blocket level, just as )

so we also need to set the default value of
"carbon.number.of.cores.while.loading" to 1







--
Sent from: 
http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/


Fw:Fw:carbondata在创建表时报错,请问下老师是什么原因

2017-10-26 Thread 胡健军








 转发邮件信息 
发件人:"胡健军" 
发送日期:2017-10-23 19:42:42
收件人:u...@carbondata.apache.org
主题:Fw:carbondata在创建表时报错,请问下老师是什么原因









 Forwarding messages 
From: "胡健军" 
Date: 2017-10-23 19:28:59
To: u...@carbondata.apache.org
Subject: carbondata在创建表时报错,请问下老师是什么原因

scala> carbon.sql("CREATE TABLE IF NOT EXISTS carbon_table(id string,name 
string,city string,age Int)STORED BY 'carbondata'")
17/10/23 19:13:52 AUDIT command.CarbonCreateTableCommand: 
[master][root][Thread-1]Creating Table with Database name [clb_carbon] and 
Table name [carbon_table]
java.lang.NoSuchMethodError: 
org.apache.spark.sql.catalyst.catalog.CatalogTable.copy(Lorg/apache/spark/sql/catalyst/TableIdentifier;Lorg/apache/spark/sql/catalyst/catalog/CatalogTableType;Lorg/apache/spark/sql/catalyst/catalog/CatalogStorageFormat;Lorg/apache/spark/sql/types/StructType;Lscala/Option;Lscala/collection/Seq;Lscala/Option;Ljava/lang/String;JJLscala/collection/immutable/Map;Lscala/Option;Lscala/Option;Lscala/Option;Lscala/Option;Lscala/collection/Seq;Z)Lorg/apache/spark/sql/catalyst/catalog/CatalogTable;
  at 
org.apache.spark.sql.CarbonSource$.updateCatalogTableWithCarbonSchema(CarbonSource.scala:253)
  at 
org.apache.spark.sql.execution.strategy.DDLStrategy.apply(DDLStrategy.scala:154)
  at 
org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:62)
  at 
org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:62)
  at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434)
  at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440)
  at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:439)
  at 
org.apache.spark.sql.catalyst.planning.QueryPlanner.plan(QueryPlanner.scala:92)
  at 
org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$2$$anonfun$apply$2.apply(QueryPlanner.scala:77)
  at 
org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$2$$anonfun$apply$2.apply(QueryPlanner.scala:74)
  at 
scala.collection.TraversableOnce$$anonfun$foldLeft$1.apply(TraversableOnce.scala:157)
  at 
scala.collection.TraversableOnce$$anonfun$foldLeft$1.apply(TraversableOnce.scala:157)
  at scala.collection.Iterator$class.foreach(Iterator.scala:893)
  at scala.collection.AbstractIterator.foreach(Iterator.scala:1336)
  at scala.collection.TraversableOnce$class.foldLeft(TraversableOnce.scala:157)
  at scala.collection.AbstractIterator.foldLeft(Iterator.scala:1336)
  at 
org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$2.apply(QueryPlanner.scala:74)
  at 
org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$2.apply(QueryPlanner.scala:66)
  at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434)
  at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440)
  at 
org.apache.spark.sql.catalyst.planning.QueryPlanner.plan(QueryPlanner.scala:92)
  at 
org.apache.spark.sql.execution.QueryExecution.sparkPlan$lzycompute(QueryExecution.scala:79)
  at 
org.apache.spark.sql.execution.QueryExecution.sparkPlan(QueryExecution.scala:75)
  at 
org.apache.spark.sql.execution.QueryExecution.executedPlan$lzycompute(QueryExecution.scala:84)
  at 
org.apache.spark.sql.execution.QueryExecution.executedPlan(QueryExecution.scala:84)
  at 
org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:87)
  at 
org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:87)
  at org.apache.spark.sql.Dataset.(Dataset.scala:185)
  at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:64)
  at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:592)
  at 
org.apache.spark.sql.execution.command.CarbonCreateTableCommand.processSchema(CarbonCreateTableCommand.scala:84)
  at 
org.apache.spark.sql.execution.command.CarbonCreateTableCommand.run(CarbonCreateTableCommand.scala:36)
  at 
org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:58)
  at 
org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:56)
  at 
org.apache.spark.sql.execution.command.ExecutedCommandExec.doExecute(commands.scala:74)
  at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:114)
  at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:114)
  at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:135)
  at 
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
  at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:132)
  at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:113)
  at 
org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:87)
  at 
org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:87)
  at org.apache.spark.sql.Dataset.(Dataset.scala:185)
  at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:64)
  at org.apache.spark.sql.

Re: [Discussion] Merging carbonindex files for each segments and across segments

2017-10-26 Thread Liang Chen
Yes, Jin Zhou.
Merge all index files to one in a segment would be useful feature. it would
significantly improve query performance.

Regards
Liang


Jin Zhou wrote
> Hi, ravipesala
> 
> Thank you for your proposal, merging index file is a very useful feature
> as
> we have already met serious performance issue caused by too many index
> files
> (case 1).
> 
> But I think the core problem of case 2 is too many loads which should be
> mainly considered in segment compaction part. As  "one segment one index
> file" design seems more clear and simple.
> 
> Regards, 
> Jin Zhou
> 
> 
> 
> --
> Sent from:
> http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/





--
Sent from: 
http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/


Re: [Discussion]support user specified segments in major compation

2017-10-26 Thread Liang Chen
Hi Jin Zhou

Thanks for starting this discussion.
1. For your first proposal : Currently , segment is the system internal
concept, not expose to outside.
Can you provide what exact problems do you encounter?  we can find the
alternative solution for your problems.

1) we can precisely control which part of table to be merged when table is
very large.

2. For your second proposal, my comment is +1, agree. can you please create
an apache jira for this ?
We would like to invite you to participate in implementing this feature
together :)
-
2) each table can has its own compaction strategy which controlled by user 
app. 

Regards
Liang


Jin Zhou wrote
> Hi community,
> Carbondata currently support two types of compaction: Minor and Major
> compaction.
> CarbonData will do major compaction according to the user defined segment
> size. But which segments to be merged are transparent to users.
> We plan to extend major compaction to support user specified segments,
> this
> will be useful in cases below:
> 1) we can precisely control which part of table to be merged when table is
> very large.
> 2) each table can has its own compaction strategy which controlled by user
> app.
> 
> the proposed syntax:
> ALTER TABLE [db_name].table_name COMPACT [SEGMENT seg_id1,seg_id2] 'MAJOR'
> in which [SEGMENT seg_id1,seg_id2] is optional and compatible with
> original
> syntax.
> 
> 
> 
> --
> Sent from:
> http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/





--
Sent from: 
http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/


Re: [DISCUSSION] Optimize the default value for some parameters

2017-10-26 Thread Liang Chen
Hi Ravi

Yes, so we need provide a table level property for blocklet size to
configure while creating a table. can you please create one JIRA for this ?

How about liking this:
CREATE TABLE IF NOT EXISTS (column_name,column_type) STORED BY
'carbondata' TBLPROPERTIES('TABLE_BLOCKLETSIZE'='128')

Regards
Liang

ravipesala wrote
> Hi Liang,
> 
> Now the TABLE_BLOCKSIZE is only limited to the size of carbondata file. It
> is not considered for allocating tasks. So it does not matter the size
> of TABLE_BLOCKSIZE.
> But yes we can consider it as 512M.
> 
> We can also change the default of blocklet
> (carbon.blockletgroup.size.in.mb) size to 128MB. Currently, it is only
> 64MB. Since the number of tasks allocation is derived from blocklet it is
> better to increase the blocklet size. And also we should add a table level
> property for blocklet size to configure while creating a table.
> 
> Regards,
> Ravindra.
> 
> On 11 October 2017 at 13:36, Liang Chen <

> chenliang6136@

> > wrote:
> 
>> Hi All
>>
>> As you know, some default value of parameters need to adjust for most of
>> cases, this discussion is for collecting which parameters' default value
>> need to be optimized:
>>
>> 1. TABLE_BLOCKSIZE:
>> current default is 1G, propose to adjust to 512M
>>
>> 2.
>> Please append at here if you propose to adjust which parameters' default
>> value .
>>
>> Regards
>> Liang
>>
>>
>>
>> --
>> Sent from: http://apache-carbondata-dev-mailing-list-archive.1130556.
>> n5.nabble.com/
>>
> 
> 
> 
> -- 
> Thanks & Regards,
> Ravi





--
Sent from: 
http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/


Re: [DISCUSSION] Optimize the default value for some parameters

2017-10-26 Thread xm_zzc
Hi ravipesala:
  Ok, I will raise jira for this and try to implement this.



--
Sent from: 
http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/