[jira] [Created] (CARBONDATA-705) Make the partition distribution as configurable and keep spark distribution as default

2017-02-15 Thread Ravindra Pesala (JIRA)
Ravindra Pesala created CARBONDATA-705:
--

 Summary: Make the partition distribution as configurable and keep 
spark distribution as default
 Key: CARBONDATA-705
 URL: https://issues.apache.org/jira/browse/CARBONDATA-705
 Project: CarbonData
  Issue Type: Bug
Reporter: Ravindra Pesala


Make the partition distribution as configurable and keep spark distribution as 
default.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Created] (CARBONDATA-706) Mulitiple OR operators does not work properly in carbondata

2017-02-15 Thread SWATI RAO (JIRA)
SWATI RAO created CARBONDATA-706:


 Summary: Mulitiple OR operators does not work properly in 
carbondata
 Key: CARBONDATA-706
 URL: https://issues.apache.org/jira/browse/CARBONDATA-706
 Project: CarbonData
  Issue Type: Bug
  Components: sql
Affects Versions: 1.1.0-incubating
 Environment: Spark 2.1
Reporter: SWATI RAO
Priority: Minor
 Attachments: 100_hive_test.csv

Multiple OR operators result does not match with hive.

Steps to Reproduces:
1:Create table using following Command
 " create table Carbon_automation (imei string,deviceInformationId int,MAC 
string,deviceColor string,device_backColor string,modelId string,marketName 
string,AMSize string,ROMSize string,CUPAudit string,CPIClocked string,series 
string,productionDate timestamp,bomCode string,internalModels string, 
deliveryTime string, channelsId string, channelsName string , deliveryAreaId 
string, deliveryCountry string, deliveryProvince string, deliveryCity 
string,deliveryDistrict string, deliveryStreet string, oxSingleNumber string, 
ActiveCheckTime string, ActiveAreaId string, ActiveCountry string, 
ActiveProvince string, Activecity string, ActiveDistrict string, ActiveStreet 
string, ActiveOperatorId string, Active_releaseId string, Active_EMUIVersion 
string, Active_operaSysVersion string, Active_BacVerNumber string, 
Active_BacFlashVer string, Active_webUIVersion string, Active_webUITypeCarrVer 
string,Active_webTypeDataVerNumber string, Active_operatorsVersion string, 
Active_phonePADPartitionedVersions string, Latest_YEAR int, Latest_MONTH int, 
Latest_DAY int, Latest_HOUR string, Latest_areaId string, Latest_country 
string, Latest_province string, Latest_city string, Latest_district string, 
Latest_street string, Latest_releaseId string, Latest_EMUIVersion string, 
Latest_operaSysVersion string, Latest_BacVerNumber string, Latest_BacFlashVer 
string, Latest_webUIVersion string, Latest_webUITypeCarrVer string, 
Latest_webTypeDataVerNumber string, Latest_operatorsVersion string, 
Latest_phonePADPartitionedVersions string, Latest_operatorId string, 
gamePointDescription string,gamePointId double,contractNumber double,imei_count 
int) STORED BY 'org.apache.carbondata.format' TBLPROPERTIES 
('DICTIONARY_INCLUDE'='deviceInformationId,Latest_YEAR,Latest_MONTH,Latest_DAY')"

2:Load Data with following command
 " LOAD DATA INPATH 'HDFS_URL/BabuStore/Data/HiveData' INTO TABLE 
Carbon_automation 
OPTIONS('DELIMITER'=',','QUOTECHAR'='"','BAD_RECORDS_ACTION'='FORCE','FILEHEADER'='imei,deviceInformationId,MAC,deviceColor,device_backColor,modelId,marketName,AMSize,ROMSize,CUPAudit,CPIClocked,series,productionDate,bomCode,internalModels,deliveryTime,channelsId,channelsName,deliveryAreaId,deliveryCountry,deliveryProvince,deliveryCity,deliveryDistrict,deliveryStreet,oxSingleNumber,contractNumber,ActiveCheckTime,ActiveAreaId,ActiveCountry,ActiveProvince,Activecity,ActiveDistrict,ActiveStreet,ActiveOperatorId,Active_releaseId,Active_EMUIVersion,Active_operaSysVersion,Active_BacVerNumber,Active_BacFlashVer,Active_webUIVersion,Active_webUITypeCarrVer,Active_webTypeDataVerNumber,Active_operatorsVersion,Active_phonePADPartitionedVersions,Latest_YEAR,Latest_MONTH,Latest_DAY,Latest_HOUR,Latest_areaId,Latest_country,Latest_province,Latest_city,Latest_district,Latest_street,Latest_releaseId,Latest_EMUIVersion,Latest_operaSysVersion,Latest_BacVerNumber,Latest_BacFlashVer,Latest_webUIVersion,Latest_webUITypeCarrVer,Latest_webTypeDataVerNumber,Latest_operatorsVersion,Latest_phonePADPartitionedVersions,Latest_operatorId,gamePointId,gamePointDescription,imei_count')"

3:Now run the Select Query:
" select imei,gamePointId, channelsId,series  from Carbon_automation where 
channelsId >=10 OR channelsId <=1 or series='7Series' "

4:Result Displays 

" 0: jdbc:hive2://localhost:1> select imei,gamePointId, channelsId,series  
from Carbon_automation where channelsId >=10 OR channelsId <=1 or 
series='7Series';
+-+--+-+--+--+
|imei | gamePointId  | channelsId  |  series  |
+-+--+-+--+--+
| 1AA1| 2738.562 | 4   | 7Series  |
| 1AA10   | 1714.635 | 4   | 7Series  |
| 1AA100  | 1271.0   | 6   | 5Series  |
| 1AA1000 | 692.0| 3   | 5Series  |
| 1AA1| 2175.0   | 1   | 7Series  |
| 1AA10   | 136.0| 6   | 9Series  |
| 1AA100  | 1600.0   | 6   | 7Series  |
| 1AA11   | 505.0| 7   | 0Series  |
| 1AA12   | 1341.0   | 3   | 0Series  |
| 1AA13   | 2239.0   | 3   | 5Series  |
| 1AA14   | 2970.0   | 2   | 4Series  |
| 1AA15   | 2593.0   | 1   | 1Series  |
| 1AA16   | 2572.0   | 2   | 6Series  |
| 1AA17   | 1991.

Re: Exception throws when I load data using carbondata-1.0.0

2017-02-15 Thread Xiaoqiao He
hi Manish Gupta,

Thanks for you focus, actually i try to load data following
https://github.com/apache/incubator-carbondata/blob/master/docs/quick-start-guide.md
for deploying carbondata-1.0.0.

1.when i execute carbondata by `bin/spark-shell`, it throws as above.
2.when i execute carbondata by `bin/spark-shell --jars
carbonlib/carbondata_2.10-1.0.0-incubating-shade-hadoop2.7.1.jar`, it
throws another exception as below:

org.apache.spark.SparkException: Job aborted due to stage failure: Task 0
> in stage 0.0 failed 4 times, most recent failure: Lost task 0.3 in stage
> 0.0 (TID 3, [task hostname]): org.apache.spark.SparkException: File
> ./carbondata_2.10-1.0.0-incubating-shade-hadoop2.7.1.jar exists and does
> not match contents of
> http://master:50843/jars/carbondata_2.10-1.0.0-incubating-shade-hadoop2.7.1.jar


I check the assembly jar and CarbonBlockDistinctValuesCombineRDD is present
actually.

anyone who meet the same problem?

Best Regards,
Hexiaoqiao


On Wed, Feb 15, 2017 at 12:56 AM, manish gupta 
wrote:

> Hi,
>
> I think the carbon jar is compiled properly. Can you use any decompiler and
> decompile carbondata-spark-common-1.1.0-incubating-SNAPSHOT.jar present in
> spark-common module target folder and check whether the required class file
> org.apache.carbondata.spark.rdd.CarbonBlockDistinctValuesCombineRDD is
> present or not.
>
> If you are using only the assembly jar then decompile and check in assembly
> jar.
>
> Regards
> Manish Gupta
>
> On Tue, Feb 14, 2017 at 11:19 AM, Xiaoqiao He  wrote:
>
> >  hi, dev,
> >
> > The latest release version apache-carbondata-1.0.0-incubating-rc2 which
> > takes Spark-1.6.2 to build throws exception `
> > java.lang.ClassNotFoundException:
> > org.apache.carbondata.spark.rdd.CarbonBlockDistinctValuesCombineRDD`
> when
> > i
> > load data following Quick Start Guide.
> >
> > Env:
> > a. CarbonData-1.0.0-incubating-rc2
> > b. Spark-1.6.2
> > c. Hadoop-2.7.1
> > d. CarbonData on "Spark on YARN" Cluster and run yarn-client mode.
> >
> > any suggestions? Thank you.
> >
> > The exception stack trace as below:
> >
> > 
> > ERROR 14-02 12:21:02,005 - main generate global dictionary failed
> > org.apache.spark.SparkException: Job aborted due to stage failure: Task
> 0
> > in stage 0.0 failed 4 times, most recent failure: Lost task 0.3 in stage
> > 0.0 (TID 3, nodemanger): java.lang.ClassNotFoundException:
> > org.apache.carbondata.spark.rdd.CarbonBlockDistinctValuesCombineRDD
> >  at
> > org.apache.spark.repl.ExecutorClassLoader.findClass(
> > ExecutorClassLoader.scala:84)
> >
> >  at java.lang.ClassLoader.loadClass(ClassLoader.java:425)
> >  at java.lang.ClassLoader.loadClass(ClassLoader.java:358)
> >  at java.lang.Class.forName0(Native Method)
> >  at java.lang.Class.forName(Class.java:274)
> >  at
> > org.apache.spark.serializer.JavaDeserializationStream$$
> > anon$1.resolveClass(JavaSerializer.scala:68)
> >
> >  at
> > java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:1612)
> >  at
> > java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1517)
> >  at
> > java.io.ObjectInputStream.readOrdinaryObject(
> ObjectInputStream.java:1771)
> >  at java.io.ObjectInputStream.readObject0(ObjectInputStream.
> java:1350)
> >  at
> > java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990)
> >  at
> > java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915)
> >  at
> > java.io.ObjectInputStream.readOrdinaryObject(
> ObjectInputStream.java:1798)
> >  at java.io.ObjectInputStream.readObject0(ObjectInputStream.
> java:1350)
> >  at java.io.ObjectInputStream.readObject(ObjectInputStream.java:370)
> >  at
> > org.apache.spark.serializer.JavaDeserializationStream.
> > readObject(JavaSerializer.scala:76)
> >
> >  at
> > org.apache.spark.serializer.JavaSerializerInstance.
> > deserialize(JavaSerializer.scala:115)
> >
> >  at
> > org.apache.spark.scheduler.ShuffleMapTask.runTask(
> ShuffleMapTask.scala:64)
> >  at
> > org.apache.spark.scheduler.ShuffleMapTask.runTask(
> ShuffleMapTask.scala:41)
> >  at org.apache.spark.scheduler.Task.run(Task.scala:89)
> >  at
> > org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:227)
> >  at
> > java.util.concurrent.ThreadPoolExecutor.runWorker(
> > ThreadPoolExecutor.java:1145)
> >
> >  at
> > java.util.concurrent.ThreadPoolExecutor$Worker.run(
> > ThreadPoolExecutor.java:615)
> >
> >  at java.lang.Thread.run(Thread.java:745)
> >
> > Driver stacktrace:
> >  at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$
> > scheduler$DAGScheduler$$failJobAndIndependentStages(
> > DAGScheduler.scala:1431)
> >
> >  at
> > org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(
> > DAGScheduler.scala:1419)
> >
> >  at
> > org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(
> > DAGScheduler.scala:1418)
> >
> >  at
> > scala.collection.mu

[jira] [Created] (CARBONDATA-707) Less ( < ) than operator does not work properly in carbondata.

2017-02-15 Thread SWATI RAO (JIRA)
SWATI RAO created CARBONDATA-707:


 Summary: Less ( < ) than operator does not work properly in 
carbondata. 
 Key: CARBONDATA-707
 URL: https://issues.apache.org/jira/browse/CARBONDATA-707
 Project: CarbonData
  Issue Type: Bug
  Components: sql
Affects Versions: 1.1.0-incubating
 Environment: Spark 2.1
Reporter: SWATI RAO
Priority: Minor
 Attachments: 100_hive_test.csv

Incorrect result displays 

Steps to Reproduces:
1:Create table using following Command
" create table Carbon_automation (imei string,deviceInformationId int,MAC 
string,deviceColor string,device_backColor string,modelId string,marketName 
string,AMSize string,ROMSize string,CUPAudit string,CPIClocked string,series 
string,productionDate timestamp,bomCode string,internalModels string, 
deliveryTime string, channelsId string, channelsName string , deliveryAreaId 
string, deliveryCountry string, deliveryProvince string, deliveryCity 
string,deliveryDistrict string, deliveryStreet string, oxSingleNumber string, 
ActiveCheckTime string, ActiveAreaId string, ActiveCountry string, 
ActiveProvince string, Activecity string, ActiveDistrict string, ActiveStreet 
string, ActiveOperatorId string, Active_releaseId string, Active_EMUIVersion 
string, Active_operaSysVersion string, Active_BacVerNumber string, 
Active_BacFlashVer string, Active_webUIVersion string, Active_webUITypeCarrVer 
string,Active_webTypeDataVerNumber string, Active_operatorsVersion string, 
Active_phonePADPartitionedVersions string, Latest_YEAR int, Latest_MONTH int, 
Latest_DAY int, Latest_HOUR string, Latest_areaId string, Latest_country 
string, Latest_province string, Latest_city string, Latest_district string, 
Latest_street string, Latest_releaseId string, Latest_EMUIVersion string, 
Latest_operaSysVersion string, Latest_BacVerNumber string, Latest_BacFlashVer 
string, Latest_webUIVersion string, Latest_webUITypeCarrVer string, 
Latest_webTypeDataVerNumber string, Latest_operatorsVersion string, 
Latest_phonePADPartitionedVersions string, Latest_operatorId string, 
gamePointDescription string,gamePointId double,contractNumber double,imei_count 
int) STORED BY 'org.apache.carbondata.format' TBLPROPERTIES 
('DICTIONARY_INCLUDE'='deviceInformationId,Latest_YEAR,Latest_MONTH,Latest_DAY')"

2:Load Data with following command
" LOAD DATA INPATH 'HDFS_URL/BabuStore/Data/HiveData' INTO TABLE 
Carbon_automation 
OPTIONS('DELIMITER'=',','QUOTECHAR'='"','BAD_RECORDS_ACTION'='FORCE','FILEHEADER'='imei,deviceInformationId,MAC,deviceColor,device_backColor,modelId,marketName,AMSize,ROMSize,CUPAudit,CPIClocked,series,productionDate,bomCode,internalModels,deliveryTime,channelsId,channelsName,deliveryAreaId,deliveryCountry,deliveryProvince,deliveryCity,deliveryDistrict,deliveryStreet,oxSingleNumber,contractNumber,ActiveCheckTime,ActiveAreaId,ActiveCountry,ActiveProvince,Activecity,ActiveDistrict,ActiveStreet,ActiveOperatorId,Active_releaseId,Active_EMUIVersion,Active_operaSysVersion,Active_BacVerNumber,Active_BacFlashVer,Active_webUIVersion,Active_webUITypeCarrVer,Active_webTypeDataVerNumber,Active_operatorsVersion,Active_phonePADPartitionedVersions,Latest_YEAR,Latest_MONTH,Latest_DAY,Latest_HOUR,Latest_areaId,Latest_country,Latest_province,Latest_city,Latest_district,Latest_street,Latest_releaseId,Latest_EMUIVersion,Latest_operaSysVersion,Latest_BacVerNumber,Latest_BacFlashVer,Latest_webUIVersion,Latest_webUITypeCarrVer,Latest_webTypeDataVerNumber,Latest_operatorsVersion,Latest_phonePADPartitionedVersions,Latest_operatorId,gamePointId,gamePointDescription,imei_count')"

3:Run the Query 
" Select imei,gamePointId, channelsId,series from Carbon_automation where  
channelsId < 4 ORDER BY gamePointId limit 5 "

4:Incorrect Result displays as follows:
++--+-+--+--+
|imei| gamePointId  | channelsId  |  series  |
++--+-+--+--+
| 1AA100050  | 29.0 | 1   | 2Series  |
| 1AA100014  | 151.0| 3   | 5Series  |
| 1AA100011  | 202.0| 1   | 0Series  |
| 1AA100018  | 441.0| 4   | 8Series  |
| 1AA100060  | 538.0| 4   | 8Series  |
++--+-+--+--+
5 rows selected (0.237 seconds)

5:CSV Attached: "100_hive_test.csv"

Expected Result: It should not display channel id 4 as per query.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


Introducing V3 format.

2017-02-15 Thread Ravindra Pesala
Problems in current format.
1. IO read is slower since it needs to go for multiple seeks on the file to
read column blocklets. Current size of blocklet is 12, so it needs to
read multiple times from file to scan the data on that column.
Alternatively we can increase the blocklet size but it suffers for filter
queries as it gets big blocklet to filter.
2. Decompression is slower in current format, we are using inverted index
for faster filter queries and using NumberCompressor to compress the
inverted index in bit wise packing. It becomes slower so we should avoid
number compressor. One alternative is to keep blocklet size with in 32000
so that inverted index can be written with short, but IO read suffers a lot.

To overcome from above 2 issues we are introducing new format V3.
Here each blocklet has multiple pages with size 32000, number of pages in
blocklet is configurable. Since we keep the page with in short limit so no
need compress the inverted index here.
And maintain the max/min for each page to further prune the filter queries.
Read the blocklet with pages at once and keep in offheap memory.
During filter first check the max/min range and if it is valid then go for
decompressing the page to filter further.

Please find the attached V3 format thrift file.

-- 
Thanks & Regards,
Ravi


Re: Introducing V3 format.

2017-02-15 Thread Ravindra Pesala
Please find the thrift file in below location.
https://drive.google.com/open?id=0B4TWTVbFSTnqZEdDRHRncVItQ242b1NqSTU2b2g4dkhkVDRj

On 15 February 2017 at 17:14, Ravindra Pesala  wrote:

> Problems in current format.
> 1. IO read is slower since it needs to go for multiple seeks on the file
> to read column blocklets. Current size of blocklet is 12, so it needs
> to read multiple times from file to scan the data on that column.
> Alternatively we can increase the blocklet size but it suffers for filter
> queries as it gets big blocklet to filter.
> 2. Decompression is slower in current format, we are using inverted index
> for faster filter queries and using NumberCompressor to compress the
> inverted index in bit wise packing. It becomes slower so we should avoid
> number compressor. One alternative is to keep blocklet size with in 32000
> so that inverted index can be written with short, but IO read suffers a lot.
>
> To overcome from above 2 issues we are introducing new format V3.
> Here each blocklet has multiple pages with size 32000, number of pages in
> blocklet is configurable. Since we keep the page with in short limit so no
> need compress the inverted index here.
> And maintain the max/min for each page to further prune the filter queries.
> Read the blocklet with pages at once and keep in offheap memory.
> During filter first check the max/min range and if it is valid then go for
> decompressing the page to filter further.
>
> Please find the attached V3 format thrift file.
>
> --
> Thanks & Regards,
> Ravi
>



-- 
Thanks & Regards,
Ravi


[jira] [Created] (CARBONDATA-708) Between operator does not work properly in carbondata.

2017-02-15 Thread SWATI RAO (JIRA)
SWATI RAO created CARBONDATA-708:


 Summary: Between operator does not work properly in carbondata.
 Key: CARBONDATA-708
 URL: https://issues.apache.org/jira/browse/CARBONDATA-708
 Project: CarbonData
  Issue Type: Bug
  Components: sql
Affects Versions: 1.1.0-incubating
 Environment: Spark 2.1
Reporter: SWATI RAO
Priority: Minor
 Attachments: 100_hive_test.csv

Incorrect result displays.

Steps to reproduce:

1:Create table using following Command

" create table Carbon_automation (imei string,deviceInformationId int,MAC 
string,deviceColor string,device_backColor string,modelId string,marketName 
string,AMSize string,ROMSize string,CUPAudit string,CPIClocked string,series 
string,productionDate timestamp,bomCode string,internalModels string, 
deliveryTime string, channelsId string, channelsName string , deliveryAreaId 
string, deliveryCountry string, deliveryProvince string, deliveryCity 
string,deliveryDistrict string, deliveryStreet string, oxSingleNumber string, 
ActiveCheckTime string, ActiveAreaId string, ActiveCountry string, 
ActiveProvince string, Activecity string, ActiveDistrict string, ActiveStreet 
string, ActiveOperatorId string, Active_releaseId string, Active_EMUIVersion 
string, Active_operaSysVersion string, Active_BacVerNumber string, 
Active_BacFlashVer string, Active_webUIVersion string, Active_webUITypeCarrVer 
string,Active_webTypeDataVerNumber string, Active_operatorsVersion string, 
Active_phonePADPartitionedVersions string, Latest_YEAR int, Latest_MONTH int, 
Latest_DAY int, Latest_HOUR string, Latest_areaId string, Latest_country 
string, Latest_province string, Latest_city string, Latest_district string, 
Latest_street string, Latest_releaseId string, Latest_EMUIVersion string, 
Latest_operaSysVersion string, Latest_BacVerNumber string, Latest_BacFlashVer 
string, Latest_webUIVersion string, Latest_webUITypeCarrVer string, 
Latest_webTypeDataVerNumber string, Latest_operatorsVersion string, 
Latest_phonePADPartitionedVersions string, Latest_operatorId string, 
gamePointDescription string,gamePointId double,contractNumber double,imei_count 
int) STORED BY 'org.apache.carbondata.format' TBLPROPERTIES 
('DICTIONARY_INCLUDE'='deviceInformationId,Latest_YEAR,Latest_MONTH,Latest_DAY')"

2:Load Data with following command

" LOAD DATA INPATH 'HDFS_URL/BabuStore/Data/HiveData' INTO TABLE 
Carbon_automation 
OPTIONS('DELIMITER'=',','QUOTECHAR'='"','BAD_RECORDS_ACTION'='FORCE','FILEHEADER'='imei,deviceInformationId,MAC,deviceColor,device_backColor,modelId,marketName,AMSize,ROMSize,CUPAudit,CPIClocked,series,productionDate,bomCode,internalModels,deliveryTime,channelsId,channelsName,deliveryAreaId,deliveryCountry,deliveryProvince,deliveryCity,deliveryDistrict,deliveryStreet,oxSingleNumber,contractNumber,ActiveCheckTime,ActiveAreaId,ActiveCountry,ActiveProvince,Activecity,ActiveDistrict,ActiveStreet,ActiveOperatorId,Active_releaseId,Active_EMUIVersion,Active_operaSysVersion,Active_BacVerNumber,Active_BacFlashVer,Active_webUIVersion,Active_webUITypeCarrVer,Active_webTypeDataVerNumber,Active_operatorsVersion,Active_phonePADPartitionedVersions,Latest_YEAR,Latest_MONTH,Latest_DAY,Latest_HOUR,Latest_areaId,Latest_country,Latest_province,Latest_city,Latest_district,Latest_street,Latest_releaseId,Latest_EMUIVersion,Latest_operaSysVersion,Latest_BacVerNumber,Latest_BacFlashVer,Latest_webUIVersion,Latest_webUITypeCarrVer,Latest_webTypeDataVerNumber,Latest_operatorsVersion,Latest_phonePADPartitionedVersions,Latest_operatorId,gamePointId,gamePointDescription,imei_count')"

3:Run the Query
select Latest_DAY,Latest_HOUR,count(distinct AMSize) as 
AMSize_number,sum(gamePointId+contractNumber) as total from Carbon_automation 
where Latest_HOUR between 12 and 15 group by Latest_DAY,Latest_HOUR order by 
total desc

4:No Result display:
+-+--+++--+
| Latest_DAY  | Latest_HOUR  | AMSize_number  | total  |
+-+--+++--+
+-+--+++--+
No rows selected (2.133 seconds).

5:CSV Attached "100_hive_test.csv"

Expected Result:Correct Result should be display.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


Re: Introducing V3 format.

2017-02-15 Thread Kumar Vishal
+1
This will improve the IO bottleneck. Page level min max will improve the
block pruning and less number of false positive blocks will improve the
filter query performance. Separating uncompression of data from reader
layer will improve the overall query performance.

-Regards
Kumar Vishal

On Wed, Feb 15, 2017 at 7:50 PM, Ravindra Pesala 
wrote:

> Please find the thrift file in below location.
> https://drive.google.com/open?id=0B4TWTVbFSTnqZEdDRHRncVItQ242b
> 1NqSTU2b2g4dkhkVDRj
>
> On 15 February 2017 at 17:14, Ravindra Pesala 
> wrote:
>
> > Problems in current format.
> > 1. IO read is slower since it needs to go for multiple seeks on the file
> > to read column blocklets. Current size of blocklet is 12, so it needs
> > to read multiple times from file to scan the data on that column.
> > Alternatively we can increase the blocklet size but it suffers for filter
> > queries as it gets big blocklet to filter.
> > 2. Decompression is slower in current format, we are using inverted index
> > for faster filter queries and using NumberCompressor to compress the
> > inverted index in bit wise packing. It becomes slower so we should avoid
> > number compressor. One alternative is to keep blocklet size with in 32000
> > so that inverted index can be written with short, but IO read suffers a
> lot.
> >
> > To overcome from above 2 issues we are introducing new format V3.
> > Here each blocklet has multiple pages with size 32000, number of pages in
> > blocklet is configurable. Since we keep the page with in short limit so
> no
> > need compress the inverted index here.
> > And maintain the max/min for each page to further prune the filter
> queries.
> > Read the blocklet with pages at once and keep in offheap memory.
> > During filter first check the max/min range and if it is valid then go
> for
> > decompressing the page to filter further.
> >
> > Please find the attached V3 format thrift file.
> >
> > --
> > Thanks & Regards,
> > Ravi
> >
>
>
>
> --
> Thanks & Regards,
> Ravi
>


Want to contribute

2017-02-15 Thread Sandeep Purohit
Hello i also have interest in this product and want to contribute it.


Re: Want to contribute

2017-02-15 Thread Kumar Vishal
Hi Sandeep,
Greetings. Please let me know which area you are interested in.
1) Data loading
2) Data query

-Regards
Kumar Vishal

On Wed, Feb 15, 2017 at 8:23 PM, Sandeep Purohit  wrote:

> Hello i also have interest in this product and want to contribute it.
>


[GitHub] incubator-carbondata-site issue #15: Added Search,Security and Faqs

2017-02-15 Thread chenliang613
Github user chenliang613 commented on the issue:

https://github.com/apache/incubator-carbondata-site/pull/15
  
LGTM


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] incubator-carbondata-site pull request #15: Added Search,Security and Faqs

2017-02-15 Thread asfgit
Github user asfgit closed the pull request at:

https://github.com/apache/incubator-carbondata-site/pull/15


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


Re: Want to contribute

2017-02-15 Thread Jean-Baptiste Onofré
Hi

Welcome. By the way it's not a product, it's a project ;)

Regards
JB

On Feb 15, 2017, 10:30, at 10:30, Sandeep Purohit  wrote:
>Hello i also have interest in this product and want to contribute it.


Re: Introducing V3 format.

2017-02-15 Thread Jean-Baptiste Onofré
Agree.

+1

Regards
JB

On Feb 15, 2017, 09:09, at 09:09, Kumar Vishal  
wrote:
>+1
>This will improve the IO bottleneck. Page level min max will improve
>the
>block pruning and less number of false positive blocks will improve the
>filter query performance. Separating uncompression of data from reader
>layer will improve the overall query performance.
>
>-Regards
>Kumar Vishal
>
>On Wed, Feb 15, 2017 at 7:50 PM, Ravindra Pesala
>
>wrote:
>
>> Please find the thrift file in below location.
>> https://drive.google.com/open?id=0B4TWTVbFSTnqZEdDRHRncVItQ242b
>> 1NqSTU2b2g4dkhkVDRj
>>
>> On 15 February 2017 at 17:14, Ravindra Pesala 
>> wrote:
>>
>> > Problems in current format.
>> > 1. IO read is slower since it needs to go for multiple seeks on the
>file
>> > to read column blocklets. Current size of blocklet is 12, so it
>needs
>> > to read multiple times from file to scan the data on that column.
>> > Alternatively we can increase the blocklet size but it suffers for
>filter
>> > queries as it gets big blocklet to filter.
>> > 2. Decompression is slower in current format, we are using inverted
>index
>> > for faster filter queries and using NumberCompressor to compress
>the
>> > inverted index in bit wise packing. It becomes slower so we should
>avoid
>> > number compressor. One alternative is to keep blocklet size with in
>32000
>> > so that inverted index can be written with short, but IO read
>suffers a
>> lot.
>> >
>> > To overcome from above 2 issues we are introducing new format V3.
>> > Here each blocklet has multiple pages with size 32000, number of
>pages in
>> > blocklet is configurable. Since we keep the page with in short
>limit so
>> no
>> > need compress the inverted index here.
>> > And maintain the max/min for each page to further prune the filter
>> queries.
>> > Read the blocklet with pages at once and keep in offheap memory.
>> > During filter first check the max/min range and if it is valid then
>go
>> for
>> > decompressing the page to filter further.
>> >
>> > Please find the attached V3 format thrift file.
>> >
>> > --
>> > Thanks & Regards,
>> > Ravi
>> >
>>
>>
>>
>> --
>> Thanks & Regards,
>> Ravi
>>


Re: Introducing V3 format.

2017-02-15 Thread Liang Chen
Hi Ravi

Thank you bringing the discussion to mailing list, i have one question: how
to ensure backward-compatible after introducing the new format.

Regards
Liang

Jean-Baptiste Onofré wrote
> Agree.
> 
> +1
> 
> Regards
> JB
> 
> On Feb 15, 2017, 09:09, at 09:09, Kumar Vishal <

> kumarvishal1802@

> > wrote:
>>+1
>>This will improve the IO bottleneck. Page level min max will improve
>>the
>>block pruning and less number of false positive blocks will improve the
>>filter query performance. Separating uncompression of data from reader
>>layer will improve the overall query performance.
>>
>>-Regards
>>Kumar Vishal
>>
>>On Wed, Feb 15, 2017 at 7:50 PM, Ravindra Pesala
>><

> ravi.pesala@

> >
>>wrote:
>>
>>> Please find the thrift file in below location.
>>> https://drive.google.com/open?id=0B4TWTVbFSTnqZEdDRHRncVItQ242b
>>> 1NqSTU2b2g4dkhkVDRj
>>>
>>> On 15 February 2017 at 17:14, Ravindra Pesala <

> ravi.pesala@

> >
>>> wrote:
>>>
>>> > Problems in current format.
>>> > 1. IO read is slower since it needs to go for multiple seeks on the
>>file
>>> > to read column blocklets. Current size of blocklet is 12, so it
>>needs
>>> > to read multiple times from file to scan the data on that column.
>>> > Alternatively we can increase the blocklet size but it suffers for
>>filter
>>> > queries as it gets big blocklet to filter.
>>> > 2. Decompression is slower in current format, we are using inverted
>>index
>>> > for faster filter queries and using NumberCompressor to compress
>>the
>>> > inverted index in bit wise packing. It becomes slower so we should
>>avoid
>>> > number compressor. One alternative is to keep blocklet size with in
>>32000
>>> > so that inverted index can be written with short, but IO read
>>suffers a
>>> lot.
>>> >
>>> > To overcome from above 2 issues we are introducing new format V3.
>>> > Here each blocklet has multiple pages with size 32000, number of
>>pages in
>>> > blocklet is configurable. Since we keep the page with in short
>>limit so
>>> no
>>> > need compress the inverted index here.
>>> > And maintain the max/min for each page to further prune the filter
>>> queries.
>>> > Read the blocklet with pages at once and keep in offheap memory.
>>> > During filter first check the max/min range and if it is valid then
>>go
>>> for
>>> > decompressing the page to filter further.
>>> >
>>> > Please find the attached V3 format thrift file.
>>> >
>>> > --
>>> > Thanks & Regards,
>>> > Ravi
>>> >
>>>
>>>
>>>
>>> --
>>> Thanks & Regards,
>>> Ravi
>>>





--
View this message in context: 
http://apache-carbondata-mailing-list-archive.1130556.n5.nabble.com/Introducing-V3-format-tp7609p7622.html
Sent from the Apache CarbonData Mailing List archive mailing list archive at 
Nabble.com.


Re: Exception throws when I load data using carbondata-1.0.0

2017-02-15 Thread Liang Chen
Hi He xiaoqiao

Quick start is local model spark.
Your case is yarn cluster , please check :
https://github.com/apache/incubator-carbondata/blob/master/docs/installation-guide.md

Regards
Liang

2017-02-15 3:29 GMT-08:00 Xiaoqiao He :

> hi Manish Gupta,
>
> Thanks for you focus, actually i try to load data following
> https://github.com/apache/incubator-carbondata/blob/
> master/docs/quick-start-guide.md
> for deploying carbondata-1.0.0.
>
> 1.when i execute carbondata by `bin/spark-shell`, it throws as above.
> 2.when i execute carbondata by `bin/spark-shell --jars
> carbonlib/carbondata_2.10-1.0.0-incubating-shade-hadoop2.7.1.jar`, it
> throws another exception as below:
>
> org.apache.spark.SparkException: Job aborted due to stage failure: Task 0
> > in stage 0.0 failed 4 times, most recent failure: Lost task 0.3 in stage
> > 0.0 (TID 3, [task hostname]): org.apache.spark.SparkException: File
> > ./carbondata_2.10-1.0.0-incubating-shade-hadoop2.7.1.jar exists and does
> > not match contents of
> > http://master:50843/jars/carbondata_2.10-1.0.0-
> incubating-shade-hadoop2.7.1.jar
>
>
> I check the assembly jar and CarbonBlockDistinctValuesCombineRDD is
> present
> actually.
>
> anyone who meet the same problem?
>
> Best Regards,
> Hexiaoqiao
>
>
> On Wed, Feb 15, 2017 at 12:56 AM, manish gupta 
> wrote:
>
> > Hi,
> >
> > I think the carbon jar is compiled properly. Can you use any decompiler
> and
> > decompile carbondata-spark-common-1.1.0-incubating-SNAPSHOT.jar present
> in
> > spark-common module target folder and check whether the required class
> file
> > org.apache.carbondata.spark.rdd.CarbonBlockDistinctValuesCombineRDD is
> > present or not.
> >
> > If you are using only the assembly jar then decompile and check in
> assembly
> > jar.
> >
> > Regards
> > Manish Gupta
> >
> > On Tue, Feb 14, 2017 at 11:19 AM, Xiaoqiao He 
> wrote:
> >
> > >  hi, dev,
> > >
> > > The latest release version apache-carbondata-1.0.0-incubating-rc2
> which
> > > takes Spark-1.6.2 to build throws exception `
> > > java.lang.ClassNotFoundException:
> > > org.apache.carbondata.spark.rdd.CarbonBlockDistinctValuesCombineRDD`
> > when
> > > i
> > > load data following Quick Start Guide.
> > >
> > > Env:
> > > a. CarbonData-1.0.0-incubating-rc2
> > > b. Spark-1.6.2
> > > c. Hadoop-2.7.1
> > > d. CarbonData on "Spark on YARN" Cluster and run yarn-client mode.
> > >
> > > any suggestions? Thank you.
> > >
> > > The exception stack trace as below:
> > >
> > > 
> > > ERROR 14-02 12:21:02,005 - main generate global dictionary failed
> > > org.apache.spark.SparkException: Job aborted due to stage failure:
> Task
> > 0
> > > in stage 0.0 failed 4 times, most recent failure: Lost task 0.3 in
> stage
> > > 0.0 (TID 3, nodemanger): java.lang.ClassNotFoundException:
> > > org.apache.carbondata.spark.rdd.CarbonBlockDistinctValuesCombineRDD
> > >  at
> > > org.apache.spark.repl.ExecutorClassLoader.findClass(
> > > ExecutorClassLoader.scala:84)
> > >
> > >  at java.lang.ClassLoader.loadClass(ClassLoader.java:425)
> > >  at java.lang.ClassLoader.loadClass(ClassLoader.java:358)
> > >  at java.lang.Class.forName0(Native Method)
> > >  at java.lang.Class.forName(Class.java:274)
> > >  at
> > > org.apache.spark.serializer.JavaDeserializationStream$$
> > > anon$1.resolveClass(JavaSerializer.scala:68)
> > >
> > >  at
> > > java.io.ObjectInputStream.readNonProxyDesc(
> ObjectInputStream.java:1612)
> > >  at
> > > java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1517)
> > >  at
> > > java.io.ObjectInputStream.readOrdinaryObject(
> > ObjectInputStream.java:1771)
> > >  at java.io.ObjectInputStream.readObject0(ObjectInputStream.
> > java:1350)
> > >  at
> > > java.io.ObjectInputStream.defaultReadFields(
> ObjectInputStream.java:1990)
> > >  at
> > > java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915)
> > >  at
> > > java.io.ObjectInputStream.readOrdinaryObject(
> > ObjectInputStream.java:1798)
> > >  at java.io.ObjectInputStream.readObject0(ObjectInputStream.
> > java:1350)
> > >  at java.io.ObjectInputStream.readObject(ObjectInputStream.
> java:370)
> > >  at
> > > org.apache.spark.serializer.JavaDeserializationStream.
> > > readObject(JavaSerializer.scala:76)
> > >
> > >  at
> > > org.apache.spark.serializer.JavaSerializerInstance.
> > > deserialize(JavaSerializer.scala:115)
> > >
> > >  at
> > > org.apache.spark.scheduler.ShuffleMapTask.runTask(
> > ShuffleMapTask.scala:64)
> > >  at
> > > org.apache.spark.scheduler.ShuffleMapTask.runTask(
> > ShuffleMapTask.scala:41)
> > >  at org.apache.spark.scheduler.Task.run(Task.scala:89)
> > >  at
> > > org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:227)
> > >  at
> > > java.util.concurrent.ThreadPoolExecutor.runWorker(
> > > ThreadPoolExecutor.java:1145)
> > >
> > >  at
> > > java.util.concurrent.ThreadPoolExecutor$Worker.run(
> > > ThreadPoolExecutor.java:

Re: Exception throws when I load data using carbondata-1.0.0

2017-02-15 Thread Xiaoqiao He
hi Liang Chen,

Thank for your help. It is true that i install and configure carbondata on
"spark on yarn" cluster following installation guide (
https://github.com/apache/incubator-carbondata/blob/master/docs/installation-guide.md#installing-and-configuring-carbondata-on-spark-on-yarn-cluster
).

Best Regards,
Heixaoqiao


On Thu, Feb 16, 2017 at 7:47 AM, Liang Chen  wrote:

> Hi He xiaoqiao
>
> Quick start is local model spark.
> Your case is yarn cluster , please check :
> https://github.com/apache/incubator-carbondata/blob/
> master/docs/installation-guide.md
>
> Regards
> Liang
>
> 2017-02-15 3:29 GMT-08:00 Xiaoqiao He :
>
> > hi Manish Gupta,
> >
> > Thanks for you focus, actually i try to load data following
> > https://github.com/apache/incubator-carbondata/blob/
> > master/docs/quick-start-guide.md
> > for deploying carbondata-1.0.0.
> >
> > 1.when i execute carbondata by `bin/spark-shell`, it throws as above.
> > 2.when i execute carbondata by `bin/spark-shell --jars
> > carbonlib/carbondata_2.10-1.0.0-incubating-shade-hadoop2.7.1.jar`, it
> > throws another exception as below:
> >
> > org.apache.spark.SparkException: Job aborted due to stage failure: Task
> 0
> > > in stage 0.0 failed 4 times, most recent failure: Lost task 0.3 in
> stage
> > > 0.0 (TID 3, [task hostname]): org.apache.spark.SparkException: File
> > > ./carbondata_2.10-1.0.0-incubating-shade-hadoop2.7.1.jar exists and
> does
> > > not match contents of
> > > http://master:50843/jars/carbondata_2.10-1.0.0-
> > incubating-shade-hadoop2.7.1.jar
> >
> >
> > I check the assembly jar and CarbonBlockDistinctValuesCombineRDD is
> > present
> > actually.
> >
> > anyone who meet the same problem?
> >
> > Best Regards,
> > Hexiaoqiao
> >
> >
> > On Wed, Feb 15, 2017 at 12:56 AM, manish gupta <
> tomanishgupt...@gmail.com>
> > wrote:
> >
> > > Hi,
> > >
> > > I think the carbon jar is compiled properly. Can you use any decompiler
> > and
> > > decompile carbondata-spark-common-1.1.0-incubating-SNAPSHOT.jar
> present
> > in
> > > spark-common module target folder and check whether the required class
> > file
> > > org.apache.carbondata.spark.rdd.CarbonBlockDistinctValuesCombineRDD is
> > > present or not.
> > >
> > > If you are using only the assembly jar then decompile and check in
> > assembly
> > > jar.
> > >
> > > Regards
> > > Manish Gupta
> > >
> > > On Tue, Feb 14, 2017 at 11:19 AM, Xiaoqiao He 
> > wrote:
> > >
> > > >  hi, dev,
> > > >
> > > > The latest release version apache-carbondata-1.0.0-incubating-rc2
> > which
> > > > takes Spark-1.6.2 to build throws exception `
> > > > java.lang.ClassNotFoundException:
> > > > org.apache.carbondata.spark.rdd.CarbonBlockDistinctValuesCombineRDD`
> > > when
> > > > i
> > > > load data following Quick Start Guide.
> > > >
> > > > Env:
> > > > a. CarbonData-1.0.0-incubating-rc2
> > > > b. Spark-1.6.2
> > > > c. Hadoop-2.7.1
> > > > d. CarbonData on "Spark on YARN" Cluster and run yarn-client mode.
> > > >
> > > > any suggestions? Thank you.
> > > >
> > > > The exception stack trace as below:
> > > >
> > > > 
> > > > ERROR 14-02 12:21:02,005 - main generate global dictionary failed
> > > > org.apache.spark.SparkException: Job aborted due to stage failure:
> > Task
> > > 0
> > > > in stage 0.0 failed 4 times, most recent failure: Lost task 0.3 in
> > stage
> > > > 0.0 (TID 3, nodemanger): java.lang.ClassNotFoundException:
> > > > org.apache.carbondata.spark.rdd.CarbonBlockDistinctValuesCombineRDD
> > > >  at
> > > > org.apache.spark.repl.ExecutorClassLoader.findClass(
> > > > ExecutorClassLoader.scala:84)
> > > >
> > > >  at java.lang.ClassLoader.loadClass(ClassLoader.java:425)
> > > >  at java.lang.ClassLoader.loadClass(ClassLoader.java:358)
> > > >  at java.lang.Class.forName0(Native Method)
> > > >  at java.lang.Class.forName(Class.java:274)
> > > >  at
> > > > org.apache.spark.serializer.JavaDeserializationStream$$
> > > > anon$1.resolveClass(JavaSerializer.scala:68)
> > > >
> > > >  at
> > > > java.io.ObjectInputStream.readNonProxyDesc(
> > ObjectInputStream.java:1612)
> > > >  at
> > > > java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1517)
> > > >  at
> > > > java.io.ObjectInputStream.readOrdinaryObject(
> > > ObjectInputStream.java:1771)
> > > >  at java.io.ObjectInputStream.readObject0(ObjectInputStream.
> > > java:1350)
> > > >  at
> > > > java.io.ObjectInputStream.defaultReadFields(
> > ObjectInputStream.java:1990)
> > > >  at
> > > > java.io.ObjectInputStream.readSerialData(
> ObjectInputStream.java:1915)
> > > >  at
> > > > java.io.ObjectInputStream.readOrdinaryObject(
> > > ObjectInputStream.java:1798)
> > > >  at java.io.ObjectInputStream.readObject0(ObjectInputStream.
> > > java:1350)
> > > >  at java.io.ObjectInputStream.readObject(ObjectInputStream.
> > java:370)
> > > >  at
> > > > org.apache.spark.serializer.JavaDeserializationStream.
> > > > readObject(JavaSerializer.scala:76)
> > > >
> > >

?????? data lost when loading data from csv file to carbon table

2017-02-15 Thread Yinwei Li
thx Ravindra.


I've run the script as:


scala> import org.apache.carbondata.core.util.CarbonProperties
scala> 
CarbonProperties.getInstance().addProperty("carbon.badRecords.location","hdfs://master:9000/data/carbondata/badrecords/")
scala> val carbon = 
SparkSession.builder().config(sc.getConf).getOrCreateCarbonSession("hdfs://master:9000/opt/carbonStore")
scala> carbon.sql(s"load data inpath '$src/web_sales.csv' into table 
_1g.web_sales 
OPTIONS('DELIMITER'='|','bad_records_logger_enable'='true','use_kettle'='true')")



but it occured an Exception: java.lang.RuntimeException: carbon.kettle.home is 
not set


the configuration in my carbon.properties is: 
carbon.kettle.home=/opt/spark-2.1.0/carbonlib/carbonplugins, but it seems not 
work.


how can I solve this problem.


--


Hi Liang Chen,


would you add a more detail document about the badRecord shows us how to 
use it, thx~~










--  --
??: "Ravindra Pesala";;
: 2017??2??15??(??) 11:36
??: "dev"; 

: Re: data lost when loading data from csv file to carbon table



Hi,

I guess you are using spark-shell, so better set bad record location to
CarbonProperties class before creating carbon session like below.

CarbonProperties.getInstance().addProperty("carbon.badRecords.location","").


1. And while loading data you need to enable bad record logging as below.

carbon.sql(s"load data inpath '$src/web_sales.csv' into table _1g.web_sales
OPTIONS('DELIMITER'='|','bad_records_logger_enable'='true', 'use_kettle
'='true')").

Please check the bad records which are added to that bad record location.


2. You can alternatively verify by ignoring the bad records by using
following command
carbon.sql(s"load data inpath '$src/web_sales.csv' into table _1g.web_sales
OPTIONS('DELIMITER'='|','bad_records_logger_enable'='true',
'bad_records_action'='ignore')").

Regards,
Ravindra.

On 15 February 2017 at 07:37, Yinwei Li <251469...@qq.com> wrote:

> Hi,
>
>
> I've set the properties as:
>
>
> carbon.badRecords.location=hdfs://localhost:9000/data/
> carbondata/badrecords
>
>
> and add 'bad_records_action'='force' when loading data as:
>
>
> carbon.sql(s"load data inpath '$src/web_sales.csv' into table
> _1g.web_sales OPTIONS('DELIMITER'='|','bad_records_action'='force')")
>
>
> but the configurations seems not work as there are no path or file
> created under the path hdfs://localhost:9000/data/carbondata/badrecords.
>
>
> here are the way I created carbonContext:
>
>
> import org.apache.spark.sql.SparkSession
> import org.apache.spark.sql.CarbonSession._
> import org.apache.spark.sql.catalyst.util._
> val carbon = SparkSession.builder().config(sc.getConf).
> getOrCreateCarbonSession("hdfs://master:9000/opt/carbonStore")
>
>
>
>
> and the following are bad record logs:
>
>
> INFO  15-02 09:43:24,393 - [Executor task launch
> worker-0][partitionID:_1g_web_sales_d59af854-773c-429c-b7e6-031d602fe2be]
> Total copy time (ms) to copy file /tmp/1039730591739247/0/_1g/
> web_sales/Fact/Part0/Segment_0/0/0-0-1487122995007.carbonindex is 65
> ERROR 15-02 09:43:24,393 - [Executor task launch
> worker-0][partitionID:_1g_web_sales_d59af854-773c-429c-b7e6-031d602fe2be]
> Data Load is partially success for table web_sales
> INFO  15-02 09:43:24,393 - Bad Record Found
>
>
>
>
> --  --
> ??: "Ravindra Pesala";;
> : 2017??2??14??(??) 10:41
> ??: "dev";
>
> : Re: data lost when loading data from csv file to carbon table
>
>
>
> Hi,
>
> Please set carbon.badRecords.location in carbon.properties and check any
> bad records are added to that location.
>
>
> Regards,
> Ravindra.
>
> On 14 February 2017 at 15:24, Yinwei Li <251469...@qq.com> wrote:
>
> > Hi all,
> >
> >
> >   I met an data lost problem when loading data from csv file to carbon
> > table, here are some details:
> >
> >
> >   Env: Spark 2.1.0 + Hadoop 2.7.2 + CarbonData 1.0.0
> >   Total Records:719,384
> >   Loaded Records:606,305 (SQL: select count(1) from table)
> >
> >
> >   My Attemps:
> >
> >
> > Attemp1: Add option bad_records_action='force' when loading data. It
> > also doesn't work, it's count equals to 606,305;
> > Attemp2: Cut line 1 to 300,000 into a csv file and load, the result
> is
> > right, which equals to 300,000;
> > Attemp3: Cut line 1 to 350,000 into a csv file and load, the result
> is
> > wrong, it equals to 305,631;
> > Attemp4: Cut line 300,000 to 350,000 into a csv file and load, the
> > result is right, it equals to 50,000;
> > Attemp5: Count the separator '|' of my csv file, it equals to lines *
> > columns,  so the source data may in the correct format;
> >
> >
> > In spark log, each attemp logs out : "Bad Record Found".
> >
> >
> > Anyone have any ideas?
>
>
>
>
> --
> Thanks & Regards,
> Ravi
>



-- 
Thanks & Regards,
Ravi

Re: data lost when loading data from csv file to carbon table

2017-02-15 Thread Ravindra Pesala
Please make 'use_kettle'='false' and try to run.

Regards,
Ravindra

On 16 February 2017 at 08:44, Yinwei Li <251469...@qq.com> wrote:

> thx Ravindra.
>
>
> I've run the script as:
>
>
> scala> import org.apache.carbondata.core.util.CarbonProperties
> scala> CarbonProperties.getInstance().addProperty("carbon.
> badRecords.location","hdfs://master:9000/data/carbondata/badrecords/")
> scala> val carbon = SparkSession.builder().config(sc.getConf).
> getOrCreateCarbonSession("hdfs://master:9000/opt/carbonStore")
> scala> carbon.sql(s"load data inpath '$src/web_sales.csv' into table
> _1g.web_sales OPTIONS('DELIMITER'='|','bad_records_logger_enable'='true',
> 'use_kettle'='true')")
>
>
>
> but it occured an Exception: java.lang.RuntimeException:
> carbon.kettle.home is not set
>
>
> the configuration in my carbon.properties is:
> carbon.kettle.home=/opt/spark-2.1.0/carbonlib/carbonplugins, but it seems
> not work.
>
>
> how can I solve this problem.
>
>
> --
>
>
> Hi Liang Chen,
>
>
> would you add a more detail document about the badRecord shows us how
> to use it, thx~~
>
>
>
>
>
>
>
>
>
>
> -- 原始邮件 --
> 发件人: "Ravindra Pesala";;
> 发送时间: 2017年2月15日(星期三) 中午11:36
> 收件人: "dev";
>
> 主题: Re: data lost when loading data from csv file to carbon table
>
>
>
> Hi,
>
> I guess you are using spark-shell, so better set bad record location to
> CarbonProperties class before creating carbon session like below.
>
> CarbonProperties.getInstance().addProperty("carbon.
> badRecords.location"," record location>").
>
>
> 1. And while loading data you need to enable bad record logging as below.
>
> carbon.sql(s"load data inpath '$src/web_sales.csv' into table _1g.web_sales
> OPTIONS('DELIMITER'='|','bad_records_logger_enable'='true', 'use_kettle
> '='true')").
>
> Please check the bad records which are added to that bad record location.
>
>
> 2. You can alternatively verify by ignoring the bad records by using
> following command
> carbon.sql(s"load data inpath '$src/web_sales.csv' into table _1g.web_sales
> OPTIONS('DELIMITER'='|','bad_records_logger_enable'='true',
> 'bad_records_action'='ignore')").
>
> Regards,
> Ravindra.
>
> On 15 February 2017 at 07:37, Yinwei Li <251469...@qq.com> wrote:
>
> > Hi,
> >
> >
> > I've set the properties as:
> >
> >
> > carbon.badRecords.location=hdfs://localhost:9000/data/
> > carbondata/badrecords
> >
> >
> > and add 'bad_records_action'='force' when loading data as:
> >
> >
> > carbon.sql(s"load data inpath '$src/web_sales.csv' into table
> > _1g.web_sales OPTIONS('DELIMITER'='|','bad_records_action'='force')")
> >
> >
> > but the configurations seems not work as there are no path or file
> > created under the path hdfs://localhost:9000/data/carbondata/badrecords.
> >
> >
> > here are the way I created carbonContext:
> >
> >
> > import org.apache.spark.sql.SparkSession
> > import org.apache.spark.sql.CarbonSession._
> > import org.apache.spark.sql.catalyst.util._
> > val carbon = SparkSession.builder().config(sc.getConf).
> > getOrCreateCarbonSession("hdfs://master:9000/opt/carbonStore")
> >
> >
> >
> >
> > and the following are bad record logs:
> >
> >
> > INFO  15-02 09:43:24,393 - [Executor task launch
> > worker-0][partitionID:_1g_web_sales_d59af854-773c-429c-b7e6-
> 031d602fe2be]
> > Total copy time (ms) to copy file /tmp/1039730591739247/0/_1g/
> > web_sales/Fact/Part0/Segment_0/0/0-0-1487122995007.carbonindex is 65
> > ERROR 15-02 09:43:24,393 - [Executor task launch
> > worker-0][partitionID:_1g_web_sales_d59af854-773c-429c-b7e6-
> 031d602fe2be]
> > Data Load is partially success for table web_sales
> > INFO  15-02 09:43:24,393 - Bad Record Found
> >
> >
> >
> >
> > -- 原始邮件 --
> > 发件人: "Ravindra Pesala";;
> > 发送时间: 2017年2月14日(星期二) 晚上10:41
> > 收件人: "dev";
> >
> > 主题: Re: data lost when loading data from csv file to carbon table
> >
> >
> >
> > Hi,
> >
> > Please set carbon.badRecords.location in carbon.properties and check any
> > bad records are added to that location.
> >
> >
> > Regards,
> > Ravindra.
> >
> > On 14 February 2017 at 15:24, Yinwei Li <251469...@qq.com> wrote:
> >
> > > Hi all,
> > >
> > >
> > >   I met an data lost problem when loading data from csv file to carbon
> > > table, here are some details:
> > >
> > >
> > >   Env: Spark 2.1.0 + Hadoop 2.7.2 + CarbonData 1.0.0
> > >   Total Records:719,384
> > >   Loaded Records:606,305 (SQL: select count(1) from table)
> > >
> > >
> > >   My Attemps:
> > >
> > >
> > > Attemp1: Add option bad_records_action='force' when loading data.
> It
> > > also doesn't work, it's count equals to 606,305;
> > > Attemp2: Cut line 1 to 300,000 into a csv file and load, the result
> > is
> > > right, which equals to 300,000;
> > > Attemp3: Cut line 1 to 350,000 into a csv file and load, the result
> > is
> > > wrong, it equals to 305,631;
> > > Attemp4: Cut line 300,000 to 350,000 into a csv file a

Re: Introducing V3 format.

2017-02-15 Thread Ravindra Pesala
Hi Liang,

Backward compatibility is already handled in 1.0.0 version, so to read old
store then  it uses V1/V2 format readers to read data from old store. So
backward compatibility works even though we jump to V3 format.

Regards,
Ravindra.

On 16 February 2017 at 04:18, Liang Chen  wrote:

> Hi Ravi
>
> Thank you bringing the discussion to mailing list, i have one question: how
> to ensure backward-compatible after introducing the new format.
>
> Regards
> Liang
>
> Jean-Baptiste Onofré wrote
> > Agree.
> >
> > +1
> >
> > Regards
> > JB
> >
> > On Feb 15, 2017, 09:09, at 09:09, Kumar Vishal <
>
> > kumarvishal1802@
>
> > > wrote:
> >>+1
> >>This will improve the IO bottleneck. Page level min max will improve
> >>the
> >>block pruning and less number of false positive blocks will improve the
> >>filter query performance. Separating uncompression of data from reader
> >>layer will improve the overall query performance.
> >>
> >>-Regards
> >>Kumar Vishal
> >>
> >>On Wed, Feb 15, 2017 at 7:50 PM, Ravindra Pesala
> >><
>
> > ravi.pesala@
>
> > >
> >>wrote:
> >>
> >>> Please find the thrift file in below location.
> >>> https://drive.google.com/open?id=0B4TWTVbFSTnqZEdDRHRncVItQ242b
> >>> 1NqSTU2b2g4dkhkVDRj
> >>>
> >>> On 15 February 2017 at 17:14, Ravindra Pesala <
>
> > ravi.pesala@
>
> > >
> >>> wrote:
> >>>
> >>> > Problems in current format.
> >>> > 1. IO read is slower since it needs to go for multiple seeks on the
> >>file
> >>> > to read column blocklets. Current size of blocklet is 12, so it
> >>needs
> >>> > to read multiple times from file to scan the data on that column.
> >>> > Alternatively we can increase the blocklet size but it suffers for
> >>filter
> >>> > queries as it gets big blocklet to filter.
> >>> > 2. Decompression is slower in current format, we are using inverted
> >>index
> >>> > for faster filter queries and using NumberCompressor to compress
> >>the
> >>> > inverted index in bit wise packing. It becomes slower so we should
> >>avoid
> >>> > number compressor. One alternative is to keep blocklet size with in
> >>32000
> >>> > so that inverted index can be written with short, but IO read
> >>suffers a
> >>> lot.
> >>> >
> >>> > To overcome from above 2 issues we are introducing new format V3.
> >>> > Here each blocklet has multiple pages with size 32000, number of
> >>pages in
> >>> > blocklet is configurable. Since we keep the page with in short
> >>limit so
> >>> no
> >>> > need compress the inverted index here.
> >>> > And maintain the max/min for each page to further prune the filter
> >>> queries.
> >>> > Read the blocklet with pages at once and keep in offheap memory.
> >>> > During filter first check the max/min range and if it is valid then
> >>go
> >>> for
> >>> > decompressing the page to filter further.
> >>> >
> >>> > Please find the attached V3 format thrift file.
> >>> >
> >>> > --
> >>> > Thanks & Regards,
> >>> > Ravi
> >>> >
> >>>
> >>>
> >>>
> >>> --
> >>> Thanks & Regards,
> >>> Ravi
> >>>
>
>
>
>
>
> --
> View this message in context: http://apache-carbondata-
> mailing-list-archive.1130556.n5.nabble.com/Introducing-V3-
> format-tp7609p7622.html
> Sent from the Apache CarbonData Mailing List archive mailing list archive
> at Nabble.com.
>



-- 
Thanks & Regards,
Ravi


回复: data lost when loading data from csv file to carbon table

2017-02-15 Thread Yinwei Li
Hi Ravindra,


I run two way to loading data of benchmark tpc-ds and there are 25 tables in 
total:


first way(using the new data loading solution):


val carbon = 
SparkSession.builder().config(sc.getConf).getOrCreateCarbonSession("hdfs://master:9000/opt/carbonStore")

carbon.sql(s"load data inpath '$src/web_sales.csv' into table _1g.web_sales 
OPTIONS('DELIMITER'='|')")





second way(using kettle solution):


scala> import org.apache.carbondata.core.util.CarbonProperties
scala> 
CarbonProperties.getInstance().addProperty("carbon.badRecords.location","hdfs://master:9000/data/carbondata/badrecords/")
scala> 
CarbonProperties.getInstance().addProperty("carbon.kettle.home","/opt/spark-2.1.0/carbonlib/carbonplugins")
scala> val carbon = 
SparkSession.builder().config(sc.getConf).getOrCreateCarbonSession("hdfs://master:9000/opt/carbonStore")
scala> carbon.sql(s"load data inpath '$src/web_sales.csv' into table 
_1g.web_sales 
OPTIONS('DELIMITER'='|','bad_records_logger_enable'='true','use_kettle'='true')")



unfortunately, 23 of the tables have a correct result except two tables names 
store_returns and web_sales.
after loading the data of the two tables, kettle solution make a correct result 
while the new solution in 1.0.0 seems have a data lost. I doult whether there 
is a bug.






-- 原始邮件 --
发件人: "ﻬ.贝壳里的海";<251469...@qq.com>;
发送时间: 2017年2月16日(星期四) 中午11:14
收件人: "dev"; 

主题: 回复: data lost when loading data from csv file to carbon table



thx Ravindra.


I've run the script as:


scala> import org.apache.carbondata.core.util.CarbonProperties
scala> 
CarbonProperties.getInstance().addProperty("carbon.badRecords.location","hdfs://master:9000/data/carbondata/badrecords/")
scala> val carbon = 
SparkSession.builder().config(sc.getConf).getOrCreateCarbonSession("hdfs://master:9000/opt/carbonStore")
scala> carbon.sql(s"load data inpath '$src/web_sales.csv' into table 
_1g.web_sales 
OPTIONS('DELIMITER'='|','bad_records_logger_enable'='true','use_kettle'='true')")



but it occured an Exception: java.lang.RuntimeException: carbon.kettle.home is 
not set


the configuration in my carbon.properties is: 
carbon.kettle.home=/opt/spark-2.1.0/carbonlib/carbonplugins, but it seems not 
work.


how can I solve this problem.


--


Hi Liang Chen,


would you add a more detail document about the badRecord shows us how to 
use it, thx~~










-- 原始邮件 --
发件人: "Ravindra Pesala";;
发送时间: 2017年2月15日(星期三) 中午11:36
收件人: "dev"; 

主题: Re: data lost when loading data from csv file to carbon table



Hi,

I guess you are using spark-shell, so better set bad record location to
CarbonProperties class before creating carbon session like below.

CarbonProperties.getInstance().addProperty("carbon.badRecords.location","").


1. And while loading data you need to enable bad record logging as below.

carbon.sql(s"load data inpath '$src/web_sales.csv' into table _1g.web_sales
OPTIONS('DELIMITER'='|','bad_records_logger_enable'='true', 'use_kettle
'='true')").

Please check the bad records which are added to that bad record location.


2. You can alternatively verify by ignoring the bad records by using
following command
carbon.sql(s"load data inpath '$src/web_sales.csv' into table _1g.web_sales
OPTIONS('DELIMITER'='|','bad_records_logger_enable'='true',
'bad_records_action'='ignore')").

Regards,
Ravindra.

On 15 February 2017 at 07:37, Yinwei Li <251469...@qq.com> wrote:

> Hi,
>
>
> I've set the properties as:
>
>
> carbon.badRecords.location=hdfs://localhost:9000/data/
> carbondata/badrecords
>
>
> and add 'bad_records_action'='force' when loading data as:
>
>
> carbon.sql(s"load data inpath '$src/web_sales.csv' into table
> _1g.web_sales OPTIONS('DELIMITER'='|','bad_records_action'='force')")
>
>
> but the configurations seems not work as there are no path or file
> created under the path hdfs://localhost:9000/data/carbondata/badrecords.
>
>
> here are the way I created carbonContext:
>
>
> import org.apache.spark.sql.SparkSession
> import org.apache.spark.sql.CarbonSession._
> import org.apache.spark.sql.catalyst.util._
> val carbon = SparkSession.builder().config(sc.getConf).
> getOrCreateCarbonSession("hdfs://master:9000/opt/carbonStore")
>
>
>
>
> and the following are bad record logs:
>
>
> INFO  15-02 09:43:24,393 - [Executor task launch
> worker-0][partitionID:_1g_web_sales_d59af854-773c-429c-b7e6-031d602fe2be]
> Total copy time (ms) to copy file /tmp/1039730591739247/0/_1g/
> web_sales/Fact/Part0/Segment_0/0/0-0-1487122995007.carbonindex is 65
> ERROR 15-02 09:43:24,393 - [Executor task launch
> worker-0][partitionID:_1g_web_sales_d59af854-773c-429c-b7e6-031d602fe2be]
> Data Load is partially success for table web_sales
> INFO  15-02 09:43:24,393 - Bad Record Found
>
>
>
>
> -- 原始邮件 --
> 发件人: "Ravindra Pesala";;
> 发送时间: 2017年2月14日(星期二) 晚上10:41
> 收件人: "dev";
>
> 主题: Re: 

Re: 回复: data lost when loading data from csv file to carbon table

2017-02-15 Thread QiangCai
Maybe you can check PR594, it will fix a bug which will impact the result of
loading.



--
View this message in context: 
http://apache-carbondata-mailing-list-archive.1130556.n5.nabble.com/data-lost-when-loading-data-from-csv-file-to-carbon-table-tp7554p7639.html
Sent from the Apache CarbonData Mailing List archive mailing list archive at 
Nabble.com.


Re: 回复: data lost when loading data from csv file to carbon table

2017-02-15 Thread Ravindra Pesala
Hi Yinwei,

Thank you for pointing out the issue, I will check with TPC-DS data and
verify the data load with new flow.

Regards,
Ravindra.

On 16 February 2017 at 09:35, QiangCai  wrote:

> Maybe you can check PR594, it will fix a bug which will impact the result
> of
> loading.
>
>
>
> --
> View this message in context: http://apache-carbondata-
> mailing-list-archive.1130556.n5.nabble.com/data-lost-when-
> loading-data-from-csv-file-to-carbon-table-tp7554p7639.html
> Sent from the Apache CarbonData Mailing List archive mailing list archive
> at Nabble.com.
>



-- 
Thanks & Regards,
Ravi


[jira] [Created] (CARBONDATA-709) Incorrect documentation for bucketing in ddl section

2017-02-15 Thread anubhav tarar (JIRA)
anubhav tarar created CARBONDATA-709:


 Summary: Incorrect documentation for bucketing in ddl section
 Key: CARBONDATA-709
 URL: https://issues.apache.org/jira/browse/CARBONDATA-709
 Project: CarbonData
  Issue Type: Bug
  Components: docs
Affects Versions: 1.0.0-incubating
Reporter: anubhav tarar


in docs ddl bucketing section Columns in the BUCKETCOLUMN parameter must be 
either a dimension or a measure but combination of both is not supported. this 
line is incorrect here is the example

0: jdbc:hive2://localhost:1> CREATE TABLE uniqData_t11(ID Int,name 
string)stored by 'carbondata' 
TBLPROPERTIES("DICTIONARY_EXCLUDE"="name","bucketnumber"="1", 
"bucketcolumns"="ID");
Error: java.lang.RuntimeException: Bucket field must be dimension column and 
should not be measure or complex column: ID (state=,code=0)

so bucketing coloumn should be dimension only

plus in parameter description tableName is not required and the example that 
added for bucketing is also wrong

0: jdbc:hive2://localhost:1> CREATE TABLE IF NOT EXISTS 
productSchema.productSalesTable ( productNumber Int, productName String, 
storeCity String, storeProvince String, productCategory String, productBatch 
String, saleQuantity Int, revenue Int) STORED BY 'carbondata' TBLPROPERTIES 
('COLUMN_GROUPS'='(productName,productCategory)', 
'DICTIONARY_EXCLUDE'='productName', 'DICTIONARY_INCLUDE'='productNumber', 
'NO_INVERTED_INDEX'='productBatch', 'BUCKETNUMBER'='4', 
'BUCKETCOLUMNS'='productNumber,saleQuantity');
Error: org.apache.carbondata.spark.exception.MalformedCarbonCommandException: 
Invalid column group,column in group should be contiguous as per schema. 
(state=,code=0) 




--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


Re: Introducing V3 format.

2017-02-15 Thread Liang Chen
Hi

Thanks for detail explanation.
+1 for introducing new format to improve performance further.

Regards
Liang


ravipesala wrote
> Hi Liang,
> 
> Backward compatibility is already handled in 1.0.0 version, so to read old
> store then  it uses V1/V2 format readers to read data from old store. So
> backward compatibility works even though we jump to V3 format.
> 
> Regards,
> Ravindra.
> 
> On 16 February 2017 at 04:18, Liang Chen <

> chenliang6136@

> > wrote:
> 
>> Hi Ravi
>>
>> Thank you bringing the discussion to mailing list, i have one question:
>> how
>> to ensure backward-compatible after introducing the new format.
>>
>> Regards
>> Liang
>>
>> Jean-Baptiste Onofré wrote
>> > Agree.
>> >
>> > +1
>> >
>> > Regards
>> > JB
>> >
>> > On Feb 15, 2017, 09:09, at 09:09, Kumar Vishal <
>>
>> > kumarvishal1802@
>>
>> > > wrote:
>> >>+1
>> >>This will improve the IO bottleneck. Page level min max will improve
>> >>the
>> >>block pruning and less number of false positive blocks will improve the
>> >>filter query performance. Separating uncompression of data from reader
>> >>layer will improve the overall query performance.
>> >>
>> >>-Regards
>> >>Kumar Vishal
>> >>
>> >>On Wed, Feb 15, 2017 at 7:50 PM, Ravindra Pesala
>> >><
>>
>> > ravi.pesala@
>>
>> > >
>> >>wrote:
>> >>
>> >>> Please find the thrift file in below location.
>> >>> https://drive.google.com/open?id=0B4TWTVbFSTnqZEdDRHRncVItQ242b
>> >>> 1NqSTU2b2g4dkhkVDRj
>> >>>
>> >>> On 15 February 2017 at 17:14, Ravindra Pesala <
>>
>> > ravi.pesala@
>>
>> > >
>> >>> wrote:
>> >>>
>> >>> > Problems in current format.
>> >>> > 1. IO read is slower since it needs to go for multiple seeks on the
>> >>file
>> >>> > to read column blocklets. Current size of blocklet is 12, so it
>> >>needs
>> >>> > to read multiple times from file to scan the data on that column.
>> >>> > Alternatively we can increase the blocklet size but it suffers for
>> >>filter
>> >>> > queries as it gets big blocklet to filter.
>> >>> > 2. Decompression is slower in current format, we are using inverted
>> >>index
>> >>> > for faster filter queries and using NumberCompressor to compress
>> >>the
>> >>> > inverted index in bit wise packing. It becomes slower so we should
>> >>avoid
>> >>> > number compressor. One alternative is to keep blocklet size with in
>> >>32000
>> >>> > so that inverted index can be written with short, but IO read
>> >>suffers a
>> >>> lot.
>> >>> >
>> >>> > To overcome from above 2 issues we are introducing new format V3.
>> >>> > Here each blocklet has multiple pages with size 32000, number of
>> >>pages in
>> >>> > blocklet is configurable. Since we keep the page with in short
>> >>limit so
>> >>> no
>> >>> > need compress the inverted index here.
>> >>> > And maintain the max/min for each page to further prune the filter
>> >>> queries.
>> >>> > Read the blocklet with pages at once and keep in offheap memory.
>> >>> > During filter first check the max/min range and if it is valid then
>> >>go
>> >>> for
>> >>> > decompressing the page to filter further.
>> >>> >
>> >>> > Please find the attached V3 format thrift file.
>> >>> >
>> >>> > --
>> >>> > Thanks & Regards,
>> >>> > Ravi
>> >>> >
>> >>>
>> >>>
>> >>>
>> >>> --
>> >>> Thanks & Regards,
>> >>> Ravi
>> >>>
>>
>>
>>
>>
>>
>> --
>> View this message in context: http://apache-carbondata-
>> mailing-list-archive.1130556.n5.nabble.com/Introducing-V3-
>> format-tp7609p7622.html
>> Sent from the Apache CarbonData Mailing List archive mailing list archive
>> at Nabble.com.
>>
> 
> 
> 
> -- 
> Thanks & Regards,
> Ravi





--
View this message in context: 
http://apache-carbondata-mailing-list-archive.1130556.n5.nabble.com/Introducing-V3-format-tp7609p7645.html
Sent from the Apache CarbonData Mailing List archive mailing list archive at 
Nabble.com.


[GitHub] incubator-carbondata-site pull request #16: Resolved Image not displaying is...

2017-02-15 Thread PallaviSingh1992
GitHub user PallaviSingh1992 opened a pull request:

https://github.com/apache/incubator-carbondata-site/pull/16

Resolved Image not displaying issue in Troubleshooting



You can merge this pull request into a Git repository by running:

$ git pull https://github.com/PallaviSingh1992/incubator-carbondata-site 
feature/ImageNotDisplayingInTS

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/incubator-carbondata-site/pull/16.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #16


commit cdcdbec0f4922977fba6b62cc91600d092c3a583
Author: PallaviSingh1992 
Date:   2017-02-16T05:55:36Z

Resolved Image not displaying issue in Troubleshooting




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


Re: 回复: data lost when loading data from csv file to carbon table

2017-02-15 Thread Ravindra Pesala
Hi Yinwei,

Can you provide create table scripts for both the tables store_returns and
web_sales.

Regards,
Ravindra.

On 16 February 2017 at 10:07, Ravindra Pesala  wrote:

> Hi Yinwei,
>
> Thank you for pointing out the issue, I will check with TPC-DS data and
> verify the data load with new flow.
>
> Regards,
> Ravindra.
>
> On 16 February 2017 at 09:35, QiangCai  wrote:
>
>> Maybe you can check PR594, it will fix a bug which will impact the result
>> of
>> loading.
>>
>>
>>
>> --
>> View this message in context: http://apache-carbondata-maili
>> ng-list-archive.1130556.n5.nabble.com/data-lost-when-load
>> ing-data-from-csv-file-to-carbon-table-tp7554p7639.html
>> Sent from the Apache CarbonData Mailing List archive mailing list archive
>> at Nabble.com.
>>
>
>
>
> --
> Thanks & Regards,
> Ravi
>



-- 
Thanks & Regards,
Ravi


?????? data lost when loading data from csv file to carbon table

2017-02-15 Thread Yinwei Li
Hi Ravindra:


I add DICTIONARY_INCLUDE for each of them:


carbon.sql("create table if not exists _1g.store_returns(sr_returned_date_sk 
integer, sr_return_time_sk integer, sr_item_sk integer, sr_customer_sk integer, 
sr_cdemo_sk integer, sr_hdemo_sk integer, sr_addr_sk integer, sr_store_sk 
integer, sr_reason_sk integer, sr_ticket_number integer, sr_return_quantity 
integer, sr_return_amt decimal(7,2), sr_return_tax decimal(7,2), 
sr_return_amt_inc_tax decimal(7,2), sr_fee decimal(7,2), sr_return_ship_cost 
decimal(7,2), sr_refunded_cash decimal(7,2), sr_reversed_charge decimal(7,2), 
sr_store_credit decimal(7,2), sr_net_loss decimal(7,2)) STORED BY 'carbondata' 
TBLPROPERTIES ('DICTIONARY_INCLUDE'='sr_returned_date_sk, sr_return_time_sk, 
sr_item_sk, sr_customer_sk, sr_cdemo_sk, sr_hdemo_sk, sr_addr_sk, sr_store_sk, 
sr_reason_sk, sr_ticket_number')");




carbon.sql("create table if not exists _1g.web_sales(ws_sold_date_sk integer, 
ws_sold_time_sk integer, ws_ship_date_sk integer, ws_item_sk integer, 
ws_bill_customer_sk integer, ws_bill_cdemo_sk integer, ws_bill_hdemo_sk 
integer, ws_bill_addr_sk integer, ws_ship_customer_sk integer, ws_ship_cdemo_sk 
integer, ws_ship_hdemo_sk integer, ws_ship_addr_sk integer, ws_web_page_sk 
integer, ws_web_site_sk integer, ws_ship_mode_sk integer, ws_warehouse_sk 
integer, ws_promo_sk integer, ws_order_number integer, ws_quantity integer, 
ws_wholesale_cost decimal(7,2), ws_list_price decimal(7,2), ws_sales_price 
decimal(7,2), ws_ext_discount_amt decimal(7,2), ws_ext_sales_price 
decimal(7,2), ws_ext_wholesale_cost decimal(7,2), ws_ext_list_price 
decimal(7,2), ws_ext_tax decimal(7,2), ws_coupon_amt decimal(7,2), 
ws_ext_ship_cost decimal(7,2), ws_net_paid decimal(7,2), ws_net_paid_inc_tax 
decimal(7,2), ws_net_paid_inc_ship decimal(7,2), ws_net_paid_inc_ship_tax 
decimal(7,2), ws_net_profit decimal(7,2)) STORED BY 'carbondata' TBLPROPERTIES 
('DICTIONARY_INCLUDE'='ws_sold_date_sk, ws_sold_time_sk, ws_ship_date_sk, 
ws_item_sk, ws_bill_customer_sk, ws_bill_cdemo_sk, ws_bill_hdemo_sk, 
ws_bill_addr_sk, ws_ship_customer_sk, ws_ship_cdemo_sk, ws_ship_hdemo_sk, 
ws_ship_addr_sk, ws_web_page_sk, ws_web_site_sk, ws_ship_mode_sk, 
ws_warehouse_sk, ws_promo_sk, ws_order_number')");



and here is my script for generate tpc-ds data:
[hadoop@master tools]$ ./dsdgen -scale 1 -suffix '.csv' -dir /data/tpc-ds/data/








--  --
??: "Ravindra Pesala";;
: 2017??2??16??(??) 3:15
??: "dev"; 

: Re: ?? data lost when loading data from csv file to carbon table



Hi Yinwei,

Can you provide create table scripts for both the tables store_returns and
web_sales.

Regards,
Ravindra.

On 16 February 2017 at 10:07, Ravindra Pesala  wrote:

> Hi Yinwei,
>
> Thank you for pointing out the issue, I will check with TPC-DS data and
> verify the data load with new flow.
>
> Regards,
> Ravindra.
>
> On 16 February 2017 at 09:35, QiangCai  wrote:
>
>> Maybe you can check PR594, it will fix a bug which will impact the result
>> of
>> loading.
>>
>>
>>
>> --
>> View this message in context: http://apache-carbondata-maili
>> ng-list-archive.1130556.n5.nabble.com/data-lost-when-load
>> ing-data-from-csv-file-to-carbon-table-tp7554p7639.html
>> Sent from the Apache CarbonData Mailing List archive mailing list archive
>> at Nabble.com.
>>
>
>
>
> --
> Thanks & Regards,
> Ravi
>



-- 
Thanks & Regards,
Ravi

Re: data lost when loading data from csv file to carbon table

2017-02-15 Thread Ravindra Pesala
Hi,

I have generated tpcds data using https://github.com/brownsys/tpcds .
And I have loaded the data using with kettle flow and with new flow, both
gives same number of rows after using select count(*) query on the table.
Even I have counted rows in excel file , it is matching with count query

Number of rows loaded
store_returns : 288279
web_sales : 718931

Scripts :

spark.sql("""
   CREATE TABLE IF NOT EXISTS STORE_RETURNS
  (sr_returned_date_sk  bigint,
   sr_return_time_skbigint,
   sr_item_sk   bigint,
   sr_customer_sk   bigint,
   sr_cdemo_sk  bigint,
   sr_hdemo_sk  bigint,
   sr_addr_sk   bigint,
   sr_store_sk  bigint,
   sr_reason_sk bigint,
   sr_ticket_number bigint,
   sr_return_quantity   bigint,
   sr_return_amtdecimal(7,2),
   sr_return_taxdecimal(7,2),
   sr_return_amt_inc_tax decimal(7,2),
   sr_fee   decimal(7,2),
   sr_return_ship_cost  decimal(7,2),
   sr_refunded_cash decimal(7,2),
   sr_reversed_charge   decimal(7,2),
   sr_store_credit  decimal(7,2),
   sr_net_loss  decimal(7,2))
   STORED BY 'carbondata'
   
TBLPROPERTIES('dictionary_include'='sr_returned_date_sk,sr_return_time_sk,sr_item_sk,sr_customer_sk,sr_cdemo_sk,sr_hdemo_sk,sr_addr_sk,sr_store_sk,sr_reason_sk,sr_ticket_number')
   """)

spark.sql("create table if not exists web_sales(ws_sold_date_sk
integer, ws_sold_time_sk integer, ws_ship_date_sk integer, ws_item_sk
integer, ws_bill_customer_sk integer, ws_bill_cdemo_sk integer,
ws_bill_hdemo_sk integer, ws_bill_addr_sk integer, ws_ship_customer_sk
integer, ws_ship_cdemo_sk integer, ws_ship_hdemo_sk integer,
ws_ship_addr_sk integer, ws_web_page_sk integer, ws_web_site_sk
integer, ws_ship_mode_sk integer, ws_warehouse_sk integer, ws_promo_sk
integer, ws_order_number integer, ws_quantity integer,
ws_wholesale_cost decimal(7,2), ws_list_price decimal(7,2),
ws_sales_price decimal(7,2), ws_ext_discount_amt decimal(7,2),
ws_ext_sales_price decimal(7,2), ws_ext_wholesale_cost decimal(7,2),
ws_ext_list_price decimal(7,2), ws_ext_tax decimal(7,2), ws_coupon_amt
decimal(7,2), ws_ext_ship_cost decimal(7,2), ws_net_paid decimal(7,2),
ws_net_paid_inc_tax decimal(7,2), ws_net_paid_inc_ship decimal(7,2),
ws_net_paid_inc_ship_tax decimal(7,2), ws_net_profit decimal(7,2))
STORED BY 'carbondata' TBLPROPERTIES
('DICTIONARY_INCLUDE'='ws_sold_date_sk, ws_sold_time_sk,
ws_ship_date_sk, ws_item_sk, ws_bill_customer_sk, ws_bill_cdemo_sk,
ws_bill_hdemo_sk, ws_bill_addr_sk, ws_ship_customer_sk,
ws_ship_cdemo_sk, ws_ship_hdemo_sk, ws_ship_addr_sk, ws_web_page_sk,
ws_web_site_sk, ws_ship_mode_sk, ws_warehouse_sk, ws_promo_sk,
ws_order_number')")

spark.sql(s"""
   LOAD DATA LOCAL INPATH
'/home/root1/Downloads/store_returns.csv' into table STORE_RETURNS
options('DELIMITER'='|',
'FILEHEADER'='sr_returned_date_sk,sr_return_time_sk,sr_item_sk,sr_customer_sk,sr_cdemo_sk,sr_hdemo_sk,sr_addr_sk,sr_store_sk,sr_reason_sk,sr_ticket_number,sr_return_quantity,sr_return_amt,sr_return_tax,sr_return_amt_inc_tax,sr_fee,sr_return_ship_cost,sr_refunded_cash,sr_reversed_charge,sr_store_credit,sr_net_loss',
'use_kettle'='false')
   """)

spark.sql(s"""
   LOAD DATA LOCAL INPATH
'hdfs://localhost:9000/tpcds/web_sales/part-r-0-dca70590-4d9d-4cc9-aff4-e20b85970d2b'
into table web_sales options('DELIMITER'='|',
'FILEHEADER'='ws_sold_date_sk, ws_sold_time_sk, ws_ship_date_sk,
ws_item_sk, ws_bill_customer_sk, ws_bill_cdemo_sk, ws_bill_hdemo_sk,
ws_bill_addr_sk, ws_ship_customer_sk, ws_ship_cdemo_sk,
ws_ship_hdemo_sk, ws_ship_addr_sk, ws_web_page_sk, ws_web_site_sk,
ws_ship_mode_sk, ws_warehouse_sk, ws_promo_sk, ws_order_number,
ws_quantity, ws_wholesale_cost, ws_list_price, ws_sales_price,
ws_ext_discount_amt, ws_ext_sales_price, ws_ext_wholesale_cost,
ws_ext_list_price, ws_ext_tax, ws_coupon_amt, ws_ext_ship_cost,
ws_net_paid, ws_net_paid_inc_tax, ws_net_paid_inc_ship,
ws_net_paid_inc_ship_tax, ws_net_profit', 'use_kettle'='false')
   """)


Did I miss something here?


Regards,
Ravindra.

On 16 February 2017 at 12:24, Yinwei Li <251469...@qq.com> wrote:

> Hi Ravindra:
>
>
> I add DICTIONARY_INCLUDE for each of them:
>
>
> carbon.sql("create table if not exists _1g.store_returns(sr_returned_date_sk
> integer, sr_return_time_sk integer, sr_item_sk integer, sr_customer_sk
> integer, sr_cdemo_sk integer, sr_hdemo_sk integer, sr_addr_sk integer,
> sr_store_sk integer, sr_reason_sk integer, sr_ticket_number integer,
> sr_return_quantity integer, sr_return_amt decimal(7,2), sr_return_tax
> decimal(7,2), sr_return_amt_inc_tax decimal(7,2), sr_fee decimal(7,2),
> sr_return_ship_cost decimal(7,2), sr_refunded_cash decimal(7,2),
> sr_reversed_charge decimal(7,2), sr_store_credit decimal(7,2), sr_net_loss
> decimal(7,2)) STORED BY 'carbondata' TBLPROPERTIES
> ('DICTIONARY_INCLUDE'='sr_returned_date_sk, sr_return_time_sk,
> sr_item_sk, sr_customer_sk, sr_cdemo_sk, sr_hdemo_sk, sr_addr_sk,
> sr_store_sk,