Re: [jira] [Created] (CARBONDATA-836) Error in load using dataframe - columns containing comma

2017-04-13 Thread Sanoj MG
Thanks Jacky. I have created a JIRA -
https://issues.apache.org/jira/browse/CARBONDATA-909 for this.



Thanks,
Sanoj

On Tue, Apr 11, 2017 at 5:42 PM, Jacky Li  wrote:

> Hi Sanoj,
>
> This is because in CarbonData loading flow, it needs to scan input data
> twice (one for generating global dictionary, another for actual loading).
> If user is using Dataframe to write to CarbonData, and if the input
> dataframe compute is costly, it is better to save it as a temporary CSV
> file first and load into CarbonData instead of computing the dataframe
> twice.
>
> However there is another option that can do single pass data load, by
> using .option(“single_pass”, “true”), in this case, the input dataframe
> should be computed only once. But when I check the code just now, it seems
> this behavior is not implemented. :(
> I think you are free to create JIRA ticket if you want.
>
> Regards,
> Jacky
>
> > 在 2017年4月11日,上午10:36,Sanoj MG  写道:
> >
> > Hi All,
> >
> > In CarbonDataFrameWriter, there is an option to load using CSV file.
> >
> > if (options.tempCSV) {
> >
> >  loadTempCSV(options)
> > } else {
> >  loadDataFrame(options)
> > }
> >
> > Why is this choice required? Is there any issue if we load it directly
> > without using CSV?
> >
> > I have many dimension table with comma in string columns, and so always
> use
> > .option("tempCSV", "false"). In CarbonOption can we set the default value
> > as "false" as below
> >
> > def tempCSV: Boolean = options.getOrElse("tempCSV", "false").toBoolean
> >
> > Thanks,
> > Sanoj
> >
> >
> > On Thu, Mar 30, 2017 at 12:14 PM, Sanoj MG (JIRA) 
> wrote:
> >
> >> Sanoj MG created CARBONDATA-836:
> >> ---
> >>
> >> Summary: Error in load using dataframe  - columns containing
> >> comma
> >> Key: CARBONDATA-836
> >> URL: https://issues.apache.org/
> jira/browse/CARBONDATA-836
> >> Project: CarbonData
> >>  Issue Type: Bug
> >>  Components: spark-integration
> >>Affects Versions: 1.1.0-incubating
> >> Environment: HDP sandbox 2.5, Spark 1.6.2
> >>Reporter: Sanoj MG
> >>Priority: Minor
> >> Fix For: NONE
> >>
> >>
> >> While trying to load data into Carabondata table using dataframe, the
> >> columns containing commas are not properly loaded.
> >>
> >> Eg:
> >> scala> df.show(false)
> >> +---+--+---++-+--+
> >> |Country|Branch|Name   |Address |ShortName|Status|
> >> +---+--+---++-+--+
> >> |2  |1 |Main Branch|, Dubai, UAE|UHO  |256   |
> >> +---+--+---++-+--+
> >>
> >>
> >> scala>  df.write.format("carbondata").option("tableName",
> >> "Branch1").option("compress", "true").mode(SaveMode.Overwrite).save()
> >>
> >>
> >> scala> cc.sql("select * from branch1").show(false)
> >>
> >> +---+--+---+---+-+--+
> >> |country|branch|name   |address|shortname|status|
> >> +---+--+---+---+-+--+
> >> |2  |1 |Main Branch|   | Dubai   |null  |
> >> +---+--+---+---+-+--+
> >>
> >>
> >>
> >>
> >>
> >>
> >> --
> >> This message was sent by Atlassian JIRA
> >> (v6.3.15#6346)
> >>
>
>


Re: [jira] [Created] (CARBONDATA-836) Error in load using dataframe - columns containing comma

2017-04-11 Thread Jacky Li
Hi Sanoj,

This is because in CarbonData loading flow, it needs to scan input data twice 
(one for generating global dictionary, another for actual loading). If user is 
using Dataframe to write to CarbonData, and if the input dataframe compute is 
costly, it is better to save it as a temporary CSV file first and load into 
CarbonData instead of computing the dataframe twice.

However there is another option that can do single pass data load, by using 
.option(“single_pass”, “true”), in this case, the input dataframe should be 
computed only once. But when I check the code just now, it seems this behavior 
is not implemented. :( 
I think you are free to create JIRA ticket if you want.

Regards,
Jacky

> 在 2017年4月11日,上午10:36,Sanoj MG  写道:
> 
> Hi All,
> 
> In CarbonDataFrameWriter, there is an option to load using CSV file.
> 
> if (options.tempCSV) {
> 
>  loadTempCSV(options)
> } else {
>  loadDataFrame(options)
> }
> 
> Why is this choice required? Is there any issue if we load it directly
> without using CSV?
> 
> I have many dimension table with comma in string columns, and so always use
> .option("tempCSV", "false"). In CarbonOption can we set the default value
> as "false" as below
> 
> def tempCSV: Boolean = options.getOrElse("tempCSV", "false").toBoolean
> 
> Thanks,
> Sanoj
> 
> 
> On Thu, Mar 30, 2017 at 12:14 PM, Sanoj MG (JIRA)  wrote:
> 
>> Sanoj MG created CARBONDATA-836:
>> ---
>> 
>> Summary: Error in load using dataframe  - columns containing
>> comma
>> Key: CARBONDATA-836
>> URL: https://issues.apache.org/jira/browse/CARBONDATA-836
>> Project: CarbonData
>>  Issue Type: Bug
>>  Components: spark-integration
>>Affects Versions: 1.1.0-incubating
>> Environment: HDP sandbox 2.5, Spark 1.6.2
>>Reporter: Sanoj MG
>>Priority: Minor
>> Fix For: NONE
>> 
>> 
>> While trying to load data into Carabondata table using dataframe, the
>> columns containing commas are not properly loaded.
>> 
>> Eg:
>> scala> df.show(false)
>> +---+--+---++-+--+
>> |Country|Branch|Name   |Address |ShortName|Status|
>> +---+--+---++-+--+
>> |2  |1 |Main Branch|, Dubai, UAE|UHO  |256   |
>> +---+--+---++-+--+
>> 
>> 
>> scala>  df.write.format("carbondata").option("tableName",
>> "Branch1").option("compress", "true").mode(SaveMode.Overwrite).save()
>> 
>> 
>> scala> cc.sql("select * from branch1").show(false)
>> 
>> +---+--+---+---+-+--+
>> |country|branch|name   |address|shortname|status|
>> +---+--+---+---+-+--+
>> |2  |1 |Main Branch|   | Dubai   |null  |
>> +---+--+---+---+-+--+
>> 
>> 
>> 
>> 
>> 
>> 
>> --
>> This message was sent by Atlassian JIRA
>> (v6.3.15#6346)
>> 



Re: [jira] [Created] (CARBONDATA-836) Error in load using dataframe - columns containing comma

2017-04-10 Thread Sanoj MG
Hi All,

In CarbonDataFrameWriter, there is an option to load using CSV file.

if (options.tempCSV) {

  loadTempCSV(options)
} else {
  loadDataFrame(options)
}

Why is this choice required? Is there any issue if we load it directly
without using CSV?

I have many dimension table with comma in string columns, and so always use
 .option("tempCSV", "false"). In CarbonOption can we set the default value
as "false" as below

def tempCSV: Boolean = options.getOrElse("tempCSV", "false").toBoolean

Thanks,
Sanoj


On Thu, Mar 30, 2017 at 12:14 PM, Sanoj MG (JIRA)  wrote:

> Sanoj MG created CARBONDATA-836:
> ---
>
>  Summary: Error in load using dataframe  - columns containing
> comma
>  Key: CARBONDATA-836
>  URL: https://issues.apache.org/jira/browse/CARBONDATA-836
>  Project: CarbonData
>   Issue Type: Bug
>   Components: spark-integration
> Affects Versions: 1.1.0-incubating
>  Environment: HDP sandbox 2.5, Spark 1.6.2
> Reporter: Sanoj MG
> Priority: Minor
>  Fix For: NONE
>
>
> While trying to load data into Carabondata table using dataframe, the
> columns containing commas are not properly loaded.
>
> Eg:
> scala> df.show(false)
> +---+--+---++-+--+
> |Country|Branch|Name   |Address |ShortName|Status|
> +---+--+---++-+--+
> |2  |1 |Main Branch|, Dubai, UAE|UHO  |256   |
> +---+--+---++-+--+
>
>
> scala>  df.write.format("carbondata").option("tableName",
> "Branch1").option("compress", "true").mode(SaveMode.Overwrite).save()
>
>
> scala> cc.sql("select * from branch1").show(false)
>
> +---+--+---+---+-+--+
> |country|branch|name   |address|shortname|status|
> +---+--+---+---+-+--+
> |2  |1 |Main Branch|   | Dubai   |null  |
> +---+--+---+---+-+--+
>
>
>
>
>
>
> --
> This message was sent by Atlassian JIRA
> (v6.3.15#6346)
>


[jira] [Created] (CARBONDATA-836) Error in load using dataframe - columns containing comma

2017-03-30 Thread Sanoj MG (JIRA)
Sanoj MG created CARBONDATA-836:
---

 Summary: Error in load using dataframe  - columns containing comma
 Key: CARBONDATA-836
 URL: https://issues.apache.org/jira/browse/CARBONDATA-836
 Project: CarbonData
  Issue Type: Bug
  Components: spark-integration
Affects Versions: 1.1.0-incubating
 Environment: HDP sandbox 2.5, Spark 1.6.2
Reporter: Sanoj MG
Priority: Minor
 Fix For: NONE


While trying to load data into Carabondata table using dataframe, the columns 
containing commas are not properly loaded. 

Eg: 
scala> df.show(false)
+---+--+---++-+--+
|Country|Branch|Name   |Address |ShortName|Status|
+---+--+---++-+--+
|2  |1 |Main Branch|, Dubai, UAE|UHO  |256   |
+---+--+---++-+--+


scala>  df.write.format("carbondata").option("tableName", 
"Branch1").option("compress", "true").mode(SaveMode.Overwrite).save()


scala> cc.sql("select * from branch1").show(false)

+---+--+---+---+-+--+
|country|branch|name   |address|shortname|status|
+---+--+---+---+-+--+
|2  |1 |Main Branch|   | Dubai   |null  |
+---+--+---+---+-+--+






--
This message was sent by Atlassian JIRA
(v6.3.15#6346)