Re: [jira] [Created] (CARBONDATA-836) Error in load using dataframe - columns containing comma

Sanoj MG Thu, 13 Apr 2017 01:16:24 -0700

Thanks Jacky. I have created a JIRA -
https://issues.apache.org/jira/browse/CARBONDATA-909 for this.




Thanks,
Sanoj

On Tue, Apr 11, 2017 at 5:42 PM, Jacky Li <jacky.li...@qq.com> wrote:

> Hi Sanoj,
>
> This is because in CarbonData loading flow, it needs to scan input data
> twice (one for generating global dictionary, another for actual loading).
> If user is using Dataframe to write to CarbonData, and if the input
> dataframe compute is costly, it is better to save it as a temporary CSV
> file first and load into CarbonData instead of computing the dataframe
> twice.
>
> However there is another option that can do single pass data load, by
> using .option(“single_pass”, “true”), in this case, the input dataframe
> should be computed only once. But when I check the code just now, it seems
> this behavior is not implemented. :(
> I think you are free to create JIRA ticket if you want.
>
> Regards,
> Jacky
>
> > 在 2017年4月11日，上午10:36，Sanoj MG <sanoj.george....@gmail.com> 写道：
> >
> > Hi All,
> >
> > In CarbonDataFrameWriter, there is an option to load using CSV file.
> >
> > if (options.tempCSV) {
> >
> >  loadTempCSV(options)
> > } else {
> >  loadDataFrame(options)
> > }
> >
> > Why is this choice required? Is there any issue if we load it directly
> > without using CSV?
> >
> > I have many dimension table with comma in string columns, and so always
> use
> > .option("tempCSV", "false"). In CarbonOption can we set the default value
> > as "false" as below
> >
> > def tempCSV: Boolean = options.getOrElse("tempCSV", "false").toBoolean
> >
> > Thanks,
> > Sanoj
> >
> >
> > On Thu, Mar 30, 2017 at 12:14 PM, Sanoj MG (JIRA) <j...@apache.org>
> wrote:
> >
> >> Sanoj MG created CARBONDATA-836:
> >> -----------------------------------
> >>
> >>             Summary: Error in load using dataframe  - columns containing
> >> comma
> >>                 Key: CARBONDATA-836
> >>                 URL: https://issues.apache.org/
> jira/browse/CARBONDATA-836
> >>             Project: CarbonData
> >>          Issue Type: Bug
> >>          Components: spark-integration
> >>    Affects Versions: 1.1.0-incubating
> >>         Environment: HDP sandbox 2.5, Spark 1.6.2
> >>            Reporter: Sanoj MG
> >>            Priority: Minor
> >>             Fix For: NONE
> >>
> >>
> >> While trying to load data into Carabondata table using dataframe, the
> >> columns containing commas are not properly loaded.
> >>
> >> Eg:
> >> scala> df.show(false)
> >> +-------+------+-----------+----------------+---------+------+
> >> |Country|Branch|Name       |Address         |ShortName|Status|
> >> +-------+------+-----------+----------------+---------+------+
> >> |2      |1     |Main Branch|XXXX, Dubai, UAE|UHO      |256   |
> >> +-------+------+-----------+----------------+---------+------+
> >>
> >>
> >> scala>  df.write.format("carbondata").option("tableName",
> >> "Branch1").option("compress", "true").mode(SaveMode.Overwrite).save()
> >>
> >>
> >> scala> cc.sql("select * from branch1").show(false)
> >>
> >> +-------+------+-----------+-------+---------+------+
> >> |country|branch|name       |address|shortname|status|
> >> +-------+------+-----------+-------+---------+------+
> >> |2      |1     |Main Branch|XXXX   | Dubai   |null  |
> >> +-------+------+-----------+-------+---------+------+
> >>
> >>
> >>
> >>
> >>
> >>
> >> --
> >> This message was sent by Atlassian JIRA
> >> (v6.3.15#6346)
> >>
>
>

Re: [jira] [Created] (CARBONDATA-836) Error in load using dataframe - columns containing comma

Reply via email to