Re: [DISCUSSION] Support DataLoad using Json for CarbonSession

Ajantha Bhat Wed, 05 Dec 2018 03:18:00 -0800

Hi,
+1 for the JSON proposal in loading.
This can help in nested level complex data type loading.
Currently, CSV loading supports only 2 level delimiter. JSON loading can
solve this problem.

While supporting JSON for SDK, I have already handled your point 1)  and 3)
you can refer and use the same.
"org.apache.carbondata.processing.loading.jsoninput.{*JsonInputFormat,
JsonStreamReader*}"
"org.apache.carbondata.processing.loading.parser.impl.*JsonRowParser*"

yes, regarding point 2) you have to implement the iterator. While doing
this, try support reading JSON and CSV files together in a folder.
Can give CSV files to CSV iterator and JSON files to JSON iterator and
support together loading.

Also for insert into by select flow, you can always send it to JSON flow by
making loadModel.isJsonFileLoad() always true in
AbstractDataLoadProcessorStep,
so that nested complex type data insert into / CTAScan be supported.

Also, I suggest you to create a JIRA for this and add a design document
there.
In document mention about what load options are newly supported for this
(like record_identifier to identify multiline spanned JSON data) also.

Thanks,
AB

On Wed, Dec 5, 2018 at 3:54 PM Indhumathi <indhumathi...@gmail.com> wrote:

> Hello All,
>
> I am working on supporting data load using JSON file for CarbonSession.
>
> 1. Json File Loading will use JsonInputFormat.The JsonInputFormat will read
> two types of JSON formatted data.
> i).The default expectation is each JSON record is newline delimited. This
> method is generally faster and is backed by the LineRecordReader you are
> likely familiar with.
> This will use SimpleJsonRecordReader to read a line of JSON and return it
> as
> a Text object.
> ii).The other method is 'pretty print' of JSON records, where records span
> multiple lines and often have some type of root identifier.
> This method is likely slower, but respects record boundaries much like the
> LineRecordReader. User has to provide the identifier and set
> "json.input.format.record.identifier".
> This will use JsonRecordReader to read JSON records from a file. It
> respects
> split boundaries to complete full JSON records, as specified by the root
> identifier.
> JsonStreamReader handles byte-by-byte reading of a JSON stream, creating
> records based on a base 'identifier'.
>
> 2. Implement JsonRecordReaderIterator similar to CSVRecordReaderIterator
>
> 3. Use JsonRowParser which will convert jsonToCarbonRecord and generate a
> Carbon Row.
>
> Please feel free to provide your comments and suggestions.
>
> Regards,
> Indhumathi M
>
>
>
>
>
>
>
>
>
> --
> Sent from:
> http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/
>

Re: [DISCUSSION] Support DataLoad using Json for CarbonSession

Reply via email to