Normally parallel processing of text input files is handled via Hadoop 
TextInputFormat, which support splitting of files on line boundaries at 
(roughly) HDFS block boundaries.

There are various XML Hadoop InputFormats available, which try to sync up with 
splittable locations. The one I’ve used in the past 
<https://github.com/apache/mahout/blob/ad84344e4055b1e6adff5779339a33fa29e1265d/examples/src/main/java/org/apache/mahout/classifier/bayes/XmlInputFormat.java>
 is part of the Mahout project.

If each JSON record is on its own line, then you can just use a regular source, 
and parse it in a subsequent map function 

Otherwise you can still create a custom input format, as long as there’s some 
unique JSON that identifies the beginning/end of each record.

See 
https://stackoverflow.com/questions/18593595/custom-inputformat-for-reading-json-in-hadoop

And failing that, you can always build a list of file paths as your input, and 
then in your map function explicitly open/read each file and process it as you 
would any JSON file. In a past project where we had a similar requirement, the 
only interesting challenge was building N lists of files (for N mappers) where 
the sum of file sizes was roughly equal for each parallel map parser, as there 
was significant skew in the file sizes.

— Ken


> On Feb 4, 2019, at 9:12 AM, madan <madan.yella...@gmail.com> wrote:
> 
> Hi,
> 
> Can someone please tell me how to split json/xml data file. Since they are 
> structured form (i.e., parent/child hierarchy), is it possible to split the 
> file and process in parallel with 2 or more instances of source operator ?
> Also please confirm if my understanding of csv splitting is correct as 
> mentioned below,
> 
> When used parallelism greater than 1, file will be split into equal parts 
> more or less and each operator instance will have respective start position 
> of file partition. There can be possibility that start position of file 
> partition can come in the middle of the delimited line as shown below. And 
> when file reading is started initial partial record will be ignored by 
> respective operator instance and reads full records which are coming 
> afterwards. ie.,
> # Operator1 reads emp1, emp2 records (reads emp2 since record's starting char 
> position fell in its reading range)
> # Operator2 ignores partial emp2 rec and reads emp3 and emp4
> # Operator3 ignores partial emp4 and reads emp5
> Record delimiter is used to skip partial record and identifying new record.
> 
> <csv_reader_positions.jpg>
> 
> 
> 
> -- 
> Thank you,
> Madan.

--------------------------
Ken Krugler
+1 530-210-6378
http://www.scaleunlimited.com
Custom big data solutions & training
Flink, Solr, Hadoop, Cascading & Cassandra

Reply via email to