Fwd: XML to TEXT

2014-02-12 Thread Ranjini Rathinam

 Please help to convert this xml to text.


  I have the attached the xml. Please find the attachement.

 Some student has two address tag and some student has one address tag and
 some student dont have address tag tag.

 I need to convert the xml into string.

 this is my desired output.

 100,ranjini,HOME,a street,ad street,ads street,chennai,tn,OFFICE,adsja1
 street,adsja2 street,adsja3 street,mumbai,Maharastra
 101,nivetha,HOME,a street,ad street,ads street,chennai,tn
 102,siva


 In normal java i have written using recursion but how to write in
 mapreduce.

 How to write the code in Mapreduce .? Pl help .

 Thanks in advance.
  Regards,
 Ranjini R


 On Fri, Jan 10, 2014 at 12:47 PM, Ranjini Rathinam 
 ranjinibe...@gmail.com wrote:

 Hi,

 Its working fine. problem was in xml . THe space i have given.

 Thanks a lot.

 Regards,
 Ranjini.R

  On Thu, Jan 9, 2014 at 10:47 PM, Diego Gutierrez 
 diego.gutier...@ucsp.edu.pe wrote:

  Hi,

 I'm sending you the eclipse project with the code. Hope this helps.

 Regards
 Diego GutiƩrrez



 2014/1/9 Ranjini Rathinam ranjinibe...@gmail.com

 Hi,

 I am using here java 1.6 and hadoop 0.20 version ,  ubuntu 12.04.

 If possible please send the jar and code for review.

 Thanks for the support,

 Ranjini

  On Wed, Jan 8, 2014 at 11:00 PM, Diego Gutierrez 
 diego.gutier...@ucsp.edu.pe wrote:

   Hi,

 I've notice that your xml file has break lines. Hadoop by default
 splits every file into lines and pass them to the map function, in other
 words, each map function process one line of the file. Please remove the
 break lines from your xml and try again. I've tested here with your xml
 file(just changing DTMNodeList list = (DTMNodeList)
 getNode(/Company/Employee, doc,
 XPathConstants.NODESET) ) and this is the output
 in result.txt


 id,name
 100,ranjini,IT1,123456,nextlevel1,Chennai1Navallur1
 1001,ranjinikumar,IT,1234516,nextlevel,ChennaiNavallur


 Note: I dont know if the java version or hadoop version can be the
 problem here. I'm using ubuntu 12.04, java oracle 7 and hadoop 2.2.0.


 If you want, I can send you the jar file with the code :)

 Regards
 Diego GutiƩrrez.



 2014/1/7 Ranjini Rathinam ranjinibe...@gmail.com

 Hi Gutierrez ,

 As suggest i tried with the code , but in the result.txt i got
 output only header. Nothing else was printing.

 After debugging i came to know that while parsing , there is no
 value.

 The problem is in line given below which is bold. While putting
 SysOut i found no value printing in this line.

  String xmlContent = value.toString();

 InputStream is = new
 ByteArrayInputStream(xmlContent.getBytes());
 DocumentBuilderFactory factory =
 DocumentBuilderFactory.newInstance();
 DocumentBuilder builder;
 try {
 builder = factory.newDocumentBuilder();

 * Document doc = builder.parse(is);*
String ed=doc.getDocumentElement().getNodeName();
out.write(ed.getBytes());
 DTMNodeList list = (DTMNodeList)
 getNode(/Company/Employee, doc,XPathConstants.NODESET);

 When iam printing

 out.write(xmlContent.getBytes):- the whole xml is being printed.

 then i wrote for Sysout for list ,nothing printed.
  out.write(ed.getBytes):- nothing is being printed.

 Please suggest where i am going wrong. Please help to fix this.

 Thanks in advance.

 I have attached my code.Please review.


 Mapper class:-

 public class XmlTextMapper extends MapperLongWritable, Text, Text,
 Text {
  private static final XPathFactory xpathFactory =
 XPathFactory.newInstance();
 @Override
 public void map(LongWritable key, Text value, Context context)
 throws IOException, InterruptedException {
 String resultFileName = /user/task/Sales/result.txt;

 Configuration conf = new Configuration();
 FileSystem fs = FileSystem.get(URI.create(resultFileName),
 conf);
 FSDataOutputStream out = fs.create(new Path(resultFileName));
 InputStream resultIS = new ByteArrayInputStream(new byte[0]);
 String header = id,name\n;
 out.write(header.getBytes());
  String xmlContent = value.toString();

 InputStream is = new
 ByteArrayInputStream(xmlContent.getBytes());
 DocumentBuilderFactory factory =
 DocumentBuilderFactory.newInstance();
 DocumentBuilder builder;
 try {
 builder = factory.newDocumentBuilder();
 Document doc = builder.parse(is);
String ed=doc.getDocumentElement().getNodeName();
out.write(ed.getBytes());
 DTMNodeList list = (DTMNodeList)
 getNode(/Company/Employee, doc,XPathConstants.NODESET);
  int size = list.getLength();
 for (int i = 0; i  size; i++) {
 Node node = list.item(i);
 String line = ;
 NodeList nodeList = node.getChildNodes();
 int childNumber = nodeList.getLength();
 for (int j = 0; j  childNumber; j

Re: XML to TEXT

2014-02-12 Thread Shekhar Sharma
Which input format you are using . Use xml input format.
On 3 Jan 2014 10:47, Ranjini Rathinam ranjinibe...@gmail.com wrote:

 Hi,

 Need to convert XML into text using mapreduce.

 I have used DOM and SAX parser.

 After using SAX Builder in mapper class. the child node act as root
 Element.

 While seeing in Sys out i found thar root element is taking the child
 element and printing.

 For Eg,

 CompEmpid100/idnameRR/name/Emp/Comp
 when this xml is passed in mapper , in sys out printing the root element

 I am getting the the root element as

 id
 name

 Please suggest and help to fix this.

 I need to convert the xml into text using mapreduce code. Please provide
 with example.

 Required output is

 id,name
 100,RR

 Please help.

 Thanks in advance,
 Ranjini R



















XML to TEXT

2014-01-08 Thread Ranjini Rathinam
Hi,

As suggest i tried with the code , but in the result.txt i got output only
header. Nothing else was printing.

After debugging i came to know that while parsing , there is no value.

The problem is in line given below which is bold. While putting SysOut i
found no value printing in this line.

String xmlContent = value.toString();

InputStream is = new ByteArrayInputStream(xmlContent.getBytes());
DocumentBuilderFactory factory =
DocumentBuilderFactory.newInstance();
DocumentBuilder builder;
try {
builder = factory.newDocumentBuilder();
 *   Document doc = builder.parse(is);*


*String ed=doc.getDocumentElement().getNodeName();*
out.write(ed.getBytes());
DTMNodeList list = (DTMNodeList) getNode(/Company/Employee,
doc,XPathConstants.NODESET);


When iam printing

out.write(xmlContent.getBytes):- the whole xml is being printed.

then i wrote for Sysout for list ,nothing printed.
out.write(ed.getBytes):- nothing is being printed.

Please suggest where i am going wrong. Please help to fix this.

Thanks in advance.

I have attached my code.Please review.


Mapper class:-

public class XmlTextMapper extends MapperLongWritable, Text, Text, Text {
private static final XPathFactory xpathFactory =
XPathFactory.newInstance();
@Override
public void map(LongWritable key, Text value, Context context)
throws IOException, InterruptedException {
String resultFileName = /user/task/Sales/result.txt;

Configuration conf = new Configuration();
FileSystem fs = FileSystem.get(URI.create(resultFileName), conf);
FSDataOutputStream out = fs.create(new Path(resultFileName));
InputStream resultIS = new ByteArrayInputStream(new byte[0]);
String header = id,name\n;
out.write(header.getBytes());
String xmlContent = value.toString();

InputStream is = new ByteArrayInputStream(xmlContent.getBytes());
DocumentBuilderFactory factory =
DocumentBuilderFactory.newInstance();
DocumentBuilder builder;
try {
builder = factory.newDocumentBuilder();
Document doc = builder.parse(is);

   String ed=doc.getDocumentElement().getNodeName();
   out.write(ed.getBytes());
DTMNodeList list = (DTMNodeList) getNode(/Company/Employee,
doc,XPathConstants.NODESET);
int size = list.getLength();
for (int i = 0; i  size; i++) {
Node node = list.item(i);
String line = ;
NodeList nodeList = node.getChildNodes();
int childNumber = nodeList.getLength();
for (int j = 0; j  childNumber; j++)
{
line += nodeList.item(j).getTextContent() + ,;
}
if (line.endsWith(,))
line = line.substring(0, line.length() - 1);
line += \n;
out.write(line.getBytes());
}
} catch (ParserConfigurationException e) {
 e.printStackTrace();
} catch (SAXException e) {
 e.printStackTrace();
} catch (XPathExpressionException e) {
 e.printStackTrace();
}
IOUtils.copyBytes(resultIS, out, 4096, true);
out.close();
}
public static Object getNode(String xpathStr, Node node, QName
retunType)
throws XPathExpressionException {
XPath xpath = xpathFactory.newXPath();
return xpath.evaluate(xpathStr, node, retunType);
}
}



Main class
public class MainXml {
public static void main(String[] args) throws Exception {

Configuration conf = new Configuration();

if (args.length != 2) {
System.err
.println(Usage: XMLtoText input path output path);
System.exit(-1);
}

  String output=/user/task/Sales/;
   Job job = new Job(conf, XML to Text);
job.setJarByClass(MainXml.class);
   // job.setJobName(XML to Text);

FileInputFormat.addInputPath(job, new Path(args[0]));

   // FileOutputFormat.setOutputPath(job, new Path(args[1]));
  Path outPath = new Path(output);
  FileOutputFormat.setOutputPath(job, outPath);
  FileSystem dfs = FileSystem.get(outPath.toUri(), conf);
  if (dfs.exists(outPath)) {
  dfs.delete(outPath, true);
  }
job.setMapperClass(XmlTextMapper.class);

job.setNumReduceTasks(0);
job.setMapOutputKeyClass(Text.class);
job.setMapOutputValueClass(Text.class);
System.exit(job.waitForCompletion(true) ? 0 : 1);
}
}



My xml file

Company
Employee
id100/id
enameranjini/ename
deptIT1/dept
sal123456/sal
locationnextlevel1/location
Address
HomeChennai1/Home
OfficeNavallur1/Office
/Address
/Employee
Employee
id1001/id
enameranjinikumar/ename
deptIT/dept
sal1234516/sal
locationnextlevel/location
Address
HomeChennai/Home
OfficeNavallur/Office
/Address
/Employee
/Company


Thanks in advance
Ranjini. R


Re: XML to TEXT

2014-01-07 Thread Ranjini Rathinam
Hi,

I am using hive. As suggest i am using xpath in select clause, but the
error is coming as invalid expression.

Please give some sample xml to process xml in hive.

Thanks in advance

Ranjini

On Tue, Jan 7, 2014 at 5:14 PM, Ranjini Rathinam ranjinibe...@gmail.comwrote:

 Hi Gutierrez ,

 As suggest i tried with the code , but in the result.txt i got output only
 header. Nothing else was printing.

 After debugging i came to know that while parsing , there is no value.

 The problem is in line given below which is bold. While putting SysOut i
 found no value printing in this line.

  String xmlContent = value.toString();

 InputStream is = new ByteArrayInputStream(xmlContent.getBytes());
 DocumentBuilderFactory factory =
 DocumentBuilderFactory.newInstance();
 DocumentBuilder builder;
 try {
 builder = factory.newDocumentBuilder();

 * Document doc = builder.parse(is);*
String ed=doc.getDocumentElement().getNodeName();
out.write(ed.getBytes());
 DTMNodeList list = (DTMNodeList) getNode(/Company/Employee,
 doc,XPathConstants.NODESET);

 When iam printing

 out.write(xmlContent.getBytes):- the whole xml is being printed.

 then i wrote for Sysout for list ,nothing printed.
  out.write(ed.getBytes):- nothing is being printed.

 Please suggest where i am going wrong. Please help to fix this.

 Thanks in advance.

 I have attached my code.Please review.


 Mapper class:-

 public class XmlTextMapper extends MapperLongWritable, Text, Text, Text {
  private static final XPathFactory xpathFactory =
 XPathFactory.newInstance();
 @Override
 public void map(LongWritable key, Text value, Context context)
 throws IOException, InterruptedException {
 String resultFileName = /user/task/Sales/result.txt;

 Configuration conf = new Configuration();
 FileSystem fs = FileSystem.get(URI.create(resultFileName), conf);
 FSDataOutputStream out = fs.create(new Path(resultFileName));
 InputStream resultIS = new ByteArrayInputStream(new byte[0]);
 String header = id,name\n;
 out.write(header.getBytes());
  String xmlContent = value.toString();

 InputStream is = new ByteArrayInputStream(xmlContent.getBytes());
 DocumentBuilderFactory factory =
 DocumentBuilderFactory.newInstance();
 DocumentBuilder builder;
 try {
 builder = factory.newDocumentBuilder();
 Document doc = builder.parse(is);
String ed=doc.getDocumentElement().getNodeName();
out.write(ed.getBytes());
 DTMNodeList list = (DTMNodeList) getNode(/Company/Employee,
 doc,XPathConstants.NODESET);
  int size = list.getLength();
 for (int i = 0; i  size; i++) {
 Node node = list.item(i);
 String line = ;
 NodeList nodeList = node.getChildNodes();
 int childNumber = nodeList.getLength();
 for (int j = 0; j  childNumber; j++)
 {
 line += nodeList.item(j).getTextContent() + ,;
 }
 if (line.endsWith(,))
 line = line.substring(0, line.length() - 1);
 line += \n;
 out.write(line.getBytes());
 }
 } catch (ParserConfigurationException e) {
  e.printStackTrace();
 } catch (SAXException e) {
  e.printStackTrace();
 } catch (XPathExpressionException e) {
  e.printStackTrace();
 }
  IOUtils.copyBytes(resultIS, out, 4096, true);
 out.close();
 }
 public static Object getNode(String xpathStr, Node node, QName
 retunType)
 throws XPathExpressionException {
 XPath xpath = xpathFactory.newXPath();
 return xpath.evaluate(xpathStr, node, retunType);
 }
 }



 Main class
 public class MainXml {
  public static void main(String[] args) throws Exception {
 Configuration conf = new Configuration();
 if (args.length != 2) {
 System.err
 .println(Usage: XMLtoText input path output
 path);
 System.exit(-1);
 }
   String output=/user/task/Sales/;
Job job = new Job(conf, XML to Text);
 job.setJarByClass(MainXml.class);
// job.setJobName(XML to Text);

 FileInputFormat.addInputPath(job, new Path(args[0]));
// FileOutputFormat.setOutputPath(job, new Path(args[1]));
   Path outPath = new Path(output);
   FileOutputFormat.setOutputPath(job, outPath);
   FileSystem dfs = FileSystem.get(outPath.toUri(), conf);
   if (dfs.exists(outPath)) {
   dfs.delete(outPath, true);
   }
 job.setMapperClass(XmlTextMapper.class);

 job.setNumReduceTasks(0);
 job.setMapOutputKeyClass(Text.class);
 job.setMapOutputValueClass(Text.class);
 System.exit(job.waitForCompletion(true) ? 0 : 1);
 }
 }


 My xml file

Re: XML to TEXT

2014-01-06 Thread Ranjini Rathinam
Hi,

Thanks a lot .

Ranjini

On Fri, Jan 3, 2014 at 10:40 PM, Diego Gutierrez 
diego.gutier...@ucsp.edu.pe wrote:

  Hi,

 I suggest to use the XPath, this is a native java support for parse xml
 and json formats.

 For the main problem, like distcp command(
 http://hadoop.apache.org/docs/r0.19.0/distcp.pdf ) there is no need of a
 reduce function, because you can parse the xml input file and create the
 file you need in the map function.For example the following code reads an
 xml file in HDFS, parse it and create a new file ( /result.txt ) with the
 expected format:
 id,name
 100,RR


 Mapper function:

 import java.io.ByteArrayInputStream;
 import java.io.IOException;
 import java.io.InputStream;
 import java.net.URI;

 import javax.xml.namespace.QName;
 import javax.xml.parsers.DocumentBuilder;
 import javax.xml.parsers.DocumentBuilderFactory;
 import javax.xml.parsers.ParserConfigurationException;
 import javax.xml.xpath.XPath;
 import javax.xml.xpath.XPathConstants;
 import javax.xml.xpath.XPathExpressionException;
 import javax.xml.xpath.XPathFactory;

 import org.apache.hadoop.conf.Configuration;
 import org.apache.hadoop.fs.FSDataOutputStream;
 import org.apache.hadoop.fs.FileSystem;
 import org.apache.hadoop.fs.Path;
 import org.apache.hadoop.io.IOUtils;
 import org.apache.hadoop.io.LongWritable;
 import org.apache.hadoop.io.Text;
 import org.apache.hadoop.mapreduce.Mapper;
 import org.w3c.dom.Document;
 import org.w3c.dom.Node;
 import org.w3c.dom.NodeList;
 import org.xml.sax.SAXException;

 import com.sun.org.apache.xml.internal.dtm.ref.DTMNodeList;

 public class XmlToTextMapper extends MapperLongWritable, Text, Text,
 Text {

 private static final XPathFactory xpathFactory =
 XPathFactory.newInstance();

 @Override
 public void map(LongWritable key, Text value, Context context)
 throws IOException, InterruptedException {

 String resultFileName = /result.txt;


 Configuration conf = new Configuration();
 FileSystem fs = FileSystem.get(URI.create(resultFileName), conf);
 FSDataOutputStream out = fs.create(new Path(resultFileName));

 InputStream resultIS = new ByteArrayInputStream(new byte[0]);

 String header = id,name\n;
 out.write(header.getBytes());

 String xmlContent = value.toString();
 InputStream is = new ByteArrayInputStream(xmlContent.getBytes());
 DocumentBuilderFactory factory =
 DocumentBuilderFactory.newInstance();
 DocumentBuilder builder;
 try {
 builder = factory.newDocumentBuilder();
 Document doc = builder.parse(is);
 DTMNodeList list = (DTMNodeList) getNode(/main/data, doc,
 XPathConstants.NODESET);

 int size = list.getLength();
 for (int i = 0; i  size; i++) {
 Node node = list.item(i);
 String line = ;
 NodeList nodeList = node.getChildNodes();
 int childNumber = nodeList.getLength();
 for (int j = 0; j  childNumber; j++) {
 line += nodeList.item(j).getTextContent() + ,;
 }
 if (line.endsWith(,))
 line = line.substring(0, line.length() - 1);
 line += \n;
 out.write(line.getBytes());

 }

 } catch (ParserConfigurationException e) {
 MyLogguer.log(error:  + e.getMessage());
 e.printStackTrace();
 } catch (SAXException e) {
 MyLogguer.log(error:  + e.getMessage());
 e.printStackTrace();
 } catch (XPathExpressionException e) {
 MyLogguer.log(error:  + e.getMessage());
 e.printStackTrace();
 }

 IOUtils.copyBytes(resultIS, out, 4096, true);
 out.close();
 }

 public static Object getNode(String xpathStr, Node node, QName
 retunType)
 throws XPathExpressionException {
 XPath xpath = xpathFactory.newXPath();
 return xpath.evaluate(xpathStr, node, retunType);
 }
 }



 --
 Main class:


 public class Main {

 public static void main(String[] args) throws Exception {

 if (args.length != 2) {
 System.err
 .println(Usage: XMLtoText input path output
 path);
 System.exit(-1);
 }

 Job job = new Job();
 job.setJarByClass(Main.class);
 job.setJobName(XML to Text);
 FileInputFormat.addInputPath(job, new Path(args[0]));
 FileOutputFormat.setOutputPath(job, new Path(args[1]));

 job.setMapperClass(XmlToTextMapper.class);
 job.setNumReduceTasks(0);
 job.setMapOutputKeyClass(Text.class);
 job.setMapOutputValueClass(Text.class);
 System.exit(job.waitForCompletion(true) ? 0 : 1);

 }
 }

 To execute the job you can use :

  bin/hadoop Main /data.xml

Re: XML to TEXT

2014-01-06 Thread Rajesh Nagaraju
hi rajini

Can u use hive? then u can just use xpaths in ur select clause

cheers
R+


On Mon, Jan 6, 2014 at 2:44 PM, Ranjini Rathinam ranjinibe...@gmail.comwrote:

 Hi,

 Thanks a lot .

 Ranjini

 On Fri, Jan 3, 2014 at 10:40 PM, Diego Gutierrez 
 diego.gutier...@ucsp.edu.pe wrote:

  Hi,

 I suggest to use the XPath, this is a native java support for parse xml
 and json formats.

 For the main problem, like distcp command(
 http://hadoop.apache.org/docs/r0.19.0/distcp.pdf ) there is no need of a
 reduce function, because you can parse the xml input file and create the
 file you need in the map function.For example the following code reads an
 xml file in HDFS, parse it and create a new file ( /result.txt ) with the
 expected format:
 id,name
 100,RR


 Mapper function:

 import java.io.ByteArrayInputStream;
 import java.io.IOException;
 import java.io.InputStream;
 import java.net.URI;

 import javax.xml.namespace.QName;
 import javax.xml.parsers.DocumentBuilder;
 import javax.xml.parsers.DocumentBuilderFactory;
 import javax.xml.parsers.ParserConfigurationException;
 import javax.xml.xpath.XPath;
 import javax.xml.xpath.XPathConstants;
 import javax.xml.xpath.XPathExpressionException;
 import javax.xml.xpath.XPathFactory;

 import org.apache.hadoop.conf.Configuration;
 import org.apache.hadoop.fs.FSDataOutputStream;
 import org.apache.hadoop.fs.FileSystem;
 import org.apache.hadoop.fs.Path;
 import org.apache.hadoop.io.IOUtils;
 import org.apache.hadoop.io.LongWritable;
 import org.apache.hadoop.io.Text;
 import org.apache.hadoop.mapreduce.Mapper;
 import org.w3c.dom.Document;
 import org.w3c.dom.Node;
 import org.w3c.dom.NodeList;
 import org.xml.sax.SAXException;

 import com.sun.org.apache.xml.internal.dtm.ref.DTMNodeList;

 public class XmlToTextMapper extends MapperLongWritable, Text, Text,
 Text {

 private static final XPathFactory xpathFactory =
 XPathFactory.newInstance();

 @Override
 public void map(LongWritable key, Text value, Context context)
 throws IOException, InterruptedException {

 String resultFileName = /result.txt;


 Configuration conf = new Configuration();
 FileSystem fs = FileSystem.get(URI.create(resultFileName), conf);
 FSDataOutputStream out = fs.create(new Path(resultFileName));

 InputStream resultIS = new ByteArrayInputStream(new byte[0]);

 String header = id,name\n;
 out.write(header.getBytes());

 String xmlContent = value.toString();
 InputStream is = new ByteArrayInputStream(xmlContent.getBytes());
 DocumentBuilderFactory factory =
 DocumentBuilderFactory.newInstance();
 DocumentBuilder builder;
 try {
 builder = factory.newDocumentBuilder();
 Document doc = builder.parse(is);
 DTMNodeList list = (DTMNodeList) getNode(/main/data, doc,
 XPathConstants.NODESET);

 int size = list.getLength();
 for (int i = 0; i  size; i++) {
 Node node = list.item(i);
 String line = ;
 NodeList nodeList = node.getChildNodes();
 int childNumber = nodeList.getLength();
 for (int j = 0; j  childNumber; j++) {
 line += nodeList.item(j).getTextContent() + ,;
 }
 if (line.endsWith(,))
 line = line.substring(0, line.length() - 1);
 line += \n;
 out.write(line.getBytes());

 }

 } catch (ParserConfigurationException e) {
 MyLogguer.log(error:  + e.getMessage());
 e.printStackTrace();
 } catch (SAXException e) {
 MyLogguer.log(error:  + e.getMessage());
 e.printStackTrace();
 } catch (XPathExpressionException e) {
 MyLogguer.log(error:  + e.getMessage());
 e.printStackTrace();
 }

 IOUtils.copyBytes(resultIS, out, 4096, true);
 out.close();
 }

 public static Object getNode(String xpathStr, Node node, QName
 retunType)
 throws XPathExpressionException {
 XPath xpath = xpathFactory.newXPath();
 return xpath.evaluate(xpathStr, node, retunType);
 }
 }



 --
  Main class:


 public class Main {

 public static void main(String[] args) throws Exception {

 if (args.length != 2) {
 System.err
 .println(Usage: XMLtoText input path output
 path);
 System.exit(-1);
 }

 Job job = new Job();
 job.setJarByClass(Main.class);
 job.setJobName(XML to Text);
 FileInputFormat.addInputPath(job, new Path(args[0]));
 FileOutputFormat.setOutputPath(job, new Path(args[1]));

 job.setMapperClass(XmlToTextMapper.class);
 job.setNumReduceTasks(0);
 job.setMapOutputKeyClass(Text.class

Re: XML to TEXT

2014-01-03 Thread Ranjini Rathinam
Hi,

I used XMLInputFormat , in that i used  Record Reader class. Same as u have
given

THe whole xml is been split into part For Eg: consider the below xml

CompEmpid/idname/name/EmpEmpid/idname/name/Emp/Comp

after using the RecordReader class the xml output is

Empid/idname/name/EmpEmpid/idname/name/Emp

the starting and end tag is Emp.

it does not convert into text.

Please suggest and help.

Thanks in advance

Ranjini

On Fri, Jan 3, 2014 at 11:22 AM, Azuryy Yu azury...@gmail.com wrote:

 Hi,

 you can use org.apache.hadoop.streaming.StreamInputFormat  using map
 reduce to convert XML to text.

 such as your xml like this:
 xml
   namelll/name
 /xml

 you need to specify stream.recordreader.begin and stream.recordreader.end
 in the Configuration:
 Configuration conf = new Configuration();
 conf.set(stream.recordreader.begin, xml);
 conf.set(stream.recordreader.end, /xml);






 On Fri, Jan 3, 2014 at 1:16 PM, Ranjini Rathinam 
 ranjinibe...@gmail.comwrote:

 Hi,

 Need to convert XML into text using mapreduce.

 I have used DOM and SAX parser.

 After using SAX Builder in mapper class. the child node act as root
 Element.

 While seeing in Sys out i found thar root element is taking the child
 element and printing.

 For Eg,

 CompEmpid100/idnameRR/name/Emp/Comp
 when this xml is passed in mapper , in sys out printing the root element

 I am getting the the root element as

 id
 name

 Please suggest and help to fix this.

 I need to convert the xml into text using mapreduce code. Please provide
 with example.

 Required output is

 id,name
 100,RR

 Please help.

 Thanks in advance,
 Ranjini R





















Re: XML to TEXT

2014-01-03 Thread Diego Gutierrez
Hi,

I suggest to use the XPath, this is a native java support for parse xml and
json formats.

For the main problem, like distcp command(
http://hadoop.apache.org/docs/r0.19.0/distcp.pdf ) there is no need of a
reduce function, because you can parse the xml input file and create the
file you need in the map function.For example the following code reads an
xml file in HDFS, parse it and create a new file ( /result.txt ) with the
expected format:
id,name
100,RR


Mapper function:

import java.io.ByteArrayInputStream;
import java.io.IOException;
import java.io.InputStream;
import java.net.URI;

import javax.xml.namespace.QName;
import javax.xml.parsers.DocumentBuilder;
import javax.xml.parsers.DocumentBuilderFactory;
import javax.xml.parsers.ParserConfigurationException;
import javax.xml.xpath.XPath;
import javax.xml.xpath.XPathConstants;
import javax.xml.xpath.XPathExpressionException;
import javax.xml.xpath.XPathFactory;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FSDataOutputStream;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IOUtils;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;
import org.w3c.dom.Document;
import org.w3c.dom.Node;
import org.w3c.dom.NodeList;
import org.xml.sax.SAXException;

import com.sun.org.apache.xml.internal.dtm.ref.DTMNodeList;

public class XmlToTextMapper extends MapperLongWritable, Text, Text, Text
{

private static final XPathFactory xpathFactory =
XPathFactory.newInstance();

@Override
public void map(LongWritable key, Text value, Context context)
throws IOException, InterruptedException {

String resultFileName = /result.txt;

Configuration conf = new Configuration();
FileSystem fs = FileSystem.get(URI.create(resultFileName), conf);
FSDataOutputStream out = fs.create(new Path(resultFileName));

InputStream resultIS = new ByteArrayInputStream(new byte[0]);

String header = id,name\n;
out.write(header.getBytes());

String xmlContent = value.toString();
InputStream is = new ByteArrayInputStream(xmlContent.getBytes());
DocumentBuilderFactory factory =
DocumentBuilderFactory.newInstance();
DocumentBuilder builder;
try {
builder = factory.newDocumentBuilder();
Document doc = builder.parse(is);
DTMNodeList list = (DTMNodeList) getNode(/main/data, doc,
XPathConstants.NODESET);

int size = list.getLength();
for (int i = 0; i  size; i++) {
Node node = list.item(i);
String line = ;
NodeList nodeList = node.getChildNodes();
int childNumber = nodeList.getLength();
for (int j = 0; j  childNumber; j++) {
line += nodeList.item(j).getTextContent() + ,;
}
if (line.endsWith(,))
line = line.substring(0, line.length() - 1);
line += \n;
out.write(line.getBytes());

}

} catch (ParserConfigurationException e) {
MyLogguer.log(error:  + e.getMessage());
e.printStackTrace();
} catch (SAXException e) {
MyLogguer.log(error:  + e.getMessage());
e.printStackTrace();
} catch (XPathExpressionException e) {
MyLogguer.log(error:  + e.getMessage());
e.printStackTrace();
}

IOUtils.copyBytes(resultIS, out, 4096, true);
out.close();
}

public static Object getNode(String xpathStr, Node node, QName
retunType)
throws XPathExpressionException {
XPath xpath = xpathFactory.newXPath();
return xpath.evaluate(xpathStr, node, retunType);
}
}



--
Main class:


public class Main {

public static void main(String[] args) throws Exception {

if (args.length != 2) {
System.err
.println(Usage: XMLtoText input path output path);
System.exit(-1);
}

Job job = new Job();
job.setJarByClass(Main.class);
job.setJobName(XML to Text);
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));

job.setMapperClass(XmlToTextMapper.class);
job.setNumReduceTasks(0);
job.setMapOutputKeyClass(Text.class);
job.setMapOutputValueClass(Text.class);
System.exit(job.waitForCompletion(true) ? 0 : 1);

}
}

To execute the job you can use :

 bin/hadoop Main /data.xml /output.


Then you can use this to see result.txt file:

  hadoop fs -cat /result.txt


I'm using this xml as input:

CompEmpid1/idnameNameA/name/datadataid2/idnameNameB/name/Emp/Comp

and the content in result.txt is like

XML to TEXT

2014-01-02 Thread Ranjini Rathinam
Hi,

Need to convert XML into text using mapreduce.

I have used DOM and SAX parser.

After using SAX Builder in mapper class. the child node act as root Element.

While seeing in Sys out i found thar root element is taking the child
element and printing.

For Eg,

CompEmpid100/idnameRR/name/Emp/Comp
when this xml is passed in mapper , in sys out printing the root element

I am getting the the root element as

id
name

Please suggest and help to fix this.

I need to convert the xml into text using mapreduce code. Please provide
with example.

Required output is

id,name
100,RR

Please help.

Thanks in advance,
Ranjini R


Re: XML to TEXT

2014-01-02 Thread Azuryy Yu
Hi,

you can use org.apache.hadoop.streaming.StreamInputFormat  using map reduce
to convert XML to text.

such as your xml like this:
xml
  namelll/name
/xml

you need to specify stream.recordreader.begin and stream.recordreader.end
in the Configuration:
Configuration conf = new Configuration();
conf.set(stream.recordreader.begin, xml);
conf.set(stream.recordreader.end, /xml);






On Fri, Jan 3, 2014 at 1:16 PM, Ranjini Rathinam ranjinibe...@gmail.comwrote:

 Hi,

 Need to convert XML into text using mapreduce.

 I have used DOM and SAX parser.

 After using SAX Builder in mapper class. the child node act as root
 Element.

 While seeing in Sys out i found thar root element is taking the child
 element and printing.

 For Eg,

 CompEmpid100/idnameRR/name/Emp/Comp
 when this xml is passed in mapper , in sys out printing the root element

 I am getting the the root element as

 id
 name

 Please suggest and help to fix this.

 I need to convert the xml into text using mapreduce code. Please provide
 with example.

 Required output is

 id,name
 100,RR

 Please help.

 Thanks in advance,
 Ranjini R