hadoop data structures
hi, i got this code which extracts timeframes frome a logfile and does some calculation on it. input lines looks like this: 1000,T,0,104,1000,1100,27147,80,80,80,80,81,81,98,98,98,101,137,137,139,177,177,177,173,166,149,134,130,124,119,111,104,92 1000,T,1,743,300,300,4976,492,492,492,492,492,497,497,856,856,863,866,875,875,954,954,954,954,954,954,954,954,770,770,770,770,743 1000,T,2,40,800,1000,11922,29,29,29,29,29,29,29,44,46,46,50,51,51,65,65,65,61,52,47,47,47,44,42,40,32,30 2001,T,0,103,6700,7000,44658,80,80,80,80,80,81,96,98,98,101,134,137,139,220,192,176,168,162,156,149,144,132,122,112,104,95 1002,U, the first value being the time in ms, T being the lines im interrested in 0,1,2 being a product ID, 104,743,40,103 being the price i want. now i need to extract all prices for some specific timeframe, lets say 3000ms. the code at the end works but has the problem that the variable "numberOfRuns" is counted up and used to calculate the time and i guess using this system in hadoop doesnt work. so i need a way to extract the "timeframes" in the mapper and what data structure would you use? import java.io.BufferedReader; import java.io.FileNotFoundException; import java.io.FileReader; import java.io.IOException; import java.util.ArrayList; import java.util.List; public class Test { public List> splitFileByTime(List lines, int timeFrame) { List> myTimes = new ArrayList>(); ArrayList lines_new = new ArrayList(); for (String z: lines) { //System.out.println(z); } int numberOfRuns = 1; for (String current : lines) { String[] parts = current.split(","); int time = Integer.parseInt(parts[0]); if (time < 0) { // Zeiten vor Beginn der Simulation, uninteressant } else { if (parts[1].contains("T")) { lines_new.add(current); } else { } if (time >= timeFrame * numberOfRuns) { numberOfRuns++; myTimes.add(lines_new); lines_new = new ArrayList(); } } } return myTimes; } public void getOpenAndClose(List> lines) { int abschnitt = 1; for (ArrayList x: lines) { System.out.println("Abschnitt: " + abschnitt); List tmp = new ArrayList(); int high = 0; int low = 1; for (String y:x) { String[] parts = y.split(","); if (parts[2].contains("0")) { int kurs = Integer.parseInt(parts[3]); if (kurs > high) { high = kurs; } if (kurs < low) { low = kurs; } System.out.println("Produkt: " + parts[2] + " wurde um " + parts[0] + " gehandelt mit kurs: " + kurs ); tmp.add(kurs); } } System.out.println("open: " + tmp.get(0)); System.out.println("close: " + tmp.get(tmp.size()-1)); System.out.println("high: " + high); System.out.println("low: " + low); abschnitt++; } } public List readFile(String filename) { List lines = new ArrayList(); BufferedReader reader = null; try { reader = new BufferedReader(new FileReader(filename)); } catch (FileNotFoundException e1) { e1.printStackTrace(); } String line; try { while ((line = reader.readLine()) != null) { lines.add(line); } } catch (IOException e) { e.printStackTrace(); } try { reader.close(); } catch (IOException e) { // TODO Auto-generated catch block e.printStackTrace(); } return lines; } public static void main(String[] args) { //String filename = "Standard-2014-04-29-12-04.csv"; String filename = "Standard-small.txt"; //Zeitspanne für Zeilen in Millisekunden int timeFrame = 3000; Test x = new Test(); List lines = x.readFile(filename); List> lines_split = x.splitFileByTime(lines, timeFrame); x.getOpenAndClose(lines_split); } }
Re: hadoop data structures
Are you asking about the type for the numberOfRuns variable which you are declaring as a Java primitive int? If yes, then you can use IntWritable class in Hadoop to define a integer variable which will work with M/R Regards, Shahab On Tue, Dec 9, 2014 at 3:47 AM, steven wrote: > hi, > > > i got this code which extracts timeframes frome a logfile and does some > calculation on it. > input lines looks like this: > > 1000,T,0,104,1000,1100,27147,80,80,80,80,81,81,98,98,98,101,137,137,139,177,177,177,173,166,149,134,130,124,119,111,104,92 > > 1000,T,1,743,300,300,4976,492,492,492,492,492,497,497,856,856,863,866,875,875,954,954,954,954,954,954,954,954,770,770,770,770,743 > > 1000,T,2,40,800,1000,11922,29,29,29,29,29,29,29,44,46,46,50,51,51,65,65,65,61,52,47,47,47,44,42,40,32,30 > > 2001,T,0,103,6700,7000,44658,80,80,80,80,80,81,96,98,98,101,134,137,139,220,192,176,168,162,156,149,144,132,122,112,104,95 > > 1002,U, > > > the first value being the time in ms, > T being the lines im interrested in > 0,1,2 being a product ID, > 104,743,40,103 being the price i want. > > > now i need to extract all prices for some specific timeframe, lets say > 3000ms. > the code at the end works but has the problem that the variable > "numberOfRuns" is counted up and used to calculate the time and i guess > using this system in hadoop doesnt work. > so i need a way to extract the "timeframes" in the mapper and what data > structure would you use? > > > > > > > import java.io.BufferedReader; > import java.io.FileNotFoundException; > import java.io.FileReader; > import java.io.IOException; > import java.util.ArrayList; > > import java.util.List; > > public class Test { > > public List> splitFileByTime(List lines, int > timeFrame) { > List> myTimes = new > ArrayList>(); > > > ArrayList lines_new = new ArrayList(); > > > for (String z: lines) { > //System.out.println(z); > } > > int numberOfRuns = 1; > > for (String current : lines) { > String[] parts = current.split(","); > > int time = Integer.parseInt(parts[0]); > > > if (time < 0) { > // Zeiten vor Beginn der Simulation, uninteressant > } else { > > > > if (parts[1].contains("T")) { > > lines_new.add(current); > } > else { > > } > if (time >= timeFrame * numberOfRuns) { > numberOfRuns++; > myTimes.add(lines_new); > > > lines_new = new ArrayList(); > } > > } > } > return myTimes; > } > > > > public void getOpenAndClose(List> lines) { > > int abschnitt = 1; > for (ArrayList x: lines) { > System.out.println("Abschnitt: " + abschnitt); > List tmp = new ArrayList(); > int high = 0; > int low = 1; > for (String y:x) { > > String[] parts = y.split(","); > if (parts[2].contains("0")) { > int kurs = Integer.parseInt(parts[3]); > if (kurs > high) { > high = kurs; > } > > if (kurs < low) { > low = kurs; > } > System.out.println("Produkt: " + parts[2] + " wurde um > " + parts[0] + " gehandelt mit kurs: " + kurs ); > tmp.add(kurs); > > > } > > } > System.out.println("open: " + tmp.get(0)); > System.out.println("close: " + tmp.get(tmp.size()-1)); > System.out.println("high: " + high); > System.out.println("low: " + low); > abschnitt++; > } > > } > > > public List readFile(String filename) { > > List lines = new ArrayList(); > > > BufferedReader reader = null; > try { > reader = new BufferedReader(new FileReader(filename)); > } catch (FileNotFoundException e1) { > e1.printStackTrace(); > } > > > String line; > > try { > while ((line = reader.readLine()) != null) { > lines.add(line); > > } > } catch (IOException e) { > e.printStackTrace(); > } > > try { > reader.close(); > } catch (IOException e) { > // TODO Auto-generated catch block > e.printStackTrace(); > } > > > return lines; > } > > > > > public static void main(String[] args) { > //String filename = "Standard-2014-04-29-12-04.csv"; > String filename = "Standard-small.txt"; > //Zeitspanne für Zeilen in Millisekunden > int timeFrame = 3000; > > Test x = new Test(); > > List lines = x.readFile(filename); > List> lines_split = x.s
Re: How to get hadoop issues data for research?
You can use REST API. Example: https://issues.apache.org/jira/rest/api/2/search?jql=project%20%3D%20HADOOP This general@ mailing list is for announcements and project management. For end-user questions and discussions, please use user@ mailing list. Regards, Akira (12/9/14, 18:22), zfx wrote: Hi, all I am a graduate student in Peking University, our lab do some research on open source projects. This is our introduction: https://passion-lab.org/ Now we need hadoop issues data for research, I found the issues list: https://issues.apache.org/jira/issues/?jql=project%20%3D%20HADOOP I want to download the hadoop issues data, Could anyone tell me how to download the data? Or is there some links or API for download the data? Many thanks! Beat regards, Feixue, Zhang
Eclipse plugin for Hadoop2.5.2
Hi,hadoopers, I am new to hadoop. I am using Hadoop2.5.2 and Yarn as MR. I would ask the two ports, M/R(v2) master port and DFS Master port that is to be configured in the Eclipse hadoop plugin view. Which properties do these ports correspond to in the hadoop configuration files,eg, yarn-site.xml. Thanks.
Re: Eclipse plugin for Hadoop2.5.2
bq. M/R(v2) master port Did you mean port for resourcemanager ? Take a look at ./hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/resources/yarn-default.xml where you can find: yarn.resourcemanager.bind-host yarn.nodemanager.bind-host Cheers On Tue, Dec 9, 2014 at 6:51 AM, Todd wrote: > Hi,hadoopers, > > I am new to hadoop. I am using Hadoop2.5.2 and Yarn as MR. I would ask the > two ports, M/R(v2) master port and DFS Master port that is to be configured > in the Eclipse hadoop plugin view. > > Which properties do these ports correspond to in the hadoop configuration > files,eg, yarn-site.xml. > > Thanks. >
Re:Re: Eclipse plugin for Hadoop2.5.2
I paste the Image below for what I mean, there are two ports out there (M/R(v2)Master) and DFS Master. Wonder to know where these two ports come from. At 2014-12-09 22:59:17, "Ted Yu" wrote: bq. M/R(v2) master port Did you mean port for resourcemanager ? Take a look at ./hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/resources/yarn-default.xml where you can find: yarn.resourcemanager.bind-host yarn.nodemanager.bind-host Cheers On Tue, Dec 9, 2014 at 6:51 AM, Todd wrote: Hi,hadoopers, I am new to hadoop. I am using Hadoop2.5.2 and Yarn as MR. I would ask the two ports, M/R(v2) master port and DFS Master port that is to be configured in the Eclipse hadoop plugin view. Which properties do these ports correspond to in the hadoop configuration files,eg, yarn-site.xml. Thanks.
Re:Re:Re: Eclipse plugin for Hadoop2.5.2
I figured out that the default is 50020. Thank Ted. At 2014-12-09 23:06:37, "Todd" wrote: I paste the Image below for what I mean, there are two ports out there (M/R(v2)Master) and DFS Master. Wonder to know where these two ports come from. At 2014-12-09 22:59:17, "Ted Yu" wrote: bq. M/R(v2) master port Did you mean port for resourcemanager ? Take a look at ./hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/resources/yarn-default.xml where you can find: yarn.resourcemanager.bind-host yarn.nodemanager.bind-host Cheers On Tue, Dec 9, 2014 at 6:51 AM, Todd wrote: Hi,hadoopers, I am new to hadoop. I am using Hadoop2.5.2 and Yarn as MR. I would ask the two ports, M/R(v2) master port and DFS Master port that is to be configured in the Eclipse hadoop plugin view. Which properties do these ports correspond to in the hadoop configuration files,eg, yarn-site.xml. Thanks.
Re: API to find current active namenode.
I was more interested in a way to do it programmatically. Found out it today that Configuration conf =newConfiguration(); conf.addResource(newPath("/etc/hadoop/conf/core-site.xml")); conf.addResource(newPath("/etc/hadoop/conf/hdfs-site.xml")); String ns = conf.get("fs.defaultFS"); FileSystem fs = FileSystem.get(conf); does what I need without have to care about which namenode is active. /Magnus On 2014-12-08 22:18, Andras POTOCZKY wrote: hi # sudo -u hdfs hdfs haadmin -getServiceState nn1 active # sudo -u hdfs hdfs haadmin -getServiceState nn2 standby Where nn1 and nn2 are the dfs.ha.namenodes.mycluster property values. This is what you need? Andras On 2014.12.08. 21:12, Magnus Runesson wrote: I develop an application that will access HDFS. Is there a single API to get current active namenode? I want it be independent of if my cluster has HA NameNode deployed or a single NameNode. The typical Hadoop-client configuration files will be installed on the host. /Magnus
Possible typo in the Hadoop "Latest Stable Release Page"
I'm looking @ this page: http://hadoop.apache.org/docs/stable/ Is it a typo that Hadoop 2.6.0 is based on 2.4.1? Thanks.
Question about container recovery
Hi, all Here is my question: is there a mechanisms that when one container exit abnormally, yarn will prefer to dispatch the container on other NM? We have a cluster with 3 NMs(each NM 135g mem) and 1 RM, and we running a job which start 13 container(= 1 AM + 12 executor containers). Each NM has 4 executor container and the mem configured for each executor container is 30g. There is a interesting test, when we killed 4 containers in one NM1, only 2 containers restarted on NM1, other 2 containers reserved on the NM2 and NM3. Any idea? Fei.