I have created Umbrella JIRA HAMA-536 for creating the InputFormats/OutputFormats with three sub-tasks. For now I have assigned the tasks to me, let me know if anyone is interested.
Praveen On Sun, Mar 25, 2012 at 6:40 PM, Thomas Jungblut < [email protected]> wrote: > > > > I can open a JIRA. I need input on what all InputFormat makes sense and > the > > their priority. Some we can port from Hadoop. > > > Yep, you're right. I guess a single JIRA would be enough for the already > implemented formats in Hadoop, for the others we need subclasses. > Formats that I really wanted to have would be: > > - DBInputFormat[1] > - XMLInputFormat > - NLineInputFormat > - CSVInputFormat (we could use OpenCSV for that in conjunction with > TextInputFormat) > - JSONInputFormat (for OpenGraph stuff) > - The graph DB formats Neo4J and how the others are called > > Anything I missed for a "full" coverage? > > Could you please elaborate on this? > > > Sure, DMOZ is some kind of crawled website database. It is used in some > pagerank examples to test it, don't know if it was in Mahout. We could also > use it since we have pagerank as well. > CommonCrawl is a new up-coming DMOZ-like database of many crawled sites, it > is hosted on S3 in Amazon Cloud. We run on EC2 via Whirr so this could be a > cool example as well. > > [1] > > http://hadoop.apache.org/mapreduce/docs/r0.21.0/api/org/apache/hadoop/mapreduce/lib/db/DBInputFormat.html > > > Am 25. März 2012 14:56 schrieb Praveen Sripati <[email protected]>: > > > Thomas et al, > > > > > Would someone please open JIRAs for that? > > > > I can open a JIRA. I need input on what all InputFormat makes sense and > the > > their priority. Some we can port from Hadoop. > > > > > Based on XML we can implement a format that parses DMOZ or commoncrawl > on > > Amzon S3. > > > > Could you please elaborate on this? > > > > Praveen > > > > > > On Sun, Mar 25, 2012 at 5:14 PM, Chia-Hung Lin <[email protected] > > >wrote: > > > > > As I understand, many iterative applications don't require key value > > > input/ output and additionally need random access (read/ write) to > > > particular file. I/O interface e.g. mpi may increase flexibility here. > > > > > > https://issues.apache.org/jira/browse/MAPREDUCE-2911 > > > > > > On 25 March 2012 10:01, Praveen Sripati <[email protected]> > > wrote: > > > > Hi, > > > > > > > > For Hama there are limited input formats > > > > > > > > CombineFileInputFormat, FileInputFormat, NullInputFormat, > > > > SequenceFileInputFormat, TextInputFormat > > > > > > > > Does it make sense to have to have more input formats? I was thinking > > > > InputFormats for Graph Databases. > > > > > > > > Any feedback for the different input formats is welcome. > > > > > > > > I quickly glanced Giraph and Hadoop and they have more InputFormats > > which > > > > makes it easy to plug them with external systems. > > > > > > > > Praveen > > > > > > > > > -- > Thomas Jungblut > Berlin <[email protected]> >
