Nice discussion! BTW, Anyone interested in contributing HBase table input/output formatters?
On Mon, Mar 26, 2012 at 2:27 AM, Thomas Jungblut <[email protected]> wrote: > Thanks for your time. > I have tweeted about the graph db formats, I know some of my followers are > working with them, so they might be interested. > > Am 25. März 2012 19:25 schrieb Praveen Sripati <[email protected]>: > >> I have created Umbrella JIRA HAMA-536 for creating the >> InputFormats/OutputFormats with three sub-tasks. For now I have assigned >> the tasks to me, let me know if anyone is interested. >> >> Praveen >> >> On Sun, Mar 25, 2012 at 6:40 PM, Thomas Jungblut < >> [email protected]> wrote: >> >> > > >> > > I can open a JIRA. I need input on what all InputFormat makes sense and >> > the >> > > their priority. Some we can port from Hadoop. >> > >> > >> > Yep, you're right. I guess a single JIRA would be enough for the already >> > implemented formats in Hadoop, for the others we need subclasses. >> > Formats that I really wanted to have would be: >> > >> > - DBInputFormat[1] >> > - XMLInputFormat >> > - NLineInputFormat >> > - CSVInputFormat (we could use OpenCSV for that in conjunction with >> > TextInputFormat) >> > - JSONInputFormat (for OpenGraph stuff) >> > - The graph DB formats Neo4J and how the others are called >> > >> > Anything I missed for a "full" coverage? >> > >> > Could you please elaborate on this? >> > >> > >> > Sure, DMOZ is some kind of crawled website database. It is used in some >> > pagerank examples to test it, don't know if it was in Mahout. We could >> also >> > use it since we have pagerank as well. >> > CommonCrawl is a new up-coming DMOZ-like database of many crawled sites, >> it >> > is hosted on S3 in Amazon Cloud. We run on EC2 via Whirr so this could >> be a >> > cool example as well. >> > >> > [1] >> > >> > >> http://hadoop.apache.org/mapreduce/docs/r0.21.0/api/org/apache/hadoop/mapreduce/lib/db/DBInputFormat.html >> > >> > >> > Am 25. März 2012 14:56 schrieb Praveen Sripati <[email protected] >> >: >> > >> > > Thomas et al, >> > > >> > > > Would someone please open JIRAs for that? >> > > >> > > I can open a JIRA. I need input on what all InputFormat makes sense and >> > the >> > > their priority. Some we can port from Hadoop. >> > > >> > > > Based on XML we can implement a format that parses DMOZ or >> commoncrawl >> > on >> > > Amzon S3. >> > > >> > > Could you please elaborate on this? >> > > >> > > Praveen >> > > >> > > >> > > On Sun, Mar 25, 2012 at 5:14 PM, Chia-Hung Lin <[email protected] >> > > >wrote: >> > > >> > > > As I understand, many iterative applications don't require key value >> > > > input/ output and additionally need random access (read/ write) to >> > > > particular file. I/O interface e.g. mpi may increase flexibility >> here. >> > > > >> > > > https://issues.apache.org/jira/browse/MAPREDUCE-2911 >> > > > >> > > > On 25 March 2012 10:01, Praveen Sripati <[email protected]> >> > > wrote: >> > > > > Hi, >> > > > > >> > > > > For Hama there are limited input formats >> > > > > >> > > > > CombineFileInputFormat, FileInputFormat, NullInputFormat, >> > > > > SequenceFileInputFormat, TextInputFormat >> > > > > >> > > > > Does it make sense to have to have more input formats? I was >> thinking >> > > > > InputFormats for Graph Databases. >> > > > > >> > > > > Any feedback for the different input formats is welcome. >> > > > > >> > > > > I quickly glanced Giraph and Hadoop and they have more InputFormats >> > > which >> > > > > makes it easy to plug them with external systems. >> > > > > >> > > > > Praveen >> > > > >> > > >> > >> > >> > >> > -- >> > Thomas Jungblut >> > Berlin <[email protected]> >> > >> > > > > -- > Thomas Jungblut > Berlin <[email protected]> -- Best Regards, Edward J. Yoon @eddieyoon
