Hi Antonio, Certainly, a JIRA ticket with a patch would be fantastic.
Thanks! Josh On Wed, Sep 28, 2016 at 12:08 PM, Antonio Murgia <[email protected]> wrote: > Thank you very much for your insights Josh, if I decide to develop a small > Phoenix Library that does, through Spark, what the CSV loader does, I'll > surely write to the mailing list, or open a Jira, or maybe even open a PR, > right? > > Thank you again > > #A.M. > > On 09/28/2016 05:10 PM, Josh Mahonin wrote: > > Hi Antonio, > > You're correct, the phoenix-spark output uses the Phoenix Hadoop > OutputFormat under the hood, which effectively does a parallel, batch JDBC > upsert. It should scale depending on the number of Spark executors, > RDD/DataFrame parallelism, and number of HBase RegionServers, though > admittedly there's a lot of overhead involved. > > The CSV Bulk loading tool uses MapReduce, it's not integrated with Spark. > It's likely possible to do so, but it's probably a non-trivial amount of > work. If you're interested in taking it on, I'd start with looking at the > following classes: > > https://github.com/apache/phoenix/blob/master/phoenix- > core/src/main/java/org/apache/phoenix/mapreduce/CsvBulkLoadTool.java > https://github.com/apache/phoenix/blob/master/phoenix- > core/src/main/java/org/apache/phoenix/mapreduce/AbstractBulkLoadTool.java > https://github.com/apache/phoenix/blob/master/phoenix- > core/src/main/java/org/apache/phoenix/mapreduce/PhoenixOutputFormat.java > https://github.com/apache/phoenix/blob/master/phoenix- > core/src/main/java/org/apache/phoenix/mapreduce/PhoenixRecordWriter.java > https://github.com/apache/phoenix/blob/master/phoenix- > spark/src/main/scala/org/apache/phoenix/spark/DataFrameFunctions.scala > > Good luck, > > Josh > > On Tue, Sep 27, 2016 at 10:43 AM, Antonio Murgia <[email protected]> > wrote: > >> Hi, >> >> I would like to perform a Bulk insert to HBase using Apache Phoenix from >> Spark. I tried using Apache Spark Phoenix library but, as far as I was >> able to understand from the code, it looks like it performs a jdbc batch >> of upserts (am I right?). Instead I want to perform a Bulk load like the >> one described in this blog post >> (https://zeyuanxy.github.io/HBase-Bulk-Loading/) but taking advance of >> the automatic transformation between java/scala types to Bytes. >> >> I'm actually using phoenix 4.5.2, therefore I cannot use hive to >> manipulate the phoenix table, and if it possible i want to avoid to >> spawn a MR job that reads data from csv >> (https://phoenix.apache.org/bulk_dataload.html). Actually i just want to >> do what the csv loader is doing with MR but programmatically with Spark >> (since the data I want to persist is already loaded in memory). >> >> Thank you all! >> >> > >
