Hi Antonio, You're correct, the phoenix-spark output uses the Phoenix Hadoop OutputFormat under the hood, which effectively does a parallel, batch JDBC upsert. It should scale depending on the number of Spark executors, RDD/DataFrame parallelism, and number of HBase RegionServers, though admittedly there's a lot of overhead involved.
The CSV Bulk loading tool uses MapReduce, it's not integrated with Spark. It's likely possible to do so, but it's probably a non-trivial amount of work. If you're interested in taking it on, I'd start with looking at the following classes: https://github.com/apache/phoenix/blob/master/phoenix-core/src/main/java/org/apache/phoenix/mapreduce/CsvBulkLoadTool.java https://github.com/apache/phoenix/blob/master/phoenix-core/src/main/java/org/apache/phoenix/mapreduce/AbstractBulkLoadTool.java https://github.com/apache/phoenix/blob/master/phoenix-core/src/main/java/org/apache/phoenix/mapreduce/PhoenixOutputFormat.java https://github.com/apache/phoenix/blob/master/phoenix-core/src/main/java/org/apache/phoenix/mapreduce/PhoenixRecordWriter.java https://github.com/apache/phoenix/blob/master/phoenix-spark/src/main/scala/org/apache/phoenix/spark/DataFrameFunctions.scala Good luck, Josh On Tue, Sep 27, 2016 at 10:43 AM, Antonio Murgia <antonio.mur...@eng.it> wrote: > Hi, > > I would like to perform a Bulk insert to HBase using Apache Phoenix from > Spark. I tried using Apache Spark Phoenix library but, as far as I was > able to understand from the code, it looks like it performs a jdbc batch > of upserts (am I right?). Instead I want to perform a Bulk load like the > one described in this blog post > (https://zeyuanxy.github.io/HBase-Bulk-Loading/) but taking advance of > the automatic transformation between java/scala types to Bytes. > > I'm actually using phoenix 4.5.2, therefore I cannot use hive to > manipulate the phoenix table, and if it possible i want to avoid to > spawn a MR job that reads data from csv > (https://phoenix.apache.org/bulk_dataload.html). Actually i just want to > do what the csv loader is doing with MR but programmatically with Spark > (since the data I want to persist is already loaded in memory). > > Thank you all! > >