/cc Spark user list
I'm confused here, you mentioned that you were writing Parquet files
using MR jobs. What's the relation between that Parquet writing task and
this JavaPairRDD one? Is it a separate problem?
Spark supports dynamic partitioning (e.g. df.write.partitionBy("col1",
"col2").format("<data source name>").save(path)), and there's a
spark-avro
<logWarning%28%22WARNING:%20Failed%20to%20write%20command%20history%20file:%20%22%20+%20e.getMessage%29>
data source. If you are writing Avro records to multiple partitions,
these two should help.
Cheng
On 11/19/15 4:30 PM, Shushant Arora wrote:
Thanks Cheng.
I have used avroParquetOutputFormat and it works fine.
my requirement is now to handle writing in multiple folders at same
time. Basically the JavaPairrdd<Void,GenericRecord> I want to write to
multiple folders based on final hive partitions where this rdd will
lend.Have you used multiple output formats in spark?
On Fri, Nov 13, 2015 at 3:56 PM, Cheng Lian <lian.cs....@gmail.com
<mailto:lian.cs....@gmail.com>> wrote:
Oh I see. Then parquet-avro should probably be more useful. AFAIK,
parquet-hive is only used internally in Hive. I don't see anyone
using it directly.
In general, you can first parse your text data, assemble them into
Avro records, and then write these records to Parquet.
BTW, Spark 1.2 also provides Parquet support. Since you're trying
to convert text data, I guess you probably don't have any nested
data. In that case, Spark 1.2 should be enough. it's not that
Spark 1.2 can't deal with nested data, it's about interoperability
with Hive. Because in the early days, Parquet spec itself didn't
specify how to write nested data. You may refer to this link for
more details:
http://spark.apache.org/docs/1.2.1/sql-programming-guide.html#parquet-files
Cheng
On 11/13/15 6:11 PM, Shushant Arora wrote:
No , I don't have data loaded in text form to hive- It was for
getting internals of what approach hive is taking .
I want direct writing to parquet file from MR job. For that Hive
Parquet datamodel vs Avro Parquet data model which approach is
better?
On Fri, Nov 13, 2015 at 3:24 PM, Cheng Lian
<lian.cs....@gmail.com <mailto:lian.cs....@gmail.com>> wrote:
If you are already able to load the text data into Hive, then
using Hive itself to convert the data is obviously the
easiest and most compatible way. For example:
CREATE TABLE text_table (key INT, value STRING);
LOAD DATA LOCAL INPATH '/tmp/data.txt' INTO TABLE text_table;
CREATE TABLE parquet_table
STORED AS PARQUET
AS SELECT * FROM text_table;
Cheng
On 11/13/15 5:13 PM, Shushant Arora wrote:
Thanks !
so which one is better for dumping text data to hive using
custom MR/spark job - Hive Parquet datamodel using
hivewritable or Avro Parquet datamodel using avro object?
On Fri, Nov 13, 2015 at 12:45 PM, Cheng Lian
<lian.cs....@gmail.com <mailto:lian.cs....@gmail.com>> wrote:
ParquetOutputFormat is not a data model. A data model
provides a WriteSupport to ParquetOutputFormat to tell
Parquet how to convert upper level domain objects (Hive
Writables, Avro records, etc.) to Parquet records. So
all other data models uses it for writing Parquet files.
Hive does have a Parquet data model. If you create a
Parquet table in Hive like "CREATE TABLE t (key INT,
value STRING) STORED AS PARQUET", it invokes the Hive
Parquet data model when reading/write table t. In the
case you mentioned, records in the text table are
firstly extracted out by Hive into Hive Writables, and
then the Hive Parquet data model converts those
Writables into Parquet records.
Cheng
On 11/13/15 2:37 PM, Shushant Arora wrote:
Thanks Cheng.
I have spark version 1.2 deployed on my cluster so for
the time being I cannot use direct spark sql functionality.
I will try with AvroParquetOutputFormat. Just want to
know how AvroParquetOutputFormat is better than direct
ParquetOutputFormat ? And also is there any hive object
model - I mean when I create a parquet table in hive
and insert data in that table using text table which
object model does hive uses internally?
Thanks
Shushant
On Fri, Nov 13, 2015 at 9:14 AM, Cheng Lian
<lian.cs....@gmail.com <mailto:lian.cs....@gmail.com>>
wrote:
If I understand your question correctly, you are
trying to write Parquet files using a specific
Parquet data model, and expect to load them into
Hive, right?
Spark also implements a Parquet data model, which
converts Spark SQL rows into Parquet records. If
you're already using Spark, then this can be very
convenient. For example, in Spark 1.5, assuming
that you've already deployed Spark against an
existing Hive metastore, you may easily save a
DataFrame as a Hive Parquet table like this:
sqlContext.range(10).write.format("parquet").saveAsTable("t")
Then you should be able to read this Parquet table
from Hive side.
If you are going to use parquet-mr directly, then
parquet-avro is always recommended since it's the
most standard one out there. But please use
parquet-mr 1.8+, because earlier versions don't
write correct LIST and MAP structures. Also, when
interacting with Hive, all fields in your Avro
schema should be marked as optional because Hive
doesn't support required fields.
Cheng
On Fri, Nov 13, 2015 at 1:56 AM, Shushant Arora
<shushantaror...@gmail.com
<mailto:shushantaror...@gmail.com>> wrote:
Hi Cheng
Have got your reference from spark mailing
list. I have few doubts on parquet - It would
be helpful if you can help here
I have a requirement to write data in parquet
format in hive table. I amusing spark job.
In parquet there are so many object model to
write to file-
AvroParquetOutputFormat,ProtoParquetOutputFormat,ThriftPrquetOutputFormat,ParquetOutputormat.
1.Which one is better and is recommended over
other?
2.Which one is best for hive?
3.Do you have any sample program in hadoop mr
or spark for this ?
Thanks
Shushant