[jira] [Commented] (PARQUET-1822) Parquet without Hadoop dependencies

ASF GitHub Bot (Jira) Fri, 01 Dec 2023 15:50:34 -0800


    [ 
https://issues.apache.org/jira/browse/PARQUET-1822?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17792267#comment-17792267
 ]


ASF GitHub Bot commented on PARQUET-1822:
-----------------------------------------

amousavigourabi commented on PR #1111:
URL: https://github.com/apache/parquet-mr/pull/1111#issuecomment-1836931968

   > Our project needs this feature as well, is there a date for the next major 
release?
   
   @drealeed if you just need to be able to drop the Hadoop Path dependency, 
you might want to consider copying the InputFile, OutputFile implementations 
from this pull request before the next release is out. If you need to fully 
drop Hadoop, this is still being worked on.




> Parquet without Hadoop dependencies
> -----------------------------------
>
>                 Key: PARQUET-1822
>                 URL: https://issues.apache.org/jira/browse/PARQUET-1822
>             Project: Parquet
>          Issue Type: Improvement
>          Components: parquet-avro
>    Affects Versions: 1.11.0
>         Environment: Amazon Fargate (linux), Windows development box.
> We are writing Parquet to be read by the Snowflake and Athena databases.
>            Reporter: mark juchems
>            Priority: Minor
>              Labels: documentation, newbie
>             Fix For: 1.14.0
>
>
> I have been trying for weeks to create a parquet file from avro and write to 
> S3 in Java.  This has been incredibly frustrating and odd as Spark can do it 
> easily (I'm told).
> I have assembled the correct jars through luck and diligence, but now I find 
> out that I have to have hadoop installed on my machine. I am currently 
> developing in Windows and it seems a dll and exe can fix that up but am 
> wondering about Linus as the code will eventually run in Fargate on AWS.
> *Why do I need external dependencies and not pure java?*
> The thing really is how utterly complex all this is.  I would like to create 
> an avro file and convert it to Parquet and write it to S3, but I am trapped 
> in "ParquetWriter" hell! 
> *Why can't I get a normal OutputStream and write it wherever I want?*
> I have scoured the web for examples and there are a few but we really need 
> some documentation on this stuff.  I understand that there may be reasons for 
> all this but I can't find them on the web anywhere.  Any help?  Can't we get 
> the "SimpleParquet" jar that does this:
>  
> ParquetWriter writer = 
> AvroParquetWriter.<GenericData.Record>builder(outputStream)
>  .withSchema(avroSchema)
>  .withConf(conf)
>  .withCompressionCodec(CompressionCodecName.SNAPPY)
>  .withWriteMode(Mode.OVERWRITE)//probably not good for prod. (overwrites 
> files).
>  .build();
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (PARQUET-1822) Parquet without Hadoop dependencies

Reply via email to