I haven’t tried this but maybe you can try using some pdf library to write
the binary contents as pdf.
On Wed, Oct 10, 2018 at 11:30 AM ☼ R Nair
wrote:
> All,
>
> I am reading a zipped file into an RdD and getting the rdd._1as the name
> and rdd._2.getBytes() as the content. How can I save the
Hi,
I need to process millions of PDFs in hdfs using spark. First I’m trying
with some 40k files. I’m using binaryFiles api with which I’m facing couple
of issues:
1. It creates only 4 tasks and I can’t seem to increase the parallelism
there.
2. It took 2276 seconds and that means for millions
ent from my iPhone
>
> On Sep 28, 2018, at 12:10 PM, Joel D wrote:
>
> I'm trying to extract text from pdf files in hdfs using pdfBox.
>
> However it throws an error:
>
> "Exception in thread "main" org.apache.spark.SparkException: ...
>
> java.io.FileNo
I'm trying to extract text from pdf files in hdfs using pdfBox.
However it throws an error:
"Exception in thread "main" org.apache.spark.SparkException: ...
java.io.FileNotFoundException: /nnAlias:8020/tmp/sample.pdf
(No such file or directory)"
What am I missing? Should I be working with
One workaround is to rename the fid column for each df before joining.
On Thu, Jul 19, 2018 at 9:50 PM wrote:
> Spark 2.3.0 has this problem upgrade it to 2.3.1
>
> Sent from my iPhone
>
> On Jul 19, 2018, at 2:13 PM, Nirav Patel wrote:
>
> corrected subject line. It's missing attribute error
We’ve developed our own version of testing framework consisting of
different areas of checking, sometimes providing expected data and
comparing with the resultant data from the data object.
Cheers.
On Tue, May 22, 2018 at 1:48 PM Steve Pruitt wrote:
> Something more on
Hi,
I've a job which takes a HiveQL joining 2 tables (2.5 TB, 45GB),
repartitions to 100 and then does some other transformations. This
executed fine earlier.
Job stages:
Stage 0: hive table 1 scan
Stage 1: Hive table 2 scan
Stage 2: Tungsten exchange for the join
Stage 3: Tungsten exchange for
Have you tried increasing driver, exec mem (gc overhead too if required)?
your code snippet and stack trace will be helpful.
On Mon, Nov 13, 2017 at 7:23 PM Alec Swan wrote:
> Hello,
>
> I am using the Spark library to convert JSON/Snappy files to ORC/ZLIB
> format.
Hi,
We are trying to come up with the best storage format for handling schema
changes in ingested data.
We noticed that both avro and parquet allows one to select based on column
name instead of the data index/position of data. However, we are inclined
towards parquet for better read performance