Re: getBytes : save as pdf

2018-10-10 Thread Joel D
I haven’t tried this but maybe you can try using some pdf library to write the binary contents as pdf. On Wed, Oct 10, 2018 at 11:30 AM ☼ R Nair wrote: > All, > > I am reading a zipped file into an RdD and getting the rdd._1as the name > and rdd._2.getBytes() as the content. How can I save the

Process Million Binary Files

2018-10-10 Thread Joel D
Hi, I need to process millions of PDFs in hdfs using spark. First I’m trying with some 40k files. I’m using binaryFiles api with which I’m facing couple of issues: 1. It creates only 4 tasks and I can’t seem to increase the parallelism there. 2. It took 2276 seconds and that means for millions

Re: Text from pdf spark

2018-09-28 Thread Joel D
ent from my iPhone > > On Sep 28, 2018, at 12:10 PM, Joel D wrote: > > I'm trying to extract text from pdf files in hdfs using pdfBox. > > However it throws an error: > > "Exception in thread "main" org.apache.spark.SparkException: ... > > java.io.FileNo

Text from pdf spark

2018-09-28 Thread Joel D
I'm trying to extract text from pdf files in hdfs using pdfBox. However it throws an error: "Exception in thread "main" org.apache.spark.SparkException: ... java.io.FileNotFoundException: /nnAlias:8020/tmp/sample.pdf (No such file or directory)" What am I missing? Should I be working with

Re: Mulitple joins with same Dataframe throws AnalysisException: resolved attribute(s)

2018-07-19 Thread Joel D
One workaround is to rename the fid column for each df before joining. On Thu, Jul 19, 2018 at 9:50 PM wrote: > Spark 2.3.0 has this problem upgrade it to 2.3.1 > > Sent from my iPhone > > On Jul 19, 2018, at 2:13 PM, Nirav Patel wrote: > > corrected subject line. It's missing attribute error

Re: [EXTERNAL] - Re: testing frameworks

2018-05-22 Thread Joel D
We’ve developed our own version of testing framework consisting of different areas of checking, sometimes providing expected data and comparing with the resultant data from the data object. Cheers. On Tue, May 22, 2018 at 1:48 PM Steve Pruitt wrote: > Something more on

No Tasks have reported metrics yet

2018-01-10 Thread Joel D
Hi, I've a job which takes a HiveQL joining 2 tables (2.5 TB, 45GB), repartitions to 100 and then does some other transformations. This executed fine earlier. Job stages: Stage 0: hive table 1 scan Stage 1: Hive table 2 scan Stage 2: Tungsten exchange for the join Stage 3: Tungsten exchange for

Re: Process large JSON file without causing OOM

2017-11-13 Thread Joel D
Have you tried increasing driver, exec mem (gc overhead too if required)? your code snippet and stack trace will be helpful. On Mon, Nov 13, 2017 at 7:23 PM Alec Swan wrote: > Hello, > > I am using the Spark library to convert JSON/Snappy files to ORC/ZLIB > format.

Schema Evolution Parquet vs Avro

2017-05-29 Thread Joel D
Hi, We are trying to come up with the best storage format for handling schema changes in ingested data. We noticed that both avro and parquet allows one to select based on column name instead of the data index/position of data. However, we are inclined towards parquet for better read performance