Re: getBytes : save as pdf

2018-10-10 Thread Joel D
I haven’t tried this but maybe you can try using some pdf library to write the binary contents as pdf. On Wed, Oct 10, 2018 at 11:30 AM ☼ R Nair wrote: > All, > > I am reading a zipped file into an RdD and getting the rdd._1as the name > and rdd._2.getBytes() as the content. How can I save the

Process Million Binary Files

2018-10-10 Thread Joel D
Hi, I need to process millions of PDFs in hdfs using spark. First I’m trying with some 40k files. I’m using binaryFiles api with which I’m facing couple of issues: 1. It creates only 4 tasks and I can’t seem to increase the parallelism there. 2. It took 2276 seconds and that means for millions

getBytes : save as pdf

2018-10-10 Thread ☼ R Nair
All, I am reading a zipped file into an RdD and getting the rdd._1as the name and rdd._2.getBytes() as the content. How can I save the latter as a PDF? In fact the zipped file is a set of PDFs. I tried saveAsObjectFile and saveAsTextFile, but cannot read back the PDF. Any clue please? Best,

Bad Message 413 Request Entity too large - Spark History UI through Knox

2018-10-10 Thread Theyaa Matti
Hi, I am getting the below message when trying to access the spark history ui through knox. Bad Message 413 reason: Request Entity Too Large It is worth to mention that the issue appears when I enable ssl on Knox. If Knox is not running with ssl, the issue disappears. Doing some

Triangle Apache Spark Meetup

2018-10-10 Thread Jean Georges Perrin
Hi, Just a small plug for Triangle Apache Spark Meetup (TASM) covers Raleigh, Durham, and Chapel Hill in North Carolina, USA. The group started back in July 2015. More details here: https://www.meetup.com/Triangle-Apache-Spark-Meetup/ .

Re: Spark on YARN not utilizing all the YARN containers available

2018-10-10 Thread Gourav Sengupta
Hi Dillon, yes we can understand the number of executors that are running but the question is more around understanding the relation between YARN containers, their persistence and SPARK excutors. Regards, Gourav On Wed, Oct 10, 2018 at 6:38 AM Dillon Dukek wrote: > There is documentation here

sparksql exception when using regexp_replace

2018-10-10 Thread 付涛
Hi, sparks: I am using sparksql to insert some values into directory,the sql seems like this: insert overwrite directory '/temp/test_spark' ROW FORMAT DELIMITED FIELDS TERMINATED BY '~' select regexp_replace('a~b~c', '~', ''), 123456 however,some exceptions has throwed:

sparksql exception when using regexp_replace

2018-10-10 Thread 付涛
Hi, sparks: I am using sparksql to insert some values into directory,the sql seems like this: insert overwrite directory '/temp/test_spark' ROW FORMAT DELIMITED FIELDS TERMINATED BY '~' select regexp_replace('a~b~c', '~', ''), 123456 however,some exceptions has