Re: getBytes : save as pdf

2018-10-10 Thread Joel D
I haven’t tried this but maybe you can try using some pdf library to write
the binary contents as pdf.

On Wed, Oct 10, 2018 at 11:30 AM ☼ R Nair 
wrote:

> All,
>
> I am reading a zipped file into an RdD and getting the rdd._1as the name
> and rdd._2.getBytes() as the content. How can I save the latter as a PDF?
> In fact the zipped file is a set of PDFs. I tried saveAsObjectFile and
> saveAsTextFile, but cannot read back the PDF. Any clue please?
>
> Best, Ravion
>


Process Million Binary Files

2018-10-10 Thread Joel D
Hi,

I need to process millions of PDFs in hdfs using spark. First I’m trying
with some 40k files. I’m using binaryFiles api with which I’m facing couple
of issues:

1. It creates only 4 tasks and I can’t seem to increase the parallelism
there.
2. It took 2276 seconds and that means for millions of files it will take
ages to complete. I’m also expecting it to fail for million records with
some timeout or gc overhead exception.

Val files = sparkSession.sparkContext.binaryFiles(filePath, 200).cache

Val fileContentRdd = files.map(file => myFunc(file)



Do you have any guidance on how I can process millions of files using
binaryFiles api?

How can I increase the number of tasks/parallelism during the creation of
files rdd?

Thanks


getBytes : save as pdf

2018-10-10 Thread ☼ R Nair
All,

I am reading a zipped file into an RdD and getting the rdd._1as the name
and rdd._2.getBytes() as the content. How can I save the latter as a PDF?
In fact the zipped file is a set of PDFs. I tried saveAsObjectFile and
saveAsTextFile, but cannot read back the PDF. Any clue please?

Best, Ravion


Bad Message 413 Request Entity too large - Spark History UI through Knox

2018-10-10 Thread Theyaa Matti
Hi,
I am getting the below message when trying to access the spark history
ui through knox.

Bad Message 413

reason: Request Entity Too Large


It is worth to mention that the issue appears when I enable ssl on Knox.

If Knox is not running with ssl, the issue disappears.


Doing some research, it seems to be a Jetty configuration to increase

the http request header buffer but I do not see any spark config for it.


Would you please advice on what to do?


Regards,


Triangle Apache Spark Meetup

2018-10-10 Thread Jean Georges Perrin
Hi,


Just a small plug for Triangle Apache Spark Meetup (TASM) covers Raleigh, 
Durham, and Chapel Hill in North Carolina, USA. The group started back in July 
2015. More details here: https://www.meetup.com/Triangle-Apache-Spark-Meetup/ 
.

Can you add our meetup to http://spark.apache.org/community.html 
 ?

jg




Re: Spark on YARN not utilizing all the YARN containers available

2018-10-10 Thread Gourav Sengupta
Hi Dillon,

yes we can understand the number of executors that are running but the
question is more around understanding the relation between YARN containers,
their persistence and SPARK excutors.

Regards,
Gourav

On Wed, Oct 10, 2018 at 6:38 AM Dillon Dukek 
wrote:

> There is documentation here
> http://spark.apache.org/docs/latest/running-on-yarn.html about running
> spark on YARN. Like I said before you can use either the logs from the
> application or the Spark UI to understand how many executors are running at
> any given time. I don't think I can help much further without more
> information about the specific use case.
>
>
> On Tue, Oct 9, 2018 at 2:54 PM Gourav Sengupta 
> wrote:
>
>> Hi Dillon,
>>
>> I do think that there is a setting available where in once YARN sets up
>> the containers then you do not deallocate them, I had used it previously in
>> HIVE, and it just saves processing time in terms of allocating containers.
>> That said I am still trying to understand how do we determine one YARN
>> container = one executor in SPARK.
>>
>> Regards,
>> Gourav
>>
>> On Tue, Oct 9, 2018 at 9:04 PM Dillon Dukek
>>  wrote:
>>
>>> I'm still not sure exactly what you are meaning by saying that you have
>>> 6 yarn containers. Yarn should just be aware of the total available
>>> resources in  your cluster and then be able to launch containers based on
>>> the executor requirements you set when you submit your job. If you can, I
>>> think it would be helpful to send me the command you're using to launch
>>> your spark process. You should also be able to use the logs and/or the
>>> spark UI to determine how many executors are running.
>>>
>>> On Tue, Oct 9, 2018 at 12:57 PM Gourav Sengupta <
>>> gourav.sengu...@gmail.com> wrote:
>>>
 hi,

 may be I am not quite clear in my head on this one. But how do we know
 that 1 yarn container = 1 executor?

 Regards,
 Gourav Sengupta

 On Tue, Oct 9, 2018 at 8:53 PM Dillon Dukek
  wrote:

> Can you send how you are launching your streaming process? Also what
> environment is this cluster running in (EMR, GCP, self managed, etc)?
>
> On Tue, Oct 9, 2018 at 10:21 AM kant kodali 
> wrote:
>
>> Hi All,
>>
>> I am using Spark 2.3.1 and using YARN as a cluster manager.
>>
>> I currently got
>>
>> 1) 6 YARN containers(executors=6) with 4 executor cores for each
>> container.
>> 2) 6 Kafka partitions from one topic.
>> 3) You can assume every other configuration is set to whatever the
>> default values are.
>>
>> Spawned a Simple Streaming Query and I see all the tasks get
>> scheduled on one YARN container. am I missing any config?
>>
>> Thanks!
>>
>


sparksql exception when using regexp_replace

2018-10-10 Thread 付涛
Hi, sparks: I am using sparksql to insert some values into directory,the
sql seems like this:  insert overwrite directory '/temp/test_spark'
ROW FORMAT DELIMITED FIELDS TERMINATED BY '~' select
regexp_replace('a~b~c', '~', ''), 123456 however,some exceptions has
throwed:  Caused by: org.apache.hadoop.hive.serde2.SerDeException:
org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe: columns has 4 elements
while columns.types has 2 elements! at
org.apache.hadoop.hive.serde2.lazy.LazySerDeParameters.extractColumnInfo(LazySerDeParameters.java:163)
at
org.apache.hadoop.hive.serde2.lazy.LazySerDeParameters.(LazySerDeParameters.java:90)
at
org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe.initialize(LazySimpleSerDe.java:116)
at
org.apache.spark.sql.hive.execution.HiveOutputWriter.(HiveFileFormat.scala:119)
at
org.apache.spark.sql.hive.execution.HiveFileFormat$$anon$1.newInstance(HiveFileFormat.scala:103)
at
org.apache.spark.sql.execution.datasources.FileFormatWriter$SingleDirectoryWriteTask.newOutputWriter(FileFormatWriter.scala:367)
at
org.apache.spark.sql.execution.datasources.FileFormatWriter$SingleDirectoryWriteTask.execute(FileFormatWriter.scala:378)
at
org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:269)
at
org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:267)
at
org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1414)
at
org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:272)
... 8 more   the hive version used is 2.0.1   when I add a alias to
regexp_replace, the sql has successed:  insert overwrite
directory '/temp/test_spark'   ROW FORMAT DELIMITED FIELDS TERMINATED BY
'~'   select regexp_replace('a~b~c', '~', '') as kv, 123456



--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

sparksql exception when using regexp_replace

2018-10-10 Thread 付涛
Hi, sparks:
 I am using sparksql to insert some values into directory,the sql seems
like this:
 
 insert overwrite directory '/temp/test_spark'
 ROW FORMAT DELIMITED FIELDS TERMINATED BY '~'
 select regexp_replace('a~b~c', '~', ''), 123456

 however,some exceptions has throwed:
 
 Caused by: org.apache.hadoop.hive.serde2.SerDeException:
org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe: columns has 4 elements
while columns.types has 2 elements!
at
org.apache.hadoop.hive.serde2.lazy.LazySerDeParameters.extractColumnInfo(LazySerDeParameters.java:163)
at
org.apache.hadoop.hive.serde2.lazy.LazySerDeParameters.(LazySerDeParameters.java:90)
at
org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe.initialize(LazySimpleSerDe.java:116)
at
org.apache.spark.sql.hive.execution.HiveOutputWriter.(HiveFileFormat.scala:119)
at
org.apache.spark.sql.hive.execution.HiveFileFormat$$anon$1.newInstance(HiveFileFormat.scala:103)
at
org.apache.spark.sql.execution.datasources.FileFormatWriter$SingleDirectoryWriteTask.newOutputWriter(FileFormatWriter.scala:367)
at
org.apache.spark.sql.execution.datasources.FileFormatWriter$SingleDirectoryWriteTask.execute(FileFormatWriter.scala:378)
at
org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:269)
at
org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:267)
at
org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1414)
at
org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:272)
... 8 more

   the hive version used is 2.0.1

   when I add a alias to regexp_replace, the sql has successed:
   
   insert overwrite directory '/temp/test_spark'
   ROW FORMAT DELIMITED FIELDS TERMINATED BY '~'
   select regexp_replace('a~b~c', '~', '') as kv, 123456



--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org