Re: How to combine all rows into a single row in DataFrame

2019-08-19 Thread Alex Landa
Hi,

It sounds similar to what we do in our application.
We don't serialize every row, but instead we group first the rows into the
wanted representation and then apply protobuf serialization using map and
lambda.
I suggest not to serialize the entire DataFrame into a single protobuf
message since it may cause OOM errors.

Thanks,
Alex

On Mon, Aug 19, 2019 at 11:24 PM Rishikesh Gawade 
wrote:

> Hi All,
> I have been trying to serialize a dataframe in protobuf format. So far, I
> have been able to serialize every row of the dataframe by using map
> function and the logic for serialization within the same(within the lambda
> function). The resultant dataframe consists of rows in serialized format(1
> row = 1 serialized message).
> I wish to form a single protobuf serialized message for this dataframe and
> in order to do that i need to combine all the serialized rows using some
> custom logic very similar to the one used in map operation.
> I am assuming that this would be possible by using the reduce operation on
> the dataframe, however, i am unaware of how to go about it.
> Any suggestions/approach would be much appreciated.
>
> Thanks,
> Rishikesh
>


Re: Spark Standalone - Failing to pass extra java options to the driver in cluster mode

2019-08-19 Thread Alex Landa
Thanks Jungtaek Lim,
I upgraded the cluster to 2.4.3 and it worked fine.

Thanks,
Alex

On Mon, Aug 19, 2019 at 10:01 PM Jungtaek Lim  wrote:

> Hi Alex,
>
> you seem to hit SPARK-26606 [1] which has been fixed in 2.4.1. Could you
> try it out with latest version?
>
> Thanks,
> Jungtaek Lim (HeartSaVioR)
>
> 1. https://issues.apache.org/jira/browse/SPARK-26606
>
> On Tue, Aug 20, 2019 at 3:43 AM Alex Landa  wrote:
>
>> Hi,
>>
>> We are using Spark Standalone 2.4.0 in production and publishing our
>> Scala app using cluster mode.
>> I saw that extra java options passed to the driver don't actually pass.
>> A submit example:
>> *spark-submit --deploy-mode cluster --master spark://:7077
>> --driver-memory 512mb --conf
>> "spark.driver.extraJavaOptions=-XX:+HeapDumpOnOutOfMemoryError" --class
>> App  app.jar *
>>
>> Doesn't pass *-XX:+HeapDumpOnOutOfMemoryError *as a JVM argument, but
>> pass instead
>> *-Dspark.driver.extraJavaOptions=-XX:+HeapDumpOnOutOfMemoryError*I
>> created a test app for it:
>>
>> val spark = SparkSession.builder()
>>   .master("local")
>>   .appName("testApp").getOrCreate()
>> import spark.implicits._
>>
>> // get a RuntimeMXBean reference
>> val runtimeMxBean = ManagementFactory.getRuntimeMXBean
>>
>> // get the jvm's input arguments as a list of strings
>> val listOfArguments = runtimeMxBean.getInputArguments
>>
>> // print the arguments
>> listOfArguments.asScala.foreach(a => println(s"ARG: $a"))
>>
>>
>> I see that for client mode I get :
>> ARG: -XX:+HeapDumpOnOutOfMemoryError
>> while in cluster mode I get:
>> ARG: -Dspark.driver.extraJavaOptions=-XX:+HeapDumpOnOutOfMemoryError
>>
>> Would appreciate your help how to work around this issue.
>> Thanks,
>> Alex
>>
>>
>
> --
> Name : Jungtaek Lim
> Blog : http://medium.com/@heartsavior
> Twitter : http://twitter.com/heartsavior
> LinkedIn : http://www.linkedin.com/in/heartsavior
>


Running Spark history Server at Context localhost:18080/sparkhistory

2019-08-19 Thread Sandish Kumar HN
Hi,

I want to run  Running Spark history Server at Context
localhost:18080/sparkhistory instead at port localhost:18080

The end goal is to access Spark History Server with a domain name i.e,
 domainname/sparkhistory

is there any hacks or spark config options?
-- 

Thanks,
Regards,
SandishKumar HN


Re: How to combine all rows into a single row in DataFrame

2019-08-19 Thread Marcin Tustin
It sounds like you want to aggregate your rows in some way. I actually just
wrote a blog post about that topic:
https://medium.com/@albamus/spark-aggregating-your-data-the-fast-way-e37b53314fad

On Mon, Aug 19, 2019 at 4:24 PM Rishikesh Gawade 
wrote:

> *This Message originated outside your organization.*
> --
> Hi All,
> I have been trying to serialize a dataframe in protobuf format. So far, I
> have been able to serialize every row of the dataframe by using map
> function and the logic for serialization within the same(within the lambda
> function). The resultant dataframe consists of rows in serialized format(1
> row = 1 serialized message).
> I wish to form a single protobuf serialized message for this dataframe and
> in order to do that i need to combine all the serialized rows using some
> custom logic very similar to the one used in map operation.
> I am assuming that this would be possible by using the reduce operation on
> the dataframe, however, i am unaware of how to go about it.
> Any suggestions/approach would be much appreciated.
>
> Thanks,
> Rishikesh
>


How to combine all rows into a single row in DataFrame

2019-08-19 Thread Rishikesh Gawade
Hi All,
I have been trying to serialize a dataframe in protobuf format. So far, I
have been able to serialize every row of the dataframe by using map
function and the logic for serialization within the same(within the lambda
function). The resultant dataframe consists of rows in serialized format(1
row = 1 serialized message).
I wish to form a single protobuf serialized message for this dataframe and
in order to do that i need to combine all the serialized rows using some
custom logic very similar to the one used in map operation.
I am assuming that this would be possible by using the reduce operation on
the dataframe, however, i am unaware of how to go about it.
Any suggestions/approach would be much appreciated.

Thanks,
Rishikesh


Re: Spark Standalone - Failing to pass extra java options to the driver in cluster mode

2019-08-19 Thread Jungtaek Lim
Hi Alex,

you seem to hit SPARK-26606 [1] which has been fixed in 2.4.1. Could you
try it out with latest version?

Thanks,
Jungtaek Lim (HeartSaVioR)

1. https://issues.apache.org/jira/browse/SPARK-26606

On Tue, Aug 20, 2019 at 3:43 AM Alex Landa  wrote:

> Hi,
>
> We are using Spark Standalone 2.4.0 in production and publishing our Scala
> app using cluster mode.
> I saw that extra java options passed to the driver don't actually pass.
> A submit example:
> *spark-submit --deploy-mode cluster --master spark://:7077
> --driver-memory 512mb --conf
> "spark.driver.extraJavaOptions=-XX:+HeapDumpOnOutOfMemoryError" --class
> App  app.jar *
>
> Doesn't pass *-XX:+HeapDumpOnOutOfMemoryError *as a JVM argument, but
> pass instead
> *-Dspark.driver.extraJavaOptions=-XX:+HeapDumpOnOutOfMemoryError*I
> created a test app for it:
>
> val spark = SparkSession.builder()
>   .master("local")
>   .appName("testApp").getOrCreate()
> import spark.implicits._
>
> // get a RuntimeMXBean reference
> val runtimeMxBean = ManagementFactory.getRuntimeMXBean
>
> // get the jvm's input arguments as a list of strings
> val listOfArguments = runtimeMxBean.getInputArguments
>
> // print the arguments
> listOfArguments.asScala.foreach(a => println(s"ARG: $a"))
>
>
> I see that for client mode I get :
> ARG: -XX:+HeapDumpOnOutOfMemoryError
> while in cluster mode I get:
> ARG: -Dspark.driver.extraJavaOptions=-XX:+HeapDumpOnOutOfMemoryError
>
> Would appreciate your help how to work around this issue.
> Thanks,
> Alex
>
>

-- 
Name : Jungtaek Lim
Blog : http://medium.com/@heartsavior
Twitter : http://twitter.com/heartsavior
LinkedIn : http://www.linkedin.com/in/heartsavior


Spark Standalone - Failing to pass extra java options to the driver in cluster mode

2019-08-19 Thread Alex Landa
Hi,

We are using Spark Standalone 2.4.0 in production and publishing our Scala
app using cluster mode.
I saw that extra java options passed to the driver don't actually pass.
A submit example:
*spark-submit --deploy-mode cluster --master spark://:7077
--driver-memory 512mb --conf
"spark.driver.extraJavaOptions=-XX:+HeapDumpOnOutOfMemoryError" --class
App  app.jar *

Doesn't pass *-XX:+HeapDumpOnOutOfMemoryError *as a JVM argument, but pass
instead
*-Dspark.driver.extraJavaOptions=-XX:+HeapDumpOnOutOfMemoryError*I created
a test app for it:

val spark = SparkSession.builder()
  .master("local")
  .appName("testApp").getOrCreate()
import spark.implicits._

// get a RuntimeMXBean reference
val runtimeMxBean = ManagementFactory.getRuntimeMXBean

// get the jvm's input arguments as a list of strings
val listOfArguments = runtimeMxBean.getInputArguments

// print the arguments
listOfArguments.asScala.foreach(a => println(s"ARG: $a"))


I see that for client mode I get :
ARG: -XX:+HeapDumpOnOutOfMemoryError
while in cluster mode I get:
ARG: -Dspark.driver.extraJavaOptions=-XX:+HeapDumpOnOutOfMemoryError

Would appreciate your help how to work around this issue.
Thanks,
Alex


Spark and ZStandard

2019-08-19 Thread Antoine DUBOIS
Hello, 
I'm using hadoop 3.1.2 with Yarn and Spark 2.4.2: 
I'm trying to read file compressed with zstd command line from spark shell. 
However after a huge fight to finally understand issue in library import and 
other stuff, I no longer face error when trying to read those files. 
However If I try to read a simple zstd file in spark-shell I open read it but 
the output is scramble with unprintable caracters all over the place and the 
file is not really readable at all. 
Is there any way to read zstd file generated from official ZStandard tool in 
Spark ? or any way to generate readable zstd file to spark ? 

Best regards 

Antoine 


smime.p7s
Description: S/MIME Cryptographic Signature