Re: SparkR + binary type + how to get value

2019-02-19 Thread Felix Cheung
from the second image it looks like there is protocol mismatch. I’d check if 
the SparkR package running there on Livy machine matches the Spark java release.

But in any case this seems more an issue with Livy config. I’d suggest checking 
with the community there:




From: Thijs Haarhuis 
Sent: Tuesday, February 19, 2019 5:28 AM
To: Felix Cheung; user@spark.apache.org
Subject: Re: SparkR + binary type + how to get value

Hi Felix,

Thanks. I got it working now by using the unlist function.

I have another question, maybe you can help me with, since I did see your 
naming popping up regarding the spark.lapply function.
I am using Apache Livy and am having troubles using this function, I even 
reported a jira ticket for it at:
https://jira.apache.org/jira/browse/LIVY-558

When I call the spark.lapply function it reports that SparkR is not initialized.
I have looked into the spark.lapply function and it seems there is no spark 
context.
Any idea how I can debug this?

I hope you can help.

Regards,
Thijs


From: Felix Cheung 
Sent: Sunday, February 17, 2019 7:18 PM
To: Thijs Haarhuis; user@spark.apache.org
Subject: Re: SparkR + binary type + how to get value

A byte buffer in R is the raw vector type, so seems like it is working as 
expected. What do you have in the raw byte? You could convert into other types 
or access individual byte directly...

https://stat.ethz.ch/R-manual/R-devel/library/base/html/raw.html



From: Thijs Haarhuis 
Sent: Thursday, February 14, 2019 4:01 AM
To: Felix Cheung; user@spark.apache.org
Subject: Re: SparkR + binary type + how to get value

Hi Felix,
Sure..

I have the following code:

  printSchema(results)
  cat("\n\n\n")

  firstRow <- first(results)
  value <- firstRow$value

  cat(paste0("Value Type: '",typeof(value),"'\n\n\n"))
  cat(paste0("Value: '",value,"'\n\n\n"))

results is a Spark Data Frame here.

When I run this code the following is printed to console:

[cid:04497e3e-7983-488a-8516-5d2349778f03]

You can there is only a single column in this sdf of type binary
when I collect this value and print the type it prints it is a list.

Any idea how to get the actual value, or how to process the individual bytes?

Thanks
Thijs


From: Felix Cheung 
Sent: Thursday, February 14, 2019 5:31 AM
To: Thijs Haarhuis; user@spark.apache.org
Subject: Re: SparkR + binary type + how to get value

Please share your code



From: Thijs Haarhuis 
Sent: Wednesday, February 13, 2019 6:09 AM
To: user@spark.apache.org
Subject: SparkR + binary type + how to get value


Hi all,



Does anybody have any experience in accessing the data from a column which has 
a binary type in a Spark Data Frame in R?

I have a Spark Data Frame which has a column which is of a binary type. I want 
to access this data and process it.

In my case I collect the spark data frame to a R data frame and access the 
first row.

When I print this row to the console it does print all the hex values correctly.



However when I access the column it prints it is a list of 1 …when I print the 
type of the child element..it again prints it is a list.

I expected this value to be of a raw type.



Anybody has some experience with this?



Thanks

Thijs




Losing system properties on executor side, if context is checkpointed

2019-02-19 Thread Dmitry Goldenberg
Hi all,

I'm seeing an odd behavior where if I switch the context from regular to
checkpointed, the system properties are no longer automatically carried
over into the worker / executors and turn out to be null there.

This is in Java, using spark-streaming_2.10, version 1.5.0.

I'm placing properties into a Properties object and pass it over into the
worker logic.  I would think I shouldn't have to do this and it works if I
just set properties into the regular context, they get automatically set
and carried over to the worker side in that case.

Is this something fixed or changed in the later versions of Spark?

This is what I ended up doing the following in the driver program:

  private JavaStreamingContext createCheckpointedContext(SparkConf sparkConf,
Parameters params) {

JavaStreamingContextFactory factory = new JavaStreamingContextFactory()
{

  @Override

  public JavaStreamingContext create() {

return createContext(sparkConf, params);

  }

};

return JavaStreamingContext.getOrCreate(params.getCheckpointDir(),
factory);

  }


  private JavaStreamingContext createContext(SparkConf sparkConf,
Parameters params) {

// Create context with the specified batch interval, in milliseconds.

JavaStreamingContext jssc = new JavaStreamingContext(sparkConf,
Durations.milliseconds(params.getBatchDurationMillis()));

// Set the checkpoint directory, if we're checkpointing

if (params.isCheckpointed()) {

  jssc.checkpoint(params.getCheckpointDir());

}

Properties props = new Properties();

  JavaConverters.seqAsJavaListConverter(sparkConf
.getExecutorEnv()).asJava().stream().map(x -> props.setProperty(x._1, x._2
));


.

  // ... Create Direct Stream from Kafka...

messageBodies.foreachRDD(new Function, Void>() {

  @Override

  public Void call(JavaRDD rdd) throws Exception {

ProcessPartitionFunction func = new ProcessPartitionFunction(

  props, // <-- Had to pass that through, so this works in a
checkpointed scenario

  params.getAppName(),

  params.getTopic(),

  ..

rdd.foreachPartition(func);

return null;

  }

});

Would appreciate any recommendations/clues,
Thanks,
- Dmitry


Re: Difference between dataset and dataframe

2019-02-19 Thread Vadim Semenov
>
> 1) Is there any difference in terms performance when we use datasets over
> dataframes? Is it significant to choose 1 over other. I do realise there
> would be some overhead due case classes but how significant is that? Are
> there any other implications.


As long as you use the DataFrame functions the performance is going to be
the same since they operate directly with Tungsten rows, but as soon as you
try to do any typed-operations like `.map` performance is going to be hit
because Spark would have to create Java objects from Tungsten memory.

2) Is the Tungsten code generation done only for datasets or is there any
> internal process to generate bytecode for dataframes as well? Since its
> related to jvm , I think its just for datasets but I couldn’t find anything
> that tells it specifically. If its just for datasets , does that mean we
> miss out on the project tungsten optimisation for dataframes?


Code generation is done for both



On Mon, Feb 18, 2019 at 9:09 PM Akhilanand  wrote:

>
> Hello,
>
> I have been recently exploring about dataset and dataframes. I would
> really appreciate if someone could answer these questions:
>
> 1) Is there any difference in terms performance when we use datasets over
> dataframes? Is it significant to choose 1 over other. I do realise there
> would be some overhead due case classes but how significant is that? Are
> there any other implications.
>
> 2) Is the Tungsten code generation done only for datasets or is there any
> internal process to generate bytecode for dataframes as well? Since its
> related to jvm , I think its just for datasets but I couldn’t find anything
> that tells it specifically. If its just for datasets , does that mean we
> miss out on the project tungsten optimisation for dataframes?
>
>
>
> Regards,
> Akhilanand BV
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>

-- 
Sent from my iPhone


Re: Difference between dataset and dataframe

2019-02-19 Thread Koert Kuipers
dataframe operations are expressed as transformations on columns, basically
on locations inside the row objects. this specificity can be exploited by
catalyst to optimize these operations. since catalyst knows exactly what
positions in the row object you modified or not at any point and often also
what operation you did on them it can reason about these and do
optimizations like re-ordering of operations, compiling operations, and
running operations on serialized/internal formats.

when you use case classes and lamba operations not as much information is
available and the operation cannot be performed on the internal
representation. so conversions and/or deserializations are necessary.

On Tue, Feb 19, 2019 at 12:59 AM Lunagariya, Dhaval <
dhaval.lunagar...@citi.com> wrote:

> It does for dataframe also. Please try example.
>
>
>
> df1 = spark.range(2, 1000, 2)
>
> df2 = spark.range(2, 1000, 4)
>
> step1 = df1.repartition(5)
>
> step12 = df2.repartition(6)
>
> step2 = step1.selectExpr("id * 5 as id")
>
> step3 = step2.join(step12, ["id"])
>
> step4 = step3.selectExpr("sum(id)")
>
> step4.collect()
>
>
>
> step4._jdf.queryExecution().debug().codegen()
>
>
>
> You will see the generated code.
>
>
>
> Regards,
>
> Dhaval
>
>
>
> *From:* [External] Akhilanand 
> *Sent:* Tuesday, February 19, 2019 10:29 AM
> *To:* Koert Kuipers 
> *Cc:* user 
> *Subject:* Re: Difference between dataset and dataframe
>
>
>
> Thanks for the reply. But can you please tell why dataframes are
> performant than datasets? Any specifics would be helpful.
>
>
>
> Also, could you comment on the tungsten code gen part of my question?
>
>
> On Feb 18, 2019, at 10:47 PM, Koert Kuipers  wrote:
>
> in the api DataFrame is just Dataset[Row]. so this makes you think Dataset
> is the generic api. interestingly enough under the hood everything is
> really Dataset[Row], so DataFrame is really the "native" language for spark
> sql, not Dataset.
>
>
>
> i find DataFrame to be significantly more performant. in general if you
> use Dataset you miss out on some optimizations. also Encoders are not very
> pleasant to work with.
>
>
>
> On Mon, Feb 18, 2019 at 9:09 PM Akhilanand 
> wrote:
>
>
> Hello,
>
> I have been recently exploring about dataset and dataframes. I would
> really appreciate if someone could answer these questions:
>
> 1) Is there any difference in terms performance when we use datasets over
> dataframes? Is it significant to choose 1 over other. I do realise there
> would be some overhead due case classes but how significant is that? Are
> there any other implications.
>
> 2) Is the Tungsten code generation done only for datasets or is there any
> internal process to generate bytecode for dataframes as well? Since its
> related to jvm , I think its just for datasets but I couldn’t find anything
> that tells it specifically. If its just for datasets , does that mean we
> miss out on the project tungsten optimisation for dataframes?
>
>
>
> Regards,
> Akhilanand BV
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>


Re: Looking for an apache spark mentor

2019-02-19 Thread Robert Kaye


> On Feb 19, 2019, at 2:26 PM, Shyam P  wrote:
> 
> What IRC channel we should join?

I should’ve included info in the first place, heh. Sorry:

#metabrainz on freenode, please.

I am ruaok, but pristine and iliekcomputers are also very much interested in 
learning more about Spark.

Thanks!

--

--ruaok

Robert Kaye -- r...@metabrainz.org --http://metabrainz.org



Re: SparkR + binary type + how to get value

2019-02-19 Thread Thijs Haarhuis
Hi Felix,

Thanks. I got it working now by using the unlist function.

I have another question, maybe you can help me with, since I did see your 
naming popping up regarding the spark.lapply function.
I am using Apache Livy and am having troubles using this function, I even 
reported a jira ticket for it at:
https://jira.apache.org/jira/browse/LIVY-558

When I call the spark.lapply function it reports that SparkR is not initialized.
I have looked into the spark.lapply function and it seems there is no spark 
context.
Any idea how I can debug this?

I hope you can help.

Regards,
Thijs


From: Felix Cheung 
Sent: Sunday, February 17, 2019 7:18 PM
To: Thijs Haarhuis; user@spark.apache.org
Subject: Re: SparkR + binary type + how to get value

A byte buffer in R is the raw vector type, so seems like it is working as 
expected. What do you have in the raw byte? You could convert into other types 
or access individual byte directly...

https://stat.ethz.ch/R-manual/R-devel/library/base/html/raw.html



From: Thijs Haarhuis 
Sent: Thursday, February 14, 2019 4:01 AM
To: Felix Cheung; user@spark.apache.org
Subject: Re: SparkR + binary type + how to get value

Hi Felix,
Sure..

I have the following code:

  printSchema(results)
  cat("\n\n\n")

  firstRow <- first(results)
  value <- firstRow$value

  cat(paste0("Value Type: '",typeof(value),"'\n\n\n"))
  cat(paste0("Value: '",value,"'\n\n\n"))

results is a Spark Data Frame here.

When I run this code the following is printed to console:

[cid:04497e3e-7983-488a-8516-5d2349778f03]

You can there is only a single column in this sdf of type binary
when I collect this value and print the type it prints it is a list.

Any idea how to get the actual value, or how to process the individual bytes?

Thanks
Thijs


From: Felix Cheung 
Sent: Thursday, February 14, 2019 5:31 AM
To: Thijs Haarhuis; user@spark.apache.org
Subject: Re: SparkR + binary type + how to get value

Please share your code



From: Thijs Haarhuis 
Sent: Wednesday, February 13, 2019 6:09 AM
To: user@spark.apache.org
Subject: SparkR + binary type + how to get value


Hi all,



Does anybody have any experience in accessing the data from a column which has 
a binary type in a Spark Data Frame in R?

I have a Spark Data Frame which has a column which is of a binary type. I want 
to access this data and process it.

In my case I collect the spark data frame to a R data frame and access the 
first row.

When I print this row to the console it does print all the hex values correctly.



However when I access the column it prints it is a list of 1 …when I print the 
type of the child element..it again prints it is a list.

I expected this value to be of a raw type.



Anybody has some experience with this?



Thanks

Thijs




Re: Looking for an apache spark mentor

2019-02-19 Thread Shyam P
What IRC channel we should join?

On Tue, 19 Feb 2019, 17:56 Robert Kaye,  wrote:

> Hello!
>
> I’m Robert Kaye from the MetaBrainz Foundation — we’re the people behind
> MusicBrainz ( https://musicbrainz.org ) and more recently ListenBrainz (
> https://listenbrainz.org ). ListenBrainz is aiming to re-create what
> last.fm used to be — we’ve already got 200M listens (AKA scrabbles) from
> our users (which is not a lot, really). We’ve setup an Apache Spark cluster
> and are starting to build user listening statistics using this setup.
>
> While our setup is working, we can see that we’re not going to scale up
> well given our current approach. We’ve been trying to read the docs, ask
> for help on the IRC channel, but we continue to miss import bits about how
> we should be doing things. Best practices around Spark seem to be hard to
> come by. :(
>
> MetaBrainz is all open source and open data — any of the data we use is
> available for anyone to download — we’re a non-profit working hard towards
> creating open source music recommendation engines. We’re hoping that
> someone could take us under their wing, turn up in our IRC channel and help
> us find the right path towards using Spark much more effectively than we’ve
> been so far.
>
> Is anyone on this list interested in helping out? Perhaps you know someone
> who might?
>
> Thanks!
>
> --
>
> --ruaok
>
> Robert Kaye -- r...@metabrainz.org --http://metabrainz.org
>
>


Spark on Kubernetes with persistent local storage

2019-02-19 Thread Arne Zachlod
Hello,

I'm trying to host spark applications on a kubernetes cluster and want
to provide localized persistent storage to the spark workers in a small
research project I'm currently doing.
I googled a bit around and found that HDFS seems to be pretty well
supported with spark, but there arise some problems with the
localization of data if I want to do this as outlined in this talk [1].
As far as I understand it, most of the configurations for deploying this
are in their git repo [2]. But the spark-driver needs some patch to map
the workers and the HDFS datanodes correctly to the kubernetes nodes, is
something like this already part of the current spark codebase as of
spark 2.4.0? I had a look at the code but couldn't find anything related
to hdfs localization (pretty sure I just didn't look at the right place).

So, my question now is: is this even a viable option at the current
state of the project(s)? What storage solution would be recommended
instead if spark on kubernetes is given (so no yarn/mesos)?

Looking forward to your input.

Arne

[1] https://databricks.com/session/hdfs-on-kubernetes-lessons-learned
[2] https://github.com/apache-spark-on-k8s/kubernetes-HDFS

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Looking for an apache spark mentor

2019-02-19 Thread Robert Kaye
Hello!

I’m Robert Kaye from the MetaBrainz Foundation — we’re the people behind 
MusicBrainz ( https://musicbrainz.org  ) and more 
recently ListenBrainz ( https://listenbrainz.org  ). 
ListenBrainz is aiming to re-create what last.fm  used to be — 
we’ve already got 200M listens (AKA scrabbles) from our users (which is not a 
lot, really). We’ve setup an Apache Spark cluster and are starting to build 
user listening statistics using this setup.

While our setup is working, we can see that we’re not going to scale up well 
given our current approach. We’ve been trying to read the docs, ask for help on 
the IRC channel, but we continue to miss import bits about how we should be 
doing things. Best practices around Spark seem to be hard to come by. :(

MetaBrainz is all open source and open data — any of the data we use is 
available for anyone to download — we’re a non-profit working hard towards 
creating open source music recommendation engines. We’re hoping that someone 
could take us under their wing, turn up in our IRC channel and help us find the 
right path towards using Spark much more effectively than we’ve been so far.

Is anyone on this list interested in helping out? Perhaps you know someone who 
might?

Thanks!

--

--ruaok

Robert Kaye -- r...@metabrainz.org --http://metabrainz.org