I was curious if there was introspection of certain typed functions and ran
the following two queries:
ds.where($"col" > 1).explain
ds.filter(_.col > 1).explain
And found that the typed function does NOT result in a PushedFilter. I
imagine this is due to a limited view of the function, so I have
And to be clear, are you doing a self-join for approx similarity? Or
joining to another dataset?
On Thu, 23 Feb 2017 at 02:01, nguyen duc Tuan wrote:
> Hi Seth,
> Here's the parameters that I used in my experiments.
> - Number of executors: 16
> - Executor's memories:
Hi,
Is there any limit on number of tasks per stage attempt?
*Thanks,*
*Parag*
Thanks!
If spark does not log these events in event log then why spark history
server provides an API to get RDD information?
>From the documentation,
/applications/[app-id]/storage/rdd A list of stored RDDs for the given
application.
/applications/[app-id]/storage/rdd/[rdd-id] Details for
Hi,
I am writing Dataframe as TSV using DataframeWriter as follows:
myDF.write.mode("overwrite").option("sep","\t").csv("/out/path")
Problem is all part files have .csv extension instead of .tsv as follows:
part-r-00012-f9f06712-1648-4eb6-985b-8a9c79267eef.csv
All the records are stored in TSV
It is too verbose, and will significantly increase the size event log.
Here is the comment in the code:
// No-op because logging every update would be overkill
> override def onBlockUpdated(event: SparkListenerBlockUpdated): Unit = {}
>
>
On Thu, Feb 23, 2017 at 11:42 AM, Parag Chaudhari
Thanks a lot the information!
Is there any reason why EventLoggingListener ignore this event?
*Thanks,*
*Parag*
On Wed, Feb 22, 2017 at 7:11 PM, Saisai Shao wrote:
> AFAIK, Spark's EventLoggingListerner ignores BlockUpdate event, so it will
> not be written into
AFAIK, Spark's EventLoggingListerner ignores BlockUpdate event, so it will
not be written into event-log, I think that's why you cannot get such info
in history server.
On Thu, Feb 23, 2017 at 9:51 AM, Parag Chaudhari
wrote:
> Hi,
>
> I am running spark shell in spark
Hi,
I am running spark shell in spark version 2.0.2. Here is my program,
var myrdd = sc.parallelize(Array.range(1, 10))
myrdd.setName("test")
myrdd.cache
myrdd.collect
But I am not able to see any RDD info in "storage" tab in spark history
server.
I looked at this
If you're talking about the version of scala used to build the broker,
that shouldn't matter.
If you're talking about the version of scala used for the kafka client
dependency, it shouldn't have compiled at all to begin with.
On Wed, Feb 22, 2017 at 12:11 PM, Muhammad Haseeb Javed
Hello,
Had a question. When I look at the executors tab in Spark UI, I notice that
some RDD blocks are assigned to the driver as well. Can someone please tell me
why?
Thanks for the help.
Hi Seth,
Here's the parameters that I used in my experiments.
- Number of executors: 16
- Executor's memories: vary from 1G -> 2G -> 3G
- Number of cores per executor: 1-> 2
- Driver's memory: 1G -> 2G -> 3G
- The similar threshold: 0.6
MinHash:
- number of hash tables: 2
SignedRandomProjection:
Hi,
As I am investigate among others onto the fault recovery capabilities of Spark,
I’ve been curious - what source code artifact initiates the parallel recovery
process? In addition, how is a faulty node detected (from a driver's point of
view)?
Thanks in advance,
Dominik
I just noticed that Spark version that I am using (2.0.2) is built with
Scala 2.11. However I am using Kafka 0.8.2 built with Scala 2.10. Could
this be the reason why we are getting this error?
On Mon, Feb 20, 2017 at 5:50 PM, Cody Koeninger wrote:
> So there's no reason to
If I understand correctly, you need to create a UDF (if you are using java
Extend appropriate UDF e.g. UDF1, UDF2 ..etc depending on number of
arguments and have this static list as a member variable in your class.
You can use this udf as filter in your stream directly.
On Tue, Feb 21, 2017 at
Hello,
I'm using Spark 2.1.0 and hadoop 2.2.0.
When I launch jobs on Yarn, I can retrieve their information on Spark History
Server, except that the links to stdout/stderr of executors are wrong -> they
lead to their url while the job was running.
We have the flag
I'm looking into this a bit further, thanks for bringing it up! Right now
the LSH implementation only uses OR-amplification. The practical
consequence of this is that it will select too many candidates when doing
approximate near neighbor search and approximate similarity join. When we
add
Spark offers a receiver-based approach or direct approach with Kafka (
https://spark.apache.org/docs/2.1.0/streaming-kafka-0-8-integration.html),
and a note in the receiver-based approach says "topic partitions in Kafka
does correlate to partitions of RDDs generated in Spark Streaming."
A fix
Your error message is not clear about what really happens.
Is your container killed by Yarn, or it indeed runs OOM?
When I run the spark job with big data, here is normally what I will do:
1) Enable GC output. You need to monitor the GC output in the executor, to
understand the GC pressure.
The Apache Bahir community is pleased to announce the release
of Apache Bahir 2.1.0 which provides the following extensions for
Apache Spark 2.1.0:
- Akka Streaming
- MQTT Streaming
- MQTT Structured Streaming
- Twitter Streaming
- ZeroMQ Streaming
For more information about
20 matches
Mail list logo