Is there a list of missing optimizations for typed functions?

2017-02-22 Thread Justin Pihony
I was curious if there was introspection of certain typed functions and ran the following two queries: ds.where($"col" > 1).explain ds.filter(_.col > 1).explain And found that the typed function does NOT result in a PushedFilter. I imagine this is due to a limited view of the function, so I have

Re: Practical configuration to run LSH in Spark 2.1.0

2017-02-22 Thread Nick Pentreath
And to be clear, are you doing a self-join for approx similarity? Or joining to another dataset? On Thu, 23 Feb 2017 at 02:01, nguyen duc Tuan wrote: > Hi Seth, > Here's the parameters that I used in my experiments. > - Number of executors: 16 > - Executor's memories:

Is there any limit on number of tasks per stage attempt?

2017-02-22 Thread Parag Chaudhari
Hi, Is there any limit on number of tasks per stage attempt? *Thanks,* *​Parag​*

Re: Why spark history server does not show RDD even if it is persisted?

2017-02-22 Thread Parag Chaudhari
Thanks! If spark does not log these events in event log then why spark history server provides an API to get RDD information? >From the documentation, /applications/[app-id]/storage/rdd A list of stored RDDs for the given application. /applications/[app-id]/storage/rdd/[rdd-id] Details for

DataframeWriter - How to change filename extension

2017-02-22 Thread Nirav Patel
Hi, I am writing Dataframe as TSV using DataframeWriter as follows: myDF.write.mode("overwrite").option("sep","\t").csv("/out/path") Problem is all part files have .csv extension instead of .tsv as follows: part-r-00012-f9f06712-1648-4eb6-985b-8a9c79267eef.csv All the records are stored in TSV

Re: Why spark history server does not show RDD even if it is persisted?

2017-02-22 Thread Saisai Shao
It is too verbose, and will significantly increase the size event log. Here is the comment in the code: // No-op because logging every update would be overkill > override def onBlockUpdated(event: SparkListenerBlockUpdated): Unit = {} > > On Thu, Feb 23, 2017 at 11:42 AM, Parag Chaudhari

Re: Why spark history server does not show RDD even if it is persisted?

2017-02-22 Thread Parag Chaudhari
Thanks a lot the information! Is there any reason why EventLoggingListener ignore this event? *Thanks,* *​Parag​* On Wed, Feb 22, 2017 at 7:11 PM, Saisai Shao wrote: > AFAIK, Spark's EventLoggingListerner ignores BlockUpdate event, so it will > not be written into

Re: Why spark history server does not show RDD even if it is persisted?

2017-02-22 Thread Saisai Shao
AFAIK, Spark's EventLoggingListerner ignores BlockUpdate event, so it will not be written into event-log, I think that's why you cannot get such info in history server. On Thu, Feb 23, 2017 at 9:51 AM, Parag Chaudhari wrote: > Hi, > > I am running spark shell in spark

Why spark history server does not show RDD even if it is persisted?

2017-02-22 Thread Parag Chaudhari
Hi, I am running spark shell in spark version 2.0.2. Here is my program, var myrdd = sc.parallelize(Array.range(1, 10)) myrdd.setName("test") myrdd.cache myrdd.collect But I am not able to see any RDD info in "storage" tab in spark history server. I looked at this

Re: Why does Spark Streaming application with Kafka fail with “requirement failed: numRecords must not be negative”?

2017-02-22 Thread Cody Koeninger
If you're talking about the version of scala used to build the broker, that shouldn't matter. If you're talking about the version of scala used for the kafka client dependency, it shouldn't have compiled at all to begin with. On Wed, Feb 22, 2017 at 12:11 PM, Muhammad Haseeb Javed

RDD blocks on Spark Driver

2017-02-22 Thread prithish
Hello, Had a question. When I look at the executors tab in Spark UI, I notice that some RDD blocks are assigned to the driver as well. Can someone please tell me why? Thanks for the help.

Re: Practical configuration to run LSH in Spark 2.1.0

2017-02-22 Thread nguyen duc Tuan
Hi Seth, Here's the parameters that I used in my experiments. - Number of executors: 16 - Executor's memories: vary from 1G -> 2G -> 3G - Number of cores per executor: 1-> 2 - Driver's memory: 1G -> 2G -> 3G - The similar threshold: 0.6 MinHash: - number of hash tables: 2 SignedRandomProjection:

Spark Streaming - parallel recovery

2017-02-22 Thread Dominik Safaric
Hi, As I am investigate among others onto the fault recovery capabilities of Spark, I’ve been curious - what source code artifact initiates the parallel recovery process? In addition, how is a faulty node detected (from a driver's point of view)? Thanks in advance, Dominik

Re: Why does Spark Streaming application with Kafka fail with “requirement failed: numRecords must not be negative”?

2017-02-22 Thread Muhammad Haseeb Javed
I just noticed that Spark version that I am using (2.0.2) is built with Scala 2.11. However I am using Kafka 0.8.2 built with Scala 2.10. Could this be the reason why we are getting this error? On Mon, Feb 20, 2017 at 5:50 PM, Cody Koeninger wrote: > So there's no reason to

Re: Spark Streaming: Using external data during stream transformation

2017-02-22 Thread Abhisheks
If I understand correctly, you need to create a UDF (if you are using java Extend appropriate UDF e.g. UDF1, UDF2 ..etc depending on number of arguments and have this static list as a member variable in your class. You can use this udf as filter in your stream directly. On Tue, Feb 21, 2017 at

Executor links in Job History

2017-02-22 Thread yohann jardin
Hello, I'm using Spark 2.1.0 and hadoop 2.2.0. When I launch jobs on Yarn, I can retrieve their information on Spark History Server, except that the links to stdout/stderr of executors are wrong -> they lead to their url while the job was running. We have the flag

Re: Practical configuration to run LSH in Spark 2.1.0

2017-02-22 Thread Seth Hendrickson
I'm looking into this a bit further, thanks for bringing it up! Right now the LSH implementation only uses OR-amplification. The practical consequence of this is that it will select too many candidates when doing approximate near neighbor search and approximate similarity join. When we add

Re: Spark executors in streaming app always uses 2 executors

2017-02-22 Thread Jon Gregg
Spark offers a receiver-based approach or direct approach with Kafka ( https://spark.apache.org/docs/2.1.0/streaming-kafka-0-8-integration.html), and a note in the receiver-based approach says "topic partitions in Kafka does correlate to partitions of RDDs generated in Spark Streaming." A fix

Re: Spark SQL : Join operation failure

2017-02-22 Thread Yong Zhang
Your error message is not clear about what really happens. Is your container killed by Yarn, or it indeed runs OOM? When I run the spark job with big data, here is normally what I will do: 1) Enable GC output. You need to monitor the GC output in the executor, to understand the GC pressure.

[ANNOUNCE] Apache Bahir 2.1.0 Released

2017-02-22 Thread Christian Kadner
The Apache Bahir community is pleased to announce the release of Apache Bahir 2.1.0 which provides the following extensions for Apache Spark 2.1.0: - Akka Streaming - MQTT Streaming - MQTT Structured Streaming - Twitter Streaming - ZeroMQ Streaming For more information about