Found it: SPARK-21435
On Mon, May 7, 2018 at 2:18 PM Victor Tso-Guillen <v...@paxata.com> wrote:
> It appears that between 2.2.0 and 2.3.0 DataFrame.write.parquet() skips
> writing empty parquet files for empty partitions. Is this configurable? Is
> there a Jira that tra
It appears that between 2.2.0 and 2.3.0 DataFrame.write.parquet() skips
writing empty parquet files for empty partitions. Is this configurable? Is
there a Jira that tracks this change?
Thanks,
Victor
Along with Priya's email slightly earlier than this one, we also are seeing
this happen on Spark 1.5.2.
On Wed, Jul 13, 2016 at 1:26 AM Dibyendu Bhattacharya <
dibyendu.bhattach...@gmail.com> wrote:
> In Spark Streaming job, I see a Batch failed with following error. Haven't
> seen anything like
://polyglotprogramming.com
On Mon, Apr 13, 2015 at 3:24 PM, Victor Tso-Guillen v...@paxata.com
wrote:
How about this?
input.distinct().combineByKey((v: V) = 1, (agg: Int, x: Int) = agg + 1,
(agg1: Int, agg2: Int) = agg1 + agg2).collect()
On Mon, Apr 13, 2015 at 10:31 AM, Dean Wampler deanwamp...@gmail.com
How about this?
input.distinct().combineByKey((v: V) = 1, (agg: Int, x: Int) = agg + 1,
(agg1: Int, agg2: Int) = agg1 + agg2).collect()
On Mon, Apr 13, 2015 at 10:31 AM, Dean Wampler deanwamp...@gmail.com
wrote:
The problem with using collect is that it will fail for large data sets,
as
Something like this?
(2 to alphabetLength toList).map(shift = Object.myFunction(inputRDD,
shift).map(v = shift - v).foldLeft(Object.myFunction(inputRDD, 1).map(v
= 1 - v))(_ union _)
which is an RDD[(Int, Char)]
Problem is that you can't play with RDDs inside of RDDs. The recursive
structure
That particular class you did find is under parquet/... which means it was
shaded. Did you build your application against a hadoop2.6 dependency? The
maven central repo only has 2.2 but HDP has its own repos.
On Thu, Mar 5, 2015 at 10:04 AM, Todd Nist tsind...@gmail.com wrote:
I am running
, so
I'll fix the IP address reporting side for local mode in my code.
On Thu, Feb 26, 2015 at 8:32 PM, Victor Tso-Guillen v...@paxata.com wrote:
Of course, breakpointing on every status update and revive offers
invocation kept the problem from happening. Where could the race be?
On Thu, Feb 26
or the
local backend is not kicking the revive offers messaging at the right time,
but I have to dig into the code some more to nail the culprit. Anyone on
these list have experience in those code areas that could help?
On Thu, Feb 26, 2015 at 2:27 AM, Victor Tso-Guillen v...@paxata.com wrote:
Thanks
(spark.rdd.compress,true) )
Thanks
Best Regards
On Thu, Feb 26, 2015 at 10:15 AM, Victor Tso-Guillen v...@paxata.com
wrote:
I'm getting this really reliably on Spark 1.2.1. Basically I'm in local
mode with parallelism at 8. I have 222 tasks and I never seem to get far
past 40. Usually in the 20s to 30s
/browse/SPARK-4516
Thanks
Best Regards
On Thu, Feb 26, 2015 at 3:26 PM, Victor Tso-Guillen v...@paxata.com
wrote:
Is there any potential problem from 1.1.1 to 1.2.1 with shuffle
dependencies that produce no data?
On Thu, Feb 26, 2015 at 1:56 AM, Victor Tso-Guillen v...@paxata.com
wrote
Is there any potential problem from 1.1.1 to 1.2.1 with shuffle
dependencies that produce no data?
On Thu, Feb 26, 2015 at 1:56 AM, Victor Tso-Guillen v...@paxata.com wrote:
The data is small. The job is composed of many small stages.
* I found that with fewer than 222 the problem exhibits
Love to hear some input on this. I did get a standalone cluster up on my
local machine and the problem didn't present itself. I'm pretty confident
that means the problem is in the LocalBackend or something near it.
On Thu, Feb 26, 2015 at 1:37 PM, Victor Tso-Guillen v...@paxata.com wrote:
Okay
Of course, breakpointing on every status update and revive offers
invocation kept the problem from happening. Where could the race be?
On Thu, Feb 26, 2015 at 7:55 PM, Victor Tso-Guillen v...@paxata.com wrote:
Love to hear some input on this. I did get a standalone cluster up on my
local
I'm getting this really reliably on Spark 1.2.1. Basically I'm in local
mode with parallelism at 8. I have 222 tasks and I never seem to get far
past 40. Usually in the 20s to 30s it will just hang. The last logging is
below, and a screenshot of the UI.
2015-02-25 20:39:55.779 GMT-0800 INFO
is
understandable, but how about heterogeneity with respect to memory?
On Thu, Dec 4, 2014 at 12:18 PM, Victor Tso-Guillen v...@paxata.com
wrote:
You'll have to decide which is more expensive in your heterogenous
environment and optimize for the utilization of that. For example, you may
decide
I don't have a great answer for you. For us, we found a common divisor, not
necessarily a whole gigabyte, of the available memory of the different
hardware and used that as the amount of memory per worker and scaled the
number of cores accordingly so that every core in the system has the same
clarify this
doubt?
Regards
Karthik
On Wed, Dec 3, 2014 at 11:11 PM, Victor Tso-Guillen v...@paxata.com
wrote:
I don't have a great answer for you. For us, we found a common divisor,
not necessarily a whole gigabyte, of the available memory of the different
hardware and used
dirs.par.foreach { case (src,dest) =
sc.textFile(src).process.saveAsFile(dest) }
Is that sufficient for you?
On Tuesday, December 2, 2014, Anselme Vignon anselme.vig...@flaminem.com
wrote:
Hi folks,
We have written a spark job that scans multiple hdfs directories and
perform
Do you have a newline or some other strange character in an argument you
pass to spark that includes that 61608?
On Wed, Sep 24, 2014 at 4:11 AM, jishnu.prat...@wipro.com wrote:
Hi ,
*I am getting this weird error while starting Worker. *
-bash-4.1$ spark-class
Really? What should we make of this?
24 Sep 2014 10:03:36,772 ERROR [Executor task launch worker-52] Executor -
Exception in task ID 40599
java.lang.IllegalArgumentException: Size exceeds Integer.MAX_VALUE
at sun.nio.ch.FileChannelImpl.map(FileChannelImpl.java:789)
at
Never mind: https://issues.apache.org/jira/browse/SPARK-1476
On Wed, Sep 24, 2014 at 11:10 AM, Victor Tso-Guillen v...@paxata.com
wrote:
Really? What should we make of this?
24 Sep 2014 10:03:36,772 ERROR [Executor task launch worker-52] Executor -
Exception in task ID 40599
You could pluck out each column in separate rdds, sort them independently,
and zip them :)
On Tue, Sep 23, 2014 at 2:40 PM, Areg Baghdasaryan (BLOOMBERG/ 731 LEX -)
abaghdasa...@bloomberg.net wrote:
Hello,
So I have crated a table in in RDD in spark in thei format:
col1 col2
So sorry about teasing you with the Scala. But the method is there in Java
too, I just checked.
On Fri, Sep 19, 2014 at 2:02 PM, Victor Tso-Guillen v...@paxata.com wrote:
It might not be the same as a real hadoop reducer, but I think it would
accomplish the same. Take a look at:
import
-Guillen v...@paxata.com
wrote:
So sorry about teasing you with the Scala. But the method is there in
Java too, I just checked.
On Fri, Sep 19, 2014 at 2:02 PM, Victor Tso-Guillen v...@paxata.com
wrote:
It might not be the same as a real hadoop reducer, but I think it would
accomplish
, Tobias Pfeiffer t...@preferred.jp wrote:
Hi,
On Thu, Sep 18, 2014 at 8:55 PM, Victor Tso-Guillen v...@paxata.com
wrote:
Is it possible to express a diamond DAG and have the leaf dependency
evaluate only once?
Well, strictly speaking your graph is not a tree, and also the meaning
of leaf
Is it possible to express a diamond DAG and have the leaf dependency
evaluate only once? So say data flows left to right (and the dependencies
are oriented right to left):
[image: Inline image 1]
Is it possible to run d.collect() and have a evaluate its iterator only
once?
Caveat: all arrows are shuffle dependencies.
On Thu, Sep 18, 2014 at 8:55 PM, Victor Tso-Guillen v...@paxata.com wrote:
Is it possible to express a diamond DAG and have the leaf dependency
evaluate only once? So say data flows left to right (and the dependencies
are oriented right to left
I'm supposing that there's no good solution to having heterogenous hardware
in a cluster. What are the prospects of having something like this in the
future? Am I missing an architectural detail that precludes this
possibility?
Thanks,
Victor
On Fri, Sep 12, 2014 at 12:10 PM, Victor Tso-Guillen
.
On Wed, Sep 17, 2014 at 8:35 AM, Victor Tso-Guillen v...@paxata.com
wrote:
I'm supposing that there's no good solution to having heterogenous
hardware
in a cluster. What are the prospects of having something like this in the
future? Am I missing an architectural detail that precludes
Ping...
On Thu, Sep 11, 2014 at 5:44 PM, Victor Tso-Guillen v...@paxata.com wrote:
So I have a bunch of hardware with different core and memory setups. Is
there a way to do one of the following:
1. Express a ratio of cores to memory to retain. The spark worker config
would represent all
Iterating an RDD gives you each partition in order of their split index.
I'd like to be able to get each partition in reverse order, but I'm having
difficultly implementing the compute() method. I thought I could do
something like this:
override def getDependencies: Seq[Dependency[_]] = {
So I have a bunch of hardware with different core and memory setups. Is
there a way to do one of the following:
1. Express a ratio of cores to memory to retain. The spark worker config
would represent all of the cores and all of the memory usable for any
application, and the application would
I ran into the same issue. What I did was use maven shade plugin to shade
my version of httpcomponents libraries into another package.
On Fri, Sep 5, 2014 at 4:33 PM, Penny Espinoza
pesp...@societyconsulting.com wrote:
Hey - I’m struggling with some dependency issues with
Interestingly, there was an almost identical question posed on Aug 22 by
cjwang. Here's the link to the archive:
http://apache-spark-user-list.1001560.n3.nabble.com/Finding-previous-and-next-element-in-a-sorted-RDD-td12621.html#a12664
On Wed, Sep 3, 2014 at 10:33 AM, Daniel, Ronald (ELS-SDG)
if possible. Those are two reasons you
might see only 4 executors active. If you mean only 4 executors exist
at all, is it possible the other 4 can't provide the memory you're
asking for?
On Tue, Sep 2, 2014 at 5:56 PM, Victor Tso-Guillen v...@paxata.com
wrote:
Actually one more question
I'm thinking of local mode where multiple virtual executors occupy the same
vm. Can we have the same configuration in spark standalone cluster mode?
also manually launch them with more worker threads than you have cores.
What cluster manager are you on?
Matei
On August 29, 2014 at 11:24:33 AM, Victor Tso-Guillen (v...@paxata.com)
wrote:
I'm thinking of local mode where multiple virtual executors occupy the
same vm. Can we have
Any more thoughts on this? I'm not sure how to do this yet.
On Fri, Aug 29, 2014 at 12:10 PM, Victor Tso-Guillen v...@paxata.com
wrote:
Standalone. I'd love to tell it that my one executor can simultaneously
serve, say, 16 tasks at once for an arbitrary number of distinct jobs.
On Fri, Aug
-Spark applications on the same cluster.
Matei
On August 26, 2014 at 6:53:33 PM, Victor Tso-Guillen (v...@paxata.com)
wrote:
Yes, we are standalone right now. Do you have literature why one would
want to consider Mesos or YARN for Spark deployments?
Sounds like I should try upgrading my
to manage cluster status, and these info
(e.g. worker number) is also available through spark metrics system.
While from the user application's point of view, can you give an example
why you need these info, what would you plan to do with them?
Best Regards,
Raymond Liu
From: Victor Tso-Guillen
I'm curious not only about what they do, but what their relationship is to
the rest of the system. I find that I get listener events for n block
managers added where n is also the number of workers I have available to
the application. Is this a stable constant?
Also, are there ways to determine
I wanted to make sure that there's full compatibility between minor
releases. I have a project that has a dependency on spark-core so that it
can be a driver program and that I can test locally. However, when
connecting to a cluster you don't necessarily know what version you're
connecting to. Is
the driver and executors.
Matei
On August 26, 2014 at 6:10:57 PM, Victor Tso-Guillen (v...@paxata.com)
wrote:
I wanted to make sure that there's full compatibility between minor
releases. I have a project that has a dependency on spark-core so that it
can be a driver program and that I can
not user program).
Best Regards,
Raymond Liu
From: Victor Tso-Guillen [mailto:v...@paxata.com]
Sent: Tuesday, August 26, 2014 11:42 PM
To: user@spark.apache.org
Subject: What is a Block Manager?
I'm curious not only about what they do, but what their relationship is to
the rest of the system. I
Do you want to do this on one column or all numeric columns?
On Mon, Aug 25, 2014 at 7:09 AM, Hingorani, Vineet vineet.hingor...@sap.com
wrote:
Hello all,
Could someone help me with the manipulation of csv file data. I have
'semicolon' separated csv data including doubles and strings. I
()
stats.mean
stats.max
Thank you
Vineet
*From:* Victor Tso-Guillen [mailto:v...@paxata.com]
*Sent:* Montag, 25. August 2014 18:34
*To:* Hingorani, Vineet
*Cc:* user@spark.apache.org
*Subject:* Re: Manipulating columns in CSV file or Transpose of
Array[Array[String]] RDD
Do you
Using mapPartitions, you could get the neighbors within a partition, but if
you think about it, it's much more difficult to accomplish this for the
complete dataset.
On Fri, Aug 22, 2014 at 11:24 AM, cjwang c...@cjwang.us wrote:
It would be nice if an RDD that was massaged by
I think I emailed about a similar issue, but in standalone mode. I haven't
investigated much so I don't know what's a good fix.
On Fri, Aug 22, 2014 at 12:00 PM, Jiayu Zhou dearji...@gmail.com wrote:
Hi,
I am having this FetchFailed issue when the driver is about to collect
about
2.5M
I did not try the Akka configs. I was doing a shuffle operation, I believe
a sort, but two copies of the operation at the same time. It was a 20M row
dataset of reasonable horizontal size.
On Sat, Aug 23, 2014 at 2:23 PM, Jiayu Zhou dearji...@gmail.com wrote:
I saw your post. What are the
Anyone know why I would see this in a bunch of executor logs? Is it just
classical overloading of the cluster network, OOM, or something else? If
anyone's seen this before, what do I need to tune to make some headway here?
Thanks,
Victor
Caused by: org.apache.spark.FetchFailedException: Fetch
How about this:
val prev: RDD[V] = rdd.mapPartitions(partition = { /*setup()*/; partition
})
new RDD[V](prev) {
protected def getPartitions = prev.partitions
def compute(split: Partition, context: TaskContext) = {
context.addOnCompleteCallback(() = /*cleanup()*/)
And duh, of course, you can do the setup in that new RDD as well :)
On Wed, Aug 20, 2014 at 1:59 AM, Victor Tso-Guillen v...@paxata.com wrote:
How about this:
val prev: RDD[V] = rdd.mapPartitions(partition = { /*setup()*/; partition
})
new RDD[V](prev) {
protected def getPartitions
53 matches
Mail list logo