Re: Spark 2.3.0 DataFrame.write.parquet() behavior change from 2.2.0

2018-05-07 Thread Victor Tso-Guillen
Found it: SPARK-21435 On Mon, May 7, 2018 at 2:18 PM Victor Tso-Guillen <v...@paxata.com> wrote: > It appears that between 2.2.0 and 2.3.0 DataFrame.write.parquet() skips > writing empty parquet files for empty partitions. Is this configurable? Is > there a Jira that tra

Spark 2.3.0 DataFrame.write.parquet() behavior change from 2.2.0

2018-05-07 Thread Victor Tso-Guillen
It appears that between 2.2.0 and 2.3.0 DataFrame.write.parquet() skips writing empty parquet files for empty partitions. Is this configurable? Is there a Jira that tracks this change? Thanks, Victor

Re: Any Idea about this error : IllegalArgumentException: File segment length cannot be negative ?

2017-04-20 Thread Victor Tso-Guillen
Along with Priya's email slightly earlier than this one, we also are seeing this happen on Spark 1.5.2. On Wed, Jul 13, 2016 at 1:26 AM Dibyendu Bhattacharya < dibyendu.bhattach...@gmail.com> wrote: > In Spark Streaming job, I see a Batch failed with following error. Haven't > seen anything like

Re: Need some guidance

2015-04-14 Thread Victor Tso-Guillen
://polyglotprogramming.com On Mon, Apr 13, 2015 at 3:24 PM, Victor Tso-Guillen v...@paxata.com wrote: How about this? input.distinct().combineByKey((v: V) = 1, (agg: Int, x: Int) = agg + 1, (agg1: Int, agg2: Int) = agg1 + agg2).collect() On Mon, Apr 13, 2015 at 10:31 AM, Dean Wampler deanwamp...@gmail.com

Re: Need some guidance

2015-04-13 Thread Victor Tso-Guillen
How about this? input.distinct().combineByKey((v: V) = 1, (agg: Int, x: Int) = agg + 1, (agg1: Int, agg2: Int) = agg1 + agg2).collect() On Mon, Apr 13, 2015 at 10:31 AM, Dean Wampler deanwamp...@gmail.com wrote: The problem with using collect is that it will fail for large data sets, as

Re: 'nested' RDD problem, advise needed

2015-03-22 Thread Victor Tso-Guillen
Something like this? (2 to alphabetLength toList).map(shift = Object.myFunction(inputRDD, shift).map(v = shift - v).foldLeft(Object.myFunction(inputRDD, 1).map(v = 1 - v))(_ union _) which is an RDD[(Int, Char)] Problem is that you can't play with RDDs inside of RDDs. The recursive structure

Re: Spark Build with Hadoop 2.6, yarn - encounter java.lang.NoClassDefFoundError: org/codehaus/jackson/map/deser/std/StdDeserializer

2015-03-05 Thread Victor Tso-Guillen
That particular class you did find is under parquet/... which means it was shaded. Did you build your application against a hadoop2.6 dependency? The maven central repo only has 2.2 but HDP has its own repos. On Thu, Mar 5, 2015 at 10:04 AM, Todd Nist tsind...@gmail.com wrote: I am running

Re: Scheduler hang?

2015-02-28 Thread Victor Tso-Guillen
, so I'll fix the IP address reporting side for local mode in my code. On Thu, Feb 26, 2015 at 8:32 PM, Victor Tso-Guillen v...@paxata.com wrote: Of course, breakpointing on every status update and revive offers invocation kept the problem from happening. Where could the race be? On Thu, Feb 26

Re: Scheduler hang?

2015-02-26 Thread Victor Tso-Guillen
or the local backend is not kicking the revive offers messaging at the right time, but I have to dig into the code some more to nail the culprit. Anyone on these list have experience in those code areas that could help? On Thu, Feb 26, 2015 at 2:27 AM, Victor Tso-Guillen v...@paxata.com wrote: Thanks

Re: Scheduler hang?

2015-02-26 Thread Victor Tso-Guillen
(spark.rdd.compress,true) ) Thanks Best Regards On Thu, Feb 26, 2015 at 10:15 AM, Victor Tso-Guillen v...@paxata.com wrote: I'm getting this really reliably on Spark 1.2.1. Basically I'm in local mode with parallelism at 8. I have 222 tasks and I never seem to get far past 40. Usually in the 20s to 30s

Re: Scheduler hang?

2015-02-26 Thread Victor Tso-Guillen
/browse/SPARK-4516 Thanks Best Regards On Thu, Feb 26, 2015 at 3:26 PM, Victor Tso-Guillen v...@paxata.com wrote: Is there any potential problem from 1.1.1 to 1.2.1 with shuffle dependencies that produce no data? On Thu, Feb 26, 2015 at 1:56 AM, Victor Tso-Guillen v...@paxata.com wrote

Re: Scheduler hang?

2015-02-26 Thread Victor Tso-Guillen
Is there any potential problem from 1.1.1 to 1.2.1 with shuffle dependencies that produce no data? On Thu, Feb 26, 2015 at 1:56 AM, Victor Tso-Guillen v...@paxata.com wrote: The data is small. The job is composed of many small stages. * I found that with fewer than 222 the problem exhibits

Re: Scheduler hang?

2015-02-26 Thread Victor Tso-Guillen
Love to hear some input on this. I did get a standalone cluster up on my local machine and the problem didn't present itself. I'm pretty confident that means the problem is in the LocalBackend or something near it. On Thu, Feb 26, 2015 at 1:37 PM, Victor Tso-Guillen v...@paxata.com wrote: Okay

Re: Scheduler hang?

2015-02-26 Thread Victor Tso-Guillen
Of course, breakpointing on every status update and revive offers invocation kept the problem from happening. Where could the race be? On Thu, Feb 26, 2015 at 7:55 PM, Victor Tso-Guillen v...@paxata.com wrote: Love to hear some input on this. I did get a standalone cluster up on my local

Scheduler hang?

2015-02-25 Thread Victor Tso-Guillen
I'm getting this really reliably on Spark 1.2.1. Basically I'm in local mode with parallelism at 8. I have 222 tasks and I never seem to get far past 40. Usually in the 20s to 30s it will just hang. The last logging is below, and a screenshot of the UI. 2015-02-25 20:39:55.779 GMT-0800 INFO

Re: heterogeneous cluster setup

2014-12-04 Thread Victor Tso-Guillen
is understandable, but how about heterogeneity with respect to memory? On Thu, Dec 4, 2014 at 12:18 PM, Victor Tso-Guillen v...@paxata.com wrote: You'll have to decide which is more expensive in your heterogenous environment and optimize for the utilization of that. For example, you may decide

Re: heterogeneous cluster setup

2014-12-03 Thread Victor Tso-Guillen
I don't have a great answer for you. For us, we found a common divisor, not necessarily a whole gigabyte, of the available memory of the different hardware and used that as the amount of memory per worker and scaled the number of cores accordingly so that every core in the system has the same

Re: heterogeneous cluster setup

2014-12-03 Thread Victor Tso-Guillen
clarify this doubt? Regards Karthik On Wed, Dec 3, 2014 at 11:11 PM, Victor Tso-Guillen v...@paxata.com wrote: I don't have a great answer for you. For us, we found a common divisor, not necessarily a whole gigabyte, of the available memory of the different hardware and used

Re: Parallelize independent tasks

2014-12-02 Thread Victor Tso-Guillen
dirs.par.foreach { case (src,dest) = sc.textFile(src).process.saveAsFile(dest) } Is that sufficient for you? On Tuesday, December 2, 2014, Anselme Vignon anselme.vig...@flaminem.com wrote: Hi folks, We have written a spark job that scans multiple hdfs directories and perform

Re: java.lang.NumberFormatException while starting spark-worker

2014-09-24 Thread Victor Tso-Guillen
Do you have a newline or some other strange character in an argument you pass to spark that includes that 61608? On Wed, Sep 24, 2014 at 4:11 AM, jishnu.prat...@wipro.com wrote: Hi , *I am getting this weird error while starting Worker. * -bash-4.1$ spark-class

java.lang.IllegalArgumentException: Size exceeds Integer.MAX_VALUE

2014-09-24 Thread Victor Tso-Guillen
Really? What should we make of this? 24 Sep 2014 10:03:36,772 ERROR [Executor task launch worker-52] Executor - Exception in task ID 40599 java.lang.IllegalArgumentException: Size exceeds Integer.MAX_VALUE at sun.nio.ch.FileChannelImpl.map(FileChannelImpl.java:789) at

Re: java.lang.IllegalArgumentException: Size exceeds Integer.MAX_VALUE

2014-09-24 Thread Victor Tso-Guillen
Never mind: https://issues.apache.org/jira/browse/SPARK-1476 On Wed, Sep 24, 2014 at 11:10 AM, Victor Tso-Guillen v...@paxata.com wrote: Really? What should we make of this? 24 Sep 2014 10:03:36,772 ERROR [Executor task launch worker-52] Executor - Exception in task ID 40599

Re: Sorting a Table in Spark RDD

2014-09-23 Thread Victor Tso-Guillen
You could pluck out each column in separate rdds, sort them independently, and zip them :) On Tue, Sep 23, 2014 at 2:40 PM, Areg Baghdasaryan (BLOOMBERG/ 731 LEX -) abaghdasa...@bloomberg.net wrote: Hello, So I have crated a table in in RDD in spark in thei format: col1 col2

Re: Reproducing the function of a Hadoop Reducer

2014-09-20 Thread Victor Tso-Guillen
So sorry about teasing you with the Scala. But the method is there in Java too, I just checked. On Fri, Sep 19, 2014 at 2:02 PM, Victor Tso-Guillen v...@paxata.com wrote: It might not be the same as a real hadoop reducer, but I think it would accomplish the same. Take a look at: import

Re: Reproducing the function of a Hadoop Reducer

2014-09-20 Thread Victor Tso-Guillen
-Guillen v...@paxata.com wrote: So sorry about teasing you with the Scala. But the method is there in Java too, I just checked. On Fri, Sep 19, 2014 at 2:02 PM, Victor Tso-Guillen v...@paxata.com wrote: It might not be the same as a real hadoop reducer, but I think it would accomplish

Re: diamond dependency tree

2014-09-19 Thread Victor Tso-Guillen
, Tobias Pfeiffer t...@preferred.jp wrote: Hi, On Thu, Sep 18, 2014 at 8:55 PM, Victor Tso-Guillen v...@paxata.com wrote: Is it possible to express a diamond DAG and have the leaf dependency evaluate only once? Well, strictly speaking your graph is not a tree, and also the meaning of leaf

diamond dependency tree

2014-09-18 Thread Victor Tso-Guillen
Is it possible to express a diamond DAG and have the leaf dependency evaluate only once? So say data flows left to right (and the dependencies are oriented right to left): [image: Inline image 1] Is it possible to run d.collect() and have a evaluate its iterator only once?

Re: diamond dependency tree

2014-09-18 Thread Victor Tso-Guillen
Caveat: all arrows are shuffle dependencies. On Thu, Sep 18, 2014 at 8:55 PM, Victor Tso-Guillen v...@paxata.com wrote: Is it possible to express a diamond DAG and have the leaf dependency evaluate only once? So say data flows left to right (and the dependencies are oriented right to left

Re: Configuring Spark for heterogenous hardware

2014-09-17 Thread Victor Tso-Guillen
I'm supposing that there's no good solution to having heterogenous hardware in a cluster. What are the prospects of having something like this in the future? Am I missing an architectural detail that precludes this possibility? Thanks, Victor On Fri, Sep 12, 2014 at 12:10 PM, Victor Tso-Guillen

Re: Configuring Spark for heterogenous hardware

2014-09-17 Thread Victor Tso-Guillen
. On Wed, Sep 17, 2014 at 8:35 AM, Victor Tso-Guillen v...@paxata.com wrote: I'm supposing that there's no good solution to having heterogenous hardware in a cluster. What are the prospects of having something like this in the future? Am I missing an architectural detail that precludes

Re: Configuring Spark for heterogenous hardware

2014-09-12 Thread Victor Tso-Guillen
Ping... On Thu, Sep 11, 2014 at 5:44 PM, Victor Tso-Guillen v...@paxata.com wrote: So I have a bunch of hardware with different core and memory setups. Is there a way to do one of the following: 1. Express a ratio of cores to memory to retain. The spark worker config would represent all

Backwards RDD

2014-09-11 Thread Victor Tso-Guillen
Iterating an RDD gives you each partition in order of their split index. I'd like to be able to get each partition in reverse order, but I'm having difficultly implementing the compute() method. I thought I could do something like this: override def getDependencies: Seq[Dependency[_]] = {

Configuring Spark for heterogenous hardware

2014-09-11 Thread Victor Tso-Guillen
So I have a bunch of hardware with different core and memory setups. Is there a way to do one of the following: 1. Express a ratio of cores to memory to retain. The spark worker config would represent all of the cores and all of the memory usable for any application, and the application would

Re: prepending jars to the driver class path for spark-submit on YARN

2014-09-06 Thread Victor Tso-Guillen
I ran into the same issue. What I did was use maven shade plugin to shade my version of httpcomponents libraries into another package. On Fri, Sep 5, 2014 at 4:33 PM, Penny Espinoza pesp...@societyconsulting.com wrote: Hey - I’m struggling with some dependency issues with

Re: Accessing neighboring elements in an RDD

2014-09-03 Thread Victor Tso-Guillen
Interestingly, there was an almost identical question posed on Aug 22 by cjwang. Here's the link to the archive: http://apache-spark-user-list.1001560.n3.nabble.com/Finding-previous-and-next-element-in-a-sorted-RDD-td12621.html#a12664 On Wed, Sep 3, 2014 at 10:33 AM, Daniel, Ronald (ELS-SDG)

Re: Possible to make one executor be able to work on multiple tasks simultaneously?

2014-09-02 Thread Victor Tso-Guillen
if possible. Those are two reasons you might see only 4 executors active. If you mean only 4 executors exist at all, is it possible the other 4 can't provide the memory you're asking for? On Tue, Sep 2, 2014 at 5:56 PM, Victor Tso-Guillen v...@paxata.com wrote: Actually one more question

Possible to make one executor be able to work on multiple tasks simultaneously?

2014-08-29 Thread Victor Tso-Guillen
I'm thinking of local mode where multiple virtual executors occupy the same vm. Can we have the same configuration in spark standalone cluster mode?

Re: Possible to make one executor be able to work on multiple tasks simultaneously?

2014-08-29 Thread Victor Tso-Guillen
also manually launch them with more worker threads than you have cores. What cluster manager are you on? Matei On August 29, 2014 at 11:24:33 AM, Victor Tso-Guillen (v...@paxata.com) wrote: I'm thinking of local mode where multiple virtual executors occupy the same vm. Can we have

Re: Possible to make one executor be able to work on multiple tasks simultaneously?

2014-08-29 Thread Victor Tso-Guillen
Any more thoughts on this? I'm not sure how to do this yet. On Fri, Aug 29, 2014 at 12:10 PM, Victor Tso-Guillen v...@paxata.com wrote: Standalone. I'd love to tell it that my one executor can simultaneously serve, say, 16 tasks at once for an arbitrary number of distinct jobs. On Fri, Aug

Re: Upgrading 1.0.0 to 1.0.2

2014-08-27 Thread Victor Tso-Guillen
-Spark applications on the same cluster. Matei On August 26, 2014 at 6:53:33 PM, Victor Tso-Guillen (v...@paxata.com) wrote: Yes, we are standalone right now. Do you have literature why one would want to consider Mesos or YARN for Spark deployments? Sounds like I should try upgrading my

Re: What is a Block Manager?

2014-08-27 Thread Victor Tso-Guillen
to manage cluster status, and these info (e.g. worker number) is also available through spark metrics system. While from the user application's point of view, can you give an example why you need these info, what would you plan to do with them? Best Regards, Raymond Liu From: Victor Tso-Guillen

What is a Block Manager?

2014-08-26 Thread Victor Tso-Guillen
I'm curious not only about what they do, but what their relationship is to the rest of the system. I find that I get listener events for n block managers added where n is also the number of workers I have available to the application. Is this a stable constant? Also, are there ways to determine

Upgrading 1.0.0 to 1.0.2

2014-08-26 Thread Victor Tso-Guillen
I wanted to make sure that there's full compatibility between minor releases. I have a project that has a dependency on spark-core so that it can be a driver program and that I can test locally. However, when connecting to a cluster you don't necessarily know what version you're connecting to. Is

Re: Upgrading 1.0.0 to 1.0.2

2014-08-26 Thread Victor Tso-Guillen
the driver and executors. Matei On August 26, 2014 at 6:10:57 PM, Victor Tso-Guillen (v...@paxata.com) wrote: I wanted to make sure that there's full compatibility between minor releases. I have a project that has a dependency on spark-core so that it can be a driver program and that I can

Re: What is a Block Manager?

2014-08-26 Thread Victor Tso-Guillen
not user program). Best Regards, Raymond Liu From: Victor Tso-Guillen [mailto:v...@paxata.com] Sent: Tuesday, August 26, 2014 11:42 PM To: user@spark.apache.org Subject: What is a Block Manager? I'm curious not only about what they do, but what their relationship is to the rest of the system. I

Re: Manipulating columns in CSV file or Transpose of Array[Array[String]] RDD

2014-08-25 Thread Victor Tso-Guillen
Do you want to do this on one column or all numeric columns? On Mon, Aug 25, 2014 at 7:09 AM, Hingorani, Vineet vineet.hingor...@sap.com wrote: Hello all, Could someone help me with the manipulation of csv file data. I have 'semicolon' separated csv data including doubles and strings. I

Re: Manipulating columns in CSV file or Transpose of Array[Array[String]] RDD

2014-08-25 Thread Victor Tso-Guillen
() stats.mean stats.max Thank you Vineet *From:* Victor Tso-Guillen [mailto:v...@paxata.com] *Sent:* Montag, 25. August 2014 18:34 *To:* Hingorani, Vineet *Cc:* user@spark.apache.org *Subject:* Re: Manipulating columns in CSV file or Transpose of Array[Array[String]] RDD Do you

Re: Finding previous and next element in a sorted RDD

2014-08-23 Thread Victor Tso-Guillen
Using mapPartitions, you could get the neighbors within a partition, but if you think about it, it's much more difficult to accomplish this for the complete dataset. On Fri, Aug 22, 2014 at 11:24 AM, cjwang c...@cjwang.us wrote: It would be nice if an RDD that was massaged by

Re: FetchFailed when collect at YARN cluster

2014-08-23 Thread Victor Tso-Guillen
I think I emailed about a similar issue, but in standalone mode. I haven't investigated much so I don't know what's a good fix. On Fri, Aug 22, 2014 at 12:00 PM, Jiayu Zhou dearji...@gmail.com wrote: Hi, I am having this FetchFailed issue when the driver is about to collect about 2.5M

Re: FetchFailed when collect at YARN cluster

2014-08-23 Thread Victor Tso-Guillen
I did not try the Akka configs. I was doing a shuffle operation, I believe a sort, but two copies of the operation at the same time. It was a 20M row dataset of reasonable horizontal size. On Sat, Aug 23, 2014 at 2:23 PM, Jiayu Zhou dearji...@gmail.com wrote: I saw your post. What are the

FetchFailedException from Block Manager

2014-08-22 Thread Victor Tso-Guillen
Anyone know why I would see this in a bunch of executor logs? Is it just classical overloading of the cluster network, OOM, or something else? If anyone's seen this before, what do I need to tune to make some headway here? Thanks, Victor Caused by: org.apache.spark.FetchFailedException: Fetch

Re: a noob question for how to implement setup and cleanup in Spark map

2014-08-20 Thread Victor Tso-Guillen
How about this: val prev: RDD[V] = rdd.mapPartitions(partition = { /*setup()*/; partition }) new RDD[V](prev) { protected def getPartitions = prev.partitions def compute(split: Partition, context: TaskContext) = { context.addOnCompleteCallback(() = /*cleanup()*/)

Re: a noob question for how to implement setup and cleanup in Spark map

2014-08-20 Thread Victor Tso-Guillen
And duh, of course, you can do the setup in that new RDD as well :) On Wed, Aug 20, 2014 at 1:59 AM, Victor Tso-Guillen v...@paxata.com wrote: How about this: val prev: RDD[V] = rdd.mapPartitions(partition = { /*setup()*/; partition }) new RDD[V](prev) { protected def getPartitions