from:"Albert"

Re: Spark 3.2 - ReusedExchange not present in join execution plan

2022-01-06 Thread Albert

I happen to encounter something similar. it's probably because you are just `explain` it. when you actually `run` it. you will get the final spark plan in which case the exchange will be reused. right, this is different compared with 3.1 probably because the upgraded aqe. not sure whether this

Error while getting RDD partitions for a parquet dataframe in Spark 3

2020-09-01 Thread Albert Butterscotch

Hi, When migrating to Spark 3, I'm getting a NoSuchElementException exception when getting partitions for a parquet dataframe - The code I'm trying to execute is - val df = sparkSession.read.parquet(inputFilePath) val partitions = df.rdd.partitions and the spark session is created

are functions deserialized once per task?

2015-10-02 Thread Michael Albert

Greetings! Is it true that functions, such as those passed to RDD.map(), are deserialized once per task?This seems to be the case looking at Executor.scala, but I don't really understand the code. I'm hoping the answer is yes because that makes it easier to write code without worrying about

Re: How to avoid executor time out on yarn spark while dealing with large shuffle skewed data?

2015-08-21 Thread Michael Albert

This is something of a wild guess, but I find that when executors start disappearingfor no obvious reason, this is usually because the yarn node-managers have decided that the containers are using too much memory and then terminate the executors. Unfortunately, to see evidence of this, one

spark-dataflow + Spark Streaming + Kafka

2015-07-25 Thread Albert Strasheim

code out there that is doing this already that we can look at for some inspiration? Any advice appreciated. Thanks Albert - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h

Re: Wired Problem: Task not serializable[Spark Streaming]

2015-06-08 Thread Michael Albert

Note that in scala, return is a non-local return: https://tpolecat.github.io/2014/05/09/return.htmlSo that return is *NOT* returning from the anonymous function, but attempting to return from the enclosing method, i.e., main.Which is running on the driver, not on the workers.So on the workers,

Re: variant record by case classes in shell fails?

2015-04-03 Thread Michael Albert

My apologies for following my own post, but a friend just pointed out that if I use kryo with reference counting AND copy-and-paste, this runs. However, if I try to load file, this fails as described below. I thought load was supposed to be equivalent? Thanks!-Mike From: Michael Albert

Re: How to check that a dataset is sorted after it has been written out?

2015-03-23 Thread Michael Albert

Owen so...@cloudera.com To: Michael Albert m_albert...@yahoo.com Cc: User user@spark.apache.org Sent: Monday, March 23, 2015 7:31 AM Subject: Re: How to check that a dataset is sorted after it has been written out? Data is not (necessarily) sorted when read from disk, no. A file might have

How to check that a dataset is sorted after it has been written out? [repost]

2015-03-22 Thread Michael Albert

Greetings![My apologies for this respost, I'm not certain that the first message made it to the list]. I sorted a dataset in Spark and then wrote it out in avro/parquet. Then I wanted to check that it was sorted. It looks like each partition has been sorted, but when reading in, the first

How to check that a dataset is sorted after it has been written out?

2015-03-20 Thread Michael Albert

Greetings! I sorted a dataset in Spark and then wrote it out in avro/parquet. Then I wanted to check that it was sorted. It looks like each partition has been sorted, but when reading in, the first partition (i.e., as seen in the partition index of mapPartitionsWithIndex) is not the same as

Re: How to debug a Hung task

2015-02-28 Thread Michael Albert

For what it's worth, I was seeing mysterious hangs, but it went away when upgrading from spark1.2 to 1.2.1.I don't know if this is your problem.Also, I'm using AWS EMR images, which were also upgraded. Anyway, that's my experience. -Mike From: Manas Kar manasdebashis...@gmail.com To:

Spark stalls or hangs: is this a clue? remote fetches seem to never return?

2015-02-05 Thread Michael Albert

Greetings! Again, thanks to all who have given suggestions.I am still trying to diagnose a problem where I have processes than run for one or several hours but intermittently stall or hang.By stall I mean that there is no CPU usage on the workers or the driver, nor network activity, nor do I

Re: Spark stalls or hangs: is this a clue? remote fetches seem to never return?

2015-02-05 Thread Michael Albert

completely confused :-). Thanks!-Mike From: Michael Albert m_albert...@yahoo.com.INVALID To: user@spark.apache.org user@spark.apache.org Sent: Thursday, February 5, 2015 9:04 PM Subject: Spark stalls or hangs: is this a clue? remote fetches seem to never return? Greetings! Again, thanks

Re: Spark Job running on localhost on yarn cluster

2015-02-04 Thread Michael Albert

1) Parameters like --num-executors should come before the jar. That is, you want something like$SPARK_HOME --num-executors 3 --driver-memory 6g --executor-memory 7g \--master yarn-cluster --class EDDApp target/scala-2.10/eddjar \outputPath That is, *your* parameters come after the jar,

Re: advice on diagnosing Spark stall for 1.5hr out of 3.5hr job?

2015-02-04 Thread Michael Albert

From: Sandy Ryza sandy.r...@cloudera.com To: Imran Rashid iras...@cloudera.com Cc: Michael Albert m_albert...@yahoo.com; user@spark.apache.org user@spark.apache.org Sent: Wednesday, February 4, 2015 12:54 PM Subject: Re: advice on diagnosing Spark stall for 1.5hr out of 3.5hr job? Also, do

Re: 2GB limit for partitions?

2015-02-03 Thread Michael Albert

Thank you! This is very helpful. -Mike From: Aaron Davidson ilike...@gmail.com To: Imran Rashid iras...@cloudera.com Cc: Michael Albert m_albert...@yahoo.com; Sean Owen so...@cloudera.com; user@spark.apache.org user@spark.apache.org Sent: Tuesday, February 3, 2015 6:13 PM Subject: Re

Re: 2GB limit for partitions?

2015-02-03 Thread Michael Albert

) at org.apache.spark.network.netty.NettyBlockRpcServer.receive(NettyBlockRpcServer.scala:57) From: Sean Owen so...@cloudera.com To: Michael Albert m_albert...@yahoo.com Cc: user@spark.apache.org user@spark.apache.org Sent: Monday, February 2, 2015 10:13 PM Subject: Re: 2GB limit for partitions? The limit is on blocks

advice on diagnosing Spark stall for 1.5hr out of 3.5hr job?

2015-02-03 Thread Michael Albert

GB of physical memory and, as far as I can determine, no swap space. The messages bracketing the stall are shown below. Any advice is welcome. Thanks! Sincerely, Mike Albert Before the stall.15/02/03 21:45:28 INFO cluster.YarnClientClusterScheduler: Removed TaskSet 5.0, whose tasks have all

2GB limit for partitions?

2015-02-02 Thread Michael Albert

? Admittedly, this is an odd use case Thanks! Sincerely, Mike Albert

How does unmanaged memory work with the executor memory limits?

2015-01-12 Thread Michael Albert

Greetings! My executors apparently are being terminated because they are running beyond physical memory limits according to the yarn-hadoop-nodemanager logs on the worker nodes (/mnt/var/log/hadoop on AWS EMR). I'm setting the driver-memory to 8G.However, looking at stdout in userlogs, I can

Re: a vague question, but perhaps it might ring a bell

2015-01-05 Thread Michael Albert

writing, but perhaps there is some subtle difference in the context? Thank you. Sincerely, Mike From: Akhil Das ak...@sigmoidanalytics.com To: Michael Albert m_albert...@yahoo.com Cc: user@spark.apache.org user@spark.apache.org Sent: Monday, January 5, 2015 1:21 AM Subject: Re: a vague

Reading one partition at a time

2015-01-04 Thread Michael Albert

Greetings! I would like to know if the code below will read one-partition-at-a-time, and whether I am reinventing the wheel. If I may explain, upstream code has managed (I hope) to save an RDD such that each partition file (e.g, part-r-0, part-r-1) contains exactly the data subset

a vague question, but perhaps it might ring a bell

2015-01-04 Thread Michael Albert

Greetings! So, I think I have data saved so that each partition (part-r-0, etc)is exactly what I wan to translate into an output file of a format not related to hadoop. I believe I've figured out how to tell Spark to read the data set without re-partitioning (in another post I mentioned

Re: unable to do group by with 1st column

2014-12-28 Thread Michael Albert

6E7 values, and the data is (DataKey(Int,Int), Option[Float]), so that shouldn't need 5g? Anyway, thanks for the info. Best wishes,Mike From: Sean Owen so...@cloudera.com To: Michael Albert m_albert...@yahoo.com Cc: user@spark.apache.org Sent: Friday, December 26, 2014 3:23 PM Subject

Re: unable to do group by with 1st column

2014-12-26 Thread Michael Albert

Greetings! I'm trying to do something similar, and having a very bad time of it. What I start with is key1: (col1, val-1-1, col2: val-1-2, col3: val-1-3, col4: val-1-4...)key2: (col1: val-2-1, col2: val-2-2, col3: val-2-3, col4: val 2-4, ...) What I want (what I have been asked to produce

Serialize mllib's MatrixFactorizationModel

2014-12-15 Thread Albert Manyà

MatrixFactorizationModel cannot be accessed in object RecommendALS val model = new MatrixFactorizationModel (8, userFeatures, productFeatures) Any ideas? Thanks! -- Albert Manyà alber...@eml.cc - To unsubscribe, e-mail

Re: Serialize mllib's MatrixFactorizationModel

2014-12-15 Thread Albert Manyà

In that case, what is the strategy to train a model in some background batch process and make recommendations for some other service in real time? Run both processes in the same spark cluster? Thanks. -- Albert Manyà alber...@eml.cc On Mon, Dec 15, 2014, at 05:58 PM, Sean Owen wrote

Re: Exception using amazonaws library

2014-12-12 Thread Albert Manyà

who is compiled against such an old version of httpclient, I see in the project dependencies that amazonaws 1.9.10 depends on httclient 4.3... It is spark who is compiled against an old version of amazonaws? Thanks. -- Albert Manyà alber...@eml.cc On Fri, Dec 12, 2014, at 09:27 AM, Akhil Das

Exception using amazonaws library

2014-12-11 Thread Albert Manyà

signature for setSoKeepalive: public static void setSoKeepalive(HttpParams params, boolean enableKeepalive) At this point I'm stuck and didn't know where to keep looking... some help would be greatly appreciated :) Thank you very much! -- Albert Manyà alber...@eml.cc

Re: avro + parquet + vectorstring + NullPointerException while reading

2014-11-06 Thread Michael Albert

. Hive at 0.13.1 still can't read it though...Thanks!-Mike From: Michael Armbrust mich...@databricks.com To: Michael Albert m_albert...@yahoo.com Cc: user@spark.apache.org user@spark.apache.org Sent: Tuesday, November 4, 2014 2:37 PM Subject: Re: avro + parquet + vectorstring

avro + parquet + vectorstring + NullPointerException while reading

2014-11-03 Thread Michael Albert

stumped. I can read and write records and maps, but arrays/vectors elude me.Am I missing something obvious? Thanks! Sincerely, Mike Albert

BUG: when running as extends App, closures don't capture variables

2014-10-29 Thread Michael Albert

Greetings! This might be a documentation issue as opposed to a coding issue, in that perhaps the correct answer is don't do that, but as this is not obvious, I am writing. The following code produces output most would not expect: package misc import org.apache.spark.SparkConfimport

Is Spark streaming suitable for our architecture?

2014-10-23 Thread Albert Vila

Hi I'm evaluating Spark streaming to see if it fits to scale or current architecture. We are currently downloading and processing 6M documents per day from online and social media. We have a different workflow for each type of document, but some of the steps are keyword extraction, language

Re: Is Spark streaming suitable for our architecture?

2014-10-23 Thread Albert Vila

Hi Jayant, On 23 October 2014 11:14, Jayant Shekhar jay...@cloudera.com wrote: Hi Albert, Have a couple of questions: - You mentioned near real-time. What exactly is your SLA for processing each document? The minimum the best :). Right now it's between 30s - 5m, but I would like

process local vs node local subtlety question/issue

2014-06-13 Thread Albert Chu

should be independent of that. I'm sure there's something subtle I'm missing or not understanding, thanks in advance. Al -- Albert Chu ch...@llnl.gov Computer Scientist High Performance Systems Division Lawrence Livermore National Laboratory

Re: Spark 3.2 - ReusedExchange not present in join execution plan

Error while getting RDD partitions for a parquet dataframe in Spark 3

are functions deserialized once per task?

Re: How to avoid executor time out on yarn spark while dealing with large shuffle skewed data?

spark-dataflow + Spark Streaming + Kafka

Re: Wired Problem: Task not serializable[Spark Streaming]

Re: variant record by case classes in shell fails?

Re: How to check that a dataset is sorted after it has been written out?

How to check that a dataset is sorted after it has been written out? [repost]

How to check that a dataset is sorted after it has been written out?

Re: How to debug a Hung task

Spark stalls or hangs: is this a clue? remote fetches seem to never return?

Re: Spark stalls or hangs: is this a clue? remote fetches seem to never return?

Re: Spark Job running on localhost on yarn cluster

Re: advice on diagnosing Spark stall for 1.5hr out of 3.5hr job?

Re: 2GB limit for partitions?

Re: 2GB limit for partitions?

advice on diagnosing Spark stall for 1.5hr out of 3.5hr job?

2GB limit for partitions?

How does unmanaged memory work with the executor memory limits?

Re: a vague question, but perhaps it might ring a bell

Reading one partition at a time

a vague question, but perhaps it might ring a bell

Re: unable to do group by with 1st column

Re: unable to do group by with 1st column

Serialize mllib's MatrixFactorizationModel

Re: Serialize mllib's MatrixFactorizationModel

Re: Exception using amazonaws library

Exception using amazonaws library

Re: avro + parquet + vectorstring + NullPointerException while reading

avro + parquet + vectorstring + NullPointerException while reading

BUG: when running as extends App, closures don't capture variables

Is Spark streaming suitable for our architecture?

Re: Is Spark streaming suitable for our architecture?

process local vs node local subtlety question/issue

35 matches

Site Navigation

Mail list logo

Footer information