GC problem doing fuzzy join

2019-06-18 Thread Arun Luthra
I'm trying to do a brute force fuzzy join where I compare N records against N other records, for N^2 total comparisons. The table is medium size and fits in memory, so I collect it and put it into a broadcast variable. The other copy of the table is in an RDD. I am basically calling the RDD map

Re: Spark 2.0.0 OOM error at beginning of RDD map on AWS

2016-08-24 Thread Arun Luthra
Also for the record, turning on kryo was not able to help. On Tue, Aug 23, 2016 at 12:58 PM, Arun Luthra <arun.lut...@gmail.com> wrote: > Splitting up the Maps to separate objects did not help. > > However, I was able to work around the problem by reimplementing it with > RDD

Re: Spark 2.0.0 OOM error at beginning of RDD map on AWS

2016-08-23 Thread Arun Luthra
Splitting up the Maps to separate objects did not help. However, I was able to work around the problem by reimplementing it with RDD joins. On Aug 18, 2016 5:16 PM, "Arun Luthra" <arun.lut...@gmail.com> wrote: > This might be caused by a few large Map objects that Spark is tr

Re: Spark 2.0.0 OOM error at beginning of RDD map on AWS

2016-08-18 Thread Arun Luthra
me? What if I manually split them up into numerous Map variables? On Mon, Aug 15, 2016 at 2:12 PM, Arun Luthra <arun.lut...@gmail.com> wrote: > I got this OOM error in Spark local mode. The error seems to have been at > the start of a stage (all of the stages on the UI showed

Spark 2.0.0 OOM error at beginning of RDD map on AWS

2016-08-15 Thread Arun Luthra
I got this OOM error in Spark local mode. The error seems to have been at the start of a stage (all of the stages on the UI showed as complete, there were more stages to do but had not showed up on the UI yet). There appears to be ~100G of free memory at the time of the error. Spark 2.0.0 200G

Re: groupByKey() compile error after upgrading from 1.6.2 to 2.0.0

2016-08-10 Thread Arun Luthra
gt; RDD but now returns another dataset or an unexpected implicit conversion. > Just add rdd() before the groupByKey call to push it into an RDD. That > being said - groupByKey generally is an anti-pattern so please be careful > with it. > > On Wed, Aug 10, 2016 at 8:07 PM, Arun

groupByKey() compile error after upgrading from 1.6.2 to 2.0.0

2016-08-10 Thread Arun Luthra
Here is the offending line: val some_rdd = pair_rdd.groupByKey().flatMap { case (mk: MyKey, md_iter: Iterable[MyData]) => { ... [error] .scala:249: overloaded method value groupByKey with alternatives: [error] [K](func: org.apache.spark.api.java.function.MapFunction[(aaa.MyKey,

Re: TaskCommitDenied (Driver denied task commit)

2016-01-22 Thread Arun Luthra
21, 2016 at 6:19 PM, Arun Luthra <arun.lut...@gmail.com> wrote: > Two changes I made that appear to be keeping various errors at bay: > > 1) bumped up spark.yarn.executor.memoryOverhead to 2000 in the spirit of > https://mail-archives.apache.org/mod_mbox/spar

MemoryStore: Not enough space to cache broadcast_N in memory

2016-01-21 Thread Arun Luthra
WARN MemoryStore: Not enough space to cache broadcast_4 in memory! (computed 60.2 MB so far) WARN MemoryStore: Persisting block broadcast_4 to disk instead. Can I increase the memory allocation for broadcast variables? I have a few broadcast variables that I create with sc.broadcast() . Are

Re: TaskCommitDenied (Driver denied task commit)

2016-01-21 Thread Arun Luthra
e partitions? What is the > action you are performing? > > On Thu, Jan 21, 2016 at 2:02 PM, Arun Luthra <arun.lut...@gmail.com> > wrote: > >> Example warning: >> >> 16/01/21 21:57:57 WARN TaskSetManager: Lost task 2168.0 in stage 1.0 (TID >> 4436, XXX): TaskCom

TaskCommitDenied (Driver denied task commit)

2016-01-21 Thread Arun Luthra
Example warning: 16/01/21 21:57:57 WARN TaskSetManager: Lost task 2168.0 in stage 1.0 (TID 4436, XXX): TaskCommitDenied (Driver denied task commit) for job: 1, partition: 2168, attempt: 4436 Is there a solution for this? Increase driver memory? I'm using just 1G driver memory but ideally I

Re: TaskCommitDenied (Driver denied task commit)

2016-01-21 Thread Arun Luthra
count > will mask this exception because the coordination does not get triggered in > non save/write operations. > > On Thu, Jan 21, 2016 at 2:46 PM Holden Karau <hol...@pigscanfly.ca> wrote: > >> Before we dig too far into this, the thing which most quickly jumps out >> t

Re: TaskCommitDenied (Driver denied task commit)

2016-01-21 Thread Arun Luthra
n Thu, Jan 21, 2016 at 2:56 PM, Arun Luthra <arun.lut...@gmail.com> > wrote: > >> Usually the pipeline works, it just failed on this particular input data. >> The other data it has run on is of similar size. >> >> Speculation is enabled. >> >> I'm usin

Re: TaskCommitDenied (Driver denied task commit)

2016-01-21 Thread Arun Luthra
mind. On Thu, Jan 21, 2016 at 5:35 PM, Arun Luthra <arun.lut...@gmail.com> wrote: > Looking into the yarn logs for a similar job where an executor was > associated with the same error, I find: > > ... > 16/01/22 01:17:18 INFO client.TransportClientFactory: Found inactive &g

groupByKey does not work?

2016-01-04 Thread Arun Luthra
I tried groupByKey and noticed that it did not group all values into the same group. In my test dataset (a Pair rdd) I have 16 records, where there are only 4 distinct keys, so I expected there to be 4 records in the groupByKey object, but instead there were 8. Each of the 4 distinct keys appear

Re: groupByKey does not work?

2016-01-04 Thread Arun Luthra
see that each key is repeated 2 times but each key should only appear once. Arun On Mon, Jan 4, 2016 at 4:07 PM, Ted Yu <yuzhih...@gmail.com> wrote: > Can you give a bit more information ? > > Release of Spark you're using > Minimal dataset that shows the problem >

Re: groupByKey does not work?

2016-01-04 Thread Arun Luthra
nt so we can count out any issues in object > equality. > > On Mon, Jan 4, 2016 at 4:42 PM Arun Luthra <arun.lut...@gmail.com> wrote: > >> Spark 1.5.0 >> >> data: >> >> p1,lo1,8,0,4,0,5,20150901|5,1,1.0 >> p1,lo2,8,0,4,0,5,20150901|

types allowed for saveasobjectfile?

2015-08-27 Thread Arun Luthra
What types of RDD can saveAsObjectFile(path) handle? I tried a naive test with an RDD[Array[String]], but when I tried to read back the result with sc.objectFile(path).take(5).foreach(println), I got a non-promising output looking like: [Ljava.lang.String;@46123a [Ljava.lang.String;@76123b

Re: types allowed for saveasobjectfile?

2015-08-27 Thread Arun Luthra
Ah, yes, that did the trick. So more generally, can this handle any serializable object? On Thu, Aug 27, 2015 at 2:11 PM, Jonathan Coveney jcove...@gmail.com wrote: array[String] doesn't pretty print by default. Use .mkString(,) for example El jueves, 27 de agosto de 2015, Arun Luthra

How to ignore features in mllib

2015-07-09 Thread Arun Luthra
Is it possible to ignore features in mllib? In other words, I would like to have some 'pass-through' data, Strings for example, attached to training examples and test data. A related stackoverflow question:

How to change hive database?

2015-07-07 Thread Arun Luthra
https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.hive.HiveContext I'm getting org.apache.spark.sql.catalyst.analysis.NoSuchTableException from: val dataframe = hiveContext.table(other_db.mytable) Do I have to change current database to access it? Is it possible to

Re: Spark launching without all of the requested YARN resources

2015-07-02 Thread Arun Luthra
as possible should make it complete earlier and increase the utilization of resources On Tue, Jun 23, 2015 at 10:34 PM, Arun Luthra arun.lut...@gmail.com wrote: Sometimes if my Hortonworks yarn-enabled cluster is fairly busy, Spark (via spark-submit) will begin its processing even though

Spark launching without all of the requested YARN resources

2015-06-23 Thread Arun Luthra
Sometimes if my Hortonworks yarn-enabled cluster is fairly busy, Spark (via spark-submit) will begin its processing even though it apparently did not get all of the requested resources; it is running very slowly. Is there a way to force Spark/YARN to only begin when it has the full set of

Missing values support in Mllib yet?

2015-06-19 Thread Arun Luthra
Hi, Is there any support for handling missing values in mllib yet, especially for decision trees where this is a natural feature? Arun

Re: Problem getting program to run on 15TB input

2015-06-09 Thread Arun Luthra
level usage of spark. @Arun, can you kindly confirm if Daniel’s suggestion helped your usecase? Thanks, Kapil Malik | kma...@adobe.com | 33430 / 8800836581 *From:* Daniel Mahler [mailto:dmah...@gmail.com] *Sent:* 13 April 2015 15:42 *To:* Arun Luthra *Cc:* Aaron Davidson; Paweł Szulc

Re: Efficient saveAsTextFile by key, directory for each key?

2015-04-22 Thread Arun Luthra
On Tue, Apr 21, 2015 at 5:45 PM, Arun Luthra arun.lut...@gmail.com wrote: Is there an efficient way to save an RDD with saveAsTextFile in such a way that the data gets shuffled into separated directories according to a key? (My end goal is to wrap the result in a multi-partitioned Hive table

Efficient saveAsTextFile by key, directory for each key?

2015-04-21 Thread Arun Luthra
Is there an efficient way to save an RDD with saveAsTextFile in such a way that the data gets shuffled into separated directories according to a key? (My end goal is to wrap the result in a multi-partitioned Hive table) Suppose you have: case class MyData(val0: Int, val1: string, directory_name:

Re: Workaround for spark 1.2.X roaringbitmap kryo problem?

2015-03-12 Thread Arun Luthra
On Tue, Mar 10, 2015 at 6:54 PM, Arun Luthra arun.lut...@gmail.com wrote: Does anyone know how to get the HighlyCompressedMapStatus to compile? I will try turning off kryo in 1.2.0 and hope things don't break. I want to benefit from the MapOutputTracker fix in 1.2.0. On Tue, Mar 3, 2015 at 5:41

Re: Workaround for spark 1.2.X roaringbitmap kryo problem?

2015-03-12 Thread Arun Luthra
, Mar 12, 2015 at 12:47 PM, Arun Luthra arun.lut...@gmail.com wrote: I'm using a pre-built Spark; I'm not trying to compile Spark. The compile error appears when I try to register HighlyCompressedMapStatus in my program: kryo.register(classOf[org.apache.spark.scheduler.HighlyCompressedMapStatus

Re: Workaround for spark 1.2.X roaringbitmap kryo problem?

2015-03-10 Thread Arun Luthra
Luthra arun.lut...@gmail.com wrote: I think this is a Java vs scala syntax issue. Will check. On Thu, Feb 26, 2015 at 8:17 PM, Arun Luthra arun.lut...@gmail.com wrote: Problem is noted here: https://issues.apache.org/jira/browse/SPARK-5949 I tried this as a workaround: import

Re: Problem getting program to run on 15TB input

2015-03-02 Thread Arun Luthra
Everything works smoothly if I do the 99%-removal filter in Hive first. So, all the baggage from garbage collection was breaking it. Is there a way to filter() out 99% of the data without having to garbage collect 99% of the RDD? On Sun, Mar 1, 2015 at 9:56 AM, Arun Luthra arun.lut...@gmail.com

Re: Workaround for spark 1.2.X roaringbitmap kryo problem?

2015-03-02 Thread Arun Luthra
I think this is a Java vs scala syntax issue. Will check. On Thu, Feb 26, 2015 at 8:17 PM, Arun Luthra arun.lut...@gmail.com wrote: Problem is noted here: https://issues.apache.org/jira/browse/SPARK-5949 I tried this as a workaround: import org.apache.spark.scheduler._ import

Re: Problem getting program to run on 15TB input

2015-03-01 Thread Arun Luthra
\ But then, reduceByKey() fails with: java.lang.OutOfMemoryError: Java heap space On Sat, Feb 28, 2015 at 12:09 PM, Arun Luthra arun.lut...@gmail.com wrote: The Spark UI names the line number and name of the operation (repartition in this case) that it is performing. Only if this information

Re: Problem getting program to run on 15TB input

2015-02-28 Thread Arun Luthra
A correction to my first post: There is also a repartition right before groupByKey to help avoid too-many-open-files error: rdd2.union(rdd1).map(...).filter(...).repartition(15000).groupByKey().map(...).flatMap(...).saveAsTextFile() On Sat, Feb 28, 2015 at 11:10 AM, Arun Luthra arun.lut

Re: Problem getting program to run on 15TB input

2015-02-28 Thread Arun Luthra
Tachyon). Burak On Fri, Feb 27, 2015 at 11:39 AM, Arun Luthra arun.lut...@gmail.com wrote: My program in pseudocode looks like this: val conf = new SparkConf().setAppName(Test) .set(spark.storage.memoryFraction,0.2) // default 0.6 .set(spark.shuffle.memoryFraction,0.12

Re: Problem getting program to run on 15TB input

2015-02-28 Thread Arun Luthra
://rabbitonweb.com sob., 28 lut 2015, 6:22 PM Arun Luthra użytkownik arun.lut...@gmail.com napisał: So, actually I am removing the persist for now, because there is significant filtering that happens after calling textFile()... but I will keep that option in mind. I just tried a few different

Re: Problem getting program to run on 15TB input

2015-02-28 Thread Arun Luthra
yoi base that assumption only on logs? sob., 28 lut 2015, 8:12 PM Arun Luthra użytkownik arun.lut...@gmail.com napisał: A correction to my first post: There is also a repartition right before groupByKey to help avoid too-many-open-files error: rdd2.union(rdd1).map(...).filter

Problem getting program to run on 15TB input

2015-02-27 Thread Arun Luthra
My program in pseudocode looks like this: val conf = new SparkConf().setAppName(Test) .set(spark.storage.memoryFraction,0.2) // default 0.6 .set(spark.shuffle.memoryFraction,0.12) // default 0.2 .set(spark.shuffle.manager,SORT) // preferred setting for optimized joins

Workaround for spark 1.2.X roaringbitmap kryo problem?

2015-02-26 Thread Arun Luthra
Problem is noted here: https://issues.apache.org/jira/browse/SPARK-5949 I tried this as a workaround: import org.apache.spark.scheduler._ import org.roaringbitmap._ ... kryo.register(classOf[org.roaringbitmap.RoaringBitmap]) kryo.register(classOf[org.roaringbitmap.RoaringArray])

Open file limit settings for Spark on Yarn job

2015-02-10 Thread Arun Luthra
Hi, I'm running Spark on Yarn from an edge node, and the tasks on the run Data Nodes. My job fails with the Too many open files error once it gets to groupByKey(). Alternatively I can make it fail immediately if I repartition the data when I create the RDD. Where do I need to make sure that

Re: Spark job ends abruptly during setup without error message

2015-02-05 Thread Arun Luthra
: Are you submitting the job from your local machine or on the driver machine.? Have you set YARN_CONF_DIR. On Thu, Feb 5, 2015 at 10:43 PM, Arun Luthra arun.lut...@gmail.com wrote: While a spark-submit job is setting up, the yarnAppState goes into Running mode, then I get a flurry of typical

Spark job ends abruptly during setup without error message

2015-02-05 Thread Arun Luthra
While a spark-submit job is setting up, the yarnAppState goes into Running mode, then I get a flurry of typical looking INFO-level messages such as INFO BlockManagerMasterActor: ... INFO YarnClientSchedulerBackend: Registered executor: ... Then, spark-submit quits without any error message and

Re: SQL query in scala API

2014-12-06 Thread Arun Luthra
) = (count + 1, seen + user) }, { case ((count0, seen0), (count1, seen1)) = (count0 + count1, seen0 ++ seen1) }).mapValues { case (count, seen) = (count, seen.size) } On 12/5/14 3:47 AM, Arun Luthra wrote: Is that Spark SQL? I'm wondering if it's possible without spark SQL. On Wed, Dec 3

Re: SQL query in scala API

2014-12-04 Thread Arun Luthra
Is that Spark SQL? I'm wondering if it's possible without spark SQL. On Wed, Dec 3, 2014 at 8:08 PM, Cheng Lian lian.cs@gmail.com wrote: You may do this: table(users).groupBy('zip)('zip, count('user), countDistinct('user)) On 12/4/14 8:47 AM, Arun Luthra wrote: I'm wondering how

SQL query in scala API

2014-12-03 Thread Arun Luthra
I'm wondering how to do this kind of SQL query with PairRDDFunctions. SELECT zip, COUNT(user), COUNT(DISTINCT user) FROM users GROUP BY zip In the Spark scala API, I can make an RDD (called users) of key-value pairs where the keys are zip (as in ZIP code) and the values are user id's. Then I can

Re: rack-topology.sh no such file or directory

2014-11-25 Thread Arun Luthra
at your hdfs-site.xml and remove the setting for a rack topology script there (or it might be in core-site.xml). Matei On Nov 19, 2014, at 12:13 PM, Arun Luthra arun.lut...@gmail.com wrote: I'm trying to run Spark on Yarn on a hortonworks 2.1.5 cluster. I'm getting this error: 14/11/19 13

rack-topology.sh no such file or directory

2014-11-19 Thread Arun Luthra
solution? Arun Luthra

How to configure build.sbt for Spark 1.2.0

2014-10-08 Thread Arun Luthra
I built Spark 1.2.0 succesfully, but was unable to build my Spark program under 1.2.0 with sbt assembly my build.sbt file. It contains: I tried: org.apache.spark %% spark-sql % 1.2.0, org.apache.spark %% spark-core % 1.2.0, and org.apache.spark %% spark-sql % 1.2.0-SNAPSHOT,

Re: How to configure build.sbt for Spark 1.2.0

2014-10-08 Thread Arun Luthra
-SNAPSHOT), you'll need to first build spark and publish-local so your application build can find those SNAPSHOTs in your local repo. Just append publish-local to your sbt command where you build Spark. -Pat On Wed, Oct 8, 2014 at 5:35 PM, Arun Luthra arun.lut...@gmail.com wrote: I built Spark

Re: PrintWriter error in foreach

2014-09-10 Thread Arun Luthra
path to the file you want to write, and make sure the directory exists and is writable by the Spark process. On Wed, Sep 10, 2014 at 3:46 PM, Arun Luthra arun.lut...@gmail.com wrote: I have a spark program that worked in local mode, but throws an error in yarn-client mode on a cluster