I'm trying to do a brute force fuzzy join where I compare N records against
N other records, for N^2 total comparisons.
The table is medium size and fits in memory, so I collect it and put it
into a broadcast variable.
The other copy of the table is in an RDD. I am basically calling the RDD
map
Also for the record, turning on kryo was not able to help.
On Tue, Aug 23, 2016 at 12:58 PM, Arun Luthra <arun.lut...@gmail.com> wrote:
> Splitting up the Maps to separate objects did not help.
>
> However, I was able to work around the problem by reimplementing it with
> RDD
Splitting up the Maps to separate objects did not help.
However, I was able to work around the problem by reimplementing it with
RDD joins.
On Aug 18, 2016 5:16 PM, "Arun Luthra" <arun.lut...@gmail.com> wrote:
> This might be caused by a few large Map objects that Spark is tr
me?
What if I manually split them up into numerous Map variables?
On Mon, Aug 15, 2016 at 2:12 PM, Arun Luthra <arun.lut...@gmail.com> wrote:
> I got this OOM error in Spark local mode. The error seems to have been at
> the start of a stage (all of the stages on the UI showed
I got this OOM error in Spark local mode. The error seems to have been at
the start of a stage (all of the stages on the UI showed as complete, there
were more stages to do but had not showed up on the UI yet).
There appears to be ~100G of free memory at the time of the error.
Spark 2.0.0
200G
gt; RDD but now returns another dataset or an unexpected implicit conversion.
> Just add rdd() before the groupByKey call to push it into an RDD. That
> being said - groupByKey generally is an anti-pattern so please be careful
> with it.
>
> On Wed, Aug 10, 2016 at 8:07 PM, Arun
Here is the offending line:
val some_rdd = pair_rdd.groupByKey().flatMap { case (mk: MyKey, md_iter:
Iterable[MyData]) => {
...
[error] .scala:249: overloaded method value groupByKey with
alternatives:
[error] [K](func:
org.apache.spark.api.java.function.MapFunction[(aaa.MyKey,
21, 2016 at 6:19 PM, Arun Luthra <arun.lut...@gmail.com> wrote:
> Two changes I made that appear to be keeping various errors at bay:
>
> 1) bumped up spark.yarn.executor.memoryOverhead to 2000 in the spirit of
> https://mail-archives.apache.org/mod_mbox/spar
WARN MemoryStore: Not enough space to cache broadcast_4 in memory!
(computed 60.2 MB so far)
WARN MemoryStore: Persisting block broadcast_4 to disk instead.
Can I increase the memory allocation for broadcast variables?
I have a few broadcast variables that I create with sc.broadcast() . Are
e partitions? What is the
> action you are performing?
>
> On Thu, Jan 21, 2016 at 2:02 PM, Arun Luthra <arun.lut...@gmail.com>
> wrote:
>
>> Example warning:
>>
>> 16/01/21 21:57:57 WARN TaskSetManager: Lost task 2168.0 in stage 1.0 (TID
>> 4436, XXX): TaskCom
Example warning:
16/01/21 21:57:57 WARN TaskSetManager: Lost task 2168.0 in stage 1.0 (TID
4436, XXX): TaskCommitDenied (Driver denied task commit) for job: 1,
partition: 2168, attempt: 4436
Is there a solution for this? Increase driver memory? I'm using just 1G
driver memory but ideally I
count
> will mask this exception because the coordination does not get triggered in
> non save/write operations.
>
> On Thu, Jan 21, 2016 at 2:46 PM Holden Karau <hol...@pigscanfly.ca> wrote:
>
>> Before we dig too far into this, the thing which most quickly jumps out
>> t
n Thu, Jan 21, 2016 at 2:56 PM, Arun Luthra <arun.lut...@gmail.com>
> wrote:
>
>> Usually the pipeline works, it just failed on this particular input data.
>> The other data it has run on is of similar size.
>>
>> Speculation is enabled.
>>
>> I'm usin
mind.
On Thu, Jan 21, 2016 at 5:35 PM, Arun Luthra <arun.lut...@gmail.com> wrote:
> Looking into the yarn logs for a similar job where an executor was
> associated with the same error, I find:
>
> ...
> 16/01/22 01:17:18 INFO client.TransportClientFactory: Found inactive
&g
I tried groupByKey and noticed that it did not group all values into the
same group.
In my test dataset (a Pair rdd) I have 16 records, where there are only 4
distinct keys, so I expected there to be 4 records in the groupByKey
object, but instead there were 8. Each of the 4 distinct keys appear
see that each key is repeated 2 times but each key should only
appear once.
Arun
On Mon, Jan 4, 2016 at 4:07 PM, Ted Yu <yuzhih...@gmail.com> wrote:
> Can you give a bit more information ?
>
> Release of Spark you're using
> Minimal dataset that shows the problem
>
nt so we can count out any issues in object
> equality.
>
> On Mon, Jan 4, 2016 at 4:42 PM Arun Luthra <arun.lut...@gmail.com> wrote:
>
>> Spark 1.5.0
>>
>> data:
>>
>> p1,lo1,8,0,4,0,5,20150901|5,1,1.0
>> p1,lo2,8,0,4,0,5,20150901|
What types of RDD can saveAsObjectFile(path) handle? I tried a naive test
with an RDD[Array[String]], but when I tried to read back the result with
sc.objectFile(path).take(5).foreach(println), I got a non-promising output
looking like:
[Ljava.lang.String;@46123a
[Ljava.lang.String;@76123b
Ah, yes, that did the trick.
So more generally, can this handle any serializable object?
On Thu, Aug 27, 2015 at 2:11 PM, Jonathan Coveney jcove...@gmail.com
wrote:
array[String] doesn't pretty print by default. Use .mkString(,) for
example
El jueves, 27 de agosto de 2015, Arun Luthra
Is it possible to ignore features in mllib? In other words, I would like to
have some 'pass-through' data, Strings for example, attached to training
examples and test data.
A related stackoverflow question:
https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.hive.HiveContext
I'm getting org.apache.spark.sql.catalyst.analysis.NoSuchTableException
from:
val dataframe = hiveContext.table(other_db.mytable)
Do I have to change current database to access it? Is it possible to
as possible should make it complete earlier and increase the
utilization of resources
On Tue, Jun 23, 2015 at 10:34 PM, Arun Luthra arun.lut...@gmail.com
wrote:
Sometimes if my Hortonworks yarn-enabled cluster is fairly busy, Spark
(via spark-submit) will begin its processing even though
Sometimes if my Hortonworks yarn-enabled cluster is fairly busy, Spark (via
spark-submit) will begin its processing even though it apparently did not
get all of the requested resources; it is running very slowly.
Is there a way to force Spark/YARN to only begin when it has the full set
of
Hi,
Is there any support for handling missing values in mllib yet, especially
for decision trees where this is a natural feature?
Arun
level usage of spark.
@Arun, can you kindly confirm if Daniel’s suggestion helped your usecase?
Thanks,
Kapil Malik | kma...@adobe.com | 33430 / 8800836581
*From:* Daniel Mahler [mailto:dmah...@gmail.com]
*Sent:* 13 April 2015 15:42
*To:* Arun Luthra
*Cc:* Aaron Davidson; Paweł Szulc
On Tue, Apr 21, 2015 at 5:45 PM, Arun Luthra arun.lut...@gmail.com wrote:
Is there an efficient way to save an RDD with saveAsTextFile in such a way
that the data gets shuffled into separated directories according to a key?
(My end goal is to wrap the result in a multi-partitioned Hive table
Is there an efficient way to save an RDD with saveAsTextFile in such a way
that the data gets shuffled into separated directories according to a key?
(My end goal is to wrap the result in a multi-partitioned Hive table)
Suppose you have:
case class MyData(val0: Int, val1: string, directory_name:
On Tue, Mar 10, 2015 at 6:54 PM, Arun Luthra arun.lut...@gmail.com
wrote:
Does anyone know how to get the HighlyCompressedMapStatus to compile?
I will try turning off kryo in 1.2.0 and hope things don't break. I want
to benefit from the MapOutputTracker fix in 1.2.0.
On Tue, Mar 3, 2015 at 5:41
, Mar 12, 2015 at 12:47 PM, Arun Luthra arun.lut...@gmail.com
wrote:
I'm using a pre-built Spark; I'm not trying to compile Spark.
The compile error appears when I try to register
HighlyCompressedMapStatus in my program:
kryo.register(classOf[org.apache.spark.scheduler.HighlyCompressedMapStatus
Luthra arun.lut...@gmail.com wrote:
I think this is a Java vs scala syntax issue. Will check.
On Thu, Feb 26, 2015 at 8:17 PM, Arun Luthra arun.lut...@gmail.com
wrote:
Problem is noted here: https://issues.apache.org/jira/browse/SPARK-5949
I tried this as a workaround:
import
Everything works smoothly if I do the 99%-removal filter in Hive first. So,
all the baggage from garbage collection was breaking it.
Is there a way to filter() out 99% of the data without having to garbage
collect 99% of the RDD?
On Sun, Mar 1, 2015 at 9:56 AM, Arun Luthra arun.lut...@gmail.com
I think this is a Java vs scala syntax issue. Will check.
On Thu, Feb 26, 2015 at 8:17 PM, Arun Luthra arun.lut...@gmail.com wrote:
Problem is noted here: https://issues.apache.org/jira/browse/SPARK-5949
I tried this as a workaround:
import org.apache.spark.scheduler._
import
\
But then, reduceByKey() fails with:
java.lang.OutOfMemoryError: Java heap space
On Sat, Feb 28, 2015 at 12:09 PM, Arun Luthra arun.lut...@gmail.com wrote:
The Spark UI names the line number and name of the operation (repartition
in this case) that it is performing. Only if this information
A correction to my first post:
There is also a repartition right before groupByKey to help avoid
too-many-open-files error:
rdd2.union(rdd1).map(...).filter(...).repartition(15000).groupByKey().map(...).flatMap(...).saveAsTextFile()
On Sat, Feb 28, 2015 at 11:10 AM, Arun Luthra arun.lut
Tachyon).
Burak
On Fri, Feb 27, 2015 at 11:39 AM, Arun Luthra arun.lut...@gmail.com
wrote:
My program in pseudocode looks like this:
val conf = new SparkConf().setAppName(Test)
.set(spark.storage.memoryFraction,0.2) // default 0.6
.set(spark.shuffle.memoryFraction,0.12
://rabbitonweb.com
sob., 28 lut 2015, 6:22 PM Arun Luthra użytkownik arun.lut...@gmail.com
napisał:
So, actually I am removing the persist for now, because there is
significant filtering that happens after calling textFile()... but I will
keep that option in mind.
I just tried a few different
yoi base that assumption
only on logs?
sob., 28 lut 2015, 8:12 PM Arun Luthra użytkownik arun.lut...@gmail.com
napisał:
A correction to my first post:
There is also a repartition right before groupByKey to help avoid
too-many-open-files error:
rdd2.union(rdd1).map(...).filter
My program in pseudocode looks like this:
val conf = new SparkConf().setAppName(Test)
.set(spark.storage.memoryFraction,0.2) // default 0.6
.set(spark.shuffle.memoryFraction,0.12) // default 0.2
.set(spark.shuffle.manager,SORT) // preferred setting for
optimized joins
Problem is noted here: https://issues.apache.org/jira/browse/SPARK-5949
I tried this as a workaround:
import org.apache.spark.scheduler._
import org.roaringbitmap._
...
kryo.register(classOf[org.roaringbitmap.RoaringBitmap])
kryo.register(classOf[org.roaringbitmap.RoaringArray])
Hi,
I'm running Spark on Yarn from an edge node, and the tasks on the run Data
Nodes. My job fails with the Too many open files error once it gets to
groupByKey(). Alternatively I can make it fail immediately if I repartition
the data when I create the RDD.
Where do I need to make sure that
:
Are you submitting the job from your local machine or on the driver
machine.?
Have you set YARN_CONF_DIR.
On Thu, Feb 5, 2015 at 10:43 PM, Arun Luthra arun.lut...@gmail.com
wrote:
While a spark-submit job is setting up, the yarnAppState goes into
Running mode, then I get a flurry of typical
While a spark-submit job is setting up, the yarnAppState goes into Running
mode, then I get a flurry of typical looking INFO-level messages such as
INFO BlockManagerMasterActor: ...
INFO YarnClientSchedulerBackend: Registered executor: ...
Then, spark-submit quits without any error message and
) =
(count + 1, seen + user)
}, { case ((count0, seen0), (count1, seen1)) =
(count0 + count1, seen0 ++ seen1)
}).mapValues { case (count, seen) =
(count, seen.size)
}
On 12/5/14 3:47 AM, Arun Luthra wrote:
Is that Spark SQL? I'm wondering if it's possible without spark SQL.
On Wed, Dec 3
Is that Spark SQL? I'm wondering if it's possible without spark SQL.
On Wed, Dec 3, 2014 at 8:08 PM, Cheng Lian lian.cs@gmail.com wrote:
You may do this:
table(users).groupBy('zip)('zip, count('user), countDistinct('user))
On 12/4/14 8:47 AM, Arun Luthra wrote:
I'm wondering how
I'm wondering how to do this kind of SQL query with PairRDDFunctions.
SELECT zip, COUNT(user), COUNT(DISTINCT user)
FROM users
GROUP BY zip
In the Spark scala API, I can make an RDD (called users) of key-value
pairs where the keys are zip (as in ZIP code) and the values are user id's.
Then I can
at your hdfs-site.xml
and remove the setting for a rack topology script there (or it might be in
core-site.xml).
Matei
On Nov 19, 2014, at 12:13 PM, Arun Luthra arun.lut...@gmail.com wrote:
I'm trying to run Spark on Yarn on a hortonworks 2.1.5 cluster. I'm
getting this error:
14/11/19 13
solution?
Arun Luthra
I built Spark 1.2.0 succesfully, but was unable to build my Spark program
under 1.2.0 with sbt assembly my build.sbt file. It contains:
I tried:
org.apache.spark %% spark-sql % 1.2.0,
org.apache.spark %% spark-core % 1.2.0,
and
org.apache.spark %% spark-sql % 1.2.0-SNAPSHOT,
-SNAPSHOT),
you'll need to first build spark and publish-local so your application
build can find those SNAPSHOTs in your local repo.
Just append publish-local to your sbt command where you build Spark.
-Pat
On Wed, Oct 8, 2014 at 5:35 PM, Arun Luthra arun.lut...@gmail.com wrote:
I built Spark
path to the file you want to write, and make sure the
directory exists and is writable by the Spark process.
On Wed, Sep 10, 2014 at 3:46 PM, Arun Luthra arun.lut...@gmail.com
wrote:
I have a spark program that worked in local mode, but throws an error in
yarn-client mode on a cluster
50 matches
Mail list logo