In my experience, Standalone works very well in small cluster where there isn’t
anything else running.
Bigger cluster or shared resources, YARN takes a win, surpassing the overhead
of spawning containers as opposed to a background running worker.
Best is if you try both, if standalone is good
Hello there,
I have a Spark program in 1.6.1, however, when I submit it to cluster, it
randomly picks the driver.
I know there is a driver specification option, but along with it it is
mandatory to define many other options I am not familiar with. The trouble is,
the .jars I am launching need
Hi thanks for the assistance,
1. Standalone
2. df.orderBy(field).limit(5000).write.parquet(...)
From: Mich Talebzadeh [mailto:mich.talebza...@gmail.com]
Sent: Friday, August 05, 2016 4:33 PM
To: Ellafi, Saif A.
Cc: user @spark
Subject: Re: Applying a limit after orderBy of big
Hi all,
I am working with a 1.5 billon rows dataframe in a small cluster and trying to
apply an orderBy operation by one of the Long Types columns.
If I limit such output to some number, say 5 millon, then trying to count,
persist or store the dataframe makes spark crash with losing executors
Hi all,
I was currently playing around with spark-env around SPARK_LOCAL_DIRS in order
to add additional shuffle storage.
But since I did this, I am getting too many open files error if total executor
cores is high. I am also getting low parallelism, by monitoring the running
tasks on some
Appreciate the follow up.
I am not entirely sure how or why my question is related to bucketization
capabilities. It indeeds sounds like a powerful feature to avoid shuffling, but
in my case, I am referring to straight forward processes of reading data and
writing to parquet.
If bucket tables
Hello everyone!
I was noticing that, when reading parquet files or actually any kind of source
data frame data (spark-csv, etc), default partinioning is not fair.
Action tasks usually act very fast on some partitions and very slow on some
others, and frequently, even fast on all but last
Hi,
I am submitting a jar file to spark submit, which has some content inside
src/main/resources.
I am unable to access such resources, since the application jar is not being
added to the classpath.
This works fine if I include the application jar also in the -driver-class-path
entry.
Is
Appreciated Michael, but this doesn’t help my case, the filter string is being
submitted from outside my program, is there any other alternative? some literal
string parser or anything I can do before?
Saif
From: Michael Armbrust [mailto:mich...@databricks.com]
Sent: Wednesday, April 13, 2016
Hi,
I am debugging a program, and for some reason, a line calling the following is
failing:
df.filter("sum(OpenAccounts) > 5").show
It says it cannot find the column OpenAccounts, as if it was applying the sum()
function and looking for a column called like that, where there is not. This
Hello all, I am facing a silly data question.
If I have +100 csv files which are part of the same data, but each csv is for
example, a year on a timeframe column (i.e. partitioned by year),
what would you suggest instead of loading all those files and joining them?
Final target would be
Hello,
In general, I am usually able to run spark submit jobs in local mode, in a
32-cores node with plenty of memory ram. The performance is significantly
faster in local mode than when using a cluster of spark workers.
How can this be explained and what measures can one take in order to
Thank you, this looks useful indeed for what I have in mind.
Saif
From: Jiří Syrový [mailto:syrovy.j...@gmail.com]
Sent: Friday, January 15, 2016 12:06 PM
To: Ellafi, Saif A.
Cc: user@spark.apache.org
Subject: Re: jobs much slower in cluster mode vs local
Hi,
you can try to use spark job
Hello,
I am tinkering with Spark 1.6. I have this 1.5 Billion rows data, to which I
apply several window functions such as lag, first, etc. The job is quite
expensive, I am running a small cluster with executors running with 70GB of ram.
Using new memory management system, the job fails around
Where can I read more about the dataset api on a user layer? I am failing to
get an API doc or understand when to use DataFrame or DataSet, advantages, etc.
Thanks,
Saif
-Original Message-
From: Jean-Baptiste Onofré [mailto:j...@nanthrax.net]
Sent: Monday, January 04, 2016 2:01 PM
To:
Hello everyone,
I am testing some parallel program submission to a stand alone cluster.
Everything works alright, the problem is, for some reason, I can't submit more
than 3 programs to the cluster.
The fourth one, whether legacy or REST, simply hangs until one of the first
three completes.
I
Data:
+---++
| label|features|
+---++
|0.13271745268556925|[-0.2006809895664...|
|0.23956421080605234|[-0.0938342314459...|
|0.47464690691431843|[0.14124846466227...|
|
Hi, just a couple cents.
are your joining columns StringTypes (id field)? I have recently reported a bug
where having inconsistent results when filtering String fields in group
operations.
Saif
From: Colin Alstad [mailto:colin.als...@pokitdok.com]
Sent: Wednesday, October 28, 2015 12:39 PM
Hi all,
I have a parquet file, which I am loading in a shell. When I launch the shell
with -driver-java-options ="-Dspark.serializer=...kryo", makes a couple fields
look like:
03-?? ??-?? ??-???
when calling > data.first
I will confirm briefly, but I am utterly sure it happens
Hello Everyone,
I would need urgent help with a data consistency issue I am having.
Stand alone Cluster of five servers. sqlContext instance of HiveContext
(default in spark-shell)
No special options other than driver memory and executor memory.
Parquet partitions are 512 where there are 160
Hello everyone,
I am doing some analytics experiments under a 4 server stand-alone cluster in a
spark shell, mostly involving a huge database with groupBy and aggregations.
I am picking 6 groupBy columns and returning various aggregated results in a
dataframe. GroupBy fields are of two types,
Thanks, sorry I cannot share the data and not sure how much significant it will
be for you.
I am reproducing the issue on a smaller piece of the content and see wether I
find a reason on the inconsistence.
val res2 = data.filter($"closed" === $"ever_closed").groupBy("product", "band
", "aget",
nevermind my last email. res2 is filtered so my test does not make sense. The
issue is not reproduced there. I have the problem somwhere else.
From: Ellafi, Saif A.
Sent: Thursday, October 22, 2015 12:57 PM
To: 'Xiao Li'
Cc: user
Subject: RE: Spark groupby and agg inconsistent and missing data
Hello,
Is there an optimal number of partitions per number of rows, when writing into
disk, so we can re-read later from source in a distributed way?
Any thoughts?
Thanks
Saif
Hi,
Is it possible to load a spark-shell, in which we do any number of operations
in a dataframe, then register it as a temporary table and get to see it through
thriftserver?
ps. or even better, submit a full job and store the dataframe in thriftserver
in-memory before the job completes.
I
Has anyone tried shuffle service in Stand Alone cluster mode? I want to enable
it for d.a. but my jobs never start when I submit them.
This happens with all my jobs.
15/10/13 08:29:45 INFO DAGScheduler: Job 0 failed: json at DataLoader.scala:86,
took 16.318615 s
Exception in thread "main"
Hi, thanks
Executors are simply failing to connect to a shuffle server:
15/10/13 08:29:34 INFO BlockManagerMaster: Registered BlockManager
15/10/13 08:29:34 INFO BlockManager: Registering executor with local external
shuffle service.
15/10/13 08:29:34 ERROR BlockManager: Failed to connect to
I believe the confusion here is self-answered.
The thing is that in the documentation, the spark shuffle service runs only
under YARN, while here we are speaking about a stand alone cluster.
The proper question is, how to launch a shuffle service for stand alone?
Saif
From:
Thanks, I missed that one.
From: Marcelo Vanzin [mailto:van...@cloudera.com]
Sent: Tuesday, October 13, 2015 2:36 PM
To: Ellafi, Saif A.
Cc: user@spark.apache.org
Subject: Re: Spark shuffle service does not work in stand alone
You have to manually start the shuffle service if you're not running
Hi all,
I am in the procss of learning big data.
Right now, I am bringing huge databases through JDBC to Spark (a 250 million
rows table can take around 3 hours), and then re-saving it into JSON, which is
fast, simple, distributed, fail-safe and stores data types, although without
any
Where can we find other available functions such as lit() ? I can’t find lit in
the api.
Thanks
From: Michael Armbrust [mailto:mich...@databricks.com]
Sent: Friday, October 09, 2015 4:04 PM
To: unk1102
Cc: user
Subject: Re: How to calculate percentile of a column of DataFrame?
You can use
Yes but I mean, this is rather curious. How is def lit(literal:Any) --> becomes
a percentile function lit(25)
Thanks for clarification
Saif
From: Umesh Kacha [mailto:umesh.ka...@gmail.com]
Sent: Friday, October 09, 2015 4:10 PM
To: Ellafi, Saif A.
Cc: Michael Armbrust; user
Subject: Re: How to
Hi all, would this be a bug??
val ws = Window.
partitionBy("clrty_id").
orderBy("filemonth_dtt")
val nm = "repeatMe"
df.select(df.col("*"), rowNumber().over(ws).cast("int").as(nm))
Hi, thanks for looking into. v1.5.1. I am really worried.
I dont have hive/hadoop for real in the environment.
Saif
From: Michael Armbrust [mailto:mich...@databricks.com]
Sent: Thursday, October 08, 2015 2:57 PM
To: Ellafi, Saif A.
Cc: user
Subject: Re: RowNumber in HiveContext returns null or
It turns out this does not happen in local[32] mode. Only happens when
submiting to standalone cluster. Don’t have YARN/MESOS to compare.
Will keep diagnosing.
Saif
From: saif.a.ell...@wellsfargo.com [mailto:saif.a.ell...@wellsfargo.com]
Sent: Thursday, October 08, 2015 3:01 PM
To:
Repartition and default parallelism to 1, in cluster mode, is still broken.
So the problem is not the parallelism, but the cluster mode itself. Something
wrong with HiveContext + cluster mode.
Saif
From: saif.a.ell...@wellsfargo.com [mailto:saif.a.ell...@wellsfargo.com]
Sent: Thursday, October
It can be large yes. But, that still does not resolve the question of why it
works in smaller environment, i.e. Local[32] or in cluster mode when using
SQLContext instead of HiveContext.
The process in general, is a RowNumber() hiveQL operation, that is why I need
HiveContext.
I have the
When running stand-alone cluster mode job, the process hangs up randomly during
a DataFrame flatMap or explode operation, in HiveContext:
-->> df.flatMap(r => for (n <- 1 to r.getInt(ind)) yield r)
This does not happen either with SQLContext in cluster, or Hive/SQL in local
mode, where it
Hi all,
In a stand-alone cluster operation, with more than 80 gbs of ram in each node,
am trying to:
1. load a partitioned json dataframe which weights around 100GB as input
2. apply transformations such as cast some column types
3. get some percentiles which involves sort by,
Hi,
I have a HiveContext job which takes less than 1 minute to complete in local
mode with 16 cores.
However, when I launch it over stand-alone cluster, it takes for ever, probably
can't even finish. Even when I have the same only node running up in which I
execute it locally.
How could I
Hi,
I would like to read the full spark-submit log once a job is completed, since
the output is not stdout, how could I redirect spark output to a file in linux?
Thanks,
Saif
Thank you, exactly what I was looking for. I have read of it before but never
associated.
Saif
From: Adrian Tanase [mailto:atan...@adobe.com]
Sent: Friday, October 02, 2015 8:24 AM
To: Ellafi, Saif A.; user@spark.apache.org
Subject: Re: Accumulator of rows?
Have you seen window functions?
Given ID, DATE, I need all sorted dates per ID, what is the easiest way?
I got this but I don't like it:
val l = zz.groupBy("id", " dt").agg($"dt".as("dummy")).sort($"dt".asc)
Saif
Hi, thank you
I would prefer to leave writing-to-disk as a last resort. Is it a last resort?
Saif
From: Ted Yu [mailto:yuzhih...@gmail.com]
Sent: Friday, October 02, 2015 3:54 PM
To: Ellafi, Saif A.
Cc: user
Subject: Re: how to broadcast huge lookup table?
Have you considered using external
I tried broadcasting a key-value rdd, but then I cannot perform any rdd-actions
inside a map/foreach function of another rdd.
any tips? If going into scala collections I end up with huge memory bottlenecks.
Saif
Hi all,
I need to repeat a couple rows from a dataframe by n times each. To do so, I
plan to create a new Data Frame, but I am being unable to find a way to
accumulate "Rows" somewhere, as this might get huge, I can't accumulate into a
mutable Array, I think?.
Thanks,
Saif
Hi all,
I have a process where I do some calculations on each one of the columns of a
dataframe.
Intrinsecally, I run across each column with a for loop. On the other hand,
each process itself is non-entirely-distributable.
To speed up the process, I would like to submit a spark program for
Hi all,
So I Have two dataframes, with two columns: DATE and VALUE.
Performing this:
data = data.join(cur_data, data("DATE") === cur_data("DATE"), "outer")
returns
Exception in thread "main" org.apache.spark.sql.AnalysisException: Reference
'DATE' is ambiguous, could be: DATE#0, DATE#3.;
But
Hi all,
I am failing to find a proper guide or tutorial onto how to write proper udf
functions in scala.
Appreciate the effort saving,
Saif
Hi,
I am trying to figure out a way to compress df.write.json() but have not been
succesful, even changing spark.io.compression.
Any thoughts?
Saif
Thank you, so I was inversely confused. At first I thought MLLIB was the
future, but based on what you say. MLLIB will be the past. Intersting.
This means that if I look forward over using the pipelines system, I shouldn't
be obsolete.
Any more insights welcome,
Saif
-Original Message-
Hi all,
I am little bit confused, as getting introduced to Spark recently. What is
going on with ML? Is it going to be deprecated? Or are all of its features
valid and constructed over? It has a set of features and ML Constructors which
I like to use, but need to confirm wether the future of
Hi all,
What would be a good way to filter rows in a dataframe, where any value in a
row has null? I wouldn't want to go through each column manually.
Thanks,
Saif
Hello, thank you for the response.
I found a blog where a guy explains that it is not possible to join columns
from different data frames.
I was trying to modify one column’s information, so selecting it and then
trying to replace the original dataframe column. Found another way,
Thanks
Saif
Hi,
Any way to store/load RDDs keeping their original object instead of string?
I am having trouble with parquet (there is always some error at class
conversion), and don't use hadoop. Looking for alternatives.
Thanks in advance
Saif
This simple comand call:
val final_df = data.select(month_balance).withColumn(month_date,
data.col(month_date_curr))
Is throwing:
org.apache.spark.sql.AnalysisException: resolved attribute(s)
month_date_curr#324 missing from month_balance#234 in operator !Project
[month_balance#234,
I can reproduce this even simpler with the following:
val gf = sc.parallelize(Array(3,6,4,7,3,4,5,5,31,4,5,2)).toDF(ASD)
val ff = sc.parallelize(Array(4,6,2,3,5,1,4,6,23,6,4,7)).toDF(GFD)
gf.withColumn(DSA, ff.col(GFD))
org.apache.spark.sql.AnalysisException: resolved attribute(s) GFD#421
Hi all,
I have SomeClass[TYPE] { def some_method(args: fixed_type_args): TYPE }
And on runtime, I create instances of this class with different AnyVal + String
types, but the return type of some_method varies.
I know I could do this with an implicit object, IF some_method received a type,
but
Hi all,
I am using pre-compiled spark with hadoop 2.6. LZ4 Codec is not on hadoop's
native libraries, so I am not being able to use it.
Can anyone suggest on how to proceed? Hopefully I wont have to recompile
hadoop. I tried changing the --driver-library-path to point directly into lz4
stand
Hi,
Is it possible in standalone to set up worker ID names? to avoid the
worker-19248891237482379-ip..-port ??
Thanks,
Saif
Hi, thank you all for the asssistance.
It is odd, it works when creating a new java.mathBigDecimal object, but not if
I work directly with
scala 5 match { case x: java.math.BigDecimal = 2 }
console:23: error: scrutinee is incompatible with pattern type;
found : java.math.BigDecimal
required:
It is okay. this methodology works very well for mapping objects of my Seq[Any].
It is indeed very cool :-)
Saif
From: Ted Yu [mailto:yuzhih...@gmail.com]
Sent: Wednesday, August 19, 2015 10:47 AM
To: Ellafi, Saif A.
Cc: sujitatgt...@gmail.com; wrbri...@gmail.com; user@spark.apache.org
Subject:
Hi, thank you for further assistance
you can reproduce this by simply running
5 match { case java.math.BigDecimal = 2 }
In my personal case, I am applying a map acton to a Seq[Any], so the elements
inside are of type any, to which I need to apply a proper
.asInstanceOf[WhoYouShouldBe].
Saif
Hi, Can you please elaborate? I am confused :-)
Saif
-Original Message-
From: Marcelo Vanzin [mailto:van...@cloudera.com]
Sent: Tuesday, August 18, 2015 5:15 PM
To: Ellafi, Saif A.
Cc: wrbri...@gmail.com; user@spark.apache.org
Subject: Re: Scala: How to match a java object
On Tue,
Hi all,
I am trying to run a spark job, in which I receive java.math.BigDecimal
objects, instead of the scala equivalents, and I am trying to convert them into
Doubles.
If I try to match-case this object class, I get: error: object
java.math.BigDecimal is not a value
How could I get around
Hi All,
I have a working program, in which I create two big tuples2 out of the data.
This seems to work in local but when I switch over cluster standalone mode, I
get this error at the very beggining:
15/08/14 10:22:25 WARN TaskSetManager: Lost task 4.0 in stage 1.0 (TID 10,
162.101.194.44):
Hello all,
I am writing a program which calls from a database. A run a couple
computations, but in the end I have a while loop, in which I make a
modification to the persisted thata. eg:
val data = PairRDD... persist()
var i = 0
while (i 10) {
val data_mod = data.map(_._1 + 1, _._2)
Hi all,
I don't have any hadoop fs installed on my environment, but I would like to
store dataframes in parquet files. I am failing to do so, if possible, anyone
have any pointers?
Thank you,
Saif
I am launching my spark-shell
spark-1.4.1-bin-hadoop2.6/bin/spark-shell
15/08/11 09:43:32 INFO SparkILoop: Created sql context (with Hive support)..
SQL context available as sqlContext.
scala val data = sc.parallelize(Array(2,3,5,7,2,3,6,1)).toDF
scala data.write.parquet(/var/ data/Saif/pq)
I confirm that it works,
I was just having this issue: https://issues.apache.org/jira/browse/SPARK-8450
Saif
From: Ellafi, Saif A.
Sent: Tuesday, August 11, 2015 12:01 PM
To: Ellafi, Saif A.; deanwamp...@gmail.com
Cc: user@spark.apache.org
Subject: RE: Parquet without hadoop: Possible?
Sorry,
Sorry, I provided bad information. This example worked fine with reduced
parallelism.
It seems my problem have to do with something specific with the real data frame
at reading point.
Saif
From: saif.a.ell...@wellsfargo.com [mailto:saif.a.ell...@wellsfargo.com]
Sent: Tuesday, August 11, 2015
Hi all,
silly question. Does logging info messages, both print or to file, or event
logging, cause any impact to general performance of spark?
Saif
Hi,
A silly question here. The Driver Web UI dies when the spark-submit program
finish. I would like some time to analyze after the program ends, as the page
does not refresh it self, when I hit F5 I lose all the info.
Thanks,
Saif
Hello, thank you, but that port is unreachable for me. Can you please share
where can I find that port equivalent in my environment?
Thank you
Saif
From: François Pelletier [mailto:newslett...@francoispelletier.org]
Sent: Friday, August 07, 2015 4:38 PM
To: user@spark.apache.org
Subject: Re:
Hello,
When enabling speculation, my first job is to launch a partitioned JDBC
dataframe query, in which some partitions take longer than others to respond.
This causes speculation and creates new nodes to launch the query. When one of
those nodes finish the query, the speculative one remains
Thank you both, I will take a look, but
1. For high-shuffle tasks, is this right for the system to have the size
and thresholds high? I hope there is no bad consequences.
2. I will try to overlook admin access and see if I can get anything with
only user rights
From: Ted Yu
Hi all,
I am experimenting and learning performance on big tasks locally, with a 32
cores node and more than 64GB of Ram, data is loaded from a database through
JDBC driver, and launching heavy computations against it. I am presented with
two questions:
1. My RDD is poorly distributed. I
Thank you for your response Zhen,
I am using some vendor specific JDBC driver JAR file (honestly I dont know
where it came from). It’s api is NOT like JdbcRDD, instead, more like jdbc from
DataFrameReader
Hi all,
would like some insight. I am currently computing huge databases, and playing
with monitoring and tunning.
When monitoring the multiple cores I have, I see that even when RDDs are
parallelized, computation on the RDD jump from core to core sporadically ( I
guess, depending on where
I tried with a RDD[DenseVector] but RDDs are not transformable, so T+
RDD[DenseVector] not : RDD[Vector] and can't get to use the RDD input method
of correlation.
Thanks,
Saif
Hello, thank you for your time.
Seq[String] works perfectly fine. I also tried running a for loop through all
elements to see if any access to a value was broken, but no, they are alright.
For now, I solved it properly calling this. Sadly, it takes a lot of time, but
works:
var data_sas =
Hi,
In a hundred columns dataframe, I wish to either select all of them except or
drop the ones I dont want.
I am failing in doing such simple task, tried two ways
val clean_cols = df.columns.filterNot(col_name =
col_name.startWith(STATE_).mkString(, )
df.select(clean_cols)
But this throws
Thank you, extending Serializable solved the issue. I am left with more
questions than answers though :-).
Regards,
Saif
From: Yana Kadiyska [mailto:yana.kadiy...@gmail.com]
Sent: Monday, July 13, 2015 2:49 PM
To: Ellafi, Saif A.
Cc: user@spark.apache.org; Liu, Weicheng
Subject: Re:
Thank you very much for your time, here is how I designed the case classes, as
far as I know they apply properly.
Ps: By the way, what do you mean by “The programming guide?”
abstract class Validator {
// positions to access with Row.getInt(x)
val shortsale_in_pos = 10
val
Hi,
For some experiment I am doing, I am trying to do the following.
1.Created an abstract class Validator. Created case objects from Validator with
validate(row: Row): Boolean method.
2. Adding in a list all case objects
3. Each validate takes a Row into account, returns itself if validate
Hello all,
I have one big RDD, in which there is a column of groups A1, A2, B1, B2, B3,
C1, D1, ..., XY.
Out of it, I am using map() to transform into RDD[LabeledPoint] with dense
vectors for later use into Logistic Regression, which takes RDD[LabeledPoint]
I would like to run a logistic
Hello,
reading from spark-csv, got some lines with missing data (not invalid).
applying map() to create a LabeledPoint with denseVector. Using map( Row =
Row.getDouble(col_index) )
To this point:
res173: org.apache.spark.mllib.regression.LabeledPoint =
Hello all,
I am learning scala spark and going through some applications with data I have.
Please allow me to put a couple questions:
spark-csv: The data I have, ain't malformed, but there are empty values in some
rows, properly comma-sepparated and not catched by DROPMALFORMED mode
These
88 matches
Mail list logo