Guys,
Do you have any thoughts on this ?
Thanks,Robert
On Sunday, April 12, 2015 5:35 PM, Grandl Robert
rgra...@yahoo.com.INVALID wrote:
Hi guys,
I was trying to figure out some counters in Spark, related to the amount of CPU
or Memory used (in some metric), used by a
As far as I know, createStream doesn't let you specify where receivers are
run.
createDirectStream in 1.3 doesn't use long-running receivers, so it is
likely to give you more even distribution of consumers across your workers.
On Mon, Apr 13, 2015 at 11:31 AM, Laeeq Ahmed
Hey Jonathan,
Are you referring to disk space used for storing persisted RDD's? For
that, Spark does not bound the amount of data persisted to disk. It's
a similar story to how Spark's shuffle disk output works (and also
Hadoop and other frameworks make this assumption as well for their
shuffle
Correct. Prediction doesn't touch that code path. -Xiangrui
On Mon, Apr 13, 2015 at 9:58 AM, Jianguo Li flyingfromch...@gmail.com
wrote:
Hi,
In the GeneralizedLinearAlgorithm, which Logistic Regression relied on, it
says if userFeatureScaling is enabled, we will standardize the training
Nothing so complicated... we are seeing mesos kill off our executors
immediately. When I reroute logging to an NFS directory we have available,
the executors survive fine. As such I am wondering if the spark workers are
getting killed by mesos for exceeding their disk quota (which atm is 0).
This
Hi guys
Does anyone know how to stop Spark from opening all Parquet files before
starting a job? This is quite a show stopper for me, since I have 5000 Parquet
files on S3.
Recap of what I tried:
1. Disable schema merging with: sqlContext.load(“parquet, Map(mergeSchema -
false”, path -
Tom,
According to Github's public activity log, Reynold Xin (in CC) deleted
his sort-benchmark branch yesterday. I didn't have a local copy aside
from the Daytona Partitioner (attached).
Reynold, is it possible to reinstate your branch?
-Ewan
On 13/04/15 16:41, Tom Hubregtsen wrote:
Thank
That's why I think it's the OOM killer. There are several cases of
memory overuse / errors :
1 - The application tries to allocate more than the Heap limit and GC
cannot free more memory = OutOfMemory : Java Heap Space exception from JVM
2 - The jvm is configured with a max heap size larger
I'm not 100% sure of spark's implementation but in the MR frameworks, it
would have a much larger shuffle write size becasue that node is dealing
with a lot more data and as a result has a lot more to shuffle
2015-04-13 13:20 GMT-04:00 java8964 java8...@hotmail.com:
If it is really due to data
I think I found where the problem comes from.
I am writing lzo compressed thrift records using elephant-bird, my guess is
that perhaps one side is computing the checksum based on the uncompressed
data and the other on the compressed data, thus getting a mismatch.
When writing the data as strings
The problem is likely that the underlying avro library is reusing objects
for speed. You probably need to explicitly copy the values out of the
reused record before the collect.
On Sat, Apr 11, 2015 at 9:23 PM, ÐΞ€ρ@Ҝ (๏̯͡๏) deepuj...@gmail.com wrote:
The read seem to be successfully as the
That appears to work, with a few changes to get the types correct:
input.distinct().combineByKey((s: String) = 1, (agg: Int, s: String) =
agg + 1, (agg1: Int, agg2: Int) = agg1 + agg2)
dean
Dean Wampler, Ph.D.
Author: Programming Scala, 2nd Edition
How about this?
input.distinct().combineByKey((v: V) = 1, (agg: Int, x: Int) = agg + 1,
(agg1: Int, agg2: Int) = agg1 + agg2).collect()
On Mon, Apr 13, 2015 at 10:31 AM, Dean Wampler deanwamp...@gmail.com
wrote:
The problem with using collect is that it will fail for large data sets,
as
I want to use Rack locality feature of Apache Spark in my application.
Is YARN the only resource manager which supports it as of now?
After going through the source code, I found that default implementation of
getRackForHost() method returns NONE in TaskSchedulerImpl which (I suppose)
would be
Here is the stack trace. The first part shows the log when the session is
started in Tableau. It is using the init sql option on the data
connection to create theTEMPORARY table myNodeTable.
Ah, I see. thanks for providing the error. The problem here is that
temporary tables do not exist in
Note that I am running pyspark in local mode (I do not have a hadoop cluster
connected) as I want to be able to work with the avro file outside of
hadoop.
--
View this message in context:
Hi Riya,
As far as I know, that is correct, unless Mesos fine-grained mode handles
this in some mysterious way.
-Sandy
On Mon, Apr 13, 2015 at 2:09 PM, rcharaya riya.char...@gmail.com wrote:
I want to use Rack locality feature of Apache Spark in my application.
Is YARN the only resource
On the worker side, it all happens in Executor. The task result is
computed here:
https://github.com/apache/spark/blob/b45059d0d7809a986ba07a447deb71f11ec6afe4/core/src/main/scala/org/apache/spark/executor/Executor.scala#L210
then its serialized along with some other goodies, and finally sent
Those funny class names come from scala's specialization -- its compiling a
different version of OpenHashMap for each primitive you stick in the type
parameter. Here's a super simple example:
*➜ **~ * more Foo.scala
class Foo[@specialized X]
*➜ **~ * scalac Foo.scala
*➜ **~ * ls
I tried to start the Spark Worker using the registered IP but this error
occurred:
15/04/13 21:35:59 INFO Worker: Registered signal handlers for [TERM, HUP,
INT]
Exception in thread main java.net.UnknownHostException: 10.240.92.75/:
Name or service not known
at
Hi,
I am trying to register classes with KryoSerializer. This has worked with
other programs. Usually the error messages are helpful in indicating which
classes need to be registered. But with my current program, I get the
following cryptic error message:
*Caused by:
yes, it sounds like a good use of an accumulator to me
val counts = sc.accumulator(0L)
rdd.map{x =
counts += 1
x
}.saveAsObjectFile(file2)
On Mon, Mar 30, 2015 at 12:08 PM, Wang, Ningjun (LNG-NPV)
ningjun.w...@lexisnexis.com wrote:
Sean
Yes I know that I can use persist() to persist
Hi Xiangrui,
Here is the class:
object ALSNew {
def main (args: Array[String]) {
val conf = new SparkConf()
.setAppName(TrainingDataPurchase)
.set(spark.executor.memory, 4g)
conf.set(spark.shuffle.memoryFraction,0.65) //default is 0.2
Hi All,
I have having trouble building a fat jar file through sbt-assembly.
[warn] Merging 'META-INF/NOTICE.txt' with strategy 'rename'
[warn] Merging 'META-INF/NOTICE' with strategy 'rename'
[warn] Merging 'META-INF/LICENSE.txt' with strategy 'rename'
[warn] Merging 'META-INF/LICENSE' with
Thanks Vadim, I can certainly consume data from a Kinesis stream when
running locally. I'm currently in the processes of extending my work to a
proper cluster (i.e. using a spark-submit job via uber jar). Feel free to
add me to gmail chat and maybe we can help each other.
On Mon, Apr 13, 2015 at
I don't believe the Kinesis asl should be provided. I used mergeStrategy
successfully to produce an uber jar.
Fyi, I've been having trouble consuming data out of Kinesis with Spark with no
success :(
Would be curious to know if you got it working.
Vadim
On Apr 13, 2015, at 9:36 PM, Mike
Hi all,
Who know how to access postgresql on Spark SQL? Do I need add the postgresql
dependency in build.sbt and set class path for it?
Thanks.
RegardsYi
Thanks Mike. I was having trouble on EC2.
On Apr 13, 2015, at 10:25 PM, Mike Trienis mike.trie...@orcsol.com wrote:
Thanks Vadim, I can certainly consume data from a Kinesis stream when running
locally. I'm currently in the processes of extending my work to a proper
cluster (i.e. using a
Hello,
Great thanks for your reply. From the code I found that the reason why my
program will scan all the edges is becasue of the EdgeDirection I passed
into is EdgeDirection.Either.
However I still met the problem of Time consuming of each iteration will
not decrease by time. Thus I have two
Hi guys,
I want to add my custom Rules(whatever the rule is) when the sql Logical
Plan is being analysed.
Is there a way to do that in the spark application code?
Thanks
--
View this message in context:
broadcast variables count towards spark.storage.memoryFraction, so they
use the same pool of memory as cached RDDs.
That being said, I'm really not sure why you are running into problems, it
seems like you have plenty of memory available. Most likely its got
nothing to do with broadcast
Hi Linlin,
have you got the solution for this issue, if yes then what are the thing
need to make correct,because I am also getting same error,when submitting
spark job in cluster mode getting error as under -
2015-04-14 18:16:43 DEBUG Transaction - Transaction rolled back in 0 ms
2015-04-14
Hello,
I am experimenting with DataFrame. I tried to construct two DataFrames with:
1. case class A(a: Int, b: String)
scala adf.printSchema()
root
|-- a: integer (nullable = false)
|-- b: string (nullable = true)
2. case class B(a: String, c: Int)
scala bdf.printSchema()
root
|-- a: string
Hi,
It's a syntax error in Spark-1.3.
The next release of spark supports the kind of UDF calls in DataFrame.
See a link below.
https://issues.apache.org/jira/browse/SPARK-6379
On Sat, Apr 11, 2015 at 3:30 AM, Yana Kadiyska yana.kadiy...@gmail.com
wrote:
Hi, I'm running into some trouble
Hi Zork,
From the exception, it is still caused by hdp.version not being propagated
correctly. Can you check whether there is any typo?
[root@c6402 conf]# more java-opts -Dhdp.version=2.2.0.0–2041
[root@c6402 conf]# more spark-defaults.conf
spark.driver.extraJavaOptions
If it is really due to data skew, will the task hanging has much bigger Shuffle
Write Size in this case?
In this case, the shuffle write size for that task is 0, and the rest IO of
this task is not much larger than the fast finished tasks, is that normal?
I am also interested in this case, as
The problem with using collect is that it will fail for large data sets, as
you'll attempt to copy the entire RDD to the memory of your driver program.
The following works (Scala syntax, but similar to Python):
scala val i1 = input.distinct.groupByKey
scala i1.foreach(println)
Any idea what this means, many thanks
==
logs/spark-.-org.apache.spark.deploy.worker.Worker-1-09.out.1
==
15/04/13 07:07:22 INFO Worker: Starting Spark worker 09:39910 with 4
cores, 6.6 GB RAM
15/04/13 07:07:22 INFO Worker: Running Spark version 1.3.0
15/04/13 07:07:22 INFO
Hi all,
Manning (the publisher) is looking for a co-author for the GraphX in Action
book. The book currently has one author (Michael Malak), but they are
looking for a co-author to work closely with Michael and improve the
writings and make it more consumable.
Early access page for the book:
Hello,
Thank you for your answer.
I'm already registering my classes as you're suggesting...
Regards
De : tsingfu [via Apache Spark User List]
[mailto:ml-node+s1001560n22468...@n3.nabble.com]
Envoyé : lundi 13 avril 2015 03:48
À : Mehdi Singer
Objet : Re: Kryo exception : Encountered
You need to do few more things or you will eventually run into these issues
val conf = new SparkConf()
.set(spark.serializer, org.apache.spark.serializer.KryoSerializer)
* .set(spark.kryoserializer.buffer.mb,
arguments.get(buffersize).get)*
*
I have set SPARK_WORKER_OPTS in spark-env.sh for that. For example:
export SPARK_WORKER_OPTS=-Dspark.worker.cleanup.enabled=true
-Dspark.worker.cleanup.appDataTtl=seconds
On 11.04.2015, at 00:01, Wang, Ningjun (LNG-NPV)
ningjun.w...@lexisnexis.com wrote:
Does anybody have an answer for
Very likely to be this :
http://www.linuxdevcenter.com/pub/a/linux/2006/11/30/linux-out-of-memory.html?page=2
Your worker ran out of memory = maybe you're asking for too much memory
for the JVM, or something else is running on the worker
Guillaume
Any idea what this means, many thanks
==
How about using mapToPair and exchanging the two. Will it be efficient
Below is the code , will it be efficient to convert like this.
JavaPairRDDLong, MatcherReleventData RddForMarch
=matchRdd.zipWithindex.mapToPair(new
PairFunctionTuple2VendorRecord,Long, Long, MatcherReleventData() {
Does it also cleanup spark local dirs ? I thought it was only cleaning
$SPARK_HOME/work/
Guillaume
I have set SPARK_WORKER_OPTS in spark-env.sh for that. For example:
export SPARK_WORKER_OPTS=-Dspark.worker.cleanup.enabled=true
-Dspark.worker.cleanup.appDataTtl=seconds
On 11.04.2015, at
Hi ,
When I am submitting spark job as --master yarn-cluster with below
command/options getting driver
memory error-
spark-submit --jars
./libs/mysql-connector-java-5.1.17.jar,./libs/log4j-1.2.17.jar --files
datasource.properties,log4j.properties --master yarn-cluster --num-executors
1
Thank you Peter.
I just want to be sure.
even if I use the classification setting the GBT uses regression trees
and not classification trees?
I know the difference between the two(theoretically) is only in the loss
and impurity functions.
thus in case it uses classification trees doing what you
Hi Zhan,
Alas setting:
-Dhdp.version=2.2.0.0–2041
Does not help. Still get the same error:
15/04/13 09:53:59 INFO yarn.Client:
client token: N/A
diagnostics: N/A
ApplicationMaster host: N/A
ApplicationMaster RPC port: -1
queue: default
start time: 1428918838408
Hi, i want to play with Criteo 1 tb dataset. Files are located on azure
storage. Here's a command to download them:
curl -O
http://azuremlsampleexperiments.blob.core.windows.net/criteo/day_{`seq
-s ‘,’ 0 23`}.gz
is there any way to read files through http protocol with spark without
My app works fine with the single, uber jar file containing my app and
all its dependencies. However, it takes about 20 minutes to copy the 65MB
jar file up to the node on the cluster, so my code, compile, test cycle
has become a core, compile, cooppp, test cycle.
I'd like to have a
Thanks Yijie! Also cc the user list.
Cheng
On 4/13/15 9:19 AM, Yijie Shen wrote:
I opened a new Parquet JIRA ticket here:
https://issues.apache.org/jira/browse/PARQUET-251
Yijie
On April 12, 2015 at 11:48:57 PM, Cheng Lian (lian.cs@gmail.com
mailto:lian.cs@gmail.com) wrote:
Sometimes a large number of partitions leads to memory problems.
Something like
val rdd1 = sc.textFile(file1).coalesce(500). ...
val rdd2 = sc.textFile(file2).coalesce(500). ...
may help.
On Mon, Mar 2, 2015 at 6:26 PM, Arun Luthra arun.lut...@gmail.com wrote:
Everything works smoothly if I
I faced exact same issue. The way i solved it was
1. Copy entire project.
2. Delete all the source, have only the dependencies in pom.xml. This will
create, fat jar, without source but deps only.
3. In original project keep it as is, now build it. this will create a JAR
(no deps, by default)
Now
Refer this post
http://blog.prabeeshk.com/blog/2015/04/07/self-contained-pyspark-application/
On 13 April 2015 at 17:41, Punya Biswal pbis...@palantir.com wrote:
Dear Spark users,
My team is working on a small library that builds on PySpark and is
organized like PySpark as well -- it has a
Thank you for your response Ewan. I quickly looked yesterday and it was
there, but today at work I tried to open it again to start working on it,
but it appears to be removed. Is this correct?
Thanks,
Tom
On 12 April 2015 at 06:58, Ewan Higgs ewan.hi...@ugent.be wrote:
Hi all.
The code is
Dear Spark users,
My team is working on a small library that builds on PySpark and is organized
like PySpark as well -- it has a JVM component (that runs in the Spark driver
and executor) and a Python component (that runs in the PySpark driver and
executor processes). What's a good approach
Hi I imported a table from mssql server with Sqoop 1.4.5 in parquet format.
But when I try to load it from Spark shell, it throws error like :
scala val df1 = sqlContext.load(/home/bipin/Customer2)
scala.collection.parallel.CompositeThrowable: Multiple exceptions thrown
during a parallel
Try this
./bin/spark-submit -v --master yarn-cluster --jars
./libs/mysql-connector-java-5.1.17.jar,./libs/log4j-1.2.17.jar --files
datasource.properties,log4j.properties --num-executors 1 --driver-memory
4g *--driver-java-options -XX:MaxPermSize=1G* --executor-memory 2g
--executor-cores 1
**Learning the ropes**
I'm trying to grasp the concept of using the pipeline in pySpark...
Simplified example:
list=[(1,alpha),(1,beta),(1,foo),(1,alpha),(2,alpha),(2,alpha),(2,bar),(3,foo)]
Desired outcome:
[(1,3),(2,2),(3,1)]
Basically for each key, I want the number of unique values.
I've
Hi,
I am not sure my problem is relevant to spark, but perhaps someone else had
the same error. When I try to write files that need multipart upload to S3
from a job on EMR I always get this error:
com.amazonaws.services.s3.model.AmazonS3Exception: The Content-MD5 you
specified did not match
You mean there is a tuple in either RDD, that has itemID = 0 or null ?
And what is catch all ?
That implies is it a good idea to run a filter on each RDD first ? We do
not do this using Pig on M/R. Is it required in Spark world ?
On Mon, Apr 13, 2015 at 9:58 PM, Jonathan Coveney
I can promise you that this is also a problem in the pig world :) not sure
why it's not a problem for this data set, though... are you sure that the
two are doing the exact same code?
you should inspect your source data. Make a histogram for each and see what
the data distribution looks like. If
Hi,
In the GeneralizedLinearAlgorithm, which Logistic Regression relied on, it
says if userFeatureScaling is enabled, we will standardize the training
features , and trained the model in the scaled space. Then we transform
the coefficients from the scaled space to the original space
My
Linux OOM throws SIGTERM, but if I remember correctly JVM handles heap
memory limits differently and throws OutOfMemoryError and eventually sends
SIGINT.
Not sure what happened but the worker simply received a SIGTERM signal, so
perhaps the daemon was terminated by someone or a parent process.
Hello,
I'm trying to use a Spark Streaming (1.2.0-cdh5.3.2) consumer
(via spark-streaming-kafka lib of the same version) with Kafka's Confluent
Platform 1.0.
I manage to make a Producer that produce my data and can check it via the
avro-console-consumer :
./bin/kafka-avro-console-consumer
Hi All I have an JavaPairRDDLong,String where each long key have 4
string values associated with it. I want to fire the Hbase query for look
up of the each String part of RDD.
This look-up will give result of around 7K integers.so for each key I will
have 7k values. now my input RDD always
My guess would be data skew. Do you know if there is some item id that is a
catch all? can it be null? item id 0? lots of data sets have this sort of
value and it always kills joins
2015-04-13 11:32 GMT-04:00 ÐΞ€ρ@Ҝ (๏̯͡๏) deepuj...@gmail.com:
Code:
val viEventsWithListings: RDD[(Long,
Whether I use 1 or 2 machines, the results are the same... Here follows the
results I got using 1 and 2 receivers with 2 machines:
2 machines, 1 receiver:
sbt/sbt run-main Benchmark 1 machine1 1000 21 | grep -i Total
delay\|record
15/04/13 16:41:34 INFO JobScheduler: Total delay: 0.156 s
Code:
val viEventsWithListings: RDD[(Long, (DetailInputRecord, VISummary, Long))]
= lstgItem.join(viEvents).map {
case (itemId, (listing, viDetail)) =
val viSummary = new VISummary
viSummary.leafCategoryId = listing.getLeafCategId().toInt
viSummary.itemSiteId =
I'm surprised that I haven't been able to find this via google, but I
haven't...
What is the setting that requests some amount of disk space for the
executors? Maybe I'm misunderstanding how this is configured...
Thanks for any help!
I need to have my own scheduler to point to a proprietary remote execution
framework.
https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/SparkContext.scala#L2152
I'm looking at where it decides on the backend and it doesn't look like
there is a hook. Of course I can
71 matches
Mail list logo