In general, I don't think that means you should exclude something;
it's still needed.
The problem is that commons config depends *only* on *beanutils-core
1.8.0* so it ends up managing up that artifact version only, and not
the main beanutils one.
In this particular instance, which I've seen
Hello everyone,
I am a Spark novice facing a nontrivial problem to solve with Spark.
I have an RDD consisting of many elements (say, 60K), where each element is
is a d-dimensional vector.
I want to implement an iterative algorithm which does the following. At each
iteration, I want to apply an
By default Spark 1.3 has bindings to Hive 0.13.1 though you can bind it to
Hive 0.12 if you specify it in the profile when building Spark as per
https://spark.apache.org/docs/1.3.0/building-spark.html.
If you are downloading a pre built version of Spark 1.3 - then by default,
it is set to Hive
Hi,
I have some code that creates ~ 80 RDD and then a sc.union is applied to
combine all 80 into one for the next step (to run topByKey for example)...
While creating 80 RDDs take 3 mins per RDD, doing a union over them takes 3
hrs (I am validating these numbers)...
Is there any checkpoint
Ok, what do i need to do in order to migrate the patch?
Thanks
Alex
On Thu, Apr 9, 2015 at 11:54 AM, Prashant Sharma scrapco...@gmail.com
wrote:
This is the jira I referred to
https://issues.apache.org/jira/browse/SPARK-3256. Another reason for not
working on it is evaluating priority
Sorry, I was getting those errors because my workload was not sustainable.
However, I noticed that, by just running the spark-streaming-benchmark (
https://github.com/tdas/spark-streaming-benchmark/blob/master/Benchmark.scala
), I get no difference on the execution time, number of processed
Is there custom class involved in your application ?
I assume you have called sparkConf.registerKryoClasses() for such class(es).
Cheers
On Thu, Apr 9, 2015 at 7:15 AM, mehdisinger mehdi.sin...@lampiris.be
wrote:
Hi,
I'm facing an issue when I try to run my Spark application. I keep getting
Hi,
Thanks a lot for such a detailed response.
On Wed, Apr 8, 2015 at 8:55 PM, Guillaume Pitel guillaume.pi...@exensa.com
wrote:
Hi Muhammad,
There are lots of ways to do it. My company actually develops a text
mining solution which embeds a very fast Approximate Neighbours solution (a
Typo in previous email, pardon me.
Set spark.driver.maxResultSize to 1068 or higher.
On Thu, Apr 9, 2015 at 8:57 AM, Ted Yu yuzhih...@gmail.com wrote:
Please set spark.kryoserializer.buffer.max.mb to 1068 (or higher).
Cheers
On Thu, Apr 9, 2015 at 8:54 AM, ÐΞ€ρ@Ҝ (๏̯͡๏) deepuj...@gmail.com
Well, maybe a Linux configure problem...
I have a cluster that is about to expose to the public, and I want everyone
that uses my cluster owns a user (without permissions of sudo, etc.)(e.g.
'guest'), and is able to submit tasks to Spark, which working on Mesos that
running with a different,
Can I see current values of all configs. Similar to configuration in Hadoop
world from ui ?
Sent from my iPhone
On 09-Apr-2015, at 11:07 pm, Marcelo Vanzin van...@cloudera.com wrote:
Set spark.yarn.maxAppAttempts=1 if you don't want retries.
On Thu, Apr 9, 2015 at 10:31 AM, ÐΞ€ρ@Ҝ (๏̯͡๏)
Hi,
I am working on the local mode.
The following code
hiveContext.setConf(hive.metastore.warehouse.dir,
/home/spark/hive/warehouse)
hiveContext.sql(create database if not exists db1)
throws
15/04/09 13:53:16 ERROR RetryingHMSHandler: MetaException(message:Unable to
create database path
Thanks Sandy, apprechiate
On Thu, Apr 9, 2015 at 10:32 PM, Sandy Ryza sandy.r...@cloudera.com wrote:
Hi Deepak,
I'm going to shamelessly plug my blog post on tuning Spark:
http://blog.cloudera.com/blog/2015/03/how-to-tune-your-apache-spark-jobs-part-2/
It talks about tuning executor size
Set spark.yarn.maxAppAttempts=1 if you don't want retries.
On Thu, Apr 9, 2015 at 10:31 AM, ÐΞ€ρ@Ҝ (๏̯͡๏) deepuj...@gmail.com wrote:
Hello,
I have a spark job with 5 stages. After it runs 3rd stage, the console shows
15/04/09 10:25:57 INFO yarn.Client: Application report for
Maybe I'm wrong, but what you are doing here is basically a bunch of
cartesian product for each key. So if hello appear 100 times in your
corpus, it will produce 100*100 elements in the join output.
I don't understand what you're doing here, but it's normal your join
takes forever, it makes
I would try something like that :
val a = rdd.sample(false,0.1,1).zipwithindex.map{ case (vector,index) =
(index,vector)}
val b = rdd.sample(false,0.1,2).zipwithindex.map{ case (vector,index) =
(index,vector)}
a.join(b).map { case (_,(vectora,vectorb)) = yourOperation }
Grouping by blocks is
Hi,
I have been trying to use Spark in an OSGi bundle but I had no luck so far.
I have seen similar mails in the past, so I am wondering, had anyone
successfully run Spark inside an OSGi bundle?
I am running Spark in the bundle created with Maven shade plugin and even tried
adding Akka JARs
commons-beanutils is brought in transitively:
[INFO] | +- org.apache.hadoop:hadoop-common:jar:2.4.0:compile
[INFO] | | +- commons-cli:commons-cli:jar:1.2:compile
[INFO] | | +- xmlenc:xmlenc:jar:0.52:compile
[INFO] | | +- commons-httpclient:commons-httpclient:jar:3.1:compile
[INFO] | | +-
Hello,
How to override log4j.properties for a specific spark job?
BR,
Patcharee
-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org
Hello guys,
I am trying to run the following dummy example for Spark,
on a dataset of 250MB, using 5 machines with 10GB RAM
each, but the join seems to be taking too long ( 2hrs).
I am using Spark 0.8.0 but I have also tried the same example
on more recent versions, with the same results.
Do
Hi,
I'm running a spark streaming job in local mode (--master local[4]), and
I'm seeing tons of these messages, roughly once every second -
WARN BlockManager: Block input-0-1428527584600 replicated to only 0 peer(s)
instead of 1 peers
We're using spark 1.2.1. Even with TRACE logging enabled,
Yes i had tried that.
Now i see this
15/04/09 07:58:08 INFO scheduler.DAGScheduler: Job 0 failed: collect at
VISummaryDataProvider.scala:38, took 275.334991 s
15/04/09 07:58:08 ERROR yarn.ApplicationMaster: User class threw exception:
Job aborted due to stage failure: Total size of serialized
Hi Friends,
I am trying to solve a use case in spark streaming, I need help on getting to
right approach on lookup / update the master data.
Use case ( simplified )
I've a dataset of entity with three attributes and identifier/row key in a
persistent store.
Each attribute along with row key
Generally, you can ignore these things. They mean some artifacts
packaged other artifacts, and so two copies show up when all the JAR
contents are merged.
But here you do show a small dependency convergence problem; beanutils
1.7 is present but beanutills-core 1.8 is too even though these should
I changed the JDK to Oracle but I still get this error. Not sure what it
means by Stream class is incompatible with local class. I am using the
following build on the server spark-1.2.1-bin-hadoop2.4
15/04/09 15:26:24 ERROR JobScheduler: Error running job streaming job
1428607584000 ms.0
In aggregateMessagesWithActiveSet, Spark still have to read all edges. It
means that a fixed time which scale with graph size is unavoidable on a
pregel-like iteration.
But what if I have to iterate nearly 100 iterations but at the last 50
iterations there are only 0.1% nodes need to be updated
Though the warnings can be ignored, they add up in the log files while
compiling other projects too. And there are a lot of those warnings. Any
workaround? How do we modify the pom.xml file to exclude these unnecessary
dependencies?
On Fri, Apr 10, 2015 at 2:29 AM, Sean Owen so...@cloudera.com
In Spark 1.3+, PySpark also support this kind of narrow dependencies,
for example,
N = 10
a1 = a.partitionBy(N)
b1 = b.partitionBy(N)
then a1.union(b1) will only have N partitions.
So, a1.join(b1) do not need shuffle anymore.
On Thu, Apr 9, 2015 at 11:57 AM, pop xia...@adobe.com wrote:
In
One method: By putting your custom log4j.properties file in your /resources
directory.
As an example, please see: http://stackoverflow.com/a/2736/236007
Kind regards,
Emre Sevinç
http://www.bigindustries.be/
On Thu, Apr 9, 2015 at 2:17 PM, patcharee patcharee.thong...@uni.no wrote:
Hi,
I just checked and i can see that there is method called withColumn:
def withColumn(colName: String, col: Column
http://spark.apache.org/docs/latest/api/scala/org/apache/spark/sql/Column.html
): DataFrame
http://spark.apache.org/docs/latest/api/scala/org/apache/spark/sql/DataFrame.html
Hi,
How do i get spark job progress-style report on console ?
I tried to set --conf spark.ui.showConsoleProgress=true but it
thanks
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/spark-job-progress-style-report-on-console-tp22440.html
Sent from the
Can you create the database directly within Hive? If you're getting the
same error within Hive, it sounds like a permissions issue as per Bojan.
More info can be found at:
http://stackoverflow.com/questions/15898211/unable-to-create-database-path-file-user-hive-warehouse-error
On Thu, Apr 9,
Responses inline. Hope they help.
On Thu, Apr 9, 2015 at 8:20 AM, Amit Assudani aassud...@impetus.com wrote:
Hi Friends,
I am trying to solve a use case in spark streaming, I need help on
getting to right approach on lookup / update the master data.
Use case ( simplified )
I’ve a
Hi Mohammed,
Sorry, I guess I was not really clear in my response. Yes sbt fails, the
-DskipTests is for mvn as I showed it in the example on how II built it.
I do not believe that -DskipTests has any impact in sbt, but could be
wrong. sbt package should skip tests. I did not try to track
Hi Deepak,
I'm going to shamelessly plug my blog post on tuning Spark:
http://blog.cloudera.com/blog/2015/03/how-to-tune-your-apache-spark-jobs-part-2/
It talks about tuning executor size as well as how the number of tasks for
a stage is calculated.
-Sandy
On Thu, Apr 9, 2015 at 9:21 AM,
Pressed send early.
I had tried that with these settings
buffersize=128 maxbuffersize=1024
val conf = new SparkConf()
.setAppName(detail)
.set(spark.serializer, org.apache.spark.serializer.KryoSerializer)
.set(spark.kryoserializer.buffer.mb,arguments.get(buffersize).get)
My Spark (1.3.0) job is failing with
com.esotericsoftware.kryo.KryoException: Buffer overflow. Available: 0,
required: 1+details
com.esotericsoftware.kryo.KryoException: Buffer overflow. Available:
0, required: 1
at com.esotericsoftware.kryo.io.Output.require(Output.java:138)
at
I think it uses local dir, hdfs dir path starts with hdfs://
Check permissions on folders, and also check logs. There should be more info
about exception.
Best
Bojan
--
View this message in context:
JavaRDDString lineswithoutStopWords = nonEmptylines
.map(new FunctionString, String() {
/**
*
*/
private static final long
Your point #1 is a bit misleading.
(1) The mappers are not executed in parallel when processing
independently the same RDD.
To clarify, I'd say: In one stage of execution, when pipelining occurs,
mappers are not executed in parallel when processing independently the same
RDD partition.
On Thu,
Hi-
Was this the JIRA issue? https://issues.apache.org/jira/browse/SPARK-2988
Any help in getting this working would be much appreciated!
Thanks
Alex
On Thu, Apr 9, 2015 at 11:32 AM, Prashant Sharma scrapco...@gmail.com
wrote:
You are right this needs to be done. I can work on it soon, I was
I have a spark job that has multiple stages. For now i star it with 100
executors, each with 12G mem (max is 16G). I am using Spark 1.3 over YARN
2.4.x.
For now i start the Spark Job with a very limited input (1 file of size
2G), overall there are 200 files. My first run is yet to complete as its
Thanks a lot TD for detailed answers. The answers lead to few more questions,
1. the transform RDD-to-RDD function runs on the driver “ - I didn’t
understand this, does it mean when I use transform function on DStream, it is
not parallelized, surely I m missing something here.
2.
Most likely you have an existing Hive installation with data in it. In this
case i was not able to get Spark 1.3 communicate with existing Hive meta
store. Hence when i read any table created in hive, Spark SQL used to
complain Data table not found
If you get it working, please share the steps.
Are you running # of receivers = # machines?
TD
On Thu, Apr 9, 2015 at 9:56 AM, Saiph Kappa saiph.ka...@gmail.com wrote:
Sorry, I was getting those errors because my workload was not sustainable.
However, I noticed that, by just running the spark-streaming-benchmark (
I agree, but as I say, most are out of the control of Spark. They
aren't because of unnecessary dependencies.
On Thu, Apr 9, 2015 at 5:14 PM, Ritesh Kumar Singh
riteshoneinamill...@gmail.com wrote:
Though the warnings can be ignored, they add up in the log files while
compiling other projects
I found this jira https://jira.codehaus.org/browse/MSHADE-128 when
googling for fixes. Wonder if it can fix anything here.
But anyways, thanks for the help :)
On Fri, Apr 10, 2015 at 2:46 AM, Sean Owen so...@cloudera.com wrote:
I agree, but as I say, most are out of the control of Spark. They
Please take a look at
https://code.google.com/p/kryo/source/browse/trunk/src/com/esotericsoftware/kryo/io/Output.java?r=236
, starting line 27.
In Spark, you can control the maxBufferSize
with spark.kryoserializer.buffer.max.mb
Cheers
Hi,
I have following scenario.. need some help ASAP
1. Ad hoc query on spark streaming.
How can i run spark queries on ongoing streaming context.
Scenario: If a stream job running to find out min and max value in last
5 min(which i am able to do.)
Now i want to run interactive query to
If your data has special characteristics like one small other large then
you can think of doing map side join in Spark using (Broadcast Values),
this will speed up things.
Otherwise as Pitel mentioned if there is nothing special and its just
cartesian product it might take ever, or you might
Hello,
I have a spark job with 5 stages. After it runs 3rd stage, the console
shows
15/04/09 10:25:57 INFO yarn.Client: Application report for
application_1427705526386_127168 (state: RUNNING)
15/04/09 10:25:58 INFO yarn.Client: Application report for
application_1427705526386_127168 (state:
You can use toDebugString to see all the steps in job.
Best
Bojan
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Caching-and-Actions-tp22418p22433.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.
Well, you are running in local mode, so it cannot find another peer to
replicate the blocks received from receivers. That's it. Its not a real
concern and that error will go away when you are run it in a cluster.
On Thu, Apr 9, 2015 at 11:24 AM, Nandan Tammineedi nan...@defend7.com
wrote:
Hi,
Actually, GraphX doesn't need to scan all the edges, because it
maintains a clustered index on the source vertex id (that is, it sorts
the edges by source vertex id and stores the offsets in a hash table).
If the activeDirection is appropriately set, it can then jump only to
the clusters with
In scala, we can make two Rdd using the same partitioner so that they are
co-partitioned
val partitioner = new HashPartitioner(5)
val a1 = a.partitionBy(partitioner).cache()
val b1 = b.partiitonBy(partitioner).cache()
How can we achieve the same in python? It would be great if
You are right this needs to be done. I can work on it soon, I was not sure
if there is any one even using scala 2.11 spark repl. Actually there is a
patch in scala 2.10 shell to support adding jars (Lost the JIRA ID), which
has to be ported for scala 2.11 too. If however, you(or anyone else) are
Thanks Ted, using HiveTest as my context worked. It still left a metastore
directory and Derby log in my current working directory though; I manually
added a shutdown hook to delete them and all was well.
On Wed, Apr 8, 2015 at 4:33 PM, Ted Yu yuzhih...@gmail.com wrote:
Please take a look at
57 matches
Mail list logo