Ok, I'll wait until -Pscala-2.11 is more stable and used by more people.
Thanks for the help!
Jianshi
On Tue, Nov 18, 2014 at 3:49 PM, Ye Xianjin advance...@gmail.com wrote:
Hi Prashant Sharma,
It's not even ok to build with scala-2.11 profile on my machine.
Just check out the
Hi,
I want to do log aggregation in a Spark standalone mode cluster, using
Apache Flume. But something weird happended. Here are my operations:
(1) Start a flume agent, listening on port 3. ( flume-conf.properties
I'm a newbee in Spark. I know that what the work should do is written in RDD.
But I want to make the worker load a native lib and I can do something to
change the content of the lib in memory.
So how can I do. I can do it on driver, but not worker. I always get a fatal
error. The jvm report A
Not sure how to solve this, but spotted these lines in the logs:
14/11/18 14:28:23 INFO YarnAllocationHandler: Container marked as
*failed*: container_1415961020140_0325_01_02
14/11/18 14:28:38 INFO YarnAllocationHandler: Container marked as
*failed*: container_1415961020140_0325_01_03
Hi,
I'm operating Spark in standalone cluster configuration (3 slaves) and
have some question.
1. How can I stop a slave on the specific node?
Under `sbin/` directory, there are
`start-{all,master,slave,slaves}` and `stop-{all,master,slaves}`, but
no `stop-slave`. Are there any way to stop
Did you set spark.executor.extraLibraryPath to the directory which your
native library exists?
2014-11-18 16:13 GMT+08:00 tangweihan tangwei...@360.cn:
I'm a newbee in Spark. I know that what the work should do is written in
RDD.
But I want to make the worker load a native lib and I can do
Hi,
I have created a spark streaming application based on spark-1.1.0. While
running it failed saying akk jar version mismatch. Some projects are using
akka 2.3.6 so I have no choice to change the akka version as it will affect
others.
What should I do?
*Caused by: akka.ConfigurationException:
Ok. I don't put it in the path. Because this is not a lib I want to use
permanently.
here is my code in RDD.
val fileaddr = SparkFiles.get(segment.so);
System.load(fileaddr);
val config = SparkFiles.get(qsegconf.ini)
val segment = new Segment//this
At 2014-11-11 01:51:43 +, Buttler, David buttl...@llnl.gov wrote:
I am building a graph from a large CSV file. Each record contains a couple
of nodes and about 10 edges. When I try to load a large portion of the
graph, using multiple partitions, I get inconsistent results in the number
1. You can comment the rest of the workers from the conf/slaves file and do
a stop-slaves.sh from that machine to stop the specific worker.
2. There is no direct command for it, but you can do something like the
following:
$ curl localhost:8080 | grep Applications -C 10 | head -n20
Where
At 2014-11-13 21:28:52 +, Ommen, Jurgen omme0...@stthomas.edu wrote:
I'm using GraphX and playing around with its PageRank algorithm. However, I
can't see from the documentation how to use edge weight when running PageRank.
Is this possible to consider edge weights and how would I do it?
Thanks Akhil.
-Naveen.
From: Akhil Das [mailto:ak...@sigmoidanalytics.com]
Sent: Tuesday, November 18, 2014 1:19 PM
To: Naveen Kumar Pokala
Cc: user@spark.apache.org
Subject: Re: Null pointer exception with larger datasets
Make sure your list is not null, if that is null then its more like
At 2014-11-15 18:01:22 -0700, tom85 tom.manha...@gmail.com wrote:
This line: val newPR = oldPR + (1.0 - resetProb) * msgSum
makes no sense to me. Should it not be:
val newPR = resetProb/graph.vertices.count() + (1.0 - resetProb) * msgSum
?
This is an unusual version of PageRank where the
Hi,
I am using Spark-1.0.0. There are two GraphX directories that I can see here
1. spark-1.0.0/examples/src/main/scala/org/apache/sprak/examples/graphx
which contains LiveJournalPageRank,scala
2. spark-1.0.0/graphx/src/main/scala/org/apache/sprak/graphx/lib which
contains
At 2014-11-17 14:47:50 +0530, Deep Pradhan pradhandeep1...@gmail.com wrote:
I was going through the graphx section in the Spark API in
https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.graphx.lib.ShortestPaths$
Here, I find the word landmark. Can anyone explain to me
So landmark can contain just one vertex right?
Which algorithm has been used to compute the shortest path?
Thank You
On Tue, Nov 18, 2014 at 2:53 PM, Ankur Dave ankurd...@gmail.com wrote:
At 2014-11-17 14:47:50 +0530, Deep Pradhan pradhandeep1...@gmail.com
wrote:
I was going through the
At 2014-11-18 12:02:52 +0530, Deep Pradhan pradhandeep1...@gmail.com wrote:
I just ran the PageRank code in GraphX with some sample data. What I am
seeing is that the total rank changes drastically if I change the number of
iterations from 10 to 100. Why is that so?
As far as I understand, the
There are no vertices of zero outdegree.
The total rank for the graph with numIter = 10 is 4.99 and for the graph
with numIter = 100 is 5.99
I do not know why so much variation.
On Tue, Nov 18, 2014 at 3:22 PM, Ankur Dave ankurd...@gmail.com wrote:
At 2014-11-18 12:02:52 +0530, Deep Pradhan
At 2014-11-18 14:59:20 +0530, Deep Pradhan pradhandeep1...@gmail.com wrote:
So landmark can contain just one vertex right?
Right.
Which algorithm has been used to compute the shortest path?
It's distributed Bellman-Ford.
Ankur
Does Bellman-Ford give the best solution?
On Tue, Nov 18, 2014 at 3:27 PM, Ankur Dave ankurd...@gmail.com wrote:
At 2014-11-18 14:59:20 +0530, Deep Pradhan pradhandeep1...@gmail.com
wrote:
So landmark can contain just one vertex right?
Right.
Which algorithm has been used to compute the
The codes that are present in 2 can be run with the command
*$SPARK_HOME/bin/spark-submit --master local[*] --class
org.apache.spark.graphx.lib.Analytics
$SPARK_HOME/assembly/target/scala-2.10/spark-assembly-*.jar pagerank
/edge-list-file.txt --numEPart=8 --numIter=10
At 2014-11-18 14:51:54 +0530, Deep Pradhan pradhandeep1...@gmail.com wrote:
I am using Spark-1.0.0. There are two GraphX directories that I can see here
1. spark-1.0.0/examples/src/main/scala/org/apache/sprak/examples/graphx
which contains LiveJournalPageRank,scala
2.
Hi Kenichi
1. How can I stop a slave on the specific node?
Under `sbin/` directory, there are
`start-{all,master,slave,slaves}` and `stop-{all,master,slaves}`, but
no `stop-slave`. Are there any way to stop the specific (e.g., the
2nd) slave via command line?
You can use
At 2014-11-18 15:29:08 +0530, Deep Pradhan pradhandeep1...@gmail.com wrote:
Does Bellman-Ford give the best solution?
It gives the same solution as any other algorithm, since there's only one
correct solution for shortest paths and it's guaranteed to find it eventually.
There are probably
Hi,
While running my spark streaming application built on spark 1.1.0 I am
getting below error.
*14/11/18 15:35:30 ERROR ReceiverTracker: Deregistered receiver for stream
0: Error starting receiver 0 - java.lang.AbstractMethodError*
* at org.apache.spark.Logging$class.log(Logging.scala:52)*
* at
What command should I use to run the LiveJournalPageRank.scala?
If you want to write a separate application, the ideal way is to do it in
a separate project that links in Spark as a dependency [1].
But even for this, I have to do the build every time I change the code,
right?
Thank You
On Tue,
At 2014-11-18 15:35:13 +0530, Deep Pradhan pradhandeep1...@gmail.com wrote:
Now, how do I run the LiveJournalPageRank.scala that is there in 1?
I think it should work to use
MASTER=local[*] $SPARK_HOME/bin/run-example graphx.LiveJournalPageRank
/edge-list-file.txt --numEPart=8 --numIter=10
Yes the above command works, but there is this problem. Most of the times,
the total rank is Nan (Not a Number). Why is it so?
Thank You
On Tue, Nov 18, 2014 at 3:48 PM, Deep Pradhan pradhandeep1...@gmail.com
wrote:
What command should I use to run the LiveJournalPageRank.scala?
If you want
At 2014-11-18 15:44:31 +0530, Deep Pradhan pradhandeep1...@gmail.com wrote:
I meant to ask whether it gives the solution faster than other algorithms.
No, it's just that it's much simpler and easier to implement than the others.
Section 5.2 of the Pregel paper [1] justifies using it for a graph
At 2014-11-18 15:51:52 +0530, Deep Pradhan pradhandeep1...@gmail.com wrote:
Yes the above command works, but there is this problem. Most of the times,
the total rank is Nan (Not a Number). Why is it so?
I've also seen this, but I'm not sure why it happens. If you could find out
which vertices
Hi guys,
Has anyone already tried doing this work?
Thanks
--
Informativa sulla Privacy: http://www.unibs.it/node/8155
You can implement a custom receiver
http://spark.apache.org/docs/latest/streaming-custom-receivers.html to
connect to Kestrel and use it. I think someone have already tried it, not
sure if it is working though. Here's the link
You can refer the following link
https://github.com/prabeesh/Spark-Kestrel
On Tue, Nov 18, 2014 at 3:51 PM, Akhil Das ak...@sigmoidanalytics.com
wrote:
You can implement a custom receiver
http://spark.apache.org/docs/latest/streaming-custom-receivers.html to
connect to Kestrel and use it. I
Hi,
In my spark application, I am loading some rows from database into Spark
RDDs
Each row has several fields, and a string key. Due to my requirements I
need to work with consecutive numeric ids (starting from 1 to N, where N is
the number of unique keys) instead of string keys . Also several
Ah... Thanks Ted! And Hao, sorry for being the original trouble maker :)
On 11/18/14 1:50 AM, Ted Yu wrote:
Looks like this was where you got that commandline:
http://search-hadoop.com/m/JW1q5RlPrl
Cheers
On Mon, Nov 17, 2014 at 9:44 AM, Hao Ren inv...@gmail.com
mailto:inv...@gmail.com
Hello everyone,
I'm new to Spark and I have the following problem:
I have this large JavaRDDMyClass collection, which I group with by
creating a hashcode from some fields in MyClass:
JavaRDDMyClass collection = ...;
JavaPairRDDInteger, Iterablelt;MyClass grouped =
collection.groupBy(...); //
A not so efficient way can be this:
|val r0: RDD[OriginalRow] = ...
val r1 = r0.keyBy(row = extractKeyFromOriginalRow(row))
val r2 = r1.keys.distinct().zipWithIndex()
val r3 = r2.join(r1).values
|
On 11/18/14 8:54 PM, shahab wrote:
Hi,
In my spark application, I am loading some
I am writing yet another Spark job server and have been able to submit jobs
and return/save results. I let multiple jobs use the same spark context but
I set job group while firing each job so that I can in future cancel jobs.
Further, what I deserve to do is provide some kind of status
I am trying to figure out if sorting is persisted after applying Pair RDD
transformations and I am not able to decisively tell after reading the
documentation.
For example:
val numbers = .. // RDD of numbers
val pairedNumbers = numbers.map(number = (number % 100, number))
val sortedPairedNumbers
I started some quick hack for that in the notebook, you can head to:
https://github.com/andypetrella/spark-notebook/blob/master/common/src/main/scala/notebook/front/widgets/SparkInfo.scala
On Tue Nov 18 2014 at 2:44:48 PM Aniket Bhatnagar
aniket.bhatna...@gmail.com wrote:
I am writing yet
nvm, it would be better if correctness of flags could be checked by sbt
during building.
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Building-Spark-with-hive-does-not-work-tp19072p19183.html
Sent from the Apache Spark User List mailing list archive at
Thanks Andy. This is very useful. This gives me all active stages their
percentage completion but I am unable to tie stages to job group (or
specific job). I looked at Spark's code and to me, it
seems org.apache.spark.scheduler.ActiveJob's group ID should get propagated
to StageInfo (possibly in
yep, we should also propose to add this stuffs in the public API.
Any other ideas?
On Tue Nov 18 2014 at 4:03:35 PM Aniket Bhatnagar
aniket.bhatna...@gmail.com wrote:
Thanks Andy. This is very useful. This gives me all active stages their
percentage completion but I am unable to tie stages
Hello Experts,
I need to get total of an amount fields for specified date range. Now that
group by on calculated field does not work
(https://issues.apache.org/jira/browse/SPARK-4296), what is the best way to
get this done?
I thought to do it using spark, but I suspect I will miss the performance
Yes,
The case is convincing for PMML with Oryx. I will also investigate
parameter server.
Cheers,
Charles
On Tuesday, November 18, 2014, Sean Owen so...@cloudera.com wrote:
I'm just using PMML. I haven't hit any limitation of its
expressiveness, for the model types is supports. I don't think
Are nightly releases posted anywhere? There are quite a few vital bugfixes
and performance improvements being commited to Spark and using the latest
commits is useful (or even necessary for some jobs).
Is there a place to post them, it doesn't seem like it would diffcult to
run make-dist nightly
Of course we can run this as well to get the lastest, but the build is
fairly long and this seems like a resource many would need.
On Tue, Nov 18, 2014 at 10:21 AM, Arun Ahuja aahuj...@gmail.com wrote:
Are nightly releases posted anywhere? There are quite a few vital
bugfixes and performance
First use groupByKey(), you get a tuple RDD with (key:K,value:ArrayBuffer[V]).
Then use map() on this RDD with a function has different operations depending
on the key which act as a parameter of this function.
在 2014年11月18日,下午8:59,jelgh johannes.e...@gmail.com 写道:
Hello everyone,
I'm
groupByKey does not run a combiner so be careful about the
performance...groupByKey does shuffle even for local groups...
reduceByKey and aggregateByKey does run a combiner but if you want a
separate function for each key, you can have a key to closure map that you
can broadcast and use it in
I think zipWithIndex is zero-based, so if you want 1 to N, you'll need to
increment them like so:
val r2 = r1.keys.distinct().zipWithIndex().mapValues(_ + 1)
If the number of distinct keys is relatively small, you might consider
collecting them into a map and broadcasting them rather than using
ah makes sense - Thanks Michael!
On Mon, Nov 17, 2014 at 6:08 PM, Michael Armbrust mich...@databricks.com
wrote:
You are perhaps hitting an issue that was fixed by #3248
https://github.com/apache/spark/pull/3248?
On Mon, Nov 17, 2014 at 9:58 AM, Sadhan Sood sadhan.s...@gmail.com
wrote:
I can see this being valuable for users wanting to live on the cutting edge
without building CI infrastructure themselves, myself included. I think
Patrick's recent work on the build scripts for 1.2.0 will make delivering
nightly builds to a public maven repo easier.
On Tue, Nov 18, 2014 at
As it is difficult to explain this, I would show what I want. Lets us say,
I have an RDD A with the following value
A = [word1, word2, word3]
I want to have an RDD with the following value
B = [(1, word1), (2, word2), (3, word3)]
That is, it gives a unique number to each entry as a key value.
Map the key value into a key,Tuple2key,value and process that -
Also ask the Spark maintainers for a version of keyed operations where the
key is passed in as an argument - I run into these cases all the time
/**
* map a tuple int a key tuple pair to insure subsequent processing has
Hi Guys,
I am doing some tests with JavaKafkaWordCount, my cluster is composed by 8
workers and 1 driver con spark-1.1.0, I am using Kafka too and I have some
questions about.
1 - When I launch the command:
bin/spark-submit --class org.apache.spark.examples.streaming.JavaKafkaWordCount
Use zipWithIndex but cache the data before you run zipWithIndex...that way
your ordering will be consistent (unless the bug has been fixed where you
don't have to cache the data)...
Normally these operations are used for dictionary building and so I am
hoping you can cache the dictionary of
Hi Folks!
I'm running Spark on YARN cluster installed with Cloudera Manager Express.
The cluster has 1 master and 3 slaves, each machine with 32 cores and 64G
RAM.
My spark's job is working fine, however it seems that just 2 of 3 slaves
are working (htop shows 2 slaves working 100% on 32 cores,
I run my Spark on YARN jobs as:
HADOOP_CONF_DIR=/etc/hadoop/conf/ /app/data/v606014/dist/bin/spark-submit
--master yarn --jars test-job.jar --executor-cores 4 --num-executors 10
--executor-memory 16g --driver-memory 4g --class TestClass test.jar
It uses HADOOP_CONF_DIR to schedule executors and
Can you check in your RM's web UI how much of each resource does Yarn
think you have available? You can also check that in the Yarn
configuration directly.
Perhaps it's not configured to use all of the available resources. (If
it was set up with Cloudera Manager, CM will reserve some room for
Hey Alan,
Spark's application master will take up 1 core on one of the nodes on the
cluster. This means that that node will only have 31 cores remaining, not
enough to fit your third executor.
-Sandy
On Tue, Nov 18, 2014 at 10:03 AM, Alan Prando a...@scanboo.com.br wrote:
Hi Folks!
I'm
My guess is you're asking for all cores of all machines but the driver
needs at least one core, so one executor is unable to find a machine to fit
on.
On Nov 18, 2014 7:04 PM, Alan Prando a...@scanboo.com.br wrote:
Hi Folks!
I'm running Spark on YARN cluster installed with Cloudera Manager
Hi Jerry,
I looked at KafkaUtils.createStream api and found actually the
spark.default.parallelism is specified in SparkConf instead. I do not
remember the exact stacks of the exception. But the exception was incurred
when createStream was called if we do not specify the
My best guess would be a networking issue--it looks like the Python
socket library isn't able to connect to whatever hostname you're
providing Spark in the configuration.
On 11/18/14 9:10 AM, amin mohebbi wrote:
Hi there,
*I have already downloaded Pre-built spark-1.1.0, I want to run
Hi all,
This is somewhat related to my previous question (
http://apache-spark-user-list.1001560.n3.nabble.com/Iterative-changes-to-RDD-and-broadcast-variables-tt19042.html
, for additional context) but for all practical purposes this is its own
issue.
As in my previous question, I'm making
To clarify about what, precisely, is impossible: the crash happens with
INDEX == 1 in func2, but func2 is only called in the reduceByKey
transformation when INDEX == 0. And according to the output of the
foreach() in line 4, that reduceByKey(func2) works just fine. How is it
then invoked again
-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org
It seems that `localhost` can not be resolved in your machines, I had
filed https://issues.apache.org/jira/browse/SPARK-4475 to track it.
On Tue, Nov 18, 2014 at 6:10 AM, amin mohebbi
aminn_...@yahoo.com.invalid wrote:
Hi there,
I have already downloaded Pre-built spark-1.1.0, I want to run
On Tue, Nov 18, 2014 at 9:06 AM, Debasish Das debasish.da...@gmail.com wrote:
Use zipWithIndex but cache the data before you run zipWithIndex...that way
your ordering will be consistent (unless the bug has been fixed where you
don't have to cache the data)...
Could you point some link about
Sorry everyone--turns out an oft-forgotten single line of code was
required to make this work:
index = 0
INDEX = sc.broadcast(index)
M = M.flatMap(func1).reduceByKey(func2)
M.foreach(debug_output)
*M.cache()*
index = 1
INDEX = sc.broadcast(index)
M = M.flatMap(func1)
M.foreach(debug_output)
Those are probably related. It looks like we are somehow not being thread
safe when initializing various parts of the scala compiler.
Since code gen is pretty experimental we probably won't have the resources
to investigate backporting a fix. However, if you can reproduce the
problem in Spark
On Tue, Nov 18, 2014 at 8:26 PM, Davies Liu dav...@databricks.com wrote:
On Tue, Nov 18, 2014 at 9:06 AM, Debasish Das debasish.da...@gmail.com
wrote:
Use zipWithIndex but cache the data before you run zipWithIndex...that way
your ordering will be consistent (unless the bug has been fixed
I've developed a Spark application using the 1.2.0-SNAPSHOP branch that
leverages Spark Streaming and Hive and can run it locally with no problem (I
need some fixes in the 1.2.0 branch). I successfully launched my EC2 cluster
by specifying a git commit hash from the 1.2.0-SNAPSHOT branch as the
I'm having problems running the twitter graph on a cluster with 4 nodes, each
having over 100GB of RAM and 32 virtual cores per node.
I do have a pre-installed spark version (built against hadoop 2.3, because
it didn't compile on my system), but I'm loading my graph file from disk
without hdfs.
Abdul,
Please send an email to user-unsubscr...@spark.apache.org
On Tue, Nov 18, 2014 at 2:05 PM, Abdul Hakeem alhak...@gmail.com wrote:
-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional
Hi,
I am using Spark 1.0.1 on Yarn 2.5, and doing everything through spark
shell.
I am running a job that essentially reads a bunch of HBase keys, looks up
HBase data, and performs some filtering and aggregation. The job works fine
in smaller datasets, but when i try to execute on the full
Hi Pala,
Do you have access to your YARN NodeManager logs? Are you able to check
whether they report killing any containers for exceeding memory limits?
-Sandy
On Tue, Nov 18, 2014 at 1:54 PM, Pala M Muthaia mchett...@rocketfuelinc.com
wrote:
Hi,
I am using Spark 1.0.1 on Yarn 2.5, and
I'm essentially loading a file and saving output to another location:
val source = sc.textFile(/tmp/testfile.txt)
source.saveAsTextFile(/tmp/testsparkoutput)
when i do so, i'm hitting this error:
14/11/18 21:15:08 INFO DAGScheduler: Failed to run saveAsTextFile at
console:15
I see, thanks!
On Tue, Nov 18, 2014 at 12:12 PM, Sean Owen so...@cloudera.com wrote:
On Tue, Nov 18, 2014 at 8:26 PM, Davies Liu dav...@databricks.com wrote:
On Tue, Nov 18, 2014 at 9:06 AM, Debasish Das debasish.da...@gmail.com
wrote:
Use zipWithIndex but cache the data before you run
It can be a serialization issue.
Happens when there are different versions installed on the same system.
What do you mean by the first time you installed and tested it out?
On Wed, Nov 19, 2014 at 3:29 AM, Anson Abraham anson.abra...@gmail.com
wrote:
I'm essentially loading a file and saving
I see the default and max cores settings but these seem to control total cores
per cluster.
My cobbled together home cluster needs the Master to not use all its cores or
it may lock up (it does other things). Is there a way to control max cores used
for a particular cluster machine in
Looks like I can do this by not using start-all.sh but starting each worker
separately passing in a '--cores n' to the master? No config/env way?
On Nov 18, 2014, at 3:14 PM, Pat Ferrel p...@occamsmachete.com wrote:
I see the default and max cores settings but these seem to control total cores
Hi,
Are there any examples of using JdbcRDD in java available?
Its not clear what is the last argument in this example (
https://github.com/apache/spark/blob/master/core/src/test/scala/org/apache/spark/rdd/JdbcRDDSuite.scala
):
sc = new SparkContext(local, test)
val rdd = new JdbcRDD(
sc,
() =
This seems to work only on a ‘worker’ not the master? So I’m back to having no
way to control cores on the master?
On Nov 18, 2014, at 3:24 PM, Pat Ferrel p...@occamsmachete.com wrote:
Looks like I can do this by not using start-all.sh but starting each worker
separately passing in a '--cores
I had also same problem to use JdbcRDD in java.
For me, I have written a class in scala to get JdbcRDD, and I call this
instance from java.
for instance, JdbcRDDWrapper.scala like this:
...
import java.sql._
import org.apache.spark.SparkContext
import org.apache.spark.rdd.JdbcRDD
import
Thanks Kidong. I'll try your approach.
On Tue, Nov 18, 2014 at 4:22 PM, mykidong mykid...@gmail.com wrote:
I had also same problem to use JdbcRDD in java.
For me, I have written a class in scala to get JdbcRDD, and I call this
instance from java.
for instance, JdbcRDDWrapper.scala like
Hi all,
I am running a Spark Streaming job. It was able to produce the correct
results up to some time. Later on, the job was still running but producing
no result. I checked the Spark streaming UI and found that 4 tasks of a
stage failed.
The error messages showed that Job aborted due to stage
OK hacking the start-slave.sh did it
On Nov 18, 2014, at 4:12 PM, Pat Ferrel p...@occamsmachete.com wrote:
This seems to work only on a ‘worker’ not the master? So I’m back to having no
way to control cores on the master?
On Nov 18, 2014, at 3:24 PM, Pat Ferrel p...@occamsmachete.com wrote:
If there a one big XML file (e.g., Wikipedia dump 44GB or the larger dump
that all revision information also) that is stored in HDFS, is it possible
to parse it in parallel/faster using Spark? Or do we have to use something
like a PullParser or Iteratee?
My current solution is to read the single
Sandy,
Good point - i forgot about NM logs.
When i looked up the NM logs, i only see the following statements that
align with the driver side log about lost executor. Many executors show the
same log statement at the same time, so it seems like the decision to kill
many if not all executors
Okay, thank you Micheal.
On Wed, Nov 19, 2014 at 3:45 AM, Michael Armbrust mich...@databricks.com
wrote:
Those are probably related. It looks like we are somehow not being thread
safe when initializing various parts of the scala compiler.
Since code gen is pretty experimental we probably
Hi,
do you have some logging backend (log4j, logback) on your classpath? This
seems a bit like there is no particular implementation of the abstract
`log()` method available.
Tobias
Hi,
see https://www.mail-archive.com/dev@spark.apache.org/msg03520.html for one
solution.
One issue with those XML files is that they cannot be processed line by
line in parallel; plus you inherently need shared/global state to parse XML
or check for well-formedness, I think. (Same issue with
Hi guys,
We ultimately needed to add 8 ec2 xl's to get better performance. As was
suspected, we could not fit all the data into ram.
This worked great with files sized around 100-350MB in size as our initial
export task produced. Unfortunately, for the partition settings that we
were able to
Hi Akhil and Kousuke,
Thank you for your quick response.
Monitoring through JSON API seems straightforward and cool.
Thanks again!
2014-11-18 19:06 GMT+09:00 Kousuke Saruta saru...@oss.nttdata.co.jp:
Hi Kenichi
1. How can I stop a slave on the specific node?
Under `sbin/` directory,
Hi,
I'm loading a json file into a RDD and then save that RDD as parquet.
One of the fields is a map of keys and values but it is being translated and
stored as a struct.
How can I convert the field into a map?
Thanks,
Daniel
Hi there,
I would like to do text clustering using k-means and Spark on a massive
dataset. As you know, before running the k-means, I have to do pre-processing
methods such as TFIDF and NLTK on my big dataset. The following is my code in
python :
|
| if __name__ == '__main__': |
| | #
Hi,
Just to give some context. We are using Hive metastore with csv Parquet
files as a part of our ETL pipeline. We query these with SparkSQL to do
some down stream work.
I'm curious whats the best way to go about testing Hive SparkSQL? I'm
using 1.1.0
I see that the LocalHiveContext has been
Hi all,
I tested *partitionBy *feature in wordcount application, and I'm
puzzled by a phenomenon. In this application, I created an rdd from some
text files in HDFS(about 100GB in size), each of which has lines composed
of words separated by a character #. I wanted to count the occurence for
Something like this?
val map_rdd = json_rdd.map(json = {
val mapper = new ObjectMapper() with ScalaObjectMapper
mapper.registerModule(DefaultScalaModule)
val myMap = mapper.readValue[Map[String,String]](json)
myMap
})
Thanks
Best Regards
On Wed, Nov 19, 2014 at
99 matches
Mail list logo