Thanks MLnick,
I fixed the error.
First i compile spark with original version later I download this pom file
to examples folder
https://github.com/tedyu/spark/commit/70fb7b4ea8fd7647e4a4ddca4df71521b749521c
Then i recompile with maven.
mvn -Dhbase.profile=hadoop-provided -Phadoop-2.4
AFAIK spark doesn't restart worker nodes itself. You can have multiple
worker nodes and in that case if one worker node goes down, then spark will
try to recompute those lost RDDs again with those workers who are alive.
Thanks
Best Regards
On Sun, Oct 5, 2014 at 5:19 AM, Jahagirdar, Madhu
Given that I have multiple worker nodes and when Spark schedules the job again
on the worker nodes that are alive, does it then again store the data in
elastic search and then flume or does it only run functions to store in flume ?
Regards,
Madhu Jahagirdar
From the Spark Streaming Programming Guide (
http://spark.apache.org/docs/latest/streaming-programming-guide.html#failure-of-a-worker-node
):
*...output operations (like foreachRDD) have at-least once semantics, that
is, the transformed data may get written to an external entity more than
once in
Doesn't spark keep track of the DAG lineage and start from where it has stopped
? Does it have to always start from the beginning of the lineage when the job
fails ?
From: Massimiliano Tomassi [max.toma...@gmail.com]
Sent: Monday, October 06, 2014 2:40 PM
To:
Arko,
It would be useful to know more details on the use case you are trying to
solve. As Tobias wrote, Spark Streaming works on DStream, which is a
continuous series of RDDs.
Do check out performance tuning :
I have created
https://issues.apache.org/jira/browse/SPARK-3814
https://issues.apache.org/jira/browse/SPARK-3815
Will probably try my hand at 3814, seems like a good place to get started...
On Fri, Oct 3, 2014 at 3:06 PM, Michael Armbrust mich...@databricks.com
wrote:
Thanks for digging in!
Hi,
I would like to ask if it is possible to use generator, that generates data
bigger than size of RAM across all the machines as the input for sc =
SparkContext(), sc.paralelize(generator). I would like to create RDD this way.
When I am trying to create RDD by sc.TextFile(file) where file
We have used the strategy that you suggested, Andrew - using many workers
per machine and keeping the heaps small ( 20gb).
Using a large heap resulted in workers hanging or not responding (leading
to timeouts). The same dataset/job for us will fail (most often due to
akka disassociated or fetch
sc.parallelize() to distribute a list of data into numbers of partitions, but
generator can not be cut and serialized automatically.
If you can partition your generator, then you can try this:
sc.parallelize(range(N), N).flatMap(lambda x: generate_partiton(x))
such as you want to generate
Hi again,
Finally, I found the time to play around with your suggestions.
Unfortunately, I noticed some unusual behavior in the MLlib results, which
is more obvious when I compare them against their scikit-learn equivalent.
Note that I am currently using spark 0.9.2. Long story short: I find it
One diff I can find is you may have different kernel functions for your
training, In Spark, you end up using Linear Kernel whereas for scikit you
are using rbk kernel. That can explain the different in the coefficients
you are getting.
On Mon, Oct 6, 2014 at 10:15 AM, Adamantios Corais
After disabled the client side authorization and no anything in the
SPARK_CLASSPATH, I am still getting no class found error.
property
namehive.security.authorization.enabled/name
valuefalse/value
descriptionPerform authorization checks on the client/description
/property
Am I hitting a
No, not yet. Only Hive UDAFs are supported.
On Mon, Oct 6, 2014 at 2:18 AM, Pei-Lun Lee pl...@appier.com wrote:
Hi,
Does spark sql currently support user-defined custom aggregation function
in scala like the way UDF defined with sqlContext.registerFunction? (not
hive UDAF)
Thanks,
--
Hi All,
Would really appreciate from the community if anyone has implemented the
saveAsNewAPIHadoopFiles method in Java found in the
org.apache.spark.streaming.api.java.JavaPairDStreamK,V
Any code snippet or online link would be greatly appreciated.
Regards,
Jacob
TD has addressed this. It should be available in 1.2.0.
https://issues.apache.org/jira/browse/SPARK-3495
On Thu, Oct 2, 2014 at 9:45 AM, maddenpj madde...@gmail.com wrote:
I am seeing this same issue. Bumping for visibility.
--
View this message in context:
Here's an example:
https://github.com/OryxProject/oryx/blob/master/oryx-lambda/src/main/java/com/cloudera/oryx/lambda/BatchLayer.java#L131
On Mon, Oct 6, 2014 at 7:39 PM, Abraham Jacob abe.jac...@gmail.com wrote:
Hi All,
Would really appreciate from the community if anyone has implemented the
Is the RDD partition index you get when you call mapPartitionWithIndex
consistent under fault-tolerance condition?
I.e.
1. Say index is 1 for one of the partitions when you call
data.mapPartitionWithIndex((index, rows) = ) // Say index is 1
2. The partition fails (maybe a long with a bunch
Unfortunately not. Again, I wonder if adding support targeted at this
small files problem would make sense for Spark core, as it is a common
problem in our space.
Right now, I don't know of any other options.
Nick
On Mon, Oct 6, 2014 at 2:24 PM, Landon Kuhn lan...@janrain.com wrote:
Hi,
The partition information for Spark master is not updated after a stage
failure. In case of HDFS, Spark gets partition information from InputFormat
and if a data node in HDFS is down when spark is performing computation for
a certain stage, this stage will fail and be resubmitted using the
Hi,
Thank you for your advice. It really might work, but to specify my problem a
bit more, think of my data more like one generated item is one parsed wikipedia
page. I am getting this generator from the parser and I don't want to save it
to the storage, but directly apply parallelize and
Any suggestions? I'm thinking of submitting a feature request for mutable
broadcast. Is it doable?
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Asynchronous-Broadcast-from-driver-to-workers-is-it-possible-tp15758p15807.html
Sent from the Apache Spark
On Mon, Oct 6, 2014 at 1:08 PM, jan.zi...@centrum.cz wrote:
Hi,
Thank you for your advice. It really might work, but to specify my problem a
bit more, think of my data more like one generated item is one parsed
wikipedia page. I am getting this generator from the parser and I don't want
to
Try a Hadoop Custom InputFormat - I can give you some samples -
While I have not tried this an input split has only a length (could be
ignores if the format treats as non splittable) and a String for a location.
If the location is a URL into wikipedia the whole thing should work.
Hadoop
Sean,
Thanks a ton Sean... This is exactly what I was looking for.
As mentioned in the code -
// This horrible, separate declaration is necessary to appease the
compiler
@SuppressWarnings(unchecked)
Class? extends OutputFormat?,? outputFormatClass = (Class? extends
OutputFormat?,?)
Hi,
I see that this type of question has been asked before, however still a
little confused about it in practice. Such as there are two ways I could
deal with a series of RDD transformation before I do a RDD action, which way
is faster:
Way 1:
val data = sc.textFile()
val data1 = data.map(x =
@Davies
I know that gensim.corpora.wikicorpus.extract_pages will be for sure the bottle
neck on the master node.
Unfortunately I am using Spark on EC2 and I don't have enough space on my nodes
to store there whole data that needs to be parsed by extract_pages. I have my
data on S3 and I kind
I think you mean that data2 is a function of data1 in the first
example. I imagine that the second version is a little bit more
efficient.
But it is nothing to do with memory or caching. You don't have to
cache anything here if you don't want to. You can cache what you like.
Once memory for the
Another rule of thumb is that definitely cache the RDD over which you need
to do iterative analysis...
For rest of them only cache if you have lot of free memory !
On Mon, Oct 6, 2014 at 2:39 PM, Sean Owen so...@cloudera.com wrote:
I think you mean that data2 is a function of data1 in the
Hi,
I have started a Spark Cluster on EC2 using Spark Standalone cluster
manager but spark is trying to identify the worker threads using the
hostnames which are not accessible publicly.
So when I try to submit jobs from eclipse it is failing, is there some way
spark can use IP address instead
I'm using CDH 5.1.0 with Spark-1.0.0. There is spark-sql-1.0.0 in clouder'a
maven repository. After put it into the classpath, I can use spark-sql in
my application.
One of issue is that I couldn't make the join as a hash join. It gives
CartesianProduct when I join two SchemaRDDs as follows:
Rohit,
Thank you very much for release the H2 version and now my app compiles
file and there is no more runtime error wrt. hadoop 1.x class or interface.
Tian
On Saturday, October 4, 2014 9:47 AM, Rohit Rai ro...@tuplejump.com wrote:
Hi Tian,
We have published a build against Hadoop 2.0
For two large key-value data sets, if they have the same set of keys, what is
the fastest way to join them into one? Suppose all keys are unique in each
data set, and we only care about those keys that appear in both data sets.
input data I have: (k, v1) and (k, v2)
data I want to get from the
Does Spark 1.1.0 work with Hadoop 2.5.0?
The maven build instruction only has command options up to hadoop 2.4.
Anybody ever made it work?
I am trying to run spark-sql with hive 0.12 on top of hadoop 2.5.0 but can't
make it work.
--
View this message in context:
34 matches
Mail list logo