Hello,
We were comparing performance of some of our production hive queries
between Hive and Spark. We compared Hive(0.13)+hadoop (1.2.1) against both
Spark 0.9 and 1.1. We could see that the performance gains have been good
in Spark.
We tried a very simple query,
select count(*) from T where
Hi,
would like to know if there is an update on this?
rgds
On Mon, Jan 12, 2015 at 10:44 AM, Niranda Perera niranda.per...@gmail.com
wrote:
Hi,
I found out that SparkSQL supports only a relatively small subset of SQL
dialect currently.
I would like to know the roadmap for the coming
Seems like it is a bug rather than a feature.
I filed a bug report: https://issues.apache.org/jira/browse/SPARK-5363
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-1-1-slow-working-Spark-1-2-fast-freezing-tp21278p21317.html
Sent from the Apache Spark
and post the code (if possible).
In a nutshell, your processing time batch interval, resulting in an
ever-increasing delay that will end up in a crash.
3 secs to process 14 messages looks like a lot. Curious what the job logic
is.
-kr, Gerard.
On Thu, Jan 22, 2015 at 12:15 PM, Tathagata Das
Hi Gerard,
Thanks for the response.
The messages get desrialised from msgpack format, and one of the strings is
desrialised to json. Certain fields are checked to decide if further processing
is required. If so, it goes through a series of in mem filters to check if more
processing is
This is not normal. Its a huge scheduling delay!! Can you tell me more
about the application?
- cluser setup, number of receivers, whats the computation, etc.
On Thu, Jan 22, 2015 at 3:11 AM, Ashic Mahtab as...@live.com wrote:
Hate to do this...but...erm...bump? Would really appreciate input
http://spark.apache.org/docs/latest/
Follow this. Its easy to get started. Use prebuilt version of spark as of
now :D
On Thu, Jan 22, 2015 at 5:06 PM, Sudipta Banerjee
asudipta.baner...@gmail.com wrote:
Hi Apache-Spark team ,
What are the system requirements installing Hadoop and Apache
Thanks Xiangrui Meng will try this.
And, found this https://github.com/kaushikranjan/knnJoin also.
Will this work with double data ? Can we find out z value of
*Vector(10.3,4.5,3,5)* ?
On Thu, Jan 22, 2015 at 12:25 AM, Xiangrui Meng men...@gmail.com wrote:
For large datasets, you need
Hate to do this...but...erm...bump? Would really appreciate input from others
using Streaming. Or at least some docs that would tell me if these are expected
or not.
From: as...@live.com
To: user@spark.apache.org
Subject: Are these numbers abnormal for spark streaming?
Date: Wed, 21 Jan 2015
Hi TD,
Here's some information:
1. Cluster has one standalone master, 4 workers. Workers are co-hosted with
Apache Cassandra. Master is set up with external Zookeeper.
2. Each machine has 2 cores and 4GB of ram. This is for testing. All machines
are vmware vms. Spark has 2GB dedicated to it on
Hi Apache-Spark team ,
What are the system requirements installing Hadoop and Apache Spark?
I have attached the screen shot of Gparted.
Thanks and regards,
Sudipta
--
Sudipta Banerjee
Consultant, Business Analytics and Cloud Based Architecture
Call me +919019578099
Hi,
Let me reword your request so you understand how (too) generic your question
is
Hi, I have $10,000, please find me some means of transportation so I can get
to work.
Please provide (a lot) more details. If you can't, consider using one of the
pre-built express VMs from either
Hi,
I'm also trying to use the insertInto method, but end up getting the
assertion error
Is there any workaround to this??
rgds
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Trying-to-run-SparkSQL-over-Spark-Streaming-tp12530p21316.html
Sent from the
Hi
There are many different variants of gradient descent mostly dealing with what
the step size is and how it might be adjusted as the algorithm proceeds. Also
if it uses a stochastic variant (as opposed to batch descent) then there are
variations there too. I don’t know off-hand what MLlib’s
I have a problem with running spark shell in windows 7. I made the following
steps:
1. downloaded and installed Scala 2.11.5
2. downloaded spark 1.2.0 by git clone git://github.com/apache/spark.git
3. run dev/change-version-to-2.11.sh and mvn -Dscala-2.11 -DskipTests clean
package (in git bash)
Hi All,
I'm using the saveAsNewAPIHadoopFile API to write SequenceFiles but I'm
getting the following runtime exception:
*Exception in thread main org.apache.spark.SparkException: Task not
serializable*
at
org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:166)
Hello,
I try to execute a simple program that runs the ShortestPaths algorithm
(org.apache.spark.graphx.lib.ShortestPaths) on a small grid graph.
I use Spark 1.2.0 downloaded from spark.apache.org.
The program's code is the following :
object GraphXGridSP {
def main(args : Array[String])
First as an aside I am pretty sure you cannot reuse one Text and
IntWritable object here. Spark does not necessarily finish with one's value
before the next call(). Although it should not be directly related to the
serialization problem I suspect it is. Your function is not serializable
since it
Ok, thanks for the clarifications. I didn't know this list has to remain
as the only official list.
Nabble is really not the best solution in the world, but we're stuck
with it, I guess.
That's it from me on this subject.
Petar
On 22.1.2015. 3:55, Nicholas Chammas wrote:
I think a few
Yeah it worked like charm!! Thank you!
On Thu, Jan 22, 2015 at 2:28 PM, Sean Owen so...@cloudera.com wrote:
First as an aside I am pretty sure you cannot reuse one Text and
IntWritable object here. Spark does not necessarily finish with one's value
before the next call(). Although it should
But voting is done on dev list, right? That could stay there...
Overlay might be a fine solution, too, but that still gives two user
lists (SO and Nabble+overlay).
On 22.1.2015. 10:42, Sean Owen wrote:
Yes, there is some project business like votes of record on releases
that needs to be
Update: I deployed a stand-alone spark in localhost then set Master as
spark://localhost:7077 and it met the same issue
Don't know how to solve it.
--
View this message in context:
Yes, there is some project business like votes of record on releases that
needs to be carried on in standard, simple accessible place and SO is not
at all suitable.
Nobody is stuck with Nabble. The suggestion is to enable a different
overlay on the existing list. SO remains a place you can ask
Hi Ashic Mahtab,
The Cassandra and the Zookeeper are they installed as a part of Yarn
architecture or are they installed in a separate layer with Apache Spark .
Thanks and Regards,
Sudipta
On Thu, Jan 22, 2015 at 8:13 PM, Ashic Mahtab as...@live.com wrote:
Hi Guys,
So I changed the interval
I've have been contributing to SO for a while now. Here're few
observations I'd like to contribute to the discussion:
The level of questions on SO is often of more entry-level. Harder
questions (that require expertise in a certain area) remain unanswered for
a while. Same questions here on the
Hello,
We were comparing performance of some of our production hive queries
between Hive and Spark. We compared Hive(0.13)+hadoop (1.2.1) against both
Spark 0.9 and 1.1. We could see that the performance gains have been good
in Spark.
We tried a very simple query,
select count(*) from T where
If you are using CDH, you would be shutting down services with
Cloudera Manager. I believe you can do it manually using Linux
'services' if you do the steps correctly across your whole cluster.
I'm not sure if the stock stop-all.sh script is supposed to work.
Certainly, if you are using CM, by far
You are right that this isn't implemented. I presume you could propose
a PR for this. The impurity calculator implementations already receive
category counts. The only drawback I see is having to store N
probabilities at each leaf, not 1.
On Wed, Jan 21, 2015 at 3:36 PM, Zsolt Tóth
I agree with Sean that a Spark-specific Stack Exchange likely won't help
and almost certainly won't make it out of Area 51. The idea certainly
sounds nice from our perspective as Spark users, but it doesn't mesh with
the structure of Stack Exchange or the criteria for creating new sites.
On Thu
Sean
You said
Ø If you know that this number is too high you can request a number of
partitions when you read it.
How to do that? Can you give a code snippet? I want to read it into 8
partitions, so I do
val rdd2 = sc.objectFile[LabeledPoint](
(“file:///tmp/mydirfile:///\\tmp\mydir”, 8)
we could implement some ‘load balancing’ policies:
I think Gerard’s suggestions are good. We need some “official” buy-in from
the project’s maintainers and heavy contributors and we should move forward
with them.
I know that at least Josh Rosen, Sean Owen, and Tathagata Das, who are
active on
I am not sure if you get the same exception as I do -- spark-shell2.cmd
works fine for me. Windows 7 as well. I've never bothered looking to fix it
as it seems spark-shell just calls spark-shell2 anyway...
On Thu, Jan 22, 2015 at 3:16 AM, Vladimir Protsenko protsenk...@gmail.com
wrote:
I have a
Another quick question... I've got 4 nodes with 2 cores each. I've assinged the
streaming app 4 cores. It seems to be using one per node. I imagine forwarding
from the receivers to the executors are causing unnecessary processing. Is
there a way to specify that I want 2 cores from the same
Yes, this isn't a well-formed question, and got maybe the response it
deserved, but the tone is veering off the rails. I just got a much
ruder reply from Sudipta privately, which I will not forward. Sudipta,
I suggest you take the responses you've gotten so far as about as much
answer as can be
Hi
I get this exception when I run a Spark test case on my local machine:
An exception or error caused a run to abort:
Venkat,
No problem!
So, creating a custom InputFormat or using sc.binaryFiles alone is not the
right solution. We also need the modified version of RDD.pipe to support
binary data? Is my understanding correct?
Yep! That is correct. The custom InputFormat allows Spark to load binary
Thank you Jerry,
Does the window operation create new RDDs for each slide duration..?
I am asking this because i see a constant increase in memory even when
there is no logs received.
If not checkpoint is there any alternative that you would suggest.?
On Tue, Jan 20, 2015 at 7:08 PM,
Hi Sudipta,
Standalone spark master. Separate Zookeeper cluster. 4 worker nodes with
cassandra + spark on each. No hadoop / hdfs / yarn.
Regards,
Ashic.
Date: Thu, 22 Jan 2015 20:42:43 +0530
Subject: Re: Are these numbers abnormal for spark streaming?
From: asudipta.baner...@gmail.com
To:
Given that the process, and in particular, the setup of connections, is
bound to the number of partitions (in x.foreachPartition{ x= ???}), I
think it would be worth trying reducing them.
Increasing the 'spark.streaming.BlockInterval' will do the trick (you can
read the tuning details here:
How much time it takes to port it?
Spark committers: Please let us know your thoughts.
Regards,
Venkat
From: Frank Austin Nothaft [mailto:fnoth...@berkeley.edu]
Sent: Thursday, January 22, 2015 9:08 AM
To: Venkat, Ankam
Cc: Nick Allen; user@spark.apache.org
Subject: Re: How to 'Pipe' Binary
I'm trying to process a large dataset, mapping/filtering works ok, but
as long as I try to reduceByKey, I get out of memory errors:
http://pastebin.com/70M5d0Bn
Any ideas how I can fix that?
Thanks.
-
To unsubscribe, e-mail:
Hey Andrew,
Thanks for the response. Is this the issue you're referring to (the
duplicate linked there has an associated patch):
https://issues.apache.org/jira/browse/SPARK-5162 ?
Just to confirm that I understand this: with this patch, Python jobs can be
submitted to YARN, and a node from the
Hi Guys,
So I changed the interval to 15 seconds. There's obviously a lot more messages
per batch, but (I think) it looks a lot healthier. Can you see any major
warning signs? I think that with 2 second intervals, the setup / teardown per
partition was what was causing the delays.
Hi Marco,
Thanks for the confirmation. Please let me know what are the lot more
detail you need to answer a very specific question WHAT IS THE MINIMUM
HARDWARE CONFIGURATION REQUIRED TO BUILT HDFS+ MAPREDUCE+SPARK+YARN on a
system? Please let me know if you need any further information and if
So the system has gone from 7msg in 4.961 secs (median) to 106msgs in 4,761
seconds.
I think there's evidence that setup costs are quite high in this case and
increasing the batch interval is helping.
On Thu, Jan 22, 2015 at 4:12 PM, Sudipta Banerjee
asudipta.baner...@gmail.com wrote:
Hi Ashic
Yup...looks like it. I can do some tricks to reduce setup costs further, but
this is much better than where I was yesterday. Thanks for your awesome input :)
-Ashic.
From: gerard.m...@gmail.com
Date: Thu, 22 Jan 2015 16:34:38 +0100
Subject: Re: Are these numbers abnormal for spark streaming?
Sudipta,
Use the Docker image [1] and play around with Hadoop and Spark in the VM for a
while.
Decide on your use case(s) and then you can move ahead for installing on a
cluster, etc.
This Docker image has all you want [HDFS + MapReduce + Spark + YARN].
All the best!
[1]:
Thanks Frank for your response.
So, creating a custom InputFormat or using sc.binaryFiles alone is not the
right solution. We also need the modified version of RDD.pipe to support
binary data? Is my understanding correct?
If yes, this can be added as new enhancement Jira request?
Nick:
Hi Sudipta,
I would also like to suggest to ask this question in Cloudera mailing list
since you have HDFS, MAPREDUCE and Yarn requirements. Spark can work with
HDFS and YARN but it is more like a client to those clusters. Cloudera can
provide services to answer your question more clearly. I'm
I looked into the namenode log and found this message:
2015-01-22 22:18:39,441 WARN org.apache.hadoop.ipc.Server: Incorrect header or
version mismatch from 10.33.140.233:53776 got version 9 expected version 4
What should I do to fix this?
Thanks.
Ey-Chih
From: eyc...@hotmail.com
To:
Hi Kane-
http://spark.apache.org/docs/latest/tuning.html has excellent information that
may be helpful. In particular increasing the number of tasks may help, as well
as confirming that you don’t have more data than you're expecting landing on a
key.
Also, if you are using spark 1.2.0,
One output file is produced per partition. If you want fewer, use
coalesce() before saving the RDD.
On Thu, Jan 22, 2015 at 10:46 PM, Kane Kim kane.ist...@gmail.com wrote:
How I can reduce number of output files? Is there a parameter to
saveAsTextFile?
Thanks.
Hi All,
I wrote a custom reader to read a DB, and it is able to return key and value
as expected but after it finished it never returned to driver
here is output of worker log :
15/01/23 15:51:38 INFO worker.ExecutorRunner: Launch command: java -cp
I'm trying to process 5TB of data, not doing anything fancy, just
map/filter and reduceByKey. Spent whole day today trying to get it
processed, but never succeeded. I've tried to deploy to ec2 with the
script provided with spark on pretty beefy machines (100 r3.2xlarge
nodes). Really frustrated
Hi Devan and Xiangrui,
Can you please explain the cost and optimization function of the KNN
alogorithim that is being used?
Thank and Regards,
Sudipta
On Thu, Jan 22, 2015 at 6:59 PM, DEVAN M.S. msdeva...@gmail.com wrote:
Thanks Xiangrui Meng will try this.
And, found this
I am following the repo on github about pyspark cassandra connector at
https://github.com/Parsely/pyspark-cassandra
On executing the line :
./run_script.py src/main/python/pyspark_cassandra_hadoop_example.py run test
It ends up wit an exception:
ERROR Executor: Exception in task 9.0 in
It means your client app is using Hadoop 2.x and your HDFS is Hadoop 1.x.
On Thu, Jan 22, 2015 at 10:32 PM, ey-chih chow eyc...@hotmail.com wrote:
I looked into the namenode log and found this message:
2015-01-22 22:18:39,441 WARN org.apache.hadoop.ipc.Server: Incorrect header
or version
At 2015-01-22 02:06:37 -0800, NicolasC nicolas.ch...@inria.fr wrote:
I try to execute a simple program that runs the ShortestPaths algorithm
(org.apache.spark.graphx.lib.ShortestPaths) on a small grid graph.
I use Spark 1.2.0 downloaded from spark.apache.org.
This program runs more than 2
Hi,
histogram() returns an object that is a pair of Arrays. There appears to be
no saveAsTextFile() for this paired object.
Currently I am using the following to save the output to a file:
val hist = a.histogram(10)
val arr1 = sc.parallelize(hist._1).saveAsTextFile(file1)
val arr2 =
Hi,
A new RDD will be created in each slide duration, if there’s no data coming, an
empty RDD will be generated.
I’m not sure there’s way to alleviate your problem from Spark side. Is your
application design have to build such a large window, can you change your
implementation if it is easy
Spark can definitely process data with optional fields. It kinda depends
on what you want to do with the results -- its more of a object design /
knowing scala types question.
Eg., scala has a built in type Option specifically for handling optional
data, which works nicely in pattern matching
Hi,
My team is using Spark 1.0.1 and the project we're working on needs to
compute exact numbers, which are then saved to S3, to be reused later in
other Spark jobs to compute other numbers. The problem we noticed yesterday:
one of the output partition files in S3 was missing :/ (some
Also, Setting spark.locality.wait=100 did not work for me.
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/How-to-make-spark-partition-sticky-i-e-stay-with-node-tp21322p21325.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.
I think you should also just be able to provide an input format that never
splits the input data. This has come up before on the list, but I couldn't
find it.*
I think this should work, but I can't try it out at the moment. Can you
please try and let us know if it works?
class
Thanks. But after I replace the maven dependence from
dependency
groupIdorg.apache.hadoop/groupId
artifactIdhadoop-client/artifactId
version2.5.0-cdh5.2.0/version
Rdd.coalesce(1) will coalesce RDD and give only one output file.
coalesce(2) will give 2 wise versa.
On Jan 23, 2015 4:58 AM, Sean Owen so...@cloudera.com wrote:
One output file is produced per partition. If you want fewer, use
coalesce() before saving the RDD.
On Thu, Jan 22, 2015 at 10:46
You need to install these libraries on all the slaves, or submit via
spark-submit:
spark-submit --py-files xxx
On Thu, Jan 22, 2015 at 11:23 AM, Mohit Singh mohit1...@gmail.com wrote:
Hi,
I might be asking something very trivial, but whats the recommend way of
using third party libraries.
Did you try it with a smaller subset of the data first?
Le 23 janv. 2015 05:54, Kane Kim kane.ist...@gmail.com a écrit :
I'm trying to process 5TB of data, not doing anything fancy, just
map/filter and reduceByKey. Spent whole day today trying to get it
processed, but never succeeded. I've
Yes, that second argument is what I was referring to, but yes it's a
*minimum*, oops, right. OK, you will want to coalesce then, indeed.
On Thu, Jan 22, 2015 at 6:51 PM, Wang, Ningjun (LNG-NPV)
ningjun.w...@lexisnexis.com wrote:
Ø If you know that this number is too high you can request a
I posted an question on stackoverflow and haven't gotten any answer yet.
http://stackoverflow.com/questions/28079037/how-to-make-spark-partition-sticky-i-e-stay-with-node
Is there a way to make a partition stay with a node in Spark Streaming? I
need these since I have to load large amount
Hi,
I might be asking something very trivial, but whats the recommend way of
using third party libraries.
I am using tables to read hdf5 format file..
And here is the error trace:
print rdd.take(2)
File /tmp/spark/python/pyspark/rdd.py, line , in take
res =
On Thu, Jan 22, 2015 at 10:21 AM, Sean Owen so...@cloudera.com wrote:
I think a Spark site would have a lot less traffic. One annoyance is
that people can't figure out when to post on SO vs Data Science vs
Cross Validated.
Another is that a lot of the discussions we see on the Spark users
list
HiI'm using Apache Spark 1.1.0 and I'm currently having issue with broadcast
method. So when I call broadcast function on a small dataset to a 5 nodes
cluster, I experiencing the Error sending message as driverActor is null
after broadcast the variables several times (apps running under jboss).
Deployed Spark Streaming applications to a standalone cluster, after a cluster
restart, all the deployed applications are gone and I could not see any
applications through the Spark Web UI.
How to make the Spark Streaming applications durable and auto-restart after a
cluster restart?
Maybe you use a wrong approach - try something like hyperloglog or bitmap
structures as you can find them, for instance, in redis. They are much
smaller
Le 22 janv. 2015 17:19, Balakrishnan Narendran balu.na...@gmail.com a
écrit :
Thank you Jerry,
Does the window operation create new
Nick,
Have you tried https://github.com/kaitoy/pcap4j
I’ve used this in a Spark app already and didn’t have any issues. My use case
was slightly different than yours, but you should give it a try.
From: Nick Allen n...@nickallen.orgmailto:n...@nickallen.org
Date: Friday, January 16, 2015 at
I have downloaded spark-1.2.0.tgz on each of my node and execute ./sbt/sbt
assembly on each of them. So I execute. /sbin/start-master.sh on my master
and ./bin/spark-class org.apache.spark.deploy.worker.Worker
spark://IP:PORT.
Althought when I got to http://localhost:8080 I cannot see any
NoSuchMethodError almost always means that you have compiled some code
against one version of a library but are running against another. I
wonder if you are including different versions of Spark in your
project, or running against a cluster on an older version?
On Thu, Jan 22, 2015 at 3:57 PM,
Folks,
Just a gentle reminder we owe to ourselves:
- this is a public forum and we need to behave accordingly, it is not place to
vent frustration in rude way
- getting attention here is an earned privilege and not entitlement
- this is not a “Platinum Support” department of your vendor
Love it!
There is a reason why SO is so effective and popular. Search is excellent,
you can quickly find very thoughtful answers about sometimes thorny
problems, and it is easy to contribute, format code, etc. Perhaps the most
useful feature is that the best answers naturally bubble up to the
FWIW I am a moderator for datascience.stackexchange.com, and even that
hasn't really achieved the critical mass that SE sites are supposed
to: http://area51.stackexchange.com/proposals/55053/data-science
I think a Spark site would have a lot less traffic. One annoyance is
that people can't figure
I use spark 1.1.0-SNAPSHOT and the test I'm running is in local mode. My test
case uses org.apache.spark.streaming.TestSuiteBase
val spark=org.apache.spark %% spark-core % 1.1.0-SNAPSHOT % provided
excludeAll(
val sparkStreaming= org.apache.spark % spark-streaming_2.10 %
1.1.0-SNAPSHOT %
I use spark 1.1.0-SNAPSHOT
val spark=org.apache.spark %% spark-core % 1.1.0-SNAPSHOT % provided
excludeAll(
-Original Message-
From: Sean Owen [mailto:so...@cloudera.com]
Sent: January-22-15 11:39 AM
To: Adrian Mocanu
Cc: u...@spark.incubator.apache.org
Subject: Re: Exception:
Hi Nicos, Taking forward your argument,please be a smart a$$ and dont use
unprofessional language just for the sake of being a moderator.
Paco Nathan is respected for the dignity he carries in sharing his
knowledge and making it available free for a$$es like us right!
So just mind your tongue next
Thank you very much Marco! Really appreciate your support.
On Thu, Jan 22, 2015 at 10:57 PM, Marco Shaw marco.s...@gmail.com wrote:
(Starting over...)
The best place to look for the requirements would be at the individual
pages of each technology.
As for absolute minimum requirements, I
Sudipta - Please don't ever come here or post here again.
On Thu, Jan 22, 2015 at 1:25 PM, Sudipta Banerjee
asudipta.baner...@gmail.com wrote:
Hi Nicos, Taking forward your argument,please be a smart a$$ and dont use
unprofessional language just for the sake of being a moderator.
Paco Nathan
You can do ./sbin/start-slave.sh --master spark://IP:PORT. I believe you're
missing --master. In addition, it's a good idea to pass with --master
exactly the spark master's endpoint as shown on your UI under
http://localhost:8080. But that should do it. If that's not working, you
can look at the
Hi,
I wanted to understand how the join on two pair rdd's work? Would it result
in shuffling data from both the RDD's with same key into same partition? If
that is the case would it be better to use partitionBy function to
partition (by the join attribute) the RDD at creation for lesser
I'm using Apache Spark 1.1.0 and I'm currently having issue with broadcast
method. So when I call broadcast function on a small dataset to a 5 nodes
cluster, I experiencing the Error sending message as driverActor is null
after broadcast the variables several times (apps running under jboss).
Any
Sudipta, with all due respect... don't respond to me if you don't like
what I say is not the same as not being a jerk about it. One earns social
capital, by being respectful and by respecting the social norms during
interaction; by everything I've seen, you've been demanding and
disrespectful
(Starting over...)
The best place to look for the requirements would be at the individual
pages of each technology.
As for absolute minimum requirements, I would suggest 50GB of disk space
and at least 8GB of memory. This is the absolute minimum.
Architecting a solution like you are looking
+1
On 22.1.2015 18:30, Marco Shaw wrote:
Sudipta - Please don't ever come here or post here again.
On Thu, Jan 22, 2015 at 1:25 PM, Sudipta Banerjee
asudipta.baner...@gmail.com mailto:asudipta.baner...@gmail.com wrote:
Hi Nicos, Taking forward your argument,please be a smart a$$ and
Dont ever reply to my queries :D
On Thu, Jan 22, 2015 at 11:02 PM, Lukas Nalezenec
lukas.naleze...@firma.seznam.cz wrote:
+1
On 22.1.2015 18:30, Marco Shaw wrote:
Sudipta - Please don't ever come here or post here again.
On Thu, Jan 22, 2015 at 1:25 PM, Sudipta Banerjee
Python couldn't find your module. Do you have that on each worker node? You
will need to have that on each one
--- Original Message ---
From: Davies Liu dav...@databricks.com
Sent: January 22, 2015 9:12 PM
To: Mohit Singh mohit1...@gmail.com
Cc: user@spark.apache.org
Subject: Re: Using third
Often when this happens to me, it is actually an exception parsing a few
messages. Easy to miss this, as error messages aren't always informative. I
would be blaming spark, but in reality it was missing fields in a CSV file.
As has been said, make a file with a few records and see if your job
95 matches
Mail list logo