Why dont you apply filter first and then Group the data and run
aggregations..
On Jul 3, 2015 1:29 PM, Megha Sridhar- Cynepia megha.sridh...@cynepia.com
wrote:
Hi,
I have a Spark DataFrame object, which when trimmed, looks like,
FromTo Subject
I have this problem with a job. A random executor gets this
ERROR executor.CoarseGrainedExecutorBackend: RECEIVED SIGNAL 15: SIGTERM
Almost always at the same point in the processing of the data. I am
processing 1 mil files with sc.wholeText. At around the 600.000th file, a
container receives
You cannot update the broadcasted variable.. It wont get reflected on
workers.
On Jul 3, 2015 12:18 PM, James Cole ja...@binarism.net wrote:
Hi all,
I'm filtering a DStream using a function. I need to be able to change this
function while the application is running (I'm polling a service to
This is the basic design of spark that it runs all actions in different
stages...
Not sure you can achieve what you r looking for.
On Jul 3, 2015 12:43 PM, Marius Danciu marius.dan...@gmail.com wrote:
Hi all,
If I have something like:
rdd.join(...).mapPartitionToPair(...)
It looks like
Kryo serialization is used internally by Spark for spilling or shuffling
intermediate results, not for writing out an RDD as an action. Look at Sandy
Ryza's examples for some hints on how to do this:
https://github.com/sryza/simplesparkavroapp
Regards,
Will
On July 3, 2015, at 2:45 AM,
Hello all,
I am learning scala spark and going through some applications with data I have.
Please allow me to put a couple questions:
spark-csv: The data I have, ain't malformed, but there are empty values in some
rows, properly comma-sepparated and not catched by DROPMALFORMED mode
These
It’ll help to see the code or at least understand what transformations you’re
using.
Also, you have 15 nodes but not using all of them, so that means you may be
losing data locality. You can see this in the job UI for Spark if any jobs do
not have node or process local.
From: diplomatic Guru
Alternatively, setting spark.driver.extraClassPath should work.
Cheers
On Fri, Jul 3, 2015 at 2:59 AM, Steve Loughran ste...@hortonworks.com
wrote:
On Thu, Jul 2, 2015 at 7:38 AM, Daniel Haviv
daniel.ha...@veracity-group.com wrote:
Hi,
I'm trying to start the thrift-server and passing it
Hello guys,
I'm after some advice on Spark performance.
I've a MapReduce job that read inputs carry out a simple calculation and
write the results into HDFS. I've implemented the same logic in Spark job.
When I tried both jobs on same datasets, I'm getting different execution
time, which is
updateStateByKey will run for all keys, whether they have new data in a batch
or not so you should be able to still use it.
On 7/3/15, 7:34 AM, micvog mich...@micvog.com wrote:
UpdateStateByKey is useful but what if I want to perform an operation to all
existing keys (not only the ones in
Hello,
I'm got a trouble with float type coercion on SparkR with hiveContext.
result - sql(hiveContext, SELECT offset, percentage from data limit 100)
show(result)
DataFrame[offset:float, percentage:float]
head(result)
Error in as.data.frame.default(x[[i]], optional = TRUE) :
cannot
Hi all,
Do you know if there is an option to specify how many replicas we want
while caching in memory a table in SparkSQL Thrift server? I have not seen
any option so far but I assumed there is an option as you can see in the
Storage section of the UI that there is 1 x replica of your
https://issues.apache.org/jira/browse/SPARK-8817
On Fri, Jul 3, 2015 at 11:43 AM, Koert Kuipers ko...@tresata.com wrote:
i see the relaxation to allow duplicate field names was done on purpose,
since some data sources can have dupes due to case insensitive resolution.
apparently the issue is
Hi all,
I'm wondering if anybody as any experience with centralised logging for
Spark - or even has felt that there was need for this given the WebUI.
At my organization we use Log4j2 and Flume as the front end of our
centralised logging system. I was looking into modifying Spark to use that
You need to add -Psparkr to build SparkR code
Shivaram
On Fri, Jul 3, 2015 at 2:14 AM, Akhil Das ak...@sigmoidanalytics.com
wrote:
Did you try:
build/mvn -Pyarn -Phadoop-2.4 -Dhadoop.version=2.4.0 -DskipTests clean package
Thanks
Best Regards
On Fri, Jul 3, 2015 at 2:27 PM,
i see the relaxation to allow duplicate field names was done on purpose,
since some data sources can have dupes due to case insensitive resolution.
apparently the issue is now dealt with during query analysis.
although this might work for sql it does not seem a good thing for
DataFrame to me. it
@bastien, in those situations, I prefer to use Unix timestamps (millisecond
or second granularity) because you can apply math operations to them easily.
If you don't have a Unix timestamp, you can use unix_timestamp() from Hive
SQL to get one with second granularity.Then doing grouping by hour
Thanks for your feedback. Yes I am aware of stages design and Silvio what
you are describing is essentially map-side join which is not applicable
when you have both RDDs quite large.
It appears that
rdd.join(...).mapToPair(f)
f is piggybacked inside join stage (right in the reducers I believe)
Hello,
I am trying to just start spark-shell... it starts, the prompt appears,
then a never ending (literally) stream of these log lines proceeds
What is it trying to do? Why is it failing?
To start it I do:
$ docker run -it ncssm/spark-base /spark/bin/spark-shell --master spark://
I'm writing a Spark Streaming application that uses RabbitMQ to consume
events. One feature of RabbitMQ that I intend to make use of is bulk ack of
messages, i.e. no need to ack one-by-one, but only ack the last event in a
batch and that would ack the entire batch.
Before I commit to doing so,
No. Spark R is language binding for spark. MLlib is machine learning
project on top of spark core
On 4 Jul 2015 12:23, praveen S mylogi...@gmail.com wrote:
Hi,
Is sparkR and spark Mlib same?
Hi,
Is sparkR and spark Mlib same?
Ted,
Thanks very much for your reply. It took me almost a week but I have
finally had a chance to implement what you noted and it appears to be
working locally. However, when I launch this onto a cluster on EC2 -- this
doesn't work reliably.
To expand, I think the issue is that some of the code
I dont think you can expect any order guarantee except the records in one
partition.
On Jul 4, 2015 7:43 AM, khaledh khal...@gmail.com wrote:
I'm writing a Spark Streaming application that uses RabbitMQ to consume
events. One feature of RabbitMQ that I intend to make use of is bulk ack of
How many partitions do you have? It might be that one partition is too
large, and there is Integer overflow. Could you double your number of
partitions?
Burak
On Fri, Jul 3, 2015 at 4:41 AM, Danny kont...@dannylinden.de wrote:
hi,
i want to run a multiclass classification with 390 classes
With binary i think it might not be possible, although if you can download
the sources and then build it then you can remove this function
https://github.com/apache/spark/blob/master/repl/scala-2.10/src/main/scala/org/apache/spark/repl/SparkILoop.scala#L1023
which initializes the SQLContext.
The main reason is Spark's startup time and the need to configure a component I
don't really need (without configs the hivecontext takes more time to load)
Thanks,
Daniel
On 3 ביולי 2015, at 11:13, Robin East robin.e...@xense.co.uk wrote:
As Akhil mentioned there isn’t AFAIK any kind of
Hi all,
If I have something like:
rdd.join(...).mapPartitionToPair(...)
It looks like mapPartitionToPair runs in a different stage then join. Is
there a way to piggyback this computation inside the join stage ? ... such
that each result partition after join is passed to
the mapPartitionToPair
I think you can open up a jira, not sure if this PR
https://github.com/apache/spark/pull/2209/files (SPARK-2890
https://issues.apache.org/jira/browse/SPARK-2890) broke the validation
piece.
Thanks
Best Regards
On Fri, Jul 3, 2015 at 4:29 AM, Koert Kuipers ko...@tresata.com wrote:
i am
Hi,
I have a Spark DataFrame object, which when trimmed, looks like,
FromTo SubjectMessage-ID
karen@xyz.com['vance.me...@enron.com', SEC Inquiry
19952575.1075858
'jeannie.mandel...@enron.com',
Hivecontext should be supersets of SQL context so you should be able to
perform all your tasks. Are you facing any problem with hivecontext?
On 3 Jul 2015 17:33, Daniel Haviv daniel.ha...@veracity-group.com wrote:
Thanks
I was looking for a less hack-ish way :)
Daniel
On Fri, Jul 3, 2015 at
Hi all,
Anyone build spark 1.4 source code for sparkR with maven/sbt, what's
comand ? using sparkR must build from source code about 1.4 version .
thank you
1106944...@qq.com
Hello Spark developers,
After upgrading to spark 1.4 on Mesos 0.22.1 existing code started to throw
this exception when calling sparkContext.stop :
(SparkListenerBus) [ERROR -
org.apache.spark.Logging$class.logError(Logging.scala:96)] Listener
EventLoggingListener threw an exception
Thanks
I was looking for a less hack-ish way :)
Daniel
On Fri, Jul 3, 2015 at 10:15 AM, Akhil Das ak...@sigmoidanalytics.com
wrote:
With binary i think it might not be possible, although if you can download
the sources and then build it then you can remove this function
hi,
i want to run a multiclass classification with 390 classes on120k label
points(tf-idf vectors). but i get the following exception. If i reduce the
number of classes to ~20 everythings work fine. How can i fix this?
i use the LogisticRegressionWithLBFGS class for my classification on a 8
I have a rather simple avro schema to serialize Tweets (message, username,
timestamp).
Kryo and twitter chill are used to do so.
For my dev environment the Spark context is configured as below
val conf: SparkConf = new SparkConf()
conf.setAppName(kryo_test)
conf.setMaster(“local[4])
Hi Salih,
Thanks for the links :) This seems very promising to me.
When do you think this would be available in the spark codeline ?
Thanks,
Suraj
On Fri, Jul 3, 2015 at 2:02 AM, Salih Oztop soz...@yahoo.com wrote:
Hi Suraj,
It seems your requirement is Record Linkage/Entity Resolution.
Hi all,
I'm filtering a DStream using a function. I need to be able to change this
function while the application is running (I'm polling a service to see if
a user has changed their filtering). The filter function is a
transformation and runs on the workers, so that's where the updates need to
In the driver when running spark-submit with --master yarn-client
On Fri, Jul 3, 2015 at 10:23 AM Akhil Das ak...@sigmoidanalytics.com
wrote:
Where does it returns null? Within the driver or in the executor? I just
tried System.console.readPassword in spark-shell and it worked.
Thanks
Best
On Thu, Jul 2, 2015 at 7:38 AM, Daniel Haviv
daniel.ha...@veracity-group.commailto:daniel.ha...@veracity-group.com wrote:
Hi,
I'm trying to start the thrift-server and passing it azure's blob storage jars
but I'm failing on :
Caused by: java.io.IOException: No FileSystem for scheme: wasb
Can you paste the code? Something is missing
Thanks
Best Regards
On Fri, Jul 3, 2015 at 3:14 PM, Jem Tucker jem.tuc...@gmail.com wrote:
In the driver when running spark-submit with --master yarn-client
On Fri, Jul 3, 2015 at 10:23 AM Akhil Das ak...@sigmoidanalytics.com
wrote:
Where does
Hi, I need to join with multiple conditions. Can anyone tell how to specify
that. For e.g. this is what I am trying to do :
val Lead_all = Leads.
| join(Utm_Master,
Leaddetails.columns(LeadSource,Utm_Source,Utm_Medium,Utm_Campaign)
==
Hi,
We have an application that requires a username/password to be entered from
the command line. To screen a password in java you need to use
System.console.readPassword however when running with spark System.console
returns null?? Any ideas on how to get the console from spark?
Thanks,
Jem
Did you try:
build/mvn -Pyarn -Phadoop-2.4 -Dhadoop.version=2.4.0 -DskipTests clean package
Thanks
Best Regards
On Fri, Jul 3, 2015 at 2:27 PM, 1106944...@qq.com 1106944...@qq.com wrote:
Hi all,
Anyone build spark 1.4 source code for sparkR with maven/sbt, what's
comand ? using
Where does it returns null? Within the driver or in the executor? I just
tried System.console.readPassword in spark-shell and it worked.
Thanks
Best Regards
On Fri, Jul 3, 2015 at 2:32 PM, Jem Tucker jem.tuc...@gmail.com wrote:
Hi,
We have an application that requires a username/password to
I have shown two senarios below:
// setup spark context
val user = readLine(username: )
val pass = System.console.readPassword(password: ) - null pointer
exception here
and
// setup spark context
val user = readLine(username: )
val console = System.console - null pointer exception here
UpdateStateByKey is useful but what if I want to perform an operation to all
existing keys (not only the ones in this RDD).
Word count for example - is there a way to decrease *all* words seen so far
by 1?
I was thinking of keeping a static class per node with the count information
and issuing a
hi,
i want to run a multiclass classification with 390 classes on120k label
points(tf-idf vectors). but i get the following exception. If i reduce the
number of classes to ~20 everythings work fine. How can i fix this?
i use the LogisticRegressionWithLBFGS class for my classification on a 8
48 matches
Mail list logo