Hi
I am running Spark 1.4 in Standalone mode on top of Hadoop 2.3 on a CentOS
node. I am trying to run grid search on an RF classifier to classify a small
dataset using the pyspark.ml.tuning module, specifically the
ParamGridBuilder and CrossValidator classes. I get the following error when
I try
I have just upgraded to spark 1.4.0 and it seems that spark-streaming-kafka
has a dependency on org.spark-project.spark unused 1.0.0 but it also embeds
that jar in its artifact, causing a problem while creating a fatjar.
This is the error:
[Step 1/1] (*:assembly) deduplicate: different file
Hi,
For a fairly large dataset, 30MB, KMeansModel.computeCost takes lot of time
(16+ mints).
It takes lot of time at this task;
org.apache.spark.rdd.DoubleRDDFunctions.sum(DoubleRDDFunctions.scala:33)
org.apache.spark.mllib.clustering.KMeansModel.computeCost(KMeansModel.scala:70)
Can this be
Hi Sean,
Thank you for your quick response. By very little data, do you mean that
the matrix is too sparse? Or are there too little data points? There
are 3856988
ratings that are in my dataset currently.
Regards,
Benedict
On Mon, Jul 13, 2015 at 7:07 PM, Sean Owen so...@cloudera.com wrote:
Honestly I don't believe this kind of functionality belongs within
spark-jobserver.
For serving of factor-type models, you are typically in the realm of
recommendations or ad-serving scenarios - i.e. needing to score a user /
context against many possible items and return a top-k list of those.
Hi Everyone,
I am developing application which handles bulk of data around millions(This
may vary as per user's requirement) records.As of now I am using
MsSqlServer as back-end and it works fine but when I perform some
operation on large data I am getting overflow exceptions.I heard about
spark
Is the data set synthetic, or has very few items? or is indeed very
sparse? those could be reasons. However usually this kind of thing
happens with very small data sets. I could be wrong about what's going
on, but it's a decent guess at the immediate cause given the error
messages.
On Mon, Jul
Hello,
I would like to share RDD between an application and sparkR.
I understand we have job-server and IBM kernel for sharing the context for
different applications but not sure how we can use it with sparkR as it is
some sort of front end (R shell) with spark.
Any insights appreciated.
Hari
Hi,
I have noticed that when StreamingContext.stop is called when no receiver
has started yet, then the context is not really stopped. Watching the logs
it looks like a stop signal is sent to 0 receivers, because the receivers
have not started yet, and then the receivers are started and the
Even for 2L records the MySQL will be better.
Regards,
Sandeep Giri,
+1-253-397-1945 (US)
+91-953-899-8962 (IN)
www.KnowBigData.com. http://KnowBigData.com.
[image: linkedin icon] https://linkedin.com/company/knowbigdata [image:
other site icon] http://knowbigdata.com [image: facebook icon]
Hi,I am trying to run the MovieALS example with an implicit dataset and am
receiving this error:
Got 3856988 ratings from 144250 users on 378937 movies.Training: 3085522,
test: 771466.15/07/13 10:43:07 WARN BLAS: Failed to load implementation
from: com.github.fommil.netlib.NativeSystemBLAS15/07/13
I interpret this to mean that the input to the Cholesky decomposition
wasn't positive definite. I think this can happen if the input matrix
is singular or very near singular -- maybe, very little data? Ben that
might at least address why this is happening; different input may work
fine.
Xiangrui
MySQL and PgSQL scale to millions. Spark or any distributed/clustered
computing environment would be inefficient for the kind of data size you
mention. That's because of coordination of processes, moving data around
etc.
On Mon, Jul 13, 2015 at 5:34 PM, Sandeep Giri sand...@knowbigdata.com
wrote:
Konstatinos,
Sure, if you have a resource leak then the collector can't free up
memory and the process will use more memory. Time to break out the
profiler and see where the memory is going.
The usual suspects are handles to resources (open file streams, sockets,
etc) kept in containers
Hi, Michal,
SparkR comes with a JVM backend that supports Java object instantiation,
calling Java instance and static methods from R side. As defined in
https://github.com/apache/spark/blob/master/R/pkg/R/backend.R,
newJObject() is to create an instance of a Java class;
callJMethod() is to call
Hello Rui Sun,
Thanks for your reply.
On reading the file readme.md in the section Using SparkR from RStudio
it mentions to set the .libPaths(c(file.path(Sys.getenv(SPARK_HOME), R,
lib), .libPaths()))
Please tell me how I can set this in Windows environment? What I mean is
how to setup
Hi Sean,
This user dataset is organic. What do you think is a good ratings threshold
then? I am only encountering this with the implicit type though. The
explicit type works fine though (though it is not suitable for this
dataset).
Thank you,
Benedict
On Mon, Jul 13, 2015 at 7:15 PM, Sean Owen
Thanks Burak.
Now it takes minutes to repartition;
Active Stages (1) Stage IdDescriptionSubmittedDurationTasks: Succeeded/Total
InputOutputShuffle Read Shuffle Write 42 (kill)
http://localhost:4040/stages/stage/kill/?id=42terminate=true repartition
at UnsupervisedSparkModelBuilder.java:120
Can it be the limited memory causing this slowness?
On Tue, Jul 14, 2015 at 9:00 AM, Nirmal Fernando nir...@wso2.com wrote:
Thanks Burak.
Now it takes minutes to repartition;
Active Stages (1) Stage IdDescriptionSubmittedDurationTasks:
Succeeded/TotalInputOutputShuffle Read Shuffle Write
The checkpointed RDD computed twice, why not do the checkpoint for the RDD
once it is computed?
Is there any special reason for this?
--
*Regards,*
*Zhaojie*
Hi Deepak
Not 100% sure , but please try increasing (--executor-cores ) to twice the
number of your physical cores on your machine.
Thanks and Regards
Aniruddh
On Tue, Jul 14, 2015 at 9:49 AM, ÐΞ€ρ@Ҝ (๏̯͡๏) deepuj...@gmail.com wrote:
Its been 30 minutes and still the partitioner has not
Hi,
I am doing my PHD thesis on large scale machine learning e.g Online
learning, batch and mini batch learning.
Could somebody help me with ideas especially in the context of Spark and to
the above learning methods.
Some ideas like improvement to existing algorithms, implementing new
features
I reduced the number of partitions to 1/4 to 76 in order to reduce the
time to 1/4 (from 33 to 8) But the re-parition is still running beyond 15
mins.
@Nirmal
click on details, shows the code lines and does not show why it is slow. I
know that repartition is slow and want to speed it up
If you press on the +details you could see the code that takes time. Did
you already check it?
On Tue, Jul 14, 2015 at 9:56 AM, ÐΞ€ρ@Ҝ (๏̯͡๏) deepuj...@gmail.com wrote:
Job view. Others are fast, but the first one (repartition) is taking 95%
of job run time.
On Mon, Jul 13, 2015 at 9:23 PM,
Hello,
I'm facing a strange behavior regarding a larger data processing
pipeline consisting of multiple steps involving Spark core and GraphX.
Increasing the network transfer rate in the 5 node cluster from 100
Mbit/s to 1 Gbit/s the runtime also increases from around 15 minutes to
19 Minutes.
Hello Arun,
Thank you for the descriptive response.
And thank you for providing the sample file too. It certainly is a great
help.
Sincerely,
Ashish
On Mon, Jul 13, 2015 at 10:30 PM, Arun Verma arun.verma...@gmail.com
wrote:
PFA sample file
On Mon, Jul 13, 2015 at 7:37 PM, Arun Verma
Many thanks for your response.
Regards,
Ashish
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Is-it-possible-to-change-the-default-port-number-7077-for-spark-tp23774p23797.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.
Hi,
Try this
Sys.setenv(SPARK_HOME=C:\\spark-1.4.0) # The path to your spark
installation
.libPaths(c(file.path(Sys.getenv(SPARK_HOME), R, lib), .libPaths()))
library(SparkR, lib.loc=C:\\spark-1.4.0\\lib) # The path to the lib
folder in the spark location
library(SparkR)
Akhil Das:
Thanks for your reply. I am using exactly the same installation everywhere.
Actually, the spark directory is shared among all nodes, including the
place where I start pyspark. So, I believe this is not the problem.
Regards,
Eduardo
On Mon, Jul 13, 2015 at 3:56 AM, Akhil Das
Hi, all
I am using spark 1.4, and find some sql is not support,
especially the subquery, such as subquery in select items,
in where clause, and in predicate conditions.
So i want to know if spark support subquery or i am in the wrong way using
spark sql?
If not support subquery, is there a plan
I would probably also look at what kind of analytical use case is to
servefor example unification of Streaming, Batch and machine learning
workloads can be easily achieved in Spark. This is one of the USP of Spark.
But if SQL is the only use case, and data volume 1 milion or 100 GB, I
think
Hi,
I would like to know: Is there any optimization has been done for window
functions in Spark SQL?
For example.
select key,
max(value1) over(partition by key) as m1,
max(value2) over(partition by key) as m2,
max(value3) over(partition by key) as m3
from table
The query above creates 3
For second question
I am comparing 2 situtations of processing kafkaRDD.
case I - When I used foreachPartition to process kafka stream I am not able
to see any stream job timing interval like Time: 142905487 ms .
displayed on driver console at start of each stream batch. But it processed
In Jira, it says in progress
https://issues.apache.org/jira/browse/SPARK-4226
On Mon, Jul 13, 2015 at 11:10 PM, Louis Hust louis.h...@gmail.com wrote:
Hi, all
I am using spark 1.4, and find some sql is not support,
especially the subquery, such as subquery in select items,
in where clause,
Reading from kafka is always going to be bounded by the number of kafka
partitions you have, regardless of what you're using to read it.
If most of your time is coming from calculation, not reading, then yes a
spark repartition will help. If most of your time is coming just from
reading, you
Yeah, I had brought that up a while back, but didn't get agreement on
removing the stub. Seems to be an intermittent problem. You can just add
an exclude:
mergeStrategy in assembly := {
case PathList(org, apache, spark, unused, UnusedStubClass.class)
= MergeStrategy.first
case x =
Regarding your first question, having more partitions than you do executors
usually means you'll have better utilization, because the workload will be
distributed more evenly. There's some degree of per-task overhead, but as
long as you don't have a huge imbalance between number of tasks and
Please can you explain how did you set this second step in windows
environment?
.libPaths(c(file.path(Sys.getenv(SPARK_HOME), R, lib), .libPaths()))
I mean to ask where do I type this command, at R prompt or in command
prompt?
Thanks for your time.
Regards,
Ashish
--
View this message in
I had been facing this problem for a long time and this practically forced me
to move to pyspark.
This is what I tried after reading the posts here
Sys.setenv(SPARK_HOME=C:\\spark-1.4.0)
.libPaths(c(file.path(Sys.getenv(SPARK_HOME), R, lib), .libPaths()))
library(SparkR,
PFA sample file
On Mon, Jul 13, 2015 at 7:37 PM, Arun Verma arun.verma...@gmail.com wrote:
Hi,
Yes it is. To do it follow these steps;
1. cd spark/intallation/path/.../conf
2. cp spark-env.sh.template spark-env.sh
3. vi spark-env.sh
4. SPARK_MASTER_PORT=9000(or any other available port)
Hi Andrew, here's what i found. Maybe would be relevant for people with
the same issue:
1) There's 3 types of local resources in YARN (public, private,
application). More about it here:
http://hortonworks.com/blog/management-of-application-dependencies-in-yarn/
2) Spark cache is of
Hi,
Yes it is. To do it follow these steps;
1. cd spark/intallation/path/.../conf
2. cp spark-env.sh.template spark-env.sh
3. vi spark-env.sh
4. SPARK_MASTER_PORT=9000(or any other available port)
PFA sample file. I hope this will help.
On Mon, Jul 13, 2015 at 7:24 PM, ashishdutt
Have you seen https://issues.apache.org/jira/browse/SPARK-6910I opened
https://issues.apache.org/jira/browse/SPARK-6984 which I think is related
to this as well. There are a bunch of issues attached to it but basically
yes, Spark interactions with a large metastore are bad...very bad if your
Its been 30 minutes and still the partitioner has not completed yet, its
ever.
Without repartition, i see this error
https://issues.apache.org/jira/browse/SPARK-5928
FetchFailed(BlockManagerId(1, imran-2.ent.cloudera.com, 55028),
shuffleId=1, mapId=0, reduceId=0, message=
A new configuration named *spark.streaming.minRememberDuration* was added
since 1.2.1 to control the file stream input, the default value is *60
seconds*, you can change this value to a large value to include older files
(older than 1 minute)
You can get the detail from this jira:
Thank you Feynman for the lead.
I was able to modify the code using clues from the RegressionMetrics example.
Here is what I got now.
val deviceAggregateLogs =
sc.textFile(logFile).map(DailyDeviceAggregates.parseLogLine).cache()
// Calculate statistics based on bytes-transferred
val
Dimensions mismatch when adding new sample. Expecting 8 but got 14.
Make sure all the vectors you are summarizing over have the same dimension.
Why would you want to write a MultivariateOnlineSummary object (which can
be represented with a couple Double's) into a distributed filesystem like
Hello all,
The configuration of my cluster is as follows;
# 4 noded cluster running on Centos OS 6.4
# spark-1.3.0 installed on all
I would like to use SparkR shipped with spark-1.4.0. I checked Cloudera and
find that the latest release CDH5.4 still does not have the spark-1.4.0.
Forums like
Hello Feynman,
Actually in my case, the vectors I am summarizing over will not have the same
dimension since many devices will be inactive on some days. This is at best a
sparse matrix where we take only the active days and attempt to fit a moving
average over it.
The reason I would like to
It's a bit hard to tell from the snippets of code but it's likely related
to the fact that when you serialize instances the enclosing class, if any,
also gets serialized, as well as any other place where fields used in the
closure come from...e.g.check this discussion:
I have to do the following tasks on a dataset using Apache Spark with Scala as
the programming language:
- Read the dataset from HDFS. A few sample lines look like this:
deviceid,bytes,eventdate
15590657,246620,20150630
14066921,1907,20150621
14066921,1906,20150626
6522013,2349,20150626
Hi folks,
I have a question regarding scheduling of Spark job on Yarn cluster.
Let's say there are 5 nodes on Yarn cluster: A,B,C, D, E
In Spark job I'll be reading some huge text file (sc.textFile(fileName))
from HDFS and create an RDD.
Assume that only nodes A, E contain the blocks of that
To clear one thing up: the space taken up by data that Spark caches on disk
is not related to YARN's local resource / application cache concept.
The latter is a way that YARN provides for distributing bits to worker
nodes. The former is just usage of disk by Spark, which happens to be in a
local
What are the other parameters? Are you just setting k=3? What about # of
runs? How many partitions do you have? How many cores does your machine
have?
Thanks,
Burak
On Mon, Jul 13, 2015 at 10:57 AM, Nirmal Fernando nir...@wso2.com wrote:
Hi Burak,
k = 3
dimension = 785 features
Spark 1.4
I'm still getting acquainted with the Spark ecosystem, and wanted to make sure
my understanding of the different API layers is correct.
Is this an accurate picture of the major API layers, and their associated
client support?
Thanks,
-Lincoln
Spark Core:
- Scala
- Java
-
Does the issue only happen when you have no traffic on the topic?
Have you profiled to see what's using heap space?
On Mon, Jul 13, 2015 at 1:05 PM, Apoorva Sareen apoorva.sar...@gmail.com
wrote:
Hi,
I am running spark streaming 1.4.0 on Yarn (Apache distribution 2.6.0)
with java 1.8.0_45
Good points, Michael.
The underlying assumption in my statement is that cost is an issue. If cost is
not an issue and the only requirement is to query structured data, then there
are several databases such as Teradata, Exadata, and Vertica that can handle
4-6 TB of data and outperform Spark.
Hi Oded,
I'm not sure I completely understand your question, but it sounds like you
could have the READER receiver produce a DStream which is
windowed/processed in Spark Streaming and forEachRDD to do the OUTPUT.
However, streaming in SparkR is not currently supported (SPARK-6803
Can you send the error messages again? I'm not seeing them.
On Mon, Jul 13, 2015 at 2:45 AM, shivamverma shivam13ve...@gmail.com
wrote:
Hi
I am running Spark 1.4 in Standalone mode on top of Hadoop 2.3 on a CentOS
node. I am trying to run grid search on an RF classifier to classify a
small
Just once.
You can see this by printing the optimized logical plan.
You will see just one repartition operation.
So do:
val df = sql(your sql...)
println(df.queryExecution.analyzed)
On Mon, Jul 13, 2015 at 6:37 AM, Hao Ren inv...@gmail.com wrote:
Hi,
I would like to know: Is there any
The call to Sorting.quicksort is not working. Perhaps I am calling it the
wrong way.
allaggregates.toArray allocates and creates a new array separate from
allaggregates which is sorted by Sorting.quickSort; allaggregates. Try:
val sortedAggregates = allaggregates.toArray
I'm using;
org.apache.spark.mllib.clustering.KMeans.train(data.rdd(), 3, 20);
Cpu cores: 8 (using default Spark conf thought)
On partitions, I'm not sure how to find that.
On Mon, Jul 13, 2015 at 11:30 PM, Burak Yavuz brk...@gmail.com wrote:
What are the other parameters? Are you just
Can you call repartition(8) or 16 on data.rdd(), before KMeans, and also,
.cache()?
something like, (I'm assuming you are using Java):
```
JavaRDDVector input = data.repartition(8).cache();
org.apache.spark.mllib.clustering.KMeans.train(input.rdd(), 3, 20);
```
On Mon, Jul 13, 2015 at 11:10 AM,
Thank you, extending Serializable solved the issue. I am left with more
questions than answers though :-).
Regards,
Saif
From: Yana Kadiyska [mailto:yana.kadiy...@gmail.com]
Sent: Monday, July 13, 2015 2:49 PM
To: Ellafi, Saif A.
Cc: user@spark.apache.org; Liu, Weicheng
Subject: Re:
Hi Sushant/Cody,
For question 1 , following is my understanding ( I am not 100% sure and
this is only my understanding, I have asked this question in another words
to TD for confirmation which is not confirmed as of now).
Following is my understanding. In accordance with tasks created in
Thanks. Spark-testing-base works pretty well.
On Fri, Jul 10, 2015 at 3:23 PM, Burak Yavuz brk...@gmail.com wrote:
I can +1 Holden's spark-testing-base package.
Burak
On Fri, Jul 10, 2015 at 12:23 PM, Holden Karau hol...@pigscanfly.ca
wrote:
Somewhat biased of course, but you can also
Hi Burak,
k = 3
dimension = 785 features
Spark 1.4
On Mon, Jul 13, 2015 at 10:28 PM, Burak Yavuz brk...@gmail.com wrote:
Hi,
How are you running K-Means? What is your k? What is the dimension of your
dataset (columns)? Which Spark version are you using?
Thanks,
Burak
On Mon, Jul 13,
Hi,
I am running spark streaming 1.4.0 on Yarn (Apache distribution 2.6.0) with
java 1.8.0_45 and also Kafka direct stream. I am also using spark with scala
2.11 support.
The issue I am seeing is that both driver and executor containers are gradually
increasing the physical memory usage till
Hi,
How are you running K-Means? What is your k? What is the dimension of your
dataset (columns)? Which Spark version are you using?
Thanks,
Burak
On Mon, Jul 13, 2015 at 2:53 AM, Nirmal Fernando nir...@wso2.com wrote:
Hi,
For a fairly large dataset, 30MB, KMeansModel.computeCost takes lot
Thank you very much for your time, here is how I designed the case classes, as
far as I know they apply properly.
Ps: By the way, what do you mean by “The programming guide?”
abstract class Validator {
// positions to access with Row.getInt(x)
val shortsale_in_pos = 10
val
I would certainly try to mark the Validator class as Serializable...If that
doesn't do it you can also try and see if this flag sheds more light:
-Dsun.io.serialization.extendedDebugInfo=true
By programming guide I mean this:
https://spark.apache.org/docs/latest/programming-guide.html I could
'
to
'/var/log/mcsvc/mesostmpdir/slaves/20150713-133618-421011372-5050-8867-S5/frameworks/20150713-152326-421011372-5050-12921-0002/executors/9/runs/9e44b2ea-c738-4e76-8103-3a85ce752b58/spark-1.4.0-bin-hadoop2.4.tgz'
I0713 15:59:50.700959 1327 fetcher.cpp:78] Extracted resource
'/var/log/mcsvc
Hi! I was just raising this issue, I already solved it by excluding that
transitive dependency. Thanks for your help anyway :)
2015-07-13 14:43 GMT+01:00 Cody Koeninger c...@koeninger.org:
Yeah, I had brought that up a while back, but didn't get agreement on
removing the stub. Seems to be an
-1.4.0-bin-hadoop2.4.tgz'
to
'/var/log/mcsvc/mesostmpdir/slaves/20150713-133618-421011372-5050-8867-S5/frameworks/20150713-152326-421011372-5050-12921-0002/executors/9/runs/9e44b2ea-c738-4e76-8103-3a85ce752b58/spark-1.4.0-bin-hadoop2.4.tgz'
I0713 15:59:50.700959 1327 fetcher.cpp:78] Extracted
Hi I did Kafka streaming through Spark streaming I have a use case where I
would like to stream data from a database table. I see JDBCRDD is there but
that is not what I am looking for I need continuous streaming like
JavaSparkStreaming which continuously runs and listens to changes in a
database
Oh, this is very interesting -- can you explain about your dependencies --
I'm running Tomcat 7 and ended up using spark-assembly from WEB_INF/lib and
removing the javax/servlet package out of it...but it's a pain in the neck.
If I'm reading your first message correctly you use hadoop common and
fetcher.cpp:135] Downloading
'http://s3-eu-west-1.amazonaws.com/int-mesos-data/frameworks/spark/spark-1.4.0-bin-hadoop2.4.tgz'
to
'/var/log/mcsvc/mesostmpdir/slaves/20150713-133618-421011372-5050-8867-S5/frameworks/20150713-152326-421011372-5050-12921-0002/executors/9/runs/9e44b2ea-c738-4e76
If you want to exploit properly the 8 nodes of your cluster, you should use ~
2 times that number for partitioning.
You can specify the number of partitions when calling parallelize, as
following:
JavaRDDPoint pnts = sc.parallelize(points, 16);
--
View this message in context:
Hi,
I have several issues related to HDFS, that may have different roots. I'm
posting as much information as I can, with the hope that I can get your
opinion on at least some of them. Basically the cases are:
- HDFS classes not found
- Connections with some datanode seems to be slow/
can you please share your application code?
I suspect that you're not making a good use of the cluster by configuring a
wrong number of partitions in your RDDs.
--
View this message in context:
Hello,
I was reading learning spark book and saw a tip in chapter 9 that read
In Spark 1.2, the regular cache() method on RDDs also results in a
cacheTable()
Is that true? When I cache a RDD and cache same data as a dataframe I see
that memory usage for dataframe cache is way less than RDD
Hi,
For some experiment I am doing, I am trying to do the following.
1.Created an abstract class Validator. Created case objects from Validator with
validate(row: Row): Boolean method.
2. Adding in a list all case objects
3. Each validate takes a Row into account, returns itself if validate
Thanks Michael for your answer.
But Yarn of today does not manage HDFS. How does Yarn RM get to know HDFS
blocks in each data node ?
Do you mean it is Yarn RM contacts NameNode for HDFS block data in each
node, and then decided to launch executor on the nodes which has required
input data blocks
Hi,
I'm seeing quite a bit of information on Spark memory management. I'm just
trying to set the heap size, e.g. Xms as 512m and Xmx as 1g or some such.
Per
http://apache-spark-user-list.1001560.n3.nabble.com/Use-of-SPARK-DAEMON-JAVA-OPTS-tt10479.html#a10529:
SPARK_DAEMON_JAVA_OPTS is not
Hey everyone,
We are running into an issue where spark jobs will sometimes hang
indefinitely. We are on Spark 1.3.1 (working on upgrading soon), Java 8,
and using mesos with spark.mesos.coarse=false. I'm fairly certain that the
issue comes up when we do shuffle operations.
My pipeline reads data
Your query will be partitioned once. Then, a single Window operator will
evaluate these three functions. As mentioned by Harish, you can take a look
at the plan (sql(your sql...).explain()).
On Mon, Jul 13, 2015 at 12:26 PM, Harish Butani rhbutani.sp...@gmail.com
wrote:
Just once.
You can see
Hello all,
I have one big RDD, in which there is a column of groups A1, A2, B1, B2, B3,
C1, D1, ..., XY.
Out of it, I am using map() to transform into RDD[LabeledPoint] with dense
vectors for later use into Logistic Regression, which takes RDD[LabeledPoint]
I would like to run a logistic
Sorry; I think I may have used poor wording. SparkR will let you use R to
analyze the data, but it has to be loaded into memory using SparkR (see SparkR
DataSources
http://people.apache.org/~pwendell/spark-releases/latest/sparkr.html).
You will still have to write a Java receiver to store the data
Hi,
I have a question for Spark SQL. Is there a way to be able to use
Spark SQL on YARN without having to submit a job?
Bottom line here is I want to be able to reduce the latency of
running queries as a job. I know that the spark sql default submission
is like a job, but was wondering if
A good example is RegressionMetrics
https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/evaluation/RegressionMetrics.scala#L48's
use of of OnlineMultivariateSummarizer to aggregate statistics across
labels and residuals; take a look at how aggregateByKey is used
On Mon, Jul 13, 2015 at 11:06 AM, Lincoln Atkinson lat...@microsoft.com wrote:
I’m still getting acquainted with the Spark ecosystem, and wanted to make
sure my understanding of the different API layers is correct.
Is this an accurate picture of the major API layers, and their associated
It happens irrespective of whether there is traffic or no traffic on the kafka
topic. Also, there is no clue i could see in the heap space. The heap looks
healthy and stable. Its something off heap which is constantly growing. I also
checked the JNI reference count from the dumps which appear
Hello,
I want to expose result of Spark computation to external tools. I plan to
do this with Thrift server JDBC interface by registering result Dataframe
as temp table.
I wrote a sample program in spark-shell to test this.
val hiveContext = new org.apache.spark.sql.hive.HiveContext(sc)
import
Well for adhoc queries you can use the CLI
On Mon, Jul 13, 2015 at 5:34 PM, Ron Gonzalez zlgonza...@yahoo.com.invalid
wrote:
Hi,
I have a question for Spark SQL. Is there a way to be able to use Spark
SQL on YARN without having to submit a job?
Bottom line here is I want to be able to
Hi all,
I'm having conf/hive-site.xml pointing to my Hive metastore but sparksql
CLI doesn't pick it up. (copying the same conf/ files to spark1.4 and 1.2
works fine). Just wondering if someone has seen this before,
Thanks
Thank you Feynman for your response. Since I am very new to Scala I may need a
bit more hand-holding at this stage.
I have been able to incorporate your suggestion about sorting - and it now
works perfectly. Thanks again for that.
I tried to use your suggestion of using
I'd look at the JDBC server (a long running yarn job you can submit queries
too)
https://spark.apache.org/docs/latest/sql-programming-guide.html#running-the-thrift-jdbcodbc-server
On Mon, Jul 13, 2015 at 6:31 PM, Jerrick Hoang jerrickho...@gmail.com
wrote:
Well for adhoc queries you can use
There was a discussion happened on that earlier, let me re-post it for you.
For the following code:
val *df* = sqlContext.parquetFile(path)
*df* remains columnar (actually it just reads from the columnar Parquet
file on disk).
For the following code:
val *cdf* = df.cache()
*cdf* is
any help / idea will be appreciated :)
thanks
Regards,
Oded Maimon
Scene53.
On Sun, Jul 12, 2015 at 4:49 PM, Oded Maimon o...@scene53.com wrote:
Hi All,
we are evaluating spark for real-time analytic. what we are trying to do
is the following:
- READER APP- use custom receiver to get
Hi , Please can anyone help me on this post, it seems to be a show stopper
for our current project
Thanks inAdvance
Regards
Dinesh
--
View this message in context:
1 - 100 of 107 matches
Mail list logo