However this returns a single column of c, without showing the original col1
.
On Thu, May 21, 2015 at 11:25 PM Ram Sriharsha sriharsha@gmail.com
wrote:
df.groupBy($col1).agg(count($col1).as(c)).show
On Thu, May 21, 2015 at 3:09 AM, SLiZn Liu sliznmail...@gmail.com wrote:
Hi Spark
Hi All,
I'm facing the same problem with Spark 1.3.0 from cloudera cdh 5.4.x. Any
luck solving the issue?
Exception:
Exception in thread main org.apache.spark.sql.AnalysisException:
Unsupported language features in query: select * from
everest_marts_test.hive_ql_test where
Dani, this appears to be addressed in SPARK-5567, scheduled for Spark 1.5.0.
Ken
On May 21, 2015, at 11:12 PM, user-digest-h...@spark.apache.org wrote:
From: Dani Qiu zongmin@gmail.com
Subject: LDA prediction on new document
Date: May 21, 2015 at 8:48:40 PM PDT
To:
Despite the odd usage, it does the trick, thanks Reynold!
On Fri, May 22, 2015 at 2:47 PM Reynold Xin r...@databricks.com wrote:
In 1.4 it actually shows col1 by default.
In 1.3, you can add col1 to the output, i.e.
df.groupBy($col1).agg($col1, count($col1).as(c)).show()
On Thu, May 21,
You can look at the logic for offloading data from Memory by looking at
ensureFreeSpace
https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/storage/MemoryStore.scala#L416
call.
And dropFromMemory
Experts,
This is an academic question. Since Spark runs on the JVM, how it is able to
do things like offloading RDDs from memory to disk when the data cannot fit
into memory. How are the calculations performed? Does it use the methods
availabe in the java.lang.Runtime class to get free/available
spark src have dockerfile ,you can use this
-- Original --
From: tridib;tridib.sama...@live.com;
Date: Fri, May 22, 2015 03:25 AM
To: useruser@spark.apache.org;
Subject: Official Docker container for Spark
Hi,
I am using spark 1.2.0. Can you
That works, thank you!
On 2015-05-22 03:15, Davies Liu wrote:
Could you try with specify PYSPARK_PYTHON to the path of python in
your virtual env, for example
PYSPARK_PYTHON=/path/to/env/bin/python bin/spark-submit xx.py
On Mon, Apr 20, 2015 at 12:51 AM, Karlson ksonsp...@siberie.de wrote:
Hi All,
I'm deploying and architecture that uses flume for sending log information
in a sink.
Spark streaming read from this sink (pull strategy) e process al this
information, during this process I would like to make some event
processing. . . for example:
Log appender writes information about
Use this:
sequenceiq/docker
Here's a link to their github repo:
docker-spark https://github.com/sequenceiq/docker-spark
They have repos for other big data tools too which are agin really nice.
Its being maintained properly by their devs and
You can do something like this:
val myRdd = ...
val rddSampledByPartition = PartitionPruningRDD.create(myRdd, i =
Random.nextDouble() 0.1) // this samples 10% of the partitions
rddSampledByPartition.mapPartitions { iter = iter.take(10) } // take the
first 10 elements out of each partition
thanks, Ken
but I am planning to use spark LDA in production. I cannot wait for the
future release.
At least, provide some workaround solution.
PS : in SPARK-5567 https://issues.apache.org/jira/browse/SPARK-5567 ,
mentioned This will require inference but should be able to use the same
code,
Can you post the morning modified code ?
Thanks
On May 21, 2015, at 11:11 PM, donhoff_h 165612...@qq.com wrote:
Hi,
Thanks very much for the reply. I have tried the SecurityUtil. I can see
from log that this statement executed successfully, but I still can not pass
the
Hi,
My modified code is listed below, just add the SecurityUtil API. I don't know
which propertyKeys I should use, so I make 2 my own propertyKeys to find the
keytab and principal.
object TestHBaseRead2 {
def main(args: Array[String]) {
val conf = new SparkConf()
val sc = new
This is just a friendly ping, just to remind you of my query.
Also, is there a possible explanation/example on the usage of
AsyncRDDActions in Java ?
On Thu, May 21, 2015 at 7:18 PM, Gautam Bajaj gautam1...@gmail.com wrote:
I am received data at UDP port 8060 and doing processing on it using
Hi,
Thanks very much for the reply. I have tried the SecurityUtil. I can see
from log that this statement executed successfully, but I still can not pass
the authentication of HBase. And with more experiments, I found a new
interesting senario. If I run the program with yarn-client mode, the
In 1.4 it actually shows col1 by default.
In 1.3, you can add col1 to the output, i.e.
df.groupBy($col1).agg($col1, count($col1).as(c)).show()
On Thu, May 21, 2015 at 11:22 PM, SLiZn Liu sliznmail...@gmail.com wrote:
However this returns a single column of c, without showing the original
in spark src this class org.apache.spark.deploy.worker.WorkerArguments
def inferDefaultCores(): Int = {
Runtime.getRuntime.availableProcessors()
}
def inferDefaultMemory(): Int = {
val ibmVendor = System.getProperty(java.vendor).contains(IBM)
var totalMb = 0
try {
val bean =
I am new in MLlib and in Spark.(I use Scala)
I'm trying to understand how LogisticRegressionWithLBFGS and
LogisticRegressionWithSGD work.
I usually use R to do logistic regressions but now I do it on Spark
to be able to analyze Big Data.
The model only returns weights and intercept. My problem
What is also strange is that this seems to work on external JSON data, but not
Parquet. I’ll try to do more verification of that next week.
On May 22, 2015, at 16:24, yana yana.kadiy...@gmail.com wrote:
There is an open Jira on Spark not pushing predicates to metastore. I have a
large
There is an open Jira on Spark not pushing predicates to metastore. I have a
large dataset with many partitions but doing anything with it 8s very
slow...But I am surprised Spark 1.2 worked for you: it has this problem...
div Original message /divdivFrom: Andrew Otto
Can you show us the rest of the program? When are you starting, or stopping
the context. Is the exception occuring right after start or stop? What
about log4j logs, what does it say?
On Fri, May 22, 2015 at 7:12 AM, Cody Koeninger c...@koeninger.org wrote:
I just verified that the following
Hi all,
(This email was easier to write in markdown, so I’ve created a gist with its
contents here: https://gist.github.com/ottomata/f91ea76cece97444e269
https://gist.github.com/ottomata/f91ea76cece97444e269. I’ll paste the
markdown content in the email body here too.)
---
We’ve recently
If the message consumption rate is higher than the time required to process ALL
data for a micro batch (ie the next RDD produced for your stream) the
following happens – lets say that e.g. your micro batch time is 3 sec:
1. Based on your message streaming and consumption rate, you
… and measure 4 is to implement a custom Feedback Loop to e.g.to monitor the
amount of free RAM and number of queued jobs and automatically decrease the
message consumption rate of the Receiver until the number of clogged RDDs and
Jobs subsides (again here you artificially decrease your
I'm using the spark-cassandra-connector from DataStax in a spark streaming
job launched from my own driver. It is connecting a a standalone cluster
on my local box which has two worker running.
This is Spark 1.3.1 and spark-cassandra-connector-1.3.0-SNAPSHOT. I have
added the following entry to
Great to see the result comparable to R in new ML implementation.
Since majority of users will still use the old mllib api, we plan to
call the ML implementation from MLlib to handle the intercept
correctly with regularization.
JIRA is created.
https://issues.apache.org/jira/browse/SPARK-7780
In Spark 1.4, Logistic Regression with elasticNet is implemented in ML
pipeline framework. Model selection can be achieved through high
lambda resulting lots of zero in the coefficients.
Sincerely,
DB Tsai
---
Blog: https://www.dbtsai.com
On
I used spark standalone cluster on Windows 2008. I kept on getting the
following error when trying to save an RDD to a windows shared folder
rdd.saveAsObjectFile(file:///T:/lab4-win02/IndexRoot01/tobacco-07/myrdd.obj)
15/05/22 16:49:05 ERROR Executor: Exception in task 0.0 in stage 12.0 (TID
Hi All,
I have cluster of four nodes (three workers and one master, with one core
each) which consumes data from Kinesis at 15 second intervals using two
streams (i.e. receivers). The job simply grabs the latest batch and pushes
it to MongoDB. I believe that the problem is that all tasks are
I guess each receiver occupies a executor. So there was only one executor
available for processing the job.
On Fri, May 22, 2015 at 1:24 PM, Mike Trienis mike.trie...@orcsol.com
wrote:
Hi All,
I have cluster of four nodes (three workers and one master, with one core
each) which consumes data
The stack trace is related to hdfs.
Can you tell us which hadoop release you are using ?
Is this a secure cluster ?
Thanks
On Fri, May 22, 2015 at 1:55 PM, Wang, Ningjun (LNG-NPV)
ningjun.w...@lexisnexis.com wrote:
I used spark standalone cluster on Windows 2008. I kept on getting the
Hello,
I am using spark 1.3 Hive 0.13.1 in AWS.
From Spark-SQL, when running Hive query to export Hive query result into AWS
S3, it failed with the following message:
==
org.apache.hadoop.hive.ql.metadata.HiveException: checkPaths:
Hi,
1. Dynamic allocation is currently only supported with YARN, correct?
2. In spark streaming, it is possible to change the number of executors
while an application is running? If so, can the allocation be controlled by
the application, instead of using any already defined automatic policy?
Sorry, but I can't see on TD's comments how to allocate executors on
demand. It seems to me that he's talking about resources within an
executor, mapping shards to cores. I want to be able to decommission
executors/workers/machines.
On Sat, May 23, 2015 at 3:31 AM, Ted Yu yuzhih...@gmail.com
Or should I shutdown the streaming context gracefully and then start it
again with a different number of executors?
On Sat, May 23, 2015 at 4:00 AM, Saiph Kappa saiph.ka...@gmail.com wrote:
Sorry, but I can't see on TD's comments how to allocate executors on
demand. It seems to me that he's
For #1, the answer is yes.
For #2, See TD's comments on SPARK-7661
Cheers
On Fri, May 22, 2015 at 6:58 PM, Saiph Kappa saiph.ka...@gmail.com wrote:
Hi,
1. Dynamic allocation is currently only supported with YARN, correct?
2. In spark streaming, it is possible to change the number of
Todd, I don't have any answers for you...other than the file is actually
named spark-defaults.conf (not sure if you made a typo in the email or
misnamed the file...). Do any other options from that file get read?
I also wanted to ask if you built the spark-cassandra-connector-assembly-1.3
Hi,
Is there an easy way to see how a SparkSQL query plan maps to different
stages of the generated Spark job? The WebUI is entirely in terms of RDD
stages and I'm having a hard time mapping it back to my query.
Pramod
That should do.
Cheers
On Fri, May 22, 2015 at 8:28 PM, Saiph Kappa saiph.ka...@gmail.com wrote:
Or should I shutdown the streaming context gracefully and then start it
again with a different number of executors?
On Sat, May 23, 2015 at 4:00 AM, Saiph Kappa saiph.ka...@gmail.com
wrote:
may because of snappy-java,
https://issues.apache.org/jira/browse/SPARK-5081
On May 23, 2015, at 1:23 AM, Josh Rosen rosenvi...@gmail.com wrote:
I don't think that 0.9.3 has been released, so I'm assuming that you're
running on branch-0.9.
There's been over 4000 commits between 0.9.3 and
Could you show up the schema and confirm that they are LongType?
df.printSchema()
On Mon, Apr 27, 2015 at 5:44 AM, jamborta jambo...@gmail.com wrote:
hi all,
I have just come across a problem where I have a table that has a few bigint
columns, it seems if I read that table into a dataframe
Something does not make sense. Receivers (currently) does not get blocked
(unless rate limit has been set) due to processing load. The receiver will
continue to receive data and store it in memory and until it is processed.
So I am still not sure how the data loss is happening. Unless you are
DataFrames have a lot more information about the data, so there is a whole
class of optimizations that are possible there that we cannot do in RDDs.
This is why we are focusing a lot of effort on this part of the project.
In Spark 1.4 you can accomplish what you want using the new window function
If you want to select specific variable combinations by hand, then you will
need to modify the dataset before passing it to the ML algorithm. The
DataFrame API should make that easy to do.
If you want to have an ML algorithm select variables automatically, then I
would recommend using L1
Hi,
Environment: Spark standalone cluster running with a master and a work on a
small Vagrant VM. The Jetty Webapp on the same node calls the spark-submit
script to start the job.
From the contents of the stdout I can see that it's running successfully.
However, the spark-submit process never
A receiver occupies a cpu core, an executor is simply a jvm instance and as
such it can be granted any number of cores and ram
So check how many cores you have per executor
Sent from Samsung Mobile
div Original message /divdivFrom: Mike Trienis
mike.trie...@orcsol.com
on the worker/container that fails, the file not found is the first error
-- the output below is from the yarn log. There were some python worker
crashes for another job/stage earlier (see the warning at 18:36) but I
expect those to be unrelated to this file not found error.
OR you can run Drools in a Central Server Mode ie as a common/shared service,
but that would slowdown your Spark Streaming job due to the remote network call
which will have to be generated for every single message
From: Evo Eftimov [mailto:evo.efti...@isecc.com]
Sent: Friday, May 22, 2015
I am not aware of existing examples but you can always “ask” Google
Basically from Spark Streaming perspective, Drools is a third-party Software
Library, you would invoke it in the same way as any other third-party software
library from the Tasks (maps, filters etc) within your DAG job
Hi,
I'm trying to connect to two topics of Kafka with Spark with DirectStream
but I get an error. I don't know if there're any limitation to do it,
because when I just access to one topics everything if right.
*val ssc = new StreamingContext(sparkConf, Seconds(5))*
*val kafkaParams =
You can deploy and invoke Drools as a Singleton on every Spark Worker Node /
Executor / Worker JVM
You can invoke it from e.g. map, filter etc and use the result from the Rule to
make decision how to transform/filter an event/message
From: Antonio Giambanco [mailto:antogia...@gmail.com]
Hi,
Sometime back I played with Distributed Rule processing by integrating
Drool with HBase Co-Processors ..and invoke Rules on any incoming data ..
https://github.com/dibbhatt/hbase-rule-engine
You can get some idea how to use Drools rules if you see this
RegionObserverCoprocessor ..
Can you share the exception(s) you encountered ?
Thanks
On May 22, 2015, at 12:33 AM, donhoff_h 165612...@qq.com wrote:
Hi,
My modified code is listed below, just add the SecurityUtil API. I don't
know which propertyKeys I should use, so I make 2 my own propertyKeys to find
the
Hi
I was using the wrong version of the spark-hive jar. I downloaded the
right version of the jar from the cloudera repo and it works now.
Thanks,
Skanda
On Fri, May 22, 2015 at 2:36 PM, Skanda skanda.ganapa...@gmail.com wrote:
Hi All,
I'm facing the same problem with Spark 1.3.0 from
The only “tricky” bit would be when you want to manage/update the Rule Base in
your Drools Engines already running as Singletons in Executor JVMs on Worker
Nodes. The invocation of Drools from Spark Streaming to evaluate a Rule already
loaded in Drools is not a problem.
From: Evo Eftimov
This is added to 1.4.0
https://github.com/apache/spark/pull/5762
On 5/22/15, 8:48 AM, Karlson ksonsp...@siberie.de wrote:
Hi,
wouldn't df.rdd.partitionBy() return a new RDD that I would then need to
make into a Dataframe again? Maybe like this:
Hi,
I am currently experimenting with linear regression (SGD) (Spark +
MLlib, ver. 1.2). At this point in time I need to fine-tune the
hyper-parameters. I do this (for now) by an exhaustive grid search of
the step size and the number of iterations. Currently I am on a dual
core that acts as
Alright, that doesn't seem to have made it into the Python API yet.
On 2015-05-22 15:12, Silvio Fiorito wrote:
This is added to 1.4.0
https://github.com/apache/spark/pull/5762
On 5/22/15, 8:48 AM, Karlson ksonsp...@siberie.de wrote:
Hi,
wouldn't df.rdd.partitionBy() return a new RDD
Hi,
wouldn't df.rdd.partitionBy() return a new RDD that I would then need to
make into a Dataframe again? Maybe like this:
df.rdd.partitionBy(...).toDF(schema=df.schema). That looks a bit weird
to me, though, and I'm not sure if the DF will be aware of its
partitioning.
On 2015-05-22
Dani,
Folding in I believe refers to setting up your Gibbs sampler (or other
model) with the learning word and document topic proportions as computed by
spark.
You might look at
https://lists.cs.princeton.edu/pipermail/topic-models/2014-May/002763.html
Where Jones suggests summing across
I just verified that the following code works on 1.3.0 :
val stream1 = KafkaUtils.createDirectStream[String, String,
StringDecoder, StringDecoder](ssc, kafkaParams, topic1)
val stream2 = KafkaUtils.createDirectStream[String, String,
StringDecoder, StringDecoder](ssc, kafkaParams, topic2)
I have a table in a Hive database partitioning by date. I notice that when
I query this table using HiveContext the created data frame has an specific
number of partitions.
Do this partitioning corresponds to my original table partitioning in Hive?
Thanks
--
Cesar Flores
You might also enable debug in: hadoop-env.sh
# Extra Java runtime options. Empty by default.
export HADOOP_OPTS=$HADOOP_OPTS -Djava.net.preferIPv4Stack=true
-Dsun.security.krb5.debug=true ${HADOOP_OPTS}”
and check that the principals are the same on the NameNode and DataNode.
and you can
Looking at python/pyspark/sql/dataframe.py :
@since(1.4)
def coalesce(self, numPartitions):
@since(1.3)
def repartition(self, numPartitions):
Would the above methods serve the purpose ?
Cheers
On Fri, May 22, 2015 at 6:57 AM, Karlson ksonsp...@siberie.de wrote:
Alright, that
Hi.
I have a job that takes
~50min with Spark 0.9.3 and
~1.8hrs on Spark 1.3.1 on the same cluster.
The only code difference between the two code bases is to fix the Seq -
Iter changes that happened in the Spark 1.x series.
Are there any other changes in the defaults from spark 0.9.3 - 1.3.1
Hi.
I have an RDD that I use repeatedly through many iterations of an
algorithm. To prevent recomputation, I persist the RDD (and incidentally I
also persist and checkpoint it's parents)
val consCostConstraintMap = consCost.join(constraintMap).map {
case (cid, (costs,(mid1,_,mid2,_,_))) = {
Thanks a lot Evo,
do you know where I can find some examples?
Have a great one
A G
2015-05-22 12:00 GMT+02:00 Evo Eftimov evo.efti...@isecc.com:
You can deploy and invoke Drools as a Singleton on every Spark Worker Node
/ Executor / Worker JVM
You can invoke it from e.g. map, filter etc
I'm not sure if it is possible to overload the map function twice, once for
just KV pairs, and another for K and V separately.
On Fri, May 22, 2015 at 10:26 AM, Justin Pihony justin.pih...@gmail.com
wrote:
This ticket https://issues.apache.org/jira/browse/SPARK-4397 improved
the RDD API, but
The (crude) proof of concept seems to work:
class RDD[V](value: List[V]){
def doStuff = println(I'm doing stuff)
}
object RDD{
implicit def toPair[V](x:RDD[V]) = new PairRDD(List((1,2)))
}
class PairRDD[K,V](value: List[(K,V)]) extends RDD (value){
def doPairs = println(I'm using pairs)
}
I have a small Spark launcher app able to instanciate a service via Spring
xml application context and then broadcasts it in order to make it
available on remote nodes.
I suppose when a Spring service is instanciated, all dependencies are
instanciated and injected at the same time, so
This ticket https://issues.apache.org/jira/browse/SPARK-4397 improved the
RDD API, but it could be even more discoverable if made available via the
API directly. I assume this was originally an omission that now needs to be
kept for backwards compatibility, but would any of the repo owners be open
hmm, sorry I think that disproves my theory. Nothing else is immediately
coming to mind. Its possible there is more info in the logs from the
driver, couldn't hurt to send those (though I don't have high hopes of
finding anything that way). Offchance this could be from too many open
files or
I guess not. Spark partitions correspond to number of splits.
On 23 May 2015 00:02, Cesar Flores ces...@gmail.com wrote:
I have a table in a Hive database partitioning by date. I notice that when
I query this table using HiveContext the created data frame has an specific
number of partitions.
I don't think that 0.9.3 has been released, so I'm assuming that you're
running on branch-0.9.
There's been over 4000 commits between 0.9.3 and 1.3.1, so I'm afraid that
this question doesn't have a concise answer:
https://github.com/apache/spark/compare/branch-0.9...v1.3.1
To narrow down the
Thank you guys for the prompt help.
I ended up building spark master and verified what DB has suggested.
val lr = (new MlLogisticRegression)
.setFitIntercept(true)
.setMaxIter(35)
val model = lr.fit(sqlContext.createDataFrame(training))
val scoreAndLabels =
76 matches
Mail list logo