If you want to specify mapping you must first create the mappings for your
index types before indexing.
As far as I know there is no way to specify this via ES-hadoop. But it's best
practice to explicitly create mappings prior to indexing, or to use index
templates when dynamically creating
now we have spark 1.3.0 on chd 5.1.0
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Can-Spark-1-0-2-run-on-CDH-4-3-0-with-yarn-And-Will-Spark-1-2-0-support-CDH5-1-2-with-yarn-tp20760p22509.html
Sent from the Apache Spark User List mailing list archive at Na
13119 Exception in thread "main" akka.actor.ActorNotFound: Actor not found
for: ActorSelection[Anchor(akka.tcp://sparkdri...@dmslave13.et2.tbsi
te.net:5908/), Path(/user/OutputCommitCoordinator)]
13120 at
akka.actor.ActorSelection$$anonfun$resolveOne$1.apply(ActorSelection.scala:65)
13
Can you provide the JDBC connector jar version. Possibly the full JAR name
and full command you ran Spark with ?
On Wed, Apr 15, 2015 at 11:27 AM, Nathan McCarthy <
nathan.mccar...@quantium.com.au> wrote:
> Just an update, tried with the old JdbcRDD and that worked fine.
>
> From: Nathan
> Da
The problem lies with getting the driver classes into the primordial class
loader when running on YARN.
Basically I need to somehow set the SPARK_CLASSPATH or compute_classpath.sh
when running on YARN. I’m not sure how to do this when YARN is handling all the
file copy.
From: Nathan
mailto:na
Greetings!
How about medical data sets, and specifically longitudinal vital signs.
Can people send good pointers?
Thanks in advance,
-- ttfn
Simon Edelhaus
California 2015
On Wed, Apr 15, 2015 at 6:01 PM, Matei Zaharia
wrote:
> Very neat, Olivier; thanks for sharing this.
>
> Matei
>
> > On
Thanks Olivier. Good work.
Interesting in more than one ways - including training, benchmarking,
testing new releases et al.
One quick question - do you plan to make it available as an S3 bucket ?
Cheers
On Wed, Apr 15, 2015 at 5:58 PM, Olivier Chapelle
wrote:
> Dear Spark users,
>
> I would l
Very neat, Olivier; thanks for sharing this.
Matei
> On Apr 15, 2015, at 5:58 PM, Olivier Chapelle wrote:
>
> Dear Spark users,
>
> I would like to draw your attention to a dataset that we recently released,
> which is as of now the largest machine learning dataset ever released; see
> the fol
Dear Spark users,
I would like to draw your attention to a dataset that we recently released,
which is as of now the largest machine learning dataset ever released; see
the following blog announcements:
- http://labs.criteo.com/2015/03/criteo-releases-its-new-dataset/
-
http://blogs.technet.com/
It may be worthwhile to do architect the computation in a different way.
dstream.foreachRDD { rdd =>
rdd.foreach { record =>
// do different things for each record based on filters
}
}
TD
On Sun, Apr 12, 2015 at 7:52 PM, Jianshi Huang
wrote:
> Hi,
>
> I have a Kafka topic that cont
Hi Guys -
Having trouble figuring out the semantics for using the alias function on
the final sum and count aggregations?
>>> cool_summary = reviews.select(reviews.user_id,
cool_cnt("votes.cool").alias("cool_cnt")).groupBy("user_id").agg({"cool_cnt":"sum","*":"count"})
>>> cool_summary
DataFram
Hi,
Is there a way to pass the mapping to define a field as not analyzed
with es-spark settings.
I am just wondering if I can set the mapping type for a field as not
analyzed using the set function in spark conf as similar to the other
es settings.
val sconf = new SparkConf()
.setMaster("local
Can you clarify more on what you want to do after querying? Is the batch
not completed until the querying and subsequent processing has completed?
On Tue, Apr 14, 2015 at 10:36 PM, Krzysztof Zarzycki
wrote:
> Thank you Tathagata, very helpful answer.
>
> Though, I would like to highlight that r
Data Frames are available from the latest 1.3 release I believe – in 1.2 (our
case at the moment) I guess the options are more limited
PS: agree that DSTreams are just an abstraction for a sequence / streams of
(ordinary) RDDs – when i use “DStreams” I mean the DStream OO API in Spark not
t
Well, DStream joins are nothing but RDD joins at its core. However, there
are more optimizations that you using DataFrames and Spark SQL joins. With
the schema, there is a greater scope for optimizing the joins. So
converting RDDs from streaming and the batch RDDs to data frames, and then
applying
Just add the following line "spark.ui.showConsoleProgress true" do your
conf/spark-defaults.conf file.
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/spark-job-progress-style-report-on-console-tp22440p22506.html
Sent from the Apache Spark User List mailing
Thank you Sir, and one final confirmation/clarification - are all forms of
joins in the Spark API for DStream RDDs based on cogroup in terms of their
internal implementation
From: Tathagata Das [mailto:t...@databricks.com]
Sent: Wednesday, April 15, 2015 9:48 PM
To: Evo Eftimov
Cc: user
Su
Agreed.
On Wed, Apr 15, 2015 at 1:29 PM, Evo Eftimov wrote:
> That has been done Sir and represents further optimizations – the
> objective here was to confirm whether cogroup always results in the
> previously described “greedy” explosion of the number of elements included
> and RAM allocated f
That has been done Sir and represents further optimizations – the objective
here was to confirm whether cogroup always results in the previously described
“greedy” explosion of the number of elements included and RAM allocated for the
result RDD
The optimizations mentioned still don’t chang
Significant optimizations can be made by doing the joining/cogroup in a
smart way. If you have to join streaming RDDs with the same batch RDD, then
you can first partition the batch RDDs using a partitions and cache it, and
then use the same partitioner on the streaming RDDs. That would make sure
t
There are indications that joins in Spark are implemented with / based on the
cogroup function/primitive/transform. So let me focus first on cogroup - it
returns a result which is RDD consisting of essentially ALL elements of the
cogrouped RDDs. Said in another way - for every key in each of the co
Yep, you are looking at operations on DStream, which is not what I'm
talking about. You should look at DStream.foreachRDD (or Java
equivalent), which hands you an RDD. Makes more sense?
The rest may make more sense when you try it. There is actually a lot
less complexity than you think.
On Wed, A
The OO API in question was mentioned several times - as the transform method of
DStreamRDD which is the ONLY way to join/cogroup/union DSTreamRDD with batch
RDD aka JavaRDD
Here is paste from the spark javadoc
JavaPairDStream transformToPair(Function>
transformFunc)
Return a new DStream in
What API differences are you talking about? a DStream gives a sequence
of RDDs. I'm not referring to DStream or its API.
Spark in general can execute many pipelines at once, ones that even
refer to the same RDD. What I mean you seem to be looking for a way to
change one shared RDD, but in fact, yo
I keep seeing only common statements
Re DStream RDDs and Batch RDDs - There is certainly something to "keep me from
using them together" and it is the OO API differences I have described
previously, several times ...
Re the batch RDD reloading from file and that "there is no need for threads" -
Yes, I mean there's nothing to keep you from using them together other
than their very different lifetime. That's probably the key here: if
you need the streaming data to live a long time it has to live in
persistent storage first.
I do exactly this and what you describe for the same purpose.
I do
Hi Sean well there is certainly a difference between "batch" RDD and
"streaming" RDD and in the previous reply you have already outlined some. Other
differences are in the Object Oriented Model / API of Spark, which also matters
besides the RDD / Spark Cluster Platform architecture.
Secondly, i
What do you mean by "batch RDD"? they're just RDDs, though store their
data in different ways and come from different sources. You can union
an RDD from an HDFS file with one from a DStream.
It sounds like you want streaming data to live longer than its batch
interval, but that's not something you
The only way to join / union /cogroup a DStream RDD with Batch RDD is via the
"transform" method, which returns another DStream RDD and hence it gets
discarded at the end of the micro-batch.
Is there any way to e.g. union Dstream RDD with Batch RDD which produces a
new Batch RDD containing the el
Hi Spark users,
Trying to upgrade to Spark1.2 and running into the following
seeing some very slow queries and wondering if someone can point me in the
right direction for debugging. My Spark UI shows a job with duration 15s
(see attached screenshot). Which would be great but client side measurem
Hi All
I am getting below exception while running foreach after zipwithindex
,flatMapvalue,flatmapvalues,
Insideview foreach I m doing lookup in broadcast variable
java.util.concurrent.RejectedExecutionException: Worker has already been
shutdown
at
org.jboss.netty.channel.socket.nio.Abst
CC Leah, who added Bernoulli option to MLlib's NaiveBayes. -Xiangrui
On Wed, Apr 15, 2015 at 4:49 AM, 姜林和 wrote:
>
> Dear meng:
> Thanks for the great work for park machine learning, and I saw the
> changes for NaiveBayes algorithm ,
> separate the algorithm to : multinomial model and Ber
The setting to increase is spark.yarn.executor.memoryOverhead
On Wed, Apr 15, 2015 at 6:35 AM, Brahma Reddy Battula <
brahmareddy.batt...@huawei.com> wrote:
> Hello Sean Owen,
>
> Thanks for your reply..I"ll increase overhead memory and check it..
>
>
> Bytheway ,Any difference between 1.1 and 1.
Env - Spark 1.3 Hadoop 2.3, Kerbeos
xx.saveAsTextFile(path, codec) gives following trace. Same works with
Spark 1.2 in same environment
val codec = classOf[]
val a = sc.textFile("/some_hdfs_file")
a.saveAsTextFile("/some_other_hdfs_file", codec) fails with following trace
in Spark 1.3, works i
Schema merging is not the feature you are looking for. It is designed when
you are adding new records (that are not associated with old records),
which may or may not have new or missing columns.
In your case it looks like you have two datasets that you want to load
separately and join on a key.
This worked with java serialization.I am using 1.2.0 you are right if I use
1.2.1 or 1.3.0 this issue will not occur
I will test this and let you know
On 15 April 2015 at 19:48, Imran Rashid wrote:
> oh interesting. The suggested workaround is to wrap the result from
> collectAsMap into another
I have found that it works if you place the sqljdbc41.jar directly in the
following folder:
YOUR_SPARK_HOME/core/target/jars/
So Spark will have the SQL Server jdbc driver when it computes its
classpath.
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Micr
oh interesting. The suggested workaround is to wrap the result from
collectAsMap into another hashmap, you should try that:
Map matchData =RddForMarch.collectAsMap();
Map tmp = new HashMap(matchData);
final Broadcast> dataMatchGlobal =
jsc.broadcast(tmp);
Can you please clarify:
* Does it work w
This looks like known issue? check this out
http://apache-spark-user-list.1001560.n3.nabble.com/java-io-InvalidClassException-org-apache-spark-api-java-JavaUtils-SerializableMapWrapper-no-valid-cor-td20034.html
Can you please suggest any work around I am broad casting HashMap return
from RDD.colle
this is a really strange exception ... I'm especially surprised that it
doesn't work w/ java serialization. Do you think you could try to boil it
down to a minimal example?
On Wed, Apr 15, 2015 at 8:58 AM, Jeetendra Gangele
wrote:
> Yes Without Kryo it did work out.when I remove kryo registrati
Yes Without Kryo it did work out.when I remove kryo registration it did
worked out
On 15 April 2015 at 19:24, Jeetendra Gangele wrote:
> its not working with the combination of Broadcast.
> Without Kyro also not working.
>
>
> On 15 April 2015 at 19:20, Akhil Das wrote:
>
>> Is it working witho
its not working with the combination of Broadcast.
Without Kyro also not working.
On 15 April 2015 at 19:20, Akhil Das wrote:
> Is it working without kryo?
>
> Thanks
> Best Regards
>
> On Wed, Apr 15, 2015 at 6:38 PM, Jeetendra Gangele
> wrote:
>
>> Hi All I am getting below exception while us
Is it working without kryo?
Thanks
Best Regards
On Wed, Apr 15, 2015 at 6:38 PM, Jeetendra Gangele
wrote:
> Hi All I am getting below exception while using Kyro serializable with
> broadcast variable. I am broadcating a hasmap with below line.
>
> Map matchData =RddForMarch.collectAsMap();
> fi
Tried with 1.3.0 release (built myself) & the most recent 1.3.1 Snapshot off
the 1.3 branch.
Haven't tried with 1.4/master.
From: Wang, Daoyuan [daoyuan.w...@intel.com]
Sent: Wednesday, April 15, 2015 5:22 PM
To: Nathan McCarthy; user@spark.apache.org
Subject: RE
Hello Sean Owen,
Thanks for your reply..I"ll increase overhead memory and check it..
Bytheway ,Any difference between 1.1 and 1.2 makes, this issue..? Since It was
passing spark 1.1 and throwing following error in 1.2...( this makes me doubt
full)
Thanks & Regards
Brahma Reddy Battula
_
Hi All I am getting below exception while using Kyro serializable with
broadcast variable. I am broadcating a hasmap with below line.
Map matchData =RddForMarch.collectAsMap();
final Broadcast> dataMatchGlobal =
jsc.broadcast(matchData);
15/04/15 12:58:51 ERROR executor.Executor: Exception i
Hi all,
If you follow the example of schema merging in the spark documentation
http://spark.apache.org/docs/latest/sql-programming-guide.html#schema-merging
you obtain the following results when you want to load the result data :
single triple double
1 3 null
2 6 null
4 1
Hello!
The result of correlation in Spark MLLib is a of type
org.apache.spark.mllib.linalg.Matrix. (see
http://spark.apache.org/docs/1.2.1/mllib-statistics.html#correlations)
val data: RDD[Vector] = ...
val correlMatrix: Matrix = Statistics.corr(data, "pearson")
I would like to save the
@Shushant: In my case, the receivers will be fixed till the end of the
application. This one's for Kafka case only, if you have a filestream
application, you will not have any receivers. Also, for kafka, next time
you run the application, it's not fixed that the receivers will get
launched on the
So receivers will be fixed for every run of streaming interval job. Say I
have set stream Duration to be 10 minutes, then after each 10 minute job
will be created and same executor nodes say in your
case(spark-akhil-slave2.c.neat-axis-616.internal
and spark-akhil-slave1.c.neat-axis-616.internal) wi
Once you start your streaming application to read from Kafka, it will
launch receivers on the executor nodes. And you can see them on the
streaming tab of your driver ui (runs on 4040).
[image: Inline image 1]
These receivers will be fixed till the end of your pipeline (unless its
crashed etc.) Y
Hi
I want to understand the flow of spark streaming with kafka.
In spark Streaming is the executor nodes at each run of streaming interval
same or At each stream interval cluster manager assigns new executor nodes
for processing this batch input. If yes then at each batch interval new
executors r
Hi guys
Regarding to parquet files. I have Spark 1.2.0 and reading 27 parquet files
(250MB/file), it lasts 4 minutes.
I have a cluster with 4 nodes and it seems me too slow.
The "load" function is not available in Spark 1.2, so I can't test it
Regards.
Miguel.
On Mon, Apr 13, 2015 at 8:12 PM,
This is not related to executor memory, but the extra overhead
subtracted from the executor's size in order to avoid using more than
the physical memory that YARN allows. That is, if you declare a 32G
executor YARN lets you use 32G physical memory but your JVM heap must
be significantly less than 3
Thanks lot for your reply..
There is no issue with spark1.1..Following issue came when I upgrade to
spark2.0...Hence I did not decrease spark.executor.memory...
I mean to say, used same config for spark1.1 and spark1.2..
Is there any issue with spark1.2..?
Or Yarn will lead this..?
And why exe
All this means is that your JVM is using more memory than it requested
from YARN. You need to increase the YARN memory overhead setting,
perhaps.
On Wed, Apr 15, 2015 at 9:59 AM, Brahma Reddy Battula
wrote:
> Hello Sparkers
>
>
> I am newbie to spark and need help.. We are using spark 1.2, we ar
Did you try reducing your spark.executor.memory?
Thanks
Best Regards
On Wed, Apr 15, 2015 at 2:29 PM, Brahma Reddy Battula <
brahmareddy.batt...@huawei.com> wrote:
> Hello Sparkers
>
>
> I am newbie to spark and need help.. We are using spark 1.2, we are
> getting the following error and execu
Its printing on console but on HDFS all folders are still empty .
On Wed, Apr 15, 2015 at 2:29 PM, Shushant Arora
wrote:
> Thanks !! Yes message types on this console is seen on another console.
>
> When I closed another console, spark streaming job is printing messages on
> console .
>
> Isn't
Thanks !! Yes message types on this console is seen on another console.
When I closed another console, spark streaming job is printing messages on
console .
Isn't the message written on a port using netcat be avaible for multiple
consumers?
On Wed, Apr 15, 2015 at 2:22 PM, bit1...@163.com wrot
Hello Sparkers
I am newbie to spark and need help.. We are using spark 1.2, we are getting
the following error and executor is getting killed..I seen SPARK-1930 and it
should be in 1.2..
Any pointer to following error, like what might lead this error..
2015-04-15 11:55:39,697 | WARN | Cont
Looks the message is consumed by the another console?( can see messages typed
on this port from another console.)
bit1...@163.com
From: Shushant Arora
Date: 2015-04-15 17:11
To: Akhil Das
CC: user@spark.apache.org
Subject: Re: spark streaming printing no output
When I launched spark-shell us
When I launched spark-shell using, spark-shell ---master local[2].
Same behaviour, no output on console but only timestamps.
When I did, lines.saveAsTextFiles("hdfslocation",suffix);
I get empty files of 0 bytes on hdfs
On Wed, Apr 15, 2015 at 12:46 PM, Akhil Das
wrote:
> Just make sure you hav
I am setting spark.executor.memory as 1024m on a 3 node cluster with each
node having 4 cores and 7 GB RAM. The combiner functions are taking scala
case classes as input and are generating mutable.ListBuffer of scala case
classes. Therefore, I am guessing hashCode and equals should be taken care
of
Hi Akhil,
Its running fine when running through Namenode(RM) but fails while running
through Gateway, if I add hadoop-core jars to the hadoop
directory(/opt/cloudera/parcels/CDH-5.3.0-1.cdh5.3.0.p0.30/lib/hadoop/) it
works fine.
Its really strange that I am running the job through Spark-Submit an
what is your JVM heap size settings? The OOM in SIzeEstimator is caused by a
lot of entry in IdentifyHashMap.
A quick guess is that the object in your dataset is a custom class and you
didn't implement the hashCode and equals method correctly.
On Wednesday, April 15, 2015 at 3:10 PM, Aniket
Can you provide your spark version?
Thanks,
Daoyuan
From: Nathan McCarthy [mailto:nathan.mccar...@quantium.com.au]
Sent: Wednesday, April 15, 2015 1:57 PM
To: Nathan McCarthy; user@spark.apache.org
Subject: Re: SparkSQL JDBC Datasources API when running on YARN - Spark 1.3.0
Just an update, trie
Just make sure you have atleast 2 cores available for processing. You can
try launching it in local[2] and make sure its working fine.
Thanks
Best Regards
On Tue, Apr 14, 2015 at 11:41 PM, Shushant Arora
wrote:
> Hi
>
> I am running a spark streaming application but on console nothing is
> gett
I am aggregating a dataset using combineByKey method and for a certain
input size, the job fails with the following error. I have enabled head
dumps to better analyze the issue and will report back if I have any
findings. Meanwhile, if you guys have any idea of what could possibly
result in this er
So the time niterval is much less than 1000 ms as you set in the code?
That's weird. Could you check the whole outputs to confirm that the content
won't be flushed by logs?
Best Regards,
Shixiong(Ryan) Zhu
2015-04-15 15:04 GMT+08:00 Shushant Arora :
> Yes only Time: 142905487 ms strings get
Yes only Time: 142905487 ms strings gets printed on console.
No output is getting printed.
And timeinterval between two strings of form ( time:ms)is very less
than Streaming Duration set in program.
On Wed, Apr 15, 2015 at 5:11 AM, Shixiong Zhu wrote:
> Could you see something like this
70 matches
Mail list logo