Sorry I was on vacation for a few days. Yes, it is on. This is what I have
in the logs:
15/06/22 10:44:00 INFO ClientCnxn: Unable to read additional data from
server sessionid 0x14dd82e22f70ef1, likely server has closed socket,
closing socket connection and attempting reconnect
15/06/22 10:44:00
Yes, just put the Cassandra connector on the Spark classpath and set the
connector config properties in the interpreter settings.
From: Mohammed Guller
Date: Monday, June 22, 2015 at 11:56 AM
To: Matthew Johnson, shahid ashraf
Cc: user@spark.apache.orgmailto:user@spark.apache.org
Subject: RE:
I haven’t tried using Zeppelin with Spark on Cassandra, so can’t say for sure,
but it should not be difficult.
Mohammed
From: Matthew Johnson [mailto:matt.john...@algomi.com]
Sent: Monday, June 22, 2015 2:15 AM
To: Mohammed Guller; shahid ashraf
Cc: user@spark.apache.org
Subject: RE: Code
Hi Pawan,
Looking at the changes for that git pull request, it looks like it just
pulls in the dependency (and transitives) for “spark-cassandra-connector”.
Since I am having to build Zeppelin myself anyway, would it be ok to just
add this myself for the connector for 1.4.0 (as found here
Hello,
A colleague of mine ran the following Spark SQL query:
select
count(*) as uses,
count (distinct cast(id as string)) as users
from usage_events
where
from_unixtime(cast(timestamp_millis/1000 as bigint))
between '2015-06-09' and '2015-06-16'
The table contains billions of rows, but
Hi James,
What version of Spark are you using? In Spark 1.2.2 I had an issue where
Spark would report a job as complete but I couldn’t find my results
anywhere – I just assumed it was me doing something wrong as I am still
quite new to Spark. However, since upgrading to 1.4.0 I have not seen
Hi,
Though the documentation does not explicitly mention support for Windowing
and Analytics function in Spark SQL, looks like it is not supported.
I tried running a query like Select Lead(column name, 1) over (Partition
By column name order by column name) from table name and I got error
saying
Hi,
Zeppelin has a cassandra-spark-connector built into the build. I have not
tried it yet may be you could let us know.
https://github.com/apache/incubator-zeppelin/pull/79
To build a Zeppelin version with the *Datastax Spark/Cassandra connector
Running perfectly in local system but not writing to file in cluster
mode .ANY suggestions please ..
//msgid is long counter
JavaDStreamString newinputStream=inputStream.map(new
FunctionString, String() {
@Override
public String call(String v1) throws Exception {
String
No, what I'm seeing is that while the cluster is running, I can't see the
app info after the app is completed. That is to say, when I click on the
application name on master:8080, no info is shown. However, when I examine
the same file on the History Server, the application information opens fine.
Great, thank you, Silvio. In your experience, is there any way to instument
a callback into Coda Hale or the Spark consumers from the metrics sink? If
the sink performs some steps once it has received the metrics, I'd like to
be able to make the consumers aware of that via some sort of a
Any response to this guys?
On Fri, Jun 19, 2015 at 2:34 PM, Nitin kak nitinkak...@gmail.com wrote:
Any other suggestions guys?
On Wed, Jun 17, 2015 at 7:54 PM, Nitin kak nitinkak...@gmail.com wrote:
With Sentry, only hive user has the permission for read/write/execute on
the subdirectories
Hey, I have exactly this question. Did you get an answer to it?
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Does-HiveContext-connect-to-HiveServer2-tp22200p23431.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.
Can not we write some data to a txt file in parallel with multiple
executors running in parallel ??
--
Thanks Regards,
Anshu Shukla
private HTable table;
You should declare table variable within apply() method.
BTW which hbase release are you using ?
I see you implement caching yourself. You can make use of the following
HTable method:
public void setWriteBufferSize(long writeBufferSize) throws
I am using spark streaming. what i am trying to do is sending few messages
to some kafka topic. where its failing.
java.lang.ClassNotFoundException: com.abc.mq.msg.ObjectEncoder
at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
at
Hi,
Is there a way to get Yarn application ID inside spark application, when
running spark Job on YARN ?
Thanks
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Yarn-application-ID-for-Spark-job-on-Yarn-tp23429.html
Sent from the Apache Spark User List
Hi James,
There are a few configurations that you can try:
https://spark.apache.org/docs/latest/sql-programming-guide.html#other-configuration-options
From my experience, the codegen really boost things up. Just run
sqlContext.sql(spark.sql.codegen=true) before you execute your query. But
keep
Hi,
I can't get this to work using CDH 5.4, Spark 1.4.0 in yarn cluster mode.
@andrew did you manage to get it work with the latest version ?
Le mar. 21 avr. 2015 à 00:02, Andrew Lee alee...@hotmail.com a écrit :
Hi Marcelo,
Exactly what I need to track, thanks for the JIRA pointer.
Date:
Thanks for the responses, guys!
Sorry, I forgot to mention that I'm using Spark 1.3.0, but I'll test with
1.4.0 and try the codegen suggestion then report back.
On 22 June 2015 at 12:37, Matthew Johnson matt.john...@algomi.com wrote:
Hi James,
What version of Spark are you using? In Spark
Thanks,
I've updated my code to use updateStateByKey but am still getting these
errors when I resume from a checkpoint.
One thought of mine was that I used sc.parallelize to generate the RDDs for
the queue, but perhaps on resume, it doesn't recreate the context needed?
--
Shaanan Cohney
PhD
Where does task_batches come from?
On 22 Jun 2015 4:48 pm, Shaanan Cohney shaan...@gmail.com wrote:
Thanks,
I've updated my code to use updateStateByKey but am still getting these
errors when I resume from a checkpoint.
One thought of mine was that I used sc.parallelize to generate the RDDs
It's a generated set of shell commands to run (written in C, highly
optimized numerical computer), which is create from a set of user provided
parameters.
The snippet above is:
task_outfiles_to_cmds = OrderedDict(run_sieving.leftover_tasks)
Have you test this on a smaller set to verify that the query is correct?
On Mon, Jun 22, 2015 at 2:59 PM, ayan guha guha.a...@gmail.com wrote:
You may also want to change count(*) to specific column.
On 23 Jun 2015 01:29, James Aley james.a...@swiftkey.com wrote:
Hello,
A colleague of mine
I have been using the spark from the last 6 months with the version 1.2.0.
I am trying to migrate to the 1.3.0 but the same problem i have written is
not wokring.
Its giving class not found error when i try to load some dependent jars
from the main program.
This use to work in 1.2.0 when set
Yes.
Thanks
Best Regards
On Mon, Jun 22, 2015 at 8:33 PM, Murthy Chelankuri kmurt...@gmail.com
wrote:
I have more than one jar. can we set sc.addJar multiple times with each
dependent jar ?
On Mon, Jun 22, 2015 at 8:30 PM, Akhil Das ak...@sigmoidanalytics.com
wrote:
Try sc.addJar instead
When I call rdd() on a DataFrame, it ends the current stage and starts a
new one that just maps the DataFrame to rdd and nothing else. It doesn't
seem to do a shuffle (which is good and expected), but then why does why is
there a separate stage?
I also thought that stages only end when there's a
Hi Matthew,
you could add the dependencies yourself by using the %dep command in
zeppelin ( https://zeppelin.incubator.apache.org/docs/interpreter/spark.html).
I have not tried with zeppelin but have used spark-notebook
https://github.com/andypetrella/spark-notebook and got Cassandra
connector
Sorry thought it was scala/spark
El 22/6/2015 9:49 p. m., Bob Corsaro rcors...@gmail.com escribió:
That's invalid syntax. I'm pretty sure pyspark is using a DSL to create a
query here and not actually doing an equality operation.
On Mon, Jun 22, 2015 at 3:43 PM Ignacio Blasco
Hi,
Our spark job on yarn suddenly started failing silently without showing
any error following is the trace.
Using properties file: /usr/lib/spark/conf/spark-defaults.conf
Adding default property:
spark.serializer=org.apache.spark.serializer.KryoSerializer
Adding default property:
Hi,
I am processing an RDD of key-value pairs. The key is an user_id, and the
value is an website url the user has ever visited.
Since I need to know all the urls each user has visited, I am tempted to
call the groupByKey on this RDD. However, since there could be millions of
users and urls,
There is reduceByKey that works on K,V. You need to accumulate partial
results and proceed. does your computation allow that ?
On Mon, Jun 22, 2015 at 2:12 PM, Jianguo Li flyingfromch...@gmail.com
wrote:
Hi,
I am processing an RDD of key-value pairs. The key is an user_id, and the
value is
Right now, we can not figure out which column you referenced in
`select`, if there are multiple row with the same name in the joined
DataFrame (for example, two `value`).
A workaround could be:
numbers2 = numbers.select(df.name, df.value.alias('other'))
rows = numbers.join(numbers2,
Hi,
suddenly our spark job on yarn started failing silently without showing
any error, following is the trace in verbose mode
Using properties file: /usr/lib/spark/conf/spark-defaults.conf
Adding default property:
spark.serializer=org.apache.spark.serializer.KryoSerializer
Adding default
Models that I am looking for are mostly factorization based models (which
includes both recommendation and topic modeling use-cases).
For recommendation models, I need a combination of Spark SQL and ml model
prediction api...I think spark job server is what I am looking for and it
has fast http
Did you restart your master / workers? On the master node, run
`sbin/stop-all.sh` followed by `sbin/start-all.sh`
2015-06-20 17:59 GMT-07:00 Raghav Shankar raghav0110...@gmail.com:
Hey Andrew,
I tried the following approach: I modified my Spark build on my local
machine. I did downloaded
1) Can you try with yarn-cluster
2) Does your queue have enough capacity
On Mon, Jun 22, 2015 at 11:10 AM, Saiph Kappa saiph.ka...@gmail.com wrote:
Hi,
I am running a simple spark streaming application on hadoop 2.7.0/YARN
(master: yarn-client) cluster with 2 different machines (12GB RAM
hi,
have you tested
s3://ww-sandbox/name_of_path/ instead of s3://ww-sandbox/name_of_path
or have you test to add your file extension with placeholder (*) like:
s3://ww-sandbox/name_of_path/*.gz
or
s3://ww-sandbox/name_of_path/*.csv
depend on your files. If it does not work pls test with
Many thanks, will look into this. I dont want to particularly reuse the
custom Hive UDAF I have, would prefer writing a new one it that is
cleaner. I am just using the JVM.
On 5 June 2015 at 00:03, Holden Karau hol...@pigscanfly.ca wrote:
My current example doesn't use a Hive UDAF, but you
hi,
I am unfortunately not very fit in the whole MLlib stuff, so I would
appreciate a little help:
Which multi-class classification algorithm i should use if i want to train
texts (100-1000 words each) into categories. The number of categories is
between 100-500 and the number of training
Hi,
I was running a WordCount application on Spark, and the machine I used has
4 physical cores. However, in spark-env.sh file, I set SPARK_WORKER_CORES
= 32. The web UI says it launched one executor with 32 cores and the
executor could execute 32 tasks simultaneously. Does spark create 32
Silvio,
Suppose my RDD is (K-1, v1,v2,v3,v4).
If i want to do simple addition i can use reduceByKey or aggregateByKey.
What if my processing needs to check all the items in the value list each
time, Above two operations do not get all the values, they just get two
pairs (v1, v2) , you do some
OK, I figured out this.
The maximum number of containers YARN can create per node is based on the
total available RAM and the maximum allocation per container (
yarn.scheduler.maximum-allocation-mb ). The default is 8192; setting to a
lower value allowed me to create more containers per node.
On
Yes I have the producer in the class path. And I am using in standalone mode.
Sent from my iPhone
On 23-Jun-2015, at 3:31 am, Tathagata Das t...@databricks.com wrote:
Do you have Kafka producer in your classpath? If so how are adding that
library? Are you running on YARN, or Mesos or
Hello All,
I am new to Spark. I have a very basic question.How do I write the output
of an action on a RDD to HDFS?
Thanks in advance for the help.
Cheers,
Ravi
Hi Chris,
Thanks for the quick reply and the welcome. I am trying to read a file from
hdfs and then writing back just the first line to hdfs.
I calling first() on the RDD to get the first line.
Sent from my iPhone
On Jun 22, 2015, at 7:42 PM, Chris Gore cdg...@cdgore.com wrote:
Hi Ravi,
Do you have Kafka producer in your classpath? If so how are adding that
library? Are you running on YARN, or Mesos or Standalone or local. These
details will be very useful.
On Mon, Jun 22, 2015 at 8:34 AM, Murthy Chelankuri kmurt...@gmail.com
wrote:
I am using spark streaming. what i am trying
Hi everyone,
I want to announce that we have create a Spark Meetup Group in Munich (Germany).
We currently plan our first event which will take place in July. There we will
show basics about spark to catch a lot of people who are new to this framework.
In the following evens we will go deeper
Hi Ravi,
Welcome, you probably want RDD.saveAsTextFile(“hdfs:///my_file”)
Chris
On Jun 22, 2015, at 5:28 PM, ravi tella ddpis...@gmail.com wrote:
Hello All,
I am new to Spark. I have a very basic question.How do I write the output of
an action on a RDD to HDFS?
Thanks in advance
You’re right of course, I’m sorry. I was typing before thinking about what you
actually asked!
On a second thought, what is the ultimate outcome for what you want the
sequence of pages for? Do they need to actually all be grouped? Could you
instead partition by user id then use a mapPartitions
Hi,
In Spark 1.4, you may use DataFrame.stat.crosstab to generate the confusion
matrix. This would be very simple if you are using the ML Pipelines Api,
and are working with DataFrames.
Best,
Burak
On Mon, Jun 22, 2015 at 4:21 AM, CD Athuraliya cdathural...@gmail.com
wrote:
Hi,
I am looking
Hello, everyone! I'm new in spark. I have already written programs in
Hadoop2.5.2, where I defined my own InputFormat and OutputFormat. Now I
want to move my codes to spark using java language. The first problem I
encountered is how to transform big txt file in local storage to RDD, which
is
Hi Team,
How to split and put the red JavaDStreamString in to mysql in java.
any existing api in sark 1.3/1.4.
team can you please share the code snippet if any body have it.
Thanks,
Manohar
--
View this message in context:
Spark docs has addresses this pretty well. Look for patterns of use
foreachRDD.
On 22 Jun 2015 17:09, Manohar753 manohar.re...@happiestminds.com wrote:
Hi Team,
How to split and put the red JavaDStreamString in to mysql in java.
any existing api in sark 1.3/1.4.
team can you please share
Counts is a list (counts = []) in the driver, used to collect the results.
It seems like it's also not the best way to be doing things, but I'm new to
spark and editing someone else's code so still learning.
Thanks!
def update_state(out_files, counts, curr_rdd):
try:
for c in
I am trying to run a function on every line of a parquet file. The function
is in an object. When I run the program, I get an exception that the object
is not serializable. I read around the internet and found that I should use
Kryo Serializer. I changed the setting in the spark conf and
I'm receiving the SPARK-5063 error (RDD transformations and actions can
only be invoked by the driver, not inside of other transformations)
whenever I try and restore from a checkpoint in spark streaming on my app.
I'm using python3 and my RDDs are inside a queuestream DStream.
This is the
But i just want to update rdd , by appending unique message ID with
each element of RDD , which should be automatically(m++ ..) updated every
time a new element comes to rdd .
On Mon, Jun 22, 2015 at 7:05 AM, Michal Čizmazia mici...@gmail.com wrote:
StreamingContext.sparkContext()
On 21
Hello,
I'm writing an application in Scala to connect to Cassandra to read the
data.
My setup is Intellij with maven. When I try to compile the application I
get the following *error: object datastax is not a member of package com*
*error: value cassandraTable is not a member of
You can use fileStream for that, look at the XMLInputFormat
https://github.com/apache/mahout/blob/ad84344e4055b1e6adff5779339a33fa29e1265d/examples/src/main/java/org/apache/mahout/classifier/bayes/XmlInputFormat.java
of mahout. It should give you full XML object as on record, (as opposed to
an XML
What does counts refer to?
Could you also paste the code of your update_state function?
On 22 Jun 2015 12:48 pm, Shaanan Cohney shaan...@gmail.com wrote:
I'm receiving the SPARK-5063 error (RDD transformations and actions can
only be invoked by the driver, not inside of other transformations)
its fixed now, adding dependecies in pom.xml fixed it
dependency
groupIdcom.datastax.spark/groupId
artifactIdspark-cassandra-connector-embedded_2.10/artifactId
version1.4.0-M1/version
/dependency
On Mon, Jun 22, 2015 at 10:46 AM, Koen Vantomme koen.vanto...@gmail.com
wrote:
Hello,
Could you elaborate a bit more? What do you meant by set up a standalone
server? and what is leading you to that exceptions?
Thanks
Best Regards
On Mon, Jun 22, 2015 at 2:22 AM, nizang ni...@windward.eu wrote:
hi,
I'm trying to setup a standalone server, and in one of my tests, I got the
Hi,
I am looking for a way to get confusion matrix for binary classification. I
was able to get confusion matrix for multiclass classification using this
[1]. But I could not find a proper way to get confusion matrix in similar
class available for binary classification [2]. Later I found this
I would suggest you have a look at the updateStateByKey transformation in
the Spark Streaming programming guide which should fit your needs better
than your update_state function.
On 22 Jun 2015 1:03 pm, Shaanan Cohney shaan...@gmail.com wrote:
Counts is a list (counts = []) in the driver, used
On 22 Jun 2015, at 04:08, Shawn Garbett
shawn.garb...@gmail.commailto:shawn.garb...@gmail.com wrote:
2015-06-21 11:03:22,029 WARN
org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl:
Container [pid=39288,containerID=container_1434751301309_0015_02_01]
Thanks Mohammed, it’s good to know I’m not alone!
How easy is it to integrate Zeppelin with Spark on Cassandra? It looks like
it would only support Hadoop out of the box. Is it just a case of dropping
the Cassandra Connector onto the Spark classpath?
Cheers,
Matthew
*From:* Mohammed
If I am not mistaken, one way to see the accumulators is that they are just
write-only for the workers and their value can be read by the driver.
Therefore they cannot be used for ID generation as you wish.
On 22 June 2015 at 04:30, anshu shukla anshushuk...@gmail.com wrote:
But i just want to
Totally depends on the use-case that you are solving with Spark, for
instance there was some discussion around the same which you could read
over here
http://apache-spark-user-list.1001560.n3.nabble.com/How-does-one-decide-no-of-executors-cores-memory-allocation-td23326.html
Thanks
Best Regards
Its pretty straight forward, this would get you started
http://stackoverflow.com/questions/24896233/how-to-save-apache-spark-schema-output-in-mysql-database
Thanks
Best Regards
On Mon, Jun 22, 2015 at 12:39 PM, Manohar753
manohar.re...@happiestminds.com wrote:
Hi Team,
How to split and
1.4 supports it
On 23 Jun 2015 02:59, Sourav Mazumder sourav.mazumde...@gmail.com wrote:
Hi,
Though the documentation does not explicitly mention support for Windowing
and Analytics function in Spark SQL, looks like it is not supported.
I tried running a query like Select Lead(column name,
Hi James,
Maybe it's the DISTINCT causing the issue.
I rewrote the query as follows. Maybe this one can finish faster.
select
sum(cnt) as uses,
count(id) as users
from (
select
count(*) cnt,
cast(id as string) as id,
from usage_events
where
I have been unsuccessful with incorporating an external Jar into a SparkR
program. Does anyone know how to do this successfully?
JarTest.java
=
package com.myco;
public class JarTest {
public static double myStaticMethod() {
return 5.515;
}
}
=
Unfortunately there is not a great way to do it without modifying Spark to
print more things it reads from the stream.
2015-06-20 23:10 GMT-07:00 John Meehan meeh...@dls.net:
Yes it seems to be consistently port out of range:1315905645”. Is there
any way to see what the python process is
You may also want to change count(*) to specific column.
On 23 Jun 2015 01:29, James Aley james.a...@swiftkey.com wrote:
Hello,
A colleague of mine ran the following Spark SQL query:
select
count(*) as uses,
count (distinct cast(id as string)) as users
from usage_events
where
Probably you should use === instead of == and !== instead of !=
Can anyone explain why the dataframe API doesn't work as I expect it to
here? It seems like the column identifiers are getting confused.
https://gist.github.com/dokipen/4b324a7365ae87b7b0e5
Hi,
I am running a simple spark streaming application on hadoop 2.7.0/YARN
(master: yarn-client) cluster with 2 different machines (12GB RAM with 8
CPU cores each).
I am launching my application like this:
~/myapp$ ~/my-spark/bin/spark-submit --class App --master yarn-client
--driver-memory 4g
Generally (not only spark sql specific) you should not cast in the where
part of a sql query. It is also not necessary in your case. Getting rid of
casts in the whole query will be also beneficial.
Le lun. 22 juin 2015 à 17:29, James Aley james.a...@swiftkey.com a écrit :
Hello,
A colleague
Is spoutLog just a non-spark file writer? If you run that in the map call
on a cluster its going to be writing in the filesystem of the executor its
being run on. I'm not sure if that's what you intended.
On Mon, Jun 22, 2015 at 1:35 PM, anshu shukla anshushuk...@gmail.com
wrote:
Running
Thanx for reply !!
YES , Either it should write on any machine of cluster or Can you please
help me ... that how to do this . Previously i was using writing using
collect () , so some of my tuples are missing while writing.
//previous logic that was just creating the file on master -
Can anyone explain why the dataframe API doesn't work as I expect it to
here? It seems like the column identifiers are getting confused.
https://gist.github.com/dokipen/4b324a7365ae87b7b0e5
That's invalid syntax. I'm pretty sure pyspark is using a DSL to create a
query here and not actually doing an equality operation.
On Mon, Jun 22, 2015 at 3:43 PM Ignacio Blasco elnopin...@gmail.com wrote:
Probably you should use === instead of == and !== instead of !=
Can anyone explain why
Hi Ravi,
For this case, you could simply do
sc.parallelize([rdd.first()]).saveAsTextFile(“hdfs:///my_file”) using pyspark
or sc.parallelize(Array(rdd.first())).saveAsTextFile(“hdfs:///my_file”) using
Scala
Chris
On Jun 22, 2015, at 5:53 PM, ddpis...@gmail.com wrote:
Hi Chris,
Thanks for
My hunch is that you changed spark.serializer to Kryo but left
spark.closureSerializer unmodified, so it's still using Java for closure
serialization. Kryo doesn't really work as a closure serializer but
there's an open pull request to fix this:
https://github.com/apache/spark/pull/6361
On Mon,
Hi Burak,
Thanks for the response. I am using Spark version 1.3.0 through Java API.
Regards,
CD
On Tue, Jun 23, 2015 at 5:11 AM, Burak Yavuz brk...@gmail.com wrote:
Hi,
In Spark 1.4, you may use DataFrame.stat.crosstab to generate the
confusion matrix. This would be very simple if you are
Is there any way to retrieve the time of each message's arrival into a Kafka
topic, when streaming in Spark, whether with receiver-based or direct
streaming?
Thanks.
--
View this message in context:
Yes, with should be with HiveContext, not SQLContext.
From: ayan guha [mailto:guha.a...@gmail.com]
Sent: Tuesday, June 23, 2015 2:51 AM
To: smazumder
Cc: user
Subject: Re: Support for Windowing and Analytics functions in Spark SQL
1.4 supports it
On 23 Jun 2015 02:59, Sourav Mazumder
How large are your models?
Spark job server does allow synchronous job execution and with a warm
long-lived context it will be quite fast - but still in the order of a second
or a few seconds usually (depending on model size - for very large models
possibly quite a lot more than that).
It’s actually not that tricky.
SPARK_WORKER_CORES: is the max task thread pool size of the of the executor,
the same saying of “one executor with 32 cores and the executor could execute
32 tasks simultaneously”. Spark doesn’t care about how much real physical
CPU/Cores you have (OS does), so
How are you submitting the application? Could you paste the code that you
are running?
Thanks
Best Regards
On Mon, Jun 22, 2015 at 5:37 PM, Sean Barzilay sesnbarzi...@gmail.com
wrote:
I am trying to run a function on every line of a parquet file. The
function is in an object. When I run the
Like this?
val rawXmls = ssc.fileStream(path, classOf[XmlInputFormat],
classOf[LongWritable],
classOf[Text])
Thanks
Best Regards
On Mon, Jun 22, 2015 at 5:45 PM, Yong Feng fengyong...@gmail.com wrote:
Thanks a lot, Akhil
I saw this mail thread before, but still do not understand how
Hi All ,
What is the Best Way to install and Spark Cluster along side with Hadoop
Cluster , Any recommendation for below deployment topology will be a great
help
*Also Is it necessary to put the Spark Worker on DataNodes as when it read
block from HDFS it will be local to the Server / Worker or
My program is written in Scala. I am creating a jar and submitting it using
spark-submit.
My code is on a computer in an internal network withe no internet so I
can't send it.
On Mon, Jun 22, 2015, 3:19 PM Akhil Das ak...@sigmoidanalytics.com wrote:
How are you submitting the application? Could
Thanks a lot, Akhil
I saw this mail thread before, but still do not understand how to use
XmlInputFormatof mahout in Spark Streaming (I am not Spark Streaming Expert
yet ;-)). Can you show me some sample code for explanation.
Thanks in advance,
Yong
On Mon, Jun 22, 2015 at 6:44 AM, Akhil Das
Option 1 should be fine, Option 2 would bound a lot on network as the data
increase in time.
Thanks
Best Regards
On Mon, Jun 22, 2015 at 5:59 PM, Ashish Soni asoni.le...@gmail.com wrote:
Hi All ,
What is the Best Way to install and Spark Cluster along side with Hadoop
Cluster , Any
Thanks Akhil
I will have a try and then go back to you
Yong
On Mon, Jun 22, 2015 at 8:25 AM, Akhil Das ak...@sigmoidanalytics.com
wrote:
Like this?
val rawXmls = ssc.fileStream(path, classOf[XmlInputFormat],
classOf[LongWritable],
classOf[Text])
Thanks
Best Regards
On Mon, Jun
I stumbled upon zipWithUniqueId/zipWithIndex. Is this what you are looking
for?
https://spark.apache.org/docs/latest/api/java/org/apache/spark/api/java/JavaRDDLike.html#zipWithUniqueId()
On 22 June 2015 at 06:16, Michal Čizmazia mici...@gmail.com wrote:
If I am not mistaken, one way to see
Hi Das,
Thanks for your reply. Somehow I missed it..
I am using Spark 1.3. The data source is from kafka.
Yeah, not sure why the delay is 0. I'll run against 1.4 and give a screenshot.
Thanks,
Mike
From: Akhil Das ak...@sigmoidanalytics.commailto:ak...@sigmoidanalytics.com
Date: Thursday, June
Hi Gerard,
Have there been any responses? Any insights as to what you ended up doing to
enable custom metrics? I'm thinking of implementing a custom metrics sink,
not sure how doable that is yet...
Thanks.
--
View this message in context:
Hi,
I was wondering if there've been any responses to this?
Thanks.
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Custom-Metrics-Sink-tp10068p23425.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.
1 - 100 of 103 matches
Mail list logo