Many Thanks Silvio,
Someone also suggested using something similar :
./bin/spark-class org.apache.spark.deploy.Client kill master url driver
ID
Regards
jk
On Fri, May 8, 2015 at 2:12 AM, Silvio Fiorito
silvio.fior...@granturing.com wrote:
Hi James,
If you’re on Spark 1.3 you can use
I think you could use checkpoint to cut the lineage of `MyRDD`, I have a
similar scenario and I use checkpoint to workaround this problem :)
Thanks
Jerry
-Original Message-
From: yaochunnan [mailto:yaochun...@gmail.com]
Sent: Friday, May 8, 2015 1:57 PM
To: user@spark.apache.org
Resolved by create udf to match different format of DateType in sparksql
and HIVE
suggest fix it in next release
On Fri, May 8, 2015 at 11:07 AM, Gerald-G shadowinl...@gmail.com wrote:
Hi:
Spark version is 1.3.1
I used sparksql insert data into a hive table with datetype partition
The
Thank you for this suggestion! But may I ask what's the advantage to use
checkpoint instead of cache here? Cuz they both cut lineage. I only know
checkpoint saves RDD in disk, while cache in memory. So may be it's for
reliability?
Also on
Have a look at this SO
http://stackoverflow.com/questions/24048729/how-to-read-input-from-s3-in-a-spark-streaming-ec2-cluster-application
question,
it has discussion on various ways of accessing S3.
Thanks
Best Regards
On Fri, May 8, 2015 at 1:21 AM, in4maniac sa...@skimlinks.com wrote:
Hi
imagine an input stream transformed by updateStateByKey, based on some
state.
as an output of the transformation, i would like to have a stream of state
changes only - not the stream of states themselves.
what is the natural way of obtaining such a stream?
Hi Cui,
Try to read the scala version of LDAExample,
https://github.com/apache/spark/blob/master/examples/src/main/scala/org/apache/spark/examples/mllib/LDAExample.scala
The matrix you're referring to is the corpus after vectorization.
One example, given a dict, [apple, orange, banana]
3
Whats your usecase and what are you trying to achieve? May be there's a
better way of doing it.
Thanks
Best Regards
On Fri, May 8, 2015 at 10:20 AM, Richard Alex Hofer rho...@andrew.cmu.edu
wrote:
Hi,
I'm working on a project in Spark and am trying to understand what's going
on. Right now to
Hi!
I've been using spark for the last months and it is awesome. I'm pretty new on
this topic so don't be too harsh on me.
Recently I've been doing some simple tests with Spark Streaming for log
processing and I'm considering different ETL input solutions such as Flume or
PDI+Kafka.
My use
I want to filter a DataFrame based on a Date column.
If the DataFrame object is constructed from a scala case class, it's
working (either compare as String or Date). But if the DataFrame is
generated by specifying a Schema to an RDD, it doesn't work. Below is
the exception and test code.
IIUC only checkpoint will clean the lineage information, cache will not cut the
lineage. Also checkpoint will put the data in HDFS, not local disk :)
I think you can use foreachRDD to do such RDD update work, it’s OK as I know
from your code snippet.
From: Chunnan Yao
Since its loading 24 records, it could be that your CSV is corrupted? (may
be the new line char isn't \n, but \r\n if it comes from a windows
environment. You can check this with *cat -v yourcsvfile.csv | more*).
Thanks
Best Regards
On Fri, May 8, 2015 at 11:23 AM, luohui20...@sina.com wrote:
I don't think you can use rawSocketStream since the RSVP is from a web
server and you will have to send a GET request first to initialize the
communication. You are better off writing a custom receiver
https://spark.apache.org/docs/latest/streaming-custom-receivers.html for
your usecase. For a
So is this sleep occurs before allocating resources for the first few
executors to start the job?
On Fri, May 8, 2015 at 6:23 AM Taeyun Kim taeyun@innowireless.com
wrote:
I think I’ve found the (maybe partial, but major) reason.
It’s between the following lines, (it’s newly captured,
imagine an input stream transformed by updateStateByKey, based on some state.
as an output of the transformation, i would like to have a stream of state
changes only - not the stream of states themselves.
what is the natural way of obtaining such a stream?
--
View this message in context:
From S3. As the dependency of df will be on s3. And because rdds are not
replicated.
On 8 May 2015 23:02, Peter Rudenko petro.rude...@gmail.com wrote:
Hi, i have a next question:
val data = sc.textFile(s3:///)val df = data.toDF
df.saveAsParquetFile(hdfs://)
df.someAction(...)
if during
Hm, thanks.
Do you know what this setting mean:
https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/DataFrame.scala#L1178
?
Thanks,
Peter Rudenko
On 2015-05-08 17:48, ayan guha wrote:
From S3. As the dependency of df will be on s3. And because rdds are
This is an interesting question. I don't have a solution for you, but you
may be interested in taking a look at Anaconda Cluster
http://continuum.io/anaconda-cluster.
It's made by the same people behind Conda (an alternative to pip focused on
data science pacakges) and may offer a better way of
Why does this not work
./spark-1.3.0-bin-hadoop2.4/bin/spark-submit --class SomeApp --deploy-mode
cluster --supervise --master spark://host01:7077,host02:7077 Some.jar
With exception:
Caused by: java.lang.NumberFormatException: For input string:
7077,host02:7077
It seems to accept only one
Dear Usergroup,
I am struggling to use the SparkR pacakge that comes with apache spark 1.4.0
I am having trouble getting the tutorials in the original
amplabs-extra/SparkR-pkg working.
Please see my stackoverflow question with a bounty for more details...here
I have two hosts host01 and host02 (lets call them)
I run one Master and two Workers on host01
I also run one Master and two Workers on host02
Now I have 1 LIVE Master on host01 and a STANDBY Master on host02
The LIVE Master is aware of all Workers in the cluster
Now I submit a Spark
BTW I'm using Spark 1.3.0.
Thanks
On Fri, May 8, 2015 at 5:22 PM, James King jakwebin...@gmail.com wrote:
I have two hosts host01 and host02 (lets call them)
I run one Master and two Workers on host01
I also run one Master and two Workers on host02
Now I have 1 LIVE Master on host01 and a
Newbie question...
Can I use any of the main ML capabilities of MLlib in a Java-only
environment, without any native library dependencies?
According to the documentation, java-netlib provides a JVM fallback. This
suggests that native netlib libraries are not required.
It appears that such a
i searched the jiras but couldnt find any recent mention of this. let me
try with 1.4.0 branch and see if it goes away...
On Wed, May 6, 2015 at 3:05 PM, Koert Kuipers ko...@tresata.com wrote:
hello all,
i build spark 1.3.1 (for cdh 5.3 with yarn) twice: for scala 2.10 and
scala 2.11. i am
Hi John,
I have been using MLLIB without installing jblas native dependence.
Functionally I have not got stuck. I still need to explore if there are any
performance hits.
Best Regards,
Sonal
Founder, Nube Technologies http://www.nubetech.co
http://in.linkedin.com/in/sonalgoyal
On Fri, May 8,
Hi
I havé an application that currently run using MR. It currently starts
extracting information from a proprietary binary file that is copied to
HDFS. The application starts by creating business objects from information
extracted from the binary files. Later those objects are used for further
Thats a feature flag for a new code path for reading parquet files. Its
only there in case bugs are found in the old path and will be removed once
we are sure the new path is solid.
On Fri, May 8, 2015 at 8:04 AM, Peter Rudenko petro.rude...@gmail.com
wrote:
Hm, thanks.
Do you know what this
I am using the Spark Cassandra connector to work with a table with 3
million records. Using .where() API to work with only a certain rows in
this table. Where clause filters the data to 1 rows.
CassandraJavaUtil.javaFunctions(sparkContext) .cassandraTable(KEY_SPACE,
MY_TABLE,
I am using the Spark Cassandra connector to work with a table with 3 million
records. Using .where() API to work with only a certain rows in this table.
Where clause filters the data to 1 rows.
CassandraJavaUtil.javaFunctions(sparkContext) .cassandraTable(KEY_SPACE,
MY_TABLE,
Any update to above mail
and Can anyone tell me logic - I have to filter tweets and submit tweets
with particular #hashtag1 to SparkSQL databases and tweets with
#hashtag2 will be passed to sentiment analysis phase .Problem is how to
split the input data in two streams using hashtags
On
If you’re using multiple masters with ZooKeeper then you should set your master
URL to be
spark://host01:7077,host02:7077
And the property spark.deploy.recoveryMode=ZOOKEEPER
See here for more info:
http://spark.apache.org/docs/latest/spark-standalone.html#standby-masters-with-zookeeper
What version of Spark are you using? It appears that at least in master we
are doing the conversion correctly, but its possible older versions of
applySchema do not. If you can reproduce the same bug in master, can you
open a JIRA?
On Fri, May 8, 2015 at 1:36 AM, Haopu Wang hw...@qilinsoft.com
Yes, at this point I believe you'll find jblas used for historical reasons,
to not change some APIs. I don't believe it's used for much if any
computation in 1.4.
On May 8, 2015 5:04 PM, John Niekrasz john.niekr...@gmail.com wrote:
Newbie question...
Can I use any of the main ML capabilities
I am implementing the lambda architecture using apache spark for both
streaming and batch processing.
For real time queries i´m using spark streaming with cassandra and for batch
queries i am using spark sql and spark mlib.
The problem i ´m facing now is: i need to implemente one serving layer,
Do you have any more specific profiling data that you can share? I'm
curious to know where AppendOnlyMap.changeValue is being called from.
On Fri, May 8, 2015 at 1:26 PM, Michal Haris michal.ha...@visualdna.com
wrote:
+dev
On 6 May 2015 10:45, Michal Haris michal.ha...@visualdna.com wrote:
Why are you not using Cassandra for storing the pre-computed views?
Mohammed
-Original Message-
From: rafac [mailto:rafaelme...@hotmail.com]
Sent: Friday, May 8, 2015 1:48 PM
To: user@spark.apache.org
Subject: Lambda architecture using Apache Spark
I am implementing the lambda
Hi,
I am pretty new to spark/spark_streaming so please excuse my naivety. I have
streaming event stream that is timestamped and I would like to aggregate it
into, let's say, hourly buckets. Now the simple answer is to use a window
operation with window length of 1 hr and sliding interval of
If I understand you correctly, you need Window duration of 1 hour and sliding
interval of 5 seconds.
Mohammed
-Original Message-
From: Ankur Chauhan [mailto:achau...@brightcove.com]
Sent: Friday, May 8, 2015 2:27 PM
To: u...@spark.incubator.apache.org
Subject: Spark streaming
Hi, I would like to create a hive table on top a existent parquet file as
described here:
https://databricks.com/blog/2015/03/24/spark-sql-graduates-from-alpha-in-spark-1-3.html
Due network restrictions, I need to store the metadata definition in a
different path than '/user/hive/warehouse', so I
Hi, i have a next question:
|val data = sc.textFile(s3:///) val df = data.toDF
df.saveAsParquetFile(hdfs://) df.someAction(...) |
if during someAction some workers would die, would recomputation
download files from s3 or from hdfs parquet?
Thanks,
Peter Rudenko
So, why isn't this comment/question being posted to the list?
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-SQL-and-Hive-interoperability-tp22690p22827.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.
Can you attach the full stack trace?
Thanks,
Yin
On Fri, May 8, 2015 at 4:44 PM, barmaley o...@solver.com wrote:
Given a registered table from data frame, I'm able to execute queries like
sqlContext.sql(SELECT STDDEV(col1) FROM table) from Spark Shell just
fine.
However, when I run exactly
hey mike-
as you pointed out here from my docs, changing the stream name is sometimes
problematic due to the way the Kinesis Client Library manages leases and
checkpoints, etc in DynamoDB.
I noticed this directly while developing the Kinesis connector which is why I
highlighted the issue
Thanks Michael for the quick return. I was looking forward the automatic
schema inferring (I think that's you mean by 'schema merging' ?), and I
think the STORED AS would still require me to define the table columns
right?
Anyways, I am glad to hear you guys already working to fix this on future
What are you trying to accomplish? Internally Spark SQL will add Exchange
operators to make sure that data is partitioned correctly for joins and
aggregations. If you are going to do other RDD operations on the result of
dataframe operations and you need to manually control the partitioning,
Just trying to make sure that something I know in advance (the joins will
always have an equality test on one specific field) is used to optimize the
partitioning so the joins only use local data.
Thanks for the info.
Ron
From: Michael Armbrust [mailto:mich...@databricks.com]
Sent: Friday,
Actually, I was talking about the support for inferring different but
compatible schemata from various files, automatically merging them into a
single schema. However, you are right that I think you need to specify the
columns / types if you create it as a Hive table.
On Fri, May 8, 2015 at 3:11
Finally got it working.
I was trying to access hive using the jdbc driver like I was trying to
access the terradata.
It took me some time to figure out that default sqlContext created by Spark
supported hive and it uses the hive-site.xml in spark conf folder to access
hive.
I had to use my
HI GUYS... I realised that it was a bug in my code that caused the code to
break.. I was running the filter on a SchemaRDD when I was supposed to be
running it on an RDD.
But I still don't understand why the stderr was about S3 request rather than
a type checking error such as No tuple position
I am just wondering if create table supports the syntax of
Create table dB.tablename
Instead of two step process of use dB and then create table tablename?
On 9 May 2015 08:17, Michael Armbrust mich...@databricks.com wrote:
Actually, I was talking about the support for inferring different but
Hey Chris!
I was happy to see the documentation outlining that issue :-) However, I
must have got into a pretty terrible state because I had to delete and
recreate the kinesis streams as well as the DynamoDB tables.
Thanks for the reply, everything is sorted.
Mike
On Fri, May 8, 2015 at
Do as Evo suggested. Rdd1=rdd.filter, rdd2=rdd.filter
On 9 May 2015 05:19, anshu shukla anshushuk...@gmail.com wrote:
Any update to above mail
and Can anyone tell me logic - I have to filter tweets and submit tweets
with particular #hashtag1 to SparkSQL databases and tweets with
- [Kinesis stream name]: The Kinesis stream that this streaming
application receives from
- The application name used in the streaming context becomes the
Kinesis application name
- The application name must be unique for a given account and region.
- The Kinesis
Given a registered table from data frame, I'm able to execute queries like
sqlContext.sql(SELECT STDDEV(col1) FROM table) from Spark Shell just fine.
However, when I run exactly the same code in a standalone app on a cluster,
it throws an exception: java.util.NoSuchElementException: key not found:
Hi All,
I am submitting the assembled fat jar file by the command:
bin/spark-submit --jars /spark-streaming-kinesis-asl_2.10-1.3.0.jar --class
com.xxx.Consumer -0.1-SNAPSHOT.jar
It reads the data file from kinesis using the stream name defined in a
configuration file. It turns out that it
55 matches
Mail list logo