Hi all,
While running Spark Word count python example with intentional mistake in *Yarn
cluster mode*, Spark terminal states final status as SUCCEEDED, but log
files state correct results indicating that the job failed.
Why terminal log output application log output contradict each other ?
If
Hi,
When I ran with spark-submit the following simple Spark program of:
import org.apache.spark.SparkContext._
import org.apache.spark.SparkConf
import org.apache.spark.rdd.RDD
import org.apache.spark.SparkContext
import org.apache.spark._
import SparkContext._
object TEST2{
def
On 23 Jul 2015, at 10:47, Greg Anderson gregory.ander...@familysearch.org
wrote:
So when I go to ~/ephemeral-hdfs/bin/hadoop and check its version, it says
Hadoop 2.0.0-cdh4.2.0. If I run pyspark and use the s3a address, things
should work, right? What am I missing? And thanks so
Hi,
When I create a DataFrame through Spark SQLContext and then register temp
table I can use %sql Zeppelin interpreter to open a nice SQL paragraph.
If on the other hand I do the same through HiveContext, I can't see those
tables in the %sql show tables.
Is there a way to query the
I am new to Spark and needed help in figuring out why my Hive databases are
not accessible to perform a data load through Spark.
Background:
1.
I am running Hive, Spark, and my Java program on a single machine. It's
a Cloudera QuickStart VM, CDH5.4x, on a VirtualBox.
2.
I have
Please check if your metastore service is running. You may need to switch
on automatic metastore service restart on restart of vm
On 24 Jul 2015 06:20, Mithila Joshi joshi.mith...@gmail.com wrote:
I am new to Spark and needed help in figuring out why my Hive databases
are not accessible to
You are probably listening to the sample stream, and THEN filtering. This
means you listen to 1% of the twitter stream, and then looking for the
tweet by Bloomberg, so there is a very good chance you don't see the
particular tweet.
In order to get all Bloomberg related tweets, you must connect to
After several tests, it turns out that wurfl itself is not thread-safe. That
cause the problem when more that one mapPartition are running, wurfl engines
are conflicting. I don’t know if there is a better way than handling wurfl
lookup outside.
Thanks,
Zhongxiao
From:
Hi There,
I am testing Spark DataFrame and havn't been able to get my code to finish
due to what I suspect are GC issues. My guess is that GC interferes with
heartbeating and executors are detected as failed. The data is ~50 numeric
columns, ~100million rows in a CSV file.
We are doing a groupBy
I tried with a RDD[DenseVector] but RDDs are not transformable, so T+
RDD[DenseVector] not : RDD[Vector] and can't get to use the RDD input method
of correlation.
Thanks,
Saif
hello spark community,
i have build an application with geomesa, accumulo and spark.
if it run on spark local mode, it is working, but on spark
cluster not. in short it says: No space left on device. Asked to remove
non-existent executor XY.
I´m confused, because there were many GB´s of free
Hi All,
I am trying to setup the Eclipse (LUNA) with Maven so that I create a
maven projects for developing spark programs. I am having some issues and I
am not sure what is the issue.
Can Anyone share a nice step-step document to configure eclipse with maven
for spark development.
Hi Akhil,
Thank you for sending this code. My apologize if I will ask something that
is obvious here, since I'm newbie in Scala, but I still don't see how I can
use this code. Maybe my original question was not very clear.
What I need is to get each Twitter Status that contains one of the
Exie,
Reported your issue: https://issues.apache.org/jira/browse/SPARK-9302
SparkR has support for long(bigint) type in serde. This issue is related to
support complex Scala types in serde.
-Original Message-
From: Exie [mailto:tfind...@prodevelop.com.au]
Sent: Friday, July 24, 2015
Interestingly, after more digging, df.printSchema() in raw spark shows the
columns as a long, not a bigint.
root
|-- localEventDtTm: timestamp (nullable = true)
|-- asset: string (nullable = true)
|-- assetCategory: string (nullable = true)
|-- assetType: string (nullable = true)
|-- event:
Hi Folks,
Using Spark to read in JSON files and detect the schema, it gives me a
dataframe with a bigint filed. R then fails to import the dataframe as it
cant convert the type.
head(mydf)
Error in as.data.frame.default(x[[i]], optional = TRUE) :
cannot coerce class jobj to a data.frame
Or Spark on HBase )
http://blog.cloudera.com/blog/2014/12/new-in-cloudera-labs-sparkonhbase/
--
Ruslan Dautkhanov
On Tue, Jul 14, 2015 at 7:07 PM, Ted Yu yuzhih...@gmail.com wrote:
bq. that is, key-value stores
Please consider HBase for this purpose :-)
On Tue, Jul 14, 2015 at 5:55 PM,
Hi,
I am wondering if anyone has successfully enabled
mapreduce.input.fileinputformat.list-status.num-threads in Spark jobs. I
usually set this property to 25 to speed up file listing in MR jobs (Hive
and Pig). But for some reason, this property does not take effect in Spark
HadoopRDD resulting
Hi Yana,
Sorry for late response. I just saw your email. At the end I ended with the
following pom https://www.dropbox.com/s/19fldb9qnnfieck/pom.xml?dl=0
There were multiple problems I had to struggle with. One of these were that
my application had REST implemented with jboss jersey which got
He should still see something. I think you need to subscribe to the
Screenname first and not filter it out only in the filter method. I do not
have the apis from mobile at hand, but there should be a method.
Le jeu. 23 juil. 2015 à 22:30, Enno Shioji eshi...@gmail.com a écrit :
You need to pay
How can I tell if it's the sample stream or full stream ?
Thanks
Sent from my iPhone
On Jul 23, 2015, at 4:17 PM, Enno Shioji
eshi...@gmail.commailto:eshi...@gmail.com wrote:
You are probably listening to the sample stream, and THEN filtering. This means
you listen to 1% of the twitter
Ahh
Makes sense - thanks for the help
Sent from my iPhone
On Jul 23, 2015, at 4:29 PM, Enno Shioji
eshi...@gmail.commailto:eshi...@gmail.com wrote:
You need to pay a lot of money to get the full stream, so unless you are doing
that, it's the sample stream!
On Thu, Jul 23, 2015 at 9:26 PM,
You need to pay a lot of money to get the full stream, so unless you are
doing that, it's the sample stream!
On Thu, Jul 23, 2015 at 9:26 PM, Patrick McCarthy pmccar...@eatonvance.com
wrote:
How can I tell if it's the sample stream or full stream ?
Thanks
Sent from my iPhone
On Jul 23,
can you explain what transformation is failing. Here's a simple example.
http://www.infoobjects.com/spark-calculating-correlation-using-rdd-of-vectors/
On Thu, Jul 23, 2015 at 5:37 AM, saif.a.ell...@wellsfargo.com wrote:
I tried with a RDD[DenseVector] but RDDs are not transformable, so T+
Hi there!
Only couple of hours left to our first webinar on* IoT data ingestion in
Spark Streaming using Kaa*.
During the webinar we will build a solution that ingests real-time data
from Intel Edison into Apache Spark for stream processing. This solution
includes a client, middleware, and
Thanks Michael, using backticks resolves the issue.
Wouldn't this fix also be something that should go into Spark 1.4.2, or at
least have the limitation noted in the documentation?
From: Michael Armbrust [mich...@databricks.com]
Sent: Wednesday, July 22, 2015
Hi, there
Per for your analytical and real time recommendations request, I would
recommend you use spark sql and hive thriftserver
to store and process your spark streaming data. As thriftserver would be run as
a long-term application and it would be
quite feasible to cyclely comsume data
The OP’s problem is he gets this:
console:47: error: type mismatch;
found : org.apache.spark.rdd.RDD[org.apache.spark.mllib.linalg.DenseVector]
required: org.apache.spark.rdd.RDD[org.apache.spark.mllib.linalg.Vector]
Note: org.apache.spark.mllib.linalg.DenseVector :
Hi,
I'm trying to read an avro file into a spark RDD, but I'm having
an Exception while getting task result.
The avro schema file has the following content:
{
type : record,
name : sample_schema,
namespace : com.adomik.avro,
fields : [ {
name : username,
type : string,
doc :
Hi,
I use Spark to read binary files using SparkContext.binaryFiles(), and then
do some calculations, processing, and manipulations to get new objects (also
binary).
The next thing I want to do is write the results back to binary files on
disk.
Is there any equivalence like saveAsTextFile just
Currently, the only way for you would be to create proper schema for the
data. This is not a bug, but you could open a jira (since this would help
others to solve their similar use-cases) for feature and in future version
it could be implemented and included.
Thanks
Best Regards
On Tue, Jul 21,
The problem should be toMap, as I tested that val maps2=maps.collect
runs ok. When I run spark-shell, I run with --master
mesos://cluster-1:5050 parameter which is the same with spark-submit.
Confused here.
2015-07-22 20:01 GMT-05:00 Yana Kadiyska yana.kadiy...@gmail.com:
Is it complaining
Hopefully this is an easy one. I am trying to filter a twitter dstream by
user ScreenName - my code is as follows
val stream = TwitterUtils.createStream(ssc, None)
.filter(_.getUser.getScreenName.contains(markets))
however nothing gets returned and I can see that Bloomberg has tweeted.
Hi,
I am in need to create a table in spark.for that I have uploaded a csv file
in HDFS and created a table using following query
CREATE EXTERNAL table IF NOT EXISTS + tableName + (teams string,runs
int) + ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' LOCATION ' +
hdfspath + ';
May I know is
Here's a few more configurations
https://cwiki.apache.org/confluence/display/Hive/Setting+Up+HiveServer2#SettingUpHiveServer2-ConfigurationPropertiesinthehive-site.xmlFile
can't find anything on the timeouts though.
Thanks
Best Regards
On Wed, Jul 22, 2015 at 1:01 AM, Judy Nash
Hi folks, having trouble expressing IN and COLLECT_SET on a dataframe. In
other words, I'd like to figure out how to write the following query:
select collect_set(b),a from mytable where c in (1,2,3) group by a
I've started with
someDF
.where( -- not sure what do for c here---
You can try adding that jar in SPARK_CLASSPATH (its deprecated though) in
spark-env.sh file.
Thanks
Best Regards
On Tue, Jul 21, 2015 at 7:34 PM, Michal Haris michal.ha...@visualdna.com
wrote:
I have a spark program that uses dataframes to query hive and I run it
both as a spark-shell for
Did you try:
val data = indexed_files.groupByKey
val *modified_data* = data.map { a =
var name = a._2.mkString(,)
(a._1, name)
}
*modified_data*.foreach { a =
var file = sc.textFile(a._2)
println(file.count)
}
Thanks
Best Regards
On Wed, Jul 22, 2015 at 2:18 AM, MorEru
It looks like its picking up the wrong namenode uri from the
HADOOP_CONF_DIR, make sure it is proper. Also for submitting a spark job to
a remote cluster, you might want to look at the spark.driver host and
spark.driver.port
Thanks
Best Regards
On Wed, Jul 22, 2015 at 8:56 PM, rok
On 23 Jul 2015, at 01:50, Ewan Leith ewan.le...@realitymine.com wrote:
I think the standard S3 driver used in Spark from the Hadoop project (S3n)
doesn't support IAM role based authentication.
However, S3a should support it. If you're running Hadoop 2.6 via the
spark-ec2 scripts (I'm
So when I go to ~/ephemeral-hdfs/bin/hadoop and check its version, it says
Hadoop 2.0.0-cdh4.2.0. If I run pyspark and use the s3a address, things should
work, right? What am I missing? And thanks so much for the help so far!
From: Steve Loughran
Did you happened to look into esDF
https://github.com/elastic/elasticsearch-hadoop/issues/441? You can open
an issue over here if that doesn't solve your problem
https://github.com/elastic/elasticsearch-hadoop/issues
Thanks
Best Regards
On Tue, Jul 21, 2015 at 5:33 PM, ayan guha
I was just wondering if there were plans to implement class weights and
prediction probabilities in random forest? Is anyone working on this?
smime.p7s
Description: S/MIME cryptographic signature
You can look into .saveAsObjectFiles
Thanks
Best Regards
On Thu, Jul 23, 2015 at 8:44 PM, Oren Shpigel o...@yowza3d.com wrote:
Hi,
I use Spark to read binary files using SparkContext.binaryFiles(), and then
do some calculations, processing, and manipulations to get new objects
(also
My spark streaming on kafka application is running in spark 1.3.
I want upgrade spark to 1.4 now.
How to deal with the spark streaming application?
Save the kafka topic partition offset, then kill the application, then
upgrade, then run spark streaming again?
Is there more elegant way?
--
View
Currently that is the best way.
On Thu, Jul 23, 2015 at 12:51 AM, JoneZhang joyoungzh...@gmail.com wrote:
My spark streaming on kafka application is running in spark 1.3.
I want upgrade spark to 1.4 now.
How to deal with the spark streaming application?
Save the kafka topic partition
Sure and sparksql supports Hive UDFs.
ISTM that the UDF 'DATE_FORMAT' is just not registered in your metastore?
Did you say 'CREATE FUNCTION' in advance?
Thanks,
On Tue, Jul 14, 2015 at 6:30 PM, Ravisankar Mani rrav...@gmail.com wrote:
Hi Everyone,
As mentioned in Spark sQL programming
Thanks for reply and your valuable suggestions
I have 10 GB data generated every day so this data I need to write in my
database also this data is schema base and schema changes frequently , so
consider this as unstructured data sometimes I may have to serve 1
write/secs with 4 m1.xLarge
Hi Everyone,
I am in need to use the table from MsSQLSERVER in SPARK.Any one please
share me the optimized way for that?
Thanks in advance,
Vinod
I think the standard S3 driver used in Spark from the Hadoop project (S3n)
doesn't support IAM role based authentication.
However, S3a should support it. If you're running Hadoop 2.6 via the spark-ec2
scripts (I'm not sure what it launches with by default) try accessing your
bucket via s3a://
It sort of depends on optimized. There is a good thread on the topic at
http://search-hadoop.com/m/q3RTtJor7QBnWT42/Spark+and+SQL+server/v=threaded
If you have an archival type strategy, you could do daily BCP extracts out
to load the data into HDFS / S3 / etc. This would result in minimal impact
51 matches
Mail list logo