Hi Patrick,
The fix you need is SPARK-6954: https://github.com/apache/spark/pull/5704.
If possible, you may cherry-pick the following commit into your Spark
deployment and it should resolve the issue:
https://github.com/apache/spark/commit/98ac39d2f5828fbdad8c9a4e563ad1169e3b9948
Note that this
I have 10 folder, each with 6000 files. Each folder is roughly 500GB. So
totally 5TB data.
The data is formatted as key t/ value. After union, I want to remove
the duplicates among keys. So each key should be unique and has only one
value.
Here is what I am doing.
folders =
I have following hadoop spark cluster nodes configuration:
Nodes 1 2 are resourceManager and nameNode respectivly
Nodes 3, 4, and 5 each includes nodeManager dataNode
Node 7 is Spark-master configured to run yarn-client or yarn-master modes
I have tested it and it works fine.
Is there any
Yes, its all just RDDs under the covers. DataFrames/SQL is just a more
concise way to express your parallel programs.
On Sat, Jun 13, 2015 at 5:25 PM, Rex X dnsr...@gmail.com wrote:
Thanks, Don! Does SQL implementation of spark do parallel processing on
records by default?
-Rex
On Sat,
Hi,
Not sure if this is it, but could you please try
com.databricks.spark.avro instead of just avro.
Thanks,
Burak
On Jun 13, 2015 9:55 AM, Shing Hing Man mat...@yahoo.com.invalid wrote:
Hi,
I am trying to read a avro file in SparkR (in Spark 1.4.0).
I started R using the following.
Yep - Burak's answer should work. FWIW the error message from the stack
trace that shows this is the line
Failed to load class for data source: avro
Thanks
Shivaram
On Sat, Jun 13, 2015 at 6:13 PM, Burak Yavuz brk...@gmail.com wrote:
Hi,
Not sure if this is it, but could you please try
I got it. Thanks!
Patcharee
On 13/06/15 23:00, Will Briggs wrote:
The context that is created by spark-shell is actually an instance of
HiveContext. If you want to use it programmatically in your driver, you need to
make sure that your context is a HiveContext, and not a SQLContext.
As the error message says, were you using a |SQLContext| instead of a
|HiveContext| to create the DataFrame?
In Spark shell, although the variable name is |sqlContext|, the type of
that variable is actually |org.apache.spark.sql.hive.HiveContext|, which
has the ability to communicate with
hey guys
I tried the following settings as well. No luck
--total-executor-cores 24 --executor-memory 4G
BTW on the same cluster , impala absolutely kills it. same query 9 seconds. no
memory issues. no issues.
In fact I am pretty disappointed with Spark-SQL.I have worked with Hive during
the
Hi
I have a question about Spark Twitter stream processing in Spark 1.3.1, the
code sample below just opens
up a twitter stream, uses auth keys, splits out has tags and creates a temp
table. However, when I try to compile
it using sbt ( CentOS 6.5) I get the error
[error]
Thanks, Don! Does SQL implementation of spark do parallel processing on
records by default?
-Rex
On Sat, Jun 13, 2015 at 10:13 AM, Don Drake dondr...@gmail.com wrote:
Take a look at https://github.com/databricks/spark-csv to read in the
tab-delimited file (change the default delimiter)
and
Try using Spark 1.4.0 with SQL code generation turned on; this should make
a huge difference.
On Sat, Jun 13, 2015 at 5:08 PM, Sanjay Subramanian
sanjaysubraman...@yahoo.com wrote:
hey guys
I tried the following settings as well. No luck
--total-executor-cores 24 --executor-memory 4G
trying again
On 13 Jun 2015, at 10:15, Robin East robin.e...@xense.co.uk wrote:
Here’s typical way to do it:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
import org.apache.spark.mllib.clustering.{KMeans, KMeansModel}
import org.apache.spark.mllib.linalg.Vectors
// Load and parse
Try build/sbt clean first.
On Tue, May 26, 2015 at 4:45 PM, Justin Yip yipjus...@prediction.io wrote:
Hello,
I am trying to build scala doc from the 1.4 branch. But it failed due to
[error] (sql/compile:compile) java.lang.AssertionError: assertion failed:
List(object package$DebugNode,
Are you using a build for scala 2.11? I’ve encountered the same behaviour
trying to run on Yarn with scala 2.11 and Spark 1.3.0, 1.3.1 and 1.4.0.RC3
and raised JIRA issue here: https://issues.apache.org/jira/browse/SPARK-7944.
Would be good to know if this is identical to what you’re seeing on
That's the Tachyon FS there, which appears to be missing a method override.
On 12 Jun 2015, at 19:58, Peter Haumer
phau...@us.ibm.commailto:phau...@us.ibm.com wrote:
Exception in thread main java.lang.UnsupportedOperationException: Not
implemented by the TFS FileSystem implementation
at
Perfect! I'll start working on it
2015-06-13 2:23 GMT+02:00 Amit Ramesh a...@yelp.com:
Hi Juan,
I have created a ticket for this:
https://issues.apache.org/jira/browse/SPARK-8337
Thanks!
Amit
On Fri, Jun 12, 2015 at 3:17 PM, Juan Rodríguez Hortalá
juan.rodriguez.hort...@gmail.com
Take a look at https://github.com/databricks/spark-csv to read in the
tab-delimited file (change the default delimiter)
and once you have that as a DataFrame, SQL can do the rest.
https://spark.apache.org/docs/latest/sql-programming-guide.html
-Don
On Fri, Jun 12, 2015 at 8:46 PM, Rex X
I have 10 folder, each with 6000 files. Each folder is roughly 500GB. So
totally 5TB data.
The data is formatted as key t/ value. After union, I want to remove the
duplicates among keys. So each key should be unique and has only one value.
Here is what I am doing.
folders =
Hi Patrick,
I'm noticing that you're using Spark 1.3.1. We fixed a bug in dynamic
allocation in 1.4 that permitted requesting negative numbers of executors.
Any chance you'd be able to try with the newer version and see if the
problem persists?
-Sandy
On Fri, Jun 12, 2015 at 7:42 PM, Patrick
I am currently using filter inside a loop of all severity levels to do this,
which I think is pretty inefficient. It has to read the entire data set once
for each severity. I wonder if there is a more efficient way that takes just
one pass of the data? Thanks.
Best,
Hao Wang
On Jun 13, 2015,
Check out this recent post by Cheng Liam regarding dynamic partitioning in
Spark 1.4: https://www.mail-archive.com/user@spark.apache.org/msg30204.html
On June 13, 2015, at 5:41 AM, Hao Wang bill...@gmail.com wrote:
Hi,
I have a bunch of large log files on Hadoop. Each line contains a log and
Hi, I am trying to read a avro file in SparkR (in Spark 1.4.0).
I started R using the following.
matmsh@gauss:~$ sparkR --packages com.databricks:spark-avro_2.10:1.0.0
Inside the R shell, when I issue the following,
read.df(sqlContext, file:///home/matmsh/myfile.avro,avro)
I get the following
Hey everyone,
I’m trying to figure out how to silence all of the logging info that gets
printed to the console when dealing with Parquet files. I’ve seen that there
have been several PRs addressing this issue, but I can’t seem to figure out how
to actually change the logging config. I’ve
I downloaded the pre-compiled Spark 1.4.0 and attempted to run an existing
Python Spark application against it and got the following error:
py4j.protocol.Py4JJavaError: An error occurred while calling o90.save.
: java.lang.RuntimeException: Failed to load class for data source:
Hi every one,
I am trying to run the FP growth example. I have tried to compile the
following POM file:
project
groupIdcom.oreilly.learningsparkexamples.mini/groupId
artifactIdlearning-spark-mini-example/artifactId
modelVersion4.0.0/modelVersion
nameexample/name
I would like to have a Spark Streaming *SQS Receiver* which deletes SQS
messages only *after* they were successfully stored on S3.
For this a *Custom Receiver* can be implemented with the semantics of the
Reliable Receiver.
The store(multiple-records) call blocks until the given records have been
Yes, if you have enabled WAL and checkpointing then after the store, you
can simply delete the SQS Messages from your receiver.
Thanks
Best Regards
On Sat, Jun 13, 2015 at 6:14 AM, Michal Čizmazia mici...@gmail.com wrote:
I would like to have a Spark Streaming SQS Receiver which deletes SQS
Thanks for the answer. Any example?
On Jun 13, 2015 2:13 PM, Sonal Goyal sonalgoy...@gmail.com wrote:
I think you need to add dependency to spark mllib too.
On Jun 13, 2015 11:10 AM, masoom alam masoom.a...@wanclouds.net wrote:
Hi every one,
I am trying to run the FP growth example. I have
Hi,
I have a bunch of large log files on Hadoop. Each line contains a log and
its severity. Is there a way that I can use Spark to split the entire data
set into different files on Hadoop according the severity field? Thanks.
Below is an example of the input and output.
Input:
[ERROR] log1
Are you looking for something like filter? See a similar example here
https://spark.apache.org/examples.html
Thanks
Best Regards
On Sat, Jun 13, 2015 at 3:11 PM, Hao Wang bill...@gmail.com wrote:
Hi,
I have a bunch of large log files on Hadoop. Each line contains a log and
its severity. Is
I think the straight answer would be No, but yes you can actually hardcode
these parameters if you want. Look in the SparkContext.scala
https://github.com/apache/spark/blob/master/core%2Fsrc%2Fmain%2Fscala%2Forg%2Fapache%2Fspark%2FSparkContext.scala#L364
where all these properties are being
Does the pre-build come with hive support?
Namely, has it been built with -Phive and -Phive-thriftserver?
On Fri, Jun 12, 2015, 9:32 AM ayan guha guha.a...@gmail.com wrote:
Thanks guys, my question must look like a stupid one today :) Looking
forward to test out 1.4.0, just downloaded it.
Thanks Akhil!
I just looked it up in the code as well.
Receiver.store(ArrayBuffer[T], ...)
ReceiverSupervisorImpl.pushArrayBuffer(ArrayBuffer[T], ...)
ReceiverSupervisorImpl.pushAndReportBlock(...)
WriteAheadLogBasedBlockHandler.storeBlock(...)
This implementation
Hi,
I am using spark 0.14. I try to insert data into a hive table (in orc
format) from DF.
partitionedTestDF.write.format(org.apache.spark.sql.hive.orc.DefaultSource)
.mode(org.apache.spark.sql.SaveMode.Append).partitionBy(zone,z,year,month).saveAsTable(testorc)
When this job is submitted by
Hey Sandy,
I'll test it out on 1.4. Do you have a bug number or PR that I could reference
as well?
Thanks!
-Pat
Sent from my iPhone
On Jun 13, 2015, at 11:38 AM, Sandy Ryza sandy.r...@cloudera.com wrote:
Hi Patrick,
I'm noticing that you're using Spark 1.3.1. We fixed a bug in dynamic
36 matches
Mail list logo