Hi again,
On Fri, Jan 16, 2015 at 4:25 PM, Tobias Pfeiffer t...@preferred.jp wrote:
Now I'm wondering where this comes from (I haven't touched this component
in a while, nor upgraded Spark etc.) [...]
So the reason that the error is showing up now is that suddenly data from a
different
Hi Tobias,
Can you provide how you create the JsonRDD?
Thanks,
Daoyuan
From: Tobias Pfeiffer [mailto:t...@preferred.jp]
Sent: Friday, January 16, 2015 4:01 PM
To: user
Subject: Re: MatchError in JsonRDD.toLong
Hi again,
On Fri, Jan 16, 2015 at 4:25 PM, Tobias Pfeiffer
Hi,
On Fri, Jan 16, 2015 at 5:55 PM, Wang, Daoyuan daoyuan.w...@intel.com
wrote:
Can you provide how you create the JsonRDD?
This should be reproducible in the Spark shell:
-
import org.apache.spark.sql._
val sqlc = new SparkContext(sc)
The second parameter of jsonRDD is the sampling ratio when we infer schema.
Thanks,
Daoyuan
From: Tobias Pfeiffer [mailto:t...@preferred.jp]
Sent: Friday, January 16, 2015 5:11 PM
To: Wang, Daoyuan
Cc: user
Subject: Re: MatchError in JsonRDD.toLong
Hi,
On Fri, Jan 16, 2015 at 5:55 PM, Wang,
And you can use jsonRDD(json:RDD[String], schema:StructType) to clearly clarify
your schema. For numbers later than Long, we can use DecimalType.
Thanks,
Daoyuan
From: Wang, Daoyuan [mailto:daoyuan.w...@intel.com]
Sent: Friday, January 16, 2015 5:14 PM
To: Tobias Pfeiffer
Cc: user
Subject: RE:
I am trying to use Spark MLib ALS with implicit feedback for collaborative
filtering. Input data has only two fields `userId` and `productId`. I have
**no product ratings**, just info on what products users have bought,
that's all. So to train ALS I use:
def trainImplicit(ratings:
Hi all!
In the Spark SQL1.2.0.
I create a hive table with custom parquet inputformat and outputformat.
like this :
CREATE TABLE test(
id string,
msg string)
CLUSTERED BY (
id)
SORTED BY (
id ASC)
INTO 10 BUCKETS
ROW FORMAT SERDE
'*com.a.MyParquetHiveSerDe*'
STORED AS INPUTFORMAT
Hi all , i tried to run a terasort benchmark on my spark cluster, but i
found it is hard to find a standard spark terasort program except a PR from
rxin and ewan higgs:
https://github.com/apache/spark/pull/1242
https://github.com/ehiggs/spark/tree/terasort
The example which rxin provided without
Thank you Abhishek. The code works.
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/How-to-define-SparkContext-with-Cassandra-connection-for-spark-jobserver-tp21119p21184.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.
I am trying to use Spark MLib ALS with implicit feedback for collaborative
filtering. Input data has only two fields `userId` and `productId`. I have
**no product ratings**, just info on what products users have bought, that's
all. So to train ALS I use:
def trainImplicit(ratings:
I am trying to use Spark MLib ALS with implicit feedback for
collaborative filtering. Input data has only two fields `userId` and
`productId`. I have **no product ratings**, just info on what products users
have bought, that's all. So to train ALS I use:
def
+spark-user
-- Forwarded message --
From: lonely Feb lonely8...@gmail.com
Date: 2015-01-16 19:09 GMT+08:00
Subject: Re: Problems with TeraValidate
To: Ewan Higgs ewan.hi...@ugent.be
thx a lot.
btw, here is my output:
1. when dataset is 1000g:
num records: 100
checksum:
I have seen similar results.
I have looked at the code and I think there are a couple of contributors:
Encoding/decoding java Strings to UTF8 bytes is quite expensive. I'm not
sure what you can do about that.
But there are options for optimization due to the repeated decoding of the
same String
Created JIRA issue
https://issues.apache.org/jira/browse/SPARK-5281
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/MissingRequirementError-with-spark-tp21149p21188.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.
Thanks a lot for clarify it.
Then following there are some questions:
1, if normally we have 1 executor per machine. Then if we have a cluster
with different hardware capacity, for example: one 8 core worker and one 4
core worker (ignore the driver machine), then if we set executor-cores =4,
then
I think you might need to set
spark.sql.hive.convertMetastoreParquet to false if I understand that flag
correctly
Sent on the new Sprint Network from my Samsung Galaxy S®4.
div Original message /divdivFrom: Xiaoyu Wang
wangxy...@gmail.com /divdivDate:01/16/2015 5:09 AM
Thanks yana!
I will try it!
在 2015年1月16日,20:51,yana yana.kadiy...@gmail.com
mailto:yana.kadiy...@gmail.com 写道:
I think you might need to set
spark.sql.hive.convertMetastoreParquet to false if I understand that flag
correctly
Sent on the new Sprint Network from my Samsung Galaxy S®4.
Does parallel processing mean it is executed in multiple worker or executed in
one worker but multiple threads? For example if I have only one worker but my
RDD has 4 partition, will it be executed parallel in 4 thread?
The reason I am asking is try to decide whether I need to configure spark
I have asked this question before but get no answer. Asking again.
Can I save RDD to the local file system and then read it back on a spark
cluster with multiple nodes?
rdd.saveAsObjectFile(file:///home/data/rdd1file:///\\home\data\rdd1)
val rdd2 =
So one worker is enough and it will use all 4 cores? In what situation shall I
configure more workers in my single node cluster?
Regards,
Ningjun Wang
Consulting Software Engineer
LexisNexis
121 Chanlon Road
New Providence, NJ 07974-1541
From: Gerard Maas [mailto:gerard.m...@gmail.com]
Sent:
I have an RDD containing binary data. I would like to use 'RDD.pipe' to
pipe that binary data to an external program that will translate it to
string/text data. Unfortunately, it seems that Spark is mangling the binary
data before it gets passed to the external program.
This code is representative
On Fri, Jan 16, 2015 at 9:58 AM, Zork Sail zorks...@gmail.com wrote:
And then train ALSL:
val model = ALS.trainImplicit(ratings, rank, numIter)
I get RMSE 0.9, which is a big error in case of preferences taking 0 or 1
value:
This is likely the problem. RMSE is not an appropriate
Well it looks like you're reading some kind of binary file as text.
That isn't going to work, in Spark or elsewhere, as binary data is not
even necessarily the valid encoding of a string. There are no line
breaks to delimit lines and thus elements of the RDD.
Your input has some record structure
I was interested in this as I had some Spark code in Python that was too slow
and wanted to know whether Scala would fix it for me. So I re-wrote my code
in Scala.
In my particular case the Scala version was 10 times faster. But I think
that is because I did an awful lot of computation in my
Per your last comment, it appears I need something like this:
https://github.com/RIPE-NCC/hadoop-pcap
Thanks a ton. That get me oriented in the right direction.
On Fri, Jan 16, 2015 at 10:20 AM, Sean Owen so...@cloudera.com wrote:
Well it looks like you're reading some kind of binary file
Hi Experts!
I am using kinesis dependency as follow
groupId = org.apache.spark
artifactId = spark-streaming-kinesis-asl_2.10
version = 1.2.0
in this aws sdk version 1.8.3 is being used. in this sdk multiple records
can not be put in a single request. is it possible to put multiple records
in a
Sorry. I couldn't understand the issue. Are you trying to send data to
kinesis from a spark batch/real time job?
- Aniket
On Fri, Jan 16, 2015, 9:40 PM Hafiz Mujadid hafizmujadi...@gmail.com
wrote:
Hi Experts!
I am using kinesis dependency as follow
groupId = org.apache.spark
artifactId =
I'm not positive, but I think this is very unlikely to work.
First, when you call sc.objectFile(...), I think the *driver* will need to
know something about the file, eg to know how many tasks to create. But it
won't even be able to see the file, since it only lives on the local
filesystem of
Hi,
I believe this is some kind of timeout problem but can't figure out how to
increase it.
I am running spark 1.2.0 on yarn (all from cdh 5.3.0). I submit a python task
which first loads big RDD from hbase - I can see in the screen output all
executors fire up then no more logging output for
Hi Tobias,
Can you share more information about how do you do that? I also have similar
question about this.
Thanks a lot,
Regards,
Shuai
From: Tobias Pfeiffer [mailto:t...@preferred.jp]
Sent: Wednesday, November 26, 2014 12:25 AM
To: Sandy Ryza
Cc: Yanbo Liang; user
Subject:
Are you referring to the PutRecords method, which was added in 1.9.9? (See
http://aws.amazon.com/releasenotes/1369906126177804) If so, can't you just
depend upon this later version of the SDK in your app even though
spark-streaming-kinesis-asl is depending upon this earlier 1.9.3 version that
I need to save RDD to file system and then restore my RDD from the file system
in the future. I don’t have any hdfs file system and don’t want to go the
hassle of setting up a hdfs system. So how can I achieve this? The application
need to be run on a cluster with multiple nodes.
Regards,
FYI, I just confirmed with the latest Spark 1.3 snapshot that the
spark.yarn.maxAppAttempts setting that SPARK-2165 refers to works
perfectly. Great to finally get rid of this problem. Also caused an issue
when the eventLogs were enabled since the spark-events/appXXX folder
already exists the
I'm talking about RDD1 (not persisted or checkpointed) in this situation:
...(somewhere) - RDD1 - RDD2
||
V V
RDD3 - RDD4 - Action!
To my experience the change RDD1 get
I'm talking about RDD1 (not persisted or checkpointed) in this situation:
...(somewhere) - RDD1 - RDD2
||
V V
RDD3 - RDD4 - Action!
To my experience the change RDD1 get
I just wanted to reiterate the solution for the benefit of the community.
The problem is not from my use of 'pipe', but that 'textFile' cannot be
used to read in binary data. (Doh) There are a couple options to move
forward.
1. Implement a custom 'InputFormat' that understands the binary input
Hey Phil,
Thank you sharing this. The result didn't surprise me a lot, it's normal to do
the prototype in Python, once it get stable and you really need the performance,
then rewrite part of it in C or whole of it in another language does make sense,
it will not cause you much time.
Davies
On
I want to add some java options when submitting application:
--conf spark.executor.extraJavaOptions=-XX:+UnlockCommercialFeatures
-XX:+FlightRecorder
But looks like it doesn't get set. Where I can add it to make it working?
Thanks.
Hi Kane,
What's the complete command line you're using to submit the app? Where
to you expect these options to appear?
On Fri, Jan 16, 2015 at 11:12 AM, Kane Kim kane.ist...@gmail.com wrote:
I want to add some java options when submitting application:
--conf
Hi Kane,
Here's the command line you sent me privately:
./spark-1.2.0-bin-hadoop2.4/bin/spark-submit --class
SimpleApp --conf
spark.executor.extraJavaOptions=-XX:+UnlockCommercialFeatures
-XX:+FlightRecorder --master local simpleapp.jar ./test.log
You're running the app in local mode. In that
The question is about the ways to create a Windows desktop-based and/or
web-based application client that is able to connect and talk to the server
containing Spark application (either local or on-premise cloud
distributions) in the run-time.
Any language/architecture may work. So far, I've seen
There's also an example of running a SparkContext in a java servlet
container from Calrissian: https://github.com/calrissian/spark-jetty-server
On Fri, Jan 16, 2015 at 2:31 PM, olegshirokikh o...@solver.com wrote:
The question is about the ways to create a Windows desktop-based and/or
Send email to user-subscr...@spark.apache.org
Cheers
On Fri, Jan 16, 2015 at 11:51 AM, Andrew Musselman
andrew.mussel...@gmail.com wrote:
Just got the latest from Github and tried running `mvn test`; is this error
common and do you have any advice on fixing it?
Thanks!
[INFO] --- scala-maven-plugin:3.2.0:compile (scala-compile-first) @
spark-core_2.10 ---
[WARNING] Zinc server is not available at port 3030 - reverting to normal
Can you try doing this before running mvn ?
export MAVEN_OPTS=-Xmx2g -XX:MaxPermSize=512M
-XX:ReservedCodeCacheSize=512m
What OS are you using ?
Cheers
On Fri, Jan 16, 2015 at 12:03 PM, Andrew Musselman
andrew.mussel...@gmail.com wrote:
Just got the latest from Github and tried running `mvn
Hi,
You can take a look at the Spark Kernel project:
https://github.com/ibm-et/spark-kernel
The Spark Kernel's goal is to serve as the foundation for interactive
applications. The project provides a client library in Scala that abstracts
connecting to the kernel (containing a Spark Context),
Thanks Ted, got farther along but now have a failing test; is this a known
issue?
---
T E S T S
---
Running org.apache.spark.JavaAPISuite
Tests run: 72, Failures: 0, Errors: 1, Skipped: 0,
Thanks Sean
On Fri, Jan 16, 2015 at 12:06 PM, Sean Owen so...@cloudera.com wrote:
Hey Andrew, you'll want to have a look at the Spark docs on building:
http://spark.apache.org/docs/latest/building-spark.html
It's the first thing covered there.
The warnings are normal as you are probably
Hi,
I observed some weird performance issue using Spark in combination with
Theano, and I have no real explanation for that. To exemplify the issue I am
using the pi.py example of spark that computes pi:
When I modify the function from the example:
#unmodified code
def f(_):
x =
You can get the internal RDD using: schemaRDD.queryExecution.toRDD. This
is used internally and does not copy. This is an unstable developer API.
On Thu, Jan 15, 2015 at 11:26 PM, Nathan McCarthy
nathan.mccar...@quantium.com.au wrote:
Thanks Cheng!
Is there any API I can get access too
I got the same error:
testGuavaOptional(org.apache.spark.JavaAPISuite) Time elapsed: 261.111 sec
ERROR!
org.apache.spark.SparkException: Job aborted due to stage failure: Master
removed our application: FAILED
at org.apache.spark.scheduler.DAGScheduler.org
Hello all,
I have a custom RDD for fast loading of data from a non-partitioned source.
The partitioning happens in the RDD implementation by pushing data from the
source into queues picked up by the current active partitions in worker
threads.
This works great on a multi-threaded single host
Hi,
I have been playing around with the new version of Spark MLlib Random
forest implementation, and while in the process, tried it with a file with
String Features.
While training, it fails with:
java.lang.NumberFormatException: For input string.
Is MBLib Random forest adapted to run on top of
Thanks a lot, Robert – I’ll definitely investigate this and probably would come
back with questions.
P.S. I’m new to this Spark forum. I’m getting responses through emails but they
are not appearing as “replies” in the thread – it’s kind of inconvenient. Is it
something that I should tweak?
The implementation accepts an RDD of LabeledPoint only, so you
couldn't feed in strings from a text file directly. LabeledPoint is a
wrapper around double values rather than strings. How were you trying
to create the input then?
No, it only accepts numeric values, although you can encode
I'd open an issue on the github to ask us to allow you to use hadoops glob
file format for the path.
On Thu, Jan 15, 2015 at 4:57 AM, David Jones letsnumsperi...@gmail.com
wrote:
I've tried this now. Spark can load multiple avro files from the same
directory by passing a path to a directory.
An alternative approach would be to translate your categorical variables
into dummy variables. If your strings represent N classes/categories you
would generate N-1 dummy variables containing 0/1 values.
Auto-magically creating dummy variables from categorical data definitely
comes in handy. I
Hi Asaf,
featurestream [1] is an internal project I'm playing with that includes
support for some of this, in particular:
* 1-pass random forest construction
* schema inference
* native support for text fields
Would this be of interest? It's not open source, but if there's sufficient
demand I
Is spark 1.2 is compatibly with HDP 2.1
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/spark-1-2-compatibility-tp21197.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.
I think you can not use textFile() or binaryFile() or pickleFile()
here, it's different format than wav.
You could get a list of paths for all the files, then
sc.parallelize(), and foreach():
def process(path):
# use subprocess to launch a process to do the job, read the
stdout as result
What's a good way to calculate similarities between all vector-rows in a
matrix or RDD[Vector]?
I'm seeing RowMatrix has a columnSimilarities method but I'm not sure I'm
going down a good path to transpose a matrix in order to run that.
Yes. It's compatible with HDP 2.1
-Original Message-
From: bhavyateja [mailto:bhavyateja.potin...@gmail.com]
Sent: Friday, January 16, 2015 3:17 PM
To: user@spark.apache.org
Subject: spark 1.2 compatibility
Is spark 1.2 is compatibly with HDP 2.1
--
View this message in context:
Should clarify on this. I personally have used HDP 2.1 + Spark 1.2 and have not
seen a problem.
However officially HDP 2.1 + Spark 1.2 is not a supported scenario.
-Original Message-
From: Judy Nash
Sent: Friday, January 16, 2015 5:35 PM
To: 'bhavyateja'; user@spark.apache.org
The Apache Spark project should work with it, but I'm not sure you can get
support from HDP (if you have that).
Matei
On Jan 16, 2015, at 5:36 PM, Judy Nash judyn...@exchange.microsoft.com
wrote:
Should clarify on this. I personally have used HDP 2.1 + Spark 1.2 and have
not seen a
Hello Everyone,
I am encountering trouble running Spark applications when I shut down my
EC2 instances. Everything else seems to work except Spark. When I try
running a simple Spark application, like sc.parallelize() I get the message
that hdfs name node is in safemode.
Has anyone else had this
Hao,
Thanks so much for the links! This is exactly what I'm looking for. If I
understand correctly, I can extend PrunedFilteredScan, PrunedScan, and
TableScan and I should be able to support all the sql semantics?
I'm a little confused about the Array[Filter] that is used with the
Filtered scan.
Hi Dib,
For our usecase I want my spark job1 to read from hdfs/cache and write to
kafka queues. Similarly spark job2 should read from kafka queues and write
to kafka queues.
Is writing to kafka queues from spark job supported in your code ?
Thanks
Deb
On Jan 15, 2015 11:21 PM, Akhil Das
although this helped to improve it significantly I still run into this problem
despite increasing the spark.yarn.executor.memoryOverhead vastly:
export SPARK_EXECUTOR_MEMORY=24Gspark.yarn.executor.memoryOverhead=6144
yet getting this:2015-01-17 04:47:40,389 WARN
My code handles the Kafka Consumer part. But writing to Kafka may not be a
big challenge which you can easily do in your driver code.
dibyendu
On Sat, Jan 17, 2015 at 9:43 AM, Debasish Das debasish.da...@gmail.com
wrote:
Hi Dib,
For our usecase I want my spark job1 to read from hdfs/cache
I tried the following but still didn't see test output :-(
diff --git a/pom.xml b/pom.xml
index f4466e5..dae2ae8 100644
--- a/pom.xml
+++ b/pom.xml
@@ -1131,6 +1131,7 @@
spark.driver.allowMultipleContextstrue/spark.driver.allowMultipleContexts
/systemProperties
You can use K-means
https://spark.apache.org/docs/latest/mllib-clustering.html with a
suitably large k. Each cluster should correspond to rows that are similar
to one another.
On Fri, Jan 16, 2015 at 5:18 PM, Andrew Musselman
andrew.mussel...@gmail.com wrote:
What's a good way to calculate
Hi experts,
I got an error during unpersist RDD.
Any ideas?
java.util.concurrent.TimeoutException: Futures timed out after [30 seconds]
at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:219) at
scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223) at
73 matches
Mail list logo