I was able to get this working by extending KryoRegistrator and setting the
spark.kryo.registrator property.
On Thu, Feb 12, 2015 at 12:31 PM, Corey Nolet cjno...@gmail.com wrote:
I'm trying to register a custom class that extends Kryo's Serializer
interface. I can't tell exactly what Class
How many nodes do you have in your cluster, how many cores, what is the
size of the memory?
On Fri, Feb 13, 2015 at 12:42 AM, Manas Kar manasdebashis...@gmail.com
wrote:
Hi Arush,
Mine is a CDH5.3 with Spark 1.2.
The only change to my spark programs are
-Dspark.driver.maxResultSize=3g
I've been experimenting with my configuration for couple of days and gained
quite a bit of power through small optimizations, but it may very well be
something I'm doing crazy that is causing this problem.
To give a little bit of a background, I am in the early stages of a project
that consumes a
Hi,
Please correct me if I'm wrong, in Spark Streaming, next batch will
not start processing until the previous batch has completed. Is there
any way to be able to start processing the next batch if the previous
batch is taking longer to process than the batch interval?
The problem I am facing
I thought more on it...can we provide access to the IndexedRDD through
thriftserver API and let the mapPartitions query the API ? I am not sure if
ThriftServer is as performant as opening up an API using other akka based
frameworks (like play or spray)...
Any pointers will be really helpful...
Hi,
I have a Hidden Markov Model running with 200MB data.
Once the program finishes (i.e. all stages/jobs are done) the program
hangs for 20 minutes or so before killing master.
In the spark master the following log appears.
2015-02-12 13:00:05,035 ERROR akka.actor.ActorSystemImpl: Uncaught
What is your cluster configuration? Did you try looking at the Web UI?
There are many tips here
http://spark.apache.org/docs/1.2.0/tuning.html
Did you try these?
On Fri, Feb 13, 2015 at 12:09 AM, Manas Kar manasdebashis...@gmail.com
wrote:
Hi,
I have a Hidden Markov Model running with 200MB
Suppose I have an object to broadcast and then use it in a mapper function,
sth like follows, (Python codes)
obj2share = sc.broadcast(Some object here)
someRdd.map(createMapper(obj2share)).collect()
The createMapper function will create a mapper function using the shared
object's value. Another
I have 5 workers each executor-memory 8GB of memory. My driver memory is 8
GB as well. They are all 8 core machines.
To answer Imran's question my configurations are thus.
executor_total_max_heapsize = 18GB
This problem happens at the end of my program.
I don't have to run a lot of jobs to see
I have 5 workers each executor-memory 8GB of memory. My driver memory is 8
GB as well. They are all 8 core machines.
To answer Imran's question my configurations are thus.
executor_total_max_heapsize = 18GB
This problem happens at the end of my program.
I don't have to run a lot of jobs to see
The thing is that my time is perfectly valid...
On Tue, Feb 10, 2015 at 10:50 PM, Akhil Das ak...@sigmoidanalytics.com
wrote:
Its with the timezone actually, you can either use an NTP to maintain
accurate system clock or you can adjust your system time to match with the
AWS one. You can do it
Hi Abe,
I'm new to Spark as well, so someone else could answer better. A few
thoughts which may or may not be the right line of thinking..
1) Spark properties can be set on the SparkConf, and with flags in
spark-submit, but settings on SparkConf take precedence. I think your jars
flag for
What version of Java are you using? Core NLP dropped support for Java 7 in
its 3.5.0 release.
Also, the correct command line option is --jars, not --addJars.
On Thu, Feb 12, 2015 at 12:03 PM, Deborah Siegel deborah.sie...@gmail.com
wrote:
Hi Abe,
I'm new to Spark as well, so someone else
Hi, patrick
Really glad to get your reply.
Yes, we are doing group by operations for our work. We know that this is common
for growTable when processing large data sets.
The problem actually goes to : Do we have any possible chance to self-modify
the initialCapacity using specifically for our
Thanks, Kelvin :)
The error seems to disappear after I decreased both
spark.storage.memoryFraction and spark.shuffle.memoryFraction to 0.2
And, some increase on driver memory.
Best,
Yifan LI
On 10 Feb 2015, at 18:58, Kelvin Chu 2dot7kel...@gmail.com wrote:
Since the stacktrace
I try to use the multi-thread to use the Spark SQL query.
some sample code just like this:
val sqlContext = new SqlContext(sc)
val rdd_query = sc.parallelize(data, part)
rdd_query.registerTempTable(MyTable)
sqlContext.cacheTable(MyTable)
val serverPool = Executors.newFixedThreadPool(3)
val
* We have an inbound stream of sensor data for millions of devices (which
have unique identifiers). Spark Streaming can handel events in the ballpark
of 100-500K records/sec/node - *so you need to decide on a cluster
accordingly. And its scalable.*
* We need to perform aggregation of this stream
Marcelo and Burak,
Thank you very much for your explanations. Now I'm able to see my logs.
On Wed, Feb 11, 2015 at 7:52 PM, Marcelo Vanzin van...@cloudera.com wrote:
For Yarn, you need to upload your log4j.properties separately from
your app's jar, because of some internal issues that are too
Ok, I would suggest adding SPARK_DRIVER_MEMORY in spark-env.sh, with a larger
amount of memory than the default 512m
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/OutOfMemoryError-with-ramdom-forest-and-small-training-dataset-tp21598p21618.html
Sent from
Check that your timezone is correct as well, an incorrect timezone can make
it look like your time is correct when it is skewed.
cheers
On Fri, Feb 13, 2015 at 5:51 AM, Kane Kim kane.ist...@gmail.com wrote:
The thing is that my time is perfectly valid...
On Tue, Feb 10, 2015 at 10:50 PM,
Hello
Has Spark implemented computing statistics for Parquet files? Or is there
any other way I can enable broadcast joins between parquet file RDDs in
Spark Sql?
Thanks
Dima
--
View this message in context:
The nm logs only seems to contain similar to the following. Nothing else in
the same time range. Any help?
2015-02-12 20:47:31,245 WARN
org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl:
Event EventType: KILL_CONTAINER sent to absent container
So you have come across spark.streaming.concurrentJobs already :)
Yeah, that is an undocumented feature that does allow multiple output
operations to submitted in parallel. However, this is not made public for
the exact reasons that you realized - the semantics in case of stateful
operations is
I looked at the environment which I ran the spark-submit command in, and it
looks like there is nothing that could be messing with the classpath.
Just to be sure, I checked the web UI which says the classpath contains:
- The two jars I added: /path/to/avro-mapred-1.7.4-hadoop2.jar and
So I tried this:
.mapPartitions(itr = {
itr.grouped(300).flatMap(items = {
myFunction(items)
})
})
and I tried this:
.mapPartitions(itr = {
itr.grouped(300).flatMap(myFunction)
})
I tried making myFunction a method, a function val, and even moving it
into a singleton
It seems unlikely to me that it would be a 2.2 issue, though not entirely
impossible. Are you able to find any of the container logs? Is the
NodeManager launching containers and reporting some exit code?
-Sandy
On Thu, Feb 12, 2015 at 1:21 PM, Anders Arpteg arp...@spotify.com wrote:
No, not
You would probably write to hdfs or check out
https://databricks.com/blog/2015/01/28/introducing-streaming-k-means-in-spark-1-2.html
You might be able to retrofit it to you use case.
--- Original Message ---
From: Su She suhsheka...@gmail.com
Sent: February 11, 2015 10:55 PM
To: Felix C
Not now, but see https://issues.apache.org/jira/browse/SPARK-3066
As an aside, it's quite expensive to make recommendations for all
users. IMHO this is not something to do, if you can avoid it
architecturally. For example, consider precomputing recommendations
only for users whose probability of
Thanks, Sean! Glad to know it will be in the future release.
On Thu, Feb 12, 2015 at 2:45 PM, Sean Owen so...@cloudera.com wrote:
Not now, but see https://issues.apache.org/jira/browse/SPARK-3066
As an aside, it's quite expensive to make recommendations for all
users. IMHO this is not
It looks to me like perhaps your SparkContext has shut down due to too many
failures. I'd look in the logs of your executors for more information.
On Thu, Feb 12, 2015 at 2:34 AM, lihu lihu...@gmail.com wrote:
I try to use the multi-thread to use the Spark SQL query.
some sample code just
The important thing here is the master's memory, that's where you're
getting the GC overhead limit. The master is updating its UI to include
your finished app when your app finishes, which would cause a spike in
memory usage.
I wouldn't expect the master to need a ton of memory just to serve the
Hi,
I wonder if there is a way to do fast top N product recommendations for all
users in training using mllib's ALS algorithm.
I am currently calling
public Rating
http://spark.apache.org/docs/1.2.0/api/java/org/apache/spark/mllib/recommendation/Rating.html[]
recommendProducts(int user,
You can start a JDBC server with an existing context. See my answer here:
http://apache-spark-user-list.1001560.n3.nabble.com/Standard-SQL-tool-access-to-SchemaRDD-td20197.html
On Thu, Feb 12, 2015 at 7:24 AM, Todd Nist tsind...@gmail.com wrote:
I have a question with regards to accessing
Looks like my clock is in sync:
-bash-4.1$ date curl -v s3.amazonaws.com
Thu Feb 12 21:40:18 UTC 2015
* About to connect() to s3.amazonaws.com port 80 (#0)
* Trying 54.231.12.24... connected
* Connected to s3.amazonaws.com (54.231.12.24) port 80 (#0)
GET / HTTP/1.1
User-Agent: curl/7.19.7
In Spark 1.3, parquet tables that are created through the datasources API
will automatically calculate the sizeInBytes, which is used to broadcast.
On Thu, Feb 12, 2015 at 12:46 PM, Dima Zhiyanov dimazhiya...@hotmail.com
wrote:
Hello
Has Spark implemented computing statistics for Parquet
HI Sean,
I am reading the paper of implicit training.
Collaborative Filtering for Implicit Feedback Datasets
http://labs.yahoo.com/files/HuKorenVolinsky-ICDM08.pdf
It mentioned
To this end, let us introduce
a set of binary variables p_ui, which indicates the preference of user u to
item i. The
We are using Gradient Boosting/Random Forests that I have found provide the
best results for our recommendations. My issue is that I need the
probability of the 0/1 label, and not the predicted label. In the spark
scala api, I see that the predict method also has an option to provide the
OK. I think I have to use None instead null, then it works. Still switching
from Java.
I can also just use the field name as what I assume.
Great experience.
From: java8...@hotmail.com
To: user@spark.apache.org
Subject: spark left outer join with java.lang.UnsupportedOperationException:
empty
Any ideas on how to figure out what is going on when using json4s 3.2.11.
I have a need to use 3.2.11 and just to see if things work I had downgraded
to 3.2.10 and things started working.
On Wed, Feb 11, 2015 at 11:45 AM, Charles Feduke charles.fed...@gmail.com
wrote:
I was having a similar
Thanks very much, you're right.
I called the sc.stop() before the execute pool shutdown.
On Fri, Feb 13, 2015 at 7:04 AM, Michael Armbrust mich...@databricks.com
wrote:
It looks to me like perhaps your SparkContext has shut down due to too
many failures. I'd look in the logs of your
Where there is no user-item interaction, you provide no interaction,
not an interaction with strength 0. Otherwise your input is fully
dense.
On Thu, Feb 12, 2015 at 11:09 PM, Crystal Xing crystalxin...@gmail.com wrote:
Hi,
I have some implicit rating data, such as the purchasing data. I read
The more I'm thinking about this- I may try this instead:
val myChunkedRDD: RDD[List[Event]] = inputRDD.mapPartitions(_
.grouped(300).toList)
I wonder if this would work. I'll try it when I get back to work tomorrow.
Yuyhao, I tried your approach too but it seems to be somehow moving all the
Hi all - I've spent a while playing with this. Two significant sources of speed
up that I've achieved are
1) Manually multiplying the feature vectors and caching either the user or
product vector
2) By doing so, if one of the RDDs is a global it becomes possible to
parallelize this step by
This all describes how the implementation operates, logically. The
matrix P is never formed, for sure, certainly not by the caller.
The implementation actually extends to handle negative values in R too
but it's all taken care of by the implementation.
On Thu, Feb 12, 2015 at 11:29 PM, Crystal
Many thanks!
On Thu, Feb 12, 2015 at 3:31 PM, Sean Owen so...@cloudera.com wrote:
This all describes how the implementation operates, logically. The
matrix P is never formed, for sure, certainly not by the caller.
The implementation actually extends to handle negative values in R too
but
Hi,
I have some implicit rating data, such as the purchasing data. I read the
paper about the implicit training algorithm used in spark and it mentioned
the for user-prodct pairs which do not have implicit rating data, such as
no purchase, we need to provide the value as 0.
This is different
Hi,
I am using Spark 1.2.0 with Hadoop 2.2. Now I have to 2 csv files, but have 8
fields. I know that the first field from both files are IDs. I want to find all
the IDs existed in the first file, but NOT in the 2nd file.
I am coming with the following code in spark-shell.
case class origAsLeft
Thanks Michael. I will give it a try.
On Thu, Feb 12, 2015 at 6:00 PM, Michael Armbrust mich...@databricks.com
wrote:
You can start a JDBC server with an existing context. See my answer here:
KMeans.train actually returns a KMeansModel so you can use predict() method of
the model
e.g. clusters.predict(pointToPredict)
or
clusters.predict(pointsToPredict)
first is a single Vector, 2nd is RDD[Vector]
Robin
On 12 Feb 2015, at 06:37, Shi Yu shiyu@gmail.com wrote:
Hi there,
I
No, mr1 should not be the issue here, and I think that would break
other things. The OP is not using mr1.
client 4 / server 7 means roughly client is Hadoop 1.x, server is
Hadoop 2.0.x. Normally, I'd say I think you are packaging Hadoop code
in your app by brining in Spark and its deps. Your app
The map will start with a capacity of 64, but will grow to accommodate
new data. Are you using the groupBy operator in Spark or are you using
Spark SQL's group by? This usually happens if you are grouping or
aggregating in a way that doesn't sufficiently condense the data
created from each input
Interesting to hear that it works for you. Are you using Yarn 2.2 as well?
No strange log message during startup, and can't see any other log messages
since no executer gets launched. Does not seems to work in yarn-client mode
either, failing with the exception below.
Exception in thread main
When you log in, you have root access. Then you can do “su hdfs” or any other
account. Then you can create hdfs directory and change permission, etc.
Thanks
Zhan Zhang
On Feb 11, 2015, at 11:28 PM, guxiaobo1982
guxiaobo1...@qq.commailto:guxiaobo1...@qq.com wrote:
Hi Zhan,
Yes, I found
d...@spark.apache.org
http://apache-spark-developers-list.1001551.n3.nabble.com/ mentioned on
http://spark.apache.org/community.html seems to be bouncing. Is there
another one ?
Can you give me the whole logs?
TD
On Tue, Feb 10, 2015 at 10:48 AM, Jon Gregg jonrgr...@gmail.com wrote:
OK that worked and getting close here ... the job ran successfully for a
bit and I got output for the first couple buckets before getting a
java.lang.Exception: Could not compute split,
1. Can you try count()? Take often does not force the entire computation.
2. Can you give the full log. From the log it seems that the blocks are
added to two nodes but the tasks seem to be launched to different nodes. I
dont see any message removing the blocks. So need the whole log to debug
Hi Tim,
I think this code will still introduce shuffle even when you call
repartition on each input stream. Actually this style of implementation
will generate more jobs (job per each input stream) than union into one
stream as called DStream.union(), and union normally has no special
overhead as
Apache Zeppelin also has a scheduler and then you can reload your chart
periodically,
Check it out:
http://zeppelin.incubator.apache.org/docs/tutorial/tutorial.html
On Fri Feb 13 2015 at 7:29:00 AM Silvio Fiorito
silvio.fior...@granturing.com wrote:
One method I’ve used is to publish each
Could you come up with a minimal example through which I can reproduce the
problem?
On Tue, Feb 10, 2015 at 12:30 PM, conor fennell.co...@gmail.com wrote:
I am getting the following error when I kill the spark driver and restart
the job:
15/02/10 17:31:05 INFO CheckpointReader: Attempting to
Hi,
I've been reading about Spark SQL and people suggest that using HiveContext
is better. So can anyone please suggest a solution to the above problem.
This is stopping me from moving forward with HiveContext.
Thanks
Harika
--
View this message in context:
I have a temporal data set in which I'd like to be able to query using
Spark SQL. The dataset is actually in Accumulo and I've already written a
CatalystScan implementation and RelationProvider[1] to register with the
SQLContext so that I can apply my SQL statements.
With my current
Hi Corey,
I would not recommend using the CatalystScan for this. Its lower level,
and not stable across releases.
You should be able to do what you want with PrunedFilteredScan
Thank you. That worked.
2015-02-12 20:03 GMT+04:00 Imran Rashid iras...@cloudera.com:
You need to import the implicit conversions to PairRDDFunctions with
import org.apache.spark.SparkContext._
(note that this requirement will go away in 1.3:
Your earlier call stack clearly states that it fails because the Derby
metastore has already been started by another instance, so I think that is
explained by your attempt to run this concurrently.
Are you running Spark standalone? Do you have a cluster? You should be able to
run spark in
Hi Gerard,
Great write-up and really good guidance in there.
I have to be honest, I don't know why but setting # of partitions for each
dStream to a low number (5-10) just causes the app to choke/crash. Setting
it to 20 gets the app going but with not so great delays. Bump it up to 30
and I
This looks like your executors aren't running a version of spark with hive
support compiled in.
On Feb 12, 2015 7:31 PM, Wush Wu w...@bridgewell.com wrote:
Dear Michael,
After use the org.apache.spark.sql.hive.HiveContext, the Exception:
java.util.
NoSuchElementException: key not found: hour
dev@spark is active.
e.g. see:
http://search-hadoop.com/m/JW1q5zQ1Xw/renaming+SchemaRDD+-%253E+DataFramesubj=renaming+SchemaRDD+gt+DataFrame
Cheers
On Thu, Feb 12, 2015 at 8:09 PM, Manoj Samel manojsamelt...@gmail.com
wrote:
d...@spark.apache.org
Sorry for the late response. With the amount of data you are planning join,
any system would take time. However, between Hive's MapRduce joins, and
Spark's basic shuffle, and Spark SQL's join, the latter wins hands down.
Furthermore, with the APIs of Spark and Spark Streaming, you will have to
do
Hey Tim,
Let me get the key points.
1. If you are not writing back to Kafka, the delay is stable? That is,
instead of foreachRDD { // write to kafka } if you do dstream.count,
then the delay is stable. Right?
2. If so, then Kafka is the bottleneck. Is the number of partitions, that
you spoke of
1) Yes, if I disable writing out to kafka and replace it with some very
light weight action is rdd.take(1), the app is stable.
2) The partitions I spoke of in the previous mail are the number of
partitions I create from each dStream. But yes, since I do processing and
writing out, per partition,
outdata.foreachRDD( rdd = rdd.foreachPartition(rec = {
val writer = new
KafkaOutputService(otherConf(kafkaProducerTopic).toString, propsMap)
writer.output(rec)
}) )
So this is creating a new kafka producer for every new
No, not submitting from windows, from a debian distribution. Had a quick
look at the rm logs, and it seems some containers are allocated but then
released again for some reason. Not easy to make sense of the logs, but
here is a snippet from the logs (from a test in our small test cluster) if
you'd
One method I’ve used is to publish each batch to a message bus or queue with a
custom UI listening on the other end, displaying the results in d3.js or some
other app. As far as I’m aware there isn’t a tool that will directly take a
DStream.
Spark Notebook seems to have some support for
Increasing your driver memory might help.
Thanks
Best Regards
On Fri, Feb 13, 2015 at 12:09 AM, Manas Kar manasdebashis...@gmail.com
wrote:
Hi,
I have a Hidden Markov Model running with 200MB data.
Once the program finishes (i.e. all stages/jobs are done) the program
hangs for 20 minutes
Hello Everyone,
I am writing simple word counts to hdfs using
messages.saveAsHadoopFiles(hdfs://user/ec2-user/,csv,String.class,
String.class, (Class) TextOutputFormat.class);
1) However, each 2 seconds I getting a new *directory *that is titled as a
csv. So i'll have test.csv, which will be a
Hi foks,
My Spark cluster has 8 machines, each of which has 377GB physical memory,
and thus the total maximum memory can be used for Spark is more than
2400+GB. In my program, I have to deal with 1 billion of (key, value) pairs,
where the key is an integer and the value is an integer array with
I replaced the writeToKafka statements with a rdd.count() and sure enough,
I have a stable app with total delay well within the batch window (20
seconds). Here's the total delay lines from the driver log:
15/02/13 06:14:26 INFO JobScheduler: Total delay: 6.521 s for time
142380806 ms
I am new to Scala. I have a dataset with many columns, each column has a
column name. Given several column names (these column names are not fixed,
they are generated dynamically), I need to sum up the values of these
columns. Is there an efficient way of doing this?
I worked out a way by using
Yes, you can try it. For example, if you have a cluster of 10 executors, 60
Kafka partitions, you can try to choose 10 receivers * 2 consumer threads,
so each thread will consume 3 partitions ideally, if you increase the
threads to 6, each threads will consume 1 partitions ideally. What I think
For streaming application, for every batch it will create a new directory
and puts the data in it. If you don't want to have multiple files inside
the directory as part- then you can do a repartition before the saveAs*
call.
Thank's for reply. I solved porblem with importing
org.apache.spark.SparkContext._ by Imran Rashid suggestion.
In the sake of interest, does JavaPairRDD intended for use from java? What
is the purpose of this class? Does my rdd implicitly converted to it in
some circumstances?
2015-02-12 19:42
I haven't been paying close attention to the JIRA tickets for
PrunedFilteredScan but I noticed some weird behavior around the filters
being applied when OR expressions were used in the WHERE clause. From what
I was seeing, it looks like it could be possible that the start and end
ranges you
Just to add to what Arush said, you can go through these links:
http://stackoverflow.com/questions/1162375/apache-port-proxy
http://serverfault.com/questions/153229/password-protect-and-serve-apache-site-by-port-number
Thanks
Best Regards
On Thu, Feb 12, 2015 at 10:43 PM, Arush Kharbanda
Thanks Kevin for the link, I have had issues trying to install zeppelin as
I believe it is not yet supported for CDH 5.3, and Spark 1.2. Please
correct me if I am mistaken.
On Thu, Feb 12, 2015 at 7:33 PM, Kevin (Sangwoo) Kim kevin...@apache.org
wrote:
Apache Zeppelin also has a scheduler and
Hi Tim,
I think maybe you can try this way:
create Receiver per executor and specify thread for each topic large than
1, and the total number of consumer thread will be: total consumer =
(receiver number) * (thread number), and make sure this total consumer is
less than or equal to Kafka
Hi Saisai,
If I understand correctly, you are suggesting that control parallelism by
having number of consumers/executors at least 1:1 for number of kafka
partitions. For example, if I have 50 partitions for a kafka topic then
either have:
- 25 or more executors, 25 receivers, each receiver set
Looking at the script, I'm not sure whether --driver-memory is
supposed to work in standalone client mode. It's too late to set the
driver's memory if the driver is what's already running. It specially
handles the case where the value is the environment config though. Not
sure, this might be on
Hi,
I need to apply a function to all the elements of a rowMatrix.
How can I do that?
Here there is a more detailed question
http://stackoverflow.com/questions/28438908/spark-mllib-apply-function-to-all-the-elements-of-a-rowmatrix
Thanks a lot!
--
View this message in context:
Can I invoke UpdateStateByKey twice on the same RDD. My requirement is as
follows.
1. Get the event stream from Kafka
2. UpdateStateByKey to aggregate and filter events based on timestamp
3. Do some processing and save results to Cassandra DB
4. UpdateStateByKey to remove keys based on logout
This is tricky to debug. Check logs of node and resource manager of YARN to
see if you can trace the error. In the past I have to closely look at
arguments getting passed to YARN container (they get logged before
attempting to launch containers). If I still don't get a clue, I had to
check the
Very interesting. It works.
When I set SPARK_DRIVER_MEMORY=83971m in spark-env.sh or spark-default.conf
it works.
However, when I set the --driver-memory option with spark submit, the memory
is not allocated to the spark master. (the web ui shows the correct value of
spark.driver.memory
Hi again,
I narrowed down the issue a bit more -- it seems to have to do with the
Kryo serializer. When I use it, then this results in a Null Pointer:
rdd = sc.parallelize(range(10))
d = {}
from random import random
for i in range(10) :
d[i] = random()
rdd.map(lambda x: d[x]).collect()
Hi,
I believe that partitionBy will use the same (default) partitioner on
both RDDs.
On 2015-02-12 17:12, Sean Owen wrote:
Doesn't this require that both RDDs have the same partitioner?
On Thu, Feb 12, 2015 at 3:48 PM, Imran Rashid iras...@cloudera.com
wrote:
Hi Karlson,
I think your
Sent from my iPhone
-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org
Hi All,
Thanks in advance for your help. I have timestamp which I need to convert to
datetime using scala. A folder contains the three needed jar files:
joda-convert-1.5.jar joda-time-2.4.jar nscala-time_2.11-1.8.0.jar
Using scala REPL and adding the jars: scala -classpath *.jar
I can use
Hi. I am stuck with how to save file to hdfs from spark.
I have written MyOutputFormat extends FileOutputFormatString, MyObject,
then in spark calling this:
rddres.saveAsHadoopFile[MyOutputFormat](hdfs://localhost/output) or
rddres.saveAsHadoopFile(hdfs://localhost/output, classOf[String],
I have a question with regards to accessing SchemaRDD’s and Spark SQL temp
tables via the thrift server. It appears that a SchemaRDD when created is
only available in the local namespace / context and are unavailable to
external services accessing Spark through thrift server via ODBC; is this
Hi All,
using Pyspark, I create two RDDs (one with about 2M records (~200MB),
the other with about 8M records (~2GB)) of the format (key, value).
I've done a partitionBy(num_partitions) on both RDDs and verified that
both RDDs have the same number of partitions and that equal keys reside
on
Hi,
I have a model and I am trying to predict regPoints. Here is the code that
I have used.
A more detailed question is available at
http://stackoverflow.com/questions/28482476/spark-mllib-predict-error-with-map
scala model
res26: org.apache.spark.mllib.regression.LinearRegressionModel =
You could apply a password using a filter using a server. Though it dosnt
looks like the right grp for the question. It can be done for spark also
for Spark UI.
On Thu, Feb 12, 2015 at 10:19 PM, MASTER_ZION (Jairo Linux)
master.z...@gmail.com wrote:
Hi everyone,
Im creating a development
1 - 100 of 111 matches
Mail list logo