Well said Will. I would add that you might want to investigate GraphChi which
claims to be able to run a number of large-scale graph processing tasks on a
workstation much quicker than a very large Hadoop cluster. It would be
interesting to know how widely applicable the approach GraphChi takes
Hi,
I am trying one transformation by calling scala method
this scala method returns MutableList[AvroObject]
def processRecords(id: String, list1: Iterable[(String, GenericRecord)]):
scala.collection.mutable.MutableList[AvroObject]
Hence, the output of transaformation is
I personally would try to avoid updateStateByKey for sessionization when
you have long sessions / a lot of keys, because it's linear on the number
of keys.
On Tue, Jul 14, 2015 at 6:25 PM, Tathagata Das t...@databricks.com wrote:
[Apologies for repost, for those who have seen this response
Assuming you run spark locally (ie either local mode or standalone cluster
on your localm/c)
1. You need to have hadoop binaries locally
2. You need to have hdfs-site on Spark Classpath of your local m/c
I would suggest you to start off with local files to play around.
If you need to run spark
Hi,
I have been playing with Spark R API that is introduced in Spark 1.4
version. Can we use any mllib functionality from the R as of now?. From the
documentation it looks like we can only use SQL/Dataframe functionality as
of now. I know there is separate project SparkR project but it doesnot
Thanks.
t - getField(df$hashtags, text) does return a Column. But when I tried
to call t - getField(df$hashtags, text), it would give an error:
Error: All select() inputs must resolve to integer column positions.
The following do not:
* getField(df$hashtags, text)
In fact, the text field in df
There seems to be a bit of confusion here - the OP (doing the PhD) had the
thread hijacked by someone with a similar name asking a mundane question.
It would be a shame to send someone away so rudely, who may do valuable
work on Spark.
Sashidar (not Sashid!) I'm personally interested in running
Hi,
There is no MLlib support in SparkR in 1.4. There will be some support in
1.5. You can check these JIRAs for progress:
https://issues.apache.org/jira/browse/SPARK-6805
https://issues.apache.org/jira/browse/SPARK-6823
Best,
Burak
On Wed, Jul 15, 2015 at 6:00 AM, madhu phatak
An in-memory hash key data structure of some kind so that you're close to
linear on the number of items in a batch, not the number of outstanding
keys. That's more complex, because you have to deal with expiration for
keys that never get hit, and for unusually long sessions you have to either
Hello all,
I upgraded from spark 1.3.1 to 1.4.0, but I'm experiencing a massive drop in
performance for the application I'm running. I've (somewhat) reproduced this
behaviour in the attached file.
My current spark setup may not be optimal exactly for this reproduction, but
I see that Spark
I use a simple map/reduce step in a Java/Spark program to remove duplicated
documents from a large (10 TB compressed) sequence file containing some
html pages. Here is the partial code:
JavaPairRDDBytesWritable, NullWritable inputRecords =
sc.sequenceFile(args[0], BytesWritable.class,
I just installed spark 1.3.1 on windows 2008 server. When I start spark-shell,
I got the following error
Failed to created SparkJLineReader: java.lang.NoClassDefFoundError: Could not
initialize class org.fusesource.jansi.internal.Kernel32
Please advise. Thanks.
Ningjun
Hi,
I have been working a bit with RDD, and am now taking a look at DataFrames.
The schema definition using case classes looks very attractive;
https://spark.apache.org/docs/1.4.0/sql-programming-guide.html#inferring-the-schema-using-reflection
Yes. I did this recently. You need to copy the cloudera cluster related
conf files into the local machine
and set HADOOP_CONF_DIR or YARN_CONF_DIR.
And also local machine should be able to ssh to the cloudera cluster.
On Wed, Jul 15, 2015 at 8:51 AM, ayan guha guha.a...@gmail.com wrote:
Hi forum
I have downloaded the latest spark version 1.4.0 and started using it.
But I couldn't find the compute-classpath.sh file in bin/ which I am using
in previous versions to provide third party libraries to my application.
Can anyone please let me know where I can provide CLASSPATH with my
101 - 115 of 115 matches
Mail list logo