Hi Chetan,
You can have a static parquet file created, and when you
create a data frame you can pass the location of both the files, with
option mergeSchema true. This will always fetch you a dataframe even if the
original file is not present.
Kuchekar, Nilesh
On Sat, May 9
that tables that
are most frequently joined together are located locally together.
Any thoughts on how I can do this with Spark? Any internal hack ideas are
welcome too. :)
Cheers,
Nilesh
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Custom-positioning
the
probability. Can you provide any pointers to any documentation that I can
reference for implementing this. Thanks!
-Nilesh
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Predicting-Class-Probability-with-Gradient-Boosting-Random-Forest-tp21633.html
Sent from
to (org.apache.spark.mllib.linalg.Vector)
val transformedValues = idfModel.transform(values)
It seems to be getting confused with multiple (java and scala) transform
methods.
Any insights?
Thanks,
Nilesh
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/New-API-for-TFIDF
Did some digging in the documentation. Looks like the IDFModel.transform only
accepts RDD as an input,
and not individual elements. Is this a bug? I am saying this because
HashingTF.transform accepts both RDD as well as vector elements as its
input.
From your post replying to Jatin, looks like
).
This will avoid writing to HDFS (replication in the Spark memory), but
truncate the lineage (by creating new BlockRDDs), and avoid stackoverflow
error.
-
Now I'm not sure how to do no. 3. Any ideas? I'm CC'ing Tathagata too.
Cheers,
Nilesh
, and (b) I
wouldn't even be able to retrieve the dense vector iteratively and my vector
would become driver-node-memory bound.
Any ideas how I can make this work for me?
Cheers,
Nilesh
[1]:
http://hadoop.apache.org/docs/r2.3.0/api/org/apache/hadoop/mapreduce/lib/output/MultipleOutputs.html
[2
Spark workers local to the worker actors should
get the data fast, and some optimization like this is definitely done I
assume?
I suppose the only benefit with HDFS would be better fault tolerance, and
the ability to checkpoint and recover even if master fails.
Cheers,
Nilesh
--
View this message