Re: AnalysisException - Infer schema for the Parquet path

2020-05-10 Thread Nilesh Kuchekar
Hi Chetan, You can have a static parquet file created, and when you create a data frame you can pass the location of both the files, with option mergeSchema true. This will always fetch you a dataframe even if the original file is not present. Kuchekar, Nilesh On Sat, May 9

Custom positioning/partitioning Dataframes

2016-06-03 Thread Nilesh Chakraborty
that tables that are most frequently joined together are located locally together. Any thoughts on how I can do this with Spark? Any internal hack ideas are welcome too. :) Cheers, Nilesh -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Custom-positioning

Predicting Class Probability with Gradient Boosting/Random Forest

2015-02-12 Thread nilesh
the probability. Can you provide any pointers to any documentation that I can reference for implementing this. Thanks! -Nilesh -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Predicting-Class-Probability-with-Gradient-Boosting-Random-Forest-tp21633.html Sent from

Re: New API for TFIDF generation in Spark 1.1.0

2014-10-09 Thread nilesh
to (org.apache.spark.mllib.linalg.Vector) val transformedValues = idfModel.transform(values) It seems to be getting confused with multiple (java and scala) transform methods. Any insights? Thanks, Nilesh -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/New-API-for-TFIDF

Re: New API for TFIDF generation in Spark 1.1.0

2014-10-09 Thread nilesh
Did some digging in the documentation. Looks like the IDFModel.transform only accepts RDD as an input, and not individual elements. Is this a bug? I am saying this because HashingTF.transform accepts both RDD as well as vector elements as its input. From your post replying to Jatin, looks like

Alternative to checkpointing and materialization for truncating lineage in high iteration jobs

2014-06-28 Thread Nilesh Chakraborty
). This will avoid writing to HDFS (replication in the Spark memory), but truncate the lineage (by creating new BlockRDDs), and avoid stackoverflow error. - Now I'm not sure how to do no. 3. Any ideas? I'm CC'ing Tathagata too. Cheers, Nilesh

Accumulable with huge accumulated value?

2014-06-14 Thread Nilesh Chakraborty
, and (b) I wouldn't even be able to retrieve the dense vector iteratively and my vector would become driver-node-memory bound. Any ideas how I can make this work for me? Cheers, Nilesh [1]: http://hadoop.apache.org/docs/r2.3.0/api/org/apache/hadoop/mapreduce/lib/output/MultipleOutputs.html [2

Performance of Akka or TCP Socket input sources vs HDFS: Data locality in Spark Streaming

2014-06-10 Thread Nilesh Chakraborty
Spark workers local to the worker actors should get the data fast, and some optimization like this is definitely done I assume? I suppose the only benefit with HDFS would be better fault tolerance, and the ability to checkpoint and recover even if master fails. Cheers, Nilesh -- View this message