Re: How VectorIndexer works in Spark ML pipelines

2015-10-18 Thread Jorge Sánchez
Vishnu, VectorIndexer will add metadata regarding which features are categorical and what are continuous depending on the threshold, if there are more different unique values than the *MaxCategories *parameter, they will be

Re: dataframes and numPartitions

2015-10-18 Thread Jorge Sánchez
Alex, If not, you can try using the functions coalesce(n) or repartition(n). As per the API, coalesce will not make a shuffle but repartition will. Regards. 2015-10-16 0:52 GMT+01:00 Mohammed Guller : > You may find the spark.sql.shuffle.partitions property useful. The

Re: Implement "LIKE" in SparkSQL

2015-09-14 Thread Jorge Sánchez
I think after you get your table as a DataFrame, you can do a filter over it, something like: val t = sqlContext.sql("select * from table t") val df = t.filter(t("a").contains(t("b"))) Let us know the results. 2015-09-12 10:45 GMT+01:00 liam : > > OK, I got another way, it

Re: an error when I read data from parquet

2016-02-22 Thread Jorge Sánchez
Hi Alex, it seems there is a problem with Spark Notebook, I suggest you follow the issue there (Or you could try Apache Zeppelin or Spark-Shell directly if notebooks are not a requirement): https://github.com/andypetrella/spark-notebook/issues/380 Regards. 2016-02-19 12:59 GMT+00:00

Re: Sqoop on Spark

2016-04-06 Thread Jorge Sánchez
Ayan, there was a talk in spark summit https://spark-summit.org/2015/events/Sqoop-on-Spark-for-Data-Ingestion/ Apparently they had a lot of problems and the project seems abandoned. If you just have to do simple ingestion of a full table or a simple query, just use Sqoop as suggested by Mich,

Re: how to merge dataframe write output files

2016-11-10 Thread Jorge Sánchez
Do you have the logs of the containers? This seems like a Memory issue. 2016-11-10 7:28 GMT+00:00 lk_spark : > hi,all: > when I call api df.write.parquet ,there is alot of small files : how > can I merge then into on file ? I tried df.coalesce(1).write.parquet ,but > it

Re: Kerberos and YARN - functions in spark-shell and spark submit local but not cluster mode

2016-12-05 Thread Jorge Sánchez
Hi Gerard, have you tried running in yarn-client mode? If so, do you still get that same error? Regards. 2016-12-05 12:49 GMT+00:00 Gerard Casey : > Edit. From here > > I > read that