Re: Please Help with DecisionTree/FeatureIndexer

2017-12-16 Thread Weichen Xu
Hi Marco, Yes you can apply `VectorAssembler` first in the pipeline to assemble multiple features column. Thanks. On Sun, Dec 17, 2017 at 6:33 AM, Marco Mistroni wrote: > Hello Wei > Thanks, i should have c hecked the data > My data has this format >

Re: Please Help with DecisionTree/FeatureIndexer

2017-12-16 Thread Marco Mistroni
Hello Wei Thanks, i should have c hecked the data My data has this format |col1|col2|col3|label| so it looks like i cannot use VectorIndexer directly (it accepts a Vector column). I am guessing what i should do is something like this (given i have few categorical features) val assembler = new

Re: How to...UNION ALL of two SELECTs over different data sources in parallel?

2017-12-16 Thread Silvio Fiorito
Hi Jacek, Just replied to the SO thread as well, but… Yes, your first statement is correct. The DFs in the union are read in the same stage, so in your example where each DF has 8 partitions then you have a stage with 16 tasks to read the 2 DFs. There's no need to define the DF in a separate

Re: Please Help with DecisionTree/FeatureIndexer

2017-12-16 Thread Weichen Xu
Hi, Marco, val data = spark.read.format("libsvm").load("data/mllib/sample_libsvm_data.txt") The data now include a feature column with name "features", val featureIndexer = new VectorIndexer() .setInputCol("features") <-- Here specify the "features" column to index.

How to...UNION ALL of two SELECTs over different data sources in parallel?

2017-12-16 Thread Jacek Laskowski
Hi, I've been trying to find out the answer to the question about UNION ALL and SELECTs @ https://stackoverflow.com/q/47837955/1305344 > If I have Spark SQL statement of the form SELECT [...] UNION ALL SELECT [...], will the two SELECT statements be executed in parallel? In my specific use case

Re: is Union or Join Supported for Spark Structured Streaming Queries in 2.2.0?

2017-12-16 Thread Jacek Laskowski
Hi, join between streaming and batch/static Datasets is supported for sure --> http://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#join-operations I'm not sure about union, but that's just easy to check (and am leaving it as your home exercise). You cannot have

Stateful Aggregation Using flatMapGroupsWithState

2017-12-16 Thread Sandip Mehta
Hi All, I am getting following error message while applying *flatMapGroupsWithState.* *Exception in thread "main" org.apache.spark.sql.AnalysisException: flatMapGroupsWithState in update mode is not supported with aggregation on a streaming DataFrame/Dataset;;* Following is what I am trying to

Windows10 + pyspark + ipython + csv file loading with timestamps

2017-12-16 Thread Esa Heikkinen
Hi Does anyone have any hints or example (code) how to get combination: Windows10 + pyspark + ipython notebook + csv file loading with timestamps (timeseries data) to dataframe or RDD to work ? I have already installed windows10 + pyspark + ipython notebook and they seem to work, but my

Re: NASA CDF files in Spark

2017-12-16 Thread Jörn Franke
Develop your own HadoopFileFormat and use https://spark.apache.org/docs/2.0.2/api/java/org/apache/spark/SparkContext.html#newAPIHadoopRDD(org.apache.hadoop.conf.Configuration,%20java.lang.Class,%20java.lang.Class,%20java.lang.Class) to load. The Spark datasource API will be relevant for you in