Re: Dataset and JavaRDD: how to eliminate the header.

2016-08-03 Thread Carlo . Allocca
On 3 Aug 2016, at 22:01, Mich Talebzadeh mailto:mich.talebza...@gmail.com>> wrote: ok in other words the result set of joining two dataset ends up with inconsistent result as a header from one DS is joined with another row from another DS? I am not 100% sure I got this point. Let me check if I

Re: Dataset and JavaRDD: how to eliminate the header.

2016-08-03 Thread Mich Talebzadeh
ok in other words the result set of joining two dataset ends up with inconsistent result as a header from one DS is joined with another row from another DS? You really need to get rid of headers one way or other before joining. or try to register them as temp table before join to see where the fa

Re: Dataset and JavaRDD: how to eliminate the header.

2016-08-03 Thread Carlo . Allocca
Hi Mich, Thanks again. My issue is not when I read the csv from a file. It is when you have a Dataset that is output of some join operations. Any help on that? Many Thanks, Best, Carlo On 3 Aug 2016, at 21:43, Mich Talebzadeh mailto:mich.talebza...@gmail.com>> wrote: hm odd. Otherwise you ca

Re: Dataset and JavaRDD: how to eliminate the header.

2016-08-03 Thread Mich Talebzadeh
hm odd. Otherwise you can try using databricks to read the CSV file. This is scala example //val df = sqlContext.read.format("com.databricks.spark.csv").option("inferSchema", "true").option("header", "true").load("hdfs://rhes564:9000/data/stg/accounts/ll/18740868") val df = sqlContext.read.format

Re: Dataset and JavaRDD: how to eliminate the header.

2016-08-03 Thread Carlo . Allocca
One more: it seems that the steps == Step 1: transform the Dataset into JavaRDD JavaRDD dataPointsWithHeader =dataset1_Join_dataset2.toJavaRDD(); and List someRows = dataPointsWithHeader.collect(); someRows.forEach(System.out::println); do not print the header. So, Could I assume

Re: Dataset and JavaRDD: how to eliminate the header.

2016-08-03 Thread Carlo . Allocca
Thanks Mich. Yes, I know both headers (categoryRankSchema, categorySchema ) as expressed below: this.dataset1 = d1_DFR.schema(categoryRankSchema).csv(categoryrankFilePath); this.dataset2 = d2_DFR.schema(categorySchema).csv(categoryFilePath); Can you use filter to get rid of the

Re: Dataset and JavaRDD: how to eliminate the header.

2016-08-03 Thread Mich Talebzadeh
Do you know the headers? Can you use filter to get rid of the header from both CSV files before joining them? Dr Mich Talebzadeh LinkedIn * https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw

Re: Dataset and JavaRDD: how to eliminate the header.

2016-08-03 Thread Carlo . Allocca
Hi Aseem, Thank you very much for your help. Please, allow me to be more specific for my case (to some extent I already do what you suggested): Let us imagine that I two csv datasets d1 and d2. I generate the Dataset as in the following: == Reading d1: sparkSession=spark; options =

Re: Dataset and JavaRDD: how to eliminate the header.

2016-08-03 Thread Aseem Bansal
Hi Depending on how how you reading the data in the first place, can you simply use the header as header instead of a row? http://spark.apache.org/docs/latest/api/java/org/apache/spark/sql/DataFrameReader.html#csv(scala.collection.Seq) See the header option On Wed, Aug 3, 2016 at 10:14 PM, Car

Dataset and JavaRDD: how to eliminate the header.

2016-08-03 Thread Carlo . Allocca
Hi All, I would like to apply a regression to my data. One of the workflow is the prepare my data as a JavaRDD starting from a Dataset with its header. So, what I did was the following: == Step 1: transform the Dataset into JavaRDD JavaRDD dataPointsWithHeader =modelDS.toJavaRDD();