Kryo not registered class
Hello, I'm with spark 2.1.0 with scala and I'm registering all classes with kryo, and I have a problem registering this class, org.apache.spark.sql.execution.datasources.PartitioningAwareFileIndex$SerializableFileStatus$SerializableBlockLocation[] I can't register with classOf[Array[Class.forName("org.apache.spark.sql.execution.datasources.PartitioningAwareFileIndex$SerializableFileStatus$SerializableBlockLocation").type]] I have tried as well creating a java class like register and registering the class as org.apache.spark.sql.execution.datasources.PartitioningAwareFileIndex$SerializableFileStatus$SerializableBlockLocation[].class; Any clue is appreciatted, Thanks.
Re: Parquet file generated by Spark, but not compatible read by Hive
Hello, Do you use df.write or you make with hivecontext.sql(" insert into ...")? Angel. El 12 jun. 2017 11:07 p. m., "Yong Zhang"escribió: > We are using Spark *1.6.2* as ETL to generate parquet file for one > dataset, and partitioned by "brand" (which is a string to represent brand > in this dataset). > > > After the partition files generated in HDFS like "brand=a" folder, we add > the partitions in the Hive. > > > The hive version is *1.2.1 *(In fact, we are using HDP 2.5.0). > > > Now the problem is that for 2 brand partitions, we cannot query the data > generated in Spark, but it works fine for the rest of partitions. > > > Below is the error in the Hive CLI and hive.log I got if I query the bad > partitions like "select * from tablename where brand='*BrandA*' limit 3;" > > > Failed with exception > java.io.IOException:org.apache.hadoop.hive.ql.metadata.HiveException: > java.lang.UnsupportedOperationException: Cannot inspect > org.apache.hadoop.io.LongWritable > > > Caused by: java.lang.UnsupportedOperationException: Cannot inspect > org.apache.hadoop.io.LongWritable > at org.apache.hadoop.hive.ql.io.parquet.serde.primitive. > ParquetStringInspector.getPrimitiveWritableObject( > ParquetStringInspector.java:52) > at org.apache.hadoop.hive.serde2.lazy.LazyUtils. > writePrimitiveUTF8(LazyUtils.java:222) > at org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe. > serialize(LazySimpleSerDe.java:307) > at org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe.serializeField( > LazySimpleSerDe.java:262) > at org.apache.hadoop.hive.serde2.DelimitedJSONSerDe.serializeField( > DelimitedJSONSerDe.java:72) > at org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe. > doSerialize(LazySimpleSerDe.java:246) > at org.apache.hadoop.hive.serde2.AbstractEncodingAwareSerDe.serialize( > AbstractEncodingAwareSerDe.java:50) > at org.apache.hadoop.hive.ql.exec.DefaultFetchFormatter. > convert(DefaultFetchFormatter.java:71) > at org.apache.hadoop.hive.ql.exec.DefaultFetchFormatter. > convert(DefaultFetchFormatter.java:40) > at org.apache.hadoop.hive.ql.exec.ListSinkOperator.process( > ListSinkOperator.java:90) > ... 22 more > > There are not too much I can find by googling this error message, but it > points to that the schema in Hive is different as in parquet file. > But this is a very strange case, as the same schema works fine for other > brands, which defined as a partition column, and share the whole Hive > schema as the above. > > If I query like: "select * from tablename where brand='*BrandB*' limit > 3:", everything works fine. > > So is this really caused by the Hive schema mismatch with parquet file > generated by Spark, or by the data within different partitioned keys, or > really a compatible issue between Spark/Hive? > > Thanks > > Yong > > >
Re: Joins in Spark
Sorry, I had a typo I mean repartitionby("fieldofjoin) El 2 may. 2017 9:44 p. m., "KhajaAsmath Mohammed" <mdkhajaasm...@gmail.com> escribió: Hi Angel, I am trying using the below code but i dont see partition on the dataframe. val iftaGPSLocation_df = sqlContext.sql(iftaGPSLocQry) import sqlContext._ import sqlContext.implicits._ datapoint_prq_df.join(geoCacheLoc_df) Val tableA = DfA.partitionby("joinField").filter("firstSegment") Columns I have are Lat3,Lon3, VIN, Time . Lat3 and Lon3 are my join columns on both dataframes and rest are select columns Thanks, Asmath On Tue, May 2, 2017 at 1:38 PM, Angel Francisco Orta < angel.francisco.o...@gmail.com> wrote: > Have you tried to make partition by join's field and run it by segments, > filtering both tables at the same segments of data? > > Example: > > Val tableA = DfA.partitionby("joinField").filter("firstSegment") > Val tableB= DfB.partitionby("joinField").filter("firstSegment") > > TableA.join(TableB) > > El 2 may. 2017 8:30 p. m., "KhajaAsmath Mohammed" <mdkhajaasm...@gmail.com> > escribió: > >> Table 1 (192 GB) is partitioned by year and month ... 192 GB of data is >> for one month i.e. for April >> >> Table 2: 92 GB not partitioned . >> >> I have to perform join on these tables now. >> >> >> >> On Tue, May 2, 2017 at 1:27 PM, Angel Francisco Orta < >> angel.francisco.o...@gmail.com> wrote: >> >>> Hello, >>> >>> Is the tables partitioned? >>> If yes, what is the partition field? >>> >>> Thanks >>> >>> >>> El 2 may. 2017 8:22 p. m., "KhajaAsmath Mohammed" < >>> mdkhajaasm...@gmail.com> escribió: >>> >>> Hi, >>> >>> I am trying to join two big tables in spark and the job is running for >>> quite a long time without any results. >>> >>> Table 1: 192GB >>> Table 2: 92 GB >>> >>> Does anyone have better solution to get the results fast? >>> >>> Thanks, >>> Asmath >>> >>> >>> >>
Re: Joins in Spark
Have you tried to make partition by join's field and run it by segments, filtering both tables at the same segments of data? Example: Val tableA = DfA.partitionby("joinField").filter("firstSegment") Val tableB= DfB.partitionby("joinField").filter("firstSegment") TableA.join(TableB) El 2 may. 2017 8:30 p. m., "KhajaAsmath Mohammed" <mdkhajaasm...@gmail.com> escribió: > Table 1 (192 GB) is partitioned by year and month ... 192 GB of data is > for one month i.e. for April > > Table 2: 92 GB not partitioned . > > I have to perform join on these tables now. > > > > On Tue, May 2, 2017 at 1:27 PM, Angel Francisco Orta < > angel.francisco.o...@gmail.com> wrote: > >> Hello, >> >> Is the tables partitioned? >> If yes, what is the partition field? >> >> Thanks >> >> >> El 2 may. 2017 8:22 p. m., "KhajaAsmath Mohammed" < >> mdkhajaasm...@gmail.com> escribió: >> >> Hi, >> >> I am trying to join two big tables in spark and the job is running for >> quite a long time without any results. >> >> Table 1: 192GB >> Table 2: 92 GB >> >> Does anyone have better solution to get the results fast? >> >> Thanks, >> Asmath >> >> >> >
Re: Joins in Spark
Hello, Is the tables partitioned? If yes, what is the partition field? Thanks El 2 may. 2017 8:22 p. m., "KhajaAsmath Mohammed"escribió: Hi, I am trying to join two big tables in spark and the job is running for quite a long time without any results. Table 1: 192GB Table 2: 92 GB Does anyone have better solution to get the results fast? Thanks, Asmath