unsubscribe

2023-05-09 Thread Balakumar iyer S



Re: Spark 2.3 Dataframe Grouby operation throws IllegalArgumentException on Large dataset

2019-07-23 Thread Balakumar iyer S
Hi Bobby Evans,

I apologise for the delayed response , yes you are right I missed out to
paste the complete stack trace exception. Here with I have attached the
complete yarn log for the same.

Thank you , It would be helpful if you guys could assist me on this error.

-
Regards
Balakumar Seetharaman


On Mon, Jul 22, 2019 at 7:05 PM Bobby Evans  wrote:

> You are missing a lot of the stack trace that could explain the
> exception.  All it shows is that an exception happened while writing out
> the orc file, not what that underlying exception is, there should be at
> least one more caused by under the one you included.
>
> Thanks,
>
> Bobby
>
> On Mon, Jul 22, 2019 at 5:58 AM Balakumar iyer S 
> wrote:
>
>> Hi ,
>>
>> I am trying to perform a group by  followed by aggregate collect set
>> operation on a two column data-setschema (LeftData int , RightData
>> int).
>>
>> code snippet
>>
>>   val wind_2  =
>> dframe.groupBy("LeftData").agg(collect_set(array("RightData")))
>>
>>  wind_2.write.mode(SaveMode.Append).format("orc").save(args(1))
>>
>> the above code works fine on a smaller dataset but throws the following
>> error on large dataset (where each keys in LeftData column  needs to be
>> grouped with 64k values approximately ).
>>
>> Could some one assist me on this , should i  set any configuration to
>> accommodate such a large  values?
>>
>> ERROR
>> -
>> Driver stacktrace:
>> at org.apache.spark.scheduler.DAGScheduler.org
>> $apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1599)
>> at
>> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1587)
>> at
>> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1586)
>> at
>> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
>> at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
>> at
>> org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1586)
>> at
>> org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:831)
>> at
>> org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:831)
>> at scala.Option.foreach(Option.scala:257)
>> at
>> org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:831)
>> at
>> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1820)
>> at
>> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1769)
>> at
>> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1758)
>> at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
>> at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:642)
>> at org.apache.spark.SparkContext.runJob(SparkContext.scala:2034)
>> at
>> org.apache.spark.sql.execution.datasources.FileFormatWriter$.write(FileFormatWriter.scala:194)
>>
>>
>> Caused by: org.apache.spark.SparkException: Task failed while writing
>> rows.
>> at
>> org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:285)
>>
>> --
>> REGARDS
>> BALAKUMAR SEETHARAMAN
>>
>>

-- 
REGARDS
BALAKUMAR SEETHARAMAN
SPARK_MAJOR_VERSION is set to 2, using Spark2
19/07/20 16:23:15 ERROR TaskSetManager: Task 0 in stage 3.0 failed 4 times; 
aborting job
19/07/20 16:23:15 ERROR FileFormatWriter: Aborting job null.
org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in 
stage 3.0 failed 4 times, most recent failure: Lost task 0.3 in stage 3.0 (TID 
316, DOSSPOCVM2, executor 13): org.apache.spark.SparkException: Task failed 
while writing rows.
at 
org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:285)
at 
org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:197)
at 
org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:196)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
at org.apache.spark.scheduler.Task.run(Task.scala:109)
at org.apache.spark.e

Spark 2.3 Dataframe Grouby operation throws IllegalArgumentException on Large dataset

2019-07-22 Thread Balakumar iyer S
Hi ,

I am trying to perform a group by  followed by aggregate collect set
operation on a two column data-setschema (LeftData int , RightData
int).

code snippet

  val wind_2  =
dframe.groupBy("LeftData").agg(collect_set(array("RightData")))

 wind_2.write.mode(SaveMode.Append).format("orc").save(args(1))

the above code works fine on a smaller dataset but throws the following
error on large dataset (where each keys in LeftData column  needs to be
grouped with 64k values approximately ).

Could some one assist me on this , should i  set any configuration to
accommodate such a large  values?

ERROR
-
Driver stacktrace:
at org.apache.spark.scheduler.DAGScheduler.org
$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1599)
at
org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1587)
at
org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1586)
at
scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
at
org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1586)
at
org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:831)
at
org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:831)
at scala.Option.foreach(Option.scala:257)
at
org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:831)
at
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1820)
at
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1769)
at
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1758)
at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:642)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2034)
at
org.apache.spark.sql.execution.datasources.FileFormatWriter$.write(FileFormatWriter.scala:194)


Caused by: org.apache.spark.SparkException: Task failed while writing rows.
at
org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:285)

-- 
REGARDS
BALAKUMAR SEETHARAMAN


The following Java MR code works for small dataset but throws(arrayindexoutofBound) error for large dataset

2019-05-09 Thread Balakumar iyer S
Hi All,

I am trying to read a orc file and  perform groupBy operation on it , but
When i run it on a large data set we are facing the following error
message.

Input format of INPUT DATA

|178111256|  107125374|
|178111256|  107148618|
|178111256|  107175361|
|178111256|  107189910|

and we are try to group by the first column.

But as per the logic and syntax the code is appropriate but it is  working
well on small data set. I have attached the code in the text file.

Thank you for your time.

ERROR MESSAGE:
Error: java.lang.ArrayIndexOutOfBoundsException at
org.apache.hadoop.mapred.MapTask$MapOutputBuffer$Buffer.write(MapTask.java:1453)
at
org.apache.hadoop.mapred.MapTask$MapOutputBuffer$Buffer.write(MapTask.java:1349)
at java.io.DataOutputStream.writeByte(DataOutputStream.java:153) at
org.apache.hadoop.io.WritableUtils.writeVLong(WritableUtils.java:273) at
org.apache.hadoop.io.WritableUtils.writeVInt(WritableUtils.java:253) at
org.apache.hadoop.io.Text.write(Text.java:330) at
org.apache.hadoop.io.serializer.WritableSerialization$WritableSerializer.serialize(WritableSerialization.java:98)
at
org.apache.hadoop.io.serializer.WritableSerialization$WritableSerializer.serialize(WritableSerialization.java:82)
at
org.apache.hadoop.mapred.MapTask$MapOutputBuffer.collect(MapTask.java:1149)
at
org.apache.hadoop.mapred.MapTask$OldOutputCollector.collect(MapTask.java:610)
at orc_groupby.orc_groupby.Orc_groupBy$MyMapper.map(Orc_groupBy.java:73) at
orc_groupby.orc_groupby.Orc_groupBy$MyMapper.map(Orc_groupBy.java:39) at
org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:54) at
org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:453) at
org.apache.hadoop.mapred.MapTask.run(MapTask.java:343) at
org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:170) at
java.security.AccessController.doPrivileged(Native Method) at
javax.security.auth.Subject.doAs(Subject.java:422) at
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1869)
at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:164)



-- 
REGARDS
BALAKUMAR SEETHARAMAN
package orc_groupby.orc_groupby;


import java.io.IOException;
import java.util.Iterator;
import java.util.List;
import java.util.Properties;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.conf.Configured;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.hive.ql.io.orc.OrcInputFormat;
import org.apache.hadoop.hive.ql.io.orc.OrcSerde;
import org.apache.hadoop.hive.ql.io.orc.OrcStruct;
import org.apache.hadoop.hive.serde2.SerDeException;
import org.apache.hadoop.hive.serde2.objectinspector.StructField;
import org.apache.hadoop.hive.serde2.objectinspector.StructObjectInspector;
import 
org.apache.hadoop.hive.serde2.objectinspector.primitive.WritableIntObjectInspector;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.io.Writable;
import org.apache.hadoop.mapred.FileInputFormat;
import org.apache.hadoop.mapred.FileOutputFormat;
import org.apache.hadoop.mapred.JobClient;
import org.apache.hadoop.mapred.JobConf;
import org.apache.hadoop.mapred.MapReduceBase;
import org.apache.hadoop.mapred.Mapper;
import org.apache.hadoop.mapred.OutputCollector;
import org.apache.hadoop.mapred.Reducer;
import org.apache.hadoop.mapred.Reporter;
import org.apache.hadoop.mapred.TextOutputFormat;
import org.apache.hadoop.util.Tool;
import org.apache.hadoop.util.ToolRunner;

public class Orc_groupBy extends Configured  implements Tool {

public static int A_ID=0;
public static int B_ID=1;

public static class MyMapper
extends MapReduceBase implements Mapper {

private StructObjectInspector oip;
private final OrcSerde serde = new OrcSerde();

public void configure(JobConf job) {
Properties table = new Properties();
table.setProperty("columns", "viewedid,viewerid");
table.setProperty("columns.types", "int,int");

serde.initialize(job, table);

try {
oip = (StructObjectInspector) serde.getObjectInspector();
} catch (SerDeException e) {
e.printStackTrace();
}
}
public void map(K key, OrcStruct val,
OutputCollector output, Reporter reporter)
throws IOException {
List fields =oip.getAllStructFieldRefs();
WritableIntObjectInspector bInspector =  
(WritableIntObjectInspector) fields.get(B_ID).getFieldObjectInspector();
String a = "";
String b = "";
try {
a =  bInspector.getPrimitiveJavaObject( 
oip.getStructFieldData(serde.deserialize(val), fields.get(A_ID))).toString();
b =  
bInspector.getPrimitiveJavaObject(oip.getStructFieldData(serde.deserialize(val),
 fields.get(B_ID))).toString();
//System.out.print("A="+a+" B="+b);
//System.exit(0);
} catch (SerDeException e1) {

An alternative logic to collaborative filtering works fine but we are facing run time issues in executing the job

2019-04-16 Thread Balakumar iyer S
Hi ,


While running the following spark code in the cluster with following
configuration it is spread into  3 job Id's

CLUSTER CONFIGURATION

3 NODE CLUSTER

NODE 1 - 64GB 16CORES

NODE 2 - 64GB 16CORES

NODE 3 - 64GB 16CORES


At Job Id 2 job is stuck at the stage 51 of 254 and then it starts
utilising the disk space I am not sure why is this happening and my work is
completely ruined . could someone help me on this

I have attached screen shot of spark stages which are stuck for reference

Please let me know for more questions with the setup and code
Thanks



code:

   def main(args: Array[String]) {

Logger.getLogger("org").setLevel(Level.ERROR)

val ss = SparkSession

  .builder

  .appName("join_association").master("local[*]")

  .getOrCreate()

  import ss.implicits._

 val dframe = ss.read.option("inferSchema",
value=true).option("delimiter", ",").csv("in/matrimony.txt")

 dframe.show()

 dframe.printSchema()

 //left_frame



 val dfLeft = dframe.withColumnRenamed("_c1", "left_data")



 val dfRight = dframe.withColumnRenamed("_c1", "right_data")



 //Join



 val joined = dfLeft.join(dfRight , dfLeft.col("_c0") ===
dfRight.col("_c0") ).filter(col("left_data") !== col("right_data") )



  joined.show()



val result = joined.select(col("left_data"), col("right_data") as
"similar_ids" )



result.write.csv("/output")

ss.stop()



  }



-- 
REGARDS
BALAKUMAR SEETHARAMAN

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org