[ https://issues.apache.org/jira/browse/HAWQ-1408?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Ming LI resolved HAWQ-1408. --------------------------- Resolution: Fixed > PANICs during COPY ... FROM STDIN > --------------------------------- > > Key: HAWQ-1408 > URL: https://issues.apache.org/jira/browse/HAWQ-1408 > Project: Apache HAWQ > Issue Type: Bug > Components: Core > Affects Versions: backlog > Reporter: Ming LI > Assignee: Ming LI > Fix For: 2.1.0.0-incubating > > > We found PANIC (and respective core dumps). From the initial analysis from > the logs and core dump, the query causing this PANIC is a "COPY ... FROM > STDIN". This query does not always panic. > This kind of queries are executed from Java/Scala code (by one of IG Spark > Jobs). Connection to the DB is managed by connection pool (commons-dbcp2) and > validated on borrow by “select 1” validation query. IG is using > postgresql-9.4-1206-jdbc41 as a java driver to create those connections. I > believe they should be using the driver from DataDirect, available in PivNet; > however, I haven't found hard evidence pointing the driver as a root cause. > My initial analysis on the packcore for the master PANIC. Not sure if this > helps or makes sense. > This is the backtrace of the packcore for process 466858: > {code} > (gdb) bt > #0 0x00007fd875f906ab in raise () from > /data/logs/52280/packcore-core.postgres.466858/lib64/libpthread.so.0 > #1 0x00000000008c0b19 in SafeHandlerForSegvBusIll (postgres_signal_arg=11, > processName=<optimized out>) at elog.c:4519 > #2 <signal handler called> > #3 0x000000000053b9c3 in SetSegnoForWrite (existing_segnos=0x4c46ff0, > existing_segnos@entry=0x0, relid=relid@entry=1195061, > segment_num=segment_num@entry=6, forNewRel=forNewRel@entry=0 '\000', > keepHash=keepHash@entry=1 '\001') at appendonlywriter.c:1166 > #4 0x000000000053c08f in assignPerRelSegno > (all_relids=all_relids@entry=0x2b96d68, segment_num=6) at > appendonlywriter.c:1212 > #5 0x00000000005f79e8 in DoCopy (stmt=stmt@entry=0x2b2a3d8, > queryString=<optimized out>) at copy.c:1591 > #6 0x00000000007ef737 in ProcessUtility > (parsetree=parsetree@entry=0x2b2a3d8, queryString=0x2c2f550 "COPY > mis_data_ig_client_derived_attributes.client_derived_attributes_src (id, > tracking_id, name, value_string, value_timestamp, value_number, > value_boolean, environment, account, channel, device, feat"..., > params=0x0, isTopLevel=isTopLevel@entry=1 '\001', > dest=dest@entry=0x2b2a7c8, completionTag=completionTag@entry=0x7ffcb5e318e0 > "") at utility.c:1076 > #7 0x00000000007ea95e in PortalRunUtility (portal=portal@entry=0x2b8eab0, > utilityStmt=utilityStmt@entry=0x2b2a3d8, isTopLevel=isTopLevel@entry=1 > '\001', dest=dest@entry=0x2b2a7c8, > completionTag=completionTag@entry=0x7ffcb5e318e0 "") at pquery.c:1969 > #8 0x00000000007ec13e in PortalRunMulti (portal=portal@entry=0x2b8eab0, > isTopLevel=isTopLevel@entry=1 '\001', dest=dest@entry=0x2b2a7c8, > altdest=altdest@entry=0x2b2a7c8, > completionTag=completionTag@entry=0x7ffcb5e318e0 "") at pquery.c:2079 > #9 0x00000000007ede95 in PortalRun (portal=portal@entry=0x2b8eab0, > count=count@entry=9223372036854775807, isTopLevel=isTopLevel@entry=1 '\001', > dest=dest@entry=0x2b2a7c8, altdest=altdest@entry=0x2b2a7c8, > completionTag=completionTag@entry=0x7ffcb5e318e0 "") at pquery.c:1596 > #10 0x00000000007e5ad9 in exec_simple_query > (query_string=query_string@entry=0x2b29100 "COPY > mis_data_ig_client_derived_attributes.client_derived_attributes_src (id, > tracking_id, name, value_string, value_timestamp, value_number, > value_boolean, environment, account, channel, device, feat"..., > seqServerHost=seqServerHost@entry=0x0, > seqServerPort=seqServerPort@entry=-1) at postgres.c:1816 > #11 0x00000000007e6cb2 in PostgresMain (argc=<optimized out>, argv=<optimized > out>, argv@entry=0x29d7820, username=0x29d75d0 "mis_ig") at postgres.c:4840 > #12 0x0000000000799540 in BackendRun (port=0x29afc50) at postmaster.c:5915 > #13 BackendStartup (port=0x29afc50) at postmaster.c:5484 > #14 ServerLoop () at postmaster.c:2163 > #15 0x000000000079c309 in PostmasterMain (argc=<optimized out>, > argv=<optimized out>) at postmaster.c:1454 > #16 0x00000000004a4209 in main (argc=9, argv=0x29af010) at main.c:226 > {code} > Jumping into the frame 3 and running info locals, we found something odd for > "status" variable: > {code} > (gdb) f 3 > #3 0x000000000053b9c3 in SetSegnoForWrite (existing_segnos=0x4c46ff0, > existing_segnos@entry=0x0, relid=relid@entry=1195061, > segment_num=segment_num@entry=6, forNewRel=forNewRel@entry=0 '\000', > keepHash=keepHash@entry=1 '\001') at appendonlywriter.c:1166 > 1166 appendonlywriter.c: No such file or directory. > (gdb) info locals > status = 0x0 > [...] > {code} > This panic comes from this piece of code in "appendonlywritter.c": > {code} > for (int i = 0; i < segment_num; i++) > { > AOSegfileStatus *status = maxSegno4Segment[i]; > status->inuse = true; > status->xid = CurrentXid; > existing_segnos = lappend_int(existing_segnos, status->segno); > } > {code} > So, we are pulling a 0x0 (null ?!) entry from _maxSegno4Segment_... That's > extrange, because earlier in this function we populate this array, and we > should not reach this section unless this _maxSegno4Segment_ array completely > populated. > I'm assuming that the variable "segment_num" (used in the for loop stop > condition) is 6 because we got (4096 MB, 1.00 CORE) x 6 resources, according > to master log (?) > {code} > 2017-02-15 00:12:20.121004 > GMT,"mis_ig","ig",p466858,th2037315872,"10.33.188.8","35247",2017-02-14 > 08:41:58 > GMT,30177010,con20836,cmd48711,seg-10000,,,x30177010,sx1,"LOG","00000","ConnID > 42180. Acquired resource from resource manager, (4096 MB, 1.00 > 0000 CORE) x 6." > {code} -- This message was sent by Atlassian JIRA (v6.3.15#6346)