Re: Using NOT NULL in a Pig FILTER statement.

Parth Sawant Fri, 19 Feb 2016 12:59:26 -0800

Hi Chandeep,
Thanks for your help. I figured it out too.

On Fri, Feb 19, 2016 at 9:30 AM, Chandeep Singh <[email protected]> wrote:


> Yes, I did filter using the same conditions you’ve mentioned. I tested it
> earlier with comma as the delimiter (previous email has logs) and now with
> ^A.
>
> [csingh~]$ cat -v test.txt
> 1^A2^A76
> 1^A^A^A76
> ^A2^A^A76
> 1^A1^A2^A
> 1^A1^A1^A76
> 1^A2^A1^A76
>
> grunt> D = LOAD 'test.txt' USING PigStorage('\\u001') AS (IS_REPORTED:INT,
> PROCESSING_STATUS_ID:INT, PROGRAM_ID:INT, AFFINITY_GROUP_ID:INT);
> grunt> DUMP D;
> (1,2,76,)
> (1,,,76)
> (,2,,76)
> (1,1,2,)
> (1,1,1,76)
> (1,2,1,76)
>
> grunt> X = FILTER D BY (IS_REPORTED is not null) AND (PROCESSING_STATUS_ID
> is not null) AND (IS_REPORTED==1) AND (PROGRAM_ID==1) AND
> (PROCESSING_STATUS_ID==2) AND (AFFINITY_GROUP_ID==76);
>
> grunt> DUMP X;
> (1,2,1,76)
>
>
> So, the filter for NULL’s is working as you can see when I dump after
> filtering.
>
> > On Feb 19, 2016, at 12:13 AM, Parth Sawant <[email protected]>
> wrote:
> >
> > Did you put a Filter on the values to remove the null? I'm trying to
> filter
> > the NULL values using the Pig Filter Keyword and then use the Phoenix Pig
> > integration to store the data. I have '\\u001' <smb://u001'> as the
> delimiter for
> > multiple files. It is supported by Pig BulkLoader too.
> >
> > Snippet:
> >
> > D = LOAD 'src_dest' using PigStorage('\\u001' <smb://u001'>) as AS
> (IS_REPORTED:INT,
> > PROCESSING_STATUS_ID:INT, PROGRAM_ID:INT, AFFINITY_GROUP_ID:INT);
> >
> > X = FILTER D BY (IS_REPORTED is not null) AND (PROCESSING_STATUS_ID is
> not
> > null) AND (IS_REPORTED==1) AND (PROGRAM_ID==1) AND
> > (PROCESSING_STATUS_ID==2) AND (AFFINITY_GROUP_ID==76);
> >
> > On Thu, Feb 18, 2016 at 3:06 PM, Chandeep Singh <[email protected]
> <mailto:[email protected]>> wrote:
> >
> >> So, I added one record to your sample to match all the conditions you
> have
> >> in your filter statement.
> >>
> >> New input:
> >> [csingh]$ hadoop fs -cat test.txt
> >> 1,,2,76
> >> 1,,,76
> >> ,2,,76
> >> 1,1,2,
> >> 1,1,1,76
> >> 1,2,1,76
> >>
> >> I modified the load statement to use PigStorage delimited by comma.
> >>
> >> D = LOAD 'test.txt' USING PigStorage(',') AS (IS_REPORTED:INT,
> >> PROCESSING_STATUS_ID:INT, PROGRAM_ID:INT, AFFINITY_GROUP_ID:INT);
> >>
> >> Output:
> >> (1,2,1,76)
> >>
> >> So, the NOT NULL's seem to be working.
> >>
> >> Pig Log’s:
> >>
> >> grunt> D = LOAD 'test.txt' USING PigStorage(',') AS (IS_REPORTED:INT,
> >> PROCESSING_STATUS_ID:INT, PROGRAM_ID:INT, AFFINITY_GROUP_ID:INT);
> >> grunt> X = FILTER D BY (IS_REPORTED is not null) AND
> (PROCESSING_STATUS_ID
> >> is not null) AND (IS_REPORTED==1) AND (PROGRAM_ID==1) AND
> >> (PROCESSING_STATUS_ID==2) AND (AFFINITY_GROUP_ID==76);
> >> grunt> DUMP X;
> >> 2016-02-18 23:01:06,336 [main] INFO
> >> org.apache.pig.tools.pigstats.ScriptState - Pig features used in the
> >> script: FILTER
> >> 2016-02-18 23:01:06,366 [main] INFO
> >> org.apache.pig.newplan.logical.optimizer.LogicalPlanOptimizer -
> >> {RULES_ENABLED=[AddForEach, ColumnMapKeyPrune,
> >> DuplicateForEachColumnRewrite, GroupByConstParallelSetter,
> >> ImplicitSplitInserter, LimitOptimizer, LoadTypeCastInserter,
> MergeFilter,
> >> MergeForEach, NewPartitionFilterOptimizer, PushDownForEachFlatten,
> >> PushUpFilter, SplitFilter, StreamTypeCastInserter],
> >> RULES_DISABLED=[FilterLogicExpressionSimplifier,
> PartitionFilterOptimizer]}
> >> 2016-02-18 23:01:06,480 [main] INFO
> >>
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer
> >> - MR plan size before optimization: 1
> >> 2016-02-18 23:01:10,798 [JobControl] INFO
> >> org.apache.hadoop.conf.Configuration.deprecation - fs.default.name is
> >> deprecated. Instead, use fs.defaultFS
> >> 2016-02-18 23:01:11,345 [JobControl] INFO
> >> org.apache.hadoop.mapreduce.JobSubmitter - Submitting tokens for job:
> >> job_1454499131434_9884
> >> 2016-02-18 23:01:11,542 [JobControl] INFO
> >> org.apache.hadoop.yarn.client.api.impl.YarnClientImpl - Submitted
> >> application application_1454499131434_9884
> >> 2016-02-18 23:01:11,597 [main] INFO
> >>
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher
> >> - 0% complete
> >> 2016-02-18 23:01:31,393 [main] INFO
> >>
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher
> >> - 50% complete
> >> 2016-02-18 23:01:36,818 [main] INFO
> >> org.apache.hadoop.conf.Configuration.deprecation - mapred.reduce.tasks
> is
> >> deprecated. Instead, use mapreduce.job.reduces
> >> 2016-02-18 23:01:36,875 [main] INFO
> >>
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher
> >> - 100% complete
> >> 2016-02-18 23:01:36,878 [main] INFO
> >> org.apache.pig.tools.pigstats.SimplePigStats - Script Statistics:
> >>
> >> HadoopVersion   PigVersion      UserId  StartedAt       FinishedAt
> >> Features
> >> 2.6.0-cdh5.4.8  0.12.0-cdh5.4.8 csingh  2016-02-18 23:01:06
>  2016-02-18
> >> 23:01:36     FILTER
> >>
> >> Success!
> >>
> >> Job Stats (time in seconds):
> >> JobId   Maps    Reduces MaxMapTime      MinMapTIme      AvgMapTime
> >> MedianMapTime   MaxReduceTime   MinReduceTime   AvgReduceTime
> >> MedianReducetime        Alias   Feature Outputs
> >> job_1454499131434_9884  1       0       8       8       8       8
> >> n/a     n/a     n/a     n/a     D,X     MAP_ONLY
> >>
> >> Input(s):
> >> Successfully read 6 records (418 bytes) from:
> >>
> >> Output(s):
> >> Successfully stored 1 records (10 bytes) in:
> >>
> >> Counters:
> >> Total records written : 1
> >> Total bytes written : 10
> >> Spillable Memory Manager spill count : 0
> >> Total bags proactively spilled: 0
> >> Total records proactively spilled: 0
> >>
> >> Job DAG:
> >> job_1454499131434_9884
> >>
> >> 2016-02-18 23:01:36,976 [main] INFO
> >>
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher
> >> - Success!
> >> 2016-02-18 23:01:36,992 [main] INFO
> >> org.apache.hadoop.mapreduce.lib.input.FileInputFormat - Total input
> paths
> >> to process : 1
> >> 2016-02-18 23:01:36,993 [main] INFO
> >> org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil - Total
> input
> >> paths to process : 1
> >> (1,2,1,76)
> >>
> >>
> >>
> >>> On Feb 18, 2016, at 10:13 PM, Parth Sawant <[email protected]>
> >> wrote:
> >>>
> >>> Attaching a sample input. Basically 5 rows with only 4 Integer values
> in
> >> each. Some are NULL values.
> >>>
> >>> Thanks.
> >>>
> >>> On Thu, Feb 18, 2016 at 2:03 PM, Chandeep Singh <[email protected]
> >> <mailto:[email protected] <mailto:[email protected]>>> wrote:
> >>> I’m just looking for one sample record (which has NULL's) and not the
> >> entire input so that its easier for me to debug.
> >>>
> >>>> On Feb 18, 2016, at 9:40 PM, Parth Sawant <[email protected]
> <mailto:[email protected]>
> >> <mailto:[email protected] <mailto:[email protected]>>>
> wrote:
> >>>>
> >>>> The input is simply too large to relay to others. A simplified schema
> >> is
> >>>> below. I only have INT columns with some null values in them. This is
> >> my
> >>>> Pig code snippet:
> >>>>
> >>>> D= LOAD 'src_locatn' as
> >>>> IS_REPORTED:INT, PROCESSING_STATUS_ID:INT, PROGRAM_ID:INT,
> >>>> AFFINITY_GROUP_ID:INT;
> >>>>
> >>>> X = FILTER D BY (IS_REPORTED is not null) AND (PROCESSING_STATUS_ID is
> >> not
> >>>> null) AND (IS_REPORTED==1) AND (PROGRAM_ID==1) AND
> >>>> (PROCESSING_STATUS_ID==2) AND (AFFINITY_GROUP_ID==76);
> >>>>
> >>>> Thanks
> >>>>
> >>>> On Thu, Feb 18, 2016 at 12:59 PM, Chandeep Singh <[email protected]
> <mailto:[email protected]>
> >> <mailto:[email protected] <mailto:[email protected]>>> wrote:
> >>>>
> >>>>> Any chance you could share a sample record which has NULL’s in it? as
> >> well
> >>>>> as your pig script?
> >>>>>
> >>>>>> On Feb 18, 2016, at 8:36 PM, Parth Sawant <[email protected]
> <mailto:[email protected]>
> >> <mailto:[email protected] <mailto:[email protected]>>>
> >>>>> wrote:
> >>>>>>
> >>>>>> I had anticipated it would throw a similar error with this
> >> suggestion as
> >>>>>> the last one... and it did. My fields are declared as INT, just to
> >>>>>> re-iterate. I don't think they can be compared to regexes. Here is
> >> the
> >>>>>> error:
> >>>>>>
> >>>>>> ERROR 1037:
> >>>>>> <file LeadSales.pig, line 19, column 29> Operands of Regex can be
> >>>>>> CharArray only :(Name: Regex Type: null Uid: null)
> >>>>>>
> >>>>>> org.apache.pig.impl.logicalLayer.validators.TypeCheckerException:
> >> ERROR
> >>>>> 1037:
> >>>>>> <file LeadSales.pig, line 19, column 29> Operands of Regex can be
> >>>>>> CharArray only :(Name: Regex Type: null Uid: null)
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>> Thanks.
> >>>>>>
> >>>>>>
> >>>>>> On Thu, Feb 18, 2016 at 5:24 AM, Chandeep Singh <[email protected]
> <mailto:[email protected]>
> >> <mailto:[email protected] <mailto:[email protected]>>> wrote:
> >>>>>>
> >>>>>>> Since you integers in this field can you try matching to a regular
> >>>>>>> expression?
> >>>>>>>
> >>>>>>> Something like: X matches '\\d+' <smb://d+'>
> >>>>>>>
> >>>>>>>> On Feb 18, 2016, at 12:55 AM, Parth Sawant <
> >> [email protected] <mailto:[email protected]> <mailto:
> [email protected] <mailto:[email protected]>>>
> >>>>>>> wrote:
> >>>>>>>>
> >>>>>>>> Hi Chandeep. I tried that already but it gave me the following
> >> error:
> >>>>>>>>
> >>>>>>>> ERROR 1039:
> >>>>>>>> <file LeadSales.pig, line 19, column 27> In alias X, incompatible
> >>>>>>>> types in NotEqual Operator left hand side:int right hand
> >>>>>>>> side:chararray.
> >>>>>>>>
> >>>>>>>> The error makes sense cause the fields I have are INT type and
> >> hence
> >>>>>>>> cannot be compared to a chararray.
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> Thanks for the prompt response though.
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> On Feb 17, 2016 16:32, "Chandeep Singh" <[email protected] <mailto:
> [email protected]> <mailto:
> >> [email protected] <mailto:[email protected]>>> wrote:
> >>>>>>>>
> >>>>>>>> Try adding != '' along with IS NOT NULL.
> >>>>>>>>>
> >>>>>>>>>> On Feb 18, 2016, at 12:26 AM, Parth Sawant <
> >> [email protected] <mailto:[email protected]> <mailto:
> [email protected] <mailto:[email protected]>>
> >>>>>>
> >>>>>>>>> wrote:
> >>>>>>>>>>
> >>>>>>>>>> I'm trying to Filter some null fields in Pig using 'IS NOT NULL'
> >> .
> >>>>> For
> >>>>>>>>> some
> >>>>>>>>>> reason the null data values persist.
> >>>>>>>>>> For eg: the following filter on storing it's contents, contains
> >> null
> >>>>>>>>> values
> >>>>>>>>>> for ABC and PQR.
> >>>>>>>>>>
> >>>>>>>>>> X = FILTER D BY (ABC IS NOT NULL) AND (ABC==1) AND (PQR==1) AND
> >> (PQR
> >>>>> IS
> >>>>>>>>> NOT
> >>>>>>>>>> NULL) ;
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>> Can someone help with this?
> >>>>>>>>>>
> >>>>>>>>>> Thanks
> >>>>>>>>>>
> >>>>>>>>>> Parth S
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>
> >>>>>
> >>>
> >>>
> >>> <Sample_in.txt>
>
>

Re: Using NOT NULL in a Pig FILTER statement.

Reply via email to