Re: Using NOT NULL in a Pig FILTER statement.

Parth Sawant Thu, 18 Feb 2016 16:14:37 -0800

Did you put a Filter on the values to remove the null? I'm trying to filter
the NULL values using the Pig Filter Keyword and then use the Phoenix Pig
integration to store the data. I have '\\u001' as the delimiter for
multiple files. It is supported by Pig BulkLoader too.


Snippet:

D = LOAD 'src_dest' using PigStorage('\\u001') as AS (IS_REPORTED:INT,
PROCESSING_STATUS_ID:INT, PROGRAM_ID:INT, AFFINITY_GROUP_ID:INT);

 X = FILTER D BY (IS_REPORTED is not null) AND (PROCESSING_STATUS_ID is not
null) AND (IS_REPORTED==1) AND (PROGRAM_ID==1) AND
(PROCESSING_STATUS_ID==2) AND (AFFINITY_GROUP_ID==76);

On Thu, Feb 18, 2016 at 3:06 PM, Chandeep Singh <[email protected]> wrote:

> So, I added one record to your sample to match all the conditions you have
> in your filter statement.
>
> New input:
> [csingh]$ hadoop fs -cat test.txt
> 1,,2,76
> 1,,,76
> ,2,,76
> 1,1,2,
> 1,1,1,76
> 1,2,1,76
>
> I modified the load statement to use PigStorage delimited by comma.
>
> D = LOAD 'test.txt' USING PigStorage(',') AS (IS_REPORTED:INT,
> PROCESSING_STATUS_ID:INT, PROGRAM_ID:INT, AFFINITY_GROUP_ID:INT);
>
> Output:
> (1,2,1,76)
>
> So, the NOT NULL's seem to be working.
>
> Pig Log’s:
>
> grunt> D = LOAD 'test.txt' USING PigStorage(',') AS (IS_REPORTED:INT,
> PROCESSING_STATUS_ID:INT, PROGRAM_ID:INT, AFFINITY_GROUP_ID:INT);
> grunt> X = FILTER D BY (IS_REPORTED is not null) AND (PROCESSING_STATUS_ID
> is not null) AND (IS_REPORTED==1) AND (PROGRAM_ID==1) AND
> (PROCESSING_STATUS_ID==2) AND (AFFINITY_GROUP_ID==76);
> grunt> DUMP X;
> 2016-02-18 23:01:06,336 [main] INFO
> org.apache.pig.tools.pigstats.ScriptState - Pig features used in the
> script: FILTER
> 2016-02-18 23:01:06,366 [main] INFO
> org.apache.pig.newplan.logical.optimizer.LogicalPlanOptimizer -
> {RULES_ENABLED=[AddForEach, ColumnMapKeyPrune,
> DuplicateForEachColumnRewrite, GroupByConstParallelSetter,
> ImplicitSplitInserter, LimitOptimizer, LoadTypeCastInserter, MergeFilter,
> MergeForEach, NewPartitionFilterOptimizer, PushDownForEachFlatten,
> PushUpFilter, SplitFilter, StreamTypeCastInserter],
> RULES_DISABLED=[FilterLogicExpressionSimplifier, PartitionFilterOptimizer]}
> 2016-02-18 23:01:06,480 [main] INFO
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer
> - MR plan size before optimization: 1
> 2016-02-18 23:01:10,798 [JobControl] INFO
> org.apache.hadoop.conf.Configuration.deprecation - fs.default.name is
> deprecated. Instead, use fs.defaultFS
> 2016-02-18 23:01:11,345 [JobControl] INFO
> org.apache.hadoop.mapreduce.JobSubmitter - Submitting tokens for job:
> job_1454499131434_9884
> 2016-02-18 23:01:11,542 [JobControl] INFO
> org.apache.hadoop.yarn.client.api.impl.YarnClientImpl - Submitted
> application application_1454499131434_9884
> 2016-02-18 23:01:11,597 [main] INFO
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher
> - 0% complete
> 2016-02-18 23:01:31,393 [main] INFO
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher
> - 50% complete
> 2016-02-18 23:01:36,818 [main] INFO
> org.apache.hadoop.conf.Configuration.deprecation - mapred.reduce.tasks is
> deprecated. Instead, use mapreduce.job.reduces
> 2016-02-18 23:01:36,875 [main] INFO
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher
> - 100% complete
> 2016-02-18 23:01:36,878 [main] INFO
> org.apache.pig.tools.pigstats.SimplePigStats - Script Statistics:
>
> HadoopVersion   PigVersion      UserId  StartedAt       FinishedAt
> Features
> 2.6.0-cdh5.4.8  0.12.0-cdh5.4.8 csingh  2016-02-18 23:01:06     2016-02-18
> 23:01:36     FILTER
>
> Success!
>
> Job Stats (time in seconds):
> JobId   Maps    Reduces MaxMapTime      MinMapTIme      AvgMapTime
> MedianMapTime   MaxReduceTime   MinReduceTime   AvgReduceTime
>  MedianReducetime        Alias   Feature Outputs
> job_1454499131434_9884  1       0       8       8       8       8
>  n/a     n/a     n/a     n/a     D,X     MAP_ONLY
>
> Input(s):
> Successfully read 6 records (418 bytes) from:
>
> Output(s):
> Successfully stored 1 records (10 bytes) in:
>
> Counters:
> Total records written : 1
> Total bytes written : 10
> Spillable Memory Manager spill count : 0
> Total bags proactively spilled: 0
> Total records proactively spilled: 0
>
> Job DAG:
> job_1454499131434_9884
>
> 2016-02-18 23:01:36,976 [main] INFO
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher
> - Success!
> 2016-02-18 23:01:36,992 [main] INFO
> org.apache.hadoop.mapreduce.lib.input.FileInputFormat - Total input paths
> to process : 1
> 2016-02-18 23:01:36,993 [main] INFO
> org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil - Total input
> paths to process : 1
> (1,2,1,76)
>
>
>
> > On Feb 18, 2016, at 10:13 PM, Parth Sawant <[email protected]>
> wrote:
> >
> > Attaching a sample input. Basically 5 rows with only 4 Integer values in
> each. Some are NULL values.
> >
> > Thanks.
> >
> > On Thu, Feb 18, 2016 at 2:03 PM, Chandeep Singh <[email protected]
> <mailto:[email protected]>> wrote:
> > I’m just looking for one sample record (which has NULL's) and not the
> entire input so that its easier for me to debug.
> >
> > > On Feb 18, 2016, at 9:40 PM, Parth Sawant <[email protected]
> <mailto:[email protected]>> wrote:
> > >
> > > The input is simply too large to relay to others. A simplified schema
> is
> > > below. I only have INT columns with some null values in them. This is
> my
> > > Pig code snippet:
> > >
> > > D= LOAD 'src_locatn' as
> > > IS_REPORTED:INT, PROCESSING_STATUS_ID:INT, PROGRAM_ID:INT,
> > > AFFINITY_GROUP_ID:INT;
> > >
> > > X = FILTER D BY (IS_REPORTED is not null) AND (PROCESSING_STATUS_ID is
> not
> > > null) AND (IS_REPORTED==1) AND (PROGRAM_ID==1) AND
> > > (PROCESSING_STATUS_ID==2) AND (AFFINITY_GROUP_ID==76);
> > >
> > > Thanks
> > >
> > > On Thu, Feb 18, 2016 at 12:59 PM, Chandeep Singh <[email protected]
> <mailto:[email protected]>> wrote:
> > >
> > >> Any chance you could share a sample record which has NULL’s in it? as
> well
> > >> as your pig script?
> > >>
> > >>> On Feb 18, 2016, at 8:36 PM, Parth Sawant <[email protected]
> <mailto:[email protected]>>
> > >> wrote:
> > >>>
> > >>> I had anticipated it would throw a similar error with this
> suggestion as
> > >>> the last one... and it did. My fields are declared as INT, just to
> > >>> re-iterate. I don't think they can be compared to regexes. Here is
> the
> > >>> error:
> > >>>
> > >>> ERROR 1037:
> > >>> <file LeadSales.pig, line 19, column 29> Operands of Regex can be
> > >>> CharArray only :(Name: Regex Type: null Uid: null)
> > >>>
> > >>> org.apache.pig.impl.logicalLayer.validators.TypeCheckerException:
> ERROR
> > >> 1037:
> > >>> <file LeadSales.pig, line 19, column 29> Operands of Regex can be
> > >>> CharArray only :(Name: Regex Type: null Uid: null)
> > >>>
> > >>>
> > >>>
> > >>> Thanks.
> > >>>
> > >>>
> > >>> On Thu, Feb 18, 2016 at 5:24 AM, Chandeep Singh <[email protected]
> <mailto:[email protected]>> wrote:
> > >>>
> > >>>> Since you integers in this field can you try matching to a regular
> > >>>> expression?
> > >>>>
> > >>>> Something like: X matches '\\d+'
> > >>>>
> > >>>>> On Feb 18, 2016, at 12:55 AM, Parth Sawant <
> [email protected] <mailto:[email protected]>>
> > >>>> wrote:
> > >>>>>
> > >>>>> Hi Chandeep. I tried that already but it gave me the following
> error:
> > >>>>>
> > >>>>> ERROR 1039:
> > >>>>> <file LeadSales.pig, line 19, column 27> In alias X, incompatible
> > >>>>> types in NotEqual Operator left hand side:int right hand
> > >>>>> side:chararray.
> > >>>>>
> > >>>>> The error makes sense cause the fields I have are INT type and
> hence
> > >>>>> cannot be compared to a chararray.
> > >>>>>
> > >>>>>
> > >>>>> Thanks for the prompt response though.
> > >>>>>
> > >>>>>
> > >>>>>
> > >>>>> On Feb 17, 2016 16:32, "Chandeep Singh" <[email protected] <mailto:
> [email protected]>> wrote:
> > >>>>>
> > >>>>> Try adding != '' along with IS NOT NULL.
> > >>>>>>
> > >>>>>>> On Feb 18, 2016, at 12:26 AM, Parth Sawant <
> [email protected] <mailto:[email protected]>
> > >>>
> > >>>>>> wrote:
> > >>>>>>>
> > >>>>>>> I'm trying to Filter some null fields in Pig using 'IS NOT NULL'
> .
> > >> For
> > >>>>>> some
> > >>>>>>> reason the null data values persist.
> > >>>>>>> For eg: the following filter on storing it's contents, contains
> null
> > >>>>>> values
> > >>>>>>> for ABC and PQR.
> > >>>>>>>
> > >>>>>>> X = FILTER D BY (ABC IS NOT NULL) AND (ABC==1) AND (PQR==1) AND
> (PQR
> > >> IS
> > >>>>>> NOT
> > >>>>>>> NULL) ;
> > >>>>>>>
> > >>>>>>>
> > >>>>>>> Can someone help with this?
> > >>>>>>>
> > >>>>>>> Thanks
> > >>>>>>>
> > >>>>>>> Parth S
> > >>>>>>
> > >>>>>>
> > >>>>
> > >>>>
> > >>
> > >>
> >
> >
> > <Sample_in.txt>
>
>

Re: Using NOT NULL in a Pig FILTER statement.

Reply via email to