Hi Chandeep, Thanks for your help. I figured it out too. On Fri, Feb 19, 2016 at 9:30 AM, Chandeep Singh <c...@chandeep.com> wrote:
> Yes, I did filter using the same conditions you’ve mentioned. I tested it > earlier with comma as the delimiter (previous email has logs) and now with > ^A. > > [csingh~]$ cat -v test.txt > 1^A2^A76 > 1^A^A^A76 > ^A2^A^A76 > 1^A1^A2^A > 1^A1^A1^A76 > 1^A2^A1^A76 > > grunt> D = LOAD 'test.txt' USING PigStorage('\\u001') AS (IS_REPORTED:INT, > PROCESSING_STATUS_ID:INT, PROGRAM_ID:INT, AFFINITY_GROUP_ID:INT); > grunt> DUMP D; > (1,2,76,) > (1,,,76) > (,2,,76) > (1,1,2,) > (1,1,1,76) > (1,2,1,76) > > grunt> X = FILTER D BY (IS_REPORTED is not null) AND (PROCESSING_STATUS_ID > is not null) AND (IS_REPORTED==1) AND (PROGRAM_ID==1) AND > (PROCESSING_STATUS_ID==2) AND (AFFINITY_GROUP_ID==76); > > grunt> DUMP X; > (1,2,1,76) > > > So, the filter for NULL’s is working as you can see when I dump after > filtering. > > > On Feb 19, 2016, at 12:13 AM, Parth Sawant <parth.sawan...@gmail.com> > wrote: > > > > Did you put a Filter on the values to remove the null? I'm trying to > filter > > the NULL values using the Pig Filter Keyword and then use the Phoenix Pig > > integration to store the data. I have '\\u001' <smb://u001'> as the > delimiter for > > multiple files. It is supported by Pig BulkLoader too. > > > > Snippet: > > > > D = LOAD 'src_dest' using PigStorage('\\u001' <smb://u001'>) as AS > (IS_REPORTED:INT, > > PROCESSING_STATUS_ID:INT, PROGRAM_ID:INT, AFFINITY_GROUP_ID:INT); > > > > X = FILTER D BY (IS_REPORTED is not null) AND (PROCESSING_STATUS_ID is > not > > null) AND (IS_REPORTED==1) AND (PROGRAM_ID==1) AND > > (PROCESSING_STATUS_ID==2) AND (AFFINITY_GROUP_ID==76); > > > > On Thu, Feb 18, 2016 at 3:06 PM, Chandeep Singh <c...@chandeep.com > <mailto:c...@chandeep.com>> wrote: > > > >> So, I added one record to your sample to match all the conditions you > have > >> in your filter statement. > >> > >> New input: > >> [csingh]$ hadoop fs -cat test.txt > >> 1,,2,76 > >> 1,,,76 > >> ,2,,76 > >> 1,1,2, > >> 1,1,1,76 > >> 1,2,1,76 > >> > >> I modified the load statement to use PigStorage delimited by comma. > >> > >> D = LOAD 'test.txt' USING PigStorage(',') AS (IS_REPORTED:INT, > >> PROCESSING_STATUS_ID:INT, PROGRAM_ID:INT, AFFINITY_GROUP_ID:INT); > >> > >> Output: > >> (1,2,1,76) > >> > >> So, the NOT NULL's seem to be working. > >> > >> Pig Log’s: > >> > >> grunt> D = LOAD 'test.txt' USING PigStorage(',') AS (IS_REPORTED:INT, > >> PROCESSING_STATUS_ID:INT, PROGRAM_ID:INT, AFFINITY_GROUP_ID:INT); > >> grunt> X = FILTER D BY (IS_REPORTED is not null) AND > (PROCESSING_STATUS_ID > >> is not null) AND (IS_REPORTED==1) AND (PROGRAM_ID==1) AND > >> (PROCESSING_STATUS_ID==2) AND (AFFINITY_GROUP_ID==76); > >> grunt> DUMP X; > >> 2016-02-18 23:01:06,336 [main] INFO > >> org.apache.pig.tools.pigstats.ScriptState - Pig features used in the > >> script: FILTER > >> 2016-02-18 23:01:06,366 [main] INFO > >> org.apache.pig.newplan.logical.optimizer.LogicalPlanOptimizer - > >> {RULES_ENABLED=[AddForEach, ColumnMapKeyPrune, > >> DuplicateForEachColumnRewrite, GroupByConstParallelSetter, > >> ImplicitSplitInserter, LimitOptimizer, LoadTypeCastInserter, > MergeFilter, > >> MergeForEach, NewPartitionFilterOptimizer, PushDownForEachFlatten, > >> PushUpFilter, SplitFilter, StreamTypeCastInserter], > >> RULES_DISABLED=[FilterLogicExpressionSimplifier, > PartitionFilterOptimizer]} > >> 2016-02-18 23:01:06,480 [main] INFO > >> > org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer > >> - MR plan size before optimization: 1 > >> 2016-02-18 23:01:10,798 [JobControl] INFO > >> org.apache.hadoop.conf.Configuration.deprecation - fs.default.name is > >> deprecated. Instead, use fs.defaultFS > >> 2016-02-18 23:01:11,345 [JobControl] INFO > >> org.apache.hadoop.mapreduce.JobSubmitter - Submitting tokens for job: > >> job_1454499131434_9884 > >> 2016-02-18 23:01:11,542 [JobControl] INFO > >> org.apache.hadoop.yarn.client.api.impl.YarnClientImpl - Submitted > >> application application_1454499131434_9884 > >> 2016-02-18 23:01:11,597 [main] INFO > >> > org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher > >> - 0% complete > >> 2016-02-18 23:01:31,393 [main] INFO > >> > org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher > >> - 50% complete > >> 2016-02-18 23:01:36,818 [main] INFO > >> org.apache.hadoop.conf.Configuration.deprecation - mapred.reduce.tasks > is > >> deprecated. Instead, use mapreduce.job.reduces > >> 2016-02-18 23:01:36,875 [main] INFO > >> > org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher > >> - 100% complete > >> 2016-02-18 23:01:36,878 [main] INFO > >> org.apache.pig.tools.pigstats.SimplePigStats - Script Statistics: > >> > >> HadoopVersion PigVersion UserId StartedAt FinishedAt > >> Features > >> 2.6.0-cdh5.4.8 0.12.0-cdh5.4.8 csingh 2016-02-18 23:01:06 > 2016-02-18 > >> 23:01:36 FILTER > >> > >> Success! > >> > >> Job Stats (time in seconds): > >> JobId Maps Reduces MaxMapTime MinMapTIme AvgMapTime > >> MedianMapTime MaxReduceTime MinReduceTime AvgReduceTime > >> MedianReducetime Alias Feature Outputs > >> job_1454499131434_9884 1 0 8 8 8 8 > >> n/a n/a n/a n/a D,X MAP_ONLY > >> > >> Input(s): > >> Successfully read 6 records (418 bytes) from: > >> > >> Output(s): > >> Successfully stored 1 records (10 bytes) in: > >> > >> Counters: > >> Total records written : 1 > >> Total bytes written : 10 > >> Spillable Memory Manager spill count : 0 > >> Total bags proactively spilled: 0 > >> Total records proactively spilled: 0 > >> > >> Job DAG: > >> job_1454499131434_9884 > >> > >> 2016-02-18 23:01:36,976 [main] INFO > >> > org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher > >> - Success! > >> 2016-02-18 23:01:36,992 [main] INFO > >> org.apache.hadoop.mapreduce.lib.input.FileInputFormat - Total input > paths > >> to process : 1 > >> 2016-02-18 23:01:36,993 [main] INFO > >> org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil - Total > input > >> paths to process : 1 > >> (1,2,1,76) > >> > >> > >> > >>> On Feb 18, 2016, at 10:13 PM, Parth Sawant <parth.sawan...@gmail.com> > >> wrote: > >>> > >>> Attaching a sample input. Basically 5 rows with only 4 Integer values > in > >> each. Some are NULL values. > >>> > >>> Thanks. > >>> > >>> On Thu, Feb 18, 2016 at 2:03 PM, Chandeep Singh <c...@chandeep.com > >> <mailto:c...@chandeep.com <mailto:c...@chandeep.com>>> wrote: > >>> I’m just looking for one sample record (which has NULL's) and not the > >> entire input so that its easier for me to debug. > >>> > >>>> On Feb 18, 2016, at 9:40 PM, Parth Sawant <parth.sawan...@gmail.com > <mailto:parth.sawan...@gmail.com> > >> <mailto:parth.sawan...@gmail.com <mailto:parth.sawan...@gmail.com>>> > wrote: > >>>> > >>>> The input is simply too large to relay to others. A simplified schema > >> is > >>>> below. I only have INT columns with some null values in them. This is > >> my > >>>> Pig code snippet: > >>>> > >>>> D= LOAD 'src_locatn' as > >>>> IS_REPORTED:INT, PROCESSING_STATUS_ID:INT, PROGRAM_ID:INT, > >>>> AFFINITY_GROUP_ID:INT; > >>>> > >>>> X = FILTER D BY (IS_REPORTED is not null) AND (PROCESSING_STATUS_ID is > >> not > >>>> null) AND (IS_REPORTED==1) AND (PROGRAM_ID==1) AND > >>>> (PROCESSING_STATUS_ID==2) AND (AFFINITY_GROUP_ID==76); > >>>> > >>>> Thanks > >>>> > >>>> On Thu, Feb 18, 2016 at 12:59 PM, Chandeep Singh <c...@chandeep.com > <mailto:c...@chandeep.com> > >> <mailto:c...@chandeep.com <mailto:c...@chandeep.com>>> wrote: > >>>> > >>>>> Any chance you could share a sample record which has NULL’s in it? as > >> well > >>>>> as your pig script? > >>>>> > >>>>>> On Feb 18, 2016, at 8:36 PM, Parth Sawant <parth.sawan...@gmail.com > <mailto:parth.sawan...@gmail.com> > >> <mailto:parth.sawan...@gmail.com <mailto:parth.sawan...@gmail.com>>> > >>>>> wrote: > >>>>>> > >>>>>> I had anticipated it would throw a similar error with this > >> suggestion as > >>>>>> the last one... and it did. My fields are declared as INT, just to > >>>>>> re-iterate. I don't think they can be compared to regexes. Here is > >> the > >>>>>> error: > >>>>>> > >>>>>> ERROR 1037: > >>>>>> <file LeadSales.pig, line 19, column 29> Operands of Regex can be > >>>>>> CharArray only :(Name: Regex Type: null Uid: null) > >>>>>> > >>>>>> org.apache.pig.impl.logicalLayer.validators.TypeCheckerException: > >> ERROR > >>>>> 1037: > >>>>>> <file LeadSales.pig, line 19, column 29> Operands of Regex can be > >>>>>> CharArray only :(Name: Regex Type: null Uid: null) > >>>>>> > >>>>>> > >>>>>> > >>>>>> Thanks. > >>>>>> > >>>>>> > >>>>>> On Thu, Feb 18, 2016 at 5:24 AM, Chandeep Singh <c...@chandeep.com > <mailto:c...@chandeep.com> > >> <mailto:c...@chandeep.com <mailto:c...@chandeep.com>>> wrote: > >>>>>> > >>>>>>> Since you integers in this field can you try matching to a regular > >>>>>>> expression? > >>>>>>> > >>>>>>> Something like: X matches '\\d+' <smb://d+'> > >>>>>>> > >>>>>>>> On Feb 18, 2016, at 12:55 AM, Parth Sawant < > >> parth.sawan...@gmail.com <mailto:parth.sawan...@gmail.com> <mailto: > parth.sawan...@gmail.com <mailto:parth.sawan...@gmail.com>>> > >>>>>>> wrote: > >>>>>>>> > >>>>>>>> Hi Chandeep. I tried that already but it gave me the following > >> error: > >>>>>>>> > >>>>>>>> ERROR 1039: > >>>>>>>> <file LeadSales.pig, line 19, column 27> In alias X, incompatible > >>>>>>>> types in NotEqual Operator left hand side:int right hand > >>>>>>>> side:chararray. > >>>>>>>> > >>>>>>>> The error makes sense cause the fields I have are INT type and > >> hence > >>>>>>>> cannot be compared to a chararray. > >>>>>>>> > >>>>>>>> > >>>>>>>> Thanks for the prompt response though. > >>>>>>>> > >>>>>>>> > >>>>>>>> > >>>>>>>> On Feb 17, 2016 16:32, "Chandeep Singh" <c...@chandeep.com <mailto: > c...@chandeep.com> <mailto: > >> c...@chandeep.com <mailto:c...@chandeep.com>>> wrote: > >>>>>>>> > >>>>>>>> Try adding != '' along with IS NOT NULL. > >>>>>>>>> > >>>>>>>>>> On Feb 18, 2016, at 12:26 AM, Parth Sawant < > >> parth.sawan...@gmail.com <mailto:parth.sawan...@gmail.com> <mailto: > parth.sawan...@gmail.com <mailto:parth.sawan...@gmail.com>> > >>>>>> > >>>>>>>>> wrote: > >>>>>>>>>> > >>>>>>>>>> I'm trying to Filter some null fields in Pig using 'IS NOT NULL' > >> . > >>>>> For > >>>>>>>>> some > >>>>>>>>>> reason the null data values persist. > >>>>>>>>>> For eg: the following filter on storing it's contents, contains > >> null > >>>>>>>>> values > >>>>>>>>>> for ABC and PQR. > >>>>>>>>>> > >>>>>>>>>> X = FILTER D BY (ABC IS NOT NULL) AND (ABC==1) AND (PQR==1) AND > >> (PQR > >>>>> IS > >>>>>>>>> NOT > >>>>>>>>>> NULL) ; > >>>>>>>>>> > >>>>>>>>>> > >>>>>>>>>> Can someone help with this? > >>>>>>>>>> > >>>>>>>>>> Thanks > >>>>>>>>>> > >>>>>>>>>> Parth S > >>>>>>>>> > >>>>>>>>> > >>>>>>> > >>>>>>> > >>>>> > >>>>> > >>> > >>> > >>> <Sample_in.txt> > >