A question regarding Storages, dependency order and java.lang.StackOverflowError

2015-04-22 Thread Alfonso Nishikawa
Hello,

Tested with Pig 0.12.1 and Pig 0.14.0

I write here with not much hope, but maybe I have luck and someone knows
how to solve it :)

I am writing an Storage for Gora, and if I use an outer bag inside a
foreach when storing I get java.lang.StackOverflowError .

Exactly this:

Pig Stack Trace
---
ERROR 2998: Unhandled internal error. null

java.lang.StackOverflowError
at sun.reflect.GeneratedConstructorAccessor4.newInstance(Unknown
Source)
at
sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
at java.lang.reflect.Constructor.newInstance(Constructor.java:526)
at java.lang.Class.newInstance(Class.java:379)
at org.apache.pig.impl.util.Utils.mergeCollection(Utils.java:441)
at
org.apache.pig.newplan.DependencyOrderWalker.doAllPredecessors(DependencyOrderWalker.java:84)
at
org.apache.pig.newplan.DependencyOrderWalker.doAllPredecessors(DependencyOrderWalker.java:88)
at
org.apache.pig.newplan.DependencyOrderWalker.doAllPredecessors(DependencyOrderWalker.java:88)
(fill 1030 lines of log with this last line)

When doing a dump or using PigStorage all works perfectly, so the problem
is surely in my Storage implementation.

The script is as follows:


borrar_areas_table = LOAD '.'
   USING org.apache.gora.pig.GoraStorage(
  'java.lang.String',
  'es.indra.innovationlabs.celtic.generated.BorrarAreas',
  'nombre') ;

borrar_areas = FOREACH borrar_areas_table GENERATE key ;
borrar_areas_bag = GROUP borrar_areas ALL ;

-- [2] - Borrar de webpage:
--  experta: map area - record = hashmap,
--y areas: array areas = bag
webpage = LOAD '.'
  USING org.apache.gora.pig.GoraStorage(
  'java.lang.String',
  'org.apache.nutch.storage.WebPage',
  'experta, areas') ;

-- Seleccionar aquellas páginas que contienen en areas alguna de las
áreas a borrar (en borrar_areas_bag.borrar_areas)
webpage_match = FILTER webpage BY bagContainsFB(areas,
borrar_areas_bag.borrar_areas) ;
-- Borrar las áreas (bag) y las claves en experta (map)
webpage_fix   = FOREACH webpage_match
GENERATE key, deleteMapKeys(experta,
borrar_areas_bag.borrar_areas) as experta,
 SUBTRACT(areas, borrar_areas_bag.borrar_areas)
as areas ;

STORE webpage_fix INTO '.' USING org.apache.gora.pig.GoraStorage(
 'java.lang.String',
 'org.apache.nutch.storage.WebPage',
 'experta, areas') ;

I have to do a workaround in order to get things done, avoiding using
borrar_areas_bag.borrar_areas and using a cross instead, but the execution
is noticeably slower:

borrar_areas_table = LOAD '.'
   USING org.apache.gora.pig.GoraStorage(
  'java.lang.String',
  'es.indra.innovationlabs.celtic.generated.BorrarAreas',
  'nombre') ;

borrar_areas = FOREACH borrar_areas_table GENERATE key ;
borrar_areas_bag = GROUP borrar_areas ALL ;

-- [2] - Borrar de webpage: experta: map area - record = hashmap, y
areas: array areas = bag
webpage = LOAD '.'
  USING org.apache.gora.pig.GoraStorage(
  'java.lang.String',
  'org.apache.nutch.storage.WebPage',
  'experta, areas') ;

webpage_cross_areas = CROSS webpage, borrar_areas_bag ;
-- Seleccionar aquellas páginas que contienen en areas alguna de las
áreas a borrar (en borrar_areas_bag::borrar_areas)
webpage_match = FILTER webpage_cross_areas BY
bagContainsFB(webpage::areas, borrar_areas_bag::borrar_areas) ;
-- Borrar las áreas (bag) y las claves en experta (map)
webpage_fix   = FOREACH webpage_match
GENERATE webpage::key AS key,
 deleteMapKeys(experta,
borrar_areas_bag::borrar_areas) as experta,
 SUBTRACT(areas,
borrar_areas_bag::borrar_areas) as areas:{(chararray)} ;

STORE webpage_fix INTO '.' USING org.apache.gora.pig.GoraStorage(
 'java.lang.String',
 'org.apache.nutch.storage.WebPage',
 'experta, areas') ;


The actual question is: Does anyone think about something if I ask about
that case?: outerbag in a foreach, Storage, dependecies, ...
Any possible method that I should implement? Is related with some schema?

I know is a quite nonsense question, so I don't expect any idea :( but
thanks! :)

Regards,

Alfonso Nishikawa


[jira] [Created] (PIG-4513) Lines dropped in delimited text when they begin with null/no-data

2015-04-22 Thread Madhan Sundararajan Devaki (JIRA)
Madhan Sundararajan Devaki created PIG-4513:
---

 Summary: Lines dropped in delimited text when they begin with 
null/no-data
 Key: PIG-4513
 URL: https://issues.apache.org/jira/browse/PIG-4513
 Project: Pig
  Issue Type: Bug
  Components: parser, piggybank
Affects Versions: 0.12.0
 Environment: CDH5.2.x, CDH5.3.x
Reporter: Madhan Sundararajan Devaki
Priority: Blocker


When Pig (0.12) is used to process delimited text files (| delimited), lines 
that do not contain data in the first column are dropped.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


Build failed in Jenkins: Pig-trunk-commit #2104

2015-04-22 Thread Apache Jenkins Server
See https://builds.apache.org/job/Pig-trunk-commit/2104/

--
[...truncated 4407 lines...]
[junit] Running org.apache.pig.test.TestNewPlanListener
[junit] Tests run: 4, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 
0.344 sec
[junit] Running org.apache.pig.test.TestNewPlanLogToPhyTranslationVisitor
[junit] Tests run: 27, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 
4.269 sec
[junit] Running org.apache.pig.test.TestNewPlanLogicalOptimizer
[junit] Tests run: 1, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 
1.247 sec
[junit] Running org.apache.pig.test.TestNewPlanOperatorPlan
[junit] Tests run: 47, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 
3.597 sec
[junit] Running org.apache.pig.test.TestNewPlanPruneMapKeys
[junit] Tests run: 8, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 
2.697 sec
[junit] Running org.apache.pig.test.TestNewPlanPushDownForeachFlatten
[junit] Tests run: 45, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 
6.3 sec
[junit] Running org.apache.pig.test.TestNewPlanPushUpFilter
[junit] Tests run: 46, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 
6.243 sec
[junit] Running org.apache.pig.test.TestNewPlanRule
[junit] Tests run: 4, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 
0.307 sec
[junit] Running org.apache.pig.test.TestNotEqualTo
[junit] Tests run: 28, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 
0.421 sec
[junit] Running org.apache.pig.test.TestNull
[junit] Tests run: 2, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 
0.42 sec
[junit] Running org.apache.pig.test.TestNullConstant
[junit] Tests run: 8, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 
23.026 sec
[junit] Running org.apache.pig.test.TestNumberOfReducers
[junit] Tests run: 10, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 
486.471 sec
[junit] Running org.apache.pig.test.TestOptimizeLimit
[junit] Tests run: 12, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 
3.115 sec
[junit] Running org.apache.pig.test.TestOrderBy3
[junit] Tests run: 4, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 
13.967 sec
[junit] Running org.apache.pig.test.TestPOBinCond
[junit] Tests run: 10, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 
0.443 sec
[junit] Running org.apache.pig.test.TestPOCast
[junit] Tests run: 14, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 
0.865 sec
[junit] Running org.apache.pig.test.TestPODistinct
[junit] Tests run: 6, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 
0.371 sec
[junit] Running org.apache.pig.test.TestPOGenerate
[junit] Tests run: 3, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 
0.342 sec
[junit] Running org.apache.pig.test.TestPOMapLookUp
[junit] Tests run: 1, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 
0.332 sec
[junit] Running org.apache.pig.test.TestPONegative
[junit] Tests run: 9, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 
5.49 sec
[junit] Running org.apache.pig.test.TestPOPartialAgg
[junit] Tests run: 16, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 
3.823 sec
[junit] Running org.apache.pig.test.TestPOPartialAggPlan
[junit] Tests run: 1, Failures: 0, Errors: 0, Skipped: 1, Time elapsed: 
0.224 sec
[junit] Running org.apache.pig.test.TestPORegexp
[junit] Tests run: 2, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 
0.353 sec
[junit] Running org.apache.pig.test.TestPOSort
[junit] Tests run: 14, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 
0.378 sec
[junit] Running org.apache.pig.test.TestPOSplit
[junit] Tests run: 1, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 
0.34 sec
[junit] Running org.apache.pig.test.TestPOUserFunc
[junit] Tests run: 6, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 
0.377 sec
[junit] Running org.apache.pig.test.TestPackage
[junit] Tests run: 1, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 
7.773 sec
[junit] Running org.apache.pig.test.TestParamSubPreproc
[junit] Tests run: 36, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 
4.721 sec
[junit] Running org.apache.pig.test.TestParser
[junit] Tests run: 4, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 
5.966 sec
[junit] Running org.apache.pig.test.TestPhyOp
[junit] Tests run: 1, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 
0.415 sec
[junit] Running org.apache.pig.test.TestPhyPatternMatch
[junit] Tests run: 4, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 
0.324 sec
[junit] Running org.apache.pig.test.TestPigContext
[junit] Tests run: 6, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 
78.338 sec
[junit] Running org.apache.pig.test.TestPigContextClassCache
[junit] Tests run: 1, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 
0.322 sec
[junit] Running org.apache.pig.test.TestPigException
[junit] Tests 

[jira] Subscription: PIG patch available

2015-04-22 Thread jira
Issue Subscription
Filter: PIG patch available (30 issues)

Subscriber: pigdaily

Key Summary
PIG-4496Fix CBZip2InputStream to close underlying stream
https://issues.apache.org/jira/browse/PIG-4496
PIG-4494Pig's htrace version conflicts with that of hadoop 2.6.0
https://issues.apache.org/jira/browse/PIG-4494
PIG-4490MIN/MAX builtin UDFs return wrong results when accumulating for 
strings
https://issues.apache.org/jira/browse/PIG-4490
PIG-4481e2e tests ComputeSpec_1, ComputeSpec_2, StreamingPerformance_3 and  
StreamingPerformance_4 produce different result on Windows
https://issues.apache.org/jira/browse/PIG-4481
PIG-4468Pig's jackson version conflicts with that of hadoop 2.6.0
https://issues.apache.org/jira/browse/PIG-4468
PIG-4455Should use DependencyOrderWalker instead of DepthFirstWalker in 
MRPrinter
https://issues.apache.org/jira/browse/PIG-4455
PIG-4452Embedded SQL using SQL instead of sql fails with string index 
out of range: -1 error
https://issues.apache.org/jira/browse/PIG-4452
PIG-4422Implement visitMergeJoin in SparkCompiler
https://issues.apache.org/jira/browse/PIG-4422
PIG-4418NullPointerException in JVMReuseImpl
https://issues.apache.org/jira/browse/PIG-4418
PIG-4417Pig's register command should support automatic fetching of jars 
from repo.
https://issues.apache.org/jira/browse/PIG-4417
PIG-4377Skewed outer join produce wrong result in some cases
https://issues.apache.org/jira/browse/PIG-4377
PIG-4365TOP udf should implement Accumulator interface
https://issues.apache.org/jira/browse/PIG-4365
PIG-4341Add CMX support to pig.tmpfilecompression.codec
https://issues.apache.org/jira/browse/PIG-4341
PIG-4323PackageConverter hanging in Spark
https://issues.apache.org/jira/browse/PIG-4323
PIG-4313StackOverflowError in LIMIT operation on Spark
https://issues.apache.org/jira/browse/PIG-4313
PIG-4276Fix ordering related failures in TestEvalPipeline for Spark
https://issues.apache.org/jira/browse/PIG-4276
PIG-4251Pig on Storm
https://issues.apache.org/jira/browse/PIG-4251
PIG-4193Make collected group work with Spark
https://issues.apache.org/jira/browse/PIG-4193
PIG-4111Make Pig compiles with avro-1.7.7
https://issues.apache.org/jira/browse/PIG-4111
PIG-4004Upgrade the Pigmix queries from the (old) mapred API to mapreduce
https://issues.apache.org/jira/browse/PIG-4004
PIG-4002Disable combiner when map-side aggregation is used
https://issues.apache.org/jira/browse/PIG-4002
PIG-3952PigStorage accepts '-tagSplit' to return full split information
https://issues.apache.org/jira/browse/PIG-3952
PIG-3911Define unique fields with @OutputSchema
https://issues.apache.org/jira/browse/PIG-3911
PIG-3877Getting Geo Latitude/Longitude from Address Lines
https://issues.apache.org/jira/browse/PIG-3877
PIG-3873Geo distance calculation using Haversine
https://issues.apache.org/jira/browse/PIG-3873
PIG-3866Create ThreadLocal classloader per PigContext
https://issues.apache.org/jira/browse/PIG-3866
PIG-3851Upgrade jline to 2.11
https://issues.apache.org/jira/browse/PIG-3851
PIG-3668COR built-in function when atleast one of the coefficient values is 
NaN
https://issues.apache.org/jira/browse/PIG-3668
PIG-3635Fix e2e tests for Hadoop 2.X on Windows
https://issues.apache.org/jira/browse/PIG-3635
PIG-3587add functionality for rolling over dates
https://issues.apache.org/jira/browse/PIG-3587

You may edit this subscription at:
https://issues.apache.org/jira/secure/FilterSubscription!default.jspa?subId=16328filterId=12322384


[jira] [Commented] (PIG-4365) TOP udf should implement Accumulator interface

2015-04-22 Thread Rohini Palaniswamy (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-4365?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14507782#comment-14507782
 ] 

Rohini Palaniswamy commented on PIG-4365:
-

[~eyal],
   Where it returns null before it now returns an empty bag. That needs to be 
fixed. Could you also add a test with actual pig script and small batch size so 
that the full code path is exercised in the test?  Refer TestAccumulator for 
example.

Can you also post the new patch in the review board (reviews.apache.org) as 
well?

 TOP udf should implement Accumulator interface
 --

 Key: PIG-4365
 URL: https://issues.apache.org/jira/browse/PIG-4365
 Project: Pig
  Issue Type: Task
Affects Versions: 0.15.0
Reporter: Rohini Palaniswamy
Assignee: Eyal Allweil
  Labels: newbie
 Fix For: 0.15.0

 Attachments: PIG-4365.patch






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (PIG-4514) pig trunk compilation is broken - VertexManagerPluginContext.reconfigureVertex change

2015-04-22 Thread Thejas M Nair (JIRA)
Thejas M Nair created PIG-4514:
--

 Summary: pig trunk compilation is broken - 
VertexManagerPluginContext.reconfigureVertex change
 Key: PIG-4514
 URL: https://issues.apache.org/jira/browse/PIG-4514
 Project: Pig
  Issue Type: Bug
Affects Versions: 0.15.0
Reporter: Thejas M Nair
Assignee: Thejas M Nair
 Fix For: 0.15.0



{code}
src/org/apache/pig/backend/hadoop/executionengine/tez/runtime/PigGraceShuffleVertexManager.java:173:
 error: exception TezException is never thrown in body of corresponding try 
statement
[javac] } catch (TezException e) {
[javac]   ^
{code}




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (PIG-4514) pig trunk compilation is broken - VertexManagerPluginContext.reconfigureVertex change

2015-04-22 Thread Thejas M Nair (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-4514?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thejas M Nair updated PIG-4514:
---
Attachment: PIG-4514.1.patch

 pig trunk compilation is broken - 
 VertexManagerPluginContext.reconfigureVertex change
 -

 Key: PIG-4514
 URL: https://issues.apache.org/jira/browse/PIG-4514
 Project: Pig
  Issue Type: Bug
Affects Versions: 0.15.0
Reporter: Thejas M Nair
Assignee: Thejas M Nair
 Fix For: 0.15.0

 Attachments: PIG-4514.1.patch


 {code}
 src/org/apache/pig/backend/hadoop/executionengine/tez/runtime/PigGraceShuffleVertexManager.java:173:
  error: exception TezException is never thrown in body of corresponding try 
 statement
 [javac] } catch (TezException e) {
 [javac]   ^
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Assigned] (PIG-4295) Enable unit test TestPigContext for spark

2015-04-22 Thread liyunzhang_intel (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-4295?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

liyunzhang_intel reassigned PIG-4295:
-

Assignee: liyunzhang_intel

 Enable unit test TestPigContext for spark
 ---

 Key: PIG-4295
 URL: https://issues.apache.org/jira/browse/PIG-4295
 Project: Pig
  Issue Type: Sub-task
  Components: spark
Reporter: liyunzhang_intel
Assignee: liyunzhang_intel
 Fix For: spark-branch

 Attachments: TEST-org.apache.pig.test.TestPigContext.txt


 error log is attached



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (PIG-4513) Lines dropped in delimited text when they begin with null/no-data

2015-04-22 Thread Rohini Palaniswamy (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-4513?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rohini Palaniswamy updated PIG-4513:

Fix Version/s: 0.15.0

 Lines dropped in delimited text when they begin with null/no-data
 -

 Key: PIG-4513
 URL: https://issues.apache.org/jira/browse/PIG-4513
 Project: Pig
  Issue Type: Bug
  Components: parser, piggybank
Affects Versions: 0.12.0
 Environment: CDH5.2.x, CDH5.3.x
Reporter: Madhan Sundararajan Devaki
Priority: Blocker
 Fix For: 0.15.0


 When Pig (0.12) is used to process delimited text files (| delimited), lines 
 that do not contain data in the first column are dropped.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (PIG-4490) MIN/MAX builtin UDFs return wrong results when accumulating for strings

2015-04-22 Thread Rohini Palaniswamy (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-4490?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rohini Palaniswamy updated PIG-4490:

Fix Version/s: 0.15.0
 Assignee: xplenty

[~opensou...@xplenty.com],
What problems do you have with compiling?  Please ensure that the testcases 
you have added fail without the fix and pass after the fix. 

Let me know if you need any help or clarifications as I would like to get this 
patch into Pig 0.15. 

 MIN/MAX builtin UDFs return wrong results when accumulating for strings
 ---

 Key: PIG-4490
 URL: https://issues.apache.org/jira/browse/PIG-4490
 Project: Pig
  Issue Type: Bug
  Components: internal-udfs
Affects Versions: 0.12.0, 0.13.0, 0.14.0
Reporter: xplenty
Assignee: xplenty
 Fix For: 0.15.0

 Attachments: fix-min-max-test.patch, fix-min-max.patch


 When using MIN/MAX UDFs with strings in a job that uses the accumulator 
 interface the results are wrong - The UDF won't return the correct MIN/MAX 
 value.
 this is caused by a reverse 'GreaterThan/SmallerThan () sign in the 
 accumulate() function of both StringMin/StringMax UDFs.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


Re: Branching Pig 0.15

2015-04-22 Thread Rohini Palaniswamy
+1 for frequent releases.

On Fri, Apr 17, 2015 at 8:38 PM, Daniel Dai da...@hortonworks.com wrote:

 It's almost 5 months since Pig 0.14.0 released, and we added Hive UDF, tez
 grace parallelism, numerous tez fixes and quite a few other patches. I
 would like to branch 0.15 by next week Wednesday. We can continue to check
 in important bug fixes into 0.15 after branching. Any objection?

 Thanks,
 Daniel



[jira] [Commented] (PIG-4513) Lines dropped in delimited text when they begin with null/no-data

2015-04-22 Thread Rohini Palaniswamy (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-4513?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14508038#comment-14508038
 ] 

Rohini Palaniswamy commented on PIG-4513:
-

This sounds bad. Can you add a reproducible script with data to the jira?

 Lines dropped in delimited text when they begin with null/no-data
 -

 Key: PIG-4513
 URL: https://issues.apache.org/jira/browse/PIG-4513
 Project: Pig
  Issue Type: Bug
  Components: parser, piggybank
Affects Versions: 0.12.0
 Environment: CDH5.2.x, CDH5.3.x
Reporter: Madhan Sundararajan Devaki
Priority: Blocker

 When Pig (0.12) is used to process delimited text files (| delimited), lines 
 that do not contain data in the first column are dropped.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)