[
https://issues.apache.org/jira/browse/PIG-4295?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
liyunzhang_intel updated PIG-4295:
----------------------------------
Attachment: PIG-4295_2.patch
PIG-4295_1.patch causes OOM error, and three regression unit tests are
added(see https://builds.apache.org/job/Pig-spark/186/):
org.apache.pig.test.TestAccumulator.testAccumWithRegexp
org.apache.pig.test.TestBestFitCast.testByteArrayCast9
org.apache.pig.test.TestEvalPipeline.testCogroupWithInputFromGroup
The OOM problem is caused by following code in PIG-4295_1.patch:
{code}
private void saveUdfImportList(PigContext pigContext) {
String udfImportList =
Joiner.on(",").join(PigContext.getPackageImportList());
pigContext.getProperties().setProperty("udf.import.list",
udfImportList);
}
{code}
Let's explain the reason why TestBestFitCast.testByteArrayCast9 fails.
When run TestBestFitCast 26 unit tests:
When first unit test runs
TestBestFitCast.setUp->PigServer.<init>->PigContext.init
the value of properties.get("udf.import.list") is null, so
PigContext.initializeImportList((String)properties.get("udf.import.list")) will
not be executed.
PigContext#init
{code}
private void init() {
if (properties.get("udf.import.list")!=null)
PigContext.initializeImportList((String)properties.get("udf.import.list"));
}
{code}
When SparkLauncher#saveUdfImportList is executed,
pigContext.getProperties().set
Property("udf.import.list",udfImportList) is called, the value of
PigContext.getProperties().get("udf.import.list") is
",java.lang.,org.apache.pig.builtin.,org.apache.pig.impl.builtin.".
When second unit tests runs
TestBestFitCast.setUp->PigServer.<init>->PigContext.init,
the value of PigContext.getProperties().get("udf.import.list") is
",java.lang.,org.apache.pig.builtin.,org.apache.pig.impl.builtin."(not null),
then PigContext.initializeImportList((String)properties.get("udf.import.list"))
is executed.
PigContext#initializeImportList
{code}:
public static void initializeImportList(String importListCommandLineProperties)
{
StringTokenizer tokenizer = new
StringTokenizer(importListCommandLineProperties, ":");
int pos = 1; // Leave "" as the first import
ArrayList<String> importList = getPackageImportList();
while (tokenizer.hasMoreTokens())
{
String importItem = tokenizer.nextToken();
if (!importItem.endsWith("."))
importItem += ".";
importList.add(pos, importItem);
pos++;
}
}
{code}
After that , the value of PigContext#packageImportList is
["",
"java.lang.",
"org.apache.pig.builtin.",
"org.apache.pig.impl.builtin."
",java.lang.,org.apache.pig.builtin.,org.apache.pig.impl.builtin."],
PigContext#packageImportList should have 4 importPackage values, but now have
5.",java.lang.,org.apache.pig.builtin.,org.apache.pig.impl.builtin." are added.
If a file contains many unit test cases, the size of
PigContext#packageImportList will be bigger.
How to avoid the OOM problem.
In PIG-4295_2.patch:
changes from
{code}
private void saveUdfImportList(PigContext pigContext) {
String udfImportList =
Joiner.on(",").join(PigContext.getPackageImportList());
pigContext.getProperties().setProperty("udf.import.list",
udfImportList);
}
{code}
to
{code}
private void saveUdfImportList(PigContext pigContext) {
String udfImportList =
Joiner.on(",").join(PigContext.getPackageImportList());
pigContext.getProperties().setProperty("spark.udf.import.list",
udfImportList);
}
{code}
If we store the UdfImportList in the
PigContext.getProperties().get("spark.udf.import.list", udfImportList“), it
will not cause the error mentioned above.
> Enable unit test "TestPigContext" for spark
> -------------------------------------------
>
> Key: PIG-4295
> URL: https://issues.apache.org/jira/browse/PIG-4295
> Project: Pig
> Issue Type: Sub-task
> Components: spark
> Affects Versions: spark-branch
> Reporter: liyunzhang_intel
> Assignee: liyunzhang_intel
> Fix For: spark-branch
>
> Attachments: PIG-4295.patch, PIG-4295_1.patch, PIG-4295_2.patch,
> TEST-org.apache.pig.test.TestPigContext.txt
>
>
> error log is attached
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)