date:20170531

[jira] [Updated] (PIG-5246) Modify bin/pig about SPARK_HOME, SPARK_ASSEMBLY_JAR after upgrading spark to 2

2017-05-31 Thread liyunzhang_intel (JIRA)


 [ 
https://issues.apache.org/jira/browse/PIG-5246?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

liyunzhang_intel updated PIG-5246:
--
Status: Patch Available  (was: Open)

> Modify bin/pig about SPARK_HOME, SPARK_ASSEMBLY_JAR after upgrading spark to 2
> --
>
> Key: PIG-5246
> URL: https://issues.apache.org/jira/browse/PIG-5246
> Project: Pig
>  Issue Type: Bug
>Reporter: liyunzhang_intel
>Assignee: liyunzhang_intel
> Attachments: PIG-5246.patch
>
>
> in bin/pig.
> we copy assembly jar to pig's classpath in spark1.6.
> {code}
> # For spark mode:
> # Please specify SPARK_HOME first so that we can locate 
> $SPARK_HOME/lib/spark-assembly*.jar,
> # we will add spark-assembly*.jar to the classpath.
> if [ "$isSparkMode"  == "true" ]; then
> if [ -z "$SPARK_HOME" ]; then
>echo "Error: SPARK_HOME is not set!"
>exit 1
> fi
> # Please specify SPARK_JAR which is the hdfs path of spark-assembly*.jar 
> to allow YARN to cache spark-assembly*.jar on nodes so that it doesn't need 
> to be distributed each time an application runs.
> if [ -z "$SPARK_JAR" ]; then
>echo "Error: SPARK_JAR is not set, SPARK_JAR stands for the hdfs 
> location of spark-assembly*.jar. This allows YARN to cache 
> spark-assembly*.jar on nodes so that it doesn't need to be distributed each 
> time an application runs."
>exit 1
> fi
> if [ -n "$SPARK_HOME" ]; then
> echo "Using Spark Home: " ${SPARK_HOME}
> SPARK_ASSEMBLY_JAR=`ls ${SPARK_HOME}/lib/spark-assembly*`
> CLASSPATH=${CLASSPATH}:$SPARK_ASSEMBLY_JAR
> fi
> fi
> {code}
> after upgrade to spark2.0, we may modify it



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

[jira] [Updated] (PIG-5246) Modify bin/pig about SPARK_HOME, SPARK_ASSEMBLY_JAR after upgrading spark to 2

2017-05-31 Thread liyunzhang_intel (JIRA)


 [ 
https://issues.apache.org/jira/browse/PIG-5246?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

liyunzhang_intel updated PIG-5246:
--
Attachment: PIG-5246.patch

[~nkollar], [~szita]: help review 
in spark2, spark-assembly*.jar does not exist, so we need append all jars under 
$SPARK_HOME/jars/ to the pig classpath.
{code}
+if [ "$sparkversion" == "21" ]; then
+  if [ -n "$SPARK_HOME" ]; then
+ echo "Using Spark Home: " ${SPARK_HOME}
+  for f in $SPARK_HOME/jars/*.jar; do
+   CLASSPATH=${CLASSPATH}:$f
+  done
+  fi
+ fi
{code}

the way to use 

1. build pig with spark21
{noformat}
   ant clean -v  -Dsparkversion=21   -Dhadoopversion=2 jar
{noformat}
2. run pig with spark21
{noformat}
  /pig -x $mode -sparkversion 21 -log4jconf $PIG_HOME/conf/log4j.properties 
-logfile $PIG_HOME/logs/pig.log  $PIG_HOME/bin/testJoin.pig
{noformat}
  

> Modify bin/pig about SPARK_HOME, SPARK_ASSEMBLY_JAR after upgrading spark to 2
> --
>
> Key: PIG-5246
> URL: https://issues.apache.org/jira/browse/PIG-5246
> Project: Pig
>  Issue Type: Bug
>Reporter: liyunzhang_intel
>Assignee: liyunzhang_intel
> Attachments: PIG-5246.patch
>
>
> in bin/pig.
> we copy assembly jar to pig's classpath in spark1.6.
> {code}
> # For spark mode:
> # Please specify SPARK_HOME first so that we can locate 
> $SPARK_HOME/lib/spark-assembly*.jar,
> # we will add spark-assembly*.jar to the classpath.
> if [ "$isSparkMode"  == "true" ]; then
> if [ -z "$SPARK_HOME" ]; then
>echo "Error: SPARK_HOME is not set!"
>exit 1
> fi
> # Please specify SPARK_JAR which is the hdfs path of spark-assembly*.jar 
> to allow YARN to cache spark-assembly*.jar on nodes so that it doesn't need 
> to be distributed each time an application runs.
> if [ -z "$SPARK_JAR" ]; then
>echo "Error: SPARK_JAR is not set, SPARK_JAR stands for the hdfs 
> location of spark-assembly*.jar. This allows YARN to cache 
> spark-assembly*.jar on nodes so that it doesn't need to be distributed each 
> time an application runs."
>exit 1
> fi
> if [ -n "$SPARK_HOME" ]; then
> echo "Using Spark Home: " ${SPARK_HOME}
> SPARK_ASSEMBLY_JAR=`ls ${SPARK_HOME}/lib/spark-assembly*`
> CLASSPATH=${CLASSPATH}:$SPARK_ASSEMBLY_JAR
> fi
> fi
> {code}
> after upgrade to spark2.0, we may modify it



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

[jira] [Commented] (PIG-5157) Upgrade to Spark 2.0

2017-05-31 Thread liyunzhang_intel (JIRA)


[ 
https://issues.apache.org/jira/browse/PIG-5157?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16032514#comment-16032514
 ] 

liyunzhang_intel commented on PIG-5157:
---

[~nkollar]: have tested that we can remove JobLogger in spark16.  

> Upgrade to Spark 2.0
> 
>
> Key: PIG-5157
> URL: https://issues.apache.org/jira/browse/PIG-5157
> Project: Pig
>  Issue Type: Improvement
>  Components: spark
>Reporter: Nandor Kollar
>Assignee: Nandor Kollar
> Fix For: 0.18.0
>
> Attachments: PIG-5157.patch
>
>
> Upgrade to Spark 2.0 (or latest)



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

[jira] Subscription: PIG patch available

2017-05-31 Thread jira

Issue Subscription
Filter: PIG patch available (32 issues)

Subscriber: pigdaily

Key Summary
PIG-5225Several unit tests are not annotated with @Test
https://issues.apache.org/jira/browse/PIG-5225
PIG-5160SchemaTupleFrontend.java is not thread safe, cause PigServer thrown 
NPE in multithread env
https://issues.apache.org/jira/browse/PIG-5160
PIG-5157Upgrade to Spark 2.0
https://issues.apache.org/jira/browse/PIG-5157
PIG-5115Builtin AvroStorage generates incorrect avro schema when the same 
pig field name appears in the alias
https://issues.apache.org/jira/browse/PIG-5115
PIG-5106Optimize when mapreduce.input.fileinputformat.input.dir.recursive 
set to true
https://issues.apache.org/jira/browse/PIG-5106
PIG-5081Can not run pig on spark source code distribution
https://issues.apache.org/jira/browse/PIG-5081
PIG-5080Support store alias as spark table
https://issues.apache.org/jira/browse/PIG-5080
PIG-5057IndexOutOfBoundsException when pig reducer processOnePackageOutput
https://issues.apache.org/jira/browse/PIG-5057
PIG-5029Optimize sort case when data is skewed
https://issues.apache.org/jira/browse/PIG-5029
PIG-4926Modify the content of start.xml for spark mode
https://issues.apache.org/jira/browse/PIG-4926
PIG-4913Reduce jython function initiation during compilation
https://issues.apache.org/jira/browse/PIG-4913
PIG-4849pig on tez will cause tez-ui to crash,because the content from 
timeline server is too long. 
https://issues.apache.org/jira/browse/PIG-4849
PIG-4750REPLACE_MULTI should compile Pattern once and reuse it
https://issues.apache.org/jira/browse/PIG-4750
PIG-4700Pig should call ProcessorContext.setProgress() in TezTaskContext
https://issues.apache.org/jira/browse/PIG-4700
PIG-4684Exception should be changed to warning when job diagnostics cannot 
be fetched
https://issues.apache.org/jira/browse/PIG-4684
PIG-4656Improve String serialization and comparator performance in 
BinInterSedes
https://issues.apache.org/jira/browse/PIG-4656
PIG-4598Allow user defined plan optimizer rules
https://issues.apache.org/jira/browse/PIG-4598
PIG-4551Partition filter is not pushed down in case of SPLIT
https://issues.apache.org/jira/browse/PIG-4551
PIG-4539New PigUnit
https://issues.apache.org/jira/browse/PIG-4539
PIG-4515org.apache.pig.builtin.Distinct throws ClassCastException
https://issues.apache.org/jira/browse/PIG-4515
PIG-4323PackageConverter hanging in Spark
https://issues.apache.org/jira/browse/PIG-4323
PIG-4313StackOverflowError in LIMIT operation on Spark
https://issues.apache.org/jira/browse/PIG-4313
PIG-4251Pig on Storm
https://issues.apache.org/jira/browse/PIG-4251
PIG-4002Disable combiner when map-side aggregation is used
https://issues.apache.org/jira/browse/PIG-4002
PIG-3952PigStorage accepts '-tagSplit' to return full split information
https://issues.apache.org/jira/browse/PIG-3952
PIG-3911Define unique fields with @OutputSchema
https://issues.apache.org/jira/browse/PIG-3911
PIG-3877Getting Geo Latitude/Longitude from Address Lines
https://issues.apache.org/jira/browse/PIG-3877
PIG-3873Geo distance calculation using Haversine
https://issues.apache.org/jira/browse/PIG-3873
PIG-3864ToDate(userstring, format, timezone) computes DateTime with strange 
handling of Daylight Saving Time with location based timezones
https://issues.apache.org/jira/browse/PIG-3864
PIG-3668COR built-in function when atleast one of the coefficient values is 
NaN
https://issues.apache.org/jira/browse/PIG-3668
PIG-3587add functionality for rolling over dates
https://issues.apache.org/jira/browse/PIG-3587
PIG-1804Alow Jython function to implement Algebraic and/or Accumulator 
interfaces
https://issues.apache.org/jira/browse/PIG-1804

You may edit this subscription at:
https://issues.apache.org/jira/secure/FilterSubscription!default.jspa?subId=16328&filterId=12322384

[jira] [Commented] (PIG-5247) Investigate stopOnFailure feature with Spark execution engine

2017-05-31 Thread liyunzhang_intel (JIRA)


[ 
https://issues.apache.org/jira/browse/PIG-5247?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16032292#comment-16032292
 ] 

liyunzhang_intel commented on PIG-5247:
---

[~szita]: what i am confused is currently this feature is implemented in spark 
engine while you create a jira for this.
if stopOnFailure is enabled, the remaining job will not be executed if 
exception is thrown.

> Investigate stopOnFailure feature with Spark execution engine
> -
>
> Key: PIG-5247
> URL: https://issues.apache.org/jira/browse/PIG-5247
> Project: Pig
>  Issue Type: Improvement
>Reporter: Adam Szita
> Fix For: 0.18.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

[jira] [Commented] (PIG-5225) Several unit tests are not annotated with @Test

2017-05-31 Thread Nandor Kollar (JIRA)


[ 
https://issues.apache.org/jira/browse/PIG-5225?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16032078#comment-16032078
 ] 

Nandor Kollar commented on PIG-5225:


Ok, uploaded PIG-5225_2.patch with the test case removed.

> Several unit tests are not annotated with @Test
> ---
>
> Key: PIG-5225
> URL: https://issues.apache.org/jira/browse/PIG-5225
> Project: Pig
>  Issue Type: Bug
>Reporter: Nandor Kollar
>Assignee: Nandor Kollar
> Fix For: 0.18.0
>
> Attachments: PIG-5225_2.patch, PIG-5225.patch
>
>
> Several test cases are not annotated with @Test. Since we use JUnit 4, these 
> test cases seems to be excluded.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

[jira] [Updated] (PIG-5225) Several unit tests are not annotated with @Test

2017-05-31 Thread Nandor Kollar (JIRA)


 [ 
https://issues.apache.org/jira/browse/PIG-5225?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nandor Kollar updated PIG-5225:
---
Attachment: PIG-5225_2.patch

> Several unit tests are not annotated with @Test
> ---
>
> Key: PIG-5225
> URL: https://issues.apache.org/jira/browse/PIG-5225
> Project: Pig
>  Issue Type: Bug
>Reporter: Nandor Kollar
>Assignee: Nandor Kollar
> Fix For: 0.18.0
>
> Attachments: PIG-5225_2.patch, PIG-5225.patch
>
>
> Several test cases are not annotated with @Test. Since we use JUnit 4, these 
> test cases seems to be excluded.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

Build failed in Jenkins: Pig-trunk-commit #2498

2017-05-31 Thread Apache Jenkins Server

See 

--
[...truncated 181.49 KB...]
A src/org/apache/pig/parser/FunctionType.java
A src/org/apache/pig/parser/DuplicatedSchemaAliasException.java
A src/org/apache/pig/parser/InvalidScalarProjectionException.java
A src/org/apache/pig/parser/SourceLocation.java
A src/org/apache/pig/parser/UndefinedAliasException.java
A src/org/apache/pig/parser/QueryParserStreamUtil.java
A src/org/apache/pig/parser/LogicalPlanBuilder.java
A src/org/apache/pig/parser/PlanGenerationFailureException.java
A src/org/apache/pig/parser/QueryParserFileStream.java
A src/org/apache/pig/parser/QueryParserDriver.java
A src/org/apache/pig/parser/PigRecognitionException.java
A src/org/apache/pig/parser/AstPrinter.g
A src/org/apache/pig/parser/PigMacro.java
A src/org/apache/pig/parser/LogicalPlanGenerator.g
A src/org/apache/pig/parser/AliasMasker.g
A src/org/apache/pig/parser/PigParserNode.java
A src/org/apache/pig/parser/AstValidator.g
A src/org/apache/pig/parser/QueryParserUtils.java
A src/org/apache/pig/parser/InvalidCommandException.java
A src/org/apache/pig/parser/RegisterResolver.java
A src/org/apache/pig/parser/StreamingCommandUtils.java
A src/org/apache/pig/parser/QueryLexer.g
A src/org/apache/pig/parser/ParserException.java
A src/org/apache/pig/parser/PigParserNodeAdaptor.java
A src/org/apache/pig/PigConstants.java
A src/org/apache/pig/tools
A src/org/apache/pig/tools/timer
A src/org/apache/pig/tools/timer/PerformanceTimerFactory.java
A src/org/apache/pig/tools/timer/PerformanceTimer.java
A src/org/apache/pig/tools/counters
A src/org/apache/pig/tools/counters/PigCounterHelper.java
A src/org/apache/pig/tools/parameters
A src/org/apache/pig/tools/parameters/ParamLoader.jj
A src/org/apache/pig/tools/parameters/PreprocessorContext.java
A 
src/org/apache/pig/tools/parameters/ParameterSubstitutionException.java
A src/org/apache/pig/tools/parameters/PigFileParser.jj
A 
src/org/apache/pig/tools/parameters/ParameterSubstitutionPreprocessor.java
A src/org/apache/pig/tools/pigscript
A src/org/apache/pig/tools/pigscript/parser
A src/org/apache/pig/tools/pigscript/parser/PigScriptParser.jj
A src/org/apache/pig/tools/ToolsPigServer.java
A src/org/apache/pig/tools/DownloadResolver.java
A src/org/apache/pig/tools/cmdline
A src/org/apache/pig/tools/cmdline/CmdLineParser.java
A src/org/apache/pig/tools/pigstats
A src/org/apache/pig/tools/pigstats/spark
A src/org/apache/pig/tools/pigstats/spark/SparkPigStats.java
A src/org/apache/pig/tools/pigstats/spark/SparkCounter.java
A src/org/apache/pig/tools/pigstats/spark/SparkCounters.java
A src/org/apache/pig/tools/pigstats/spark/SparkScriptState.java
A src/org/apache/pig/tools/pigstats/spark/SparkPigStatusReporter.java
A src/org/apache/pig/tools/pigstats/spark/SparkJobStats.java
A src/org/apache/pig/tools/pigstats/spark/SparkCounterGroup.java
A src/org/apache/pig/tools/pigstats/spark/SparkStatsUtil.java
A src/org/apache/pig/tools/pigstats/PigProgressNotificationListener.java
A src/org/apache/pig/tools/pigstats/tez
A src/org/apache/pig/tools/pigstats/tez/TezVertexStats.java
A 
src/org/apache/pig/tools/pigstats/tez/PigTezProgressNotificationListener.java
A src/org/apache/pig/tools/pigstats/tez/TezPigScriptStats.java
A src/org/apache/pig/tools/pigstats/tez/TezScriptState.java
A src/org/apache/pig/tools/pigstats/tez/TezDAGStats.java
A src/org/apache/pig/tools/pigstats/ScriptState.java
A src/org/apache/pig/tools/pigstats/mapreduce
A src/org/apache/pig/tools/pigstats/mapreduce/MRScriptState.java
A src/org/apache/pig/tools/pigstats/mapreduce/MRJobStats.java
A src/org/apache/pig/tools/pigstats/mapreduce/SimplePigStats.java
A src/org/apache/pig/tools/pigstats/mapreduce/MRPigStatsUtil.java
A src/org/apache/pig/tools/pigstats/PigStatusReporter.java
A src/org/apache/pig/tools/pigstats/PigWarnCounter.java
A src/org/apache/pig/tools/pigstats/EmbeddedPigStats.java
A src/org/apache/pig/tools/pigstats/JobStats.java
A src/org/apache/pig/tools/pigstats/PigStatsUtil.java
A src/org/apache/pig/tools/pigstats/EmptyPigStats.java
A src/org/apache/pig/tools/pigstats/InputStats.java
A src/org/apache/pig/tools/pigstats/PigStats.java
A src/org/apache/pig/tools/pigstats/OutputStats.java
A src/org/apache/pig/tools/streams
A src/org/apache/pig/tools/streams/StreamGenerator.java
A src/org/apache/pig/tools/gru

[jira] [Commented] (PIG-5225) Several unit tests are not annotated with @Test

2017-05-31 Thread Daniel Dai (JIRA)


[ 
https://issues.apache.org/jira/browse/PIG-5225?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16032007#comment-16032007
 ] 

Daniel Dai commented on PIG-5225:
-

The test is added even before me. The test won't throw exception, it will get a 
null result and a warning counter like Rohini points out. However, the test 
name suggest it is testing a failed UDF. I don't this is valid anymore and fine 
to remove it.

> Several unit tests are not annotated with @Test
> ---
>
> Key: PIG-5225
> URL: https://issues.apache.org/jira/browse/PIG-5225
> Project: Pig
>  Issue Type: Bug
>Reporter: Nandor Kollar
>Assignee: Nandor Kollar
> Fix For: 0.18.0
>
> Attachments: PIG-5225.patch
>
>
> Several test cases are not annotated with @Test. Since we use JUnit 4, these 
> test cases seems to be excluded.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

Build failed in Jenkins: Pig-trunk-commit #2497

2017-05-31 Thread Apache Jenkins Server

See 


Changes:

[rohini] PIG-5248: Fix TestCombiner#testGroupByLimit after PigOnSpark merge 
(rohini)

[rohini] PIG-5245: TestGrunt.testStopOnFailure is flaky (rohini)

[daijy] PIG-5216: Customizable Error Handling for Loaders in Pig (chenjunz via 
daijy)

--
[...truncated 175.60 KB...]
A 
contrib/piggybank/java/src/main/java/org/apache/pig/piggybank/storage/MyRegExLoader.java
A 
contrib/piggybank/java/src/main/java/org/apache/pig/piggybank/storage/DBStorage.java
A 
contrib/piggybank/java/src/main/java/org/apache/pig/piggybank/storage/JsonMetadata.java
A 
contrib/piggybank/java/src/main/java/org/apache/pig/piggybank/evaluation
A 
contrib/piggybank/java/src/main/java/org/apache/pig/piggybank/evaluation/xml
A 
contrib/piggybank/java/src/main/java/org/apache/pig/piggybank/evaluation/xml/XPath.java
A 
contrib/piggybank/java/src/main/java/org/apache/pig/piggybank/evaluation/xml/XPathAll.java
A 
contrib/piggybank/java/src/main/java/org/apache/pig/piggybank/evaluation/IsInt.java
A 
contrib/piggybank/java/src/main/java/org/apache/pig/piggybank/evaluation/MaxTupleBy1stField.java
A 
contrib/piggybank/java/src/main/java/org/apache/pig/piggybank/evaluation/string
A 
contrib/piggybank/java/src/main/java/org/apache/pig/piggybank/evaluation/string/LOWER.java
A 
contrib/piggybank/java/src/main/java/org/apache/pig/piggybank/evaluation/string/Split.java
A 
contrib/piggybank/java/src/main/java/org/apache/pig/piggybank/evaluation/string/UPPER.java
A 
contrib/piggybank/java/src/main/java/org/apache/pig/piggybank/evaluation/string/INDEXOF.java
A 
contrib/piggybank/java/src/main/java/org/apache/pig/piggybank/evaluation/string/HashFNV.java
A 
contrib/piggybank/java/src/main/java/org/apache/pig/piggybank/evaluation/string/RegexExtractAll.java
A 
contrib/piggybank/java/src/main/java/org/apache/pig/piggybank/evaluation/string/RegexExtract.java
A 
contrib/piggybank/java/src/main/java/org/apache/pig/piggybank/evaluation/string/LcFirst.java
A 
contrib/piggybank/java/src/main/java/org/apache/pig/piggybank/evaluation/string/Trim.java
A 
contrib/piggybank/java/src/main/java/org/apache/pig/piggybank/evaluation/string/REPLACE.java
A 
contrib/piggybank/java/src/main/java/org/apache/pig/piggybank/evaluation/string/Reverse.java
A 
contrib/piggybank/java/src/main/java/org/apache/pig/piggybank/evaluation/string/HashFNV1.java
A 
contrib/piggybank/java/src/main/java/org/apache/pig/piggybank/evaluation/string/UcFirst.java
A 
contrib/piggybank/java/src/main/java/org/apache/pig/piggybank/evaluation/string/HashFNV2.java
A 
contrib/piggybank/java/src/main/java/org/apache/pig/piggybank/evaluation/string/LASTINDEXOF.java
A 
contrib/piggybank/java/src/main/java/org/apache/pig/piggybank/evaluation/string/SUBSTRING.java
A 
contrib/piggybank/java/src/main/java/org/apache/pig/piggybank/evaluation/string/LENGTH.java
A 
contrib/piggybank/java/src/main/java/org/apache/pig/piggybank/evaluation/string/LookupInFiles.java
A 
contrib/piggybank/java/src/main/java/org/apache/pig/piggybank/evaluation/string/REPLACE_MULTI.java
A 
contrib/piggybank/java/src/main/java/org/apache/pig/piggybank/evaluation/string/Stuff.java
A 
contrib/piggybank/java/src/main/java/org/apache/pig/piggybank/evaluation/string/RegexMatch.java
A 
contrib/piggybank/java/src/main/java/org/apache/pig/piggybank/evaluation/IsLong.java
A 
contrib/piggybank/java/src/main/java/org/apache/pig/piggybank/evaluation/util
A 
contrib/piggybank/java/src/main/java/org/apache/pig/piggybank/evaluation/util/SearchQuery.java
A 
contrib/piggybank/java/src/main/java/org/apache/pig/piggybank/evaluation/util/ToBag.java
A 
contrib/piggybank/java/src/main/java/org/apache/pig/piggybank/evaluation/util/ToTuple.java
A 
contrib/piggybank/java/src/main/java/org/apache/pig/piggybank/evaluation/util/apachelogparser
A 
contrib/piggybank/java/src/main/java/org/apache/pig/piggybank/evaluation/util/apachelogparser/SearchEngineExtractor.java
A 
contrib/piggybank/java/src/main/java/org/apache/pig/piggybank/evaluation/util/apachelogparser/DateExtractor.java
A 
contrib/piggybank/java/src/main/java/org/apache/pig/piggybank/evaluation/util/apachelogparser/HostExtractor.java
A 
contrib/piggybank/java/src/main/java/org/apache/pig/piggybank/evaluation/util/apachelogparser/SearchTermExtractor.java
A 
contrib/piggybank/java/src/main/java/org/apache/pig/piggybank/evaluation/util/Top.java
A 
contrib/piggybank/java/src/main/java/org/apache/pig/piggybank/evaluation/IsNumeric.java
A 
contrib/piggybank/java/src/main/java/org/apache/pig/piggybank/evaluation/Stitch.java
A 
contr

[jira] [Commented] (PIG-4059) Pig on Spark

2017-05-31 Thread Rohini Palaniswamy (JIRA)


[ 
https://issues.apache.org/jira/browse/PIG-4059?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16031937#comment-16031937
 ] 

Rohini Palaniswamy commented on PIG-4059:
-

[~szita],
With PIG-4941, support for SplitLocationInfo was added which is available 
only from Hadoop 2.5 (MAPREDUCE-5896). So Pig 0.17 will only work with Hadoop 
2.5 and above. Please document that the minimum supported version for 0.17 is 
Hadoop 2.5 in the release notes.  

> Pig on Spark
> 
>
> Key: PIG-4059
> URL: https://issues.apache.org/jira/browse/PIG-4059
> Project: Pig
>  Issue Type: New Feature
>  Components: spark
>Reporter: Rohini Palaniswamy
>Assignee: Praveen Rachabattuni
>  Labels: spork
> Fix For: spark-branch, 0.17.0
>
> Attachments: Pig-on-Spark-Design-Doc.pdf, Pig-on-Spark-Scope.pdf
>
>
> Setting up your development environment:
> 0. download spark release package(currently pig on spark only support spark 
> 1.6).
> 1. Check out Pig Spark branch.
> 2. Build Pig by running "ant jar" and "ant -Dhadoopversion=23 jar" for 
> hadoop-2.x versions
> 3. Configure these environmental variables:
> export HADOOP_USER_CLASSPATH_FIRST="true"
> Now we support “local” and "yarn-client" mode, you can export system variable 
> “SPARK_MASTER” like:
> export SPARK_MASTER=local or export SPARK_MASTER="yarn-client"
> 4. In local mode: ./pig -x spark_local xxx.pig
> In yarn-client mode: 
> export SPARK_HOME=xx; 
> export SPARK_JAR=hdfs://example.com:8020/ (the hdfs location where 
> you upload the spark-assembly*.jar)
> ./pig -x spark xxx.pig



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

[jira] [Updated] (PIG-5248) Fix TestCombiner#testGroupByLimit after PigOnSpark merge

2017-05-31 Thread Rohini Palaniswamy (JIRA)


 [ 
https://issues.apache.org/jira/browse/PIG-5248?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rohini Palaniswamy updated PIG-5248:

  Resolution: Fixed
Hadoop Flags: Reviewed
  Status: Resolved  (was: Patch Available)

bq. Interesting that test is now catching the real issue. +1 on the patch.
   As [~szita] mentioned, the test case was referring to the wrong alias and it 
was corrected in the spark patch making the test case actually work as intended.

Committed to branch-0.17 and trunk. Thanks [~szita] for debugging the issue and 
[~knoguchi] for reviewing it.

> Fix TestCombiner#testGroupByLimit after PigOnSpark merge
> 
>
> Key: PIG-5248
> URL: https://issues.apache.org/jira/browse/PIG-5248
> Project: Pig
>  Issue Type: Improvement
>Reporter: Adam Szita
>Assignee: Rohini Palaniswamy
> Fix For: 0.17.0
>
> Attachments: PIG-5248-1.patch
>
>
> This test started failing on TEZ after we merged PoS. The test checks if 
> there is a "Combiner plan" among the vertices of Tez execution plan.
> The last step of the query is {{pigServer.registerQuery("d = limit c 2 ; 
> ");}} which is decisive in the case of Tez. It looks like if we check the 
> plan for "d" there is no combiner part, but there is one if we check it for 
> "c" - so without applying limit.
> The reason this didn't come out before is because the alias supplied to 
> {{checkCombinerUsed}} method was disregarded and alias "c" was checked 
> always. This was recently fixed with the PoS merge. (See diff of TestCombiner 
> [here|https://github.com/apache/pig/commit/e766b6bf29e610b6312f8447fc008bed6beb4090?diff=split#diff-8bcae39a2bb998cdfeb8c7af960eb196L360]
>  )



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

[jira] [Updated] (PIG-5201) Null handling on FLATTEN

2017-05-31 Thread Koji Noguchi (JIRA)


 [ 
https://issues.apache.org/jira/browse/PIG-5201?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Koji Noguchi updated PIG-5201:
--
Attachment: pig-5201-v03.patch

Noticed I forgot to take out some debug print/dump statements.  Taking them out 
and now calling {{Util.checkQueryOutputsAfterSort}} for map output comparisons. 

> Null handling on FLATTEN
> 
>
> Key: PIG-5201
> URL: https://issues.apache.org/jira/browse/PIG-5201
> Project: Pig
>  Issue Type: Bug
>Reporter: Koji Noguchi
>Assignee: Koji Noguchi
>Priority: Minor
> Fix For: 0.18.0
>
> Attachments: pig-5201-v00-testonly.patch, pig-5201-v01.patch, 
> pig-5201-v02.patch, pig-5201-v03.patch
>
>
> Sometimes, FLATTEN(null) or FLATTEN(bag-with-null) seem to produce incorrect 
> results.
> Test code/script to follow.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

[jira] [Updated] (PIG-5248) Fix TestCombiner#testGroupByLimit after PigOnSpark merge

2017-05-31 Thread Rohini Palaniswamy (JIRA)


 [ 
https://issues.apache.org/jira/browse/PIG-5248?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rohini Palaniswamy updated PIG-5248:

Summary: Fix TestCombiner#testGroupByLimit after PigOnSpark merge  (was: 
Fix TestCombiner#testGroupByLimit after PoS merge)

> Fix TestCombiner#testGroupByLimit after PigOnSpark merge
> 
>
> Key: PIG-5248
> URL: https://issues.apache.org/jira/browse/PIG-5248
> Project: Pig
>  Issue Type: Improvement
>Reporter: Adam Szita
>Assignee: Rohini Palaniswamy
> Fix For: 0.17.0
>
> Attachments: PIG-5248-1.patch
>
>
> This test started failing on TEZ after we merged PoS. The test checks if 
> there is a "Combiner plan" among the vertices of Tez execution plan.
> The last step of the query is {{pigServer.registerQuery("d = limit c 2 ; 
> ");}} which is decisive in the case of Tez. It looks like if we check the 
> plan for "d" there is no combiner part, but there is one if we check it for 
> "c" - so without applying limit.
> The reason this didn't come out before is because the alias supplied to 
> {{checkCombinerUsed}} method was disregarded and alias "c" was checked 
> always. This was recently fixed with the PoS merge. (See diff of TestCombiner 
> [here|https://github.com/apache/pig/commit/e766b6bf29e610b6312f8447fc008bed6beb4090?diff=split#diff-8bcae39a2bb998cdfeb8c7af960eb196L360]
>  )



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

[jira] [Updated] (PIG-5245) TestGrunt.testStopOnFailure is flaky

2017-05-31 Thread Rohini Palaniswamy (JIRA)


 [ 
https://issues.apache.org/jira/browse/PIG-5245?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rohini Palaniswamy updated PIG-5245:

  Resolution: Fixed
Hadoop Flags: Reviewed
  Status: Resolved  (was: Patch Available)

Patch committed to branch-0.17 and trunk. [~szita], thanks for the review and 
debugging of the issue.

> TestGrunt.testStopOnFailure is flaky
> 
>
> Key: PIG-5245
> URL: https://issues.apache.org/jira/browse/PIG-5245
> Project: Pig
>  Issue Type: Bug
>Reporter: Rohini Palaniswamy
>Assignee: Rohini Palaniswamy
> Fix For: 0.17.0
>
> Attachments: PIG-5245-1.patch, 
> PIG-5245-2.addingSparkLibsToMiniCluster.patch, PIG-5245-2.patch, 
> PIG-5245-3.patch
>
>
>   The test is supposed to run two tests in parallel, and one when fails other 
> should be killed when stop on failure is configured. But the test is actually 
> running only job at a time and based on the order in which jobs are run it is 
> passing. This is because of the capacity scheduler configuration of the 
> MiniCluster. It runs only one AM at a time due to resource restrictions. In a 
> 16G node, when the first job runs it takes up 1536 (AM) + 1024 (task) 
> memory.mb. Since only 10% of cluster resource is the default for running AMs 
> and a single AM already takes up memory close to 1.6G, second job AM is not 
> launched.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

[jira] [Resolved] (PIG-3368) doc pig flatten operator applied to empty vs null bag

2017-05-31 Thread Koji Noguchi (JIRA)


 [ 
https://issues.apache.org/jira/browse/PIG-3368?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Koji Noguchi resolved PIG-3368.
---
   Resolution: Duplicate
 Assignee: (was: Aniket Mokashi)
Fix Version/s: (was: 0.18.0)

Let's track this in PIG-5201.

> doc pig flatten operator applied to empty vs null bag
> -
>
> Key: PIG-3368
> URL: https://issues.apache.org/jira/browse/PIG-3368
> Project: Pig
>  Issue Type: Improvement
>  Components: documentation
>Reporter: Andy Schlaikjer
>
> [Pig docs|http://pig.apache.org/docs/r0.11.0/basic.html#flatten] state that 
> FLATTEN(field_of_type_bag) may generate a cross-product in the case when an 
> additional field is projected, e.g.:
> y = FOREACH x GENERATE f1, FLATTEN(fbag) as f2;
> Additionally, for records in x for which fbag is empty (not null), no output 
> record is generated.
> What is expected behavior when fbag is null?
> Some users might expect similar behavior, but FLATTEN actually passes through 
> the null, resulting in an output record (f1, f2) where f2 is null.
> It would be useful to update FLATTEN docs to mention this.
> http://svn.apache.org/viewvc/pig/trunk/src/docs/src/documentation/content/xdocs/basic.xml?view=markup#l5051
> I'm guessing these are the relevant bits which affect this behavior:
> http://svn.apache.org/viewvc/pig/trunk/src/org/apache/pig/backend/hadoop/executionengine/physicalLayer/relationalOperators/POForEach.java?view=markup#l440
> http://svn.apache.org/viewvc/pig/trunk/src/org/apache/pig/backend/hadoop/executionengine/physicalLayer/relationalOperators/POForEach.java?view=markup#l468



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

[jira] [Commented] (PIG-5245) TestGrunt.testStopOnFailure is flaky

2017-05-31 Thread Adam Szita (JIRA)


[ 
https://issues.apache.org/jira/browse/PIG-5245?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16031861#comment-16031861
 ] 

Adam Szita commented on PIG-5245:
-

+1 for [^PIG-5245-3.patch]

> TestGrunt.testStopOnFailure is flaky
> 
>
> Key: PIG-5245
> URL: https://issues.apache.org/jira/browse/PIG-5245
> Project: Pig
>  Issue Type: Bug
>Reporter: Rohini Palaniswamy
>Assignee: Rohini Palaniswamy
> Fix For: 0.17.0
>
> Attachments: PIG-5245-1.patch, 
> PIG-5245-2.addingSparkLibsToMiniCluster.patch, PIG-5245-2.patch, 
> PIG-5245-3.patch
>
>
>   The test is supposed to run two tests in parallel, and one when fails other 
> should be killed when stop on failure is configured. But the test is actually 
> running only job at a time and based on the order in which jobs are run it is 
> passing. This is because of the capacity scheduler configuration of the 
> MiniCluster. It runs only one AM at a time due to resource restrictions. In a 
> 16G node, when the first job runs it takes up 1536 (AM) + 1024 (task) 
> memory.mb. Since only 10% of cluster resource is the default for running AMs 
> and a single AM already takes up memory close to 1.6G, second job AM is not 
> launched.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

[jira] [Commented] (PIG-5248) Fix TestCombiner#testGroupByLimit after PoS merge

2017-05-31 Thread Koji Noguchi (JIRA)


[ 
https://issues.apache.org/jira/browse/PIG-5248?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16031856#comment-16031856
 ] 

Koji Noguchi commented on PIG-5248:
---

bq. It is not a quick fix. Skipped the test for now and created PIG-5249 to fix 
the issue.

Interesting that test is now catching the real issue.  +1 on the patch.

> Fix TestCombiner#testGroupByLimit after PoS merge
> -
>
> Key: PIG-5248
> URL: https://issues.apache.org/jira/browse/PIG-5248
> Project: Pig
>  Issue Type: Improvement
>Reporter: Adam Szita
>Assignee: Rohini Palaniswamy
> Fix For: 0.17.0
>
> Attachments: PIG-5248-1.patch
>
>
> This test started failing on TEZ after we merged PoS. The test checks if 
> there is a "Combiner plan" among the vertices of Tez execution plan.
> The last step of the query is {{pigServer.registerQuery("d = limit c 2 ; 
> ");}} which is decisive in the case of Tez. It looks like if we check the 
> plan for "d" there is no combiner part, but there is one if we check it for 
> "c" - so without applying limit.
> The reason this didn't come out before is because the alias supplied to 
> {{checkCombinerUsed}} method was disregarded and alias "c" was checked 
> always. This was recently fixed with the PoS merge. (See diff of TestCombiner 
> [here|https://github.com/apache/pig/commit/e766b6bf29e610b6312f8447fc008bed6beb4090?diff=split#diff-8bcae39a2bb998cdfeb8c7af960eb196L360]
>  )



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

[jira] [Updated] (PIG-5248) Fix TestCombiner#testGroupByLimit after PoS merge

2017-05-31 Thread Rohini Palaniswamy (JIRA)


 [ 
https://issues.apache.org/jira/browse/PIG-5248?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rohini Palaniswamy updated PIG-5248:

Assignee: Rohini Palaniswamy
  Status: Patch Available  (was: Open)

> Fix TestCombiner#testGroupByLimit after PoS merge
> -
>
> Key: PIG-5248
> URL: https://issues.apache.org/jira/browse/PIG-5248
> Project: Pig
>  Issue Type: Improvement
>Reporter: Adam Szita
>Assignee: Rohini Palaniswamy
> Fix For: 0.17.0
>
> Attachments: PIG-5248-1.patch
>
>
> This test started failing on TEZ after we merged PoS. The test checks if 
> there is a "Combiner plan" among the vertices of Tez execution plan.
> The last step of the query is {{pigServer.registerQuery("d = limit c 2 ; 
> ");}} which is decisive in the case of Tez. It looks like if we check the 
> plan for "d" there is no combiner part, but there is one if we check it for 
> "c" - so without applying limit.
> The reason this didn't come out before is because the alias supplied to 
> {{checkCombinerUsed}} method was disregarded and alias "c" was checked 
> always. This was recently fixed with the PoS merge. (See diff of TestCombiner 
> [here|https://github.com/apache/pig/commit/e766b6bf29e610b6312f8447fc008bed6beb4090?diff=split#diff-8bcae39a2bb998cdfeb8c7af960eb196L360]
>  )



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

[jira] [Updated] (PIG-5248) Fix TestCombiner#testGroupByLimit after PoS merge

2017-05-31 Thread Rohini Palaniswamy (JIRA)


 [ 
https://issues.apache.org/jira/browse/PIG-5248?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rohini Palaniswamy updated PIG-5248:

Attachment: PIG-5248-1.patch

It is not a quick fix. Skipped the test for now and created PIG-5249 to fix the 
issue.

> Fix TestCombiner#testGroupByLimit after PoS merge
> -
>
> Key: PIG-5248
> URL: https://issues.apache.org/jira/browse/PIG-5248
> Project: Pig
>  Issue Type: Improvement
>Reporter: Adam Szita
> Fix For: 0.17.0
>
> Attachments: PIG-5248-1.patch
>
>
> This test started failing on TEZ after we merged PoS. The test checks if 
> there is a "Combiner plan" among the vertices of Tez execution plan.
> The last step of the query is {{pigServer.registerQuery("d = limit c 2 ; 
> ");}} which is decisive in the case of Tez. It looks like if we check the 
> plan for "d" there is no combiner part, but there is one if we check it for 
> "c" - so without applying limit.
> The reason this didn't come out before is because the alias supplied to 
> {{checkCombinerUsed}} method was disregarded and alias "c" was checked 
> always. This was recently fixed with the PoS merge. (See diff of TestCombiner 
> [here|https://github.com/apache/pig/commit/e766b6bf29e610b6312f8447fc008bed6beb4090?diff=split#diff-8bcae39a2bb998cdfeb8c7af960eb196L360]
>  )



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

[jira] [Created] (PIG-5249) Group followed by Limit does not have combiner optimization in Tez

2017-05-31 Thread Rohini Palaniswamy (JIRA)

Rohini Palaniswamy created PIG-5249:
---

 Summary: Group followed by Limit does not have combiner 
optimization in Tez
 Key: PIG-5249
 URL: https://issues.apache.org/jira/browse/PIG-5249
 Project: Pig
  Issue Type: Bug
Reporter: Rohini Palaniswamy
 Fix For: 0.18.0


Changes done in CombinerOptimizer with PIG-946 does not work in Tez as Limit is 
followed by POValueOutputTez/POValueInputTez and then Foreach. MR does 
LimitAdjuster after CombinerOptimizer, so it works for MR.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

[jira] [Updated] (PIG-5245) TestGrunt.testStopOnFailure is flaky

2017-05-31 Thread Rohini Palaniswamy (JIRA)


 [ 
https://issues.apache.org/jira/browse/PIG-5245?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rohini Palaniswamy updated PIG-5245:

Attachment: PIG-5245-3.patch

Final patch including [~szita]'s changes for spark. Also reduced the AM 
percentage to 0.1 as that itself should allow launching 3AMs with a 16G RAM. 

> TestGrunt.testStopOnFailure is flaky
> 
>
> Key: PIG-5245
> URL: https://issues.apache.org/jira/browse/PIG-5245
> Project: Pig
>  Issue Type: Bug
>Reporter: Rohini Palaniswamy
>Assignee: Rohini Palaniswamy
> Fix For: 0.17.0
>
> Attachments: PIG-5245-1.patch, 
> PIG-5245-2.addingSparkLibsToMiniCluster.patch, PIG-5245-2.patch, 
> PIG-5245-3.patch
>
>
>   The test is supposed to run two tests in parallel, and one when fails other 
> should be killed when stop on failure is configured. But the test is actually 
> running only job at a time and based on the order in which jobs are run it is 
> passing. This is because of the capacity scheduler configuration of the 
> MiniCluster. It runs only one AM at a time due to resource restrictions. In a 
> 16G node, when the first job runs it takes up 1536 (AM) + 1024 (task) 
> memory.mb. Since only 10% of cluster resource is the default for running AMs 
> and a single AM already takes up memory close to 1.6G, second job AM is not 
> launched.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

[jira] [Resolved] (PIG-5216) Customizable Error Handling for Loaders in Pig

2017-05-31 Thread Daniel Dai (JIRA)


 [ 
https://issues.apache.org/jira/browse/PIG-5216?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Daniel Dai resolved PIG-5216.
-
  Resolution: Fixed
Hadoop Flags: Reviewed

Also did rebase after spark merge. Patch committed to trunk. Thanks Iris!

> Customizable Error Handling for Loaders in Pig
> --
>
> Key: PIG-5216
> URL: https://issues.apache.org/jira/browse/PIG-5216
> Project: Pig
>  Issue Type: Improvement
>Reporter: Iris Zeng
>Assignee: Iris Zeng
> Fix For: 0.18.0
>
> Attachments: PIG-5216-1.patch, PIG-5216-2.patch, PIG-5216-3.patch, 
> PIG-5216-4.patch
>
>
> Add Error Handling for Loaders in Pig, so that user can choose to allow 
> errors when load data, and set error numbers / rate
> Ideas based on error handling on store func see 
> https://issues.apache.org/jira/browse/PIG-4704



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

[jira] [Updated] (PIG-5216) Customizable Error Handling for Loaders in Pig

2017-05-31 Thread Daniel Dai (JIRA)


 [ 
https://issues.apache.org/jira/browse/PIG-5216?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Daniel Dai updated PIG-5216:

Attachment: PIG-5216-4.patch

Find several issues when running unit tests:
1. In POLoad.setup, we also need to set LoadFuncDecorator
2. When serializing "pig.loads", set POLoad.parentPlan null as we don't want to 
serialize all physical plan
3. In MRJobStats, we still refer "pig.inputs"
4. Some formatting issues

Attach PIG-5216-4.patch.

> Customizable Error Handling for Loaders in Pig
> --
>
> Key: PIG-5216
> URL: https://issues.apache.org/jira/browse/PIG-5216
> Project: Pig
>  Issue Type: Improvement
>Reporter: Iris Zeng
>Assignee: Iris Zeng
> Fix For: 0.18.0
>
> Attachments: PIG-5216-1.patch, PIG-5216-2.patch, PIG-5216-3.patch, 
> PIG-5216-4.patch
>
>
> Add Error Handling for Loaders in Pig, so that user can choose to allow 
> errors when load data, and set error numbers / rate
> Ideas based on error handling on store func see 
> https://issues.apache.org/jira/browse/PIG-4704



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

[jira] [Updated] (PIG-5248) Fix TestCombiner#testGroupByLimit after PoS merge

2017-05-31 Thread Adam Szita (JIRA)


 [ 
https://issues.apache.org/jira/browse/PIG-5248?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Adam Szita updated PIG-5248:

Description: 
This test started failing on TEZ after we merged PoS. The test checks if there 
is a "Combiner plan" among the vertices of Tez execution plan.
The last step of the query is {{pigServer.registerQuery("d = limit c 2 ; ");}} 
which is decisive in the case of Tez. It looks like if we check the plan for 
"d" there is no combiner part, but there is one if we check it for "c" - so 
without applying limit.

The reason this didn't come out before is because the alias supplied to 
{{checkCombinerUsed}} method was disregarded and alias "c" was checked always. 
This was recently fixed with the PoS merge. (See diff of TestCombiner 
[here|https://github.com/apache/pig/commit/e766b6bf29e610b6312f8447fc008bed6beb4090?diff=split#diff-8bcae39a2bb998cdfeb8c7af960eb196L360]
 )

> Fix TestCombiner#testGroupByLimit after PoS merge
> -
>
> Key: PIG-5248
> URL: https://issues.apache.org/jira/browse/PIG-5248
> Project: Pig
>  Issue Type: Improvement
>Reporter: Adam Szita
> Fix For: 0.17.0
>
>
> This test started failing on TEZ after we merged PoS. The test checks if 
> there is a "Combiner plan" among the vertices of Tez execution plan.
> The last step of the query is {{pigServer.registerQuery("d = limit c 2 ; 
> ");}} which is decisive in the case of Tez. It looks like if we check the 
> plan for "d" there is no combiner part, but there is one if we check it for 
> "c" - so without applying limit.
> The reason this didn't come out before is because the alias supplied to 
> {{checkCombinerUsed}} method was disregarded and alias "c" was checked 
> always. This was recently fixed with the PoS merge. (See diff of TestCombiner 
> [here|https://github.com/apache/pig/commit/e766b6bf29e610b6312f8447fc008bed6beb4090?diff=split#diff-8bcae39a2bb998cdfeb8c7af960eb196L360]
>  )



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

[jira] [Commented] (PIG-5248) Fix TestCombiner#testGroupByLimit after PoS merge

2017-05-31 Thread Adam Szita (JIRA)


[ 
https://issues.apache.org/jira/browse/PIG-5248?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16031231#comment-16031231
 ] 

Adam Szita commented on PIG-5248:
-

[~rohini] can you take a look on this please?

> Fix TestCombiner#testGroupByLimit after PoS merge
> -
>
> Key: PIG-5248
> URL: https://issues.apache.org/jira/browse/PIG-5248
> Project: Pig
>  Issue Type: Improvement
>Reporter: Adam Szita
> Fix For: 0.17.0
>
>
> This test started failing on TEZ after we merged PoS. The test checks if 
> there is a "Combiner plan" among the vertices of Tez execution plan.
> The last step of the query is {{pigServer.registerQuery("d = limit c 2 ; 
> ");}} which is decisive in the case of Tez. It looks like if we check the 
> plan for "d" there is no combiner part, but there is one if we check it for 
> "c" - so without applying limit.
> The reason this didn't come out before is because the alias supplied to 
> {{checkCombinerUsed}} method was disregarded and alias "c" was checked 
> always. This was recently fixed with the PoS merge. (See diff of TestCombiner 
> [here|https://github.com/apache/pig/commit/e766b6bf29e610b6312f8447fc008bed6beb4090?diff=split#diff-8bcae39a2bb998cdfeb8c7af960eb196L360]
>  )



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

[jira] [Created] (PIG-5248) Fix TestCombiner#testGroupByLimit after PoS merge

2017-05-31 Thread Adam Szita (JIRA)

Adam Szita created PIG-5248:
---

 Summary: Fix TestCombiner#testGroupByLimit after PoS merge
 Key: PIG-5248
 URL: https://issues.apache.org/jira/browse/PIG-5248
 Project: Pig
  Issue Type: Improvement
Reporter: Adam Szita
 Fix For: 0.17.0






--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

[jira] [Commented] (PIG-5245) TestGrunt.testStopOnFailure is flaky

2017-05-31 Thread Adam Szita (JIRA)


[ 
https://issues.apache.org/jira/browse/PIG-5245?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16031205#comment-16031205
 ] 

Adam Szita commented on PIG-5245:
-

[~rohini] I've spotted a missing part in the YarnMinicluster classpath 
compilation. If we want to run tests in Spark mode with 
{{SPARK_MASTER=yarn-client}} config (rather than {{local}} which is now 
default) we will need the Spark jars for the forked JVMs. I've attached 
[^PIG-5245-2.addingSparkLibsToMiniCluster.patch] which you can apply after 
[^PIG-5245-2.patch]. It will be helpful in the future.

> TestGrunt.testStopOnFailure is flaky
> 
>
> Key: PIG-5245
> URL: https://issues.apache.org/jira/browse/PIG-5245
> Project: Pig
>  Issue Type: Bug
>Reporter: Rohini Palaniswamy
>Assignee: Rohini Palaniswamy
> Fix For: 0.17.0
>
> Attachments: PIG-5245-1.patch, 
> PIG-5245-2.addingSparkLibsToMiniCluster.patch, PIG-5245-2.patch
>
>
>   The test is supposed to run two tests in parallel, and one when fails other 
> should be killed when stop on failure is configured. But the test is actually 
> running only job at a time and based on the order in which jobs are run it is 
> passing. This is because of the capacity scheduler configuration of the 
> MiniCluster. It runs only one AM at a time due to resource restrictions. In a 
> 16G node, when the first job runs it takes up 1536 (AM) + 1024 (task) 
> memory.mb. Since only 10% of cluster resource is the default for running AMs 
> and a single AM already takes up memory close to 1.6G, second job AM is not 
> launched.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

[jira] [Updated] (PIG-5245) TestGrunt.testStopOnFailure is flaky

2017-05-31 Thread Adam Szita (JIRA)


 [ 
https://issues.apache.org/jira/browse/PIG-5245?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Adam Szita updated PIG-5245:

Attachment: PIG-5245-2.addingSparkLibsToMiniCluster.patch

> TestGrunt.testStopOnFailure is flaky
> 
>
> Key: PIG-5245
> URL: https://issues.apache.org/jira/browse/PIG-5245
> Project: Pig
>  Issue Type: Bug
>Reporter: Rohini Palaniswamy
>Assignee: Rohini Palaniswamy
> Fix For: 0.17.0
>
> Attachments: PIG-5245-1.patch, 
> PIG-5245-2.addingSparkLibsToMiniCluster.patch, PIG-5245-2.patch
>
>
>   The test is supposed to run two tests in parallel, and one when fails other 
> should be killed when stop on failure is configured. But the test is actually 
> running only job at a time and based on the order in which jobs are run it is 
> passing. This is because of the capacity scheduler configuration of the 
> MiniCluster. It runs only one AM at a time due to resource restrictions. In a 
> 16G node, when the first job runs it takes up 1536 (AM) + 1024 (task) 
> memory.mb. Since only 10% of cluster resource is the default for running AMs 
> and a single AM already takes up memory close to 1.6G, second job AM is not 
> launched.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

[jira] [Commented] (PIG-5245) TestGrunt.testStopOnFailure is flaky

2017-05-31 Thread Adam Szita (JIRA)


[ 
https://issues.apache.org/jira/browse/PIG-5245?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16031188#comment-16031188
 ] 

Adam Szita commented on PIG-5245:
-

+1 for [^PIG-5245-2.patch] - I verified all Spark unit tests pass (3,338 tests 
in 2hr 7min). I don't think this is a blocker issue [~nkollar] we will address 
Spark vs stopOnFailure in Pig 0.18

> TestGrunt.testStopOnFailure is flaky
> 
>
> Key: PIG-5245
> URL: https://issues.apache.org/jira/browse/PIG-5245
> Project: Pig
>  Issue Type: Bug
>Reporter: Rohini Palaniswamy
>Assignee: Rohini Palaniswamy
> Fix For: 0.17.0
>
> Attachments: PIG-5245-1.patch, PIG-5245-2.patch
>
>
>   The test is supposed to run two tests in parallel, and one when fails other 
> should be killed when stop on failure is configured. But the test is actually 
> running only job at a time and based on the order in which jobs are run it is 
> passing. This is because of the capacity scheduler configuration of the 
> MiniCluster. It runs only one AM at a time due to resource restrictions. In a 
> 16G node, when the first job runs it takes up 1536 (AM) + 1024 (task) 
> memory.mb. Since only 10% of cluster resource is the default for running AMs 
> and a single AM already takes up memory close to 1.6G, second job AM is not 
> launched.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

[jira] [Commented] (PIG-5245) TestGrunt.testStopOnFailure is flaky

2017-05-31 Thread Nandor Kollar (JIRA)


[ 
https://issues.apache.org/jira/browse/PIG-5245?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16030993#comment-16030993
 ] 

Nandor Kollar commented on PIG-5245:


Having this test fixed for MR and for Tez, but not for Spark, can we still 
proceed with the release? In my opinion this is not a blocker issue for Spark. 
[~kellyzly] [~szita] [~rohini] what do you think? +1 for PIG-5245-2.patch

> TestGrunt.testStopOnFailure is flaky
> 
>
> Key: PIG-5245
> URL: https://issues.apache.org/jira/browse/PIG-5245
> Project: Pig
>  Issue Type: Bug
>Reporter: Rohini Palaniswamy
>Assignee: Rohini Palaniswamy
> Fix For: 0.17.0
>
> Attachments: PIG-5245-1.patch, PIG-5245-2.patch
>
>
>   The test is supposed to run two tests in parallel, and one when fails other 
> should be killed when stop on failure is configured. But the test is actually 
> running only job at a time and based on the order in which jobs are run it is 
> passing. This is because of the capacity scheduler configuration of the 
> MiniCluster. It runs only one AM at a time due to resource restrictions. In a 
> 16G node, when the first job runs it takes up 1536 (AM) + 1024 (task) 
> memory.mb. Since only 10% of cluster resource is the default for running AMs 
> and a single AM already takes up memory close to 1.6G, second job AM is not 
> launched.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

[jira] [Commented] (PIG-5157) Upgrade to Spark 2.0

2017-05-31 Thread Nandor Kollar (JIRA)


[ 
https://issues.apache.org/jira/browse/PIG-5157?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16030991#comment-16030991
 ] 

Nandor Kollar commented on PIG-5157:


Ok, thanks, then I'll uncomment that and test it. As for 
{{spark.eventLog.enabled}}, it requires {{spark.eventLog.dir}} defined too, I 
think for Spark 2.x we don't have to care about it, since the user can set 
these if required. Though my change removed the logger completely, and it seems 
these property are not available for Spark 1.x. My question is: do we need this 
for Spark 1.x? If so, I'm afraid this should be included into the shims too.

> Upgrade to Spark 2.0
> 
>
> Key: PIG-5157
> URL: https://issues.apache.org/jira/browse/PIG-5157
> Project: Pig
>  Issue Type: Improvement
>  Components: spark
>Reporter: Nandor Kollar
>Assignee: Nandor Kollar
> Fix For: 0.18.0
>
> Attachments: PIG-5157.patch
>
>
> Upgrade to Spark 2.0 (or latest)



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

[jira] [Commented] (PIG-5245) TestGrunt.testStopOnFailure is flaky

2017-05-31 Thread Adam Szita (JIRA)


[ 
https://issues.apache.org/jira/browse/PIG-5245?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16030988#comment-16030988
 ] 

Adam Szita commented on PIG-5245:
-

[~kellyzly] I think that has a different mechanism compared to MR, it doesn't 
fail an existing job, merely prevents the new one to be submitted. I think it 
would be worth to rethink the stopOnFailure feature wrt. Spark mode. Let's do 
this in PIG-5247

> TestGrunt.testStopOnFailure is flaky
> 
>
> Key: PIG-5245
> URL: https://issues.apache.org/jira/browse/PIG-5245
> Project: Pig
>  Issue Type: Bug
>Reporter: Rohini Palaniswamy
>Assignee: Rohini Palaniswamy
> Fix For: 0.17.0
>
> Attachments: PIG-5245-1.patch, PIG-5245-2.patch
>
>
>   The test is supposed to run two tests in parallel, and one when fails other 
> should be killed when stop on failure is configured. But the test is actually 
> running only job at a time and based on the order in which jobs are run it is 
> passing. This is because of the capacity scheduler configuration of the 
> MiniCluster. It runs only one AM at a time due to resource restrictions. In a 
> 16G node, when the first job runs it takes up 1536 (AM) + 1024 (task) 
> memory.mb. Since only 10% of cluster resource is the default for running AMs 
> and a single AM already takes up memory close to 1.6G, second job AM is not 
> launched.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

[jira] [Created] (PIG-5247) Investigate stopOnFailure feature with Spark execution engine

2017-05-31 Thread Adam Szita (JIRA)

Adam Szita created PIG-5247:
---

 Summary: Investigate stopOnFailure feature with Spark execution 
engine
 Key: PIG-5247
 URL: https://issues.apache.org/jira/browse/PIG-5247
 Project: Pig
  Issue Type: Improvement
Reporter: Adam Szita
 Fix For: 0.18.0






--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

[jira] [Commented] (PIG-5245) TestGrunt.testStopOnFailure is flaky

2017-05-31 Thread liyunzhang_intel (JIRA)


[ 
https://issues.apache.org/jira/browse/PIG-5245?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16030847#comment-16030847
 ] 

liyunzhang_intel commented on PIG-5245:
---

[~rohini]: stop_on_failure is implemented in spark mode in 
[JobGraphBuilder.java|https://github.com/apache/pig/blob/trunk/src/org/apache/pig/backend/hadoop/executionengine/spark/JobGraphBuilder.java#L193]


> TestGrunt.testStopOnFailure is flaky
> 
>
> Key: PIG-5245
> URL: https://issues.apache.org/jira/browse/PIG-5245
> Project: Pig
>  Issue Type: Bug
>Reporter: Rohini Palaniswamy
>Assignee: Rohini Palaniswamy
> Fix For: 0.17.0
>
> Attachments: PIG-5245-1.patch, PIG-5245-2.patch
>
>
>   The test is supposed to run two tests in parallel, and one when fails other 
> should be killed when stop on failure is configured. But the test is actually 
> running only job at a time and based on the order in which jobs are run it is 
> passing. This is because of the capacity scheduler configuration of the 
> MiniCluster. It runs only one AM at a time due to resource restrictions. In a 
> 16G node, when the first job runs it takes up 1536 (AM) + 1024 (task) 
> memory.mb. Since only 10% of cluster resource is the default for running AMs 
> and a single AM already takes up memory close to 1.6G, second job AM is not 
> launched.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

[jira] [Updated] (PIG-5245) TestGrunt.testStopOnFailure is flaky

2017-05-31 Thread Rohini Palaniswamy (JIRA)


 [ 
https://issues.apache.org/jira/browse/PIG-5245?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rohini Palaniswamy updated PIG-5245:

Attachment: PIG-5245-2.patch

Skipped this test for Spark as stop on failure is not implemented. Also added 
these configs to YarnMiniCluster as well.  Running full unit tests for MR and 
Tez. Will update once done.

> TestGrunt.testStopOnFailure is flaky
> 
>
> Key: PIG-5245
> URL: https://issues.apache.org/jira/browse/PIG-5245
> Project: Pig
>  Issue Type: Bug
>Reporter: Rohini Palaniswamy
>Assignee: Rohini Palaniswamy
> Fix For: 0.17.0
>
> Attachments: PIG-5245-1.patch, PIG-5245-2.patch
>
>
>   The test is supposed to run two tests in parallel, and one when fails other 
> should be killed when stop on failure is configured. But the test is actually 
> running only job at a time and based on the order in which jobs are run it is 
> passing. This is because of the capacity scheduler configuration of the 
> MiniCluster. It runs only one AM at a time due to resource restrictions. In a 
> 16G node, when the first job runs it takes up 1536 (AM) + 1024 (task) 
> memory.mb. Since only 10% of cluster resource is the default for running AMs 
> and a single AM already takes up memory close to 1.6G, second job AM is not 
> launched.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

[jira] [Created] (PIG-5246) Modify bin/pig about SPARK_HOME, SPARK_ASSEMBLY_JAR after upgrading spark to 2

2017-05-31 Thread liyunzhang_intel (JIRA)

liyunzhang_intel created PIG-5246:
-

 Summary: Modify bin/pig about SPARK_HOME, SPARK_ASSEMBLY_JAR after 
upgrading spark to 2
 Key: PIG-5246
 URL: https://issues.apache.org/jira/browse/PIG-5246
 Project: Pig
  Issue Type: Bug
Reporter: liyunzhang_intel
Assignee: liyunzhang_intel


in bin/pig.
we copy assembly jar to pig's classpath in spark1.6.
{code}
# For spark mode:
# Please specify SPARK_HOME first so that we can locate 
$SPARK_HOME/lib/spark-assembly*.jar,
# we will add spark-assembly*.jar to the classpath.
if [ "$isSparkMode"  == "true" ]; then
if [ -z "$SPARK_HOME" ]; then
   echo "Error: SPARK_HOME is not set!"
   exit 1
fi

# Please specify SPARK_JAR which is the hdfs path of spark-assembly*.jar to 
allow YARN to cache spark-assembly*.jar on nodes so that it doesn't need to be 
distributed each time an application runs.
if [ -z "$SPARK_JAR" ]; then
   echo "Error: SPARK_JAR is not set, SPARK_JAR stands for the hdfs 
location of spark-assembly*.jar. This allows YARN to cache spark-assembly*.jar 
on nodes so that it doesn't need to be distributed each time an application 
runs."
   exit 1
fi

if [ -n "$SPARK_HOME" ]; then
echo "Using Spark Home: " ${SPARK_HOME}
SPARK_ASSEMBLY_JAR=`ls ${SPARK_HOME}/lib/spark-assembly*`
CLASSPATH=${CLASSPATH}:$SPARK_ASSEMBLY_JAR
fi
fi

{code}
after upgrade to spark2.0, we may modify it



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

[jira] [Commented] (PIG-5245) TestGrunt.testStopOnFailure is flaky

2017-05-31 Thread Adam Szita (JIRA)


[ 
https://issues.apache.org/jira/browse/PIG-5245?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16030814#comment-16030814
 ] 

Adam Szita commented on PIG-5245:
-

+1 for [^PIG-5245-1.patch]

> TestGrunt.testStopOnFailure is flaky
> 
>
> Key: PIG-5245
> URL: https://issues.apache.org/jira/browse/PIG-5245
> Project: Pig
>  Issue Type: Bug
>Reporter: Rohini Palaniswamy
>Assignee: Rohini Palaniswamy
> Fix For: 0.17.0
>
> Attachments: PIG-5245-1.patch
>
>
>   The test is supposed to run two tests in parallel, and one when fails other 
> should be killed when stop on failure is configured. But the test is actually 
> running only job at a time and based on the order in which jobs are run it is 
> passing. This is because of the capacity scheduler configuration of the 
> MiniCluster. It runs only one AM at a time due to resource restrictions. In a 
> 16G node, when the first job runs it takes up 1536 (AM) + 1024 (task) 
> memory.mb. Since only 10% of cluster resource is the default for running AMs 
> and a single AM already takes up memory close to 1.6G, second job AM is not 
> launched.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

[jira] [Updated] (PIG-5245) TestGrunt.testStopOnFailure is flaky

2017-05-31 Thread Rohini Palaniswamy (JIRA)


 [ 
https://issues.apache.org/jira/browse/PIG-5245?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rohini Palaniswamy updated PIG-5245:

Status: Patch Available  (was: Open)

> TestGrunt.testStopOnFailure is flaky
> 
>
> Key: PIG-5245
> URL: https://issues.apache.org/jira/browse/PIG-5245
> Project: Pig
>  Issue Type: Bug
>Reporter: Rohini Palaniswamy
>Assignee: Rohini Palaniswamy
> Fix For: 0.17.0
>
> Attachments: PIG-5245-1.patch
>
>
>   The test is supposed to run two tests in parallel, and one when fails other 
> should be killed when stop on failure is configured. But the test is actually 
> running only job at a time and based on the order in which jobs are run it is 
> passing. This is because of the capacity scheduler configuration of the 
> MiniCluster. It runs only one AM at a time due to resource restrictions. In a 
> 16G node, when the first job runs it takes up 1536 (AM) + 1024 (task) 
> memory.mb. Since only 10% of cluster resource is the default for running AMs 
> and a single AM already takes up memory close to 1.6G, second job AM is not 
> launched.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

[jira] [Comment Edited] (PIG-5157) Upgrade to Spark 2.0

2017-05-31 Thread liyunzhang_intel (JIRA)


[ 
https://issues.apache.org/jira/browse/PIG-5157?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16030799#comment-16030799
 ] 

liyunzhang_intel edited comment on PIG-5157 at 5/31/17 7:54 AM:


[~nkollar]:
bq. in JobMetricsListener.java there's a huge code section commented out 
(uncomment and remove the code onTaskEnd until we fix PIG-5157). Should we 
enable that?
the reason to modify it is because [~rohini] suggested that [memory| is used a 
lot if we update metric info in onTaskEnd()(suppose there are thousand tasks)
in org.apache.pig.backend.hadoop.executionengine.spark.JobMetricsListener of 
spark21, we should use code like following 
notice: not fully test, can not guarantee it is right.
{code}
  public void onStageCompleted(SparkListenerStageCompleted stageCompleted) {
int stageId = stageCompleted.stageInfo().stageId();
int stageAttemptId = stageCompleted.stageInfo().attemptId();
String stageIdentifier = stageId + "_" + stageAttemptId;
Integer jobId = stageIdToJobId.get(stageId);
if (jobId == null) {
LOG.warn("Cannot find job id for stage[" + stageId + "].");
} else {
Map> jobMetrics = 
allJobMetrics.get(jobId);
if (jobMetrics == null) {
jobMetrics = Maps.newHashMap();
allJobMetrics.put(jobId, jobMetrics);
}
List stageMetrics = jobMetrics.get(stageIdentifier);
if (stageMetrics == null) {
stageMetrics = Lists.newLinkedList();
jobMetrics.put(stageIdentifier, stageMetrics);
}
 
 stageMetrics.add(stageCompleted.stageInfo().taskMetrics());
}
}
public synchronized void onTaskEnd(SparkListenerTaskEnd taskEnd) {
}
{code}
bq. I removed JobLogger, do we need it? It seems that a property called 
'spark.eventLog.enabled' is the proper replacement for this class, should we 
use it instead? It looks like JobLogger became deprecated and was removed from 
Spark 2.
It seems we can remove JobLogger and enable {{spark.eventLog.enabled}} in spark2



was (Author: kellyzly):
[~nkollar]:
bq. in JobMetricsListener.java there's a huge code section commented out 
(uncomment and remove the code onTaskEnd until we fix PIG-5157). Should we 
enable that?
the reason to modify it is because [~rohini] suggested that [memory| is used a 
lot if we update metric info in onTaskEnd()(suppose there are thousand tasks)
in org.apache.pig.backend.hadoop.executionengine.spark.JobMetricsListener of 
spark21, we should use code like following 
notice: not fully test, can not guarantee it is right.
{code}
  public void onStageCompleted(SparkListenerStageCompleted stageCompleted) {
if we update taskMetrics in onTaskEnd(), it consumes lot of memory.
int stageId = stageCompleted.stageInfo().stageId();
int stageAttemptId = stageCompleted.stageInfo().attemptId();
String stageIdentifier = stageId + "_" + stageAttemptId;
Integer jobId = stageIdToJobId.get(stageId);
if (jobId == null) {
LOG.warn("Cannot find job id for stage[" + stageId + "].");
} else {
Map> jobMetrics = 
allJobMetrics.get(jobId);
if (jobMetrics == null) {
jobMetrics = Maps.newHashMap();
allJobMetrics.put(jobId, jobMetrics);
}
List stageMetrics = jobMetrics.get(stageIdentifier);
if (stageMetrics == null) {
stageMetrics = Lists.newLinkedList();
jobMetrics.put(stageIdentifier, stageMetrics);
}
 
 stageMetrics.add(stageCompleted.stageInfo().taskMetrics());
}
}
public synchronized void onTaskEnd(SparkListenerTaskEnd taskEnd) {
}
{code}
bq. I removed JobLogger, do we need it? It seems that a property called 
'spark.eventLog.enabled' is the proper replacement for this class, should we 
use it instead? It looks like JobLogger became deprecated and was removed from 
Spark 2.
It seems we can remove JobLogger and enable {{spark.eventLog.enabled}} in spark2


> Upgrade to Spark 2.0
> 
>
> Key: PIG-5157
> URL: https://issues.apache.org/jira/browse/PIG-5157
> Project: Pig
>  Issue Type: Improvement
>  Components: spark
>Reporter: Nandor Kollar
>Assignee: Nandor Kollar
> Fix For: 0.18.0
>
> Attachments: PIG-5157.patch
>
>
> Upgrade to Spark 2.0 (or latest)



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

[jira] [Commented] (PIG-5157) Upgrade to Spark 2.0

2017-05-31 Thread liyunzhang_intel (JIRA)


[ 
https://issues.apache.org/jira/browse/PIG-5157?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16030799#comment-16030799
 ] 

liyunzhang_intel commented on PIG-5157:
---

[~nkollar]:
bq. in JobMetricsListener.java there's a huge code section commented out 
(uncomment and remove the code onTaskEnd until we fix PIG-5157). Should we 
enable that?
the reason to modify it is because [~rohini] suggested that [memory| is used a 
lot if we update metric info in onTaskEnd()(suppose there are thousand tasks)
in org.apache.pig.backend.hadoop.executionengine.spark.JobMetricsListener of 
spark21, we should use code like following 
notice: not fully test, can not guarantee it is right.
{code}
  public void onStageCompleted(SparkListenerStageCompleted stageCompleted) {
if we update taskMetrics in onTaskEnd(), it consumes lot of memory.
int stageId = stageCompleted.stageInfo().stageId();
int stageAttemptId = stageCompleted.stageInfo().attemptId();
String stageIdentifier = stageId + "_" + stageAttemptId;
Integer jobId = stageIdToJobId.get(stageId);
if (jobId == null) {
LOG.warn("Cannot find job id for stage[" + stageId + "].");
} else {
Map> jobMetrics = 
allJobMetrics.get(jobId);
if (jobMetrics == null) {
jobMetrics = Maps.newHashMap();
allJobMetrics.put(jobId, jobMetrics);
}
List stageMetrics = jobMetrics.get(stageIdentifier);
if (stageMetrics == null) {
stageMetrics = Lists.newLinkedList();
jobMetrics.put(stageIdentifier, stageMetrics);
}
 
 stageMetrics.add(stageCompleted.stageInfo().taskMetrics());
}
}
public synchronized void onTaskEnd(SparkListenerTaskEnd taskEnd) {
}
{code}
bq. I removed JobLogger, do we need it? It seems that a property called 
'spark.eventLog.enabled' is the proper replacement for this class, should we 
use it instead? It looks like JobLogger became deprecated and was removed from 
Spark 2.
It seems we can remove JobLogger and enable {{spark.eventLog.enabled}} in spark2


> Upgrade to Spark 2.0
> 
>
> Key: PIG-5157
> URL: https://issues.apache.org/jira/browse/PIG-5157
> Project: Pig
>  Issue Type: Improvement
>  Components: spark
>Reporter: Nandor Kollar
>Assignee: Nandor Kollar
> Fix For: 0.18.0
>
> Attachments: PIG-5157.patch
>
>
> Upgrade to Spark 2.0 (or latest)



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

[jira] [Issue Comment Deleted] (PIG-5157) Upgrade to Spark 2.0

2017-05-31 Thread hsj (JIRA)


 [ 
https://issues.apache.org/jira/browse/PIG-5157?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

hsj updated PIG-5157:
-
Comment: was deleted

(was: [~nkollar]:
bq. in JobMetricsListener.java there's a huge code section commented out 
(uncomment and remove the code onTaskEnd until we fix PIG-5157). Should we 
enable that?
the reason to modify it is because [~rohini] suggested that [memory| is used a 
lot if we update metric info in onTaskEnd()(suppose there are thousand tasks)
in org.apache.pig.backend.hadoop.executionengine.spark.JobMetricsListener of 
spark21, we should use code like following 
notice: not fully test, can not guarantee it is right.
{code}
  public void onStageCompleted(SparkListenerStageCompleted stageCompleted) {
if we update taskMetrics in onTaskEnd(), it consumes lot of memory.
int stageId = stageCompleted.stageInfo().stageId();
int stageAttemptId = stageCompleted.stageInfo().attemptId();
String stageIdentifier = stageId + "_" + stageAttemptId;
Integer jobId = stageIdToJobId.get(stageId);
if (jobId == null) {
LOG.warn("Cannot find job id for stage[" + stageId + "].");
} else {
Map> jobMetrics = 
allJobMetrics.get(jobId);
if (jobMetrics == null) {
jobMetrics = Maps.newHashMap();
allJobMetrics.put(jobId, jobMetrics);
}
List stageMetrics = jobMetrics.get(stageIdentifier);
if (stageMetrics == null) {
stageMetrics = Lists.newLinkedList();
jobMetrics.put(stageIdentifier, stageMetrics);
}
 
 stageMetrics.add(stageCompleted.stageInfo().taskMetrics());
}
}
public synchronized void onTaskEnd(SparkListenerTaskEnd taskEnd) {
}
{code}
bq. I removed JobLogger, do we need it? It seems that a property called 
'spark.eventLog.enabled' is the proper replacement for this class, should we 
use it instead? It looks like JobLogger became deprecated and was removed from 
Spark 2.
It seems we can remove JobLogger and enable {{spark.eventLog.enabled}} in spark2
)

> Upgrade to Spark 2.0
> 
>
> Key: PIG-5157
> URL: https://issues.apache.org/jira/browse/PIG-5157
> Project: Pig
>  Issue Type: Improvement
>  Components: spark
>Reporter: Nandor Kollar
>Assignee: Nandor Kollar
> Fix For: 0.18.0
>
> Attachments: PIG-5157.patch
>
>
> Upgrade to Spark 2.0 (or latest)



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

[jira] [Commented] (PIG-5157) Upgrade to Spark 2.0

2017-05-31 Thread hsj (JIRA)


[ 
https://issues.apache.org/jira/browse/PIG-5157?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16030795#comment-16030795
 ] 

hsj commented on PIG-5157:
--

[~nkollar]:
bq. in JobMetricsListener.java there's a huge code section commented out 
(uncomment and remove the code onTaskEnd until we fix PIG-5157). Should we 
enable that?
the reason to modify it is because [~rohini] suggested that [memory| is used a 
lot if we update metric info in onTaskEnd()(suppose there are thousand tasks)
in org.apache.pig.backend.hadoop.executionengine.spark.JobMetricsListener of 
spark21, we should use code like following 
notice: not fully test, can not guarantee it is right.
{code}
  public void onStageCompleted(SparkListenerStageCompleted stageCompleted) {
if we update taskMetrics in onTaskEnd(), it consumes lot of memory.
int stageId = stageCompleted.stageInfo().stageId();
int stageAttemptId = stageCompleted.stageInfo().attemptId();
String stageIdentifier = stageId + "_" + stageAttemptId;
Integer jobId = stageIdToJobId.get(stageId);
if (jobId == null) {
LOG.warn("Cannot find job id for stage[" + stageId + "].");
} else {
Map> jobMetrics = 
allJobMetrics.get(jobId);
if (jobMetrics == null) {
jobMetrics = Maps.newHashMap();
allJobMetrics.put(jobId, jobMetrics);
}
List stageMetrics = jobMetrics.get(stageIdentifier);
if (stageMetrics == null) {
stageMetrics = Lists.newLinkedList();
jobMetrics.put(stageIdentifier, stageMetrics);
}
 
 stageMetrics.add(stageCompleted.stageInfo().taskMetrics());
}
}
public synchronized void onTaskEnd(SparkListenerTaskEnd taskEnd) {
}
{code}
bq. I removed JobLogger, do we need it? It seems that a property called 
'spark.eventLog.enabled' is the proper replacement for this class, should we 
use it instead? It looks like JobLogger became deprecated and was removed from 
Spark 2.
It seems we can remove JobLogger and enable {{spark.eventLog.enabled}} in spark2


> Upgrade to Spark 2.0
> 
>
> Key: PIG-5157
> URL: https://issues.apache.org/jira/browse/PIG-5157
> Project: Pig
>  Issue Type: Improvement
>  Components: spark
>Reporter: Nandor Kollar
>Assignee: Nandor Kollar
> Fix For: 0.18.0
>
> Attachments: PIG-5157.patch
>
>
> Upgrade to Spark 2.0 (or latest)



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

[jira] [Updated] (PIG-5245) TestGrunt.testStopOnFailure is flaky

2017-05-31 Thread Rohini Palaniswamy (JIRA)


 [ 
https://issues.apache.org/jira/browse/PIG-5245?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rohini Palaniswamy updated PIG-5245:

Attachment: PIG-5245-1.patch

Reduced general resource and heap usage of AM and task containers for mapreduce 
mode and increased number of AM containers to 20% of cluster resource (node 
RAM).  Will look into doing it in YARNMiniCluster (tez and spark) as well in 
0.18.

> TestGrunt.testStopOnFailure is flaky
> 
>
> Key: PIG-5245
> URL: https://issues.apache.org/jira/browse/PIG-5245
> Project: Pig
>  Issue Type: Bug
>Reporter: Rohini Palaniswamy
>Assignee: Rohini Palaniswamy
> Fix For: 0.17.0
>
> Attachments: PIG-5245-1.patch
>
>
>   The test is supposed to run two tests in parallel, and one when fails other 
> should be killed when stop on failure is configured. But the test is actually 
> running only job at a time and based on the order in which jobs are run it is 
> passing. This is because of the capacity scheduler configuration of the 
> MiniCluster. It runs only one AM at a time due to resource restrictions. In a 
> 16G node, when the first job runs it takes up 1536 (AM) + 1024 (task) 
> memory.mb. Since only 10% of cluster resource is the default for running AMs 
> and a single AM already takes up memory close to 1.6G, second job AM is not 
> launched.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

[jira] [Created] (PIG-5245) TestGrunt.testStopOnFailure is flaky

2017-05-31 Thread Rohini Palaniswamy (JIRA)

Rohini Palaniswamy created PIG-5245:
---

 Summary: TestGrunt.testStopOnFailure is flaky
 Key: PIG-5245
 URL: https://issues.apache.org/jira/browse/PIG-5245
 Project: Pig
  Issue Type: Bug
Reporter: Rohini Palaniswamy
Assignee: Rohini Palaniswamy
 Fix For: 0.17.0


  The test is supposed to run two tests in parallel, and one when fails other 
should be killed when stop on failure is configured. But the test is actually 
running only job at a time and based on the order in which jobs are run it is 
passing. This is because of the capacity scheduler configuration of the 
MiniCluster. It runs only one AM at a time due to resource restrictions. In a 
16G node, when the first job runs it takes up 1536 (AM) + 1024 (task) 
memory.mb. Since only 10% of cluster resource is the default for running AMs 
and a single AM already takes up memory close to 1.6G, second job AM is not 
launched.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

45 matches

Mail list logo