Guaranteed processing orders of each batch in Spark Streaming

2015-10-18 Thread Renjie Liu
Hi, all:
I've read source code and it seems that there is no guarantee that the
order of processing of each RDD is guaranteed since jobs are just submitted
to a thread pool. I  believe that this is quite important in streaming
since updates should be ordered.


Spark SQL: what does an exclamation mark mean in the plan?

2015-10-18 Thread Xiao Li
Hi, all,

After turning on the trace, I saw a strange exclamation mark in
the intermediate plans. This happened in catalyst analyzer.

Join Inner, Some((col1#0 = col1#6))
 Project [col1#0,col2#1,col3#2,col2_alias#24,col3#2 AS col3_alias#13]
  Project [col1#0,col2#1,col3#2,col2#1 AS col2_alias#24]
   LogicalRDD [col1#0,col2#1,col3#2], MapPartitionsRDD[1] at
createDataFrame at SimpleApp.scala:32
 Aggregate [col1#6], [col1#6,count(col1#6) AS count(col1)#5L]
  *!Project [col1#6,col2#7,col3#8,col2_alias#24,col3#8 AS col3_alias#4]*
   Project [col1#6,col2#7,col3#8,col2#7 AS col2_alias#3]
LogicalRDD [col1#6,col2#7,col3#8], MapPartitionsRDD[1] at
createDataFrame at SimpleApp.scala:32

Could anybody give me a hint why there exists a !(exclamation mark) before
the node name (Project)? This ! mark does not disappear in the subsequent
query plan.

Thank you!

Xiao Li


Re: PMML export for LinearRegressionModel

2015-10-18 Thread Fazlan Nazeem
Hi Joseph,

That's great. Also It would be great if spark extends the PMML support to
models which are not PMML supported right now.
e.g

   - Decision Tree
   - Random Forest
   - Naive Bayes


On Sun, Oct 18, 2015 at 2:55 AM, Joseph Bradley 
wrote:

> Thanks for bringing this up!  We need to add PMML export methods to the
> spark.ml API.  I just made a JIRA for tracking that:
> https://issues.apache.org/jira/browse/SPARK-11171
>
> Joseph
>
> On Thu, Oct 15, 2015 at 2:58 AM, Fazlan Nazeem  wrote:
>
>> Ok It turns out I was using the wrong LinearRegressionModel which was in  
>> package
>> org.apache.spark.ml.regression;.
>>
>>
>>
>> On Thu, Oct 15, 2015 at 3:23 PM, Fazlan Nazeem  wrote:
>>
>>> This is the API doc for LinearRegressionModel. It does not implement
>>> PMMLExportable
>>>
>>> https://spark.apache.org/docs/latest/api/java/index.html
>>>
>>> On Thu, Oct 15, 2015 at 3:11 PM, canan chen  wrote:
>>>
 The method toPMML is in trait PMMLExportable

 *LinearRegressionModel has this trait, you should be able to call *
 *LinearRegressionModel#toPMML*

 On Thu, Oct 15, 2015 at 5:25 PM, Fazlan Nazeem 
 wrote:

> Hi
>
> I am trying to export a LinearRegressionModel in PMML format.
> According to the following resource[1] PMML export is supported for
> LinearRegressionModel.
>
> [1] https://spark.apache.org/docs/latest/mllib-pmml-model-export.html
>
> But there is *no* *toPMML* method in *LinearRegressionModel* class
> although LogisticRegressionModel, ReidgeRegressionModel,SVMModel etc has
> toPMML method.
>
> Can someone explain what is the issue here?
>
> --
> Thanks & Regards,
>
> Fazlan Nazeem
>
> *Software Engineer*
>
> *WSO2 Inc*
> Mobile : +94772338839
> <%2B94%20%280%29%20773%20451194>
> fazl...@wso2.com
>


>>>
>>>
>>> --
>>> Thanks & Regards,
>>>
>>> Fazlan Nazeem
>>>
>>> *Software Engineer*
>>>
>>> *WSO2 Inc*
>>> Mobile : +94772338839
>>> <%2B94%20%280%29%20773%20451194>
>>> fazl...@wso2.com
>>>
>>
>>
>>
>> --
>> Thanks & Regards,
>>
>> Fazlan Nazeem
>>
>> *Software Engineer*
>>
>> *WSO2 Inc*
>> Mobile : +94772338839
>> <%2B94%20%280%29%20773%20451194>
>> fazl...@wso2.com
>>
>
>


-- 
Thanks & Regards,

Fazlan Nazeem

*Software Engineer*

*WSO2 Inc*
Mobile : +94772338839
<%2B94%20%280%29%20773%20451194>
fazl...@wso2.com


RE: ShuffledHashJoin Possible Issue

2015-10-18 Thread Cheng, Hao
Hi Gsvic, Can you please provide detail code / steps to reproduce that?

Hao

-Original Message-
From: gsvic [mailto:victora...@gmail.com] 
Sent: Monday, October 19, 2015 3:55 AM
To: dev@spark.apache.org
Subject: ShuffledHashJoin Possible Issue

I am doing some experiments with join algorithms in SparkSQL and I am facing 
the following issue:

I have costructed two "dummy" json tables, t1.json and t2.json. Each of them 
has two columns, ID and Value. The ID is an incremental integer(unique) and the 
Value a random value. I am running an equi-join query on ID attribute.
In case of SortMerge and BroadcastHashJoin algorithms, the return result is 
correct but in case of ShuffledHashJoin the count aggregate returns always 
zero. The correct result is t2, as t2.ID is a subset of t1.ID.

The query is *t1.join(t2).where(t1("ID").equalTo(t2("ID")))*





--
View this message in context: 
http://apache-spark-developers-list.1001551.n3.nabble.com/ShuffledHashJoin-Possible-Issue-tp14672.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional 
commands, e-mail: dev-h...@spark.apache.org


-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: SPARK_MASTER_IP actually expects a DNS name, not IP address

2015-10-18 Thread Robert Dodier
Nicholas,

FWIW the --ip option seems to have been deprecated in commit d90d2af1,
but that was a pretty big commit, lots of other stuff changed, and there
isn't any hint in the log message as to the reason for changing --ip.

best,

Robert Dodier

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: SPARK_MASTER_IP actually expects a DNS name, not IP address

2015-10-18 Thread Nicholas Chammas
Good catches, Robert.

I had actually typed up a draft email a couple of days ago citing those
same two blocks of code. I deleted it when I realized like you that the
snippets did not explain why IP addresses weren’t working.

Something seems wrong here, but I’m not sure what exactly. Maybe this is a
documentation bug or missing deprecation warning.

For example, this line

seems to back up your finding that SPARK_MASTER_IP is deprecated (since the
--ip it maps to is deprecated), but no warnings are displayed and the docs
make no mention of this being deprecated.

  private def printUsageAndExit(exitCode: Int) {
// scalastyle:off println
System.err.println(
  "Usage: Master [options]\n" +
  "\n" +
  "Options:\n" +
  "  -i HOST, --ip HOST Hostname to listen on (deprecated,
please use --host or -h) \n" +
  "  -h HOST, --host HOST   Hostname to listen on\n" +
  "  -p PORT, --port PORT   Port to listen on (default: 7077)\n" +
  "  --webui-port PORT  Port for web UI (default: 8080)\n" +
  "  --properties-file FILE Path to a custom Spark properties file.\n" +
  " Default is conf/spark-defaults.conf.")
// scalastyle:on println
System.exit(exitCode)
  }

Hopefully someone with better knowledge of this code can explain what’s
going on.

I’m beginning to think SPARK_MASTER_IP is straight up deprecated in favor
of SPARK_MASTER_HOST, but a warning needs to be actually thrown here in the
code

and the template

and docs need to be updated.

This code hasn’t been touched much since Spark’s genesis, so I’m not
expecting anyone to know off the top of their head whether this is wrong or
right. Perhaps I should just open a PR and take it from there.

Nick
​

On Sat, Oct 17, 2015 at 11:21 PM Robert Dodier 
wrote:

> Nicholas Chammas wrote
> > The funny thing is that Spark seems to accept this only if the value of
> > SPARK_MASTER_IP is a DNS name and not an IP address.
> >
> > When I provide an IP address, I get errors in the log when starting the
> > master:
> >
> > 15/10/15 01:47:31 ERROR NettyTransport: failed to bind to
> > /54.210.XX.XX:7077, shutting down Netty transport
>
> A couple of things. (1) That log message appears to originate at line 434
> of
> NettyTransport.scala.
> (
> https://github.com/akka/akka/blob/master/akka-remote/src/main/scala/akka/remote/transport/netty/NettyTransport.scala
> )
> It appears the exception is rethrown; is it caught somewhere else so we can
> see what the actual error was that triggered the log message? I don't see
> anything obvious in the code.
>
> (2) sbin/start-master.sh executes something.Master with --ip
> SPARK_MASTER_IP, which calls something.MasterArguments to handle its
> arguments, which says:
>
>   case ("--ip" | "-i") :: value :: tail =>
> Utils.checkHost(value, "ip no longer supported, please use hostname
> " + value)
> host = value
> parse(tail)
>
>   case ("--host" | "-h") :: value :: tail =>
> Utils.checkHost(value, "Please use hostname " + value)
> host = value
> parse(tail)
>
> So it would appear that the intent is that numerical IP addresses are
> disallowed, however, Utils.checkHost says:
>
> def checkHost(host: String, message: String = "") {
>   assert(host.indexOf(':') == -1, message)
> }
>
> which accepts numerical IP addresses just fine. Is there some other test
> that should be applied in MasterArguments? or maybe checkHost should be
> looking for some other pattern? Is it possible that MasterArguments was
> changed to disallow --ip without propagating that backwards into any
> scripts
> that call it?
>
> Hope this helps in some way.
>
> Robert Dodier
>
>
>
> --
> View this message in context:
> http://apache-spark-developers-list.1001551.n3.nabble.com/SPARK-MASTER-IP-actually-expects-a-DNS-name-not-IP-address-tp14613p14665.html
> Sent from the Apache Spark Developers List mailing list archive at
> Nabble.com.
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
> For additional commands, e-mail: dev-h...@spark.apache.org
>
>


ShuffledHashJoin Possible Issue

2015-10-18 Thread gsvic
I am doing some experiments with join algorithms in SparkSQL and I am facing
the following issue:

I have costructed two "dummy" json tables, t1.json and t2.json. Each of them
has two columns, ID and Value. The ID is an incremental integer(unique) and
the Value a random value. I am running an equi-join query on ID attribute.
In case of SortMerge and BroadcastHashJoin algorithms, the return result is
correct but in case of ShuffledHashJoin the count aggregate returns always
zero. The correct result is t2, as t2.ID is a subset of t1.ID.

The query is *t1.join(t2).where(t1("ID").equalTo(t2("ID")))*





--
View this message in context: 
http://apache-spark-developers-list.1001551.n3.nabble.com/ShuffledHashJoin-Possible-Issue-tp14672.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



streaming test failure

2015-10-18 Thread Ted Yu
When I ran the following command on Linux with latest master branch:
~/apache-maven-3.3.3/bin/mvn clean -Phive -Phive-thriftserver -Pyarn
-Phadoop-2.4 -Dhadoop.version=2.7.0 package

I saw some test failures:
http://pastebin.com/1VYZYy5K

Has anyone seen similar test failure before ?

Thanks


test failed due to OOME

2015-10-18 Thread Ted Yu
From
https://amplab.cs.berkeley.edu/jenkins/job/Spark-Master-Maven-with-YARN/HADOOP_PROFILE=hadoop-2.4,label=spark-test/3846/console
:

SparkListenerSuite:- basic creation and shutdown of LiveListenerBus-
bus.stop() waits for the event queue to completely drain- basic
creation of StageInfo- basic creation of StageInfo with shuffle-
StageInfo with fewer tasks than partitions- local metrics-
onTaskGettingResult() called when result fetched remotely *** FAILED
***  org.apache.spark.SparkException: Job aborted due to stage
failure: Task 0 in stage 0.0 failed 1 times, most recent failure: Lost
task 0.0 in stage 0.0 (TID 0, localhost): java.lang.OutOfMemoryError:
Java heap space at java.util.Arrays.copyOf(Arrays.java:2271)at
java.io.ByteArrayOutputStream.grow(ByteArrayOutputStream.java:113)  at
java.io.ByteArrayOutputStream.ensureCapacity(ByteArrayOutputStream.java:93) 
at
java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:140) at
java.io.ObjectOutputStream$BlockDataOutputStream.write(ObjectOutputStream.java:1852)
at
java.io.ObjectOutputStream.write(ObjectOutputStream.java:708)   at
org.apache.spark.util.Utils$.writeByteBuffer(Utils.scala:182)   at
org.apache.spark.scheduler.DirectTaskResult$$anonfun$writeExternal$1.apply$mcV$sp(TaskResult.scala:52)
  at
org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1160) at
org.apache.spark.scheduler.DirectTaskResult.writeExternal(TaskResult.scala:49)  
at
java.io.ObjectOutputStream.writeExternalData(ObjectOutputStream.java:1458)  
at
java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1429)
at
java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1177)   at
java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:347) at
org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:44)
at
org.apache.spark.serializer.JavaSerializerInstance.serialize(JavaSerializer.scala:101)
  at
org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:256)   at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) 
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) 
at
java.lang.Thread.run(Thread.java:745)


Should more heap be given to test suite ?


Cheers


Re: Streaming and storing to Google Cloud Storage or S3

2015-10-18 Thread Steve Loughran

> On 18 Oct 2015, at 03:23, vonnagy  wrote:
> 
> Has anyone tried to go from streaming directly to GCS or S3 and overcome the
> unacceptable performance. It can never keep up.

the problem here is that they aren't really filesystems (certainly s3 via the 
s3n & s3a clients), flush() is a no-op, and its's only on the close() that 
there's a bulk upload. For bonus fun, anything that does a rename() usually 
forces a download/re-upload of the source files.

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: Build spark 1.5.1 branch fails

2015-10-18 Thread Steve Loughran

On 18 Oct 2015, at 11:09, Sean Owen 
mailto:so...@cloudera.com>> wrote:


These are still too low I think. Try 4g heap and 1g permgen. That's what the 
error tells you right?

On Sat, Oct 17, 2015, 10:58 PM Chester Chen 
mailto:ches...@alpinenow.com>> wrote:
Yes, I have tried MAVEN_OPTS with

-Xmx2g -XX:MaxPermSize=512M -XX:ReservedCodeCacheSize=512m

-Xmx4g -XX:MaxPermSize=512M -XX:ReservedCodeCacheSize=512m

-Xmx2g -XX:MaxPermSize=1g -XX:ReservedCodeCacheSize=512m

None of them works. All failed with the same error.

thanks



fwiw, here's min. The headless one is a relic of apple jdk6 that I could 
probably cut now.

$ echo $MAVEN_OPTS
-Xmx2g -XX:MaxPermSize=512M -XX:ReservedCodeCacheSize=512m -Xms256m 
-Djava.awt.headless=true



Re: Build spark 1.5.1 branch fails

2015-10-18 Thread Sean Owen
These are still too low I think. Try 4g heap and 1g permgen. That's what
the error tells you right?

On Sat, Oct 17, 2015, 10:58 PM Chester Chen  wrote:

> Yes, I have tried MAVEN_OPTS with
>
> -Xmx2g -XX:MaxPermSize=512M -XX:ReservedCodeCacheSize=512m
>
> -Xmx4g -XX:MaxPermSize=512M -XX:ReservedCodeCacheSize=512m
>
> -Xmx2g -XX:MaxPermSize=1g -XX:ReservedCodeCacheSize=512m
>
> None of them works. All failed with the same error.
>
> thanks
>
>
>
>
>
>
> On Sat, Oct 17, 2015 at 2:44 PM, Ted Yu  wrote:
>
>> Have you set MAVEN_OPTS with the following ?
>> -Xmx2g -XX:MaxPermSize=512M -XX:ReservedCodeCacheSize=512m
>>
>> Cheers
>>
>> On Sat, Oct 17, 2015 at 2:35 PM, Chester Chen 
>> wrote:
>>
>>> I was using jdk 1.7 and maven version is the same as pom file.
>>>
>>> ᚛ |(v1.5.1)|$ java -version
>>> java version "1.7.0_51"
>>> Java(TM) SE Runtime Environment (build 1.7.0_51-b13)
>>> Java HotSpot(TM) 64-Bit Server VM (build 24.51-b03, mixed mode)
>>>
>>> Using build/sbt still fail the same with -Denforcer.skip, with mvn
>>> build, it fails with
>>>
>>>
>>> [ERROR] PermGen space -> [Help 1]
>>> [ERROR]
>>> [ERROR] To see the full stack trace of the errors, re-run Maven with the
>>> -e switch.
>>> [ERROR] Re-run Maven using the -X switch to enable full debug logging
>>>
>>> I am giving up on this. Just using 1.5.2-SNAPSHOT for now.
>>>
>>> Chester
>>>
>>>
>>> On Mon, Oct 12, 2015 at 12:05 AM, Xiao Li  wrote:
>>>
 Hi, Chester,

 Please check your pom.xml. Your java.version and maven.version might
 not match your build environment.

 Or using -Denforcer.skip=true from the command line to skip it.

 Good luck,

 Xiao Li

 2015-10-08 10:35 GMT-07:00 Chester Chen :

> Question regarding branch-1.5  build.
>
> Noticed that the spark project no longer publish the spark-assembly.
> We have to build ourselves ( until we find way to not depends on assembly
> jar).
>
>
> I check out the tag v.1.5.1 release version and using the sbt to build
> it, I get the following error
>
> build/sbt -Pyarn -Phadoop-2.6 -Dhadoop.version=2.6.0 -Phive
> -Phive-thriftserver -DskipTests clean package assembly
>
>
> [warn] ::
> [warn] ::  UNRESOLVED DEPENDENCIES ::
> [warn] ::
> [warn] :: org.apache.spark#spark-network-common_2.10;1.5.1:
> configuration not public in
> org.apache.spark#spark-network-common_2.10;1.5.1: 'test'. It was required
> from org.apache.spark#spark-network-shuffle_2.10;1.5.1 test
> [warn] ::
> [warn]
> [warn] Note: Unresolved dependencies path:
> [warn] org.apache.spark:spark-network-common_2.10:1.5.1
> ((com.typesafe.sbt.pom.MavenHelper) MavenHelper.scala#L76)
> [warn]  +- org.apache.spark:spark-network-shuffle_2.10:1.5.1
> [info] Packaging
> /Users/chester/projects/alpine/apache/spark/launcher/target/scala-2.10/spark-launcher_2.10-1.5.1.jar
> ...
> [info] Done packaging.
> [warn] four warnings found
> [warn] Note: Some input files use unchecked or unsafe operations.
> [warn] Note: Recompile with -Xlint:unchecked for details.
> [warn] No main class detected
> [info] Packaging
> /Users/chester/projects/alpine/apache/spark/external/flume-sink/target/scala-2.10/spark-streaming-flume-sink_2.10-1.5.1.jar
> ...
> [info] Done packaging.
> sbt.ResolveException: unresolved dependency:
> org.apache.spark#spark-network-common_2.10;1.5.1: configuration not public
> in org.apache.spark#spark-network-common_2.10;1.5.1: 'test'. It was
> required from org.apache.spark#spark-network-shuffle_2.10;1.5.1 test
>
>
> Somehow the network-shuffle can't find the test jar needed ( not sure
> why test still needed, even the  -DskipTests is already specified)
>
> tried the maven command, the build failed as well ( without assembly)
>
> mvn -Pyarn -Phadoop-2.6 -Dhadoop.version=2.6.0 -Phive
> -Phive-thriftserver -DskipTests clean package
>
> [ERROR] Failed to execute goal
> org.apache.maven.plugins:maven-enforcer-plugin:1.4:enforce
> (enforce-versions) on project spark-parent_2.10: Some Enforcer rules have
> failed. Look above for specific messages explaining why the rule failed. 
> ->
> [Help 1]
> [ERROR]
> [ERROR] To see the full stack trace of the errors, re-run Maven with
> the -e switch.
> [ERROR] Re-run Maven using the -X switch to enable full debug logging.
> [ERROR]
> [ERROR] For more information about the errors and possible solutions,
> please read the following articles:
> [ERROR] [Help 1]
> http://cwiki.apache.org/confluence/display/MAVEN/MojoExecutionException
>
>
>
> I checkout the branch-1.5 and replaced "1.5.2-SNAPSHOT" with "1.5.1"
> and build/sbt will