Powered By Spark

2018-12-22 Thread Ascot Moss
Hi,

We use Apache Spark for many cases, can anyone advise how to add our
sharing to "http://spark.apache.org/powered-by.html;

Thanks and Happy Holidays!


WARN HIVE: Failed to access metastore. This class should not accessed in runtime

2017-08-14 Thread Ascot Moss
Hi,

I got following error when start spark thriftserver:

WARN HIVE: Failed to access metastore. This class should not accessed in
runtime.
org.apache.hadoop.hive.ql.metadata.HiveException:
Java.lang.RuntimeException: Unable to instantiate
org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient

I have already copied hive-site.xml' to $SPARK_HOME/conf/

How canI solve it?

Regards


ThriftServer on HTTPS

2017-08-12 Thread Ascot Moss
Hi,

I have Spark ThriftServer up and running on HTTP, where can I find the
steps to setup Spark ThriftServer on HTTPS?

Regards


Re: ERROR transport.TSaslTransport: SASL negotiation failure javax.security.sasl.SaslException: GSS initiate failed [Caused by GSSException: No valid credentials provided (Mechanism level: Failed to f

2017-08-12 Thread Ascot Moss
I fixed the issue (as no permission on keytab file), please ignore.

On Sun, Aug 13, 2017 at 9:42 AM, Ascot Moss <ascot.m...@gmail.com> wrote:

> Hi,
>
> Spark: 2.1.0
> Hive: 2.1.1
>
> When starting thrift server, I got the following error:
>
> How can I fix it?
>
> Regards
>
>
>
> (error log)
> 17/08/13 09:28:17 DEBUG client.IsolatedClientLoader: shared class:
> java.lang.NoSuchFieldError
> 17/08/13 09:28:17 DEBUG client.IsolatedClientLoader: shared class:
> org.apache.hadoop.security.SecurityUtil
> 17/08/13 09:28:17 DEBUG client.IsolatedClientLoader: shared class:
> org.apache.hadoop.security.SaslRpcServer
> 17/08/13 09:28:17 DEBUG client.IsolatedClientLoader: hive class:
> org.apache.thrift.transport.TSaslTransportException -
> jar:file:/home/spark-2.1.0/assembly/target/scala-2.11/
> jars/libthrift-0.9.2.jar!/org/apache/thrift/transport/
> TSaslTransportException.class
> 17/08/13 09:28:17 DEBUG client.IsolatedClientLoader: hive class:
> javax.security.sasl.Sasl - jar:file:/opt/jdk1.8.0_102/
> jre/lib/rt.jar!/javax/security/sasl/Sasl.class
> 17/08/13 09:28:18 DEBUG client.IsolatedClientLoader: hive class:
> org.apache.thrift.transport.TMemoryInputTransport -
> jar:file:/home/spark-2.1.0/assembly/target/scala-2.11/
> jars/libthrift-0.9.2.jar!/org/apache/thrift/transport/
> TMemoryInputTransport.class
> 17/08/13 09:28:18 DEBUG client.IsolatedClientLoader: hive class:
> org.apache.thrift.TByteArrayOutputStream - jar:file:/home/spark-2.1.0/
> assembly/target/scala-2.11/jars/libthrift-0.9.2.jar!/org/apache/thrift/
> TByteArrayOutputStream.class
> 17/08/13 09:28:18 DEBUG client.IsolatedClientLoader: hive class:
> org.apache.thrift.transport.TSaslTransport$SaslParticipant -
> jar:file:/home/spark-2.1.0/assembly/target/scala-2.11/
> jars/libthrift-0.9.2.jar!/org/apache/thrift/transport/TSaslTransport$
> SaslParticipant.class
> 17/08/13 09:28:18 DEBUG client.IsolatedClientLoader: hive class:
> org.apache.thrift.protocol.TProtocolException -
> jar:file:/home/spark-2.1.0/assembly/target/scala-2.11/
> jars/libthrift-0.9.2.jar!/org/apache/thrift/protocol/
> TProtocolException.class
> 17/08/13 09:28:18 DEBUG client.IsolatedClientLoader: hive class:
> org.apache.thrift.protocol.TStruct - jar:file:/home/spark-2.1.0/
> assembly/target/scala-2.11/jars/libthrift-0.9.2.jar!/org/
> apache/thrift/protocol/TStruct.class
> 17/08/13 09:28:18 DEBUG client.IsolatedClientLoader: hive class:
> org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$Client -
> jar:file:/home/spark-2.1.0/assembly/target/scala-2.11/
> jars/hive-metastore-1.2.1.spark2.jar!/org/apache/hadoop/
> hive/metastore/api/ThriftHiveMetastore$Client.class
> 17/08/13 09:28:18 DEBUG client.IsolatedClientLoader: hive class:
> com.facebook.fb303.FacebookService$Client - jar:file:/home/spark-2.1.0/
> assembly/target/scala-2.11/jars/libfb303-0.9.2.jar!/com/facebook/fb303/
> FacebookService$Client.class
> 17/08/13 09:28:18 DEBUG client.IsolatedClientLoader: hive class:
> org.apache.thrift.TServiceClient - jar:file:/home/spark-2.1.0/
> assembly/target/scala-2.11/jars/libthrift-0.9.2.jar!/org/
> apache/thrift/TServiceClient.class
> 17/08/13 09:28:18 DEBUG client.IsolatedClientLoader: hive class:
> org.apache.hadoop.hive.metastore.api.InvalidObjectException -
> jar:file:/home/spark-2.1.0/assembly/target/scala-2.11/
> jars/hive-metastore-1.2.1.spark2.jar!/org/apache/hadoop/
> hive/metastore/api/InvalidObjectException.class
> 17/08/13 09:28:18 DEBUG client.IsolatedClientLoader: hive class:
> org.apache.hadoop.hive.metastore.api.UnknownTableException -
> jar:file:/home/spark-2.1.0/assembly/target/scala-2.11/
> jars/hive-metastore-1.2.1.spark2.jar!/org/apache/hadoop/
> hive/metastore/api/UnknownTableException.class
> 17/08/13 09:28:18 DEBUG client.IsolatedClientLoader: hive class:
> org.apache.hadoop.hive.metastore.api.UnknownDBException -
> jar:file:/home/spark-2.1.0/assembly/target/scala-2.11/
> jars/hive-metastore-1.2.1.spark2.jar!/org/apache/hadoop/
> hive/metastore/api/UnknownDBException.class
> 17/08/13 09:28:18 DEBUG client.IsolatedClientLoader: hive class:
> org.apache.hadoop.hive.metastore.api.ConfigValSecurityException -
> jar:file:/home/spark-2.1.0/assembly/target/scala-2.11/
> jars/hive-metastore-1.2.1.spark2.jar!/org/apache/hadoop/
> hive/metastore/api/ConfigValSecurityException.class
> 17/08/13 09:28:18 DEBUG client.IsolatedClientLoader: hive class:
> org.apache.hadoop.hive.metastore.api.UnknownPartitionException -
> jar:file:/home/spark-2.1.0/assembly/target/scala-2.11/
> jars/hive-metastore-1.2.1.spark2.jar!/org/apache/hadoop/
> hive/metastore/api/UnknownPartitionException.class
> 17/08/13 09:28:18 DEBUG client.IsolatedClientLoader: hive class:
> org.apache.hadoop.hi

ThriftServer Start Error

2017-08-12 Thread Ascot Moss
I tried to start spark-thrift server but get following error:


javax.security.sasl.SaslException: GSS initiate failed [Caused by
GSSException: No valid credentials provided (Mechanism level: Failed to
find any Kerberos tgt)]
java.io.IOException: javax.security.sasl.SaslException: GSS initiate failed
[Caused by GSSException: No valid credentials provided (Mechanism level:
Failed to find any Kerberos tgt)]
at org.apache.hadoop.ipc.Client$Connection$1.run(Client.java:687)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGro
upInformation.java:1698)
at org.apache.hadoop.ipc.Client$Connection.handleSaslConnection
Failure(Client.java:650)
at org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:737)
at org.apache.hadoop.ipc.Client$Connection.access$2900(Client.java:375)
at org.apache.hadoop.ipc.Client.getConnection(Client.java:1528)
at org.apache.hadoop.ipc.Client.call(Client.java:1451)
at org.apache.hadoop.ipc.Client.call(Client.java:1412)
at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(
ProtobufRpcEngine.java:229)
at com.sun.proxy.$Proxy19.getFileInfo(Unknown Source)
at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTran
slatorPB.getFileInfo(ClientNamenodeProtocolTranslatorPB.java:771)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAcce
ssorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMe
thodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMeth
od(RetryInvocationHandler.java:191)
at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(Ret
ryInvocationHandler.java:102)
at com.sun.proxy.$Proxy20.getFileInfo(Unknown Source)
at org.apache.hadoop.hdfs.DFSClient.getFileInfo(DFSClient.java:2108)
at org.apache.hadoop.hdfs.DistributedFileSystem$22.doCall(
DistributedFileSystem.java:1305)
at org.apache.hadoop.hdfs.DistributedFileSystem$22.doCall(
DistributedFileSystem.java:1301)
at org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSyst
emLinkResolver.java:81)
at org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(D
istributedFileSystem.java:1317)

Spark: v2.1.0
Hive: 2.1.1

How to resolve it?

Regards


ThriftServer Start Error

2017-08-11 Thread Ascot Moss
Hi

I tried to start spark-thrift server but get following error:


javax.security.sasl.SaslException: GSS initiate failed [Caused by
GSSException: No valid credentials provided (Mechanism level: Failed to
find any Kerberos tgt)]
java.io.IOException: javax.security.sasl.SaslException: GSS initiate failed
[Caused by GSSException: No valid credentials provided (Mechanism level:
Failed to find any Kerberos tgt)]
at org.apache.hadoop.ipc.Client$Connection$1.run(Client.java:687)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at org.apache.hadoop.security.UserGroupInformation.doAs(
UserGroupInformation.java:1698)
at org.apache.hadoop.ipc.Client$Connection.handleSaslConnectionFailure(
Client.java:650)
at org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:737)
at org.apache.hadoop.ipc.Client$Connection.access$2900(Client.java:375)
at org.apache.hadoop.ipc.Client.getConnection(Client.java:1528)
at org.apache.hadoop.ipc.Client.call(Client.java:1451)
at org.apache.hadoop.ipc.Client.call(Client.java:1412)
at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.
invoke(ProtobufRpcEngine.java:229)
at com.sun.proxy.$Proxy19.getFileInfo(Unknown Source)
at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslat
orPB.getFileInfo(ClientNamenodeProtocolTranslatorPB.java:771)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(
NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(
DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(
RetryInvocationHandler.java:191)
at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(
RetryInvocationHandler.java:102)
at com.sun.proxy.$Proxy20.getFileInfo(Unknown Source)
at org.apache.hadoop.hdfs.DFSClient.getFileInfo(DFSClient.java:2108)
at org.apache.hadoop.hdfs.DistributedFileSystem$22.
doCall(DistributedFileSystem.java:1305)
at org.apache.hadoop.hdfs.DistributedFileSystem$22.
doCall(DistributedFileSystem.java:1301)
at org.apache.hadoop.fs.FileSystemLinkResolver.resolve(
FileSystemLinkResolver.java:81)
at org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(
DistributedFileSystem.java:1317)

Spark: v2.1.0
Hive: 2.1.1

How to resolve it?

Regards


Spark ThriftServer Error

2017-08-11 Thread Ascot Moss
Hi,

When started thriftSever, got the following issue:


17/08/11 16:06:56 ERROR util.Utils: Uncaught exception in thread Thread-3

java.lang.NullPointerException

at
org.apache.spark.sql.hive.thriftserver.HiveThriftServer2$$anonfun$main$1.apply$mcV$sp(HiveThriftServer2.scala:85)

at
org.apache.spark.util.SparkShutdownHook.run(ShutdownHookManager.scala:216)

at
org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1$$anonfun$apply$mcV$sp$1.apply$mcV$sp(ShutdownHookManager.scala:188)

at
org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1$$anonfun$apply$mcV$sp$1.apply(ShutdownHookManager.scala:188)

at
org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1$$anonfun$apply$mcV$sp$1.apply(ShutdownHookManager.scala:188)

at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1951)

at
org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1.apply$mcV$sp(ShutdownHookManager.scala:188)

at
org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1.apply(ShutdownHookManager.scala:188)

at
org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1.apply(ShutdownHookManager.scala:188)

at scala.util.Try$.apply(Try.scala:192)

at
org.apache.spark.util.SparkShutdownHookManager.runAll(ShutdownHookManager.scala:188)

at
org.apache.spark.util.SparkShutdownHookManager$$anon$2.run(ShutdownHookManager.scala:178)

at
org.apache.hadoop.util.ShutdownHookManager$1.run(ShutdownHookManager.java:54)

17/08/11 16:06:56 INFO util.ShutdownHookManager: Shutdown hook called

17/08/11 16:06:56 INFO util.ShutdownHookManager: Deleting directory
/tmp/spark-1c1974f2-5838-42e9-bd0f-d5ba698f3634

17/08/11 16:06:56 INFO util.ShutdownHookManager: Deleting directory
/tmp/spark-a4308e96-6b00-4412-b6cc-eff165488a46

spark version: 2.1.0
hive: 2.1.1 with Kerberos

How to resolve it?

Regards


Spark Thriftserver ERROR

2017-08-11 Thread Ascot Moss
Hi

I tried to start spark-thrift server but get following error:


javax.security.sasl.SaslException: GSS initiate failed [Caused by
GSSException: No valid credentials provided (Mechanism level: Failed to
find any Kerberos tgt)]
java.io.IOException: javax.security.sasl.SaslException: GSS initiate failed
[Caused by GSSException: No valid credentials provided (Mechanism level:
Failed to find any Kerberos tgt)]
at org.apache.hadoop.ipc.Client$Connection$1.run(Client.java:687)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1698)
at
org.apache.hadoop.ipc.Client$Connection.handleSaslConnectionFailure(Client.java:650)
at org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:737)
at org.apache.hadoop.ipc.Client$Connection.access$2900(Client.java:375)
at org.apache.hadoop.ipc.Client.getConnection(Client.java:1528)
at org.apache.hadoop.ipc.Client.call(Client.java:1451)
at org.apache.hadoop.ipc.Client.call(Client.java:1412)
at
org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:229)
at com.sun.proxy.$Proxy19.getFileInfo(Unknown Source)
at
org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.getFileInfo(ClientNamenodeProtocolTranslatorPB.java:771)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at
org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:191)
at
org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102)
at com.sun.proxy.$Proxy20.getFileInfo(Unknown Source)
at org.apache.hadoop.hdfs.DFSClient.getFileInfo(DFSClient.java:2108)
at
org.apache.hadoop.hdfs.DistributedFileSystem$22.doCall(DistributedFileSystem.java:1305)
at
org.apache.hadoop.hdfs.DistributedFileSystem$22.doCall(DistributedFileSystem.java:1301)
at
org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
at
org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:1317)

Spark: v2.1.0
Hive: 2.1.1

How to resolve it?

Regards


Re: Spark 2.0 Build Failed

2016-07-29 Thread Ascot Moss
I think my maven is broken, I used another node in the cluster to compile
2.0.0 and got "successful"


[INFO]

[INFO] --- maven-source-plugin:2.4:jar-no-fork (create-source-jar) @
java8-tests_2.11 ---

[INFO]

[INFO] --- maven-source-plugin:2.4:test-jar-no-fork (create-source-jar) @
java8-tests_2.11 ---

[INFO]


[INFO] Reactor Summary:

[INFO]

[INFO] Spark Project Parent POM ... SUCCESS [
1.972 s]

[INFO] Spark Project Tags . SUCCESS [
3.280 s]

[INFO] Spark Project Sketch ... SUCCESS [
5.449 s]

[INFO] Spark Project Networking ... SUCCESS [
5.197 s]

[INFO] Spark Project Shuffle Streaming Service  SUCCESS [
3.816 s]

[INFO] Spark Project Unsafe ... SUCCESS [
8.275 s]

[INFO] Spark Project Launcher . SUCCESS [
5.229 s]

[INFO] Spark Project Core . SUCCESS [01:41
min]

[INFO] Spark Project GraphX ... SUCCESS [
25.111 s]

[INFO] Spark Project Streaming  SUCCESS [
54.290 s]

[INFO] Spark Project Catalyst . SUCCESS [01:15
min]

[INFO] Spark Project SQL .. SUCCESS [01:55
min]

[INFO] Spark Project ML Local Library . SUCCESS [
16.132 s]

[INFO] Spark Project ML Library ... SUCCESS [01:32
min]

[INFO] Spark Project Tools  SUCCESS [
5.136 s]

[INFO] Spark Project Hive . SUCCESS [
53.472 s]

[INFO] Spark Project REPL . SUCCESS [
11.716 s]

[INFO] Spark Project YARN Shuffle Service . SUCCESS [
4.102 s]

[INFO] Spark Project YARN . SUCCESS [
26.685 s]

[INFO] Spark Project Hive Thrift Server ... SUCCESS [
23.611 s]

[INFO] Spark Project Assembly . SUCCESS [
1.342 s]

[INFO] Spark Project External Flume Sink .. SUCCESS [
9.630 s]

[INFO] Spark Project External Flume ... SUCCESS [
15.323 s]

[INFO] Spark Project External Flume Assembly .. SUCCESS [
1.434 s]

[INFO] Spark Integration for Kafka 0.8  SUCCESS [
20.958 s]

[INFO] Spark Project Examples . SUCCESS [
22.080 s]

[INFO] Spark Project External Kafka Assembly .. SUCCESS [
2.421 s]

[INFO] Spark Integration for Kafka 0.10 ... SUCCESS [
16.255 s]

[INFO] Spark Integration for Kafka 0.10 Assembly .. SUCCESS [
2.578 s]

[INFO] Spark Project Java 8 Tests . SUCCESS [
5.573 s]

[INFO]


[INFO] BUILD SUCCESS

[INFO]


[INFO] Total time: 12:16 min

[INFO] Finished at: 2016-07-29T15:58:41+08:00

[INFO] Final Memory: 127M/4210M

[INFO]






I tried to reinstall my maven 3.3.9, deleted .m2, still got following error.

[ERROR] Plugin org.scalastyle:scalastyle-maven-plugin:0.8.0 or one of its
dependencies could not be resolved: Failed to read artifact descriptor for
org.scalastyle:scalastyle-maven-plugin:jar:0.8.0: Could not transfer
artifact org.scalastyle:scalastyle-maven-plugin:pom:0.8.0 from/to central (
https://repo1.maven.org/maven2): sun.security.validator.ValidatorException:
PKIX path building failed:
sun.security.provider.certpath.SunCertPathBuilderException: unable to find
valid certification path to requested target -> [Help 1]

[ERROR]

[ERROR] To see the full stack trace of the errors, re-run Maven with the -e
switch.

[ERROR] Re-run Maven using the -X switch to enable full debug logging.

[ERROR]

[ERROR] For more information about the errors and possible solutions,
please read the following articles:

[ERROR] [Help 1]
http://cwiki.apache.org/confluence/display/MAVEN/PluginResolutionException

On Fri, Jul 29, 2016 at 1:46 PM, Ascot Moss <ascot.m...@gmail.com> wrote:

> I just run
>
> wget https://repo1.maven.org/maven2/org/apache/apache/14/apache-14.pom,
> can get it without issue.
>
> On Fri, Jul 29, 2016 at 1:44 PM, Ascot Moss <ascot.m...@gmail.com> wrote:
>
>> Hi thanks!
>>
>> mvn dependency:tree
>>
>> [INFO] Scanning for projects...
>>
>> Downloading:
>> https://repo1.maven.org/maven2/org/apache/apache/14/apache-14.pom
>>
>> [ERROR] [ERROR] Some problems were encountered while processing the POMs:
>>
>> [FATAL] Non-resolvable parent POM for
>> org.apache.spark:spark-parent_2.11:2.0.0: Could not transfer artifact
>> org.apache:apach

Re: Spark 2.0 Build Failed

2016-07-28 Thread Ascot Moss
I just run

wget https://repo1.maven.org/maven2/org/apache/apache/14/apache-14.pom, can
get it without issue.

On Fri, Jul 29, 2016 at 1:44 PM, Ascot Moss <ascot.m...@gmail.com> wrote:

> Hi thanks!
>
> mvn dependency:tree
>
> [INFO] Scanning for projects...
>
> Downloading:
> https://repo1.maven.org/maven2/org/apache/apache/14/apache-14.pom
>
> [ERROR] [ERROR] Some problems were encountered while processing the POMs:
>
> [FATAL] Non-resolvable parent POM for
> org.apache.spark:spark-parent_2.11:2.0.0: Could not transfer artifact
> org.apache:apache:pom:14 from/to central (https://repo1.maven.org/maven2):
> sun.security.validator.ValidatorException: PKIX path building failed:
> sun.security.provider.certpath.SunCertPathBuilderException: unable to find
> valid certification path to requested target and 'parent.relativePath'
> points at wrong local POM @ line 22, column 11
>
>  @
>
> [ERROR] The build could not read 1 project -> [Help 1]
>
> [ERROR]
>
> [ERROR]   The project org.apache.spark:spark-parent_2.11:2.0.0
> (/edh_all_sources/edh_2.6.0/spark-2.0.0/pom.xml) has 1 error
>
> [ERROR] Non-resolvable parent POM for
> org.apache.spark:spark-parent_2.11:2.0.0: Could not transfer artifact
> org.apache:apache:pom:14 from/to central (https://repo1.maven.org/maven2):
> sun.security.validator.ValidatorException: PKIX path building failed:
> sun.security.provider.certpath.SunCertPathBuilderException: unable to find
> valid certification path to requested target and 'parent.relativePath'
> points at wrong local POM @ line 22, column 11 -> [Help 2]
>
> [ERROR]
>
> [ERROR] To see the full stack trace of the errors, re-run Maven with the
> -e switch.
>
> [ERROR] Re-run Maven using the -X switch to enable full debug logging.
>
> [ERROR]
>
> [ERROR] For more information about the errors and possible solutions,
> please read the following articles:
>
> [ERROR] [Help 1]
> http://cwiki.apache.org/confluence/display/MAVEN/ProjectBuildingException
>
> [ERROR] [Help 2]
> http://cwiki.apache.org/confluence/display/MAVEN/UnresolvableModelException
>
>
>
>
>
>
> On Fri, Jul 29, 2016 at 1:34 PM, Dong Meng <mengdong0...@gmail.com> wrote:
>
>> Before build, first do a "mvn dependency:tree" to make sure the
>> dependency is right
>>
>> On Thu, Jul 28, 2016 at 10:18 PM, Ascot Moss <ascot.m...@gmail.com>
>> wrote:
>>
>>> Thanks for your reply.
>>>
>>> Is there a way to find the correct Hadoop profile name?
>>>
>>> On Fri, Jul 29, 2016 at 7:06 AM, Sean Owen <so...@cloudera.com> wrote:
>>>
>>>> You have at least two problems here: wrong Hadoop profile name, and
>>>> some kind of firewall interrupting access to the Maven repo. It's not
>>>> related to Spark.
>>>>
>>>> On Thu, Jul 28, 2016 at 4:04 PM, Ascot Moss <ascot.m...@gmail.com>
>>>> wrote:
>>>> > Hi,
>>>> >
>>>> > I tried to build spark,
>>>> >
>>>> > (try 1)
>>>> > mvn -Pyarn -Phadoop-2.7.0 -Dscala-2.11 -Dhadoop.version=2.7.0 -Phive
>>>> > -Phive-thriftserver -DskipTests clean package
>>>> >
>>>> > [INFO] Spark Project Parent POM ... FAILURE
>>>> [  0.658
>>>> > s]
>>>> >
>>>> > [INFO] Spark Project Tags . SKIPPED
>>>> >
>>>> > [INFO] Spark Project Sketch ... SKIPPED
>>>> >
>>>> > [INFO] Spark Project Networking ... SKIPPED
>>>> >
>>>> > [INFO] Spark Project Shuffle Streaming Service  SKIPPED
>>>> >
>>>> > [INFO] Spark Project Unsafe ... SKIPPED
>>>> >
>>>> > [INFO] Spark Project Launcher . SKIPPED
>>>> >
>>>> > [INFO] Spark Project Core . SKIPPED
>>>> >
>>>> > [INFO] Spark Project GraphX ... SKIPPED
>>>> >
>>>> > [INFO] Spark Project Streaming  SKIPPED
>>>> >
>>>> > [INFO] Spark Project Catalyst . SKIPPED
>>>> >
>>>> > [INFO] Spark Project SQL .. SKIPPED
>>>> >
>>>> > [INFO] Spark Project ML Local Library . SKIPPED
>>>>

Re: Spark 2.0 Build Failed

2016-07-28 Thread Ascot Moss
Hi thanks!

mvn dependency:tree

[INFO] Scanning for projects...

Downloading:
https://repo1.maven.org/maven2/org/apache/apache/14/apache-14.pom

[ERROR] [ERROR] Some problems were encountered while processing the POMs:

[FATAL] Non-resolvable parent POM for
org.apache.spark:spark-parent_2.11:2.0.0: Could not transfer artifact
org.apache:apache:pom:14 from/to central (https://repo1.maven.org/maven2):
sun.security.validator.ValidatorException: PKIX path building failed:
sun.security.provider.certpath.SunCertPathBuilderException: unable to find
valid certification path to requested target and 'parent.relativePath'
points at wrong local POM @ line 22, column 11

 @

[ERROR] The build could not read 1 project -> [Help 1]

[ERROR]

[ERROR]   The project org.apache.spark:spark-parent_2.11:2.0.0
(/edh_all_sources/edh_2.6.0/spark-2.0.0/pom.xml) has 1 error

[ERROR] Non-resolvable parent POM for
org.apache.spark:spark-parent_2.11:2.0.0: Could not transfer artifact
org.apache:apache:pom:14 from/to central (https://repo1.maven.org/maven2):
sun.security.validator.ValidatorException: PKIX path building failed:
sun.security.provider.certpath.SunCertPathBuilderException: unable to find
valid certification path to requested target and 'parent.relativePath'
points at wrong local POM @ line 22, column 11 -> [Help 2]

[ERROR]

[ERROR] To see the full stack trace of the errors, re-run Maven with the -e
switch.

[ERROR] Re-run Maven using the -X switch to enable full debug logging.

[ERROR]

[ERROR] For more information about the errors and possible solutions,
please read the following articles:

[ERROR] [Help 1]
http://cwiki.apache.org/confluence/display/MAVEN/ProjectBuildingException

[ERROR] [Help 2]
http://cwiki.apache.org/confluence/display/MAVEN/UnresolvableModelException






On Fri, Jul 29, 2016 at 1:34 PM, Dong Meng <mengdong0...@gmail.com> wrote:

> Before build, first do a "mvn dependency:tree" to make sure the
> dependency is right
>
> On Thu, Jul 28, 2016 at 10:18 PM, Ascot Moss <ascot.m...@gmail.com> wrote:
>
>> Thanks for your reply.
>>
>> Is there a way to find the correct Hadoop profile name?
>>
>> On Fri, Jul 29, 2016 at 7:06 AM, Sean Owen <so...@cloudera.com> wrote:
>>
>>> You have at least two problems here: wrong Hadoop profile name, and
>>> some kind of firewall interrupting access to the Maven repo. It's not
>>> related to Spark.
>>>
>>> On Thu, Jul 28, 2016 at 4:04 PM, Ascot Moss <ascot.m...@gmail.com>
>>> wrote:
>>> > Hi,
>>> >
>>> > I tried to build spark,
>>> >
>>> > (try 1)
>>> > mvn -Pyarn -Phadoop-2.7.0 -Dscala-2.11 -Dhadoop.version=2.7.0 -Phive
>>> > -Phive-thriftserver -DskipTests clean package
>>> >
>>> > [INFO] Spark Project Parent POM ... FAILURE [
>>> 0.658
>>> > s]
>>> >
>>> > [INFO] Spark Project Tags . SKIPPED
>>> >
>>> > [INFO] Spark Project Sketch ... SKIPPED
>>> >
>>> > [INFO] Spark Project Networking ... SKIPPED
>>> >
>>> > [INFO] Spark Project Shuffle Streaming Service  SKIPPED
>>> >
>>> > [INFO] Spark Project Unsafe ... SKIPPED
>>> >
>>> > [INFO] Spark Project Launcher . SKIPPED
>>> >
>>> > [INFO] Spark Project Core . SKIPPED
>>> >
>>> > [INFO] Spark Project GraphX ... SKIPPED
>>> >
>>> > [INFO] Spark Project Streaming  SKIPPED
>>> >
>>> > [INFO] Spark Project Catalyst . SKIPPED
>>> >
>>> > [INFO] Spark Project SQL .. SKIPPED
>>> >
>>> > [INFO] Spark Project ML Local Library . SKIPPED
>>> >
>>> > [INFO] Spark Project ML Library ... SKIPPED
>>> >
>>> > [INFO] Spark Project Tools  SKIPPED
>>> >
>>> > [INFO] Spark Project Hive . SKIPPED
>>> >
>>> > [INFO] Spark Project REPL . SKIPPED
>>> >
>>> > [INFO] Spark Project YARN Shuffle Service . SKIPPED
>>> >
>>> > [INFO] Spark Project YARN . SKIPPED
>>> >
>>> > [INFO] Spark Project Hive Thrift Server .

Re: Spark 2.0 Build Failed

2016-07-28 Thread Ascot Moss
Thanks for your reply.

Is there a way to find the correct Hadoop profile name?

On Fri, Jul 29, 2016 at 7:06 AM, Sean Owen <so...@cloudera.com> wrote:

> You have at least two problems here: wrong Hadoop profile name, and
> some kind of firewall interrupting access to the Maven repo. It's not
> related to Spark.
>
> On Thu, Jul 28, 2016 at 4:04 PM, Ascot Moss <ascot.m...@gmail.com> wrote:
> > Hi,
> >
> > I tried to build spark,
> >
> > (try 1)
> > mvn -Pyarn -Phadoop-2.7.0 -Dscala-2.11 -Dhadoop.version=2.7.0 -Phive
> > -Phive-thriftserver -DskipTests clean package
> >
> > [INFO] Spark Project Parent POM ... FAILURE [
> 0.658
> > s]
> >
> > [INFO] Spark Project Tags . SKIPPED
> >
> > [INFO] Spark Project Sketch ... SKIPPED
> >
> > [INFO] Spark Project Networking ... SKIPPED
> >
> > [INFO] Spark Project Shuffle Streaming Service  SKIPPED
> >
> > [INFO] Spark Project Unsafe ... SKIPPED
> >
> > [INFO] Spark Project Launcher . SKIPPED
> >
> > [INFO] Spark Project Core . SKIPPED
> >
> > [INFO] Spark Project GraphX ... SKIPPED
> >
> > [INFO] Spark Project Streaming  SKIPPED
> >
> > [INFO] Spark Project Catalyst . SKIPPED
> >
> > [INFO] Spark Project SQL .. SKIPPED
> >
> > [INFO] Spark Project ML Local Library . SKIPPED
> >
> > [INFO] Spark Project ML Library ... SKIPPED
> >
> > [INFO] Spark Project Tools  SKIPPED
> >
> > [INFO] Spark Project Hive . SKIPPED
> >
> > [INFO] Spark Project REPL . SKIPPED
> >
> > [INFO] Spark Project YARN Shuffle Service . SKIPPED
> >
> > [INFO] Spark Project YARN . SKIPPED
> >
> > [INFO] Spark Project Hive Thrift Server ... SKIPPED
> >
> > [INFO] Spark Project Assembly . SKIPPED
> >
> > [INFO] Spark Project External Flume Sink .. SKIPPED
> >
> > [INFO] Spark Project External Flume ... SKIPPED
> >
> > [INFO] Spark Project External Flume Assembly .. SKIPPED
> >
> > [INFO] Spark Integration for Kafka 0.8  SKIPPED
> >
> > [INFO] Spark Project Examples . SKIPPED
> >
> > [INFO] Spark Project External Kafka Assembly .. SKIPPED
> >
> > [INFO] Spark Integration for Kafka 0.10 ... SKIPPED
> >
> > [INFO] Spark Integration for Kafka 0.10 Assembly .. SKIPPED
> >
> > [INFO] Spark Project Java 8 Tests . SKIPPED
> >
> > [INFO]
> > 
> >
> > [INFO] BUILD FAILURE
> >
> > [INFO]
> > 
> >
> > [INFO] Total time: 1.090 s
> >
> > [INFO] Finished at: 2016-07-29T07:01:57+08:00
> >
> > [INFO] Final Memory: 30M/605M
> >
> > [INFO]
> > 
> >
> > [WARNING] The requested profile "hadoop-2.7.0" could not be activated
> > because it does not exist.
> >
> > [ERROR] Plugin org.apache.maven.plugins:maven-site-plugin:3.3 or one of
> its
> > dependencies could not be resolved: Failed to read artifact descriptor
> for
> > org.apache.maven.plugins:maven-site-plugin:jar:3.3: Could not transfer
> > artifact org.apache.maven.plugins:maven-site-plugin:pom:3.3 from/to
> central
> > (https://repo1.maven.org/maven2):
> sun.security.validator.ValidatorException:
> > PKIX path building failed:
> > sun.security.provider.certpath.SunCertPathBuilderException: unable to
> find
> > valid certification path to requested target -> [Help 1]
> >
> > [ERROR]
> >
> > [ERROR] To see the full stack trace of the errors, re-run Maven with the
> -e
> > switch.
> >
> > [ERROR] Re-run Maven using the -X switch to enable full debug logging.
> >
> > [ERROR]
> >
> > [ERROR] For more information about the errors and possible solutions,
> please

Spark 2.0 Build Failed

2016-07-28 Thread Ascot Moss
Hi,

I tried to build spark,

(try 1)
mvn -Pyarn *-Phadoop-2.7.0* *-Dscala-2.11* -Dhadoop.version=2.7.0 -Phive
-Phive-thriftserver -DskipTests clean package

[INFO] Spark Project Parent POM ... FAILURE [
0.658 s]

[INFO] Spark Project Tags . SKIPPED

[INFO] Spark Project Sketch ... SKIPPED

[INFO] Spark Project Networking ... SKIPPED

[INFO] Spark Project Shuffle Streaming Service  SKIPPED

[INFO] Spark Project Unsafe ... SKIPPED

[INFO] Spark Project Launcher . SKIPPED

[INFO] Spark Project Core . SKIPPED

[INFO] Spark Project GraphX ... SKIPPED

[INFO] Spark Project Streaming  SKIPPED

[INFO] Spark Project Catalyst . SKIPPED

[INFO] Spark Project SQL .. SKIPPED

[INFO] Spark Project ML Local Library . SKIPPED

[INFO] Spark Project ML Library ... SKIPPED

[INFO] Spark Project Tools  SKIPPED

[INFO] Spark Project Hive . SKIPPED

[INFO] Spark Project REPL . SKIPPED

[INFO] Spark Project YARN Shuffle Service . SKIPPED

[INFO] Spark Project YARN . SKIPPED

[INFO] Spark Project Hive Thrift Server ... SKIPPED

[INFO] Spark Project Assembly . SKIPPED

[INFO] Spark Project External Flume Sink .. SKIPPED

[INFO] Spark Project External Flume ... SKIPPED

[INFO] Spark Project External Flume Assembly .. SKIPPED

[INFO] Spark Integration for Kafka 0.8  SKIPPED

[INFO] Spark Project Examples . SKIPPED

[INFO] Spark Project External Kafka Assembly .. SKIPPED

[INFO] Spark Integration for Kafka 0.10 ... SKIPPED

[INFO] Spark Integration for Kafka 0.10 Assembly .. SKIPPED

[INFO] Spark Project Java 8 Tests . SKIPPED

[INFO]


[INFO] BUILD FAILURE

[INFO]


[INFO] Total time: 1.090 s

[INFO] Finished at: 2016-07-29T07:01:57+08:00

[INFO] Final Memory: 30M/605M

[INFO]


[WARNING] The requested profile "hadoop-2.7.0" could not be activated
because it does not exist.

[ERROR] Plugin org.apache.maven.plugins:maven-site-plugin:3.3 or one of its
dependencies could not be resolved: Failed to read artifact descriptor for
org.apache.maven.plugins:maven-site-plugin:jar:3.3: Could not transfer
artifact org.apache.maven.plugins:maven-site-plugin:pom:3.3 from/to central
(https://repo1.maven.org/maven2):
sun.security.validator.ValidatorException: PKIX path building failed:
sun.security.provider.certpath.SunCertPathBuilderException: unable to find
valid certification path to requested target -> [Help 1]

[ERROR]

[ERROR] To see the full stack trace of the errors, re-run Maven with the -e
switch.

[ERROR] Re-run Maven using the -X switch to enable full debug logging.

[ERROR]

[ERROR] For more information about the errors and possible solutions,
please read the following articles:

[ERROR] [Help 1]
http://cwiki.apache.org/confluence/display/MAVEN/PluginResolutionException


(try 2)

./build/mvn -DskipTests clean package

[INFO]


[INFO] Reactor Summary:

[INFO]

[INFO] Spark Project Parent POM ... FAILURE [
0.653 s]

[INFO] Spark Project Tags . SKIPPED

[INFO] Spark Project Sketch ... SKIPPED

[INFO] Spark Project Networking ... SKIPPED

[INFO] Spark Project Shuffle Streaming Service  SKIPPED

[INFO] Spark Project Unsafe ... SKIPPED

[INFO] Spark Project Launcher . SKIPPED

[INFO] Spark Project Core . SKIPPED

[INFO] Spark Project GraphX ... SKIPPED

[INFO] Spark Project Streaming  SKIPPED

[INFO] Spark Project Catalyst . SKIPPED

[INFO] Spark Project SQL .. SKIPPED

[INFO] Spark Project ML Local Library . SKIPPED

[INFO] Spark Project ML Library ... SKIPPED

[INFO] Spark Project Tools  SKIPPED

[INFO] Spark Project Hive . SKIPPED

[INFO] Spark Project REPL . SKIPPED

[INFO] Spark Project Assembly 

Re: saveAsTextFile at treeEnsembleModels.scala:447, took 2.513396 s Killed

2016-07-28 Thread Ascot Moss
Hi,

Thanks for your reply.

permissions (access) is not an issue in my case, it is because this issue
only happened when the bigger input file was used to generate the model,
i.e. with smaller input(s) all worked well.   It seems to me that ".save"
cannot save big file.

Q1: Any idea about the size  limit that ".save" can handle?
Q2: Any idea about how to check the size model that will be saved vis
".save" ?

Regards



On Thu, Jul 28, 2016 at 4:19 PM, Spico Florin <spicoflo...@gmail.com> wrote:

> Hi!
>   There are many reasons that your task is failed. One could be that you
> don't have proper permissions (access) to  hdfs with your user. Please
> check your user rights to write in hdfs. Please have a look also :
>
> http://stackoverflow.com/questions/27427042/spark-unable-to-save-in-hadoop-permission-denied-for-user
> I hope it jelps.
>  Florin
>
>
> On Thu, Jul 28, 2016 at 3:49 AM, Ascot Moss <ascot.m...@gmail.com> wrote:
>
>>
>> Hi,
>>
>> Please help!
>>
>> When saving the model, I got following error and cannot save the model to
>> hdfs:
>>
>> (my source code, my spark is v1.6.2)
>> my_model.save(sc, "/my_model")
>>
>> -
>> 16/07/28 08:36:19 INFO TaskSchedulerImpl: Removed TaskSet 69.0, whose
>> tasks have all completed, from pool
>>
>> 16/07/28 08:36:19 INFO DAGScheduler: ResultStage 69 (saveAsTextFile at
>> treeEnsembleModels.scala:447) finished in 0.901 s
>>
>> 16/07/28 08:36:19 INFO DAGScheduler: Job 38 finished: saveAsTextFile at
>> treeEnsembleModels.scala:447, took 2.513396 s
>>
>> Killed
>> -
>>
>>
>> Q1: Is there any limitation on saveAsTextFile?
>> Q2: or where to find the error log file location?
>>
>> Regards
>>
>>
>>
>>
>>
>


A question about Spark Cluster vs Local Mode

2016-07-27 Thread Ascot Moss
Hi,

If I submit the same job to spark in cluster mode, does it mean in cluster
mode it will be run in cluster memory pool and it will fail if it runs out
of cluster's memory?

--driver-memory 64g \

--executor-memory 16g \

Regards


DecisionTree currently only supports maxDepth <= 30

2016-07-27 Thread Ascot Moss
Hi,

Is there any reason behind to limit  maxDepth <= 30?  Can it be deeper?


Exception in thread "main" java.lang.IllegalArgumentException: requirement
failed: DecisionTree currently only supports maxDepth <= 30, but was given
maxDepth = 50.

at scala.Predef$.require(Predef.scala:233)
at org.apache.spark.mllib.tree.RandomForest.run(RandomForest.scala:169)


Regards


saveAsTextFile at treeEnsembleModels.scala:447, took 2.513396 s Killed

2016-07-27 Thread Ascot Moss
Hi,

Please help!

When saving the model, I got following error and cannot save the model to
hdfs:

(my source code, my spark is v1.6.2)
my_model.save(sc, "/my_model")

-
16/07/28 08:36:19 INFO TaskSchedulerImpl: Removed TaskSet 69.0, whose tasks
have all completed, from pool

16/07/28 08:36:19 INFO DAGScheduler: ResultStage 69 (saveAsTextFile at
treeEnsembleModels.scala:447) finished in 0.901 s

16/07/28 08:36:19 INFO DAGScheduler: Job 38 finished: saveAsTextFile at
treeEnsembleModels.scala:447, took 2.513396 s

Killed
-


Q1: Is there any limitation on saveAsTextFile?
Q2: or where to find the error log file location?

Regards


Re: DAGScheduler: Job 20 finished: collectAsMap at DecisionTree.scala:651, took 19.556700 s Killed

2016-07-26 Thread Ascot Moss
It is YARN cluster,

/bin/spark-submit \

--conf "spark.executor.extraJavaOptions=-XX:+UseG1GC
-XX:+PrintGCTimeStamps -XX:+PrintGCDetails"
\

--driver-memory 64G \

--executor-memory 16g \


On Tue, Jul 26, 2016 at 7:00 PM, Jacek Laskowski <ja...@japila.pl> wrote:

> Hi,
>
> What's the cluster manager? Is this YARN perhaps? Do you have any
> other apps on the cluster? How do you submit your app? What are the
> properties?
>
> Pozdrawiam,
> Jacek Laskowski
> 
> https://medium.com/@jaceklaskowski/
> Mastering Apache Spark http://bit.ly/mastering-apache-spark
> Follow me at https://twitter.com/jaceklaskowski
>
>
> On Tue, Jul 26, 2016 at 1:27 AM, Ascot Moss <ascot.m...@gmail.com> wrote:
> > Hi,
> >
> > spark: 1.6.1
> > java: java 1.8_u40
> > I tried random forest training phase, the same code works well if with 20
> > trees (lower accuracy, about 68%).  When trying the training phase with
> more
> > tree, I set to 200 trees, it returned:
> >
> > "DAGScheduler: Job 20 finished: collectAsMap at DecisionTree.scala:651,
> took
> > 19.556700 s Killed" .  There is no WARN or ERROR from console, the task
> is
> > just stopped in the end.
> >
> > Any idea how to resolve it? Should the timeout parameter be set to longer
> >
> > regards
> >
> >
> > (below is the log from console)
> >
> > 16/07/26 00:02:47 INFO DAGScheduler: looking for newly runnable stages
> >
> > 16/07/26 00:02:47 INFO DAGScheduler: running: Set()
> >
> > 16/07/26 00:02:47 INFO DAGScheduler: waiting: Set(ResultStage 32)
> >
> > 16/07/26 00:02:47 INFO DAGScheduler: failed: Set()
> >
> > 16/07/26 00:02:47 INFO DAGScheduler: Submitting ResultStage 32
> > (MapPartitionsRDD[75] at map at DecisionTree.scala:642), which has no
> > missing parents
> >
> > 16/07/26 00:02:47 INFO MemoryStore: Block broadcast_48 stored as values
> in
> > memory (estimated size 2.2 MB, free 18.2 MB)
> >
> > 16/07/26 00:02:47 INFO MemoryStore: Block broadcast_48_piece0 stored as
> > bytes in memory (estimated size 436.9 KB, free 18.7 MB)
> >
> > 16/07/26 00:02:47 INFO BlockManagerInfo: Added broadcast_48_piece0 in
> memory
> > on x.x.x.x:35450 (size: 436.9 KB, free: 45.8 GB)
> >
> > 16/07/26 00:02:47 INFO SparkContext: Created broadcast 48 from broadcast
> at
> > DAGScheduler.scala:1006
> >
> > 16/07/26 00:02:47 INFO DAGScheduler: Submitting 4 missing tasks from
> > ResultStage 32 (MapPartitionsRDD[75] at map at DecisionTree.scala:642)
> >
> > 16/07/26 00:02:47 INFO TaskSchedulerImpl: Adding task set 32.0 with 4
> tasks
> >
> > 16/07/26 00:02:47 INFO TaskSetManager: Starting task 0.0 in stage 32.0
> (TID
> > 185, x.x.x.x, partition 0,NODE_LOCAL, 1956 bytes)
> >
> > 16/07/26 00:02:47 INFO TaskSetManager: Starting task 1.0 in stage 32.0
> (TID
> > 186, x.x.x.x, partition 1,NODE_LOCAL, 1956 bytes)
> >
> > 16/07/26 00:02:47 INFO TaskSetManager: Starting task 2.0 in stage 32.0
> (TID
> > 187, x.x.x.x, partition 2,NODE_LOCAL, 1956 bytes)
> >
> > 16/07/26 00:02:47 INFO TaskSetManager: Starting task 3.0 in stage 32.0
> (TID
> > 188, x.x.x.x, partition 3,NODE_LOCAL, 1956 bytes)
> >
> > 16/07/26 00:02:47 INFO BlockManagerInfo: Added broadcast_48_piece0 in
> memory
> > on x.x.x.x:58784 (size: 436.9 KB, free: 5.1 GB)
> >
> > 16/07/26 00:02:47 INFO MapOutputTrackerMasterEndpoint: Asked to send map
> > output locations for shuffle 12 to x.x.x.x:44434
> >
> > 16/07/26 00:02:47 INFO MapOutputTrackerMaster: Size of output statuses
> for
> > shuffle 12 is 180 bytes
> >
> > 16/07/26 00:02:47 INFO BlockManagerInfo: Added broadcast_48_piece0 in
> memory
> > on x.x.x.x:46186 (size: 436.9 KB, free: 2.2 GB)
> >
> > 16/07/26 00:02:47 INFO BlockManagerInfo: Added broadcast_48_piece0 in
> memory
> > on x.x.x.x:50132 (size: 436.9 KB, free: 5.0 GB)
> >
> > 16/07/26 00:02:47 INFO MapOutputTrackerMasterEndpoint: Asked to send map
> > output locations for shuffle 12 to x.x.x.x:47272
> >
> > 16/07/26 00:02:47 INFO MapOutputTrackerMasterEndpoint: Asked to send map
> > output locations for shuffle 12 to x.x.x.x:46802
> >
> > 16/07/26 00:02:49 INFO TaskSetManager: Finished task 2.0 in stage 32.0
> (TID
> > 187) in 2265 ms on x.x.x.x (1/4)
> >
> > 16/07/26 00:02:49 INFO TaskSetManager: Finished task 1.0 in stage 32.0
> (TID
> > 186) in 2266 ms on x.x.x.x (2/4)
> >
> > 16/07/26 00:02:50 INFO TaskSetManager: Finished task 0.0 in stage 32.0
> (TID
> > 185) in 2

Re: DAGScheduler: Job 20 finished: collectAsMap at DecisionTree.scala:651, took 19.556700 s Killed

2016-07-26 Thread Ascot Moss
any ideas?

On Tuesday, July 26, 2016, Ascot Moss <ascot.m...@gmail.com> wrote:

> Hi,
>
> spark: 1.6.1
> java: java 1.8_u40
> I tried random forest training phase, the same code works well if with 20
> trees (lower accuracy, about 68%).  When trying the training phase with
> more tree, I set to 200 trees, it returned:
>
> "DAGScheduler: Job 20 finished: collectAsMap at DecisionTree.scala:651,
> took 19.556700 s Killed" .  There is no WARN or ERROR from console, the
> task is just stopped in the end.
>
> Any idea how to resolve it? Should the timeout parameter be set to longer
>
> regards
>
>
> (below is the log from console)
>
> 16/07/26 00:02:47 INFO DAGScheduler: looking for newly runnable stages
>
> 16/07/26 00:02:47 INFO DAGScheduler: running: Set()
>
> 16/07/26 00:02:47 INFO DAGScheduler: waiting: Set(ResultStage 32)
>
> 16/07/26 00:02:47 INFO DAGScheduler: failed: Set()
>
> 16/07/26 00:02:47 INFO DAGScheduler: Submitting ResultStage 32
> (MapPartitionsRDD[75] at map at DecisionTree.scala:642), which has no
> missing parents
>
> 16/07/26 00:02:47 INFO MemoryStore: Block broadcast_48 stored as values in
> memory (estimated size 2.2 MB, free 18.2 MB)
>
> 16/07/26 00:02:47 INFO MemoryStore: Block broadcast_48_piece0 stored as
> bytes in memory (estimated size 436.9 KB, free 18.7 MB)
>
> 16/07/26 00:02:47 INFO BlockManagerInfo: Added broadcast_48_piece0 in
> memory on x.x.x.x:35450 (size: 436.9 KB, free: 45.8 GB)
>
> 16/07/26 00:02:47 INFO SparkContext: Created broadcast 48 from broadcast
> at DAGScheduler.scala:1006
>
> 16/07/26 00:02:47 INFO DAGScheduler: Submitting 4 missing tasks from
> ResultStage 32 (MapPartitionsRDD[75] at map at DecisionTree.scala:642)
>
> 16/07/26 00:02:47 INFO TaskSchedulerImpl: Adding task set 32.0 with 4 tasks
>
> 16/07/26 00:02:47 INFO TaskSetManager: Starting task 0.0 in stage 32.0
> (TID 185, x.x.x.x, partition 0,NODE_LOCAL, 1956 bytes)
>
> 16/07/26 00:02:47 INFO TaskSetManager: Starting task 1.0 in stage 32.0
> (TID 186, x.x.x.x, partition 1,NODE_LOCAL, 1956 bytes)
>
> 16/07/26 00:02:47 INFO TaskSetManager: Starting task 2.0 in stage 32.0
> (TID 187, x.x.x.x, partition 2,NODE_LOCAL, 1956 bytes)
>
> 16/07/26 00:02:47 INFO TaskSetManager: Starting task 3.0 in stage 32.0
> (TID 188, x.x.x.x, partition 3,NODE_LOCAL, 1956 bytes)
>
> 16/07/26 00:02:47 INFO BlockManagerInfo: Added broadcast_48_piece0 in
> memory on x.x.x.x:58784 (size: 436.9 KB, free: 5.1 GB)
>
> 16/07/26 00:02:47 INFO MapOutputTrackerMasterEndpoint: Asked to send map
> output locations for shuffle 12 to x.x.x.x:44434
>
> 16/07/26 00:02:47 INFO MapOutputTrackerMaster: Size of output statuses for
> shuffle 12 is 180 bytes
>
> 16/07/26 00:02:47 INFO BlockManagerInfo: Added broadcast_48_piece0 in
> memory on x.x.x.x:46186 (size: 436.9 KB, free: 2.2 GB)
>
> 16/07/26 00:02:47 INFO BlockManagerInfo: Added broadcast_48_piece0 in
> memory on x.x.x.x:50132 (size: 436.9 KB, free: 5.0 GB)
>
> 16/07/26 00:02:47 INFO MapOutputTrackerMasterEndpoint: Asked to send map
> output locations for shuffle 12 to x.x.x.x:47272
>
> 16/07/26 00:02:47 INFO MapOutputTrackerMasterEndpoint: Asked to send map
> output locations for shuffle 12 to x.x.x.x:46802
>
> 16/07/26 00:02:49 INFO TaskSetManager: Finished task 2.0 in stage 32.0
> (TID 187) in 2265 ms on x.x.x.x (1/4)
>
> 16/07/26 00:02:49 INFO TaskSetManager: Finished task 1.0 in stage 32.0
> (TID 186) in 2266 ms on x.x.x.x (2/4)
>
> 16/07/26 00:02:50 INFO TaskSetManager: Finished task 0.0 in stage 32.0
> (TID 185) in 2794 ms on x.x.x.x (3/4)
>
> 16/07/26 00:02:50 INFO TaskSetManager: Finished task 3.0 in stage 32.0
> (TID 188) in 3738 ms on x.x.x.x (4/4)
>
> 16/07/26 00:02:50 INFO TaskSchedulerImpl: Removed TaskSet 32.0, whose
> tasks have all completed, from pool
>
> 16/07/26 00:02:50 INFO DAGScheduler: ResultStage 32 (collectAsMap at
> DecisionTree.scala:651) finished in 3.738 s
>
> 16/07/26 00:02:50 INFO DAGScheduler: Job 19 finished: collectAsMap at
> DecisionTree.scala:651, took 19.493917 s
>
> 16/07/26 00:02:51 INFO MemoryStore: Block broadcast_49 stored as values in
> memory (estimated size 1053.9 KB, free 19.7 MB)
>
> 16/07/26 00:02:52 INFO MemoryStore: Block broadcast_49_piece0 stored as
> bytes in memory (estimated size 626.7 KB, free 20.3 MB)
>
> 16/07/26 00:02:52 INFO BlockManagerInfo: Added broadcast_49_piece0 in
> memory on x.x.x.x:35450 (size: 626.7 KB, free: 45.8 GB)
>
> 16/07/26 00:02:52 INFO SparkContext: Created broadcast 49 from broadcast
> at DecisionTree.scala:601
>
> 16/07/26 00:02:52 INFO SparkContext: Starting job: collectAsMap at
> DecisionTree.scala:651
>
> 16/07/26 00:02:52 INFO DAGSchedul

DAGScheduler: Job 20 finished: collectAsMap at DecisionTree.scala:651, took 19.556700 s Killed

2016-07-25 Thread Ascot Moss
Hi,

spark: 1.6.1
java: java 1.8_u40
I tried random forest training phase, the same code works well if with 20
trees (lower accuracy, about 68%).  When trying the training phase with
more tree, I set to 200 trees, it returned:

"DAGScheduler: Job 20 finished: collectAsMap at DecisionTree.scala:651,
took 19.556700 s Killed" .  There is no WARN or ERROR from console, the
task is just stopped in the end.

Any idea how to resolve it? Should the timeout parameter be set to longer

regards


(below is the log from console)

16/07/26 00:02:47 INFO DAGScheduler: looking for newly runnable stages

16/07/26 00:02:47 INFO DAGScheduler: running: Set()

16/07/26 00:02:47 INFO DAGScheduler: waiting: Set(ResultStage 32)

16/07/26 00:02:47 INFO DAGScheduler: failed: Set()

16/07/26 00:02:47 INFO DAGScheduler: Submitting ResultStage 32
(MapPartitionsRDD[75] at map at DecisionTree.scala:642), which has no
missing parents

16/07/26 00:02:47 INFO MemoryStore: Block broadcast_48 stored as values in
memory (estimated size 2.2 MB, free 18.2 MB)

16/07/26 00:02:47 INFO MemoryStore: Block broadcast_48_piece0 stored as
bytes in memory (estimated size 436.9 KB, free 18.7 MB)

16/07/26 00:02:47 INFO BlockManagerInfo: Added broadcast_48_piece0 in
memory on x.x.x.x:35450 (size: 436.9 KB, free: 45.8 GB)

16/07/26 00:02:47 INFO SparkContext: Created broadcast 48 from broadcast at
DAGScheduler.scala:1006

16/07/26 00:02:47 INFO DAGScheduler: Submitting 4 missing tasks from
ResultStage 32 (MapPartitionsRDD[75] at map at DecisionTree.scala:642)

16/07/26 00:02:47 INFO TaskSchedulerImpl: Adding task set 32.0 with 4 tasks

16/07/26 00:02:47 INFO TaskSetManager: Starting task 0.0 in stage 32.0 (TID
185, x.x.x.x, partition 0,NODE_LOCAL, 1956 bytes)

16/07/26 00:02:47 INFO TaskSetManager: Starting task 1.0 in stage 32.0 (TID
186, x.x.x.x, partition 1,NODE_LOCAL, 1956 bytes)

16/07/26 00:02:47 INFO TaskSetManager: Starting task 2.0 in stage 32.0 (TID
187, x.x.x.x, partition 2,NODE_LOCAL, 1956 bytes)

16/07/26 00:02:47 INFO TaskSetManager: Starting task 3.0 in stage 32.0 (TID
188, x.x.x.x, partition 3,NODE_LOCAL, 1956 bytes)

16/07/26 00:02:47 INFO BlockManagerInfo: Added broadcast_48_piece0 in
memory on x.x.x.x:58784 (size: 436.9 KB, free: 5.1 GB)

16/07/26 00:02:47 INFO MapOutputTrackerMasterEndpoint: Asked to send map
output locations for shuffle 12 to x.x.x.x:44434

16/07/26 00:02:47 INFO MapOutputTrackerMaster: Size of output statuses for
shuffle 12 is 180 bytes

16/07/26 00:02:47 INFO BlockManagerInfo: Added broadcast_48_piece0 in
memory on x.x.x.x:46186 (size: 436.9 KB, free: 2.2 GB)

16/07/26 00:02:47 INFO BlockManagerInfo: Added broadcast_48_piece0 in
memory on x.x.x.x:50132 (size: 436.9 KB, free: 5.0 GB)

16/07/26 00:02:47 INFO MapOutputTrackerMasterEndpoint: Asked to send map
output locations for shuffle 12 to x.x.x.x:47272

16/07/26 00:02:47 INFO MapOutputTrackerMasterEndpoint: Asked to send map
output locations for shuffle 12 to x.x.x.x:46802

16/07/26 00:02:49 INFO TaskSetManager: Finished task 2.0 in stage 32.0 (TID
187) in 2265 ms on x.x.x.x (1/4)

16/07/26 00:02:49 INFO TaskSetManager: Finished task 1.0 in stage 32.0 (TID
186) in 2266 ms on x.x.x.x (2/4)

16/07/26 00:02:50 INFO TaskSetManager: Finished task 0.0 in stage 32.0 (TID
185) in 2794 ms on x.x.x.x (3/4)

16/07/26 00:02:50 INFO TaskSetManager: Finished task 3.0 in stage 32.0 (TID
188) in 3738 ms on x.x.x.x (4/4)

16/07/26 00:02:50 INFO TaskSchedulerImpl: Removed TaskSet 32.0, whose tasks
have all completed, from pool

16/07/26 00:02:50 INFO DAGScheduler: ResultStage 32 (collectAsMap at
DecisionTree.scala:651) finished in 3.738 s

16/07/26 00:02:50 INFO DAGScheduler: Job 19 finished: collectAsMap at
DecisionTree.scala:651, took 19.493917 s

16/07/26 00:02:51 INFO MemoryStore: Block broadcast_49 stored as values in
memory (estimated size 1053.9 KB, free 19.7 MB)

16/07/26 00:02:52 INFO MemoryStore: Block broadcast_49_piece0 stored as
bytes in memory (estimated size 626.7 KB, free 20.3 MB)

16/07/26 00:02:52 INFO BlockManagerInfo: Added broadcast_49_piece0 in
memory on x.x.x.x:35450 (size: 626.7 KB, free: 45.8 GB)

16/07/26 00:02:52 INFO SparkContext: Created broadcast 49 from broadcast at
DecisionTree.scala:601

16/07/26 00:02:52 INFO SparkContext: Starting job: collectAsMap at
DecisionTree.scala:651

16/07/26 00:02:52 INFO DAGScheduler: Registering RDD 76 (mapPartitions at
DecisionTree.scala:622)

16/07/26 00:02:52 INFO DAGScheduler: Got job 20 (collectAsMap at
DecisionTree.scala:651) with 4 output partitions

16/07/26 00:02:52 INFO DAGScheduler: Final stage: ResultStage 34
(collectAsMap at DecisionTree.scala:651)

16/07/26 00:02:52 INFO DAGScheduler: Parents of final stage:
List(ShuffleMapStage 33)

16/07/26 00:02:52 INFO DAGScheduler: Missing parents: List(ShuffleMapStage
33)

16/07/26 00:02:52 INFO DAGScheduler: Submitting ShuffleMapStage 33
(MapPartitionsRDD[76] at mapPartitions at DecisionTree.scala:622), which
has no missing parents

16/07/26 00:02:52 INFO 

Spark 1.6.2 version displayed as 1.6.1

2016-07-24 Thread Ascot Moss
Hi,

I am trying to upgrade spark from 1.6.1 to 1.6.2, from 1.6.2 spark-shell, I
found the version is still displayed 1.6.1

Is this a minor typo/bug?

Regards



###

Welcome to

    __

 / __/__  ___ _/ /__

_\ \/ _ \/ _ `/ __/  '_/

   /___/ .__/\_,_/_/ /_/\_\   version 1.6.1

  /_/


Re: Size exceeds Integer.MAX_VALUE

2016-07-24 Thread Ascot Moss
the data set is the training data set for random forest training, about
36,500 data,  any idea how to further partition it?

On Sun, Jul 24, 2016 at 12:31 PM, Andrew Ehrlich <and...@aehrlich.com>
wrote:

> It may be this issue: https://issues.apache.org/jira/browse/SPARK-6235 which
> limits the size of the blocks in the file being written to disk to 2GB.
>
> If so, the solution is for you to try tuning for smaller tasks. Try
> increasing the number of partitions, or using a more space-efficient data
> structure inside the RDD, or increasing the amount of memory available to
> spark and caching the data in memory. Make sure you are using Kryo
> serialization.
>
> Andrew
>
> On Jul 23, 2016, at 9:00 PM, Ascot Moss <ascot.m...@gmail.com> wrote:
>
>
> Hi,
>
> Please help!
>
> My spark: 1.6.2
> Java: java8_u40
>
> I am trying random forest training, I got " Size exceeds
> Integer.MAX_VALUE".
>
> Any idea how to resolve it?
>
>
> (the log)
> 16/07/24 07:59:49 ERROR Executor: Exception in task 0.0 in stage 7.0 (TID
> 25)
> java.lang.IllegalArgumentException: Size exceeds Integer.MAX_VALUE
> at sun.nio.ch.FileChannelImpl.map(FileChannelImpl.java:836)
> at
> org.apache.spark.storage.DiskStore$$anonfun$getBytes$2.apply(DiskStore.scala:127)
>
> at
> org.apache.spark.storage.DiskStore$$anonfun$getBytes$2.apply(DiskStore.scala:115)
>
> at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1250)
> at org.apache.spark.storage.DiskStore.getBytes(DiskStore.scala:129)
> at org.apache.spark.storage.DiskStore.getBytes(DiskStore.scala:136)
> at
> org.apache.spark.storage.BlockManager.doGetLocal(BlockManager.scala:503)
>
> at org.apache.spark.storage.BlockManager.getLocal(BlockManager.scala:420)
>
> at org.apache.spark.storage.BlockManager.get(BlockManager.scala:625)
> at org.apache.spark.CacheManager.putInBlockManager(CacheManager.scala:154)
>
> at org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:78)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:268)
> at
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
> at
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73)
>
> at
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
>
> at org.apache.spark.scheduler.Task.run(Task.scala:89)
> at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
> at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>
> at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>
> at java.lang.Thread.run(Thread.java:745)
> 16/07/24 07:59:49 WARN TaskSetManager: Lost task 0.0 in stage 7.0 (TID 25,
> localhost): java.lang.IllegalArgumentException: Size exceeds
> Integer.MAX_VALUE
> at sun.nio.ch.FileChannelImpl.map(FileChannelImpl.java:836)
> at
> org.apache.spark.storage.DiskStore$$anonfun$getBytes$2.apply(DiskStore.scala:127)
>
> at
> org.apache.spark.storage.DiskStore$$anonfun$getBytes$2.apply(DiskStore.scala:115)
>
> at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1250)
> at org.apache.spark.storage.DiskStore.getBytes(DiskStore.scala:129)
> at org.apache.spark.storage.DiskStore.getBytes(DiskStore.scala:136)
> at
> org.apache.spark.storage.BlockManager.doGetLocal(BlockManager.scala:503)
>
> at org.apache.spark.storage.BlockManager.getLocal(BlockManager.scala:420)
>
> at org.apache.spark.storage.BlockManager.get(BlockManager.scala:625)
> at org.apache.spark.CacheManager.putInBlockManager(CacheManager.scala:154)
>
> at org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:78)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:268)
> at
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
> at
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73)
>
> at
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
>
> at org.apache.spark.scheduler.Task.run(Task.scala:89)
> at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
> at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>
> at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>
> at java.lang.Thread.run(Thread.java:745)
>
>
> Regards
>
>
>


Size exceeds Integer.MAX_VALUE

2016-07-23 Thread Ascot Moss
Hi,

Please help!

My spark: 1.6.2
Java: java8_u40

I am trying random forest training, I got " Size exceeds Integer.MAX_VALUE".

Any idea how to resolve it?


(the log)
16/07/24 07:59:49 ERROR Executor: Exception in task 0.0 in stage 7.0 (TID
25)
java.lang.IllegalArgumentException: Size exceeds Integer.MAX_VALUE
at sun.nio.ch.FileChannelImpl.map(FileChannelImpl.java:836)
at
org.apache.spark.storage.DiskStore$$anonfun$getBytes$2.apply(DiskStore.scala:127)

at
org.apache.spark.storage.DiskStore$$anonfun$getBytes$2.apply(DiskStore.scala:115)

at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1250)
at org.apache.spark.storage.DiskStore.getBytes(DiskStore.scala:129)
at org.apache.spark.storage.DiskStore.getBytes(DiskStore.scala:136)
at org.apache.spark.storage.BlockManager.doGetLocal(BlockManager.scala:503)

at org.apache.spark.storage.BlockManager.getLocal(BlockManager.scala:420)

at org.apache.spark.storage.BlockManager.get(BlockManager.scala:625)
at org.apache.spark.CacheManager.putInBlockManager(CacheManager.scala:154)

at org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:78)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:268)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)

at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
at
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73)

at
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)

at org.apache.spark.scheduler.Task.run(Task.scala:89)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)

at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)

at java.lang.Thread.run(Thread.java:745)
16/07/24 07:59:49 WARN TaskSetManager: Lost task 0.0 in stage 7.0 (TID 25,
localhost): java.lang.IllegalArgumentException: Size exceeds
Integer.MAX_VALUE
at sun.nio.ch.FileChannelImpl.map(FileChannelImpl.java:836)
at
org.apache.spark.storage.DiskStore$$anonfun$getBytes$2.apply(DiskStore.scala:127)

at
org.apache.spark.storage.DiskStore$$anonfun$getBytes$2.apply(DiskStore.scala:115)

at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1250)
at org.apache.spark.storage.DiskStore.getBytes(DiskStore.scala:129)
at org.apache.spark.storage.DiskStore.getBytes(DiskStore.scala:136)
at org.apache.spark.storage.BlockManager.doGetLocal(BlockManager.scala:503)

at org.apache.spark.storage.BlockManager.getLocal(BlockManager.scala:420)

at org.apache.spark.storage.BlockManager.get(BlockManager.scala:625)
at org.apache.spark.CacheManager.putInBlockManager(CacheManager.scala:154)

at org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:78)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:268)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)

at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
at
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73)

at
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)

at org.apache.spark.scheduler.Task.run(Task.scala:89)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)

at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)

at java.lang.Thread.run(Thread.java:745)


Regards


Re: ERROR Remote RPC client disassociated. Likely due to containers exceeding thresholds, or network issues. Check driver logs for WARN messages.

2016-07-23 Thread Ascot Moss
I tried to add -Xloggc:./jvm_gc.log

--conf "spark.executor.extraJavaOptions=-XX:+UseG1GC -XX:+PrintGCDetails
-XX:+PrintGCTimeStamps -Xloggc:./jvm_gc.log -XX:+PrintGCDateStamps"

however, I could not find ./jvm_gc.log

How to resolve the OOM and gc log issue?

Regards

On Sun, Jul 24, 2016 at 6:37 AM, Ascot Moss <ascot.m...@gmail.com> wrote:

> My JDK is Java 1.8 u40
>
> On Sun, Jul 24, 2016 at 3:45 AM, Ted Yu <yuzhih...@gmail.com> wrote:
>
>> Since you specified +PrintGCDetails, you should be able to get some more
>> detail from the GC log.
>>
>> Also, which JDK version are you using ?
>>
>> Please use Java 8 where G1GC is more reliable.
>>
>> On Sat, Jul 23, 2016 at 10:38 AM, Ascot Moss <ascot.m...@gmail.com>
>> wrote:
>>
>>> Hi,
>>>
>>> I added the following parameter:
>>>
>>> --conf "spark.executor.extraJavaOptions=-XX:+UseG1GC
>>> -XX:MaxGCPauseMillis=200 -XX:ParallelGCThreads=20 -XX:ConcGCThreads=5
>>> -XX:InitiatingHeapOccupancyPercent=70 -XX:+PrintGCDetails
>>> -XX:+PrintGCTimeStamps"
>>>
>>> Still got Java heap space error.
>>>
>>> Any idea to resolve?  (my spark is 1.6.1)
>>>
>>>
>>> 16/07/23 23:31:50 WARN TaskSetManager: Lost task 1.0 in stage 6.0 (TID
>>> 22, n1791): java.lang.OutOfMemoryError: Java heap space   at
>>> scala.reflect.ManifestFactory$$anon$12.newArray(Manifest.scala:138)
>>>
>>> at
>>> scala.reflect.ManifestFactory$$anon$12.newArray(Manifest.scala:136)
>>>
>>> at
>>> scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:248)
>>>
>>> at
>>> org.apache.spark.util.collection.CompactBuffer.toArray(CompactBuffer.scala:30)
>>>
>>> at
>>> org.apache.spark.mllib.tree.DecisionTree$.org$apache$spark$mllib$tree$DecisionTree$$findSplits$1(DecisionTree.scala:1009)
>>> at
>>> org.apache.spark.mllib.tree.DecisionTree$$anonfun$29.apply(DecisionTree.scala:1042)
>>>
>>> at
>>> org.apache.spark.mllib.tree.DecisionTree$$anonfun$29.apply(DecisionTree.scala:1042)
>>>
>>> at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
>>>
>>> at scala.collection.Iterator$class.foreach(Iterator.scala:727)
>>>
>>> at
>>> scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
>>>
>>> at
>>> scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48)
>>>
>>> at
>>> scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103)
>>>
>>> at
>>> scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47)
>>>
>>> at 
>>> scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273)
>>>
>>> at scala.collection.AbstractIterator.to(Iterator.scala:1157)
>>>
>>> at
>>> scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265)
>>>
>>> at
>>> scala.collection.AbstractIterator.toBuffer(Iterator.scala:1157)
>>>
>>> at
>>> scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252)
>>>
>>> at
>>> scala.collection.AbstractIterator.toArray(Iterator.scala:1157)
>>>
>>> at
>>> org.apache.spark.rdd.RDD$$anonfun$collect$1$$anonfun$12.apply(RDD.scala:927)
>>>
>>> at
>>> org.apache.spark.rdd.RDD$$anonfun$collect$1$$anonfun$12.apply(RDD.scala:927)
>>>
>>>     at
>>> org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1858)
>>>
>>> at
>>> org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1858)
>>>
>>> at
>>> org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
>>>
>>> at org.apache.spark.scheduler.Task.run(Task.scala:89)
>>>
>>> at
>>> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
>>>
>>> at
>>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>>>
>>> at
>>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>>>
>>> at java.lang.Thread.run(Thread.java:745)
>>>
>>> Regards
>>>
>>>
>>

Re: ERROR Remote RPC client disassociated. Likely due to containers exceeding thresholds, or network issues. Check driver logs for WARN messages.

2016-07-23 Thread Ascot Moss
My JDK is Java 1.8 u40

On Sun, Jul 24, 2016 at 3:45 AM, Ted Yu <yuzhih...@gmail.com> wrote:

> Since you specified +PrintGCDetails, you should be able to get some more
> detail from the GC log.
>
> Also, which JDK version are you using ?
>
> Please use Java 8 where G1GC is more reliable.
>
> On Sat, Jul 23, 2016 at 10:38 AM, Ascot Moss <ascot.m...@gmail.com> wrote:
>
>> Hi,
>>
>> I added the following parameter:
>>
>> --conf "spark.executor.extraJavaOptions=-XX:+UseG1GC
>> -XX:MaxGCPauseMillis=200 -XX:ParallelGCThreads=20 -XX:ConcGCThreads=5
>> -XX:InitiatingHeapOccupancyPercent=70 -XX:+PrintGCDetails
>> -XX:+PrintGCTimeStamps"
>>
>> Still got Java heap space error.
>>
>> Any idea to resolve?  (my spark is 1.6.1)
>>
>>
>> 16/07/23 23:31:50 WARN TaskSetManager: Lost task 1.0 in stage 6.0 (TID
>> 22, n1791): java.lang.OutOfMemoryError: Java heap space   at
>> scala.reflect.ManifestFactory$$anon$12.newArray(Manifest.scala:138)
>>
>> at
>> scala.reflect.ManifestFactory$$anon$12.newArray(Manifest.scala:136)
>>
>> at
>> scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:248)
>>
>> at
>> org.apache.spark.util.collection.CompactBuffer.toArray(CompactBuffer.scala:30)
>>
>> at
>> org.apache.spark.mllib.tree.DecisionTree$.org$apache$spark$mllib$tree$DecisionTree$$findSplits$1(DecisionTree.scala:1009)
>> at
>> org.apache.spark.mllib.tree.DecisionTree$$anonfun$29.apply(DecisionTree.scala:1042)
>>
>> at
>> org.apache.spark.mllib.tree.DecisionTree$$anonfun$29.apply(DecisionTree.scala:1042)
>>
>> at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
>>
>> at scala.collection.Iterator$class.foreach(Iterator.scala:727)
>>
>> at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
>>
>> at
>> scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48)
>>
>> at
>> scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103)
>>
>> at
>> scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47)
>>
>> at 
>> scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273)
>>
>> at scala.collection.AbstractIterator.to(Iterator.scala:1157)
>>
>> at
>> scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265)
>>
>> at
>> scala.collection.AbstractIterator.toBuffer(Iterator.scala:1157)
>>
>> at
>> scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252)
>>
>> at scala.collection.AbstractIterator.toArray(Iterator.scala:1157)
>>
>> at
>> org.apache.spark.rdd.RDD$$anonfun$collect$1$$anonfun$12.apply(RDD.scala:927)
>>
>> at
>> org.apache.spark.rdd.RDD$$anonfun$collect$1$$anonfun$12.apply(RDD.scala:927)
>>
>> at
>> org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1858)
>>
>> at
>> org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1858)
>>
>> at
>> org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
>>
>> at org.apache.spark.scheduler.Task.run(Task.scala:89)
>>
>> at
>> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
>>
>> at
>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>>
>> at
>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>>
>> at java.lang.Thread.run(Thread.java:745)
>>
>> Regards
>>
>>
>>
>> On Sat, Jul 23, 2016 at 9:49 AM, Ascot Moss <ascot.m...@gmail.com> wrote:
>>
>>> Thanks. Trying with extra conf now.
>>>
>>> On Sat, Jul 23, 2016 at 6:59 AM, RK Aduri <rkad...@collectivei.com>
>>> wrote:
>>>
>>>> I can see large number of collections happening on driver and
>>>> eventually, driver is running out of memory. ( am not sure whether you have
>>>> persisted any rdd or data frame). May be you would want to avoid doing so
>>>> many collections or persist unwanted data in memory.
>>>>
>>>> To begin with, you may want to re-run the job with this following
>>>> config: --conf "spark.executor.extraJavaOptions=-XX:+UseG1GC
>>>> -XX:+PrintGCDetails -XX:+P

Re: ERROR Remote RPC client disassociated. Likely due to containers exceeding thresholds, or network issues. Check driver logs for WARN messages.

2016-07-23 Thread Ascot Moss
Hi,

I added the following parameter:

--conf "spark.executor.extraJavaOptions=-XX:+UseG1GC
-XX:MaxGCPauseMillis=200 -XX:ParallelGCThreads=20 -XX:ConcGCThreads=5
-XX:InitiatingHeapOccupancyPercent=70 -XX:+PrintGCDetails
-XX:+PrintGCTimeStamps"

Still got Java heap space error.

Any idea to resolve?  (my spark is 1.6.1)


16/07/23 23:31:50 WARN TaskSetManager: Lost task 1.0 in stage 6.0 (TID 22,
n1791): java.lang.OutOfMemoryError: Java heap space   at
scala.reflect.ManifestFactory$$anon$12.newArray(Manifest.scala:138)

at
scala.reflect.ManifestFactory$$anon$12.newArray(Manifest.scala:136)

at
scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:248)

at
org.apache.spark.util.collection.CompactBuffer.toArray(CompactBuffer.scala:30)

at
org.apache.spark.mllib.tree.DecisionTree$.org$apache$spark$mllib$tree$DecisionTree$$findSplits$1(DecisionTree.scala:1009)
at
org.apache.spark.mllib.tree.DecisionTree$$anonfun$29.apply(DecisionTree.scala:1042)

at
org.apache.spark.mllib.tree.DecisionTree$$anonfun$29.apply(DecisionTree.scala:1042)

at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)

at scala.collection.Iterator$class.foreach(Iterator.scala:727)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)

at
scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48)

at
scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103)

at
scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47)

at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273)

at scala.collection.AbstractIterator.to(Iterator.scala:1157)

at
scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265)

at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1157)

at
scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252)

at scala.collection.AbstractIterator.toArray(Iterator.scala:1157)

at
org.apache.spark.rdd.RDD$$anonfun$collect$1$$anonfun$12.apply(RDD.scala:927)

at
org.apache.spark.rdd.RDD$$anonfun$collect$1$$anonfun$12.apply(RDD.scala:927)

at
org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1858)

at
org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1858)

at
org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)

at org.apache.spark.scheduler.Task.run(Task.scala:89)

at
org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)

at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)

at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)

at java.lang.Thread.run(Thread.java:745)

Regards



On Sat, Jul 23, 2016 at 9:49 AM, Ascot Moss <ascot.m...@gmail.com> wrote:

> Thanks. Trying with extra conf now.
>
> On Sat, Jul 23, 2016 at 6:59 AM, RK Aduri <rkad...@collectivei.com> wrote:
>
>> I can see large number of collections happening on driver and eventually,
>> driver is running out of memory. ( am not sure whether you have persisted
>> any rdd or data frame). May be you would want to avoid doing so many
>> collections or persist unwanted data in memory.
>>
>> To begin with, you may want to re-run the job with this following config: 
>> --conf
>> "spark.executor.extraJavaOptions=-XX:+UseG1GC -XX:+PrintGCDetails
>> -XX:+PrintGCTimeStamps” —> and this will give you an idea of how you are
>> getting OOM.
>>
>>
>> On Jul 22, 2016, at 3:52 PM, Ascot Moss <ascot.m...@gmail.com> wrote:
>>
>> Hi
>>
>> Please help!
>>
>>  When running random forest training phase in cluster mode, I got GC
>> overhead limit exceeded.
>>
>> I have used two parameters when submitting the job to cluster
>>
>> --driver-memory 64g \
>>
>> --executor-memory 8g \
>>
>> My Current settings:
>>
>> (spark-defaults.conf)
>>
>> spark.executor.memory   8g
>>
>> (spark-env.sh)
>>
>> export SPARK_WORKER_MEMORY=8g
>>
>> export HADOOP_HEAPSIZE=8000
>>
>>
>> Any idea how to resolve it?
>>
>> Regards
>>
>>
>>
>>
>>
>>
>> ###  (the erro log) ###
>>
>> 16/07/23 04:34:04 WARN TaskSetManager: Lost task 2.0 in stage 6.1 (TID
>> 30, n1794): java.lang.OutOfMemoryError: GC overhead limit exceeded
>>
>> at
>> scala.reflect.ManifestFactory$$anon$12.newArray(Manifest.scala:138)
>>
>> at
>> scala.reflect.ManifestFactory$$anon$12.newArray(Manifest.scala:136)
>>
>> at

ERROR Remote RPC client disassociated. Likely due to containers exceeding thresholds, or network issues. Check driver logs for WARN messages.

2016-07-22 Thread Ascot Moss
Hi

Please help!

 When running random forest training phase in cluster mode, I got GC
overhead limit exceeded.

I have used two parameters when submitting the job to cluster

--driver-memory 64g \

--executor-memory 8g \

My Current settings:

(spark-defaults.conf)

spark.executor.memory   8g

(spark-env.sh)

export SPARK_WORKER_MEMORY=8g

export HADOOP_HEAPSIZE=8000


Any idea how to resolve it?

Regards






###  (the erro log) ###

16/07/23 04:34:04 WARN TaskSetManager: Lost task 2.0 in stage 6.1 (TID 30,
n1794): java.lang.OutOfMemoryError: GC overhead limit exceeded

at
scala.reflect.ManifestFactory$$anon$12.newArray(Manifest.scala:138)

at
scala.reflect.ManifestFactory$$anon$12.newArray(Manifest.scala:136)

at
org.apache.spark.util.collection.CompactBuffer.growToSize(CompactBuffer.scala:144)

at
org.apache.spark.util.collection.CompactBuffer.$plus$plus$eq(CompactBuffer.scala:90)

at
org.apache.spark.rdd.PairRDDFunctions$$anonfun$groupByKey$1$$anonfun$10.apply(PairRDDFunctions.scala:505)

at
org.apache.spark.rdd.PairRDDFunctions$$anonfun$groupByKey$1$$anonfun$10.apply(PairRDDFunctions.scala:505)

at
org.apache.spark.util.collection.ExternalAppendOnlyMap$ExternalIterator.mergeIfKeyExists(ExternalAppendOnlyMap.scala:318)

at
org.apache.spark.util.collection.ExternalAppendOnlyMap$ExternalIterator.next(ExternalAppendOnlyMap.scala:365)

at
org.apache.spark.util.collection.ExternalAppendOnlyMap$ExternalIterator.next(ExternalAppendOnlyMap.scala:265)

at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)

at scala.collection.Iterator$class.foreach(Iterator.scala:727)

at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)

at
scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48)

at
scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103)

at
scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47)

at scala.collection.TraversableOnce$class.to
(TraversableOnce.scala:273)

at scala.collection.AbstractIterator.to(Iterator.scala:1157)

at
scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265)

at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1157)

at
scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252)

at scala.collection.AbstractIterator.toArray(Iterator.scala:1157)

at
org.apache.spark.rdd.RDD$$anonfun$collect$1$$anonfun$12.apply(RDD.scala:927)

at
org.apache.spark.rdd.RDD$$anonfun$collect$1$$anonfun$12.apply(RDD.scala:927)

at
org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1858)

at
org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1858)

at
org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)

at org.apache.spark.scheduler.Task.run(Task.scala:89)

at
org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)

at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)

at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)

at java.lang.Thread.run(Thread.java:745)


Random Forest gererate model failed (DecisionTree.scala:642), which has no missing parents

2016-07-15 Thread Ascot Moss
Hi,

I am trying to create the Random Forest model, my source_code as follows:
val rf_model  = RandomForest.trainClassifier(trainData, 7, Map[Int,Int](),
20, "auto", "entropy", 30, 300)


I got following error:

##

16/07/15 19:55:04 INFO TaskSchedulerImpl: Removed TaskSet 21.0, whose tasks
have all completed, from pool

16/07/15 19:55:04 INFO DAGScheduler: ShuffleMapStage 21 (mapPartitions at
DecisionTree.scala:622) finished in 2.685 s

16/07/15 19:55:04 INFO DAGScheduler: looking for newly runnable stages

16/07/15 19:55:04 INFO DAGScheduler: running: Set()

16/07/15 19:55:04 INFO DAGScheduler: waiting: Set(ResultStage 22)

16/07/15 19:55:04 INFO DAGScheduler: failed: Set()

16/07/15 19:55:04 INFO DAGScheduler: Submitting ResultStage 22
(MapPartitionsRDD[43] at map at DecisionTree.scala:642), which has no
missing parents

Killed

##


Any idea what is wrong?

Regards


Random Forest Job got killed (DAGScheduler: failed: Set() , DecisionTree.scala:642), which has no missing parents)

2016-07-15 Thread Ascot Moss
Hi,

I am trying to create the Random Forest model, my source_code as follows:
val rf_model  = edhRF.trainClassifier(trainData, 7, Map[Int,Int](), 20,
"auto", "entropy", 30, 300)


I got following error:

##

16/07/15 19:55:04 INFO TaskSchedulerImpl: Removed TaskSet 21.0, whose tasks
have all completed, from pool

16/07/15 19:55:04 INFO DAGScheduler: ShuffleMapStage 21 (mapPartitions at
DecisionTree.scala:622) finished in 2.685 s

16/07/15 19:55:04 INFO DAGScheduler: looking for newly runnable stages

16/07/15 19:55:04 INFO DAGScheduler: running: Set()

16/07/15 19:55:04 INFO DAGScheduler: waiting: Set(ResultStage 22)

16/07/15 19:55:04 INFO DAGScheduler: failed: Set()

16/07/15 19:55:04 INFO DAGScheduler: Submitting ResultStage 22
(MapPartitionsRDD[43] at map at DecisionTree.scala:642), which has no
missing parents

Killed

##


Any idea what is wrong?

Regards


LogisticRegression.scala ERROR, require(Predef.scala)

2016-06-23 Thread Ascot Moss
Hi,

My Spark is 1.5.2, when trying MLLib, I got the following error. Any idea
to fix it?

Regards


==

16/06/23 16:26:20 ERROR Executor: Exception in task 0.0 in stage 5.0 (TID 5)

java.lang.IllegalArgumentException: requirement failed

at scala.Predef$.require(Predef.scala:221)

at
org.apache.spark.mllib.classification.LogisticRegressionModel.predictPoint(LogisticRegression.scala:118)

at
org.apache.spark.mllib.regression.GeneralizedLinearModel$$anonfun$predict$1$$anonfun$apply$1.apply(GeneralizedLinearAlgorithm.scala:65)

at
org.apache.spark.mllib.regression.GeneralizedLinearModel$$anonfun$predict$1$$anonfun$apply$1.apply(GeneralizedLinearAlgorithm.scala:65)

at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)

at
org.apache.spark.rdd.RDD$$anonfun$zip$1$$anonfun$apply$27$$anon$1.next(RDD.scala:815)

at
org.apache.spark.rdd.RDD$$anonfun$zip$1$$anonfun$apply$27$$anon$1.next(RDD.scala:808)

at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)

at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)

at
org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1$$anonfun$13$$anonfun$apply$6.apply$mcV$sp(PairRDDFunctions.scala:1109)

at
org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1$$anonfun$13$$anonfun$apply$6.apply(PairRDDFunctions.scala:1108)

at
org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1$$anonfun$13$$anonfun$apply$6.apply(PairRDDFunctions.scala:1108)

at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1285)

at
org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1$$anonfun$13.apply(PairRDDFunctions.scala:1116)

at
org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1$$anonfun$13.apply(PairRDDFunctions.scala:1095)

at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:63)

at org.apache.spark.scheduler.Task.run(Task.scala:70)

at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213)

at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)

at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)

at java.lang.Thread.run(Thread.java:745)

16/06/23 16:26:20 WARN TaskSetManager: Lost task 0.0 in stage 5.0 (TID 5,
localhost): java.lang.IllegalArgumentException: requirement failed

at scala.Predef$.require(Predef.scala:221)

at
org.apache.spark.mllib.classification.LogisticRegressionModel.predictPoint(LogisticRegression.scala:118)

at
org.apache.spark.mllib.regression.GeneralizedLinearModel$$anonfun$predict$1$$anonfun$apply$1.apply(GeneralizedLinearAlgorithm.scala:65)

at
org.apache.spark.mllib.regression.GeneralizedLinearModel$$anonfun$predict$1$$anonfun$apply$1.apply(GeneralizedLinearAlgorithm.scala:65)

at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)

at
org.apache.spark.rdd.RDD$$anonfun$zip$1$$anonfun$apply$27$$anon$1.next(RDD.scala:815)

at
org.apache.spark.rdd.RDD$$anonfun$zip$1$$anonfun$apply$27$$anon$1.next(RDD.scala:808)

at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)

at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)

at
org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1$$anonfun$13$$anonfun$apply$6.apply$mcV$sp(PairRDDFunctions.scala:1109)

at
org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1$$anonfun$13$$anonfun$apply$6.apply(PairRDDFunctions.scala:1108)

at
org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1$$anonfun$13$$anonfun$apply$6.apply(PairRDDFunctions.scala:1108)

at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1285)

at
org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1$$anonfun$13.apply(PairRDDFunctions.scala:1116)

at
org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1$$anonfun$13.apply(PairRDDFunctions.scala:1095)

at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:63)

at org.apache.spark.scheduler.Task.run(Task.scala:70)

at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213)

at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)

at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)

at java.lang.Thread.run(Thread.java:745)


16/06/23 16:26:20 ERROR TaskSetManager: Task 0 in stage 5.0 failed 1 times;
aborting job

16/06/23 16:26:20 INFO TaskSchedulerImpl: Removed TaskSet 5.0, whose tasks
have all completed, from pool

16/06/23 16:26:20 INFO TaskSchedulerImpl: Cancelling stage 5

16/06/23 16:26:20 INFO DAGScheduler: ResultStage 5 (foreach at P.scala:49)
failed in 0.118 s

16/06/23 16:26:20 INFO DAGScheduler: Job 14 failed: foreach at P.scala:49,
took 0.140928 s

16/06/23 16:26:20 ERROR JobScheduler: Error running job streaming job
146667038 ms.0

org.apache.spark.SparkException: Job aborted due to stage failure: Task 0
in stage 5.0 failed 1 times, most recent failure: Lost task 0.0 in stage
5.0 (TID 5, 

Re: Apache Flink

2016-04-16 Thread Ascot Moss
I compared both last month, seems to me that Flink's MLLib is not yet ready.

On Sun, Apr 17, 2016 at 12:23 AM, Mich Talebzadeh  wrote:

> Thanks Ted. I was wondering if someone is using both :)
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> *
>
>
>
> http://talebzadehmich.wordpress.com
>
>
>
> On 16 April 2016 at 17:08, Ted Yu  wrote:
>
>> Looks like this question is more relevant on flink mailing list :-)
>>
>> On Sat, Apr 16, 2016 at 8:52 AM, Mich Talebzadeh <
>> mich.talebza...@gmail.com> wrote:
>>
>>> Hi,
>>>
>>> Has anyone used Apache Flink instead of Spark by any chance
>>>
>>> I am interested in its set of libraries for Complex Event Processing.
>>>
>>> Frankly I don't know if it offers far more than Spark offers.
>>>
>>> Thanks
>>>
>>> Dr Mich Talebzadeh
>>>
>>>
>>>
>>> LinkedIn * 
>>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>> *
>>>
>>>
>>>
>>> http://talebzadehmich.wordpress.com
>>>
>>>
>>>
>>
>>
>


ERROR ArrayBuffer(java.nio.channels.ClosedChannelException

2016-03-19 Thread Ascot Moss
Hi,

I have a SparkStream (with Kafka) job, after running several days, it
failed with following errors:
ERROR DirectKafkaInputDStream:
ArrayBuffer(java.nio.channels.ClosedChannelException)

Any idea what would be wrong? will it be SparkStreaming buffer overflow
issue?



Regards






*** from the log ***

16/03/18 09:15:18 INFO VerifiableProperties: Property zookeeper.connect is
overridden to

16/03/17 12:13:51 ERROR DirectKafkaInputDStream:
ArrayBuffer(java.nio.channels.ClosedChannelException)

16/03/17 12:13:52 INFO SimpleConsumer: Reconnect due to socket error:
java.nio.channels.ClosedChannelException

16/03/17 12:13:52 ERROR JobScheduler: Error generating jobs for time
1458188031800 ms

org.apache.spark.SparkException:
ArrayBuffer(java.nio.channels.ClosedChannelException)

at
org.apache.spark.streaming.kafka.DirectKafkaInputDStream.latestLeaderOffsets(DirectKafkaInputDStream.scala:123)

at
org.apache.spark.streaming.kafka.DirectKafkaInputDStream.compute(DirectKafkaInputDStream.scala:145)

at
org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1$$anonfun$apply$7.apply(DStream.scala:350)

at
org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1$$anonfun$apply$7.apply(DStream.scala:350)

at scala.util.DynamicVariable.withValue(DynamicVariable.scala:57)

at
org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1.apply(DStream.scala:349)

at
org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1.apply(DStream.scala:349)

at
org.apache.spark.streaming.dstream.DStream.createRDDWithLocalProperties(DStream.scala:399)

at
org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1.apply(DStream.scala:344)

at
org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1.apply(DStream.scala:342)

at scala.Option.orElse(Option.scala:257)

at
org.apache.spark.streaming.dstream.DStream.getOrCompute(DStream.scala:339)

at
org.apache.spark.streaming.dstream.ForEachDStream.generateJob(ForEachDStream.scala:38)

at
org.apache.spark.streaming.DStreamGraph$$anonfun$1.apply(DStreamGraph.scala:120)

at
org.apache.spark.streaming.DStreamGraph$$anonfun$1.apply(DStreamGraph.scala:120)

at
scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:251)

at
scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:251)

at
scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)

at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)

at scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:251)

at scala.collection.AbstractTraversable.flatMap(Traversable.scala:105)

at
org.apache.spark.streaming.DStreamGraph.generateJobs(DStreamGraph.scala:120)

at
org.apache.spark.streaming.scheduler.JobGenerator$$anonfun$2.apply(JobGenerator.scala:247)

at
org.apache.spark.streaming.scheduler.JobGenerator$$anonfun$2.apply(JobGenerator.scala:245)

at scala.util.Try$.apply(Try.scala:161)

at
org.apache.spark.streaming.scheduler.JobGenerator.generateJobs(JobGenerator.scala:245)

at org.apache.spark.streaming.scheduler.JobGenerator.org
$apache$spark$streaming$scheduler$JobGenerator$$processEvent(JobGenerator.scala:181)

at
org.apache.spark.streaming.scheduler.JobGenerator$$anon$1.onReceive(JobGenerator.scala:87)

at
org.apache.spark.streaming.scheduler.JobGenerator$$anon$1.onReceive(JobGenerator.scala:86)

at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)

Exception in thread "main" org.apache.spark.SparkException:
ArrayBuffer(java.nio.channels.ClosedChannelException)

at
org.apache.spark.streaming.kafka.DirectKafkaInputDStream.latestLeaderOffsets(DirectKafkaInputDStream.scala:123)

at
org.apache.spark.streaming.kafka.DirectKafkaInputDStream.compute(DirectKafkaInputDStream.scala:145)

at
org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1$$anonfun$apply$7.apply(DStream.scala:350)

at
org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1$$anonfun$apply$7.apply(DStream.scala:350)

at scala.util.DynamicVariable.withValue(DynamicVariable.scala:57)

at
org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1.apply(DStream.scala:349)

at
org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1.apply(DStream.scala:349)

at
org.apache.spark.streaming.dstream.DStream.createRDDWithLocalProperties(DStream.scala:399)

at
org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1.apply(DStream.scala:344)

at
org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1.apply(DStream.scala:342)

at scala.Option.orElse(Option.scala:257)

at
org.apache.spark.streaming.dstream.DStream.getOrCompute(DStream.scala:339)

at
org.apache.spark.streaming.dstream.ForEachDStream.generateJob(ForEachDStream.scala:38)

at
org.apache.spark.streaming.DStreamGraph$$anonfun$1.apply(DStreamGraph.scala:120)

at

How to compile Python and use How to compile Python and use spark-submit

2016-01-08 Thread Ascot Moss
Hi,

Instead of using Spark-shell, does anyone know how to build .zip (or .egg)
for Python and use Spark-submit to run?

Regards


Pivot Data in Spark and Scala

2015-10-29 Thread Ascot Moss
Hi,

I have data as follows:

A, 2015, 4
A, 2014, 12
A, 2013, 1
B, 2015, 24
B, 2013 4


I need to convert the data to a new format:
A ,4,12,1
B,   24,,4

Any idea how to make it in Spark Scala?

Thanks


Spark: How to find similar text title

2015-10-20 Thread Ascot Moss
Hi,

I have my RDD that stores the titles of some articles:
1. "About Spark Streaming"
2. "About Spark MLlib"
3. "About Spark SQL"
4. "About Spark Installation"
5. "Kafka Streaming"
6. "Kafka Setup"
7. 

I need to build a model to find titles by similarity,
e.g
if given "About Spark", hope to get:

"About Spark Installation", 0.98622 (where 0.98622 is the score
of similarity, range between 0 to 1)
"About Spark MLlib", 0.95394
"About Spark Streaming", 0.94332
"About Spark SQL", 0.9111

Any idea or reference to do so?

Thanks
Ascot





 and need to find out similar titles