date:20141224

That command is still wrong. It is -Xmx3g with no =.
On Dec 24, 2014 9:50 AM, Vladimir Protsenko protsenk...@gmail.com wrote:

 Java 8 rpm 64bit downloaded from official oracle site solved my problem.
 And I need not set max heap size, final memory shown at the end of maven
 build was 81/1943M. I want to learn spark so have no restriction on
 choosing java version.

 Guru Medasani, thanks for the tip.

 I will repeat info, that I wrongly send only to Sean. I have tried export
 MAVEN_OPTS=`-Xmx=3g -XX:MaxPermSize=1g -XX:ReservedCodeCacheSize=1g` and
 it doesn't work also.

 Best Regards,
 Vladimir Protsenko

 2014-12-23 19:45 GMT+04:00 Guru Medasani gdm...@outlook.com:

 Thanks for the clarification Sean.

 Best Regards,
 Guru Medasani




  From: so...@cloudera.com
  Date: Tue, 23 Dec 2014 15:39:59 +
  Subject: Re: Spark Installation Maven PermGen OutOfMemoryException
  To: gdm...@outlook.com
  CC: protsenk...@gmail.com; user@spark.apache.org

 
  The text there is actually unclear. In Java 8, you still need to set
  the max heap size (-Xmx2g). The optional bit is the
  -XX:MaxPermSize=512M actually. Java 8 no longer has a separate
  permanent generation.
 
  On Tue, Dec 23, 2014 at 3:32 PM, Guru Medasani gdm...@outlook.com
 wrote:
   Hi Vladimir,
  
   From the link Sean posted, if you use Java 8 there is this following
 note.
  
   Note: For Java 8 and above this step is not required.
  
   So if you have no problems using Java 8, give it a shot.
  
   Best Regards,
   Guru Medasani
  
 
  -
  To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
  For additional commands, e-mail: user-h...@spark.apache.org

got ”org.apache.thrift.protocol.TProtocolException: Expected protocol id ffffff82 but got ffffff80“ from hive metastroe service when I use show tables command in spark-sql shell

2014-12-24 Thread Roc Chu

this is my problem. I use mysql to store hive meta data. and i can get what
i want when I exec show tables in hive shell. but in the same machine. I
use spark-sql to execute same command (show tables), I got errors.
I look at the log of hive metastore find this errors

2014-12-24 05:04:59,874 ERROR [pool-3-thread-2]: server.TThreadPoolServer
(TThreadPoolServer.java:run(294)) - Thrift error occurred during processing
of message.
org.apache.thrift.protocol.TProtocolException: Expected protocol id
ff82 but got ff80
at
org.apache.thrift.protocol.TCompactProtocol.readMessageBegin(TCompactProtocol.java:503)
at
org.apache.hadoop.hive.metastore.TUGIBasedProcessor.process(TUGIBasedProcessor.java:75)
at
org.apache.thrift.server.TThreadPoolServer$WorkerProcess.run(TThreadPoolServer.java:285)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
2014-12-24 05:05:01,883 ERROR [pool-3-thread-3]: server.TThreadPoolServer
(TThreadPoolServer.java:run(294)) - Thrift error occurred during processing
of message.
org.apache.thrift.protocol.TProtocolException: Expected protocol id
ff82 but got ff80
at
org.apache.thrift.protocol.TCompactProtocol.readMessageBegin(TCompactProtocol.java:503)
at
org.apache.hadoop.hive.metastore.TUGIBasedProcessor.process(TUGIBasedProcessor.java:75)
at
org.apache.thrift.server.TThreadPoolServer$WorkerProcess.run(TThreadPoolServer.java:285)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)

I compiled e spark from soucecode of spark-1.1.1
and my hive was compiled from apache-hive-0.15.0

I put the same hive-site.xml both hive conf and apark conf.

waiting for help
thanks

Re: How to build Spark against the latest

2014-12-24 Thread guxiaobo1982

Hi Ted,
 The reference command works, but where I can get the deployable binaries?


Xiaobo Gu








-- Original --
From:  Ted Yu;yuzhih...@gmail.com;
Send time: Wednesday, Dec 24, 2014 12:09 PM
To: guxiaobo1...@qq.com; 
Cc: user@spark.apache.orguser@spark.apache.org; 
Subject:  Re: How to build Spark against the latest



See http://search-hadoop.com/m/JW1q5Cew0j

On Tue, Dec 23, 2014 at 8:00 PM, guxiaobo1982 guxiaobo1...@qq.com wrote:
Hi,
The official pom.xml file only have profile for hadoop version 2.4 as the 
latest version, but I installed hadoop version 2.6.0 with ambari, how can I 
build spark against it, just using mvn -Dhadoop.version=2.6.0, or how to make a 
coresponding profile for it?


Regards,


Xiaobo

Need help for Spark-JobServer setup on Maven (for Java programming)

2014-12-24 Thread Sasi

Dear All,

We are trying to share RDDs across different sessions of same Web
application (Java). We need to share single RDD between those sessions. As
we understand from some posts, it is possible through Spark-JobServer.

Is there any guidelines you can provide to setup Spark-JobServer for Maven
(for Apache Spark - Java programming).

We will be glad for your suggestion.

Sasi



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Need-help-for-Spark-JobServer-setup-on-Maven-for-Java-programming-tp20849.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Why does consuming a RESTful web service (using javax.ws.rs.* and Jsersey) work in unit test but not when submitted to Spark?

Hello,

I have a piece of code that runs inside Spark Streaming and tries to get
some data from a RESTful web service (that runs locally on my machine). The
code snippet in question is:

 Client client = ClientBuilder.newClient();
 WebTarget target = client.target(http://localhost:/rest;);
 target = target.path(annotate)
 .queryParam(text,
UrlEscapers.urlFragmentEscaper().escape(spotlightSubmission))
 .queryParam(confidence, 0.3);

  logger.warn(!!! DEBUG !!! target: {}, target.getUri().toString());

  String response =
target.request().accept(MediaType.APPLICATION_JSON_TYPE).get(String.class);

  logger.warn(!!! DEBUG !!! Spotlight response: {}, response);

When run inside a unit test as follows:

 mvn clean test -Dtest=SpotlightTest#testCountWords

it contacts the RESTful web service and retrieves some data as expected.
But when the same code is run as part of the application that is submitted
to Spark, using spark-submit script I receive the following error:

  java.lang.NoSuchMethodError:
javax.ws.rs.core.MultivaluedMap.addAll(Ljava/lang/Object;[Ljava/lang/Object;)V

I'm using Spark 1.1.0 and for consuming the web service I'm using Jersey in
my project's pom.xml:

 dependency
  groupIdorg.glassfish.jersey.containers/groupId
  artifactIdjersey-container-servlet-core/artifactId
  version2.14/version
/dependency

So I suspect that when the application is submitted to Spark, somehow
there's a different JAR in the environment that uses a different version of
Jersey / javax.ws.rs.*

Does anybody know which version of Jersey / javax.ws.rs.*  is used in the
Spark environment, or how to solve this conflict?


-- 
Emre Sevinç
https://be.linkedin.com/in/emresevinc/

Re: got ”org.apache.thrift.protocol.TProtocolException: Expected protocol id ffffff82 but got ffffff80“ from hive metastroe service when I use show tables command in spark-sql shell

2014-12-24 Thread Cheng Lian


Hi Roc,

Spark SQL 1.2.0 can only work with Hive 0.12.0 or Hive 0.13.1 
(controlled by compilation flags), versions prior 1.2.0 only works with 
Hive 0.12.0. So Hive 0.15.0-SNAPSHOT is not an option.


Would like to add that this is due to backwards compatibility issue of 
Hive metastore, AFAIK there's no popular system works with multiple 
versions of Hive metastore simultaneously.


Cheng

On 12/24/14 6:31 PM, Roc Chu wrote:
this is my problem. I use mysql to store hive meta data. and i can get 
what i want when I exec show tables in hive shell. but in the same 
machine. I use spark-sql to execute same command (show tables), I got 
errors.

I look at the log of hive metastore find this errors

2014-12-24 05:04:59,874 ERROR [pool-3-thread-2]: 
server.TThreadPoolServer (TThreadPoolServer.java:run(294)) - Thrift 
error occurred during processing of message.
org.apache.thrift.protocol.TProtocolException: Expected protocol id 
ff82 but got ff80
at 
org.apache.thrift.protocol.TCompactProtocol.readMessageBegin(TCompactProtocol.java:503)
at 
org.apache.hadoop.hive.metastore.TUGIBasedProcessor.process(TUGIBasedProcessor.java:75)
at 
org.apache.thrift.server.TThreadPoolServer$WorkerProcess.run(TThreadPoolServer.java:285)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)

at java.lang.Thread.run(Thread.java:745)
2014-12-24 05:05:01,883 ERROR [pool-3-thread-3]: 
server.TThreadPoolServer (TThreadPoolServer.java:run(294)) - Thrift 
error occurred during processing of message.
org.apache.thrift.protocol.TProtocolException: Expected protocol id 
ff82 but got ff80
at 
org.apache.thrift.protocol.TCompactProtocol.readMessageBegin(TCompactProtocol.java:503)
at 
org.apache.hadoop.hive.metastore.TUGIBasedProcessor.process(TUGIBasedProcessor.java:75)
at 
org.apache.thrift.server.TThreadPoolServer$WorkerProcess.run(TThreadPoolServer.java:285)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)

at java.lang.Thread.run(Thread.java:745)

I compiled e spark from soucecode of spark-1.1.1
and my hive was compiled from apache-hive-0.15.0

I put the same hive-site.xml both hive conf and apark conf.

waiting for help
thanks





-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: Spark Installation Maven PermGen OutOfMemoryException

2014-12-24 Thread Vladimir Protsenko

Thanks. Bad mistake.

2014-12-24 14:02 GMT+04:00 Sean Owen so...@cloudera.com:

 That command is still wrong. It is -Xmx3g with no =.
 On Dec 24, 2014 9:50 AM, Vladimir Protsenko protsenk...@gmail.com
 wrote:

 Java 8 rpm 64bit downloaded from official oracle site solved my problem.
 And I need not set max heap size, final memory shown at the end of maven
 build was 81/1943M. I want to learn spark so have no restriction on
 choosing java version.

 Guru Medasani, thanks for the tip.

 I will repeat info, that I wrongly send only to Sean. I have tried export
 MAVEN_OPTS=`-Xmx=3g -XX:MaxPermSize=1g -XX:ReservedCodeCacheSize=1g` and
 it doesn't work also.

 Best Regards,
 Vladimir Protsenko

 2014-12-23 19:45 GMT+04:00 Guru Medasani gdm...@outlook.com:

 Thanks for the clarification Sean.

 Best Regards,
 Guru Medasani




  From: so...@cloudera.com
  Date: Tue, 23 Dec 2014 15:39:59 +
  Subject: Re: Spark Installation Maven PermGen OutOfMemoryException
  To: gdm...@outlook.com
  CC: protsenk...@gmail.com; user@spark.apache.org

 
  The text there is actually unclear. In Java 8, you still need to set
  the max heap size (-Xmx2g). The optional bit is the
  -XX:MaxPermSize=512M actually. Java 8 no longer has a separate
  permanent generation.
 
  On Tue, Dec 23, 2014 at 3:32 PM, Guru Medasani gdm...@outlook.com
 wrote:
   Hi Vladimir,
  
   From the link Sean posted, if you use Java 8 there is this following
 note.
  
   Note: For Java 8 and above this step is not required.
  
   So if you have no problems using Java 8, give it a shot.
  
   Best Regards,
   Guru Medasani
  
 
  -
  To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
  For additional commands, e-mail: user-h...@spark.apache.org

Re: Why does consuming a RESTful web service (using javax.ws.rs.* and Jsersey) work in unit test but not when submitted to Spark?

Your guess is right, that there are two incompatible versions of
Jersey (or really, JAX-RS) in your runtime. Spark doesn't use Jersey,
but its transitive dependencies may, or your transitive dependencies
may.

I don't see Jersey in Spark's dependency tree except from HBase tests,
which in turn only appear in examples, so that's unlikely to be it.
I'd take a look with 'mvn dependency:tree' on your own code first.
Maybe you are including JavaEE 6 for example?

On Wed, Dec 24, 2014 at 12:02 PM, Emre Sevinc emre.sev...@gmail.com wrote:
 Hello,

 I have a piece of code that runs inside Spark Streaming and tries to get
 some data from a RESTful web service (that runs locally on my machine). The
 code snippet in question is:

  Client client = ClientBuilder.newClient();
  WebTarget target = client.target(http://localhost:/rest;);
  target = target.path(annotate)
  .queryParam(text,
 UrlEscapers.urlFragmentEscaper().escape(spotlightSubmission))
  .queryParam(confidence, 0.3);

   logger.warn(!!! DEBUG !!! target: {}, target.getUri().toString());

   String response =
 target.request().accept(MediaType.APPLICATION_JSON_TYPE).get(String.class);

   logger.warn(!!! DEBUG !!! Spotlight response: {}, response);

 When run inside a unit test as follows:

  mvn clean test -Dtest=SpotlightTest#testCountWords

 it contacts the RESTful web service and retrieves some data as expected. But
 when the same code is run as part of the application that is submitted to
 Spark, using spark-submit script I receive the following error:

   java.lang.NoSuchMethodError:
 javax.ws.rs.core.MultivaluedMap.addAll(Ljava/lang/Object;[Ljava/lang/Object;)V

 I'm using Spark 1.1.0 and for consuming the web service I'm using Jersey in
 my project's pom.xml:

  dependency
   groupIdorg.glassfish.jersey.containers/groupId
   artifactIdjersey-container-servlet-core/artifactId
   version2.14/version
 /dependency

 So I suspect that when the application is submitted to Spark, somehow
 there's a different JAR in the environment that uses a different version of
 Jersey / javax.ws.rs.*

 Does anybody know which version of Jersey / javax.ws.rs.*  is used in the
 Spark environment, or how to solve this conflict?


 --
 Emre Sevinç
 https://be.linkedin.com/in/emresevinc/


-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

saveAsNewAPIHadoopDataset against hbase hanging in pyspark 1.2.0

Hi,
have been using this without any issues with spark 1.1.0 but after upgrading to 
1.2.0 saving a RDD from pyspark using saveAsNewAPIHadoopDataset into HBase just 
hangs - even when testing with the example from the stock hbase_outputformat.py.
anyone having same issue? (and able to solve?)
using hbase 0.98.6 and yarn-client mode.
thanks,Antony.

Re: Why does consuming a RESTful web service (using javax.ws.rs.* and Jsersey) work in unit test but not when submitted to Spark?

On Wed, Dec 24, 2014 at 1:46 PM, Sean Owen so...@cloudera.com wrote:

 I'd take a look with 'mvn dependency:tree' on your own code first.
 Maybe you are including JavaEE 6 for example?


For reference, my complete pom.xml looks like:

project xmlns=http://maven.apache.org/POM/4.0.0; xmlns:xsi=
http://www.w3.org/2001/XMLSchema-instance;
  xsi:schemaLocation=http://maven.apache.org/POM/4.0.0
http://maven.apache.org/xsd/maven-4.0.0.xsd;
  modelVersion4.0.0/modelVersion

  groupIdbigcontent/groupId
  artifactIdbigcontent/artifactId
  version1.0-SNAPSHOT/version
  packagingjar/packaging

  namebigcontent/name
  urlhttp://maven.apache.org/url

  properties
project.build.sourceEncodingUTF-8/project.build.sourceEncoding
  /properties

  build
plugins
  plugin
groupIdorg.apache.maven.plugins/groupId
artifactIdmaven-shade-plugin/artifactId
version2.3/version
configuration
  !-- put your configurations here --
/configuration
executions
  execution
phasepackage/phase
goals
  goalshade/goal
/goals
  /execution
/executions
  /plugin

  plugin
groupIdorg.apache.maven.plugins/groupId
artifactIdmaven-compiler-plugin/artifactId
version3.2/version
configuration
  source1.7/source
  target1.7/target
/configuration
  /plugin
/plugins
  /build

  dependencies
dependency
  groupIdorg.apache.spark/groupId
  artifactIdspark-streaming_2.10/artifactId
  version1.1.1/version
  scopeprovided/scope
/dependency

dependency
  groupIdorg.glassfish.jersey.containers/groupId
  artifactIdjersey-container-servlet-core/artifactId
  version2.14/version
/dependency

dependency
  groupIdorg.apache.hadoop/groupId
  artifactIdhadoop-client/artifactId
  version2.4.0/version
/dependency

dependency
  groupIdcom.google.guava/groupId
  artifactIdguava/artifactId
  version16.0/version
/dependency

dependency
  groupIdorg.apache.hadoop/groupId
  artifactIdhadoop-mapreduce-client-core/artifactId
  version2.4.0/version
/dependency

dependency
  groupIdjson-mapreduce/groupId
  artifactIdjson-mapreduce/artifactId
  version1.0-SNAPSHOT/version
  exclusions
  exclusion
groupIdjavax.servlet/groupId
artifactId*/artifactId
  /exclusion
exclusion
  groupIdcommons-io/groupId
  artifactId*/artifactId
  /exclusion
  exclusion
  groupIdcommons-lang/groupId
  artifactId*/artifactId
  /exclusion
exclusion
  groupIdorg.apache.hadoop/groupId
  artifactIdhadoop-common/artifactId
/exclusion
  /exclusions
/dependency

dependency
  groupIdorg.apache.avro/groupId
  artifactIdavro-mapred/artifactId
  version1.7.7/version
  exclusions
exclusion
  groupIdjavax.servlet/groupId
  artifactId*/artifactId
/exclusion
exclusion
  groupIdorg.apache.hadoop/groupId
  artifactIdhadoop-common/artifactId
/exclusion
  /exclusions
/dependency

dependency
  groupIdjunit/groupId
  artifactIdjunit/artifactId
  version4.11/version
  scopetest/scope
/dependency

dependency
  groupIdorg.apache.avro/groupId
  artifactIdavro/artifactId
  version1.7.7/version
  exclusions
exclusion
  groupIdjavax.servlet/groupId
  artifactId*/artifactId
/exclusion
  /exclusions
/dependency

dependency
  groupIdorg.apache.hadoop/groupId
  artifactIdhadoop-common/artifactId
  version2.4.0/version
  scopeprovided/scope
  exclusions
exclusion
  groupIdjavax.servlet/groupId
  artifactId*/artifactId
/exclusion
exclusion
  groupIdcom.google.guava/groupId
  artifactId*/artifactId
/exclusion
  /exclusions
/dependency

dependency
  groupIdorg.slf4j/groupId
  artifactIdslf4j-log4j12/artifactId
  version1.7.7/version
/dependency
  /dependencies
/project

And 'mvn dependency:tree' produces the following output:



[INFO] Scanning for projects...
[INFO]

[INFO]

[INFO] Building bigcontent 1.0-SNAPSHOT
[INFO]

[INFO]
[INFO] --- maven-dependency-plugin:2.8:tree (default-cli) @ bigcontent ---
[INFO] bigcontent:bigcontent:jar:1.0-SNAPSHOT
[INFO] +- org.apache.spark:spark-streaming_2.10:jar:1.1.1:provided
[INFO] |  +- org.apache.spark:spark-core_2.10:jar:1.1.1:provided
[INFO] |  |  +- org.apache.curator:curator-recipes:jar:2.4.0:provided
[INFO] |  |  |  \- org.apache.curator:curator-framework:jar:2.4.0:provided
[INFO] |  |  | \-

Re: Why does consuming a RESTful web service (using javax.ws.rs.* and Jsersey) work in unit test but not when submitted to Spark?

It seems like YARN depends an older version of Jersey, that is 1.9:

  https://github.com/apache/spark/blob/master/yarn/pom.xml

When I've modified my dependencies to have only:

  dependency
  groupIdcom.sun.jersey/groupId
  artifactIdjersey-core/artifactId
  version1.9.1/version
/dependency

And then modified the code to use the old Jersey API:

Client c = Client.create();
WebResource r = c.resource(http://localhost:/rest;)
 .path(annotate)
 .queryParam(text,
UrlEscapers.urlFragmentEscaper().escape(spotlightSubmission))
 .queryParam(confidence, 0.3);

logger.warn(!!! DEBUG !!! target: {}, r.getURI());

String response = r.accept(MediaType.APPLICATION_JSON_TYPE)
   //.header()
   .get(String.class);

logger.warn(!!! DEBUG !!! Spotlight response: {}, response);

It seems to work when I use spark-submit to submit the application that
includes this code.

Funny thing is, now my relevant unit test does not run, complaining about
not having enough memory:

Java HotSpot(TM) 64-Bit Server VM warning: INFO:
os::commit_memory(0xc490, 25165824, 0) failed; error='Cannot
allocate memory' (errno=12)
#
# There is insufficient memory for the Java Runtime Environment to continue.
# Native memory allocation (mmap) failed to map 25165824 bytes for
committing reserved memory.

--
Emre


On Wed, Dec 24, 2014 at 1:46 PM, Sean Owen so...@cloudera.com wrote:

 Your guess is right, that there are two incompatible versions of
 Jersey (or really, JAX-RS) in your runtime. Spark doesn't use Jersey,
 but its transitive dependencies may, or your transitive dependencies
 may.

 I don't see Jersey in Spark's dependency tree except from HBase tests,
 which in turn only appear in examples, so that's unlikely to be it.
 I'd take a look with 'mvn dependency:tree' on your own code first.
 Maybe you are including JavaEE 6 for example?

 On Wed, Dec 24, 2014 at 12:02 PM, Emre Sevinc emre.sev...@gmail.com
 wrote:
  Hello,
 
  I have a piece of code that runs inside Spark Streaming and tries to get
  some data from a RESTful web service (that runs locally on my machine).
 The
  code snippet in question is:
 
   Client client = ClientBuilder.newClient();
   WebTarget target = client.target(http://localhost:/rest;);
   target = target.path(annotate)
   .queryParam(text,
  UrlEscapers.urlFragmentEscaper().escape(spotlightSubmission))
   .queryParam(confidence, 0.3);
 
logger.warn(!!! DEBUG !!! target: {},
 target.getUri().toString());
 
String response =
 
 target.request().accept(MediaType.APPLICATION_JSON_TYPE).get(String.class);
 
logger.warn(!!! DEBUG !!! Spotlight response: {}, response);
 
  When run inside a unit test as follows:
 
   mvn clean test -Dtest=SpotlightTest#testCountWords
 
  it contacts the RESTful web service and retrieves some data as expected.
 But
  when the same code is run as part of the application that is submitted to
  Spark, using spark-submit script I receive the following error:
 
java.lang.NoSuchMethodError:
 
 javax.ws.rs.core.MultivaluedMap.addAll(Ljava/lang/Object;[Ljava/lang/Object;)V
 
  I'm using Spark 1.1.0 and for consuming the web service I'm using Jersey
 in
  my project's pom.xml:
 
   dependency
groupIdorg.glassfish.jersey.containers/groupId
artifactIdjersey-container-servlet-core/artifactId
version2.14/version
  /dependency
 
  So I suspect that when the application is submitted to Spark, somehow
  there's a different JAR in the environment that uses a different version
 of
  Jersey / javax.ws.rs.*
 
  Does anybody know which version of Jersey / javax.ws.rs.*  is used in the
  Spark environment, or how to solve this conflict?
 
 
  --
  Emre Sevinç
  https://be.linkedin.com/in/emresevinc/
 




-- 
Emre Sevinc

Re: Single worker locked at 100% CPU

2014-12-24 Thread Phil Wills

Turns out that I was just being idiotic and had assigned so much memory to
Spark that the O/S was ending up continually swapping.  Apologies for the
noise.

Phil

On Wed, Dec 24, 2014 at 1:16 AM, Andrew Ash and...@andrewash.com wrote:

 Hi Phil,

 This sounds a lot like a deadlock in Hadoop's Configuration object that I
 ran into a while back.  If you jstack the JVM and see a thread that looks
 like the below, it could be
 https://issues.apache.org/jira/browse/SPARK-2546

 Executor task launch worker-6 daemon prio=10 tid=0x7f91f01fe000 
 nid=0x54b1 runnable [0x7f92d74f1000]
java.lang.Thread.State: RUNNABLE
 at java.util.HashMap.transfer(HashMap.java:601)
 at java.util.HashMap.resize(HashMap.java:581)
 at java.util.HashMap.addEntry(HashMap.java:879)
 at java.util.HashMap.put(HashMap.java:505)
 at org.apache.hadoop.conf.Configuration.set(Configuration.java:803)
 at org.apache.hadoop.conf.Configuration.set(Configuration.java:783)
 at org.apache.hadoop.conf.Configuration.setClass(Configuration.java:1662)


 The fix for this issue is hidden behind a flag because it might have
 performance implications, but if it is this problem then you can set
 spark.hadoop.cloneConf=true and see if that fixes things.

 Good luck!
 Andrew

 On Tue, Dec 23, 2014 at 9:40 AM, Phil Wills otherp...@gmail.com wrote:

 I've been attempting to run a job based on MLlib's ALS implementation for
 a while now and have hit an issue I'm having a lot of difficulty getting to
 the bottom of.

 On a moderate size set of input data it works fine, but against larger
 (still well short of what I'd think of as big) sets of data, I'll see one
 or two workers get stuck spinning at 100% CPU and the job unable to
 recover.

 I don't believe this is down to memory pressure as I seem to get the same
 behaviour at about the same size of input data, even if  the cluster is
 twice as large. GC logs also suggest things are proceeding reasonably with
 some Full GC's occurring, but no suggestion of the process being GC locked.

 After rebooting the instance that got into trouble, I can see the stderr
 log for the task truncated in the middle of a log-line at the time CPU
 shoots to and sticks at 100%, but no other signs of a problem.

 I've run into the same issue on 1.1.0 and 1.2.0 in standalone mode and
 running on YARN.

 Any suggestions on further steps I could try to get a clearer diagnosis
 of the issue would be much appreciated.

 Thanks,

 Phil

Re: Why does consuming a RESTful web service (using javax.ws.rs.* and Jsersey) work in unit test but not when submitted to Spark?

That could well be it -- oops, I forgot to run with the YARN profile
and so didn't see the YARN dependencies. Try the userClassPathFirst
option to try to make your app's copy take precedence.

The second error is really a JVM bug, but, is from having too little
memory available for the unit tests.

http://spark.apache.org/docs/latest/building-spark.html#setting-up-mavens-memory-usage

On Wed, Dec 24, 2014 at 1:56 PM, Emre Sevinc emre.sev...@gmail.com wrote:
 It seems like YARN depends an older version of Jersey, that is 1.9:

   https://github.com/apache/spark/blob/master/yarn/pom.xml

 When I've modified my dependencies to have only:

   dependency
   groupIdcom.sun.jersey/groupId
   artifactIdjersey-core/artifactId
   version1.9.1/version
 /dependency

 And then modified the code to use the old Jersey API:

 Client c = Client.create();
 WebResource r = c.resource(http://localhost:/rest;)
  .path(annotate)
  .queryParam(text,
 UrlEscapers.urlFragmentEscaper().escape(spotlightSubmission))
  .queryParam(confidence, 0.3);

 logger.warn(!!! DEBUG !!! target: {}, r.getURI());

 String response = r.accept(MediaType.APPLICATION_JSON_TYPE)
//.header()
.get(String.class);

 logger.warn(!!! DEBUG !!! Spotlight response: {}, response);

 It seems to work when I use spark-submit to submit the application that
 includes this code.

 Funny thing is, now my relevant unit test does not run, complaining about
 not having enough memory:

 Java HotSpot(TM) 64-Bit Server VM warning: INFO:
 os::commit_memory(0xc490, 25165824, 0) failed; error='Cannot
 allocate memory' (errno=12)
 #
 # There is insufficient memory for the Java Runtime Environment to continue.
 # Native memory allocation (mmap) failed to map 25165824 bytes for
 committing reserved memory.

 --
 Emre


 On Wed, Dec 24, 2014 at 1:46 PM, Sean Owen so...@cloudera.com wrote:

 Your guess is right, that there are two incompatible versions of
 Jersey (or really, JAX-RS) in your runtime. Spark doesn't use Jersey,
 but its transitive dependencies may, or your transitive dependencies
 may.

 I don't see Jersey in Spark's dependency tree except from HBase tests,
 which in turn only appear in examples, so that's unlikely to be it.
 I'd take a look with 'mvn dependency:tree' on your own code first.
 Maybe you are including JavaEE 6 for example?

 On Wed, Dec 24, 2014 at 12:02 PM, Emre Sevinc emre.sev...@gmail.com
 wrote:
  Hello,
 
  I have a piece of code that runs inside Spark Streaming and tries to get
  some data from a RESTful web service (that runs locally on my machine).
  The
  code snippet in question is:
 
   Client client = ClientBuilder.newClient();
   WebTarget target = client.target(http://localhost:/rest;);
   target = target.path(annotate)
   .queryParam(text,
  UrlEscapers.urlFragmentEscaper().escape(spotlightSubmission))
   .queryParam(confidence, 0.3);
 
logger.warn(!!! DEBUG !!! target: {},
  target.getUri().toString());
 
String response =
 
  target.request().accept(MediaType.APPLICATION_JSON_TYPE).get(String.class);
 
logger.warn(!!! DEBUG !!! Spotlight response: {}, response);
 
  When run inside a unit test as follows:
 
   mvn clean test -Dtest=SpotlightTest#testCountWords
 
  it contacts the RESTful web service and retrieves some data as expected.
  But
  when the same code is run as part of the application that is submitted
  to
  Spark, using spark-submit script I receive the following error:
 
java.lang.NoSuchMethodError:
 
  javax.ws.rs.core.MultivaluedMap.addAll(Ljava/lang/Object;[Ljava/lang/Object;)V
 
  I'm using Spark 1.1.0 and for consuming the web service I'm using Jersey
  in
  my project's pom.xml:
 
   dependency
groupIdorg.glassfish.jersey.containers/groupId
artifactIdjersey-container-servlet-core/artifactId
version2.14/version
  /dependency
 
  So I suspect that when the application is submitted to Spark, somehow
  there's a different JAR in the environment that uses a different version
  of
  Jersey / javax.ws.rs.*
 
  Does anybody know which version of Jersey / javax.ws.rs.*  is used in
  the
  Spark environment, or how to solve this conflict?
 
 
  --
  Emre Sevinç
  https://be.linkedin.com/in/emresevinc/
 




 --
 Emre Sevinc

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: Why does consuming a RESTful web service (using javax.ws.rs.* and Jsersey) work in unit test but not when submitted to Spark?

Sean,

Thanks a lot for the important information, especially  userClassPathFirst.

Cheers,
Emre

On Wed, Dec 24, 2014 at 3:38 PM, Sean Owen so...@cloudera.com wrote:

 That could well be it -- oops, I forgot to run with the YARN profile
 and so didn't see the YARN dependencies. Try the userClassPathFirst
 option to try to make your app's copy take precedence.

 The second error is really a JVM bug, but, is from having too little
 memory available for the unit tests.


 http://spark.apache.org/docs/latest/building-spark.html#setting-up-mavens-memory-usage

 On Wed, Dec 24, 2014 at 1:56 PM, Emre Sevinc emre.sev...@gmail.com
 wrote:
  It seems like YARN depends an older version of Jersey, that is 1.9:
 
https://github.com/apache/spark/blob/master/yarn/pom.xml
 
  When I've modified my dependencies to have only:
 
dependency
groupIdcom.sun.jersey/groupId
artifactIdjersey-core/artifactId
version1.9.1/version
  /dependency
 
  And then modified the code to use the old Jersey API:
 
  Client c = Client.create();
  WebResource r = c.resource(http://localhost:/rest;)
   .path(annotate)
   .queryParam(text,
  UrlEscapers.urlFragmentEscaper().escape(spotlightSubmission))
   .queryParam(confidence, 0.3);
 
  logger.warn(!!! DEBUG !!! target: {}, r.getURI());
 
  String response = r.accept(MediaType.APPLICATION_JSON_TYPE)
 //.header()
 .get(String.class);
 
  logger.warn(!!! DEBUG !!! Spotlight response: {}, response);
 
  It seems to work when I use spark-submit to submit the application that
  includes this code.
 
  Funny thing is, now my relevant unit test does not run, complaining about
  not having enough memory:
 
  Java HotSpot(TM) 64-Bit Server VM warning: INFO:
  os::commit_memory(0xc490, 25165824, 0) failed; error='Cannot
  allocate memory' (errno=12)
  #
  # There is insufficient memory for the Java Runtime Environment to
 continue.
  # Native memory allocation (mmap) failed to map 25165824 bytes for
  committing reserved memory.
 
  --
  Emre
 
 
  On Wed, Dec 24, 2014 at 1:46 PM, Sean Owen so...@cloudera.com wrote:
 
  Your guess is right, that there are two incompatible versions of
  Jersey (or really, JAX-RS) in your runtime. Spark doesn't use Jersey,
  but its transitive dependencies may, or your transitive dependencies
  may.
 
  I don't see Jersey in Spark's dependency tree except from HBase tests,
  which in turn only appear in examples, so that's unlikely to be it.
  I'd take a look with 'mvn dependency:tree' on your own code first.
  Maybe you are including JavaEE 6 for example?
 
  On Wed, Dec 24, 2014 at 12:02 PM, Emre Sevinc emre.sev...@gmail.com
  wrote:
   Hello,
  
   I have a piece of code that runs inside Spark Streaming and tries to
 get
   some data from a RESTful web service (that runs locally on my
 machine).
   The
   code snippet in question is:
  
Client client = ClientBuilder.newClient();
WebTarget target = client.target(http://localhost:/rest;);
target = target.path(annotate)
.queryParam(text,
   UrlEscapers.urlFragmentEscaper().escape(spotlightSubmission))
.queryParam(confidence, 0.3);
  
 logger.warn(!!! DEBUG !!! target: {},
   target.getUri().toString());
  
 String response =
  
  
 target.request().accept(MediaType.APPLICATION_JSON_TYPE).get(String.class);
  
 logger.warn(!!! DEBUG !!! Spotlight response: {}, response);
  
   When run inside a unit test as follows:
  
mvn clean test -Dtest=SpotlightTest#testCountWords
  
   it contacts the RESTful web service and retrieves some data as
 expected.
   But
   when the same code is run as part of the application that is submitted
   to
   Spark, using spark-submit script I receive the following error:
  
 java.lang.NoSuchMethodError:
  
  
 javax.ws.rs.core.MultivaluedMap.addAll(Ljava/lang/Object;[Ljava/lang/Object;)V
  
   I'm using Spark 1.1.0 and for consuming the web service I'm using
 Jersey
   in
   my project's pom.xml:
  
dependency
 groupIdorg.glassfish.jersey.containers/groupId
 artifactIdjersey-container-servlet-core/artifactId
 version2.14/version
   /dependency
  
   So I suspect that when the application is submitted to Spark, somehow
   there's a different JAR in the environment that uses a different
 version
   of
   Jersey / javax.ws.rs.*
  
   Does anybody know which version of Jersey / javax.ws.rs.*  is used in
   the
   Spark environment, or how to solve this conflict?
  
  
   --
   Emre Sevinç
   https://be.linkedin.com/in/emresevinc/
  
 
 
 
 
  --
  Emre Sevinc




-- 
Emre Sevinc

SVDPlusPlus Recommender in MLLib

2014-12-24 Thread Prafulla Wani

hi ,

Is there any plan to add SVDPlusPlus based recommender to MLLib ? It is 
implemented in Mahout from this paper - 
http://research.yahoo.com/files/kdd08koren.pdf 
http://research.yahoo.com/files/kdd08koren.pdf

Regards,
Prafulla.

Re: saveAsNewAPIHadoopDataset against hbase hanging in pyspark 1.2.0

2014-12-24 Thread Ted Yu

bq. even when testing with the example from the stock hbase_outputformat.py

Can you take jstack of the above and pastebin it ?

Thanks

On Wed, Dec 24, 2014 at 4:49 AM, Antony Mayi antonym...@yahoo.com.invalid
wrote:

 Hi,

 have been using this without any issues with spark 1.1.0 but after
 upgrading to 1.2.0 saving a RDD from pyspark
 using saveAsNewAPIHadoopDataset into HBase just hangs - even when testing
 with the example from the stock hbase_outputformat.py.

 anyone having same issue? (and able to solve?)

 using hbase 0.98.6 and yarn-client mode.

 thanks,
 Antony.

null Error in ALS model predict

2014-12-24 Thread Franco Barrientos

Hi all!,

 

I have  a RDD[(int,int,double,double)] where the first two int values are id
and product, respectively. I trained an implicit ALS algorithm and want to
make predictions from this RDD. I make two things but I think both ways are
same.

 

1-  Convert this RDD to RDD[(int,int)] and use
model.predict(RDD(int,int)), this works to me!

2-  Make a map and apply  model.predict(int,int), for example:

val ratings = RDD[(int,int,double,double)].map{ case (id, rubro, rating,
resp)= 

model.predict(id,rubro)

}

Where ratings is a RDD[Double].

 

Now, the second way when I apply a ratings.first() I get the follow error:



 

Why this happend? I need to use this second way.

 

Thanks in advance,

 

Franco Barrientos
Data Scientist

Málaga #115, Of. 1003, Las Condes.
Santiago, Chile.
(+562)-29699649
(+569)-76347893

 mailto:franco.barrien...@exalitica.com franco.barrien...@exalitica.com 

 http://www.exalitica.com/ www.exalitica.com


  http://exalitica.com/web/img/frim.png

How to identify erroneous input record ?

2014-12-24 Thread Sanjay Subramanian

hey guys 
One of my input records has an problem that makes the code fail.
var demoRddFilter = demoRdd.filter(line = 
!line.contains(ISR$CASE$I_F_COD$FOLL_SEQ) || 
!line.contains(primaryid$caseid$caseversion))

var demoRddFilterMap = demoRddFilter.map(line = line.split('$')(0) + ~ + 
line.split('$')(5) + ~ + line.split('$')(11) + ~ + 
line.split('$')(12))demoRddFilterMap.saveAsTextFile(/data/aers/msfx/demo/ + 
outFile)
This is possibly happening because perhaps one input record may not have 13 
fields.If this were Hadoop mapper code , I have 2 ways to solve this 1. test 
the number of fields of each line before applying the map function2. enclose 
the mapping function in a try catch block so that the mapping function only 
fails for the erroneous recordHow do I implement 1. or 2. in the Spark code 
?Thanks
sanjay  !--#yiv7202296517 _filtered #yiv7202296517 
{font-family:Helvetica;panose-1:2 11 6 4 2 2 2 2 2 4;} _filtered #yiv7202296517 
{font-family:Cambria Math;panose-1:2 4 5 3 5 4 6 3 2 4;} _filtered 
#yiv7202296517 {font-family:Calibri;panose-1:2 15 5 2 2 2 4 3 2 
4;}#yiv7202296517 #yiv7202296517 p.yiv7202296517MsoNormal, #yiv7202296517 
li.yiv7202296517MsoNormal, #yiv7202296517 div.yiv7202296517MsoNormal 
{margin:0cm;margin-bottom:.0001pt;font-size:11.0pt;font-family:Calibri, 
sans-serif;}#yiv7202296517 a:link, #yiv7202296517 
span.yiv7202296517MsoHyperlink 
{color:#0563C1;text-decoration:underline;}#yiv7202296517 a:visited, 
#yiv7202296517 span.yiv7202296517MsoHyperlinkFollowed 
{color:#954F72;text-decoration:underline;}#yiv7202296517 
p.yiv7202296517MsoListParagraph, #yiv7202296517 
li.yiv7202296517MsoListParagraph, #yiv7202296517 
div.yiv7202296517MsoListParagraph 
{margin-top:0cm;margin-right:0cm;margin-bottom:0cm;margin-left:36.0pt;margin-bottom:.0001pt;font-size:11.0pt;font-family:Calibri,
 sans-serif;}#yiv7202296517 span.yiv7202296517EstiloCorreo17 
{font-family:Calibri, sans-serif;color:windowtext;}#yiv7202296517 
.yiv7202296517MsoChpDefault {font-family:Calibri, sans-serif;} _filtered 
#yiv7202296517 {margin:70.85pt 3.0cm 70.85pt 3.0cm;}#yiv7202296517 
div.yiv7202296517WordSection1 {}#yiv7202296517 _filtered #yiv7202296517 {} 
_filtered #yiv7202296517 {} _filtered #yiv7202296517 {} _filtered 
#yiv7202296517 {} _filtered #yiv7202296517 {} _filtered #yiv7202296517 {} 
_filtered #yiv7202296517 {} _filtered #yiv7202296517 {} _filtered 
#yiv7202296517 {} _filtered #yiv7202296517 {}#yiv7202296517 ol 
{margin-bottom:0cm;}#yiv7202296517 ul {margin-bottom:0cm;}--

Re: How to identify erroneous input record ?

2014-12-24 Thread Sanjay Subramanian

DOH Looks like I did not have enough coffee before I asked this :-) I added the 
if statement...var demoRddFilter = demoRdd.filter(line = 
!line.contains(ISR$CASE$I_F_COD$FOLL_SEQ) || 
!line.contains(primaryid$caseid$caseversion))
var demoRddFilterMap = demoRddFilter.map(line = {
  if (line.split('$').length = 13){
line.split('$')(0) + ~ + line.split('$')(5) + ~ + line.split('$')(11) + 
~ + line.split('$')(12)
  }
})

  From: Sanjay Subramanian sanjaysubraman...@yahoo.com.INVALID
 To: user@spark.apache.org user@spark.apache.org 
 Sent: Wednesday, December 24, 2014 8:28 AM
 Subject: How to identify erroneous input record ?
   
hey guys 
One of my input records has an problem that makes the code fail.
var demoRddFilter = demoRdd.filter(line = 
!line.contains(ISR$CASE$I_F_COD$FOLL_SEQ) || 
!line.contains(primaryid$caseid$caseversion))

var demoRddFilterMap = demoRddFilter.map(line = line.split('$')(0) + ~ + 
line.split('$')(5) + ~ + line.split('$')(11) + ~ + 
line.split('$')(12))demoRddFilterMap.saveAsTextFile(/data/aers/msfx/demo/ + 
outFile)
This is possibly happening because perhaps one input record may not have 13 
fields.If this were Hadoop mapper code , I have 2 ways to solve this 1. test 
the number of fields of each line before applying the map function2. enclose 
the mapping function in a try catch block so that the mapping function only 
fails for the erroneous recordHow do I implement 1. or 2. in the Spark code 
?Thanks
sanjay

  #yiv8750085330 #yiv8750085330 -- filtered {font-family:Helvetica;panose-1:2 
11 6 4 2 2 2 2 2 4;}#yiv8750085330 filtered {panose-1:2 4 5 3 5 4 6 3 2 
4;}#yiv8750085330 filtered {font-family:Calibri;panose-1:2 15 5 2 2 2 4 3 2 
4;}#yiv8750085330 p.yiv8750085330MsoNormal, #yiv8750085330 
li.yiv8750085330MsoNormal, #yiv8750085330 div.yiv8750085330MsoNormal 
{margin:0cm;margin-bottom:.0001pt;font-size:11.0pt;}#yiv8750085330 a:link, 
#yiv8750085330 span.yiv8750085330MsoHyperlink 
{color:#0563C1;text-decoration:underline;}#yiv8750085330 a:visited, 
#yiv8750085330 span.yiv8750085330MsoHyperlinkFollowed 
{color:#954F72;text-decoration:underline;}#yiv8750085330 
p.yiv8750085330MsoListParagraph, #yiv8750085330 
li.yiv8750085330MsoListParagraph, #yiv8750085330 
div.yiv8750085330MsoListParagraph 
{margin-top:0cm;margin-right:0cm;margin-bottom:0cm;margin-left:36.0pt;margin-bottom:.0001pt;font-size:11.0pt;}#yiv8750085330
 span.yiv8750085330EstiloCorreo17 {color:windowtext;}#yiv8750085330 
.yiv8750085330MsoChpDefault {}#yiv8750085330 filtered {margin:70.85pt 3.0cm 
70.85pt 3.0cm;}#yiv8750085330 div.yiv8750085330WordSection1 {}#yiv8750085330 
filtered {}#yiv8750085330 filtered {}#yiv8750085330 filtered {}#yiv8750085330 
filtered {}#yiv8750085330 filtered {}#yiv8750085330 filtered {}#yiv8750085330 
filtered {}#yiv8750085330 filtered {}#yiv8750085330 filtered {}#yiv8750085330 
filtered {}#yiv8750085330 ol {margin-bottom:0cm;}#yiv8750085330 ul 
{margin-bottom:0cm;}#yiv8750085330

Re: How to identify erroneous input record ?

I don't believe that works since your map function does not return a
value for lines shorter than 13 tokens. You should use flatMap and
Some/None. (You probably want to not parse the string 5 times too.)

val demoRddFilterMap = demoRddFilter.flatMap { line =
  val tokens = line.split('$')
  if (tokens.length = 13) {
val parsed = tokens(0) + ~ + tokens(5) + ~ + tokens(11) + ~
+ tokens(12)
Some(parsed)
  } else {
None
  }
}

On Wed, Dec 24, 2014 at 4:35 PM, Sanjay Subramanian
sanjaysubraman...@yahoo.com.invalid wrote:
 DOH Looks like I did not have enough coffee before I asked this :-)
 I added the if statement...

 var demoRddFilter = demoRdd.filter(line =
 !line.contains(ISR$CASE$I_F_COD$FOLL_SEQ) ||
 !line.contains(primaryid$caseid$caseversion))
 var demoRddFilterMap = demoRddFilter.map(line = {
   if (line.split('$').length = 13){
 line.split('$')(0) + ~ + line.split('$')(5) + ~ +
 line.split('$')(11) + ~ + line.split('$')(12)
   }
 })


 
 From: Sanjay Subramanian sanjaysubraman...@yahoo.com.INVALID
 To: user@spark.apache.org user@spark.apache.org
 Sent: Wednesday, December 24, 2014 8:28 AM
 Subject: How to identify erroneous input record ?

 hey guys

 One of my input records has an problem that makes the code fail.

 var demoRddFilter = demoRdd.filter(line =
 !line.contains(ISR$CASE$I_F_COD$FOLL_SEQ) ||
 !line.contains(primaryid$caseid$caseversion))

 var demoRddFilterMap = demoRddFilter.map(line = line.split('$')(0) + ~ +
 line.split('$')(5) + ~ + line.split('$')(11) + ~ + line.split('$')(12))

 demoRddFilterMap.saveAsTextFile(/data/aers/msfx/demo/ + outFile)


 This is possibly happening because perhaps one input record may not have 13
 fields.

 If this were Hadoop mapper code , I have 2 ways to solve this

 1. test the number of fields of each line before applying the map function

 2. enclose the mapping function in a try catch block so that the mapping
 function only fails for the erroneous record

 How do I implement 1. or 2. in the Spark code ?

 Thanks


 sanjay








-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: How to identify erroneous input record ?

2014-12-24 Thread Sanjay Subramanian

Although not elegantly I got the output via my code but totally agree on the 
parsing 5 times (thats really bad).Will add your suggested logic and check it 
out. I have a long way to the finish line. I am re-architecting my entire 
hadoop code and getting it onto spark.
Check out what I do at www.medicalsidefx.orgPrimarily an iPhone app but 
underlying is Lucene, Hadoop and hopefully soon in 2015 - Spark :-)  
  From: Sean Owen so...@cloudera.com
 To: Sanjay Subramanian sanjaysubraman...@yahoo.com 
Cc: user@spark.apache.org user@spark.apache.org 
 Sent: Wednesday, December 24, 2014 8:56 AM
 Subject: Re: How to identify erroneous input record ?

I don't believe that works since your map function does not return a
value for lines shorter than 13 tokens. You should use flatMap and
Some/None. (You probably want to not parse the string 5 times too.)

val demoRddFilterMap = demoRddFilter.flatMap { line =
  val tokens = line.split('$')
  if (tokens.length = 13) {
    val parsed = tokens(0) + ~ + tokens(5) + ~ + tokens(11) + ~
+ tokens(12)
    Some(parsed)
  } else {
    None
  }
}

On Wed, Dec 24, 2014 at 4:35 PM, Sanjay Subramanian
sanjaysubraman...@yahoo.com.invalid wrote:
 DOH Looks like I did not have enough coffee before I asked this :-)
 I added the if statement...

 var demoRddFilter = demoRdd.filter(line =
 !line.contains(ISR$CASE$I_F_COD$FOLL_SEQ) ||
 !line.contains(primaryid$caseid$caseversion))
 var demoRddFilterMap = demoRddFilter.map(line = {
  if (line.split('$').length = 13){
    line.split('$')(0) + ~ + line.split('$')(5) + ~ +
 line.split('$')(11) + ~ + line.split('$')(12)
  }
 })

 From: Sanjay Subramanian sanjaysubraman...@yahoo.com.INVALID
 To: user@spark.apache.org user@spark.apache.org
 Sent: Wednesday, December 24, 2014 8:28 AM
 Subject: How to identify erroneous input record ?

 hey guys

 One of my input records has an problem that makes the code fail.

 var demoRddFilter = demoRdd.filter(line =
 !line.contains(ISR$CASE$I_F_COD$FOLL_SEQ) ||
 !line.contains(primaryid$caseid$caseversion))

 var demoRddFilterMap = demoRddFilter.map(line = line.split('$')(0) + ~ +
 line.split('$')(5) + ~ + line.split('$')(11) + ~ + line.split('$')(12))

 demoRddFilterMap.saveAsTextFile(/data/aers/msfx/demo/ + outFile)

 This is possibly happening because perhaps one input record may not have 13
 fields.

 If this were Hadoop mapper code , I have 2 ways to solve this

 1. test the number of fields of each line before applying the map function

 2. enclose the mapping function in a try catch block so that the mapping
 function only fails for the erroneous record

 How do I implement 1. or 2. in the Spark code ?

 Thanks

 sanjay

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

hiveContext.jsonFile fails with Unexpected close marker

2014-12-24 Thread elliott cordo

I have generally been impressed with the way jsonFile eats just about any
json data model.. but getting this error when i try to ingest this file:
Unexpected close marker ']': expected '}

Here are the commands from the pyspark shell:

from pyspark.sql import  HiveContext
hiveContext = HiveContext(sc)
f = hiveContext.jsonFile(sample.json)

Here is some sample json:
{wf_session: [

{id:6021fb91-c9ec-4019-9ab9-f628aee8d259,machine_id:b45c8c4a-7e8e-442d-8d49-fb7c32e2d813,session_id:d65ca338-c6b8-4bff-93b1-7f2364726fb7,event_at:2014-12-19T15:55:31.373Z,screen:x,type:1,time_left_secs:1},

{id:7e696c19-3ba4-4469-be28-5ef1f0c03d63,machine_id:b45c8c4a-7e8e-442d-8d49-fb7c32e2d813,session_id:d65ca338-c6b8-4bff-93b1-7f2364726fb7,event_at:2014-12-19T15:55:32.385Z,screen:x,type:2,ad_unit_id:null,spot_started_at:2014-12-19T15:55:12.364Z,spot_ended_at:2014-12-19T15:55:32.385Z,spot_duration_secs:20,impression_count:0,impressions:[],engagement_count:0,engagements:[]},

{id:68a43006-09bc-4c18-af55-1ebdc0e041a3,machine_id:b45c8c4a-7e8e-442d-8d49-fb7c32e2d813,session_id:d65ca338-c6b8-4bff-93b1-7f2364726fb7,event_at:2014-12-19T15:55:32.375Z,screen:x,type:3,duration_secs:20,to_ad_unit_id:developmentbea1f3a4-be08-4119-b9f4-7}
  ] }


Any help would be appreciated! :)  Merry Xmas!

RE: Not Serializable exception when integrating SQL and Spark Streaming

2014-12-24 Thread Tarun Garg

Thanks for the reply.
I am testing this with a small amount of data and what is happening is when 
ever there is data in the Kafka topic Spark does not through Exception 
otherwise it is.
ThanksTarun

Date: Wed, 24 Dec 2014 16:23:30 +0800
From: lian.cs@gmail.com
To: bigdat...@live.com; user@spark.apache.org
Subject: Re: Not Serializable exception when integrating SQL and Spark Streaming


  

  
  

  Generally you can use -Dsun.io.serialization.extendedDebugInfo=true
to enable serialization debugging information when serialization
exceptions are raised.
  On 12/24/14 1:32 PM,
bigdata4u wrote:
  
  



  I am trying to use sql over Spark streaming using Java. But i am 
getting
Serialization Exception.

public static void main(String args[]) {
SparkConf sparkConf = new SparkConf().setAppName(NumberCount);
JavaSparkContext jc = new JavaSparkContext(sparkConf);
JavaStreamingContext jssc = new JavaStreamingContext(jc, new
Duration(2000));
jssc.addStreamingListener(new WorkCountMonitor());
int numThreads = Integer.parseInt(args[3]);
MapString,Integer topicMap = new HashMapString,Integer();
String[] topics = args[2].split(,);
for (String topic : topics) {
topicMap.put(topic, numThreads);
}
JavaPairReceiverInputDStreamString,String data =
KafkaUtils.createStream(jssc, args[0], args[1], topicMap);
data.print();

JavaDStreamPerson streamData = data.map(new FunctionTuple2lt;String,
String, Person() {
public Person call(Tuple2String,String v1) throws Exception {
String[] stringArray = v1._2.split(,);
Person Person = new Person();
Person.setName(stringArray[0]);
Person.setAge(stringArray[1]);
return Person;
}

});


final JavaSQLContext sqlContext = new JavaSQLContext(jc);
streamData.foreachRDD(new FunctionJavaRDDlt;Person,Void() {
public Void call(JavaRDDPerson rdd) {

JavaSchemaRDD subscriberSchema = sqlContext.applySchema(rdd,
Person.class);

subscriberSchema.registerAsTable(people);
System.out.println(all data);
JavaSchemaRDD names = sqlContext.sql(SELECT name FROM people);
System.out.println(afterwards);

ListString males = new ArrayListString();

males = names.map(new FunctionRow,String() {
public String call(Row row) {
return row.getString(0);
}
}).collect();
System.out.println(before for);
for (String name : males) {
System.out.println(name);
}
return null;
}
});
jssc.start();
jssc.awaitTermination();

Exception is

14/12/23 23:49:38 ERROR JobScheduler: Error running job streaming job
1419378578000 ms.1 org.apache.spark.SparkException: Task not serializable at
org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:166)
at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:158) at
org.apache.spark.SparkContext.clean(SparkContext.scala:1435) at
org.apache.spark.rdd.RDD.map(RDD.scala:271) at
org.apache.spark.api.java.JavaRDDLike$class.map(JavaRDDLike.scala:78) at
org.apache.spark.sql.api.java.JavaSchemaRDD.map(JavaSchemaRDD.scala:42) at
com.basic.spark.NumberCount$2.call(NumberCount.java:79) at
com.basic.spark.NumberCount$2.call(NumberCount.java:67) at
org.apache.spark.streaming.api.java.JavaDStreamLike$anonfun$foreachRDD$1.apply(JavaDStreamLike.scala:274)
at
org.apache.spark.streaming.api.java.JavaDStreamLike$anonfun$foreachRDD$1.apply(JavaDStreamLike.scala:274)
at
org.apache.spark.streaming.dstream.DStream$anonfun$foreachRDD$1.apply(DStream.scala:529)
at
org.apache.spark.streaming.dstream.DStream$anonfun$foreachRDD$1.apply(DStream.scala:529)
at
org.apache.spark.streaming.dstream.ForEachDStream$anonfun$1.apply$mcV$sp(ForEachDStream.scala:42)
at
org.apache.spark.streaming.dstream.ForEachDStream$anonfun$1.apply(ForEachDStream.scala:40)
at
org.apache.spark.streaming.dstream.ForEachDStream$anonfun$1.apply(ForEachDStream.scala:40)
at scala.util.Try$.apply(Try.scala:161) at
org.apache.spark.streaming.scheduler.Job.run(Job.scala:32) at
org.apache.spark.streaming.scheduler.JobScheduler$JobHandler.run(JobScheduler.scala:171)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:724) 
Caused by: java.io.NotSerializableException:
org.apache.spark.sql.api.java.JavaSQLContext at
java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1181) at
java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1541)
at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1506)
at

Re: null Error in ALS model predict

2014-12-24 Thread Burak Yavuz

Hi,

The MatrixFactorizationModel consists of two RDD's. When you use the second 
method, Spark tries to serialize both RDD's for the .map() function, 
which is not possible, because RDD's are not serializable. Therefore you 
receive the NULLPointerException. You must use the first method.

Best,
Burak

- Original Message -
From: Franco Barrientos franco.barrien...@exalitica.com
To: user@spark.apache.org
Sent: Wednesday, December 24, 2014 7:44:24 AM
Subject: null Error in ALS model predict

Hi all!,

 

I have  a RDD[(int,int,double,double)] where the first two int values are id
and product, respectively. I trained an implicit ALS algorithm and want to
make predictions from this RDD. I make two things but I think both ways are
same.

 

1-  Convert this RDD to RDD[(int,int)] and use
model.predict(RDD(int,int)), this works to me!

2-  Make a map and apply  model.predict(int,int), for example:

val ratings = RDD[(int,int,double,double)].map{ case (id, rubro, rating,
resp)= 

model.predict(id,rubro)

}

Where ratings is a RDD[Double].

 

Now, the second way when I apply a ratings.first() I get the follow error:



 

Why this happend? I need to use this second way.

 

Thanks in advance,

 

Franco Barrientos
Data Scientist

Málaga #115, Of. 1003, Las Condes.
Santiago, Chile.
(+562)-29699649
(+569)-76347893

 mailto:franco.barrien...@exalitica.com franco.barrien...@exalitica.com 

 http://www.exalitica.com/ www.exalitica.com


  http://exalitica.com/web/img/frim.png

Re: Spark metrics for ganglia

2014-12-24 Thread Tim Harsch

Did you get past this issue?  I¹m trying to get this to work as well.  You
have to compile in the spark-ganglia-lgpl artifact into your application.

dependency
  groupIdorg.apache.spark/groupId
artifactIdspark-ganglia-lgpl_2.10/artifactId
version${project.version}/version
/dependency


So I added the above snippet to the examples project, and it finds the
class now when I try to run the Pi example, but I get this problem instead:

14/12/24 11:47:23 ERROR metrics.MetricsSystem: Sink class
org.apache.spark.metrics.sink.GangliaSink cannot be instantialized
ŠSNIPŠ
Caused by: java.lang.NumberFormatException: For input string: 1 





On 12/15/14, 11:29 AM, danilopds danilob...@gmail.com wrote:

Thanks tsingfu,

I used this configuration based in your post: (with ganglia unicast mode)
# Enable GangliaSink for all instances
*.sink.ganglia.class=org.apache.spark.metrics.sink.GangliaSink
*.sink.ganglia.host=10.0.0.7
*.sink.ganglia.port=8649
*.sink.ganglia.period=15
*.sink.ganglia.unit=seconds
*.sink.ganglia.ttl=1
*.sink.ganglia.mode=unicast

Then,
I have the following error now.
ERROR metrics.MetricsSystem: Sink class
org.apache.spark.metrics.sink.GangliaSink  cannot be instantialized
java.lang.ClassNotFoundException:
org.apache.spark.metrics.sink.GangliaSink





--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-metrics-for-gang
lia-tp14335p20690.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Discourse: A proposed alternative to the Spark User list

2014-12-24 Thread Nick Chammas

When people have questions about Spark, there are 2 main places (as far as
I can tell) where they ask them:

- Stack Overflow, under the apache-spark tag
http://stackoverflow.com/questions/tagged/apache-spark
- This mailing list

The mailing list is valuable as an independent place for discussion that is
part of the Spark project itself. Furthermore, it allows for a broader
range of discussions than would be allowed on Stack Overflow
http://stackoverflow.com/help/dont-ask.

As the Spark project has grown in popularity, I see that a few problems
have emerged with this mailing list:

- It’s hard to follow topics (e.g. Streaming vs. SQL) that you’re
interested in, and it’s hard to know when someone has mentioned you
specifically.
- It’s hard to search for existing threads and link information across
disparate threads.
- It’s hard to format code and log snippets nicely, and by extension,
hard to read other people’s posts with this kind of information.

There are existing solutions to all these (and other) problems based around
straight-up discipline or client-side tooling, which users have to conjure
up for themselves.

I’d like us as a community to consider using Discourse
http://www.discourse.org/ as an alternative to, or overlay on top of,
this mailing list, that provides better out-of-the-box solutions to these
problems.

Discourse is a modern discussion platform built by some of the same people
who created Stack Overflow. It has many neat features
http://v1.discourse.org/about/ that I believe this community would
benefit from.

For example:

- When a user starts typing up a new post, they get a panel *showing
existing conversations that look similar*, just like on Stack Overflow.
- It’s easy to search for posts and link between them.
- *Markdown support* is built-in to composer.
- You can *specifically mention people* and they will be notified.
- Posts can be categorized (e.g. Streaming, SQL, etc.).
- There is a built-in option for mailing list support which forwards all
activity on the forum to a user’s email address and which allows for
creation of new posts via email.

What do you think of Discourse as an alternative, more manageable way to
discus Spark?

There are a few options we can consider:

1. Work with the ASF as well as the Discourse team to allow Discourse to
act as an overlay on top of this mailing list

https://meta.discourse.org/t/discourse-as-a-front-end-for-existing-asf-mailing-lists/23167?u=nicholaschammas,
allowing people to continue to use the mailing list as-is if they want.
(This is the toughest but perhaps most attractive option.)
2. Create a new Discourse forum for Spark that is not bound to this user
list. This is relatively easy but will effectively fork the community on
this list. (We cannot shut down this mailing in favor of one managed by
Discourse.)
3. Don’t use Discourse. Just encourage people on this list to post
instead on Stack Overflow whenever possible.
4. Something else.

What does everyone think?

Nick

--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Discourse-A-proposed-alternative-to-the-Spark-User-list-tp20851.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: saveAsNewAPIHadoopDataset against hbase hanging in pyspark 1.2.0

this is it (jstack of particular yarn container) - http://pastebin.com/eAdiUYKK
thanks, Antony. 

 On Wednesday, 24 December 2014, 16:34, Ted Yu yuzhih...@gmail.com wrote:
   
 

 bq. even when testing with the example from the stock hbase_outputformat.py
Can you take jstack of the above and pastebin it ?
Thanks
On Wed, Dec 24, 2014 at 4:49 AM, Antony Mayi antonym...@yahoo.com.invalid 
wrote:

Hi,
have been using this without any issues with spark 1.1.0 but after upgrading to 
1.2.0 saving a RDD from pyspark using saveAsNewAPIHadoopDataset into HBase just 
hangs - even when testing with the example from the stock hbase_outputformat.py.
anyone having same issue? (and able to solve?)
using hbase 0.98.6 and yarn-client mode.
thanks,Antony.

Re: saveAsNewAPIHadoopDataset against hbase hanging in pyspark 1.2.0

2014-12-24 Thread Ted Yu

I went over the jstack but didn't find any call related to hbase or
zookeeper.
Do you find anything important in the logs ?

Looks like container launcher was waiting for the script to return some
result:


   1. at
   
org.apache.hadoop.util.Shell$ShellCommandExecutor.parseExecResult(Shell.java:715)
   2. at org.apache.hadoop.util.Shell.runCommand(Shell.java:524)


On Wed, Dec 24, 2014 at 3:11 PM, Antony Mayi antonym...@yahoo.com wrote:

 this is it (jstack of particular yarn container) -
 http://pastebin.com/eAdiUYKK

 thanks, Antony.


   On Wednesday, 24 December 2014, 16:34, Ted Yu yuzhih...@gmail.com
 wrote:



 bq. even when testing with the example from the stock
 hbase_outputformat.py

 Can you take jstack of the above and pastebin it ?

 Thanks

 On Wed, Dec 24, 2014 at 4:49 AM, Antony Mayi antonym...@yahoo.com.invalid
  wrote:

 Hi,

 have been using this without any issues with spark 1.1.0 but after
 upgrading to 1.2.0 saving a RDD from pyspark
 using saveAsNewAPIHadoopDataset into HBase just hangs - even when testing
 with the example from the stock hbase_outputformat.py.

 anyone having same issue? (and able to solve?)

 using hbase 0.98.6 and yarn-client mode.

 thanks,
 Antony.

RE: Not Serializable exception when integrating SQL and Spark Streaming

2014-12-24 Thread Tarun Garg

Thanks
I debug this further and below is the cause
Caused by: java.io.NotSerializableException: 
org.apache.spark.sql.api.java.JavaSQLContext- field (class 
com.basic.spark.NumberCount$2, name: val$sqlContext, type: class 
org.apache.spark.sql.api.java.JavaSQLContext)- object (class 
com.basic.spark.NumberCount$2, com.basic.spark.NumberCount$2@69ddbcc7)
- field (class com.basic.spark.NumberCount$2$1, name: this$0, type: class 
com.basic.spark.NumberCount$2)- object (class 
com.basic.spark.NumberCount$2$1, com.basic.spark.NumberCount$2$1@2524beed)
- field (class 
org.apache.spark.api.java.JavaPairRDD$$anonfun$toScalaFunction$1, name: 
fun$1, type: interface org.apache.spark.api.java.function.Function)
I tried this also 
http://apache-spark-user-list.1001560.n3.nabble.com/How-to-access-objects-declared-and-initialized-outside-the-call-method-of-JavaRDD-td17094.html#a17150
Why there is difference SQLContext is Serializable but JavaSQLContext is not? 
Spark is designed like this.
Thanks
Date: Wed, 24 Dec 2014 16:23:30 +0800
From: lian.cs@gmail.com
To: bigdat...@live.com; user@spark.apache.org
Subject: Re: Not Serializable exception when integrating SQL and Spark Streaming


  

  
  

  Generally you can use -Dsun.io.serialization.extendedDebugInfo=true
to enable serialization debugging information when serialization
exceptions are raised.
  On 12/24/14 1:32 PM,
bigdata4u wrote:
  
  



  I am trying to use sql over Spark streaming using Java. But i am 
getting
Serialization Exception.

public static void main(String args[]) {
SparkConf sparkConf = new SparkConf().setAppName(NumberCount);
JavaSparkContext jc = new JavaSparkContext(sparkConf);
JavaStreamingContext jssc = new JavaStreamingContext(jc, new
Duration(2000));
jssc.addStreamingListener(new WorkCountMonitor());
int numThreads = Integer.parseInt(args[3]);
MapString,Integer topicMap = new HashMapString,Integer();
String[] topics = args[2].split(,);
for (String topic : topics) {
topicMap.put(topic, numThreads);
}
JavaPairReceiverInputDStreamString,String data =
KafkaUtils.createStream(jssc, args[0], args[1], topicMap);
data.print();

JavaDStreamPerson streamData = data.map(new FunctionTuple2lt;String,
String, Person() {
public Person call(Tuple2String,String v1) throws Exception {
String[] stringArray = v1._2.split(,);
Person Person = new Person();
Person.setName(stringArray[0]);
Person.setAge(stringArray[1]);
return Person;
}

});


final JavaSQLContext sqlContext = new JavaSQLContext(jc);
streamData.foreachRDD(new FunctionJavaRDDlt;Person,Void() {
public Void call(JavaRDDPerson rdd) {

JavaSchemaRDD subscriberSchema = sqlContext.applySchema(rdd,
Person.class);

subscriberSchema.registerAsTable(people);
System.out.println(all data);
JavaSchemaRDD names = sqlContext.sql(SELECT name FROM people);
System.out.println(afterwards);

ListString males = new ArrayListString();

males = names.map(new FunctionRow,String() {
public String call(Row row) {
return row.getString(0);
}
}).collect();
System.out.println(before for);
for (String name : males) {
System.out.println(name);
}
return null;
}
});
jssc.start();
jssc.awaitTermination();

Exception is

14/12/23 23:49:38 ERROR JobScheduler: Error running job streaming job
1419378578000 ms.1 org.apache.spark.SparkException: Task not serializable at
org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:166)
at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:158) at
org.apache.spark.SparkContext.clean(SparkContext.scala:1435) at
org.apache.spark.rdd.RDD.map(RDD.scala:271) at
org.apache.spark.api.java.JavaRDDLike$class.map(JavaRDDLike.scala:78) at
org.apache.spark.sql.api.java.JavaSchemaRDD.map(JavaSchemaRDD.scala:42) at
com.basic.spark.NumberCount$2.call(NumberCount.java:79) at
com.basic.spark.NumberCount$2.call(NumberCount.java:67) at
org.apache.spark.streaming.api.java.JavaDStreamLike$anonfun$foreachRDD$1.apply(JavaDStreamLike.scala:274)
at
org.apache.spark.streaming.api.java.JavaDStreamLike$anonfun$foreachRDD$1.apply(JavaDStreamLike.scala:274)
at
org.apache.spark.streaming.dstream.DStream$anonfun$foreachRDD$1.apply(DStream.scala:529)
at
org.apache.spark.streaming.dstream.DStream$anonfun$foreachRDD$1.apply(DStream.scala:529)
at
org.apache.spark.streaming.dstream.ForEachDStream$anonfun$1.apply$mcV$sp(ForEachDStream.scala:42)
at
org.apache.spark.streaming.dstream.ForEachDStream$anonfun$1.apply(ForEachDStream.scala:40)
at

Re: saveAsNewAPIHadoopDataset against hbase hanging in pyspark 1.2.0

I just run it by hand from pyspark shell. here is the steps:
pyspark --jars 
/usr/lib/spark/lib/spark-examples-1.2.0-cdh5.3.0-hadoop2.5.0-cdh5.3.0.jar
 conf = {hbase.zookeeper.quorum: localhost,
...         hbase.mapred.outputtable: test,...         
mapreduce.outputformat.class: 
org.apache.hadoop.hbase.mapreduce.TableOutputFormat,...         
mapreduce.job.output.key.class: 
org.apache.hadoop.hbase.io.ImmutableBytesWritable,...         
mapreduce.job.output.value.class: org.apache.hadoop.io.Writable} keyConv 
= 
org.apache.spark.examples.pythonconverters.StringToImmutableBytesWritableConverter
 valueConv = 
org.apache.spark.examples.pythonconverters.StringListToPutConverter 
sc.parallelize([['testkey', 'f1', 'testqual', 'testval']], 1).map(lambda x: 
(x[0], x)).saveAsNewAPIHadoopDataset(...         conf=conf,...         
keyConverter=keyConv,...         valueConverter=valueConv)
then it spills few of the INFO level messages about submitting a task etc but 
then it just hangs. very same code runs ok on spark 1.1.0 - the records gets 
stored in hbase.
thanks,Antony.

 

 On Thursday, 25 December 2014, 0:37, Ted Yu yuzhih...@gmail.com wrote:
   
 

 I went over the jstack but didn't find any call related to hbase or 
zookeeper.Do you find anything important in the logs ?
Looks like container launcher was waiting for the script to return some result:
   
   -         at 
org.apache.hadoop.util.Shell$ShellCommandExecutor.parseExecResult(Shell.java:715)
   -         at org.apache.hadoop.util.Shell.runCommand(Shell.java:524)

On Wed, Dec 24, 2014 at 3:11 PM, Antony Mayi antonym...@yahoo.com wrote:

this is it (jstack of particular yarn container) - http://pastebin.com/eAdiUYKK
thanks, Antony. 

 On Wednesday, 24 December 2014, 16:34, Ted Yu yuzhih...@gmail.com wrote:
   
 

 bq. even when testing with the example from the stock hbase_outputformat.py
Can you take jstack of the above and pastebin it ?
Thanks
On Wed, Dec 24, 2014 at 4:49 AM, Antony Mayi antonym...@yahoo.com.invalid 
wrote:

Hi,
have been using this without any issues with spark 1.1.0 but after upgrading to 
1.2.0 saving a RDD from pyspark using saveAsNewAPIHadoopDataset into HBase just 
hangs - even when testing with the example from the stock hbase_outputformat.py.
anyone having same issue? (and able to solve?)
using hbase 0.98.6 and yarn-client mode.
thanks,Antony.

Re: saveAsNewAPIHadoopDataset against hbase hanging in pyspark 1.2.0

2014-12-24 Thread Ted Yu

bq. hbase.zookeeper.quorum: localhost

You are running hbase cluster in standalone mode ?
Is hbase-client jar in the classpath ?

Cheers

On Wed, Dec 24, 2014 at 4:11 PM, Antony Mayi antonym...@yahoo.com wrote:

 I just run it by hand from pyspark shell. here is the steps:

 pyspark --jars
 /usr/lib/spark/lib/spark-examples-1.2.0-cdh5.3.0-hadoop2.5.0-cdh5.3.0.jar

  conf = {hbase.zookeeper.quorum: localhost,
 ... hbase.mapred.outputtable: test,
 ... mapreduce.outputformat.class:
 org.apache.hadoop.hbase.mapreduce.TableOutputFormat,
 ... mapreduce.job.output.key.class:
 org.apache.hadoop.hbase.io.ImmutableBytesWritable,
 ... mapreduce.job.output.value.class:
 org.apache.hadoop.io.Writable}
  keyConv =
 org.apache.spark.examples.pythonconverters.StringToImmutableBytesWritableConverter
  valueConv =
 org.apache.spark.examples.pythonconverters.StringListToPutConverter
  sc.parallelize([['testkey', 'f1', 'testqual', 'testval']],
 1).map(lambda x: (x[0], x)).saveAsNewAPIHadoopDataset(
 ... conf=conf,
 ... keyConverter=keyConv,
 ... valueConverter=valueConv)

 then it spills few of the INFO level messages about submitting a task etc
 but then it just hangs. very same code runs ok on spark 1.1.0 - the records
 gets stored in hbase.

 thanks,
 Antony.




   On Thursday, 25 December 2014, 0:37, Ted Yu yuzhih...@gmail.com wrote:



 I went over the jstack but didn't find any call related to hbase or
 zookeeper.
 Do you find anything important in the logs ?

 Looks like container launcher was waiting for the script to return some
 result:


1. at

 org.apache.hadoop.util.Shell$ShellCommandExecutor.parseExecResult(Shell.java:715)
2. at org.apache.hadoop.util.Shell.runCommand(Shell.java:524)


 On Wed, Dec 24, 2014 at 3:11 PM, Antony Mayi antonym...@yahoo.com wrote:

 this is it (jstack of particular yarn container) -
 http://pastebin.com/eAdiUYKK

 thanks, Antony.


   On Wednesday, 24 December 2014, 16:34, Ted Yu yuzhih...@gmail.com
 wrote:



 bq. even when testing with the example from the stock
 hbase_outputformat.py

 Can you take jstack of the above and pastebin it ?

 Thanks

 On Wed, Dec 24, 2014 at 4:49 AM, Antony Mayi antonym...@yahoo.com.invalid
  wrote:

 Hi,

 have been using this without any issues with spark 1.1.0 but after
 upgrading to 1.2.0 saving a RDD from pyspark
 using saveAsNewAPIHadoopDataset into HBase just hangs - even when testing
 with the example from the stock hbase_outputformat.py.

 anyone having same issue? (and able to solve?)

 using hbase 0.98.6 and yarn-client mode.

 thanks,
 Antony.

Re: SchemaRDD to RDD[String]

2014-12-24 Thread Tobias Pfeiffer

Hi,

On Wed, Dec 24, 2014 at 3:18 PM, Hafiz Mujadid hafizmujadi...@gmail.com
wrote:

 I want to convert a schemaRDD into RDD of String. How can we do that?

 Currently I am doing like this which is not converting correctly no
 exception but resultant strings are empty

 here is my code


Hehe, this is the most Java-ish Scala code I have ever seen ;-)

Having said that, are you sure that your rows are not empty? The code looks
correct to me, actually.

Tobias

Re: SchemaRDD to RDD[String]

You might also try the following, which I think is equivalent:

schemaRDD.map(_.mkString(,))

On Wed, Dec 24, 2014 at 8:12 PM, Tobias Pfeiffer t...@preferred.jp wrote:

 Hi,

 On Wed, Dec 24, 2014 at 3:18 PM, Hafiz Mujadid hafizmujadi...@gmail.com
 wrote:

 I want to convert a schemaRDD into RDD of String. How can we do that?

 Currently I am doing like this which is not converting correctly no
 exception but resultant strings are empty

 here is my code


 Hehe, this is the most Java-ish Scala code I have ever seen ;-)

 Having said that, are you sure that your rows are not empty? The code
 looks correct to me, actually.

 Tobias

Re: Escape commas in file names

No, there is not.  Can you open a JIRA?

On Tue, Dec 23, 2014 at 6:33 PM, Daniel Siegmann daniel.siegm...@velos.io
wrote:

 I am trying to load a Parquet file which has a comma in its name. Yes,
 this is a valid file name in HDFS. However, sqlContext.parquetFile
 interprets this as a comma-separated list of parquet files.

 Is there any way to escape the comma so it is treated as part of a single
 file name?

 --
 Daniel Siegmann, Software Developer
 Velos
 Accelerating Machine Learning

 54 W 40th St, New York, NY 10018
 E: daniel.siegm...@velos.io W: www.velos.io

Re: Not Serializable exception when integrating SQL and Spark Streaming

The various spark contexts generally aren't serializable because you can't
use them on the executors anyway.  We made SQLContext serializable just
because it gets pulled into scope more often due to the implicit
conversions its contains.  You should try marking the variable that holds
the context with the annotation @transient.

On Wed, Dec 24, 2014 at 7:04 PM, Tarun Garg bigdat...@live.com wrote:

 Thanks

 I debug this further and below is the cause

 Caused by: java.io.NotSerializableException:
 org.apache.spark.sql.api.java.JavaSQLContext
 - field (class com.basic.spark.NumberCount$2, name:
 val$sqlContext, type: class
 org.apache.spark.sql.api.java.JavaSQLContext)
 - object (class com.basic.spark.NumberCount$2,
 com.basic.spark.NumberCount$2@69ddbcc7)
 - field (class com.basic.spark.NumberCount$2$1, name: this$0,
 type: class com.basic.spark.NumberCount$2)
 - object (class com.basic.spark.NumberCount$2$1,
 com.basic.spark.NumberCount$2$1@2524beed)
 - field (class
 org.apache.spark.api.java.JavaPairRDD$$anonfun$toScalaFunction$1, name:
 fun$1, type: interface org.apache.spark.api.java.function.Function)

 I tried this also
 http://apache-spark-user-list.1001560.n3.nabble.com/How-to-access-objects-declared-and-initialized-outside-the-call-method-of-JavaRDD-td17094.html#a17150

 Why there is difference SQLContext is Serializable but JavaSQLContext is
 not? Spark is designed like this.

 Thanks

 --
 Date: Wed, 24 Dec 2014 16:23:30 +0800
 From: lian.cs@gmail.com
 To: bigdat...@live.com; user@spark.apache.org
 Subject: Re: Not Serializable exception when integrating SQL and Spark
 Streaming

  Generally you can use -Dsun.io.serialization.extendedDebugInfo=true to
 enable serialization debugging information when serialization exceptions
 are raised.

 On 12/24/14 1:32 PM, bigdata4u wrote:


  I am trying to use sql over Spark streaming using Java. But i am getting
 Serialization Exception.

 public static void main(String args[]) {
 SparkConf sparkConf = new SparkConf().setAppName(NumberCount);
 JavaSparkContext jc = new JavaSparkContext(sparkConf);
 JavaStreamingContext jssc = new JavaStreamingContext(jc, new
 Duration(2000));
 jssc.addStreamingListener(new WorkCountMonitor());
 int numThreads = Integer.parseInt(args[3]);
 MapString,Integer topicMap = new HashMapString,Integer();
 String[] topics = args[2].split(,);
 for (String topic : topics) {
 topicMap.put(topic, numThreads);
 }
 JavaPairReceiverInputDStreamString,String data =
 KafkaUtils.createStream(jssc, args[0], args[1], topicMap);
 data.print();

 JavaDStreamPerson streamData = data.map(new FunctionTuple2lt;String,
 String, Person() {
 public Person call(Tuple2String,String v1) throws Exception {
 String[] stringArray = v1._2.split(,);
 Person Person = new Person();
 Person.setName(stringArray[0]);
 Person.setAge(stringArray[1]);
 return Person;
 }

 });


 final JavaSQLContext sqlContext = new JavaSQLContext(jc);
 streamData.foreachRDD(new FunctionJavaRDDlt;Person,Void() {
 public Void call(JavaRDDPerson rdd) {

 JavaSchemaRDD subscriberSchema = sqlContext.applySchema(rdd,
 Person.class);

 subscriberSchema.registerAsTable(people);
 System.out.println(all data);
 JavaSchemaRDD names = sqlContext.sql(SELECT name FROM people);
 System.out.println(afterwards);

 ListString males = new ArrayListString();

 males = names.map(new FunctionRow,String() {
 public String call(Row row) {
 return row.getString(0);
 }
 }).collect();
 System.out.println(before for);
 for (String name : males) {
 System.out.println(name);
 }
 return null;
 }
 });
 jssc.start();
 jssc.awaitTermination();

 Exception is

 14/12/23 23:49:38 ERROR JobScheduler: Error running job streaming job
 1419378578000 ms.1 org.apache.spark.SparkException: Task not serializable at
 org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:166)
 at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:158) at
 org.apache.spark.SparkContext.clean(SparkContext.scala:1435) at
 org.apache.spark.rdd.RDD.map(RDD.scala:271) at
 org.apache.spark.api.java.JavaRDDLike$class.map(JavaRDDLike.scala:78) at
 org.apache.spark.sql.api.java.JavaSchemaRDD.map(JavaSchemaRDD.scala:42) at
 com.basic.spark.NumberCount$2.call(NumberCount.java:79) at
 com.basic.spark.NumberCount$2.call(NumberCount.java:67) at
 org.apache.spark.streaming.api.java.JavaDStreamLike$anonfun$foreachRDD$1.apply(JavaDStreamLike.scala:274)
 at

Re: hiveContext.jsonFile fails with Unexpected close marker

Each JSON object needs to be on a single line since this is the boundary
the TextFileInputFormat uses when splitting up large files.

On Wed, Dec 24, 2014 at 12:34 PM, elliott cordo elliottco...@gmail.com
wrote:

 I have generally been impressed with the way jsonFile eats just about
 any json data model.. but getting this error when i try to ingest this
 file: Unexpected close marker ']': expected '}

 Here are the commands from the pyspark shell:

 from pyspark.sql import  HiveContext
 hiveContext = HiveContext(sc)
 f = hiveContext.jsonFile(sample.json)

 Here is some sample json:
 {wf_session: [

 {id:6021fb91-c9ec-4019-9ab9-f628aee8d259,machine_id:b45c8c4a-7e8e-442d-8d49-fb7c32e2d813,session_id:d65ca338-c6b8-4bff-93b1-7f2364726fb7,event_at:2014-12-19T15:55:31.373Z,screen:x,type:1,time_left_secs:1},

 {id:7e696c19-3ba4-4469-be28-5ef1f0c03d63,machine_id:b45c8c4a-7e8e-442d-8d49-fb7c32e2d813,session_id:d65ca338-c6b8-4bff-93b1-7f2364726fb7,event_at:2014-12-19T15:55:32.385Z,screen:x,type:2,ad_unit_id:null,spot_started_at:2014-12-19T15:55:12.364Z,spot_ended_at:2014-12-19T15:55:32.385Z,spot_duration_secs:20,impression_count:0,impressions:[],engagement_count:0,engagements:[]},

 {id:68a43006-09bc-4c18-af55-1ebdc0e041a3,machine_id:b45c8c4a-7e8e-442d-8d49-fb7c32e2d813,session_id:d65ca338-c6b8-4bff-93b1-7f2364726fb7,event_at:2014-12-19T15:55:32.375Z,screen:x,type:3,duration_secs:20,to_ad_unit_id:developmentbea1f3a4-be08-4119-b9f4-7}
   ] }


 Any help would be appreciated! :)  Merry Xmas!

Re: saveAsNewAPIHadoopDataset against hbase hanging in pyspark 1.2.0

I am running it in yarn-client mode and I believe hbase-client is part of the 
spark-examples-1.2.0-cdh5.3.0-hadoop2.5.0-cdh5.3.0.jar which I am submitting at 
launch.
adding another jstack taken during the hanging - http://pastebin.com/QDQrBw70 - 
this is of the CoarseGrainedExecutorBackend process - this one is referencing 
hbase and zookeeper.
thanks,Antony. 

 On Thursday, 25 December 2014, 1:38, Ted Yu yuzhih...@gmail.com wrote:
   
 

 bq. hbase.zookeeper.quorum: localhost
You are running hbase cluster in standalone mode ?Is hbase-client jar in the 
classpath ?
Cheers
On Wed, Dec 24, 2014 at 4:11 PM, Antony Mayi antonym...@yahoo.com wrote:

I just run it by hand from pyspark shell. here is the steps:
pyspark --jars 
/usr/lib/spark/lib/spark-examples-1.2.0-cdh5.3.0-hadoop2.5.0-cdh5.3.0.jar
 conf = {hbase.zookeeper.quorum: localhost,
...         hbase.mapred.outputtable: test,...         
mapreduce.outputformat.class: 
org.apache.hadoop.hbase.mapreduce.TableOutputFormat,...         
mapreduce.job.output.key.class: 
org.apache.hadoop.hbase.io.ImmutableBytesWritable,...         
mapreduce.job.output.value.class: org.apache.hadoop.io.Writable} keyConv 
= 
org.apache.spark.examples.pythonconverters.StringToImmutableBytesWritableConverter
 valueConv = 
org.apache.spark.examples.pythonconverters.StringListToPutConverter 
sc.parallelize([['testkey', 'f1', 'testqual', 'testval']], 1).map(lambda x: 
(x[0], x)).saveAsNewAPIHadoopDataset(...         conf=conf,...         
keyConverter=keyConv,...         valueConverter=valueConv)
then it spills few of the INFO level messages about submitting a task etc but 
then it just hangs. very same code runs ok on spark 1.1.0 - the records gets 
stored in hbase.
thanks,Antony.

 

 On Thursday, 25 December 2014, 0:37, Ted Yu yuzhih...@gmail.com wrote:
   
 

 I went over the jstack but didn't find any call related to hbase or 
zookeeper.Do you find anything important in the logs ?
Looks like container launcher was waiting for the script to return some result:
   
   -         at 
org.apache.hadoop.util.Shell$ShellCommandExecutor.parseExecResult(Shell.java:715)
   -         at org.apache.hadoop.util.Shell.runCommand(Shell.java:524)

On Wed, Dec 24, 2014 at 3:11 PM, Antony Mayi antonym...@yahoo.com wrote:

this is it (jstack of particular yarn container) - http://pastebin.com/eAdiUYKK
thanks, Antony. 

 On Wednesday, 24 December 2014, 16:34, Ted Yu yuzhih...@gmail.com wrote:
   
 

 bq. even when testing with the example from the stock hbase_outputformat.py
Can you take jstack of the above and pastebin it ?
Thanks
On Wed, Dec 24, 2014 at 4:49 AM, Antony Mayi antonym...@yahoo.com.invalid 
wrote:

Hi,
have been using this without any issues with spark 1.1.0 but after upgrading to 
1.2.0 saving a RDD from pyspark using saveAsNewAPIHadoopDataset into HBase just 
hangs - even when testing with the example from the stock hbase_outputformat.py.
anyone having same issue? (and able to solve?)
using hbase 0.98.6 and yarn-client mode.
thanks,Antony.

Re: saveAsNewAPIHadoopDataset against hbase hanging in pyspark 1.2.0