Re: Not Serializable exception when integrating SQL and Spark Streaming

2014-12-24 Thread Cheng Lian
Generally you can use |-Dsun.io.serialization.extendedDebugInfo=true| to 
enable serialization debugging information when serialization exceptions 
are raised.


On 12/24/14 1:32 PM, bigdata4u wrote:


I am trying to use sql over Spark streaming using Java. But i am getting
Serialization Exception.

public static void main(String args[]) {
 SparkConf sparkConf = new SparkConf().setAppName(NumberCount);
 JavaSparkContext jc = new JavaSparkContext(sparkConf);
 JavaStreamingContext jssc = new JavaStreamingContext(jc, new
Duration(2000));
 jssc.addStreamingListener(new WorkCountMonitor());
 int numThreads = Integer.parseInt(args[3]);
 MapString,Integer topicMap = new HashMapString,Integer();
 String[] topics = args[2].split(,);
 for (String topic : topics) {
 topicMap.put(topic, numThreads);
 }
 JavaPairReceiverInputDStreamString,String data =
KafkaUtils.createStream(jssc, args[0], args[1], topicMap);
 data.print();

 JavaDStreamPerson streamData = data.map(new FunctionTuple2lt;String,
String, Person() {
 public Person call(Tuple2String,String v1) throws Exception {
 String[] stringArray = v1._2.split(,);
 Person Person = new Person();
 Person.setName(stringArray[0]);
 Person.setAge(stringArray[1]);
 return Person;
 }

 });


 final JavaSQLContext sqlContext = new JavaSQLContext(jc);
 streamData.foreachRDD(new FunctionJavaRDDlt;Person,Void() {
 public Void call(JavaRDDPerson rdd) {

 JavaSchemaRDD subscriberSchema = sqlContext.applySchema(rdd,
Person.class);

 subscriberSchema.registerAsTable(people);
 System.out.println(all data);
 JavaSchemaRDD names = sqlContext.sql(SELECT name FROM people);
 System.out.println(afterwards);

 ListString males = new ArrayListString();

 males = names.map(new FunctionRow,String() {
 public String call(Row row) {
 return row.getString(0);
 }
 }).collect();
 System.out.println(before for);
 for (String name : males) {
 System.out.println(name);
 }
 return null;
 }
 });
 jssc.start();
 jssc.awaitTermination();

Exception is

14/12/23 23:49:38 ERROR JobScheduler: Error running job streaming job
1419378578000 ms.1 org.apache.spark.SparkException: Task not serializable at
org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:166)
at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:158) at
org.apache.spark.SparkContext.clean(SparkContext.scala:1435) at
org.apache.spark.rdd.RDD.map(RDD.scala:271) at
org.apache.spark.api.java.JavaRDDLike$class.map(JavaRDDLike.scala:78) at
org.apache.spark.sql.api.java.JavaSchemaRDD.map(JavaSchemaRDD.scala:42) at
com.basic.spark.NumberCount$2.call(NumberCount.java:79) at
com.basic.spark.NumberCount$2.call(NumberCount.java:67) at
org.apache.spark.streaming.api.java.JavaDStreamLike$anonfun$foreachRDD$1.apply(JavaDStreamLike.scala:274)
at
org.apache.spark.streaming.api.java.JavaDStreamLike$anonfun$foreachRDD$1.apply(JavaDStreamLike.scala:274)
at
org.apache.spark.streaming.dstream.DStream$anonfun$foreachRDD$1.apply(DStream.scala:529)
at
org.apache.spark.streaming.dstream.DStream$anonfun$foreachRDD$1.apply(DStream.scala:529)
at
org.apache.spark.streaming.dstream.ForEachDStream$anonfun$1.apply$mcV$sp(ForEachDStream.scala:42)
at
org.apache.spark.streaming.dstream.ForEachDStream$anonfun$1.apply(ForEachDStream.scala:40)
at
org.apache.spark.streaming.dstream.ForEachDStream$anonfun$1.apply(ForEachDStream.scala:40)
at scala.util.Try$.apply(Try.scala:161) at
org.apache.spark.streaming.scheduler.Job.run(Job.scala:32) at
org.apache.spark.streaming.scheduler.JobScheduler$JobHandler.run(JobScheduler.scala:171)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:724)
Caused by: java.io.NotSerializableException:
org.apache.spark.sql.api.java.JavaSQLContext at
java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1181) at
java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1541)
at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1506)
at
java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1429)
at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1175) at
java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1541)
at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1506)
at
java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1429)
at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1175) at

Re: SparkSQL: CREATE EXTERNAL TABLE with a SchemaRDD

2014-12-24 Thread Cheng Lian
Hao and Lam - I think the issue here is that |registerRDDAsTable| only 
creates a temporary table, which is not seen by Hive metastore.


And Michael had once given a workaround for creating external Parquet 
table: 
http://apache-spark-user-list.1001560.n3.nabble.com/persist-table-schema-in-spark-sql-td16297.html


Cheng

On 12/24/14 9:38 AM, Cheng, Hao wrote:

Hi, Lam, I can confirm this is a bug with the latest master, and I 
filed a jira issue for this:


https://issues.apache.org/jira/browse/SPARK-4944

Hope come with a solution soon.

Cheng Hao

*From:*Jerry Lam [mailto:chiling...@gmail.com]
*Sent:* Wednesday, December 24, 2014 4:26 AM
*To:* user@spark.apache.org
*Subject:* SparkSQL: CREATE EXTERNAL TABLE with a SchemaRDD

Hi spark users,

I'm trying to create external table using HiveContext after creating a 
schemaRDD and saving the RDD into a parquet file on hdfs.


I would like to use the schema in the schemaRDD (rdd_table) when I 
create the external table.


For example:

rdd_table.saveAsParquetFile(/user/spark/my_data.parquet)

hiveContext.registerRDDAsTable(rdd_table, rdd_table)

hiveContext.sql(CREATE EXTERNAL TABLE my_data LIKE rdd_table LOCATION 
'/user/spark/my_data.parquet')


the last line fails with:

org.apache.spark.sql.execution.QueryExecutionException: FAILED: 
Execution Error, return code 1 from 
org.apache.hadoop.hive.ql.exec.DDLTask. Table not found rdd_table


at 
org.apache.spark.sql.hive.HiveContext.runHive(HiveContext.scala:322)


at 
org.apache.spark.sql.hive.HiveContext.runSqlHive(HiveContext.scala:284)


at 
org.apache.spark.sql.hive.execution.NativeCommand.sideEffectResult$lzycompute(NativeCommand.scala:35)


at 
org.apache.spark.sql.hive.execution.NativeCommand.sideEffectResult(NativeCommand.scala:35)


at 
org.apache.spark.sql.hive.execution.NativeCommand.execute(NativeCommand.scala:38)


at 
org.apache.spark.sql.hive.HiveContext$QueryExecution.toRdd$lzycompute(HiveContext.scala:382)


at 
org.apache.spark.sql.hive.HiveContext$QueryExecution.toRdd(HiveContext.scala:382)


Is this supported?

Best Regards,

Jerry


​


Re: Spark Installation Maven PermGen OutOfMemoryException

2014-12-24 Thread Vladimir Protsenko
Java 8 rpm 64bit downloaded from official oracle site solved my problem.
And I need not set max heap size, final memory shown at the end of maven
build was 81/1943M. I want to learn spark so have no restriction on
choosing java version.

Guru Medasani, thanks for the tip.

I will repeat info, that I wrongly send only to Sean. I have tried export
MAVEN_OPTS=`-Xmx=3g -XX:MaxPermSize=1g -XX:ReservedCodeCacheSize=1g` and it
doesn't work also.

Best Regards,
Vladimir Protsenko

2014-12-23 19:45 GMT+04:00 Guru Medasani gdm...@outlook.com:

 Thanks for the clarification Sean.

 Best Regards,
 Guru Medasani




  From: so...@cloudera.com
  Date: Tue, 23 Dec 2014 15:39:59 +
  Subject: Re: Spark Installation Maven PermGen OutOfMemoryException
  To: gdm...@outlook.com
  CC: protsenk...@gmail.com; user@spark.apache.org

 
  The text there is actually unclear. In Java 8, you still need to set
  the max heap size (-Xmx2g). The optional bit is the
  -XX:MaxPermSize=512M actually. Java 8 no longer has a separate
  permanent generation.
 
  On Tue, Dec 23, 2014 at 3:32 PM, Guru Medasani gdm...@outlook.com
 wrote:
   Hi Vladimir,
  
   From the link Sean posted, if you use Java 8 there is this following
 note.
  
   Note: For Java 8 and above this step is not required.
  
   So if you have no problems using Java 8, give it a shot.
  
   Best Regards,
   Guru Medasani
  
 
  -
  To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
  For additional commands, e-mail: user-h...@spark.apache.org
 



Re: Spark Installation Maven PermGen OutOfMemoryException

2014-12-24 Thread Sean Owen
That command is still wrong. It is -Xmx3g with no =.
On Dec 24, 2014 9:50 AM, Vladimir Protsenko protsenk...@gmail.com wrote:

 Java 8 rpm 64bit downloaded from official oracle site solved my problem.
 And I need not set max heap size, final memory shown at the end of maven
 build was 81/1943M. I want to learn spark so have no restriction on
 choosing java version.

 Guru Medasani, thanks for the tip.

 I will repeat info, that I wrongly send only to Sean. I have tried export
 MAVEN_OPTS=`-Xmx=3g -XX:MaxPermSize=1g -XX:ReservedCodeCacheSize=1g` and
 it doesn't work also.

 Best Regards,
 Vladimir Protsenko

 2014-12-23 19:45 GMT+04:00 Guru Medasani gdm...@outlook.com:

 Thanks for the clarification Sean.

 Best Regards,
 Guru Medasani




  From: so...@cloudera.com
  Date: Tue, 23 Dec 2014 15:39:59 +
  Subject: Re: Spark Installation Maven PermGen OutOfMemoryException
  To: gdm...@outlook.com
  CC: protsenk...@gmail.com; user@spark.apache.org

 
  The text there is actually unclear. In Java 8, you still need to set
  the max heap size (-Xmx2g). The optional bit is the
  -XX:MaxPermSize=512M actually. Java 8 no longer has a separate
  permanent generation.
 
  On Tue, Dec 23, 2014 at 3:32 PM, Guru Medasani gdm...@outlook.com
 wrote:
   Hi Vladimir,
  
   From the link Sean posted, if you use Java 8 there is this following
 note.
  
   Note: For Java 8 and above this step is not required.
  
   So if you have no problems using Java 8, give it a shot.
  
   Best Regards,
   Guru Medasani
  
 
  -
  To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
  For additional commands, e-mail: user-h...@spark.apache.org
 





got ”org.apache.thrift.protocol.TProtocolException: Expected protocol id ffffff82 but got ffffff80“ from hive metastroe service when I use show tables command in spark-sql shell

2014-12-24 Thread Roc Chu
this is my problem. I use mysql to store hive meta data. and i can get what
i want when I exec show tables in hive shell. but in the same machine. I
use spark-sql to execute same command (show tables), I got errors.
I look at the log of hive metastore find this errors

2014-12-24 05:04:59,874 ERROR [pool-3-thread-2]: server.TThreadPoolServer
(TThreadPoolServer.java:run(294)) - Thrift error occurred during processing
of message.
org.apache.thrift.protocol.TProtocolException: Expected protocol id
ff82 but got ff80
at
org.apache.thrift.protocol.TCompactProtocol.readMessageBegin(TCompactProtocol.java:503)
at
org.apache.hadoop.hive.metastore.TUGIBasedProcessor.process(TUGIBasedProcessor.java:75)
at
org.apache.thrift.server.TThreadPoolServer$WorkerProcess.run(TThreadPoolServer.java:285)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
2014-12-24 05:05:01,883 ERROR [pool-3-thread-3]: server.TThreadPoolServer
(TThreadPoolServer.java:run(294)) - Thrift error occurred during processing
of message.
org.apache.thrift.protocol.TProtocolException: Expected protocol id
ff82 but got ff80
at
org.apache.thrift.protocol.TCompactProtocol.readMessageBegin(TCompactProtocol.java:503)
at
org.apache.hadoop.hive.metastore.TUGIBasedProcessor.process(TUGIBasedProcessor.java:75)
at
org.apache.thrift.server.TThreadPoolServer$WorkerProcess.run(TThreadPoolServer.java:285)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)

I compiled e spark from soucecode of spark-1.1.1
and my hive was compiled from apache-hive-0.15.0

I put the same hive-site.xml both hive conf and apark conf.

waiting for help
thanks


Re: How to build Spark against the latest

2014-12-24 Thread guxiaobo1982
Hi Ted,
 The reference command works, but where I can get the deployable binaries?


Xiaobo Gu








-- Original --
From:  Ted Yu;yuzhih...@gmail.com;
Send time: Wednesday, Dec 24, 2014 12:09 PM
To: guxiaobo1...@qq.com; 
Cc: user@spark.apache.orguser@spark.apache.org; 
Subject:  Re: How to build Spark against the latest



See http://search-hadoop.com/m/JW1q5Cew0j

On Tue, Dec 23, 2014 at 8:00 PM, guxiaobo1982 guxiaobo1...@qq.com wrote:
Hi,
The official pom.xml file only have profile for hadoop version 2.4 as the 
latest version, but I installed hadoop version 2.6.0 with ambari, how can I 
build spark against it, just using mvn -Dhadoop.version=2.6.0, or how to make a 
coresponding profile for it?


Regards,


Xiaobo

Need help for Spark-JobServer setup on Maven (for Java programming)

2014-12-24 Thread Sasi
Dear All,

We are trying to share RDDs across different sessions of same Web
application (Java). We need to share single RDD between those sessions. As
we understand from some posts, it is possible through Spark-JobServer.

Is there any guidelines you can provide to setup Spark-JobServer for Maven
(for Apache Spark - Java programming).

We will be glad for your suggestion.

Sasi



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Need-help-for-Spark-JobServer-setup-on-Maven-for-Java-programming-tp20849.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Why does consuming a RESTful web service (using javax.ws.rs.* and Jsersey) work in unit test but not when submitted to Spark?

2014-12-24 Thread Emre Sevinc
Hello,

I have a piece of code that runs inside Spark Streaming and tries to get
some data from a RESTful web service (that runs locally on my machine). The
code snippet in question is:

 Client client = ClientBuilder.newClient();
 WebTarget target = client.target(http://localhost:/rest;);
 target = target.path(annotate)
 .queryParam(text,
UrlEscapers.urlFragmentEscaper().escape(spotlightSubmission))
 .queryParam(confidence, 0.3);

  logger.warn(!!! DEBUG !!! target: {}, target.getUri().toString());

  String response =
target.request().accept(MediaType.APPLICATION_JSON_TYPE).get(String.class);

  logger.warn(!!! DEBUG !!! Spotlight response: {}, response);

When run inside a unit test as follows:

 mvn clean test -Dtest=SpotlightTest#testCountWords

it contacts the RESTful web service and retrieves some data as expected.
But when the same code is run as part of the application that is submitted
to Spark, using spark-submit script I receive the following error:

  java.lang.NoSuchMethodError:
javax.ws.rs.core.MultivaluedMap.addAll(Ljava/lang/Object;[Ljava/lang/Object;)V

I'm using Spark 1.1.0 and for consuming the web service I'm using Jersey in
my project's pom.xml:

 dependency
  groupIdorg.glassfish.jersey.containers/groupId
  artifactIdjersey-container-servlet-core/artifactId
  version2.14/version
/dependency

So I suspect that when the application is submitted to Spark, somehow
there's a different JAR in the environment that uses a different version of
Jersey / javax.ws.rs.*

Does anybody know which version of Jersey / javax.ws.rs.*  is used in the
Spark environment, or how to solve this conflict?


-- 
Emre Sevinç
https://be.linkedin.com/in/emresevinc/


Re: got ”org.apache.thrift.protocol.TProtocolException: Expected protocol id ffffff82 but got ffffff80“ from hive metastroe service when I use show tables command in spark-sql shell

2014-12-24 Thread Cheng Lian

Hi Roc,

Spark SQL 1.2.0 can only work with Hive 0.12.0 or Hive 0.13.1 
(controlled by compilation flags), versions prior 1.2.0 only works with 
Hive 0.12.0. So Hive 0.15.0-SNAPSHOT is not an option.


Would like to add that this is due to backwards compatibility issue of 
Hive metastore, AFAIK there's no popular system works with multiple 
versions of Hive metastore simultaneously.


Cheng

On 12/24/14 6:31 PM, Roc Chu wrote:
this is my problem. I use mysql to store hive meta data. and i can get 
what i want when I exec show tables in hive shell. but in the same 
machine. I use spark-sql to execute same command (show tables), I got 
errors.

I look at the log of hive metastore find this errors

2014-12-24 05:04:59,874 ERROR [pool-3-thread-2]: 
server.TThreadPoolServer (TThreadPoolServer.java:run(294)) - Thrift 
error occurred during processing of message.
org.apache.thrift.protocol.TProtocolException: Expected protocol id 
ff82 but got ff80
at 
org.apache.thrift.protocol.TCompactProtocol.readMessageBegin(TCompactProtocol.java:503)
at 
org.apache.hadoop.hive.metastore.TUGIBasedProcessor.process(TUGIBasedProcessor.java:75)
at 
org.apache.thrift.server.TThreadPoolServer$WorkerProcess.run(TThreadPoolServer.java:285)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)

at java.lang.Thread.run(Thread.java:745)
2014-12-24 05:05:01,883 ERROR [pool-3-thread-3]: 
server.TThreadPoolServer (TThreadPoolServer.java:run(294)) - Thrift 
error occurred during processing of message.
org.apache.thrift.protocol.TProtocolException: Expected protocol id 
ff82 but got ff80
at 
org.apache.thrift.protocol.TCompactProtocol.readMessageBegin(TCompactProtocol.java:503)
at 
org.apache.hadoop.hive.metastore.TUGIBasedProcessor.process(TUGIBasedProcessor.java:75)
at 
org.apache.thrift.server.TThreadPoolServer$WorkerProcess.run(TThreadPoolServer.java:285)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)

at java.lang.Thread.run(Thread.java:745)

I compiled e spark from soucecode of spark-1.1.1
and my hive was compiled from apache-hive-0.15.0

I put the same hive-site.xml both hive conf and apark conf.

waiting for help
thanks





-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Spark Installation Maven PermGen OutOfMemoryException

2014-12-24 Thread Vladimir Protsenko
Thanks. Bad mistake.

2014-12-24 14:02 GMT+04:00 Sean Owen so...@cloudera.com:

 That command is still wrong. It is -Xmx3g with no =.
 On Dec 24, 2014 9:50 AM, Vladimir Protsenko protsenk...@gmail.com
 wrote:

 Java 8 rpm 64bit downloaded from official oracle site solved my problem.
 And I need not set max heap size, final memory shown at the end of maven
 build was 81/1943M. I want to learn spark so have no restriction on
 choosing java version.

 Guru Medasani, thanks for the tip.

 I will repeat info, that I wrongly send only to Sean. I have tried export
 MAVEN_OPTS=`-Xmx=3g -XX:MaxPermSize=1g -XX:ReservedCodeCacheSize=1g` and
 it doesn't work also.

 Best Regards,
 Vladimir Protsenko

 2014-12-23 19:45 GMT+04:00 Guru Medasani gdm...@outlook.com:

 Thanks for the clarification Sean.

 Best Regards,
 Guru Medasani




  From: so...@cloudera.com
  Date: Tue, 23 Dec 2014 15:39:59 +
  Subject: Re: Spark Installation Maven PermGen OutOfMemoryException
  To: gdm...@outlook.com
  CC: protsenk...@gmail.com; user@spark.apache.org

 
  The text there is actually unclear. In Java 8, you still need to set
  the max heap size (-Xmx2g). The optional bit is the
  -XX:MaxPermSize=512M actually. Java 8 no longer has a separate
  permanent generation.
 
  On Tue, Dec 23, 2014 at 3:32 PM, Guru Medasani gdm...@outlook.com
 wrote:
   Hi Vladimir,
  
   From the link Sean posted, if you use Java 8 there is this following
 note.
  
   Note: For Java 8 and above this step is not required.
  
   So if you have no problems using Java 8, give it a shot.
  
   Best Regards,
   Guru Medasani
  
 
  -
  To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
  For additional commands, e-mail: user-h...@spark.apache.org
 





Re: Why does consuming a RESTful web service (using javax.ws.rs.* and Jsersey) work in unit test but not when submitted to Spark?

2014-12-24 Thread Sean Owen
Your guess is right, that there are two incompatible versions of
Jersey (or really, JAX-RS) in your runtime. Spark doesn't use Jersey,
but its transitive dependencies may, or your transitive dependencies
may.

I don't see Jersey in Spark's dependency tree except from HBase tests,
which in turn only appear in examples, so that's unlikely to be it.
I'd take a look with 'mvn dependency:tree' on your own code first.
Maybe you are including JavaEE 6 for example?

On Wed, Dec 24, 2014 at 12:02 PM, Emre Sevinc emre.sev...@gmail.com wrote:
 Hello,

 I have a piece of code that runs inside Spark Streaming and tries to get
 some data from a RESTful web service (that runs locally on my machine). The
 code snippet in question is:

  Client client = ClientBuilder.newClient();
  WebTarget target = client.target(http://localhost:/rest;);
  target = target.path(annotate)
  .queryParam(text,
 UrlEscapers.urlFragmentEscaper().escape(spotlightSubmission))
  .queryParam(confidence, 0.3);

   logger.warn(!!! DEBUG !!! target: {}, target.getUri().toString());

   String response =
 target.request().accept(MediaType.APPLICATION_JSON_TYPE).get(String.class);

   logger.warn(!!! DEBUG !!! Spotlight response: {}, response);

 When run inside a unit test as follows:

  mvn clean test -Dtest=SpotlightTest#testCountWords

 it contacts the RESTful web service and retrieves some data as expected. But
 when the same code is run as part of the application that is submitted to
 Spark, using spark-submit script I receive the following error:

   java.lang.NoSuchMethodError:
 javax.ws.rs.core.MultivaluedMap.addAll(Ljava/lang/Object;[Ljava/lang/Object;)V

 I'm using Spark 1.1.0 and for consuming the web service I'm using Jersey in
 my project's pom.xml:

  dependency
   groupIdorg.glassfish.jersey.containers/groupId
   artifactIdjersey-container-servlet-core/artifactId
   version2.14/version
 /dependency

 So I suspect that when the application is submitted to Spark, somehow
 there's a different JAR in the environment that uses a different version of
 Jersey / javax.ws.rs.*

 Does anybody know which version of Jersey / javax.ws.rs.*  is used in the
 Spark environment, or how to solve this conflict?


 --
 Emre Sevinç
 https://be.linkedin.com/in/emresevinc/


-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



saveAsNewAPIHadoopDataset against hbase hanging in pyspark 1.2.0

2014-12-24 Thread Antony Mayi
Hi,
have been using this without any issues with spark 1.1.0 but after upgrading to 
1.2.0 saving a RDD from pyspark using saveAsNewAPIHadoopDataset into HBase just 
hangs - even when testing with the example from the stock hbase_outputformat.py.
anyone having same issue? (and able to solve?)
using hbase 0.98.6 and yarn-client mode.
thanks,Antony.


Re: Why does consuming a RESTful web service (using javax.ws.rs.* and Jsersey) work in unit test but not when submitted to Spark?

2014-12-24 Thread Emre Sevinc
On Wed, Dec 24, 2014 at 1:46 PM, Sean Owen so...@cloudera.com wrote:

 I'd take a look with 'mvn dependency:tree' on your own code first.
 Maybe you are including JavaEE 6 for example?


For reference, my complete pom.xml looks like:

project xmlns=http://maven.apache.org/POM/4.0.0; xmlns:xsi=
http://www.w3.org/2001/XMLSchema-instance;
  xsi:schemaLocation=http://maven.apache.org/POM/4.0.0
http://maven.apache.org/xsd/maven-4.0.0.xsd;
  modelVersion4.0.0/modelVersion

  groupIdbigcontent/groupId
  artifactIdbigcontent/artifactId
  version1.0-SNAPSHOT/version
  packagingjar/packaging

  namebigcontent/name
  urlhttp://maven.apache.org/url

  properties
project.build.sourceEncodingUTF-8/project.build.sourceEncoding
  /properties

  build
plugins
  plugin
groupIdorg.apache.maven.plugins/groupId
artifactIdmaven-shade-plugin/artifactId
version2.3/version
configuration
  !-- put your configurations here --
/configuration
executions
  execution
phasepackage/phase
goals
  goalshade/goal
/goals
  /execution
/executions
  /plugin

  plugin
groupIdorg.apache.maven.plugins/groupId
artifactIdmaven-compiler-plugin/artifactId
version3.2/version
configuration
  source1.7/source
  target1.7/target
/configuration
  /plugin
/plugins
  /build

  dependencies
dependency
  groupIdorg.apache.spark/groupId
  artifactIdspark-streaming_2.10/artifactId
  version1.1.1/version
  scopeprovided/scope
/dependency

dependency
  groupIdorg.glassfish.jersey.containers/groupId
  artifactIdjersey-container-servlet-core/artifactId
  version2.14/version
/dependency

dependency
  groupIdorg.apache.hadoop/groupId
  artifactIdhadoop-client/artifactId
  version2.4.0/version
/dependency

dependency
  groupIdcom.google.guava/groupId
  artifactIdguava/artifactId
  version16.0/version
/dependency

dependency
  groupIdorg.apache.hadoop/groupId
  artifactIdhadoop-mapreduce-client-core/artifactId
  version2.4.0/version
/dependency

dependency
  groupIdjson-mapreduce/groupId
  artifactIdjson-mapreduce/artifactId
  version1.0-SNAPSHOT/version
  exclusions
  exclusion
groupIdjavax.servlet/groupId
artifactId*/artifactId
  /exclusion
exclusion
  groupIdcommons-io/groupId
  artifactId*/artifactId
  /exclusion
  exclusion
  groupIdcommons-lang/groupId
  artifactId*/artifactId
  /exclusion
exclusion
  groupIdorg.apache.hadoop/groupId
  artifactIdhadoop-common/artifactId
/exclusion
  /exclusions
/dependency

dependency
  groupIdorg.apache.avro/groupId
  artifactIdavro-mapred/artifactId
  version1.7.7/version
  exclusions
exclusion
  groupIdjavax.servlet/groupId
  artifactId*/artifactId
/exclusion
exclusion
  groupIdorg.apache.hadoop/groupId
  artifactIdhadoop-common/artifactId
/exclusion
  /exclusions
/dependency

dependency
  groupIdjunit/groupId
  artifactIdjunit/artifactId
  version4.11/version
  scopetest/scope
/dependency

dependency
  groupIdorg.apache.avro/groupId
  artifactIdavro/artifactId
  version1.7.7/version
  exclusions
exclusion
  groupIdjavax.servlet/groupId
  artifactId*/artifactId
/exclusion
  /exclusions
/dependency

dependency
  groupIdorg.apache.hadoop/groupId
  artifactIdhadoop-common/artifactId
  version2.4.0/version
  scopeprovided/scope
  exclusions
exclusion
  groupIdjavax.servlet/groupId
  artifactId*/artifactId
/exclusion
exclusion
  groupIdcom.google.guava/groupId
  artifactId*/artifactId
/exclusion
  /exclusions
/dependency

dependency
  groupIdorg.slf4j/groupId
  artifactIdslf4j-log4j12/artifactId
  version1.7.7/version
/dependency
  /dependencies
/project

And 'mvn dependency:tree' produces the following output:



[INFO] Scanning for projects...
[INFO]

[INFO]

[INFO] Building bigcontent 1.0-SNAPSHOT
[INFO]

[INFO]
[INFO] --- maven-dependency-plugin:2.8:tree (default-cli) @ bigcontent ---
[INFO] bigcontent:bigcontent:jar:1.0-SNAPSHOT
[INFO] +- org.apache.spark:spark-streaming_2.10:jar:1.1.1:provided
[INFO] |  +- org.apache.spark:spark-core_2.10:jar:1.1.1:provided
[INFO] |  |  +- org.apache.curator:curator-recipes:jar:2.4.0:provided
[INFO] |  |  |  \- org.apache.curator:curator-framework:jar:2.4.0:provided
[INFO] |  |  | \- 

Re: Why does consuming a RESTful web service (using javax.ws.rs.* and Jsersey) work in unit test but not when submitted to Spark?

2014-12-24 Thread Emre Sevinc
It seems like YARN depends an older version of Jersey, that is 1.9:

  https://github.com/apache/spark/blob/master/yarn/pom.xml

When I've modified my dependencies to have only:

  dependency
  groupIdcom.sun.jersey/groupId
  artifactIdjersey-core/artifactId
  version1.9.1/version
/dependency

And then modified the code to use the old Jersey API:

Client c = Client.create();
WebResource r = c.resource(http://localhost:/rest;)
 .path(annotate)
 .queryParam(text,
UrlEscapers.urlFragmentEscaper().escape(spotlightSubmission))
 .queryParam(confidence, 0.3);

logger.warn(!!! DEBUG !!! target: {}, r.getURI());

String response = r.accept(MediaType.APPLICATION_JSON_TYPE)
   //.header()
   .get(String.class);

logger.warn(!!! DEBUG !!! Spotlight response: {}, response);

It seems to work when I use spark-submit to submit the application that
includes this code.

Funny thing is, now my relevant unit test does not run, complaining about
not having enough memory:

Java HotSpot(TM) 64-Bit Server VM warning: INFO:
os::commit_memory(0xc490, 25165824, 0) failed; error='Cannot
allocate memory' (errno=12)
#
# There is insufficient memory for the Java Runtime Environment to continue.
# Native memory allocation (mmap) failed to map 25165824 bytes for
committing reserved memory.

--
Emre


On Wed, Dec 24, 2014 at 1:46 PM, Sean Owen so...@cloudera.com wrote:

 Your guess is right, that there are two incompatible versions of
 Jersey (or really, JAX-RS) in your runtime. Spark doesn't use Jersey,
 but its transitive dependencies may, or your transitive dependencies
 may.

 I don't see Jersey in Spark's dependency tree except from HBase tests,
 which in turn only appear in examples, so that's unlikely to be it.
 I'd take a look with 'mvn dependency:tree' on your own code first.
 Maybe you are including JavaEE 6 for example?

 On Wed, Dec 24, 2014 at 12:02 PM, Emre Sevinc emre.sev...@gmail.com
 wrote:
  Hello,
 
  I have a piece of code that runs inside Spark Streaming and tries to get
  some data from a RESTful web service (that runs locally on my machine).
 The
  code snippet in question is:
 
   Client client = ClientBuilder.newClient();
   WebTarget target = client.target(http://localhost:/rest;);
   target = target.path(annotate)
   .queryParam(text,
  UrlEscapers.urlFragmentEscaper().escape(spotlightSubmission))
   .queryParam(confidence, 0.3);
 
logger.warn(!!! DEBUG !!! target: {},
 target.getUri().toString());
 
String response =
 
 target.request().accept(MediaType.APPLICATION_JSON_TYPE).get(String.class);
 
logger.warn(!!! DEBUG !!! Spotlight response: {}, response);
 
  When run inside a unit test as follows:
 
   mvn clean test -Dtest=SpotlightTest#testCountWords
 
  it contacts the RESTful web service and retrieves some data as expected.
 But
  when the same code is run as part of the application that is submitted to
  Spark, using spark-submit script I receive the following error:
 
java.lang.NoSuchMethodError:
 
 javax.ws.rs.core.MultivaluedMap.addAll(Ljava/lang/Object;[Ljava/lang/Object;)V
 
  I'm using Spark 1.1.0 and for consuming the web service I'm using Jersey
 in
  my project's pom.xml:
 
   dependency
groupIdorg.glassfish.jersey.containers/groupId
artifactIdjersey-container-servlet-core/artifactId
version2.14/version
  /dependency
 
  So I suspect that when the application is submitted to Spark, somehow
  there's a different JAR in the environment that uses a different version
 of
  Jersey / javax.ws.rs.*
 
  Does anybody know which version of Jersey / javax.ws.rs.*  is used in the
  Spark environment, or how to solve this conflict?
 
 
  --
  Emre Sevinç
  https://be.linkedin.com/in/emresevinc/
 




-- 
Emre Sevinc


Re: Single worker locked at 100% CPU

2014-12-24 Thread Phil Wills
Turns out that I was just being idiotic and had assigned so much memory to
Spark that the O/S was ending up continually swapping.  Apologies for the
noise.

Phil

On Wed, Dec 24, 2014 at 1:16 AM, Andrew Ash and...@andrewash.com wrote:

 Hi Phil,

 This sounds a lot like a deadlock in Hadoop's Configuration object that I
 ran into a while back.  If you jstack the JVM and see a thread that looks
 like the below, it could be
 https://issues.apache.org/jira/browse/SPARK-2546

 Executor task launch worker-6 daemon prio=10 tid=0x7f91f01fe000 
 nid=0x54b1 runnable [0x7f92d74f1000]
java.lang.Thread.State: RUNNABLE
 at java.util.HashMap.transfer(HashMap.java:601)
 at java.util.HashMap.resize(HashMap.java:581)
 at java.util.HashMap.addEntry(HashMap.java:879)
 at java.util.HashMap.put(HashMap.java:505)
 at org.apache.hadoop.conf.Configuration.set(Configuration.java:803)
 at org.apache.hadoop.conf.Configuration.set(Configuration.java:783)
 at org.apache.hadoop.conf.Configuration.setClass(Configuration.java:1662)


 The fix for this issue is hidden behind a flag because it might have
 performance implications, but if it is this problem then you can set
 spark.hadoop.cloneConf=true and see if that fixes things.

 Good luck!
 Andrew

 On Tue, Dec 23, 2014 at 9:40 AM, Phil Wills otherp...@gmail.com wrote:

 I've been attempting to run a job based on MLlib's ALS implementation for
 a while now and have hit an issue I'm having a lot of difficulty getting to
 the bottom of.

 On a moderate size set of input data it works fine, but against larger
 (still well short of what I'd think of as big) sets of data, I'll see one
 or two workers get stuck spinning at 100% CPU and the job unable to
 recover.

 I don't believe this is down to memory pressure as I seem to get the same
 behaviour at about the same size of input data, even if  the cluster is
 twice as large. GC logs also suggest things are proceeding reasonably with
 some Full GC's occurring, but no suggestion of the process being GC locked.

 After rebooting the instance that got into trouble, I can see the stderr
 log for the task truncated in the middle of a log-line at the time CPU
 shoots to and sticks at 100%, but no other signs of a problem.

 I've run into the same issue on 1.1.0 and 1.2.0 in standalone mode and
 running on YARN.

 Any suggestions on further steps I could try to get a clearer diagnosis
 of the issue would be much appreciated.

 Thanks,

 Phil





Re: Why does consuming a RESTful web service (using javax.ws.rs.* and Jsersey) work in unit test but not when submitted to Spark?

2014-12-24 Thread Sean Owen
That could well be it -- oops, I forgot to run with the YARN profile
and so didn't see the YARN dependencies. Try the userClassPathFirst
option to try to make your app's copy take precedence.

The second error is really a JVM bug, but, is from having too little
memory available for the unit tests.

http://spark.apache.org/docs/latest/building-spark.html#setting-up-mavens-memory-usage

On Wed, Dec 24, 2014 at 1:56 PM, Emre Sevinc emre.sev...@gmail.com wrote:
 It seems like YARN depends an older version of Jersey, that is 1.9:

   https://github.com/apache/spark/blob/master/yarn/pom.xml

 When I've modified my dependencies to have only:

   dependency
   groupIdcom.sun.jersey/groupId
   artifactIdjersey-core/artifactId
   version1.9.1/version
 /dependency

 And then modified the code to use the old Jersey API:

 Client c = Client.create();
 WebResource r = c.resource(http://localhost:/rest;)
  .path(annotate)
  .queryParam(text,
 UrlEscapers.urlFragmentEscaper().escape(spotlightSubmission))
  .queryParam(confidence, 0.3);

 logger.warn(!!! DEBUG !!! target: {}, r.getURI());

 String response = r.accept(MediaType.APPLICATION_JSON_TYPE)
//.header()
.get(String.class);

 logger.warn(!!! DEBUG !!! Spotlight response: {}, response);

 It seems to work when I use spark-submit to submit the application that
 includes this code.

 Funny thing is, now my relevant unit test does not run, complaining about
 not having enough memory:

 Java HotSpot(TM) 64-Bit Server VM warning: INFO:
 os::commit_memory(0xc490, 25165824, 0) failed; error='Cannot
 allocate memory' (errno=12)
 #
 # There is insufficient memory for the Java Runtime Environment to continue.
 # Native memory allocation (mmap) failed to map 25165824 bytes for
 committing reserved memory.

 --
 Emre


 On Wed, Dec 24, 2014 at 1:46 PM, Sean Owen so...@cloudera.com wrote:

 Your guess is right, that there are two incompatible versions of
 Jersey (or really, JAX-RS) in your runtime. Spark doesn't use Jersey,
 but its transitive dependencies may, or your transitive dependencies
 may.

 I don't see Jersey in Spark's dependency tree except from HBase tests,
 which in turn only appear in examples, so that's unlikely to be it.
 I'd take a look with 'mvn dependency:tree' on your own code first.
 Maybe you are including JavaEE 6 for example?

 On Wed, Dec 24, 2014 at 12:02 PM, Emre Sevinc emre.sev...@gmail.com
 wrote:
  Hello,
 
  I have a piece of code that runs inside Spark Streaming and tries to get
  some data from a RESTful web service (that runs locally on my machine).
  The
  code snippet in question is:
 
   Client client = ClientBuilder.newClient();
   WebTarget target = client.target(http://localhost:/rest;);
   target = target.path(annotate)
   .queryParam(text,
  UrlEscapers.urlFragmentEscaper().escape(spotlightSubmission))
   .queryParam(confidence, 0.3);
 
logger.warn(!!! DEBUG !!! target: {},
  target.getUri().toString());
 
String response =
 
  target.request().accept(MediaType.APPLICATION_JSON_TYPE).get(String.class);
 
logger.warn(!!! DEBUG !!! Spotlight response: {}, response);
 
  When run inside a unit test as follows:
 
   mvn clean test -Dtest=SpotlightTest#testCountWords
 
  it contacts the RESTful web service and retrieves some data as expected.
  But
  when the same code is run as part of the application that is submitted
  to
  Spark, using spark-submit script I receive the following error:
 
java.lang.NoSuchMethodError:
 
  javax.ws.rs.core.MultivaluedMap.addAll(Ljava/lang/Object;[Ljava/lang/Object;)V
 
  I'm using Spark 1.1.0 and for consuming the web service I'm using Jersey
  in
  my project's pom.xml:
 
   dependency
groupIdorg.glassfish.jersey.containers/groupId
artifactIdjersey-container-servlet-core/artifactId
version2.14/version
  /dependency
 
  So I suspect that when the application is submitted to Spark, somehow
  there's a different JAR in the environment that uses a different version
  of
  Jersey / javax.ws.rs.*
 
  Does anybody know which version of Jersey / javax.ws.rs.*  is used in
  the
  Spark environment, or how to solve this conflict?
 
 
  --
  Emre Sevinç
  https://be.linkedin.com/in/emresevinc/
 




 --
 Emre Sevinc

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Why does consuming a RESTful web service (using javax.ws.rs.* and Jsersey) work in unit test but not when submitted to Spark?

2014-12-24 Thread Emre Sevinc
Sean,

Thanks a lot for the important information, especially  userClassPathFirst.

Cheers,
Emre

On Wed, Dec 24, 2014 at 3:38 PM, Sean Owen so...@cloudera.com wrote:

 That could well be it -- oops, I forgot to run with the YARN profile
 and so didn't see the YARN dependencies. Try the userClassPathFirst
 option to try to make your app's copy take precedence.

 The second error is really a JVM bug, but, is from having too little
 memory available for the unit tests.


 http://spark.apache.org/docs/latest/building-spark.html#setting-up-mavens-memory-usage

 On Wed, Dec 24, 2014 at 1:56 PM, Emre Sevinc emre.sev...@gmail.com
 wrote:
  It seems like YARN depends an older version of Jersey, that is 1.9:
 
https://github.com/apache/spark/blob/master/yarn/pom.xml
 
  When I've modified my dependencies to have only:
 
dependency
groupIdcom.sun.jersey/groupId
artifactIdjersey-core/artifactId
version1.9.1/version
  /dependency
 
  And then modified the code to use the old Jersey API:
 
  Client c = Client.create();
  WebResource r = c.resource(http://localhost:/rest;)
   .path(annotate)
   .queryParam(text,
  UrlEscapers.urlFragmentEscaper().escape(spotlightSubmission))
   .queryParam(confidence, 0.3);
 
  logger.warn(!!! DEBUG !!! target: {}, r.getURI());
 
  String response = r.accept(MediaType.APPLICATION_JSON_TYPE)
 //.header()
 .get(String.class);
 
  logger.warn(!!! DEBUG !!! Spotlight response: {}, response);
 
  It seems to work when I use spark-submit to submit the application that
  includes this code.
 
  Funny thing is, now my relevant unit test does not run, complaining about
  not having enough memory:
 
  Java HotSpot(TM) 64-Bit Server VM warning: INFO:
  os::commit_memory(0xc490, 25165824, 0) failed; error='Cannot
  allocate memory' (errno=12)
  #
  # There is insufficient memory for the Java Runtime Environment to
 continue.
  # Native memory allocation (mmap) failed to map 25165824 bytes for
  committing reserved memory.
 
  --
  Emre
 
 
  On Wed, Dec 24, 2014 at 1:46 PM, Sean Owen so...@cloudera.com wrote:
 
  Your guess is right, that there are two incompatible versions of
  Jersey (or really, JAX-RS) in your runtime. Spark doesn't use Jersey,
  but its transitive dependencies may, or your transitive dependencies
  may.
 
  I don't see Jersey in Spark's dependency tree except from HBase tests,
  which in turn only appear in examples, so that's unlikely to be it.
  I'd take a look with 'mvn dependency:tree' on your own code first.
  Maybe you are including JavaEE 6 for example?
 
  On Wed, Dec 24, 2014 at 12:02 PM, Emre Sevinc emre.sev...@gmail.com
  wrote:
   Hello,
  
   I have a piece of code that runs inside Spark Streaming and tries to
 get
   some data from a RESTful web service (that runs locally on my
 machine).
   The
   code snippet in question is:
  
Client client = ClientBuilder.newClient();
WebTarget target = client.target(http://localhost:/rest;);
target = target.path(annotate)
.queryParam(text,
   UrlEscapers.urlFragmentEscaper().escape(spotlightSubmission))
.queryParam(confidence, 0.3);
  
 logger.warn(!!! DEBUG !!! target: {},
   target.getUri().toString());
  
 String response =
  
  
 target.request().accept(MediaType.APPLICATION_JSON_TYPE).get(String.class);
  
 logger.warn(!!! DEBUG !!! Spotlight response: {}, response);
  
   When run inside a unit test as follows:
  
mvn clean test -Dtest=SpotlightTest#testCountWords
  
   it contacts the RESTful web service and retrieves some data as
 expected.
   But
   when the same code is run as part of the application that is submitted
   to
   Spark, using spark-submit script I receive the following error:
  
 java.lang.NoSuchMethodError:
  
  
 javax.ws.rs.core.MultivaluedMap.addAll(Ljava/lang/Object;[Ljava/lang/Object;)V
  
   I'm using Spark 1.1.0 and for consuming the web service I'm using
 Jersey
   in
   my project's pom.xml:
  
dependency
 groupIdorg.glassfish.jersey.containers/groupId
 artifactIdjersey-container-servlet-core/artifactId
 version2.14/version
   /dependency
  
   So I suspect that when the application is submitted to Spark, somehow
   there's a different JAR in the environment that uses a different
 version
   of
   Jersey / javax.ws.rs.*
  
   Does anybody know which version of Jersey / javax.ws.rs.*  is used in
   the
   Spark environment, or how to solve this conflict?
  
  
   --
   Emre Sevinç
   https://be.linkedin.com/in/emresevinc/
  
 
 
 
 
  --
  Emre Sevinc




-- 
Emre Sevinc


SVDPlusPlus Recommender in MLLib

2014-12-24 Thread Prafulla Wani
hi ,

Is there any plan to add SVDPlusPlus based recommender to MLLib ? It is 
implemented in Mahout from this paper - 
http://research.yahoo.com/files/kdd08koren.pdf 
http://research.yahoo.com/files/kdd08koren.pdf

Regards,
Prafulla.

Re: saveAsNewAPIHadoopDataset against hbase hanging in pyspark 1.2.0

2014-12-24 Thread Ted Yu
bq. even when testing with the example from the stock hbase_outputformat.py

Can you take jstack of the above and pastebin it ?

Thanks

On Wed, Dec 24, 2014 at 4:49 AM, Antony Mayi antonym...@yahoo.com.invalid
wrote:

 Hi,

 have been using this without any issues with spark 1.1.0 but after
 upgrading to 1.2.0 saving a RDD from pyspark
 using saveAsNewAPIHadoopDataset into HBase just hangs - even when testing
 with the example from the stock hbase_outputformat.py.

 anyone having same issue? (and able to solve?)

 using hbase 0.98.6 and yarn-client mode.

 thanks,
 Antony.




null Error in ALS model predict

2014-12-24 Thread Franco Barrientos
Hi all!,

 

I have  a RDD[(int,int,double,double)] where the first two int values are id
and product, respectively. I trained an implicit ALS algorithm and want to
make predictions from this RDD. I make two things but I think both ways are
same.

 

1-  Convert this RDD to RDD[(int,int)] and use
model.predict(RDD(int,int)), this works to me!

2-  Make a map and apply  model.predict(int,int), for example:

val ratings = RDD[(int,int,double,double)].map{ case (id, rubro, rating,
resp)= 

model.predict(id,rubro)

}

Where ratings is a RDD[Double].

 

Now, the second way when I apply a ratings.first() I get the follow error:



 

Why this happend? I need to use this second way.

 

Thanks in advance,

 

Franco Barrientos
Data Scientist

Málaga #115, Of. 1003, Las Condes.
Santiago, Chile.
(+562)-29699649
(+569)-76347893

 mailto:franco.barrien...@exalitica.com franco.barrien...@exalitica.com 

 http://www.exalitica.com/ www.exalitica.com


  http://exalitica.com/web/img/frim.png 

 



How to identify erroneous input record ?

2014-12-24 Thread Sanjay Subramanian
hey guys 
One of my input records has an problem that makes the code fail.
var demoRddFilter = demoRdd.filter(line = 
!line.contains(ISR$CASE$I_F_COD$FOLL_SEQ) || 
!line.contains(primaryid$caseid$caseversion))

var demoRddFilterMap = demoRddFilter.map(line = line.split('$')(0) + ~ + 
line.split('$')(5) + ~ + line.split('$')(11) + ~ + 
line.split('$')(12))demoRddFilterMap.saveAsTextFile(/data/aers/msfx/demo/ + 
outFile)
This is possibly happening because perhaps one input record may not have 13 
fields.If this were Hadoop mapper code , I have 2 ways to solve this 1. test 
the number of fields of each line before applying the map function2. enclose 
the mapping function in a try catch block so that the mapping function only 
fails for the erroneous recordHow do I implement 1. or 2. in the Spark code 
?Thanks
sanjay  !--#yiv7202296517 _filtered #yiv7202296517 
{font-family:Helvetica;panose-1:2 11 6 4 2 2 2 2 2 4;} _filtered #yiv7202296517 
{font-family:Cambria Math;panose-1:2 4 5 3 5 4 6 3 2 4;} _filtered 
#yiv7202296517 {font-family:Calibri;panose-1:2 15 5 2 2 2 4 3 2 
4;}#yiv7202296517 #yiv7202296517 p.yiv7202296517MsoNormal, #yiv7202296517 
li.yiv7202296517MsoNormal, #yiv7202296517 div.yiv7202296517MsoNormal 
{margin:0cm;margin-bottom:.0001pt;font-size:11.0pt;font-family:Calibri, 
sans-serif;}#yiv7202296517 a:link, #yiv7202296517 
span.yiv7202296517MsoHyperlink 
{color:#0563C1;text-decoration:underline;}#yiv7202296517 a:visited, 
#yiv7202296517 span.yiv7202296517MsoHyperlinkFollowed 
{color:#954F72;text-decoration:underline;}#yiv7202296517 
p.yiv7202296517MsoListParagraph, #yiv7202296517 
li.yiv7202296517MsoListParagraph, #yiv7202296517 
div.yiv7202296517MsoListParagraph 
{margin-top:0cm;margin-right:0cm;margin-bottom:0cm;margin-left:36.0pt;margin-bottom:.0001pt;font-size:11.0pt;font-family:Calibri,
 sans-serif;}#yiv7202296517 span.yiv7202296517EstiloCorreo17 
{font-family:Calibri, sans-serif;color:windowtext;}#yiv7202296517 
.yiv7202296517MsoChpDefault {font-family:Calibri, sans-serif;} _filtered 
#yiv7202296517 {margin:70.85pt 3.0cm 70.85pt 3.0cm;}#yiv7202296517 
div.yiv7202296517WordSection1 {}#yiv7202296517 _filtered #yiv7202296517 {} 
_filtered #yiv7202296517 {} _filtered #yiv7202296517 {} _filtered 
#yiv7202296517 {} _filtered #yiv7202296517 {} _filtered #yiv7202296517 {} 
_filtered #yiv7202296517 {} _filtered #yiv7202296517 {} _filtered 
#yiv7202296517 {} _filtered #yiv7202296517 {}#yiv7202296517 ol 
{margin-bottom:0cm;}#yiv7202296517 ul {margin-bottom:0cm;}--

Re: How to identify erroneous input record ?

2014-12-24 Thread Sanjay Subramanian
DOH Looks like I did not have enough coffee before I asked this :-) I added the 
if statement...var demoRddFilter = demoRdd.filter(line = 
!line.contains(ISR$CASE$I_F_COD$FOLL_SEQ) || 
!line.contains(primaryid$caseid$caseversion))
var demoRddFilterMap = demoRddFilter.map(line = {
  if (line.split('$').length = 13){
line.split('$')(0) + ~ + line.split('$')(5) + ~ + line.split('$')(11) + 
~ + line.split('$')(12)
  }
})

  From: Sanjay Subramanian sanjaysubraman...@yahoo.com.INVALID
 To: user@spark.apache.org user@spark.apache.org 
 Sent: Wednesday, December 24, 2014 8:28 AM
 Subject: How to identify erroneous input record ?
   
hey guys 
One of my input records has an problem that makes the code fail.
var demoRddFilter = demoRdd.filter(line = 
!line.contains(ISR$CASE$I_F_COD$FOLL_SEQ) || 
!line.contains(primaryid$caseid$caseversion))

var demoRddFilterMap = demoRddFilter.map(line = line.split('$')(0) + ~ + 
line.split('$')(5) + ~ + line.split('$')(11) + ~ + 
line.split('$')(12))demoRddFilterMap.saveAsTextFile(/data/aers/msfx/demo/ + 
outFile)
This is possibly happening because perhaps one input record may not have 13 
fields.If this were Hadoop mapper code , I have 2 ways to solve this 1. test 
the number of fields of each line before applying the map function2. enclose 
the mapping function in a try catch block so that the mapping function only 
fails for the erroneous recordHow do I implement 1. or 2. in the Spark code 
?Thanks
sanjay

  #yiv8750085330 #yiv8750085330 -- filtered {font-family:Helvetica;panose-1:2 
11 6 4 2 2 2 2 2 4;}#yiv8750085330 filtered {panose-1:2 4 5 3 5 4 6 3 2 
4;}#yiv8750085330 filtered {font-family:Calibri;panose-1:2 15 5 2 2 2 4 3 2 
4;}#yiv8750085330 p.yiv8750085330MsoNormal, #yiv8750085330 
li.yiv8750085330MsoNormal, #yiv8750085330 div.yiv8750085330MsoNormal 
{margin:0cm;margin-bottom:.0001pt;font-size:11.0pt;}#yiv8750085330 a:link, 
#yiv8750085330 span.yiv8750085330MsoHyperlink 
{color:#0563C1;text-decoration:underline;}#yiv8750085330 a:visited, 
#yiv8750085330 span.yiv8750085330MsoHyperlinkFollowed 
{color:#954F72;text-decoration:underline;}#yiv8750085330 
p.yiv8750085330MsoListParagraph, #yiv8750085330 
li.yiv8750085330MsoListParagraph, #yiv8750085330 
div.yiv8750085330MsoListParagraph 
{margin-top:0cm;margin-right:0cm;margin-bottom:0cm;margin-left:36.0pt;margin-bottom:.0001pt;font-size:11.0pt;}#yiv8750085330
 span.yiv8750085330EstiloCorreo17 {color:windowtext;}#yiv8750085330 
.yiv8750085330MsoChpDefault {}#yiv8750085330 filtered {margin:70.85pt 3.0cm 
70.85pt 3.0cm;}#yiv8750085330 div.yiv8750085330WordSection1 {}#yiv8750085330 
filtered {}#yiv8750085330 filtered {}#yiv8750085330 filtered {}#yiv8750085330 
filtered {}#yiv8750085330 filtered {}#yiv8750085330 filtered {}#yiv8750085330 
filtered {}#yiv8750085330 filtered {}#yiv8750085330 filtered {}#yiv8750085330 
filtered {}#yiv8750085330 ol {margin-bottom:0cm;}#yiv8750085330 ul 
{margin-bottom:0cm;}#yiv8750085330 

  

Re: How to identify erroneous input record ?

2014-12-24 Thread Sean Owen
I don't believe that works since your map function does not return a
value for lines shorter than 13 tokens. You should use flatMap and
Some/None. (You probably want to not parse the string 5 times too.)

val demoRddFilterMap = demoRddFilter.flatMap { line =
  val tokens = line.split('$')
  if (tokens.length = 13) {
val parsed = tokens(0) + ~ + tokens(5) + ~ + tokens(11) + ~
+ tokens(12)
Some(parsed)
  } else {
None
  }
}

On Wed, Dec 24, 2014 at 4:35 PM, Sanjay Subramanian
sanjaysubraman...@yahoo.com.invalid wrote:
 DOH Looks like I did not have enough coffee before I asked this :-)
 I added the if statement...

 var demoRddFilter = demoRdd.filter(line =
 !line.contains(ISR$CASE$I_F_COD$FOLL_SEQ) ||
 !line.contains(primaryid$caseid$caseversion))
 var demoRddFilterMap = demoRddFilter.map(line = {
   if (line.split('$').length = 13){
 line.split('$')(0) + ~ + line.split('$')(5) + ~ +
 line.split('$')(11) + ~ + line.split('$')(12)
   }
 })


 
 From: Sanjay Subramanian sanjaysubraman...@yahoo.com.INVALID
 To: user@spark.apache.org user@spark.apache.org
 Sent: Wednesday, December 24, 2014 8:28 AM
 Subject: How to identify erroneous input record ?

 hey guys

 One of my input records has an problem that makes the code fail.

 var demoRddFilter = demoRdd.filter(line =
 !line.contains(ISR$CASE$I_F_COD$FOLL_SEQ) ||
 !line.contains(primaryid$caseid$caseversion))

 var demoRddFilterMap = demoRddFilter.map(line = line.split('$')(0) + ~ +
 line.split('$')(5) + ~ + line.split('$')(11) + ~ + line.split('$')(12))

 demoRddFilterMap.saveAsTextFile(/data/aers/msfx/demo/ + outFile)


 This is possibly happening because perhaps one input record may not have 13
 fields.

 If this were Hadoop mapper code , I have 2 ways to solve this

 1. test the number of fields of each line before applying the map function

 2. enclose the mapping function in a try catch block so that the mapping
 function only fails for the erroneous record

 How do I implement 1. or 2. in the Spark code ?

 Thanks


 sanjay








-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: How to identify erroneous input record ?

2014-12-24 Thread Sanjay Subramanian
Although not elegantly I got the output via my code but totally agree on the 
parsing 5 times (thats really bad).Will add your suggested logic and check it 
out. I have a long way to the finish line. I am re-architecting my entire 
hadoop code and getting it onto spark.
Check out what I do at www.medicalsidefx.orgPrimarily an iPhone app but 
underlying is Lucene, Hadoop and hopefully soon in 2015 - Spark :-)  
  From: Sean Owen so...@cloudera.com
 To: Sanjay Subramanian sanjaysubraman...@yahoo.com 
Cc: user@spark.apache.org user@spark.apache.org 
 Sent: Wednesday, December 24, 2014 8:56 AM
 Subject: Re: How to identify erroneous input record ?
   
I don't believe that works since your map function does not return a
value for lines shorter than 13 tokens. You should use flatMap and
Some/None. (You probably want to not parse the string 5 times too.)

val demoRddFilterMap = demoRddFilter.flatMap { line =
  val tokens = line.split('$')
  if (tokens.length = 13) {
    val parsed = tokens(0) + ~ + tokens(5) + ~ + tokens(11) + ~
+ tokens(12)
    Some(parsed)
  } else {
    None
  }
}



On Wed, Dec 24, 2014 at 4:35 PM, Sanjay Subramanian
sanjaysubraman...@yahoo.com.invalid wrote:
 DOH Looks like I did not have enough coffee before I asked this :-)
 I added the if statement...

 var demoRddFilter = demoRdd.filter(line =
 !line.contains(ISR$CASE$I_F_COD$FOLL_SEQ) ||
 !line.contains(primaryid$caseid$caseversion))
 var demoRddFilterMap = demoRddFilter.map(line = {
  if (line.split('$').length = 13){
    line.split('$')(0) + ~ + line.split('$')(5) + ~ +
 line.split('$')(11) + ~ + line.split('$')(12)
  }
 })


 
 From: Sanjay Subramanian sanjaysubraman...@yahoo.com.INVALID
 To: user@spark.apache.org user@spark.apache.org
 Sent: Wednesday, December 24, 2014 8:28 AM
 Subject: How to identify erroneous input record ?

 hey guys

 One of my input records has an problem that makes the code fail.

 var demoRddFilter = demoRdd.filter(line =
 !line.contains(ISR$CASE$I_F_COD$FOLL_SEQ) ||
 !line.contains(primaryid$caseid$caseversion))

 var demoRddFilterMap = demoRddFilter.map(line = line.split('$')(0) + ~ +
 line.split('$')(5) + ~ + line.split('$')(11) + ~ + line.split('$')(12))

 demoRddFilterMap.saveAsTextFile(/data/aers/msfx/demo/ + outFile)


 This is possibly happening because perhaps one input record may not have 13
 fields.

 If this were Hadoop mapper code , I have 2 ways to solve this

 1. test the number of fields of each line before applying the map function

 2. enclose the mapping function in a try catch block so that the mapping
 function only fails for the erroneous record

 How do I implement 1. or 2. in the Spark code ?

 Thanks


 sanjay








-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



  

hiveContext.jsonFile fails with Unexpected close marker

2014-12-24 Thread elliott cordo
I have generally been impressed with the way jsonFile eats just about any
json data model.. but getting this error when i try to ingest this file:
Unexpected close marker ']': expected '}

Here are the commands from the pyspark shell:

from pyspark.sql import  HiveContext
hiveContext = HiveContext(sc)
f = hiveContext.jsonFile(sample.json)

Here is some sample json:
{wf_session: [

{id:6021fb91-c9ec-4019-9ab9-f628aee8d259,machine_id:b45c8c4a-7e8e-442d-8d49-fb7c32e2d813,session_id:d65ca338-c6b8-4bff-93b1-7f2364726fb7,event_at:2014-12-19T15:55:31.373Z,screen:x,type:1,time_left_secs:1},

{id:7e696c19-3ba4-4469-be28-5ef1f0c03d63,machine_id:b45c8c4a-7e8e-442d-8d49-fb7c32e2d813,session_id:d65ca338-c6b8-4bff-93b1-7f2364726fb7,event_at:2014-12-19T15:55:32.385Z,screen:x,type:2,ad_unit_id:null,spot_started_at:2014-12-19T15:55:12.364Z,spot_ended_at:2014-12-19T15:55:32.385Z,spot_duration_secs:20,impression_count:0,impressions:[],engagement_count:0,engagements:[]},

{id:68a43006-09bc-4c18-af55-1ebdc0e041a3,machine_id:b45c8c4a-7e8e-442d-8d49-fb7c32e2d813,session_id:d65ca338-c6b8-4bff-93b1-7f2364726fb7,event_at:2014-12-19T15:55:32.375Z,screen:x,type:3,duration_secs:20,to_ad_unit_id:developmentbea1f3a4-be08-4119-b9f4-7}
  ] }


Any help would be appreciated! :)  Merry Xmas!


RE: Not Serializable exception when integrating SQL and Spark Streaming

2014-12-24 Thread Tarun Garg
Thanks for the reply.
I am testing this with a small amount of data and what is happening is when 
ever there is data in the Kafka topic Spark does not through Exception 
otherwise it is.
ThanksTarun

Date: Wed, 24 Dec 2014 16:23:30 +0800
From: lian.cs@gmail.com
To: bigdat...@live.com; user@spark.apache.org
Subject: Re: Not Serializable exception when integrating SQL and Spark Streaming


  

  
  

  Generally you can use -Dsun.io.serialization.extendedDebugInfo=true
to enable serialization debugging information when serialization
exceptions are raised.
  On 12/24/14 1:32 PM,
bigdata4u wrote:
  
  



  I am trying to use sql over Spark streaming using Java. But i am 
getting
Serialization Exception.

public static void main(String args[]) {
SparkConf sparkConf = new SparkConf().setAppName(NumberCount);
JavaSparkContext jc = new JavaSparkContext(sparkConf);
JavaStreamingContext jssc = new JavaStreamingContext(jc, new
Duration(2000));
jssc.addStreamingListener(new WorkCountMonitor());
int numThreads = Integer.parseInt(args[3]);
MapString,Integer topicMap = new HashMapString,Integer();
String[] topics = args[2].split(,);
for (String topic : topics) {
topicMap.put(topic, numThreads);
}
JavaPairReceiverInputDStreamString,String data =
KafkaUtils.createStream(jssc, args[0], args[1], topicMap);
data.print();

JavaDStreamPerson streamData = data.map(new FunctionTuple2lt;String,
String, Person() {
public Person call(Tuple2String,String v1) throws Exception {
String[] stringArray = v1._2.split(,);
Person Person = new Person();
Person.setName(stringArray[0]);
Person.setAge(stringArray[1]);
return Person;
}

});


final JavaSQLContext sqlContext = new JavaSQLContext(jc);
streamData.foreachRDD(new FunctionJavaRDDlt;Person,Void() {
public Void call(JavaRDDPerson rdd) {

JavaSchemaRDD subscriberSchema = sqlContext.applySchema(rdd,
Person.class);

subscriberSchema.registerAsTable(people);
System.out.println(all data);
JavaSchemaRDD names = sqlContext.sql(SELECT name FROM people);
System.out.println(afterwards);

ListString males = new ArrayListString();

males = names.map(new FunctionRow,String() {
public String call(Row row) {
return row.getString(0);
}
}).collect();
System.out.println(before for);
for (String name : males) {
System.out.println(name);
}
return null;
}
});
jssc.start();
jssc.awaitTermination();

Exception is

14/12/23 23:49:38 ERROR JobScheduler: Error running job streaming job
1419378578000 ms.1 org.apache.spark.SparkException: Task not serializable at
org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:166)
at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:158) at
org.apache.spark.SparkContext.clean(SparkContext.scala:1435) at
org.apache.spark.rdd.RDD.map(RDD.scala:271) at
org.apache.spark.api.java.JavaRDDLike$class.map(JavaRDDLike.scala:78) at
org.apache.spark.sql.api.java.JavaSchemaRDD.map(JavaSchemaRDD.scala:42) at
com.basic.spark.NumberCount$2.call(NumberCount.java:79) at
com.basic.spark.NumberCount$2.call(NumberCount.java:67) at
org.apache.spark.streaming.api.java.JavaDStreamLike$anonfun$foreachRDD$1.apply(JavaDStreamLike.scala:274)
at
org.apache.spark.streaming.api.java.JavaDStreamLike$anonfun$foreachRDD$1.apply(JavaDStreamLike.scala:274)
at
org.apache.spark.streaming.dstream.DStream$anonfun$foreachRDD$1.apply(DStream.scala:529)
at
org.apache.spark.streaming.dstream.DStream$anonfun$foreachRDD$1.apply(DStream.scala:529)
at
org.apache.spark.streaming.dstream.ForEachDStream$anonfun$1.apply$mcV$sp(ForEachDStream.scala:42)
at
org.apache.spark.streaming.dstream.ForEachDStream$anonfun$1.apply(ForEachDStream.scala:40)
at
org.apache.spark.streaming.dstream.ForEachDStream$anonfun$1.apply(ForEachDStream.scala:40)
at scala.util.Try$.apply(Try.scala:161) at
org.apache.spark.streaming.scheduler.Job.run(Job.scala:32) at
org.apache.spark.streaming.scheduler.JobScheduler$JobHandler.run(JobScheduler.scala:171)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:724) 
Caused by: java.io.NotSerializableException:
org.apache.spark.sql.api.java.JavaSQLContext at
java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1181) at
java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1541)
at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1506)
at

Re: null Error in ALS model predict

2014-12-24 Thread Burak Yavuz
Hi,

The MatrixFactorizationModel consists of two RDD's. When you use the second 
method, Spark tries to serialize both RDD's for the .map() function, 
which is not possible, because RDD's are not serializable. Therefore you 
receive the NULLPointerException. You must use the first method.

Best,
Burak

- Original Message -
From: Franco Barrientos franco.barrien...@exalitica.com
To: user@spark.apache.org
Sent: Wednesday, December 24, 2014 7:44:24 AM
Subject: null Error in ALS model predict

Hi all!,

 

I have  a RDD[(int,int,double,double)] where the first two int values are id
and product, respectively. I trained an implicit ALS algorithm and want to
make predictions from this RDD. I make two things but I think both ways are
same.

 

1-  Convert this RDD to RDD[(int,int)] and use
model.predict(RDD(int,int)), this works to me!

2-  Make a map and apply  model.predict(int,int), for example:

val ratings = RDD[(int,int,double,double)].map{ case (id, rubro, rating,
resp)= 

model.predict(id,rubro)

}

Where ratings is a RDD[Double].

 

Now, the second way when I apply a ratings.first() I get the follow error:



 

Why this happend? I need to use this second way.

 

Thanks in advance,

 

Franco Barrientos
Data Scientist

Málaga #115, Of. 1003, Las Condes.
Santiago, Chile.
(+562)-29699649
(+569)-76347893

 mailto:franco.barrien...@exalitica.com franco.barrien...@exalitica.com 

 http://www.exalitica.com/ www.exalitica.com


  http://exalitica.com/web/img/frim.png 

 



Re: Spark metrics for ganglia

2014-12-24 Thread Tim Harsch
Did you get past this issue?  I¹m trying to get this to work as well.  You
have to compile in the spark-ganglia-lgpl artifact into your application.

dependency
  groupIdorg.apache.spark/groupId
artifactIdspark-ganglia-lgpl_2.10/artifactId
version${project.version}/version
/dependency


So I added the above snippet to the examples project, and it finds the
class now when I try to run the Pi example, but I get this problem instead:

14/12/24 11:47:23 ERROR metrics.MetricsSystem: Sink class
org.apache.spark.metrics.sink.GangliaSink cannot be instantialized
ŠSNIPŠ
Caused by: java.lang.NumberFormatException: For input string: 1 





On 12/15/14, 11:29 AM, danilopds danilob...@gmail.com wrote:

Thanks tsingfu,

I used this configuration based in your post: (with ganglia unicast mode)
# Enable GangliaSink for all instances
*.sink.ganglia.class=org.apache.spark.metrics.sink.GangliaSink
*.sink.ganglia.host=10.0.0.7
*.sink.ganglia.port=8649
*.sink.ganglia.period=15
*.sink.ganglia.unit=seconds
*.sink.ganglia.ttl=1
*.sink.ganglia.mode=unicast

Then,
I have the following error now.
ERROR metrics.MetricsSystem: Sink class
org.apache.spark.metrics.sink.GangliaSink  cannot be instantialized
java.lang.ClassNotFoundException:
org.apache.spark.metrics.sink.GangliaSink





--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-metrics-for-gang
lia-tp14335p20690.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Discourse: A proposed alternative to the Spark User list

2014-12-24 Thread Nick Chammas
When people have questions about Spark, there are 2 main places (as far as
I can tell) where they ask them:

   - Stack Overflow, under the apache-spark tag
   http://stackoverflow.com/questions/tagged/apache-spark
   - This mailing list

The mailing list is valuable as an independent place for discussion that is
part of the Spark project itself. Furthermore, it allows for a broader
range of discussions than would be allowed on Stack Overflow
http://stackoverflow.com/help/dont-ask.

As the Spark project has grown in popularity, I see that a few problems
have emerged with this mailing list:

   - It’s hard to follow topics (e.g. Streaming vs. SQL) that you’re
   interested in, and it’s hard to know when someone has mentioned you
   specifically.
   - It’s hard to search for existing threads and link information across
   disparate threads.
   - It’s hard to format code and log snippets nicely, and by extension,
   hard to read other people’s posts with this kind of information.

There are existing solutions to all these (and other) problems based around
straight-up discipline or client-side tooling, which users have to conjure
up for themselves.

I’d like us as a community to consider using Discourse
http://www.discourse.org/ as an alternative to, or overlay on top of,
this mailing list, that provides better out-of-the-box solutions to these
problems.

Discourse is a modern discussion platform built by some of the same people
who created Stack Overflow. It has many neat features
http://v1.discourse.org/about/ that I believe this community would
benefit from.

For example:

   - When a user starts typing up a new post, they get a panel *showing
   existing conversations that look similar*, just like on Stack Overflow.
   - It’s easy to search for posts and link between them.
   - *Markdown support* is built-in to composer.
   - You can *specifically mention people* and they will be notified.
   - Posts can be categorized (e.g. Streaming, SQL, etc.).
   - There is a built-in option for mailing list support which forwards all
   activity on the forum to a user’s email address and which allows for
   creation of new posts via email.

What do you think of Discourse as an alternative, more manageable way to
discus Spark?

There are a few options we can consider:

   1. Work with the ASF as well as the Discourse team to allow Discourse to
   act as an overlay on top of this mailing list
   
https://meta.discourse.org/t/discourse-as-a-front-end-for-existing-asf-mailing-lists/23167?u=nicholaschammas,
   allowing people to continue to use the mailing list as-is if they want.
   (This is the toughest but perhaps most attractive option.)
   2. Create a new Discourse forum for Spark that is not bound to this user
   list. This is relatively easy but will effectively fork the community on
   this list. (We cannot shut down this mailing in favor of one managed by
   Discourse.)
   3. Don’t use Discourse. Just encourage people on this list to post
   instead on Stack Overflow whenever possible.
   4. Something else.

What does everyone think?

Nick
​




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Discourse-A-proposed-alternative-to-the-Spark-User-list-tp20851.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: saveAsNewAPIHadoopDataset against hbase hanging in pyspark 1.2.0

2014-12-24 Thread Antony Mayi
this is it (jstack of particular yarn container) - http://pastebin.com/eAdiUYKK
thanks, Antony. 

 On Wednesday, 24 December 2014, 16:34, Ted Yu yuzhih...@gmail.com wrote:
   
 

 bq. even when testing with the example from the stock hbase_outputformat.py
Can you take jstack of the above and pastebin it ?
Thanks
On Wed, Dec 24, 2014 at 4:49 AM, Antony Mayi antonym...@yahoo.com.invalid 
wrote:

Hi,
have been using this without any issues with spark 1.1.0 but after upgrading to 
1.2.0 saving a RDD from pyspark using saveAsNewAPIHadoopDataset into HBase just 
hangs - even when testing with the example from the stock hbase_outputformat.py.
anyone having same issue? (and able to solve?)
using hbase 0.98.6 and yarn-client mode.
thanks,Antony.




 
   

Re: saveAsNewAPIHadoopDataset against hbase hanging in pyspark 1.2.0

2014-12-24 Thread Ted Yu
I went over the jstack but didn't find any call related to hbase or
zookeeper.
Do you find anything important in the logs ?

Looks like container launcher was waiting for the script to return some
result:


   1. at
   
org.apache.hadoop.util.Shell$ShellCommandExecutor.parseExecResult(Shell.java:715)
   2. at org.apache.hadoop.util.Shell.runCommand(Shell.java:524)


On Wed, Dec 24, 2014 at 3:11 PM, Antony Mayi antonym...@yahoo.com wrote:

 this is it (jstack of particular yarn container) -
 http://pastebin.com/eAdiUYKK

 thanks, Antony.


   On Wednesday, 24 December 2014, 16:34, Ted Yu yuzhih...@gmail.com
 wrote:



 bq. even when testing with the example from the stock
 hbase_outputformat.py

 Can you take jstack of the above and pastebin it ?

 Thanks

 On Wed, Dec 24, 2014 at 4:49 AM, Antony Mayi antonym...@yahoo.com.invalid
  wrote:

 Hi,

 have been using this without any issues with spark 1.1.0 but after
 upgrading to 1.2.0 saving a RDD from pyspark
 using saveAsNewAPIHadoopDataset into HBase just hangs - even when testing
 with the example from the stock hbase_outputformat.py.

 anyone having same issue? (and able to solve?)

 using hbase 0.98.6 and yarn-client mode.

 thanks,
 Antony.







RE: Not Serializable exception when integrating SQL and Spark Streaming

2014-12-24 Thread Tarun Garg
Thanks
I debug this further and below is the cause
Caused by: java.io.NotSerializableException: 
org.apache.spark.sql.api.java.JavaSQLContext- field (class 
com.basic.spark.NumberCount$2, name: val$sqlContext, type: class 
org.apache.spark.sql.api.java.JavaSQLContext)- object (class 
com.basic.spark.NumberCount$2, com.basic.spark.NumberCount$2@69ddbcc7)
- field (class com.basic.spark.NumberCount$2$1, name: this$0, type: class 
com.basic.spark.NumberCount$2)- object (class 
com.basic.spark.NumberCount$2$1, com.basic.spark.NumberCount$2$1@2524beed)
- field (class 
org.apache.spark.api.java.JavaPairRDD$$anonfun$toScalaFunction$1, name: 
fun$1, type: interface org.apache.spark.api.java.function.Function)
I tried this also 
http://apache-spark-user-list.1001560.n3.nabble.com/How-to-access-objects-declared-and-initialized-outside-the-call-method-of-JavaRDD-td17094.html#a17150
Why there is difference SQLContext is Serializable but JavaSQLContext is not? 
Spark is designed like this.
Thanks
Date: Wed, 24 Dec 2014 16:23:30 +0800
From: lian.cs@gmail.com
To: bigdat...@live.com; user@spark.apache.org
Subject: Re: Not Serializable exception when integrating SQL and Spark Streaming


  

  
  

  Generally you can use -Dsun.io.serialization.extendedDebugInfo=true
to enable serialization debugging information when serialization
exceptions are raised.
  On 12/24/14 1:32 PM,
bigdata4u wrote:
  
  



  I am trying to use sql over Spark streaming using Java. But i am 
getting
Serialization Exception.

public static void main(String args[]) {
SparkConf sparkConf = new SparkConf().setAppName(NumberCount);
JavaSparkContext jc = new JavaSparkContext(sparkConf);
JavaStreamingContext jssc = new JavaStreamingContext(jc, new
Duration(2000));
jssc.addStreamingListener(new WorkCountMonitor());
int numThreads = Integer.parseInt(args[3]);
MapString,Integer topicMap = new HashMapString,Integer();
String[] topics = args[2].split(,);
for (String topic : topics) {
topicMap.put(topic, numThreads);
}
JavaPairReceiverInputDStreamString,String data =
KafkaUtils.createStream(jssc, args[0], args[1], topicMap);
data.print();

JavaDStreamPerson streamData = data.map(new FunctionTuple2lt;String,
String, Person() {
public Person call(Tuple2String,String v1) throws Exception {
String[] stringArray = v1._2.split(,);
Person Person = new Person();
Person.setName(stringArray[0]);
Person.setAge(stringArray[1]);
return Person;
}

});


final JavaSQLContext sqlContext = new JavaSQLContext(jc);
streamData.foreachRDD(new FunctionJavaRDDlt;Person,Void() {
public Void call(JavaRDDPerson rdd) {

JavaSchemaRDD subscriberSchema = sqlContext.applySchema(rdd,
Person.class);

subscriberSchema.registerAsTable(people);
System.out.println(all data);
JavaSchemaRDD names = sqlContext.sql(SELECT name FROM people);
System.out.println(afterwards);

ListString males = new ArrayListString();

males = names.map(new FunctionRow,String() {
public String call(Row row) {
return row.getString(0);
}
}).collect();
System.out.println(before for);
for (String name : males) {
System.out.println(name);
}
return null;
}
});
jssc.start();
jssc.awaitTermination();

Exception is

14/12/23 23:49:38 ERROR JobScheduler: Error running job streaming job
1419378578000 ms.1 org.apache.spark.SparkException: Task not serializable at
org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:166)
at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:158) at
org.apache.spark.SparkContext.clean(SparkContext.scala:1435) at
org.apache.spark.rdd.RDD.map(RDD.scala:271) at
org.apache.spark.api.java.JavaRDDLike$class.map(JavaRDDLike.scala:78) at
org.apache.spark.sql.api.java.JavaSchemaRDD.map(JavaSchemaRDD.scala:42) at
com.basic.spark.NumberCount$2.call(NumberCount.java:79) at
com.basic.spark.NumberCount$2.call(NumberCount.java:67) at
org.apache.spark.streaming.api.java.JavaDStreamLike$anonfun$foreachRDD$1.apply(JavaDStreamLike.scala:274)
at
org.apache.spark.streaming.api.java.JavaDStreamLike$anonfun$foreachRDD$1.apply(JavaDStreamLike.scala:274)
at
org.apache.spark.streaming.dstream.DStream$anonfun$foreachRDD$1.apply(DStream.scala:529)
at
org.apache.spark.streaming.dstream.DStream$anonfun$foreachRDD$1.apply(DStream.scala:529)
at
org.apache.spark.streaming.dstream.ForEachDStream$anonfun$1.apply$mcV$sp(ForEachDStream.scala:42)
at
org.apache.spark.streaming.dstream.ForEachDStream$anonfun$1.apply(ForEachDStream.scala:40)
at

Re: saveAsNewAPIHadoopDataset against hbase hanging in pyspark 1.2.0

2014-12-24 Thread Antony Mayi
I just run it by hand from pyspark shell. here is the steps:
pyspark --jars 
/usr/lib/spark/lib/spark-examples-1.2.0-cdh5.3.0-hadoop2.5.0-cdh5.3.0.jar
 conf = {hbase.zookeeper.quorum: localhost,
...         hbase.mapred.outputtable: test,...         
mapreduce.outputformat.class: 
org.apache.hadoop.hbase.mapreduce.TableOutputFormat,...         
mapreduce.job.output.key.class: 
org.apache.hadoop.hbase.io.ImmutableBytesWritable,...         
mapreduce.job.output.value.class: org.apache.hadoop.io.Writable} keyConv 
= 
org.apache.spark.examples.pythonconverters.StringToImmutableBytesWritableConverter
 valueConv = 
org.apache.spark.examples.pythonconverters.StringListToPutConverter 
sc.parallelize([['testkey', 'f1', 'testqual', 'testval']], 1).map(lambda x: 
(x[0], x)).saveAsNewAPIHadoopDataset(...         conf=conf,...         
keyConverter=keyConv,...         valueConverter=valueConv)
then it spills few of the INFO level messages about submitting a task etc but 
then it just hangs. very same code runs ok on spark 1.1.0 - the records gets 
stored in hbase.
thanks,Antony.

 

 On Thursday, 25 December 2014, 0:37, Ted Yu yuzhih...@gmail.com wrote:
   
 

 I went over the jstack but didn't find any call related to hbase or 
zookeeper.Do you find anything important in the logs ?
Looks like container launcher was waiting for the script to return some result:
   
   -         at 
org.apache.hadoop.util.Shell$ShellCommandExecutor.parseExecResult(Shell.java:715)
   -         at org.apache.hadoop.util.Shell.runCommand(Shell.java:524)

On Wed, Dec 24, 2014 at 3:11 PM, Antony Mayi antonym...@yahoo.com wrote:

this is it (jstack of particular yarn container) - http://pastebin.com/eAdiUYKK
thanks, Antony. 

 On Wednesday, 24 December 2014, 16:34, Ted Yu yuzhih...@gmail.com wrote:
   
 

 bq. even when testing with the example from the stock hbase_outputformat.py
Can you take jstack of the above and pastebin it ?
Thanks
On Wed, Dec 24, 2014 at 4:49 AM, Antony Mayi antonym...@yahoo.com.invalid 
wrote:

Hi,
have been using this without any issues with spark 1.1.0 but after upgrading to 
1.2.0 saving a RDD from pyspark using saveAsNewAPIHadoopDataset into HBase just 
hangs - even when testing with the example from the stock hbase_outputformat.py.
anyone having same issue? (and able to solve?)
using hbase 0.98.6 and yarn-client mode.
thanks,Antony.




 




 
   

Re: saveAsNewAPIHadoopDataset against hbase hanging in pyspark 1.2.0

2014-12-24 Thread Ted Yu
bq. hbase.zookeeper.quorum: localhost

You are running hbase cluster in standalone mode ?
Is hbase-client jar in the classpath ?

Cheers

On Wed, Dec 24, 2014 at 4:11 PM, Antony Mayi antonym...@yahoo.com wrote:

 I just run it by hand from pyspark shell. here is the steps:

 pyspark --jars
 /usr/lib/spark/lib/spark-examples-1.2.0-cdh5.3.0-hadoop2.5.0-cdh5.3.0.jar

  conf = {hbase.zookeeper.quorum: localhost,
 ... hbase.mapred.outputtable: test,
 ... mapreduce.outputformat.class:
 org.apache.hadoop.hbase.mapreduce.TableOutputFormat,
 ... mapreduce.job.output.key.class:
 org.apache.hadoop.hbase.io.ImmutableBytesWritable,
 ... mapreduce.job.output.value.class:
 org.apache.hadoop.io.Writable}
  keyConv =
 org.apache.spark.examples.pythonconverters.StringToImmutableBytesWritableConverter
  valueConv =
 org.apache.spark.examples.pythonconverters.StringListToPutConverter
  sc.parallelize([['testkey', 'f1', 'testqual', 'testval']],
 1).map(lambda x: (x[0], x)).saveAsNewAPIHadoopDataset(
 ... conf=conf,
 ... keyConverter=keyConv,
 ... valueConverter=valueConv)

 then it spills few of the INFO level messages about submitting a task etc
 but then it just hangs. very same code runs ok on spark 1.1.0 - the records
 gets stored in hbase.

 thanks,
 Antony.




   On Thursday, 25 December 2014, 0:37, Ted Yu yuzhih...@gmail.com wrote:



 I went over the jstack but didn't find any call related to hbase or
 zookeeper.
 Do you find anything important in the logs ?

 Looks like container launcher was waiting for the script to return some
 result:


1. at

 org.apache.hadoop.util.Shell$ShellCommandExecutor.parseExecResult(Shell.java:715)
2. at org.apache.hadoop.util.Shell.runCommand(Shell.java:524)


 On Wed, Dec 24, 2014 at 3:11 PM, Antony Mayi antonym...@yahoo.com wrote:

 this is it (jstack of particular yarn container) -
 http://pastebin.com/eAdiUYKK

 thanks, Antony.


   On Wednesday, 24 December 2014, 16:34, Ted Yu yuzhih...@gmail.com
 wrote:



 bq. even when testing with the example from the stock
 hbase_outputformat.py

 Can you take jstack of the above and pastebin it ?

 Thanks

 On Wed, Dec 24, 2014 at 4:49 AM, Antony Mayi antonym...@yahoo.com.invalid
  wrote:

 Hi,

 have been using this without any issues with spark 1.1.0 but after
 upgrading to 1.2.0 saving a RDD from pyspark
 using saveAsNewAPIHadoopDataset into HBase just hangs - even when testing
 with the example from the stock hbase_outputformat.py.

 anyone having same issue? (and able to solve?)

 using hbase 0.98.6 and yarn-client mode.

 thanks,
 Antony.










Re: SchemaRDD to RDD[String]

2014-12-24 Thread Tobias Pfeiffer
Hi,

On Wed, Dec 24, 2014 at 3:18 PM, Hafiz Mujadid hafizmujadi...@gmail.com
wrote:

 I want to convert a schemaRDD into RDD of String. How can we do that?

 Currently I am doing like this which is not converting correctly no
 exception but resultant strings are empty

 here is my code


Hehe, this is the most Java-ish Scala code I have ever seen ;-)

Having said that, are you sure that your rows are not empty? The code looks
correct to me, actually.

Tobias


Re: SchemaRDD to RDD[String]

2014-12-24 Thread Michael Armbrust
You might also try the following, which I think is equivalent:

schemaRDD.map(_.mkString(,))

On Wed, Dec 24, 2014 at 8:12 PM, Tobias Pfeiffer t...@preferred.jp wrote:

 Hi,

 On Wed, Dec 24, 2014 at 3:18 PM, Hafiz Mujadid hafizmujadi...@gmail.com
 wrote:

 I want to convert a schemaRDD into RDD of String. How can we do that?

 Currently I am doing like this which is not converting correctly no
 exception but resultant strings are empty

 here is my code


 Hehe, this is the most Java-ish Scala code I have ever seen ;-)

 Having said that, are you sure that your rows are not empty? The code
 looks correct to me, actually.

 Tobias





Re: Escape commas in file names

2014-12-24 Thread Michael Armbrust
No, there is not.  Can you open a JIRA?

On Tue, Dec 23, 2014 at 6:33 PM, Daniel Siegmann daniel.siegm...@velos.io
wrote:

 I am trying to load a Parquet file which has a comma in its name. Yes,
 this is a valid file name in HDFS. However, sqlContext.parquetFile
 interprets this as a comma-separated list of parquet files.

 Is there any way to escape the comma so it is treated as part of a single
 file name?

 --
 Daniel Siegmann, Software Developer
 Velos
 Accelerating Machine Learning

 54 W 40th St, New York, NY 10018
 E: daniel.siegm...@velos.io W: www.velos.io



Re: Not Serializable exception when integrating SQL and Spark Streaming

2014-12-24 Thread Michael Armbrust
The various spark contexts generally aren't serializable because you can't
use them on the executors anyway.  We made SQLContext serializable just
because it gets pulled into scope more often due to the implicit
conversions its contains.  You should try marking the variable that holds
the context with the annotation @transient.

On Wed, Dec 24, 2014 at 7:04 PM, Tarun Garg bigdat...@live.com wrote:

 Thanks

 I debug this further and below is the cause

 Caused by: java.io.NotSerializableException:
 org.apache.spark.sql.api.java.JavaSQLContext
 - field (class com.basic.spark.NumberCount$2, name:
 val$sqlContext, type: class
 org.apache.spark.sql.api.java.JavaSQLContext)
 - object (class com.basic.spark.NumberCount$2,
 com.basic.spark.NumberCount$2@69ddbcc7)
 - field (class com.basic.spark.NumberCount$2$1, name: this$0,
 type: class com.basic.spark.NumberCount$2)
 - object (class com.basic.spark.NumberCount$2$1,
 com.basic.spark.NumberCount$2$1@2524beed)
 - field (class
 org.apache.spark.api.java.JavaPairRDD$$anonfun$toScalaFunction$1, name:
 fun$1, type: interface org.apache.spark.api.java.function.Function)

 I tried this also
 http://apache-spark-user-list.1001560.n3.nabble.com/How-to-access-objects-declared-and-initialized-outside-the-call-method-of-JavaRDD-td17094.html#a17150

 Why there is difference SQLContext is Serializable but JavaSQLContext is
 not? Spark is designed like this.

 Thanks

 --
 Date: Wed, 24 Dec 2014 16:23:30 +0800
 From: lian.cs@gmail.com
 To: bigdat...@live.com; user@spark.apache.org
 Subject: Re: Not Serializable exception when integrating SQL and Spark
 Streaming

  Generally you can use -Dsun.io.serialization.extendedDebugInfo=true to
 enable serialization debugging information when serialization exceptions
 are raised.

 On 12/24/14 1:32 PM, bigdata4u wrote:


  I am trying to use sql over Spark streaming using Java. But i am getting
 Serialization Exception.

 public static void main(String args[]) {
 SparkConf sparkConf = new SparkConf().setAppName(NumberCount);
 JavaSparkContext jc = new JavaSparkContext(sparkConf);
 JavaStreamingContext jssc = new JavaStreamingContext(jc, new
 Duration(2000));
 jssc.addStreamingListener(new WorkCountMonitor());
 int numThreads = Integer.parseInt(args[3]);
 MapString,Integer topicMap = new HashMapString,Integer();
 String[] topics = args[2].split(,);
 for (String topic : topics) {
 topicMap.put(topic, numThreads);
 }
 JavaPairReceiverInputDStreamString,String data =
 KafkaUtils.createStream(jssc, args[0], args[1], topicMap);
 data.print();

 JavaDStreamPerson streamData = data.map(new FunctionTuple2lt;String,
 String, Person() {
 public Person call(Tuple2String,String v1) throws Exception {
 String[] stringArray = v1._2.split(,);
 Person Person = new Person();
 Person.setName(stringArray[0]);
 Person.setAge(stringArray[1]);
 return Person;
 }

 });


 final JavaSQLContext sqlContext = new JavaSQLContext(jc);
 streamData.foreachRDD(new FunctionJavaRDDlt;Person,Void() {
 public Void call(JavaRDDPerson rdd) {

 JavaSchemaRDD subscriberSchema = sqlContext.applySchema(rdd,
 Person.class);

 subscriberSchema.registerAsTable(people);
 System.out.println(all data);
 JavaSchemaRDD names = sqlContext.sql(SELECT name FROM people);
 System.out.println(afterwards);

 ListString males = new ArrayListString();

 males = names.map(new FunctionRow,String() {
 public String call(Row row) {
 return row.getString(0);
 }
 }).collect();
 System.out.println(before for);
 for (String name : males) {
 System.out.println(name);
 }
 return null;
 }
 });
 jssc.start();
 jssc.awaitTermination();

 Exception is

 14/12/23 23:49:38 ERROR JobScheduler: Error running job streaming job
 1419378578000 ms.1 org.apache.spark.SparkException: Task not serializable at
 org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:166)
 at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:158) at
 org.apache.spark.SparkContext.clean(SparkContext.scala:1435) at
 org.apache.spark.rdd.RDD.map(RDD.scala:271) at
 org.apache.spark.api.java.JavaRDDLike$class.map(JavaRDDLike.scala:78) at
 org.apache.spark.sql.api.java.JavaSchemaRDD.map(JavaSchemaRDD.scala:42) at
 com.basic.spark.NumberCount$2.call(NumberCount.java:79) at
 com.basic.spark.NumberCount$2.call(NumberCount.java:67) at
 org.apache.spark.streaming.api.java.JavaDStreamLike$anonfun$foreachRDD$1.apply(JavaDStreamLike.scala:274)
 at
 

Re: hiveContext.jsonFile fails with Unexpected close marker

2014-12-24 Thread Michael Armbrust
Each JSON object needs to be on a single line since this is the boundary
the TextFileInputFormat uses when splitting up large files.

On Wed, Dec 24, 2014 at 12:34 PM, elliott cordo elliottco...@gmail.com
wrote:

 I have generally been impressed with the way jsonFile eats just about
 any json data model.. but getting this error when i try to ingest this
 file: Unexpected close marker ']': expected '}

 Here are the commands from the pyspark shell:

 from pyspark.sql import  HiveContext
 hiveContext = HiveContext(sc)
 f = hiveContext.jsonFile(sample.json)

 Here is some sample json:
 {wf_session: [

 {id:6021fb91-c9ec-4019-9ab9-f628aee8d259,machine_id:b45c8c4a-7e8e-442d-8d49-fb7c32e2d813,session_id:d65ca338-c6b8-4bff-93b1-7f2364726fb7,event_at:2014-12-19T15:55:31.373Z,screen:x,type:1,time_left_secs:1},

 {id:7e696c19-3ba4-4469-be28-5ef1f0c03d63,machine_id:b45c8c4a-7e8e-442d-8d49-fb7c32e2d813,session_id:d65ca338-c6b8-4bff-93b1-7f2364726fb7,event_at:2014-12-19T15:55:32.385Z,screen:x,type:2,ad_unit_id:null,spot_started_at:2014-12-19T15:55:12.364Z,spot_ended_at:2014-12-19T15:55:32.385Z,spot_duration_secs:20,impression_count:0,impressions:[],engagement_count:0,engagements:[]},

 {id:68a43006-09bc-4c18-af55-1ebdc0e041a3,machine_id:b45c8c4a-7e8e-442d-8d49-fb7c32e2d813,session_id:d65ca338-c6b8-4bff-93b1-7f2364726fb7,event_at:2014-12-19T15:55:32.375Z,screen:x,type:3,duration_secs:20,to_ad_unit_id:developmentbea1f3a4-be08-4119-b9f4-7}
   ] }


 Any help would be appreciated! :)  Merry Xmas!



Re: saveAsNewAPIHadoopDataset against hbase hanging in pyspark 1.2.0

2014-12-24 Thread Antony Mayi
I am running it in yarn-client mode and I believe hbase-client is part of the 
spark-examples-1.2.0-cdh5.3.0-hadoop2.5.0-cdh5.3.0.jar which I am submitting at 
launch.
adding another jstack taken during the hanging - http://pastebin.com/QDQrBw70 - 
this is of the CoarseGrainedExecutorBackend process - this one is referencing 
hbase and zookeeper.
thanks,Antony. 

 On Thursday, 25 December 2014, 1:38, Ted Yu yuzhih...@gmail.com wrote:
   
 

 bq. hbase.zookeeper.quorum: localhost
You are running hbase cluster in standalone mode ?Is hbase-client jar in the 
classpath ?
Cheers
On Wed, Dec 24, 2014 at 4:11 PM, Antony Mayi antonym...@yahoo.com wrote:

I just run it by hand from pyspark shell. here is the steps:
pyspark --jars 
/usr/lib/spark/lib/spark-examples-1.2.0-cdh5.3.0-hadoop2.5.0-cdh5.3.0.jar
 conf = {hbase.zookeeper.quorum: localhost,
...         hbase.mapred.outputtable: test,...         
mapreduce.outputformat.class: 
org.apache.hadoop.hbase.mapreduce.TableOutputFormat,...         
mapreduce.job.output.key.class: 
org.apache.hadoop.hbase.io.ImmutableBytesWritable,...         
mapreduce.job.output.value.class: org.apache.hadoop.io.Writable} keyConv 
= 
org.apache.spark.examples.pythonconverters.StringToImmutableBytesWritableConverter
 valueConv = 
org.apache.spark.examples.pythonconverters.StringListToPutConverter 
sc.parallelize([['testkey', 'f1', 'testqual', 'testval']], 1).map(lambda x: 
(x[0], x)).saveAsNewAPIHadoopDataset(...         conf=conf,...         
keyConverter=keyConv,...         valueConverter=valueConv)
then it spills few of the INFO level messages about submitting a task etc but 
then it just hangs. very same code runs ok on spark 1.1.0 - the records gets 
stored in hbase.
thanks,Antony.

 

 On Thursday, 25 December 2014, 0:37, Ted Yu yuzhih...@gmail.com wrote:
   
 

 I went over the jstack but didn't find any call related to hbase or 
zookeeper.Do you find anything important in the logs ?
Looks like container launcher was waiting for the script to return some result:
   
   -         at 
org.apache.hadoop.util.Shell$ShellCommandExecutor.parseExecResult(Shell.java:715)
   -         at org.apache.hadoop.util.Shell.runCommand(Shell.java:524)

On Wed, Dec 24, 2014 at 3:11 PM, Antony Mayi antonym...@yahoo.com wrote:

this is it (jstack of particular yarn container) - http://pastebin.com/eAdiUYKK
thanks, Antony. 

 On Wednesday, 24 December 2014, 16:34, Ted Yu yuzhih...@gmail.com wrote:
   
 

 bq. even when testing with the example from the stock hbase_outputformat.py
Can you take jstack of the above and pastebin it ?
Thanks
On Wed, Dec 24, 2014 at 4:49 AM, Antony Mayi antonym...@yahoo.com.invalid 
wrote:

Hi,
have been using this without any issues with spark 1.1.0 but after upgrading to 
1.2.0 saving a RDD from pyspark using saveAsNewAPIHadoopDataset into HBase just 
hangs - even when testing with the example from the stock hbase_outputformat.py.
anyone having same issue? (and able to solve?)
using hbase 0.98.6 and yarn-client mode.
thanks,Antony.




 




 




 
   

Re: saveAsNewAPIHadoopDataset against hbase hanging in pyspark 1.2.0

2014-12-24 Thread Antony Mayi
also hbase itself works ok:
hbase(main):006:0 scan 'test'ROW                            COLUMN+CELL        
                                                                     key1       
                   column=f1:asd, timestamp=1419463092904, value=456            
                          1 row(s) in 0.0250 seconds
hbase(main):007:0 put 'test', 'testkey', 'f1:testqual', 'testval'0 row(s) in 
0.0170 seconds
hbase(main):008:0 scan 'test'ROW                            COLUMN+CELL        
                                                                     key1       
                   column=f1:asd, timestamp=1419463092904, value=456            
                           testkey                       column=f1:testqual, 
timestamp=1419487275905, value=testval                             2 row(s) in 
0.0270 seconds
 

 On Thursday, 25 December 2014, 6:58, Antony Mayi antonym...@yahoo.com 
wrote:
   
 

 I am running it in yarn-client mode and I believe hbase-client is part of the 
spark-examples-1.2.0-cdh5.3.0-hadoop2.5.0-cdh5.3.0.jar which I am submitting at 
launch.
adding another jstack taken during the hanging - http://pastebin.com/QDQrBw70 - 
this is of the CoarseGrainedExecutorBackend process - this one is referencing 
hbase and zookeeper.
thanks,Antony. 

 On Thursday, 25 December 2014, 1:38, Ted Yu yuzhih...@gmail.com wrote:
   
 

 bq. hbase.zookeeper.quorum: localhost
You are running hbase cluster in standalone mode ?Is hbase-client jar in the 
classpath ?
Cheers
On Wed, Dec 24, 2014 at 4:11 PM, Antony Mayi antonym...@yahoo.com wrote:

I just run it by hand from pyspark shell. here is the steps:
pyspark --jars 
/usr/lib/spark/lib/spark-examples-1.2.0-cdh5.3.0-hadoop2.5.0-cdh5.3.0.jar
 conf = {hbase.zookeeper.quorum: localhost,
...         hbase.mapred.outputtable: test,...         
mapreduce.outputformat.class: 
org.apache.hadoop.hbase.mapreduce.TableOutputFormat,...         
mapreduce.job.output.key.class: 
org.apache.hadoop.hbase.io.ImmutableBytesWritable,...         
mapreduce.job.output.value.class: org.apache.hadoop.io.Writable} keyConv 
= 
org.apache.spark.examples.pythonconverters.StringToImmutableBytesWritableConverter
 valueConv = 
org.apache.spark.examples.pythonconverters.StringListToPutConverter 
sc.parallelize([['testkey', 'f1', 'testqual', 'testval']], 1).map(lambda x: 
(x[0], x)).saveAsNewAPIHadoopDataset(...         conf=conf,...         
keyConverter=keyConv,...         valueConverter=valueConv)
then it spills few of the INFO level messages about submitting a task etc but 
then it just hangs. very same code runs ok on spark 1.1.0 - the records gets 
stored in hbase.
thanks,Antony.

 

 On Thursday, 25 December 2014, 0:37, Ted Yu yuzhih...@gmail.com wrote:
   
 

 I went over the jstack but didn't find any call related to hbase or 
zookeeper.Do you find anything important in the logs ?
Looks like container launcher was waiting for the script to return some result:
   
   -         at 
org.apache.hadoop.util.Shell$ShellCommandExecutor.parseExecResult(Shell.java:715)
   -         at org.apache.hadoop.util.Shell.runCommand(Shell.java:524)

On Wed, Dec 24, 2014 at 3:11 PM, Antony Mayi antonym...@yahoo.com wrote:

this is it (jstack of particular yarn container) - http://pastebin.com/eAdiUYKK
thanks, Antony. 

 On Wednesday, 24 December 2014, 16:34, Ted Yu yuzhih...@gmail.com wrote:
   
 

 bq. even when testing with the example from the stock hbase_outputformat.py
Can you take jstack of the above and pastebin it ?
Thanks
On Wed, Dec 24, 2014 at 4:49 AM, Antony Mayi antonym...@yahoo.com.invalid 
wrote:

Hi,
have been using this without any issues with spark 1.1.0 but after upgrading to 
1.2.0 saving a RDD from pyspark using saveAsNewAPIHadoopDataset into HBase just 
hangs - even when testing with the example from the stock hbase_outputformat.py.
anyone having same issue? (and able to solve?)
using hbase 0.98.6 and yarn-client mode.
thanks,Antony.




 




 




 


 
   

Question on saveAsTextFile with overwrite option

2014-12-24 Thread Shao, Saisai
Hi,

We have such requirements to save RDD output to HDFS with saveAsTextFile like 
API, but need to overwrite the data if existed. I'm not sure if current Spark 
support such kind of operations, or I need to check this manually?

There's a thread in mailing list discussed about this 
(http://apache-spark-user-list.1001560.n3.nabble.com/How-can-I-make-Spark-1-0-saveAsTextFile-to-overwrite-existing-file-td6696.html),
 I'm not sure this feature is enabled or not, or with some configurations?

Appreciate your suggestions.

Thanks a lot
Jerry


Re: Question on saveAsTextFile with overwrite option

2014-12-24 Thread Patrick Wendell
Is it sufficient to set spark.hadoop.validateOutputSpecs to false?

http://spark.apache.org/docs/latest/configuration.html

- Patrick

On Wed, Dec 24, 2014 at 10:52 PM, Shao, Saisai saisai.s...@intel.com wrote:
 Hi,



 We have such requirements to save RDD output to HDFS with saveAsTextFile
 like API, but need to overwrite the data if existed. I'm not sure if current
 Spark support such kind of operations, or I need to check this manually?



 There's a thread in mailing list discussed about this
 (http://apache-spark-user-list.1001560.n3.nabble.com/How-can-I-make-Spark-1-0-saveAsTextFile-to-overwrite-existing-file-td6696.html),
 I'm not sure this feature is enabled or not, or with some configurations?



 Appreciate your suggestions.



 Thanks a lot

 Jerry

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



RE: Question on saveAsTextFile with overwrite option

2014-12-24 Thread Cheng, Hao
I am wondering if we can provide more friendly API, other than configuration 
for this purpose. What do you think Patrick?

Cheng Hao

-Original Message-
From: Patrick Wendell [mailto:pwend...@gmail.com] 
Sent: Thursday, December 25, 2014 3:22 PM
To: Shao, Saisai
Cc: user@spark.apache.org; d...@spark.apache.org
Subject: Re: Question on saveAsTextFile with overwrite option

Is it sufficient to set spark.hadoop.validateOutputSpecs to false?

http://spark.apache.org/docs/latest/configuration.html

- Patrick

On Wed, Dec 24, 2014 at 10:52 PM, Shao, Saisai saisai.s...@intel.com wrote:
 Hi,



 We have such requirements to save RDD output to HDFS with 
 saveAsTextFile like API, but need to overwrite the data if existed. 
 I'm not sure if current Spark support such kind of operations, or I need to 
 check this manually?



 There's a thread in mailing list discussed about this 
 (http://apache-spark-user-list.1001560.n3.nabble.com/How-can-I-make-Sp
 ark-1-0-saveAsTextFile-to-overwrite-existing-file-td6696.html),
 I'm not sure this feature is enabled or not, or with some configurations?



 Appreciate your suggestions.



 Thanks a lot

 Jerry

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional 
commands, e-mail: user-h...@spark.apache.org


-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Question on saveAsTextFile with overwrite option

2014-12-24 Thread Patrick Wendell
So the behavior of overwriting existing directories IMO is something
we don't want to encourage. The reason why the Hadoop client has these
checks is that it's very easy for users to do unsafe things without
them. For instance, a user could overwrite an RDD that had 100
partitions with an RDD that has 10 partitions... and if they read back
the RDD they would get a corrupted RDD that has a combination of data
from the old and new RDD.

If users want to circumvent these safety checks, we need to make them
explicitly disable them. Given this, I think a config option is as
reasonable as any alternatives. This is already pretty easy IMO.

- Patrick

On Wed, Dec 24, 2014 at 11:28 PM, Cheng, Hao hao.ch...@intel.com wrote:
 I am wondering if we can provide more friendly API, other than configuration 
 for this purpose. What do you think Patrick?

 Cheng Hao

 -Original Message-
 From: Patrick Wendell [mailto:pwend...@gmail.com]
 Sent: Thursday, December 25, 2014 3:22 PM
 To: Shao, Saisai
 Cc: user@spark.apache.org; d...@spark.apache.org
 Subject: Re: Question on saveAsTextFile with overwrite option

 Is it sufficient to set spark.hadoop.validateOutputSpecs to false?

 http://spark.apache.org/docs/latest/configuration.html

 - Patrick

 On Wed, Dec 24, 2014 at 10:52 PM, Shao, Saisai saisai.s...@intel.com wrote:
 Hi,



 We have such requirements to save RDD output to HDFS with
 saveAsTextFile like API, but need to overwrite the data if existed.
 I'm not sure if current Spark support such kind of operations, or I need to 
 check this manually?



 There's a thread in mailing list discussed about this
 (http://apache-spark-user-list.1001560.n3.nabble.com/How-can-I-make-Sp
 ark-1-0-saveAsTextFile-to-overwrite-existing-file-td6696.html),
 I'm not sure this feature is enabled or not, or with some configurations?



 Appreciate your suggestions.



 Thanks a lot

 Jerry

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional 
 commands, e-mail: user-h...@spark.apache.org


-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



RE: Question on saveAsTextFile with overwrite option

2014-12-24 Thread Shao, Saisai
Thanks Patrick for your detailed explanation.

BR
Jerry

-Original Message-
From: Patrick Wendell [mailto:pwend...@gmail.com] 
Sent: Thursday, December 25, 2014 3:43 PM
To: Cheng, Hao
Cc: Shao, Saisai; user@spark.apache.org; d...@spark.apache.org
Subject: Re: Question on saveAsTextFile with overwrite option

So the behavior of overwriting existing directories IMO is something we don't 
want to encourage. The reason why the Hadoop client has these checks is that 
it's very easy for users to do unsafe things without them. For instance, a user 
could overwrite an RDD that had 100 partitions with an RDD that has 10 
partitions... and if they read back the RDD they would get a corrupted RDD that 
has a combination of data from the old and new RDD.

If users want to circumvent these safety checks, we need to make them 
explicitly disable them. Given this, I think a config option is as reasonable 
as any alternatives. This is already pretty easy IMO.

- Patrick

On Wed, Dec 24, 2014 at 11:28 PM, Cheng, Hao hao.ch...@intel.com wrote:
 I am wondering if we can provide more friendly API, other than configuration 
 for this purpose. What do you think Patrick?

 Cheng Hao

 -Original Message-
 From: Patrick Wendell [mailto:pwend...@gmail.com]
 Sent: Thursday, December 25, 2014 3:22 PM
 To: Shao, Saisai
 Cc: user@spark.apache.org; d...@spark.apache.org
 Subject: Re: Question on saveAsTextFile with overwrite option

 Is it sufficient to set spark.hadoop.validateOutputSpecs to false?

 http://spark.apache.org/docs/latest/configuration.html

 - Patrick

 On Wed, Dec 24, 2014 at 10:52 PM, Shao, Saisai saisai.s...@intel.com wrote:
 Hi,



 We have such requirements to save RDD output to HDFS with 
 saveAsTextFile like API, but need to overwrite the data if existed.
 I'm not sure if current Spark support such kind of operations, or I need to 
 check this manually?



 There's a thread in mailing list discussed about this 
 (http://apache-spark-user-list.1001560.n3.nabble.com/How-can-I-make-S
 p ark-1-0-saveAsTextFile-to-overwrite-existing-file-td6696.html),
 I'm not sure this feature is enabled or not, or with some configurations?



 Appreciate your suggestions.



 Thanks a lot

 Jerry

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For 
 additional commands, e-mail: user-h...@spark.apache.org


-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: How to build Spark against the latest

2014-12-24 Thread guxiaobo1982
What options should I use when running the make-distribution.sh script,


I tried ./make-distribution.sh --hadoop.version 2.6.0 --with-yarn -with-hive 
--with-tachyon --tgz
with nothing came out.


Regards



-- Original --
From:  guxiaobo1982;guxiaobo1...@qq.com;
Send time: Wednesday, Dec 24, 2014 6:52 PM
To: Ted Yuyuzhih...@gmail.com; 
Cc: user@spark.apache.orguser@spark.apache.org; 
Subject:  Re:  How to build Spark against the latest



Hi Ted,
 The reference command works, but where I can get the deployable binaries?


Xiaobo Gu








-- Original --
From:  Ted Yu;yuzhih...@gmail.com;
Send time: Wednesday, Dec 24, 2014 12:09 PM
To: guxiaobo1...@qq.com; 
Cc: user@spark.apache.orguser@spark.apache.org; 
Subject:  Re: How to build Spark against the latest



See http://search-hadoop.com/m/JW1q5Cew0j

On Tue, Dec 23, 2014 at 8:00 PM, guxiaobo1982 guxiaobo1...@qq.com wrote:
Hi,
The official pom.xml file only have profile for hadoop version 2.4 as the 
latest version, but I installed hadoop version 2.6.0 with ambari, how can I 
build spark against it, just using mvn -Dhadoop.version=2.6.0, or how to make a 
coresponding profile for it?


Regards,


Xiaobo