Re: Not Serializable exception when integrating SQL and Spark Streaming

2014-12-24 Thread Cheng Lian
Generally you can use |-Dsun.io.serialization.extendedDebugInfo=true| to enable serialization debugging information when serialization exceptions are raised. On 12/24/14 1:32 PM, bigdata4u wrote: I am trying to use sql over Spark streaming using Java. But i am getting Serialization

Re: SparkSQL: CREATE EXTERNAL TABLE with a SchemaRDD

2014-12-24 Thread Cheng Lian
Hao and Lam - I think the issue here is that |registerRDDAsTable| only creates a temporary table, which is not seen by Hive metastore. And Michael had once given a workaround for creating external Parquet table:

Re: Spark Installation Maven PermGen OutOfMemoryException

2014-12-24 Thread Vladimir Protsenko
Java 8 rpm 64bit downloaded from official oracle site solved my problem. And I need not set max heap size, final memory shown at the end of maven build was 81/1943M. I want to learn spark so have no restriction on choosing java version. Guru Medasani, thanks for the tip. I will repeat info, that

Re: Spark Installation Maven PermGen OutOfMemoryException

2014-12-24 Thread Sean Owen
That command is still wrong. It is -Xmx3g with no =. On Dec 24, 2014 9:50 AM, Vladimir Protsenko protsenk...@gmail.com wrote: Java 8 rpm 64bit downloaded from official oracle site solved my problem. And I need not set max heap size, final memory shown at the end of maven build was 81/1943M. I

got ”org.apache.thrift.protocol.TProtocolException: Expected protocol id ffffff82 but got ffffff80“ from hive metastroe service when I use show tables command in spark-sql shell

2014-12-24 Thread Roc Chu
this is my problem. I use mysql to store hive meta data. and i can get what i want when I exec show tables in hive shell. but in the same machine. I use spark-sql to execute same command (show tables), I got errors. I look at the log of hive metastore find this errors 2014-12-24 05:04:59,874

Re: How to build Spark against the latest

2014-12-24 Thread guxiaobo1982
Hi Ted, The reference command works, but where I can get the deployable binaries? Xiaobo Gu -- Original -- From: Ted Yu;yuzhih...@gmail.com; Send time: Wednesday, Dec 24, 2014 12:09 PM To: guxiaobo1...@qq.com; Cc:

Need help for Spark-JobServer setup on Maven (for Java programming)

2014-12-24 Thread Sasi
Dear All, We are trying to share RDDs across different sessions of same Web application (Java). We need to share single RDD between those sessions. As we understand from some posts, it is possible through Spark-JobServer. Is there any guidelines you can provide to setup Spark-JobServer for Maven

Why does consuming a RESTful web service (using javax.ws.rs.* and Jsersey) work in unit test but not when submitted to Spark?

2014-12-24 Thread Emre Sevinc
Hello, I have a piece of code that runs inside Spark Streaming and tries to get some data from a RESTful web service (that runs locally on my machine). The code snippet in question is: Client client = ClientBuilder.newClient(); WebTarget target =

Re: got ”org.apache.thrift.protocol.TProtocolException: Expected protocol id ffffff82 but got ffffff80“ from hive metastroe service when I use show tables command in spark-sql shell

2014-12-24 Thread Cheng Lian
Hi Roc, Spark SQL 1.2.0 can only work with Hive 0.12.0 or Hive 0.13.1 (controlled by compilation flags), versions prior 1.2.0 only works with Hive 0.12.0. So Hive 0.15.0-SNAPSHOT is not an option. Would like to add that this is due to backwards compatibility issue of Hive metastore, AFAIK

Re: Spark Installation Maven PermGen OutOfMemoryException

2014-12-24 Thread Vladimir Protsenko
Thanks. Bad mistake. 2014-12-24 14:02 GMT+04:00 Sean Owen so...@cloudera.com: That command is still wrong. It is -Xmx3g with no =. On Dec 24, 2014 9:50 AM, Vladimir Protsenko protsenk...@gmail.com wrote: Java 8 rpm 64bit downloaded from official oracle site solved my problem. And I need

Re: Why does consuming a RESTful web service (using javax.ws.rs.* and Jsersey) work in unit test but not when submitted to Spark?

2014-12-24 Thread Sean Owen
Your guess is right, that there are two incompatible versions of Jersey (or really, JAX-RS) in your runtime. Spark doesn't use Jersey, but its transitive dependencies may, or your transitive dependencies may. I don't see Jersey in Spark's dependency tree except from HBase tests, which in turn

saveAsNewAPIHadoopDataset against hbase hanging in pyspark 1.2.0

2014-12-24 Thread Antony Mayi
Hi, have been using this without any issues with spark 1.1.0 but after upgrading to 1.2.0 saving a RDD from pyspark using saveAsNewAPIHadoopDataset into HBase just hangs - even when testing with the example from the stock hbase_outputformat.py. anyone having same issue? (and able to solve?)

Re: Why does consuming a RESTful web service (using javax.ws.rs.* and Jsersey) work in unit test but not when submitted to Spark?

2014-12-24 Thread Emre Sevinc
On Wed, Dec 24, 2014 at 1:46 PM, Sean Owen so...@cloudera.com wrote: I'd take a look with 'mvn dependency:tree' on your own code first. Maybe you are including JavaEE 6 for example? For reference, my complete pom.xml looks like: project xmlns=http://maven.apache.org/POM/4.0.0; xmlns:xsi=

Re: Why does consuming a RESTful web service (using javax.ws.rs.* and Jsersey) work in unit test but not when submitted to Spark?

2014-12-24 Thread Emre Sevinc
It seems like YARN depends an older version of Jersey, that is 1.9: https://github.com/apache/spark/blob/master/yarn/pom.xml When I've modified my dependencies to have only: dependency groupIdcom.sun.jersey/groupId artifactIdjersey-core/artifactId version1.9.1/version

Re: Single worker locked at 100% CPU

2014-12-24 Thread Phil Wills
Turns out that I was just being idiotic and had assigned so much memory to Spark that the O/S was ending up continually swapping. Apologies for the noise. Phil On Wed, Dec 24, 2014 at 1:16 AM, Andrew Ash and...@andrewash.com wrote: Hi Phil, This sounds a lot like a deadlock in Hadoop's

Re: Why does consuming a RESTful web service (using javax.ws.rs.* and Jsersey) work in unit test but not when submitted to Spark?

2014-12-24 Thread Sean Owen
That could well be it -- oops, I forgot to run with the YARN profile and so didn't see the YARN dependencies. Try the userClassPathFirst option to try to make your app's copy take precedence. The second error is really a JVM bug, but, is from having too little memory available for the unit tests.

Re: Why does consuming a RESTful web service (using javax.ws.rs.* and Jsersey) work in unit test but not when submitted to Spark?

2014-12-24 Thread Emre Sevinc
Sean, Thanks a lot for the important information, especially userClassPathFirst. Cheers, Emre On Wed, Dec 24, 2014 at 3:38 PM, Sean Owen so...@cloudera.com wrote: That could well be it -- oops, I forgot to run with the YARN profile and so didn't see the YARN dependencies. Try the

SVDPlusPlus Recommender in MLLib

2014-12-24 Thread Prafulla Wani
hi , Is there any plan to add SVDPlusPlus based recommender to MLLib ? It is implemented in Mahout from this paper - http://research.yahoo.com/files/kdd08koren.pdf http://research.yahoo.com/files/kdd08koren.pdf Regards, Prafulla.

Re: saveAsNewAPIHadoopDataset against hbase hanging in pyspark 1.2.0

2014-12-24 Thread Ted Yu
bq. even when testing with the example from the stock hbase_outputformat.py Can you take jstack of the above and pastebin it ? Thanks On Wed, Dec 24, 2014 at 4:49 AM, Antony Mayi antonym...@yahoo.com.invalid wrote: Hi, have been using this without any issues with spark 1.1.0 but after

null Error in ALS model predict

2014-12-24 Thread Franco Barrientos
Hi all!, I have a RDD[(int,int,double,double)] where the first two int values are id and product, respectively. I trained an implicit ALS algorithm and want to make predictions from this RDD. I make two things but I think both ways are same. 1- Convert this RDD to RDD[(int,int)] and

How to identify erroneous input record ?

2014-12-24 Thread Sanjay Subramanian
hey guys  One of my input records has an problem that makes the code fail. var demoRddFilter = demoRdd.filter(line = !line.contains(ISR$CASE$I_F_COD$FOLL_SEQ) || !line.contains(primaryid$caseid$caseversion)) var demoRddFilterMap = demoRddFilter.map(line = line.split('$')(0) + ~ +

Re: How to identify erroneous input record ?

2014-12-24 Thread Sanjay Subramanian
DOH Looks like I did not have enough coffee before I asked this :-) I added the if statement...var demoRddFilter = demoRdd.filter(line = !line.contains(ISR$CASE$I_F_COD$FOLL_SEQ) || !line.contains(primaryid$caseid$caseversion)) var demoRddFilterMap = demoRddFilter.map(line = { if

Re: How to identify erroneous input record ?

2014-12-24 Thread Sean Owen
I don't believe that works since your map function does not return a value for lines shorter than 13 tokens. You should use flatMap and Some/None. (You probably want to not parse the string 5 times too.) val demoRddFilterMap = demoRddFilter.flatMap { line = val tokens = line.split('$') if

Re: How to identify erroneous input record ?

2014-12-24 Thread Sanjay Subramanian
Although not elegantly I got the output via my code but totally agree on the parsing 5 times (thats really bad).Will add your suggested logic and check it out. I have a long way to the finish line. I am re-architecting my entire hadoop code and getting it onto spark. Check out what I do at

hiveContext.jsonFile fails with Unexpected close marker

2014-12-24 Thread elliott cordo
I have generally been impressed with the way jsonFile eats just about any json data model.. but getting this error when i try to ingest this file: Unexpected close marker ']': expected '} Here are the commands from the pyspark shell: from pyspark.sql import HiveContext hiveContext =

RE: Not Serializable exception when integrating SQL and Spark Streaming

2014-12-24 Thread Tarun Garg
Thanks for the reply. I am testing this with a small amount of data and what is happening is when ever there is data in the Kafka topic Spark does not through Exception otherwise it is. ThanksTarun Date: Wed, 24 Dec 2014 16:23:30 +0800 From: lian.cs@gmail.com To: bigdat...@live.com;

Re: null Error in ALS model predict

2014-12-24 Thread Burak Yavuz
Hi, The MatrixFactorizationModel consists of two RDD's. When you use the second method, Spark tries to serialize both RDD's for the .map() function, which is not possible, because RDD's are not serializable. Therefore you receive the NULLPointerException. You must use the first method. Best,

Re: Spark metrics for ganglia

2014-12-24 Thread Tim Harsch
Did you get past this issue? I¹m trying to get this to work as well. You have to compile in the spark-ganglia-lgpl artifact into your application. dependency groupIdorg.apache.spark/groupId artifactIdspark-ganglia-lgpl_2.10/artifactId

Discourse: A proposed alternative to the Spark User list

2014-12-24 Thread Nick Chammas
When people have questions about Spark, there are 2 main places (as far as I can tell) where they ask them: - Stack Overflow, under the apache-spark tag http://stackoverflow.com/questions/tagged/apache-spark - This mailing list The mailing list is valuable as an independent place for

Re: saveAsNewAPIHadoopDataset against hbase hanging in pyspark 1.2.0

2014-12-24 Thread Antony Mayi
this is it (jstack of particular yarn container) - http://pastebin.com/eAdiUYKK thanks, Antony. On Wednesday, 24 December 2014, 16:34, Ted Yu yuzhih...@gmail.com wrote: bq. even when testing with the example from the stock hbase_outputformat.py Can you take jstack of the above and

Re: saveAsNewAPIHadoopDataset against hbase hanging in pyspark 1.2.0

2014-12-24 Thread Ted Yu
I went over the jstack but didn't find any call related to hbase or zookeeper. Do you find anything important in the logs ? Looks like container launcher was waiting for the script to return some result: 1. at

RE: Not Serializable exception when integrating SQL and Spark Streaming

2014-12-24 Thread Tarun Garg
Thanks I debug this further and below is the cause Caused by: java.io.NotSerializableException: org.apache.spark.sql.api.java.JavaSQLContext- field (class com.basic.spark.NumberCount$2, name: val$sqlContext, type: class org.apache.spark.sql.api.java.JavaSQLContext)- object

Re: saveAsNewAPIHadoopDataset against hbase hanging in pyspark 1.2.0

2014-12-24 Thread Antony Mayi
I just run it by hand from pyspark shell. here is the steps: pyspark --jars /usr/lib/spark/lib/spark-examples-1.2.0-cdh5.3.0-hadoop2.5.0-cdh5.3.0.jar conf = {hbase.zookeeper.quorum: localhost, ...         hbase.mapred.outputtable: test,...         mapreduce.outputformat.class:

Re: saveAsNewAPIHadoopDataset against hbase hanging in pyspark 1.2.0

2014-12-24 Thread Ted Yu
bq. hbase.zookeeper.quorum: localhost You are running hbase cluster in standalone mode ? Is hbase-client jar in the classpath ? Cheers On Wed, Dec 24, 2014 at 4:11 PM, Antony Mayi antonym...@yahoo.com wrote: I just run it by hand from pyspark shell. here is the steps: pyspark --jars

Re: SchemaRDD to RDD[String]

2014-12-24 Thread Tobias Pfeiffer
Hi, On Wed, Dec 24, 2014 at 3:18 PM, Hafiz Mujadid hafizmujadi...@gmail.com wrote: I want to convert a schemaRDD into RDD of String. How can we do that? Currently I am doing like this which is not converting correctly no exception but resultant strings are empty here is my code Hehe,

Re: SchemaRDD to RDD[String]

2014-12-24 Thread Michael Armbrust
You might also try the following, which I think is equivalent: schemaRDD.map(_.mkString(,)) On Wed, Dec 24, 2014 at 8:12 PM, Tobias Pfeiffer t...@preferred.jp wrote: Hi, On Wed, Dec 24, 2014 at 3:18 PM, Hafiz Mujadid hafizmujadi...@gmail.com wrote: I want to convert a schemaRDD into RDD

Re: Escape commas in file names

2014-12-24 Thread Michael Armbrust
No, there is not. Can you open a JIRA? On Tue, Dec 23, 2014 at 6:33 PM, Daniel Siegmann daniel.siegm...@velos.io wrote: I am trying to load a Parquet file which has a comma in its name. Yes, this is a valid file name in HDFS. However, sqlContext.parquetFile interprets this as a

Re: Not Serializable exception when integrating SQL and Spark Streaming

2014-12-24 Thread Michael Armbrust
The various spark contexts generally aren't serializable because you can't use them on the executors anyway. We made SQLContext serializable just because it gets pulled into scope more often due to the implicit conversions its contains. You should try marking the variable that holds the context

Re: hiveContext.jsonFile fails with Unexpected close marker

2014-12-24 Thread Michael Armbrust
Each JSON object needs to be on a single line since this is the boundary the TextFileInputFormat uses when splitting up large files. On Wed, Dec 24, 2014 at 12:34 PM, elliott cordo elliottco...@gmail.com wrote: I have generally been impressed with the way jsonFile eats just about any json data

Re: saveAsNewAPIHadoopDataset against hbase hanging in pyspark 1.2.0

2014-12-24 Thread Antony Mayi
I am running it in yarn-client mode and I believe hbase-client is part of the  spark-examples-1.2.0-cdh5.3.0-hadoop2.5.0-cdh5.3.0.jar which I am submitting at launch. adding another jstack taken during the hanging - http://pastebin.com/QDQrBw70 - this is of the CoarseGrainedExecutorBackend

Re: saveAsNewAPIHadoopDataset against hbase hanging in pyspark 1.2.0

2014-12-24 Thread Antony Mayi
also hbase itself works ok: hbase(main):006:0 scan 'test'ROW                            COLUMN+CELL                                                                             key1                          column=f1:asd, timestamp=1419463092904, value=456                                      1

Question on saveAsTextFile with overwrite option

2014-12-24 Thread Shao, Saisai
Hi, We have such requirements to save RDD output to HDFS with saveAsTextFile like API, but need to overwrite the data if existed. I'm not sure if current Spark support such kind of operations, or I need to check this manually? There's a thread in mailing list discussed about this

Re: Question on saveAsTextFile with overwrite option

2014-12-24 Thread Patrick Wendell
Is it sufficient to set spark.hadoop.validateOutputSpecs to false? http://spark.apache.org/docs/latest/configuration.html - Patrick On Wed, Dec 24, 2014 at 10:52 PM, Shao, Saisai saisai.s...@intel.com wrote: Hi, We have such requirements to save RDD output to HDFS with saveAsTextFile like

RE: Question on saveAsTextFile with overwrite option

2014-12-24 Thread Cheng, Hao
I am wondering if we can provide more friendly API, other than configuration for this purpose. What do you think Patrick? Cheng Hao -Original Message- From: Patrick Wendell [mailto:pwend...@gmail.com] Sent: Thursday, December 25, 2014 3:22 PM To: Shao, Saisai Cc: user@spark.apache.org;

Re: Question on saveAsTextFile with overwrite option

2014-12-24 Thread Patrick Wendell
So the behavior of overwriting existing directories IMO is something we don't want to encourage. The reason why the Hadoop client has these checks is that it's very easy for users to do unsafe things without them. For instance, a user could overwrite an RDD that had 100 partitions with an RDD that

RE: Question on saveAsTextFile with overwrite option

2014-12-24 Thread Shao, Saisai
Thanks Patrick for your detailed explanation. BR Jerry -Original Message- From: Patrick Wendell [mailto:pwend...@gmail.com] Sent: Thursday, December 25, 2014 3:43 PM To: Cheng, Hao Cc: Shao, Saisai; user@spark.apache.org; d...@spark.apache.org Subject: Re: Question on saveAsTextFile

Re: How to build Spark against the latest

2014-12-24 Thread guxiaobo1982
What options should I use when running the make-distribution.sh script, I tried ./make-distribution.sh --hadoop.version 2.6.0 --with-yarn -with-hive --with-tachyon --tgz with nothing came out. Regards -- Original -- From: guxiaobo1982;guxiaobo1...@qq.com;