[ Potential bug ] Spark terminal logs say that job has succeeded even though job has failed in Yarn cluster mode

2015-07-23 Thread Elkhan Dadashov
Hi all, While running Spark Word count python example with intentional mistake in *Yarn cluster mode*, Spark terminal states final status as SUCCEEDED, but log files state correct results indicating that the job failed. Why terminal log output application log output contradict each other ? If

java.lang.NoSuchMethodError for list.toMap.

2015-07-23 Thread Dan Dong
Hi, When I ran with spark-submit the following simple Spark program of: import org.apache.spark.SparkContext._ import org.apache.spark.SparkConf import org.apache.spark.rdd.RDD import org.apache.spark.SparkContext import org.apache.spark._ import SparkContext._ object TEST2{ def

Re: Help accessing protected S3

2015-07-23 Thread Steve Loughran
On 23 Jul 2015, at 10:47, Greg Anderson gregory.ander...@familysearch.org wrote: So when I go to ~/ephemeral-hdfs/bin/hadoop and check its version, it says Hadoop 2.0.0-cdh4.2.0. If I run pyspark and use the s3a address, things should work, right? What am I missing? And thanks so

Zeppelin notebook question

2015-07-23 Thread Stefan Panayotov
Hi, When I create a DataFrame through Spark SQLContext and then register temp table I can use %sql Zeppelin interpreter to open a nice SQL paragraph. If on the other hand I do the same through HiveContext, I can't see those tables in the %sql show tables. Is there a way to query the

Fail to load hive tables through Spark

2015-07-23 Thread Mithila Joshi
I am new to Spark and needed help in figuring out why my Hive databases are not accessible to perform a data load through Spark. Background: 1. I am running Hive, Spark, and my Java program on a single machine. It's a Cloudera QuickStart VM, CDH5.4x, on a VirtualBox. 2. I have

Re: Fail to load hive tables through Spark

2015-07-23 Thread ayan guha
Please check if your metastore service is running. You may need to switch on automatic metastore service restart on restart of vm On 24 Jul 2015 06:20, Mithila Joshi joshi.mith...@gmail.com wrote: I am new to Spark and needed help in figuring out why my Hive databases are not accessible to

Re: Twitter4J streaming question

2015-07-23 Thread Enno Shioji
You are probably listening to the sample stream, and THEN filtering. This means you listen to 1% of the twitter stream, and then looking for the tweet by Bloomberg, so there is a very good chance you don't see the particular tweet. In order to get all Bloomberg related tweets, you must connect to

Re: Using Wurfl in Spark

2015-07-23 Thread Zhongxiao Ma
After several tests, it turns out that wurfl itself is not thread-safe. That cause the problem when more that one mapPartition are running, wurfl engines are conflicting. I don’t know if there is a better way than handling wurfl lookup outside. Thanks, Zhongxiao From:

spark dataframe gc

2015-07-23 Thread Mohit Jaggi
Hi There, I am testing Spark DataFrame and havn't been able to get my code to finish due to what I suspect are GC issues. My guess is that GC interferes with heartbeating and executors are detected as failed. The data is ~50 numeric columns, ~100million rows in a CSV file. We are doing a groupBy

[MLLIB] Anyone tried correlation with RDD[Vector] ?

2015-07-23 Thread Saif.A.Ellafi
I tried with a RDD[DenseVector] but RDDs are not transformable, so T+ RDD[DenseVector] not : RDD[Vector] and can't get to use the RDD input method of correlation. Thanks, Saif

Asked to remove non-existent executor exception

2015-07-23 Thread Pa Rö
hello spark community, i have build an application with geomesa, accumulo and spark. if it run on spark local mode, it is working, but on spark cluster not. in short it says: No space left on device. Asked to remove non-existent executor XY. I´m confused, because there were many GB´s of free

Spark - Eclipse IDE - Maven

2015-07-23 Thread Siva Reddy
Hi All, I am trying to setup the Eclipse (LUNA) with Maven so that I create a maven projects for developing spark programs. I am having some issues and I am not sure what is the issue. Can Anyone share a nice step-step document to configure eclipse with maven for spark development.

Re: How to restart Twitter spark stream

2015-07-23 Thread Zoran Jeremic
Hi Akhil, Thank you for sending this code. My apologize if I will ask something that is obvious here, since I'm newbie in Scala, but I still don't see how I can use this code. Maybe my original question was not very clear. What I need is to get each Twitter Status that contains one of the

RE: SparkR Supported Types - Please add bigint

2015-07-23 Thread Sun, Rui
Exie, Reported your issue: https://issues.apache.org/jira/browse/SPARK-9302 SparkR has support for long(bigint) type in serde. This issue is related to support complex Scala types in serde. -Original Message- From: Exie [mailto:tfind...@prodevelop.com.au] Sent: Friday, July 24, 2015

Re: SparkR Supported Types - Please add bigint

2015-07-23 Thread Exie
Interestingly, after more digging, df.printSchema() in raw spark shows the columns as a long, not a bigint. root |-- localEventDtTm: timestamp (nullable = true) |-- asset: string (nullable = true) |-- assetCategory: string (nullable = true) |-- assetType: string (nullable = true) |-- event:

SparkR Supported Types - Please add bigint

2015-07-23 Thread Exie
Hi Folks, Using Spark to read in JSON files and detect the schema, it gives me a dataframe with a bigint filed. R then fails to import the dataframe as it cant convert the type. head(mydf) Error in as.data.frame.default(x[[i]], optional = TRUE) : cannot coerce class jobj to a data.frame

Re: Is IndexedRDD available in Spark 1.4.0?

2015-07-23 Thread Ruslan Dautkhanov
Or Spark on HBase ) http://blog.cloudera.com/blog/2014/12/new-in-cloudera-labs-sparkonhbase/ -- Ruslan Dautkhanov On Tue, Jul 14, 2015 at 7:07 PM, Ted Yu yuzhih...@gmail.com wrote: bq. that is, key-value stores Please consider HBase for this purpose :-) On Tue, Jul 14, 2015 at 5:55 PM,

Enabling mapreduce.input.fileinputformat.list-status.num-threads in Spark?

2015-07-23 Thread Cheolsoo Park
Hi, I am wondering if anyone has successfully enabled mapreduce.input.fileinputformat.list-status.num-threads in Spark jobs. I usually set this property to 25 to speed up file listing in MR jobs (Hive and Pig). But for some reason, this property does not take effect in Spark HadoopRDD resulting

Re: Spark on Tomcat has exception IncompatibleClassChangeError: Implementing class

2015-07-23 Thread Zoran Jeremic
Hi Yana, Sorry for late response. I just saw your email. At the end I ended with the following pom https://www.dropbox.com/s/19fldb9qnnfieck/pom.xml?dl=0 There were multiple problems I had to struggle with. One of these were that my application had REST implemented with jboss jersey which got

Re: Twitter4J streaming question

2015-07-23 Thread Jörn Franke
He should still see something. I think you need to subscribe to the Screenname first and not filter it out only in the filter method. I do not have the apis from mobile at hand, but there should be a method. Le jeu. 23 juil. 2015 à 22:30, Enno Shioji eshi...@gmail.com a écrit : You need to pay

Re: Twitter4J streaming question

2015-07-23 Thread Patrick McCarthy
How can I tell if it's the sample stream or full stream ? Thanks Sent from my iPhone On Jul 23, 2015, at 4:17 PM, Enno Shioji eshi...@gmail.commailto:eshi...@gmail.com wrote: You are probably listening to the sample stream, and THEN filtering. This means you listen to 1% of the twitter

Re: Twitter4J streaming question

2015-07-23 Thread Patrick McCarthy
Ahh Makes sense - thanks for the help Sent from my iPhone On Jul 23, 2015, at 4:29 PM, Enno Shioji eshi...@gmail.commailto:eshi...@gmail.com wrote: You need to pay a lot of money to get the full stream, so unless you are doing that, it's the sample stream! On Thu, Jul 23, 2015 at 9:26 PM,

Re: Twitter4J streaming question

2015-07-23 Thread Enno Shioji
You need to pay a lot of money to get the full stream, so unless you are doing that, it's the sample stream! On Thu, Jul 23, 2015 at 9:26 PM, Patrick McCarthy pmccar...@eatonvance.com wrote: How can I tell if it's the sample stream or full stream ? Thanks Sent from my iPhone On Jul 23,

Re: [MLLIB] Anyone tried correlation with RDD[Vector] ?

2015-07-23 Thread Rishi Yadav
can you explain what transformation is failing. Here's a simple example. http://www.infoobjects.com/spark-calculating-correlation-using-rdd-of-vectors/ On Thu, Jul 23, 2015 at 5:37 AM, saif.a.ell...@wellsfargo.com wrote: I tried with a RDD[DenseVector] but RDDs are not transformable, so T+

Schedule lunchtime today for a free webinar IoT data ingestion in Spark Streaming using Kaa 11 a.m. PDT (2 p.m. EDT)

2015-07-23 Thread Oleh Rozvadovskyy
Hi there! Only couple of hours left to our first webinar on* IoT data ingestion in Spark Streaming using Kaa*. During the webinar we will build a solution that ingests real-time data from Intel Edison into Apache Spark for stream processing. This solution includes a client, middleware, and

RE: Issue with column named count in a DataFrame

2015-07-23 Thread Young, Matthew T
Thanks Michael, using backticks resolves the issue. Wouldn't this fix also be something that should go into Spark 1.4.2, or at least have the limitation noted in the documentation? From: Michael Armbrust [mich...@databricks.com] Sent: Wednesday, July 22, 2015

Re: Re: Need help in setting up spark cluster

2015-07-23 Thread fightf...@163.com
Hi, there Per for your analytical and real time recommendations request, I would recommend you use spark sql and hive thriftserver to store and process your spark streaming data. As thriftserver would be run as a long-term application and it would be quite feasible to cyclely comsume data

Re: [MLLIB] Anyone tried correlation with RDD[Vector] ?

2015-07-23 Thread Robin East
The OP’s problem is he gets this: console:47: error: type mismatch; found : org.apache.spark.rdd.RDD[org.apache.spark.mllib.linalg.DenseVector] required: org.apache.spark.rdd.RDD[org.apache.spark.mllib.linalg.Vector] Note: org.apache.spark.mllib.linalg.DenseVector :

ERROR TaskResultGetter: Exception while getting task result when reading avro files that contain arrays

2015-07-23 Thread Arbi Akhina
Hi, I'm trying to read an avro file into a spark RDD, but I'm having an Exception while getting task result. The avro schema file has the following content: { type : record, name : sample_schema, namespace : com.adomik.avro, fields : [ { name : username, type : string, doc :

Writing binary files in Spark

2015-07-23 Thread Oren Shpigel
Hi, I use Spark to read binary files using SparkContext.binaryFiles(), and then do some calculations, processing, and manipulations to get new objects (also binary). The next thing I want to do is write the results back to binary files on disk. Is there any equivalence like saveAsTextFile just

Re: writing/reading multiple Parquet files: Failed to merge incompatible data types StringType and StructType

2015-07-23 Thread Akhil Das
Currently, the only way for you would be to create proper schema for the data. This is not a bug, but you could open a jira (since this would help others to solve their similar use-cases) for feature and in future version it could be implemented and included. Thanks Best Regards On Tue, Jul 21,

Re: spark-submit and spark-shell behaviors mismatch.

2015-07-23 Thread Dan Dong
The problem should be toMap, as I tested that val maps2=maps.collect runs ok. When I run spark-shell, I run with --master mesos://cluster-1:5050 parameter which is the same with spark-submit. Confused here. 2015-07-22 20:01 GMT-05:00 Yana Kadiyska yana.kadiy...@gmail.com: Is it complaining

Twitter4J streaming question

2015-07-23 Thread pjmccarthy
Hopefully this is an easy one. I am trying to filter a twitter dstream by user ScreenName - my code is as follows val stream = TwitterUtils.createStream(ssc, None) .filter(_.getUser.getScreenName.contains(markets)) however nothing gets returned and I can see that Bloomberg has tweeted.

Create table from local machine

2015-07-23 Thread vinod kumar
Hi, I am in need to create a table in spark.for that I have uploaded a csv file in HDFS and created a table using following query CREATE EXTERNAL table IF NOT EXISTS + tableName + (teams string,runs int) + ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' LOCATION ' + hdfspath + '; May I know is

Re: spark thrift server supports timeout?

2015-07-23 Thread Akhil Das
Here's a few more configurations https://cwiki.apache.org/confluence/display/Hive/Setting+Up+HiveServer2#SettingUpHiveServer2-ConfigurationPropertiesinthehive-site.xmlFile can't find anything on the timeouts though. Thanks Best Regards On Wed, Jul 22, 2015 at 1:01 AM, Judy Nash

Help with Dataframe syntax ( IN / COLLECT_SET)

2015-07-23 Thread Yana Kadiyska
Hi folks, having trouble expressing IN and COLLECT_SET on a dataframe. In other words, I'd like to figure out how to write the following query: select collect_set(b),a from mytable where c in (1,2,3) group by a I've started with someDF .where( -- not sure what do for c here---

Re: 1.4.0 classpath issue with spark-submit

2015-07-23 Thread Akhil Das
You can try adding that jar in SPARK_CLASSPATH (its deprecated though) in spark-env.sh file. Thanks Best Regards On Tue, Jul 21, 2015 at 7:34 PM, Michal Haris michal.ha...@visualdna.com wrote: I have a spark program that uses dataframes to query hive and I run it both as a spark-shell for

Re: NullPointerException inside RDD when calling sc.textFile

2015-07-23 Thread Akhil Das
Did you try: val data = indexed_files.groupByKey val *modified_data* = data.map { a = var name = a._2.mkString(,) (a._1, name) } *modified_data*.foreach { a = var file = sc.textFile(a._2) println(file.count) } Thanks Best Regards On Wed, Jul 22, 2015 at 2:18 AM, MorEru

Re: problems running Spark on a firewalled remote YARN cluster via SOCKS proxy

2015-07-23 Thread Akhil Das
It looks like its picking up the wrong namenode uri from the HADOOP_CONF_DIR, make sure it is proper. Also for submitting a spark job to a remote cluster, you might want to look at the spark.driver host and spark.driver.port Thanks Best Regards On Wed, Jul 22, 2015 at 8:56 PM, rok

Re: Help accessing protected S3

2015-07-23 Thread Steve Loughran
On 23 Jul 2015, at 01:50, Ewan Leith ewan.le...@realitymine.com wrote: I think the standard S3 driver used in Spark from the Hadoop project (S3n) doesn't support IAM role based authentication. However, S3a should support it. If you're running Hadoop 2.6 via the spark-ec2 scripts (I'm

RE: Help accessing protected S3

2015-07-23 Thread Greg Anderson
So when I go to ~/ephemeral-hdfs/bin/hadoop and check its version, it says Hadoop 2.0.0-cdh4.2.0. If I run pyspark and use the s3a address, things should work, right? What am I missing? And thanks so much for the help so far! From: Steve Loughran

Re: Using Dataframe write with newHdoopApi

2015-07-23 Thread Akhil Das
Did you happened to look into esDF https://github.com/elastic/elasticsearch-hadoop/issues/441? You can open an issue over here if that doesn't solve your problem https://github.com/elastic/elasticsearch-hadoop/issues Thanks Best Regards On Tue, Jul 21, 2015 at 5:33 PM, ayan guha

Class weights and prediction probabilities in random forest?

2015-07-23 Thread Patrick Crenshaw
I was just wondering if there were plans to implement class weights and prediction probabilities in random forest? Is anyone working on this? smime.p7s Description: S/MIME cryptographic signature

Re: Writing binary files in Spark

2015-07-23 Thread Akhil Das
You can look into .saveAsObjectFiles Thanks Best Regards On Thu, Jul 23, 2015 at 8:44 PM, Oren Shpigel o...@yowza3d.com wrote: Hi, I use Spark to read binary files using SparkContext.binaryFiles(), and then do some calculations, processing, and manipulations to get new objects (also

How to deal with the spark streaming application while upgrade spark

2015-07-23 Thread JoneZhang
My spark streaming on kafka application is running in spark 1.3. I want upgrade spark to 1.4 now. How to deal with the spark streaming application? Save the kafka topic partition offset, then kill the application, then upgrade, then run spark streaming again? Is there more elegant way? -- View

Re: How to deal with the spark streaming application while upgrade spark

2015-07-23 Thread Tathagata Das
Currently that is the best way. On Thu, Jul 23, 2015 at 12:51 AM, JoneZhang joyoungzh...@gmail.com wrote: My spark streaming on kafka application is running in spark 1.3. I want upgrade spark to 1.4 now. How to deal with the spark streaming application? Save the kafka topic partition

Re: Udf's in spark

2015-07-23 Thread Takeshi Yamamuro
Sure and sparksql supports Hive UDFs. ISTM that the UDF 'DATE_FORMAT' is just not registered in your metastore? Did you say 'CREATE FUNCTION' in advance? Thanks, On Tue, Jul 14, 2015 at 6:30 PM, Ravisankar Mani rrav...@gmail.com wrote: Hi Everyone, As mentioned in Spark sQL programming

Re: Re: Need help in setting up spark cluster

2015-07-23 Thread Jeetendra Gangele
Thanks for reply and your valuable suggestions I have 10 GB data generated every day so this data I need to write in my database also this data is schema base and schema changes frequently , so consider this as unstructured data sometimes I may have to serve 1 write/secs with 4 m1.xLarge

SQL Server to Spark

2015-07-23 Thread vinod kumar
Hi Everyone, I am in need to use the table from MsSQLSERVER in SPARK.Any one please share me the optimized way for that? Thanks in advance, Vinod

RE: Help accessing protected S3

2015-07-23 Thread Ewan Leith
I think the standard S3 driver used in Spark from the Hadoop project (S3n) doesn't support IAM role based authentication. However, S3a should support it. If you're running Hadoop 2.6 via the spark-ec2 scripts (I'm not sure what it launches with by default) try accessing your bucket via s3a://

Re: SQL Server to Spark

2015-07-23 Thread Denny Lee
It sort of depends on optimized. There is a good thread on the topic at http://search-hadoop.com/m/q3RTtJor7QBnWT42/Spark+and+SQL+server/v=threaded If you have an archival type strategy, you could do daily BCP extracts out to load the data into HDFS / S3 / etc. This would result in minimal impact