NullPointerException with joda time

2015-11-10 Thread romain sagean
Hi community, I try to apply the function below during a flatMapValues or a map but I get a nullPointerException with the plusDays(1). What did I miss ? def allDates(dateSeq: Seq[DateTime], dateEnd: DateTime): Seq[DateTime] = { if (dateSeq.last.isBefore(dateEnd)){ allDates(dateSeq:+

Re: [Yarn] Executor cores isolation

2015-11-10 Thread Jörn Franke
I would have to check the Spark source code, but theoretically you can limit the no of threads on the jvm level. Maybe spark does this.Alternatively, you can use cgroups, but this introduces other complexity. > On 10 Nov 2015, at 14:33, Peter Rudenko wrote: > > Hi i

Re: could not understand issue about static spark Function (map / sortBy ...)

2015-11-10 Thread Zhiliang Zhu
I have got the issues all, after quite a lot of test. Function would only be defined in static normal function body, or defined as static member variable.Function would also be defined as inner static class, some its own member variable or functions could be defined, the variable can be passed

Re: [Yarn] Executor cores isolation

2015-11-10 Thread Peter Rudenko
As i've tried cgroups - seems the isolation is done by percantage not by cores number. E.g. i've set min share to 256 - i still see all 8 cores, but i could only load only 20% of each core. Thanks, Peter Rudenko On 2015-11-10 15:52, Saisai Shao wrote: From my understanding, it depends on

A question about accumulator

2015-11-10 Thread Tan Tim
Hi, all There is a discussion about the accumulator in stack overflow: http://stackoverflow.com/questions/27357440/spark-accumalator-value-is-different-when-inside-rdd-and-outside-rdd I comment about this question (from user Tim). As the output I tried, I hava two questions: 1. Why the

Re: Spark 1.5 UDAF ArrayType

2015-11-10 Thread Alex Nastetsky
Hi, I believe I ran into the same bug in 1.5.0, although my error looks like this: Caused by: java.lang.ClassCastException: [Lcom.verve.spark.sql.ElementWithCount; cannot be cast to org.apache.spark.sql.types.ArrayData at

Re: What are the .snapshot files in /home/spark/Snapshots?

2015-11-10 Thread Dmitry Goldenberg
N/m, these are just profiling snapshots :) Sorry for the wide distribution. On Tue, Nov 10, 2015 at 9:46 AM, Dmitry Goldenberg wrote: > We're seeing a bunch of .snapshot files being created under > /home/spark/Snapshots, such as the following for example: > >

[ANNOUNCE] Announcing Spark 1.5.2

2015-11-10 Thread Reynold Xin
Hi All, Spark 1.5.2 is a maintenance release containing stability fixes. This release is based on the branch-1.5 maintenance branch of Spark. We *strongly recommend* all 1.5.x users to upgrade to this release. The full list of bug fixes is here: http://s.apache.org/spark-1.5.2

Re: Anybody hit this issue in spark shell?

2015-11-10 Thread Shixiong Zhu
Scala compiler stores some metadata in the ScalaSig attribute. See the following link as an example: http://stackoverflow.com/questions/10130106/how-does-scala-know-the-difference-between-def-foo-and-def-foo/10130403#10130403 As maven-shade-plugin doesn't recognize ScalaSig, it cannot fix the

Re: NullPointerException with joda time

2015-11-10 Thread Ted Yu
Can you show the stack trace for the NPE ? Which release of Spark are you using ? Cheers On Tue, Nov 10, 2015 at 8:20 AM, romain sagean wrote: > Hi community, > I try to apply the function below during a flatMapValues or a map but I > get a nullPointerException with the

Re: Is it possible Running SparkR on 2 nodes without HDFS

2015-11-10 Thread Sanjay Subramanian
Cool thanksI have a CDH 5.4.8 (Cloudera Starving Developers Version) with 1 NN and 4 DN and SPark is running but its 1.3.xI want to leverage this HDFS hive cluster for SparkR because we do all data munging here and produce datasets for ML. I am thinking of the following idea  1. Add 2 datanodes

Re: Querying nested struct fields

2015-11-10 Thread pratik khadloya
I tried the same, didn't work :( scala> hc.sql("select _1.item_id from agg_imps_df limit 10").collect() 15/11/10 14:30:41 INFO parse.ParseDriver: Parsing command: select _1.item_id from agg_imps_df limit 10 org.apache.spark.sql.AnalysisException: missing \' at 'from' near ''; line 1 pos 23

save data as unique file on each slave node

2015-11-10 Thread Chuming Chen
Hi List, I have a paired RDD, I want to save the data of each partition as a file with unique file name (path) on each slave node. Then I will invoke an external program from Spark to process those files on the slave nodes. Is it possible to do that? Thanks. Chuming

Re: Querying nested struct fields

2015-11-10 Thread Michael Armbrust
Use a `.`: hc.sql("select _1.item_id from agg_imps_df limit 10").collect() On Tue, Nov 10, 2015 at 11:24 AM, pratik khadloya wrote: > Hello, > > I just saved a PairRDD as a table, but i am not able to query it > correctly. The below and other variations does not seem to

Querying nested struct fields

2015-11-10 Thread pratik khadloya
Hello, I just saved a PairRDD as a table, but i am not able to query it correctly. The below and other variations does not seem to work. scala> hc.sql("select * from agg_imps_df").printSchema() |-- _1: struct (nullable = true) ||-- item_id: long (nullable = true) ||-- flight_id: long

Re: SF Spark Office Hours Experiment - Friday Afternoon

2015-11-10 Thread Holden Karau
So the SF version will be this Friday at https://foursquare.com/v/coffee-mission/561ab392498e1bc38c3a7e8d (next to the 24th st bart) from 2pm till 5:30pm , come by with your Spark questions :D I'll try and schedule the on-line ones shortly but I had some unexpected travel come up. On Tue, Oct 27,

Re: Querying nested struct fields

2015-11-10 Thread Michael Armbrust
Oh sorry _1 is not a valid hive identifier, you need to use backticks to escape it: Seq(((1, 2), 2)).toDF().registerTempTable("test") sql("SELECT `_1`.`_1` FROM test") On Tue, Nov 10, 2015 at 11:31 AM, pratik khadloya wrote: > I tried the same, didn't work :( > > scala>

Re: though experiment: Can I use spark streaming to replace all of my rest services?

2015-11-10 Thread Jörn Franke
Maybe you look for web sockets/stomp to get it to the end user? Or http2/stomp in the future > On 10 Nov 2015, at 21:28, Andy Davidson wrote: > > I just finished watching a great presentation from a recent spark summit on > real time movie recommendations using

Re: NullPointerException with joda time

2015-11-10 Thread Romain Sagean
see below a more complete version of the code. the firstDate (previously minDate) should not be null, I even added an extra "filter( _._2 != null)" before the flatMap and the error is still there. What I don't understand is why I have the error on dateSeq.las.plusDays and not on

Re: How to configure logging...

2015-11-10 Thread Hitoshi
I don't have akka but with just Spark, I just edited log4j.properties to "log4j.rootCategory=ERROR, console" and ran the following command and was able to get only the Time row as output. run-example org.apache.spark.examples.streaming.JavaNetworkWordCount localhost -- View this message

though experiment: Can I use spark streaming to replace all of my rest services?

2015-11-10 Thread Andy Davidson
I just finished watching a great presentation from a recent spark summit on real time movie recommendations using spark. https://spark-summit.org/east-2015/talk/real-time-recommendations-using-spar k . For the purpose of email I am going to really simplify what they did. In general their real time

Re: NullPointerException with joda time

2015-11-10 Thread Ted Yu
I took a look at https://github.com/JodaOrg/joda-time/blob/master/src/main/java/org/joda/time/DateTime.java Looks like the NPE came from line below: long instant = getChronology().days().add(getMillis(), days); Maybe catch the NPE and print out the value of currentDate to see if there is more

thought experiment: use spark ML to real time prediction

2015-11-10 Thread Andy Davidson
Lets say I have use spark ML to train a linear model. I know I can save and load the model to disk. I am not sure how I can use the model in a real time environment. For example I do not think I can return a ³prediction² to the client using spark streaming easily. Also for some applications the

Re: Anybody hit this issue in spark shell?

2015-11-10 Thread Ted Yu
In the PR, a new scala style rule is added banning use of @VisibleForTesting Similar rules can be added as seen fit. Cheers On Tue, Nov 10, 2015 at 11:22 AM, Shixiong Zhu wrote: > Scala compiler stores some metadata in the ScalaSig attribute. See the > following link as an

Re: Using model saved by MLlib with out creating spark context

2015-11-10 Thread Viju K
Thank you for your suggestion, Stefano. I was hoping for an easier solution :) Sent from my iPhone > On Oct 30, 2015, at 2:11 PM, Stefano Baghino > wrote: > > One possibility would be to export the model as a PMML (Predictive Model > Markup Language, an

Re: Spark Packages Configuration Not Found

2015-11-10 Thread Jakob Odersky
(accidental keyboard-shortcut sent the message) ... spark-shell from the spark 1.5.2 binary distribution. Also, running "spPublishLocal" has the same effect. thanks, --Jakob On 10 November 2015 at 14:55, Jakob Odersky wrote: > Hi, > I ran into in error trying to run

Python Kafka support?

2015-11-10 Thread Darren Govoni
Hi, I read on this page http://spark.apache.org/docs/latest/streaming-kafka-integration.html about python support for "receiverless" kafka integration (Approach 2) but it says its incomplete as of version 1.4. Has this been updated in version 1.5.1? Darren

Re: Querying nested struct fields

2015-11-10 Thread pratik khadloya
That worked!! Thanks a lot Michael. ~Pratik On Tue, Nov 10, 2015 at 12:02 PM Michael Armbrust wrote: > Oh sorry _1 is not a valid hive identifier, you need to use backticks to > escape it: > > Seq(((1, 2), 2)).toDF().registerTempTable("test") > sql("SELECT `_1`.`_1`

Re: Is it possible Running SparkR on 2 nodes without HDFS

2015-11-10 Thread Ali Tajeldin EDU
make sure "/mnt/local/1024gbxvdf1/all_adleads_cleaned_commas_in_quotes_good_file.csv" is accessible on your slave node. -- Ali On Nov 9, 2015, at 6:06 PM, Sanjay Subramanian wrote: > hey guys > > I have a 2 node SparkR (1 master 1 slave)cluster on AWS

RE: thought experiment: use spark ML to real time prediction

2015-11-10 Thread Kothuvatiparambil, Viju
I have a similar issue. I want to load a model saved by a spark machine learning job, in a web application. model.save(jsc.sc(), "myModelPath"); LogisticRegressionModel model = LogisticRegressionModel.load(

Spark Packages Configuration Not Found

2015-11-10 Thread Jakob Odersky
Hi, I ran into in error trying to run spark-shell with an external package that I built and published locally using the spark-package sbt plugin ( https://github.com/databricks/sbt-spark-package). To my understanding, spark packages can be published simply as maven artifacts, yet after running

Re: OLAP query using spark dataframe with cassandra

2015-11-10 Thread David Morales
Hi there, Please consider our real-time aggregation engine, sparkta, fully open source (Apache2 License). Here you have some slides about the project: - http://www.slideshare.net/Stratio/strata-sparkta And the source code: - https://github.com/Stratio/sparkta Sparkta is a real-time

Re: Kafka Direct does not recover automatically when the Kafka Stream gets messed up?

2015-11-10 Thread Adrian Tanase
Can you be a bit more specific about what “blow up” means? Also what do you mean by “messed up” brokers? Inbalance? Broker(s) dead? We’re also using the direct consumer and so far nothing dramatic happened: - on READ it automatically reads from backups if leader is dead (machine gone) - or READ

Re: Unwanted SysOuts in Spark Parquet

2015-11-10 Thread Cheng Lian
This is because of PARQUET-369 , which prevents users or other libraries to override Parquet's JUL logging settings via SLF4J. It has been fixed in the most recent parquet-format master (PR #32

Re: Why is Kryo not the default serializer?

2015-11-10 Thread ozawa_h
Array issue was also discussed in Apache Hive forum. This problem seems like it can be resolved by using Kryo 3.x. Will upgrading to Kryo 3.x allow Kryo to become the default SerDes? https://issues.apache.org/jira/browse/HIVE-12174

spark shared RDD

2015-11-10 Thread Ben
Hi, After reading some documentations about spark and ignite, I am wondering if shared RDD from ignite can be used to share data in memory without any duplication between multiple spark jobs. Running on mesos I can collocate them, but will this be enough to avoid memory duplication or not? I am

spark shared RDD

2015-11-10 Thread Ben
Hi, After reading some documentations about spark and ignite, I am wondering if shared RDD from ignite can be used to share data in memory without any duplication between multiple spark jobs. Running on mesos I can collocate them, but will this be enough to avoid memory duplication or not? I am

Re: Kafka Direct does not recover automatically when the Kafka Stream gets messed up?

2015-11-10 Thread Adrian Tanase
I’ve seen this before during an extreme outage on the cluster, where the kafka offsets checkpointed by the directstreamRdd were bigger than what kafka reported. The checkpoint was therefore corrupted. I don’t know the root cause but since I was stressing the cluster during a reliability test I

NoSuchElementException: key not found

2015-11-10 Thread Ankush Khanna
Hi, I was working with a simple task (running locally). Just reading a file (35 mb) with about 200 features and making a random forest with 5 trees with 5 depth.  While saving the file with: predictions.select("VisitNumber", "probability")    .write.format("json") // tried different formats    

could not understand issue about static spark Function (map / sortBy ...)

2015-11-10 Thread Zhiliang Zhu
As more test, the Function call by map/sortBy etc must be defined as static, or it can be defined as non-static and must be called by other static normal function.I am really confused by it. On Tuesday, November 10, 2015 4:12 PM, Zhiliang Zhu wrote:

static spark Function as map

2015-11-10 Thread Zhiliang Zhu
Hi All, I have met some bug not understandable as follows: class A {  private JavaRDD _com_rdd;  ...  ...   //here it must be static, but not every Function as map etc would be static, as the code examples in spark self official doc   static Function mapParseRow = new

Re: Python Kafka support?

2015-11-10 Thread Saisai Shao
Hi Darren, Functionality like messageHandler is missing in python API, still not included in version 1.5.1. Thanks Jerry On Wed, Nov 11, 2015 at 7:37 AM, Darren Govoni wrote: > Hi, > I read on this page >

Re: [ANNOUNCE] Announcing Spark 1.5.2

2015-11-10 Thread Fengdong Yu
This is the most simplest announcement I saw. > On Nov 11, 2015, at 12:49 AM, Reynold Xin wrote: > > Hi All, > > Spark 1.5.2 is a maintenance release containing stability fixes. This release > is based on the branch-1.5 maintenance branch of Spark. We *strongly >

Re: anyone using netlib-java with sparkR on yarn spark1.6?

2015-11-10 Thread Shivaram Venkataraman
I think this is happening in the driver. Could you check the classpath of the JVM that gets started ? If you use spark-submit on yarn the classpath is setup before R gets launched, so it should match the behavior of Scala / Python. Thanks Shivaram On Fri, Nov 6, 2015 at 1:39 PM, Tom Graves

Spark-csv error on read AWS s3a in spark 1.4.1

2015-11-10 Thread Zhang, Jingyu
A small csv file in S3. I use s3a://key:seckey@bucketname/a.csv It works for SparkContext pixelsStr: SparkContext = ctx.textFile(s3pathOrg); It works for Java Spark-csv as well Java code : DataFrame careerOneDF = sqlContext.read().format( "com.databricks.spark.csv")

Re: NoSuchElementException: key not found

2015-11-10 Thread Ankush Khanna
Any suggestions any one? Using version 1.5.1. Regards Ankush Khanna On Nov 10, 2015, at 11:37 AM, Ankush Khanna wrote: Hi, I was working with a simple task (running locally). Just reading a file (35 mb) with about 200 features and making a random forest with 5 trees

Re: Overriding Derby in hive-site.xml giving strange results...

2015-11-10 Thread Gaurav Tiwari
So which one will be used during query execution ? What we observed that no matter what queries are getting executed on Hive with derby . Also there is no dialect defined for HSQLDB , do you think this will work ? On Tue, Nov 10, 2015 at 5:07 AM, Michael Armbrust wrote:

Save to distributed file system from worker processes

2015-11-10 Thread bikash.mnr
I am quite new with pyspark. In my application with pyspark, I want to achieve following things: -- Create a RDD using python list and partition it into some partitions. -- Now use rdd.foreachPartition(func) -- Here, the function "func" performs an iterative operation which, reads

Re: Spark Thrift doesn't start

2015-11-10 Thread fightf...@163.com
I think the exception info just says clear that you may miss some tez related jar on the spark thrift server classpath. fightf...@163.com From: DaeHyun Ryu Date: 2015-11-11 14:47 To: user Subject: Spark Thrift doesn't start Hi folks, I configured tez as execution engine of Hive. After done

Spark SQL reading json with pre-defined schema

2015-11-10 Thread ganesh.tiwari
I have very very large json and I want to save by avoiding Spark to make scan over data to infer the schema. Instead since I already know the data, I would prefer to provide the schema myself with sqlContext.read().schema(mySchema).json(jsonFilePath) however the problem is the json data format

RE: Connecting SparkR through Yarn

2015-11-10 Thread Sun, Rui
Amit, You can simply set “MASTER” as “yarn-client” before calling sparkR.init(). Sys.setenv("MASTER"="yarn-client") I assume that you have set “YARN_CONF_DIR” env variable required for running Spark on YARN. If you want to set more YARN specific configurations, you can for example Sys.setenv

Spark Thrift doesn't start

2015-11-10 Thread DaeHyun Ryu
Hi folks, I configured tez as execution engine of Hive. After done that, whenever I started spark thrift server, it just stopped automatically. I checked log and saw the following messages. My spark version is 1.4.1 and tez version is 0.7.0 (IBM BigInsights 4.1) Does anyone have any idea on

What are the .snapshot files in /home/spark/Snapshots?

2015-11-10 Thread Dmitry Goldenberg
We're seeing a bunch of .snapshot files being created under /home/spark/Snapshots, such as the following for example: CoarseGrainedExecutorBackend-2015-08-27-shutdown.snapshot CoarseGrainedExecutorBackend-2015-08-31-shutdown-1.snapshot SparkSubmit-2015-08-31-shutdown-1.snapshot

Re: Re: OLAP query using spark dataframe with cassandra

2015-11-10 Thread Andrés Ivaldi
Hi, We have been evaluating apache Kylin, how flexible is it? I mean, we need to create the cube Structure Dynamically and populete it from different sources, the process time is not too important, what is important is the response time on queries? Thanks. On Mon, Nov 9, 2015 at 11:01 PM,

PySpark: breakdown application execution time and fine-tuning the application

2015-11-10 Thread saluc
Hello, I am using PySpark to develop my big-data application. I have the impression that most of the execution of my application is spent on the infrastructure (distributing the code and the data in the cluster, IPC between the Python processes and the JVM) rather than on the computation itself.

Terasort on Spark

2015-11-10 Thread Du, Fan
Hi Spark experts I'm using ehiggs/spark-terasort to exercise my cluster. I don't understand how to run the terasort in a standard way when using cluster. Currently, all the input data and output data is put into hdfs, and I can generate/sort/validate all the sample data.But I'm not sure

Re: Re: OLAP query using spark dataframe with cassandra

2015-11-10 Thread Andrés Ivaldi
Hi, Cassandra looks very interesting and It seems to fit right, but it looks like It needs too much work to have the proper configuration that depends of the data. And what We need to do it a generic structure with less configuration possible, because the end users dont have the know-how for do

Re: AnalysisException Handling for unspecified field in Spark SQL

2015-11-10 Thread Arvin
Hi,you can add a the new column to your DataFrame just before your query. The added column will be null in your new table, so you can have what you want. -- View this message in context:

[Yarn] Executor cores isolation

2015-11-10 Thread Peter Rudenko
Hi i have a question: how does the cores isolation works on spark on yarn. E.g. i have a machine with 8 cores, but launched a worker with --executor-cores 1, and after doing something like: rdd.foreachPartition(=>{for all visible cores: burn core in a new tread}) Will it see 1 core or all 8

Re: [Yarn] Executor cores isolation

2015-11-10 Thread Saisai Shao
>From my understanding, it depends on whether you enabled CGroup isolation or not in Yarn. By default it is not, which means you could allocate one core but bump a lot of thread in your task to occupy the CPU resource, this is just a logic limitation. For Yarn CPU isolation you may refer to this

RE: Cassandra via SparkSQL/Hive JDBC

2015-11-10 Thread Bryan
Anyone have thoughts or a similar use-case for SparkSQL / Cassandra? Regards, Bryan Jeffrey -Original Message- From: "Bryan Jeffrey" Sent: ‎11/‎4/‎2015 11:16 AM To: "user" Subject: Cassandra via SparkSQL/Hive JDBC Hello. I have been