Create multiple output files from Thriftserver

2016-04-28 Thread mayankshete
Is there a way to create multiple output files when connected from beeline to the Thriftserver ? Right now i am using beeline -e 'query' > output.txt which is not efficient as it uses linux operator to combine output files . -- View this message in context:

Re: [2 BUG REPORT] failed to run make-distribution.sh when a older version maven installed in system and run VersionsSuite test hang

2016-04-28 Thread Ted Yu
For #1, have you seen this JIRA ? [SPARK-14867][BUILD] Remove `--force` option in `build/mvn` On Thu, Apr 28, 2016 at 8:27 PM, Demon King wrote: > BUG 1: > I have installed maven 3.0.2 in system, When I using make-distribution.sh > , it seem not use maven 3.2.2 but use

Re: executor delay in Spark

2016-04-28 Thread Raghava Mutharaju
Hello Mike, No problem, logs are useful to us anyway. Thank you for all the pointers. We started off with examining only a single RDD but later on added a few more. The persist count and unpersist count sequence is the dummy stage that you suggested us to use to avoid the initial scheduler delay.

[2 BUG REPORT] failed to run make-distribution.sh when a older version maven installed in system and run VersionsSuite test hang

2016-04-28 Thread Demon King
BUG 1: I have installed maven 3.0.2 in system, When I using make-distribution.sh , it seem not use maven 3.2.2 but use /usr/local/bin/mvn to build spark. So I add --force option in make-distribution.sh like this: line 130: VERSION=$("$MVN" *--force* help:evaluate -Dexpression=project.version $@

Re: Spark on AWS

2016-04-28 Thread Fatma Ozcan
Thanks for the responses. Fatma On Apr 28, 2016 3:00 PM, "Renato Perini" wrote: > I have setup a small development cluster using t2.micro machines and an > Amazon Linux AMI (CentOS 6.x). > The whole setup has been done manually, without using the provided > scripts. The

Re: Spark support for Complex Event Processing (CEP)

2016-04-28 Thread Michael Segel
Look, you said that you didn’t have continuous data, and you do have continuous data. I just used an analog signal which can be converted. So that you end up with contiguous digital sampling. The point is that you have to consider that micro batches are still batched and you’re adding

Re: Spark on AWS

2016-04-28 Thread Renato Perini
I have setup a small development cluster using t2.micro machines and an Amazon Linux AMI (CentOS 6.x). The whole setup has been done manually, without using the provided scripts. The whole setup is composed of a total of 5 instances: the first machine has an elastic IP and it is used as a

Re: Spark on AWS

2016-04-28 Thread Alexander Pivovarov
Fatima, the easiest way to create Spark cluster on AWS is to create EMR cluster and select Spark application. (the latest EMR includes Spark 1.6.1) Spark works well with S3 (read and write). However it's recommended to set spark.speculation true (it's expected that some tasks fail if you read

Spark on AWS

2016-04-28 Thread Fatma Ozcan
What is your experience using Spark on AWS? Are you setting up your own Spark cluster, and using HDFS? Or are you using Spark as a service from AWS? In the latter case, what is your experience of using S3 directly, without having HDFS in between? Thanks, Fatma

Re: Spark 2.0+ Structured Streaming

2016-04-28 Thread Tathagata Das
Hello Benjamin, Have you take a look at the slides of my talk in Strata San Jose - http://www.slideshare.net/databricks/taking-spark-streaming-to-the-next-level-with-datasets-and-dataframes Unfortunately there is not video, as Strata does not upload videos for everyone. I presented the same talk

Re: Spark support for Complex Event Processing (CEP)

2016-04-28 Thread Mich Talebzadeh
Also the point about "First there is this thing called analog signal processing…. Is that continuous enough for you? " I agree that analog signal processing like a sine wave, an AM radio signal – is truly continuous. However, here we are talking about digital data which will always be sent as

Re: Spark 2.0 Release Date

2016-04-28 Thread Jacek Laskowski
Hi Arun, My bet is...https://spark-summit.org/2016 :) Pozdrawiam, Jacek Laskowski https://medium.com/@jaceklaskowski/ Mastering Apache Spark http://bit.ly/mastering-apache-spark Follow me at https://twitter.com/jaceklaskowski On Thu, Apr 28, 2016 at 1:43 PM, Arun Patel

Hadoop Context

2016-04-28 Thread Eric Friedman
Hello, Where in the Spark APIs can I get access to the Hadoop Context instance? I am trying to implement the Spark equivalent of this public void reduce(Text key, Iterable values, Context context) throws IOException, InterruptedException { if (record == null) {

Re: Addign a new column to a dataframe (based on value of existing column)

2016-04-28 Thread Marco Mistroni
many thx Nick kr On Thu, Apr 28, 2016 at 8:07 PM, Nick Pentreath wrote: > This should work: > > scala> val df = Seq((25.0, "foo"), (30.0, "bar")).toDF("age", "name") > scala> df.withColumn("AgeInt", when(col("age") > 29.0, > 1).otherwise(0)).show > +++--+

Re: Addign a new column to a dataframe (based on value of existing column)

2016-04-28 Thread Nick Pentreath
This should work: scala> val df = Seq((25.0, "foo"), (30.0, "bar")).toDF("age", "name") scala> df.withColumn("AgeInt", when(col("age") > 29.0, 1).otherwise(0)).show +++--+ | age|name|AgeInt| +++--+ |25.0| foo| 0| |30.0| bar| 1| +++--+ On Thu, 28 Apr 2016 at

Addign a new column to a dataframe (based on value of existing column)

2016-04-28 Thread Marco Mistroni
HI all i have a dataFrame with a column ("Age", type double) and i am trying to create a new column based on the value of the Age column, using Scala API this code keeps on complaining scala> df.withColumn("AgeInt", if (df("Age") > 29.0) lit(1) else lit(0)) :28: error: type mismatch; found :

aggregateByKey - external combine function

2016-04-28 Thread Nirav Patel
Hi, I tried to convert a groupByKey operation to aggregateByKey in a hope to avoid memory and high gc issue when dealing with 200GB of data. I needed to create a Collection of resulting key-value pairs which represent all combinations of given key. My merge fun definition is as follows: private

Re: How does .jsonFile() work?

2016-04-28 Thread Aliaksandr Bedrytski
If your question is about how the schema is inferred for JSON, the paragraph 5.1 from this paper https://amplab.cs.berkeley.edu/wp-content/uploads/2015/03/SparkSQLSigmod2015.pdf explains it quite well (long story short, Spark tries to find the most specific type for the field, otherwise it is a

Re: Spark support for Complex Event Processing (CEP)

2016-04-28 Thread Mich Talebzadeh
In a commerical (C)EP like say StreamBase, or for example its competitor Apama, the arrival of an input event **immediately** triggers further downstream processing. This is admitadly an asynchronous approach, not a synchronous clock-driven micro-batch approach like Spark's. I suppose if one

Re: slow SQL query with cached dataset

2016-04-28 Thread Imran Akbar
Thanks Dr. Mich, Jorn, It's about 150 million rows in the cached dataset. How do I tell if it's spilling to disk? I didn't really see any logs to that affect. How do I determine the optimal number of partitions for a given input dataset? What's too much? regards, imran On Mon, Apr 25, 2016

Re: slow SQL query with cached dataset

2016-04-28 Thread Mich Talebzadeh
Hi Imran, " How do I tell if it's spilling to disk?" Well that is a very valid question. I do not have a quantitative matrix to use it to state that out of X GB of data in Spark, Y GB has been spilled to disk because of the volume of data. Unlike an RDBMS Spark uses memory ass opposed to shared

Re: How does .jsonFile() work?

2016-04-28 Thread harjitdotsingh
>From what I know and what I have played with, jsonFile reads JsonRecords which are defined as one record per line. Its not always the case that you can supply the data that way. If you have custom data json data where you cannot define a record per line, you will have to write your own

access to nonnegative flag with ALS trainImplicit

2016-04-28 Thread Roberto Pagliari
I'm using ALS with mllib 1.5.2 in Scala. I do not have access to the nonnegative flag in trainImplicit. Which API is it available from?

Re: Could not access Spark webUI on OpenStack VMs

2016-04-28 Thread Ted Yu
What happened when you tried to access port 8080 ? Checking iptables settings is good to do. At my employer, we use OpenStack clusters daily and don't encounter much problem - including UI access. Probably some settings should be tuned. On Thu, Apr 28, 2016 at 5:03 AM, Dan Dong

Re: Is JavaSparkContext.wholeTextFiles distributed?

2016-04-28 Thread Xiangrui Meng
It implements CombineInputFormat from Hadoop. isSplittable=false means each individual file cannot be split. If you only see one partition even with a large minPartitions, perhaps the total size of files is not big enough. Those are configurable in Hadoop conf. -Xiangrui On Tue, Apr 26, 2016,

Re: Spark writing to secure zone throws : UnknownCryptoProtocolVersionException

2016-04-28 Thread Ted Yu
Interesting. The phoenix dependency wasn't shown in the classpath of your previous email. On Thu, Apr 28, 2016 at 4:12 AM, pierre lacave wrote: > Narrowed down to some version incompatibility with Phoenix 4.7 , > > Including

Re: Spark 2.0 Release Date

2016-04-28 Thread Benjamin Kim
Next Thursday is Databricks' webinar on Spark 2.0. If you are attending, I bet many are going to ask when the release will be. Last time they did this, Spark 1.6 came out not too long afterward. > On Apr 28, 2016, at 5:21 AM, Sean Owen wrote: > > I don't know if anyone has

Re: Streaming K-means not printing predictions

2016-04-28 Thread Ashutosh Kumar
It is reading the files now but throws another error complaining vector sizes does not match. I saw this error reported on stack trace . http://stackoverflow.com/questions/30737361/getting-java-lang-illegalargumentexception-requirement-failed-while-calling-spa Also example given in scala

Re: Spark 2.0 Release Date

2016-04-28 Thread Sean Owen
I don't know if anyone has begun a firm discussion on dates, but there are >100 open issues and ~10 blockers, so still some work to do before code freeze, it looks like. My unofficial guess is mid June before it's all done. On Thu, Apr 28, 2016 at 12:43 PM, Arun Patel

Spark 2.0+ Structured Streaming

2016-04-28 Thread Benjamin Kim
Can someone explain to me how the new Structured Streaming works in the upcoming Spark 2.0+? I’m a little hazy how data will be stored and referenced if it can be queried and/or batch processed directly from streams and if the data will be append only to or will there be some sort of upsert

Could not access Spark webUI on OpenStack VMs

2016-04-28 Thread Dan Dong
Hi, all, I'm having problem to access the web UI of my Spark cluster. The cluster is composed of a few virtual machines running on a OpenStack platform. The VMs are launched from CentOS7.0 server image available from official site. The Spark itself runs well and master and worker process are all

Re: Spark support for Complex Event Processing (CEP)

2016-04-28 Thread Michael Segel
I don’t. I believe that there have been a couple of hack-a-thons like one done in Chicago a few years back using public transportation data. The first question is what sort of data do you get from the city? I mean it could be as simple as time_stamp, bus_id, route and GPS (x,y). Or they

Spark 2.0 Release Date

2016-04-28 Thread Arun Patel
A small request. Would you mind providing an approximate date of Spark 2.0 release? Is it early May or Mid May or End of May? Thanks, Arun

Re: Spark writing to secure zone throws : UnknownCryptoProtocolVersionException

2016-04-28 Thread pierre lacave
Narrowed down to some version incompatibility with Phoenix 4.7 , Including $SPARK_HOME/lib/phoenix-4.7.0-HBase-1.1-client-spark.jar to extraClassPath and that trigger the issue above. I ll have a go at adding the individual dependencies as opposed to this fat jar and see how it goes. Thanks

Re: Reading from Amazon S3

2016-04-28 Thread Gourav Sengupta
Why would you use JAVA (create a problem and then try to solve it)? Have you tried using Scala or Python or even R? Regards, Gourav On Thu, Apr 28, 2016 at 10:07 AM, Steve Loughran wrote: > > On 26 Apr 2016, at 18:49, Ted Yu wrote: > > Looking at

Re: spark sql job create too many files in HDFS when doing insert overwrite hive table

2016-04-28 Thread linxi zeng
BTW, I have created a JIRA task to follow this issue: https://issues.apache.org/jira/browse/SPARK-14974 2016-04-28 18:08 GMT+08:00 linxi zeng : > Hi, > > Recently, we often encounter problems using spark sql for inserting data > into a partition table (ex.: insert

Re: Save DataFrame to HBase

2016-04-28 Thread Ted Yu
Hbase 2.0 release likely would come after Spark 2.0 release. There're other features being developed in hbase 2.0 I am not sure when hbase 2.0 would be released. The refguide is incomplete. Zhan has assigned the doc JIRA to himself. The documentation would be done after fixing bugs in

spark sql job create too many files in HDFS when doing insert overwrite hive table

2016-04-28 Thread linxi zeng
Hi, Recently, we often encounter problems using spark sql for inserting data into a partition table (ex.: insert overwrite table $output_table partition(dt) select xxx from tmp_table). After the spark job start running on yarn, *the app will create too many files (ex. 200w+, or even 1000w+),

Re: Spark writing to secure zone throws : UnknownCryptoProtocolVersionException

2016-04-28 Thread pierre lacave
Thanks Ted, I am actually using the hadoop free version of spark (spark-1.5.0-bin-without-hadoop) over hadoop 2.6.1, so could very well be related indeed. I have configured spark-env.sh with export SPARK_DIST_CLASSPATH=$($HADOOP_PREFIX/bin/hadoop classpath), which is the only version of hadoop

Classification or grouping of analyzing tools

2016-04-28 Thread Esa Heikkinen
Hi I am newbie in this analyzing field. It seems there are exist many tools, frameworks, ecosystems, softwares, languages and so on. 1) Are there exist some classifications or groupings for them ? 2) What kind of types of tools are exist ? 3) What are the main purposes ot tools ? Regards

Re: Spark support for Complex Event Processing (CEP)

2016-04-28 Thread Esa Heikkinen
Do you know any good examples how to use Spark streaming in tracking public transportation systems ? Or Storm or some other tool example ? Regards Esa Heikkinen 28.4.2016, 3:16, Michael Segel kirjoitti: Uhm… I think you need to clarify a couple of things… First there is this thing called

Fwd: AuthorizationException while exposing via JDBC client (beeline)

2016-04-28 Thread ram kumar
Hi, I wrote a spark job which registers a temp table and when I expose it via beeline (JDBC client) $ *./bin/beeline* beeline> * !connect jdbc:hive2://IP:10003 -n ram -p *0: jdbc:hive2://IP> *show

Re: DataFrame job fails on parsing error, help?

2016-04-28 Thread Night Wolf
We are hitting the same issue on Spark 1.6.1 with tungsten enabled, kryo enabled & sort based shuffle. Did you find a resolution? On Sat, Apr 9, 2016 at 6:31 AM, Ted Yu wrote: > Not much. > > So no chance of different snappy version ? > > On Fri, Apr 8, 2016 at 1:26 PM,

Re: EOFException while reading from HDFS

2016-04-28 Thread Saurav Sinha
Are you able to connect to Name node UI on MACHINE_IP:50070. Check what is URI there. If UI does't open it means your hdfs is not up ,try to start it using start.dfs.sh. On Thu, Apr 28, 2016 at 2:59 AM, Bibudh Lahiri wrote: > Hi, > I installed Hadoop 2.6.0 today on