Re: How to accelerate reading json file?

2016-01-05 Thread VISHNU SUBRAMANIAN
HI , You can try this sqlContext.read.format("json").option("samplingRatio","0.1").load("path") If it still takes time , feel free to experiment with the samplingRatio. Thanks, Vishnu On Wed, Jan 6, 2016 at 12:43 PM, Gavin Yue wrote: > I am trying to read json files following the example: >

How to accelerate reading json file?

2016-01-05 Thread Gavin Yue
I am trying to read json files following the example: val path = "examples/src/main/resources/jsonfile"val people = sqlContext.read.json(path) I have 1 Tb size files in the path. It took 1.2 hours to finish the reading to infer the schema. But I already know the schema. Could I make this proces

?????? How to use Java8

2016-01-05 Thread Sea
thanks -- -- ??: "Andy Davidson";; : 2016??1??6??(??) 11:04 ??: "Sea"<261810...@qq.com>; "user"; : Re: How to use Java8 Hi Sea From: Sea <261810...@qq.com> Date: Tuesday, January 5, 2016 at 6:16 PM To: "user @spark

Re: How to concat few rows into a new column in dataframe

2016-01-05 Thread Gavin Yue
I found that in 1.6 dataframe could do repartition. Should I still need to do orderby first or I just have to repartition? On Tue, Jan 5, 2016 at 9:25 PM, Gavin Yue wrote: > I tried the Ted's solution and it works. But I keep hitting the JVM out > of memory problem. > And grouping the key

Re: How to concat few rows into a new column in dataframe

2016-01-05 Thread Gavin Yue
I tried the Ted's solution and it works. But I keep hitting the JVM out of memory problem. And grouping the key causes a lot of data shuffling. So I am trying to order the data based on ID first and save as Parquet. Is there way to make sure that the data is partitioned that each ID's data is

Re: pyspark Dataframe and histogram through ggplot (python)

2016-01-05 Thread Felix Cheung
Hi, select() returns a new Spark DataFrame; I would imagine ggplot would not work with it. Could you try df.select("something").toPandas()? _ From: Snehotosh Banerjee Sent: Tuesday, January 5, 2016 4:32 AM Subject: pyspark Dataframe and histogram through ggplot (

Re: sparkR ORC support.

2016-01-05 Thread Felix Cheung
Firstly I don't have ORC data to verify but this should work: df <- loadDF(sqlContext, "data/path", "orc") Secondly, could you check if sparkR.stop() was called? sparkRHive.init() should be called after sparkR.init() - please check if there is any error message there. __

RE: Spark on Apache Ingnite?

2016-01-05 Thread Umesh Kacha
Hi Nate thanks much. I have exact same use cases mentioned by you. My spark job does heavy writing involving group by and huge data shuffling. Can you please provide any pointer how can I run my existing spark job which is running on yarn to make it run on ignite? Please guide. Thanks again. On J

Re: pyspark dataframe: row with a minimum value of a column for each group

2016-01-05 Thread ayan guha
Yes there is. It is called window function over partitions. Equivalent SQL would be: select * from (select a,b,c, rank() over (partition by a order by b) r from df) x where r = 1 You can register your DF as a temp table and use the sql form. Or, (>Spark 1.4) you can use window methods a

[Spark-SQL] Custom aggregate function for GrouppedData

2016-01-05 Thread Abhishek Gayakwad
Hello Hivemind, Referring to this thread - https://forums.databricks.com/questions/956/how-do-i-group-my-dataset-by-a-key-or-combination.html. I have learnt that we can not do much with groupped data apart from using existing aggregate functions. This blog post was written in may 2015, I don't kno

Re: [discuss] dropping Python 2.6 support

2016-01-05 Thread Jim Lohse
Hey Python 2.6 don't let the door hit you on the way out! haha Drop It No Problem On 01/05/2016 12:17 AM, Reynold Xin wrote: Does anybody here care about us dropping support for Python 2.6 in Spark 2.0? Python 2.6 is ancient, and is pretty slow in many aspects (e.g. json parsing) when compar

Out of memory issue

2016-01-05 Thread babloo80
Hello there, I have a spark job reads 7 parquet files (8 GB, 3 x 16 GB, 3 x 14 GB) in different stages of execution and creates a result parquet of 9 GB (about 27 million rows containing 165 columns. some columns are map based containing utmost 200 value histograms). The stages involve, Step 1: Re

Re: Can spark.scheduler.pool be applied globally ?

2016-01-05 Thread Jeff Zhang
Thanks Mark, custom configuration file would be better for me. Changing code will make it affect all the applications, this is too risky for me. On Wed, Jan 6, 2016 at 10:50 AM, Mark Hamstra wrote: > The other way to do it is to build a custom version of Spark where you > have changed the valu

Re: How to use Java8

2016-01-05 Thread Andy Davidson
Hi Sea From: Sea <261810...@qq.com> Date: Tuesday, January 5, 2016 at 6:16 PM To: "user @spark" Subject: How to use Java8 > Hi, all > I want to support java8, I use JDK1.8.0_65 in production environment, but > it doesn't work. Should I build spark using jdk1.8, and set > 1.8 in pom.xml?

Re: UpdateStateByKey : Partitioning and Shuffle

2016-01-05 Thread Tathagata Das
Both mapWithState and updateStateByKey by default uses the HashPartitioner, and hashes the key in the key-value DStream on which the state operation is applied. The new data and state is partition in the exact same partitioner, so that same keys from the new data (from the input DStream) get shuffl

Re: 101 question on external metastore

2016-01-05 Thread Yana Kadiyska
Deenar, I have not resolved this issue. Why do you think it's from different versions of Derby? I was playing with this as a fun experiment and my setup was on a clean machine -- no other versions of hive/hadoop/etc... On Sun, Dec 20, 2015 at 12:17 AM, Deenar Toraskar wrote: > apparently it is d

RE: aggregateByKey vs combineByKey

2016-01-05 Thread LINChen
Hi Marco,In your case, since you don't need to perform an aggregation (such as a sum or average) over each key, using groupByKey may perform better. groupByKey inherently utilizes compactBuffer which is much more efficient than ArrayBuffer. Thanks.LIN Chen Date: Tue, 5 Jan 2016 21:13:40 + S

Re: Can spark.scheduler.pool be applied globally ?

2016-01-05 Thread Mark Hamstra
The other way to do it is to build a custom version of Spark where you have changed the value of DEFAULT_SCHEDULING_MODE -- and if you were paying close attention, I accidentally let it slip that that is what I've done. I previously wrote "schedulingMode = DEFAULT_SCHEDULING_MODE -- i.e. Schedulin

Re: java.io.FileNotFoundException(Too many open files) in Spark streaming

2016-01-05 Thread Priya Ch
Yes, the fileinputstream is closed. May be i didn't show in the screen shot . As spark implements, sort-based shuffle, there is a parameter called maximum merge factor which decides the number of files that can be merged at once and this avoids too many open files. I am suspecting that it is somet

How to use Java8

2016-01-05 Thread Sea
Hi, all I want to support java8, I use JDK1.8.0_65 in production environment, but it doesn't work. Should I build spark using jdk1.8, and set 1.8 in pom.xml? java.lang.UnsupportedClassVersionError: Unsupported major.minor version 52.

Re: [discuss] dropping Python 2.6 support

2016-01-05 Thread Jeff Zhang
+1 On Wed, Jan 6, 2016 at 9:18 AM, Juliet Hougland wrote: > Most admins I talk to about python and spark are already actively (or on > their way to) managing their cluster python installations. Even if people > begin using the system python with pyspark, there is eventually a user who > needs a

Re: Can spark.scheduler.pool be applied globally ?

2016-01-05 Thread Jeff Zhang
Right, I can override the root pool in configuration file, Thanks Mark. On Wed, Jan 6, 2016 at 8:45 AM, Mark Hamstra wrote: > Just configure with > FAIR in fairscheduler.xml (or > in spark.scheduler.allocation.file if you have over-riden the default name > for the config file.) `buildDefaultPo

UpdateStateByKey : Partitioning and Shuffle

2016-01-05 Thread Soumitra Johri
Hi, I am relatively new to Spark and am using updateStateByKey() operation to maintain state in my Spark Streaming application. The input data is coming through a Kafka topic. 1. I want to understand how are DStreams partitioned? 2. How does the partitioning work with mapWithState() or u

DataFrame withColumnRenamed throwing NullPointerException

2016-01-05 Thread Prasad Ravilla
I am joining two data frames as shown in the code below. This is throwing NullPointerException. I have a number of different join throughout the program and the SparkContext throws this NullPointerException on a randomly on one of the joins. The two data frames are very large data frames ( aroun

Re: problem building spark on centos

2016-01-05 Thread Ted Yu
Which version of maven are you using ? It should be 3.3.3+ On Tue, Jan 5, 2016 at 4:54 PM, Jade Liu wrote: > Hi, All: > > I’m trying to build spark 1.5.2 from source using maven with the following > command: > > ./make-distribution.sh --tgz -Phadoop-2.6 -Pyarn -Dhadoop.version=2.6.0 > -Dscala-2

pyspark dataframe: row with a minimum value of a column for each group

2016-01-05 Thread Wei Chen
Hi, I am trying to retrieve the rows with a minimum value of a column for each group. For example: the following dataframe: a | b | c -- 1 | 1 | 1 1 | 2 | 2 1 | 3 | 3 2 | 1 | 4 2 | 2 | 5 2 | 3 | 6 3 | 1 | 7 3 | 2 | 8 3 | 3 | 9 -- I group by 'a', and want the rows with the smalles

problem building spark on centos

2016-01-05 Thread Jade Liu
Hi, All: I'm trying to build spark 1.5.2 from source using maven with the following command: ./make-distribution.sh --tgz -Phadoop-2.6 -Pyarn -Dhadoop.version=2.6.0 -Dscala-2.11 -Phive -Phive-thriftserver -DskipTests I got the following error: + VERSION='[ERROR] [Help 2] http://cwiki.apache.or

Re: Can spark.scheduler.pool be applied globally ?

2016-01-05 Thread Mark Hamstra
Just configure with FAIR in fairscheduler.xml (or in spark.scheduler.allocation.file if you have over-riden the default name for the config file.) `buildDefaultPool()` will only build the pool named "default" with the default properties (such as schedulingMode = DEFAULT_SCHEDULING_MODE -- i.e. Sc

Re: [discuss] dropping Python 2.6 support

2016-01-05 Thread Josh Rosen
I don't think that we're planning to drop Java 7 support for Spark 2.0. Personally, I would recommend using Java 8 if you're running Spark 1.5.0+ and are using SQL/DataFrames so that you can benefit from improvements to code cache flushing in the Java 8 JVMs. Spark SQL's generated classes can fill

Re: Can spark.scheduler.pool be applied globally ?

2016-01-05 Thread Jeff Zhang
Sorry, I don't make it clearly. What I want is the default pool is fair scheduling. But seems if I want to use fair scheduling now, I have to set spark.scheduler.pool explicitly. On Wed, Jan 6, 2016 at 2:03 AM, Mark Hamstra wrote: > I don't understand. If you're using fair scheduling and don't

Re: [discuss] dropping Python 2.6 support

2016-01-05 Thread Koert Kuipers
hey evil admin:) i think the bit about java was from me? if so, i meant to indicate that the reality for us is java is 1.7 on most (all?) clusters. i do not believe spark prefers java 1.8. my point was that even although java 1.7 is getting old as well it would be a major issue for me if spark drop

Re: How to concat few rows into a new column in dataframe

2016-01-05 Thread Michael Armbrust
This would also be possible with an Aggregator in Spark 1.6: https://docs.cloud.databricks.com/docs/spark/1.6/index.html#examples/Dataset%20Aggregator.html On Tue, Jan 5, 2016 at 2:59 PM, Ted Yu wrote: > Something like the following: > > val zeroValue = collection.mutable.Set[String]() > > val a

Re: [discuss] dropping Python 2.6 support

2016-01-05 Thread Josh Rosen
> > Note that you _can_ use a Python 2.7 `ipython` executable on the driver > while continuing to use a vanilla `python` executable on the executors Whoops, just to be clear, this should actually read "while continuing to use a vanilla `python` 2.7 executable". On Tue, Jan 5, 2016 at 3:07 PM, Jo

Re: [discuss] dropping Python 2.6 support

2016-01-05 Thread Josh Rosen
Yep, the driver and executors need to have compatible Python versions. I think that there are some bytecode-level incompatibilities between 2.6 and 2.7 which would impact the deserialization of Python closures, so I think you need to be running the same 2.x version for all communicating Spark proce

Re: [discuss] dropping Python 2.6 support

2016-01-05 Thread Nicholas Chammas
I think all the slaves need the same (or a compatible) version of Python installed since they run Python code in PySpark jobs natively. On Tue, Jan 5, 2016 at 6:02 PM Koert Kuipers wrote: > interesting i didnt know that! > > On Tue, Jan 5, 2016 at 5:57 PM, Nicholas Chammas < > nicholas.cham...@g

Re: [discuss] dropping Python 2.6 support

2016-01-05 Thread Koert Kuipers
if python 2.7 only has to be present on the node that launches the app (does it?) than that could be important indeed. On Tue, Jan 5, 2016 at 6:02 PM, Koert Kuipers wrote: > interesting i didnt know that! > > On Tue, Jan 5, 2016 at 5:57 PM, Nicholas Chammas < > nicholas.cham...@gmail.com> wrote:

Re: [discuss] dropping Python 2.6 support

2016-01-05 Thread Koert Kuipers
interesting i didnt know that! On Tue, Jan 5, 2016 at 5:57 PM, Nicholas Chammas wrote: > even if python 2.7 was needed only on this one machine that launches the > app we can not ship it with our software because its gpl licensed > > Not to nitpick, but maybe this is important. The Python licens

Re: How to concat few rows into a new column in dataframe

2016-01-05 Thread Ted Yu
Something like the following: val zeroValue = collection.mutable.Set[String]() val aggredated = data.aggregateByKey (zeroValue)((set, v) => set += v, (setOne, setTwo) => setOne ++= setTwo) On Tue, Jan 5, 2016 at 2:46 PM, Gavin Yue wrote: > Hey, > > For example, a table df with two columns > id

Re: [discuss] dropping Python 2.6 support

2016-01-05 Thread Davies Liu
Created JIRA: https://issues.apache.org/jira/browse/SPARK-12661 On Tue, Jan 5, 2016 at 2:49 PM, Koert Kuipers wrote: > i do not think so. > > does the python 2.7 need to be installed on all slaves? if so, we do not > have direct access to those. > > also, spark is easy for us to ship with our sof

Re: [discuss] dropping Python 2.6 support

2016-01-05 Thread Nicholas Chammas
even if python 2.7 was needed only on this one machine that launches the app we can not ship it with our software because its gpl licensed Not to nitpick, but maybe this is important. The Python license is GPL-compatible but not GPL : Note GPL-compatible do

Re: [discuss] dropping Python 2.6 support

2016-01-05 Thread Koert Kuipers
i do not think so. does the python 2.7 need to be installed on all slaves? if so, we do not have direct access to those. also, spark is easy for us to ship with our software since its apache 2 licensed, and it only needs to be present on the machine that launches the app (thanks to yarn). even if

How to concat few rows into a new column in dataframe

2016-01-05 Thread Gavin Yue
Hey, For example, a table df with two columns id name 1 abc 1 bdf 2 ab 2 cd I want to group by the id and concat the string into array of string. like this id 1 [abc,bdf] 2 [ab, cd] How could I achieve this in dataframe? I stuck on df.groupBy("id"). ??? Thanks

Re: [discuss] dropping Python 2.6 support

2016-01-05 Thread Josh Rosen
If users are able to install Spark 2.0 on their RHEL clusters, then I imagine that they're also capable of installing a standalone Python alongside that Spark version (without changing Python systemwide). For instance, Anaconda/Miniconda make it really easy to install Python 2.7.x/3.x without impac

Re: [discuss] dropping Python 2.6 support

2016-01-05 Thread Koert Kuipers
yeah, the practical concern is that we have no control over java or python version on large company clusters. our current reality for the vast majority of them is java 7 and python 2.6, no matter how outdated that is. i dont like it either, but i cannot change it. we currently don't use pyspark s

Re: [discuss] dropping Python 2.6 support

2016-01-05 Thread Nicholas Chammas
As I pointed out in my earlier email, RHEL will support Python 2.6 until 2020. So I'm assuming these large companies will have the option of riding out Python 2.6 until then. Are we seriously saying that Spark should likewise support Python 2.6 for the next several years? Even though the core Pyth

Re: aggregateByKey vs combineByKey

2016-01-05 Thread Ted Yu
Looking at PairRDDFunctions.scala : def aggregateByKey[U: ClassTag](zeroValue: U, partitioner: Partitioner)(seqOp: (U, V) => U, combOp: (U, U) => U): RDD[(K, U)] = self.withScope { ... combineByKeyWithClassTag[U]((v: V) => cleanedSeqOp(createZero(), v), cleanedSeqOp, combOp, part

Re: Spark 1.6 - Datasets and Avro Encoders

2016-01-05 Thread Michael Armbrust
On Tue, Jan 5, 2016 at 1:31 PM, Olivier Girardot < o.girar...@lateral-thoughts.com> wrote: > I'll do, but if you want my two cents, creating a dedicated "optimised" > encoder for Avro would be great (especially if it's possible to do better > than plain AvroKeyValueOutputFormat with saveAsNewAPIHa

sortBy transformation shows as a job

2016-01-05 Thread Soumitra Kumar
Fellows, I have a simple code. sc.parallelize (Array (1, 4, 3, 2), 2).sortBy (i=>i).foreach (println) This results in 2 jobs (sortBy, foreach) in Spark's application master ui. I thought there is one to one relationship between RDD action and job. Here, only action is foreach, so should be only on

Re: Spark 1.6 - Datasets and Avro Encoders

2016-01-05 Thread Olivier Girardot
I'll do, but if you want my two cents, creating a dedicated "optimised" encoder for Avro would be great (especially if it's possible to do better than plain AvroKeyValueOutputFormat with saveAsNewAPIHadoopFile :) ) Thanks for your time Michael, and happy new year :-) Regards, Olivier. 2016-01-0

aggregateByKey vs combineByKey

2016-01-05 Thread Marco Mistroni
Hi all i have the following dataSet kv = [(2,Hi), (1,i), (2,am), (1,a), (4,test), (6,s tring)] It's a simple list of tuples containing (word_length, word) What i wanted to do was to group the result by key in order to have a result in the form [(word_length_1, [word1, word2, word3], word_length

Re: coalesce(1).saveAsTextfile() takes forever?

2016-01-05 Thread Andy Davidson
Hi Unk1102 I also had trouble when I used coalesce(). Reparation() worked much better. Keep in mind if you have a large number of portions you are probably going have high communication costs. Also my code works a lot better on 1.6.0. DataFrame memory was not be spilled in 1.5.2. In 1.6.0 unpersi

Re: problem with DataFrame df.withColumn() org.apache.spark.sql.AnalysisException: resolved attribute(s) missing

2016-01-05 Thread Michael Armbrust
> > I am trying to implement org.apache.spark.ml.Transformer interface in > Java 8. > My understanding is the sudo code for transformers is something like > > @Override > > public DataFrame transform(DataFrame df) { > > 1. Select the input column > > 2. Create a new column > > 3. Append the ne

RE: Spark on Apache Ingnite?

2016-01-05 Thread nate
We started playing with Ignite back Hadoop, hive and spark services, and looking to move to it as our default for deployment going forward, still early but so far its been pretty nice and excited for the flexibility it will provide for our particular use cases. Would say in general its worth looki

spark-itemsimilarity No FileSystem for scheme error

2016-01-05 Thread roy
Hi we are using CDH 5.4.0 with Spark 1.5.2 (doesn't come with CDH 5.4.0) I am following this link https://mahout.apache.org/users/recommender/intro-cooccurrence-spark.html to trying to test/create new algorithm with mahout item-similarity. I am running following command ./bin/mahout sp

Re: coalesce(1).saveAsTextfile() takes forever?

2016-01-05 Thread Umesh Kacha
Hi dataframe has not boolean option for coalesce it is only for RDD I believe sourceFrame.coalesce(1,true) //gives compilation error On Wed, Jan 6, 2016 at 1:38 AM, Alexander Pivovarov wrote: > try coalesce(1, true). > > On Tue, Jan 5, 2016 at 11:58 AM, unk1102 wrote: > >> hi I am trying to

Re: coalesce(1).saveAsTextfile() takes forever?

2016-01-05 Thread Igor Berman
another option will be to try rdd.toLocalIterator() not sure if it will help though I had same problem and ended up to move all parts to local disk(with Hadoop FileSystem api) and then processing them locally On 5 January 2016 at 22:08, Alexander Pivovarov wrote: > try coalesce(1, true). > > O

Re: coalesce(1).saveAsTextfile() takes forever?

2016-01-05 Thread Alexander Pivovarov
try coalesce(1, true). On Tue, Jan 5, 2016 at 11:58 AM, unk1102 wrote: > hi I am trying to save many partitions of Dataframe into one CSV file and > it > take forever for large data sets of around 5-6 GB. > > > sourceFrame.coalesce(1).write().format("com.databricks.spark.csv").option("gzip").sav

Re: Spark SQL dataframes explode /lateral view help

2016-01-05 Thread Deenar Toraskar
val sparkConf = new SparkConf() .setMaster("local[*]") .setAppName("Dataframe Test") val sc = new SparkContext(sparkConf) val sql = new SQLContext(sc) val dataframe = sql.read.json("orders.json") val expanded = dataframe .explode[::[Long], Long]("items", "item1")(row => row) .explode[::[

Re: problem with DataFrame df.withColumn() org.apache.spark.sql.AnalysisException: resolved attribute(s) missing

2016-01-05 Thread Andy Davidson
Hi Michael I am not sure you under stand my code correct. I am trying to implement org.apache.spark.ml.Transformer interface in Java 8. My understanding is the sudo code for transformers is something like @Override public DataFrame transform(DataFrame df) { 1. Select the input column 2.

Spark SQL dataframes explode /lateral view help

2016-01-05 Thread Deenar Toraskar
Hi All I have the following spark sql query and would like to use convert this to use the dataframes api (spark 1.6). The eee, eep and pfep are all maps of (int -> float) select e.counterparty, epe, mpfe, eepe, noOfMonthseep, teee as effectiveExpectedExposure, teep as expectedExposure , tpfep as

coalesce(1).saveAsTextfile() takes forever?

2016-01-05 Thread unk1102
hi I am trying to save many partitions of Dataframe into one CSV file and it take forever for large data sets of around 5-6 GB. sourceFrame.coalesce(1).write().format("com.databricks.spark.csv").option("gzip").save("/path/hadoop") For small data above code works well but for large data it hangs f

Re: Negative Number of Active Tasks in Spark UI

2016-01-05 Thread Shixiong(Ryan) Zhu
Did you enable "spark.speculation"? On Tue, Jan 5, 2016 at 9:14 AM, Prasad Ravilla wrote: > I am using Spark 1.5.2. > > I am not using Dynamic allocation. > > Thanks, > Prasad. > > > > > On 1/5/16, 3:24 AM, "Ted Yu" wrote: > > >Which version of Spark do you use ? > > > >This might be related: >

Re: [discuss] dropping Python 2.6 support

2016-01-05 Thread Ted Yu
+1 > On Jan 5, 2016, at 10:49 AM, Davies Liu wrote: > > +1 > > On Tue, Jan 5, 2016 at 5:45 AM, Nicholas Chammas > wrote: >> +1 >> >> Red Hat supports Python 2.6 on REHL 5 until 2020, but otherwise yes, Python >> 2.6 is ancient history and the core Python developers stopped supporting it >> in

Re: [discuss] dropping Python 2.6 support

2016-01-05 Thread Koert Kuipers
rhel/centos 6 ships with python 2.6, doesnt it? if so, i still know plenty of large companies where python 2.6 is the only option. asking them for python 2.7 is not going to work so i think its a bad idea On Tue, Jan 5, 2016 at 1:52 PM, Juliet Hougland wrote: > I don't see a reason Spark 2.0 w

Re: [discuss] dropping Python 2.6 support

2016-01-05 Thread Davies Liu
+1 On Tue, Jan 5, 2016 at 5:45 AM, Nicholas Chammas wrote: > +1 > > Red Hat supports Python 2.6 on REHL 5 until 2020, but otherwise yes, Python > 2.6 is ancient history and the core Python developers stopped supporting it > in 2013. REHL 5 is not a good enough reason to continue support for Pytho

Re: Double Counting When Using Accumulators with Spark Streaming

2016-01-05 Thread Shixiong(Ryan) Zhu
Hey Rachana, There are two jobs in your codes actually: `rdd.isEmpty` and `rdd.saveAsTextFile`. Since you don't cache or checkpoint this rdd, it will execute your map function twice for each record. You can move "accum.add(1)" to "rdd.saveAsTextFile" like this: JavaDStream lines = messages.m

Spark on Apache Ingnite?

2016-01-05 Thread unk1102
Hi has anybody tried and had success with Spark on Apache Ignite seems promising? https://ignite.apache.org/ -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-on-Apache-Ingnite-tp25884.html Sent from the Apache Spark User List mailing list archive at Nab

Re: Can spark.scheduler.pool be applied globally ?

2016-01-05 Thread Mark Hamstra
I don't understand. If you're using fair scheduling and don't set a pool, the default pool will be used. On Tue, Jan 5, 2016 at 1:57 AM, Jeff Zhang wrote: > > It seems currently spark.scheduler.pool must be set as localProperties > (associate with thread). Any reason why spark.scheduler.pool ca

Re: Spark 1.6 - Datasets and Avro Encoders

2016-01-05 Thread Michael Armbrust
You could try with the `Encoders.bean` method. It detects classes that have getters and setters. Please report back! On Tue, Jan 5, 2016 at 9:45 AM, Olivier Girardot < o.girar...@lateral-thoughts.com> wrote: > Hi everyone, > considering the new Datasets API, will there be Encoders defined for >

Re: java.io.FileNotFoundException(Too many open files) in Spark streaming

2016-01-05 Thread Annabel Melongo
Vijay, Are you closing the fileinputstream at the end of each loop ( in.close())? My guess is those streams aren't close and thus the "too many open files" exception. On Tuesday, January 5, 2016 8:03 AM, Priya Ch wrote: Can some one throw light on this ? Regards,Padma Ch On Mon, Dec 2

Spark 1.6 - Datasets and Avro Encoders

2016-01-05 Thread Olivier Girardot
Hi everyone, considering the new Datasets API, will there be Encoders defined for reading and writing Avro files ? Will it be possible to use already generated Avro classes ? Regards, -- *Olivier Girardot*

Re: Networking problems in Spark 1.6.0

2016-01-05 Thread Yiannis Gkoufas
Yes, that was the case, the app was built with java 8. But that was the case with Spark 1.5.2 as well and it didn't complain. On 5 January 2016 at 16:40, Dean Wampler wrote: > ​Still, it would be good to know what happened exactly. Why did the netty > dependency expect Java 8? Did you build you

Re: Negative Number of Active Tasks in Spark UI

2016-01-05 Thread Prasad Ravilla
I am using Spark 1.5.2. I am not using Dynamic allocation. Thanks, Prasad. On 1/5/16, 3:24 AM, "Ted Yu" wrote: >Which version of Spark do you use ? > >This might be related: >https://urldefense.proofpoint.com/v2/url?u=https-3A__issues.apache.org_jira_browse_SPARK-2D8560&d=CwICAg&c=fa_WZs7nN

RE: Double Counting When Using Accumulators with Spark Streaming

2016-01-05 Thread Rachana Srivastava
Thanks a lot for your prompt response. I am pushing one message. HashMap kafkaParams = new HashMap(); kafkaParams.put("metadata.broker.list","localhost:9092"); kafkaParams.put("zookeeper.connect", "localhost:2181"); JavaPairInputDStream messages = KafkaUtils.createDirectStream( jssc, S

Re: Networking problems in Spark 1.6.0

2016-01-05 Thread Dean Wampler
​Still, it would be good to know what happened exactly. Why did the netty dependency expect Java 8? Did you build your app on a machine with Java 8 and deploy on a Java 7 machine?​ Anyway, I played with the 1.6.0 spark-shell using Java 7 and it worked fine. I also looked at the distribution's cla

Re: Double Counting When Using Accumulators with Spark Streaming

2016-01-05 Thread Jean-Baptiste Onofré
Hi Rachana, don't you have two messages on the kafka broker ? Regards JB On 01/05/2016 05:14 PM, Rachana Srivastava wrote: I have a very simple two lines program. I am getting input from Kafka and save the input in a file and counting the input received. My code looks like this, when I run t

Double Counting When Using Accumulators with Spark Streaming

2016-01-05 Thread Rachana Srivastava
I have a very simple two lines program. I am getting input from Kafka and save the input in a file and counting the input received. My code looks like this, when I run this code I am getting two accumulator count for each input. HashMap kafkaParams = new HashMap(); kafkaParams.put("metadata.

Re: sparkR ORC support.

2016-01-05 Thread Prem Sure
Yes Sandeep, also copy hive-site.xml too to spark conf directory. On Tue, Jan 5, 2016 at 10:07 AM, Sandeep Khurana wrote: > Also, do I need to setup hive in spark as per the link > http://stackoverflow.com/questions/26360725/accesing-hive-tables-in-spark > ? > > We might need to copy hdfs-site.

Re: sparkR ORC support.

2016-01-05 Thread Sandeep Khurana
Also, do I need to setup hive in spark as per the link http://stackoverflow.com/questions/26360725/accesing-hive-tables-in-spark ? We might need to copy hdfs-site.xml file to spark conf directory ? On Tue, Jan 5, 2016 at 8:28 PM, Sandeep Khurana wrote: > Deepak > > Tried this. Getting this erro

Re: Networking problems in Spark 1.6.0

2016-01-05 Thread Yiannis Gkoufas
Hi Dean, thanks so much for the response! It works without a problem now! On 5 January 2016 at 14:33, Dean Wampler wrote: > ConcurrentHashMap.keySet() returning a KeySetView is a Java 8 method. The > Java 7 method returns a Set. Are you running Java 7? What happens if you > run Java 8? > > Dean

Re: sparkR ORC support.

2016-01-05 Thread Sandeep Khurana
Deepak Tried this. Getting this error now rror in sql(hivecontext, "FROM CATEGORIES SELECT category_id", "") : unused argument ("") On Tue, Jan 5, 2016 at 6:48 PM, Deepak Sharma wrote: > Hi Sandeep > can you try this ? > > results <- sql(hivecontext, "FROM test SELECT id","") > > Thanks > D

Handling futures from foreachPartitionAsync in Spark Streaming

2016-01-05 Thread Trevor
Hi everyone, I'm working on a spark streaming program where I need to asynchronously apply a complex function across the partitions of an RDD. I'm currently using foreachPartitionAsync to achieve this. What is the idiomatic way of handling the FutureAction that returns from the foreachPartitionAsy

Re: Is there a way to use parallelize function in sparkR spark version (1.6.0)

2016-01-05 Thread Ted Yu
Please take a look at the following for examples: R/pkg/R/RDD.R R/pkg/R/pairRDD.R Cheers On Tue, Jan 5, 2016 at 2:36 AM, Chandan Verma wrote: > >

Re: Networking problems in Spark 1.6.0

2016-01-05 Thread Dean Wampler
ConcurrentHashMap.keySet() returning a KeySetView is a Java 8 method. The Java 7 method returns a Set. Are you running Java 7? What happens if you run Java 8? Dean Wampler, Ph.D. Author: Programming Scala, 2nd Edition (O'Reilly) Typesafe

Networking problems in Spark 1.6.0

2016-01-05 Thread Yiannis Gkoufas
Hi there, I have been using Spark 1.5.2 on my cluster without a problem and wanted to try Spark 1.6.0. I have the exact same configuration on both clusters. I am able to start the Standalone Cluster but I fail to submit a job getting errors like the following: 16/01/05 14:24:14 INFO AppClient$Cli

Re: finding distinct count using dataframe

2016-01-05 Thread Kristina Rogale Plazonic
I think it's an expression, rather than a function you'd find in the API (as a function you could do df.select(col).distinct.count) This will give you the number of distinct rows in both columns: scala> df.select(countDistinct("name", "age")) res397: org.apache.spark.sql.DataFrame = [COUNT(DIST

Re: [discuss] dropping Python 2.6 support

2016-01-05 Thread Nicholas Chammas
+1 Red Hat supports Python 2.6 on REHL 5 until 2020 , but otherwise yes, Python 2.6 is ancient history and the core Python developers stopped supporting it in 2013. REHL 5 is not a good enough reason to continue support for Python

Re: sparkR ORC support.

2016-01-05 Thread Deepak Sharma
Hi Sandeep can you try this ? results <- sql(hivecontext, "FROM test SELECT id","") Thanks Deepak On Tue, Jan 5, 2016 at 5:49 PM, Sandeep Khurana wrote: > Thanks Deepak. > > I tried this as well. I created a hivecontext with "hivecontext <<- > sparkRHive.init(sc) " . > > When I tried to r

Re: java.io.FileNotFoundException(Too many open files) in Spark streaming

2016-01-05 Thread Priya Ch
Can some one throw light on this ? Regards, Padma Ch On Mon, Dec 28, 2015 at 3:59 PM, Priya Ch wrote: > Chris, we are using spark 1.3.0 version. we have not set > spark.streaming.concurrentJobs > this parameter. It takes the default value. > > Vijay, > > From the tack trace it is evident th

pyspark Dataframe and histogram through ggplot (python)

2016-01-05 Thread Snehotosh Banerjee
Hi, I am facing issue in rendering charts through ggplot while working on pyspark Dataframe on a dummy dataset. I have created a Spark Dataframe and trying to draw a histogram through ggplot in python. [image: Inline image 1] [image: Inline image 2] I have a valid schema as below.But, below com

Re: sparkR ORC support.

2016-01-05 Thread Sandeep Khurana
Thanks Deepak. I tried this as well. I created a hivecontext with "hivecontext <<- sparkRHive.init(sc) " . When I tried to read hive table from this , results <- sql(hivecontext, "FROM test SELECT id") I get below error, Error in callJMethod(sqlContext, "sql", sqlQuery) : Invalid jobj 2.

Re: sparkR ORC support.

2016-01-05 Thread Deepak Sharma
Hi Sandeep I am not sure if ORC can be read directly in R. But there can be a workaround .First create hive table on top of ORC files and then access hive table in R. Thanks Deepak On Tue, Jan 5, 2016 at 4:57 PM, Sandeep Khurana wrote: > Hello > > I need to read an ORC files in hdfs in R using

sparkR ORC support.

2016-01-05 Thread Sandeep Khurana
Hello I need to read an ORC files in hdfs in R using spark. I am not able to find a package to do that. Can anyone help with documentation or example for this purpose? -- Architect Infoworks.io http://Infoworks.io

Re: Negative Number of Active Tasks in Spark UI

2016-01-05 Thread Ted Yu
Which version of Spark do you use ? This might be related: https://issues.apache.org/jira/browse/SPARK-8560 Do you use dynamic allocation ? Cheers > On Jan 4, 2016, at 10:05 PM, Prasad Ravilla wrote: > > I am seeing negative active tasks in the Spark UI. > > Is anyone seeing this? > How is t

RE: Spark Streaming + Kafka + scala job message read issue

2016-01-05 Thread vivek.meghanathan
Hello All, After investigating further using a test program, we were able to read the kafka input messages using spark streaming. Once we add a particular line which performs map and reduce - and groupByKey (all written in single line), we are not seeing the input message details in the logs.

Re: finding distinct count using dataframe

2016-01-05 Thread Arunkumar Pillai
Thanks Yanbo, Thanks for the help. But I'm not able to find countDistinct ot approxCountDistinct. function. These functions are within dataframe or any other package On Tue, Jan 5, 2016 at 3:24 PM, Yanbo Liang wrote: > Hi Arunkumar, > > You can use datasetDF.select(countDistinct(col1, col2, col

Is there a way to use parallelize function in sparkR spark version (1.6.0)

2016-01-05 Thread Chandan Verma
=== DISCLAIMER: The information contained in this message (including any attachments) is confidential and ma

Can spark.scheduler.pool be applied globally ?

2016-01-05 Thread Jeff Zhang
It seems currently spark.scheduler.pool must be set as localProperties (associate with thread). Any reason why spark.scheduler.pool can not be used globally. My scenario is that I want my thriftserver started with fair scheduler as the default pool without using set command to set the pool. Is the

Re: finding distinct count using dataframe

2016-01-05 Thread Yanbo Liang
Hi Arunkumar, You can use datasetDF.select(countDistinct(col1, col2, col3, ...)) or approxCountDistinct for a approximate result. 2016-01-05 17:11 GMT+08:00 Arunkumar Pillai : > Hi > > Is there any functions to find distinct count of all the variables in > dataframe. > > val sc = new SparkCont

Re: groupByKey does not work?

2016-01-05 Thread Sean Owen
I suspect this is another instance of case classes not working as expected between the driver and executor when used with spark-shell. Search JIRA for some back story. On Tue, Jan 5, 2016 at 12:42 AM, Arun Luthra wrote: > Spark 1.5.0 > > data: > > p1,lo1,8,0,4,0,5,20150901|5,1,1.0 > p1,lo

  1   2   >