Re: ElasticSearch Spark error

2017-05-15 Thread Rohit Verma
Try to switch the trace logging, is your es cluster running behind docker. Its possible that your spark cluster can’t communicate using docker ips. Regards Rohit On May 15, 2017, at 4:55 PM, Nick Pentreath > wrote: It may be best to

Spark failing while persisting sorted columns.

2017-03-09 Thread Rohit Verma
details: * Cores in use: 20 Total, 0 Used * Memory in use: 72.2 GB Total, 0.0 B Used And process configuration are as "spark.cores.max", “20" "spark.executor.memory", “3400MB" “spark.kryoserializer.buffer.max”,”1000MB” Any leads would be highly appreciated. Regards Rohit Verma

Re: Spark join over sorted columns of dataset.

2017-03-03 Thread Rohit Verma
Sending it to dev’s. Can you please help me providing some ideas for below. Regards Rohit > On Feb 23, 2017, at 3:47 PM, Rohit Verma <rohit.ve...@rokittech.com> wrote: > > Hi > > While joining two columns of different dataset, how to optimize join if both > the colum

Re: Spark driver CPU usage

2017-03-01 Thread Rohit Verma
Use conf spark.task.cpus to control number of cpus to use in a task. On Mar 1, 2017, at 5:41 PM, Phadnis, Varun wrote: > > Hello, > > Is there a way to control CPU usage for driver when running applications in > client mode? > > Currently we are observing that the

Spark join over sorted columns of dataset.

2017-02-23 Thread Rohit Verma
Hi While joining two columns of different dataset, how to optimize join if both the columns are pre sorted within the dataset. So that when spark do sort merge join the sorting phase can skipped. Regards Rohit - To unsubscribe

Dataset count on database or parquet

2017-02-08 Thread Rohit Verma
Hi Which of the following is better approach for too many values in database final Dataset dataset = spark.sqlContext().read() .format("jdbc") .option("url", params.getJdbcUrl()) .option("driver", params.getDriver())

Re: Having multiple spark context

2017-01-30 Thread Rohit Verma
] Sent: Monday, January 30, 2017 1:33 PM To: vincent gromakowski <vincent.gromakow...@gmail.com<mailto:vincent.gromakow...@gmail.com>> Cc: Rohit Verma <rohit.ve...@rokittech.com<mailto:rohit.ve...@rokittech.com>>; user@spark.apache.org<mailto:user@spark.apache.org>; Sing

Re: Having multiple spark context

2017-01-29 Thread Rohit Verma
Hi, If I am right, you need to launch other context from another jvm. If you are trying to launch from same jvm another context it will return you the existing context. Rohit On Jan 30, 2017, at 12:24 PM, Mark Hamstra > wrote: More than

ToLocalIterator vs collect

2017-01-05 Thread Rohit Verma
Hi all, I am aware that collect will return a list aggregated on driver, this will return OOM when we have a too big list. Is toLocalIterator safe to use with very big list, i want to access all values one by one. Basically the goal is to compare two sorted rdds (A and B) to find top k

Getting values list per partition

2016-12-28 Thread Rohit Verma
Hi I am trying something like final Dataset df = spark.read().csv("src/main/resources/star2000.csv").select("_c1").as(Encoders.STRING()); final Dataset arrayListDataset = df.mapPartitions(new MapPartitionsFunction() { @Override public Iterator

Re: Ingesting data in elasticsearch from hdfs using spark , cluster setup and usage

2016-12-22 Thread Rohit Verma
ites... > > One more thing, make sure you have enough network bandwidth... > > Regards, > > Yang > > Sent from my iPhone > >> On Dec 22, 2016, at 12:35 PM, Rohit Verma <rohit.ve...@rokittech.com> wrote: >> >> I am setting up a spark cluste

Ingesting data in elasticsearch from hdfs using spark , cluster setup and usage

2016-12-22 Thread Rohit Verma
I am setting up a spark cluster. I have hdfs data nodes and spark master nodes on same instances. To add elasticsearch to this cluster, should I spawn es on different machine on same machine. I have only 12 machines, 1-master (spark and hdfs) 8-spark workers and hdfs data nodes I can use 3

Re: How to deal with string column data for spark mlib?

2016-12-20 Thread Rohit Verma
@Deepak, This conversion is not suitable for categorical data. But again as I mentioned its all dependent on nature of data and what is intended by OP Consider you want to convert race into numbers (races as black, white and asian) So, you want numerical variables, and you could just assign a

Re: How to deal with string column data for spark mlib?

2016-12-20 Thread Rohit Verma
There are various techniques but the actual answer will depend on what you are trying to do, kind of input data, nature of algorithm. You can browse through https://www.analyticsvidhya.com/blog/2015/11/easy-methods-deal-categorical-variables-predictive-modeling/ this should give you a starting

Is selecting different datasets from same parquet file blocking.

2016-11-17 Thread Rohit Verma
Hi I have dataset which has 10 columns, created through a parquet file. I want to perform some operations on each column. I create 10 datasets as dsBig.select(col). When I submit these 10 jobs will they be blocking each other as all of them reading from same parquet file. Is selecting

Re: Map and MapParitions with partition-local variable

2016-11-17 Thread Rohit Verma
Using a map and mapPartition on same df at the same time doesn't make much sense to me. Also without complete infor I am assuming that you have some partition strategy being defined/influenced by map operation. In that case you can create a hashmap of map values for each partitions, do

Re: Joining to a large, pre-sorted file

2016-11-15 Thread Rohit Verma
You can try coalesce on join statement. val result = master.join(transaction,”key”). coalesce(# number of partitions in master) On Nov 15, 2016, at 8:07 PM, Stuart White > wrote: It seems that the number of files could possibly get out of

Re: Problem submitting a spark job using yarn-client as master

2016-11-15 Thread Rohit Verma
you can set hdfs as defaults, sparksession.sparkContext().hadoopConfiguration().set("fs.defaultFS", “hdfs://master_node:8020”); Regards Rohit On Nov 16, 2016, at 3:15 AM, David Robison > wrote: I am trying to submit a spark job

Spark hash function

2016-11-14 Thread Rohit Verma
Hi All, One of the miscellaneous functions in spark sql is hash expression[Murmur3Hash]("hash"), I was wondering whether its which variant of murmurhas3 murmurhash3_x64_128 murmurhash3_x86_32 ( this is also part of spark unsafe package). Also what is seed for the hash function. I am

Re: Spark joins using row id

2016-11-12 Thread Rohit Verma
ain() method to print out the steps that Spark will execute to satisfy your query. This site explains how all this works: http://blog.hydronitrogen.com/2016/05/13/shuffle-free-joins-in-spark-sql/ On Sat, Nov 12, 2016 at 5:11 AM, Rohit Verma <rohit.ve...@rokittech.com<mailto:rohit.ve...@rok

Spark joins using row id

2016-11-12 Thread Rohit Verma
For datasets structured as ds1 rowN col1 1 A 2 B 3 C 4 C … and ds2 rowN col2 1 X 2 Y 3 Z … I want to do a left join Dataset joined = ds1.join(ds2,”rowN”,”left outer”); I somewhere read in SO or this mailing list that if spark is aware of datasets

Spark master shows 0 cores for executors

2016-11-07 Thread Rohit Verma
Facing a strange issue with spark 2.0.1. When creating a spark session with executor properties like 'spark.executor.memory':'3g',\ 'spark.executor.cores':'12',\ Spark master shows 0 cores for executors. Similar issue I found on stack overflow as

Spark dataset cache vs tempview

2016-11-05 Thread Rohit Verma
I have a parquet file which I reading atleast 4-5 times within my application. I was wondering what is most efficient thing to do. Option 1. While writing parquet file, immediately read it back to dataset and call cache. I am assuming by doing an immediate read I might use some existing

Optimized way to use spark as db to hdfs etl

2016-11-05 Thread Rohit Verma
I am using spark to read from database and write in hdfs as parquet file. Here is code snippet. private long etlFunction(SparkSession spark){ spark.sqlContext().setConf("spark.sql.parquet.compression.codec", “SNAPPY"); Properties properties = new Properties();

Re: Cogrouping or joining datasets by rownum

2016-10-26 Thread Rohit Verma
The formatting of message got disturbed so sending it again On Oct 27, 2016, at 8:52 AM, Rohit Verma <rohit.ve...@rokittech.com<mailto:rohit.ve...@rokittech.com>> wrote: Does anyone tried how to cogroup datasets / join datasets by row num. DS1 d1 d2 40 AA

Cogrouping or joining datasets by rownum

2016-10-26 Thread Rohit Verma
Does anyone tried how to cogroup datasets / join datasets by row num. e.g DS 1 43 AA 44 BB 45 CB DS2 IN india AU australia i want to get rownum ds1.1 ds1.2 ds2.1 ds2.2 1 43 AA IN india 2 44 BB AU australia 3 45 CB null null I don’t expect a complete code, some pointers on how to do is

Help regarding reading text file within rdd operations

2016-10-25 Thread Rohit Verma
oolean>() { @Override public Boolean call(Tuple2<Column, Column> tup) throws Exception { Dataset text1 = spark.read().text(tup._1); <-- same issue Dataset text2 = spark.read().text(tup._2); return text1.