Re: Ingesting data in elasticsearch from hdfs using spark , cluster setup and usage

2016-12-22 Thread Rohit Verma
Below ingestion rate is actually when I am using a bactch size of 10mb, 10 records. I have tried with 20-50 partitions, higher partitions give bulk queue exceptions. Anyways thanks for suggestion I would appreciate more inputs, specifically on cluster design. Rohit > On Dec 22, 2016, at

Re: Merging Parquet Files

2016-12-22 Thread Benjamin Kim
Thanks, Hyukjin. I’ll try using the Parquet tools for 1.9 On Dec 23, 2016, at 12:43 PM, Hyukjin Kwon wrote: Hi Benjamin, As you might already know, I believe the Hadoop command automatically does not merge the column-based format such as ORC or Parquet but just simply

Best Practice for Spark Job Jar Generation

2016-12-22 Thread Chetan Khatri
Hello Spark Community, For Spark Job Creation I use SBT Assembly to build Uber("Super") Jar and then submit to spark-submit. Example, bin/spark-submit --class hbase.spark.chetan.com.SparkHbaseJob /home/chetan/hbase-spark/SparkMSAPoc-assembly-1.0.jar But other folks has debate with for Uber

Can't access the data in Kafka Spark Streaming globally

2016-12-22 Thread Sree Eedupuganti
I am trying to stream the data from Kafka to Spark. JavaPairInputDStream directKafkaStream = KafkaUtils.createDirectStream(ssc, String.class, String.class, StringDecoder.class, StringDecoder.class,

Re: Merging Parquet Files

2016-12-22 Thread Benjamin Kim
Thanks, Hyukjin. I’ll try using the Parquet tools for 1.9 based on the jira. If that doesn’t work, I’ll try Kite. Cheers, Ben > On Dec 23, 2016, at 12:43 PM, Hyukjin Kwon wrote: > > Hi Benjamin, > > > As you might already know, I believe the Hadoop command

Re: Merging Parquet Files

2016-12-22 Thread Hyukjin Kwon
Hi Benjamin, As you might already know, I believe the Hadoop command automatically does not merge the column-based format such as ORC or Parquet but just simply concatenates them. I haven't tried this by myself but I remember I saw a JIRA in Parquet -

答复: submit spark task on yarn asynchronously via java?

2016-12-22 Thread Linyuxin
Hi, Could Anybody help? 发件人: Linyuxin 发送时间: 2016年12月22日 14:18 收件人: user 主题: submit spark task on yarn asynchronously via java? Hi All, Version: Spark 1.5.1 Hadoop 2.7.2 Is there any way to submit and monitor spark task on yarn via java asynchronously?

Re: streaming performance

2016-12-22 Thread Tathagata Das
>From what I understand looking at the code in stackoverflow, I think you are "simulating" the streaming version of your calculation incorrectly. You are repeatedly unioning batch dataframes to simulate streaming and then applying aggregation on the unioned DF. That will not going to compute

Re: Why does Spark 2.0 change number or partitions when reading a parquet file?

2016-12-22 Thread Daniel Siegmann
Spark 2.0.0 introduced "Automatic file coalescing for native data sources" ( http://spark.apache.org/releases/spark-release-2-0-0.html#performance-and-runtime). Perhaps that is the cause? I'm not sure if this feature is mentioned anywhere in the documentation or if there's any way to disable it.

Merging Parquet Files

2016-12-22 Thread Benjamin Kim
Has anyone tried to merge *.gz.parquet files before? I'm trying to merge them into 1 file after they are output from Spark. Doing a coalesce(1) on the Spark cluster will not work. It just does not have the resources to do it. I'm trying to do it using the commandline and not use Spark. I will

Re: Has anyone managed to connect to Oracle via JDBC from Spark CDH 5.5.2

2016-12-22 Thread Jörn Franke
Why not upgrade to ojdbc7 - this one is for java 7+8? Keep in mind that the jdbc driver is updated constantly (although simply called ojdbc7). I would be surprised if this does not work with cloudera as it runs on the oracle big data appliance. > On 22 Dec 2016, at 21:44, Mich Talebzadeh

Re: Has anyone managed to connect to Oracle via JDBC from Spark CDH 5.5.2

2016-12-22 Thread Mich Talebzadeh
Thanks all sorted. The admin had updated these lines incorrectly in $SPARK_HOME/conf/spark-defaults.conf by updating one of the parameters only for Oracle ojdbc6.jar and the other one for Sybase jconn4.jar! spark.driver.extraClassPath /home/hduser/jars/ojdbc6.jar

Spark subscribe

2016-12-22 Thread pradeep s
Hi , Can you please add me to spark subscription list. Regards Pradeep S

Re: Ingesting data in elasticsearch from hdfs using spark , cluster setup and usage

2016-12-22 Thread genia...@gmail.com
One thing I will look at is how many partitions your dataset has before writing to ES using Spark. As it may be the limiting factor to your parallel writing. You can also tune the batch size on ES writes... One more thing, make sure you have enough network bandwidth... Regards, Yang Sent

Ingesting data in elasticsearch from hdfs using spark , cluster setup and usage

2016-12-22 Thread Rohit Verma
I am setting up a spark cluster. I have hdfs data nodes and spark master nodes on same instances. To add elasticsearch to this cluster, should I spawn es on different machine on same machine. I have only 12 machines, 1-master (spark and hdfs) 8-spark workers and hdfs data nodes I can use 3

Re: parsing embedded json in spark

2016-12-22 Thread Shaw Liu
Hi,I guess you can use 'get_json_object' function Get Outlook for iOS On Thu, Dec 22, 2016 at 9:52 PM +0800, "Irving Duran" wrote: Is it an option to parse that field prior of creating the dataframe? If so, that's what I would do. In terms of your master

Why does Spark 2.0 change number or partitions when reading a parquet file?

2016-12-22 Thread Kristina Rogale Plazonic
Hi, I write a randomly generated 30,000-row dataframe to parquet. I verify that it has 200 partitions (both in Spark and inspecting the parquet file in hdfs). When I read it back in, it has 23 partitions?! Is there some optimization going on? (This doesn't happen in Spark 1.5) *How can I force

Re: parsing embedded json in spark

2016-12-22 Thread Irving Duran
Is it an option to parse that field prior of creating the dataframe? If so, that's what I would do. In terms of your master node only working, you have to share more about your structure, are you using spark standalone, yarn, or mesos? Thank You, Irving Duran On Thu, Dec 22, 2016 at 1:42 AM,

RE: spark-shell fails to redefine values

2016-12-22 Thread Spencer, Alex (Santander)
Can you ask for eee inbetween each reassign? The memory address at the end 1ec5bf62 != 2c6beb3e or 66cb003 – so what’s going on there? From: Yang [mailto:tedd...@gmail.com] Sent: 21 December 2016 18:37 To: user Subject: spark-shell fails to redefine values summary:

RE: Has anyone managed to connect to Oracle via JDBC from Spark CDH 5.5.2

2016-12-22 Thread Alexander Kapustin
Hello, For Spark 1.5 (and for 1.6) we use Oracle jdbc via spark-submit.sh --jars /path/to/ojdbc6.jar Also we use additional oracle driver properties via --driver-java-options Sent from Mail for Windows 10 From: Mich