Re: Is there a way to merge parquet small files?

2016-05-19 Thread Gavin Yue
For logs file I would suggest save as gziped text file first. After aggregation, convert them into parquet by merging a few files. > On May 19, 2016, at 22:32, Deng Ching-Mallete wrote: > > IMO, it might be better to merge or compact the parquet files instead of >

Re: Is there a way to merge parquet small files?

2016-05-19 Thread Deng Ching-Mallete
IMO, it might be better to merge or compact the parquet files instead of keeping lots of small files in the HDFS. Please refer to [1] for more info. We also encountered the same issue with the slow query, and it was indeed caused by the many small parquet files. In our case, we were processing

Re: Is there a way to merge parquet small files?

2016-05-19 Thread Alexander Pivovarov
Try to use hadoop setting mapreduce.input.fileinputformat.split.maxsize to control RDD partition size I heard that DF can several files in 1 task On Thu, May 19, 2016 at 8:50 PM, 王晓龙/0515 wrote: > I’m using a spark streaming program to store log message into

Is there a way to merge parquet small files?

2016-05-19 Thread 王晓龙/01111515
I’m using a spark streaming program to store log message into parquet file every 10 mins. Now, when I query the parquet, it usually takes hundreds of thousands of stages to compute a single count. I looked into the parquet file’s path and find a great amount of small files. Do the small files

Query about how to estimate cpu usage for spark

2016-05-19 Thread Wang Jiaye
For MR job, there is job counter to provide CPU ms information while I cannot find a similar metrics in Spark which is quite useful. Do anyone know about this?

Re: Does spark support Apache Arrow

2016-05-19 Thread Hyukjin Kwon
FYI, there is a JIRA for this, https://issues.apache.org/jira/browse/SPARK-13534 I hope this link is helpful. Thanks! 2016-05-20 11:18 GMT+09:00 Sun Rui : > 1. I don’t think so > 2. Arrow is for in-memory columnar execution. While cache is for in-memory > columnar storage

Re: Does spark support Apache Arrow

2016-05-19 Thread Sun Rui
1. I don’t think so 2. Arrow is for in-memory columnar execution. While cache is for in-memory columnar storage > On May 20, 2016, at 10:16, Todd wrote: > > From the official site http://arrow.apache.org/, Apache Arrow is used for > Columnar In-Memory storage. I have two quick

Does spark support Apache Arrow

2016-05-19 Thread Todd
From the official site http://arrow.apache.org/, Apache Arrow is used for Columnar In-Memory storage. I have two quick questions: 1. Does spark support Apache Arrow? 2. When dataframe is cached in memory, the data are saved in columnar in-memory style. What is the relationship between this

Re: Tar File: On Spark

2016-05-19 Thread Sun Rui
Sure. You can try pySpark, which is the Python API of Spark. > On May 20, 2016, at 06:20, ayan guha wrote: > > Hi > > Thanks for the input. Can it be possible to write it in python? I think I can > use FileUti.untar from hdfs jar. But can I do it from python? > > On 19

Re: dataframe stat corr for multiple columns

2016-05-19 Thread Sun Rui
There is an existing JIRA issue for it: https://issues.apache.org/jira/browse/SPARK-11057 Also there is an PR. Maybe we should help to review and merge it with a higher priority. > On May 20, 2016, at 00:09, Xiangrui Meng

Re: Starting executor without a master

2016-05-19 Thread Marcelo Vanzin
On Thu, May 19, 2016 at 6:06 PM, Mathieu Longtin wrote: > I'm looking to bypass the master entirely. I manage the workers outside of > Spark. So I want to start the driver, the start workers that connect > directly to the driver. It should be possible to do that if you

Re: Starting executor without a master

2016-05-19 Thread Mathieu Longtin
I'm looking to bypass the master entirely. I manage the workers outside of Spark. So I want to start the driver, the start workers that connect directly to the driver. Anyway, it looks like I will have to live with our current solution for a while. On Thu, May 19, 2016 at 8:32 PM Marcelo Vanzin

Re: Starting executor without a master

2016-05-19 Thread Mathieu Longtin
Okay: *host=my.local.server* *port=someport* This is the spark-submit command, which runs on my local server: *$SPARK_HOME/bin/spark-submit --master spark://$host:$port --executor-memory 4g python-script.py with args* If I want 200 worker cores, I tell the cluster scheduler to run this command

Re: Starting executor without a master

2016-05-19 Thread Marcelo Vanzin
Hi Mathieu, There's nothing like that in Spark currently. For that, you'd need a new cluster manager implementation that knows how to start executors in those remote machines (e.g. by running ssh or something). In the current master there's an interface you can implement to try that if you

Re: KafkaUtils.createDirectStream Not Fetching Messages with Confluent Serializers as Value Decoder.

2016-05-19 Thread Mail.com
Hi Muthu, Do you have Kerberos enabled? Thanks, Pradeep > On May 19, 2016, at 12:17 AM, Ramaswamy, Muthuraman > wrote: > > I am using Spark 1.6.1 and Kafka 0.9+ It works for both receiver and > receiver-less mode. > > One thing I noticed when you specify

Re: Filter out the elements from xml file in Spark

2016-05-19 Thread Mail.com
Hi Yogesh, Can you try map operation and get what you need. Whatever parser you are using. You could also look at spark-XML package . Thanks, Pradeep > On May 19, 2016, at 4:39 AM, Yogesh Vyas wrote: > > Hi, > I had xml files which I am reading through textFileStream,

Re: Starting executor without a master

2016-05-19 Thread Mich Talebzadeh
In a normal operation we tell spark which node the worker processes can run by adding the nodenames to conf/slaves. Not very clear on this in your case all the jobs run locally with say 100 executor cores like below: ${SPARK_HOME}/bin/spark-submit \ --master local[*] \

Re: Starting executor without a master

2016-05-19 Thread Mathieu Longtin
Mostly, the resource management is not up to the Spark master. We routinely start 100 executor-cores for 5 minute job, and they just quit when they are done. Then those processor cores can do something else entirely, they are not reserved for Spark at all. On Thu, May 19, 2016 at 4:55 PM Mich

Re: Tar File: On Spark

2016-05-19 Thread Ted Yu
See http://memect.co/call-java-from-python-so You can also use Py4J On Thu, May 19, 2016 at 3:20 PM, ayan guha wrote: > Hi > > Thanks for the input. Can it be possible to write it in python? I think I > can use FileUti.untar from hdfs jar. But can I do it from python? > On

Re: How to perform reduce operation in the same order as partition indexes

2016-05-19 Thread ayan guha
You can add the index from mappartitionwithindex in the output and order based on that in merge step On 19 May 2016 13:22, "Pulasthi Supun Wickramasinghe" wrote: > Hi Devs/All, > > I am pretty new to Spark. I have a program which does some map reduce > operations with

Re: Tar File: On Spark

2016-05-19 Thread ayan guha
Hi Thanks for the input. Can it be possible to write it in python? I think I can use FileUti.untar from hdfs jar. But can I do it from python? On 19 May 2016 16:57, "Sun Rui" wrote: > 1. create a temp dir on HDFS, say “/tmp” > 2. write a script to create in the temp dir one

Re: Couldn't find leader offsets

2016-05-19 Thread Colin Hall
Hey Cody, thanks for the response. I looked at connection as a possibility based on your advice and after a lot of digging found a couple of things mentioned on SO and kafka lists about name resolution causing issues. I created an entry in /etc/hosts on the spark host to resolve the broker to

Splitting RDD by partition

2016-05-19 Thread shlomi
Hey Sparkers, I have a workflow where I have to ensure certain keys are always in the same RDD partition (its a mandatory algorithmic invariant). I can easily achieve this by having a custom partitioner. This results in a single RDD that requires further computations. However, currently there

Re: Starting executor without a master

2016-05-19 Thread Mich Talebzadeh
Then in theory every user can fire multiple spark-submit jobs. do you cap it with settings in $SPARK_HOME/conf/spark-defaults.conf , but I guess in reality every user submits one job only. This is an interesting model for two reasons: - It uses parallel processing across all the nodes or

Re: Starting executor without a master

2016-05-19 Thread Mathieu Longtin
Driver memory is default. Executor memory depends on job, the caller decides how much memory to use. We don't specify --num-executors as we want all cores assigned to the local master, since they were started by the current user. No local executor. --master=spark://localhost:someport. 1 core per

Re: Starting executor without a master

2016-05-19 Thread Mich Talebzadeh
Thanks Mathieu So it would be interesting to see what resources allocated in your case, especially the num-executors and executor-cores. I gather every node has enough memory and cores. ${SPARK_HOME}/bin/spark-submit \ --master local[2] \ --driver-memory 4g \

Re: Starting executor without a master

2016-05-19 Thread Mathieu Longtin
The driver (the process started by spark-submit) runs locally. The executors run on any of thousands of servers. So far, I haven't tried more than 500 executors. Right now, I run a master on the same server as the driver. On Thu, May 19, 2016 at 3:49 PM Mich Talebzadeh

Re: Starting executor without a master

2016-05-19 Thread Mich Talebzadeh
ok so you are using some form of NFS mounted file system shared among the nodes and basically you start the processes through spark-submit. In Stand-alone mode, a simple cluster manager included with Spark. It does the management of resources so it is not clear to me what you are referring as

Re: Starting executor without a master

2016-05-19 Thread Mathieu Longtin
No master and no node manager, just the processes that do actual work. We use the "stand alone" version because we have a shared file system and a way of allocating computing resources already (Univa Grid Engine). If an executor were to die, we have other ways of restarting it, we don't need the

Re: Starting executor without a master

2016-05-19 Thread Mich Talebzadeh
Hi Mathieu What does this approach provide that the norm lacks? So basically each node has its master in this model. Are these supposed to be individual stand alone servers? Thanks Dr Mich Talebzadeh LinkedIn *

Re: Hive 2 database Entity-Relationship Diagram

2016-05-19 Thread Mich Talebzadeh
Thanks These are the list of tables and views Dr Mich Talebzadeh LinkedIn * https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw * http://talebzadehmich.wordpress.com On

Re: Latency experiment without losing executors

2016-05-19 Thread Ted Yu
16/05/19 15:51:39 WARN CoarseGrainedExecutorBackend: An unknown (ip-10-171-80-97.ec2.internal:44765) driver disconnected. 16/05/19 15:51:42 ERROR TransportClient: Failed to send RPC 5466711974642652953 to ip-10-171-80-97.ec2.internal/10.171.80.97:44765: java.nio.channels.ClosedChannelException

Hive 2 database Entity-Relationship Diagram

2016-05-19 Thread Mich Talebzadeh
Hi All, I use Hive 2 with metastore created for Oracle Database with hive-txn-schema-2.0.0.oracle.sql. It already includes concurrency stuff added into metastore The RDBMS is Oracle Database 12c Enterprise Edition Release 12.1.0.2.0. I created an Entity-Relationship (ER) diagram from the

Re: Latency experiment without losing executors

2016-05-19 Thread Geet Kumar
Ah, it seems the code did not show up in the email. Here is a link to the original post: http://apache-spark-user-list.1001560.n3.nabble.com/Latency-experiment-without-losing-executors-td26981.html Also, attached is the executor logs.​ spark-logging.log

Re: dataframe stat corr for multiple columns

2016-05-19 Thread Xiangrui Meng
This is nice to have. Please create a JIRA for it. Right now, you can merge all columns into a vector column using RFormula or VectorAssembler, then convert it into an RDD and call corr from MLlib. On Tue, May 17, 2016, 7:09 AM Ankur Jain wrote: > Hello Team, > > > > In my

Re: Couldn't find leader offsets

2016-05-19 Thread Cody Koeninger
Looks like a networking issue to me. Make sure you can connect to the broker on the specified host and port from the spark driver (and the executors too, for that matter) On Wed, May 18, 2016 at 4:04 PM, samsayiam wrote: > I have seen questions posted about this on SO and

Re: Does Structured Streaming support Kafka as data source?

2016-05-19 Thread Cody Koeninger
I went ahead and created https://issues.apache.org/jira/browse/SPARK-15406 to track this On Wed, May 18, 2016 at 9:55 PM, Todd wrote: > Hi, > I brief the spark code, and it looks that structured streaming doesn't > support kafka as data source yet?

Re: HBase / Spark Kerberos problem

2016-05-19 Thread Arun Natva
Some of the Hadoop services cannot make use of the ticket obtained by oginUserFromKeytab. I was able to get past it using gss Jaas configuration where you can pass either Keytab file or ticketCache to spark executors that access HBase. Sent from my iPhone > On May 19, 2016, at 4:51 AM, Ellis,

Re: Spark Streaming Application run on yarn-clustor mode

2016-05-19 Thread Ted Yu
Yes. See https://spark.apache.org/docs/latest/streaming-programming-guide.html On Thu, May 19, 2016 at 7:24 AM, wrote: > Hi Friends, > > Is spark streaming job will run on yarn-cluster mode? > > Thanks > Raj > > > Sent from Yahoo Mail. Get the app

Spark Streaming Application run on yarn-clustor mode

2016-05-19 Thread spark.raj
Hi Friends, Is spark streaming job will run on yarn-cluster mode? ThanksRaj Sent from Yahoo Mail. Get the app

RE: HBase / Spark Kerberos problem

2016-05-19 Thread philipp.meyerhoefer
Thanks Tom & John! modifying spark-env.sh did the trick - my last line in the file is now: export SPARK_DIST_CLASSPATH=$(paste -sd: "$SELF/classpath.txt"):`hbase classpath`:/etc/hbase/conf:/etc/hbase/conf/hbase-site.xml Now o.a.s.d.y.Client logs “Added HBase security token to credentials” and

RE: HBase / Spark Kerberos problem

2016-05-19 Thread Ellis, Tom (Financial Markets IT)
Yeah we ran into this issue. Key part is to have the hbase jars and hbase-site.xml config on the classpath of the spark submitter. We did it slightly differently from Y Bodnar, where we set the required jars and config on the env var SPARK_DIST_CLASSPATH in our spark env file (rather than

Filter out the elements from xml file in Spark

2016-05-19 Thread Yogesh Vyas
Hi, I had xml files which I am reading through textFileStream, and then filtering out the required elements using traditional conditions and loops. I would like to know if there is any specific packages or functions provided in spark to perform operations on RDD of xml? Regards, Yogesh

Re: Latency experiment without losing executors

2016-05-19 Thread Ted Yu
I didn't see the code snippet. Were you using picture(s) ? Please pastebin the code. It would be better if you pastebin executor log for the killed executor. Thanks On Wed, May 18, 2016 at 9:41 PM, gkumar7 wrote: > I would like to test the latency (tasks/s) perceived in

Re: HBase / Spark Kerberos problem

2016-05-19 Thread John Trengrove
Have you had a look at this issue? https://issues.apache.org/jira/browse/SPARK-12279 There is a comment by Y Bodnar on how they successfully got Kerberos and HBase working. 2016-05-18 18:13 GMT+10:00 : > Hi all, > > I have been puzzling over a Kerberos

HBase / Spark Kerberos problem

2016-05-19 Thread philipp.meyerhoefer
Hi all, I have been puzzling over a Kerberos problem for a while now and wondered if anyone can help. For spark-submit, I specify --master yarn-client --keytab x --principal y, which creates my SparkContext fine. Connections to Zookeeper Quorum to find the HBase master work well too. But when

Re: Tar File: On Spark

2016-05-19 Thread Sun Rui
1. create a temp dir on HDFS, say “/tmp” 2. write a script to create in the temp dir one file for each tar file. Each file has only one line: 3. Write a spark application. It is like: val rdd = sc.textFile () rdd.map { line => construct an untar command using the path information in

Tar File: On Spark

2016-05-19 Thread ayan guha
Hi I have few tar files in HDFS in a single folder. each file has multiple files in it. tar1: - f1.txt - f2.txt tar2: - f1.txt - f2.txt (each tar file will have exact same number of files, same name) I am trying to find a way (spark or pig) to extract them to their own

Any way to pass custom hadoop conf to through spark thrift server ?

2016-05-19 Thread Jeff Zhang
I want to pass one custom hadoop conf to spark thrift server so that both driver and executor side can get this conf. And I also want this custom hadoop conf only detected by this user's job who set this conf. Is it possible for spark thrift server now ? Thanks -- Best Regards Jeff Zhang