Re: spark 1.4.1 saveAsTextFile is slow on emr-4.0.0

2015-09-02 Thread Neil Jonkers
Hi, Can you set the following parameters in your mapred-site.xml file please: mapred.output.direct.EmrFileSystemtrue mapred.output.direct.NativeS3FileSystemtrue You can also config this at cluster launch time with the following Classification via EMR console:

Re: Too many open files issue

2015-09-02 Thread Steve Loughran
On 31 Aug 2015, at 19:49, Sigurd Knippenberg > wrote: I know I can adjust the max open files allowed by the OS but I'd rather fix the underlaying issue. bumping up the OS handle limits is step #1 of installing a hadoop cluster

Re: Custom Partitioner

2015-09-02 Thread Jem Tucker
alter the range partitioner such that it skews the partitioning and assigns more partitions to the heavier weighted keys? to do this you will have to know the weighting before you start On Wed, Sep 2, 2015 at 8:02 AM shahid ashraf wrote: > yes i can take as an example , but

Save dataframe into hbase

2015-09-02 Thread Hafiz Mujadid
Hi What is the efficient way to save Dataframe into hbase? Thanks -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Save-dataframe-into-hbase-tp24552.html Sent from the Apache Spark User List mailing list archive at Nabble.com.

OOM in spark driver

2015-09-02 Thread ankit tyagi
Hi All, I am using spark-sql 1.3.1 with hadoop 2.4.0 version. I am running sql query against parquet files and wanted to save result on s3 but looks like https://issues.apache.org/jira/browse/SPARK-2984 problem still coming while saving data to s3. Hence Now i am saving result on hdfs and with

R: Spark + Druid

2015-09-02 Thread Paolo Platter
Fantastic!!! I will look into that and I hope to contribute Paolo Inviata dal mio Windows Phone Da: Harish Butani Inviato: ‎02/‎09/‎2015 06:04 A: user Oggetto: Spark + Druid Hi, I am working on the

Re: Custom Partitioner

2015-09-02 Thread shahid ashraf
yes i can take as an example , but my actual use case is that in need to resolve a data skew, when i do grouping based on key(A-Z) the resulting partitions are skewed like (partition no.,no_of_keys, total elements with given key) << partition: [(0, 0, 0), (1, 15, 17395), (2, 0, 0), (3, 0, 0), (4,

Simple join of two Spark DataFrame failing with “org.apache.spark.sql.AnalysisException: Cannot resolve column name”

2015-09-02 Thread steve.felsheim
Running into an issue trying to perform a simple join of two DataFrames created from two different parquet files on HDFS. [main] INFO org.apache.spark.SparkContext - Running *Spark version 1.4.1* Using HDFS from Hadoop 2.7.0 Here is a sample to illustrate. public void

Re: Small File to HDFS

2015-09-02 Thread Tao Lu
You may consider storing it in one big HDFS file, and to keep appending new messages to it. For instance, one message -> zip it -> append it to the HDFS as one line On Wed, Sep 2, 2015 at 12:43 PM, wrote: > Hi, > I already store them in MongoDB in parralel for operational

Small File to HDFS

2015-09-02 Thread nibiau
Hello, I'am currently using Spark Streaming to collect small messages (events) , size being <50 KB , volume is high (several millions per day) and I have to store those messages in HDFS. I understood that storing small files can be problematic in HDFS , how can I manage it ? Tks Nicolas

Re: Small File to HDFS

2015-09-02 Thread nibiau
Hi, I already store them in MongoDB in parralel for operational access and don't want to add an other database in the loop Is it the only solution ? Tks Nicolas - Mail original - De: "Ted Yu" À: nib...@free.fr Cc: "user" Envoyé: Mercredi 2

Re: Small File to HDFS

2015-09-02 Thread Ted Yu
Instead of storing those messages in HDFS, have you considered storing them in key-value store (e.g. hbase) ? Cheers On Wed, Sep 2, 2015 at 9:07 AM, wrote: > Hello, > I'am currently using Spark Streaming to collect small messages (events) , > size being <50 KB , volume is high

Re: Memory-efficient successive calls to repartition()

2015-09-02 Thread alexis GILLAIN
Just made some tests on my laptop. Deletion of the files is not immediate but a System.gc() call makes the job on shuffle files of a checkpointed RDD. It should solve your problem (`sc._jvm.System.gc()` in Python as pointed in the databricks link in my previous message). 2015-09-02 20:55

Inferring JSON schema from a JSON string in a dataframe column

2015-09-02 Thread mstang
Hi, On Spark 1.3, using Scala 10.4 Given an existing dataframe with two colums (col A = JSON string, col B = int), is it possible to create a new dataframe from col A and automatically generate the schema (similar to when json is loaded/read from file)? Alternately... given an existing

Spark MLlib Decision Tree Node Accuracy

2015-09-02 Thread derechan
Hi, I'm working on analyzing a decision tree in Apache spark 1.2. I've built the tree and I can traverse it to get the splits and impurity. I can't seem to find a way to get which records fall into each node. My goal is to get a count of how accurate each leaf is at identifying the classification.

Re: Memory-efficient successive calls to repartition()

2015-09-02 Thread shahid ashraf
Hi Guys It seems my problems is related to his question as well. i am running standalone spark 1.4.1 on local machine i have 10 partitions with data skew on partition 1 and 4 partition: [(0, 0), (*1, 15593259)*, (2, 0), (3, 0), (*4, 20695601)*, (5, 0), (6, 0), (7, 0), (8, 0), (9, 0)] and

Kafka Direct Stream join without data shuffle

2015-09-02 Thread Chen Song
I have a stream got from Kafka with direct approach, say, inputStream, I need to 1. Create another DStream derivedStream with map or mapPartitions (with some data enrichment with reference table) on inputStream 2. Join derivedStream with inputStream In my use case, I don't need to shuffle data.

Re: Kafka Direct Stream join without data shuffle

2015-09-02 Thread Cody Koeninger
No, there isn't a partitioner for KafkaRDD (KafkaRDD may not even be a pair rdd, for instance). It sounds to me like if it's a self-join, you should be able to do it in a single mapPartition operation. On Wed, Sep 2, 2015 at 3:06 PM, Chen Song wrote: > I have a stream

Parquet partitioning for unique identifier

2015-09-02 Thread Kohki Nishio
Hello experts, I have a huge json file (> 40G) and trying to use Parquet as a file format. Each entry has a unique identifier but other than that, it doesn't have 'well balanced value' column to partition it. Right now it just throws OOM and couldn't figure out what to do with it. It would be

Problem while loading saved data

2015-09-02 Thread Amila De Silva
Hi All, I have a two node spark cluster, to which I'm connecting using IPython notebook. To see how data saving/loading works, I simply created a dataframe using people.json using the Code below; df = sqlContext.read.json("examples/src/main/resources/people.json") Then called the following to

`sbt core/test` hangs on LogUrlsStandaloneSuite?

2015-09-02 Thread Jacek Laskowski
Hi, Am I doing something off base to execute tests for core module using sbt as follows? [spark]> core/test ... [info] KryoSerializerAutoResetDisabledSuite: [info] - sort-shuffle with bypassMergeSort (SPARK-7873) (53 milliseconds) [info] - calling deserialize() after deserializeStream() (2

Understanding Batch Processing Time

2015-09-02 Thread Snehal Nagmote
Hi All, I have spark job where I read data from Kafka every 5 seconds interval and query Cassandra based on Kafka data using spark Cassandra Connector , I am using spark 1.4 , Often the batch gets stuck in processing after job Id 352 . Spark takes long time to spawn job 353 where it reads from

Re: cached data between jobs

2015-09-02 Thread Eric Walker
Hi Jeff, I think I see what you're saying. I was thinking more of a whole Spark job, where `spark-submit` is run once to completion and then started up again, rather than a "job" as seen in the Spark UI. I take it there is no implicit caching of results between `spark-submit` runs. (In the

Unbale to run Group BY on Large File

2015-09-02 Thread SAHA, DEBOBROTA
Hi , I am getting below error while I am trying to select data using SPARK SQL from a RDD table. java.lang.OutOfMemoryError: GC overhead limit exceeded "Spark Context Cleaner" java.lang.InterruptedException The file or table size is around 113 GB and I am running SPARK 1.4 on a

spark-submit not using conf/spark-defaults.conf

2015-09-02 Thread Axel Dahl
in my spark-defaults.conf I have: spark.files file1.zip, file2.py spark.master spark://master.domain.com:7077 If I execute: bin/pyspark I can see it adding the files correctly. However if I execute bin/spark-submit test.py where test.py relies on the file1.zip, I get

Re: Unbale to run Group BY on Large File

2015-09-02 Thread Silvio Fiorito
Unfortunately, groupBy is not the most efficient operation. What is it you’re trying to do? It may be possible with one of the other *byKey transformations. From: "SAHA, DEBOBROTA" Date: Wednesday, September 2, 2015 at 7:46 PM To: "'user@spark.apache.org'" Subject:

Re: Problem while loading saved data

2015-09-02 Thread Guru Medasani
Hi Amila, Error says that the ‘people.parquet’ file does not exist. Can you manually check to see if that file exists? > Py4JJavaError: An error occurred while calling o53840.parquet. > : java.lang.AssertionError: assertion failed: No schema defined, and no > Parquet data file or summary file

FlatMap Explanation

2015-09-02 Thread Ashish Soni
Hi , Can some one please explain the output of the flat map data in RDD as below {1, 2, 3, 3} rdd.flatMap(x => x.to(3)) output as below {1, 2, 3, 2, 3, 3, 3} i am not able to understand how the output came as above. Thanks,

Alter table fails to find table

2015-09-02 Thread Tim Smith
Spark 1.3.0 (CDH 5.4.4) scala> sqlContext.sql("SHOW TABLES").collect res18: Array[org.apache.spark.sql.Row] = Array([allactivitydata,true], [sample_07,false], [sample_08,false]) sqlContext.sql("SELECT COUNT(*) from allactivitydata").collect res19: Array[org.apache.spark.sql.Row] =

Re: Hbase Lookup

2015-09-02 Thread Jörn Franke
You may check if it makes sense to write a coprocessor doing an upsert for you, if it does not exist already. Maybe phoenix for Hbase supports this already. Another alternative, if the records do not have an unique Id, is to put them into a text index engine, such as Solr or Elasticsearch, which

Re: Parquet partitioning for unique identifier

2015-09-02 Thread Raghavendra Pandey
Did you specify partitioning column while saving data.. On Sep 3, 2015 5:41 AM, "Kohki Nishio" wrote: > Hello experts, > > I have a huge json file (> 40G) and trying to use Parquet as a file > format. Each entry has a unique identifier but other than that, it doesn't > have

Re: Unbale to run Group BY on Large File

2015-09-02 Thread Raghavendra Pandey
You can increase number of partitions n try... On Sep 3, 2015 5:33 AM, "Silvio Fiorito" wrote: > Unfortunately, groupBy is not the most efficient operation. What is it > you’re trying to do? It may be possible with one of the other *byKey > transformations. > >

Re: Hbase Lookup

2015-09-02 Thread ayan guha
Thanks for your info. I am planning to implement a pig udf to do record look ups. Kindly let me know if this is a good idea. Best Ayan On Thu, Sep 3, 2015 at 2:55 PM, Jörn Franke wrote: > > You may check if it makes sense to write a coprocessor doing an upsert for > you,

Re: spark-submit not using conf/spark-defaults.conf

2015-09-02 Thread Davies Liu
This should be a bug, could you create a JIRA for it? On Wed, Sep 2, 2015 at 4:38 PM, Axel Dahl wrote: > in my spark-defaults.conf I have: > spark.files file1.zip, file2.py > spark.master spark://master.domain.com:7077 > > If I execute: >

Re: large number of import-related function calls in PySpark profile

2015-09-02 Thread Davies Liu
Could you have a short script to reproduce this? On Wed, Sep 2, 2015 at 2:10 PM, Priedhorsky, Reid wrote: > Hello, > > I have a PySpark computation that relies on Pandas and NumPy. Currently, my > inner loop iterates 2,000 times. I’m seeing the following show up in my >

Getting an error when trying to read a GZIPPED file

2015-09-02 Thread Spark Enthusiast
Folks, I have an input file which is gzipped. I use sc.textFile("foo.gz") when I see the following problem. Can someone help me how to fix this? 15/09/03 10:05:32 INFO deprecation: mapred.job.id is deprecated. Instead, use mapreduce.job.id15/09/03 10:05:32 INFO CodecPool: Got brand-new

Re: FlatMap Explanation

2015-09-02 Thread Raghavendra Pandey
Flatmap is just like map but it flattens out the seq output of the closure... In your case, you call "to" function that is to return list... a.to(b) returns list(a,...,b) So rdd.flatMap( x => x.to(3)) will take all element and return range upto 3.. On Sep 3, 2015 7:36 AM, "Ashish Soni"

Re: Problem while loading saved data

2015-09-02 Thread Amila De Silva
Hi Guru, Thanks for the reply. Yes, I checked if the file exists. But instead of a single file what I found was a directory having the following structure. people.parquet └── _temporary └── 0 ├── task_201509030057_4699_m_00 │ └──

Re: Resource allocation in SPARK streaming

2015-09-02 Thread Akhil Das
Well in spark, you can get the information that you need from the driver ui running on port 4040, click on the active job, then click on the stages and inside the stages you will find the tasks and the machine address on which the task is being executed, you can also check the cpu load on that

Re: spark-submit not using conf/spark-defaults.conf

2015-09-02 Thread Axel Dahl
So a bit more investigation, shows that: if I have configured spark-defaults.conf with: "spark.files library.py" then if I call "spark-submit.py -v test.py" I see that my "spark.files" default option has been replaced with "spark.files test.py", basically spark-submit is

Re: Understanding Batch Processing Time

2015-09-02 Thread Tathagata Das
Can you jstack into the driver and see what is process doing after job 352? Also to confirm, the system is stuck after job 352 finishes, and before job 353 starts (shows up in the UI), right? TD On Wed, Sep 2, 2015 at 12:55 PM, Snehal Nagmote wrote: > Hi All, > > I

Spark DataFrame saveAsTable with partitionBy creates no ORC file in HDFS

2015-09-02 Thread unk1102
Hi I have a Spark dataframe which I want to save as hive table with partitions. I tried the following two statements but they dont work I dont see any ORC files in HDFS directory its empty. I can see baseTable is there in Hive console but obviously its empty because of no files inside HDFS. The

Re: Save dataframe into hbase

2015-09-02 Thread Ted Yu
The following JIRA is close to integration: HBASE-14181 Add Spark DataFrame DataSource to HBase-Spark Module after which hbase would provide better support for DataFrame interaction. On Wed, Sep 2, 2015 at 1:21 PM, ALEX K wrote: > you can use Phoenix-Spark plugin: >

wild cards in spark sql

2015-09-02 Thread Hafiz Mujadid
Hi does spark sql support wild cards to filter data in sql queries just like we can filter data in sql queries in RDBMS with different wild cards like % and ? etc. In other words how can i write following query in spar sql select * from employee where ename like 'a%d' thanks -- View this

ERROR WHILE REPARTITION

2015-09-02 Thread shahid ashraf
Hi Guys i am running standalone spark 1.4.1 on local machine i have 10 partitions with data skew on partition 1 and 4 partition: [(0, 0), (*1, 15593259)*, (2, 0), (3, 0), (*4, 20695601)*, (5, 0), (6, 0), (7, 0), (8, 0), (9, 0)] and elements: >> Now i try to rdd.repartition(10) and getting

Re: spark 1.5 sort slow

2015-09-02 Thread Michael Armbrust
Can you include the output of `explain()` for each of the runs? On Tue, Sep 1, 2015 at 1:06 AM, patcharee wrote: > Hi, > > I found spark 1.5 sorting is very slow compared to spark 1.4. Below is my > code snippet > > val sqlRDD = sql("select date, u, v, z from

Re: Save dataframe into hbase

2015-09-02 Thread ALEX K
you can use Phoenix-Spark plugin: https://phoenix.apache.org/phoenix_spark.html On Wed, Sep 2, 2015 at 4:04 AM, Hafiz Mujadid wrote: > Hi > > What is the efficient way to save Dataframe into hbase? > > Thanks > > > > > > > -- > View this message in context: >

Is it required to remove checkpoint when submitting a code change?

2015-09-02 Thread Ricardo Luis Silva Paiva
Hi, Is there a way to submit an app code change, keeping the checkpoint data or do I need to erase the checkpoint folder every time I re-submit the spark app with a new jar? I have an app that count pageviews streaming from Kafka, and deliver a file every hour from the past 24 hours. I'm using

Re: wild cards in spark sql

2015-09-02 Thread Michael Armbrust
That query should work. On Wed, Sep 2, 2015 at 1:50 PM, Hafiz Mujadid wrote: > Hi > > does spark sql support wild cards to filter data in sql queries just like > we > can filter data in sql queries in RDBMS with different wild cards like % > and > ? etc. In other words

Re: wild cards in spark sql

2015-09-02 Thread Anas Sherwani
Yes, SparkSQL does support wildcards. The query you have written should work as is, if the type of ename is string. You can find all the keywords and a few supported functions at http://docs.datastax.com/en/datastax_enterprise/4.6/datastax_enterprise/spark/sparkSqlSupportedSyntax.html

large number of import-related function calls in PySpark profile

2015-09-02 Thread Priedhorsky, Reid
Hello, I have a PySpark computation that relies on Pandas and NumPy. Currently, my inner loop iterates 2,000 times. I’m seeing the following show up in my profiling: 74804/291020.2040.0002.1730.000 :2234(_find_and_load) 74804/291020.1450.0001.8670.000

Re: Understanding Batch Processing Time

2015-09-02 Thread Snehal Nagmote
Hi , Thank you for your reply , Yes , I confirm , the system is stuck after job 352 finishes and before job 353 starts (shows up in the UI) I will start job again and will take jstack if I can reproduce the problem Thanks, Snehal On 2 September 2015 at 13:34, Tathagata Das

Re: Is it required to remove checkpoint when submitting a code change?

2015-09-02 Thread Cody Koeninger
Yeah, in general if you're changing the jar you can't recover the checkpoint. If you're just changing parameters, why not externalize those in a configuration file so your jar doesn't change? I tend to stick even my app-specific parameters in an external spark config so everything is in one

Hbase Lookup

2015-09-02 Thread ayan guha
Hello group I am trying to use pig or spark in order to achieve following: 1. Write a batch process which will read from a file 2. Lookup hbase to see if the record exists. If so then need to compare incoming values with hbase and update fields which do not match. Else create a new record. My

Re: Spark DataFrame saveAsTable with partitionBy creates no ORC file in HDFS

2015-09-02 Thread Michael Armbrust
Before Spark 1.5, tables created using saveAsTable cannot be queried by Hive because we only store Spark SQL metadata. In Spark 1.5 for parquet and ORC we store both, but this will not work with partitioned tables because hive does not support dynamic partition discovery. On Wed, Sep 2, 2015 at

Re: spark 1.4.1 saveAsTextFile is slow on emr-4.0.0

2015-09-02 Thread Alexander Pivovarov
Hi Neil Yes! it helps!!! I do not see _temporary in console output anymore. saveAsTextFile is fast now. 2015-09-02 23:07:00,022 INFO [task-result-getter-0] scheduler.TaskSetManager (Logging.scala:logInfo(59)) - Finished task 18.0 in stage 0.0 (TID 18) in 4398 ms on ip-10-0-24-103.ec2.internal

How to Serialize and Reconstruct JavaRDD later?

2015-09-02 Thread Raja Reddy
Hi All, *Context:* I am exploring topic modelling with LDA with Spark MLLib. However, I need my model to enhance as more batches of documents come in. As of now I see no way of doing something like this, which gensim does:

Re: Too many open files issue

2015-09-02 Thread Saisai Shao
Here is the code in which NewHadoopRDD register close handler and be called when the task is completed ( https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/rdd/NewHadoopRDD.scala#L136 ). >From my understanding, possibly the reason is that this `foreach` code in your

Re: Problems with Tungsten in Spark 1.5.0-rc2

2015-09-02 Thread Anders Arpteg
I haven't done a comparative benchmarking between the two, and it would involve some work to do so. A single run with each suffler would probably not say that much since we have a rather busy cluster and the performance heavily depends on what's currently running in the cluster. I have seen less

Multiple spark-submits vs akka-actors

2015-09-02 Thread srungarapu vamsi
Hi, I am using a mesos cluster to run my spark jobs. I have one mesos-master and two mesos-slaves setup on 2 machines. On one machine, master and slave are setup and on the second machine mesos-slave is setup I run these on m3-large ec2 instances. 1. When i try to submit two jobs using

Re: Multiple spark-submits vs akka-actors

2015-09-02 Thread Akhil Das
"Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources". I'm assuming you are submitting the job in coarse-grained mode, in that case make sure you are asking for the available resources. If you want to submit

Re: Memory-efficient successive calls to repartition()

2015-09-02 Thread alexis GILLAIN
Aurélien, >From what you're saying, I can think of a couple of things considering I don't know what you are doing in the rest of the code : - There is lot of non hdfs writes, it comes from the rest of your code and/or repartittion(). Repartition involve a shuffling and creation of files on disk.

Re: Too many open files issue

2015-09-02 Thread Steve Loughran
ah, now that does sound suspicious... On 2 Sep 2015, at 14:09, Sigurd Knippenberg > wrote: Yep. I know. It's was set to 32K when I ran this test. If I bump it to 64K the issue goes away. It still doesn't make sense to me that the Spark job

Re: How to Serialize and Reconstruct JavaRDD later?

2015-09-02 Thread Hemant Bhanawat
You want to persist the state between the execution of two rdds. So, I believe what you need is serialization of your model and not JavaRDD. If you can serialize your model, you can persist that in HDFS or some other datastore to be used by the next RDDs. If you are using Spark Streaming, doing

Re: Question about Google Books Ngrams with pyspark (1.4.1)

2015-09-02 Thread Bertrand
Looking at another forum, I tried : files = sc.newAPIHadoopFile("s3n://datasets.elasticmapreduce/ngrams/books/20090715/eng-us-all/1gram","com.hadoop.mapreduce.LzoTextInputFormat","org.apache.hadoop.io.LongWritable","org.apache.hadoop.io.Text") Traceback (most recent call last): File "", line

Re: Memory-efficient successive calls to repartition()

2015-09-02 Thread Aurélien Bellet
Thanks a lot for the useful link and comments Alexis! First of all, the problem occurs without doing anything else in the code (except of course loading my data from HDFS at the beginning) - so it definitely comes from the shuffling. You're right, in the current version, checkpoint files are

Unable to understand error “SparkListenerBus has already stopped! Dropping event …”

2015-09-02 Thread Adrien Mogenet
Hi there, I'd like to know if anyone has a magic method to avoid such messages in Spark logs: 2015-08-30 19:30:44 ERROR LiveListenerBus:75 - SparkListenerBus has already stopped! Dropping event SparkListenerExecutorMetricsUpdate(41,WrappedArray()) After further investigations, I understand that

Re: Too many open files issue

2015-09-02 Thread Sigurd Knippenberg
Yep. I know. It's was set to 32K when I ran this test. If I bump it to 64K the issue goes away. It still doesn't make sense to me that the Spark job doesn't release its file handles until the end of the job instead of doing that while my loop iterates. Sigurd On Wed, Sep 2, 2015 at 4:33 AM,

Error using SQLContext in spark

2015-09-02 Thread rakesh sharma
Error: application failed with exceptionjava.lang.NoSuchMethodError: org.apache.spark.sql.SQLContext.(Lorg/apache/spark/api/java/JavaSparkContext;)V at examples.PersonRecordReader.getPersonRecords(PersonRecordReader.java:35) at