Spark as standalone or with Hadoop stack.

2015-09-22 Thread Shiv Kandavelu
Hi All, We currently have a Hadoop cluster having Yarn as the resource manager. We are planning to use HBase as the data store due to the C-P aspects of the CAP Theorem. We now want to do extensive data processing both stored data in HBase as well as Steam processing from online website / API

pyspark question: create RDD from csr_matrix

2015-09-22 Thread jeff saremi
i've tried desperately to create an RDD from a matrix i have. Every combination failed. I have a sparse matrix returned from a call to dv = DictVectorizer()sv_tf = dv.fit_transform(tf) which is supposed to be a matrix of document terms and their frequencies. I need to convert this to

Re: Deploying spark-streaming application on production

2015-09-22 Thread Adrian Tanase
btw I re-read the docs and I want to clarify that reliable receiver + WAL gives you at least once, not exactly once semantics. Sent from my iPhone On 21 Sep 2015, at 21:50, Adrian Tanase > wrote: I'm wondering, isn't this the canonical use case for

How to share memory in a broadcast between tasks in the same executor?

2015-09-22 Thread Clément Frison
Hello, My team and I have a 32-core machine and we would like to use a huge object - for example a large dictionary - in a map transformation and use all our cores in parallel by sharing this object among some tasks. We broadcast our large dictionary. dico_br = sc.broadcast(dico) We use it in

Re: Spark 1.5 UDAF ArrayType

2015-09-22 Thread Michael Armbrust
Check out: http://spark.apache.org/docs/latest/sql-programming-guide.html#data-types On Tue, Sep 22, 2015 at 12:49 PM, Deenar Toraskar < deenar.toras...@thinkreactive.co.uk> wrote: > Michael > > Thank you for your prompt answer. I will repost after I try this again on > 1.5.1 or branch-1.5. In

Re: Spark as standalone or with Hadoop stack.

2015-09-22 Thread Ted Yu
bq. it's relatively harder to use it with HBase I agree with Sean. I work on HBase. To my knowledge, no one runs HBase on top of Mesos. On Tue, Sep 22, 2015 at 12:31 PM, Sean Owen wrote: > Who told you Mesos would make Spark 100x faster? does it make sense > that just the

Re: Spark 1.5 UDAF ArrayType

2015-09-22 Thread Michael Armbrust
I think that you are hitting a bug (which should be fixed in Spark 1.5.1). I'm hoping we can cut an RC for that this week. Until then you could try building branch-1.5. On Tue, Sep 22, 2015 at 11:13 AM, Deenar Toraskar wrote: > Hi > > I am trying to write an UDAF

Re: How to share memory in a broadcast between tasks in the same executor?

2015-09-22 Thread Utkarsh Sengar
If broadcast variable doesn't fit in memory, I think is not the right fit for you. You can think about fitting it with an RDD as a tuple with other data you are working on. Say you are working on RDD (rdd in your case), run a map/reduce to convert it to RDD> so now

Re: Mesos Tasks only run on one node

2015-09-22 Thread Tim Chen
What configuration have you used, and what are the slaves configuration? Possiblity all other nodes either don't have enough resources, are is using a another role that's preventing from the executor to be launched. Tim On Mon, Sep 21, 2015 at 1:58 PM, John Omernik wrote: >

Re: Creating BlockMatrix with java API

2015-09-22 Thread Pulasthi Supun Wickramasinghe
Hi Yanbo, Thanks for the reply. I thought i might be missing something. Anyway i moved to using scala since it is the complete API. Best Regards, Pulasthi On Tue, Sep 22, 2015 at 7:03 AM, Yanbo Liang wrote: > This is due to the distributed matrices like >

Re: spark + parquet + schema name and metadata

2015-09-22 Thread Cheng Lian
I see, this makes sense. We should probably add this in Spark SQL. However, there's one corner case to note about user-defined Parquet metadata. When committing a write job, ParquetOutputCommitter writes Parquet summary files (_metadata and _common_metadata), and user-defined key-value

Re: SparkR - calling as.vector() with rdd dataframe causes error

2015-09-22 Thread Luciano Resende
localDF is a pure R data frame and as.vector will work with no problems, as for calling it in the SparkR objects, try calling collect before you call as.vector (or in your case, the algorithms), that should solve your problem. On Mon, Sep 21, 2015 at 8:48 AM, Ellen Kraffmiller <

Re: Remove duplicate keys by always choosing first in file.

2015-09-22 Thread Philip Weaver
The indices are definitely necessary. My first solution was just reduceByKey { case (v, _) => v } and that didn't work. I needed to look at both values and see which had the lower index. On Tue, Sep 22, 2015 at 8:54 AM, Sean Owen wrote: > The point is that this only works if

Re: Has anyone used the Twitter API for location filtering?

2015-09-22 Thread Jo Sunad
Thanks Akhil, but I can't seem to get any tweets that include location data. For example, when I do stream.filter(status => status.getPlace().getName) and run the code for 20 minutes I only get null values.It seems like Twitter might purposely be removing the Place for free users? On Tue, Sep

Re: How to speed up MLlib LDA?

2015-09-22 Thread Charles Earl
It seems that the Vowpal Wabbit version is most similar to what is in https://github.com/intel-analytics/TopicModeling/blob/master/src/main/scala/org/apache/spark/mllib/topicModeling/OnlineHDP.scala Although the Intel seems to implement the Hierarchical Dirichlet Process (topics and subtopics) as

Re: How to speed up MLlib LDA?

2015-09-22 Thread Marko Asplund
How optimized are the Commons math3 methods that showed up in profiling? Are there any higher performance alternatives to these? marko

Re: How to speed up MLlib LDA?

2015-09-22 Thread Pedro Rodriguez
I helped some with the LDA and worked quite a bit on a Gibbs version. I don't know if the Gibbs version might help, but since it is not (yet) in MLlib, Intel Analytics kindly created a spark package with their adapted version plus a couple other LDA algorithms:

Re: Count for select not matching count for group by

2015-09-22 Thread Michael Armbrust
This looks like something is wrong with predicate pushdown. Can you include the output of calling explain, and tell us what format the data is stored in? On Mon, Sep 21, 2015 at 8:06 AM, Michael Kelly wrote: > Hi, > > I'm seeing some strange behaviour with spark

Spark 1.5 UDAF ArrayType

2015-09-22 Thread Deenar Toraskar
Hi I am trying to write an UDAF ArraySum, that does element wise sum of arrays of Doubles returning an array of Double following the sample in https://databricks.com/blog/2015/09/16/spark-1-5-dataframe-api-highlights-datetimestring-handling-time-intervals-and-udafs.html. I am getting the

Re: WAL on S3

2015-09-22 Thread Tathagata Das
You can keep the checkpoints in the Hadoop-compatible file system and the WAL somewhere else using your custom WAL implementation. Yes, cleaning up the stuff gets complicated as it is not as easy as deleting off the checkpoint directory - you will have to clean up checkpoint directory as well as

Spark standalone/Mesos on top of Ceph

2015-09-22 Thread fightf...@163.com
Hi guys, Here is the info for Ceph : http://ceph.com/ We are investigating and using Ceph for distributed storage and monitoring, specifically interested in using Ceph as the underlied file system storage for spark. However, we had no experience for achiveing that. Any body has seen such

Re: Spark standalone/Mesos on top of Ceph

2015-09-22 Thread Jerry Lam
Do you have specific reasons to use Ceph? I used Ceph before, I'm not too in love with it especially when I was using the Ceph Object Gateway S3 API. There are some incompatibilities with aws s3 api. You really really need to try it because making the commitment. Did you managed to install it? On

how to submit the spark job outside the cluster

2015-09-22 Thread Zhiliang Zhu
Dear Experts, Spark job is running on the cluster by yarn. Since the job can be submited at the place on the machine from the cluster,however, I would like to submit the job from another machine which does not belong to the cluster.I know for this, hadoop job could be done by way of another

Re: how to submit the spark job outside the cluster

2015-09-22 Thread Zhan Zhang
It should be similar to other hadoop jobs. You need hadoop configuration in your client machine, and point the HADOOP_CONF_DIR in spark to the configuration. Thanks Zhan Zhang On Sep 22, 2015, at 6:37 PM, Zhiliang Zhu > wrote:

Yarn Shutting Down Spark Processing

2015-09-22 Thread Bryan Jeffrey
Hello. I have a Spark streaming job running on a cluster managed by Yarn. The spark streaming job starts and receives data from Kafka. It is processing well and then after several seconds I see the following error: 15/09/22 14:53:49 ERROR yarn.ApplicationMaster: SparkContext did not initialize

Re: Re: Spark standalone/Mesos on top of Ceph

2015-09-22 Thread Jerry Lam
Hi Sun, The issue with Ceph as the underlying file system for Spark is that you lose data locality. Ceph is not designed to have spark run directly on top of the OSDs. I know that cephfs provides data location information via hadoop compatible API. The last time I researched on this is that the

Parallel collection in driver programs

2015-09-22 Thread Andy Huang
Hi All, Would like know if anyone has experienced with parallel collection in the driver program. And, if there is actual advantage/disadvantage of doing so. E.g. With a collection of Jdbc connections and tables We have adapted our non-spark code which utilize parallel collection to the spark

Re: Streaming Receiver Imbalance Problem

2015-09-22 Thread Tathagata Das
A lot of these imbalances were solved in spark 1.5. Could you give that a spin? https://issues.apache.org/jira/browse/SPARK-8882 On Tue, Sep 22, 2015 at 12:17 AM, SLiZn Liu wrote: > Hi spark users, > > In our Spark Streaming app via Kafka integration on Mesos, we

Re: Streaming Receiver Imbalance Problem

2015-09-22 Thread SLiZn Liu
Cool, we are still sticking with 1.3.1, will upgrade to 1.5 ASAP. Thanks for the tips, Tathagata! On Wed, Sep 23, 2015 at 10:40 AM Tathagata Das wrote: > A lot of these imbalances were solved in spark 1.5. Could you give that a > spin? > >

Re: Re: Spark standalone/Mesos on top of Ceph

2015-09-22 Thread fightf...@163.com
Hi Jerry Yeah, we managed to run and use ceph already in our few production environment, especially with OpenStack. The reason we want to use Ceph is that we aim to look for some workarounds for unified storage layer and the design concepts of ceph is quite catching. I am just interested

Re: Long GC pauses with Spark SQL 1.3.0 and billion row tables

2015-09-22 Thread tridib
By skewed did you mean it's not distributed uniformly across partition? All of my columns are string and almost of same size. i.e. id1,field11,fields12 id2,field21,field22 -- View this message in context:

Re: How to share memory in a broadcast between tasks in the same executor?

2015-09-22 Thread Deenar Toraskar
Clement In local mode all worker threads run in the driver VM. Your dictionary should not be copied 32 times, in fact it wont be broadcast at all. Have you tried increasing spark.driver.memory to ensure that the driver uses all the memory on the machine. Deenar On 22 September 2015 at 19:42,

Re: how to submit the spark job outside the cluster

2015-09-22 Thread Zhan Zhang
Hi Zhiliang, I cannot find a specific doc. But as far as I remember, you can log in one of your cluster machine, and find the hadoop configuration location, for example /etc/hadoop/conf, copy that directory to your local machine. Typically it has hdfs-site.xml, yarn-site.xml etc. In spark, the

RE: Why is 1 executor overworked and other sit idle?

2015-09-22 Thread Chirag Dewan
Thanks Ted and Rich. So if I repartition my RDD programmatically and call coalesce on the RDD to 1 partition would that generate 1 output file? Ahh.. Is my coalesce operation causing 1 partition, hence 1 output file and 1 executor working on all the data? To summarize this is what I do :-

Re: Py4j issue with Python Kafka Module

2015-09-22 Thread ayan guha
I must have been gone mad :) Thanks for pointing it out. I downloaded 1.5.0 assembly jar and added it in SPARK_CLASSPATH. However, I am getting a new error now >>> kvs = KafkaUtils.createDirectStream(ssc,['spark'],{"metadata.broker.list":'l ocalhost:9092'})

Re: Apache Spark job in local[*] is slower than regular 1-thread Python program

2015-09-22 Thread Jonathan Coveney
It's highly conceivable to be able to beat spark in performance on tiny data sets like this. That's not really what it has been optimized for. El martes, 22 de septiembre de 2015, juljoin escribió: > Hello, > > I am trying to figure Spark out and I still have some

Re: how to submit the spark job outside the cluster

2015-09-22 Thread Zhiliang Zhu
Hi Zhan, Yes, I get it now. I have not ever deployed hadoop configuration locally, and do not find the specific doc, would you help provide the doc to do that... Thank you,Zhiliang On Wednesday, September 23, 2015 11:08 AM, Zhan Zhang wrote: There is no

Scala Limitation - Case Class definition with more than 22 arguments

2015-09-22 Thread satish chandra j
HI All, Do we have any alternative solutions in Scala to avoid limitation in defining a Case Class having more than 22 arguments We are using Scala version 2.10.2, currently I need to define a case class with 37 arguments but getting an error as "*error: Implementation restriction: case classes

Re: Scala Limitation - Case Class definition with more than 22 arguments

2015-09-22 Thread Andy Huang
Alternatively, I would suggest you looking at programmatically building the schema refer to http://spark.apache.org/docs/latest/sql-programming-guide.html#programmatically-specifying-the-schema Cheers Andy On Wed, Sep 23, 2015 at 2:07 PM, Ted Yu wrote: > Can you switch to

Re: Invalid checkpoint url

2015-09-22 Thread srungarapu vamsi
@Das, No, i am getting in the cluster mode. I think i understood why i am getting this error, please correct me if i am wrong. Reason is: checkpointing writes rdd to disk, so this checkpointing happens on all workers. Whenever, spark has to read back the rdd , checkpoint directory should be

Re: Creating BlockMatrix with java API

2015-09-22 Thread Pulasthi Supun Wickramasinghe
Hi Sabarish Thanks, that would indeed solve my problem Best Regards, Pulasthi On Wed, Sep 23, 2015 at 12:55 AM, Sabarish Sasidharan < sabarish.sasidha...@manthan.com> wrote: > Hi Pulasthi > > You can always use JavaRDD.rdd() to get the scala rdd. So in your case, > > new BlockMatrix(rdd.rdd(),

How to make Group By/reduceByKey more efficient?

2015-09-22 Thread swetha
Hi, How to make Group By more efficient? Is it recommended to use a custom partitioner and then do a Group By? And can we use a custom partitioner and then use a reduceByKey for optimization? Thanks, Swetha -- View this message in context:

Re: how to submit the spark job outside the cluster

2015-09-22 Thread Zhiliang Zhu
Hi Zhan, Thanks very much for your help comment.I also view it would be similar to hadoop job submit, however, I was not deciding whether it is like that when it comes to spark.  Have you ever tried that for spark...Would you give me the deployment doc for hadoop and spark gateway, since this

Re: Py4j issue with Python Kafka Module

2015-09-22 Thread Saisai Shao
I think it is something related to class loader, the behavior is different for classpath and --jars. If you want to know the details I think you'd better dig out some source code. Thanks Jerry On Tue, Sep 22, 2015 at 9:10 PM, ayan guha wrote: > I must have been gone mad :)

Re: Creating BlockMatrix with java API

2015-09-22 Thread Sabarish Sasidharan
Hi Pulasthi You can always use JavaRDD.rdd() to get the scala rdd. So in your case, new BlockMatrix(rdd.rdd(), 2, 2) should work. Regards Sab On Tue, Sep 22, 2015 at 10:50 PM, Pulasthi Supun Wickramasinghe < pulasthi...@gmail.com> wrote: > Hi Yanbo, > > Thanks for the reply. I thought i

Re: SparkR vs R

2015-09-22 Thread Yashwanth Kumar
Hi, 1. The main difference between SparkR and R is that "SparkR" can handle bigdata. Yes, you can use other core libraries inside SparkR(not algos like lm(),glm(),kmean()) 2.Yes, core R libraries will not be distributed. You can use function from these libraries which are applicabe for mapper

Re: how to submit the spark job outside the cluster

2015-09-22 Thread Zhan Zhang
There is no difference between running the client in or out of the client (assuming there is no firewall or network connectivity issue), as long as you have hadoop configuration locally. Here is the doc for running on yarn. http://spark.apache.org/docs/latest/running-on-yarn.html Thanks.

Re: WAL on S3

2015-09-22 Thread Michal Čizmazia
Can checkpoints be stored to S3 (via S3/S3A Hadoop URL)? Trying to answer this question, I looked into Checkpoint.getCheckpointFiles [1]. It is doing findFirstIn which would probably be calling the S3 LIST operation. S3 LIST is prone to eventual consistency [2]. What would happen when

Re: SparkR for accumulo

2015-09-22 Thread madhvi.gupta
Hi Rui, Cant we use the accumulo data RDD created from JAVA in spark, in sparkR? Thanks and Regards Madhvi Gupta On Tuesday 22 September 2015 04:42 PM, Sun, Rui wrote: I am afraid that there is no support for accumulo in SparkR now, because: 1. It seems that there is no data source support

Re: how to submit the spark job outside the cluster

2015-09-22 Thread Zhiliang Zhu
Hi Zhan, I really appreciate your help, I will do as that next.And on the local machine, no hadoop/spark needs to be installed, but only copied with the /etc/hadoop/conf... whether the information (for example IP, hostname etc) of local machine would be set in the conf files... Moreover, do

Re: Scala Limitation - Case Class definition with more than 22 arguments

2015-09-22 Thread Ted Yu
Can you switch to 2.11 ? The following has been fixed in 2.11: https://issues.scala-lang.org/browse/SI-7296 Otherwise consider packaging related values into a case class of their own. On Tue, Sep 22, 2015 at 8:48 PM, satish chandra j wrote: > HI All, > Do we have any

JdbcRDD Constructor

2015-09-22 Thread satish chandra j
HI All, JdbcRDD constructor has following parameters, *JdbcRDD *(SparkContext

Re: spark-avro takes a lot time to load thousands of files

2015-09-22 Thread Deenar Toraskar
Daniel Can you elaborate why are you using a broadcast variable to concatenate many Avro files into a single ORC file. Look at wholetextfiles on Spark context. SparkContext.wholeTextFiles lets you read a directory containing multiple small text files, and returns each of them as (filename,

Re: Partitions on RDDs

2015-09-22 Thread Yashwanth Kumar
HI, In the first rdd transformation (eg: reading from a file sc.textfile("path",partition)), the partition you specify will be applied to all further transformations and actions from this rdd. In few places repartitioning your rdd will give a added advantage. Repartition is usually done during

Partitions on RDDs

2015-09-22 Thread XIANDI
I'm always confused by the partitions. We may have many RDDs in the code. Do we need to partition on all of them? Do the rdds get rearranged among all the nodes whenever we do a partition? What is a wise way of doing partitions? -- View this message in context:

Re: HDP 2.3 support for Spark 1.5.x

2015-09-22 Thread Zhan Zhang
Hi Krishna, For the time being, you can download from upstream, and it should be running OK for HDP2.3. For hdp specific problem, you can ask in Hortonworks forum. Thanks. Zhan Zhang On Sep 22, 2015, at 3:42 PM, Krishna Sankar > wrote: Guys,

Re: WAL on S3

2015-09-22 Thread Michal Čizmazia
I am trying to use pluggable WAL, but it can be used only with checkpointing turned on. Thus I still need have a Hadoop-compatible file system. Is there something like pluggable checkpointing? Or can WAL be used without checkpointing? What happens when WAL is available but the checkpoint

Re: Why is 1 executor overworked and other sit idle?

2015-09-22 Thread Richard Eggert
If there's only one partition, by definition it will only be handled by one executor. Repartition to divide the work up. Note that this will also result in multiple output files, however. If you absolutely need them to be combined into a single file, I suggest using the Unix/Linux 'cat' command

Re: WAL on S3

2015-09-22 Thread Michal Čizmazia
My understanding of pluggable WAL was that it eliminates the need for having a Hadoop-compatible file system [1]. What is the use of pluggable WAL when it can be only used together with checkpointing which still requires a Hadoop-compatible file system? [1]:

Re: Partitions on RDDs

2015-09-22 Thread Richard Eggert
In general, RDDs get partitioned automatically without programmer intervention. You generally don't need to worry about them unless you need to adjust the size/number of partitions or the partitioning scheme according to the needs of your application. Partitions get redistributed among nodes

Re: WAL on S3

2015-09-22 Thread Tathagata Das
1. Currently, the WAL can be used only with checkpointing turned on, because it does not make sense to recover from WAL if there is not checkpoint information to recover from. 2. Since the current implementation saves the WAL in the checkpoint directory, they share the fate -- if checkpoint

SPARK_WORKER_INSTANCES was detected (set to '2')…This is deprecated in Spark 1.0+

2015-09-22 Thread Jacek Laskowski
Hi, This is for Spark 1.6.0-SNAPSHOT (SHA1 a96ba40f7ee1352288ea676d8844e1c8174202eb). I've been toying with Spark Standalone cluster and have the following file in conf/spark-env.sh: ➜ spark git:(master) ✗ cat conf/spark-env.sh SPARK_WORKER_CORES=2 SPARK_WORKER_MEMORY=2g # multiple Spark

Re: Spark as standalone or with Hadoop stack.

2015-09-22 Thread Jacek Laskowski
On Tue, Sep 22, 2015 at 10:03 PM, Ted Yu wrote: > To my knowledge, no one runs HBase on top of Mesos. Hi, That sentence caught my attention. Could you explain the reasons for not running HBase on Mesos, i.e. what makes Mesos inappropriate for HBase? Jacek

Re: Invalid checkpoint url

2015-09-22 Thread Tathagata Das
Are you getting this error in local mode? On Tue, Sep 22, 2015 at 7:34 AM, srungarapu vamsi wrote: > Yes, I tried ssc.checkpoint("checkpoint"), it works for me as long as i > don't use reduceByKeyAndWindow. > > When i start using "reduceByKeyAndWindow" it complains me

Re: Apache Spark job in local[*] is slower than regular 1-thread Python program

2015-09-22 Thread Richard Eggert
Maybe it's just my phone, but I don't see any code. On Sep 22, 2015 11:46 AM, "juljoin" wrote: > Hello, > > I am trying to figure Spark out and I still have some problems with its > speed, I can't figure them out. In short, I wrote two programs that loop > through a

Re: Streaming Receiver Imbalance Problem

2015-09-22 Thread Tathagata Das
Also, you could switch to the Direct KAfka API which was first released as experimental in 1.3. In 1.5 we graduated it from experimental, but its quite usable in Spark 1.3.1 TD On Tue, Sep 22, 2015 at 7:45 PM, SLiZn Liu wrote: > Cool, we are still sticking with 1.3.1,

Re: Invalid checkpoint url

2015-09-22 Thread Tathagata Das
Bingo! That is the problem. The solution is now obvious I presume :) On Tue, Sep 22, 2015 at 9:41 PM, srungarapu vamsi wrote: > @Das, > No, i am getting in the cluster mode. > I think i understood why i am getting this error, please correct me if i > am wrong. > Reason

Re: Py4j issue with Python Kafka Module

2015-09-22 Thread Tathagata Das
SPARK_CLASSPATH is I believe deprecated right now. So I am not surprised that there is some difference in the code paths. On Tue, Sep 22, 2015 at 9:45 PM, Saisai Shao wrote: > I think it is something related to class loader, the behavior is different > for classpath and

Re: Remove duplicate keys by always choosing first in file.

2015-09-22 Thread Sean Owen
The point is that this only works if you already knew the file was presented in order within and across partitions, which was the original problem anyway. I don't think it is in general, but in practice, I do imagine it's already in the expected order from textFile. Maybe under the hood this ends

Re: Using Spark for portfolio manager app

2015-09-22 Thread Thúy Hằng Lê
That's great answer Andrian. I find a lots of information here. I have direction for application now, i will try your suggestion :) Vào Thứ Ba, ngày 22 tháng 9 năm 2015, Adrian Tanase đã viết: > >1. reading from kafka has exactly once guarantees - we are using it in >

Re: Py4j issue with Python Kafka Module

2015-09-22 Thread Saisai Shao
I think you're using the wrong version of kafka assembly jar, I think Python API from direct Kafka stream is not supported for Spark 1.3.0, you'd better change to version 1.5.0, looks like you're using Spark 1.5.0, why you choose Kafka assembly 1.3.0?

Re: passing SparkContext as parameter

2015-09-22 Thread Petr Novak
And probably the original source code https://gist.github.com/koen-dejonghe/39c10357607c698c0b04 On Tue, Sep 22, 2015 at 10:37 AM, Petr Novak wrote: > To complete design pattern: > >

Re: Deploying spark-streaming application on production

2015-09-22 Thread Petr Novak
Or if there is an option on MQTT server to block events ingestion towards Spark but still keep receiving and buffering them in MQTT and wait for ACK, then it would be possible just to gracefully shutdown Spark job to finish what is in its buffers and restart. Petr On Tue, Sep 22, 2015 at 10:53

Re: Troubleshooting "Task not serializable" in Spark/Scala environments

2015-09-22 Thread Huy Banh
The header should be sent from driver to workers already by spark. And therefore in sparkshell it works. In scala IDE, the code inside an app class. Then you need to check if the app class is serializable. On Tue, Sep 22, 2015 at 9:13 AM Alexis Gillain < alexis.gill...@googlemail.com> wrote: >

Re: Spark Web UI + NGINX

2015-09-22 Thread Akhil Das
Can you not just tunnel it? Like on Machine A: ssh -L 8080:127.0.0.1:8080 machineB And on your local machine: ssh -L 80:127.0.0.1:8080 machineA And then simply open http://localhost/ that will show up the spark ui running on machineB. People at digitalOcean has made wonder article on how

Re: Spark Lost executor && shuffle.FetchFailedException

2015-09-22 Thread Akhil Das
If you can look a bit deeper in the executor logs, then you might find the root cause for this issue. Also make sure the ports (seems 34869 here) are accessible between all the machines. Thanks Best Regards On Mon, Sep 21, 2015 at 12:40 PM, wrote: > Hi All: > > > When I

Re: How to get a new RDD by ordinarily subtract its adjacent rows

2015-09-22 Thread Zhiliang Zhu
Dear Sujit, Since you are senior with Spark, I might not know whether it is convenient for you to help comment some on my dilemma while using spark to deal with R background application ... Thank you very much!Zhiliang On Tuesday, September 22, 2015 1:45 AM, Zhiliang Zhu

Re: spark.mesos.coarse impacts memory performance on mesos

2015-09-22 Thread Tim Chen
Hi Utkarsh, Just to be sure you originally set coarse to false but then to true? Or is it the other way around? Also what's the exception/stack trace when the driver crashed? Coarse grain mode per-starts all the Spark executor backends, so has the least overhead comparing to fine grain. There

Re: passing SparkContext as parameter

2015-09-22 Thread Petr Novak
To complete design pattern: http://stackoverflow.com/questions/30450763/spark-streaming-and-connection-pool-implementation Petr On Mon, Sep 21, 2015 at 10:02 PM, Romi Kuntsman wrote: > Cody, that's a great reference! > As shown there - the best way to connect to an external

Re: spark + parquet + schema name and metadata

2015-09-22 Thread Borisa Zivkovic
thanks for answer. I need this in order to be able to track schema metadata. basically when I create parquet files from Spark I want to be able to "tag" them in some way (giving the schema appropriate name or attaching some key/values) and then it is fairly easy to get basic metadata about

Uneven distribution of tasks among workers in Spark/GraphX 1.5.0

2015-09-22 Thread dmytro
I have a large list of edges as a 5000 partition RDD. Now, I'm doing a simple but shuffle-heavy operation: val g = Graph.fromEdges(edges, ...).partitionBy(...) val subs = Graph(g.collectEdges(...), g.edges).collectNeighbors() subs.saveAsObjectFile("hdfs://...") The job gets divided into 9

Re: Cache after filter Vs Writing back to HDFS

2015-09-22 Thread Akhil Das
Instead of .map you can try doing a .mapPartitions and see the performance. Thanks Best Regards On Fri, Sep 18, 2015 at 2:47 AM, Gavin Yue wrote: > For a large dataset, I want to filter out something and then do the > computing intensive work. > > What I am doing now: >

RE: Support of other languages?

2015-09-22 Thread Sun, Rui
Although the data is RDD[Array[Byte]] where content is not meaningful to Spark Core, it has to be on heap, as Spark Core manipulates RDD transformations on heap. SPARK-10399 is irrelevant. it aims to manipulate off-heap data using C++library via JNI. This is done in-process. From: Rahul

Re: Lost tasks in Spark SQL join jobs

2015-09-22 Thread Akhil Das
If you look a bit in the error logs, you can possibly see other issues like GC over head etc, which causes the next set of tasks to fail. Thanks Best Regards On Thu, Sep 17, 2015 at 9:26 AM, Gang Bai wrote: > Hi all, > > I’m joining two tables on a specific

Re: Has anyone used the Twitter API for location filtering?

2015-09-22 Thread Akhil Das
​That's because sometime getPlace returns null and calling getLang over null throws up either null pointer exception or noSuchMethodError. You need to filter out those statuses which doesn't include location data.​ Thanks Best Regards On Fri, Sep 18, 2015 at 12:46 AM, Jo Sunad

Re: Stopping criteria for gradient descent

2015-09-22 Thread Yanbo Liang
Hi Nishanth, The convergence tolerance is a condition which decides iteration termination. In LogisticRegression with SGD optimization, it depends on the difference of weight vectors. But in GBT it depends on the validate error on the held out test set. 2015-09-18 4:09 GMT+08:00 nishanthps

Re: SparkContext declared as object variable

2015-09-22 Thread Akhil Das
Its a "value" not a variable, and what are you parallelizing here? Thanks Best Regards On Fri, Sep 18, 2015 at 11:21 PM, Priya Ch wrote: > Hello All, > > Instead of declaring sparkContext in main, declared as object variable > as - > > object sparkDemo > { > >

Re: How to speed up MLlib LDA?

2015-09-22 Thread Marko Asplund
Hi, I did some profiling for my LDA prototype code that requests topic distributions from a model. According to Java Mission Control more than 80 % of execution time during sample interval is spent in the following methods: org.apache.commons.math3.util.FastMath.log(double); count: 337; 47.07%

Re: Deploying spark-streaming application on production

2015-09-22 Thread Petr Novak
If MQTT can be configured with long enough timeout for ACK and can buffer enough events while waiting for Spark Job restart then I think one could do even without WAL assuming that Spark job shutdowns gracefully. Possibly saving its own custom metadata somewhere, f.e. Zookeeper, if required to

Re: Deploying spark-streaming application on production

2015-09-22 Thread Petr Novak
Ahh the problem probably is async ingestion to Spark receiver buffers, hence WAL is required I would say. Petr On Tue, Sep 22, 2015 at 10:53 AM, Petr Novak wrote: > If MQTT can be configured with long enough timeout for ACK and can buffer > enough events while waiting for

Re: Long GC pauses with Spark SQL 1.3.0 and billion row tables

2015-09-22 Thread dmytro
Could it be that your data is skewed? Do you have variable-length column types? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Long-GC-pauses-with-Spark-SQL-1-3-0-and-billion-row-tables-tp22750p24762.html Sent from the Apache Spark User List mailing list

Re: passing SparkContext as parameter

2015-09-22 Thread Priya Ch
I have scenario like this - I read dstream of messages from kafka. Now if my rdd contains 10 messages, for each message I need to query the cassandraDB, do some modification and update the records in DB. If there is no option of passing sparkContext to workers to read.write into DB, the only

Streaming Receiver Imbalance Problem

2015-09-22 Thread SLiZn Liu
Hi spark users, In our Spark Streaming app via Kafka integration on Mesos, we initialed 3 receivers to receive 3 Kafka partitions, whereas records receiving rate imbalance been observed, with spark.streaming.receiver.maxRate is set to 120, sometimes 1 of which receives very close to the limit

Re: Spark Ingestion into Relational DB

2015-09-22 Thread Jörn Franke
In these cases you may want to have a separate oracle instance for the batch process and another one for serving it to avoid sla surprises. Nevertheless, if data processing becomes more strategic cross-projects you may think about job management and HDFS using Hadoop with Spark. Le mar. 22 sept.

Py4j issue with Python Kafka Module

2015-09-22 Thread ayan guha
Hi I have added spark assembly jar to SPARK CLASSPATH >>> print os.environ['SPARK_CLASSPATH'] D:\sw\spark-streaming-kafka-assembly_2.10-1.3.0.jar Now I am facing below issue with a test topic >>> ssc = StreamingContext(sc, 2) >>> kvs =

Re: spark-avro takes a lot time to load thousands of files

2015-09-22 Thread Daniel Haviv
I Agree but it's a constraint I have to deal with. The idea is load these files and merge them into ORC. When using hive on Tez it takes less than a minute. Daniel > On 22 בספט׳ 2015, at 16:00, Jonathan Coveney wrote: > > having a file per record is pretty inefficient on

Re: Spark Streaming distributed job

2015-09-22 Thread Adrian Tanase
I think you need to dig into the custom receiver implementation. As long as the source is distributed and partitioned, the downstream .map, .foreachXX are all distributed as you would expect. You could look at how the “classic” Kafka receiver is instantiated in the streaming guide and try to

RE: Long GC pauses with Spark SQL 1.3.0 and billion row tables

2015-09-22 Thread java8964
Or at least tell us how many partitions you are using. Yong > Date: Tue, 22 Sep 2015 02:06:15 -0700 > From: belevts...@gmail.com > To: user@spark.apache.org > Subject: Re: Long GC pauses with Spark SQL 1.3.0 and billion row tables > > Could it be that your data is skewed? Do you have

Re: AWS_CREDENTIAL_FILE

2015-09-22 Thread Akhil Das
No, you can either set the configurations within your SparkConf's hadoop configuration: val hadoopConf = sparkContext.hadoopConfiguration hadoopConf.set("fs.s3n.awsAccessKeyId", s3Key) hadoopConf.set("fs.s3n.awsSecretAccessKey", s3Secret) or you can set it in the environment as: export

Re: Slow Performance with Apache Spark Gradient Boosted Tree training runs

2015-09-22 Thread Yashwanth Kumar
Hi vkutsenko, Can you just give partitions to the input labeled rdd, like: data = MLUtils.loadLibSVMFile(jsc.sc(), "s3://somebucket/somekey/plaintext_libsvm_file").toJavaRDD().*repartition(5)*; Here, i used 5, since you have have 5 cores. Also for further benchmark and performance tuning:

  1   2   >