I agree with the others that a dedicated NoSQL datastore can make sense. You
should look at the lambda architecture paradigm. Keep in mind that more memory
does not necessarily mean more performance. It is the right data structure for
the queries of your users. Additionally, if your queries
Or you could use sinks like elasticsearch.
Regards,
Noorul
On Mon, Mar 6, 2017 at 10:52 AM, devjyoti patra wrote:
> Timothy, why are you writing application logs to HDFS? In case you want to
> analyze these logs later, you can write to local storage on your slave nodes
> and
Timothy, why are you writing application logs to HDFS? In case you want to
analyze these logs later, you can write to local storage on your slave
nodes and later rotate those files to a suitable location. If they are only
going to useful for debugging the application, you can always remove them
Hi,
I am new to Spark ML Lib. I am using FPGrowth model for finding related
items.
Number of transactions are 63K and the total number of items in all
transactions are 200K.
I am running FPGrowth model to generate frequent items sets. It is taking
huge amount of time to generate frequent
I'm running a single worker EMR cluster for a Structured Streaming job. How
do I deal with my application log filling up HDFS?
/var/log/spark/apps/application_1487823545416_0021_1.inprogress
is currently 21.8 GB
*Sent with Shift
Hi everyone,
We are deploying kafka cluster for ingesting streaming data. But sometimes,
some of nodes on the cluster have troubles (node dies, kafka daemon is
killed...). However, Recovering data in Kafka can be very slow. It takes
serveral hours to recover from disaster. I saw a slide here
Any specific reason to choose Spark? It sounds like you have a
Write-Once-Read-Many Times dataset, which is logically partitioned across
customers, sitting in some data store. And essentially you are looking for
a fast way to access it, and most likely you will use the same partition
key for
Hello all,
I have a strange behaviour I can't understand. I have a streaming job using a
custom java receiver that pull data from a jms queue that I process and then
write to HDFS as parquet and avro files. For some reason my job keeps failing
after 1hr and 30 minutes. When It fails I get an
Hi Allan,
Where is the data stored right now? If it's in a relational database, and you
are using Spark with Hadoop, I feel like it would make sense to move the import
the data into HDFS, just because it would be faster to access the data. You
could use Sqoop to do that.
In terms of having a
How about HTTP2/REST connector for Spark? Is that something we can expect?
Thanks!
On Wed, Feb 22, 2017 at 4:07 AM, Christian Kadner
wrote:
> The Apache Bahir community is pleased to announce the release
> of Apache Bahir 2.1.0 which provides the following extensions for
>
Hi,
I am looking to use Spark to help execute queries against a reasonably
large dataset (1 billion rows). I'm a bit lost with all the different
libraries / add ons to Spark, and am looking for some direction as to what
I should look at / what may be helpful.
A couple of relevant points:
- The
anyone? please? is this getting any priority?
On Tue, Sep 27, 2016 at 3:38 PM, Ofer Eliassaf
wrote:
> Is there any plan to support python spark running in "cluster mode" on a
> standalone deployment?
>
> There is this famous survey mentioning that more than 50% of the
A better forum would be
https://groups.google.com/forum/#!forum/spark-jobserver
or
https://gitter.im/spark-jobserver/spark-jobserver
Regards,
Noorul
Madabhattula Rajesh Kumar writes:
> Hi,
>
> I am getting below an exception when I start the job-server
>
>
Hi,
I am getting below an exception when I start the job-server
./server_start.sh: line 41: kill: (11482) - No such process
Please let me know how to resolve this error
Regards,
Rajesh
Thanks,
very useful!
On Sun, Mar 5, 2017 at 4:55 AM, Yuhao Yang wrote:
>
> Sharing some snippets I accumulated during developing with Apache Spark
> DataFrame (DataSet). Hope it can help you in some way.
>
> https://github.com/hhbyyh/DataFrameCheatSheet.
>
> [image: 内嵌图片 1]
>
Just as best practice, dataframe and datasets are preferred way, so try not
to resort to rdd unless you absolutely have to...
On Sun, 5 Mar 2017 at 7:10 pm, khwunchai jaengsawang
wrote:
> Hi Old-Scool,
>
>
> For the first question, you can specify the number of partition in
Hi Old-Scool,
For the first question, you can specify the number of partition in any
DataFrame by using repartition(numPartitions: Int, partitionExprs: Column*).
Example:
val partitioned = data.repartition(numPartitions=10).cache()
For your second question, you can transform your RDD
18 matches
Mail list logo