Depends what your use case is. A generic benchmark does not make sense, because
they are different technologies for different purposes.
> On 23 Sep 2016, at 06:09, ayan guha wrote:
>
> Hi
>
> Is there any benchmark or point of view in terms of pros and cons between AWS
> Redshift vs Spark SQL
A JVM is always going to use a little
>> >> off-heap memory by itself, so setting a max heap size of 2GB means the
>> >> JVM process may use a bit more than 2GB of memory. With an off-heap
>> >> intensive app like Spark it can be a lot more.
>> >>
&
park it can be a lot more.
>
> There's a built-in 10% overhead, so that if you ask for a 3GB executor
> it will ask for 3.3GB from YARN. You can increase the overhead.
>
> On Wed, Sep 21, 2016 at 11:41 PM, Jörn Franke
> wrote:
> > All off-heap memory is still managed b
You should take also into account that spark has different option to represent
data in-memory, such as Java serialized objects, Kyro serialized, Tungsten
(columnar optionally compressed) etc. the tungsten thing depends heavily on the
underlying data and sorting especially if compressed.
Then, yo
All off-heap memory is still managed by the JVM process. If you limit the
memory of this process then you limit the memory. I think the memory of the JVM
process could be limited via the xms/xmx parameter of the JVM. This can be
configured via spark options for yarn (be aware that they are diffe
I think there might be still something messed up with the classpath. It
complains in the logs about deprecated jars and deprecated configuration files.
> On 21 Sep 2016, at 22:21, Mich Talebzadeh wrote:
>
> Well I am left to use Spark for importing data from RDBMS table to Hadoop.
>
> You may
Do you mind sharing what your software does? What is the input data size? What
is the spark version and apis used? How many nodes? What is the input data
format? Is compression used?
> On 21 Sep 2016, at 13:37, Trinadh Kaja wrote:
>
> Hi all,
>
> how to increase spark performance ,i am using
I am not sure what you try to achieve here. Can you please tell us what the
goal of the program is. Maybe with some example data?
Besides this, I have the feeling that it will fail once it is not used in a
single node scenario due to the reference to the global counter variable.
Also unclear wh
Ignite has a special cache for HDFS data (which is not a Java cache), for rdds
etc. So you are right it is in this sense very different.
Besides caching, from what I see from data scientists is that for interactive
queries and models evaluation they anyway do not browse the complete data. Even
In Tableau you can use the in-memory facilities of the Tableau server.
As said, Apache Ignite could be one way. You can also use it to make Hive
tables in-memory. While reducing IO can make sense, I do not think you will
receive in production systems so much difference (at least not 20x). If the
Hi,
I recommend that the third party application puts an empty file with the same
filename as the original file, but the extension ".uploaded". This is an
indicator that the file has been fully (!) written to the fs. Otherwise you
risk only reading parts of the file.
Then, you can have a file sy
Hmm is it just a lookup and the values are small? I do not think that in this
case redis needs to be installed on each worker node. Redis has a rather
efficient protocol. Hence one or a few dedicated redis nodes probably fit your
purpose more then needed. Just try to reuse connections and do not
Hi,
An alternative to Spark could be flume to store data from Kafka to HDFS. It
provides also some reliability mechanisms and has been explicitly designed for
import/export and is tested. Not sure if i would go for spark streaming if the
use case is only storing, but I do not have the full pict
It could be that by using the rdd it converts the data from the internal format
to Java objects (-> much more memory is needed), which may lead to spill over
to disk. This conversion takes a lot of time. Then, you need to transfer these
Java objects via network to one single node (repartition ..
Hi,
DataFrames are more efficient if you have Tungsten activated as the underlying
processing engine (normally by default). However, this only speeds up
processing , saving as an io-bound operation not necessarily.
What is exactly slow? The write?
You could use myDF.write.save().write...
Howe
I fear the issue is that this will create and destroy a XML parser object 2 mio
times, which is very inefficient - it does not really look like a parser
performance issue. Can't you do something about the format choice? Ask your
supplier to deliver another format (ideally avro or sth like this?)
Depends on your data...
How did you split training and test set?
How does the model fit to the data?
You could try of course also to have more data to fed into the model
Have you considered alternative machine learning models?
I do not think this is a Spark problem, but you should ask the mac
Both are part of the heap.
> On 16 Aug 2016, at 04:26, Lan Jiang wrote:
>
> Hello,
>
> My understanding is that YARN executor container memory is based on
> "spark.executor.memory" + “spark.yarn.executor.memoryOverhead”. The first one
> is for heap memory and second one is for offheap memory.
Depends on the size of the arrays, but is it what you want to achieve similar
to a join?
> On 15 Aug 2016, at 20:12, Eric Ho wrote:
>
> Hi,
>
> I've two nested-for loops like this:
>
> for all elements in Array A do:
>
> for all elements in Array B do:
>
> compare a[3] with b[4] see if they
Use a format that has built-in indexes, such as Parquet or Orc. Do not forget
to sort the data on the columns that your filter on.
> On 14 Aug 2016, at 05:03, Taotao.Li wrote:
>
>
> hi, guys, does Spark SQL support indexes? if so, how can I create an index
> on my temp table? if not, how can
ta or any other property which may arise
>> from relying on this email's technical content is explicitly disclaimed. The
>> author will in no case be liable for any monetary damages arising from such
>> loss, damage or destruction.
>>
>>
>>> On 2 Augu
er property which may arise from
> relying on this email's technical content is explicitly disclaimed. The
> author will in no case be liable for any monetary damages arising from such
> loss, damage or destruction.
>
>
>> On 2 August 2016 at 23:10, Ted Yu wrote:
>
If you need to use single inserts, updates, deletes, select why not use hbase
with Phoenix? I see it as complementary to the hive / warehouse offering
> On 02 Aug 2016, at 22:34, Mich Talebzadeh wrote:
>
> Hi,
>
> I decided to create a catalog table in Hive ORC and transactional. That table
Why don't you write your own Hadoop FileInputFormat. It can be used by Spark...
> On 28 Jul 2016, at 20:04, jtgenesis wrote:
>
> Hey all,
>
> I was wondering what the best course of action is for processing an image
> that has an involved internal structure (file headers, sub-headers, image
> d
is by Hortonworks, so battle of file format continues...
>>
>> Sent from my iPhone
>>
>>> On Jul 27, 2016, at 4:54 PM, janardhan shetty
>>> wrote:
>>>
>>> Seems like parquet format is better comparatively to orc when the dataset
>>> is
gt;>>>> [1] and part of orc has been enhanced while deployed for Facebook, Yahoo
>>>>> [2].
>>>>>
>>>>> Other than this presentation [3], do you guys know any other benchmark?
>>>>>
>>>>> [1]https://parquet
gt;>>>> [1] and part of orc has been enhanced while deployed for Facebook, Yahoo
>>>>> [2].
>>>>>
>>>>> Other than this presentation [3], do you guys know any other benchmark?
>>>>>
>>>>> [1]https://parquet
I think both are very similar, but with slightly different goals. While they
work transparently for each Hadoop application you need to enable specific
support in the application for predicate push down.
In the end you have to check which application you are using and do some tests
(with correc
Well as far as I know there is some update statement planned for spark, but not
sure which release. You could alternatively use Hive+Orc.
Another alternative would be to add the deltas in a separate file and when
accessing the table filtering out the double entries. From time to time you
could
I am not sure if I exactly understand your use case, but for my Hadoop/Spark
format that reads the Bitcoin blockchain I extend from FileInputFormat. I use
the default split mechanism. This could mean that I split in the middle of a
bitcoin block, which is no issue, because the first split can r
I think the comparison with Oracle rdbms and oracle times ten is not so good.
There are times when the in-memory database of Oracle is slower than the rdbms
(especially in case of Exadata) due to the issue that in-memory - as in Spark -
means everything is in memory and everything is always pro
er way getting the same result. However, my concerns:
>>
>> Spark has a wide user base. I judge this from Spark user group traffic
>> TEZ user group has no traffic I am afraid
>> LLAP I don't know
>> Sounds like Hortonworks promote TEZ and Cloudera does not wa
> "lastName":"Doe"
> },
> {
> "firstName":"Anna",
>"lastName":"Smith"
> },
> {
>"firstName":"Peter",
> "lastName":"Jones"
Memory fragmentation? Quiet common with in-memory systems.
> On 08 Jul 2016, at 08:56, aasish.kumar wrote:
>
> Hello everyone:
>
> I have been facing a problem associated spark streaming memory.
>
> I have been running two Spark Streaming jobs concurrently. The jobs read
> data from Kafka with
This does not need necessarily the case if you look at the Hadoop
FileInputFormat architecture then you can even split large multi line Jsons
without issues. I would need to have a look at it, but one large file does not
mean one Executor independent of the underlying format.
> On 07 Jul 2016,
Still you need sparkR
> On 29 Jun 2016, at 19:14, John Aherne wrote:
>
> Microsoft Azure has an option to create a spark cluster with R Server. MS
> bought RevoScale (I think that was the name) and just recently deployed it.
>
>> On Wed, Jun 29, 2016 at 10:53 AM, Xinh Huynh wrote:
>> There is
eh
>>
>> LinkedIn
>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>
>> http://talebzadehmich.wordpress.com
>>
>> Disclaimer: Use it at your own risk. Any and all responsibility for any
>> loss, damage or destru
eh
>
> LinkedIn
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>
> http://talebzadehmich.wordpress.com
>
> Disclaimer: Use it at your own risk. Any and all responsibility for any loss,
> damage or destruction of data
Bzip2 is splittable for text files.
Btw in Orc the question of splittable does not matter because each stripe is
compressed individually.
Have you tried tez? As far as I recall (at least it was in the first version of
Hive) mr uses for order by a single reducer which is a bottleneck.
Do you
Dataframe uses a more efficient binary representation to store and persist
data. You should go for that one in most of the cases. Rdd is slower.
> On 27 Jun 2016, at 07:54, Brandon White wrote:
>
> What is the difference between persisting a dataframe and a rdd? When I
> persist my RDD, the UI
e at https://twitter.com/jaceklaskowski
>
>
>> On Fri, Jun 24, 2016 at 10:14 AM, Jörn Franke wrote:
>> I would push the Spark people to provide equivalent functionality . In the
>> end it is a deserialization/serialization process which should not be done
>&g
I would push the Spark people to provide equivalent functionality . In the end
it is a deserialization/serialization process which should not be done back and
forth because it is one of the more costly aspects during processing. It needs
to convert Java objects to a binary representation. It is
cessing as we do get 300k
> messages per sec , so lookup will slow down.
>
> Thanks
> Sandesh
>
>> On Wed, Jun 22, 2016 at 3:28 PM, Jörn Franke wrote:
>>
>> Spark Streamig does not guarantee exactly once for output action. It means
>> that one item is only
Spark Streamig does not guarantee exactly once for output action. It means that
one item is only processed in an RDD.
You can achieve at most once or at least once.
You could however do at least once (via checkpoing) and record which messages
have been proceed (some identifier available?) and do
I would import data via sqoop and put it on HDFS. It has some mechanisms to
handle the lack of reliability by jdbc.
Then you can process the data via Spark. You could also use jdbc rdd but I do
not recommend to use it, because you do not want to pull data all the time out
of the database when
If you insert the data sorted then there is not need to bucket the data.
You can even create an index in Spark. Simply set the outputformat
configuration orc.create.index = true
> On 20 Jun 2016, at 09:10, Mich Talebzadeh wrote:
>
> Right, you concern is that you expect storeindex in ORC fil
I agree here.
However it depends always on your use case !
Best regards
> On 16 Jun 2016, at 04:58, Gourav Sengupta wrote:
>
> Hi Mahender,
>
> please ensure that for dimension tables you are enabling the broadcast
> method. You must be able to see surprising gains @12x.
>
> Overall I th
What Volume do you have? Why do not you use the corresponding Cassandra
functionality directly?
If you do it once and not iteratively in-memory you cannot expect so much
improvement
> On 15 Jun 2016, at 16:01, nikita.dobryukha wrote:
>
> We use Cassandra 3.5 + Spark 1.6.1 in 2-node cluster
You do not describe use cases, but technologies. First be aware on your needs
and then check technologies.
Otherwise nobody can help you properly and you will end up with an inefficient
stack for your needs.
> On 14 Jun 2016, at 00:52, KhajaAsmath Mohammed
> wrote:
>
> Hi,
>
> In my current
ブ大好きな人ぜひフォローしてください
> 固定ツイートお願いします
> ラブライブに出会えて良かった!
> 9人のみんなのこと忘れない
> #LoveLiveforever
> #ラブライバーと繋がりたいRT https://t.co/kITPDLER9x
> 07114803986434/photo/1$738659292685979648
> :13Z://pbs.twimg.com/media/CkA-exTWYAAK8TU.jpg
> : 1000RT:【資金不足】「学園ハンサム」、クラウドファンディングでアニメ化支援を募集
> h
nterprise search server with a REST-like API. You
>>>> put documents in it (called "indexing") via JSON, XML, CSV or binary over
>>>> HTTP. You query it via HTTP GET and receive JSON, XML, CSV or binary
>>>> results.
>>>>
>>>> thanks
>>>>
&g
;> Dr Mich Talebzadeh
>>
>> LinkedIn
>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>
>> http://talebzadehmich.wordpress.com
>>
>>
>>> On 7 June 2016 at 13:38, Jörn Franke wrote:
>>> Solr
CSV or binary over HTTP.
> You query it via HTTP GET and receive JSON, XML, CSV or binary results.
>
> thanks
>
> Dr Mich Talebzadeh
>
> LinkedIn
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>
> http://talebzadehmich.wordpr
Before hardware optimization there is always software optimization.
Are you using dataset / dataframe? Are you using the right data types ( eg int
where int is appropriate , try to avoid string and char etc)
Do you extract only the stuff needed? What are the algorithm parameters?
> On 07 Jun 201
presume this is a typical question.
>
> You mentioned Spark ml (machine learning?) . Is that something viable?
>
> Cheers
>
>
>
>
>
> Dr Mich Talebzadeh
>
> LinkedIn
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
&g
Spark ml Support Vector machines or neural networks could be candidates.
For unstructured learning it could be clustering.
For doing a graph analysis On the followers you can easily use Spark Graphx
Keep in mind that each tweet contains a lot of meta data (location, followers
etc) that is more or
Or combine both! It is possible with Spark Streaming to combine streaming data
and on HDFS. In the end it always depends what you want to do and when you need
what.
> On 03 Jun 2016, at 10:26, Mich Talebzadeh wrote:
>
> I use twitter data with spark streaming to experiment with twitter data.
an email to Hive user group to see anyone has managed to
>>> built a vendor independent version.
>>>
>>>
>>> Dr Mich Talebzadeh
>>>
>>> LinkedIn
>>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>>
Well if you require R then you need to install it (including all additional
packages) on each node. I am not sure why you store the data in Postgres .
Storing it in Parquet and Orc is sufficient in HDFS (sorted on relevant
columns) and you use the SparkR libraries to access them.
> On 30 May 2
h TEZ) or use Impala instead of Hive
> etc as I am sure you already know.
>
> Cheers,
>
>
>
>
> Dr Mich Talebzadeh
>
> LinkedIn
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>
> http://talebzadehmich.wordpress.com
&
on use case.
>>
>> HTH
>>
>>
>>
>> Dr Mich Talebzadeh
>>
>> LinkedIn
>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>
>> http://talebzadehmich.wordpress.com
>>
>>
>
final sentence about this. Both systems
develop and change.
> On 25 May 2016, at 22:14, Reynold Xin wrote:
>
>
>> On Wed, May 25, 2016 at 9:52 AM, Jörn Franke wrote:
>> Spark is more for machine learning working iteravely over the whole same
>> dataset in memory. Ad
Hive has a little bit more emphasis on the case that your data that is queried
is much bigger than available memory or when you need to query many different
small data subsets or recently interactively queries (llap etc.).
Spark is more for machine learning working iteravely over the whole sa
Fuzzy match logic.
>
> How can use map/reduce operations across 2 rdds ?
>
> Thanks,
> Padma Ch
>
>> On Wed, May 25, 2016 at 4:49 PM, Jörn Franke wrote:
>>
>> Alternatively depending on the exact use case you may employ solr on Hadoop
>> for text
Alternatively depending on the exact use case you may employ solr on Hadoop for
text analytics
> On 25 May 2016, at 12:57, Priya Ch wrote:
>
> Lets say i have rdd A of strings as {"hi","bye","ch"} and another RDD B of
> strings as {"padma","hihi","chch","priya"}. For every string rdd A i nee
No this is not needed, look at the map / reduce operations and the standard
spark word count
> On 25 May 2016, at 12:57, Priya Ch wrote:
>
> Lets say i have rdd A of strings as {"hi","bye","ch"} and another RDD B of
> strings as {"padma","hihi","chch","priya"}. For every string rdd A i need
>
What is the use case of this ? A Cartesian product is by definition slow in any
system. Why do you need this? How long does your application take now?
> On 25 May 2016, at 12:42, Priya Ch wrote:
>
> I tried
> dataframe.write.format("com.databricks.spark.csv").save("/hdfs_path"). Even
> this i
Hi Mich,
I think these comparisons are useful. One interesting aspect could be hardware
scalability in this context. Additionally different type of computations.
Furthermore, one could compare Spark and Tez+llap as execution engines. I have
the gut feeling that each one can be justified by di
Do you want to replace ELK by Spark? Depending on your queries you could do as
you proposed. However, many of the text analytics queries will probably be much
faster on ELK. If your queries are more interactive and not about batch
processing then it does not make so much sense. I am not sure why
14000 partitions seem to be way too many to be performant (except for large
data sets). How much data does one partition contain?
> On 22 May 2016, at 09:34, SRK wrote:
>
> Hi,
>
> In my Spark SQL query to insert data, I have around 14,000 partitions of
> data which seems to be causing memory
What is the motivation to use such an old version of Hive? This will lead to
less performance and other risks.
> On 21 May 2016, at 01:57, "kali.tumm...@gmail.com"
> wrote:
>
> Hi All ,
>
> Is there a way to ask spark and spark-sql to use Hive 0.14 version instead
> of inbuilt hive 1.2.1.
>
Do you have the full source code? Why do you convert a data frame to rdd - this
does not make sense to me?
> On 18 May 2016, at 06:13, Mohanraj Ragupathiraj wrote:
>
> I have created a DataFrame from a HBase Table (PHOENIX) which has 500 million
> rows. From the DataFrame I created an RDD of J
I do not recommend large windows. You can have small windows, store the data
and then do the reports for one hour or one day on stored data.
> On 09 May 2016, at 05:19, "kramer2...@126.com" wrote:
>
> We have some stream data need to be calculated and considering use spark
> stream to do it.
>
Look at lambda architecture.
What is the motivation of your migration?
> On 04 May 2016, at 03:29, Tapan Upadhyay wrote:
>
> Hi,
>
> We are planning to move our adhoc queries from teradata to spark. We have
> huge volume of queries during the day. What is best way to go about it -
>
> 1) Re
It is in Spark not different compared to another program. However a web service
and json is probably not very suitable for large data volumes.
> On 03 May 2016, at 04:45, KhajaAsmath Mohammed
> wrote:
>
> Hi,
>
> I am working on a project to pull data from sprinklr for every 15 minutes and
>
Hallo,
Spark is a general framework for distributed in-memory processing. You can
always write a highly-specified piece of code which is faster than Spark, but
then it can do only one thing and if you need something else you will have to
rewrite everything from scratch . This is why Spark is be
You See oversimplifying here and some of your statements are not correct. There
are also other aspects to consider. Finally, it would be better to support him
with the problem, because Spark supports Java. Java and Scala run on the same
underlying JVM.
> On 02 May 2016, at 17:42, Gourav Sengupt
I do not know your data, but it looks that you have too many partitions for
such a small data set.
> On 26 Apr 2016, at 00:47, Imran Akbar wrote:
>
> Hi,
>
> I'm running a simple query like this through Spark SQL:
>
> sqlContext.sql("SELECT MIN(age) FROM data WHERE country = 'GBR' AND
> dt_y
You can call any Java/scala library from R using the package rJava
> On 25 Apr 2016, at 19:16, ankur.jain wrote:
>
> Hello Team,
>
> Is there any way to call spark code (scala/python) from R?
> I want to use Cloudera spark-ts api with SparkR, if anyone had used that
> please let me know.
>
>
Well it could also depend on the receiving database. You should also check the
executors. Updating to the latest version of the JDBC driver and JDK8, if
supported by JDBC driver, could help.
> On 20 Apr 2016, at 00:14, Jonathan Gray wrote:
>
> Hi,
>
> I'm trying to write ~60 million rows from
Python can access the JVM - this how it interfaces with Spark. Some of the
components do not have a wrapper fro the corresponding Java Api yet and thus
are not accessible in Python.
Same for elastic search. You need to write a more or less simple wrapper.
> On 20 Apr 2016, at 09:53, "kramer2...
I do not think there is a simple how to for this. First you need to be clear of
volumes in storage, in-transit and in-processing. Then you need to be aware of
what kind of queries you want to do. Your assumption of milliseconds for he
expected data volumes currently seem to be unrealistic. Howev
I think the easiest would be to use a Hadoop Windows distribution, such as
Hortonworks. However, the Linux version of Hortonworks is a little bit more
advanced.
> On 18 Apr 2016, at 14:13, My List wrote:
>
> Deepak,
>
> The following could be a very dumb questions so pardon me for the same.
>
What is your exact set of requirements for algo trading? Is it react in
real-time or analysis over longer time? In the first case, I do not think a
framework such as Spark or Flink makes sense. They are generic, but in order to
compete with other usually custom developed highly - specialized en
You could also explore the in-memory database of 12c . However, I am not sure
how beneficial it is for Oltp scenarios.
I am excited to see how the performance will be on hbase as a hive metastore.
Nevertheless, your results on Oracle/SSD will be beneficial for the community.
> On 17 Apr 2016,
Generally a recommendation (besides the issue) - Do not put dates as String. I
recommend here to make them ints. It will be in both cases much faster.
It could be that you load them differently in the tables. Generally for these
tables you should insert them in both cases sorted into the tables
You could use a different format and the dataset or dataframe instead of rdd.
> On 14 Apr 2016, at 23:21, Bibudh Lahiri wrote:
>
> Hi,
> As part of a larger program, I am extracting the distinct values of some
> columns of an RDD with 100 million records and 4 columns. I am running Spark
>
I do not think so. Hadoop provides an ecosystem in which you can deploy
different engines, such as MR, HBase, TEZ, Spark, Flink, titandb, hive, solr...
I observe also that commercial analytical tools use one or more of these
engines to execute their code in a distributed fashion. You need this
a.
>
> Why is the discussion about using anything other than SQOOP still so
> wonderfully on?
>
>
> Regards,
> Gourav
>
>> On Mon, Apr 11, 2016 at 6:26 PM, Jörn Franke wrote:
>> Actually I was referring to have a an external table in Oracle, which is
>> use
Is the host in /etc/hosts ?
> On 13 Apr 2016, at 07:28, Amit Singh Hora wrote:
>
> I am trying to access directory in Hadoop from my Spark code on local
> machine.Hadoop is HA enabled .
>
> val conf = new SparkConf().setAppName("LDA Sample").setMaster("local[2]")
> val sc=new SparkContext(conf)
y from it… you can do a very simple bulk load/unload process. However
> you need to know the file’s format.
>
> Not sure what IBM or Oracle has done to tie their RDBMs to Big Data.
>
> As I and other posters to this thread have alluded to… this would be a block
> bulk load/u
ell. It is using JDBC for each connection between data-nodes and their
>>>>> AMP (compute) nodes. There is an additional layer that coordinates all of
>>>>> it.
>>>>> I know Oracle has a similar technology I've used it and had to supply the
>&
. ;-)
>
> Just saying. ;-)
>
> -Mike
>
>> On Apr 5, 2016, at 10:44 PM, Jörn Franke wrote:
>>
>> I do not think you can be more resource efficient. In the end you have to
>> store the data anyway on HDFS . You have a lot of development effort for
>&
ng ingestion, if possible. Also, I can then use Spark stand alone cluster
> to ingest, even if my hadoop cluster is heavily loaded. What you guys think?
>
>> On Wed, Apr 6, 2016 at 3:13 PM, Jörn Franke wrote:
>> Why do you want to reimplement something which is already there?
&g
Why do you want to reimplement something which is already there?
> On 06 Apr 2016, at 06:47, ayan guha wrote:
>
> Hi
>
> Thanks for reply. My use case is query ~40 tables from Oracle (using index
> and incremental only) and add data to existing Hive tables. Also, it would be
> good to have an
If you check the newest Hortonworks distribution then you see that it generally
works. Maybe you can borrow some of their packages. Alternatively it should be
also available in other distributions.
> On 26 Mar 2016, at 22:47, Mich Talebzadeh wrote:
>
> Hi,
>
> I am running Hive 2 and now Spar
I am not 100% sure of the root cause, but if you need rdd caching then look at
Apache Ignite or similar.
> On 24 Mar 2016, at 16:22, Daniel Imberman wrote:
>
> Hi Takeshi,
>
> Thank you for getting back to me. If this is not possible then perhaps you
> can help me with the root problem that c
How much data are you querying? What is the query? How selective it is supposed
to be? What is the block size?
> On 16 Mar 2016, at 11:23, Joseph wrote:
>
> Hi all,
>
> I have known that ORC provides three level of indexes within each file, file
> level, stripe level, and row level.
> The fi
minal_type = 25080;
> select * from gprs where terminal_type = 25080;
>
> In the gprs table, the "terminal_type" column's value is in [0, 25066]
>
> Joseph
>
> From: Jörn Franke
> Date: 2016-03-16 19:26
> To: Joseph
> CC: user; user
> Subject: Re
I am not sure about this. At least Hortonworks provides its distribution with
Hive and Spark 1.6
> On 14 Mar 2016, at 09:25, Mich Talebzadeh wrote:
>
> I think the only version of Spark that works OK with Hive (Hive on Spark
> engine) is version 1.3.1. I also get OOM from time to time and have
301 - 400 of 525 matches
Mail list logo