case?
>
> Regards,
> Gourav
>
> On Mon, May 2, 2016 at 5:59 PM, Ted Yu <yuzhih...@gmail.com> wrote:
>
>> That's my interpretation.
>>
>> On Mon, May 2, 2016 at 9:45 AM, Buntu Dev < <buntu...@gmail.com>
>> buntu...@gmail.com> wrote:
>>
>&g
not defined.
>
> On Sat, May 7, 2016 at 11:48 PM, Buntu Dev <buntu...@gmail.com> wrote:
> > I'm using pyspark dataframe api to sort by specific column and then
> saving
> > the dataframe as parquet file. But the resulting parquet file doesn't
> seem
> > to b
I'm using pyspark dataframe api to sort by specific column and then saving
the dataframe as parquet file. But the resulting parquet file doesn't seem
to be sorted.
Applying sort and doing a head() on the results shows the correct results
sorted by 'value' column in desc order, as shown below:
this?
On Mon, May 2, 2016 at 6:21 AM, Ted Yu <yuzhih...@gmail.com> wrote:
> Please consider decreasing block size.
>
> Thanks
>
> > On May 1, 2016, at 9:19 PM, Buntu Dev <buntu...@gmail.com> wrote:
> >
> > I got a 10g limitation on the executors and operating on parq
I got a 10g limitation on the executors and operating on parquet dataset
with block size 70M with 200 blocks. I keep hitting the memory limits when
doing a 'select * from t1 order by c1 limit 100' (ie, 1M). It works if
I limit to say 100k. What are the options to save a large dataset without
:01 PM, Krishna <research...@gmail.com> wrote:
> I recently encountered similar network related errors and was able to fix
> it by applying the ethtool updates decribed here [
> https://issues.apache.org/jira/plugins/servlet/mobile#issue/SPARK-5085]
>
>
> On Friday, April 29
e error
I would ultimately want to store the result set as parquet. Are there any
other options to handle this?
Thanks!
On Wed, Apr 27, 2016 at 11:10 AM, Buntu Dev <buntu...@gmail.com> wrote:
> I got 14GB of parquet data and when trying to apply order by using spark
> sql and save the f
I got 14GB of parquet data and when trying to apply order by using spark
sql and save the first 1M rows but keeps failing with "Connection reset by
peer: socket write error" on the executors.
I've allocated about 10g to both driver and the executors along with
setting the maxResultSize to 10g but
rovide a way to reproduce it (could generate fake
> dataset)?
>
> On Sat, Apr 9, 2016 at 4:33 PM, Buntu Dev <buntu...@gmail.com> wrote:
> > I've allocated about 4g for the driver. For the count stage, I notice the
> > Shuffle Write to be 13.9 GB.
> >
> > On
gt; Looks like the exception occurred on driver.
>
> Consider increasing the values for the following config:
>
> conf.set("spark.driver.memory", "10240m")
> conf.set("spark.driver.maxResultSize", "2g")
>
> Cheers
>
> On Sat, Apr 9, 20
;
> Pozdrawiam,
> Jacek Laskowski
>
> https://medium.com/@jaceklaskowski/
> Mastering Apache Spark http://bit.ly/mastering-apache-spark
> Follow me at https://twitter.com/jaceklaskowski
>
>
> On Sat, Apr 9, 2016 at 7:51 PM, Buntu Dev <buntu...@gmail.com> wrote:
> &
I'm running this motif pattern against 1.5M vertices (5.5mb) and 10M (60mb)
edges:
tgraph.find("(a)-[]->(b); (c)-[]->(b); (c)-[]->(d)")
I keep running into Java heap space errors:
~
ERROR actor.ActorSystemImpl: Uncaught fatal error from thread
I've allocated about 4g for the driver. For the count stage, I notice the
Shuffle Write to be 13.9 GB.
On Sat, Apr 9, 2016 at 11:43 AM, Ndjido Ardo BAR <ndj...@gmail.com> wrote:
> What's the size of your driver?
> On Sat, 9 Apr 2016 at 20:33, Buntu Dev <buntu...@gmail.com> wr
of column :-) !
> df.count() displays the size of your data frame.
> df.columns.size() the number of columns.
>
> Finally, I suggest you check the size of your drive and customize it
> accordingly.
>
> Cheers,
>
> Ardo
>
> Sent from my iPhone
>
> > On 09 Apr 2016, at 19:
t.block.size", "134217728")
sqlContext.setConf("spark.broadcast.blockSize", "134217728")
df.write.parquet("/path/to/dest")
I tried the same with different block sizes but none had any effect. Is
this the right way to set the properties using setConf()?
You may want to read this post regarding Spark with Drools:
http://blog.cloudera.com/blog/2015/11/how-to-build-a-complex-event-processing-app-on-apache-spark-and-drools/
On Wed, Nov 4, 2015 at 8:05 PM, Daniel Mahler wrote:
> I am not familiar with any rule engines on Spark
Thanks.. I was using Scala 2.11.1 and was able to
use algebird-core_2.10-0.1.11.jar with spark-shell.
On Thu, Oct 30, 2014 at 8:22 AM, Ian O'Connell i...@ianoconnell.com wrote:
Whats the error with the 2.10 version of algebird?
On Thu, Oct 30, 2014 at 12:49 AM, thadude ohpre...@yahoo.com
one to your project's class path.
Thanks
Best Regards
On Sun, Oct 19, 2014 at 10:18 PM, bdev buntu...@gmail.com wrote:
I built the latest Spark project and I'm running into these errors when
attempting to run the streaming examples locally on the Mac, how do I fix
these errors
wrote:
I think you have not imported
org.apache.spark.streaming.StreamingContext._ ? This gets you the
implicits that provide these methods.
On Thu, Oct 9, 2014 at 8:40 PM, bdev buntu...@gmail.com wrote:
I'm using KafkaUtils.createStream for the input stream to pull messages
from
kafka which
impala, then it may allow if the schema
changes are append only. Otherwise existing Parquet files have to be
migrated to new schema.
- Original Message -
From: Buntu Dev buntu...@gmail.com
To: Soumitra Kumar kumar.soumi...@gmail.com
Cc: u...@spark.incubator.apache.org
Sent: Tuesday
Thanks for the update.. I'm interested in writing the results to MySQL as
well, can you share some light or code sample on how you setup the
driver/connection pool/etc.?
On Thu, Sep 25, 2014 at 4:00 PM, maddenpj madde...@gmail.com wrote:
Update for posterity, so once again I solved the problem
I'm processing about 10GB of tab delimited rawdata with a few fields (page
and user id along with timestamp when user viewed the page) using a 40 node
cluster and using SparkSQL to compute the number of unique visitors per page
at various intervals. I'm currently just reading the data as
I got a 40 node cdh 5.1 cluster and attempting to run a simple spark app that
processes about 10-15GB raw data but I keep running into this error:
java.lang.OutOfMemoryError: GC overhead limit exceeded
Each node has 8 cores and 2GB memory. I notice the heap size on the
executors is set to
I'm looking to write a select statement to get a distinct count on userId
grouped by keyword column on a parquet file SchemaRDD equivalent of:
SELECT keyword, count(distinct(userId)) from table group by keyword
How to write it using the chained select().groupBy() operations?
Thanks!
--
the same thing in 1.0.0 using DSL only. Just curious,
why don't you use the hql() / sql() methods and pass a query string
in?
[1] https://github.com/apache/spark/pull/1211/files
On Thu, Jul 31, 2014 at 2:20 PM, Buntu Dev buntu...@gmail.com wrote:
Thanks Zongheng for the pointer
Thanks Michael for confirming!
On Thu, Jul 31, 2014 at 2:43 PM, Michael Armbrust mich...@databricks.com
wrote:
The performance should be the same using the DSL or SQL strings.
On Thu, Jul 31, 2014 at 2:36 PM, Buntu Dev buntu...@gmail.com wrote:
I was not sure if registerAsTable
If you need to run Spark apps through Hue, see if Ooyala's job server helps:
http://gethue.com/get-started-with-spark-deploy-spark-server-and-compute-pi-from-your-web-browser/
--
View this message in context:
I'm using the spark-shell locally and working on a dataset of size 900MB. I
initially ran into java.lang.OutOfMemoryError: GC overhead limit exceeded
error and upon researching set SPARK_DRIVER_MEMORY to 4g.
Now I run into ArrayIndexOutOfBoundsException, please let me know if there
is some way to
Just wanted to add more info.. I was using SparkSQL reading in the
tab-delimited raw data files converting the timestamp to Date format:
sc.textFile(rawdata/*).map(_.split(\t)).map(p = Point(df.format(new
Date( p(0).trim.toLong*1000L )), p(1), p(2).trim.toInt ,p(3).trim.toInt,
p(4).trim.toInt
I wanted to experiment with using Parquet data with SparkSQL. I got some
tab-delimited files and wanted to know how to convert them to Parquet
format. I'm using standalone spark-shell.
Thanks!
--
View this message in context:
Thanks Michael.
If I read in multiple files and attempt to saveAsParquetFile() I get the
ArrayIndexOutOfBoundsException. I don't see this if I try the same with a
single file:
case class Point(dt: String, uid: String, kw: String, tz: Int, success:
Int, code: String )
val point =
That seems to be the issue, when I reduce the number of fields it works
perfectly fine.
Thanks again Michael.. that was super helpful!!
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Convert-raw-data-files-to-Parquet-format-tp10526p10541.html
Sent from
Turns out to be an issue with number of fields being read, one of the fields
might be missing from the raw data file causing this error. Michael Ambrust
pointed it out in another thread.
--
View this message in context:
I could possible use Spark API and write an batch app to provide some per web
page stats such as views, uniques etc. The same can be achieved using
SparkSQL, so wanted to check:
* what are the best practices and pros/cons of either of the approaches?
* Does SparkSQL require registerAsTable for
Now we are storing Data direct from Kafka to Parquet.
We are currently using Camus and wanted to know how you went about storing
to Parquet?
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Apache-kafka-spark-Parquet-tp10037p10441.html
Sent from the Apache
Hi --
I tried searching for eclipse spark plugin setup for developing with Spark
and there seems to be some information I can go with but I have not seen a
starter app or project to import into Eclipse and try it out. Can anyone
please point me to any Scala projects to import into Scala Eclipse
Hi --
New to Spark and trying to figure out how to do a generate unique counts per
page by date given this raw data:
timestamp,page,userId
1405377264,google,user1
1405378589,google,user2
1405380012,yahoo,user1
..
I can do a groupBy a field and get the count:
val lines=sc.textFile(data.csv)
val
We have CDH 5.0.2 which doesn't include Spark SQL yet and may only be
available in CDH 5.1 which is yet to be released.
If Spark SQL is the only option then I might need to hack around to add it
into the current CDH deployment if thats possible.
--
View this message in context:
Thanks Nick.
All I'm attempting is to report number of unique visitors per page by date.
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Count-distinct-with-groupBy-usage-tp9781p9786.html
Sent from the Apache Spark User List mailing list archive at
Thanks Sean!! Thats what I was looking for -- group by on mulitple fields.
I'm gonna play with it now. Thanks again!
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Count-distinct-with-groupBy-usage-tp9781p9803.html
Sent from the Apache Spark User List
40 matches
Mail list logo