spark streaming disk hit

2015-07-21 Thread Abhishek R. Singh
Is it fair to say that Storm stream processing is completely in memory, whereas 
spark streaming would take a disk hit because of how shuffle works?

Does spark streaming try to avoid disk usage out of the box?

-Abhishek-
-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: spark streaming disk hit

2015-07-21 Thread Abhishek R. Singh
Thanks TD - appreciate the response !

On Jul 21, 2015, at 1:54 PM, Tathagata Das  wrote:

> Most shuffle files are really kept around in the OS's buffer/disk cache, so 
> it is still pretty much in memory. If you are concerned about performance, 
> you have to do a holistic comparison for end-to-end performance. You could 
> take a look at this. 
> 
> https://spark-summit.org/2015/events/towards-benchmarking-modern-distributed-streaming-systems/
> 
> On Tue, Jul 21, 2015 at 11:57 AM, Abhishek R. Singh 
>  wrote:
> Is it fair to say that Storm stream processing is completely in memory, 
> whereas spark streaming would take a disk hit because of how shuffle works?
> 
> Does spark streaming try to avoid disk usage out of the box?
> 
> -Abhishek-
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
> 
> 



Re: How to increase parallelism of a Spark cluster?

2015-08-02 Thread Abhishek R. Singh
I don't know if (your assertion/expectation that) workers will process things 
(multiple partitions) in parallel is really valid. Or if having more partitions 
than workers will necessarily help (unless you are memory bound - so partitions 
is essentially helping your work size rather than execution parallelism).

[Disclaimer: I am no authority on Spark, but wanted to throw my spin based my 
own understanding]. 

Nothing official about it :)

-abhishek-

> On Jul 31, 2015, at 1:03 PM, Sujit Pal  wrote:
> 
> Hello,
> 
> I am trying to run a Spark job that hits an external webservice to get back 
> some information. The cluster is 1 master + 4 workers, each worker has 60GB 
> RAM and 4 CPUs. The external webservice is a standalone Solr server, and is 
> accessed using code similar to that shown below.
> 
>> def getResults(keyValues: Iterator[(String, Array[String])]):
>> Iterator[(String, String)] = {
>> val solr = new HttpSolrClient()
>> initializeSolrParameters(solr)
>> keyValues.map(keyValue => (keyValue._1, process(solr, keyValue)))
>> }
>> myRDD.repartition(10) 
>>  .mapPartitions(keyValues => getResults(keyValues))
>  
> The mapPartitions does some initialization to the SolrJ client per partition 
> and then hits it for each record in the partition via the getResults() call.
> 
> I repartitioned in the hope that this will result in 10 clients hitting Solr 
> simultaneously (I would like to go upto maybe 30-40 simultaneous clients if I 
> can). However, I counted the number of open connections using "netstat -anp | 
> grep ":8983.*ESTABLISHED" in a loop on the Solr box and observed that Solr 
> has a constant 4 clients (ie, equal to the number of workers) over the 
> lifetime of the run.
> 
> My observation leads me to believe that each worker processes a single stream 
> of work sequentially. However, from what I understand about how Spark works, 
> each worker should be able to process number of tasks parallelly, and that 
> repartition() is a hint for it to do so.
> 
> Is there some SparkConf environment variable I should set to increase 
> parallelism in these workers, or should I just configure a cluster with 
> multiple workers per machine? Or is there something I am doing wrong?
> 
> Thank you in advance for any pointers you can provide.
> 
> -sujit
> 


tachyon

2015-08-07 Thread Abhishek R. Singh
Do people use Tachyon in production, or is it experimental grade still?

Regards,
Abhishek

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: tachyon

2015-08-07 Thread Abhishek R. Singh
Thanks Calvin - much appreciated !

-Abhishek-

On Aug 7, 2015, at 11:11 AM, Calvin Jia  wrote:

> Hi Abhishek,
> 
> Here's a production use case that may interest you: 
> http://www.meetup.com/Tachyon/events/222485713/
> 
> Baidu is using Tachyon to manage more than 100 nodes in production resulting 
> in a 30x performance improvement for their SparkSQL workload. They are also 
> using the tiered storage feature in Tachyon giving them over 2PB of Tachyon 
> managed space.
> 
> Hope this helps,
> Calvin
> 
> On Fri, Aug 7, 2015 at 10:00 AM, Ted Yu  wrote:
> Looks like you would get better response on Tachyon's mailing list:
> 
> https://groups.google.com/forum/?fromgroups#!forum/tachyon-users
> 
> Cheers
> 
> On Fri, Aug 7, 2015 at 9:56 AM, Abhishek R. Singh 
>  wrote:
> Do people use Tachyon in production, or is it experimental grade still?
> 
> Regards,
> Abhishek
> 
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
> 
> 
> 



Re: Transformation not happening for reduceByKey or GroupByKey

2015-08-21 Thread Abhishek R. Singh
You had:

> RDD.reduceByKey((x,y) => x+y)
> RDD.take(3)

Maybe try:

> rdd2 = RDD.reduceByKey((x,y) => x+y)
> rdd2.take(3)

-Abhishek-

On Aug 20, 2015, at 3:05 AM, satish chandra j  wrote:

> HI All,
> I have data in RDD as mentioned below:
> 
> RDD : Array[(Int),(Int)] = Array((0,1), (0,2),(1,20),(1,30),(2,40))
> 
> 
> I am expecting output as Array((0,3),(1,50),(2,40)) just a sum function on 
> Values for each key
> 
> Code:
> RDD.reduceByKey((x,y) => x+y)
> RDD.take(3)
> 
> Result in console:
> RDD: org.apache.spark.rdd.RDD[(Int,Int)]= ShuffledRDD[1] at reduceByKey at 
> :73
> res:Array[(Int,Int)] = Array()
> 
> Command as mentioned
> 
> dse spark --master local --jars postgresql-9.4-1201.jar -i  
> 
> 
> Please let me know what is missing in my code, as my resultant Array is empty
> 
> 
> 
> Regards,
> Satish
> 


-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Grouping runs of elements in a RDD

2015-06-30 Thread Abhishek R. Singh
could you use a custom partitioner to preserve boundaries such that all related 
tuples end up on the same partition?

On Jun 30, 2015, at 12:00 PM, RJ Nowling  wrote:

> Thanks, Reynold.  I still need to handle incomplete groups that fall between 
> partition boundaries. So, I need a two-pass approach. I came up with a 
> somewhat hacky way to handle those using the partition indices and key-value 
> pairs as a second pass after the first.
> 
> OCaml's std library provides a function called group() that takes a break 
> function that operators on pairs of successive elements.  It seems a similar 
> approach could be used in Spark and would be more efficient than my approach 
> with key-value pairs since you know the ordering of the partitions.
> 
> Has this need been expressed by others?  
> 
> On Tue, Jun 30, 2015 at 1:03 PM, Reynold Xin  wrote:
> Try mapPartitions, which gives you an iterator, and you can produce an 
> iterator back.
> 
> 
> On Tue, Jun 30, 2015 at 11:01 AM, RJ Nowling  wrote:
> Hi all,
> 
> I have a problem where I have a RDD of elements:
> 
> Item1 Item2 Item3 Item4 Item5 Item6 ...
> 
> and I want to run a function over them to decide which runs of elements to 
> group together:
> 
> [Item1 Item2] [Item3] [Item4 Item5 Item6] ...
> 
> Technically, I could use aggregate to do this, but I would have to use a List 
> of List of T which would produce a very large collection in memory.
> 
> Is there an easy way to accomplish this?  e.g.,, it would be nice to have a 
> version of aggregate where the combination function can return a complete 
> group that is added to the new RDD and an incomplete group which is passed to 
> the next call of the reduce function.
> 
> Thanks,
> RJ
> 
> 



spark sql error with proto/parquet

2015-04-18 Thread Abhishek R. Singh
I have created a bunch of protobuf based parquet files that I want to 
read/inspect using Spark SQL. However, I am running into exceptions and not 
able to proceed much further:

This succeeds successfully (probably because there is no action yet). I can 
also printSchema() and count() without any issues: 

scala> val df = sqlContext.load(“my_root_dir/201504101000", "parquet")

scala> df.select(df("summary")).first

15/04/18 17:03:03 WARN scheduler.TaskSetManager: Lost task 0.0 in stage 5.0 
(TID 27, xxx.yyy.com): parquet.io.ParquetDecodingException: Can not read value 
at 0 in block -1 in file 
hdfs://xxx.yyy.com:8020/my_root_dir/201504101000/0.parquet
at 
parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:213)
at 
parquet.hadoop.ParquetRecordReader.nextKeyValue(ParquetRecordReader.java:204)
at 
org.apache.spark.rdd.NewHadoopRDD$$anon$1.hasNext(NewHadoopRDD.scala:143)
at 
org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:308)
at scala.collection.Iterator$class.foreach(Iterator.scala:727)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
at 
scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48)
at 
scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103)
at 
scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47)
at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273)
at scala.collection.AbstractIterator.to(Iterator.scala:1157)
at 
scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265)
at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1157)
at 
scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252)
at scala.collection.AbstractIterator.toArray(Iterator.scala:1157)
at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:122)
at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:122)
at 
org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1498)
at 
org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1498)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61)
at org.apache.spark.scheduler.Task.run(Task.scala:64)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:203)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Caused by: parquet.io.ParquetDecodingException: The requested schema is not 
compatible with the file schema. incompatible types: optional group …


I could convert my protos into json and then back to parquet, but that seems 
wasteful !

Also, I will be happy to contribute and make protobuf work with Spark SQL if I 
can get some guidance/help/pointers. Help appreciated.

-Abhishek-

Re: Dataframes Question

2015-04-18 Thread Abhishek R. Singh
I am no expert myself, but from what I understand DataFrame is grandfathering 
SchemaRDD. This was done for API stability as spark sql matured out of alpha as 
part of 1.3.0 release. 

It is forward looking and brings (dataframe like) syntax that was not available 
with the older schema RDD.

On Apr 18, 2015, at 4:43 PM, Arun Patel  wrote:

> Experts,
> 
> I have few basic questions on DataFrames vs Spark SQL.  My confusion is more 
> with DataFrames. 
> 
> 1)  What is the difference between Spark SQL and DataFrames?  Are they same?
> 2)  Documentation says SchemaRDD is renamed as DataFrame. This means 
> SchemaRDD is not existing in 1.3?  
> 3)  As per documentation, it looks like creating dataframe is no different 
> than SchemaRDD -  df = 
> sqlContext.jsonFile("examples/src/main/resources/people.json").  
> So, my question is what is the difference?
> 
> Thanks for your help.  
> 
> Arun


-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org