Re: Question about SparkSQL and Hive-on-Spark

2014-09-23 Thread Yi Tian
Hi, Will

We are planning to start implementing these functions.

We hope that we could make a general design in following week.



Best Regards,

Yi Tian
tianyi.asiai...@gmail.com




On Sep 23, 2014, at 23:39, Will Benton  wrote:

> Hi Yi,
> 
> I've had some interest in implementing windowing and rollup in particular for 
> some of my applications but haven't had them on the front of my plate yet.  
> If you need them as well, I'm happy to start taking a look this week.
> 
> 
> best,
> wb
> 
> 
> - Original Message -
>> From: "Yi Tian" 
>> To: dev@spark.apache.org
>> Sent: Tuesday, September 23, 2014 2:47:17 AM
>> Subject: Question about SparkSQL and Hive-on-Spark
>> 
>> Hi all,
>> 
>> I have some questions about the SparkSQL and Hive-on-Spark
>> 
>> Will SparkSQL support all the hive feature in the future? or just making hive
>> as a datasource of Spark?
>> 
>> From Spark 1.1.0 , we have thrift-server support running hql on spark. Will
>> this feature be replaced by Hive on Spark?
>> 
>> The reason for asking these questions is that we found some hive functions
>> are not  running well on SparkSQL ( like window function, cube and rollup
>> function)
>> 
>> Is it worth for making effort on implement these functions with SparkSQL?
>> Could you guys give some advices ?
>> 
>> thank you.
>> 
>> 
>> Best Regards,
>> 
>> Yi Tian
>> tianyi.asiai...@gmail.com
>> 
>> 
>> 
>> 
>> 
>> -
>> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
>> For additional commands, e-mail: dev-h...@spark.apache.org
>> 
>> 


-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: A couple questions about shared variables

2014-09-23 Thread Sandy Ryza
Filed https://issues.apache.org/jira/browse/SPARK-3642 for documenting
these nuances.

-Sandy

On Mon, Sep 22, 2014 at 10:36 AM, Nan Zhu  wrote:

>  I see, thanks for pointing this out
>
>
> --
> Nan Zhu
>
> On Monday, September 22, 2014 at 12:08 PM, Sandy Ryza wrote:
>
> MapReduce counters do not count duplications.  In MapReduce, if a task
> needs to be re-run, the value of the counter from the second task
> overwrites the value from the first task.
>
> -Sandy
>
> On Mon, Sep 22, 2014 at 4:55 AM, Nan Zhu  wrote:
>
>  If you think it as necessary to fix, I would like to resubmit that PR
> (seems to have some conflicts with the current DAGScheduler)
>
> My suggestion is to make it as an option in accumulator, e.g. some
> algorithms utilizing accumulator for result calculation, it needs a
> deterministic accumulator, while others implementing something like Hadoop
> counters may need the current implementation (count everything happened,
> including the duplications)
>
> Your thoughts?
>
> --
> Nan Zhu
>
> On Sunday, September 21, 2014 at 6:35 PM, Matei Zaharia wrote:
>
> Hmm, good point, this seems to have been broken by refactorings of the
> scheduler, but it worked in the past. Basically the solution is simple --
> in a result stage, we should not apply the update for each task ID more
> than once -- the same way we don't call job.listener.taskSucceeded more
> than once. Your PR also tried to avoid this for resubmitted shuffle stages,
> but I don't think we need to do that necessarily (though we could).
>
> Matei
>
> On September 21, 2014 at 1:11:13 PM, Nan Zhu (zhunanmcg...@gmail.com)
> wrote:
>
> Hi, Matei,
>
> Can you give some hint on how the current implementation guarantee the
> accumulator is only applied for once?
>
> There is a pending PR trying to achieving this (
> https://github.com/apache/spark/pull/228/files), but from the current
> implementation, I didn’t see this has been done? (maybe I missed something)
>
> Best,
>
> --
> Nan Zhu
>
> On Sunday, September 21, 2014 at 1:10 AM, Matei Zaharia wrote:
>
>  Hey Sandy,
>
> On September 20, 2014 at 8:50:54 AM, Sandy Ryza (sandy.r...@cloudera.com)
> wrote:
>
> Hey All,
>
> A couple questions came up about shared variables recently, and I wanted
> to
> confirm my understanding and update the doc to be a little more clear.
>
> *Broadcast variables*
> Now that tasks data is automatically broadcast, the only occasions where
> it
> makes sense to explicitly broadcast are:
> * You want to use a variable from tasks in multiple stages.
> * You want to have the variable stored on the executors in deserialized
> form.
> * You want tasks to be able to modify the variable and have those
> modifications take effect for other tasks running on the same executor
> (usually a very bad idea).
>
> Is that right?
> Yeah, pretty much. Reason 1 above is probably the biggest, but 2 also
> matters. (We might later factor tasks in a different way to avoid 2, but
> it's hard due to things like Hadoop JobConf objects in the tasks).
>
>
> *Accumulators*
> Values are only counted for successful tasks. Is that right? KMeans seems
> to use it in this way. What happens if a node goes away and successful
> tasks need to be resubmitted? Or the stage runs again because a different
> job needed it.
> Accumulators are guaranteed to give a deterministic result if you only
> increment them in actions. For each result stage, the accumulator's update
> from each task is only applied once, even if that task runs multiple times.
> If you use accumulators in transformations (i.e. in a stage that may be
> part of multiple jobs), then you may see multiple updates, from each run.
> This is kind of confusing but it was useful for people who wanted to use
> these for debugging.
>
> Matei
>
>
>
>
>
> thanks,
> Sandy
>
>
>
>
>
>


SPARK-3660 : Initial RDD for updateStateByKey transformation

2014-09-23 Thread Soumitra Kumar
Hello fellow developers,

Thanks TD for relevant pointers.

I have created an issue :
https://issues.apache.org/jira/browse/SPARK-3660

Copying the description from JIRA:
"
How to initialize state tranformation updateStateByKey?

I have word counts from previous spark-submit run, and want to load that in 
next spark-submit job to start counting over that.

One proposal is to add following argument to updateStateByKey methods.
initial : Option [RDD [(K, S)]] = None

This will maintain the backward compatibility as well.

I have a working code as well.

This thread started on spark-user list at:
http://apache-spark-user-list.1001560.n3.nabble.com/How-to-initialize-updateStateByKey-operation-td14772.html
"

Please let me know if I shall add a parameter "initial : Option [RDD [(K, S)]] 
= None" to all updateStateByKey methods or create new ones?

Thanks,
-Soumitra.

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: Question about SparkSQL and Hive-on-Spark

2014-09-23 Thread DB Tsai
Hi Will,

We're also very interested in windowing support in SparkSQL. Let's us
know once this is available for testing. Thanks.

Sincerely,

DB Tsai
---
My Blog: https://www.dbtsai.com
LinkedIn: https://www.linkedin.com/in/dbtsai


On Tue, Sep 23, 2014 at 8:39 AM, Will Benton  wrote:
> Hi Yi,
>
> I've had some interest in implementing windowing and rollup in particular for 
> some of my applications but haven't had them on the front of my plate yet.  
> If you need them as well, I'm happy to start taking a look this week.
>
>
> best,
> wb
>
>
> - Original Message -
>> From: "Yi Tian" 
>> To: dev@spark.apache.org
>> Sent: Tuesday, September 23, 2014 2:47:17 AM
>> Subject: Question about SparkSQL and Hive-on-Spark
>>
>> Hi all,
>>
>> I have some questions about the SparkSQL and Hive-on-Spark
>>
>> Will SparkSQL support all the hive feature in the future? or just making hive
>> as a datasource of Spark?
>>
>> From Spark 1.1.0 , we have thrift-server support running hql on spark. Will
>> this feature be replaced by Hive on Spark?
>>
>> The reason for asking these questions is that we found some hive functions
>> are not  running well on SparkSQL ( like window function, cube and rollup
>> function)
>>
>> Is it worth for making effort on implement these functions with SparkSQL?
>> Could you guys give some advices ?
>>
>> thank you.
>>
>>
>> Best Regards,
>>
>> Yi Tian
>> tianyi.asiai...@gmail.com
>>
>>
>>
>>
>>
>> -
>> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
>> For additional commands, e-mail: dev-h...@spark.apache.org
>>
>>
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
> For additional commands, e-mail: dev-h...@spark.apache.org
>

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



spark.local.dir and spark.worker.dir not used

2014-09-23 Thread Priya Ch
Hi,

I am using spark 1.0.0. In my spark code i m trying to persist an rdd to
disk as rrd.persist(DISK_ONLY). But unfortunately couldn't find the
location where the rdd has been written to disk. I specified
SPARK_LOCAL_DIRS and SPARK_WORKER_DIR to some other location rather than
using the default /tmp directory, but still couldnt see anything in worker
directory andspark ocal directory.

I also tried specifying the local dir and worker dir from the spark code
while defining the SparkConf as conf.set("spark.local.dir",
"/home/padma/sparkdir") but the directories are not used.


In general which directories spark would be using for map output files,
intermediate writes and persisting rdd to disk ?


Thanks,
Padma Ch


Re: OutOfMemoryError on parquet SnappyDecompressor

2014-09-23 Thread Aaron Davidson
This may be related: https://github.com/Parquet/parquet-mr/issues/211

Perhaps if we change our configuration settings for Parquet it would get
better, but the performance characteristics of Snappy are pretty bad here
under some circumstances.

On Tue, Sep 23, 2014 at 10:13 AM, Cody Koeninger  wrote:

> Cool, that's pretty much what I was thinking as far as configuration goes.
>
> Running on Mesos.  Worker nodes are amazon xlarge, so 4 core / 15g.  I've
> tried executor memory sizes as high as 6G
> Default hdfs block size 64m, about 25G of total data written by a job with
> 128 partitions.  The exception comes when trying to read the data (all
> columns).
>
> Schema looks like this:
>
> case class A(
>   a: Long,
>   b: Long,
>   c: Byte,
>   d: Option[Long],
>   e: Option[Long],
>   f: Option[Long],
>   g: Option[Long],
>   h: Option[Int],
>   i: Long,
>   j: Option[Int],
>   k: Seq[Int],
>   l: Seq[Int],
>   m: Seq[Int]
> )
>
> We're just going back to gzip for now, but might be nice to help someone
> else avoid running into this.
>
> On Tue, Sep 23, 2014 at 11:18 AM, Michael Armbrust  >
> wrote:
>
> > I actually submitted a patch to do this yesterday:
> > https://github.com/apache/spark/pull/2493
> >
> > Can you tell us more about your configuration.  In particular how much
> > memory/cores do the executors have and what does the schema of your data
> > look like?
> >
> > On Tue, Sep 23, 2014 at 7:39 AM, Cody Koeninger 
> > wrote:
> >
> >> So as a related question, is there any reason the settings in SQLConf
> >> aren't read from the spark context's conf?  I understand why the sql
> conf
> >> is mutable, but it's not particularly user friendly to have most spark
> >> configuration set via e.g. defaults.conf or --properties-file, but for
> >> spark sql to ignore those.
> >>
> >> On Mon, Sep 22, 2014 at 4:34 PM, Cody Koeninger 
> >> wrote:
> >>
> >> > After commit 8856c3d8 switched from gzip to snappy as default parquet
> >> > compression codec, I'm seeing the following when trying to read
> parquet
> >> > files saved using the new default (same schema and roughly same size
> as
> >> > files that were previously working):
> >> >
> >> > java.lang.OutOfMemoryError: Direct buffer memory
> >> > java.nio.Bits.reserveMemory(Bits.java:658)
> >> > java.nio.DirectByteBuffer.(DirectByteBuffer.java:123)
> >> > java.nio.ByteBuffer.allocateDirect(ByteBuffer.java:306)
> >> >
> >> >
> >>
> parquet.hadoop.codec.SnappyDecompressor.setInput(SnappyDecompressor.java:99)
> >> >
> >> >
> >>
> parquet.hadoop.codec.NonBlockedDecompressorStream.read(NonBlockedDecompressorStream.java:43)
> >> > java.io.DataInputStream.readFully(DataInputStream.java:195)
> >> > java.io.DataInputStream.readFully(DataInputStream.java:169)
> >> >
> >> >
> >>
> parquet.bytes.BytesInput$StreamBytesInput.toByteArray(BytesInput.java:201)
> >> >
> >> >
> parquet.column.impl.ColumnReaderImpl.readPage(ColumnReaderImpl.java:521)
> >> >
> >> >
> >>
> parquet.column.impl.ColumnReaderImpl.checkRead(ColumnReaderImpl.java:493)
> >> >
> >> >
> parquet.column.impl.ColumnReaderImpl.consume(ColumnReaderImpl.java:546)
> >> >
> >> > parquet.column.impl.ColumnReaderImpl.(ColumnReaderImpl.java:339)
> >> >
> >> >
> >>
> parquet.column.impl.ColumnReadStoreImpl.newMemColumnReader(ColumnReadStoreImpl.java:63)
> >> >
> >> >
> >>
> parquet.column.impl.ColumnReadStoreImpl.getColumnReader(ColumnReadStoreImpl.java:58)
> >> >
> >> >
> >>
> parquet.io.RecordReaderImplementation.(RecordReaderImplementation.java:265)
> >> >
> >>  parquet.io.MessageColumnIO.getRecordReader(MessageColumnIO.java:60)
> >> >
> >>  parquet.io.MessageColumnIO.getRecordReader(MessageColumnIO.java:74)
> >> >
> >> >
> >>
> parquet.hadoop.InternalParquetRecordReader.checkRead(InternalParquetRecordReader.java:110)
> >> >
> >> >
> >>
> parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:172)
> >> >
> >> >
> >>
> parquet.hadoop.ParquetRecordReader.nextKeyValue(ParquetRecordReader.java:130)
> >> >
> >> >
> >>
> org.apache.spark.rdd.NewHadoopRDD$$anon$1.hasNext(NewHadoopRDD.scala:139)
> >> >
> >> >
> >>
> org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)
> >> > scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
> >> > scala.collection.Iterator$$anon$14.hasNext(Iterator.scala:388)
> >> > scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
> >> > scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
> >> > scala.collection.Iterator$class.isEmpty(Iterator.scala:256)
> >> > scala.collection.AbstractIterator.isEmpty(Iterator.scala:1157)
> >> >
> >> >
> >>
> org.apache.spark.sql.execution.ExistingRdd$$anonfun$productToRowRdd$1.apply(basicOperators.scala:220)
> >> >
> >> >
> >>
> org.apache.spark.sql.execution.ExistingRdd$$anonfun$productToRowRdd$1.apply(basicOperators.scala:219)
> >> > org.apache.spark.rdd.RDD$$anonfun$13.apply(RDD.

Re: OutOfMemoryError on parquet SnappyDecompressor

2014-09-23 Thread Cody Koeninger
Cool, that's pretty much what I was thinking as far as configuration goes.

Running on Mesos.  Worker nodes are amazon xlarge, so 4 core / 15g.  I've
tried executor memory sizes as high as 6G
Default hdfs block size 64m, about 25G of total data written by a job with
128 partitions.  The exception comes when trying to read the data (all
columns).

Schema looks like this:

case class A(
  a: Long,
  b: Long,
  c: Byte,
  d: Option[Long],
  e: Option[Long],
  f: Option[Long],
  g: Option[Long],
  h: Option[Int],
  i: Long,
  j: Option[Int],
  k: Seq[Int],
  l: Seq[Int],
  m: Seq[Int]
)

We're just going back to gzip for now, but might be nice to help someone
else avoid running into this.

On Tue, Sep 23, 2014 at 11:18 AM, Michael Armbrust 
wrote:

> I actually submitted a patch to do this yesterday:
> https://github.com/apache/spark/pull/2493
>
> Can you tell us more about your configuration.  In particular how much
> memory/cores do the executors have and what does the schema of your data
> look like?
>
> On Tue, Sep 23, 2014 at 7:39 AM, Cody Koeninger 
> wrote:
>
>> So as a related question, is there any reason the settings in SQLConf
>> aren't read from the spark context's conf?  I understand why the sql conf
>> is mutable, but it's not particularly user friendly to have most spark
>> configuration set via e.g. defaults.conf or --properties-file, but for
>> spark sql to ignore those.
>>
>> On Mon, Sep 22, 2014 at 4:34 PM, Cody Koeninger 
>> wrote:
>>
>> > After commit 8856c3d8 switched from gzip to snappy as default parquet
>> > compression codec, I'm seeing the following when trying to read parquet
>> > files saved using the new default (same schema and roughly same size as
>> > files that were previously working):
>> >
>> > java.lang.OutOfMemoryError: Direct buffer memory
>> > java.nio.Bits.reserveMemory(Bits.java:658)
>> > java.nio.DirectByteBuffer.(DirectByteBuffer.java:123)
>> > java.nio.ByteBuffer.allocateDirect(ByteBuffer.java:306)
>> >
>> >
>> parquet.hadoop.codec.SnappyDecompressor.setInput(SnappyDecompressor.java:99)
>> >
>> >
>> parquet.hadoop.codec.NonBlockedDecompressorStream.read(NonBlockedDecompressorStream.java:43)
>> > java.io.DataInputStream.readFully(DataInputStream.java:195)
>> > java.io.DataInputStream.readFully(DataInputStream.java:169)
>> >
>> >
>> parquet.bytes.BytesInput$StreamBytesInput.toByteArray(BytesInput.java:201)
>> >
>> > parquet.column.impl.ColumnReaderImpl.readPage(ColumnReaderImpl.java:521)
>> >
>> >
>> parquet.column.impl.ColumnReaderImpl.checkRead(ColumnReaderImpl.java:493)
>> >
>> > parquet.column.impl.ColumnReaderImpl.consume(ColumnReaderImpl.java:546)
>> >
>> > parquet.column.impl.ColumnReaderImpl.(ColumnReaderImpl.java:339)
>> >
>> >
>> parquet.column.impl.ColumnReadStoreImpl.newMemColumnReader(ColumnReadStoreImpl.java:63)
>> >
>> >
>> parquet.column.impl.ColumnReadStoreImpl.getColumnReader(ColumnReadStoreImpl.java:58)
>> >
>> >
>> parquet.io.RecordReaderImplementation.(RecordReaderImplementation.java:265)
>> >
>>  parquet.io.MessageColumnIO.getRecordReader(MessageColumnIO.java:60)
>> >
>>  parquet.io.MessageColumnIO.getRecordReader(MessageColumnIO.java:74)
>> >
>> >
>> parquet.hadoop.InternalParquetRecordReader.checkRead(InternalParquetRecordReader.java:110)
>> >
>> >
>> parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:172)
>> >
>> >
>> parquet.hadoop.ParquetRecordReader.nextKeyValue(ParquetRecordReader.java:130)
>> >
>> >
>> org.apache.spark.rdd.NewHadoopRDD$$anon$1.hasNext(NewHadoopRDD.scala:139)
>> >
>> >
>> org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)
>> > scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
>> > scala.collection.Iterator$$anon$14.hasNext(Iterator.scala:388)
>> > scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
>> > scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
>> > scala.collection.Iterator$class.isEmpty(Iterator.scala:256)
>> > scala.collection.AbstractIterator.isEmpty(Iterator.scala:1157)
>> >
>> >
>> org.apache.spark.sql.execution.ExistingRdd$$anonfun$productToRowRdd$1.apply(basicOperators.scala:220)
>> >
>> >
>> org.apache.spark.sql.execution.ExistingRdd$$anonfun$productToRowRdd$1.apply(basicOperators.scala:219)
>> > org.apache.spark.rdd.RDD$$anonfun$13.apply(RDD.scala:596)
>> > org.apache.spark.rdd.RDD$$anonfun$13.apply(RDD.scala:596)
>> >
>> > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
>> > org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
>> > org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
>> >
>>  org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:62)
>> > org.apache.spark.scheduler.Task.run(Task.scala:54)
>> >
>> > org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:181)
>> >
>> >
>> java.util.concurrent.ThreadPoolExecutor.ru

Re: OutOfMemoryError on parquet SnappyDecompressor

2014-09-23 Thread Michael Armbrust
I actually submitted a patch to do this yesterday:
https://github.com/apache/spark/pull/2493

Can you tell us more about your configuration.  In particular how much
memory/cores do the executors have and what does the schema of your data
look like?

On Tue, Sep 23, 2014 at 7:39 AM, Cody Koeninger  wrote:

> So as a related question, is there any reason the settings in SQLConf
> aren't read from the spark context's conf?  I understand why the sql conf
> is mutable, but it's not particularly user friendly to have most spark
> configuration set via e.g. defaults.conf or --properties-file, but for
> spark sql to ignore those.
>
> On Mon, Sep 22, 2014 at 4:34 PM, Cody Koeninger 
> wrote:
>
> > After commit 8856c3d8 switched from gzip to snappy as default parquet
> > compression codec, I'm seeing the following when trying to read parquet
> > files saved using the new default (same schema and roughly same size as
> > files that were previously working):
> >
> > java.lang.OutOfMemoryError: Direct buffer memory
> > java.nio.Bits.reserveMemory(Bits.java:658)
> > java.nio.DirectByteBuffer.(DirectByteBuffer.java:123)
> > java.nio.ByteBuffer.allocateDirect(ByteBuffer.java:306)
> >
> >
> parquet.hadoop.codec.SnappyDecompressor.setInput(SnappyDecompressor.java:99)
> >
> >
> parquet.hadoop.codec.NonBlockedDecompressorStream.read(NonBlockedDecompressorStream.java:43)
> > java.io.DataInputStream.readFully(DataInputStream.java:195)
> > java.io.DataInputStream.readFully(DataInputStream.java:169)
> >
> >
> parquet.bytes.BytesInput$StreamBytesInput.toByteArray(BytesInput.java:201)
> >
> > parquet.column.impl.ColumnReaderImpl.readPage(ColumnReaderImpl.java:521)
> >
> > parquet.column.impl.ColumnReaderImpl.checkRead(ColumnReaderImpl.java:493)
> >
> > parquet.column.impl.ColumnReaderImpl.consume(ColumnReaderImpl.java:546)
> >
> > parquet.column.impl.ColumnReaderImpl.(ColumnReaderImpl.java:339)
> >
> >
> parquet.column.impl.ColumnReadStoreImpl.newMemColumnReader(ColumnReadStoreImpl.java:63)
> >
> >
> parquet.column.impl.ColumnReadStoreImpl.getColumnReader(ColumnReadStoreImpl.java:58)
> >
> >
> parquet.io.RecordReaderImplementation.(RecordReaderImplementation.java:265)
> >
>  parquet.io.MessageColumnIO.getRecordReader(MessageColumnIO.java:60)
> >
>  parquet.io.MessageColumnIO.getRecordReader(MessageColumnIO.java:74)
> >
> >
> parquet.hadoop.InternalParquetRecordReader.checkRead(InternalParquetRecordReader.java:110)
> >
> >
> parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:172)
> >
> >
> parquet.hadoop.ParquetRecordReader.nextKeyValue(ParquetRecordReader.java:130)
> >
> > org.apache.spark.rdd.NewHadoopRDD$$anon$1.hasNext(NewHadoopRDD.scala:139)
> >
> >
> org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)
> > scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
> > scala.collection.Iterator$$anon$14.hasNext(Iterator.scala:388)
> > scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
> > scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
> > scala.collection.Iterator$class.isEmpty(Iterator.scala:256)
> > scala.collection.AbstractIterator.isEmpty(Iterator.scala:1157)
> >
> >
> org.apache.spark.sql.execution.ExistingRdd$$anonfun$productToRowRdd$1.apply(basicOperators.scala:220)
> >
> >
> org.apache.spark.sql.execution.ExistingRdd$$anonfun$productToRowRdd$1.apply(basicOperators.scala:219)
> > org.apache.spark.rdd.RDD$$anonfun$13.apply(RDD.scala:596)
> > org.apache.spark.rdd.RDD$$anonfun$13.apply(RDD.scala:596)
> >
> > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
> > org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
> > org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
> >
>  org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:62)
> > org.apache.spark.scheduler.Task.run(Task.scala:54)
> >
> > org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:181)
> >
> >
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> >
> >
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> > java.lang.Thread.run(Thread.java:722)
> >
> >
> >
>


Re: RFC: Deprecating YARN-alpha API's

2014-09-23 Thread Tom Graves
Any other comments or objections on this?
Thanks,Tom 

 On Tuesday, September 9, 2014 4:39 PM, Chester Chen 
 wrote:
   

 We were using it until recently, we are talking to our customers and see if
we can get off it.

Chester
Alpine Data Labs



On Tue, Sep 9, 2014 at 10:59 AM, Sean Owen  wrote:

> FWIW consensus from Cloudera folk seems to be that there's no need or
> demand on this end for YARN alpha. It wouldn't have an impact if it
> were removed sooner even.
>
> It will be a small positive to reduce complexity by removing this
> support, making it a little easier to develop for current YARN APIs.
>
> On Tue, Sep 9, 2014 at 5:16 PM, Patrick Wendell 
> wrote:
> > Hi Everyone,
> >
> > This is a call to the community for comments on SPARK-3445 [1]. In a
> > nutshell, we are trying to figure out timelines for deprecation of the
> > YARN-alpha API's as Yahoo is now moving off of them. It's helpful for
> > us to have a sense of whether anyone else uses these.
> >
> > Please comment on the JIRA if you have feeback, thanks!
> >
> > [1] https://issues.apache.org/jira/browse/SPARK-3445
> >
> > - Patrick
> >
> > -
> > To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
> > For additional commands, e-mail: dev-h...@spark.apache.org
> >
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
> For additional commands, e-mail: dev-h...@spark.apache.org
>
>




Re: Question about SparkSQL and Hive-on-Spark

2014-09-23 Thread Will Benton
Hi Yi,

I've had some interest in implementing windowing and rollup in particular for 
some of my applications but haven't had them on the front of my plate yet.  If 
you need them as well, I'm happy to start taking a look this week.


best,
wb


- Original Message -
> From: "Yi Tian" 
> To: dev@spark.apache.org
> Sent: Tuesday, September 23, 2014 2:47:17 AM
> Subject: Question about SparkSQL and Hive-on-Spark
> 
> Hi all,
> 
> I have some questions about the SparkSQL and Hive-on-Spark
> 
> Will SparkSQL support all the hive feature in the future? or just making hive
> as a datasource of Spark?
> 
> From Spark 1.1.0 , we have thrift-server support running hql on spark. Will
> this feature be replaced by Hive on Spark?
> 
> The reason for asking these questions is that we found some hive functions
> are not  running well on SparkSQL ( like window function, cube and rollup
> function)
> 
> Is it worth for making effort on implement these functions with SparkSQL?
> Could you guys give some advices ?
> 
> thank you.
> 
> 
> Best Regards,
> 
> Yi Tian
> tianyi.asiai...@gmail.com
> 
> 
> 
> 
> 
> -
> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
> For additional commands, e-mail: dev-h...@spark.apache.org
> 
> 

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: OutOfMemoryError on parquet SnappyDecompressor

2014-09-23 Thread Cody Koeninger
So as a related question, is there any reason the settings in SQLConf
aren't read from the spark context's conf?  I understand why the sql conf
is mutable, but it's not particularly user friendly to have most spark
configuration set via e.g. defaults.conf or --properties-file, but for
spark sql to ignore those.

On Mon, Sep 22, 2014 at 4:34 PM, Cody Koeninger  wrote:

> After commit 8856c3d8 switched from gzip to snappy as default parquet
> compression codec, I'm seeing the following when trying to read parquet
> files saved using the new default (same schema and roughly same size as
> files that were previously working):
>
> java.lang.OutOfMemoryError: Direct buffer memory
> java.nio.Bits.reserveMemory(Bits.java:658)
> java.nio.DirectByteBuffer.(DirectByteBuffer.java:123)
> java.nio.ByteBuffer.allocateDirect(ByteBuffer.java:306)
>
> parquet.hadoop.codec.SnappyDecompressor.setInput(SnappyDecompressor.java:99)
>
> parquet.hadoop.codec.NonBlockedDecompressorStream.read(NonBlockedDecompressorStream.java:43)
> java.io.DataInputStream.readFully(DataInputStream.java:195)
> java.io.DataInputStream.readFully(DataInputStream.java:169)
>
> parquet.bytes.BytesInput$StreamBytesInput.toByteArray(BytesInput.java:201)
>
> parquet.column.impl.ColumnReaderImpl.readPage(ColumnReaderImpl.java:521)
>
> parquet.column.impl.ColumnReaderImpl.checkRead(ColumnReaderImpl.java:493)
>
> parquet.column.impl.ColumnReaderImpl.consume(ColumnReaderImpl.java:546)
>
> parquet.column.impl.ColumnReaderImpl.(ColumnReaderImpl.java:339)
>
> parquet.column.impl.ColumnReadStoreImpl.newMemColumnReader(ColumnReadStoreImpl.java:63)
>
> parquet.column.impl.ColumnReadStoreImpl.getColumnReader(ColumnReadStoreImpl.java:58)
>
> parquet.io.RecordReaderImplementation.(RecordReaderImplementation.java:265)
> parquet.io.MessageColumnIO.getRecordReader(MessageColumnIO.java:60)
> parquet.io.MessageColumnIO.getRecordReader(MessageColumnIO.java:74)
>
> parquet.hadoop.InternalParquetRecordReader.checkRead(InternalParquetRecordReader.java:110)
>
> parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:172)
>
> parquet.hadoop.ParquetRecordReader.nextKeyValue(ParquetRecordReader.java:130)
>
> org.apache.spark.rdd.NewHadoopRDD$$anon$1.hasNext(NewHadoopRDD.scala:139)
>
> org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)
> scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
> scala.collection.Iterator$$anon$14.hasNext(Iterator.scala:388)
> scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
> scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
> scala.collection.Iterator$class.isEmpty(Iterator.scala:256)
> scala.collection.AbstractIterator.isEmpty(Iterator.scala:1157)
>
> org.apache.spark.sql.execution.ExistingRdd$$anonfun$productToRowRdd$1.apply(basicOperators.scala:220)
>
> org.apache.spark.sql.execution.ExistingRdd$$anonfun$productToRowRdd$1.apply(basicOperators.scala:219)
> org.apache.spark.rdd.RDD$$anonfun$13.apply(RDD.scala:596)
> org.apache.spark.rdd.RDD$$anonfun$13.apply(RDD.scala:596)
>
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
> org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
> org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
> org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:62)
> org.apache.spark.scheduler.Task.run(Task.scala:54)
>
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:181)
>
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> java.lang.Thread.run(Thread.java:722)
>
>
>


RE: spark.local.dir and spark.worker.dir not used

2014-09-23 Thread Shao, Saisai
Hi,

Spark.local.dir is the one used to write map output data and persistent RDD 
blocks, but the path of  file has been hashed, so you cannot directly find the 
persistent rdd block files, but definitely it will be in this folders on your 
worker node.

Thanks
Jerry

From: Priya Ch [mailto:learnings.chitt...@gmail.com]
Sent: Tuesday, September 23, 2014 6:31 PM
To: u...@spark.apache.org; dev@spark.apache.org
Subject: spark.local.dir and spark.worker.dir not used

Hi,

I am using spark 1.0.0. In my spark code i m trying to persist an rdd to disk 
as rrd.persist(DISK_ONLY). But unfortunately couldn't find the location where 
the rdd has been written to disk. I specified SPARK_LOCAL_DIRS and 
SPARK_WORKER_DIR to some other location rather than using the default /tmp 
directory, but still couldnt see anything in worker directory andspark ocal 
directory.

I also tried specifying the local dir and worker dir from the spark code while 
defining the SparkConf as conf.set("spark.local.dir", "/home/padma/sparkdir") 
but the directories are not used.


In general which directories spark would be using for map output files, 
intermediate writes and persisting rdd to disk ?


Thanks,
Padma Ch


resources allocated for an application

2014-09-23 Thread rapelly kartheek
Hi,

I am trying to find out where exactly in the spark code are the resources
getting allocated for a newly submitted spark application.

I have a stand-alone spark cluster. Can someone please direct me to the
right part of the code.

regards


Re: Question about SparkSQL and Hive-on-Spark

2014-09-23 Thread Reynold Xin
On Tue, Sep 23, 2014 at 12:47 AM, Yi Tian  wrote:

> Hi all,
>
> I have some questions about the SparkSQL and Hive-on-Spark
>
> Will SparkSQL support all the hive feature in the future? or just making
> hive as a datasource of Spark?
>

Most likely not *ALL* Hive features, but almost all common features.


>
> From Spark 1.1.0 , we have thrift-server support running hql on spark.
> Will this feature be replaced by Hive on Spark?
>

No.


>
> The reason for asking these questions is that we found some hive functions
> are not  running well on SparkSQL ( like window function, cube and rollup
> function)


> Is it worth for making effort on implement these functions with SparkSQL?
> Could you guys give some advices ?
>

Yes absolutely.


>
> thank you.
>
>
> Best Regards,
>
> Yi Tian
> tianyi.asiai...@gmail.com
>
>
>
>
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
> For additional commands, e-mail: dev-h...@spark.apache.org
>
>


Question about SparkSQL and Hive-on-Spark

2014-09-23 Thread Yi Tian
Hi all,

I have some questions about the SparkSQL and Hive-on-Spark

Will SparkSQL support all the hive feature in the future? or just making hive 
as a datasource of Spark?

From Spark 1.1.0 , we have thrift-server support running hql on spark. Will 
this feature be replaced by Hive on Spark?

The reason for asking these questions is that we found some hive functions are 
not  running well on SparkSQL ( like window function, cube and rollup function)

Is it worth for making effort on implement these functions with SparkSQL? Could 
you guys give some advices ? 

thank you.


Best Regards,

Yi Tian
tianyi.asiai...@gmail.com





-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org