Re: Parquet divide by zero

2015-01-28 Thread Lukas Nalezenec

Hi Jim,
I am sorry, I know about your patch and I will commit it ASAP.

Lukas Nalezenec


On 28.1.2015 22:28, Jim Carroll wrote:

Hello all,

I've been hitting a divide by zero error in Parquet though Spark detailed
(and fixed) here: https://github.com/apache/incubator-parquet-mr/pull/102

Is anyone else hitting this error? I hit it frequently.

It looks like the Parquet team is preparing to release 1.6.0 and, since they
have been completely unresponsive, I'm assuming its going to go with this
bug (without the fix). Other than the fact that the divide by zero mistake
is obvious, perhaps the conditions it occurs are rare and I'm doing
something wrong.

Has anyone else hit this and if so, have they resolved it?

Thanks
Jim




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Parquet-divide-by-zero-tp21406.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org




-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Spark Team - Paco Nathan said that your team can help

2015-01-22 Thread Lukas Nalezenec

+1

On 22.1.2015 18:30, Marco Shaw wrote:

Sudipta - Please don't ever come here or post here again.

On Thu, Jan 22, 2015 at 1:25 PM, Sudipta Banerjee 
mailto:asudipta.baner...@gmail.com>> wrote:


Hi Nicos, Taking forward your argument,please be a smart a$$ and
dont use unprofessional language just for the sake of being a
moderator.
Paco Nathan is respected for the dignity he carries in sharing his
knowledge and making it available free for a$$es like us right!
So just mind your tongue next time you put such a$$ in your mouth.

Best Regards,
Sudipta

On Thu, Jan 22, 2015 at 10:39 PM, Nicos Kekchidis mailto:ikon...@me.com>> wrote:

Folks,
Just a gentle reminder we owe to ourselves:
- this is a public forum and we need to behave accordingly, it
is not place to vent frustration in rude way
- getting attention here is an earned privilege and not
entitlement
- this is not a “Platinum Support” department of your vendor
rather and open source collaboration forum where people
volunteer their time to pay attention to your needs
- there are still many gray areas so be patient and articulate
questions in as much details as possible if you want to get
quick help and not just be perceived as a smart a$$

FYI - Paco Nathan is a well respected Spark evangelist and
many people, including myself, owe to his passion for jumping
on Spark platform promise. People like Sean Owen keep us
believing in things when we feel like hitting the dead-end.

Please, be respectful of what connections you are prized with
and act civilized.

Have a great day!
- Nicos


> On Jan 22, 2015, at 7:49 AM, Sean Owen mailto:so...@cloudera.com>> wrote:
>
> Yes, this isn't a well-formed question, and got maybe the
response it
> deserved, but the tone is veering off the rails. I just got
a much
> ruder reply from Sudipta privately, which I will not
forward. Sudipta,
> I suggest you take the responses you've gotten so far as
about as much
> answer as can be had here and do some work yourself, and
come back
> with much more specific questions, and it will all be
helpful and
> polite again.
>
> On Thu, Jan 22, 2015 at 2:51 PM, Sudipta Banerjee
> mailto:asudipta.baner...@gmail.com>> wrote:
>> Hi Marco,
>>
>> Thanks for the confirmation. Please let me know what are
the lot more detail
>> you need to answer a very specific question  WHAT IS THE
MINIMUM HARDWARE
>> CONFIGURATION REQUIRED TO BUILT HDFS+ MAPREDUCE+SPARK+YARN 
on a system?

>> Please let me know if you need any further information and
if you dont know
>> please drive across with the $1 to Sir Paco Nathan and
get me the
>> answer.
>>
>> Thanks and Regards,
>> Sudipta
>>
>> On Thu, Jan 22, 2015 at 5:33 PM, Marco Shaw
mailto:marco.s...@gmail.com>> wrote:
>>>
>>> Hi,
>>>
>>> Let me reword your request so you understand how (too)
generic your
>>> question is
>>>
>>> "Hi, I have $10,000, please find me some means of
transportation so I can
>>> get to work."
>>>
>>> Please provide (a lot) more details. If you can't,
consider using one of
>>> the pre-built express VMs from either Cloudera,
Hortonworks or MapR, for
>>> example.
>>>
>>> Marco
>>>
>>>
>>>
 On Jan 22, 2015, at 7:36 AM, Sudipta Banerjee
 mailto:asudipta.baner...@gmail.com>> wrote:



 Hi Apache-Spark team ,

 What are the system requirements installing Hadoop and
Apache Spark?
 I have attached the screen shot of Gparted.


 Thanks and regards,
 Sudipta




 --
 Sudipta Banerjee
 Consultant, Business Analytics and Cloud Based Architecture
 Call me +919019578099 
 


-
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org

 For additional commands, e-mail:
user-h...@spark.apache.org 
>>
>>
>>
>>
>> --
>> Sudipta Banerjee
>> Consultant, Business Analytics and Cloud Based Architecture
>> Call me +919019578099 
>
>
  

Re: Mapping Hadoop Reduce to Spark

2014-09-05 Thread Lukas Nalezenec

Hi,

FYI: There is bug in Java mapPartitions - SPARK-3369 
. In Java results from 
mapPartitions and similar functions must fit in memory. Look at example 
below - it returns List.


Lukas


On 1.9.2014 00:50, Matei Zaharia wrote:
mapPartitions just gives you an Iterator of the values in each 
partition, and lets you return an Iterator of outputs. For instance, 
take a look at 
https://github.com/apache/spark/blob/master/core/src/test/java/org/apache/spark/JavaAPISuite.java#L694.


Matei

On August 31, 2014 at 12:26:51 PM, Steve Lewis (lordjoe2...@gmail.com 
) wrote:



Is there a sample of how to do this -
I see 1.1 is out but cannot find samples of mapPartitions
A Java sample would be very useful


On Sat, Aug 30, 2014 at 10:30 AM, Matei Zaharia 
mailto:matei.zaha...@gmail.com>> wrote:


In 1.1, you'll be able to get all of these properties using
sortByKey, and then mapPartitions on top to iterate through the
key-value pairs. Unfortunately sortByKey does not let you control
the Partitioner, but it's fairly easy to write your own version
that does if this is important.

In previous versions, the values for each key had to fit in
memory (though we could have data on disk across keys), and this
is still true for groupByKey, cogroup and join. Those
restrictions will hopefully go away in a later release. But
sortByKey + mapPartitions lets you just iterate through the
key-value pairs without worrying about this.

Matei

On August 30, 2014 at 9:04:37 AM, Steve Lewis
(lordjoe2...@gmail.com ) wrote:


When programming in Hadoop it is possible to guarantee
1) All keys sent to a specific partition will be handled by the
same machine (thread)
2) All keys received by a specific machine (thread) will be
received in sorted order
3) These conditions will hold even if the values associated with
a specific key are too large enough to fit in memory.

In my Hadoop code I use all of these conditions - specifically
with my larger data sets the size of data I wish to group
exceeds the available memory.

I think I understand the operation of groupby but my
understanding is that this requires that the results for a
single key, and perhaps all keys fit on a single machine.

Is there away to perform like Hadoop ad not require that an
entire group fir in memory?





--
Steven M. Lewis PhD
4221 105th Ave NE
Kirkland, WA 98033
206-384-1340 (cell)
Skype lordjoe_com





Re: Configuring Spark for reduceByKey on on massive data sets

2014-05-18 Thread lukas nalezenec
Hi
Try using *reduceByKeyLocally*.
Regards
Lukas Nalezenec


On Sun, May 18, 2014 at 3:33 AM, Matei Zaharia wrote:

> Make sure you set up enough reduce partitions so you don’t overload them.
> Another thing that may help is checking whether you’ve run out of local
> disk space on the machines, and turning on spark.shuffle.consolidateFiles
> to produce fewer files. Finally, there’s been a recent fix in both branch
> 0.9 and master that reduces the amount of memory used when there are small
> files (due to extra memory that was being taken by mmap()):
> https://issues.apache.org/jira/browse/SPARK-1145. You can find this in
> either the 1.0 release candidates on the dev list, or branch-0.9 in git.
>
> Matei
>
> On May 17, 2014, at 5:45 PM, Madhu  wrote:
>
> > Daniel,
> >
> > How many partitions do you have?
> > Are they more or less uniformly distributed?
> > We have similar data volume currently running well on Hadoop MapReduce
> with
> > roughly 30 nodes.
> > I was planning to test it with Spark.
> > I'm very interested in your findings.
> >
> >
> >
> > -
> > Madhu
> > https://www.linkedin.com/in/msiddalingaiah
> > --
> > View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Configuring-Spark-for-reduceByKey-on-on-massive-data-sets-tp5966p5967.html
> > Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
>


Re: Is there any problem on the spark mailing list?

2014-05-11 Thread lukas nalezenec
There was an outage: https://blogs.apache.org/infra/entry/mail_outage



On Fri, May 9, 2014 at 1:27 PM, wxhsdp  wrote:

> i think so, fewer questions and answers these three days
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Is-there-any-problem-on-the-spark-mailing-list-tp5509p5522.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>


Cannot compile SIMR with Spark 9.1

2014-04-28 Thread lukas nalezenec
Hi,
I am trying to recompile SIMR with Spark 9.1 but it fails on incompatible
method:

[error]
/home/lukas/src/simr/src/main/scala/org/apache/spark/simr/RelayServer.scala:213:
not enough arguments for method createActorSystem: (name: String, host:
String, port: Int, indestructible: Boolean, conf:
org.apache.spark.SparkConf)(akka.actor.ActorSystem, Int).
[error] Unspecified value parameter conf.
[error] val (as, port) = AkkaUtils.createActorSystem(SIMR_SYSTEM_NAME,
hostname, 0)


Is anybody using SIMR with Spark 9.1 ?
Is this known issue ?

Thanks in advance
Lukas


Re: about rdd.filter()

2014-04-23 Thread Lukas Nalezenec

Hi,
can you please add stacktrace ?
Lukas

On 23.4.2014 11:45, randylu wrote:

   my code is like:
 rdd2 = rdd1.filter(_._2.length > 1)
 rdd2.collect()
   it works well, but if i use a variable /num/ instead of 1:
 var num = 1
 rdd2 = rdd1.filter(_._2.length > num)
 rdd2.collect()
   it fails at rdd2.collect()
   so strange?



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/about-rdd-filter-tp4657.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.




strange StreamCorruptedException

2014-04-18 Thread Lukas Nalezenec

Hi all,

I am running algorithm similar to wordcount and I am not sure why it 
fails at end, there are only 200 words so result of the computation 
should be small.


I have got SIMR command line with Spark 0.8.1 , 50 workers each with 
~512M RAM.

The dataset is 100 GB tab separated text HadoopRDD, it has ~6000 partitions.

My command line is:
dataset.map(x => x.split("\t")).map(x => (x(2), 
x(3).toInt)).reduceByKey(_ + _).collect


It throws this exception:

java.io.StreamCorruptedException (java.io.StreamCorruptedException: 
invalid type code: AC)

java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1377)java.io.ObjectInputStream.readObject(ObjectInputStream.java:370)org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:39)org.apache.spark.serializer.DeserializationStream$$anon$1.getNext(Serializer.scala:101)org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:71)scala.collection.Iterator$$anon$21.hasNext(Iterator.scala:440)org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:26)org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:27)org.apache.spark.Aggregator.combineCombinersByKey(Aggregator.scala:53)org.apache.spark.rdd.PairRDDFunctions$$anonfun$combineByKey$2.apply(PairRDDFunctions.scala:95)

What am I doing wrong ?

Thanks!
Best Regards
Lukas